doc/transformation.texi

   1 @node Data Manipulation, Data Selection, Variable Attributes, Top
   2 @chapter Data transformations
   3 @cindex transformations
   4
   5 The PSPP procedures examined in this chapter manipulate data and
   6 prepare the active file for later analyses.  They do not produce output,
   7 as a rule.
   8
   9 @menu
  10 * AGGREGATE::                   Summarize multiple cases into a single case.
  11 * AUTORECODE::                  Automatic recoding of variables.
  12 * COMPUTE::                     Assigning a variable a calculated value.
  13 * COUNT::                       Counting variables with particular values.
  14 * FLIP::                        Exchange variables with cases.
  15 * IF::                          Conditionally assigning a calculated value.
  16 * RECODE::                      Mapping values from one set to another.
  17 * SORT CASES::                  Sort the active file.
  18 @end menu
  19
  20 @node AGGREGATE, AUTORECODE, Data Manipulation, Data Manipulation
  21 @section AGGREGATE
  22 @vindex AGGREGATE
  23
  24 @display
  25 AGGREGATE
  26         /BREAK=var_list
  27         /PRESORTED
  28         /OUTFILE=@{*,'filename'@}
  29         /DOCUMENT
  30         /MISSING=COLUMNWISE
  31         /dest_var['label']@dots{}=agr_func(src_vars, args@dots{})@dots{}
  32 @end display
  33
  34 @cmd{AGGREGATE} summarizes groups of cases into single cases.
  35 Cases are divided into groups that have the same values for one or more
  36 variables called @dfn{break variables}.  Several functions are available
  37 for summarizing case contents.
  38
  39 At least one break variable must be specified on BREAK, the only
  40 required subcommand.  The values of these variables are used to divide
  41 the active file into groups to be summarized.  In addition, at least
  42 one @var{dest_var} must be specified.
  43
  44 By default, the active file is sorted based on the break variables
  45 before aggregation takes place.  If the active file is already sorted
  46 or otherwise grouped in terms of the break variables, specify
  47 PRESORTED to save time.
  48
  49 The OUTFILE subcommand specifies a system file by file name string or
  50 file handle (@pxref{FILE HANDLE}).  The aggregated cases are written to
  51 this file.  If OUTFILE is not specified, or if @samp{*} is specified,
  52 then the aggregated cases replace the active file.
  53
  54 Specify DOCUMENT to copy the documents from the active file into the
  55 aggregate file (@pxref{DOCUMENT}).  Otherwise, the aggregate file will
  56 not contain any documents, even if the aggregate file replaces the
  57 active file.
  58
  59 One or more sets of aggregation variables must be specified.  Each set
  60 comprises a list of aggregation variables, an equals sign (@samp{=}),
  61 the name of an aggregation function (see the list below), and a list
  62 of source variables in parentheses.  Some aggregation functions expect
  63 additional arguments following the source variable names.
  64
  65 Aggregation variables typically are created with no variable label,
  66 value labels, or missing values.  Their default print and write
  67 formats depend on the aggregation function used, with details given in
  68 the table below.  A variable label for an aggregation variable may be
  69 specified just after the variable's name in the aggregation variable
  70 list.
  71
  72 Each set must have exactly as many source variables as aggregation
  73 variables.  Each aggregation variable receives the results of applying
  74 the specified aggregation function to the corresponding source
  75 variable.  Most aggregation functions may be applied to numeric and
  76 short and long string variables.  Others, marked below, are restricted
  77 to numeric values.
  78
  79 The available aggregation functions are as follows:
  80
  81 @table @asis
  82 @item FGT(var_name, value)
  83 Fraction of values greater than the specified constant.  The default
  84 format is F5.3.
  85
  86 @item FIN(var_name, low, high)
  87 Fraction of values within the specified inclusive range of constants.
  88 The default format is F5.3.
  89
  90 @item FLT(var_name, value)
  91 Fraction of values less than the specified constant.  The default
  92 format is F5.3.
  93
  94 @item FIRST(var_name)
  95 First non-missing value in break group.  The aggregation variable
  96 receives the complete dictionary information from the source variable.
  97
  98 @item FOUT(var_name, low, high)
  99 Fraction of values strictly outside the specified range of constants.
 100 The default format is F5.3.
 101
 102 @item LAST(var_name)
 103 Last non-missing value in break group.  The aggregation variable
 104 receives the complete dictionary information from the source variable.
 105
 106 @item MAX(var_name)
 107 Maximum value.  The aggregation variable receives the complete
 108 dictionary information from the source variable.
 109
 110 @item MEAN(var_name)
 111 Arithmetic mean.  Limited to numeric values.  The default format is
 112 F8.2.
 113
 114 @item MIN(var_name)
 115 Minimum value.  The aggregation variable receives the complete
 116 dictionary information from the source variable.
 117
 118 @item N(var_name)
 119 Number of non-missing values.  The default format is F7.0 if weighting
 120 is not enabled, F8.2 if it is (@pxref{WEIGHT}).
 121
 122 @item N
 123 Number of cases aggregated to form this group.  The default format is
 124 F7.0 if weighting is not enabled, F8.2 if it is (@pxref{WEIGHT}).
 125
 126 @item NMISS(var_name)
 127 Number of missing values.  The default format is F7.0 if weighting is
 128 not enabled, F8.2 if it is (@pxref{WEIGHT}).
 129
 130 @item NU(var_name)
 131 Number of non-missing values.  Each case is considered to have a weight
 132 of 1, regardless of the current weighting variable (@pxref{WEIGHT}).
 133 The default format is F7.0.
 134
 135 @item NU
 136 Number of cases aggregated to form this group.  Each case is considered
 137 to have a weight of 1, regardless of the current weighting variable.
 138 The default format is F7.0.
 139
 140 @item NUMISS(var_name)
 141 Number of missing values.  Each case is considered to have a weight of
 142 1, regardless of the current weighting variable.  The default format is F7.0.
 143
 144 @item PGT(var_name, value)
 145 Percentage between 0 and 100 of values greater than the specified
 146 constant.  The default format is F5.1.
 147
 148 @item PIN(var_name, low, high)
 149 Percentage of values within the specified inclusive range of
 150 constants.  The default format is F5.1.
 151
 152 @item PLT(var_name, value)
 153 Percentage of values less than the specified constant.  The default
 154 format is F5.1.
 155
 156 @item POUT(var_name, low, high)
 157 Percentage of values strictly outside the specified range of
 158 constants.  The default format is F5.1.
 159
 160 @item SD(var_name)
 161 Standard deviation of the mean.  Limited to numeric values.  The
 162 default format is F8.2.
 163
 164 @item SUM(var_name)
 165 Sum.  Limited to numeric values.  The default format is F8.2.
 166 @end table
 167
 168 Aggregation functions compare string values in terms of internal
 169 character codes.  On most modern computers, this is a form of ASCII.
 170
 171 The aggregation functions listed above exclude all user-missing values
 172 from calculations.  To include user-missing values, insert a period
 173 (@samp{.}) between the function name and left parenthesis
 174 (e.g.@: @samp{SUM.}).
 175
 176 Normally, only a single case (for SD and SD., two cases) need be
 177 non-missing in each group for the aggregate variable to be
 178 non-missing.  Specifying /MISSING=COLUMNWISE inverts this behavior, so
 179 that the aggregate variable becomes missing if any aggregated value is
 180 missing.
 181
 182 @cmd{AGGREGATE} both ignores and cancels the current @cmd{SPLIT FILE}
 183 settings (@pxref{SPLIT FILE}).
 184
 185 @node AUTORECODE, COMPUTE, AGGREGATE, Data Manipulation
 186 @section AUTORECODE
 187 @vindex AUTORECODE
 188
 189 @display
 190 AUTORECODE VARIABLES=src_vars INTO dest_vars
 191         /DESCENDING
 192         /PRINT
 193 @end display
 194
 195 The @cmd{AUTORECODE} procedure considers the @var{n} values that a variable
 196 takes on and maps them onto values 1@dots{}@var{n} on a new numeric
 197 variable.
 198
 199 Subcommand VARIABLES is the only required subcommand and must come
 200 first.  Specify VARIABLES, an equals sign (@samp{=}), a list of source
 201 variables, INTO, and a list of target variables.  There must the same
 202 number of source and target variables.  The target variables must not
 203 already exist.
 204
 205 By default, increasing values of a source variable (for a string, this
 206 is based on character code comparisons) are recoded to increasing values
 207 of its target variable.  To cause increasing values of a source variable
 208 to be recoded to decreasing values of its target variable (@var{n} down
 209 to 1), specify DESCENDING.
 210
 211 PRINT is currently ignored.
 212
 213 @cmd{AUTORECODE} is a procedure.  It causes the data to be read.
 214
 215 @node COMPUTE, COUNT, AUTORECODE, Data Manipulation
 216 @section COMPUTE
 217 @vindex COMPUTE
 218
 219 @display
 220 COMPUTE variable = expression.
 221   or
 222 COMPUTE vector(index) = expression.
 223 @end display
 224
 225 @cmd{COMPUTE} assigns the value of an expression to a target
 226 variable.  For each case, the expression is evaluated and its value
 227 assigned to the target variable.  Numeric and short and long string
 228 variables may be assigned.  When a string expression's width differs
 229 from the target variable's width, the string result of the expression
 230 is truncated or padded with spaces on the right as necessary.  The
 231 expression and variable types must match.
 232
 233 For numeric variables only, the target variable need not already
 234 exist.  Numeric variables created by @cmd{COMPUTE} are assigned an
 235 @code{F8.2} output format.  String variables must be declared before
 236 they can be used as targets for @cmd{COMPUTE}.
 237
 238 The target variable may be specified as an element of a vector
 239 (@pxref{VECTOR}).  In this case, a vector index expression must be
 240 specified in parentheses following the vector name.  The index
 241 expression must evaluate to a numeric value that, after rounding down
 242 to the nearest integer, is a valid index for the named vector.
 243
 244 Using @cmd{COMPUTE} to assign to a variable specified on @cmd{LEAVE}
 245 (@pxref{LEAVE}) resets the variable's left state.  Therefore,
 246 @code{LEAVE} should be specified following @cmd{COMPUTE}, not before.
 247
 248 @cmd{COMPUTE} is a transformation.  It does not cause the active file to be
 249 read.
 250
 251 When @cmd{COMPUTE} is specified following @cmd{TEMPORARY}
 252 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 253 (@pxref{LAG}).
 254
 255 @node COUNT, FLIP, COMPUTE, Data Manipulation
 256 @section COUNT
 257 @vindex COUNT
 258
 259 @display
 260 COUNT var_name = var@dots{} (value@dots{}).
 261
 262 Each value takes one of the following forms:
 263         number
 264         string
 265         num1 THRU num2
 266         MISSING
 267         SYSMIS
 268 In addition, num1 and num2 can be LO or LOWEST, or HI or HIGHEST,
 269 respectively.
 270 @end display
 271
 272 @cmd{COUNT} creates or replaces a numeric @dfn{target} variable that
 273 counts the occurrence of a @dfn{criterion} value or set of values over
 274 one or more @dfn{test} variables for each case.
 275
 276 The target variable values are always nonnegative integers.  They are
 277 never missing.  The target variable is assigned an F8.2 output format.
 278 @xref{Input/Output Formats}.  Any variables, including long and short
 279 string variables, may be test variables.
 280
 281 User-missing values of test variables are treated just like any other
 282 values.  They are @strong{not} treated as system-missing values.
 283 User-missing values that are criterion values or inside ranges of
 284 criterion values are counted as any other values.  However (for numeric
 285 variables), keyword MISSING may be used to refer to all system-
 286 and user-missing values.
 287
 288 @cmd{COUNT} target variables are assigned values in the order
 289 specified.  In the command @code{COUNT A=A B(1) /B=A B(2).}, the
 290 following actions occur:
 291
 292 @itemize @minus
 293 @item
 294 The number of occurrences of 1 between @code{A} and @code{B} is counted.
 295
 296 @item
 297 @code{A} is assigned this value.
 298
 299 @item
 300 The number of occurrences of 1 between @code{B} and the @strong{new}
 301 value of @code{A} is counted.
 302
 303 @item
 304 @code{B} is assigned this value.
 305 @end itemize
 306
 307 Despite this ordering, all @cmd{COUNT} criterion variables must exist
 308 before the procedure is executed---they may not be created as target
 309 variables earlier in the command!  Break such a command into two
 310 separate commands.
 311
 312 The examples below may help to clarify.
 313
 314 @enumerate A
 315 @item
 316 Assuming @code{Q0}, @code{Q2}, @dots{}, @code{Q9} are numeric variables,
 317 the following commands:
 318
 319 @enumerate
 320 @item
 321 Count the number of times the value 1 occurs through these variables
 322 for each case and assigns the count to variable @code{QCOUNT}.
 323
 324 @item
 325 Print out the total number of times the value 1 occurs throughout
 326 @emph{all} cases using @cmd{DESCRIPTIVES}.  @xref{DESCRIPTIVES}, for
 327 details.
 328 @end enumerate
 329
 330 @example
 331 COUNT QCOUNT=Q0 TO Q9(1).
 332 DESCRIPTIVES QCOUNT /STATISTICS=SUM.
 333 @end example
 334
 335 @item
 336 Given these same variables, the following commands:
 337
 338 @enumerate
 339 @item
 340 Count the number of valid values of these variables for each case and
 341 assigns the count to variable @code{QVALID}.
 342
 343 @item
 344 Multiplies each value of @code{QVALID} by 10 to obtain a percentage of
 345 valid values, using @cmd{COMPUTE}.  @xref{COMPUTE}, for details.
 346
 347 @item
 348 Print out the percentage of valid values across all cases, using
 349 @cmd{DESCRIPTIVES}.  @xref{DESCRIPTIVES}, for details.
 350 @end enumerate
 351
 352 @example
 353 COUNT QVALID=Q0 TO Q9 (LO THRU HI).
 354 COMPUTE QVALID=QVALID*10.
 355 DESCRIPTIVES QVALID /STATISTICS=MEAN.
 356 @end example
 357 @end enumerate
 358
 359 @node FLIP, IF, COUNT, Data Manipulation
 360 @section FLIP
 361 @vindex FLIP
 362
 363 @display
 364 FLIP /VARIABLES=var_list /NEWNAMES=var_name.
 365 @end display
 366
 367 @cmd{FLIP} transposes rows and columns in the active file.  It
 368 causes cases to be swapped with variables, and vice versa.
 369
 370 All variables in the transposed active file are numeric.  String
 371 variables take on the system-missing value in the transposed file.
 372
 373 No subcommands are required.  The VARIABLES subcommand specifies
 374 variables that will be transformed into cases.  Variables not specified
 375 are discarded.  By default, all variables are selected for
 376 transposition.
 377
 378 The variables specified by NEWNAMES, which must be a string variable, is
 379 used to give names to the variables created by @cmd{FLIP}.  If
 380 NEWNAMES is not
 381 specified then the default is a variable named CASE_LBL, if it exists.
 382 If it does not then the variables created by FLIP are named VAR000
 383 through VAR999, then VAR1000, VAR1001, and so on.
 384
 385 When a NEWNAMES variable is available, the names must be canonicalized
 386 before becoming variable names.  Invalid characters are replaced by
 387 letter @samp{V} in the first position, or by @samp{_} in subsequent
 388 positions.  If the name thus generated is not unique, then numeric
 389 extensions are added, starting with 1, until a unique name is found or
 390 there are no remaining possibilities.  If the latter occurs then the
 391 FLIP operation aborts.
 392
 393 The resultant dictionary contains a CASE_LBL variable, which stores the
 394 names of the variables in the dictionary before the transposition.  If
 395 the active file is subsequently transposed using @cmd{FLIP}, this
 396 variable can
 397 be used to recreate the original variable names.
 398
 399 FLIP honors N OF CASES.  It ignores TEMPORARY, so that ``temporary''
 400 transformations become permanent.
 401
 402 @node IF, RECODE, FLIP, Data Manipulation
 403 @section IF
 404 @vindex IF
 405
 406 @display
 407 IF condition variable=expression.
 408   or
 409 IF condition vector(index)=expression.
 410 @end display
 411
 412 The @cmd{IF} transformation conditionally assigns the value of a target
 413 expression to a target variable, based on the truth of a test
 414 expression.
 415
 416 Specify a boolean-valued expression (@pxref{Expressions}) to be tested
 417 following the IF keyword.  This expression is evaluated for each case.
 418 If the value is true, then the value of the expression is computed and
 419 assigned to the specified variable.  If the value is false or missing,
 420 nothing is done.  Numeric and short and long string variables may be
 421 assigned.  When a string expression's width differs from the target
 422 variable's width, the string result of the expression is truncated or
 423 padded with spaces on the right as necessary.  The expression and
 424 variable types must match.
 425
 426 The target variable may be specified as an element of a vector
 427 (@pxref{VECTOR}).  In this case, a vector index expression must be
 428 specified in parentheses following the vector name.  The index
 429 expression must evaluate to a numeric value that, after rounding down
 430 to the nearest integer, is a valid index for the named vector.
 431
 432 Using @cmd{IF} to assign to a variable specified on @cmd{LEAVE}
 433 (@pxref{LEAVE}) resets the variable's left state.  Therefore,
 434 @code{LEAVE} should be specified following @cmd{IF}, not before.
 435
 436 When @cmd{IF} is specified following @cmd{TEMPORARY}
 437 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 438 (@pxref{LAG}).
 439
 440 @node RECODE, SORT CASES, IF, Data Manipulation
 441 @section RECODE
 442 @vindex RECODE
 443
 444 @display
 445 RECODE var_list (src_value@dots{}=dest_value)@dots{} [INTO var_list].
 446
 447 src_value may take the following forms:
 448         number
 449         string
 450         num1 THRU num2
 451         MISSING
 452         SYSMIS
 453         ELSE
 454 Open-ended ranges may be specified using LO or LOWEST for num1
 455 or HI or HIGHEST for num2.
 456
 457 dest_value may take the following forms:
 458         num
 459         string
 460         SYSMIS
 461         COPY
 462 @end display
 463
 464 @cmd{RECODE} translates data from one range of values to
 465 another, via flexible user-specified mappings.  Data may be remapped
 466 in-place or copied to new variables.  Numeric, short string, and long
 467 string data can be recoded.
 468
 469 Specify the list of source variables, followed by one or more mapping
 470 specifications each enclosed in parentheses.  If the data is to be
 471 copied to new variables, specify INTO, then the list of target
 472 variables.  String target variables must already have been declared
 473 using @cmd{STRING} or another transformation, but numeric target
 474 variables can
 475 be created on the fly.  There must be exactly as many target variables
 476 as source variables.  Each source variable is remapped into its
 477 corresponding target variable.
 478
 479 When INTO is not used, the input and output variables must be of the
 480 same type.  Otherwise, string values can be recoded into numeric values,
 481 and vice versa.  When this is done and there is no mapping for a
 482 particular value, either a value consisting of all spaces or the
 483 system-missing value is assigned, depending on variable type.
 484
 485 Mappings are considered from left to right.  The first src_value that
 486 matches the value of the source variable causes the target variable to
 487 receive the value indicated by the dest_value.  Literal number, string,
 488 and range src_value's should be self-explanatory.  MISSING as a
 489 src_value matches any user- or system-missing value.  SYSMIS matches the
 490 system missing value only.  ELSE is a catch-all that matches anything.
 491 It should be the last src_value specified.
 492
 493 Numeric and string dest_value's should also be self-explanatory.  COPY
 494 causes the input values to be copied to the output.  This is only value
 495 if the source and target variables are of the same type.  SYSMIS
 496 indicates the system-missing value.
 497
 498 If the source variables are strings and the target variables are
 499 numeric, then there is one additional mapping available: (CONVERT),
 500 which must be the last specified mapping.  CONVERT causes a number
 501 specified as a string to be converted to a numeric value.  If the string
 502 cannot be parsed as a number, then the system-missing value is assigned.
 503
 504 Multiple recodings can be specified on a single @cmd{RECODE} invocation.
 505 Introduce additional recodings with a slash (@samp{/}) to
 506 separate them from the previous recodings.
 507
 508 @node SORT CASES,  , RECODE, Data Manipulation
 509 @section SORT CASES
 510 @vindex SORT CASES
 511
 512 @display
 513 SORT CASES BY var_list.
 514 @end display
 515
 516 @cmd{SORT CASES} sorts the active file by the values of one or more
 517 variables.
 518
 519 Specify BY and a list of variables to sort by.  By default, variables
 520 are sorted in ascending order.  To override sort order, specify (D) or
 521 (DOWN) after a list of variables to get descending order, or (A) or (UP)
 522 for ascending order.  These apply to the entire list of variables
 523 preceding them.
 524
 525 @cmd{SORT CASES} is a procedure.  It causes the data to be read.
 526
 527 @cmd{SORT CASES} attempts to sort the entire active file in main memory.
 528 If main memory is exhausted, it falls back to a merge sort algorithm that
 529 involves writing and reading numerous temporary files.
 530
 531 @cmd{SORT CASES} may not be specified following TEMPORARY.
 532 @setfilename ignored