doc/transformation.texi

   1 @node Data Manipulation
   2 @chapter Data transformations
   3 @cindex transformations
   4
   5 The PSPP procedures examined in this chapter manipulate data and
   6 prepare the active dataset for later analyses.  They do not produce output,
   7 as a rule.
   8
   9 @menu
  10 * AGGREGATE::                   Summarize multiple cases into a single case.
  11 * AUTORECODE::                  Automatic recoding of variables.
  12 * COMPUTE::                     Assigning a variable a calculated value.
  13 * COUNT::                       Counting variables with particular values.
  14 * FLIP::                        Exchange variables with cases.
  15 * IF::                          Conditionally assigning a calculated value.
  16 * RECODE::                      Mapping values from one set to another.
  17 * SORT CASES::                  Sort the active dataset.
  18 @end menu
  19
  20 @node AGGREGATE
  21 @section AGGREGATE
  22 @vindex AGGREGATE
  23
  24 @display
  25 AGGREGATE
  26         OUTFILE=@{*,'file-name',file_handle@} [MODE=@{REPLACE, ADDVARIABLES@}]
  27         /PRESORTED
  28         /DOCUMENT
  29         /MISSING=COLUMNWISE
  30         /BREAK=var_list
  31         /dest_var['label']@dots{}=agr_func(src_vars, args@dots{})@dots{}
  32 @end display
  33
  34 @cmd{AGGREGATE} summarizes groups of cases into single cases.
  35 Cases are divided into groups that have the same values for one or more
  36 variables called @dfn{break variables}.  Several functions are available
  37 for summarizing case contents.
  38
  39 The OUTFILE subcommand is required and must appear first.  Specify a
  40 system file or portable file by file name or file
  41 handle (@pxref{File Handles}), or a dataset by its name
  42 (@pxref{Datasets}).
  43 The aggregated cases are written to this file.  If @samp{*} is
  44 specified, then the aggregated cases replace the active dataset's data.
  45 Use of OUTFILE to write a portable file is a PSPP extension.
  46
  47 If OUTFILE=@samp{*} is given, then the subcommand MODE may also be
  48 specified.
  49 The mode subcommand has two possible values: ADDVARIABLES or REPLACE.
  50 In REPLACE mode, the entire active dataset is replaced by a new dataset
  51 which contains just the break variables and the destination varibles.
  52 In this mode, the new file will contain as many cases as there are
  53 unique combinations of the break variables.
  54 In ADDVARIABLES mode, the destination variables will be appended to
  55 the existing active dataset.
  56 Cases which have identical combinations of values in their break
  57 variables, will receive identical values for the destination variables.
  58 The number of cases in the active dataset will remain unchanged.
  59 Note that if ADDVARIABLES is specified, then the data @emph{must} be
  60 sorted on the break variables.
  61
  62 By default, the active dataset will be sorted based on the break variables
  63 before aggregation takes place.  If the active dataset is already sorted
  64 or otherwise grouped in terms of the break variables, specify
  65 PRESORTED to save time.
  66 PRESORTED is assumed if MODE=ADDVARIABLES is used.
  67
  68 Specify DOCUMENT to copy the documents from the active dataset into the
  69 aggregate file (@pxref{DOCUMENT}).  Otherwise, the aggregate file will
  70 not contain any documents, even if the aggregate file replaces the
  71 active dataset.
  72
  73 Normally, only a single case (for SD and SD., two cases) need be
  74 non-missing in each group for the aggregate variable to be
  75 non-missing.  Specifying /MISSING=COLUMNWISE inverts this behavior, so
  76 that the aggregate variable becomes missing if any aggregated value is
  77 missing.
  78
  79 If PRESORTED, DOCUMENT, or MISSING are specified, they must appear
  80 between OUTFILE and BREAK.
  81
  82 At least one break variable must be specified on BREAK, a
  83 required subcommand.  The values of these variables are used to divide
  84 the active dataset into groups to be summarized.  In addition, at least
  85 one @var{dest_var} must be specified.
  86
  87 One or more sets of aggregation variables must be specified.  Each set
  88 comprises a list of aggregation variables, an equals sign (@samp{=}),
  89 the name of an aggregation function (see the list below), and a list
  90 of source variables in parentheses.  Some aggregation functions expect
  91 additional arguments following the source variable names.
  92
  93 Aggregation variables typically are created with no variable label,
  94 value labels, or missing values.  Their default print and write
  95 formats depend on the aggregation function used, with details given in
  96 the table below.  A variable label for an aggregation variable may be
  97 specified just after the variable's name in the aggregation variable
  98 list.
  99
 100 Each set must have exactly as many source variables as aggregation
 101 variables.  Each aggregation variable receives the results of applying
 102 the specified aggregation function to the corresponding source
 103 variable.  The MEAN, MEDIAN, SD, and SUM aggregation functions may only be
 104 applied to numeric variables.  All the rest may be applied to numeric
 105 and string variables.
 106
 107 The available aggregation functions are as follows:
 108
 109 @table @asis
 110 @item FGT(var_name, value)
 111 Fraction of values greater than the specified constant.  The default
 112 format is F5.3.
 113
 114 @item FIN(var_name, low, high)
 115 Fraction of values within the specified inclusive range of constants.
 116 The default format is F5.3.
 117
 118 @item FLT(var_name, value)
 119 Fraction of values less than the specified constant.  The default
 120 format is F5.3.
 121
 122 @item FIRST(var_name)
 123 First non-missing value in break group.  The aggregation variable
 124 receives the complete dictionary information from the source variable.
 125 The sort performed by AGGREGATE (and by SORT CASES) is stable, so that
 126 the first case with particular values for the break variables before
 127 sorting will also be the first case in that break group after sorting.
 128
 129 @item FOUT(var_name, low, high)
 130 Fraction of values strictly outside the specified range of constants.
 131 The default format is F5.3.
 132
 133 @item LAST(var_name)
 134 Last non-missing value in break group.  The aggregation variable
 135 receives the complete dictionary information from the source variable.
 136 The sort performed by AGGREGATE (and by SORT CASES) is stable, so that
 137 the last case with particular values for the break variables before
 138 sorting will also be the last case in that break group after sorting.
 139
 140 @item MAX(var_name)
 141 Maximum value.  The aggregation variable receives the complete
 142 dictionary information from the source variable.
 143
 144 @item MEAN(var_name)
 145 Arithmetic mean.  Limited to numeric values.  The default format is
 146 F8.2.
 147
 148 @item MEDIAN(var_name)
 149 The median value.  Limited to numeric values.  The default format is F8.2.
 150
 151 @item MIN(var_name)
 152 Minimum value.  The aggregation variable receives the complete
 153 dictionary information from the source variable.
 154
 155 @item N(var_name)
 156 Number of non-missing values.  The default format is F7.0 if weighting
 157 is not enabled, F8.2 if it is (@pxref{WEIGHT}).
 158
 159 @item N
 160 Number of cases aggregated to form this group.  The default format is
 161 F7.0 if weighting is not enabled, F8.2 if it is (@pxref{WEIGHT}).
 162
 163 @item NMISS(var_name)
 164 Number of missing values.  The default format is F7.0 if weighting is
 165 not enabled, F8.2 if it is (@pxref{WEIGHT}).
 166
 167 @item NU(var_name)
 168 Number of non-missing values.  Each case is considered to have a weight
 169 of 1, regardless of the current weighting variable (@pxref{WEIGHT}).
 170 The default format is F7.0.
 171
 172 @item NU
 173 Number of cases aggregated to form this group.  Each case is considered
 174 to have a weight of 1, regardless of the current weighting variable.
 175 The default format is F7.0.
 176
 177 @item NUMISS(var_name)
 178 Number of missing values.  Each case is considered to have a weight of
 179 1, regardless of the current weighting variable.  The default format is F7.0.
 180
 181 @item PGT(var_name, value)
 182 Percentage between 0 and 100 of values greater than the specified
 183 constant.  The default format is F5.1.
 184
 185 @item PIN(var_name, low, high)
 186 Percentage of values within the specified inclusive range of
 187 constants.  The default format is F5.1.
 188
 189 @item PLT(var_name, value)
 190 Percentage of values less than the specified constant.  The default
 191 format is F5.1.
 192
 193 @item POUT(var_name, low, high)
 194 Percentage of values strictly outside the specified range of
 195 constants.  The default format is F5.1.
 196
 197 @item SD(var_name)
 198 Standard deviation of the mean.  Limited to numeric values.  The
 199 default format is F8.2.
 200
 201 @item SUM(var_name)
 202 Sum.  Limited to numeric values.  The default format is F8.2.
 203 @end table
 204
 205 Aggregation functions compare string values in terms of internal
 206 character codes.  On most modern computers, this is a form of ASCII.
 207
 208 The aggregation functions listed above exclude all user-missing values
 209 from calculations.  To include user-missing values, insert a period
 210 (@samp{.}) at the end of the function name.  (e.g.@: @samp{SUM.}).
 211 (Be aware that specifying such a function as the last token on a line
 212 will cause the period to be interpreted as the end of the command.)
 213
 214 @cmd{AGGREGATE} both ignores and cancels the current @cmd{SPLIT FILE}
 215 settings (@pxref{SPLIT FILE}).
 216
 217 @node AUTORECODE
 218 @section AUTORECODE
 219 @vindex AUTORECODE
 220
 221 @display
 222 AUTORECODE VARIABLES=src_vars INTO dest_vars
 223         [ /DESCENDING ]
 224         [ /PRINT ]
 225         [ /GROUP ]
 226 @end display
 227
 228 The @cmd{AUTORECODE} procedure considers the @var{n} values that a variable
 229 takes on and maps them onto values 1@dots{}@var{n} on a new numeric
 230 variable.
 231
 232 Subcommand VARIABLES is the only required subcommand and must come
 233 first.  Specify VARIABLES, an equals sign (@samp{=}), a list of source
 234 variables, INTO, and a list of target variables.  There must the same
 235 number of source and target variables.  The target variables must not
 236 already exist.
 237
 238 By default, increasing values of a source variable (for a string, this
 239 is based on character code comparisons) are recoded to increasing values
 240 of its target variable.  To cause increasing values of a source variable
 241 to be recoded to decreasing values of its target variable (@var{n} down
 242 to 1), specify DESCENDING.
 243
 244 PRINT is currently ignored.
 245
 246 The GROUP subcommand is relevant only if more than one variable is to be
 247 recoded.   It causes a single mapping between source and target values to
 248 be used, instead of one map per variable.
 249
 250 @cmd{AUTORECODE} is a procedure.  It causes the data to be read.
 251
 252 @node COMPUTE
 253 @section COMPUTE
 254 @vindex COMPUTE
 255
 256 @display
 257 COMPUTE variable = expression.
 258   or
 259 COMPUTE vector(index) = expression.
 260 @end display
 261
 262 @cmd{COMPUTE} assigns the value of an expression to a target
 263 variable.  For each case, the expression is evaluated and its value
 264 assigned to the target variable.  Numeric and string
 265 variables may be assigned.  When a string expression's width differs
 266 from the target variable's width, the string result of the expression
 267 is truncated or padded with spaces on the right as necessary.  The
 268 expression and variable types must match.
 269
 270 For numeric variables only, the target variable need not already
 271 exist.  Numeric variables created by @cmd{COMPUTE} are assigned an
 272 @code{F8.2} output format.  String variables must be declared before
 273 they can be used as targets for @cmd{COMPUTE}.
 274
 275 The target variable may be specified as an element of a vector
 276 (@pxref{VECTOR}).  In this case, a vector index expression must be
 277 specified in parentheses following the vector name.  The index
 278 expression must evaluate to a numeric value that, after rounding down
 279 to the nearest integer, is a valid index for the named vector.
 280
 281 Using @cmd{COMPUTE} to assign to a variable specified on @cmd{LEAVE}
 282 (@pxref{LEAVE}) resets the variable's left state.  Therefore,
 283 @code{LEAVE} should be specified following @cmd{COMPUTE}, not before.
 284
 285 @cmd{COMPUTE} is a transformation.  It does not cause the active dataset to be
 286 read.
 287
 288 When @cmd{COMPUTE} is specified following @cmd{TEMPORARY}
 289 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 290 (@pxref{LAG}).
 291
 292 @node COUNT
 293 @section COUNT
 294 @vindex COUNT
 295
 296 @display
 297 COUNT var_name = var@dots{} (value@dots{}).
 298
 299 Each value takes one of the following forms:
 300         number
 301         string
 302         num1 THRU num2
 303         MISSING
 304         SYSMIS
 305 In addition, num1 and num2 can be LO or LOWEST, or HI or HIGHEST,
 306 respectively.
 307 @end display
 308
 309 @cmd{COUNT} creates or replaces a numeric @dfn{target} variable that
 310 counts the occurrence of a @dfn{criterion} value or set of values over
 311 one or more @dfn{test} variables for each case.
 312
 313 The target variable values are always nonnegative integers.  They are
 314 never missing.  The target variable is assigned an F8.2 output format.
 315 @xref{Input and Output Formats}.  Any variables, including
 316 string variables, may be test variables.
 317
 318 User-missing values of test variables are treated just like any other
 319 values.  They are @strong{not} treated as system-missing values.
 320 User-missing values that are criterion values or inside ranges of
 321 criterion values are counted as any other values.  However (for numeric
 322 variables), keyword MISSING may be used to refer to all system-
 323 and user-missing values.
 324
 325 @cmd{COUNT} target variables are assigned values in the order
 326 specified.  In the command @code{COUNT A=A B(1) /B=A B(2).}, the
 327 following actions occur:
 328
 329 @itemize @minus
 330 @item
 331 The number of occurrences of 1 between @code{A} and @code{B} is counted.
 332
 333 @item
 334 @code{A} is assigned this value.
 335
 336 @item
 337 The number of occurrences of 1 between @code{B} and the @strong{new}
 338 value of @code{A} is counted.
 339
 340 @item
 341 @code{B} is assigned this value.
 342 @end itemize
 343
 344 Despite this ordering, all @cmd{COUNT} criterion variables must exist
 345 before the procedure is executed---they may not be created as target
 346 variables earlier in the command!  Break such a command into two
 347 separate commands.
 348
 349 The examples below may help to clarify.
 350
 351 @enumerate A
 352 @item
 353 Assuming @code{Q0}, @code{Q2}, @dots{}, @code{Q9} are numeric variables,
 354 the following commands:
 355
 356 @enumerate
 357 @item
 358 Count the number of times the value 1 occurs through these variables
 359 for each case and assigns the count to variable @code{QCOUNT}.
 360
 361 @item
 362 Print out the total number of times the value 1 occurs throughout
 363 @emph{all} cases using @cmd{DESCRIPTIVES}.  @xref{DESCRIPTIVES}, for
 364 details.
 365 @end enumerate
 366
 367 @example
 368 COUNT QCOUNT=Q0 TO Q9(1).
 369 DESCRIPTIVES QCOUNT /STATISTICS=SUM.
 370 @end example
 371
 372 @item
 373 Given these same variables, the following commands:
 374
 375 @enumerate
 376 @item
 377 Count the number of valid values of these variables for each case and
 378 assigns the count to variable @code{QVALID}.
 379
 380 @item
 381 Multiplies each value of @code{QVALID} by 10 to obtain a percentage of
 382 valid values, using @cmd{COMPUTE}.  @xref{COMPUTE}, for details.
 383
 384 @item
 385 Print out the percentage of valid values across all cases, using
 386 @cmd{DESCRIPTIVES}.  @xref{DESCRIPTIVES}, for details.
 387 @end enumerate
 388
 389 @example
 390 COUNT QVALID=Q0 TO Q9 (LO THRU HI).
 391 COMPUTE QVALID=QVALID*10.
 392 DESCRIPTIVES QVALID /STATISTICS=MEAN.
 393 @end example
 394 @end enumerate
 395
 396 @node FLIP
 397 @section FLIP
 398 @vindex FLIP
 399
 400 @display
 401 FLIP /VARIABLES=var_list /NEWNAMES=var_name.
 402 @end display
 403
 404 @cmd{FLIP} transposes rows and columns in the active dataset.  It
 405 causes cases to be swapped with variables, and vice versa.
 406
 407 All variables in the transposed active dataset are numeric.  String
 408 variables take on the system-missing value in the transposed file.
 409
 410 No subcommands are required.  If specified, the VARIABLES subcommand
 411 selects variables to be transformed into cases, and variables not
 412 specified are discarded.  If the VARIABLES subcommand is omitted, all
 413 variables are selected for transposition.
 414
 415 The variables specified by NEWNAMES, which must be a string variable, is
 416 used to give names to the variables created by @cmd{FLIP}.  Only the
 417 first 8 characters of the variable are used.  If
 418 NEWNAMES is not
 419 specified then the default is a variable named CASE_LBL, if it exists.
 420 If it does not then the variables created by FLIP are named VAR000
 421 through VAR999, then VAR1000, VAR1001, and so on.
 422
 423 When a NEWNAMES variable is available, the names must be canonicalized
 424 before becoming variable names.  Invalid characters are replaced by
 425 letter @samp{V} in the first position, or by @samp{_} in subsequent
 426 positions.  If the name thus generated is not unique, then numeric
 427 extensions are added, starting with 1, until a unique name is found or
 428 there are no remaining possibilities.  If the latter occurs then the
 429 FLIP operation aborts.
 430
 431 The resultant dictionary contains a CASE_LBL variable, a string
 432 variable of width 8, which stores the names of the variables in the
 433 dictionary before the transposition.  Variables names longer than 8
 434 characters are truncated.  If the active dataset is subsequently
 435 transposed using @cmd{FLIP}, this variable can be used to recreate the
 436 original variable names.
 437
 438 FLIP honors @cmd{N OF CASES} (@pxref{N OF CASES}).  It ignores
 439 @cmd{TEMPORARY} (@pxref{TEMPORARY}), so that ``temporary''
 440 transformations become permanent.
 441
 442 @node IF
 443 @section IF
 444 @vindex IF
 445
 446 @display
 447 IF condition variable=expression.
 448   or
 449 IF condition vector(index)=expression.
 450 @end display
 451
 452 The @cmd{IF} transformation conditionally assigns the value of a target
 453 expression to a target variable, based on the truth of a test
 454 expression.
 455
 456 Specify a boolean-valued expression (@pxref{Expressions}) to be tested
 457 following the IF keyword.  This expression is evaluated for each case.
 458 If the value is true, then the value of the expression is computed and
 459 assigned to the specified variable.  If the value is false or missing,
 460 nothing is done.  Numeric and string variables may be
 461 assigned.  When a string expression's width differs from the target
 462 variable's width, the string result of the expression is truncated or
 463 padded with spaces on the right as necessary.  The expression and
 464 variable types must match.
 465
 466 The target variable may be specified as an element of a vector
 467 (@pxref{VECTOR}).  In this case, a vector index expression must be
 468 specified in parentheses following the vector name.  The index
 469 expression must evaluate to a numeric value that, after rounding down
 470 to the nearest integer, is a valid index for the named vector.
 471
 472 Using @cmd{IF} to assign to a variable specified on @cmd{LEAVE}
 473 (@pxref{LEAVE}) resets the variable's left state.  Therefore,
 474 @code{LEAVE} should be specified following @cmd{IF}, not before.
 475
 476 When @cmd{IF} is specified following @cmd{TEMPORARY}
 477 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 478 (@pxref{LAG}).
 479
 480 @node RECODE
 481 @section RECODE
 482 @vindex RECODE
 483
 484 @display
 485 RECODE var_list (src_value@dots{}=dest_value)@dots{} [INTO var_list].
 486
 487 src_value may take the following forms:
 488         number
 489         string
 490         num1 THRU num2
 491         MISSING
 492         SYSMIS
 493         ELSE
 494 Open-ended ranges may be specified using LO or LOWEST for num1
 495 or HI or HIGHEST for num2.
 496
 497 dest_value may take the following forms:
 498         num
 499         string
 500         SYSMIS
 501         COPY
 502 @end display
 503
 504 @cmd{RECODE} translates data from one range of values to
 505 another, via flexible user-specified mappings.  Data may be remapped
 506 in-place or copied to new variables.  Numeric and
 507 string data can be recoded.
 508
 509 Specify the list of source variables, followed by one or more mapping
 510 specifications each enclosed in parentheses.  If the data is to be
 511 copied to new variables, specify INTO, then the list of target
 512 variables.  String target variables must already have been declared
 513 using @cmd{STRING} or another transformation, but numeric target
 514 variables can
 515 be created on the fly.  There must be exactly as many target variables
 516 as source variables.  Each source variable is remapped into its
 517 corresponding target variable.
 518
 519 When INTO is not used, the input and output variables must be of the
 520 same type.  Otherwise, string values can be recoded into numeric values,
 521 and vice versa.  When this is done and there is no mapping for a
 522 particular value, either a value consisting of all spaces or the
 523 system-missing value is assigned, depending on variable type.
 524
 525 Mappings are considered from left to right.  The first src_value that
 526 matches the value of the source variable causes the target variable to
 527 receive the value indicated by the dest_value.  Literal number, string,
 528 and range src_value's should be self-explanatory.  MISSING as a
 529 src_value matches any user- or system-missing value.  SYSMIS matches the
 530 system missing value only.  ELSE is a catch-all that matches anything.
 531 It should be the last src_value specified.
 532
 533 Numeric and string dest_value's should be self-explanatory.  COPY
 534 causes the input values to be copied to the output.  This is only valid
 535 if the source and target variables are of the same type.  SYSMIS
 536 indicates the system-missing value.
 537
 538 If the source variables are strings and the target variables are
 539 numeric, then there is one additional mapping available: (CONVERT),
 540 which must be the last specified mapping.  CONVERT causes a number
 541 specified as a string to be converted to a numeric value.  If the string
 542 cannot be parsed as a number, then the system-missing value is assigned.
 543
 544 Multiple recodings can be specified on a single @cmd{RECODE} invocation.
 545 Introduce additional recodings with a slash (@samp{/}) to
 546 separate them from the previous recodings.
 547
 548 @node SORT CASES
 549 @section SORT CASES
 550 @vindex SORT CASES
 551
 552 @display
 553 SORT CASES BY var_list[(@{D|A@}] [ var_list[(@{D|A@}] ] ...
 554 @end display
 555
 556 @cmd{SORT CASES} sorts the active dataset by the values of one or more
 557 variables.
 558
 559 Specify BY and a list of variables to sort by.  By default, variables
 560 are sorted in ascending order.  To override sort order, specify (D) or
 561 (DOWN) after a list of variables to get descending order, or (A) or (UP)
 562 for ascending order.  These apply to all the listed variables
 563 up until the preceding (A), (D), (UP) or (DOWN).
 564
 565 The sort algorithms used by @cmd{SORT CASES} are stable.  That is,
 566 records that have equal values of the sort variables will have the
 567 same relative order before and after sorting.  As a special case,
 568 re-sorting an already sorted file will not affect the ordering of
 569 cases.
 570
 571 @cmd{SORT CASES} is a procedure.  It causes the data to be read.
 572
 573 @cmd{SORT CASES} attempts to sort the entire active dataset in main memory.
 574 If workspace is exhausted, it falls back to a merge sort algorithm that
 575 involves creates numerous temporary files.
 576
 577 @cmd{SORT CASES} may not be specified following TEMPORARY.