1 @node Data Manipulation
2 @chapter Data transformations
3 @cindex transformations
5 The PSPP procedures examined in this chapter manipulate data and
6 prepare the active dataset for later analyses. They do not produce output,
10 * AGGREGATE:: Summarize multiple cases into a single case.
11 * AUTORECODE:: Automatic recoding of variables.
12 * COMPUTE:: Assigning a variable a calculated value.
13 * COUNT:: Counting variables with particular values.
14 * FLIP:: Exchange variables with cases.
15 * IF:: Conditionally assigning a calculated value.
16 * RECODE:: Mapping values from one set to another.
17 * SORT CASES:: Sort the active dataset.
26 OUTFILE=@{*,'file-name',file_handle@} [MODE=@{REPLACE, ADDVARIABLES@}]
31 /dest_var['label']@dots{}=agr_func(src_vars, args@dots{})@dots{}
34 @cmd{AGGREGATE} summarizes groups of cases into single cases.
35 Cases are divided into groups that have the same values for one or more
36 variables called @dfn{break variables}. Several functions are available
37 for summarizing case contents.
39 The OUTFILE subcommand is required and must appear first. Specify a
40 system file or portable file by file name or file
41 handle (@pxref{File Handles}), or a dataset by its name
43 The aggregated cases are written to this file. If @samp{*} is
44 specified, then the aggregated cases replace the active dataset's data.
45 Use of OUTFILE to write a portable file is a PSPP extension.
47 If OUTFILE=@samp{*} is given, then the subcommand MODE may also be
49 The mode subcommand has two possible values: ADDVARIABLES or REPLACE.
50 In REPLACE mode, the entire active dataset is replaced by a new dataset
51 which contains just the break variables and the destination varibles.
52 In this mode, the new file will contain as many cases as there are
53 unique combinations of the break variables.
54 In ADDVARIABLES mode, the destination variables will be appended to
55 the existing active dataset.
56 Cases which have identical combinations of values in their break
57 variables, will receive identical values for the destination variables.
58 The number of cases in the active dataset will remain unchanged.
59 Note that if ADDVARIABLES is specified, then the data @emph{must} be
60 sorted on the break variables.
62 By default, the active dataset will be sorted based on the break variables
63 before aggregation takes place. If the active dataset is already sorted
64 or otherwise grouped in terms of the break variables, specify
65 PRESORTED to save time.
66 PRESORTED is assumed if MODE=ADDVARIABLES is used.
68 Specify DOCUMENT to copy the documents from the active dataset into the
69 aggregate file (@pxref{DOCUMENT}). Otherwise, the aggregate file will
70 not contain any documents, even if the aggregate file replaces the
73 Normally, only a single case (for SD and SD., two cases) need be
74 non-missing in each group for the aggregate variable to be
75 non-missing. Specifying /MISSING=COLUMNWISE inverts this behavior, so
76 that the aggregate variable becomes missing if any aggregated value is
79 If PRESORTED, DOCUMENT, or MISSING are specified, they must appear
80 between OUTFILE and BREAK.
82 At least one break variable must be specified on BREAK, a
83 required subcommand. The values of these variables are used to divide
84 the active dataset into groups to be summarized. In addition, at least
85 one @var{dest_var} must be specified.
87 One or more sets of aggregation variables must be specified. Each set
88 comprises a list of aggregation variables, an equals sign (@samp{=}),
89 the name of an aggregation function (see the list below), and a list
90 of source variables in parentheses. Some aggregation functions expect
91 additional arguments following the source variable names.
93 Aggregation variables typically are created with no variable label,
94 value labels, or missing values. Their default print and write
95 formats depend on the aggregation function used, with details given in
96 the table below. A variable label for an aggregation variable may be
97 specified just after the variable's name in the aggregation variable
100 Each set must have exactly as many source variables as aggregation
101 variables. Each aggregation variable receives the results of applying
102 the specified aggregation function to the corresponding source
103 variable. The MEAN, MEDIAN, SD, and SUM aggregation functions may only be
104 applied to numeric variables. All the rest may be applied to numeric
105 and string variables.
107 The available aggregation functions are as follows:
110 @item FGT(var_name, value)
111 Fraction of values greater than the specified constant. The default
114 @item FIN(var_name, low, high)
115 Fraction of values within the specified inclusive range of constants.
116 The default format is F5.3.
118 @item FLT(var_name, value)
119 Fraction of values less than the specified constant. The default
122 @item FIRST(var_name)
123 First non-missing value in break group. The aggregation variable
124 receives the complete dictionary information from the source variable.
125 The sort performed by AGGREGATE (and by SORT CASES) is stable, so that
126 the first case with particular values for the break variables before
127 sorting will also be the first case in that break group after sorting.
129 @item FOUT(var_name, low, high)
130 Fraction of values strictly outside the specified range of constants.
131 The default format is F5.3.
134 Last non-missing value in break group. The aggregation variable
135 receives the complete dictionary information from the source variable.
136 The sort performed by AGGREGATE (and by SORT CASES) is stable, so that
137 the last case with particular values for the break variables before
138 sorting will also be the last case in that break group after sorting.
141 Maximum value. The aggregation variable receives the complete
142 dictionary information from the source variable.
145 Arithmetic mean. Limited to numeric values. The default format is
148 @item MEDIAN(var_name)
149 The median value. Limited to numeric values. The default format is F8.2.
152 Minimum value. The aggregation variable receives the complete
153 dictionary information from the source variable.
156 Number of non-missing values. The default format is F7.0 if weighting
157 is not enabled, F8.2 if it is (@pxref{WEIGHT}).
160 Number of cases aggregated to form this group. The default format is
161 F7.0 if weighting is not enabled, F8.2 if it is (@pxref{WEIGHT}).
163 @item NMISS(var_name)
164 Number of missing values. The default format is F7.0 if weighting is
165 not enabled, F8.2 if it is (@pxref{WEIGHT}).
168 Number of non-missing values. Each case is considered to have a weight
169 of 1, regardless of the current weighting variable (@pxref{WEIGHT}).
170 The default format is F7.0.
173 Number of cases aggregated to form this group. Each case is considered
174 to have a weight of 1, regardless of the current weighting variable.
175 The default format is F7.0.
177 @item NUMISS(var_name)
178 Number of missing values. Each case is considered to have a weight of
179 1, regardless of the current weighting variable. The default format is F7.0.
181 @item PGT(var_name, value)
182 Percentage between 0 and 100 of values greater than the specified
183 constant. The default format is F5.1.
185 @item PIN(var_name, low, high)
186 Percentage of values within the specified inclusive range of
187 constants. The default format is F5.1.
189 @item PLT(var_name, value)
190 Percentage of values less than the specified constant. The default
193 @item POUT(var_name, low, high)
194 Percentage of values strictly outside the specified range of
195 constants. The default format is F5.1.
198 Standard deviation of the mean. Limited to numeric values. The
199 default format is F8.2.
202 Sum. Limited to numeric values. The default format is F8.2.
205 Aggregation functions compare string values in terms of internal
206 character codes. On most modern computers, this is a form of ASCII.
208 The aggregation functions listed above exclude all user-missing values
209 from calculations. To include user-missing values, insert a period
210 (@samp{.}) at the end of the function name. (e.g.@: @samp{SUM.}).
211 (Be aware that specifying such a function as the last token on a line
212 will cause the period to be interpreted as the end of the command.)
214 @cmd{AGGREGATE} both ignores and cancels the current @cmd{SPLIT FILE}
215 settings (@pxref{SPLIT FILE}).
222 AUTORECODE VARIABLES=src_vars INTO dest_vars
228 The @cmd{AUTORECODE} procedure considers the @var{n} values that a variable
229 takes on and maps them onto values 1@dots{}@var{n} on a new numeric
232 Subcommand VARIABLES is the only required subcommand and must come
233 first. Specify VARIABLES, an equals sign (@samp{=}), a list of source
234 variables, INTO, and a list of target variables. There must the same
235 number of source and target variables. The target variables must not
238 By default, increasing values of a source variable (for a string, this
239 is based on character code comparisons) are recoded to increasing values
240 of its target variable. To cause increasing values of a source variable
241 to be recoded to decreasing values of its target variable (@var{n} down
242 to 1), specify DESCENDING.
244 PRINT is currently ignored.
246 The GROUP subcommand is relevant only if more than one variable is to be
247 recoded. It causes a single mapping between source and target values to
248 be used, instead of one map per variable.
250 @cmd{AUTORECODE} is a procedure. It causes the data to be read.
257 COMPUTE variable = expression.
259 COMPUTE vector(index) = expression.
262 @cmd{COMPUTE} assigns the value of an expression to a target
263 variable. For each case, the expression is evaluated and its value
264 assigned to the target variable. Numeric and string
265 variables may be assigned. When a string expression's width differs
266 from the target variable's width, the string result of the expression
267 is truncated or padded with spaces on the right as necessary. The
268 expression and variable types must match.
270 For numeric variables only, the target variable need not already
271 exist. Numeric variables created by @cmd{COMPUTE} are assigned an
272 @code{F8.2} output format. String variables must be declared before
273 they can be used as targets for @cmd{COMPUTE}.
275 The target variable may be specified as an element of a vector
276 (@pxref{VECTOR}). In this case, a vector index expression must be
277 specified in parentheses following the vector name. The index
278 expression must evaluate to a numeric value that, after rounding down
279 to the nearest integer, is a valid index for the named vector.
281 Using @cmd{COMPUTE} to assign to a variable specified on @cmd{LEAVE}
282 (@pxref{LEAVE}) resets the variable's left state. Therefore,
283 @code{LEAVE} should be specified following @cmd{COMPUTE}, not before.
285 @cmd{COMPUTE} is a transformation. It does not cause the active dataset to be
288 When @cmd{COMPUTE} is specified following @cmd{TEMPORARY}
289 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
297 COUNT var_name = var@dots{} (value@dots{}).
299 Each value takes one of the following forms:
305 In addition, num1 and num2 can be LO or LOWEST, or HI or HIGHEST,
309 @cmd{COUNT} creates or replaces a numeric @dfn{target} variable that
310 counts the occurrence of a @dfn{criterion} value or set of values over
311 one or more @dfn{test} variables for each case.
313 The target variable values are always nonnegative integers. They are
314 never missing. The target variable is assigned an F8.2 output format.
315 @xref{Input and Output Formats}. Any variables, including
316 string variables, may be test variables.
318 User-missing values of test variables are treated just like any other
319 values. They are @strong{not} treated as system-missing values.
320 User-missing values that are criterion values or inside ranges of
321 criterion values are counted as any other values. However (for numeric
322 variables), keyword MISSING may be used to refer to all system-
323 and user-missing values.
325 @cmd{COUNT} target variables are assigned values in the order
326 specified. In the command @code{COUNT A=A B(1) /B=A B(2).}, the
327 following actions occur:
331 The number of occurrences of 1 between @code{A} and @code{B} is counted.
334 @code{A} is assigned this value.
337 The number of occurrences of 1 between @code{B} and the @strong{new}
338 value of @code{A} is counted.
341 @code{B} is assigned this value.
344 Despite this ordering, all @cmd{COUNT} criterion variables must exist
345 before the procedure is executed---they may not be created as target
346 variables earlier in the command! Break such a command into two
349 The examples below may help to clarify.
353 Assuming @code{Q0}, @code{Q2}, @dots{}, @code{Q9} are numeric variables,
354 the following commands:
358 Count the number of times the value 1 occurs through these variables
359 for each case and assigns the count to variable @code{QCOUNT}.
362 Print out the total number of times the value 1 occurs throughout
363 @emph{all} cases using @cmd{DESCRIPTIVES}. @xref{DESCRIPTIVES}, for
368 COUNT QCOUNT=Q0 TO Q9(1).
369 DESCRIPTIVES QCOUNT /STATISTICS=SUM.
373 Given these same variables, the following commands:
377 Count the number of valid values of these variables for each case and
378 assigns the count to variable @code{QVALID}.
381 Multiplies each value of @code{QVALID} by 10 to obtain a percentage of
382 valid values, using @cmd{COMPUTE}. @xref{COMPUTE}, for details.
385 Print out the percentage of valid values across all cases, using
386 @cmd{DESCRIPTIVES}. @xref{DESCRIPTIVES}, for details.
390 COUNT QVALID=Q0 TO Q9 (LO THRU HI).
391 COMPUTE QVALID=QVALID*10.
392 DESCRIPTIVES QVALID /STATISTICS=MEAN.
401 FLIP /VARIABLES=var_list /NEWNAMES=var_name.
404 @cmd{FLIP} transposes rows and columns in the active dataset. It
405 causes cases to be swapped with variables, and vice versa.
407 All variables in the transposed active dataset are numeric. String
408 variables take on the system-missing value in the transposed file.
410 No subcommands are required. If specified, the VARIABLES subcommand
411 selects variables to be transformed into cases, and variables not
412 specified are discarded. If the VARIABLES subcommand is omitted, all
413 variables are selected for transposition.
415 The variables specified by NEWNAMES, which must be a string variable, is
416 used to give names to the variables created by @cmd{FLIP}. Only the
417 first 8 characters of the variable are used. If
419 specified then the default is a variable named CASE_LBL, if it exists.
420 If it does not then the variables created by FLIP are named VAR000
421 through VAR999, then VAR1000, VAR1001, and so on.
423 When a NEWNAMES variable is available, the names must be canonicalized
424 before becoming variable names. Invalid characters are replaced by
425 letter @samp{V} in the first position, or by @samp{_} in subsequent
426 positions. If the name thus generated is not unique, then numeric
427 extensions are added, starting with 1, until a unique name is found or
428 there are no remaining possibilities. If the latter occurs then the
429 FLIP operation aborts.
431 The resultant dictionary contains a CASE_LBL variable, a string
432 variable of width 8, which stores the names of the variables in the
433 dictionary before the transposition. Variables names longer than 8
434 characters are truncated. If the active dataset is subsequently
435 transposed using @cmd{FLIP}, this variable can be used to recreate the
436 original variable names.
438 FLIP honors @cmd{N OF CASES} (@pxref{N OF CASES}). It ignores
439 @cmd{TEMPORARY} (@pxref{TEMPORARY}), so that ``temporary''
440 transformations become permanent.
447 IF condition variable=expression.
449 IF condition vector(index)=expression.
452 The @cmd{IF} transformation conditionally assigns the value of a target
453 expression to a target variable, based on the truth of a test
456 Specify a boolean-valued expression (@pxref{Expressions}) to be tested
457 following the IF keyword. This expression is evaluated for each case.
458 If the value is true, then the value of the expression is computed and
459 assigned to the specified variable. If the value is false or missing,
460 nothing is done. Numeric and string variables may be
461 assigned. When a string expression's width differs from the target
462 variable's width, the string result of the expression is truncated or
463 padded with spaces on the right as necessary. The expression and
464 variable types must match.
466 The target variable may be specified as an element of a vector
467 (@pxref{VECTOR}). In this case, a vector index expression must be
468 specified in parentheses following the vector name. The index
469 expression must evaluate to a numeric value that, after rounding down
470 to the nearest integer, is a valid index for the named vector.
472 Using @cmd{IF} to assign to a variable specified on @cmd{LEAVE}
473 (@pxref{LEAVE}) resets the variable's left state. Therefore,
474 @code{LEAVE} should be specified following @cmd{IF}, not before.
476 When @cmd{IF} is specified following @cmd{TEMPORARY}
477 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
485 RECODE var_list (src_value@dots{}=dest_value)@dots{} [INTO var_list].
487 src_value may take the following forms:
494 Open-ended ranges may be specified using LO or LOWEST for num1
495 or HI or HIGHEST for num2.
497 dest_value may take the following forms:
504 @cmd{RECODE} translates data from one range of values to
505 another, via flexible user-specified mappings. Data may be remapped
506 in-place or copied to new variables. Numeric and
507 string data can be recoded.
509 Specify the list of source variables, followed by one or more mapping
510 specifications each enclosed in parentheses. If the data is to be
511 copied to new variables, specify INTO, then the list of target
512 variables. String target variables must already have been declared
513 using @cmd{STRING} or another transformation, but numeric target
515 be created on the fly. There must be exactly as many target variables
516 as source variables. Each source variable is remapped into its
517 corresponding target variable.
519 When INTO is not used, the input and output variables must be of the
520 same type. Otherwise, string values can be recoded into numeric values,
521 and vice versa. When this is done and there is no mapping for a
522 particular value, either a value consisting of all spaces or the
523 system-missing value is assigned, depending on variable type.
525 Mappings are considered from left to right. The first src_value that
526 matches the value of the source variable causes the target variable to
527 receive the value indicated by the dest_value. Literal number, string,
528 and range src_value's should be self-explanatory. MISSING as a
529 src_value matches any user- or system-missing value. SYSMIS matches the
530 system missing value only. ELSE is a catch-all that matches anything.
531 It should be the last src_value specified.
533 Numeric and string dest_value's should be self-explanatory. COPY
534 causes the input values to be copied to the output. This is only valid
535 if the source and target variables are of the same type. SYSMIS
536 indicates the system-missing value.
538 If the source variables are strings and the target variables are
539 numeric, then there is one additional mapping available: (CONVERT),
540 which must be the last specified mapping. CONVERT causes a number
541 specified as a string to be converted to a numeric value. If the string
542 cannot be parsed as a number, then the system-missing value is assigned.
544 Multiple recodings can be specified on a single @cmd{RECODE} invocation.
545 Introduce additional recodings with a slash (@samp{/}) to
546 separate them from the previous recodings.
553 SORT CASES BY var_list[(@{D|A@}] [ var_list[(@{D|A@}] ] ...
556 @cmd{SORT CASES} sorts the active dataset by the values of one or more
559 Specify BY and a list of variables to sort by. By default, variables
560 are sorted in ascending order. To override sort order, specify (D) or
561 (DOWN) after a list of variables to get descending order, or (A) or (UP)
562 for ascending order. These apply to all the listed variables
563 up until the preceding (A), (D), (UP) or (DOWN).
565 The sort algorithms used by @cmd{SORT CASES} are stable. That is,
566 records that have equal values of the sort variables will have the
567 same relative order before and after sorting. As a special case,
568 re-sorting an already sorted file will not affect the ordering of
571 @cmd{SORT CASES} is a procedure. It causes the data to be read.
573 @cmd{SORT CASES} attempts to sort the entire active dataset in main memory.
574 If workspace is exhausted, it falls back to a merge sort algorithm that
575 involves creates numerous temporary files.
577 @cmd{SORT CASES} may not be specified following TEMPORARY.