1 @node Data Manipulation, Data Selection, Variable Attributes, Top
2 @chapter Data transformations
3 @cindex transformations
5 The PSPP procedures examined in this chapter manipulate data and
6 prepare the active file for later analyses. They do not produce output,
10 * AGGREGATE:: Summarize multiple cases into a single case.
11 * AUTORECODE:: Automatic recoding of variables.
12 * COMPUTE:: Assigning a variable a calculated value.
13 * COUNT:: Counting variables with particular values.
14 * FLIP:: Exchange variables with cases.
15 * IF:: Conditionally assigning a calculated value.
16 * RECODE:: Mapping values from one set to another.
17 * SORT CASES:: Sort the active file.
20 @node AGGREGATE, AUTORECODE, Data Manipulation, Data Manipulation
28 /OUTFILE=@{*,'filename'@}
31 /dest_var['label']@dots{}=agr_func(src_vars, args@dots{})@dots{}
34 @cmd{AGGREGATE} summarizes groups of cases into single cases.
35 Cases are divided into groups that have the same values for one or more
36 variables called @dfn{break variables}. Several functions are available
37 for summarizing case contents.
39 At least one break variable must be specified on BREAK, the only
40 required subcommand. The values of these variables are used to divide
41 the active file into groups to be summarized. In addition, at least
42 one @var{dest_var} must be specified.
44 By default, the active file is sorted based on the break variables
45 before aggregation takes place. If the active file is already sorted
46 or otherwise grouped in terms of the break variables, specify
47 PRESORTED to save time.
49 The OUTFILE subcommand specifies a system file by file name string or
50 file handle (@pxref{FILE HANDLE}). The aggregated cases are written to
51 this file. If OUTFILE is not specified, or if @samp{*} is specified,
52 then the aggregated cases replace the active file.
54 Specify DOCUMENT to copy the documents from the active file into the
55 aggregate file (@pxref{DOCUMENT}). Otherwise, the aggregate file will
56 not contain any documents, even if the aggregate file replaces the
59 One or more sets of aggregation variables must be specified. Each set
60 comprises a list of aggregation variables, an equals sign (@samp{=}),
61 the name of an aggregation function (see the list below), and a list
62 of source variables in parentheses. Some aggregation functions expect
63 additional arguments following the source variable names.
65 Aggregation variables typically are created with no variable label,
66 value labels, or missing values. Their default print and write
67 formats depend on the aggregation function used, with details given in
68 the table below. A variable label for an aggregation variable may be
69 specified just after the variable's name in the aggregation variable
72 Each set must have exactly as many source variables as aggregation
73 variables. Each aggregation variable receives the results of applying
74 the specified aggregation function to the corresponding source
75 variable. Most aggregation functions may be applied to numeric and
76 short and long string variables. Others, marked below, are restricted
79 The available aggregation functions are as follows:
82 @item FGT(var_name, value)
83 Fraction of values greater than the specified constant. The default
86 @item FIN(var_name, low, high)
87 Fraction of values within the specified inclusive range of constants.
88 The default format is F5.3.
90 @item FLT(var_name, value)
91 Fraction of values less than the specified constant. The default
95 First non-missing value in break group. The aggregation variable
96 receives the complete dictionary information from the source variable.
98 @item FOUT(var_name, low, high)
99 Fraction of values strictly outside the specified range of constants.
100 The default format is F5.3.
103 Last non-missing value in break group. The aggregation variable
104 receives the complete dictionary information from the source variable.
107 Maximum value. The aggregation variable receives the complete
108 dictionary information from the source variable.
111 Arithmetic mean. Limited to numeric values. The default format is
115 Minimum value. The aggregation variable receives the complete
116 dictionary information from the source variable.
119 Number of non-missing values. The default format is F7.0 if weighting
120 is not enabled, F8.2 if it is (@pxref{WEIGHT}).
123 Number of cases aggregated to form this group. The default format is
124 F7.0 if weighting is not enabled, F8.2 if it is (@pxref{WEIGHT}).
126 @item NMISS(var_name)
127 Number of missing values. The default format is F7.0 if weighting is
128 not enabled, F8.2 if it is (@pxref{WEIGHT}).
131 Number of non-missing values. Each case is considered to have a weight
132 of 1, regardless of the current weighting variable (@pxref{WEIGHT}).
133 The default format is F7.0.
136 Number of cases aggregated to form this group. Each case is considered
137 to have a weight of 1, regardless of the current weighting variable.
138 The default format is F7.0.
140 @item NUMISS(var_name)
141 Number of missing values. Each case is considered to have a weight of
142 1, regardless of the current weighting variable. The default format is F7.0.
144 @item PGT(var_name, value)
145 Percentage between 0 and 100 of values greater than the specified
146 constant. The default format is F5.1.
148 @item PIN(var_name, low, high)
149 Percentage of values within the specified inclusive range of
150 constants. The default format is F5.1.
152 @item PLT(var_name, value)
153 Percentage of values less than the specified constant. The default
156 @item POUT(var_name, low, high)
157 Percentage of values strictly outside the specified range of
158 constants. The default format is F5.1.
161 Standard deviation of the mean. Limited to numeric values. The
162 default format is F8.2.
165 Sum. Limited to numeric values. The default format is F8.2.
168 Aggregation functions compare string values in terms of internal
169 character codes. On most modern computers, this is a form of ASCII.
171 The aggregation functions listed above exclude all user-missing values
172 from calculations. To include user-missing values, insert a period
173 (@samp{.}) between the function name and left parenthesis
174 (e.g.@: @samp{SUM.}).
176 Normally, only a single case (for SD and SD., two cases) need be
177 non-missing in each group for the aggregate variable to be
178 non-missing. Specifying /MISSING=COLUMNWISE inverts this behavior, so
179 that the aggregate variable becomes missing if any aggregated value is
182 @cmd{AGGREGATE} both ignores and cancels the current @cmd{SPLIT FILE}
183 settings (@pxref{SPLIT FILE}).
185 @node AUTORECODE, COMPUTE, AGGREGATE, Data Manipulation
190 AUTORECODE VARIABLES=src_vars INTO dest_vars
195 The @cmd{AUTORECODE} procedure considers the @var{n} values that a variable
196 takes on and maps them onto values 1@dots{}@var{n} on a new numeric
199 Subcommand VARIABLES is the only required subcommand and must come
200 first. Specify VARIABLES, an equals sign (@samp{=}), a list of source
201 variables, INTO, and a list of target variables. There must the same
202 number of source and target variables. The target variables must not
205 By default, increasing values of a source variable (for a string, this
206 is based on character code comparisons) are recoded to increasing values
207 of its target variable. To cause increasing values of a source variable
208 to be recoded to decreasing values of its target variable (@var{n} down
209 to 1), specify DESCENDING.
211 PRINT is currently ignored.
213 @cmd{AUTORECODE} is a procedure. It causes the data to be read.
215 @node COMPUTE, COUNT, AUTORECODE, Data Manipulation
220 COMPUTE variable = expression.
222 COMPUTE vector(index) = expression.
225 @cmd{COMPUTE} assigns the value of an expression to a target
226 variable. For each case, the expression is evaluated and its value
227 assigned to the target variable. Numeric and short and long string
228 variables may be assigned. When a string expression's width differs
229 from the target variable's width, the string result of the expression
230 is truncated or padded with spaces on the right as necessary. The
231 expression and variable types must match.
233 For numeric variables only, the target variable need not already
234 exist. Numeric variables created by @cmd{COMPUTE} are assigned an
235 @code{F8.2} output format. String variables must be declared before
236 they can be used as targets for @cmd{COMPUTE}.
238 The target variable may be specified as an element of a vector
239 (@pxref{VECTOR}). In this case, a vector index expression must be
240 specified in parentheses following the vector name. The index
241 expression must evaluate to a numeric value that, after rounding down
242 to the nearest integer, is a valid index for the named vector.
244 Using @cmd{COMPUTE} to assign to a variable specified on @cmd{LEAVE}
245 (@pxref{LEAVE}) resets the variable's left state. Therefore,
246 @code{LEAVE} should be specified following @cmd{COMPUTE}, not before.
248 @cmd{COMPUTE} is a transformation. It does not cause the active file to be
251 When @cmd{COMPUTE} is specified following @cmd{TEMPORARY}
252 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
255 @node COUNT, FLIP, COMPUTE, Data Manipulation
260 COUNT var_name = var@dots{} (value@dots{}).
262 Each value takes one of the following forms:
268 In addition, num1 and num2 can be LO or LOWEST, or HI or HIGHEST,
272 @cmd{COUNT} creates or replaces a numeric @dfn{target} variable that
273 counts the occurrence of a @dfn{criterion} value or set of values over
274 one or more @dfn{test} variables for each case.
276 The target variable values are always nonnegative integers. They are
277 never missing. The target variable is assigned an F8.2 output format.
278 @xref{Input/Output Formats}. Any variables, including long and short
279 string variables, may be test variables.
281 User-missing values of test variables are treated just like any other
282 values. They are @strong{not} treated as system-missing values.
283 User-missing values that are criterion values or inside ranges of
284 criterion values are counted as any other values. However (for numeric
285 variables), keyword MISSING may be used to refer to all system-
286 and user-missing values.
288 @cmd{COUNT} target variables are assigned values in the order
289 specified. In the command @code{COUNT A=A B(1) /B=A B(2).}, the
290 following actions occur:
294 The number of occurrences of 1 between @code{A} and @code{B} is counted.
297 @code{A} is assigned this value.
300 The number of occurrences of 1 between @code{B} and the @strong{new}
301 value of @code{A} is counted.
304 @code{B} is assigned this value.
307 Despite this ordering, all @cmd{COUNT} criterion variables must exist
308 before the procedure is executed---they may not be created as target
309 variables earlier in the command! Break such a command into two
312 The examples below may help to clarify.
316 Assuming @code{Q0}, @code{Q2}, @dots{}, @code{Q9} are numeric variables,
317 the following commands:
321 Count the number of times the value 1 occurs through these variables
322 for each case and assigns the count to variable @code{QCOUNT}.
325 Print out the total number of times the value 1 occurs throughout
326 @emph{all} cases using @cmd{DESCRIPTIVES}. @xref{DESCRIPTIVES}, for
331 COUNT QCOUNT=Q0 TO Q9(1).
332 DESCRIPTIVES QCOUNT /STATISTICS=SUM.
336 Given these same variables, the following commands:
340 Count the number of valid values of these variables for each case and
341 assigns the count to variable @code{QVALID}.
344 Multiplies each value of @code{QVALID} by 10 to obtain a percentage of
345 valid values, using @cmd{COMPUTE}. @xref{COMPUTE}, for details.
348 Print out the percentage of valid values across all cases, using
349 @cmd{DESCRIPTIVES}. @xref{DESCRIPTIVES}, for details.
353 COUNT QVALID=Q0 TO Q9 (LO THRU HI).
354 COMPUTE QVALID=QVALID*10.
355 DESCRIPTIVES QVALID /STATISTICS=MEAN.
359 @node FLIP, IF, COUNT, Data Manipulation
364 FLIP /VARIABLES=var_list /NEWNAMES=var_name.
367 @cmd{FLIP} transposes rows and columns in the active file. It
368 causes cases to be swapped with variables, and vice versa.
370 All variables in the transposed active file are numeric. String
371 variables take on the system-missing value in the transposed file.
373 No subcommands are required. The VARIABLES subcommand specifies
374 variables that will be transformed into cases. Variables not specified
375 are discarded. By default, all variables are selected for
378 The variables specified by NEWNAMES, which must be a string variable, is
379 used to give names to the variables created by @cmd{FLIP}. If
381 specified then the default is a variable named CASE_LBL, if it exists.
382 If it does not then the variables created by FLIP are named VAR000
383 through VAR999, then VAR1000, VAR1001, and so on.
385 When a NEWNAMES variable is available, the names must be canonicalized
386 before becoming variable names. Invalid characters are replaced by
387 letter @samp{V} in the first position, or by @samp{_} in subsequent
388 positions. If the name thus generated is not unique, then numeric
389 extensions are added, starting with 1, until a unique name is found or
390 there are no remaining possibilities. If the latter occurs then the
391 FLIP operation aborts.
393 The resultant dictionary contains a CASE_LBL variable, which stores the
394 names of the variables in the dictionary before the transposition. If
395 the active file is subsequently transposed using @cmd{FLIP}, this
397 be used to recreate the original variable names.
399 FLIP honors N OF CASES. It ignores TEMPORARY, so that ``temporary''
400 transformations become permanent.
402 @node IF, RECODE, FLIP, Data Manipulation
407 IF condition variable=expression.
409 IF condition vector(index)=expression.
412 The @cmd{IF} transformation conditionally assigns the value of a target
413 expression to a target variable, based on the truth of a test
416 Specify a boolean-valued expression (@pxref{Expressions}) to be tested
417 following the IF keyword. This expression is evaluated for each case.
418 If the value is true, then the value of the expression is computed and
419 assigned to the specified variable. If the value is false or missing,
420 nothing is done. Numeric and short and long string variables may be
421 assigned. When a string expression's width differs from the target
422 variable's width, the string result of the expression is truncated or
423 padded with spaces on the right as necessary. The expression and
424 variable types must match.
426 The target variable may be specified as an element of a vector
427 (@pxref{VECTOR}). In this case, a vector index expression must be
428 specified in parentheses following the vector name. The index
429 expression must evaluate to a numeric value that, after rounding down
430 to the nearest integer, is a valid index for the named vector.
432 Using @cmd{IF} to assign to a variable specified on @cmd{LEAVE}
433 (@pxref{LEAVE}) resets the variable's left state. Therefore,
434 @code{LEAVE} should be specified following @cmd{IF}, not before.
436 When @cmd{IF} is specified following @cmd{TEMPORARY}
437 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
440 @node RECODE, SORT CASES, IF, Data Manipulation
445 RECODE var_list (src_value@dots{}=dest_value)@dots{} [INTO var_list].
447 src_value may take the following forms:
454 Open-ended ranges may be specified using LO or LOWEST for num1
455 or HI or HIGHEST for num2.
457 dest_value may take the following forms:
464 @cmd{RECODE} translates data from one range of values to
465 another, via flexible user-specified mappings. Data may be remapped
466 in-place or copied to new variables. Numeric, short string, and long
467 string data can be recoded.
469 Specify the list of source variables, followed by one or more mapping
470 specifications each enclosed in parentheses. If the data is to be
471 copied to new variables, specify INTO, then the list of target
472 variables. String target variables must already have been declared
473 using @cmd{STRING} or another transformation, but numeric target
475 be created on the fly. There must be exactly as many target variables
476 as source variables. Each source variable is remapped into its
477 corresponding target variable.
479 When INTO is not used, the input and output variables must be of the
480 same type. Otherwise, string values can be recoded into numeric values,
481 and vice versa. When this is done and there is no mapping for a
482 particular value, either a value consisting of all spaces or the
483 system-missing value is assigned, depending on variable type.
485 Mappings are considered from left to right. The first src_value that
486 matches the value of the source variable causes the target variable to
487 receive the value indicated by the dest_value. Literal number, string,
488 and range src_value's should be self-explanatory. MISSING as a
489 src_value matches any user- or system-missing value. SYSMIS matches the
490 system missing value only. ELSE is a catch-all that matches anything.
491 It should be the last src_value specified.
493 Numeric and string dest_value's should also be self-explanatory. COPY
494 causes the input values to be copied to the output. This is only value
495 if the source and target variables are of the same type. SYSMIS
496 indicates the system-missing value.
498 If the source variables are strings and the target variables are
499 numeric, then there is one additional mapping available: (CONVERT),
500 which must be the last specified mapping. CONVERT causes a number
501 specified as a string to be converted to a numeric value. If the string
502 cannot be parsed as a number, then the system-missing value is assigned.
504 Multiple recodings can be specified on a single @cmd{RECODE} invocation.
505 Introduce additional recodings with a slash (@samp{/}) to
506 separate them from the previous recodings.
508 @node SORT CASES, , RECODE, Data Manipulation
513 SORT CASES BY var_list.
516 @cmd{SORT CASES} sorts the active file by the values of one or more
519 Specify BY and a list of variables to sort by. By default, variables
520 are sorted in ascending order. To override sort order, specify (D) or
521 (DOWN) after a list of variables to get descending order, or (A) or (UP)
522 for ascending order. These apply to the entire list of variables
525 @cmd{SORT CASES} is a procedure. It causes the data to be read.
527 @cmd{SORT CASES} attempts to sort the entire active file in main memory.
528 If main memory is exhausted, it falls back to a merge sort algorithm that
529 involves writing and reading numerous temporary files.
531 @cmd{SORT CASES} may not be specified following TEMPORARY.