@cindex transformations
The PSPP procedures examined in this chapter manipulate data and
-prepare the active file for later analyses. They do not produce output,
+prepare the active dataset for later analyses. They do not produce output,
as a rule.
@menu
* FLIP:: Exchange variables with cases.
* IF:: Conditionally assigning a calculated value.
* RECODE:: Mapping values from one set to another.
-* SORT CASES:: Sort the active file.
+* SORT CASES:: Sort the active dataset.
@end menu
@node AGGREGATE
@display
AGGREGATE
- OUTFILE=@{*,'file-name',file_handle@}
+ OUTFILE=@{*,'file-name',file_handle@} [MODE=@{REPLACE, ADDVARIABLES@}]
/PRESORTED
/DOCUMENT
/MISSING=COLUMNWISE
for summarizing case contents.
The OUTFILE subcommand is required and must appear first. Specify a
-system file, portable file, or scratch file by file name or file
-handle (@pxref{File Handles}).
+system file or portable file by file name or file
+handle (@pxref{File Handles}), or a dataset by its name
+(@pxref{Datasets}).
The aggregated cases are written to this file. If @samp{*} is
-specified, then the aggregated cases replace the active file. Use of
-OUTFILE to write a portable file or scratch file is a PSPP extension.
-
-By default, the active file will be sorted based on the break variables
-before aggregation takes place. If the active file is already sorted
+specified, then the aggregated cases replace the active dataset's data.
+Use of OUTFILE to write a portable file is a PSPP extension.
+
+If OUTFILE=@samp{*} is given, then the subcommand MODE may also be
+specified.
+The mode subcommand has two possible values: ADDVARIABLES or REPLACE.
+In REPLACE mode, the entire active dataset is replaced by a new dataset
+which contains just the break variables and the destination varibles.
+In this mode, the new file will contain as many cases as there are
+unique combinations of the break variables.
+In ADDVARIABLES mode, the destination variables will be appended to
+the existing active dataset.
+Cases which have identical combinations of values in their break
+variables, will receive identical values for the destination variables.
+The number of cases in the active dataset will remain unchanged.
+Note that if ADDVARIABLES is specified, then the data @emph{must} be
+sorted on the break variables.
+
+By default, the active dataset will be sorted based on the break variables
+before aggregation takes place. If the active dataset is already sorted
or otherwise grouped in terms of the break variables, specify
PRESORTED to save time.
+PRESORTED is assumed if MODE=ADDVARIABLES is used.
-Specify DOCUMENT to copy the documents from the active file into the
+Specify DOCUMENT to copy the documents from the active dataset into the
aggregate file (@pxref{DOCUMENT}). Otherwise, the aggregate file will
not contain any documents, even if the aggregate file replaces the
-active file.
+active dataset.
Normally, only a single case (for SD and SD., two cases) need be
non-missing in each group for the aggregate variable to be
At least one break variable must be specified on BREAK, a
required subcommand. The values of these variables are used to divide
-the active file into groups to be summarized. In addition, at least
+the active dataset into groups to be summarized. In addition, at least
one @var{dest_var} must be specified.
One or more sets of aggregation variables must be specified. Each set
@display
AUTORECODE VARIABLES=src_vars INTO dest_vars
- /DESCENDING
- /PRINT
+ [ /DESCENDING ]
+ [ /PRINT ]
+ [ /GROUP ]
+ [ /BLANK = @{VALID, MISSING@} ]
@end display
The @cmd{AUTORECODE} procedure considers the @var{n} values that a variable
PRINT is currently ignored.
+The GROUP subcommand is relevant only if more than one variable is to be
+recoded. It causes a single mapping between source and target values to
+be used, instead of one map per variable.
+
+If /BLANK=MISSING is given, then string variables which contain only
+whitespace are recoded as SYSMIS. If /BLANK=VALID is given then they
+will be allocated a value like any other. /BLANK is not relevant
+to numeric values. /BLANK=VALID is the default.
+
@cmd{AUTORECODE} is a procedure. It causes the data to be read.
@node COMPUTE
(@pxref{LEAVE}) resets the variable's left state. Therefore,
@code{LEAVE} should be specified following @cmd{COMPUTE}, not before.
-@cmd{COMPUTE} is a transformation. It does not cause the active file to be
+@cmd{COMPUTE} is a transformation. It does not cause the active dataset to be
read.
When @cmd{COMPUTE} is specified following @cmd{TEMPORARY}
FLIP /VARIABLES=var_list /NEWNAMES=var_name.
@end display
-@cmd{FLIP} transposes rows and columns in the active file. It
+@cmd{FLIP} transposes rows and columns in the active dataset. It
causes cases to be swapped with variables, and vice versa.
-All variables in the transposed active file are numeric. String
+All variables in the transposed active dataset are numeric. String
variables take on the system-missing value in the transposed file.
No subcommands are required. If specified, the VARIABLES subcommand
The resultant dictionary contains a CASE_LBL variable, a string
variable of width 8, which stores the names of the variables in the
dictionary before the transposition. Variables names longer than 8
-characters are truncated. If the active file is subsequently
+characters are truncated. If the active dataset is subsequently
transposed using @cmd{FLIP}, this variable can be used to recreate the
original variable names.
@section RECODE
@vindex RECODE
+The @cmd{RECODE} command is used to transform existing values into other,
+user specified values.
+The general form is:
+
@display
-RECODE var_list (src_value@dots{}=dest_value)@dots{} [INTO var_list].
+RECODE @var{src_vars}
+ (@var{src_value} @var{src_value} @dots{} = @var{dest_value})
+ (@var{src_value} @var{src_value} @dots{} = @var{dest_value})
+ (@var{src_value} @var{src_value} @dots{} = @var{dest_value}) @dots{}
+ [INTO @var{dest_vars}].
+@end display
-src_value may take the following forms:
- number
- string
- num1 THRU num2
- MISSING
- SYSMIS
- ELSE
-Open-ended ranges may be specified using LO or LOWEST for num1
-or HI or HIGHEST for num2.
+Following the RECODE keyword itself comes @var{src_vars} which is a list
+of variables whose values are to be transformed.
+These variables may be string variables or they may be numeric.
+However the list must be homogeneous; you may not mix string variables and
+numeric variables in the same recoding.
+
+After the list of source variables, there should be one or more @dfn{mappings}.
+Each mapping is enclosed in parentheses, and contains the source values and
+a destination value separated by a single @samp{=}.
+The source values are used to specify the values in the dataset which
+need to change, and the destination value specifies the new value
+to which they should be changed.
+Each @var{src_value} may take one of the following forms:
+@itemize @bullet
+@item @var{number}
+If the source variables are numeric then @var{src_value} may be a literal
+number.
+@item @var{string}
+If the source variables are string variables then @var{src_value} may be a
+literal string (like all strings, enclosed in single or double quotes).
+@item @var{num1} THRU @var{num2}
+This form is valid only when the source variables are numeric.
+It specifies all values in the range [@var{num1}, @var{num2}].
+Normally you would ensure that @var{num2} is greater than or equal to
+@var{num1}.
+If @var{num1} however is greater than @var{num2}, then the range
+[@var{num2},@var{num1}] will be used instead.
+Open-ended ranges may be specified using @samp{LO} or @samp{LOWEST}
+for @var{num1}
+or @samp{HI} or @samp{HIGHEST} for @var{num2}.
+@item @samp{MISSING}
+The literal keyword @samp{MISSING} matches both system missing and user
+missing values.
+It is valid for both numeric and string variables.
+@item @samp{SYSMIS}
+The literal keyword @samp{SYSMIS} matches system missing
+values.
+It is valid for both numeric variables only.
+@item @samp{ELSE}
+The @samp{ELSE} keyword may be used to match any values which are
+not matched by any other @var{src_value} appearing in the command.
+If this keyword appears, it should be used in the last mapping of the
+command.
+@end itemize
-dest_value may take the following forms:
- num
- string
- SYSMIS
- COPY
-@end display
+After the source variables comes an @samp{=} and then the @var{dest_value}.
+The @var{dest_value} may take any of the following forms:
+@itemize @bullet
+@item @var{number}
+A literal numeric value to which the source values should be changed.
+This implies the destination variable must be numeric.
+@item @var{string}
+A literal string value (enclosed in quotation marks) to which the source
+values should be changed.
+This implies the destination variable must be a string variable.
+@item @samp{SYSMIS}
+The keyword @samp{SYSMIS} changes the value to the system missing value.
+This implies the destination variable must be numeric.
+@item @samp{COPY}
+The special keyword @samp{COPY} means that the source value should not be
+modified, but
+copied directly to the destination value.
+This is meaningful only if @samp{INTO @var{dest_vars}} is specified.
+@end itemize
-@cmd{RECODE} translates data from one range of values to
-another, via flexible user-specified mappings. Data may be remapped
-in-place or copied to new variables. Numeric and
-string data can be recoded.
-
-Specify the list of source variables, followed by one or more mapping
-specifications each enclosed in parentheses. If the data is to be
-copied to new variables, specify INTO, then the list of target
-variables. String target variables must already have been declared
-using @cmd{STRING} or another transformation, but numeric target
-variables can
-be created on the fly. There must be exactly as many target variables
-as source variables. Each source variable is remapped into its
-corresponding target variable.
-
-When INTO is not used, the input and output variables must be of the
-same type. Otherwise, string values can be recoded into numeric values,
-and vice versa. When this is done and there is no mapping for a
-particular value, either a value consisting of all spaces or the
-system-missing value is assigned, depending on variable type.
-
-Mappings are considered from left to right. The first src_value that
-matches the value of the source variable causes the target variable to
-receive the value indicated by the dest_value. Literal number, string,
-and range src_value's should be self-explanatory. MISSING as a
-src_value matches any user- or system-missing value. SYSMIS matches the
-system missing value only. ELSE is a catch-all that matches anything.
-It should be the last src_value specified.
-
-Numeric and string dest_value's should be self-explanatory. COPY
-causes the input values to be copied to the output. This is only valid
-if the source and target variables are of the same type. SYSMIS
-indicates the system-missing value.
-
-If the source variables are strings and the target variables are
-numeric, then there is one additional mapping available: (CONVERT),
-which must be the last specified mapping. CONVERT causes a number
-specified as a string to be converted to a numeric value. If the string
-cannot be parsed as a number, then the system-missing value is assigned.
-
-Multiple recodings can be specified on a single @cmd{RECODE} invocation.
+Mappings are considered from left to right.
+Therefore, if a value is matched by a @var{src_value} from more than
+one mapping, the first (leftmost) mapping which matches will be considered.
+Any subsequent matches will be ignored.
+
+The clause @samp{INTO @var{dest_vars}} is optional.
+The behaviour of the command is slightly different depending on whether it
+appears or not.
+
+If @samp{INTO @var{dest_vars}} does not appear, then values will be recoded
+``in place´´. This means that the recoded values are written back to the
+source variables from whence the original values came.
+In this case, the @var{dest_value} for every mapping must imply a value which
+has the same type as the @var{src_value}.
+For example, if the source value is a string value, it is not permissible for
+@var{dest_value} to be @samp{SYSMIS} or another forms which implies a numeric
+result.
+The following example two numeric variables @var{x} and @var{y} are recoded
+in place.
+Zero is recoded to 99, the values 1 to 10 inclusive are unchanged,
+values 1000 and higher are recoded to the system-missing value and all other
+values are changed to 999:
+@example
+recode @var{x} @var{y}
+ (0 = 99)
+ (1 THRU 10 = COPY)
+ (1000 THRU HIGHEST = SYSMIS)
+ (ELSE = 999).
+@end example
+
+If @samp{INTO @var{dest_vars}} is given, then recoded values are written
+into the variables specified in @var{dest_vars}, which must therefore
+ contain a list of valid variable names.
+The number of variables in @var{dest_vars} must be the same as the number
+of variables in @var{src_vars}
+and the respective order of the variables in @var{dest_vars} corresponds to
+the order of @var{src_vars}.
+That is to say, recoded values whose
+original value came from the @var{n}th variable in @var{src_vars} will be
+placed into the @var{n}th variable in @var{dest_vars}.
+The source variables will be unchanged.
+If any mapping implies a string as its destination value, then the respective
+destination variable must already exist, or
+have been declared using @cmd{STRING} or another transformation.
+Numeric variables however will be automatically created if they don't already
+exist.
+The following example deals with two source variables, @var{a} and @var{b}
+which contain string values. Hence there are two destination variables
+@var{v1} and @var{v2}.
+Any cases where @var{a} or @var{b} contain the values @samp{apple},
+@samp{pear} or @samp{pomegranate} will result in @var{v1} or @var{v2} being
+filled with the string @samp{fruit} whilst cases with
+@samp{tomato}, @samp{lettuce} or @samp{carrot} will result in @samp{vegetable}.
+Any other values will produce the result @samp{unknown}:
+@example
+string @var{v1} (a20).
+string @var{v2} (a20).
+
+recode @var{a} @var{b}
+ ("apple" "pear" "pomegranate" = "fruit")
+ ("tomato" "lettuce" "carrot" = "vegetable")
+ (ELSE = "unknown")
+ into @var{v1} @var{v2}.
+@end example
+
+There is one very special mapping, not mentioned above.
+If the source variable is a string variable
+then a mapping may be specified as @samp{(CONVERT)}.
+This mapping, if it appears must be the last mapping given and
+the @samp{INTO @var{dest_vars}} clause must also be given and
+must not refer to a string variable.
+@samp{CONVERT} causes a number specified as a string to
+be converted to a numeric value.
+For example it will convert the string @samp{"3"} into the numeric
+value 3 (note that it will not convert @samp{three} into 3).
+If the string cannot be parsed as a number, then the system-missing value
+is assigned instead.
+In the following example, cases where the value of @var{x} (a string variable)
+is the empty string, are recoded to 999 and all others are converted to the
+numeric equivalent of the input value. The results are placed into the
+numeric variable @var{y}:
+@example
+recode @var{x}
+ ("" = 999)
+ (convert)
+ into @var{y}.
+@end example
+
+It is possible to specify multiple recodings on a single command.
Introduce additional recodings with a slash (@samp{/}) to
-separate them from the previous recodings.
+separate them from the previous recodings:
+@example
+recode
+ @var{a} (2 = 22) (else = 99)
+ /@var{b} (1 = 3) into @var{z}
+ .
+@end example
+@noindent Here we have two recodings. The first affects the source variable
+@var{a} and recodes in-place the value 2 into 22 and all other values to 99.
+The second recoding copies the values of @var{b} into the the variable @var{z},
+changing any instances of 1 into 3.
@node SORT CASES
@section SORT CASES
SORT CASES BY var_list[(@{D|A@}] [ var_list[(@{D|A@}] ] ...
@end display
-@cmd{SORT CASES} sorts the active file by the values of one or more
+@cmd{SORT CASES} sorts the active dataset by the values of one or more
variables.
Specify BY and a list of variables to sort by. By default, variables
@cmd{SORT CASES} is a procedure. It causes the data to be read.
-@cmd{SORT CASES} attempts to sort the entire active file in main memory.
+@cmd{SORT CASES} attempts to sort the entire active dataset in main memory.
If workspace is exhausted, it falls back to a merge sort algorithm that
involves creates numerous temporary files.