-@node Data Selection, Conditionals and Looping, Data Manipulation, Top
+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2017 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
+@node Data Selection
@chapter Selecting data for analysis
-This chapter documents PSPP commands that temporarily or permanently
-select data records from the active file for analysis.
+This chapter documents @pspp{} commands that temporarily or permanently
+select data records from the active dataset for analysis.
@menu
* FILTER:: Exclude cases based on a variable.
-* N OF CASES:: Limit the size of the active file.
+* N OF CASES:: Limit the size of the active dataset.
* SAMPLE:: Select a specified proportion of cases.
* SELECT IF:: Permanently delete selected cases.
* SPLIT FILE:: Do multiple analyses with one command.
* WEIGHT:: Weight cases by a variable.
@end menu
-@node FILTER, N OF CASES, Data Selection, Data Selection
+@node FILTER
@section FILTER
@vindex FILTER
@display
-FILTER BY var_name.
+FILTER BY @var{var_name}.
FILTER OFF.
@end display
@cmd{FILTER} allows a boolean-valued variable to be used to select
cases from the data stream for processing.
-To set up filtering, specify BY and a variable name. Keyword
+To set up filtering, specify @subcmd{BY} and a variable name. Keyword
BY is optional but recommended. Cases which have a zero or system- or
user-missing value are excluded from analysis, but not deleted from the
data stream. Cases with other values are analyzed.
Filtering takes place immediately before cases pass to a procedure for
analysis. Only one filter variable may be active at a time. Normally,
case filtering continues until it is explicitly turned off with @code{FILTER
-OFF}. However, if @cmd{FILTER} is placed after TEMPORARY, it filters only
+OFF}. However, if @cmd{FILTER} is placed after @cmd{TEMPORARY}, it filters only
the next procedure or procedure-like command.
@node N OF CASES
@vindex N OF CASES
@display
-N [OF CASES] num_of_cases [ESTIMATED].
+N [OF CASES] @var{num_of_cases} [ESTIMATED].
@end display
-Sometimes you may want to disregard cases of your input. @cmd{N} can
-do this. @code{N 100} tells PSPP to disregard all cases after the
-first 100.
+@cmd{N OF CASES} limits the number of cases processed by any
+procedures that follow it in the command stream. @code{N OF CASES
+100}, for example, tells @pspp{} to disregard all cases after the first
+100.
-If the value specified for @cmd{N} is greater than the number of cases
-read in, the value is ignored.
+When @cmd{N OF CASES} is specified after @cmd{TEMPORARY}, it affects
+only the next procedure (@pxref{TEMPORARY}). Otherwise, cases beyond
+the limit specified are not processed by any later procedure.
-@cmd{N} does not discard cases or prevent them from being read. It
-just causes cases beyond the last one specified to be ignored by data
-analysis commands.
+If the limit specified on @cmd{N OF CASES} is greater than the number
+of cases in the active dataset, it has no effect.
-A later @cmd{N} command can increase or decrease the number of cases
-selected. (To select all the cases without knowing how many there are,
-specify a very high number: 100000 or whatever you think is large enough.)
+When @cmd{N OF CASES} is used along with @cmd{SAMPLE} or @cmd{SELECT
+IF}, the case limit is applied to the cases obtained after sampling or
+case selection, regardless of how @cmd{N OF CASES} is placed relative
+to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file. Thus, the
+commands @code{N OF CASES 100} and @code{SAMPLE .5} will both randomly
+sample approximately half of the active dataset's cases, then select the
+first 100 of those sampled, regardless of their order in the command
+file.
-Transformation procedures performed after @cmd{N} is executed
-@emph{do} cause cases to be discarded.
-
-@cmd{SAMPLE} and @cmd{SELECT IF} have
-precedence over @cmd{N}---the same results are obtained by both of the
-following fragments, given the same random number seeds:
-
-@example
-@i{@dots{}set up, read in data@dots{}}
-N 100.
-SAMPLE .5.
-@i{@dots{}analyze data@dots{}}
-
-@i{@dots{}set up, read in data@dots{}}
-SAMPLE .5.
-N 100.
-@i{@dots{}analyze data@dots{}}
-@end example
-
-Both fragments above first randomly sample approximately half of the
-cases, then select the first 100 of those sampled.
-
-@cmd{N} with the @code{ESTIMATED} keyword gives an
-estimated number of cases before @cmd{DATA LIST} or another command to
-read in data. @code{ESTIMATED} never limits the number of cases
-processed by procedures. PSPP currently does not make use of
-case count estimates.
-
-When @cmd{N} is specified after @cmd{TEMPORARY}, it affects only
-the next procedure (@pxref{TEMPORARY}).
+@cmd{N OF CASES} with the @code{ESTIMATED} keyword gives an estimated
+number of cases before @cmd{DATA LIST} or another command to read in
+data. @code{ESTIMATED} never limits the number of cases processed by
+procedures. @pspp{} currently does not make use of case count estimates.
@node SAMPLE
@section SAMPLE
@vindex SAMPLE
@display
-SAMPLE num1 [FROM num2].
+SAMPLE @var{num1} [FROM @var{num2}].
@end display
@cmd{SAMPLE} randomly samples a proportion of the cases in the active
file. Unless it follows @cmd{TEMPORARY}, it operates as a
-transformation, permanently removing cases from the active file.
+transformation, permanently removing cases from the active dataset.
The proportion to sample can be expressed as a single number between 0
-and 1. If @code{k} is the number specified, and @code{N} is the number
-of currently-selected cases in the active file, then after
-@code{SAMPLE @var{k}.}, approximately @code{k*N} cases will be
+and 1. If @var{k} is the number specified, and @var{N} is the number
+of currently-selected cases in the active dataset, then after
+@subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases will be
selected.
-The proportion to sample can also be specified in the style @code{SAMPLE
+The proportion to sample can also be specified in the style @subcmd{SAMPLE
@var{m} FROM @var{N}}. With this style, cases are selected as follows:
@enumerate
@item
If @var{N} is equal to the number of currently-selected cases in the
-active file, exactly @var{m} cases will be selected.
+active dataset, exactly @var{m} cases will be selected.
@item
If @var{N} is greater than the number of currently-selected cases in the
-active file, an equivalent proportion of cases will be selected.
+active dataset, an equivalent proportion of cases will be selected.
@item
If @var{N} is less than the number of currently-selected cases in the
active, exactly @var{m} cases will be selected @emph{from the first
-@var{N} cases in the active file.}
+@var{N} cases in the active dataset.}
@end enumerate
@cmd{SAMPLE} and @cmd{SELECT IF} are performed in
differing endianness or floating-point formats. By default, the
random number seed is based on the system time.
-@node SELECT IF, SPLIT FILE, SAMPLE, Data Selection
+@node SELECT IF
@section SELECT IF
@vindex SELECT IF
@display
-SELECT IF expression.
+SELECT IF @var{expression}.
@end display
-@cmd{SELECT IF} selects cases for analysis based on the value of a
-boolean expression. Cases not selected are permanently eliminated
-from the active file, unless @cmd{TEMPORARY} is in effect
+@cmd{SELECT IF} selects cases for analysis based on the value of
+@var{expression}. Cases not selected are permanently eliminated
+from the active dataset, unless @cmd{TEMPORARY} is in effect
(@pxref{TEMPORARY}).
Specify a boolean expression (@pxref{Expressions}). If the value of the
(@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
(@pxref{LAG}).
-@node SPLIT FILE, TEMPORARY, SELECT IF, Data Selection
+@node SPLIT FILE
@section SPLIT FILE
@vindex SPLIT FILE
@display
-SPLIT FILE [@{LAYERED, SEPARATE@}] BY var_list.
+SPLIT FILE [@{LAYERED, SEPARATE@}] BY @var{var_list}.
SPLIT FILE OFF.
@end display
variable values for the group are printed along with the analysis.
When a list of variable names is specified, one of the keywords
-LAYERED or SEPARATE may also be specified. If provided, either
+@subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified. If provided, either
keyword are ignored.
Groups are formed only by @emph{adjacent} cases. To create a split
using a variable where like values are not adjacent in the working file,
you should first sort the data by that variable (@pxref{SORT CASES}).
-Specify OFF to disable @cmd{SPLIT FILE} and resume analysis of the
-entire active file as a single group of data.
+Specify @subcmd{OFF} to disable @cmd{SPLIT FILE} and resume analysis of the
+entire active dataset as a single group of data.
When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only
the next procedure (@pxref{TEMPORARY}).
-@node TEMPORARY, WEIGHT, SPLIT FILE, Data Selection
+@node TEMPORARY
@section TEMPORARY
@vindex TEMPORARY
@cmd{TEMPORARY} is used to make the effects of transformations
following its execution temporary. These transformations will
affect only the execution of the next procedure or procedure-like
-command. Their effects will not be saved to the active file.
+command. Their effects will not be saved to the active dataset.
The only specification on @cmd{TEMPORARY} is the command name.
20
24
END DATA.
+
COMPUTE X=X/2.
+
TEMPORARY.
COMPUTE X=X+3.
+
DESCRIPTIVES X.
DESCRIPTIVES X.
@end example
10.5, 13, 15. The data read by the first @cmd{DESCRIPTIVES} are 1, 2,
5, 7.5, 10, 12.
-@node WEIGHT, , TEMPORARY, Data Selection
+@node WEIGHT
@section WEIGHT
@vindex WEIGHT
@display
-WEIGHT BY var_name.
+WEIGHT BY @var{var_name}.
WEIGHT OFF.
@end display
@cmd{WEIGHT} assigns cases varying weights,
-changing the frequency distribution of the active file. Execution of
+changing the frequency distribution of the active dataset. Execution of
@cmd{WEIGHT} is delayed until data have been read.
If a variable name is specified, @cmd{WEIGHT} causes the values of that
variable to be used as weighting factors for subsequent statistical
-procedures. Use of keyword BY is optional but recommended. Weighting
+procedures. Use of keyword @subcmd{BY} is optional but recommended. Weighting
variables must be numeric. Scratch variables may not be used for
weighting (@pxref{Scratch Variables}).
-When OFF is specified, subsequent statistical procedures will weight all
+When @subcmd{OFF} is specified, subsequent statistical procedures will weight all
cases equally.
A positive integer weighting factor @var{w} on a case will yield the
When @cmd{WEIGHT} is specified after @cmd{TEMPORARY}, it affects only
the next procedure (@pxref{TEMPORARY}).
-@cmd{WEIGHT} does not cause cases in the active file to be replicated in
-memory.
-@setfilename ignored
+@cmd{WEIGHT} does not cause cases in the active dataset to be
+replicated in memory.