+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
@node Data Selection
@chapter Selecting data for analysis
@vindex FILTER
@display
-FILTER BY var_name.
+FILTER BY @var{var_name}.
FILTER OFF.
@end display
@cmd{FILTER} allows a boolean-valued variable to be used to select
cases from the data stream for processing.
-To set up filtering, specify BY and a variable name. Keyword
+To set up filtering, specify @subcmd{BY} and a variable name. Keyword
BY is optional but recommended. Cases which have a zero or system- or
user-missing value are excluded from analysis, but not deleted from the
data stream. Cases with other values are analyzed.
Filtering takes place immediately before cases pass to a procedure for
analysis. Only one filter variable may be active at a time. Normally,
case filtering continues until it is explicitly turned off with @code{FILTER
-OFF}. However, if @cmd{FILTER} is placed after TEMPORARY, it filters only
+OFF}. However, if @cmd{FILTER} is placed after @cmd{TEMPORARY}, it filters only
the next procedure or procedure-like command.
@node N OF CASES
@vindex N OF CASES
@display
-N [OF CASES] num_of_cases [ESTIMATED].
+N [OF CASES] @var{num_of_cases} [ESTIMATED].
@end display
@cmd{N OF CASES} limits the number of cases processed by any
IF}, the case limit is applied to the cases obtained after sampling or
case selection, regardless of how @cmd{N OF CASES} is placed relative
to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file. Thus, the
-commands @code{N OF CASES 100} and @code{SAMPLE .5} will both randomly
+commands @code{N OF CASES 100} and @code{SAMPLE .5} both randomly
sample approximately half of the active dataset's cases, then select the
first 100 of those sampled, regardless of their order in the command
file.
@vindex SAMPLE
@display
-SAMPLE num1 [FROM num2].
+SAMPLE @var{num1} [FROM @var{num2}].
@end display
@cmd{SAMPLE} randomly samples a proportion of the cases in the active
transformation, permanently removing cases from the active dataset.
The proportion to sample can be expressed as a single number between 0
-and 1. If @code{k} is the number specified, and @code{N} is the number
+and 1. If @var{k} is the number specified, and @var{N} is the number
of currently-selected cases in the active dataset, then after
-@code{SAMPLE @var{k}.}, approximately @code{k*N} cases will be
+@subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases are
selected.
-The proportion to sample can also be specified in the style @code{SAMPLE
+The proportion to sample can also be specified in the style @subcmd{SAMPLE
@var{m} FROM @var{N}}. With this style, cases are selected as follows:
@enumerate
@item
If @var{N} is equal to the number of currently-selected cases in the
-active dataset, exactly @var{m} cases will be selected.
+active dataset, exactly @var{m} cases are selected.
@item
If @var{N} is greater than the number of currently-selected cases in the
-active dataset, an equivalent proportion of cases will be selected.
+active dataset, an equivalent proportion of cases are selected.
@item
If @var{N} is less than the number of currently-selected cases in the
-active, exactly @var{m} cases will be selected @emph{from the first
+active, exactly @var{m} cases are selected @emph{from the first
@var{N} cases in the active dataset.}
@end enumerate
@vindex SELECT IF
@display
-SELECT IF expression.
+SELECT IF @var{expression}.
@end display
-@cmd{SELECT IF} selects cases for analysis based on the value of a
-boolean expression. Cases not selected are permanently eliminated
+@cmd{SELECT IF} selects cases for analysis based on the value of
+@var{expression}. Cases not selected are permanently eliminated
from the active dataset, unless @cmd{TEMPORARY} is in effect
(@pxref{TEMPORARY}).
Specify a boolean expression (@pxref{Expressions}). If the value of the
-expression is true for a particular case, the case will be analyzed. If
-the expression has a false or missing value, then the case will be
+expression is true for a particular case, the case is analyzed. If
+the expression has a false or missing value, then the case is
deleted from the data stream.
Place @cmd{SELECT IF} as early in the command file as
possible. Cases that are deleted early can be processed more
efficiently in time and space.
+Once cases have been deleted from the active dataset using @cmd{SELECT IF} they
+cannot be re-instated.
+If you want to be able to re-instate cases, then use @cmd{FILTER} (@pxref{FILTER})
+instead.
When @cmd{SELECT IF} is specified following @cmd{TEMPORARY}
(@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
(@pxref{LAG}).
+@subsection Example Select-If
+
+A shop steward is interested in the salaries of younger personnel in a firm.
+The file @file{personnel.sav} provides the salaries of all the workers and their
+dates of birth. The syntax in @ref{select-if:ex} shows how @cmd{SELECT IF} can
+be used to limit analysis only to those persons born after December 31, 1999.
+
+@float Example, select-if:ex
+@psppsyntax {select-if.sps}
+@caption {Using @cmd{SELECT IF} to select persons born on or after a certain date.}
+@end float
+
+From @ref{select-if:res} one can see that there are 56 persons listed in the dataset,
+and 17 of them were born after December 31, 1999.
+
+@float Result, select-if:res
+@psppoutput {select-if}
+@caption {Salary descriptives before and after the @cmd{SELECT IF} transformation.}
+@end float
+
+Note that the @file{personnel.sav} file from which the data were read is unaffected.
+The transformation affects only the active file.
+
@node SPLIT FILE
@section SPLIT FILE
@vindex SPLIT FILE
@display
-SPLIT FILE [@{LAYERED, SEPARATE@}] BY var_list.
+SPLIT FILE [@{LAYERED, SEPARATE@}] BY @var{var_list}.
SPLIT FILE OFF.
@end display
variable values for the group are printed along with the analysis.
When a list of variable names is specified, one of the keywords
-LAYERED or SEPARATE may also be specified. If provided, either
+@subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified. If provided, either
keyword are ignored.
-Groups are formed only by @emph{adjacent} cases. To create a split
+Groups are formed only by @emph{adjacent} cases. To create a split
using a variable where like values are not adjacent in the working file,
you should first sort the data by that variable (@pxref{SORT CASES}).
-Specify OFF to disable @cmd{SPLIT FILE} and resume analysis of the
+Specify @subcmd{OFF} to disable @cmd{SPLIT FILE} and resume analysis of the
entire active dataset as a single group of data.
When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only
the next procedure (@pxref{TEMPORARY}).
+@subsection Example Split
+
+The file @file{horticulture.sav} contains data describing the @exvar{yield}
+of a number of horticultural specimens which have been subjected to
+various @exvar{treatment}s. If we wanted to investigate linear statistics
+of the @exvar{yeild}, one way to do this is using the @cmd{DESCRIPTIVES} (@pxref{DESCRIPTIVES}).
+However, it is reasonable to expect the mean to be different depending
+on the @exvar{treatment}. So we might want to perform three separate
+procedures --- one for each treatment.
+@footnote{There are other, possibly better, ways to achieve a similar result
+using the @cmd{MEANS} or @cmd{EXAMINE} commands.}
+@ref{split:ex} shows how this can be done automatically using
+the @cmd{SPLIT FILE} command.
+
+@float Example, split:ex
+@psppsyntax {split.sps}
+@caption {Running @cmd{DESCRIPTIVES} on each value of @exvar{treatment}}
+@end float
+
+In @ref{split:res} you can see that the table of descriptive statistics
+appears 3 times --- once for each value of @exvar{treatment}.
+In this example @samp{N}, the number of observations are identical in
+all splits. This is because that experiment was deliberately designed
+that way. However in general one can expect a different @samp{N} for each
+split.
+
+@float Example, split:res
+@psppoutput {split}
+@caption {The results of running @cmd{DESCRIPTIVES} with an active split}
+@end float
+
+Unless @cmd{TEMPORARY} was used, after a split has been defined for
+a dataset it remains active until explicitly disabled.
+In the graphical user interface, the active split variable (if any) is
+displayed in the status bar (@pxref{split-status-bar:scr}.
+If a dataset is saved to a system file (@pxref{SAVE}) whilst a split
+is active, the split stastus is stored in the file and will be
+automatically loaded when that file is loaded.
+
+@float Screenshot, split-status-bar:scr
+@psppimage {split-status-bar}
+@caption {The status bar indicating that the data set is split using the @exvar{treatment} variable}
+@end float
+
+
@node TEMPORARY
@section TEMPORARY
@vindex TEMPORARY
@end display
@cmd{TEMPORARY} is used to make the effects of transformations
-following its execution temporary. These transformations will
+following its execution temporary. These transformations
affect only the execution of the next procedure or procedure-like
-command. Their effects will not be saved to the active dataset.
+command. Their effects are not be saved to the active dataset.
The only specification on @cmd{TEMPORARY} is the command name.
Scratch variables cannot be used following @cmd{TEMPORARY}.
-An example may help to clarify:
-
-@example
-DATA LIST /X 1-2.
-BEGIN DATA.
- 2
- 4
-10
-15
-20
-24
-END DATA.
-COMPUTE X=X/2.
-TEMPORARY.
-COMPUTE X=X+3.
-DESCRIPTIVES X.
-DESCRIPTIVES X.
-@end example
+@subsection Example Temporary
+
+In @ref{temporary:ex} there are two @cmd{COMPUTE} transformation. One
+of them immediatly follows a @cmd{TEMPORARY} command, and therefore has
+effect only for the next procedure, which in this case is the first
+@cmd{DESCRIPTIVES} command.
+
+@float Example, temporary:ex
+@psppsyntax {temporary.sps}
+@caption {Running a @cmd{COMPUTE} transformation after @cmd{TEMPORARY}}
+@end float
+
+The data read by the first @cmd{DESCRIPTIVES} procedure are 4, 5, 8,
+10.5, 13, 15. The data read by the second @cmd{DESCRIPTIVES} procedure are 1, 2,
+5, 7.5, 10, 12. This is because the second @cmd{COMPUTE} transformation
+has no effect on the second @cmd{DESCRIPTIVES} procedure. You can check these
+figures in @ref{temporary:res}.
+
+@float Result, temporary:res
+@psppoutput {temporary}
+@caption {The results of running two consecutive @cmd{DESCRIPTIVES} commands after
+ a temporary transformation}
+@end float
-The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8,
-10.5, 13, 15. The data read by the first @cmd{DESCRIPTIVES} are 1, 2,
-5, 7.5, 10, 12.
@node WEIGHT
@section WEIGHT
@vindex WEIGHT
@display
-WEIGHT BY var_name.
+WEIGHT BY @var{var_name}.
WEIGHT OFF.
@end display
If a variable name is specified, @cmd{WEIGHT} causes the values of that
variable to be used as weighting factors for subsequent statistical
-procedures. Use of keyword BY is optional but recommended. Weighting
+procedures. Use of keyword @subcmd{BY} is optional but recommended. Weighting
variables must be numeric. Scratch variables may not be used for
weighting (@pxref{Scratch Variables}).
-When OFF is specified, subsequent statistical procedures will weight all
+When @subcmd{OFF} is specified, subsequent statistical procedures weight all
cases equally.
-A positive integer weighting factor @var{w} on a case will yield the
+A positive integer weighting factor @var{w} on a case yields the
same statistical output as would replicating the case @var{w} times.
A weighting factor of 0 is treated for statistical purposes as if the
case did not exist in the input. Weighting values need not be
@cmd{WEIGHT} does not cause cases in the active dataset to be
replicated in memory.
+
+
+@subsection Example Weights
+
+One could define a dataset containing an inventory of stock items.
+It would be reasonable to use a string variable for a description of the
+item, and a numeric variable for the number in stock, like in @ref{weight:ex}.
+
+@float Example, weight:ex
+@psppsyntax {weight.sps}
+@caption {Setting the weight on the variable @exvar{quantity}}
+@end float
+
+One analysis which most surely would be of interest is
+the relative amounts or each item in stock.
+However without setting a weight variable, @cmd{FREQUENCIES}
+(@pxref{FREQUENCIES}) does not tell us what we want to know, since
+there is only one case for each stock item. @ref{weight:res} shows the
+difference between the weighted and unweighted frequency tables.
+
+@float Example, weight:res
+@psppoutput {weight}
+@caption {Weighted and unweighted frequency tables of @exvar{items}}
+@end float