1 @c PSPP - a program for statistical analysis.
2 @c Copyright (C) 2017 Free Software Foundation, Inc.
3 @c Permission is granted to copy, distribute and/or modify this document
4 @c under the terms of the GNU Free Documentation License, Version 1.3
5 @c or any later version published by the Free Software Foundation;
6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
7 @c A copy of the license is included in the section entitled "GNU
8 @c Free Documentation License".
11 @chapter Selecting data for analysis
13 This chapter documents @pspp{} commands that temporarily or permanently
14 select data records from the active dataset for analysis.
17 * FILTER:: Exclude cases based on a variable.
18 * N OF CASES:: Limit the size of the active dataset.
19 * SAMPLE:: Select a specified proportion of cases.
20 * SELECT IF:: Permanently delete selected cases.
21 * SPLIT FILE:: Do multiple analyses with one command.
22 * TEMPORARY:: Make transformations' effects temporary.
23 * WEIGHT:: Weight cases by a variable.
31 FILTER BY @var{var_name}.
35 @cmd{FILTER} allows a boolean-valued variable to be used to select
36 cases from the data stream for processing.
38 To set up filtering, specify @subcmd{BY} and a variable name. Keyword
39 BY is optional but recommended. Cases which have a zero or system- or
40 user-missing value are excluded from analysis, but not deleted from the
41 data stream. Cases with other values are analyzed.
42 To filter based on a different condition, use
43 transformations such as @cmd{COMPUTE} or @cmd{RECODE} to compute a
44 filter variable of the required form, then specify that variable on
47 @code{FILTER OFF} turns off case filtering.
49 Filtering takes place immediately before cases pass to a procedure for
50 analysis. Only one filter variable may be active at a time. Normally,
51 case filtering continues until it is explicitly turned off with @code{FILTER
52 OFF}. However, if @cmd{FILTER} is placed after @cmd{TEMPORARY}, it filters only
53 the next procedure or procedure-like command.
60 N [OF CASES] @var{num_of_cases} [ESTIMATED].
63 @cmd{N OF CASES} limits the number of cases processed by any
64 procedures that follow it in the command stream. @code{N OF CASES
65 100}, for example, tells @pspp{} to disregard all cases after the first
68 When @cmd{N OF CASES} is specified after @cmd{TEMPORARY}, it affects
69 only the next procedure (@pxref{TEMPORARY}). Otherwise, cases beyond
70 the limit specified are not processed by any later procedure.
72 If the limit specified on @cmd{N OF CASES} is greater than the number
73 of cases in the active dataset, it has no effect.
75 When @cmd{N OF CASES} is used along with @cmd{SAMPLE} or @cmd{SELECT
76 IF}, the case limit is applied to the cases obtained after sampling or
77 case selection, regardless of how @cmd{N OF CASES} is placed relative
78 to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file. Thus, the
79 commands @code{N OF CASES 100} and @code{SAMPLE .5} will both randomly
80 sample approximately half of the active dataset's cases, then select the
81 first 100 of those sampled, regardless of their order in the command
84 @cmd{N OF CASES} with the @code{ESTIMATED} keyword gives an estimated
85 number of cases before @cmd{DATA LIST} or another command to read in
86 data. @code{ESTIMATED} never limits the number of cases processed by
87 procedures. @pspp{} currently does not make use of case count estimates.
94 SAMPLE @var{num1} [FROM @var{num2}].
97 @cmd{SAMPLE} randomly samples a proportion of the cases in the active
98 file. Unless it follows @cmd{TEMPORARY}, it operates as a
99 transformation, permanently removing cases from the active dataset.
101 The proportion to sample can be expressed as a single number between 0
102 and 1. If @var{k} is the number specified, and @var{N} is the number
103 of currently-selected cases in the active dataset, then after
104 @subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases will be
107 The proportion to sample can also be specified in the style @subcmd{SAMPLE
108 @var{m} FROM @var{N}}. With this style, cases are selected as follows:
112 If @var{N} is equal to the number of currently-selected cases in the
113 active dataset, exactly @var{m} cases will be selected.
116 If @var{N} is greater than the number of currently-selected cases in the
117 active dataset, an equivalent proportion of cases will be selected.
120 If @var{N} is less than the number of currently-selected cases in the
121 active, exactly @var{m} cases will be selected @emph{from the first
122 @var{N} cases in the active dataset.}
125 @cmd{SAMPLE} and @cmd{SELECT IF} are performed in
126 the order specified by the syntax file.
128 @cmd{SAMPLE} is always performed before @code{N OF CASES}, regardless
129 of ordering in the syntax file (@pxref{N OF CASES}).
131 The same values for @cmd{SAMPLE} may result in different samples. To
132 obtain the same sample, use the @code{SET} command to set the random
133 number seed to the same value before each @cmd{SAMPLE}. Different
134 samples may still result when the file is processed on systems with
135 differing endianness or floating-point formats. By default, the
136 random number seed is based on the system time.
143 SELECT IF @var{expression}.
146 @cmd{SELECT IF} selects cases for analysis based on the value of
147 @var{expression}. Cases not selected are permanently eliminated
148 from the active dataset, unless @cmd{TEMPORARY} is in effect
151 Specify a boolean expression (@pxref{Expressions}). If the value of the
152 expression is true for a particular case, the case will be analyzed. If
153 the expression has a false or missing value, then the case will be
154 deleted from the data stream.
156 Place @cmd{SELECT IF} as early in the command file as
157 possible. Cases that are deleted early can be processed more
158 efficiently in time and space.
160 When @cmd{SELECT IF} is specified following @cmd{TEMPORARY}
161 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
169 SPLIT FILE [@{LAYERED, SEPARATE@}] BY @var{var_list}.
173 @cmd{SPLIT FILE} allows multiple sets of data present in one data
174 file to be analyzed separately using single statistical procedure
177 Specify a list of variable names to analyze multiple sets of
178 data separately. Groups of adjacent cases having the same values for these
179 variables are analyzed by statistical procedure commands as one group.
180 An independent analysis is carried out for each group of cases, and the
181 variable values for the group are printed along with the analysis.
183 When a list of variable names is specified, one of the keywords
184 @subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified. If provided, either
187 Groups are formed only by @emph{adjacent} cases. To create a split
188 using a variable where like values are not adjacent in the working file,
189 you should first sort the data by that variable (@pxref{SORT CASES}).
191 Specify @subcmd{OFF} to disable @cmd{SPLIT FILE} and resume analysis of the
192 entire active dataset as a single group of data.
194 When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only
195 the next procedure (@pxref{TEMPORARY}).
197 @subsection Example Split
199 The file @file{horticulture.sav} contains data describing the @exvar{yield}
200 of a number of horticultural specimens which have been subjected to
201 various @exvar{treatment}s. If we wanted to investigate linear statistics
202 of the @exvar{yeild}, one way to do this is using the @cmd{DESCRIPTIVES} (@pxref{DESCRIPTIVES}).
203 However, it is reasonable to expect the mean to be different depending
204 on the @exvar{treatment}. So we might want to perform three separate
205 procedures --- one for each treatment.
206 @footnote{There are other, possibly better, ways to achieve a similar result
207 using the @cmd{MEANS} or @cmd{EXAMINE} commands.}
208 @ref{split:ex} shows how this can be done automatically using
209 the @cmd{SPLIT FILE} command.
211 @float Example, split:ex
212 @psppsyntax {split.sps}
213 @caption {Running @cmd{DESCRIPTIVES} on each value of @exvar{treatment}}
216 In @ref{split:res} you can see that the table of descriptive statistics
217 appears 3 times --- once for each value of @exvar{treatment}.
218 In this example @samp{N}, the number of observations are identical in
219 all splits. This is because that experiment was deliberately designed
220 that way. However in general one can expect a different @samp{N} for each
223 @float Example, split:res
225 @caption {The results of running @cmd{DESCRIPTIVES} with an active split}
228 Unless @cmd{TEMPORARY} was used, after a split has been defined for
229 a dataset it remains active until explicitly disabled.
239 @cmd{TEMPORARY} is used to make the effects of transformations
240 following its execution temporary. These transformations will
241 affect only the execution of the next procedure or procedure-like
242 command. Their effects will not be saved to the active dataset.
244 The only specification on @cmd{TEMPORARY} is the command name.
246 @cmd{TEMPORARY} may not appear within a @cmd{DO IF} or @cmd{LOOP}
247 construct. It may appear only once between procedures and
248 procedure-like commands.
250 Scratch variables cannot be used following @cmd{TEMPORARY}.
252 An example may help to clarify:
274 The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8,
275 10.5, 13, 15. The data read by the first @cmd{DESCRIPTIVES} are 1, 2,
283 WEIGHT BY @var{var_name}.
287 @cmd{WEIGHT} assigns cases varying weights,
288 changing the frequency distribution of the active dataset. Execution of
289 @cmd{WEIGHT} is delayed until data have been read.
291 If a variable name is specified, @cmd{WEIGHT} causes the values of that
292 variable to be used as weighting factors for subsequent statistical
293 procedures. Use of keyword @subcmd{BY} is optional but recommended. Weighting
294 variables must be numeric. Scratch variables may not be used for
295 weighting (@pxref{Scratch Variables}).
297 When @subcmd{OFF} is specified, subsequent statistical procedures will weight all
300 A positive integer weighting factor @var{w} on a case will yield the
301 same statistical output as would replicating the case @var{w} times.
302 A weighting factor of 0 is treated for statistical purposes as if the
303 case did not exist in the input. Weighting values need not be
304 integers, but negative and system-missing values for the weighting
305 variable are interpreted as weighting factors of 0. User-missing
306 values are not treated specially.
308 When @cmd{WEIGHT} is specified after @cmd{TEMPORARY}, it affects only
309 the next procedure (@pxref{TEMPORARY}).
311 @cmd{WEIGHT} does not cause cases in the active dataset to be
312 replicated in memory.
315 @subsection Example Weights
317 One could define a dataset containing an inventory of stock items.
318 It would be reasonable to use a string variable for a description of the
319 item, and a numeric variable for the number in stock, like in @ref{weight:ex}.
321 @float Example, weight:ex
322 @psppsyntax {weight.sps}
323 @caption {Setting the weight on the variable @exvar{quantity}}
326 One analysis which most surely would be of interest is
327 the relative amounts or each item in stock.
328 However without setting a weight variable, @cmd{FREQUENCIES} (@pxref{FREQUENCIES}) will not
329 tell us what we want to know, since there is only one case for each stock item.
330 @ref{weight:res} shows the difference between the weighted and unweighted
333 @float Example, weight:res
335 @caption {Weighted and unweighted frequency tables of @exvar{items}}