doc/data-selection.texi

   1 @node Data Selection, Conditionals and Looping, Data Manipulation, Top
   2 @chapter Selecting data for analysis
   3
   4 This chapter documents PSPP commands that temporarily or permanently
   5 select data records from the active file for analysis.
   6
   7 @menu
   8 * FILTER::                      Exclude cases based on a variable.
   9 * N OF CASES::                  Limit the size of the active file.
  10 * SAMPLE::                      Select a specified proportion of cases.
  11 * SELECT IF::                   Permanently delete selected cases.
  12 * SPLIT FILE::                  Do multiple analyses with one command.
  13 * TEMPORARY::                   Make transformations' effects temporary.
  14 * WEIGHT::                      Weight cases by a variable.
  15 @end menu
  16
  17 @node FILTER, N OF CASES, Data Selection, Data Selection
  18 @section FILTER
  19 @vindex FILTER
  20
  21 @display
  22 FILTER BY var_name.
  23 FILTER OFF.
  24 @end display
  25
  26 @cmd{FILTER} allows a boolean-valued variable to be used to select
  27 cases from the data stream for processing.
  28
  29 To set up filtering, specify BY and a variable name.  Keyword
  30 BY is optional but recommended.  Cases which have a zero or system- or
  31 user-missing value are excluded from analysis, but not deleted from the
  32 data stream.  Cases with other values are analyzed.
  33 To filter based on a different condition, use
  34 transformations such as @cmd{COMPUTE} or @cmd{RECODE} to compute a
  35 filter variable of the required form, then specify that variable on
  36 @cmd{FILTER}.
  37
  38 @code{FILTER OFF} turns off case filtering.
  39
  40 Filtering takes place immediately before cases pass to a procedure for
  41 analysis.  Only one filter variable may be active at a time.  Normally,
  42 case filtering continues until it is explicitly turned off with @code{FILTER
  43 OFF}.  However, if @cmd{FILTER} is placed after TEMPORARY, it filters only
  44 the next procedure or procedure-like command.
  45
  46 @node N OF CASES
  47 @section N OF CASES
  48 @vindex N OF CASES
  49
  50 @display
  51 N [OF CASES] num_of_cases [ESTIMATED].
  52 @end display
  53
  54 Sometimes you may want to disregard cases of your input.  @cmd{N} can
  55 do this.  @code{N 100} tells PSPP to disregard all cases after the
  56 first 100.
  57
  58 If the value specified for @cmd{N} is greater than the number of cases
  59 read in, the value is ignored.
  60
  61 @cmd{N} does not discard cases or prevent them from being read.  It
  62 just causes cases beyond the last one specified to be ignored by data
  63 analysis commands.
  64
  65 A later @cmd{N} command can increase or decrease the number of cases
  66 selected.  (To select all the cases without knowing how many there are,
  67 specify a very high number: 100000 or whatever you think is large enough.)
  68
  69 Transformation procedures performed after @cmd{N} is executed
  70 @emph{do} cause cases to be discarded.
  71
  72 @cmd{SAMPLE} and @cmd{SELECT IF} have
  73 precedence over @cmd{N}---the same results are obtained by both of the
  74 following fragments, given the same random number seeds:
  75
  76 @example
  77 @i{@dots{}set up, read in data@dots{}}
  78 N 100.
  79 SAMPLE .5.
  80 @i{@dots{}analyze data@dots{}}
  81
  82 @i{@dots{}set up, read in data@dots{}}
  83 SAMPLE .5.
  84 N 100.
  85 @i{@dots{}analyze data@dots{}}
  86 @end example
  87
  88 Both fragments above first randomly sample approximately half of the
  89 cases, then select the first 100 of those sampled.
  90
  91 @cmd{N} with the @code{ESTIMATED} keyword gives an
  92 estimated number of cases before @cmd{DATA LIST} or another command to
  93 read in data.  @code{ESTIMATED} never limits the number of cases
  94 processed by procedures.  PSPP currently does not make use of
  95 case count estimates.
  96
  97 When @cmd{N} is specified after @cmd{TEMPORARY}, it affects only
  98 the next procedure (@pxref{TEMPORARY}).
  99
 100 @node SAMPLE
 101 @section SAMPLE
 102 @vindex SAMPLE
 103
 104 @display
 105 SAMPLE num1 [FROM num2].
 106 @end display
 107
 108 @cmd{SAMPLE} randomly samples a proportion of the cases in the active
 109 file.  Unless it follows @cmd{TEMPORARY}, it operates as a
 110 transformation, permanently removing cases from the active file.
 111
 112 The proportion to sample can be expressed as a single number between 0
 113 and 1.  If @code{k} is the number specified, and @code{N} is the number
 114 of currently-selected cases in the active file, then after
 115 @code{SAMPLE @var{k}.}, approximately @code{k*N} cases will be
 116 selected.
 117
 118 The proportion to sample can also be specified in the style @code{SAMPLE
 119 @var{m} FROM @var{N}}.  With this style, cases are selected as follows:
 120
 121 @enumerate
 122 @item
 123 If @var{N} is equal to the number of currently-selected cases in the
 124 active file, exactly @var{m} cases will be selected.
 125
 126 @item
 127 If @var{N} is greater than the number of currently-selected cases in the
 128 active file, an equivalent proportion of cases will be selected.
 129
 130 @item
 131 If @var{N} is less than the number of currently-selected cases in the
 132 active, exactly @var{m} cases will be selected @emph{from the first
 133 @var{N} cases in the active file.}
 134 @end enumerate
 135
 136 @cmd{SAMPLE} and @cmd{SELECT IF} are performed in
 137 the order specified by the syntax file.
 138
 139 @cmd{SAMPLE} is always performed before @code{N OF CASES}, regardless
 140 of ordering in the syntax file (@pxref{N OF CASES}).
 141
 142 The same values for @cmd{SAMPLE} may result in different samples.  To
 143 obtain the same sample, use the @code{SET} command to set the random
 144 number seed to the same value before each @cmd{SAMPLE}.  Different
 145 samples may still result when the file is processed on systems with
 146 differing endianness or floating-point formats.  By default, the
 147 random number seed is based on the system time.
 148
 149 @node SELECT IF, SPLIT FILE, SAMPLE, Data Selection
 150 @section SELECT IF
 151 @vindex SELECT IF
 152
 153 @display
 154 SELECT IF expression.
 155 @end display
 156
 157 @cmd{SELECT IF} selects cases for analysis based on the value of a
 158 boolean expression.  Cases not selected are permanently eliminated
 159 from the active file, unless @cmd{TEMPORARY} is in effect
 160 (@pxref{TEMPORARY}).
 161
 162 Specify a boolean expression (@pxref{Expressions}).  If the value of the
 163 expression is true for a particular case, the case will be analyzed.  If
 164 the expression has a false or missing value, then the case will be
 165 deleted from the data stream.
 166
 167 Place @cmd{SELECT IF} as early in the command file as
 168 possible.  Cases that are deleted early can be processed more
 169 efficiently in time and space.
 170
 171 When @cmd{SELECT IF} is specified following @cmd{TEMPORARY}
 172 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 173 (@pxref{LAG}).
 174
 175 @node SPLIT FILE, TEMPORARY, SELECT IF, Data Selection
 176 @section SPLIT FILE
 177 @vindex SPLIT FILE
 178
 179 @display
 180 SPLIT FILE [@{LAYERED, SEPARATE@}] BY var_list.
 181 SPLIT FILE OFF.
 182 @end display
 183
 184 @cmd{SPLIT FILE} allows multiple sets of data present in one data
 185 file to be analyzed separately using single statistical procedure
 186 commands.
 187
 188 Specify a list of variable names to analyze multiple sets of
 189 data separately.  Groups of adjacent cases having the same values for these
 190 variables are analyzed by statistical procedure commands as one group.
 191 An independent analysis is carried out for each group of cases, and the
 192 variable values for the group are printed along with the analysis.
 193
 194 When a list of variable names is specified, one of the keywords
 195 LAYERED or SEPARATE may also be specified.  If provided, either
 196 keyword are ignored.
 197
 198 Groups are formed only by @emph{adjacent} cases.  To create a split
 199 using a variable where like values are not adjacent in the working file,
 200 you should first sort the data by that variable (@pxref{SORT CASES}).
 201
 202 Specify OFF to disable @cmd{SPLIT FILE} and resume analysis of the
 203 entire active file as a single group of data.
 204
 205 When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only
 206 the next procedure (@pxref{TEMPORARY}).
 207
 208 @node TEMPORARY, WEIGHT, SPLIT FILE, Data Selection
 209 @section TEMPORARY
 210 @vindex TEMPORARY
 211
 212 @display
 213 TEMPORARY.
 214 @end display
 215
 216 @cmd{TEMPORARY} is used to make the effects of transformations
 217 following its execution temporary.  These transformations will
 218 affect only the execution of the next procedure or procedure-like
 219 command.  Their effects will not be saved to the active file.
 220
 221 The only specification on @cmd{TEMPORARY} is the command name.
 222
 223 @cmd{TEMPORARY} may not appear within a @cmd{DO IF} or @cmd{LOOP}
 224 construct.  It may appear only once between procedures and
 225 procedure-like commands.
 226
 227 Scratch variables cannot be used following @cmd{TEMPORARY}.
 228
 229 An example may help to clarify:
 230
 231 @example
 232 DATA LIST /X 1-2.
 233 BEGIN DATA.
 234  2
 235  4
 236 10
 237 15
 238 20
 239 24
 240 END DATA.
 241 COMPUTE X=X/2.
 242 TEMPORARY.
 243 COMPUTE X=X+3.
 244 DESCRIPTIVES X.
 245 DESCRIPTIVES X.
 246 @end example
 247
 248 The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8,
 249 10.5, 13, 15.  The data read by the first @cmd{DESCRIPTIVES} are 1, 2,
 250 5, 7.5, 10, 12.
 251
 252 @node WEIGHT,  , TEMPORARY, Data Selection
 253 @section WEIGHT
 254 @vindex WEIGHT
 255
 256 @display
 257 WEIGHT BY var_name.
 258 WEIGHT OFF.
 259 @end display
 260
 261 @cmd{WEIGHT} assigns cases varying weights,
 262 changing the frequency distribution of the active file.  Execution of
 263 @cmd{WEIGHT} is delayed until data have been read.
 264
 265 If a variable name is specified, @cmd{WEIGHT} causes the values of that
 266 variable to be used as weighting factors for subsequent statistical
 267 procedures.  Use of keyword BY is optional but recommended.  Weighting
 268 variables must be numeric.  Scratch variables may not be used for
 269 weighting (@pxref{Scratch Variables}).
 270
 271 When OFF is specified, subsequent statistical procedures will weight all
 272 cases equally.
 273
 274 A positive integer weighting factor @var{w} on a case will yield the
 275 same statistical output as would replicating the case @var{w} times.
 276 A weighting factor of 0 is treated for statistical purposes as if the
 277 case did not exist in the input.  Weighting values need not be
 278 integers, but negative and system-missing values for the weighting
 279 variable are interpreted as weighting factors of 0.  User-missing
 280 values are not treated specially.
 281
 282 When @cmd{WEIGHT} is specified after @cmd{TEMPORARY}, it affects only
 283 the next procedure (@pxref{TEMPORARY}).
 284
 285 @cmd{WEIGHT} does not cause cases in the active file to be replicated in
 286 memory.
 287 @setfilename ignored