pintos-os.org Git - pspp/blob - data-selection.texi

   1 @node Data Selection, Conditionals and Looping, Data Manipulation, Top
   2 @chapter Selecting data for analysis
   3
   4 This chapter documents PSPP commands that temporarily or permanently
   5 select data records from the active file for analysis.
   6
   7 @menu
   8 * FILTER::                      Exclude cases based on a variable.
   9 * N OF CASES::                  Limit the size of the active file.
  10 * PROCESS IF::                  Temporarily excluding cases.
  11 * SAMPLE::                      Select a specified proportion of cases.
  12 * SELECT IF::                   Permanently delete selected cases.
  13 * SPLIT FILE::                  Do multiple analyses with one command.
  14 * TEMPORARY::                   Make transformations' effects temporary.
  15 * WEIGHT::                      Weight cases by a variable.
  16 @end menu
  17
  18 @node FILTER, N OF CASES, Data Selection, Data Selection
  19 @section FILTER
  20 @vindex FILTER
  21
  22 @display
  23 FILTER BY var_name.
  24 FILTER OFF.
  25 @end display
  26
  27 @cmd{FILTER} allows a boolean-valued variable to be used to select
  28 cases from the data stream for processing.
  29
  30 To set up filtering, specify BY and a variable name.  Keyword
  31 BY is optional but recommended.  Cases which have a zero or system- or
  32 user-missing value are excluded from analysis, but not deleted from the
  33 data stream.  Cases with other values are analyzed.
  34 To filter based on a different condition, use
  35 transformations such as @cmd{COMPUTE} or @cmd{RECODE} to compute a
  36 filter variable of the required form, then specify that variable on
  37 @cmd{FILTER}.
  38
  39 @code{FILTER OFF} turns off case filtering.
  40
  41 Filtering takes place immediately before cases pass to a procedure for
  42 analysis.  Only one filter variable may be active at a time.  Normally,
  43 case filtering continues until it is explicitly turned off with @code{FILTER
  44 OFF}.  However, if @cmd{FILTER} is placed after TEMPORARY, it filters only
  45 the next procedure or procedure-like command.
  46
  47 @node N OF CASES, PROCESS IF, FILTER, Data Selection
  48 @section N OF CASES
  49 @vindex N OF CASES
  50
  51 @display
  52 N [OF CASES] num_of_cases [ESTIMATED].
  53 @end display
  54
  55 Sometimes you may want to disregard cases of your input.  @cmd{N} can
  56 do this.  @code{N 100} tells PSPP to disregard all cases after the
  57 first 100.
  58
  59 If the value specified for @cmd{N} is greater than the number of cases
  60 read in, the value is ignored.
  61
  62 @cmd{N} does not discard cases or prevent them from being read.  It
  63 just causes cases beyond the last one specified to be ignored by data
  64 analysis commands.
  65
  66 A later @cmd{N} command can increase or decrease the number of cases
  67 selected.  (To select all the cases without knowing how many there are,
  68 specify a very high number: 100000 or whatever you think is large enough.)
  69
  70 Transformation procedures performed after @cmd{N} is executed
  71 @emph{do} cause cases to be discarded.
  72
  73 @cmd{SAMPLE}, @cmd{PROCESS IF}, and @cmd{SELECT IF} have
  74 precedence over @cmd{N}---the same results are obtained by both of the
  75 following fragments, given the same random number seeds:
  76
  77 @example
  78 @i{@dots{}set up, read in data@dots{}}
  79 N 100.
  80 SAMPLE .5.
  81 @i{@dots{}analyze data@dots{}}
  82
  83 @i{@dots{}set up, read in data@dots{}}
  84 SAMPLE .5.
  85 N 100.
  86 @i{@dots{}analyze data@dots{}}
  87 @end example
  88
  89 Both fragments above first randomly sample approximately half of the
  90 cases, then select the first 100 of those sampled.
  91
  92 @cmd{N} with the @code{ESTIMATED} keyword gives an
  93 estimated number of cases before @cmd{DATA LIST} or another command to
  94 read in data.  @code{ESTIMATED} never limits the number of cases
  95 processed by procedures.  PSPP currently does not make use of
  96 case count estimates.
  97
  98 When @cmd{N} is specified after @cmd{TEMPORARY}, it affects only
  99 the next procedure (@pxref{TEMPORARY}).
 100
 101 @node PROCESS IF, SAMPLE, N OF CASES, Data Selection
 102 @section PROCESS IF
 103 @vindex PROCESS IF
 104
 105 @example
 106 PROCESS IF expression.
 107 @end example
 108
 109 @cmd{PROCESS IF} temporarily eliminates cases from the
 110 data stream.  Its effects are active only through the execution of the
 111 next procedure or procedure-like command.
 112
 113 Specify a boolean expression (@pxref{Expressions}).  If the value of the
 114 expression is true for a particular case, the case will be analyzed.  If
 115 the expression has a false or missing value, then the case will be
 116 deleted from the data stream for this procedure only.
 117
 118 Regardless of its placement relative to other commands, @cmd{PROCESS IF}
 119 always takes effect immediately before data passes to the procedure.
 120 Only one @cmd{PROCESS IF} command may be in effect at any given time.
 121
 122 The effects of @cmd{PROCESS IF} are similar, but not identical, to the
 123 effects of executing @cmd{TEMPORARY}, then @cmd{SELECT IF}
 124 (@pxref{SELECT IF}).
 125
 126 The filtering performed by @cmd{PROCESS IF} takes place immediately
 127 before cases pass to a procedure for analysis.  Because @cmd{PROCESS
 128 IF} affects only a single procedure, its placement relative to
 129 @cmd{TEMPORARY} is unimportant.
 130
 131 @cmd{PROCESS IF} is deprecated.  It is included for compatibility with
 132 old command files.  New syntax files should use @cmd{SELECT IF} or
 133 @cmd{FILTER} instead.
 134
 135 @node SAMPLE, SELECT IF, PROCESS IF, Data Selection
 136 @section SAMPLE
 137 @vindex SAMPLE
 138
 139 @display
 140 SAMPLE num1 [FROM num2].
 141 @end display
 142
 143 @cmd{SAMPLE} randomly samples a proportion of the cases in the active
 144 file.  Unless it follows @cmd{TEMPORARY}, it operates as a
 145 transformation, permanently removing cases from the active file.
 146
 147 The proportion to sample can be expressed as a single number between 0
 148 and 1.  If @code{k} is the number specified, and @code{N} is the number
 149 of currently-selected cases in the active file, then after
 150 @code{SAMPLE @var{k}.}, approximately @code{k*N} cases will be
 151 selected.
 152
 153 The proportion to sample can also be specified in the style @code{SAMPLE
 154 @var{m} FROM @var{N}}.  With this style, cases are selected as follows:
 155
 156 @enumerate
 157 @item
 158 If @var{N} is equal to the number of currently-selected cases in the
 159 active file, exactly @var{m} cases will be selected.
 160
 161 @item
 162 If @var{N} is greater than the number of currently-selected cases in the
 163 active file, an equivalent proportion of cases will be selected.
 164
 165 @item
 166 If @var{N} is less than the number of currently-selected cases in the
 167 active, exactly @var{m} cases will be selected @emph{from the first
 168 @var{N} cases in the active file.}
 169 @end enumerate
 170
 171 @cmd{SAMPLE} and @cmd{SELECT IF} are performed in
 172 the order specified by the syntax file.
 173
 174 @cmd{SAMPLE} is always performed before @code{N OF CASES}, regardless
 175 of ordering in the syntax file (@pxref{N OF CASES}).
 176
 177 The same values for @cmd{SAMPLE} may result in different samples.  To
 178 obtain the same sample, use the @code{SET} command to set the random
 179 number seed to the same value before each @cmd{SAMPLE}.  Different
 180 samples may still result when the file is processed on systems with
 181 differing endianness or floating-point formats.  By default, the
 182 random number seed is based on the system time.
 183
 184 @node SELECT IF, SPLIT FILE, SAMPLE, Data Selection
 185 @section SELECT IF
 186 @vindex SELECT IF
 187
 188 @display
 189 SELECT IF expression.
 190 @end display
 191
 192 @cmd{SELECT IF} selects cases for analysis based on the value of a
 193 boolean expression.  Cases not selected are permanently eliminated
 194 from the active file, unless @cmd{TEMPORARY} is in effect
 195 (@pxref{TEMPORARY}).
 196
 197 Specify a boolean expression (@pxref{Expressions}).  If the value of the
 198 expression is true for a particular case, the case will be analyzed.  If
 199 the expression has a false or missing value, then the case will be
 200 deleted from the data stream.
 201
 202 Place @cmd{SELECT IF} as early in the command file as
 203 possible.  Cases that are deleted early can be processed more
 204 efficiently in time and space.
 205
 206 When @cmd{SELECT IF} is specified following @cmd{TEMPORARY}
 207 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 208 (@pxref{LAG}).
 209
 210 @node SPLIT FILE, TEMPORARY, SELECT IF, Data Selection
 211 @section SPLIT FILE
 212 @vindex SPLIT FILE
 213
 214 @display
 215 SPLIT FILE [@{LAYERED, SEPARATE@}] BY var_list.
 216 SPLIT FILE OFF.
 217 @end display
 218
 219 @cmd{SPLIT FILE} allows multiple sets of data present in one data
 220 file to be analyzed separately using single statistical procedure
 221 commands.
 222
 223 Specify a list of variable names to analyze multiple sets of
 224 data separately.  Groups of adjacent cases having the same values for these
 225 variables are analyzed by statistical procedure commands as one group.
 226 An independent analysis is carried out for each group of cases, and the
 227 variable values for the group are printed along with the analysis.
 228
 229 When a list of variable names is specified, one of the keywords
 230 LAYERED or SEPARATE may also be specified.  If provided, either
 231 keyword are ignored.
 232
 233 Groups are formed only by @emph{adjacent} cases.  To create a split
 234 using a variable where like values are not adjacent in the working file,
 235 you should first sort the data by that variable (@pxref{SORT CASES}).
 236
 237 Specify OFF to disable @cmd{SPLIT FILE} and resume analysis of the
 238 entire active file as a single group of data.
 239
 240 When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only
 241 the next procedure (@pxref{TEMPORARY}).
 242
 243 @node TEMPORARY, WEIGHT, SPLIT FILE, Data Selection
 244 @section TEMPORARY
 245 @vindex TEMPORARY
 246
 247 @display
 248 TEMPORARY.
 249 @end display
 250
 251 @cmd{TEMPORARY} is used to make the effects of transformations
 252 following its execution temporary.  These transformations will
 253 affect only the execution of the next procedure or procedure-like
 254 command.  Their effects will not be saved to the active file.
 255
 256 The only specification on @cmd{TEMPORARY} is the command name.
 257
 258 @cmd{TEMPORARY} may not appear within a @cmd{DO IF} or @cmd{LOOP}
 259 construct.  It may appear only once between procedures and
 260 procedure-like commands.
 261
 262 Scratch variables cannot be used following @cmd{TEMPORARY}.
 263
 264 An example may help to clarify:
 265
 266 @example
 267 DATA LIST /X 1-2.
 268 BEGIN DATA.
 269  2
 270  4
 271 10
 272 15
 273 20
 274 24
 275 END DATA.
 276 COMPUTE X=X/2.
 277 TEMPORARY.
 278 COMPUTE X=X+3.
 279 DESCRIPTIVES X.
 280 DESCRIPTIVES X.
 281 @end example
 282
 283 The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8,
 284 10.5, 13, 15.  The data read by the first @cmd{DESCRIPTIVES} are 1, 2,
 285 5, 7.5, 10, 12.
 286
 287 @node WEIGHT,  , TEMPORARY, Data Selection
 288 @section WEIGHT
 289 @vindex WEIGHT
 290
 291 @display
 292 WEIGHT BY var_name.
 293 WEIGHT OFF.
 294 @end display
 295
 296 @cmd{WEIGHT} assigns cases varying weights,
 297 changing the frequency distribution of the active file.  Execution of
 298 @cmd{WEIGHT} is delayed until data have been read.
 299
 300 If a variable name is specified, @cmd{WEIGHT} causes the values of that
 301 variable to be used as weighting factors for subsequent statistical
 302 procedures.  Use of keyword BY is optional but recommended.  Weighting
 303 variables must be numeric.  Scratch variables may not be used for
 304 weighting (@pxref{Scratch Variables}).
 305
 306 When OFF is specified, subsequent statistical procedures will weight all
 307 cases equally.
 308
 309 A positive integer weighting factor @var{w} on a case will yield the
 310 same statistical output as would replicating the case @var{w} times.
 311 A weighting factor of 0 is treated for statistical purposes as if the
 312 case did not exist in the input.  Weighting values need not be
 313 integers, but negative and system-missing values for the weighting
 314 variable are interpreted as weighting factors of 0.  User-missing
 315 values are not treated specially.
 316
 317 When @cmd{WEIGHT} is specified after @cmd{TEMPORARY}, it affects only
 318 the next procedure (@pxref{TEMPORARY}).
 319
 320 @cmd{WEIGHT} does not cause cases in the active file to be replicated in
 321 memory.
 322 @setfilename ignored