pintos-os.org Git - pspp/blob - doc/data-selection.texi

   1 @node Data Selection
   2 @chapter Selecting data for analysis
   3
   4 This chapter documents @pspp{} commands that temporarily or permanently
   5 select data records from the active dataset for analysis.
   6
   7 @menu
   8 * FILTER::                      Exclude cases based on a variable.
   9 * N OF CASES::                  Limit the size of the active dataset.
  10 * SAMPLE::                      Select a specified proportion of cases.
  11 * SELECT IF::                   Permanently delete selected cases.
  12 * SPLIT FILE::                  Do multiple analyses with one command.
  13 * TEMPORARY::                   Make transformations' effects temporary.
  14 * WEIGHT::                      Weight cases by a variable.
  15 @end menu
  16
  17 @node FILTER
  18 @section FILTER
  19 @vindex FILTER
  20
  21 @display
  22 FILTER BY @var{var_name}.
  23 FILTER OFF.
  24 @end display
  25
  26 @cmd{FILTER} allows a boolean-valued variable to be used to select
  27 cases from the data stream for processing.
  28
  29 To set up filtering, specify @subcmd{BY} and a variable name.  Keyword
  30 BY is optional but recommended.  Cases which have a zero or system- or
  31 user-missing value are excluded from analysis, but not deleted from the
  32 data stream.  Cases with other values are analyzed.
  33 To filter based on a different condition, use
  34 transformations such as @cmd{COMPUTE} or @cmd{RECODE} to compute a
  35 filter variable of the required form, then specify that variable on
  36 @cmd{FILTER}.
  37
  38 @code{FILTER OFF} turns off case filtering.
  39
  40 Filtering takes place immediately before cases pass to a procedure for
  41 analysis.  Only one filter variable may be active at a time.  Normally,
  42 case filtering continues until it is explicitly turned off with @code{FILTER
  43 OFF}.  However, if @cmd{FILTER} is placed after @cmd{TEMPORARY}, it filters only
  44 the next procedure or procedure-like command.
  45
  46 @node N OF CASES
  47 @section N OF CASES
  48 @vindex N OF CASES
  49
  50 @display
  51 N [OF CASES] @var{num_of_cases} [ESTIMATED].
  52 @end display
  53
  54 @cmd{N OF CASES} limits the number of cases processed by any
  55 procedures that follow it in the command stream.  @code{N OF CASES
  56 100}, for example, tells @pspp{} to disregard all cases after the first
  57 100.
  58
  59 When @cmd{N OF CASES} is specified after @cmd{TEMPORARY}, it affects
  60 only the next procedure (@pxref{TEMPORARY}).  Otherwise, cases beyond
  61 the limit specified are not processed by any later procedure.
  62
  63 If the limit specified on @cmd{N OF CASES} is greater than the number
  64 of cases in the active dataset, it has no effect.
  65
  66 When @cmd{N OF CASES} is used along with @cmd{SAMPLE} or @cmd{SELECT
  67 IF}, the case limit is applied to the cases obtained after sampling or
  68 case selection, regardless of how @cmd{N OF CASES} is placed relative
  69 to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file.  Thus, the
  70 commands @code{N OF CASES 100} and @code{SAMPLE .5} will both randomly
  71 sample approximately half of the active dataset's cases, then select the
  72 first 100 of those sampled, regardless of their order in the command
  73 file.
  74
  75 @cmd{N OF CASES} with the @code{ESTIMATED} keyword gives an estimated
  76 number of cases before @cmd{DATA LIST} or another command to read in
  77 data.  @code{ESTIMATED} never limits the number of cases processed by
  78 procedures.  @pspp{} currently does not make use of case count estimates.
  79
  80 @node SAMPLE
  81 @section SAMPLE
  82 @vindex SAMPLE
  83
  84 @display
  85 SAMPLE @var{num1} [FROM @var{num2}].
  86 @end display
  87
  88 @cmd{SAMPLE} randomly samples a proportion of the cases in the active
  89 file.  Unless it follows @cmd{TEMPORARY}, it operates as a
  90 transformation, permanently removing cases from the active dataset.
  91
  92 The proportion to sample can be expressed as a single number between 0
  93 and 1.  If @var{k} is the number specified, and @var{N} is the number
  94 of currently-selected cases in the active dataset, then after
  95 @subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases will be
  96 selected.
  97
  98 The proportion to sample can also be specified in the style @subcmd{SAMPLE
  99 @var{m} FROM @var{N}}.  With this style, cases are selected as follows:
 100
 101 @enumerate
 102 @item
 103 If @var{N} is equal to the number of currently-selected cases in the
 104 active dataset, exactly @var{m} cases will be selected.
 105
 106 @item
 107 If @var{N} is greater than the number of currently-selected cases in the
 108 active dataset, an equivalent proportion of cases will be selected.
 109
 110 @item
 111 If @var{N} is less than the number of currently-selected cases in the
 112 active, exactly @var{m} cases will be selected @emph{from the first
 113 @var{N} cases in the active dataset.}
 114 @end enumerate
 115
 116 @cmd{SAMPLE} and @cmd{SELECT IF} are performed in
 117 the order specified by the syntax file.
 118
 119 @cmd{SAMPLE} is always performed before @code{N OF CASES}, regardless
 120 of ordering in the syntax file (@pxref{N OF CASES}).
 121
 122 The same values for @cmd{SAMPLE} may result in different samples.  To
 123 obtain the same sample, use the @code{SET} command to set the random
 124 number seed to the same value before each @cmd{SAMPLE}.  Different
 125 samples may still result when the file is processed on systems with
 126 differing endianness or floating-point formats.  By default, the
 127 random number seed is based on the system time.
 128
 129 @node SELECT IF
 130 @section SELECT IF
 131 @vindex SELECT IF
 132
 133 @display
 134 SELECT IF @var{expression}.
 135 @end display
 136
 137 @cmd{SELECT IF} selects cases for analysis based on the value of
 138 @var{expression}.  Cases not selected are permanently eliminated
 139 from the active dataset, unless @cmd{TEMPORARY} is in effect
 140 (@pxref{TEMPORARY}).
 141
 142 Specify a boolean expression (@pxref{Expressions}).  If the value of the
 143 expression is true for a particular case, the case will be analyzed.  If
 144 the expression has a false or missing value, then the case will be
 145 deleted from the data stream.
 146
 147 Place @cmd{SELECT IF} as early in the command file as
 148 possible.  Cases that are deleted early can be processed more
 149 efficiently in time and space.
 150
 151 When @cmd{SELECT IF} is specified following @cmd{TEMPORARY}
 152 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 153 (@pxref{LAG}).
 154
 155 @node SPLIT FILE
 156 @section SPLIT FILE
 157 @vindex SPLIT FILE
 158
 159 @display
 160 SPLIT FILE [@{LAYERED, SEPARATE@}] BY @var{var_list}.
 161 SPLIT FILE OFF.
 162 @end display
 163
 164 @cmd{SPLIT FILE} allows multiple sets of data present in one data
 165 file to be analyzed separately using single statistical procedure
 166 commands.
 167
 168 Specify a list of variable names to analyze multiple sets of
 169 data separately.  Groups of adjacent cases having the same values for these
 170 variables are analyzed by statistical procedure commands as one group.
 171 An independent analysis is carried out for each group of cases, and the
 172 variable values for the group are printed along with the analysis.
 173
 174 When a list of variable names is specified, one of the keywords
 175 @subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified.  If provided, either
 176 keyword are ignored.
 177
 178 Groups are formed only by @emph{adjacent} cases.  To create a split
 179 using a variable where like values are not adjacent in the working file,
 180 you should first sort the data by that variable (@pxref{SORT CASES}).
 181
 182 Specify @subcmd{OFF} to disable @cmd{SPLIT FILE} and resume analysis of the
 183 entire active dataset as a single group of data.
 184
 185 When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only
 186 the next procedure (@pxref{TEMPORARY}).
 187
 188 @node TEMPORARY
 189 @section TEMPORARY
 190 @vindex TEMPORARY
 191
 192 @display
 193 TEMPORARY.
 194 @end display
 195
 196 @cmd{TEMPORARY} is used to make the effects of transformations
 197 following its execution temporary.  These transformations will
 198 affect only the execution of the next procedure or procedure-like
 199 command.  Their effects will not be saved to the active dataset.
 200
 201 The only specification on @cmd{TEMPORARY} is the command name.
 202
 203 @cmd{TEMPORARY} may not appear within a @cmd{DO IF} or @cmd{LOOP}
 204 construct.  It may appear only once between procedures and
 205 procedure-like commands.
 206
 207 Scratch variables cannot be used following @cmd{TEMPORARY}.
 208
 209 An example may help to clarify:
 210
 211 @example
 212 DATA LIST /X 1-2.
 213 BEGIN DATA.
 214  2
 215  4
 216 10
 217 15
 218 20
 219 24
 220 END DATA.
 221
 222 COMPUTE X=X/2.
 223
 224 TEMPORARY.
 225 COMPUTE X=X+3.
 226
 227 DESCRIPTIVES X.
 228 DESCRIPTIVES X.
 229 @end example
 230
 231 The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8,
 232 10.5, 13, 15.  The data read by the first @cmd{DESCRIPTIVES} are 1, 2,
 233 5, 7.5, 10, 12.
 234
 235 @node WEIGHT
 236 @section WEIGHT
 237 @vindex WEIGHT
 238
 239 @display
 240 WEIGHT BY @var{var_name}.
 241 WEIGHT OFF.
 242 @end display
 243
 244 @cmd{WEIGHT} assigns cases varying weights,
 245 changing the frequency distribution of the active dataset.  Execution of
 246 @cmd{WEIGHT} is delayed until data have been read.
 247
 248 If a variable name is specified, @cmd{WEIGHT} causes the values of that
 249 variable to be used as weighting factors for subsequent statistical
 250 procedures.  Use of keyword @subcmd{BY} is optional but recommended.  Weighting
 251 variables must be numeric.  Scratch variables may not be used for
 252 weighting (@pxref{Scratch Variables}).
 253
 254 When @subcmd{OFF} is specified, subsequent statistical procedures will weight all
 255 cases equally.
 256
 257 A positive integer weighting factor @var{w} on a case will yield the
 258 same statistical output as would replicating the case @var{w} times.
 259 A weighting factor of 0 is treated for statistical purposes as if the
 260 case did not exist in the input.  Weighting values need not be
 261 integers, but negative and system-missing values for the weighting
 262 variable are interpreted as weighting factors of 0.  User-missing
 263 values are not treated specially.
 264
 265 When @cmd{WEIGHT} is specified after @cmd{TEMPORARY}, it affects only
 266 the next procedure (@pxref{TEMPORARY}).
 267
 268 @cmd{WEIGHT} does not cause cases in the active dataset to be
 269 replicated in memory.