pintos-os.org Git - pspp/blob - doc/data-selection.texi

   1 @c PSPP - a program for statistical analysis.
   2 @c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
   3 @c Permission is granted to copy, distribute and/or modify this document
   4 @c under the terms of the GNU Free Documentation License, Version 1.3
   5 @c or any later version published by the Free Software Foundation;
   6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   7 @c A copy of the license is included in the section entitled "GNU
   8 @c Free Documentation License".
   9 @c
  10 @node Data Selection
  11 @chapter Selecting data for analysis
  12
  13 This chapter documents @pspp{} commands that temporarily or permanently
  14 select data records from the active dataset for analysis.
  15
  16 @menu
  17 * FILTER::                      Exclude cases based on a variable.
  18 * N OF CASES::                  Limit the size of the active dataset.
  19 * SAMPLE::                      Select a specified proportion of cases.
  20 * SELECT IF::                   Permanently delete selected cases.
  21 * SPLIT FILE::                  Do multiple analyses with one command.
  22 * TEMPORARY::                   Make transformations' effects temporary.
  23 * WEIGHT::                      Weight cases by a variable.
  24 @end menu
  25
  26 @node FILTER
  27 @section FILTER
  28 @vindex FILTER
  29
  30 @display
  31 FILTER BY @var{var_name}.
  32 FILTER OFF.
  33 @end display
  34
  35 @cmd{FILTER} allows a boolean-valued variable to be used to select
  36 cases from the data stream for processing.
  37
  38 To set up filtering, specify @subcmd{BY} and a variable name.  Keyword
  39 BY is optional but recommended.  Cases which have a zero or system- or
  40 user-missing value are excluded from analysis, but not deleted from the
  41 data stream.  Cases with other values are analyzed.
  42 To filter based on a different condition, use
  43 transformations such as @cmd{COMPUTE} or @cmd{RECODE} to compute a
  44 filter variable of the required form, then specify that variable on
  45 @cmd{FILTER}.
  46
  47 @code{FILTER OFF} turns off case filtering.
  48
  49 Filtering takes place immediately before cases pass to a procedure for
  50 analysis.  Only one filter variable may be active at a time.  Normally,
  51 case filtering continues until it is explicitly turned off with @code{FILTER
  52 OFF}.  However, if @cmd{FILTER} is placed after @cmd{TEMPORARY}, it filters only
  53 the next procedure or procedure-like command.
  54
  55 @node N OF CASES
  56 @section N OF CASES
  57 @vindex N OF CASES
  58
  59 @display
  60 N [OF CASES] @var{num_of_cases} [ESTIMATED].
  61 @end display
  62
  63 @cmd{N OF CASES} limits the number of cases processed by any
  64 procedures that follow it in the command stream.  @code{N OF CASES
  65 100}, for example, tells @pspp{} to disregard all cases after the first
  66 100.
  67
  68 When @cmd{N OF CASES} is specified after @cmd{TEMPORARY}, it affects
  69 only the next procedure (@pxref{TEMPORARY}).  Otherwise, cases beyond
  70 the limit specified are not processed by any later procedure.
  71
  72 If the limit specified on @cmd{N OF CASES} is greater than the number
  73 of cases in the active dataset, it has no effect.
  74
  75 When @cmd{N OF CASES} is used along with @cmd{SAMPLE} or @cmd{SELECT
  76 IF}, the case limit is applied to the cases obtained after sampling or
  77 case selection, regardless of how @cmd{N OF CASES} is placed relative
  78 to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file.  Thus, the
  79 commands @code{N OF CASES 100} and @code{SAMPLE .5} both randomly
  80 sample approximately half of the active dataset's cases, then select the
  81 first 100 of those sampled, regardless of their order in the command
  82 file.
  83
  84 @cmd{N OF CASES} with the @code{ESTIMATED} keyword gives an estimated
  85 number of cases before @cmd{DATA LIST} or another command to read in
  86 data.  @code{ESTIMATED} never limits the number of cases processed by
  87 procedures.  @pspp{} currently does not make use of case count estimates.
  88
  89 @node SAMPLE
  90 @section SAMPLE
  91 @vindex SAMPLE
  92
  93 @display
  94 SAMPLE @var{num1} [FROM @var{num2}].
  95 @end display
  96
  97 @cmd{SAMPLE} randomly samples a proportion of the cases in the active
  98 file.  Unless it follows @cmd{TEMPORARY}, it operates as a
  99 transformation, permanently removing cases from the active dataset.
 100
 101 The proportion to sample can be expressed as a single number between 0
 102 and 1.  If @var{k} is the number specified, and @var{N} is the number
 103 of currently-selected cases in the active dataset, then after
 104 @subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases are
 105 selected.
 106
 107 The proportion to sample can also be specified in the style @subcmd{SAMPLE
 108 @var{m} FROM @var{N}}.  With this style, cases are selected as follows:
 109
 110 @enumerate
 111 @item
 112 If @var{N} is equal to the number of currently-selected cases in the
 113 active dataset, exactly @var{m} cases are selected.
 114
 115 @item
 116 If @var{N} is greater than the number of currently-selected cases in the
 117 active dataset, an equivalent proportion of cases are selected.
 118
 119 @item
 120 If @var{N} is less than the number of currently-selected cases in the
 121 active, exactly @var{m} cases are selected @emph{from the first
 122 @var{N} cases in the active dataset.}
 123 @end enumerate
 124
 125 @cmd{SAMPLE} and @cmd{SELECT IF} are performed in
 126 the order specified by the syntax file.
 127
 128 @cmd{SAMPLE} is always performed before @code{N OF CASES}, regardless
 129 of ordering in the syntax file (@pxref{N OF CASES}).
 130
 131 The same values for @cmd{SAMPLE} may result in different samples.  To
 132 obtain the same sample, use the @code{SET} command to set the random
 133 number seed to the same value before each @cmd{SAMPLE}.  Different
 134 samples may still result when the file is processed on systems with
 135 differing endianness or floating-point formats.  By default, the
 136 random number seed is based on the system time.
 137
 138 @node SELECT IF
 139 @section SELECT IF
 140 @vindex SELECT IF
 141
 142 @display
 143 SELECT IF @var{expression}.
 144 @end display
 145
 146 @cmd{SELECT IF} selects cases for analysis based on the value of
 147 @var{expression}.  Cases not selected are permanently eliminated
 148 from the active dataset, unless @cmd{TEMPORARY} is in effect
 149 (@pxref{TEMPORARY}).
 150
 151 Specify a boolean expression (@pxref{Expressions}).  If the value of the
 152 expression is true for a particular case, the case is analyzed.  If
 153 the expression has a false or missing value, then the case is
 154 deleted from the data stream.
 155
 156 Place @cmd{SELECT IF} as early in the command file as
 157 possible.  Cases that are deleted early can be processed more
 158 efficiently in time and space.
 159 Once cases have been deleted from the active dataset using @cmd{SELECT IF} they
 160 cannot be re-instated.
 161 If you want to be able to re-instate cases, then use @cmd{FILTER} (@pxref{FILTER})
 162 instead.
 163
 164 When @cmd{SELECT IF} is specified following @cmd{TEMPORARY}
 165 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 166 (@pxref{LAG}).
 167
 168 @subsection Example Select-If
 169
 170 A shop steward is interested in the salaries of younger personnel in a firm.
 171 The file @file{personnel.sav} provides the salaries of all the workers and their
 172 dates of birth.  The syntax in @ref{select-if:ex} shows how @cmd{SELECT IF} can
 173 be used to limit analysis only to those persons born after December 31, 1999.
 174
 175 @float Example, select-if:ex
 176 @psppsyntax {select-if.sps}
 177 @caption {Using @cmd{SELECT IF} to select persons born on or after a certain date.}
 178 @end float
 179
 180 From @ref{select-if:res} one can see that there are 56 persons listed in the dataset,
 181 and 17 of them were born after December 31, 1999.
 182
 183 @float Result, select-if:res
 184 @psppoutput {select-if}
 185 @caption {Salary descriptives before and after the @cmd{SELECT IF} transformation.}
 186 @end float
 187
 188 Note that the @file{personnel.sav} file from which the data were read is unaffected.
 189 The transformation affects only the active file.
 190
 191 @node SPLIT FILE
 192 @section SPLIT FILE
 193 @vindex SPLIT FILE
 194
 195 @display
 196 SPLIT FILE [@{LAYERED, SEPARATE@}] BY @var{var_list}.
 197 SPLIT FILE OFF.
 198 @end display
 199
 200 @cmd{SPLIT FILE} allows multiple sets of data present in one data
 201 file to be analyzed separately using single statistical procedure
 202 commands.
 203
 204 Specify a list of variable names to analyze multiple sets of
 205 data separately.  Groups of adjacent cases having the same values for these
 206 variables are analyzed by statistical procedure commands as one group.
 207 An independent analysis is carried out for each group of cases, and the
 208 variable values for the group are printed along with the analysis.
 209
 210 When a list of variable names is specified, one of the keywords
 211 @subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified.  With
 212 @subcmd{LAYERED}, which is the default, the separate analyses for each
 213 group are presented together in a single table.  With
 214 @subcmd{SEPARATE}, each analysis is presented in a separate table.
 215 Not all procedures honor the distinction.
 216
 217 Groups are formed only by @emph{adjacent} cases.  To create a split
 218 using a variable where like values are not adjacent in the working file,
 219 first sort the data by that variable (@pxref{SORT CASES}).
 220
 221 Specify @subcmd{OFF} to disable @cmd{SPLIT FILE} and resume analysis of the
 222 entire active dataset as a single group of data.
 223
 224 When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only
 225 the next procedure (@pxref{TEMPORARY}).
 226
 227 @subsection Example Split
 228
 229 The file @file{horticulture.sav} contains data describing the @exvar{yield}
 230 of a number of horticultural specimens which have been subjected to
 231 various @exvar{treatment}s.   If we wanted to investigate linear statistics
 232 of the @exvar{yeild}, one way to do this is using the @cmd{DESCRIPTIVES} (@pxref{DESCRIPTIVES}).
 233 However, it is reasonable to expect the mean to be different depending
 234 on the @exvar{treatment}.   So we might want to perform three separate
 235 procedures --- one for each treatment.
 236 @footnote{There are other, possibly better, ways to achieve a similar result
 237 using the @cmd{MEANS} or @cmd{EXAMINE} commands.}
 238 @ref{split:ex} shows how this can be done automatically using
 239 the @cmd{SPLIT FILE} command.
 240
 241 @float Example, split:ex
 242 @psppsyntax {split.sps}
 243 @caption {Running @cmd{DESCRIPTIVES} on each value of @exvar{treatment}}
 244 @end float
 245
 246 In @ref{split:res} you can see that the table of descriptive statistics
 247 appears 3 times --- once for each value of @exvar{treatment}.
 248 In this example @samp{N}, the number of observations are identical in
 249 all splits.  This is because that experiment was deliberately designed
 250 that way.  However in general one can expect a different @samp{N} for each
 251 split.
 252
 253 @float Example, split:res
 254 @psppoutput {split}
 255 @caption {The results of running @cmd{DESCRIPTIVES} with an active split}
 256 @end float
 257
 258 Unless @cmd{TEMPORARY} was used, after a split has been defined for
 259 a dataset it remains active until explicitly disabled.
 260 In the graphical user interface, the active split variable (if any) is
 261 displayed in the status bar (@pxref{split-status-bar:scr}.
 262 If a dataset is saved to a system file (@pxref{SAVE}) whilst a split
 263 is active, the split stastus is stored in the file and will be
 264 automatically loaded when that file is loaded.
 265
 266 @float Screenshot, split-status-bar:scr
 267 @psppimage {split-status-bar}
 268 @caption {The status bar indicating that the data set is split using the @exvar{treatment} variable}
 269 @end float
 270
 271
 272 @node TEMPORARY
 273 @section TEMPORARY
 274 @vindex TEMPORARY
 275
 276 @display
 277 TEMPORARY.
 278 @end display
 279
 280 @cmd{TEMPORARY} is used to make the effects of transformations
 281 following its execution temporary.  These transformations
 282 affect only the execution of the next procedure or procedure-like
 283 command.  Their effects are not be saved to the active dataset.
 284
 285 The only specification on @cmd{TEMPORARY} is the command name.
 286
 287 @cmd{TEMPORARY} may not appear within a @cmd{DO IF} or @cmd{LOOP}
 288 construct.  It may appear only once between procedures and
 289 procedure-like commands.
 290
 291 Scratch variables cannot be used following @cmd{TEMPORARY}.
 292
 293 @subsection Example Temporary
 294
 295 In @ref{temporary:ex} there are two @cmd{COMPUTE} transformation.  One
 296 of them immediatly follows a @cmd{TEMPORARY} command, and therefore has
 297 effect only for the next procedure, which in this case is the first
 298 @cmd{DESCRIPTIVES} command.
 299
 300 @float Example, temporary:ex
 301 @psppsyntax {temporary.sps}
 302 @caption {Running a @cmd{COMPUTE} transformation after @cmd{TEMPORARY}}
 303 @end float
 304
 305 The data read by the first @cmd{DESCRIPTIVES} procedure are 4, 5, 8,
 306 10.5, 13, 15.  The data read by the second @cmd{DESCRIPTIVES} procedure are 1, 2,
 307 5, 7.5, 10, 12.   This is because the second @cmd{COMPUTE} transformation
 308 has no effect on the second @cmd{DESCRIPTIVES} procedure.   You can check these
 309 figures in @ref{temporary:res}.
 310
 311 @float Result, temporary:res
 312 @psppoutput {temporary}
 313 @caption {The results of running two consecutive @cmd{DESCRIPTIVES} commands after
 314          a temporary transformation}
 315 @end float
 316
 317
 318 @node WEIGHT
 319 @section WEIGHT
 320 @vindex WEIGHT
 321
 322 @display
 323 WEIGHT BY @var{var_name}.
 324 WEIGHT OFF.
 325 @end display
 326
 327 @cmd{WEIGHT} assigns cases varying weights,
 328 changing the frequency distribution of the active dataset.  Execution of
 329 @cmd{WEIGHT} is delayed until data have been read.
 330
 331 If a variable name is specified, @cmd{WEIGHT} causes the values of that
 332 variable to be used as weighting factors for subsequent statistical
 333 procedures.  Use of keyword @subcmd{BY} is optional but recommended.  Weighting
 334 variables must be numeric.  Scratch variables may not be used for
 335 weighting (@pxref{Scratch Variables}).
 336
 337 When @subcmd{OFF} is specified, subsequent statistical procedures weight all
 338 cases equally.
 339
 340 A positive integer weighting factor @var{w} on a case yields the
 341 same statistical output as would replicating the case @var{w} times.
 342 A weighting factor of 0 is treated for statistical purposes as if the
 343 case did not exist in the input.  Weighting values need not be
 344 integers, but negative and system-missing values for the weighting
 345 variable are interpreted as weighting factors of 0.  User-missing
 346 values are not treated specially.
 347
 348 When @cmd{WEIGHT} is specified after @cmd{TEMPORARY}, it affects only
 349 the next procedure (@pxref{TEMPORARY}).
 350
 351 @cmd{WEIGHT} does not cause cases in the active dataset to be
 352 replicated in memory.
 353
 354
 355 @subsection Example Weights
 356
 357 One could define a  dataset containing an inventory of stock items.
 358 It would be reasonable to use a string variable for a description of the
 359 item, and a numeric variable for the number in stock, like in @ref{weight:ex}.
 360
 361 @float Example, weight:ex
 362 @psppsyntax {weight.sps}
 363 @caption {Setting the weight on the variable @exvar{quantity}}
 364 @end float
 365
 366 One analysis which most surely would be of interest is
 367 the relative amounts or each item in stock.
 368 However without setting a weight variable, @cmd{FREQUENCIES}
 369 (@pxref{FREQUENCIES}) does not tell us what we want to know, since
 370 there is only one case for each stock item. @ref{weight:res} shows the
 371 difference between the weighted and unweighted frequency tables.
 372
 373 @float Example, weight:res
 374 @psppoutput {weight}
 375 @caption {Weighted and unweighted frequency tables of @exvar{items}}
 376 @end float