doc/statistics.texi

   1 @node Statistics, Utilities, Conditionals and Looping, Top
   2 @chapter Statistics
   3
   4 This chapter documents the statistical procedures that PSPP supports so
   5 far.
   6
   7 @menu
   8 * DESCRIPTIVES::                Descriptive statistics.
   9 * FREQUENCIES::                 Frequency tables.
  10 * CROSSTABS::                   Crosstabulation tables.
  11 * T-TEST::                      Test hypotheses about means.
  12 * ONEWAY::                      One analysis of variance.
  13 @end menu
  14
  15 @node DESCRIPTIVES, FREQUENCIES, Statistics, Statistics
  16 @section DESCRIPTIVES
  17
  18 @vindex DESCRIPTIVES
  19 @display
  20 DESCRIPTIVES
  21         /VARIABLES=var_list
  22         /MISSING=@{VARIABLE,LISTWISE@} @{INCLUDE,NOINCLUDE@}
  23         /FORMAT=@{LABELS,NOLABELS@} @{NOINDEX,INDEX@} @{LINE,SERIAL@}
  24         /SAVE
  25         /STATISTICS=@{ALL,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS,
  26                      SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,DEFAULT,
  27                      SESKEWNESS,SEKURTOSIS@}
  28         /SORT=@{NONE,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS,SKEWNESS,
  29                RANGE,MINIMUM,MAXIMUM,SUM,SESKEWNESS,SEKURTOSIS,NAME@}
  30               @{A,D@}
  31 @end display
  32
  33 The @cmd{DESCRIPTIVES} procedure reads the active file and outputs
  34 descriptive
  35 statistics requested by the user.  In addition, it can optionally
  36 compute Z-scores.
  37
  38 The VARIABLES subcommand, which is required, specifies the list of
  39 variables to be analyzed.  Keyword VARIABLES is optional.
  40
  41 All other subcommands are optional:
  42
  43 The MISSING subcommand determines the handling of missing variables.  If
  44 INCLUDE is set, then user-missing values are included in the
  45 calculations.  If NOINCLUDE is set, which is the default, user-missing
  46 values are excluded.  If VARIABLE is set, then missing values are
  47 excluded on a variable by variable basis; if LISTWISE is set, then
  48 the entire case is excluded whenever any value in that case has a
  49 system-missing or, if INCLUDE is set, user-missing value.
  50
  51 The FORMAT subcommand affects the output format.  Currently the
  52 LABELS/NOLABELS and NOINDEX/INDEX settings are not used.  When SERIAL is
  53 set, both valid and missing number of cases are listed in the output;
  54 when NOSERIAL is set, only valid cases are listed.
  55
  56 The SAVE subcommand causes @cmd{DESCRIPTIVES} to calculate Z scores for all
  57 the specified variables.  The Z scores are saved to new variables.
  58 Variable names are generated by trying first the original variable name
  59 with Z prepended and truncated to a maximum of 8 characters, then the
  60 names ZSC000 through ZSC999, STDZ00 through STDZ09, ZZZZ00 through
  61 ZZZZ09, ZQZQ00 through ZQZQ09, in that sequence.  In addition, Z score
  62 variable names can be specified explicitly on VARIABLES in the variable
  63 list by enclosing them in parentheses after each variable.
  64
  65 The STATISTICS subcommand specifies the statistics to be displayed:
  66
  67 @table @code
  68 @item ALL
  69 All of the statistics below.
  70 @item MEAN
  71 Arithmetic mean.
  72 @item SEMEAN
  73 Standard error of the mean.
  74 @item STDDEV
  75 Standard deviation.
  76 @item VARIANCE
  77 Variance.
  78 @item KURTOSIS
  79 Kurtosis and standard error of the kurtosis.
  80 @item SKEWNESS
  81 Skewness and standard error of the skewness.
  82 @item RANGE
  83 Range.
  84 @item MINIMUM
  85 Minimum value.
  86 @item MAXIMUM
  87 Maximum value.
  88 @item SUM
  89 Sum.
  90 @item DEFAULT
  91 Mean, standard deviation of the mean, minimum, maximum.
  92 @item SEKURTOSIS
  93 Standard error of the kurtosis.
  94 @item SESKEWNESS
  95 Standard error of the skewness.
  96 @end table
  97
  98 The SORT subcommand specifies how the statistics should be sorted.  Most
  99 of the possible values should be self-explanatory.  NAME causes the
 100 statistics to be sorted by name.  By default, the statistics are listed
 101 in the order that they are specified on the VARIABLES subcommand.  The A
 102 and D settings request an ascending or descending sort order,
 103 respectively.
 104
 105 @node FREQUENCIES, CROSSTABS, DESCRIPTIVES, Statistics
 106 @section FREQUENCIES
 107
 108 @vindex FREQUENCIES
 109 @display
 110 FREQUENCIES
 111         /VARIABLES=var_list
 112         /FORMAT=@{TABLE,NOTABLE,LIMIT(limit)@}
 113                 @{STANDARD,CONDENSE,ONEPAGE[(onepage_limit)]@}
 114                 @{LABELS,NOLABELS@}
 115                 @{AVALUE,DVALUE,AFREQ,DFREQ@}
 116                 @{SINGLE,DOUBLE@}
 117                 @{OLDPAGE,NEWPAGE@}
 118         /MISSING=@{EXCLUDE,INCLUDE@}
 119         /STATISTICS=@{DEFAULT,MEAN,SEMEAN,MEDIAN,MODE,STDDEV,VARIANCE,
 120                      KURTOSIS,SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,
 121                      SESKEWNESS,SEKURTOSIS,ALL,NONE@}
 122         /NTILES=ntiles
 123         /PERCENTILES=percent@dots{}
 124
 125 (These options are not currently implemented.)
 126         /BARCHART=@dots{}
 127         /HISTOGRAM=@dots{}
 128         /HBAR=@dots{}
 129         /GROUPED=@dots{}
 130
 131 (Integer mode.)
 132         /VARIABLES=var_list (low,high)@dots{}
 133 @end display
 134
 135 The @cmd{FREQUENCIES} procedure outputs frequency tables for specified
 136 variables.
 137 @cmd{FREQUENCIES} can also calculate and display descriptive statistics
 138 (including median and mode) and percentiles.
 139
 140 In the future, @cmd{FREQUENCIES} will also support graphical output in the
 141 form of bar charts and histograms.  In addition, it will be able to
 142 support percentiles for grouped data.
 143
 144 The VARIABLES subcommand is the only required subcommand.  Specify the
 145 variables to be analyzed.  In most cases, this is all that is required.
 146 This is known as @dfn{general mode}.
 147
 148 Occasionally, one may want to invoke a special mode called @dfn{integer
 149 mode}.  Normally, in general mode, PSPP will automatically determine
 150 what values occur in the data.  In integer mode, the user specifies the
 151 range of values that the data assumes.  To invoke this mode, specify a
 152 range of data values in parentheses, separated by a comma.  Data values
 153 inside the range are truncated to the nearest integer, then assigned to
 154 that value.  If values occur outside this range, they are discarded.
 155
 156 The FORMAT subcommand controls the output format.  It has several
 157 possible settings:
 158
 159 @itemize @bullet
 160 @item
 161 TABLE, the default, causes a frequency table to be output for every
 162 variable specified.  NOTABLE prevents them from being output.  LIMIT
 163 with a numeric argument causes them to be output except when there are
 164 more than the specified number of values in the table.
 165
 166 @item
 167 STANDARD frequency tables contain more complete information, but also to
 168 take up more space on the printed page.  CONDENSE frequency tables are
 169 less informative but take up less space.  ONEPAGE with a numeric
 170 argument will output standard frequency tables if there are the
 171 specified number of values or less, condensed tables otherwise.  ONEPAGE
 172 without an argument defaults to a threshold of 50 values.
 173
 174 @item
 175 LABELS causes value labels to be displayed in STANDARD frequency
 176 tables.  NOLABLES prevents this.
 177
 178 @item
 179 Normally frequency tables are sorted in ascending order by value.  This
 180 is AVALUE.  DVALUE tables are sorted in descending order by value.
 181 AFREQ and DFREQ tables are sorted in ascending and descending order,
 182 respectively, by frequency count.
 183
 184 @item
 185 SINGLE spaced frequency tables are closely spaced.  DOUBLE spaced
 186 frequency tables have wider spacing.
 187
 188 @item
 189 OLDPAGE and NEWPAGE are not currently used.
 190 @end itemize
 191
 192 The MISSING subcommand controls the handling of user-missing values.
 193 When EXCLUDE, the default, is set, user-missing values are not included
 194 in frequency tables or statistics.  When INCLUDE is set, user-missing
 195 are included.  System-missing values are never included in statistics,
 196 but are listed in frequency tables.
 197
 198 The available STATISTICS are the same as available in @cmd{DESCRIPTIVES}
 199 (@pxref{DESCRIPTIVES}), with the addition of MEDIAN, the data's median
 200 value, and MODE, the mode.  (If there are multiple modes, the smallest
 201 value is reported.)  By default, the mean, standard deviation of the
 202 mean, minimum, and maximum are reported for each variable.
 203
 204 PERCENTILES causes the specified percentiles to be reported.
 205 The percentiles should  be presented at a list of numbers between 0
 206 and 100 inclusive.
 207 The NTILES subcommand causes the percentiles to be reported at the
 208 boundaries of the data set divided into the specified number of ranges.
 209 For instance, @code{/NTILES=4} would cause quartiles to be reported.
 210
 211
 212 @node CROSSTABS, T-TEST, FREQUENCIES, Statistics
 213 @section CROSSTABS
 214
 215 @vindex CROSSTABS
 216 @display
 217 CROSSTABS
 218         /TABLES=var_list BY var_list [BY var_list]@dots{}
 219         /MISSING=@{TABLE,INCLUDE,REPORT@}
 220         /WRITE=@{NONE,CELLS,ALL@}
 221         /FORMAT=@{TABLES,NOTABLES@}
 222                 @{LABELS,NOLABELS,NOVALLABS@}
 223                 @{PIVOT,NOPIVOT@}
 224                 @{AVALUE,DVALUE@}
 225                 @{NOINDEX,INDEX@}
 226                 @{BOX,NOBOX@}
 227         /CELLS=@{COUNT,ROW,COLUMN,TOTAL,EXPECTED,RESIDUAL,SRESIDUAL,
 228                 ASRESIDUAL,ALL,NONE@}
 229         /STATISTICS=@{CHISQ,PHI,CC,LAMBDA,UC,BTAU,CTAU,RISK,GAMMA,D,
 230                      KAPPA,ETA,CORR,ALL,NONE@}
 231
 232 (Integer mode.)
 233         /VARIABLES=var_list (low,high)@dots{}
 234 @end display
 235
 236 The @cmd{CROSSTABS} procedure displays crosstabulation
 237 tables requested by the user.  It can calculate several statistics for
 238 each cell in the crosstabulation tables.  In addition, a number of
 239 statistics can be calculated for each table itself.
 240
 241 The TABLES subcommand is used to specify the tables to be reported.  Any
 242 number of dimensions is permitted, and any number of variables per
 243 dimension is allowed.  The TABLES subcommand may be repeated as many
 244 times as needed.  This is the only required subcommand in @dfn{general
 245 mode}.
 246
 247 Occasionally, one may want to invoke a special mode called @dfn{integer
 248 mode}.  Normally, in general mode, PSPP automatically determines
 249 what values occur in the data.  In integer mode, the user specifies the
 250 range of values that the data assumes.  To invoke this mode, specify the
 251 VARIABLES subcommand, giving a range of data values in parentheses for
 252 each variable to be used on the TABLES subcommand.  Data values inside
 253 the range are truncated to the nearest integer, then assigned to that
 254 value.  If values occur outside this range, they are discarded.  When it
 255 is present, the VARIABLES subcommand must precede the TABLES
 256 subcommand.
 257
 258 In general mode, numeric and string variables may be specified on
 259 TABLES.  Although long string variables are allowed, only their
 260 initial short-string parts are used.  In integer mode, only numeric
 261 variables are allowed.
 262
 263 The MISSING subcommand determines the handling of user-missing values.
 264 When set to TABLE, the default, missing values are dropped on a table by
 265 table basis.  When set to INCLUDE, user-missing values are included in
 266 tables and statistics.  When set to REPORT, which is allowed only in
 267 integer mode, user-missing values are included in tables but marked with
 268 an @samp{M} (for ``missing'') and excluded from statistical
 269 calculations.
 270
 271 Currently the WRITE subcommand is ignored.
 272
 273 The FORMAT subcommand controls the characteristics of the
 274 crosstabulation tables to be displayed.  It has a number of possible
 275 settings:
 276
 277 @itemize @bullet
 278 @item
 279 TABLES, the default, causes crosstabulation tables to be output.
 280 NOTABLES suppresses them.
 281
 282 @item
 283 LABELS, the default, allows variable labels and value labels to appear
 284 in the output.  NOLABELS suppresses them.  NOVALLABS displays variable
 285 labels but suppresses value labels.
 286
 287 @item
 288 PIVOT, the default, causes each TABLES subcommand to be displayed in a
 289 pivot table format.  NOPIVOT causes the old-style crosstabulation format
 290 to be used.
 291
 292 @item
 293 AVALUE, the default, causes values to be sorted in ascending order.
 294 DVALUE asserts a descending sort order.
 295
 296 @item
 297 INDEX/NOINDEX is currently ignored.
 298
 299 @item
 300 BOX/NOBOX is currently ignored.
 301 @end itemize
 302
 303 The CELLS subcommand controls the contents of each cell in the displayed
 304 crosstabulation table.  The possible settings are:
 305
 306 @table @asis
 307 @item COUNT
 308 Frequency count.
 309 @item ROW
 310 Row percent.
 311 @item COLUMN
 312 Column percent.
 313 @item TOTAL
 314 Table percent.
 315 @item EXPECTED
 316 Expected value.
 317 @item RESIDUAL
 318 Residual.
 319 @item SRESIDUAL
 320 Standardized residual.
 321 @item ASRESIDUAL
 322 Adjusted standardized residual.
 323 @item ALL
 324 All of the above.
 325 @item NONE
 326 Suppress cells entirely.
 327 @end table
 328
 329 @samp{/CELLS} without any settings specified requests COUNT, ROW,
 330 COLUMN, and TOTAL.  If CELLS is not specified at all then only COUNT
 331 will be selected.
 332
 333 The STATISTICS subcommand selects statistics for computation:
 334
 335 @table @asis
 336 @item CHISQ
 337 Pearson chi-square, likelihood ratio, Fisher's exact test, continuity
 338 correction, linear-by-linear association.
 339 @item PHI
 340 Phi.
 341 @item CC
 342 Contingency coefficient.
 343 @item LAMBDA
 344 Lambda.
 345 @item UC
 346 Uncertainty coefficient.
 347 @item BTAU
 348 Tau-b.
 349 @item CTAU
 350 Tau-c.
 351 @item RISK
 352 Risk estimate.
 353 @item GAMMA
 354 Gamma.
 355 @item D
 356 Somers' D.
 357 @item KAPPA
 358 Cohen's Kappa.
 359 @item ETA
 360 Eta.
 361 @item CORR
 362 Spearman correlation, Pearson's r.
 363 @item ALL
 364 All of the above.
 365 @item NONE
 366 No statistics.
 367 @end table
 368
 369 Selected statistics are only calculated when appropriate for the
 370 statistic.  Certain statistics require tables of a particular size, and
 371 some statistics are calculated only in integer mode.
 372
 373 @samp{/STATISTICS} without any settings selects CHISQ.  If the
 374 STATISTICS subcommand is not given, no statistics are calculated.
 375
 376 @strong{Please note:} Currently the implementation of CROSSTABS has the
 377 followings bugs:
 378
 379 @itemize @bullet
 380 @item
 381 Pearson's R (but not Spearman) is off a little.
 382 @item
 383 T values for Spearman's R and Pearson's R are wrong.
 384 @item
 385 Significance of symmetric and directional measures is not calculated.
 386 @item
 387 Asymmetric ASEs and T values for lambda are wrong.
 388 @item
 389 ASE of Goodman and Kruskal's tau is not calculated.
 390 @item
 391 ASE of symmetric somers' d is wrong.
 392 @item
 393 Approximate T of uncertainty coefficient is wrong.
 394 @end itemize
 395
 396 Fixes for any of these deficiencies would be welcomed.
 397
 398 @node T-TEST, ONEWAY, CROSSTABS, Statistics
 399 @comment  node-name,  next,  previous,  up
 400 @section T-TEST
 401
 402 @vindex T-TEST
 403 @display
 404 T-TEST
 405         /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
 406         /CRITERIA=CIN(confidence)
 407
 408
 409 (One Sample mode.)
 410         TESTVAL=test_value
 411         /VARIABLES=var_list
 412
 413
 414 (Independent Samples mode.)
 415         GROUPS=var(value1 [, value2])
 416         /VARIABLES=var_list
 417
 418
 419 (Paired Samples mode.)
 420         PAIRS=var_list [WITH var_list [(PAIRED)] ]
 421
 422 @end display
 423
 424
 425 The @cmd{T-TEST} procedure outputs tables used in testing hypotheses about
 426 means.
 427 It operates in one of three modes:
 428 @itemize
 429 @item One Sample mode.
 430 @item Independent Groups mode.
 431 @item Paired mode.
 432 @end itemize
 433
 434 @noindent
 435 Each of these modes are described in more detail below.
 436 There are two optional subcommands which are common to all modes.
 437
 438 The @cmd{/CRITERIA} subcommand tells PSPP the confidence interval used
 439 in the tests.  The default value is 0.95.
 440
 441
 442 The @cmd{MISSING} subcommand determines the handling of missing
 443 variables.
 444 If INCLUDE is set, then user-missing values are included in the
 445 calculations, but system-missing values are not.
 446 If EXCLUDE is set, which is the default, user-missing
 447 values are excluded as well as system-missing values.
 448 This is the default.
 449
 450 If LISTWISE is set, then the entire case is excluded from analysis
 451 whenever any variable  specified in the @cmd{/VARIABLES}, @cmd{/PAIRS} or
 452 @cmd{/GROUPS} subcommands contains a missing value.
 453 If ANALYSIS is set, then missing values are excluded only in the analysis for
 454 which they would be needed. This is the default.
 455
 456
 457 @menu
 458 * One Sample Mode::             Testing against a hypothesised mean
 459 * Independent Samples Mode::    Testing two independent groups for equal mean
 460 * Paired Samples Mode::         Testing two interdependent groups for equal mean
 461 @end menu
 462
 463 @node One Sample Mode, Independent Samples Mode, T-TEST, T-TEST
 464 @subsection One Sample Mode
 465
 466 The @cmd{TESTVAL} subcommand invokes the One Sample mode.
 467 This mode is used to test a population mean against a hypothesised
 468 mean.
 469 The value given to the @cmd{TESTVAL} subcommand is the value against
 470 which you wish to test.
 471 In this mode, you must also use the @cmd{/VARIABLES} subcommand to
 472 tell PSPP which variables you wish to test.
 473
 474 @node Independent Samples Mode, Paired Samples Mode, One Sample Mode, T-TEST
 475 @comment  node-name,  next,  previous,  up
 476 @subsection Independent Samples Mode
 477
 478 The @cmd{GROUPS} subcommand invokes Independent Samples mode or
 479 `Groups' mode.
 480 This mode is used to test whether two groups of values have the
 481 same population mean.
 482 In this mode, you must also use the @cmd{/VARIABLES} subcommand to
 483 tell PSPP the dependent variables you wish to test.
 484
 485 The variable given in the @cmd{GROUPS} subcommand is the independent
 486 variable which determines to which group the samples belong.
 487 The values in parentheses are the specific values of the independent
 488 variable for each group.
 489 If the parentheses are omitted and no values are given, the default values
 490 of 1.0 and 2.0 are assumed.
 491
 492 If the independent variable is numeric,
 493 it is acceptable to specify only one value inside the parentheses.
 494 If you do this, cases where the independent variable is
 495 less than  or equal to this value belong to the first group, and cases
 496 greater than this value belong to the second group.
 497 When using this form of the @cmd{GROUPS} subcommand, missing values in
 498 the independent variable are excluded on a listwise basis, regardless
 499 of whether @cmd{/MISSING=LISTWISE} was specified.
 500
 501
 502 @node Paired Samples Mode,  , Independent Samples Mode, T-TEST
 503 @comment  node-name,  next,  previous,  up
 504 @subsection Paired Samples Mode
 505
 506 The @cmd{PAIRS} subcommand introduces Paired Samples mode.
 507 Use this mode when repeated measures have been taken from the same
 508 samples.
 509 If the the @code{WITH} keyword is omitted, then tables for all
 510 combinations of variables given in the @cmd{PAIRS} subcommand are
 511 generated.
 512 If the @code{WITH} keyword is given, and the @code{(PAIRED)} keyword
 513 is also given, then the number of variables preceding @code{WITH}
 514 must be the same as the number following it.
 515 In this case, tables for each respective pair of variables are
 516 generated.
 517 In the event that the @code{WITH} keyword is given, but the
 518 @code{(PAIRED)} keyword is omitted, then tables for each combination
 519 of variable preceding @code{WITH} against variable following
 520 @code{WITH} are generated.
 521
 522
 523 @node ONEWAY, , T-TEST, Statistics
 524 @comment  node-name,  next,  previous,  up
 525 @section Oneway
 526
 527 @vindex ONEWAY
 528 @cindex analysis of variance
 529 @cindex ANOVA
 530
 531 @display
 532 ONEWAY
 533         [/VARIABLES = ] var_list BY var
 534         /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
 535         /CONTRASTS= value1 [, value2] ... [,valueN]
 536         /STATISTICS=@{DESCRIPTIVES,HOMOGENEITY@}
 537
 538 @end display
 539
 540 The @cmd{ONEWAY} procedure performs a one-way analysis of variance of
 541 variables factored by a single independent variable.
 542 It is used to compare the means of a population
 543 divided into more than two groups.
 544
 545 The  variables to be analysed should be given in the @code{VARIABLES}
 546 subcommand.
 547 The list of variables must be followed by the @code{BY} keyword and
 548 the name of the independent (or factor) variable.
 549
 550 You can use the @code{STATISTICS} subcommand to tell PSPP to display
 551 ancilliary information.  The options accepted are:
 552 @itemize
 553 @item DESCRIPTIVES
 554 Displays descriptive statistics about the groups factored by the independent
 555 variable.
 556 @item HOMOGENEITY
 557 Displays the Levene test of Homogeneity of Variance for the
 558 variables and their groups.
 559 @end itemize
 560
 561 The @code{CONTRASTS} subcommand is used when you anticipate certain
 562 differences between the groups.
 563 The subcommand must be followed by a list of numerals which are the
 564 coefficients of the groups to be tested.
 565 The number of coefficients must correspond to the number of distinct
 566 groups (or values of the independent variable).
 567 If the total sum of the coefficients are not zero, then PSPP will
 568 display a warning, but will proceed with the analysis.
 569 The @code{CONTRASTS} subcommand may be given up to 10 times in order
 570 to specify different contrast tests.
 571 @setfilename ignored