pintos-os.org Git - pspp/blob - doc/statistics.texi

   1 @node Statistics, Utilities, Conditionals and Looping, Top
   2 @chapter Statistics
   3
   4 This chapter documents the statistical procedures that PSPP supports so
   5 far.
   6
   7 @c If you add any new commands, then don't forget to remove the entry in
   8 @c not-implemented.texi
   9
  10 @menu
  11 * DESCRIPTIVES::                Descriptive statistics.
  12 * FREQUENCIES::                 Frequency tables.
  13 * CROSSTABS::                   Crosstabulation tables.
  14 * T-TEST::                      Test hypotheses about means.
  15 * ONEWAY::                      One way analysis of variance.
  16 @end menu
  17
  18 @node DESCRIPTIVES, FREQUENCIES, Statistics, Statistics
  19 @section DESCRIPTIVES
  20
  21 @vindex DESCRIPTIVES
  22 @display
  23 DESCRIPTIVES
  24         /VARIABLES=var_list
  25         /MISSING=@{VARIABLE,LISTWISE@} @{INCLUDE,NOINCLUDE@}
  26         /FORMAT=@{LABELS,NOLABELS@} @{NOINDEX,INDEX@} @{LINE,SERIAL@}
  27         /SAVE
  28         /STATISTICS=@{ALL,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS,
  29                      SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,DEFAULT,
  30                      SESKEWNESS,SEKURTOSIS@}
  31         /SORT=@{NONE,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS,SKEWNESS,
  32                RANGE,MINIMUM,MAXIMUM,SUM,SESKEWNESS,SEKURTOSIS,NAME@}
  33               @{A,D@}
  34 @end display
  35
  36 The @cmd{DESCRIPTIVES} procedure reads the active file and outputs
  37 descriptive
  38 statistics requested by the user.  In addition, it can optionally
  39 compute Z-scores.
  40
  41 The VARIABLES subcommand, which is required, specifies the list of
  42 variables to be analyzed.  Keyword VARIABLES is optional.
  43
  44 All other subcommands are optional:
  45
  46 The MISSING subcommand determines the handling of missing variables.  If
  47 INCLUDE is set, then user-missing values are included in the
  48 calculations.  If NOINCLUDE is set, which is the default, user-missing
  49 values are excluded.  If VARIABLE is set, then missing values are
  50 excluded on a variable by variable basis; if LISTWISE is set, then
  51 the entire case is excluded whenever any value in that case has a
  52 system-missing or, if INCLUDE is set, user-missing value.
  53
  54 The FORMAT subcommand affects the output format.  Currently the
  55 LABELS/NOLABELS and NOINDEX/INDEX settings are not used.  When SERIAL is
  56 set, both valid and missing number of cases are listed in the output;
  57 when NOSERIAL is set, only valid cases are listed.
  58
  59 The SAVE subcommand causes @cmd{DESCRIPTIVES} to calculate Z scores for all
  60 the specified variables.  The Z scores are saved to new variables.
  61 Variable names are generated by trying first the original variable name
  62 with Z prepended and truncated to a maximum of 8 characters, then the
  63 names ZSC000 through ZSC999, STDZ00 through STDZ09, ZZZZ00 through
  64 ZZZZ09, ZQZQ00 through ZQZQ09, in that sequence.  In addition, Z score
  65 variable names can be specified explicitly on VARIABLES in the variable
  66 list by enclosing them in parentheses after each variable.
  67
  68 The STATISTICS subcommand specifies the statistics to be displayed:
  69
  70 @table @code
  71 @item ALL
  72 All of the statistics below.
  73 @item MEAN
  74 Arithmetic mean.
  75 @item SEMEAN
  76 Standard error of the mean.
  77 @item STDDEV
  78 Standard deviation.
  79 @item VARIANCE
  80 Variance.
  81 @item KURTOSIS
  82 Kurtosis and standard error of the kurtosis.
  83 @item SKEWNESS
  84 Skewness and standard error of the skewness.
  85 @item RANGE
  86 Range.
  87 @item MINIMUM
  88 Minimum value.
  89 @item MAXIMUM
  90 Maximum value.
  91 @item SUM
  92 Sum.
  93 @item DEFAULT
  94 Mean, standard deviation of the mean, minimum, maximum.
  95 @item SEKURTOSIS
  96 Standard error of the kurtosis.
  97 @item SESKEWNESS
  98 Standard error of the skewness.
  99 @end table
 100
 101 The SORT subcommand specifies how the statistics should be sorted.  Most
 102 of the possible values should be self-explanatory.  NAME causes the
 103 statistics to be sorted by name.  By default, the statistics are listed
 104 in the order that they are specified on the VARIABLES subcommand.  The A
 105 and D settings request an ascending or descending sort order,
 106 respectively.
 107
 108 @node FREQUENCIES, CROSSTABS, DESCRIPTIVES, Statistics
 109 @section FREQUENCIES
 110
 111 @vindex FREQUENCIES
 112 @display
 113 FREQUENCIES
 114         /VARIABLES=var_list
 115         /FORMAT=@{TABLE,NOTABLE,LIMIT(limit)@}
 116                 @{STANDARD,CONDENSE,ONEPAGE[(onepage_limit)]@}
 117                 @{LABELS,NOLABELS@}
 118                 @{AVALUE,DVALUE,AFREQ,DFREQ@}
 119                 @{SINGLE,DOUBLE@}
 120                 @{OLDPAGE,NEWPAGE@}
 121         /MISSING=@{EXCLUDE,INCLUDE@}
 122         /STATISTICS=@{DEFAULT,MEAN,SEMEAN,MEDIAN,MODE,STDDEV,VARIANCE,
 123                      KURTOSIS,SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,
 124                      SESKEWNESS,SEKURTOSIS,ALL,NONE@}
 125         /NTILES=ntiles
 126         /PERCENTILES=percent@dots{}
 127
 128 (These options are not currently implemented.)
 129         /BARCHART=@dots{}
 130         /HISTOGRAM=@dots{}
 131         /HBAR=@dots{}
 132         /GROUPED=@dots{}
 133
 134 (Integer mode.)
 135         /VARIABLES=var_list (low,high)@dots{}
 136 @end display
 137
 138 The @cmd{FREQUENCIES} procedure outputs frequency tables for specified
 139 variables.
 140 @cmd{FREQUENCIES} can also calculate and display descriptive statistics
 141 (including median and mode) and percentiles.
 142
 143 In the future, @cmd{FREQUENCIES} will also support graphical output in the
 144 form of bar charts and histograms.  In addition, it will be able to
 145 support percentiles for grouped data.
 146
 147 The VARIABLES subcommand is the only required subcommand.  Specify the
 148 variables to be analyzed.  In most cases, this is all that is required.
 149 This is known as @dfn{general mode}.
 150
 151 Occasionally, one may want to invoke a special mode called @dfn{integer
 152 mode}.  Normally, in general mode, PSPP will automatically determine
 153 what values occur in the data.  In integer mode, the user specifies the
 154 range of values that the data assumes.  To invoke this mode, specify a
 155 range of data values in parentheses, separated by a comma.  Data values
 156 inside the range are truncated to the nearest integer, then assigned to
 157 that value.  If values occur outside this range, they are discarded.
 158
 159 The FORMAT subcommand controls the output format.  It has several
 160 possible settings:
 161
 162 @itemize @bullet
 163 @item
 164 TABLE, the default, causes a frequency table to be output for every
 165 variable specified.  NOTABLE prevents them from being output.  LIMIT
 166 with a numeric argument causes them to be output except when there are
 167 more than the specified number of values in the table.
 168
 169 @item
 170 STANDARD frequency tables contain more complete information, but also to
 171 take up more space on the printed page.  CONDENSE frequency tables are
 172 less informative but take up less space.  ONEPAGE with a numeric
 173 argument will output standard frequency tables if there are the
 174 specified number of values or less, condensed tables otherwise.  ONEPAGE
 175 without an argument defaults to a threshold of 50 values.
 176
 177 @item
 178 LABELS causes value labels to be displayed in STANDARD frequency
 179 tables.  NOLABLES prevents this.
 180
 181 @item
 182 Normally frequency tables are sorted in ascending order by value.  This
 183 is AVALUE.  DVALUE tables are sorted in descending order by value.
 184 AFREQ and DFREQ tables are sorted in ascending and descending order,
 185 respectively, by frequency count.
 186
 187 @item
 188 SINGLE spaced frequency tables are closely spaced.  DOUBLE spaced
 189 frequency tables have wider spacing.
 190
 191 @item
 192 OLDPAGE and NEWPAGE are not currently used.
 193 @end itemize
 194
 195 The MISSING subcommand controls the handling of user-missing values.
 196 When EXCLUDE, the default, is set, user-missing values are not included
 197 in frequency tables or statistics.  When INCLUDE is set, user-missing
 198 are included.  System-missing values are never included in statistics,
 199 but are listed in frequency tables.
 200
 201 The available STATISTICS are the same as available in @cmd{DESCRIPTIVES}
 202 (@pxref{DESCRIPTIVES}), with the addition of MEDIAN, the data's median
 203 value, and MODE, the mode.  (If there are multiple modes, the smallest
 204 value is reported.)  By default, the mean, standard deviation of the
 205 mean, minimum, and maximum are reported for each variable.
 206
 207 PERCENTILES causes the specified percentiles to be reported.
 208 The percentiles should  be presented at a list of numbers between 0
 209 and 100 inclusive.
 210 The NTILES subcommand causes the percentiles to be reported at the
 211 boundaries of the data set divided into the specified number of ranges.
 212 For instance, @code{/NTILES=4} would cause quartiles to be reported.
 213
 214
 215 @node CROSSTABS, T-TEST, FREQUENCIES, Statistics
 216 @section CROSSTABS
 217
 218 @vindex CROSSTABS
 219 @display
 220 CROSSTABS
 221         /TABLES=var_list BY var_list [BY var_list]@dots{}
 222         /MISSING=@{TABLE,INCLUDE,REPORT@}
 223         /WRITE=@{NONE,CELLS,ALL@}
 224         /FORMAT=@{TABLES,NOTABLES@}
 225                 @{LABELS,NOLABELS,NOVALLABS@}
 226                 @{PIVOT,NOPIVOT@}
 227                 @{AVALUE,DVALUE@}
 228                 @{NOINDEX,INDEX@}
 229                 @{BOX,NOBOX@}
 230         /CELLS=@{COUNT,ROW,COLUMN,TOTAL,EXPECTED,RESIDUAL,SRESIDUAL,
 231                 ASRESIDUAL,ALL,NONE@}
 232         /STATISTICS=@{CHISQ,PHI,CC,LAMBDA,UC,BTAU,CTAU,RISK,GAMMA,D,
 233                      KAPPA,ETA,CORR,ALL,NONE@}
 234
 235 (Integer mode.)
 236         /VARIABLES=var_list (low,high)@dots{}
 237 @end display
 238
 239 The @cmd{CROSSTABS} procedure displays crosstabulation
 240 tables requested by the user.  It can calculate several statistics for
 241 each cell in the crosstabulation tables.  In addition, a number of
 242 statistics can be calculated for each table itself.
 243
 244 The TABLES subcommand is used to specify the tables to be reported.  Any
 245 number of dimensions is permitted, and any number of variables per
 246 dimension is allowed.  The TABLES subcommand may be repeated as many
 247 times as needed.  This is the only required subcommand in @dfn{general
 248 mode}.
 249
 250 Occasionally, one may want to invoke a special mode called @dfn{integer
 251 mode}.  Normally, in general mode, PSPP automatically determines
 252 what values occur in the data.  In integer mode, the user specifies the
 253 range of values that the data assumes.  To invoke this mode, specify the
 254 VARIABLES subcommand, giving a range of data values in parentheses for
 255 each variable to be used on the TABLES subcommand.  Data values inside
 256 the range are truncated to the nearest integer, then assigned to that
 257 value.  If values occur outside this range, they are discarded.  When it
 258 is present, the VARIABLES subcommand must precede the TABLES
 259 subcommand.
 260
 261 In general mode, numeric and string variables may be specified on
 262 TABLES.  Although long string variables are allowed, only their
 263 initial short-string parts are used.  In integer mode, only numeric
 264 variables are allowed.
 265
 266 The MISSING subcommand determines the handling of user-missing values.
 267 When set to TABLE, the default, missing values are dropped on a table by
 268 table basis.  When set to INCLUDE, user-missing values are included in
 269 tables and statistics.  When set to REPORT, which is allowed only in
 270 integer mode, user-missing values are included in tables but marked with
 271 an @samp{M} (for ``missing'') and excluded from statistical
 272 calculations.
 273
 274 Currently the WRITE subcommand is ignored.
 275
 276 The FORMAT subcommand controls the characteristics of the
 277 crosstabulation tables to be displayed.  It has a number of possible
 278 settings:
 279
 280 @itemize @bullet
 281 @item
 282 TABLES, the default, causes crosstabulation tables to be output.
 283 NOTABLES suppresses them.
 284
 285 @item
 286 LABELS, the default, allows variable labels and value labels to appear
 287 in the output.  NOLABELS suppresses them.  NOVALLABS displays variable
 288 labels but suppresses value labels.
 289
 290 @item
 291 PIVOT, the default, causes each TABLES subcommand to be displayed in a
 292 pivot table format.  NOPIVOT causes the old-style crosstabulation format
 293 to be used.
 294
 295 @item
 296 AVALUE, the default, causes values to be sorted in ascending order.
 297 DVALUE asserts a descending sort order.
 298
 299 @item
 300 INDEX/NOINDEX is currently ignored.
 301
 302 @item
 303 BOX/NOBOX is currently ignored.
 304 @end itemize
 305
 306 The CELLS subcommand controls the contents of each cell in the displayed
 307 crosstabulation table.  The possible settings are:
 308
 309 @table @asis
 310 @item COUNT
 311 Frequency count.
 312 @item ROW
 313 Row percent.
 314 @item COLUMN
 315 Column percent.
 316 @item TOTAL
 317 Table percent.
 318 @item EXPECTED
 319 Expected value.
 320 @item RESIDUAL
 321 Residual.
 322 @item SRESIDUAL
 323 Standardized residual.
 324 @item ASRESIDUAL
 325 Adjusted standardized residual.
 326 @item ALL
 327 All of the above.
 328 @item NONE
 329 Suppress cells entirely.
 330 @end table
 331
 332 @samp{/CELLS} without any settings specified requests COUNT, ROW,
 333 COLUMN, and TOTAL.  If CELLS is not specified at all then only COUNT
 334 will be selected.
 335
 336 The STATISTICS subcommand selects statistics for computation:
 337
 338 @table @asis
 339 @item CHISQ
 340 Pearson chi-square, likelihood ratio, Fisher's exact test, continuity
 341 correction, linear-by-linear association.
 342 @item PHI
 343 Phi.
 344 @item CC
 345 Contingency coefficient.
 346 @item LAMBDA
 347 Lambda.
 348 @item UC
 349 Uncertainty coefficient.
 350 @item BTAU
 351 Tau-b.
 352 @item CTAU
 353 Tau-c.
 354 @item RISK
 355 Risk estimate.
 356 @item GAMMA
 357 Gamma.
 358 @item D
 359 Somers' D.
 360 @item KAPPA
 361 Cohen's Kappa.
 362 @item ETA
 363 Eta.
 364 @item CORR
 365 Spearman correlation, Pearson's r.
 366 @item ALL
 367 All of the above.
 368 @item NONE
 369 No statistics.
 370 @end table
 371
 372 Selected statistics are only calculated when appropriate for the
 373 statistic.  Certain statistics require tables of a particular size, and
 374 some statistics are calculated only in integer mode.
 375
 376 @samp{/STATISTICS} without any settings selects CHISQ.  If the
 377 STATISTICS subcommand is not given, no statistics are calculated.
 378
 379 @strong{Please note:} Currently the implementation of CROSSTABS has the
 380 followings bugs:
 381
 382 @itemize @bullet
 383 @item
 384 Pearson's R (but not Spearman) is off a little.
 385 @item
 386 T values for Spearman's R and Pearson's R are wrong.
 387 @item
 388 Significance of symmetric and directional measures is not calculated.
 389 @item
 390 Asymmetric ASEs and T values for lambda are wrong.
 391 @item
 392 ASE of Goodman and Kruskal's tau is not calculated.
 393 @item
 394 ASE of symmetric somers' d is wrong.
 395 @item
 396 Approximate T of uncertainty coefficient is wrong.
 397 @end itemize
 398
 399 Fixes for any of these deficiencies would be welcomed.
 400
 401 @node T-TEST, ONEWAY, CROSSTABS, Statistics
 402 @comment  node-name,  next,  previous,  up
 403 @section T-TEST
 404
 405 @vindex T-TEST
 406 @display
 407 T-TEST
 408         /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
 409         /CRITERIA=CIN(confidence)
 410
 411
 412 (One Sample mode.)
 413         TESTVAL=test_value
 414         /VARIABLES=var_list
 415
 416
 417 (Independent Samples mode.)
 418         GROUPS=var(value1 [, value2])
 419         /VARIABLES=var_list
 420
 421
 422 (Paired Samples mode.)
 423         PAIRS=var_list [WITH var_list [(PAIRED)] ]
 424
 425 @end display
 426
 427
 428 The @cmd{T-TEST} procedure outputs tables used in testing hypotheses about
 429 means.
 430 It operates in one of three modes:
 431 @itemize
 432 @item One Sample mode.
 433 @item Independent Groups mode.
 434 @item Paired mode.
 435 @end itemize
 436
 437 @noindent
 438 Each of these modes are described in more detail below.
 439 There are two optional subcommands which are common to all modes.
 440
 441 The @cmd{/CRITERIA} subcommand tells PSPP the confidence interval used
 442 in the tests.  The default value is 0.95.
 443
 444
 445 The @cmd{MISSING} subcommand determines the handling of missing
 446 variables.
 447 If INCLUDE is set, then user-missing values are included in the
 448 calculations, but system-missing values are not.
 449 If EXCLUDE is set, which is the default, user-missing
 450 values are excluded as well as system-missing values.
 451 This is the default.
 452
 453 If LISTWISE is set, then the entire case is excluded from analysis
 454 whenever any variable  specified in the @cmd{/VARIABLES}, @cmd{/PAIRS} or
 455 @cmd{/GROUPS} subcommands contains a missing value.
 456 If ANALYSIS is set, then missing values are excluded only in the analysis for
 457 which they would be needed. This is the default.
 458
 459
 460 @menu
 461 * One Sample Mode::             Testing against a hypothesised mean
 462 * Independent Samples Mode::    Testing two independent groups for equal mean
 463 * Paired Samples Mode::         Testing two interdependent groups for equal mean
 464 @end menu
 465
 466 @node One Sample Mode, Independent Samples Mode, T-TEST, T-TEST
 467 @subsection One Sample Mode
 468
 469 The @cmd{TESTVAL} subcommand invokes the One Sample mode.
 470 This mode is used to test a population mean against a hypothesised
 471 mean.
 472 The value given to the @cmd{TESTVAL} subcommand is the value against
 473 which you wish to test.
 474 In this mode, you must also use the @cmd{/VARIABLES} subcommand to
 475 tell PSPP which variables you wish to test.
 476
 477 @node Independent Samples Mode, Paired Samples Mode, One Sample Mode, T-TEST
 478 @comment  node-name,  next,  previous,  up
 479 @subsection Independent Samples Mode
 480
 481 The @cmd{GROUPS} subcommand invokes Independent Samples mode or
 482 `Groups' mode.
 483 This mode is used to test whether two groups of values have the
 484 same population mean.
 485 In this mode, you must also use the @cmd{/VARIABLES} subcommand to
 486 tell PSPP the dependent variables you wish to test.
 487
 488 The variable given in the @cmd{GROUPS} subcommand is the independent
 489 variable which determines to which group the samples belong.
 490 The values in parentheses are the specific values of the independent
 491 variable for each group.
 492 If the parentheses are omitted and no values are given, the default values
 493 of 1.0 and 2.0 are assumed.
 494
 495 If the independent variable is numeric,
 496 it is acceptable to specify only one value inside the parentheses.
 497 If you do this, cases where the independent variable is
 498 less than  or equal to this value belong to the first group, and cases
 499 greater than this value belong to the second group.
 500 When using this form of the @cmd{GROUPS} subcommand, missing values in
 501 the independent variable are excluded on a listwise basis, regardless
 502 of whether @cmd{/MISSING=LISTWISE} was specified.
 503
 504
 505 @node Paired Samples Mode,  , Independent Samples Mode, T-TEST
 506 @comment  node-name,  next,  previous,  up
 507 @subsection Paired Samples Mode
 508
 509 The @cmd{PAIRS} subcommand introduces Paired Samples mode.
 510 Use this mode when repeated measures have been taken from the same
 511 samples.
 512 If the the @code{WITH} keyword is omitted, then tables for all
 513 combinations of variables given in the @cmd{PAIRS} subcommand are
 514 generated.
 515 If the @code{WITH} keyword is given, and the @code{(PAIRED)} keyword
 516 is also given, then the number of variables preceding @code{WITH}
 517 must be the same as the number following it.
 518 In this case, tables for each respective pair of variables are
 519 generated.
 520 In the event that the @code{WITH} keyword is given, but the
 521 @code{(PAIRED)} keyword is omitted, then tables for each combination
 522 of variable preceding @code{WITH} against variable following
 523 @code{WITH} are generated.
 524
 525
 526 @node ONEWAY, , T-TEST, Statistics
 527 @comment  node-name,  next,  previous,  up
 528 @section Oneway
 529
 530 @vindex ONEWAY
 531 @cindex analysis of variance
 532 @cindex ANOVA
 533
 534 @display
 535 ONEWAY
 536         [/VARIABLES = ] var_list BY var
 537         /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
 538         /CONTRASTS= value1 [, value2] ... [,valueN]
 539         /STATISTICS=@{DESCRIPTIVES,HOMOGENEITY@}
 540
 541 @end display
 542
 543 The @cmd{ONEWAY} procedure performs a one-way analysis of variance of
 544 variables factored by a single independent variable.
 545 It is used to compare the means of a population
 546 divided into more than two groups.
 547
 548 The  variables to be analysed should be given in the @code{VARIABLES}
 549 subcommand.
 550 The list of variables must be followed by the @code{BY} keyword and
 551 the name of the independent (or factor) variable.
 552
 553 You can use the @code{STATISTICS} subcommand to tell PSPP to display
 554 ancilliary information.  The options accepted are:
 555 @itemize
 556 @item DESCRIPTIVES
 557 Displays descriptive statistics about the groups factored by the independent
 558 variable.
 559 @item HOMOGENEITY
 560 Displays the Levene test of Homogeneity of Variance for the
 561 variables and their groups.
 562 @end itemize
 563
 564 The @code{CONTRASTS} subcommand is used when you anticipate certain
 565 differences between the groups.
 566 The subcommand must be followed by a list of numerals which are the
 567 coefficients of the groups to be tested.
 568 The number of coefficients must correspond to the number of distinct
 569 groups (or values of the independent variable).
 570 If the total sum of the coefficients are not zero, then PSPP will
 571 display a warning, but will proceed with the analysis.
 572 The @code{CONTRASTS} subcommand may be given up to 10 times in order
 573 to specify different contrast tests.
 574 @setfilename ignored