1 @node Statistics, Utilities, Conditionals and Looping, Top
4 This chapter documents the statistical procedures that PSPP supports so
8 * DESCRIPTIVES:: Descriptive statistics.
9 * FREQUENCIES:: Frequency tables.
10 * CROSSTABS:: Crosstabulation tables.
11 * T-TEST:: Test hypotheses about means.
12 * ONEWAY:: One analysis of variance.
15 @node DESCRIPTIVES, FREQUENCIES, Statistics, Statistics
22 /MISSING=@{VARIABLE,LISTWISE@} @{INCLUDE,NOINCLUDE@}
23 /FORMAT=@{LABELS,NOLABELS@} @{NOINDEX,INDEX@} @{LINE,SERIAL@}
25 /STATISTICS=@{ALL,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS,
26 SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,DEFAULT,
27 SESKEWNESS,SEKURTOSIS@}
28 /SORT=@{NONE,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS,SKEWNESS,
29 RANGE,MINIMUM,MAXIMUM,SUM,SESKEWNESS,SEKURTOSIS,NAME@}
33 The @cmd{DESCRIPTIVES} procedure reads the active file and outputs
35 statistics requested by the user. In addition, it can optionally
38 The VARIABLES subcommand, which is required, specifies the list of
39 variables to be analyzed. Keyword VARIABLES is optional.
41 All other subcommands are optional:
43 The MISSING subcommand determines the handling of missing variables. If
44 INCLUDE is set, then user-missing values are included in the
45 calculations. If NOINCLUDE is set, which is the default, user-missing
46 values are excluded. If VARIABLE is set, then missing values are
47 excluded on a variable by variable basis; if LISTWISE is set, then
48 the entire case is excluded whenever any value in that case has a
49 system-missing or, if INCLUDE is set, user-missing value.
51 The FORMAT subcommand affects the output format. Currently the
52 LABELS/NOLABELS and NOINDEX/INDEX settings are not used. When SERIAL is
53 set, both valid and missing number of cases are listed in the output;
54 when NOSERIAL is set, only valid cases are listed.
56 The SAVE subcommand causes @cmd{DESCRIPTIVES} to calculate Z scores for all
57 the specified variables. The Z scores are saved to new variables.
58 Variable names are generated by trying first the original variable name
59 with Z prepended and truncated to a maximum of 8 characters, then the
60 names ZSC000 through ZSC999, STDZ00 through STDZ09, ZZZZ00 through
61 ZZZZ09, ZQZQ00 through ZQZQ09, in that sequence. In addition, Z score
62 variable names can be specified explicitly on VARIABLES in the variable
63 list by enclosing them in parentheses after each variable.
65 The STATISTICS subcommand specifies the statistics to be displayed:
69 All of the statistics below.
73 Standard error of the mean.
79 Kurtosis and standard error of the kurtosis.
81 Skewness and standard error of the skewness.
91 Mean, standard deviation of the mean, minimum, maximum.
93 Standard error of the kurtosis.
95 Standard error of the skewness.
98 The SORT subcommand specifies how the statistics should be sorted. Most
99 of the possible values should be self-explanatory. NAME causes the
100 statistics to be sorted by name. By default, the statistics are listed
101 in the order that they are specified on the VARIABLES subcommand. The A
102 and D settings request an ascending or descending sort order,
105 @node FREQUENCIES, CROSSTABS, DESCRIPTIVES, Statistics
112 /FORMAT=@{TABLE,NOTABLE,LIMIT(limit)@}
113 @{STANDARD,CONDENSE,ONEPAGE[(onepage_limit)]@}
115 @{AVALUE,DVALUE,AFREQ,DFREQ@}
118 /MISSING=@{EXCLUDE,INCLUDE@}
119 /STATISTICS=@{DEFAULT,MEAN,SEMEAN,MEDIAN,MODE,STDDEV,VARIANCE,
120 KURTOSIS,SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,
121 SESKEWNESS,SEKURTOSIS,ALL,NONE@}
123 /PERCENTILES=percent@dots{}
125 (These options are not currently implemented.)
132 /VARIABLES=var_list (low,high)@dots{}
135 The @cmd{FREQUENCIES} procedure outputs frequency tables for specified
137 @cmd{FREQUENCIES} can also calculate and display descriptive statistics
138 (including median and mode) and percentiles.
140 In the future, @cmd{FREQUENCIES} will also support graphical output in the
141 form of bar charts and histograms. In addition, it will be able to
142 support percentiles for grouped data.
144 The VARIABLES subcommand is the only required subcommand. Specify the
145 variables to be analyzed. In most cases, this is all that is required.
146 This is known as @dfn{general mode}.
148 Occasionally, one may want to invoke a special mode called @dfn{integer
149 mode}. Normally, in general mode, PSPP will automatically determine
150 what values occur in the data. In integer mode, the user specifies the
151 range of values that the data assumes. To invoke this mode, specify a
152 range of data values in parentheses, separated by a comma. Data values
153 inside the range are truncated to the nearest integer, then assigned to
154 that value. If values occur outside this range, they are discarded.
156 The FORMAT subcommand controls the output format. It has several
161 TABLE, the default, causes a frequency table to be output for every
162 variable specified. NOTABLE prevents them from being output. LIMIT
163 with a numeric argument causes them to be output except when there are
164 more than the specified number of values in the table.
167 STANDARD frequency tables contain more complete information, but also to
168 take up more space on the printed page. CONDENSE frequency tables are
169 less informative but take up less space. ONEPAGE with a numeric
170 argument will output standard frequency tables if there are the
171 specified number of values or less, condensed tables otherwise. ONEPAGE
172 without an argument defaults to a threshold of 50 values.
175 LABELS causes value labels to be displayed in STANDARD frequency
176 tables. NOLABLES prevents this.
179 Normally frequency tables are sorted in ascending order by value. This
180 is AVALUE. DVALUE tables are sorted in descending order by value.
181 AFREQ and DFREQ tables are sorted in ascending and descending order,
182 respectively, by frequency count.
185 SINGLE spaced frequency tables are closely spaced. DOUBLE spaced
186 frequency tables have wider spacing.
189 OLDPAGE and NEWPAGE are not currently used.
192 The MISSING subcommand controls the handling of user-missing values.
193 When EXCLUDE, the default, is set, user-missing values are not included
194 in frequency tables or statistics. When INCLUDE is set, user-missing
195 are included. System-missing values are never included in statistics,
196 but are listed in frequency tables.
198 The available STATISTICS are the same as available in @cmd{DESCRIPTIVES}
199 (@pxref{DESCRIPTIVES}), with the addition of MEDIAN, the data's median
200 value, and MODE, the mode. (If there are multiple modes, the smallest
201 value is reported.) By default, the mean, standard deviation of the
202 mean, minimum, and maximum are reported for each variable.
204 PERCENTILES causes the specified percentiles to be reported.
205 The percentiles should be presented at a list of numbers between 0
207 The NTILES subcommand causes the percentiles to be reported at the
208 boundaries of the data set divided into the specified number of ranges.
209 For instance, @code{/NTILES=4} would cause quartiles to be reported.
212 @node CROSSTABS, T-TEST, FREQUENCIES, Statistics
218 /TABLES=var_list BY var_list [BY var_list]@dots{}
219 /MISSING=@{TABLE,INCLUDE,REPORT@}
220 /WRITE=@{NONE,CELLS,ALL@}
221 /FORMAT=@{TABLES,NOTABLES@}
222 @{LABELS,NOLABELS,NOVALLABS@}
227 /CELLS=@{COUNT,ROW,COLUMN,TOTAL,EXPECTED,RESIDUAL,SRESIDUAL,
228 ASRESIDUAL,ALL,NONE@}
229 /STATISTICS=@{CHISQ,PHI,CC,LAMBDA,UC,BTAU,CTAU,RISK,GAMMA,D,
230 KAPPA,ETA,CORR,ALL,NONE@}
233 /VARIABLES=var_list (low,high)@dots{}
236 The @cmd{CROSSTABS} procedure displays crosstabulation
237 tables requested by the user. It can calculate several statistics for
238 each cell in the crosstabulation tables. In addition, a number of
239 statistics can be calculated for each table itself.
241 The TABLES subcommand is used to specify the tables to be reported. Any
242 number of dimensions is permitted, and any number of variables per
243 dimension is allowed. The TABLES subcommand may be repeated as many
244 times as needed. This is the only required subcommand in @dfn{general
247 Occasionally, one may want to invoke a special mode called @dfn{integer
248 mode}. Normally, in general mode, PSPP automatically determines
249 what values occur in the data. In integer mode, the user specifies the
250 range of values that the data assumes. To invoke this mode, specify the
251 VARIABLES subcommand, giving a range of data values in parentheses for
252 each variable to be used on the TABLES subcommand. Data values inside
253 the range are truncated to the nearest integer, then assigned to that
254 value. If values occur outside this range, they are discarded. When it
255 is present, the VARIABLES subcommand must precede the TABLES
258 In general mode, numeric and string variables may be specified on
259 TABLES. Although long string variables are allowed, only their
260 initial short-string parts are used. In integer mode, only numeric
261 variables are allowed.
263 The MISSING subcommand determines the handling of user-missing values.
264 When set to TABLE, the default, missing values are dropped on a table by
265 table basis. When set to INCLUDE, user-missing values are included in
266 tables and statistics. When set to REPORT, which is allowed only in
267 integer mode, user-missing values are included in tables but marked with
268 an @samp{M} (for ``missing'') and excluded from statistical
271 Currently the WRITE subcommand is ignored.
273 The FORMAT subcommand controls the characteristics of the
274 crosstabulation tables to be displayed. It has a number of possible
279 TABLES, the default, causes crosstabulation tables to be output.
280 NOTABLES suppresses them.
283 LABELS, the default, allows variable labels and value labels to appear
284 in the output. NOLABELS suppresses them. NOVALLABS displays variable
285 labels but suppresses value labels.
288 PIVOT, the default, causes each TABLES subcommand to be displayed in a
289 pivot table format. NOPIVOT causes the old-style crosstabulation format
293 AVALUE, the default, causes values to be sorted in ascending order.
294 DVALUE asserts a descending sort order.
297 INDEX/NOINDEX is currently ignored.
300 BOX/NOBOX is currently ignored.
303 The CELLS subcommand controls the contents of each cell in the displayed
304 crosstabulation table. The possible settings are:
320 Standardized residual.
322 Adjusted standardized residual.
326 Suppress cells entirely.
329 @samp{/CELLS} without any settings specified requests COUNT, ROW,
330 COLUMN, and TOTAL. If CELLS is not specified at all then only COUNT
333 The STATISTICS subcommand selects statistics for computation:
337 Pearson chi-square, likelihood ratio, Fisher's exact test, continuity
338 correction, linear-by-linear association.
342 Contingency coefficient.
346 Uncertainty coefficient.
362 Spearman correlation, Pearson's r.
369 Selected statistics are only calculated when appropriate for the
370 statistic. Certain statistics require tables of a particular size, and
371 some statistics are calculated only in integer mode.
373 @samp{/STATISTICS} without any settings selects CHISQ. If the
374 STATISTICS subcommand is not given, no statistics are calculated.
376 @strong{Please note:} Currently the implementation of CROSSTABS has the
381 Pearson's R (but not Spearman) is off a little.
383 T values for Spearman's R and Pearson's R are wrong.
385 Significance of symmetric and directional measures is not calculated.
387 Asymmetric ASEs and T values for lambda are wrong.
389 ASE of Goodman and Kruskal's tau is not calculated.
391 ASE of symmetric somers' d is wrong.
393 Approximate T of uncertainty coefficient is wrong.
396 Fixes for any of these deficiencies would be welcomed.
398 @node T-TEST, ONEWAY, CROSSTABS, Statistics
399 @comment node-name, next, previous, up
405 /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
406 /CRITERIA=CIN(confidence)
414 (Independent Samples mode.)
415 GROUPS=var(value1 [, value2])
419 (Paired Samples mode.)
420 PAIRS=var_list [WITH var_list [(PAIRED)] ]
425 The @cmd{T-TEST} procedure outputs tables used in testing hypotheses about
427 It operates in one of three modes:
429 @item One Sample mode.
430 @item Independent Groups mode.
435 Each of these modes are described in more detail below.
436 There are two optional subcommands which are common to all modes.
438 The @cmd{/CRITERIA} subcommand tells PSPP the confidence interval used
439 in the tests. The default value is 0.95.
442 The @cmd{MISSING} subcommand determines the handling of missing
444 If INCLUDE is set, then user-missing values are included in the
445 calculations, but system-missing values are not.
446 If EXCLUDE is set, which is the default, user-missing
447 values are excluded as well as system-missing values.
450 If LISTWISE is set, then the entire case is excluded from analysis
451 whenever any variable specified in the @cmd{/VARIABLES}, @cmd{/PAIRS} or
452 @cmd{/GROUPS} subcommands contains a missing value.
453 If ANALYSIS is set, then missing values are excluded only in the analysis for
454 which they would be needed. This is the default.
458 * One Sample Mode:: Testing against a hypothesised mean
459 * Independent Samples Mode:: Testing two independent groups for equal mean
460 * Paired Samples Mode:: Testing two interdependent groups for equal mean
463 @node One Sample Mode, Independent Samples Mode, T-TEST, T-TEST
464 @subsection One Sample Mode
466 The @cmd{TESTVAL} subcommand invokes the One Sample mode.
467 This mode is used to test a population mean against a hypothesised
469 The value given to the @cmd{TESTVAL} subcommand is the value against
470 which you wish to test.
471 In this mode, you must also use the @cmd{/VARIABLES} subcommand to
472 tell PSPP which variables you wish to test.
474 @node Independent Samples Mode, Paired Samples Mode, One Sample Mode, T-TEST
475 @comment node-name, next, previous, up
476 @subsection Independent Samples Mode
478 The @cmd{GROUPS} subcommand invokes Independent Samples mode or
480 This mode is used to test whether two groups of values have the
481 same population mean.
482 In this mode, you must also use the @cmd{/VARIABLES} subcommand to
483 tell PSPP the dependent variables you wish to test.
485 The variable given in the @cmd{GROUPS} subcommand is the independent
486 variable which determines to which group the samples belong.
487 The values in parentheses are the specific values of the independent
488 variable for each group.
489 If the parentheses are omitted and no values are given, the default values
490 of 1.0 and 2.0 are assumed.
492 If the independent variable is numeric,
493 it is acceptable to specify only one value inside the parentheses.
494 If you do this, cases where the independent variable is
495 less than or equal to this value belong to the first group, and cases
496 greater than this value belong to the second group.
497 When using this form of the @cmd{GROUPS} subcommand, missing values in
498 the independent variable are excluded on a listwise basis, regardless
499 of whether @cmd{/MISSING=LISTWISE} was specified.
502 @node Paired Samples Mode, , Independent Samples Mode, T-TEST
503 @comment node-name, next, previous, up
504 @subsection Paired Samples Mode
506 The @cmd{PAIRS} subcommand introduces Paired Samples mode.
507 Use this mode when repeated measures have been taken from the same
509 If the the @code{WITH} keyword is omitted, then tables for all
510 combinations of variables given in the @cmd{PAIRS} subcommand are
512 If the @code{WITH} keyword is given, and the @code{(PAIRED)} keyword
513 is also given, then the number of variables preceding @code{WITH}
514 must be the same as the number following it.
515 In this case, tables for each respective pair of variables are
517 In the event that the @code{WITH} keyword is given, but the
518 @code{(PAIRED)} keyword is omitted, then tables for each combination
519 of variable preceding @code{WITH} against variable following
520 @code{WITH} are generated.
523 @node ONEWAY, , T-TEST, Statistics
524 @comment node-name, next, previous, up
528 @cindex analysis of variance
533 [/VARIABLES = ] var_list BY var
534 /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
535 /CONTRASTS= value1 [, value2] ... [,valueN]
536 /STATISTICS=@{DESCRIPTIVES,HOMOGENEITY@}
540 The @cmd{ONEWAY} procedure performs a one-way analysis of variance of
541 variables factored by a single independent variable.
542 It is used to compare the means of a population
543 divided into more than two groups.
545 The variables to be analysed should be given in the @code{VARIABLES}
547 The list of variables must be followed by the @code{BY} keyword and
548 the name of the independent (or factor) variable.
550 You can use the @code{STATISTICS} subcommand to tell PSPP to display
551 ancilliary information. The options accepted are:
554 Displays descriptive statistics about the groups factored by the independent
557 Displays the Levene test of Homogeneity of Variance for the
558 variables and their groups.
561 The @code{CONTRASTS} subcommand is used when you anticipate certain
562 differences between the groups.
563 The subcommand must be followed by a list of numerals which are the
564 coefficients of the groups to be tested.
565 The number of coefficients must correspond to the number of distinct
566 groups (or values of the independent variable).
567 If the total sum of the coefficients are not zero, then PSPP will
568 display a warning, but will proceed with the analysis.
569 The @code{CONTRASTS} subcommand may be given up to 10 times in order
570 to specify different contrast tests.