1 @node Statistics, Utilities, Conditionals and Looping, Top
4 This chapter documents the statistical procedures that PSPP supports so
7 @c If you add any new commands, then don't forget to remove the entry in
8 @c not-implemented.texi
11 * DESCRIPTIVES:: Descriptive statistics.
12 * FREQUENCIES:: Frequency tables.
13 * EXAMINE:: Testing data for normality.
14 * CROSSTABS:: Crosstabulation tables.
15 * T-TEST:: Test hypotheses about means.
16 * ONEWAY:: One way analysis of variance.
19 @node DESCRIPTIVES, FREQUENCIES, Statistics, Statistics
26 /MISSING=@{VARIABLE,LISTWISE@} @{INCLUDE,NOINCLUDE@}
27 /FORMAT=@{LABELS,NOLABELS@} @{NOINDEX,INDEX@} @{LINE,SERIAL@}
29 /STATISTICS=@{ALL,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS,
30 SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,DEFAULT,
31 SESKEWNESS,SEKURTOSIS@}
32 /SORT=@{NONE,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS,SKEWNESS,
33 RANGE,MINIMUM,MAXIMUM,SUM,SESKEWNESS,SEKURTOSIS,NAME@}
37 The @cmd{DESCRIPTIVES} procedure reads the active file and outputs
39 statistics requested by the user. In addition, it can optionally
42 The VARIABLES subcommand, which is required, specifies the list of
43 variables to be analyzed. Keyword VARIABLES is optional.
45 All other subcommands are optional:
47 The MISSING subcommand determines the handling of missing variables. If
48 INCLUDE is set, then user-missing values are included in the
49 calculations. If NOINCLUDE is set, which is the default, user-missing
50 values are excluded. If VARIABLE is set, then missing values are
51 excluded on a variable by variable basis; if LISTWISE is set, then
52 the entire case is excluded whenever any value in that case has a
53 system-missing or, if INCLUDE is set, user-missing value.
55 The FORMAT subcommand affects the output format. Currently the
56 LABELS/NOLABELS and NOINDEX/INDEX settings are not used. When SERIAL is
57 set, both valid and missing number of cases are listed in the output;
58 when NOSERIAL is set, only valid cases are listed.
60 The SAVE subcommand causes @cmd{DESCRIPTIVES} to calculate Z scores for all
61 the specified variables. The Z scores are saved to new variables.
62 Variable names are generated by trying first the original variable name
63 with Z prepended and truncated to a maximum of 8 characters, then the
64 names ZSC000 through ZSC999, STDZ00 through STDZ09, ZZZZ00 through
65 ZZZZ09, ZQZQ00 through ZQZQ09, in that sequence. In addition, Z score
66 variable names can be specified explicitly on VARIABLES in the variable
67 list by enclosing them in parentheses after each variable.
69 The STATISTICS subcommand specifies the statistics to be displayed:
73 All of the statistics below.
77 Standard error of the mean.
83 Kurtosis and standard error of the kurtosis.
85 Skewness and standard error of the skewness.
95 Mean, standard deviation of the mean, minimum, maximum.
97 Standard error of the kurtosis.
99 Standard error of the skewness.
102 The SORT subcommand specifies how the statistics should be sorted. Most
103 of the possible values should be self-explanatory. NAME causes the
104 statistics to be sorted by name. By default, the statistics are listed
105 in the order that they are specified on the VARIABLES subcommand. The A
106 and D settings request an ascending or descending sort order,
109 @node FREQUENCIES, EXAMINE, DESCRIPTIVES, Statistics
116 /FORMAT=@{TABLE,NOTABLE,LIMIT(limit)@}
117 @{STANDARD,CONDENSE,ONEPAGE[(onepage_limit)]@}
119 @{AVALUE,DVALUE,AFREQ,DFREQ@}
122 /MISSING=@{EXCLUDE,INCLUDE@}
123 /STATISTICS=@{DEFAULT,MEAN,SEMEAN,MEDIAN,MODE,STDDEV,VARIANCE,
124 KURTOSIS,SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,
125 SESKEWNESS,SEKURTOSIS,ALL,NONE@}
127 /PERCENTILES=percent@dots{}
129 (These options are not currently implemented.)
136 /VARIABLES=var_list (low,high)@dots{}
139 The @cmd{FREQUENCIES} procedure outputs frequency tables for specified
141 @cmd{FREQUENCIES} can also calculate and display descriptive statistics
142 (including median and mode) and percentiles.
144 In the future, @cmd{FREQUENCIES} will also support graphical output in the
145 form of bar charts and histograms. In addition, it will be able to
146 support percentiles for grouped data.
148 The VARIABLES subcommand is the only required subcommand. Specify the
149 variables to be analyzed. In most cases, this is all that is required.
150 This is known as @dfn{general mode}.
152 Occasionally, one may want to invoke a special mode called @dfn{integer
153 mode}. Normally, in general mode, PSPP will automatically determine
154 what values occur in the data. In integer mode, the user specifies the
155 range of values that the data assumes. To invoke this mode, specify a
156 range of data values in parentheses, separated by a comma. Data values
157 inside the range are truncated to the nearest integer, then assigned to
158 that value. If values occur outside this range, they are discarded.
160 The FORMAT subcommand controls the output format. It has several
165 TABLE, the default, causes a frequency table to be output for every
166 variable specified. NOTABLE prevents them from being output. LIMIT
167 with a numeric argument causes them to be output except when there are
168 more than the specified number of values in the table.
171 STANDARD frequency tables contain more complete information, but also to
172 take up more space on the printed page. CONDENSE frequency tables are
173 less informative but take up less space. ONEPAGE with a numeric
174 argument will output standard frequency tables if there are the
175 specified number of values or less, condensed tables otherwise. ONEPAGE
176 without an argument defaults to a threshold of 50 values.
179 LABELS causes value labels to be displayed in STANDARD frequency
180 tables. NOLABLES prevents this.
183 Normally frequency tables are sorted in ascending order by value. This
184 is AVALUE. DVALUE tables are sorted in descending order by value.
185 AFREQ and DFREQ tables are sorted in ascending and descending order,
186 respectively, by frequency count.
189 SINGLE spaced frequency tables are closely spaced. DOUBLE spaced
190 frequency tables have wider spacing.
193 OLDPAGE and NEWPAGE are not currently used.
196 The MISSING subcommand controls the handling of user-missing values.
197 When EXCLUDE, the default, is set, user-missing values are not included
198 in frequency tables or statistics. When INCLUDE is set, user-missing
199 are included. System-missing values are never included in statistics,
200 but are listed in frequency tables.
202 The available STATISTICS are the same as available in @cmd{DESCRIPTIVES}
203 (@pxref{DESCRIPTIVES}), with the addition of MEDIAN, the data's median
204 value, and MODE, the mode. (If there are multiple modes, the smallest
205 value is reported.) By default, the mean, standard deviation of the
206 mean, minimum, and maximum are reported for each variable.
208 PERCENTILES causes the specified percentiles to be reported.
209 The percentiles should be presented at a list of numbers between 0
211 The NTILES subcommand causes the percentiles to be reported at the
212 boundaries of the data set divided into the specified number of ranges.
213 For instance, @code{/NTILES=4} would cause quartiles to be reported.
216 @node EXAMINE, CROSSTABS, FREQUENCIES, Statistics
217 @comment node-name, next, previous, up
221 @cindex Normality, testing for
225 VARIABLES=var_list [BY factor_list ]
226 /STATISTICS=@{DESCRIPTIVES, EXTREME[(n)], ALL, NONE@}
227 /PLOT=@{STEMLEAF, BOXPLOT, NPPLOT, SPREADLEVEL(n), HISTOGRAM,
230 /COMPARE=@{GROUPS,VARIABLES@}
231 /ID=@{case_number, var_name@}
233 /PERCENTILE=[value_list]=@{HAVERAGE, WAVERAGE, ROUND, AEMPIRICAL, EMPIRICAL @}
234 /MISSING=@{LISTWISE, PAIRWISE@} [@{EXCLUDE, INCLUDE@}]
235 [@{NOREPORT,REPORT@}]
239 The @cmd{EXAMINE} command is used to test how closely a distribution is to a
240 normal distribution. It also shows you outliers and extreme values.
242 The VARIABLES subcommand specifies the dependent variables and the
243 independent variable to use as factors for the analysis. Variables
244 listed before the first BY keyword are the dependent variables.
245 The dependent variables may optionally be followed by a list of
246 factors which tell PSPP how to break down the analysis for each
247 dependent variable. The format for each factor is
253 The STATISTICS subcommand specifies the analysis to be done.
254 DESCRIPTIVES will produce a table showing some parametric and
255 non-parametrics statistics. EXTREME produces a table showing extreme
256 values of the dependent variable. A number in parentheses determines
257 how many upper and lower extremes to show. The default number is 5.
260 The PLOT subcommand specifies which plots are to be produced if any.
262 The CINTERVAL subcommand specifies the confidence interval to use in
263 calculation of the descriptives command. The default it 95%.
265 The PERCENTILES subcommand specifies which percentiles are to be calculated,
266 and which algorithm to use for calculating them. The default is to
267 calculate the 5, 10, 25, 50, 75, 90, 95 percentiles using the
270 The TOTAL and NOTOTAL subcommands are mutually exclusive. If NOTOTAL
271 is given and factors have been specified in the VARIABLES subcommand,
272 then then statistics for the unfactored dependent variables are
273 produced in addition to the factored variables. If there are no
274 factors specified then TOTAL and NOTOTAL have no effect.
277 If many dependent variable are given, or factors are given for which
278 there are many distinct values, then @cmd{EXAMINE} will produce a very
279 large quantity of output.
282 @node CROSSTABS, T-TEST, EXAMINE, Statistics
288 /TABLES=var_list BY var_list [BY var_list]@dots{}
289 /MISSING=@{TABLE,INCLUDE,REPORT@}
290 /WRITE=@{NONE,CELLS,ALL@}
291 /FORMAT=@{TABLES,NOTABLES@}
292 @{LABELS,NOLABELS,NOVALLABS@}
297 /CELLS=@{COUNT,ROW,COLUMN,TOTAL,EXPECTED,RESIDUAL,SRESIDUAL,
298 ASRESIDUAL,ALL,NONE@}
299 /STATISTICS=@{CHISQ,PHI,CC,LAMBDA,UC,BTAU,CTAU,RISK,GAMMA,D,
300 KAPPA,ETA,CORR,ALL,NONE@}
303 /VARIABLES=var_list (low,high)@dots{}
306 The @cmd{CROSSTABS} procedure displays crosstabulation
307 tables requested by the user. It can calculate several statistics for
308 each cell in the crosstabulation tables. In addition, a number of
309 statistics can be calculated for each table itself.
311 The TABLES subcommand is used to specify the tables to be reported. Any
312 number of dimensions is permitted, and any number of variables per
313 dimension is allowed. The TABLES subcommand may be repeated as many
314 times as needed. This is the only required subcommand in @dfn{general
317 Occasionally, one may want to invoke a special mode called @dfn{integer
318 mode}. Normally, in general mode, PSPP automatically determines
319 what values occur in the data. In integer mode, the user specifies the
320 range of values that the data assumes. To invoke this mode, specify the
321 VARIABLES subcommand, giving a range of data values in parentheses for
322 each variable to be used on the TABLES subcommand. Data values inside
323 the range are truncated to the nearest integer, then assigned to that
324 value. If values occur outside this range, they are discarded. When it
325 is present, the VARIABLES subcommand must precede the TABLES
328 In general mode, numeric and string variables may be specified on
329 TABLES. Although long string variables are allowed, only their
330 initial short-string parts are used. In integer mode, only numeric
331 variables are allowed.
333 The MISSING subcommand determines the handling of user-missing values.
334 When set to TABLE, the default, missing values are dropped on a table by
335 table basis. When set to INCLUDE, user-missing values are included in
336 tables and statistics. When set to REPORT, which is allowed only in
337 integer mode, user-missing values are included in tables but marked with
338 an @samp{M} (for ``missing'') and excluded from statistical
341 Currently the WRITE subcommand is ignored.
343 The FORMAT subcommand controls the characteristics of the
344 crosstabulation tables to be displayed. It has a number of possible
349 TABLES, the default, causes crosstabulation tables to be output.
350 NOTABLES suppresses them.
353 LABELS, the default, allows variable labels and value labels to appear
354 in the output. NOLABELS suppresses them. NOVALLABS displays variable
355 labels but suppresses value labels.
358 PIVOT, the default, causes each TABLES subcommand to be displayed in a
359 pivot table format. NOPIVOT causes the old-style crosstabulation format
363 AVALUE, the default, causes values to be sorted in ascending order.
364 DVALUE asserts a descending sort order.
367 INDEX/NOINDEX is currently ignored.
370 BOX/NOBOX is currently ignored.
373 The CELLS subcommand controls the contents of each cell in the displayed
374 crosstabulation table. The possible settings are:
390 Standardized residual.
392 Adjusted standardized residual.
396 Suppress cells entirely.
399 @samp{/CELLS} without any settings specified requests COUNT, ROW,
400 COLUMN, and TOTAL. If CELLS is not specified at all then only COUNT
403 The STATISTICS subcommand selects statistics for computation:
407 Pearson chi-square, likelihood ratio, Fisher's exact test, continuity
408 correction, linear-by-linear association.
412 Contingency coefficient.
416 Uncertainty coefficient.
432 Spearman correlation, Pearson's r.
439 Selected statistics are only calculated when appropriate for the
440 statistic. Certain statistics require tables of a particular size, and
441 some statistics are calculated only in integer mode.
443 @samp{/STATISTICS} without any settings selects CHISQ. If the
444 STATISTICS subcommand is not given, no statistics are calculated.
446 @strong{Please note:} Currently the implementation of CROSSTABS has the
451 Pearson's R (but not Spearman) is off a little.
453 T values for Spearman's R and Pearson's R are wrong.
455 Significance of symmetric and directional measures is not calculated.
457 Asymmetric ASEs and T values for lambda are wrong.
459 ASE of Goodman and Kruskal's tau is not calculated.
461 ASE of symmetric somers' d is wrong.
463 Approximate T of uncertainty coefficient is wrong.
466 Fixes for any of these deficiencies would be welcomed.
468 @node T-TEST, ONEWAY, CROSSTABS, Statistics
469 @comment node-name, next, previous, up
475 /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
476 /CRITERIA=CIN(confidence)
484 (Independent Samples mode.)
485 GROUPS=var(value1 [, value2])
489 (Paired Samples mode.)
490 PAIRS=var_list [WITH var_list [(PAIRED)] ]
495 The @cmd{T-TEST} procedure outputs tables used in testing hypotheses about
497 It operates in one of three modes:
499 @item One Sample mode.
500 @item Independent Groups mode.
505 Each of these modes are described in more detail below.
506 There are two optional subcommands which are common to all modes.
508 The @cmd{/CRITERIA} subcommand tells PSPP the confidence interval used
509 in the tests. The default value is 0.95.
512 The @cmd{MISSING} subcommand determines the handling of missing
514 If INCLUDE is set, then user-missing values are included in the
515 calculations, but system-missing values are not.
516 If EXCLUDE is set, which is the default, user-missing
517 values are excluded as well as system-missing values.
520 If LISTWISE is set, then the entire case is excluded from analysis
521 whenever any variable specified in the @cmd{/VARIABLES}, @cmd{/PAIRS} or
522 @cmd{/GROUPS} subcommands contains a missing value.
523 If ANALYSIS is set, then missing values are excluded only in the analysis for
524 which they would be needed. This is the default.
528 * One Sample Mode:: Testing against a hypothesised mean
529 * Independent Samples Mode:: Testing two independent groups for equal mean
530 * Paired Samples Mode:: Testing two interdependent groups for equal mean
533 @node One Sample Mode, Independent Samples Mode, T-TEST, T-TEST
534 @subsection One Sample Mode
536 The @cmd{TESTVAL} subcommand invokes the One Sample mode.
537 This mode is used to test a population mean against a hypothesised
539 The value given to the @cmd{TESTVAL} subcommand is the value against
540 which you wish to test.
541 In this mode, you must also use the @cmd{/VARIABLES} subcommand to
542 tell PSPP which variables you wish to test.
544 @node Independent Samples Mode, Paired Samples Mode, One Sample Mode, T-TEST
545 @comment node-name, next, previous, up
546 @subsection Independent Samples Mode
548 The @cmd{GROUPS} subcommand invokes Independent Samples mode or
550 This mode is used to test whether two groups of values have the
551 same population mean.
552 In this mode, you must also use the @cmd{/VARIABLES} subcommand to
553 tell PSPP the dependent variables you wish to test.
555 The variable given in the @cmd{GROUPS} subcommand is the independent
556 variable which determines to which group the samples belong.
557 The values in parentheses are the specific values of the independent
558 variable for each group.
559 If the parentheses are omitted and no values are given, the default values
560 of 1.0 and 2.0 are assumed.
562 If the independent variable is numeric,
563 it is acceptable to specify only one value inside the parentheses.
564 If you do this, cases where the independent variable is
565 less than or equal to this value belong to the first group, and cases
566 greater than this value belong to the second group.
567 When using this form of the @cmd{GROUPS} subcommand, missing values in
568 the independent variable are excluded on a listwise basis, regardless
569 of whether @cmd{/MISSING=LISTWISE} was specified.
572 @node Paired Samples Mode, , Independent Samples Mode, T-TEST
573 @comment node-name, next, previous, up
574 @subsection Paired Samples Mode
576 The @cmd{PAIRS} subcommand introduces Paired Samples mode.
577 Use this mode when repeated measures have been taken from the same
579 If the the @code{WITH} keyword is omitted, then tables for all
580 combinations of variables given in the @cmd{PAIRS} subcommand are
582 If the @code{WITH} keyword is given, and the @code{(PAIRED)} keyword
583 is also given, then the number of variables preceding @code{WITH}
584 must be the same as the number following it.
585 In this case, tables for each respective pair of variables are
587 In the event that the @code{WITH} keyword is given, but the
588 @code{(PAIRED)} keyword is omitted, then tables for each combination
589 of variable preceding @code{WITH} against variable following
590 @code{WITH} are generated.
593 @node ONEWAY, , T-TEST, Statistics
594 @comment node-name, next, previous, up
598 @cindex analysis of variance
603 [/VARIABLES = ] var_list BY var
604 /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
605 /CONTRASTS= value1 [, value2] ... [,valueN]
606 /STATISTICS=@{DESCRIPTIVES,HOMOGENEITY@}
610 The @cmd{ONEWAY} procedure performs a one-way analysis of variance of
611 variables factored by a single independent variable.
612 It is used to compare the means of a population
613 divided into more than two groups.
615 The variables to be analysed should be given in the @code{VARIABLES}
617 The list of variables must be followed by the @code{BY} keyword and
618 the name of the independent (or factor) variable.
620 You can use the @code{STATISTICS} subcommand to tell PSPP to display
621 ancilliary information. The options accepted are:
624 Displays descriptive statistics about the groups factored by the independent
627 Displays the Levene test of Homogeneity of Variance for the
628 variables and their groups.
631 The @code{CONTRASTS} subcommand is used when you anticipate certain
632 differences between the groups.
633 The subcommand must be followed by a list of numerals which are the
634 coefficients of the groups to be tested.
635 The number of coefficients must correspond to the number of distinct
636 groups (or values of the independent variable).
637 If the total sum of the coefficients are not zero, then PSPP will
638 display a warning, but will proceed with the analysis.
639 The @code{CONTRASTS} subcommand may be given up to 10 times in order
640 to specify different contrast tests.