@node Statistics
@chapter Statistics
-This chapter documents the statistical procedures that PSPP supports so
+This chapter documents the statistical procedures that @pspp{} supports so
far.
@menu
* CORRELATIONS:: Correlation tables.
* CROSSTABS:: Crosstabulation tables.
* FACTOR:: Factor analysis and Principal Components analysis
+* MEANS:: Average values and other statistics.
* NPAR TESTS:: Nonparametric tests.
* T-TEST:: Test hypotheses about means.
* ONEWAY:: One way analysis of variance.
but not currently honored.
@node EXAMINE
-@comment node-name, next, previous, up
@section EXAMINE
-@vindex EXAMINE
+@vindex EXAMINE
+@cindex Exploratory data analysis
@cindex Normality, testing for
@display
EXAMINE
- VARIABLES=var_list [BY factor_list ]
- /STATISTICS=@{DESCRIPTIVES, EXTREME[(n)], ALL, NONE@}
+ VARIABLES= @var{var1} [@var{var2}] @dots{} [@var{varN}]
+ [BY @var{factor1} [BY @var{subfactor1}]
+ [ @var{factor2} [BY @var{subfactor2}]]
+ @dots{}
+ [ @var{factor3} [BY @var{subfactor3}]]
+ ]
+ /STATISTICS=@{DESCRIPTIVES, EXTREME[(@var{n})], ALL, NONE@}
/PLOT=@{BOXPLOT, NPPLOT, HISTOGRAM, ALL, NONE@}
- /CINTERVAL n
+ /CINTERVAL @var{p}
/COMPARE=@{GROUPS,VARIABLES@}
- /ID=var_name
+ /ID=@var{identity_variable}
/@{TOTAL,NOTOTAL@}
- /PERCENTILE=[value_list]=@{HAVERAGE, WAVERAGE, ROUND, AEMPIRICAL, EMPIRICAL @}
+ /PERCENTILE=[@var{percentiles}]=@{HAVERAGE, WAVERAGE, ROUND, AEMPIRICAL, EMPIRICAL @}
/MISSING=@{LISTWISE, PAIRWISE@} [@{EXCLUDE, INCLUDE@}]
[@{NOREPORT,REPORT@}]
@end display
-The @cmd{EXAMINE} command is used to test how closely a distribution is to a
-normal distribution. It also shows you outliers and extreme values.
+The @cmd{EXAMINE} command is used to perform exploratory data analysis.
+In particular, it is useful for testing how closely a distribution follows a
+normal distribution, and for finding outliers and extreme values.
-The VARIABLES subcommand specifies the dependent variables and the
-independent variable to use as factors for the analysis. Variables
-listed before the first BY keyword are the dependent variables.
+The VARIABLES subcommand is mandatory.
+It specifies the dependent variables and optionally variables to use as
+factors for the analysis.
+Variables listed before the first BY keyword (if any) are the
+dependent variables.
The dependent variables may optionally be followed by a list of
-factors which tell PSPP how to break down the analysis for each
-dependent variable. The format for each factor is
+factors which tell @pspp{} how to break down the analysis for each
+dependent variable.
+
+Following the dependent variables, factors may be specified.
+The factors (if desired) should be preceeded by a single BY keyword.
+The format for each factor is
@display
-var [BY var].
+@var{factorvar} [BY @var{subfactorvar}].
@end display
+Each unique combination of the values of @var{factorvar} and
+@var{subfactorvar} divide the dataset into @dfn{cells}.
+Statistics will be calculated for each cell
+and for the entire dataset (unless NOTOTAL is given).
-
-The STATISTICS subcommand specifies the analysis to be done.
+The STATISTICS subcommand specifies which statistics to show.
DESCRIPTIVES will produce a table showing some parametric and
-non-parametrics statistics. EXTREME produces a table showing extreme
-values of the dependent variable. A number in parentheses determines
-how many upper and lower extremes to show. The default number is 5.
-
+non-parametrics statistics.
+EXTREME produces a table showing the extremities of each cell.
+A number in parentheses, @var{n} determines
+how many upper and lower extremities to show.
+The default number is 5.
+
+The subcommands TOTAL and NOTOTAL are mutually exclusive.
+If TOTAL appears, then statistics will be produced for the entire dataset
+as well as for each cell.
+If NOTOTAL appears, then statistics will be produced only for the cells
+(unless no factor variables have been given).
+These subcommands have no effect if there have been no factor variables
+specified.
@cindex boxplot
@cindex histogram
@cindex npplot
The PLOT subcommand specifies which plots are to be produced if any.
Available plots are HISTOGRAM, NPPLOT and BOXPLOT.
+They can all be used to visualise how closely each cell conforms to a
+normal distribution.
+Boxplots will also show you the outliers and extreme values.
The COMPARE subcommand is only relevant if producing boxplots, and it is only
-useful there is more than one dependent variable and at least one factor. If
+useful there is more than one dependent variable and at least one factor.
+If
/COMPARE=GROUPS is specified, then one plot per dependent variable is produced,
-containing boxplots for all the factors.
-If /COMPARE=VARIABLES is specified, then one plot per factor is produced, each
+each of which contain boxplots for all the cells.
+If /COMPARE=VARIABLES is specified, then one plot per cell is produced,
each containing one boxplot per dependent variable.
-If the /COMPARE subcommand is ommitted, then PSPP uses the default value of
-/COMPARE=GROUPS.
+If the /COMPARE subcommand is omitted, then @pspp{} behaves as if
+/COMPARE=GROUPS were given.
-The ID subcommand also pertains to boxplots. If given, it must
-specify a variable name. Outliers and extreme cases plotted in
-boxplots will be labelled with the case from that variable. Numeric or
-string variables are permissible. If the ID subcommand is not given,
-then the casenumber will be used for labelling.
+The ID subcommand is relevant only if /PLOT=BOXPLOT or
+/STATISTICS=EXTREME has been given.
+If given, it shoule provide the name of a variable which is to be used
+to labels extreme values and outliers.
+Numeric or string variables are permissible.
+If the ID subcommand is not given, then the casenumber will be used for
+labelling.
The CINTERVAL subcommand specifies the confidence interval to use in
-calculation of the descriptives command. The default it 95%.
+calculation of the descriptives command. The default is 95%.
@cindex percentiles
The PERCENTILES subcommand specifies which percentiles are to be calculated,
produced in addition to the factored variables. If there are no
factors specified then TOTAL and NOTOTAL have no effect.
+
+The following example will generate descriptive statistics and histograms for
+two variables @var{score1} and @var{score2}.
+Two factors are given, @i{viz}: @var{gender} and @var{gender} BY @var{culture}.
+Therefore, the descriptives and histograms will be generated for each
+distinct value
+of @var{gender} @emph{and} for each distinct combination of the values
+of @var{gender} and @var{race}.
+Since the NOTOTAL keyword is given, statistics and histograms for
+@var{score1} and @var{score2} covering the whole dataset are not produced.
+@example
+EXAMINE @var{score1} @var{score2} BY
+ @var{gender}
+ @var{gender} BY @var{culture}
+ /STATISTICS = DESCRIPTIVES
+ /PLOT = HISTOGRAM
+ /NOTOTAL.
+@end example
+
+Here is a second example showing how the @cmd{examine} command can be used to find extremities.
+@example
+EXAMINE @var{height} @var{weight} BY
+ @var{gender}
+ /STATISTICS = EXTREME (3)
+ /PLOT = BOXPLOT
+ /COMPARE = GROUPS
+ /ID = @var{name}.
+@end example
+In this example, we look at the height and weight of a sample of individuals and
+how they differ between male and female.
+A table showing the 3 largest and the 3 smallest values of @var{height} and
+@var{weight} for each gender, and for the whole dataset will be shown.
+Boxplots will also be produced.
+Because /COMPARE = GROUPS was given, boxplots for male and female will be
+shown in the same graphic, allowing us to easily see the difference between
+the genders.
+Since the variable @var{name} was specified on the ID subcommand, this will be
+used to label the extreme values.
+
@strong{Warning!}
-If many dependent variable are given, or factors are given for which
+If many dependent variables are specified, or if factor variables are
+specified for which
there are many distinct values, then @cmd{EXAMINE} will produce a very
large quantity of output.
/MISSING=@{TABLE,INCLUDE,REPORT@}
/WRITE=@{NONE,CELLS,ALL@}
/FORMAT=@{TABLES,NOTABLES@}
- @{LABELS,NOLABELS,NOVALLABS@}
@{PIVOT,NOPIVOT@}
@{AVALUE,DVALUE@}
@{NOINDEX,INDEX@}
mode}.
Occasionally, one may want to invoke a special mode called @dfn{integer
-mode}. Normally, in general mode, PSPP automatically determines
+mode}. Normally, in general mode, @pspp{} automatically determines
what values occur in the data. In integer mode, the user specifies the
range of values that the data assumes. To invoke this mode, specify the
VARIABLES subcommand, giving a range of data values in parentheses for
TABLES, the default, causes crosstabulation tables to be output.
NOTABLES suppresses them.
-@item
-LABELS, the default, allows variable labels and value labels to appear
-in the output. NOLABELS suppresses them. NOVALLABS displays variable
-labels but suppresses value labels.
-
@item
PIVOT, the default, causes each TABLES subcommand to be displayed in a
pivot table format. NOPIVOT causes the old-style crosstabulation format
[ /ROTATION=@{VARIMAX, EQUAMAX, QUARTIMAX, NOROTATE@}]
- [ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [SIG] [ALL] [DEFAULT] ]
+ [ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [KMO] [SIG] [ALL] [DEFAULT] ]
[ /PLOT=[EIGEN] ]
The covariance matrix is printed.
@item DET
The determinant of the correlation or covariance matrix is printed.
+@item KMO
+ The Kaiser-Meyer-Olkin measure of sampling adequacy and the Bartlett test of sphericity is printed.
@item SIG
The significance of the elements of correlation matrix is printed.
@item ALL
Identical to INITIAL and EXTRACTION.
@end itemize
-If /PLOT=EIGEN is given, then a ``Scree'' plot of the eigenvalues will be printed. This can be useful for visualising
+If /PLOT=EIGEN is given, then a ``Scree'' plot of the eigenvalues will be printed. This can be useful for visualizing
which factors (components) should be retained.
The /FORMAT subcommand determined how data are to be displayed in loading matrices. If SORT is specified, then the variables
If PAIRWISE is set, then a case is considered missing only if either of the
values for the particular coefficient are missing.
The default is LISTWISE.
-
+
+@node MEANS
+@section MEANS
+
+@vindex MEANS
+@cindex means
+
+@display
+MEANS [TABLES =]
+ @{varlist@}
+ [ BY @{varlist@} [BY @{varlist@} [BY @{varlist@} @dots{} ]]]
+
+ [ /@{varlist@}
+ [ BY @{varlist@} [BY @{varlist@} [BY @{varlist@} @dots{} ]]] ]
+
+ [/CELLS = [MEAN] [COUNT] [STDDEV] [SEMEAN] [SUM] [MIN] [MAX] [RANGE]
+ [VARIANCE] [KURT] [SEKURT]
+ [SKEW] [SESKEW] [FIRST] [LAST]
+ [HARMONIC] [GEOMETRIC]
+ [DEFAULT]
+ [ALL]
+ [NONE] ]
+
+ [/MISSING = [TABLE] [INCLUDE] [DEPENDENT]]
+@end display
+
+You can use the @cmd{MEANS} command to calculate the arithmetic mean and similar
+statistics, either for the dataset as a whole or for categories of data.
+
+The simplest form of the command is
+@example
+MEANS @var{v}.
+@end example
+@noindent which calculates the mean, count and standard deviation for @var{v}.
+If you specify a grouping variable, for example
+@example
+MEANS @var{v} BY @var{g}.
+@end example
+@noindent then the means, counts and standard deviations for @var{v} after having
+been grouped by @var{g} will be calculated.
+Instead of the mean, count and standard deviation, you could specify the statistics
+in which you are interested:
+@example
+MEANS @var{x} @var{y} BY @var{g}
+ /CELLS = HARMONIC SUM MIN.
+@end example
+This example calculates the harmonic mean, the sum and the minimum values of @var{x} and @var{y}
+grouped by @var{g}.
+
+The CELLS subcommand specifies which statistics to calculate. The available statistics
+are:
+@itemize
+@item MEAN
+@cindex arithmetic mean
+ The arithmetic mean.
+@item COUNT
+ The count of the values.
+@item STDDEV
+ The standard deviation.
+@item SEMEAN
+ The standard error of the mean.
+@item SUM
+ The sum of the values.
+@item MIN
+ The minimum value.
+@item MAX
+ The maximum value.
+@item RANGE
+ The difference between the maximum and minimum values.
+@item VARIANCE
+ The variance.
+@item FIRST
+ The first value in the category.
+@item LAST
+ The last value in the category.
+@item SKEW
+ The skewness.
+@item SESKEW
+ The standard error of the skewness.
+@item KURT
+ The kurtosis
+@item SEKURT
+ The standard error of the kurtosis.
+@item HARMONIC
+@cindex harmonic mean
+ The harmonic mean.
+@item GEOMETRIC
+@cindex geometric mean
+ The geometric mean.
+@end itemize
+
+In addition, three special keywords are recognized:
+@itemize
+@item DEFAULT
+ This is the same as MEAN COUNT STDDEV
+@item ALL
+ All of the above statistics will be calculated.
+@item NONE
+ No statistics will be calculated (only a summary will be shown).
+@end itemize
+
+
+More than one @dfn{table} can be specified in a single command.
+Each table is separated by a @samp{/}. For
+example
+@example
+MEANS TABLES =
+ @var{c} @var{d} @var{e} BY @var{x}
+ /@var{a} @var{b} BY @var{x} @var{y}
+ /@var{f} BY @var{y} BY @var{z}.
+@end example
+has three tables (the @samp{TABLE =} is optional).
+The first table has three dependent variables @var{c}, @var{d} and @var{e}
+and a single categorical variable @var{x}.
+The second table has two dependent variables @var{a} and @var{b},
+and two categorical variables @var{x} and @var{y}.
+The third table has a single dependent variables @var{f}
+and a categorical variable formed by the combination of @var{y} and @var{z}.
+
+
+By default values are omitted from the analysis only if missing values
+(either system missing or user missing)
+for any of the variables directly involved in their calculation are
+encountered.
+This behaviour can be modified with the /MISSING subcommand.
+Three options are possible: TABLE, INCLUDE and DEPENDENT.
+
+/MISSING = TABLE causes cases to be dropped if any variable is missing
+in the table specification currently being processed, regardless of
+whether it is needed to calculate the statistic.
+
+/MISSING = INCLUDE says that user missing values, either in the dependent
+variables or in the categorical variables should be taken at their face
+value, and not excluded.
+
+/MISSING = DEPENDENT says that user missing values, in the dependent
+variables should be taken at their face value, however cases which
+have user missing values for the categorical variables should be omitted
+from the calculation.
@node NPAR TESTS
@section NPAR TESTS
* COCHRAN:: Cochran Q Test
* FRIEDMAN:: Friedman Test
* KENDALL:: Kendall's W Test
+* KOLMOGOROV-SMIRNOV:: Kolmogorov Smirnov Test
* KRUSKAL-WALLIS:: Kruskal-Wallis Test
* MANN-WHITNEY:: Mann Whitney U Test
* MCNEMAR:: McNemar Test
+* MEDIAN:: Median Test
* RUNS:: Runs Test
* SIGN:: The Sign Test
* WILCOXON:: Wilcoxon Signed Ranks Test
That is to say, the test is always performed in the observed
direction.
-PSPP uses a very precise approximation to the gamma function to
+@pspp{} uses a very precise approximation to the gamma function to
compute the binomial significance. Thus, exact results are reported
even for very large sample sizes.
unity indicates complete agreement.
+@node KOLMOGOROV-SMIRNOV
+@subsection Kolmogorov-Smirnov Test
+@vindex KOLMOGOROV-SMIRNOV
+@vindex K-S
+@cindex Kolmogorov-Smirnov test
+
+@display
+ [ /KOLMOGOROV-SMIRNOV (@{NORMAL [@var{mu}, @var{sigma}], UNIFORM [@var{min}, @var{max}], POISSON [@var{lambda}], EXPONENTIAL [@var{scale}] @}) = varlist ]
+@end display
+
+The one sample Kolmogorov-Smirnov subcommand is used to test whether or not a dataset is
+drawn from a particular distribution. Four distributions are supported, @i{viz:}
+Normal, Uniform, Poisson and Exponential.
+
+Ideally you should provide the parameters of the distribution against which you wish to test
+the data. For example, with the normal distribution the mean (@var{mu})and standard deviation (@var{sigma})
+should be given; with the uniform distribution, the minimum (@var{min})and maximum (@var{max}) value should
+be provided.
+However, if the parameters are omitted they will be imputed from the data. Imputing the
+parameters reduces the power of the test so should be avoided if possible.
+
+In the following example, two variables @var{score} and @var{age} are tested to see if
+they follow a normal distribution with a mean of 3.5 and a standard deviation of 2.0.
+@example
+ NPAR TESTS
+ /KOLMOGOROV-SMIRNOV (normal 3.5 2.0) = @var{score} @var{age}.
+@end example
+If the variables need to be tested against different distributions, then a separate
+subcommand must be used. For example the following syntax tests @var{score} against
+a normal distribution with mean of 3.5 and standard deviation of 2.0 whilst @var{age}
+is tested against a normal distribution of mean 40 and standard deviation 1.5.
+@example
+ NPAR TESTS
+ /KOLMOGOROV-SMIRNOV (normal 3.5 2.0) = @var{score}
+ /KOLMOGOROV-SMIRNOV (normal 40 1.5) = @var{age}.
+@end example
+
+The abbreviated subcommand K-S may be used in place of KOLMOGOROV-SMIRNOV.
+
@node KRUSKAL-WALLIS
@subsection Kruskal-Wallis Test
@vindex KRUSKAL-WALLIS
than two distinct variables an error will occur and the test will
not be run.
+@node MEDIAN
+@subsection Median Test
+@vindex MEDIAN
+@cindex Median test
+
+@display
+ [ /MEDIAN [(value)] = varlist BY variable (value1, value2) ]
+@end display
+
+The median test is used to test whether independent samples come from
+populations with a common median.
+The median of the populations against which the samples are to be tested
+may be given in parentheses immediately after the
+/MEDIAN subcommand. If it is not given, the median will be imputed from the
+union of all the samples.
+
+The variables of the samples to be tested should immediately follow the @samp{=} sign. The
+keyword @code{BY} must come next, and then the grouping variable. Two values
+in parentheses should follow. If the first value is greater than the second,
+then a 2 sample test is performed using these two values to determine the groups.
+If however, the first variable is less than the second, then a @i{k} sample test is
+conducted and the group values used are all values encountered which lie in the
+range [@var{value1},@var{value2}].
+
+
@node RUNS
@subsection Runs Test
@vindex RUNS
@cindex runs test
@display
- [ /RUNS (@{MEAN, MEDIAN, MODE, value@}) varlist ]
+ [ /RUNS (@{MEAN, MEDIAN, MODE, value@}) = varlist ]
@end display
The /RUNS subcommand tests whether a data sequence is randomly ordered.
Each of these modes are described in more detail below.
There are two optional subcommands which are common to all modes.
-The @cmd{/CRITERIA} subcommand tells PSPP the confidence interval used
+The @cmd{/CRITERIA} subcommand tells @pspp{} the confidence interval used
in the tests. The default value is 0.95.
@menu
-* One Sample Mode:: Testing against a hypothesised mean
+* One Sample Mode:: Testing against a hypothesized mean
* Independent Samples Mode:: Testing two independent groups for equal mean
* Paired Samples Mode:: Testing two interdependent groups for equal mean
@end menu
@subsection One Sample Mode
The @cmd{TESTVAL} subcommand invokes the One Sample mode.
-This mode is used to test a population mean against a hypothesised
+This mode is used to test a population mean against a hypothesized
mean.
The value given to the @cmd{TESTVAL} subcommand is the value against
which you wish to test.
In this mode, you must also use the @cmd{/VARIABLES} subcommand to
-tell PSPP which variables you wish to test.
+tell @pspp{} which variables you wish to test.
@node Independent Samples Mode
@comment node-name, next, previous, up
This mode is used to test whether two groups of values have the
same population mean.
In this mode, you must also use the @cmd{/VARIABLES} subcommand to
-tell PSPP the dependent variables you wish to test.
+tell @pspp{} the dependent variables you wish to test.
The variable given in the @cmd{GROUPS} subcommand is the independent
variable which determines to which group the samples belong.
The list of variables must be followed by the @code{BY} keyword and
the name of the independent (or factor) variable.
-You can use the @code{STATISTICS} subcommand to tell PSPP to display
+You can use the @code{STATISTICS} subcommand to tell @pspp{} to display
ancilliary information. The options accepted are:
@itemize
@item DESCRIPTIVES
coefficients of the groups to be tested.
The number of coefficients must correspond to the number of distinct
groups (or values of the independent variable).
-If the total sum of the coefficients are not zero, then PSPP will
+If the total sum of the coefficients are not zero, then @pspp{} will
display a warning, but will proceed with the analysis.
The @code{CONTRAST} subcommand may be given up to 10 times in order
to specify different contrast tests.
@end display
@cindex Cronbach's Alpha
-The @cmd{RELIABILTY} command performs reliablity analysis on the data.
+The @cmd{RELIABILTY} command performs reliability analysis on the data.
The VARIABLES subcommand is required. It determines the set of variables
upon which analysis is to be performed.
@section ROC
@vindex ROC
-@cindex Receiver Operating Characterstic
+@cindex Receiver Operating Characteristic
@cindex Area under curve
@display
Cases are excluded on a listwise basis; if any of the variables in @var{var_list}
or if the variable @var{state_var} is missing, then the entire case will be
excluded.
-
-