@c PSPP - a program for statistical analysis.
-@c Copyright (C) 2017 Free Software Foundation, Inc.
+@c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
@c Permission is granted to copy, distribute and/or modify this document
@c under the terms of the GNU Free Documentation License, Version 1.3
@c or any later version published by the Free Software Foundation;
@end display
The @cmd{DESCRIPTIVES} procedure reads the active dataset and outputs
-descriptive
-statistics requested by the user. In addition, it can optionally
+linear descriptive statistics requested by the user. In addition, it can optionally
compute Z-scores.
The @subcmd{VARIABLES} subcommand, which is required, specifies the list of
The @subcmd{A} and @subcmd{D} settings request an ascending or descending
sort order, respectively.
+@subsection Descriptives Example
+
+The @file{physiology.sav} file contains various physiological data for a sample
+of persons. Running the @cmd{DESCRIPTIVES} command on the variables @exvar{height}
+and @exvar{temperature} with the default options allows one to see simple linear
+statistics for these two variables. In @ref{descriptives:ex}, these variables
+are specfied on the @subcmd{VARIABLES} subcommand and the @subcmd{SAVE} option
+has been used, to request that Z scores be calculated.
+
+After the command has completed, this example runs @cmd{DESCRIPTIVES} again, this
+time on the @exvar{zheight} and @exvar{ztemperature} variables,
+which are the two normalized (Z-score) variables generated by the
+first @cmd{DESCRIPTIVES} command.
+
+@float Example, descriptives:ex
+@psppsyntax {descriptives.sps}
+@caption {Running two @cmd{DESCRIPTIVES} commands, one with the @subcmd{SAVE} subcommand}
+@end float
+
+@float Screenshot, descriptives:scr
+@psppimage {descriptives}
+@caption {The Descriptives dialog box with two variables and Z-Scores option selected}
+@end float
+
+In @ref{descriptives:res}, we can see that there are 40 valid data for each of the variables
+and no missing values. The mean average of the height and temperature is 16677.12
+and 37.02 respectively. The descriptive statistics for temperature seem reasonable.
+However there is a very high standard deviation for @exvar{height} and a suspiciously
+low minimum. This is due to a data entry error in the
+data (@pxref{Identifying incorrect data}).
+
+In the second Descriptive Statistics command, one can see that the mean and standard
+deviation of both Z score variables is 0 and 1 respectively. All Z score statistics
+should have these properties since they are normalized versions of the original scores.
+
+@float Result, descriptives:res
+@psppoutput {descriptives}
+@caption {Descriptives statistics including two normalized variables (Z-scores)}
+@end float
+
@node FREQUENCIES
@section FREQUENCIES
displayed slices to a given range of values.
The keyword @subcmd{NOMISSING} causes missing values to be omitted from the
piechart. This is the default.
-If instead, @subcmd{MISSING} is specified, then a single slice
-will be included representing all system missing and user-missing cases.
+If instead, @subcmd{MISSING} is specified, then the pie chart includes
+a single slice representing all system missing and user-missing cases.
@cindex bar chart
The @subcmd{BARCHART} subcommand produces a bar chart for each variable.
The @subcmd{MINIMUM} and @subcmd{MAXIMUM} keywords can be used to omit
categories whose counts which lie outside the specified limits.
The @subcmd{FREQ} option (default) causes the ordinate to display the frequency
-of each category, whereas the @subcmd{PERCENT} option will display relative
+of each category, whereas the @subcmd{PERCENT} option displays relative
percentages.
The @subcmd{FREQ} and @subcmd{PERCENT} options on @subcmd{HISTOGRAM} and
The @subcmd{ORDER} subcommand is accepted but ignored.
+@subsection Frequencies Example
+
+@ref{frequencies:ex} runs a frequency analysis on the @exvar{sex}
+and @exvar{occupation} variables from the @file{personnel.sav} file.
+This is useful to get an general idea of the way in which these nominal
+variables are distributed.
+
+@float Example, frequencies:ex
+@psppsyntax {frequencies.sps}
+@caption {Running frequencies on the @exvar{sex} and @exvar{occupation} variables}
+@end float
+
+If you are using the graphic user interface, the dialog box is set up such that
+by default, several statistics are calculated. Some are not particularly useful
+for categorical variables, so you may want to disable those.
+
+@float Screenshot, frequencies:scr
+@psppimage {frequencies}
+@caption {The frequencies dialog box with the @exvar{sex} and @exvar{occupation} variables selected}
+@end float
+
+From @ref{frequencies:res} it is evident that there are 33 males, 21 females and
+2 persons for whom their sex has not been entered.
+
+One can also see how many of each occupation there are in the data.
+When dealing with string variables used as nominal values, running a frequency
+analysis is useful to detect data input entries. Notice that
+one @exvar{occupation} value has been mistyped as ``Scrientist''. This entry should
+be corrected, or marked as missing before using the data.
+
+@float Result, frequencies:res
+@psppoutput {frequencies}
+@caption {The relative frequencies of @exvar{sex} and @exvar{occupation}}
+@end float
+
@node EXAMINE
@section EXAMINE
@end display
Each unique combination of the values of @var{factorvar} and
@var{subfactorvar} divide the dataset into @dfn{cells}.
-Statistics will be calculated for each cell
+Statistics are calculated for each cell
and for the entire dataset (unless @subcmd{NOTOTAL} is given).
The @subcmd{STATISTICS} subcommand specifies which statistics to show.
-@subcmd{DESCRIPTIVES} will produce a table showing some parametric and
+@subcmd{DESCRIPTIVES} produces a table showing some parametric and
non-parametrics statistics.
@subcmd{EXTREME} produces a table showing the extremities of each cell.
A number in parentheses, @var{n} determines
The default number is 5.
The subcommands @subcmd{TOTAL} and @subcmd{NOTOTAL} are mutually exclusive.
-If @subcmd{TOTAL} appears, then statistics will be produced for the entire dataset
-as well as for each cell.
-If @subcmd{NOTOTAL} appears, then statistics will be produced only for the cells
+If @subcmd{TOTAL} appears, then statistics for the entire dataset
+as well as for each cell are produced.
+If @subcmd{NOTOTAL} appears, then statistics are produced only for the cells
(unless no factor variables have been given).
These subcommands have no effect if there have been no factor variables
specified.
@subcmd{SPREADLEVEL}.
The first three can be used to visualise how closely each cell conforms to a
normal distribution, whilst the spread vs.@: level plot can be useful to visualise
-how the variance of differs between factors.
-Boxplots will also show you the outliers and extreme values.
+how the variance differs between factors.
+Boxplots show you the outliers and extreme values.
@footnote{@subcmd{HISTOGRAM} uses Sturges' rule to determine the number of
bins, as approximately @math{1 + \log2(n)}, where @math{n} is the number of samples.
Note that @cmd{FREQUENCIES} uses a different algorithm to find the bin size.}
The @subcmd{SPREADLEVEL} plot displays the interquartile range versus the
median. It takes an optional parameter @var{t}, which specifies how the data
should be transformed prior to plotting.
-The given value @var{t} is a power to which the data is raised. For example, if
-@var{t} is given as 2, then the data will be squared.
+The given value @var{t} is a power to which the data are raised. For example, if
+@var{t} is given as 2, then the square of the data is used.
Zero, however is a special value. If @var{t} is 0 or
-is omitted, then data will be transformed by taking its natural logarithm instead of
+is omitted, then data are transformed by taking its natural logarithm instead of
raising to the power of @var{t}.
@cindex Shapiro-Wilk
If given, it should provide the name of a variable which is to be used
to labels extreme values and outliers.
Numeric or string variables are permissible.
-If the @subcmd{ID} subcommand is not given, then the case number will be used for
+If the @subcmd{ID} subcommand is not given, then the case number is used for
labelling.
The @subcmd{CINTERVAL} subcommand specifies the confidence interval to use in
factors specified then @subcmd{TOTAL} and @subcmd{NOTOTAL} have no effect.
-The following example will generate descriptive statistics and histograms for
+The following example generates descriptive statistics and histograms for
two variables @var{score1} and @var{score2}.
Two factors are given, @i{viz}: @var{gender} and @var{gender} BY @var{culture}.
-Therefore, the descriptives and histograms will be generated for each
+Therefore, the descriptives and histograms are generated for each
distinct value
of @var{gender} @emph{and} for each distinct combination of the values
of @var{gender} and @var{race}.
@end example
In this example, we look at the height and weight of a sample of individuals and
how they differ between male and female.
-A table showing the 3 largest and the 3 smallest values of @var{height} and
-@var{weight} for each gender, and for the whole dataset will be shown.
-Boxplots will also be produced.
-Because @subcmd{/COMPARE = GROUPS} was given, boxplots for male and female will be
-shown in the same graphic, allowing us to easily see the difference between
+A table showing the 3 largest and the 3 smallest values of @exvar{height} and
+@exvar{weight} for each gender, and for the whole dataset as are shown.
+In addition, the @subcmd{/PLOT} subcommand requests boxplots.
+Because @subcmd{/COMPARE = GROUPS} was specified, boxplots for male and female are
+shown in juxtaposed in the same graphic, allowing us to easily see the difference between
the genders.
-Since the variable @var{name} was specified on the @subcmd{ID} subcommand, this will be
-used to label the extreme values.
+Since the variable @var{name} was specified on the @subcmd{ID} subcommand,
+values of the @var{name} variable are used to label the extreme values.
@strong{Warning!}
-If many dependent variables are specified, or if factor variables are
-specified for which
-there are many distinct values, then @cmd{EXAMINE} will produce a very
+If you specify many dependent variables or factor variables
+for which there are many distinct values, then @cmd{EXAMINE} will produce a very
large quantity of output.
@node GRAPH
@end display
-The @cmd{GRAPH} produces graphical plots of data. Only one of the subcommands
-@subcmd{HISTOGRAM} or @subcmd{SCATTERPLOT} can be specified, i.e. only one plot
+The @cmd{GRAPH} command produces graphical plots of data. Only one of the subcommands
+@subcmd{HISTOGRAM}, @subcmd{BAR} or @subcmd{SCATTERPLOT} can be specified, @i{i.e.} only one plot
can be produced per call of @cmd{GRAPH}. The @subcmd{MISSING} is optional.
@menu
@cindex scatterplot
The subcommand @subcmd{SCATTERPLOT} produces an xy plot of the
-data. The different values of the optional third variable @var{var3}
-will result in different colours and/or markers for the plot. The
-following is an example for producing a scatterplot.
+data.
+@cmd{GRAPH} uses the third variable @var{var3}, if specified, to determine
+the colours and/or markers for the plot.
+The following is an example for producing a scatterplot.
@example
GRAPH
/SCATTERPLOT = @var{height} WITH @var{weight} BY @var{gender}.
@end example
-This example will produce a scatterplot where @var{height} is plotted versus @var{weight}. Depending
+This example produces a scatterplot where @var{height} is plotted versus @var{weight}. Depending
on the value of the @var{gender} variable, the colour of the datapoint is different. With
-this plot it is possible to analyze gender differences for @var{height} vs.@: @var{weight} relation.
+this plot it is possible to analyze gender differences for @var{height} versus @var{weight} relation.
@node HISTOGRAM
@subsection Histogram
The @cmd{CORRELATIONS} procedure produces tables of the Pearson correlation coefficient
for a set of variables. The significance of the coefficients are also given.
-At least one @subcmd{VARIABLES} subcommand is required. If the @subcmd{WITH}
-keyword is used, then a non-square correlation table will be produced.
-The variables preceding @subcmd{WITH}, will be used as the rows of the table,
-and the variables following will be the columns of the table.
-If no @subcmd{WITH} subcommand is given, then a square, symmetrical table using all variables is produced.
-
+At least one @subcmd{VARIABLES} subcommand is required. If you specify the @subcmd{WITH}
+keyword, then a non-square correlation table is produced.
+The variables preceding @subcmd{WITH}, are used as the rows of the table,
+and the variables following @subcmd{WITH} are used as the columns of the table.
+If no @subcmd{WITH} subcommand is specified, then @cmd{CORRELATIONS} produces a
+square, symmetrical table using all variables.
The @cmd{MISSING} subcommand determines the handling of missing variables.
If @subcmd{INCLUDE} is set, then user-missing values are included in the
The @subcmd{STATISTICS} subcommand requests additional statistics to be displayed. The keyword
@subcmd{DESCRIPTIVES} requests that the mean, number of non-missing cases, and the non-biased
estimator of the standard deviation are displayed.
-These statistics will be displayed in a separated table, for all the variables listed
+These statistics are displayed in a separated table, for all the variables listed
in any @subcmd{/VARIABLES} subcommand.
The @subcmd{XPROD} keyword requests cross-product deviations and covariance estimators to
be displayed for each pair of variables.
CROSSTABS
/TABLES=@var{var_list} BY @var{var_list} [BY @var{var_list}]@dots{}
/MISSING=@{TABLE,INCLUDE,REPORT@}
- /WRITE=@{NONE,CELLS,ALL@}
/FORMAT=@{TABLES,NOTABLES@}
- @{PIVOT,NOPIVOT@}
@{AVALUE,DVALUE@}
- @{NOINDEX,INDEX@}
- @{BOX,NOBOX@}
/CELLS=@{COUNT,ROW,COLUMN,TOTAL,EXPECTED,RESIDUAL,SRESIDUAL,
ASRESIDUAL,ALL,NONE@}
/COUNT=@{ASIS,CASE,CELL@}
integer mode, user-missing values are included in tables but marked with
a footnote and excluded from statistical calculations.
-Currently the @subcmd{WRITE} subcommand is ignored.
-
The @subcmd{FORMAT} subcommand controls the characteristics of the
crosstabulation tables to be displayed. It has a number of possible
settings:
@itemize @w{}
@item
@subcmd{TABLES}, the default, causes crosstabulation tables to be output.
-@subcmd{NOTABLES} suppresses them.
-
-@item
-@subcmd{PIVOT}, the default, causes each @subcmd{TABLES} subcommand to be displayed in a
-pivot table format. @subcmd{NOPIVOT} causes the old-style crosstabulation format
-to be used.
+@subcmd{NOTABLES}, which is equivalent to @code{CELLS=NONE}, suppresses them.
@item
@subcmd{AVALUE}, the default, causes values to be sorted in ascending order.
@subcmd{DVALUE} asserts a descending sort order.
-
-@item
-@subcmd{INDEX} and @subcmd{NOINDEX} are currently ignored.
-
-@item
-@subcmd{BOX} and @subcmd{NOBOX} is currently ignored.
@end itemize
The @subcmd{CELLS} subcommand controls the contents of each cell in the displayed
@samp{/CELLS} without any settings specified requests @subcmd{COUNT}, @subcmd{ROW},
@subcmd{COLUMN}, and @subcmd{TOTAL}.
If @subcmd{CELLS} is not specified at all then only @subcmd{COUNT}
-will be selected.
+is selected.
By default, crosstabulation and statistics use raw case weights,
without rounding. Use the @subcmd{/COUNT} subcommand to perform
@table @asis
@item CHISQ
-@cindex chisquare
@cindex chi-square
Pearson chi-square, likelihood ratio, Fisher's exact test, continuity
The @samp{/BARCHART} subcommand produces a clustered bar chart for the first two
variables on each table.
If a table has more than two variables, the counts for the third and subsequent levels
-will be aggregated and the chart will be produces as if there were only two variables.
+are aggregated and the chart is produced as if there were only two variables.
@strong{Please note:} Currently the implementation of @cmd{CROSSTABS} has the
Fixes for any of these deficiencies would be welcomed.
+@subsection Crosstabs Example
+
+@cindex chi-square test of independence
+
+A researcher wishes to know if, in an industry, a person's sex is related to
+the person's occupation. To investigate this, she has determined that the
+@file{personnel.sav} is a representative, randomly selected sample of persons.
+The researcher's null hypothesis is that a person's sex has no relation to a
+person's occupation. She uses a chi-squared test of independence to investigate
+the hypothesis.
+
+@float Example, crosstabs:ex
+@psppsyntax {crosstabs.sps}
+@caption {Running crosstabs on the @exvar{sex} and @exvar{occupation} variables}
+@end float
+
+The syntax in @ref{crosstabs:ex} conducts a chi-squared test of independence.
+The line @code{/tables = occupation by sex} indicates that @exvar{occupation}
+and @exvar{sex} are the variables to be tabulated. To do this using the @gui{}
+you must place these variable names respectively in the @samp{Row} and
+@samp{Column} fields as shown in @ref{crosstabs:scr}.
+
+@float Screenshot, crosstabs:scr
+@psppimage {crosstabs}
+@caption {The Crosstabs dialog box with the @exvar{sex} and @exvar{occupation} variables selected}
+@end float
+
+Similarly, the @samp{Cells} button shows a dialog box to select the @code{count}
+and @code{expected} options. All other cell options can be deselected for this
+test.
+
+You would use the @samp{Format} and @samp{Statistics} buttons to select options
+for the @subcmd{FORMAT} and @subcmd{STATISTICS} subcommands. In this example,
+the @samp{Statistics} requires only the @samp{Chisq} option to be checked. All
+other options should be unchecked. No special settings are required from the
+@samp{Format} dialog.
+
+As shown in @ref{crosstabs:res} @cmd{CROSSTABS} generates a contingency table
+containing the observed count and the expected count of each sex and each
+occupation. The expected count is the count which would be observed if the
+null hypothesis were true.
+
+The significance of the Pearson Chi-Square value is very much larger than the
+normally accepted value of 0.05 and so one cannot reject the null hypothesis.
+Thus the researcher must conclude that a person's sex has no relation to the
+person's occupation.
+
+@float Results, crosstabs:res
+@psppoutput {crosstabs}
+@caption {The results of a test of independence between @exvar{sex} and @exvar{occupation}}
+@end float
+
+
@node FACTOR
@section FACTOR
If specified, @subcmd{MATRIX IN} must be followed by @samp{COV} or @samp{CORR},
then by @samp{=} and @var{file_spec} all in parentheses.
@var{file_spec} may either be an asterisk, which indicates the currently loaded
-dataset, or it may be a filename to be loaded. @xref{MATRIX DATA}, for the expected
+dataset, or it may be a file name to be loaded. @xref{MATRIX DATA}, for the expected
format of the file.
-The @subcmd{/EXTRACTION} subcommand is used to specify the way in which factors (components) are extracted from the data.
+The @subcmd{/EXTRACTION} subcommand is used to specify the way in which factors
+(components) are extracted from the data.
If @subcmd{PC} is specified, then Principal Components Analysis is used.
If @subcmd{PAF} is specified, then Principal Axis Factoring is
-used. By default Principal Components Analysis will be used.
+used. By default Principal Components Analysis is used.
-The @subcmd{/ROTATION} subcommand is used to specify the method by which the extracted solution will be rotated.
-Three orthogonal rotation methods are available:
+The @subcmd{/ROTATION} subcommand is used to specify the method by which the
+extracted solution is rotated. Three orthogonal rotation methods are available:
@subcmd{VARIMAX} (which is the default), @subcmd{EQUAMAX}, and @subcmd{QUARTIMAX}.
There is one oblique rotation method, @i{viz}: @subcmd{PROMAX}.
Optionally you may enter the power of the promax rotation @var{k}, which must be enclosed in parentheses.
The default value of @var{k} is 5.
-If you don't want any rotation to be performed, the word @subcmd{NOROTATE} will prevent the command from performing any
-rotation on the data.
+If you don't want any rotation to be performed, the word @subcmd{NOROTATE}
+prevents the command from performing any rotation on the data.
-The @subcmd{/METHOD} subcommand should be used to determine whether the covariance matrix or the correlation matrix of the data is
+The @subcmd{/METHOD} subcommand should be used to determine whether the
+covariance matrix or the correlation matrix of the data is
to be analysed. By default, the correlation matrix is analysed.
The @subcmd{/PRINT} subcommand may be used to select which features of the analysis are reported:
Identical to @subcmd{INITIAL} and @subcmd{EXTRACTION}.
@end itemize
-If @subcmd{/PLOT=EIGEN} is given, then a ``Scree'' plot of the eigenvalues will be printed. This can be useful for visualizing
+If @subcmd{/PLOT=EIGEN} is given, then a ``Scree'' plot of the eigenvalues is
+printed. This can be useful for visualizing the factors and deciding
which factors (components) should be retained.
-The @subcmd{/FORMAT} subcommand determined how data are to be displayed in loading matrices. If @subcmd{SORT} is specified, then the variables
-are sorted in descending order of significance. If @subcmd{BLANK(@var{n})} is specified, then coefficients whose absolute value is less
-than @var{n} will not be printed. If the keyword @subcmd{DEFAULT} is given, or if no @subcmd{/FORMAT} subcommand is given, then no sorting is
-performed, and all coefficients will be printed.
-
-The @subcmd{/CRITERIA} subcommand is used to specify how the number of extracted factors (components) are chosen.
-If @subcmd{FACTORS(@var{n})} is
-specified, where @var{n} is an integer, then @var{n} factors will be extracted. Otherwise, the @subcmd{MINEIGEN} setting will
-be used.
-@subcmd{MINEIGEN(@var{l})} requests that all factors whose eigenvalues are greater than or equal to @var{l} are extracted.
-The default value of @var{l} is 1.
-The @subcmd{ECONVERGE} setting has effect only when iterative algorithms for factor
-extraction (such as Principal Axis Factoring) are used.
-@subcmd{ECONVERGE(@var{delta})} specifies that
-iteration should cease when
-the maximum absolute value of the communality estimate between one iteration and the previous is less than @var{delta}. The
-default value of @var{delta} is 0.001.
-The @subcmd{ITERATE(@var{m})} may appear any number of times and is used for two different purposes.
-It is used to set the maximum number of iterations (@var{m}) for convergence and also to set the maximum number of iterations
-for rotation.
-Whether it affects convergence or rotation depends upon which subcommand follows the @subcmd{ITERATE} subcommand.
+The @subcmd{/FORMAT} subcommand determined how data are to be
+displayed in loading matrices. If @subcmd{SORT} is specified, then
+the variables are sorted in descending order of significance. If
+@subcmd{BLANK(@var{n})} is specified, then coefficients whose absolute
+value is less than @var{n} are not printed. If the keyword
+@subcmd{DEFAULT} is specified, or if no @subcmd{/FORMAT} subcommand is
+specified, then no sorting is performed, and all coefficients are printed.
+
+You can use the @subcmd{/CRITERIA} subcommand to specify how the number of
+extracted factors (components) are chosen. If @subcmd{FACTORS(@var{n})} is
+specified, where @var{n} is an integer, then @var{n} factors are
+extracted. Otherwise, the @subcmd{MINEIGEN} setting is used.
+@subcmd{MINEIGEN(@var{l})} requests that all factors whose eigenvalues
+are greater than or equal to @var{l} are extracted. The default value
+of @var{l} is 1. The @subcmd{ECONVERGE} setting has effect only when
+using iterative algorithms for factor extraction (such as Principal Axis
+Factoring). @subcmd{ECONVERGE(@var{delta})} specifies that
+iteration should cease when the maximum absolute value of the
+communality estimate between one iteration and the previous is less
+than @var{delta}. The default value of @var{delta} is 0.001.
+
+The @subcmd{ITERATE(@var{m})} may appear any number of times and is
+used for two different purposes. It is used to set the maximum number
+of iterations (@var{m}) for convergence and also to set the maximum
+number of iterations for rotation.
+Whether it affects convergence or rotation depends upon which
+subcommand follows the @subcmd{ITERATE} subcommand.
If @subcmd{EXTRACTION} follows, it affects convergence.
If @subcmd{ROTATION} follows, it affects rotation.
-If neither @subcmd{ROTATION} nor @subcmd{EXTRACTION} follow a @subcmd{ITERATE} subcommand it will be ignored.
+If neither @subcmd{ROTATION} nor @subcmd{EXTRACTION} follow a
+@subcmd{ITERATE} subcommand, then the entire subcommand is ignored.
The default value of @var{m} is 25.
-The @cmd{MISSING} subcommand determines the handling of missing variables.
-If @subcmd{INCLUDE} is set, then user-missing values are included in the
-calculations, but system-missing values are not.
+The @cmd{MISSING} subcommand determines the handling of missing
+variables. If @subcmd{INCLUDE} is set, then user-missing values are
+included in the calculations, but system-missing values are not.
If @subcmd{EXCLUDE} is set, which is the default, user-missing
-values are excluded as well as system-missing values.
-This is the default.
-If @subcmd{LISTWISE} is set, then the entire case is excluded from analysis
-whenever any variable specified in the @cmd{VARIABLES} subcommand
-contains a missing value.
-If @subcmd{PAIRWISE} is set, then a case is considered missing only if either of the
-values for the particular coefficient are missing.
+values are excluded as well as system-missing values. This is the
+default. If @subcmd{LISTWISE} is set, then the entire case is excluded
+from analysis whenever any variable specified in the @cmd{VARIABLES}
+subcommand contains a missing value.
+
+If @subcmd{PAIRWISE} is set, then a case is considered missing only if
+either of the values for the particular coefficient are missing.
The default is @subcmd{LISTWISE}.
@node GLM
appear before the @code{BY} keyword.
The @var{fixed_factors} list must be one or more categorical variables. Normally it
-will not make sense to enter a scalar variable in the @var{fixed_factors} and doing
+does not make sense to enter a scalar variable in the @var{fixed_factors} and doing
so may cause @pspp{} to do a lot of unnecessary processing.
The @subcmd{METHOD} subcommand is used to change the method for producing the sums of
MEANS @var{v} BY @var{g}.
@end example
@noindent then the means, counts and standard deviations for @var{v} after having
-been grouped by @var{g} will be calculated.
+been grouped by @var{g} are calculated.
Instead of the mean, count and standard deviation, you could specify the statistics
in which you are interested:
@example
@item @subcmd{DEFAULT}
This is the same as @subcmd{MEAN} @subcmd{COUNT} @subcmd{STDDEV}.
@item @subcmd{ALL}
- All of the above statistics will be calculated.
+ All of the above statistics are calculated.
@item @subcmd{NONE}
- No statistics will be calculated (only a summary will be shown).
+ No statistics are calculated (only a summary is shown).
@end itemize
have user missing values for the categorical variables should be omitted
from the calculation.
+@subsection Example Means
+
+The dataset in @file{repairs.sav} contains the mean time between failures (@exvar{mtbf})
+for a sample of artifacts produced by different factories and trialed under
+different operating conditions.
+Since there are four combinations of categorical variables, by simply looking
+at the list of data, it would be hard to how the scores vary for each category.
+@ref{means:ex} shows one way of tabulating the @exvar{mtbf} in a way which is
+easier to understand.
+
+@float Example, means:ex
+@psppsyntax {means.sps}
+@caption {Running @cmd{MEANS} on the @exvar{mtbf} score with categories @exvar{factory} and @exvar{environment}}
+@end float
+
+The results are shown in @ref{means:res}. The figures shown indicate the mean,
+standard deviation and number of samples in each category.
+These figures however do not indicate whether the results are statistically
+significant. For that, you would need to use the procedures @cmd{ONEWAY}, @cmd{GLM} or
+@cmd{T-TEST} depending on the hypothesis being tested.
+
+@float Result, means:res
+@psppoutput {means}
+@caption {The @exvar{mtbf} categorised by @exvar{factory} and @exvar{environment}}
+@end float
+
+Note that there is no limit to the number of variables for which you can calculate
+statistics, nor to the number of categorical variables per layer, nor the number
+of layers.
+However, running @cmd{MEANS} on a large numbers of variables, or with categorical variables
+containing a large number of distinct values may result in an extremely large output, which
+will not be easy to interpret.
+So you should consider carefully which variables to select for participation in the analysis.
+
@node NPAR TESTS
@section NPAR TESTS
subcommand @subcmd{/METHOD=EXACT} is specified.
Exact tests give more accurate results, but may take an unacceptably long
time to perform. If the @subcmd{TIMER} keyword is used, it sets a maximum time,
-after which the test will be abandoned, and a warning message printed.
+after which the test is abandoned, and a warning message printed.
The time, in minutes, should be specified in parentheses after the @subcmd{TIMER} keyword.
If the @subcmd{TIMER} keyword is given without this figure, then a default value of 5 minutes
is used.
@menu
* BINOMIAL:: Binomial Test
-* CHISQUARE:: Chisquare Test
+* CHISQUARE:: Chi-square Test
* COCHRAN:: Cochran Q Test
* FRIEDMAN:: Friedman Test
* KENDALL:: Kendall's W Test
than or equal to the threshold value form the first category. Values
greater than the threshold form the second category.
-If two values appear after the variable list, then they will be used
+If two values appear after the variable list, then they are used
as the values which a variable must take to be in the respective
category.
Cases for which a variable takes a value equal to neither of the specified
even for very large sample sizes.
-
@node CHISQUARE
-@subsection Chisquare Test
+@subsection Chi-square Test
@vindex CHISQUARE
-@cindex chisquare test
+@cindex chi-square test
@display
If no @subcmd{/EXPECTED} subcommand is given, then equal frequencies
are expected.
+@subsubsection Chi-square Example
+
+A researcher wishes to investigate whether there are an equal number of
+persons of each sex in a population. The sample chosen for invesigation
+is that from the @file {physiology.sav} dataset. The null hypothesis for
+the test is that the population comprises an equal number of males and females.
+The analysis is performed as shown in @ref{chisquare:ex}.
+
+@float Example, chisquare:ex
+@psppsyntax {chisquare.sps}
+@caption {Performing a chi-square test to check for equal distribution of sexes}
+@end float
+
+There is only one test variable, @i{viz:} @exvar{sex}. The other variables in the dataset
+are ignored.
+
+@float Screenshot, chisquare:scr
+@psppimage {chisquare}
+@caption {Performing a chi-square test using the graphic user interface}
+@end float
+
+In @ref{chisquare:res} the summary box shows that in the sample, there are more males
+than females. However the significance of chi-square result is greater than 0.05
+--- the most commonly accepted p-value --- and therefore
+there is not enough evidence to reject the null hypothesis and one must conclude
+that the evidence does not indicate that there is an imbalance of the sexes
+in the population.
+
+@float Result, chisquare:res
+@psppoutput {chisquare}
+@caption {The results of running a chi-square test on @exvar{sex}}
+@end float
+
@node COCHRAN
@subsection Cochran Q Test
@end display
The Cochran Q test is used to test for differences between three or more groups.
-The data for @var{var_list} in all cases must assume exactly two distinct values (other than missing values).
+The data for @var{var_list} in all cases must assume exactly two
+distinct values (other than missing values).
-The value of Q will be displayed and its Asymptotic significance based on a chi-square distribution.
+The value of Q is displayed along with its Asymptotic significance
+based on a chi-square distribution.
@node FRIEDMAN
@subsection Friedman Test
drawn from a particular distribution. Four distributions are supported, @i{viz:}
Normal, Uniform, Poisson and Exponential.
-Ideally you should provide the parameters of the distribution against which you wish to test
-the data. For example, with the normal distribution the mean (@var{mu})and standard deviation (@var{sigma})
-should be given; with the uniform distribution, the minimum (@var{min})and maximum (@var{max}) value should
-be provided.
-However, if the parameters are omitted they will be imputed from the data. Imputing the
-parameters reduces the power of the test so should be avoided if possible.
-
-In the following example, two variables @var{score} and @var{age} are tested to see if
-they follow a normal distribution with a mean of 3.5 and a standard deviation of 2.0.
+Ideally you should provide the parameters of the distribution against
+which you wish to test the data. For example, with the normal
+distribution the mean (@var{mu})and standard deviation (@var{sigma})
+should be given; with the uniform distribution, the minimum
+(@var{min})and maximum (@var{max}) value should be provided.
+However, if the parameters are omitted they are imputed from the
+data. Imputing the parameters reduces the power of the test so should
+be avoided if possible.
+
+In the following example, two variables @var{score} and @var{age} are
+tested to see if they follow a normal distribution with a mean of 3.5
+and a standard deviation of 2.0.
@example
NPAR TESTS
/KOLMOGOROV-SMIRNOV (normal 3.5 2.0) = @var{score} @var{age}.
The data to be compared are specified by @var{var_list}.
The categorical variable determining the groups to which the
data belongs is given by @var{var}. The limits @var{lower} and
-@var{upper} specify the valid range of @var{var}. Any cases for
-which @var{var} falls outside [@var{lower}, @var{upper}] will be
-ignored.
+@var{upper} specify the valid range of @var{var}.
+If @var{upper} is smaller than @var{lower}, the PSPP will assume their values
+to be reversed. Any cases for which @var{var} falls outside
+[@var{lower}, @var{upper}] are ignored.
-The mean rank of each group as well as the chi-squared value and significance
-of the test will be printed.
-The abbreviated subcommand @subcmd{K-W} may be used in place of @subcmd{KRUSKAL-WALLIS}.
+The mean rank of each group as well as the chi-squared value and
+significance of the test are printed.
+The abbreviated subcommand @subcmd{K-W} may be used in place of
+@subcmd{KRUSKAL-WALLIS}.
@node MANN-WHITNEY
[ /MANN-WHITNEY = @var{var_list} BY var (@var{group1}, @var{group2}) ]
@end display
-The Mann-Whitney subcommand is used to test whether two groups of data come from different populations.
-The variables to be tested should be specified in @var{var_list} and the grouping variable, that determines to which group the test variables belong, in @var{var}.
+The Mann-Whitney subcommand is used to test whether two groups of data
+come from different populations. The variables to be tested should be
+specified in @var{var_list} and the grouping variable, that determines
+to which group the test variables belong, in @var{var}.
@var{Var} may be either a string or an alpha variable.
@var{Group1} and @var{group2} specify the
two values of @var{var} which determine the groups of the test data.
-Cases for which the @var{var} value is neither @var{group1} or @var{group2} will be ignored.
+Cases for which the @var{var} value is neither @var{group1} or
+@var{group2} are ignored.
+
+The value of the Mann-Whitney U statistic, the Wilcoxon W, and the
+significance are printed.
+You may abbreviated the subcommand @subcmd{MANN-WHITNEY} to
+@subcmd{M-W}.
-The value of the Mann-Whitney U statistic, the Wilcoxon W, and the significance will be printed.
-The abbreviated subcommand @subcmd{M-W} may be used in place of @subcmd{MANN-WHITNEY}.
@node MCNEMAR
@subsection McNemar Test
populations with a common median.
The median of the populations against which the samples are to be tested
may be given in parentheses immediately after the
-@subcmd{/MEDIAN} subcommand. If it is not given, the median will be imputed from the
+@subcmd{/MEDIAN} subcommand. If it is not given, the median is imputed from the
union of all the samples.
The variables of the samples to be tested should immediately follow the @samp{=} sign. The
In this mode, you must also use the @subcmd{/VARIABLES} subcommand to
tell @pspp{} which variables you wish to test.
+@subsubsection Example - One Sample T-test
+
+A researcher wishes to know whether the weight of persons in a population
+is different from the national average.
+The samples are drawn from the population under investigation and recorded
+in the file @file{physiology.sav}.
+From the Department of Health, she
+knows that the national average weight of healthy adults is 76.8kg.
+Accordingly the @subcmd{TESTVAL} is set to 76.8.
+The null hypothesis therefore is that the mean average weight of the
+population from which the sample was drawn is 76.8kg.
+
+As previously noted (@pxref{Identifying incorrect data}), one
+sample in the dataset contains a weight value
+which is clearly incorrect. So this is excluded from the analysis
+using the @cmd{SELECT} command.
+
+@float Example, one-sample-t:ex
+@psppsyntax {one-sample-t.sps}
+@caption {Running a one sample T-Test after excluding all non-positive values}
+@end float
+
+@float Screenshot, one-sample-t:scr
+@psppimage {one-sample-t}
+@caption {Using the One Sample T-Test dialog box to test @exvar{weight} for a mean of 76.8kg}
+@end float
+
+
+@ref{one-sample-t:res} shows that the mean of our sample differs from the test value
+by -1.40kg. However the significance is very high (0.610). So one cannot
+reject the null hypothesis, and must conclude there is not enough evidence
+to suggest that the mean weight of the persons in our population is different
+from 76.8kg.
+
+@float Results, one-sample-t:res
+@psppoutput {one-sample-t}
+@caption {The results of a one sample T-test of @exvar{weight} using a test value of 76.8kg}
+@end float
+
@node Independent Samples Mode
@subsection Independent Samples Mode
the independent variable are excluded on a listwise basis, regardless
of whether @subcmd{/MISSING=LISTWISE} was specified.
+@subsubsection Example - Independent Samples T-test
+
+A researcher wishes to know whether within a population, adult males
+are taller than adult females.
+The samples are drawn from the population under investigation and recorded
+in the file @file{physiology.sav}.
+
+As previously noted (@pxref{Identifying incorrect data}), one
+sample in the dataset contains a height value
+which is clearly incorrect. So this is excluded from the analysis
+using the @cmd{SELECT} command.
+
+
+@float Example, indepdendent-samples-t:ex
+@psppsyntax {independent-samples-t.sps}
+@caption {Running a independent samples T-Test after excluding all observations less than 200kg}
+@end float
+
+
+The null hypothesis is that both males and females are on average
+of equal height.
+
+@float Screenshot, independent-samples-t:scr
+@psppimage {independent-samples-t}
+@caption {Using the Independent Sample T-test dialog, to test for differences of @exvar{height} between values of @exvar{sex}}
+@end float
+
+
+In this case, the grouping variable is @exvar{sex}, so this is entered
+as the variable for the @subcmd{GROUP} subcommand. The group values are 0 (male) and
+1 (female).
+
+If you are running the proceedure using syntax, then you need to enter
+the values corresponding to each group within parentheses.
+If you are using the graphic user interface, then you have to open
+the ``Define Groups'' dialog box and enter the values corresponding
+to each group as shown in @ref{define-groups-t:scr}. If, as in this case, the dataset has defined value
+labels for the group variable, then you can enter them by label
+or by value.
+
+@float Screenshot, define-groups-t:scr
+@psppimage {define-groups-t}
+@caption {Setting the values of the grouping variable for an Independent Samples T-test}
+@end float
+
+From @ref{independent-samples-t:res}, one can clearly see that the @emph{sample} mean height
+is greater for males than for females. However in order to see if this
+is a significant result, one must consult the T-Test table.
+
+The T-Test table contains two rows; one for use if the variance of the samples
+in each group may be safely assumed to be equal, and the second row
+if the variances in each group may not be safely assumed to be equal.
+
+In this case however, both rows show a 2-tailed significance less than 0.001 and
+one must therefore reject the null hypothesis and conclude that within
+the population the mean height of males and of females are unequal.
+
+@float Result, independent-samples-t:res
+@psppoutput {independent-samples-t}
+@caption {The results of an independent samples T-test of @exvar{height} by @exvar{sex}}
+@end float
@node Paired Samples Mode
@subsection Paired Samples Mode
to specify different contrast tests.
The @subcmd{MISSING} subcommand defines how missing values are handled.
If @subcmd{LISTWISE} is specified then cases which have missing values for
-the independent variable or any dependent variable will be ignored.
-If @subcmd{ANALYSIS} is specified, then cases will be ignored if the independent
+the independent variable or any dependent variable are ignored.
+If @subcmd{ANALYSIS} is specified, then cases are ignored if the independent
variable is missing or if the dependent variable currently being
analysed is missing. The default is @subcmd{ANALYSIS}.
A setting of @subcmd{EXCLUDE} means that variables whose values are
@end itemize
@noindent
-The optional syntax @code{ALPHA(@var{value})} is used to indicate
-that @var{value} should be used as the
-confidence level for which the posthoc tests will be performed.
-The default is 0.05.
+Use the optional syntax @code{ALPHA(@var{value})} to indicate that
+@cmd{ONEWAY} should perform the posthoc tests at a confidence level of
+@var{value}. If @code{ALPHA(@var{value})} is not specified, then the
+confidence level used is 0.05.
@node QUICK CLUSTER
@section QUICK CLUSTER
If @subcmd{INITIAL} is set, then the initial cluster memberships will
be printed.
If @subcmd{CLUSTER} is set, the cluster memberships of the individual
-cases will be displayed (potentially generating lengthy output).
+cases are displayed (potentially generating lengthy output).
You can specify the subcommand @subcmd{SAVE} to ask that each case's cluster membership
and the euclidean distance between the case and its cluster center be saved to
The @subcmd{VARIABLES} subcommand is required. It determines the set of variables
upon which analysis is to be performed.
-The @subcmd{SCALE} subcommand determines which variables reliability is to be
-calculated for. If it is omitted, then analysis for all variables named
-in the @subcmd{VARIABLES} subcommand will be used.
+The @subcmd{SCALE} subcommand determines the variables for which
+reliability is to be calculated. If @subcmd{SCALE} is omitted, then analysis for
+all variables named in the @subcmd{VARIABLES} subcommand are used.
Optionally, the @var{name} parameter may be specified to set a string name
for the scale.
The default model is @subcmd{ALPHA}.
By default, any cases with user missing, or system missing values for
-any variables given
-in the @subcmd{VARIABLES} subcommand will be omitted from analysis.
-The @subcmd{MISSING} subcommand determines whether user missing values are to
-be included or excluded in the analysis.
+any variables given in the @subcmd{VARIABLES} subcommand are omitted
+from the analysis. The @subcmd{MISSING} subcommand determines whether
+user missing values are included or excluded in the analysis.
The @subcmd{SUMMARY} subcommand determines the type of summary analysis to be performed.
Currently there is only one type: @subcmd{SUMMARY=TOTAL}, which displays per-item
analysis tested against the totals.
+@subsection Example - Reliability
+
+Before analysing the results of a survey -- particularly for a multiple choice survey --
+it is desireable to know whether the respondents have considered their answers
+or simply provided random answers.
+
+In the following example the survey results from the file @file{hotel.sav} are used.
+All five survey questions are included in the reliability analysis.
+However, before running the analysis, the data must be preprocessed.
+An examination of the survey questions reveals that two questions, @i{viz:} v3 and v5
+are negatively worded, whereas the others are positively worded.
+All questions must be based upon the same scale for the analysis to be meaningful.
+One could use the @cmd{RECODE} command (@pxref{RECODE}), however a simpler way is
+to use @cmd{COMPUTE} (@pxref{COMPUTE}) and this is what is done in @ref{reliability:ex}.
+
+@float Example, reliability:ex
+@psppsyntax {reliability.sps}
+@caption {Investigating the reliability of survey responses}
+@end float
+
+In this case, all variables in the data set are used. So we can use the special
+keyword @samp{ALL} (@pxref{BNF}).
+
+@float Screenshot, reliability:src
+@psppimage {reliability}
+@caption {Reliability dialog box with all variables selected}
+@end float
+
+@ref{reliability:res} shows that Cronbach's Alpha is 0.11 which is a value normally considered too
+low to indicate consistency within the data. This is possibly due to the small number of
+survey questions. The survey should be redesigned before serious use of the results are
+applied.
+
+@float Result, reliability:res
+@psppoutput {reliability}
+@caption {The results of the reliability command on @file{hotel.sav}}
+@end float
@node ROC
If the keyword @subcmd{NONE} is given, then no @subcmd{ROC} curve is drawn.
By default, the curve is drawn with no reference line.
-The optional subcommand @subcmd{PRINT} determines which additional tables should be printed.
-Two additional tables are available.
-The @subcmd{SE} keyword says that standard error of the area under the curve should be printed as well as
-the area itself.
-In addition, a p-value under the null hypothesis that the area under the curve equals 0.5 will be
-printed.
-The @subcmd{COORDINATES} keyword says that a table of coordinates of the @subcmd{ROC} curve should be printed.
+The optional subcommand @subcmd{PRINT} determines which additional
+tables should be printed. Two additional tables are available. The
+@subcmd{SE} keyword says that standard error of the area under the
+curve should be printed as well as the area itself. In addition, a
+p-value for the null hypothesis that the area under the curve equals
+0.5 is printed. The @subcmd{COORDINATES} keyword says that a
+table of coordinates of the @subcmd{ROC} curve should be printed.
The @subcmd{CRITERIA} subcommand has four optional parameters:
@itemize @bullet
be included or excluded in the analysis. The default behaviour is to
exclude them.
Cases are excluded on a listwise basis; if any of the variables in @var{var_list}
-or if the variable @var{state_var} is missing, then the entire case will be
+or if the variable @var{state_var} is missing, then the entire case is
excluded.
@c LocalWords: subcmd subcommand