* DESCRIPTIVES:: Descriptive statistics.
* FREQUENCIES:: Frequency tables.
* EXAMINE:: Testing data for normality.
+* GRAPH:: Plot data.
* CORRELATIONS:: Correlation tables.
* CROSSTABS:: Crosstabulation tables.
* FACTOR:: Factor analysis and Principal Components analysis.
+* GLM:: Univariate Linear Models.
* LOGISTIC REGRESSION:: Bivariate Logistic Regression.
* MEANS:: Average values and other statistics.
* NPAR TESTS:: Nonparametric tests.
[@{FREQ[(@var{y_max})],PERCENT[(@var{y_max})]@}] [@{NONORMAL,NORMAL@}]
/PIECHART=[MINIMUM(@var{x_min})] [MAXIMUM(@var{x_max})]
[@{FREQ,PERCENT@}] [@{NOMISSING,MISSING@}]
+ /BARCHART=[MINIMUM(@var{x_min})] [MAXIMUM(@var{x_max})]
+ [@{FREQ,PERCENT@}]
+ /ORDER=@{ANALYSIS,VARIABLE@}
+
(These options are not currently implemented.)
- /BARCHART=@dots{}
/HBAR=@dots{}
/GROUPED=@dots{}
@end display
The @cmd{FREQUENCIES} procedure outputs frequency tables for specified
variables.
@cmd{FREQUENCIES} can also calculate and display descriptive statistics
-(including median and mode) and percentiles,
-@cmd{FREQUENCIES} can also output
-histograms and pie charts.
+(including median and mode) and percentiles, and various graphical representations
+of the frequency distribution.
The @subcmd{VARIABLES} subcommand is the only required subcommand. Specify the
variables to be analyzed.
The @subcmd{HISTOGRAM} subcommand causes the output to include a histogram for
each specified numeric variable. The X axis by default ranges from
the minimum to the maximum value observed in the data, but the @subcmd{MINIMUM}
-and @subcmd{MAXIMUM} keywords can set an explicit range. Specify @subcmd{NORMAL} to
-superimpose a normal curve on the histogram. Histograms are not
-created for string variables.
+and @subcmd{MAXIMUM} keywords can set an explicit range.
+@footnote{The number of
+bins is chosen according to the Freedman-Diaconis rule:
+@math{2 \times IQR(x)n^{-1/3}}, where @math{IQR(x)} is the interquartile range of @math{x}
+and @math{n} is the number of samples. Note that
+@cmd{EXAMINE} uses a different algorithm to determine bin sizes.}
+Histograms are not created for string variables.
+
+Specify @subcmd{NORMAL} to superimpose a normal curve on the
+histogram.
@cindex piechart
The @subcmd{PIECHART} subcommand adds a pie chart for each variable to the data. Each
slice represents one value, with the size of the slice proportional to
the value's frequency. By default, all non-missing values are given
-slices. The @subcmd{MINIMUM} and @subcmd{MAXIMUM} keywords can be used to limit the
-displayed slices to a given range of values. The @subcmd{MISSING} keyword adds
-slices for missing values.
-
-The @subcmd{FREQ} and @subcmd{PERCENT} options on @subcmd{HISTOGRAM} and @subcmd{PIECHART} are accepted
-but not currently honoured.
+slices.
+The @subcmd{MINIMUM} and @subcmd{MAXIMUM} keywords can be used to limit the
+displayed slices to a given range of values.
+The keyword @subcmd{NOMISSING} causes missing values to be omitted from the
+piechart. This is the default.
+If instead, @subcmd{MISSING} is specified, then a single slice
+will be included representing all system missing and user-missing cases.
+
+@cindex bar chart
+The @subcmd{BARCHART} subcommand produces a bar chart for each variable.
+The @subcmd{MINIMUM} and @subcmd{MAXIMUM} keywords can be used to omit
+categories whose counts which lie outside the specified limits.
+The @subcmd{FREQ} option (default) causes the ordinate to display the frequency
+of each category, whereas the @subcmd{PERCENT} option will display relative
+percentages.
+
+The @subcmd{FREQ} and @subcmd{PERCENT} options on @subcmd{HISTOGRAM} and
+@subcmd{PIECHART} are accepted but not currently honoured.
+
+The @subcmd{ORDER} subcommand is accepted but ignored.
@node EXAMINE
@section EXAMINE
dependent variable.
Following the dependent variables, factors may be specified.
-The factors (if desired) should be preceeded by a single @subcmd{BY} keyword.
+The factors (if desired) should be preceded by a single @subcmd{BY} keyword.
The format for each factor is
@display
@var{factorvar} [BY @var{subfactorvar}].
normal distribution, whilst the spread vs.@: level plot can be useful to visualise
how the variance of differs between factors.
Boxplots will also show you the outliers and extreme values.
+@footnote{@subcmd{HISTOGRAM} uses Sturges' rule to determine the number of
+bins, as approximately @math{1 + \log2(n)}, where @math{n} is the number of samples.
+Note that @cmd{FREQUENCIES} uses a different algorithm to find the bin size.}
The @subcmd{SPREADLEVEL} plot displays the interquartile range versus the
median. It takes an optional parameter @var{t}, which specifies how the data
The @subcmd{ID} subcommand is relevant only if @subcmd{/PLOT=BOXPLOT} or
@subcmd{/STATISTICS=EXTREME} has been given.
-If given, it shoule provide the name of a variable which is to be used
+If given, it should provide the name of a variable which is to be used
to labels extreme values and outliers.
Numeric or string variables are permissible.
-If the @subcmd{ID} subcommand is not given, then the casenumber will be used for
+If the @subcmd{ID} subcommand is not given, then the case number will be used for
labelling.
The @subcmd{CINTERVAL} subcommand specifies the confidence interval to use in
there are many distinct values, then @cmd{EXAMINE} will produce a very
large quantity of output.
+@node GRAPH
+@section GRAPH
+
+@vindex GRAPH
+@cindex Exploratory data analysis
+@cindex normality, testing
+
+@display
+GRAPH
+ /HISTOGRAM [(NORMAL)]= @var{var}
+ /SCATTERPLOT [(BIVARIATE)] = @var{var1} WITH @var{var2} [BY @var{var3}]
+ /BAR = @{@var{summary-function}(@var{var1}) | @var{count-function}@} BY @var{var2} [BY @var{var3}]
+ [ /MISSING=@{LISTWISE, VARIABLE@} [@{EXCLUDE, INCLUDE@}] ]
+ [@{NOREPORT,REPORT@}]
+
+@end display
+
+The @cmd{GRAPH} produces graphical plots of data. Only one of the subcommands
+@subcmd{HISTOGRAM} or @subcmd{SCATTERPLOT} can be specified, i.e. only one plot
+can be produced per call of @cmd{GRAPH}. The @subcmd{MISSING} is optional.
+
+@menu
+* SCATTERPLOT:: Cartesian Plots
+* HISTOGRAM:: Histograms
+* BAR CHART:: Bar Charts
+@end menu
+
+@node SCATTERPLOT
+@subsection Scatterplot
+@cindex scatterplot
+
+The subcommand @subcmd{SCATTERPLOT} produces an xy plot of the
+data. The different values of the optional third variable @var{var3}
+will result in different colours and/or markers for the plot. The
+following is an example for producing a scatterplot.
+
+@example
+GRAPH
+ /SCATTERPLOT = @var{height} WITH @var{weight} BY @var{gender}.
+@end example
+
+This example will produce a scatterplot where @var{height} is plotted versus @var{weight}. Depending
+on the value of the @var{gender} variable, the colour of the datapoint is different. With
+this plot it is possible to analyze gender differences for @var{height} vs.@: @var{weight} relation.
+
+@node HISTOGRAM
+@subsection Histogram
+@cindex histogram
+
+The subcommand @subcmd{HISTOGRAM} produces a histogram. Only one variable is allowed for
+the histogram plot.
+The keyword @subcmd{NORMAL} may be specified in parentheses, to indicate that the ideal normal curve
+should be superimposed over the histogram.
+For an alternative method to produce histograms @pxref{EXAMINE}. The
+following example produces a histogram plot for the variable @var{weight}.
+
+@example
+GRAPH
+ /HISTOGRAM = @var{weight}.
+@end example
+
+@node BAR CHART
+@subsection Bar Chart
+@cindex bar chart
+
+The subcommand @subcmd{BAR} produces a bar chart.
+This subcommand requires that a @var{count-function} be specified (with no arguments) or a @var{summary-function} with a variable @var{var1} in parentheses.
+Following the summary or count function, the keyword @subcmd{BY} should be specified and then a catagorical variable, @var{var2}.
+The values of the variable @var{var2} determine the labels of the bars to be plotted.
+Optionally a second categorical variable @var{var3} may be specified in which case a clustered (grouped) bar chart is produced.
+
+Valid count functions are
+@table @subcmd
+@item COUNT
+The weighted counts of the cases in each category.
+@item PCT
+The weighted counts of the cases in each category expressed as a percentage of the total weights of the cases.
+@item CUFREQ
+The cumulative weighted counts of the cases in each category.
+@item CUPCT
+The cumulative weighted counts of the cases in each category expressed as a percentage of the total weights of the cases.
+@end table
+
+The summary function is applied to @var{var1} across all cases in each category.
+The recognised summary functions are:
+@table @subcmd
+@item SUM
+The sum.
+@item MEAN
+The arithmetic mean.
+@item MAXIMUM
+The maximum value.
+@item MINIMUM
+The minimum value.
+@end table
+
+The following examples assume a dataset which is the results of a survey.
+Each respondent has indicated annual income, their sex and city of residence.
+One could create a bar chart showing how the mean income varies between of residents of different cities, thus:
+@example
+GRAPH /BAR = MEAN(@var{income}) BY @var{city}.
+@end example
+
+This can be extended to also indicate how income in each city differs between the sexes.
+@example
+GRAPH /BAR = MEAN(@var{income}) BY @var{city} BY @var{sex}.
+@end example
+
+One might also want to see how many respondents there are from each city. This can be achieved as follows:
+@example
+GRAPH /BAR = COUNT BY @var{city}.
+@end example
+
+Bar charts can also be produced using the @ref{FREQUENCIES} and @ref{CROSSTABS} commands.
+
@node CORRELATIONS
@section CORRELATIONS
@{BOX,NOBOX@}
/CELLS=@{COUNT,ROW,COLUMN,TOTAL,EXPECTED,RESIDUAL,SRESIDUAL,
ASRESIDUAL,ALL,NONE@}
+ /COUNT=@{ASIS,CASE,CELL@}
+ @{ROUND,TRUNCATE@}
/STATISTICS=@{CHISQ,PHI,CC,LAMBDA,UC,BTAU,CTAU,RISK,GAMMA,D,
KAPPA,ETA,CORR,ALL,NONE@}
+ /BARCHART
(Integer mode.)
/VARIABLES=@var{var_list} (@var{low},@var{high})@dots{}
crosstabulation tables to be displayed. It has a number of possible
settings:
-@itemize @asis
+@itemize @w{}
@item
@subcmd{TABLES}, the default, causes crosstabulation tables to be output.
@subcmd{NOTABLES} suppresses them.
If @subcmd{CELLS} is not specified at all then only @subcmd{COUNT}
will be selected.
+By default, crosstabulation and statistics use raw case weights,
+without rounding. Use the @subcmd{/COUNT} subcommand to perform
+rounding: CASE rounds the weights of individual weights as cases are
+read, CELL rounds the weights of cells within each crosstabulation
+table after it has been constructed, and ASIS explicitly specifies the
+default non-rounding behavior. When rounding is requested, ROUND, the
+default, rounds to the nearest integer and TRUNCATE rounds toward
+zero.
+
The @subcmd{STATISTICS} subcommand selects statistics for computation:
@table @asis
@samp{/STATISTICS} without any settings selects CHISQ. If the
@subcmd{STATISTICS} subcommand is not given, no statistics are calculated.
+@cindex bar chart
+The @samp{/BARCHART} subcommand produces a clustered bar chart for the first two
+variables on each table.
+If a table has more than two variables, the counts for the third and subsequent levels
+will be aggregated and the chart will be produces as if there were only two variables.
+
+
@strong{Please note:} Currently the implementation of @cmd{CROSSTABS} has the
-following bugs:
+following limitations:
@itemize @bullet
@item
-Significance of symmetric and directional measures is not calculated.
-@item
-Asymptotic standard error is not calculated for asymmetric lambda.
+Significance of some symmetric and directional measures is not calculated.
@item
-ASE of Goodman and Kruskal's tau is not calculated.
+Asymptotic standard error is not calculated for
+Goodman and Kruskal's tau or symmetric Somers' d.
@item
-ASE of symmetric somers' d is wrong.
-@item
-Approximate T of uncertainty coefficient is wrong.
+Approximate T is not calculated for symmetric uncertainty coefficient.
@end itemize
Fixes for any of these deficiencies would be welcomed.
@cindex data reduction
@display
-FACTOR VARIABLES=@var{var_list}
+FACTOR @{
+ VARIABLES=@var{var_list},
+ MATRIX IN (@{CORR,COV@}=@{*,@var{file_spec}@})
+ @}
[ /METHOD = @{CORRELATION, COVARIANCE@} ]
+ [ /ANALYSIS=@var{var_list} ]
+
[ /EXTRACTION=@{PC, PAF@}]
- [ /ROTATION=@{VARIMAX, EQUAMAX, QUARTIMAX, NOROTATE@}]
+ [ /ROTATION=@{VARIMAX, EQUAMAX, QUARTIMAX, PROMAX[(@var{k})], NOROTATE@}]
[ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [KMO] [SIG] [ALL] [DEFAULT] ]
The @cmd{FACTOR} command performs Factor Analysis or Principal Axis Factoring on a dataset. It may be used to find
common factors in the data or for data reduction purposes.
-The @subcmd{VARIABLES} subcommand is required. It lists the variables which are to partake in the analysis.
+The @subcmd{VARIABLES} subcommand is required (unless the @subcmd{MATRIX IN}
+subcommand is used).
+It lists the variables which are to partake in the analysis. (The @subcmd{ANALYSIS}
+subcommand may optionally further limit the variables that
+participate; it is useful primarily in conjunction with @subcmd{MATRIX IN}.)
+
+If @subcmd{MATRIX IN} instead of @subcmd{VARIABLES} is specified, then the analysis
+is performed on a pre-prepared correlation or covariance matrix file instead of on
+individual data cases. Typically the matrix file will have been generated by
+@cmd{MATRIX DATA} (@pxref{MATRIX DATA}) or provided by a third party.
+If specified, @subcmd{MATRIX IN} must be followed by @samp{COV} or @samp{CORR},
+then by @samp{=} and @var{file_spec} all in parentheses.
+@var{file_spec} may either be an asterisk, which indicates the currently loaded
+dataset, or it may be a filename to be loaded. @xref{MATRIX DATA} for the expected
+format of the file.
The @subcmd{/EXTRACTION} subcommand is used to specify the way in which factors (components) are extracted from the data.
If @subcmd{PC} is specified, then Principal Components Analysis is used.
used. By default Principal Components Analysis will be used.
The @subcmd{/ROTATION} subcommand is used to specify the method by which the extracted solution will be rotated.
-Three methods are available: @subcmd{VARIMAX} (which is the default), @subcmd{EQUAMAX}, and @subcmd{QUARTIMAX}.
-If don't want any rotation to be performed, the word @subcmd{NOROTATE} will prevent the command from performing any
-rotation on the data. Oblique rotations are not supported.
+Three orthogonal rotation methods are available:
+@subcmd{VARIMAX} (which is the default), @subcmd{EQUAMAX}, and @subcmd{QUARTIMAX}.
+There is one oblique rotation method, @i{viz}: @subcmd{PROMAX}.
+Optionally you may enter the power of the promax rotation @var{k}, which must be enclosed in parentheses.
+The default value of @var{k} is 5.
+If you don't want any rotation to be performed, the word @subcmd{NOROTATE} will prevent the command from performing any
+rotation on the data.
The @subcmd{/METHOD} subcommand should be used to determine whether the covariance matrix or the correlation matrix of the data is
to be analysed. By default, the correlation matrix is analysed.
values for the particular coefficient are missing.
The default is @subcmd{LISTWISE}.
+@node GLM
+@section GLM
+
+@vindex GLM
+@cindex univariate analysis of variance
+@cindex fixed effects
+@cindex factorial anova
+@cindex analysis of variance
+@cindex ANOVA
+
+
+@display
+GLM @var{dependent_vars} BY @var{fixed_factors}
+ [/METHOD = SSTYPE(@var{type})]
+ [/DESIGN = @var{interaction_0} [@var{interaction_1} [... @var{interaction_n}]]]
+ [/INTERCEPT = @{INCLUDE|EXCLUDE@}]
+ [/MISSING = @{INCLUDE|EXCLUDE@}]
+@end display
+
+The @cmd{GLM} procedure can be used for fixed effects factorial Anova.
+
+The @var{dependent_vars} are the variables to be analysed.
+You may analyse several variables in the same command in which case they should all
+appear before the @code{BY} keyword.
+
+The @var{fixed_factors} list must be one or more categorical variables. Normally it
+will not make sense to enter a scalar variable in the @var{fixed_factors} and doing
+so may cause @pspp{} to do a lot of unnecessary processing.
+
+The @subcmd{METHOD} subcommand is used to change the method for producing the sums of
+squares. Available values of @var{type} are 1, 2 and 3. The default is type 3.
+
+You may specify a custom design using the @subcmd{DESIGN} subcommand.
+The design comprises a list of interactions where each interaction is a
+list of variables separated by a @samp{*}. For example the command
+@display
+GLM subject BY sex age_group race
+ /DESIGN = age_group sex group age_group*sex age_group*race
+@end display
+@noindent specifies the model @math{subject = age_group + sex + race + age_group*sex + age_group*race}.
+If no @subcmd{DESIGN} subcommand is specified, then the default is all possible combinations
+of the fixed factors. That is to say
+@display
+GLM subject BY sex age_group race
+@end display
+implies the model
+@math{subject = age_group + sex + race + age_group*sex + age_group*race + sex*race + age_group*sex*race}.
+
+
+The @subcmd{MISSING} subcommand determines the handling of missing
+variables.
+If @subcmd{INCLUDE} is set then, for the purposes of GLM analysis,
+only system-missing values are considered
+to be missing; user-missing values are not regarded as missing.
+If @subcmd{EXCLUDE} is set, which is the default, then user-missing
+values are considered to be missing as well as system-missing values.
+A case for which any dependent variable or any factor
+variable has a missing value is excluded from the analysis.
+
@node LOGISTIC REGRESSION
@section LOGISTIC REGRESSION
The @subcmd{/EXPECTED} subcommand specifies the expected values of each
category.
There must be exactly one non-zero expected value, for each observed
-category, or the @subcmd{EQUAL} keywork must be specified.
+category, or the @subcmd{EQUAL} keyword must be specified.
You may use the notation @subcmd{@var{n}*@var{f}} to specify @var{n}
consecutive expected categories all taking a frequency of @var{f}.
The frequencies given are proportions, not absolute frequencies. The
The @subcmd{/WILCOXON} subcommand tests for differences between medians of the
variables listed.
The test does not make any assumptions about the variances of the samples.
-It does however assume that the distribution is symetrical.
+It does however assume that the distribution is symmetrical.
If the @subcmd{WITH} keyword is omitted, then tests for all
combinations of the listed variables are performed.
@display
T-TEST
/MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
- /CRITERIA=CIN(@var{confidence})
+ /CRITERIA=CI(@var{confidence})
(One Sample mode.)
the name of the independent (or factor) variable.
You can use the @subcmd{STATISTICS} subcommand to tell @pspp{} to display
-ancilliary information. The options accepted are:
+ancillary information. The options accepted are:
@itemize
@item DESCRIPTIVES
Displays descriptive statistics about the groups factored by the independent
@display
QUICK CLUSTER @var{var_list}
- [/CRITERIA=CLUSTERS(@var{k}) [MXITER(@var{max_iter})]]
+ [/CRITERIA=CLUSTERS(@var{k}) [MXITER(@var{max_iter})] CONVERGE(@var{epsilon}) [NOINITIAL]]
[/MISSING=@{EXCLUDE,INCLUDE@} @{LISTWISE, PAIRWISE@}]
+ [/PRINT=@{INITIAL@} @{CLUSTER@}]
@end display
The @cmd{QUICK CLUSTER} command performs k-means clustering on the
The minimum specification is @samp{QUICK CLUSTER} followed by the names
of the variables which contain the cluster data. Normally you will also
want to specify @subcmd{/CRITERIA=CLUSTERS(@var{k})} where @var{k} is the
-number of clusters. If this is not given, then @var{k} defaults to 2.
+number of clusters. If this is not specified, then @var{k} defaults to 2.
+
+If you use @subcmd{/CRITERIA=NOINITIAL} then a naive algorithm to select
+the initial clusters is used. This will provide for faster execution but
+less well separated initial clusters and hence possibly an inferior final
+result.
-The command uses an iterative algorithm to determine the clusters for
-each case. It will continue iterating until convergence, or until @var{max_iter}
-iterations have been done. The default value of @var{max_iter} is 2.
+
+@cmd{QUICK CLUSTER} uses an iterative algorithm to select the clusters centers.
+The subcommand @subcmd{/CRITERIA=MXITER(@var{max_iter})} sets the maximum number of iterations.
+During classification, @pspp{} will continue iterating until until @var{max_iter}
+iterations have been done or the convergence criterion (see below) is fulfilled.
+The default value of @var{max_iter} is 2.
+
+If however, you specify @subcmd{/CRITERIA=NOUPDATE} then after selecting the initial centers,
+no further update to the cluster centers is done. In this case, @var{max_iter}, if specified.
+is ignored.
+
+The subcommand @subcmd{/CRITERIA=CONVERGE(@var{epsilon})} is used
+to set the convergence criterion. The value of convergence criterion is @var{epsilon}
+times the minimum distance between the @emph{initial} cluster centers. Iteration stops when
+the mean cluster distance between one iteration and the next
+is less than the convergence criterion. The default value of @var{epsilon} is zero.
The @subcmd{MISSING} subcommand determines the handling of missing variables.
If @subcmd{INCLUDE} is set, then user-missing values are considered at their face
on the basis of the non-missing values.
The default is @subcmd{LISTWISE}.
+The @subcmd{PRINT} subcommand requests additional output to be printed.
+If @subcmd{INITIAL} is set, then the initial cluster memberships will
+be printed.
+If @subcmd{CLUSTER} is set, the cluster memberships of the individual
+cases will be displayed (potentially generating lengthy output).
+
@node RANK
@section RANK
@end display
@cindex Cronbach's Alpha
-The @cmd{RELIABILTY} command performs reliability analysis on the data.
+The @cmd{RELIABILITY} command performs reliability analysis on the data.
The @subcmd{VARIABLES} subcommand is required. It determines the set of variables
upon which analysis is to be performed.
Cases are excluded on a listwise basis; if any of the variables in @var{var_list}
or if the variable @var{state_var} is missing, then the entire case will be
excluded.
+
+@c LocalWords: subcmd subcommand