but not currently honored.
@node EXAMINE
-@comment node-name, next, previous, up
@section EXAMINE
-@vindex EXAMINE
+@vindex EXAMINE
+@cindex Exploratory data analysis
@cindex Normality, testing for
@display
EXAMINE
- VARIABLES=var_list [BY factor_list ]
- /STATISTICS=@{DESCRIPTIVES, EXTREME[(n)], ALL, NONE@}
+ VARIABLES= @var{var1} [@var{var2}] @dots{} [@var{varN}]
+ [BY @var{factor1} [BY @var{subfactor1}]
+ [ @var{factor2} [BY @var{subfactor2}]]
+ @dots{}
+ [ @var{factor3} [BY @var{subfactor3}]]
+ ]
+ /STATISTICS=@{DESCRIPTIVES, EXTREME[(@var{n})], ALL, NONE@}
/PLOT=@{BOXPLOT, NPPLOT, HISTOGRAM, ALL, NONE@}
- /CINTERVAL n
+ /CINTERVAL @var{p}
/COMPARE=@{GROUPS,VARIABLES@}
- /ID=var_name
+ /ID=@var{identity_variable}
/@{TOTAL,NOTOTAL@}
- /PERCENTILE=[value_list]=@{HAVERAGE, WAVERAGE, ROUND, AEMPIRICAL, EMPIRICAL @}
+ /PERCENTILE=[@var{percentiles}]=@{HAVERAGE, WAVERAGE, ROUND, AEMPIRICAL, EMPIRICAL @}
/MISSING=@{LISTWISE, PAIRWISE@} [@{EXCLUDE, INCLUDE@}]
[@{NOREPORT,REPORT@}]
@end display
-The @cmd{EXAMINE} command is used to test how closely a distribution is to a
-normal distribution. It also shows you outliers and extreme values.
+The @cmd{EXAMINE} command is used to perform exploratory data analysis.
+In particular, it is useful for testing how closely a distribution follows a
+normal distribution, and for finding outliers and extreme values.
-The VARIABLES subcommand specifies the dependent variables and the
-independent variable to use as factors for the analysis. Variables
-listed before the first BY keyword are the dependent variables.
+The VARIABLES subcommand is mandatory.
+It specifies the dependent variables and optionally variables to use as
+factors for the analysis.
+Variables listed before the first BY keyword (if any) are the
+dependent variables.
The dependent variables may optionally be followed by a list of
factors which tell PSPP how to break down the analysis for each
-dependent variable. The format for each factor is
+dependent variable.
+
+Following the dependent variables, factors may be specified.
+The factors (if desired) should be preceeded by a single BY keyword.
+The format for each factor is
@display
-var [BY var].
+@var{factorvar} [BY @var{subfactorvar}].
@end display
+Each unique combination of the values of @var{factorvar} and
+@var{subfactorvar} divide the dataset into @dfn{cells}.
+Statistics will be calculated for each cell
+and for the entire dataset (unless NOTOTAL is given).
-
-The STATISTICS subcommand specifies the analysis to be done.
+The STATISTICS subcommand specifies which statistics to show.
DESCRIPTIVES will produce a table showing some parametric and
-non-parametrics statistics. EXTREME produces a table showing extreme
-values of the dependent variable. A number in parentheses determines
-how many upper and lower extremes to show. The default number is 5.
-
+non-parametrics statistics.
+EXTREME produces a table showing the extremities of each cell.
+A number in parentheses, @var{n} determines
+how many upper and lower extremities to show.
+The default number is 5.
+
+The subcommands TOTAL and NOTOTAL are mutually exclusive.
+If TOTAL appears, then statistics will be produced for the entire dataset
+as well as for each cell.
+If NOTOTAL appears, then statistics will be produced only for the cells
+(unless no factor variables have been given).
+These subcommands have no effect if there have been no factor variables
+specified.
@cindex boxplot
@cindex histogram
@cindex npplot
The PLOT subcommand specifies which plots are to be produced if any.
Available plots are HISTOGRAM, NPPLOT and BOXPLOT.
+They can all be used to visualise how closely each cell conforms to a
+normal distribution.
+Boxplots will also show you the outliers and extreme values.
The COMPARE subcommand is only relevant if producing boxplots, and it is only
-useful there is more than one dependent variable and at least one factor. If
+useful there is more than one dependent variable and at least one factor.
+If
/COMPARE=GROUPS is specified, then one plot per dependent variable is produced,
-containing boxplots for all the factors.
-If /COMPARE=VARIABLES is specified, then one plot per factor is produced, each
+each of which contain boxplots for all the cells.
+If /COMPARE=VARIABLES is specified, then one plot per cell is produced,
each containing one boxplot per dependent variable.
-If the /COMPARE subcommand is omitted, then PSPP uses the default value of
-/COMPARE=GROUPS.
+If the /COMPARE subcommand is omitted, then PSPP behaves as if
+/COMPARE=GROUPS were given.
-The ID subcommand also pertains to boxplots. If given, it must
-specify a variable name. Outliers and extreme cases plotted in
-boxplots will be labelled with the case from that variable. Numeric or
-string variables are permissible. If the ID subcommand is not given,
-then the casenumber will be used for labelling.
+The ID subcommand is relevant only if /PLOT=BOXPLOT or
+/STATISTICS=EXTREME has been given.
+If given, it shoule provide the name of a variable which is to be used
+to labels extreme values and outliers.
+Numeric or string variables are permissible.
+If the ID subcommand is not given, then the casenumber will be used for
+labelling.
The CINTERVAL subcommand specifies the confidence interval to use in
-calculation of the descriptives command. The default it 95%.
+calculation of the descriptives command. The default is 95%.
@cindex percentiles
The PERCENTILES subcommand specifies which percentiles are to be calculated,
produced in addition to the factored variables. If there are no
factors specified then TOTAL and NOTOTAL have no effect.
+
+The following example will generate descriptive statistics and histograms for
+two variables @var{score1} and @var{score2}.
+Two factors are given, @i{viz}: @var{gender} and @var{gender} BY @var{culture}.
+Therefore, the descriptives and histograms will be generated for each
+distinct value
+of @var{gender} @emph{and} for each distinct combination of the values
+of @var{gender} and @var{race}.
+Since the NOTOTAL keyword is given, statistics and histograms for
+@var{score1} and @var{score2} covering the whole dataset are not produced.
+@example
+EXAMINE @var{score1} @var{score2} BY
+ @var{gender}
+ @var{gender} BY @var{culture}
+ /STATISTICS = DESCRIPTIVES
+ /PLOT = HISTOGRAM
+ /NOTOTAL.
+@end example
+
+Here is a second example showing how the @cmd{examine} command can be used to find extremities.
+@example
+EXAMINE @var{height} @var{weight} BY
+ @var{gender}
+ /STATISTICS = EXTREME (3)
+ /PLOT = BOXPLOT
+ /COMPARE = GROUPS
+ /ID = @var{name}.
+@end example
+In this example, we look at the height and weight of a sample of individuals and
+how they differ between male and female.
+A table showing the 3 largest and the 3 smallest values of @var{height} and
+@var{weight} for each gender, and for the whole dataset will be shown.
+Boxplots will also be produced.
+Because /COMPARE = GROUPS was given, boxplots for male and female will be
+shown in the same graphic, allowing us to easily see the difference between
+the genders.
+Since the variable @var{name} was specified on the ID subcommand, this will be
+used to label the extreme values.
+
@strong{Warning!}
-If many dependent variable are given, or factors are given for which
+If many dependent variables are specified, or if factor variables are
+specified for which
there are many distinct values, then @cmd{EXAMINE} will produce a very
large quantity of output.
[/MISSING = [TABLE] [INCLUDE] [DEPENDENT]]
@end display
-You can use the MEANS command to calculate the arithmetic mean and similar
+You can use the @cmd{MEANS} command to calculate the arithmetic mean and similar
statistics, either for the dataset as a whole or for categories of data.
The simplest form of the command is