From cad6c29bc4b26627ba85b1e268e8939585bc35a4 Mon Sep 17 00:00:00 2001 From: John Darrington Date: Fri, 23 Mar 2012 08:26:31 +0100 Subject: [PATCH 1/1] Updated the documentation for EXAMINE --- doc/statistics.texi | 136 +++++++++++++++++++++++++++++++++----------- 1 file changed, 103 insertions(+), 33 deletions(-) diff --git a/doc/statistics.texi b/doc/statistics.texi index edd96d0990..2f7022ba52 100644 --- a/doc/statistics.texi +++ b/doc/statistics.texi @@ -205,71 +205,101 @@ The FREQ and PERCENT options on HISTOGRAM and PIECHART are accepted but not currently honored. @node EXAMINE -@comment node-name, next, previous, up @section EXAMINE -@vindex EXAMINE +@vindex EXAMINE +@cindex Exploratory data analysis @cindex Normality, testing for @display EXAMINE - VARIABLES=var_list [BY factor_list ] - /STATISTICS=@{DESCRIPTIVES, EXTREME[(n)], ALL, NONE@} + VARIABLES= @var{var1} [@var{var2}] @dots{} [@var{varN}] + [BY @var{factor1} [BY @var{subfactor1}] + [ @var{factor2} [BY @var{subfactor2}]] + @dots{} + [ @var{factor3} [BY @var{subfactor3}]] + ] + /STATISTICS=@{DESCRIPTIVES, EXTREME[(@var{n})], ALL, NONE@} /PLOT=@{BOXPLOT, NPPLOT, HISTOGRAM, ALL, NONE@} - /CINTERVAL n + /CINTERVAL @var{p} /COMPARE=@{GROUPS,VARIABLES@} - /ID=var_name + /ID=@var{identity_variable} /@{TOTAL,NOTOTAL@} - /PERCENTILE=[value_list]=@{HAVERAGE, WAVERAGE, ROUND, AEMPIRICAL, EMPIRICAL @} + /PERCENTILE=[@var{percentiles}]=@{HAVERAGE, WAVERAGE, ROUND, AEMPIRICAL, EMPIRICAL @} /MISSING=@{LISTWISE, PAIRWISE@} [@{EXCLUDE, INCLUDE@}] [@{NOREPORT,REPORT@}] @end display -The @cmd{EXAMINE} command is used to test how closely a distribution is to a -normal distribution. It also shows you outliers and extreme values. +The @cmd{EXAMINE} command is used to perform exploratory data analysis. +In particular, it is useful for testing how closely a distribution follows a +normal distribution, and for finding outliers and extreme values. -The VARIABLES subcommand specifies the dependent variables and the -independent variable to use as factors for the analysis. Variables -listed before the first BY keyword are the dependent variables. +The VARIABLES subcommand is mandatory. +It specifies the dependent variables and optionally variables to use as +factors for the analysis. +Variables listed before the first BY keyword (if any) are the +dependent variables. The dependent variables may optionally be followed by a list of factors which tell PSPP how to break down the analysis for each -dependent variable. The format for each factor is +dependent variable. + +Following the dependent variables, factors may be specified. +The factors (if desired) should be preceeded by a single BY keyword. +The format for each factor is @display -var [BY var]. +@var{factorvar} [BY @var{subfactorvar}]. @end display +Each unique combination of the values of @var{factorvar} and +@var{subfactorvar} divide the dataset into @dfn{cells}. +Statistics will be calculated for each cell +and for the entire dataset (unless NOTOTAL is given). - -The STATISTICS subcommand specifies the analysis to be done. +The STATISTICS subcommand specifies which statistics to show. DESCRIPTIVES will produce a table showing some parametric and -non-parametrics statistics. EXTREME produces a table showing extreme -values of the dependent variable. A number in parentheses determines -how many upper and lower extremes to show. The default number is 5. - +non-parametrics statistics. +EXTREME produces a table showing the extremities of each cell. +A number in parentheses, @var{n} determines +how many upper and lower extremities to show. +The default number is 5. + +The subcommands TOTAL and NOTOTAL are mutually exclusive. +If TOTAL appears, then statistics will be produced for the entire dataset +as well as for each cell. +If NOTOTAL appears, then statistics will be produced only for the cells +(unless no factor variables have been given). +These subcommands have no effect if there have been no factor variables +specified. @cindex boxplot @cindex histogram @cindex npplot The PLOT subcommand specifies which plots are to be produced if any. Available plots are HISTOGRAM, NPPLOT and BOXPLOT. +They can all be used to visualise how closely each cell conforms to a +normal distribution. +Boxplots will also show you the outliers and extreme values. The COMPARE subcommand is only relevant if producing boxplots, and it is only -useful there is more than one dependent variable and at least one factor. If +useful there is more than one dependent variable and at least one factor. +If /COMPARE=GROUPS is specified, then one plot per dependent variable is produced, -containing boxplots for all the factors. -If /COMPARE=VARIABLES is specified, then one plot per factor is produced, each +each of which contain boxplots for all the cells. +If /COMPARE=VARIABLES is specified, then one plot per cell is produced, each containing one boxplot per dependent variable. -If the /COMPARE subcommand is omitted, then PSPP uses the default value of -/COMPARE=GROUPS. +If the /COMPARE subcommand is omitted, then PSPP behaves as if +/COMPARE=GROUPS were given. -The ID subcommand also pertains to boxplots. If given, it must -specify a variable name. Outliers and extreme cases plotted in -boxplots will be labelled with the case from that variable. Numeric or -string variables are permissible. If the ID subcommand is not given, -then the casenumber will be used for labelling. +The ID subcommand is relevant only if /PLOT=BOXPLOT or +/STATISTICS=EXTREME has been given. +If given, it shoule provide the name of a variable which is to be used +to labels extreme values and outliers. +Numeric or string variables are permissible. +If the ID subcommand is not given, then the casenumber will be used for +labelling. The CINTERVAL subcommand specifies the confidence interval to use in -calculation of the descriptives command. The default it 95%. +calculation of the descriptives command. The default is 95%. @cindex percentiles The PERCENTILES subcommand specifies which percentiles are to be calculated, @@ -283,8 +313,48 @@ then then statistics for the unfactored dependent variables are produced in addition to the factored variables. If there are no factors specified then TOTAL and NOTOTAL have no effect. + +The following example will generate descriptive statistics and histograms for +two variables @var{score1} and @var{score2}. +Two factors are given, @i{viz}: @var{gender} and @var{gender} BY @var{culture}. +Therefore, the descriptives and histograms will be generated for each +distinct value +of @var{gender} @emph{and} for each distinct combination of the values +of @var{gender} and @var{race}. +Since the NOTOTAL keyword is given, statistics and histograms for +@var{score1} and @var{score2} covering the whole dataset are not produced. +@example +EXAMINE @var{score1} @var{score2} BY + @var{gender} + @var{gender} BY @var{culture} + /STATISTICS = DESCRIPTIVES + /PLOT = HISTOGRAM + /NOTOTAL. +@end example + +Here is a second example showing how the @cmd{examine} command can be used to find extremities. +@example +EXAMINE @var{height} @var{weight} BY + @var{gender} + /STATISTICS = EXTREME (3) + /PLOT = BOXPLOT + /COMPARE = GROUPS + /ID = @var{name}. +@end example +In this example, we look at the height and weight of a sample of individuals and +how they differ between male and female. +A table showing the 3 largest and the 3 smallest values of @var{height} and +@var{weight} for each gender, and for the whole dataset will be shown. +Boxplots will also be produced. +Because /COMPARE = GROUPS was given, boxplots for male and female will be +shown in the same graphic, allowing us to easily see the difference between +the genders. +Since the variable @var{name} was specified on the ID subcommand, this will be +used to label the extreme values. + @strong{Warning!} -If many dependent variable are given, or factors are given for which +If many dependent variables are specified, or if factor variables are +specified for which there are many distinct values, then @cmd{EXAMINE} will produce a very large quantity of output. @@ -661,7 +731,7 @@ MEANS [TABLES =] [/MISSING = [TABLE] [INCLUDE] [DEPENDENT]] @end display -You can use the MEANS command to calculate the arithmetic mean and similar +You can use the @cmd{MEANS} command to calculate the arithmetic mean and similar statistics, either for the dataset as a whole or for categories of data. The simplest form of the command is -- 2.30.2