Change terminology from "active file" to "active dataset".

[pspp-builds.git] / doc / statistics.texi
diff --git a/doc/statistics.texi b/doc/statistics.texi

index 632c1a3754492fb0eb5aa4c59ff31e810342e272..e5e299a47f8a66ae63688581de53a4b57e18d35b 100644 (file)
--- a/doc/statistics.texi
+++ b/doc/statistics.texi
@@ -8,7 +8,9 @@ far.
  * DESCRIPTIVES::                Descriptive statistics.
  * FREQUENCIES::                 Frequency tables.
  * EXAMINE::                     Testing data for normality.
+* CORRELATIONS::                Correlation tables.
  * CROSSTABS::                   Crosstabulation tables.
+* FACTOR::                      Factor analysis and Principal Components analysis
  * NPAR TESTS::                  Nonparametric tests.
  * T-TEST::                      Test hypotheses about means.
  * ONEWAY::                      One way analysis of variance.
@@ -36,7 +38,7 @@ DESCRIPTIVES
                @{A,D@}
  @end display
  
-The @cmd{DESCRIPTIVES} procedure reads the active file and outputs
+The @cmd{DESCRIPTIVES} procedure reads the active dataset and outputs
  descriptive
  statistics requested by the user.  In addition, it can optionally
  compute Z-scores.
@@ -116,11 +118,7 @@ respectively.
  FREQUENCIES
          /VARIABLES=var_list
          /FORMAT=@{TABLE,NOTABLE,LIMIT(limit)@}
-                @{STANDARD,CONDENSE,ONEPAGE[(onepage_limit)]@}
-                @{LABELS,NOLABELS@}
                  @{AVALUE,DVALUE,AFREQ,DFREQ@}
-                @{SINGLE,DOUBLE@}
-                @{OLDPAGE,NEWPAGE@}
          /MISSING=@{EXCLUDE,INCLUDE@}
          /STATISTICS=@{DEFAULT,MEAN,SEMEAN,MEDIAN,MODE,STDDEV,VARIANCE,
                       KURTOSIS,SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,
@@ -128,8 +126,9 @@ FREQUENCIES
          /NTILES=ntiles
          /PERCENTILES=percent@dots{}
          /HISTOGRAM=[MINIMUM(x_min)] [MAXIMUM(x_max)] 
-                   [@{FREQ,PCNT@}] [@{NONORMAL,NORMAL@}]
-        /PIECHART=[MINIMUM(x_min)] [MAXIMUM(x_max)] @{NOMISSING,MISSING@}
+                   [@{FREQ[(y_max)],PERCENT[(y_max)]@}] [@{NONORMAL,NORMAL@}]
+        /PIECHART=[MINIMUM(x_min)] [MAXIMUM(x_max)]
+                  [@{FREQ,PERCENT@}] [@{NOMISSING,MISSING@}]
  
  (These options are not currently implemented.)
          /BARCHART=@dots{}
@@ -140,11 +139,9 @@ FREQUENCIES
  The @cmd{FREQUENCIES} procedure outputs frequency tables for specified
  variables.
  @cmd{FREQUENCIES} can also calculate and display descriptive statistics
-(including median and mode) and percentiles.
-
-@cmd{FREQUENCIES} also support graphical output in the form of
-histograms and pie charts.  In the future, it will be able to produce
-bar charts and output percentiles for grouped data.
+(including median and mode) and percentiles,
+@cmd{FREQUENCIES} can also output
+histograms and pie charts.  
  
  The VARIABLES subcommand is the only required subcommand.  Specify the
  variables to be analyzed.
@@ -159,30 +156,11 @@ variable specified.  NOTABLE prevents them from being output.  LIMIT
  with a numeric argument causes them to be output except when there are
  more than the specified number of values in the table.
  
-@item
-STANDARD frequency tables contain more complete information, but also to
-take up more space on the printed page.  CONDENSE frequency tables are
-less informative but take up less space.  ONEPAGE with a numeric
-argument will output standard frequency tables if there are the
-specified number of values or less, condensed tables otherwise.  ONEPAGE
-without an argument defaults to a threshold of 50 values.
-
-@item
-LABELS causes value labels to be displayed in STANDARD frequency
-tables.  NOLABLES prevents this.
-
  @item
  Normally frequency tables are sorted in ascending order by value.  This
  is AVALUE.  DVALUE tables are sorted in descending order by value.
  AFREQ and DFREQ tables are sorted in ascending and descending order,
  respectively, by frequency count.
-
-@item
-SINGLE spaced frequency tables are closely spaced.  DOUBLE spaced
-frequency tables have wider spacing.
-
-@item
-OLDPAGE and NEWPAGE are not currently used.
  @end itemize
  
  The MISSING subcommand controls the handling of user-missing values.
@@ -205,15 +183,15 @@ The NTILES subcommand causes the percentiles to be reported at the
  boundaries of the data set divided into the specified number of ranges.
  For instance, @code{/NTILES=4} would cause quartiles to be reported.
  
+@cindex histogram
  The HISTOGRAM subcommand causes the output to include a histogram for
-each specified numeric variable.  The X axis by default ranges from the
-minimum to the maximum value observed in the data, but the MINIMUM and
-MAXIMUM keywords can set an explicit range.  The Y axis by default is
-labeled in frequencies; use the PERCENT keyword to causes it to be
-labeled in percent of the total observed count.  Specify NORMAL to
-superimpose a normal curve on the histogram.
-Histograms are not created for string variables.
+each specified numeric variable.  The X axis by default ranges from
+the minimum to the maximum value observed in the data, but the MINIMUM
+and MAXIMUM keywords can set an explicit range.  Specify NORMAL to
+superimpose a normal curve on the histogram.  Histograms are not
+created for string variables.
  
+@cindex piechart
  The PIECHART adds a pie chart for each variable to the data.  Each
  slice represents one value, with the size of the slice proportional to
  the value's frequency.  By default, all non-missing values are given
@@ -221,6 +199,9 @@ slices.  The MINIMUM and MAXIMUM keywords can be used to limit the
  displayed slices to a given range of values.  The MISSING keyword adds
  slices for missing values.
  
+The FREQ and PERCENT options on HISTOGRAM and PIECHART are accepted
+but not currently honored.
+
  @node EXAMINE
  @comment  node-name,  next,  previous,  up
  @section EXAMINE
@@ -264,7 +245,11 @@ values of the dependent variable.  A number in parentheses determines
  how many upper and lower extremes to show.  The default number is 5.
  
  
+@cindex boxplot
+@cindex histogram
+@cindex npplot
  The PLOT subcommand specifies which plots are to be produced if any.
+Available plots are HISTOGRAM, NPPLOT and BOXPLOT.
  
  The COMPARE subcommand is only relevant if producing boxplots, and it is only 
  useful there is more than one dependent variable and at least one factor.   If 
@@ -301,6 +286,69 @@ If many dependent variable are given, or factors are given for which
  there are many distinct values, then @cmd{EXAMINE} will produce a very
  large quantity of output.
  
+@node CORRELATIONS
+@section CORRELATIONS
+
+@vindex CORRELATIONS
+@display
+CORRELATIONS
+     /VARIABLES = varlist [ WITH varlist ]
+     [
+      .
+      .
+      .
+      /VARIABLES = varlist [ WITH varlist ]
+      /VARIABLES = varlist [ WITH varlist ]
+     ]
+
+     [ /PRINT=@{TWOTAIL, ONETAIL@} @{SIG, NOSIG@} ]
+     [ /STATISTICS=DESCRIPTIVES XPROD ALL]
+     [ /MISSING=@{PAIRWISE, LISTWISE@} @{INCLUDE, EXCLUDE@} ]
+@end display    
+
+@cindex correlation
+The @cmd{CORRELATIONS} procedure produces tables of the Pearson correlation coefficient
+for a set of variables.  The significance of the coefficients are also given.
+
+At least one VARIABLES subcommand is required. If the WITH keyword is used, then a non-square
+correlation table will be produced.
+The variables preceding WITH, will be used as the rows of the table, and the variables following
+will be the columns of the table.
+If no WITH subcommand is given, then a square, symmetrical table using all variables is produced.
+
+
+The @cmd{MISSING} subcommand determines the handling of missing variables.  
+If INCLUDE is set, then user-missing values are included in the
+calculations, but system-missing values are not.
+If EXCLUDE is set, which is the default, user-missing
+values are excluded as well as system-missing values. 
+This is the default.
+
+If LISTWISE is set, then the entire case is excluded from analysis
+whenever any variable  specified in any @cmd{/VARIABLES} subcommand
+contains a missing value.   
+If PAIRWISE is set, then a case is considered missing only if either of the
+values  for the particular coefficient are missing.
+The default is PAIRWISE.
+
+The PRINT subcommand is used to control how the reported significance values are printed.
+If the TWOTAIL option is used, then a two-tailed test of significance is 
+printed.  If the ONETAIL option is given, then a one-tailed test is used.
+The default is TWOTAIL.
+
+If the NOSIG option is specified, then correlation coefficients with significance less than
+0.05 are highlighted.
+If SIG is specified, then no highlighting is performed.  This is the default.
+
+@cindex covariance
+The STATISTICS subcommand requests additional statistics to be displayed.  The keyword 
+DESCRIPTIVES requests that the mean, number of non-missing cases, and the non-biased
+estimator of the standard deviation are displayed.
+These statistics will be displayed in a separated table, for all the variables listed
+in any /VARIABLES subcommand.
+The XPROD keyword requests cross-product deviations and covariance estimators to 
+be displayed for each pair of variables.
+The keyword ALL is the union of DESCRIPTIVES and XPROD.
  
  @node CROSSTABS
  @section CROSSTABS
@@ -489,6 +537,108 @@ Approximate T of uncertainty coefficient is wrong.
  
  Fixes for any of these deficiencies would be welcomed.
  
+@node FACTOR
+@section FACTOR
+
+@vindex FACTOR
+@cindex factor analysis
+@cindex principal components analysis
+@cindex principal axis factoring
+@cindex data reduction
+
+@display
+FACTOR  VARIABLES=var_list
+
+        [ /METHOD = @{CORRELATION, COVARIANCE@} ]
+
+        [ /EXTRACTION=@{PC, PAF@}] 
+
+        [ /ROTATION=@{VARIMAX, EQUAMAX, QUARTIMAX, NOROTATE@}]
+
+        [ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [SIG] [ALL] [DEFAULT] ]
+
+        [ /PLOT=[EIGEN] ]
+
+        [ /FORMAT=[SORT] [BLANK(@var{n})] [DEFAULT] ]
+
+        [ /CRITERIA=[FACTORS(@var{n})] [MINEIGEN(@var{l})] [ITERATE(@var{m})] [ECONVERGE (@var{delta})] [DEFAULT] ]
+
+        [ /MISSING=[@{LISTWISE, PAIRWISE@}] [@{INCLUDE, EXCLUDE@}] ]
+@end display
+
+The FACTOR command performs Factor Analysis or Principal Axis Factoring on a dataset.  It may be used to find
+common factors in the data or for data reduction purposes.
+
+The VARIABLES subcommand is required.  It lists the variables which are to partake in the analysis.
+
+The /EXTRACTION subcommand is used to specify the way in which factors (components) are extracted from the data.
+If PC is specified, then Principal Components Analysis is used.  If PAF is specified, then Principal Axis Factoring is
+used. By default Principal Components Analysis will be used.
+
+The /ROTATION subcommand is used to specify the method by which the extracted solution will be rotated.
+Three methods are available: VARIMAX (which is the default), EQUAMAX, and QUARTIMAX.
+If don't want any rotation to be performed, the word NOROTATE will prevent the command from performing any
+rotation on the data. Oblique rotations are not supported.
+
+The /METHOD subcommand should be used to determine whether the covariance matrix or the correlation matrix of the data is
+to be analysed.  By default, the correlation matrix is analysed.
+
+The /PRINT subcommand may be used to select which features of the analysis are reported:
+
+@itemize
+@item UNIVARIATE
+      A table of mean values, standard deviations and total weights are printed.
+@item INITIAL
+      Initial communalities and eigenvalues are printed.
+@item EXTRACTION
+      Extracted communalities and eigenvalues are printed.
+@item ROTATION
+      Rotated communalities and eigenvalues are printed.
+@item CORRELATION
+      The correlation matrix is printed.
+@item COVARIANCE
+      The covariance matrix is printed.
+@item DET
+      The determinant of the correlation or covariance matrix is printed.
+@item SIG
+      The significance of the elements of correlation matrix is printed.
+@item ALL
+      All of the above are printed.
+@item DEFAULT
+      Identical to INITIAL and EXTRACTION.
+@end itemize
+
+If /PLOT=EIGEN is given, then a ``Scree'' plot of the eigenvalues will be printed.  This can be useful for visualising
+which factors (components) should be retained.
+
+The /FORMAT subcommand determined how data are to be displayed in loading matrices.  If SORT is specified, then the variables
+are sorted in descending order of significance.  If BLANK(@var{n}) is specified, then coefficients whose absolute value is less
+than @var{n} will not be printed.  If the keyword DEFAULT is given, or if no /FORMAT subcommand is given, then no sorting is 
+performed, and all coefficients will be printed.
+
+The /CRITERIA subcommand is used to specify how the number of extracted factors (components) are chosen.  If FACTORS(@var{n}) is
+specified, where @var{n} is an integer, then @var{n} factors will be extracted.  Otherwise, the MINEIGEN setting will
+be used.  MINEIGEN(@var{l}) requests that all factors whose eigenvalues are greater than or equal to @var{l} are extracted.
+The default value of @var{l} is 1.    The ECONVERGE and ITERATE settings have effect only when iterative algorithms for factor
+extraction (such as Principal Axis Factoring) are used.   ECONVERGE(@var{delta}) specifies that iteration should cease when
+the maximum absolute value of the communality estimate between one iteration and the previous is less than @var{delta}. The
+default value of @var{delta} is 0.001.
+The ITERATE(@var{m}) setting sets the maximum number of iterations to @var{m}.  The default value of @var{m} is 25.
+
+The @cmd{MISSING} subcommand determines the handling of missing variables.  
+If INCLUDE is set, then user-missing values are included in the
+calculations, but system-missing values are not.
+If EXCLUDE is set, which is the default, user-missing
+values are excluded as well as system-missing values. 
+This is the default.
+If LISTWISE is set, then the entire case is excluded from analysis
+whenever any variable  specified in the @cmd{VARIABLES} subcommand
+contains a missing value.   
+If PAIRWISE is set, then a case is considered missing only if either of the
+values  for the particular coefficient are missing.
+The default is LISTWISE.
+ 
+
  @node NPAR TESTS
  @section NPAR TESTS
  
@@ -531,8 +681,14 @@ is used.
  @menu
  * BINOMIAL::                Binomial Test
  * CHISQUARE::               Chisquare Test
-* WILCOXON::                Wilcoxon Signed Ranks Test
+* COCHRAN::                 Cochran Q Test
+* FRIEDMAN::                Friedman Test
+* KENDALL::                 Kendall's W Test
+* KRUSKAL-WALLIS::          Kruskal-Wallis Test
+* MANN-WHITNEY::            Mann Whitney U Test
+* RUNS::                    Runs Test
  * SIGN::                    The Sign Test
+* WILCOXON::                Wilcoxon Signed Ranks Test
  @end menu
  
  
@@ -612,20 +768,131 @@ sum of the frequencies need not be 1.
  If no /EXPECTED subcommand is given, then then equal frequencies 
  are expected.
  
-@node WILCOXON
-@subsection Wilcoxon Matched Pairs Signed Ranks Test
-@comment  node-name,  next,  previous,  up
-@vindex WILCOXON
-@cindex wilcoxon matched pairs signed ranks test
+
+@node COCHRAN
+@subsection Cochran Q Test
+@vindex Cochran
+@cindex Cochran Q test
+@cindex Q, Cochran Q
  
  @display
-     [ /WILCOXON varlist [ WITH varlist [ (PAIRED) ]]]
+     [ /COCHRAN = varlist ]
  @end display
  
-The /WILCOXON subcommand tests for differences between medians of the 
+The Cochran Q test is used to test for differences between three or more groups.
+The data for @var{varlist} in all cases must assume exactly two distinct values (other than missing values). 
+
+The value of Q will be displayed and its Asymptotic significance based on a chi-square distribution.
+
+@node FRIEDMAN
+@subsection Friedman Test
+@vindex FRIEDMAN
+@cindex Friedman test
+
+@display
+     [ /FRIEDMAN = varlist ]
+@end display
+
+The Friedman test is used to test for differences between repeated measures when there is no indication that the distributions are normally distributed.
+
+A list of variables which contain the measured data must be given.  The procedure prints the sum of ranks for each variable, the test statistic and its significance.
+
+@node KENDALL
+@subsection Kendall's W Test
+@vindex KENDALL
+@cindex Kendall's W test
+@cindex coefficient of concordance
+
+@display
+     [ /KENDALL = varlist ]
+@end display
+
+The Kendall test investigates whether an arbitrary number of related samples come from the 
+same population.
+It is identical to the Friedman test except that the additional statistic W, Kendall's Coefficient of Concordance is printed.
+It has the range [0,1] --- a value of zero indicates no agreement between the samples whereas a value of
+unity indicates complete agreement.
+
+
+@node KRUSKAL-WALLIS
+@subsection Kruskal-Wallis Test
+@vindex KRUSKAL-WALLIS
+@vindex K-W
+@cindex Kruskal-Wallis test
+
+@display
+     [ /KRUSKAL-WALLIS = varlist BY var (lower, upper) ]
+@end display
+
+The Kruskal-Wallis test is used to compare data from an 
+arbitrary number of populations.  It does not assume normality.
+The data to be compared are specified by @var{varlist}.
+The categorical variable determining the groups to which the
+data belongs is given by @var{var}. The limits @var{lower} and
+@var{upper} specify the valid range of @var{var}. Any cases for
+which @var{var} falls outside [@var{lower}, @var{upper}] will be
+ignored.
+
+The mean rank of each group as well as the chi-squared value and significance
+of the test will be printed.
+The abbreviated subcommand  K-W may be used in place of KRUSKAL-WALLIS.
+
+
+@node MANN-WHITNEY
+@subsection Mann-Whitney U Test
+@vindex MANN-WHITNEY
+@vindex M-W
+@cindex Mann-Whitney U test
+@cindex U, Mann-Whitney U
+
+@display
+     [ /MANN-WHITNEY = varlist BY var (group1, group2) ]
+@end display
+
+The Mann-Whitney subcommand is used to test whether two groups of data come from different populations.
+The variables to be tested should be specified in @var{varlist} and the grouping variable, that determines to which group the test variables belong, in @var{var}.
+@var{Var} may be either a string or an alpha variable.
+@var{Group1} and @var{group2} specify the
+two values of @var{var} which determine the groups of the test data.
+Cases for which the @var{var} value is neither @var{group1} or @var{group2} will be ignored.
+
+The value of the Mann-Whitney U statistic, the Wilcoxon W, and the significance will be printed.
+The abbreviated subcommand  M-W may be used in place of MANN-WHITNEY.
+
+
+@node RUNS
+@subsection Runs Test
+@vindex RUNS
+@cindex runs test
+
+@display 
+     [ /RUNS (@{MEAN, MEDIAN, MODE, value@}) varlist ]
+@end display
+
+The /RUNS subcommand tests whether a data sequence is randomly ordered.
+
+It works by examining the number of times a variable's value crosses a given threshold. 
+The desired threshold must be specified within parentheses.
+It may either be specified as a number or as one of MEAN, MEDIAN or MODE.
+Following the threshold specification comes the list of variables whose values are to be
+tested.
+
+The subcommand shows the number of runs, the asymptotic significance based on the
+length of the data.
+
+@node SIGN
+@subsection Sign Test
+@vindex SIGN
+@cindex sign test
+
+@display
+     [ /SIGN varlist [ WITH varlist [ (PAIRED) ]]]
+@end display
+
+The /SIGN subcommand tests for differences between medians of the 
  variables listed.
-The test does not make any assumptions about the variances of the samples.
-It does however assume that the distribution is symetrical.
+The test does not make any assumptions about the
+distribution of the data.
  
  If the @code{WITH} keyword is omitted, then tests for all
  combinations of the listed variables are performed.
@@ -639,20 +906,20 @@ If the @code{WITH} keyword is given, but the
  of variable preceding @code{WITH} against variable following
  @code{WITH} are performed.
  
-
-@node SIGN
-@subsection Sign Test
-@vindex SIGN
-@cindex sign test
+@node WILCOXON
+@subsection Wilcoxon Matched Pairs Signed Ranks Test
+@comment  node-name,  next,  previous,  up
+@vindex WILCOXON
+@cindex wilcoxon matched pairs signed ranks test
  
  @display
-     [ /SIGN varlist [ WITH varlist [ (PAIRED) ]]]
+     [ /WILCOXON varlist [ WITH varlist [ (PAIRED) ]]]
  @end display
  
-The /SIGN subcommand tests for differences between medians of the 
+The /WILCOXON subcommand tests for differences between medians of the 
  variables listed.
-The test does not make any assumptions about the
-distribution of the data.
+The test does not make any assumptions about the variances of the samples.
+It does however assume that the distribution is symetrical.
  
  If the @code{WITH} keyword is omitted, then tests for all
  combinations of the listed variables are performed.
@@ -814,7 +1081,7 @@ variables factored by a single independent variable.
  It is used to compare the means of a population
  divided into more than two groups. 
  
-The  variables to be analysed should be given in the @code{VARIABLES}
+The dependent variables to be analysed should be given in the @code{VARIABLES}
  subcommand.  
  The list of variables must be followed by the @code{BY} keyword and
  the name of the independent (or factor) variable.
@@ -840,6 +1107,16 @@ If the total sum of the coefficients are not zero, then PSPP will
  display a warning, but will proceed with the analysis.
  The @code{CONTRAST} subcommand may be given up to 10 times in order
  to specify different contrast tests.
+The @code{MISSING} subcommand defines how missing values are handled.
+If LISTWISE is specified then cases which have missing values for 
+the independent variable or any dependent variable will be ignored.
+If ANALYSIS is specified, then cases will be ignored if the independent
+variable is missing or if the dependent variable currently being 
+analysed is missing.  The default is ANALYSIS.
+A setting of EXCLUDE means that variables whose values are
+user-missing are to be excluded from the analysis. A setting of
+INCLUDE means they are to be included.  The default is EXCLUDE.
+
  
  @node RANK
  @comment  node-name,  next,  previous,  up
@@ -960,7 +1237,7 @@ analysis tested against the totals.
  
  @display
  ROC     @var{var_list} BY @var{state_var} (@var{state_value})
-        /PLOT @{ CURVE [(REFERENCE)], NONE @}
+        /PLOT = @{ CURVE [(REFERENCE)], NONE @}
          /PRINT = [ SE ] [ COORDINATES ]
          /CRITERIA = [ CUTOFF(@{INCLUDE,EXCLUDE@}) ]
            [ TESTPOS (@{LARGE,SMALL@}) ]