Added a base parameter to the interaction_case_hash function

[pspp] / doc / statistics.texi
diff --git a/doc/statistics.texi b/doc/statistics.texi

index 0cb7f70fe84e59d662bee2bbb220a15822e48038..18cc79e7ace82bdd38e55b4908aaf64a5e8a47b3 100644 (file)
--- a/doc/statistics.texi
+++ b/doc/statistics.texi
@@ -14,6 +14,7 @@ far.
  * NPAR TESTS::                  Nonparametric tests.
  * T-TEST::                      Test hypotheses about means.
  * ONEWAY::                      One way analysis of variance.
+* QUICK CLUSTER::               K-Means clustering.
  * RANK::                        Compute rank scores.
  * REGRESSION::                  Linear regression.
  * RELIABILITY::                 Reliability analysis.
@@ -38,7 +39,7 @@ DESCRIPTIVES
                @{A,D@}
  @end display
  
-The @cmd{DESCRIPTIVES} procedure reads the active file and outputs
+The @cmd{DESCRIPTIVES} procedure reads the active dataset and outputs
  descriptive
  statistics requested by the user.  In addition, it can optionally
  compute Z-scores.
@@ -555,7 +556,7 @@ FACTOR  VARIABLES=var_list
  
          [ /ROTATION=@{VARIMAX, EQUAMAX, QUARTIMAX, NOROTATE@}]
  
-        [ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [SIG] [ALL] [DEFAULT] ]
+        [ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [KMO] [SIG] [ALL] [DEFAULT] ]
  
          [ /PLOT=[EIGEN] ]
  
@@ -600,6 +601,8 @@ The /PRINT subcommand may be used to select which features of the analysis are r
        The covariance matrix is printed.
  @item DET
        The determinant of the correlation or covariance matrix is printed.
+@item KMO
+      The Kaiser-Meyer-Olkin measure of sampling adequacy and the Bartlett test of sphericity is printed.
  @item SIG
        The significance of the elements of correlation matrix is printed.
  @item ALL
@@ -681,10 +684,17 @@ is used.
  @menu
  * BINOMIAL::                Binomial Test
  * CHISQUARE::               Chisquare Test
+* COCHRAN::                 Cochran Q Test
+* FRIEDMAN::                Friedman Test
+* KENDALL::                 Kendall's W Test
+* KOLMOGOROV-SMIRNOV::      Kolmogorov Smirnov Test
  * KRUSKAL-WALLIS::          Kruskal-Wallis Test
-* WILCOXON::                Wilcoxon Signed Ranks Test
+* MANN-WHITNEY::            Mann Whitney U Test
+* MCNEMAR::                 McNemar Test
+* MEDIAN::                  Median Test
  * RUNS::                    Runs Test
  * SIGN::                    The Sign Test
+* WILCOXON::                Wilcoxon Signed Ranks Test
  @end menu
  
  
@@ -765,9 +775,92 @@ If no /EXPECTED subcommand is given, then then equal frequencies
  are expected.
  
  
+@node COCHRAN
+@subsection Cochran Q Test
+@vindex Cochran
+@cindex Cochran Q test
+@cindex Q, Cochran Q
+
+@display
+     [ /COCHRAN = varlist ]
+@end display
+
+The Cochran Q test is used to test for differences between three or more groups.
+The data for @var{varlist} in all cases must assume exactly two distinct values (other than missing values). 
+
+The value of Q will be displayed and its Asymptotic significance based on a chi-square distribution.
+
+@node FRIEDMAN
+@subsection Friedman Test
+@vindex FRIEDMAN
+@cindex Friedman test
+
+@display
+     [ /FRIEDMAN = varlist ]
+@end display
+
+The Friedman test is used to test for differences between repeated measures when there is no indication that the distributions are normally distributed.
+
+A list of variables which contain the measured data must be given.  The procedure prints the sum of ranks for each variable, the test statistic and its significance.
+
+@node KENDALL
+@subsection Kendall's W Test
+@vindex KENDALL
+@cindex Kendall's W test
+@cindex coefficient of concordance
+
+@display
+     [ /KENDALL = varlist ]
+@end display
+
+The Kendall test investigates whether an arbitrary number of related samples come from the 
+same population.
+It is identical to the Friedman test except that the additional statistic W, Kendall's Coefficient of Concordance is printed.
+It has the range [0,1] --- a value of zero indicates no agreement between the samples whereas a value of
+unity indicates complete agreement.
+
+
+@node KOLMOGOROV-SMIRNOV
+@subsection Kolmogorov-Smirnov Test
+@vindex KOLMOGOROV-SMIRNOV
+@vindex K-S
+@cindex Kolmogorov-Smirnov test
+
+@display
+     [ /KOLMOGOROV-SMIRNOV (@{NORMAL [@var{mu}, @var{sigma}], UNIFORM [@var{min}, @var{max}], POISSON [@var{lambda}], EXPONENTIAL [@var{scale}] @}) = varlist ]
+@end display
+
+The one sample Kolmogorov-Smirnov subcommand is used to test whether or not a dataset is
+drawn from a particular distribution.  Four distributions are supported, @i{viz:}
+Normal, Uniform, Poisson and Exponential.
+
+Ideally you should provide the parameters of the distribution against which you wish to test
+the data. For example, with the normal distribution  the mean (@var{mu})and standard deviation (@var{sigma})
+should be given; with the uniform distribution, the minimum (@var{min})and maximum (@var{max}) value should
+be provided.
+However, if the parameters are omitted they will be imputed from the data. Imputing the
+parameters reduces the power of the test so should be avoided if possible.
+
+In the following example, two variables @var{score} and @var{age} are tested to see if
+they follow a normal distribution with a mean of 3.5 and a standard deviation of 2.0.
+@example
+  NPAR TESTS
+        /KOLMOGOROV-SMIRNOV (normal 3.5 2.0) = @var{score} @var{age}.
+@end example
+If the variables need to be tested against different distributions, then a seperate
+subcommand must be used.  For example the following syntax tests @var{score} against
+a normal distribution with mean of 3.5 and standard deviation of 2.0 whilst @var{age}
+is tested against a normal distribution of mean 40 and standard deviation 1.5.
+@example
+  NPAR TESTS
+        /KOLMOGOROV-SMIRNOV (normal 3.5 2.0) = @var{score}
+        /KOLMOGOROV-SMIRNOV (normal 40 1.5) =  @var{age}.
+@end example
+
+The abbreviated subcommand  K-S may be used in place of KOLMOGOROV-SMIRNOV.
+
  @node KRUSKAL-WALLIS
  @subsection Kruskal-Wallis Test
-@comment  node-name,  next,  previous,  up
  @vindex KRUSKAL-WALLIS
  @vindex K-W
  @cindex Kruskal-Wallis test
@@ -790,20 +883,38 @@ of the test will be printed.
  The abbreviated subcommand  K-W may be used in place of KRUSKAL-WALLIS.
  
  
-@node WILCOXON
-@subsection Wilcoxon Matched Pairs Signed Ranks Test
-@comment  node-name,  next,  previous,  up
-@vindex WILCOXON
-@cindex wilcoxon matched pairs signed ranks test
+@node MANN-WHITNEY
+@subsection Mann-Whitney U Test
+@vindex MANN-WHITNEY
+@vindex M-W
+@cindex Mann-Whitney U test
+@cindex U, Mann-Whitney U
  
  @display
-     [ /WILCOXON varlist [ WITH varlist [ (PAIRED) ]]]
+     [ /MANN-WHITNEY = varlist BY var (group1, group2) ]
  @end display
  
-The /WILCOXON subcommand tests for differences between medians of the 
-variables listed.
-The test does not make any assumptions about the variances of the samples.
-It does however assume that the distribution is symetrical.
+The Mann-Whitney subcommand is used to test whether two groups of data come from different populations.
+The variables to be tested should be specified in @var{varlist} and the grouping variable, that determines to which group the test variables belong, in @var{var}.
+@var{Var} may be either a string or an alpha variable.
+@var{Group1} and @var{group2} specify the
+two values of @var{var} which determine the groups of the test data.
+Cases for which the @var{var} value is neither @var{group1} or @var{group2} will be ignored.
+
+The value of the Mann-Whitney U statistic, the Wilcoxon W, and the significance will be printed.
+The abbreviated subcommand  M-W may be used in place of MANN-WHITNEY.
+
+@node MCNEMAR
+@subsection McNemar Test
+@vindex MCNEMAR
+@cindex McNemar test
+
+@display
+     [ /MCNEMAR varlist [ WITH varlist [ (PAIRED) ]]]
+@end display
+
+Use McNemar's test to analyse the significance of the difference between
+pairs of correlated proportions.
  
  If the @code{WITH} keyword is omitted, then tests for all
  combinations of the listed variables are performed.
@@ -817,13 +928,42 @@ If the @code{WITH} keyword is given, but the
  of variable preceding @code{WITH} against variable following
  @code{WITH} are performed.
  
+The data in each variable must be dichotomous.  If there are more
+than two distinct variables an error will occur and the test will
+not be run.
+
+@node MEDIAN
+@subsection Median Test
+@vindex MEDIAN
+@cindex Median test
+
+@display
+     [ /MEDIAN [(value)] = varlist BY variable (value1, value2) ]
+@end display
+
+The median test is used to test whether independent samples come from 
+populations with a common median.
+The median of the populations against which the samples are to be tested
+may be given in parentheses immediately after the 
+/MEDIAN subcommand.  If it is not given, the median will be imputed from the 
+union of all the samples.
+
+The variables of the samples to be tested should immediately follow the @samp{=} sign. The
+keyword @code{BY} must come next, and then the grouping variable.  Two values
+in parentheses should follow.  If the first value is greater than the second,
+then a 2 sample test is performed using these two values to determine the groups.
+If however, the first variable is less than the second, then a @i{k} sample test is
+conducted and the group values used are all values encountered which lie in the
+range [@var{value1},@var{value2}].
+
+
  @node RUNS
  @subsection Runs Test
  @vindex RUNS
  @cindex runs test
  
  @display 
-     [ /RUNS (@{MEAN, MEDIAN, MODE, value@}) varlist ]
+     [ /RUNS (@{MEAN, MEDIAN, MODE, value@})  = varlist ]
  @end display
  
  The /RUNS subcommand tests whether a data sequence is randomly ordered.
@@ -863,6 +1003,33 @@ If the @code{WITH} keyword is given, but the
  of variable preceding @code{WITH} against variable following
  @code{WITH} are performed.
  
+@node WILCOXON
+@subsection Wilcoxon Matched Pairs Signed Ranks Test
+@comment  node-name,  next,  previous,  up
+@vindex WILCOXON
+@cindex wilcoxon matched pairs signed ranks test
+
+@display
+     [ /WILCOXON varlist [ WITH varlist [ (PAIRED) ]]]
+@end display
+
+The /WILCOXON subcommand tests for differences between medians of the 
+variables listed.
+The test does not make any assumptions about the variances of the samples.
+It does however assume that the distribution is symetrical.
+
+If the @code{WITH} keyword is omitted, then tests for all
+combinations of the listed variables are performed.
+If the @code{WITH} keyword is given, and the @code{(PAIRED)} keyword
+is also given, then the number of variables preceding @code{WITH}
+must be the same as the number following it.
+In this case, tests for each respective pair of variables are
+performed.
+If the @code{WITH} keyword is given, but the
+@code{(PAIRED)} keyword is omitted, then tests for each combination
+of variable preceding @code{WITH} against variable following
+@code{WITH} are performed.
+
  @node T-TEST
  @comment  node-name,  next,  previous,  up
  @section T-TEST
@@ -1003,7 +1170,7 @@ ONEWAY
          /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@}
          /CONTRAST= value1 [, value2] ... [,valueN]
          /STATISTICS=@{DESCRIPTIVES,HOMOGENEITY@}
-
+        /POSTHOC=@{BONFERRONI, GH, LSD, SCHEFFE, SIDAK, TUKEY, ALPHA ([value])@}
  @end display
  
  The @cmd{ONEWAY} procedure performs a one-way analysis of variance of
@@ -1047,11 +1214,76 @@ A setting of EXCLUDE means that variables whose values are
  user-missing are to be excluded from the analysis. A setting of
  INCLUDE means they are to be included.  The default is EXCLUDE.
  
+Using the @code{POSTHOC} subcommand you can perform multiple
+pairwise comparisons on the data. The following comparison methods
+are available:
+@itemize
+@item LSD
+Least Significant Difference.
+@item TUKEY
+Tukey Honestly Significant Difference.
+@item BONFERRONI
+Bonferroni test.
+@item SCHEFFE
+Scheff@'e's test.
+@item SIDAK
+Sidak test.
+@item GH
+The Games-Howell test.
+@end itemize
+
+@noindent
+The optional syntax @code{ALPHA(@var{value})} is used to indicate
+that @var{value} should be used as the
+confidence level for which the posthoc tests will be performed.
+The default is 0.05.
+
+@node QUICK CLUSTER
+@comment  node-name,  next,  previous,  up
+@section QUICK CLUSTER
+@vindex QUICK CLUSTER
+
+@cindex K-means clustering
+@cindex clustering
+
+@display
+QUICK CLUSTER var_list
+      [/CRITERIA=CLUSTERS(@var{k}) [MXITER(@var{max_iter})]]
+      [/MISSING=@{EXCLUDE,INCLUDE@} @{LISTWISE, PAIRWISE@}]
+@end display
+
+The @cmd{QUICK CLUSTER} command performs k-means clustering on the
+dataset.  This is useful when you wish to allocate cases into clusters
+of similar values and you already know the number of clusters.
+
+The minimum specification is @samp{QUICK CLUSTER} followed by the names
+of the variables which contain the cluster data.  Normally you will also
+want to specify @samp{/CRITERIA=CLUSTERS(@var{k})} where @var{k} is the
+number of clusters.  If this is not given, then @var{k} defaults to 2.
+
+The command uses an iterative algorithm to determine the clusters for
+each case.  It will continue iterating until convergence, or until @var{max_iter}
+iterations have been done.  The default value of @var{max_iter} is 2.
+
+The @cmd{MISSING} subcommand determines the handling of missing variables.  
+If INCLUDE is set, then user-missing values are considered at their face
+value and not as missing values.
+If EXCLUDE is set, which is the default, user-missing
+values are excluded as well as system-missing values. 
+
+If LISTWISE is set, then the entire case is excluded from the analysis
+whenever any of the clustering variables contains a missing value.   
+If PAIRWISE is set, then a case is considered missing only if all the
+clustering variables contain missing values.  Otherwise it is clustered
+on the basis of the non-missing values.
+The default is LISTWISE.
+
  
  @node RANK
  @comment  node-name,  next,  previous,  up
  @section RANK
  
+
  @vindex RANK
  @display
  RANK