X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;ds=sidebyside;f=doc%2Fstatistics.texi;h=edd96d0990bb3fcd050a8490efc36d6560855f39;hb=5cdb1f83c37674ee6d588bb0104729d564012ecb;hp=7e7d9c05eeaa8a300cefd78dc0a549699ffe7b93;hpb=c7bb6c83f463cc9bf8602daeb7b3540bc4b73fc9;p=pspp-builds.git diff --git a/doc/statistics.texi b/doc/statistics.texi index 7e7d9c05..edd96d09 100644 --- a/doc/statistics.texi +++ b/doc/statistics.texi @@ -11,9 +11,11 @@ far. * CORRELATIONS:: Correlation tables. * CROSSTABS:: Crosstabulation tables. * FACTOR:: Factor analysis and Principal Components analysis +* MEANS:: Average values and other statistics. * NPAR TESTS:: Nonparametric tests. * T-TEST:: Test hypotheses about means. * ONEWAY:: One way analysis of variance. +* QUICK CLUSTER:: K-Means clustering. * RANK:: Compute rank scores. * REGRESSION:: Linear regression. * RELIABILITY:: Reliability analysis. @@ -38,7 +40,7 @@ DESCRIPTIVES @{A,D@} @end display -The @cmd{DESCRIPTIVES} procedure reads the active file and outputs +The @cmd{DESCRIPTIVES} procedure reads the active dataset and outputs descriptive statistics requested by the user. In addition, it can optionally compute Z-scores. @@ -257,7 +259,7 @@ useful there is more than one dependent variable and at least one factor. If containing boxplots for all the factors. If /COMPARE=VARIABLES is specified, then one plot per factor is produced, each each containing one boxplot per dependent variable. -If the /COMPARE subcommand is ommitted, then PSPP uses the default value of +If the /COMPARE subcommand is omitted, then PSPP uses the default value of /COMPARE=GROUPS. The ID subcommand also pertains to boxplots. If given, it must @@ -360,7 +362,6 @@ CROSSTABS /MISSING=@{TABLE,INCLUDE,REPORT@} /WRITE=@{NONE,CELLS,ALL@} /FORMAT=@{TABLES,NOTABLES@} - @{LABELS,NOLABELS,NOVALLABS@} @{PIVOT,NOPIVOT@} @{AVALUE,DVALUE@} @{NOINDEX,INDEX@} @@ -418,11 +419,6 @@ settings: TABLES, the default, causes crosstabulation tables to be output. NOTABLES suppresses them. -@item -LABELS, the default, allows variable labels and value labels to appear -in the output. NOLABELS suppresses them. NOVALLABS displays variable -labels but suppresses value labels. - @item PIVOT, the default, causes each TABLES subcommand to be displayed in a pivot table format. NOPIVOT causes the old-style crosstabulation format @@ -555,7 +551,7 @@ FACTOR VARIABLES=var_list [ /ROTATION=@{VARIMAX, EQUAMAX, QUARTIMAX, NOROTATE@}] - [ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [SIG] [ALL] [DEFAULT] ] + [ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [KMO] [SIG] [ALL] [DEFAULT] ] [ /PLOT=[EIGEN] ] @@ -600,6 +596,8 @@ The /PRINT subcommand may be used to select which features of the analysis are r The covariance matrix is printed. @item DET The determinant of the correlation or covariance matrix is printed. +@item KMO + The Kaiser-Meyer-Olkin measure of sampling adequacy and the Bartlett test of sphericity is printed. @item SIG The significance of the elements of correlation matrix is printed. @item ALL @@ -608,7 +606,7 @@ The /PRINT subcommand may be used to select which features of the analysis are r Identical to INITIAL and EXTRACTION. @end itemize -If /PLOT=EIGEN is given, then a ``Scree'' plot of the eigenvalues will be printed. This can be useful for visualising +If /PLOT=EIGEN is given, then a ``Scree'' plot of the eigenvalues will be printed. This can be useful for visualizing which factors (components) should be retained. The /FORMAT subcommand determined how data are to be displayed in loading matrices. If SORT is specified, then the variables @@ -637,7 +635,145 @@ contains a missing value. If PAIRWISE is set, then a case is considered missing only if either of the values for the particular coefficient are missing. The default is LISTWISE. - + +@node MEANS +@section MEANS + +@vindex MEANS +@cindex means + +@display +MEANS [TABLES =] + @{varlist@} + [ BY @{varlist@} [BY @{varlist@} [BY @{varlist@} @dots{} ]]] + + [ /@{varlist@} + [ BY @{varlist@} [BY @{varlist@} [BY @{varlist@} @dots{} ]]] ] + + [/CELLS = [MEAN] [COUNT] [STDDEV] [SEMEAN] [SUM] [MIN] [MAX] [RANGE] + [VARIANCE] [KURT] [SEKURT] + [SKEW] [SESKEW] [FIRST] [LAST] + [HARMONIC] [GEOMETRIC] + [DEFAULT] + [ALL] + [NONE] ] + + [/MISSING = [TABLE] [INCLUDE] [DEPENDENT]] +@end display + +You can use the MEANS command to calculate the arithmetic mean and similar +statistics, either for the dataset as a whole or for categories of data. + +The simplest form of the command is +@example +MEANS @var{v}. +@end example +@noindent which calculates the mean, count and standard deviation for @var{v}. +If you specify a grouping variable, for example +@example +MEANS @var{v} BY @var{g}. +@end example +@noindent then the means, counts and standard deviations for @var{v} after having +been grouped by @var{g} will be calculated. +Instead of the mean, count and standard deviation, you could specify the statistics +in which you are interested: +@example +MEANS @var{x} @var{y} BY @var{g} + /CELLS = HARMONIC SUM MIN. +@end example +This example calculates the harmonic mean, the sum and the minimum values of @var{x} and @var{y} +grouped by @var{g}. + +The CELLS subcommand specifies which statistics to calculate. The available statistics +are: +@itemize +@item MEAN +@cindex arithmetic mean + The arithmetic mean. +@item COUNT + The count of the values. +@item STDDEV + The standard deviation. +@item SEMEAN + The standard error of the mean. +@item SUM + The sum of the values. +@item MIN + The minimum value. +@item MAX + The maximum value. +@item RANGE + The difference between the maximum and minimum values. +@item VARIANCE + The variance. +@item FIRST + The first value in the category. +@item LAST + The last value in the category. +@item SKEW + The skewness. +@item SESKEW + The standard error of the skewness. +@item KURT + The kurtosis +@item SEKURT + The standard error of the kurtosis. +@item HARMONIC +@cindex harmonic mean + The harmonic mean. +@item GEOMETRIC +@cindex geometric mean + The geometric mean. +@end itemize + +In addition, three special keywords are recognized: +@itemize +@item DEFAULT + This is the same as MEAN COUNT STDDEV +@item ALL + All of the above statistics will be calculated. +@item NONE + No statistics will be calculated (only a summary will be shown). +@end itemize + + +More than one @dfn{table} can be specified in a single command. +Each table is separated by a @samp{/}. For +example +@example +MEANS TABLES = + @var{c} @var{d} @var{e} BY @var{x} + /@var{a} @var{b} BY @var{x} @var{y} + /@var{f} BY @var{y} BY @var{z}. +@end example +has three tables (the @samp{TABLE =} is optional). +The first table has three dependent variables @var{c}, @var{d} and @var{e} +and a single categorical variable @var{x}. +The second table has two dependent variables @var{a} and @var{b}, +and two categorical variables @var{x} and @var{y}. +The third table has a single dependent variables @var{f} +and a categorical variable formed by the combination of @var{y} and @var{z}. + + +By default values are omitted from the analysis only if missing values +(either system missing or user missing) +for any of the variables directly involved in their calculation are +encountered. +This behaviour can be modified with the /MISSING subcommand. +Three options are possible: TABLE, INCLUDE and DEPENDENT. + +/MISSING = TABLE causes cases to be dropped if any variable is missing +in the table specification currently being processed, regardless of +whether it is needed to calculate the statistic. + +/MISSING = INCLUDE says that user missing values, either in the dependent +variables or in the categorical variables should be taken at their face +value, and not excluded. + +/MISSING = DEPENDENT says that user missing values, in the dependent +variables should be taken at their face value, however cases which +have user missing values for the categorical variables should be omitted +from the calculation. @node NPAR TESTS @section NPAR TESTS @@ -681,11 +817,17 @@ is used. @menu * BINOMIAL:: Binomial Test * CHISQUARE:: Chisquare Test +* COCHRAN:: Cochran Q Test * FRIEDMAN:: Friedman Test +* KENDALL:: Kendall's W Test +* KOLMOGOROV-SMIRNOV:: Kolmogorov Smirnov Test * KRUSKAL-WALLIS:: Kruskal-Wallis Test -* WILCOXON:: Wilcoxon Signed Ranks Test +* MANN-WHITNEY:: Mann Whitney U Test +* MCNEMAR:: McNemar Test +* MEDIAN:: Median Test * RUNS:: Runs Test * SIGN:: The Sign Test +* WILCOXON:: Wilcoxon Signed Ranks Test @end menu @@ -766,6 +908,21 @@ If no /EXPECTED subcommand is given, then then equal frequencies are expected. +@node COCHRAN +@subsection Cochran Q Test +@vindex Cochran +@cindex Cochran Q test +@cindex Q, Cochran Q + +@display + [ /COCHRAN = varlist ] +@end display + +The Cochran Q test is used to test for differences between three or more groups. +The data for @var{varlist} in all cases must assume exactly two distinct values (other than missing values). + +The value of Q will be displayed and its Asymptotic significance based on a chi-square distribution. + @node FRIEDMAN @subsection Friedman Test @vindex FRIEDMAN @@ -779,6 +936,62 @@ The Friedman test is used to test for differences between repeated measures when A list of variables which contain the measured data must be given. The procedure prints the sum of ranks for each variable, the test statistic and its significance. +@node KENDALL +@subsection Kendall's W Test +@vindex KENDALL +@cindex Kendall's W test +@cindex coefficient of concordance + +@display + [ /KENDALL = varlist ] +@end display + +The Kendall test investigates whether an arbitrary number of related samples come from the +same population. +It is identical to the Friedman test except that the additional statistic W, Kendall's Coefficient of Concordance is printed. +It has the range [0,1] --- a value of zero indicates no agreement between the samples whereas a value of +unity indicates complete agreement. + + +@node KOLMOGOROV-SMIRNOV +@subsection Kolmogorov-Smirnov Test +@vindex KOLMOGOROV-SMIRNOV +@vindex K-S +@cindex Kolmogorov-Smirnov test + +@display + [ /KOLMOGOROV-SMIRNOV (@{NORMAL [@var{mu}, @var{sigma}], UNIFORM [@var{min}, @var{max}], POISSON [@var{lambda}], EXPONENTIAL [@var{scale}] @}) = varlist ] +@end display + +The one sample Kolmogorov-Smirnov subcommand is used to test whether or not a dataset is +drawn from a particular distribution. Four distributions are supported, @i{viz:} +Normal, Uniform, Poisson and Exponential. + +Ideally you should provide the parameters of the distribution against which you wish to test +the data. For example, with the normal distribution the mean (@var{mu})and standard deviation (@var{sigma}) +should be given; with the uniform distribution, the minimum (@var{min})and maximum (@var{max}) value should +be provided. +However, if the parameters are omitted they will be imputed from the data. Imputing the +parameters reduces the power of the test so should be avoided if possible. + +In the following example, two variables @var{score} and @var{age} are tested to see if +they follow a normal distribution with a mean of 3.5 and a standard deviation of 2.0. +@example + NPAR TESTS + /KOLMOGOROV-SMIRNOV (normal 3.5 2.0) = @var{score} @var{age}. +@end example +If the variables need to be tested against different distributions, then a separate +subcommand must be used. For example the following syntax tests @var{score} against +a normal distribution with mean of 3.5 and standard deviation of 2.0 whilst @var{age} +is tested against a normal distribution of mean 40 and standard deviation 1.5. +@example + NPAR TESTS + /KOLMOGOROV-SMIRNOV (normal 3.5 2.0) = @var{score} + /KOLMOGOROV-SMIRNOV (normal 40 1.5) = @var{age}. +@end example + +The abbreviated subcommand K-S may be used in place of KOLMOGOROV-SMIRNOV. + @node KRUSKAL-WALLIS @subsection Kruskal-Wallis Test @vindex KRUSKAL-WALLIS @@ -803,20 +1016,38 @@ of the test will be printed. The abbreviated subcommand K-W may be used in place of KRUSKAL-WALLIS. -@node WILCOXON -@subsection Wilcoxon Matched Pairs Signed Ranks Test -@comment node-name, next, previous, up -@vindex WILCOXON -@cindex wilcoxon matched pairs signed ranks test +@node MANN-WHITNEY +@subsection Mann-Whitney U Test +@vindex MANN-WHITNEY +@vindex M-W +@cindex Mann-Whitney U test +@cindex U, Mann-Whitney U @display - [ /WILCOXON varlist [ WITH varlist [ (PAIRED) ]]] + [ /MANN-WHITNEY = varlist BY var (group1, group2) ] @end display -The /WILCOXON subcommand tests for differences between medians of the -variables listed. -The test does not make any assumptions about the variances of the samples. -It does however assume that the distribution is symetrical. +The Mann-Whitney subcommand is used to test whether two groups of data come from different populations. +The variables to be tested should be specified in @var{varlist} and the grouping variable, that determines to which group the test variables belong, in @var{var}. +@var{Var} may be either a string or an alpha variable. +@var{Group1} and @var{group2} specify the +two values of @var{var} which determine the groups of the test data. +Cases for which the @var{var} value is neither @var{group1} or @var{group2} will be ignored. + +The value of the Mann-Whitney U statistic, the Wilcoxon W, and the significance will be printed. +The abbreviated subcommand M-W may be used in place of MANN-WHITNEY. + +@node MCNEMAR +@subsection McNemar Test +@vindex MCNEMAR +@cindex McNemar test + +@display + [ /MCNEMAR varlist [ WITH varlist [ (PAIRED) ]]] +@end display + +Use McNemar's test to analyse the significance of the difference between +pairs of correlated proportions. If the @code{WITH} keyword is omitted, then tests for all combinations of the listed variables are performed. @@ -830,13 +1061,42 @@ If the @code{WITH} keyword is given, but the of variable preceding @code{WITH} against variable following @code{WITH} are performed. +The data in each variable must be dichotomous. If there are more +than two distinct variables an error will occur and the test will +not be run. + +@node MEDIAN +@subsection Median Test +@vindex MEDIAN +@cindex Median test + +@display + [ /MEDIAN [(value)] = varlist BY variable (value1, value2) ] +@end display + +The median test is used to test whether independent samples come from +populations with a common median. +The median of the populations against which the samples are to be tested +may be given in parentheses immediately after the +/MEDIAN subcommand. If it is not given, the median will be imputed from the +union of all the samples. + +The variables of the samples to be tested should immediately follow the @samp{=} sign. The +keyword @code{BY} must come next, and then the grouping variable. Two values +in parentheses should follow. If the first value is greater than the second, +then a 2 sample test is performed using these two values to determine the groups. +If however, the first variable is less than the second, then a @i{k} sample test is +conducted and the group values used are all values encountered which lie in the +range [@var{value1},@var{value2}]. + + @node RUNS @subsection Runs Test @vindex RUNS @cindex runs test @display - [ /RUNS (@{MEAN, MEDIAN, MODE, value@}) varlist ] + [ /RUNS (@{MEAN, MEDIAN, MODE, value@}) = varlist ] @end display The /RUNS subcommand tests whether a data sequence is randomly ordered. @@ -876,6 +1136,33 @@ If the @code{WITH} keyword is given, but the of variable preceding @code{WITH} against variable following @code{WITH} are performed. +@node WILCOXON +@subsection Wilcoxon Matched Pairs Signed Ranks Test +@comment node-name, next, previous, up +@vindex WILCOXON +@cindex wilcoxon matched pairs signed ranks test + +@display + [ /WILCOXON varlist [ WITH varlist [ (PAIRED) ]]] +@end display + +The /WILCOXON subcommand tests for differences between medians of the +variables listed. +The test does not make any assumptions about the variances of the samples. +It does however assume that the distribution is symetrical. + +If the @code{WITH} keyword is omitted, then tests for all +combinations of the listed variables are performed. +If the @code{WITH} keyword is given, and the @code{(PAIRED)} keyword +is also given, then the number of variables preceding @code{WITH} +must be the same as the number following it. +In this case, tests for each respective pair of variables are +performed. +If the @code{WITH} keyword is given, but the +@code{(PAIRED)} keyword is omitted, then tests for each combination +of variable preceding @code{WITH} against variable following +@code{WITH} are performed. + @node T-TEST @comment node-name, next, previous, up @section T-TEST @@ -937,7 +1224,7 @@ which they would be needed. This is the default. @menu -* One Sample Mode:: Testing against a hypothesised mean +* One Sample Mode:: Testing against a hypothesized mean * Independent Samples Mode:: Testing two independent groups for equal mean * Paired Samples Mode:: Testing two interdependent groups for equal mean @end menu @@ -946,7 +1233,7 @@ which they would be needed. This is the default. @subsection One Sample Mode The @cmd{TESTVAL} subcommand invokes the One Sample mode. -This mode is used to test a population mean against a hypothesised +This mode is used to test a population mean against a hypothesized mean. The value given to the @cmd{TESTVAL} subcommand is the value against which you wish to test. @@ -1016,7 +1303,7 @@ ONEWAY /MISSING=@{ANALYSIS,LISTWISE@} @{EXCLUDE,INCLUDE@} /CONTRAST= value1 [, value2] ... [,valueN] /STATISTICS=@{DESCRIPTIVES,HOMOGENEITY@} - + /POSTHOC=@{BONFERRONI, GH, LSD, SCHEFFE, SIDAK, TUKEY, ALPHA ([value])@} @end display The @cmd{ONEWAY} procedure performs a one-way analysis of variance of @@ -1060,11 +1347,76 @@ A setting of EXCLUDE means that variables whose values are user-missing are to be excluded from the analysis. A setting of INCLUDE means they are to be included. The default is EXCLUDE. +Using the @code{POSTHOC} subcommand you can perform multiple +pairwise comparisons on the data. The following comparison methods +are available: +@itemize +@item LSD +Least Significant Difference. +@item TUKEY +Tukey Honestly Significant Difference. +@item BONFERRONI +Bonferroni test. +@item SCHEFFE +Scheff@'e's test. +@item SIDAK +Sidak test. +@item GH +The Games-Howell test. +@end itemize + +@noindent +The optional syntax @code{ALPHA(@var{value})} is used to indicate +that @var{value} should be used as the +confidence level for which the posthoc tests will be performed. +The default is 0.05. + +@node QUICK CLUSTER +@comment node-name, next, previous, up +@section QUICK CLUSTER +@vindex QUICK CLUSTER + +@cindex K-means clustering +@cindex clustering + +@display +QUICK CLUSTER var_list + [/CRITERIA=CLUSTERS(@var{k}) [MXITER(@var{max_iter})]] + [/MISSING=@{EXCLUDE,INCLUDE@} @{LISTWISE, PAIRWISE@}] +@end display + +The @cmd{QUICK CLUSTER} command performs k-means clustering on the +dataset. This is useful when you wish to allocate cases into clusters +of similar values and you already know the number of clusters. + +The minimum specification is @samp{QUICK CLUSTER} followed by the names +of the variables which contain the cluster data. Normally you will also +want to specify @samp{/CRITERIA=CLUSTERS(@var{k})} where @var{k} is the +number of clusters. If this is not given, then @var{k} defaults to 2. + +The command uses an iterative algorithm to determine the clusters for +each case. It will continue iterating until convergence, or until @var{max_iter} +iterations have been done. The default value of @var{max_iter} is 2. + +The @cmd{MISSING} subcommand determines the handling of missing variables. +If INCLUDE is set, then user-missing values are considered at their face +value and not as missing values. +If EXCLUDE is set, which is the default, user-missing +values are excluded as well as system-missing values. + +If LISTWISE is set, then the entire case is excluded from the analysis +whenever any of the clustering variables contains a missing value. +If PAIRWISE is set, then a case is considered missing only if all the +clustering variables contain missing values. Otherwise it is clustered +on the basis of the non-missing values. +The default is LISTWISE. + @node RANK @comment node-name, next, previous, up @section RANK + @vindex RANK @display RANK @@ -1140,7 +1492,7 @@ RELIABILITY @end display @cindex Cronbach's Alpha -The @cmd{RELIABILTY} command performs reliablity analysis on the data. +The @cmd{RELIABILTY} command performs reliability analysis on the data. The VARIABLES subcommand is required. It determines the set of variables upon which analysis is to be performed. @@ -1175,7 +1527,7 @@ analysis tested against the totals. @section ROC @vindex ROC -@cindex Receiver Operating Characterstic +@cindex Receiver Operating Characteristic @cindex Area under curve @display @@ -1239,5 +1591,3 @@ exclude them. Cases are excluded on a listwise basis; if any of the variables in @var{var_list} or if the variable @var{state_var} is missing, then the entire case will be excluded. - -