X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fstatistics.texi;h=cdc5fdf6706dee358a119e252273d0ce22caf4f3;hb=12b00817442d691881d3565ebe51835cb6b11758;hp=8fa93b1566a38fe304836d8d473fcc2dde7189b3;hpb=0527ca8a9d13e99d51b579b27bbfde7d3e11cd94;p=pspp-builds.git diff --git a/doc/statistics.texi b/doc/statistics.texi index 8fa93b15..cdc5fdf6 100644 --- a/doc/statistics.texi +++ b/doc/statistics.texi @@ -8,12 +8,16 @@ far. * DESCRIPTIVES:: Descriptive statistics. * FREQUENCIES:: Frequency tables. * EXAMINE:: Testing data for normality. +* CORRELATIONS:: Correlation tables. * CROSSTABS:: Crosstabulation tables. +* FACTOR:: Factor analysis and Principal Components analysis * NPAR TESTS:: Nonparametric tests. * T-TEST:: Test hypotheses about means. * ONEWAY:: One way analysis of variance. * RANK:: Compute rank scores. * REGRESSION:: Linear regression. +* RELIABILITY:: Reliability analysis. +* ROC:: Receiver Operating Characteristic. @end menu @node DESCRIPTIVES @@ -204,12 +208,13 @@ boundaries of the data set divided into the specified number of ranges. For instance, @code{/NTILES=4} would cause quartiles to be reported. The HISTOGRAM subcommand causes the output to include a histogram for -each specified variable. The X axis by default ranges from the +each specified numeric variable. The X axis by default ranges from the minimum to the maximum value observed in the data, but the MINIMUM and MAXIMUM keywords can set an explicit range. The Y axis by default is labeled in frequencies; use the PERCENT keyword to causes it to be labeled in percent of the total observed count. Specify NORMAL to superimpose a normal curve on the histogram. +Histograms are not created for string variables. The PIECHART adds a pie chart for each variable to the data. Each slice represents one value, with the size of the slice proportional to @@ -232,7 +237,7 @@ EXAMINE /PLOT=@{BOXPLOT, NPPLOT, HISTOGRAM, ALL, NONE@} /CINTERVAL n /COMPARE=@{GROUPS,VARIABLES@} - /ID=@{case_number, var_name@} + /ID=var_name /@{TOTAL,NOTOTAL@} /PERCENTILE=[value_list]=@{HAVERAGE, WAVERAGE, ROUND, AEMPIRICAL, EMPIRICAL @} /MISSING=@{LISTWISE, PAIRWISE@} [@{EXCLUDE, INCLUDE@}] @@ -271,6 +276,12 @@ If /COMPARE=VARIABLES is specified, then one plot per factor is produced, each each containing one boxplot per dependent variable. If the /COMPARE subcommand is ommitted, then PSPP uses the default value of /COMPARE=GROUPS. + +The ID subcommand also pertains to boxplots. If given, it must +specify a variable name. Outliers and extreme cases plotted in +boxplots will be labelled with the case from that variable. Numeric or +string variables are permissible. If the ID subcommand is not given, +then the casenumber will be used for labelling. The CINTERVAL subcommand specifies the confidence interval to use in calculation of the descriptives command. The default it 95%. @@ -292,6 +303,69 @@ If many dependent variable are given, or factors are given for which there are many distinct values, then @cmd{EXAMINE} will produce a very large quantity of output. +@node CORRELATIONS +@section CORRELATIONS + +@vindex CORRELATIONS +@display +CORRELATIONS + /VARIABLES = varlist [ WITH varlist ] + [ + . + . + . + /VARIABLES = varlist [ WITH varlist ] + /VARIABLES = varlist [ WITH varlist ] + ] + + [ /PRINT=@{TWOTAIL, ONETAIL@} @{SIG, NOSIG@} ] + [ /STATISTICS=DESCRIPTIVES XPROD ALL] + [ /MISSING=@{PAIRWISE, LISTWISE@} @{INCLUDE, EXCLUDE@} ] +@end display + +@cindex correlation +The @cmd{CORRELATIONS} procedure produces tables of the Pearson correlation coefficient +for a set of variables. The significance of the coefficients are also given. + +At least one VARIABLES subcommand is required. If the WITH keyword is used, then a non-square +correlation table will be produced. +The variables preceding WITH, will be used as the rows of the table, and the variables following +will be the columns of the table. +If no WITH subcommand is given, then a square, symmetrical table using all variables is produced. + + +The @cmd{MISSING} subcommand determines the handling of missing variables. +If INCLUDE is set, then user-missing values are included in the +calculations, but system-missing values are not. +If EXCLUDE is set, which is the default, user-missing +values are excluded as well as system-missing values. +This is the default. + +If LISTWISE is set, then the entire case is excluded from analysis +whenever any variable specified in any @cmd{/VARIABLES} subcommand +contains a missing value. +If PAIRWISE is set, then a case is considered missing only if either of the +values for the particular coefficient are missing. +The default is PAIRWISE. + +The PRINT subcommand is used to control how the reported significance values are printed. +If the TWOTAIL option is used, then a two-tailed test of significance is +printed. If the ONETAIL option is given, then a one-tailed test is used. +The default is TWOTAIL. + +If the NOSIG option is specified, then correlation coefficients with significance less than +0.05 are highlighted. +If SIG is specified, then no highlighting is performed. This is the default. + +@cindex covariance +The STATISTICS subcommand requests additional statistics to be displayed. The keyword +DESCRIPTIVES requests that the mean, number of non-missing cases, and the non-biased +estimator of the standard deviation are displayed. +These statistics will be displayed in a separated table, for all the variables listed +in any /VARIABLES subcommand. +The XPROD keyword requests cross-product deviations and covariance estimators to +be displayed for each pair of variables. +The keyword ALL is the union of DESCRIPTIVES and XPROD. @node CROSSTABS @section CROSSTABS @@ -340,9 +414,7 @@ is present, the VARIABLES subcommand must precede the TABLES subcommand. In general mode, numeric and string variables may be specified on -TABLES. Although long string variables are allowed, only their -initial short-string parts are used. In integer mode, only numeric -variables are allowed. +TABLES. In integer mode, only numeric variables are allowed. The MISSING subcommand determines the handling of user-missing values. When set to TABLE, the default, missing values are dropped on a table by @@ -482,6 +554,99 @@ Approximate T of uncertainty coefficient is wrong. Fixes for any of these deficiencies would be welcomed. +@node FACTOR +@section FACTOR + +@vindex FACTOR +@cindex factor analysis +@cindex principal components analysis +@cindex principal axis factoring +@cindex data reduction + +@display +FACTOR VARIABLES=var_list + + [ /METHOD = @{CORRELATION, COVARIANCE@} ] + + [ /EXTRACTION=@{PC, PAF@}] + + [ /PRINT=[INITIAL] [EXTRACTION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [SIG] [ALL] [DEFAULT] ] + + [ /PLOT=[EIGEN] ] + + [ /FORMAT=[SORT] [BLANK(@var{n})] [DEFAULT] ] + + [ /CRITERIA=[FACTORS(@var{n})] [MINEIGEN(@var{l})] [ITERATE(@var{m})] [ECONVERGE (@var{delta})] [DEFAULT] ] + + [ /MISSING=[@{LISTWISE, PAIRWISE@}] [@{INCLUDE, EXCLUDE@}] ] +@end display + +The FACTOR command performs Factor Analysis or Principal Axis Factoring on a dataset. It may be used to find +common factors in the data or for data reduction purposes. + +The VARIABLES subcommand is required. It lists the variables which are to partake in the analysis. + +The /EXTRACTION subcommand is used to specify the way in which factors (components) are extracted from the data. +If PC is specified, then Principal Components Analysis is used. If PAF is specified, then Principal Axis Factoring is +used. By default Principal Components Analysis will be used. + +The /METHOD subcommand should be used to determine whether the covariance matrix or the correlation matrix of the data is +to be analysed. By default, the correlation matrix is analysed. + +The /PRINT subcommand may be used to select which features of the analysis are reported: + +@itemize +@item UNIVARIATE + A table of mean values, standard deviations and total weights are printed. +@item INITIAL + Initial communalities and eigenvalues are printed. +@item EXTRACTION + Extracted communalities and eigenvalues are printed. +@item CORRELATION + The correlation matrix is printed. +@item COVARIANCE + The covariance matrix is printed. +@item DET + The determinant of the correlation or covariance matrix is printed. +@item SIG + The significance of the elements of correlation matrix is printed. +@item ALL + All of the above are printed. +@item DEFAULT + Identical to INITIAL and EXTRACTION. +@end itemize + +If /PLOT=EIGEN is given, then a ``Scree'' plot of the eigenvalues will be printed. This can be useful for visualising +which factors (components) should be retained. + +The /FORMAT subcommand determined how data are to be displayed in loading matrices. If SORT is specified, then the variables +are sorted in descending order of significance. If BLANK(@var{n}) is specified, then coefficients whose absolute value is less +than @var{n} will not be printed. If the keyword DEFAULT is given, or if no /FORMAT subcommand is given, then no sorting is +performed, and all coefficients will be printed. + +The /CRITERIA subcommand is used to specify how the number of extracted factors (components) are chosen. If FACTORS(@var{n}) is +specified, where @var{n} is an integer, then @var{n} factors will be extracted. Otherwise, the MINEIGEN setting will +be used. MINEIGEN(@var{l}) requests that all factors whose eigenvalues are greater than or equal to @var{l} are extracted. +The default value of @var{l} is 1. The ECONVERGE and ITERATE settings have effect only when iterative algorithms for factor +extraction (such as Principal Axis Factoring) are used. ECONVERGE(@var{delta}) specifies that iteration should cease when +the maximum absolute value of the communality estimate between one iteration and the previous is less than @var{delta}. The +default value of @var{delta} is 0.001. +The ITERATE(@var{m}) setting sets the maximum number of iterations to @var{m}. The default value of @var{m} is 25. + +The @cmd{MISSING} subcommand determines the handling of missing variables. +If INCLUDE is set, then user-missing values are included in the +calculations, but system-missing values are not. +If EXCLUDE is set, which is the default, user-missing +values are excluded as well as system-missing values. +This is the default. +If LISTWISE is set, then the entire case is excluded from analysis +whenever any variable specified in the @cmd{VARIABLES} subcommand +contains a missing value. +If PAIRWISE is set, then a case is considered missing only if either of the +values for the particular coefficient are missing. +The default is LISTWISE. + + @node NPAR TESTS @section NPAR TESTS @@ -499,6 +664,8 @@ NPAR TESTS [ /STATISTICS=@{DESCRIPTIVES@} ] [ /MISSING=@{ANALYSIS, LISTWISE@} @{INCLUDE, EXCLUDE@} ] + + [ /METHOD=EXACT [ TIMER [(n)] ] ] @end display NPAR TESTS performs nonparametric tests. @@ -508,10 +675,22 @@ One or more tests may be specified by using the corresponding subcommand. If the /STATISTICS subcommand is also specified, then summary statistics are produces for each variable that is the subject of any test. +Certain tests may take a long time to execute, if an exact figure is required. +Therefore, by default asymptotic approximations are used unless the +subcommand /METHOD=EXACT is specified. +Exact tests give more accurate results, but may take an unacceptably long +time to perform. If the TIMER keyword is used, it sets a maximum time, +after which the test will be abandoned, and a warning message printed. +The time, in minutes, should be specified in parentheses after the TIMER keyword. +If the TIMER keyword is given without this figure, then a default value of 5 minutes +is used. + @menu * BINOMIAL:: Binomial Test * CHISQUARE:: Chisquare Test +* WILCOXON:: Wilcoxon Signed Ranks Test +* SIGN:: The Sign Test @end menu @@ -524,7 +703,7 @@ produces for each variable that is the subject of any test. [ /BINOMIAL[(p)]=var_list[(value1[, value2)] ] ] @end display -The binomial test compares the observed distribution of a dichotomous +The /BINOMIAL subcommand compares the observed distribution of a dichotomous variable with that of a binomial distribution. The variable @var{p} specifies the test proportion of the binomial distribution. @@ -564,7 +743,7 @@ even for very large sample sizes. @node CHISQUARE -@subsection Chisquare test +@subsection Chisquare Test @vindex CHISQUARE @cindex chisquare test @@ -574,7 +753,7 @@ even for very large sample sizes. @end display -The chisquare test produces a chi-square statistic for the differences +The /CHISQUARE subcommand produces a chi-square statistic for the differences between the expected and observed frequencies of the categories of a variable. Optionally, a range of values may appear after the variable list. If a range is given, then non integer values are truncated, and values @@ -591,6 +770,59 @@ sum of the frequencies need not be 1. If no /EXPECTED subcommand is given, then then equal frequencies are expected. +@node WILCOXON +@subsection Wilcoxon Matched Pairs Signed Ranks Test +@comment node-name, next, previous, up +@vindex WILCOXON +@cindex wilcoxon matched pairs signed ranks test + +@display + [ /WILCOXON varlist [ WITH varlist [ (PAIRED) ]]] +@end display + +The /WILCOXON subcommand tests for differences between medians of the +variables listed. +The test does not make any assumptions about the variances of the samples. +It does however assume that the distribution is symetrical. + +If the @code{WITH} keyword is omitted, then tests for all +combinations of the listed variables are performed. +If the @code{WITH} keyword is given, and the @code{(PAIRED)} keyword +is also given, then the number of variables preceding @code{WITH} +must be the same as the number following it. +In this case, tests for each respective pair of variables are +performed. +If the @code{WITH} keyword is given, but the +@code{(PAIRED)} keyword is omitted, then tests for each combination +of variable preceding @code{WITH} against variable following +@code{WITH} are performed. + + +@node SIGN +@subsection Sign Test +@vindex SIGN +@cindex sign test + +@display + [ /SIGN varlist [ WITH varlist [ (PAIRED) ]]] +@end display + +The /SIGN subcommand tests for differences between medians of the +variables listed. +The test does not make any assumptions about the +distribution of the data. + +If the @code{WITH} keyword is omitted, then tests for all +combinations of the listed variables are performed. +If the @code{WITH} keyword is given, and the @code{(PAIRED)} keyword +is also given, then the number of variables preceding @code{WITH} +must be the same as the number following it. +In this case, tests for each respective pair of variables are +performed. +If the @code{WITH} keyword is given, but the +@code{(PAIRED)} keyword is omitted, then tests for each combination +of variable preceding @code{WITH} against variable following +@code{WITH} are performed. @node T-TEST @comment node-name, next, previous, up @@ -766,7 +998,6 @@ If the total sum of the coefficients are not zero, then PSPP will display a warning, but will proceed with the analysis. The @code{CONTRAST} subcommand may be given up to 10 times in order to specify different contrast tests. -@setfilename ignored @node RANK @comment node-name, next, previous, up @@ -831,3 +1062,120 @@ user-missing are to be excluded from the rank scores. A setting of INCLUDE means they are to be included. The default is EXCLUDE. @include regression.texi + + +@node RELIABILITY +@section RELIABILITY + +@vindex RELIABILITY +@display +RELIABILITY + /VARIABLES=var_list + /SCALE (@var{name}) = @{var_list, ALL@} + /MODEL=@{ALPHA, SPLIT[(N)]@} + /SUMMARY=@{TOTAL,ALL@} + /MISSING=@{EXCLUDE,INCLUDE@} +@end display + +@cindex Cronbach's Alpha +The @cmd{RELIABILTY} command performs reliablity analysis on the data. + +The VARIABLES subcommand is required. It determines the set of variables +upon which analysis is to be performed. + +The SCALE subcommand determines which variables reliability is to be +calculated for. If it is omitted, then analysis for all variables named +in the VARIABLES subcommand will be used. +Optionally, the @var{name} parameter may be specified to set a string name +for the scale. + +The MODEL subcommand determines the type of analysis. If ALPHA is specified, +then Cronbach's Alpha is calculated for the scale. If the model is SPLIT, +then the variables are divided into 2 subsets. An optional parameter +@var{N} may be given, to specify how many variables to be in the first subset. +If @var{N} is omitted, then it defaults to one half of the variables in the +scale, or one half minus one if there are an odd number of variables. +The default model is ALPHA. + +By default, any cases with user missing, or system missing values for +any variables given +in the VARIABLES subcommand will be omitted from analysis. +The MISSING subcommand determines whether user missing values are to +be included or excluded in the analysis. + +The SUMMARY subcommand determines the type of summary analysis to be performed. +Currently there is only one type: SUMMARY=TOTAL, which displays per-item +analysis tested against the totals. + + + +@node ROC +@section ROC + +@vindex ROC +@cindex Receiver Operating Characterstic +@cindex Area under curve + +@display +ROC @var{var_list} BY @var{state_var} (@var{state_value}) + /PLOT = @{ CURVE [(REFERENCE)], NONE @} + /PRINT = [ SE ] [ COORDINATES ] + /CRITERIA = [ CUTOFF(@{INCLUDE,EXCLUDE@}) ] + [ TESTPOS (@{LARGE,SMALL@}) ] + [ CI (@var{confidence}) ] + [ DISTRIBUTION (@{FREE, NEGEXPO @}) ] + /MISSING=@{EXCLUDE,INCLUDE@} +@end display + + +The @cmd{ROC} command is used to plot the receiver operating characteristic curve +of a dataset, and to estimate the area under the curve. +This is useful for analysing the efficacy of a variable as a predictor of a state of nature. + +The mandatory @var{var_list} is the list of predictor variables. +The variable @var{state_var} is the variable whose values represent the actual states, +and @var{state_value} is the value of this variable which represents the positive state. + +The optional subcommand PLOT is used to determine if and how the ROC curve is drawn. +The keyword CURVE means that the ROC curve should be drawn, and the optional keyword REFERENCE, +which should be enclosed in parentheses, says that the diagonal reference line should be drawn. +If the keyword NONE is given, then no ROC curve is drawn. +By default, the curve is drawn with no reference line. + +The optional subcommand PRINT determines which additional tables should be printed. +Two additional tables are available. +The SE keyword says that standard error of the area under the curve should be printed as well as +the area itself. +In addition, a p-value under the null hypothesis that the area under the curve equals 0.5 will be +printed. +The COORDINATES keyword says that a table of coordinates of the ROC curve should be printed. + +The CRITERIA subcommand has four optional parameters: +@itemize @bullet +@item The TESTPOS parameter may be LARGE or SMALL. +LARGE is the default, and says that larger values in the predictor variables are to be +considered positive. SMALL indicates that smaller values should be considered positive. + +@item The CI parameter specifies the confidence interval that should be printed. +It has no effect if the SE keyword in the PRINT subcommand has not been given. + +@item The DISTRIBUTION parameter determines the method to be used when estimating the area +under the curve. +There are two possibilities, @i{viz}: FREE and NEGEXPO. +The FREE method uses a non-parametric estimate, and the NEGEXPO method a bi-negative +exponential distribution estimate. +The NEGEXPO method should only be used when the number of positive actual states is +equal to the number of negative actual states. +The default is FREE. + +@item The CUTOFF parameter is for compatibility and is ignored. +@end itemize + +The MISSING subcommand determines whether user missing values are to +be included or excluded in the analysis. The default behaviour is to +exclude them. +Cases are excluded on a listwise basis; if any of the variables in @var{var_list} +or if the variable @var{state_var} is missing, then the entire case will be +excluded. + +