X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fstatistics.texi;h=63e8f35aef9ba37f540661aee7018c3b6a9fadb4;hb=f4933a557797e27e8c51f1ad1e8aed6aafe3518f;hp=5430c6bc6958e7e22b2a6b9cff98cd65ccd48b22;hpb=2402013898889076c67683f1582f90448f4fa6ef;p=pspp-builds.git diff --git a/doc/statistics.texi b/doc/statistics.texi index 5430c6bc..63e8f35a 100644 --- a/doc/statistics.texi +++ b/doc/statistics.texi @@ -8,6 +8,7 @@ far. * DESCRIPTIVES:: Descriptive statistics. * FREQUENCIES:: Frequency tables. * EXAMINE:: Testing data for normality. +* CORRELATIONS:: Correlation tables. * CROSSTABS:: Crosstabulation tables. * NPAR TESTS:: Nonparametric tests. * T-TEST:: Test hypotheses about means. @@ -15,6 +16,7 @@ far. * RANK:: Compute rank scores. * REGRESSION:: Linear regression. * RELIABILITY:: Reliability analysis. +* ROC:: Receiver Operating Characteristic. @end menu @node DESCRIPTIVES @@ -300,6 +302,69 @@ If many dependent variable are given, or factors are given for which there are many distinct values, then @cmd{EXAMINE} will produce a very large quantity of output. +@node CORRELATIONS +@section CORRELATIONS + +@vindex CORRELATIONS +@display +CORRELATIONS + /VARIABLES = varlist [ WITH varlist ] + [ + . + . + . + /VARIABLES = varlist [ WITH varlist ] + /VARIABLES = varlist [ WITH varlist ] + ] + + [ /PRINT=@{TWOTAIL, ONETAIL@} @{SIG, NOSIG@} ] + [ /STATISTICS=DESCRIPTIVES XPROD ALL] + [ /MISSING=@{PAIRWISE, LISTWISE@} @{INCLUDE, EXCLUDE@} ] +@end display + +@cindex correlation +The @cmd{CORRELATIONS} procedure produces tables of the Pearson correlation coefficient +for a set of variables. The significance of the coefficients are also given. + +At least one VARIABLES subcommand is required. If the WITH keyword is used, then a non-square +correlation table will be produced. +The variables preceding WITH, will be used as the rows of the table, and the variables following +will be the columns of the table. +If no WITH subcommand is given, then a square, symmetrical table using all variables is produced. + + +The @cmd{MISSING} subcommand determines the handling of missing variables. +If INCLUDE is set, then user-missing values are included in the +calculations, but system-missing values are not. +If EXCLUDE is set, which is the default, user-missing +values are excluded as well as system-missing values. +This is the default. + +If LISTWISE is set, then the entire case is excluded from analysis +whenever any variable specified in the any @cmd{/VARIABLES} subcommand +contains a missing value. +If PAIRWISE is set, then a case is considered missing only if either of the +values for the particular coefficient are missing. +The default is PAIRWISE. + +The PRINT subcommand is used to control how the reported significance values are printed. +If the TWOTAIL option is used, then a two-tailed test of significance is +printed. If the ONETAIL option is given, then a one-tailed test is used. +The default is TWOTAIL. + +If the NOSIG option is specified, then correlation coefficients with significance less than +0.05 are highlighted. +If SIG is specified, then no highlighting is performed. This is the default. + +@cindex covariance +The STATISTICS subcommand requests additional statistics to be displayed. The keyword +DESCRIPTIVES requests that the mean, number of non-missing cases, and the non-biased +estimator of the standard deviation are displayed. +These statistics will be displayed in a separated table, for all the variables listed +in any /VARIABLES subcommand. +The XPROD keyword requests cross-product deviations and covariance estimators to +be displayed for each pair of variables. +The keyword ALL is the union of DESCRIPTIVES and XPROD. @node CROSSTABS @section CROSSTABS @@ -348,9 +413,7 @@ is present, the VARIABLES subcommand must precede the TABLES subcommand. In general mode, numeric and string variables may be specified on -TABLES. Although long string variables are allowed, only their -initial short-string parts are used. In integer mode, only numeric -variables are allowed. +TABLES. In integer mode, only numeric variables are allowed. The MISSING subcommand determines the handling of user-missing values. When set to TABLE, the default, missing values are dropped on a table by @@ -952,3 +1015,73 @@ analysis tested against the totals. +@node ROC +@section ROC + +@vindex ROC +@cindex Receiver Operating Characterstic +@cindex Area under curve + +@display +ROC @var{var_list} BY @var{state_var} (@var{state_value}) + /PLOT = @{ CURVE [(REFERENCE)], NONE @} + /PRINT = [ SE ] [ COORDINATES ] + /CRITERIA = [ CUTOFF(@{INCLUDE,EXCLUDE@}) ] + [ TESTPOS (@{LARGE,SMALL@}) ] + [ CI (@var{confidence}) ] + [ DISTRIBUTION (@{FREE, NEGEXPO @}) ] + /MISSING=@{EXCLUDE,INCLUDE@} +@end display + + +The @cmd{ROC} command is used to plot the receiver operating characteristic curve +of a dataset, and to estimate the area under the curve. +This is useful for analysing the efficacy of a variable as a predictor of a state of nature. + +The mandatory @var{var_list} is the list of predictor variables. +The variable @var{state_var} is the variable whose values represent the actual states, +and @var{state_value} is the value of this variable which represents the positive state. + +The optional subcommand PLOT is used to determine if and how the ROC curve is drawn. +The keyword CURVE means that the ROC curve should be drawn, and the optional keyword REFERENCE, +which should be enclosed in parentheses, says that the diagonal reference line should be drawn. +If the keyword NONE is given, then no ROC curve is drawn. +By default, the curve is drawn with no reference line. + +The optional subcommand PRINT determines which additional tables should be printed. +Two additional tables are available. +The SE keyword says that standard error of the area under the curve should be printed as well as +the area itself. +In addition, a p-value under the null hypothesis that the area under the curve equals 0.5 will be +printed. +The COORDINATES keyword says that a table of coordinates of the ROC curve should be printed. + +The CRITERIA subcommand has four optional parameters: +@itemize @bullet +@item The TESTPOS parameter may be LARGE or SMALL. +LARGE is the default, and says that larger values in the predictor variables are to be +considered positive. SMALL indicates that smaller values should be considered positive. + +@item The CI parameter specifies the confidence interval that should be printed. +It has no effect if the SE keyword in the PRINT subcommand has not been given. + +@item The DISTRIBUTION parameter determines the method to be used when estimating the area +under the curve. +There are two possibilities, @i{viz}: FREE and NEGEXPO. +The FREE method uses a non-parametric estimate, and the NEGEXPO method a bi-negative +exponential distribution estimate. +The NEGEXPO method should only be used when the number of positive actual states is +equal to the number of negative actual states. +The default is FREE. + +@item The CUTOFF parameter is for compatibility and is ignored. +@end itemize + +The MISSING subcommand determines whether user missing values are to +be included or excluded in the analysis. The default behaviour is to +exclude them. +Cases are excluded on a listwise basis; if any of the variables in @var{var_list} +or if the variable @var{state_var} is missing, then the entire case will be +excluded. + +