X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fstatistics.texi;h=9cc49557ea2ff288b4abd32e57580bc1f68ae115;hb=47f9412c378ef8e0bcce43566f72caa3b856580b;hp=35c47eea2c9b1a5152678c4b4c15638c7c655d28;hpb=01056c26493652935e9ad48f6c5da5d84473d885;p=pspp diff --git a/doc/statistics.texi b/doc/statistics.texi index 35c47eea2c..9cc49557ea 100644 --- a/doc/statistics.texi +++ b/doc/statistics.texi @@ -8,9 +8,11 @@ far. * DESCRIPTIVES:: Descriptive statistics. * FREQUENCIES:: Frequency tables. * EXAMINE:: Testing data for normality. +* GRAPH:: Plot data. * CORRELATIONS:: Correlation tables. * CROSSTABS:: Crosstabulation tables. -* FACTOR:: Factor analysis and Principal Components analysis +* FACTOR:: Factor analysis and Principal Components analysis. +* LOGISTIC REGRESSION:: Bivariate Logistic Regression. * MEANS:: Average values and other statistics. * NPAR TESTS:: Nonparametric tests. * T-TEST:: Test hypotheses about means. @@ -72,6 +74,8 @@ names ZSC000 through ZSC999, STDZ00 through STDZ09, ZZZZ00 through ZZZZ09, ZQZQ00 through ZQZQ09, in that sequence. In addition, Z score variable names can be specified explicitly on @subcmd{VARIABLES} in the variable list by enclosing them in parentheses after each variable. +When Z scores are calculated, @pspp{} ignores @cmd{TEMPORARY}, +treating temporary transformations as permanent. The @subcmd{STATISTICS} subcommand specifies the statistics to be displayed: @@ -191,9 +195,13 @@ For instance, @subcmd{/NTILES=4} would cause quartiles to be reported. The @subcmd{HISTOGRAM} subcommand causes the output to include a histogram for each specified numeric variable. The X axis by default ranges from the minimum to the maximum value observed in the data, but the @subcmd{MINIMUM} -and @subcmd{MAXIMUM} keywords can set an explicit range. Specify @subcmd{NORMAL} to -superimpose a normal curve on the histogram. Histograms are not -created for string variables. +and @subcmd{MAXIMUM} keywords can set an explicit range. The number of +bins are 2IQR(x)n^-1/3 according to the Freedman-Diaconis rule. (Note that +@cmd{EXAMINE} uses a different algorithm to determine bin sizes.) +Histograms are not created for string variables. + +Specify @subcmd{NORMAL} to superimpose a normal curve on the +histogram. @cindex piechart The @subcmd{PIECHART} subcommand adds a pie chart for each variable to the data. Each @@ -211,7 +219,7 @@ but not currently honoured. @vindex EXAMINE @cindex Exploratory data analysis -@cindex Normality, testing for +@cindex normality, testing @display EXAMINE @@ -285,6 +293,10 @@ normal distribution, whilst the spread vs.@: level plot can be useful to visuali how the variance of differs between factors. Boxplots will also show you the outliers and extreme values. +@subcmd{HISTOGRAM} uses Sturges' rule to determine the number of +bins, as approximately 1 + log2(n). (Note that @cmd{FREQUENCIES} uses a +different algorithm to find the bin size.) + The @subcmd{SPREADLEVEL} plot displays the interquartile range versus the median. It takes an optional parameter @var{t}, which specifies how the data should be transformed prior to plotting. @@ -372,6 +384,52 @@ specified for which there are many distinct values, then @cmd{EXAMINE} will produce a very large quantity of output. +@node GRAPH +@section GRAPH + +@vindex GRAPH +@cindex Exploratory data analysis +@cindex normality, testing + +@display +GRAPH + /HISTOGRAM = @var{var} + /SCATTERPLOT [(BIVARIATE)] = @var{var1} WITH @var{var2} [BY @var{var3}] + [ /MISSING=@{LISTWISE, VARIABLE@} [@{EXCLUDE, INCLUDE@}] ] + [@{NOREPORT,REPORT@}] + +@end display + +The @cmd{GRAPH} produces graphical plots of data. Only one of the subcommands +@subcmd{HISTOGRAM} or @subcmd{SCATTERPLOT} can be specified, i.e. only one plot +can be produced per call of @cmd{GRAPH}. The @subcmd{MISSING} is optional. + +@cindex scatterplot + +The subcommand @subcmd{SCATTERPLOT} produces an xy plot of the data. The different +values of the optional third variable @var{var3} will result in different colours and/or +markers for the plot. The following is an example for producing a scatterplot. + +@example +GRAPH + /SCATTERPLOT = @var{height} WITH @var{weight} BY @var{gender}. +@end example + +This example will produce a scatterplot where height is plotted versus weight. Depending +on the value of the gender variable, the colour of the datapoint is different. With +this plot it is possible to analyze gender differences for height vs. weight relation. + +@cindex histogram + +The subcommand @subcmd{HISTOGRAM} produces a histogram. Only one variable is allowed for +the histogram plot. For an alternative method to produce histograms @pxref{EXAMINE}. The +following example produces a histogram plot for variable weigth. + +@example +GRAPH + /HISTOGRAM = @var{weight}. +@end example + @node CORRELATIONS @section CORRELATIONS @@ -497,7 +555,7 @@ The @subcmd{FORMAT} subcommand controls the characteristics of the crosstabulation tables to be displayed. It has a number of possible settings: -@itemize @asis +@itemize @w{} @item @subcmd{TABLES}, the default, causes crosstabulation tables to be output. @subcmd{NOTABLES} suppresses them. @@ -596,23 +654,16 @@ some statistics are calculated only in integer mode. @subcmd{STATISTICS} subcommand is not given, no statistics are calculated. @strong{Please note:} Currently the implementation of @cmd{CROSSTABS} has the -followings bugs: +following bugs: @itemize @bullet @item -Pearson's R (but not Spearman) is off a little. -@item -T values for Spearman's R and Pearson's R are wrong. -@item -Significance of symmetric and directional measures is not calculated. -@item -Asymmetric ASEs and T values for lambda are wrong. +Significance of some symmetric and directional measures is not calculated. @item -ASE of Goodman and Kruskal's tau is not calculated. +Asymptotic standard error is not calculated for +Goodman and Kruskal's tau or symmetric Somers' d. @item -ASE of symmetric somers' d is wrong. -@item -Approximate T of uncertainty coefficient is wrong. +Approximate T is not calculated for symmetric uncertainty coefficient. @end itemize Fixes for any of these deficiencies would be welcomed. @@ -702,13 +753,23 @@ performed, and all coefficients will be printed. The @subcmd{/CRITERIA} subcommand is used to specify how the number of extracted factors (components) are chosen. If @subcmd{FACTORS(@var{n})} is specified, where @var{n} is an integer, then @var{n} factors will be extracted. Otherwise, the @subcmd{MINEIGEN} setting will -be used. @subcmd{MINEIGEN(@var{l})} requests that all factors whose eigenvalues are greater than or equal to @var{l} are extracted. -The default value of @var{l} is 1. The @subcmd{ECONVERGE} and @subcmd{ITERATE} settings have effect only when iterative algorithms for factor -extraction (such as Principal Axis Factoring) are used. @subcmd{ECONVERGE(@var{delta})} specifies that +be used. +@subcmd{MINEIGEN(@var{l})} requests that all factors whose eigenvalues are greater than or equal to @var{l} are extracted. +The default value of @var{l} is 1. +The @subcmd{ECONVERGE} setting has effect only when iterative algorithms for factor +extraction (such as Principal Axis Factoring) are used. +@subcmd{ECONVERGE(@var{delta})} specifies that iteration should cease when the maximum absolute value of the communality estimate between one iteration and the previous is less than @var{delta}. The default value of @var{delta} is 0.001. -The @subcmd{ITERATE(@var{m})} setting sets the maximum number of iterations to @var{m}. The default value of @var{m} is 25. +The @subcmd{ITERATE(@var{m})} may appear any number of times and is used for two different purposes. +It is used to set the maximum number of iterations (@var{m}) for convergence and also to set the maximum number of iterations +for rotation. +Whether it affects convergence or rotation depends upon which subcommand follows the @subcmd{ITERATE} subcommand. +If @subcmd{EXTRACTION} follows, it affects convergence. +If @subcmd{ROTATION} follows, it affects rotation. +If neither @subcmd{ROTATION} nor @subcmd{EXTRACTION} follow a @subcmd{ITERATE} subcommand it will be ignored. +The default value of @var{m} is 25. The @cmd{MISSING} subcommand determines the handling of missing variables. If @subcmd{INCLUDE} is set, then user-missing values are included in the @@ -723,6 +784,92 @@ If @subcmd{PAIRWISE} is set, then a case is considered missing only if either of values for the particular coefficient are missing. The default is @subcmd{LISTWISE}. +@node LOGISTIC REGRESSION +@section LOGISTIC REGRESSION + +@vindex LOGISTIC REGRESSION +@cindex logistic regression +@cindex bivariate logistic regression + +@display +LOGISTIC REGRESSION [VARIABLES =] @var{dependent_var} WITH @var{predictors} + + [/CATEGORICAL = @var{categorical_predictors}] + + [@{/NOCONST | /ORIGIN | /NOORIGIN @}] + + [/PRINT = [SUMMARY] [DEFAULT] [CI(@var{confidence})] [ALL]] + + [/CRITERIA = [BCON(@var{min_delta})] [ITERATE(@var{max_interations})] + [LCON(@var{min_likelihood_delta})] [EPS(@var{min_epsilon})] + [CUT(@var{cut_point})]] + + [/MISSING = @{INCLUDE|EXCLUDE@}] +@end display + +Bivariate Logistic Regression is used when you want to explain a dichotomous dependent +variable in terms of one or more predictor variables. + +The minimum command is +@example +LOGISTIC REGRESSION @var{y} WITH @var{x1} @var{x2} @dots{} @var{xn}. +@end example +Here, @var{y} is the dependent variable, which must be dichotomous and @var{x1} @dots{} @var{xn} +are the predictor variables whose coefficients the procedure estimates. + +By default, a constant term is included in the model. +Hence, the full model is +@math{ +{\bf y} += b_0 + b_1 {\bf x_1} ++ b_2 {\bf x_2} ++ \dots ++ b_n {\bf x_n} +} + +Predictor variables which are categorical in nature should be listed on the @subcmd{/CATEGORICAL} subcommand. +Simple variables as well as interactions between variables may be listed here. + +If you want a model without the constant term @math{b_0}, use the keyword @subcmd{/ORIGIN}. +@subcmd{/NOCONST} is a synonym for @subcmd{/ORIGIN}. + +An iterative Newton-Raphson procedure is used to fit the model. +The @subcmd{/CRITERIA} subcommand is used to specify the stopping criteria of the procedure, +and other parameters. +The value of @var{cut_point} is used in the classification table. It is the +threshold above which predicted values are considered to be 1. Values +of @var{cut_point} must lie in the range [0,1]. +During iterations, if any one of the stopping criteria are satisfied, the procedure is +considered complete. +The stopping criteria are: +@itemize +@item The number of iterations exceeds @var{max_iterations}. + The default value of @var{max_iterations} is 20. +@item The change in the all coefficient estimates are less than @var{min_delta}. +The default value of @var{min_delta} is 0.001. +@item The magnitude of change in the likelihood estimate is less than @var{min_likelihood_delta}. +The default value of @var{min_delta} is zero. +This means that this criterion is disabled. +@item The differential of the estimated probability for all cases is less than @var{min_epsilon}. +In other words, the probabilities are close to zero or one. +The default value of @var{min_epsilon} is 0.00000001. +@end itemize + + +The @subcmd{PRINT} subcommand controls the display of optional statistics. +Currently there is one such option, @subcmd{CI}, which indicates that the +confidence interval of the odds ratio should be displayed as well as its value. +@subcmd{CI} should be followed by an integer in parentheses, to indicate the +confidence level of the desired confidence interval. + +The @subcmd{MISSING} subcommand determines the handling of missing +variables. +If @subcmd{INCLUDE} is set, then user-missing values are included in the +calculations, but system-missing values are not. +If @subcmd{EXCLUDE} is set, which is the default, user-missing +values are excluded as well as system-missing values. +This is the default. + @node MEANS @section MEANS