X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fstatistics.texi;h=9cc49557ea2ff288b4abd32e57580bc1f68ae115;hb=47f9412c378ef8e0bcce43566f72caa3b856580b;hp=35c47eea2c9b1a5152678c4b4c15638c7c655d28;hpb=01056c26493652935e9ad48f6c5da5d84473d885;p=pspp

diff --git a/doc/statistics.texi b/doc/statistics.texi
index 35c47eea2c..9cc49557ea 100644
--- a/doc/statistics.texi
+++ b/doc/statistics.texi
@@ -8,9 +8,11 @@ far.
 * DESCRIPTIVES::                Descriptive statistics.
 * FREQUENCIES::                 Frequency tables.
 * EXAMINE::                     Testing data for normality.
+* GRAPH::                       Plot data.
 * CORRELATIONS::                Correlation tables.
 * CROSSTABS::                   Crosstabulation tables.
-* FACTOR::                      Factor analysis and Principal Components analysis
+* FACTOR::                      Factor analysis and Principal Components analysis.
+* LOGISTIC REGRESSION::         Bivariate Logistic Regression.
 * MEANS::                       Average values and other statistics.
 * NPAR TESTS::                  Nonparametric tests.
 * T-TEST::                      Test hypotheses about means.
@@ -72,6 +74,8 @@ names ZSC000 through ZSC999, STDZ00 through STDZ09, ZZZZ00 through
 ZZZZ09, ZQZQ00 through ZQZQ09, in that sequence.  In addition, Z score
 variable names can be specified explicitly on @subcmd{VARIABLES} in the variable
 list by enclosing them in parentheses after each variable.
+When Z scores are calculated, @pspp{} ignores @cmd{TEMPORARY},
+treating temporary transformations as permanent.
 
 The @subcmd{STATISTICS} subcommand specifies the statistics to be displayed:
 
@@ -191,9 +195,13 @@ For instance, @subcmd{/NTILES=4} would cause quartiles to be reported.
 The @subcmd{HISTOGRAM} subcommand causes the output to include a histogram for
 each specified numeric variable.  The X axis by default ranges from
 the minimum to the maximum value observed in the data, but the @subcmd{MINIMUM}
-and @subcmd{MAXIMUM} keywords can set an explicit range.  Specify @subcmd{NORMAL} to
-superimpose a normal curve on the histogram.  Histograms are not
-created for string variables.
+and @subcmd{MAXIMUM} keywords can set an explicit range. The number of
+bins are 2IQR(x)n^-1/3 according to the Freedman-Diaconis rule.  (Note that
+@cmd{EXAMINE} uses a different algorithm to determine bin sizes.)
+Histograms are not created for string variables.
+
+Specify @subcmd{NORMAL} to superimpose a normal curve on the
+histogram.
 
 @cindex piechart
 The @subcmd{PIECHART} subcommand adds a pie chart for each variable to the data.  Each
@@ -211,7 +219,7 @@ but not currently honoured.
 
 @vindex EXAMINE
 @cindex Exploratory data analysis
-@cindex Normality, testing for
+@cindex normality, testing
 
 @display
 EXAMINE
@@ -285,6 +293,10 @@ normal distribution, whilst the spread vs.@: level plot can be useful to visuali
 how the variance of differs between factors.
 Boxplots will also show you the outliers and extreme values.
 
+@subcmd{HISTOGRAM} uses Sturges' rule to determine the number of
+bins, as approximately 1 + log2(n).  (Note that @cmd{FREQUENCIES} uses a
+different algorithm to find the bin size.)
+
 The @subcmd{SPREADLEVEL} plot displays the interquartile range versus the 
 median.  It takes an optional parameter @var{t}, which specifies how the data
 should be transformed prior to plotting.
@@ -372,6 +384,52 @@ specified for which
 there are many distinct values, then @cmd{EXAMINE} will produce a very
 large quantity of output.
 
+@node GRAPH
+@section GRAPH
+
+@vindex GRAPH
+@cindex Exploratory data analysis
+@cindex normality, testing
+
+@display
+GRAPH
+        /HISTOGRAM = @var{var}
+        /SCATTERPLOT [(BIVARIATE)] = @var{var1} WITH @var{var2} [BY @var{var3}] 
+        [ /MISSING=@{LISTWISE, VARIABLE@} [@{EXCLUDE, INCLUDE@}] ] 
+		[@{NOREPORT,REPORT@}]
+
+@end display
+
+The @cmd{GRAPH} produces graphical plots of data. Only one of the subcommands 
+@subcmd{HISTOGRAM} or @subcmd{SCATTERPLOT} can be specified, i.e. only one plot
+can be produced per call of @cmd{GRAPH}. The @subcmd{MISSING} is optional. 
+
+@cindex scatterplot
+
+The subcommand @subcmd{SCATTERPLOT} produces an xy plot of the data. The different 
+values of the optional third variable @var{var3} will result in different colours and/or
+markers for the plot. The following is an example for producing a scatterplot.
+
+@example
+GRAPH   
+        /SCATTERPLOT = @var{height} WITH @var{weight} BY @var{gender}.
+@end example
+
+This example will produce a scatterplot where height is plotted versus weight. Depending
+on the value of the gender variable, the colour of the datapoint is different. With
+this plot it is possible to analyze gender differences for height vs. weight relation.
+
+@cindex histogram
+
+The subcommand @subcmd{HISTOGRAM} produces a histogram. Only one variable is allowed for
+the histogram plot. For an alternative method to produce histograms @pxref{EXAMINE}. The
+following example produces a histogram plot for variable weigth.
+
+@example
+GRAPH   
+        /HISTOGRAM = @var{weight}.
+@end example
+
 @node CORRELATIONS
 @section CORRELATIONS
 
@@ -497,7 +555,7 @@ The @subcmd{FORMAT} subcommand controls the characteristics of the
 crosstabulation tables to be displayed.  It has a number of possible
 settings:
 
-@itemize @asis
+@itemize @w{}
 @item
 @subcmd{TABLES}, the default, causes crosstabulation tables to be output.
 @subcmd{NOTABLES} suppresses them.
@@ -596,23 +654,16 @@ some statistics are calculated only in integer mode.
 @subcmd{STATISTICS} subcommand is not given, no statistics are calculated.
 
 @strong{Please note:} Currently the implementation of @cmd{CROSSTABS} has the
-followings bugs:
+following bugs:
 
 @itemize @bullet
 @item
-Pearson's R (but not Spearman) is off a little.
-@item
-T values for Spearman's R and Pearson's R are wrong.
-@item
-Significance of symmetric and directional measures is not calculated.
-@item
-Asymmetric ASEs and T values for lambda are wrong.
+Significance of some symmetric and directional measures is not calculated.
 @item
-ASE of Goodman and Kruskal's tau is not calculated.
+Asymptotic standard error is not calculated for
+Goodman and Kruskal's tau or symmetric Somers' d.
 @item
-ASE of symmetric somers' d is wrong.
-@item
-Approximate T of uncertainty coefficient is wrong.
+Approximate T is not calculated for symmetric uncertainty coefficient.
 @end itemize
 
 Fixes for any of these deficiencies would be welcomed.
@@ -702,13 +753,23 @@ performed, and all coefficients will be printed.
 The @subcmd{/CRITERIA} subcommand is used to specify how the number of extracted factors (components) are chosen.
 If @subcmd{FACTORS(@var{n})} is
 specified, where @var{n} is an integer, then @var{n} factors will be extracted.  Otherwise, the @subcmd{MINEIGEN} setting will
-be used.  @subcmd{MINEIGEN(@var{l})} requests that all factors whose eigenvalues are greater than or equal to @var{l} are extracted.
-The default value of @var{l} is 1.    The @subcmd{ECONVERGE} and @subcmd{ITERATE} settings have effect only when iterative algorithms for factor
-extraction (such as Principal Axis Factoring) are used.   @subcmd{ECONVERGE(@var{delta})} specifies that
+be used.  
+@subcmd{MINEIGEN(@var{l})} requests that all factors whose eigenvalues are greater than or equal to @var{l} are extracted.
+The default value of @var{l} is 1.    
+The @subcmd{ECONVERGE} setting has effect only when iterative algorithms for factor
+extraction (such as Principal Axis Factoring) are used.   
+@subcmd{ECONVERGE(@var{delta})} specifies that
 iteration should cease when
 the maximum absolute value of the communality estimate between one iteration and the previous is less than @var{delta}. The
 default value of @var{delta} is 0.001.
-The @subcmd{ITERATE(@var{m})} setting sets the maximum number of iterations to @var{m}.  The default value of @var{m} is 25.
+The @subcmd{ITERATE(@var{m})} may appear any number of times and is used for two different purposes.  
+It is used to set the maximum number of iterations (@var{m}) for convergence and also to set the maximum number of iterations
+for rotation.
+Whether it affects convergence or rotation depends upon which subcommand follows the @subcmd{ITERATE} subcommand.
+If @subcmd{EXTRACTION} follows, it affects convergence.  
+If @subcmd{ROTATION} follows, it affects rotation.  
+If neither @subcmd{ROTATION} nor @subcmd{EXTRACTION} follow a @subcmd{ITERATE} subcommand it will be ignored.
+The default value of @var{m} is 25.
 
 The @cmd{MISSING} subcommand determines the handling of missing variables.  
 If @subcmd{INCLUDE} is set, then user-missing values are included in the
@@ -723,6 +784,92 @@ If @subcmd{PAIRWISE} is set, then a case is considered missing only if either of
 values  for the particular coefficient are missing.
 The default is @subcmd{LISTWISE}.
 
+@node LOGISTIC REGRESSION
+@section LOGISTIC REGRESSION
+
+@vindex LOGISTIC REGRESSION
+@cindex logistic regression
+@cindex bivariate logistic regression
+
+@display
+LOGISTIC REGRESSION [VARIABLES =] @var{dependent_var} WITH @var{predictors}
+
+     [/CATEGORICAL = @var{categorical_predictors}]
+
+     [@{/NOCONST | /ORIGIN | /NOORIGIN @}]
+
+     [/PRINT = [SUMMARY] [DEFAULT] [CI(@var{confidence})] [ALL]]
+
+     [/CRITERIA = [BCON(@var{min_delta})] [ITERATE(@var{max_interations})]
+                  [LCON(@var{min_likelihood_delta})] [EPS(@var{min_epsilon})]
+                  [CUT(@var{cut_point})]]
+
+     [/MISSING = @{INCLUDE|EXCLUDE@}]
+@end display
+
+Bivariate Logistic Regression is used when you want to explain a dichotomous dependent
+variable in terms of one or more predictor variables.
+
+The minimum command is
+@example
+LOGISTIC REGRESSION @var{y} WITH @var{x1} @var{x2} @dots{} @var{xn}.
+@end example
+Here, @var{y} is the dependent variable, which must be dichotomous and @var{x1} @dots{} @var{xn}
+are the predictor variables whose coefficients the procedure estimates.
+
+By default, a constant term is included in the model.
+Hence, the full model is
+@math{
+{\bf y} 
+= b_0 + b_1 {\bf x_1} 
++ b_2 {\bf x_2} 
++ \dots
++ b_n {\bf x_n}
+}
+
+Predictor variables which are categorical in nature should be listed on the @subcmd{/CATEGORICAL} subcommand.
+Simple variables as well as interactions between variables may be listed here.
+
+If you want a model without the constant term @math{b_0}, use the keyword @subcmd{/ORIGIN}.
+@subcmd{/NOCONST} is a synonym for @subcmd{/ORIGIN}.
+
+An iterative Newton-Raphson procedure is used to fit the model.
+The @subcmd{/CRITERIA} subcommand is used to specify the stopping criteria of the procedure,
+and other parameters.
+The value of @var{cut_point} is used in the classification table.  It is the 
+threshold above which predicted values are considered to be 1.  Values
+of @var{cut_point} must lie in the range [0,1].
+During iterations, if any one of the stopping criteria are satisfied, the procedure is
+considered complete.
+The stopping criteria are:
+@itemize
+@item The number of iterations exceeds @var{max_iterations}.  
+      The default value of @var{max_iterations} is 20.
+@item The change in the all coefficient estimates are less than @var{min_delta}.
+The default value of @var{min_delta} is 0.001.
+@item The magnitude of change in the likelihood estimate is less than @var{min_likelihood_delta}.
+The default value of @var{min_delta} is zero.
+This means that this criterion is disabled.
+@item The differential of the estimated probability for all cases is less than @var{min_epsilon}.
+In other words, the probabilities are close to zero or one.
+The default value of @var{min_epsilon} is 0.00000001.
+@end itemize
+
+
+The @subcmd{PRINT} subcommand controls the display of optional statistics.
+Currently there is one such option, @subcmd{CI}, which indicates that the 
+confidence interval of the odds ratio should be displayed as well as its value.
+@subcmd{CI} should be followed by an integer in parentheses, to indicate the
+confidence level of the desired confidence interval.
+
+The @subcmd{MISSING} subcommand determines the handling of missing
+variables.  
+If @subcmd{INCLUDE} is set, then user-missing values are included in the
+calculations, but system-missing values are not.
+If @subcmd{EXCLUDE} is set, which is the default, user-missing
+values are excluded as well as system-missing values. 
+This is the default.
+
 @node MEANS
 @section MEANS