- [CORRELATIONS](commands/statistics/correlations.md)
- [CROSSTABS](commands/statistics/crosstabs.md)
- [CTABLES](commands/statistics/ctables.md)
+ - [FACTOR](commands/statistics/factor.md)
+ - [GLM](commands/statistics/glm.md)
+ - [LOGISTIC REGRESSION](commands/statistics/logistic-regression.md)
+ - [MEANS](commands/statistics/means.md)
# Developer Documentation
--- /dev/null
+# FACTOR
+
+```
+FACTOR {
+ VARIABLES=VAR_LIST,
+ MATRIX IN ({CORR,COV}={*,FILE_SPEC})
+ }
+
+ [ /METHOD = {CORRELATION, COVARIANCE} ]
+
+ [ /ANALYSIS=VAR_LIST ]
+
+ [ /EXTRACTION={PC, PAF}]
+
+ [ /ROTATION={VARIMAX, EQUAMAX, QUARTIMAX, PROMAX[(K)], NOROTATE}]
+
+ [ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [KMO] [AIC] [SIG] [ALL] [DEFAULT] ]
+
+ [ /PLOT=[EIGEN] ]
+
+ [ /FORMAT=[SORT] [BLANK(N)] [DEFAULT] ]
+
+ [ /CRITERIA=[FACTORS(N)] [MINEIGEN(L)] [ITERATE(M)] [ECONVERGE (DELTA)] [DEFAULT] ]
+
+ [ /MISSING=[{LISTWISE, PAIRWISE}] [{INCLUDE, EXCLUDE}] ]
+```
+
+The `FACTOR` command performs Factor Analysis or Principal Axis
+Factoring on a dataset. It may be used to find common factors in the
+data or for data reduction purposes.
+
+The `VARIABLES` subcommand is required (unless the `MATRIX IN`
+subcommand is used). It lists the variables which are to partake in the
+analysis. (The `ANALYSIS` subcommand may optionally further limit the
+variables that participate; it is useful primarily in conjunction with
+`MATRIX IN`.)
+
+If `MATRIX IN` instead of `VARIABLES` is specified, then the analysis
+is performed on a pre-prepared correlation or covariance matrix file
+instead of on individual data cases. Typically the matrix file will
+have been generated by `MATRIX DATA` (*note MATRIX DATA::) or provided
+by a third party. If specified, `MATRIX IN` must be followed by `COV`
+or `CORR`, then by `=` and `FILE_SPEC` all in parentheses. `FILE_SPEC` may
+either be an asterisk, which indicates the currently loaded dataset, or
+it may be a file name to be loaded. *Note MATRIX DATA::, for the
+expected format of the file.
+
+The `/EXTRACTION` subcommand is used to specify the way in which
+factors (components) are extracted from the data. If `PC` is specified,
+then Principal Components Analysis is used. If `PAF` is specified, then
+Principal Axis Factoring is used. By default Principal Components
+Analysis is used.
+
+The `/ROTATION` subcommand is used to specify the method by which the
+extracted solution is rotated. Three orthogonal rotation methods are
+available: `VARIMAX` (which is the default), `EQUAMAX`, and `QUARTIMAX`.
+There is one oblique rotation method, viz: `PROMAX`. Optionally you may
+enter the power of the promax rotation K, which must be enclosed in
+parentheses. The default value of K is 5. If you don't want any
+rotation to be performed, the word `NOROTATE` prevents the command from
+performing any rotation on the data.
+
+The `/METHOD` subcommand should be used to determine whether the
+covariance matrix or the correlation matrix of the data is to be
+analysed. By default, the correlation matrix is analysed.
+
+The `/PRINT` subcommand may be used to select which features of the
+analysis are reported:
+
+- `UNIVARIATE` A table of mean values, standard deviations and total
+ weights are printed.
+- `INITIAL` Initial communalities and eigenvalues are printed.
+- `EXTRACTION` Extracted communalities and eigenvalues are printed.
+- `ROTATION` Rotated communalities and eigenvalues are printed.
+- `CORRELATION` The correlation matrix is printed.
+- `COVARIANCE` The covariance matrix is printed.
+- `DET` The determinant of the correlation or covariance matrix is
+ printed.
+- `AIC` The anti-image covariance and anti-image correlation matrices
+ are printed.
+- `KMO` The Kaiser-Meyer-Olkin measure of sampling adequacy and the
+ Bartlett test of sphericity is printed.
+- `SIG` The significance of the elements of correlation matrix is
+ printed.
+- `ALL` All of the above are printed.
+- `DEFAULT` Identical to `INITIAL` and `EXTRACTION`.
+
+If `/PLOT=EIGEN` is given, then a "Scree" plot of the eigenvalues is
+printed. This can be useful for visualizing the factors and deciding
+which factors (components) should be retained.
+
+The `/FORMAT` subcommand determined how data are to be displayed in
+loading matrices. If `SORT` is specified, then the variables are sorted
+in descending order of significance. If `BLANK(N)` is specified, then
+coefficients whose absolute value is less than N are not printed. If
+the keyword `DEFAULT` is specified, or if no `/FORMAT` subcommand is
+specified, then no sorting is performed, and all coefficients are
+printed.
+
+You can use the `/CRITERIA` subcommand to specify how the number of
+extracted factors (components) are chosen. If `FACTORS(N)` is
+specified, where N is an integer, then N factors are extracted.
+Otherwise, the `MINEIGEN` setting is used. `MINEIGEN(L)` requests that
+all factors whose eigenvalues are greater than or equal to L are
+extracted. The default value of L is 1. The `ECONVERGE` setting has
+effect only when using iterative algorithms for factor extraction (such
+as Principal Axis Factoring). `ECONVERGE(DELTA)` specifies that
+iteration should cease when the maximum absolute value of the
+communality estimate between one iteration and the previous is less than
+DELTA. The default value of DELTA is 0.001.
+
+The `ITERATE(M)` may appear any number of times and is used for two
+different purposes. It is used to set the maximum number of iterations
+(M) for convergence and also to set the maximum number of iterations for
+rotation. Whether it affects convergence or rotation depends upon which
+subcommand follows the `ITERATE` subcommand. If `EXTRACTION` follows,
+it affects convergence. If `ROTATION` follows, it affects rotation. If
+neither `ROTATION` nor `EXTRACTION` follow a `ITERATE` subcommand, then
+the entire subcommand is ignored. The default value of M is 25.
+
+The `MISSING` subcommand determines the handling of missing
+variables. If `INCLUDE` is set, then user-missing values are included
+in the calculations, but system-missing values are not. If `EXCLUDE` is
+set, which is the default, user-missing values are excluded as well as
+system-missing values. This is the default. If `LISTWISE` is set, then
+the entire case is excluded from analysis whenever any variable
+specified in the `VARIABLES` subcommand contains a missing value.
+
+If `PAIRWISE` is set, then a case is considered missing only if
+either of the values for the particular coefficient are missing. The
+default is `LISTWISE`.
+
--- /dev/null
+# GLM
+
+```
+GLM DEPENDENT_VARS BY FIXED_FACTORS
+ [/METHOD = SSTYPE(TYPE)]
+ [/DESIGN = INTERACTION_0 [INTERACTION_1 [... INTERACTION_N]]]
+ [/INTERCEPT = {INCLUDE|EXCLUDE}]
+ [/MISSING = {INCLUDE|EXCLUDE}]
+```
+
+The `GLM` procedure can be used for fixed effects factorial Anova.
+
+The `DEPENDENT_VARS` are the variables to be analysed. You may analyse
+several variables in the same command in which case they should all
+appear before the `BY` keyword.
+
+The `FIXED_FACTORS` list must be one or more categorical variables.
+Normally it does not make sense to enter a scalar variable in the
+`FIXED_FACTORS` and doing so may cause PSPP to do a lot of unnecessary
+processing.
+
+The `METHOD` subcommand is used to change the method for producing
+the sums of squares. Available values of `TYPE` are 1, 2 and 3. The
+default is type 3.
+
+You may specify a custom design using the `DESIGN` subcommand. The
+design comprises a list of interactions where each interaction is a list
+of variables separated by a `*`. For example the command
+```
+GLM subject BY sex age_group race
+ /DESIGN = age_group sex group age_group*sex age_group*race
+```
+specifies the model
+```
+subject = age_group + sex + race + age_group×sex + age_group×race
+```
+If no `DESIGN` subcommand is specified, then the
+default is all possible combinations of the fixed factors. That is to
+say
+```
+GLM subject BY sex age_group race
+```
+implies the model
+```
+subject = age_group + sex + race + age_group×sex + age_group×race + sex×race + age_group×sex×race
+```
+
+The `MISSING` subcommand determines the handling of missing variables.
+If `INCLUDE` is set then, for the purposes of GLM analysis, only
+system-missing values are considered to be missing; user-missing
+values are not regarded as missing. If `EXCLUDE` is set, which is the
+default, then user-missing values are considered to be missing as well
+as system-missing values. A case for which any dependent variable or
+any factor variable has a missing value is excluded from the analysis.
+
--- /dev/null
+# LOGISTIC REGRESSION
+
+```
+LOGISTIC REGRESSION [VARIABLES =] DEPENDENT_VAR WITH PREDICTORS
+ [/CATEGORICAL = CATEGORICAL_PREDICTORS]
+ [{/NOCONST | /ORIGIN | /NOORIGIN }]
+ [/PRINT = [SUMMARY] [DEFAULT] [CI(CONFIDENCE)] [ALL]]
+ [/CRITERIA = [BCON(MIN_DELTA)] [ITERATE(MAX_INTERATIONS)]
+ [LCON(MIN_LIKELIHOOD_DELTA)] [EPS(MIN_EPSILON)]
+ [CUT(CUT_POINT)]]
+ [/MISSING = {INCLUDE|EXCLUDE}]
+```
+
+Bivariate Logistic Regression is used when you want to explain a
+dichotomous dependent variable in terms of one or more predictor
+variables.
+
+The minimum command is
+```
+LOGISTIC REGRESSION y WITH x1 x2 ... xN.
+```
+
+Here, `y` is the dependent variable, which must be dichotomous and
+`x1` through `xN` are the predictor variables whose coefficients the
+procedure estimates.
+
+By default, a constant term is included in the model. Hence, the
+full model is \\[{\bf y} = b_0 + b_1 {\bf x_1} + b_2 {\bf x_2} + \dots +
+b_n {\bf x_n}.\\]
+
+Predictor variables which are categorical in nature should be listed
+on the `/CATEGORICAL` subcommand. Simple variables as well as
+interactions between variables may be listed here.
+
+If you want a model without the constant term b_0, use the keyword
+`/ORIGIN`. `/NOCONST` is a synonym for `/ORIGIN`.
+
+An iterative Newton-Raphson procedure is used to fit the model. The
+`/CRITERIA` subcommand is used to specify the stopping criteria of the
+procedure, and other parameters. The value of `CUT_POINT` is used in the
+classification table. It is the threshold above which predicted values
+are considered to be 1. Values of `CUT_POINT` must lie in the range
+[0,1]. During iterations, if any one of the stopping criteria are
+satisfied, the procedure is considered complete. The stopping criteria
+are:
+
+- The number of iterations exceeds `MAX_ITERATIONS`. The default value
+ of `MAX_ITERATIONS` is 20.
+- The change in the all coefficient estimates are less than
+ `MIN_DELTA`. The default value of `MIN_DELTA` is 0.001.
+- The magnitude of change in the likelihood estimate is less than
+ `MIN_LIKELIHOOD_DELTA`. The default value of `MIN_LIKELIHOOD_DELTA`
+ is zero. This means that this criterion is disabled.
+- The differential of the estimated probability for all cases is less
+ than `MIN_EPSILON`. In other words, the probabilities are close to
+ zero or one. The default value of `MIN_EPSILON` is 0.00000001.
+
+The `PRINT` subcommand controls the display of optional statistics.
+Currently there is one such option, `CI`, which indicates that the
+confidence interval of the odds ratio should be displayed as well as its
+value. `CI` should be followed by an integer in parentheses, to
+indicate the confidence level of the desired confidence interval.
+
+The `MISSING` subcommand determines the handling of missing
+variables. If `INCLUDE` is set, then user-missing values are included
+in the calculations, but system-missing values are not. If `EXCLUDE` is
+set, which is the default, user-missing values are excluded as well as
+system-missing values. This is the default.
+
--- /dev/null
+# MEANS
+
+```
+MEANS [TABLES =]
+ {VAR_LIST}
+ [ BY {VAR_LIST} [BY {VAR_LIST} [BY {VAR_LIST} ... ]]]
+
+ [ /{VAR_LIST}
+ [ BY {VAR_LIST} [BY {VAR_LIST} [BY {VAR_LIST} ... ]]] ]
+
+ [/CELLS = [MEAN] [COUNT] [STDDEV] [SEMEAN] [SUM] [MIN] [MAX] [RANGE]
+ [VARIANCE] [KURT] [SEKURT]
+ [SKEW] [SESKEW] [FIRST] [LAST]
+ [HARMONIC] [GEOMETRIC]
+ [DEFAULT]
+ [ALL]
+ [NONE] ]
+
+ [/MISSING = [INCLUDE] [DEPENDENT]]
+```
+
+You can use the `MEANS` command to calculate the arithmetic mean and
+similar statistics, either for the dataset as a whole or for categories
+of data.
+
+The simplest form of the command is
+```
+MEANS V.
+```
+which calculates the mean, count and standard deviation for V. If you
+specify a grouping variable, for example
+```
+MEANS V BY G.
+```
+then the means, counts and standard deviations for V after having been
+grouped by G are calculated. Instead of the mean, count and standard
+deviation, you could specify the statistics in which you are interested:
+```
+MEANS X Y BY G
+ /CELLS = HARMONIC SUM MIN.
+```
+This example calculates the harmonic mean, the sum and the minimum
+values of X and Y grouped by G.
+
+The `CELLS` subcommand specifies which statistics to calculate. The
+available statistics are:
+- `MEAN`: The arithmetic mean.
+- `COUNT`: The count of the values.
+- `STDDEV`: The standard deviation.
+- `SEMEAN`: The standard error of the mean.
+- `SUM`: The sum of the values.
+- `MIN`: The minimum value.
+- `MAX`: The maximum value.
+- `RANGE`: The difference between the maximum and minimum values.
+- `VARIANCE`: The variance.
+- `FIRST`: The first value in the category.
+- `LAST`: The last value in the category.
+- `SKEW`: The skewness.
+- `SESKEW`: The standard error of the skewness.
+- `KURT`: The kurtosis
+- `SEKURT`: The standard error of the kurtosis.
+- `HARMONIC`: The harmonic mean.
+- `GEOMETRIC`: The geometric mean.
+
+In addition, three special keywords are recognized:
+- `DEFAULT`: This is the same as `MEAN COUNT STDDEV`.
+- `ALL`: All of the above statistics are calculated.
+- `NONE`: No statistics are calculated (only a summary is shown).
+
+More than one "table" can be specified in a single command. Each
+table is separated by a `/`. For example
+
+```
+ MEANS TABLES =
+ c d e BY x
+ /a b BY x y
+ /f BY y BY z.
+```
+
+has three tables (the `TABLE =` is optional). The first table has
+three dependent variables `c`, `d`, and `e` and a single categorical
+variable `x`. The second table has two dependent variables `a` and
+`b`, and two categorical variables `x` and `y`. The third table has a
+single dependent variable `f` and a categorical variable formed by the
+combination of `y` and `Z`.
+
+By default values are omitted from the analysis only if missing
+values (either system missing or user missing) for any of the variables
+directly involved in their calculation are encountered. This behaviour
+can be modified with the `/MISSING` subcommand. Three options are
+possible: `TABLE`, `INCLUDE` and `DEPENDENT`.
+
+`/MISSING = INCLUDE` says that user missing values, either in the
+dependent variables or in the categorical variables should be taken at
+their face value, and not excluded.
+
+`/MISSING = DEPENDENT` says that user missing values, in the
+dependent variables should be taken at their face value, however cases
+which have user missing values for the categorical variables should be
+omitted from the calculation.
+
+## Example
+
+The dataset in `repairs.sav` contains the mean time between failures
+(mtbf) for a sample of artifacts produced by different factories and
+trialed under different operating conditions. Since there are four
+combinations of categorical variables, by simply looking at the list
+of data, it would be hard to how the scores vary for each category.
+The syntax below shows one way of tabulating the mtbf in a way which
+is easier to understand.
+
+```
+get file='repairs.sav'.
+
+means tables = mtbf
+ by factory by environment.
+```
+
+The results are shown below. The figures shown indicate the mean,
+standard deviation and number of samples in each category. These
+figures however do not indicate whether the results are statistically
+significant. For that, you would need to use the procedures `ONEWAY`,
+`GLM` or `T-TEST` depending on the hypothesis being tested.
+
+```
+ Case Processing Summary
+┌────────────────────────────┬───────────────────────────────┐
+│ │ Cases │
+│ ├──────────┬─────────┬──────────┤
+│ │ Included │ Excluded│ Total │
+│ ├──┬───────┼─┬───────┼──┬───────┤
+│ │ N│Percent│N│Percent│ N│Percent│
+├────────────────────────────┼──┼───────┼─┼───────┼──┼───────┤
+│mtbf * factory * environment│30│ 100.0%│0│ .0%│30│ 100.0%│
+└────────────────────────────┴──┴───────┴─┴───────┴──┴───────┘
+
+ Report
+┌────────────────────────────────────────────┬─────┬──┬──────────────┐
+│Manufacturing facility Operating Environment│ Mean│ N│Std. Deviation│
+├────────────────────────────────────────────┼─────┼──┼──────────────┤
+│0 Temperate │ 7.26│ 9│ 2.57│
+│ Tropical │ 7.47│ 7│ 2.68│
+│ Total │ 7.35│16│ 2.53│
+├────────────────────────────────────────────┼─────┼──┼──────────────┤
+│1 Temperate │13.38│ 6│ 7.77│
+│ Tropical │ 8.20│ 8│ 8.39│
+│ Total │10.42│14│ 8.26│
+├────────────────────────────────────────────┼─────┼──┼──────────────┤
+│Total Temperate │ 9.71│15│ 5.91│
+│ Tropical │ 7.86│15│ 6.20│
+│ Total │ 8.78│30│ 6.03│
+└────────────────────────────────────────────┴─────┴──┴──────────────┘
+```
+
+PSPP does not limit the number of variables for which you can
+calculate statistics, nor number of categorical variables per layer,
+nor the number of layers. However, running `MEANS` on a large number
+of variables, or with categorical variables containing a large number
+of distinct values, may result in an extremely large output, which
+will not be easy to interpret. So you should consider carefully which
+variables to select for participation in the analysis.
+