From: Ben Pfaff Date: Thu, 8 May 2025 00:02:32 +0000 (-0700) Subject: work on manual X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=84e5b89de1d28ea61230cd44555fb932c862b879;p=pspp work on manual --- diff --git a/rust/doc/src/SUMMARY.md b/rust/doc/src/SUMMARY.md index ebf9c1febc..7ebd98825c 100644 --- a/rust/doc/src/SUMMARY.md +++ b/rust/doc/src/SUMMARY.md @@ -127,6 +127,10 @@ - [CORRELATIONS](commands/statistics/correlations.md) - [CROSSTABS](commands/statistics/crosstabs.md) - [CTABLES](commands/statistics/ctables.md) + - [FACTOR](commands/statistics/factor.md) + - [GLM](commands/statistics/glm.md) + - [LOGISTIC REGRESSION](commands/statistics/logistic-regression.md) + - [MEANS](commands/statistics/means.md) # Developer Documentation diff --git a/rust/doc/src/commands/statistics/factor.md b/rust/doc/src/commands/statistics/factor.md new file mode 100644 index 0000000000..b0c303f982 --- /dev/null +++ b/rust/doc/src/commands/statistics/factor.md @@ -0,0 +1,132 @@ +# FACTOR + +``` +FACTOR { + VARIABLES=VAR_LIST, + MATRIX IN ({CORR,COV}={*,FILE_SPEC}) + } + + [ /METHOD = {CORRELATION, COVARIANCE} ] + + [ /ANALYSIS=VAR_LIST ] + + [ /EXTRACTION={PC, PAF}] + + [ /ROTATION={VARIMAX, EQUAMAX, QUARTIMAX, PROMAX[(K)], NOROTATE}] + + [ /PRINT=[INITIAL] [EXTRACTION] [ROTATION] [UNIVARIATE] [CORRELATION] [COVARIANCE] [DET] [KMO] [AIC] [SIG] [ALL] [DEFAULT] ] + + [ /PLOT=[EIGEN] ] + + [ /FORMAT=[SORT] [BLANK(N)] [DEFAULT] ] + + [ /CRITERIA=[FACTORS(N)] [MINEIGEN(L)] [ITERATE(M)] [ECONVERGE (DELTA)] [DEFAULT] ] + + [ /MISSING=[{LISTWISE, PAIRWISE}] [{INCLUDE, EXCLUDE}] ] +``` + +The `FACTOR` command performs Factor Analysis or Principal Axis +Factoring on a dataset. It may be used to find common factors in the +data or for data reduction purposes. + +The `VARIABLES` subcommand is required (unless the `MATRIX IN` +subcommand is used). It lists the variables which are to partake in the +analysis. (The `ANALYSIS` subcommand may optionally further limit the +variables that participate; it is useful primarily in conjunction with +`MATRIX IN`.) + +If `MATRIX IN` instead of `VARIABLES` is specified, then the analysis +is performed on a pre-prepared correlation or covariance matrix file +instead of on individual data cases. Typically the matrix file will +have been generated by `MATRIX DATA` (*note MATRIX DATA::) or provided +by a third party. If specified, `MATRIX IN` must be followed by `COV` +or `CORR`, then by `=` and `FILE_SPEC` all in parentheses. `FILE_SPEC` may +either be an asterisk, which indicates the currently loaded dataset, or +it may be a file name to be loaded. *Note MATRIX DATA::, for the +expected format of the file. + +The `/EXTRACTION` subcommand is used to specify the way in which +factors (components) are extracted from the data. If `PC` is specified, +then Principal Components Analysis is used. If `PAF` is specified, then +Principal Axis Factoring is used. By default Principal Components +Analysis is used. + +The `/ROTATION` subcommand is used to specify the method by which the +extracted solution is rotated. Three orthogonal rotation methods are +available: `VARIMAX` (which is the default), `EQUAMAX`, and `QUARTIMAX`. +There is one oblique rotation method, viz: `PROMAX`. Optionally you may +enter the power of the promax rotation K, which must be enclosed in +parentheses. The default value of K is 5. If you don't want any +rotation to be performed, the word `NOROTATE` prevents the command from +performing any rotation on the data. + +The `/METHOD` subcommand should be used to determine whether the +covariance matrix or the correlation matrix of the data is to be +analysed. By default, the correlation matrix is analysed. + +The `/PRINT` subcommand may be used to select which features of the +analysis are reported: + +- `UNIVARIATE` A table of mean values, standard deviations and total + weights are printed. +- `INITIAL` Initial communalities and eigenvalues are printed. +- `EXTRACTION` Extracted communalities and eigenvalues are printed. +- `ROTATION` Rotated communalities and eigenvalues are printed. +- `CORRELATION` The correlation matrix is printed. +- `COVARIANCE` The covariance matrix is printed. +- `DET` The determinant of the correlation or covariance matrix is + printed. +- `AIC` The anti-image covariance and anti-image correlation matrices + are printed. +- `KMO` The Kaiser-Meyer-Olkin measure of sampling adequacy and the + Bartlett test of sphericity is printed. +- `SIG` The significance of the elements of correlation matrix is + printed. +- `ALL` All of the above are printed. +- `DEFAULT` Identical to `INITIAL` and `EXTRACTION`. + +If `/PLOT=EIGEN` is given, then a "Scree" plot of the eigenvalues is +printed. This can be useful for visualizing the factors and deciding +which factors (components) should be retained. + +The `/FORMAT` subcommand determined how data are to be displayed in +loading matrices. If `SORT` is specified, then the variables are sorted +in descending order of significance. If `BLANK(N)` is specified, then +coefficients whose absolute value is less than N are not printed. If +the keyword `DEFAULT` is specified, or if no `/FORMAT` subcommand is +specified, then no sorting is performed, and all coefficients are +printed. + +You can use the `/CRITERIA` subcommand to specify how the number of +extracted factors (components) are chosen. If `FACTORS(N)` is +specified, where N is an integer, then N factors are extracted. +Otherwise, the `MINEIGEN` setting is used. `MINEIGEN(L)` requests that +all factors whose eigenvalues are greater than or equal to L are +extracted. The default value of L is 1. The `ECONVERGE` setting has +effect only when using iterative algorithms for factor extraction (such +as Principal Axis Factoring). `ECONVERGE(DELTA)` specifies that +iteration should cease when the maximum absolute value of the +communality estimate between one iteration and the previous is less than +DELTA. The default value of DELTA is 0.001. + +The `ITERATE(M)` may appear any number of times and is used for two +different purposes. It is used to set the maximum number of iterations +(M) for convergence and also to set the maximum number of iterations for +rotation. Whether it affects convergence or rotation depends upon which +subcommand follows the `ITERATE` subcommand. If `EXTRACTION` follows, +it affects convergence. If `ROTATION` follows, it affects rotation. If +neither `ROTATION` nor `EXTRACTION` follow a `ITERATE` subcommand, then +the entire subcommand is ignored. The default value of M is 25. + +The `MISSING` subcommand determines the handling of missing +variables. If `INCLUDE` is set, then user-missing values are included +in the calculations, but system-missing values are not. If `EXCLUDE` is +set, which is the default, user-missing values are excluded as well as +system-missing values. This is the default. If `LISTWISE` is set, then +the entire case is excluded from analysis whenever any variable +specified in the `VARIABLES` subcommand contains a missing value. + +If `PAIRWISE` is set, then a case is considered missing only if +either of the values for the particular coefficient are missing. The +default is `LISTWISE`. + diff --git a/rust/doc/src/commands/statistics/glm.md b/rust/doc/src/commands/statistics/glm.md new file mode 100644 index 0000000000..e88cf9ee66 --- /dev/null +++ b/rust/doc/src/commands/statistics/glm.md @@ -0,0 +1,55 @@ +# GLM + +``` +GLM DEPENDENT_VARS BY FIXED_FACTORS + [/METHOD = SSTYPE(TYPE)] + [/DESIGN = INTERACTION_0 [INTERACTION_1 [... INTERACTION_N]]] + [/INTERCEPT = {INCLUDE|EXCLUDE}] + [/MISSING = {INCLUDE|EXCLUDE}] +``` + +The `GLM` procedure can be used for fixed effects factorial Anova. + +The `DEPENDENT_VARS` are the variables to be analysed. You may analyse +several variables in the same command in which case they should all +appear before the `BY` keyword. + +The `FIXED_FACTORS` list must be one or more categorical variables. +Normally it does not make sense to enter a scalar variable in the +`FIXED_FACTORS` and doing so may cause PSPP to do a lot of unnecessary +processing. + +The `METHOD` subcommand is used to change the method for producing +the sums of squares. Available values of `TYPE` are 1, 2 and 3. The +default is type 3. + +You may specify a custom design using the `DESIGN` subcommand. The +design comprises a list of interactions where each interaction is a list +of variables separated by a `*`. For example the command +``` +GLM subject BY sex age_group race + /DESIGN = age_group sex group age_group*sex age_group*race +``` +specifies the model +``` +subject = age_group + sex + race + age_group×sex + age_group×race +``` +If no `DESIGN` subcommand is specified, then the +default is all possible combinations of the fixed factors. That is to +say +``` +GLM subject BY sex age_group race +``` +implies the model +``` +subject = age_group + sex + race + age_group×sex + age_group×race + sex×race + age_group×sex×race +``` + +The `MISSING` subcommand determines the handling of missing variables. +If `INCLUDE` is set then, for the purposes of GLM analysis, only +system-missing values are considered to be missing; user-missing +values are not regarded as missing. If `EXCLUDE` is set, which is the +default, then user-missing values are considered to be missing as well +as system-missing values. A case for which any dependent variable or +any factor variable has a missing value is excluded from the analysis. + diff --git a/rust/doc/src/commands/statistics/logistic-regression.md b/rust/doc/src/commands/statistics/logistic-regression.md new file mode 100644 index 0000000000..1e5f8d07dc --- /dev/null +++ b/rust/doc/src/commands/statistics/logistic-regression.md @@ -0,0 +1,69 @@ +# LOGISTIC REGRESSION + +``` +LOGISTIC REGRESSION [VARIABLES =] DEPENDENT_VAR WITH PREDICTORS + [/CATEGORICAL = CATEGORICAL_PREDICTORS] + [{/NOCONST | /ORIGIN | /NOORIGIN }] + [/PRINT = [SUMMARY] [DEFAULT] [CI(CONFIDENCE)] [ALL]] + [/CRITERIA = [BCON(MIN_DELTA)] [ITERATE(MAX_INTERATIONS)] + [LCON(MIN_LIKELIHOOD_DELTA)] [EPS(MIN_EPSILON)] + [CUT(CUT_POINT)]] + [/MISSING = {INCLUDE|EXCLUDE}] +``` + +Bivariate Logistic Regression is used when you want to explain a +dichotomous dependent variable in terms of one or more predictor +variables. + +The minimum command is +``` +LOGISTIC REGRESSION y WITH x1 x2 ... xN. +``` + +Here, `y` is the dependent variable, which must be dichotomous and +`x1` through `xN` are the predictor variables whose coefficients the +procedure estimates. + +By default, a constant term is included in the model. Hence, the +full model is \\[{\bf y} = b_0 + b_1 {\bf x_1} + b_2 {\bf x_2} + \dots + +b_n {\bf x_n}.\\] + +Predictor variables which are categorical in nature should be listed +on the `/CATEGORICAL` subcommand. Simple variables as well as +interactions between variables may be listed here. + +If you want a model without the constant term b_0, use the keyword +`/ORIGIN`. `/NOCONST` is a synonym for `/ORIGIN`. + +An iterative Newton-Raphson procedure is used to fit the model. The +`/CRITERIA` subcommand is used to specify the stopping criteria of the +procedure, and other parameters. The value of `CUT_POINT` is used in the +classification table. It is the threshold above which predicted values +are considered to be 1. Values of `CUT_POINT` must lie in the range +[0,1]. During iterations, if any one of the stopping criteria are +satisfied, the procedure is considered complete. The stopping criteria +are: + +- The number of iterations exceeds `MAX_ITERATIONS`. The default value + of `MAX_ITERATIONS` is 20. +- The change in the all coefficient estimates are less than + `MIN_DELTA`. The default value of `MIN_DELTA` is 0.001. +- The magnitude of change in the likelihood estimate is less than + `MIN_LIKELIHOOD_DELTA`. The default value of `MIN_LIKELIHOOD_DELTA` + is zero. This means that this criterion is disabled. +- The differential of the estimated probability for all cases is less + than `MIN_EPSILON`. In other words, the probabilities are close to + zero or one. The default value of `MIN_EPSILON` is 0.00000001. + +The `PRINT` subcommand controls the display of optional statistics. +Currently there is one such option, `CI`, which indicates that the +confidence interval of the odds ratio should be displayed as well as its +value. `CI` should be followed by an integer in parentheses, to +indicate the confidence level of the desired confidence interval. + +The `MISSING` subcommand determines the handling of missing +variables. If `INCLUDE` is set, then user-missing values are included +in the calculations, but system-missing values are not. If `EXCLUDE` is +set, which is the default, user-missing values are excluded as well as +system-missing values. This is the default. + diff --git a/rust/doc/src/commands/statistics/means.md b/rust/doc/src/commands/statistics/means.md new file mode 100644 index 0000000000..e5d2caf9a1 --- /dev/null +++ b/rust/doc/src/commands/statistics/means.md @@ -0,0 +1,162 @@ +# MEANS + +``` +MEANS [TABLES =] + {VAR_LIST} + [ BY {VAR_LIST} [BY {VAR_LIST} [BY {VAR_LIST} ... ]]] + + [ /{VAR_LIST} + [ BY {VAR_LIST} [BY {VAR_LIST} [BY {VAR_LIST} ... ]]] ] + + [/CELLS = [MEAN] [COUNT] [STDDEV] [SEMEAN] [SUM] [MIN] [MAX] [RANGE] + [VARIANCE] [KURT] [SEKURT] + [SKEW] [SESKEW] [FIRST] [LAST] + [HARMONIC] [GEOMETRIC] + [DEFAULT] + [ALL] + [NONE] ] + + [/MISSING = [INCLUDE] [DEPENDENT]] +``` + +You can use the `MEANS` command to calculate the arithmetic mean and +similar statistics, either for the dataset as a whole or for categories +of data. + +The simplest form of the command is +``` +MEANS V. +``` +which calculates the mean, count and standard deviation for V. If you +specify a grouping variable, for example +``` +MEANS V BY G. +``` +then the means, counts and standard deviations for V after having been +grouped by G are calculated. Instead of the mean, count and standard +deviation, you could specify the statistics in which you are interested: +``` +MEANS X Y BY G + /CELLS = HARMONIC SUM MIN. +``` +This example calculates the harmonic mean, the sum and the minimum +values of X and Y grouped by G. + +The `CELLS` subcommand specifies which statistics to calculate. The +available statistics are: +- `MEAN`: The arithmetic mean. +- `COUNT`: The count of the values. +- `STDDEV`: The standard deviation. +- `SEMEAN`: The standard error of the mean. +- `SUM`: The sum of the values. +- `MIN`: The minimum value. +- `MAX`: The maximum value. +- `RANGE`: The difference between the maximum and minimum values. +- `VARIANCE`: The variance. +- `FIRST`: The first value in the category. +- `LAST`: The last value in the category. +- `SKEW`: The skewness. +- `SESKEW`: The standard error of the skewness. +- `KURT`: The kurtosis +- `SEKURT`: The standard error of the kurtosis. +- `HARMONIC`: The harmonic mean. +- `GEOMETRIC`: The geometric mean. + +In addition, three special keywords are recognized: +- `DEFAULT`: This is the same as `MEAN COUNT STDDEV`. +- `ALL`: All of the above statistics are calculated. +- `NONE`: No statistics are calculated (only a summary is shown). + +More than one "table" can be specified in a single command. Each +table is separated by a `/`. For example + +``` + MEANS TABLES = + c d e BY x + /a b BY x y + /f BY y BY z. +``` + +has three tables (the `TABLE =` is optional). The first table has +three dependent variables `c`, `d`, and `e` and a single categorical +variable `x`. The second table has two dependent variables `a` and +`b`, and two categorical variables `x` and `y`. The third table has a +single dependent variable `f` and a categorical variable formed by the +combination of `y` and `Z`. + +By default values are omitted from the analysis only if missing +values (either system missing or user missing) for any of the variables +directly involved in their calculation are encountered. This behaviour +can be modified with the `/MISSING` subcommand. Three options are +possible: `TABLE`, `INCLUDE` and `DEPENDENT`. + +`/MISSING = INCLUDE` says that user missing values, either in the +dependent variables or in the categorical variables should be taken at +their face value, and not excluded. + +`/MISSING = DEPENDENT` says that user missing values, in the +dependent variables should be taken at their face value, however cases +which have user missing values for the categorical variables should be +omitted from the calculation. + +## Example + +The dataset in `repairs.sav` contains the mean time between failures +(mtbf) for a sample of artifacts produced by different factories and +trialed under different operating conditions. Since there are four +combinations of categorical variables, by simply looking at the list +of data, it would be hard to how the scores vary for each category. +The syntax below shows one way of tabulating the mtbf in a way which +is easier to understand. + +``` +get file='repairs.sav'. + +means tables = mtbf + by factory by environment. +``` + +The results are shown below. The figures shown indicate the mean, +standard deviation and number of samples in each category. These +figures however do not indicate whether the results are statistically +significant. For that, you would need to use the procedures `ONEWAY`, +`GLM` or `T-TEST` depending on the hypothesis being tested. + +``` + Case Processing Summary +┌────────────────────────────┬───────────────────────────────┐ +│ │ Cases │ +│ ├──────────┬─────────┬──────────┤ +│ │ Included │ Excluded│ Total │ +│ ├──┬───────┼─┬───────┼──┬───────┤ +│ │ N│Percent│N│Percent│ N│Percent│ +├────────────────────────────┼──┼───────┼─┼───────┼──┼───────┤ +│mtbf * factory * environment│30│ 100.0%│0│ .0%│30│ 100.0%│ +└────────────────────────────┴──┴───────┴─┴───────┴──┴───────┘ + + Report +┌────────────────────────────────────────────┬─────┬──┬──────────────┐ +│Manufacturing facility Operating Environment│ Mean│ N│Std. Deviation│ +├────────────────────────────────────────────┼─────┼──┼──────────────┤ +│0 Temperate │ 7.26│ 9│ 2.57│ +│ Tropical │ 7.47│ 7│ 2.68│ +│ Total │ 7.35│16│ 2.53│ +├────────────────────────────────────────────┼─────┼──┼──────────────┤ +│1 Temperate │13.38│ 6│ 7.77│ +│ Tropical │ 8.20│ 8│ 8.39│ +│ Total │10.42│14│ 8.26│ +├────────────────────────────────────────────┼─────┼──┼──────────────┤ +│Total Temperate │ 9.71│15│ 5.91│ +│ Tropical │ 7.86│15│ 6.20│ +│ Total │ 8.78│30│ 6.03│ +└────────────────────────────────────────────┴─────┴──┴──────────────┘ +``` + +PSPP does not limit the number of variables for which you can +calculate statistics, nor number of categorical variables per layer, +nor the number of layers. However, running `MEANS` on a large number +of variables, or with categorical variables containing a large number +of distinct values, may result in an extremely large output, which +will not be easy to interpret. So you should consider carefully which +variables to select for participation in the analysis. +