work on manual

author Ben Pfaff <blp@cs.stanford.edu>

Wed, 7 May 2025 16:01:32 +0000 (09:01 -0700)

committer Ben Pfaff <blp@cs.stanford.edu>

Wed, 7 May 2025 16:01:32 +0000 (09:01 -0700)
author Ben Pfaff <blp@cs.stanford.edu>
Wed, 7 May 2025 16:01:32 +0000 (09:01 -0700)
committer Ben Pfaff <blp@cs.stanford.edu>
Wed, 7 May 2025 16:01:32 +0000 (09:01 -0700)
diff --git a/rust/doc/src/SUMMARY.md b/rust/doc/src/SUMMARY.md

index ea37e89028430e5bf4b335b19d61e4d3cafb13c7..4b01a293316f1f6d74482332c2cc5cc690b35898 100644 (file)
--- a/rust/doc/src/SUMMARY.md
+++ b/rust/doc/src/SUMMARY.md
@@ -97,6 +97,10 @@
    - [VARIABLE WIDTH](commands/variables/variable-width.md)
    - [VECTOR](commands/variables/vector.md)
    - [WRITE FORMATS](commands/variables/write-formats.md)
+- [Transforming Data](commands/data/index.md)
+  - [AGGREGATE](commands/data/aggregate.md)
+  - [AUTORECODE](commands/data/autorecode.md)
+  - [COMPUTE](commands/data/compute.md)
  
  # Developer Documentation
  
diff --git a/rust/doc/src/commands/data/aggregate.md b/rust/doc/src/commands/data/aggregate.md

new file mode 100644 (file)

index 0000000..55ca0de
--- /dev/null
+++ b/rust/doc/src/commands/data/aggregate.md
@@ -0,0 +1,233 @@
+# AGGREGATE
+
+```
+AGGREGATE
+        [OUTFILE={*,'FILE_NAME',FILE_HANDLE} [MODE={REPLACE,ADDVARIABLES}]]
+        [/MISSING=COLUMNWISE]
+        [/PRESORTED]
+        [/DOCUMENT]
+        [/BREAK=VAR_LIST]
+        /DEST_VAR['LABEL']...=AGR_FUNC(SRC_VARS[, ARGS]...)...
+```
+
+`AGGREGATE` summarizes groups of cases into single cases.  It divides
+cases into groups that have the same values for one or more variables
+called "break variables".  Several functions are available for
+summarizing case contents.
+
+The `AGGREGATE` syntax consists of subcommands to control its
+behavior, all of which are optional, followed by one or more
+destination variable assigments, each of which uses an aggregation
+function to define how it is calculated.
+
+The `OUTFILE` subcommand, which must be first, names the destination
+for `AGGREGATE` output.  It may name a system file by file name or
+[file handle](../../language/files/file-handles.md), a
+[dataset](../../language/datasets/index.md) by its name, or `*` to
+replace the active dataset.  `AGGREGATE` writes its output to this
+file.
+
+With `OUTFILE=*` only, `MODE` may be specified immediately afterward
+with the value `ADDVARIABLES` or `REPLACE`:
+
+- With `REPLACE`, the default, the active dataset is replaced by a
+  new dataset which contains just the break variables and the
+  destination varibles.  The new file contains as many cases as there
+  are unique combinations of the break variables.
+
+- With `ADDVARIABLES`, the destination variables are added to those
+  in the existing active dataset.  Cases that have the same
+  combination of values in their break variables receive identical
+  values for the destination variables.  The number of cases in the
+  active dataset remains unchanged.  The data must be sorted on the
+  break variables, that is, `ADDVARIABLES` implies `PRESORTED`
+
+Without `OUTFILE`, `AGGREGATE` acts as if `OUTFILE=*
+MODE=ADDVARIABLES` were specified.
+
+By default, `AGGREGATE` first sorts the data on the break variables.
+If the active dataset is already sorted or grouped by the break
+variables, specify `PRESORTED` to save time.  With
+`MODE=ADDVARIABLES`, the data must be pre-sorted.
+
+Specify `DOCUMENT` (*note DOCUMENT::) to copy the documents from the
+active dataset into the aggregate file.  Otherwise, the aggregate file
+does not contain any documents, even if the aggregate file replaces
+the active dataset.
+
+Normally, `AGGREGATE` produces a non-missing value whenever there is
+enough non-missing data for the aggregation function in use, that is,
+just one non-missing value or, for the `SD` and `SD.` aggregation
+functions, two non-missing values.  Specify `/MISSING=COLUMNWISE` to
+make `AGGREGATE` output a missing value when one or more of the input
+values are missing.
+
+The `BREAK` subcommand is optionally but usually present.  On `BREAK`,
+list the variables used to divide the active dataset into groups to be
+summarized.
+
+`AGGREGATE` is particular about the order of subcommands.  `OUTFILE`
+must be first, followed by `MISSING`.  `PRESORTED` and `DOCUMENT`
+follow `MISSING`, in either order, followed by `BREAK`, then followed
+by aggregation variable specifications.
+
+At least one set of aggregation variables is required.  Each set
+comprises a list of aggregation variables, an equals sign (`=`), the
+name of an aggregation function (see the list below), and a list of
+source variables in parentheses.  A few aggregation functions do not
+accept source variables, and some aggregation functions expect
+additional arguments after the source variable names.
+
+`AGGREGATE` typically creates aggregation variables with no variable
+label, value labels, or missing values.  Their default print and write
+formats depend on the aggregation function used, with details given in
+the table below.  A variable label for an aggregation variable may be
+specified just after the variable's name in the aggregation variable
+list.
+
+Each set must have exactly as many source variables as aggregation
+variables.  Each aggregation variable receives the results of applying
+the specified aggregation function to the corresponding source variable.
+
+The following aggregation functions may be applied only to numeric
+variables:
+
+* `MEAN(VAR_NAME...)`  
+  Arithmetic mean.  Limited to numeric values.  The default format is
+  `F8.2`.
+
+* `MEDIAN(VAR_NAME...)`  
+  The median value.  Limited to numeric values.  The default format
+  is `F8.2`.
+
+* `SD(VAR_NAME...)`  
+  Standard deviation of the mean.  Limited to numeric values.  The
+  default format is `F8.2`.
+
+* `SUM(VAR_NAME...)`  
+  Sum.  Limited to numeric values.  The default format is `F8.2`.
+
+   These aggregation functions may be applied to numeric and string
+variables:
+
+* `CGT(VAR_NAME..., VALUE)`  
+  `CLT(VAR_NAME..., VALUE)`  
+  `CIN(VAR_NAME..., LOW, HIGH)`  
+  `COUT(VAR_NAME..., LOW, HIGH)`  
+  Total weight of cases greater than or less than `VALUE` or inside or
+  outside the closed range `[LOW,HIGH]`, respectively.  The default
+  format is `F5.3`.
+
+* `FGT(VAR_NAME..., VALUE)`  
+  `FLT(VAR_NAME..., VALUE)`  
+  `FIN(VAR_NAME..., LOW, HIGH)`  
+  `FOUT(VAR_NAME..., LOW, HIGH)`  
+  Fraction of values greater than or less than `VALUE` or inside or
+  outside the closed range `[LOW,HIGH]`, respectively.  The default
+  format is `F5.3`.
+
+* `FIRST(VAR_NAME...)`  
+  `LAST(VAR_NAME...)`  
+  First or last non-missing value, respectively, in break group.  The
+  aggregation variable receives the complete dictionary information
+  from the source variable.  The sort performed by `AGGREGATE` (and
+  by `SORT CASES`) is stable.  This means that the first (or last)
+  case with particular values for the break variables before sorting
+  is also the first (or last) case in that break group after sorting.
+
+* `MIN(VAR_NAME...)`  
+  `MAX(VAR_NAME...)`  
+  Minimum or maximum value, respectively.  The aggregation variable
+  receives the complete dictionary information from the source
+  variable.
+
+* `N(VAR_NAME...)`  
+  `NMISS(VAR_NAME...)`  
+  Total weight of non-missing or missing values, respectively.  The
+  default format is `F7.0` if weighting is not enabled, `F8.2` if it is
+  (*note WEIGHT::).
+
+* `NU(VAR_NAME...)`  
+  `NUMISS(VAR_NAME...)`  
+  Count of non-missing or missing values, respectively, ignoring case
+  weights.  The default format is `F7.0`.
+
+* `PGT(VAR_NAME..., VALUE)`  
+  `PLT(VAR_NAME..., VALUE)`  
+  `PIN(VAR_NAME..., LOW, HIGH)`  
+  `POUT(VAR_NAME..., LOW, HIGH)`  
+  Percentage between 0 and 100 of values greater than or less than
+  `VALUE` or inside or outside the closed range `[LOW,HIGH]`,
+  respectively.  The default format is `F5.1`.
+
+These aggregation functions do not accept source variables:
+
+* `N`  
+  Total weight of cases aggregated to form this group.  The default
+  format is `F7.0` if weighting is not enabled, `F8.2` if it is (*note
+  WEIGHT::).
+
+* `NU`  
+  Count of cases aggregated to form this group, ignoring case
+  weights.  The default format is `F7.0`.
+
+Aggregation functions compare string values in terms of Unicode
+character codes.
+
+The aggregation functions listed above exclude all user-missing values
+from calculations.  To include user-missing values, insert a period
+(`.`) at the end of the function name.  (e.g. `SUM.`).  (Be aware that
+specifying such a function as the last token on a line causes the
+period to be interpreted as the end of the command.)
+
+`AGGREGATE` both ignores and cancels the current `SPLIT FILE` settings
+(*note SPLIT FILE::).
+
+## Example
+
+The `personnel.sav` dataset provides the occupations and salaries of
+many individuals.  For many purposes however such detailed information
+is not interesting, but often the aggregated statistics of each
+occupation are of interest.  Here, the `AGGREGATE` command is used to
+calculate the mean, the median and the standard deviation of each
+occupation.
+
+```
+GET FILE="personnel.sav".
+AGGREGATE OUTFILE=* MODE=REPLACE
+        /BREAK=occupation
+        /occ_mean_salary=MEAN(salary)
+        /occ_median_salary=MEDIAN(salary)
+        /occ_std_dev_salary=SD(salary).
+LIST.
+```
+
+Since we chose the `MODE=REPLACE` option, cases for the individual
+persons are no longer present.  They have each been replaced by a
+single case per aggregated value.
+
+```
+                                Data List
+┌──────────────────┬───────────────┬─────────────────┬──────────────────┐
+│    occupation    │occ_mean_salary│occ_median_salary│occ_std_dev_salary│
+├──────────────────┼───────────────┼─────────────────┼──────────────────┤
+│Artist            │       37836.18│         34712.50│           7631.48│
+│Baker             │       45075.20│         45075.20│           4411.21│
+│Barrister         │       39504.00│         39504.00│                 .│
+│Carpenter         │       39349.11│         36190.04│           7453.40│
+│Cleaner           │       41142.50│         39647.49│          14378.98│
+│Cook              │       40357.79│         43194.00│          11064.51│
+│Manager           │       46452.14│         45657.56│           6901.69│
+│Mathematician     │       34531.06│         34763.06│           5267.68│
+│Painter           │       45063.55│         45063.55│          15159.67│
+│Payload Specialist│       34355.72│         34355.72│                 .│
+│Plumber           │       40413.91│         40410.00│           4726.05│
+│Scientist         │       36687.07│         36803.83│          10873.54│
+│Scrientist        │       42530.65│         42530.65│                 .│
+│Tailor            │       34586.79│         34586.79│           3728.98│
+└──────────────────┴───────────────┴─────────────────┴──────────────────┘
+```
+
+Some values for the standard deviation are blank because there is only
+one case with the respective occupation.
+
diff --git a/rust/doc/src/commands/data/autorecode.md b/rust/doc/src/commands/data/autorecode.md

new file mode 100644 (file)

index 0000000..20831e6
--- /dev/null
+++ b/rust/doc/src/commands/data/autorecode.md
@@ -0,0 +1,133 @@
+# AUTORECODE
+
+```
+AUTORECODE VARIABLES=SRC_VARS INTO DEST_VARS
+        [ /DESCENDING ]
+        [ /PRINT ]
+        [ /GROUP ]
+        [ /BLANK = {VALID, MISSING} ]
+```
+
+The `AUTORECODE` procedure considers the N values that a variable
+takes on and maps them onto values 1...N on a new numeric variable.
+
+Subcommand `VARIABLES` is the only required subcommand and must come
+first.  Specify `VARIABLES`, an equals sign (`=`), a list of source
+variables, `INTO`, and a list of target variables.  There must the
+same number of source and target variables.  The target variables must
+not already exist.
+
+`AUTORECODE` ordinarily assigns each increasing non-missing value of a
+source variable (for a string, this is based on character code
+comparisons) to consecutive values of its target variable.  For
+example, the smallest non-missing value of the source variable is
+recoded to value 1, the next smallest to 2, and so on.  If the source
+variable has user-missing values, they are recoded to consecutive
+values just above the non-missing values.  For example, if a source
+variables has seven distinct non-missing values, then the smallest
+missing value would be recoded to 8, the next smallest to 9, and so
+on.
+
+Use `DESCENDING` to reverse the sort order for non-missing values, so
+that the largest non-missing value is recoded to 1, the second-largest
+to 2, and so on.  Even with `DESCENDING`, user-missing values are
+still recoded in ascending order just above the non-missing values.
+
+The system-missing value is always recoded into the system-missing
+variable in target variables.
+
+If a source value has a value label, then that value label is retained
+for the new value in the target variable.  Otherwise, the source value
+itself becomes each new value's label.
+
+Variable labels are copied from the source to target variables.
+
+`PRINT` is currently ignored.
+
+The `GROUP` subcommand is relevant only if more than one variable is
+to be recoded.  It causes a single mapping between source and target
+values to be used, instead of one map per variable.  With `GROUP`,
+user-missing values are taken from the first source variable that has
+any user-missing values.
+
+If `/BLANK=MISSING` is given, then string variables which contain
+only whitespace are recoded as SYSMIS. If `/BLANK=VALID` is specified
+then they are allocated a value like any other.  `/BLANK` is not
+relevant to numeric values.  `/BLANK=VALID` is the default.
+
+`AUTORECODE` is a procedure.  It causes the data to be read.
+
+## Example
+
+In the file `personnel.sav`, the variable occupation is a string
+variable.  Except for data of a purely commentary nature, string
+variables are generally a bad idea.  One reason is that data entry
+errors are easily overlooked.  This has happened in `personnel.sav`;
+one entry which should read "Scientist" has been mistyped as
+"Scrientist".  The syntax below shows how to correct this error in the
+`DO IF` clause[^1], which then uses `AUTORECODE` to create a new numeric
+variable which takes recoded values of occupation.  Finally, we remove
+the old variable and rename the new variable to the name of the old
+variable:
+
+[^1]: One must use care when correcting such data input errors rather
+than simply marking them as missing.  For example, if an occupation
+has been entered "Barister", did the person mean "Barrister" or
+"Barista"?
+
+```
+get file='personnel.sav'.
+
+* Correct a typing error in the original file.
+do if occupation = "Scrientist".
+ compute occupation = "Scientist".
+end if.
+
+autorecode
+   variables = occupation into occ
+   /blank = missing.
+
+* Delete the old variable.
+delete variables occupation.
+
+* Rename the new variable to the old variable's name.
+rename variables (occ = occupation).
+
+* Inspect the new variable.
+display dictionary /variables=occupation.
+```
+
+
+Notice, in the output below, how the new variable has been
+automatically allocated value labels which correspond to the strings
+of the old variable.  This means that in future analyses the
+descriptive strings are reported instead of the numeric values.
+
+```
+                                   Variables
++----------+--------+--------------+-----+-----+---------+----------+---------+
+|          |        |  Measurement |     |     |         |   Print  |  Write  |
+|Name      |Position|     Level    | Role|Width|Alignment|  Format  |  Format |
++----------+--------+--------------+-----+-----+---------+----------+---------+
+|occupation|       6|Unknown       |Input|    8|Right    |F2.0      |F2.0     |
++----------+--------+--------------+-----+-----+---------+----------+---------+
+
+            Value Labels
++---------------+------------------+
+|Variable Value |       Label      |
++---------------+------------------+
+|occupation 1   |Artist            |
+|           2   |Baker             |
+|           3   |Barrister         |
+|           4   |Carpenter         |
+|           5   |Cleaner           |
+|           6   |Cook              |
+|           7   |Manager           |
+|           8   |Mathematician     |
+|           9   |Painter           |
+|           10  |Payload Specialist|
+|           11  |Plumber           |
+|           12  |Scientist         |
+|           13  |Tailor            |
++---------------+------------------+
+```
diff --git a/rust/doc/src/commands/data/compute.md b/rust/doc/src/commands/data/compute.md

new file mode 100644 (file)

index 0000000..dd3067b
--- /dev/null
+++ b/rust/doc/src/commands/data/compute.md
@@ -0,0 +1,87 @@
+# COMPUTE
+
+```
+COMPUTE VARIABLE = EXPRESSION.
+   or
+COMPUTE vector(INDEX) = EXPRESSION.
+```
+
+`COMPUTE` assigns the value of an expression to a target variable.
+For each case, the expression is evaluated and its value assigned to
+the target variable.  Numeric and string variables may be assigned.
+When a string expression's width differs from the target variable's
+width, the string result of the expression is truncated or padded with
+spaces on the right as necessary.  The expression and variable types
+must match.
+
+For numeric variables only, the target variable need not already
+exist.  Numeric variables created by `COMPUTE` are assigned an `F8.2`
+output format.  String variables must be declared before they can be
+used as targets for `COMPUTE`.
+
+The target variable may be specified as an element of a
+[vector](../../commands/variables/vector.md).  In this case, an
+expression `INDEX` must be specified in parentheses following the vector
+name.  The expression `INDEX` must evaluate to a numeric value that,
+after rounding down to the nearest integer, is a valid index for the
+named vector.
+
+Using `COMPUTE` to assign to a variable specified on
+[`LEAVE`](../../commands/variables/leave.md) resets the variable's
+left state.  Therefore, `LEAVE` should be specified following
+`COMPUTE`, not before.
+
+`COMPUTE` is a transformation.  It does not cause the active dataset
+to be read.
+
+When `COMPUTE` is specified following `TEMPORARY` (*note TEMPORARY::),
+the [`LAG`](../../language/expressions/functions/miscellaneous.md)
+function may not be used.
+
+## Examples
+
+The dataset `physiology.sav` contains the height and weight of
+persons.  For some purposes, neither height nor weight alone is of
+interest.  Epidemiologists are often more interested in the "body mass
+index" which can sometimes be used as a predictor for clinical
+conditions.  The body mass index is defined as the weight of the
+person in kilograms divided by the square of the person's height in
+metres.[^1]
+
+[^1]: Since BMI is a quantity with a ratio scale and has units, the
+term "index" is a misnomer, but that is what it is called.
+
+```
+get file='physiology.sav'.
+
+* height is in mm so we must divide by 1000 to get metres.
+compute bmi = weight / (height/1000)**2.
+variable label bmi "Body Mass Index".
+
+descriptives /weight height bmi.
+```
+
+This syntax shows how you can use `COMPUTE` to generate a new variable
+called bmi and have every case's value calculated from the existing
+values of weight and height.  It also shows how you can [add a
+label](../../commands/variables/variable-labels.md) to this new
+variable, so that a more descriptive label appears in subsequent
+analyses, and this can be seen in the output from the `DESCRIPTIVES`
+command, below.
+
+The expression which follows the `=` sign can be as complicated as
+necessary.  See [Expressions](../../language/expressions.md) for a
+full description of the language accepted.
+
+```
+                  Descriptive Statistics
+┌─────────────────────┬──┬───────┬───────┬───────┬───────┐
+│                     │ N│  Mean │Std Dev│Minimum│Maximum│
+├─────────────────────┼──┼───────┼───────┼───────┼───────┤
+│Weight in kilograms  │40│  72.12│  26.70│  ─55.6│   92.1│
+│Height in millimeters│40│1677.12│ 262.87│    179│   1903│
+│Body Mass Index      │40│  67.46│ 274.08│ ─21.62│1756.82│
+│Valid N (listwise)   │40│       │       │       │       │
+│Missing N (listwise) │ 0│       │       │       │       │
+└─────────────────────┴──┴───────┴───────┴───────┴───────┘
+```
diff --git a/rust/doc/src/commands/data/index.md b/rust/doc/src/commands/data/index.md

new file mode 100644 (file)

index 0000000..d4f0517
--- /dev/null
+++ b/rust/doc/src/commands/data/index.md
@@ -0,0 +1,2 @@
+The PSPP procedures in this chapter manipulate data and prepare the
+active dataset for later analyses.  They do not produce output.
diff --git a/rust/doc/src/data/aggregate.md b/rust/doc/src/data/aggregate.md

new file mode 100644 (file)

index 0000000..fea17fe
--- /dev/null
+++ b/rust/doc/src/data/aggregate.md
@@ -0,0 +1,233 @@
+# AGGREGATE
+
+```
+AGGREGATE
+        [OUTFILE={*,'FILE_NAME',FILE_HANDLE} [MODE={REPLACE,ADDVARIABLES}]]
+        [/MISSING=COLUMNWISE]
+        [/PRESORTED]
+        [/DOCUMENT]
+        [/BREAK=VAR_LIST]
+        /DEST_VAR['LABEL']...=AGR_FUNC(SRC_VARS[, ARGS]...)...
+```
+
+`AGGREGATE` summarizes groups of cases into single cases.  It divides
+cases into groups that have the same values for one or more variables
+called "break variables".  Several functions are available for
+summarizing case contents.
+
+The `AGGREGATE` syntax consists of subcommands to control its
+behavior, all of which are optional, followed by one or more
+destination variable assigments, each of which uses an aggregation
+function to define how it is calculated.
+
+The `OUTFILE` subcommand, which must be first, names the destination
+for `AGGREGATE` output.  It may name a system file by file name or
+[file handle](../../language/files/file-handles.md), a
+[dataset](../../language/datasets/index.md) by its name, or `*` to
+replace the active dataset.  `AGGREGATE` writes its output to this
+file.
+
+With `OUTFILE=*` only, `MODE` may be specified immediately afterward
+with the value `ADDVARIABLES` or `REPLACE`:
+
+- With `REPLACE`, the default, the active dataset is replaced by a
+  new dataset which contains just the break variables and the
+  destination varibles.  The new file contains as many cases as there
+  are unique combinations of the break variables.
+
+- With `ADDVARIABLES`, the destination variables are added to those
+  in the existing active dataset.  Cases that have the same
+  combination of values in their break variables receive identical
+  values for the destination variables.  The number of cases in the
+  active dataset remains unchanged.  The data must be sorted on the
+  break variables, that is, `ADDVARIABLES` implies `PRESORTED`
+
+Without `OUTFILE`, `AGGREGATE` acts as if `OUTFILE=*
+MODE=ADDVARIABLES` were specified.
+
+By default, `AGGREGATE` first sorts the data on the break variables.
+If the active dataset is already sorted or grouped by the break
+variables, specify `PRESORTED` to save time.  With
+`MODE=ADDVARIABLES`, the data must be pre-sorted.
+
+Specify `DOCUMENT` (*note DOCUMENT::) to copy the documents from the
+active dataset into the aggregate file.  Otherwise, the aggregate file
+does not contain any documents, even if the aggregate file replaces
+the active dataset.
+
+Normally, `AGGREGATE` produces a non-missing value whenever there is
+enough non-missing data for the aggregation function in use, that is,
+just one non-missing value or, for the `SD` and `SD.` aggregation
+functions, two non-missing values.  Specify `/MISSING=COLUMNWISE` to
+make `AGGREGATE` output a missing value when one or more of the input
+values are missing.
+
+The `BREAK` subcommand is optionally but usually present.  On `BREAK`,
+list the variables used to divide the active dataset into groups to be
+summarized.
+
+`AGGREGATE` is particular about the order of subcommands.  `OUTFILE`
+must be first, followed by `MISSING`.  `PRESORTED` and `DOCUMENT`
+follow `MISSING`, in either order, followed by `BREAK`, then followed
+by aggregation variable specifications.
+
+At least one set of aggregation variables is required.  Each set
+comprises a list of aggregation variables, an equals sign (`=`), the
+name of an aggregation function (see the list below), and a list of
+source variables in parentheses.  A few aggregation functions do not
+accept source variables, and some aggregation functions expect
+additional arguments after the source variable names.
+
+`AGGREGATE` typically creates aggregation variables with no variable
+label, value labels, or missing values.  Their default print and write
+formats depend on the aggregation function used, with details given in
+the table below.  A variable label for an aggregation variable may be
+specified just after the variable's name in the aggregation variable
+list.
+
+Each set must have exactly as many source variables as aggregation
+variables.  Each aggregation variable receives the results of applying
+the specified aggregation function to the corresponding source variable.
+
+The following aggregation functions may be applied only to numeric
+variables:
+
+* `MEAN(VAR_NAME...)`  
+  Arithmetic mean.  Limited to numeric values.  The default format is
+  `F8.2`.
+
+* `MEDIAN(VAR_NAME...)`  
+  The median value.  Limited to numeric values.  The default format
+  is `F8.2`.
+
+* `SD(VAR_NAME...)`  
+  Standard deviation of the mean.  Limited to numeric values.  The
+  default format is `F8.2`.
+
+* `SUM(VAR_NAME...)`  
+  Sum.  Limited to numeric values.  The default format is `F8.2`.
+
+   These aggregation functions may be applied to numeric and string
+variables:
+
+* `CGT(VAR_NAME..., VALUE)`  
+  `CLT(VAR_NAME..., VALUE)`  
+  `CIN(VAR_NAME..., LOW, HIGH)`  
+  `COUT(VAR_NAME..., LOW, HIGH)`  
+  Total weight of cases greater than or less than `VALUE` or inside or
+  outside the closed range `[LOW,HIGH]`, respectively.  The default
+  format is `F5.3`.
+
+* `FGT(VAR_NAME..., VALUE)`  
+  `FLT(VAR_NAME..., VALUE)`  
+  `FIN(VAR_NAME..., LOW, HIGH)`  
+  `FOUT(VAR_NAME..., LOW, HIGH)`  
+  Fraction of values greater than or less than `VALUE` or inside or
+  outside the closed range `[LOW,HIGH]`, respectively.  The default
+  format is `F5.3`.
+
+* `FIRST(VAR_NAME...)`  
+  `LAST(VAR_NAME...)`  
+  First or last non-missing value, respectively, in break group.  The
+  aggregation variable receives the complete dictionary information
+  from the source variable.  The sort performed by `AGGREGATE` (and
+  by `SORT CASES`) is stable.  This means that the first (or last)
+  case with particular values for the break variables before sorting
+  is also the first (or last) case in that break group after sorting.
+
+* `MIN(VAR_NAME...)`  
+  `MAX(VAR_NAME...)`  
+  Minimum or maximum value, respectively.  The aggregation variable
+  receives the complete dictionary information from the source
+  variable.
+
+* `N(VAR_NAME...)`  
+  `NMISS(VAR_NAME...)`  
+  Total weight of non-missing or missing values, respectively.  The
+  default format is `F7.0` if weighting is not enabled, `F8.2` if it is
+  (*note WEIGHT::).
+
+* `NU(VAR_NAME...)`  
+  `NUMISS(VAR_NAME...)`  
+  Count of non-missing or missing values, respectively, ignoring case
+  weights.  The default format is `F7.0`.
+
+* `PGT(VAR_NAME..., VALUE)`  
+  `PLT(VAR_NAME..., VALUE)`  
+  `PIN(VAR_NAME..., LOW, HIGH)`  
+  `POUT(VAR_NAME..., LOW, HIGH)`  
+  Percentage between 0 and 100 of values greater than or less than
+  `VALUE` or inside or outside the closed range `[LOW,HIGH]`,
+  respectively.  The default format is `F5.1`.
+
+These aggregation functions do not accept source variables:
+
+* `N`  
+  Total weight of cases aggregated to form this group.  The default
+  format is `F7.0` if weighting is not enabled, `F8.2` if it is (*note
+  WEIGHT::).
+
+* `NU`  
+  Count of cases aggregated to form this group, ignoring case
+  weights.  The default format is `F7.0`.
+
+Aggregation functions compare string values in terms of Unicode
+character codes.
+
+The aggregation functions listed above exclude all user-missing values
+from calculations.  To include user-missing values, insert a period
+(`.`) at the end of the function name.  (e.g. `SUM.`).  (Be aware that
+specifying such a function as the last token on a line causes the
+period to be interpreted as the end of the command.)
+
+`AGGREGATE` both ignores and cancels the current `SPLIT FILE` settings
+(*note SPLIT FILE::).
+
+## Aggregate Example
+
+The `personnel.sav` dataset provides the occupations and salaries of
+many individuals.  For many purposes however such detailed information
+is not interesting, but often the aggregated statistics of each
+occupation are of interest.  Here, the `AGGREGATE` command is used to
+calculate the mean, the median and the standard deviation of each
+occupation.
+
+```
+GET FILE="personnel.sav".
+AGGREGATE OUTFILE=* MODE=REPLACE
+        /BREAK=occupation
+        /occ_mean_salary=MEAN(salary)
+        /occ_median_salary=MEDIAN(salary)
+        /occ_std_dev_salary=SD(salary).
+LIST.
+```
+
+Since we chose the `MODE=REPLACE` option, cases for the individual
+persons are no longer present.  They have each been replaced by a
+single case per aggregated value.
+
+```
+                                Data List
+┌──────────────────┬───────────────┬─────────────────┬──────────────────┐
+│    occupation    │occ_mean_salary│occ_median_salary│occ_std_dev_salary│
+├──────────────────┼───────────────┼─────────────────┼──────────────────┤
+│Artist            │       37836.18│         34712.50│           7631.48│
+│Baker             │       45075.20│         45075.20│           4411.21│
+│Barrister         │       39504.00│         39504.00│                 .│
+│Carpenter         │       39349.11│         36190.04│           7453.40│
+│Cleaner           │       41142.50│         39647.49│          14378.98│
+│Cook              │       40357.79│         43194.00│          11064.51│
+│Manager           │       46452.14│         45657.56│           6901.69│
+│Mathematician     │       34531.06│         34763.06│           5267.68│
+│Painter           │       45063.55│         45063.55│          15159.67│
+│Payload Specialist│       34355.72│         34355.72│                 .│
+│Plumber           │       40413.91│         40410.00│           4726.05│
+│Scientist         │       36687.07│         36803.83│          10873.54│
+│Scrientist        │       42530.65│         42530.65│                 .│
+│Tailor            │       34586.79│         34586.79│           3728.98│
+└──────────────────┴───────────────┴─────────────────┴──────────────────┘
+```
+
+Note that some values for the standard deviation are blank.  This is
+because there is only one case with the respective occupation.
+
diff --git a/rust/doc/src/data/index.md b/rust/doc/src/data/index.md

new file mode 100644 (file)

index 0000000..d4f0517
--- /dev/null
+++ b/rust/doc/src/data/index.md
@@ -0,0 +1,2 @@
+The PSPP procedures in this chapter manipulate data and prepare the
+active dataset for later analyses.  They do not produce output.
author	Ben Pfaff <blp@cs.stanford.edu>
	Wed, 7 May 2025 16:01:32 +0000 (09:01 -0700)
committer	Ben Pfaff <blp@cs.stanford.edu>
	Wed, 7 May 2025 16:01:32 +0000 (09:01 -0700)
rust/doc/src/SUMMARY.md		patch \| blob \| history
rust/doc/src/commands/data/aggregate.md	[new file with mode: 0644]	patch \| blob
rust/doc/src/commands/data/autorecode.md	[new file with mode: 0644]	patch \| blob
rust/doc/src/commands/data/compute.md	[new file with mode: 0644]	patch \| blob
rust/doc/src/commands/data/index.md	[new file with mode: 0644]	patch \| blob
rust/doc/src/data/aggregate.md	[new file with mode: 0644]	patch \| blob
rust/doc/src/data/index.md	[new file with mode: 0644]	patch \| blob