From: Ben Pfaff <blp@cs.stanford.edu>
Date: Wed, 7 May 2025 16:01:32 +0000 (-0700)
Subject: work on manual
X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=bcc97775854126ca0491af060a228fb068d6f3e9;p=pspp

work on manual
---

diff --git a/rust/doc/src/SUMMARY.md b/rust/doc/src/SUMMARY.md
index ea37e89028..4b01a29331 100644
--- a/rust/doc/src/SUMMARY.md
+++ b/rust/doc/src/SUMMARY.md
@@ -97,6 +97,10 @@
   - [VARIABLE WIDTH](commands/variables/variable-width.md)
   - [VECTOR](commands/variables/vector.md)
   - [WRITE FORMATS](commands/variables/write-formats.md)
+- [Transforming Data](commands/data/index.md)
+  - [AGGREGATE](commands/data/aggregate.md)
+  - [AUTORECODE](commands/data/autorecode.md)
+  - [COMPUTE](commands/data/compute.md)
 
 # Developer Documentation
 
diff --git a/rust/doc/src/commands/data/aggregate.md b/rust/doc/src/commands/data/aggregate.md
new file mode 100644
index 0000000000..55ca0dedc1
--- /dev/null
+++ b/rust/doc/src/commands/data/aggregate.md
@@ -0,0 +1,233 @@
+# AGGREGATE
+
+```
+AGGREGATE
+        [OUTFILE={*,'FILE_NAME',FILE_HANDLE} [MODE={REPLACE,ADDVARIABLES}]]
+        [/MISSING=COLUMNWISE]
+        [/PRESORTED]
+        [/DOCUMENT]
+        [/BREAK=VAR_LIST]
+        /DEST_VAR['LABEL']...=AGR_FUNC(SRC_VARS[, ARGS]...)...
+```
+
+`AGGREGATE` summarizes groups of cases into single cases.  It divides
+cases into groups that have the same values for one or more variables
+called "break variables".  Several functions are available for
+summarizing case contents.
+
+The `AGGREGATE` syntax consists of subcommands to control its
+behavior, all of which are optional, followed by one or more
+destination variable assigments, each of which uses an aggregation
+function to define how it is calculated.
+
+The `OUTFILE` subcommand, which must be first, names the destination
+for `AGGREGATE` output.  It may name a system file by file name or
+[file handle](../../language/files/file-handles.md), a
+[dataset](../../language/datasets/index.md) by its name, or `*` to
+replace the active dataset.  `AGGREGATE` writes its output to this
+file.
+
+With `OUTFILE=*` only, `MODE` may be specified immediately afterward
+with the value `ADDVARIABLES` or `REPLACE`:
+
+- With `REPLACE`, the default, the active dataset is replaced by a
+  new dataset which contains just the break variables and the
+  destination varibles.  The new file contains as many cases as there
+  are unique combinations of the break variables.
+
+- With `ADDVARIABLES`, the destination variables are added to those
+  in the existing active dataset.  Cases that have the same
+  combination of values in their break variables receive identical
+  values for the destination variables.  The number of cases in the
+  active dataset remains unchanged.  The data must be sorted on the
+  break variables, that is, `ADDVARIABLES` implies `PRESORTED`
+
+Without `OUTFILE`, `AGGREGATE` acts as if `OUTFILE=*
+MODE=ADDVARIABLES` were specified.
+
+By default, `AGGREGATE` first sorts the data on the break variables.
+If the active dataset is already sorted or grouped by the break
+variables, specify `PRESORTED` to save time.  With
+`MODE=ADDVARIABLES`, the data must be pre-sorted.
+
+Specify `DOCUMENT` (*note DOCUMENT::) to copy the documents from the
+active dataset into the aggregate file.  Otherwise, the aggregate file
+does not contain any documents, even if the aggregate file replaces
+the active dataset.
+
+Normally, `AGGREGATE` produces a non-missing value whenever there is
+enough non-missing data for the aggregation function in use, that is,
+just one non-missing value or, for the `SD` and `SD.` aggregation
+functions, two non-missing values.  Specify `/MISSING=COLUMNWISE` to
+make `AGGREGATE` output a missing value when one or more of the input
+values are missing.
+
+The `BREAK` subcommand is optionally but usually present.  On `BREAK`,
+list the variables used to divide the active dataset into groups to be
+summarized.
+
+`AGGREGATE` is particular about the order of subcommands.  `OUTFILE`
+must be first, followed by `MISSING`.  `PRESORTED` and `DOCUMENT`
+follow `MISSING`, in either order, followed by `BREAK`, then followed
+by aggregation variable specifications.
+
+At least one set of aggregation variables is required.  Each set
+comprises a list of aggregation variables, an equals sign (`=`), the
+name of an aggregation function (see the list below), and a list of
+source variables in parentheses.  A few aggregation functions do not
+accept source variables, and some aggregation functions expect
+additional arguments after the source variable names.
+
+`AGGREGATE` typically creates aggregation variables with no variable
+label, value labels, or missing values.  Their default print and write
+formats depend on the aggregation function used, with details given in
+the table below.  A variable label for an aggregation variable may be
+specified just after the variable's name in the aggregation variable
+list.
+
+Each set must have exactly as many source variables as aggregation
+variables.  Each aggregation variable receives the results of applying
+the specified aggregation function to the corresponding source variable.
+
+The following aggregation functions may be applied only to numeric
+variables:
+
+* `MEAN(VAR_NAME...)`  
+  Arithmetic mean.  Limited to numeric values.  The default format is
+  `F8.2`.
+
+* `MEDIAN(VAR_NAME...)`  
+  The median value.  Limited to numeric values.  The default format
+  is `F8.2`.
+
+* `SD(VAR_NAME...)`  
+  Standard deviation of the mean.  Limited to numeric values.  The
+  default format is `F8.2`.
+
+* `SUM(VAR_NAME...)`  
+  Sum.  Limited to numeric values.  The default format is `F8.2`.
+
+   These aggregation functions may be applied to numeric and string
+variables:
+
+* `CGT(VAR_NAME..., VALUE)`  
+  `CLT(VAR_NAME..., VALUE)`  
+  `CIN(VAR_NAME..., LOW, HIGH)`  
+  `COUT(VAR_NAME..., LOW, HIGH)`  
+  Total weight of cases greater than or less than `VALUE` or inside or
+  outside the closed range `[LOW,HIGH]`, respectively.  The default
+  format is `F5.3`.
+
+* `FGT(VAR_NAME..., VALUE)`  
+  `FLT(VAR_NAME..., VALUE)`  
+  `FIN(VAR_NAME..., LOW, HIGH)`  
+  `FOUT(VAR_NAME..., LOW, HIGH)`  
+  Fraction of values greater than or less than `VALUE` or inside or
+  outside the closed range `[LOW,HIGH]`, respectively.  The default
+  format is `F5.3`.
+
+* `FIRST(VAR_NAME...)`  
+  `LAST(VAR_NAME...)`  
+  First or last non-missing value, respectively, in break group.  The
+  aggregation variable receives the complete dictionary information
+  from the source variable.  The sort performed by `AGGREGATE` (and
+  by `SORT CASES`) is stable.  This means that the first (or last)
+  case with particular values for the break variables before sorting
+  is also the first (or last) case in that break group after sorting.
+
+* `MIN(VAR_NAME...)`  
+  `MAX(VAR_NAME...)`  
+  Minimum or maximum value, respectively.  The aggregation variable
+  receives the complete dictionary information from the source
+  variable.
+
+* `N(VAR_NAME...)`  
+  `NMISS(VAR_NAME...)`  
+  Total weight of non-missing or missing values, respectively.  The
+  default format is `F7.0` if weighting is not enabled, `F8.2` if it is
+  (*note WEIGHT::).
+
+* `NU(VAR_NAME...)`  
+  `NUMISS(VAR_NAME...)`  
+  Count of non-missing or missing values, respectively, ignoring case
+  weights.  The default format is `F7.0`.
+
+* `PGT(VAR_NAME..., VALUE)`  
+  `PLT(VAR_NAME..., VALUE)`  
+  `PIN(VAR_NAME..., LOW, HIGH)`  
+  `POUT(VAR_NAME..., LOW, HIGH)`  
+  Percentage between 0 and 100 of values greater than or less than
+  `VALUE` or inside or outside the closed range `[LOW,HIGH]`,
+  respectively.  The default format is `F5.1`.
+
+These aggregation functions do not accept source variables:
+
+* `N`  
+  Total weight of cases aggregated to form this group.  The default
+  format is `F7.0` if weighting is not enabled, `F8.2` if it is (*note
+  WEIGHT::).
+
+* `NU`  
+  Count of cases aggregated to form this group, ignoring case
+  weights.  The default format is `F7.0`.
+
+Aggregation functions compare string values in terms of Unicode
+character codes.
+
+The aggregation functions listed above exclude all user-missing values
+from calculations.  To include user-missing values, insert a period
+(`.`) at the end of the function name.  (e.g. `SUM.`).  (Be aware that
+specifying such a function as the last token on a line causes the
+period to be interpreted as the end of the command.)
+
+`AGGREGATE` both ignores and cancels the current `SPLIT FILE` settings
+(*note SPLIT FILE::).
+
+## Example
+
+The `personnel.sav` dataset provides the occupations and salaries of
+many individuals.  For many purposes however such detailed information
+is not interesting, but often the aggregated statistics of each
+occupation are of interest.  Here, the `AGGREGATE` command is used to
+calculate the mean, the median and the standard deviation of each
+occupation.
+
+```
+GET FILE="personnel.sav".
+AGGREGATE OUTFILE=* MODE=REPLACE
+        /BREAK=occupation
+        /occ_mean_salary=MEAN(salary)
+        /occ_median_salary=MEDIAN(salary)
+        /occ_std_dev_salary=SD(salary).
+LIST.
+```
+
+Since we chose the `MODE=REPLACE` option, cases for the individual
+persons are no longer present.  They have each been replaced by a
+single case per aggregated value.
+
+```
+                                Data List
+ââââââââââââââââââââ¬ââââââââââââââââ¬ââââââââââââââââââ¬âââââââââââââââââââ
+â    occupation    âocc_mean_salaryâocc_median_salaryâocc_std_dev_salaryâ
+ââââââââââââââââââââ¼ââââââââââââââââ¼ââââââââââââââââââ¼âââââââââââââââââââ¤
+âArtist            â       37836.18â         34712.50â           7631.48â
+âBaker             â       45075.20â         45075.20â           4411.21â
+âBarrister         â       39504.00â         39504.00â                 .â
+âCarpenter         â       39349.11â         36190.04â           7453.40â
+âCleaner           â       41142.50â         39647.49â          14378.98â
+âCook              â       40357.79â         43194.00â          11064.51â
+âManager           â       46452.14â         45657.56â           6901.69â
+âMathematician     â       34531.06â         34763.06â           5267.68â
+âPainter           â       45063.55â         45063.55â          15159.67â
+âPayload Specialistâ       34355.72â         34355.72â                 .â
+âPlumber           â       40413.91â         40410.00â           4726.05â
+âScientist         â       36687.07â         36803.83â          10873.54â
+âScrientist        â       42530.65â         42530.65â                 .â
+âTailor            â       34586.79â         34586.79â           3728.98â
+ââââââââââââââââââââ´ââââââââââââââââ´ââââââââââââââââââ´âââââââââââââââââââ
+```
+
+Some values for the standard deviation are blank because there is only
+one case with the respective occupation.
+
diff --git a/rust/doc/src/commands/data/autorecode.md b/rust/doc/src/commands/data/autorecode.md
new file mode 100644
index 0000000000..20831e6078
--- /dev/null
+++ b/rust/doc/src/commands/data/autorecode.md
@@ -0,0 +1,133 @@
+# AUTORECODE
+
+```
+AUTORECODE VARIABLES=SRC_VARS INTO DEST_VARS
+        [ /DESCENDING ]
+        [ /PRINT ]
+        [ /GROUP ]
+        [ /BLANK = {VALID, MISSING} ]
+```
+
+The `AUTORECODE` procedure considers the N values that a variable
+takes on and maps them onto values 1...N on a new numeric variable.
+
+Subcommand `VARIABLES` is the only required subcommand and must come
+first.  Specify `VARIABLES`, an equals sign (`=`), a list of source
+variables, `INTO`, and a list of target variables.  There must the
+same number of source and target variables.  The target variables must
+not already exist.
+
+`AUTORECODE` ordinarily assigns each increasing non-missing value of a
+source variable (for a string, this is based on character code
+comparisons) to consecutive values of its target variable.  For
+example, the smallest non-missing value of the source variable is
+recoded to value 1, the next smallest to 2, and so on.  If the source
+variable has user-missing values, they are recoded to consecutive
+values just above the non-missing values.  For example, if a source
+variables has seven distinct non-missing values, then the smallest
+missing value would be recoded to 8, the next smallest to 9, and so
+on.
+
+Use `DESCENDING` to reverse the sort order for non-missing values, so
+that the largest non-missing value is recoded to 1, the second-largest
+to 2, and so on.  Even with `DESCENDING`, user-missing values are
+still recoded in ascending order just above the non-missing values.
+
+The system-missing value is always recoded into the system-missing
+variable in target variables.
+
+If a source value has a value label, then that value label is retained
+for the new value in the target variable.  Otherwise, the source value
+itself becomes each new value's label.
+
+Variable labels are copied from the source to target variables.
+
+`PRINT` is currently ignored.
+
+The `GROUP` subcommand is relevant only if more than one variable is
+to be recoded.  It causes a single mapping between source and target
+values to be used, instead of one map per variable.  With `GROUP`,
+user-missing values are taken from the first source variable that has
+any user-missing values.
+
+If `/BLANK=MISSING` is given, then string variables which contain
+only whitespace are recoded as SYSMIS. If `/BLANK=VALID` is specified
+then they are allocated a value like any other.  `/BLANK` is not
+relevant to numeric values.  `/BLANK=VALID` is the default.
+
+`AUTORECODE` is a procedure.  It causes the data to be read.
+
+## Example
+
+In the file `personnel.sav`, the variable occupation is a string
+variable.  Except for data of a purely commentary nature, string
+variables are generally a bad idea.  One reason is that data entry
+errors are easily overlooked.  This has happened in `personnel.sav`;
+one entry which should read "Scientist" has been mistyped as
+"Scrientist".  The syntax below shows how to correct this error in the
+`DO IF` clause[^1], which then uses `AUTORECODE` to create a new numeric
+variable which takes recoded values of occupation.  Finally, we remove
+the old variable and rename the new variable to the name of the old
+variable:
+
+[^1]: One must use care when correcting such data input errors rather
+than simply marking them as missing.  For example, if an occupation
+has been entered "Barister", did the person mean "Barrister" or
+"Barista"?
+
+```
+get file='personnel.sav'.
+
+* Correct a typing error in the original file.
+do if occupation = "Scrientist".
+ compute occupation = "Scientist".
+end if.
+
+autorecode
+   variables = occupation into occ
+   /blank = missing.
+
+* Delete the old variable.
+delete variables occupation.
+
+* Rename the new variable to the old variable's name.
+rename variables (occ = occupation).
+
+* Inspect the new variable.
+display dictionary /variables=occupation.
+```
+
+
+Notice, in the output below, how the new variable has been
+automatically allocated value labels which correspond to the strings
+of the old variable.  This means that in future analyses the
+descriptive strings are reported instead of the numeric values.
+
+```
+                                   Variables
++----------+--------+--------------+-----+-----+---------+----------+---------+
+|          |        |  Measurement |     |     |         |   Print  |  Write  |
+|Name      |Position|     Level    | Role|Width|Alignment|  Format  |  Format |
++----------+--------+--------------+-----+-----+---------+----------+---------+
+|occupation|       6|Unknown       |Input|    8|Right    |F2.0      |F2.0     |
++----------+--------+--------------+-----+-----+---------+----------+---------+
+
+            Value Labels
++---------------+------------------+
+|Variable Value |       Label      |
++---------------+------------------+
+|occupation 1   |Artist            |
+|           2   |Baker             |
+|           3   |Barrister         |
+|           4   |Carpenter         |
+|           5   |Cleaner           |
+|           6   |Cook              |
+|           7   |Manager           |
+|           8   |Mathematician     |
+|           9   |Painter           |
+|           10  |Payload Specialist|
+|           11  |Plumber           |
+|           12  |Scientist         |
+|           13  |Tailor            |
++---------------+------------------+
+```
diff --git a/rust/doc/src/commands/data/compute.md b/rust/doc/src/commands/data/compute.md
new file mode 100644
index 0000000000..dd3067bf52
--- /dev/null
+++ b/rust/doc/src/commands/data/compute.md
@@ -0,0 +1,87 @@
+# COMPUTE
+
+```
+COMPUTE VARIABLE = EXPRESSION.
+   or
+COMPUTE vector(INDEX) = EXPRESSION.
+```
+
+`COMPUTE` assigns the value of an expression to a target variable.
+For each case, the expression is evaluated and its value assigned to
+the target variable.  Numeric and string variables may be assigned.
+When a string expression's width differs from the target variable's
+width, the string result of the expression is truncated or padded with
+spaces on the right as necessary.  The expression and variable types
+must match.
+
+For numeric variables only, the target variable need not already
+exist.  Numeric variables created by `COMPUTE` are assigned an `F8.2`
+output format.  String variables must be declared before they can be
+used as targets for `COMPUTE`.
+
+The target variable may be specified as an element of a
+[vector](../../commands/variables/vector.md).  In this case, an
+expression `INDEX` must be specified in parentheses following the vector
+name.  The expression `INDEX` must evaluate to a numeric value that,
+after rounding down to the nearest integer, is a valid index for the
+named vector.
+
+Using `COMPUTE` to assign to a variable specified on
+[`LEAVE`](../../commands/variables/leave.md) resets the variable's
+left state.  Therefore, `LEAVE` should be specified following
+`COMPUTE`, not before.
+
+`COMPUTE` is a transformation.  It does not cause the active dataset
+to be read.
+
+When `COMPUTE` is specified following `TEMPORARY` (*note TEMPORARY::),
+the [`LAG`](../../language/expressions/functions/miscellaneous.md)
+function may not be used.
+
+## Examples
+
+The dataset `physiology.sav` contains the height and weight of
+persons.  For some purposes, neither height nor weight alone is of
+interest.  Epidemiologists are often more interested in the "body mass
+index" which can sometimes be used as a predictor for clinical
+conditions.  The body mass index is defined as the weight of the
+person in kilograms divided by the square of the person's height in
+metres.[^1]
+
+[^1]: Since BMI is a quantity with a ratio scale and has units, the
+term "index" is a misnomer, but that is what it is called.
+
+```
+get file='physiology.sav'.
+
+* height is in mm so we must divide by 1000 to get metres.
+compute bmi = weight / (height/1000)**2.
+variable label bmi "Body Mass Index".
+
+descriptives /weight height bmi.
+```
+
+This syntax shows how you can use `COMPUTE` to generate a new variable
+called bmi and have every case's value calculated from the existing
+values of weight and height.  It also shows how you can [add a
+label](../../commands/variables/variable-labels.md) to this new
+variable, so that a more descriptive label appears in subsequent
+analyses, and this can be seen in the output from the `DESCRIPTIVES`
+command, below.
+
+The expression which follows the `=` sign can be as complicated as
+necessary.  See [Expressions](../../language/expressions.md) for a
+full description of the language accepted.
+
+```
+                  Descriptive Statistics
+âââââââââââââââââââââââ¬âââ¬ââââââââ¬ââââââââ¬ââââââââ¬ââââââââ
+â                     â Nâ  Mean âStd DevâMinimumâMaximumâ
+âââââââââââââââââââââââ¼âââ¼ââââââââ¼ââââââââ¼ââââââââ¼ââââââââ¤
+âWeight in kilograms  â40â  72.12â  26.70â  â55.6â   92.1â
+âHeight in millimetersâ40â1677.12â 262.87â    179â   1903â
+âBody Mass Index      â40â  67.46â 274.08â â21.62â1756.82â
+âValid N (listwise)   â40â       â       â       â       â
+âMissing N (listwise) â 0â       â       â       â       â
+âââââââââââââââââââââââ´âââ´ââââââââ´ââââââââ´ââââââââ´ââââââââ
+```
diff --git a/rust/doc/src/commands/data/index.md b/rust/doc/src/commands/data/index.md
new file mode 100644
index 0000000000..d4f051786a
--- /dev/null
+++ b/rust/doc/src/commands/data/index.md
@@ -0,0 +1,2 @@
+The PSPP procedures in this chapter manipulate data and prepare the
+active dataset for later analyses.  They do not produce output.
diff --git a/rust/doc/src/data/aggregate.md b/rust/doc/src/data/aggregate.md
new file mode 100644
index 0000000000..fea17feb61
--- /dev/null
+++ b/rust/doc/src/data/aggregate.md
@@ -0,0 +1,233 @@
+# AGGREGATE
+
+```
+AGGREGATE
+        [OUTFILE={*,'FILE_NAME',FILE_HANDLE} [MODE={REPLACE,ADDVARIABLES}]]
+        [/MISSING=COLUMNWISE]
+        [/PRESORTED]
+        [/DOCUMENT]
+        [/BREAK=VAR_LIST]
+        /DEST_VAR['LABEL']...=AGR_FUNC(SRC_VARS[, ARGS]...)...
+```
+
+`AGGREGATE` summarizes groups of cases into single cases.  It divides
+cases into groups that have the same values for one or more variables
+called "break variables".  Several functions are available for
+summarizing case contents.
+
+The `AGGREGATE` syntax consists of subcommands to control its
+behavior, all of which are optional, followed by one or more
+destination variable assigments, each of which uses an aggregation
+function to define how it is calculated.
+
+The `OUTFILE` subcommand, which must be first, names the destination
+for `AGGREGATE` output.  It may name a system file by file name or
+[file handle](../../language/files/file-handles.md), a
+[dataset](../../language/datasets/index.md) by its name, or `*` to
+replace the active dataset.  `AGGREGATE` writes its output to this
+file.
+
+With `OUTFILE=*` only, `MODE` may be specified immediately afterward
+with the value `ADDVARIABLES` or `REPLACE`:
+
+- With `REPLACE`, the default, the active dataset is replaced by a
+  new dataset which contains just the break variables and the
+  destination varibles.  The new file contains as many cases as there
+  are unique combinations of the break variables.
+
+- With `ADDVARIABLES`, the destination variables are added to those
+  in the existing active dataset.  Cases that have the same
+  combination of values in their break variables receive identical
+  values for the destination variables.  The number of cases in the
+  active dataset remains unchanged.  The data must be sorted on the
+  break variables, that is, `ADDVARIABLES` implies `PRESORTED`
+
+Without `OUTFILE`, `AGGREGATE` acts as if `OUTFILE=*
+MODE=ADDVARIABLES` were specified.
+
+By default, `AGGREGATE` first sorts the data on the break variables.
+If the active dataset is already sorted or grouped by the break
+variables, specify `PRESORTED` to save time.  With
+`MODE=ADDVARIABLES`, the data must be pre-sorted.
+
+Specify `DOCUMENT` (*note DOCUMENT::) to copy the documents from the
+active dataset into the aggregate file.  Otherwise, the aggregate file
+does not contain any documents, even if the aggregate file replaces
+the active dataset.
+
+Normally, `AGGREGATE` produces a non-missing value whenever there is
+enough non-missing data for the aggregation function in use, that is,
+just one non-missing value or, for the `SD` and `SD.` aggregation
+functions, two non-missing values.  Specify `/MISSING=COLUMNWISE` to
+make `AGGREGATE` output a missing value when one or more of the input
+values are missing.
+
+The `BREAK` subcommand is optionally but usually present.  On `BREAK`,
+list the variables used to divide the active dataset into groups to be
+summarized.
+
+`AGGREGATE` is particular about the order of subcommands.  `OUTFILE`
+must be first, followed by `MISSING`.  `PRESORTED` and `DOCUMENT`
+follow `MISSING`, in either order, followed by `BREAK`, then followed
+by aggregation variable specifications.
+
+At least one set of aggregation variables is required.  Each set
+comprises a list of aggregation variables, an equals sign (`=`), the
+name of an aggregation function (see the list below), and a list of
+source variables in parentheses.  A few aggregation functions do not
+accept source variables, and some aggregation functions expect
+additional arguments after the source variable names.
+
+`AGGREGATE` typically creates aggregation variables with no variable
+label, value labels, or missing values.  Their default print and write
+formats depend on the aggregation function used, with details given in
+the table below.  A variable label for an aggregation variable may be
+specified just after the variable's name in the aggregation variable
+list.
+
+Each set must have exactly as many source variables as aggregation
+variables.  Each aggregation variable receives the results of applying
+the specified aggregation function to the corresponding source variable.
+
+The following aggregation functions may be applied only to numeric
+variables:
+
+* `MEAN(VAR_NAME...)`  
+  Arithmetic mean.  Limited to numeric values.  The default format is
+  `F8.2`.
+
+* `MEDIAN(VAR_NAME...)`  
+  The median value.  Limited to numeric values.  The default format
+  is `F8.2`.
+
+* `SD(VAR_NAME...)`  
+  Standard deviation of the mean.  Limited to numeric values.  The
+  default format is `F8.2`.
+
+* `SUM(VAR_NAME...)`  
+  Sum.  Limited to numeric values.  The default format is `F8.2`.
+
+   These aggregation functions may be applied to numeric and string
+variables:
+
+* `CGT(VAR_NAME..., VALUE)`  
+  `CLT(VAR_NAME..., VALUE)`  
+  `CIN(VAR_NAME..., LOW, HIGH)`  
+  `COUT(VAR_NAME..., LOW, HIGH)`  
+  Total weight of cases greater than or less than `VALUE` or inside or
+  outside the closed range `[LOW,HIGH]`, respectively.  The default
+  format is `F5.3`.
+
+* `FGT(VAR_NAME..., VALUE)`  
+  `FLT(VAR_NAME..., VALUE)`  
+  `FIN(VAR_NAME..., LOW, HIGH)`  
+  `FOUT(VAR_NAME..., LOW, HIGH)`  
+  Fraction of values greater than or less than `VALUE` or inside or
+  outside the closed range `[LOW,HIGH]`, respectively.  The default
+  format is `F5.3`.
+
+* `FIRST(VAR_NAME...)`  
+  `LAST(VAR_NAME...)`  
+  First or last non-missing value, respectively, in break group.  The
+  aggregation variable receives the complete dictionary information
+  from the source variable.  The sort performed by `AGGREGATE` (and
+  by `SORT CASES`) is stable.  This means that the first (or last)
+  case with particular values for the break variables before sorting
+  is also the first (or last) case in that break group after sorting.
+
+* `MIN(VAR_NAME...)`  
+  `MAX(VAR_NAME...)`  
+  Minimum or maximum value, respectively.  The aggregation variable
+  receives the complete dictionary information from the source
+  variable.
+
+* `N(VAR_NAME...)`  
+  `NMISS(VAR_NAME...)`  
+  Total weight of non-missing or missing values, respectively.  The
+  default format is `F7.0` if weighting is not enabled, `F8.2` if it is
+  (*note WEIGHT::).
+
+* `NU(VAR_NAME...)`  
+  `NUMISS(VAR_NAME...)`  
+  Count of non-missing or missing values, respectively, ignoring case
+  weights.  The default format is `F7.0`.
+
+* `PGT(VAR_NAME..., VALUE)`  
+  `PLT(VAR_NAME..., VALUE)`  
+  `PIN(VAR_NAME..., LOW, HIGH)`  
+  `POUT(VAR_NAME..., LOW, HIGH)`  
+  Percentage between 0 and 100 of values greater than or less than
+  `VALUE` or inside or outside the closed range `[LOW,HIGH]`,
+  respectively.  The default format is `F5.1`.
+
+These aggregation functions do not accept source variables:
+
+* `N`  
+  Total weight of cases aggregated to form this group.  The default
+  format is `F7.0` if weighting is not enabled, `F8.2` if it is (*note
+  WEIGHT::).
+
+* `NU`  
+  Count of cases aggregated to form this group, ignoring case
+  weights.  The default format is `F7.0`.
+
+Aggregation functions compare string values in terms of Unicode
+character codes.
+
+The aggregation functions listed above exclude all user-missing values
+from calculations.  To include user-missing values, insert a period
+(`.`) at the end of the function name.  (e.g. `SUM.`).  (Be aware that
+specifying such a function as the last token on a line causes the
+period to be interpreted as the end of the command.)
+
+`AGGREGATE` both ignores and cancels the current `SPLIT FILE` settings
+(*note SPLIT FILE::).
+
+## Aggregate Example
+
+The `personnel.sav` dataset provides the occupations and salaries of
+many individuals.  For many purposes however such detailed information
+is not interesting, but often the aggregated statistics of each
+occupation are of interest.  Here, the `AGGREGATE` command is used to
+calculate the mean, the median and the standard deviation of each
+occupation.
+
+```
+GET FILE="personnel.sav".
+AGGREGATE OUTFILE=* MODE=REPLACE
+        /BREAK=occupation
+        /occ_mean_salary=MEAN(salary)
+        /occ_median_salary=MEDIAN(salary)
+        /occ_std_dev_salary=SD(salary).
+LIST.
+```
+
+Since we chose the `MODE=REPLACE` option, cases for the individual
+persons are no longer present.  They have each been replaced by a
+single case per aggregated value.
+
+```
+                                Data List
+ââââââââââââââââââââ¬ââââââââââââââââ¬ââââââââââââââââââ¬âââââââââââââââââââ
+â    occupation    âocc_mean_salaryâocc_median_salaryâocc_std_dev_salaryâ
+ââââââââââââââââââââ¼ââââââââââââââââ¼ââââââââââââââââââ¼âââââââââââââââââââ¤
+âArtist            â       37836.18â         34712.50â           7631.48â
+âBaker             â       45075.20â         45075.20â           4411.21â
+âBarrister         â       39504.00â         39504.00â                 .â
+âCarpenter         â       39349.11â         36190.04â           7453.40â
+âCleaner           â       41142.50â         39647.49â          14378.98â
+âCook              â       40357.79â         43194.00â          11064.51â
+âManager           â       46452.14â         45657.56â           6901.69â
+âMathematician     â       34531.06â         34763.06â           5267.68â
+âPainter           â       45063.55â         45063.55â          15159.67â
+âPayload Specialistâ       34355.72â         34355.72â                 .â
+âPlumber           â       40413.91â         40410.00â           4726.05â
+âScientist         â       36687.07â         36803.83â          10873.54â
+âScrientist        â       42530.65â         42530.65â                 .â
+âTailor            â       34586.79â         34586.79â           3728.98â
+ââââââââââââââââââââ´ââââââââââââââââ´ââââââââââââââââââ´âââââââââââââââââââ
+```
+
+Note that some values for the standard deviation are blank.  This is
+because there is only one case with the respective occupation.
+
diff --git a/rust/doc/src/data/index.md b/rust/doc/src/data/index.md
new file mode 100644
index 0000000000..d4f051786a
--- /dev/null
+++ b/rust/doc/src/data/index.md
@@ -0,0 +1,2 @@
+The PSPP procedures in this chapter manipulate data and prepare the
+active dataset for later analyses.  They do not produce output.