From: Ben Pfaff Date: Wed, 7 May 2025 16:01:32 +0000 (-0700) Subject: work on manual X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=bcc97775854126ca0491af060a228fb068d6f3e9;p=pspp work on manual --- diff --git a/rust/doc/src/SUMMARY.md b/rust/doc/src/SUMMARY.md index ea37e89028..4b01a29331 100644 --- a/rust/doc/src/SUMMARY.md +++ b/rust/doc/src/SUMMARY.md @@ -97,6 +97,10 @@ - [VARIABLE WIDTH](commands/variables/variable-width.md) - [VECTOR](commands/variables/vector.md) - [WRITE FORMATS](commands/variables/write-formats.md) +- [Transforming Data](commands/data/index.md) + - [AGGREGATE](commands/data/aggregate.md) + - [AUTORECODE](commands/data/autorecode.md) + - [COMPUTE](commands/data/compute.md) # Developer Documentation diff --git a/rust/doc/src/commands/data/aggregate.md b/rust/doc/src/commands/data/aggregate.md new file mode 100644 index 0000000000..55ca0dedc1 --- /dev/null +++ b/rust/doc/src/commands/data/aggregate.md @@ -0,0 +1,233 @@ +# AGGREGATE + +``` +AGGREGATE + [OUTFILE={*,'FILE_NAME',FILE_HANDLE} [MODE={REPLACE,ADDVARIABLES}]] + [/MISSING=COLUMNWISE] + [/PRESORTED] + [/DOCUMENT] + [/BREAK=VAR_LIST] + /DEST_VAR['LABEL']...=AGR_FUNC(SRC_VARS[, ARGS]...)... +``` + +`AGGREGATE` summarizes groups of cases into single cases. It divides +cases into groups that have the same values for one or more variables +called "break variables". Several functions are available for +summarizing case contents. + +The `AGGREGATE` syntax consists of subcommands to control its +behavior, all of which are optional, followed by one or more +destination variable assigments, each of which uses an aggregation +function to define how it is calculated. + +The `OUTFILE` subcommand, which must be first, names the destination +for `AGGREGATE` output. It may name a system file by file name or +[file handle](../../language/files/file-handles.md), a +[dataset](../../language/datasets/index.md) by its name, or `*` to +replace the active dataset. `AGGREGATE` writes its output to this +file. + +With `OUTFILE=*` only, `MODE` may be specified immediately afterward +with the value `ADDVARIABLES` or `REPLACE`: + +- With `REPLACE`, the default, the active dataset is replaced by a + new dataset which contains just the break variables and the + destination varibles. The new file contains as many cases as there + are unique combinations of the break variables. + +- With `ADDVARIABLES`, the destination variables are added to those + in the existing active dataset. Cases that have the same + combination of values in their break variables receive identical + values for the destination variables. The number of cases in the + active dataset remains unchanged. The data must be sorted on the + break variables, that is, `ADDVARIABLES` implies `PRESORTED` + +Without `OUTFILE`, `AGGREGATE` acts as if `OUTFILE=* +MODE=ADDVARIABLES` were specified. + +By default, `AGGREGATE` first sorts the data on the break variables. +If the active dataset is already sorted or grouped by the break +variables, specify `PRESORTED` to save time. With +`MODE=ADDVARIABLES`, the data must be pre-sorted. + +Specify `DOCUMENT` (*note DOCUMENT::) to copy the documents from the +active dataset into the aggregate file. Otherwise, the aggregate file +does not contain any documents, even if the aggregate file replaces +the active dataset. + +Normally, `AGGREGATE` produces a non-missing value whenever there is +enough non-missing data for the aggregation function in use, that is, +just one non-missing value or, for the `SD` and `SD.` aggregation +functions, two non-missing values. Specify `/MISSING=COLUMNWISE` to +make `AGGREGATE` output a missing value when one or more of the input +values are missing. + +The `BREAK` subcommand is optionally but usually present. On `BREAK`, +list the variables used to divide the active dataset into groups to be +summarized. + +`AGGREGATE` is particular about the order of subcommands. `OUTFILE` +must be first, followed by `MISSING`. `PRESORTED` and `DOCUMENT` +follow `MISSING`, in either order, followed by `BREAK`, then followed +by aggregation variable specifications. + +At least one set of aggregation variables is required. Each set +comprises a list of aggregation variables, an equals sign (`=`), the +name of an aggregation function (see the list below), and a list of +source variables in parentheses. A few aggregation functions do not +accept source variables, and some aggregation functions expect +additional arguments after the source variable names. + +`AGGREGATE` typically creates aggregation variables with no variable +label, value labels, or missing values. Their default print and write +formats depend on the aggregation function used, with details given in +the table below. A variable label for an aggregation variable may be +specified just after the variable's name in the aggregation variable +list. + +Each set must have exactly as many source variables as aggregation +variables. Each aggregation variable receives the results of applying +the specified aggregation function to the corresponding source variable. + +The following aggregation functions may be applied only to numeric +variables: + +* `MEAN(VAR_NAME...)` + Arithmetic mean. Limited to numeric values. The default format is + `F8.2`. + +* `MEDIAN(VAR_NAME...)` + The median value. Limited to numeric values. The default format + is `F8.2`. + +* `SD(VAR_NAME...)` + Standard deviation of the mean. Limited to numeric values. The + default format is `F8.2`. + +* `SUM(VAR_NAME...)` + Sum. Limited to numeric values. The default format is `F8.2`. + + These aggregation functions may be applied to numeric and string +variables: + +* `CGT(VAR_NAME..., VALUE)` + `CLT(VAR_NAME..., VALUE)` + `CIN(VAR_NAME..., LOW, HIGH)` + `COUT(VAR_NAME..., LOW, HIGH)` + Total weight of cases greater than or less than `VALUE` or inside or + outside the closed range `[LOW,HIGH]`, respectively. The default + format is `F5.3`. + +* `FGT(VAR_NAME..., VALUE)` + `FLT(VAR_NAME..., VALUE)` + `FIN(VAR_NAME..., LOW, HIGH)` + `FOUT(VAR_NAME..., LOW, HIGH)` + Fraction of values greater than or less than `VALUE` or inside or + outside the closed range `[LOW,HIGH]`, respectively. The default + format is `F5.3`. + +* `FIRST(VAR_NAME...)` + `LAST(VAR_NAME...)` + First or last non-missing value, respectively, in break group. The + aggregation variable receives the complete dictionary information + from the source variable. The sort performed by `AGGREGATE` (and + by `SORT CASES`) is stable. This means that the first (or last) + case with particular values for the break variables before sorting + is also the first (or last) case in that break group after sorting. + +* `MIN(VAR_NAME...)` + `MAX(VAR_NAME...)` + Minimum or maximum value, respectively. The aggregation variable + receives the complete dictionary information from the source + variable. + +* `N(VAR_NAME...)` + `NMISS(VAR_NAME...)` + Total weight of non-missing or missing values, respectively. The + default format is `F7.0` if weighting is not enabled, `F8.2` if it is + (*note WEIGHT::). + +* `NU(VAR_NAME...)` + `NUMISS(VAR_NAME...)` + Count of non-missing or missing values, respectively, ignoring case + weights. The default format is `F7.0`. + +* `PGT(VAR_NAME..., VALUE)` + `PLT(VAR_NAME..., VALUE)` + `PIN(VAR_NAME..., LOW, HIGH)` + `POUT(VAR_NAME..., LOW, HIGH)` + Percentage between 0 and 100 of values greater than or less than + `VALUE` or inside or outside the closed range `[LOW,HIGH]`, + respectively. The default format is `F5.1`. + +These aggregation functions do not accept source variables: + +* `N` + Total weight of cases aggregated to form this group. The default + format is `F7.0` if weighting is not enabled, `F8.2` if it is (*note + WEIGHT::). + +* `NU` + Count of cases aggregated to form this group, ignoring case + weights. The default format is `F7.0`. + +Aggregation functions compare string values in terms of Unicode +character codes. + +The aggregation functions listed above exclude all user-missing values +from calculations. To include user-missing values, insert a period +(`.`) at the end of the function name. (e.g. `SUM.`). (Be aware that +specifying such a function as the last token on a line causes the +period to be interpreted as the end of the command.) + +`AGGREGATE` both ignores and cancels the current `SPLIT FILE` settings +(*note SPLIT FILE::). + +## Example + +The `personnel.sav` dataset provides the occupations and salaries of +many individuals. For many purposes however such detailed information +is not interesting, but often the aggregated statistics of each +occupation are of interest. Here, the `AGGREGATE` command is used to +calculate the mean, the median and the standard deviation of each +occupation. + +``` +GET FILE="personnel.sav". +AGGREGATE OUTFILE=* MODE=REPLACE + /BREAK=occupation + /occ_mean_salary=MEAN(salary) + /occ_median_salary=MEDIAN(salary) + /occ_std_dev_salary=SD(salary). +LIST. +``` + +Since we chose the `MODE=REPLACE` option, cases for the individual +persons are no longer present. They have each been replaced by a +single case per aggregated value. + +``` + Data List +┌──────────────────┬───────────────┬─────────────────┬──────────────────┐ +│ occupation │occ_mean_salary│occ_median_salary│occ_std_dev_salary│ +├──────────────────┼───────────────┼─────────────────┼──────────────────┤ +│Artist │ 37836.18│ 34712.50│ 7631.48│ +│Baker │ 45075.20│ 45075.20│ 4411.21│ +│Barrister │ 39504.00│ 39504.00│ .│ +│Carpenter │ 39349.11│ 36190.04│ 7453.40│ +│Cleaner │ 41142.50│ 39647.49│ 14378.98│ +│Cook │ 40357.79│ 43194.00│ 11064.51│ +│Manager │ 46452.14│ 45657.56│ 6901.69│ +│Mathematician │ 34531.06│ 34763.06│ 5267.68│ +│Painter │ 45063.55│ 45063.55│ 15159.67│ +│Payload Specialist│ 34355.72│ 34355.72│ .│ +│Plumber │ 40413.91│ 40410.00│ 4726.05│ +│Scientist │ 36687.07│ 36803.83│ 10873.54│ +│Scrientist │ 42530.65│ 42530.65│ .│ +│Tailor │ 34586.79│ 34586.79│ 3728.98│ +└──────────────────┴───────────────┴─────────────────┴──────────────────┘ +``` + +Some values for the standard deviation are blank because there is only +one case with the respective occupation. + diff --git a/rust/doc/src/commands/data/autorecode.md b/rust/doc/src/commands/data/autorecode.md new file mode 100644 index 0000000000..20831e6078 --- /dev/null +++ b/rust/doc/src/commands/data/autorecode.md @@ -0,0 +1,133 @@ +# AUTORECODE + +``` +AUTORECODE VARIABLES=SRC_VARS INTO DEST_VARS + [ /DESCENDING ] + [ /PRINT ] + [ /GROUP ] + [ /BLANK = {VALID, MISSING} ] +``` + +The `AUTORECODE` procedure considers the N values that a variable +takes on and maps them onto values 1...N on a new numeric variable. + +Subcommand `VARIABLES` is the only required subcommand and must come +first. Specify `VARIABLES`, an equals sign (`=`), a list of source +variables, `INTO`, and a list of target variables. There must the +same number of source and target variables. The target variables must +not already exist. + +`AUTORECODE` ordinarily assigns each increasing non-missing value of a +source variable (for a string, this is based on character code +comparisons) to consecutive values of its target variable. For +example, the smallest non-missing value of the source variable is +recoded to value 1, the next smallest to 2, and so on. If the source +variable has user-missing values, they are recoded to consecutive +values just above the non-missing values. For example, if a source +variables has seven distinct non-missing values, then the smallest +missing value would be recoded to 8, the next smallest to 9, and so +on. + +Use `DESCENDING` to reverse the sort order for non-missing values, so +that the largest non-missing value is recoded to 1, the second-largest +to 2, and so on. Even with `DESCENDING`, user-missing values are +still recoded in ascending order just above the non-missing values. + +The system-missing value is always recoded into the system-missing +variable in target variables. + +If a source value has a value label, then that value label is retained +for the new value in the target variable. Otherwise, the source value +itself becomes each new value's label. + +Variable labels are copied from the source to target variables. + +`PRINT` is currently ignored. + +The `GROUP` subcommand is relevant only if more than one variable is +to be recoded. It causes a single mapping between source and target +values to be used, instead of one map per variable. With `GROUP`, +user-missing values are taken from the first source variable that has +any user-missing values. + +If `/BLANK=MISSING` is given, then string variables which contain +only whitespace are recoded as SYSMIS. If `/BLANK=VALID` is specified +then they are allocated a value like any other. `/BLANK` is not +relevant to numeric values. `/BLANK=VALID` is the default. + +`AUTORECODE` is a procedure. It causes the data to be read. + +## Example + +In the file `personnel.sav`, the variable occupation is a string +variable. Except for data of a purely commentary nature, string +variables are generally a bad idea. One reason is that data entry +errors are easily overlooked. This has happened in `personnel.sav`; +one entry which should read "Scientist" has been mistyped as +"Scrientist". The syntax below shows how to correct this error in the +`DO IF` clause[^1], which then uses `AUTORECODE` to create a new numeric +variable which takes recoded values of occupation. Finally, we remove +the old variable and rename the new variable to the name of the old +variable: + +[^1]: One must use care when correcting such data input errors rather +than simply marking them as missing. For example, if an occupation +has been entered "Barister", did the person mean "Barrister" or +"Barista"? + +``` +get file='personnel.sav'. + +* Correct a typing error in the original file. +do if occupation = "Scrientist". + compute occupation = "Scientist". +end if. + +autorecode + variables = occupation into occ + /blank = missing. + +* Delete the old variable. +delete variables occupation. + +* Rename the new variable to the old variable's name. +rename variables (occ = occupation). + +* Inspect the new variable. +display dictionary /variables=occupation. +``` + + +Notice, in the output below, how the new variable has been +automatically allocated value labels which correspond to the strings +of the old variable. This means that in future analyses the +descriptive strings are reported instead of the numeric values. + +``` + Variables ++----------+--------+--------------+-----+-----+---------+----------+---------+ +| | | Measurement | | | | Print | Write | +|Name |Position| Level | Role|Width|Alignment| Format | Format | ++----------+--------+--------------+-----+-----+---------+----------+---------+ +|occupation| 6|Unknown |Input| 8|Right |F2.0 |F2.0 | ++----------+--------+--------------+-----+-----+---------+----------+---------+ + + Value Labels ++---------------+------------------+ +|Variable Value | Label | ++---------------+------------------+ +|occupation 1 |Artist | +| 2 |Baker | +| 3 |Barrister | +| 4 |Carpenter | +| 5 |Cleaner | +| 6 |Cook | +| 7 |Manager | +| 8 |Mathematician | +| 9 |Painter | +| 10 |Payload Specialist| +| 11 |Plumber | +| 12 |Scientist | +| 13 |Tailor | ++---------------+------------------+ +``` diff --git a/rust/doc/src/commands/data/compute.md b/rust/doc/src/commands/data/compute.md new file mode 100644 index 0000000000..dd3067bf52 --- /dev/null +++ b/rust/doc/src/commands/data/compute.md @@ -0,0 +1,87 @@ +# COMPUTE + +``` +COMPUTE VARIABLE = EXPRESSION. + or +COMPUTE vector(INDEX) = EXPRESSION. +``` + +`COMPUTE` assigns the value of an expression to a target variable. +For each case, the expression is evaluated and its value assigned to +the target variable. Numeric and string variables may be assigned. +When a string expression's width differs from the target variable's +width, the string result of the expression is truncated or padded with +spaces on the right as necessary. The expression and variable types +must match. + +For numeric variables only, the target variable need not already +exist. Numeric variables created by `COMPUTE` are assigned an `F8.2` +output format. String variables must be declared before they can be +used as targets for `COMPUTE`. + +The target variable may be specified as an element of a +[vector](../../commands/variables/vector.md). In this case, an +expression `INDEX` must be specified in parentheses following the vector +name. The expression `INDEX` must evaluate to a numeric value that, +after rounding down to the nearest integer, is a valid index for the +named vector. + +Using `COMPUTE` to assign to a variable specified on +[`LEAVE`](../../commands/variables/leave.md) resets the variable's +left state. Therefore, `LEAVE` should be specified following +`COMPUTE`, not before. + +`COMPUTE` is a transformation. It does not cause the active dataset +to be read. + +When `COMPUTE` is specified following `TEMPORARY` (*note TEMPORARY::), +the [`LAG`](../../language/expressions/functions/miscellaneous.md) +function may not be used. + +## Examples + +The dataset `physiology.sav` contains the height and weight of +persons. For some purposes, neither height nor weight alone is of +interest. Epidemiologists are often more interested in the "body mass +index" which can sometimes be used as a predictor for clinical +conditions. The body mass index is defined as the weight of the +person in kilograms divided by the square of the person's height in +metres.[^1] + +[^1]: Since BMI is a quantity with a ratio scale and has units, the +term "index" is a misnomer, but that is what it is called. + +``` +get file='physiology.sav'. + +* height is in mm so we must divide by 1000 to get metres. +compute bmi = weight / (height/1000)**2. +variable label bmi "Body Mass Index". + +descriptives /weight height bmi. +``` + +This syntax shows how you can use `COMPUTE` to generate a new variable +called bmi and have every case's value calculated from the existing +values of weight and height. It also shows how you can [add a +label](../../commands/variables/variable-labels.md) to this new +variable, so that a more descriptive label appears in subsequent +analyses, and this can be seen in the output from the `DESCRIPTIVES` +command, below. + +The expression which follows the `=` sign can be as complicated as +necessary. See [Expressions](../../language/expressions.md) for a +full description of the language accepted. + +``` + Descriptive Statistics +┌─────────────────────┬──┬───────┬───────┬───────┬───────┐ +│ │ N│ Mean │Std Dev│Minimum│Maximum│ +├─────────────────────┼──┼───────┼───────┼───────┼───────┤ +│Weight in kilograms │40│ 72.12│ 26.70│ ─55.6│ 92.1│ +│Height in millimeters│40│1677.12│ 262.87│ 179│ 1903│ +│Body Mass Index │40│ 67.46│ 274.08│ ─21.62│1756.82│ +│Valid N (listwise) │40│ │ │ │ │ +│Missing N (listwise) │ 0│ │ │ │ │ +└─────────────────────┴──┴───────┴───────┴───────┴───────┘ +``` diff --git a/rust/doc/src/commands/data/index.md b/rust/doc/src/commands/data/index.md new file mode 100644 index 0000000000..d4f051786a --- /dev/null +++ b/rust/doc/src/commands/data/index.md @@ -0,0 +1,2 @@ +The PSPP procedures in this chapter manipulate data and prepare the +active dataset for later analyses. They do not produce output. diff --git a/rust/doc/src/data/aggregate.md b/rust/doc/src/data/aggregate.md new file mode 100644 index 0000000000..fea17feb61 --- /dev/null +++ b/rust/doc/src/data/aggregate.md @@ -0,0 +1,233 @@ +# AGGREGATE + +``` +AGGREGATE + [OUTFILE={*,'FILE_NAME',FILE_HANDLE} [MODE={REPLACE,ADDVARIABLES}]] + [/MISSING=COLUMNWISE] + [/PRESORTED] + [/DOCUMENT] + [/BREAK=VAR_LIST] + /DEST_VAR['LABEL']...=AGR_FUNC(SRC_VARS[, ARGS]...)... +``` + +`AGGREGATE` summarizes groups of cases into single cases. It divides +cases into groups that have the same values for one or more variables +called "break variables". Several functions are available for +summarizing case contents. + +The `AGGREGATE` syntax consists of subcommands to control its +behavior, all of which are optional, followed by one or more +destination variable assigments, each of which uses an aggregation +function to define how it is calculated. + +The `OUTFILE` subcommand, which must be first, names the destination +for `AGGREGATE` output. It may name a system file by file name or +[file handle](../../language/files/file-handles.md), a +[dataset](../../language/datasets/index.md) by its name, or `*` to +replace the active dataset. `AGGREGATE` writes its output to this +file. + +With `OUTFILE=*` only, `MODE` may be specified immediately afterward +with the value `ADDVARIABLES` or `REPLACE`: + +- With `REPLACE`, the default, the active dataset is replaced by a + new dataset which contains just the break variables and the + destination varibles. The new file contains as many cases as there + are unique combinations of the break variables. + +- With `ADDVARIABLES`, the destination variables are added to those + in the existing active dataset. Cases that have the same + combination of values in their break variables receive identical + values for the destination variables. The number of cases in the + active dataset remains unchanged. The data must be sorted on the + break variables, that is, `ADDVARIABLES` implies `PRESORTED` + +Without `OUTFILE`, `AGGREGATE` acts as if `OUTFILE=* +MODE=ADDVARIABLES` were specified. + +By default, `AGGREGATE` first sorts the data on the break variables. +If the active dataset is already sorted or grouped by the break +variables, specify `PRESORTED` to save time. With +`MODE=ADDVARIABLES`, the data must be pre-sorted. + +Specify `DOCUMENT` (*note DOCUMENT::) to copy the documents from the +active dataset into the aggregate file. Otherwise, the aggregate file +does not contain any documents, even if the aggregate file replaces +the active dataset. + +Normally, `AGGREGATE` produces a non-missing value whenever there is +enough non-missing data for the aggregation function in use, that is, +just one non-missing value or, for the `SD` and `SD.` aggregation +functions, two non-missing values. Specify `/MISSING=COLUMNWISE` to +make `AGGREGATE` output a missing value when one or more of the input +values are missing. + +The `BREAK` subcommand is optionally but usually present. On `BREAK`, +list the variables used to divide the active dataset into groups to be +summarized. + +`AGGREGATE` is particular about the order of subcommands. `OUTFILE` +must be first, followed by `MISSING`. `PRESORTED` and `DOCUMENT` +follow `MISSING`, in either order, followed by `BREAK`, then followed +by aggregation variable specifications. + +At least one set of aggregation variables is required. Each set +comprises a list of aggregation variables, an equals sign (`=`), the +name of an aggregation function (see the list below), and a list of +source variables in parentheses. A few aggregation functions do not +accept source variables, and some aggregation functions expect +additional arguments after the source variable names. + +`AGGREGATE` typically creates aggregation variables with no variable +label, value labels, or missing values. Their default print and write +formats depend on the aggregation function used, with details given in +the table below. A variable label for an aggregation variable may be +specified just after the variable's name in the aggregation variable +list. + +Each set must have exactly as many source variables as aggregation +variables. Each aggregation variable receives the results of applying +the specified aggregation function to the corresponding source variable. + +The following aggregation functions may be applied only to numeric +variables: + +* `MEAN(VAR_NAME...)` + Arithmetic mean. Limited to numeric values. The default format is + `F8.2`. + +* `MEDIAN(VAR_NAME...)` + The median value. Limited to numeric values. The default format + is `F8.2`. + +* `SD(VAR_NAME...)` + Standard deviation of the mean. Limited to numeric values. The + default format is `F8.2`. + +* `SUM(VAR_NAME...)` + Sum. Limited to numeric values. The default format is `F8.2`. + + These aggregation functions may be applied to numeric and string +variables: + +* `CGT(VAR_NAME..., VALUE)` + `CLT(VAR_NAME..., VALUE)` + `CIN(VAR_NAME..., LOW, HIGH)` + `COUT(VAR_NAME..., LOW, HIGH)` + Total weight of cases greater than or less than `VALUE` or inside or + outside the closed range `[LOW,HIGH]`, respectively. The default + format is `F5.3`. + +* `FGT(VAR_NAME..., VALUE)` + `FLT(VAR_NAME..., VALUE)` + `FIN(VAR_NAME..., LOW, HIGH)` + `FOUT(VAR_NAME..., LOW, HIGH)` + Fraction of values greater than or less than `VALUE` or inside or + outside the closed range `[LOW,HIGH]`, respectively. The default + format is `F5.3`. + +* `FIRST(VAR_NAME...)` + `LAST(VAR_NAME...)` + First or last non-missing value, respectively, in break group. The + aggregation variable receives the complete dictionary information + from the source variable. The sort performed by `AGGREGATE` (and + by `SORT CASES`) is stable. This means that the first (or last) + case with particular values for the break variables before sorting + is also the first (or last) case in that break group after sorting. + +* `MIN(VAR_NAME...)` + `MAX(VAR_NAME...)` + Minimum or maximum value, respectively. The aggregation variable + receives the complete dictionary information from the source + variable. + +* `N(VAR_NAME...)` + `NMISS(VAR_NAME...)` + Total weight of non-missing or missing values, respectively. The + default format is `F7.0` if weighting is not enabled, `F8.2` if it is + (*note WEIGHT::). + +* `NU(VAR_NAME...)` + `NUMISS(VAR_NAME...)` + Count of non-missing or missing values, respectively, ignoring case + weights. The default format is `F7.0`. + +* `PGT(VAR_NAME..., VALUE)` + `PLT(VAR_NAME..., VALUE)` + `PIN(VAR_NAME..., LOW, HIGH)` + `POUT(VAR_NAME..., LOW, HIGH)` + Percentage between 0 and 100 of values greater than or less than + `VALUE` or inside or outside the closed range `[LOW,HIGH]`, + respectively. The default format is `F5.1`. + +These aggregation functions do not accept source variables: + +* `N` + Total weight of cases aggregated to form this group. The default + format is `F7.0` if weighting is not enabled, `F8.2` if it is (*note + WEIGHT::). + +* `NU` + Count of cases aggregated to form this group, ignoring case + weights. The default format is `F7.0`. + +Aggregation functions compare string values in terms of Unicode +character codes. + +The aggregation functions listed above exclude all user-missing values +from calculations. To include user-missing values, insert a period +(`.`) at the end of the function name. (e.g. `SUM.`). (Be aware that +specifying such a function as the last token on a line causes the +period to be interpreted as the end of the command.) + +`AGGREGATE` both ignores and cancels the current `SPLIT FILE` settings +(*note SPLIT FILE::). + +## Aggregate Example + +The `personnel.sav` dataset provides the occupations and salaries of +many individuals. For many purposes however such detailed information +is not interesting, but often the aggregated statistics of each +occupation are of interest. Here, the `AGGREGATE` command is used to +calculate the mean, the median and the standard deviation of each +occupation. + +``` +GET FILE="personnel.sav". +AGGREGATE OUTFILE=* MODE=REPLACE + /BREAK=occupation + /occ_mean_salary=MEAN(salary) + /occ_median_salary=MEDIAN(salary) + /occ_std_dev_salary=SD(salary). +LIST. +``` + +Since we chose the `MODE=REPLACE` option, cases for the individual +persons are no longer present. They have each been replaced by a +single case per aggregated value. + +``` + Data List +┌──────────────────┬───────────────┬─────────────────┬──────────────────┐ +│ occupation │occ_mean_salary│occ_median_salary│occ_std_dev_salary│ +├──────────────────┼───────────────┼─────────────────┼──────────────────┤ +│Artist │ 37836.18│ 34712.50│ 7631.48│ +│Baker │ 45075.20│ 45075.20│ 4411.21│ +│Barrister │ 39504.00│ 39504.00│ .│ +│Carpenter │ 39349.11│ 36190.04│ 7453.40│ +│Cleaner │ 41142.50│ 39647.49│ 14378.98│ +│Cook │ 40357.79│ 43194.00│ 11064.51│ +│Manager │ 46452.14│ 45657.56│ 6901.69│ +│Mathematician │ 34531.06│ 34763.06│ 5267.68│ +│Painter │ 45063.55│ 45063.55│ 15159.67│ +│Payload Specialist│ 34355.72│ 34355.72│ .│ +│Plumber │ 40413.91│ 40410.00│ 4726.05│ +│Scientist │ 36687.07│ 36803.83│ 10873.54│ +│Scrientist │ 42530.65│ 42530.65│ .│ +│Tailor │ 34586.79│ 34586.79│ 3728.98│ +└──────────────────┴───────────────┴─────────────────┴──────────────────┘ +``` + +Note that some values for the standard deviation are blank. This is +because there is only one case with the respective occupation. + diff --git a/rust/doc/src/data/index.md b/rust/doc/src/data/index.md new file mode 100644 index 0000000000..d4f051786a --- /dev/null +++ b/rust/doc/src/data/index.md @@ -0,0 +1,2 @@ +The PSPP procedures in this chapter manipulate data and prepare the +active dataset for later analyses. They do not produce output.