From: Ben Pfaff Date: Tue, 6 May 2025 02:38:45 +0000 (-0700) Subject: work on manual X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=e7e12d3e146633cf87a44701dd045158bae05fab;p=pspp work on manual --- diff --git a/rust/doc/src/SUMMARY.md b/rust/doc/src/SUMMARY.md index 69876f7a74..e6e59f1056 100644 --- a/rust/doc/src/SUMMARY.md +++ b/rust/doc/src/SUMMARY.md @@ -3,6 +3,19 @@ [Introduction](introduction.md) [License](license.md) +# Language Syntax + +- [Basics](language/basics/index.md) + - [Tokens](language/basics/tokens.md) + - [Forming Commands](language/basics/commands.md) + - [Syntax Variants](language/basics/syntax-variants.md) + - [Handling Missing Values](language/basics/missing-values.md) +- [Datasets](language/datasets/index.md) + - [Variables](language/datasets/variables.md) + - [Variable Lists](language/datasets/variable-lists.md) + - [Input and Output Formats](language/datasets/formats/index.md) + - [Basic Numeric Formats](language/datasets/formats/basic.md) + # Developer Documentation - [System File Format](system-file/index.md) diff --git a/rust/doc/src/language/basics/commands.md b/rust/doc/src/language/basics/commands.md new file mode 100644 index 0000000000..9e5241cb56 --- /dev/null +++ b/rust/doc/src/language/basics/commands.md @@ -0,0 +1,21 @@ +# Forming Commands + +Most PSPP commands share a common structure. A command begins with a +command name, such as `FREQUENCIES`, `DATA LIST`, or `N OF CASES`. The +command name may be abbreviated to its first word, and each word in the +command name may be abbreviated to its first three or more characters, +where these abbreviations are unambiguous. + +The command name may be followed by one or more "subcommands". Each +subcommand begins with a subcommand name, which may be abbreviated to +its first three letters. Some subcommands accept a series of one or +more specifications, which follow the subcommand name, optionally +separated from it by an equals sign (`=`). Specifications may be +separated from each other by commas or spaces. Each subcommand must +be separated from the next (if any) by a forward slash (`/`). + +There are multiple ways to mark the end of a command. The most common +way is to end the last line of the command with a period (`.`) as +described in the previous section. A blank line, or one that consists +only of white space or comments, also ends a command. + diff --git a/rust/doc/src/language/basics/index.md b/rust/doc/src/language/basics/index.md new file mode 100644 index 0000000000..b7d305c858 --- /dev/null +++ b/rust/doc/src/language/basics/index.md @@ -0,0 +1,2 @@ +This chapter discusses elements common to many PSPP commands. Later +chapters describe individual commands in detail. diff --git a/rust/doc/src/language/basics/missing-values.md b/rust/doc/src/language/basics/missing-values.md new file mode 100644 index 0000000000..a63e0a45cf --- /dev/null +++ b/rust/doc/src/language/basics/missing-values.md @@ -0,0 +1,19 @@ +# Handling Missing Values + +PSPP includes special support for unknown numeric data values. Missing +observations are assigned a special value, called the "system-missing +value". This "value" actually indicates the absence of a value; it +means that the actual value is unknown. Procedures automatically +exclude from analyses those observations or cases that have missing +values. Details of missing value exclusion depend on the procedure and +can often be controlled by the user; refer to descriptions of individual +procedures for details. + + The system-missing value exists only for numeric variables. String +variables always have a defined value, even if it is only a string of +spaces. + + Variables, whether numeric or string, can have designated +"user-missing values". Every user-missing value is an actual value for +that variable. However, most of the time user-missing values are +treated in the same way as the system-missing value. diff --git a/rust/doc/src/language/basics/syntax-variants.md b/rust/doc/src/language/basics/syntax-variants.md new file mode 100644 index 0000000000..59092ab566 --- /dev/null +++ b/rust/doc/src/language/basics/syntax-variants.md @@ -0,0 +1,30 @@ +# Syntax Variants + +There are three variants of command syntax, which vary only in how they +detect the end of one command and the start of the next. + + In "interactive mode", which is the default for syntax typed at a +command prompt, a period as the last non-blank character on a line ends +a command. A blank line also ends a command. + + In "batch mode", an end-of-line period or a blank line also ends a +command. Additionally, it treats any line that has a non-blank +character in the leftmost column as beginning a new command. Thus, in +batch mode the second and subsequent lines in a command must be +indented. + + Regardless of the syntax mode, a plus sign, minus sign, or period in +the leftmost column of a line is ignored and causes that line to begin a +new command. This is most useful in batch mode, in which the first line +of a new command could not otherwise be indented, but it is accepted +regardless of syntax mode. + + The default mode for reading commands from a file is "auto mode". It +is the same as batch mode, except that a line with a non-blank in the +leftmost column only starts a new command if that line begins with the +name of a PSPP command. This correctly interprets most valid PSPP +syntax files regardless of the syntax mode for which they are intended. + + The `--interactive` (or `-i`) or `--batch` (or `-b`) options set the +syntax mode for files listed on the PSPP command line. + diff --git a/rust/doc/src/language/basics/tokens.md b/rust/doc/src/language/basics/tokens.md new file mode 100644 index 0000000000..5da24061b9 --- /dev/null +++ b/rust/doc/src/language/basics/tokens.md @@ -0,0 +1,115 @@ +# Tokens + +PSPP divides most syntax file lines into series of short chunks called +"tokens". Tokens are then grouped to form commands, each of which +tells PSPP to take some action—read in data, write out data, perform a +statistical procedure, etc. Each type of token is described below. + +## Identifiers + +Identifiers are names that typically specify variables, commands, or +subcommands. The first character in an identifier must be a letter, +`#`, or `@`. The remaining characters in the identifier must be +letters, digits, or one of the following special characters: + +``` +. _ $ # @ +``` + +Identifiers may be any length, but only the first 64 bytes are +significant. Identifiers are not case-sensitive: `foobar`, +`Foobar`, `FooBar`, `FOOBAR`, and `FoObaR` are different +representations of the same identifier. + +Some identifiers are reserved. Reserved identifiers may not be +used in any context besides those explicitly described in this +manual. The reserved identifiers are: + +``` +ALL AND BY EQ GE GT LE LT NE NOT OR TO WITH +``` + +## Keywords + +Keywords are a subclass of identifiers that form a fixed part of +command syntax. For example, command and subcommand names are +keywords. Keywords may be abbreviated to their first 3 characters +if this abbreviation is unambiguous. (Unique abbreviations of 3 or +more characters are also accepted: `FRE`, `FREQ`, and `FREQUENCIES` +are equivalent when the last is a keyword.) + +Reserved identifiers are always used as keywords. Other +identifiers may be used both as keywords and as user-defined +identifiers, such as variable names. + +## Numbers + +Numbers are expressed in decimal. A decimal point is optional. +Numbers may be expressed in scientific notation by adding `e` and a +base-10 exponent, so that `1.234e3` has the value 1234. Here are +some more examples of valid numbers: + +``` +-5 3.14159265359 1e100 -.707 8945. +``` + +Negative numbers are expressed with a `-` prefix. However, in +situations where a literal `-` token is expected, what appears to +be a negative number is treated as `-` followed by a positive +number. + +No white space is allowed within a number token, except for +horizontal white space between `-` and the rest of the number. + +The last example above, `8945.` is interpreted as two tokens, +`8945` and `.`, if it is the last token on a line. *Note Forming +commands of tokens: Commands. + +## Strings + +Strings are literal sequences of characters enclosed in pairs of +single quotes (`'`) or double quotes (`"`). To include the +character used for quoting in the string, double it, e.g. `'it''s +an apostrophe'`. White space and case of letters are significant +inside strings. + +Strings can be concatenated using `+`, so that `"a" + 'b' + 'c'` is +equivalent to `'abc'`. So that a long string may be broken across +lines, a line break may precede or follow, or both precede and +follow, the `+`. (However, an entirely blank line preceding or +following the `+` is interpreted as ending the current command.) + +Strings may also be expressed as hexadecimal character values by +prefixing the initial quote character by `x` or `X`. Regardless of +the syntax file or active dataset's encoding, the hexadecimal +digits in the string are interpreted as Unicode characters in UTF-8 +encoding. + +> Individual Unicode code points may also be expressed by specifying +the hexadecimal code point number in single or double quotes +preceded by `u` or `U`. For example, Unicode code point U+1D11E, +the musical G clef character, could be expressed as `U'1D11E'`. +Invalid Unicode code points (above U+10FFFF or in between U+D800 +and U+DFFF) are not allowed. + +When strings are concatenated with `+`, each segment's prefix is +considered individually. For example, `'The G clef symbol is:' + +u"1d11e" + "."` inserts a G clef symbol in the middle of an +otherwise plain text string. + +## Punctuators and Operators + +These tokens are the punctuators and operators: + +``` +, / = ( ) + - * / ** < <= <> > >= ~= & | . +``` + +Most of these appear within the syntax of commands, but the period +(`.`) punctuator is used only at the end of a command. It is a +punctuator only as the last character on a line (except white +space). When it is the last non-space character on a line, a +period is not treated as part of another token, even if it would +otherwise be part of, e.g., an identifier or a floating-point +number. + diff --git a/rust/doc/src/language/datasets/formats/basic.md b/rust/doc/src/language/datasets/formats/basic.md new file mode 100644 index 0000000000..10d4d316ee --- /dev/null +++ b/rust/doc/src/language/datasets/formats/basic.md @@ -0,0 +1,161 @@ +# Basic Numeric Formats + +The basic numeric formats are used for input and output of real numbers +in standard or scientific notation. The following table shows an +example of how each format displays positive and negative numbers with +the default decimal point setting: + +|Format |3141.59 |-3141.59| +|:--------------|--------------:|---------:| +|`F8.2` |` 3141.59` |`-3141.59`| +|`COMMA9.2` |` 3,141.59` |`-3,141.59`| +|`DOT9.2` |` 3.141,59` |`-3.141,59`| +|`DOLLAR10.2` |` $3,141.59` |`-$3,141.59`| +|`PCT9.2` |` 3141.59%` |`-3141.59%`| +|`E8.1` |` 3.1E+003` |`-3.1E+003`| + + On output, numbers in `F` format are expressed in standard decimal +notation with the requested number of decimal places. The other formats +output some variation on this style: + + - Numbers in `COMMA` format are additionally grouped every three digits + by inserting a grouping character. The grouping character is + ordinarily a comma, but it can be changed to a period (*note SET + DECIMAL::). + + - `DOT` format is like `COMMA` format, but it interchanges the role of + the decimal point and grouping characters. That is, the current + grouping character is used as a decimal point and vice versa. + + - `DOLLAR` format is like `COMMA` format, but it prefixes the number with + `$`. + + - `PCT` format is like `F` format, but adds `%` after the number. + + - The `E` format always produces output in scientific notation. + + On input, the basic numeric formats accept positive and numbers in +standard decimal notation or scientific notation. Leading and trailing +spaces are allowed. An empty or all-spaces field, or one that contains +only a single period, is treated as the system missing value. + + In scientific notation, the exponent may be introduced by a sign (`+` +or `-`), or by one of the letters `e` or `d` (in uppercase or +lowercase), or by a letter followed by a sign. A single space may +follow the letter or the sign or both. + + On fixed-format `DATA LIST` (*note DATA LIST FIXED::) and in a few +other contexts, decimals are implied when the field does not contain a +decimal point. In `F6.5` format, for example, the field `314159` is taken +as the value 3.14159 with implied decimals. Decimals are never implied +if an explicit decimal point is present or if scientific notation is +used. + + `E` and `F` formats accept the basic syntax already described. The other +formats allow some additional variations: + +- `COMMA`, `DOLLAR`, and `DOT` formats ignore grouping characters within + the integer part of the input field. The identity of the grouping + character depends on the format. + +- `DOLLAR` format allows a dollar sign to precede the number. In a + negative number, the dollar sign may precede or follow the minus + sign. + +- `PCT` format allows a percent sign to follow the number. + + All of the basic number formats have a maximum field width of 40 and +accept no more than 16 decimal places, on both input and output. Some +additional restrictions apply: + +- As input formats, the basic numeric formats allow no more decimal + places than the field width. As output formats, the field width + must be greater than the number of decimal places; that is, large + enough to allow for a decimal point and the number of requested + decimal places. `DOLLAR` and `PCT` formats must allow an additional + column for `$` or `%`. + +- The default output format for a given input format increases the + field width enough to make room for optional input characters. If + an input format calls for decimal places, the width is increased by + 1 to make room for an implied decimal point. `COMMA`, `DOT`, and + `DOLLAR` formats also increase the output width to make room for + grouping characters. `DOLLAR` and `PCT` further increase the output + field width by 1 to make room for `$` or `%`. The increased output + width is capped at 40, the maximum field width. + +- The `E` format is exceptional. For output, `E` format has a minimum + width of 7 plus the number of decimal places. The default output + format for an `E` input format is an `E` format with at least 3 decimal + places and thus a minimum width of 10. + +More details of basic numeric output formatting are given below: + +- Output rounds to nearest, with ties rounded away from zero. Thus, + 2.5 is output as `3` in `F1.0` format, and -1.125 as `-1.13` in `F5.1` + format. + +- The system-missing value is output as a period in a field of + spaces, placed in the decimal point's position, or in the rightmost + column if no decimal places are requested. A period is used even + if the decimal point character is a comma. + +- A number that does not fill its field is right-justified within the + field. + +- A number is too large for its field causes decimal places to be + dropped to make room. If dropping decimals does not make enough + room, scientific notation is used if the field is wide enough. If + a number does not fit in the field, even in scientific notation, + the overflow is indicated by filling the field with asterisks + (`*`). + +- `COMMA`, `DOT`, and `DOLLAR` formats insert grouping characters only if + space is available for all of them. Grouping characters are never + inserted when all decimal places must be dropped. Thus, 1234.56 in + `COMMA5.2` format is output as ` 1235` without a comma, even though + there is room for one, because all decimal places were dropped. + +- `DOLLAR` or `PCT` format drop the `$` or `%` only if the number would + not fit at all without it. Scientific notation with `$` or `%` is + preferred to ordinary decimal notation without it. + +- Except in scientific notation, a decimal point is included only + when it is followed by a digit. If the integer part of the number + being output is 0, and a decimal point is included, then PSPP + ordinarily drops the zero before the decimal point. However, in + `F`, `COMMA`, or `DOT` formats, PSPP keeps the zero if `SET + LEADZERO` is set to `ON` (*note SET LEADZERO::). + + In scientific notation, the number always includes a decimal point, + even if it is not followed by a digit. + +- A negative number includes a minus sign only in the presence of a + nonzero digit: -0.01 is output as `-.01` in `F4.2` format but as + ` .0` in `F4.1` format. Thus, a "negative zero" never includes a + minus sign. + +- In negative numbers output in `DOLLAR` format, the dollar sign + follows the negative sign. Thus, -9.99 in `DOLLAR6.2` format is + output as `-$9.99`. + +- In scientific notation, the exponent is output as `E` followed by + `+` or `-` and exactly three digits. Numbers with magnitude less + than 10**-999 or larger than 10**999 are not supported by most + computers, but if they are supported then their output is + considered to overflow the field and they are output as asterisks. + +- On most computers, no more than 15 decimal digits are significant + in output, even if more are printed. In any case, output precision + cannot be any higher than input precision; few data sets are + accurate to 15 digits of precision. Unavoidable loss of precision + in intermediate calculations may also reduce precision of output. + +- Special values such as infinities and "not a number" values are + usually converted to the system-missing value before printing. In + a few circumstances, these values are output directly. In fields + of width 3 or greater, special values are output as however many + characters fit from `+Infinity` or `-Infinity` for infinities, from + `NaN` for "not a number," or from `Unknown` for other values (if + any are supported by the system). In fields under 3 columns wide, + special values are output as asterisks. diff --git a/rust/doc/src/language/datasets/formats/index.md b/rust/doc/src/language/datasets/formats/index.md new file mode 100644 index 0000000000..4d68607530 --- /dev/null +++ b/rust/doc/src/language/datasets/formats/index.md @@ -0,0 +1,31 @@ +# Input and Output Formats + +An "input format" describes how to interpret the contents of an input +field as a number or a string. It might specify that the field contains +an ordinary decimal number, a time or date, a number in binary or +hexadecimal notation, or one of several other notations. Input formats +are used by commands such as `DATA LIST` that read data or syntax files +into the PSPP active dataset. + + Every input format corresponds to a default "output format" that +specifies the formatting used when the value is output later. It is +always possible to explicitly specify an output format that resembles +the input format. Usually, this is the default, but in cases where the +input format is unfriendly to human readability, such as binary or +hexadecimal formats, the default output format is an easier-to-read +decimal format. + + Every variable has two output formats, called its "print format" and +"write format". Print formats are used in most output contexts; write +formats are used only by `WRITE` (*note WRITE::). Newly created +variables have identical print and write formats, and `FORMATS`, the +most commonly used command for changing formats (*note FORMATS::), sets +both of them to the same value as well. Thus, most of the time, the +distinction between print and write formats is unimportant. + + Input and output formats are specified to PSPP with a "format +specification" of the form `TypeW` or `TypeW.D`, where `Type` is one +of the format types described later, `W` is a field width measured in +columns, and `D` is an optional number of decimal places. If `D` is +omitted, a value of 0 is assumed. Some formats do not allow a nonzero +`D` to be specified. diff --git a/rust/doc/src/language/datasets/index.md b/rust/doc/src/language/datasets/index.md new file mode 100644 index 0000000000..679a0a9811 --- /dev/null +++ b/rust/doc/src/language/datasets/index.md @@ -0,0 +1,13 @@ +# Datasets + +PSPP works with data organized into "datasets". A dataset consists of a +set of "variables", which taken together are said to form a +"dictionary", and one or more "cases", each of which has one value for +each variable. + + At any given time PSPP has exactly one distinguished dataset, called +the "active dataset". Most PSPP commands work only with the active +dataset. In addition to the active dataset, PSPP also supports any +number of additional open datasets. The `DATASET` commands can choose a +new active dataset from among those that are open, as well as create and +destroy datasets (*note DATASET::). diff --git a/rust/doc/src/language/datasets/variable-lists.md b/rust/doc/src/language/datasets/variable-lists.md new file mode 100644 index 0000000000..2e46e9316d --- /dev/null +++ b/rust/doc/src/language/datasets/variable-lists.md @@ -0,0 +1,24 @@ +# Variable Lists + +To refer to a set of variables, list their names one after another. +Optionally, their names may be separated by commas. To include a +range of variables from the dictionary in the list, write the name of +the first and last variable in the range, separated by `TO`. For +instance, if the dictionary contains six variables with the names +`ID`, `X1`, `X2`, `GOAL`, `MET`, and `NEXTGOAL`, in that order, then +`X2 TO MET` would include variables `X2`, `GOAL`, and `MET`. + + Commands that define variables, such as `DATA LIST`, give `TO` an +alternate meaning. With these commands, `TO` define sequences of +variables whose names end in consecutive integers. The syntax is two +identifiers that begin with the same root and end with numbers, +separated by `TO`. The syntax `X1 TO X5` defines 5 variables, named +`X1`, `X2`, `X3`, `X4`, and `X5`. The syntax `ITEM0008 TO ITEM0013` +defines 6 variables, named `ITEM0008`, `ITEM0009`, `ITEM0010`, +`ITEM0011`, `ITEM0012`, and `ITEM00013`. The syntaxes `QUES001 TO +QUES9` and `QUES6 TO QUES3` are invalid. + + After a set of variables has been defined with `DATA LIST` or +another command with this method, the same set can be referenced on +later commands using the same syntax. + diff --git a/rust/doc/src/language/datasets/variables.md b/rust/doc/src/language/datasets/variables.md new file mode 100644 index 0000000000..74dae3f42c --- /dev/null +++ b/rust/doc/src/language/datasets/variables.md @@ -0,0 +1,130 @@ +# Attributes of Variables + +Each variable has a number of attributes, including: + +* Name + An identifier, up to 64 bytes long. Each variable must have a + different name. *Note Tokens::. + + Some system variable names begin with `$`, but user-defined + variables' names may not begin with `$`. + + The final character in a variable name should not be `.`, because + such an identifier will be misinterpreted when it is the final + token on a line: `FOO.` is divided into two separate tokens, `FOO` + and `.`, indicating end-of-command. *Note Tokens::. + + The final character in a variable name should not be `_`, because + some such identifiers are used for special purposes by PSPP + procedures. + + As with all PSPP identifiers, variable names are not + case-sensitive. PSPP capitalizes variable names on output the same + way they were capitalized at their point of definition in the + input. + +* Type + Numeric or string. + +* Width (string variables only) + String variables with a width of 8 + characters or fewer are called "short string variables". Short + string variables may be used in a few contexts where "long string + variables" (those with widths greater than 8) are not allowed. + +* Position + Variables in the dictionary are arranged in a specific order. + `DISPLAY` can be used to show this order: see *note DISPLAY::. + +* Initialization + Either reinitialized to 0 or spaces for each case, or left at its + existing value. *Note LEAVE::. + +* Missing values + Optionally, up to three values, or a range of values, or a specific + value plus a range, can be specified as "user-missing values". + There is also a "system-missing value" that is assigned to an + observation when there is no other obvious value for that + observation. Observations with missing values are automatically + excluded from analyses. User-missing values are actual data + values, while the system-missing value is not a value at all. + *Note Missing Observations::. + +* Variable label + A string that describes the variable. *Note VARIABLE LABELS::. + +* Value label + Optionally, these associate each possible value of the variable + with a string. *Note VALUE LABELS::. + +* Print format + Display width, format, and (for numeric variables) number of + decimal places. This attribute does not affect how data are + stored, just how they are displayed. Example: a width of 8, with 2 + decimal places. *Note Input and Output Formats::. + +* Write format + Similar to print format, but used by the `WRITE` command (*note + WRITE::). + +* Measurement level + One of the following: + + - *Nominal*: Each value of a nominal variable represents a distinct + category. The possible categories are finite and often have value + labels. The order of categories is not significant. Political + parties, US states, and yes/no choices are nominal. Numeric and + string variables can be nominal. + + - *Ordinal*: Ordinal variables also represent distinct categories, but + their values are arranged according to some natural order. Likert + scales, e.g. from strongly disagree to strongly agree, are + ordinal. Data grouped into ranges, e.g. age groups or income + groups, are ordinal. Both numeric and string variables can be + ordinal. String values are ordered alphabetically, so letter + grades from A to F will work as expected, but `poor`, + `satisfactory`, `excellent` will not. + + - *Scale*: Scale variables are ones for which differences and ratios + are meaningful. These are often values which have a natural unit + attached, such as age in years, income in dollars, or distance in + miles. Only numeric variables are scalar. + + Variables created by `COMPUTE` and similar transformations, + obtained from external sources, etc., initially have an unknown + measurement level. Any procedure that reads the data will then + assign a default measurement level. PSPP can assign some defaults + without reading the data: + + - Nominal, if it's a string variable. + + - Nominal, if the variable has a WKDAY or MONTH print format. + + - Scale, if the variable has a DOLLAR, CCA through CCE, or time + or date print format. + + Otherwise, PSPP reads the data and decides based on its + distribution: + + - Nominal, if all observations are missing. + + - Scale, if one or more valid observations are noninteger or + negative. + + - Scale, if no valid observation is less than 10. + + - Scale, if the variable has 24 or more unique valid values. + The value 24 is the default and can be adjusted (*note SET + SCALEMIN::). + + Finally, if none of the above is true, PSPP assigns the variable a + nominal measurement level. + +* Custom attributes + User-defined associations between names and values. *Note VARIABLE + ATTRIBUTE::. + +* Role + The intended role of a variable for use in dialog boxes in + graphical user interfaces. *Note VARIABLE ROLE::. + diff --git a/rust/doc/src/language/index.md b/rust/doc/src/language/index.md new file mode 100644 index 0000000000..b7d305c858 --- /dev/null +++ b/rust/doc/src/language/index.md @@ -0,0 +1,2 @@ +This chapter discusses elements common to many PSPP commands. Later +chapters describe individual commands in detail. diff --git a/rust/doc/src/license.md b/rust/doc/src/license.md index d273a3cbb8..3ed60ec097 100644 --- a/rust/doc/src/license.md +++ b/rust/doc/src/license.md @@ -33,4 +33,5 @@ PSPP is licensed under the [GNU General Public License](https://www.gnu.org/licenses/licenses.html#GPL), version 3 or later. This manual is licensed under the [GNU Free Documentation License](https://www.gnu.org/licenses/licenses.html#FDL), version 1.3 -or later. +or later; with no Invariant Sections, no Front-Cover Texts, and no +Back-Cover Texts.