From: Ben Pfaff Date: Fri, 25 Apr 2025 22:48:41 +0000 (-0700) Subject: Convert system file format documentation to mdbook. X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=8f24f1a70ceb98132b5de54b05c679b7937ab446;p=pspp Convert system file format documentation to mdbook. --- diff --git a/rust/doc/.gitignore b/rust/doc/.gitignore new file mode 100644 index 0000000000..7585238efe --- /dev/null +++ b/rust/doc/.gitignore @@ -0,0 +1 @@ +book diff --git a/rust/doc/book.toml b/rust/doc/book.toml new file mode 100644 index 0000000000..7381a1feaa --- /dev/null +++ b/rust/doc/book.toml @@ -0,0 +1,6 @@ +[book] +authors = ["Ben Pfaff"] +language = "en" +multilingual = false +src = "src" +title = "GNU PSPP" diff --git a/rust/doc/src/SUMMARY.md b/rust/doc/src/SUMMARY.md new file mode 100644 index 0000000000..6a6725ddf6 --- /dev/null +++ b/rust/doc/src/SUMMARY.md @@ -0,0 +1,29 @@ +# Summary + +[Introduction](introduction.md) +[License](license.md) + +# Developer Documentation + +- [System File Format](system-file/index.md) + - [System File Record Structure](system-file/system-file-record-structure.md) + - [File Header Record](system-file/file-header-record.md) + - [Variable Record](system-file/variable-record.md) + - [Value Labels Records](system-file/value-labels-records.md) + - [Document Record](system-file/document-record.md) + - [Machine Integer Info Record](system-file/machine-integer-info-record.md) + - [Machine Floating-Point Info Record](system-file/machine-floating-point-info-record.md) + - [Multiple Response Sets Records](system-file/multiple-response-sets-records.md) + - [Extra Product Info Record](system-file/extra-product-info-record.md) + - [Variable Display Parameter Record](system-file/variable-display-parameter-record.md) + - [Variable Sets Record](system-file/variable-sets-record.md) + - [Long Variable Names Record](system-file/long-variable-names-record.md) + - [Very Long String Record](system-file/very-long-string-record.md) + - [Character Encoding Record](system-file/character-encoding-record.md) + - [Long String Value Labels Record](system-file/long-string-value-labels-record.md) + - [Long String Missing Values Record](system-file/long-string-missing-values-record.md) + - [Data File and Variable Attributes Record](system-file/data-file-and-variable-attributes-record.md) + - [Extended Number of Cases Record](system-file/extended-number-of-cases-record.md) + - [Other Informational Records](system-file/other-informational-records.md) + - [Dictionary Termination Record](system-file/dictionary-termination-record.md) + - [Data Record](system-file/data-record.md) diff --git a/rust/doc/src/introduction.md b/rust/doc/src/introduction.md new file mode 100644 index 0000000000..ecd11d1150 --- /dev/null +++ b/rust/doc/src/introduction.md @@ -0,0 +1,18 @@ +# Introduction + +GNU PSPP is a tool for statistical analysis of sampled data. It reads +the data, analyzes the data according to commands provided, and writes +the results to a listing file, to the standard output or to a window +of the graphical display. + +The language accepted by PSPP is similar to those accepted by SPSS +statistical products. The details of PSPP's language are given later +in this manual. + +PSPP produces tables and charts as output, which it can produce in +several formats; currently, ASCII, PostScript, PDF, HTML, DocBook and +TeX are supported. + +PSPP is a work in progress. The authors hope to fully support all +features in the products that PSPP replaces, eventually. The authors +welcome questions, comments, donations, and code submissions. diff --git a/rust/doc/src/license.md b/rust/doc/src/license.md new file mode 100644 index 0000000000..d273a3cbb8 --- /dev/null +++ b/rust/doc/src/license.md @@ -0,0 +1,36 @@ +# License + +PSPP is not in the public domain. It is copyrighted and there are +restrictions on its distribution, but these restrictions are designed to +permit everything that a good cooperating citizen would want to do. +What is not allowed is to try to prevent others from further sharing any +version of this program that they might get from you. + +Specifically, we want to make sure that you have the right to give +away copies of PSPP, that you receive source code or else can get it if +you want it, that you can change these programs or use pieces of them in +new free programs, and that you know you can do these things. + +To make sure that everyone has such rights, we have to forbid you to +deprive anyone else of these rights. For example, if you distribute +copies of PSPP, you must give the recipients all the rights that you +have. You must make sure that they, too, receive or can get the source +code. And you must tell them their rights. + +Also, for our own protection, we must make certain that everyone +finds out that there is no warranty for PSPP. If these programs are +modified by someone else and passed on, we want their recipients to know +that what they have is not what we distributed, so that any problems +introduced by others will not reflect on our reputation. + +Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + +PSPP is licensed under the [GNU General Public +License](https://www.gnu.org/licenses/licenses.html#GPL), version 3 or +later. This manual is licensed under the [GNU Free Documentation +License](https://www.gnu.org/licenses/licenses.html#FDL), version 1.3 +or later. diff --git a/rust/doc/src/system-file/character-encoding-record.md b/rust/doc/src/system-file/character-encoding-record.md new file mode 100644 index 0000000000..46e12a7f53 --- /dev/null +++ b/rust/doc/src/system-file/character-encoding-record.md @@ -0,0 +1,166 @@ +# Character Encoding Record + +This record, if present, indicates the character encoding for string +data, long variable names, variable labels, value labels and other +strings in the file. + + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Exactly `count` bytes of data. */ + char encoding[]; + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 20. + +* `int32 size;` + + The size of each element in the `encoding` member. Always set to 1. + +* `int32 count;` + + The total number of bytes in `encoding`. + +* `char encoding[];` + + The name of the character encoding. Normally this will be an + [official IANA character set name or + alias](http://www.iana.org/assignments/character-sets). Character + set names are not case-sensitive, and SPSS is not consistent, + e.g. both `windows-1251` and `WINDOWS-1252` have both been observed, + as have `Big5` and `BIG5`. + +This record is not present in files generated by older software. See +also the [`character_code` field in the machine integer info +record](machine-integer-info-record.md#character-code). + +The following character encoding names have been observed. The names +are shown in lowercase, even though they were not always in lowercase in +the file. Alternative names for the same encoding are, when known, +listed together. For each encoding, the `character_code` values that +they were observed paired with are also listed. First, the following +are strictly single-byte, ASCII-compatible encodings: + +* (encoding record missing) + + 0, 2, 3, 874, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 20127, + 28591, 28592, 28605 + +* `ansi_x3.4-1968` + `ascii` + + 1252 + +* `cp28605` + + 2 + +* `cp874` + + 9066 + +* `iso-8859-1` + + 819 + +* `windows-874` + + 874 + +* `windows-1250` + + 2, 1250, 1252 + +* `windows-1251` + + 2, 1251 + +* `cp1252` + `windows-1252` + + 2, 1250, 1252, 1253 + +* `cp1253` + `windows-1253` + + 1253 + +* `windows-1254` + + 2, 1254 + +* `windows-1255` + + 2, 1255 + +* `windows-1256` + + 2, 1252, 1256 + +* `windows-1257` + + 2, 1257 + +* `windows-1258` + + 1258 + +The others are multibyte encodings, in which some code points occupy +a single byte and others multiple bytes. The following multibyte +encodings are "ASCII compatible," that is, they use ASCII values only to +indicate ASCII: + +* (encoding record missing) + + 65001, 949 + +* `euc-kr` + + 2, 51949 + +* `utf-8` + + 0, 2, 1250, 1251, 1252, 1256, 65001 + +The following multibyte encodings are not ASCII compatible, that is, +while they encode ASCII characters as their native values, they also use +ASCII values as second or later bytes in multibyte sequences: + +* (encoding record missing) + + 932, 936, 950 + +* `big5` + `cp950` + + 2, 950 + +* `gbk` + + 936 + +* `cp932` + `windows-31j` + + 932 + +As the tables above show, when the character encoding record and the +machine integer info record are both present, they can contradict each +other. Observations show that, in this case, the character encoding +record should be honored. + +If, for testing purposes, a file is crafted with different +`character_code` and `encoding`, it seems that `character_code` +controls the encoding for all strings in the system file before the +dictionary termination record, including strings in data (e.g. string +missing values), and `encoding` controls the encoding for strings +following the dictionary termination record. + diff --git a/rust/doc/src/system-file/data-file-and-variable-attributes-record.md b/rust/doc/src/system-file/data-file-and-variable-attributes-record.md new file mode 100644 index 0000000000..fcdcfcc0f6 --- /dev/null +++ b/rust/doc/src/system-file/data-file-and-variable-attributes-record.md @@ -0,0 +1,99 @@ +# Data File and Variable Attributes Records + +The data file and variable attributes records represent custom +attributes for the system file or for individual variables in the +system file, as defined on the `DATAFILE ATTRIBUTE` and `VARIABLE +ATTRIBUTE` commands, respectively. + +``` + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Exactly `count` bytes of data. */ + char attributes[]; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 17 for a data file attribute record + or to 18 for a variable attributes record. + +* `int32 size;` + + The size of each element in the `attributes` member. Always set to + value 1. + +* `int32 count;` + + The total number of bytes in `attributes`. + +* `char attributes[];` + + The attributes, in a text-based format. + + In record subtype 17, this field contains a single attribute set. + An attribute set is a sequence of one or more attributes + concatenated together. Each attribute consists of a name, which + has the same syntax as a variable name, followed by, inside + parentheses, a sequence of one or more values. Each value consists + of a string enclosed in single quotes (`'`) followed by a line feed + (byte 0x0a). A value may contain single quote characters, which + are not themselves escaped or quoted or required to be present in + pairs. There is no apparent way to embed a line feed in a value. + There is no distinction between an attribute with a single value + and an attribute array with one element. + + In record subtype 18, this field contains a sequence of one or more + variable attribute sets. If more than one variable attribute set + is present, each one after the first is delimited from the previous + by `/`. Each variable attribute set consists of a long variable + name, followed by `:`, followed by an attribute set with the same + syntax as on record subtype 17. + + System files written by `Stata 14.1/-savespss- 1.77 by S.Radyakin` + may include multiple records with subtype 18, one per variable that + has variable attributes. + + The total length is `count` bytes. + +## Example + +A system file produced with the following VARIABLE ATTRIBUTE commands in +effect: + +``` +VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=fred[1]('23') fred[2]('34'). +VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123'). +``` + +will contain a variable attribute record with the following contents: + +``` +0000 07 00 00 00 12 00 00 00 01 00 00 00 22 00 00 00 |............"...| +0010 64 75 6d 6d 79 3a 66 72 65 64 28 27 32 33 27 0a |dummy:fred('23'.| +0020 27 33 34 27 0a 29 62 65 72 74 28 27 31 32 33 27 |'34'.)bert('123'| +0030 0a 29 |.) | +``` + +## Variable Roles + +A variable's role is represented as an attribute named `$@Role`. This +attribute has a single element whose values and their meanings are: + +|Value|Role | +|----:|:----------| +| 0 | Input | +| 1 | Output | +| 2 | Both | +| 3 | None | +| 4 | Partition | +| 5 | Split | + +The default and most common role is 0 (input). diff --git a/rust/doc/src/system-file/data-record.md b/rust/doc/src/system-file/data-record.md new file mode 100644 index 0000000000..d0fcf944c5 --- /dev/null +++ b/rust/doc/src/system-file/data-record.md @@ -0,0 +1,186 @@ +# Data Record + +The data record must follow all other records in the system file. Every +system file must have a data record that specifies data for at least one +case. The format of the data record varies depending on the value of +`compression` in the file header record: + +* 0: no compression + + Data is arranged as a series of 8-byte elements. Each element + corresponds to the variable declared in the respective variable + record (*note Variable Record::). Numeric values are given in + `flt64` format; string values are literal characters string, padded + on the right when necessary to fill out 8-byte units. + +* 1: bytecode compression + + The first 8 bytes of the data record is divided into a series of + 1-byte command codes. These codes have meanings as described + below: + + - 0 + + Ignored. If the program writing the system file accumulates + compressed data in blocks of fixed length, 0 bytes can be used + to pad out extra bytes remaining at the end of a fixed-size + block. + + - 1 through 251 + + A number with value `code - bias`, where `code` is the value of + the compression code and `bias` comes from the file header. For + example, code 105 with bias 100.0 (the normal value) indicates a + numeric variable of value 5. + + A code of 0 (after subtracting the bias) in a string field encodes + null bytes. This is unusual, since a string field normally + encodes text data, but it exists in real system files. + + - 252 + + End of file. This code may or may not appear at the end of the + data stream. PSPP always outputs this code but its use is not + required. + + - 253 + + A numeric or string value that is not compressible. The value is + stored in the 8 bytes following the current block of command + bytes. If this value appears twice in a block of command bytes, + then it indicates the second group of 8 bytes following the + command bytes, and so on. + + - 254 + + An 8-byte string value that is all spaces. + + - 255 + + The system-missing value. + + The end of the 8-byte group of bytecodes is followed by any 8-byte + blocks of non-compressible values indicated by code 253. After that + follows another 8-byte group of bytecodes, then those bytecodes' + non-compressible values. The pattern repeats to the end of the file + or a code with value 252. + +* 2: ZLIB compression + + The data record consists of the following, in order: + + - ZLIB data header, 24 bytes long. + + - One or more variable-length blocks of ZLIB compressed data. + + - ZLIB data trailer, with a 24-byte fixed header plus an + additional 24 bytes for each preceding ZLIB compressed data + block. + + The ZLIB data header has the following format: + + ``` + int64 zheader_ofs; + int64 ztrailer_ofs; + int64 ztrailer_len; + ``` + + - `int64 zheader_ofs;` + + The offset, in bytes, of the beginning of this structure + within the system file. + + - `int64 ztrailer_ofs;` + + The offset, in bytes, of the first byte of the ZLIB data + trailer. + + - `int64 ztrailer_len;` + + The number of bytes in the ZLIB data trailer. This and the + previous field sum to the size of the system file in bytes. + + The data header is followed by `(ztrailer_len - 24) / 24` ZLIB + compressed data blocks. Each ZLIB compressed data block begins + with a ZLIB header as specified in RFC 1950, e.g. hex bytes `78 01` + (the only header yet observed in practice). Each block + decompresses to a fixed number of bytes (in practice only + `0x3ff000`-byte blocks have been observed), except that the last + block of data may be shorter. The last ZLIB compressed data block + gends just before offset `ztrailer_ofs`. + + The result of ZLIB decompression is bytecode compressed data as + described above for compression format 1. + + The ZLIB data trailer begins with the following 24-byte fixed + header: + + ``` + int64 bias; + int64 zero; + int32 block_size; + int32 n_blocks; + ``` + + - `int64 int_bias;` + + The compression bias as a negative integer, e.g. if `bias` in + the file header record is 100.0, then `int_bias` is −100 (this + is the only value yet observed in practice). + + - `int64 zero;` + + Always observed to be zero. + + - `int32 block_size;` + + The number of bytes in each ZLIB compressed data block, except + possibly the last, following decompression. Only `0x3ff000` + has been observed so far. + + - `int32 n_blocks;` + + The number of ZLIB compressed data blocks, always exactly + `(ztrailer_len - 24) / 24`. + + The fixed header is followed by `n_blocks` 24-byte ZLIB data block + descriptors, each of which describes the compressed data block + corresponding to its offset. Each block descriptor has the + following format: + + ``` + int64 uncompressed_ofs; + int64 compressed_ofs; + int32 uncompressed_size; + int32 compressed_size; + ``` + + - `int64 uncompressed_ofs;` + + The offset, in bytes, that this block of data would have in a + similar system file that uses compression format 1. This is + `zheader_ofs` in the first block descriptor, and in each + succeeding block descriptor it is the sum of the previous + desciptor's `uncompressed_ofs` and `uncompressed_size`. + + - `int64 compressed_ofs;` + + The offset, in bytes, of the actual beginning of this + compressed data block. This is `zheader_ofs + 24` in the + first block descriptor, and in each succeeding block + descriptor it is the sum of the previous descriptor's + `compressed_ofs` and `compressed_size`. The final block + descriptor's `compressed_ofs` and `compressed_size` sum to + `ztrailer_ofs`. + + - `int32 uncompressed_size;` + + The number of bytes in this data block, after decompression. + This is `block_size` in every data block except the last, + which may be smaller. + + - `int32 compressed_size;` + + The number of bytes in this data block, as stored compressed + in this system file. + diff --git a/rust/doc/src/system-file/dictionary-termination-record.md b/rust/doc/src/system-file/dictionary-termination-record.md new file mode 100644 index 0000000000..8eaba85caf --- /dev/null +++ b/rust/doc/src/system-file/dictionary-termination-record.md @@ -0,0 +1,18 @@ +# Dictionary Termination Record + +The dictionary termination record separates all other records from the +data records. + +``` + int32 rec_type; + int32 filler; +``` + +* `int32 rec_type;` + + Record type. Always set to 999. + +* `int32 filler;` + + Ignored padding. Should be set to 0. + diff --git a/rust/doc/src/system-file/document-record.md b/rust/doc/src/system-file/document-record.md new file mode 100644 index 0000000000..a73bf6646a --- /dev/null +++ b/rust/doc/src/system-file/document-record.md @@ -0,0 +1,25 @@ +# Document Record + +The document record, if present, has the following format: + +``` + int32 rec_type; + int32 n_lines; + char lines[][80]; +``` + +* `int32 rec_type;` + + Record type. Always set to 6. + +* `int32 n_lines;` + + Number of lines of documents present. This should be greater than + zero, but ReadStats writes system files with zero `n_lines`. + +* `char lines[][80];` + + Document lines. The number of elements is defined by `n_lines`. + Lines shorter than 80 characters are padded on the right with + spaces. + diff --git a/rust/doc/src/system-file/extended-number-of-cases-record.md b/rust/doc/src/system-file/extended-number-of-cases-record.md new file mode 100644 index 0000000000..4843315075 --- /dev/null +++ b/rust/doc/src/system-file/extended-number-of-cases-record.md @@ -0,0 +1,43 @@ +# Extended Number of Cases Record + +The [file header record](file-header-record.md#ncases) expresses the +number of cases in the system file as an `int32`. This record allows +the number of cases in the system file to be expressed as a 64-bit +number. + +``` + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + int64 unknown; + int64 ncases64; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 16. + +* `int32 size;` + + Size of each element. Always set to 8. + +* `int32 count;` + + Number of pieces of data in the data part. Alway set to 2. + +* `int64 unknown;` + + Meaning unknown. Always set to 1. + +* `int64 ncases64;` + + Number of cases in the file as a 64-bit integer. Presumably this + could be -1 to indicate that the number of cases is unknown, for + the same reason as `ncases` in the file header record, but this has + not been observed in the wild. + diff --git a/rust/doc/src/system-file/extra-product-info-record.md b/rust/doc/src/system-file/extra-product-info-record.md new file mode 100644 index 0000000000..b3e2561c84 --- /dev/null +++ b/rust/doc/src/system-file/extra-product-info-record.md @@ -0,0 +1,38 @@ +# Extra Product Info Record + +This optional record appears to contain a text string that describes the +program that wrote the file and the source of the data. (This is +redundant with the file label and product info found in the [file header +record](file-header-record.md).) + + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Exactly `count` bytes of data. */ + char info[]; + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 10. + +* `int32 size;` + + The size of each element in the `info` member. Always set to 1. + +* `int32 count;` + + The total number of bytes in `info`. + +* `char info[];` + + A text string. A product that identifies itself as `VOXCO + INTERVIEWER 4.3` uses CR-only line ends in this field, rather than + the more usual LF-only or CR LF line ends. + diff --git a/rust/doc/src/system-file/file-header-record.md b/rust/doc/src/system-file/file-header-record.md new file mode 100644 index 0000000000..9b95430898 --- /dev/null +++ b/rust/doc/src/system-file/file-header-record.md @@ -0,0 +1,128 @@ +# File Header Record + +A system file begins with the file header, with the following format: + +``` + char rec_type[4]; + char prod_name[60]; + int32 layout_code; + int32 nominal_case_size; + int32 compression; + int32 weight_index; + int32 ncases; + flt64 bias; + char creation_date[9]; + char creation_time[8]; + char file_label[64]; + char padding[3]; +``` + +* `char rec_type[4];` + + Record type code, either `$FL2` for system files with uncompressed + data or data compressed with simple bytecode compression, or `$FL3` + for system files with ZLIB compressed data. + + This is truly a character field that uses the character encoding as + other strings. Thus, in a file with an ASCII-based character + encoding this field contains `24 46 4c 32` or `24 46 4c 33`, and in + a file with an EBCDIC-based encoding this field contains `5b c6 d3 + f2`. (No EBCDIC-based ZLIB-compressed files have been observed.) + +* `char prod_name[60];` + + Product identification string. This always begins with the + characters `@(#) SPSS DATA FILE`. PSPP uses the remaining + characters to give its version and the operating system name; for + example, `GNU pspp 0.1.4 - sparc-sun-solaris2.5.2`. The string is + truncated if it would be longer than 60 characters; otherwise it is + padded on the right with spaces. + + The product name field allow readers to behave differently based on + quirks in the way that particular software writes system files. + *Note Value Labels Records::, for the detail of the quirk that the + PSPP system file reader tolerates in files written by ReadStat, + which has `https://github.com/WizardMac/ReadStat` in `prod_name`. + +* `int32 layout_code;` + + Normally set to 2, although a few system files have been spotted in + the wild with a value of 3 here. PSPP use this value to determine + the file's integer endianness. + +* `int32 nominal_case_size;` + + Number of data elements per case. This is the number of variables, + except that long string variables add extra data elements (one for + every 8 characters after the first 8). However, string variables + do not contribute to this value beyond the first 255 bytes. + Further, some software always writes -1 or 0 in this field. In + general, it is unsafe for systems reading system files to rely upon + this value. + +* `int32 compression;` + + Set to 0 if the data in the file is not compressed, 1 if the data is + compressed with simple bytecode compression, 2 if the data is ZLIB + compressed. This field has value 2 if and only if `rec_type` is + `$FL3`. + +* `int32 weight_index;` + + If one of the variables in the data set is used as a weighting + variable, set to the [dictionary + index](variable-record.md#dictionary-index) of that variable. + Otherwise, set to 0. + +* `int32 ncases;` + + Set to the number of cases in the file if it is + known, or -1 otherwise. + + In the general case it is not possible to determine the number of + cases that will be output to a system file at the time that the + header is written. The way that this is dealt with is by writing + the entire system file, including the header, then seeking back to + the beginning of the file and writing just the `ncases` field. For + files in which this is not valid, the seek operation fails. In + this case, `ncases` remains -1. + +* `flt64 bias;` + + Compression bias, ordinarily set to 100. Only integers between `1 + - bias` and `251 - bias` can be compressed. + + By assuming that its value is 100, PSPP uses `bias` to determine the + file's floating-point format and endianness. If the compression + bias is not 100, PSPP cannot auto-detect the floating-point format + and assumes that it is IEEE 754 format with the same endianness as + the system file's integers, which is correct for all known system + files. + +* `char creation_date[9];` + + Date of creation of the system file, in `dd mmm yy` format, with + the month as standard English abbreviations, using an initial + capital letter and following with lowercase. If the date is not + available then this field is arbitrarily set to `01 Jan 70`. + +* `char creation_time[8];` + + Time of creation of the system file, in `hh:mm:ss` format and using + 24-hour time. If the time is not available then this field is + arbitrarily set to `00:00:00`. + +* `char file_label[64];` + + File label declared by the user, if any. Padded on the right with + spaces. + + A product that identifies itself as `VOXCO INTERVIEWER 4.3` uses + CR-only line ends in this field, rather than the more usual LF-only + or CR LF line ends. + +* `char padding[3];` + + Ignored padding bytes to make the structure a multiple of 32 bits + in length. Set to zeros. + diff --git a/rust/doc/src/system-file/index.md b/rust/doc/src/system-file/index.md new file mode 100644 index 0000000000..bf4cf38436 --- /dev/null +++ b/rust/doc/src/system-file/index.md @@ -0,0 +1,70 @@ +An SPSS system file holds a set of cases and dictionary information that +describes how they may be interpreted. The system file format dates +back 40+ years and has evolved greatly over that time to support new +features, but in a way to facilitate interchange between even the oldest +and newest versions of software. This chapter describes the system file +format. + +System files use four data types: 8-bit characters, 32-bit integers, +64-bit integers, and 64-bit floating points, called here `char’, +`int32’, `int64’, and `flt64’, respectively. Data is not necessarily +aligned on a word or double-word boundary: the [long variable name +record](long-variable-name-record.md) and [very long string +record](very-long-string-record.md) have arbitrary byte length and can +therefore cause all data coming after them in the file to be +misaligned. + +Integer data in system files may be big-endian or little-endian. A +reader may detect the endianness of a system file by examining +[layout_code](file-header-record.md#layout_code) in the file header +record. + +Floating-point data in system files may nominally be in IEEE 754, IBM, +or VAX formats. A reader may detect the floating-point format in use +by examining [bias](file-header-record.md#bias) in the file header +record. Only files with IEEE 754 floating point data have actually +been encountered. + +PSPP detects big-endian and little-endian integer formats in system +files and translates as necessary. PSPP also detects the floating-point +format in use, as well as the endianness of IEEE 754 floating-point +numbers, and translates as needed. However, only IEEE 754 numbers with +the same endianness as integer data in the same file have actually been +observed in system files, and it is likely that other formats are +obsolete or were never used. + +System files use a few floating point values for special purposes: + +* `SYSMIS` + + The system-missing value is represented by the largest possible + negative number in the floating point format (`-DBL_MAX` or + `f64::MIN`). + +* `HIGHEST` + + `HIGHEST` is used as the high end of a missing value range with an + unbounded maximum. It is represented by the largest possible + positive number (`DBL_MAX` or `f64::MAX`). + +* `LOWEST` + + `LOWEST` is used as the low end of a missing value range with an + unbounded minimum. It was originally represented by the + second-largest negative number (in IEEE 754 format, + `0xffeffffffffffffe`). System files written by SPSS 21 and later + instead use the largest negative number (`-DBL_MAX` or `f64::MIN`), + the same value as `SYSMIS`. This does not lead to ambiguity because + `LOWEST` appears in system files only in missing value ranges, which + never contain `SYSMIS`. + +System files may use most character encodings based on an 8-bit unit. +UTF-16 and UTF-32, based on wider units, appear to be unacceptable. +`rec_type` in the file header record is sufficient to distinguish +between ASCII and EBCDIC based encodings. The best way to determine +the specific encoding in use is to consult the [character encoding +record](character-encoding-record.md), if present, and failing that +the [character_code](machine-integer-info-record.md#character_code) in +the machine integer info record. The same encoding should be used +for the dictionary and the data in the file, although it is possible +to artificially synthesize files that use different encodings.. diff --git a/rust/doc/src/system-file/long-string-missing-values-record.md b/rust/doc/src/system-file/long-string-missing-values-record.md new file mode 100644 index 0000000000..4d1ed7c50e --- /dev/null +++ b/rust/doc/src/system-file/long-string-missing-values-record.md @@ -0,0 +1,70 @@ +# Long String Missing Values Record + +This record, if present, specifies missing values for long string +variables. + +``` + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Repeated up to exactly `count` bytes. */ + int32 var_name_len; + char var_name[]; + char n_missing_values; + int32 value_len; + char values[value_len * n_missing_values]; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 22. + +* `int32 size;` + + Always set to 1. + +* `int32 count;` + + The number of bytes following the header until the next header. + +* `int32 var_name_len;` + `char var_name[];` + + The number of bytes in the name of the long string variable that + has missing values, plus the variable name itself, which consists + of exactly `var_name_len` bytes. The variable name is not padded + to any particular boundary, nor is it null-terminated. + +* `char n_missing_values;` + + The number of missing values, either 1, 2, or 3. (This is, + unusually, a single byte instead of a 32-bit number.) + +* `int32 value_len;` + + The length of each missing value string, in bytes. This value + should be 8, because long string variables are at least 8 bytes + wide (by definition), only the first 8 bytes of a long string + variable's missing values are allowed to be non-spaces, and any + spaces within the first 8 bytes are included in the missing value + here. + +* `char values[value_len * n_missing_values]` + + The missing values themselves, without any padding or null + terminators. + +An earlier version of this document stated that `value_len` was +repeated before each of the missing values, so that there was an extra +`int32` value of 8 before each missing value after the first. Old +versions of PSPP wrote data files in this format. Readers can tolerate +this mistake, if they wish, by noticing and skipping the extra `int32` +values, which wouldn't ordinarily occur in strings. + diff --git a/rust/doc/src/system-file/long-string-value-labels-record.md b/rust/doc/src/system-file/long-string-value-labels-record.md new file mode 100644 index 0000000000..9a5c043444 --- /dev/null +++ b/rust/doc/src/system-file/long-string-value-labels-record.md @@ -0,0 +1,77 @@ +# Long String Value Labels Record + +This record, if present, specifies value labels for long string +variables. + +``` + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Repeated up to exactly `count` bytes. */ + int32 var_name_len; + char var_name[]; + int32 var_width; + int32 n_labels; + long_string_label labels[]; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 21. + +* `int32 size;` + + Always set to 1. + +* `int32 count;` + + The number of bytes following the header until the next header. + +* `int32 var_name_len;` + `char var_name[];` + + The number of bytes in the name of the variable that has long + string value labels, plus the variable name itself, which consists + of exactly `var_name_len` bytes. The variable name is not padded + to any particular boundary, nor is it null-terminated. + +* `int32 var_width;` + + The width of the variable, in bytes, which will be between 9 and + 32767. + +* `int32 n_labels;` + `long_string_label labels[];` + + The long string labels themselves. The `labels` array contains + exactly `n_labels` elements, each of which has the following + substructure: + + ``` + int32 value_len; + char value[]; + int32 label_len; + char label[]; + ``` + + - `int32 value_len;` + `char value[];` + + The string value being labeled. `value_len` is the number of + bytes in `value`; it is equal to `var_width`. The `value` + array is not padded or null-terminated. + + - `int32 label_len;` + `char label[];` + + The label for the string value. `label_len`, which must be + between 0 and 120, is the number of bytes in `label`. The + `label` array is not padded or null-terminated. + diff --git a/rust/doc/src/system-file/long-variable-names-record.md b/rust/doc/src/system-file/long-variable-names-record.md new file mode 100644 index 0000000000..489f992e4c --- /dev/null +++ b/rust/doc/src/system-file/long-variable-names-record.md @@ -0,0 +1,41 @@ +# Long Variable Names Record + +If present, the long variable names record has the following format: + + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Exactly `count` bytes of data. */ + char var_name_pairs[]; + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 13. + +* `int32 size;` + + The size of each element in the `var_name_pairs` member. Always + set to 1. + +* `int32 count;` + + The total number of bytes in `var_name_pairs`. + +* `char var_name_pairs[];` + + A list of key-value tuples, where each key is the name of a + variable, and the value is its long variable name. The key field is + at most 8 bytes long and must match the name of a variable which + appears in the [variable record](variable-record.md). The value + field is at most 64 bytes long. The key and value fields are + separated by a `=` byte. Each tuple is separated by a byte whose + value is 09. There is no trailing separator following the last + tuple. The total length is `count` bytes. + diff --git a/rust/doc/src/system-file/machine-floating-point-info-record.md b/rust/doc/src/system-file/machine-floating-point-info-record.md new file mode 100644 index 0000000000..a8ccf13028 --- /dev/null +++ b/rust/doc/src/system-file/machine-floating-point-info-record.md @@ -0,0 +1,48 @@ +# Machine Floating-Point Info Record + +The floating-point info record, if present, has the following format: + +``` + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Data. */ + flt64 sysmis; + flt64 highest; + flt64 lowest; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 4. + +* `int32 size;` + + Size of each piece of data in the data part, in bytes. Always set + to 8. + +* `int32 count;` + + Number of pieces of data in the data part. Always set to 3. + +* `flt64 sysmis;` + `flt64 highest;` + `flt64 lowest;` + + The system missing value, the value used for `HIGHEST` in missing + values, and the value used for `LOWEST` in missing values, + respectively. See the [introduction](index.md) for more + information. + + The SPSSWriter library in PHP, which identifies itself as `FOM SPSS + 1.0.0` in the file header record `prod_name` field, writes + unexpected values to these fields, but it uses the same values + consistently throughout the rest of the file. + diff --git a/rust/doc/src/system-file/machine-integer-info-record.md b/rust/doc/src/system-file/machine-integer-info-record.md new file mode 100644 index 0000000000..f08b0107ff --- /dev/null +++ b/rust/doc/src/system-file/machine-integer-info-record.md @@ -0,0 +1,196 @@ +# Machine Integer Info Record + +The integer info record, if present, has the following format: + +``` + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Data. */ + int32 version_major; + int32 version_minor; + int32 version_revision; + int32 machine_code; + int32 floating_point_rep; + int32 compression_code; + int32 endianness; + int32 character_code; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 3. + +* `int32 size;` + + Size of each piece of data in the data part, in bytes. Always set + to 4. + +* `int32 count;` + + Number of pieces of data in the data part. Always set to 8. + +* `int32 version_major;` + + PSPP major version number. In version X.Y.Z, this is X. + +* `int32 version_minor;` + + PSPP minor version number. In version X.Y.Z, this is Y. + +* `int32 version_revision;` + + PSPP version revision number. In version X.Y.Z, this is Z. + +* `int32 machine_code;` + + Machine code. PSPP always set this field to value to -1, but other + values may appear. + +* `int32 floating_point_rep;` + + Floating point representation code. For IEEE 754 systems (the most + common) this is 1. IBM 370 is supposed to set this to 2, and DEC + VAX E to 3, but neither of these has been observed. + +* `int32 compression_code;` + + Compression code. Always set to 1, regardless of whether or how + the file is compressed. + +* `int32 endianness;` + + Machine endianness. 1 indicates big-endian, 2 indicates + little-endian. + +* `int32 character_code;` + + Character code. The following values + have been actually observed in system files: + + - 1 + + EBCDIC. Only one example has been observed. + + - 2 + + 7-bit ASCII. Old versions of SPSS for Unix and Windows always + wrote value 2 in this field, regardless of the encoding in + use, so it is not reliable and should be ignored. + + - 3 + + 8-bit "ASCII". + + - 819 + + ISO 8859-1 (IBM AIX code page number). + + - 874 + 9066 + + The `windows-874` code page for Thai. + + - 932 + + The `windows-932` code page for Japanese (aka `Shift_JIS`). + + - 936 + + The `windows-936` code page for simplified Chinese (aka `GBK`). + + - 949 + + Probably `ks_c_5601-1987`, Unified Hangul Code. + + - 950 + + The `big5` code page for traditional Chinese. + + - 1250 + + The `windows-1250` code page for Central European and Eastern + European languages. + + - 1251 + + The `windows-1251` code page for Cyrillic languages. + + - 1252 + + The `windows-1252` code page for Western European languages. + + - 1253 + + The `windows-1253` code page for modern Greek. + + - 1254 + + The `windows-1254` code page for Turkish. + + - 1255 + + The `windows-1255` code page for Hebrew. + + - 1256 + + The `windows-1256` code page for Arabic script. + + - 1257 + + The `windows-1257` code page for Estonian, Latvian, and + Lithuanian. + + - 1258 + + The `windows-1258` code page for Vietnamese. + + - 20127 + + US-ASCII. + + - 28591 + + ISO 8859-1 (Latin-1). + + - 25592 + + ISO 8859-2 (Central European). + + - 28605 + + ISO 8895-9 (Latin-9). + + - 51949 + + The `euc-kr` code page for Korean. + + - 65001 + + UTF-8. + + The following additional values are known to be defined: + + - 3 + + 8-bit "ASCII". + + - 4 + + DEC Kanji. + + The most common values observed, from most to least common, are + 1252, 65001, 2, and 28591. + + Other Windows code page numbers are known to be generally valid. + + Newer versions also write the character encoding [as a + string](character-encoding-record.md).. + diff --git a/rust/doc/src/system-file/multiple-response-sets-records.md b/rust/doc/src/system-file/multiple-response-sets-records.md new file mode 100644 index 0000000000..c7a091f79c --- /dev/null +++ b/rust/doc/src/system-file/multiple-response-sets-records.md @@ -0,0 +1,124 @@ +# Multiple Response Sets Records + +The system file format has two different types of records that +represent multiple response sets. The first type of record describes +multiple response sets that can be understood by SPSS before +version 14. The second type of record, with a closely related format, +is used for multiple dichotomy sets that use the +`CATEGORYLABELS=COUNTEDVALUES` feature added in version 14. + +``` + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Exactly `count` bytes of data. */ + char mrsets[]; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Set to 7 for records that describe multiple + response sets understood by SPSS before version 14, or to 19 for + records that describe dichotomy sets that use the + `CATEGORYLABELS=COUNTEDVALUES` feature added in version 14. + +* `int32 size;` + + The size of each element in the `mrsets` member. Always set to 1. + +* `int32 count;` + + The total number of bytes in `mrsets`. + +* `char mrsets[];` + + Zero or more line feeds (byte 0x0a), followed by a series of + multiple response sets, each of which consists of the following: + + - The set's name (an identifier that begins with `$`), in mixed + upper and lower case. + + - An equals sign (`=`). + + - `C` for a multiple category set, `D` for a multiple dichotomy + set with `CATEGORYLABELS=VARLABELS`, or `E` for a multiple + dichotomy set with `CATEGORYLABELS=COUNTEDVALUES`. + + - For a multiple dichotomy set with + `CATEGORYLABELS=COUNTEDVALUES`, a space, followed by a number + expressed as decimal digits, followed by a space. If + `LABELSOURCE=VARLABEL` was specified on MRSETS, then the number + is 11; otherwise it is 1.[^note] + + - For either kind of multiple dichotomy set, the counted value, + as a positive integer count specified as decimal digits, + followed by a space, followed by as many string bytes as + specified in the count. If the set contains numeric + variables, the string consists of the counted integer value + expressed as decimal digits. If the set contains string + variables, the string contains the counted string value. + Either way, the string may be padded on the right with spaces + (older versions of SPSS seem to always pad to a width of 8 + bytes; newer versions don't). + + - A space. + + - The multiple response set's label, using the same format as + for the counted value for multiple dichotomy sets. A string + of length 0 means that the set does not have a label. A + string of length 0 is also written if LABELSOURCE=VARLABEL was + specified. + + - The short names of the variables in the set, converted to + lowercase, each preceded by a single space. + + Even though a multiple response set must have at least two + variables, some system files contain multiple response sets + with no variables or one variable. The source and meaning of + these multiple response sets is unknown. (Perhaps they arise + from creating a multiple response set then deleting all the + variables that it contains?) + + - One line feed (byte 0x0a). Sometimes multiple, even hundreds, + of line feeds are present. + +Example: Given appropriate variable definitions, consider the +following MRSETS command: + +``` +MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c + /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55 + /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes' + /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES + VARIABLES=k l m VALUE=34 + /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL + VARIABLES=n o p VALUE='choice'. +``` + +The above would generate the following multiple response set record +of subtype 7: + +``` +$a=C 10 my mcgroup a b c +$b=D2 55 0 g e f d +$c=D3 Yes 10 mdgroup #2 h i j +``` + +It would also generate the following multiple response set record +with subtype 19: + +``` +$d=E 1 2 34 13 third mdgroup k l m +$e=E 11 6 choice 0 n o p +``` + +[^note]: This part of the format may not be fully understood, because + only a single example of each possibility has been examined. + diff --git a/rust/doc/src/system-file/other-informational-records.md b/rust/doc/src/system-file/other-informational-records.md new file mode 100644 index 0000000000..c0a5a66d19 --- /dev/null +++ b/rust/doc/src/system-file/other-informational-records.md @@ -0,0 +1,24 @@ +# Other Informational Records + +This chapter documents many specific types of extension records are +documented here, but others are known to exist. PSPP ignores unknown +extension records when reading system files. + + The following extension record subtypes have also been observed, with +the following believed meanings: + +* 6 + + Date info, probably related to USE (according to Aapi Hämäläinen). + +* 12 + + A UUID in the format described in RFC 4122. Only two examples + observed, both written by SPSS 13, and in each case the UUID + contained both upper and lower case. + +* 24 + + XML that describes how data in the file should be displayed + on-screen. + diff --git a/rust/doc/src/system-file/system-file-record-structure.md b/rust/doc/src/system-file/system-file-record-structure.md new file mode 100644 index 0000000000..59c8eb9373 --- /dev/null +++ b/rust/doc/src/system-file/system-file-record-structure.md @@ -0,0 +1,83 @@ +# System File Record Structure + +System files are divided into records with the following format: + +``` + int32 type; + char data[]; +``` + + This header does not identify the length of the `data` or any +information about what it contains, so the system file reader must +understand the format of `data` based on `type`. However, records with +type 7, called “extension records”, have a stricter format: + +``` + int32 type; + int32 subtype; + int32 size; + int32 count; + char data[size * count]; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. This value identifies a particular kind of + extension record. + +* `int32 size;` + + The size of each piece of data that follows the header, in bytes. + Known extension records use 1, 4, or 8, for `char`, `int32`, and + `flt64` format data, respectively. + +* `int32 count;` + + The number of pieces of data that follow the header. + +* `char data[size * count];` + + Data, whose format and interpretation depend on the subtype. + +An extension record contains exactly `size * count` bytes of data, +which allows a reader that does not understand an extension record to +skip it. Extension records provide only nonessential information, so +this allows for files written by newer software to preserve backward +compatibility with older or less capable readers. + +Records in a system file must appear in the following order: + +* File header record. + +* Variable records. + +* All pairs of value labels records and value label variables + records, if present. + +* Document record, if present. + +* Extension (type 7) records, in ascending numerical order of their + subtypes. + + System files written by SPSS include at most one of each kind of + extension record. This is generally true of system files written + by other software as well, with known exceptions noted below in the + individual sections about each type of record. + +* Dictionary termination record. + +* Data record. + +We advise authors of programs that read system files to tolerate +format variations. Various kinds of misformatting and corruption have +been observed in system files written by SPSS and other software +alike. In particular, because extension records provide nonessential +information, it is generally better to ignore an extension record +entirely than to refuse to read a system file. + +The following sections describe the known kinds of records. + diff --git a/rust/doc/src/system-file/value-labels-records.md b/rust/doc/src/system-file/value-labels-records.md new file mode 100644 index 0000000000..91a9cba882 --- /dev/null +++ b/rust/doc/src/system-file/value-labels-records.md @@ -0,0 +1,86 @@ +# Value Labels Records + +The value label records documented in this section are used for +numeric and short string variables only. Long string variables may +have value labels, but their value labels are recorded using a +[different record type](long-string-values-record.md). + +[ReadStat](file-header-record.md) writes value labels that label a +single value more than once. In more detail, it emits value labels +whose values are longer than string variables' widths, that are +identical in the actual width of the variable, e.g. labels for values +`ABC123` and `ABC456` for a string variable with width 3. For files +written by this software, PSPP ignores such labels. + +## Value Label Record for Labels + +The value label record has the following format: + +``` + int32 rec_type; + int32 label_count; + + /* Repeated `n_label` times. */ + char value[8]; + char label_len; + char label[]; +``` + +* `int32 rec_type;` + + Record type. Always set to 3. + +* `int32 label_count;` + + Number of value labels present in this record. + +The remaining fields are repeated `count` times. Each repetition +specifies one value label. + +* `char value[8];` + + A numeric value or a short string value padded as necessary to 8 + bytes in length. Its type and width cannot be determined until the + following value label variables record (see below) is read. + +* `char label_len;` + + The label's length, in bytes. The documented maximum length varies + from 60 to 120 based on SPSS version. PSPP supports value labels + up to 255 bytes long. + +* `char label[];` + + `label_len` bytes of the actual label, followed by up to 7 bytes of + padding to bring `label` and `label_len` together to a multiple of + 8 bytes in length. + +## Value Label Record for Variables + +The value label record is always immediately followed by a value +label variables record with the following format: + +``` + int32 rec_type; + int32 var_count; + int32 vars[]; +``` + +* `int32 rec_type;` + + Record type. Always set to 4. + +* `int32 var_count;` + + Number of variables that the associated value labels from the value + label record are to be applied. + +* `int32 vars[];` + + A list of 1-based [dictionary + indexes](variable-record.md#dictionary-index) of variables to which + to apply the value labels. There are `var_count` elements. + + String variables wider than 8 bytes may not be specified in this + list. + diff --git a/rust/doc/src/system-file/variable-display-parameter-record.md b/rust/doc/src/system-file/variable-display-parameter-record.md new file mode 100644 index 0000000000..f2f08edfdf --- /dev/null +++ b/rust/doc/src/system-file/variable-display-parameter-record.md @@ -0,0 +1,71 @@ +# Variable Display Parameter Record + +The variable display parameter record, if present, has the following +format: + + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Repeated `count` times. */ + int32 measure; + int32 width; /* Not always present. */ + int32 alignment; + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 11. + +* `int32 size;` + + The size of `int32`. Always set to 4. + +* `int32 count;` + + The number of sets of variable display parameters (ordinarily the + number of variables in the dictionary), times 2 or 3. + +The remaining members are repeated `count` times, in the same order as +the variable records. No element corresponds to variable records that +continue long string variables. The meanings of these members are as +follows: + +* `int32 measure;` + + The measurement level of the variable: + + | Value | Level | + |------:|:---------| + | 0 | Unknown | + | 1 | Nominal | + | 2 | Ordinal | + | 3 | Scale | + + An "unknown" `measure` of 0 means that the variable was created in + some way that doesn't make the measurement level clear, e.g. with a + `COMPUTE` transformation. PSPP sets the measurement level the first + time it reads the data, so this should rarely appear. + +* `int32 width;` + + The width of the display column for the variable in characters. + + This field is present if `count` is 3 times the number of variables + in the dictionary. It is omitted if `count` is 2 times the number of + variables. + +* `int32 alignment;` + + The alignment of the variable for display purposes: + + | Value | Alignment | + |------:|:---------------| + | 0 | Left aligned | + | 1 | Right aligned | + | 2 | Centre aligned | diff --git a/rust/doc/src/system-file/variable-record.md b/rust/doc/src/system-file/variable-record.md new file mode 100644 index 0000000000..433243fde1 --- /dev/null +++ b/rust/doc/src/system-file/variable-record.md @@ -0,0 +1,186 @@ +# Variable Record + +There must be one variable record for each numeric variable and each +string variable with width 8 bytes or less. String variables wider than +8 bytes have one variable record for each 8 bytes, rounding up. The +first variable record for a long string specifies the variable's correct +dictionary information. Subsequent variable records for a long string +are filled with dummy information: a type of -1, no variable label or +missing values, print and write formats that are ignored, and an empty +string as name. A few system files have been encountered that include a +variable label on dummy variable records, so readers should take care to +parse dummy variable records in the same way as other variable records. + + +The "dictionary index" of a variable is +a 1-based offset in the set of variable records, including dummy +variable records for long string variables. The first variable record +has a dictionary index of 1, the second has a dictionary index of 2, +and so on. + +The system file format does not directly support string variables +wider than 255 bytes. Such very long string variables are represented +by a number of narrower string variables. See [very long string +record](very-long-string-record.md) for details. + + A system file should contain at least one variable and thus at least +one variable record, but system files have been observed in the wild +without any variables (thus, no data either). + +``` + int32 rec_type; + int32 type; + int32 has_var_label; + int32 n_missing_values; + int32 print; + int32 write; + char name[8]; + + /* Present only if `has_var_label` is 1. */ + int32 label_len; + char label[]; + + /* Present only if `n_missing_values` is nonzero. */ + flt64 missing_values[]; +``` + +* `int32 rec_type;` + + Record type code. Always set to 2. + +* `int32 type;` + + Variable type code. Set to 0 for a numeric variable. For a short + string variable or the first part of a long string variable, this + is set to the width of the string. For the second and subsequent + parts of a long string variable, set to -1, and the remaining + fields in the structure are ignored. + +* `int32 has_var_label;` + + If this variable has a variable label, set to 1; otherwise, set to + 0. + +* `int32 n_missing_values;` + + If the variable has no missing values, set to 0. If the variable + has one, two, or three discrete missing values, set to 1, 2, or 3, + respectively. If the variable has a range for missing variables, + set to -2; if the variable has a range for missing variables plus a + single discrete value, set to -3. + + A long string variable always has the value 0 here. A separate + record indicates [missing values for long string + variables](long-string-missing-values-record.md). + +* `int32 print;` + + Print format for this variable. See below. + +* `int32 write;` + + Write format for this variable. See below. + +* `char name[8];` + + Variable name. The variable name must begin with a capital letter + or the at-sign (`@`). Subsequent characters may also be digits, + octothorpes (`#`), dollar signs (`$`), underscores (`_`), or full + stops (`.`). The variable name is padded on the right with spaces. + + The `name` fields should be unique within a system file. System + files written by SPSS that contain very long string variables with + similar names sometimes contain duplicate names that are later + eliminated by resolving the [very long string + names](very-long-string-record.md). PSPP handles duplicates by + assigning them new, unique names. + +* `int32 label_len;` + + This field is present only if `has_var_label` is set to 1. It is + set to the length, in characters, of the variable label. The + documented maximum length varies from 120 to 255 based on SPSS + version, but some files have been seen with longer labels. PSPP + accepts labels of any length. + +* `char label[];` + + This field is present only if `has_var_label` is set to 1. It has + length `label_len`, rounded up to the nearest multiple of 32 bits. + The first `label_len` characters are the variable's variable label. + +* `flt64 missing_values[];` + + This field is present only if `n_missing_values` is nonzero. It has + the same number of 8-byte elements as the absolute value of + `n_missing_values`. Each element is interpreted as a number for + numeric variables (with `HIGHEST` and `LOWEST` indicated as + described in the [introduction](index.md)). For string variables of + width less than 8 bytes, elements are right-padded with spaces; for + string variables wider than 8 bytes, only the first 8 bytes of each + missing value are specified, with the remainder implicitly all + spaces. + + For discrete missing values, each element represents one missing + value. When a range is present, the first element denotes the + minimum value in the range, and the second element denotes the + maximum value in the range. When a range plus a value are present, + the third element denotes the additional discrete missing value. + +The `print` and `write` members of sysfile_variable are output +formats coded into `int32` types. The least-significant byte of the +`int32` represents the number of decimal places, and the next two bytes +in order of increasing significance represent field width and format +type, respectively. The most-significant byte is not used and should be +set to zero. + +Format types are defined as follows: + +| Value | Meaning | +|------:|:------------| +| 0 | Not used. | +| 1 | `A` | +| 2 | `AHEX` | +| 3 | `COMMA` | +| 4 | `DOLLAR` | +| 5 | `F` | +| 6 | `IB` | +| 7 | `PIBHEX` | +| 8 | `P` | +| 9 | `PIB` | +| 10 | `PK` | +| 11 | `RB` | +| 12 | `RBHEX` | +| 13 | Not used. | +| 14 | Not used. | +| 15 | `Z` | +| 16 | `N` | +| 17 | `E` | +| 18 | Not used. | +| 19 | Not used. | +| 20 | `DATE` | +| 21 | `TIME` | +| 22 | `DATETIME` | +| 23 | `ADATE` | +| 24 | `JDATE` | +| 25 | `DTIME` | +| 26 | `WKDAY` | +| 27 | `MONTH` | +| 28 | `MOYR` | +| 29 | `QYR` | +| 30 | `WKYR` | +| 31 | `PCT` | +| 32 | `DOT` | +| 33 | `CCA` | +| 34 | `CCB` | +| 35 | `CCC` | +| 36 | `CCD` | +| 37 | `CCE` | +| 38 | `EDATE` | +| 39 | `SDATE` | +| 40 | `MTIME` | +| 41 | `YMDHMS` | + +A few system files have been observed in the wild with invalid +`write` fields, in particular with value 0. Readers should probably +treat invalid `print` or `write` fields as some default format. diff --git a/rust/doc/src/system-file/variable-sets-record.md b/rust/doc/src/system-file/variable-sets-record.md new file mode 100644 index 0000000000..0c53aec7e0 --- /dev/null +++ b/rust/doc/src/system-file/variable-sets-record.md @@ -0,0 +1,50 @@ +# Variable Sets Record + +The SPSS GUI offers users the ability to arrange variables in sets. +Users may enable and disable sets individually, and the data editor and +analysis dialog boxes only show enabled sets. Syntax does not use +variable sets. + + The variable sets record, if present, has the following format: + +``` + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Exactly `count` bytes of text. */ + char text[]; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 5. + +* `int32 size;` + + Always set to 1. + +* `int32 count;` + + The total number of bytes in `text`. + +* `char text[];` + + The variable sets, in a text-based format. + + Each variable set occupies one line of text, each of which ends + with a line feed (byte 0x0a), optionally preceded by a carriage + return (byte 0x0d). + + Each line begins with the name of the variable set, followed by an + equals sign (`=`) and a space (byte 0x20), followed by the long + variable names of the members of the set, separated by spaces. A + variable set may be empty, in which case the equals sign and the + space following it are still present. + diff --git a/rust/doc/src/system-file/very-long-string-record.md b/rust/doc/src/system-file/very-long-string-record.md new file mode 100644 index 0000000000..fbf772c469 --- /dev/null +++ b/rust/doc/src/system-file/very-long-string-record.md @@ -0,0 +1,84 @@ +# Very Long String Record + +Old versions of SPSS limited string variables to a width of 255 bytes. +For backward compatibility with these older versions, the system file +format represents a string longer than 255 bytes, called a “very long +string”, as a collection of strings no longer than 255 bytes each. The +strings concatenated to make a very long string are called its +“segments”; for consistency, variables other than very long strings are +considered to have a single segment. + +A very long string with a width of `w` has `n = (w + 251) / 252` +segments, that is, one segment for every 252 bytes of width, rounding +up. It would be logical, then, for each of the segments except the +last to have a width of 252 and the last segment to have the +remainder, but this is not the case. In fact, each segment except the +last has a width of 255 bytes. The last has width `w - (n - 1) * +252`; some versions of SPSS make it slightly wider, but not wide +enough to make the last segment require another 8 bytes of data. + +Data is packed tightly into segments of a very long string, 255 bytes +per segment. Because 255 bytes of segment data are allocated for +every 252 bytes of the very long string's width (approximately), some +unused space is left over at the end of the allocated segments. Data +in unused space is ignored. + +> Example: Consider a very long string of width 20,000. Such a very +long string has 20,000 / 252 = 80 (rounding up) segments. The first +79 segments have width 255; the last segment has width 20,000 - 79 * +252 = 92 or slightly wider (up to 96 bytes, the next multiple of 8). +The very long string's data is actually stored in the 19,890 bytes in +the first 78 segments, plus the first 110 bytes of the 79th segment +(19,890 + 110 = 20,000). The remaining 145 bytes of the 79th segment +and all 92 bytes of the 80th segment are unused. + +The very long string record explains how to stitch together segments +to obtain very long string data. For each of the very long string +variables in the dictionary, it specifies the name of its first +segment's variable and the very long string variable's actual width. +The remaining segments immediately follow the named variable in the +system file's dictionary. + +The very long string record, which is present only if the system file +contains very long string variables, has the following format: + +``` + /* Header. */ + int32 rec_type; + int32 subtype; + int32 size; + int32 count; + + /* Exactly `count` bytes of data. */ + char string_lengths[]; +``` + +* `int32 rec_type;` + + Record type. Always set to 7. + +* `int32 subtype;` + + Record subtype. Always set to 14. + +* `int32 size;` + + The size of each element in the `string_lengths` member. Always + set to 1. + +* `int32 count;` + + The total number of bytes in `string_lengths`. + +* `char string_lengths[];` + + a list of key-value tuples, where key is the name of a variable, and + value is its length. the key field is at most 8 bytes long and must + match the name of a variable which appears in the [variable + record](variable-record.md). the value field is exactly 5 bytes + long. it is a zero-padded, ASCII-encoded string that is the length + of the variable. the key and value fields are separated by a `=` + byte. tuples are delimited by a two-byte sequence {00, 09}. After + the last tuple, there may be a single byte 00, or {00, 09}. The + total length is `count` bytes. +