# Developer Documentation
-- [System File Format](system-file/index.md)
- - [System File Record Structure](system-file/system-file-record-structure.md)
- - [File Header Record](system-file/file-header-record.md)
- - [Variable Record](system-file/variable-record.md)
- - [Value Labels Records](system-file/value-labels-records.md)
- - [Document Record](system-file/document-record.md)
- - [Machine Integer Info Record](system-file/machine-integer-info-record.md)
- - [Machine Floating-Point Info Record](system-file/machine-floating-point-info-record.md)
- - [Multiple Response Sets Records](system-file/multiple-response-sets-records.md)
- - [Extra Product Info Record](system-file/extra-product-info-record.md)
- - [Variable Display Parameter Record](system-file/variable-display-parameter-record.md)
- - [Variable Sets Record](system-file/variable-sets-record.md)
- - [Long Variable Names Record](system-file/long-variable-names-record.md)
- - [Very Long String Record](system-file/very-long-string-record.md)
- - [Character Encoding Record](system-file/character-encoding-record.md)
- - [Long String Value Labels Record](system-file/long-string-value-labels-record.md)
- - [Long String Missing Values Record](system-file/long-string-missing-values-record.md)
- - [Data File and Variable Attributes Record](system-file/data-file-and-variable-attributes-record.md)
- - [Extended Number of Cases Record](system-file/extended-number-of-cases-record.md)
- - [Other Informational Records](system-file/other-informational-records.md)
- - [Dictionary Termination Record](system-file/dictionary-termination-record.md)
- - [Data Record](system-file/data-record.md)
+- [System File Format](system-file.md)
- [SPSS Viewer File Format](spv/index.md)
- [Structure Members](spv/structure.md)
- [Light Detail Members](spv/light-detail.md)
formats. This is what portable files are designed to do.
The portable file format is mostly obsolete. [System
-files](system-file/index.md) are a better alternative.
+files](system-file.md) are a better alternative.
> This information is gleaned from examination of ASCII-formatted
portable files only, so some of it may be incorrect for portable files
- Print format. This is a set of three integer fields:
- - [Format type](system-file/variable-record.md#format-types) encoded
- the same as in system files.
+ - [Format type](system-file.md#format-types) encoded the same as in
+ system files.
- Format width. 1-40.
A few portable files with invalid format types or formats that are
not of the appropriate width for their variables have been spotted
- in the wild. PSPP assigns a default F or A format to a variable
+ in the wild. PSPP assigns a default `F` or `A` format to a variable
with an invalid format.
- Write format. Same structure as the print format described above.
--- /dev/null
+# System File Format
+
+An SPSS system file holds a set of cases and dictionary information that
+describes how they may be interpreted. The system file format dates
+back 40+ years and has evolved greatly over that time to support new
+features, but in a way to facilitate interchange between even the oldest
+and newest versions of software. This chapter describes the system file
+format.
+
+<!-- toc -->
+
+## Introduction
+
+System files use four data types: 8-bit characters, 32-bit integers,
+64-bit integers, and 64-bit floating points, called here `char’,
+`int32’, `int64’, and `flt64’, respectively. Data is not necessarily
+aligned on a word or double-word boundary: the [long variable name
+record](#long-variable-names-record) and [very long string
+record](#very-long-string-record) have arbitrary byte length and can
+therefore cause all data coming after them in the file to be
+misaligned.
+
+Integer data in system files may be big-endian or little-endian. A
+reader may detect the endianness of a system file by examining
+`layout_code` in the [file header record](#file-header-record).
+
+Floating-point data in system files may nominally be in IEEE 754, IBM,
+or VAX formats. A reader may detect the floating-point format in use
+by examining `bias` in the [file header record](#file-header-record).
+Only files with IEEE 754 floating point data have actually been
+encountered.
+
+PSPP detects big-endian and little-endian integer formats in system
+files and translates as necessary. PSPP also detects the floating-point
+format in use, as well as the endianness of IEEE 754 floating-point
+numbers, and translates as needed. However, only IEEE 754 numbers with
+the same endianness as integer data in the same file have actually been
+observed in system files, and it is likely that other formats are
+obsolete or were never used.
+
+System files use a few floating point values for special purposes:
+
+* `SYSMIS`
+
+ The system-missing value is represented by the largest possible
+ negative number in the floating point format (`-DBL_MAX` or
+ `f64::MIN`).
+
+* `HIGHEST`
+
+ `HIGHEST` is used as the high end of a missing value range with an
+ unbounded maximum. It is represented by the largest possible
+ positive number (`DBL_MAX` or `f64::MAX`).
+
+* `LOWEST`
+
+ `LOWEST` is used as the low end of a missing value range with an
+ unbounded minimum. It was originally represented by the
+ second-largest negative number (in IEEE 754 format,
+ `0xffeffffffffffffe`). System files written by SPSS 21 and later
+ instead use the largest negative number (`-DBL_MAX` or `f64::MIN`),
+ the same value as `SYSMIS`. This does not lead to ambiguity because
+ `LOWEST` appears in system files only in missing value ranges, which
+ never contain `SYSMIS`.
+
+System files may use most character encodings based on an 8-bit unit.
+UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
+`rec_type` in the file header record is sufficient to distinguish
+between ASCII and EBCDIC based encodings. The best way to determine
+the specific encoding in use is to consult the [character encoding
+record](#character-encoding-record), if present, and failing that
+`character_code` in the [machine integer info
+record](#machine-integer-info-record). The same encoding should be
+used for the dictionary and the data in the file, although it is
+possible to artificially synthesize files that use different
+encodings.
+
+## System File Record Structure
+
+System files are divided into records with the following format:
+
+```
+ int32 type;
+ char data[];
+```
+
+ This header does not identify the length of the `data` or any
+information about what it contains, so the system file reader must
+understand the format of `data` based on `type`. However, records with
+type 7, called “extension records”, have a stricter format:
+
+```
+ int32 type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+ char data[size * count];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. This value identifies a particular kind of
+ extension record.
+
+* `int32 size;`
+
+ The size of each piece of data that follows the header, in bytes.
+ Known extension records use 1, 4, or 8, for `char`, `int32`, and
+ `flt64` format data, respectively.
+
+* `int32 count;`
+
+ The number of pieces of data that follow the header.
+
+* `char data[size * count];`
+
+ Data, whose format and interpretation depend on the subtype.
+
+An extension record contains exactly `size * count` bytes of data,
+which allows a reader that does not understand an extension record to
+skip it. Extension records provide only nonessential information, so
+this allows for files written by newer software to preserve backward
+compatibility with older or less capable readers.
+
+Records in a system file must appear in the following order:
+
+* File header record.
+
+* Variable records.
+
+* All pairs of value labels records and value label variables
+ records, if present.
+
+* Document record, if present.
+
+* Extension (type 7) records, in ascending numerical order of their
+ subtypes.
+
+ System files written by SPSS include at most one of each kind of
+ extension record. This is generally true of system files written
+ by other software as well, with known exceptions noted below in the
+ individual sections about each type of record.
+
+* Dictionary termination record.
+
+* Data record.
+
+We advise authors of programs that read system files to tolerate
+format variations. Various kinds of misformatting and corruption have
+been observed in system files written by SPSS and other software
+alike. In particular, because extension records provide nonessential
+information, it is generally better to ignore an extension record
+entirely than to refuse to read a system file.
+
+The following sections describe the known kinds of records.
+
+## File Header Record
+
+A system file begins with the file header, with the following format:
+
+```
+ char rec_type[4];
+ char prod_name[60];
+ int32 layout_code;
+ int32 nominal_case_size;
+ int32 compression;
+ int32 weight_index;
+ int32 ncases;
+ flt64 bias;
+ char creation_date[9];
+ char creation_time[8];
+ char file_label[64];
+ char padding[3];
+```
+
+* `char rec_type[4];`
+
+ Record type code, either `$FL2` for system files with uncompressed
+ data or data compressed with simple bytecode compression, or `$FL3`
+ for system files with ZLIB compressed data.
+
+ This is truly a character field that uses the character encoding as
+ other strings. Thus, in a file with an ASCII-based character
+ encoding this field contains `24 46 4c 32` or `24 46 4c 33`, and in
+ a file with an EBCDIC-based encoding this field contains `5b c6 d3
+ f2`. (No EBCDIC-based ZLIB-compressed files have been observed.)
+
+* `char prod_name[60];`
+
+ Product identification string. This always begins with the
+ characters `@(#) SPSS DATA FILE`. PSPP uses the remaining
+ characters to give its version and the operating system name; for
+ example, `GNU pspp 0.1.4 - sparc-sun-solaris2.5.2`. The string is
+ truncated if it would be longer than 60 characters; otherwise it is
+ padded on the right with spaces.
+
+ The product name field allow readers to behave differently based on
+ quirks in the way that particular software writes system files.
+ *Note Value Labels Records::, for the detail of the quirk that the
+ PSPP system file reader tolerates in files written by ReadStat,
+ which has `https://github.com/WizardMac/ReadStat` in `prod_name`.
+
+* `int32 layout_code;`
+
+ Normally set to 2, although a few system files have been spotted in
+ the wild with a value of 3 here. PSPP use this value to determine
+ the file's integer endianness.
+
+* `int32 nominal_case_size;`
+
+ Number of data elements per case. This is the number of variables,
+ except that long string variables add extra data elements (one for
+ every 8 characters after the first 8). However, string variables
+ do not contribute to this value beyond the first 255 bytes.
+ Further, some software always writes -1 or 0 in this field. In
+ general, it is unsafe for systems reading system files to rely upon
+ this value.
+
+* `int32 compression;`
+
+ Set to 0 if the data in the file is not compressed, 1 if the data is
+ compressed with simple bytecode compression, 2 if the data is ZLIB
+ compressed. This field has value 2 if and only if `rec_type` is
+ `$FL3`.
+
+* `int32 weight_index;`
+
+ If one of the variables in the data set is used as a weighting
+ variable, set to the [dictionary index](#dictionary-index) of that
+ variable. Otherwise, set to 0.
+
+* `int32 ncases;`
+
+ Set to the number of cases in the file if it is known, or -1
+ otherwise.
+
+ In the general case it is not possible to determine the number of
+ cases that will be output to a system file at the time that the
+ header is written. The way that this is dealt with is by writing
+ the entire system file, including the header, then seeking back to
+ the beginning of the file and writing just the `ncases` field. For
+ files in which this is not valid, the seek operation fails. In
+ this case, `ncases` remains -1.
+
+* `flt64 bias;`
+
+ Compression bias, usually 100. Only integers between `1 - bias` and
+ `251 - bias` can be compressed.
+
+ By assuming that its value is 100, PSPP uses `bias` to determine the
+ file's floating-point format and endianness. If the compression
+ bias is not 100, PSPP cannot auto-detect the floating-point format
+ and assumes that it is IEEE 754 format with the same endianness as
+ the system file's integers, which is correct for all known system
+ files.
+
+* `char creation_date[9];`
+
+ Date of creation of the system file, in `dd mmm yy` format, with
+ the month as standard English abbreviations, using an initial
+ capital letter and following with lowercase. If the date is not
+ available then this field is arbitrarily set to `01 Jan 70`.
+
+* `char creation_time[8];`
+
+ Time of creation of the system file, in `hh:mm:ss` format and using
+ 24-hour time. If the time is not available then this field is
+ arbitrarily set to `00:00:00`.
+
+* `char file_label[64];`
+
+ File label declared by the user, if any. Padded on the right with
+ spaces.
+
+ A product that identifies itself as `VOXCO INTERVIEWER 4.3` uses
+ CR-only line ends in this field, rather than the more usual LF-only
+ or CR LF line ends.
+
+* `char padding[3];`
+
+ Ignored padding bytes to make the structure a multiple of 32 bits
+ in length. Set to zeros.
+
+## Variable Record
+
+There must be one variable record for each numeric variable and each
+string variable with width 8 bytes or less. String variables wider than
+8 bytes have one variable record for each 8 bytes, rounding up. The
+first variable record for a long string specifies the variable's correct
+dictionary information. Subsequent variable records for a long string
+are filled with dummy information: a type of -1, no variable label or
+missing values, print and write formats that are ignored, and an empty
+string as name. A few system files have been encountered that include a
+variable label on dummy variable records, so readers should take care to
+parse dummy variable records in the same way as other variable records.
+
+The "<a name="dictionary-index">dictionary index</a>" of a variable is
+a 1-based offset in the set of variable records, including dummy
+variable records for long string variables. The first variable record
+has a dictionary index of 1, the second has a dictionary index of 2,
+and so on.
+
+The system file format does not directly support string variables
+wider than 255 bytes. Such very long string variables are represented
+by a number of narrower string variables. See [very long string
+record](#very-long-string-record) for details.
+
+ A system file should contain at least one variable and thus at least
+one variable record, but system files have been observed in the wild
+without any variables (thus, no data either).
+
+```
+ int32 rec_type;
+ int32 type;
+ int32 has_var_label;
+ int32 n_missing_values;
+ int32 print;
+ int32 write;
+ char name[8];
+
+ /* Present only if `has_var_label` is 1. */
+ int32 label_len;
+ char label[];
+
+ /* Present only if `n_missing_values` is nonzero. */
+ flt64 missing_values[];
+```
+
+* `int32 rec_type;`
+
+ Record type code. Always set to 2.
+
+* `int32 type;`
+
+ Variable type code. Set to 0 for a numeric variable. For a short
+ string variable or the first part of a long string variable, this
+ is set to the width of the string. For the second and subsequent
+ parts of a long string variable, set to -1, and the remaining
+ fields in the structure are ignored.
+
+* `int32 has_var_label;`
+
+ If this variable has a variable label, set to 1; otherwise, set to
+ 0.
+
+* `int32 n_missing_values;`
+
+ If the variable has no missing values, set to 0. If the variable
+ has one, two, or three discrete missing values, set to 1, 2, or 3,
+ respectively. If the variable has a range for missing variables,
+ set to -2; if the variable has a range for missing variables plus a
+ single discrete value, set to -3.
+
+ A long string variable always has the value 0 here. A separate
+ record indicates [missing values for long string
+ variables](#long-string-missing-values-record).
+
+* `int32 print;`
+
+ Print format for this variable. See below.
+
+* `int32 write;`
+
+ Write format for this variable. See below.
+
+* `char name[8];`
+
+ Variable name. The variable name must begin with a capital letter
+ or the at-sign (`@`). Subsequent characters may also be digits,
+ octothorpes (`#`), dollar signs (`$`), underscores (`_`), or full
+ stops (`.`). The variable name is padded on the right with spaces.
+
+ The `name` fields should be unique within a system file. System
+ files written by SPSS that contain very long string variables with
+ similar names sometimes contain duplicate names that are later
+ eliminated by resolving the [very long string
+ names](#very-long-string-record). PSPP handles duplicates by
+ assigning them new, unique names.
+
+* `int32 label_len;`
+
+ This field is present only if `has_var_label` is set to 1. It is
+ set to the length, in characters, of the variable label. The
+ documented maximum length varies from 120 to 255 based on SPSS
+ version, but some files have been seen with longer labels. PSPP
+ accepts labels of any length.
+
+* `char label[];`
+
+ This field is present only if `has_var_label` is set to 1. It has
+ length `label_len`, rounded up to the nearest multiple of 32 bits.
+ The first `label_len` characters are the variable's variable label.
+
+* `flt64 missing_values[];`
+
+ This field is present only if `n_missing_values` is nonzero. It has
+ the same number of 8-byte elements as the absolute value of
+ `n_missing_values`. Each element is interpreted as a number for
+ numeric variables (with `HIGHEST` and `LOWEST` indicated as
+ described in the [introduction](#introduction)). For string
+ variables of width less than 8 bytes, elements are right-padded with
+ spaces.
+
+ For discrete missing values, each element represents one missing
+ value. When a range is present, the first element denotes the
+ minimum value in the range, and the second element denotes the
+ maximum value in the range. When a range plus a value are present,
+ the third element denotes the additional discrete missing value.
+
+### Format Types
+
+The `print` and `write` members of `sysfile_variable` are output
+formats coded into `int32` types. The least-significant byte of the
+`int32` represents the number of decimal places, and the next two bytes
+in order of increasing significance represent field width and format
+type, respectively. The most-significant byte is not used and should be
+set to zero.
+
+Format types are defined as follows:
+
+| Value | Meaning |
+|------:|:------------|
+| 0 | Not used. |
+| 1 | `A` |
+| 2 | `AHEX` |
+| 3 | `COMMA` |
+| 4 | `DOLLAR` |
+| 5 | `F` |
+| 6 | `IB` |
+| 7 | `PIBHEX` |
+| 8 | `P` |
+| 9 | `PIB` |
+| 10 | `PK` |
+| 11 | `RB` |
+| 12 | `RBHEX` |
+| 13 | Not used. |
+| 14 | Not used. |
+| 15 | `Z` |
+| 16 | `N` |
+| 17 | `E` |
+| 18 | Not used. |
+| 19 | Not used. |
+| 20 | `DATE` |
+| 21 | `TIME` |
+| 22 | `DATETIME` |
+| 23 | `ADATE` |
+| 24 | `JDATE` |
+| 25 | `DTIME` |
+| 26 | `WKDAY` |
+| 27 | `MONTH` |
+| 28 | `MOYR` |
+| 29 | `QYR` |
+| 30 | `WKYR` |
+| 31 | `PCT` |
+| 32 | `DOT` |
+| 33 | `CCA` |
+| 34 | `CCB` |
+| 35 | `CCC` |
+| 36 | `CCD` |
+| 37 | `CCE` |
+| 38 | `EDATE` |
+| 39 | `SDATE` |
+| 40 | `MTIME` |
+| 41 | `YMDHMS` |
+
+A few system files have been observed in the wild with invalid
+`write` fields, in particular with value 0. Readers should probably
+treat invalid `print` or `write` fields as some default format.
+
+### Obsolete Treatment of Long String Missing Values
+
+SPSS and most versions of PSPP write missing values for string
+variables wider than 8 bytes with a [Long String Missing Values
+Record](#long-string-missing-values-record). Very old versions of
+PSPP instead wrote these missing values on the variables record,
+writing only the first 8 bytes of each missing value, with the
+remainder implicitly all spaces. Any new software should use the
+[Long String Missing Values
+Record](#long-string-missing-values-record), but it might possibly be
+worthwhile also to accept the old format used by PSPP.
+
+## Value Labels Records
+
+The value label records documented in this section are used for
+numeric and short string variables only. Long string variables may
+have value labels, but their value labels are recorded using a
+[different record type](#long-string-value-labels-record).
+
+[ReadStat](#file-header-record) writes value labels that label a
+single value more than once. In more detail, it emits value labels
+whose values are longer than string variables' widths, that are
+identical in the actual width of the variable, e.g. labels for values
+`ABC123` and `ABC456` for a string variable with width 3. For files
+written by this software, PSPP ignores such labels.
+
+### Value Label Record for Labels
+
+The value label record has the following format:
+
+```
+ int32 rec_type;
+ int32 label_count;
+
+ /* Repeated `n_label` times. */
+ char value[8];
+ char label_len;
+ char label[];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 3.
+
+* `int32 label_count;`
+
+ Number of value labels present in this record.
+
+The remaining fields are repeated `count` times. Each repetition
+specifies one value label.
+
+* `char value[8];`
+
+ A numeric value or a short string value padded as necessary to 8
+ bytes in length. Its type and width cannot be determined until the
+ following value label variables record (see below) is read.
+
+* `char label_len;`
+
+ The label's length, in bytes. The documented maximum length varies
+ from 60 to 120 based on SPSS version. PSPP supports value labels
+ up to 255 bytes long.
+
+* `char label[];`
+
+ `label_len` bytes of the actual label, followed by up to 7 bytes of
+ padding to bring `label` and `label_len` together to a multiple of
+ 8 bytes in length.
+
+### Value Label Record for Variables
+
+The value label record is always immediately followed by a value
+label variables record with the following format:
+
+```
+ int32 rec_type;
+ int32 var_count;
+ int32 vars[];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 4.
+
+* `int32 var_count;`
+
+ Number of variables that the associated value labels from the value
+ label record are to be applied.
+
+* `int32 vars[];`
+
+ A list of 1-based [dictionary indexes](#dictionary-index) of
+ variables to which to apply the value labels. There are `var_count`
+ elements.
+
+ String variables wider than 8 bytes may not be specified in this
+ list.
+
+## Document Record
+
+The document record, if present, has the following format:
+
+```
+ int32 rec_type;
+ int32 n_lines;
+ char lines[][80];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 6.
+
+* `int32 n_lines;`
+
+ Number of lines of documents present. This should be greater than
+ zero, but ReadStats writes system files with zero `n_lines`.
+
+* `char lines[][80];`
+
+ Document lines. The number of elements is defined by `n_lines`.
+ Lines shorter than 80 characters are padded on the right with
+ spaces.
+
+## Machine Integer Info Record
+
+The integer info record, if present, has the following format:
+
+```
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Data. */
+ int32 version_major;
+ int32 version_minor;
+ int32 version_revision;
+ int32 machine_code;
+ int32 floating_point_rep;
+ int32 compression_code;
+ int32 endianness;
+ int32 character_code;
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 3.
+
+* `int32 size;`
+
+ Size of each piece of data in the data part, in bytes. Always set
+ to 4.
+
+* `int32 count;`
+
+ Number of pieces of data in the data part. Always set to 8.
+
+* `int32 version_major;`
+
+ PSPP major version number. In version X.Y.Z, this is X.
+
+* `int32 version_minor;`
+
+ PSPP minor version number. In version X.Y.Z, this is Y.
+
+* `int32 version_revision;`
+
+ PSPP version revision number. In version X.Y.Z, this is Z.
+
+* `int32 machine_code;`
+
+ Machine code. PSPP always set this field to value to -1, but other
+ values may appear.
+
+* `int32 floating_point_rep;`
+
+ Floating point representation code. For IEEE 754 systems (the most
+ common) this is 1. IBM 370 is supposed to set this to 2, and DEC
+ VAX E to 3, but neither of these has been observed.
+
+* `int32 compression_code;`
+
+ Compression code. Always set to 1, regardless of whether or how
+ the file is compressed.
+
+* `int32 endianness;`
+
+ Machine endianness. 1 indicates big-endian, 2 indicates
+ little-endian.
+
+* `int32 character_code;`
+
+ <a name="character-code">Character code</a>. The following values
+ have been actually observed in system files:
+
+ - 1
+
+ EBCDIC. Only one example has been observed.
+
+ - 2
+
+ 7-bit ASCII. Old versions of SPSS for Unix and Windows always
+ wrote value 2 in this field, regardless of the encoding in
+ use, so it is not reliable and should be ignored.
+
+ - 3
+
+ 8-bit "ASCII".
+
+ - 819
+
+ ISO 8859-1 (IBM AIX code page number).
+
+ - 874
+ 9066
+
+ The `windows-874` code page for Thai.
+
+ - 932
+
+ The `windows-932` code page for Japanese (aka `Shift_JIS`).
+
+ - 936
+
+ The `windows-936` code page for simplified Chinese (aka `GBK`).
+
+ - 949
+
+ Probably `ks_c_5601-1987`, Unified Hangul Code.
+
+ - 950
+
+ The `big5` code page for traditional Chinese.
+
+ - 1250
+
+ The `windows-1250` code page for Central European and Eastern
+ European languages.
+
+ - 1251
+
+ The `windows-1251` code page for Cyrillic languages.
+
+ - 1252
+
+ The `windows-1252` code page for Western European languages.
+
+ - 1253
+
+ The `windows-1253` code page for modern Greek.
+
+ - 1254
+
+ The `windows-1254` code page for Turkish.
+
+ - 1255
+
+ The `windows-1255` code page for Hebrew.
+
+ - 1256
+
+ The `windows-1256` code page for Arabic script.
+
+ - 1257
+
+ The `windows-1257` code page for Estonian, Latvian, and
+ Lithuanian.
+
+ - 1258
+
+ The `windows-1258` code page for Vietnamese.
+
+ - 20127
+
+ US-ASCII.
+
+ - 28591
+
+ ISO 8859-1 (Latin-1).
+
+ - 25592
+
+ ISO 8859-2 (Central European).
+
+ - 28605
+
+ ISO 8895-9 (Latin-9).
+
+ - 51949
+
+ The `euc-kr` code page for Korean.
+
+ - 65001
+
+ UTF-8.
+
+ The following additional values are known to be defined:
+
+ - 3
+
+ 8-bit "ASCII".
+
+ - 4
+
+ DEC Kanji.
+
+ The most common values observed, from most to least common, are
+ 1252, 65001, 2, and 28591.
+
+ Other Windows code page numbers are known to be generally valid.
+
+ Newer versions also write the character encoding [as a
+ string](#character-encoding-record).
+
+## Machine Floating-Point Info Record
+
+The floating-point info record, if present, has the following format:
+
+```
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Data. */
+ flt64 sysmis;
+ flt64 highest;
+ flt64 lowest;
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 4.
+
+* `int32 size;`
+
+ Size of each piece of data in the data part, in bytes. Always set
+ to 8.
+
+* `int32 count;`
+
+ Number of pieces of data in the data part. Always set to 3.
+
+* `flt64 sysmis;`
+ `flt64 highest;`
+ `flt64 lowest;`
+
+ The system missing value, the value used for `HIGHEST` in missing
+ values, and the value used for `LOWEST` in missing values,
+ respectively. See the [introduction](#introduction) for more
+ information.
+
+ The SPSSWriter library in PHP, which identifies itself as `FOM SPSS
+ 1.0.0` in the file header record `prod_name` field, writes
+ unexpected values to these fields, but it uses the same values
+ consistently throughout the rest of the file.
+
+## Multiple Response Sets Records
+
+The system file format has two different types of records that
+represent multiple response sets. The first type of record describes
+multiple response sets that can be understood by SPSS before
+version 14. The second type of record, with a closely related format,
+is used for multiple dichotomy sets that use the
+`CATEGORYLABELS=COUNTEDVALUES` feature added in version 14.
+
+```
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Exactly `count` bytes of data. */
+ char mrsets[];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Set to 7 for records that describe multiple
+ response sets understood by SPSS before version 14, or to 19 for
+ records that describe dichotomy sets that use the
+ `CATEGORYLABELS=COUNTEDVALUES` feature added in version 14.
+
+* `int32 size;`
+
+ The size of each element in the `mrsets` member. Always set to 1.
+
+* `int32 count;`
+
+ The total number of bytes in `mrsets`.
+
+* `char mrsets[];`
+
+ Zero or more line feeds (byte 0x0a), followed by a series of
+ multiple response sets, each of which consists of the following:
+
+ - The set's name (an identifier that begins with `$`), in mixed
+ upper and lower case.
+
+ - An equals sign (`=`).
+
+ - `C` for a multiple category set, `D` for a multiple dichotomy
+ set with `CATEGORYLABELS=VARLABELS`, or `E` for a multiple
+ dichotomy set with `CATEGORYLABELS=COUNTEDVALUES`.
+
+ - For a multiple dichotomy set with
+ `CATEGORYLABELS=COUNTEDVALUES`, a space, followed by a number
+ expressed as decimal digits, followed by a space. If
+ `LABELSOURCE=VARLABEL` was specified on MRSETS, then the number
+ is 11; otherwise it is 1.[^note]
+
+ - For either kind of multiple dichotomy set, the counted value,
+ as a positive integer count specified as decimal digits,
+ followed by a space, followed by as many string bytes as
+ specified in the count. If the set contains numeric
+ variables, the string consists of the counted integer value
+ expressed as decimal digits. If the set contains string
+ variables, the string contains the counted string value.
+ Either way, the string may be padded on the right with spaces
+ (older versions of SPSS seem to always pad to a width of 8
+ bytes; newer versions don't).
+
+ - A space.
+
+ - The multiple response set's label, using the same format as
+ for the counted value for multiple dichotomy sets. A string
+ of length 0 means that the set does not have a label. A
+ string of length 0 is also written if LABELSOURCE=VARLABEL was
+ specified.
+
+ - The short names of the variables in the set, converted to
+ lowercase, each preceded by a single space.
+
+ Even though a multiple response set must have at least two
+ variables, some system files contain multiple response sets
+ with no variables or one variable. The source and meaning of
+ these multiple response sets is unknown. (Perhaps they arise
+ from creating a multiple response set then deleting all the
+ variables that it contains?)
+
+ - One line feed (byte 0x0a). Sometimes multiple, even hundreds,
+ of line feeds are present.
+
+Example: Given appropriate variable definitions, consider the
+following MRSETS command:
+
+```
+MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c
+ /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55
+ /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes'
+ /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES
+ VARIABLES=k l m VALUE=34
+ /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL
+ VARIABLES=n o p VALUE='choice'.
+```
+
+The above would generate the following multiple response set record
+of subtype 7:
+
+```
+$a=C 10 my mcgroup a b c
+$b=D2 55 0 g e f d
+$c=D3 Yes 10 mdgroup #2 h i j
+```
+
+It would also generate the following multiple response set record
+with subtype 19:
+
+```
+$d=E 1 2 34 13 third mdgroup k l m
+$e=E 11 6 choice 0 n o p
+```
+
+[^note]: This part of the format may not be fully understood, because
+ only a single example of each possibility has been examined.
+
+## Extra Product Info Record
+
+This optional record appears to contain a text string that describes
+the program that wrote the file and the source of the data. (This is
+redundant with the file label and product info found in the [file
+header record](#file-header-record).)
+
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Exactly `count` bytes of data. */
+ char info[];
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 10.
+
+* `int32 size;`
+
+ The size of each element in the `info` member. Always set to 1.
+
+* `int32 count;`
+
+ The total number of bytes in `info`.
+
+* `char info[];`
+
+ A text string. A product that identifies itself as `VOXCO
+ INTERVIEWER 4.3` uses CR-only line ends in this field, rather than
+ the more usual LF-only or CR LF line ends.
+
+## Variable Display Parameter Record
+
+The variable display parameter record, if present, has the following
+format:
+
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Repeated `count` times. */
+ int32 measure;
+ int32 width; /* Not always present. */
+ int32 alignment;
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 11.
+
+* `int32 size;`
+
+ The size of `int32`. Always set to 4.
+
+* `int32 count;`
+
+ The number of sets of variable display parameters (ordinarily the
+ number of variables in the dictionary), times 2 or 3.
+
+The remaining members are repeated `count` times, in the same order as
+the variable records. No element corresponds to variable records that
+continue long string variables. The meanings of these members are as
+follows:
+
+* `int32 measure;`
+
+ The measurement level of the variable:
+
+ | Value | Level |
+ |------:|:---------|
+ | 0 | Unknown |
+ | 1 | Nominal |
+ | 2 | Ordinal |
+ | 3 | Scale |
+
+ An "unknown" `measure` of 0 means that the variable was created in
+ some way that doesn't make the measurement level clear, e.g. with a
+ `COMPUTE` transformation. PSPP sets the measurement level the first
+ time it reads the data, so this should rarely appear.
+
+* `int32 width;`
+
+ The width of the display column for the variable in characters.
+
+ This field is present if `count` is 3 times the number of variables
+ in the dictionary. It is omitted if `count` is 2 times the number of
+ variables.
+
+* `int32 alignment;`
+
+ The alignment of the variable for display purposes:
+
+ | Value | Alignment |
+ |------:|:---------------|
+ | 0 | Left aligned |
+ | 1 | Right aligned |
+ | 2 | Centre aligned |
+
+## Variable Sets Record
+
+The SPSS GUI offers users the ability to arrange variables in sets.
+Users may enable and disable sets individually, and the data editor and
+analysis dialog boxes only show enabled sets. Syntax does not use
+variable sets.
+
+ The variable sets record, if present, has the following format:
+
+```
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Exactly `count` bytes of text. */
+ char text[];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 5.
+
+* `int32 size;`
+
+ Always set to 1.
+
+* `int32 count;`
+
+ The total number of bytes in `text`.
+
+* `char text[];`
+
+ The variable sets, in a text-based format.
+
+ Each variable set occupies one line of text, each of which ends
+ with a line feed (byte 0x0a), optionally preceded by a carriage
+ return (byte 0x0d).
+
+ Each line begins with the name of the variable set, followed by an
+ equals sign (`=`) and a space (byte 0x20), followed by the long
+ variable names of the members of the set, separated by spaces. A
+ variable set may be empty, in which case the equals sign and the
+ space following it are still present.
+
+## Long Variable Names Record
+
+If present, the long variable names record has the following format:
+
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Exactly `count` bytes of data. */
+ char var_name_pairs[];
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 13.
+
+* `int32 size;`
+
+ The size of each element in the `var_name_pairs` member. Always
+ set to 1.
+
+* `int32 count;`
+
+ The total number of bytes in `var_name_pairs`.
+
+* `char var_name_pairs[];`
+
+ A list of key-value tuples, where each key is the name of a
+ variable, and the value is its long variable name. The key field is
+ at most 8 bytes long and must match the name of a variable which
+ appears in the [variable record](#variable-record). The value
+ field is at most 64 bytes long. The key and value fields are
+ separated by a `=` byte. Each tuple is separated by a byte whose
+ value is 09. There is no trailing separator following the last
+ tuple. The total length is `count` bytes.
+
+## Very Long String Record
+
+Old versions of SPSS limited string variables to a width of 255 bytes.
+For backward compatibility with these older versions, the system file
+format represents a string longer than 255 bytes, called a “very long
+string”, as a collection of strings no longer than 255 bytes each. The
+strings concatenated to make a very long string are called its
+“segments”; for consistency, variables other than very long strings are
+considered to have a single segment.
+
+A very long string with a width of `w` has `n = (w + 251) / 252`
+segments, that is, one segment for every 252 bytes of width, rounding
+up. It would be logical, then, for each of the segments except the
+last to have a width of 252 and the last segment to have the
+remainder, but this is not the case. In fact, each segment except the
+last has a width of 255 bytes. The last has width `w - (n - 1) *
+252`; some versions of SPSS make it slightly wider, but not wide
+enough to make the last segment require another 8 bytes of data.
+
+Data is packed tightly into segments of a very long string, 255 bytes
+per segment. Because 255 bytes of segment data are allocated for
+every 252 bytes of the very long string's width (approximately), some
+unused space is left over at the end of the allocated segments. Data
+in unused space is ignored.
+
+> Example: Consider a very long string of width 20,000. Such a very
+long string has 20,000 / 252 = 80 (rounding up) segments. The first
+79 segments have width 255; the last segment has width 20,000 - 79 *
+252 = 92 or slightly wider (up to 96 bytes, the next multiple of 8).
+The very long string's data is actually stored in the 19,890 bytes in
+the first 78 segments, plus the first 110 bytes of the 79th segment
+(19,890 + 110 = 20,000). The remaining 145 bytes of the 79th segment
+and all 92 bytes of the 80th segment are unused.
+
+The very long string record explains how to stitch together segments
+to obtain very long string data. For each of the very long string
+variables in the dictionary, it specifies the name of its first
+segment's variable and the very long string variable's actual width.
+The remaining segments immediately follow the named variable in the
+system file's dictionary.
+
+The very long string record, which is present only if the system file
+contains very long string variables, has the following format:
+
+```
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Exactly `count` bytes of data. */
+ char string_lengths[];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 14.
+
+* `int32 size;`
+
+ The size of each element in the `string_lengths` member. Always
+ set to 1.
+
+* `int32 count;`
+
+ The total number of bytes in `string_lengths`.
+
+* `char string_lengths[];`
+
+ a list of key-value tuples, where key is the name of a variable, and
+ value is its length. the key field is at most 8 bytes long and must
+ match the name of a variable which appears in the [variable
+ record](#variable-record). the value field is exactly 5 bytes long.
+ it is a zero-padded, ASCII-encoded string that is the length of the
+ variable. the key and value fields are separated by a `=` byte.
+ tuples are delimited by a two-byte sequence {00, 09}. After the
+ last tuple, there may be a single byte 00, or {00, 09}. The total
+ length is `count` bytes.
+
+## Character Encoding Record
+
+This record, if present, indicates the character encoding for string
+data, long variable names, variable labels, value labels and other
+strings in the file.
+
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Exactly `count` bytes of data. */
+ char encoding[];
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 20.
+
+* `int32 size;`
+
+ The size of each element in the `encoding` member. Always set to 1.
+
+* `int32 count;`
+
+ The total number of bytes in `encoding`.
+
+* `char encoding[];`
+
+ The name of the character encoding. Normally this will be an
+ [official IANA character set name or
+ alias](http://www.iana.org/assignments/character-sets). Character
+ set names are not case-sensitive, and SPSS is not consistent,
+ e.g. both `windows-1251` and `WINDOWS-1252` have both been observed,
+ as have `Big5` and `BIG5`.
+
+This record is not present in files generated by older software. See
+also `character_code` in the [machine integer info
+record](#machine-integer-info-record).
+
+The following character encoding names have been observed. The names
+are shown in lowercase, even though they were not always in lowercase in
+the file. Alternative names for the same encoding are, when known,
+listed together. For each encoding, the `character_code` values that
+they were observed paired with are also listed. First, the following
+are strictly single-byte, ASCII-compatible encodings:
+
+* (encoding record missing)
+
+ 0, 2, 3, 874, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 20127,
+ 28591, 28592, 28605
+
+* `ansi_x3.4-1968`
+ `ascii`
+
+ 1252
+
+* `cp28605`
+
+ 2
+
+* `cp874`
+
+ 9066
+
+* `iso-8859-1`
+
+ 819
+
+* `windows-874`
+
+ 874
+
+* `windows-1250`
+
+ 2, 1250, 1252
+
+* `windows-1251`
+
+ 2, 1251
+
+* `cp1252`
+ `windows-1252`
+
+ 2, 1250, 1252, 1253
+
+* `cp1253`
+ `windows-1253`
+
+ 1253
+
+* `windows-1254`
+
+ 2, 1254
+
+* `windows-1255`
+
+ 2, 1255
+
+* `windows-1256`
+
+ 2, 1252, 1256
+
+* `windows-1257`
+
+ 2, 1257
+
+* `windows-1258`
+
+ 1258
+
+The others are multibyte encodings, in which some code points occupy
+a single byte and others multiple bytes. The following multibyte
+encodings are "ASCII compatible," that is, they use ASCII values only to
+indicate ASCII:
+
+* (encoding record missing)
+
+ 65001, 949
+
+* `euc-kr`
+
+ 2, 51949
+
+* `utf-8`
+
+ 0, 2, 1250, 1251, 1252, 1256, 65001
+
+The following multibyte encodings are not ASCII compatible, that is,
+while they encode ASCII characters as their native values, they also use
+ASCII values as second or later bytes in multibyte sequences:
+
+* (encoding record missing)
+
+ 932, 936, 950
+
+* `big5`
+ `cp950`
+
+ 2, 950
+
+* `gbk`
+
+ 936
+
+* `cp932`
+ `windows-31j`
+
+ 932
+
+As the tables above show, when the character encoding record and the
+machine integer info record are both present, they can contradict each
+other. Observations show that, in this case, the character encoding
+record should be honored.
+
+If, for testing purposes, a file is crafted with different
+`character_code` and `encoding`, it seems that `character_code`
+controls the encoding for all strings in the system file before the
+dictionary termination record, including strings in data (e.g. string
+missing values), and `encoding` controls the encoding for strings
+following the dictionary termination record.
+
+## Long String Value Labels Record
+
+This record, if present, specifies value labels for long string
+variables.
+
+```
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Repeated up to exactly `count` bytes. */
+ int32 var_name_len;
+ char var_name[];
+ int32 var_width;
+ int32 n_labels;
+ long_string_label labels[];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 21.
+
+* `int32 size;`
+
+ Always set to 1.
+
+* `int32 count;`
+
+ The number of bytes following the header until the next header.
+
+* `int32 var_name_len;`
+ `char var_name[];`
+
+ The number of bytes in the name of the variable that has long
+ string value labels, plus the variable name itself, which consists
+ of exactly `var_name_len` bytes. The variable name is not padded
+ to any particular boundary, nor is it null-terminated.
+
+* `int32 var_width;`
+
+ The width of the variable, in bytes, which will be between 9 and
+ 32767.
+
+* `int32 n_labels;`
+ `long_string_label labels[];`
+
+ The long string labels themselves. The `labels` array contains
+ exactly `n_labels` elements, each of which has the following
+ substructure:
+
+ ```
+ int32 value_len;
+ char value[];
+ int32 label_len;
+ char label[];
+ ```
+
+ - `int32 value_len;`
+ `char value[];`
+
+ The string value being labeled. `value_len` is the number of
+ bytes in `value`; it is equal to `var_width`. The `value`
+ array is not padded or null-terminated.
+
+ - `int32 label_len;`
+ `char label[];`
+
+ The label for the string value. `label_len`, which must be
+ between 0 and 120, is the number of bytes in `label`. The
+ `label` array is not padded or null-terminated.
+
+## Long String Missing Values Record
+
+This record, if present, specifies missing values for long string
+variables.
+
+```
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Repeated up to exactly `count` bytes. */
+ int32 var_name_len;
+ char var_name[];
+ char n_missing_values;
+ int32 value_len;
+ char values[value_len * n_missing_values];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 22.
+
+* `int32 size;`
+
+ Always set to 1.
+
+* `int32 count;`
+
+ The number of bytes following the header until the next header.
+
+* `int32 var_name_len;`
+ `char var_name[];`
+
+ The number of bytes in the name of the long string variable that
+ has missing values, plus the variable name itself, which consists
+ of exactly `var_name_len` bytes. The variable name is not padded
+ to any particular boundary, nor is it null-terminated.
+
+* `char n_missing_values;`
+
+ The number of missing values, either 1, 2, or 3. (This is,
+ unusually, a single byte instead of a 32-bit number.)
+
+* `int32 value_len;`
+
+ The length of each missing value string, in bytes. This value
+ should be 8, because long string variables are at least 8 bytes
+ wide (by definition), only the first 8 bytes of a long string
+ variable's missing values are allowed to be non-spaces, and any
+ spaces within the first 8 bytes are included in the missing value
+ here.
+
+* `char values[value_len * n_missing_values]`
+
+ The missing values themselves, without any padding or null
+ terminators.
+
+An earlier version of this document stated that `value_len` was
+repeated before each of the missing values, so that there was an extra
+`int32` value of 8 before each missing value after the first. Old
+versions of PSPP wrote data files in this format. Readers can tolerate
+this mistake, if they wish, by noticing and skipping the extra `int32`
+values, which wouldn't ordinarily occur in strings.
+
+## Data File and Variable Attributes Records
+
+The data file and variable attributes records represent custom
+attributes for the system file or for individual variables in the
+system file, as defined on the `DATAFILE ATTRIBUTE` and `VARIABLE
+ATTRIBUTE` commands, respectively.
+
+```
+ /* Header. */
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+
+ /* Exactly `count` bytes of data. */
+ char attributes[];
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 17 for a data file attribute record
+ or to 18 for a variable attributes record.
+
+* `int32 size;`
+
+ The size of each element in the `attributes` member. Always set to
+ value 1.
+
+* `int32 count;`
+
+ The total number of bytes in `attributes`.
+
+* `char attributes[];`
+
+ The attributes, in a text-based format.
+
+ In record subtype 17, this field contains a single attribute set.
+ An attribute set is a sequence of one or more attributes
+ concatenated together. Each attribute consists of a name, which
+ has the same syntax as a variable name, followed by, inside
+ parentheses, a sequence of one or more values. Each value consists
+ of a string enclosed in single quotes (`'`) followed by a line feed
+ (byte 0x0a). A value may contain single quote characters, which
+ are not themselves escaped or quoted or required to be present in
+ pairs. There is no apparent way to embed a line feed in a value.
+ There is no distinction between an attribute with a single value
+ and an attribute array with one element.
+
+ In record subtype 18, this field contains a sequence of one or more
+ variable attribute sets. If more than one variable attribute set
+ is present, each one after the first is delimited from the previous
+ by `/`. Each variable attribute set consists of a long variable
+ name, followed by `:`, followed by an attribute set with the same
+ syntax as on record subtype 17.
+
+ System files written by `Stata 14.1/-savespss- 1.77 by S.Radyakin`
+ may include multiple records with subtype 18, one per variable that
+ has variable attributes.
+
+ The total length is `count` bytes.
+
+### Example
+
+A system file produced with the following VARIABLE ATTRIBUTE commands in
+effect:
+
+```
+VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=fred[1]('23') fred[2]('34').
+VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123').
+```
+
+will contain a variable attribute record with the following contents:
+
+```
+0000 07 00 00 00 12 00 00 00 01 00 00 00 22 00 00 00 |............"...|
+0010 64 75 6d 6d 79 3a 66 72 65 64 28 27 32 33 27 0a |dummy:fred('23'.|
+0020 27 33 34 27 0a 29 62 65 72 74 28 27 31 32 33 27 |'34'.)bert('123'|
+0030 0a 29 |.) |
+```
+
+### Variable Roles
+
+A variable's role is represented as an attribute named `$@Role`. This
+attribute has a single element whose values and their meanings are:
+
+|Value|Role |
+|----:|:----------|
+| 0 | Input |
+| 1 | Target |
+| 2 | Both |
+| 3 | None |
+| 4 | Partition |
+| 5 | Split |
+
+The default and most common role is 0 (input).
+
+## Extended Number of Cases Record
+
+`ncases` in the [file header record](#file-header-record) expresses
+the number of cases in the system file as an `int32`. This record
+allows the number of cases in the system file to be expressed as a
+64-bit number.
+
+```
+ int32 rec_type;
+ int32 subtype;
+ int32 size;
+ int32 count;
+ int64 unknown;
+ int64 ncases64;
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 7.
+
+* `int32 subtype;`
+
+ Record subtype. Always set to 16.
+
+* `int32 size;`
+
+ Size of each element. Always set to 8.
+
+* `int32 count;`
+
+ Number of pieces of data in the data part. Alway set to 2.
+
+* `int64 unknown;`
+
+ Meaning unknown. Always set to 1.
+
+* `int64 ncases64;`
+
+ Number of cases in the file as a 64-bit integer. Presumably this
+ could be -1 to indicate that the number of cases is unknown, for
+ the same reason as `ncases` in the file header record, but this has
+ not been observed in the wild.
+
+## Other Informational Records
+
+This chapter documents many specific types of extension records are
+documented here, but others are known to exist. PSPP ignores unknown
+extension records when reading system files.
+
+ The following extension record subtypes have also been observed, with
+the following believed meanings:
+
+* 6
+
+ Date info, probably related to USE (according to Aapi Hämäläinen).
+
+* 12
+
+ A UUID in the format described in RFC 4122. Only two examples
+ observed, both written by SPSS 13, and in each case the UUID
+ contained both upper and lower case.
+
+* 24
+
+ XML that describes how data in the file should be displayed
+ on-screen.
+
+## Dictionary Termination Record
+
+The dictionary termination record separates all other records from the
+data records.
+
+```
+ int32 rec_type;
+ int32 filler;
+```
+
+* `int32 rec_type;`
+
+ Record type. Always set to 999.
+
+* `int32 filler;`
+
+ Ignored padding. Should be set to 0.
+
+## Data Record
+
+The data record must follow all other records in the system file. Every
+system file must have a data record that specifies data for at least one
+case. The format of the data record varies depending on the value of
+`compression` in the file header record:
+
+* 0: no compression
+
+ Data is arranged as a series of 8-byte elements. Each element
+ corresponds to the variable declared in the respective variable
+ record (*note Variable Record::). Numeric values are given in
+ `flt64` format; string values are literal characters string, padded
+ on the right when necessary to fill out 8-byte units.
+
+* 1: bytecode compression
+
+ The first 8 bytes of the data record is divided into a series of
+ 1-byte command codes. These codes have meanings as described
+ below:
+
+ - 0
+
+ Ignored. If the program writing the system file accumulates
+ compressed data in blocks of fixed length, 0 bytes can be used
+ to pad out extra bytes remaining at the end of a fixed-size
+ block.
+
+ - 1 through 251
+
+ A number with value `code - bias`, where `code` is the value of
+ the compression code and `bias` comes from the file header. For
+ example, code 105 with bias 100.0 (the normal value) indicates a
+ numeric variable of value 5.
+
+ A code of 0 (after subtracting the bias) in a string field encodes
+ null bytes. This is unusual, since a string field normally
+ encodes text data, but it exists in real system files.
+
+ - 252
+
+ End of file. This code may or may not appear at the end of the
+ data stream. PSPP always outputs this code but its use is not
+ required.
+
+ - 253
+
+ A numeric or string value that is not compressible. The value is
+ stored in the 8 bytes following the current block of command
+ bytes. If this value appears twice in a block of command bytes,
+ then it indicates the second group of 8 bytes following the
+ command bytes, and so on.
+
+ - 254
+
+ An 8-byte string value that is all spaces.
+
+ - 255
+
+ The system-missing value.
+
+ The end of the 8-byte group of bytecodes is followed by any 8-byte
+ blocks of non-compressible values indicated by code 253. After that
+ follows another 8-byte group of bytecodes, then those bytecodes'
+ non-compressible values. The pattern repeats to the end of the file
+ or a code with value 252.
+
+* 2: ZLIB compression
+
+ The data record consists of the following, in order:
+
+ - ZLIB data header, 24 bytes long.
+
+ - One or more variable-length blocks of ZLIB compressed data.
+
+ - ZLIB data trailer, with a 24-byte fixed header plus an
+ additional 24 bytes for each preceding ZLIB compressed data
+ block.
+
+ The ZLIB data header has the following format:
+
+ ```
+ int64 zheader_ofs;
+ int64 ztrailer_ofs;
+ int64 ztrailer_len;
+ ```
+
+ - `int64 zheader_ofs;`
+
+ The offset, in bytes, of the beginning of this structure
+ within the system file.
+
+ - `int64 ztrailer_ofs;`
+
+ The offset, in bytes, of the first byte of the ZLIB data
+ trailer.
+
+ - `int64 ztrailer_len;`
+
+ The number of bytes in the ZLIB data trailer. This and the
+ previous field sum to the size of the system file in bytes.
+
+ The data header is followed by `(ztrailer_len - 24) / 24` ZLIB
+ compressed data blocks. Each ZLIB compressed data block begins
+ with a ZLIB header as specified in RFC 1950, e.g. hex bytes `78 01`
+ (the only header yet observed in practice). Each block
+ decompresses to a fixed number of bytes (in practice only
+ `0x3ff000`-byte blocks have been observed), except that the last
+ block of data may be shorter. The last ZLIB compressed data block
+ gends just before offset `ztrailer_ofs`.
+
+ The result of ZLIB decompression is bytecode compressed data as
+ described above for compression format 1.
+
+ The ZLIB data trailer begins with the following 24-byte fixed
+ header:
+
+ ```
+ int64 bias;
+ int64 zero;
+ int32 block_size;
+ int32 n_blocks;
+ ```
+
+ - `int64 int_bias;`
+
+ The compression bias as a negative integer, e.g. if `bias` in
+ the file header record is 100.0, then `int_bias` is −100 (this
+ is the only value yet observed in practice).
+
+ - `int64 zero;`
+
+ Always observed to be zero.
+
+ - `int32 block_size;`
+
+ The number of bytes in each ZLIB compressed data block, except
+ possibly the last, following decompression. Only `0x3ff000`
+ has been observed so far.
+
+ - `int32 n_blocks;`
+
+ The number of ZLIB compressed data blocks, always exactly
+ `(ztrailer_len - 24) / 24`.
+
+ The fixed header is followed by `n_blocks` 24-byte ZLIB data block
+ descriptors, each of which describes the compressed data block
+ corresponding to its offset. Each block descriptor has the
+ following format:
+
+ ```
+ int64 uncompressed_ofs;
+ int64 compressed_ofs;
+ int32 uncompressed_size;
+ int32 compressed_size;
+ ```
+
+ - `int64 uncompressed_ofs;`
+
+ The offset, in bytes, that this block of data would have in a
+ similar system file that uses compression format 1. This is
+ `zheader_ofs` in the first block descriptor, and in each
+ succeeding block descriptor it is the sum of the previous
+ desciptor's `uncompressed_ofs` and `uncompressed_size`.
+
+ - `int64 compressed_ofs;`
+
+ The offset, in bytes, of the actual beginning of this
+ compressed data block. This is `zheader_ofs + 24` in the
+ first block descriptor, and in each succeeding block
+ descriptor it is the sum of the previous descriptor's
+ `compressed_ofs` and `compressed_size`. The final block
+ descriptor's `compressed_ofs` and `compressed_size` sum to
+ `ztrailer_ofs`.
+
+ - `int32 uncompressed_size;`
+
+ The number of bytes in this data block, after decompression.
+ This is `block_size` in every data block except the last,
+ which may be smaller.
+
+ - `int32 compressed_size;`
+
+ The number of bytes in this data block, as stored compressed
+ in this system file.
+
+++ /dev/null
-# Character Encoding Record
-
-This record, if present, indicates the character encoding for string
-data, long variable names, variable labels, value labels and other
-strings in the file.
-
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Exactly `count` bytes of data. */
- char encoding[];
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 20.
-
-* `int32 size;`
-
- The size of each element in the `encoding` member. Always set to 1.
-
-* `int32 count;`
-
- The total number of bytes in `encoding`.
-
-* `char encoding[];`
-
- The name of the character encoding. Normally this will be an
- [official IANA character set name or
- alias](http://www.iana.org/assignments/character-sets). Character
- set names are not case-sensitive, and SPSS is not consistent,
- e.g. both `windows-1251` and `WINDOWS-1252` have both been observed,
- as have `Big5` and `BIG5`.
-
-This record is not present in files generated by older software. See
-also the [`character_code` field in the machine integer info
-record](machine-integer-info-record.md#character-code).
-
-The following character encoding names have been observed. The names
-are shown in lowercase, even though they were not always in lowercase in
-the file. Alternative names for the same encoding are, when known,
-listed together. For each encoding, the `character_code` values that
-they were observed paired with are also listed. First, the following
-are strictly single-byte, ASCII-compatible encodings:
-
-* (encoding record missing)
-
- 0, 2, 3, 874, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 20127,
- 28591, 28592, 28605
-
-* `ansi_x3.4-1968`
- `ascii`
-
- 1252
-
-* `cp28605`
-
- 2
-
-* `cp874`
-
- 9066
-
-* `iso-8859-1`
-
- 819
-
-* `windows-874`
-
- 874
-
-* `windows-1250`
-
- 2, 1250, 1252
-
-* `windows-1251`
-
- 2, 1251
-
-* `cp1252`
- `windows-1252`
-
- 2, 1250, 1252, 1253
-
-* `cp1253`
- `windows-1253`
-
- 1253
-
-* `windows-1254`
-
- 2, 1254
-
-* `windows-1255`
-
- 2, 1255
-
-* `windows-1256`
-
- 2, 1252, 1256
-
-* `windows-1257`
-
- 2, 1257
-
-* `windows-1258`
-
- 1258
-
-The others are multibyte encodings, in which some code points occupy
-a single byte and others multiple bytes. The following multibyte
-encodings are "ASCII compatible," that is, they use ASCII values only to
-indicate ASCII:
-
-* (encoding record missing)
-
- 65001, 949
-
-* `euc-kr`
-
- 2, 51949
-
-* `utf-8`
-
- 0, 2, 1250, 1251, 1252, 1256, 65001
-
-The following multibyte encodings are not ASCII compatible, that is,
-while they encode ASCII characters as their native values, they also use
-ASCII values as second or later bytes in multibyte sequences:
-
-* (encoding record missing)
-
- 932, 936, 950
-
-* `big5`
- `cp950`
-
- 2, 950
-
-* `gbk`
-
- 936
-
-* `cp932`
- `windows-31j`
-
- 932
-
-As the tables above show, when the character encoding record and the
-machine integer info record are both present, they can contradict each
-other. Observations show that, in this case, the character encoding
-record should be honored.
-
-If, for testing purposes, a file is crafted with different
-`character_code` and `encoding`, it seems that `character_code`
-controls the encoding for all strings in the system file before the
-dictionary termination record, including strings in data (e.g. string
-missing values), and `encoding` controls the encoding for strings
-following the dictionary termination record.
-
+++ /dev/null
-# Data File and Variable Attributes Records
-
-The data file and variable attributes records represent custom
-attributes for the system file or for individual variables in the
-system file, as defined on the `DATAFILE ATTRIBUTE` and `VARIABLE
-ATTRIBUTE` commands, respectively.
-
-```
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Exactly `count` bytes of data. */
- char attributes[];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 17 for a data file attribute record
- or to 18 for a variable attributes record.
-
-* `int32 size;`
-
- The size of each element in the `attributes` member. Always set to
- value 1.
-
-* `int32 count;`
-
- The total number of bytes in `attributes`.
-
-* `char attributes[];`
-
- The attributes, in a text-based format.
-
- In record subtype 17, this field contains a single attribute set.
- An attribute set is a sequence of one or more attributes
- concatenated together. Each attribute consists of a name, which
- has the same syntax as a variable name, followed by, inside
- parentheses, a sequence of one or more values. Each value consists
- of a string enclosed in single quotes (`'`) followed by a line feed
- (byte 0x0a). A value may contain single quote characters, which
- are not themselves escaped or quoted or required to be present in
- pairs. There is no apparent way to embed a line feed in a value.
- There is no distinction between an attribute with a single value
- and an attribute array with one element.
-
- In record subtype 18, this field contains a sequence of one or more
- variable attribute sets. If more than one variable attribute set
- is present, each one after the first is delimited from the previous
- by `/`. Each variable attribute set consists of a long variable
- name, followed by `:`, followed by an attribute set with the same
- syntax as on record subtype 17.
-
- System files written by `Stata 14.1/-savespss- 1.77 by S.Radyakin`
- may include multiple records with subtype 18, one per variable that
- has variable attributes.
-
- The total length is `count` bytes.
-
-## Example
-
-A system file produced with the following VARIABLE ATTRIBUTE commands in
-effect:
-
-```
-VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=fred[1]('23') fred[2]('34').
-VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123').
-```
-
-will contain a variable attribute record with the following contents:
-
-```
-0000 07 00 00 00 12 00 00 00 01 00 00 00 22 00 00 00 |............"...|
-0010 64 75 6d 6d 79 3a 66 72 65 64 28 27 32 33 27 0a |dummy:fred('23'.|
-0020 27 33 34 27 0a 29 62 65 72 74 28 27 31 32 33 27 |'34'.)bert('123'|
-0030 0a 29 |.) |
-```
-
-## Variable Roles
-
-A variable's role is represented as an attribute named `$@Role`. This
-attribute has a single element whose values and their meanings are:
-
-|Value|Role |
-|----:|:----------|
-| 0 | Input |
-| 1 | Target |
-| 2 | Both |
-| 3 | None |
-| 4 | Partition |
-| 5 | Split |
-
-The default and most common role is 0 (input).
+++ /dev/null
-# Data Record
-
-The data record must follow all other records in the system file. Every
-system file must have a data record that specifies data for at least one
-case. The format of the data record varies depending on the value of
-`compression` in the file header record:
-
-* 0: no compression
-
- Data is arranged as a series of 8-byte elements. Each element
- corresponds to the variable declared in the respective variable
- record (*note Variable Record::). Numeric values are given in
- `flt64` format; string values are literal characters string, padded
- on the right when necessary to fill out 8-byte units.
-
-* 1: bytecode compression
-
- The first 8 bytes of the data record is divided into a series of
- 1-byte command codes. These codes have meanings as described
- below:
-
- - 0
-
- Ignored. If the program writing the system file accumulates
- compressed data in blocks of fixed length, 0 bytes can be used
- to pad out extra bytes remaining at the end of a fixed-size
- block.
-
- - 1 through 251
-
- A number with value `code - bias`, where `code` is the value of
- the compression code and `bias` comes from the file header. For
- example, code 105 with bias 100.0 (the normal value) indicates a
- numeric variable of value 5.
-
- A code of 0 (after subtracting the bias) in a string field encodes
- null bytes. This is unusual, since a string field normally
- encodes text data, but it exists in real system files.
-
- - 252
-
- End of file. This code may or may not appear at the end of the
- data stream. PSPP always outputs this code but its use is not
- required.
-
- - 253
-
- A numeric or string value that is not compressible. The value is
- stored in the 8 bytes following the current block of command
- bytes. If this value appears twice in a block of command bytes,
- then it indicates the second group of 8 bytes following the
- command bytes, and so on.
-
- - 254
-
- An 8-byte string value that is all spaces.
-
- - 255
-
- The system-missing value.
-
- The end of the 8-byte group of bytecodes is followed by any 8-byte
- blocks of non-compressible values indicated by code 253. After that
- follows another 8-byte group of bytecodes, then those bytecodes'
- non-compressible values. The pattern repeats to the end of the file
- or a code with value 252.
-
-* 2: ZLIB compression
-
- The data record consists of the following, in order:
-
- - ZLIB data header, 24 bytes long.
-
- - One or more variable-length blocks of ZLIB compressed data.
-
- - ZLIB data trailer, with a 24-byte fixed header plus an
- additional 24 bytes for each preceding ZLIB compressed data
- block.
-
- The ZLIB data header has the following format:
-
- ```
- int64 zheader_ofs;
- int64 ztrailer_ofs;
- int64 ztrailer_len;
- ```
-
- - `int64 zheader_ofs;`
-
- The offset, in bytes, of the beginning of this structure
- within the system file.
-
- - `int64 ztrailer_ofs;`
-
- The offset, in bytes, of the first byte of the ZLIB data
- trailer.
-
- - `int64 ztrailer_len;`
-
- The number of bytes in the ZLIB data trailer. This and the
- previous field sum to the size of the system file in bytes.
-
- The data header is followed by `(ztrailer_len - 24) / 24` ZLIB
- compressed data blocks. Each ZLIB compressed data block begins
- with a ZLIB header as specified in RFC 1950, e.g. hex bytes `78 01`
- (the only header yet observed in practice). Each block
- decompresses to a fixed number of bytes (in practice only
- `0x3ff000`-byte blocks have been observed), except that the last
- block of data may be shorter. The last ZLIB compressed data block
- gends just before offset `ztrailer_ofs`.
-
- The result of ZLIB decompression is bytecode compressed data as
- described above for compression format 1.
-
- The ZLIB data trailer begins with the following 24-byte fixed
- header:
-
- ```
- int64 bias;
- int64 zero;
- int32 block_size;
- int32 n_blocks;
- ```
-
- - `int64 int_bias;`
-
- The compression bias as a negative integer, e.g. if `bias` in
- the file header record is 100.0, then `int_bias` is −100 (this
- is the only value yet observed in practice).
-
- - `int64 zero;`
-
- Always observed to be zero.
-
- - `int32 block_size;`
-
- The number of bytes in each ZLIB compressed data block, except
- possibly the last, following decompression. Only `0x3ff000`
- has been observed so far.
-
- - `int32 n_blocks;`
-
- The number of ZLIB compressed data blocks, always exactly
- `(ztrailer_len - 24) / 24`.
-
- The fixed header is followed by `n_blocks` 24-byte ZLIB data block
- descriptors, each of which describes the compressed data block
- corresponding to its offset. Each block descriptor has the
- following format:
-
- ```
- int64 uncompressed_ofs;
- int64 compressed_ofs;
- int32 uncompressed_size;
- int32 compressed_size;
- ```
-
- - `int64 uncompressed_ofs;`
-
- The offset, in bytes, that this block of data would have in a
- similar system file that uses compression format 1. This is
- `zheader_ofs` in the first block descriptor, and in each
- succeeding block descriptor it is the sum of the previous
- desciptor's `uncompressed_ofs` and `uncompressed_size`.
-
- - `int64 compressed_ofs;`
-
- The offset, in bytes, of the actual beginning of this
- compressed data block. This is `zheader_ofs + 24` in the
- first block descriptor, and in each succeeding block
- descriptor it is the sum of the previous descriptor's
- `compressed_ofs` and `compressed_size`. The final block
- descriptor's `compressed_ofs` and `compressed_size` sum to
- `ztrailer_ofs`.
-
- - `int32 uncompressed_size;`
-
- The number of bytes in this data block, after decompression.
- This is `block_size` in every data block except the last,
- which may be smaller.
-
- - `int32 compressed_size;`
-
- The number of bytes in this data block, as stored compressed
- in this system file.
-
+++ /dev/null
-# Dictionary Termination Record
-
-The dictionary termination record separates all other records from the
-data records.
-
-```
- int32 rec_type;
- int32 filler;
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 999.
-
-* `int32 filler;`
-
- Ignored padding. Should be set to 0.
-
+++ /dev/null
-# Document Record
-
-The document record, if present, has the following format:
-
-```
- int32 rec_type;
- int32 n_lines;
- char lines[][80];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 6.
-
-* `int32 n_lines;`
-
- Number of lines of documents present. This should be greater than
- zero, but ReadStats writes system files with zero `n_lines`.
-
-* `char lines[][80];`
-
- Document lines. The number of elements is defined by `n_lines`.
- Lines shorter than 80 characters are padded on the right with
- spaces.
-
+++ /dev/null
-# Extended Number of Cases Record
-
-The [file header record](file-header-record.md#ncases) expresses the
-number of cases in the system file as an `int32`. This record allows
-the number of cases in the system file to be expressed as a 64-bit
-number.
-
-```
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
- int64 unknown;
- int64 ncases64;
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 16.
-
-* `int32 size;`
-
- Size of each element. Always set to 8.
-
-* `int32 count;`
-
- Number of pieces of data in the data part. Alway set to 2.
-
-* `int64 unknown;`
-
- Meaning unknown. Always set to 1.
-
-* `int64 ncases64;`
-
- Number of cases in the file as a 64-bit integer. Presumably this
- could be -1 to indicate that the number of cases is unknown, for
- the same reason as `ncases` in the file header record, but this has
- not been observed in the wild.
-
+++ /dev/null
-# Extra Product Info Record
-
-This optional record appears to contain a text string that describes the
-program that wrote the file and the source of the data. (This is
-redundant with the file label and product info found in the [file header
-record](file-header-record.md).)
-
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Exactly `count` bytes of data. */
- char info[];
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 10.
-
-* `int32 size;`
-
- The size of each element in the `info` member. Always set to 1.
-
-* `int32 count;`
-
- The total number of bytes in `info`.
-
-* `char info[];`
-
- A text string. A product that identifies itself as `VOXCO
- INTERVIEWER 4.3` uses CR-only line ends in this field, rather than
- the more usual LF-only or CR LF line ends.
-
+++ /dev/null
-# File Header Record
-
-A system file begins with the file header, with the following format:
-
-```
- char rec_type[4];
- char prod_name[60];
- int32 layout_code;
- int32 nominal_case_size;
- int32 compression;
- int32 weight_index;
- int32 ncases;
- flt64 bias;
- char creation_date[9];
- char creation_time[8];
- char file_label[64];
- char padding[3];
-```
-
-* `char rec_type[4];`
-
- Record type code, either `$FL2` for system files with uncompressed
- data or data compressed with simple bytecode compression, or `$FL3`
- for system files with ZLIB compressed data.
-
- This is truly a character field that uses the character encoding as
- other strings. Thus, in a file with an ASCII-based character
- encoding this field contains `24 46 4c 32` or `24 46 4c 33`, and in
- a file with an EBCDIC-based encoding this field contains `5b c6 d3
- f2`. (No EBCDIC-based ZLIB-compressed files have been observed.)
-
-* `char prod_name[60];`
-
- Product identification string. This always begins with the
- characters `@(#) SPSS DATA FILE`. PSPP uses the remaining
- characters to give its version and the operating system name; for
- example, `GNU pspp 0.1.4 - sparc-sun-solaris2.5.2`. The string is
- truncated if it would be longer than 60 characters; otherwise it is
- padded on the right with spaces.
-
- The product name field allow readers to behave differently based on
- quirks in the way that particular software writes system files.
- *Note Value Labels Records::, for the detail of the quirk that the
- PSPP system file reader tolerates in files written by ReadStat,
- which has `https://github.com/WizardMac/ReadStat` in `prod_name`.
-
-* `int32 layout_code;`
-
- Normally set to 2, although a few system files have been spotted in
- the wild with a value of 3 here. PSPP use this value to determine
- the file's integer endianness.
-
-* `int32 nominal_case_size;`
-
- Number of data elements per case. This is the number of variables,
- except that long string variables add extra data elements (one for
- every 8 characters after the first 8). However, string variables
- do not contribute to this value beyond the first 255 bytes.
- Further, some software always writes -1 or 0 in this field. In
- general, it is unsafe for systems reading system files to rely upon
- this value.
-
-* `int32 compression;`
-
- Set to 0 if the data in the file is not compressed, 1 if the data is
- compressed with simple bytecode compression, 2 if the data is ZLIB
- compressed. This field has value 2 if and only if `rec_type` is
- `$FL3`.
-
-* `int32 weight_index;`
-
- If one of the variables in the data set is used as a weighting
- variable, set to the [dictionary
- index](variable-record.md#dictionary-index) of that variable.
- Otherwise, set to 0.
-
-* `int32 ncases;`
-
- Set to the <a name="ncases">number of cases</a> in the file if it is
- known, or -1 otherwise.
-
- In the general case it is not possible to determine the number of
- cases that will be output to a system file at the time that the
- header is written. The way that this is dealt with is by writing
- the entire system file, including the header, then seeking back to
- the beginning of the file and writing just the `ncases` field. For
- files in which this is not valid, the seek operation fails. In
- this case, `ncases` remains -1.
-
-* `flt64 bias;`
-
- Compression bias, usually 100. Only integers between `1 - bias` and
- `251 - bias` can be compressed.
-
- By assuming that its value is 100, PSPP uses `bias` to determine the
- file's floating-point format and endianness. If the compression
- bias is not 100, PSPP cannot auto-detect the floating-point format
- and assumes that it is IEEE 754 format with the same endianness as
- the system file's integers, which is correct for all known system
- files.
-
-* `char creation_date[9];`
-
- Date of creation of the system file, in `dd mmm yy` format, with
- the month as standard English abbreviations, using an initial
- capital letter and following with lowercase. If the date is not
- available then this field is arbitrarily set to `01 Jan 70`.
-
-* `char creation_time[8];`
-
- Time of creation of the system file, in `hh:mm:ss` format and using
- 24-hour time. If the time is not available then this field is
- arbitrarily set to `00:00:00`.
-
-* `char file_label[64];`
-
- File label declared by the user, if any. Padded on the right with
- spaces.
-
- A product that identifies itself as `VOXCO INTERVIEWER 4.3` uses
- CR-only line ends in this field, rather than the more usual LF-only
- or CR LF line ends.
-
-* `char padding[3];`
-
- Ignored padding bytes to make the structure a multiple of 32 bits
- in length. Set to zeros.
-
+++ /dev/null
-An SPSS system file holds a set of cases and dictionary information that
-describes how they may be interpreted. The system file format dates
-back 40+ years and has evolved greatly over that time to support new
-features, but in a way to facilitate interchange between even the oldest
-and newest versions of software. This chapter describes the system file
-format.
-
-System files use four data types: 8-bit characters, 32-bit integers,
-64-bit integers, and 64-bit floating points, called here `char’,
-`int32’, `int64’, and `flt64’, respectively. Data is not necessarily
-aligned on a word or double-word boundary: the [long variable name
-record](long-variable-names-record.md) and [very long string
-record](very-long-string-record.md) have arbitrary byte length and can
-therefore cause all data coming after them in the file to be
-misaligned.
-
-Integer data in system files may be big-endian or little-endian. A
-reader may detect the endianness of a system file by examining
-[layout_code](file-header-record.md#layout_code) in the file header
-record.
-
-Floating-point data in system files may nominally be in IEEE 754, IBM,
-or VAX formats. A reader may detect the floating-point format in use
-by examining [bias](file-header-record.md#bias) in the file header
-record. Only files with IEEE 754 floating point data have actually
-been encountered.
-
-PSPP detects big-endian and little-endian integer formats in system
-files and translates as necessary. PSPP also detects the floating-point
-format in use, as well as the endianness of IEEE 754 floating-point
-numbers, and translates as needed. However, only IEEE 754 numbers with
-the same endianness as integer data in the same file have actually been
-observed in system files, and it is likely that other formats are
-obsolete or were never used.
-
-System files use a few floating point values for special purposes:
-
-* `SYSMIS`
-
- The system-missing value is represented by the largest possible
- negative number in the floating point format (`-DBL_MAX` or
- `f64::MIN`).
-
-* `HIGHEST`
-
- `HIGHEST` is used as the high end of a missing value range with an
- unbounded maximum. It is represented by the largest possible
- positive number (`DBL_MAX` or `f64::MAX`).
-
-* `LOWEST`
-
- `LOWEST` is used as the low end of a missing value range with an
- unbounded minimum. It was originally represented by the
- second-largest negative number (in IEEE 754 format,
- `0xffeffffffffffffe`). System files written by SPSS 21 and later
- instead use the largest negative number (`-DBL_MAX` or `f64::MIN`),
- the same value as `SYSMIS`. This does not lead to ambiguity because
- `LOWEST` appears in system files only in missing value ranges, which
- never contain `SYSMIS`.
-
-System files may use most character encodings based on an 8-bit unit.
-UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
-`rec_type` in the file header record is sufficient to distinguish
-between ASCII and EBCDIC based encodings. The best way to determine
-the specific encoding in use is to consult the [character encoding
-record](character-encoding-record.md), if present, and failing that
-the [character_code](machine-integer-info-record.md#character_code) in
-the machine integer info record. The same encoding should be used
-for the dictionary and the data in the file, although it is possible
-to artificially synthesize files that use different encodings.
+++ /dev/null
-# Long String Missing Values Record
-
-This record, if present, specifies missing values for long string
-variables.
-
-```
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Repeated up to exactly `count` bytes. */
- int32 var_name_len;
- char var_name[];
- char n_missing_values;
- int32 value_len;
- char values[value_len * n_missing_values];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 22.
-
-* `int32 size;`
-
- Always set to 1.
-
-* `int32 count;`
-
- The number of bytes following the header until the next header.
-
-* `int32 var_name_len;`
- `char var_name[];`
-
- The number of bytes in the name of the long string variable that
- has missing values, plus the variable name itself, which consists
- of exactly `var_name_len` bytes. The variable name is not padded
- to any particular boundary, nor is it null-terminated.
-
-* `char n_missing_values;`
-
- The number of missing values, either 1, 2, or 3. (This is,
- unusually, a single byte instead of a 32-bit number.)
-
-* `int32 value_len;`
-
- The length of each missing value string, in bytes. This value
- should be 8, because long string variables are at least 8 bytes
- wide (by definition), only the first 8 bytes of a long string
- variable's missing values are allowed to be non-spaces, and any
- spaces within the first 8 bytes are included in the missing value
- here.
-
-* `char values[value_len * n_missing_values]`
-
- The missing values themselves, without any padding or null
- terminators.
-
-An earlier version of this document stated that `value_len` was
-repeated before each of the missing values, so that there was an extra
-`int32` value of 8 before each missing value after the first. Old
-versions of PSPP wrote data files in this format. Readers can tolerate
-this mistake, if they wish, by noticing and skipping the extra `int32`
-values, which wouldn't ordinarily occur in strings.
-
+++ /dev/null
-# Long String Value Labels Record
-
-This record, if present, specifies value labels for long string
-variables.
-
-```
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Repeated up to exactly `count` bytes. */
- int32 var_name_len;
- char var_name[];
- int32 var_width;
- int32 n_labels;
- long_string_label labels[];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 21.
-
-* `int32 size;`
-
- Always set to 1.
-
-* `int32 count;`
-
- The number of bytes following the header until the next header.
-
-* `int32 var_name_len;`
- `char var_name[];`
-
- The number of bytes in the name of the variable that has long
- string value labels, plus the variable name itself, which consists
- of exactly `var_name_len` bytes. The variable name is not padded
- to any particular boundary, nor is it null-terminated.
-
-* `int32 var_width;`
-
- The width of the variable, in bytes, which will be between 9 and
- 32767.
-
-* `int32 n_labels;`
- `long_string_label labels[];`
-
- The long string labels themselves. The `labels` array contains
- exactly `n_labels` elements, each of which has the following
- substructure:
-
- ```
- int32 value_len;
- char value[];
- int32 label_len;
- char label[];
- ```
-
- - `int32 value_len;`
- `char value[];`
-
- The string value being labeled. `value_len` is the number of
- bytes in `value`; it is equal to `var_width`. The `value`
- array is not padded or null-terminated.
-
- - `int32 label_len;`
- `char label[];`
-
- The label for the string value. `label_len`, which must be
- between 0 and 120, is the number of bytes in `label`. The
- `label` array is not padded or null-terminated.
-
+++ /dev/null
-# Long Variable Names Record
-
-If present, the long variable names record has the following format:
-
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Exactly `count` bytes of data. */
- char var_name_pairs[];
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 13.
-
-* `int32 size;`
-
- The size of each element in the `var_name_pairs` member. Always
- set to 1.
-
-* `int32 count;`
-
- The total number of bytes in `var_name_pairs`.
-
-* `char var_name_pairs[];`
-
- A list of key-value tuples, where each key is the name of a
- variable, and the value is its long variable name. The key field is
- at most 8 bytes long and must match the name of a variable which
- appears in the [variable record](variable-record.md). The value
- field is at most 64 bytes long. The key and value fields are
- separated by a `=` byte. Each tuple is separated by a byte whose
- value is 09. There is no trailing separator following the last
- tuple. The total length is `count` bytes.
-
+++ /dev/null
-# Machine Floating-Point Info Record
-
-The floating-point info record, if present, has the following format:
-
-```
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Data. */
- flt64 sysmis;
- flt64 highest;
- flt64 lowest;
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 4.
-
-* `int32 size;`
-
- Size of each piece of data in the data part, in bytes. Always set
- to 8.
-
-* `int32 count;`
-
- Number of pieces of data in the data part. Always set to 3.
-
-* `flt64 sysmis;`
- `flt64 highest;`
- `flt64 lowest;`
-
- The system missing value, the value used for `HIGHEST` in missing
- values, and the value used for `LOWEST` in missing values,
- respectively. See the [introduction](index.md) for more
- information.
-
- The SPSSWriter library in PHP, which identifies itself as `FOM SPSS
- 1.0.0` in the file header record `prod_name` field, writes
- unexpected values to these fields, but it uses the same values
- consistently throughout the rest of the file.
-
+++ /dev/null
-# Machine Integer Info Record
-
-The integer info record, if present, has the following format:
-
-```
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Data. */
- int32 version_major;
- int32 version_minor;
- int32 version_revision;
- int32 machine_code;
- int32 floating_point_rep;
- int32 compression_code;
- int32 endianness;
- int32 character_code;
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 3.
-
-* `int32 size;`
-
- Size of each piece of data in the data part, in bytes. Always set
- to 4.
-
-* `int32 count;`
-
- Number of pieces of data in the data part. Always set to 8.
-
-* `int32 version_major;`
-
- PSPP major version number. In version X.Y.Z, this is X.
-
-* `int32 version_minor;`
-
- PSPP minor version number. In version X.Y.Z, this is Y.
-
-* `int32 version_revision;`
-
- PSPP version revision number. In version X.Y.Z, this is Z.
-
-* `int32 machine_code;`
-
- Machine code. PSPP always set this field to value to -1, but other
- values may appear.
-
-* `int32 floating_point_rep;`
-
- Floating point representation code. For IEEE 754 systems (the most
- common) this is 1. IBM 370 is supposed to set this to 2, and DEC
- VAX E to 3, but neither of these has been observed.
-
-* `int32 compression_code;`
-
- Compression code. Always set to 1, regardless of whether or how
- the file is compressed.
-
-* `int32 endianness;`
-
- Machine endianness. 1 indicates big-endian, 2 indicates
- little-endian.
-
-* `int32 character_code;`
-
- <a name="character-code">Character code</a>. The following values
- have been actually observed in system files:
-
- - 1
-
- EBCDIC. Only one example has been observed.
-
- - 2
-
- 7-bit ASCII. Old versions of SPSS for Unix and Windows always
- wrote value 2 in this field, regardless of the encoding in
- use, so it is not reliable and should be ignored.
-
- - 3
-
- 8-bit "ASCII".
-
- - 819
-
- ISO 8859-1 (IBM AIX code page number).
-
- - 874
- 9066
-
- The `windows-874` code page for Thai.
-
- - 932
-
- The `windows-932` code page for Japanese (aka `Shift_JIS`).
-
- - 936
-
- The `windows-936` code page for simplified Chinese (aka `GBK`).
-
- - 949
-
- Probably `ks_c_5601-1987`, Unified Hangul Code.
-
- - 950
-
- The `big5` code page for traditional Chinese.
-
- - 1250
-
- The `windows-1250` code page for Central European and Eastern
- European languages.
-
- - 1251
-
- The `windows-1251` code page for Cyrillic languages.
-
- - 1252
-
- The `windows-1252` code page for Western European languages.
-
- - 1253
-
- The `windows-1253` code page for modern Greek.
-
- - 1254
-
- The `windows-1254` code page for Turkish.
-
- - 1255
-
- The `windows-1255` code page for Hebrew.
-
- - 1256
-
- The `windows-1256` code page for Arabic script.
-
- - 1257
-
- The `windows-1257` code page for Estonian, Latvian, and
- Lithuanian.
-
- - 1258
-
- The `windows-1258` code page for Vietnamese.
-
- - 20127
-
- US-ASCII.
-
- - 28591
-
- ISO 8859-1 (Latin-1).
-
- - 25592
-
- ISO 8859-2 (Central European).
-
- - 28605
-
- ISO 8895-9 (Latin-9).
-
- - 51949
-
- The `euc-kr` code page for Korean.
-
- - 65001
-
- UTF-8.
-
- The following additional values are known to be defined:
-
- - 3
-
- 8-bit "ASCII".
-
- - 4
-
- DEC Kanji.
-
- The most common values observed, from most to least common, are
- 1252, 65001, 2, and 28591.
-
- Other Windows code page numbers are known to be generally valid.
-
- Newer versions also write the character encoding [as a
- string](character-encoding-record.md)..
-
+++ /dev/null
-# Multiple Response Sets Records
-
-The system file format has two different types of records that
-represent multiple response sets. The first type of record describes
-multiple response sets that can be understood by SPSS before
-version 14. The second type of record, with a closely related format,
-is used for multiple dichotomy sets that use the
-`CATEGORYLABELS=COUNTEDVALUES` feature added in version 14.
-
-```
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Exactly `count` bytes of data. */
- char mrsets[];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Set to 7 for records that describe multiple
- response sets understood by SPSS before version 14, or to 19 for
- records that describe dichotomy sets that use the
- `CATEGORYLABELS=COUNTEDVALUES` feature added in version 14.
-
-* `int32 size;`
-
- The size of each element in the `mrsets` member. Always set to 1.
-
-* `int32 count;`
-
- The total number of bytes in `mrsets`.
-
-* `char mrsets[];`
-
- Zero or more line feeds (byte 0x0a), followed by a series of
- multiple response sets, each of which consists of the following:
-
- - The set's name (an identifier that begins with `$`), in mixed
- upper and lower case.
-
- - An equals sign (`=`).
-
- - `C` for a multiple category set, `D` for a multiple dichotomy
- set with `CATEGORYLABELS=VARLABELS`, or `E` for a multiple
- dichotomy set with `CATEGORYLABELS=COUNTEDVALUES`.
-
- - For a multiple dichotomy set with
- `CATEGORYLABELS=COUNTEDVALUES`, a space, followed by a number
- expressed as decimal digits, followed by a space. If
- `LABELSOURCE=VARLABEL` was specified on MRSETS, then the number
- is 11; otherwise it is 1.[^note]
-
- - For either kind of multiple dichotomy set, the counted value,
- as a positive integer count specified as decimal digits,
- followed by a space, followed by as many string bytes as
- specified in the count. If the set contains numeric
- variables, the string consists of the counted integer value
- expressed as decimal digits. If the set contains string
- variables, the string contains the counted string value.
- Either way, the string may be padded on the right with spaces
- (older versions of SPSS seem to always pad to a width of 8
- bytes; newer versions don't).
-
- - A space.
-
- - The multiple response set's label, using the same format as
- for the counted value for multiple dichotomy sets. A string
- of length 0 means that the set does not have a label. A
- string of length 0 is also written if LABELSOURCE=VARLABEL was
- specified.
-
- - The short names of the variables in the set, converted to
- lowercase, each preceded by a single space.
-
- Even though a multiple response set must have at least two
- variables, some system files contain multiple response sets
- with no variables or one variable. The source and meaning of
- these multiple response sets is unknown. (Perhaps they arise
- from creating a multiple response set then deleting all the
- variables that it contains?)
-
- - One line feed (byte 0x0a). Sometimes multiple, even hundreds,
- of line feeds are present.
-
-Example: Given appropriate variable definitions, consider the
-following MRSETS command:
-
-```
-MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c
- /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55
- /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes'
- /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES
- VARIABLES=k l m VALUE=34
- /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL
- VARIABLES=n o p VALUE='choice'.
-```
-
-The above would generate the following multiple response set record
-of subtype 7:
-
-```
-$a=C 10 my mcgroup a b c
-$b=D2 55 0 g e f d
-$c=D3 Yes 10 mdgroup #2 h i j
-```
-
-It would also generate the following multiple response set record
-with subtype 19:
-
-```
-$d=E 1 2 34 13 third mdgroup k l m
-$e=E 11 6 choice 0 n o p
-```
-
-[^note]: This part of the format may not be fully understood, because
- only a single example of each possibility has been examined.
-
+++ /dev/null
-# Other Informational Records
-
-This chapter documents many specific types of extension records are
-documented here, but others are known to exist. PSPP ignores unknown
-extension records when reading system files.
-
- The following extension record subtypes have also been observed, with
-the following believed meanings:
-
-* 6
-
- Date info, probably related to USE (according to Aapi Hämäläinen).
-
-* 12
-
- A UUID in the format described in RFC 4122. Only two examples
- observed, both written by SPSS 13, and in each case the UUID
- contained both upper and lower case.
-
-* 24
-
- XML that describes how data in the file should be displayed
- on-screen.
-
+++ /dev/null
-# System File Record Structure
-
-System files are divided into records with the following format:
-
-```
- int32 type;
- char data[];
-```
-
- This header does not identify the length of the `data` or any
-information about what it contains, so the system file reader must
-understand the format of `data` based on `type`. However, records with
-type 7, called “extension records”, have a stricter format:
-
-```
- int32 type;
- int32 subtype;
- int32 size;
- int32 count;
- char data[size * count];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. This value identifies a particular kind of
- extension record.
-
-* `int32 size;`
-
- The size of each piece of data that follows the header, in bytes.
- Known extension records use 1, 4, or 8, for `char`, `int32`, and
- `flt64` format data, respectively.
-
-* `int32 count;`
-
- The number of pieces of data that follow the header.
-
-* `char data[size * count];`
-
- Data, whose format and interpretation depend on the subtype.
-
-An extension record contains exactly `size * count` bytes of data,
-which allows a reader that does not understand an extension record to
-skip it. Extension records provide only nonessential information, so
-this allows for files written by newer software to preserve backward
-compatibility with older or less capable readers.
-
-Records in a system file must appear in the following order:
-
-* File header record.
-
-* Variable records.
-
-* All pairs of value labels records and value label variables
- records, if present.
-
-* Document record, if present.
-
-* Extension (type 7) records, in ascending numerical order of their
- subtypes.
-
- System files written by SPSS include at most one of each kind of
- extension record. This is generally true of system files written
- by other software as well, with known exceptions noted below in the
- individual sections about each type of record.
-
-* Dictionary termination record.
-
-* Data record.
-
-We advise authors of programs that read system files to tolerate
-format variations. Various kinds of misformatting and corruption have
-been observed in system files written by SPSS and other software
-alike. In particular, because extension records provide nonessential
-information, it is generally better to ignore an extension record
-entirely than to refuse to read a system file.
-
-The following sections describe the known kinds of records.
-
+++ /dev/null
-# Value Labels Records
-
-The value label records documented in this section are used for
-numeric and short string variables only. Long string variables may
-have value labels, but their value labels are recorded using a
-[different record type](long-string-value-labels-record.md).
-
-[ReadStat](file-header-record.md) writes value labels that label a
-single value more than once. In more detail, it emits value labels
-whose values are longer than string variables' widths, that are
-identical in the actual width of the variable, e.g. labels for values
-`ABC123` and `ABC456` for a string variable with width 3. For files
-written by this software, PSPP ignores such labels.
-
-## Value Label Record for Labels
-
-The value label record has the following format:
-
-```
- int32 rec_type;
- int32 label_count;
-
- /* Repeated `n_label` times. */
- char value[8];
- char label_len;
- char label[];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 3.
-
-* `int32 label_count;`
-
- Number of value labels present in this record.
-
-The remaining fields are repeated `count` times. Each repetition
-specifies one value label.
-
-* `char value[8];`
-
- A numeric value or a short string value padded as necessary to 8
- bytes in length. Its type and width cannot be determined until the
- following value label variables record (see below) is read.
-
-* `char label_len;`
-
- The label's length, in bytes. The documented maximum length varies
- from 60 to 120 based on SPSS version. PSPP supports value labels
- up to 255 bytes long.
-
-* `char label[];`
-
- `label_len` bytes of the actual label, followed by up to 7 bytes of
- padding to bring `label` and `label_len` together to a multiple of
- 8 bytes in length.
-
-## Value Label Record for Variables
-
-The value label record is always immediately followed by a value
-label variables record with the following format:
-
-```
- int32 rec_type;
- int32 var_count;
- int32 vars[];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 4.
-
-* `int32 var_count;`
-
- Number of variables that the associated value labels from the value
- label record are to be applied.
-
-* `int32 vars[];`
-
- A list of 1-based [dictionary
- indexes](variable-record.md#dictionary-index) of variables to which
- to apply the value labels. There are `var_count` elements.
-
- String variables wider than 8 bytes may not be specified in this
- list.
-
+++ /dev/null
-# Variable Display Parameter Record
-
-The variable display parameter record, if present, has the following
-format:
-
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Repeated `count` times. */
- int32 measure;
- int32 width; /* Not always present. */
- int32 alignment;
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 11.
-
-* `int32 size;`
-
- The size of `int32`. Always set to 4.
-
-* `int32 count;`
-
- The number of sets of variable display parameters (ordinarily the
- number of variables in the dictionary), times 2 or 3.
-
-The remaining members are repeated `count` times, in the same order as
-the variable records. No element corresponds to variable records that
-continue long string variables. The meanings of these members are as
-follows:
-
-* `int32 measure;`
-
- The measurement level of the variable:
-
- | Value | Level |
- |------:|:---------|
- | 0 | Unknown |
- | 1 | Nominal |
- | 2 | Ordinal |
- | 3 | Scale |
-
- An "unknown" `measure` of 0 means that the variable was created in
- some way that doesn't make the measurement level clear, e.g. with a
- `COMPUTE` transformation. PSPP sets the measurement level the first
- time it reads the data, so this should rarely appear.
-
-* `int32 width;`
-
- The width of the display column for the variable in characters.
-
- This field is present if `count` is 3 times the number of variables
- in the dictionary. It is omitted if `count` is 2 times the number of
- variables.
-
-* `int32 alignment;`
-
- The alignment of the variable for display purposes:
-
- | Value | Alignment |
- |------:|:---------------|
- | 0 | Left aligned |
- | 1 | Right aligned |
- | 2 | Centre aligned |
+++ /dev/null
-# Variable Record
-
-There must be one variable record for each numeric variable and each
-string variable with width 8 bytes or less. String variables wider than
-8 bytes have one variable record for each 8 bytes, rounding up. The
-first variable record for a long string specifies the variable's correct
-dictionary information. Subsequent variable records for a long string
-are filled with dummy information: a type of -1, no variable label or
-missing values, print and write formats that are ignored, and an empty
-string as name. A few system files have been encountered that include a
-variable label on dummy variable records, so readers should take care to
-parse dummy variable records in the same way as other variable records.
-
-The "<a name="dictionary-index">dictionary index</a>" of a variable is
-a 1-based offset in the set of variable records, including dummy
-variable records for long string variables. The first variable record
-has a dictionary index of 1, the second has a dictionary index of 2,
-and so on.
-
-The system file format does not directly support string variables
-wider than 255 bytes. Such very long string variables are represented
-by a number of narrower string variables. See [very long string
-record](very-long-string-record.md) for details.
-
- A system file should contain at least one variable and thus at least
-one variable record, but system files have been observed in the wild
-without any variables (thus, no data either).
-
-```
- int32 rec_type;
- int32 type;
- int32 has_var_label;
- int32 n_missing_values;
- int32 print;
- int32 write;
- char name[8];
-
- /* Present only if `has_var_label` is 1. */
- int32 label_len;
- char label[];
-
- /* Present only if `n_missing_values` is nonzero. */
- flt64 missing_values[];
-```
-
-* `int32 rec_type;`
-
- Record type code. Always set to 2.
-
-* `int32 type;`
-
- Variable type code. Set to 0 for a numeric variable. For a short
- string variable or the first part of a long string variable, this
- is set to the width of the string. For the second and subsequent
- parts of a long string variable, set to -1, and the remaining
- fields in the structure are ignored.
-
-* `int32 has_var_label;`
-
- If this variable has a variable label, set to 1; otherwise, set to
- 0.
-
-* `int32 n_missing_values;`
-
- If the variable has no missing values, set to 0. If the variable
- has one, two, or three discrete missing values, set to 1, 2, or 3,
- respectively. If the variable has a range for missing variables,
- set to -2; if the variable has a range for missing variables plus a
- single discrete value, set to -3.
-
- A long string variable always has the value 0 here. A separate
- record indicates [missing values for long string
- variables](long-string-missing-values-record.md).
-
-* `int32 print;`
-
- Print format for this variable. See below.
-
-* `int32 write;`
-
- Write format for this variable. See below.
-
-* `char name[8];`
-
- Variable name. The variable name must begin with a capital letter
- or the at-sign (`@`). Subsequent characters may also be digits,
- octothorpes (`#`), dollar signs (`$`), underscores (`_`), or full
- stops (`.`). The variable name is padded on the right with spaces.
-
- The `name` fields should be unique within a system file. System
- files written by SPSS that contain very long string variables with
- similar names sometimes contain duplicate names that are later
- eliminated by resolving the [very long string
- names](very-long-string-record.md). PSPP handles duplicates by
- assigning them new, unique names.
-
-* `int32 label_len;`
-
- This field is present only if `has_var_label` is set to 1. It is
- set to the length, in characters, of the variable label. The
- documented maximum length varies from 120 to 255 based on SPSS
- version, but some files have been seen with longer labels. PSPP
- accepts labels of any length.
-
-* `char label[];`
-
- This field is present only if `has_var_label` is set to 1. It has
- length `label_len`, rounded up to the nearest multiple of 32 bits.
- The first `label_len` characters are the variable's variable label.
-
-* `flt64 missing_values[];`
-
- This field is present only if `n_missing_values` is nonzero. It has
- the same number of 8-byte elements as the absolute value of
- `n_missing_values`. Each element is interpreted as a number for
- numeric variables (with `HIGHEST` and `LOWEST` indicated as
- described in the [introduction](index.md)). For string variables of
- width less than 8 bytes, elements are right-padded with spaces.
-
- For discrete missing values, each element represents one missing
- value. When a range is present, the first element denotes the
- minimum value in the range, and the second element denotes the
- maximum value in the range. When a range plus a value are present,
- the third element denotes the additional discrete missing value.
-
-## Format Types
-
-The `print` and `write` members of `sysfile_variable` are output
-formats coded into `int32` types. The least-significant byte of the
-`int32` represents the number of decimal places, and the next two bytes
-in order of increasing significance represent field width and format
-type, respectively. The most-significant byte is not used and should be
-set to zero.
-
-Format types are defined as follows:
-
-| Value | Meaning |
-|------:|:------------|
-| 0 | Not used. |
-| 1 | `A` |
-| 2 | `AHEX` |
-| 3 | `COMMA` |
-| 4 | `DOLLAR` |
-| 5 | `F` |
-| 6 | `IB` |
-| 7 | `PIBHEX` |
-| 8 | `P` |
-| 9 | `PIB` |
-| 10 | `PK` |
-| 11 | `RB` |
-| 12 | `RBHEX` |
-| 13 | Not used. |
-| 14 | Not used. |
-| 15 | `Z` |
-| 16 | `N` |
-| 17 | `E` |
-| 18 | Not used. |
-| 19 | Not used. |
-| 20 | `DATE` |
-| 21 | `TIME` |
-| 22 | `DATETIME` |
-| 23 | `ADATE` |
-| 24 | `JDATE` |
-| 25 | `DTIME` |
-| 26 | `WKDAY` |
-| 27 | `MONTH` |
-| 28 | `MOYR` |
-| 29 | `QYR` |
-| 30 | `WKYR` |
-| 31 | `PCT` |
-| 32 | `DOT` |
-| 33 | `CCA` |
-| 34 | `CCB` |
-| 35 | `CCC` |
-| 36 | `CCD` |
-| 37 | `CCE` |
-| 38 | `EDATE` |
-| 39 | `SDATE` |
-| 40 | `MTIME` |
-| 41 | `YMDHMS` |
-
-A few system files have been observed in the wild with invalid
-`write` fields, in particular with value 0. Readers should probably
-treat invalid `print` or `write` fields as some default format.
-
-## Obsolete Treatment of Long String Missing Values
-
-SPSS and most versions of PSPP write missing values for string
-variables wider than 8 bytes with a [Long String Missing Values
-Record](long-string-missing-values-record.md). Very old versions of
-PSPP instead wrote these missing values on the variables record,
-writing only the first 8 bytes of each missing value, with the
-remainder implicitly all spaces. Any new software should use the
-[Long String Missing Values
-Record](long-string-missing-values-record.md), but it might possibly
-be worthwhile also to accept the old format used by PSPP.
+++ /dev/null
-# Variable Sets Record
-
-The SPSS GUI offers users the ability to arrange variables in sets.
-Users may enable and disable sets individually, and the data editor and
-analysis dialog boxes only show enabled sets. Syntax does not use
-variable sets.
-
- The variable sets record, if present, has the following format:
-
-```
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Exactly `count` bytes of text. */
- char text[];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 5.
-
-* `int32 size;`
-
- Always set to 1.
-
-* `int32 count;`
-
- The total number of bytes in `text`.
-
-* `char text[];`
-
- The variable sets, in a text-based format.
-
- Each variable set occupies one line of text, each of which ends
- with a line feed (byte 0x0a), optionally preceded by a carriage
- return (byte 0x0d).
-
- Each line begins with the name of the variable set, followed by an
- equals sign (`=`) and a space (byte 0x20), followed by the long
- variable names of the members of the set, separated by spaces. A
- variable set may be empty, in which case the equals sign and the
- space following it are still present.
-
+++ /dev/null
-# Very Long String Record
-
-Old versions of SPSS limited string variables to a width of 255 bytes.
-For backward compatibility with these older versions, the system file
-format represents a string longer than 255 bytes, called a “very long
-string”, as a collection of strings no longer than 255 bytes each. The
-strings concatenated to make a very long string are called its
-“segments”; for consistency, variables other than very long strings are
-considered to have a single segment.
-
-A very long string with a width of `w` has `n = (w + 251) / 252`
-segments, that is, one segment for every 252 bytes of width, rounding
-up. It would be logical, then, for each of the segments except the
-last to have a width of 252 and the last segment to have the
-remainder, but this is not the case. In fact, each segment except the
-last has a width of 255 bytes. The last has width `w - (n - 1) *
-252`; some versions of SPSS make it slightly wider, but not wide
-enough to make the last segment require another 8 bytes of data.
-
-Data is packed tightly into segments of a very long string, 255 bytes
-per segment. Because 255 bytes of segment data are allocated for
-every 252 bytes of the very long string's width (approximately), some
-unused space is left over at the end of the allocated segments. Data
-in unused space is ignored.
-
-> Example: Consider a very long string of width 20,000. Such a very
-long string has 20,000 / 252 = 80 (rounding up) segments. The first
-79 segments have width 255; the last segment has width 20,000 - 79 *
-252 = 92 or slightly wider (up to 96 bytes, the next multiple of 8).
-The very long string's data is actually stored in the 19,890 bytes in
-the first 78 segments, plus the first 110 bytes of the 79th segment
-(19,890 + 110 = 20,000). The remaining 145 bytes of the 79th segment
-and all 92 bytes of the 80th segment are unused.
-
-The very long string record explains how to stitch together segments
-to obtain very long string data. For each of the very long string
-variables in the dictionary, it specifies the name of its first
-segment's variable and the very long string variable's actual width.
-The remaining segments immediately follow the named variable in the
-system file's dictionary.
-
-The very long string record, which is present only if the system file
-contains very long string variables, has the following format:
-
-```
- /* Header. */
- int32 rec_type;
- int32 subtype;
- int32 size;
- int32 count;
-
- /* Exactly `count` bytes of data. */
- char string_lengths[];
-```
-
-* `int32 rec_type;`
-
- Record type. Always set to 7.
-
-* `int32 subtype;`
-
- Record subtype. Always set to 14.
-
-* `int32 size;`
-
- The size of each element in the `string_lengths` member. Always
- set to 1.
-
-* `int32 count;`
-
- The total number of bytes in `string_lengths`.
-
-* `char string_lengths[];`
-
- a list of key-value tuples, where key is the name of a variable, and
- value is its length. the key field is at most 8 bytes long and must
- match the name of a variable which appears in the [variable
- record](variable-record.md). the value field is exactly 5 bytes
- long. it is a zero-padded, ASCII-encoded string that is the length
- of the variable. the key and value fields are separated by a `=`
- byte. tuples are delimited by a two-byte sequence {00, 09}. After
- the last tuple, there may be a single byte 00, or {00, 09}. The
- total length is `count` bytes.
-