--- /dev/null
+# SPSS/PC+ System File Format
+
+SPSS/PC+, first released in 1984, was a simplified version of SPSS for
+IBM PC and compatible computers. It used a data file format related to
+the one described in the previous chapter, but simplified and
+incompatible. The SPSS/PC+ software became obsolete in the 1990s, so
+files in this format are rarely encountered today. Nevertheless, for
+completeness, and because it is not very difficult, it seems worthwhile
+to support at least reading these files. This chapter documents this
+format, based on examination of a corpus of about 60 files from a
+variety of sources.
+
+System files use four data types: 8-bit characters, 16-bit unsigned
+integers, 32-bit unsigned integers, and 64-bit floating points, called
+here `char`, `uint16`, `uint32`, and `flt64`, respectively. Data is not
+necessarily aligned on a word or double-word boundary.
+
+SPSS/PC+ ran only on IBM PC and compatible computers. Therefore,
+values in these files are always in little-endian byte order.
+Floating-point numbers are always in IEEE 754 format.
+
+SPSS/PC+ system files represent the system-missing value as
+-1.66e308, or `f5 1e 26 02 8a 8c ed ff` expressed as hexadecimal. (This
+is an unusual choice: it is close to, but not equal to, the largest
+negative 64-bit IEEE 754, which is about -1.8e308.)
+
+Text in SPSS/PC+ system file is encoded in ASCII-based 8-bit MS DOS
+codepages. The corpus used for investigating the format were all
+ASCII-only.
+
+An SPSS/PC+ system file begins with the following 256-byte directory:
+
+```
+uint32 two;
+uint32 zero;
+struct {
+ uint32 ofs;
+ uint32 len;
+} records[15];
+char filename[128];
+```
+
+* `uint32 two;`
+ `uint32 zero;`
+ Always set to 2 and 0, respectively.
+
+ These fields could be used as a signature for the file format, but
+ the `product` field in record 0 seems more likely to be unique
+ (*note Record 0 Main Header Record::).
+
+* `struct { ... } records[15];`
+ Each of the elements in this array identifies a record in the
+ system file. The `ofs` is a byte offset, from the beginning of the
+ file, that identifies the start of the record. `len` specifies the
+ length of the record, in bytes. Many records are optional or not
+ used. If a record is not present, `ofs` and `len` for that record
+ are both are zero.
+
+* `char filename[128];`
+ In most files in the corpus, this field is entirely filled with
+ spaces. In one file, it contains a file name, followed by a null
+ bytes, followed by spaces to fill the remainder of the field. The
+ meaning is unknown.
+
+The following sections describe the contents of each record,
+identified by the index into the `records` array.
+
+<!-- toc -->
+
+## Record 0: Main Header Record
+
+All files in the corpus have this record at offset 0x100 with length
+0xb0 (but readers should find this record, like the others, via the
+`records` table in the directory). Its format is:
+
+```
+uint16 one0;
+char product[62];
+flt64 sysmis;
+uint32 zero0;
+uint32 zero1;
+uint16 one1;
+uint16 compressed;
+uint16 nominal_case_size;
+uint16 n_cases0;
+uint16 weight_index;
+uint16 zero2;
+uint16 n_cases1;
+uint16 zero3;
+char creation_date[8];
+char creation_time[8];
+char label[64];
+```
+
+* `uint16 one0;`
+ `uint16 one1;`
+ Always set to 1.
+
+* `uint32 zero0;`
+ `uint32 zero1;`
+ `uint16 zero2;`
+ `uint16 zero3;`
+ Always set to 0.
+
+ It seems likely that one of these variables is set to 1 if
+ weighting is enabled, but none of the files in the corpus is
+ weighted.
+
+* `char product[62];`
+ Name of the program that created the file. Only the following
+ unique values have been observed, in each case padded on the right
+ with spaces:
+
+ ```
+ DESPSS/PC+ System File Written by Data Entry II
+ PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+
+ PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+ V3.0
+ PCSPSS SYSTEM FILE. IBM PC DOS, SPSS for Windows
+ ```
+
+ Thus, it is reasonable to use the presence of the string `SPSS` at
+ offset 0x104 as a simple test for an SPSS/PC+ data file.
+
+* `flt64 sysmis;`
+ The system-missing value, as described previously.
+
+* `uint16 compressed;`
+ Set to 0 if the data in the file is not compressed, 1 if the data
+ is compressed with simple bytecode compression.
+
+* `uint16 nominal_case_size;`
+ Number of data elements per case. This is the number of variables,
+ except that long string variables add extra data elements (one for
+ every 8 bytes after the first 8). String variables in SPSS/PC+
+ system files are limited to 255 bytes.
+
+* `uint16 n_cases0;`
+ `uint16 n_cases1;`
+ The number of cases in the data record. Both values are the same.
+ Some files in the corpus contain data for the number of cases noted
+ here, followed by garbage that somewhat resembles data.
+
+* `uint16 weight_index;`
+ 0, if the file is unweighted, otherwise a 1-based index into the
+ data record of the weighting variable, e.g. 4 for the first
+ variable after the 3 system-defined variables.
+
+* `char creation_date[8];`
+ The date that the file was created, in `mm/dd/yy` format.
+ Single-digit days and months are not prefixed by zeros. The string
+ is padded with spaces on right or left or both, e.g. `_2/4/93_`,
+ `10/5/87_`, and `_1/11/88` (with `_` standing in for a space) are
+ all actual examples from the corpus.
+
+* `char creation_time[8];`
+ The time that the file was created, in `HH:MM:SS` format.
+ Single-digit hours are padded on a left with a space. Minutes and
+ seconds are always written as two digits.
+
+* `char file_label[64];`
+ [File label](commands/utilities/file-label.md) declared by the user,
+ if any. Padded on the right with spaces.
+
+## Record 1: Variables Record
+
+The variables record most commonly starts at offset 0x1b0, but it can be
+placed elsewhere. The record contains instances of the following
+32-byte structure:
+
+```
+uint32 value_label_start;
+uint32 value_label_end;
+uint32 var_label_ofs;
+uint32 format;
+char name[8];
+union {
+ flt64 f;
+ char s[8];
+} missing;
+```
+
+The number of instances is the `nominal_case_size` specified in the
+main header record. There is one instance for each numeric variable
+and each string variable with width 8 bytes or less. String variables
+wider than 8 bytes have one instance for each 8 bytes, rounding up.
+The first instance for a long string specifies the variable's correct
+dictionary information. Subsequent instances for a long string are
+generally filled with all-zero bytes, although the `missing` field
+contains the numeric system-missing value, and some writers also fill
+in `var_label_ofs`, `format`, and `name`, sometimes filling the latter
+with the numeric system-missing value rather than a text string.
+Regardless of the values used, readers should ignore the contents of
+these additional instances for long strings.
+
+* `uint32 value_label_start;`
+ `uint32 value_label_end;`
+ For a variable with value labels, these specify offsets into the
+ label record of the start and end of this variable's value
+ labels, respectively. See the [labels
+ record](#record-2-labels-record), for more information.
+
+ For a variable without any value labels, these are both zero.
+
+ A long string variable may not have value labels.
+
+* `uint32 var_label_ofs;`
+ For a variable with a variable label, this specifies an offset into
+ the label record. *Note Record 2 Labels Record::, for more
+ information.
+
+ For a variable without a variable label, this is zero.
+
+* `uint32 format;`
+ The variable's output format, in the same format used in system
+ files. *Note System File Output Formats::, for details. SPSS/PC+
+ system files only use format types 5 (F, for numeric variables) and
+ 1 (A, for string variables).
+
+* `char name[8];`
+ The variable's name, padded on the right with spaces.
+
+* `union { ... } missing;`
+ A user-missing value. For numeric variables, `missing.f` is the
+ variable's user-missing value. For string variables, `missing.s`
+ is a string missing value. A variable without a user-missing value
+ is indicated with `missing.f` set to the system-missing value, even
+ for string variables (!). A Long string variable may not have a
+ missing value.
+
+In addition to the user-defined variables, every SPSS/PC+ system file
+contains, as its first three variables, the following system-defined
+variables, in the following order. The system-defined variables have
+no variable label, value labels, or missing values.
+
+* `$CASENUM`
+ A numeric variable with format `F8.0`. Most of the time this is a
+ sequence number, starting with 1 for the first case and counting up
+ for each subsequent case. Some files skip over values, which
+ probably reflects cases that were deleted.
+
+* `$DATE`
+ A string variable with format `A8`. Same format (including varying
+ padding) as the `creation_date` field in the [main header
+ record](#record-0-main-header-record). The actual date can differ
+ from `creation_date` and from record to record. This may reflect
+ when individual cases were added or updated.
+
+* `$WEIGHT`
+ A numeric variable with format `F8.2`. This represents the case's
+ weight; SPSS/PC+ files do not have a user-defined weighting
+ variable. If weighting has not been enabled, every case has value
+ 1.0.
+
+## Record 2: Labels Record
+
+The labels record holds value labels and variable labels. Unlike the
+other records, it is not meant to be read directly and sequentially.
+Instead, this record must be interpreted one piece at a time, by
+following pointers from the variables record.
+
+The `value_label_start`, `value_label_end`, and `var_label_ofs`
+fields in a variable record are all offsets relative to the beginning of
+the labels record, with an additional 7-byte offset. That is, if the
+labels record starts at byte offset `labels_ofs` and a variable has a
+given `var_label_ofs`, then the variable label begins at byte offset
+`labels_ofs` + `var_label_ofs` + 7 in the file.
+
+A variable label, starting at the offset indicated by
+`var_label_ofs`, consists of a one-byte length followed by the specified
+number of bytes of the variable label string, like this:
+
+```
+uint8 length;
+char s[length];
+```
+
+ A set of value labels, extending from `value_label_start` to
+`value_label_end` (exclusive), consists of a numeric or string value
+followed by a string in the format just described. String values are
+padded on the right with spaces to fill the 8-byte field, like this:
+
+```
+union {
+ flt64 f;
+ char s[8];
+} value;
+uint8 length;
+char s[length];
+```
+
+ The labels record begins with a pair of `uint32` values. The first of
+these is always 3. The second is between 8 and 16 less than the number
+of bytes in the record. Neither value is important for interpreting the
+file.
+
+## Record 3: Data Record
+
+The format of the data record varies depending on the value of
+`compressed` in the file header record:
+
+* 0: no compression
+ Data is arranged as a series of 8-byte elements, one per variable
+ instance variable in the variable record (*note Record 1 Variables
+ Record::). Numeric values are given in `flt64` format; string
+ values are literal characters string, padded on the right with
+ spaces when necessary to fill out 8-byte units.
+
+* 1: bytecode compression
+ The first 8 bytes of the data record is divided into a series of
+ 1-byte command codes. These codes have meanings as described
+ below:
+
+ - 0
+ The system-missing value.
+
+ - 1
+ A numeric or string value that is not compressible. The value
+ is stored in the 8 bytes following the current block of
+ command bytes. If this value appears twice in a block of
+ command bytes, then it indicates the second group of 8 bytes
+ following the command bytes, and so on.
+
+ - 2 through 255
+ A number with value CODE - 100, where CODE is the value of the
+ compression code. For example, code 105 indicates a numeric
+ variable of value 5.
+
+ The end of the 8-byte group of bytecodes is followed by any 8-byte
+ blocks of non-compressible values indicated by code 1. After that
+ follows another 8-byte group of bytecodes, then those bytecodes'
+ non-compressible values. The pattern repeats up to the number of
+ cases specified by the main header record have been seen.
+
+ The corpus does not contain any files with command codes 2 through
+ 95, so it is possible that some of these codes are used for special
+ purposes.
+
+Cases of data often, but not always, fill the entire data record.
+Readers should stop reading after the number of cases specified in the
+main header record. Otherwise, readers may try to interpret garbage
+following the data as additional cases.
+
+## Records 4 and 5: Data Entry
+
+Records 4 and 5 appear to be related to SPSS/PC+ Data Entry.
+
--- /dev/null
+# Portable File Format
+
+These days, most computers use the same internal data formats for
+integer and floating-point data, if one ignores little differences like
+big- versus little-endian byte ordering. However, occasionally it is
+necessary to exchange data between systems with incompatible data
+formats. This is what portable files are designed to do.
+
+The portable file format is mostly obsolete. [System
+files](system-file/index.md) are a better alternative.
+
+> This information is gleaned from examination of ASCII-formatted
+portable files only, so some of it may be incorrect for portable files
+formatted in EBCDIC or other character sets.
+
+<!-- toc -->
+
+## Portable File Characters
+
+Portable files are arranged as a series of lines of 80 characters each.
+Each line is terminated by a carriage-return, line-feed sequence
+("new-lines"). New-lines are only used to avoid line length limits
+imposed by some OSes; they are not meaningful.
+
+Most lines in portable files are exactly 80 characters long. The
+only exception is a line that ends in one or more spaces, in which the
+spaces may optionally be omitted. Thus, a portable file reader must act
+as though a line shorter than 80 characters is padded to that length
+with spaces.
+
+The file must be terminated with a `Z` character. In addition, if
+the final line in the file does not have exactly 80 characters, then it
+is padded on the right with `Z` characters. (The file contents may be
+in any character set; the file contains a description of its own
+character set, as explained in the next section. Therefore, the `Z`
+character is not necessarily an ASCII `Z`.)
+
+For the rest of the description of the portable file format,
+new-lines and the trailing `Z`s will be ignored, as if they did not
+exist, because they are not an important part of understanding the file
+contents.
+
+## Portable File Structure
+
+Every portable file consists of the following records, in sequence:
+
+- File header.
+
+- Version and date info.
+
+- Product identification.
+
+- Author identification (optional).
+
+- Subproduct identification (optional).
+
+- Variable count.
+
+- Case weight variable (optional).
+
+- Variables. Each variable record may optionally be followed by a
+ missing value record and a variable label record.
+
+- Value labels (optional).
+
+- Documents (optional).
+
+- Data.
+
+Most records are identified by a single-character tag code. The file
+header and version info record do not have a tag.
+
+Other than these single-character codes, there are three types of
+fields in a portable file: floating-point, integer, and string.
+Floating-point fields have the following format:
+
+- Zero or more leading spaces.
+
+- Optional asterisk (`*`), which indicates a missing value. The
+ asterisk must be followed by a single character, generally a period
+ (`.`), but it appears that other characters may also be possible.
+ This completes the specification of a missing value.
+
+- Optional minus sign (`-`) to indicate a negative number.
+
+- A whole number, consisting of one or more base-30 digits: `0`
+ through `9` plus capital letters `A` through `T`.
+
+- Optional fraction, consisting of a radix point (`.`) followed by
+ one or more base-30 digits.
+
+- Optional exponent, consisting of a plus or minus sign (`+` or `-`)
+ followed by one or more base-30 digits.
+
+- A forward slash (`/`).
+
+Integer fields take a form identical to floating-point fields, but
+they may not contain a fraction.
+
+String fields take the form of a integer field having value N,
+followed by exactly N characters, which are the string content.
+
+## Portable File Header
+
+Every portable file begins with a 464-byte header, consisting of a
+200-byte collection of vanity splash strings, followed by a 256-byte
+character set translation table, followed by an 8-byte tag string.
+
+The 200-byte segment is divided into five 40-byte sections, each of
+which represents the string `CHARSET SPSS PORT FILE` in a different
+character set encoding, where `CHARSET` is the name of the character set
+used in the file, e.g. `ASCII` or `EBCDIC`. Each string is padded on
+the right with spaces in its respective character set.
+
+It appears that these strings exist only to inform those who might
+view the file on a screen, and that they are not parsed by SPSS
+products. Thus, they can be safely ignored. For those interested, the
+strings are supposed to be in the following character sets, in the
+specified order: EBCDIC, 7-bit ASCII, CDC 6-bit ASCII, 6-bit ASCII,
+Honeywell 6-bit ASCII.
+
+The 256-byte segment describes a mapping from the character set used
+in the portable file to an arbitrary character set having characters at
+the following positions:
+
+* 0-60: Control characters. Not important enough to describe in full here.
+
+* 61-63: Reserved.
+
+* 64-73: Digits `0` through `9`.
+
+* 74-99: Capital letters `A` through `Z`.
+
+* 100-125: Lowercase letters `a` through `z`.
+
+* 126: Space.
+
+* 127-130: Symbols `.<(+`
+
+* 131: Solid vertical pipe.
+
+* 132-142: Symbols `&[]!$*);^-/`
+
+* 143: Broken vertical pipe.
+
+* 144-150: Symbols `,%_>`?``:`
+
+* 151: British pound symbol.
+
+* 152-155: Symbols `@'="`.
+
+* 156: Less than or equal symbol.
+
+* 157: Empty box.
+
+* 158: Plus or minus.
+
+* 159: Filled box.
+
+* 160: Degree symbol.
+
+* 161: Dagger.
+
+* 162: Symbol `~`.
+
+* 163: En dash.
+
+* 164: Lower left corner box draw.
+
+* 165: Upper left corner box draw.
+
+* 166: Greater than or equal symbol.
+
+* 167-176: Superscript `0` through `9`.
+
+* 177: Lower right corner box draw.
+
+* 178: Upper right corner box draw.
+
+* 179: Not equal symbol.
+
+* 180: Em dash.
+
+* 181: Superscript `(`.
+
+* 182: Superscript `)`.
+
+* 183: Horizontal dagger (?).
+
+* 184-186: Symbols `{}\`.
+
+* 187: Cents symbol.
+
+* 188: Centered dot, or bullet.
+
+* 189-255: Reserved.
+
+Symbols that are not defined in a particular character set are set to
+the same value as symbol 64; i.e., to `0`.
+
+The 8-byte tag string consists of the exact characters `SPSSPORT` in
+the portable file's character set, which can be used to verify that the
+file is indeed a portable file.
+
+## Version and Date Info Record
+
+This record does not have a tag code. It has the following structure:
+
+- A single character identifying the file format version. The letter
+ A represents version 0, and so on.
+
+- An 8-character string field giving the file creation date in the
+ format YYYYMMDD.
+
+- A 6-character string field giving the file creation time in the
+ format HHMMSS.
+
+## Identification Records
+
+The product identification record has tag code `1`. It consists of a
+single string field giving the name of the product that wrote the
+portable file.
+
+The author identification record has tag code `2`. It is optional.
+If present, it consists of a single string field giving the name of the
+person who caused the portable file to be written.
+
+The subproduct identification record has tag code `3`. It is
+optional. If present, it consists of a single string field giving
+additional information on the product that wrote the portable file.
+
+## Variable Count Record
+
+The variable count record has tag code `4`. It consists of a single
+integer field giving the number of variables in the file dictionary.
+
+## Precision Record
+
+The precision record has tag code `5`. It consists of a single integer
+field specifying the maximum number of base-30 digits used in data in
+the file.
+
+## Case Weight Variable Record
+
+The case weight variable record is optional. If it is present, it
+indicates the variable used for weighting cases; if it is absent, cases
+are unweighted. It has tag code `6`. It consists of a single string
+field that names the weighting variable.
+
+## Variable Records
+
+Each variable record represents a single variable. Variable records
+have tag code `7`. They have the following structure:
+
+- Width (integer). This is 0 for a numeric variable, and a number
+ between 1 and 255 for a string variable.
+
+- Name (string). 1-8 characters long. Must be in all capitals.
+
+ A few portable files that contain duplicate variable names have
+ been spotted in the wild. PSPP handles these by renaming the
+ duplicates with numeric extensions: `VAR_1`, `VAR_2`, and so on.
+
+- Print format. This is a set of three integer fields:
+
+ - [Format type](system-file/variable-record.md#format-types) encoded
+ the same as in system files.
+
+ - Format width. 1-40.
+
+ - Number of decimal places. 1-40.
+
+ A few portable files with invalid format types or formats that are
+ not of the appropriate width for their variables have been spotted
+ in the wild. PSPP assigns a default F or A format to a variable
+ with an invalid format.
+
+- Write format. Same structure as the print format described above.
+
+Each variable record can optionally be followed by a missing value
+record, which has tag code `8`. A missing value record has one field,
+the missing value itself (a floating-point or string, as appropriate).
+Up to three of these missing value records can be used.
+
+There is also a record for missing value ranges, which has tag code
+`B`. It is followed by two fields representing the range, which are
+floating-point or string as appropriate. If a missing value range is
+present, it may be followed by a single missing value record.
+
+Tag codes `9` and `A` represent `LO THRU X` and `X THRU HI` ranges,
+respectively. Each is followed by a single field representing X. If
+one of the ranges is present, it may be followed by a single missing
+value record.
+
+In addition, each variable record can optionally be followed by a
+variable label record, which has tag code `C`. A variable label record
+has one field, the variable label itself (string).
+
+## Value Label Records
+
+Value label records have tag code `D`. They have the following format:
+
+- Variable count (integer).
+
+- List of variables (strings). The variable count specifies the
+ number in the list. Variables are specified by their names. All
+ variables must be of the same type (numeric or string), but string
+ variables do not necessarily have the same width.
+
+- Label count (integer).
+
+- List of (value, label) tuples. The label count specifies the
+ number of tuples. Each tuple consists of a value, which is numeric
+ or string as appropriate to the variables, followed by a label
+ (string).
+
+A few portable files that specify duplicate value labels, that is,
+two different labels for a single value of a single variable, have been
+spotted in the wild. PSPP uses the last value label specified in these
+cases.
+
+## Document Record
+
+One document record may optionally follow the value label record. The
+document record consists of tag code `E`, following by the number of
+document lines as an integer, followed by that number of strings, each
+of which represents one document line. Document lines must be 80 bytes
+long or shorter.
+
+## Portable File Data
+
+The data record has tag code `F`. There is only one tag for all the
+data; thus, all the data must follow the dictionary. The data is
+terminated by the end-of-file marker `Z`, which is not valid as the
+beginning of a data element.
+
+Data elements are output in the same order as the variable records
+describing them. String variables are output as string fields, and
+numeric variables are output as floating-point fields.
+