From 9c0b0dce29dcc30268cf2a06fb88648ec9028a88 Mon Sep 17 00:00:00 2001 From: Ben Pfaff Date: Fri, 9 May 2025 08:49:08 -0700 Subject: [PATCH] document SPSS/PC+ and portable files --- rust/doc/src/SUMMARY.md | 2 + rust/doc/src/pc+.md | 346 ++++++++++++++++++++ rust/doc/src/pc+/index.md | 1 + rust/doc/src/pc.md | 1 + rust/doc/src/portable.md | 340 +++++++++++++++++++ rust/doc/src/system-file/variable-record.md | 4 +- 6 files changed, 693 insertions(+), 1 deletion(-) create mode 100644 rust/doc/src/pc+.md create mode 100644 rust/doc/src/pc+/index.md create mode 100644 rust/doc/src/pc.md create mode 100644 rust/doc/src/portable.md diff --git a/rust/doc/src/SUMMARY.md b/rust/doc/src/SUMMARY.md index f5ae0d5e4b..39dadbae09 100644 --- a/rust/doc/src/SUMMARY.md +++ b/rust/doc/src/SUMMARY.md @@ -195,3 +195,5 @@ - [Encrypted File Wrappers](encrypted-wrapper/index.md) - [Common Wrapper Format](encrypted-wrapper/common-wrapper-format.md) - [Password Encoding](encrypted-wrapper/password-encoding.md) +- [SPSS Portable File Format](portable.md) +- [SPSS/PC+ System File Format](pc+.md) \ No newline at end of file diff --git a/rust/doc/src/pc+.md b/rust/doc/src/pc+.md new file mode 100644 index 0000000000..261e978856 --- /dev/null +++ b/rust/doc/src/pc+.md @@ -0,0 +1,346 @@ +# SPSS/PC+ System File Format + +SPSS/PC+, first released in 1984, was a simplified version of SPSS for +IBM PC and compatible computers. It used a data file format related to +the one described in the previous chapter, but simplified and +incompatible. The SPSS/PC+ software became obsolete in the 1990s, so +files in this format are rarely encountered today. Nevertheless, for +completeness, and because it is not very difficult, it seems worthwhile +to support at least reading these files. This chapter documents this +format, based on examination of a corpus of about 60 files from a +variety of sources. + +System files use four data types: 8-bit characters, 16-bit unsigned +integers, 32-bit unsigned integers, and 64-bit floating points, called +here `char`, `uint16`, `uint32`, and `flt64`, respectively. Data is not +necessarily aligned on a word or double-word boundary. + +SPSS/PC+ ran only on IBM PC and compatible computers. Therefore, +values in these files are always in little-endian byte order. +Floating-point numbers are always in IEEE 754 format. + +SPSS/PC+ system files represent the system-missing value as +-1.66e308, or `f5 1e 26 02 8a 8c ed ff` expressed as hexadecimal. (This +is an unusual choice: it is close to, but not equal to, the largest +negative 64-bit IEEE 754, which is about -1.8e308.) + +Text in SPSS/PC+ system file is encoded in ASCII-based 8-bit MS DOS +codepages. The corpus used for investigating the format were all +ASCII-only. + +An SPSS/PC+ system file begins with the following 256-byte directory: + +``` +uint32 two; +uint32 zero; +struct { + uint32 ofs; + uint32 len; +} records[15]; +char filename[128]; +``` + +* `uint32 two;` + `uint32 zero;` + Always set to 2 and 0, respectively. + + These fields could be used as a signature for the file format, but + the `product` field in record 0 seems more likely to be unique + (*note Record 0 Main Header Record::). + +* `struct { ... } records[15];` + Each of the elements in this array identifies a record in the + system file. The `ofs` is a byte offset, from the beginning of the + file, that identifies the start of the record. `len` specifies the + length of the record, in bytes. Many records are optional or not + used. If a record is not present, `ofs` and `len` for that record + are both are zero. + +* `char filename[128];` + In most files in the corpus, this field is entirely filled with + spaces. In one file, it contains a file name, followed by a null + bytes, followed by spaces to fill the remainder of the field. The + meaning is unknown. + +The following sections describe the contents of each record, +identified by the index into the `records` array. + + + +## Record 0: Main Header Record + +All files in the corpus have this record at offset 0x100 with length +0xb0 (but readers should find this record, like the others, via the +`records` table in the directory). Its format is: + +``` +uint16 one0; +char product[62]; +flt64 sysmis; +uint32 zero0; +uint32 zero1; +uint16 one1; +uint16 compressed; +uint16 nominal_case_size; +uint16 n_cases0; +uint16 weight_index; +uint16 zero2; +uint16 n_cases1; +uint16 zero3; +char creation_date[8]; +char creation_time[8]; +char label[64]; +``` + +* `uint16 one0;` + `uint16 one1;` + Always set to 1. + +* `uint32 zero0;` + `uint32 zero1;` + `uint16 zero2;` + `uint16 zero3;` + Always set to 0. + + It seems likely that one of these variables is set to 1 if + weighting is enabled, but none of the files in the corpus is + weighted. + +* `char product[62];` + Name of the program that created the file. Only the following + unique values have been observed, in each case padded on the right + with spaces: + + ``` + DESPSS/PC+ System File Written by Data Entry II + PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+ + PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+ V3.0 + PCSPSS SYSTEM FILE. IBM PC DOS, SPSS for Windows + ``` + + Thus, it is reasonable to use the presence of the string `SPSS` at + offset 0x104 as a simple test for an SPSS/PC+ data file. + +* `flt64 sysmis;` + The system-missing value, as described previously. + +* `uint16 compressed;` + Set to 0 if the data in the file is not compressed, 1 if the data + is compressed with simple bytecode compression. + +* `uint16 nominal_case_size;` + Number of data elements per case. This is the number of variables, + except that long string variables add extra data elements (one for + every 8 bytes after the first 8). String variables in SPSS/PC+ + system files are limited to 255 bytes. + +* `uint16 n_cases0;` + `uint16 n_cases1;` + The number of cases in the data record. Both values are the same. + Some files in the corpus contain data for the number of cases noted + here, followed by garbage that somewhat resembles data. + +* `uint16 weight_index;` + 0, if the file is unweighted, otherwise a 1-based index into the + data record of the weighting variable, e.g. 4 for the first + variable after the 3 system-defined variables. + +* `char creation_date[8];` + The date that the file was created, in `mm/dd/yy` format. + Single-digit days and months are not prefixed by zeros. The string + is padded with spaces on right or left or both, e.g. `_2/4/93_`, + `10/5/87_`, and `_1/11/88` (with `_` standing in for a space) are + all actual examples from the corpus. + +* `char creation_time[8];` + The time that the file was created, in `HH:MM:SS` format. + Single-digit hours are padded on a left with a space. Minutes and + seconds are always written as two digits. + +* `char file_label[64];` + [File label](commands/utilities/file-label.md) declared by the user, + if any. Padded on the right with spaces. + +## Record 1: Variables Record + +The variables record most commonly starts at offset 0x1b0, but it can be +placed elsewhere. The record contains instances of the following +32-byte structure: + +``` +uint32 value_label_start; +uint32 value_label_end; +uint32 var_label_ofs; +uint32 format; +char name[8]; +union { + flt64 f; + char s[8]; +} missing; +``` + +The number of instances is the `nominal_case_size` specified in the +main header record. There is one instance for each numeric variable +and each string variable with width 8 bytes or less. String variables +wider than 8 bytes have one instance for each 8 bytes, rounding up. +The first instance for a long string specifies the variable's correct +dictionary information. Subsequent instances for a long string are +generally filled with all-zero bytes, although the `missing` field +contains the numeric system-missing value, and some writers also fill +in `var_label_ofs`, `format`, and `name`, sometimes filling the latter +with the numeric system-missing value rather than a text string. +Regardless of the values used, readers should ignore the contents of +these additional instances for long strings. + +* `uint32 value_label_start;` + `uint32 value_label_end;` + For a variable with value labels, these specify offsets into the + label record of the start and end of this variable's value + labels, respectively. See the [labels + record](#record-2-labels-record), for more information. + + For a variable without any value labels, these are both zero. + + A long string variable may not have value labels. + +* `uint32 var_label_ofs;` + For a variable with a variable label, this specifies an offset into + the label record. *Note Record 2 Labels Record::, for more + information. + + For a variable without a variable label, this is zero. + +* `uint32 format;` + The variable's output format, in the same format used in system + files. *Note System File Output Formats::, for details. SPSS/PC+ + system files only use format types 5 (F, for numeric variables) and + 1 (A, for string variables). + +* `char name[8];` + The variable's name, padded on the right with spaces. + +* `union { ... } missing;` + A user-missing value. For numeric variables, `missing.f` is the + variable's user-missing value. For string variables, `missing.s` + is a string missing value. A variable without a user-missing value + is indicated with `missing.f` set to the system-missing value, even + for string variables (!). A Long string variable may not have a + missing value. + +In addition to the user-defined variables, every SPSS/PC+ system file +contains, as its first three variables, the following system-defined +variables, in the following order. The system-defined variables have +no variable label, value labels, or missing values. + +* `$CASENUM` + A numeric variable with format `F8.0`. Most of the time this is a + sequence number, starting with 1 for the first case and counting up + for each subsequent case. Some files skip over values, which + probably reflects cases that were deleted. + +* `$DATE` + A string variable with format `A8`. Same format (including varying + padding) as the `creation_date` field in the [main header + record](#record-0-main-header-record). The actual date can differ + from `creation_date` and from record to record. This may reflect + when individual cases were added or updated. + +* `$WEIGHT` + A numeric variable with format `F8.2`. This represents the case's + weight; SPSS/PC+ files do not have a user-defined weighting + variable. If weighting has not been enabled, every case has value + 1.0. + +## Record 2: Labels Record + +The labels record holds value labels and variable labels. Unlike the +other records, it is not meant to be read directly and sequentially. +Instead, this record must be interpreted one piece at a time, by +following pointers from the variables record. + +The `value_label_start`, `value_label_end`, and `var_label_ofs` +fields in a variable record are all offsets relative to the beginning of +the labels record, with an additional 7-byte offset. That is, if the +labels record starts at byte offset `labels_ofs` and a variable has a +given `var_label_ofs`, then the variable label begins at byte offset +`labels_ofs` + `var_label_ofs` + 7 in the file. + +A variable label, starting at the offset indicated by +`var_label_ofs`, consists of a one-byte length followed by the specified +number of bytes of the variable label string, like this: + +``` +uint8 length; +char s[length]; +``` + + A set of value labels, extending from `value_label_start` to +`value_label_end` (exclusive), consists of a numeric or string value +followed by a string in the format just described. String values are +padded on the right with spaces to fill the 8-byte field, like this: + +``` +union { + flt64 f; + char s[8]; +} value; +uint8 length; +char s[length]; +``` + + The labels record begins with a pair of `uint32` values. The first of +these is always 3. The second is between 8 and 16 less than the number +of bytes in the record. Neither value is important for interpreting the +file. + +## Record 3: Data Record + +The format of the data record varies depending on the value of +`compressed` in the file header record: + +* 0: no compression + Data is arranged as a series of 8-byte elements, one per variable + instance variable in the variable record (*note Record 1 Variables + Record::). Numeric values are given in `flt64` format; string + values are literal characters string, padded on the right with + spaces when necessary to fill out 8-byte units. + +* 1: bytecode compression + The first 8 bytes of the data record is divided into a series of + 1-byte command codes. These codes have meanings as described + below: + + - 0 + The system-missing value. + + - 1 + A numeric or string value that is not compressible. The value + is stored in the 8 bytes following the current block of + command bytes. If this value appears twice in a block of + command bytes, then it indicates the second group of 8 bytes + following the command bytes, and so on. + + - 2 through 255 + A number with value CODE - 100, where CODE is the value of the + compression code. For example, code 105 indicates a numeric + variable of value 5. + + The end of the 8-byte group of bytecodes is followed by any 8-byte + blocks of non-compressible values indicated by code 1. After that + follows another 8-byte group of bytecodes, then those bytecodes' + non-compressible values. The pattern repeats up to the number of + cases specified by the main header record have been seen. + + The corpus does not contain any files with command codes 2 through + 95, so it is possible that some of these codes are used for special + purposes. + +Cases of data often, but not always, fill the entire data record. +Readers should stop reading after the number of cases specified in the +main header record. Otherwise, readers may try to interpret garbage +following the data as additional cases. + +## Records 4 and 5: Data Entry + +Records 4 and 5 appear to be related to SPSS/PC+ Data Entry. + diff --git a/rust/doc/src/pc+/index.md b/rust/doc/src/pc+/index.md new file mode 100644 index 0000000000..c7f15adba3 --- /dev/null +++ b/rust/doc/src/pc+/index.md @@ -0,0 +1 @@ +# SPSS/PC+ System File Format diff --git a/rust/doc/src/pc.md b/rust/doc/src/pc.md new file mode 100644 index 0000000000..c7f15adba3 --- /dev/null +++ b/rust/doc/src/pc.md @@ -0,0 +1 @@ +# SPSS/PC+ System File Format diff --git a/rust/doc/src/portable.md b/rust/doc/src/portable.md new file mode 100644 index 0000000000..e038d730ee --- /dev/null +++ b/rust/doc/src/portable.md @@ -0,0 +1,340 @@ +# Portable File Format + +These days, most computers use the same internal data formats for +integer and floating-point data, if one ignores little differences like +big- versus little-endian byte ordering. However, occasionally it is +necessary to exchange data between systems with incompatible data +formats. This is what portable files are designed to do. + +The portable file format is mostly obsolete. [System +files](system-file/index.md) are a better alternative. + +> This information is gleaned from examination of ASCII-formatted +portable files only, so some of it may be incorrect for portable files +formatted in EBCDIC or other character sets. + + + +## Portable File Characters + +Portable files are arranged as a series of lines of 80 characters each. +Each line is terminated by a carriage-return, line-feed sequence +("new-lines"). New-lines are only used to avoid line length limits +imposed by some OSes; they are not meaningful. + +Most lines in portable files are exactly 80 characters long. The +only exception is a line that ends in one or more spaces, in which the +spaces may optionally be omitted. Thus, a portable file reader must act +as though a line shorter than 80 characters is padded to that length +with spaces. + +The file must be terminated with a `Z` character. In addition, if +the final line in the file does not have exactly 80 characters, then it +is padded on the right with `Z` characters. (The file contents may be +in any character set; the file contains a description of its own +character set, as explained in the next section. Therefore, the `Z` +character is not necessarily an ASCII `Z`.) + +For the rest of the description of the portable file format, +new-lines and the trailing `Z`s will be ignored, as if they did not +exist, because they are not an important part of understanding the file +contents. + +## Portable File Structure + +Every portable file consists of the following records, in sequence: + +- File header. + +- Version and date info. + +- Product identification. + +- Author identification (optional). + +- Subproduct identification (optional). + +- Variable count. + +- Case weight variable (optional). + +- Variables. Each variable record may optionally be followed by a + missing value record and a variable label record. + +- Value labels (optional). + +- Documents (optional). + +- Data. + +Most records are identified by a single-character tag code. The file +header and version info record do not have a tag. + +Other than these single-character codes, there are three types of +fields in a portable file: floating-point, integer, and string. +Floating-point fields have the following format: + +- Zero or more leading spaces. + +- Optional asterisk (`*`), which indicates a missing value. The + asterisk must be followed by a single character, generally a period + (`.`), but it appears that other characters may also be possible. + This completes the specification of a missing value. + +- Optional minus sign (`-`) to indicate a negative number. + +- A whole number, consisting of one or more base-30 digits: `0` + through `9` plus capital letters `A` through `T`. + +- Optional fraction, consisting of a radix point (`.`) followed by + one or more base-30 digits. + +- Optional exponent, consisting of a plus or minus sign (`+` or `-`) + followed by one or more base-30 digits. + +- A forward slash (`/`). + +Integer fields take a form identical to floating-point fields, but +they may not contain a fraction. + +String fields take the form of a integer field having value N, +followed by exactly N characters, which are the string content. + +## Portable File Header + +Every portable file begins with a 464-byte header, consisting of a +200-byte collection of vanity splash strings, followed by a 256-byte +character set translation table, followed by an 8-byte tag string. + +The 200-byte segment is divided into five 40-byte sections, each of +which represents the string `CHARSET SPSS PORT FILE` in a different +character set encoding, where `CHARSET` is the name of the character set +used in the file, e.g. `ASCII` or `EBCDIC`. Each string is padded on +the right with spaces in its respective character set. + +It appears that these strings exist only to inform those who might +view the file on a screen, and that they are not parsed by SPSS +products. Thus, they can be safely ignored. For those interested, the +strings are supposed to be in the following character sets, in the +specified order: EBCDIC, 7-bit ASCII, CDC 6-bit ASCII, 6-bit ASCII, +Honeywell 6-bit ASCII. + +The 256-byte segment describes a mapping from the character set used +in the portable file to an arbitrary character set having characters at +the following positions: + +* 0-60: Control characters. Not important enough to describe in full here. + +* 61-63: Reserved. + +* 64-73: Digits `0` through `9`. + +* 74-99: Capital letters `A` through `Z`. + +* 100-125: Lowercase letters `a` through `z`. + +* 126: Space. + +* 127-130: Symbols `.<(+` + +* 131: Solid vertical pipe. + +* 132-142: Symbols `&[]!$*);^-/` + +* 143: Broken vertical pipe. + +* 144-150: Symbols `,%_>`?``:` + +* 151: British pound symbol. + +* 152-155: Symbols `@'="`. + +* 156: Less than or equal symbol. + +* 157: Empty box. + +* 158: Plus or minus. + +* 159: Filled box. + +* 160: Degree symbol. + +* 161: Dagger. + +* 162: Symbol `~`. + +* 163: En dash. + +* 164: Lower left corner box draw. + +* 165: Upper left corner box draw. + +* 166: Greater than or equal symbol. + +* 167-176: Superscript `0` through `9`. + +* 177: Lower right corner box draw. + +* 178: Upper right corner box draw. + +* 179: Not equal symbol. + +* 180: Em dash. + +* 181: Superscript `(`. + +* 182: Superscript `)`. + +* 183: Horizontal dagger (?). + +* 184-186: Symbols `{}\`. + +* 187: Cents symbol. + +* 188: Centered dot, or bullet. + +* 189-255: Reserved. + +Symbols that are not defined in a particular character set are set to +the same value as symbol 64; i.e., to `0`. + +The 8-byte tag string consists of the exact characters `SPSSPORT` in +the portable file's character set, which can be used to verify that the +file is indeed a portable file. + +## Version and Date Info Record + +This record does not have a tag code. It has the following structure: + +- A single character identifying the file format version. The letter + A represents version 0, and so on. + +- An 8-character string field giving the file creation date in the + format YYYYMMDD. + +- A 6-character string field giving the file creation time in the + format HHMMSS. + +## Identification Records + +The product identification record has tag code `1`. It consists of a +single string field giving the name of the product that wrote the +portable file. + +The author identification record has tag code `2`. It is optional. +If present, it consists of a single string field giving the name of the +person who caused the portable file to be written. + +The subproduct identification record has tag code `3`. It is +optional. If present, it consists of a single string field giving +additional information on the product that wrote the portable file. + +## Variable Count Record + +The variable count record has tag code `4`. It consists of a single +integer field giving the number of variables in the file dictionary. + +## Precision Record + +The precision record has tag code `5`. It consists of a single integer +field specifying the maximum number of base-30 digits used in data in +the file. + +## Case Weight Variable Record + +The case weight variable record is optional. If it is present, it +indicates the variable used for weighting cases; if it is absent, cases +are unweighted. It has tag code `6`. It consists of a single string +field that names the weighting variable. + +## Variable Records + +Each variable record represents a single variable. Variable records +have tag code `7`. They have the following structure: + +- Width (integer). This is 0 for a numeric variable, and a number + between 1 and 255 for a string variable. + +- Name (string). 1-8 characters long. Must be in all capitals. + + A few portable files that contain duplicate variable names have + been spotted in the wild. PSPP handles these by renaming the + duplicates with numeric extensions: `VAR_1`, `VAR_2`, and so on. + +- Print format. This is a set of three integer fields: + + - [Format type](system-file/variable-record.md#format-types) encoded + the same as in system files. + + - Format width. 1-40. + + - Number of decimal places. 1-40. + + A few portable files with invalid format types or formats that are + not of the appropriate width for their variables have been spotted + in the wild. PSPP assigns a default F or A format to a variable + with an invalid format. + +- Write format. Same structure as the print format described above. + +Each variable record can optionally be followed by a missing value +record, which has tag code `8`. A missing value record has one field, +the missing value itself (a floating-point or string, as appropriate). +Up to three of these missing value records can be used. + +There is also a record for missing value ranges, which has tag code +`B`. It is followed by two fields representing the range, which are +floating-point or string as appropriate. If a missing value range is +present, it may be followed by a single missing value record. + +Tag codes `9` and `A` represent `LO THRU X` and `X THRU HI` ranges, +respectively. Each is followed by a single field representing X. If +one of the ranges is present, it may be followed by a single missing +value record. + +In addition, each variable record can optionally be followed by a +variable label record, which has tag code `C`. A variable label record +has one field, the variable label itself (string). + +## Value Label Records + +Value label records have tag code `D`. They have the following format: + +- Variable count (integer). + +- List of variables (strings). The variable count specifies the + number in the list. Variables are specified by their names. All + variables must be of the same type (numeric or string), but string + variables do not necessarily have the same width. + +- Label count (integer). + +- List of (value, label) tuples. The label count specifies the + number of tuples. Each tuple consists of a value, which is numeric + or string as appropriate to the variables, followed by a label + (string). + +A few portable files that specify duplicate value labels, that is, +two different labels for a single value of a single variable, have been +spotted in the wild. PSPP uses the last value label specified in these +cases. + +## Document Record + +One document record may optionally follow the value label record. The +document record consists of tag code `E`, following by the number of +document lines as an integer, followed by that number of strings, each +of which represents one document line. Document lines must be 80 bytes +long or shorter. + +## Portable File Data + +The data record has tag code `F`. There is only one tag for all the +data; thus, all the data must follow the dictionary. The data is +terminated by the end-of-file marker `Z`, which is not valid as the +beginning of a data element. + +Data elements are output in the same order as the variable records +describing them. String variables are output as string fields, and +numeric variables are output as floating-point fields. + diff --git a/rust/doc/src/system-file/variable-record.md b/rust/doc/src/system-file/variable-record.md index 433243fde1..603cff8309 100644 --- a/rust/doc/src/system-file/variable-record.md +++ b/rust/doc/src/system-file/variable-record.md @@ -127,7 +127,9 @@ without any variables (thus, no data either). maximum value in the range. When a range plus a value are present, the third element denotes the additional discrete missing value. -The `print` and `write` members of sysfile_variable are output +## Format Types + +The `print` and `write` members of `sysfile_variable` are output formats coded into `int32` types. The least-significant byte of the `int32` represents the number of decimal places, and the next two bytes in order of increasing significance represent field width and format -- 2.30.2