+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2019 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
+
@node System File Format
-@appendix System File Format
+@chapter System File Format
-A system file encapsulates a set of cases and dictionary information
-that describes how they may be interpreted. This chapter describes
-the format of a system file.
+An SPSS system file holds a set of cases and dictionary information
+that describes how they may be interpreted. The system file format
+dates back 40+ years and has evolved greatly over that time to support
+new features, but in a way to facilitate interchange between even the
+oldest and newest versions of software. This chapter describes the
+system file format.
System files use four data types: 8-bit characters, 32-bit integers,
-64-bit integers,
+64-bit integers,
and 64-bit floating points, called here @code{char}, @code{int32},
@code{int64}, and
@code{flt64}, respectively. Data is not necessarily aligned on a word
* Multiple Response Sets Records::
* Extra Product Info Record::
* Variable Display Parameter Record::
+* Variable Sets Record::
* Long Variable Names Record::
* Very Long String Record::
* Character Encoding Record::
* Other Informational Records::
* Dictionary Termination Record::
* Data Record::
-* Encrypted System Files::
@end menu
@node System File Record Structure
Extension (type 7) records, in ascending numerical order of their
subtypes.
+System files written by SPSS include at most one of each kind of
+extension record. This is generally true of system files written by
+other software as well, with known exceptions noted below in the
+individual sections about each type of record.
+
@item
Dictionary termination record.
would be longer than 60 characters; otherwise it is padded on the right
with spaces.
+The product name field allow readers to behave differently based on
+quirks in the way that particular software writes system files.
+@xref{Value Labels Records}, for the detail of the quirk that the PSPP
+system file reader tolerates in files written by ReadStat, which has
+@code{https://github.com/WizardMac/ReadStat} in @code{prod_name}.
+
@anchor{layout_code}
@item int32 layout_code;
Normally set to 2, although a few system files have been spotted in
Number of data elements per case. This is the number of variables,
except that long string variables add extra data elements (one for every
8 characters after the first 8). However, string variables do not
-contribute to this value beyond the first 255 bytes. Further, system
-files written by some systems set this value to -1. In general, it is
+contribute to this value beyond the first 255 bytes. Further, some
+software always writes -1 or 0 in this field. In general, it is
unsafe for systems reading system files to rely upon this value.
@item int32 compression;
same way as other variable records.
@anchor{Dictionary Index}
-The @dfn{dictionary index} of a variable is its offset in the set of
+The @dfn{dictionary index} of a variable is a 1-based offset in the set of
variable records, including dummy variable records for long string
-variables. The first variable record has a dictionary index of 0, the
-second has a dictionary index of 1, and so on.
+variables. The first variable record has a dictionary index of 1, the
+second has a dictionary index of 2, and so on.
The system file format does not directly support string variables
wider than 255 bytes. Such very long string variables are represented
@tab @code{EDATE}
@item 39
@tab @code{SDATE}
+@item 40
+@tab @code{MTIME}
+@item 41
+@tab @code{YMDHMS}
@end multitable
@end quotation
have value labels, but their value labels are recorded using a
different record type (@pxref{Long String Value Labels Record}).
+ReadStat (@pxref{File Header Record}) writes value labels that label a
+single value more than once. In more detail, it emits value labels
+whose values are longer than string variables' widths, that are
+identical in the actual width of the variable, e.g.@: labels for
+values @code{ABC123} and @code{ABC456} for a string variable with
+width 3. For files written by this software, PSPP ignores such
+labels.
+
The value label record has the following format:
@example
int32 rec_type;
int32 label_count;
-/* @r{Repeated @code{label_cnt} times}. */
+/* @r{Repeated @code{n_label} times}. */
char value[8];
char label_len;
char label[];
label record are to be applied.
@item int32 vars[];
-A list of dictionary indexes of variables to which to apply the value
+A list of 1-based dictionary indexes of variables to which to apply the value
labels (@pxref{Dictionary Index}). There are @code{var_count}
elements.
Record type. Always set to 6.
@item int32 n_lines;
-Number of lines of documents present.
+Number of lines of documents present. This should be greater than
+zero, but ReadStats writes system files with zero @code{n_lines}.
@item char lines[][80];
Document lines. The number of elements is defined by @code{n_lines}.
@table @code
@item int32 measure;
-The measurement type of the variable:
+The measurement level of the variable:
@table @asis
+@item 0
+Unknown
@item 1
-Nominal Scale
+Nominal
@item 2
-Ordinal Scale
+Ordinal
@item 3
-Continuous Scale
+Scale
@end table
-SPSS sometimes writes a @code{measure} of 0. PSPP interprets this as
-nominal scale.
+An ``unknown'' @code{measure} of 0 means that the variable was created
+in some way that doesn't make the measurement level clear, e.g.@: with
+a @code{COMPUTE} transformation. PSPP sets the measurement level the
+first time it reads the data using the rules documented in
+@ref{Measurement Level,,,pspp, PSPP Users Guide}, so this should
+rarely appear.
@item int32 width;
The width of the display column for the variable in characters.
@end table
@end table
+@node Variable Sets Record
+@section Variable Sets Record
+
+The SPSS GUI offers users the ability to arrange variables in sets.
+Users may enable and disable sets individually, and the data editor
+and analysis dialog boxes only show enabled sets. Syntax does not use
+variable sets.
+
+The variable sets record, if present, has the following format:
+
+@example
+/* @r{Header.} */
+int32 rec_type;
+int32 subtype;
+int32 size;
+int32 count;
+
+/* @r{Exactly @code{count} bytes of text.} */
+char text[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type. Always set to 7.
+
+@item int32 subtype;
+Record subtype. Always set to 5.
+
+@item int32 size;
+Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{text}.
+
+@item char text[];
+The variable sets, in a text-based format.
+
+Each variable set occupies one line of text, each of which ends with a
+line feed (byte 0x0a), optionally preceded by a carriage return (byte
+0x0d).
+
+Each line begins with the name of the variable set, followed by an
+equals sign (@samp{=}) and a space (byte 0x20), followed by the long
+variable names of the members of the set, separated by spaces. A
+variable set may be empty, in which case the equals sign and the space
+following it are still present.
+@end table
+
@node Long Variable Names Record
@section Long Variable Names Record
int32 var_name_len;
char var_name[];
char n_missing_values;
-long_string_missing_value values[];
+int32 value_len;
+char values[values_len * n_missing_values];
@end example
@table @code
The number of missing values, either 1, 2, or 3. (This is, unusually,
a single byte instead of a 32-bit number.)
-@item long_string_missing_value values[];
-The missing values themselves. This array contains exactly
-@code{n_missing_values} elements, each of which has the following
-substructure:
-
-@example
-int32 value_len;
-char value[];
-@end example
-
-@table @code
@item int32 value_len;
-The length of the missing value string, in bytes. This value should
+The length of each missing value string, in bytes. This value should
be 8, because long string variables are at least 8 bytes wide (by
definition), only the first 8 bytes of a long string variable's
missing values are allowed to be non-spaces, and any spaces within the
first 8 bytes are included in the missing value here.
-@item char value[];
-The missing value string, exactly @code{value_len} bytes, without
-any padding or null terminator.
-@end table
+@item char values[values_len * n_missing_values]
+The missing values themselves, without any padding or null
+terminators.
@end table
+An earlier version of this document stated that @code{value_len} was
+repeated before each of the missing values, so that there was an extra
+@code{int32} value of 8 before each missing value after the first.
+Old versions of PSPP wrote data files in this format. Readers can
+tolerate this mistake, if they wish, by noticing and skipping the
+extra @code{int32} values, which wouldn't ordinarily occur in strings.
+
@node Data File and Variable Attributes Records
@section Data File and Variable Attributes Records
@item char attributes[];
The attributes, in a text-based format.
-In record type 17, this field contains a single attribute set. An
+In record subtype 17, this field contains a single attribute set. An
attribute set is a sequence of one or more attributes concatenated
together. Each attribute consists of a name, which has the same
syntax as a variable name, followed by, inside parentheses, a sequence
an attribute with a single value and an attribute array with one
element.
-In record type 18, this field contains a sequence of one or more
+In record subtype 18, this field contains a sequence of one or more
variable attribute sets. If more than one variable attribute set is
present, each one after the first is delimited from the previous by
@code{/}. Each variable attribute set consists of a long
variable name,
followed by @code{:}, followed by an attribute set with the same
-syntax as on record type 17.
+syntax as on record subtype 17.
+
+System files written by @code{Stata 14.1/-savespss- 1.77 by
+S.Radyakin} may include multiple records with subtype 18, one per
+variable that has variable attributes.
The total length is @code{count} bytes.
@end table
the following believed meanings:
@table @asis
-@item 5
-A set of grouped variables (according to Aapi H@"am@"al@"ainen).
-
@item 6
Date info, probably related to USE (according to Aapi H@"am@"al@"ainen).
variable @code{bias} from the file header. For example,
code 105 with bias 100.0 (the normal value) indicates a numeric variable
of value 5.
-One file has been seen written by SPSS 14 that contained such a code
-in a @emph{string} field with the value 0 (after the bias is
-subtracted) as a way of encoding null bytes.
+
+A code of 0 (after subtracting the bias) in a string field encodes
+null bytes. This is unusual, since a string field normally encodes
+text data, but it exists in real system files.
@item 252
End of file. This code may or may not appear at the end of the data
field sum to the size of the system file in bytes.
@end table
-The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB
+The data header is followed by @code{(ztrailer_len - 24) / 24} ZLIB
compressed data blocks. Each ZLIB compressed data block begins with a
ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
01} (the only header yet observed in practice). Each block
@item int32 n_blocks;
The number of ZLIB compressed data blocks, always exactly
-@code{(ztrailer_ofs - 24) / 24}.
+@code{(ztrailer_len - 24) / 24}.
@end table
The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
@end table
@setfilename ignored
-
-@node Encrypted System Files
-@section Encrypted System Files
-
-SPSS 21 and later support an encrypted system file format.
-
-@quotation Warning
-The SPSS encrypted file format is poorly designed. It is much cheaper
-and faster to decrypt a file encrypted this way than if a well
-designed alternative were used. If you must use this format, use a
-10-byte randomly generated password.
-@end quotation
-
-@subheading Encrypted File Format
-
-Encrypted system files begin with the following 36-byte fixed header:
-
-@example
-0000 1c 00 00 00 00 00 00 00 45 4e 43 52 59 50 54 45 |........ENCRYPTE|
-0010 44 53 41 56 15 00 00 00 00 00 00 00 00 00 00 00 |DSAV............|
-0020 00 00 00 00 |....|
-@end example
-
-Following the fixed header is a complete system file in the usual
-format, except that each 16-byte block is encrypted with AES-256 in
-ECB mode. The AES-256 key is derived from a password in the following
-way:
-
-@enumerate
-@item
-Start from the literal password typed by the user. Truncate it to at
-most 10 bytes, then append (between 1 and 22) null bytes until there
-are exactly 32 bytes. Call this @var{password}.
-
-@item
-Let @var{constant} be the following 73-byte constant:
-
-@example
-0000 00 00 00 01 35 27 13 cc 53 a7 78 89 87 53 22 11
-0010 d6 5b 31 58 dc fe 2e 7e 94 da 2f 00 cc 15 71 80
-0020 0a 6c 63 53 00 38 c3 38 ac 22 f3 63 62 0e ce 85
-0030 3f b8 07 4c 4e 2b 77 c7 21 f5 1a 80 1d 67 fb e1
-0040 e1 83 07 d8 0d 00 00 01 00
-@end example
-
-@item
-Compute CMAC-AES-256(@var{password}, @var{constant}). Call the
-16-byte result @var{cmac}.
-
-@item
-The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is,
-@var{cmac} repeated twice.
-@end enumerate
-
-@subsubheading Example
-
-Consider the password @samp{pspp}. @var{password} is:
-
-@example
-0000 70 73 70 70 00 00 00 00 00 00 00 00 00 00 00 00 |pspp............|
-0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
-@end example
-
-@noindent
-@var{cmac} is:
-
-@example
-0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45
-@end example
-
-@noindent
-The AES-256 key is:
-
-@example
-0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45
-0010 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45
-@end example
-
-@subheading Password Encoding
-
-SPSS also supports what it calls ``encrypted passwords.'' These are
-not encrypted. They are encoded with a simple, fixed scheme. An
-encoded password is always a multiple of 2 characters long, and never
-longer than 20 characters. The characters in an encoded password are
-always in the graphic ASCII range 33 through 126. Each successive
-pair of characters in the password encodes a single byte in the
-plaintext password.
-
-Use the following algorithm to decode a pair of characters:
-
-@enumerate
-@item
-Let @var{a} be the ASCII code of the first character, and @var{b} be
-the ASCII code of the second character.
-
-@item
-Let @var{ah} be the most significant 4 bits of @var{a}. Find the line
-in the table below that has @var{ah} on the left side. The right side
-of the line is a set of possible values for the most significant 4
-bits of the decoded byte.
-
-@display
-@t{2 } @result{} @t{2367}
-@t{3 } @result{} @t{0145}
-@t{47} @result{} @t{89cd}
-@t{56} @result{} @t{abef}
-@end display
-
-@item
-Let @var{bh} be the most significant 4 bits of @var{b}. Find the line
-in the second table below that has @var{bh} on the left side. The
-right side of the line is a set of possible values for the most
-significant 4 bits of the decoded byte. Together with the results of
-the previous step, only a single possibility is left.
-
-@display
-@t{2 } @result{} @t{139b}
-@t{3 } @result{} @t{028a}
-@t{47} @result{} @t{46ce}
-@t{56} @result{} @t{57df}
-@end display
-
-@item
-Let @var{al} be the least significant 4 bits of @var{a}. Find the
-line in the table below that has @var{al} on the left side. The right
-side of the line is a set of possible values for the least significant
-4 bits of the decoded byte.
-
-@display
-@t{03cf} @result{} @t{0145}
-@t{12de} @result{} @t{2367}
-@t{478b} @result{} @t{89cd}
-@t{569a} @result{} @t{abef}
-@end display
-
-@item
-Let @var{bl} be the least significant 4 bits of @var{b}. Find the
-line in the table below that has @var{bl} on the left side. The right
-side of the line is a set of possible values for the least significant
-4 bits of the decoded byte. Together with the results of the previous
-step, only a single possibility is left.
-
-@display
-@t{03cf} @result{} @t{028a}
-@t{12de} @result{} @t{139b}
-@t{478b} @result{} @t{46ce}
-@t{569a} @result{} @t{57df}
-@end display
-@end enumerate
-
-@subsubheading Example
-
-Consider the encoded character pair @samp{-|}. @var{a} is
-0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is
-0xd, and @var{bl} is 0xc. @var{ah} means that the most significant
-four bits of the decoded character is 2, 3, 6, or 7, and @var{bh}
-means that they are 4, 6, 0xc, or 0xe. The single possibility in
-common is 6, so the most significant four bits are 6. Similarly,
-@var{al} means that the least significant four bits are 2, 3, 6, or 7,
-and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant
-four bits are 2. The decoded character is therefore 0x62, the letter
-@samp{b}.