has actually been observed in system files, and it is likely that
other formats are obsolete or were never used.
-The PSPP system-missing value is represented by the largest possible
-negative number in the floating point format (@code{-DBL_MAX}). Two
-other values are important for use as missing values: @code{HIGHEST},
-represented by the largest possible positive number (@code{DBL_MAX}),
-and @code{LOWEST}, represented by the second-largest negative number
-(in IEEE 754 format, @code{0xffeffffffffffffe}).
+System files use a few floating point values for special purposes:
+
+@table @asis
+@item SYSMIS
+The system-missing value is represented by the largest possible
+negative number in the floating point format (@code{-DBL_MAX}).
+
+@item HIGHEST
+HIGHEST is used as the high end of a missing value range with an
+unbounded maximum. It is represented by the largest possible positive
+number (@code{DBL_MAX}).
+
+@item LOWEST
+LOWEST is used as the low end of a missing value range with an
+unbounded minimum. It was originally represented by the
+second-largest negative number (in IEEE 754 format,
+@code{0xffeffffffffffffe}). System files written by SPSS 21 and later
+instead use the largest negative number (@code{-DBL_MAX}), the same
+value as SYSMIS. This does not lead to ambiguity because LOWEST
+appears in system files only in missing value ranges, which never
+contain SYSMIS.
+@end table
+
+System files may use most character encodings based on an 8-bit unit.
+UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
+@code{rec_type} in the file header record is sufficient to distinguish
+between ASCII and EBCDIC based encodings. The best way to determine
+the specific encoding in use is to consult the character encoding
+record (@pxref{Character Encoding Record}), if present, and failing
+that the @code{character_code} in the machine integer info record
+(@pxref{Machine Integer Info Record}). The same encoding should be
+used for the dictionary and the data in the file, although it is
+possible to artificially synthesize files that use different encodings
+(@pxref{Character Encoding Record}).
System files are divided into records, each of which begins with a
4-byte record type, usually regarded as an @code{int32}.
* Machine Integer Info Record::
* Machine Floating-Point Info Record::
* Multiple Response Sets Records::
+* Extra Product Info Record::
* Variable Display Parameter Record::
* Long Variable Names Record::
* Very Long String Record::
* Character Encoding Record::
* Long String Value Labels Record::
+* Long String Missing Values Record::
* Data File and Variable Attributes Records::
* Extended Number of Cases Record::
* Miscellaneous Informational Records::
char prod_name[60];
int32 layout_code;
int32 nominal_case_size;
-int32 compressed;
+int32 compression;
int32 weight_index;
int32 ncases;
flt64 bias;
@table @code
@item char rec_type[4];
-Record type code, set to @samp{$FL2}.
+Record type code, either @samp{$FL2} for system files with
+uncompressed data or data compressed with simple bytecode compression,
+or @samp{$FL3} for system files with ZLIB compressed data.
+
+This is truly a character field that uses the character encoding as
+other strings. Thus, in a file with an ASCII-based character encoding
+this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a
+file with an EBCDIC-based encoding this field contains @code{5b c6 d3
+f2}. (No EBCDIC-based ZLIB-compressed files have been observed.)
@item char prod_name[60];
Product identification string. This always begins with the characters
unsafe for systems reading system files to rely upon this value.
@item int32 compressed;
-Set to 1 if the data in the file is compressed, 0 otherwise.
+Set to 0 if the data in the file is not compressed, 1 if the data is
+compressed with simple bytecode compression, 2 if the data is ZLIB
+compressed. This field has value 2 if and only if @code{rec_type} is
+@samp{$FL3}.
@item int32 weight_index;
If one of the variables in the data set is used as a weighting
File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
PSPP Users Guide}). Padded on the right with spaces.
+A product that identifies itself as @code{VOXCO INTERVIEWER 4.3} uses
+CR-only line ends in this field, rather than the more usual LF-only or
+CR LF line ends.
+
@item char padding[3];
Ignored padding bytes to make the structure a multiple of 32 bits in
length. Set to zeros.
-2; if the variable has a range for missing variables plus a single
discrete value, set to -3.
+A long string variable always has the value 0 here. A separate record
+indicates missing values for long string variables (@pxref{Long String
+Missing Values Record}).
+
@item int32 print;
Print format for this variable. See below.
IBM 370 sets this to 2, and DEC VAX E to 3.
@item int32 compression_code;
-Compression code. Always set to 1.
+Compression code. Always set to 1, regardless of whether or how the
+file is compressed.
@item int32 endianness;
Machine endianness. 1 indicates big-endian, 2 indicates little-endian.
been actually observed in system files:
@table @asis
+@item 1
+EBCDIC.
+
@item 2
7-bit ASCII.
The following additional values are known to be defined:
@table @asis
-@item 1
-EBCDIC.
-
@item 3
8-bit ``ASCII''.
Other Windows code page numbers are known to be generally valid.
-Old versions of SPSS always wrote value 2 in this field, regardless of
-the encoding in use. Newer versions also write the character encoding
-as a string (see @ref{Character Encoding Record}).
+Old versions of SPSS for Unix and Windows always wrote value 2 in this
+field, regardless of the encoding in use. Newer versions also write
+the character encoding as a string (see @ref{Character Encoding
+Record}).
@end table
@node Machine Floating-Point Info Record
$e=E 11 6 choice 0 n o p
@end example
+@node Extra Product Info Record
+@section Extra Product Info Record
+
+This optional record appears to contain a text string that describes
+the program that wrote the file and the source of the data. (This is
+redundant with the file label and product info found in the file
+header record.)
+
+@example
+/* @r{Header.} */
+int32 rec_type;
+int32 subtype;
+int32 size;
+int32 count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char info[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type. Always set to 7.
+
+@item int32 subtype;
+Record subtype. Always set to 10.
+
+@item int32 size;
+The size of each element in the @code{info} member. Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{info}.
+
+@item char info[];
+A text string. A product that identifies itself as @code{VOXCO
+INTERVIEWER 4.3} uses CR-only line ends in this field, rather than the
+more usual LF-only or CR LF line ends.
+@end table
+
@node Variable Display Parameter Record
@section Variable Display Parameter Record
Continuous Scale
@end table
-SPSS 14 sometimes writes a @code{measure} of 0 for string variables.
-PSPP interprets this as nominal scale.
+SPSS sometimes writes a @code{measure} of 0. PSPP interprets this as
+nominal scale.
@item int32 width;
The width of the display column for the variable in characters.
@end table
@end table
+@node Long String Missing Values Record
+@section Long String Missing Values Record
+
+This record, if present, specifies missing values for long string
+variables.
+
+@example
+/* @r{Header.} */
+int32 rec_type;
+int32 subtype;
+int32 size;
+int32 count;
+
+/* @r{Repeated up to exactly @code{count} bytes.} */
+int32 var_name_len;
+char var_name[];
+char n_missing_values;
+long_string_missing_value values[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type. Always set to 7.
+
+@item int32 subtype;
+Record subtype. Always set to 22.
+
+@item int32 size;
+Always set to 1.
+
+@item int32 count;
+The number of bytes following the header until the next header.
+
+@item int32 var_name_len;
+@itemx char var_name[];
+The number of bytes in the name of the long string variable that has
+missing values, plus the variable name itself, which consists of
+exactly @code{var_name_len} bytes. The variable name is not padded to
+any particular boundary, nor is it null-terminated.
+
+@item char n_missing_values;
+The number of missing values, either 1, 2, or 3. (This is, unusually,
+a single byte instead of a 32-bit number.)
+
+@item long_string_missing_value values[];
+The missing values themselves. This array contains exactly
+@code{n_missing_values} elements, each of which has the following
+substructure:
+
+@example
+int32 value_len;
+char value[];
+@end example
+
+@table @code
+@item int32 value_len;
+The length of the missing value string, in bytes. This value should
+be 8, because long string variables are at least 8 bytes wide (by
+definition), only the first 8 bytes of a long string variable's
+missing values are allowed to be non-spaces, and any spaces within the
+first 8 bytes are included in the missing value here.
+
+@item char value[];
+The missing value string, exactly @code{value_len} bytes, without
+any padding or null terminator.
+@end table
+@end table
+
@node Data File and Variable Attributes Records
@section Data File and Variable Attributes Records
00000030 0a 29 |.) |
@end example
+@menu
+* Variable Roles::
+@end menu
+
+@node Variable Roles
+@subsection Variable Roles
+
+A variable's role is represented as an attribute named @code{$@@Role}.
+This attribute has a single element whose values and their meanings
+are:
+
+@table @code
+@item 0
+Input. This, the default, is the most common role.
+@item 1
+Output.
+@item 2
+Both.
+@item 3
+None.
+@item 4
+Partition.
+@item 5
+Split.
+@end table
+
@node Extended Number of Cases Record
@section Extended Number of Cases Record
@item int32 subtype;
Record subtype. May take any value. According to Aapi
H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6
-indicates date info (probably related to USE).
+indicates date info (probably related to USE). Subtype 24 appears to
+contain XML that describes how data in the file should be displayed
+on-screen.
@item int32 size;
Size of each piece of data in the data part. Should have the value 1,
@node Data Record
@section Data Record
-Data records must follow all other records in the system file. There must
-be at least one data record in every system file.
-
-The format of data records varies depending on whether the data is
-compressed. Regardless, the data is arranged in a series of 8-byte
-elements.
+The data record must follow all other records in the system file.
+Every system file must have a data record that specifies data for at
+least one case. The format of the data record varies depending on the
+value of @code{compression} in the file header record:
-When data is not compressed,
-each element corresponds to
+@table @asis
+@item 0: no compression
+Data is arranged as a series of 8-byte elements.
+Each element corresponds to
the variable declared in the respective variable record (@pxref{Variable
Record}). Numeric values are given in @code{flt64} format; string
values are literal characters string, padded on the right when
necessary to fill out 8-byte units.
-Compressed data is arranged in the following manner: the first 8 bytes
-in the data section is divided into a series of 1-byte command
+@item 1: bytecode compression
+The first 8 bytes
+of the data record is divided into a series of 1-byte command
codes. These codes have meanings as described below:
@table @asis
The system-missing value.
@end table
-When the end of the an 8-byte group of command bytes is reached, any
-blocks of non-compressible values indicated by code 253 are skipped,
-and the next element of command bytes is read and interpreted, until
-the end of the file or a code with value 252 is reached.
+The end of the 8-byte group of bytecodes is followed by any 8-byte
+blocks of non-compressible values indicated by code 253. After that
+follows another 8-byte group of bytecodes, then those bytecodes'
+non-compressible values. The pattern repeats to the end of the file
+or a code with value 252.
+
+@item 2: ZLIB compression
+The data record consists of the following, in order:
+
+@itemize @bullet
+@item
+ZLIB data header, 24 bytes long.
+
+@item
+One or more variable-length blocks of ZLIB compressed data.
+
+@item
+ZLIB data trailer, with a 24-byte fixed header plus an additional 24
+bytes for each preceding ZLIB compressed data block.
+@end itemize
+
+The ZLIB data header has the following format:
+
+@example
+int64 zheader_ofs;
+int64 ztrailer_ofs;
+int64 ztrailer_len;
+@end example
+
+@table @code
+@item int64 zheader_ofs;
+The offset, in bytes, of the beginning of this structure within the
+system file.
+
+@item int64 ztrailer_ofs;
+The offset, in bytes, of the first byte of the ZLIB data trailer.
+
+@item int64 ztrailer_len;
+The number of bytes in the ZLIB data trailer. This and the previous
+field sum to the size of the system file in bytes.
+@end table
+
+The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB
+compressed data blocks. Each ZLIB compressed data block begins with a
+ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
+01} (the only header yet observed in practice). Each block
+decompresses to a fixed number of bytes (in practice only
+@code{0x3ff000}-byte blocks have been observed), except that the last
+block of data may be shorter. The last ZLIB compressed data block
+gends just before offset @code{ztrailer_ofs}.
+
+The result of ZLIB decompression is bytecode compressed data as
+described above for compression format 1.
+
+The ZLIB data trailer begins with the following 24-byte fixed header:
+
+@example
+int64 bias;
+int64 zero;
+int32 block_size;
+int32 n_blocks;
+@end example
+
+@table @code
+@item int64 int_bias;
+The compression bias as a negative integer, e.g.@: if @code{bias} in
+the file header record is 100.0, then @code{int_bias} is @minus{}100
+(this is the only value yet observed in practice).
+
+@item int64 zero;
+Always observed to be zero.
+
+@item int32 block_size;
+The number of bytes in each ZLIB compressed data block, except
+possibly the last, following decompression. Only @code{0x3ff000} has
+been observed so far.
+
+@item int32 n_blocks;
+The number of ZLIB compressed data blocks, always exactly
+@code{(ztrailer_ofs - 24) / 24}.
+@end table
+
+The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
+block descriptors, each of which describes the compressed data block
+corresponding to its offset. Each block descriptor has the following
+format:
+
+@example
+int64 uncompressed_ofs;
+int64 compressed_ofs;
+int32 uncompressed_size;
+int32 compressed_size;
+@end example
+
+@table @code
+@item int64 uncompressed_ofs;
+The offset, in bytes, that this block of data would have in a similar
+system file that uses compression format 1. This is
+@code{zheader_ofs} in the first block descriptor, and in each
+succeeding block descriptor it is the sum of the previous desciptor's
+@code{uncompressed_ofs} and @code{uncompressed_size}.
+
+@item int64 compressed_ofs;
+The offset, in bytes, of the actual beginning of this compressed data
+block. This is @code{zheader_ofs + 24} in the first block descriptor,
+and in each succeeding block descriptor it is the sum of the previous
+descriptor's @code{compressed_ofs} and @code{compressed_size}. The
+final block descriptor's @code{compressed_ofs} and
+@code{compressed_size} sum to @code{ztrailer_ofs}.
+
+@item int32 uncompressed_size;
+The number of bytes in this data block, after decompression. This is
+@code{block_size} in every data block except the last, which may be
+smaller.
+
+@item int32 compressed_size;
+The number of bytes in this data block, as stored compressed in this
+system file.
+@end table
+@end table
+
@setfilename ignored