floating-point format in use, as well as the endianness of IEEE 754
floating-point numbers, and translates as needed. However, only IEEE
754 numbers with the same endianness as integer data in the same file
-has actually been observed in system files, and it is likely that
+have actually been observed in system files, and it is likely that
other formats are obsolete or were never used.
System files use a few floating point values for special purposes:
possible to artificially synthesize files that use different encodings
(@pxref{Character Encoding Record}).
-System files are divided into records, each of which begins with a
-4-byte record type, usually regarded as an @code{int32}.
+@menu
+* System File Record Structure::
+* File Header Record::
+* Variable Record::
+* Value Labels Records::
+* Document Record::
+* Machine Integer Info Record::
+* Machine Floating-Point Info Record::
+* Multiple Response Sets Records::
+* Extra Product Info Record::
+* Variable Display Parameter Record::
+* Long Variable Names Record::
+* Very Long String Record::
+* Character Encoding Record::
+* Long String Value Labels Record::
+* Long String Missing Values Record::
+* Data File and Variable Attributes Records::
+* Extended Number of Cases Record::
+* Other Informational Records::
+* Dictionary Termination Record::
+* Data Record::
+@end menu
+
+@node System File Record Structure
+@section System File Record Structure
+
+System files are divided into records with the following format:
+
+@example
+int32 type;
+char data[];
+@end example
+
+This header does not identify the length of the @code{data} or any
+information about what it contains, so the system file reader must
+understand the format of @code{data} based on @code{type}. However,
+records with type 7, called @dfn{extension records}, have a stricter
+format:
+
+@example
+int32 type;
+int32 subtype;
+int32 size;
+int32 count;
+char data[size * count];
+@end example
-The records must appear in the following order:
+@table @code
+@item int32 rec_type;
+Record type. Always set to 7.
+
+@item int32 subtype;
+Record subtype. This value identifies a particular kind of extension
+record.
+
+@item int32 size;
+The size of each piece of data that follows the header, in bytes.
+Known extension records use 1, 4, or 8, for @code{char}, @code{int32},
+and @code{flt64} format data, respectively.
+
+@item int32 count;
+The number of pieces of data that follow the header.
+
+@item char data[size * count];
+Data, whose format and interpretation depend on the subtype.
+@end table
+
+An extension record contains exactly @code{size * count} bytes of
+data, which allows a reader that does not understand an extension
+record to skip it. Extension records provide only nonessential
+information, so this allows for files written by newer software to
+preserve backward compatibility with older or less capable readers.
+
+Records in a system file must appear in the following order:
@itemize @bullet
@item
Data record.
@end itemize
-Each type of record is described separately below.
+We advise authors of programs that read system files to tolerate
+format variations. Various kinds of misformatting and corruption have
+been observed in system files written by SPSS and other software
+alike. In particular, because extension records provide nonessential
+information, it is generally better to ignore an extension record
+entirely than to refuse to read a system file.
-@menu
-* File Header Record::
-* Variable Record::
-* Value Labels Records::
-* Document Record::
-* Machine Integer Info Record::
-* Machine Floating-Point Info Record::
-* Multiple Response Sets Records::
-* Extra Product Info Record::
-* Variable Display Parameter Record::
-* Long Variable Names Record::
-* Very Long String Record::
-* Character Encoding Record::
-* Long String Value Labels Record::
-* Long String Missing Values Record::
-* Data File and Variable Attributes Records::
-* Extended Number of Cases Record::
-* Miscellaneous Informational Records::
-* Dictionary Termination Record::
-* Data Record::
-* Encrypted System Files::
-@end menu
+The following sections describe the known kinds of records.
@node File Header Record
@section File Header Record
-The file header is always the first record in the file. It has the
-following format:
+A system file begins with the file header, with the following format:
@example
char rec_type[4];
by a number of narrower string variables. @xref{Very Long String
Record}, for details.
+A system file should contain at least one variable and thus at least
+one variable record, but system files have been observed in the wild
+without any variables (thus, no data either).
+
@example
int32 rec_type;
int32 type;
Number of pieces of data in the data part. Always set to 3.
@item flt64 sysmis;
-The system missing value.
-
-@item flt64 highest;
-The value used for HIGHEST in missing values.
-
-@item flt64 lowest;
-The value used for LOWEST in missing values.
+@itemx flt64 highest;
+@itemx flt64 lowest;
+The system missing value, the value used for HIGHEST in missing
+values, and the value used for LOWEST in missing values, respectively.
+@xref{System File Format}, for more information.
+
+The SPSSWriter library in PHP, which identifies itself as @code{FOM
+SPSS 1.0.0} in the file header record @code{prod_name} field, writes
+unexpected values to these fields, but it uses the same values
+consistently throughout the rest of the file.
@end table
@node Multiple Response Sets Records
each separated from the previous by a single space.
Even though a multiple response set must have at least two variables,
-some system files contain multiple response sets with no variables at
-all. The source and meaning of these multiple response sets is
+some system files contain multiple response sets with no variables or
+one variable. The source and meaning of these multiple response sets is
unknown. (Perhaps they arise from creating a multiple response set
then deleting all the variables that it contains?)
not been observed in the wild.
@end table
-@node Miscellaneous Informational Records
-@section Miscellaneous Informational Records
+@node Other Informational Records
+@section Other Informational Records
-Some specific types of miscellaneous informational records are
+This chapter documents many specific types of extension records are
documented here, but others are known to exist. PSPP ignores unknown
-miscellaneous informational records when reading system files.
-
-@example
-/* @r{Header.} */
-int32 rec_type;
-int32 subtype;
-int32 size;
-int32 count;
-
-/* @r{Exactly @code{size * count} bytes of data.} */
-char data[];
-@end example
+extension records when reading system files.
-@table @code
-@item int32 rec_type;
-Record type. Always set to 7.
+The following extension record subtypes have also been observed, with
+the following believed meanings:
-@item int32 subtype;
-Record subtype. May take any value. According to Aapi
-H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6
-indicates date info (probably related to USE). Subtype 24 appears to
-contain XML that describes how data in the file should be displayed
-on-screen.
+@table @asis
+@item 5
+A set of grouped variables (according to Aapi H@"am@"al@"ainen).
-@item int32 size;
-Size of each piece of data in the data part. Should have the value 1,
-4, or 8, for @code{char}, @code{int32}, and @code{flt64} format data,
-respectively.
+@item 6
+Date info, probably related to USE (according to Aapi H@"am@"al@"ainen).
-@item int32 count;
-Number of pieces of data in the data part.
+@item 12
+A UUID in the format described in RFC 4122. Only two examples
+observed, both written by SPSS 13, and in each case the UUID contained
+both upper and lower case.
-@item char data[];
-Arbitrary data. There must be @code{size} times @code{count} bytes of
-data.
+@item 24
+XML that describes how data in the file should be displayed on-screen.
@end table
@node Dictionary Termination Record
@end table
@setfilename ignored
-
-@node Encrypted System Files
-@section Encrypted System Files
-
-SPSS 21 and later support an encrypted system file format.
-
-@quotation Warning
-The SPSS encrypted file format is poorly designed. It is much cheaper
-and faster to decrypt a file encrypted this way than if a well
-designed alternative were used. If you must use this format, use a
-10-byte randomly generated password.
-@end quotation
-
-@subheading Encrypted File Format
-
-Encrypted system files begin with the following 36-byte fixed header:
-
-@example
-0000 1c 00 00 00 00 00 00 00 45 4e 43 52 59 50 54 45 |........ENCRYPTE|
-0010 44 53 41 56 15 00 00 00 00 00 00 00 00 00 00 00 |DSAV............|
-0020 00 00 00 00 |....|
-@end example
-
-Following the fixed header is a complete system file in the usual
-format, except that each 16-byte block is encrypted with AES-256 in
-ECB mode. The AES-256 key is derived from a password in the following
-way:
-
-@enumerate
-@item
-Start from the literal password typed by the user. Truncate it to at
-most 10 bytes, then append (between 1 and 22) null bytes until there
-are exactly 32 bytes. Call this @var{password}.
-
-@item
-Let @var{constant} be the following 73-byte constant:
-
-@example
-0000 00 00 00 01 35 27 13 cc 53 a7 78 89 87 53 22 11
-0010 d6 5b 31 58 dc fe 2e 7e 94 da 2f 00 cc 15 71 80
-0020 0a 6c 63 53 00 38 c3 38 ac 22 f3 63 62 0e ce 85
-0030 3f b8 07 4c 4e 2b 77 c7 21 f5 1a 80 1d 67 fb e1
-0040 e1 83 07 d8 0d 00 00 01 00
-@end example
-
-@item
-Compute CMAC-AES-256(@var{password}, @var{constant}). Call the
-16-byte result @var{cmac}.
-
-@item
-The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is,
-@var{cmac} repeated twice.
-@end enumerate
-
-@subsubheading Example
-
-Consider the password @samp{pspp}. @var{password} is:
-
-@example
-0000 70 73 70 70 00 00 00 00 00 00 00 00 00 00 00 00 |pspp............|
-0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
-@end example
-
-@noindent
-@var{cmac} is:
-
-@example
-0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45
-@end example
-
-@noindent
-The AES-256 key is:
-
-@example
-0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45
-0010 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45
-@end example
-
-@subheading Password Encoding
-
-SPSS also supports what it calls ``encrypted passwords.'' These are
-not encrypted. They are encoded with a simple, fixed scheme. An
-encoded password is always a multiple of 2 characters long, and never
-longer than 20 characters. The characters in an encoded password are
-always in the graphic ASCII range 33 through 126. Each successive
-pair of characters in the password encodes a single byte in the
-plaintext password.
-
-Use the following algorithm to decode a pair of characters:
-
-@enumerate
-@item
-Let @var{a} be the ASCII code of the first character, and @var{b} be
-the ASCII code of the second character.
-
-@item
-Let @var{ah} be the most significant 4 bits of @var{a}. Find the line
-in the table below that has @var{ah} on the left side. The right side
-of the line is a set of possible values for the most significant 4
-bits of the decoded byte.
-
-@display
-@t{2 } @result{} @t{2367}
-@t{3 } @result{} @t{0145}
-@t{47} @result{} @t{89cd}
-@t{56} @result{} @t{abef}
-@end display
-
-@item
-Let @var{bh} be the most significant 4 bits of @var{b}. Find the line
-in the second table below that has @var{bh} on the left side. The
-right side of the line is a set of possible values for the most
-significant 4 bits of the decoded byte. Together with the results of
-the previous step, only a single possibility is left.
-
-@display
-@t{2 } @result{} @t{139b}
-@t{3 } @result{} @t{028a}
-@t{47} @result{} @t{46ce}
-@t{56} @result{} @t{57df}
-@end display
-
-@item
-Let @var{al} be the least significant 4 bits of @var{a}. Find the
-line in the table below that has @var{al} on the left side. The right
-side of the line is a set of possible values for the least significant
-4 bits of the decoded byte.
-
-@display
-@t{03cf} @result{} @t{0145}
-@t{12de} @result{} @t{2367}
-@t{478b} @result{} @t{89cd}
-@t{569a} @result{} @t{abef}
-@end display
-
-@item
-Let @var{bl} be the least significant 4 bits of @var{b}. Find the
-line in the table below that has @var{bl} on the left side. The right
-side of the line is a set of possible values for the least significant
-4 bits of the decoded byte. Together with the results of the previous
-step, only a single possibility is left.
-
-@display
-@t{03cf} @result{} @t{028a}
-@t{12de} @result{} @t{139b}
-@t{478b} @result{} @t{46ce}
-@t{569a} @result{} @t{57df}
-@end display
-@end enumerate
-
-@subsubheading Example
-
-Consider the encoded character pair @samp{-|}. @var{a} is
-0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is
-0xd, and @var{bl} is 0xc. @var{ah} means that the most significant
-four bits of the decoded character is 2, 3, 6, or 7, and @var{bh}
-means that they are 4, 6, 0xc, or 0xe. The single possibility in
-common is 6, so the most significant four bits are 6. Similarly,
-@var{al} means that the least significant four bits are 2, 3, 6, or 7,
-and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant
-four bits are 2. The decoded character is therefore 0x62, the letter
-@samp{b}.