+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2019 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
+
@node System File Format
@appendix System File Format
the format of a system file.
System files use four data types: 8-bit characters, 32-bit integers,
-64-bit integers,
+64-bit integers,
and 64-bit floating points, called here @code{char}, @code{int32},
@code{int64}, and
@code{flt64}, respectively. Data is not necessarily aligned on a word
* Other Informational Records::
* Dictionary Termination Record::
* Data Record::
-* Encrypted System Files::
@end menu
@node System File Record Structure
Extension (type 7) records, in ascending numerical order of their
subtypes.
+System files written by SPSS include at most one of each kind of
+extension record. This is generally true of system files written by
+other software as well, with known exceptions noted below in the
+individual sections about each type of record.
+
@item
Dictionary termination record.
would be longer than 60 characters; otherwise it is padded on the right
with spaces.
+The product name field allow readers to behave differently based on
+quirks in the way that particular software writes system files.
+@xref{Value Labels Records}, for the detail of the quirk that the PSPP
+system file reader tolerates in files written by ReadStat, which has
+@code{https://github.com/WizardMac/ReadStat} in @code{prod_name}.
+
@anchor{layout_code}
@item int32 layout_code;
Normally set to 2, although a few system files have been spotted in
Number of data elements per case. This is the number of variables,
except that long string variables add extra data elements (one for every
8 characters after the first 8). However, string variables do not
-contribute to this value beyond the first 255 bytes. Further, system
-files written by some systems set this value to -1. In general, it is
+contribute to this value beyond the first 255 bytes. Further, some
+software always writes -1 or 0 in this field. In general, it is
unsafe for systems reading system files to rely upon this value.
@item int32 compression;
same way as other variable records.
@anchor{Dictionary Index}
-The @dfn{dictionary index} of a variable is its offset in the set of
+The @dfn{dictionary index} of a variable is a 1-based offset in the set of
variable records, including dummy variable records for long string
-variables. The first variable record has a dictionary index of 0, the
-second has a dictionary index of 1, and so on.
+variables. The first variable record has a dictionary index of 1, the
+second has a dictionary index of 2, and so on.
The system file format does not directly support string variables
wider than 255 bytes. Such very long string variables are represented
@tab @code{EDATE}
@item 39
@tab @code{SDATE}
+@item 40
+@tab @code{MTIME}
+@item 41
+@tab @code{YMDHMS}
@end multitable
@end quotation
have value labels, but their value labels are recorded using a
different record type (@pxref{Long String Value Labels Record}).
+ReadStat (@pxref{File Header Record}) writes value labels that label a
+single value more than once. In more detail, it emits value labels
+whose values are longer than string variables' widths, that are
+identical in the actual width of the variable, e.g.@: labels for
+values @code{ABC123} and @code{ABC456} for a string variable with
+width 3. For files written by this software, PSPP ignores such
+labels.
+
The value label record has the following format:
@example
int32 rec_type;
int32 label_count;
-/* @r{Repeated @code{label_cnt} times}. */
+/* @r{Repeated @code{n_label} times}. */
char value[8];
char label_len;
char label[];
label record are to be applied.
@item int32 vars[];
-A list of dictionary indexes of variables to which to apply the value
+A list of 1-based dictionary indexes of variables to which to apply the value
labels (@pxref{Dictionary Index}). There are @code{var_count}
elements.
Record type. Always set to 6.
@item int32 n_lines;
-Number of lines of documents present.
+Number of lines of documents present. This should be greater than
+zero, but ReadStats writes system files with zero @code{n_lines}.
@item char lines[][80];
Document lines. The number of elements is defined by @code{n_lines}.
Number of pieces of data in the data part. Always set to 3.
@item flt64 sysmis;
-The system missing value.
-
-@item flt64 highest;
-The value used for HIGHEST in missing values.
-
-@item flt64 lowest;
-The value used for LOWEST in missing values.
+@itemx flt64 highest;
+@itemx flt64 lowest;
+The system missing value, the value used for HIGHEST in missing
+values, and the value used for LOWEST in missing values, respectively.
+@xref{System File Format}, for more information.
+
+The SPSSWriter library in PHP, which identifies itself as @code{FOM
+SPSS 1.0.0} in the file header record @code{prod_name} field, writes
+unexpected values to these fields, but it uses the same values
+consistently throughout the rest of the file.
@end table
@node Multiple Response Sets Records
each separated from the previous by a single space.
Even though a multiple response set must have at least two variables,
-some system files contain multiple response sets with no variables at
-all. The source and meaning of these multiple response sets is
+some system files contain multiple response sets with no variables or
+one variable. The source and meaning of these multiple response sets is
unknown. (Perhaps they arise from creating a multiple response set
then deleting all the variables that it contains?)
@item char attributes[];
The attributes, in a text-based format.
-In record type 17, this field contains a single attribute set. An
+In record subtype 17, this field contains a single attribute set. An
attribute set is a sequence of one or more attributes concatenated
together. Each attribute consists of a name, which has the same
syntax as a variable name, followed by, inside parentheses, a sequence
an attribute with a single value and an attribute array with one
element.
-In record type 18, this field contains a sequence of one or more
+In record subtype 18, this field contains a sequence of one or more
variable attribute sets. If more than one variable attribute set is
present, each one after the first is delimited from the previous by
@code{/}. Each variable attribute set consists of a long
variable name,
followed by @code{:}, followed by an attribute set with the same
-syntax as on record type 17.
+syntax as on record subtype 17.
+
+System files written by @code{Stata 14.1/-savespss- 1.77 by
+S.Radyakin} may include multiple records with subtype 18, one per
+variable that has variable attributes.
The total length is @code{count} bytes.
@end table
@item 6
Date info, probably related to USE (according to Aapi H@"am@"al@"ainen).
+@item 12
+A UUID in the format described in RFC 4122. Only two examples
+observed, both written by SPSS 13, and in each case the UUID contained
+both upper and lower case.
+
@item 24
XML that describes how data in the file should be displayed on-screen.
@end table
variable @code{bias} from the file header. For example,
code 105 with bias 100.0 (the normal value) indicates a numeric variable
of value 5.
-One file has been seen written by SPSS 14 that contained such a code
-in a @emph{string} field with the value 0 (after the bias is
-subtracted) as a way of encoding null bytes.
+
+A code of 0 (after subtracting the bias) in a string field encodes
+null bytes. This is unusual, since a string field normally encodes
+text data, but it exists in real system files.
@item 252
End of file. This code may or may not appear at the end of the data
field sum to the size of the system file in bytes.
@end table
-The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB
+The data header is followed by @code{(ztrailer_len - 24) / 24} ZLIB
compressed data blocks. Each ZLIB compressed data block begins with a
ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
01} (the only header yet observed in practice). Each block
@item int32 n_blocks;
The number of ZLIB compressed data blocks, always exactly
-@code{(ztrailer_ofs - 24) / 24}.
+@code{(ztrailer_len - 24) / 24}.
@end table
The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
@end table
@setfilename ignored
-
-@node Encrypted System Files
-@section Encrypted System Files
-
-SPSS 21 and later support an encrypted system file format.
-
-@quotation Warning
-The SPSS encrypted file format is poorly designed. It is much cheaper
-and faster to decrypt a file encrypted this way than if a well
-designed alternative were used. If you must use this format, use a
-10-byte randomly generated password.
-@end quotation
-
-@subheading Encrypted File Format
-
-Encrypted system files begin with the following 36-byte fixed header:
-
-@example
-0000 1c 00 00 00 00 00 00 00 45 4e 43 52 59 50 54 45 |........ENCRYPTE|
-0010 44 53 41 56 15 00 00 00 00 00 00 00 00 00 00 00 |DSAV............|
-0020 00 00 00 00 |....|
-@end example
-
-Following the fixed header is a complete system file in the usual
-format, except that each 16-byte block is encrypted with AES-256 in
-ECB mode. The AES-256 key is derived from a password in the following
-way:
-
-@enumerate
-@item
-Start from the literal password typed by the user. Truncate it to at
-most 10 bytes, then append (between 1 and 22) null bytes until there
-are exactly 32 bytes. Call this @var{password}.
-
-@item
-Let @var{constant} be the following 73-byte constant:
-
-@example
-0000 00 00 00 01 35 27 13 cc 53 a7 78 89 87 53 22 11
-0010 d6 5b 31 58 dc fe 2e 7e 94 da 2f 00 cc 15 71 80
-0020 0a 6c 63 53 00 38 c3 38 ac 22 f3 63 62 0e ce 85
-0030 3f b8 07 4c 4e 2b 77 c7 21 f5 1a 80 1d 67 fb e1
-0040 e1 83 07 d8 0d 00 00 01 00
-@end example
-
-@item
-Compute CMAC-AES-256(@var{password}, @var{constant}). Call the
-16-byte result @var{cmac}.
-
-@item
-The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is,
-@var{cmac} repeated twice.
-@end enumerate
-
-@subsubheading Example
-
-Consider the password @samp{pspp}. @var{password} is:
-
-@example
-0000 70 73 70 70 00 00 00 00 00 00 00 00 00 00 00 00 |pspp............|
-0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
-@end example
-
-@noindent
-@var{cmac} is:
-
-@example
-0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45
-@end example
-
-@noindent
-The AES-256 key is:
-
-@example
-0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45
-0010 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45
-@end example
-
-@subheading Password Encoding
-
-SPSS also supports what it calls ``encrypted passwords.'' These are
-not encrypted. They are encoded with a simple, fixed scheme. An
-encoded password is always a multiple of 2 characters long, and never
-longer than 20 characters. The characters in an encoded password are
-always in the graphic ASCII range 33 through 126. Each successive
-pair of characters in the password encodes a single byte in the
-plaintext password.
-
-Use the following algorithm to decode a pair of characters:
-
-@enumerate
-@item
-Let @var{a} be the ASCII code of the first character, and @var{b} be
-the ASCII code of the second character.
-
-@item
-Let @var{ah} be the most significant 4 bits of @var{a}. Find the line
-in the table below that has @var{ah} on the left side. The right side
-of the line is a set of possible values for the most significant 4
-bits of the decoded byte.
-
-@display
-@t{2 } @result{} @t{2367}
-@t{3 } @result{} @t{0145}
-@t{47} @result{} @t{89cd}
-@t{56} @result{} @t{abef}
-@end display
-
-@item
-Let @var{bh} be the most significant 4 bits of @var{b}. Find the line
-in the second table below that has @var{bh} on the left side. The
-right side of the line is a set of possible values for the most
-significant 4 bits of the decoded byte. Together with the results of
-the previous step, only a single possibility is left.
-
-@display
-@t{2 } @result{} @t{139b}
-@t{3 } @result{} @t{028a}
-@t{47} @result{} @t{46ce}
-@t{56} @result{} @t{57df}
-@end display
-
-@item
-Let @var{al} be the least significant 4 bits of @var{a}. Find the
-line in the table below that has @var{al} on the left side. The right
-side of the line is a set of possible values for the least significant
-4 bits of the decoded byte.
-
-@display
-@t{03cf} @result{} @t{0145}
-@t{12de} @result{} @t{2367}
-@t{478b} @result{} @t{89cd}
-@t{569a} @result{} @t{abef}
-@end display
-
-@item
-Let @var{bl} be the least significant 4 bits of @var{b}. Find the
-line in the table below that has @var{bl} on the left side. The right
-side of the line is a set of possible values for the least significant
-4 bits of the decoded byte. Together with the results of the previous
-step, only a single possibility is left.
-
-@display
-@t{03cf} @result{} @t{028a}
-@t{12de} @result{} @t{139b}
-@t{478b} @result{} @t{46ce}
-@t{569a} @result{} @t{57df}
-@end display
-@end enumerate
-
-@subsubheading Example
-
-Consider the encoded character pair @samp{-|}. @var{a} is
-0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is
-0xd, and @var{bl} is 0xc. @var{ah} means that the most significant
-four bits of the decoded character is 2, 3, 6, or 7, and @var{bh}
-means that they are 4, 6, 0xc, or 0xe. The single possibility in
-common is 6, so the most significant four bits are 6. Similarly,
-@var{al} means that the least significant four bits are 2, 3, 6, or 7,
-and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant
-four bits are 2. The decoded character is therefore 0x62, the letter
-@samp{b}.