Add support for reading SPSS/PC+ system files.

[pspp] / doc / dev / system-file-format.texi
diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi

index d1d00873f3b3666aa97f8f1e10553a5815d268bb..d100aa9d24d59fcd6265e73e70310d6b556403f8 100644 (file)
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -33,12 +33,40 @@ floating-point numbers, and translates as needed.  However, only IEEE
  has actually been observed in system files, and it is likely that
  other formats are obsolete or were never used.
  
-The PSPP system-missing value is represented by the largest possible
-negative number in the floating point format (@code{-DBL_MAX}).  Two
-other values are important for use as missing values: @code{HIGHEST},
-represented by the largest possible positive number (@code{DBL_MAX}),
-and @code{LOWEST}, represented by the second-largest negative number
-(in IEEE 754 format, @code{0xffeffffffffffffe}).
+System files use a few floating point values for special purposes:
+
+@table @asis
+@item SYSMIS
+The system-missing value is represented by the largest possible
+negative number in the floating point format (@code{-DBL_MAX}).
+
+@item HIGHEST
+HIGHEST is used as the high end of a missing value range with an
+unbounded maximum.  It is represented by the largest possible positive
+number (@code{DBL_MAX}).
+
+@item LOWEST
+LOWEST is used as the low end of a missing value range with an
+unbounded minimum.  It was originally represented by the
+second-largest negative number (in IEEE 754 format,
+@code{0xffeffffffffffffe}).  System files written by SPSS 21 and later
+instead use the largest negative number (@code{-DBL_MAX}), the same
+value as SYSMIS.  This does not lead to ambiguity because LOWEST
+appears in system files only in missing value ranges, which never
+contain SYSMIS.
+@end table
+
+System files may use most character encodings based on an 8-bit unit.
+UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
+@code{rec_type} in the file header record is sufficient to distinguish
+between ASCII and EBCDIC based encodings.  The best way to determine
+the specific encoding in use is to consult the character encoding
+record (@pxref{Character Encoding Record}), if present, and failing
+that the @code{character_code} in the machine integer info record
+(@pxref{Machine Integer Info Record}).  The same encoding should be
+used for the dictionary and the data in the file, although it is
+possible to artificially synthesize files that use different encodings
+(@pxref{Character Encoding Record}).
  
  System files are divided into records, each of which begins with a
  4-byte record type, usually regarded as an @code{int32}.
@@ -60,7 +88,8 @@ if present.
  Document record, if present.
  
  @item
-Any records not explicitly included in this list, in any order.
+Extension (type 7) records, in ascending numerical order of their
+subtypes.
  
  @item
  Dictionary termination record.
@@ -78,16 +107,20 @@ Each type of record is described separately below.
  * Document Record::
  * Machine Integer Info Record::
  * Machine Floating-Point Info Record::
+* Multiple Response Sets Records::
+* Extra Product Info Record::
  * Variable Display Parameter Record::
  * Long Variable Names Record::
  * Very Long String Record::
  * Character Encoding Record::
  * Long String Value Labels Record::
+* Long String Missing Values Record::
  * Data File and Variable Attributes Records::
  * Extended Number of Cases Record::
  * Miscellaneous Informational Records::
  * Dictionary Termination Record::
  * Data Record::
+* Encrypted System Files::
  @end menu
  
  @node File Header Record
@@ -101,7 +134,7 @@ char                rec_type[4];
  char                prod_name[60];
  int32               layout_code;
  int32               nominal_case_size;
-int32               compressed;
+int32               compression;
  int32               weight_index;
  int32               ncases;
  flt64               bias;
@@ -113,7 +146,15 @@ char                padding[3];
  
  @table @code
  @item char rec_type[4];
-Record type code, set to @samp{$FL2}.
+Record type code, either @samp{$FL2} for system files with
+uncompressed data or data compressed with simple bytecode compression,
+or @samp{$FL3} for system files with ZLIB compressed data.
+
+This is truly a character field that uses the character encoding as
+other strings.  Thus, in a file with an ASCII-based character encoding
+this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a
+file with an EBCDIC-based encoding this field contains @code{5b c6 d3
+f2}.  (No EBCDIC-based ZLIB-compressed files have been observed.)
  
  @item char prod_name[60];
  Product identification string.  This always begins with the characters
@@ -137,8 +178,11 @@ contribute to this value beyond the first 255 bytes.   Further, system
  files written by some systems set this value to -1.  In general, it is
  unsafe for systems reading system files to rely upon this value.
  
-@item int32 compressed;
-Set to 1 if the data in the file is compressed, 0 otherwise.
+@item int32 compression;
+Set to 0 if the data in the file is not compressed, 1 if the data is
+compressed with simple bytecode compression, 2 if the data is ZLIB
+compressed.  This field has value 2 if and only if @code{rec_type} is
+@samp{$FL3}.
  
  @item int32 weight_index;
  If one of the variables in the data set is used as a weighting
@@ -183,6 +227,10 @@ field is arbitrarily set to @samp{00:00:00}.
  File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
  PSPP Users Guide}).  Padded on the right with spaces.
  
+A product that identifies itself as @code{VOXCO INTERVIEWER 4.3} uses
+CR-only line ends in this field, rather than the more usual LF-only or
+CR LF line ends.
+
  @item char padding[3];
  Ignored padding bytes to make the structure a multiple of 32 bits in
  length.  Set to zeros.
@@ -252,6 +300,10 @@ respectively.  If the variable has a range for missing variables, set to
  -2; if the variable has a range for missing variables plus a single
  discrete value, set to -3.
  
+A long string variable always has the value 0 here.  A separate record
+indicates missing values for long string variables (@pxref{Long String
+Missing Values Record}).
+
  @item int32 print;
  Print format for this variable.  See below.
  
@@ -264,10 +316,19 @@ the at-sign (@samp{@@}).  Subsequent characters may also be digits, octothorpes
  (@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full
  stops (@samp{.}).  The variable name is padded on the right with spaces.
  
+The @samp{name} fields should be unique within a system file.  System
+files written by SPSS that contain very long string variables with
+similar names sometimes contain duplicate names that are later
+eliminated by resolving the very long string names (@pxref{Very Long
+String Record}).  PSPP handles duplicates by assigning them new,
+unique names.
+
  @item int32 label_len;
  This field is present only if @code{has_var_label} is set to 1.  It is
-set to the length, in characters, of the variable label, which must be a
-number between 0 and 120.
+set to the length, in characters, of the variable label.  The
+documented maximum length varies from 120 to 255 based on SPSS
+version, but some files have been seen with longer labels.  PSPP
+accepts labels of any length.
  
  @item char label[];
  This field is present only if @code{has_var_label} is set to 1.  It has
@@ -291,6 +352,7 @@ in the range.  When a range plus a value are present, the third
  element denotes the additional discrete missing value.
  @end table
  
+@anchor{System File Output Formats}
  The @code{print} and @code{write} members of sysfile_variable are output
  formats coded into @code{int32} types.  The least-significant byte
  of the @code{int32} represents the number of decimal places, and the
@@ -387,6 +449,11 @@ Format types are defined as follows:
  @end multitable
  @end quotation
  
+A few system files have been observed in the wild with invalid
+@code{write} fields, in particular with value 0.  Readers should
+probably treat invalid @code{print} or @code{write} fields as some
+default format.
+
  @node Value Labels Records
  @section Value Labels Records
  
@@ -425,7 +492,9 @@ in length.  Its type and width cannot be determined until the
  following value label variables record (see below) is read.
  
  @item char label_len;
-The label's length, in bytes.
+The label's length, in bytes.  The documented maximum length varies
+from 60 to 120 based on SPSS version.  PSPP supports value labels up
+to 255 bytes long.
  
  @item char label[];
  @code{label_len} bytes of the actual label, followed by up to 7 bytes
@@ -538,20 +607,53 @@ Floating point representation code.  For IEEE 754 systems this is 1.
  IBM 370 sets this to 2, and DEC VAX E to 3.
  
  @item int32 compression_code;
-Compression code.  Always set to 1.
+Compression code.  Always set to 1, regardless of whether or how the
+file is compressed.
  
  @item int32 endianness;
  Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
  
  @item int32 character_code;
-@anchor{character-code}
-Character code.  1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
-indicates 8-bit ASCII, 4 indicates DEC Kanji.
-Windows code page numbers are also valid.
-
-Experience has shown that in many files, this field is ignored or incorrect.
-For a more reliable indication of the file's character encoding
-see @ref{Character Encoding Record}.
+@anchor{character-code} Character code.  The following values have
+been actually observed in system files:
+
+@table @asis
+@item 1
+EBCDIC.
+
+@item 2
+7-bit ASCII.
+
+@item 1250
+The @code{windows-1250} code page for Central European and Eastern
+European languages.
+
+@item 1252
+The @code{windows-1252} code page for Western European languages.
+
+@item 28591
+ISO 8859-1.
+
+@item 65001
+UTF-8.
+@end table
+
+The following additional values are known to be defined:
+
+@table @asis
+@item 3
+8-bit ``ASCII''.
+
+@item 4
+DEC Kanji.
+@end table
+
+Other Windows code page numbers are known to be generally valid.
+
+Old versions of SPSS for Unix and Windows always wrote value 2 in this
+field, regardless of the encoding in use.  Newer versions also write
+the character encoding as a string (see @ref{Character Encoding
+Record}).
  @end table
  
  @node Machine Floating-Point Info Record
@@ -595,6 +697,175 @@ The value used for HIGHEST in missing values.
  The value used for LOWEST in missing values.
  @end table
  
+@node Multiple Response Sets Records
+@section Multiple Response Sets Records
+
+The system file format has two different types of records that
+represent multiple response sets (@pxref{MRSETS,,,pspp, PSPP Users
+Guide}).  The first type of record describes multiple response sets
+that can be understood by SPSS before version 14.  The second type of
+record, with a closely related format, is used for multiple dichotomy
+sets that use the CATEGORYLABELS=COUNTEDVALUES feature added in
+version 14.
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                mrsets[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Set to 7 for records that describe multiple response
+sets understood by SPSS before version 14, or to 19 for records that
+describe dichotomy sets that use the CATEGORYLABELS=COUNTEDVALUES
+feature added in version 14.
+
+@item int32 size;
+The size of each element in the @code{mrsets} member. Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{mrsets}.
+
+@item char mrsets[];
+Zero or more line feeds (byte 0x0a), followed by a series of multiple
+response sets, each of which consists of the following:
+
+@itemize @bullet
+@item
+The set's name (an identifier that begins with @samp{$}), in mixed
+upper and lower case.
+
+@item
+An equals sign (@samp{=}).
+
+@item
+@samp{C} for a multiple category set, @samp{D} for a multiple
+dichotomy set with CATEGORYLABELS=VARLABELS, or @samp{E} for a
+multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES.
+
+@item
+For a multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES, a
+space, followed by a number expressed as decimal digits, followed by a
+space.  If LABELSOURCE=VARLABEL was specified on MRSETS, then the
+number is 11; otherwise it is 1.@footnote{This part of the format may
+not be fully understood, because only a single example of each
+possibility has been examined.}
+
+@item
+For either kind of multiple dichotomy set, the counted value, as a
+positive integer count specified as decimal digits, followed by a
+space, followed by as many string bytes as specified in the count.  If
+the set contains numeric variables, the string consists of the counted
+integer value expressed as decimal digits.  If the set contains string
+variables, the string contains the counted string value.  Either way,
+the string may be padded on the right with spaces (older versions of
+SPSS seem to always pad to a width of 8 bytes; newer versions don't).
+
+@item
+A space.
+
+@item
+The multiple response set's label, using the same format as for the
+counted value for multiple dichotomy sets.  A string of length 0 means
+that the set does not have a label.  A string of length 0 is also
+written if LABELSOURCE=VARLABEL was specified.
+
+@item
+A space.
+
+@item
+The short names of the variables in the set, converted to lowercase,
+each separated from the previous by a single space.
+
+Even though a multiple response set must have at least two variables,
+some system files contain multiple response sets with no variables at
+all.  The source and meaning of these multiple response sets is
+unknown.  (Perhaps they arise from creating a multiple response set
+then deleting all the variables that it contains?)
+
+@item
+One line feed (byte 0x0a).  Sometimes multiple, even hundreds, of line
+feeds are present.
+@end itemize
+@end table
+
+Example: Given appropriate variable definitions, consider the
+following MRSETS command:
+
+@example
+MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c
+       /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55
+       /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes'
+       /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES
+        VARIABLES=k l m VALUE=34
+       /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL
+        VARIABLES=n o p VALUE='choice'.
+@end example
+
+The above would generate the following multiple response set record of
+subtype 7:
+
+@example
+$a=C 10 my mcgroup a b c
+$b=D2 55 0  g e f d
+$c=D3 Yes 10 mdgroup #2 h i j
+@end example
+
+It would also generate the following multiple response set record with
+subtype 19:
+
+@example
+$d=E 1 2 34 13 third mdgroup k l m
+$e=E 11 6 choice 0  n o p
+@end example
+
+@node Extra Product Info Record
+@section Extra Product Info Record
+
+This optional record appears to contain a text string that describes
+the program that wrote the file and the source of the data.  (This is
+redundant with the file label and product info found in the file
+header record.)
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                info[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 10.
+
+@item int32 size;
+The size of each element in the @code{info} member. Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{info}.
+
+@item char info[];
+A text string.  A product that identifies itself as @code{VOXCO
+INTERVIEWER 4.3} uses CR-only line ends in this field, rather than the
+more usual LF-only or CR LF line ends.
+@end table
+
  @node Variable Display Parameter Record
  @section Variable Display Parameter Record
  
@@ -646,8 +917,8 @@ Ordinal Scale
  Continuous Scale
  @end table
  
-SPSS 14 sometimes writes a @code{measure} of 0.  PSPP interprets this
-as nominal scale.
+SPSS sometimes writes a @code{measure} of 0.  PSPP interprets this as
+nominal scale.
  
  @item int32 width;
  The width of the display column for the variable in characters.
@@ -827,12 +1098,29 @@ The size of each element in the @code{encoding} member. Always set to 1.
  The total number of bytes in @code{encoding}.
  
  @item char encoding[];
-The name of the character encoding.  Normally this will be an official IANA characterset name or alias.
+The name of the character encoding.  Normally this will be an official
+IANA character set name or alias.
  See @url{http://www.iana.org/assignments/character-sets}.
+Character set names are not case-sensitive, but SPSS appears to write
+them in all-uppercase.
  @end table
  
-This record is not present in files generated by older software.
-See also @ref{character-code}.
+This record is not present in files generated by older software.  See
+also the @code{character_code} field in the machine integer info
+record (@pxref{character-code}).
+
+When the character encoding record and the machine integer info record
+are both present, all system files observed in practice indicate the
+same character encoding, e.g.@: 1252 as @code{character_code} and
+@code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc.
+
+If, for testing purposes, a file is crafted with different
+@code{character_code} and @code{encoding}, it seems that
+@code{character_code} controls the encoding for all strings in the
+system file before the dictionary termination record, including
+strings in data (e.g.@: string missing values), and @code{encoding}
+controls the encoding for strings following the dictionary termination
+record.
  
  @node Long String Value Labels Record
  @section Long String Value Labels Record
@@ -907,6 +1195,74 @@ between 0 and 120, is the number of bytes in @code{label}.  The
  @end table
  @end table
  
+@node Long String Missing Values Record
+@section Long String Missing Values Record
+
+This record, if present, specifies missing values for long string
+variables.
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Repeated up to exactly @code{count} bytes.} */
+int32               var_name_len;
+char                var_name[];
+char                n_missing_values;
+long_string_missing_value   values[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 22.
+
+@item int32 size;
+Always set to 1.
+
+@item int32 count;
+The number of bytes following the header until the next header.
+
+@item int32 var_name_len;
+@itemx char var_name[];
+The number of bytes in the name of the long string variable that has
+missing values, plus the variable name itself, which consists of
+exactly @code{var_name_len} bytes.  The variable name is not padded to
+any particular boundary, nor is it null-terminated.
+
+@item char n_missing_values;
+The number of missing values, either 1, 2, or 3.  (This is, unusually,
+a single byte instead of a 32-bit number.)
+
+@item long_string_missing_value values[];
+The missing values themselves.  This array contains exactly
+@code{n_missing_values} elements, each of which has the following
+substructure:
+
+@example
+int32               value_len;
+char                value[];
+@end example
+
+@table @code
+@item int32 value_len;
+The length of the missing value string, in bytes.  This value should
+be 8, because long string variables are at least 8 bytes wide (by
+definition), only the first 8 bytes of a long string variable's
+missing values are allowed to be non-spaces, and any spaces within the
+first 8 bytes are included in the missing value here.
+
+@item char value[];
+The missing value string, exactly @code{value_len} bytes, without
+any padding or null terminator.
+@end table
+@end table
+
  @node Data File and Variable Attributes Records
  @section Data File and Variable Attributes Records
  
@@ -959,7 +1315,8 @@ element.
  In record type 18, this field contains a sequence of one or more
  variable attribute sets.  If more than one variable attribute set is
  present, each one after the first is delimited from the previous by
-@code{/}.  Each variable attribute set consists of a variable name,
+@code{/}.  Each variable attribute set consists of a long
+variable name,
  followed by @code{:}, followed by an attribute set with the same
  syntax as on record type 17.
  
@@ -980,12 +1337,38 @@ VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123').
  will contain a variable attribute record with the following contents:
  
  @example
-00000000  07 00 00 00 12 00 00 00  01 00 00 00 22 00 00 00  |............"...|
-00000010  64 75 6d 6d 79 3a 66 72  65 64 28 27 32 33 27 0a  |dummy:fred('23'.|
-00000020  27 33 34 27 0a 29 62 65  72 74 28 27 31 32 33 27  |'34'.)bert('123'|
-00000030  0a 29                                             |.)              |
+0000  07 00 00 00 12 00 00 00  01 00 00 00 22 00 00 00  |............"...|
+0010  64 75 6d 6d 79 3a 66 72  65 64 28 27 32 33 27 0a  |dummy:fred('23'.|
+0020  27 33 34 27 0a 29 62 65  72 74 28 27 31 32 33 27  |'34'.)bert('123'|
+0030  0a 29                                             |.)              |
  @end example
  
+@menu
+* Variable Roles::
+@end menu
+
+@node Variable Roles
+@subsection Variable Roles
+
+A variable's role is represented as an attribute named @code{$@@Role}.
+This attribute has a single element whose values and their meanings
+are:
+
+@table @code
+@item 0
+Input.  This, the default, is the most common role.
+@item 1
+Output.
+@item 2
+Both.
+@item 3
+None.
+@item 4
+Partition.
+@item 5
+Split.
+@end table
+
  @node Extended Number of Cases Record
  @section Extended Number of Cases Record
  
@@ -1050,7 +1433,9 @@ Record type.  Always set to 7.
  @item int32 subtype;
  Record subtype.  May take any value.  According to Aapi
  H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6
-indicates date info (probably related to USE).
+indicates date info (probably related to USE).  Subtype 24 appears to
+contain XML that describes how data in the file should be displayed
+on-screen.
  
  @item int32 size;
  Size of each piece of data in the data part.  Should have the value 1,
@@ -1087,22 +1472,23 @@ Ignored padding.  Should be set to 0.
  @node Data Record
  @section Data Record
  
-Data records must follow all other records in the system file.  There must
-be at least one data record in every system file.
-
-The format of data records varies depending on whether the data is
-compressed.  Regardless, the data is arranged in a series of 8-byte
-elements.
+The data record must follow all other records in the system file.
+Every system file must have a data record that specifies data for at
+least one case.  The format of the data record varies depending on the
+value of @code{compression} in the file header record:
  
-When data is not compressed,
-each element corresponds to
+@table @asis
+@item 0: no compression
+Data is arranged as a series of 8-byte elements.
+Each element corresponds to
  the variable declared in the respective variable record (@pxref{Variable
  Record}).  Numeric values are given in @code{flt64} format; string
  values are literal characters string, padded on the right when
  necessary to fill out 8-byte units.
  
-Compressed data is arranged in the following manner: the first 8 bytes
-in the data section is divided into a series of 1-byte command
+@item 1: bytecode compression
+The first 8 bytes
+of the data record is divided into a series of 1-byte command
  codes.  These codes have meanings as described below:
  
  @table @asis
@@ -1140,8 +1526,287 @@ An 8-byte string value that is all spaces.
  The system-missing value.
  @end table
  
-When the end of the an 8-byte group of command bytes is reached, any
-blocks of non-compressible values indicated by code 253 are skipped,
-and the next element of command bytes is read and interpreted, until
-the end of the file or a code with value 252 is reached.
+The end of the 8-byte group of bytecodes is followed by any 8-byte
+blocks of non-compressible values indicated by code 253.  After that
+follows another 8-byte group of bytecodes, then those bytecodes'
+non-compressible values.  The pattern repeats to the end of the file
+or a code with value 252.
+
+@item 2: ZLIB compression
+The data record consists of the following, in order:
+
+@itemize @bullet
+@item
+ZLIB data header, 24 bytes long.
+
+@item
+One or more variable-length blocks of ZLIB compressed data.
+
+@item
+ZLIB data trailer, with a 24-byte fixed header plus an additional 24
+bytes for each preceding ZLIB compressed data block.
+@end itemize
+
+The ZLIB data header has the following format:
+
+@example
+int64               zheader_ofs;
+int64               ztrailer_ofs;
+int64               ztrailer_len;
+@end example
+
+@table @code
+@item int64 zheader_ofs;
+The offset, in bytes, of the beginning of this structure within the
+system file.
+
+@item int64 ztrailer_ofs;
+The offset, in bytes, of the first byte of the ZLIB data trailer.
+
+@item int64 ztrailer_len;
+The number of bytes in the ZLIB data trailer.  This and the previous
+field sum to the size of the system file in bytes.
+@end table
+
+The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB
+compressed data blocks.  Each ZLIB compressed data block begins with a
+ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
+01} (the only header yet observed in practice).  Each block
+decompresses to a fixed number of bytes (in practice only
+@code{0x3ff000}-byte blocks have been observed), except that the last
+block of data may be shorter.  The last ZLIB compressed data block
+gends just before offset @code{ztrailer_ofs}.
+
+The result of ZLIB decompression is bytecode compressed data as
+described above for compression format 1.
+
+The ZLIB data trailer begins with the following 24-byte fixed header:
+
+@example
+int64               bias;
+int64               zero;
+int32               block_size;
+int32               n_blocks;
+@end example
+
+@table @code
+@item int64 int_bias;
+The compression bias as a negative integer, e.g.@: if @code{bias} in
+the file header record is 100.0, then @code{int_bias} is @minus{}100
+(this is the only value yet observed in practice).
+
+@item int64 zero;
+Always observed to be zero.
+
+@item int32 block_size;
+The number of bytes in each ZLIB compressed data block, except
+possibly the last, following decompression.  Only @code{0x3ff000} has
+been observed so far.
+
+@item int32 n_blocks;
+The number of ZLIB compressed data blocks, always exactly
+@code{(ztrailer_ofs - 24) / 24}.
+@end table
+
+The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
+block descriptors, each of which describes the compressed data block
+corresponding to its offset.  Each block descriptor has the following
+format:
+
+@example
+int64               uncompressed_ofs;
+int64               compressed_ofs;
+int32               uncompressed_size;
+int32               compressed_size;
+@end example
+
+@table @code
+@item int64 uncompressed_ofs;
+The offset, in bytes, that this block of data would have in a similar
+system file that uses compression format 1.  This is
+@code{zheader_ofs} in the first block descriptor, and in each
+succeeding block descriptor it is the sum of the previous desciptor's
+@code{uncompressed_ofs} and @code{uncompressed_size}.
+
+@item int64 compressed_ofs;
+The offset, in bytes, of the actual beginning of this compressed data
+block.  This is @code{zheader_ofs + 24} in the first block descriptor,
+and in each succeeding block descriptor it is the sum of the previous
+descriptor's @code{compressed_ofs} and @code{compressed_size}.  The
+final block descriptor's @code{compressed_ofs} and
+@code{compressed_size} sum to @code{ztrailer_ofs}.
+
+@item int32 uncompressed_size;
+The number of bytes in this data block, after decompression.  This is
+@code{block_size} in every data block except the last, which may be
+smaller.
+
+@item int32 compressed_size;
+The number of bytes in this data block, as stored compressed in this
+system file.
+@end table
+@end table
+
  @setfilename ignored
+
+@node Encrypted System Files
+@section Encrypted System Files
+
+SPSS 21 and later support an encrypted system file format.
+
+@quotation Warning
+The SPSS encrypted file format is poorly designed.  It is much cheaper
+and faster to decrypt a file encrypted this way than if a well
+designed alternative were used.  If you must use this format, use a
+10-byte randomly generated password.
+@end quotation
+
+@subheading Encrypted File Format
+
+Encrypted system files begin with the following 36-byte fixed header:
+
+@example
+0000  1c 00 00 00 00 00 00 00  45 4e 43 52 59 50 54 45  |........ENCRYPTE|
+0010  44 53 41 56 15 00 00 00  00 00 00 00 00 00 00 00  |DSAV............|
+0020  00 00 00 00                                       |....|
+@end example
+
+Following the fixed header is a complete system file in the usual
+format, except that each 16-byte block is encrypted with AES-256 in
+ECB mode.  The AES-256 key is derived from a password in the following
+way:
+
+@enumerate 
+@item
+Start from the literal password typed by the user.  Truncate it to at
+most 10 bytes, then append (between 1 and 22) null bytes until there
+are exactly 32 bytes.  Call this @var{password}.
+
+@item
+Let @var{constant} be the following 73-byte constant:
+
+@example
+0000  00 00 00 01 35 27 13 cc  53 a7 78 89 87 53 22 11
+0010  d6 5b 31 58 dc fe 2e 7e  94 da 2f 00 cc 15 71 80
+0020  0a 6c 63 53 00 38 c3 38  ac 22 f3 63 62 0e ce 85
+0030  3f b8 07 4c 4e 2b 77 c7  21 f5 1a 80 1d 67 fb e1
+0040  e1 83 07 d8 0d 00 00 01  00
+@end example
+
+@item
+Compute CMAC-AES-256(@var{password}, @var{constant}).  Call the
+16-byte result @var{cmac}.
+
+@item
+The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is,
+@var{cmac} repeated twice.
+@end enumerate
+
+@subsubheading Example
+
+Consider the password @samp{pspp}.  @var{password} is:
+
+@example
+0000  70 73 70 70 00 00 00 00  00 00 00 00 00 00 00 00  |pspp............|
+0010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
+@end example
+
+@noindent
+@var{cmac} is:
+
+@example
+0000  3e da 09 8e 66 04 d4 fd  f9 63 0c 2c a8 6f b0 45
+@end example
+
+@noindent
+The AES-256 key is:
+
+@example
+0000  3e da 09 8e 66 04 d4 fd  f9 63 0c 2c a8 6f b0 45
+0010  3e da 09 8e 66 04 d4 fd  f9 63 0c 2c a8 6f b0 45
+@end example
+
+@subheading Password Encoding
+
+SPSS also supports what it calls ``encrypted passwords.''  These are
+not encrypted.  They are encoded with a simple, fixed scheme.  An
+encoded password is always a multiple of 2 characters long, and never
+longer than 20 characters.  The characters in an encoded password are
+always in the graphic ASCII range 33 through 126.  Each successive
+pair of characters in the password encodes a single byte in the
+plaintext password.
+
+Use the following algorithm to decode a pair of characters:
+
+@enumerate
+@item
+Let @var{a} be the ASCII code of the first character, and @var{b} be
+the ASCII code of the second character.
+
+@item
+Let @var{ah} be the most significant 4 bits of @var{a}.  Find the line
+in the table below that has @var{ah} on the left side.  The right side
+of the line is a set of possible values for the most significant 4
+bits of the decoded byte.
+
+@display
+@t{2 } @result{} @t{2367}
+@t{3 } @result{} @t{0145}
+@t{47} @result{} @t{89cd}
+@t{56} @result{} @t{abef}
+@end display
+
+@item
+Let @var{bh} be the most significant 4 bits of @var{b}.  Find the line
+in the second table below that has @var{bh} on the left side.  The
+right side of the line is a set of possible values for the most
+significant 4 bits of the decoded byte.  Together with the results of
+the previous step, only a single possibility is left.
+
+@display
+@t{2 } @result{} @t{139b}
+@t{3 } @result{} @t{028a}
+@t{47} @result{} @t{46ce}
+@t{56} @result{} @t{57df}
+@end display
+
+@item
+Let @var{al} be the least significant 4 bits of @var{a}.  Find the
+line in the table below that has @var{al} on the left side.  The right
+side of the line is a set of possible values for the least significant
+4 bits of the decoded byte.
+
+@display
+@t{03cf} @result{} @t{0145}
+@t{12de} @result{} @t{2367}
+@t{478b} @result{} @t{89cd}
+@t{569a} @result{} @t{abef}
+@end display
+
+@item
+Let @var{bl} be the least significant 4 bits of @var{b}.  Find the
+line in the table below that has @var{bl} on the left side.  The right
+side of the line is a set of possible values for the least significant
+4 bits of the decoded byte.  Together with the results of the previous
+step, only a single possibility is left.
+
+@display
+@t{03cf} @result{} @t{028a}
+@t{12de} @result{} @t{139b}
+@t{478b} @result{} @t{46ce}
+@t{569a} @result{} @t{57df}
+@end display
+@end enumerate
+
+@subsubheading Example
+
+Consider the encoded character pair @samp{-|}.  @var{a} is
+0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is
+0xd, and @var{bl} is 0xc.  @var{ah} means that the most significant
+four bits of the decoded character is 2, 3, 6, or 7, and @var{bh}
+means that they are 4, 6, 0xc, or 0xe.  The single possibility in
+common is 6, so the most significant four bits are 6.  Similarly,
+@var{al} means that the least significant four bits are 2, 3, 6, or 7,
+and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant
+four bits are 2.  The decoded character is therefore 0x62, the letter
+@samp{b}.