Add support for reading SPSS/PC+ system files.

[pspp] / doc / dev / system-file-format.texi
diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi

index af74fd2f600c6e95f0380afa23465605365263b8..d100aa9d24d59fcd6265e73e70310d6b556403f8 100644 (file)
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -56,6 +56,18 @@ appears in system files only in missing value ranges, which never
  contain SYSMIS.
  @end table
  
+System files may use most character encodings based on an 8-bit unit.
+UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
+@code{rec_type} in the file header record is sufficient to distinguish
+between ASCII and EBCDIC based encodings.  The best way to determine
+the specific encoding in use is to consult the character encoding
+record (@pxref{Character Encoding Record}), if present, and failing
+that the @code{character_code} in the machine integer info record
+(@pxref{Machine Integer Info Record}).  The same encoding should be
+used for the dictionary and the data in the file, although it is
+possible to artificially synthesize files that use different encodings
+(@pxref{Character Encoding Record}).
+
  System files are divided into records, each of which begins with a
  4-byte record type, usually regarded as an @code{int32}.
  
@@ -96,16 +108,19 @@ Each type of record is described separately below.
  * Machine Integer Info Record::
  * Machine Floating-Point Info Record::
  * Multiple Response Sets Records::
+* Extra Product Info Record::
  * Variable Display Parameter Record::
  * Long Variable Names Record::
  * Very Long String Record::
  * Character Encoding Record::
  * Long String Value Labels Record::
+* Long String Missing Values Record::
  * Data File and Variable Attributes Records::
  * Extended Number of Cases Record::
  * Miscellaneous Informational Records::
  * Dictionary Termination Record::
  * Data Record::
+* Encrypted System Files::
  @end menu
  
  @node File Header Record
@@ -119,7 +134,7 @@ char                rec_type[4];
  char                prod_name[60];
  int32               layout_code;
  int32               nominal_case_size;
-int32               compressed;
+int32               compression;
  int32               weight_index;
  int32               ncases;
  flt64               bias;
@@ -131,9 +146,15 @@ char                padding[3];
  
  @table @code
  @item char rec_type[4];
-Record type code, set to @samp{$FL2}, that is, either @code{24 46 4c
-32} if the file uses an ASCII-based character encoding, or @code{5b c6
-d3 f2} if the file uses an EBCDIC-based character encoding.
+Record type code, either @samp{$FL2} for system files with
+uncompressed data or data compressed with simple bytecode compression,
+or @samp{$FL3} for system files with ZLIB compressed data.
+
+This is truly a character field that uses the character encoding as
+other strings.  Thus, in a file with an ASCII-based character encoding
+this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a
+file with an EBCDIC-based encoding this field contains @code{5b c6 d3
+f2}.  (No EBCDIC-based ZLIB-compressed files have been observed.)
  
  @item char prod_name[60];
  Product identification string.  This always begins with the characters
@@ -157,8 +178,11 @@ contribute to this value beyond the first 255 bytes.   Further, system
  files written by some systems set this value to -1.  In general, it is
  unsafe for systems reading system files to rely upon this value.
  
-@item int32 compressed;
-Set to 1 if the data in the file is compressed, 0 otherwise.
+@item int32 compression;
+Set to 0 if the data in the file is not compressed, 1 if the data is
+compressed with simple bytecode compression, 2 if the data is ZLIB
+compressed.  This field has value 2 if and only if @code{rec_type} is
+@samp{$FL3}.
  
  @item int32 weight_index;
  If one of the variables in the data set is used as a weighting
@@ -203,6 +227,10 @@ field is arbitrarily set to @samp{00:00:00}.
  File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
  PSPP Users Guide}).  Padded on the right with spaces.
  
+A product that identifies itself as @code{VOXCO INTERVIEWER 4.3} uses
+CR-only line ends in this field, rather than the more usual LF-only or
+CR LF line ends.
+
  @item char padding[3];
  Ignored padding bytes to make the structure a multiple of 32 bits in
  length.  Set to zeros.
@@ -272,6 +300,10 @@ respectively.  If the variable has a range for missing variables, set to
  -2; if the variable has a range for missing variables plus a single
  discrete value, set to -3.
  
+A long string variable always has the value 0 here.  A separate record
+indicates missing values for long string variables (@pxref{Long String
+Missing Values Record}).
+
  @item int32 print;
  Print format for this variable.  See below.
  
@@ -284,12 +316,19 @@ the at-sign (@samp{@@}).  Subsequent characters may also be digits, octothorpes
  (@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full
  stops (@samp{.}).  The variable name is padded on the right with spaces.
  
+The @samp{name} fields should be unique within a system file.  System
+files written by SPSS that contain very long string variables with
+similar names sometimes contain duplicate names that are later
+eliminated by resolving the very long string names (@pxref{Very Long
+String Record}).  PSPP handles duplicates by assigning them new,
+unique names.
+
  @item int32 label_len;
  This field is present only if @code{has_var_label} is set to 1.  It is
  set to the length, in characters, of the variable label.  The
  documented maximum length varies from 120 to 255 based on SPSS
  version, but some files have been seen with longer labels.  PSPP
-accepts longer labels and truncates them to 255 bytes on input.
+accepts labels of any length.
  
  @item char label[];
  This field is present only if @code{has_var_label} is set to 1.  It has
@@ -313,6 +352,7 @@ in the range.  When a range plus a value are present, the third
  element denotes the additional discrete missing value.
  @end table
  
+@anchor{System File Output Formats}
  The @code{print} and @code{write} members of sysfile_variable are output
  formats coded into @code{int32} types.  The least-significant byte
  of the @code{int32} represents the number of decimal places, and the
@@ -567,7 +607,8 @@ Floating point representation code.  For IEEE 754 systems this is 1.
  IBM 370 sets this to 2, and DEC VAX E to 3.
  
  @item int32 compression_code;
-Compression code.  Always set to 1.
+Compression code.  Always set to 1, regardless of whether or how the
+file is compressed.
  
  @item int32 endianness;
  Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
@@ -695,8 +736,8 @@ The size of each element in the @code{mrsets} member. Always set to 1.
  The total number of bytes in @code{mrsets}.
  
  @item char mrsets[];
-A series of multiple response sets, each of which consists of the
-following:
+Zero or more line feeds (byte 0x0a), followed by a series of multiple
+response sets, each of which consists of the following:
  
  @itemize @bullet
  @item
@@ -745,8 +786,15 @@ A space.
  The short names of the variables in the set, converted to lowercase,
  each separated from the previous by a single space.
  
+Even though a multiple response set must have at least two variables,
+some system files contain multiple response sets with no variables at
+all.  The source and meaning of these multiple response sets is
+unknown.  (Perhaps they arise from creating a multiple response set
+then deleting all the variables that it contains?)
+
  @item
-A line feed (byte 0x0a).
+One line feed (byte 0x0a).  Sometimes multiple, even hundreds, of line
+feeds are present.
  @end itemize
  @end table
  
@@ -780,6 +828,44 @@ $d=E 1 2 34 13 third mdgroup k l m
  $e=E 11 6 choice 0  n o p
  @end example
  
+@node Extra Product Info Record
+@section Extra Product Info Record
+
+This optional record appears to contain a text string that describes
+the program that wrote the file and the source of the data.  (This is
+redundant with the file label and product info found in the file
+header record.)
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                info[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 10.
+
+@item int32 size;
+The size of each element in the @code{info} member. Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{info}.
+
+@item char info[];
+A text string.  A product that identifies itself as @code{VOXCO
+INTERVIEWER 4.3} uses CR-only line ends in this field, rather than the
+more usual LF-only or CR LF line ends.
+@end table
+
  @node Variable Display Parameter Record
  @section Variable Display Parameter Record
  
@@ -1109,6 +1195,74 @@ between 0 and 120, is the number of bytes in @code{label}.  The
  @end table
  @end table
  
+@node Long String Missing Values Record
+@section Long String Missing Values Record
+
+This record, if present, specifies missing values for long string
+variables.
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Repeated up to exactly @code{count} bytes.} */
+int32               var_name_len;
+char                var_name[];
+char                n_missing_values;
+long_string_missing_value   values[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 22.
+
+@item int32 size;
+Always set to 1.
+
+@item int32 count;
+The number of bytes following the header until the next header.
+
+@item int32 var_name_len;
+@itemx char var_name[];
+The number of bytes in the name of the long string variable that has
+missing values, plus the variable name itself, which consists of
+exactly @code{var_name_len} bytes.  The variable name is not padded to
+any particular boundary, nor is it null-terminated.
+
+@item char n_missing_values;
+The number of missing values, either 1, 2, or 3.  (This is, unusually,
+a single byte instead of a 32-bit number.)
+
+@item long_string_missing_value values[];
+The missing values themselves.  This array contains exactly
+@code{n_missing_values} elements, each of which has the following
+substructure:
+
+@example
+int32               value_len;
+char                value[];
+@end example
+
+@table @code
+@item int32 value_len;
+The length of the missing value string, in bytes.  This value should
+be 8, because long string variables are at least 8 bytes wide (by
+definition), only the first 8 bytes of a long string variable's
+missing values are allowed to be non-spaces, and any spaces within the
+first 8 bytes are included in the missing value here.
+
+@item char value[];
+The missing value string, exactly @code{value_len} bytes, without
+any padding or null terminator.
+@end table
+@end table
+
  @node Data File and Variable Attributes Records
  @section Data File and Variable Attributes Records
  
@@ -1183,12 +1337,38 @@ VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123').
  will contain a variable attribute record with the following contents:
  
  @example
-00000000  07 00 00 00 12 00 00 00  01 00 00 00 22 00 00 00  |............"...|
-00000010  64 75 6d 6d 79 3a 66 72  65 64 28 27 32 33 27 0a  |dummy:fred('23'.|
-00000020  27 33 34 27 0a 29 62 65  72 74 28 27 31 32 33 27  |'34'.)bert('123'|
-00000030  0a 29                                             |.)              |
+0000  07 00 00 00 12 00 00 00  01 00 00 00 22 00 00 00  |............"...|
+0010  64 75 6d 6d 79 3a 66 72  65 64 28 27 32 33 27 0a  |dummy:fred('23'.|
+0020  27 33 34 27 0a 29 62 65  72 74 28 27 31 32 33 27  |'34'.)bert('123'|
+0030  0a 29                                             |.)              |
  @end example
  
+@menu
+* Variable Roles::
+@end menu
+
+@node Variable Roles
+@subsection Variable Roles
+
+A variable's role is represented as an attribute named @code{$@@Role}.
+This attribute has a single element whose values and their meanings
+are:
+
+@table @code
+@item 0
+Input.  This, the default, is the most common role.
+@item 1
+Output.
+@item 2
+Both.
+@item 3
+None.
+@item 4
+Partition.
+@item 5
+Split.
+@end table
+
  @node Extended Number of Cases Record
  @section Extended Number of Cases Record
  
@@ -1292,22 +1472,23 @@ Ignored padding.  Should be set to 0.
  @node Data Record
  @section Data Record
  
-Data records must follow all other records in the system file.  There must
-be at least one data record in every system file.
-
-The format of data records varies depending on whether the data is
-compressed.  Regardless, the data is arranged in a series of 8-byte
-elements.
+The data record must follow all other records in the system file.
+Every system file must have a data record that specifies data for at
+least one case.  The format of the data record varies depending on the
+value of @code{compression} in the file header record:
  
-When data is not compressed,
-each element corresponds to
+@table @asis
+@item 0: no compression
+Data is arranged as a series of 8-byte elements.
+Each element corresponds to
  the variable declared in the respective variable record (@pxref{Variable
  Record}).  Numeric values are given in @code{flt64} format; string
  values are literal characters string, padded on the right when
  necessary to fill out 8-byte units.
  
-Compressed data is arranged in the following manner: the first 8 bytes
-in the data section is divided into a series of 1-byte command
+@item 1: bytecode compression
+The first 8 bytes
+of the data record is divided into a series of 1-byte command
  codes.  These codes have meanings as described below:
  
  @table @asis
@@ -1345,8 +1526,287 @@ An 8-byte string value that is all spaces.
  The system-missing value.
  @end table
  
-When the end of the an 8-byte group of command bytes is reached, any
-blocks of non-compressible values indicated by code 253 are skipped,
-and the next element of command bytes is read and interpreted, until
-the end of the file or a code with value 252 is reached.
+The end of the 8-byte group of bytecodes is followed by any 8-byte
+blocks of non-compressible values indicated by code 253.  After that
+follows another 8-byte group of bytecodes, then those bytecodes'
+non-compressible values.  The pattern repeats to the end of the file
+or a code with value 252.
+
+@item 2: ZLIB compression
+The data record consists of the following, in order:
+
+@itemize @bullet
+@item
+ZLIB data header, 24 bytes long.
+
+@item
+One or more variable-length blocks of ZLIB compressed data.
+
+@item
+ZLIB data trailer, with a 24-byte fixed header plus an additional 24
+bytes for each preceding ZLIB compressed data block.
+@end itemize
+
+The ZLIB data header has the following format:
+
+@example
+int64               zheader_ofs;
+int64               ztrailer_ofs;
+int64               ztrailer_len;
+@end example
+
+@table @code
+@item int64 zheader_ofs;
+The offset, in bytes, of the beginning of this structure within the
+system file.
+
+@item int64 ztrailer_ofs;
+The offset, in bytes, of the first byte of the ZLIB data trailer.
+
+@item int64 ztrailer_len;
+The number of bytes in the ZLIB data trailer.  This and the previous
+field sum to the size of the system file in bytes.
+@end table
+
+The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB
+compressed data blocks.  Each ZLIB compressed data block begins with a
+ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
+01} (the only header yet observed in practice).  Each block
+decompresses to a fixed number of bytes (in practice only
+@code{0x3ff000}-byte blocks have been observed), except that the last
+block of data may be shorter.  The last ZLIB compressed data block
+gends just before offset @code{ztrailer_ofs}.
+
+The result of ZLIB decompression is bytecode compressed data as
+described above for compression format 1.
+
+The ZLIB data trailer begins with the following 24-byte fixed header:
+
+@example
+int64               bias;
+int64               zero;
+int32               block_size;
+int32               n_blocks;
+@end example
+
+@table @code
+@item int64 int_bias;
+The compression bias as a negative integer, e.g.@: if @code{bias} in
+the file header record is 100.0, then @code{int_bias} is @minus{}100
+(this is the only value yet observed in practice).
+
+@item int64 zero;
+Always observed to be zero.
+
+@item int32 block_size;
+The number of bytes in each ZLIB compressed data block, except
+possibly the last, following decompression.  Only @code{0x3ff000} has
+been observed so far.
+
+@item int32 n_blocks;
+The number of ZLIB compressed data blocks, always exactly
+@code{(ztrailer_ofs - 24) / 24}.
+@end table
+
+The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
+block descriptors, each of which describes the compressed data block
+corresponding to its offset.  Each block descriptor has the following
+format:
+
+@example
+int64               uncompressed_ofs;
+int64               compressed_ofs;
+int32               uncompressed_size;
+int32               compressed_size;
+@end example
+
+@table @code
+@item int64 uncompressed_ofs;
+The offset, in bytes, that this block of data would have in a similar
+system file that uses compression format 1.  This is
+@code{zheader_ofs} in the first block descriptor, and in each
+succeeding block descriptor it is the sum of the previous desciptor's
+@code{uncompressed_ofs} and @code{uncompressed_size}.
+
+@item int64 compressed_ofs;
+The offset, in bytes, of the actual beginning of this compressed data
+block.  This is @code{zheader_ofs + 24} in the first block descriptor,
+and in each succeeding block descriptor it is the sum of the previous
+descriptor's @code{compressed_ofs} and @code{compressed_size}.  The
+final block descriptor's @code{compressed_ofs} and
+@code{compressed_size} sum to @code{ztrailer_ofs}.
+
+@item int32 uncompressed_size;
+The number of bytes in this data block, after decompression.  This is
+@code{block_size} in every data block except the last, which may be
+smaller.
+
+@item int32 compressed_size;
+The number of bytes in this data block, as stored compressed in this
+system file.
+@end table
+@end table
+
  @setfilename ignored
+
+@node Encrypted System Files
+@section Encrypted System Files
+
+SPSS 21 and later support an encrypted system file format.
+
+@quotation Warning
+The SPSS encrypted file format is poorly designed.  It is much cheaper
+and faster to decrypt a file encrypted this way than if a well
+designed alternative were used.  If you must use this format, use a
+10-byte randomly generated password.
+@end quotation
+
+@subheading Encrypted File Format
+
+Encrypted system files begin with the following 36-byte fixed header:
+
+@example
+0000  1c 00 00 00 00 00 00 00  45 4e 43 52 59 50 54 45  |........ENCRYPTE|
+0010  44 53 41 56 15 00 00 00  00 00 00 00 00 00 00 00  |DSAV............|
+0020  00 00 00 00                                       |....|
+@end example
+
+Following the fixed header is a complete system file in the usual
+format, except that each 16-byte block is encrypted with AES-256 in
+ECB mode.  The AES-256 key is derived from a password in the following
+way:
+
+@enumerate 
+@item
+Start from the literal password typed by the user.  Truncate it to at
+most 10 bytes, then append (between 1 and 22) null bytes until there
+are exactly 32 bytes.  Call this @var{password}.
+
+@item
+Let @var{constant} be the following 73-byte constant:
+
+@example
+0000  00 00 00 01 35 27 13 cc  53 a7 78 89 87 53 22 11
+0010  d6 5b 31 58 dc fe 2e 7e  94 da 2f 00 cc 15 71 80
+0020  0a 6c 63 53 00 38 c3 38  ac 22 f3 63 62 0e ce 85
+0030  3f b8 07 4c 4e 2b 77 c7  21 f5 1a 80 1d 67 fb e1
+0040  e1 83 07 d8 0d 00 00 01  00
+@end example
+
+@item
+Compute CMAC-AES-256(@var{password}, @var{constant}).  Call the
+16-byte result @var{cmac}.
+
+@item
+The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is,
+@var{cmac} repeated twice.
+@end enumerate
+
+@subsubheading Example
+
+Consider the password @samp{pspp}.  @var{password} is:
+
+@example
+0000  70 73 70 70 00 00 00 00  00 00 00 00 00 00 00 00  |pspp............|
+0010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
+@end example
+
+@noindent
+@var{cmac} is:
+
+@example
+0000  3e da 09 8e 66 04 d4 fd  f9 63 0c 2c a8 6f b0 45
+@end example
+
+@noindent
+The AES-256 key is:
+
+@example
+0000  3e da 09 8e 66 04 d4 fd  f9 63 0c 2c a8 6f b0 45
+0010  3e da 09 8e 66 04 d4 fd  f9 63 0c 2c a8 6f b0 45
+@end example
+
+@subheading Password Encoding
+
+SPSS also supports what it calls ``encrypted passwords.''  These are
+not encrypted.  They are encoded with a simple, fixed scheme.  An
+encoded password is always a multiple of 2 characters long, and never
+longer than 20 characters.  The characters in an encoded password are
+always in the graphic ASCII range 33 through 126.  Each successive
+pair of characters in the password encodes a single byte in the
+plaintext password.
+
+Use the following algorithm to decode a pair of characters:
+
+@enumerate
+@item
+Let @var{a} be the ASCII code of the first character, and @var{b} be
+the ASCII code of the second character.
+
+@item
+Let @var{ah} be the most significant 4 bits of @var{a}.  Find the line
+in the table below that has @var{ah} on the left side.  The right side
+of the line is a set of possible values for the most significant 4
+bits of the decoded byte.
+
+@display
+@t{2 } @result{} @t{2367}
+@t{3 } @result{} @t{0145}
+@t{47} @result{} @t{89cd}
+@t{56} @result{} @t{abef}
+@end display
+
+@item
+Let @var{bh} be the most significant 4 bits of @var{b}.  Find the line
+in the second table below that has @var{bh} on the left side.  The
+right side of the line is a set of possible values for the most
+significant 4 bits of the decoded byte.  Together with the results of
+the previous step, only a single possibility is left.
+
+@display
+@t{2 } @result{} @t{139b}
+@t{3 } @result{} @t{028a}
+@t{47} @result{} @t{46ce}
+@t{56} @result{} @t{57df}
+@end display
+
+@item
+Let @var{al} be the least significant 4 bits of @var{a}.  Find the
+line in the table below that has @var{al} on the left side.  The right
+side of the line is a set of possible values for the least significant
+4 bits of the decoded byte.
+
+@display
+@t{03cf} @result{} @t{0145}
+@t{12de} @result{} @t{2367}
+@t{478b} @result{} @t{89cd}
+@t{569a} @result{} @t{abef}
+@end display
+
+@item
+Let @var{bl} be the least significant 4 bits of @var{b}.  Find the
+line in the table below that has @var{bl} on the left side.  The right
+side of the line is a set of possible values for the least significant
+4 bits of the decoded byte.  Together with the results of the previous
+step, only a single possibility is left.
+
+@display
+@t{03cf} @result{} @t{028a}
+@t{12de} @result{} @t{139b}
+@t{478b} @result{} @t{46ce}
+@t{569a} @result{} @t{57df}
+@end display
+@end enumerate
+
+@subsubheading Example
+
+Consider the encoded character pair @samp{-|}.  @var{a} is
+0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is
+0xd, and @var{bl} is 0xc.  @var{ah} means that the most significant
+four bits of the decoded character is 2, 3, 6, or 7, and @var{bh}
+means that they are 4, 6, 0xc, or 0xe.  The single possibility in
+common is 6, so the most significant four bits are 6.  Similarly,
+@var{al} means that the least significant four bits are 2, 3, 6, or 7,
+and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant
+four bits are 2.  The decoded character is therefore 0x62, the letter
+@samp{b}.