work

[pspp] / doc / dev / system-file-format.texi
diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi

index 19a6c057479d0294cfee1c79dc269373abd53fdf..ba390c1c0243b7e3f32facffb712bc6f936ca6b3 100644 (file)
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -1,12 +1,25 @@
+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2019 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
+
  @node System File Format
-@appendix System File Format
+@chapter System File Format
  
-A system file encapsulates a set of cases and dictionary information
-that describes how they may be interpreted.  This chapter describes
-the format of a system file.
+An SPSS system file holds a set of cases and dictionary information
+that describes how they may be interpreted.  The system file format
+dates back 40+ years and has evolved greatly over that time to support
+new features, but in a way to facilitate interchange between even the
+oldest and newest versions of software.  This chapter describes the
+system file format.
  
  System files use four data types: 8-bit characters, 32-bit integers,
-64-bit integers, 
+64-bit integers,
  and 64-bit floating points, called here @code{char}, @code{int32},
  @code{int64}, and
  @code{flt64}, respectively.  Data is not necessarily aligned on a word
@@ -79,6 +92,7 @@ possible to artificially synthesize files that use different encodings
  * Multiple Response Sets Records::
  * Extra Product Info Record::
  * Variable Display Parameter Record::
+* Variable Sets Record::
  * Long Variable Names Record::
  * Very Long String Record::
  * Character Encoding Record::
@@ -89,7 +103,6 @@ possible to artificially synthesize files that use different encodings
  * Other Informational Records::
  * Dictionary Termination Record::
  * Data Record::
-* Encrypted System Files::
  @end menu
  
  @node System File Record Structure
@@ -162,6 +175,11 @@ Document record, if present.
  Extension (type 7) records, in ascending numerical order of their
  subtypes.
  
+System files written by SPSS include at most one of each kind of
+extension record.  This is generally true of system files written by
+other software as well, with known exceptions noted below in the
+individual sections about each type of record.
+
  @item
  Dictionary termination record.
  
@@ -218,6 +236,12 @@ pspp 0.1.4 - sparc-sun-solaris2.5.2}.  The string is truncated if it
  would be longer than 60 characters; otherwise it is padded on the right
  with spaces.
  
+The product name field allow readers to behave differently based on
+quirks in the way that particular software writes system files.
+@xref{Value Labels Records}, for the detail of the quirk that the PSPP
+system file reader tolerates in files written by ReadStat, which has
+@code{https://github.com/WizardMac/ReadStat} in @code{prod_name}.
+
  @anchor{layout_code}
  @item int32 layout_code;
  Normally set to 2, although a few system files have been spotted in
@@ -228,8 +252,8 @@ file's integer endianness (@pxref{System File Format}).
  Number of data elements per case.  This is the number of variables,
  except that long string variables add extra data elements (one for every
  8 characters after the first 8).  However, string variables do not
-contribute to this value beyond the first 255 bytes.   Further, system
-files written by some systems set this value to -1.  In general, it is
+contribute to this value beyond the first 255 bytes.   Further, some
+software always writes -1 or 0 in this field.  In general, it is
  unsafe for systems reading system files to rely upon this value.
  
  @item int32 compression;
@@ -306,10 +330,10 @@ so readers should take care to parse dummy variable records in the
  same way as other variable records.
  
  @anchor{Dictionary Index}
-The @dfn{dictionary index} of a variable is its offset in the set of
+The @dfn{dictionary index} of a variable is a 1-based offset in the set of
  variable records, including dummy variable records for long string
-variables.  The first variable record has a dictionary index of 0, the
-second has a dictionary index of 1, and so on.
+variables.  The first variable record has a dictionary index of 1, the
+second has a dictionary index of 2, and so on.
  
  The system file format does not directly support string variables
  wider than 255 bytes.  Such very long string variables are represented
@@ -504,6 +528,10 @@ Format types are defined as follows:
  @tab @code{EDATE}
  @item 39
  @tab @code{SDATE}
+@item 40
+@tab @code{MTIME}
+@item 41
+@tab @code{YMDHMS}
  @end multitable
  @end quotation
  
@@ -520,13 +548,21 @@ numeric and short string variables only.  Long string variables may
  have value labels, but their value labels are recorded using a
  different record type (@pxref{Long String Value Labels Record}).
  
+ReadStat (@pxref{File Header Record}) writes value labels that label a
+single value more than once.  In more detail, it emits value labels
+whose values are longer than string variables' widths, that are
+identical in the actual width of the variable, e.g.@: labels for
+values @code{ABC123} and @code{ABC456} for a string variable with
+width 3.  For files written by this software, PSPP ignores such
+labels.
+
  The value label record has the following format:
  
  @example
  int32               rec_type;
  int32               label_count;
  
-/* @r{Repeated @code{label_cnt} times}. */
+/* @r{Repeated @code{n_label} times}. */
  char                value[8];
  char                label_len;
  char                label[];
@@ -578,7 +614,7 @@ Number of variables that the associated value labels from the value
  label record are to be applied.
  
  @item int32 vars[];
-A list of dictionary indexes of variables to which to apply the value
+A list of 1-based dictionary indexes of variables to which to apply the value
  labels (@pxref{Dictionary Index}).  There are @code{var_count}
  elements.
  
@@ -601,7 +637,8 @@ char                lines[][80];
  Record type.  Always set to 6.
  
  @item int32 n_lines;
-Number of lines of documents present.
+Number of lines of documents present.  This should be greater than
+zero, but ReadStats writes system files with zero @code{n_lines}.
  
  @item char lines[][80];
  Document lines.  The number of elements is defined by @code{n_lines}.
@@ -968,18 +1005,24 @@ members are as follows:
  
  @table @code
  @item int32 measure;
-The measurement type of the variable:
+The measurement level of the variable:
  @table @asis
+@item 0
+Unknown
  @item 1
-Nominal Scale
+Nominal
  @item 2
-Ordinal Scale
+Ordinal
  @item 3
-Continuous Scale
+Scale
  @end table
  
-SPSS sometimes writes a @code{measure} of 0.  PSPP interprets this as
-nominal scale.
+An ``unknown'' @code{measure} of 0 means that the variable was created
+in some way that doesn't make the measurement level clear, e.g.@: with
+a @code{COMPUTE} transformation.  PSPP sets the measurement level the
+first time it reads the data using the rules documented in
+@ref{Measurement Level,,,pspp, PSPP Users Guide}, so this should
+rarely appear.
  
  @item int32 width;
  The width of the display column for the variable in characters.
@@ -1001,6 +1044,54 @@ Centre aligned
  @end table
  @end table
  
+@node Variable Sets Record
+@section Variable Sets Record
+
+The SPSS GUI offers users the ability to arrange variables in sets.
+Users may enable and disable sets individually, and the data editor
+and analysis dialog boxes only show enabled sets.  Syntax does not use
+variable sets.
+
+The variable sets record, if present, has the following format:
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of text.} */
+char                text[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 5.
+
+@item int32 size;
+Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{text}.
+
+@item char text[];
+The variable sets, in a text-based format.
+
+Each variable set occupies one line of text, each of which ends with a
+line feed (byte 0x0a), optionally preceded by a carriage return (byte
+0x0d).
+
+Each line begins with the name of the variable set, followed by an
+equals sign (@samp{=}) and a space (byte 0x20), followed by the long
+variable names of the members of the set, separated by spaces.  A
+variable set may be empty, in which case the equals sign and the space
+following it are still present.
+@end table
+
  @node Long Variable Names Record
  @section Long Variable Names Record
  
@@ -1273,7 +1364,8 @@ int32               count;
  int32               var_name_len;
  char                var_name[];
  char                n_missing_values;
-long_string_missing_value   values[];
+int32               value_len;
+char                values[values_len * n_missing_values];
  @end example
  
  @table @code
@@ -1300,30 +1392,25 @@ any particular boundary, nor is it null-terminated.
  The number of missing values, either 1, 2, or 3.  (This is, unusually,
  a single byte instead of a 32-bit number.)
  
-@item long_string_missing_value values[];
-The missing values themselves.  This array contains exactly
-@code{n_missing_values} elements, each of which has the following
-substructure:
-
-@example
-int32               value_len;
-char                value[];
-@end example
-
-@table @code
  @item int32 value_len;
-The length of the missing value string, in bytes.  This value should
+The length of each missing value string, in bytes.  This value should
  be 8, because long string variables are at least 8 bytes wide (by
  definition), only the first 8 bytes of a long string variable's
  missing values are allowed to be non-spaces, and any spaces within the
  first 8 bytes are included in the missing value here.
  
-@item char value[];
-The missing value string, exactly @code{value_len} bytes, without
-any padding or null terminator.
-@end table
+@item char values[values_len * n_missing_values]
+The missing values themselves, without any padding or null
+terminators.
  @end table
  
+An earlier version of this document stated that @code{value_len} was
+repeated before each of the missing values, so that there was an extra
+@code{int32} value of 8 before each missing value after the first.
+Old versions of PSPP wrote data files in this format.  Readers can
+tolerate this mistake, if they wish, by noticing and skipping the
+extra @code{int32} values, which wouldn't ordinarily occur in strings.
+
  @node Data File and Variable Attributes Records
  @section Data File and Variable Attributes Records
  
@@ -1361,7 +1448,7 @@ The total number of bytes in @code{attributes}.
  @item char attributes[];
  The attributes, in a text-based format.
  
-In record type 17, this field contains a single attribute set.  An
+In record subtype 17, this field contains a single attribute set.  An
  attribute set is a sequence of one or more attributes concatenated
  together.  Each attribute consists of a name, which has the same
  syntax as a variable name, followed by, inside parentheses, a sequence
@@ -1373,13 +1460,17 @@ way to embed a line feed in a value.  There is no distinction between
  an attribute with a single value and an attribute array with one
  element.
  
-In record type 18, this field contains a sequence of one or more
+In record subtype 18, this field contains a sequence of one or more
  variable attribute sets.  If more than one variable attribute set is
  present, each one after the first is delimited from the previous by
  @code{/}.  Each variable attribute set consists of a long
  variable name,
  followed by @code{:}, followed by an attribute set with the same
-syntax as on record type 17.
+syntax as on record subtype 17.
+
+System files written by @code{Stata 14.1/-savespss- 1.77 by
+S.Radyakin} may include multiple records with subtype 18, one per
+variable that has variable attributes.
  
  The total length is @code{count} bytes.
  @end table
@@ -1480,9 +1571,6 @@ The following extension record subtypes have also been observed, with
  the following believed meanings:
  
  @table @asis
-@item 5
-A set of grouped variables (according to Aapi H@"am@"al@"ainen).
-
  @item 6
  Date info, probably related to USE (according to Aapi H@"am@"al@"ainen).
  
@@ -1549,9 +1637,10 @@ value @var{code} - @var{bias}, where
  variable @code{bias} from the file header.  For example,
  code 105 with bias 100.0 (the normal value) indicates a numeric variable
  of value 5.
-One file has been seen written by SPSS 14 that contained such a code
-in a @emph{string} field with the value 0 (after the bias is
-subtracted) as a way of encoding null bytes.
+
+A code of 0 (after subtracting the bias) in a string field encodes
+null bytes.  This is unusual, since a string field normally encodes
+text data, but it exists in real system files.
  
  @item 252
  End of file.  This code may or may not appear at the end of the data
@@ -1613,7 +1702,7 @@ The number of bytes in the ZLIB data trailer.  This and the previous
  field sum to the size of the system file in bytes.
  @end table
  
-The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB
+The data header is followed by @code{(ztrailer_len - 24) / 24} ZLIB
  compressed data blocks.  Each ZLIB compressed data block begins with a
  ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
  01} (the only header yet observed in practice).  Each block
@@ -1650,7 +1739,7 @@ been observed so far.
  
  @item int32 n_blocks;
  The number of ZLIB compressed data blocks, always exactly
-@code{(ztrailer_ofs - 24) / 24}.
+@code{(ztrailer_len - 24) / 24}.
  @end table
  
  The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
@@ -1693,165 +1782,3 @@ system file.
  @end table
  
  @setfilename ignored
-
-@node Encrypted System Files
-@section Encrypted System Files
-
-SPSS 21 and later support an encrypted system file format.
-
-@quotation Warning
-The SPSS encrypted file format is poorly designed.  It is much cheaper
-and faster to decrypt a file encrypted this way than if a well
-designed alternative were used.  If you must use this format, use a
-10-byte randomly generated password.
-@end quotation
-
-@subheading Encrypted File Format
-
-Encrypted system files begin with the following 36-byte fixed header:
-
-@example
-0000  1c 00 00 00 00 00 00 00  45 4e 43 52 59 50 54 45  |........ENCRYPTE|
-0010  44 53 41 56 15 00 00 00  00 00 00 00 00 00 00 00  |DSAV............|
-0020  00 00 00 00                                       |....|
-@end example
-
-Following the fixed header is a complete system file in the usual
-format, except that each 16-byte block is encrypted with AES-256 in
-ECB mode.  The AES-256 key is derived from a password in the following
-way:
-
-@enumerate 
-@item
-Start from the literal password typed by the user.  Truncate it to at
-most 10 bytes, then append (between 1 and 22) null bytes until there
-are exactly 32 bytes.  Call this @var{password}.
-
-@item
-Let @var{constant} be the following 73-byte constant:
-
-@example
-0000  00 00 00 01 35 27 13 cc  53 a7 78 89 87 53 22 11
-0010  d6 5b 31 58 dc fe 2e 7e  94 da 2f 00 cc 15 71 80
-0020  0a 6c 63 53 00 38 c3 38  ac 22 f3 63 62 0e ce 85
-0030  3f b8 07 4c 4e 2b 77 c7  21 f5 1a 80 1d 67 fb e1
-0040  e1 83 07 d8 0d 00 00 01  00
-@end example
-
-@item
-Compute CMAC-AES-256(@var{password}, @var{constant}).  Call the
-16-byte result @var{cmac}.
-
-@item
-The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is,
-@var{cmac} repeated twice.
-@end enumerate
-
-@subsubheading Example
-
-Consider the password @samp{pspp}.  @var{password} is:
-
-@example
-0000  70 73 70 70 00 00 00 00  00 00 00 00 00 00 00 00  |pspp............|
-0010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
-@end example
-
-@noindent
-@var{cmac} is:
-
-@example
-0000  3e da 09 8e 66 04 d4 fd  f9 63 0c 2c a8 6f b0 45
-@end example
-
-@noindent
-The AES-256 key is:
-
-@example
-0000  3e da 09 8e 66 04 d4 fd  f9 63 0c 2c a8 6f b0 45
-0010  3e da 09 8e 66 04 d4 fd  f9 63 0c 2c a8 6f b0 45
-@end example
-
-@subheading Password Encoding
-
-SPSS also supports what it calls ``encrypted passwords.''  These are
-not encrypted.  They are encoded with a simple, fixed scheme.  An
-encoded password is always a multiple of 2 characters long, and never
-longer than 20 characters.  The characters in an encoded password are
-always in the graphic ASCII range 33 through 126.  Each successive
-pair of characters in the password encodes a single byte in the
-plaintext password.
-
-Use the following algorithm to decode a pair of characters:
-
-@enumerate
-@item
-Let @var{a} be the ASCII code of the first character, and @var{b} be
-the ASCII code of the second character.
-
-@item
-Let @var{ah} be the most significant 4 bits of @var{a}.  Find the line
-in the table below that has @var{ah} on the left side.  The right side
-of the line is a set of possible values for the most significant 4
-bits of the decoded byte.
-
-@display
-@t{2 } @result{} @t{2367}
-@t{3 } @result{} @t{0145}
-@t{47} @result{} @t{89cd}
-@t{56} @result{} @t{abef}
-@end display
-
-@item
-Let @var{bh} be the most significant 4 bits of @var{b}.  Find the line
-in the second table below that has @var{bh} on the left side.  The
-right side of the line is a set of possible values for the most
-significant 4 bits of the decoded byte.  Together with the results of
-the previous step, only a single possibility is left.
-
-@display
-@t{2 } @result{} @t{139b}
-@t{3 } @result{} @t{028a}
-@t{47} @result{} @t{46ce}
-@t{56} @result{} @t{57df}
-@end display
-
-@item
-Let @var{al} be the least significant 4 bits of @var{a}.  Find the
-line in the table below that has @var{al} on the left side.  The right
-side of the line is a set of possible values for the least significant
-4 bits of the decoded byte.
-
-@display
-@t{03cf} @result{} @t{0145}
-@t{12de} @result{} @t{2367}
-@t{478b} @result{} @t{89cd}
-@t{569a} @result{} @t{abef}
-@end display
-
-@item
-Let @var{bl} be the least significant 4 bits of @var{b}.  Find the
-line in the table below that has @var{bl} on the left side.  The right
-side of the line is a set of possible values for the least significant
-4 bits of the decoded byte.  Together with the results of the previous
-step, only a single possibility is left.
-
-@display
-@t{03cf} @result{} @t{028a}
-@t{12de} @result{} @t{139b}
-@t{478b} @result{} @t{46ce}
-@t{569a} @result{} @t{57df}
-@end display
-@end enumerate
-
-@subsubheading Example
-
-Consider the encoded character pair @samp{-|}.  @var{a} is
-0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is
-0xd, and @var{bl} is 0xc.  @var{ah} means that the most significant
-four bits of the decoded character is 2, 3, 6, or 7, and @var{bh}
-means that they are 4, 6, 0xc, or 0xe.  The single possibility in
-common is 6, so the most significant four bits are 6.  Similarly,
-@var{al} means that the least significant four bits are 2, 3, 6, or 7,
-and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant
-four bits are 2.  The decoded character is therefore 0x62, the letter
-@samp{b}.