Automatically infer variables' measurement level from format and data.

[pspp] / doc / dev / system-file-format.texi
diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi

index 0358961bab9aa39426cc2233048ae478b56e86db..17f5c1450fa0be9e996e3d47b79c5de6e96c4e34 100644 (file)
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -1,3 +1,13 @@
+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2019 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
+
  @node System File Format
  @appendix System File Format
  
@@ -5,8 +15,10 @@ A system file encapsulates a set of cases and dictionary information
  that describes how they may be interpreted.  This chapter describes
  the format of a system file.
  
-System files use three data types: 8-bit characters, 32-bit integers,
-and 64-bit floating points, called here @code{char}, @code{int32}, and
+System files use four data types: 8-bit characters, 32-bit integers,
+64-bit integers,
+and 64-bit floating points, called here @code{char}, @code{int32},
+@code{int64}, and
  @code{flt64}, respectively.  Data is not necessarily aligned on a word
  or double-word boundary: the long variable name record (@pxref{Long
  Variable Names Record}) and very long string records (@pxref{Very Long
@@ -28,20 +40,118 @@ files and translates as necessary.  PSPP also detects the
  floating-point format in use, as well as the endianness of IEEE 754
  floating-point numbers, and translates as needed.  However, only IEEE
  754 numbers with the same endianness as integer data in the same file
-has actually been observed in system files, and it is likely that
+have actually been observed in system files, and it is likely that
  other formats are obsolete or were never used.
  
-The PSPP system-missing value is represented by the largest possible
-negative number in the floating point format (@code{-DBL_MAX}).  Two
-other values are important for use as missing values: @code{HIGHEST},
-represented by the largest possible positive number (@code{DBL_MAX}),
-and @code{LOWEST}, represented by the second-largest negative number
-(in IEEE 754 format, @code{0xffeffffffffffffe}).
+System files use a few floating point values for special purposes:
+
+@table @asis
+@item SYSMIS
+The system-missing value is represented by the largest possible
+negative number in the floating point format (@code{-DBL_MAX}).
+
+@item HIGHEST
+HIGHEST is used as the high end of a missing value range with an
+unbounded maximum.  It is represented by the largest possible positive
+number (@code{DBL_MAX}).
+
+@item LOWEST
+LOWEST is used as the low end of a missing value range with an
+unbounded minimum.  It was originally represented by the
+second-largest negative number (in IEEE 754 format,
+@code{0xffeffffffffffffe}).  System files written by SPSS 21 and later
+instead use the largest negative number (@code{-DBL_MAX}), the same
+value as SYSMIS.  This does not lead to ambiguity because LOWEST
+appears in system files only in missing value ranges, which never
+contain SYSMIS.
+@end table
+
+System files may use most character encodings based on an 8-bit unit.
+UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
+@code{rec_type} in the file header record is sufficient to distinguish
+between ASCII and EBCDIC based encodings.  The best way to determine
+the specific encoding in use is to consult the character encoding
+record (@pxref{Character Encoding Record}), if present, and failing
+that the @code{character_code} in the machine integer info record
+(@pxref{Machine Integer Info Record}).  The same encoding should be
+used for the dictionary and the data in the file, although it is
+possible to artificially synthesize files that use different encodings
+(@pxref{Character Encoding Record}).
+
+@menu
+* System File Record Structure::
+* File Header Record::
+* Variable Record::
+* Value Labels Records::
+* Document Record::
+* Machine Integer Info Record::
+* Machine Floating-Point Info Record::
+* Multiple Response Sets Records::
+* Extra Product Info Record::
+* Variable Display Parameter Record::
+* Long Variable Names Record::
+* Very Long String Record::
+* Character Encoding Record::
+* Long String Value Labels Record::
+* Long String Missing Values Record::
+* Data File and Variable Attributes Records::
+* Extended Number of Cases Record::
+* Other Informational Records::
+* Dictionary Termination Record::
+* Data Record::
+@end menu
+
+@node System File Record Structure
+@section System File Record Structure
+
+System files are divided into records with the following format:
+
+@example
+int32               type;
+char                data[];
+@end example
+
+This header does not identify the length of the @code{data} or any
+information about what it contains, so the system file reader must
+understand the format of @code{data} based on @code{type}.  However,
+records with type 7, called @dfn{extension records}, have a stricter
+format:
+
+@example
+int32               type;
+int32               subtype;
+int32               size;
+int32               count;
+char                data[size * count];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  This value identifies a particular kind of extension
+record.
  
-System files are divided into records, each of which begins with a
-4-byte record type, usually regarded as an @code{int32}.
+@item int32 size;
+The size of each piece of data that follows the header, in bytes.
+Known extension records use 1, 4, or 8, for @code{char}, @code{int32},
+and @code{flt64} format data, respectively.
  
-The records must appear in the following order:
+@item int32 count;
+The number of pieces of data that follow the header.
+
+@item char data[size * count];
+Data, whose format and interpretation depend on the subtype.
+@end table
+
+An extension record contains exactly @code{size * count} bytes of
+data, which allows a reader that does not understand an extension
+record to skip it.  Extension records provide only nonessential
+information, so this allows for files written by newer software to
+preserve backward compatibility with older or less capable readers.
+
+Records in a system file must appear in the following order:
  
  @itemize @bullet
  @item
@@ -58,24 +168,13 @@ if present.
  Document record, if present.
  
  @item
-Any of the following records, if present, in any order:
-
-@itemize @minus
-@item
-Machine integer info record.
-
-@item
-Machine floating-point info record.
-
-@item
-Variable display parameter record.
-
-@item
-Long variable names record.
+Extension (type 7) records, in ascending numerical order of their
+subtypes.
  
-@item
-Miscellaneous informational records.
-@end itemize
+System files written by SPSS include at most one of each kind of
+extension record.  This is generally true of system files written by
+other software as well, with known exceptions noted below in the
+individual sections about each type of record.
  
  @item
  Dictionary termination record.
@@ -84,38 +183,26 @@ Dictionary termination record.
  Data record.
  @end itemize
  
-Each type of record is described separately below.
+We advise authors of programs that read system files to tolerate
+format variations.  Various kinds of misformatting and corruption have
+been observed in system files written by SPSS and other software
+alike.  In particular, because extension records provide nonessential
+information, it is generally better to ignore an extension record
+entirely than to refuse to read a system file.
  
-@menu
-* File Header Record::
-* Variable Record::
-* Value Labels Records::
-* Document Record::
-* Machine Integer Info Record::
-* Machine Floating-Point Info Record::
-* Variable Display Parameter Record::
-* Long Variable Names Record::
-* Very Long String Record::
-* Character Encoding Record::
-* Long String Value Labels Record::
-* Data File and Variable Attributes Records::
-* Miscellaneous Informational Records::
-* Dictionary Termination Record::
-* Data Record::
-@end menu
+The following sections describe the known kinds of records.
  
  @node File Header Record
  @section File Header Record
  
-The file header is always the first record in the file.  It has the
-following format:
+A system file begins with the file header, with the following format:
  
  @example
  char                rec_type[4];
  char                prod_name[60];
  int32               layout_code;
  int32               nominal_case_size;
-int32               compressed;
+int32               compression;
  int32               weight_index;
  int32               ncases;
  flt64               bias;
@@ -127,7 +214,15 @@ char                padding[3];
  
  @table @code
  @item char rec_type[4];
-Record type code, set to @samp{$FL2}.
+Record type code, either @samp{$FL2} for system files with
+uncompressed data or data compressed with simple bytecode compression,
+or @samp{$FL3} for system files with ZLIB compressed data.
+
+This is truly a character field that uses the character encoding as
+other strings.  Thus, in a file with an ASCII-based character encoding
+this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a
+file with an EBCDIC-based encoding this field contains @code{5b c6 d3
+f2}.  (No EBCDIC-based ZLIB-compressed files have been observed.)
  
  @item char prod_name[60];
  Product identification string.  This always begins with the characters
@@ -137,6 +232,12 @@ pspp 0.1.4 - sparc-sun-solaris2.5.2}.  The string is truncated if it
  would be longer than 60 characters; otherwise it is padded on the right
  with spaces.
  
+The product name field allow readers to behave differently based on
+quirks in the way that particular software writes system files.
+@xref{Value Labels Records}, for the detail of the quirk that the PSPP
+system file reader tolerates in files written by ReadStat, which has
+@code{https://github.com/WizardMac/ReadStat} in @code{prod_name}.
+
  @anchor{layout_code}
  @item int32 layout_code;
  Normally set to 2, although a few system files have been spotted in
@@ -147,12 +248,15 @@ file's integer endianness (@pxref{System File Format}).
  Number of data elements per case.  This is the number of variables,
  except that long string variables add extra data elements (one for every
  8 characters after the first 8).  However, string variables do not
-contribute to this value beyond the first 255 bytes.   Further, system
-files written by some systems set this value to -1.  In general, it is
+contribute to this value beyond the first 255 bytes.   Further, some
+software always writes -1 or 0 in this field.  In general, it is
  unsafe for systems reading system files to rely upon this value.
  
-@item int32 compressed;
-Set to 1 if the data in the file is compressed, 0 otherwise.
+@item int32 compression;
+Set to 0 if the data in the file is not compressed, 1 if the data is
+compressed with simple bytecode compression, 2 if the data is ZLIB
+compressed.  This field has value 2 if and only if @code{rec_type} is
+@samp{$FL3}.
  
  @item int32 weight_index;
  If one of the variables in the data set is used as a weighting
@@ -166,7 +270,7 @@ In the general case it is not possible to determine the number of cases
  that will be output to a system file at the time that the header is
  written.  The way that this is dealt with is by writing the entire
  system file, including the header, then seeking back to the beginning of
-the file and writing just the @code{ncases} field.  For `files' in which
+the file and writing just the @code{ncases} field.  For files in which
  this is not valid, the seek operation fails.  In this case,
  @code{ncases} remains -1.
  
@@ -197,6 +301,10 @@ field is arbitrarily set to @samp{00:00:00}.
  File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
  PSPP Users Guide}).  Padded on the right with spaces.
  
+A product that identifies itself as @code{VOXCO INTERVIEWER 4.3} uses
+CR-only line ends in this field, rather than the more usual LF-only or
+CR LF line ends.
+
  @item char padding[3];
  Ignored padding bytes to make the structure a multiple of 32 bits in
  length.  Set to zeros.
@@ -218,16 +326,20 @@ so readers should take care to parse dummy variable records in the
  same way as other variable records.
  
  @anchor{Dictionary Index}
-The @dfn{dictionary index} of a variable is its offset in the set of
+The @dfn{dictionary index} of a variable is a 1-based offset in the set of
  variable records, including dummy variable records for long string
-variables.  The first variable record has a dictionary index of 0, the
-second has a dictionary index of 1, and so on.
+variables.  The first variable record has a dictionary index of 1, the
+second has a dictionary index of 2, and so on.
  
  The system file format does not directly support string variables
  wider than 255 bytes.  Such very long string variables are represented
  by a number of narrower string variables.  @xref{Very Long String
  Record}, for details.
  
+A system file should contain at least one variable and thus at least
+one variable record, but system files have been observed in the wild
+without any variables (thus, no data either).
+
  @example
  int32               rec_type;
  int32               type;
@@ -266,6 +378,10 @@ respectively.  If the variable has a range for missing variables, set to
  -2; if the variable has a range for missing variables plus a single
  discrete value, set to -3.
  
+A long string variable always has the value 0 here.  A separate record
+indicates missing values for long string variables (@pxref{Long String
+Missing Values Record}).
+
  @item int32 print;
  Print format for this variable.  See below.
  
@@ -278,10 +394,19 @@ the at-sign (@samp{@@}).  Subsequent characters may also be digits, octothorpes
  (@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full
  stops (@samp{.}).  The variable name is padded on the right with spaces.
  
+The @samp{name} fields should be unique within a system file.  System
+files written by SPSS that contain very long string variables with
+similar names sometimes contain duplicate names that are later
+eliminated by resolving the very long string names (@pxref{Very Long
+String Record}).  PSPP handles duplicates by assigning them new,
+unique names.
+
  @item int32 label_len;
  This field is present only if @code{has_var_label} is set to 1.  It is
-set to the length, in characters, of the variable label, which must be a
-number between 0 and 120.
+set to the length, in characters, of the variable label.  The
+documented maximum length varies from 120 to 255 based on SPSS
+version, but some files have been seen with longer labels.  PSPP
+accepts labels of any length.
  
  @item char label[];
  This field is present only if @code{has_var_label} is set to 1.  It has
@@ -289,17 +414,23 @@ length @code{label_len}, rounded up to the nearest multiple of 32 bits.
  The first @code{label_len} characters are the variable's variable label.
  
  @item flt64 missing_values[];
-This field is present only if @code{n_missing_values} is not 0.  It has
-the same number of elements as the absolute value of
-@code{n_missing_values}.  For discrete missing values, each element
-represents one missing value.  When a range is present, the first
-element denotes the minimum value in the range, and the second element
-denotes the maximum value in the range.  When a range plus a value are
-present, the third element denotes the additional discrete missing
-value.  HIGHEST and LOWEST are indicated as described in the chapter
-introduction.
+This field is present only if @code{n_missing_values} is nonzero.  It
+has the same number of 8-byte elements as the absolute value of
+@code{n_missing_values}.  Each element is interpreted as a number for
+numeric variables (with HIGHEST and LOWEST indicated as described in
+the chapter introduction).  For string variables of width less than 8
+bytes, elements are right-padded with spaces; for string variables
+wider than 8 bytes, only the first 8 bytes of each missing value are
+specified, with the remainder implicitly all spaces.
+
+For discrete missing values, each element represents one missing
+value.  When a range is present, the first element denotes the minimum
+value in the range, and the second element denotes the maximum value
+in the range.  When a range plus a value are present, the third
+element denotes the additional discrete missing value.
  @end table
  
+@anchor{System File Output Formats}
  The @code{print} and @code{write} members of sysfile_variable are output
  formats coded into @code{int32} types.  The least-significant byte
  of the @code{int32} represents the number of decimal places, and the
@@ -393,9 +524,18 @@ Format types are defined as follows:
  @tab @code{EDATE}
  @item 39
  @tab @code{SDATE}
+@item 40
+@tab @code{MTIME}
+@item 41
+@tab @code{YMDHMS}
  @end multitable
  @end quotation
  
+A few system files have been observed in the wild with invalid
+@code{write} fields, in particular with value 0.  Readers should
+probably treat invalid @code{print} or @code{write} fields as some
+default format.
+
  @node Value Labels Records
  @section Value Labels Records
  
@@ -404,13 +544,21 @@ numeric and short string variables only.  Long string variables may
  have value labels, but their value labels are recorded using a
  different record type (@pxref{Long String Value Labels Record}).
  
+ReadStat (@pxref{File Header Record}) writes value labels that label a
+single value more than once.  In more detail, it emits value labels
+whose values are longer than string variables' widths, that are
+identical in the actual width of the variable, e.g.@: labels for
+values @code{ABC123} and @code{ABC456} for a string variable with
+width 3.  For files written by this software, PSPP ignores such
+labels.
+
  The value label record has the following format:
  
  @example
  int32               rec_type;
  int32               label_count;
  
-/* @r{Repeated @code{label_cnt} times}. */
+/* @r{Repeated @code{n_label} times}. */
  char                value[8];
  char                label_len;
  char                label[];
@@ -434,7 +582,9 @@ in length.  Its type and width cannot be determined until the
  following value label variables record (see below) is read.
  
  @item char label_len;
-The label's length, in bytes.
+The label's length, in bytes.  The documented maximum length varies
+from 60 to 120 based on SPSS version.  PSPP supports value labels up
+to 255 bytes long.
  
  @item char label[];
  @code{label_len} bytes of the actual label, followed by up to 7 bytes
@@ -460,7 +610,7 @@ Number of variables that the associated value labels from the value
  label record are to be applied.
  
  @item int32 vars[];
-A list of dictionary indexes of variables to which to apply the value
+A list of 1-based dictionary indexes of variables to which to apply the value
  labels (@pxref{Dictionary Index}).  There are @code{var_count}
  elements.
  
@@ -483,7 +633,8 @@ char                lines[][80];
  Record type.  Always set to 6.
  
  @item int32 n_lines;
-Number of lines of documents present.
+Number of lines of documents present.  This should be greater than
+zero, but ReadStats writes system files with zero @code{n_lines}.
  
  @item char lines[][80];
  Document lines.  The number of elements is defined by @code{n_lines}.
@@ -547,20 +698,53 @@ Floating point representation code.  For IEEE 754 systems this is 1.
  IBM 370 sets this to 2, and DEC VAX E to 3.
  
  @item int32 compression_code;
-Compression code.  Always set to 1.
+Compression code.  Always set to 1, regardless of whether or how the
+file is compressed.
  
  @item int32 endianness;
  Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
  
  @item int32 character_code;
-@anchor{character-code}
-Character code.  1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
-indicates 8-bit ASCII, 4 indicates DEC Kanji.
-Windows code page numbers are also valid.
-
-Experience has shown that in many files, this field is ignored or incorrect.
-For a more reliable indication of the file's character encoding
-see @ref{Character Encoding Record}.
+@anchor{character-code} Character code.  The following values have
+been actually observed in system files:
+
+@table @asis
+@item 1
+EBCDIC.
+
+@item 2
+7-bit ASCII.
+
+@item 1250
+The @code{windows-1250} code page for Central European and Eastern
+European languages.
+
+@item 1252
+The @code{windows-1252} code page for Western European languages.
+
+@item 28591
+ISO 8859-1.
+
+@item 65001
+UTF-8.
+@end table
+
+The following additional values are known to be defined:
+
+@table @asis
+@item 3
+8-bit ``ASCII''.
+
+@item 4
+DEC Kanji.
+@end table
+
+Other Windows code page numbers are known to be generally valid.
+
+Old versions of SPSS for Unix and Windows always wrote value 2 in this
+field, regardless of the encoding in use.  Newer versions also write
+the character encoding as a string (see @ref{Character Encoding
+Record}).
  @end table
  
  @node Machine Floating-Point Info Record
@@ -595,13 +779,185 @@ Size of each piece of data in the data part, in bytes.  Always set to 8.
  Number of pieces of data in the data part.  Always set to 3.
  
  @item flt64 sysmis;
-The system missing value.
+@itemx flt64 highest;
+@itemx flt64 lowest;
+The system missing value, the value used for HIGHEST in missing
+values, and the value used for LOWEST in missing values, respectively.
+@xref{System File Format}, for more information.
+
+The SPSSWriter library in PHP, which identifies itself as @code{FOM
+SPSS 1.0.0} in the file header record @code{prod_name} field, writes
+unexpected values to these fields, but it uses the same values
+consistently throughout the rest of the file.
+@end table
+
+@node Multiple Response Sets Records
+@section Multiple Response Sets Records
+
+The system file format has two different types of records that
+represent multiple response sets (@pxref{MRSETS,,,pspp, PSPP Users
+Guide}).  The first type of record describes multiple response sets
+that can be understood by SPSS before version 14.  The second type of
+record, with a closely related format, is used for multiple dichotomy
+sets that use the CATEGORYLABELS=COUNTEDVALUES feature added in
+version 14.
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                mrsets[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Set to 7 for records that describe multiple response
+sets understood by SPSS before version 14, or to 19 for records that
+describe dichotomy sets that use the CATEGORYLABELS=COUNTEDVALUES
+feature added in version 14.
+
+@item int32 size;
+The size of each element in the @code{mrsets} member. Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{mrsets}.
+
+@item char mrsets[];
+Zero or more line feeds (byte 0x0a), followed by a series of multiple
+response sets, each of which consists of the following:
+
+@itemize @bullet
+@item
+The set's name (an identifier that begins with @samp{$}), in mixed
+upper and lower case.
+
+@item
+An equals sign (@samp{=}).
+
+@item
+@samp{C} for a multiple category set, @samp{D} for a multiple
+dichotomy set with CATEGORYLABELS=VARLABELS, or @samp{E} for a
+multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES.
+
+@item
+For a multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES, a
+space, followed by a number expressed as decimal digits, followed by a
+space.  If LABELSOURCE=VARLABEL was specified on MRSETS, then the
+number is 11; otherwise it is 1.@footnote{This part of the format may
+not be fully understood, because only a single example of each
+possibility has been examined.}
+
+@item
+For either kind of multiple dichotomy set, the counted value, as a
+positive integer count specified as decimal digits, followed by a
+space, followed by as many string bytes as specified in the count.  If
+the set contains numeric variables, the string consists of the counted
+integer value expressed as decimal digits.  If the set contains string
+variables, the string contains the counted string value.  Either way,
+the string may be padded on the right with spaces (older versions of
+SPSS seem to always pad to a width of 8 bytes; newer versions don't).
+
+@item
+A space.
+
+@item
+The multiple response set's label, using the same format as for the
+counted value for multiple dichotomy sets.  A string of length 0 means
+that the set does not have a label.  A string of length 0 is also
+written if LABELSOURCE=VARLABEL was specified.
+
+@item
+A space.
+
+@item
+The short names of the variables in the set, converted to lowercase,
+each separated from the previous by a single space.
+
+Even though a multiple response set must have at least two variables,
+some system files contain multiple response sets with no variables or
+one variable.  The source and meaning of these multiple response sets is
+unknown.  (Perhaps they arise from creating a multiple response set
+then deleting all the variables that it contains?)
+
+@item
+One line feed (byte 0x0a).  Sometimes multiple, even hundreds, of line
+feeds are present.
+@end itemize
+@end table
  
-@item flt64 highest;
-The value used for HIGHEST in missing values.
+Example: Given appropriate variable definitions, consider the
+following MRSETS command:
  
-@item flt64 lowest;
-The value used for LOWEST in missing values.
+@example
+MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c
+       /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55
+       /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes'
+       /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES
+        VARIABLES=k l m VALUE=34
+       /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL
+        VARIABLES=n o p VALUE='choice'.
+@end example
+
+The above would generate the following multiple response set record of
+subtype 7:
+
+@example
+$a=C 10 my mcgroup a b c
+$b=D2 55 0  g e f d
+$c=D3 Yes 10 mdgroup #2 h i j
+@end example
+
+It would also generate the following multiple response set record with
+subtype 19:
+
+@example
+$d=E 1 2 34 13 third mdgroup k l m
+$e=E 11 6 choice 0  n o p
+@end example
+
+@node Extra Product Info Record
+@section Extra Product Info Record
+
+This optional record appears to contain a text string that describes
+the program that wrote the file and the source of the data.  (This is
+redundant with the file label and product info found in the file
+header record.)
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                info[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 10.
+
+@item int32 size;
+The size of each element in the @code{info} member. Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{info}.
+
+@item char info[];
+A text string.  A product that identifies itself as @code{VOXCO
+INTERVIEWER 4.3} uses CR-only line ends in this field, rather than the
+more usual LF-only or CR LF line ends.
  @end table
  
  @node Variable Display Parameter Record
@@ -645,18 +1001,24 @@ members are as follows:
  
  @table @code
  @item int32 measure;
-The measurement type of the variable:
+The measurement level of the variable:
  @table @asis
+@item 0
+Unknown
  @item 1
-Nominal Scale
+Nominal
  @item 2
-Ordinal Scale
+Ordinal
  @item 3
-Continuous Scale
+Scale
  @end table
  
-SPSS 14 sometimes writes a @code{measure} of 0.  PSPP interprets this
-as nominal scale.
+An ``unknown'' @code{measure} of 0 means that the variable was created
+in some way that doesn't make the measurement level clear, e.g.@: with
+a @code{COMPUTE} transformation.  PSPP sets the measurement level the
+first time it reads the data using the rules documented in
+@ref{Measurement Level,,,pspp, PSPP Users Guide}, so this should
+rarely appear.
  
  @item int32 width;
  The width of the display column for the variable in characters.
@@ -836,12 +1198,29 @@ The size of each element in the @code{encoding} member. Always set to 1.
  The total number of bytes in @code{encoding}.
  
  @item char encoding[];
-The name of the character encoding.  Normally this will be an official IANA characterset name or alias.
+The name of the character encoding.  Normally this will be an official
+IANA character set name or alias.
  See @url{http://www.iana.org/assignments/character-sets}.
+Character set names are not case-sensitive, but SPSS appears to write
+them in all-uppercase.
  @end table
  
-This record is not present in files generated by older software.
-See also @ref{character-code}.
+This record is not present in files generated by older software.  See
+also the @code{character_code} field in the machine integer info
+record (@pxref{character-code}).
+
+When the character encoding record and the machine integer info record
+are both present, all system files observed in practice indicate the
+same character encoding, e.g.@: 1252 as @code{character_code} and
+@code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc.
+
+If, for testing purposes, a file is crafted with different
+@code{character_code} and @code{encoding}, it seems that
+@code{character_code} controls the encoding for all strings in the
+system file before the dictionary termination record, including
+strings in data (e.g.@: string missing values), and @code{encoding}
+controls the encoding for strings following the dictionary termination
+record.
  
  @node Long String Value Labels Record
  @section Long String Value Labels Record
@@ -916,6 +1295,74 @@ between 0 and 120, is the number of bytes in @code{label}.  The
  @end table
  @end table
  
+@node Long String Missing Values Record
+@section Long String Missing Values Record
+
+This record, if present, specifies missing values for long string
+variables.
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Repeated up to exactly @code{count} bytes.} */
+int32               var_name_len;
+char                var_name[];
+char                n_missing_values;
+long_string_missing_value   values[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 22.
+
+@item int32 size;
+Always set to 1.
+
+@item int32 count;
+The number of bytes following the header until the next header.
+
+@item int32 var_name_len;
+@itemx char var_name[];
+The number of bytes in the name of the long string variable that has
+missing values, plus the variable name itself, which consists of
+exactly @code{var_name_len} bytes.  The variable name is not padded to
+any particular boundary, nor is it null-terminated.
+
+@item char n_missing_values;
+The number of missing values, either 1, 2, or 3.  (This is, unusually,
+a single byte instead of a 32-bit number.)
+
+@item long_string_missing_value values[];
+The missing values themselves.  This array contains exactly
+@code{n_missing_values} elements, each of which has the following
+substructure:
+
+@example
+int32               value_len;
+char                value[];
+@end example
+
+@table @code
+@item int32 value_len;
+The length of the missing value string, in bytes.  This value should
+be 8, because long string variables are at least 8 bytes wide (by
+definition), only the first 8 bytes of a long string variable's
+missing values are allowed to be non-spaces, and any spaces within the
+first 8 bytes are included in the missing value here.
+
+@item char value[];
+The missing value string, exactly @code{value_len} bytes, without
+any padding or null terminator.
+@end table
+@end table
+
  @node Data File and Variable Attributes Records
  @section Data File and Variable Attributes Records
  
@@ -953,7 +1400,7 @@ The total number of bytes in @code{attributes}.
  @item char attributes[];
  The attributes, in a text-based format.
  
-In record type 17, this field contains a single attribute set.  An
+In record subtype 17, this field contains a single attribute set.  An
  attribute set is a sequence of one or more attributes concatenated
  together.  Each attribute consists of a name, which has the same
  syntax as a variable name, followed by, inside parentheses, a sequence
@@ -965,12 +1412,17 @@ way to embed a line feed in a value.  There is no distinction between
  an attribute with a single value and an attribute array with one
  element.
  
-In record type 18, this field contains a sequence of one or more
+In record subtype 18, this field contains a sequence of one or more
  variable attribute sets.  If more than one variable attribute set is
  present, each one after the first is delimited from the previous by
-@code{/}.  Each variable attribute set consists of a variable name,
+@code{/}.  Each variable attribute set consists of a long
+variable name,
  followed by @code{:}, followed by an attribute set with the same
-syntax as on record type 17.
+syntax as on record subtype 17.
+
+System files written by @code{Stata 14.1/-savespss- 1.77 by
+S.Radyakin} may include multiple records with subtype 18, one per
+variable that has variable attributes.
  
  The total length is @code{count} bytes.
  @end table
@@ -989,28 +1441,52 @@ VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123').
  will contain a variable attribute record with the following contents:
  
  @example
-00000000  07 00 00 00 12 00 00 00  01 00 00 00 22 00 00 00  |............"...|
-00000010  64 75 6d 6d 79 3a 66 72  65 64 28 27 32 33 27 0a  |dummy:fred('23'.|
-00000020  27 33 34 27 0a 29 62 65  72 74 28 27 31 32 33 27  |'34'.)bert('123'|
-00000030  0a 29                                             |.)              |
+0000  07 00 00 00 12 00 00 00  01 00 00 00 22 00 00 00  |............"...|
+0010  64 75 6d 6d 79 3a 66 72  65 64 28 27 32 33 27 0a  |dummy:fred('23'.|
+0020  27 33 34 27 0a 29 62 65  72 74 28 27 31 32 33 27  |'34'.)bert('123'|
+0030  0a 29                                             |.)              |
  @end example
  
-@node Miscellaneous Informational Records
-@section Miscellaneous Informational Records
+@menu
+* Variable Roles::
+@end menu
  
-Some specific types of miscellaneous informational records are
-documented here, but others are known to exist.  PSPP ignores unknown
-miscellaneous informational records when reading system files.
+@node Variable Roles
+@subsection Variable Roles
+
+A variable's role is represented as an attribute named @code{$@@Role}.
+This attribute has a single element whose values and their meanings
+are:
+
+@table @code
+@item 0
+Input.  This, the default, is the most common role.
+@item 1
+Output.
+@item 2
+Both.
+@item 3
+None.
+@item 4
+Partition.
+@item 5
+Split.
+@end table
+
+@node Extended Number of Cases Record
+@section Extended Number of Cases Record
+
+The file header record expresses the number of cases in the system
+file as an int32 (@pxref{File Header Record}).  This record allows the
+number of cases in the system file to be expressed as a 64-bit number.
  
  @example
-/* @r{Header.} */
  int32               rec_type;
  int32               subtype;
  int32               size;
  int32               count;
-
-/* @r{Exactly @code{size * count} bytes of data.} */
-char                data[];
+int64               unknown;
+int64               ncases64;
  @end example
  
  @table @code
@@ -1018,21 +1494,49 @@ char                data[];
  Record type.  Always set to 7.
  
  @item int32 subtype;
-Record subtype.  May take any value.  According to Aapi
-H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6
-indicates date info (probably related to USE).
+Record subtype.  Always set to 16.
  
  @item int32 size;
-Size of each piece of data in the data part.  Should have the value 1,
-4, or 8, for @code{char}, @code{int32}, and @code{flt64} format data,
-respectively.
+Size of each element.  Always set to 8.
  
  @item int32 count;
-Number of pieces of data in the data part.
+Number of pieces of data in the data part.  Alway set to 2.
+
+@item int64 unknown;
+Meaning unknown.  Always set to 1.
  
-@item char data[];
-Arbitrary data.  There must be @code{size} times @code{count} bytes of
-data.
+@item int64 ncases64;
+Number of cases in the file as a 64-bit integer.  Presumably this
+could be -1 to indicate that the number of cases is unknown, for the
+same reason as @code{ncases} in the file header record, but this has
+not been observed in the wild.
+@end table
+
+@node Other Informational Records
+@section Other Informational Records
+
+This chapter documents many specific types of extension records are
+documented here, but others are known to exist.  PSPP ignores unknown
+extension records when reading system files.
+
+The following extension record subtypes have also been observed, with
+the following believed meanings:
+
+@table @asis
+@item 5
+A named variable set for use in the GUI (according to Aapi
+H@"am@"al@"ainen).
+
+@item 6
+Date info, probably related to USE (according to Aapi H@"am@"al@"ainen).
+
+@item 12
+A UUID in the format described in RFC 4122.  Only two examples
+observed, both written by SPSS 13, and in each case the UUID contained
+both upper and lower case.
+
+@item 24
+XML that describes how data in the file should be displayed on-screen.
  @end table
  
  @node Dictionary Termination Record
@@ -1057,22 +1561,23 @@ Ignored padding.  Should be set to 0.
  @node Data Record
  @section Data Record
  
-Data records must follow all other records in the system file.  There must
-be at least one data record in every system file.
-
-The format of data records varies depending on whether the data is
-compressed.  Regardless, the data is arranged in a series of 8-byte
-elements.
+The data record must follow all other records in the system file.
+Every system file must have a data record that specifies data for at
+least one case.  The format of the data record varies depending on the
+value of @code{compression} in the file header record:
  
-When data is not compressed,
-each element corresponds to
+@table @asis
+@item 0: no compression
+Data is arranged as a series of 8-byte elements.
+Each element corresponds to
  the variable declared in the respective variable record (@pxref{Variable
  Record}).  Numeric values are given in @code{flt64} format; string
  values are literal characters string, padded on the right when
  necessary to fill out 8-byte units.
  
-Compressed data is arranged in the following manner: the first 8 bytes
-in the data section is divided into a series of 1-byte command
+@item 1: bytecode compression
+The first 8 bytes
+of the data record is divided into a series of 1-byte command
  codes.  These codes have meanings as described below:
  
  @table @asis
@@ -1089,6 +1594,10 @@ variable @code{bias} from the file header.  For example,
  code 105 with bias 100.0 (the normal value) indicates a numeric variable
  of value 5.
  
+A code of 0 (after subtracting the bias) in a string field encodes
+null bytes.  This is unusual, since a string field normally encodes
+text data, but it exists in real system files.
+
  @item 252
  End of file.  This code may or may not appear at the end of the data
  stream.  PSPP always outputs this code but its use is not required.
@@ -1107,8 +1616,125 @@ An 8-byte string value that is all spaces.
  The system-missing value.
  @end table
  
-When the end of the an 8-byte group of command bytes is reached, any
-blocks of non-compressible values indicated by code 253 are skipped,
-and the next element of command bytes is read and interpreted, until
-the end of the file or a code with value 252 is reached.
+The end of the 8-byte group of bytecodes is followed by any 8-byte
+blocks of non-compressible values indicated by code 253.  After that
+follows another 8-byte group of bytecodes, then those bytecodes'
+non-compressible values.  The pattern repeats to the end of the file
+or a code with value 252.
+
+@item 2: ZLIB compression
+The data record consists of the following, in order:
+
+@itemize @bullet
+@item
+ZLIB data header, 24 bytes long.
+
+@item
+One or more variable-length blocks of ZLIB compressed data.
+
+@item
+ZLIB data trailer, with a 24-byte fixed header plus an additional 24
+bytes for each preceding ZLIB compressed data block.
+@end itemize
+
+The ZLIB data header has the following format:
+
+@example
+int64               zheader_ofs;
+int64               ztrailer_ofs;
+int64               ztrailer_len;
+@end example
+
+@table @code
+@item int64 zheader_ofs;
+The offset, in bytes, of the beginning of this structure within the
+system file.
+
+@item int64 ztrailer_ofs;
+The offset, in bytes, of the first byte of the ZLIB data trailer.
+
+@item int64 ztrailer_len;
+The number of bytes in the ZLIB data trailer.  This and the previous
+field sum to the size of the system file in bytes.
+@end table
+
+The data header is followed by @code{(ztrailer_len - 24) / 24} ZLIB
+compressed data blocks.  Each ZLIB compressed data block begins with a
+ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
+01} (the only header yet observed in practice).  Each block
+decompresses to a fixed number of bytes (in practice only
+@code{0x3ff000}-byte blocks have been observed), except that the last
+block of data may be shorter.  The last ZLIB compressed data block
+gends just before offset @code{ztrailer_ofs}.
+
+The result of ZLIB decompression is bytecode compressed data as
+described above for compression format 1.
+
+The ZLIB data trailer begins with the following 24-byte fixed header:
+
+@example
+int64               bias;
+int64               zero;
+int32               block_size;
+int32               n_blocks;
+@end example
+
+@table @code
+@item int64 int_bias;
+The compression bias as a negative integer, e.g.@: if @code{bias} in
+the file header record is 100.0, then @code{int_bias} is @minus{}100
+(this is the only value yet observed in practice).
+
+@item int64 zero;
+Always observed to be zero.
+
+@item int32 block_size;
+The number of bytes in each ZLIB compressed data block, except
+possibly the last, following decompression.  Only @code{0x3ff000} has
+been observed so far.
+
+@item int32 n_blocks;
+The number of ZLIB compressed data blocks, always exactly
+@code{(ztrailer_len - 24) / 24}.
+@end table
+
+The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
+block descriptors, each of which describes the compressed data block
+corresponding to its offset.  Each block descriptor has the following
+format:
+
+@example
+int64               uncompressed_ofs;
+int64               compressed_ofs;
+int32               uncompressed_size;
+int32               compressed_size;
+@end example
+
+@table @code
+@item int64 uncompressed_ofs;
+The offset, in bytes, that this block of data would have in a similar
+system file that uses compression format 1.  This is
+@code{zheader_ofs} in the first block descriptor, and in each
+succeeding block descriptor it is the sum of the previous desciptor's
+@code{uncompressed_ofs} and @code{uncompressed_size}.
+
+@item int64 compressed_ofs;
+The offset, in bytes, of the actual beginning of this compressed data
+block.  This is @code{zheader_ofs + 24} in the first block descriptor,
+and in each succeeding block descriptor it is the sum of the previous
+descriptor's @code{compressed_ofs} and @code{compressed_size}.  The
+final block descriptor's @code{compressed_ofs} and
+@code{compressed_size} sum to @code{ztrailer_ofs}.
+
+@item int32 uncompressed_size;
+The number of bytes in this data block, after decompression.  This is
+@code{block_size} in every data block except the last, which may be
+smaller.
+
+@item int32 compressed_size;
+The number of bytes in this data block, as stored compressed in this
+system file.
+@end table
+@end table
+
  @setfilename ignored