Patch #6262. New developers guide and resulting fixes and cleanups.

[pspp-builds.git] / doc / dev / system-file-format.texi
diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi

new file mode 100644 (file)

index 0000000..193c660
--- /dev/null
+++ b/doc/dev/system-file-format.texi
@@ -0,0 +1,906 @@
+@node System File Format
+@appendix System File Format
+
+A system file encapsulates a set of cases and dictionary information
+that describes how they may be interpreted.  This chapter describes
+the format of a system file.
+
+System files use three data types: 8-bit characters, 32-bit integers,
+and 64-bit floating points, called here @code{char}, @code{int32}, and
+@code{flt64}, respectively.  Data is not necessarily aligned on a word
+or double-word boundary: the long variable name record (@pxref{Long
+Variable Names Record}) and very long string records (@pxref{Very Long
+String Record}) have arbitrary byte length and can therefore cause all
+data coming after them in the file to be misaligned.
+
+Integer data in system files may be big-endian or little-endian.  A
+reader may detect the endianness of a system file by examining
+@code{layout_code} in the file header record
+(@pxref{layout_code,,@code{layout_code}}).
+
+Floating-point data in system files may nominally be in IEEE 754, IBM,
+or VAX formats.  A reader may detect the floating-point format in use
+by examining @code{bias} in the file header record
+(@pxref{bias,,@code{bias}}).
+
+PSPP detects big-endian and little-endian integer formats in system
+files and translates as necessary.  PSPP also detects the
+floating-point format in use, as well as the endianness of IEEE 754
+floating-point numbers, and translates as needed.  However, only IEEE
+754 numbers with the same endianness as integer data in the same file
+has actually been observed in system files, and it is likely that
+other formats are obsolete or were never used.
+
+The PSPP system-missing value is represented by the largest possible
+negative number in the floating point format (@code{-DBL_MAX}).  Two
+other values are important for use as missing values: @code{HIGHEST},
+represented by the largest possible positive number (@code{DBL_MAX}),
+and @code{LOWEST}, represented by the second-largest negative number
+(in IEEE 754 format, @code{0xffeffffffffffffe}).
+
+System files are divided into records, each of which begins with a
+4-byte record type, usually regarded as an @code{int32}.
+
+The records must appear in the following order:
+
+@itemize @bullet
+@item
+File header record.
+
+@item
+Variable records.
+
+@item
+All pairs of value labels records and value label variables records,
+if present.
+
+@item
+Document record, if present.
+
+@item
+Any of the following records, if present, in any order:
+
+@itemize @minus
+@item
+Machine integer info record.
+
+@item
+Machine floating-point info record.
+
+@item
+Variable display parameter record.
+
+@item
+Long variable names record.
+
+@item
+Miscellaneous informational records.
+@end itemize
+
+@item
+Dictionary termination record.
+
+@item
+Data record.
+@end itemize
+
+Each type of record is described separately below.
+
+@menu
+* File Header Record::
+* Variable Record::
+* Value Labels Records::
+* Document Record::
+* Machine Integer Info Record::
+* Machine Floating-Point Info Record::
+* Variable Display Parameter Record::
+* Long Variable Names Record::
+* Very Long String Record::
+* Miscellaneous Informational Records::
+* Dictionary Termination Record::
+* Data Record::
+@end menu
+
+@node File Header Record
+@section File Header Record
+
+The file header is always the first record in the file.  It has the
+following format:
+
+@example
+char                rec_type[4];
+char                prod_name[60];
+int32               layout_code;
+int32               nominal_case_size;
+int32               compressed;
+int32               weight_index;
+int32               ncases;
+flt64               bias;
+char                creation_date[9];
+char                creation_time[8];
+char                file_label[64];
+char                padding[3];
+@end example
+
+@table @code
+@item char rec_type[4];
+Record type code, set to @samp{$FL2}.
+
+@item char prod_name[60];
+Product identification string.  This always begins with the characters
+@samp{@@(#) SPSS DATA FILE}.  PSPP uses the remaining characters to
+give its version and the operating system name; for example, @samp{GNU
+pspp 0.1.4 - sparc-sun-solaris2.5.2}.  The string is truncated if it
+would be longer than 60 characters; otherwise it is padded on the right
+with spaces.
+
+@anchor{layout_code}
+@item int32 layout_code;
+Normally set to 2, although a few system files have been spotted in
+the wild with a value of 3 here.  PSPP use this value to determine the
+file's integer endianness (@pxref{System File Format}).
+
+@item int32 nominal_case_size;
+Number of data elements per case.  This is the number of variables,
+except that long string variables add extra data elements (one for every
+8 characters after the first 8).  However, string variables do not
+contribute to this value beyond the first 255 bytes.   Further, system
+files written by some systems set this value to -1.  In general, it is
+unsafe for systems reading system files to rely upon this value.
+
+@item int32 compressed;
+Set to 1 if the data in the file is compressed, 0 otherwise.
+
+@item int32 weight_index;
+If one of the variables in the data set is used as a weighting
+variable, set to the dictionary index of that variable, plus 1
+(@pxref{Dictionary Index}).  Otherwise, set to 0.
+
+@item int32 ncases;
+Set to the number of cases in the file if it is known, or -1 otherwise.
+
+In the general case it is not possible to determine the number of cases
+that will be output to a system file at the time that the header is
+written.  The way that this is dealt with is by writing the entire
+system file, including the header, then seeking back to the beginning of
+the file and writing just the @code{ncases} field.  For `files' in which
+this is not valid, the seek operation fails.  In this case,
+@code{ncases} remains -1.
+
+@anchor{bias}
+@item flt64 bias;
+Compression bias, ordinarily set to 100.  Only integers between
+@code{1 - bias} and @code{251 - bias} can be compressed.
+
+By assuming that its value is 100, PSPP uses @code{bias} to determine
+the file's floating-point format and endianness (@pxref{System File
+Format}).  If the compression bias is not 100, PSPP cannot auto-detect
+the floating-point format and assumes that it is IEEE 754 format with
+the same endianness as the system file's integers, which is correct
+for all known system files.
+
+@item char creation_date[9];
+Date of creation of the system file, in @samp{dd mmm yy}
+format, with the month as standard English abbreviations, using an
+initial capital letter and following with lowercase.  If the date is not
+available then this field is arbitrarily set to @samp{01 Jan 70}.
+
+@item char creation_time[8];
+Time of creation of the system file, in @samp{hh:mm:ss}
+format and using 24-hour time.  If the time is not available then this
+field is arbitrarily set to @samp{00:00:00}.
+
+@item char file_label[64];
+File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
+PSPP Users Guide}).  Padded on the right with spaces.
+
+@item char padding[3];
+Ignored padding bytes to make the structure a multiple of 32 bits in
+length.  Set to zeros.
+@end table
+
+@node Variable Record
+@section Variable Record
+
+There must be one variable record for each numeric variable and each
+string variable with width 8 bytes or less.  String variables wider
+than 8 bytes have one variable record for each 8 bytes, rounding up.
+The first variable record for a long string specifies the variable's
+correct dictionary information.  Subsequent variable records for a
+long string are filled with dummy information: a type of -1, no
+variable label or missing values, print and write formats that are
+ignored, and an empty string as name.  A few system files have been
+encountered that include a variable label on dummy variable records,
+so readers should take care to parse dummy variable records in the
+same way as other variable records.
+
+@anchor{Dictionary Index}
+The @dfn{dictionary index} of a variable is its offset in the set of
+variable records, including dummy variable records for long string
+variables.  The first variable record has a dictionary index of 0, the
+second has a dictionary index of 1, and so on.
+
+The system file format does not directly support string variables
+wider than 255 bytes.  Such very long string variables are represented
+by a number of narrower string variables.  @xref{Very Long String
+Record}, for details.
+
+@example
+int32               rec_type;
+int32               type;
+int32               has_var_label;
+int32               n_missing_values;
+int32               print;
+int32               write;
+char                name[8];
+
+/* @r{Present only if @code{has_var_label} is 1.} */
+int32               label_len;
+char                label[];
+
+/* @r{Present only if @code{n_missing_values} is nonzero}. */
+flt64               missing_values[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type code.  Always set to 2.
+
+@item int32 type;
+Variable type code.  Set to 0 for a numeric variable.  For a short
+string variable or the first part of a long string variable, this is set
+to the width of the string.  For the second and subsequent parts of a
+long string variable, set to -1, and the remaining fields in the
+structure are ignored.
+
+@item int32 has_var_label;
+If this variable has a variable label, set to 1; otherwise, set to 0.
+
+@item int32 n_missing_values;
+If the variable has no missing values, set to 0.  If the variable has
+one, two, or three discrete missing values, set to 1, 2, or 3,
+respectively.  If the variable has a range for missing variables, set to
+-2; if the variable has a range for missing variables plus a single
+discrete value, set to -3.
+
+@item int32 print;
+Print format for this variable.  See below.
+
+@item int32 write;
+Write format for this variable.  See below.
+
+@item char name[8];
+Variable name.  The variable name must begin with a capital letter or
+the at-sign (@samp{@@}).  Subsequent characters may also be digits, octothorpes
+(@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full
+stops (@samp{.}).  The variable name is padded on the right with spaces.
+
+@item int32 label_len;
+This field is present only if @code{has_var_label} is set to 1.  It is
+set to the length, in characters, of the variable label, which must be a
+number between 0 and 120.
+
+@item char label[];
+This field is present only if @code{has_var_label} is set to 1.  It has
+length @code{label_len}, rounded up to the nearest multiple of 32 bits.
+The first @code{label_len} characters are the variable's variable label.
+
+@item flt64 missing_values[];
+This field is present only if @code{n_missing_values} is not 0.  It has
+the same number of elements as the absolute value of
+@code{n_missing_values}.  For discrete missing values, each element
+represents one missing value.  When a range is present, the first
+element denotes the minimum value in the range, and the second element
+denotes the maximum value in the range.  When a range plus a value are
+present, the third element denotes the additional discrete missing
+value.  HIGHEST and LOWEST are indicated as described in the chapter
+introduction.
+@end table
+
+The @code{print} and @code{write} members of sysfile_variable are output
+formats coded into @code{int32} types.  The least-significant byte
+of the @code{int32} represents the number of decimal places, and the
+next two bytes in order of increasing significance represent field width
+and format type, respectively.  The most-significant byte is not
+used and should be set to zero.
+
+Format types are defined as follows:
+
+@quotation
+@multitable {Value} {@code{DATETIME}}
+@headitem Value
+@tab Meaning
+@item 0
+@tab Not used.
+@item 1
+@tab @code{A}
+@item 2
+@tab @code{AHEX}
+@item 3
+@tab @code{COMMA}
+@item 4
+@tab @code{DOLLAR}
+@item 5
+@tab @code{F}
+@item 6
+@tab @code{IB}
+@item 7
+@tab @code{PIBHEX}
+@item 8
+@tab @code{P}
+@item 9
+@tab @code{PIB}
+@item 10
+@tab @code{PK}
+@item 11
+@tab @code{RB}
+@item 12
+@tab @code{RBHEX}
+@item 13
+@tab Not used.
+@item 14
+@tab Not used.
+@item 15
+@tab @code{Z}
+@item 16
+@tab @code{N}
+@item 17
+@tab @code{E}
+@item 18
+@tab Not used.
+@item 19
+@tab Not used.
+@item 20
+@tab @code{DATE}
+@item 21
+@tab @code{TIME}
+@item 22
+@tab @code{DATETIME}
+@item 23
+@tab @code{ADATE}
+@item 24
+@tab @code{JDATE}
+@item 25
+@tab @code{DTIME}
+@item 26
+@tab @code{WKDAY}
+@item 27
+@tab @code{MONTH}
+@item 28
+@tab @code{MOYR}
+@item 29
+@tab @code{QYR}
+@item 30
+@tab @code{WKYR}
+@item 31
+@tab @code{PCT}
+@item 32
+@tab @code{DOT}
+@item 33
+@tab @code{CCA}
+@item 34
+@tab @code{CCB}
+@item 35
+@tab @code{CCC}
+@item 36
+@tab @code{CCD}
+@item 37
+@tab @code{CCE}
+@item 38
+@tab @code{EDATE}
+@item 39
+@tab @code{SDATE}
+@end multitable
+@end quotation
+
+@node Value Labels Records
+@section Value Labels Records
+
+The value label record has the following format:
+
+@example
+int32               rec_type;
+int32               label_count;
+
+/* @r{Repeated @code{label_cnt} times}. */
+char                value[8];
+char                label_len;
+char                label[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 3.
+
+@item int32 label_count;
+Number of value labels present in this record.
+@end table
+
+The remaining fields are repeated @code{count} times.  Each
+repetition specifies one value label.
+
+@table @code
+@item char value[8];
+A numeric value or a short string value padded as necessary to 8 bytes
+in length.  Its type and width cannot be determined until the
+following value label variables record (see below) is read.
+
+@item char label_len;
+The label's length, in bytes.
+
+@item char label[];
+@code{label_len} bytes of the actual label, followed by up to 7 bytes
+of padding to bring @code{label} and @code{label_len} together to a
+multiple of 8 bytes in length.
+@end table
+
+The value label record is always immediately followed by a value label
+variables record with the following format:
+
+@example
+int32               rec_type;
+int32               var_count;
+int32               vars[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 4.
+
+@item int32 var_count;
+Number of variables that the associated value labels from the value
+label record are to be applied.
+
+@item int32 vars[];
+A list of dictionary indexes of variables to which to apply the value
+labels (@pxref{Dictionary Index}).  There are @code{var_count}
+elements.
+
+String variables wider than 8 bytes may not have value labels.
+@end table
+
+@node Document Record
+@section Document Record
+
+The document record, if present, has the following format:
+
+@example
+int32               rec_type;
+int32               n_lines;
+char                lines[][80];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 6.
+
+@item int32 n_lines;
+Number of lines of documents present.
+
+@item char lines[][80];
+Document lines.  The number of elements is defined by @code{n_lines}.
+Lines shorter than 80 characters are padded on the right with spaces.
+@end table
+
+@node Machine Integer Info Record
+@section Machine Integer Info Record
+
+The integer info record, if present, has the following format:
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Data.} */
+int32               version_major;
+int32               version_minor;
+int32               version_revision;
+int32               machine_code;
+int32               floating_point_rep;
+int32               compression_code;
+int32               endianness;
+int32               character_code;
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 3.
+
+@item int32 size;
+Size of each piece of data in the data part, in bytes.  Always set to 4.
+
+@item int32 count;
+Number of pieces of data in the data part.  Always set to 8.
+
+@item int32 version_major;
+PSPP major version number.  In version @var{x}.@var{y}.@var{z}, this
+is @var{x}.
+
+@item int32 version_minor;
+PSPP minor version number.  In version @var{x}.@var{y}.@var{z}, this
+is @var{y}.
+
+@item int32 version_revision;
+PSPP version revision number.  In version @var{x}.@var{y}.@var{z},
+this is @var{z}.
+
+@item int32 machine_code;
+Machine code.  PSPP always set this field to value to -1, but other
+values may appear.
+
+@item int32 floating_point_rep;
+Floating point representation code.  For IEEE 754 systems this is 1.
+IBM 370 sets this to 2, and DEC VAX E to 3.
+
+@item int32 compression_code;
+Compression code.  Always set to 1.
+
+@item int32 endianness;
+Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
+
+@item int32 character_code;
+Character code.  1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
+indicates 8-bit ASCII, 4 indicates DEC Kanji.
+Windows code page numbers are also valid.
+@end table
+
+@node Machine Floating-Point Info Record
+@section Machine Floating-Point Info Record
+
+The floating-point info record, if present, has the following format:
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Data.} */
+flt64               sysmis;
+flt64               highest;
+flt64               lowest;
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 4.
+
+@item int32 size;
+Size of each piece of data in the data part, in bytes.  Always set to 8.
+
+@item int32 count;
+Number of pieces of data in the data part.  Always set to 3.
+
+@item flt64 sysmis;
+The system missing value.
+
+@item flt64 highest;
+The value used for HIGHEST in missing values.
+
+@item flt64 lowest;
+The value used for LOWEST in missing values.
+@end table
+
+@node Variable Display Parameter Record
+@section Variable Display Parameter Record
+
+The variable display parameter record, if present, has the following
+format:
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Repeated @code{count} times}. */
+int32               measure;
+int32               width;
+int32               alignment;
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 11.
+
+@item int32 size;
+The size of @code{int32}.  Always set to 4.
+
+@item int32 count;
+The number of sets of variable display parameters (ordinarily the
+number of variables in the dictionary), times 3.
+@end table
+
+The remaining members are repeated @code{count} times, in the same
+order as the variable records.  No element corresponds to variable
+records that continue long string variables.  The meanings of these
+members are as follows:
+
+@table @code
+@item int32 measure;
+The measurement type of the variable:
+@table @asis
+@item 1
+Nominal Scale
+@item 2
+Ordinal Scale
+@item 3
+Continuous Scale
+@end table
+
+SPSS 14 sometimes writes a @code{measure} of 0.  PSPP interprets this
+as nominal scale.
+
+@item int32 width;
+The width of the display column for the variable in characters.
+
+@item int32 alignment;
+The alignment of the variable for display purposes:
+
+@table @asis
+@item 0
+Left aligned
+@item 1
+Right aligned
+@item 2
+Centre aligned
+@end table
+@end table
+
+@node Long Variable Names Record
+@section Long Variable Names Record
+
+If present, the long variable names record has the following format:
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                var_name_pairs[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 13.
+
+@item int32 size;
+The size of each element in the @code{var_name_pairs} member. Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{var_name_pairs}.
+
+@item char var_name_pairs[];
+A list of @var{key}--@var{value} tuples, where @var{key} is the name
+of a variable, and @var{value} is its long variable name.
+The @var{key} field is at most 8 bytes long and must match the
+name of a variable which appears in the variable record (@pxref{Variable
+Record}).
+The @var{value} field is at most 64 bytes long.
+The @var{key} and @var{value} fields are separated by a @samp{=} byte.
+Each tuple is separated by a byte whose value is 09.  There is no
+trailing separator following the last tuple.
+The total length is @code{count} bytes.
+@end table
+
+@node Very Long String Record
+@section Very Long String Record
+
+Old versions of SPSS limited string variables to a width of 255 bytes.
+For backward compatibility with these older versions, the system file
+format represents a string longer than 255 bytes, called a @dfn{very
+long string}, as a collection of strings no longer than 255 bytes
+each.  The strings concatenated to make a very long string are called
+its @dfn{segments}; for consistency, variables other than very long
+strings are considered to have a single segment.
+
+A very long string with a width of @var{w} has @var{n} =
+(@var{w} + 251) / 252 segments, that is, one segment for every
+252 bytes of width, rounding up.  It would be logical, then, for each
+of the segments except the last to have a width of 252 and the last
+segment to have the remainder, but this is not the case.  In fact,
+each segment except the last has a width of 255 bytes.  The last
+segment has width @var{w} - (@var{n} - 1) * 252; some versions
+of SPSS make it slightly wider, but not wide enough to make the last
+segment require another 8 bytes of data.
+
+Data is packed tightly into segments of a very long string, 255 bytes
+per segment.  Because 255 bytes of segment data are allocated for
+every 252 bytes of the very long string's width (approximately), some
+unused space is left over at the end of the allocated segments.  Data
+in unused space is ignored.
+
+Example: Consider a very long string of width 20,000.  Such a very
+long string has 20,000 / 252 = 80 (rounding up) segments.  The first
+79 segments have width 255; the last segment has width 20,000 - 79 *
+252 = 92 or slightly wider (up to 96 bytes, the next multiple of 8).
+The very long string's data is actually stored in the 19,890 bytes in
+the first 78 segments, plus the first 110 bytes of the 79th segment
+(19,890 + 110 = 20,000).  The remaining 145 bytes of the 79th segment
+and all 92 bytes of the 80th segment are unused.
+
+The very long string record explains how to stitch together segments
+to obtain very long string data.  For each of the very long string
+variables in the dictionary, it specifies the name of its first
+segment's variable and the very long string variable's actual width.
+The remaining segments immediately follow the named variable in the
+system file's dictionary.
+
+The very long string record, which is present only if the system file
+contains very long string variables, has the following format:
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                string_lengths[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Always set to 14.
+
+@item int32 size;
+The size of each element in the @code{string_lengths} member. Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{string_lengths}.
+
+@item char string_lengths[];
+A list of @var{key}--@var{value} tuples, where @var{key} is the name
+of a variable, and @var{value} is its length.
+The @var{key} field is at most 8 bytes long and must match the
+name of a variable which appears in the variable record (@pxref{Variable
+Record}).
+The @var{value} field is exactly 5 bytes long. It is a zero-padded,
+ASCII-encoded string that is the length of the variable.
+The @var{key} and @var{value} fields are separated by a @samp{=} byte.
+Tuples are delimited by a two-byte sequence @{00, 09@}.
+After the last tuple, there may be a single byte 00, or @{00, 09@}.
+The total length is @code{count} bytes.
+@end table
+
+@node Miscellaneous Informational Records
+@section Miscellaneous Informational Records
+
+Some specific types of miscellaneous informational records are
+documented here, but others are known to exist.  PSPP ignores unknown
+miscellaneous informational records when reading system files.
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{size * count} bytes of data.} */
+char                data[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  May take any value.  According to Aapi
+H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6
+indicates date info (probably related to USE).
+
+@item int32 size;
+Size of each piece of data in the data part.  Should have the value 1,
+4, or 8, for @code{char}, @code{int32}, and @code{flt64} format data,
+respectively.
+
+@item int32 count;
+Number of pieces of data in the data part.
+
+@item char data[];
+Arbitrary data.  There must be @code{size} times @code{count} bytes of
+data.
+@end table
+
+@node Dictionary Termination Record
+@section Dictionary Termination Record
+
+The dictionary termination record separates all other records from the
+data records.
+
+@example
+int32               rec_type;
+int32               filler;
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 999.
+
+@item int32 filler;
+Ignored padding.  Should be set to 0.
+@end table
+
+@node Data Record
+@section Data Record
+
+Data records must follow all other records in the system file.  There must
+be at least one data record in every system file.
+
+The format of data records varies depending on whether the data is
+compressed.  Regardless, the data is arranged in a series of 8-byte
+elements.
+
+When data is not compressed,
+each element corresponds to
+the variable declared in the respective variable record (@pxref{Variable
+Record}).  Numeric values are given in @code{flt64} format; string
+values are literal characters string, padded on the right when
+necessary to fill out 8-byte units.
+
+Compressed data is arranged in the following manner: the first 8 bytes
+in the data section is divided into a series of 1-byte command
+codes.  These codes have meanings as described below:
+
+@table @asis
+@item 0
+Ignored.  If the program writing the system file accumulates compressed
+data in blocks of fixed length, 0 bytes can be used to pad out extra
+bytes remaining at the end of a fixed-size block.
+
+@item 1 through 251
+A number with
+value @var{code} - @var{bias}, where
+@var{code} is the value of the compression code and @var{bias} is the
+variable @code{bias} from the file header.  For example,
+code 105 with bias 100.0 (the normal value) indicates a numeric variable
+of value 5.
+
+@item 252
+End of file.  This code may or may not appear at the end of the data
+stream.  PSPP always outputs this code but its use is not required.
+
+@item 253
+A numeric or string value that is not
+compressible.  The value is stored in the 8 bytes following the
+current block of command bytes.  If this value appears twice in a block
+of command bytes, then it indicates the second group of 8 bytes following the
+command bytes, and so on.
+
+@item 254
+An 8-byte string value that is all spaces.
+
+@item 255
+The system-missing value.
+@end table
+
+When the end of the an 8-byte group of command bytes is reached, any
+blocks of non-compressible values indicated by code 253 are skipped,
+and the next element of command bytes is read and interpreted, until
+the end of the file or a code with value 252 is reached.
+@setfilename ignored