Add support for reading SPSS/PC+ system files.

[pspp] / doc / dev / pc+-file-format.texi
diff --git a/doc/dev/pc+-file-format.texi b/doc/dev/pc+-file-format.texi

new file mode 100644 (file)

index 0000000..f3f1a9d
--- /dev/null
+++ b/doc/dev/pc+-file-format.texi
@@ -0,0 +1,362 @@
+@node SPSS/PC+ System File Format
+@appendix SPSS/PC+ System File Format
+
+SPSS/PC+, first released in 1984, was a simplified version of SPSS for
+IBM PC and compatible computers.  It used a data file format related
+to the one described in the previous chapter, but simplified and
+incompatible.  The SPSS/PC+ software became obsolete in the 1990s, so
+files in this format are rarely encountered today.  Nevertheless, for
+completeness, and because it is not very difficult, it seems
+worthwhile to support at least reading these files.  This chapter
+documents this format, based on examination of a corpus of about 60
+files from a variety of sources.
+
+System files use four data types: 8-bit characters, 16-bit unsigned
+integers, 32-bit unsigned integers, and 64-bit floating points, called
+here @code{char}, @code{uint16}, @code{uint32}, and @code{flt64},
+respectively.  Data is not necessarily aligned on a word or
+double-word boundary.
+
+SPSS/PC+ ran only on IBM PC and compatible computers.  Therefore,
+values in these files are always in little-endian byte order.
+Floating-point numbers are always in IEEE 754 format.
+
+SPSS/PC+ system files represent the system-missing value as -1.66e308,
+or @code{f5 1e 26 02 8a 8c ed ff} expressed as hexadecimal.  (This is
+an unusual choice: it is close to, but not equal to, the largest
+negative 64-bit IEEE 754, which is about -1.8e308.)
+
+Text in SPSS/PC+ system file is encoded in ASCII-based 8-bit MS DOS
+codepages.  The corpus used for investigating the format were all
+ASCII-only.
+
+An SPSS/PC+ system file begins with the following 256-byte directory:
+
+@example
+uint32              two;
+uint32              zero;
+struct @{
+    uint32          ofs;
+    uint32          len;
+@} records[15];
+char                filename[128];
+@end example
+
+@table @code
+@item uint32 two;
+@itemx uint32 zero;
+Always set to 2 and 0, respectively.
+
+These fields could be used as a signature for the file format, but the
+@code{product} field in record 0 seems more likely to be unique
+(@pxref{Record 0 Main Header Record}).
+
+@item struct @{ @dots{} @} records[15];
+Each of the elements in this array identifies a record in the system
+file.  The @code{ofs} is a byte offset, from the beginning of the
+file, that identifies the start of the record.  @code{len} specifies
+the length of the record, in bytes.  Many records are optional or not
+used.  If a record is not present, @code{ofs} and @code{len} for that
+record are both are zero.
+
+@item char filename[128];
+In most files in the corpus, this field is entirely filled with
+spaces.  In one file, it contains a file name, followed by a null
+bytes, followed by spaces to fill the remainder of the field.  The
+meaning is unknown.
+@end table
+
+The following sections describe the contents of each record,
+identified by the index into the @code{records} array.
+
+@menu
+* Record 0 Main Header Record::
+* Record 1 Variables Record::
+* Record 2 Labels Record::
+* Record 3 Data Record::
+* Records 4 and 5 Data Entry::
+@end menu
+
+@node Record 0 Main Header Record
+@section Record 0: Main Header Record
+
+All files in the corpus have this record at offset 0x100 with length
+0xb0 (but readers should find this record, like the others, via the
+@code{records} table in the directory).  Its format is:
+
+@example
+uint16              one0;
+char                product[62];
+flt64               sysmis;
+uint32              zero0;
+uint32              zero1;
+uint16              one1;
+uint16              compressed;
+uint16              nominal_case_size;
+uint32              n_cases0;
+uint16              zero2;
+uint32              n_cases1;
+char                creation_date[8];
+char                creation_time[8];
+char                label[64];
+@end example
+
+@table @code
+@item uint16 one0;
+@itemx uint16 one1;
+Always set to 1.
+
+@item uint32 zero0;
+@itemx uint32 zero1;
+@itemx uint16 zero2;
+Always set to 0.
+
+It seems likely that one of these variables is set to 1 if weighting
+is enabled, but none of the files in the corpus is weighted.
+
+@item char product[62];
+Name of the program that created the file.  Only the following unique
+values have been observed, in each case padded on the right with
+spaces:
+
+@example
+DESPSS/PC+ System File Written by Data Entry II
+PCSPSS SYSTEM FILE.  IBM PC DOS, SPSS/PC+
+PCSPSS SYSTEM FILE.  IBM PC DOS, SPSS/PC+ V3.0
+PCSPSS SYSTEM FILE.  IBM PC DOS, SPSS for Windows
+@end example
+
+Thus, it is reasonable to use the presence of the string @samp{SPSS}
+at offset 0x104 as a simple test for an SPSS/PC+ data file.
+
+@item flt64 sysmis;
+The system-missing value, as described previously (@pxref{SPSS/PC+
+System File Format}).
+
+@item uint16 compressed;
+Set to 0 if the data in the file is not compressed, 1 if the data is
+compressed with simple bytecode compression.
+
+@item uint16 nominal_case_size;
+Number of data elements per case.  This is the number of variables,
+except that long string variables add extra data elements (one for
+every 8 bytes after the first 8).  String variables in SPSS/PC+ system
+files are limited to 255 bytes.
+
+@item uint32 n_cases0;
+@itemx uint32 n_cases1;
+The number of cases in the data record.  Both values are the same.
+Some files in the corpus contain data for the number of cases noted
+here, followed by garbage that somewhat resembles data.
+
+@item char creation_date[8];
+The date that the file was created, in @samp{mm/dd/yy} format.
+Single-digit days and months are not prefixed by zeros.  The string is
+padded with spaces on right or left or both, e.g. @samp{_2/4/93_},
+@samp{10/5/87_}, and @samp{_1/11/88} (with @samp{_} standing in for a
+space) are all actual examples from the corpus.
+
+@item char creation_time[8];
+The time that the file was created, in @samp{HH:MM:SS} format.
+Single-digit hours are padded on a left with a space.  Minutes and
+seconds are always written as two digits.
+
+@item char file_label[64];
+File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
+PSPP Users Guide}).  Padded on the right with spaces.
+@end table
+
+@node Record 1 Variables Record
+@section Record 1: Variables Record
+
+The variables record most commonly starts at offset 0x1b0, but it can
+be placed elsewhere.  The record contains instances of the following
+32-byte structure:
+
+@example
+uint32              value_label_start;
+uint32              value_label_end;
+uint32              var_label_ofs;
+uint32              format;
+char                name[8];
+union @{
+    flt64           f;
+    char            s[8];
+@} missing;
+@end example
+
+The number of instances is the @code{nominal_case_size} specified in
+the main header record.  There is one instance for each numeric
+variable and each string variable with width 8 bytes or less.  String
+variables wider than 8 bytes have one instance for each 8 bytes,
+rounding up.  The first instance for a long string specifies the
+variable's correct dictionary information.  Subsequent instances for a
+long string are generally filled with all-zero bytes, although the
+@code{missing} field contains the numeric system-missing value, and
+some writers also fill in @code{var_label_ofs}, @code{format}, and
+@code{name}, sometimes filling the latter with the numeric
+system-missing value rather than a text string.  Regardless of the
+values used, readers should ignore the contents of these additional
+instances for long strings.
+
+@table @code
+@item uint32 value_label_start;
+@itemx uint32 value_label_end;
+For a variable with value labels, these specify offsets into the label
+record of the start and end of this variable's value labels,
+respectively.  @xref{Record 2 Labels Record}, for more information.
+
+For a variable without any value labels, these are both zero.
+
+A long string variable may not have value labels.
+
+@item uint32 var_label_ofs;
+For a variable with a variable label, this specifies an offset into
+the label record.  @xref{Record 2 Labels Record}, for more
+information.
+
+For a variable without a variable label, this is zero.
+
+@item uint32 format;
+The variable's output format, in the same format used in system files.
+@xref{System File Output Formats}, for details.  SPSS/PC+ system files
+only use format types 5 (F, for numeric variables) and 1 (A, for
+string variables).
+
+@item char name[8];
+The variable's name, padded on the right with spaces.
+
+@item union @{ @dots{} @} missing;
+A user-missing value.  For numeric variables, @code{missing.f} is the
+variable's user-missing value.  For string variables, @code{missing.s}
+is a string missing value.  A variable without a user-missing value is
+indicated with @code{missing.f} set to the system-missing value, even
+for string variables (!).  A Long string variable may not have a
+missing value.
+@end table
+
+In addition to the user-defined variables, every SPSS/PC+ system file
+contains, as its first three variables, the following system-defined
+variables, in the following order.  The system-defined variables have
+no variable label, value labels, or missing values.
+
+@table @code
+@item $CASENUM
+A numeric variable with format F8.0.  Most of the time this is a
+sequence number, starting with 1 for the first case and counting up
+for each subsequent case.  Some files skip over values, which probably
+reflects cases that were deleted.
+
+@item $DATE
+A string variable with format A8.  Same format (including varying
+padding) as the @code{creation_date} field in the main header record
+(@pxref{Record 0 Main Header Record}).  The actual date can differ
+from @code{creation_date} and from record to record.  This may reflect
+when individual cases were added or updated.
+
+@item $WEIGHT
+A numeric variable with format F8.2.  This represents the case's
+weight; SPSS/PC+ files do not have a user-defined weighting variable.
+If weighting has not been enabled, every case has value 1.0.
+@end table
+
+@node Record 2 Labels Record
+@section Record 2: Labels Record
+
+The labels record holds value labels and variable labels.  Unlike the
+other records, it is not meant to be read directly and sequentially.
+Instead, this record must be interpreted one piece at a time, by
+following pointers from the variables record.
+
+The @code{value_label_start}, @code{value_label_end}, and
+@code{var_label_ofs} fields in a variable record are all offsets
+relative to the beginning of the labels record, with an additional
+7-byte offset.  That is, if the labels record starts at byte offset
+@code{labels_ofs} and a variable has a given @code{var_label_ofs},
+then the variable label begins at byte offset @math{@code{labels_ofs}
++ @code{var_label_ofs} + 7} in the file.
+
+A variable label, starting at the offset indicated by
+@code{var_label_ofs}, consists of a one-byte length followed by the
+specified number of bytes of the variable label string, like this:
+
+@example
+uint8               length;
+char                s[length];
+@end example
+
+A set of value labels, extending from @code{value_label_start} to
+@code{value_label_end} (exclusive), consists of a numeric or string
+value followed by a string in the format just described.  String
+values are padded on the right with spaces to fill the 8-byte field,
+like this:
+
+@example
+union @{
+    flt64           f;
+    char            s[8];
+@} value;
+uint8               length;
+char                s[length];
+@end example
+
+The labels record begins with a pair of uint32 values.  The first of
+these is always 3.  The second is between 8 and 16 less than the
+number of bytes in the record.  Neither value is important for
+interpreting the file.
+
+@node Record 3 Data Record
+@section Record 3: Data Record
+
+The format of the data record varies depending on the value of
+@code{compressed} in the file header record:
+
+@table @asis
+@item 0: no compression
+Data is arranged as a series of 8-byte elements, one per variable
+instance variable in the variable record (@pxref{Record 1 Variables
+Record}).  Numeric values are given in @code{flt64} format; string
+values are literal characters string, padded on the right with spaces
+when necessary to fill out 8-byte units.
+
+@item 1: bytecode compression
+The first 8 bytes of the data record is divided into a series of
+1-byte command codes.  These codes have meanings as described below:
+
+@table @asis
+@item 0
+The system-missing value.
+
+@item 1
+A numeric or string value that is not
+compressible.  The value is stored in the 8 bytes following the
+current block of command bytes.  If this value appears twice in a block
+of command bytes, then it indicates the second group of 8 bytes following the
+command bytes, and so on.
+
+@item 2 through 255
+A number with value @var{code} - 100, where @var{code} is the value of
+the compression code.  For example, code 105 indicates a numeric
+variable of value 5.
+@end table
+
+The end of the 8-byte group of bytecodes is followed by any 8-byte
+blocks of non-compressible values indicated by code 1.  After that
+follows another 8-byte group of bytecodes, then those bytecodes'
+non-compressible values.  The pattern repeats up to the number of
+cases specified by the main header record have been seen.
+
+The corpus does not contain any files with command codes 2 through 95,
+so it is possible that some of these codes are used for special
+purposes.
+@end table
+
+Cases of data often, but not always, fill the entire data record.
+Readers should stop reading after the number of cases specified in the
+main header record.  Otherwise, readers may try to interpret garbage
+following the data as additional cases.
+
+@node Records 4 and 5 Data Entry
+@section Records 4 and 5: Data Entry
+
+Records 4 and 5 appear to be related to SPSS/PC+ Data Entry.