X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fpc%2B-file-format.texi;fp=doc%2Fdev%2Fpc%2B-file-format.texi;h=f3f1a9d62041bd2493e984e20194421fb124e579;hb=63c7521729b947ace9e192dff9330813ecfb5812;hp=0000000000000000000000000000000000000000;hpb=7eeac30440dc09a42ee869b69263a8879bbb3ff8;p=pspp diff --git a/doc/dev/pc+-file-format.texi b/doc/dev/pc+-file-format.texi new file mode 100644 index 0000000000..f3f1a9d620 --- /dev/null +++ b/doc/dev/pc+-file-format.texi @@ -0,0 +1,362 @@ +@node SPSS/PC+ System File Format +@appendix SPSS/PC+ System File Format + +SPSS/PC+, first released in 1984, was a simplified version of SPSS for +IBM PC and compatible computers. It used a data file format related +to the one described in the previous chapter, but simplified and +incompatible. The SPSS/PC+ software became obsolete in the 1990s, so +files in this format are rarely encountered today. Nevertheless, for +completeness, and because it is not very difficult, it seems +worthwhile to support at least reading these files. This chapter +documents this format, based on examination of a corpus of about 60 +files from a variety of sources. + +System files use four data types: 8-bit characters, 16-bit unsigned +integers, 32-bit unsigned integers, and 64-bit floating points, called +here @code{char}, @code{uint16}, @code{uint32}, and @code{flt64}, +respectively. Data is not necessarily aligned on a word or +double-word boundary. + +SPSS/PC+ ran only on IBM PC and compatible computers. Therefore, +values in these files are always in little-endian byte order. +Floating-point numbers are always in IEEE 754 format. + +SPSS/PC+ system files represent the system-missing value as -1.66e308, +or @code{f5 1e 26 02 8a 8c ed ff} expressed as hexadecimal. (This is +an unusual choice: it is close to, but not equal to, the largest +negative 64-bit IEEE 754, which is about -1.8e308.) + +Text in SPSS/PC+ system file is encoded in ASCII-based 8-bit MS DOS +codepages. The corpus used for investigating the format were all +ASCII-only. + +An SPSS/PC+ system file begins with the following 256-byte directory: + +@example +uint32 two; +uint32 zero; +struct @{ + uint32 ofs; + uint32 len; +@} records[15]; +char filename[128]; +@end example + +@table @code +@item uint32 two; +@itemx uint32 zero; +Always set to 2 and 0, respectively. + +These fields could be used as a signature for the file format, but the +@code{product} field in record 0 seems more likely to be unique +(@pxref{Record 0 Main Header Record}). + +@item struct @{ @dots{} @} records[15]; +Each of the elements in this array identifies a record in the system +file. The @code{ofs} is a byte offset, from the beginning of the +file, that identifies the start of the record. @code{len} specifies +the length of the record, in bytes. Many records are optional or not +used. If a record is not present, @code{ofs} and @code{len} for that +record are both are zero. + +@item char filename[128]; +In most files in the corpus, this field is entirely filled with +spaces. In one file, it contains a file name, followed by a null +bytes, followed by spaces to fill the remainder of the field. The +meaning is unknown. +@end table + +The following sections describe the contents of each record, +identified by the index into the @code{records} array. + +@menu +* Record 0 Main Header Record:: +* Record 1 Variables Record:: +* Record 2 Labels Record:: +* Record 3 Data Record:: +* Records 4 and 5 Data Entry:: +@end menu + +@node Record 0 Main Header Record +@section Record 0: Main Header Record + +All files in the corpus have this record at offset 0x100 with length +0xb0 (but readers should find this record, like the others, via the +@code{records} table in the directory). Its format is: + +@example +uint16 one0; +char product[62]; +flt64 sysmis; +uint32 zero0; +uint32 zero1; +uint16 one1; +uint16 compressed; +uint16 nominal_case_size; +uint32 n_cases0; +uint16 zero2; +uint32 n_cases1; +char creation_date[8]; +char creation_time[8]; +char label[64]; +@end example + +@table @code +@item uint16 one0; +@itemx uint16 one1; +Always set to 1. + +@item uint32 zero0; +@itemx uint32 zero1; +@itemx uint16 zero2; +Always set to 0. + +It seems likely that one of these variables is set to 1 if weighting +is enabled, but none of the files in the corpus is weighted. + +@item char product[62]; +Name of the program that created the file. Only the following unique +values have been observed, in each case padded on the right with +spaces: + +@example +DESPSS/PC+ System File Written by Data Entry II +PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+ +PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+ V3.0 +PCSPSS SYSTEM FILE. IBM PC DOS, SPSS for Windows +@end example + +Thus, it is reasonable to use the presence of the string @samp{SPSS} +at offset 0x104 as a simple test for an SPSS/PC+ data file. + +@item flt64 sysmis; +The system-missing value, as described previously (@pxref{SPSS/PC+ +System File Format}). + +@item uint16 compressed; +Set to 0 if the data in the file is not compressed, 1 if the data is +compressed with simple bytecode compression. + +@item uint16 nominal_case_size; +Number of data elements per case. This is the number of variables, +except that long string variables add extra data elements (one for +every 8 bytes after the first 8). String variables in SPSS/PC+ system +files are limited to 255 bytes. + +@item uint32 n_cases0; +@itemx uint32 n_cases1; +The number of cases in the data record. Both values are the same. +Some files in the corpus contain data for the number of cases noted +here, followed by garbage that somewhat resembles data. + +@item char creation_date[8]; +The date that the file was created, in @samp{mm/dd/yy} format. +Single-digit days and months are not prefixed by zeros. The string is +padded with spaces on right or left or both, e.g. @samp{_2/4/93_}, +@samp{10/5/87_}, and @samp{_1/11/88} (with @samp{_} standing in for a +space) are all actual examples from the corpus. + +@item char creation_time[8]; +The time that the file was created, in @samp{HH:MM:SS} format. +Single-digit hours are padded on a left with a space. Minutes and +seconds are always written as two digits. + +@item char file_label[64]; +File label declared by the user, if any (@pxref{FILE LABEL,,,pspp, +PSPP Users Guide}). Padded on the right with spaces. +@end table + +@node Record 1 Variables Record +@section Record 1: Variables Record + +The variables record most commonly starts at offset 0x1b0, but it can +be placed elsewhere. The record contains instances of the following +32-byte structure: + +@example +uint32 value_label_start; +uint32 value_label_end; +uint32 var_label_ofs; +uint32 format; +char name[8]; +union @{ + flt64 f; + char s[8]; +@} missing; +@end example + +The number of instances is the @code{nominal_case_size} specified in +the main header record. There is one instance for each numeric +variable and each string variable with width 8 bytes or less. String +variables wider than 8 bytes have one instance for each 8 bytes, +rounding up. The first instance for a long string specifies the +variable's correct dictionary information. Subsequent instances for a +long string are generally filled with all-zero bytes, although the +@code{missing} field contains the numeric system-missing value, and +some writers also fill in @code{var_label_ofs}, @code{format}, and +@code{name}, sometimes filling the latter with the numeric +system-missing value rather than a text string. Regardless of the +values used, readers should ignore the contents of these additional +instances for long strings. + +@table @code +@item uint32 value_label_start; +@itemx uint32 value_label_end; +For a variable with value labels, these specify offsets into the label +record of the start and end of this variable's value labels, +respectively. @xref{Record 2 Labels Record}, for more information. + +For a variable without any value labels, these are both zero. + +A long string variable may not have value labels. + +@item uint32 var_label_ofs; +For a variable with a variable label, this specifies an offset into +the label record. @xref{Record 2 Labels Record}, for more +information. + +For a variable without a variable label, this is zero. + +@item uint32 format; +The variable's output format, in the same format used in system files. +@xref{System File Output Formats}, for details. SPSS/PC+ system files +only use format types 5 (F, for numeric variables) and 1 (A, for +string variables). + +@item char name[8]; +The variable's name, padded on the right with spaces. + +@item union @{ @dots{} @} missing; +A user-missing value. For numeric variables, @code{missing.f} is the +variable's user-missing value. For string variables, @code{missing.s} +is a string missing value. A variable without a user-missing value is +indicated with @code{missing.f} set to the system-missing value, even +for string variables (!). A Long string variable may not have a +missing value. +@end table + +In addition to the user-defined variables, every SPSS/PC+ system file +contains, as its first three variables, the following system-defined +variables, in the following order. The system-defined variables have +no variable label, value labels, or missing values. + +@table @code +@item $CASENUM +A numeric variable with format F8.0. Most of the time this is a +sequence number, starting with 1 for the first case and counting up +for each subsequent case. Some files skip over values, which probably +reflects cases that were deleted. + +@item $DATE +A string variable with format A8. Same format (including varying +padding) as the @code{creation_date} field in the main header record +(@pxref{Record 0 Main Header Record}). The actual date can differ +from @code{creation_date} and from record to record. This may reflect +when individual cases were added or updated. + +@item $WEIGHT +A numeric variable with format F8.2. This represents the case's +weight; SPSS/PC+ files do not have a user-defined weighting variable. +If weighting has not been enabled, every case has value 1.0. +@end table + +@node Record 2 Labels Record +@section Record 2: Labels Record + +The labels record holds value labels and variable labels. Unlike the +other records, it is not meant to be read directly and sequentially. +Instead, this record must be interpreted one piece at a time, by +following pointers from the variables record. + +The @code{value_label_start}, @code{value_label_end}, and +@code{var_label_ofs} fields in a variable record are all offsets +relative to the beginning of the labels record, with an additional +7-byte offset. That is, if the labels record starts at byte offset +@code{labels_ofs} and a variable has a given @code{var_label_ofs}, +then the variable label begins at byte offset @math{@code{labels_ofs} ++ @code{var_label_ofs} + 7} in the file. + +A variable label, starting at the offset indicated by +@code{var_label_ofs}, consists of a one-byte length followed by the +specified number of bytes of the variable label string, like this: + +@example +uint8 length; +char s[length]; +@end example + +A set of value labels, extending from @code{value_label_start} to +@code{value_label_end} (exclusive), consists of a numeric or string +value followed by a string in the format just described. String +values are padded on the right with spaces to fill the 8-byte field, +like this: + +@example +union @{ + flt64 f; + char s[8]; +@} value; +uint8 length; +char s[length]; +@end example + +The labels record begins with a pair of uint32 values. The first of +these is always 3. The second is between 8 and 16 less than the +number of bytes in the record. Neither value is important for +interpreting the file. + +@node Record 3 Data Record +@section Record 3: Data Record + +The format of the data record varies depending on the value of +@code{compressed} in the file header record: + +@table @asis +@item 0: no compression +Data is arranged as a series of 8-byte elements, one per variable +instance variable in the variable record (@pxref{Record 1 Variables +Record}). Numeric values are given in @code{flt64} format; string +values are literal characters string, padded on the right with spaces +when necessary to fill out 8-byte units. + +@item 1: bytecode compression +The first 8 bytes of the data record is divided into a series of +1-byte command codes. These codes have meanings as described below: + +@table @asis +@item 0 +The system-missing value. + +@item 1 +A numeric or string value that is not +compressible. The value is stored in the 8 bytes following the +current block of command bytes. If this value appears twice in a block +of command bytes, then it indicates the second group of 8 bytes following the +command bytes, and so on. + +@item 2 through 255 +A number with value @var{code} - 100, where @var{code} is the value of +the compression code. For example, code 105 indicates a numeric +variable of value 5. +@end table + +The end of the 8-byte group of bytecodes is followed by any 8-byte +blocks of non-compressible values indicated by code 1. After that +follows another 8-byte group of bytecodes, then those bytecodes' +non-compressible values. The pattern repeats up to the number of +cases specified by the main header record have been seen. + +The corpus does not contain any files with command codes 2 through 95, +so it is possible that some of these codes are used for special +purposes. +@end table + +Cases of data often, but not always, fill the entire data record. +Readers should stop reading after the number of cases specified in the +main header record. Otherwise, readers may try to interpret garbage +following the data as additional cases. + +@node Records 4 and 5 Data Entry +@section Records 4 and 5: Data Entry + +Records 4 and 5 appear to be related to SPSS/PC+ Data Entry.