1 @node Data File Format, q2c Input Format, Portable File Format, Top
2 @appendix Data File Format
4 PSPP necessarily uses the same format for system files as do the
5 products with which it is compatible. This chapter is a description of
8 There are three data types used in system files: 32-bit integers, 64-bit
9 floating points, and 1-byte characters. In this document these will
10 simply be referred to as @code{int32}, @code{flt64}, and @code{char},
11 the names that are used in the PSPP source code. Every field of type
12 @code{int32} or @code{flt64} is aligned on a 32-bit boundary relative to
13 the start of the record.
15 The endianness of data in PSPP system files is not specified. System
16 files output on a computer of a particular endianness will have the
17 endianness of that computer. However, PSPP can read files of either
18 endianness, regardless of its host computer's endianness. PSPP
19 translates endianness for both integer and floating point numbers.
21 Floating point formats are also not specified. PSPP does not
22 translate between floating point formats. This is unlikely to be a
23 problem as all modern computer architectures use IEEE 754 format for
24 floating point representation.
26 The PSPP system-missing value is represented by the largest possible
27 negative number in the floating point format; in C, this is most likely
28 @code{-DBL_MAX}. There are two other important values used in missing
29 values: @code{HIGHEST} and @code{LOWEST}. These are represented by the
30 largest possible positive number (probably @code{DBL_MAX}) and the
31 second-largest negative number. The latter must be determined in a
32 system-dependent manner; in IEEE 754 format it is represented by value
33 @code{0xffeffffffffffffe}.
35 System files are divided into records. Each record begins with an
36 @code{int32} giving a numeric record type. Individual record types are
40 * File Header Record::
42 * Value Label Record::
43 * Value Label Variable Record::
45 * Machine int32 Info Record::
46 * Machine flt64 Info Record::
47 * Auxiliary Variable Parameter Record::
48 * Long Variable Names Record::
49 * Very Long String Length Record::
50 * Miscellaneous Informational Records::
51 * Dictionary Termination Record::
55 @node File Header Record, Variable Record, Data File Format, Data File Format
56 @section File Header Record
58 The file header is always the first record in the file.
66 int32 nominal_case_size;
71 char creation_date[9];
72 char creation_time[8];
79 @item char rec_type[4];
80 Record type code. Always set to @samp{$FL2}. This is the only record
81 for which the record type is not of type @code{int32}.
83 @item char prod_name[60];
84 Product identification string. This always begins with the characters
85 @samp{@@(#) SPSS DATA FILE}. PSPP uses the remaining characters to
86 give its version and the operating system name; for example, @samp{GNU
87 pspp 0.1.4 - sparc-sun-solaris2.5.2}. The string is truncated if it
88 would be longer than 60 characters; otherwise it is padded on the right
91 @item int32 layout_code;
92 Always set to 2. PSPP reads this value to determine the
95 @item int32 nominal_case_size;
96 Number of data elements per case. This is the number of variables,
97 except that long string variables add extra data elements (one for every
98 8 characters after the first 8). However, string variables do not
99 contribute to this value beyond the first 255 bytes. Further, system
100 files written by some systems set this value to -1. In general, it is
101 unsafe for systems reading system files to rely upon this value.
103 @item int32 compressed;
104 Set to 1 if the data in the file is compressed, 0 otherwise.
106 @item int32 weight_index;
107 If one of the variables in the data set is used as a weighting variable,
108 set to the index of that variable. Otherwise, set to 0.
111 Set to the number of cases in the file if it is known, or -1 otherwise.
113 In the general case it is not possible to determine the number of cases
114 that will be output to a system file at the time that the header is
115 written. The way that this is dealt with is by writing the entire
116 system file, including the header, then seeking back to the beginning of
117 the file and writing just the @code{ncases} field. For `files' in which
118 this is not valid, the seek operation fails. In this case,
119 @code{ncases} remains -1.
122 Compression bias. Always set to 100. The significance of this value is
123 that only numbers between @code{(1 - bias)} and @code{(251 - bias)} can
126 @item char creation_date[9];
127 Set to the date of creation of the system file, in @samp{dd mmm yy}
128 format, with the month as standard English abbreviations, using an
129 initial capital letter and following with lowercase. If the date is not
130 available then this field is arbitrarily set to @samp{01 Jan 70}.
132 @item char creation_time[8];
133 Set to the time of creation of the system file, in @samp{hh:mm:ss}
134 format and using 24-hour time. If the time is not available then this
135 field is arbitrarily set to @samp{00:00:00}.
137 @item char file_label[64];
138 Set the file label declared by the user, if any (@pxref{FILE LABEL}).
139 Padded on the right with spaces.
141 @item char padding[3];
142 Ignored padding bytes to make the structure a multiple of 32 bits in
143 length. Set to zeros.
146 @node Variable Record, Value Label Record, File Header Record, Data File Format
147 @section Variable Record
149 Immediately following the header must come the variable records. There
150 must be one variable record for every variable and every 8 characters in
151 a long string beyond the first 8.
154 struct sysfile_variable
159 int32 n_missing_values;
164 /* The following two fields are present
165 only if has_var_label is 1. */
167 char label[/* variable length */];
169 /* The following field is present only
170 if n_missing_values is not 0. */
171 flt64 missing_values[/* variable length */];
176 @item int32 rec_type;
177 Record type code. Always set to 2.
180 Variable type code. Set to 0 for a numeric variable. For a short
181 string variable or the first part of a long string variable, this is set
182 to the width of the string. For the second and subsequent parts of a
183 long string variable, set to -1, and the remaining fields in the
184 structure are ignored.
186 @item int32 has_var_label;
187 If this variable has a variable label, set to 1; otherwise, set to 0.
189 @item int32 n_missing_values;
190 If the variable has no missing values, set to 0. If the variable has
191 one, two, or three discrete missing values, set to 1, 2, or 3,
192 respectively. If the variable has a range for missing variables, set to
193 -2; if the variable has a range for missing variables plus a single
194 discrete value, set to -3.
197 Print format for this variable. See below.
200 Write format for this variable. See below.
203 Variable name. The variable name must begin with a capital letter or
204 the at-sign (@samp{@@}). Subsequent characters may also be digits, octothorpes
205 (@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full
206 stops (@samp{.}). The variable name is padded on the right with spaces.
208 @item int32 label_len;
209 This field is present only if @code{has_var_label} is set to 1. It is
210 set to the length, in characters, of the variable label, which must be a
211 number between 0 and 120.
213 @item char label[/* variable length */];
214 This field is present only if @code{has_var_label} is set to 1. It has
215 length @code{label_len}, rounded up to the nearest multiple of 32 bits.
216 The first @code{label_len} characters are the variable's variable label.
218 @item flt64 missing_values[/* variable length */];
219 This field is present only if @code{n_missing_values} is not 0. It has
220 the same number of elements as the absolute value of
221 @code{n_missing_values}. For discrete missing values, each element
222 represents one missing value. When a range is present, the first
223 element denotes the minimum value in the range, and the second element
224 denotes the maximum value in the range. When a range plus a value are
225 present, the third element denotes the additional discrete missing
226 value. HIGHEST and LOWEST are indicated as described in the chapter
230 The @code{print} and @code{write} members of sysfile_variable are output
231 formats coded into @code{int32} types. The LSB (least-significant byte)
232 of the @code{int32} represents the number of decimal places, and the
233 next two bytes in order of increasing significance represent field width
234 and format type, respectively. The MSB (most-significant byte) is not
235 used and should be set to zero.
237 Format types are defined as follows:
321 @node Value Label Record, Value Label Variable Record, Variable Record, Data File Format
322 @section Value Label Record
324 Value label records must follow the variable records and must precede
325 the header termination record. Other than this, they may appear
326 anywhere in the system file. Every value label record must be
327 immediately followed by a label variable record, described below.
329 Value label records begin with @code{rec_type}, an @code{int32} value
330 set to the record type of 3. This is followed by @code{count}, an
331 @code{int32} value set to the number of value labels present in this
334 These two fields are followed by a series of @code{count} tuples. Each
335 tuple is divided into two fields, the value and the label. The first of
336 these, the value, is composed of a 64-bit value, which is either a
337 @code{flt64} value or up to 8 characters (padded on the right to 8
338 bytes) denoting a short string value. Whether the value is a
339 @code{flt64} or a character string is not defined inside the value label
342 The second field in the tuple, the label, has variable length. The
343 first @code{char} is a count of the number of characters in the value
344 label. The remainder of the field is the label itself. The field is
345 padded on the right to a multiple of 64 bits in length.
347 @node Value Label Variable Record, Document Record, Value Label Record, Data File Format
348 @section Value Label Variable Record
350 Every value label variable record must be immediately preceded by a
351 value label record, described above.
354 struct sysfile_value_label_variable
358 int32 vars[/* variable length */];
363 @item int32 rec_type;
364 Record type. Always set to 4.
367 Number of variables that the associated value labels from the value
368 label record are to be applied.
370 @item int32 vars[/* variable length */];
371 A list of variables to which to apply the value labels. There are
372 @code{count} elements. Each element identifies a variable record, where
373 the first element is numbered 1 and long string variables are considered
374 to occupy multiple indexes.
377 @node Document Record, Machine int32 Info Record, Value Label Variable Record, Data File Format
378 @section Document Record
380 There must be no more than one document record per system file.
381 Document records must follow the variable records and precede the
382 dictionary termination record.
385 struct sysfile_document
389 char lines[/* variable length */][80];
394 @item int32 rec_type;
395 Record type. Always set to 6.
398 Number of lines of documents present.
400 @item char lines[/* variable length */][80];
401 Document lines. The number of elements is defined by @code{n_lines}.
402 Lines shorter than 80 characters are padded on the right with spaces.
405 @node Machine int32 Info Record, Machine flt64 Info Record, Document Record, Data File Format
406 @section Machine @code{int32} Info Record
408 There must be no more than one machine @code{int32} info record per
409 system file. Machine @code{int32} info records must follow the variable
410 records and precede the dictionary termination record.
413 struct sysfile_machine_int32_info
424 int32 version_revision;
426 int32 floating_point_rep;
427 int32 compression_code;
429 int32 character_code;
434 @item int32 rec_type;
435 Record type. Always set to 7.
438 Record subtype. Always set to 3.
441 Size of each piece of data in the data part, in bytes. Always set to 4.
444 Number of pieces of data in the data part. Always set to 8.
446 @item int32 version_major;
447 PSPP major version number. In version @var{x}.@var{y}.@var{z}, this
450 @item int32 version_minor;
451 PSPP minor version number. In version @var{x}.@var{y}.@var{z}, this
454 @item int32 version_revision;
455 PSPP version revision number. In version @var{x}.@var{y}.@var{z},
458 @item int32 machine_code;
459 Machine code. PSPP always set this field to value to -1, but other
462 @item int32 floating_point_rep;
463 Floating point representation code. For IEEE 754 systems this is 1.
464 IBM 370 sets this to 2, and DEC VAX E to 3.
466 @item int32 compression_code;
467 Compression code. Always set to 1.
469 @item int32 endianness;
470 Machine endianness. 1 indicates big-endian, 2 indicates little-endian.
472 @item int32 character_code;
473 Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
474 indicates 8-bit ASCII, 4 indicates DEC Kanji.
475 Windows code page numbers are also valid.
478 @node Machine flt64 Info Record, Auxiliary Variable Parameter Record, Machine int32 Info Record, Data File Format
479 @section Machine @code{flt64} Info Record
481 There must be no more than one machine @code{flt64} info record per
482 system file. Machine @code{flt64} info records must follow the variable
483 records and precede the dictionary termination record.
486 struct sysfile_machine_flt64_info
502 @item int32 rec_type;
503 Record type. Always set to 7.
506 Record subtype. Always set to 4.
509 Size of each piece of data in the data part, in bytes. Always set to 8.
512 Number of pieces of data in the data part. Always set to 3.
515 The system missing value.
518 The value used for HIGHEST in missing values.
521 The value used for LOWEST in missing values.
524 @node Auxiliary Variable Parameter Record, Long Variable Names Record, Machine flt64 Info Record, Data File Format
525 @section Auxiliary Variable Parameter Record
527 There must be no more than one auxiliary variable parameter record per
528 system file. This record must follow the variable
529 records and precede the dictionary termination record.
532 struct sysfile_aux_var_parameter
541 struct aux_params aux_params[/* variable length */];
546 @item int32 rec_type;
547 Record type. Always set to 7.
550 Record subtype. Always set to 11.
553 The size @code{int32}. Always set to 4.
556 The total number of records in @code{aux_params}, multiplied by 3.
558 @item struct aux_params aux_params[];
559 An array of @code{struct aux_params}. The order of the elements corresponds
560 to the order of the variables in the Variable Records. No element
561 corresponds to variable records that continue long string variables.
562 The @code{struct aux_params} type is defined as follows:
575 The measurement type of the variable:
585 Occasionally a value of 0 is seen here. PSPP interprets this to mean
589 The width of the display column for the variable in characters.
591 @item int32 alignment
592 The alignment of the variable for display purposes:
611 @node Long Variable Names Record, Very Long String Length Record, Auxiliary Variable Parameter Record, Data File Format
612 @section Long Variable Names Record
614 There must be no more than one long variable names record per
615 system file. This record must follow the variable
616 records and precede the dictionary termination record.
619 struct sysfile_long_variable_names
628 char var_name_pairs[/* variable length */];
633 @item int32 rec_type;
634 Record type. Always set to 7.
637 Record subtype. Always set to 13.
640 The size of each element in the @code{var_name_pairs} member. Always set to 1.
643 The total number of bytes in @code{var_name_pairs}.
645 @item char var_name_pairs[/* variable length */];
646 A list of @var{key}--@var{value} tuples, where @var{key} is the name
647 of a variable, and @var{value} is its long variable name.
648 The @var{key} field is at most 8 bytes long and must match the
649 name of a variable which appears in the variable record (@pxref{Variable
651 The @var{value} field is at most 64 bytes long.
652 The @var{key} and @var{value} fields are separated by a @samp{=} byte.
653 Each tuple is separated by a byte whose value is 09. There is no
654 trailing separator following the last tuple.
655 The total length is @code{count} bytes.
658 @node Very Long String Length Record, Miscellaneous Informational Records, Long Variable Names Record, Data File Format
659 @comment node-name, next, previous, up
660 @section Very Long String Length Record
663 There must be no more than one very long string length record per
664 system file. This record must follow the variable records and precede the
665 dictionary termination record.
668 struct sysfile_very_long_string_lengths
677 char string_lengths[/* variable length */];
682 @item int32 rec_type;
683 Record type. Always set to 7.
686 Record subtype. Always set to 14.
689 The size of each element in the @code{string_lengths} member. Always set to 1.
692 The total number of bytes in @code{string_lengths}.
694 @item char string_lengths[/* variable length */];
695 A list of @var{key}--@var{value} tuples, where @var{key} is the name
696 of a variable, and @var{value} is its length.
697 The @var{key} field is at most 8 bytes long and must match the
698 name of a variable which appears in the variable record (@pxref{Variable
700 The @var{value} field is exactly 5 bytes long. It is a zero-padded,
701 ASCII-encoded string that is the length of the variable.
702 The @var{key} and @var{value} fields are separated by a @samp{=} byte.
703 Tuples are delimited by a two-byte sequence @{00, 09@}.
704 After the last tuple, there may be a single byte 00, or @{00, 09@}.
705 The total length is @code{count} bytes.
710 @node Miscellaneous Informational Records, Dictionary Termination Record, Very Long String Length Record, Data File Format
711 @section Miscellaneous Informational Records
713 Miscellaneous informational records must follow the variable records and
714 precede the dictionary termination record.
716 Some specific types of miscellaneous informational records are
717 documented here, but others are known to exist. PSPP ignores unknown
718 miscellaneous informational records when reading system files.
721 struct sysfile_misc_info
730 char data[/* variable length */];
735 @item int32 rec_type;
736 Record type. Always set to 7.
739 Record subtype. May take any value. According to Aapi
740 H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6
741 indicates date info (probably related to USE).
744 Size of each piece of data in the data part. Should have the value 4 or
745 8, for @code{int32} and @code{flt64}, respectively.
748 Number of pieces of data in the data part.
750 @item char data[/* variable length */];
751 Arbitrary data. There must be @code{size} times @code{count} bytes of
755 @node Dictionary Termination Record, Data Record, Miscellaneous Informational Records, Data File Format
756 @section Dictionary Termination Record
758 The dictionary termination record must follow all other records, except
759 for the actual cases, which it must precede. There must be exactly one
760 dictionary termination record in every system file.
763 struct sysfile_dict_term
771 @item int32 rec_type;
772 Record type. Always set to 999.
775 Ignored padding. Should be set to 0.
778 @node Data Record, , Dictionary Termination Record, Data File Format
781 Data records must follow all other records in the data file. There must
782 be at least one data record in every system file.
784 The format of data records varies depending on whether the data is
785 compressed. Regardless, the data is arranged in a series of 8-byte
788 When data is not compressed,
789 each element corresponds to
790 the variable declared in the respective variable record (@pxref{Variable
791 Record}). Numeric values are given in @code{flt64} format; string
792 values are literal characters string, padded on the right when
795 Compressed data is arranged in the following manner: the first 8-byte
796 element in the data section is divided into a series of 1-byte command
797 codes. These codes have meanings as described below:
801 Ignored. If the program writing the system file accumulates compressed
802 data in blocks of fixed length, 0 bytes can be used to pad out extra
803 bytes remaining at the end of a fixed-size block.
806 These values indicate that the corresponding numeric variable has the
807 value @code{(@var{code} - @var{bias})} for the case being read, where
808 @var{code} is the value of the compression code and @var{bias} is the
809 variable @code{compression_bias} from the file header. For example,
810 code 105 with bias 100.0 (the normal value) indicates a numeric variable
814 End of file. This code may or may not appear at the end of the data
815 stream. PSPP always outputs this code but its use is not required.
818 This value indicates that the numeric or string value is not
819 compressible. The value is stored in the 8-byte element following the
820 current block of command bytes. If this value appears twice in a block
821 of command bytes, then it indicates the second element following the
822 command bytes, and so on.
825 Used to indicate a string value that is all spaces.
828 Used to indicate the system-missing value.
831 When the end of the first 8-byte element of command bytes is reached,
832 any blocks of non-compressible values are skipped, and the next element
833 of command bytes is read and interpreted, until the end of the file is