X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fsystem-file-format.texi;h=89c35aab3f16dbe064f14dc94c3ced83ce285f60;hb=refs%2Fbuilds%2F20131212033030%2Fpspp;hp=7b0eff9f79823f0bb623e10d2f1da8c22922eff7;hpb=fe8dc2171009e90d2335f159d05f7e6660e24780;p=pspp diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index 7b0eff9f79..89c35aab3f 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -33,12 +33,40 @@ floating-point numbers, and translates as needed. However, only IEEE has actually been observed in system files, and it is likely that other formats are obsolete or were never used. -The PSPP system-missing value is represented by the largest possible -negative number in the floating point format (@code{-DBL_MAX}). Two -other values are important for use as missing values: @code{HIGHEST}, -represented by the largest possible positive number (@code{DBL_MAX}), -and @code{LOWEST}, represented by the second-largest negative number -(in IEEE 754 format, @code{0xffeffffffffffffe}). +System files use a few floating point values for special purposes: + +@table @asis +@item SYSMIS +The system-missing value is represented by the largest possible +negative number in the floating point format (@code{-DBL_MAX}). + +@item HIGHEST +HIGHEST is used as the high end of a missing value range with an +unbounded maximum. It is represented by the largest possible positive +number (@code{DBL_MAX}). + +@item LOWEST +LOWEST is used as the low end of a missing value range with an +unbounded minimum. It was originally represented by the +second-largest negative number (in IEEE 754 format, +@code{0xffeffffffffffffe}). System files written by SPSS 21 and later +instead use the largest negative number (@code{-DBL_MAX}), the same +value as SYSMIS. This does not lead to ambiguity because LOWEST +appears in system files only in missing value ranges, which never +contain SYSMIS. +@end table + +System files may use most character encodings based on an 8-bit unit. +UTF-16 and UTF-32, based on wider units, appear to be unacceptable. +@code{rec_type} in the file header record is sufficient to distinguish +between ASCII and EBCDIC based encodings. The best way to determine +the specific encoding in use is to consult the character encoding +record (@pxref{Character Encoding Record}), if present, and failing +that the @code{character_code} in the machine integer info record +(@pxref{Machine Integer Info Record}). The same encoding should be +used for the dictionary and the data in the file, although it is +possible to artificially synthesize files that use different encodings +(@pxref{Character Encoding Record}). System files are divided into records, each of which begins with a 4-byte record type, usually regarded as an @code{int32}. @@ -80,11 +108,13 @@ Each type of record is described separately below. * Machine Integer Info Record:: * Machine Floating-Point Info Record:: * Multiple Response Sets Records:: +* Extra Product Info Record:: * Variable Display Parameter Record:: * Long Variable Names Record:: * Very Long String Record:: * Character Encoding Record:: * Long String Value Labels Record:: +* Long String Missing Values Record:: * Data File and Variable Attributes Records:: * Extended Number of Cases Record:: * Miscellaneous Informational Records:: @@ -103,7 +133,7 @@ char rec_type[4]; char prod_name[60]; int32 layout_code; int32 nominal_case_size; -int32 compressed; +int32 compression; int32 weight_index; int32 ncases; flt64 bias; @@ -115,7 +145,15 @@ char padding[3]; @table @code @item char rec_type[4]; -Record type code, set to @samp{$FL2}. +Record type code, either @samp{$FL2} for system files with +uncompressed data or data compressed with simple bytecode compression, +or @samp{$FL3} for system files with ZLIB compressed data. + +This is truly a character field that uses the character encoding as +other strings. Thus, in a file with an ASCII-based character encoding +this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a +file with an EBCDIC-based encoding this field contains @code{5b c6 d3 +f2}. (No EBCDIC-based ZLIB-compressed files have been observed.) @item char prod_name[60]; Product identification string. This always begins with the characters @@ -140,7 +178,10 @@ files written by some systems set this value to -1. In general, it is unsafe for systems reading system files to rely upon this value. @item int32 compressed; -Set to 1 if the data in the file is compressed, 0 otherwise. +Set to 0 if the data in the file is not compressed, 1 if the data is +compressed with simple bytecode compression, 2 if the data is ZLIB +compressed. This field has value 2 if and only if @code{rec_type} is +@samp{$FL3}. @item int32 weight_index; If one of the variables in the data set is used as a weighting @@ -185,6 +226,10 @@ field is arbitrarily set to @samp{00:00:00}. File label declared by the user, if any (@pxref{FILE LABEL,,,pspp, PSPP Users Guide}). Padded on the right with spaces. +A product that identifies itself as @code{VOXCO INTERVIEWER 4.3} uses +CR-only line ends in this field, rather than the more usual LF-only or +CR LF line ends. + @item char padding[3]; Ignored padding bytes to make the structure a multiple of 32 bits in length. Set to zeros. @@ -254,6 +299,10 @@ respectively. If the variable has a range for missing variables, set to -2; if the variable has a range for missing variables plus a single discrete value, set to -3. +A long string variable always has the value 0 here. A separate record +indicates missing values for long string variables (@pxref{Long String +Missing Values Record}). + @item int32 print; Print format for this variable. See below. @@ -549,7 +598,8 @@ Floating point representation code. For IEEE 754 systems this is 1. IBM 370 sets this to 2, and DEC VAX E to 3. @item int32 compression_code; -Compression code. Always set to 1. +Compression code. Always set to 1, regardless of whether or how the +file is compressed. @item int32 endianness; Machine endianness. 1 indicates big-endian, 2 indicates little-endian. @@ -559,6 +609,9 @@ Machine endianness. 1 indicates big-endian, 2 indicates little-endian. been actually observed in system files: @table @asis +@item 1 +EBCDIC. + @item 2 7-bit ASCII. @@ -579,9 +632,6 @@ UTF-8. The following additional values are known to be defined: @table @asis -@item 1 -EBCDIC. - @item 3 8-bit ``ASCII''. @@ -591,9 +641,10 @@ DEC Kanji. Other Windows code page numbers are known to be generally valid. -Old versions of SPSS always wrote value 2 in this field, regardless of -the encoding in use. Newer versions also write the character encoding -as a string (see @ref{Character Encoding Record}). +Old versions of SPSS for Unix and Windows always wrote value 2 in this +field, regardless of the encoding in use. Newer versions also write +the character encoding as a string (see @ref{Character Encoding +Record}). @end table @node Machine Floating-Point Info Record @@ -761,6 +812,44 @@ $d=E 1 2 34 13 third mdgroup k l m $e=E 11 6 choice 0 n o p @end example +@node Extra Product Info Record +@section Extra Product Info Record + +This optional record appears to contain a text string that describes +the program that wrote the file and the source of the data. (This is +redundant with the file label and product info found in the file +header record.) + +@example +/* @r{Header.} */ +int32 rec_type; +int32 subtype; +int32 size; +int32 count; + +/* @r{Exactly @code{count} bytes of data.} */ +char info[]; +@end example + +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. Always set to 10. + +@item int32 size; +The size of each element in the @code{info} member. Always set to 1. + +@item int32 count; +The total number of bytes in @code{info}. + +@item char info[]; +A text string. A product that identifies itself as @code{VOXCO +INTERVIEWER 4.3} uses CR-only line ends in this field, rather than the +more usual LF-only or CR LF line ends. +@end table + @node Variable Display Parameter Record @section Variable Display Parameter Record @@ -812,8 +901,8 @@ Ordinal Scale Continuous Scale @end table -SPSS 14 sometimes writes a @code{measure} of 0 for string variables. -PSPP interprets this as nominal scale. +SPSS sometimes writes a @code{measure} of 0. PSPP interprets this as +nominal scale. @item int32 width; The width of the display column for the variable in characters. @@ -1090,6 +1179,74 @@ between 0 and 120, is the number of bytes in @code{label}. The @end table @end table +@node Long String Missing Values Record +@section Long String Missing Values Record + +This record, if present, specifies missing values for long string +variables. + +@example +/* @r{Header.} */ +int32 rec_type; +int32 subtype; +int32 size; +int32 count; + +/* @r{Repeated up to exactly @code{count} bytes.} */ +int32 var_name_len; +char var_name[]; +char n_missing_values; +long_string_missing_value values[]; +@end example + +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. Always set to 22. + +@item int32 size; +Always set to 1. + +@item int32 count; +The number of bytes following the header until the next header. + +@item int32 var_name_len; +@itemx char var_name[]; +The number of bytes in the name of the long string variable that has +missing values, plus the variable name itself, which consists of +exactly @code{var_name_len} bytes. The variable name is not padded to +any particular boundary, nor is it null-terminated. + +@item char n_missing_values; +The number of missing values, either 1, 2, or 3. (This is, unusually, +a single byte instead of a 32-bit number.) + +@item long_string_missing_value values[]; +The missing values themselves. This array contains exactly +@code{n_missing_values} elements, each of which has the following +substructure: + +@example +int32 value_len; +char value[]; +@end example + +@table @code +@item int32 value_len; +The length of the missing value string, in bytes. This value should +be 8, because long string variables are at least 8 bytes wide (by +definition), only the first 8 bytes of a long string variable's +missing values are allowed to be non-spaces, and any spaces within the +first 8 bytes are included in the missing value here. + +@item char value[]; +The missing value string, exactly @code{value_len} bytes, without +any padding or null terminator. +@end table +@end table + @node Data File and Variable Attributes Records @section Data File and Variable Attributes Records @@ -1170,6 +1327,32 @@ will contain a variable attribute record with the following contents: 00000030 0a 29 |.) | @end example +@menu +* Variable Roles:: +@end menu + +@node Variable Roles +@subsection Variable Roles + +A variable's role is represented as an attribute named @code{$@@Role}. +This attribute has a single element whose values and their meanings +are: + +@table @code +@item 0 +Input. This, the default, is the most common role. +@item 1 +Output. +@item 2 +Both. +@item 3 +None. +@item 4 +Partition. +@item 5 +Split. +@end table + @node Extended Number of Cases Record @section Extended Number of Cases Record @@ -1234,7 +1417,9 @@ Record type. Always set to 7. @item int32 subtype; Record subtype. May take any value. According to Aapi H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6 -indicates date info (probably related to USE). +indicates date info (probably related to USE). Subtype 24 appears to +contain XML that describes how data in the file should be displayed +on-screen. @item int32 size; Size of each piece of data in the data part. Should have the value 1, @@ -1271,22 +1456,23 @@ Ignored padding. Should be set to 0. @node Data Record @section Data Record -Data records must follow all other records in the system file. There must -be at least one data record in every system file. - -The format of data records varies depending on whether the data is -compressed. Regardless, the data is arranged in a series of 8-byte -elements. +The data record must follow all other records in the system file. +Every system file must have a data record that specifies data for at +least one case. The format of the data record varies depending on the +value of @code{compression} in the file header record: -When data is not compressed, -each element corresponds to +@table @asis +@item 0: no compression +Data is arranged as a series of 8-byte elements. +Each element corresponds to the variable declared in the respective variable record (@pxref{Variable Record}). Numeric values are given in @code{flt64} format; string values are literal characters string, padded on the right when necessary to fill out 8-byte units. -Compressed data is arranged in the following manner: the first 8 bytes -in the data section is divided into a series of 1-byte command +@item 1: bytecode compression +The first 8 bytes +of the data record is divided into a series of 1-byte command codes. These codes have meanings as described below: @table @asis @@ -1324,8 +1510,125 @@ An 8-byte string value that is all spaces. The system-missing value. @end table -When the end of the an 8-byte group of command bytes is reached, any -blocks of non-compressible values indicated by code 253 are skipped, -and the next element of command bytes is read and interpreted, until -the end of the file or a code with value 252 is reached. +The end of the 8-byte group of bytecodes is followed by any 8-byte +blocks of non-compressible values indicated by code 253. After that +follows another 8-byte group of bytecodes, then those bytecodes' +non-compressible values. The pattern repeats to the end of the file +or a code with value 252. + +@item 2: ZLIB compression +The data record consists of the following, in order: + +@itemize @bullet +@item +ZLIB data header, 24 bytes long. + +@item +One or more variable-length blocks of ZLIB compressed data. + +@item +ZLIB data trailer, with a 24-byte fixed header plus an additional 24 +bytes for each preceding ZLIB compressed data block. +@end itemize + +The ZLIB data header has the following format: + +@example +int64 zheader_ofs; +int64 ztrailer_ofs; +int64 ztrailer_len; +@end example + +@table @code +@item int64 zheader_ofs; +The offset, in bytes, of the beginning of this structure within the +system file. + +@item int64 ztrailer_ofs; +The offset, in bytes, of the first byte of the ZLIB data trailer. + +@item int64 ztrailer_len; +The number of bytes in the ZLIB data trailer. This and the previous +field sum to the size of the system file in bytes. +@end table + +The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB +compressed data blocks. Each ZLIB compressed data block begins with a +ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78 +01} (the only header yet observed in practice). Each block +decompresses to a fixed number of bytes (in practice only +@code{0x3ff000}-byte blocks have been observed), except that the last +block of data may be shorter. The last ZLIB compressed data block +gends just before offset @code{ztrailer_ofs}. + +The result of ZLIB decompression is bytecode compressed data as +described above for compression format 1. + +The ZLIB data trailer begins with the following 24-byte fixed header: + +@example +int64 bias; +int64 zero; +int32 block_size; +int32 n_blocks; +@end example + +@table @code +@item int64 int_bias; +The compression bias as a negative integer, e.g.@: if @code{bias} in +the file header record is 100.0, then @code{int_bias} is @minus{}100 +(this is the only value yet observed in practice). + +@item int64 zero; +Always observed to be zero. + +@item int32 block_size; +The number of bytes in each ZLIB compressed data block, except +possibly the last, following decompression. Only @code{0x3ff000} has +been observed so far. + +@item int32 n_blocks; +The number of ZLIB compressed data blocks, always exactly +@code{(ztrailer_ofs - 24) / 24}. +@end table + +The fixed header is followed by @code{n_blocks} 24-byte ZLIB data +block descriptors, each of which describes the compressed data block +corresponding to its offset. Each block descriptor has the following +format: + +@example +int64 uncompressed_ofs; +int64 compressed_ofs; +int32 uncompressed_size; +int32 compressed_size; +@end example + +@table @code +@item int64 uncompressed_ofs; +The offset, in bytes, that this block of data would have in a similar +system file that uses compression format 1. This is +@code{zheader_ofs} in the first block descriptor, and in each +succeeding block descriptor it is the sum of the previous desciptor's +@code{uncompressed_ofs} and @code{uncompressed_size}. + +@item int64 compressed_ofs; +The offset, in bytes, of the actual beginning of this compressed data +block. This is @code{zheader_ofs + 24} in the first block descriptor, +and in each succeeding block descriptor it is the sum of the previous +descriptor's @code{compressed_ofs} and @code{compressed_size}. The +final block descriptor's @code{compressed_ofs} and +@code{compressed_size} sum to @code{ztrailer_ofs}. + +@item int32 uncompressed_size; +The number of bytes in this data block, after decompression. This is +@code{block_size} in every data block except the last, which may be +smaller. + +@item int32 compressed_size; +The number of bytes in this data block, as stored compressed in this +system file. +@end table +@end table + @setfilename ignored