X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fsystem-file-format.texi;h=a480195857f8d027cc910d9c9ac76b6d6ff26fb7;hb=008fe5fdec4f94535df888ff8cdd94f802a3660d;hp=af74fd2f600c6e95f0380afa23465605365263b8;hpb=f1e84802839683eba11166a85e2f9fb5d49f3490;p=pspp diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index af74fd2f60..a480195857 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -56,6 +56,18 @@ appears in system files only in missing value ranges, which never contain SYSMIS. @end table +System files may use most character encodings based on an 8-bit unit. +UTF-16 and UTF-32, based on wider units, appear to be unacceptable. +@code{rec_type} in the file header record is sufficient to distinguish +between ASCII and EBCDIC based encodings. The best way to determine +the specific encoding in use is to consult the character encoding +record (@pxref{Character Encoding Record}), if present, and failing +that the @code{character_code} in the machine integer info record +(@pxref{Machine Integer Info Record}). The same encoding should be +used for the dictionary and the data in the file, although it is +possible to artificially synthesize files that use different encodings +(@pxref{Character Encoding Record}). + System files are divided into records, each of which begins with a 4-byte record type, usually regarded as an @code{int32}. @@ -96,16 +108,19 @@ Each type of record is described separately below. * Machine Integer Info Record:: * Machine Floating-Point Info Record:: * Multiple Response Sets Records:: +* Extra Product Info Record:: * Variable Display Parameter Record:: * Long Variable Names Record:: * Very Long String Record:: * Character Encoding Record:: * Long String Value Labels Record:: +* Long String Missing Values Record:: * Data File and Variable Attributes Records:: * Extended Number of Cases Record:: * Miscellaneous Informational Records:: * Dictionary Termination Record:: * Data Record:: +* Encrypted System Files:: @end menu @node File Header Record @@ -119,7 +134,7 @@ char rec_type[4]; char prod_name[60]; int32 layout_code; int32 nominal_case_size; -int32 compressed; +int32 compression; int32 weight_index; int32 ncases; flt64 bias; @@ -131,9 +146,15 @@ char padding[3]; @table @code @item char rec_type[4]; -Record type code, set to @samp{$FL2}, that is, either @code{24 46 4c -32} if the file uses an ASCII-based character encoding, or @code{5b c6 -d3 f2} if the file uses an EBCDIC-based character encoding. +Record type code, either @samp{$FL2} for system files with +uncompressed data or data compressed with simple bytecode compression, +or @samp{$FL3} for system files with ZLIB compressed data. + +This is truly a character field that uses the character encoding as +other strings. Thus, in a file with an ASCII-based character encoding +this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a +file with an EBCDIC-based encoding this field contains @code{5b c6 d3 +f2}. (No EBCDIC-based ZLIB-compressed files have been observed.) @item char prod_name[60]; Product identification string. This always begins with the characters @@ -158,7 +179,10 @@ files written by some systems set this value to -1. In general, it is unsafe for systems reading system files to rely upon this value. @item int32 compressed; -Set to 1 if the data in the file is compressed, 0 otherwise. +Set to 0 if the data in the file is not compressed, 1 if the data is +compressed with simple bytecode compression, 2 if the data is ZLIB +compressed. This field has value 2 if and only if @code{rec_type} is +@samp{$FL3}. @item int32 weight_index; If one of the variables in the data set is used as a weighting @@ -203,6 +227,10 @@ field is arbitrarily set to @samp{00:00:00}. File label declared by the user, if any (@pxref{FILE LABEL,,,pspp, PSPP Users Guide}). Padded on the right with spaces. +A product that identifies itself as @code{VOXCO INTERVIEWER 4.3} uses +CR-only line ends in this field, rather than the more usual LF-only or +CR LF line ends. + @item char padding[3]; Ignored padding bytes to make the structure a multiple of 32 bits in length. Set to zeros. @@ -272,6 +300,10 @@ respectively. If the variable has a range for missing variables, set to -2; if the variable has a range for missing variables plus a single discrete value, set to -3. +A long string variable always has the value 0 here. A separate record +indicates missing values for long string variables (@pxref{Long String +Missing Values Record}). + @item int32 print; Print format for this variable. See below. @@ -284,6 +316,13 @@ the at-sign (@samp{@@}). Subsequent characters may also be digits, octothorpes (@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full stops (@samp{.}). The variable name is padded on the right with spaces. +The @samp{name} fields should be unique within a system file. System +files written by SPSS that contain very long string variables with +similar names sometimes contain duplicate names that are later +eliminated by resolving the very long string names (@pxref{Very Long +String Record}). PSPP handles duplicates by assigning them new, +unique names. + @item int32 label_len; This field is present only if @code{has_var_label} is set to 1. It is set to the length, in characters, of the variable label. The @@ -567,7 +606,8 @@ Floating point representation code. For IEEE 754 systems this is 1. IBM 370 sets this to 2, and DEC VAX E to 3. @item int32 compression_code; -Compression code. Always set to 1. +Compression code. Always set to 1, regardless of whether or how the +file is compressed. @item int32 endianness; Machine endianness. 1 indicates big-endian, 2 indicates little-endian. @@ -780,6 +820,44 @@ $d=E 1 2 34 13 third mdgroup k l m $e=E 11 6 choice 0 n o p @end example +@node Extra Product Info Record +@section Extra Product Info Record + +This optional record appears to contain a text string that describes +the program that wrote the file and the source of the data. (This is +redundant with the file label and product info found in the file +header record.) + +@example +/* @r{Header.} */ +int32 rec_type; +int32 subtype; +int32 size; +int32 count; + +/* @r{Exactly @code{count} bytes of data.} */ +char info[]; +@end example + +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. Always set to 10. + +@item int32 size; +The size of each element in the @code{info} member. Always set to 1. + +@item int32 count; +The total number of bytes in @code{info}. + +@item char info[]; +A text string. A product that identifies itself as @code{VOXCO +INTERVIEWER 4.3} uses CR-only line ends in this field, rather than the +more usual LF-only or CR LF line ends. +@end table + @node Variable Display Parameter Record @section Variable Display Parameter Record @@ -1109,6 +1187,74 @@ between 0 and 120, is the number of bytes in @code{label}. The @end table @end table +@node Long String Missing Values Record +@section Long String Missing Values Record + +This record, if present, specifies missing values for long string +variables. + +@example +/* @r{Header.} */ +int32 rec_type; +int32 subtype; +int32 size; +int32 count; + +/* @r{Repeated up to exactly @code{count} bytes.} */ +int32 var_name_len; +char var_name[]; +char n_missing_values; +long_string_missing_value values[]; +@end example + +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. Always set to 22. + +@item int32 size; +Always set to 1. + +@item int32 count; +The number of bytes following the header until the next header. + +@item int32 var_name_len; +@itemx char var_name[]; +The number of bytes in the name of the long string variable that has +missing values, plus the variable name itself, which consists of +exactly @code{var_name_len} bytes. The variable name is not padded to +any particular boundary, nor is it null-terminated. + +@item char n_missing_values; +The number of missing values, either 1, 2, or 3. (This is, unusually, +a single byte instead of a 32-bit number.) + +@item long_string_missing_value values[]; +The missing values themselves. This array contains exactly +@code{n_missing_values} elements, each of which has the following +substructure: + +@example +int32 value_len; +char value[]; +@end example + +@table @code +@item int32 value_len; +The length of the missing value string, in bytes. This value should +be 8, because long string variables are at least 8 bytes wide (by +definition), only the first 8 bytes of a long string variable's +missing values are allowed to be non-spaces, and any spaces within the +first 8 bytes are included in the missing value here. + +@item char value[]; +The missing value string, exactly @code{value_len} bytes, without +any padding or null terminator. +@end table +@end table + @node Data File and Variable Attributes Records @section Data File and Variable Attributes Records @@ -1183,12 +1329,38 @@ VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123'). will contain a variable attribute record with the following contents: @example -00000000 07 00 00 00 12 00 00 00 01 00 00 00 22 00 00 00 |............"...| -00000010 64 75 6d 6d 79 3a 66 72 65 64 28 27 32 33 27 0a |dummy:fred('23'.| -00000020 27 33 34 27 0a 29 62 65 72 74 28 27 31 32 33 27 |'34'.)bert('123'| -00000030 0a 29 |.) | +0000 07 00 00 00 12 00 00 00 01 00 00 00 22 00 00 00 |............"...| +0010 64 75 6d 6d 79 3a 66 72 65 64 28 27 32 33 27 0a |dummy:fred('23'.| +0020 27 33 34 27 0a 29 62 65 72 74 28 27 31 32 33 27 |'34'.)bert('123'| +0030 0a 29 |.) | @end example +@menu +* Variable Roles:: +@end menu + +@node Variable Roles +@subsection Variable Roles + +A variable's role is represented as an attribute named @code{$@@Role}. +This attribute has a single element whose values and their meanings +are: + +@table @code +@item 0 +Input. This, the default, is the most common role. +@item 1 +Output. +@item 2 +Both. +@item 3 +None. +@item 4 +Partition. +@item 5 +Split. +@end table + @node Extended Number of Cases Record @section Extended Number of Cases Record @@ -1292,22 +1464,23 @@ Ignored padding. Should be set to 0. @node Data Record @section Data Record -Data records must follow all other records in the system file. There must -be at least one data record in every system file. +The data record must follow all other records in the system file. +Every system file must have a data record that specifies data for at +least one case. The format of the data record varies depending on the +value of @code{compression} in the file header record: -The format of data records varies depending on whether the data is -compressed. Regardless, the data is arranged in a series of 8-byte -elements. - -When data is not compressed, -each element corresponds to +@table @asis +@item 0: no compression +Data is arranged as a series of 8-byte elements. +Each element corresponds to the variable declared in the respective variable record (@pxref{Variable Record}). Numeric values are given in @code{flt64} format; string values are literal characters string, padded on the right when necessary to fill out 8-byte units. -Compressed data is arranged in the following manner: the first 8 bytes -in the data section is divided into a series of 1-byte command +@item 1: bytecode compression +The first 8 bytes +of the data record is divided into a series of 1-byte command codes. These codes have meanings as described below: @table @asis @@ -1345,8 +1518,287 @@ An 8-byte string value that is all spaces. The system-missing value. @end table -When the end of the an 8-byte group of command bytes is reached, any -blocks of non-compressible values indicated by code 253 are skipped, -and the next element of command bytes is read and interpreted, until -the end of the file or a code with value 252 is reached. +The end of the 8-byte group of bytecodes is followed by any 8-byte +blocks of non-compressible values indicated by code 253. After that +follows another 8-byte group of bytecodes, then those bytecodes' +non-compressible values. The pattern repeats to the end of the file +or a code with value 252. + +@item 2: ZLIB compression +The data record consists of the following, in order: + +@itemize @bullet +@item +ZLIB data header, 24 bytes long. + +@item +One or more variable-length blocks of ZLIB compressed data. + +@item +ZLIB data trailer, with a 24-byte fixed header plus an additional 24 +bytes for each preceding ZLIB compressed data block. +@end itemize + +The ZLIB data header has the following format: + +@example +int64 zheader_ofs; +int64 ztrailer_ofs; +int64 ztrailer_len; +@end example + +@table @code +@item int64 zheader_ofs; +The offset, in bytes, of the beginning of this structure within the +system file. + +@item int64 ztrailer_ofs; +The offset, in bytes, of the first byte of the ZLIB data trailer. + +@item int64 ztrailer_len; +The number of bytes in the ZLIB data trailer. This and the previous +field sum to the size of the system file in bytes. +@end table + +The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB +compressed data blocks. Each ZLIB compressed data block begins with a +ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78 +01} (the only header yet observed in practice). Each block +decompresses to a fixed number of bytes (in practice only +@code{0x3ff000}-byte blocks have been observed), except that the last +block of data may be shorter. The last ZLIB compressed data block +gends just before offset @code{ztrailer_ofs}. + +The result of ZLIB decompression is bytecode compressed data as +described above for compression format 1. + +The ZLIB data trailer begins with the following 24-byte fixed header: + +@example +int64 bias; +int64 zero; +int32 block_size; +int32 n_blocks; +@end example + +@table @code +@item int64 int_bias; +The compression bias as a negative integer, e.g.@: if @code{bias} in +the file header record is 100.0, then @code{int_bias} is @minus{}100 +(this is the only value yet observed in practice). + +@item int64 zero; +Always observed to be zero. + +@item int32 block_size; +The number of bytes in each ZLIB compressed data block, except +possibly the last, following decompression. Only @code{0x3ff000} has +been observed so far. + +@item int32 n_blocks; +The number of ZLIB compressed data blocks, always exactly +@code{(ztrailer_ofs - 24) / 24}. +@end table + +The fixed header is followed by @code{n_blocks} 24-byte ZLIB data +block descriptors, each of which describes the compressed data block +corresponding to its offset. Each block descriptor has the following +format: + +@example +int64 uncompressed_ofs; +int64 compressed_ofs; +int32 uncompressed_size; +int32 compressed_size; +@end example + +@table @code +@item int64 uncompressed_ofs; +The offset, in bytes, that this block of data would have in a similar +system file that uses compression format 1. This is +@code{zheader_ofs} in the first block descriptor, and in each +succeeding block descriptor it is the sum of the previous desciptor's +@code{uncompressed_ofs} and @code{uncompressed_size}. + +@item int64 compressed_ofs; +The offset, in bytes, of the actual beginning of this compressed data +block. This is @code{zheader_ofs + 24} in the first block descriptor, +and in each succeeding block descriptor it is the sum of the previous +descriptor's @code{compressed_ofs} and @code{compressed_size}. The +final block descriptor's @code{compressed_ofs} and +@code{compressed_size} sum to @code{ztrailer_ofs}. + +@item int32 uncompressed_size; +The number of bytes in this data block, after decompression. This is +@code{block_size} in every data block except the last, which may be +smaller. + +@item int32 compressed_size; +The number of bytes in this data block, as stored compressed in this +system file. +@end table +@end table + @setfilename ignored + +@node Encrypted System Files +@section Encrypted System Files + +SPSS 21 and later support an encrypted system file format. + +@quotation Warning +The SPSS encrypted file format is poorly designed. It is much cheaper +and faster to decrypt a file encrypted this way than if a well +designed alternative were used. If you must use this format, use a +10-byte randomly generated password. +@end quotation + +@subheading Encrypted File Format + +Encrypted system files begin with the following 36-byte fixed header: + +@example +0000 1c 00 00 00 00 00 00 00 45 4e 43 52 59 50 54 45 |........ENCRYPTE| +0010 44 53 41 56 15 00 00 00 00 00 00 00 00 00 00 00 |DSAV............| +0020 00 00 00 00 |....| +@end example + +Following the fixed header is a complete system file in the usual +format, except that each 16-byte block is encrypted with AES-256 in +ECB mode. The AES-256 key is derived from a password in the following +way: + +@enumerate +@item +Start from the literal password typed by the user. Truncate it to at +most 10 bytes, then append (between 1 and 22) null bytes until there +are exactly 32 bytes. Call this @var{password}. + +@item +Let @var{constant} be the following 73-byte constant: + +@example +0000 00 00 00 01 35 27 13 cc 53 a7 78 89 87 53 22 11 +0010 d6 5b 31 58 dc fe 2e 7e 94 da 2f 00 cc 15 71 80 +0020 0a 6c 63 53 00 38 c3 38 ac 22 f3 63 62 0e ce 85 +0030 3f b8 07 4c 4e 2b 77 c7 21 f5 1a 80 1d 67 fb e1 +0040 e1 83 07 d8 0d 00 00 01 00 +@end example + +@item +Compute CMAC-AES-256(@var{password}, @var{constant}). Call the +16-byte result @var{cmac}. + +@item +The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is, +@var{cmac} repeated twice. +@end enumerate + +@subsubheading Example + +Consider the password @samp{pspp}. @var{password} is: + +@example +0000 70 73 70 70 00 00 00 00 00 00 00 00 00 00 00 00 |pspp............| +0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| +@end example + +@noindent +@var{cmac} is: + +@example +0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 +@end example + +@noindent +The AES-256 key is: + +@example +0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 +0010 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 +@end example + +@subheading Password Encoding + +SPSS also supports what it calls ``encrypted passwords.'' These are +not encrypted. They are encoded with a simple, fixed scheme. An +encoded password is always a multiple of 2 characters long, and never +longer than 20 characters. The characters in an encoded password are +always in the graphic ASCII range 33 through 126. Each successive +pair of characters in the password encodes a single byte in the +plaintext password. + +Use the following algorithm to decode a pair of characters: + +@enumerate +@item +Let @var{a} be the ASCII code of the first character, and @var{b} be +the ASCII code of the second character. + +@item +Let @var{ah} be the most significant 4 bits of @var{a}. Find the line +in the table below that has @var{ah} on the left side. The right side +of the line is a set of possible values for the most significant 4 +bits of the decoded byte. + +@display +@t{2 } @result{} @t{2367} +@t{3 } @result{} @t{0145} +@t{47} @result{} @t{89cd} +@t{56} @result{} @t{abef} +@end display + +@item +Let @var{bh} be the most significant 4 bits of @var{b}. Find the line +in the second table below that has @var{bh} on the left side. The +right side of the line is a set of possible values for the most +significant 4 bits of the decoded byte. Together with the results of +the previous step, only a single possibility is left. + +@display +@t{2 } @result{} @t{139b} +@t{3 } @result{} @t{028a} +@t{47} @result{} @t{46ce} +@t{56} @result{} @t{57df} +@end display + +@item +Let @var{al} be the least significant 4 bits of @var{a}. Find the +line in the table below that has @var{al} on the left side. The right +side of the line is a set of possible values for the least significant +4 bits of the decoded byte. + +@display +@t{03cf} @result{} @t{0145} +@t{12de} @result{} @t{2367} +@t{478b} @result{} @t{89cd} +@t{569a} @result{} @t{abef} +@end display + +@item +Let @var{bl} be the least significant 4 bits of @var{b}. Find the +line in the table below that has @var{bl} on the left side. The right +side of the line is a set of possible values for the least significant +4 bits of the decoded byte. Together with the results of the previous +step, only a single possibility is left. + +@display +@t{03cf} @result{} @t{028a} +@t{12de} @result{} @t{139b} +@t{478b} @result{} @t{46ce} +@t{569a} @result{} @t{57df} +@end display +@end enumerate + +@subsubheading Example + +Consider the encoded character pair @samp{-|}. @var{a} is +0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is +0xd, and @var{bl} is 0xc. @var{ah} means that the most significant +four bits of the decoded character is 2, 3, 6, or 7, and @var{bh} +means that they are 4, 6, 0xc, or 0xe. The single possibility in +common is 6, so the most significant four bits are 6. Similarly, +@var{al} means that the least significant four bits are 2, 3, 6, or 7, +and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant +four bits are 2. The decoded character is therefore 0x62, the letter +@samp{b}.