X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fsystem-file-format.texi;h=ca7c668287634dde3cc9fe5df2b909a593323860;hb=6417b81fb2d5665471ba9fadd180ed1ddeb29246;hp=19a6c057479d0294cfee1c79dc269373abd53fdf;hpb=274f8e25103e776ce990bb84be7f149efd6f384b;p=pspp diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index 19a6c05747..ca7c668287 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -1,3 +1,13 @@ +@c PSPP - a program for statistical analysis. +@c Copyright (C) 2019 Free Software Foundation, Inc. +@c Permission is granted to copy, distribute and/or modify this document +@c under the terms of the GNU Free Documentation License, Version 1.3 +@c or any later version published by the Free Software Foundation; +@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. +@c A copy of the license is included in the section entitled "GNU +@c Free Documentation License". +@c + @node System File Format @appendix System File Format @@ -6,7 +16,7 @@ that describes how they may be interpreted. This chapter describes the format of a system file. System files use four data types: 8-bit characters, 32-bit integers, -64-bit integers, +64-bit integers, and 64-bit floating points, called here @code{char}, @code{int32}, @code{int64}, and @code{flt64}, respectively. Data is not necessarily aligned on a word @@ -89,7 +99,6 @@ possible to artificially synthesize files that use different encodings * Other Informational Records:: * Dictionary Termination Record:: * Data Record:: -* Encrypted System Files:: @end menu @node System File Record Structure @@ -162,6 +171,11 @@ Document record, if present. Extension (type 7) records, in ascending numerical order of their subtypes. +System files written by SPSS include at most one of each kind of +extension record. This is generally true of system files written by +other software as well, with known exceptions noted below in the +individual sections about each type of record. + @item Dictionary termination record. @@ -218,6 +232,12 @@ pspp 0.1.4 - sparc-sun-solaris2.5.2}. The string is truncated if it would be longer than 60 characters; otherwise it is padded on the right with spaces. +The product name field allow readers to behave differently based on +quirks in the way that particular software writes system files. +@xref{Value Labels Records}, for the detail of the quirk that the PSPP +system file reader tolerates in files written by ReadStat, which has +@code{https://github.com/WizardMac/ReadStat} in @code{prod_name}. + @anchor{layout_code} @item int32 layout_code; Normally set to 2, although a few system files have been spotted in @@ -228,8 +248,8 @@ file's integer endianness (@pxref{System File Format}). Number of data elements per case. This is the number of variables, except that long string variables add extra data elements (one for every 8 characters after the first 8). However, string variables do not -contribute to this value beyond the first 255 bytes. Further, system -files written by some systems set this value to -1. In general, it is +contribute to this value beyond the first 255 bytes. Further, some +software always writes -1 or 0 in this field. In general, it is unsafe for systems reading system files to rely upon this value. @item int32 compression; @@ -306,10 +326,10 @@ so readers should take care to parse dummy variable records in the same way as other variable records. @anchor{Dictionary Index} -The @dfn{dictionary index} of a variable is its offset in the set of +The @dfn{dictionary index} of a variable is a 1-based offset in the set of variable records, including dummy variable records for long string -variables. The first variable record has a dictionary index of 0, the -second has a dictionary index of 1, and so on. +variables. The first variable record has a dictionary index of 1, the +second has a dictionary index of 2, and so on. The system file format does not directly support string variables wider than 255 bytes. Such very long string variables are represented @@ -504,6 +524,10 @@ Format types are defined as follows: @tab @code{EDATE} @item 39 @tab @code{SDATE} +@item 40 +@tab @code{MTIME} +@item 41 +@tab @code{YMDHMS} @end multitable @end quotation @@ -520,13 +544,21 @@ numeric and short string variables only. Long string variables may have value labels, but their value labels are recorded using a different record type (@pxref{Long String Value Labels Record}). +ReadStat (@pxref{File Header Record}) writes value labels that label a +single value more than once. In more detail, it emits value labels +whose values are longer than string variables' widths, that are +identical in the actual width of the variable, e.g.@: labels for +values @code{ABC123} and @code{ABC456} for a string variable with +width 3. For files written by this software, PSPP ignores such +labels. + The value label record has the following format: @example int32 rec_type; int32 label_count; -/* @r{Repeated @code{label_cnt} times}. */ +/* @r{Repeated @code{n_label} times}. */ char value[8]; char label_len; char label[]; @@ -578,7 +610,7 @@ Number of variables that the associated value labels from the value label record are to be applied. @item int32 vars[]; -A list of dictionary indexes of variables to which to apply the value +A list of 1-based dictionary indexes of variables to which to apply the value labels (@pxref{Dictionary Index}). There are @code{var_count} elements. @@ -601,7 +633,8 @@ char lines[][80]; Record type. Always set to 6. @item int32 n_lines; -Number of lines of documents present. +Number of lines of documents present. This should be greater than +zero, but ReadStats writes system files with zero @code{n_lines}. @item char lines[][80]; Document lines. The number of elements is defined by @code{n_lines}. @@ -1361,7 +1394,7 @@ The total number of bytes in @code{attributes}. @item char attributes[]; The attributes, in a text-based format. -In record type 17, this field contains a single attribute set. An +In record subtype 17, this field contains a single attribute set. An attribute set is a sequence of one or more attributes concatenated together. Each attribute consists of a name, which has the same syntax as a variable name, followed by, inside parentheses, a sequence @@ -1373,13 +1406,17 @@ way to embed a line feed in a value. There is no distinction between an attribute with a single value and an attribute array with one element. -In record type 18, this field contains a sequence of one or more +In record subtype 18, this field contains a sequence of one or more variable attribute sets. If more than one variable attribute set is present, each one after the first is delimited from the previous by @code{/}. Each variable attribute set consists of a long variable name, followed by @code{:}, followed by an attribute set with the same -syntax as on record type 17. +syntax as on record subtype 17. + +System files written by @code{Stata 14.1/-savespss- 1.77 by +S.Radyakin} may include multiple records with subtype 18, one per +variable that has variable attributes. The total length is @code{count} bytes. @end table @@ -1481,7 +1518,8 @@ the following believed meanings: @table @asis @item 5 -A set of grouped variables (according to Aapi H@"am@"al@"ainen). +A named variable set for use in the GUI (according to Aapi +H@"am@"al@"ainen). @item 6 Date info, probably related to USE (according to Aapi H@"am@"al@"ainen). @@ -1549,9 +1587,10 @@ value @var{code} - @var{bias}, where variable @code{bias} from the file header. For example, code 105 with bias 100.0 (the normal value) indicates a numeric variable of value 5. -One file has been seen written by SPSS 14 that contained such a code -in a @emph{string} field with the value 0 (after the bias is -subtracted) as a way of encoding null bytes. + +A code of 0 (after subtracting the bias) in a string field encodes +null bytes. This is unusual, since a string field normally encodes +text data, but it exists in real system files. @item 252 End of file. This code may or may not appear at the end of the data @@ -1613,7 +1652,7 @@ The number of bytes in the ZLIB data trailer. This and the previous field sum to the size of the system file in bytes. @end table -The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB +The data header is followed by @code{(ztrailer_len - 24) / 24} ZLIB compressed data blocks. Each ZLIB compressed data block begins with a ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78 01} (the only header yet observed in practice). Each block @@ -1650,7 +1689,7 @@ been observed so far. @item int32 n_blocks; The number of ZLIB compressed data blocks, always exactly -@code{(ztrailer_ofs - 24) / 24}. +@code{(ztrailer_len - 24) / 24}. @end table The fixed header is followed by @code{n_blocks} 24-byte ZLIB data @@ -1693,165 +1732,3 @@ system file. @end table @setfilename ignored - -@node Encrypted System Files -@section Encrypted System Files - -SPSS 21 and later support an encrypted system file format. - -@quotation Warning -The SPSS encrypted file format is poorly designed. It is much cheaper -and faster to decrypt a file encrypted this way than if a well -designed alternative were used. If you must use this format, use a -10-byte randomly generated password. -@end quotation - -@subheading Encrypted File Format - -Encrypted system files begin with the following 36-byte fixed header: - -@example -0000 1c 00 00 00 00 00 00 00 45 4e 43 52 59 50 54 45 |........ENCRYPTE| -0010 44 53 41 56 15 00 00 00 00 00 00 00 00 00 00 00 |DSAV............| -0020 00 00 00 00 |....| -@end example - -Following the fixed header is a complete system file in the usual -format, except that each 16-byte block is encrypted with AES-256 in -ECB mode. The AES-256 key is derived from a password in the following -way: - -@enumerate -@item -Start from the literal password typed by the user. Truncate it to at -most 10 bytes, then append (between 1 and 22) null bytes until there -are exactly 32 bytes. Call this @var{password}. - -@item -Let @var{constant} be the following 73-byte constant: - -@example -0000 00 00 00 01 35 27 13 cc 53 a7 78 89 87 53 22 11 -0010 d6 5b 31 58 dc fe 2e 7e 94 da 2f 00 cc 15 71 80 -0020 0a 6c 63 53 00 38 c3 38 ac 22 f3 63 62 0e ce 85 -0030 3f b8 07 4c 4e 2b 77 c7 21 f5 1a 80 1d 67 fb e1 -0040 e1 83 07 d8 0d 00 00 01 00 -@end example - -@item -Compute CMAC-AES-256(@var{password}, @var{constant}). Call the -16-byte result @var{cmac}. - -@item -The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is, -@var{cmac} repeated twice. -@end enumerate - -@subsubheading Example - -Consider the password @samp{pspp}. @var{password} is: - -@example -0000 70 73 70 70 00 00 00 00 00 00 00 00 00 00 00 00 |pspp............| -0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -@end example - -@noindent -@var{cmac} is: - -@example -0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 -@end example - -@noindent -The AES-256 key is: - -@example -0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 -0010 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 -@end example - -@subheading Password Encoding - -SPSS also supports what it calls ``encrypted passwords.'' These are -not encrypted. They are encoded with a simple, fixed scheme. An -encoded password is always a multiple of 2 characters long, and never -longer than 20 characters. The characters in an encoded password are -always in the graphic ASCII range 33 through 126. Each successive -pair of characters in the password encodes a single byte in the -plaintext password. - -Use the following algorithm to decode a pair of characters: - -@enumerate -@item -Let @var{a} be the ASCII code of the first character, and @var{b} be -the ASCII code of the second character. - -@item -Let @var{ah} be the most significant 4 bits of @var{a}. Find the line -in the table below that has @var{ah} on the left side. The right side -of the line is a set of possible values for the most significant 4 -bits of the decoded byte. - -@display -@t{2 } @result{} @t{2367} -@t{3 } @result{} @t{0145} -@t{47} @result{} @t{89cd} -@t{56} @result{} @t{abef} -@end display - -@item -Let @var{bh} be the most significant 4 bits of @var{b}. Find the line -in the second table below that has @var{bh} on the left side. The -right side of the line is a set of possible values for the most -significant 4 bits of the decoded byte. Together with the results of -the previous step, only a single possibility is left. - -@display -@t{2 } @result{} @t{139b} -@t{3 } @result{} @t{028a} -@t{47} @result{} @t{46ce} -@t{56} @result{} @t{57df} -@end display - -@item -Let @var{al} be the least significant 4 bits of @var{a}. Find the -line in the table below that has @var{al} on the left side. The right -side of the line is a set of possible values for the least significant -4 bits of the decoded byte. - -@display -@t{03cf} @result{} @t{0145} -@t{12de} @result{} @t{2367} -@t{478b} @result{} @t{89cd} -@t{569a} @result{} @t{abef} -@end display - -@item -Let @var{bl} be the least significant 4 bits of @var{b}. Find the -line in the table below that has @var{bl} on the left side. The right -side of the line is a set of possible values for the least significant -4 bits of the decoded byte. Together with the results of the previous -step, only a single possibility is left. - -@display -@t{03cf} @result{} @t{028a} -@t{12de} @result{} @t{139b} -@t{478b} @result{} @t{46ce} -@t{569a} @result{} @t{57df} -@end display -@end enumerate - -@subsubheading Example - -Consider the encoded character pair @samp{-|}. @var{a} is -0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is -0xd, and @var{bl} is 0xc. @var{ah} means that the most significant -four bits of the decoded character is 2, 3, 6, or 7, and @var{bh} -means that they are 4, 6, 0xc, or 0xe. The single possibility in -common is 6, so the most significant four bits are 6. Similarly, -@var{al} means that the least significant four bits are 2, 3, 6, or 7, -and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant -four bits are 2. The decoded character is therefore 0x62, the letter -@samp{b}.