X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fsystem-file-format.texi;h=4cb142593d284046cf497ac9de82d0e451d5a1d3;hb=fe7682b3c3d36cf9ba3e867588e5b808af833262;hp=89c35aab3f16dbe064f14dc94c3ced83ce285f60;hpb=e2da62d735c597afeef2e0e9b36e5a4a83d7da94;p=pspp diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index 89c35aab3f..4cb142593d 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -30,7 +30,7 @@ files and translates as necessary. PSPP also detects the floating-point format in use, as well as the endianness of IEEE 754 floating-point numbers, and translates as needed. However, only IEEE 754 numbers with the same endianness as integer data in the same file -has actually been observed in system files, and it is likely that +have actually been observed in system files, and it is likely that other formats are obsolete or were never used. System files use a few floating point values for special purposes: @@ -68,10 +68,81 @@ used for the dictionary and the data in the file, although it is possible to artificially synthesize files that use different encodings (@pxref{Character Encoding Record}). -System files are divided into records, each of which begins with a -4-byte record type, usually regarded as an @code{int32}. +@menu +* System File Record Structure:: +* File Header Record:: +* Variable Record:: +* Value Labels Records:: +* Document Record:: +* Machine Integer Info Record:: +* Machine Floating-Point Info Record:: +* Multiple Response Sets Records:: +* Extra Product Info Record:: +* Variable Display Parameter Record:: +* Long Variable Names Record:: +* Very Long String Record:: +* Character Encoding Record:: +* Long String Value Labels Record:: +* Long String Missing Values Record:: +* Data File and Variable Attributes Records:: +* Extended Number of Cases Record:: +* Other Informational Records:: +* Dictionary Termination Record:: +* Data Record:: +* Encrypted System Files:: +@end menu + +@node System File Record Structure +@section System File Record Structure + +System files are divided into records with the following format: + +@example +int32 type; +char data[]; +@end example + +This header does not identify the length of the @code{data} or any +information about what it contains, so the system file reader must +understand the format of @code{data} based on @code{type}. However, +records with type 7, called @dfn{extension records}, have a stricter +format: + +@example +int32 type; +int32 subtype; +int32 size; +int32 count; +char data[size * count]; +@end example -The records must appear in the following order: +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. This value identifies a particular kind of extension +record. + +@item int32 size; +The size of each piece of data that follows the header, in bytes. +Known extension records use 1, 4, or 8, for @code{char}, @code{int32}, +and @code{flt64} format data, respectively. + +@item int32 count; +The number of pieces of data that follow the header. + +@item char data[size * count]; +Data, whose format and interpretation depend on the subtype. +@end table + +An extension record contains exactly @code{size * count} bytes of +data, which allows a reader that does not understand an extension +record to skip it. Extension records provide only nonessential +information, so this allows for files written by newer software to +preserve backward compatibility with older or less capable readers. + +Records in a system file must appear in the following order: @itemize @bullet @item @@ -98,35 +169,19 @@ Dictionary termination record. Data record. @end itemize -Each type of record is described separately below. +We advise authors of programs that read system files to tolerate +format variations. Various kinds of misformatting and corruption have +been observed in system files written by SPSS and other software +alike. In particular, because extension records provide nonessential +information, it is generally better to ignore an extension record +entirely than to refuse to read a system file. -@menu -* File Header Record:: -* Variable Record:: -* Value Labels Records:: -* Document Record:: -* Machine Integer Info Record:: -* Machine Floating-Point Info Record:: -* Multiple Response Sets Records:: -* Extra Product Info Record:: -* Variable Display Parameter Record:: -* Long Variable Names Record:: -* Very Long String Record:: -* Character Encoding Record:: -* Long String Value Labels Record:: -* Long String Missing Values Record:: -* Data File and Variable Attributes Records:: -* Extended Number of Cases Record:: -* Miscellaneous Informational Records:: -* Dictionary Termination Record:: -* Data Record:: -@end menu +The following sections describe the known kinds of records. @node File Header Record @section File Header Record -The file header is always the first record in the file. It has the -following format: +A system file begins with the file header, with the following format: @example char rec_type[4]; @@ -177,7 +232,7 @@ contribute to this value beyond the first 255 bytes. Further, system files written by some systems set this value to -1. In general, it is unsafe for systems reading system files to rely upon this value. -@item int32 compressed; +@item int32 compression; Set to 0 if the data in the file is not compressed, 1 if the data is compressed with simple bytecode compression, 2 if the data is ZLIB compressed. This field has value 2 if and only if @code{rec_type} is @@ -261,6 +316,10 @@ wider than 255 bytes. Such very long string variables are represented by a number of narrower string variables. @xref{Very Long String Record}, for details. +A system file should contain at least one variable and thus at least +one variable record, but system files have been observed in the wild +without any variables (thus, no data either). + @example int32 rec_type; int32 type; @@ -315,12 +374,19 @@ the at-sign (@samp{@@}). Subsequent characters may also be digits, octothorpes (@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full stops (@samp{.}). The variable name is padded on the right with spaces. +The @samp{name} fields should be unique within a system file. System +files written by SPSS that contain very long string variables with +similar names sometimes contain duplicate names that are later +eliminated by resolving the very long string names (@pxref{Very Long +String Record}). PSPP handles duplicates by assigning them new, +unique names. + @item int32 label_len; This field is present only if @code{has_var_label} is set to 1. It is set to the length, in characters, of the variable label. The documented maximum length varies from 120 to 255 based on SPSS version, but some files have been seen with longer labels. PSPP -accepts longer labels and truncates them to 255 bytes on input. +accepts labels of any length. @item char label[]; This field is present only if @code{has_var_label} is set to 1. It has @@ -344,6 +410,7 @@ in the range. When a range plus a value are present, the third element denotes the additional discrete missing value. @end table +@anchor{System File Output Formats} The @code{print} and @code{write} members of sysfile_variable are output formats coded into @code{int32} types. The least-significant byte of the @code{int32} represents the number of decimal places, and the @@ -727,8 +794,8 @@ The size of each element in the @code{mrsets} member. Always set to 1. The total number of bytes in @code{mrsets}. @item char mrsets[]; -A series of multiple response sets, each of which consists of the -following: +Zero or more line feeds (byte 0x0a), followed by a series of multiple +response sets, each of which consists of the following: @itemize @bullet @item @@ -777,8 +844,15 @@ A space. The short names of the variables in the set, converted to lowercase, each separated from the previous by a single space. +Even though a multiple response set must have at least two variables, +some system files contain multiple response sets with no variables at +all. The source and meaning of these multiple response sets is +unknown. (Perhaps they arise from creating a multiple response set +then deleting all the variables that it contains?) + @item -A line feed (byte 0x0a). +One line feed (byte 0x0a). Sometimes multiple, even hundreds, of line +feeds are present. @end itemize @end table @@ -1321,10 +1395,10 @@ VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123'). will contain a variable attribute record with the following contents: @example -00000000 07 00 00 00 12 00 00 00 01 00 00 00 22 00 00 00 |............"...| -00000010 64 75 6d 6d 79 3a 66 72 65 64 28 27 32 33 27 0a |dummy:fred('23'.| -00000020 27 33 34 27 0a 29 62 65 72 74 28 27 31 32 33 27 |'34'.)bert('123'| -00000030 0a 29 |.) | +0000 07 00 00 00 12 00 00 00 01 00 00 00 22 00 00 00 |............"...| +0010 64 75 6d 6d 79 3a 66 72 65 64 28 27 32 33 27 0a |dummy:fred('23'.| +0020 27 33 34 27 0a 29 62 65 72 74 28 27 31 32 33 27 |'34'.)bert('123'| +0030 0a 29 |.) | @end example @menu @@ -1392,46 +1466,25 @@ same reason as @code{ncases} in the file header record, but this has not been observed in the wild. @end table -@node Miscellaneous Informational Records -@section Miscellaneous Informational Records +@node Other Informational Records +@section Other Informational Records -Some specific types of miscellaneous informational records are +This chapter documents many specific types of extension records are documented here, but others are known to exist. PSPP ignores unknown -miscellaneous informational records when reading system files. - -@example -/* @r{Header.} */ -int32 rec_type; -int32 subtype; -int32 size; -int32 count; - -/* @r{Exactly @code{size * count} bytes of data.} */ -char data[]; -@end example +extension records when reading system files. -@table @code -@item int32 rec_type; -Record type. Always set to 7. +The following extension record subtypes have also been observed, with +the following believed meanings: -@item int32 subtype; -Record subtype. May take any value. According to Aapi -H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6 -indicates date info (probably related to USE). Subtype 24 appears to -contain XML that describes how data in the file should be displayed -on-screen. - -@item int32 size; -Size of each piece of data in the data part. Should have the value 1, -4, or 8, for @code{char}, @code{int32}, and @code{flt64} format data, -respectively. +@table @asis +@item 5 +A set of grouped variables (according to Aapi H@"am@"al@"ainen). -@item int32 count; -Number of pieces of data in the data part. +@item 6 +Date info, probably related to USE (according to Aapi H@"am@"al@"ainen). -@item char data[]; -Arbitrary data. There must be @code{size} times @code{count} bytes of -data. +@item 24 +XML that describes how data in the file should be displayed on-screen. @end table @node Dictionary Termination Record @@ -1632,3 +1685,165 @@ system file. @end table @setfilename ignored + +@node Encrypted System Files +@section Encrypted System Files + +SPSS 21 and later support an encrypted system file format. + +@quotation Warning +The SPSS encrypted file format is poorly designed. It is much cheaper +and faster to decrypt a file encrypted this way than if a well +designed alternative were used. If you must use this format, use a +10-byte randomly generated password. +@end quotation + +@subheading Encrypted File Format + +Encrypted system files begin with the following 36-byte fixed header: + +@example +0000 1c 00 00 00 00 00 00 00 45 4e 43 52 59 50 54 45 |........ENCRYPTE| +0010 44 53 41 56 15 00 00 00 00 00 00 00 00 00 00 00 |DSAV............| +0020 00 00 00 00 |....| +@end example + +Following the fixed header is a complete system file in the usual +format, except that each 16-byte block is encrypted with AES-256 in +ECB mode. The AES-256 key is derived from a password in the following +way: + +@enumerate +@item +Start from the literal password typed by the user. Truncate it to at +most 10 bytes, then append (between 1 and 22) null bytes until there +are exactly 32 bytes. Call this @var{password}. + +@item +Let @var{constant} be the following 73-byte constant: + +@example +0000 00 00 00 01 35 27 13 cc 53 a7 78 89 87 53 22 11 +0010 d6 5b 31 58 dc fe 2e 7e 94 da 2f 00 cc 15 71 80 +0020 0a 6c 63 53 00 38 c3 38 ac 22 f3 63 62 0e ce 85 +0030 3f b8 07 4c 4e 2b 77 c7 21 f5 1a 80 1d 67 fb e1 +0040 e1 83 07 d8 0d 00 00 01 00 +@end example + +@item +Compute CMAC-AES-256(@var{password}, @var{constant}). Call the +16-byte result @var{cmac}. + +@item +The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is, +@var{cmac} repeated twice. +@end enumerate + +@subsubheading Example + +Consider the password @samp{pspp}. @var{password} is: + +@example +0000 70 73 70 70 00 00 00 00 00 00 00 00 00 00 00 00 |pspp............| +0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| +@end example + +@noindent +@var{cmac} is: + +@example +0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 +@end example + +@noindent +The AES-256 key is: + +@example +0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 +0010 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 +@end example + +@subheading Password Encoding + +SPSS also supports what it calls ``encrypted passwords.'' These are +not encrypted. They are encoded with a simple, fixed scheme. An +encoded password is always a multiple of 2 characters long, and never +longer than 20 characters. The characters in an encoded password are +always in the graphic ASCII range 33 through 126. Each successive +pair of characters in the password encodes a single byte in the +plaintext password. + +Use the following algorithm to decode a pair of characters: + +@enumerate +@item +Let @var{a} be the ASCII code of the first character, and @var{b} be +the ASCII code of the second character. + +@item +Let @var{ah} be the most significant 4 bits of @var{a}. Find the line +in the table below that has @var{ah} on the left side. The right side +of the line is a set of possible values for the most significant 4 +bits of the decoded byte. + +@display +@t{2 } @result{} @t{2367} +@t{3 } @result{} @t{0145} +@t{47} @result{} @t{89cd} +@t{56} @result{} @t{abef} +@end display + +@item +Let @var{bh} be the most significant 4 bits of @var{b}. Find the line +in the second table below that has @var{bh} on the left side. The +right side of the line is a set of possible values for the most +significant 4 bits of the decoded byte. Together with the results of +the previous step, only a single possibility is left. + +@display +@t{2 } @result{} @t{139b} +@t{3 } @result{} @t{028a} +@t{47} @result{} @t{46ce} +@t{56} @result{} @t{57df} +@end display + +@item +Let @var{al} be the least significant 4 bits of @var{a}. Find the +line in the table below that has @var{al} on the left side. The right +side of the line is a set of possible values for the least significant +4 bits of the decoded byte. + +@display +@t{03cf} @result{} @t{0145} +@t{12de} @result{} @t{2367} +@t{478b} @result{} @t{89cd} +@t{569a} @result{} @t{abef} +@end display + +@item +Let @var{bl} be the least significant 4 bits of @var{b}. Find the +line in the table below that has @var{bl} on the left side. The right +side of the line is a set of possible values for the least significant +4 bits of the decoded byte. Together with the results of the previous +step, only a single possibility is left. + +@display +@t{03cf} @result{} @t{028a} +@t{12de} @result{} @t{139b} +@t{478b} @result{} @t{46ce} +@t{569a} @result{} @t{57df} +@end display +@end enumerate + +@subsubheading Example + +Consider the encoded character pair @samp{-|}. @var{a} is +0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is +0xd, and @var{bl} is 0xc. @var{ah} means that the most significant +four bits of the decoded character is 2, 3, 6, or 7, and @var{bh} +means that they are 4, 6, 0xc, or 0xe. The single possibility in +common is 6, so the most significant four bits are 6. Similarly, +@var{al} means that the least significant four bits are 2, 3, 6, or 7, +and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant +four bits are 2. The decoded character is therefore 0x62, the letter +@samp{b}.