X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fsystem-file-format.texi;h=fb131f5274f0d61e90f9eefdf0c994c413b7e468;hb=8e27b1a0dba7f33b7acb0d8894efe2045b0bb98f;hp=0f0940b58f88a2bfe9d13c0678dc52343be14982;hpb=24c5f7c629e68801492d7ca1766953a2a954a820;p=pspp diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index 0f0940b58f..fb131f5274 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -30,7 +30,7 @@ files and translates as necessary. PSPP also detects the floating-point format in use, as well as the endianness of IEEE 754 floating-point numbers, and translates as needed. However, only IEEE 754 numbers with the same endianness as integer data in the same file -has actually been observed in system files, and it is likely that +have actually been observed in system files, and it is likely that other formats are obsolete or were never used. System files use a few floating point values for special purposes: @@ -68,10 +68,80 @@ used for the dictionary and the data in the file, although it is possible to artificially synthesize files that use different encodings (@pxref{Character Encoding Record}). -System files are divided into records, each of which begins with a -4-byte record type, usually regarded as an @code{int32}. +@menu +* System File Record Structure:: +* File Header Record:: +* Variable Record:: +* Value Labels Records:: +* Document Record:: +* Machine Integer Info Record:: +* Machine Floating-Point Info Record:: +* Multiple Response Sets Records:: +* Extra Product Info Record:: +* Variable Display Parameter Record:: +* Long Variable Names Record:: +* Very Long String Record:: +* Character Encoding Record:: +* Long String Value Labels Record:: +* Long String Missing Values Record:: +* Data File and Variable Attributes Records:: +* Extended Number of Cases Record:: +* Other Informational Records:: +* Dictionary Termination Record:: +* Data Record:: +@end menu + +@node System File Record Structure +@section System File Record Structure + +System files are divided into records with the following format: + +@example +int32 type; +char data[]; +@end example + +This header does not identify the length of the @code{data} or any +information about what it contains, so the system file reader must +understand the format of @code{data} based on @code{type}. However, +records with type 7, called @dfn{extension records}, have a stricter +format: -The records must appear in the following order: +@example +int32 type; +int32 subtype; +int32 size; +int32 count; +char data[size * count]; +@end example + +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. This value identifies a particular kind of extension +record. + +@item int32 size; +The size of each piece of data that follows the header, in bytes. +Known extension records use 1, 4, or 8, for @code{char}, @code{int32}, +and @code{flt64} format data, respectively. + +@item int32 count; +The number of pieces of data that follow the header. + +@item char data[size * count]; +Data, whose format and interpretation depend on the subtype. +@end table + +An extension record contains exactly @code{size * count} bytes of +data, which allows a reader that does not understand an extension +record to skip it. Extension records provide only nonessential +information, so this allows for files written by newer software to +preserve backward compatibility with older or less capable readers. + +Records in a system file must appear in the following order: @itemize @bullet @item @@ -91,6 +161,11 @@ Document record, if present. Extension (type 7) records, in ascending numerical order of their subtypes. +System files written by SPSS include at most one of each kind of +extension record. This is generally true of system files written by +other software as well, with known exceptions noted below in the +individual sections about each type of record. + @item Dictionary termination record. @@ -98,36 +173,19 @@ Dictionary termination record. Data record. @end itemize -Each type of record is described separately below. +We advise authors of programs that read system files to tolerate +format variations. Various kinds of misformatting and corruption have +been observed in system files written by SPSS and other software +alike. In particular, because extension records provide nonessential +information, it is generally better to ignore an extension record +entirely than to refuse to read a system file. -@menu -* File Header Record:: -* Variable Record:: -* Value Labels Records:: -* Document Record:: -* Machine Integer Info Record:: -* Machine Floating-Point Info Record:: -* Multiple Response Sets Records:: -* Extra Product Info Record:: -* Variable Display Parameter Record:: -* Long Variable Names Record:: -* Very Long String Record:: -* Character Encoding Record:: -* Long String Value Labels Record:: -* Long String Missing Values Record:: -* Data File and Variable Attributes Records:: -* Extended Number of Cases Record:: -* Miscellaneous Informational Records:: -* Dictionary Termination Record:: -* Data Record:: -* Encrypted System Files:: -@end menu +The following sections describe the known kinds of records. @node File Header Record @section File Header Record -The file header is always the first record in the file. It has the -following format: +A system file begins with the file header, with the following format: @example char rec_type[4]; @@ -178,7 +236,7 @@ contribute to this value beyond the first 255 bytes. Further, system files written by some systems set this value to -1. In general, it is unsafe for systems reading system files to rely upon this value. -@item int32 compressed; +@item int32 compression; Set to 0 if the data in the file is not compressed, 1 if the data is compressed with simple bytecode compression, 2 if the data is ZLIB compressed. This field has value 2 if and only if @code{rec_type} is @@ -262,6 +320,10 @@ wider than 255 bytes. Such very long string variables are represented by a number of narrower string variables. @xref{Very Long String Record}, for details. +A system file should contain at least one variable and thus at least +one variable record, but system files have been observed in the wild +without any variables (thus, no data either). + @example int32 rec_type; int32 type; @@ -352,6 +414,7 @@ in the range. When a range plus a value are present, the third element denotes the additional discrete missing value. @end table +@anchor{System File Output Formats} The @code{print} and @code{write} members of sysfile_variable are output formats coded into @code{int32} types. The least-significant byte of the @code{int32} represents the number of decimal places, and the @@ -542,7 +605,10 @@ char lines[][80]; Record type. Always set to 6. @item int32 n_lines; -Number of lines of documents present. +Number of lines of documents present. This should be greater than +zero, but the system file writer that identifies itself as +@url{https://github.com/WizardMac/ReadStat} writes document records +with zero @code{n_lines}. @item char lines[][80]; Document lines. The number of elements is defined by @code{n_lines}. @@ -687,13 +753,16 @@ Size of each piece of data in the data part, in bytes. Always set to 8. Number of pieces of data in the data part. Always set to 3. @item flt64 sysmis; -The system missing value. - -@item flt64 highest; -The value used for HIGHEST in missing values. - -@item flt64 lowest; -The value used for LOWEST in missing values. +@itemx flt64 highest; +@itemx flt64 lowest; +The system missing value, the value used for HIGHEST in missing +values, and the value used for LOWEST in missing values, respectively. +@xref{System File Format}, for more information. + +The SPSSWriter library in PHP, which identifies itself as @code{FOM +SPSS 1.0.0} in the file header record @code{prod_name} field, writes +unexpected values to these fields, but it uses the same values +consistently throughout the rest of the file. @end table @node Multiple Response Sets Records @@ -786,8 +855,8 @@ The short names of the variables in the set, converted to lowercase, each separated from the previous by a single space. Even though a multiple response set must have at least two variables, -some system files contain multiple response sets with no variables at -all. The source and meaning of these multiple response sets is +some system files contain multiple response sets with no variables or +one variable. The source and meaning of these multiple response sets is unknown. (Perhaps they arise from creating a multiple response set then deleting all the variables that it contains?) @@ -1299,7 +1368,7 @@ The total number of bytes in @code{attributes}. @item char attributes[]; The attributes, in a text-based format. -In record type 17, this field contains a single attribute set. An +In record subtype 17, this field contains a single attribute set. An attribute set is a sequence of one or more attributes concatenated together. Each attribute consists of a name, which has the same syntax as a variable name, followed by, inside parentheses, a sequence @@ -1311,13 +1380,17 @@ way to embed a line feed in a value. There is no distinction between an attribute with a single value and an attribute array with one element. -In record type 18, this field contains a sequence of one or more +In record subtype 18, this field contains a sequence of one or more variable attribute sets. If more than one variable attribute set is present, each one after the first is delimited from the previous by @code{/}. Each variable attribute set consists of a long variable name, followed by @code{:}, followed by an attribute set with the same -syntax as on record type 17. +syntax as on record subtype 17. + +System files written by @code{Stata 14.1/-savespss- 1.77 by +S.Radyakin} may include multiple records with subtype 18, one per +variable that has variable attributes. The total length is @code{count} bytes. @end table @@ -1407,46 +1480,30 @@ same reason as @code{ncases} in the file header record, but this has not been observed in the wild. @end table -@node Miscellaneous Informational Records -@section Miscellaneous Informational Records +@node Other Informational Records +@section Other Informational Records -Some specific types of miscellaneous informational records are +This chapter documents many specific types of extension records are documented here, but others are known to exist. PSPP ignores unknown -miscellaneous informational records when reading system files. - -@example -/* @r{Header.} */ -int32 rec_type; -int32 subtype; -int32 size; -int32 count; +extension records when reading system files. -/* @r{Exactly @code{size * count} bytes of data.} */ -char data[]; -@end example +The following extension record subtypes have also been observed, with +the following believed meanings: -@table @code -@item int32 rec_type; -Record type. Always set to 7. - -@item int32 subtype; -Record subtype. May take any value. According to Aapi -H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6 -indicates date info (probably related to USE). Subtype 24 appears to -contain XML that describes how data in the file should be displayed -on-screen. +@table @asis +@item 5 +A set of grouped variables (according to Aapi H@"am@"al@"ainen). -@item int32 size; -Size of each piece of data in the data part. Should have the value 1, -4, or 8, for @code{char}, @code{int32}, and @code{flt64} format data, -respectively. +@item 6 +Date info, probably related to USE (according to Aapi H@"am@"al@"ainen). -@item int32 count; -Number of pieces of data in the data part. +@item 12 +A UUID in the format described in RFC 4122. Only two examples +observed, both written by SPSS 13, and in each case the UUID contained +both upper and lower case. -@item char data[]; -Arbitrary data. There must be @code{size} times @code{count} bytes of -data. +@item 24 +XML that describes how data in the file should be displayed on-screen. @end table @node Dictionary Termination Record @@ -1647,165 +1704,3 @@ system file. @end table @setfilename ignored - -@node Encrypted System Files -@section Encrypted System Files - -SPSS 21 and later support an encrypted system file format. - -@quotation Warning -The SPSS encrypted file format is poorly designed. It is much cheaper -and faster to decrypt a file encrypted this way than if a well -designed alternative were used. If you must use this format, use a -10-byte randomly generated password. -@end quotation - -@subheading Encrypted File Format - -Encrypted system files begin with the following 36-byte fixed header: - -@example -0000 1c 00 00 00 00 00 00 00 45 4e 43 52 59 50 54 45 |........ENCRYPTE| -0010 44 53 41 56 15 00 00 00 00 00 00 00 00 00 00 00 |DSAV............| -0020 00 00 00 00 |....| -@end example - -Following the fixed header is a complete system file in the usual -format, except that each 16-byte block is encrypted with AES-256 in -ECB mode. The AES-256 key is derived from a password in the following -way: - -@enumerate -@item -Start from the literal password typed by the user. Truncate it to at -most 10 bytes, then append (between 1 and 22) null bytes until there -are exactly 32 bytes. Call this @var{password}. - -@item -Let @var{constant} be the following 73-byte constant: - -@example -0000 00 00 00 01 35 27 13 cc 53 a7 78 89 87 53 22 11 -0010 d6 5b 31 58 dc fe 2e 7e 94 da 2f 00 cc 15 71 80 -0020 0a 6c 63 53 00 38 c3 38 ac 22 f3 63 62 0e ce 85 -0030 3f b8 07 4c 4e 2b 77 c7 21 f5 1a 80 1d 67 fb e1 -0040 e1 83 07 d8 0d 00 00 01 00 -@end example - -@item -Compute CMAC-AES-256(@var{password}, @var{constant}). Call the -16-byte result @var{cmac}. - -@item -The 32-byte AES-256 key is @var{cmac} || @var{cmac}, that is, -@var{cmac} repeated twice. -@end enumerate - -@subsubheading Example - -Consider the password @samp{pspp}. @var{password} is: - -@example -0000 70 73 70 70 00 00 00 00 00 00 00 00 00 00 00 00 |pspp............| -0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -@end example - -@noindent -@var{cmac} is: - -@example -0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 -@end example - -@noindent -The AES-256 key is: - -@example -0000 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 -0010 3e da 09 8e 66 04 d4 fd f9 63 0c 2c a8 6f b0 45 -@end example - -@subheading Password Encoding - -SPSS also supports what it calls ``encrypted passwords.'' These are -not encrypted. They are encoded with a simple, fixed scheme. An -encoded password is always a multiple of 2 characters long, and never -longer than 20 characters. The characters in an encoded password are -always in the graphic ASCII range 33 through 126. Each successive -pair of characters in the password encodes a single byte in the -plaintext password. - -Use the following algorithm to decode a pair of characters: - -@enumerate -@item -Let @var{a} be the ASCII code of the first character, and @var{b} be -the ASCII code of the second character. - -@item -Let @var{ah} be the most significant 4 bits of @var{a}. Find the line -in the table below that has @var{ah} on the left side. The right side -of the line is a set of possible values for the most significant 4 -bits of the decoded byte. - -@display -@t{2 } @result{} @t{2367} -@t{3 } @result{} @t{0145} -@t{47} @result{} @t{89cd} -@t{56} @result{} @t{abef} -@end display - -@item -Let @var{bh} be the most significant 4 bits of @var{b}. Find the line -in the second table below that has @var{bh} on the left side. The -right side of the line is a set of possible values for the most -significant 4 bits of the decoded byte. Together with the results of -the previous step, only a single possibility is left. - -@display -@t{2 } @result{} @t{139b} -@t{3 } @result{} @t{028a} -@t{47} @result{} @t{46ce} -@t{56} @result{} @t{57df} -@end display - -@item -Let @var{al} be the least significant 4 bits of @var{a}. Find the -line in the table below that has @var{al} on the left side. The right -side of the line is a set of possible values for the least significant -4 bits of the decoded byte. - -@display -@t{03cf} @result{} @t{0145} -@t{12de} @result{} @t{2367} -@t{478b} @result{} @t{89cd} -@t{569a} @result{} @t{abef} -@end display - -@item -Let @var{bl} be the least significant 4 bits of @var{b}. Find the -line in the table below that has @var{bl} on the left side. The right -side of the line is a set of possible values for the least significant -4 bits of the decoded byte. Together with the results of the previous -step, only a single possibility is left. - -@display -@t{03cf} @result{} @t{028a} -@t{12de} @result{} @t{139b} -@t{478b} @result{} @t{46ce} -@t{569a} @result{} @t{57df} -@end display -@end enumerate - -@subsubheading Example - -Consider the encoded character pair @samp{-|}. @var{a} is -0x2d and @var{b} is 0x7c, so @var{ah} is 2, @var{bh} is 7, @var{al} is -0xd, and @var{bl} is 0xc. @var{ah} means that the most significant -four bits of the decoded character is 2, 3, 6, or 7, and @var{bh} -means that they are 4, 6, 0xc, or 0xe. The single possibility in -common is 6, so the most significant four bits are 6. Similarly, -@var{al} means that the least significant four bits are 2, 3, 6, or 7, -and @var{bl} means they are 0, 2, 8, or 0xa, so the least significant -four bits are 2. The decoded character is therefore 0x62, the letter -@samp{b}.