X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fsystem-file-format.texi;h=484fbb43f1eca08fc089bd0f5ec171215f279403;hb=80527716392c066fdf72f37729c42089a2174bae;hp=d1d00873f3b3666aa97f8f1e10553a5815d268bb;hpb=b8f75b2ac6bc701ecacaa248d630918d7a7346e2;p=pspp diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index d1d00873f3..484fbb43f1 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -30,20 +30,118 @@ files and translates as necessary. PSPP also detects the floating-point format in use, as well as the endianness of IEEE 754 floating-point numbers, and translates as needed. However, only IEEE 754 numbers with the same endianness as integer data in the same file -has actually been observed in system files, and it is likely that +have actually been observed in system files, and it is likely that other formats are obsolete or were never used. -The PSPP system-missing value is represented by the largest possible -negative number in the floating point format (@code{-DBL_MAX}). Two -other values are important for use as missing values: @code{HIGHEST}, -represented by the largest possible positive number (@code{DBL_MAX}), -and @code{LOWEST}, represented by the second-largest negative number -(in IEEE 754 format, @code{0xffeffffffffffffe}). +System files use a few floating point values for special purposes: -System files are divided into records, each of which begins with a -4-byte record type, usually regarded as an @code{int32}. +@table @asis +@item SYSMIS +The system-missing value is represented by the largest possible +negative number in the floating point format (@code{-DBL_MAX}). + +@item HIGHEST +HIGHEST is used as the high end of a missing value range with an +unbounded maximum. It is represented by the largest possible positive +number (@code{DBL_MAX}). + +@item LOWEST +LOWEST is used as the low end of a missing value range with an +unbounded minimum. It was originally represented by the +second-largest negative number (in IEEE 754 format, +@code{0xffeffffffffffffe}). System files written by SPSS 21 and later +instead use the largest negative number (@code{-DBL_MAX}), the same +value as SYSMIS. This does not lead to ambiguity because LOWEST +appears in system files only in missing value ranges, which never +contain SYSMIS. +@end table + +System files may use most character encodings based on an 8-bit unit. +UTF-16 and UTF-32, based on wider units, appear to be unacceptable. +@code{rec_type} in the file header record is sufficient to distinguish +between ASCII and EBCDIC based encodings. The best way to determine +the specific encoding in use is to consult the character encoding +record (@pxref{Character Encoding Record}), if present, and failing +that the @code{character_code} in the machine integer info record +(@pxref{Machine Integer Info Record}). The same encoding should be +used for the dictionary and the data in the file, although it is +possible to artificially synthesize files that use different encodings +(@pxref{Character Encoding Record}). + +@menu +* System File Record Structure:: +* File Header Record:: +* Variable Record:: +* Value Labels Records:: +* Document Record:: +* Machine Integer Info Record:: +* Machine Floating-Point Info Record:: +* Multiple Response Sets Records:: +* Extra Product Info Record:: +* Variable Display Parameter Record:: +* Long Variable Names Record:: +* Very Long String Record:: +* Character Encoding Record:: +* Long String Value Labels Record:: +* Long String Missing Values Record:: +* Data File and Variable Attributes Records:: +* Extended Number of Cases Record:: +* Other Informational Records:: +* Dictionary Termination Record:: +* Data Record:: +@end menu + +@node System File Record Structure +@section System File Record Structure + +System files are divided into records with the following format: + +@example +int32 type; +char data[]; +@end example + +This header does not identify the length of the @code{data} or any +information about what it contains, so the system file reader must +understand the format of @code{data} based on @code{type}. However, +records with type 7, called @dfn{extension records}, have a stricter +format: + +@example +int32 type; +int32 subtype; +int32 size; +int32 count; +char data[size * count]; +@end example -The records must appear in the following order: +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. This value identifies a particular kind of extension +record. + +@item int32 size; +The size of each piece of data that follows the header, in bytes. +Known extension records use 1, 4, or 8, for @code{char}, @code{int32}, +and @code{flt64} format data, respectively. + +@item int32 count; +The number of pieces of data that follow the header. + +@item char data[size * count]; +Data, whose format and interpretation depend on the subtype. +@end table + +An extension record contains exactly @code{size * count} bytes of +data, which allows a reader that does not understand an extension +record to skip it. Extension records provide only nonessential +information, so this allows for files written by newer software to +preserve backward compatibility with older or less capable readers. + +Records in a system file must appear in the following order: @itemize @bullet @item @@ -60,7 +158,13 @@ if present. Document record, if present. @item -Any records not explicitly included in this list, in any order. +Extension (type 7) records, in ascending numerical order of their +subtypes. + +System files written by SPSS include at most one of each kind of +extension record. This is generally true of system files written by +other software as well, with known exceptions noted below in the +individual sections about each type of record. @item Dictionary termination record. @@ -69,39 +173,26 @@ Dictionary termination record. Data record. @end itemize -Each type of record is described separately below. +We advise authors of programs that read system files to tolerate +format variations. Various kinds of misformatting and corruption have +been observed in system files written by SPSS and other software +alike. In particular, because extension records provide nonessential +information, it is generally better to ignore an extension record +entirely than to refuse to read a system file. -@menu -* File Header Record:: -* Variable Record:: -* Value Labels Records:: -* Document Record:: -* Machine Integer Info Record:: -* Machine Floating-Point Info Record:: -* Variable Display Parameter Record:: -* Long Variable Names Record:: -* Very Long String Record:: -* Character Encoding Record:: -* Long String Value Labels Record:: -* Data File and Variable Attributes Records:: -* Extended Number of Cases Record:: -* Miscellaneous Informational Records:: -* Dictionary Termination Record:: -* Data Record:: -@end menu +The following sections describe the known kinds of records. @node File Header Record @section File Header Record -The file header is always the first record in the file. It has the -following format: +A system file begins with the file header, with the following format: @example char rec_type[4]; char prod_name[60]; int32 layout_code; int32 nominal_case_size; -int32 compressed; +int32 compression; int32 weight_index; int32 ncases; flt64 bias; @@ -113,7 +204,15 @@ char padding[3]; @table @code @item char rec_type[4]; -Record type code, set to @samp{$FL2}. +Record type code, either @samp{$FL2} for system files with +uncompressed data or data compressed with simple bytecode compression, +or @samp{$FL3} for system files with ZLIB compressed data. + +This is truly a character field that uses the character encoding as +other strings. Thus, in a file with an ASCII-based character encoding +this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a +file with an EBCDIC-based encoding this field contains @code{5b c6 d3 +f2}. (No EBCDIC-based ZLIB-compressed files have been observed.) @item char prod_name[60]; Product identification string. This always begins with the characters @@ -123,6 +222,12 @@ pspp 0.1.4 - sparc-sun-solaris2.5.2}. The string is truncated if it would be longer than 60 characters; otherwise it is padded on the right with spaces. +The product name field allow readers to behave differently based on +quirks in the way that particular software writes system files. +@xref{Value Labels Records}, for the detail of the quirk that the PSPP +system file reader tolerates in files written by ReadStat, which has +@code{https://github.com/WizardMac/ReadStat} in @code{prod_name}. + @anchor{layout_code} @item int32 layout_code; Normally set to 2, although a few system files have been spotted in @@ -133,12 +238,15 @@ file's integer endianness (@pxref{System File Format}). Number of data elements per case. This is the number of variables, except that long string variables add extra data elements (one for every 8 characters after the first 8). However, string variables do not -contribute to this value beyond the first 255 bytes. Further, system -files written by some systems set this value to -1. In general, it is +contribute to this value beyond the first 255 bytes. Further, some +software always writes -1 or 0 in this field. In general, it is unsafe for systems reading system files to rely upon this value. -@item int32 compressed; -Set to 1 if the data in the file is compressed, 0 otherwise. +@item int32 compression; +Set to 0 if the data in the file is not compressed, 1 if the data is +compressed with simple bytecode compression, 2 if the data is ZLIB +compressed. This field has value 2 if and only if @code{rec_type} is +@samp{$FL3}. @item int32 weight_index; If one of the variables in the data set is used as a weighting @@ -183,6 +291,10 @@ field is arbitrarily set to @samp{00:00:00}. File label declared by the user, if any (@pxref{FILE LABEL,,,pspp, PSPP Users Guide}). Padded on the right with spaces. +A product that identifies itself as @code{VOXCO INTERVIEWER 4.3} uses +CR-only line ends in this field, rather than the more usual LF-only or +CR LF line ends. + @item char padding[3]; Ignored padding bytes to make the structure a multiple of 32 bits in length. Set to zeros. @@ -214,6 +326,10 @@ wider than 255 bytes. Such very long string variables are represented by a number of narrower string variables. @xref{Very Long String Record}, for details. +A system file should contain at least one variable and thus at least +one variable record, but system files have been observed in the wild +without any variables (thus, no data either). + @example int32 rec_type; int32 type; @@ -252,6 +368,10 @@ respectively. If the variable has a range for missing variables, set to -2; if the variable has a range for missing variables plus a single discrete value, set to -3. +A long string variable always has the value 0 here. A separate record +indicates missing values for long string variables (@pxref{Long String +Missing Values Record}). + @item int32 print; Print format for this variable. See below. @@ -264,10 +384,19 @@ the at-sign (@samp{@@}). Subsequent characters may also be digits, octothorpes (@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full stops (@samp{.}). The variable name is padded on the right with spaces. +The @samp{name} fields should be unique within a system file. System +files written by SPSS that contain very long string variables with +similar names sometimes contain duplicate names that are later +eliminated by resolving the very long string names (@pxref{Very Long +String Record}). PSPP handles duplicates by assigning them new, +unique names. + @item int32 label_len; This field is present only if @code{has_var_label} is set to 1. It is -set to the length, in characters, of the variable label, which must be a -number between 0 and 120. +set to the length, in characters, of the variable label. The +documented maximum length varies from 120 to 255 based on SPSS +version, but some files have been seen with longer labels. PSPP +accepts labels of any length. @item char label[]; This field is present only if @code{has_var_label} is set to 1. It has @@ -291,6 +420,7 @@ in the range. When a range plus a value are present, the third element denotes the additional discrete missing value. @end table +@anchor{System File Output Formats} The @code{print} and @code{write} members of sysfile_variable are output formats coded into @code{int32} types. The least-significant byte of the @code{int32} represents the number of decimal places, and the @@ -387,6 +517,11 @@ Format types are defined as follows: @end multitable @end quotation +A few system files have been observed in the wild with invalid +@code{write} fields, in particular with value 0. Readers should +probably treat invalid @code{print} or @code{write} fields as some +default format. + @node Value Labels Records @section Value Labels Records @@ -395,6 +530,14 @@ numeric and short string variables only. Long string variables may have value labels, but their value labels are recorded using a different record type (@pxref{Long String Value Labels Record}). +ReadStat (@pxref{File Header Record}) writes value labels that label a +single value more than once. In more detail, it emits value labels +whose values are longer than string variables' widths, that are +identical in the actual width of the variable, e.g.@: labels for +values @code{ABC123} and @code{ABC456} for a string variable with +width 3. For files written by this software, PSPP ignores such +labels. + The value label record has the following format: @example @@ -425,7 +568,9 @@ in length. Its type and width cannot be determined until the following value label variables record (see below) is read. @item char label_len; -The label's length, in bytes. +The label's length, in bytes. The documented maximum length varies +from 60 to 120 based on SPSS version. PSPP supports value labels up +to 255 bytes long. @item char label[]; @code{label_len} bytes of the actual label, followed by up to 7 bytes @@ -474,7 +619,8 @@ char lines[][80]; Record type. Always set to 6. @item int32 n_lines; -Number of lines of documents present. +Number of lines of documents present. This should be greater than +zero, but ReadStats writes system files with zero @code{n_lines}. @item char lines[][80]; Document lines. The number of elements is defined by @code{n_lines}. @@ -538,20 +684,53 @@ Floating point representation code. For IEEE 754 systems this is 1. IBM 370 sets this to 2, and DEC VAX E to 3. @item int32 compression_code; -Compression code. Always set to 1. +Compression code. Always set to 1, regardless of whether or how the +file is compressed. @item int32 endianness; Machine endianness. 1 indicates big-endian, 2 indicates little-endian. @item int32 character_code; -@anchor{character-code} -Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3 -indicates 8-bit ASCII, 4 indicates DEC Kanji. -Windows code page numbers are also valid. - -Experience has shown that in many files, this field is ignored or incorrect. -For a more reliable indication of the file's character encoding -see @ref{Character Encoding Record}. +@anchor{character-code} Character code. The following values have +been actually observed in system files: + +@table @asis +@item 1 +EBCDIC. + +@item 2 +7-bit ASCII. + +@item 1250 +The @code{windows-1250} code page for Central European and Eastern +European languages. + +@item 1252 +The @code{windows-1252} code page for Western European languages. + +@item 28591 +ISO 8859-1. + +@item 65001 +UTF-8. +@end table + +The following additional values are known to be defined: + +@table @asis +@item 3 +8-bit ``ASCII''. + +@item 4 +DEC Kanji. +@end table + +Other Windows code page numbers are known to be generally valid. + +Old versions of SPSS for Unix and Windows always wrote value 2 in this +field, regardless of the encoding in use. Newer versions also write +the character encoding as a string (see @ref{Character Encoding +Record}). @end table @node Machine Floating-Point Info Record @@ -586,13 +765,185 @@ Size of each piece of data in the data part, in bytes. Always set to 8. Number of pieces of data in the data part. Always set to 3. @item flt64 sysmis; -The system missing value. +@itemx flt64 highest; +@itemx flt64 lowest; +The system missing value, the value used for HIGHEST in missing +values, and the value used for LOWEST in missing values, respectively. +@xref{System File Format}, for more information. + +The SPSSWriter library in PHP, which identifies itself as @code{FOM +SPSS 1.0.0} in the file header record @code{prod_name} field, writes +unexpected values to these fields, but it uses the same values +consistently throughout the rest of the file. +@end table + +@node Multiple Response Sets Records +@section Multiple Response Sets Records + +The system file format has two different types of records that +represent multiple response sets (@pxref{MRSETS,,,pspp, PSPP Users +Guide}). The first type of record describes multiple response sets +that can be understood by SPSS before version 14. The second type of +record, with a closely related format, is used for multiple dichotomy +sets that use the CATEGORYLABELS=COUNTEDVALUES feature added in +version 14. + +@example +/* @r{Header.} */ +int32 rec_type; +int32 subtype; +int32 size; +int32 count; + +/* @r{Exactly @code{count} bytes of data.} */ +char mrsets[]; +@end example + +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. Set to 7 for records that describe multiple response +sets understood by SPSS before version 14, or to 19 for records that +describe dichotomy sets that use the CATEGORYLABELS=COUNTEDVALUES +feature added in version 14. + +@item int32 size; +The size of each element in the @code{mrsets} member. Always set to 1. + +@item int32 count; +The total number of bytes in @code{mrsets}. + +@item char mrsets[]; +Zero or more line feeds (byte 0x0a), followed by a series of multiple +response sets, each of which consists of the following: + +@itemize @bullet +@item +The set's name (an identifier that begins with @samp{$}), in mixed +upper and lower case. + +@item +An equals sign (@samp{=}). + +@item +@samp{C} for a multiple category set, @samp{D} for a multiple +dichotomy set with CATEGORYLABELS=VARLABELS, or @samp{E} for a +multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES. + +@item +For a multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES, a +space, followed by a number expressed as decimal digits, followed by a +space. If LABELSOURCE=VARLABEL was specified on MRSETS, then the +number is 11; otherwise it is 1.@footnote{This part of the format may +not be fully understood, because only a single example of each +possibility has been examined.} + +@item +For either kind of multiple dichotomy set, the counted value, as a +positive integer count specified as decimal digits, followed by a +space, followed by as many string bytes as specified in the count. If +the set contains numeric variables, the string consists of the counted +integer value expressed as decimal digits. If the set contains string +variables, the string contains the counted string value. Either way, +the string may be padded on the right with spaces (older versions of +SPSS seem to always pad to a width of 8 bytes; newer versions don't). + +@item +A space. + +@item +The multiple response set's label, using the same format as for the +counted value for multiple dichotomy sets. A string of length 0 means +that the set does not have a label. A string of length 0 is also +written if LABELSOURCE=VARLABEL was specified. + +@item +A space. + +@item +The short names of the variables in the set, converted to lowercase, +each separated from the previous by a single space. + +Even though a multiple response set must have at least two variables, +some system files contain multiple response sets with no variables or +one variable. The source and meaning of these multiple response sets is +unknown. (Perhaps they arise from creating a multiple response set +then deleting all the variables that it contains?) + +@item +One line feed (byte 0x0a). Sometimes multiple, even hundreds, of line +feeds are present. +@end itemize +@end table + +Example: Given appropriate variable definitions, consider the +following MRSETS command: + +@example +MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c + /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55 + /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes' + /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES + VARIABLES=k l m VALUE=34 + /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL + VARIABLES=n o p VALUE='choice'. +@end example -@item flt64 highest; -The value used for HIGHEST in missing values. +The above would generate the following multiple response set record of +subtype 7: -@item flt64 lowest; -The value used for LOWEST in missing values. +@example +$a=C 10 my mcgroup a b c +$b=D2 55 0 g e f d +$c=D3 Yes 10 mdgroup #2 h i j +@end example + +It would also generate the following multiple response set record with +subtype 19: + +@example +$d=E 1 2 34 13 third mdgroup k l m +$e=E 11 6 choice 0 n o p +@end example + +@node Extra Product Info Record +@section Extra Product Info Record + +This optional record appears to contain a text string that describes +the program that wrote the file and the source of the data. (This is +redundant with the file label and product info found in the file +header record.) + +@example +/* @r{Header.} */ +int32 rec_type; +int32 subtype; +int32 size; +int32 count; + +/* @r{Exactly @code{count} bytes of data.} */ +char info[]; +@end example + +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. Always set to 10. + +@item int32 size; +The size of each element in the @code{info} member. Always set to 1. + +@item int32 count; +The total number of bytes in @code{info}. + +@item char info[]; +A text string. A product that identifies itself as @code{VOXCO +INTERVIEWER 4.3} uses CR-only line ends in this field, rather than the +more usual LF-only or CR LF line ends. @end table @node Variable Display Parameter Record @@ -646,8 +997,8 @@ Ordinal Scale Continuous Scale @end table -SPSS 14 sometimes writes a @code{measure} of 0. PSPP interprets this -as nominal scale. +SPSS sometimes writes a @code{measure} of 0. PSPP interprets this as +nominal scale. @item int32 width; The width of the display column for the variable in characters. @@ -827,12 +1178,29 @@ The size of each element in the @code{encoding} member. Always set to 1. The total number of bytes in @code{encoding}. @item char encoding[]; -The name of the character encoding. Normally this will be an official IANA characterset name or alias. +The name of the character encoding. Normally this will be an official +IANA character set name or alias. See @url{http://www.iana.org/assignments/character-sets}. +Character set names are not case-sensitive, but SPSS appears to write +them in all-uppercase. @end table -This record is not present in files generated by older software. -See also @ref{character-code}. +This record is not present in files generated by older software. See +also the @code{character_code} field in the machine integer info +record (@pxref{character-code}). + +When the character encoding record and the machine integer info record +are both present, all system files observed in practice indicate the +same character encoding, e.g.@: 1252 as @code{character_code} and +@code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc. + +If, for testing purposes, a file is crafted with different +@code{character_code} and @code{encoding}, it seems that +@code{character_code} controls the encoding for all strings in the +system file before the dictionary termination record, including +strings in data (e.g.@: string missing values), and @code{encoding} +controls the encoding for strings following the dictionary termination +record. @node Long String Value Labels Record @section Long String Value Labels Record @@ -907,6 +1275,74 @@ between 0 and 120, is the number of bytes in @code{label}. The @end table @end table +@node Long String Missing Values Record +@section Long String Missing Values Record + +This record, if present, specifies missing values for long string +variables. + +@example +/* @r{Header.} */ +int32 rec_type; +int32 subtype; +int32 size; +int32 count; + +/* @r{Repeated up to exactly @code{count} bytes.} */ +int32 var_name_len; +char var_name[]; +char n_missing_values; +long_string_missing_value values[]; +@end example + +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. Always set to 22. + +@item int32 size; +Always set to 1. + +@item int32 count; +The number of bytes following the header until the next header. + +@item int32 var_name_len; +@itemx char var_name[]; +The number of bytes in the name of the long string variable that has +missing values, plus the variable name itself, which consists of +exactly @code{var_name_len} bytes. The variable name is not padded to +any particular boundary, nor is it null-terminated. + +@item char n_missing_values; +The number of missing values, either 1, 2, or 3. (This is, unusually, +a single byte instead of a 32-bit number.) + +@item long_string_missing_value values[]; +The missing values themselves. This array contains exactly +@code{n_missing_values} elements, each of which has the following +substructure: + +@example +int32 value_len; +char value[]; +@end example + +@table @code +@item int32 value_len; +The length of the missing value string, in bytes. This value should +be 8, because long string variables are at least 8 bytes wide (by +definition), only the first 8 bytes of a long string variable's +missing values are allowed to be non-spaces, and any spaces within the +first 8 bytes are included in the missing value here. + +@item char value[]; +The missing value string, exactly @code{value_len} bytes, without +any padding or null terminator. +@end table +@end table + @node Data File and Variable Attributes Records @section Data File and Variable Attributes Records @@ -944,7 +1380,7 @@ The total number of bytes in @code{attributes}. @item char attributes[]; The attributes, in a text-based format. -In record type 17, this field contains a single attribute set. An +In record subtype 17, this field contains a single attribute set. An attribute set is a sequence of one or more attributes concatenated together. Each attribute consists of a name, which has the same syntax as a variable name, followed by, inside parentheses, a sequence @@ -956,12 +1392,17 @@ way to embed a line feed in a value. There is no distinction between an attribute with a single value and an attribute array with one element. -In record type 18, this field contains a sequence of one or more +In record subtype 18, this field contains a sequence of one or more variable attribute sets. If more than one variable attribute set is present, each one after the first is delimited from the previous by -@code{/}. Each variable attribute set consists of a variable name, +@code{/}. Each variable attribute set consists of a long +variable name, followed by @code{:}, followed by an attribute set with the same -syntax as on record type 17. +syntax as on record subtype 17. + +System files written by @code{Stata 14.1/-savespss- 1.77 by +S.Radyakin} may include multiple records with subtype 18, one per +variable that has variable attributes. The total length is @code{count} bytes. @end table @@ -980,12 +1421,38 @@ VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123'). will contain a variable attribute record with the following contents: @example -00000000 07 00 00 00 12 00 00 00 01 00 00 00 22 00 00 00 |............"...| -00000010 64 75 6d 6d 79 3a 66 72 65 64 28 27 32 33 27 0a |dummy:fred('23'.| -00000020 27 33 34 27 0a 29 62 65 72 74 28 27 31 32 33 27 |'34'.)bert('123'| -00000030 0a 29 |.) | +0000 07 00 00 00 12 00 00 00 01 00 00 00 22 00 00 00 |............"...| +0010 64 75 6d 6d 79 3a 66 72 65 64 28 27 32 33 27 0a |dummy:fred('23'.| +0020 27 33 34 27 0a 29 62 65 72 74 28 27 31 32 33 27 |'34'.)bert('123'| +0030 0a 29 |.) | @end example +@menu +* Variable Roles:: +@end menu + +@node Variable Roles +@subsection Variable Roles + +A variable's role is represented as an attribute named @code{$@@Role}. +This attribute has a single element whose values and their meanings +are: + +@table @code +@item 0 +Input. This, the default, is the most common role. +@item 1 +Output. +@item 2 +Both. +@item 3 +None. +@item 4 +Partition. +@item 5 +Split. +@end table + @node Extended Number of Cases Record @section Extended Number of Cases Record @@ -1025,44 +1492,30 @@ same reason as @code{ncases} in the file header record, but this has not been observed in the wild. @end table -@node Miscellaneous Informational Records -@section Miscellaneous Informational Records +@node Other Informational Records +@section Other Informational Records -Some specific types of miscellaneous informational records are +This chapter documents many specific types of extension records are documented here, but others are known to exist. PSPP ignores unknown -miscellaneous informational records when reading system files. +extension records when reading system files. -@example -/* @r{Header.} */ -int32 rec_type; -int32 subtype; -int32 size; -int32 count; +The following extension record subtypes have also been observed, with +the following believed meanings: -/* @r{Exactly @code{size * count} bytes of data.} */ -char data[]; -@end example - -@table @code -@item int32 rec_type; -Record type. Always set to 7. - -@item int32 subtype; -Record subtype. May take any value. According to Aapi -H@"am@"al@"ainen, value 5 indicates a set of grouped variables and 6 -indicates date info (probably related to USE). +@table @asis +@item 5 +A set of grouped variables (according to Aapi H@"am@"al@"ainen). -@item int32 size; -Size of each piece of data in the data part. Should have the value 1, -4, or 8, for @code{char}, @code{int32}, and @code{flt64} format data, -respectively. +@item 6 +Date info, probably related to USE (according to Aapi H@"am@"al@"ainen). -@item int32 count; -Number of pieces of data in the data part. +@item 12 +A UUID in the format described in RFC 4122. Only two examples +observed, both written by SPSS 13, and in each case the UUID contained +both upper and lower case. -@item char data[]; -Arbitrary data. There must be @code{size} times @code{count} bytes of -data. +@item 24 +XML that describes how data in the file should be displayed on-screen. @end table @node Dictionary Termination Record @@ -1087,22 +1540,23 @@ Ignored padding. Should be set to 0. @node Data Record @section Data Record -Data records must follow all other records in the system file. There must -be at least one data record in every system file. - -The format of data records varies depending on whether the data is -compressed. Regardless, the data is arranged in a series of 8-byte -elements. +The data record must follow all other records in the system file. +Every system file must have a data record that specifies data for at +least one case. The format of the data record varies depending on the +value of @code{compression} in the file header record: -When data is not compressed, -each element corresponds to +@table @asis +@item 0: no compression +Data is arranged as a series of 8-byte elements. +Each element corresponds to the variable declared in the respective variable record (@pxref{Variable Record}). Numeric values are given in @code{flt64} format; string values are literal characters string, padded on the right when necessary to fill out 8-byte units. -Compressed data is arranged in the following manner: the first 8 bytes -in the data section is divided into a series of 1-byte command +@item 1: bytecode compression +The first 8 bytes +of the data record is divided into a series of 1-byte command codes. These codes have meanings as described below: @table @asis @@ -1140,8 +1594,125 @@ An 8-byte string value that is all spaces. The system-missing value. @end table -When the end of the an 8-byte group of command bytes is reached, any -blocks of non-compressible values indicated by code 253 are skipped, -and the next element of command bytes is read and interpreted, until -the end of the file or a code with value 252 is reached. +The end of the 8-byte group of bytecodes is followed by any 8-byte +blocks of non-compressible values indicated by code 253. After that +follows another 8-byte group of bytecodes, then those bytecodes' +non-compressible values. The pattern repeats to the end of the file +or a code with value 252. + +@item 2: ZLIB compression +The data record consists of the following, in order: + +@itemize @bullet +@item +ZLIB data header, 24 bytes long. + +@item +One or more variable-length blocks of ZLIB compressed data. + +@item +ZLIB data trailer, with a 24-byte fixed header plus an additional 24 +bytes for each preceding ZLIB compressed data block. +@end itemize + +The ZLIB data header has the following format: + +@example +int64 zheader_ofs; +int64 ztrailer_ofs; +int64 ztrailer_len; +@end example + +@table @code +@item int64 zheader_ofs; +The offset, in bytes, of the beginning of this structure within the +system file. + +@item int64 ztrailer_ofs; +The offset, in bytes, of the first byte of the ZLIB data trailer. + +@item int64 ztrailer_len; +The number of bytes in the ZLIB data trailer. This and the previous +field sum to the size of the system file in bytes. +@end table + +The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB +compressed data blocks. Each ZLIB compressed data block begins with a +ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78 +01} (the only header yet observed in practice). Each block +decompresses to a fixed number of bytes (in practice only +@code{0x3ff000}-byte blocks have been observed), except that the last +block of data may be shorter. The last ZLIB compressed data block +gends just before offset @code{ztrailer_ofs}. + +The result of ZLIB decompression is bytecode compressed data as +described above for compression format 1. + +The ZLIB data trailer begins with the following 24-byte fixed header: + +@example +int64 bias; +int64 zero; +int32 block_size; +int32 n_blocks; +@end example + +@table @code +@item int64 int_bias; +The compression bias as a negative integer, e.g.@: if @code{bias} in +the file header record is 100.0, then @code{int_bias} is @minus{}100 +(this is the only value yet observed in practice). + +@item int64 zero; +Always observed to be zero. + +@item int32 block_size; +The number of bytes in each ZLIB compressed data block, except +possibly the last, following decompression. Only @code{0x3ff000} has +been observed so far. + +@item int32 n_blocks; +The number of ZLIB compressed data blocks, always exactly +@code{(ztrailer_ofs - 24) / 24}. +@end table + +The fixed header is followed by @code{n_blocks} 24-byte ZLIB data +block descriptors, each of which describes the compressed data block +corresponding to its offset. Each block descriptor has the following +format: + +@example +int64 uncompressed_ofs; +int64 compressed_ofs; +int32 uncompressed_size; +int32 compressed_size; +@end example + +@table @code +@item int64 uncompressed_ofs; +The offset, in bytes, that this block of data would have in a similar +system file that uses compression format 1. This is +@code{zheader_ofs} in the first block descriptor, and in each +succeeding block descriptor it is the sum of the previous desciptor's +@code{uncompressed_ofs} and @code{uncompressed_size}. + +@item int64 compressed_ofs; +The offset, in bytes, of the actual beginning of this compressed data +block. This is @code{zheader_ofs + 24} in the first block descriptor, +and in each succeeding block descriptor it is the sum of the previous +descriptor's @code{compressed_ofs} and @code{compressed_size}. The +final block descriptor's @code{compressed_ofs} and +@code{compressed_size} sum to @code{ztrailer_ofs}. + +@item int32 uncompressed_size; +The number of bytes in this data block, after decompression. This is +@code{block_size} in every data block except the last, which may be +smaller. + +@item int32 compressed_size; +The number of bytes in this data block, as stored compressed in this +system file. +@end table +@end table + @setfilename ignored