X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fsystem-file-format.texi;h=8315762cef52cb26ea20f1aa2757ca3d6f622554;hb=refs%2Fbuilds%2F20120731001846%2Fpspp;hp=d1d00873f3b3666aa97f8f1e10553a5815d268bb;hpb=b8f75b2ac6bc701ecacaa248d630918d7a7346e2;p=pspp diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index d1d00873f3..8315762cef 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -60,7 +60,8 @@ if present. Document record, if present. @item -Any records not explicitly included in this list, in any order. +Extension (type 7) records, in ascending numerical order of their +subtypes. @item Dictionary termination record. @@ -78,6 +79,7 @@ Each type of record is described separately below. * Document Record:: * Machine Integer Info Record:: * Machine Floating-Point Info Record:: +* Multiple Response Sets Records:: * Variable Display Parameter Record:: * Long Variable Names Record:: * Very Long String Record:: @@ -113,7 +115,9 @@ char padding[3]; @table @code @item char rec_type[4]; -Record type code, set to @samp{$FL2}. +Record type code, set to @samp{$FL2}, that is, either @code{24 46 4c +32} if the file uses an ASCII-based character encoding, or @code{5b c6 +d3 f2} if the file uses an EBCDIC-based character encoding. @item char prod_name[60]; Product identification string. This always begins with the characters @@ -266,8 +270,10 @@ stops (@samp{.}). The variable name is padded on the right with spaces. @item int32 label_len; This field is present only if @code{has_var_label} is set to 1. It is -set to the length, in characters, of the variable label, which must be a -number between 0 and 120. +set to the length, in characters, of the variable label. The +documented maximum length varies from 120 to 255 based on SPSS +version, but some files have been seen with longer labels. PSPP +accepts longer labels and truncates them to 255 bytes on input. @item char label[]; This field is present only if @code{has_var_label} is set to 1. It has @@ -387,6 +393,11 @@ Format types are defined as follows: @end multitable @end quotation +A few system files have been observed in the wild with invalid +@code{write} fields, in particular with value 0. Readers should +probably treat invalid @code{print} or @code{write} fields as some +default format. + @node Value Labels Records @section Value Labels Records @@ -425,7 +436,9 @@ in length. Its type and width cannot be determined until the following value label variables record (see below) is read. @item char label_len; -The label's length, in bytes. +The label's length, in bytes. The documented maximum length varies +from 60 to 120 based on SPSS version. PSPP supports value labels up +to 255 bytes long. @item char label[]; @code{label_len} bytes of the actual label, followed by up to 7 bytes @@ -544,14 +557,46 @@ Compression code. Always set to 1. Machine endianness. 1 indicates big-endian, 2 indicates little-endian. @item int32 character_code; -@anchor{character-code} -Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3 -indicates 8-bit ASCII, 4 indicates DEC Kanji. -Windows code page numbers are also valid. - -Experience has shown that in many files, this field is ignored or incorrect. -For a more reliable indication of the file's character encoding -see @ref{Character Encoding Record}. +@anchor{character-code} Character code. The following values have +been actually observed in system files: + +@table @asis +@item 1 +EBCDIC. + +@item 2 +7-bit ASCII. + +@item 1250 +The @code{windows-1250} code page for Central European and Eastern +European languages. + +@item 1252 +The @code{windows-1252} code page for Western European languages. + +@item 28591 +ISO 8859-1. + +@item 65001 +UTF-8. +@end table + +The following additional values are known to be defined: + +@table @asis +@item 3 +8-bit ``ASCII''. + +@item 4 +DEC Kanji. +@end table + +Other Windows code page numbers are known to be generally valid. + +Old versions of SPSS for Unix and Windows always wrote value 2 in this +field, regardless of the encoding in use. Newer versions also write +the character encoding as a string (see @ref{Character Encoding +Record}). @end table @node Machine Floating-Point Info Record @@ -595,6 +640,130 @@ The value used for HIGHEST in missing values. The value used for LOWEST in missing values. @end table +@node Multiple Response Sets Records +@section Multiple Response Sets Records + +The system file format has two different types of records that +represent multiple response sets (@pxref{MRSETS,,,pspp, PSPP Users +Guide}). The first type of record describes multiple response sets +that can be understood by SPSS before version 14. The second type of +record, with a closely related format, is used for multiple dichotomy +sets that use the CATEGORYLABELS=COUNTEDVALUES feature added in +version 14. + +@example +/* @r{Header.} */ +int32 rec_type; +int32 subtype; +int32 size; +int32 count; + +/* @r{Exactly @code{count} bytes of data.} */ +char mrsets[]; +@end example + +@table @code +@item int32 rec_type; +Record type. Always set to 7. + +@item int32 subtype; +Record subtype. Set to 7 for records that describe multiple response +sets understood by SPSS before version 14, or to 19 for records that +describe dichotomy sets that use the CATEGORYLABELS=COUNTEDVALUES +feature added in version 14. + +@item int32 size; +The size of each element in the @code{mrsets} member. Always set to 1. + +@item int32 count; +The total number of bytes in @code{mrsets}. + +@item char mrsets[]; +A series of multiple response sets, each of which consists of the +following: + +@itemize @bullet +@item +The set's name (an identifier that begins with @samp{$}), in mixed +upper and lower case. + +@item +An equals sign (@samp{=}). + +@item +@samp{C} for a multiple category set, @samp{D} for a multiple +dichotomy set with CATEGORYLABELS=VARLABELS, or @samp{E} for a +multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES. + +@item +For a multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES, a +space, followed by a number expressed as decimal digits, followed by a +space. If LABELSOURCE=VARLABEL was specified on MRSETS, then the +number is 11; otherwise it is 1.@footnote{This part of the format may +not be fully understood, because only a single example of each +possibility has been examined.} + +@item +For either kind of multiple dichotomy set, the counted value, as a +positive integer count specified as decimal digits, followed by a +space, followed by as many string bytes as specified in the count. If +the set contains numeric variables, the string consists of the counted +integer value expressed as decimal digits. If the set contains string +variables, the string contains the counted string value. Either way, +the string may be padded on the right with spaces (older versions of +SPSS seem to always pad to a width of 8 bytes; newer versions don't). + +@item +A space. + +@item +The multiple response set's label, using the same format as for the +counted value for multiple dichotomy sets. A string of length 0 means +that the set does not have a label. A string of length 0 is also +written if LABELSOURCE=VARLABEL was specified. + +@item +A space. + +@item +The short names of the variables in the set, converted to lowercase, +each separated from the previous by a single space. + +@item +A line feed (byte 0x0a). +@end itemize +@end table + +Example: Given appropriate variable definitions, consider the +following MRSETS command: + +@example +MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c + /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55 + /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes' + /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES + VARIABLES=k l m VALUE=34 + /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL + VARIABLES=n o p VALUE='choice'. +@end example + +The above would generate the following multiple response set record of +subtype 7: + +@example +$a=C 10 my mcgroup a b c +$b=D2 55 0 g e f d +$c=D3 Yes 10 mdgroup #2 h i j +@end example + +It would also generate the following multiple response set record with +subtype 19: + +@example +$d=E 1 2 34 13 third mdgroup k l m +$e=E 11 6 choice 0 n o p +@end example + @node Variable Display Parameter Record @section Variable Display Parameter Record @@ -646,8 +815,8 @@ Ordinal Scale Continuous Scale @end table -SPSS 14 sometimes writes a @code{measure} of 0. PSPP interprets this -as nominal scale. +SPSS 14 sometimes writes a @code{measure} of 0 for string variables. +PSPP interprets this as nominal scale. @item int32 width; The width of the display column for the variable in characters. @@ -827,12 +996,29 @@ The size of each element in the @code{encoding} member. Always set to 1. The total number of bytes in @code{encoding}. @item char encoding[]; -The name of the character encoding. Normally this will be an official IANA characterset name or alias. +The name of the character encoding. Normally this will be an official +IANA character set name or alias. See @url{http://www.iana.org/assignments/character-sets}. +Character set names are not case-sensitive, but SPSS appears to write +them in all-uppercase. @end table -This record is not present in files generated by older software. -See also @ref{character-code}. +This record is not present in files generated by older software. See +also the @code{character_code} field in the machine integer info +record (@pxref{character-code}). + +When the character encoding record and the machine integer info record +are both present, all system files observed in practice indicate the +same character encoding, e.g.@: 1252 as @code{character_code} and +@code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc. + +If, for testing purposes, a file is crafted with different +@code{character_code} and @code{encoding}, it seems that +@code{character_code} controls the encoding for all strings in the +system file before the dictionary termination record, including +strings in data (e.g.@: string missing values), and @code{encoding} +controls the encoding for strings following the dictionary termination +record. @node Long String Value Labels Record @section Long String Value Labels Record @@ -959,7 +1145,8 @@ element. In record type 18, this field contains a sequence of one or more variable attribute sets. If more than one variable attribute set is present, each one after the first is delimited from the previous by -@code{/}. Each variable attribute set consists of a variable name, +@code{/}. Each variable attribute set consists of a long +variable name, followed by @code{:}, followed by an attribute set with the same syntax as on record type 17.