X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fsystem-file-format.texi;h=b0b69dd4cae54f0922bd497232811ff1bd027c1f;hb=df291f0446ae11b5a3c788a06ee73eb9ac877ca5;hp=1309cb79195d7bcb04705b624e70c90abc251654;hpb=c5ad65b0351ab1d897eb072eeaec06fb37802b01;p=pspp diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index 1309cb7919..b0b69dd4ca 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -60,7 +60,8 @@ if present. Document record, if present. @item -Any records not explicitly included in this list, in any order. +Extension (type 7) records, in ascending numerical order of their +subtypes. @item Dictionary termination record. @@ -267,8 +268,10 @@ stops (@samp{.}). The variable name is padded on the right with spaces. @item int32 label_len; This field is present only if @code{has_var_label} is set to 1. It is -set to the length, in characters, of the variable label, which must be a -number between 0 and 120. +set to the length, in characters, of the variable label. The +documented maximum length varies from 120 to 255 based on SPSS +version, but some files have been seen with longer labels. PSPP +accepts longer labels and truncates them to 255 bytes on input. @item char label[]; This field is present only if @code{has_var_label} is set to 1. It has @@ -426,7 +429,9 @@ in length. Its type and width cannot be determined until the following value label variables record (see below) is read. @item char label_len; -The label's length, in bytes. +The label's length, in bytes. The documented maximum length varies +from 60 to 120 based on SPSS version. PSPP supports value labels up +to 255 bytes long. @item char label[]; @code{label_len} bytes of the actual label, followed by up to 7 bytes @@ -545,14 +550,45 @@ Compression code. Always set to 1. Machine endianness. 1 indicates big-endian, 2 indicates little-endian. @item int32 character_code; -@anchor{character-code} -Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3 -indicates 8-bit ASCII, 4 indicates DEC Kanji. -Windows code page numbers are also valid. - -Experience has shown that in many files, this field is ignored or incorrect. -For a more reliable indication of the file's character encoding -see @ref{Character Encoding Record}. +@anchor{character-code} Character code. The following values have +been actually observed in system files: + +@table @asis +@item 2 +7-bit ASCII. + +@item 1250 +The @code{windows-1250} code page for Central European and Eastern +European languages. + +@item 1252 +The @code{windows-1252} code page for Western European languages. + +@item 28591 +ISO 8859-1. + +@item 65001 +UTF-8. +@end table + +The following additional values are known to be defined: + +@table @asis +@item 1 +EBCDIC. + +@item 3 +8-bit ``ASCII''. + +@item 4 +DEC Kanji. +@end table + +Other Windows code page numbers are known to be generally valid. + +Old versions of SPSS always wrote value 2 in this field, regardless of +the encoding in use. Newer versions also write the character encoding +as a string (see @ref{Character Encoding Record}). @end table @node Machine Floating-Point Info Record @@ -770,8 +806,8 @@ Ordinal Scale Continuous Scale @end table -SPSS 14 sometimes writes a @code{measure} of 0. PSPP interprets this -as nominal scale. +SPSS 14 sometimes writes a @code{measure} of 0 for string variables. +PSPP interprets this as nominal scale. @item int32 width; The width of the display column for the variable in characters. @@ -955,8 +991,22 @@ The name of the character encoding. Normally this will be an official IANA char See @url{http://www.iana.org/assignments/character-sets}. @end table -This record is not present in files generated by older software. -See also @ref{character-code}. +This record is not present in files generated by older software. See +also the @code{character_code} field in the machine integer info +record (@pxref{character-code}). + +When the character encoding record and the machine integer info record +are both present, all system files observed in practice indicate the +same character encoding, e.g.@: 1252 as @code{character_code} and +@code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc. + +If, for testing purposes, a file is crafted with different +@code{character_code} and @code{encoding}, it seems that +@code{character_code} controls the encoding for all strings in the +system file before the dictionary termination record, including +strings in data (e.g.@: string missing values), and @code{encoding} +controls the encoding for strings following the dictionary termination +record. @node Long String Value Labels Record @section Long String Value Labels Record @@ -1083,7 +1133,8 @@ element. In record type 18, this field contains a sequence of one or more variable attribute sets. If more than one variable attribute set is present, each one after the first is delimited from the previous by -@code{/}. Each variable attribute set consists of a variable name, +@code{/}. Each variable attribute set consists of a (potentially +long) variable name, followed by @code{:}, followed by an attribute set with the same syntax as on record type 17.