X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdev%2Fsystem-file-format.texi;h=b0b69dd4cae54f0922bd497232811ff1bd027c1f;hb=df291f0446ae11b5a3c788a06ee73eb9ac877ca5;hp=1309cb79195d7bcb04705b624e70c90abc251654;hpb=c5ad65b0351ab1d897eb072eeaec06fb37802b01;p=pspp

diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi
index 1309cb7919..b0b69dd4ca 100644
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -60,7 +60,8 @@ if present.
 Document record, if present.
 
 @item
-Any records not explicitly included in this list, in any order.
+Extension (type 7) records, in ascending numerical order of their
+subtypes.
 
 @item
 Dictionary termination record.
@@ -267,8 +268,10 @@ stops (@samp{.}).  The variable name is padded on the right with spaces.
 
 @item int32 label_len;
 This field is present only if @code{has_var_label} is set to 1.  It is
-set to the length, in characters, of the variable label, which must be a
-number between 0 and 120.
+set to the length, in characters, of the variable label.  The
+documented maximum length varies from 120 to 255 based on SPSS
+version, but some files have been seen with longer labels.  PSPP
+accepts longer labels and truncates them to 255 bytes on input.
 
 @item char label[];
 This field is present only if @code{has_var_label} is set to 1.  It has
@@ -426,7 +429,9 @@ in length.  Its type and width cannot be determined until the
 following value label variables record (see below) is read.
 
 @item char label_len;
-The label's length, in bytes.
+The label's length, in bytes.  The documented maximum length varies
+from 60 to 120 based on SPSS version.  PSPP supports value labels up
+to 255 bytes long.
 
 @item char label[];
 @code{label_len} bytes of the actual label, followed by up to 7 bytes
@@ -545,14 +550,45 @@ Compression code.  Always set to 1.
 Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
 
 @item int32 character_code;
-@anchor{character-code}
-Character code.  1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
-indicates 8-bit ASCII, 4 indicates DEC Kanji.
-Windows code page numbers are also valid.
-
-Experience has shown that in many files, this field is ignored or incorrect.
-For a more reliable indication of the file's character encoding
-see @ref{Character Encoding Record}.
+@anchor{character-code} Character code.  The following values have
+been actually observed in system files:
+
+@table @asis
+@item 2
+7-bit ASCII.
+
+@item 1250
+The @code{windows-1250} code page for Central European and Eastern
+European languages.
+
+@item 1252
+The @code{windows-1252} code page for Western European languages.
+
+@item 28591
+ISO 8859-1.
+
+@item 65001
+UTF-8.
+@end table
+
+The following additional values are known to be defined:
+
+@table @asis
+@item 1
+EBCDIC.
+
+@item 3
+8-bit ``ASCII''.
+
+@item 4
+DEC Kanji.
+@end table
+
+Other Windows code page numbers are known to be generally valid.
+
+Old versions of SPSS always wrote value 2 in this field, regardless of
+the encoding in use.  Newer versions also write the character encoding
+as a string (see @ref{Character Encoding Record}).
 @end table
 
 @node Machine Floating-Point Info Record
@@ -770,8 +806,8 @@ Ordinal Scale
 Continuous Scale
 @end table
 
-SPSS 14 sometimes writes a @code{measure} of 0.  PSPP interprets this
-as nominal scale.
+SPSS 14 sometimes writes a @code{measure} of 0 for string variables.
+PSPP interprets this as nominal scale.
 
 @item int32 width;
 The width of the display column for the variable in characters.
@@ -955,8 +991,22 @@ The name of the character encoding.  Normally this will be an official IANA char
 See @url{http://www.iana.org/assignments/character-sets}.
 @end table
 
-This record is not present in files generated by older software.
-See also @ref{character-code}.
+This record is not present in files generated by older software.  See
+also the @code{character_code} field in the machine integer info
+record (@pxref{character-code}).
+
+When the character encoding record and the machine integer info record
+are both present, all system files observed in practice indicate the
+same character encoding, e.g.@: 1252 as @code{character_code} and
+@code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc.
+
+If, for testing purposes, a file is crafted with different
+@code{character_code} and @code{encoding}, it seems that
+@code{character_code} controls the encoding for all strings in the
+system file before the dictionary termination record, including
+strings in data (e.g.@: string missing values), and @code{encoding}
+controls the encoding for strings following the dictionary termination
+record.
 
 @node Long String Value Labels Record
 @section Long String Value Labels Record
@@ -1083,7 +1133,8 @@ element.
 In record type 18, this field contains a sequence of one or more
 variable attribute sets.  If more than one variable attribute set is
 present, each one after the first is delimited from the previous by
-@code{/}.  Each variable attribute set consists of a variable name,
+@code{/}.  Each variable attribute set consists of a (potentially
+long) variable name,
 followed by @code{:}, followed by an attribute set with the same
 syntax as on record type 17.