doc: Update description of character encoding information in system files.

[pspp-builds.git] / doc / dev / system-file-format.texi
diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi

index d1d00873f3b3666aa97f8f1e10553a5815d268bb..3f82a4628e0918e1797e6c23a0670e42e0a5e8dc 100644 (file)
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -78,6 +78,7 @@ Each type of record is described separately below.
  * Document Record::
  * Machine Integer Info Record::
  * Machine Floating-Point Info Record::
+* Multiple Response Sets Records::
  * Variable Display Parameter Record::
  * Long Variable Names Record::
  * Very Long String Record::
@@ -266,8 +267,10 @@ stops (@samp{.}).  The variable name is padded on the right with spaces.
  
  @item int32 label_len;
  This field is present only if @code{has_var_label} is set to 1.  It is
-set to the length, in characters, of the variable label, which must be a
-number between 0 and 120.
+set to the length, in characters, of the variable label.  The
+documented maximum length varies from 120 to 255 based on SPSS
+version, but some files have been seen with longer labels.  PSPP
+accepts longer labels and truncates them to 255 bytes on input.
  
  @item char label[];
  This field is present only if @code{has_var_label} is set to 1.  It has
@@ -425,7 +428,9 @@ in length.  Its type and width cannot be determined until the
  following value label variables record (see below) is read.
  
  @item char label_len;
-The label's length, in bytes.
+The label's length, in bytes.  The documented maximum length varies
+from 60 to 120 based on SPSS version.  PSPP supports value labels up
+to 255 bytes long.
  
  @item char label[];
  @code{label_len} bytes of the actual label, followed by up to 7 bytes
@@ -544,14 +549,45 @@ Compression code.  Always set to 1.
  Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
  
  @item int32 character_code;
-@anchor{character-code}
-Character code.  1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
-indicates 8-bit ASCII, 4 indicates DEC Kanji.
-Windows code page numbers are also valid.
-
-Experience has shown that in many files, this field is ignored or incorrect.
-For a more reliable indication of the file's character encoding
-see @ref{Character Encoding Record}.
+@anchor{character-code} Character code.  The following values have
+been actually observed in system files:
+
+@table @asis
+@item 2
+7-bit ASCII.
+
+@item 1250
+The @code{windows-1250} code page for Central European and Eastern
+European languages.
+
+@item 1252
+The @code{windows-1252} code page for Western European languages.
+
+@item 28591
+ISO 8859-1.
+
+@item 65001
+UTF-8.
+@end table
+
+The following additional values are known to be defined:
+
+@table @asis
+@item 1
+EBCDIC.
+
+@item 3
+8-bit ``ASCII''.
+
+@item 4
+DEC Kanji.
+@end table
+
+Other Windows code page numbers are known to be generally valid.
+
+Old versions of SPSS always wrote value 2 in this field, regardless of
+the encoding in use.  Newer versions also write the character encoding
+as a string (see @ref{Character Encoding Record}).
  @end table
  
  @node Machine Floating-Point Info Record
@@ -595,6 +631,129 @@ The value used for HIGHEST in missing values.
  The value used for LOWEST in missing values.
  @end table
  
+@node Multiple Response Sets Records
+@section Multiple Response Sets Records
+
+The system file format has two different types of records that
+represent multiple response sets (@pxref{MRSETS,,,pspp, PSPP Users
+Guide}).  The first type of record describes multiple response sets
+that can be understood by SPSS before version 14.  The second type of
+record, with a closely related format, is used for multiple dichotomy
+sets that use the CATEGORYLABELS=COUNTEDVALUES feature added in
+version 14.
+
+@example
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                mrsets[];
+@end example
+
+@table @code
+@item int32 rec_type;
+Record type.  Always set to 7.
+
+@item int32 subtype;
+Record subtype.  Set to 7 for records that describe multiple response
+sets understood by SPSS before version 14, or to 19 for records that
+describe dichotomy sets that use the CATEGORYLABELS=COUNTEDVALUES
+feature added in version 14.
+
+@item int32 size;
+The size of each element in the @code{mrsets} member. Always set to 1.
+
+@item int32 count;
+The total number of bytes in @code{mrsets}.
+
+@item char mrsets[];
+A series of multiple response sets, each of which consists of the
+following:
+
+@itemize @bullet
+@item
+The set's name (an identifier that begins with @samp{$}).
+
+@item
+An equals sign (@samp{=}).
+
+@item
+@samp{C} for a multiple category set, @samp{D} for a multiple
+dichotomy set with CATEGORYLABELS=VARLABELS, or @samp{E} for a
+multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES.
+
+@item
+For a multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES, a
+space, followed by a number expressed as decimal digits, followed by a
+space.  If LABELSOURCE=VARLABEL was specified on MRSETS, then the
+number is 11; otherwise it is 1.@footnote{This part of the format may
+not be fully understood, because only a single example of each
+possibility has been examined.}
+
+@item
+For either kind of multiple dichotomy set, the counted value, as a
+positive integer count specified as decimal digits, followed by a
+space, followed by as many string bytes as specified in the count.  If
+the set contains numeric variables, the string consists of the counted
+integer value expressed as decimal digits.  If the set contains string
+variables, the string contains the counted string value.  Either way,
+the string may be padded on the right with spaces (older versions of
+SPSS seem to always pad to a width of 8 bytes; newer versions don't).
+
+@item
+A space.
+
+@item
+The multiple response set's label, using the same format as for the
+counted value for multiple dichotomy sets.  A string of length 0 means
+that the set does not have a label.  A string of length 0 is also
+written if LABELSOURCE=VARLABEL was specified.
+
+@item
+A space.
+
+@item
+The names of the variables in the set, each separated from the
+previous by a single space.
+
+@item
+A line feed (byte 0x0a).
+@end itemize
+@end table
+
+Example: Given appropriate variable definitions, consider the
+following MRSETS command:
+
+@example
+MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c
+       /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55
+       /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes'
+       /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES
+        VARIABLES=k l m VALUE=34
+       /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL
+        VARIABLES=n o p VALUE='choice'.
+@end example
+
+The above would generate the following multiple response set record of
+subtype 7:
+
+@example
+$a=C 10 my mcgroup a b c
+$b=D2 55 0  g e f d
+$c=D3 Yes 10 mdgroup #2 h i j
+@end example
+
+It would also generate the following multiple response set record with
+subtype 19:
+
+@example
+$d=E 1 2 34 13 third mdgroup k l m
+$e=E 11 6 choice 0  n o p
+@end example
+
  @node Variable Display Parameter Record
  @section Variable Display Parameter Record
  
@@ -646,8 +805,8 @@ Ordinal Scale
  Continuous Scale
  @end table
  
-SPSS 14 sometimes writes a @code{measure} of 0.  PSPP interprets this
-as nominal scale.
+SPSS 14 sometimes writes a @code{measure} of 0 for string variables.
+PSPP interprets this as nominal scale.
  
  @item int32 width;
  The width of the display column for the variable in characters.
@@ -831,8 +990,22 @@ The name of the character encoding.  Normally this will be an official IANA char
  See @url{http://www.iana.org/assignments/character-sets}.
  @end table
  
-This record is not present in files generated by older software.
-See also @ref{character-code}.
+This record is not present in files generated by older software.  See
+also the @code{character_code} field in the machine integer info
+record (@pxref{character-code}).
+
+When the character encoding record and the machine integer info record
+are both present, all system files observed in practice indicate the
+same character encoding, e.g.@: 1252 as @code{character_code} and
+@code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc.
+
+If, for testing purposes, a file is crafted with different
+@code{character_code} and @code{encoding}, it seems that
+@code{character_code} controls the encoding for all strings in the
+system file before the dictionary termination record, including
+strings in data (e.g.@: string missing values), and @code{encoding}
+controls the encoding for strings following the dictionary termination
+record.
  
  @node Long String Value Labels Record
  @section Long String Value Labels Record
@@ -959,7 +1132,8 @@ element.
  In record type 18, this field contains a sequence of one or more
  variable attribute sets.  If more than one variable attribute set is
  present, each one after the first is delimited from the previous by
-@code{/}.  Each variable attribute set consists of a variable name,
+@code{/}.  Each variable attribute set consists of a (potentially
+long) variable name,
  followed by @code{:}, followed by an attribute set with the same
  syntax as on record type 17.