Finish description of legacy ("heavy") binary format.

[pspp] / spv-file-format.texi
diff --git a/spv-file-format.texi b/spv-file-format.texi

index 82367ddef40185f38a5d6a6ff1fc017cdc5d72d3..a2b07aa711bdc839a26b94f6c81be2e798e098e4 100644 (file)
--- a/spv-file-format.texi
+++ b/spv-file-format.texi
@@ -9,7 +9,7 @@ write them.
  
  An an aside, SPSS 15 and earlier versions use a completely different
  output format based on the Microsoft Compound Document Format.  This
  
  An an aside, SPSS 15 and earlier versions use a completely different
  output format based on the Microsoft Compound Document Format.  This
-format is not documented.
+format is not documented here.
  
  An SPV file is a Zip archive that can be read with @command{zipinfo}
  and @command{unzip} and similar programs.  The final member in the Zip
  
  An SPV file is a Zip archive that can be read with @command{zipinfo}
  and @command{unzip} and similar programs.  The final member in the Zip
@@ -59,7 +59,7 @@ The structure of a chart plus its data.  Charts do not have a
  @itemx @var{prefix}_pmml.scf
  Not yet investigated.  The corpus contains only one example of each.
  
  @itemx @var{prefix}_pmml.scf
  Not yet investigated.  The corpus contains only one example of each.
  
-@itemx @var{prefix}_stats.xml
+@item @var{prefix}_stats.xml
  Not yet investigated.  The corpus contains few examples.
  @end table
  
  Not yet investigated.  The corpus contains few examples.
  @end table
  
@@ -174,7 +174,7 @@ describes what it labels, often by naming the statistical procedure
  that was executed, e.g.@: ``Frequencies'' or ``T-Test''.  Labels are
  often very generic, especially within a @code{container}, e.g.@:
  ``Title'' or ``Warnings'' or ``Notes''.  Label text is localized
  that was executed, e.g.@: ``Frequencies'' or ``T-Test''.  Labels are
  often very generic, especially within a @code{container}, e.g.@:
  ``Title'' or ``Warnings'' or ``Notes''.  Label text is localized
-according to the output language, e.g. in Italian a frequency table
+according to the output language, e.g.@: in Italian a frequency table
  procedure is labeled ``Frequenze''.
  
  The corpus contains one example of an empty label, one that contains
  procedure is labeled ``Frequenze''.
  
  The corpus contains one example of an empty label, one that contains
@@ -376,10 +376,10 @@ table-id := int
  
  @code{header} includes @code{version}, a version number that affects
  the interpretation of some of the other data in the member.  We will
  
  @code{header} includes @code{version}, a version number that affects
  the interpretation of some of the other data in the member.  We will
-refer to ``version 1'' and ``version 3'' members later on.  It also
-@code{table-id} is a binary version of @code{tableId} attribute in the
-structure member that refers to the detail member.  For example, if
-@code{tableId} is @code{-4154297861994971133}, then @code{table-id}
+refer to ``version 1'' and ``version 3'' members later on.
+@code{table-id} is a binary version of the @code{tableId} attribute in
+the structure member that refers to the detail member.  For example,
+if @code{tableId} is @code{-4154297861994971133}, then @code{table-id}
  would be 0xdca00003.  The meaning of the other variable parts of the
  header is not known.
  
  would be 0xdca00003.  The meaning of the other variable parts of the
  header is not known.
  
@@ -621,32 +621,18 @@ index @math{5 \times (4 \times (3 \times 0 + 1) + 2) + 3 = 33}.
  
  @example
  value := 00? 00? 00? 00? raw-value
  
  @example
  value := 00? 00? 00? 00? raw-value
-raw-value := 01 opt-value int32[format] double[x]
-           | 02 opt-value int32[format] double[x]
-             string[varname] string[vallab] (01 | 02 | 03)
-           | 03 string[local] opt-value string[id] string[c] (00 | 01)[type]
-           | 04 opt-value int32[format] string[vallab] string[varname]
-             (01 | 02 | 03) string[vallab]
-           | 05 opt-value string[varname] string[varlabel] (01 | 02 | 03)
-           | opt-value string[format] int32[n-substs] substitution*[n-substs]
-substitution := i0 value
-              | int32[x] value*[x + 1]      /* @r{x > 0} */
-opt-value := 31 i0 (i0 | i1 string) opt-value-i0-v1        /* @r{version 1} */
-           | 31 i0 (i0 | i1 string) opt-value-i0-v3        /* @r{version 3} */
-           | 31 i1 int32[footnote-number] nested-string
-           | 31 i2 (00 | 01 | 02) 00 (i1 | i2 | i3) nested-string
-           | 31 i3 00 00 01 00 i2 nested-string
-           | 58
-opt-value-i0-v1 := 00 (i1 | i2) 00 00 int32 00 00
-opt-value-i0-v3 := count(counted-string
-                         (58 | 31 style)
-                         (58
-                          | 31 i0 i0 i0 i0 01 00 (01 | 02 | 08)
-                            00 08 00 0a 00))
-
-style := 01? 00? 00? 00? 01 string[fgcolor] string[bgcolor] string[font] byte
-nested-string := 00 00 count(counted-string (58 | 31 style) 58)
-counted-string := count((i0 (58 | 31 string))?)
+raw-value :=
+    01 value-mod int[format] double[x]
+  | 02 value-mod int[format] double[x]
+    string[varname] string[vallab] (01 | 02 | 03)
+  | 03 string[local] value-mod string[id] string[c] (00 | 01)[type]
+  | 04 value-mod int[format] string[vallab] string[varname]
+    (01 | 02 | 03) string[s]
+  | 05 value-mod string[varname] string[varlabel] (01 | 02 | 03)
+  | value-mod string[format] int[n-args] arg*[n-args]
+arg :=
+    i0 value
+  | int[x] i0 value*[x + 1]      /* @r{x > 0} */
  @end example
  
  A @code{value} boils down to a number or a string.  There are several
  @end example
  
  A @code{value} boils down to a number or a string.  There are several
@@ -675,14 +661,16 @@ The meaning of the final byte is unknown.  Possibly it is connected to
  whether the value or the label should be displayed.
  
  @item 03
  whether the value or the label should be displayed.
  
  @item 03
-A text string that originates from the software program (rather than
-from user data).  The string is provided in two forms: @code{c} is in
-English and @code{local} is localized to the user's language
-environment.  In an English-language locale, the two strings are often
-the same, and in cases where they differ @code{c} is often abbreviated
-or obscure and @code{local} is more appropriate for a user interface,
-e.g.@: @code{c} of ``Not a PxP table for MCN...'' versus @code{local}
-of ``Computed only for a PxP table, where P must be greater than 1.''
+A text string, in two forms: @code{c} is in English, and sometimes
+abbreviated or obscure, and @code{local} is localized to the user's
+locale.  In an English-language locale, the two strings are often the
+same, and in the cases where they differ, @code{local} is more
+appropriate for a user interface, e.g.@: @code{c} of ``Not a PxP table
+for MCN...'' versus @code{local} of ``Computed only for a PxP table,
+where P must be greater than 1.''
+
+@code{c} and @code{local} are always either both empty or both
+nonempty.
  
  @code{id} is a brief identifying string whose form seems to resemble a
  programming language identifier, e.g.@: @code{cumulative_percent} or
  
  @code{id} is a brief identifying string whose form seems to resemble a
  programming language identifier, e.g.@: @code{cumulative_percent} or
@@ -690,6 +678,242 @@ programming language identifier, e.g.@: @code{cumulative_percent} or
  
  @code{type} is 00 for text taken from user input, such as syntax
  fragment, expressions, file names, data set names, and 01 for fixed
  
  @code{type} is 00 for text taken from user input, such as syntax
  fragment, expressions, file names, data set names, and 01 for fixed
-text strings such as names of procedures or statistics.
+text strings such as names of procedures or statistics.  In the former
+case, @code{id} is always the empty string; in the latter case,
+@code{id} is still sometimes empty.
  
  @item 04
  
  @item 04
+The string value @code{s}, presented to the user formatted according
+to @code{format}.  The format for a string is not too interesting, and
+clearly invalid formats like A16.39 or A255.127 or A134.1 abound in
+the corpus, so readers should probably ignore the format entirely.
+
+@code{s} is a value of variable @code{varname} and has value label
+@code{vallab}.  @code{varname} is never empty but @code{vallab} is
+commonly empty.
+
+The meaning of the final byte is unknown.
+
+@item 05
+Variable @code{varname}, which is rarely observed as empty in the
+corpus, with variable label @code{varlabel}, which is often empty.
+
+The meaning of the final byte is unknown.
+
+@item 31
+@itemx 58
+(These bytes begin a @code{value-mod}.)  A format string, analogous to
+@code{printf}, followed by one or more arguments, each of which has
+one or more values.  The format string uses the following syntax:
+
+@table @code
+@item \%
+@item \:
+@item \[
+@item \]
+Each of these expands to the character following @samp{\\}.  This is
+useful to escape characters that have special meaning in format
+strings.  These are effective inside and outside the @code{[@dots{}]}
+syntax forms described below.
+
+@item \n
+Expands to a new-line, inside or outside the @code{[@dots{}]} forms
+described below.
+
+@item ^@var{i}
+Expands to a formatted version of argument @var{i}, which must have
+only a single value.  For example, @code{^1} would expand to the first
+argument's @code{value}.
+
+@item [:@var{a}:]@var{i}
+Expands @var{a} for each of the @code{value}s in @var{i}.  @var{a}
+should contain one or more @code{^@var{j}} conversions, which are
+drawn from the values for argument @var{i} in order.  Some examples
+from the corpus:
+
+@table @code
+@item [:^1:]1
+All of the values for the first argument, concatenated.
+
+@item [:^1\n:]1
+Expands to the values for the first argument, each followed by
+a new-line.
+
+@item [:^1 = ^2:]2
+Expands to @code{@var{x} = @var{y}} where @var{x} is the second
+argument's first value and @var{y} is its second value.  (This would
+be used only if the argument has two values.  With additional values,
+the second and third values would be directly concatenated, which
+would look funny.)
+@end table
+
+@item [@var{a}:@var{b}:]@var{i}
+This extends the previous form so that the first values are expanded
+using @var{a} and later values are expanded using @var{b}.  For an
+unknown reason, within @var{a} the @code{^@var{j}} conversions are
+instead written as @code{%@var{j}}.  Some examples from the corpus:
+
+@table @code
+@item [%1:*^1:]1
+Expands to all of the values for the first argument, separated by
+@samp{*}.
+
+@item [%1 = %2:, ^1 = ^2:]1
+Given appropriate values for the first argument, expands to @code{X =
+1, Y = 2, Z = 3}.
+
+@item [%1:, ^1:]1
+Given appropriate values, expands to @code{1, 2, 3}.
+@end table
+@end table
+
+The format string is localized to the user's locale.
+@end table
+
+@example
+value-mod :=
+    31 i0 (i0 | i1 string[subscript]) value-mod-i0-v1 /* @r{version 1} */
+  | 31 i0 (i0 | i1 string[subscript]) value-mod-i0-v3 /* @r{version 3} */
+  | 31 i1 int[footnote-number] format
+  | 31 i2 (00 | 01 | 02) 00 (i1 | i2 | i3) format
+  | 31 i3 00 00 01 00 i2 format
+  | 58
+value-mod-i0-v1 := 00 (i1 | i2) 00 00 int 00 00
+value-mod-i0-v3 := count(format-string
+                         (58 | 31 style)
+                         (58
+                          | 31 i0 i0 i0 i0 01 00 (01 | 02 | 08)
+                            00 08 00 0a 00))
+
+style := 01? 00? 00? 00? 01 string[fgcolor] string[bgcolor] string[font] byte
+format := 00 00 count(format-string (58 | 31 style) 58)
+format-string := count((i0 (58 | 31 string))?)
+@end example
+
+A @code{value-mod} can specify special modifications to a @code{value}:
+
+@itemize @bullet
+@item
+The @code{footnote-number}, if present, specifies a footnote that the
+@code{value} references.  The footnote's marker is shown appended to
+the main text of the @code{value}, as a superscript.
+
+@item
+The @code{subscript}, if present, specifies a string to append to the
+main text of the @code{value}, as a subscript.  The subscript text is
+normally a brief indicator, e.g.@: @samp{a} or @samp{a,b}, with its
+meaning indicated by the table caption.  In this usage, subscripts are
+similar to footnotes; one apparent difference is that a @code{value}
+can only reference one footnote but a subscript can list more than one
+letter.
+
+@item
+The @code{format}, if present, is a format string for substitutions
+using the syntax explained previously.  It appears to be an
+English-language version of the localized format string in the
+@code{value} in which the @code{format} is nested.
+
+@item
+The @code{style}, if present, changes the style for this individual
+@code{value}.
+@end itemize
+
+@node SPV Legacy Detail Member Binary Format
+@subsection SPV Legacy Detail Member Binary Format
+
+Whereas the light binary format represents everything about a given
+pivot table, the legacy binary format conceptually consists of a
+number of named sources, each of which consists of a number of named
+series, each of which is a 1-dimensional array of numbers or strings
+or a mix.  Thus, the legacy binary file format is quite simple.
+
+@example
+legacy-binary := 00 byte[version] int16[n-sources] int[file-size]
+                 metadata*[n-sources] data*[n-sources]
+@end example
+
+@code{version} is a version number that affects the interpretation of
+some of the other data in the member.  Versions 0xaf and 0xb0 are
+known.  We will refer to ``version 0xaf'' and ``version 0xb0'' members
+later on.
+
+A legacy member consists of @code{n-sources} data sources, each of
+which has @code{metadata} and @code{data}.
+
+@code{file-size} is the size of the file, in bytes.
+
+@example
+/* @r{version 0xaf} */
+metadata := int[per-series] int[n-series] int[ofs] byte*32[source-name]
+
+/* @r{version 0xb0} */
+metadata := int[per-series] int[n-series] int[ofs] byte*64[source-name] int[x]
+@end example
+
+A data source consists of @code{n-series} series of data, with
+@code{per-series} data values per series.
+
+@code{source-name} is a 32- or 64-byte string padded on the right with
+zero bytes.  The names that appear in the corpus are very generic,
+usually @code{tableData} or @code{source0}.
+
+The @code{ofs} is the offset, in bytes, from the beginning of the file
+to the start of this data source's @code{data}.  This allows programs
+to skip to the beginning of the data for a particular source; it is
+also important to determine whether a source includes any string data
+(see below).
+
+The meaning of @code{x} in version 0xb0 is unknown.
+
+@example
+data := numeric-data string-data?
+numeric-data := numeric-series*[n-series]
+numeric-series := byte*288[series-name] double*[per-series]
+@end example
+
+Data follow the metadata in the legacy binary format, with sources in
+the same order.  Each series begins with a @code{series-name}, which
+generally indicates its role in the pivot table, e.g.@: ``cell'',
+``cellFormat'', ``dimension0categories'', ``dimension0group0''.  The
+name is followed by the data, one double per element in the series.  A
+double with the maximum negative double @code{-DBL_MAX} represents the
+system-missing value SYSMIS.
+
+@example
+string-data := i1 string[source-name] pairs labels
+
+pairs := int[n-string-series] pair-series*[n-string-series]
+pair-series := string[pair-series-name] int[n-pairs] pair*[n-pairs]
+pair := int[i] int[j]
+
+labels := int[n-labels] label*[n-labels]
+label := int[frequency] int[s]
+@end example
+
+A source may include a mix of numeric and string data values.  When a
+source includes any string data, the data values that are strings are
+set to SYSMIS in the @code{numeric-series}, and @code{string-data}
+follows the @code{numeric-data}.  To reliably determine whether a
+source includes @code{string-data}, the reader should check whether
+the offset following the @code{numeric-data} is the offset of the next
+series, as indicated by its @code{metadata} (or end of file, in the
+case of the last source in a file).
+
+@code{string-data} repeats the name of the source.
+
+The string data overlays the numeric data.  @code{n-string-series} is
+the number of series within the source that include string data.  More
+precisely, it is the 1-based index of the last series in the source
+that includes any string data; thus, it would be 4 if there are 5
+series and only the fourth one includes string data.
+
+Each @code{pair-series} consists a sequence of 0 or more pairs, each
+of which maps from a 0-based index within the series @code{i} to a
+0-based label index @code{j}.  The pair @code{i} = 2, @code{j} = 3,
+for example, would mean that the third data value (with value SYSMIS)
+is to be replaced by the string of the fourth label.
+
+The labels themselves follow the pairs.  The valuable part of each
+label is the string @code{s}.  Each label also includes a
+@code{frequency} that reports the number of pairs that reference it
+(although this is not useful).