Work on documentation.

[pspp] / spv-file-format.texi
diff --git a/spv-file-format.texi b/spv-file-format.texi

index 50a94141b924452dc0ee9bd255d6c4d2acdc6d35..d245f7203931d6e701f2b9fb3c805c80c3ba855f 100644 (file)
--- a/spv-file-format.texi
+++ b/spv-file-format.texi
@@ -80,13 +80,18 @@ their exact names do not matter to readers as long as they are unique.
  @node SPV Structure Member Format
  @section Structure Member Format
  
+A structure member lays out the high-level structure for a group of
+output items such as heading, tables, and charts.  Structure members
+do not include the details of tables and charts but instead refer to
+them by their member names.
+
  Structure members' XML files claim conformance with a collection of
  XML Schemas.  These schemas are distributed, under a nonfree license,
  with SPSS binaries.  Fortunately, the schemas are not necessary to
-understand the structure members.  To a degree, the schemas can even
+understand the structure members.  The schemas can even
  be deceptive because they document elements and attributes that are
  not in the corpus and do not document elements and attributes that are
-commonly found there.
+commonly found in the corpus.
  
  Structure members use a different XML namespace for each schema, but
  these namespaces are not entirely consistent.  In some SPV files, for
@@ -142,6 +147,46 @@ times.
  @center @image{dev/spv-structure, 5in}
  @end iftex
  
+The following example shows the contents of a typical structure member
+for a @cmd{DESCRIPTIVES} procedure.  A real structure member would not
+have indentation.  This example all omits most attributes, all XML
+namespace information, and the CSS from the embedded HTML:
+
+@example
+<?xml version="1.0" encoding="utf-8"?>
+<heading>
+  <label>Output</label>
+  <heading commandName="Descriptives">
+    <label>Descriptives</label>
+    <container>
+      <label>Title</label>
+      <text commandName="Descriptives" type="title">
+        <html lang="en">
+<![CDATA[<head><style type="text/css">...</style></head><BR>Descriptives]]>
+        </html>
+      </text>
+    </container>
+    <container visibility="hidden">
+      <label>Notes</label>
+      <table commandName="Descriptives" subType="Notes" type="note">
+        <tableStructure>
+          <dataPath>00000000001_lightNotesData.bin</dataPath>
+        </tableStructure>
+      </table>
+    </container>
+    <container>
+      <label>Descriptive Statistics</label>
+      <table commandName="Descriptives" subType="Descriptive Statistics"
+             type="table">
+        <tableStructure>
+          <dataPath>00000000002_lightTableData.bin</dataPath>
+        </tableStructure>
+      </table>
+    </container>
+  </heading>
+</heading>
+@end example
+
  @menu
  * SPV Structure heading Element::
  * SPV Structure label Element::
@@ -315,7 +360,7 @@ Parent: @code{text} @*
  Contents: CDATA
  
  The CDATA contains an HTML document.  In some cases, the document
-starts with @code{<html>} and ends with @code{</html}; in others the
+starts with @code{<html>} and ends with @code{</html>}; in others the
  @code{html} element is implied.  Generally the HTML includes a
  @code{head} element with a CSS stylesheet.  The HTML body often begins
  with @code{<BR>}.  The actual content ranges from trivial to simple:
@@ -440,7 +485,7 @@ This element has no attributes.
  @node SPV Structure @code{text} Element (Inside @code{pageParagraph})
  @subsection The @code{text} Element (Inside @code{pageParagraph})
  
-Parent: @code{pageParagraph}
+Parent: @code{pageParagraph} @*
  Contents: CDATA?
  
  This @code{text} element is nested inside a @code{pageParagraph}.  There
@@ -451,7 +496,7 @@ The element is either empty, or contains CDATA that holds almost-XHTML
  text: in the corpus, either an @code{html} or @code{p} element.  It is
  @emph{almost}-XHTML because the @code{html} element designates the
  default namespace as
-@code{http://xml.spss.com/spss/viewer/viewer-tree} instead of an XHTML
+@indicateurl{http://xml.spss.com/spss/viewer/viewer-tree} instead of an XHTML
  namespace, and because the CDATA can contain substitution variables:
  @code{&[Page]} for the page number and @code{&[PageTitle]} for the
  page title.
@@ -607,7 +652,12 @@ An SPV light member begins with a 39-byte header:
  Header @result{}
      01 00
      (i1 @math{|} i3)[@t{version}]
-    01 bool*4 int
+    bool
+    bool[@t{show-numeric-markers}]
+    bool[@t{rotate-inner-column-labels}]
+    bool[@t{rotate-outer-row-labels}]
+    bool
+    int
      int[@t{min-column-width}] int[@t{max-column-width}]
      int[@t{min-row-width}] int[@t{max-row-width}]
      int64[@t{table-id}]
@@ -619,6 +669,18 @@ some of the other data in the member.  We will refer to ``version 1''
  and ``version 3'' later on and use v1(@dots{}) and v3(@dots{}) for
  version-specific formatting (as described previously).
  
+If @code{show-numeric-markers} is 1, footnote markers are shown as
+numbers, starting from 1; otherwise, they are shown as letters,
+starting from @samp{a}.
+
+If @code{rotate-inner-column-labels} is 1, then column labels closest
+to the data are rotated to be vertical; otherwise, they are shown
+in the normal way.
+
+If @code{rotate-outer-row-labels} is 1, then row labels farthest from
+the data are rotated to be vertical; otherwise, they are shown in the
+normal way.
+
  @code{table-id} is a binary version of the @code{tableId} attribute in
  the structure member that refers to the detail member.  For example,
  if @code{tableId} is @code{-4122591256483201023}, then @code{table-id}
@@ -860,11 +922,24 @@ TableSettings @result{}
      bool[@t{footnote-marker-position}]
      v3(
        byte
-      be32[@t{n}] byte*[@t{n}]
+      count(
+        Breakpoints[@t{row-breaks}] Breakpoints[@t{column-breaks}]
+        Keeps[@t{row-keeps}] Keeps[@t{column-keeps}]
+        PointKeeps[@t{row-keeps}] PointKeeps[@t{column-keeps}]
+      )
        bestring[@t{notes}]
        bestring[@t{table-look}]
        00...
      )
+
+Breakpoints @result{} be32[@t{n-breaks}] be32*[@t{n-breaks}]
+
+Keeps @result{} be32[@t{n-keeps}] Keep*@t{n-keeps}
+Keep @result{} be32[@t{offset}] be[@t{n}]
+
+PointKeeps @result{} be32[@t{n-point-keeps}] PointKeep*@t{n-point-keeps}
+PointKeep @result{} be32[@t{offset}] be32 be32
+
  @end format
  @end cartouche
  
@@ -886,6 +961,20 @@ shown as numbers starting from 1.
  When @code{footnote-marker-position} is 1, footnote markers are shown
  as superscripts, otherwise as subscripts.
  
+The Breakpoints are rows or columns after which there is a page break;
+for example, a row break of 1 requests a page break after the second
+row.  Usually no breakpoints are specified, indicating that page
+breaks should be selected automatically.
+
+The Keeps are ranges of rows or columns to be kept together without a
+page break; for example, a row Keep with @code{offset} 1 and @code{n}
+10 requests that the 10 rows starting with the second row be kept
+together.  Usually no Keeps are specified.
+
+The PointKeeps seem to be generated automatically based on
+user-specified Keeps.  They seems to indicate a conversion from rows
+or columns to pixel or point offsets.
+
  @code{notes} is a text string that contains user-specified notes.  It
  is displayed when the user hovers the cursor over the table, like
  ``alt text'' on a webpage.  It is not printed.  It is usually empty.
@@ -901,33 +990,59 @@ TableSettings ends with an arbitrary number of null bytes.
  @cartouche
  @format
  Formats @result{}
-    int[@t{nwidths}] int*[@t{nwidths}]
+    int[@t{n-widths}] int*[@t{n-widths}]
      string[@t{encoding}]
-    int (00 @math{|} 01) 00 (00 @math{|} 01)
+    int[@t{current-layer}]
+    bool[@t{digit-grouping}] bool[@t{leading-zero}] bool
      int[@t{epoch}]
      byte[@t{decimal}] byte[@t{grouping}]
      CustomCurrency
-    v1(i0)
-    v3(count(count(X5) count(X6)))
-
-CustomCurrency @result{} int[@t{n-ccs}] string*[@t{n-ccs}]
+    count(
+      v1(X0?)
+      v3(count(X1 count(X2)) count(X3))
  
-X5 @result{} byte*33 int[@t{n}] int*[@t{n}]
-X6 @result{}
+X0 @result{}
+    byte*14
+    string[@t{command}] string[@t{command-local}]
+    string[@t{language}] string[@t{charset}] string[@t{locale}]
+    bool 00 bool bool
+    int[@t{epoch}]
+    byte[@t{decimal}] byte[@t{grouping}]
+    CustomCurrency
+    byte[@t{missing}] bool
+
+X1 @result{}
+    byte*2
+    byte[@t{lang}]
+    byte[@t{variable-mode}]
+    byte[@t{value-mode}]
+    int*2
+    00*17
+    bool
+    01
+X2 @result{}
+    int[@t{n-heights}] int*[@t{n-heights}]
+    int[@t{n-style-map}] BlankMap*[@t{n-style-map}]
+    int[@t{n-styles}] StylePair*[@t{n-styles}]
+    count((i0 i0)?)
+StyleMap @result{} int64[@t{cell-index}] int16[@t{style-index}]
+X3 @result{}
      01 00 (03 @math{|} 04) 00 00 00
-    string[@t{command}] string[@t{subcommand}]
+    string[@t{command}] string[@t{command-local}]
      string[@t{language}] string[@t{charset}] string[@t{locale}]
-    (00 @math{|} 01) 00 bool bool
+    bool 00 bool bool
      int[@t{epoch}]
      byte[@t{decimal}] byte[@t{grouping}]
      double[@t{small}] 01
      (string[@t{dataset}] string[@t{datafile}] i0 int[@t{date}] i0)?
      CustomCurrency
      byte[@t{missing}] bool (i2000000 i0)?
+
+CustomCurrency @result{} int[@t{n-ccs}] string*[@t{n-ccs}]
  @end format
  @end cartouche
  
-If @code{nwidths} is nonzero, then the accompanying integers are
+If @code{n-widths} is nonzero, then the accompanying integers are
  column widths as manually adjusted by the user.  (Row heights are
  computed automatically based on the widths.)
  
@@ -950,6 +1065,13 @@ are @samp{.} and @samp{,}.
  @samp{'} (apostrophe), @samp{ } (space), and zero (presumably
  indicating that digits should not be grouped).
  
+@code{command} describes the statistical procedure that generated the
+output, in English.  It is not necessarily the literal syntax name of
+the procedure: for example, NPAR TESTS becomes ``Nonparametric
+Tests.''  @code{command-local} is the procedure's name, translated
+into the output language; it is often empty and, when it is not,
+sometimes the same as @code{command}.
+
  @code{dataset} is the name of the dataset analyzed to produce the
  output, e.g.@: @code{DataSet1}, and @code{datafile} the name of the
  file it was read from, e.g.@: @file{C:\Users\foo\bar.sav}.  The latter
@@ -971,6 +1093,9 @@ following strings are CCA through CCE format strings.  @xref{Custom
  Currency Formats,,, pspp, PSPP}.  Most commonly these are all
  @code{-,,,} but other strings occur.
  
+@code{missing} is the character used to indicate that a cell contains
+a missing value.  It is always observed as @samp{.}.
+
  @node SPV Light Member Dimensions
  @subsection Dimensions
  
@@ -980,30 +1105,34 @@ the categories associated with each dimension.
  @cartouche
  @format
  Dimensions @result{} int[@t{n-dims}] Dimension*[@t{n-dims}]
-Dimension @result{} Value[@t{name}] DimUnknown int[@t{n-categories}] Category*[@t{n-categories}]
-DimUnknown @result{}
+Dimension @result{} Value[@t{name}] DimProperties int[@t{n-categories}] Category*[@t{n-categories}]
+DimProperties @result{}
      byte[@t{d1}]
      (00 @math{|} 01 @math{|} 02)[@t{d2}]
      (i0 @math{|} i2)[@t{d3}]
-    (00 @math{|} 01)[@t{d4}]
-    (00 @math{|} 01)[@t{d5}]
-    01
-    int[@t{d6}]
+    bool[@t{show-dim-label}]
+    bool[@t{hide-all-labels}]
+    01 int[@t{dim-index}]
  @end format
  @end cartouche
  
  @code{name} is the name of the dimension, e.g. @code{Variables},
  @code{Statistics}, or a variable name.
  
+The meanings of @code{d1}, @code{d2}, and @code{d3} are unknown.
  @code{d1} is usually 0 but many other values have been observed.
  
-@code{d3} is 2 over 99% of the time.
+If @code{show-dim-label} is 01, the pivot table displays a label for
+the dimension itself.  Because usually the group and category labels
+are enough explanation, it is usually 00.
  
-@code{d5} is 0 over 99% of the time.
+If @code{hide-all-labels} is 01, the pivot table omits all labels for
+the dimension, including group and category labels.  It is usually 00.
+When @code{hide-all-labels} is 01, @code{show-dim-label} is ignored.
  
-@code{d6} is either -1 or the 0-based index of the dimension, e.g.@: 0
-for the first dimension, 1 for the second, and so on.  The latter is
-the case 98% of the time in the corpus.
+@code{dim-index} is usually the 0-based index of the dimension, e.g.@:
+0 for the first dimension, 1 for the second, and so on.  Sometimes it
+is -1.  There is no visible difference.
  
  @node SPV Light Member Categories
  @subsection Categories
@@ -1014,23 +1143,26 @@ are really categories; the others just serve as grouping constructs.
  @cartouche
  @format
  Category @result{} Value[@t{name}] (Leaf @math{|} Group)
-Leaf @result{} 00 00 00 i2 int[@t{index}] i0
+Leaf @result{} 00 00 00 i2 int[@t{cat-index}] i0
  Group @result{}
-    (00 @math{|} 01)[@t{merge}] 00 01 (i0 @math{|} i2)[@t{data}]
+    bool[@t{merge}] 00 01 (i0 @math{|} i2)[@t{data}]
      i-1 int[@t{n-subcategories}] Category*[@t{n-subcategories}]
  @end format
  @end cartouche
  
  @code{name} is the name of the category (or group).
  
-A Leaf represents a leaf category.  The Leaf's @code{index} is a
+A Leaf represents a leaf category.  The Leaf's @code{cat-index} is a
  nonnegative integer less than @code{n-categories} in the Dimension in
-which the Category is nested (directly or indirectly).
+which the Category is nested (directly or indirectly).  These
+categories represent the original order in which the categories were
+sorted; if the user sorted or rearranged the categories, then the
+order of categories in the file reflects that without changing the
+@code{cat-index} values.
  
-A Group represents a Group of nested categories.  Usually a Group
-contains at least one Category, so that @code{n-subcategories} is
-positive, but a few Groups with @code{n-subcategories} 0 has been
-observed.
+A Group is a group of nested categories.  Usually a Group contains at
+least one Category, so that @code{n-subcategories} is positive, but a
+few Groups with @code{n-subcategories} 0 has been observed.
  
  If a Group's @code{merge} is 00, the most common value, then the group
  is really a distinct group that should be represented as such in the
@@ -1056,23 +1188,23 @@ The final part of an SPV light member contains the actual data.
  Data @result{}
      int[@t{layers}] int[@t{rows}] int[@t{columns}] int*[@t{n-dimensions}]
      int[@t{n-data}] Datum*[@t{n-data}]
-Datum @result{} int64[@t{index}] v3(00?) Value
+Datum @result{} int64[@t{index}] v1(00?) Value
  @end format
  @end cartouche
  
-The values of @code{layers}, @code{rows}, and @code{columns} each
-specifies the number of dimensions displayed in layers, rows, and
+The values of @code{n-layers}, @code{n-rows}, and @code{n-columns}
+each specifies the number of dimensions displayed in layers, rows, and
  columns, respectively.  Any of them may be zero.  Their values sum to
  @code{n-dimensions} from Dimensions (@pxref{SPV Light Member
  Dimensions}).
  
  The @code{n-dimensions} integers are a permutation of the 0-based
-dimension numbers.  The first @code{layers} integers specify each of
-the dimensions represented by layers, the next @code{rows} integers
+dimension numbers.  The first @code{n-layers} integers specify each of
+the dimensions represented by layers, the next @code{n-rows} integers
  specify the dimensions represented by rows, and the final
-@code{columns} integers specify the dimensions represented by columns.
-When there is more than one dimension of a given kind, the inner
-dimensions are given first.
+@code{n-columns} integers specify the dimensions represented by
+columns.  When there is more than one dimension of a given kind, the
+inner dimensions are given first.
  
  The format of a Datum varies slightly from version 1 to version 3: in
  version 1 it allows for an extra optional 00 byte.
@@ -1092,6 +1224,7 @@ for each @math{i} from 0 to @math{d - 1}:
  For example, suppose there are 3 dimensions with 3, 4, and 5
  categories, respectively.  The datum at coordinates (1, 2, 3) has
  index @math{5 \times (4 \times (3 \times 0 + 1) + 2) + 3 = 33}.
+Within a given dimension, the index is the @code{cat-index} in a Leaf.
  
  @node SPV Light Member Value
  @subsection Value
@@ -1106,7 +1239,7 @@ RawValue @result{}
      01 ValueMod int[@t{format}] double[@t{x}]
    @math{|} 02 ValueMod int[@t{format}] double[@t{x}]
      string[@t{varname}] string[@t{vallab}] (01 @math{|} 02 @math{|} 03)
-  @math{|} 03 string[@t{local}] ValueMod string[@t{id}] string[@t{c}] (00 @math{|} 01)[@t{type}]
+  @math{|} 03 string[@t{local}] ValueMod string[@t{id}] string[@t{c}] bool[@t{type}]
    @math{|} 04 ValueMod int[@t{format}] string[@t{vallab}] string[@t{varname}]
      (01 @math{|} 02 @math{|} 03) string[@t{s}]
    @math{|} 05 ValueMod string[@t{varname}] string[@t{varlabel}] (01 @math{|} 02 @math{|} 03)
@@ -1261,13 +1394,26 @@ A ValueMod can specify special modifications to a Value.
  ValueMod @result{}
      31 i0 (i0 @math{|} i1 string[@t{subscript}])
      v1(00 (i1 @math{|} i2) 00 00 int 00 00)
-    v3(count(FormatString Style ValueModUnknown))
+    v3(count(FormatString StylePair))
    @math{|} 31 int[@t{n-refs}] int16*[@t{n-refs}] Format
    @math{|} 58
-Style @result{} 58 @math{|} 31 01? 00? 00? 00? 01 string[@t{fgcolor}] string[@t{bgcolor}] string[@t{typeface}] byte[@t{size}]
+
  Format @result{} 00 00 count(FormatString Style 58)
-FormatString @result{} count((i0 (58 @math{|} 31 string))?)
-ValueModUnknown @result{} 58 @math{|} 31 i0 i0 i0 i0 01 00 (01 @math{|} 02 @math{|} 08) 00 08 00 0a 00)
+FormatString @result{} count((count((i0 58)?) (58 @math{|} 31 string))?)
+
+StylePair @result{}
+    (31 Style | 58)
+    (31 Style2 | 58)
+
+Style @result{}
+    bool[@t{bold}] bool[@t{italic}] bool[@t{underline}] bool[@t{show}]
+    string[@t{fgcolor}] string[@t{bgcolor}]
+    string[@t{typeface}] byte[@t{size}]
+
+Style2 @result{}
+    int[@t{halign}] int[@t{valign}] double[@t{offset}]
+    int16[@t{left-margin}] int16[@t{right-margin}]
+    int16[@t{top-margin}] int16[@t{bottom-margin}]
  @end format
  @end cartouche
  
@@ -1287,8 +1433,22 @@ syntax explained previously.  It appears to be an English-language
  version of the localized format string in the Value in which the
  Format is nested.
  
-The Style, if present, changes the style for this individual Value.
-The @code{size} is a font size in units of 1/96 inch.
+Style and Style2, if present, change the style for this individual
+Value.  @code{bold}, @code{italic}, and @code{underline} control the
+particular style.  @code{fgcolor} and @code{bgcolor} are strings, such
+as @code{#ffffff}.  The @code{size} is a font size in units of 1/96
+inch.
+
+@code{halign} is 0 for center, 2 for left, 4 for right, 6 for decimal,
+0xffffffad for mixed.  For decimal alignment, @code{offset} is the
+decimal point's offset from the right side of the cell, in units of
+1/72 inch.
+
+@code{valign} specifies vertical alignment: 0 for center, 1 for top, 3
+for bottom.
+
+@code{left-margin}, @code{right-margin}, @code{top-margin}, and
+@code{bottom-margin} are in units of 1/72 inch.
  
  @node SPV Legacy Detail Member Binary Format
  @section Legacy Detail Member Binary Format
@@ -1472,6 +1632,7 @@ elements are assigned @code{id} attributes that are never referenced.
  * SPV Detail coordinates Element::
  * SPV Detail faceting Element::
  * SPV Detail facetLayout Element::
+* SPV Detail style Element::
  @end menu
  
  @node SPV Detail visualization Element
@@ -1516,7 +1677,7 @@ The title of the pivot table, localized to the output language.
  
  @defvr {Required} style
  The @code{id} of a @code{style} element (@pxref{SPV Detail style
-element}).  This is the base style for the entire pivot table.  In
+Element}).  This is the base style for the entire pivot table.  In
  every example in the corpus, the value is @code{visualizationStyle}
  and the corresponding @code{style} element has no attributes other
  than @code{id}.
@@ -1726,7 +1887,7 @@ Contents: @code{location}@math{+} @code{coordinates} @code{faceting} @code{facet
  @defvr {Required} cellStyle
  @defvrx {Required} style
  Each of these is the @code{id} of a @code{style} element (@pxref{SPV
-Detail style element}).  The former is the default style for
+Detail style Element}).  The former is the default style for
  individual cells, the latter for the entire table.
  @end defvr
  
@@ -2504,7 +2665,7 @@ This element has the following attributes.
  @defvrx {Optional} useGroupging
  The syntax and meaning of these attributes is the same as on the
  @code{format} element for a numeric format.  @pxref{SPV Detail format
-element}.
+Element}.
  @end defvr
  
  @node SPV Detail stringFormat Element
@@ -2678,7 +2839,7 @@ A value, or multiple values separated by semicolons,
  e.g.@: @code{0} or @code{13;14;15;16}.
  @end defvr
  
-@subsubheading The @code{intersectWhere}
+@subsubheading The @code{intersectWhere} Element
  
  Parent: @code{intersect} @*
  Contents: empty
@@ -2691,3 +2852,8 @@ The meaning of these attributes is unknown.  In the four examples in
  the corpus they always take the values @code{dimension2categories} and
  @code{dimension0categories}, respectively.
  @end defvr
+
+@node SPV Detail style Element
+@subsection The @code{style} Element
+
+TBD.