pintos-os.org Git - pspp/blob - spv-file-format.texi

   1 @node SPSS Viewer Format
   2 @section SPSS Viewer Format
   3
   4 SPSS Viewer or @file{.spv} files, here called SPV files, are written
   5 by SPSS 16 and later to represent the contents of its output editor.
   6 This section documents the format.  This description is detailed
   7 enough to read SPV files, but it is probably not sufficient to
   8 write them.
   9
  10 An an aside, SPSS 15 and earlier versions use a completely different
  11 output format based on the Microsoft Compound Document Format.  This
  12 format is not documented here.
  13
  14 An SPV file is a Zip archive that can be read with @command{zipinfo}
  15 and @command{unzip} and similar programs.  The final member in the Zip
  16 archive is a file named @file{META-INF/MANIFEST.MF}.  This structure
  17 makes SPV files resemble Java ``JAR'' files, but whereas a JAR
  18 manifest contains a sequence of colon-delimited key/value pairs, an
  19 SPV manifest contains the string @samp{allowPivoting=true}, without a
  20 new-line.
  21
  22 The rest of the members in an SPV file's Zip archive fall into two
  23 categories: structure and details.  ``Structure'' member names begin
  24 with @file{outputViewer@var{nnnnnnnnnn}}, where each @var{n} is a
  25 decimal digit, and end with @file{.xml}, and often include the string
  26 @file{_heading} in between.  Each of these members represents some
  27 kind of output item (a table, a heading, a block of text, etc.) or a
  28 group of them.  The member whose output goes at the beginning of the
  29 document is numbered 0, the next member in the output is numbered 1,
  30 and so on.
  31
  32 Structure members contain XML.  This XML is sometimes self-contained,
  33 but it often references other members in the Zip archive named as
  34 follows:
  35
  36 @table @asis
  37 @item @file{@var{prefix}_table.xml} and @file{@var{prefix}_tableData.bin}
  38 @itemx @file{@var{prefix}_lightTableData.bin}
  39 The structure of a table plus its data.  Older SPV files pair a
  40 @file{@var{prefix}_table.xml} file that describes the table's
  41 structure with a binary @file{@var{prefix}_tableData.bin} file that
  42 gives its data.  Newer SPV files (the majority of those in the corpus)
  43 instead include a single @file{@var{prefix}_lightTableData.bin} file
  44 that incorporates both into a single binary format.
  45
  46 @item @file{@var{prefix}_warning.xml} and @file{@var{prefix}_warningData.bin}
  47 @itemx @file{@var{prefix}_lightWarningData.bin}
  48 Same format used for tables, with a different name.
  49
  50 @item @file{@var{prefix}_notes.xml} and @file{@var{prefix}_notesData.bin}
  51 @itemx @file{@var{prefix}_lightNotesData.bin}
  52 Same format used for tables, with a different name.
  53
  54 @item @file{@var{prefix}_chartData.bin} and @file{@var{prefix}_chart.xml}
  55 The structure of a chart plus its data.  Charts do not have a
  56 ``light'' format.
  57
  58 @item @var{prefix}_model.scf
  59 @itemx @var{prefix}_pmml.scf
  60 Not yet investigated.  The corpus contains only one example of each.
  61
  62 @item @var{prefix}_stats.xml
  63 Not yet investigated.  The corpus contains few examples.
  64 @end table
  65
  66 The @file{@var{prefix}} in the names of the detail members is
  67 typically an 11-digit decimal number that increases for each item,
  68 tending to skip values.  Older SPV files use different naming
  69 conventions.  Structure member refer to detail members by name, and so
  70 their exact names do not appear to matter as long as they are unique.
  71
  72 @node SPV Structure Member Format
  73 @subsection Structure Member Format
  74
  75 Structure members XML files claim conformance with a collection of XML
  76 Schemas.  These schemas are distributed, under a nonfree license, with
  77 SPSS binaries.  Fortunately, the schemas are not necessary to
  78 understand the structure members.  To a degree, the schemas can even
  79 be deceptive because they document elements and attributes that are
  80 not in the corpus and lack documentation of elements and attributes
  81 that are commonly found in the corpus.
  82
  83 Structure members use a different XML namespace for each schema, but
  84 these namespaces are not entirely consistent: in some SPV files, for
  85 example, the @code{viewer-tree} schema is associated with namespace
  86 @indicateurl{http://xml.spss.com/spss/viewer-tree} and in others with
  87 @indicateurl{http://xml.spss.com/spss/viewer/viewer-tree} (note the
  88 additional @file{viewer/}).  In any case, the schema URIs are
  89 not resolvable to obtain the schemas themselves.
  90
  91 One may ignore all of the above in interpreting a structure member.
  92 The actual XML has a simple and straightforward form that does not
  93 require a reader to take schemas or namespaces into account.
  94
  95 @table @code
  96 @item heading
  97 Parent: Document root or @code{heading} @*
  98 Contents: [@code{pageSetup}] @code{label} [@code{container} | @code{heading}]*
  99
 100 The root of a structure member is a @code{heading}, which represents a
 101 section of output beginning with a title (the @code{label}) and
 102 ordinarily followed by content containers or further nested
 103 (sub)-sections of output.
 104
 105 The document root heading may also contain a @code{pageSetup} element.
 106
 107 The following attributes have been observed on both document root and
 108 nested @code{heading} elements:
 109
 110 @table @asis
 111 @item Optional attribute: @code{creator-version}
 112 The version of the software that created this SPV file.  A string of
 113 the form @code{xxyyzzww} represents software version xx.yy.zz.ww,
 114 e.g.@: @code{21000001} is version 21.0.0.1.  Trailing pairs of zeros
 115 are sometimes omitted, so that @code{21}, @code{210000}, and
 116 @code{21000000} are all version 21.0.0.0 (and the corpus contains all
 117 three of those forms).
 118 @end table
 119
 120 The following attributes have been observed on document root
 121 @code{heading} elements only:
 122
 123 @table @asis
 124 @item Optional attribute: @code{creator}
 125 The directory of the software that created this SPV file,
 126 e.g. @file{C:\PROGRA~1\IBM\SPSS\STATIS~1\22} or
 127 @file{/Applications/IBM/SPSS/Statistics/22/SPSSStatistics.app/Contents/Resources/Java/../../bin}.
 128
 129 @item Optional attribute: @code{creation-date-time}
 130 The date and time at which the SPV file was written, in a
 131 locale-specific format, e.g. @code{Friday, May 16, 2014 6:47:37 PM
 132 PDT} or @code{lunedì 17 marzo 2014 3.15.48 CET} or even @code{Friday,
 133 December 5, 2014 5:00:19 o'clock PM EST}.
 134
 135 @item Optional attribute: @code{lockReader}
 136 Whether a reader should be allowed to edit the output.  The possible
 137 values are @code{true} and @code{false}, but the corpus only contains
 138 @code{false}.
 139
 140 @item Optional attribute: @code{schemaLocation}
 141 This is actually an XML Namespace attribute.  A reader may ignore it.
 142 @end table
 143
 144 The following attributes have been observed only on nested
 145 @code{heading} elements:
 146
 147 @table @asis
 148 @item Required attribute: @code{commandName}
 149 The locale-invariant name of the command that produced the output,
 150 e.g.@: @code{Frequencies}, @code{T-Test}, @code{Non Par Corr}.
 151
 152 @item Optional attribute: @code{visibility}
 153 To what degree the output represented by the element is visible.  The
 154 only observed value is @code{collapsed}.
 155
 156 @item Optional attribute: @code{locale}
 157 The locale used for output, in Windows format, which is similar to the
 158 format used in Unix with the underscore replaced by a hyphen, e.g.@:
 159 @code{en-US}, @code{en-GB}, @code{el-GR}, @code{sr-Cryl-RS}.
 160
 161 @item Optional attribute: @code{olang}
 162 The output language, e.g.@: @code{en}, @code{it}, @code{es},
 163 @code{de}, @code{pt-BR}.
 164 @end table
 165
 166 @item label
 167 Parent: @code{heading} or @code{container} @*
 168 Contents: text
 169
 170 Every @code{heading} and @code{container} holds a @code{label} as its
 171 first child.  The root @code{heading} in a structure member always
 172 contains the string ``Output''.  Otherwise, the text in @code{label}
 173 describes what it labels, often by naming the statistical procedure
 174 that was executed, e.g.@: ``Frequencies'' or ``T-Test''.  Labels are
 175 often very generic, especially within a @code{container}, e.g.@:
 176 ``Title'' or ``Warnings'' or ``Notes''.  Label text is localized
 177 according to the output language, e.g.@: in Italian a frequency table
 178 procedure is labeled ``Frequenze''.
 179
 180 The corpus contains one example of an empty label, one that contains
 181 no text.
 182
 183 @item container
 184 Parent: @code{heading} @*
 185 Contents: @code{label} [@code{table} | @code{text}]
 186
 187 A @code{container} serves to label a @code{table} or a @code{text}
 188 item.
 189
 190 @table @asis
 191 @item Required attribute: @code{visibility}
 192 Either @code{visible} or @code{hidden}, this indicates whether the
 193 container's content is displayed.
 194
 195 @item Optional attribute: @code{text-align}
 196 Presumably indicates the alignment of text within the container.  The
 197 only observed value is @code{left}.  Observed with nested @code{table}
 198 and @code{text} elements.
 199
 200 @item Optional attribute: @code{width}
 201 The width of the container in the form @code{@var{n}px}, e.g.@:
 202 @code{1097px}.
 203 @end table
 204
 205 @item text
 206 Parent: @code{container} @*
 207 Contents: @code{html}
 208
 209 This @code{text} element is nested inside a @code{container}.  There
 210 is a different @code{text} element that is nested inside a
 211 @code{pageParagraph}.
 212
 213 @table @asis
 214 @item Required attribute: @code{type}
 215 One of @code{title}, @code{log}, or @code{text}.
 216
 217 @item Optional attribute: @code{commandName}
 218 As on the @code{heading} element.  For output not specific to a
 219 command, this is simply @code{log}.  The corpus contains one example
 220 of where @code{commandName} is present but set to the empty string.
 221
 222 @item Optional attribute: @code{creator-version}
 223 As on the @code{heading} element.
 224 @end table
 225
 226 @item html
 227 Parent: @code{text} @*
 228 Contents: cdata
 229
 230 The cdata contains an HTML document.  In some cases, the document
 231 starts with @code{<html>} and ends with @code{</html}; in others the
 232 @code{html} element is implied.  Generally the HTML includes a
 233 @code{head} element with a CSS stylesheet.  The HTML body often begins
 234 with @code{<BR>}.  The actual content ranges from trivial to simple:
 235 just discarding the CSS and tags yields readable results.
 236
 237 @table @asis
 238 @item Required attribute: @code{lang}
 239 This always contains @code{en} in the corpus.
 240 @end table
 241
 242 @item table
 243 Parent: @code{container} @*
 244 Contents: @code{tableStructure}
 245
 246 @table @asis
 247 @item Required attribute: @code{commandName}
 248 As on the @code{heading} element.
 249
 250 @item Required attribute: @code{type}
 251 One of @code{table}, @code{note}, or @code{warning}.
 252
 253 @item Required attribute: @code{subType}
 254 The locale-invariant name for the particular kind of output that this
 255 table represents in the procedure.  This can be the same as
 256 @code{commandName} e.g.@: @code{Frequencies}, or different, e.g.@:
 257 @code{Case Processing Summary}.  Generic subtypes @code{Notes} and
 258 @code{Warnings} are often used.
 259
 260 @item Required attribute: @code{tableId}
 261 A number that uniquely identifies the table within the SPV file,
 262 typically a large negative number such as @code{-4147135649387905023}.
 263
 264 @item Optional attribute: @code{creator-version}
 265 As on the @code{heading} element.  In the corpus, this is only present
 266 for version 21 and up and always includes all 8 digits.
 267 @end table
 268
 269 @item tableStructure
 270 Parent: @code{table}
 271 Contents: @code{dataPath}
 272
 273 @item dataPath
 274 Parent: @code{tableStructure}
 275 Contents: text
 276
 277 Contains the name of the Zip member that holds the table details,
 278 e.g.@: @code{0000000001437_lightTableData.bin}.
 279
 280 @item pageSetup
 281 Parent: @code{heading} @*
 282 Contents: @code{pageHeader} @code{pageFooter}
 283
 284 @table @asis
 285 @item Required attribute: @code{initial-page-number}
 286 Always @code{1}.
 287
 288 @item Optional attribute: @code{chart-size}
 289 Always @code{as-is} or a localization (!) of it (e.g.@: @code{dimensione
 290 attuale}, @code{Wie vorgegeben}).
 291
 292 @item Optional attribute: @code{margin-left}
 293 @itemx Optional attribute: @code{margin-right}
 294 @itemx Optional attribute: @code{margin-top}
 295 @itemx Optional attribute: @code{margin-bottom}
 296 Margin sizes in the form @code{@var{size}in}, e.g.@: @code{0.25in}.
 297
 298 @item Optional attribute: @code{paper-height}
 299 @itemx Optional attribute: @code{paper-width}
 300 Paper sizes in the form @code{@var{size}in}, e.g.@: @code{8.5in} by
 301 @code{11in} for letter paper or @code{8.267in} by @code{11.692in} for
 302 A4 paper.
 303
 304 @item Optional attribute: @code{reference-orientation}
 305 Always @code{0deg}.
 306
 307 @item Optional attribute: @code{space-after}
 308 Always @code{12pt}.
 309 @end table
 310
 311 @item pageHeader
 312 @itemx pageFooter
 313 Parent: @code{pageSetup} @*
 314 Contents: @code{pageParagraph}*
 315
 316 No attributes.
 317
 318 @item pageParagraph
 319 Parent: @code{pageHeader} or @code{pageFooter} @*
 320 Contents: @code{text}
 321
 322 Text to go at the top or bottom of a page, respectively.
 323
 324 @item text
 325 Parent: @code{pageParagraph} @*
 326 Contents: [cdata]
 327
 328 This @code{text} element is nested inside a @code{pageParagraph}.  There
 329 is a different @code{text} element that is nested inside a
 330 @code{container}.
 331
 332 The element is either empty, or contains cdata that holds almost-XHTML
 333 text: in the corpus, either an @code{html} or @code{p} element.  It is
 334 @emph{almost}-XHTML because the @code{html} element designates the
 335 default namespace as
 336 @code{http://xml.spss.com/spss/viewer/viewer-tree} instead of an XHTML
 337 namespace.
 338
 339 The cdata can contain substitution variables: @code{&[Page]} for the
 340 page number and @code{&[PageTitle]} for the page title.
 341
 342 Typical contents (indented for clarity):
 343
 344 @example
 345 <html xmlns="http://xml.spss.com/spss/viewer/viewer-tree">
 346     <head></head>
 347     <body>
 348         <p style="text-align:right; margin-top: 0">Page &[Page]</p>
 349     </body>
 350 </html>
 351 @end example
 352
 353 @table @asis
 354 @item Required attribute: @code{type}
 355 Always @code{text}.
 356 @end table
 357 @end table
 358
 359 @node SPV Light Detail Member Format
 360 @subsection Light Detail Member Format
 361
 362 A ``light'' detail member @file{.bin} consists of a number of sections
 363 concatenated together, terminated by a byte 01:
 364
 365 @example
 366 light-member := header title styles dimensions data 01
 367 @end example
 368
 369 The first section is a 0x27-byte header:
 370
 371 @example
 372 header := 01 00 version 01 (00 | 01) byte*21 00 00 table-id byte*4
 373 version := i1 | i3
 374 table-id := int
 375 @end example
 376
 377 @code{header} includes @code{version}, a version number that affects
 378 the interpretation of some of the other data in the member.  We will
 379 refer to ``version 1'' and ``version 3'' members later on.
 380 @code{table-id} is a binary version of the @code{tableId} attribute in
 381 the structure member that refers to the detail member.  For example,
 382 if @code{tableId} is @code{-4154297861994971133}, then @code{table-id}
 383 would be 0xdca00003.  The meaning of the other variable parts of the
 384 header is not known.
 385
 386 @example
 387 title := value 01?              /* @r{localized title} */
 388          value 01? 31           /* @r{subtype} */
 389          value 01? 00? 58       /* @r{locale-invariant title} */
 390          (31 value | 58)        /* @r{caption} */
 391          int[n] footnote*[n]    /* @r{footnotes} */
 392 footnote := value (31 value | 58) byte*4
 393 @end example
 394
 395 @example
 396 styles := 00 font*8
 397           int[x1] byte*[x1]
 398           int[x2] byte*[x2]
 399           int[x3] byte*[x3]
 400           int[x4] int*[x4]
 401           string[encoding]
 402           (i0 | i-1) (00 | 01) 00 (00 | 01)
 403           int
 404           byte[decimal] byte[grouping]
 405           int[n-ccs] string*[n-ccs]     /* @r{custom currency} */
 406           styles2
 407
 408 x2 := 00 00 00 01 00 00 00 00 00 00 00 00 00 02 00 00 00 00  /* @r{18 bytes} */
 409
 410 styles2 := i0                           /* @r{version 1} */
 411 styles2 := count(count(x5) count(x6))   /* @r{version 3} */
 412 x5 := byte*33 int[n] int*n
 413 x6 := 01 00 (03 | 04) 00 00 00
 414       string[command] string[subcommand]
 415       string[language] string[charset] string[locale]
 416       (00 | 01) 00 (00 | 01) (00 | 01)
 417       int
 418       byte[decimal] byte[grouping]
 419       byte*8 01
 420       (string[dataset] string[datafile] i0 int i0)?
 421       int[n-ccs] string*[n-ccs]
 422       2e (00 | 01) (i2000000 i0)?
 423 @end example
 424
 425 In every example in the corpus, @code{x1} is 240.  The meaning of the
 426 bytes that follow it is unknown.
 427
 428 In every example in the corpus, @code{x2} is 18 and the bytes that
 429 follow it are @code{00 00 00 01 00 00 00 00 00 00 00 00 00 02 00 00 00
 430 00}.  The meaning of these bytes is unknown.
 431
 432 In every example in the corpus for version 1, @code{x3} is 16 and the
 433 bytes that follow it are @code{00 00 00 01 00 00 00 01 00 00 00 00 01
 434 01 01 01}.  In version 3, observed @code{x3} varies from 117 to 150,
 435 and its bytes include a 1-byte count at offset 0x34.  When the count
 436 is nonzero, a text string of that length at offset 0x35 is the name of
 437 a ``TableLook'', e.g. ``Default'' or ``Academic''.
 438
 439 Observed values of @code{x4} vary from 0 to 17.  Out of 7060 examples
 440 in the corpus, it is nonzero only 36 times.
 441
 442 @code{encoding} is a character encoding, usually a Windows code page
 443 such as @code{en_US.windows-1252} or @code{it_IT.windows-1252}.  The
 444 encoding string is itself encoded in US-ASCII.  The rest of the
 445 character strings in the file use this encoding.
 446
 447 @code{decimal} is the decimal point character.  The observed values
 448 are @samp{.} and @samp{,}.
 449
 450 @code{grouping} is the grouping character.  The observed values are
 451 @samp{,}, @samp{.}, @samp{'}, @samp{ }, and zero (presumably
 452 indicating that digits should not be grouped).
 453
 454 @code{n-ccs} is observed as either 0 or 5.  When it is 5, the
 455 following strings are CCA through CCE format strings.  Most commonly
 456 these are all @code{-,,,} but other strings occur.
 457
 458 @example
 459 font := byte[index] 31 string[typeface]
 460         00 00
 461         (10 | 20 | 40 | 50 | 70 | 80)[f1]
 462         41
 463         (i0 | i1 | i2)[f2]
 464         00
 465         (i0 | i2 | i64173)[f3]
 466         (i0 | i1 | i2 | i3)[f4]
 467         string[fgcolor] string[bgcolor]
 468         i0 i0 00
 469         (v3: int[f5] int[f6] int[f7] int[f8])
 470 @end example
 471
 472 Each @code{font}, in order, represents the font style for a different
 473 element: title, caption, footnote, row labels, column labels, corner
 474 labels, data, and layers.
 475
 476 @code{index} is the 1-based index of the @code{font}, i.e. 1 for the
 477 first @code{font}, through 8 for the final @code{font}.
 478
 479 @code{typeface} is the string name of the font.  In the corpus, this
 480 is @code{SansSerif} in over 99% of instances and @code{Times New
 481 Roman} in the rest.
 482
 483 @code{fgcolor} and @code{bgcolor} are the foreground color and
 484 background color, respectively.  In the corpus, these are always
 485 @code{#000000} and @code{#ffffff}, respectively.
 486
 487 The meaning of the remaining data is unknown.  It seems likely to
 488 include font sizes, horizontal and vertical alignment, attributes such
 489 as bold or italic, and margins.  @code{f1} is @code{40} most of the
 490 time.  @code{f2} is @code{i1} most of the time for the title and
 491 @code{i0} most of the time for other fonts.
 492
 493 The table below lists the values observed in the corpus.  When a cell
 494 contains a single value, then 99+% of the corpus contains that value.
 495 When a cell contains a pair of values, then the first value is seen in
 496 about two-third of the corpus and the second value in about the
 497 remaining one-third.  In fonts that include multiple pairs, values are
 498 correlated, that is, for font 3, f5 = 24, f6 = 24, f7 = 2 appears
 499 about two-thirds of the time, as does the combination of f4 = 0, f6 =
 500 10 for font 7.
 501
 502 @example
 503 font  f1  f2     f3   f4     f5    f6  f7  f8
 504
 505    1  40   1      0    0      8 10/11   1   8
 506    2  40   0      2    1      8 10/11   1   1
 507    3  40   0      2    1  24/11 24/ 8 2/3   4
 508    4  40   0      2    3      8 10/11   1   1
 509    5  40   0      0    1      8 10/11   1   4
 510    6  40   0      2    1      8 10/11   1   4
 511    7  40   0  64173  0/1      8 10/11   1   1
 512    8  40   0      2    3      8 10/11   1   4
 513 @end example
 514
 515 @example
 516 dimensions := int[n-dims] dimension*[n-dims]
 517 dimension := value[name]
 518              byte[d1]
 519              (00 | 01 | 02)[d2]
 520              (i0 | i2)[d3]
 521              (00 | 01)[d4]
 522              (00 | 01)[d5]
 523              01
 524              int[d6]
 525              int[n-categories] category*[n-categories]
 526 @end example
 527
 528 @code{name} is the name of the dimension, e.g. @code{Variables},
 529 @code{Statistics}, or a variable name.
 530
 531 @code{d1} is usually 0 but many other values have been observed.
 532
 533 @code{d3} is 2 over 99% of the time.
 534
 535 @code{d5} is 0 over 99% of the time.
 536
 537 @code{d6} is either -1 or the 0-based index of the dimension, e.g.@: 0
 538 for the first dimension, 1 for the second, and so on.  The latter is
 539 the case 98% of the time in the corpus.
 540
 541 @example
 542 category := value[name] (terminal | group)
 543 terminal-category := 00 00 00 i2 int[index] i0
 544 @end example
 545
 546 @code{name} is the name of the category (or group).
 547
 548 @code{category} can represent a terminal category.  In that case,
 549 @code{index} is a nonnegative integer less than @code{n-categories} in
 550 the @code{dimension} in which the @code{category} is nested (directly
 551 or indirectly).
 552
 553 Alternatively, @code{category} can represent a @code{group} of nested
 554 categories:
 555
 556 @example
 557 group := (00 | 01)[merge] 00 01 (i0 | i2)[data]
 558          i-1 int[n-subcategories] category*[n-subcategories]
 559 @end example
 560
 561 Ordinarily a group has some nested content, so that
 562 @code{n-subcategories} is positive, but a few instances of groups with
 563 @code{n-subcategories} 0 has been observed.
 564
 565 If @code{merge} is 00, the most common value, then the group is really
 566 a distinct group that should be represented as such in the visual
 567 representation and user interface.  If @code{merge} is 01, however,
 568 the categories in this group should be shown and treated as if they
 569 were direct children of the group's parent group (or if it has no
 570 parent group, then direct children of the dimension), and this group's
 571 name is irrelevant and should not be displayed.  (Merged groups can be
 572 nested!)
 573
 574 @code{data} appears to be i2 when all of the categories within a group
 575 are terminal categories that directly represent data values for a
 576 variable (e.g. in a frequency table or crosstabulation, a group of
 577 values in a variable being tabulated) and i0 otherwise, but this might
 578 be naive.
 579
 580 @example
 581 data := int[layers] int[rows] int[columns] int*[n-dimensions]
 582         int[n-data] datum*[n-data]
 583 @end example
 584
 585 The values of @code{layers}, @code{rows}, and @code{columns} each
 586 specifies the number of dimensions represented in layers or rows or
 587 columns, respectively, and their values sum to the number of
 588 dimensions.
 589
 590 The @code{n-dimensions} integers are a permutation of the 0-based
 591 dimension numbers.  The first @code{layers} of them specify each of
 592 the dimensions represented by layers, the next @code{rows} of them
 593 specify the dimensions represented by rows, and the final
 594 @code{columns} of them specify the dimensions represented by columns.
 595 When there is more than one dimension of a given kind, the inner
 596 dimensions are given first.
 597
 598 @example
 599 datum := int64[index] 00? value      /* @r{version 1} */
 600 datum := int64[index] value          /* @r{version 3} */
 601 @end example
 602
 603 The format of a datum varies slightly from version 1 to version 3: in
 604 version 1 it allows for an extra optional 00 byte.
 605
 606 A datum consists of an index and a value.  Suppose there are @math{d}
 607 dimensions and dimension @math{i} for @math{0 \le i < d} has
 608 @math{n_i} categories.  Consider the datum at coordinates @math{x_i}
 609 for @math{0 \le i < d}; note that @math{0 \le x_i < n_i}.  Then the
 610 index is calculated by the following algorithm:
 611
 612 @display
 613 let index = 0
 614 for each @math{i} from 0 to @math{d - 1}:
 615     index = @math{n_i \times} index + @math{x_i}
 616 @end display
 617
 618 For example, suppose there are 3 dimensions with 3, 4, and 5
 619 categories, respectively.  The datum at coordinates (1, 2, 3) has
 620 index @math{5 \times (4 \times (3 \times 0 + 1) + 2) + 3 = 33}.
 621
 622 @example
 623 value := 00? 00? 00? 00? raw-value
 624 raw-value :=
 625     01 value-mod int[format] double[x]
 626   | 02 value-mod int[format] double[x]
 627     string[varname] string[vallab] (01 | 02 | 03)
 628   | 03 string[local] value-mod string[id] string[c] (00 | 01)[type]
 629   | 04 value-mod int[format] string[vallab] string[varname]
 630     (01 | 02 | 03) string[s]
 631   | 05 value-mod string[varname] string[varlabel] (01 | 02 | 03)
 632   | value-mod string[format] int[n-args] arg*[n-args]
 633 arg :=
 634     i0 value
 635   | int[x] i0 value*[x + 1]      /* @r{x > 0} */
 636 @end example
 637
 638 A @code{value} boils down to a number or a string.  There are several
 639 possibilities, which one can distinguish by the first nonzero byte in
 640 the encoding:
 641
 642 @table @code
 643 @item 01
 644 The numeric value @code{x}, presented to the user formatted according
 645 to @code{format}, which is in the format described for system files.
 646 @xref{System File Output Formats}, for details.  Most commonly
 647 @code{format} has width 40 (the maximum).
 648
 649 An @code{x} with the maximum negative double @code{-DBL_MAX}
 650 represents the system-missing value SYSMIS.  (HIGHEST and LOWEST have
 651 not been observed.)  @xref{System File Format}, for more about these
 652 special values.
 653
 654 @item 02
 655 Similar to @code{01}, with the additional information that @code{x} is
 656 a value of variable @code{varname} and has value label @code{vallab}.
 657 Both @code{varname} and @code{vallab} can be the empty string, the
 658 latter very commonly.
 659
 660 The meaning of the final byte is unknown.  Possibly it is connected to
 661 whether the value or the label should be displayed.
 662
 663 @item 03
 664 A text string, in two forms: @code{c} is in English, and sometimes
 665 abbreviated or obscure, and @code{local} is localized to the user's
 666 locale.  In an English-language locale, the two strings are often the
 667 same, and in the cases where they differ, @code{local} is more
 668 appropriate for a user interface, e.g.@: @code{c} of ``Not a PxP table
 669 for MCN...'' versus @code{local} of ``Computed only for a PxP table,
 670 where P must be greater than 1.''
 671
 672 @code{c} and @code{local} are always either both empty or both
 673 nonempty.
 674
 675 @code{id} is a brief identifying string whose form seems to resemble a
 676 programming language identifier, e.g.@: @code{cumulative_percent} or
 677 @code{factor_14}.  It is not unique.
 678
 679 @code{type} is 00 for text taken from user input, such as syntax
 680 fragment, expressions, file names, data set names, and 01 for fixed
 681 text strings such as names of procedures or statistics.  In the former
 682 case, @code{id} is always the empty string; in the latter case,
 683 @code{id} is still sometimes empty.
 684
 685 @item 04
 686 The string value @code{s}, presented to the user formatted according
 687 to @code{format}.  The format for a string is not too interesting, and
 688 clearly invalid formats like A16.39 or A255.127 or A134.1 abound in
 689 the corpus, so readers should probably ignore the format entirely.
 690
 691 @code{s} is a value of variable @code{varname} and has value label
 692 @code{vallab}.  @code{varname} is never empty but @code{vallab} is
 693 commonly empty.
 694
 695 The meaning of the final byte is unknown.
 696
 697 @item 05
 698 Variable @code{varname}, which is rarely observed as empty in the
 699 corpus, with variable label @code{varlabel}, which is often empty.
 700
 701 The meaning of the final byte is unknown.
 702
 703 @item 31
 704 @itemx 58
 705 (These bytes begin a @code{value-mod}.)  A format string, analogous to
 706 @code{printf}, followed by one or more arguments, each of which has
 707 one or more values.  The format string uses the following syntax:
 708
 709 @table @code
 710 @item \%
 711 @item \:
 712 @item \[
 713 @item \]
 714 Each of these expands to the character following @samp{\\}.  This is
 715 useful to escape characters that have special meaning in format
 716 strings.  These are effective inside and outside the @code{[@dots{}]}
 717 syntax forms described below.
 718
 719 @item \n
 720 Expands to a new-line, inside or outside the @code{[@dots{}]} forms
 721 described below.
 722
 723 @item ^@var{i}
 724 Expands to a formatted version of argument @var{i}, which must have
 725 only a single value.  For example, @code{^1} would expand to the first
 726 argument's @code{value}.
 727
 728 @item [:@var{a}:]@var{i}
 729 Expands @var{a} for each of the @code{value}s in @var{i}.  @var{a}
 730 should contain one or more @code{^@var{j}} conversions, which are
 731 drawn from the values for argument @var{i} in order.  Some examples
 732 from the corpus:
 733
 734 @table @code
 735 @item [:^1:]1
 736 All of the values for the first argument, concatenated.
 737
 738 @item [:^1\n:]1
 739 Expands to the values for the first argument, each followed by
 740 a new-line.
 741
 742 @item [:^1 = ^2:]2
 743 Expands to @code{@var{x} = @var{y}} where @var{x} is the second
 744 argument's first value and @var{y} is its second value.  (This would
 745 be used only if the argument has two values.  With additional values,
 746 the second and third values would be directly concatenated, which
 747 would look funny.)
 748 @end table
 749
 750 @item [@var{a}:@var{b}:]@var{i}
 751 This extends the previous form so that the first values are expanded
 752 using @var{a} and later values are expanded using @var{b}.  For an
 753 unknown reason, within @var{a} the @code{^@var{j}} conversions are
 754 instead written as @code{%@var{j}}.  Some examples from the corpus:
 755
 756 @table @code
 757 @item [%1:*^1:]1
 758 Expands to all of the values for the first argument, separated by
 759 @samp{*}.
 760
 761 @item [%1 = %2:, ^1 = ^2:]1
 762 Given appropriate values for the first argument, expands to @code{X =
 763 1, Y = 2, Z = 3}.
 764
 765 @item [%1:, ^1:]1
 766 Given appropriate values, expands to @code{1, 2, 3}.
 767 @end table
 768 @end table
 769
 770 The format string is localized to the user's locale.
 771 @end table
 772
 773 @example
 774 value-mod :=
 775     31 i0 (i0 | i1 string[subscript]) value-mod-i0-v1 /* @r{version 1} */
 776   | 31 i0 (i0 | i1 string[subscript]) value-mod-i0-v3 /* @r{version 3} */
 777   | 31 i1 int[footnote-number] format
 778   | 31 i2 (00 | 01 | 02) 00 (i1 | i2 | i3) format
 779   | 31 i3 00 00 01 00 i2 format
 780   | 58
 781 value-mod-i0-v1 := 00 (i1 | i2) 00 00 int 00 00
 782 value-mod-i0-v3 := count(format-string
 783                          (58 | 31 style)
 784                          (58
 785                           | 31 i0 i0 i0 i0 01 00 (01 | 02 | 08)
 786                             00 08 00 0a 00))
 787
 788 style := 01? 00? 00? 00? 01 string[fgcolor] string[bgcolor] string[font] byte
 789 format := 00 00 count(format-string (58 | 31 style) 58)
 790 format-string := count((i0 (58 | 31 string))?)
 791 @end example
 792
 793 A @code{value-mod} can specify special modifications to a @code{value}:
 794
 795 @itemize @bullet
 796 @item
 797 The @code{footnote-number}, if present, specifies a footnote that the
 798 @code{value} references.  The footnote's marker is shown appended to
 799 the main text of the @code{value}, as a superscript.
 800
 801 @item
 802 The @code{subscript}, if present, specifies a string to append to the
 803 main text of the @code{value}, as a subscript.  The subscript text is
 804 normally a brief indicator, e.g.@: @samp{a} or @samp{a,b}, with its
 805 meaning indicated by the table caption.  In this usage, subscripts are
 806 similar to footnotes; one apparent difference is that a @code{value}
 807 can only reference one footnote but a subscript can list more than one
 808 letter.
 809
 810 @item
 811 The @code{format}, if present, is a format string for substitutions
 812 using the syntax explained previously.  It appears to be an
 813 English-language version of the localized format string in the
 814 @code{value} in which the @code{format} is nested.
 815
 816 @item
 817 The @code{style}, if present, changes the style for this individual
 818 @code{value}.
 819 @end itemize
 820
 821 @node SPV Legacy Detail Member Binary Format
 822 @subsection SPV Legacy Detail Member Binary Format
 823
 824 Whereas the light binary format represents everything about a given
 825 pivot table, the legacy binary format conceptually consists of a
 826 number of named sources, each of which consists of a number of named
 827 series, each of which is a 1-dimensional array of numbers or strings
 828 or a mix.  Thus, the legacy binary file format is quite simple.
 829
 830 @example
 831 legacy-binary := 00 byte[version] int16[n-sources] int[file-size]
 832                  metadata*[n-sources] data*[n-sources]
 833 @end example
 834
 835 @code{version} is a version number that affects the interpretation of
 836 some of the other data in the member.  Versions 0xaf and 0xb0 are
 837 known.  We will refer to ``version 0xaf'' and ``version 0xb0'' members
 838 later on.
 839
 840 A legacy member consists of @code{n-sources} data sources, each of
 841 which has @code{metadata} and @code{data}.
 842
 843 @code{file-size} is the size of the file, in bytes.
 844
 845 @example
 846 /* @r{version 0xaf} */
 847 metadata := int[per-series] int[n-series] int[ofs] byte*32[source-name]
 848
 849 /* @r{version 0xb0} */
 850 metadata := int[per-series] int[n-series] int[ofs] byte*64[source-name] int[x]
 851 @end example
 852
 853 A data source consists of @code{n-series} series of data, with
 854 @code{per-series} data values per series.
 855
 856 @code{source-name} is a 32- or 64-byte string padded on the right with
 857 zero bytes.  The names that appear in the corpus are very generic,
 858 usually @code{tableData} or @code{source0}.
 859
 860 The @code{ofs} is the offset, in bytes, from the beginning of the file
 861 to the start of this data source's @code{data}.  This allows programs
 862 to skip to the beginning of the data for a particular source; it is
 863 also important to determine whether a source includes any string data
 864 (see below).
 865
 866 The meaning of @code{x} in version 0xb0 is unknown.
 867
 868 @example
 869 data := numeric-data string-data?
 870 numeric-data := numeric-series*[n-series]
 871 numeric-series := byte*288[series-name] double*[per-series]
 872 @end example
 873
 874 Data follow the metadata in the legacy binary format, with sources in
 875 the same order.  Each series begins with a @code{series-name}, which
 876 generally indicates its role in the pivot table, e.g.@: ``cell'',
 877 ``cellFormat'', ``dimension0categories'', ``dimension0group0''.  The
 878 name is followed by the data, one double per element in the series.  A
 879 double with the maximum negative double @code{-DBL_MAX} represents the
 880 system-missing value SYSMIS.
 881
 882 @example
 883 string-data := i1 string[source-name] pairs labels
 884
 885 pairs := int[n-string-series] pair-series*[n-string-series]
 886 pair-series := string[pair-series-name] int[n-pairs] pair*[n-pairs]
 887 pair := int[i] int[j]
 888
 889 labels := int[n-labels] label*[n-labels]
 890 label := int[frequency] int[s]
 891 @end example
 892
 893 A source may include a mix of numeric and string data values.  When a
 894 source includes any string data, the data values that are strings are
 895 set to SYSMIS in the @code{numeric-series}, and @code{string-data}
 896 follows the @code{numeric-data}.  To reliably determine whether a
 897 source includes @code{string-data}, the reader should check whether
 898 the offset following the @code{numeric-data} is the offset of the next
 899 series, as indicated by its @code{metadata} (or end of file, in the
 900 case of the last source in a file).
 901
 902 @code{string-data} repeats the name of the source.
 903
 904 The string data overlays the numeric data.  @code{n-string-series} is
 905 the number of series within the source that include string data.  More
 906 precisely, it is the 1-based index of the last series in the source
 907 that includes any string data; thus, it would be 4 if there are 5
 908 series and only the fourth one includes string data.
 909
 910 Each @code{pair-series} consists a sequence of 0 or more pairs, each
 911 of which maps from a 0-based index within the series @code{i} to a
 912 0-based label index @code{j}.  The pair @code{i} = 2, @code{j} = 3,
 913 for example, would mean that the third data value (with value SYSMIS)
 914 is to be replaced by the string of the fourth label.
 915
 916 The labels themselves follow the pairs.  The valuable part of each
 917 label is the string @code{s}.  Each label also includes a
 918 @code{frequency} that reports the number of pairs that reference it
 919 (although this is not useful).