1 @node SPSS Viewer File Format
2 @chapter SPSS Viewer File Format
4 SPSS Viewer or @file{.spv} files, here called SPV files, are written
5 by SPSS 16 and later to represent the contents of its output editor.
6 This chapter documents the format, based on examination of a corpus of
7 about 500 files from a variety of sources. This description is
8 detailed enough to read SPV files, but probably not enough to write
11 SPSS 15 and earlier versions use a completely different output format
12 based on the Microsoft Compound Document Format. This format is not
15 An SPV file is a Zip archive that can be read with @command{zipinfo}
16 and @command{unzip} and similar programs. The final member in the Zip
17 archive is a file named @file{META-INF/MANIFEST.MF}. This structure
18 makes SPV files resemble Java ``JAR'' files (and ODF files), but
19 whereas a JAR manifest contains a sequence of colon-delimited
20 key/value pairs, an SPV manifest contains the string
21 @samp{allowPivoting=true}, without a new-line. (This string may be
22 the best way to identify an SPV file; it is invariant across the
25 The rest of the members in an SPV file's Zip archive fall into two
26 categories: @dfn{structure} and @dfn{detail} members. Structure
27 member names begin with @file{outputViewer@var{nnnnnnnnnn}}, where
28 each @var{n} is a decimal digit, and end with @file{.xml}, and often
29 include the string @file{_heading} in between. Each of these members
30 represents some kind of output item (a table, a heading, a block of
31 text, etc.) or a group of them. The member whose output goes at the
32 beginning of the document is numbered 0, the next member in the output
33 is numbered 1, and so on.
35 Structure members contain XML. This XML is sometimes self-contained,
36 but it often references detail members in the Zip archive, which named
40 @item @file{@var{prefix}_table.xml} and @file{@var{prefix}_tableData.bin}
41 @itemx @file{@var{prefix}_lightTableData.bin}
42 The structure of a table plus its data. Older SPV files pair a
43 @file{@var{prefix}_table.xml} file that describes the table's
44 structure with a binary @file{@var{prefix}_tableData.bin} file that
45 gives its data. Newer SPV files (the majority of those in the corpus)
46 instead include a single @file{@var{prefix}_lightTableData.bin} file
47 that incorporates both into a single binary format.
49 @item @file{@var{prefix}_warning.xml} and @file{@var{prefix}_warningData.bin}
50 @itemx @file{@var{prefix}_lightWarningData.bin}
51 Same format used for tables, with a different name.
53 @item @file{@var{prefix}_notes.xml} and @file{@var{prefix}_notesData.bin}
54 @itemx @file{@var{prefix}_lightNotesData.bin}
55 Same format used for tables, with a different name.
57 @item @file{@var{prefix}_chartData.bin} and @file{@var{prefix}_chart.xml}
58 The structure of a chart plus its data. Charts do not have a
61 @item @file{@var{prefix}_model.scf}
62 @itemx @file{@var{prefix}_pmml.scf}
63 Not yet investigated. The corpus contains only one example of each.
65 @item @file{@var{prefix}_stats.xml}
66 Not yet investigated. The corpus contains few examples.
69 The @file{@var{prefix}} in the names of the detail members is
70 typically an 11-digit decimal number that increases for each item,
71 tending to skip values. Older SPV files use different naming
72 conventions. Structure member refer to detail members by name, and so
73 their exact names do not matter to readers as long as they are unique.
76 * SPV Structure Member Format::
77 * SPV Light Detail Member Format::
78 * SPV Legacy Detail Member Binary Format::
81 @node SPV Structure Member Format
82 @section Structure Member Format
84 Structure members' XML files claim conformance with a collection of
85 XML Schemas. These schemas are distributed, under a nonfree license,
86 with SPSS binaries. Fortunately, the schemas are not necessary to
87 understand the structure members. To a degree, the schemas can even
88 be deceptive because they document elements and attributes that are
89 not in the corpus and do not document elements and attributes that are
92 Structure members use a different XML namespace for each schema, but
93 these namespaces are not entirely consistent. In some SPV files, for
94 example, the @code{viewer-tree} schema is associated with namespace
95 @indicateurl{http://xml.spss.com/spss/viewer-tree} and in others with
96 @indicateurl{http://xml.spss.com/spss/viewer/viewer-tree} (note the
97 additional @file{viewer/}). Under either name, the schema URIs are
98 not resolvable to obtain the schemas themselves.
100 One may ignore all of the above in interpreting a structure member.
101 The actual XML has a simple and straightforward form that does not
102 require a reader to take schemas or namespaces into account.
104 The elements found in structure members are documented below. For
105 each element, we note the possible parent elements and the element's
106 contents. The contents are specified as pseudo-regular expressions
107 with the following conventions:
120 Grouping multiple elements.
125 @item @var{a} @math{|} @var{b}
126 A choice between @var{a} and @var{b}.
129 Zero or more @var{x}.
133 For a diagram illustrating the hierarchy of elements within an SPV
134 structure member, please refer to a PDF version of the manual.
138 The following diagram shows the hierarchy of elements within an SPV
139 structure member. Edges point from parent to child elements.
140 Unlabeled edges indicate that the child appears exactly once; edges
141 labeled with *, zero or more times; edges labeled with ?, zero or one
143 @center @image{dev/spv-structure, 5in}
147 * SPV heading Element::
148 * SPV label Element::
149 * SPV container Element::
150 * SPV text Element (Inside @code{container})::
152 * SPV table Element::
153 * SPV tableStructure Element::
154 * SPV dataPath Element::
155 * SPV pageSetup Element::
156 * SPV pageHeader and pageFooter Elements::
157 * SPV pageParagraph Element::
158 * SPV @code{text} Element (Inside @code{pageParagraph})::
161 @node SPV heading Element
162 @subsection The @code{heading} Element
164 Parent: Document root or @code{heading} @*
165 Contents: [@code{pageSetup}] @code{label} (@code{container} @math{|} @code{heading})*
167 The root of a structure member is a @code{heading}, which represents a
168 section of output beginning with a title (the @code{label}) and
169 ordinarily followed by content containers or further nested
170 (sub)-sections of output.
172 The document root heading, only, may also contain a @code{pageSetup}
175 The following attributes have been observed on both document root and
176 nested @code{heading} elements.
178 @defvr {Optional} creator-version
179 The version of the software that created this SPV file. A string of
180 the form @code{xxyyzzww} represents software version xx.yy.zz.ww,
181 e.g.@: @code{21000001} is version 21.0.0.1. Trailing pairs of zeros
182 are sometimes omitted, so that @code{21}, @code{210000}, and
183 @code{21000000} are all version 21.0.0.0 (and the corpus contains all
184 three of those forms).
188 The following attributes have been observed on document root
189 @code{heading} elements only:
191 @defvr {Optional} @code{creator}
192 The directory in the file system of the software that created this SPV
196 @defvr {Optional} @code{creation-date-time}
197 The date and time at which the SPV file was written, in a
198 locale-specific format, e.g. @code{Friday, May 16, 2014 6:47:37 PM
199 PDT} or @code{lunedì 17 marzo 2014 3.15.48 CET} or even @code{Friday,
200 December 5, 2014 5:00:19 o'clock PM EST}.
203 @defvr {Optional} @code{lockReader}
204 Whether a reader should be allowed to edit the output. The possible
205 values are @code{true} and @code{false}, but the corpus only contains
209 @defvr {Optional} @code{schemaLocation}
210 This is actually an XML Namespace attribute. A reader may ignore it.
214 The following attributes have been observed only on nested
215 @code{heading} elements:
217 @defvr {Required} @code{commandName}
218 The locale-invariant name of the command that produced the output,
219 e.g.@: @code{Frequencies}, @code{T-Test}, @code{Non Par Corr}.
222 @defvr {Optional} @code{visibility}
223 To what degree the output represented by the element is visible. The
224 only observed value is @code{collapsed}.
227 @defvr {Optional} @code{locale}
228 The locale used for output, in Windows format, which is similar to the
229 format used in Unix with the underscore replaced by a hyphen, e.g.@:
230 @code{en-US}, @code{en-GB}, @code{el-GR}, @code{sr-Cryl-RS}.
233 @defvr {Optional} @code{olang}
234 The output language, e.g.@: @code{en}, @code{it}, @code{es},
235 @code{de}, @code{pt-BR}.
238 @node SPV label Element
239 @subsection The @code{label} Element
241 Parent: @code{heading} or @code{container} @*
244 Every @code{heading} and @code{container} holds a @code{label} as its
245 first child. The root @code{heading} in a structure member always
246 contains the string ``Output''. Otherwise, the text in @code{label}
247 describes what it labels, often by naming the statistical procedure
248 that was executed, e.g.@: ``Frequencies'' or ``T-Test''. Labels are
249 often very generic, especially within a @code{container}, e.g.@:
250 ``Title'' or ``Warnings'' or ``Notes''. Label text is localized
251 according to the output language, e.g.@: in Italian a frequency table
252 procedure is labeled ``Frequenze''.
254 The corpus contains one example of an empty label, one that contains
257 This element has no attributes.
259 @node SPV container Element
260 @subsection The @code{container} Element
262 Parent: @code{heading} @*
263 Contents: @code{label} [@code{table} @math{|} @code{text}]
265 A @code{container} serves to label a @code{table} or a @code{text}
268 This element has the following attributes.
270 @defvr {Required} @code{visibility}
271 Either @code{visible} or @code{hidden}, this indicates whether the
272 container's content is displayed.
275 @defvr {Optional} @code{text-align}
276 Presumably indicates the alignment of text within the container. The
277 only observed value is @code{left}. Observed with nested @code{table}
278 and @code{text} elements.
281 @defvr {Optional} @code{width}
282 The width of the container in the form @code{@var{n}px}, e.g.@:
286 @node SPV text Element (Inside @code{container})
287 @subsection The @code{text} Element (Inside @code{container})
289 Parent: @code{container} @*
290 Contents: @code{html}
292 This @code{text} element is nested inside a @code{container}. There
293 is a different @code{text} element that is nested inside a
294 @code{pageParagraph}.
296 This element has the following attributes.
298 @defvr {Required} @code{type}
299 One of @code{title}, @code{log}, or @code{text}.
302 @defvr {Optional} @code{commandName}
303 As on the @code{heading} element. For output not specific to a
304 command, this is simply @code{log}. The corpus contains one example
305 of where @code{commandName} is present but set to the empty string.
308 @defvr {Optional} @code{creator-version}
309 As on the @code{heading} element.
312 @node SPV html Element
313 @subsection The @code{html} Element
315 Parent: @code{text} @*
318 The CDATA contains an HTML document. In some cases, the document
319 starts with @code{<html>} and ends with @code{</html}; in others the
320 @code{html} element is implied. Generally the HTML includes a
321 @code{head} element with a CSS stylesheet. The HTML body often begins
322 with @code{<BR>}. The actual content ranges from trivial to simple:
323 just discarding the CSS and tags yields readable results.
325 This element has the following attributes.
327 @defvr {Required} @code{lang}
328 This always contains @code{en} in the corpus.
331 @node SPV table Element
332 @subsection The @code{table} Element
334 Parent: @code{container} @*
335 Contents: @code{tableStructure}
337 This element has the following attributes.
339 @defvr {Required} @code{commandName}
340 As on the @code{heading} element.
343 @defvr {Required} @code{type}
344 One of @code{table}, @code{note}, or @code{warning}.
347 @defvr {Required} @code{subType}
348 The locale-invariant name for the particular kind of output that this
349 table represents in the procedure. This can be the same as
350 @code{commandName} e.g.@: @code{Frequencies}, or different, e.g.@:
351 @code{Case Processing Summary}. Generic subtypes @code{Notes} and
352 @code{Warnings} are often used.
355 @defvr {Required} @code{tableId}
356 A number that uniquely identifies the table within the SPV file,
357 typically a large negative number such as @code{-4147135649387905023}.
360 @defvr {Optional} @code{creator-version}
361 As on the @code{heading} element. In the corpus, this is only present
362 for version 21 and up and always includes all 8 digits.
365 @node SPV tableStructure Element
366 @subsection The @code{tableStructure} Element
368 Parent: @code{table} @*
369 Contents: @code{dataPath}
371 This element has no attributes.
373 @node SPV dataPath Element
374 @subsection The @code{dataPath} Element
376 Parent: @code{tableStructure} @*
379 Contains the name of the Zip member that holds the table details,
380 e.g.@: @code{0000000001437_lightTableData.bin}.
382 This element has no attributes.
384 @node SPV pageSetup Element
385 @subsection The @code{pageSetup} Element
387 Parent: @code{heading} @*
388 Contents: @code{pageHeader} @code{pageFooter}
390 This element has the following attributes.
392 @defvr {Required} @code{initial-page-number}
396 @defvr {Optional} @code{chart-size}
397 Always @code{as-is} or a localization (!) of it (e.g.@: @code{dimensione
398 attuale}, @code{Wie vorgegeben}).
401 @defvr {Optional} @code{margin-left}
402 @defvrx {Optional} @code{margin-right}
403 @defvrx {Optional} @code{margin-top}
404 @defvrx {Optional} @code{margin-bottom}
405 Margin sizes in the form @code{@var{size}in}, e.g.@: @code{0.25in}.
408 @defvr {Optional} @code{paper-height}
409 @defvrx {Optional} @code{paper-width}
410 Paper sizes in the form @code{@var{size}in}, e.g.@: @code{8.5in} by
411 @code{11in} for letter paper or @code{8.267in} by @code{11.692in} for
415 @defvr {Optional} @code{reference-orientation}
419 @defvr {Optional} @code{space-after}
423 @node SPV pageHeader and pageFooter Elements
424 @subsection The @code{pageHeader} and @code{pageFooter} Elements
426 Parent: @code{pageSetup} @*
427 Contents: @code{pageParagraph}*
429 This element has no attributes.
431 @node SPV pageParagraph Element
432 @subsection The @code{pageParagraph} Element
434 Parent: @code{pageHeader} or @code{pageFooter} @*
435 Contents: @code{text}
437 Text to go at the top or bottom of a page, respectively.
439 This element has no attributes.
441 @node SPV @code{text} Element (Inside @code{pageParagraph})
442 @subsection The @code{text} Element (Inside @code{pageParagraph})
444 Parent: @code{pageParagraph} @*
447 This @code{text} element is nested inside a @code{pageParagraph}. There
448 is a different @code{text} element that is nested inside a
451 The element is either empty, or contains CDATA that holds almost-XHTML
452 text: in the corpus, either an @code{html} or @code{p} element. It is
453 @emph{almost}-XHTML because the @code{html} element designates the
455 @code{http://xml.spss.com/spss/viewer/viewer-tree} instead of an XHTML
456 namespace, and because the CDATA can contain substitution variables:
457 @code{&[Page]} for the page number and @code{&[PageTitle]} for the
460 Typical contents (indented for clarity):
463 <html xmlns="http://xml.spss.com/spss/viewer/viewer-tree">
466 <p style="text-align:right; margin-top: 0">Page &[Page]</p>
471 This element has the following attributes.
473 @defvr {Required} @code{type}
477 @node SPV Light Detail Member Format
478 @section Light Detail Member Format
480 This section describes the format of ``light'' detail member
481 @file{.bin} files. These files have a binary format which we describe
482 here in terms of a context-free grammar using the following conventions:
486 Names of nonterminals are written in CamelCaps style.
488 @item 00, 01, @dots{}, ff.
489 Bytes with fixed values are written in hexadecimal:
491 @item i0, i1, @dots{}, i9, i10, i11, @dots{}
492 32-bit integers with fixed values are written in decimal, prefixed by
499 An arbitrary 32-bit integer.
502 An arbitrary 64-bit IEEE floating-point number.
505 A 32-bit integer followed by the specified number of bytes of
506 character data, encoded in UTF-8.
509 @var{x} is optional, e.g.@: 00? is an optional zero byte.
511 @item @var{x}*@var{n}
512 @var{x} is repeated @var{n} times, e.g. byte*10 for ten arbitrary bytes.
514 @item @var{x}[@var{name}]
515 Gives @var{x} the specified @var{name}. Names are useful in textual
516 explanations. They are also used, also bracketed, to indicate counts,
517 e.g.@: int[@t{n}] byte*[@t{n}] for a 32-bit integer followed by the
518 specified number of arbitrary bytes.
520 @item @var{a} @math{|} @var{b}
521 Either @var{a} or @var{b}.
524 Parentheses are used for grouping to make precedence clear, especially
525 in the presence of @math{|}, e.g.@: in 00 (01 @math{|} 02 @math{|} 03)
529 A 32-bit integer that indicates the number of bytes in @var{x},
530 followed by @var{x} itself.
533 In a version 1 @file{.bin} file, @var{x}; in version 3, nothing.
534 (The @file{.bin} file header indicates the version.)
537 In a version 3 @file{.bin} file, @var{x}; in version 1, nothing.
540 All integer and floating-point values in this format use little-endian
543 A ``light'' detail member @file{.bin} consists of a number of sections
544 concatenated together, terminated by a byte 01:
547 LightMember @result{} Header Title Styles Dimensions Data 01
550 The first section is a 0x27-byte header:
555 (i1 @math{|} i3)[@t{version}]
556 01 (00 @math{|} 01) byte*21 00 00
557 int[@t{table-id}] byte*4
560 The Header includes @code{version}, a version number that affects the
561 interpretation of some of the other data in the member. We will refer
562 to ``version 1'' and ``version 3'' members later on. @code{table-id}
563 is a binary version of the @code{tableId} attribute in the structure
564 member that refers to the detail member. For example, if
565 @code{tableId} is @code{-4154297861994971133}, then @code{table-id}
566 would be 0xdca00003. The meaning of the other variable parts of the
572 Value[@t{subtype}] 01? 31
573 Value[@t{c}] 01? 00? 58
574 (58 @math{|} 31 Value[@t{caption}])
575 int[@t{n}] Footnote*[@t{n}]
578 (58 @math{|} 31 Value[@t{marker}])
585 int[@t{n1}] byte*[@t{n1}]
586 int[@t{n2}] byte*[@t{n2}]
587 int[@t{n3}] byte*[@t{n3}]
588 int[@t{n4}] int*[@t{n4}]
590 (i0 @math{|} i-1) (00 @math{|} 01) 00 (00 @math{|} 01)
592 byte[@t{decimal}] byte[@t{grouping}]
593 int[@t{n-ccs}] string*[@t{n-ccs}]
595 v3(count(count(X5) count(X6)))
597 X5 @result{} byte*33 int[@t{n}] int*[@t{n}]
599 01 00 (03 @math{|} 04) 00 00 00
600 string[@t{command}] string[@t{subcommand}]
601 string[@t{language}] string[@t{charset}] string[@t{locale}]
602 (00 @math{|} 01) 00 (00 @math{|} 01) (00 @math{|} 01)
604 byte[@t{decimal}] byte[@t{grouping}]
606 (string[@t{dataset}] string[@t{datafile}] i0 int i0)?
607 int[@t{n-ccs}] string*[@t{n-ccs}]
608 2e (00 @math{|} 01) (i2000000 i0)?
611 In every example in the corpus, @code{n1} is 240. The meaning of the
612 bytes that follow it is unknown.
614 In every example in the corpus, @code{n2} is 18 and the bytes that
615 follow it are @code{00 00 00 01 00 00 00 00 00 00 00 00 00 02 00 00 00
616 00}. The meaning of these bytes is unknown.
618 In every example in the corpus for version 1, @code{n3} is 16 and the
619 bytes that follow it are @code{00 00 00 01 00 00 00 01 00 00 00 00 01
620 01 01 01}. In version 3, observed @code{n3} varies from 117 to 150,
621 and its bytes include a 1-byte count at offset 0x34. When the count
622 is nonzero, a text string of that length at offset 0x35 is the name of
623 a ``TableLook'', e.g. ``Default'' or ``Academic''.
625 Observed values of @code{n4} vary from 0 to 17. Out of 7060 examples
626 in the corpus, it is nonzero only 36 times.
628 @code{encoding} is a character encoding, usually a Windows code page
629 such as @code{en_US.windows-1252} or @code{it_IT.windows-1252}. The
630 encoding string is itself encoded in US-ASCII. The rest of the
631 character strings in the file use this encoding.
633 @code{decimal} is the decimal point character. The observed values
634 are @samp{.} and @samp{,}.
636 @code{grouping} is the grouping character. The observed values are
637 @samp{,}, @samp{.}, @samp{'}, @samp{ }, and zero (presumably
638 indicating that digits should not be grouped).
640 @code{n-ccs} is observed as either 0 or 5. When it is 5, the
641 following strings are CCA through CCE format strings. Most commonly
642 these are all @code{-,,,} but other strings occur.
646 byte[@t{index}] 31 string[@t{typeface}]
648 (10 @math{|} 20 @math{|} 40 @math{|} 50 @math{|} 70 @math{|} 80)[@t{f1}]
650 (i0 @math{|} i1 @math{|} i2)[@t{f2}]
652 (i0 @math{|} i2 @math{|} i64173)[@t{f3}]
653 (i0 @math{|} i1 @math{|} i2 @math{|} i3)[@t{f4}]
654 string[@t{fgcolor}] string[@t{bgcolor}]
656 v3(int[@t{f5}] int[@t{f6}] int[@t{f7}] int[@t{f8}]))
659 Each Font, in order, represents the font style for a different
660 element: title, caption, footnote, row labels, column labels, corner
661 labels, data, and layers.
663 @code{index} is the 1-based index of the Font, i.e. 1 for the first
664 Font, through 8 for the final Font.
666 @code{typeface} is the string name of the font. In the corpus, this
667 is @code{SansSerif} in over 99% of instances and @code{Times New
670 @code{fgcolor} and @code{bgcolor} are the foreground color and
671 background color, respectively. In the corpus, these are always
672 @code{#000000} and @code{#ffffff}, respectively.
674 The meaning of the remaining data is unknown. It seems likely to
675 include font sizes, horizontal and vertical alignment, attributes such
676 as bold or italic, and margins. @code{f1} is @code{40} most of the
677 time. @code{f2} is @code{i1} most of the time for the title and
678 @code{i0} most of the time for other fonts.
680 The table below lists the values observed in the corpus. When a cell
681 contains a single value, then 99@math{+}% of the corpus contains that value.
682 When a cell contains a pair of values, then the first value is seen in
683 about two-third of the corpus and the second value in about the
684 remaining one-third. In fonts that include multiple pairs, values are
685 correlated, that is, for font 3, f5 = 24, f6 = 24, f7 = 2 appears
686 about two-thirds of the time, as does the combination of f4 = 0, f6 =
689 @multitable {font} {40} {f2} {64173} {0/1} {24/11} {10/11} {2/3} {f8}
690 @headitem font @tab f1 @tab f2 @tab f3 @tab f4 @tab f5 @tab f6 @tab f7 @tab f8
691 @item 1 @tab 40 @tab 1 @tab 0 @tab 0 @tab 8 @tab 10/11 @tab 1 @tab 8
692 @item 2 @tab 40 @tab 0 @tab 2 @tab 1 @tab 8 @tab 10/11 @tab 1 @tab 1
693 @item 3 @tab 40 @tab 0 @tab 2 @tab 1 @tab 24/11 @tab 24/ 8 @tab 2/3 @tab 4
694 @item 4 @tab 40 @tab 0 @tab 2 @tab 3 @tab 8 @tab 10/11 @tab 1 @tab 1
695 @item 5 @tab 40 @tab 0 @tab 0 @tab 1 @tab 8 @tab 10/11 @tab 1 @tab 4
696 @item 6 @tab 40 @tab 0 @tab 2 @tab 1 @tab 8 @tab 10/11 @tab 1 @tab 4
697 @item 7 @tab 40 @tab 0 @tab 64173 @tab 0/1 @tab 8 @tab 10/11 @tab 1 @tab 1
698 @item 8 @tab 40 @tab 0 @tab 2 @tab 3 @tab 8 @tab 10/11 @tab 1 @tab 4
702 Dimensions @result{} int[@t{n-dims}] Dimension*[@t{n-dims}]
706 (00 @math{|} 01 @math{|} 02)[@t{d2}]
707 (i0 @math{|} i2)[@t{d3}]
708 (00 @math{|} 01)[@t{d4}]
709 (00 @math{|} 01)[@t{d5}]
712 int[@t{n-categories}] category*[@t{n-categories}]
715 @code{name} is the name of the dimension, e.g. @code{Variables},
716 @code{Statistics}, or a variable name.
718 @code{d1} is usually 0 but many other values have been observed.
720 @code{d3} is 2 over 99% of the time.
722 @code{d5} is 0 over 99% of the time.
724 @code{d6} is either -1 or the 0-based index of the dimension, e.g.@: 0
725 for the first dimension, 1 for the second, and so on. The latter is
726 the case 98% of the time in the corpus.
729 Category @result{} value[@t{name}] (Terminal @math{|} Group)
730 Terminal @result{} 00 00 00 i2 int[@t{index}] i0
732 (00 @math{|} 01)[@t{merge}] 00 01 (i0 @math{|} i2)[@t{data}]
733 i-1 int[@t{n-subcategories}] category*[@t{n-subcategories}]
736 @code{name} is the name of the category (or group).
738 A Category can represent a terminal category. In that case,
739 @code{index} is a nonnegative integer less than @code{n-categories} in
740 the Dimension in which the Category is nested (directly or
743 Alternatively, a Category can represent a Group of nested categories.
744 Usually a Group has some nested content, so that
745 @code{n-subcategories} is positive, but a few Groups with
746 @code{n-subcategories} 0 has been observed.
748 If a Group's @code{merge} is 00, the most common value, then the group
749 is really a distinct group that should be represented as such in the
750 visual representation and user interface. If @code{merge} is 01,
751 however, the categories in this group should be shown and treated as
752 if they were direct children of the group's containing group (or if it
753 has no parent group, then direct children of the dimension), and this
754 group's name is irrelevant and should not be displayed. (Merged
755 groups can be nested!)
757 @code{data} appears to be i2 when all of the categories within a group
758 are terminal categories that directly represent data values for a
759 variable (e.g. in a frequency table or crosstabulation, a group of
760 values in a variable being tabulated) and i0 otherwise.
764 int[@t{layers}] int[@t{rows}] int[@t{columns}] int*[@t{n-dimensions}]
765 int[@t{n-data}] Datum*[@t{n-data}]
768 The values of @code{layers}, @code{rows}, and @code{columns} each
769 specifies the number of dimensions represented in layers or rows or
770 columns, respectively, and their values sum to the number of
773 The @code{n-dimensions} integers are a permutation of the 0-based
774 dimension numbers. The first @code{layers} of them specify each of
775 the dimensions represented by layers, the next @code{rows} of them
776 specify the dimensions represented by rows, and the final
777 @code{columns} of them specify the dimensions represented by columns.
778 When there is more than one dimension of a given kind, the inner
779 dimensions are given first.
782 Datum @result{} int64[@t{index}] v3(00?) Value
785 The format of a Datum varies slightly from version 1 to version 3: in
786 version 1 it allows for an extra optional 00 byte.
788 A Datum consists of an @code{index} and a Value. Suppose there are
789 @math{d} dimensions and dimension @math{i}, @math{0 \le i < d}, has
790 @math{n_i} categories. Consider the datum at coordinates @math{x_i},
791 @math{0 \le i < d}, and note that @math{0 \le x_i < n_i}. Then the
792 index is calculated by the following algorithm:
796 for each @math{i} from 0 to @math{d - 1}:
797 @i{index} = @math{n_i \times} @i{index} @math{+} @math{x_i}
800 For example, suppose there are 3 dimensions with 3, 4, and 5
801 categories, respectively. The datum at coordinates (1, 2, 3) has
802 index @math{5 \times (4 \times (3 \times 0 + 1) + 2) + 3 = 33}.
805 Value @result{} 00? 00? 00? 00? RawValue
807 01 ValueMod int[@t{format}] double[@t{x}]
808 @math{|} 02 ValueMod int[@t{format}] double[@t{x}]
809 string[@t{varname}] string[@t{vallab}] (01 @math{|} 02 @math{|} 03)
810 @math{|} 03 string[@t{local}] ValueMod string[@t{id}] string[@t{c}] (00 @math{|} 01)[@t{type}]
811 @math{|} 04 ValueMod int[@t{format}] string[@t{vallab}] string[@t{varname}]
812 (01 @math{|} 02 @math{|} 03) string[@t{s}]
813 @math{|} 05 ValueMod string[@t{varname}] string[@t{varlabel}] (01 @math{|} 02 @math{|} 03)
814 @math{|} ValueMod string[@t{format}] int[@t{n-args}] Arg*[@t{n-args}]
817 @math{|} int[@t{x}] i0 Value*[@t{x}@math{+}1] /* @t{x} @math{>} 0 */
820 A Value boils down to a number or a string. There are several
821 possibilities, which one can distinguish by the first nonzero byte in
826 The numeric value @code{x}, intended to be presented to the user
827 formatted according to @code{format}, which is in the format described
828 for system files. @xref{System File Output Formats}, for details.
829 Most commonly, @code{format} has width 40 (the maximum).
831 When @code{x} is the maximum negative double value @code{-DBL_MAX}, it
832 represents the system-missing value SYSMIS. (HIGHEST and LOWEST have
833 not been observed.) @xref{System File Format}, for more about these
837 Similar to @code{01}, with the additional information that @code{x} is
838 a value of variable @code{varname} and has value label @code{vallab}.
839 Both @code{varname} and @code{vallab} can be the empty string, the
840 latter very commonly.
842 The meaning of the final byte is unknown. Possibly it is connected to
843 whether the value or the label should be displayed.
846 A text string, in two forms: @code{c} is in English, and sometimes
847 abbreviated or obscure, and @code{local} is localized to the user's
848 locale. In an English-language locale, the two strings are often the
849 same, and in the cases where they differ, @code{local} is more
850 appropriate for a user interface, e.g.@: @code{c} of ``Not a PxP table
851 for MCN...'' versus @code{local} of ``Computed only for a PxP table,
852 where P must be greater than 1.''
854 @code{c} and @code{local} are always either both empty or both
857 @code{id} is a brief identifying string whose form seems to resemble a
858 programming language identifier, e.g.@: @code{cumulative_percent} or
859 @code{factor_14}. It is not unique.
861 @code{type} is 00 for text taken from user input, such as syntax
862 fragment, expressions, file names, data set names, and 01 for fixed
863 text strings such as names of procedures or statistics. In the former
864 case, @code{id} is always the empty string; in the latter case,
865 @code{id} is still sometimes empty.
868 The string value @code{s}, intended to be presented to the user
869 formatted according to @code{format}. The format for a string is not
870 too interesting, and the corpus contains many clearly invalid formats
871 like A16.39 or A255.127 or A134.1, so readers should probably ignore
874 @code{s} is a value of variable @code{varname} and has value label
875 @code{vallab}. @code{varname} is never empty but @code{vallab} is
878 The meaning of the final byte is unknown.
881 Variable @code{varname}, which is rarely observed as empty in the
882 corpus, with variable label @code{varlabel}, which is often empty.
884 The meaning of the final byte is unknown.
888 (These bytes begin a ValueMod.) A format string, analogous to
889 @code{printf}, followed by one or more arguments, each of which has
890 one or more values. The format string uses the following syntax:
897 Each of these expands to the character following @samp{\\}, to escape
898 characters that have special meaning in format strings. These are
899 effective inside and outside the @code{[@dots{}]} syntax forms
903 Expands to a new-line, inside or outside the @code{[@dots{}]} forms
907 Expands to a formatted version of argument @var{i}, which must have
908 only a single value. For example, @code{^1} expands to the first
909 argument's @code{value}.
911 @item [:@var{a}:]@var{i}
912 Expands @var{a} for each of the values in @var{i}. @var{a}
913 should contain one or more @code{^@var{j}} conversions, which are
914 drawn from the values for argument @var{i} in order. Some examples
919 All of the values for the first argument, concatenated.
922 Expands to the values for the first argument, each followed by
926 Expands to @code{@var{x} = @var{y}} where @var{x} is the second
927 argument's first value and @var{y} is its second value. (This would
928 be used only if the argument has two values. If there were more
929 values, the second and third values would be directly concatenated,
930 which would look funny.)
933 @item [@var{a}:@var{b}:]@var{i}
934 This extends the previous form so that the first values are expanded
935 using @var{a} and later values are expanded using @var{b}. For an
936 unknown reason, within @var{a} the @code{^@var{j}} conversions are
937 instead written as @code{%@var{j}}. Some examples from the corpus:
941 Expands to all of the values for the first argument, separated by
944 @item [%1 = %2:, ^1 = ^2:]1
945 Given appropriate values for the first argument, expands to @code{X =
949 Given appropriate values, expands to @code{1, 2, 3}.
953 The format string is localized to the user's locale.
958 31 i0 (i0 @math{|} i1 string[@t{subscript}])
959 v1(00 (i1 @math{|} i2) 00 00 int 00 00)
960 v3(count(FormatString
961 (58 @math{|} 31 Style)
963 @math{|} 31 i0 i0 i0 i0 01 00 (01 @math{|} 02 @math{|} 08)
965 @math{|} 31 i1 int[@t{footnote-number}] Format
966 @math{|} 31 i2 (00 @math{|} 01 @math{|} 02) 00 (i1 @math{|} i2 @math{|} i3) Format
967 @math{|} 31 i3 00 00 01 00 i2 Format
970 Style @result{} 01? 00? 00? 00? 01 string[@t{fgcolor}] string[@t{bgcolor}] string[@t{font}] byte
971 Format @result{} 00 00 count(FormatString (58 @math{|} 31 Style) 58)
972 FormatString @result{} count((i0 (58 @math{|} 31 string))?)
975 A ValueMod can specify special modifications to a Value:
979 The @code{footnote-number}, if present, specifies a footnote that the
980 Value references. The footnote's marker is shown appended to the main
981 text of the Value, as a superscript.
984 The @code{subscript}, if present, specifies a string to append to the
985 main text of the Value, as a subscript. The subscript text is
986 normally a brief indicator, e.g.@: @samp{a} or @samp{a,b}, with its
987 meaning indicated by the table caption. In this usage, subscripts are
988 similar to footnotes; one apparent difference is that a Value can only
989 reference one footnote but a subscript can list more than one letter.
992 The Format, if present, is a format string for substitutions using the
993 syntax explained previously. It appears to be an English-language
994 version of the localized format string in the Value in which the
998 The Style, if present, changes the style for this individual Value.
1001 @node SPV Legacy Detail Member Binary Format
1002 @section Legacy Detail Member Binary Format
1004 Whereas the light binary format represents everything about a given
1005 pivot table, the legacy binary format conceptually consists of a
1006 number of named sources, each of which consists of a number of named
1007 series, each of which is a 1-dimensional array of numbers or strings
1008 or a mix. Thus, the legacy binary file format is quite simple.
1010 This section uses the same context-free grammar notation as in the
1011 previous section, with the following additions:
1015 In a version 0xaf legacy member, @var{x}; in other versions, nothing.
1016 (The legacy member header indicates the version; see below.)
1019 In a version 0xb0 legacy member, @var{x}; in other versions, nothing.
1023 LegacyBinary @result{}
1024 00 byte[@t{version}] int16[@t{n-sources}] int[@t{file-size}]
1025 Metadata*[@t{n-sources}] Data*[@t{n-sources}]
1028 @code{version} is a version number that affects the interpretation of
1029 some of the other data in the member. Versions 0xaf and 0xb0 are
1030 known. We will refer to ``version 0xaf'' and ``version 0xb0'' members
1033 A legacy member consists of @code{n-sources} data sources, each of
1034 which has Metadata and Data.
1036 @code{file-size} is the size of the file, in bytes.
1040 int[@t{per-series}] int[@t{n-series}] int[@t{ofs}]
1041 vAF(byte*32[@t{source-name}])
1042 vB0(byte*64[@t{source-name}] int[@t{x}])
1045 A data source consists of @code{n-series} series of data, with
1046 @code{per-series} data values per series.
1048 @code{source-name} is a 32- or 64-byte string padded on the right with
1049 zero bytes. The names that appear in the corpus are very generic,
1050 usually @code{tableData} or @code{source0}.
1052 The @code{ofs} is the offset, in bytes, from the beginning of the file
1053 to the start of this data source's Data. This allows programs to skip
1054 to the beginning of the data for a particular source; it is also
1055 important to determine whether a source includes any string data (see
1058 The meaning of @code{x} in version 0xb0 is unknown.
1061 Data @result{} NumericData StringData?
1062 NumericData @result{} NumericSeries*[@t{n-series}]
1063 NumericSeries @result{} byte*288[@t{series-name}] double*[@t{per-series}]
1066 Data follow the Metadata in the legacy binary format, with sources in
1067 the same order. Each NumericSeries begins with a @code{series-name}
1068 that generally indicates its role in the pivot table, e.g.@: ``cell'',
1069 ``cellFormat'', ``dimension0categories'', ``dimension0group0''. The
1070 name is followed by the data, one double per element in the series. A
1071 double with the maximum negative double @code{-DBL_MAX} represents the
1072 system-missing value SYSMIS.
1075 StringData @result{} i1 string[@t{source-name}] Pairs Labels
1077 Pairs @result{} int[@t{n-string-series}] PairSeries*[@t{n-string-series}]
1078 PairSeries @result{} string[@t{pair-series-name}] int[@t{n-pairs}] Pair*[@t{n-pairs}]
1079 Pair @result{} int[@t{i}] int[@t{j}]
1081 Labels @result{} int[@t{n-labels}] Label*[@t{n-labels}]
1082 Label @result{} int[@t{frequency}] int[@t{s}]
1085 A source may include a mix of numeric and string data values. When a
1086 source includes any string data, the data values that are strings are
1087 set to SYSMIS in the NumericSeries, and StringData follows the
1088 NumericData. A source that contains no string data omits the
1089 StringData. To reliably determine whether a source includes
1090 StringData, the reader should check whether the offset following the
1091 Numericdata is the offset of the next series, as indicated by its
1092 Metadata (or end of file, in the case of the last source in a file).
1094 StringData repeats the name of the source.
1096 The string data overlays the numeric data. @code{n-string-series} is
1097 the number of series within the source that include string data. More
1098 precisely, it is the 1-based index of the last series in the source
1099 that includes any string data; thus, it would be 4 if there are 5
1100 series and only the fourth one includes string data.
1102 Each PairSeries consists a sequence of 0 or more pairs, each of which
1103 maps from a 0-based index within the series @code{i} to a 0-based
1104 label index @code{j}. The pair @code{i} = 2, @code{j} = 3, for
1105 example, would mean that the third data value (with value SYSMIS) is
1106 to be replaced by the string of the fourth label.
1108 The labels themselves follow the pairs. The valuable part of each
1109 label is the string @code{s}. Each label also includes a
1110 @code{frequency} that reports the number of pairs that reference it
1111 (although this is not useful).