pintos-os.org Git - pspp/blob - spv-file-format.texi

   1 @node SPSS Viewer Format
   2 @section SPSS Viewer Format
   3
   4 SPSS Viewer or @file{.spv} files, here called SPV files, are written
   5 by SPSS 16 and later to represent the contents of its output editor.
   6 This section documents the format.  This description is detailed
   7 enough to read SPV files, but it is probably not sufficient to
   8 write them.
   9
  10 An an aside, SPSS 15 and earlier versions use a completely different
  11 output format based on the Microsoft Compound Document Format.  This
  12 format is not documented.
  13
  14 An SPV file is a Zip archive that can be read with @command{zipinfo}
  15 and @command{unzip} and similar programs.  The final member in the Zip
  16 archive is a file named @file{META-INF/MANIFEST.MF}.  This structure
  17 makes SPV files resemble Java ``JAR'' files, but whereas a JAR
  18 manifest contains a sequence of colon-delimited key/value pairs, an
  19 SPV manifest contains the string @samp{allowPivoting=true}, without a
  20 new-line.
  21
  22 The rest of the members in an SPV file's Zip archive fall into two
  23 categories: structure and details.  ``Structure'' member names begin
  24 with @file{outputViewer@var{nnnnnnnnnn}}, where each @var{n} is a
  25 decimal digit, and end with @file{.xml}, and often include the string
  26 @file{_heading} in between.  Each of these members represents some
  27 kind of output item (a table, a heading, a block of text, etc.) or a
  28 group of them.  The member whose output goes at the beginning of the
  29 document is numbered 0, the next member in the output is numbered 1,
  30 and so on.
  31
  32 Structure members contain XML.  This XML is sometimes self-contained,
  33 but it often references other members in the Zip archive named as
  34 follows:
  35
  36 @table @asis
  37 @item @file{@var{prefix}_table.xml} and @file{@var{prefix}_tableData.bin}
  38 @itemx @file{@var{prefix}_lightTableData.bin}
  39 The structure of a table plus its data.  Older SPV files pair a
  40 @file{@var{prefix}_table.xml} file that describes the table's
  41 structure with a binary @file{@var{prefix}_tableData.bin} file that
  42 gives its data.  Newer SPV files (the majority of those in the corpus)
  43 instead include a single @file{@var{prefix}_lightTableData.bin} file
  44 that incorporates both into a single binary format.
  45
  46 @item @file{@var{prefix}_warning.xml} and @file{@var{prefix}_warningData.bin}
  47 @itemx @file{@var{prefix}_lightWarningData.bin}
  48 Same format used for tables, with a different name.
  49
  50 @item @file{@var{prefix}_notes.xml} and @file{@var{prefix}_notesData.bin}
  51 @itemx @file{@var{prefix}_lightNotesData.bin}
  52 Same format used for tables, with a different name.
  53
  54 @item @file{@var{prefix}_chartData.bin} and @file{@var{prefix}_chart.xml}
  55 The structure of a chart plus its data.  Charts do not have a
  56 ``light'' format.
  57
  58 @item @var{prefix}_model.xml
  59 @itemx @var{prefix}_pmml.xml
  60 @itemx @var{prefix}_stats.xml
  61 Not yet investigated.  The corpus contains only one example of each.
  62 @end table
  63
  64 The @file{@var{prefix}} in the names of the detail members is
  65 typically an 11-digit decimal number that increases for each item,
  66 tending to skip values.  Older SPV files use different naming
  67 conventions.  Structure member refer to detail members by name, and so
  68 their exact names do not appear to matter as long as they are unique.
  69
  70 @node SPV Structure Member Format
  71 @subsection Structure Member Format
  72
  73 Structure members XML files claim conformance with a collection of XML
  74 Schemas.  These schemas are distributed, under a nonfree license, with
  75 SPSS binaries.  Fortunately, the schemas are not necessary to
  76 understand the structure members.  To a degree, the schemas can even
  77 be deceptive because they document elements and attributes that are
  78 not in the corpus and lack documentation of elements and attributes
  79 that are commonly found in the corpus.
  80
  81 Structure members use a different XML namespace for each schema, but
  82 these namespaces are not entirely consistent: in some SPV files, for
  83 example, the @code{viewer-tree} schema is associated with namespace
  84 @indicateurl{http://xml.spss.com/spss/viewer-tree} and in other with
  85 @indicateurl{http://xml.spss.com/spss/viewer/viewer-tree} (note the
  86 additional @file{viewer/} directory.  In any case, the schema URIs are
  87 not resolvable to obtain the schemas themselves.
  88
  89 One may ignore all of the above in interpreting a structure member.
  90 The actual XML has a simple and straightforward form that does not
  91 require a reader to take schemas or namespaces into account.
  92
  93 @table @code
  94 @item heading
  95 Parent: Document root or @code{heading} @*
  96 Contents: [@code{pageSetup}] @code{label} [@code{container} | @code{heading}]*
  97
  98 The root of a structure member is a @code{heading}, which represents a
  99 section of output beginning with a title (the @code{label}) and
 100 ordinarily followed by content containers or further nested
 101 (sub)-sections of output.
 102
 103 The document root heading may also contain a @code{pageSetup} element.
 104
 105 The following attributes have been observed on both document root and
 106 nested @code{heading} elements:
 107
 108 @table @asis
 109 @item Optional attribute: @code{creator-version}
 110 The version of the software that created this SPV file.  A string of
 111 the form @code{xxyyzzww} represents software version xx.yy.zz.ww,
 112 e.g.@: @code{21000001} is version 21.0.0.1.  Trailing pairs of zeros
 113 are sometimes omitted, so that @code{21}, @code{210000}, and
 114 @code{21000000} are all version 21.0.0.0 (and the corpus contains all
 115 three of those forms).
 116 @end table
 117
 118 The following attributes have been observed on document root
 119 @code{heading} elements only:
 120
 121 @table @asis
 122 @item Optional attribute: @code{creator}
 123 The directory of the software that created this SPV file,
 124 e.g. @file{C:\PROGRA~1\IBM\SPSS\STATIS~1\22} or
 125 @file{/Applications/IBM/SPSS/Statistics/22/SPSSStatistics.app/Contents/Resources/Java/../../bin}.
 126
 127 @item Optional attribute: @code{creation-date-time}
 128 The date and time at which the SPV file was written, in a
 129 locale-specific format, e.g. @code{Friday, May 16, 2014 6:47:37 PM
 130 PDT} or @code{lunedì 17 marzo 2014 3.15.48 CET} or even @code{Friday,
 131 December 5, 2014 5:00:19 o'clock PM EST}.
 132
 133 @item Optional attribute: @code{lockReader}
 134 Whether a reader should be allowed to edit the output.  The possible
 135 values are @code{true} and @code{false}, but the corpus only contains
 136 @code{false}.
 137
 138 @item Optional attribute: @code{schemaLocation}
 139 This is actually an XML Namespace attribute.  A reader may ignore it.
 140 @end table
 141
 142 The following attributes have been observed only on nested
 143 @code{heading} elements:
 144
 145 @table @asis
 146 @item Required attribute: @code{commandName}
 147 The locale-invariant name of the command that produced the output,
 148 e.g.@: @code{Frequencies}, @code{T-Test}, @code{Non Par Corr}.
 149
 150 @item Optional attribute: @code{visibility}
 151 To what degree the output represented by the element is visible.  The
 152 only observed value is @code{collapsed}.
 153
 154 @item Optional attribute: @code{locale}
 155 The locale used for output, in Windows format, which is similar to the
 156 format used in Unix with the underscore replaced by a hyphen, e.g.@:
 157 @code{en-US}, @code{en-GB}, @code{el-GR}, @code{sr-Cryl-RS}.
 158
 159 @item Optional attribute: @code{olang}
 160 The output language, e.g.@: @code{en}, @code{it}, @code{es},
 161 @code{de}, @code{pt-BR}.
 162 @end table
 163
 164 @item label
 165 Parent: @code{heading} or @code{container} @*
 166 Contents: text
 167
 168 Every @code{heading} and @code{container} holds a @code{label} as its
 169 first child.  The root @code{heading} in a structure member always
 170 contains the string ``Output''.  Otherwise, the text in @code{label}
 171 describes what it labels, often by naming the statistical procedure
 172 that was executed, e.g.@: ``Frequencies'' or ``T-Test''.  Labels are
 173 often very generic, especially within a @code{container}, e.g.@:
 174 ``Title'' or ``Warnings'' or ``Notes''.  Label text is localized
 175 according to the output language, e.g. in Italian a frequency table
 176 procedure is labeled ``Frequenze''.
 177
 178 The corpus contains one example of an empty label, one that contains
 179 no text.
 180
 181 @item container
 182 Parent: @code{heading} @*
 183 Contents: @code{label} [@code{table} | @code{text}]
 184
 185 A @code{container} serves to label a @code{table} or a @code{text}
 186 item.
 187
 188 @table @asis
 189 @item Required attribute: @code{visibility}
 190 Either @code{visible} or @code{hidden}, this indicates whether the
 191 container's content is displayed.
 192
 193 @item Optional attribute: @code{text-align}
 194 Presumably indicates the alignment of text within the container.  The
 195 only observed value is @code{left}.  Observed with nested @code{table}
 196 and @code{text} elements.
 197
 198 @item Optional attribute: @code{width}
 199 The width of the container in the form @code{@var{n}px}, e.g.@:
 200 @code{1097px}.
 201 @end table
 202
 203 @item text
 204 Parent: @code{container} @*
 205 Contents: @code{html}
 206
 207 This @code{text} element is nested inside a @code{container}.  There
 208 is a different @code{text} element that is nested inside a
 209 @code{pageParagraph}.
 210
 211 @table @asis
 212 @item Required attribute: @code{type}
 213 One of @code{title}, @code{log}, or @code{text}.
 214
 215 @item Optional attribute: @code{commandName}
 216 As on the @code{heading} element.  For output not specific to a
 217 command, this is simply @code{log}.  The corpus contains one example
 218 of where @code{commandName} is present but set to the empty string.
 219
 220 @item Optional attribute: @code{creator-version}
 221 As on the @code{heading} element.
 222 @end table
 223
 224 @item html
 225 Parent: @code{text} @*
 226 Contents: cdata
 227
 228 The cdata contains an HTML document.  In some cases, the document
 229 starts with @code{<html>} and ends with @code{</html}; in others the
 230 @code{html} element is implied.  Generally the HTML includes a
 231 @code{head} element with a CSS stylesheet.  The HTML body often begins
 232 with @code{<BR>}.  The actual content ranges from trivial to simple:
 233 just discarding the CSS and tags yields readable results.
 234
 235 @table @asis
 236 @item Required attribute: @code{lang}
 237 This always contains @code{en} in the corpus.
 238 @end table
 239
 240 @item table
 241 Parent: @code{container} @*
 242 Contents: @code{tableStructure}
 243
 244 @table @asis
 245 @item Required attribute: @code{commandName}
 246 As on the @code{heading} element.
 247
 248 @item Required attribute: @code{type}
 249 One of @code{table}, @code{note}, or @code{warning}.
 250
 251 @item Required attribute: @code{subType}
 252 The locale-invariant name for the particular kind of output that this
 253 table represents in the procedure.  This can be the same as
 254 @code{commandName} e.g.@: @code{Frequencies}, or different, e.g.@:
 255 @code{Case Processing Summary}.  Generic subtypes @code{Notes} and
 256 @code{Warnings} are often used.
 257
 258 @item Required attribute: @code{tableId}
 259 A number that uniquely identifies the table within the SPV file,
 260 typically a large negative number such as @code{-4147135649387905023}.
 261
 262 @item Optional attribute: @code{creator-version}
 263 As on the @code{heading} element.  In the corpus, this is only present
 264 for version 21 and up and always includes all 8 digits.
 265 @end table
 266
 267 @item tableStructure
 268 Parent: @code{table}
 269 Contents: @code{dataPath}
 270
 271 @item dataPath
 272 Parent: @code{tableStructure}
 273 Contents: text
 274
 275 Contains the name of the Zip member that holds the table details,
 276 e.g.@: @code{0000000001437_lightTableData.bin}.
 277
 278 @item pageSetup
 279 Parent: @code{heading} @*
 280 Contents: @code{pageHeader} @code{pageFooter}
 281
 282 @table @asis
 283 @item Required attribute: @code{initial-page-number}
 284 Always @code{1}.
 285
 286 @item Optional attribute: @code{chart-size}
 287 Always @code{as-is} or a localization (!) of it (e.g.@: @code{dimensione
 288 attuale}, @code{Wie vorgegeben}).
 289
 290 @item Optional attribute: @code{margin-left}
 291 @itemx Optional attribute: @code{margin-right}
 292 @itemx Optional attribute: @code{margin-top}
 293 @itemx Optional attribute: @code{margin-bottom}
 294 Margin sizes in the form @code{@var{size}in}, e.g.@: @code{0.25in}.
 295
 296 @item Optional attribute: @code{paper-height}
 297 @itemx Optional attribute: @code{paper-width}
 298 Paper sizes in the form @code{@var{size}in}, e.g.@: @code{8.5in} by
 299 @code{11in} for letter paper or @code{8.267in} by @code{11.692in} for
 300 A4 paper.
 301
 302 @item Optional attribute: @code{reference-orientation}
 303 Always @code{0deg}.
 304
 305 @item Optional attribute: @code{space-after}
 306 Always @code{12pt}.
 307 @end table
 308
 309 @item pageHeader
 310 @itemx pageFooter
 311 Parent: @code{pageSetup} @*
 312 Contents: @code{pageParagraph}*
 313
 314 No attributes.
 315
 316 @item pageParagraph
 317 Parent: @code{pageHeader} or @code{pageFooter} @*
 318 Contents: @code{text}
 319
 320 Text to go at the top or bottom of a page, respectively.
 321
 322 @item text
 323 Parent: @code{pageParagraph} @*
 324 Contents: [cdata]
 325
 326 This @code{text} element is nested inside a @code{pageParagraph}.  There
 327 is a different @code{text} element that is nested inside a
 328 @code{container}.
 329
 330 The element is either empty, or contains cdata that holds almost-XHTML
 331 text: in the corpus, either an @code{html} or @code{p} element.  It is
 332 @emph{almost}-XHTML because the @code{html} element designates the
 333 default namespace as
 334 @code{http://xml.spss.com/spss/viewer/viewer-tree} instead of an XHTML
 335 namespace.
 336
 337 The cdata can contain substitution variables: @code{&[Page]} for the
 338 page number and @code{&[PageTitle]} for the page title.
 339
 340 Typical contents (indented for clarity):
 341
 342 @example
 343 <html xmlns="http://xml.spss.com/spss/viewer/viewer-tree">
 344     <head></head>
 345     <body>
 346         <p style="text-align:right; margin-top: 0">Page &[Page]</p>
 347     </body>
 348 </html>
 349 @end example
 350
 351 @table @asis
 352 @item Required attribute: @code{type}
 353 Always @code{text}.
 354 @end table
 355 @end table
 356
 357 @node SPV Light Detail Member Format
 358 @subsection Light Detail Member Format
 359
 360 A ``light'' detail member @file{.bin} consists of a number of sections
 361 concatenated together, terminated by a byte 01:
 362
 363 @example
 364 light-member := header title styles dimensions data 01
 365 @end example
 366
 367 The first section is a 0x27-byte header:
 368
 369 @example
 370 header := 01 00 version 01 (00 | 01) byte*21 00 00 table-id byte*4
 371 version := i1 | i3
 372 table-id := int
 373 @end example
 374
 375 @code{header} includes @code{version}, a version number that affects
 376 the interpretation of some of the other data in the member.  We will
 377 refer to ``version 1'' and ``version 3'' members later on.  It also
 378 @code{table-id} is a binary version of @code{tableId} attribute in the
 379 structure member that refers to the detail member.  For example, if
 380 @code{tableId} is @code{-4154297861994971133}, then @code{table-id}
 381 would be 0xdca00003.  The meaning of the other variable parts of the
 382 header is not known.
 383
 384 @example
 385 title := value 01?              /* @r{localized title} */
 386          value 01? 31           /* @r{subtype} */
 387          value 01? 00? 58       /* @r{locale-invariant title} */
 388          (31 value | 58)        /* @r{caption} */
 389          int[n] footnote*[n]    /* @r{footnotes} */
 390 footnote := value (31 value | 58) byte*4
 391 @end example
 392
 393 @example
 394 styles := 00 font*8
 395           int[x1] byte*[x1]
 396           int[x2] byte*[x2]
 397           int[x3] byte*[x3]
 398           int[x4] int*[x4]
 399           string[encoding]
 400           (i0 | i-1) (00 | 01) 00 (00 | 01)
 401           int
 402           byte[decimal] byte[grouping]
 403           int[x5] string*[x5]    /* @r{custom currency} */
 404           int[x6] byte*[x6]
 405 @end example
 406
 407 In every example in the corpus, @code{x1} is 240.  The meaning of the
 408 bytes that follow it is unknown.
 409
 410 In every example in the corpus, @code{x2} is 18 and the bytes that
 411 follow it are @code{00 00 00 01 00 00 00 00 00 00 00 00 00 02 00 00 00
 412 00}.  The meaning of these bytes is unknown.
 413
 414 Observed values of @code{x3} vary from 16 to 150.  The bytes that
 415 follow it vary somewhat.
 416
 417 Observed values of @code{x4} vary from 0 to 17.  Out of 7060 examples
 418 in the corpus, it is nonzero only 36 times.
 419
 420 @code{encoding} is a character encoding, usually a Windows code page
 421 such as @code{en_US.windows-1252} or @code{it_IT.windows-1252}.  The
 422 encoding string is itself encoded in US-ASCII.  The rest of the
 423 character strings in the file use this encoding.
 424
 425 @code{decimal} is the decimal point character.  The observed values
 426 are @samp{.} and @samp{,}.
 427
 428 @code{grouping} is the grouping character.  The observed values are
 429 @samp{,}, @samp{.}, @samp{'}, @samp{ }, and zero (presumably
 430 indicating that digits should not be grouped).
 431
 432 @code{x5} is observed as either 0 or 5.  When it is 5, the following
 433 strings are CCA through CCE format strings.  Most commonly these are
 434 all @code{-,,,} but other strings occur.
 435
 436 @example
 437 font := byte[index] 31 string[typeface]
 438         00 00
 439         (10 | 20 | 40 | 50 | 70 | 80)[f1]
 440         41
 441         (i0 | i1 | i2)[f2]
 442         00
 443         (i0 | i2 | i64173)[f3]
 444         (i0 | i1 | i2 | i3)[f4]
 445         string[fgcolor] string[bgcolor]
 446         i0 i0 00
 447         (v3: int[f5] int[f6] int[f7] int[f8])
 448 @end example
 449
 450 Each @code{font}, in order, represents the font style for a different
 451 element: title, caption, footnote, row labels, column labels, corner
 452 labels, data, and layers.
 453
 454 @code{index} is the 1-based index of the @code{font}, i.e. 1 for the
 455 first @code{font}, through 8 for the final @code{font}.
 456
 457 @code{typeface} is the string name of the font.  In the corpus, this
 458 is @code{SansSerif} in over 99% of instances and @code{Times New
 459 Roman} in the rest.
 460
 461 @code{fgcolor} and @code{bgcolor} are the foreground color and
 462 background color, respectively.  In the corpus, these are always
 463 @code{#000000} and @code{#ffffff}, respectively.
 464
 465 The meaning of the remaining data is unknown.  It seems likely to
 466 include font sizes, horizontal and vertical alignment, attributes such
 467 as bold or italic, and margins.  @code{f1} is @code{40} most of the
 468 time.  @code{f2} is @code{i1} most of the time for the title and
 469 @code{i0} most of the time for other fonts.
 470
 471 The table below lists the values observed in the corpus.  When a cell
 472 contains a single value, then 99+% of the corpus contains that value.
 473 When a cell contains a pair of values, then the first value is seen in
 474 about two-third of the corpus and the second value in about the
 475 remaining one-third.  In fonts that include multiple pairs, values are
 476 correlated, that is, for font 3, f5 = 24, f6 = 24, f7 = 2 appears
 477 about two-thirds of the time, as does the combination of f4 = 0, f6 =
 478 10 for font 7.
 479
 480 @example
 481 font  f1  f2     f3   f4     f5    f6  f7  f8
 482
 483    1  40   1      0    0      8 10/11   1   8
 484    2  40   0      2    1      8 10/11   1   1
 485    3  40   0      2    1  24/11 24/ 8 2/3   4
 486    4  40   0      2    3      8 10/11   1   1
 487    5  40   0      0    1      8 10/11   1   4
 488    6  40   0      2    1      8 10/11   1   4
 489    7  40   0  64173  0/1      8 10/11   1   1
 490    8  40   0      2    3      8 10/11   1   4
 491 @end example
 492
 493 @example
 494 dimensions := int[n-dims] dimension*[n-dims]
 495 dimension := value[name]
 496              byte[d1]
 497              (00 | 01 | 02)[d2]
 498              (i0 | i2)[d3]
 499              (00 | 01)[d4]
 500              (00 | 01)[d5]
 501              01
 502              int[d6]
 503              int[n-categories] category*[n-categories]
 504 @end example
 505
 506 @code{name} is the name of the dimension, e.g. @code{Variables},
 507 @code{Statistics}, or a variable name.
 508
 509 @code{d1} is usually 0 but many other values have been observed.
 510
 511 @code{d3} is 2 over 99% of the time.
 512
 513 @code{d5} is 0 over 99% of the time.
 514
 515 @code{d6} is either -1 or the 0-based index of the dimension, e.g.@: 0
 516 for the first dimension, 1 for the second, and so on.  The latter is
 517 the case 98% of the time in the corpus.
 518
 519 @example
 520 category := value[name] (terminal | group)
 521 terminal-category := 00 00 00 i2 int[index] i0
 522 @end example
 523
 524 @code{name} is the name of the category (or group).
 525
 526 @code{category} can represent a terminal category.  In that case,
 527 @code{index} is a nonnegative integer less than @code{n-categories} in
 528 the @code{dimension} in which the @code{category} is nested (directly
 529 or indirectly).
 530
 531 Alternatively, @code{category} can represent a @code{group} of nested
 532 categories:
 533
 534 @example
 535 group := (00 | 01)[merge] 00 01 (i0 | i2)[data]
 536          i-1 int[n-subcategories] category*[n-subcategories]
 537 @end example
 538
 539 Ordinarily a group has some nested content, so that
 540 @code{n-subcategories} is positive, but a few instances of groups with
 541 @code{n-subcategories} 0 has been observed.
 542
 543 If @code{merge} is 00, the most common value, then the group is really
 544 a distinct group that should be represented as such in the visual
 545 representation and user interface.  If @code{merge} is 01, however,
 546 the categories in this group should be shown and treated as if they
 547 were direct children of the group's parent group (or if it has no
 548 parent group, then direct children of the dimension), and this group's
 549 name is irrelevant and should not be displayed.  (Merged groups can be
 550 nested!)
 551
 552 @code{data} appears to be i2 when all of the categories within a group
 553 are terminal categories that directly represent data values for a
 554 variable (e.g. in a frequency table or crosstabulation, a group of
 555 values in a variable being tabulated) and i0 otherwise, but this might
 556 be naive.
 557
 558 @example
 559 data := int[layers] int[rows] int[columns] int*[n-dimensions]
 560 @end example
 561
 562 The values of @code{layers}, @code{rows}, and @code{columns} each
 563 specifies the number of dimensions represented in layers or rows or
 564 columns, respectively, and their values sum to the number of
 565 dimensions.
 566
 567 The @code{n-dimensions} integers are a permutation of the 0-based
 568 dimension numbers.  The first @code{layers} of them specify each of
 569 the dimensions represented by layers, the next @code{rows} of them
 570 specify the dimensions represented by rows, and the final
 571 @code{columns} of them specify the dimensions represented by columns.
 572 When there is more than one dimension of a given kind, the inner
 573 dimensions are given first.