From 7d38d3c5c32f0ca2ef9568c2e699bf75a99c9c34 Mon Sep 17 00:00:00 2001 From: Ben Pfaff Date: Fri, 9 May 2025 13:50:01 -0700 Subject: [PATCH] work on manual --- rust/doc/src/SUMMARY.md | 3 +- rust/doc/src/spv/light-detail.md | 1061 +++++++++++++++++ .../spv/{structure/index.md => structure.md} | 0 3 files changed, 1063 insertions(+), 1 deletion(-) create mode 100644 rust/doc/src/spv/light-detail.md rename rust/doc/src/spv/{structure/index.md => structure.md} (100%) diff --git a/rust/doc/src/SUMMARY.md b/rust/doc/src/SUMMARY.md index 0ef511b3ca..ab2ede3ab6 100644 --- a/rust/doc/src/SUMMARY.md +++ b/rust/doc/src/SUMMARY.md @@ -193,7 +193,8 @@ - [Dictionary Termination Record](system-file/dictionary-termination-record.md) - [Data Record](system-file/data-record.md) - [SPSS Viewer File Format](spv/index.md) - - [Structure Members](spv/structure/index.md) + - [Structure Members](spv/structure.md) + - [Light Detail Members](spv/light-detail.md) - [Encrypted File Wrappers](encrypted-wrapper/index.md) - [Common Wrapper Format](encrypted-wrapper/common-wrapper-format.md) - [Password Encoding](encrypted-wrapper/password-encoding.md) diff --git a/rust/doc/src/spv/light-detail.md b/rust/doc/src/spv/light-detail.md new file mode 100644 index 0000000000..6b70c7be58 --- /dev/null +++ b/rust/doc/src/spv/light-detail.md @@ -0,0 +1,1061 @@ +# Light Detail Member Format + +This section describes the format of "light" detail `.bin` members. +These members have a binary format which we describe here in terms of a +context-free grammar using the following conventions: + +* `NonTerminal ⇒ ...` + Nonterminals have CamelCaps names, and ⇒ indicates a production. + The right-hand side of a production is often broken across multiple + lines. Break points are chosen for aesthetics only and have no + semantic significance. + +* `00, 01, ..., ff.` + A bytes with a fixed value, written as a pair of hexadecimal + digits. + +* `i0, i1, ..., i9, i10, i11, ...` + `ib0, ib1, ..., ib9, ib10, ib11, ...` + A 32-bit integer in little-endian or big-endian byte order, + respectively, with a fixed value, written in decimal. Prefixed by + `i` for little-endian or `ib` for big-endian. + +* `byte` + A byte. + +* `bool` + A byte with value 0 or 1. + +* `int16` + `be16` + A 16-bit unsigned integer in little-endian or big-endian byte + order, respectively. + +* `int32` + `be32` + A 32-bit unsigned integer in little-endian or big-endian byte + order, respectively. + +* `int64` + `be64` + A 64-bit unsigned integer in little-endian or big-endian byte + order, respectively. + +* `double` + A 64-bit IEEE floating-point number. + +* `float` + A 32-bit IEEE floating-point number. + +* `string` + `bestring` + A 32-bit unsigned integer, in little-endian or big-endian byte + order, respectively, followed by the specified number of bytes of + character data. (The encoding is indicated by the Formats + nonterminal.) + +* `X?` + X is optional, e.g. 00? is an optional zero byte. + +* `X*N` + X is repeated N times, e.g. byte*10 for ten arbitrary bytes. + +* `X[NAME]` + Gives X the specified NAME. Names are used in textual + explanations. They are also used, also bracketed, to indicate + counts, e.g. `int32[n] byte*[n]` for a 32-bit integer followed by + the specified number of arbitrary bytes. + +* `A | B` + Either A or B. + +* `(X)` + Parentheses are used for grouping to make precedence clear, + especially in the presence of |, e.g. in 00 (01 | 02 | 03) 00. + +* `count(X)` + `becount(X)` + A 32-bit unsigned integer, in little-endian or big-endian byte + order, respectively, that indicates the number of bytes in X, + followed by X itself. + +* `v1(X)` + In a version 1 `.bin` member, X; in version 3, nothing. (The + `.bin` header indicates the version.) + +* `v3(X)` + In a version 3 `.bin` member, X; in version 1, nothing. + +PSPP uses this grammar to parse light detail members. See +`src/output/spv/light-binary.grammar` in the PSPP source tree for the +full grammar. + +Little-endian byte order is far more common in this format, but a few +pieces of the format use big-endian byte order. + +Light detail members express linear units in two ways: points (pt), at 72/inch, and "device-independent pixels" (px), at 96/inch. To +convert from pt to px, multiply by 1.33 and round up. To convert from +px to pt, divide by 1.33 and round down. + +A "light" detail member `.bin` consists of a number of sections +concatenated together, terminated by an optional byte 01: + +``` +Table => + Header Titles Footnotes + Areas Borders PrintSettings TableSettings Formats + Dimensions Axes Cells + 01? +``` + +The following sections go into more detail. + + + +## Header + +An SPV light member begins with a 39-byte header: + +``` +Header => + 01 00 + (i1 | i3)[version] + bool[x0] + bool[x1] + bool[rotate-inner-column-labels] + bool[rotate-outer-row-labels] + bool[x2] + int32[x3] + int32[min-col-heading-width] int32[max-col-heading-width] + int32[min-row-heading-width] int32[max-row-heading-width] + int64[table-id] +``` + +`version` is a version number that affects the interpretation of some +of the other data in the member. We will refer to "version 1" and +"version 3" later on and use `v1(...)` and `v3(...)` for +version-specific formatting (as described previously). + +If `rotate-inner-column-labels` is 1, then column labels closest to +the data are rotated 90° counterclockwise; otherwise, they are shown in +the normal way. + +If `rotate-outer-row-labels` is 1, then row labels farthest from the +data are rotated 90° counterclockwise; otherwise, they are shown in the +normal way. + +`min-col-heading-width`, `max-col-heading-width`, +`min-row-heading-width`, and `max-row-heading-width` are measurements in +1/96 inch units (called "device independent pixel" units in Windows) +whose values influence column widths. For the purpose of interpreting +these values, a table is divided into the three regions shown below: + +``` ++------------------+-------------------------------------------------+ +| | column headings | +| +-------------------------------------------------+ +| corner | | +| and | | +| row headings | data | +| | | +| | | ++------------------+-------------------------------------------------+ +``` + +`min-col-heading-width` and `max-col-heading-width` apply to the +columns in the column headings region. `min-col-heading-width` is the +minimum width that any of these columns will be given automatically. In +addition, `max-col-heading-width` is the maximum width that a column +will be assigned to accommodate a long label in the column headings +cells. These columns will still be made wider to accommodate wide data +values in the data region. + +`min-row-heading-width` is the minimum width that a column in the +corner and row headings region will be given automatically. +`max-col-heading-width` is the maximum width that a column in this +region will be assigned to accomodate a long label. This region doesn't +include data, so data values don't affect column widths. + +`table-id` is a binary version of the `tableId` attribute in the +structure member that refers to the detail member. For example, if +`tableId` is `-4122591256483201023`, then `table-id` would be +0xc6c99d183b300001. + +The meaning of the other variable parts of the header is not known. +A writer may safely use version 3, true for `x0`, false for `x1`, true +for `x2`, and 0x15 for `x3`. + +## Titles + +``` +Titles => + Value[title] 01? + Value[subtype] 01? 31 + Value[user-title] 01? + (31 Value[corner-text] | 58) + (31 Value[caption] | 58) +``` + +The `Titles` follow the Header and specify the table's title, caption, +and corner text. + +The `user-title` reflects any user editing of the title text or +style. The `title` is the title originally generated by the procedure. +Both of these are appropriate for presentation and localized to the +user's language. For example, for a frequency table, `title` and +`user-title` normally name the variable and `c` is simply "Frequencies". + +`subtype` is the same as the `subType` attribute in the [`table` +structure XML element](structure.md#the-table-element) that referred +to this member. + +The `corner-text`, if present, is shown in the upper-left corner of +the table, above the row headings and to the left of the column +headings. It is usually absent. When row dimension labels are +displayed in the corner (see `show-row-labels-in-corner`), corner text +is hidden. + +The `caption`, if present, is shown below the table. `caption` +reflects user editing of the caption. + +## Footnotes + +``` +Footnotes => int32[n-footnotes] Footnote*[n-footnotes] +Footnote => Value[text] (58 | 31 Value[marker]) int32[show] +``` + +Each footnote has `text` and an optional custom `marker` (such as +`*`). + +The syntax for Value would allow footnotes (and their markers) to +reference other footnotes, but in practice this doesn't work. + +`show` is a 32-bit signed integer. It is positive to show the +footnote or negative to hide it. Its magnitude is often 1, and in other +cases tends to be the number of references to the footnote. It is safe +to write 1 to show a footnote and -1 to hide it. + +## Areas + +``` +Areas => 00? Area*8 +Area => + byte[index] 31 + string[typeface] float[size] int32[style] bool[underline] + int32[halign] int32[valign] + string[fg-color] string[bg-color] + bool[alternate] string[alt-fg-color] string[alt-bg-color] + v3(int32[left-margin] int32[right-margin] int32[top-margin] int32[bottom-margin]) +``` + +Each `Area` represents the style for a different area of the table, in +the following order: title, caption, footer, corner, column labels, +row labels, data, and layers. + +`index` is the 1-based index of the Area, i.e. 1 for the first `Area`, +through 8 for the final `Area`. + +`typeface` is the string name of the font used in the area. In the +corpus, this is `SansSerif` in over 99% of instances and `Times New +Roman` in the rest. + +`size` is the size of the font, in [px](#px). The most common size in +the corpus is 12 px. Even though `size` has a floating-point type, in +the corpus its values are always integers. + +`style` is a bit mask. Bit 0 (with value 1) is set for bold, bit 1 +(with value 2) is set for italic. + +`underline` is 1 if the font is underlined, 0 otherwise. + +`halign` specifies horizontal alignment: 0 for center, 2 for left, 4 +for right, 61453 for decimal, 64173 for mixed. Mixed alignment varies +according to type: string data is left-justified, numbers and most other +formats are right-justified. + +`valign` specifies vertical alignment: 0 for center, 1 for top, 3 for +bottom. + +`fg-color` and `bg-color` are the foreground color and background +color, respectively. In the corpus, these are always `#000000` and +`#ffffff`, respectively. + +`alternate` is 1 if rows should alternate colors, 0 if all rows +should be the same color. When `alternate` is 1, `alt-fg-color` and +`alt-bg-color` specify the colors for the alternate rows; otherwise they +are empty strings. + +`left-margin`, `right-margin`, `top-margin`, and `bottom-margin` are +measured in px. + +## Borders + +``` +Borders => + count( + ib1[endian] + be32[n-borders] Border*[n-borders] + bool[show-grid-lines] + 00 00 00) + +Border => + be32[border-type] + be32[stroke-type] + be32[color] +``` + +`Borders` reflects how borders between regions are drawn. + +The fixed value of `endian` can be used to validate the endianness. + +`show-grid-lines` is 1 to draw grid lines, otherwise 0. + +Each `Border` describes one kind of border. `n-borders` seems to +always be 19. Each `border-type` appears once (although in an +unpredictable order) and correspond to the following borders: + +* 0: Title. +* 1...4: Left, top, right, and bottom outer frame. +* 5...8: Left, top, right, and bottom inner frame. +* 9, 10: Left and top of data area. +* 11, 12: Horizontal and vertical dimension rows. +* 13, 14: Horizontal and vertical dimension columns. +* 15, 16: Horizontal and vertical category rows. +* 17, 18: Horizontal and vertical category columns. + +`stroke-type` describes how a border is drawn, as one of: + +* 0: No line. +* 1: Solid line. +* 2: Dashed line. +* 3: Thick line. +* 4: Thin line. +* 5: Double line. + +`color` is an RGB color. Bits 24-31 are alpha, bits 16-23 are red, +8-15 are green, 0-7 are blue. An alpha of 255 indicates an opaque +color, therefore opaque black is 0xff000000. + +## Print Settings + +``` +PrintSettings => + count( + ib1[endian] + bool[all-layers] + bool[paginate-layers] + bool[fit-width] + bool[fit-length] + bool[top-continuation] + bool[bottom-continuation] + be32[n-orphan-lines] + bestring[continuation-string]) +``` + +`PrintSettings` reflects settings for printing. The fixed value of +`endian` can be used to validate the endianness. + +`all-layers` is 1 to print all layers, 0 to print only the layer +designated by `current-layer` in [`TableSettings`](#table-settings). + +`paginate-layers` is 1 to print each layer at the start of a new +page, 0 otherwise. (This setting is honored only `all-layers` is 1, +since otherwise only one layer is printed.) + +`fit-width` and `fit-length` control whether the table is shrunk to +fit within a page's width or length, respectively. + +`n-orphan-lines` is the minimum number of rows or columns to put in +one part of a table that is broken across pages. + +If `top-continuation` is 1, then `continuation-string` is printed at +the top of a page when a table is broken across pages for printing; +similarly for `bottom-continuation` and the bottom of a page. Usually, +`continuation-string` is empty. + +## Table Settings + +``` +TableSettings => + count( + v3( + ib1[endian] + be32[x5] + be32[current-layer] + bool[omit-empty] + bool[show-row-labels-in-corner] + bool[show-alphabetic-markers] + bool[footnote-marker-superscripts] + byte[x6] + becount( + Breakpoints[row-breaks] Breakpoints[column-breaks] + Keeps[row-keeps] Keeps[column-keeps] + PointKeeps[row-point-keeps] PointKeeps[column-point-keeps] + ) + bestring[notes] + bestring[table-look] + )...) + +Breakpoints => be32[n-breaks] be32*[n-breaks] + +Keeps => be32[n-keeps] Keep*[n-keeps] +Keep => be32[offset] be32[n] + +PointKeeps => be32[n-point-keeps] PointKeep*[n-point-keeps] +PointKeep => be32[offset] be32 be32 +``` + +`TableSettings` reflects display settings. The fixed value of +`endian` can be used to validate the endianness. + +`current-layer` is the displayed layer. Suppose there are \\(d\\) +layers, numbered 1 through \\(d\\) in the order given in the +Dimensions (*note SPV Light Member Dimensions::), and that the +displayed value of dimension \\(i\\) is \\\(d_i, 0 \le x_i < n_i\\), +where \\(n_i\\) is the number of categories in dimension \\(i\\). +Then `current-layer` is the \\(k\\) calculated by the following algorithm: + +> let \\(k = 0\\). +> for each \\(i\\) from \\(d\\) downto 1: +> \\(\quad k = (n_i \times k) + x_i\\). + +If `omit-empty` is 1, empty rows or columns (ones with nothing in any +cell) are hidden; otherwise, they are shown. + +If `show-row-labels-in-corner` is 1, then row labels are shown in the +upper left corner; otherwise, they are shown nested. + +If `show-alphabetic-markers` is 1, markers are shown as letters (e.g. +`a`, `b`, `c`, ...); otherwise, they are shown as numbers starting from +1. + +When `footnote-marker-superscripts` is 1, footnote markers are shown +as superscripts, otherwise as subscripts. + +The `Breakpoints` are rows or columns after which there is a page +break; for example, a row break of 1 requests a page break after the +second row. Usually no breakpoints are specified, indicating that page +breaks should be selected automatically. + +The `Keeps` are ranges of rows or columns to be kept together without +a page break; for example, a row Keep with `offset` 1 and `n` 10 +requests that the 10 rows starting with the second row be kept +together. Usually no `Keeps` are specified. + +The `PointKeeps` seem to be generated automatically based on +user-specified Keeps. They seems to indicate a conversion from rows or +columns to pixel or point offsets. + +`notes` is a text string that contains user-specified notes. It is +displayed when the user hovers the cursor over the table, like text in +the `title` attribute in HTML. It is not printed. It is usually empty. + +`table-look` is the name of a SPSS "TableLook" table style, such as +"Default" or "Academic"; it is often empty. + +`TableSettings` ends with an arbitrary number of null bytes. A writer +may safely write 82 null bytes. + +A writer may safely use 4 for `x5` and 0 for `x6`. + +## Formats + +``` +Formats => + int32[n-widths] int32*[n-widths] + string[locale] + int32[current-layer] + bool[x7] bool[x8] bool[x9] + Y0 + CustomCurrency + count( + v1(X0?) + v3(count(X1 count(X2)) count(X3))) +Y0 => int32[epoch] byte[decimal] byte[grouping] +CustomCurrency => int32[n-ccs] string*[n-ccs] +``` + +If `n-widths` is nonzero, then the accompanying integers are column +widths as manually adjusted by the user. + +`locale` is a locale including an encoding, such as +`en_US.windows-1252` or `it_IT.windows-1252`. (`locale` is often +duplicated in Y1, described below). + +`epoch` is the year that starts the epoch. A 2-digit year is +interpreted as belonging to the 100 years beginning at the epoch. The +default epoch year is 69 years prior to the current year; thus, in 2017 +this field by default contains 1948. In the corpus, `epoch` ranges from +1943 to 1948, plus some contain -1. + +`decimal` is the decimal point character. The observed values are +`.` and `,`. + +`grouping` is the grouping character. Usually, it is `,` if +`decimal` is `.`, and vice versa. Other observed values are `'` +(apostrophe), ` ` (space), and zero (presumably indicating that digits +should not be grouped). + +`n-ccs` is observed as either 0 or 5. When it is 5, the following +strings are CCA through CCE format strings. *Note (pspp)Custom Currency +Formats::. Most commonly these are all `-,,,` but other strings occur. + +A writer may safely use false for `x7`, `x8`, and `x9`. + +### X0 + +X0 only appears, optionally, in version 1 members. + +``` +X0 => byte*14 Y1 Y2 +Y1 => + string[command] string[command-local] + string[language] string[charset] string[locale] + bool[x10] bool[include-leading-zero] bool[x12] bool[x13] + Y0 +Y2 => CustomCurrency byte[missing] bool[x17] +``` + +`command` describes the statistical procedure that generated the +output, in English. It is not necessarily the literal syntax name of +the procedure: for example, NPAR TESTS becomes "Nonparametric Tests." +`command-local` is the procedure's name, translated into the output +language; it is often empty and, when it is not, sometimes the same as +`command`. + +`include-leading-zero` is the `LEADZERO` setting for the table, where +false is `OFF` (the default) and true is `ON`. *Note (pspp)SET +LEADZERO::. + +`missing` is the character used to indicate that a cell contains a +missing value. It is always observed as `.`. + +A writer may safely use false for `x10` and `x17` and true for `x12` +and `x13`. + +### X1 + +`X1` only appears in version 3 members. + +``` +X1 => + bool[x14] + byte[show-title] + bool[x16] + byte[lang] + byte[show-variables] + byte[show-values] + int32[x18] int32[x19] + 00*17 + bool[x20] + bool[show-caption] +``` + +`lang` may indicate the language in use. Some values seem to be 0: +en, 1: de, 2: es, 3: it, 5: ko, 6: pl, 8: zh-tw, 10: pt_BR, 11: fr. + +`show-variables` determines how variables are displayed by default. +A value of 1 means to display variable names, 2 to display variable +labels when available, 3 to display both (name followed by label, +separated by a space). The most common value is 0, which probably means +to use a global default. + +`show-values` is a similar setting for values. A value of 1 means to +display the value, 2 to display the value label when available, 3 to +display both. Again, the most common value is 0, which probably means +to use a global default. + +`show-title` is 1 to show the caption, 10 to hide it. + +`show-caption` is true to show the caption, false to hide it. + +A writer may safely use false for `x14`, false for `x16`, 0 for +`lang`, -1 for `x18` and `x19`, and false for `x20`. + +### X2 + +`X2` only appears in version 3 members. + +``` +X2 => + int32[n-row-heights] int32*[n-row-heights] + int32[n-style-map] StyleMap*[n-style-map] + int32[n-styles] StylePair*[n-styles] + count((i0 i0)?) +StyleMap => int64[cell-index] int16[style-index] +``` + +If present, `n-row-heights` and the accompanying integers are row +heights as manually adjusted by the user. + +The rest of `X2` specifies styles for data cells. At first glance +this is odd, because each data cell can have its own style embedded as +part of the data, but in practice `X2` specifies a style for a cell +only if that cell is empty (and thus does not appear in the data at +all). Each StyleMap specifies the index of a blank cell, calculated +the same was as in the Cells (*note SPV Light Member Cells::), along +with a 0-based index into the accompanying StylePair array. + +A writer may safely omit the optional `i0 i0` inside the +`count(...)`. + +### X3 + +`X3` only appears in version 3 members. + +``` +X3 => + 01 00 byte[x21] 00 00 00 + Y1 + double[small] 01 + (string[dataset] string[datafile] i0 int32[date] i0)? + Y2 + (int32[x22] i0 01?)? +``` + +`small` is a small real number. In the corpus, it overwhelmingly +takes the value 0.0001, with zero occasionally seen. Nonzero numbers +with format 40 (*note SPV Light Member Value::) whose magnitudes are +smaller than displayed in scientific notation. (Thus, a `small` of zero +prevents scientific notation from being chosen.) + +`dataset` is the name of the dataset analyzed to produce the output, +e.g. `DataSet1`, and `datafile` the name of the file it was read from, +e.g. `C:\Users\foo\bar.sav`. The latter is sometimes the empty string. + +`date` is a date, as seconds since the epoch, i.e. since January 1, +1970. Pivot tables within an SPV file often have dates a few minutes +apart, so this is probably a creation date for the table rather than for +the file. + +Sometimes `dataset`, `datafile`, and `date` are present and other +times they are absent. The reader can distinguish by assuming that they +are present and then checking whether the presumptive `dataset` contains +a null byte (a valid string never will). + +`x22` is usually 0 or 2000000. + +A writer may safely use 4 for `x21` and omit `x22` and the other +optional bytes at the end. + +### Encoding + +Formats contains several indications of character encoding: + +- `locale` in Formats itself. + +- `locale` in Y1 (in version 1, Y1 is optionally nested inside X0; in +version 3, Y1 is nested inside X3). + +- `charset` in version 3, in Y1. + +- `lang` in X1, in version 3. + +`charset`, if present, is a good indication of character encoding, +and in its absence the encoding suffix on `locale` in Formats will work. + +`locale` in Y1 can be disregarded: it is normally the same as +`locale` in Formats, and it is only present if `charset` is also. + +`lang` is not helpful and should be ignored for character encoding +purposes. + +However, the corpus contains many examples of light members whose +strings are encoded in UTF-8 despite declaring some other character set. +Furthermore, the corpus contains several examples of light members in +which some strings are encoded in UTF-8 (and contain multibyte +characters) and other strings are encoded in another character set (and +contain non-ASCII characters). PSPP treats any valid UTF-8 string as +UTF-8 and only falls back to the declared encoding for strings that are +not valid UTF-8. + +The `pspp-output` program's `strings` command can help analyze the +encoding in an SPV light member. Use `pspp-output --help-dev` to see +its usage. + +## Dimensions + +A pivot table presents multidimensional data. A Dimension identifies +the categories associated with each dimension. + +``` +Dimensions => int32[n-dims] Dimension*[n-dims] +Dimension => + Value[name] DimProperties + int32[n-categories] Category*[n-categories] +DimProperties => + byte[x1] + byte[x2] + int32[x3] + bool[hide-dim-label] + bool[hide-all-labels] + 01 int32[dim-index] +``` + +`name` is the name of the dimension, e.g. `Variables`, `Statistics`, +or a variable name. + +The meanings of `x1` and `x3` are unknown. `x1` is usually 0 but +many other values have been observed. A writer may safely use 0 for +`x1` and 2 for `x3`. + +`x2` is 0, 1, or 2. For a pivot table with L layer dimensions, R row +dimensions, and C column dimensions, `x2` is 2 for the first L +dimensions, 0 for the next R dimensions, and 1 for the remaining C +dimensions. This does not mean that the layer dimensions must be +presented first, followed by the row dimensions, followed by the column +dimensions--on the contrary, they are frequently in a different +order--but `x2` must follow this pattern to prevent the pivot table from +being misinterpreted. + +If `hide-dim-label` is 00, the pivot table displays a label for the +dimension itself. Because usually the group and category labels are +enough explanation, it is usually 01. + +If `hide-all-labels` is 01, the pivot table omits all labels for the +dimension, including group and category labels. It is usually 00. When +`hide-all-labels` is 01, `show-dim-label` is ignored. + +`dim-index` is usually the 0-based index of the dimension, e.g. 0 for +the first dimension, 1 for the second, and so on. Sometimes it is -1. +There is no visible difference. A writer may safely use the 0-based +index. + +## Categories + +Categories are arranged in a tree. Only the leaf nodes in the tree are +really categories; the others just serve as grouping constructs. + +``` +Category => Value[name] (Leaf | Group) +Leaf => 00 00 00 i2 int32[leaf-index] i0 +Group => + bool[merge] 00 01 int32[x23] + i-1 int32[n-subcategories] Category*[n-subcategories] +``` + +`name` is the name of the category (or group). + +A Leaf represents a leaf category. The Leaf's `leaf-index` is a +nonnegative integer unique within the Dimension and less than +`n-categories` in the Dimension. If the user does not sort or rearrange +the categories, then `leaf-index` starts at 0 for the first Leaf in the +dimension and increments by 1 with each successive Leaf. If the user +does sorts or rearrange the categories, then the order of categories in +the file reflects that change and `leaf-index` reflects the original +order. + +A dimension can have no leaf categories at all. A table that +contains such a dimension necessarily has no data at all. + +A Group is a group of nested categories. Usually a Group contains at +least one Category, so that `n-subcategories` is positive, but Groups +with zero subcategories have been observed. + +If a Group's `merge` is 00, the most common value, then the group is +really a distinct group that should be represented as such in the visual +representation and user interface. If `merge` is 01, the categories in +this group should be shown and treated as if they were direct children +of the group's containing group (or if it has no parent group, then +direct children of the dimension), and this group's name is irrelevant +and should not be displayed. (Merged groups can be nested!) + +Writers need not use merged groups. + +A Group's `x23` appears to be `i2` when all of the categories within a +group are leaf categories that directly represent data values for a +variable (e.g. in a frequency table or crosstabulation, a group of +values in a variable being tabulated) and i0 otherwise. A writer may +safely write a constant 0 in this field. + +## Axes + +After the dimensions come assignment of each dimension to one of the +axes: layers, rows, and columns. + +``` +Axes => + int32[n-layers] int32[n-rows] int32[n-columns] + int32*[n-layers] int32*[n-rows] int32*[n-columns] +``` + +The values of `n-layers`, `n-rows`, and `n-columns` each specifies +the number of dimensions displayed in layers, rows, and columns, +respectively. Any of them may be zero. Their values sum to +`n-dimensions` from [`Dimensions`](#dimensions). + +The following `n-dimensions` integers, in three groups, are a +permutation of the 0-based dimension numbers. The first `n-layers` +integers specify each of the dimensions represented by layers, the next +`n-rows` integers specify the dimensions represented by rows, and the +final `n-columns` integers specify the dimensions represented by +columns. When there is more than one dimension of a given kind, the +inner dimensions are given first. (For the layer axis, this means that +the first dimension is at the bottom of the list and the last dimension +is at the top when the current layer is displayed.) + +## Cells + +The final part of an SPV light member contains the actual data. + +``` +Cells => int32[n-cells] Cell*[n-cells] +Cell => int64[index] v1(00?) Value +``` + +A Cell consists of an `index` and a Value. Suppose there are \\(d\\) +dimensions, numbered 1 through \\(d\\) in the order given in the [`Dimensions`](#dimensions) +previously, and that dimension \\(i\\) has \\(n_i\\) categories. Consider the cell +at coordinates \\(x_i, 1 \le i \le d\\), and note that \\(0 \le x_i < n_i\\). Then +the index \\(k\\) is calculated by the following algorithm: + + +> let \\(k = 0\\). +> for each \\(i\\) from 1 to \\(d\\): +> \\(\quad k = (n_i \times k) + x_i\\) + +For example, suppose there are 3 dimensions with 3, 4, and 5 +categories, respectively. The cell at coordinates (1, 2, 3) has index +\\(k = 5 \times (4 \times (3 \times 0 + 1) + 2) + 3 = 33\\). Within a +given dimension, the index is the `leaf-index` in a Leaf. + +## Value + +`Value` is used throughout the SPV light member format. It boils down to +a number or a string. + +``` +Value => 00? 00? 00? 00? RawValue +RawValue => + 01 ValueMod int32[format] double[x] + | 02 ValueMod int32[format] double[x] + string[var-name] string[value-label] byte[show] + | 03 string[local] ValueMod string[id] string[c] bool[fixed] + | 04 ValueMod int32[format] string[value-label] string[var-name] + byte[show] string[s] + | 05 ValueMod string[var-name] string[var-label] byte[show] + | 06 string[local] ValueMod string[id] string[c] + | ValueMod string[template] int32[n-args] Argument*[n-args] +Argument => + i0 Value + | int32[x] i0 Value*[x] /* x > 0 */ +``` + +There are several possible encodings, which one can distinguish by +the first nonzero byte in the encoding. + +* `01` + The numeric value `x`, intended to be presented to the user + formatted according to `format`, which is about the same as the + format described for system files (*note System File Output + Formats::). The exception is that format 40 is not MTIME but + instead approximately a synonym for F format with a different rule + for whether a value is shown in scientific notation: a value in + format 40 is shown in scientific notation if and only if it is + nonzero and its magnitude is less than [`small`](#formats). + + Most commonly, `format` has width 40 (the maximum). + + An `x` with the maximum negative double value `-DBL_MAX` represents + the system-missing value `SYSMIS`. (`HIGHEST` and `LOWEST` have not been + observed.) See *note System File Format::, for more about these + special values. + +* `02` + Similar to `01`, with the additional information that `x` is a + value of variable `var-name` and has value label `value-label`. + Both `var-name` and `value-label` can be the empty string, the + latter very commonly. + + `show` determines whether to show the numeric value or the value + label. A value of 1 means to show the value, 2 to show the label, + 3 to show both, and 0 means to use the default specified in + [`show-values`](#formats). + +* `03` + A text string, in two forms: `c` is in English, and sometimes + abbreviated or obscure, and `local` is localized to the user's + locale. In an English-language locale, the two strings are often + the same, and in the cases where they differ, `local` is more + appropriate for a user interface, e.g. `c` of "Not a PxP table for + MCN..." versus `local` of "Computed only for a PxP table, where P + must be greater than 1." + + `c` and `local` are always either both empty or both nonempty. + + `id` is a brief identifying string whose form seems to resemble a + programming language identifier, e.g. `cumulative_percent` or + `factor_14`. It is not unique. + + `fixed` is: + + * `00` for text taken from user input, such as syntax fragment, + expressions, file names, data set names. `id` is always the + empty string. + + * `01` for fixed text strings such as names of procedures or + statistics. `id` is sometimes empty. + +* `04` + The string value `s`, intended to be presented to the user formatted + according to `format`. The format for a string is not too + interesting, and the corpus contains many clearly invalid formats + like `A16.39` or `A255.127` or `A134.1`, so readers should probably + entirely disregard the format. PSPP only checks `format` to + distinguish AHEX format. + + `s` is a value of variable `var-name` and has value label + `value-label`. `var-name` is never empty but `value-label` is + commonly empty. + + `show` has the same meaning as in the encoding for `02`. + +* `05` + Variable `var-name` with variable label `var-label`. In the + corpus, `var-name` is rarely empty and `var-label` is often empty. + + `show` determines whether to show the variable name or the variable + label. A value of 1 means to show the name, 2 to show the label, 3 + to show both, and 0 means to use the default specified in + [`show-variables`](#formats). + +* `06` + Similar to type `03`, with `fixed` assumed to be true. + +* otherwise + When the first byte of a `RawValue` is not one of the above, the + `RawValue` starts with a `ValueMod`, whose syntax is described in + the next section. (A `ValueMod` always begins with byte 31 or 58.) + + This case is a template string, analogous to `printf`, followed by + one or more `Argument`s, each of which has one or more values. The + template string is copied directly into the output except for the + following special syntax: + + * `\%` + `\:` + `\[` + `\]` + Each of these expands to the character following `\\`, to + escape characters that have special meaning in template + strings. These are effective inside and outside the `[...]` + syntax forms described below. + + * `\n` + Expands to a new-line, inside or outside the `[...]` forms + described below. + + * `^I` + Expands to a formatted version of argument `I`, which must have + only a single value. For example, `^1` expands to the first + argument's `value`. + + * `[:A:]I` + Expands `A` for each of the values in `I`. `A` should contain one + or more `^J` conversions, which are drawn from the values for + argument `I` in order. Some examples from the corpus: + + * `[:^1:]1` + All of the values for the first argument, concatenated. + + * `[:^1\n:]1` + Expands to the values for the first argument, each followed by a + new-line. + + * `[:^1 = ^2:]2` + Expands to `X = Y` where X is the second argument's first alue + and Y is its second value. (This would be used only if the + argument has two values. If there were more values, the second + and third values would be directly concatenated, which would + look funny.) + + * `[A:B:]I` + This extends the previous form so that the first values are + expanded using `A` and later values are expanded using `B`. For + an unknown reason, within `A` the `^J` conversions are instead + written as `%J`. Some examples from the corpus: + + * `[%1:*^1:]1` + Expands to all of the values for the first argument, + separated by `*`. + + * `[%1 = %2:, ^1 = ^2:]1` + Given appropriate values for the first argument, expands + to `X = 1, Y = 2, Z = 3`. + + * `[%1:, ^1:]1` + Given appropriate values, expands to `1, 2, 3`. + + The template string is localized to the user's locale. + +A writer may safely omit all of the optional 00 bytes at the +beginning of a Value, except that it should write a single 00 byte +before a templated Value. + +## ValueMod + +A `ValueMod` can specify special modifications to a Value. + +``` +ValueMod => + 58 + | 31 + int32[n-refs] int16*[n-refs] + int32[n-subscripts] string*[n-subscripts] + v1(00 (i1 | i2) 00? 00? int32 00? 00?) + v3(count(TemplateString StylePair)) + +TemplateString => count((count((i0 (58 | 31 55))?) (58 | 31 string[id]))?) + +StylePair => + (31 FontStyle | 58) + (31 CellStyle | 58) + +FontStyle => + bool[bold] bool[italic] bool[underline] bool[show] + string[fg-color] string[bg-color] + string[typeface] byte[size] + +CellStyle => + int32[halign] int32[valign] double[decimal-offset] + int16[left-margin] int16[right-margin] + int16[top-margin] int16[bottom-margin] +``` + +A `ValueMod` that begins with `31` specifies special modifications to +a `Value`. + +Each of the `n-refs` integers is a reference to a +[`Footnote`](#footnotes) by a 0-based index. Footnote markers are +shown appended to the main text of the `Value`, as superscripts or +subscripts. + +The `subscripts`, if present, are strings to append to the main text +of the Value, as subscripts. Each subscript text is a brief indicator, +e.g. `a` or `b`, with its meaning indicated by the table caption. When +multiple subscripts are present, they are displayed separated by commas. + +The `id` inside the `TemplateString`, if present, is a template string +for substitutions using the syntax explained previously. It appears +to be an English-language version of the localized template string in +the Value in which the `Template` is nested. A writer may safely omit +the optional fixed data in `TemplateString`. + +`FontStyle` and `CellStyle`, if present, change the style for this +individual Value. In `FontStyle`, `bold`, `italic`, and `underline` +control the particular style. `show` is ordinarily 1; if it is 0, then +the cell data is not shown. `fg-color` and `bg-color` are strings in +the format `#rrggbb`, e.g. `#ff0000` for red or `#ffffff` for white. +The empty string is occasionally observed also. The `size` is a font +size in units of 1/128 inch. + +In `CellStyle`, `halign` is 0 for center, 2 for left, 4 for right, 6 +for decimal, 0xffffffad for mixed. For decimal alignment, +`decimal-offset` is the decimal point's offset from the right side of +the cell, in [pt](#pt). `valign` specifies vertical alignment: 0 for +center, 1 for top, 3 for bottom. `left-margin`, `right-margin`, +`top-margin`, and `bottom-margin` are in pt. + diff --git a/rust/doc/src/spv/structure/index.md b/rust/doc/src/spv/structure.md similarity index 100% rename from rust/doc/src/spv/structure/index.md rename to rust/doc/src/spv/structure.md -- 2.30.2