X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=spv-file-format.texi;h=223b2ef90fef85544c5fca6093913d78a4fac271;hb=7501fe1209b07d18977d8a9249036f2f3a34014d;hp=38effad71ff460b2978ea65eea991eea7af992cd;hpb=6a45803579c7f58e62668306240784beec7a611a;p=pspp diff --git a/spv-file-format.texi b/spv-file-format.texi index 38effad71f..223b2ef90f 100644 --- a/spv-file-format.texi +++ b/spv-file-format.texi @@ -55,10 +55,12 @@ Same format used for tables, with a different name. The structure of a chart plus its data. Charts do not have a ``light'' format. -@item @var{prefix}_model.xml -@itemx @var{prefix}_pmml.xml -@itemx @var{prefix}_stats.xml +@item @var{prefix}_model.scf +@itemx @var{prefix}_pmml.scf Not yet investigated. The corpus contains only one example of each. + +@item @var{prefix}_stats.xml +Not yet investigated. The corpus contains few examples. @end table The @file{@var{prefix}} in the names of the detail members is @@ -172,7 +174,7 @@ describes what it labels, often by naming the statistical procedure that was executed, e.g.@: ``Frequencies'' or ``T-Test''. Labels are often very generic, especially within a @code{container}, e.g.@: ``Title'' or ``Warnings'' or ``Notes''. Label text is localized -according to the output language, e.g. in Italian a frequency table +according to the output language, e.g.@: in Italian a frequency table procedure is labeled ``Frequenze''. The corpus contains one example of an empty label, one that contains @@ -374,10 +376,10 @@ table-id := int @code{header} includes @code{version}, a version number that affects the interpretation of some of the other data in the member. We will -refer to ``version 1'' and ``version 3'' members later on. It also -@code{table-id} is a binary version of @code{tableId} attribute in the -structure member that refers to the detail member. For example, if -@code{tableId} is @code{-4154297861994971133}, then @code{table-id} +refer to ``version 1'' and ``version 3'' members later on. +@code{table-id} is a binary version of the @code{tableId} attribute in +the structure member that refers to the detail member. For example, +if @code{tableId} is @code{-4154297861994971133}, then @code{table-id} would be 0xdca00003. The meaning of the other variable parts of the header is not known. @@ -403,6 +405,8 @@ styles := 00 font*8 int[n-ccs] string*[n-ccs] /* @r{custom currency} */ styles2 +x2 := 00 00 00 01 00 00 00 00 00 00 00 00 00 02 00 00 00 00 /* @r{18 bytes} */ + styles2 := i0 /* @r{version 1} */ styles2 := count(count(x5) count(x6)) /* @r{version 3} */ x5 := byte*33 int[n] int*n @@ -415,7 +419,7 @@ x6 := 01 00 (03 | 04) 00 00 00 byte*8 01 (string[dataset] string[datafile] i0 int i0)? int[n-ccs] string*[n-ccs] - 2e (00 | 01) + 2e (00 | 01) (i2000000 i0)? @end example In every example in the corpus, @code{x1} is 240. The meaning of the @@ -427,10 +431,10 @@ follow it are @code{00 00 00 01 00 00 00 00 00 00 00 00 00 02 00 00 00 In every example in the corpus for version 1, @code{x3} is 16 and the bytes that follow it are @code{00 00 00 01 00 00 00 01 00 00 00 00 01 -01 01 01}. In version 3, observed @code{x3} varies from 117 to 150 and -the bytes that follow it vary somewhat and often include a readable -text string, e.g. ``Default'' or ``Academic'', which appears to be the -name of a ``TableLook''. +01 01 01}. In version 3, observed @code{x3} varies from 117 to 150, +and its bytes include a 1-byte count at offset 0x34. When the count +is nonzero, a text string of that length at offset 0x35 is the name of +a ``TableLook'', e.g. ``Default'' or ``Academic''. Observed values of @code{x4} vary from 0 to 17. Out of 7060 examples in the corpus, it is nonzero only 36 times. @@ -447,9 +451,9 @@ are @samp{.} and @samp{,}. @samp{,}, @samp{.}, @samp{'}, @samp{ }, and zero (presumably indicating that digits should not be grouped). -@code{x5} is observed as either 0 or 5. When it is 5, the following -strings are CCA through CCE format strings. Most commonly these are -all @code{-,,,} but other strings occur. +@code{n-ccs} is observed as either 0 or 5. When it is 5, the +following strings are CCA through CCE format strings. Most commonly +these are all @code{-,,,} but other strings occur. @example font := byte[index] 31 string[typeface] @@ -617,32 +621,241 @@ index @math{5 \times (4 \times (3 \times 0 + 1) + 2) + 3 = 33}. @example value := 00? 00? 00? 00? raw-value -raw-value := 01 opt-value int32[format] double - | 02 opt-value int32[format] double string[varname] string[vallab] - (01 | 02 | 03) - | 03 string[local] opt-value string[id] string[c] (00 | 01) - | 04 opt-value int32[format] string[vallab] string[varname] - (01 | 02 | 03) string[vallab] - | 05 opt-value string[varname] string[varlabel] (01 | 02 | 03) - | opt-value string[format] int32[n-substs] substitution*[n-substs] -substitution := i0 value - | int32[x] value*[x + 1] /* @r{x > 0} */ -opt-value := 31 i0 (i0 | i1 string) opt-value-i0-v1 /* @r{version 1} */ - | 31 i0 (i0 | i1 string) opt-value-i0-v3 /* @r{version 3} */ - | 31 i1 int32[footnote-number] nested-string - | 31 i2 (00 | 02) 00 (i1 | i2 | i3) nested-string - | 31 i3 00 00 01 00 i2 nested-string - | 58 -opt-value-i0-v1 := 00 (i1 | i2) 00 00 int32 00 00 -opt-value-i0-v3 := count(counted-string - (58 - | 31 01? 00? 00? 00? 01 - string[fgcolor] string[bgcolor] string[typeface] - byte) +raw-value := + 01 value-mod int32[format] double[x] + | 02 value-mod int32[format] double[x] + string[varname] string[vallab] (01 | 02 | 03) + | 03 string[local] value-mod string[id] string[c] (00 | 01)[type] + | 04 value-mod int32[format] string[vallab] string[varname] + (01 | 02 | 03) string[s] + | 05 value-mod string[varname] string[varlabel] (01 | 02 | 03) + | value-mod string[format] int32[n-args] arg*[n-args] +arg := + i0 value + | int32[x] i0 value*[x + 1] /* @r{x > 0} */ +@end example + +A @code{value} boils down to a number or a string. There are several +possibilities, which one can distinguish by the first nonzero byte in +the encoding: + +@table @code +@item 01 +The numeric value @code{x}, presented to the user formatted according +to @code{format}, which is in the format described for system files. +@xref{System File Output Formats}, for details. Most commonly +@code{format} has width 40 (the maximum). + +An @code{x} with the maximum negative double @code{-DBL_MAX} +represents the system-missing value SYSMIS. (HIGHEST and LOWEST have +not been observed.) @xref{System File Format}, for more about these +special values. + +@item 02 +Similar to @code{01}, with the additional information that @code{x} is +a value of variable @code{varname} and has value label @code{vallab}. +Both @code{varname} and @code{vallab} can be the empty string, the +latter very commonly. + +The meaning of the final byte is unknown. Possibly it is connected to +whether the value or the label should be displayed. + +@item 03 +A text string, in two forms: @code{c} is in English, and sometimes +abbreviated or obscure, and @code{local} is localized to the user's +locale. In an English-language locale, the two strings are often the +same, and in the cases where they differ, @code{local} is more +appropriate for a user interface, e.g.@: @code{c} of ``Not a PxP table +for MCN...'' versus @code{local} of ``Computed only for a PxP table, +where P must be greater than 1.'' + +@code{c} and @code{local} are always either both empty or both +nonempty. + +@code{id} is a brief identifying string whose form seems to resemble a +programming language identifier, e.g.@: @code{cumulative_percent} or +@code{factor_14}. It is not unique. + +@code{type} is 00 for text taken from user input, such as syntax +fragment, expressions, file names, data set names, and 01 for fixed +text strings such as names of procedures or statistics. In the former +case, @code{id} is always the empty string; in the latter case, +@code{id} is still sometimes empty. + +@item 04 +The string value @code{s}, presented to the user formatted according +to @code{format}. The format for a string is not too interesting, and +clearly invalid formats like A16.39 or A255.127 or A134.1 abound in +the corpus, so readers should probably ignore the format entirely. + +@code{s} is a value of variable @code{varname} and has value label +@code{vallab}. @code{varname} is never empty but @code{vallab} is +commonly empty. + +The meaning of the final byte is unknown. + +@item 05 +Variable @code{varname}, which is rarely observed as empty in the +corpus, with variable label @code{varlabel}, which is often empty. + +The meaning of the final byte is unknown. + +@item 31 +@itemx 58 +(These bytes begin a @code{value-mod}.) A format string, analogous to +@code{printf}, followed by one or more arguments, each of which has +one or more values. The format string uses the following syntax: + +@table @code +@item \% +@item \: +@item \[ +@item \] +Each of these expands to the character following @samp{\\}. This is +useful to escape characters that have special meaning in format +strings. These are effective inside and outside the @code{[@dots{}]} +syntax forms described below. + +@item \n +Expands to a new-line, inside or outside the @code{[@dots{}]} forms +described below. + +@item ^@var{i} +Expands to a formatted version of argument @var{i}, which must have +only a single value. For example, @code{^1} would expand to the first +argument's @code{value}. + +@item [:@var{a}:]@var{i} +Expands @var{a} for each of the @code{value}s in @var{i}. @var{a} +should contain one or more @code{^@var{j}} conversions, which are +drawn from the values for argument @var{i} in order. Some examples +from the corpus: + +@table @code +@item [:^1:]1 +All of the values for the first argument, concatenated. + +@item [:^1\n:]1 +Expands to the values for the first argument, each followed by +a new-line. + +@item [:^1 = ^2:]2 +Expands to @code{@var{x} = @var{y}} where @var{x} is the second +argument's first value and @var{y} is its second value. (This would +be used only if the argument has two values. With additional values, +the second and third values would be directly concatenated, which +would look funny.) +@end table + +@item [@var{a}:@var{b}:]@var{i} +This extends the previous form so that the first values are expanded +using @var{a} and later values are expanded using @var{b}. For an +unknown reason, within @var{a} the @code{^@var{j}} conversions are +instead written as @code{%@var{j}}. Some examples from the corpus: + +@table @code +@item [%1:*^1:]1 +Expands to all of the values for the first argument, separated by +@samp{*}. + +@item [%1 = %2:, ^1 = ^2:]1 +Given appropriate values for the first argument, expands to @code{X = +1, Y = 2, Z = 3}. + +@item [%1:, ^1:]1 +Given appropriate values, expands to @code{1, 2, 3}. +@end table +@end table + +The format string is localized to the user's locale. +@end table + +@example +value-mod := + 31 i0 (i0 | i1 string[subscript]) value-mod-i0-v1 /* @r{version 1} */ + | 31 i0 (i0 | i1 string[subscript]) value-mod-i0-v3 /* @r{version 3} */ + | 31 i1 int32[footnote-number] format + | 31 i2 (00 | 01 | 02) 00 (i1 | i2 | i3) format + | 31 i3 00 00 01 00 i2 format + | 58 +value-mod-i0-v1 := 00 (i1 | i2) 00 00 int32 00 00 +value-mod-i0-v3 := count(format-string + (58 | 31 style) (58 | 31 i0 i0 i0 i0 01 00 (01 | 02 | 08) 00 08 00 0a 00)) -nested-string := 00 00 count(counted-string 58 58) -counted-string := count((i0 (58 | 31 string))?) +style := 01? 00? 00? 00? 01 string[fgcolor] string[bgcolor] string[font] byte +format := 00 00 count(format-string (58 | 31 style) 58) +format-string := count((i0 (58 | 31 string))?) +@end example + +A @code{value-mod} can specify special modifications to a @code{value}: + +@itemize @bullet +@item +The @code{footnote-number}, if present, specifies a footnote that the +@code{value} references. The footnote's marker is shown appended to +the main text of the @code{value}, as a superscript. + +@item +The @code{subscript}, if present, specifies a string to append to the +main text of the @code{value}, as a subscript. The subscript text is +normally a brief indicator, e.g.@: @samp{a} or @samp{a,b}, with its +meaning indicated by the table caption. In this usage, subscripts are +similar to footnotes; one apparent difference is that a @code{value} +can only reference one footnote but a subscript can list more than one +letter. + +@item +The @code{format}, if present, is a format string for substitutions +using the syntax explained previously. It appears to be an +English-language version of the localized format string in the +@code{value} in which the @code{format} is nested. + +@item +The @code{style}, if present, changes the style for this individual +@code{value}. +@end itemize + +@node SPV Legacy Detail Member Binary Format +@subsection SPV Legacy Detail Member Binary Format + +A legacy detail member's binary file has a much simpler format than +the light member binary format. + +@example +legacy-member := 00 byte[version] int16[n-sources] int32[file-size] + metadata*[n-sources] data*[n-sources] +@end example + +@code{version} is a version number that affects the interpretation of +some of the other data in the member. Versions 0xaf and 0xb0 are +known. We will refer to ``version 0xaf'' and ``version 0xb0'' members +later on. + +A legacy member consists of @code{n-sources} data sources, each of +which has @code{metadata} and @code{data}. + +@code{file-size} is the size of the file, in bytes. + +@example +metadata := int32[per-series] int32[n-series] int32[offset] source-name +source-name := byte*[32] /* @r{version 0xaf} */ +source-name := byte*[64] int32 /* @r{version 0xb0} */ +@end example + +A data source consists of @code{n-series} series of data, with +@code{per-series} data items per series. Depending on the version, +@code{source-name} is a 32- or 64-byte string padded on the right with +zero bytes. The names that appear in the corpus are very generic, +usually @code{tableData} or @code{source0}. The @code{offset} is the +offset, in bytes, from the beginning of the file to the start of this +data source's @code{data}. + +The meaning of the number in version 0xb0 @code{source-name} is +unknown. + +@example + @end example