pintos-os.org Git - pspp/blob - doc/dev/system-file-format.texi

   1 @node System File Format
   2 @appendix System File Format
   3
   4 A system file encapsulates a set of cases and dictionary information
   5 that describes how they may be interpreted.  This chapter describes
   6 the format of a system file.
   7
   8 System files use four data types: 8-bit characters, 32-bit integers,
   9 64-bit integers,
  10 and 64-bit floating points, called here @code{char}, @code{int32},
  11 @code{int64}, and
  12 @code{flt64}, respectively.  Data is not necessarily aligned on a word
  13 or double-word boundary: the long variable name record (@pxref{Long
  14 Variable Names Record}) and very long string records (@pxref{Very Long
  15 String Record}) have arbitrary byte length and can therefore cause all
  16 data coming after them in the file to be misaligned.
  17
  18 Integer data in system files may be big-endian or little-endian.  A
  19 reader may detect the endianness of a system file by examining
  20 @code{layout_code} in the file header record
  21 (@pxref{layout_code,,@code{layout_code}}).
  22
  23 Floating-point data in system files may nominally be in IEEE 754, IBM,
  24 or VAX formats.  A reader may detect the floating-point format in use
  25 by examining @code{bias} in the file header record
  26 (@pxref{bias,,@code{bias}}).
  27
  28 PSPP detects big-endian and little-endian integer formats in system
  29 files and translates as necessary.  PSPP also detects the
  30 floating-point format in use, as well as the endianness of IEEE 754
  31 floating-point numbers, and translates as needed.  However, only IEEE
  32 754 numbers with the same endianness as integer data in the same file
  33 have actually been observed in system files, and it is likely that
  34 other formats are obsolete or were never used.
  35
  36 System files use a few floating point values for special purposes:
  37
  38 @table @asis
  39 @item SYSMIS
  40 The system-missing value is represented by the largest possible
  41 negative number in the floating point format (@code{-DBL_MAX}).
  42
  43 @item HIGHEST
  44 HIGHEST is used as the high end of a missing value range with an
  45 unbounded maximum.  It is represented by the largest possible positive
  46 number (@code{DBL_MAX}).
  47
  48 @item LOWEST
  49 LOWEST is used as the low end of a missing value range with an
  50 unbounded minimum.  It was originally represented by the
  51 second-largest negative number (in IEEE 754 format,
  52 @code{0xffeffffffffffffe}).  System files written by SPSS 21 and later
  53 instead use the largest negative number (@code{-DBL_MAX}), the same
  54 value as SYSMIS.  This does not lead to ambiguity because LOWEST
  55 appears in system files only in missing value ranges, which never
  56 contain SYSMIS.
  57 @end table
  58
  59 System files may use most character encodings based on an 8-bit unit.
  60 UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
  61 @code{rec_type} in the file header record is sufficient to distinguish
  62 between ASCII and EBCDIC based encodings.  The best way to determine
  63 the specific encoding in use is to consult the character encoding
  64 record (@pxref{Character Encoding Record}), if present, and failing
  65 that the @code{character_code} in the machine integer info record
  66 (@pxref{Machine Integer Info Record}).  The same encoding should be
  67 used for the dictionary and the data in the file, although it is
  68 possible to artificially synthesize files that use different encodings
  69 (@pxref{Character Encoding Record}).
  70
  71 @menu
  72 * System File Record Structure::
  73 * File Header Record::
  74 * Variable Record::
  75 * Value Labels Records::
  76 * Document Record::
  77 * Machine Integer Info Record::
  78 * Machine Floating-Point Info Record::
  79 * Multiple Response Sets Records::
  80 * Extra Product Info Record::
  81 * Variable Display Parameter Record::
  82 * Long Variable Names Record::
  83 * Very Long String Record::
  84 * Character Encoding Record::
  85 * Long String Value Labels Record::
  86 * Long String Missing Values Record::
  87 * Data File and Variable Attributes Records::
  88 * Extended Number of Cases Record::
  89 * Other Informational Records::
  90 * Dictionary Termination Record::
  91 * Data Record::
  92 @end menu
  93
  94 @node System File Record Structure
  95 @section System File Record Structure
  96
  97 System files are divided into records with the following format:
  98
  99 @example
 100 int32               type;
 101 char                data[];
 102 @end example
 103
 104 This header does not identify the length of the @code{data} or any
 105 information about what it contains, so the system file reader must
 106 understand the format of @code{data} based on @code{type}.  However,
 107 records with type 7, called @dfn{extension records}, have a stricter
 108 format:
 109
 110 @example
 111 int32               type;
 112 int32               subtype;
 113 int32               size;
 114 int32               count;
 115 char                data[size * count];
 116 @end example
 117
 118 @table @code
 119 @item int32 rec_type;
 120 Record type.  Always set to 7.
 121
 122 @item int32 subtype;
 123 Record subtype.  This value identifies a particular kind of extension
 124 record.
 125
 126 @item int32 size;
 127 The size of each piece of data that follows the header, in bytes.
 128 Known extension records use 1, 4, or 8, for @code{char}, @code{int32},
 129 and @code{flt64} format data, respectively.
 130
 131 @item int32 count;
 132 The number of pieces of data that follow the header.
 133
 134 @item char data[size * count];
 135 Data, whose format and interpretation depend on the subtype.
 136 @end table
 137
 138 An extension record contains exactly @code{size * count} bytes of
 139 data, which allows a reader that does not understand an extension
 140 record to skip it.  Extension records provide only nonessential
 141 information, so this allows for files written by newer software to
 142 preserve backward compatibility with older or less capable readers.
 143
 144 Records in a system file must appear in the following order:
 145
 146 @itemize @bullet
 147 @item
 148 File header record.
 149
 150 @item
 151 Variable records.
 152
 153 @item
 154 All pairs of value labels records and value label variables records,
 155 if present.
 156
 157 @item
 158 Document record, if present.
 159
 160 @item
 161 Extension (type 7) records, in ascending numerical order of their
 162 subtypes.
 163
 164 @item
 165 Dictionary termination record.
 166
 167 @item
 168 Data record.
 169 @end itemize
 170
 171 We advise authors of programs that read system files to tolerate
 172 format variations.  Various kinds of misformatting and corruption have
 173 been observed in system files written by SPSS and other software
 174 alike.  In particular, because extension records provide nonessential
 175 information, it is generally better to ignore an extension record
 176 entirely than to refuse to read a system file.
 177
 178 The following sections describe the known kinds of records.
 179
 180 @node File Header Record
 181 @section File Header Record
 182
 183 A system file begins with the file header, with the following format:
 184
 185 @example
 186 char                rec_type[4];
 187 char                prod_name[60];
 188 int32               layout_code;
 189 int32               nominal_case_size;
 190 int32               compression;
 191 int32               weight_index;
 192 int32               ncases;
 193 flt64               bias;
 194 char                creation_date[9];
 195 char                creation_time[8];
 196 char                file_label[64];
 197 char                padding[3];
 198 @end example
 199
 200 @table @code
 201 @item char rec_type[4];
 202 Record type code, either @samp{$FL2} for system files with
 203 uncompressed data or data compressed with simple bytecode compression,
 204 or @samp{$FL3} for system files with ZLIB compressed data.
 205
 206 This is truly a character field that uses the character encoding as
 207 other strings.  Thus, in a file with an ASCII-based character encoding
 208 this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a
 209 file with an EBCDIC-based encoding this field contains @code{5b c6 d3
 210 f2}.  (No EBCDIC-based ZLIB-compressed files have been observed.)
 211
 212 @item char prod_name[60];
 213 Product identification string.  This always begins with the characters
 214 @samp{@@(#) SPSS DATA FILE}.  PSPP uses the remaining characters to
 215 give its version and the operating system name; for example, @samp{GNU
 216 pspp 0.1.4 - sparc-sun-solaris2.5.2}.  The string is truncated if it
 217 would be longer than 60 characters; otherwise it is padded on the right
 218 with spaces.
 219
 220 @anchor{layout_code}
 221 @item int32 layout_code;
 222 Normally set to 2, although a few system files have been spotted in
 223 the wild with a value of 3 here.  PSPP use this value to determine the
 224 file's integer endianness (@pxref{System File Format}).
 225
 226 @item int32 nominal_case_size;
 227 Number of data elements per case.  This is the number of variables,
 228 except that long string variables add extra data elements (one for every
 229 8 characters after the first 8).  However, string variables do not
 230 contribute to this value beyond the first 255 bytes.   Further, system
 231 files written by some systems set this value to -1.  In general, it is
 232 unsafe for systems reading system files to rely upon this value.
 233
 234 @item int32 compression;
 235 Set to 0 if the data in the file is not compressed, 1 if the data is
 236 compressed with simple bytecode compression, 2 if the data is ZLIB
 237 compressed.  This field has value 2 if and only if @code{rec_type} is
 238 @samp{$FL3}.
 239
 240 @item int32 weight_index;
 241 If one of the variables in the data set is used as a weighting
 242 variable, set to the dictionary index of that variable, plus 1
 243 (@pxref{Dictionary Index}).  Otherwise, set to 0.
 244
 245 @item int32 ncases;
 246 Set to the number of cases in the file if it is known, or -1 otherwise.
 247
 248 In the general case it is not possible to determine the number of cases
 249 that will be output to a system file at the time that the header is
 250 written.  The way that this is dealt with is by writing the entire
 251 system file, including the header, then seeking back to the beginning of
 252 the file and writing just the @code{ncases} field.  For files in which
 253 this is not valid, the seek operation fails.  In this case,
 254 @code{ncases} remains -1.
 255
 256 @anchor{bias}
 257 @item flt64 bias;
 258 Compression bias, ordinarily set to 100.  Only integers between
 259 @code{1 - bias} and @code{251 - bias} can be compressed.
 260
 261 By assuming that its value is 100, PSPP uses @code{bias} to determine
 262 the file's floating-point format and endianness (@pxref{System File
 263 Format}).  If the compression bias is not 100, PSPP cannot auto-detect
 264 the floating-point format and assumes that it is IEEE 754 format with
 265 the same endianness as the system file's integers, which is correct
 266 for all known system files.
 267
 268 @item char creation_date[9];
 269 Date of creation of the system file, in @samp{dd mmm yy}
 270 format, with the month as standard English abbreviations, using an
 271 initial capital letter and following with lowercase.  If the date is not
 272 available then this field is arbitrarily set to @samp{01 Jan 70}.
 273
 274 @item char creation_time[8];
 275 Time of creation of the system file, in @samp{hh:mm:ss}
 276 format and using 24-hour time.  If the time is not available then this
 277 field is arbitrarily set to @samp{00:00:00}.
 278
 279 @item char file_label[64];
 280 File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
 281 PSPP Users Guide}).  Padded on the right with spaces.
 282
 283 A product that identifies itself as @code{VOXCO INTERVIEWER 4.3} uses
 284 CR-only line ends in this field, rather than the more usual LF-only or
 285 CR LF line ends.
 286
 287 @item char padding[3];
 288 Ignored padding bytes to make the structure a multiple of 32 bits in
 289 length.  Set to zeros.
 290 @end table
 291
 292 @node Variable Record
 293 @section Variable Record
 294
 295 There must be one variable record for each numeric variable and each
 296 string variable with width 8 bytes or less.  String variables wider
 297 than 8 bytes have one variable record for each 8 bytes, rounding up.
 298 The first variable record for a long string specifies the variable's
 299 correct dictionary information.  Subsequent variable records for a
 300 long string are filled with dummy information: a type of -1, no
 301 variable label or missing values, print and write formats that are
 302 ignored, and an empty string as name.  A few system files have been
 303 encountered that include a variable label on dummy variable records,
 304 so readers should take care to parse dummy variable records in the
 305 same way as other variable records.
 306
 307 @anchor{Dictionary Index}
 308 The @dfn{dictionary index} of a variable is its offset in the set of
 309 variable records, including dummy variable records for long string
 310 variables.  The first variable record has a dictionary index of 0, the
 311 second has a dictionary index of 1, and so on.
 312
 313 The system file format does not directly support string variables
 314 wider than 255 bytes.  Such very long string variables are represented
 315 by a number of narrower string variables.  @xref{Very Long String
 316 Record}, for details.
 317
 318 A system file should contain at least one variable and thus at least
 319 one variable record, but system files have been observed in the wild
 320 without any variables (thus, no data either).
 321
 322 @example
 323 int32               rec_type;
 324 int32               type;
 325 int32               has_var_label;
 326 int32               n_missing_values;
 327 int32               print;
 328 int32               write;
 329 char                name[8];
 330
 331 /* @r{Present only if @code{has_var_label} is 1.} */
 332 int32               label_len;
 333 char                label[];
 334
 335 /* @r{Present only if @code{n_missing_values} is nonzero}. */
 336 flt64               missing_values[];
 337 @end example
 338
 339 @table @code
 340 @item int32 rec_type;
 341 Record type code.  Always set to 2.
 342
 343 @item int32 type;
 344 Variable type code.  Set to 0 for a numeric variable.  For a short
 345 string variable or the first part of a long string variable, this is set
 346 to the width of the string.  For the second and subsequent parts of a
 347 long string variable, set to -1, and the remaining fields in the
 348 structure are ignored.
 349
 350 @item int32 has_var_label;
 351 If this variable has a variable label, set to 1; otherwise, set to 0.
 352
 353 @item int32 n_missing_values;
 354 If the variable has no missing values, set to 0.  If the variable has
 355 one, two, or three discrete missing values, set to 1, 2, or 3,
 356 respectively.  If the variable has a range for missing variables, set to
 357 -2; if the variable has a range for missing variables plus a single
 358 discrete value, set to -3.
 359
 360 A long string variable always has the value 0 here.  A separate record
 361 indicates missing values for long string variables (@pxref{Long String
 362 Missing Values Record}).
 363
 364 @item int32 print;
 365 Print format for this variable.  See below.
 366
 367 @item int32 write;
 368 Write format for this variable.  See below.
 369
 370 @item char name[8];
 371 Variable name.  The variable name must begin with a capital letter or
 372 the at-sign (@samp{@@}).  Subsequent characters may also be digits, octothorpes
 373 (@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full
 374 stops (@samp{.}).  The variable name is padded on the right with spaces.
 375
 376 The @samp{name} fields should be unique within a system file.  System
 377 files written by SPSS that contain very long string variables with
 378 similar names sometimes contain duplicate names that are later
 379 eliminated by resolving the very long string names (@pxref{Very Long
 380 String Record}).  PSPP handles duplicates by assigning them new,
 381 unique names.
 382
 383 @item int32 label_len;
 384 This field is present only if @code{has_var_label} is set to 1.  It is
 385 set to the length, in characters, of the variable label.  The
 386 documented maximum length varies from 120 to 255 based on SPSS
 387 version, but some files have been seen with longer labels.  PSPP
 388 accepts labels of any length.
 389
 390 @item char label[];
 391 This field is present only if @code{has_var_label} is set to 1.  It has
 392 length @code{label_len}, rounded up to the nearest multiple of 32 bits.
 393 The first @code{label_len} characters are the variable's variable label.
 394
 395 @item flt64 missing_values[];
 396 This field is present only if @code{n_missing_values} is nonzero.  It
 397 has the same number of 8-byte elements as the absolute value of
 398 @code{n_missing_values}.  Each element is interpreted as a number for
 399 numeric variables (with HIGHEST and LOWEST indicated as described in
 400 the chapter introduction).  For string variables of width less than 8
 401 bytes, elements are right-padded with spaces; for string variables
 402 wider than 8 bytes, only the first 8 bytes of each missing value are
 403 specified, with the remainder implicitly all spaces.
 404
 405 For discrete missing values, each element represents one missing
 406 value.  When a range is present, the first element denotes the minimum
 407 value in the range, and the second element denotes the maximum value
 408 in the range.  When a range plus a value are present, the third
 409 element denotes the additional discrete missing value.
 410 @end table
 411
 412 @anchor{System File Output Formats}
 413 The @code{print} and @code{write} members of sysfile_variable are output
 414 formats coded into @code{int32} types.  The least-significant byte
 415 of the @code{int32} represents the number of decimal places, and the
 416 next two bytes in order of increasing significance represent field width
 417 and format type, respectively.  The most-significant byte is not
 418 used and should be set to zero.
 419
 420 Format types are defined as follows:
 421
 422 @quotation
 423 @multitable {Value} {@code{DATETIME}}
 424 @headitem Value
 425 @tab Meaning
 426 @item 0
 427 @tab Not used.
 428 @item 1
 429 @tab @code{A}
 430 @item 2
 431 @tab @code{AHEX}
 432 @item 3
 433 @tab @code{COMMA}
 434 @item 4
 435 @tab @code{DOLLAR}
 436 @item 5
 437 @tab @code{F}
 438 @item 6
 439 @tab @code{IB}
 440 @item 7
 441 @tab @code{PIBHEX}
 442 @item 8
 443 @tab @code{P}
 444 @item 9
 445 @tab @code{PIB}
 446 @item 10
 447 @tab @code{PK}
 448 @item 11
 449 @tab @code{RB}
 450 @item 12
 451 @tab @code{RBHEX}
 452 @item 13
 453 @tab Not used.
 454 @item 14
 455 @tab Not used.
 456 @item 15
 457 @tab @code{Z}
 458 @item 16
 459 @tab @code{N}
 460 @item 17
 461 @tab @code{E}
 462 @item 18
 463 @tab Not used.
 464 @item 19
 465 @tab Not used.
 466 @item 20
 467 @tab @code{DATE}
 468 @item 21
 469 @tab @code{TIME}
 470 @item 22
 471 @tab @code{DATETIME}
 472 @item 23
 473 @tab @code{ADATE}
 474 @item 24
 475 @tab @code{JDATE}
 476 @item 25
 477 @tab @code{DTIME}
 478 @item 26
 479 @tab @code{WKDAY}
 480 @item 27
 481 @tab @code{MONTH}
 482 @item 28
 483 @tab @code{MOYR}
 484 @item 29
 485 @tab @code{QYR}
 486 @item 30
 487 @tab @code{WKYR}
 488 @item 31
 489 @tab @code{PCT}
 490 @item 32
 491 @tab @code{DOT}
 492 @item 33
 493 @tab @code{CCA}
 494 @item 34
 495 @tab @code{CCB}
 496 @item 35
 497 @tab @code{CCC}
 498 @item 36
 499 @tab @code{CCD}
 500 @item 37
 501 @tab @code{CCE}
 502 @item 38
 503 @tab @code{EDATE}
 504 @item 39
 505 @tab @code{SDATE}
 506 @end multitable
 507 @end quotation
 508
 509 A few system files have been observed in the wild with invalid
 510 @code{write} fields, in particular with value 0.  Readers should
 511 probably treat invalid @code{print} or @code{write} fields as some
 512 default format.
 513
 514 @node Value Labels Records
 515 @section Value Labels Records
 516
 517 The value label records documented in this section are used for
 518 numeric and short string variables only.  Long string variables may
 519 have value labels, but their value labels are recorded using a
 520 different record type (@pxref{Long String Value Labels Record}).
 521
 522 The value label record has the following format:
 523
 524 @example
 525 int32               rec_type;
 526 int32               label_count;
 527
 528 /* @r{Repeated @code{label_cnt} times}. */
 529 char                value[8];
 530 char                label_len;
 531 char                label[];
 532 @end example
 533
 534 @table @code
 535 @item int32 rec_type;
 536 Record type.  Always set to 3.
 537
 538 @item int32 label_count;
 539 Number of value labels present in this record.
 540 @end table
 541
 542 The remaining fields are repeated @code{count} times.  Each
 543 repetition specifies one value label.
 544
 545 @table @code
 546 @item char value[8];
 547 A numeric value or a short string value padded as necessary to 8 bytes
 548 in length.  Its type and width cannot be determined until the
 549 following value label variables record (see below) is read.
 550
 551 @item char label_len;
 552 The label's length, in bytes.  The documented maximum length varies
 553 from 60 to 120 based on SPSS version.  PSPP supports value labels up
 554 to 255 bytes long.
 555
 556 @item char label[];
 557 @code{label_len} bytes of the actual label, followed by up to 7 bytes
 558 of padding to bring @code{label} and @code{label_len} together to a
 559 multiple of 8 bytes in length.
 560 @end table
 561
 562 The value label record is always immediately followed by a value label
 563 variables record with the following format:
 564
 565 @example
 566 int32               rec_type;
 567 int32               var_count;
 568 int32               vars[];
 569 @end example
 570
 571 @table @code
 572 @item int32 rec_type;
 573 Record type.  Always set to 4.
 574
 575 @item int32 var_count;
 576 Number of variables that the associated value labels from the value
 577 label record are to be applied.
 578
 579 @item int32 vars[];
 580 A list of dictionary indexes of variables to which to apply the value
 581 labels (@pxref{Dictionary Index}).  There are @code{var_count}
 582 elements.
 583
 584 String variables wider than 8 bytes may not be specified in this list.
 585 @end table
 586
 587 @node Document Record
 588 @section Document Record
 589
 590 The document record, if present, has the following format:
 591
 592 @example
 593 int32               rec_type;
 594 int32               n_lines;
 595 char                lines[][80];
 596 @end example
 597
 598 @table @code
 599 @item int32 rec_type;
 600 Record type.  Always set to 6.
 601
 602 @item int32 n_lines;
 603 Number of lines of documents present.
 604
 605 @item char lines[][80];
 606 Document lines.  The number of elements is defined by @code{n_lines}.
 607 Lines shorter than 80 characters are padded on the right with spaces.
 608 @end table
 609
 610 @node Machine Integer Info Record
 611 @section Machine Integer Info Record
 612
 613 The integer info record, if present, has the following format:
 614
 615 @example
 616 /* @r{Header.} */
 617 int32               rec_type;
 618 int32               subtype;
 619 int32               size;
 620 int32               count;
 621
 622 /* @r{Data.} */
 623 int32               version_major;
 624 int32               version_minor;
 625 int32               version_revision;
 626 int32               machine_code;
 627 int32               floating_point_rep;
 628 int32               compression_code;
 629 int32               endianness;
 630 int32               character_code;
 631 @end example
 632
 633 @table @code
 634 @item int32 rec_type;
 635 Record type.  Always set to 7.
 636
 637 @item int32 subtype;
 638 Record subtype.  Always set to 3.
 639
 640 @item int32 size;
 641 Size of each piece of data in the data part, in bytes.  Always set to 4.
 642
 643 @item int32 count;
 644 Number of pieces of data in the data part.  Always set to 8.
 645
 646 @item int32 version_major;
 647 PSPP major version number.  In version @var{x}.@var{y}.@var{z}, this
 648 is @var{x}.
 649
 650 @item int32 version_minor;
 651 PSPP minor version number.  In version @var{x}.@var{y}.@var{z}, this
 652 is @var{y}.
 653
 654 @item int32 version_revision;
 655 PSPP version revision number.  In version @var{x}.@var{y}.@var{z},
 656 this is @var{z}.
 657
 658 @item int32 machine_code;
 659 Machine code.  PSPP always set this field to value to -1, but other
 660 values may appear.
 661
 662 @item int32 floating_point_rep;
 663 Floating point representation code.  For IEEE 754 systems this is 1.
 664 IBM 370 sets this to 2, and DEC VAX E to 3.
 665
 666 @item int32 compression_code;
 667 Compression code.  Always set to 1, regardless of whether or how the
 668 file is compressed.
 669
 670 @item int32 endianness;
 671 Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
 672
 673 @item int32 character_code;
 674 @anchor{character-code} Character code.  The following values have
 675 been actually observed in system files:
 676
 677 @table @asis
 678 @item 1
 679 EBCDIC.
 680
 681 @item 2
 682 7-bit ASCII.
 683
 684 @item 1250
 685 The @code{windows-1250} code page for Central European and Eastern
 686 European languages.
 687
 688 @item 1252
 689 The @code{windows-1252} code page for Western European languages.
 690
 691 @item 28591
 692 ISO 8859-1.
 693
 694 @item 65001
 695 UTF-8.
 696 @end table
 697
 698 The following additional values are known to be defined:
 699
 700 @table @asis
 701 @item 3
 702 8-bit ``ASCII''.
 703
 704 @item 4
 705 DEC Kanji.
 706 @end table
 707
 708 Other Windows code page numbers are known to be generally valid.
 709
 710 Old versions of SPSS for Unix and Windows always wrote value 2 in this
 711 field, regardless of the encoding in use.  Newer versions also write
 712 the character encoding as a string (see @ref{Character Encoding
 713 Record}).
 714 @end table
 715
 716 @node Machine Floating-Point Info Record
 717 @section Machine Floating-Point Info Record
 718
 719 The floating-point info record, if present, has the following format:
 720
 721 @example
 722 /* @r{Header.} */
 723 int32               rec_type;
 724 int32               subtype;
 725 int32               size;
 726 int32               count;
 727
 728 /* @r{Data.} */
 729 flt64               sysmis;
 730 flt64               highest;
 731 flt64               lowest;
 732 @end example
 733
 734 @table @code
 735 @item int32 rec_type;
 736 Record type.  Always set to 7.
 737
 738 @item int32 subtype;
 739 Record subtype.  Always set to 4.
 740
 741 @item int32 size;
 742 Size of each piece of data in the data part, in bytes.  Always set to 8.
 743
 744 @item int32 count;
 745 Number of pieces of data in the data part.  Always set to 3.
 746
 747 @item flt64 sysmis;
 748 @itemx flt64 highest;
 749 @itemx flt64 lowest;
 750 The system missing value, the value used for HIGHEST in missing
 751 values, and the value used for LOWEST in missing values, respectively.
 752 @xref{System File Format}, for more information.
 753
 754 The SPSSWriter library in PHP, which identifies itself as @code{FOM
 755 SPSS 1.0.0} in the file header record @code{prod_name} field, writes
 756 unexpected values to these fields, but it uses the same values
 757 consistently throughout the rest of the file.
 758 @end table
 759
 760 @node Multiple Response Sets Records
 761 @section Multiple Response Sets Records
 762
 763 The system file format has two different types of records that
 764 represent multiple response sets (@pxref{MRSETS,,,pspp, PSPP Users
 765 Guide}).  The first type of record describes multiple response sets
 766 that can be understood by SPSS before version 14.  The second type of
 767 record, with a closely related format, is used for multiple dichotomy
 768 sets that use the CATEGORYLABELS=COUNTEDVALUES feature added in
 769 version 14.
 770
 771 @example
 772 /* @r{Header.} */
 773 int32               rec_type;
 774 int32               subtype;
 775 int32               size;
 776 int32               count;
 777
 778 /* @r{Exactly @code{count} bytes of data.} */
 779 char                mrsets[];
 780 @end example
 781
 782 @table @code
 783 @item int32 rec_type;
 784 Record type.  Always set to 7.
 785
 786 @item int32 subtype;
 787 Record subtype.  Set to 7 for records that describe multiple response
 788 sets understood by SPSS before version 14, or to 19 for records that
 789 describe dichotomy sets that use the CATEGORYLABELS=COUNTEDVALUES
 790 feature added in version 14.
 791
 792 @item int32 size;
 793 The size of each element in the @code{mrsets} member. Always set to 1.
 794
 795 @item int32 count;
 796 The total number of bytes in @code{mrsets}.
 797
 798 @item char mrsets[];
 799 Zero or more line feeds (byte 0x0a), followed by a series of multiple
 800 response sets, each of which consists of the following:
 801
 802 @itemize @bullet
 803 @item
 804 The set's name (an identifier that begins with @samp{$}), in mixed
 805 upper and lower case.
 806
 807 @item
 808 An equals sign (@samp{=}).
 809
 810 @item
 811 @samp{C} for a multiple category set, @samp{D} for a multiple
 812 dichotomy set with CATEGORYLABELS=VARLABELS, or @samp{E} for a
 813 multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES.
 814
 815 @item
 816 For a multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES, a
 817 space, followed by a number expressed as decimal digits, followed by a
 818 space.  If LABELSOURCE=VARLABEL was specified on MRSETS, then the
 819 number is 11; otherwise it is 1.@footnote{This part of the format may
 820 not be fully understood, because only a single example of each
 821 possibility has been examined.}
 822
 823 @item
 824 For either kind of multiple dichotomy set, the counted value, as a
 825 positive integer count specified as decimal digits, followed by a
 826 space, followed by as many string bytes as specified in the count.  If
 827 the set contains numeric variables, the string consists of the counted
 828 integer value expressed as decimal digits.  If the set contains string
 829 variables, the string contains the counted string value.  Either way,
 830 the string may be padded on the right with spaces (older versions of
 831 SPSS seem to always pad to a width of 8 bytes; newer versions don't).
 832
 833 @item
 834 A space.
 835
 836 @item
 837 The multiple response set's label, using the same format as for the
 838 counted value for multiple dichotomy sets.  A string of length 0 means
 839 that the set does not have a label.  A string of length 0 is also
 840 written if LABELSOURCE=VARLABEL was specified.
 841
 842 @item
 843 A space.
 844
 845 @item
 846 The short names of the variables in the set, converted to lowercase,
 847 each separated from the previous by a single space.
 848
 849 Even though a multiple response set must have at least two variables,
 850 some system files contain multiple response sets with no variables or
 851 one variable.  The source and meaning of these multiple response sets is
 852 unknown.  (Perhaps they arise from creating a multiple response set
 853 then deleting all the variables that it contains?)
 854
 855 @item
 856 One line feed (byte 0x0a).  Sometimes multiple, even hundreds, of line
 857 feeds are present.
 858 @end itemize
 859 @end table
 860
 861 Example: Given appropriate variable definitions, consider the
 862 following MRSETS command:
 863
 864 @example
 865 MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c
 866        /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55
 867        /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes'
 868        /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES
 869         VARIABLES=k l m VALUE=34
 870        /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL
 871         VARIABLES=n o p VALUE='choice'.
 872 @end example
 873
 874 The above would generate the following multiple response set record of
 875 subtype 7:
 876
 877 @example
 878 $a=C 10 my mcgroup a b c
 879 $b=D2 55 0  g e f d
 880 $c=D3 Yes 10 mdgroup #2 h i j
 881 @end example
 882
 883 It would also generate the following multiple response set record with
 884 subtype 19:
 885
 886 @example
 887 $d=E 1 2 34 13 third mdgroup k l m
 888 $e=E 11 6 choice 0  n o p
 889 @end example
 890
 891 @node Extra Product Info Record
 892 @section Extra Product Info Record
 893
 894 This optional record appears to contain a text string that describes
 895 the program that wrote the file and the source of the data.  (This is
 896 redundant with the file label and product info found in the file
 897 header record.)
 898
 899 @example
 900 /* @r{Header.} */
 901 int32               rec_type;
 902 int32               subtype;
 903 int32               size;
 904 int32               count;
 905
 906 /* @r{Exactly @code{count} bytes of data.} */
 907 char                info[];
 908 @end example
 909
 910 @table @code
 911 @item int32 rec_type;
 912 Record type.  Always set to 7.
 913
 914 @item int32 subtype;
 915 Record subtype.  Always set to 10.
 916
 917 @item int32 size;
 918 The size of each element in the @code{info} member. Always set to 1.
 919
 920 @item int32 count;
 921 The total number of bytes in @code{info}.
 922
 923 @item char info[];
 924 A text string.  A product that identifies itself as @code{VOXCO
 925 INTERVIEWER 4.3} uses CR-only line ends in this field, rather than the
 926 more usual LF-only or CR LF line ends.
 927 @end table
 928
 929 @node Variable Display Parameter Record
 930 @section Variable Display Parameter Record
 931
 932 The variable display parameter record, if present, has the following
 933 format:
 934
 935 @example
 936 /* @r{Header.} */
 937 int32               rec_type;
 938 int32               subtype;
 939 int32               size;
 940 int32               count;
 941
 942 /* @r{Repeated @code{count} times}. */
 943 int32               measure;
 944 int32               width;           /* @r{Not always present.} */
 945 int32               alignment;
 946 @end example
 947
 948 @table @code
 949 @item int32 rec_type;
 950 Record type.  Always set to 7.
 951
 952 @item int32 subtype;
 953 Record subtype.  Always set to 11.
 954
 955 @item int32 size;
 956 The size of @code{int32}.  Always set to 4.
 957
 958 @item int32 count;
 959 The number of sets of variable display parameters (ordinarily the
 960 number of variables in the dictionary), times 2 or 3.
 961 @end table
 962
 963 The remaining members are repeated @code{count} times, in the same
 964 order as the variable records.  No element corresponds to variable
 965 records that continue long string variables.  The meanings of these
 966 members are as follows:
 967
 968 @table @code
 969 @item int32 measure;
 970 The measurement type of the variable:
 971 @table @asis
 972 @item 1
 973 Nominal Scale
 974 @item 2
 975 Ordinal Scale
 976 @item 3
 977 Continuous Scale
 978 @end table
 979
 980 SPSS sometimes writes a @code{measure} of 0.  PSPP interprets this as
 981 nominal scale.
 982
 983 @item int32 width;
 984 The width of the display column for the variable in characters.
 985
 986 This field is present if @var{count} is 3 times the number of
 987 variables in the dictionary.  It is omitted if @var{count} is 2 times
 988 the number of variables.
 989
 990 @item int32 alignment;
 991 The alignment of the variable for display purposes:
 992
 993 @table @asis
 994 @item 0
 995 Left aligned
 996 @item 1
 997 Right aligned
 998 @item 2
 999 Centre aligned
1000 @end table
1001 @end table
1002
1003 @node Long Variable Names Record
1004 @section Long Variable Names Record
1005
1006 If present, the long variable names record has the following format:
1007
1008 @example
1009 /* @r{Header.} */
1010 int32               rec_type;
1011 int32               subtype;
1012 int32               size;
1013 int32               count;
1014
1015 /* @r{Exactly @code{count} bytes of data.} */
1016 char                var_name_pairs[];
1017 @end example
1018
1019 @table @code
1020 @item int32 rec_type;
1021 Record type.  Always set to 7.
1022
1023 @item int32 subtype;
1024 Record subtype.  Always set to 13.
1025
1026 @item int32 size;
1027 The size of each element in the @code{var_name_pairs} member. Always set to 1.
1028
1029 @item int32 count;
1030 The total number of bytes in @code{var_name_pairs}.
1031
1032 @item char var_name_pairs[];
1033 A list of @var{key}--@var{value} tuples, where @var{key} is the name
1034 of a variable, and @var{value} is its long variable name.
1035 The @var{key} field is at most 8 bytes long and must match the
1036 name of a variable which appears in the variable record (@pxref{Variable
1037 Record}).
1038 The @var{value} field is at most 64 bytes long.
1039 The @var{key} and @var{value} fields are separated by a @samp{=} byte.
1040 Each tuple is separated by a byte whose value is 09.  There is no
1041 trailing separator following the last tuple.
1042 The total length is @code{count} bytes.
1043 @end table
1044
1045 @node Very Long String Record
1046 @section Very Long String Record
1047
1048 Old versions of SPSS limited string variables to a width of 255 bytes.
1049 For backward compatibility with these older versions, the system file
1050 format represents a string longer than 255 bytes, called a @dfn{very
1051 long string}, as a collection of strings no longer than 255 bytes
1052 each.  The strings concatenated to make a very long string are called
1053 its @dfn{segments}; for consistency, variables other than very long
1054 strings are considered to have a single segment.
1055
1056 A very long string with a width of @var{w} has @var{n} =
1057 (@var{w} + 251) / 252 segments, that is, one segment for every
1058 252 bytes of width, rounding up.  It would be logical, then, for each
1059 of the segments except the last to have a width of 252 and the last
1060 segment to have the remainder, but this is not the case.  In fact,
1061 each segment except the last has a width of 255 bytes.  The last
1062 segment has width @var{w} - (@var{n} - 1) * 252; some versions
1063 of SPSS make it slightly wider, but not wide enough to make the last
1064 segment require another 8 bytes of data.
1065
1066 Data is packed tightly into segments of a very long string, 255 bytes
1067 per segment.  Because 255 bytes of segment data are allocated for
1068 every 252 bytes of the very long string's width (approximately), some
1069 unused space is left over at the end of the allocated segments.  Data
1070 in unused space is ignored.
1071
1072 Example: Consider a very long string of width 20,000.  Such a very
1073 long string has 20,000 / 252 = 80 (rounding up) segments.  The first
1074 79 segments have width 255; the last segment has width 20,000 - 79 *
1075 252 = 92 or slightly wider (up to 96 bytes, the next multiple of 8).
1076 The very long string's data is actually stored in the 19,890 bytes in
1077 the first 78 segments, plus the first 110 bytes of the 79th segment
1078 (19,890 + 110 = 20,000).  The remaining 145 bytes of the 79th segment
1079 and all 92 bytes of the 80th segment are unused.
1080
1081 The very long string record explains how to stitch together segments
1082 to obtain very long string data.  For each of the very long string
1083 variables in the dictionary, it specifies the name of its first
1084 segment's variable and the very long string variable's actual width.
1085 The remaining segments immediately follow the named variable in the
1086 system file's dictionary.
1087
1088 The very long string record, which is present only if the system file
1089 contains very long string variables, has the following format:
1090
1091 @example
1092 /* @r{Header.} */
1093 int32               rec_type;
1094 int32               subtype;
1095 int32               size;
1096 int32               count;
1097
1098 /* @r{Exactly @code{count} bytes of data.} */
1099 char                string_lengths[];
1100 @end example
1101
1102 @table @code
1103 @item int32 rec_type;
1104 Record type.  Always set to 7.
1105
1106 @item int32 subtype;
1107 Record subtype.  Always set to 14.
1108
1109 @item int32 size;
1110 The size of each element in the @code{string_lengths} member. Always set to 1.
1111
1112 @item int32 count;
1113 The total number of bytes in @code{string_lengths}.
1114
1115 @item char string_lengths[];
1116 A list of @var{key}--@var{value} tuples, where @var{key} is the name
1117 of a variable, and @var{value} is its length.
1118 The @var{key} field is at most 8 bytes long and must match the
1119 name of a variable which appears in the variable record (@pxref{Variable
1120 Record}).
1121 The @var{value} field is exactly 5 bytes long. It is a zero-padded,
1122 ASCII-encoded string that is the length of the variable.
1123 The @var{key} and @var{value} fields are separated by a @samp{=} byte.
1124 Tuples are delimited by a two-byte sequence @{00, 09@}.
1125 After the last tuple, there may be a single byte 00, or @{00, 09@}.
1126 The total length is @code{count} bytes.
1127 @end table
1128
1129 @node Character Encoding Record
1130 @section Character Encoding Record
1131
1132 This record, if present, indicates the character encoding for string data,
1133 long variable names, variable labels, value labels and other strings in the
1134 file.
1135
1136 @example
1137 /* @r{Header.} */
1138 int32               rec_type;
1139 int32               subtype;
1140 int32               size;
1141 int32               count;
1142
1143 /* @r{Exactly @code{count} bytes of data.} */
1144 char                encoding[];
1145 @end example
1146
1147 @table @code
1148 @item int32 rec_type;
1149 Record type.  Always set to 7.
1150
1151 @item int32 subtype;
1152 Record subtype.  Always set to 20.
1153
1154 @item int32 size;
1155 The size of each element in the @code{encoding} member. Always set to 1.
1156
1157 @item int32 count;
1158 The total number of bytes in @code{encoding}.
1159
1160 @item char encoding[];
1161 The name of the character encoding.  Normally this will be an official
1162 IANA character set name or alias.
1163 See @url{http://www.iana.org/assignments/character-sets}.
1164 Character set names are not case-sensitive, but SPSS appears to write
1165 them in all-uppercase.
1166 @end table
1167
1168 This record is not present in files generated by older software.  See
1169 also the @code{character_code} field in the machine integer info
1170 record (@pxref{character-code}).
1171
1172 When the character encoding record and the machine integer info record
1173 are both present, all system files observed in practice indicate the
1174 same character encoding, e.g.@: 1252 as @code{character_code} and
1175 @code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc.
1176
1177 If, for testing purposes, a file is crafted with different
1178 @code{character_code} and @code{encoding}, it seems that
1179 @code{character_code} controls the encoding for all strings in the
1180 system file before the dictionary termination record, including
1181 strings in data (e.g.@: string missing values), and @code{encoding}
1182 controls the encoding for strings following the dictionary termination
1183 record.
1184
1185 @node Long String Value Labels Record
1186 @section Long String Value Labels Record
1187
1188 This record, if present, specifies value labels for long string
1189 variables.
1190
1191 @example
1192 /* @r{Header.} */
1193 int32               rec_type;
1194 int32               subtype;
1195 int32               size;
1196 int32               count;
1197
1198 /* @r{Repeated up to exactly @code{count} bytes.} */
1199 int32               var_name_len;
1200 char                var_name[];
1201 int32               var_width;
1202 int32               n_labels;
1203 long_string_label   labels[];
1204 @end example
1205
1206 @table @code
1207 @item int32 rec_type;
1208 Record type.  Always set to 7.
1209
1210 @item int32 subtype;
1211 Record subtype.  Always set to 21.
1212
1213 @item int32 size;
1214 Always set to 1.
1215
1216 @item int32 count;
1217 The number of bytes following the header until the next header.
1218
1219 @item int32 var_name_len;
1220 @itemx char var_name[];
1221 The number of bytes in the name of the variable that has long string
1222 value labels, plus the variable name itself, which consists of exactly
1223 @code{var_name_len} bytes.  The variable name is not padded to any
1224 particular boundary, nor is it null-terminated.
1225
1226 @item int32 var_width;
1227 The width of the variable, in bytes, which will be between 9 and
1228 32767.
1229
1230 @item int32 n_labels;
1231 @itemx long_string_label labels[];
1232 The long string labels themselves.  The @code{labels} array contains
1233 exactly @code{n_labels} elements, each of which has the following
1234 substructure:
1235
1236 @example
1237 int32               value_len;
1238 char                value[];
1239 int32               label_len;
1240 char                label[];
1241 @end example
1242
1243 @table @code
1244 @item int32 value_len;
1245 @itemx char value[];
1246 The string value being labeled.  @code{value_len} is the number of
1247 bytes in @code{value}; it is equal to @code{var_width}.  The
1248 @code{value} array is not padded or null-terminated.
1249
1250 @item int32 label_len;
1251 @itemx char label[];
1252 The label for the string value.  @code{label_len}, which must be
1253 between 0 and 120, is the number of bytes in @code{label}.  The
1254 @code{label} array is not padded or null-terminated.
1255 @end table
1256 @end table
1257
1258 @node Long String Missing Values Record
1259 @section Long String Missing Values Record
1260
1261 This record, if present, specifies missing values for long string
1262 variables.
1263
1264 @example
1265 /* @r{Header.} */
1266 int32               rec_type;
1267 int32               subtype;
1268 int32               size;
1269 int32               count;
1270
1271 /* @r{Repeated up to exactly @code{count} bytes.} */
1272 int32               var_name_len;
1273 char                var_name[];
1274 char                n_missing_values;
1275 long_string_missing_value   values[];
1276 @end example
1277
1278 @table @code
1279 @item int32 rec_type;
1280 Record type.  Always set to 7.
1281
1282 @item int32 subtype;
1283 Record subtype.  Always set to 22.
1284
1285 @item int32 size;
1286 Always set to 1.
1287
1288 @item int32 count;
1289 The number of bytes following the header until the next header.
1290
1291 @item int32 var_name_len;
1292 @itemx char var_name[];
1293 The number of bytes in the name of the long string variable that has
1294 missing values, plus the variable name itself, which consists of
1295 exactly @code{var_name_len} bytes.  The variable name is not padded to
1296 any particular boundary, nor is it null-terminated.
1297
1298 @item char n_missing_values;
1299 The number of missing values, either 1, 2, or 3.  (This is, unusually,
1300 a single byte instead of a 32-bit number.)
1301
1302 @item long_string_missing_value values[];
1303 The missing values themselves.  This array contains exactly
1304 @code{n_missing_values} elements, each of which has the following
1305 substructure:
1306
1307 @example
1308 int32               value_len;
1309 char                value[];
1310 @end example
1311
1312 @table @code
1313 @item int32 value_len;
1314 The length of the missing value string, in bytes.  This value should
1315 be 8, because long string variables are at least 8 bytes wide (by
1316 definition), only the first 8 bytes of a long string variable's
1317 missing values are allowed to be non-spaces, and any spaces within the
1318 first 8 bytes are included in the missing value here.
1319
1320 @item char value[];
1321 The missing value string, exactly @code{value_len} bytes, without
1322 any padding or null terminator.
1323 @end table
1324 @end table
1325
1326 @node Data File and Variable Attributes Records
1327 @section Data File and Variable Attributes Records
1328
1329 The data file and variable attributes records represent custom
1330 attributes for the system file or for individual variables in the
1331 system file, as defined on the DATAFILE ATTRIBUTE (@pxref{DATAFILE
1332 ATTRIBUTE,,,pspp, PSPP Users Guide}) and VARIABLE ATTRIBUTE commands
1333 (@pxref{VARIABLE ATTRIBUTE,,,pspp, PSPP Users Guide}), respectively.
1334
1335 @example
1336 /* @r{Header.} */
1337 int32               rec_type;
1338 int32               subtype;
1339 int32               size;
1340 int32               count;
1341
1342 /* @r{Exactly @code{count} bytes of data.} */
1343 char                attributes[];
1344 @end example
1345
1346 @table @code
1347 @item int32 rec_type;
1348 Record type.  Always set to 7.
1349
1350 @item int32 subtype;
1351 Record subtype.  Always set to 17 for a data file attribute record or
1352 to 18 for a variable attributes record.
1353
1354 @item int32 size;
1355 The size of each element in the @code{attributes} member. Always set to 1.
1356
1357 @item int32 count;
1358 The total number of bytes in @code{attributes}.
1359
1360 @item char attributes[];
1361 The attributes, in a text-based format.
1362
1363 In record type 17, this field contains a single attribute set.  An
1364 attribute set is a sequence of one or more attributes concatenated
1365 together.  Each attribute consists of a name, which has the same
1366 syntax as a variable name, followed by, inside parentheses, a sequence
1367 of one or more values.  Each value consists of a string enclosed in
1368 single quotes (@code{'}) followed by a line feed (byte 0x0a).  A value
1369 may contain single quote characters, which are not themselves escaped
1370 or quoted or required to be present in pairs.  There is no apparent
1371 way to embed a line feed in a value.  There is no distinction between
1372 an attribute with a single value and an attribute array with one
1373 element.
1374
1375 In record type 18, this field contains a sequence of one or more
1376 variable attribute sets.  If more than one variable attribute set is
1377 present, each one after the first is delimited from the previous by
1378 @code{/}.  Each variable attribute set consists of a long
1379 variable name,
1380 followed by @code{:}, followed by an attribute set with the same
1381 syntax as on record type 17.
1382
1383 The total length is @code{count} bytes.
1384 @end table
1385
1386 @subheading Example
1387
1388 A system file produced with the following VARIABLE ATTRIBUTE commands
1389 in effect:
1390
1391 @example
1392 VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=fred[1]('23') fred[2]('34').
1393 VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123').
1394 @end example
1395
1396 @noindent
1397 will contain a variable attribute record with the following contents:
1398
1399 @example
1400 0000  07 00 00 00 12 00 00 00  01 00 00 00 22 00 00 00  |............"...|
1401 0010  64 75 6d 6d 79 3a 66 72  65 64 28 27 32 33 27 0a  |dummy:fred('23'.|
1402 0020  27 33 34 27 0a 29 62 65  72 74 28 27 31 32 33 27  |'34'.)bert('123'|
1403 0030  0a 29                                             |.)              |
1404 @end example
1405
1406 @menu
1407 * Variable Roles::
1408 @end menu
1409
1410 @node Variable Roles
1411 @subsection Variable Roles
1412
1413 A variable's role is represented as an attribute named @code{$@@Role}.
1414 This attribute has a single element whose values and their meanings
1415 are:
1416
1417 @table @code
1418 @item 0
1419 Input.  This, the default, is the most common role.
1420 @item 1
1421 Output.
1422 @item 2
1423 Both.
1424 @item 3
1425 None.
1426 @item 4
1427 Partition.
1428 @item 5
1429 Split.
1430 @end table
1431
1432 @node Extended Number of Cases Record
1433 @section Extended Number of Cases Record
1434
1435 The file header record expresses the number of cases in the system
1436 file as an int32 (@pxref{File Header Record}).  This record allows the
1437 number of cases in the system file to be expressed as a 64-bit number.
1438
1439 @example
1440 int32               rec_type;
1441 int32               subtype;
1442 int32               size;
1443 int32               count;
1444 int64               unknown;
1445 int64               ncases64;
1446 @end example
1447
1448 @table @code
1449 @item int32 rec_type;
1450 Record type.  Always set to 7.
1451
1452 @item int32 subtype;
1453 Record subtype.  Always set to 16.
1454
1455 @item int32 size;
1456 Size of each element.  Always set to 8.
1457
1458 @item int32 count;
1459 Number of pieces of data in the data part.  Alway set to 2.
1460
1461 @item int64 unknown;
1462 Meaning unknown.  Always set to 1.
1463
1464 @item int64 ncases64;
1465 Number of cases in the file as a 64-bit integer.  Presumably this
1466 could be -1 to indicate that the number of cases is unknown, for the
1467 same reason as @code{ncases} in the file header record, but this has
1468 not been observed in the wild.
1469 @end table
1470
1471 @node Other Informational Records
1472 @section Other Informational Records
1473
1474 This chapter documents many specific types of extension records are
1475 documented here, but others are known to exist.  PSPP ignores unknown
1476 extension records when reading system files.
1477
1478 The following extension record subtypes have also been observed, with
1479 the following believed meanings:
1480
1481 @table @asis
1482 @item 5
1483 A set of grouped variables (according to Aapi H@"am@"al@"ainen).
1484
1485 @item 6
1486 Date info, probably related to USE (according to Aapi H@"am@"al@"ainen).
1487
1488 @item 12
1489 A UUID in the format described in RFC 4122.  Only two examples
1490 observed, both written by SPSS 13, and in each case the UUID contained
1491 both upper and lower case.
1492
1493 @item 24
1494 XML that describes how data in the file should be displayed on-screen.
1495 @end table
1496
1497 @node Dictionary Termination Record
1498 @section Dictionary Termination Record
1499
1500 The dictionary termination record separates all other records from the
1501 data records.
1502
1503 @example
1504 int32               rec_type;
1505 int32               filler;
1506 @end example
1507
1508 @table @code
1509 @item int32 rec_type;
1510 Record type.  Always set to 999.
1511
1512 @item int32 filler;
1513 Ignored padding.  Should be set to 0.
1514 @end table
1515
1516 @node Data Record
1517 @section Data Record
1518
1519 The data record must follow all other records in the system file.
1520 Every system file must have a data record that specifies data for at
1521 least one case.  The format of the data record varies depending on the
1522 value of @code{compression} in the file header record:
1523
1524 @table @asis
1525 @item 0: no compression
1526 Data is arranged as a series of 8-byte elements.
1527 Each element corresponds to
1528 the variable declared in the respective variable record (@pxref{Variable
1529 Record}).  Numeric values are given in @code{flt64} format; string
1530 values are literal characters string, padded on the right when
1531 necessary to fill out 8-byte units.
1532
1533 @item 1: bytecode compression
1534 The first 8 bytes
1535 of the data record is divided into a series of 1-byte command
1536 codes.  These codes have meanings as described below:
1537
1538 @table @asis
1539 @item 0
1540 Ignored.  If the program writing the system file accumulates compressed
1541 data in blocks of fixed length, 0 bytes can be used to pad out extra
1542 bytes remaining at the end of a fixed-size block.
1543
1544 @item 1 through 251
1545 A number with
1546 value @var{code} - @var{bias}, where
1547 @var{code} is the value of the compression code and @var{bias} is the
1548 variable @code{bias} from the file header.  For example,
1549 code 105 with bias 100.0 (the normal value) indicates a numeric variable
1550 of value 5.
1551 One file has been seen written by SPSS 14 that contained such a code
1552 in a @emph{string} field with the value 0 (after the bias is
1553 subtracted) as a way of encoding null bytes.
1554
1555 @item 252
1556 End of file.  This code may or may not appear at the end of the data
1557 stream.  PSPP always outputs this code but its use is not required.
1558
1559 @item 253
1560 A numeric or string value that is not
1561 compressible.  The value is stored in the 8 bytes following the
1562 current block of command bytes.  If this value appears twice in a block
1563 of command bytes, then it indicates the second group of 8 bytes following the
1564 command bytes, and so on.
1565
1566 @item 254
1567 An 8-byte string value that is all spaces.
1568
1569 @item 255
1570 The system-missing value.
1571 @end table
1572
1573 The end of the 8-byte group of bytecodes is followed by any 8-byte
1574 blocks of non-compressible values indicated by code 253.  After that
1575 follows another 8-byte group of bytecodes, then those bytecodes'
1576 non-compressible values.  The pattern repeats to the end of the file
1577 or a code with value 252.
1578
1579 @item 2: ZLIB compression
1580 The data record consists of the following, in order:
1581
1582 @itemize @bullet
1583 @item
1584 ZLIB data header, 24 bytes long.
1585
1586 @item
1587 One or more variable-length blocks of ZLIB compressed data.
1588
1589 @item
1590 ZLIB data trailer, with a 24-byte fixed header plus an additional 24
1591 bytes for each preceding ZLIB compressed data block.
1592 @end itemize
1593
1594 The ZLIB data header has the following format:
1595
1596 @example
1597 int64               zheader_ofs;
1598 int64               ztrailer_ofs;
1599 int64               ztrailer_len;
1600 @end example
1601
1602 @table @code
1603 @item int64 zheader_ofs;
1604 The offset, in bytes, of the beginning of this structure within the
1605 system file.
1606
1607 @item int64 ztrailer_ofs;
1608 The offset, in bytes, of the first byte of the ZLIB data trailer.
1609
1610 @item int64 ztrailer_len;
1611 The number of bytes in the ZLIB data trailer.  This and the previous
1612 field sum to the size of the system file in bytes.
1613 @end table
1614
1615 The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB
1616 compressed data blocks.  Each ZLIB compressed data block begins with a
1617 ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
1618 01} (the only header yet observed in practice).  Each block
1619 decompresses to a fixed number of bytes (in practice only
1620 @code{0x3ff000}-byte blocks have been observed), except that the last
1621 block of data may be shorter.  The last ZLIB compressed data block
1622 gends just before offset @code{ztrailer_ofs}.
1623
1624 The result of ZLIB decompression is bytecode compressed data as
1625 described above for compression format 1.
1626
1627 The ZLIB data trailer begins with the following 24-byte fixed header:
1628
1629 @example
1630 int64               bias;
1631 int64               zero;
1632 int32               block_size;
1633 int32               n_blocks;
1634 @end example
1635
1636 @table @code
1637 @item int64 int_bias;
1638 The compression bias as a negative integer, e.g.@: if @code{bias} in
1639 the file header record is 100.0, then @code{int_bias} is @minus{}100
1640 (this is the only value yet observed in practice).
1641
1642 @item int64 zero;
1643 Always observed to be zero.
1644
1645 @item int32 block_size;
1646 The number of bytes in each ZLIB compressed data block, except
1647 possibly the last, following decompression.  Only @code{0x3ff000} has
1648 been observed so far.
1649
1650 @item int32 n_blocks;
1651 The number of ZLIB compressed data blocks, always exactly
1652 @code{(ztrailer_ofs - 24) / 24}.
1653 @end table
1654
1655 The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
1656 block descriptors, each of which describes the compressed data block
1657 corresponding to its offset.  Each block descriptor has the following
1658 format:
1659
1660 @example
1661 int64               uncompressed_ofs;
1662 int64               compressed_ofs;
1663 int32               uncompressed_size;
1664 int32               compressed_size;
1665 @end example
1666
1667 @table @code
1668 @item int64 uncompressed_ofs;
1669 The offset, in bytes, that this block of data would have in a similar
1670 system file that uses compression format 1.  This is
1671 @code{zheader_ofs} in the first block descriptor, and in each
1672 succeeding block descriptor it is the sum of the previous desciptor's
1673 @code{uncompressed_ofs} and @code{uncompressed_size}.
1674
1675 @item int64 compressed_ofs;
1676 The offset, in bytes, of the actual beginning of this compressed data
1677 block.  This is @code{zheader_ofs + 24} in the first block descriptor,
1678 and in each succeeding block descriptor it is the sum of the previous
1679 descriptor's @code{compressed_ofs} and @code{compressed_size}.  The
1680 final block descriptor's @code{compressed_ofs} and
1681 @code{compressed_size} sum to @code{ztrailer_ofs}.
1682
1683 @item int32 uncompressed_size;
1684 The number of bytes in this data block, after decompression.  This is
1685 @code{block_size} in every data block except the last, which may be
1686 smaller.
1687
1688 @item int32 compressed_size;
1689 The number of bytes in this data block, as stored compressed in this
1690 system file.
1691 @end table
1692 @end table
1693
1694 @setfilename ignored