pintos-os.org Git - pspp/blob - doc/dev/system-file-format.texi

   1 @c PSPP - a program for statistical analysis.
   2 @c Copyright (C) 2019 Free Software Foundation, Inc.
   3 @c Permission is granted to copy, distribute and/or modify this document
   4 @c under the terms of the GNU Free Documentation License, Version 1.3
   5 @c or any later version published by the Free Software Foundation;
   6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   7 @c A copy of the license is included in the section entitled "GNU
   8 @c Free Documentation License".
   9 @c
  10
  11 @node System File Format
  12 @chapter System File Format
  13
  14 An SPSS system file holds a set of cases and dictionary information
  15 that describes how they may be interpreted.  The system file format
  16 dates back 40+ years and has evolved greatly over that time to support
  17 new features, but in a way to facilitate interchange between even the
  18 oldest and newest versions of software.  This chapter describes the
  19 system file format.
  20
  21 System files use four data types: 8-bit characters, 32-bit integers,
  22 64-bit integers,
  23 and 64-bit floating points, called here @code{char}, @code{int32},
  24 @code{int64}, and
  25 @code{flt64}, respectively.  Data is not necessarily aligned on a word
  26 or double-word boundary: the long variable name record (@pxref{Long
  27 Variable Names Record}) and very long string records (@pxref{Very Long
  28 String Record}) have arbitrary byte length and can therefore cause all
  29 data coming after them in the file to be misaligned.
  30
  31 Integer data in system files may be big-endian or little-endian.  A
  32 reader may detect the endianness of a system file by examining
  33 @code{layout_code} in the file header record
  34 (@pxref{layout_code,,@code{layout_code}}).
  35
  36 Floating-point data in system files may nominally be in IEEE 754, IBM,
  37 or VAX formats.  A reader may detect the floating-point format in use
  38 by examining @code{bias} in the file header record
  39 (@pxref{bias,,@code{bias}}).
  40
  41 PSPP detects big-endian and little-endian integer formats in system
  42 files and translates as necessary.  PSPP also detects the
  43 floating-point format in use, as well as the endianness of IEEE 754
  44 floating-point numbers, and translates as needed.  However, only IEEE
  45 754 numbers with the same endianness as integer data in the same file
  46 have actually been observed in system files, and it is likely that
  47 other formats are obsolete or were never used.
  48
  49 System files use a few floating point values for special purposes:
  50
  51 @table @asis
  52 @item SYSMIS
  53 The system-missing value is represented by the largest possible
  54 negative number in the floating point format (@code{-DBL_MAX}).
  55
  56 @item HIGHEST
  57 HIGHEST is used as the high end of a missing value range with an
  58 unbounded maximum.  It is represented by the largest possible positive
  59 number (@code{DBL_MAX}).
  60
  61 @item LOWEST
  62 LOWEST is used as the low end of a missing value range with an
  63 unbounded minimum.  It was originally represented by the
  64 second-largest negative number (in IEEE 754 format,
  65 @code{0xffeffffffffffffe}).  System files written by SPSS 21 and later
  66 instead use the largest negative number (@code{-DBL_MAX}), the same
  67 value as SYSMIS.  This does not lead to ambiguity because LOWEST
  68 appears in system files only in missing value ranges, which never
  69 contain SYSMIS.
  70 @end table
  71
  72 System files may use most character encodings based on an 8-bit unit.
  73 UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
  74 @code{rec_type} in the file header record is sufficient to distinguish
  75 between ASCII and EBCDIC based encodings.  The best way to determine
  76 the specific encoding in use is to consult the character encoding
  77 record (@pxref{Character Encoding Record}), if present, and failing
  78 that the @code{character_code} in the machine integer info record
  79 (@pxref{Machine Integer Info Record}).  The same encoding should be
  80 used for the dictionary and the data in the file, although it is
  81 possible to artificially synthesize files that use different encodings
  82 (@pxref{Character Encoding Record}).
  83
  84 @menu
  85 * System File Record Structure::
  86 * File Header Record::
  87 * Variable Record::
  88 * Value Labels Records::
  89 * Document Record::
  90 * Machine Integer Info Record::
  91 * Machine Floating-Point Info Record::
  92 * Multiple Response Sets Records::
  93 * Extra Product Info Record::
  94 * Variable Display Parameter Record::
  95 * Long Variable Names Record::
  96 * Very Long String Record::
  97 * Character Encoding Record::
  98 * Long String Value Labels Record::
  99 * Long String Missing Values Record::
 100 * Data File and Variable Attributes Records::
 101 * Extended Number of Cases Record::
 102 * Other Informational Records::
 103 * Dictionary Termination Record::
 104 * Data Record::
 105 @end menu
 106
 107 @node System File Record Structure
 108 @section System File Record Structure
 109
 110 System files are divided into records with the following format:
 111
 112 @example
 113 int32               type;
 114 char                data[];
 115 @end example
 116
 117 This header does not identify the length of the @code{data} or any
 118 information about what it contains, so the system file reader must
 119 understand the format of @code{data} based on @code{type}.  However,
 120 records with type 7, called @dfn{extension records}, have a stricter
 121 format:
 122
 123 @example
 124 int32               type;
 125 int32               subtype;
 126 int32               size;
 127 int32               count;
 128 char                data[size * count];
 129 @end example
 130
 131 @table @code
 132 @item int32 rec_type;
 133 Record type.  Always set to 7.
 134
 135 @item int32 subtype;
 136 Record subtype.  This value identifies a particular kind of extension
 137 record.
 138
 139 @item int32 size;
 140 The size of each piece of data that follows the header, in bytes.
 141 Known extension records use 1, 4, or 8, for @code{char}, @code{int32},
 142 and @code{flt64} format data, respectively.
 143
 144 @item int32 count;
 145 The number of pieces of data that follow the header.
 146
 147 @item char data[size * count];
 148 Data, whose format and interpretation depend on the subtype.
 149 @end table
 150
 151 An extension record contains exactly @code{size * count} bytes of
 152 data, which allows a reader that does not understand an extension
 153 record to skip it.  Extension records provide only nonessential
 154 information, so this allows for files written by newer software to
 155 preserve backward compatibility with older or less capable readers.
 156
 157 Records in a system file must appear in the following order:
 158
 159 @itemize @bullet
 160 @item
 161 File header record.
 162
 163 @item
 164 Variable records.
 165
 166 @item
 167 All pairs of value labels records and value label variables records,
 168 if present.
 169
 170 @item
 171 Document record, if present.
 172
 173 @item
 174 Extension (type 7) records, in ascending numerical order of their
 175 subtypes.
 176
 177 System files written by SPSS include at most one of each kind of
 178 extension record.  This is generally true of system files written by
 179 other software as well, with known exceptions noted below in the
 180 individual sections about each type of record.
 181
 182 @item
 183 Dictionary termination record.
 184
 185 @item
 186 Data record.
 187 @end itemize
 188
 189 We advise authors of programs that read system files to tolerate
 190 format variations.  Various kinds of misformatting and corruption have
 191 been observed in system files written by SPSS and other software
 192 alike.  In particular, because extension records provide nonessential
 193 information, it is generally better to ignore an extension record
 194 entirely than to refuse to read a system file.
 195
 196 The following sections describe the known kinds of records.
 197
 198 @node File Header Record
 199 @section File Header Record
 200
 201 A system file begins with the file header, with the following format:
 202
 203 @example
 204 char                rec_type[4];
 205 char                prod_name[60];
 206 int32               layout_code;
 207 int32               nominal_case_size;
 208 int32               compression;
 209 int32               weight_index;
 210 int32               ncases;
 211 flt64               bias;
 212 char                creation_date[9];
 213 char                creation_time[8];
 214 char                file_label[64];
 215 char                padding[3];
 216 @end example
 217
 218 @table @code
 219 @item char rec_type[4];
 220 Record type code, either @samp{$FL2} for system files with
 221 uncompressed data or data compressed with simple bytecode compression,
 222 or @samp{$FL3} for system files with ZLIB compressed data.
 223
 224 This is truly a character field that uses the character encoding as
 225 other strings.  Thus, in a file with an ASCII-based character encoding
 226 this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a
 227 file with an EBCDIC-based encoding this field contains @code{5b c6 d3
 228 f2}.  (No EBCDIC-based ZLIB-compressed files have been observed.)
 229
 230 @item char prod_name[60];
 231 Product identification string.  This always begins with the characters
 232 @samp{@@(#) SPSS DATA FILE}.  PSPP uses the remaining characters to
 233 give its version and the operating system name; for example, @samp{GNU
 234 pspp 0.1.4 - sparc-sun-solaris2.5.2}.  The string is truncated if it
 235 would be longer than 60 characters; otherwise it is padded on the right
 236 with spaces.
 237
 238 The product name field allow readers to behave differently based on
 239 quirks in the way that particular software writes system files.
 240 @xref{Value Labels Records}, for the detail of the quirk that the PSPP
 241 system file reader tolerates in files written by ReadStat, which has
 242 @code{https://github.com/WizardMac/ReadStat} in @code{prod_name}.
 243
 244 @anchor{layout_code}
 245 @item int32 layout_code;
 246 Normally set to 2, although a few system files have been spotted in
 247 the wild with a value of 3 here.  PSPP use this value to determine the
 248 file's integer endianness (@pxref{System File Format}).
 249
 250 @item int32 nominal_case_size;
 251 Number of data elements per case.  This is the number of variables,
 252 except that long string variables add extra data elements (one for every
 253 8 characters after the first 8).  However, string variables do not
 254 contribute to this value beyond the first 255 bytes.   Further, some
 255 software always writes -1 or 0 in this field.  In general, it is
 256 unsafe for systems reading system files to rely upon this value.
 257
 258 @item int32 compression;
 259 Set to 0 if the data in the file is not compressed, 1 if the data is
 260 compressed with simple bytecode compression, 2 if the data is ZLIB
 261 compressed.  This field has value 2 if and only if @code{rec_type} is
 262 @samp{$FL3}.
 263
 264 @item int32 weight_index;
 265 If one of the variables in the data set is used as a weighting
 266 variable, set to the dictionary index of that variable, plus 1
 267 (@pxref{Dictionary Index}).  Otherwise, set to 0.
 268
 269 @item int32 ncases;
 270 Set to the number of cases in the file if it is known, or -1 otherwise.
 271
 272 In the general case it is not possible to determine the number of cases
 273 that will be output to a system file at the time that the header is
 274 written.  The way that this is dealt with is by writing the entire
 275 system file, including the header, then seeking back to the beginning of
 276 the file and writing just the @code{ncases} field.  For files in which
 277 this is not valid, the seek operation fails.  In this case,
 278 @code{ncases} remains -1.
 279
 280 @anchor{bias}
 281 @item flt64 bias;
 282 Compression bias, ordinarily set to 100.  Only integers between
 283 @code{1 - bias} and @code{251 - bias} can be compressed.
 284
 285 By assuming that its value is 100, PSPP uses @code{bias} to determine
 286 the file's floating-point format and endianness (@pxref{System File
 287 Format}).  If the compression bias is not 100, PSPP cannot auto-detect
 288 the floating-point format and assumes that it is IEEE 754 format with
 289 the same endianness as the system file's integers, which is correct
 290 for all known system files.
 291
 292 @item char creation_date[9];
 293 Date of creation of the system file, in @samp{dd mmm yy}
 294 format, with the month as standard English abbreviations, using an
 295 initial capital letter and following with lowercase.  If the date is not
 296 available then this field is arbitrarily set to @samp{01 Jan 70}.
 297
 298 @item char creation_time[8];
 299 Time of creation of the system file, in @samp{hh:mm:ss}
 300 format and using 24-hour time.  If the time is not available then this
 301 field is arbitrarily set to @samp{00:00:00}.
 302
 303 @item char file_label[64];
 304 File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
 305 PSPP Users Guide}).  Padded on the right with spaces.
 306
 307 A product that identifies itself as @code{VOXCO INTERVIEWER 4.3} uses
 308 CR-only line ends in this field, rather than the more usual LF-only or
 309 CR LF line ends.
 310
 311 @item char padding[3];
 312 Ignored padding bytes to make the structure a multiple of 32 bits in
 313 length.  Set to zeros.
 314 @end table
 315
 316 @node Variable Record
 317 @section Variable Record
 318
 319 There must be one variable record for each numeric variable and each
 320 string variable with width 8 bytes or less.  String variables wider
 321 than 8 bytes have one variable record for each 8 bytes, rounding up.
 322 The first variable record for a long string specifies the variable's
 323 correct dictionary information.  Subsequent variable records for a
 324 long string are filled with dummy information: a type of -1, no
 325 variable label or missing values, print and write formats that are
 326 ignored, and an empty string as name.  A few system files have been
 327 encountered that include a variable label on dummy variable records,
 328 so readers should take care to parse dummy variable records in the
 329 same way as other variable records.
 330
 331 @anchor{Dictionary Index}
 332 The @dfn{dictionary index} of a variable is a 1-based offset in the set of
 333 variable records, including dummy variable records for long string
 334 variables.  The first variable record has a dictionary index of 1, the
 335 second has a dictionary index of 2, and so on.
 336
 337 The system file format does not directly support string variables
 338 wider than 255 bytes.  Such very long string variables are represented
 339 by a number of narrower string variables.  @xref{Very Long String
 340 Record}, for details.
 341
 342 A system file should contain at least one variable and thus at least
 343 one variable record, but system files have been observed in the wild
 344 without any variables (thus, no data either).
 345
 346 @example
 347 int32               rec_type;
 348 int32               type;
 349 int32               has_var_label;
 350 int32               n_missing_values;
 351 int32               print;
 352 int32               write;
 353 char                name[8];
 354
 355 /* @r{Present only if @code{has_var_label} is 1.} */
 356 int32               label_len;
 357 char                label[];
 358
 359 /* @r{Present only if @code{n_missing_values} is nonzero}. */
 360 flt64               missing_values[];
 361 @end example
 362
 363 @table @code
 364 @item int32 rec_type;
 365 Record type code.  Always set to 2.
 366
 367 @item int32 type;
 368 Variable type code.  Set to 0 for a numeric variable.  For a short
 369 string variable or the first part of a long string variable, this is set
 370 to the width of the string.  For the second and subsequent parts of a
 371 long string variable, set to -1, and the remaining fields in the
 372 structure are ignored.
 373
 374 @item int32 has_var_label;
 375 If this variable has a variable label, set to 1; otherwise, set to 0.
 376
 377 @item int32 n_missing_values;
 378 If the variable has no missing values, set to 0.  If the variable has
 379 one, two, or three discrete missing values, set to 1, 2, or 3,
 380 respectively.  If the variable has a range for missing variables, set to
 381 -2; if the variable has a range for missing variables plus a single
 382 discrete value, set to -3.
 383
 384 A long string variable always has the value 0 here.  A separate record
 385 indicates missing values for long string variables (@pxref{Long String
 386 Missing Values Record}).
 387
 388 @item int32 print;
 389 Print format for this variable.  See below.
 390
 391 @item int32 write;
 392 Write format for this variable.  See below.
 393
 394 @item char name[8];
 395 Variable name.  The variable name must begin with a capital letter or
 396 the at-sign (@samp{@@}).  Subsequent characters may also be digits, octothorpes
 397 (@samp{#}), dollar signs (@samp{$}), underscores (@samp{_}), or full
 398 stops (@samp{.}).  The variable name is padded on the right with spaces.
 399
 400 The @samp{name} fields should be unique within a system file.  System
 401 files written by SPSS that contain very long string variables with
 402 similar names sometimes contain duplicate names that are later
 403 eliminated by resolving the very long string names (@pxref{Very Long
 404 String Record}).  PSPP handles duplicates by assigning them new,
 405 unique names.
 406
 407 @item int32 label_len;
 408 This field is present only if @code{has_var_label} is set to 1.  It is
 409 set to the length, in characters, of the variable label.  The
 410 documented maximum length varies from 120 to 255 based on SPSS
 411 version, but some files have been seen with longer labels.  PSPP
 412 accepts labels of any length.
 413
 414 @item char label[];
 415 This field is present only if @code{has_var_label} is set to 1.  It has
 416 length @code{label_len}, rounded up to the nearest multiple of 32 bits.
 417 The first @code{label_len} characters are the variable's variable label.
 418
 419 @item flt64 missing_values[];
 420 This field is present only if @code{n_missing_values} is nonzero.  It
 421 has the same number of 8-byte elements as the absolute value of
 422 @code{n_missing_values}.  Each element is interpreted as a number for
 423 numeric variables (with HIGHEST and LOWEST indicated as described in
 424 the chapter introduction).  For string variables of width less than 8
 425 bytes, elements are right-padded with spaces; for string variables
 426 wider than 8 bytes, only the first 8 bytes of each missing value are
 427 specified, with the remainder implicitly all spaces.
 428
 429 For discrete missing values, each element represents one missing
 430 value.  When a range is present, the first element denotes the minimum
 431 value in the range, and the second element denotes the maximum value
 432 in the range.  When a range plus a value are present, the third
 433 element denotes the additional discrete missing value.
 434 @end table
 435
 436 @anchor{System File Output Formats}
 437 The @code{print} and @code{write} members of sysfile_variable are output
 438 formats coded into @code{int32} types.  The least-significant byte
 439 of the @code{int32} represents the number of decimal places, and the
 440 next two bytes in order of increasing significance represent field width
 441 and format type, respectively.  The most-significant byte is not
 442 used and should be set to zero.
 443
 444 Format types are defined as follows:
 445
 446 @quotation
 447 @multitable {Value} {@code{DATETIME}}
 448 @headitem Value
 449 @tab Meaning
 450 @item 0
 451 @tab Not used.
 452 @item 1
 453 @tab @code{A}
 454 @item 2
 455 @tab @code{AHEX}
 456 @item 3
 457 @tab @code{COMMA}
 458 @item 4
 459 @tab @code{DOLLAR}
 460 @item 5
 461 @tab @code{F}
 462 @item 6
 463 @tab @code{IB}
 464 @item 7
 465 @tab @code{PIBHEX}
 466 @item 8
 467 @tab @code{P}
 468 @item 9
 469 @tab @code{PIB}
 470 @item 10
 471 @tab @code{PK}
 472 @item 11
 473 @tab @code{RB}
 474 @item 12
 475 @tab @code{RBHEX}
 476 @item 13
 477 @tab Not used.
 478 @item 14
 479 @tab Not used.
 480 @item 15
 481 @tab @code{Z}
 482 @item 16
 483 @tab @code{N}
 484 @item 17
 485 @tab @code{E}
 486 @item 18
 487 @tab Not used.
 488 @item 19
 489 @tab Not used.
 490 @item 20
 491 @tab @code{DATE}
 492 @item 21
 493 @tab @code{TIME}
 494 @item 22
 495 @tab @code{DATETIME}
 496 @item 23
 497 @tab @code{ADATE}
 498 @item 24
 499 @tab @code{JDATE}
 500 @item 25
 501 @tab @code{DTIME}
 502 @item 26
 503 @tab @code{WKDAY}
 504 @item 27
 505 @tab @code{MONTH}
 506 @item 28
 507 @tab @code{MOYR}
 508 @item 29
 509 @tab @code{QYR}
 510 @item 30
 511 @tab @code{WKYR}
 512 @item 31
 513 @tab @code{PCT}
 514 @item 32
 515 @tab @code{DOT}
 516 @item 33
 517 @tab @code{CCA}
 518 @item 34
 519 @tab @code{CCB}
 520 @item 35
 521 @tab @code{CCC}
 522 @item 36
 523 @tab @code{CCD}
 524 @item 37
 525 @tab @code{CCE}
 526 @item 38
 527 @tab @code{EDATE}
 528 @item 39
 529 @tab @code{SDATE}
 530 @item 40
 531 @tab @code{MTIME}
 532 @item 41
 533 @tab @code{YMDHMS}
 534 @end multitable
 535 @end quotation
 536
 537 A few system files have been observed in the wild with invalid
 538 @code{write} fields, in particular with value 0.  Readers should
 539 probably treat invalid @code{print} or @code{write} fields as some
 540 default format.
 541
 542 @node Value Labels Records
 543 @section Value Labels Records
 544
 545 The value label records documented in this section are used for
 546 numeric and short string variables only.  Long string variables may
 547 have value labels, but their value labels are recorded using a
 548 different record type (@pxref{Long String Value Labels Record}).
 549
 550 ReadStat (@pxref{File Header Record}) writes value labels that label a
 551 single value more than once.  In more detail, it emits value labels
 552 whose values are longer than string variables' widths, that are
 553 identical in the actual width of the variable, e.g.@: labels for
 554 values @code{ABC123} and @code{ABC456} for a string variable with
 555 width 3.  For files written by this software, PSPP ignores such
 556 labels.
 557
 558 The value label record has the following format:
 559
 560 @example
 561 int32               rec_type;
 562 int32               label_count;
 563
 564 /* @r{Repeated @code{n_label} times}. */
 565 char                value[8];
 566 char                label_len;
 567 char                label[];
 568 @end example
 569
 570 @table @code
 571 @item int32 rec_type;
 572 Record type.  Always set to 3.
 573
 574 @item int32 label_count;
 575 Number of value labels present in this record.
 576 @end table
 577
 578 The remaining fields are repeated @code{count} times.  Each
 579 repetition specifies one value label.
 580
 581 @table @code
 582 @item char value[8];
 583 A numeric value or a short string value padded as necessary to 8 bytes
 584 in length.  Its type and width cannot be determined until the
 585 following value label variables record (see below) is read.
 586
 587 @item char label_len;
 588 The label's length, in bytes.  The documented maximum length varies
 589 from 60 to 120 based on SPSS version.  PSPP supports value labels up
 590 to 255 bytes long.
 591
 592 @item char label[];
 593 @code{label_len} bytes of the actual label, followed by up to 7 bytes
 594 of padding to bring @code{label} and @code{label_len} together to a
 595 multiple of 8 bytes in length.
 596 @end table
 597
 598 The value label record is always immediately followed by a value label
 599 variables record with the following format:
 600
 601 @example
 602 int32               rec_type;
 603 int32               var_count;
 604 int32               vars[];
 605 @end example
 606
 607 @table @code
 608 @item int32 rec_type;
 609 Record type.  Always set to 4.
 610
 611 @item int32 var_count;
 612 Number of variables that the associated value labels from the value
 613 label record are to be applied.
 614
 615 @item int32 vars[];
 616 A list of 1-based dictionary indexes of variables to which to apply the value
 617 labels (@pxref{Dictionary Index}).  There are @code{var_count}
 618 elements.
 619
 620 String variables wider than 8 bytes may not be specified in this list.
 621 @end table
 622
 623 @node Document Record
 624 @section Document Record
 625
 626 The document record, if present, has the following format:
 627
 628 @example
 629 int32               rec_type;
 630 int32               n_lines;
 631 char                lines[][80];
 632 @end example
 633
 634 @table @code
 635 @item int32 rec_type;
 636 Record type.  Always set to 6.
 637
 638 @item int32 n_lines;
 639 Number of lines of documents present.  This should be greater than
 640 zero, but ReadStats writes system files with zero @code{n_lines}.
 641
 642 @item char lines[][80];
 643 Document lines.  The number of elements is defined by @code{n_lines}.
 644 Lines shorter than 80 characters are padded on the right with spaces.
 645 @end table
 646
 647 @node Machine Integer Info Record
 648 @section Machine Integer Info Record
 649
 650 The integer info record, if present, has the following format:
 651
 652 @example
 653 /* @r{Header.} */
 654 int32               rec_type;
 655 int32               subtype;
 656 int32               size;
 657 int32               count;
 658
 659 /* @r{Data.} */
 660 int32               version_major;
 661 int32               version_minor;
 662 int32               version_revision;
 663 int32               machine_code;
 664 int32               floating_point_rep;
 665 int32               compression_code;
 666 int32               endianness;
 667 int32               character_code;
 668 @end example
 669
 670 @table @code
 671 @item int32 rec_type;
 672 Record type.  Always set to 7.
 673
 674 @item int32 subtype;
 675 Record subtype.  Always set to 3.
 676
 677 @item int32 size;
 678 Size of each piece of data in the data part, in bytes.  Always set to 4.
 679
 680 @item int32 count;
 681 Number of pieces of data in the data part.  Always set to 8.
 682
 683 @item int32 version_major;
 684 PSPP major version number.  In version @var{x}.@var{y}.@var{z}, this
 685 is @var{x}.
 686
 687 @item int32 version_minor;
 688 PSPP minor version number.  In version @var{x}.@var{y}.@var{z}, this
 689 is @var{y}.
 690
 691 @item int32 version_revision;
 692 PSPP version revision number.  In version @var{x}.@var{y}.@var{z},
 693 this is @var{z}.
 694
 695 @item int32 machine_code;
 696 Machine code.  PSPP always set this field to value to -1, but other
 697 values may appear.
 698
 699 @item int32 floating_point_rep;
 700 Floating point representation code.  For IEEE 754 systems this is 1.
 701 IBM 370 sets this to 2, and DEC VAX E to 3.
 702
 703 @item int32 compression_code;
 704 Compression code.  Always set to 1, regardless of whether or how the
 705 file is compressed.
 706
 707 @item int32 endianness;
 708 Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
 709
 710 @item int32 character_code;
 711 @anchor{character-code} Character code.  The following values have
 712 been actually observed in system files:
 713
 714 @table @asis
 715 @item 1
 716 EBCDIC.
 717
 718 @item 2
 719 7-bit ASCII.
 720
 721 @item 1250
 722 The @code{windows-1250} code page for Central European and Eastern
 723 European languages.
 724
 725 @item 1252
 726 The @code{windows-1252} code page for Western European languages.
 727
 728 @item 28591
 729 ISO 8859-1.
 730
 731 @item 65001
 732 UTF-8.
 733 @end table
 734
 735 The following additional values are known to be defined:
 736
 737 @table @asis
 738 @item 3
 739 8-bit ``ASCII''.
 740
 741 @item 4
 742 DEC Kanji.
 743 @end table
 744
 745 Other Windows code page numbers are known to be generally valid.
 746
 747 Old versions of SPSS for Unix and Windows always wrote value 2 in this
 748 field, regardless of the encoding in use.  Newer versions also write
 749 the character encoding as a string (see @ref{Character Encoding
 750 Record}).
 751 @end table
 752
 753 @node Machine Floating-Point Info Record
 754 @section Machine Floating-Point Info Record
 755
 756 The floating-point info record, if present, has the following format:
 757
 758 @example
 759 /* @r{Header.} */
 760 int32               rec_type;
 761 int32               subtype;
 762 int32               size;
 763 int32               count;
 764
 765 /* @r{Data.} */
 766 flt64               sysmis;
 767 flt64               highest;
 768 flt64               lowest;
 769 @end example
 770
 771 @table @code
 772 @item int32 rec_type;
 773 Record type.  Always set to 7.
 774
 775 @item int32 subtype;
 776 Record subtype.  Always set to 4.
 777
 778 @item int32 size;
 779 Size of each piece of data in the data part, in bytes.  Always set to 8.
 780
 781 @item int32 count;
 782 Number of pieces of data in the data part.  Always set to 3.
 783
 784 @item flt64 sysmis;
 785 @itemx flt64 highest;
 786 @itemx flt64 lowest;
 787 The system missing value, the value used for HIGHEST in missing
 788 values, and the value used for LOWEST in missing values, respectively.
 789 @xref{System File Format}, for more information.
 790
 791 The SPSSWriter library in PHP, which identifies itself as @code{FOM
 792 SPSS 1.0.0} in the file header record @code{prod_name} field, writes
 793 unexpected values to these fields, but it uses the same values
 794 consistently throughout the rest of the file.
 795 @end table
 796
 797 @node Multiple Response Sets Records
 798 @section Multiple Response Sets Records
 799
 800 The system file format has two different types of records that
 801 represent multiple response sets (@pxref{MRSETS,,,pspp, PSPP Users
 802 Guide}).  The first type of record describes multiple response sets
 803 that can be understood by SPSS before version 14.  The second type of
 804 record, with a closely related format, is used for multiple dichotomy
 805 sets that use the CATEGORYLABELS=COUNTEDVALUES feature added in
 806 version 14.
 807
 808 @example
 809 /* @r{Header.} */
 810 int32               rec_type;
 811 int32               subtype;
 812 int32               size;
 813 int32               count;
 814
 815 /* @r{Exactly @code{count} bytes of data.} */
 816 char                mrsets[];
 817 @end example
 818
 819 @table @code
 820 @item int32 rec_type;
 821 Record type.  Always set to 7.
 822
 823 @item int32 subtype;
 824 Record subtype.  Set to 7 for records that describe multiple response
 825 sets understood by SPSS before version 14, or to 19 for records that
 826 describe dichotomy sets that use the CATEGORYLABELS=COUNTEDVALUES
 827 feature added in version 14.
 828
 829 @item int32 size;
 830 The size of each element in the @code{mrsets} member. Always set to 1.
 831
 832 @item int32 count;
 833 The total number of bytes in @code{mrsets}.
 834
 835 @item char mrsets[];
 836 Zero or more line feeds (byte 0x0a), followed by a series of multiple
 837 response sets, each of which consists of the following:
 838
 839 @itemize @bullet
 840 @item
 841 The set's name (an identifier that begins with @samp{$}), in mixed
 842 upper and lower case.
 843
 844 @item
 845 An equals sign (@samp{=}).
 846
 847 @item
 848 @samp{C} for a multiple category set, @samp{D} for a multiple
 849 dichotomy set with CATEGORYLABELS=VARLABELS, or @samp{E} for a
 850 multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES.
 851
 852 @item
 853 For a multiple dichotomy set with CATEGORYLABELS=COUNTEDVALUES, a
 854 space, followed by a number expressed as decimal digits, followed by a
 855 space.  If LABELSOURCE=VARLABEL was specified on MRSETS, then the
 856 number is 11; otherwise it is 1.@footnote{This part of the format may
 857 not be fully understood, because only a single example of each
 858 possibility has been examined.}
 859
 860 @item
 861 For either kind of multiple dichotomy set, the counted value, as a
 862 positive integer count specified as decimal digits, followed by a
 863 space, followed by as many string bytes as specified in the count.  If
 864 the set contains numeric variables, the string consists of the counted
 865 integer value expressed as decimal digits.  If the set contains string
 866 variables, the string contains the counted string value.  Either way,
 867 the string may be padded on the right with spaces (older versions of
 868 SPSS seem to always pad to a width of 8 bytes; newer versions don't).
 869
 870 @item
 871 A space.
 872
 873 @item
 874 The multiple response set's label, using the same format as for the
 875 counted value for multiple dichotomy sets.  A string of length 0 means
 876 that the set does not have a label.  A string of length 0 is also
 877 written if LABELSOURCE=VARLABEL was specified.
 878
 879 @item
 880 A space.
 881
 882 @item
 883 The short names of the variables in the set, converted to lowercase,
 884 each separated from the previous by a single space.
 885
 886 Even though a multiple response set must have at least two variables,
 887 some system files contain multiple response sets with no variables or
 888 one variable.  The source and meaning of these multiple response sets is
 889 unknown.  (Perhaps they arise from creating a multiple response set
 890 then deleting all the variables that it contains?)
 891
 892 @item
 893 One line feed (byte 0x0a).  Sometimes multiple, even hundreds, of line
 894 feeds are present.
 895 @end itemize
 896 @end table
 897
 898 Example: Given appropriate variable definitions, consider the
 899 following MRSETS command:
 900
 901 @example
 902 MRSETS /MCGROUP NAME=$a LABEL='my mcgroup' VARIABLES=a b c
 903        /MDGROUP NAME=$b VARIABLES=g e f d VALUE=55
 904        /MDGROUP NAME=$c LABEL='mdgroup #2' VARIABLES=h i j VALUE='Yes'
 905        /MDGROUP NAME=$d LABEL='third mdgroup' CATEGORYLABELS=COUNTEDVALUES
 906         VARIABLES=k l m VALUE=34
 907        /MDGROUP NAME=$e CATEGORYLABELS=COUNTEDVALUES LABELSOURCE=VARLABEL
 908         VARIABLES=n o p VALUE='choice'.
 909 @end example
 910
 911 The above would generate the following multiple response set record of
 912 subtype 7:
 913
 914 @example
 915 $a=C 10 my mcgroup a b c
 916 $b=D2 55 0  g e f d
 917 $c=D3 Yes 10 mdgroup #2 h i j
 918 @end example
 919
 920 It would also generate the following multiple response set record with
 921 subtype 19:
 922
 923 @example
 924 $d=E 1 2 34 13 third mdgroup k l m
 925 $e=E 11 6 choice 0  n o p
 926 @end example
 927
 928 @node Extra Product Info Record
 929 @section Extra Product Info Record
 930
 931 This optional record appears to contain a text string that describes
 932 the program that wrote the file and the source of the data.  (This is
 933 redundant with the file label and product info found in the file
 934 header record.)
 935
 936 @example
 937 /* @r{Header.} */
 938 int32               rec_type;
 939 int32               subtype;
 940 int32               size;
 941 int32               count;
 942
 943 /* @r{Exactly @code{count} bytes of data.} */
 944 char                info[];
 945 @end example
 946
 947 @table @code
 948 @item int32 rec_type;
 949 Record type.  Always set to 7.
 950
 951 @item int32 subtype;
 952 Record subtype.  Always set to 10.
 953
 954 @item int32 size;
 955 The size of each element in the @code{info} member. Always set to 1.
 956
 957 @item int32 count;
 958 The total number of bytes in @code{info}.
 959
 960 @item char info[];
 961 A text string.  A product that identifies itself as @code{VOXCO
 962 INTERVIEWER 4.3} uses CR-only line ends in this field, rather than the
 963 more usual LF-only or CR LF line ends.
 964 @end table
 965
 966 @node Variable Display Parameter Record
 967 @section Variable Display Parameter Record
 968
 969 The variable display parameter record, if present, has the following
 970 format:
 971
 972 @example
 973 /* @r{Header.} */
 974 int32               rec_type;
 975 int32               subtype;
 976 int32               size;
 977 int32               count;
 978
 979 /* @r{Repeated @code{count} times}. */
 980 int32               measure;
 981 int32               width;           /* @r{Not always present.} */
 982 int32               alignment;
 983 @end example
 984
 985 @table @code
 986 @item int32 rec_type;
 987 Record type.  Always set to 7.
 988
 989 @item int32 subtype;
 990 Record subtype.  Always set to 11.
 991
 992 @item int32 size;
 993 The size of @code{int32}.  Always set to 4.
 994
 995 @item int32 count;
 996 The number of sets of variable display parameters (ordinarily the
 997 number of variables in the dictionary), times 2 or 3.
 998 @end table
 999
1000 The remaining members are repeated @code{count} times, in the same
1001 order as the variable records.  No element corresponds to variable
1002 records that continue long string variables.  The meanings of these
1003 members are as follows:
1004
1005 @table @code
1006 @item int32 measure;
1007 The measurement level of the variable:
1008 @table @asis
1009 @item 0
1010 Unknown
1011 @item 1
1012 Nominal
1013 @item 2
1014 Ordinal
1015 @item 3
1016 Scale
1017 @end table
1018
1019 An ``unknown'' @code{measure} of 0 means that the variable was created
1020 in some way that doesn't make the measurement level clear, e.g.@: with
1021 a @code{COMPUTE} transformation.  PSPP sets the measurement level the
1022 first time it reads the data using the rules documented in
1023 @ref{Measurement Level,,,pspp, PSPP Users Guide}, so this should
1024 rarely appear.
1025
1026 @item int32 width;
1027 The width of the display column for the variable in characters.
1028
1029 This field is present if @var{count} is 3 times the number of
1030 variables in the dictionary.  It is omitted if @var{count} is 2 times
1031 the number of variables.
1032
1033 @item int32 alignment;
1034 The alignment of the variable for display purposes:
1035
1036 @table @asis
1037 @item 0
1038 Left aligned
1039 @item 1
1040 Right aligned
1041 @item 2
1042 Centre aligned
1043 @end table
1044 @end table
1045
1046 @node Long Variable Names Record
1047 @section Long Variable Names Record
1048
1049 If present, the long variable names record has the following format:
1050
1051 @example
1052 /* @r{Header.} */
1053 int32               rec_type;
1054 int32               subtype;
1055 int32               size;
1056 int32               count;
1057
1058 /* @r{Exactly @code{count} bytes of data.} */
1059 char                var_name_pairs[];
1060 @end example
1061
1062 @table @code
1063 @item int32 rec_type;
1064 Record type.  Always set to 7.
1065
1066 @item int32 subtype;
1067 Record subtype.  Always set to 13.
1068
1069 @item int32 size;
1070 The size of each element in the @code{var_name_pairs} member. Always set to 1.
1071
1072 @item int32 count;
1073 The total number of bytes in @code{var_name_pairs}.
1074
1075 @item char var_name_pairs[];
1076 A list of @var{key}--@var{value} tuples, where @var{key} is the name
1077 of a variable, and @var{value} is its long variable name.
1078 The @var{key} field is at most 8 bytes long and must match the
1079 name of a variable which appears in the variable record (@pxref{Variable
1080 Record}).
1081 The @var{value} field is at most 64 bytes long.
1082 The @var{key} and @var{value} fields are separated by a @samp{=} byte.
1083 Each tuple is separated by a byte whose value is 09.  There is no
1084 trailing separator following the last tuple.
1085 The total length is @code{count} bytes.
1086 @end table
1087
1088 @node Very Long String Record
1089 @section Very Long String Record
1090
1091 Old versions of SPSS limited string variables to a width of 255 bytes.
1092 For backward compatibility with these older versions, the system file
1093 format represents a string longer than 255 bytes, called a @dfn{very
1094 long string}, as a collection of strings no longer than 255 bytes
1095 each.  The strings concatenated to make a very long string are called
1096 its @dfn{segments}; for consistency, variables other than very long
1097 strings are considered to have a single segment.
1098
1099 A very long string with a width of @var{w} has @var{n} =
1100 (@var{w} + 251) / 252 segments, that is, one segment for every
1101 252 bytes of width, rounding up.  It would be logical, then, for each
1102 of the segments except the last to have a width of 252 and the last
1103 segment to have the remainder, but this is not the case.  In fact,
1104 each segment except the last has a width of 255 bytes.  The last
1105 segment has width @var{w} - (@var{n} - 1) * 252; some versions
1106 of SPSS make it slightly wider, but not wide enough to make the last
1107 segment require another 8 bytes of data.
1108
1109 Data is packed tightly into segments of a very long string, 255 bytes
1110 per segment.  Because 255 bytes of segment data are allocated for
1111 every 252 bytes of the very long string's width (approximately), some
1112 unused space is left over at the end of the allocated segments.  Data
1113 in unused space is ignored.
1114
1115 Example: Consider a very long string of width 20,000.  Such a very
1116 long string has 20,000 / 252 = 80 (rounding up) segments.  The first
1117 79 segments have width 255; the last segment has width 20,000 - 79 *
1118 252 = 92 or slightly wider (up to 96 bytes, the next multiple of 8).
1119 The very long string's data is actually stored in the 19,890 bytes in
1120 the first 78 segments, plus the first 110 bytes of the 79th segment
1121 (19,890 + 110 = 20,000).  The remaining 145 bytes of the 79th segment
1122 and all 92 bytes of the 80th segment are unused.
1123
1124 The very long string record explains how to stitch together segments
1125 to obtain very long string data.  For each of the very long string
1126 variables in the dictionary, it specifies the name of its first
1127 segment's variable and the very long string variable's actual width.
1128 The remaining segments immediately follow the named variable in the
1129 system file's dictionary.
1130
1131 The very long string record, which is present only if the system file
1132 contains very long string variables, has the following format:
1133
1134 @example
1135 /* @r{Header.} */
1136 int32               rec_type;
1137 int32               subtype;
1138 int32               size;
1139 int32               count;
1140
1141 /* @r{Exactly @code{count} bytes of data.} */
1142 char                string_lengths[];
1143 @end example
1144
1145 @table @code
1146 @item int32 rec_type;
1147 Record type.  Always set to 7.
1148
1149 @item int32 subtype;
1150 Record subtype.  Always set to 14.
1151
1152 @item int32 size;
1153 The size of each element in the @code{string_lengths} member. Always set to 1.
1154
1155 @item int32 count;
1156 The total number of bytes in @code{string_lengths}.
1157
1158 @item char string_lengths[];
1159 A list of @var{key}--@var{value} tuples, where @var{key} is the name
1160 of a variable, and @var{value} is its length.
1161 The @var{key} field is at most 8 bytes long and must match the
1162 name of a variable which appears in the variable record (@pxref{Variable
1163 Record}).
1164 The @var{value} field is exactly 5 bytes long. It is a zero-padded,
1165 ASCII-encoded string that is the length of the variable.
1166 The @var{key} and @var{value} fields are separated by a @samp{=} byte.
1167 Tuples are delimited by a two-byte sequence @{00, 09@}.
1168 After the last tuple, there may be a single byte 00, or @{00, 09@}.
1169 The total length is @code{count} bytes.
1170 @end table
1171
1172 @node Character Encoding Record
1173 @section Character Encoding Record
1174
1175 This record, if present, indicates the character encoding for string data,
1176 long variable names, variable labels, value labels and other strings in the
1177 file.
1178
1179 @example
1180 /* @r{Header.} */
1181 int32               rec_type;
1182 int32               subtype;
1183 int32               size;
1184 int32               count;
1185
1186 /* @r{Exactly @code{count} bytes of data.} */
1187 char                encoding[];
1188 @end example
1189
1190 @table @code
1191 @item int32 rec_type;
1192 Record type.  Always set to 7.
1193
1194 @item int32 subtype;
1195 Record subtype.  Always set to 20.
1196
1197 @item int32 size;
1198 The size of each element in the @code{encoding} member. Always set to 1.
1199
1200 @item int32 count;
1201 The total number of bytes in @code{encoding}.
1202
1203 @item char encoding[];
1204 The name of the character encoding.  Normally this will be an official
1205 IANA character set name or alias.
1206 See @url{http://www.iana.org/assignments/character-sets}.
1207 Character set names are not case-sensitive, but SPSS appears to write
1208 them in all-uppercase.
1209 @end table
1210
1211 This record is not present in files generated by older software.  See
1212 also the @code{character_code} field in the machine integer info
1213 record (@pxref{character-code}).
1214
1215 When the character encoding record and the machine integer info record
1216 are both present, all system files observed in practice indicate the
1217 same character encoding, e.g.@: 1252 as @code{character_code} and
1218 @code{windows-1252} as @code{encoding}, 65001 and @code{UTF-8}, etc.
1219
1220 If, for testing purposes, a file is crafted with different
1221 @code{character_code} and @code{encoding}, it seems that
1222 @code{character_code} controls the encoding for all strings in the
1223 system file before the dictionary termination record, including
1224 strings in data (e.g.@: string missing values), and @code{encoding}
1225 controls the encoding for strings following the dictionary termination
1226 record.
1227
1228 @node Long String Value Labels Record
1229 @section Long String Value Labels Record
1230
1231 This record, if present, specifies value labels for long string
1232 variables.
1233
1234 @example
1235 /* @r{Header.} */
1236 int32               rec_type;
1237 int32               subtype;
1238 int32               size;
1239 int32               count;
1240
1241 /* @r{Repeated up to exactly @code{count} bytes.} */
1242 int32               var_name_len;
1243 char                var_name[];
1244 int32               var_width;
1245 int32               n_labels;
1246 long_string_label   labels[];
1247 @end example
1248
1249 @table @code
1250 @item int32 rec_type;
1251 Record type.  Always set to 7.
1252
1253 @item int32 subtype;
1254 Record subtype.  Always set to 21.
1255
1256 @item int32 size;
1257 Always set to 1.
1258
1259 @item int32 count;
1260 The number of bytes following the header until the next header.
1261
1262 @item int32 var_name_len;
1263 @itemx char var_name[];
1264 The number of bytes in the name of the variable that has long string
1265 value labels, plus the variable name itself, which consists of exactly
1266 @code{var_name_len} bytes.  The variable name is not padded to any
1267 particular boundary, nor is it null-terminated.
1268
1269 @item int32 var_width;
1270 The width of the variable, in bytes, which will be between 9 and
1271 32767.
1272
1273 @item int32 n_labels;
1274 @itemx long_string_label labels[];
1275 The long string labels themselves.  The @code{labels} array contains
1276 exactly @code{n_labels} elements, each of which has the following
1277 substructure:
1278
1279 @example
1280 int32               value_len;
1281 char                value[];
1282 int32               label_len;
1283 char                label[];
1284 @end example
1285
1286 @table @code
1287 @item int32 value_len;
1288 @itemx char value[];
1289 The string value being labeled.  @code{value_len} is the number of
1290 bytes in @code{value}; it is equal to @code{var_width}.  The
1291 @code{value} array is not padded or null-terminated.
1292
1293 @item int32 label_len;
1294 @itemx char label[];
1295 The label for the string value.  @code{label_len}, which must be
1296 between 0 and 120, is the number of bytes in @code{label}.  The
1297 @code{label} array is not padded or null-terminated.
1298 @end table
1299 @end table
1300
1301 @node Long String Missing Values Record
1302 @section Long String Missing Values Record
1303
1304 This record, if present, specifies missing values for long string
1305 variables.
1306
1307 @example
1308 /* @r{Header.} */
1309 int32               rec_type;
1310 int32               subtype;
1311 int32               size;
1312 int32               count;
1313
1314 /* @r{Repeated up to exactly @code{count} bytes.} */
1315 int32               var_name_len;
1316 char                var_name[];
1317 char                n_missing_values;
1318 int32               value_len;
1319 char                values[values_len * n_missing_values];
1320 @end example
1321
1322 @table @code
1323 @item int32 rec_type;
1324 Record type.  Always set to 7.
1325
1326 @item int32 subtype;
1327 Record subtype.  Always set to 22.
1328
1329 @item int32 size;
1330 Always set to 1.
1331
1332 @item int32 count;
1333 The number of bytes following the header until the next header.
1334
1335 @item int32 var_name_len;
1336 @itemx char var_name[];
1337 The number of bytes in the name of the long string variable that has
1338 missing values, plus the variable name itself, which consists of
1339 exactly @code{var_name_len} bytes.  The variable name is not padded to
1340 any particular boundary, nor is it null-terminated.
1341
1342 @item char n_missing_values;
1343 The number of missing values, either 1, 2, or 3.  (This is, unusually,
1344 a single byte instead of a 32-bit number.)
1345
1346 @item int32 value_len;
1347 The length of each missing value string, in bytes.  This value should
1348 be 8, because long string variables are at least 8 bytes wide (by
1349 definition), only the first 8 bytes of a long string variable's
1350 missing values are allowed to be non-spaces, and any spaces within the
1351 first 8 bytes are included in the missing value here.
1352
1353 @item char values[values_len * n_missing_values]
1354 The missing values themselves, without any padding or null
1355 terminators.
1356 @end table
1357
1358 An earlier version of this document stated that @code{value_len} was
1359 repeated before each of the missing values, so that there was an extra
1360 @code{int32} value of 8 before each missing value after the first.
1361 Old versions of PSPP wrote data files in this format.  Readers can
1362 tolerate this mistake, if they wish, by noticing and skipping the
1363 extra @code{int32} values, which wouldn't ordinarily occur in strings.
1364
1365 @node Data File and Variable Attributes Records
1366 @section Data File and Variable Attributes Records
1367
1368 The data file and variable attributes records represent custom
1369 attributes for the system file or for individual variables in the
1370 system file, as defined on the DATAFILE ATTRIBUTE (@pxref{DATAFILE
1371 ATTRIBUTE,,,pspp, PSPP Users Guide}) and VARIABLE ATTRIBUTE commands
1372 (@pxref{VARIABLE ATTRIBUTE,,,pspp, PSPP Users Guide}), respectively.
1373
1374 @example
1375 /* @r{Header.} */
1376 int32               rec_type;
1377 int32               subtype;
1378 int32               size;
1379 int32               count;
1380
1381 /* @r{Exactly @code{count} bytes of data.} */
1382 char                attributes[];
1383 @end example
1384
1385 @table @code
1386 @item int32 rec_type;
1387 Record type.  Always set to 7.
1388
1389 @item int32 subtype;
1390 Record subtype.  Always set to 17 for a data file attribute record or
1391 to 18 for a variable attributes record.
1392
1393 @item int32 size;
1394 The size of each element in the @code{attributes} member. Always set to 1.
1395
1396 @item int32 count;
1397 The total number of bytes in @code{attributes}.
1398
1399 @item char attributes[];
1400 The attributes, in a text-based format.
1401
1402 In record subtype 17, this field contains a single attribute set.  An
1403 attribute set is a sequence of one or more attributes concatenated
1404 together.  Each attribute consists of a name, which has the same
1405 syntax as a variable name, followed by, inside parentheses, a sequence
1406 of one or more values.  Each value consists of a string enclosed in
1407 single quotes (@code{'}) followed by a line feed (byte 0x0a).  A value
1408 may contain single quote characters, which are not themselves escaped
1409 or quoted or required to be present in pairs.  There is no apparent
1410 way to embed a line feed in a value.  There is no distinction between
1411 an attribute with a single value and an attribute array with one
1412 element.
1413
1414 In record subtype 18, this field contains a sequence of one or more
1415 variable attribute sets.  If more than one variable attribute set is
1416 present, each one after the first is delimited from the previous by
1417 @code{/}.  Each variable attribute set consists of a long
1418 variable name,
1419 followed by @code{:}, followed by an attribute set with the same
1420 syntax as on record subtype 17.
1421
1422 System files written by @code{Stata 14.1/-savespss- 1.77 by
1423 S.Radyakin} may include multiple records with subtype 18, one per
1424 variable that has variable attributes.
1425
1426 The total length is @code{count} bytes.
1427 @end table
1428
1429 @subheading Example
1430
1431 A system file produced with the following VARIABLE ATTRIBUTE commands
1432 in effect:
1433
1434 @example
1435 VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=fred[1]('23') fred[2]('34').
1436 VARIABLE ATTRIBUTE VARIABLES=dummy ATTRIBUTE=bert('123').
1437 @end example
1438
1439 @noindent
1440 will contain a variable attribute record with the following contents:
1441
1442 @example
1443 0000  07 00 00 00 12 00 00 00  01 00 00 00 22 00 00 00  |............"...|
1444 0010  64 75 6d 6d 79 3a 66 72  65 64 28 27 32 33 27 0a  |dummy:fred('23'.|
1445 0020  27 33 34 27 0a 29 62 65  72 74 28 27 31 32 33 27  |'34'.)bert('123'|
1446 0030  0a 29                                             |.)              |
1447 @end example
1448
1449 @menu
1450 * Variable Roles::
1451 @end menu
1452
1453 @node Variable Roles
1454 @subsection Variable Roles
1455
1456 A variable's role is represented as an attribute named @code{$@@Role}.
1457 This attribute has a single element whose values and their meanings
1458 are:
1459
1460 @table @code
1461 @item 0
1462 Input.  This, the default, is the most common role.
1463 @item 1
1464 Output.
1465 @item 2
1466 Both.
1467 @item 3
1468 None.
1469 @item 4
1470 Partition.
1471 @item 5
1472 Split.
1473 @end table
1474
1475 @node Extended Number of Cases Record
1476 @section Extended Number of Cases Record
1477
1478 The file header record expresses the number of cases in the system
1479 file as an int32 (@pxref{File Header Record}).  This record allows the
1480 number of cases in the system file to be expressed as a 64-bit number.
1481
1482 @example
1483 int32               rec_type;
1484 int32               subtype;
1485 int32               size;
1486 int32               count;
1487 int64               unknown;
1488 int64               ncases64;
1489 @end example
1490
1491 @table @code
1492 @item int32 rec_type;
1493 Record type.  Always set to 7.
1494
1495 @item int32 subtype;
1496 Record subtype.  Always set to 16.
1497
1498 @item int32 size;
1499 Size of each element.  Always set to 8.
1500
1501 @item int32 count;
1502 Number of pieces of data in the data part.  Alway set to 2.
1503
1504 @item int64 unknown;
1505 Meaning unknown.  Always set to 1.
1506
1507 @item int64 ncases64;
1508 Number of cases in the file as a 64-bit integer.  Presumably this
1509 could be -1 to indicate that the number of cases is unknown, for the
1510 same reason as @code{ncases} in the file header record, but this has
1511 not been observed in the wild.
1512 @end table
1513
1514 @node Other Informational Records
1515 @section Other Informational Records
1516
1517 This chapter documents many specific types of extension records are
1518 documented here, but others are known to exist.  PSPP ignores unknown
1519 extension records when reading system files.
1520
1521 The following extension record subtypes have also been observed, with
1522 the following believed meanings:
1523
1524 @table @asis
1525 @item 5
1526 A named variable set for use in the GUI (according to Aapi
1527 H@"am@"al@"ainen).
1528
1529 @item 6
1530 Date info, probably related to USE (according to Aapi H@"am@"al@"ainen).
1531
1532 @item 12
1533 A UUID in the format described in RFC 4122.  Only two examples
1534 observed, both written by SPSS 13, and in each case the UUID contained
1535 both upper and lower case.
1536
1537 @item 24
1538 XML that describes how data in the file should be displayed on-screen.
1539 @end table
1540
1541 @node Dictionary Termination Record
1542 @section Dictionary Termination Record
1543
1544 The dictionary termination record separates all other records from the
1545 data records.
1546
1547 @example
1548 int32               rec_type;
1549 int32               filler;
1550 @end example
1551
1552 @table @code
1553 @item int32 rec_type;
1554 Record type.  Always set to 999.
1555
1556 @item int32 filler;
1557 Ignored padding.  Should be set to 0.
1558 @end table
1559
1560 @node Data Record
1561 @section Data Record
1562
1563 The data record must follow all other records in the system file.
1564 Every system file must have a data record that specifies data for at
1565 least one case.  The format of the data record varies depending on the
1566 value of @code{compression} in the file header record:
1567
1568 @table @asis
1569 @item 0: no compression
1570 Data is arranged as a series of 8-byte elements.
1571 Each element corresponds to
1572 the variable declared in the respective variable record (@pxref{Variable
1573 Record}).  Numeric values are given in @code{flt64} format; string
1574 values are literal characters string, padded on the right when
1575 necessary to fill out 8-byte units.
1576
1577 @item 1: bytecode compression
1578 The first 8 bytes
1579 of the data record is divided into a series of 1-byte command
1580 codes.  These codes have meanings as described below:
1581
1582 @table @asis
1583 @item 0
1584 Ignored.  If the program writing the system file accumulates compressed
1585 data in blocks of fixed length, 0 bytes can be used to pad out extra
1586 bytes remaining at the end of a fixed-size block.
1587
1588 @item 1 through 251
1589 A number with
1590 value @var{code} - @var{bias}, where
1591 @var{code} is the value of the compression code and @var{bias} is the
1592 variable @code{bias} from the file header.  For example,
1593 code 105 with bias 100.0 (the normal value) indicates a numeric variable
1594 of value 5.
1595
1596 A code of 0 (after subtracting the bias) in a string field encodes
1597 null bytes.  This is unusual, since a string field normally encodes
1598 text data, but it exists in real system files.
1599
1600 @item 252
1601 End of file.  This code may or may not appear at the end of the data
1602 stream.  PSPP always outputs this code but its use is not required.
1603
1604 @item 253
1605 A numeric or string value that is not
1606 compressible.  The value is stored in the 8 bytes following the
1607 current block of command bytes.  If this value appears twice in a block
1608 of command bytes, then it indicates the second group of 8 bytes following the
1609 command bytes, and so on.
1610
1611 @item 254
1612 An 8-byte string value that is all spaces.
1613
1614 @item 255
1615 The system-missing value.
1616 @end table
1617
1618 The end of the 8-byte group of bytecodes is followed by any 8-byte
1619 blocks of non-compressible values indicated by code 253.  After that
1620 follows another 8-byte group of bytecodes, then those bytecodes'
1621 non-compressible values.  The pattern repeats to the end of the file
1622 or a code with value 252.
1623
1624 @item 2: ZLIB compression
1625 The data record consists of the following, in order:
1626
1627 @itemize @bullet
1628 @item
1629 ZLIB data header, 24 bytes long.
1630
1631 @item
1632 One or more variable-length blocks of ZLIB compressed data.
1633
1634 @item
1635 ZLIB data trailer, with a 24-byte fixed header plus an additional 24
1636 bytes for each preceding ZLIB compressed data block.
1637 @end itemize
1638
1639 The ZLIB data header has the following format:
1640
1641 @example
1642 int64               zheader_ofs;
1643 int64               ztrailer_ofs;
1644 int64               ztrailer_len;
1645 @end example
1646
1647 @table @code
1648 @item int64 zheader_ofs;
1649 The offset, in bytes, of the beginning of this structure within the
1650 system file.
1651
1652 @item int64 ztrailer_ofs;
1653 The offset, in bytes, of the first byte of the ZLIB data trailer.
1654
1655 @item int64 ztrailer_len;
1656 The number of bytes in the ZLIB data trailer.  This and the previous
1657 field sum to the size of the system file in bytes.
1658 @end table
1659
1660 The data header is followed by @code{(ztrailer_len - 24) / 24} ZLIB
1661 compressed data blocks.  Each ZLIB compressed data block begins with a
1662 ZLIB header as specified in RFC@tie{}1950, e.g.@: hex bytes @code{78
1663 01} (the only header yet observed in practice).  Each block
1664 decompresses to a fixed number of bytes (in practice only
1665 @code{0x3ff000}-byte blocks have been observed), except that the last
1666 block of data may be shorter.  The last ZLIB compressed data block
1667 gends just before offset @code{ztrailer_ofs}.
1668
1669 The result of ZLIB decompression is bytecode compressed data as
1670 described above for compression format 1.
1671
1672 The ZLIB data trailer begins with the following 24-byte fixed header:
1673
1674 @example
1675 int64               bias;
1676 int64               zero;
1677 int32               block_size;
1678 int32               n_blocks;
1679 @end example
1680
1681 @table @code
1682 @item int64 int_bias;
1683 The compression bias as a negative integer, e.g.@: if @code{bias} in
1684 the file header record is 100.0, then @code{int_bias} is @minus{}100
1685 (this is the only value yet observed in practice).
1686
1687 @item int64 zero;
1688 Always observed to be zero.
1689
1690 @item int32 block_size;
1691 The number of bytes in each ZLIB compressed data block, except
1692 possibly the last, following decompression.  Only @code{0x3ff000} has
1693 been observed so far.
1694
1695 @item int32 n_blocks;
1696 The number of ZLIB compressed data blocks, always exactly
1697 @code{(ztrailer_len - 24) / 24}.
1698 @end table
1699
1700 The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
1701 block descriptors, each of which describes the compressed data block
1702 corresponding to its offset.  Each block descriptor has the following
1703 format:
1704
1705 @example
1706 int64               uncompressed_ofs;
1707 int64               compressed_ofs;
1708 int32               uncompressed_size;
1709 int32               compressed_size;
1710 @end example
1711
1712 @table @code
1713 @item int64 uncompressed_ofs;
1714 The offset, in bytes, that this block of data would have in a similar
1715 system file that uses compression format 1.  This is
1716 @code{zheader_ofs} in the first block descriptor, and in each
1717 succeeding block descriptor it is the sum of the previous desciptor's
1718 @code{uncompressed_ofs} and @code{uncompressed_size}.
1719
1720 @item int64 compressed_ofs;
1721 The offset, in bytes, of the actual beginning of this compressed data
1722 block.  This is @code{zheader_ofs + 24} in the first block descriptor,
1723 and in each succeeding block descriptor it is the sum of the previous
1724 descriptor's @code{compressed_ofs} and @code{compressed_size}.  The
1725 final block descriptor's @code{compressed_ofs} and
1726 @code{compressed_size} sum to @code{ztrailer_ofs}.
1727
1728 @item int32 uncompressed_size;
1729 The number of bytes in this data block, after decompression.  This is
1730 @code{block_size} in every data block except the last, which may be
1731 smaller.
1732
1733 @item int32 compressed_size;
1734 The number of bytes in this data block, as stored compressed in this
1735 system file.
1736 @end table
1737 @end table
1738
1739 @setfilename ignored