pintos-os.org Git - pspp/blob - doc/dev/pc+-file-format.texi

   1 @c PSPP - a program for statistical analysis.
   2 @c Copyright (C) 2019 Free Software Foundation, Inc.
   3 @c Permission is granted to copy, distribute and/or modify this document
   4 @c under the terms of the GNU Free Documentation License, Version 1.3
   5 @c or any later version published by the Free Software Foundation;
   6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   7 @c A copy of the license is included in the section entitled "GNU
   8 @c Free Documentation License".
   9 @c
  10
  11 @node SPSS/PC+ System File Format
  12 @appendix SPSS/PC+ System File Format
  13
  14 SPSS/PC+, first released in 1984, was a simplified version of SPSS for
  15 IBM PC and compatible computers.  It used a data file format related
  16 to the one described in the previous chapter, but simplified and
  17 incompatible.  The SPSS/PC+ software became obsolete in the 1990s, so
  18 files in this format are rarely encountered today.  Nevertheless, for
  19 completeness, and because it is not very difficult, it seems
  20 worthwhile to support at least reading these files.  This chapter
  21 documents this format, based on examination of a corpus of about 60
  22 files from a variety of sources.
  23
  24 System files use four data types: 8-bit characters, 16-bit unsigned
  25 integers, 32-bit unsigned integers, and 64-bit floating points, called
  26 here @code{char}, @code{uint16}, @code{uint32}, and @code{flt64},
  27 respectively.  Data is not necessarily aligned on a word or
  28 double-word boundary.
  29
  30 SPSS/PC+ ran only on IBM PC and compatible computers.  Therefore,
  31 values in these files are always in little-endian byte order.
  32 Floating-point numbers are always in IEEE 754 format.
  33
  34 SPSS/PC+ system files represent the system-missing value as -1.66e308,
  35 or @code{f5 1e 26 02 8a 8c ed ff} expressed as hexadecimal.  (This is
  36 an unusual choice: it is close to, but not equal to, the largest
  37 negative 64-bit IEEE 754, which is about -1.8e308.)
  38
  39 Text in SPSS/PC+ system file is encoded in ASCII-based 8-bit MS DOS
  40 codepages.  The corpus used for investigating the format were all
  41 ASCII-only.
  42
  43 An SPSS/PC+ system file begins with the following 256-byte directory:
  44
  45 @example
  46 uint32              two;
  47 uint32              zero;
  48 struct @{
  49     uint32          ofs;
  50     uint32          len;
  51 @} records[15];
  52 char                filename[128];
  53 @end example
  54
  55 @table @code
  56 @item uint32 two;
  57 @itemx uint32 zero;
  58 Always set to 2 and 0, respectively.
  59
  60 These fields could be used as a signature for the file format, but the
  61 @code{product} field in record 0 seems more likely to be unique
  62 (@pxref{Record 0 Main Header Record}).
  63
  64 @item struct @{ @dots{} @} records[15];
  65 Each of the elements in this array identifies a record in the system
  66 file.  The @code{ofs} is a byte offset, from the beginning of the
  67 file, that identifies the start of the record.  @code{len} specifies
  68 the length of the record, in bytes.  Many records are optional or not
  69 used.  If a record is not present, @code{ofs} and @code{len} for that
  70 record are both are zero.
  71
  72 @item char filename[128];
  73 In most files in the corpus, this field is entirely filled with
  74 spaces.  In one file, it contains a file name, followed by a null
  75 bytes, followed by spaces to fill the remainder of the field.  The
  76 meaning is unknown.
  77 @end table
  78
  79 The following sections describe the contents of each record,
  80 identified by the index into the @code{records} array.
  81
  82 @menu
  83 * Record 0 Main Header Record::
  84 * Record 1 Variables Record::
  85 * Record 2 Labels Record::
  86 * Record 3 Data Record::
  87 * Records 4 and 5 Data Entry::
  88 @end menu
  89
  90 @node Record 0 Main Header Record
  91 @section Record 0: Main Header Record
  92
  93 All files in the corpus have this record at offset 0x100 with length
  94 0xb0 (but readers should find this record, like the others, via the
  95 @code{records} table in the directory).  Its format is:
  96
  97 @example
  98 uint16              one0;
  99 char                product[62];
 100 flt64               sysmis;
 101 uint32              zero0;
 102 uint32              zero1;
 103 uint16              one1;
 104 uint16              compressed;
 105 uint16              nominal_case_size;
 106 uint16              n_cases0;
 107 uint16              weight_index;
 108 uint16              zero2;
 109 uint16              n_cases1;
 110 uint16              zero3;
 111 char                creation_date[8];
 112 char                creation_time[8];
 113 char                label[64];
 114 @end example
 115
 116 @table @code
 117 @item uint16 one0;
 118 @itemx uint16 one1;
 119 Always set to 1.
 120
 121 @item uint32 zero0;
 122 @itemx uint32 zero1;
 123 @itemx uint16 zero2;
 124 @itemx uint16 zero3;
 125 Always set to 0.
 126
 127 It seems likely that one of these variables is set to 1 if weighting
 128 is enabled, but none of the files in the corpus is weighted.
 129
 130 @item char product[62];
 131 Name of the program that created the file.  Only the following unique
 132 values have been observed, in each case padded on the right with
 133 spaces:
 134
 135 @example
 136 DESPSS/PC+ System File Written by Data Entry II
 137 PCSPSS SYSTEM FILE.  IBM PC DOS, SPSS/PC+
 138 PCSPSS SYSTEM FILE.  IBM PC DOS, SPSS/PC+ V3.0
 139 PCSPSS SYSTEM FILE.  IBM PC DOS, SPSS for Windows
 140 @end example
 141
 142 Thus, it is reasonable to use the presence of the string @samp{SPSS}
 143 at offset 0x104 as a simple test for an SPSS/PC+ data file.
 144
 145 @item flt64 sysmis;
 146 The system-missing value, as described previously (@pxref{SPSS/PC+
 147 System File Format}).
 148
 149 @item uint16 compressed;
 150 Set to 0 if the data in the file is not compressed, 1 if the data is
 151 compressed with simple bytecode compression.
 152
 153 @item uint16 nominal_case_size;
 154 Number of data elements per case.  This is the number of variables,
 155 except that long string variables add extra data elements (one for
 156 every 8 bytes after the first 8).  String variables in SPSS/PC+ system
 157 files are limited to 255 bytes.
 158
 159 @item uint16 n_cases0;
 160 @itemx uint16 n_cases1;
 161 The number of cases in the data record.  Both values are the same.
 162 Some files in the corpus contain data for the number of cases noted
 163 here, followed by garbage that somewhat resembles data.
 164
 165 @item uint16 weight_index;
 166 0, if the file is unweighted, otherwise a 1-based index into the data
 167 record of the weighting variable, e.g.@: 4 for the first variable
 168 after the 3 system-defined variables.
 169
 170 @item char creation_date[8];
 171 The date that the file was created, in @samp{mm/dd/yy} format.
 172 Single-digit days and months are not prefixed by zeros.  The string is
 173 padded with spaces on right or left or both, e.g. @samp{_2/4/93_},
 174 @samp{10/5/87_}, and @samp{_1/11/88} (with @samp{_} standing in for a
 175 space) are all actual examples from the corpus.
 176
 177 @item char creation_time[8];
 178 The time that the file was created, in @samp{HH:MM:SS} format.
 179 Single-digit hours are padded on a left with a space.  Minutes and
 180 seconds are always written as two digits.
 181
 182 @item char file_label[64];
 183 File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
 184 PSPP Users Guide}).  Padded on the right with spaces.
 185 @end table
 186
 187 @node Record 1 Variables Record
 188 @section Record 1: Variables Record
 189
 190 The variables record most commonly starts at offset 0x1b0, but it can
 191 be placed elsewhere.  The record contains instances of the following
 192 32-byte structure:
 193
 194 @example
 195 uint32              value_label_start;
 196 uint32              value_label_end;
 197 uint32              var_label_ofs;
 198 uint32              format;
 199 char                name[8];
 200 union @{
 201     flt64           f;
 202     char            s[8];
 203 @} missing;
 204 @end example
 205
 206 The number of instances is the @code{nominal_case_size} specified in
 207 the main header record.  There is one instance for each numeric
 208 variable and each string variable with width 8 bytes or less.  String
 209 variables wider than 8 bytes have one instance for each 8 bytes,
 210 rounding up.  The first instance for a long string specifies the
 211 variable's correct dictionary information.  Subsequent instances for a
 212 long string are generally filled with all-zero bytes, although the
 213 @code{missing} field contains the numeric system-missing value, and
 214 some writers also fill in @code{var_label_ofs}, @code{format}, and
 215 @code{name}, sometimes filling the latter with the numeric
 216 system-missing value rather than a text string.  Regardless of the
 217 values used, readers should ignore the contents of these additional
 218 instances for long strings.
 219
 220 @table @code
 221 @item uint32 value_label_start;
 222 @itemx uint32 value_label_end;
 223 For a variable with value labels, these specify offsets into the label
 224 record of the start and end of this variable's value labels,
 225 respectively.  @xref{Record 2 Labels Record}, for more information.
 226
 227 For a variable without any value labels, these are both zero.
 228
 229 A long string variable may not have value labels.
 230
 231 @item uint32 var_label_ofs;
 232 For a variable with a variable label, this specifies an offset into
 233 the label record.  @xref{Record 2 Labels Record}, for more
 234 information.
 235
 236 For a variable without a variable label, this is zero.
 237
 238 @item uint32 format;
 239 The variable's output format, in the same format used in system files.
 240 @xref{System File Output Formats}, for details.  SPSS/PC+ system files
 241 only use format types 5 (F, for numeric variables) and 1 (A, for
 242 string variables).
 243
 244 @item char name[8];
 245 The variable's name, padded on the right with spaces.
 246
 247 @item union @{ @dots{} @} missing;
 248 A user-missing value.  For numeric variables, @code{missing.f} is the
 249 variable's user-missing value.  For string variables, @code{missing.s}
 250 is a string missing value.  A variable without a user-missing value is
 251 indicated with @code{missing.f} set to the system-missing value, even
 252 for string variables (!).  A Long string variable may not have a
 253 missing value.
 254 @end table
 255
 256 In addition to the user-defined variables, every SPSS/PC+ system file
 257 contains, as its first three variables, the following system-defined
 258 variables, in the following order.  The system-defined variables have
 259 no variable label, value labels, or missing values.
 260
 261 @table @code
 262 @item $CASENUM
 263 A numeric variable with format F8.0.  Most of the time this is a
 264 sequence number, starting with 1 for the first case and counting up
 265 for each subsequent case.  Some files skip over values, which probably
 266 reflects cases that were deleted.
 267
 268 @item $DATE
 269 A string variable with format A8.  Same format (including varying
 270 padding) as the @code{creation_date} field in the main header record
 271 (@pxref{Record 0 Main Header Record}).  The actual date can differ
 272 from @code{creation_date} and from record to record.  This may reflect
 273 when individual cases were added or updated.
 274
 275 @item $WEIGHT
 276 A numeric variable with format F8.2.  This represents the case's
 277 weight; SPSS/PC+ files do not have a user-defined weighting variable.
 278 If weighting has not been enabled, every case has value 1.0.
 279 @end table
 280
 281 @node Record 2 Labels Record
 282 @section Record 2: Labels Record
 283
 284 The labels record holds value labels and variable labels.  Unlike the
 285 other records, it is not meant to be read directly and sequentially.
 286 Instead, this record must be interpreted one piece at a time, by
 287 following pointers from the variables record.
 288
 289 The @code{value_label_start}, @code{value_label_end}, and
 290 @code{var_label_ofs} fields in a variable record are all offsets
 291 relative to the beginning of the labels record, with an additional
 292 7-byte offset.  That is, if the labels record starts at byte offset
 293 @code{labels_ofs} and a variable has a given @code{var_label_ofs},
 294 then the variable label begins at byte offset @math{@code{labels_ofs}
 295 + @code{var_label_ofs} + 7} in the file.
 296
 297 A variable label, starting at the offset indicated by
 298 @code{var_label_ofs}, consists of a one-byte length followed by the
 299 specified number of bytes of the variable label string, like this:
 300
 301 @example
 302 uint8               length;
 303 char                s[length];
 304 @end example
 305
 306 A set of value labels, extending from @code{value_label_start} to
 307 @code{value_label_end} (exclusive), consists of a numeric or string
 308 value followed by a string in the format just described.  String
 309 values are padded on the right with spaces to fill the 8-byte field,
 310 like this:
 311
 312 @example
 313 union @{
 314     flt64           f;
 315     char            s[8];
 316 @} value;
 317 uint8               length;
 318 char                s[length];
 319 @end example
 320
 321 The labels record begins with a pair of uint32 values.  The first of
 322 these is always 3.  The second is between 8 and 16 less than the
 323 number of bytes in the record.  Neither value is important for
 324 interpreting the file.
 325
 326 @node Record 3 Data Record
 327 @section Record 3: Data Record
 328
 329 The format of the data record varies depending on the value of
 330 @code{compressed} in the file header record:
 331
 332 @table @asis
 333 @item 0: no compression
 334 Data is arranged as a series of 8-byte elements, one per variable
 335 instance variable in the variable record (@pxref{Record 1 Variables
 336 Record}).  Numeric values are given in @code{flt64} format; string
 337 values are literal characters string, padded on the right with spaces
 338 when necessary to fill out 8-byte units.
 339
 340 @item 1: bytecode compression
 341 The first 8 bytes of the data record is divided into a series of
 342 1-byte command codes.  These codes have meanings as described below:
 343
 344 @table @asis
 345 @item 0
 346 The system-missing value.
 347
 348 @item 1
 349 A numeric or string value that is not
 350 compressible.  The value is stored in the 8 bytes following the
 351 current block of command bytes.  If this value appears twice in a block
 352 of command bytes, then it indicates the second group of 8 bytes following the
 353 command bytes, and so on.
 354
 355 @item 2 through 255
 356 A number with value @var{code} - 100, where @var{code} is the value of
 357 the compression code.  For example, code 105 indicates a numeric
 358 variable of value 5.
 359 @end table
 360
 361 The end of the 8-byte group of bytecodes is followed by any 8-byte
 362 blocks of non-compressible values indicated by code 1.  After that
 363 follows another 8-byte group of bytecodes, then those bytecodes'
 364 non-compressible values.  The pattern repeats up to the number of
 365 cases specified by the main header record have been seen.
 366
 367 The corpus does not contain any files with command codes 2 through 95,
 368 so it is possible that some of these codes are used for special
 369 purposes.
 370 @end table
 371
 372 Cases of data often, but not always, fill the entire data record.
 373 Readers should stop reading after the number of cases specified in the
 374 main header record.  Otherwise, readers may try to interpret garbage
 375 following the data as additional cases.
 376
 377 @node Records 4 and 5 Data Entry
 378 @section Records 4 and 5: Data Entry
 379
 380 Records 4 and 5 appear to be related to SPSS/PC+ Data Entry.