pintos-os.org Git - pspp/blob - doc/dev/pc+-file-format.texi

   1 @node SPSS/PC+ System File Format
   2 @appendix SPSS/PC+ System File Format
   3
   4 SPSS/PC+, first released in 1984, was a simplified version of SPSS for
   5 IBM PC and compatible computers.  It used a data file format related
   6 to the one described in the previous chapter, but simplified and
   7 incompatible.  The SPSS/PC+ software became obsolete in the 1990s, so
   8 files in this format are rarely encountered today.  Nevertheless, for
   9 completeness, and because it is not very difficult, it seems
  10 worthwhile to support at least reading these files.  This chapter
  11 documents this format, based on examination of a corpus of about 60
  12 files from a variety of sources.
  13
  14 System files use four data types: 8-bit characters, 16-bit unsigned
  15 integers, 32-bit unsigned integers, and 64-bit floating points, called
  16 here @code{char}, @code{uint16}, @code{uint32}, and @code{flt64},
  17 respectively.  Data is not necessarily aligned on a word or
  18 double-word boundary.
  19
  20 SPSS/PC+ ran only on IBM PC and compatible computers.  Therefore,
  21 values in these files are always in little-endian byte order.
  22 Floating-point numbers are always in IEEE 754 format.
  23
  24 SPSS/PC+ system files represent the system-missing value as -1.66e308,
  25 or @code{f5 1e 26 02 8a 8c ed ff} expressed as hexadecimal.  (This is
  26 an unusual choice: it is close to, but not equal to, the largest
  27 negative 64-bit IEEE 754, which is about -1.8e308.)
  28
  29 Text in SPSS/PC+ system file is encoded in ASCII-based 8-bit MS DOS
  30 codepages.  The corpus used for investigating the format were all
  31 ASCII-only.
  32
  33 An SPSS/PC+ system file begins with the following 256-byte directory:
  34
  35 @example
  36 uint32              two;
  37 uint32              zero;
  38 struct @{
  39     uint32          ofs;
  40     uint32          len;
  41 @} records[15];
  42 char                filename[128];
  43 @end example
  44
  45 @table @code
  46 @item uint32 two;
  47 @itemx uint32 zero;
  48 Always set to 2 and 0, respectively.
  49
  50 These fields could be used as a signature for the file format, but the
  51 @code{product} field in record 0 seems more likely to be unique
  52 (@pxref{Record 0 Main Header Record}).
  53
  54 @item struct @{ @dots{} @} records[15];
  55 Each of the elements in this array identifies a record in the system
  56 file.  The @code{ofs} is a byte offset, from the beginning of the
  57 file, that identifies the start of the record.  @code{len} specifies
  58 the length of the record, in bytes.  Many records are optional or not
  59 used.  If a record is not present, @code{ofs} and @code{len} for that
  60 record are both are zero.
  61
  62 @item char filename[128];
  63 In most files in the corpus, this field is entirely filled with
  64 spaces.  In one file, it contains a file name, followed by a null
  65 bytes, followed by spaces to fill the remainder of the field.  The
  66 meaning is unknown.
  67 @end table
  68
  69 The following sections describe the contents of each record,
  70 identified by the index into the @code{records} array.
  71
  72 @menu
  73 * Record 0 Main Header Record::
  74 * Record 1 Variables Record::
  75 * Record 2 Labels Record::
  76 * Record 3 Data Record::
  77 * Records 4 and 5 Data Entry::
  78 @end menu
  79
  80 @node Record 0 Main Header Record
  81 @section Record 0: Main Header Record
  82
  83 All files in the corpus have this record at offset 0x100 with length
  84 0xb0 (but readers should find this record, like the others, via the
  85 @code{records} table in the directory).  Its format is:
  86
  87 @example
  88 uint16              one0;
  89 char                product[62];
  90 flt64               sysmis;
  91 uint32              zero0;
  92 uint32              zero1;
  93 uint16              one1;
  94 uint16              compressed;
  95 uint16              nominal_case_size;
  96 uint32              n_cases0;
  97 uint16              zero2;
  98 uint32              n_cases1;
  99 char                creation_date[8];
 100 char                creation_time[8];
 101 char                label[64];
 102 @end example
 103
 104 @table @code
 105 @item uint16 one0;
 106 @itemx uint16 one1;
 107 Always set to 1.
 108
 109 @item uint32 zero0;
 110 @itemx uint32 zero1;
 111 @itemx uint16 zero2;
 112 Always set to 0.
 113
 114 It seems likely that one of these variables is set to 1 if weighting
 115 is enabled, but none of the files in the corpus is weighted.
 116
 117 @item char product[62];
 118 Name of the program that created the file.  Only the following unique
 119 values have been observed, in each case padded on the right with
 120 spaces:
 121
 122 @example
 123 DESPSS/PC+ System File Written by Data Entry II
 124 PCSPSS SYSTEM FILE.  IBM PC DOS, SPSS/PC+
 125 PCSPSS SYSTEM FILE.  IBM PC DOS, SPSS/PC+ V3.0
 126 PCSPSS SYSTEM FILE.  IBM PC DOS, SPSS for Windows
 127 @end example
 128
 129 Thus, it is reasonable to use the presence of the string @samp{SPSS}
 130 at offset 0x104 as a simple test for an SPSS/PC+ data file.
 131
 132 @item flt64 sysmis;
 133 The system-missing value, as described previously (@pxref{SPSS/PC+
 134 System File Format}).
 135
 136 @item uint16 compressed;
 137 Set to 0 if the data in the file is not compressed, 1 if the data is
 138 compressed with simple bytecode compression.
 139
 140 @item uint16 nominal_case_size;
 141 Number of data elements per case.  This is the number of variables,
 142 except that long string variables add extra data elements (one for
 143 every 8 bytes after the first 8).  String variables in SPSS/PC+ system
 144 files are limited to 255 bytes.
 145
 146 @item uint32 n_cases0;
 147 @itemx uint32 n_cases1;
 148 The number of cases in the data record.  Both values are the same.
 149 Some files in the corpus contain data for the number of cases noted
 150 here, followed by garbage that somewhat resembles data.
 151
 152 @item char creation_date[8];
 153 The date that the file was created, in @samp{mm/dd/yy} format.
 154 Single-digit days and months are not prefixed by zeros.  The string is
 155 padded with spaces on right or left or both, e.g. @samp{_2/4/93_},
 156 @samp{10/5/87_}, and @samp{_1/11/88} (with @samp{_} standing in for a
 157 space) are all actual examples from the corpus.
 158
 159 @item char creation_time[8];
 160 The time that the file was created, in @samp{HH:MM:SS} format.
 161 Single-digit hours are padded on a left with a space.  Minutes and
 162 seconds are always written as two digits.
 163
 164 @item char file_label[64];
 165 File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
 166 PSPP Users Guide}).  Padded on the right with spaces.
 167 @end table
 168
 169 @node Record 1 Variables Record
 170 @section Record 1: Variables Record
 171
 172 The variables record most commonly starts at offset 0x1b0, but it can
 173 be placed elsewhere.  The record contains instances of the following
 174 32-byte structure:
 175
 176 @example
 177 uint32              value_label_start;
 178 uint32              value_label_end;
 179 uint32              var_label_ofs;
 180 uint32              format;
 181 char                name[8];
 182 union @{
 183     flt64           f;
 184     char            s[8];
 185 @} missing;
 186 @end example
 187
 188 The number of instances is the @code{nominal_case_size} specified in
 189 the main header record.  There is one instance for each numeric
 190 variable and each string variable with width 8 bytes or less.  String
 191 variables wider than 8 bytes have one instance for each 8 bytes,
 192 rounding up.  The first instance for a long string specifies the
 193 variable's correct dictionary information.  Subsequent instances for a
 194 long string are generally filled with all-zero bytes, although the
 195 @code{missing} field contains the numeric system-missing value, and
 196 some writers also fill in @code{var_label_ofs}, @code{format}, and
 197 @code{name}, sometimes filling the latter with the numeric
 198 system-missing value rather than a text string.  Regardless of the
 199 values used, readers should ignore the contents of these additional
 200 instances for long strings.
 201
 202 @table @code
 203 @item uint32 value_label_start;
 204 @itemx uint32 value_label_end;
 205 For a variable with value labels, these specify offsets into the label
 206 record of the start and end of this variable's value labels,
 207 respectively.  @xref{Record 2 Labels Record}, for more information.
 208
 209 For a variable without any value labels, these are both zero.
 210
 211 A long string variable may not have value labels.
 212
 213 @item uint32 var_label_ofs;
 214 For a variable with a variable label, this specifies an offset into
 215 the label record.  @xref{Record 2 Labels Record}, for more
 216 information.
 217
 218 For a variable without a variable label, this is zero.
 219
 220 @item uint32 format;
 221 The variable's output format, in the same format used in system files.
 222 @xref{System File Output Formats}, for details.  SPSS/PC+ system files
 223 only use format types 5 (F, for numeric variables) and 1 (A, for
 224 string variables).
 225
 226 @item char name[8];
 227 The variable's name, padded on the right with spaces.
 228
 229 @item union @{ @dots{} @} missing;
 230 A user-missing value.  For numeric variables, @code{missing.f} is the
 231 variable's user-missing value.  For string variables, @code{missing.s}
 232 is a string missing value.  A variable without a user-missing value is
 233 indicated with @code{missing.f} set to the system-missing value, even
 234 for string variables (!).  A Long string variable may not have a
 235 missing value.
 236 @end table
 237
 238 In addition to the user-defined variables, every SPSS/PC+ system file
 239 contains, as its first three variables, the following system-defined
 240 variables, in the following order.  The system-defined variables have
 241 no variable label, value labels, or missing values.
 242
 243 @table @code
 244 @item $CASENUM
 245 A numeric variable with format F8.0.  Most of the time this is a
 246 sequence number, starting with 1 for the first case and counting up
 247 for each subsequent case.  Some files skip over values, which probably
 248 reflects cases that were deleted.
 249
 250 @item $DATE
 251 A string variable with format A8.  Same format (including varying
 252 padding) as the @code{creation_date} field in the main header record
 253 (@pxref{Record 0 Main Header Record}).  The actual date can differ
 254 from @code{creation_date} and from record to record.  This may reflect
 255 when individual cases were added or updated.
 256
 257 @item $WEIGHT
 258 A numeric variable with format F8.2.  This represents the case's
 259 weight; SPSS/PC+ files do not have a user-defined weighting variable.
 260 If weighting has not been enabled, every case has value 1.0.
 261 @end table
 262
 263 @node Record 2 Labels Record
 264 @section Record 2: Labels Record
 265
 266 The labels record holds value labels and variable labels.  Unlike the
 267 other records, it is not meant to be read directly and sequentially.
 268 Instead, this record must be interpreted one piece at a time, by
 269 following pointers from the variables record.
 270
 271 The @code{value_label_start}, @code{value_label_end}, and
 272 @code{var_label_ofs} fields in a variable record are all offsets
 273 relative to the beginning of the labels record, with an additional
 274 7-byte offset.  That is, if the labels record starts at byte offset
 275 @code{labels_ofs} and a variable has a given @code{var_label_ofs},
 276 then the variable label begins at byte offset @math{@code{labels_ofs}
 277 + @code{var_label_ofs} + 7} in the file.
 278
 279 A variable label, starting at the offset indicated by
 280 @code{var_label_ofs}, consists of a one-byte length followed by the
 281 specified number of bytes of the variable label string, like this:
 282
 283 @example
 284 uint8               length;
 285 char                s[length];
 286 @end example
 287
 288 A set of value labels, extending from @code{value_label_start} to
 289 @code{value_label_end} (exclusive), consists of a numeric or string
 290 value followed by a string in the format just described.  String
 291 values are padded on the right with spaces to fill the 8-byte field,
 292 like this:
 293
 294 @example
 295 union @{
 296     flt64           f;
 297     char            s[8];
 298 @} value;
 299 uint8               length;
 300 char                s[length];
 301 @end example
 302
 303 The labels record begins with a pair of uint32 values.  The first of
 304 these is always 3.  The second is between 8 and 16 less than the
 305 number of bytes in the record.  Neither value is important for
 306 interpreting the file.
 307
 308 @node Record 3 Data Record
 309 @section Record 3: Data Record
 310
 311 The format of the data record varies depending on the value of
 312 @code{compressed} in the file header record:
 313
 314 @table @asis
 315 @item 0: no compression
 316 Data is arranged as a series of 8-byte elements, one per variable
 317 instance variable in the variable record (@pxref{Record 1 Variables
 318 Record}).  Numeric values are given in @code{flt64} format; string
 319 values are literal characters string, padded on the right with spaces
 320 when necessary to fill out 8-byte units.
 321
 322 @item 1: bytecode compression
 323 The first 8 bytes of the data record is divided into a series of
 324 1-byte command codes.  These codes have meanings as described below:
 325
 326 @table @asis
 327 @item 0
 328 The system-missing value.
 329
 330 @item 1
 331 A numeric or string value that is not
 332 compressible.  The value is stored in the 8 bytes following the
 333 current block of command bytes.  If this value appears twice in a block
 334 of command bytes, then it indicates the second group of 8 bytes following the
 335 command bytes, and so on.
 336
 337 @item 2 through 255
 338 A number with value @var{code} - 100, where @var{code} is the value of
 339 the compression code.  For example, code 105 indicates a numeric
 340 variable of value 5.
 341 @end table
 342
 343 The end of the 8-byte group of bytecodes is followed by any 8-byte
 344 blocks of non-compressible values indicated by code 1.  After that
 345 follows another 8-byte group of bytecodes, then those bytecodes'
 346 non-compressible values.  The pattern repeats up to the number of
 347 cases specified by the main header record have been seen.
 348
 349 The corpus does not contain any files with command codes 2 through 95,
 350 so it is possible that some of these codes are used for special
 351 purposes.
 352 @end table
 353
 354 Cases of data often, but not always, fill the entire data record.
 355 Readers should stop reading after the number of cases specified in the
 356 main header record.  Otherwise, readers may try to interpret garbage
 357 following the data as additional cases.
 358
 359 @node Records 4 and 5 Data Entry
 360 @section Records 4 and 5: Data Entry
 361
 362 Records 4 and 5 appear to be related to SPSS/PC+ Data Entry.