pintos-os.org Git - pspp/blob - doc/matrices.texi

   1 @c PSPP - a program for statistical analysis.
   2 @c Copyright (C) 2017, 2020, 2021 Free Software Foundation, Inc.
   3 @c Permission is granted to copy, distribute and/or modify this document
   4 @c under the terms of the GNU Free Documentation License, Version 1.3
   5 @c or any later version published by the Free Software Foundation;
   6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   7 @c A copy of the license is included in the section entitled "GNU
   8 @c Free Documentation License".
   9 @c
  10 @node Matrices
  11 @chapter Matrices
  12
  13 Some @pspp{} procedures work with matrices by producing numeric
  14 matrices that report results of data analysis, or by consuming
  15 matrices as a basis for further analysis.  This chapter documents the
  16 format of data files that store these matrices and commands for
  17 working with them.
  18
  19 @node Matrix Files
  20 @section Matrix Files
  21 @vindex Matrix file
  22
  23 A matrix file is an SPSS system file that conforms to the dictionary
  24 and case structure described in this section.  Procedures that read
  25 matrices from files expect them to be in the matrix file format.
  26 Procedures that write matrices also use this format.
  27
  28 Text files that contain matrices can be converted to matrix file
  29 format.  @xref{MATRIX DATA}, for a command to read a text file as a
  30 matrix file.
  31
  32 A matrix file's dictionary must have the following variables in the
  33 specified order:
  34
  35 @enumerate
  36 @item
  37 Zero or more numeric split variables.  These are included by
  38 procedures when @cmd{SPLIT FILE} is active.  @cmd{MATRIX DATA} assigns
  39 split variables format F4.0.
  40
  41 @item
  42 @code{ROWTYPE_}, a string variable with width 8.  This variable
  43 indicates the kind of matrix or vector that a given case represents.
  44 The supported row types are listed below.
  45
  46 @item
  47 Zero or more numeric factor variables.  These are included by
  48 procedures that divide data into cells.  For within-cell data, factor
  49 variables are filled with non-missing values; for pooled data, they
  50 are missing.  @cmd{MATRIX DATA} assigns factor variables format F4.0.
  51
  52 @item
  53 @code{VARNAME_}, a string variable.  Matrix data includes one row per
  54 continuous variable (see below), naming each continuous variable in
  55 order.  This column is blank for vector data.  @cmd{MATRIX DATA} makes
  56 @code{VARNAME_} wide enough for the name of any of the continuous
  57 variables, but at least 8 bytes.
  58
  59 @item
  60 One or more continuous variables.  These are the variables whose data
  61 was analyzed to produce the matrices.  @cmd{MATRIX DATA} assigns
  62 continuous variables format F10.4.
  63 @end enumerate
  64
  65 Case weights are ignored in matrix files.
  66
  67 @subheading Row Types
  68 @anchor{Matrix File Row Types}
  69
  70 Matrix files support a fixed set of types of matrix and vector data.
  71 The @code{ROWTYPE_} variable in each case of a matrix file indicates
  72 its row type.
  73
  74 The supported matrix row types are listed below.  Each type is listed
  75 with the keyword that identifies it in @code{ROWTYPE_}.  All supported
  76 types of matrices are square, meaning that each matrix must include
  77 one row per continuous variable, with the @code{VARNAME_} variable
  78 indicating each continuous variable in turn in the same order as the
  79 dictionary.
  80
  81 @table @code
  82 @item CORR
  83 Correlation coefficients.
  84
  85 @item COV
  86 Covariance coefficients.
  87
  88 @item MAT
  89 General-purpose matrix.
  90
  91 @item N_MATRIX
  92 Counts.
  93
  94 @item PROX
  95 Proximities matrix.
  96 @end table
  97
  98 The supported vector row types are listed below, along with their
  99 associated keyword.  Vector row types only require a single row, whose
 100 @code{VARNAME_} is blank:
 101
 102 @table @code
 103 @item COUNT
 104 Unweighted counts.
 105
 106 @item DFE
 107 Degrees of freedom.
 108
 109 @item MEAN
 110 Means.
 111
 112 @item MSE
 113 Mean squared errors.
 114
 115 @item N
 116 Counts.
 117
 118 @item STDDEV
 119 Standard deviations.
 120 @end table
 121
 122 Only the row types listed above may appear in matrix files.  The
 123 @cmd{MATRIX DATA} command, however, accepts the additional row types
 124 listed below, which it changes into matrix file row types as part of
 125 its conversion process:
 126
 127 @table @code
 128 @item N_VECTOR
 129 Synonym for @cmd{N}.
 130
 131 @item SD
 132 Synonym for @code{STDDEV}.
 133
 134 @item N_SCALAR
 135 Accepts a single number from the @code{MATRIX DATA} input and writes
 136 it as an @code{N} row with the number replicated across all the
 137 continuous variables.
 138 @end table
 139
 140 @node MATRIX DATA
 141 @section MATRIX DATA
 142 @vindex MATRIX DATA
 143
 144 @display
 145 MATRIX DATA
 146         VARIABLES=@var{variables}
 147         [FILE=@{'@var{file_name}' | INLINE@}
 148         [/FORMAT=[@{LIST | FREE@}]
 149                  [@{UPPER | LOWER | FULL@}]
 150                  [@{DIAGONAL | NODIAGONAL@}]]
 151         [/SPLIT=@var{split_vars}]
 152         [/FACTORS=@var{factor_vars}]
 153         [/N=@var{n}]
 154
 155 The following subcommands are only needed when ROWTYPE_ is not
 156 specified on the VARIABLES subcommand:
 157         [/CONTENTS=@{CORR,COUNT,COV,DFE,MAT,MEAN,MSE,
 158                     N_MATRIX,N|N_VECTOR,N_SCALAR,PROX,SD|STDDEV@}]
 159         [/CELLS=@var{n_cells}]
 160 @end display
 161
 162 The @cmd{MATRIX DATA} command convert matrices and vectors from text
 163 format into the matrix file format (@xref{Matrix Files}) for use by
 164 procedures that read matrices.  It reads a text file or inline data
 165 and outputs to the active file, replacing any data already in the
 166 active dataset.  The matrix file may then be used by other commands
 167 directly from the active file, or it may be written to a @file{.sav}
 168 file using the @cmd{SAVE} command.
 169
 170 The text data read by @cmd{MATRIX DATA} can be delimited by spaces or
 171 commas.  A plus or minus sign, except immediately following a @samp{d}
 172 or @samp{e}, also begins a new value.  Optionally, values may be
 173 enclosed in single or double quotes.
 174
 175 @cmd{MATRIX DATA} can read the types of matrix and vector data
 176 supported in matrix files (@pxref{Matrix File Row Types}).
 177
 178 The @subcmd{FILE} subcommand specifies the source of the command's
 179 input.  To read input from a text file, specify its name in quotes.
 180 To supply input inline, omit @subcmd{FILE} or specify @code{INLINE}.
 181 Inline data must directly follow @code{MATRIX DATA}, inside @cmd{BEGIN
 182 DATA} (@pxref{BEGIN DATA}).
 183
 184 @subcmd{VARIABLES} is the only required subcommand.  It names the
 185 variables present in each input record in the order that they appear.
 186 (@cmd{MATRIX DATA} reorders the variables in the matrix file it
 187 produces, if needed to fit the matrix file format.)  The variable list
 188 must include split variables and factor variables, if they are present
 189 in the data, in addition to the continuous variables that form matrix
 190 rows and columns.  It may also include a special variable named
 191 @code{ROWTYPE_}.
 192
 193 Matrix data may include split variables or factor variables or both.
 194 List split variables, if any, on the @subcmd{SPLIT} subcommand and
 195 factor variables, if any, on the @subcmd{FACTORS} subcommand.  Split
 196 and factor variables must be numeric.  Split and factor variables must
 197 also be listed on @subcmd{VARIABLES}, with one exception: if
 198 @subcmd{VARIABLES} does not include @code{ROWTYPE_}, then
 199 @subcmd{SPLIT} may name a single variable that is not in
 200 @subcmd{VARIABLES} (@pxref{MATRIX DATA Example 8}).
 201
 202 The @subcmd{FORMAT} subcommand accepts settings to describe the format
 203 of the input data:
 204
 205 @table @asis
 206 @item @code{LIST} (default)
 207 @itemx @code{FREE}
 208 LIST requires each row to begin at the start of a new input line.
 209 FREE allows rows to begin in the middle of a line.  Either setting
 210 allows a single row to continue across multiple input lines.
 211
 212 @item @code{LOWER} (default)
 213 @itemx @code{UPPER}
 214 @itemx @code{FULL}
 215 With LOWER, only the lower triangle is read from the input data and
 216 the upper triangle is mirrored across the main diagonal.  UPPER
 217 behaves similarly for the upper triangle.  FULL reads the entire
 218 matrix.
 219
 220 @item @code{DIAGONAL} (default)
 221 @itemx @code{NODIAGONAL}
 222 With DIAGONAL, the main diagonal is read from the input data.  With
 223 NODIAGONAL, which is incompatible with FULL, the main diagonal is not
 224 read from the input data but instead set to 1 for correlation matrices
 225 and system-missing for others.
 226 @end table
 227
 228 The @subcmd{N} subcommand is a way to specify the size of the
 229 population.  It is equivalent to specifying an @code{N} vector with
 230 the specified value for each split file.
 231
 232 @cmd{MATRIX DATA} supports two different ways to indicate the kinds of
 233 matrices and vectors present in the data, depending on whether a
 234 variable with the special name @code{ROWTYPE_} is present in
 235 @code{VARIABLES}.  The following subsections explain @cmd{MATRIX DATA}
 236 syntax and behavior in each case.
 237
 238 @node MATRIX DATA with ROWTYPE_
 239 @subsection With @code{ROWTYPE_}
 240
 241 If @code{VARIABLES} includes @code{ROWTYPE_}, each case's
 242 @code{ROWTYPE_} indicates the type of data contained in the row.
 243 @xref{Matrix File Row Types}, for a list of supported row types.
 244
 245 @subsubheading Example 1: Defaults with @code{ROWTYPE_}
 246 @anchor{MATRIX DATA Example 1}
 247
 248 This example shows a simple use of @cmd{MATRIX DATA} with
 249 @code{ROWTYPE_} plus 8 variables named @code{var01} through
 250 @code{var08}.
 251
 252 Because @code{ROWTYPE_} is the first variable in @subcmd{VARIABLES},
 253 it appears first on each line. The first three lines in the example
 254 data have @code{ROWTYPE_} values of @samp{MEAN}, @samp{SD}, and
 255 @samp{N}.  These indicate that these lines contain vectors of means,
 256 standard deviations, and counts, respectively, for @code{var01}
 257 through @code{var08} in order.
 258
 259 The remaining 8 lines have a ROWTYPE_ of @samp{CORR} which indicates
 260 that the values are correlation coefficients.  Each of the lines
 261 corresponds to a row in the correlation matrix: the first line is for
 262 @code{var01}, the next line for @code{var02}, and so on.  The input
 263 only contains values for the lower triangle, including the diagonal,
 264 since @code{FORMAT=LOWER DIAGONAL} is the default.
 265
 266 With @code{ROWTYPE_}, the @code{CONTENTS} subcommand is optional and
 267 the @code{CELLS} subcommand may not be used.
 268
 269 @example
 270 MATRIX DATA
 271     VARIABLES=ROWTYPE_ var01 TO var08.
 272 BEGIN DATA.
 273 MEAN  24.3   5.4  69.7  20.1  13.4   2.7  27.9   3.7
 274 SD     5.7   1.5  23.5   5.8   2.8   4.5   5.4   1.5
 275 N       92    92    92    92    92    92    92    92
 276 CORR  1.00
 277 CORR   .18  1.00
 278 CORR  -.22  -.17  1.00
 279 CORR   .36   .31  -.14  1.00
 280 CORR   .27   .16  -.12   .22  1.00
 281 CORR   .33   .15  -.17   .24   .21  1.00
 282 CORR   .50   .29  -.20   .32   .12   .38  1.00
 283 CORR   .17   .29  -.05   .20   .27   .20   .04  1.00
 284 END DATA.
 285 @end example
 286
 287 @subsubheading Example 2: @code{FORMAT=UPPER NODIAGONAL}
 288
 289 This syntax produces the same matrix file as example 1, but it uses
 290 @code{FORMAT=UPPER NODIAGONAL} to specify the upper triangle and omit
 291 the diagonal.  Because the matrix's @code{ROWTYPE_} is @code{CORR},
 292 @pspp{} automatically fills in the diagonal with 1.
 293
 294 @example
 295 MATRIX DATA
 296     VARIABLES=ROWTYPE_ var01 TO var08
 297     /FORMAT=UPPER NODIAGONAL.
 298 BEGIN DATA.
 299 MEAN  24.3   5.4  69.7  20.1  13.4   2.7  27.9   3.7
 300 SD     5.7   1.5  23.5   5.8   2.8   4.5   5.4   1.5
 301 N       92    92    92    92    92    92    92    92
 302 CORR         .17   .50  -.33   .27   .36  -.22   .18
 303 CORR               .29   .29  -.20   .32   .12   .38
 304 CORR                     .05   .20  -.15   .16   .21
 305 CORR                           .20   .32  -.17   .12
 306 CORR                                 .27   .12  -.24
 307 CORR                                      -.20  -.38
 308 CORR                                             .04
 309 END DATA.
 310 @end example
 311
 312 @subsubheading Example 3: @subcmd{N} subcommand
 313
 314 This syntax uses the @subcmd{N} subcommand in place of an @code{N}
 315 vector.  It produces the same matrix file as examples 1 and 2.
 316
 317 @example
 318 MATRIX DATA
 319     VARIABLES=ROWTYPE_ var01 TO var08
 320     /FORMAT=UPPER NODIAGONAL
 321     /N 92.
 322 BEGIN DATA.
 323 MEAN  24.3   5.4  69.7  20.1  13.4   2.7  27.9   3.7
 324 SD     5.7   1.5  23.5   5.8   2.8   4.5   5.4   1.5
 325 CORR         .17   .50  -.33   .27   .36  -.22   .18
 326 CORR               .29   .29  -.20   .32   .12   .38
 327 CORR                     .05   .20  -.15   .16   .21
 328 CORR                           .20   .32  -.17   .12
 329 CORR                                 .27   .12  -.24
 330 CORR                                      -.20  -.38
 331 CORR                                             .04
 332 END DATA.
 333 @end example
 334
 335 @subsubheading Example 4: Split variables
 336 @anchor{MATRIX DATA Example 4}
 337
 338 This syntax defines two matrices, using the variable @samp{s1} to
 339 distinguish between them.  Notice how the order of variables in the
 340 input matches their order on @subcmd{VARIABLES}.  This example also
 341 uses @code{FORMAT=FULL}.
 342
 343 @example
 344 MATRIX DATA
 345     VARIABLES=s1 ROWTYPE_  var01 TO var04
 346     /SPLIT=s1
 347     /FORMAT=FULL.
 348 BEGIN DATA.
 349 0 MEAN 34 35 36 37
 350 0 SD   22 11 55 66
 351 0 N    99 98 99 92
 352 0 CORR  1 .9 .8 .7
 353 0 CORR .9  1 .6 .5
 354 0 CORR .8 .6  1 .4
 355 0 CORR .7 .5 .4  1
 356 1 MEAN 44 45 34 39
 357 1 SD   23 15 51 46
 358 1 N    98 34 87 23
 359 1 CORR  1 .2 .3 .4
 360 1 CORR .2  1 .5 .6
 361 1 CORR .3 .5  1 .7
 362 1 CORR .4 .6 .7  1
 363 END DATA.
 364 @end example
 365
 366 @subsubheading Example 5: Factor variables
 367 @anchor{MATRIX DATA Example 5}
 368
 369 This syntax defines a matrix file that includes a factor variable
 370 @samp{f1}.  The data includes mean, standard deviation, and count
 371 vectors for two values of the factor variable, plus a correlation
 372 matrix for pooled data.
 373
 374 @example
 375 MATRIX DATA
 376     VARIABLES=ROWTYPE_ f1 var01 TO var04
 377     /FACTOR=f1.
 378 BEGIN DATA.
 379 MEAN 0 34 35 36 37
 380 SD   0 22 11 55 66
 381 N    0 99 98 99 92
 382 MEAN 1 44 45 34 39
 383 SD   1 23 15 51 46
 384 N    1 98 34 87 23
 385 CORR .  1
 386 CORR . .9  1
 387 CORR . .8 .6  1
 388 CORR . .7 .5 .4  1
 389 END DATA.
 390 @end example
 391
 392 @node MATRIX DATA without ROWTYPE_
 393 @subsection Without @code{ROWTYPE_}
 394
 395 If @code{VARIABLES} does not contain @code{ROWTYPE_}, the
 396 @subcmd{CONTENTS} subcommand defines the row types that appear in the
 397 file and their order.  If @subcmd{CONTENTS} is omitted,
 398 @code{CONTENTS=CORR} is assumed.
 399
 400 Factor variables without @code{ROWTYPE_} introduce special
 401 requirements, illustrated below in Examples 8 and 9.
 402
 403 @subsubheading Example 6: Defaults without @code{ROWTYPE_}
 404
 405 This example shows a simple use of @cmd{MATRIX DATA} with 8 variables
 406 named @code{var01} through @code{var08}, without @code{ROWTYPE_}.
 407 This yields the same matrix file as Example 1 (@pxref{MATRIX DATA
 408 Example 1}).
 409
 410 @example
 411 MATRIX DATA
 412     VARIABLES=var01 TO var08
 413    /CONTENTS=MEAN SD N CORR.
 414 BEGIN DATA.
 415 24.3   5.4  69.7  20.1  13.4   2.7  27.9   3.7
 416  5.7   1.5  23.5   5.8   2.8   4.5   5.4   1.5
 417   92    92    92    92    92    92    92    92
 418 1.00
 419  .18  1.00
 420 -.22  -.17  1.00
 421  .36   .31  -.14  1.00
 422  .27   .16  -.12   .22  1.00
 423  .33   .15  -.17   .24   .21  1.00
 424  .50   .29  -.20   .32   .12   .38  1.00
 425  .17   .29  -.05   .20   .27   .20   .04  1.00
 426 END DATA.
 427 @end example
 428
 429 @subsubheading Example 7: Split variables with explicit values
 430
 431 This syntax defines two matrices, using the variable @code{s1} to
 432 distinguish between them.  Each line of data begins with @code{s1}.
 433 This yields the same matrix file as Example 4 (@pxref{MATRIX DATA
 434 Example 4}).
 435
 436 @example
 437 MATRIX DATA
 438     VARIABLES=s1 var01 TO var04
 439     /SPLIT=s1
 440     /FORMAT=FULL
 441     /CONTENTS=MEAN SD N CORR.
 442 BEGIN DATA.
 443 0 34 35 36 37
 444 0 22 11 55 66
 445 0 99 98 99 92
 446 0  1 .9 .8 .7
 447 0 .9  1 .6 .5
 448 0 .8 .6  1 .4
 449 0 .7 .5 .4  1
 450 1 44 45 34 39
 451 1 23 15 51 46
 452 1 98 34 87 23
 453 1  1 .2 .3 .4
 454 1 .2  1 .5 .6
 455 1 .3 .5  1 .7
 456 1 .4 .6 .7  1
 457 END DATA.
 458 @end example
 459
 460 @subsubheading Example 8: Split variable with sequential values
 461 @anchor{MATRIX DATA Example 8}
 462
 463 Like this previous example, this syntax defines two matrices with
 464 split variable @code{s1}.  In this case, though, @code{s1} is not
 465 listed in @subcmd{VARIABLES}, which means that its value does not
 466 appear in the data.  Instead, @cmd{MATRIX DATA} reads matrix data
 467 until the input is exhausted, supplying 1 for the first split, 2 for
 468 the second, and so on.
 469
 470 @example
 471 MATRIX DATA
 472     VARIABLES=var01 TO var04
 473     /SPLIT=s1
 474     /FORMAT=FULL
 475     /CONTENTS=MEAN SD N CORR.
 476 BEGIN DATA.
 477 34 35 36 37
 478 22 11 55 66
 479 99 98 99 92
 480  1 .9 .8 .7
 481 .9  1 .6 .5
 482 .8 .6  1 .4
 483 .7 .5 .4  1
 484 44 45 34 39
 485 23 15 51 46
 486 98 34 87 23
 487  1 .2 .3 .4
 488 .2  1 .5 .6
 489 .3 .5  1 .7
 490 .4 .6 .7  1
 491 END DATA.
 492 @end example
 493
 494 @subsubsection Factor variables without @code{ROWTYPE_}
 495
 496 Without @subcmd{ROWTYPE_}, factor variables introduce two new wrinkles
 497 to @cmd{MATRIX DATA} syntax.  First, the @subcmd{CELLS} subcommand
 498 must declare the number of combinations of factor variables present in
 499 the data.  If there is, for example, one factor variable for which the
 500 data contains three values, one would write @code{CELLS=3}; if there
 501 are two (or more) factor variables for which the data contains five
 502 combinations, one would use @code{CELLS=5}; and so on.
 503
 504 Second, the @subcmd{CONTENTS} subcommand must distinguish within-cell
 505 data from pooled data by enclosing within-cell row types in
 506 parentheses.  When different within-cell row types for a single factor
 507 appear in subsequent lines, enclose the row types in a single set of
 508 parentheses; when different factors' values for a given within-cell
 509 row type appear in subsequent lines, enclose each row type in
 510 individual parentheses.
 511
 512 Without @subcmd{ROWTYPE_}, input lines for pooled data do not include
 513 factor values, not even as missing values, but input lines for
 514 within-cell data do.
 515
 516 The following examples aim to clarify this syntax.
 517
 518 @subsubheading Example 9: Factor variables, grouping within-cell records by factor
 519
 520 This syntax defines the same matrix file as Example 5 (@pxref{MATRIX
 521 DATA Example 5}), without using @code{ROWTYPE_}.  It declares
 522 @code{CELLS=2} because the data contains two values (0 and 1) for
 523 factor variable @code{f1}.  Within-cell vector row types @code{MEAN},
 524 @code{SD}, and @code{N} are in a single set of parentheses on
 525 @subcmd{CONTENTS} because they are grouped together in subsequent
 526 lines for a single factor value.  The data lines with the pooled
 527 correlation matrix do not have any factor values.
 528
 529 @example
 530 MATRIX DATA
 531     VARIABLES=f1 var01 TO var04
 532     /FACTOR=f1
 533     /CELLS=2
 534     /CONTENTS=(MEAN SD N) CORR.
 535 BEGIN DATA.
 536 0 34 35 36 37
 537 0 22 11 55 66
 538 0 99 98 99 92
 539 1 44 45 34 39
 540 1 23 15 51 46
 541 1 98 34 87 23
 542    1
 543   .9  1
 544   .8 .6  1
 545   .7 .5 .4  1
 546 END DATA.
 547 @end example
 548
 549 @subsubheading Example 10: Factor variables, grouping within-cell records by row type
 550
 551 This syntax defines the same matrix file as the previous example.  The
 552 only difference is that the within-cell vector rows are grouped
 553 differently: two rows of means (one for each factor), followed by two
 554 rows of standard deviations, followed by two rows of counts.
 555
 556 @example
 557 MATRIX DATA
 558     VARIABLES=f1 var01 TO var04
 559     /FACTOR=f1
 560     /CELLS=2
 561     /CONTENTS=(MEAN) (SD) (N) CORR.
 562 BEGIN DATA.
 563 0 34 35 36 37
 564 1 44 45 34 39
 565 0 22 11 55 66
 566 1 23 15 51 46
 567 0 99 98 99 92
 568 1 98 34 87 23
 569    1
 570   .9  1
 571   .8 .6  1
 572   .7 .5 .4  1
 573 END DATA.
 574 @end example