MATRIX DATA: Fully implement.

[pspp] / doc / matrices.texi
diff --git a/doc/matrices.texi b/doc/matrices.texi

new file mode 100644 (file)

index 0000000..26a29db
--- /dev/null
+++ b/doc/matrices.texi
@@ -0,0 +1,574 @@
+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2017, 2020, 2021 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
+@node Matrices
+@chapter Matrices
+
+Some @pspp{} procedures work with matrices by producing numeric
+matrices that report results of data analysis, or by consuming
+matrices as a basis for further analysis.  This chapter documents the
+format of data files that store these matrices and commands for
+working with them.
+
+@node Matrix Files
+@section Matrix Files
+@vindex Matrix file
+
+A matrix file is an SPSS system file that conforms to the dictionary
+and case structure described in this section.  Procedures that read
+matrices from files expect them to be in the matrix file format.
+Procedures that write matrices also use this format.
+
+Text files that contain matrices can be converted to matrix file
+format.  @xref{MATRIX DATA}, for a command to read a text file as a
+matrix file.
+
+A matrix file's dictionary must have the following variables in the
+specified order:
+
+@enumerate
+@item
+Zero or more numeric split variables.  These are included by
+procedures when @cmd{SPLIT FILE} is active.  @cmd{MATRIX DATA} assigns
+split variables format F4.0.
+
+@item
+@code{ROWTYPE_}, a string variable with width 8.  This variable
+indicates the kind of matrix or vector that a given case represents.
+The supported row types are listed below.
+
+@item
+Zero or more numeric factor variables.  These are included by
+procedures that divide data into cells.  For within-cell data, factor
+variables are filled with non-missing values; for pooled data, they
+are missing.  @cmd{MATRIX DATA} assigns factor variables format F4.0.
+
+@item
+@code{VARNAME_}, a string variable.  Matrix data includes one row per
+continuous variable (see below), naming each continuous variable in
+order.  This column is blank for vector data.  @cmd{MATRIX DATA} makes
+@code{VARNAME_} wide enough for the name of any of the continuous
+variables, but at least 8 bytes.
+
+@item
+One or more continuous variables.  These are the variables whose data
+was analyzed to produce the matrices.  @cmd{MATRIX DATA} assigns
+continuous variables format F10.4.
+@end enumerate
+
+Case weights are ignored in matrix files. 
+
+@subheading Row Types
+@anchor{Matrix File Row Types}
+
+Matrix files support a fixed set of types of matrix and vector data.
+The @code{ROWTYPE_} variable in each case of a matrix file indicates
+its row type.  
+
+The supported matrix row types are listed below.  Each type is listed
+with the keyword that identifies it in @code{ROWTYPE_}.  All supported
+types of matrices are square, meaning that each matrix must include
+one row per continuous variable, with the @code{VARNAME_} variable
+indicating each continuous variable in turn in the same order as the
+dictionary.
+
+@table @code
+@item CORR
+Correlation coefficients.
+
+@item COV
+Covariance coefficients.
+
+@item MAT
+General-purpose matrix.
+
+@item N_MATRIX
+Counts.
+
+@item PROX
+Proximities matrix.
+@end table
+
+The supported vector row types are listed below, along with their
+associated keyword.  Vector row types only require a single row, whose
+@code{VARNAME_} is blank:
+
+@table @code
+@item COUNT
+Unweighted counts.
+
+@item DFE
+Degrees of freedom.
+
+@item MEAN
+Means.
+
+@item MSE
+Mean squared errors.
+
+@item N
+Counts.
+
+@item STDDEV
+Standard deviations.
+@end table
+
+Only the row types listed above may appear in matrix files.  The
+@cmd{MATRIX DATA} command, however, accepts the additional row types
+listed below, which it changes into matrix file row types as part of
+its conversion process:
+
+@table @code
+@item N_VECTOR
+Synonym for @cmd{N}.
+
+@item SD
+Synonym for @code{STDDEV}.
+
+@item N_SCALAR
+Accepts a single number from the @code{MATRIX DATA} input and writes
+it as an @code{N} row with the number replicated across all the
+continuous variables.
+@end table
+
+@node MATRIX DATA
+@section MATRIX DATA
+@vindex MATRIX DATA
+
+@display
+MATRIX DATA
+        VARIABLES=@var{variables}
+        [FILE=@{'@var{file_name}' | INLINE@}
+        [/FORMAT=[@{LIST | FREE@}]
+                 [@{UPPER | LOWER | FULL@}]
+                 [@{DIAGONAL | NODIAGONAL@}]]
+        [/SPLIT=@var{split_vars}]
+        [/FACTORS=@var{factor_vars}]
+        [/N=@var{n}]
+
+The following subcommands are only needed when ROWTYPE_ is not
+specified on the VARIABLES subcommand:
+        [/CONTENTS=@{CORR,COUNT,COV,DFE,MAT,MEAN,MSE,
+                    N_MATRIX,N|N_VECTOR,N_SCALAR,PROX,SD|STDDEV@}]
+        [/CELLS=@var{n_cells}]
+@end display
+
+The @cmd{MATRIX DATA} command convert matrices and vectors from text
+format into the matrix file format (@xref{Matrix Files}) for use by
+procedures that read matrices.  It reads a text file or inline data
+and outputs to the active file, replacing any data already in the
+active dataset.  The matrix file may then be used by other commands
+directly from the active file, or it may be written to a @file{.sav}
+file using the @cmd{SAVE} command.
+
+The text data read by @cmd{MATRIX DATA} can be delimited by spaces or
+commas.  A plus or minus sign, except immediately following a @samp{d}
+or @samp{e}, also begins a new value.  Optionally, values may be
+enclosed in single or double quotes.
+
+@cmd{MATRIX DATA} can read the types of matrix and vector data
+supported in matrix files (@pxref{Matrix File Row Types}).
+
+The @subcmd{FILE} subcommand specifies the source of the command's
+input.  To read input from a text file, specify its name in quotes.
+To supply input inline, omit @subcmd{FILE} or specify @code{INLINE}.
+Inline data must directly follow @code{MATRIX DATA}, inside @cmd{BEGIN
+DATA} (@pxref{BEGIN DATA}).
+
+@subcmd{VARIABLES} is the only required subcommand.  It names the
+variables present in each input record in the order that they appear.
+(@cmd{MATRIX DATA} reorders the variables in the matrix file it
+produces, if needed to fit the matrix file format.)  The variable list
+must include split variables and factor variables, if they are present
+in the data, in addition to the continuous variables that form matrix
+rows and columns.  It may also include a special variable named
+@code{ROWTYPE_}.
+
+Matrix data may include split variables or factor variables or both.
+List split variables, if any, on the @subcmd{SPLIT} subcommand and
+factor variables, if any, on the @subcmd{FACTORS} subcommand.  Split
+and factor variables must be numeric.  Split and factor variables must
+also be listed on @subcmd{VARIABLES}, with one exception: if
+@subcmd{VARIABLES} does not include @code{ROWTYPE_}, then
+@subcmd{SPLIT} may name a single variable that is not in
+@subcmd{VARIABLES} (@pxref{MATRIX DATA Example 8}).
+
+The @subcmd{FORMAT} subcommand accepts settings to describe the format
+of the input data:
+
+@table @asis
+@item @code{LIST} (default)
+@itemx @code{FREE}
+LIST requires each row to begin at the start of a new input line.
+FREE allows rows to begin in the middle of a line.  Either setting
+allows a single row to continue across multiple input lines.
+
+@item @code{LOWER} (default)
+@itemx @code{UPPER}
+@itemx @code{FULL}
+With LOWER, only the lower triangle is read from the input data and
+the upper triangle is mirrored across the main diagonal.  UPPER
+behaves similarly for the upper triangle.  FULL reads the entire
+matrix.
+
+@item @code{DIAGONAL} (default)
+@itemx @code{NODIAGONAL}
+With DIAGONAL, the main diagonal is read from the input data.  With
+NODIAGONAL, which is incompatible with FULL, the main diagonal is not
+read from the input data but instead set to 1 for correlation matrices
+and system-missing for others.
+@end table
+
+The @subcmd{N} subcommand is a way to specify the size of the
+population.  It is equivalent to specifying an @code{N} vector with
+the specified value for each split file.
+
+@cmd{MATRIX DATA} supports two different ways to indicate the kinds of
+matrices and vectors present in the data, depending on whether a
+variable with the special name @code{ROWTYPE_} is present in
+@code{VARIABLES}.  The following subsections explain @cmd{MATRIX DATA}
+syntax and behavior in each case.
+
+@node MATRIX DATA with ROWTYPE_
+@subsection With @code{ROWTYPE_}
+
+If @code{VARIABLES} includes @code{ROWTYPE_}, each case's
+@code{ROWTYPE_} indicates the type of data contained in the row.
+@xref{Matrix File Row Types}, for a list of supported row types.
+
+@subsubheading Example 1: Defaults with @code{ROWTYPE_}
+@anchor{MATRIX DATA Example 1}
+
+This example shows a simple use of @cmd{MATRIX DATA} with
+@code{ROWTYPE_} plus 8 variables named @code{var01} through
+@code{var08}.
+
+Because @code{ROWTYPE_} is the first variable in @subcmd{VARIABLES},
+it appears first on each line. The first three lines in the example
+data have @code{ROWTYPE_} values of @samp{MEAN}, @samp{SD}, and
+@samp{N}.  These indicate that these lines contain vectors of means,
+standard deviations, and counts, respectively, for @code{var01}
+through @code{var08} in order.
+
+The remaining 8 lines have a ROWTYPE_ of @samp{CORR} which indicates
+that the values are correlation coefficients.  Each of the lines
+corresponds to a row in the correlation matrix: the first line is for
+@code{var01}, the next line for @code{var02}, and so on.  The input
+only contains values for the lower triangle, including the diagonal,
+since @code{FORMAT=LOWER DIAGONAL} is the default.
+
+With @code{ROWTYPE_}, the @code{CONTENTS} subcommand is optional and
+the @code{CELLS} subcommand may not be used.
+
+@example
+MATRIX DATA
+    VARIABLES=ROWTYPE_ var01 TO var08.
+BEGIN DATA.
+MEAN  24.3   5.4  69.7  20.1  13.4   2.7  27.9   3.7
+SD     5.7   1.5  23.5   5.8   2.8   4.5   5.4   1.5
+N       92    92    92    92    92    92    92    92
+CORR  1.00
+CORR   .18  1.00
+CORR  -.22  -.17  1.00
+CORR   .36   .31  -.14  1.00
+CORR   .27   .16  -.12   .22  1.00
+CORR   .33   .15  -.17   .24   .21  1.00
+CORR   .50   .29  -.20   .32   .12   .38  1.00
+CORR   .17   .29  -.05   .20   .27   .20   .04  1.00
+END DATA.
+@end example
+
+@subsubheading Example 2: @code{FORMAT=UPPER NODIAGONAL}
+
+This syntax produces the same matrix file as example 1, but it uses
+@code{FORMAT=UPPER NODIAGONAL} to specify the upper triangle and omit
+the diagonal.  Because the matrix's @code{ROWTYPE_} is @code{CORR},
+@pspp{} automatically fills in the diagonal with 1.
+
+@example
+MATRIX DATA
+    VARIABLES=ROWTYPE_ var01 TO var08
+    /FORMAT=UPPER NODIAGONAL.
+BEGIN DATA.
+MEAN  24.3   5.4  69.7  20.1  13.4   2.7  27.9   3.7
+SD     5.7   1.5  23.5   5.8   2.8   4.5   5.4   1.5
+N       92    92    92    92    92    92    92    92
+CORR         .17   .50  -.33   .27   .36  -.22   .18
+CORR               .29   .29  -.20   .32   .12   .38
+CORR                     .05   .20  -.15   .16   .21
+CORR                           .20   .32  -.17   .12
+CORR                                 .27   .12  -.24
+CORR                                      -.20  -.38
+CORR                                             .04
+END DATA.
+@end example
+
+@subsubheading Example 3: @subcmd{N} subcommand
+
+This syntax uses the @subcmd{N} subcommand in place of an @code{N}
+vector.  It produces the same matrix file as examples 1 and 2.
+
+@example
+MATRIX DATA
+    VARIABLES=ROWTYPE_ var01 TO var08
+    /FORMAT=UPPER NODIAGONAL
+    /N 92.
+BEGIN DATA.
+MEAN  24.3   5.4  69.7  20.1  13.4   2.7  27.9   3.7
+SD     5.7   1.5  23.5   5.8   2.8   4.5   5.4   1.5
+CORR         .17   .50  -.33   .27   .36  -.22   .18
+CORR               .29   .29  -.20   .32   .12   .38
+CORR                     .05   .20  -.15   .16   .21
+CORR                           .20   .32  -.17   .12
+CORR                                 .27   .12  -.24
+CORR                                      -.20  -.38
+CORR                                             .04
+END DATA.
+@end example
+
+@subsubheading Example 4: Split variables
+@anchor{MATRIX DATA Example 4}
+
+This syntax defines two matrices, using the variable @samp{s1} to
+distinguish between them.  Notice how the order of variables in the
+input matches their order on @subcmd{VARIABLES}.  This example also
+uses @code{FORMAT=FULL}.
+
+@example
+MATRIX DATA
+    VARIABLES=s1 ROWTYPE_  var01 TO var04
+    /SPLIT=s1
+    /FORMAT=FULL.
+BEGIN DATA.
+0 MEAN 34 35 36 37
+0 SD   22 11 55 66
+0 N    99 98 99 92
+0 CORR  1 .9 .8 .7
+0 CORR .9  1 .6 .5
+0 CORR .8 .6  1 .4
+0 CORR .7 .5 .4  1
+1 MEAN 44 45 34 39
+1 SD   23 15 51 46
+1 N    98 34 87 23
+1 CORR  1 .2 .3 .4
+1 CORR .2  1 .5 .6
+1 CORR .3 .5  1 .7
+1 CORR .4 .6 .7  1
+END DATA.
+@end example
+
+@subsubheading Example 5: Factor variables
+@anchor{MATRIX DATA Example 5}
+
+This syntax defines a matrix file that includes a factor variable
+@samp{f1}.  The data includes mean, standard deviation, and count
+vectors for two values of the factor variable, plus a correlation
+matrix for pooled data.
+
+@example
+MATRIX DATA
+    VARIABLES=ROWTYPE_ f1 var01 TO var04
+    /FACTOR=f1.
+BEGIN DATA.
+MEAN 0 34 35 36 37
+SD   0 22 11 55 66
+N    0 99 98 99 92
+MEAN 1 44 45 34 39
+SD   1 23 15 51 46
+N    1 98 34 87 23
+CORR .  1
+CORR . .9  1
+CORR . .8 .6  1
+CORR . .7 .5 .4  1
+END DATA.
+@end example
+
+@node MATRIX DATA without ROWTYPE_
+@subsection Without @code{ROWTYPE_}
+
+If @code{VARIABLES} does not contain @code{ROWTYPE_}, the
+@subcmd{CONTENTS} subcommand defines the row types that appear in the
+file and their order.  If @subcmd{CONTENTS} is omitted,
+@code{CONTENTS=CORR} is assumed.
+
+Factor variables without @code{ROWTYPE_} introduce special
+requirements, illustrated below in Examples 8 and 9.
+
+@subsubheading Example 6: Defaults without @code{ROWTYPE_}
+
+This example shows a simple use of @cmd{MATRIX DATA} with 8 variables
+named @code{var01} through @code{var08}, without @code{ROWTYPE_}.
+This yields the same matrix file as Example 1 (@pxref{MATRIX DATA
+Example 1}).
+
+@example
+MATRIX DATA
+    VARIABLES=var01 TO var08
+   /CONTENTS=MEAN SD N CORR.
+BEGIN DATA.
+24.3   5.4  69.7  20.1  13.4   2.7  27.9   3.7
+ 5.7   1.5  23.5   5.8   2.8   4.5   5.4   1.5
+  92    92    92    92    92    92    92    92
+1.00
+ .18  1.00
+-.22  -.17  1.00
+ .36   .31  -.14  1.00
+ .27   .16  -.12   .22  1.00
+ .33   .15  -.17   .24   .21  1.00
+ .50   .29  -.20   .32   .12   .38  1.00
+ .17   .29  -.05   .20   .27   .20   .04  1.00
+END DATA.
+@end example
+
+@subsubheading Example 7: Split variables with explicit values
+
+This syntax defines two matrices, using the variable @code{s1} to
+distinguish between them.  Each line of data begins with @code{s1}.
+This yields the same matrix file as Example 4 (@pxref{MATRIX DATA
+Example 4}).
+
+@example
+MATRIX DATA
+    VARIABLES=s1 var01 TO var04
+    /SPLIT=s1
+    /FORMAT=FULL
+    /CONTENTS=MEAN SD N CORR.
+BEGIN DATA.
+0 34 35 36 37
+0 22 11 55 66
+0 99 98 99 92
+0  1 .9 .8 .7
+0 .9  1 .6 .5
+0 .8 .6  1 .4
+0 .7 .5 .4  1
+1 44 45 34 39
+1 23 15 51 46
+1 98 34 87 23
+1  1 .2 .3 .4
+1 .2  1 .5 .6
+1 .3 .5  1 .7
+1 .4 .6 .7  1
+END DATA.
+@end example
+
+@subsubheading Example 8: Split variable with sequential values
+@anchor{MATRIX DATA Example 8}
+
+Like this previous example, this syntax defines two matrices with
+split variable @code{s1}.  In this case, though, @code{s1} is not
+listed in @subcmd{VARIABLES}, which means that its value does not
+appear in the data.  Instead, @cmd{MATRIX DATA} reads matrix data
+until the input is exhausted, supplying 1 for the first split, 2 for
+the second, and so on.
+
+@example
+MATRIX DATA
+    VARIABLES=var01 TO var04
+    /SPLIT=s1
+    /FORMAT=FULL
+    /CONTENTS=MEAN SD N CORR.
+BEGIN DATA.
+34 35 36 37
+22 11 55 66
+99 98 99 92
+ 1 .9 .8 .7
+.9  1 .6 .5
+.8 .6  1 .4
+.7 .5 .4  1
+44 45 34 39
+23 15 51 46
+98 34 87 23
+ 1 .2 .3 .4
+.2  1 .5 .6
+.3 .5  1 .7
+.4 .6 .7  1
+END DATA.
+@end example
+
+@subsubsection Factor variables without @code{ROWTYPE_}
+
+Without @subcmd{ROWTYPE_}, factor variables introduce two new wrinkles
+to @cmd{MATRIX DATA} syntax.  First, the @subcmd{CELLS} subcommand
+must declare the number of combinations of factor variables present in
+the data.  If there is, for example, one factor variable for which the
+data contains three values, one would write @code{CELLS=3}; if there
+are two (or more) factor variables for which the data contains five
+combinations, one would use @code{CELLS=5}; and so on.
+
+Second, the @subcmd{CONTENTS} subcommand must distinguish within-cell
+data from pooled data by enclosing within-cell row types in
+parentheses.  When different within-cell row types for a single factor
+appear in subsequent lines, enclose the row types in a single set of
+parentheses; when different factors' values for a given within-cell
+row type appear in subsequent lines, enclose each row type in
+individual parentheses.
+
+Without @subcmd{ROWTYPE_}, input lines for pooled data do not include
+factor values, not even as missing values, but input lines for
+within-cell data do.
+
+The following examples aim to clarify this syntax.
+
+@subsubheading Example 9: Factor variables, grouping within-cell records by factor
+
+This syntax defines the same matrix file as Example 5 (@pxref{MATRIX
+DATA Example 5}), without using @code{ROWTYPE_}.  It declares
+@code{CELLS=2} because the data contains two values (0 and 1) for
+factor variable @code{f1}.  Within-cell vector row types @code{MEAN},
+@code{SD}, and @code{N} are in a single set of parentheses on
+@subcmd{CONTENTS} because they are grouped together in subsequent
+lines for a single factor value.  The data lines with the pooled
+correlation matrix do not have any factor values.
+
+@example
+MATRIX DATA
+    VARIABLES=f1 var01 TO var04
+    /FACTOR=f1
+    /CELLS=2
+    /CONTENTS=(MEAN SD N) CORR.
+BEGIN DATA.
+0 34 35 36 37
+0 22 11 55 66
+0 99 98 99 92
+1 44 45 34 39
+1 23 15 51 46
+1 98 34 87 23
+   1
+  .9  1
+  .8 .6  1
+  .7 .5 .4  1
+END DATA.
+@end example
+
+@subsubheading Example 10: Factor variables, grouping within-cell records by row type
+
+This syntax defines the same matrix file as the previous example.  The
+only difference is that the within-cell vector rows are grouped
+differently: two rows of means (one for each factor), followed by two
+rows of standard deviations, followed by two rows of counts.
+
+@example
+MATRIX DATA
+    VARIABLES=f1 var01 TO var04
+    /FACTOR=f1
+    /CELLS=2
+    /CONTENTS=(MEAN) (SD) (N) CORR.
+BEGIN DATA.
+0 34 35 36 37
+1 44 45 34 39
+0 22 11 55 66
+1 23 15 51 46
+0 99 98 99 92
+1 98 34 87 23
+   1
+  .9  1
+  .8 .6  1
+  .7 .5 .4  1
+END DATA.
+@end example