X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdata-io.texi;h=f7d3e7ff336726413ffd6105a4c8c9426655434c;hb=d26105c398be227dc38668ce3e742c31adef15f7;hp=e1eeff09a37d1876eacc2fb4c269f84a0f21d617;hpb=a74ac710e3cd2b6a52fd763ab31ce68e83f21dfe;p=pspp

diff --git a/doc/data-io.texi b/doc/data-io.texi
index e1eeff09a3..f7d3e7ff33 100644
--- a/doc/data-io.texi
+++ b/doc/data-io.texi
@@ -1,3 +1,12 @@
+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
 @c (modify-syntax-entry ?_ "w")
 @c (modify-syntax-entry ?' "'")
 @c (modify-syntax-entry ?@ "'")
@@ -11,7 +20,7 @@
 @cindex cases
 @cindex observations
 
-Data are the focus of the @pspp{} language.  
+Data are the focus of the @pspp{} language.
 Each datum  belongs to a @dfn{case} (also called an @dfn{observation}).
 Each case represents an individual or ``experimental unit''.
 For example, in the results of a survey, the names of the respondents,
@@ -39,6 +48,7 @@ actually be read until a procedure is executed.
 * INPUT PROGRAM::               Support for complex input programs.
 * LIST::                        List cases in the active dataset.
 * NEW FILE::                    Clear the active dataset.
+* MATRIX DATA::                 Defining matrix material for procedures.
 * PRINT::                       Display values in print formats.
 * PRINT EJECT::                 Eject the current page then print.
 * PRINT SPACE::                 Print blank lines.
@@ -180,7 +190,7 @@ The DATASET DECLARE command creates a new dataset that is initially
 ``empty,'' that is, it has no dictionary or data.  If a dataset with
 the given name already exists, this has no effect.  The new dataset
 can be used with commands that support output to a dataset,
-e.g. AGGREGATE (@pxref{AGGREGATE}).
+@i{e.g.} AGGREGATE (@pxref{AGGREGATE}).
 
 @vindex DATASET CLOSE
 The DATASET CLOSE command deletes a dataset.  If the active dataset is
@@ -295,7 +305,7 @@ the beginning of an input file.  It can be used to skip over a row
 that contains variable names, for example.
 
 @cmd{DATA LIST} can optionally output a table describing how the data file
-will be read.  The @subcmd{TABLE} subcommand enables this output, and
+is read.  The @subcmd{TABLE} subcommand enables this output, and
 @subcmd{NOTABLE} disables it.  The default is to output the table.
 
 The list of variables to be read from the data list must come last.
@@ -319,7 +329,7 @@ changed; see @ref{SET} for more information.)
 
 In columnar style, to use a variable format other than the default,
 specify the format type in parentheses after the column numbers.  For
-instance, for alphanumeric @samp{A} format, use @samp{(A)}.  
+instance, for alphanumeric @samp{A} format, use @samp{(A)}.
 
 In addition, implied decimal places can be specified in parentheses
 after the column numbers.  As an example, suppose that a data file has a
@@ -375,7 +385,7 @@ FORTRAN and columnar styles may be freely intermixed.  Columnar style
 leaves the active column immediately after the ending column
 specified.  Record motion using @code{NEWREC} in FORTRAN style also
 applies to later FORTRAN and columnar specifiers.
- 
+
 @menu
 * DATA LIST FIXED Examples::    Examples of DATA LIST FIXED.
 @end menu
@@ -484,7 +494,10 @@ where each @var{var_spec} takes one of the forms
 @end display
 
 In free format, the input data is, by default, structured as a series
-of fields separated by spaces, tabs, commas, or line breaks.  Each
+of fields separated by spaces, tabs, or line breaks.
+If the current @subcmd{DECIMAL} separator is @subcmd{DOT} (@pxref{SET}),
+then commas are also treated as field separators.
+Each
 field's content may be unquoted, or it may be quoted with a pairs of
 apostrophes (@samp{'}) or double quotes (@samp{"}).  Unquoted white
 space separates fields but is not part of any field.  Any mix of
@@ -511,7 +524,7 @@ The variables to be parsed are given as a single list of variable names.
 This list must be introduced by a single slash (@samp{/}).  The set of
 variable names may contain format specifications in parentheses
 (@pxref{Input and Output Formats}).  Format specifications apply to all
-variables back to the previous parenthesized format specification.  
+variables back to the previous parenthesized format specification.
 
 In addition, an asterisk may be used to indicate that all variables
 preceding it are to have input/output format @samp{F8.0}.
@@ -743,7 +756,7 @@ For reading text files in CHARACTER mode, all of the forms described
 for ENCODING on the INSERT command are supported (@pxref{INSERT}).
 For reading in other file-based modes, encoding autodetection is not
 supported; if the specified encoding requests autodetection then the
-default encoding will be used.  This is also true when a file handle
+default encoding is used.  This is also true when a file handle
 is used for writing a file in any mode.
 
 @node INPUT PROGRAM
@@ -819,7 +832,7 @@ the extra data in the longer file is ignored.
 @example
 INPUT PROGRAM.
         NUMERIC #A #B.
-        
+
         DO IF NOT #A.
                 DATA LIST NOTABLE END=#A FILE='a.data'/X 1-10.
         END IF.
@@ -955,12 +968,166 @@ NEW FILE.
 @cmd{NEW FILE} command clears the dictionary and data from the current
 active dataset.
 
+@node MATRIX DATA
+@section MATRIX DATA
+@vindex MATRIX DATA
+
+@display
+MATRIX DATA
+        VARIABLES = @var{columns}
+        [FILE='@var{file_name}'| INLINE @}
+        [/FORMAT= [@{LIST | FREE@}]
+                  [@{UPPER | LOWER | FULL@}]
+                  [@{DIAGONAL | NODIAGONAL@}]]
+        [/N= @var{n}]
+        [/SPLIT= @var{split_variables}].
+@end display
+
+The @cmd{MATRIX DATA} command is used to input data in the form of matrices
+which can subsequently be used by other commands.  If the
+@subcmd{FILE} is omitted or takes the value @samp{INLINE} then the command
+should immediately followed by @cmd{BEGIN DATA} (@pxref{BEGIN DATA}).
+
+There is one mandatory subcommand, @i{viz:} @subcmd{VARIABLES}, which defines
+the @var{columns} of the matrix.
+Normally, the @var{columns} should include an item called @samp{ROWTYPE_}.
+The @samp{ROWTYPE_} column is used to specify the purpose of a row in the
+matrix.
+
+@example
+matrix data
+    variables = rowtype_ var01 TO var08.
+
+begin data.
+mean  24.3  5.4  69.7  20.1  13.4  2.7  27.9  3.7
+sd    5.7   1.5  23.5  5.8   2.8   4.5  5.4   1.5
+n     92    92   92    92    92    92   92    92
+corr 1.00
+corr .18  1.00
+corr -.22  -.17  1.00
+corr .36  .31  -.14  1.00
+corr .27  .16  -.12  .22  1.00
+corr .33  .15  -.17  .24  .21  1.00
+corr .50  .29  -.20  .32  .12  .38  1.00
+corr .17  .29  -.05  .20  .27  .20  .04  1.00
+end data.
+@end example
+
+In the above example, the first three rows have ROWTYPE_ values of
+@samp{mean}, @samp{sd}, and @samp{n}.  These indicate that the rows
+contain mean values, standard deviations and counts, respectively.
+All subsequent rows have a ROWTYPE_ of @samp{corr} which indicates
+that the values are correlation coefficients.
+
+Note that in this example, the upper right values of the @samp{corr}
+values are blank, and in each case, the rightmost value is unity.
+This is because, the
+@subcmd{FORMAT} subcommand defaults to @samp{LOWER DIAGONAL},
+which indicates that only the lower triangle is provided in the data.
+The opposite triangle is automatically inferred.  One could instead
+specify the upper triangle as follows:
+
+
+@example
+matrix data
+    variables = rowtype_ var01 TO var08
+    /format = upper nodiagonal.
+
+begin data.
+mean  24.3 5.4  69.7  20.1  13.4  2.7  27.9  3.7
+sd    5.7  1.5  23.5  5.8   2.8   4.5  5.4   1.5
+n     92    92   92    92    92    92   92    92
+corr         .17  .50  -.33  .27  .36  -.22  .18
+corr               .29  .29  -.20  .32  .12  .38
+corr                    .05  .20  -.15  .16  .21
+corr                         .20  .32  -.17  .12
+corr                              .27  .12  -.24
+corr                                  -.20  -.38
+corr                                         .04
+end data.
+@end example
+
+In this example the @samp{NODIAGONAL} keyword is used.  Accordingly
+the diagonal values of the matrix are omitted.  This implies that
+there is one less @samp{corr} line than there are variables.
+If the @samp{FULL} option is passed to the @subcmd{FORMAT} subcommand,
+then all the matrix elements must be provided, including the diagonal
+elements.
+
+In the preceding examples, each matrix row has been specified on a
+single line.  If you pass the keyword @var{FREE} to @subcmd{FORMAT}
+then the data may be data for several matrix rows may be specified on
+the same line, or a single row may be split across lines.
+
+The @subcmd{N} subcommand may be used to specify the number
+of valid cases for each variable.  It should not be used if the
+data contains a record whose ROWTYPE_ column is @samp{N} or @samp{N_VECTOR}.
+It implies a @samp{N} record whose values are all @var{n}.
+That is to say,
+@example
+matrix data
+    variables = rowtype_  var01 TO var04
+    /format = upper nodiagonal
+    /n = 99.
+begin data
+mean 34 35 36 37
+sd   22 11 55 66
+corr 9 8 7
+corr 6 5
+corr 4
+end data.
+@end example
+produces an effect identical to
+@example
+matrix data
+    variables = rowtype_  var01 TO var04
+    /format = upper nodiagonal
+begin data
+n    99 99 99 99
+mean 34 35 36 37
+sd   22 11 55 66
+corr 9 8 7
+corr 6 5
+corr 4
+end data.
+@end example
+
+
+The @subcmd{SPLIT} is used to indicate that variables are to be
+considered as split variables.  For example, the following
+defines two matrices using the variable @samp{S1} to distinguish
+between them.
+
+@example
+matrix data
+    variables = s1 rowtype_  var01 TO var04
+    /split = s1
+    /format = full diagonal.
+
+begin data
+0 mean 34 35 36 37
+0 sd   22 11 55 66
+0 n    99 98 99 92
+0 corr 1 9 8 7
+0 corr 9 1 6 5
+0 corr 8 6 1 4
+0 corr 7 5 4 1
+1 mean 44 45 34 39
+1 sd   23 15 51 46
+1 n    98 34 87 23
+1 corr 1 2 3 4
+1 corr 2 1 5 6
+1 corr 3 5 1 7
+1 corr 4 6 7 1
+end data.
+@end example
+
 @node PRINT
 @section PRINT
 @vindex PRINT
 
 @display
-PRINT 
+PRINT
         [OUTFILE='@var{file_name}']
         [RECORDS=@var{n_lines}]
         [@{NOTABLE,TABLE@}]
@@ -984,10 +1151,11 @@ are specified, @cmd{PRINT} outputs a single blank line.
 
 The @subcmd{OUTFILE} subcommand specifies the file to receive the output.  The
 file may be a file name as a string or a file handle (@pxref{File
-Handles}).  If @subcmd{OUTFILE} is not present then output will be sent to
-@pspp{}'s output listing file.  When @subcmd{OUTFILE} is present, a space is
-inserted at beginning of each output line, even lines that otherwise
-would be blank.
+Handles}).  If @subcmd{OUTFILE} is not present then output is sent to
+@pspp{}'s output listing file.  When @subcmd{OUTFILE} is present, the
+output is written to @var{file_name} in a plain text format, with a
+space inserted at beginning of each output line, even lines that
+otherwise would be blank.
 
 The @subcmd{ENCODING} subcommand may only be used if the
 @subcmd{OUTFILE} subcommand is also used.  It specifies the character
@@ -1003,15 +1171,15 @@ default, suppresses this output table.
 
 Introduce the strings and variables to be printed with a slash
 (@samp{/}).  Optionally, the slash may be followed by a number
-indicating which output line will be specified.  In the absence of this
-line number, the next line number will be specified.  Multiple lines may
+indicating which output line is specified.  In the absence of this
+line number, the next line number is specified.  Multiple lines may
 be specified using multiple slashes with the intended output for a line
 following its respective slash.
 
 Literal strings may be printed.  Specify the string itself.
 Optionally the string may be followed by a column number, specifying
 the column on the line where the string should start.  Otherwise, the
-string will be printed at the current position on the line.
+string is printed at the current position on the line.
 
 Variables to be printed can be specified in the same ways as available
 for @cmd{DATA LIST FIXED} (@pxref{DATA LIST FIXED}).  In addition, a
@@ -1019,10 +1187,10 @@ variable
 list may be followed by an asterisk (@samp{*}), which indicates that the
 variables should be printed in their dictionary print formats, separated
 by spaces.  A variable list followed by a slash or the end of command
-will be interpreted the same way.
+is interpreted in the same way.
 
 If a FORTRAN type specification is used to move backwards on the current
-line, then text is written at that point on the line, the line will be
+line, then text is written at that point on the line, the line is
 truncated to that length, although additional text being added will
 again extend the line to that length.
 
@@ -1031,7 +1199,7 @@ again extend the line to that length.
 @vindex PRINT EJECT
 
 @display
-PRINT EJECT 
+PRINT EJECT
         OUTFILE='@var{file_name}'
         RECORDS=@var{n_lines}
         @{NOTABLE,TABLE@}
@@ -1073,7 +1241,7 @@ PRINT SPACE [OUTFILE='file_name'] [ENCODING='@var{encoding}'] [n_lines].
 
 The @subcmd{OUTFILE} subcommand is optional.  It may be used to direct output to
 a file specified by file name as a string or file handle (@pxref{File
-Handles}).  If OUTFILE is not specified then output will be directed to
+Handles}).  If OUTFILE is not specified then output is directed to
 the listing file.
 
 The @subcmd{ENCODING} subcommand may only be used if @subcmd{OUTFILE}
@@ -1100,7 +1268,7 @@ for further processing.
 The @subcmd{FILE} subcommand, which is optional, is used to specify the file to
 have its line re-read.  The file must be specified as the name of a file
 handle (@pxref{File Handles}).  If FILE is not specified then the last
-file specified on @cmd{DATA LIST} will be assumed (last file specified
+file specified on @cmd{DATA LIST} is assumed (last file specified
 lexically, not in terms of flow-of-control).
 
 By default, the line re-read is re-read in its entirety.  With the
@@ -1203,7 +1371,7 @@ structure (@pxref{LOOP}).  Use @cmd{DATA LIST} before, not after,
 @vindex WRITE
 
 @display
-WRITE 
+WRITE
         OUTFILE='@var{file_name}'
         RECORDS=@var{n_lines}
         @{NOTABLE,TABLE@}
@@ -1216,7 +1384,7 @@ WRITE
         @var{var_list} *
 @end display
 
-@code{WRITE} writes text or binary data to an output file.  
+@code{WRITE} writes text or binary data to an output file.
 
 @xref{PRINT}, for more information on syntax and usage.  @cmd{PRINT}
 and @cmd{WRITE} differ in only a few ways: