X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdata-io.texi;h=f7d3e7ff336726413ffd6105a4c8c9426655434c;hb=4daafd79f8b250b1651c1b66a3171a654abf252d;hp=7d11cc6f05ad75dd8236f6694c960f64d3f1ecb3;hpb=1b1837591924226078c96db15888b68beec2ef6d;p=pspp diff --git a/doc/data-io.texi b/doc/data-io.texi index 7d11cc6f05..f7d3e7ff33 100644 --- a/doc/data-io.texi +++ b/doc/data-io.texi @@ -1,3 +1,12 @@ +@c PSPP - a program for statistical analysis. +@c Copyright (C) 2017, 2020 Free Software Foundation, Inc. +@c Permission is granted to copy, distribute and/or modify this document +@c under the terms of the GNU Free Documentation License, Version 1.3 +@c or any later version published by the Free Software Foundation; +@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. +@c A copy of the license is included in the section entitled "GNU +@c Free Documentation License". +@c @c (modify-syntax-entry ?_ "w") @c (modify-syntax-entry ?' "'") @c (modify-syntax-entry ?@ "'") @@ -11,7 +20,7 @@ @cindex cases @cindex observations -Data are the focus of the @pspp{} language. +Data are the focus of the @pspp{} language. Each datum belongs to a @dfn{case} (also called an @dfn{observation}). Each case represents an individual or ``experimental unit''. For example, in the results of a survey, the names of the respondents, @@ -39,6 +48,7 @@ actually be read until a procedure is executed. * INPUT PROGRAM:: Support for complex input programs. * LIST:: List cases in the active dataset. * NEW FILE:: Clear the active dataset. +* MATRIX DATA:: Defining matrix material for procedures. * PRINT:: Display values in print formats. * PRINT EJECT:: Eject the current page then print. * PRINT SPACE:: Print blank lines. @@ -180,7 +190,7 @@ The DATASET DECLARE command creates a new dataset that is initially ``empty,'' that is, it has no dictionary or data. If a dataset with the given name already exists, this has no effect. The new dataset can be used with commands that support output to a dataset, -e.g. AGGREGATE (@pxref{AGGREGATE}). +@i{e.g.} AGGREGATE (@pxref{AGGREGATE}). @vindex DATASET CLOSE The DATASET CLOSE command deletes a dataset. If the active dataset is @@ -277,8 +287,9 @@ external file. It may be used to specify a file name as a string or a file handle (@pxref{File Handles}). If the @subcmd{FILE} subcommand is not used, then input is assumed to be specified within the command file using @cmd{BEGIN DATA}@dots{}@cmd{END DATA} (@pxref{BEGIN DATA}). -The @subcmd{ENCODING} subcommand may only be used if the @subcmd{FILE} subcommand is also used. -It specifies the character encoding of the file. +The @subcmd{ENCODING} subcommand may only be used if the @subcmd{FILE} +subcommand is also used. It specifies the character encoding of the +file. @xref{INSERT}, for information on supported encodings. The optional @subcmd{RECORDS} subcommand, which takes a single integer as an argument, is used to specify the number of lines per record. @@ -294,7 +305,7 @@ the beginning of an input file. It can be used to skip over a row that contains variable names, for example. @cmd{DATA LIST} can optionally output a table describing how the data file -will be read. The @subcmd{TABLE} subcommand enables this output, and +is read. The @subcmd{TABLE} subcommand enables this output, and @subcmd{NOTABLE} disables it. The default is to output the table. The list of variables to be read from the data list must come last. @@ -304,7 +315,7 @@ of variable specifications may be present. Each variable specification consists of a list of variable names followed by a description of their location on the input line. Sets of -variables may be specified using the @code{DATA LIST} TO convention +variables may be specified using the @cmd{DATA LIST} @subcmd{TO} convention (@pxref{Sets of Variables}). There are two ways to specify the location of the variable on the line: columnar style and FORTRAN style. @@ -318,7 +329,7 @@ changed; see @ref{SET} for more information.) In columnar style, to use a variable format other than the default, specify the format type in parentheses after the column numbers. For -instance, for alphanumeric @samp{A} format, use @samp{(A)}. +instance, for alphanumeric @samp{A} format, use @samp{(A)}. In addition, implied decimal places can be specified in parentheses after the column numbers. As an example, suppose that a data file has a @@ -374,7 +385,7 @@ FORTRAN and columnar styles may be freely intermixed. Columnar style leaves the active column immediately after the ending column specified. Record motion using @code{NEWREC} in FORTRAN style also applies to later FORTRAN and columnar specifiers. - + @menu * DATA LIST FIXED Examples:: Examples of DATA LIST FIXED. @end menu @@ -483,7 +494,10 @@ where each @var{var_spec} takes one of the forms @end display In free format, the input data is, by default, structured as a series -of fields separated by spaces, tabs, commas, or line breaks. Each +of fields separated by spaces, tabs, or line breaks. +If the current @subcmd{DECIMAL} separator is @subcmd{DOT} (@pxref{SET}), +then commas are also treated as field separators. +Each field's content may be unquoted, or it may be quoted with a pairs of apostrophes (@samp{'}) or double quotes (@samp{"}). Unquoted white space separates fields but is not part of any field. Any mix of @@ -503,13 +517,14 @@ of quoting is allowed. The @subcmd{NOTABLE} and @subcmd{TABLE} subcommands are as in @cmd{DATA LIST FIXED} above. @subcmd{NOTABLE} is the default. -The @subcmd{FILE} and @subcmd{SKIP} subcommands are as in @cmd{DATA LIST FIXED} above. +The @subcmd{FILE}, @subcmd{SKIP}, and @subcmd{ENCODING} subcommands +are as in @cmd{DATA LIST FIXED} above. The variables to be parsed are given as a single list of variable names. This list must be introduced by a single slash (@samp{/}). The set of variable names may contain format specifications in parentheses (@pxref{Input and Output Formats}). Format specifications apply to all -variables back to the previous parenthesized format specification. +variables back to the previous parenthesized format specification. In addition, an asterisk may be used to indicate that all variables preceding it are to have input/output format @samp{F8.0}. @@ -525,7 +540,7 @@ on field width apply, but they are honored on output. DATA LIST LIST [(@{TAB,'@var{c}'@}, @dots{})] [@{NOTABLE,TABLE@}] - [FILE='@var{file_name'} [ENCODING='@var{encoding}']] + [FILE='@var{file_name}' [ENCODING='@var{encoding}']] [SKIP=@var{record_count}] /@var{var_spec}@dots{} @@ -571,19 +586,23 @@ For text files: FILE HANDLE @var{handle_name} /NAME='@var{file_name} [/MODE=CHARACTER] + [/ENDS=@{CR,CRLF@}] /TABWIDTH=@var{tab_width} + [ENCODING='@var{encoding}'] For binary files in native encoding with fixed-length records: FILE HANDLE @var{handle_name} /NAME='@var{file_name}' /MODE=IMAGE [/LRECL=@var{rec_len}] + [ENCODING='@var{encoding}'] For binary files in native encoding with variable-length records: FILE HANDLE @var{handle_name} /NAME='@var{file_name}' /MODE=BINARY [/LRECL=@var{rec_len}] + [ENCODING='@var{encoding}'] For binary files encoded in EBCDIC: FILE HANDLE @var{handle_name} @@ -591,6 +610,7 @@ For binary files encoded in EBCDIC: /MODE=360 /RECFORM=@{FIXED,VARIABLE,SPANNED@} [/LRECL=@var{rec_len}] + [ENCODING='@var{encoding}'] @end display Use @cmd{FILE HANDLE} to associate a file handle name with a file and @@ -613,9 +633,8 @@ The effect and syntax of @cmd{FILE HANDLE} depends on the selected MODE: @itemize @item -In CHARACTER mode, the default, the data file is read as a text file, -according to the local system's conventions, and each text line is -read as one record. +In CHARACTER mode, the default, the data file is read as a text file. +Each text line is read as one record. In CHARACTER mode only, tabs are expanded to spaces by input programs, except by @cmd{DATA LIST FREE} with explicitly specified delimiters. @@ -623,6 +642,12 @@ Each tab is 4 characters wide by default, but TABWIDTH (a @pspp{} extension) may be used to specify an alternate width. Use a TABWIDTH of 0 to suppress tab expansion. +A file written in CHARACTER mode by default uses the line ends of the +system on which PSPP is running, that is, on Windows, the default is +CR LF line ends, and on other systems the default is LF only. Specify +ENDS as CR or CRLF to override the default. PSPP reads files using +either convention on any kind of system, regardless of ENDS. + @item In IMAGE mode, the data file is treated as a series of fixed-length binary records. LRECL should be used to specify the record length in @@ -726,6 +751,14 @@ The @subcmd{NAME} subcommand specifies the name of the file associated with the handle. It is required in all modes but SCRATCH mode, in which its use is forbidden. +The ENCODING subcommand specifies the encoding of text in the file. +For reading text files in CHARACTER mode, all of the forms described +for ENCODING on the INSERT command are supported (@pxref{INSERT}). +For reading in other file-based modes, encoding autodetection is not +supported; if the specified encoding requests autodetection then the +default encoding is used. This is also true when a file handle +is used for writing a file in any mode. + @node INPUT PROGRAM @section INPUT PROGRAM @vindex INPUT PROGRAM @@ -775,6 +808,9 @@ so an infinite loop results. @cmd{END FILE}, when executed, stops the flow of input data and passes out of the @cmd{INPUT PROGRAM} structure. +@cmd{INPUT PROGRAM} must contain at least one @cmd{DATA LIST} or +@cmd{END FILE} command. + All this is very confusing. A few examples should help to clarify. @c If you change this example, change the regression test1 in @@ -796,7 +832,7 @@ the extra data in the longer file is ignored. @example INPUT PROGRAM. NUMERIC #A #B. - + DO IF NOT #A. DATA LIST NOTABLE END=#A FILE='a.data'/X 1-10. END IF. @@ -906,23 +942,19 @@ printed. Keyword VARIABLES is optional. If @subcmd{VARIABLES} subcommand is no specified then all variables in the active dataset are printed. The @subcmd{CASES} subcommand can be used to specify a subset of cases to be -printed. Specify FROM and the case number of the first case to print, -TO and the case number of the last case to print, and BY and the number +printed. Specify @subcmd{FROM} and the case number of the first case to print, +@subcmd{TO} and the case number of the last case to print, and @subcmd{BY} and the number of cases to advance between printing cases, or any subset of those -settings. If CASES is not specified then all cases are printed. +settings. If @subcmd{CASES} is not specified then all cases are printed. -The @subcmd{FORMAT} subcommand can be used to change the output format. NUMBERED -will print case numbers along with each case; UNNUMBERED, the default, -causes the case numbers to be omitted. The WRAP and SINGLE settings are +The @subcmd{FORMAT} subcommand can be used to change the output format. @subcmd{NUMBERED} +will print case numbers along with each case; @subcmd{UNNUMBERED}, the default, +causes the case numbers to be omitted. The @subcmd{WRAP} and @subcmd{SINGLE} settings are currently not used. Case numbers start from 1. They are counted after all transformations have been considered. -@cmd{LIST} attempts to fit all the values on a single line. If needed -to make them fit, variable names are displayed vertically. If values -cannot fit on a single line, then a multi-line format will be used. - @cmd{LIST} is a procedure. It causes the data to be read. @node NEW FILE @@ -936,19 +968,174 @@ NEW FILE. @cmd{NEW FILE} command clears the dictionary and data from the current active dataset. +@node MATRIX DATA +@section MATRIX DATA +@vindex MATRIX DATA + +@display +MATRIX DATA + VARIABLES = @var{columns} + [FILE='@var{file_name}'| INLINE @} + [/FORMAT= [@{LIST | FREE@}] + [@{UPPER | LOWER | FULL@}] + [@{DIAGONAL | NODIAGONAL@}]] + [/N= @var{n}] + [/SPLIT= @var{split_variables}]. +@end display + +The @cmd{MATRIX DATA} command is used to input data in the form of matrices +which can subsequently be used by other commands. If the +@subcmd{FILE} is omitted or takes the value @samp{INLINE} then the command +should immediately followed by @cmd{BEGIN DATA} (@pxref{BEGIN DATA}). + +There is one mandatory subcommand, @i{viz:} @subcmd{VARIABLES}, which defines +the @var{columns} of the matrix. +Normally, the @var{columns} should include an item called @samp{ROWTYPE_}. +The @samp{ROWTYPE_} column is used to specify the purpose of a row in the +matrix. + +@example +matrix data + variables = rowtype_ var01 TO var08. + +begin data. +mean 24.3 5.4 69.7 20.1 13.4 2.7 27.9 3.7 +sd 5.7 1.5 23.5 5.8 2.8 4.5 5.4 1.5 +n 92 92 92 92 92 92 92 92 +corr 1.00 +corr .18 1.00 +corr -.22 -.17 1.00 +corr .36 .31 -.14 1.00 +corr .27 .16 -.12 .22 1.00 +corr .33 .15 -.17 .24 .21 1.00 +corr .50 .29 -.20 .32 .12 .38 1.00 +corr .17 .29 -.05 .20 .27 .20 .04 1.00 +end data. +@end example + +In the above example, the first three rows have ROWTYPE_ values of +@samp{mean}, @samp{sd}, and @samp{n}. These indicate that the rows +contain mean values, standard deviations and counts, respectively. +All subsequent rows have a ROWTYPE_ of @samp{corr} which indicates +that the values are correlation coefficients. + +Note that in this example, the upper right values of the @samp{corr} +values are blank, and in each case, the rightmost value is unity. +This is because, the +@subcmd{FORMAT} subcommand defaults to @samp{LOWER DIAGONAL}, +which indicates that only the lower triangle is provided in the data. +The opposite triangle is automatically inferred. One could instead +specify the upper triangle as follows: + + +@example +matrix data + variables = rowtype_ var01 TO var08 + /format = upper nodiagonal. + +begin data. +mean 24.3 5.4 69.7 20.1 13.4 2.7 27.9 3.7 +sd 5.7 1.5 23.5 5.8 2.8 4.5 5.4 1.5 +n 92 92 92 92 92 92 92 92 +corr .17 .50 -.33 .27 .36 -.22 .18 +corr .29 .29 -.20 .32 .12 .38 +corr .05 .20 -.15 .16 .21 +corr .20 .32 -.17 .12 +corr .27 .12 -.24 +corr -.20 -.38 +corr .04 +end data. +@end example + +In this example the @samp{NODIAGONAL} keyword is used. Accordingly +the diagonal values of the matrix are omitted. This implies that +there is one less @samp{corr} line than there are variables. +If the @samp{FULL} option is passed to the @subcmd{FORMAT} subcommand, +then all the matrix elements must be provided, including the diagonal +elements. + +In the preceding examples, each matrix row has been specified on a +single line. If you pass the keyword @var{FREE} to @subcmd{FORMAT} +then the data may be data for several matrix rows may be specified on +the same line, or a single row may be split across lines. + +The @subcmd{N} subcommand may be used to specify the number +of valid cases for each variable. It should not be used if the +data contains a record whose ROWTYPE_ column is @samp{N} or @samp{N_VECTOR}. +It implies a @samp{N} record whose values are all @var{n}. +That is to say, +@example +matrix data + variables = rowtype_ var01 TO var04 + /format = upper nodiagonal + /n = 99. +begin data +mean 34 35 36 37 +sd 22 11 55 66 +corr 9 8 7 +corr 6 5 +corr 4 +end data. +@end example +produces an effect identical to +@example +matrix data + variables = rowtype_ var01 TO var04 + /format = upper nodiagonal +begin data +n 99 99 99 99 +mean 34 35 36 37 +sd 22 11 55 66 +corr 9 8 7 +corr 6 5 +corr 4 +end data. +@end example + + +The @subcmd{SPLIT} is used to indicate that variables are to be +considered as split variables. For example, the following +defines two matrices using the variable @samp{S1} to distinguish +between them. + +@example +matrix data + variables = s1 rowtype_ var01 TO var04 + /split = s1 + /format = full diagonal. + +begin data +0 mean 34 35 36 37 +0 sd 22 11 55 66 +0 n 99 98 99 92 +0 corr 1 9 8 7 +0 corr 9 1 6 5 +0 corr 8 6 1 4 +0 corr 7 5 4 1 +1 mean 44 45 34 39 +1 sd 23 15 51 46 +1 n 98 34 87 23 +1 corr 1 2 3 4 +1 corr 2 1 5 6 +1 corr 3 5 1 7 +1 corr 4 6 7 1 +end data. +@end example + @node PRINT @section PRINT @vindex PRINT @display -PRINT - OUTFILE='@var{file_name}' - RECORDS=@var{n_lines} - @{NOTABLE,TABLE@} +PRINT + [OUTFILE='@var{file_name}'] + [RECORDS=@var{n_lines}] + [@{NOTABLE,TABLE@}] + [ENCODING='@var{encoding}'] [/[@var{line_no}] @var{arg}@dots{}] @var{arg} takes one of the following forms: - '@var{string}' [@var{start}-@var{end}] + '@var{string}' [@var{start}] @var{var_list} @var{start}-@var{end} [@var{type_spec}] @var{var_list} (@var{fortran_spec}) @var{var_list} * @@ -964,10 +1151,16 @@ are specified, @cmd{PRINT} outputs a single blank line. The @subcmd{OUTFILE} subcommand specifies the file to receive the output. The file may be a file name as a string or a file handle (@pxref{File -Handles}). If @subcmd{OUTFILE} is not present then output will be sent to -@pspp{}'s output listing file. When @subcmd{OUTFILE} is present, a space is -inserted at beginning of each output line, even lines that otherwise -would be blank. +Handles}). If @subcmd{OUTFILE} is not present then output is sent to +@pspp{}'s output listing file. When @subcmd{OUTFILE} is present, the +output is written to @var{file_name} in a plain text format, with a +space inserted at beginning of each output line, even lines that +otherwise would be blank. + +The @subcmd{ENCODING} subcommand may only be used if the +@subcmd{OUTFILE} subcommand is also used. It specifies the character +encoding of the file. @xref{INSERT}, for information on supported +encodings. The @subcmd{RECORDS} subcommand specifies the number of lines to be output. The number of lines may optionally be surrounded by parentheses. @@ -978,17 +1171,15 @@ default, suppresses this output table. Introduce the strings and variables to be printed with a slash (@samp{/}). Optionally, the slash may be followed by a number -indicating which output line will be specified. In the absence of this -line number, the next line number will be specified. Multiple lines may +indicating which output line is specified. In the absence of this +line number, the next line number is specified. Multiple lines may be specified using multiple slashes with the intended output for a line following its respective slash. - -Literal strings may be printed. Specify the string itself. Optionally -the string may be followed by a column number or range of column -numbers, specifying the location on the line for the string to be -printed. Otherwise, the string will be printed at the current position -on the line. +Literal strings may be printed. Specify the string itself. +Optionally the string may be followed by a column number, specifying +the column on the line where the string should start. Otherwise, the +string is printed at the current position on the line. Variables to be printed can be specified in the same ways as available for @cmd{DATA LIST FIXED} (@pxref{DATA LIST FIXED}). In addition, a @@ -996,10 +1187,10 @@ variable list may be followed by an asterisk (@samp{*}), which indicates that the variables should be printed in their dictionary print formats, separated by spaces. A variable list followed by a slash or the end of command -will be interpreted the same way. +is interpreted in the same way. If a FORTRAN type specification is used to move backwards on the current -line, then text is written at that point on the line, the line will be +line, then text is written at that point on the line, the line is truncated to that length, although additional text being added will again extend the line to that length. @@ -1008,7 +1199,7 @@ again extend the line to that length. @vindex PRINT EJECT @display -PRINT EJECT +PRINT EJECT OUTFILE='@var{file_name}' RECORDS=@var{n_lines} @{NOTABLE,TABLE@} @@ -1034,7 +1225,7 @@ With @subcmd{OUTFILE}, @cmd{PRINT EJECT} writes its output to the specified file The first line of output is written with @samp{1} inserted in the first column. Commonly, this is the only line of output. If additional lines of output are specified, these additional lines are -written with a space inserted in the first column, as with PRINT. +written with a space inserted in the first column, as with @subcmd{PRINT}. @xref{PRINT}, for more information on syntax and usage. @@ -1043,16 +1234,20 @@ written with a space inserted in the first column, as with PRINT. @vindex PRINT SPACE @display -PRINT SPACE OUTFILE='file_name' n_lines. +PRINT SPACE [OUTFILE='file_name'] [ENCODING='@var{encoding}'] [n_lines]. @end display @cmd{PRINT SPACE} prints one or more blank lines to an output file. The @subcmd{OUTFILE} subcommand is optional. It may be used to direct output to a file specified by file name as a string or file handle (@pxref{File -Handles}). If OUTFILE is not specified then output will be directed to +Handles}). If OUTFILE is not specified then output is directed to the listing file. +The @subcmd{ENCODING} subcommand may only be used if @subcmd{OUTFILE} +is also used. It specifies the character encoding of the file. +@xref{INSERT}, for information on supported encodings. + n_lines is also optional. If present, it is an expression (@pxref{Expressions}) specifying the number of blank lines to be printed. The expression must evaluate to a nonnegative value. @@ -1062,7 +1257,7 @@ printed. The expression must evaluate to a nonnegative value. @vindex REREAD @display -REREAD FILE=handle COLUMN=column. +REREAD [FILE=handle] [COLUMN=column] [ENCODING='@var{encoding}']. @end display The @cmd{REREAD} transformation allows the previous input line in a @@ -1073,7 +1268,7 @@ for further processing. The @subcmd{FILE} subcommand, which is optional, is used to specify the file to have its line re-read. The file must be specified as the name of a file handle (@pxref{File Handles}). If FILE is not specified then the last -file specified on @cmd{DATA LIST} will be assumed (last file specified +file specified on @cmd{DATA LIST} is assumed (last file specified lexically, not in terms of flow-of-control). By default, the line re-read is re-read in its entirety. With the @@ -1082,6 +1277,10 @@ re-reading. Specify an expression (@pxref{Expressions}) evaluating to the first column that should be included in the re-read line. Columns are numbered from 1 at the left margin. +The @subcmd{ENCODING} subcommand may only be used if the @subcmd{FILE} +subcommand is also used. It specifies the character encoding of the +file. @xref{INSERT}, for information on supported encodings. + Issuing @code{REREAD} multiple times will not back up in the data file. Instead, it will re-read the same line multiple times. @@ -1172,7 +1371,7 @@ structure (@pxref{LOOP}). Use @cmd{DATA LIST} before, not after, @vindex WRITE @display -WRITE +WRITE OUTFILE='@var{file_name}' RECORDS=@var{n_lines} @{NOTABLE,TABLE@} @@ -1185,7 +1384,7 @@ WRITE @var{var_list} * @end display -@code{WRITE} writes text or binary data to an output file. +@code{WRITE} writes text or binary data to an output file. @xref{PRINT}, for more information on syntax and usage. @cmd{PRINT} and @cmd{WRITE} differ in only a few ways: