X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdata-io.texi;h=9f5b8dfbeec3303c19ffe29b89249d96e8c2a67f;hb=refs%2Fheads%2Fctables7;hp=4862ccc5964e4d2e9783e5900b1f28901b542caf;hpb=e8b26fb0d765310d4c7400c39465008f1bb8601d;p=pspp

diff --git a/doc/data-io.texi b/doc/data-io.texi
index 4862ccc596..9f5b8dfbee 100644
--- a/doc/data-io.texi
+++ b/doc/data-io.texi
@@ -1,3 +1,12 @@
+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
 @c (modify-syntax-entry ?_ "w")
 @c (modify-syntax-entry ?' "'")
 @c (modify-syntax-entry ?@ "'")
@@ -11,7 +20,7 @@
 @cindex cases
 @cindex observations
 
-Data are the focus of the @pspp{} language.  
+Data are the focus of the @pspp{} language.
 Each datum  belongs to a @dfn{case} (also called an @dfn{observation}).
 Each case represents an individual or ``experimental unit''.
 For example, in the results of a survey, the names of the respondents,
@@ -180,7 +189,7 @@ The DATASET DECLARE command creates a new dataset that is initially
 ``empty,'' that is, it has no dictionary or data.  If a dataset with
 the given name already exists, this has no effect.  The new dataset
 can be used with commands that support output to a dataset,
-e.g. AGGREGATE (@pxref{AGGREGATE}).
+@i{e.g.} AGGREGATE (@pxref{AGGREGATE}).
 
 @vindex DATASET CLOSE
 The DATASET CLOSE command deletes a dataset.  If the active dataset is
@@ -277,8 +286,9 @@ external file.  It may be used to specify a file name as a string or a
 file handle (@pxref{File Handles}).  If the @subcmd{FILE} subcommand is not used,
 then input is assumed to be specified within the command file using
 @cmd{BEGIN DATA}@dots{}@cmd{END DATA} (@pxref{BEGIN DATA}).
-The @subcmd{ENCODING} subcommand may only be used if the @subcmd{FILE} subcommand is also used.
-It specifies the character encoding of the file.
+The @subcmd{ENCODING} subcommand may only be used if the @subcmd{FILE}
+subcommand is also used.  It specifies the character encoding of the
+file.  @xref{INSERT}, for information on supported encodings.
 
 The optional @subcmd{RECORDS} subcommand, which takes a single integer as an
 argument, is used to specify the number of lines per record.
@@ -294,7 +304,7 @@ the beginning of an input file.  It can be used to skip over a row
 that contains variable names, for example.
 
 @cmd{DATA LIST} can optionally output a table describing how the data file
-will be read.  The @subcmd{TABLE} subcommand enables this output, and
+is read.  The @subcmd{TABLE} subcommand enables this output, and
 @subcmd{NOTABLE} disables it.  The default is to output the table.
 
 The list of variables to be read from the data list must come last.
@@ -318,7 +328,7 @@ changed; see @ref{SET} for more information.)
 
 In columnar style, to use a variable format other than the default,
 specify the format type in parentheses after the column numbers.  For
-instance, for alphanumeric @samp{A} format, use @samp{(A)}.  
+instance, for alphanumeric @samp{A} format, use @samp{(A)}.
 
 In addition, implied decimal places can be specified in parentheses
 after the column numbers.  As an example, suppose that a data file has a
@@ -374,7 +384,7 @@ FORTRAN and columnar styles may be freely intermixed.  Columnar style
 leaves the active column immediately after the ending column
 specified.  Record motion using @code{NEWREC} in FORTRAN style also
 applies to later FORTRAN and columnar specifiers.
- 
+
 @menu
 * DATA LIST FIXED Examples::    Examples of DATA LIST FIXED.
 @end menu
@@ -474,7 +484,7 @@ DATA LIST FREE
         [(@{TAB,'@var{c}'@}, @dots{})]
         [@{NOTABLE,TABLE@}]
         [FILE='@var{file_name}' [ENCODING='@var{encoding}']]
-        [SKIP=@var{record_cnt}]
+        [SKIP=@var{n_records}]
         /@var{var_spec}@dots{}
 
 where each @var{var_spec} takes one of the forms
@@ -483,7 +493,10 @@ where each @var{var_spec} takes one of the forms
 @end display
 
 In free format, the input data is, by default, structured as a series
-of fields separated by spaces, tabs, commas, or line breaks.  Each
+of fields separated by spaces, tabs, or line breaks.
+If the current @subcmd{DECIMAL} separator is @subcmd{DOT} (@pxref{SET}),
+then commas are also treated as field separators.
+Each
 field's content may be unquoted, or it may be quoted with a pairs of
 apostrophes (@samp{'}) or double quotes (@samp{"}).  Unquoted white
 space separates fields but is not part of any field.  Any mix of
@@ -503,13 +516,14 @@ of quoting is allowed.
 The @subcmd{NOTABLE} and @subcmd{TABLE} subcommands are as in @cmd{DATA LIST FIXED} above.
 @subcmd{NOTABLE} is the default.
 
-The @subcmd{FILE} and @subcmd{SKIP} subcommands are as in @cmd{DATA LIST FIXED} above.
+The @subcmd{FILE}, @subcmd{SKIP}, and @subcmd{ENCODING} subcommands
+are as in @cmd{DATA LIST FIXED} above.
 
 The variables to be parsed are given as a single list of variable names.
 This list must be introduced by a single slash (@samp{/}).  The set of
 variable names may contain format specifications in parentheses
 (@pxref{Input and Output Formats}).  Format specifications apply to all
-variables back to the previous parenthesized format specification.  
+variables back to the previous parenthesized format specification.
 
 In addition, an asterisk may be used to indicate that all variables
 preceding it are to have input/output format @samp{F8.0}.
@@ -525,7 +539,7 @@ on field width apply, but they are honored on output.
 DATA LIST LIST
         [(@{TAB,'@var{c}'@}, @dots{})]
         [@{NOTABLE,TABLE@}]
-        [FILE='@var{file_name'} [ENCODING='@var{encoding}']]
+        [FILE='@var{file_name}' [ENCODING='@var{encoding}']]
         [SKIP=@var{record_count}]
         /@var{var_spec}@dots{}
 
@@ -571,19 +585,23 @@ For text files:
         FILE HANDLE @var{handle_name}
                 /NAME='@var{file_name}
                 [/MODE=CHARACTER]
+                [/ENDS=@{CR,CRLF@}]
                 /TABWIDTH=@var{tab_width}
+                [ENCODING='@var{encoding}']
 
 For binary files in native encoding with fixed-length records:
         FILE HANDLE @var{handle_name}
                 /NAME='@var{file_name}'
                 /MODE=IMAGE
                 [/LRECL=@var{rec_len}]
+                [ENCODING='@var{encoding}']
 
 For binary files in native encoding with variable-length records:
         FILE HANDLE @var{handle_name}
                 /NAME='@var{file_name}'
                 /MODE=BINARY
                 [/LRECL=@var{rec_len}]
+                [ENCODING='@var{encoding}']
 
 For binary files encoded in EBCDIC:
         FILE HANDLE @var{handle_name}
@@ -591,6 +609,7 @@ For binary files encoded in EBCDIC:
                 /MODE=360
                 /RECFORM=@{FIXED,VARIABLE,SPANNED@}
                 [/LRECL=@var{rec_len}]
+                [ENCODING='@var{encoding}']
 @end display
 
 Use @cmd{FILE HANDLE} to associate a file handle name with a file and
@@ -613,9 +632,8 @@ The effect and syntax of @cmd{FILE HANDLE} depends on the selected MODE:
 
 @itemize
 @item
-In CHARACTER mode, the default, the data file is read as a text file,
-according to the local system's conventions, and each text line is
-read as one record.
+In CHARACTER mode, the default, the data file is read as a text file.
+Each text line is read as one record.
 
 In CHARACTER mode only, tabs are expanded to spaces by input programs,
 except by @cmd{DATA LIST FREE} with explicitly specified delimiters.
@@ -623,6 +641,12 @@ Each tab is 4 characters wide by default, but TABWIDTH (a @pspp{}
 extension) may be used to specify an alternate width.  Use a TABWIDTH
 of 0 to suppress tab expansion.
 
+A file written in CHARACTER mode by default uses the line ends of the
+system on which PSPP is running, that is, on Windows, the default is
+CR LF line ends, and on other systems the default is LF only.  Specify
+ENDS as CR or CRLF to override the default.  PSPP reads files using
+either convention on any kind of system, regardless of ENDS.
+
 @item
 In IMAGE mode, the data file is treated as a series of fixed-length
 binary records.  LRECL should be used to specify the record length in
@@ -726,6 +750,14 @@ The @subcmd{NAME} subcommand specifies the name of the file associated with the
 handle.  It is required in all modes but SCRATCH mode, in which its
 use is forbidden.
 
+The ENCODING subcommand specifies the encoding of text in the file.
+For reading text files in CHARACTER mode, all of the forms described
+for ENCODING on the INSERT command are supported (@pxref{INSERT}).
+For reading in other file-based modes, encoding autodetection is not
+supported; if the specified encoding requests autodetection then the
+default encoding is used.  This is also true when a file handle
+is used for writing a file in any mode.
+
 @node INPUT PROGRAM
 @section INPUT PROGRAM
 @vindex INPUT PROGRAM
@@ -775,118 +807,118 @@ so an infinite loop results.  @cmd{END FILE}, when executed,
 stops the flow of input data and passes out of the @cmd{INPUT PROGRAM}
 structure.
 
-All this is very confusing.  A few examples should help to clarify.
+@cmd{INPUT PROGRAM} must contain at least one @cmd{DATA LIST} or
+@cmd{END FILE} command.
+
+@subheading Example 1: Read two files in parallel to the end of the shorter
+
+The following example reads variable X from file @file{a.txt} and
+variable Y from file @file{b.txt}.  If one file is shorter than the
+other then the extra data in the longer file is ignored.
 
-@c If you change this example, change the regression test1 in
-@c tests/command/input-program.sh to match.
 @example
 INPUT PROGRAM.
-        DATA LIST NOTABLE FILE='a.data'/X 1-10.
-        DATA LIST NOTABLE FILE='b.data'/Y 1-10.
+    DATA LIST NOTABLE FILE='a.txt'/X 1-10.
+    DATA LIST NOTABLE FILE='b.txt'/Y 1-10.
 END INPUT PROGRAM.
 LIST.
 @end example
 
-The example above reads variable X from file @file{a.data} and variable
-Y from file @file{b.data}.  If one file is shorter than the other then
-the extra data in the longer file is ignored.
+@subheading Example 2: Read two files in parallel, supplementing the shorter
+
+The following example also reads variable X from @file{a.txt} and
+variable Y from @file{b.txt}.  If one file is shorter than the other
+then it continues reading the longer to its end, setting the other
+variable to system-missing.
 
-@c If you change this example, change the regression test2 in
-@c tests/command/input-program.sh to match.
 @example
 INPUT PROGRAM.
-        NUMERIC #A #B.
-        
-        DO IF NOT #A.
-                DATA LIST NOTABLE END=#A FILE='a.data'/X 1-10.
-        END IF.
-        DO IF NOT #B.
-                DATA LIST NOTABLE END=#B FILE='b.data'/Y 1-10.
-        END IF.
-        DO IF #A AND #B.
-                END FILE.
-        END IF.
-        END CASE.
+    NUMERIC #A #B.
+
+    DO IF NOT #A.
+        DATA LIST NOTABLE END=#A FILE='a.txt'/X 1-10.
+    END IF.
+    DO IF NOT #B.
+        DATA LIST NOTABLE END=#B FILE='b.txt'/Y 1-10.
+    END IF.
+    DO IF #A AND #B.
+        END FILE.
+    END IF.
+    END CASE.
 END INPUT PROGRAM.
 LIST.
 @end example
 
-The above example reads variable X from @file{a.data} and variable Y from
-@file{b.data}.  If one file is shorter than the other then the missing
-field is set to the system-missing value alongside the present value for
-the remaining length of the longer file.
+@subheading Example 3: Concatenate two files (version 1)
+
+The following example reads data from file @file{a.txt}, then from
+@file{b.txt}, and concatenates them into a single active dataset.
 
-@c If you change this example, change the regression test3 in
-@c tests/command/input-program.sh to match.
 @example
 INPUT PROGRAM.
-        NUMERIC #A #B.
-
-        DO IF #A.
-                DATA LIST NOTABLE END=#B FILE='b.data'/X 1-10.
-                DO IF #B.
-                        END FILE.
-                ELSE.
-                        END CASE.
-                END IF.
+    NUMERIC #A #B.
+
+    DO IF #A.
+        DATA LIST NOTABLE END=#B FILE='b.txt'/X 1-10.
+        DO IF #B.
+            END FILE.
         ELSE.
-                DATA LIST NOTABLE END=#A FILE='a.data'/X 1-10.
-                DO IF NOT #A.
-                        END CASE.
-                END IF.
+            END CASE.
         END IF.
+    ELSE.
+        DATA LIST NOTABLE END=#A FILE='a.txt'/X 1-10.
+        DO IF NOT #A.
+            END CASE.
+        END IF.
+    END IF.
 END INPUT PROGRAM.
 LIST.
 @end example
 
-The above example reads data from file @file{a.data}, then from
-@file{b.data}, and concatenates them into a single active dataset.
+@subheading Example 4: Concatenate two files (version 2)
+
+This is another way to do the same thing as Example 3.
 
-@c If you change this example, change the regression test4 in
-@c tests/command/input-program.sh to match.
 @example
 INPUT PROGRAM.
-        NUMERIC #EOF.
-
-        LOOP IF NOT #EOF.
-                DATA LIST NOTABLE END=#EOF FILE='a.data'/X 1-10.
-                DO IF NOT #EOF.
-                        END CASE.
-                END IF.
-        END LOOP.
-
-        COMPUTE #EOF = 0.
-        LOOP IF NOT #EOF.
-                DATA LIST NOTABLE END=#EOF FILE='b.data'/X 1-10.
-                DO IF NOT #EOF.
-                        END CASE.
-                END IF.
-        END LOOP.
+    NUMERIC #EOF.
 
-        END FILE.
+    LOOP IF NOT #EOF.
+        DATA LIST NOTABLE END=#EOF FILE='a.txt'/X 1-10.
+        DO IF NOT #EOF.
+            END CASE.
+        END IF.
+    END LOOP.
+
+    COMPUTE #EOF = 0.
+    LOOP IF NOT #EOF.
+        DATA LIST NOTABLE END=#EOF FILE='b.txt'/X 1-10.
+        DO IF NOT #EOF.
+            END CASE.
+        END IF.
+    END LOOP.
+
+    END FILE.
 END INPUT PROGRAM.
 LIST.
 @end example
 
-The above example does the same thing as the previous example, in a
-different way.
+@subheading Example 5: Generate random variates
+
+The follows example creates a dataset that consists of 50 random
+variates between 0 and 10.
 
-@c If you change this example, make similar changes to the regression
-@c test5 in tests/command/input-program.sh.
 @example
 INPUT PROGRAM.
-        LOOP #I=1 TO 50.
-                COMPUTE X=UNIFORM(10).
-                END CASE.
-        END LOOP.
-        END FILE.
+    LOOP #I=1 TO 50.
+        COMPUTE X=UNIFORM(10).
+        END CASE.
+    END LOOP.
+    END FILE.
 END INPUT PROGRAM.
-LIST/FORMAT=NUMBERED.
+LIST /FORMAT=NUMBERED.
 @end example
 
-The above example causes an active dataset to be created consisting of 50
-random variates between 0 and 10.
-
 @node LIST
 @section LIST
 @vindex LIST
@@ -919,10 +951,6 @@ currently not used.
 Case numbers start from 1.  They are counted after all transformations
 have been considered.
 
-@cmd{LIST} attempts to fit all the values on a single line.  If needed
-to make them fit, variable names are displayed vertically.  If values
-cannot fit on a single line, then a multi-line format will be used.
-
 @cmd{LIST} is a procedure.  It causes the data to be read.
 
 @node NEW FILE
@@ -941,14 +969,15 @@ active dataset.
 @vindex PRINT
 
 @display
-PRINT 
-        OUTFILE='@var{file_name}'
-        RECORDS=@var{n_lines}
-        @{NOTABLE,TABLE@}
+PRINT
+        [OUTFILE='@var{file_name}']
+        [RECORDS=@var{n_lines}]
+        [@{NOTABLE,TABLE@}]
+        [ENCODING='@var{encoding}']
         [/[@var{line_no}] @var{arg}@dots{}]
 
 @var{arg} takes one of the following forms:
-        '@var{string}' [@var{start}-@var{end}]
+        '@var{string}' [@var{start}]
         @var{var_list} @var{start}-@var{end} [@var{type_spec}]
         @var{var_list} (@var{fortran_spec})
         @var{var_list} *
@@ -964,10 +993,16 @@ are specified, @cmd{PRINT} outputs a single blank line.
 
 The @subcmd{OUTFILE} subcommand specifies the file to receive the output.  The
 file may be a file name as a string or a file handle (@pxref{File
-Handles}).  If @subcmd{OUTFILE} is not present then output will be sent to
-@pspp{}'s output listing file.  When @subcmd{OUTFILE} is present, a space is
-inserted at beginning of each output line, even lines that otherwise
-would be blank.
+Handles}).  If @subcmd{OUTFILE} is not present then output is sent to
+@pspp{}'s output listing file.  When @subcmd{OUTFILE} is present, the
+output is written to @var{file_name} in a plain text format, with a
+space inserted at beginning of each output line, even lines that
+otherwise would be blank.
+
+The @subcmd{ENCODING} subcommand may only be used if the
+@subcmd{OUTFILE} subcommand is also used.  It specifies the character
+encoding of the file.  @xref{INSERT}, for information on supported
+encodings.
 
 The @subcmd{RECORDS} subcommand specifies the number of lines to be output.  The
 number of lines may optionally be surrounded by parentheses.
@@ -978,17 +1013,15 @@ default, suppresses this output table.
 
 Introduce the strings and variables to be printed with a slash
 (@samp{/}).  Optionally, the slash may be followed by a number
-indicating which output line will be specified.  In the absence of this
-line number, the next line number will be specified.  Multiple lines may
+indicating which output line is specified.  In the absence of this
+line number, the next line number is specified.  Multiple lines may
 be specified using multiple slashes with the intended output for a line
 following its respective slash.
 
-
-Literal strings may be printed.  Specify the string itself.  Optionally
-the string may be followed by a column number or range of column
-numbers, specifying the location on the line for the string to be
-printed.  Otherwise, the string will be printed at the current position
-on the line.
+Literal strings may be printed.  Specify the string itself.
+Optionally the string may be followed by a column number, specifying
+the column on the line where the string should start.  Otherwise, the
+string is printed at the current position on the line.
 
 Variables to be printed can be specified in the same ways as available
 for @cmd{DATA LIST FIXED} (@pxref{DATA LIST FIXED}).  In addition, a
@@ -996,10 +1029,10 @@ variable
 list may be followed by an asterisk (@samp{*}), which indicates that the
 variables should be printed in their dictionary print formats, separated
 by spaces.  A variable list followed by a slash or the end of command
-will be interpreted the same way.
+is interpreted in the same way.
 
 If a FORTRAN type specification is used to move backwards on the current
-line, then text is written at that point on the line, the line will be
+line, then text is written at that point on the line, the line is
 truncated to that length, although additional text being added will
 again extend the line to that length.
 
@@ -1008,7 +1041,7 @@ again extend the line to that length.
 @vindex PRINT EJECT
 
 @display
-PRINT EJECT 
+PRINT EJECT
         OUTFILE='@var{file_name}'
         RECORDS=@var{n_lines}
         @{NOTABLE,TABLE@}
@@ -1043,16 +1076,20 @@ written with a space inserted in the first column, as with @subcmd{PRINT}.
 @vindex PRINT SPACE
 
 @display
-PRINT SPACE OUTFILE='file_name' n_lines.
+PRINT SPACE [OUTFILE='file_name'] [ENCODING='@var{encoding}'] [n_lines].
 @end display
 
 @cmd{PRINT SPACE} prints one or more blank lines to an output file.
 
 The @subcmd{OUTFILE} subcommand is optional.  It may be used to direct output to
 a file specified by file name as a string or file handle (@pxref{File
-Handles}).  If OUTFILE is not specified then output will be directed to
+Handles}).  If OUTFILE is not specified then output is directed to
 the listing file.
 
+The @subcmd{ENCODING} subcommand may only be used if @subcmd{OUTFILE}
+is also used.  It specifies the character encoding of the file.
+@xref{INSERT}, for information on supported encodings.
+
 n_lines is also optional.  If present, it is an expression
 (@pxref{Expressions}) specifying the number of blank lines to be
 printed.  The expression must evaluate to a nonnegative value.
@@ -1062,7 +1099,7 @@ printed.  The expression must evaluate to a nonnegative value.
 @vindex REREAD
 
 @display
-REREAD FILE=handle COLUMN=column.
+REREAD [FILE=handle] [COLUMN=column] [ENCODING='@var{encoding}'].
 @end display
 
 The @cmd{REREAD} transformation allows the previous input line in a
@@ -1073,7 +1110,7 @@ for further processing.
 The @subcmd{FILE} subcommand, which is optional, is used to specify the file to
 have its line re-read.  The file must be specified as the name of a file
 handle (@pxref{File Handles}).  If FILE is not specified then the last
-file specified on @cmd{DATA LIST} will be assumed (last file specified
+file specified on @cmd{DATA LIST} is assumed (last file specified
 lexically, not in terms of flow-of-control).
 
 By default, the line re-read is re-read in its entirety.  With the
@@ -1082,6 +1119,10 @@ re-reading.  Specify an expression (@pxref{Expressions}) evaluating to
 the first column that should be included in the re-read line.  Columns
 are numbered from 1 at the left margin.
 
+The @subcmd{ENCODING} subcommand may only be used if the @subcmd{FILE}
+subcommand is also used.  It specifies the character encoding of the
+file.   @xref{INSERT}, for information on supported encodings.
+
 Issuing @code{REREAD} multiple times will not back up in the data
 file.  Instead, it will re-read the same line multiple times.
 
@@ -1172,7 +1213,7 @@ structure (@pxref{LOOP}).  Use @cmd{DATA LIST} before, not after,
 @vindex WRITE
 
 @display
-WRITE 
+WRITE
         OUTFILE='@var{file_name}'
         RECORDS=@var{n_lines}
         @{NOTABLE,TABLE@}
@@ -1185,7 +1226,7 @@ WRITE
         @var{var_list} *
 @end display
 
-@code{WRITE} writes text or binary data to an output file.  
+@code{WRITE} writes text or binary data to an output file.
 
 @xref{PRINT}, for more information on syntax and usage.  @cmd{PRINT}
 and @cmd{WRITE} differ in only a few ways: