Convert to utf8 in data_out function.

[pspp-builds.git] / doc / data-io.texi
diff --git a/doc/data-io.texi b/doc/data-io.texi

index b5e6ca22450c36308fc42377f8c41284f18bf37f..1bad334e95b60a1dc0c7872ada8f2bff4b28b5a8 100644 (file)
--- a/doc/data-io.texi
+++ b/doc/data-io.texi
@@ -1,4 +1,4 @@
-@node Data Input and Output, System and Portable Files, Expressions, Top
+@node Data Input and Output
  @chapter Data Input and Output
  @cindex input
  @cindex output
@@ -14,6 +14,8 @@ their sex, age, etc.@: and their responses are all data and the data
  pertaining to single respondent is a case.
  This chapter examines
  the PSPP commands for defining variables and reading and writing data.
+There are alternative commands to  read data from predefined sources
+such as system files or databases (@xref{GET, GET DATA}.)
  
  @quotation Note
  These commands tell PSPP how to read data, but the data will not
@@ -22,15 +24,14 @@ actually be read until a procedure is executed.
  
  @menu
  * BEGIN DATA::                  Embed data within a syntax file.
-* CLEAR TRANSFORMATIONS::       Clear pending transformations.
  * CLOSE FILE HANDLE::           Close a file handle.
+* DATAFILE ATTRIBUTE::          Set custom attributes on data files.
  * DATA LIST::                   Fundamental data reading command.
  * END CASE::                    Output the current case.
  * END FILE::                    Terminate the current input program.
  * FILE HANDLE::                 Support for special file formats.
  * INPUT PROGRAM::               Support for complex input programs.
  * LIST::                        List cases in the active file.
-* MATRIX DATA::                 Read matrices in text format.
  * NEW FILE::                    Clear the active file and dictionary.
  * PRINT::                       Display values in print formats.
  * PRINT EJECT::                 Eject the current page then print.
@@ -65,17 +66,6 @@ white space and exactly one space between the words @code{END} and
  END DATA.
  @end example
  
-@node CLEAR TRANSFORMATIONS
-@section CLEAR TRANSFORMATIONS
-@vindex CLEAR TRANSFORMATIONS
-
-@display
-CLEAR TRANSFORMATIONS.
-@end display
-
-@cmd{CLEAR TRANSFORMATIONS} clears out all pending
-transformations.  It does not cancel the current input program.
-
  @node CLOSE FILE HANDLE
  @section CLOSE FILE HANDLE
  
@@ -100,6 +90,52 @@ DATA} and @cmd{END DATA}, cannot be closed.  Attempts to close it with
  
  @cmd{CLOSE FILE HANDLE} is a PSPP extension.
  
+@node DATAFILE ATTRIBUTE
+@section DATAFILE ATTRIBUTE
+@vindex DATAFILE ATTRIBUTE
+
+@display
+DATAFILE ATTRIBUTE
+         ATTRIBUTE=name('value') [name('value')]@dots{}
+         ATTRIBUTE=name@b{[}index@b{]}('value') [name@b{[}index@b{]}('value')]@dots{}
+         DELETE=name [name]@dots{}
+         DELETE=name@b{[}index@b{]} [name@b{[}index@b{]}]@dots{}
+@end display
+
+@cmd{DATAFILE ATTRIBUTE} adds, modifies, or removes user-defined
+attributes associated with the active file.  Custom data file
+attributes are not interpreted by PSPP, but they are saved as part of
+system files and may be used by other software that reads them.
+
+Use the ATTRIBUTE subcommand to add or modify a custom data file
+attribute.  Specify the name of the attribute as an identifier
+(@pxref{Tokens}), followed by the desired value, in parentheses, as a
+quoted string.  Attribute names that begin with @code{$} are reserved
+for PSPP's internal use, and attribute names that begin with @code{@@}
+or @code{$@@} are not displayed by most PSPP commands that display
+other attributes.  Other attribute names are not treated specially.
+
+Attributes may also be organized into arrays.  To assign to an array
+element, add an integer array index enclosed in square brackets
+(@code{[} and @code{]}) between the attribute name and value.  Array
+indexes start at 1, not 0.  An attribute array that has a single
+element (number 1) is not distinguished from a non-array attribute.
+
+Use the DELETE subcommand to delete an attribute.  Specify an
+attribute name by itself to delete an entire attribute, including all
+array elements for attribute arrays.  Specify an attribute name
+followed by an array index in square brackets to delete a single
+element of an attribute array.  In the latter case, all the array
+elements numbered higher than the deleted element are shifted down,
+filling the vacated position.
+
+To associate custom attributes with particular variables, instead of
+with the entire active file, use @cmd{VARIABLE ATTRIBUTE} (@pxref{VARIABLE ATTRIBUTE}) instead.
+
+@cmd{DATAFILE ATTRIBUTE} takes effect immediately.  It is not affected
+by conditional and looping structures such as @cmd{DO IF} or
+@cmd{LOOP}.
+
  @node DATA LIST
  @section DATA LIST
  @vindex DATA LIST
@@ -121,6 +157,10 @@ free format.
  
  Each form of @cmd{DATA LIST} is described in detail below.
  
+@xref{GET DATA}, for a command that offers a few enhancements over
+DATA LIST and that may be substituted for DATA LIST in many
+situations.
+
  @menu
  * DATA LIST FIXED::             Fixed columnar locations for data.
  * DATA LIST FREE::              Any spacing you like.
@@ -138,7 +178,7 @@ Each form of @cmd{DATA LIST} is described in detail below.
  @display
  DATA LIST [FIXED]
          @{TABLE,NOTABLE@}
-        [FILE='file-name']
+        [FILE='file-name' [ENCODING='encoding']]
          [RECORDS=record_count]
          [END=end_var]
          [SKIP=record_count]
@@ -158,6 +198,8 @@ external file.  It may be used to specify a file name as a string or a
  file handle (@pxref{File Handles}).  If the FILE subcommand is not used,
  then input is assumed to be specified within the command file using
  @cmd{BEGIN DATA}@dots{}@cmd{END DATA} (@pxref{BEGIN DATA}).
+The ENCODING subcommand may only be used if the FILE subcommand is also used.
+It specifies the character encoding of the file.
  
  The optional RECORDS subcommand, which takes a single integer as an
  argument, is used to specify the number of lines per record.  If RECORDS
@@ -276,7 +318,7 @@ Defines the following variables:
  
  @itemize @bullet
  @item
-@code{NAME}, a 10-character-wide long string variable, in columns 1
+@code{NAME}, a 10-character-wide string variable, in columns 1
  through 10.
  
  @item
@@ -317,15 +359,15 @@ Defines the following variables:
  @code{ID}, a numeric variable, in columns 1-5 of the first record.
  
  @item
-@code{NAME}, a 30-character long string variable, in columns 7-36 of the
+@code{NAME}, a 30-character string variable, in columns 7-36 of the
  first record.
  
  @item
-@code{SURNAME}, a 30-character long string variable, in columns 38-67 of
+@code{SURNAME}, a 30-character string variable, in columns 38-67 of
  the first record.
  
  @item
-@code{MINITIAL}, a 1-character short string variable, in column 69 of
+@code{MINITIAL}, a 1-character string variable, in column 69 of
  the first record.
  
  @item
@@ -351,8 +393,7 @@ This example shows keywords abbreviated to their first 3 letters.
  DATA LIST FREE
          [(@{TAB,'c'@}, @dots{})]
          [@{NOTABLE,TABLE@}]
-        [FILE='file-name']
-        [END=end_var]
+        [FILE='file-name' [ENCODING='encoding']]
          [SKIP=record_cnt]
          /var_spec@dots{}
  
@@ -382,7 +423,7 @@ of quoting is allowed.
  The NOTABLE and TABLE subcommands are as in @cmd{DATA LIST FIXED} above.
  NOTABLE is the default.
  
-The FILE, END, and SKIP subcommands are as in @cmd{DATA LIST FIXED} above.
+The FILE and SKIP subcommands are as in @cmd{DATA LIST FIXED} above.
  
  The variables to be parsed are given as a single list of variable names.
  This list must be introduced by a single slash (@samp{/}).  The set of
@@ -404,8 +445,7 @@ on field width apply, but they are honored on output.
  DATA LIST LIST
          [(@{TAB,'c'@}, @dots{})]
          [@{NOTABLE,TABLE@}]
-        [FILE='file-name']
-        [END=end_var]
+        [FILE='file-name' [ENCODING='encoding']]
          [SKIP=record_count]
          /var_spec@dots{}
  
@@ -453,12 +493,25 @@ For text files:
                  [/MODE=CHARACTER]
                  /TABWIDTH=tab_width
  
-For binary files with fixed-length records:
+For binary files in native encoding with fixed-length records:
          FILE HANDLE handle_name
                  /NAME='file-name'
                  /MODE=IMAGE
                  [/LRECL=rec_len]
  
+For binary files in native encoding with variable-length records:
+        FILE HANDLE handle_name
+                /NAME='file-name'
+                /MODE=BINARY
+                [/LRECL=rec_len]
+
+For binary files encoded in EBCDIC:
+        FILE HANDLE handle_name
+                /NAME='file-name'
+                /MODE=360
+                /RECFORM=@{FIXED,VARIABLE,SPANNED@}
+                [/LRECL=rec_len]
+
  To explicitly declare a scratch handle:
          FILE HANDLE handle_name
                  /MODE=SCRATCH
@@ -480,28 +533,129 @@ file handle name must not already have been used in a previous
  invocation of @cmd{FILE HANDLE}, unless it has been closed by an
  intervening command (@pxref{CLOSE FILE HANDLE}).
  
-MODE specifies a file mode.  In CHARACTER mode, the default, the data
-file is read as a text file, according to the local system's 
-conventions, and each text line is read as one record.
-In CHARACTER mode, most input programs will expand tabs to spaces
-(@cmd{DATA LIST FREE} with explicitly specified delimiters is an
-exception).  By default, each tab is 4 characters wide, but an
-alternate width may be specified on TABWIDTH.  A tab width of 0
-suppresses tab expansion entirely.
-
-In IMAGE mode, the data file is opened in ANSI C binary mode.  Record
-length is fixed, with output data truncated or padded with spaces to
-the record length.  LRECL specifies the record length in bytes, with a
-default of 1024.  Tab characters are never expanded to spaces in
-binary mode.  Records
+The effect and syntax of FILE HANDLE depends on the selected MODE:
  
-The NAME subcommand specifies the name of the file associated with the
-handle.  It is required in CHARACTER and IMAGE modes.
+@itemize
+@item
+In CHARACTER mode, the default, the data file is read as a text file,
+according to the local system's conventions, and each text line is
+read as one record.
+
+In CHARACTER mode only, tabs are expanded to spaces by input programs,
+except by @cmd{DATA LIST FREE} with explicitly specified delimiters.
+Each tab is 4 characters wide by default, but TABWIDTH (a PSPP
+extension) may be used to specify an alternate width.  Use a TABWIDTH
+of 0 to suppress tab expansion.
+
+@item
+In IMAGE mode, the data file is treated as a series of fixed-length
+binary records.  LRECL should be used to specify the record length in
+bytes, with a default of 1024.  On input, it is an error if an IMAGE
+file's length is not a integer multiple of the record length.  On
+output, each record is padded with spaces or truncated, if necessary,
+to make it exactly the correct length.
  
-The SCRATCH mode designates the file handle as a scratch file handle.
+@item
+In BINARY mode, the data file is treated as a series of
+variable-length binary records.  LRECL may be specified, but its value
+is ignored.  The data for each record is both preceded and followed by
+a 32-bit signed integer in little-endian byte order that specifies the
+length of the record.  (This redundancy permits records in these
+files to be efficiently read in reverse order, although PSPP always
+reads them in forward order.)  The length does not include either
+integer.
+
+@item
+Mode 360 reads and writes files in formats first used for tapes in the
+1960s on IBM mainframe operating systems and still supported today by
+the modern successors of those operating systems.  For more
+information, see @cite{OS/400 Tape and Diskette Device Programming},
+available on IBM's website.
+
+Alphanumeric data in mode 360 files are encoded in EBCDIC.  PSPP
+translates EBCDIC to or from the host's native format as necessary on
+input or output, using an ASCII/EBCDIC translation that is one-to-one,
+so that a ``round trip'' from ASCII to EBCDIC back to ASCII, or vice
+versa, always yields exactly the original data.
+
+The RECFORM subcommand is required in mode 360.  The precise file
+format depends on its setting:
+
+@table @asis
+@item F
+@itemx FIXED
+This record format is equivalent to IMAGE mode, except for EBCDIC
+translation.
+
+IBM documentation calls this @code{*F} (fixed-length, deblocked)
+format.
+
+@item V
+@itemx VARIABLE
+The file comprises a sequence of zero or more variable-length blocks.
+Each block begins with a 4-byte @dfn{block descriptor word} (BDW).
+The first two bytes of the BDW are an unsigned integer in big-endian
+byte order that specifies the length of the block, including the BDW
+itself.  The other two bytes of the BDW are ignored on input and
+written as zeros on output.
+
+Following the BDW, the remainder of each block is a sequence of one or
+more variable-length records, each of which in turn begins with a
+4-byte @dfn{record descriptor word} (RDW) that has the same format as
+the BDW.  Following the RDW, the remainder of each record is the
+record data.
+
+The maximum length of a record in VARIABLE mode is 65,527 bytes:
+65,535 bytes (the maximum value of a 16-bit unsigned integer), minus 4
+bytes for the BDW, minus 4 bytes for the RDW.
+
+In mode VARIABLE, LRECL specifies a maximum, not a fixed, record
+length, in bytes.  The default is 8,192.
+
+IBM documentation calls this @code{*VB} (variable-length, blocked,
+unspanned) format.
+
+@item VS
+@itemx SPANNED
+The file format is like that of VARIABLE mode, except that logical
+records may be split among multiple physical records (called
+@dfn{segments}) or blocks.  In SPANNED mode, the third byte of each
+RDW is called the segment control character (SCC).  Odd SCC values
+cause the segment to be appended to a record buffer maintained in
+memory; even values also append the segment and then flush its
+contents to the input procedure.  Canonically, SCC value 0 designates
+a record not spanned among multiple segments, and values 1 through 3
+designate the first segment, the last segment, or an intermediate
+segment, respectively, within a multi-segment record.  The record
+buffer is also flushed at end of file regardless of the final record's
+SCC.
+
+The maximum length of a logical record in VARIABLE mode is limited
+only by memory available to PSPP.  Segments are limited to 65,527
+bytes, as in VARIABLE mode.
+
+This format is similar to what IBM documentation call @code{*VS}
+(variable-length, deblocked, spanned) format.
+@end table
+
+In mode 360, fields of type A that extend beyond the end of a record
+read from disk are padded with spaces in the host's native character
+set, which are then translated from EBCDIC to the native character
+set.  Thus, when the host's native character set is based on ASCII,
+these fields are effectively padded with character @code{X'80'}.  This
+wart is implemented for compatibility.
+
+@item
+SCRATCH mode is a PSPP extension that designates the file handle as a
+scratch file handle.
  Its use is usually unnecessary because file handle names that begin with
  @samp{#} are assumed to refer to scratch files.  @pxref{File Handles},
  for more information.
+@end itemize
+
+The NAME subcommand specifies the name of the file associated with the
+handle.  It is required in all modes but SCRATCH mode, in which its
+use is forbidden.
  
  @node INPUT PROGRAM
  @section INPUT PROGRAM
@@ -554,6 +708,8 @@ structure.
  
  All this is very confusing.  A few examples should help to clarify.
  
+@c If you change this example, change the regression test1 in
+@c tests/command/input-program.sh to match.
  @example
  INPUT PROGRAM.
          DATA LIST NOTABLE FILE='a.data'/X 1-10.
@@ -566,6 +722,8 @@ The example above reads variable X from file @file{a.data} and variable
  Y from file @file{b.data}.  If one file is shorter than the other then
  the extra data in the longer file is ignored.
  
+@c If you change this example, change the regression test2 in
+@c tests/command/input-program.sh to match.
  @example
  INPUT PROGRAM.
          NUMERIC #A #B.
@@ -589,6 +747,8 @@ The above example reads variable X from @file{a.data} and variable Y from
  field is set to the system-missing value alongside the present value for
  the remaining length of the longer file.
  
+@c If you change this example, change the regression test3 in
+@c tests/command/input-program.sh to match.
  @example
  INPUT PROGRAM.
          NUMERIC #A #B.
@@ -613,6 +773,8 @@ LIST.
  The above example reads data from file @file{a.data}, then from
  @file{b.data}, and concatenates them into a single active file.
  
+@c If you change this example, change the regression test4 in
+@c tests/command/input-program.sh to match.
  @example
  INPUT PROGRAM.
          NUMERIC #EOF.
@@ -640,6 +802,8 @@ LIST.
  The above example does the same thing as the previous example, in a
  different way.
  
+@c If you change this example, make similar changes to the regression
+@c test5 in tests/command/input-program.sh.
  @example
  INPUT PROGRAM.
          LOOP #I=1 TO 50.
@@ -695,108 +859,6 @@ cannot fit on a single line, then a multi-line format will be used.
  
  @cmd{LIST} is a procedure.  It causes the data to be read.
  
-@node MATRIX DATA
-@section MATRIX DATA
-@vindex MATRIX DATA
-
-@display
-MATRIX DATA
-        /VARIABLES=var_list
-        /FILE='file-name'
-        /FORMAT=@{LIST,FREE@} @{LOWER,UPPER,FULL@} @{DIAGONAL,NODIAGONAL@}
-        /SPLIT=@{new_var,var_list@}
-        /FACTORS=var_list
-        /CELLS=n_cells
-        /N=n
-        /CONTENTS=@{N_VECTOR,N_SCALAR,N_MATRIX,MEAN,STDDEV,COUNT,MSE,
-                   DFE,MAT,COV,CORR,PROX@}
-@end display
-
-@cmd{MATRIX DATA} command reads square matrices in one of several textual
-formats.  @cmd{MATRIX DATA} clears the dictionary and replaces it and
-reads a
-data file.
-
-Use VARIABLES to specify the variables that form the rows and columns of
-the matrices.  You may not specify a variable named @code{VARNAME_}.  You
-should specify VARIABLES first.
-
-Specify the file to read on FILE, either as a file name string or a file
-handle (@pxref{File Handles}).  If FILE is not specified then matrix data
-must immediately follow @cmd{MATRIX DATA} with a @cmd{BEGIN
-DATA}@dots{}@cmd{END DATA}
-construct (@pxref{BEGIN DATA}).
-
-The FORMAT subcommand specifies how the matrices are formatted.  LIST,
-the default, indicates that there is one line per row of matrix data;
-FREE allows single matrix rows to be broken across multiple lines.  This
-is analogous to the difference between @cmd{DATA LIST FREE} and
-@cmd{DATA LIST LIST}
-(@pxref{DATA LIST}).  LOWER, the default, indicates that the lower
-triangle of the matrix is given; UPPER indicates the upper triangle; and
-FULL indicates that the entire matrix is given.  DIAGONAL, the default,
-indicates that the diagonal is part of the data; NODIAGONAL indicates
-that it is omitted.  DIAGONAL/NODIAGONAL have no effect when FULL is
-specified.
-
-The SPLIT subcommand is used to specify @cmd{SPLIT FILE} variables for the
-input matrices (@pxref{SPLIT FILE}).  Specify either a single variable
-not specified on VARIABLES, or one or more variables that are specified
-on VARIABLES.  In the former case, the SPLIT values are not present in
-the data and ROWTYPE_ may not be specified on VARIABLES.  In the latter
-case, the SPLIT values are present in the data.
-
-Specify a list of factor variables on FACTORS.  Factor variables must
-also be listed on VARIABLES.  Factor variables are used when there are
-some variables where, for each possible combination of their values,
-statistics on the matrix variables are included in the data.
-
-If FACTORS is specified and ROWTYPE_ is not specified on VARIABLES, the
-CELLS subcommand is required.  Specify the number of factor variable
-combinations that are given.  For instance, if factor variable A has 2
-values and factor variable B has 3 values, specify 6.
-
-The N subcommand specifies a population number of observations.  When N
-is specified, one N record is output for each @cmd{SPLIT FILE}.
-
-Use CONTENTS to specify what sort of information the matrices include.
-Each possible option is described in more detail below.  When ROWTYPE_
-is specified on VARIABLES, CONTENTS is optional; otherwise, if CONTENTS
-is not specified then /CONTENTS=CORR is assumed.
-
-@table @asis
-@item N
-@item N_VECTOR
-Number of observations as a vector, one value for each variable.
-@item N_SCALAR
-Number of observations as a single value.
-@item N_MATRIX
-Matrix of counts.
-@item MEAN
-Vector of means.
-@item STDDEV
-Vector of standard deviations.
-@item COUNT
-Vector of counts.
-@item MSE
-Vector of mean squared errors.
-@item DFE
-Vector of degrees of freedom.
-@item MAT
-Generic matrix.
-@item COV
-Covariance matrix.
-@item CORR
-Correlation matrix.
-@item PROX
-Proximities matrix.
-@end table
-
-The exact semantics of the matrices read by @cmd{MATRIX DATA} are complex.
-Right now @cmd{MATRIX DATA} isn't too useful due to a lack of procedures
-accepting or producing related data, so these semantics aren't
-documented.  Later, they'll be described here in detail.
-
  @node NEW FILE
  @section NEW FILE
  @vindex NEW FILE
@@ -1082,4 +1144,3 @@ specified output format, whereas @cmd{WRITE} outputs the
  system-missing value as a field filled with spaces.  Binary formats
  are an exception.
  @end itemize
-@setfilename ignored