X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdata-io.texi;h=cfb99eb7927d8b2f7a7940e32c81d9f2a4ef4e4b;hb=f3dc389d453a81ffd3957662e287c96b3a73c4bb;hp=3439ca3a8a0631c8dc0f777ee3e0e38f0029367d;hpb=a1d8ce0ad176357cf250ec065e0bf1a4520a19e3;p=pspp-builds.git diff --git a/doc/data-io.texi b/doc/data-io.texi index 3439ca3a..cfb99eb7 100644 --- a/doc/data-io.texi +++ b/doc/data-io.texi @@ -14,6 +14,8 @@ their sex, age, etc.@: and their responses are all data and the data pertaining to single respondent is a case. This chapter examines the PSPP commands for defining variables and reading and writing data. +There are alternative commands to read data from predefined sources +such as system files or databases (@xref{GET, GET DATA}.) @quotation Note These commands tell PSPP how to read data, but the data will not @@ -23,13 +25,15 @@ actually be read until a procedure is executed. @menu * BEGIN DATA:: Embed data within a syntax file. * CLOSE FILE HANDLE:: Close a file handle. +* DATAFILE ATTRIBUTE:: Set custom attributes on data files. +* DATASET:: Manage multiple datasets. * DATA LIST:: Fundamental data reading command. * END CASE:: Output the current case. * END FILE:: Terminate the current input program. * FILE HANDLE:: Support for special file formats. * INPUT PROGRAM:: Support for complex input programs. -* LIST:: List cases in the active file. -* NEW FILE:: Clear the active file and dictionary. +* LIST:: List cases in the active dataset. +* NEW FILE:: Clear the active dataset. * PRINT:: Display values in print formats. * PRINT EJECT:: Eject the current page then print. * PRINT SPACE:: Print blank lines. @@ -75,18 +79,137 @@ given file. The only specification is the name of the handle to close. Afterward @cmd{FILE HANDLE}. -If the file handle name refers to a scratch file, then the storage -associated with the scratch file in memory or on disk will be freed. -If the scratch file is in use, e.g.@: it has been specified on a -@cmd{GET} command whose execution has not completed, then freeing is -delayed until it is no longer in use. - The file named INLINE, which represents data entered between @cmd{BEGIN DATA} and @cmd{END DATA}, cannot be closed. Attempts to close it with @cmd{CLOSE FILE HANDLE} have no effect. @cmd{CLOSE FILE HANDLE} is a PSPP extension. +@node DATAFILE ATTRIBUTE +@section DATAFILE ATTRIBUTE +@vindex DATAFILE ATTRIBUTE + +@display +DATAFILE ATTRIBUTE + ATTRIBUTE=name('value') [name('value')]@dots{} + ATTRIBUTE=name@b{[}index@b{]}('value') [name@b{[}index@b{]}('value')]@dots{} + DELETE=name [name]@dots{} + DELETE=name@b{[}index@b{]} [name@b{[}index@b{]}]@dots{} +@end display + +@cmd{DATAFILE ATTRIBUTE} adds, modifies, or removes user-defined +attributes associated with the active dataset. Custom data file +attributes are not interpreted by PSPP, but they are saved as part of +system files and may be used by other software that reads them. + +Use the ATTRIBUTE subcommand to add or modify a custom data file +attribute. Specify the name of the attribute as an identifier +(@pxref{Tokens}), followed by the desired value, in parentheses, as a +quoted string. Attribute names that begin with @code{$} are reserved +for PSPP's internal use, and attribute names that begin with @code{@@} +or @code{$@@} are not displayed by most PSPP commands that display +other attributes. Other attribute names are not treated specially. + +Attributes may also be organized into arrays. To assign to an array +element, add an integer array index enclosed in square brackets +(@code{[} and @code{]}) between the attribute name and value. Array +indexes start at 1, not 0. An attribute array that has a single +element (number 1) is not distinguished from a non-array attribute. + +Use the DELETE subcommand to delete an attribute. Specify an +attribute name by itself to delete an entire attribute, including all +array elements for attribute arrays. Specify an attribute name +followed by an array index in square brackets to delete a single +element of an attribute array. In the latter case, all the array +elements numbered higher than the deleted element are shifted down, +filling the vacated position. + +To associate custom attributes with particular variables, instead of +with the entire active dataset, use @cmd{VARIABLE ATTRIBUTE} +(@pxref{VARIABLE ATTRIBUTE}) instead. + +@cmd{DATAFILE ATTRIBUTE} takes effect immediately. It is not affected +by conditional and looping structures such as @cmd{DO IF} or +@cmd{LOOP}. + +@node DATASET +@section DATASET commands +@vindex DATASET + +@display +DATASET NAME name [WINDOW=@{ASIS,FRONT@}]. +DATASET ACTIVATE name [WINDOW=@{ASIS,FRONT@}]. +DATASET COPY name [WINDOW=@{MINIMIZED,HIDDEN,FRONT@}]. +DATASET DECLARE name [WINDOW=@{MINIMIZED,HIDDEN,FRONT@}]. +DATASET CLOSE @{name,*,ALL@}. +DATASET DISPLAY. +@end display + +The @cmd{DATASET} commands simplify use of multiple datasets within a +PSPP session. They allow datasets to be created and destroyed. At +any given time, most PSPP commands work with a single dataset, called +the active dataset. + +@vindex DATASET NAME +The DATASET NAME command gives the active dataset the specified name, or +if it already had a name, it renames it. If another dataset already +had the given name, that dataset is deleted. + +@vindex DATASET ACTIVATE +The DATASET ACTIVATE command selects the named dataset, which must +already exist, as the active dataset. Before switching the active +dataset, any pending transformations are executed, as if @cmd{EXECUTE} +had been specified. If the active dataset is unnamed before +switching, then it is deleted and becomes unavailable after switching. + +@vindex DATASET COPY +The DATASET COPY command creates a new dataset with the specified +name, whose contents are a copy of the active dataset. Any pending +transformations are executed, as if @cmd{EXECUTE} had been specified, +before making the copy. If a dataset with the given name already +exists, it is replaced. If the name is the name of the active +dataset, then the active dataset becomes unnamed. + +@vindex DATASET DECLARE +The DATASET DECLARE command creates a new dataset that is initially +``empty,'' that is, it has no dictionary or data. If a dataset with +the given name already exists, this has no effect. The new dataset +can be used with commands that support output to a dataset, +e.g. AGGREGATE (@pxref{AGGREGATE}). + +@vindex DATASET CLOSE +The DATASET CLOSE command deletes a dataset. If the active dataset is +specified by name, or if @samp{*} is specified, then the active +dataset becomes unnamed. If a different dataset is specified by name, +then it is deleted and becomes unavailable. Specifying ALL deletes +all datasets except for the active dataset, which becomes unnamed. + +@vindex DATASET DISPLAY +The DATASET DISPLAY command lists all the currently defined datasets. + +Many DATASET commands accept an optional WINDOW subcommand. In the +PSPPIRE GUI, the value given for this subcommand influences how the +dataset's window is displayed. Outside the GUI, the WINDOW subcommand +has no effect. The valid values are: + +@table @asis +@item ASIS +Do not change how the window is displayed. This is the default for +DATASET NAME and DATASET ACTIVATE. + +@item FRONT +Raise the dataset's window to the top. Make it the default dataset +for running syntax. + +@item MINIMIZED +Display the window ``minimized'' to an icon. Prefer other datasets +for running syntax. This is the default for DATASET COPY and DATASET +DECLARE. + +@item HIDDEN +Hide the dataset's window. Prefer other datasets for running syntax. +@end table + @node DATA LIST @section DATA LIST @vindex DATA LIST @@ -108,6 +231,10 @@ free format. Each form of @cmd{DATA LIST} is described in detail below. +@xref{GET DATA}, for a command that offers a few enhancements over +DATA LIST and that may be substituted for DATA LIST in many +situations. + @menu * DATA LIST FIXED:: Fixed columnar locations for data. * DATA LIST FREE:: Any spacing you like. @@ -125,7 +252,7 @@ Each form of @cmd{DATA LIST} is described in detail below. @display DATA LIST [FIXED] @{TABLE,NOTABLE@} - [FILE='file-name'] + [FILE='file-name' [ENCODING='encoding']] [RECORDS=record_count] [END=end_var] [SKIP=record_count] @@ -145,6 +272,8 @@ external file. It may be used to specify a file name as a string or a file handle (@pxref{File Handles}). If the FILE subcommand is not used, then input is assumed to be specified within the command file using @cmd{BEGIN DATA}@dots{}@cmd{END DATA} (@pxref{BEGIN DATA}). +The ENCODING subcommand may only be used if the FILE subcommand is also used. +It specifies the character encoding of the file. The optional RECORDS subcommand, which takes a single integer as an argument, is used to specify the number of lines per record. If RECORDS @@ -263,7 +392,7 @@ Defines the following variables: @itemize @bullet @item -@code{NAME}, a 10-character-wide long string variable, in columns 1 +@code{NAME}, a 10-character-wide string variable, in columns 1 through 10. @item @@ -304,15 +433,15 @@ Defines the following variables: @code{ID}, a numeric variable, in columns 1-5 of the first record. @item -@code{NAME}, a 30-character long string variable, in columns 7-36 of the +@code{NAME}, a 30-character string variable, in columns 7-36 of the first record. @item -@code{SURNAME}, a 30-character long string variable, in columns 38-67 of +@code{SURNAME}, a 30-character string variable, in columns 38-67 of the first record. @item -@code{MINITIAL}, a 1-character short string variable, in column 69 of +@code{MINITIAL}, a 1-character string variable, in column 69 of the first record. @item @@ -338,7 +467,7 @@ This example shows keywords abbreviated to their first 3 letters. DATA LIST FREE [(@{TAB,'c'@}, @dots{})] [@{NOTABLE,TABLE@}] - [FILE='file-name'] + [FILE='file-name' [ENCODING='encoding']] [SKIP=record_cnt] /var_spec@dots{} @@ -390,7 +519,7 @@ on field width apply, but they are honored on output. DATA LIST LIST [(@{TAB,'c'@}, @dots{})] [@{NOTABLE,TABLE@}] - [FILE='file-name'] + [FILE='file-name' [ENCODING='encoding']] [SKIP=record_count] /var_spec@dots{} @@ -438,15 +567,24 @@ For text files: [/MODE=CHARACTER] /TABWIDTH=tab_width -For binary files with fixed-length records: +For binary files in native encoding with fixed-length records: FILE HANDLE handle_name /NAME='file-name' /MODE=IMAGE [/LRECL=rec_len] -To explicitly declare a scratch handle: +For binary files in native encoding with variable-length records: + FILE HANDLE handle_name + /NAME='file-name' + /MODE=BINARY + [/LRECL=rec_len] + +For binary files encoded in EBCDIC: FILE HANDLE handle_name - /MODE=SCRATCH + /NAME='file-name' + /MODE=360 + /RECFORM=@{FIXED,VARIABLE,SPANNED@} + [/LRECL=rec_len] @end display Use @cmd{FILE HANDLE} to associate a file handle name with a file and @@ -465,28 +603,122 @@ file handle name must not already have been used in a previous invocation of @cmd{FILE HANDLE}, unless it has been closed by an intervening command (@pxref{CLOSE FILE HANDLE}). -MODE specifies a file mode. In CHARACTER mode, the default, the data -file is read as a text file, according to the local system's -conventions, and each text line is read as one record. -In CHARACTER mode, most input programs will expand tabs to spaces -(@cmd{DATA LIST FREE} with explicitly specified delimiters is an -exception). By default, each tab is 4 characters wide, but an -alternate width may be specified on TABWIDTH. A tab width of 0 -suppresses tab expansion entirely. - -In IMAGE mode, the data file is opened in ANSI C binary mode. Record -length is fixed, with output data truncated or padded with spaces to -the record length. LRECL specifies the record length in bytes, with a -default of 1024. Tab characters are never expanded to spaces in -binary mode. Records +The effect and syntax of FILE HANDLE depends on the selected MODE: -The NAME subcommand specifies the name of the file associated with the -handle. It is required in CHARACTER and IMAGE modes. +@itemize +@item +In CHARACTER mode, the default, the data file is read as a text file, +according to the local system's conventions, and each text line is +read as one record. + +In CHARACTER mode only, tabs are expanded to spaces by input programs, +except by @cmd{DATA LIST FREE} with explicitly specified delimiters. +Each tab is 4 characters wide by default, but TABWIDTH (a PSPP +extension) may be used to specify an alternate width. Use a TABWIDTH +of 0 to suppress tab expansion. + +@item +In IMAGE mode, the data file is treated as a series of fixed-length +binary records. LRECL should be used to specify the record length in +bytes, with a default of 1024. On input, it is an error if an IMAGE +file's length is not a integer multiple of the record length. On +output, each record is padded with spaces or truncated, if necessary, +to make it exactly the correct length. + +@item +In BINARY mode, the data file is treated as a series of +variable-length binary records. LRECL may be specified, but its value +is ignored. The data for each record is both preceded and followed by +a 32-bit signed integer in little-endian byte order that specifies the +length of the record. (This redundancy permits records in these +files to be efficiently read in reverse order, although PSPP always +reads them in forward order.) The length does not include either +integer. -The SCRATCH mode designates the file handle as a scratch file handle. -Its use is usually unnecessary because file handle names that begin with -@samp{#} are assumed to refer to scratch files. @pxref{File Handles}, -for more information. +@item +Mode 360 reads and writes files in formats first used for tapes in the +1960s on IBM mainframe operating systems and still supported today by +the modern successors of those operating systems. For more +information, see @cite{OS/400 Tape and Diskette Device Programming}, +available on IBM's website. + +Alphanumeric data in mode 360 files are encoded in EBCDIC. PSPP +translates EBCDIC to or from the host's native format as necessary on +input or output, using an ASCII/EBCDIC translation that is one-to-one, +so that a ``round trip'' from ASCII to EBCDIC back to ASCII, or vice +versa, always yields exactly the original data. + +The RECFORM subcommand is required in mode 360. The precise file +format depends on its setting: + +@table @asis +@item F +@itemx FIXED +This record format is equivalent to IMAGE mode, except for EBCDIC +translation. + +IBM documentation calls this @code{*F} (fixed-length, deblocked) +format. + +@item V +@itemx VARIABLE +The file comprises a sequence of zero or more variable-length blocks. +Each block begins with a 4-byte @dfn{block descriptor word} (BDW). +The first two bytes of the BDW are an unsigned integer in big-endian +byte order that specifies the length of the block, including the BDW +itself. The other two bytes of the BDW are ignored on input and +written as zeros on output. + +Following the BDW, the remainder of each block is a sequence of one or +more variable-length records, each of which in turn begins with a +4-byte @dfn{record descriptor word} (RDW) that has the same format as +the BDW. Following the RDW, the remainder of each record is the +record data. + +The maximum length of a record in VARIABLE mode is 65,527 bytes: +65,535 bytes (the maximum value of a 16-bit unsigned integer), minus 4 +bytes for the BDW, minus 4 bytes for the RDW. + +In mode VARIABLE, LRECL specifies a maximum, not a fixed, record +length, in bytes. The default is 8,192. + +IBM documentation calls this @code{*VB} (variable-length, blocked, +unspanned) format. + +@item VS +@itemx SPANNED +The file format is like that of VARIABLE mode, except that logical +records may be split among multiple physical records (called +@dfn{segments}) or blocks. In SPANNED mode, the third byte of each +RDW is called the segment control character (SCC). Odd SCC values +cause the segment to be appended to a record buffer maintained in +memory; even values also append the segment and then flush its +contents to the input procedure. Canonically, SCC value 0 designates +a record not spanned among multiple segments, and values 1 through 3 +designate the first segment, the last segment, or an intermediate +segment, respectively, within a multi-segment record. The record +buffer is also flushed at end of file regardless of the final record's +SCC. + +The maximum length of a logical record in VARIABLE mode is limited +only by memory available to PSPP. Segments are limited to 65,527 +bytes, as in VARIABLE mode. + +This format is similar to what IBM documentation call @code{*VS} +(variable-length, deblocked, spanned) format. +@end table + +In mode 360, fields of type A that extend beyond the end of a record +read from disk are padded with spaces in the host's native character +set, which are then translated from EBCDIC to the native character +set. Thus, when the host's native character set is based on ASCII, +these fields are effectively padded with character @code{X'80'}. This +wart is implemented for compatibility. +@end itemize + +The NAME subcommand specifies the name of the file associated with the +handle. It is required in all modes but SCRATCH mode, in which its +use is forbidden. @node INPUT PROGRAM @section INPUT PROGRAM @@ -602,7 +834,7 @@ LIST. @end example The above example reads data from file @file{a.data}, then from -@file{b.data}, and concatenates them into a single active file. +@file{b.data}, and concatenates them into a single active dataset. @c If you change this example, change the regression test4 in @c tests/command/input-program.sh to match. @@ -646,7 +878,7 @@ END INPUT PROGRAM. LIST/FORMAT=NUMBERED. @end example -The above example causes an active file to be created consisting of 50 +The above example causes an active dataset to be created consisting of 50 random variates between 0 and 10. @node LIST @@ -657,8 +889,7 @@ random variates between 0 and 10. LIST /VARIABLES=var_list /CASES=FROM start_index TO end_index BY incr_index - /FORMAT=@{UNNUMBERED,NUMBERED@} @{WRAP,SINGLE@} - @{NOWEIGHT,WEIGHT@} + /FORMAT=@{UNNUMBERED,NUMBERED@} @{WRAP,SINGLE@} @end display The @cmd{LIST} procedure prints the values of specified variables to the @@ -666,7 +897,7 @@ listing file. The VARIABLES subcommand specifies the variables whose values are to be printed. Keyword VARIABLES is optional. If VARIABLES subcommand is not -specified then all variables in the active file are printed. +specified then all variables in the active dataset are printed. The CASES subcommand can be used to specify a subset of cases to be printed. Specify FROM and the case number of the first case to print, @@ -677,9 +908,7 @@ settings. If CASES is not specified then all cases are printed. The FORMAT subcommand can be used to change the output format. NUMBERED will print case numbers along with each case; UNNUMBERED, the default, causes the case numbers to be omitted. The WRAP and SINGLE settings are -currently not used. WEIGHT will cause case weights to be printed along -with variable values; NOWEIGHT, the default, causes case weights to be -omitted from the output. +currently not used. Case numbers start from 1. They are counted after all transformations have been considered. @@ -698,7 +927,8 @@ cannot fit on a single line, then a multi-line format will be used. NEW FILE. @end display -@cmd{NEW FILE} command clears the current active file. +@cmd{NEW FILE} command clears the dictionary and data from the current +active dataset. @node PRINT @section PRINT @@ -975,4 +1205,3 @@ specified output format, whereas @cmd{WRITE} outputs the system-missing value as a field filled with spaces. Binary formats are an exception. @end itemize -@setfilename ignored