-@node System and Portable Files, Variable Attributes, Data Input and Output, Top
+@node System and Portable Files
@chapter System Files and Portable Files
The commands in this chapter read, write, and examine system files and
* APPLY DICTIONARY:: Apply system file dictionary to active file.
* EXPORT:: Write to a portable file.
* GET:: Read from a system file.
+* GET DATA:: Read from foreign files.
* IMPORT:: Read from a portable file.
* MATCH FILES:: Merge system files.
* SAVE:: Write to a system file.
Use of @cmd{GET} to read a portable file or scratch file is a PSPP
extension.
+@node GET DATA
+@section GET DATA
+@vindex GET DATA
+
+@display
+GET DATA
+ /TYPE=@{GNM,PSQL,TXT@}
+ @dots{}additional subcommands depending on TYPE@dots{}
+@end display
+
+The @cmd{GET DATA} command is used to read files and other data sources
+created by other applications.
+When this command is executed, the current dictionary and active file are
+replaced with variables and data read from the specified source.
+
+The TYPE subcommand is mandatory and must be the first subcommand
+specified. It determines the type of the file or source to read.
+PSPP currently supports the following file types:
+
+@table @asis
+@item GNM
+Spreadsheet files created by Gnumeric (@url{http://gnumeric.org}).
+
+@item PSQL
+Relations from PostgreSQL databases (@url{http://postgresql.org}).
+
+@item TXT
+Textual data files in columnar and delimited formats.
+@end table
+
+Each supported file type has additional subcommands, explained in
+separate sections below.
+
+@menu
+* GET DATA /TYPE=GNM::
+* GET DATA /TYPE=PSQL::
+* GET DATA /TYPE=TXT::
+@end menu
+
+@node GET DATA /TYPE=GNM
+@subsection Gnumeric Spreadsheet Files
+
+@display
+GET DATA /TYPE=GNM
+ /FILE=@{'file-name'@}
+ /SHEET=@{NAME 'sheet-name', INDEX n@}
+ /CELLRANGE=@{RANGE 'range', FULL@}
+ /READNAMES=@{ON, OFF@}
+ /ASSUMEDVARWIDTH=n.
+@end display
+
+@cindex Gnumeric
+@cindex spreadsheet files
+To use GET DATA to read a spreadsheet file created by Gnumeric
+(@url{http://gnumeric.org}), specify TYPE=GNM to indicate the file's
+format and use FILE to indicate the Gnumeric file to be read. All
+other subcommands are optional.
+
+The format of each variable is determined by the format of the spreadsheet
+cell containing the first datum for the variable.
+If this cell is of string (text) format, then the width of the variable is
+determined from the length of the string it contains, unless the
+ASSUMEDVARWIDTH subcommand is given.
+
+
+The FILE subcommand is mandatory. Specify the name of the file
+to be read.
+
+The SHEET subcommand specifies the sheet within the spreadsheet file to read.
+There are two forms of the SHEET subcommand.
+In the first form,
+@samp{/SHEET=name @var{sheet-name}}, the string @var{sheet-name} is the
+name of the sheet to read.
+In the second form, @samp{/SHEET=index @var{idx}}, @var{idx} is a
+integer which is the index of the sheet to read.
+The first sheet has the index 1.
+If the SHEET subcommand is omitted, then the command will read the
+first sheet in the file.
+
+The CELLRANGE subcommand specifies the range of cells within the sheet to read.
+If the subcommand is given as @samp{/CELLRANGE=FULL}, then the entire
+sheet is read.
+To read only part of a sheet, use the form
+@samp{/CELLRANGE=range '@var{top-left-cell}:@var{bottom-right-cell}'}.
+For example, the subcommand @samp{/CELLRANGE=range 'C3:P19'} reads
+columns C--P, and rows 3--19 inclusive.
+If no CELLRANGE subcommand is given, then the entire sheet is read.
+
+If @samp{/READNAMES=ON} is specified, then the contents of cells of
+the first row are used as the names of the variables in which to store
+the data from subsequent rows.
+If the READNAMES command is omitted, or if @samp{/READNAMES=OFF} is
+used, then the variables receive automatically assigned names.
+
+The ASSUMEDVARWIDTH subcommand specifies the maximum width of string
+variables read from the file.
+If omitted, the default value is determined from the length of the
+string in the first spreadsheet cell for each variable.
+
+
+@node GET DATA /TYPE=PSQL
+@subsection Postgres Database Queries
+
+@display
+GET DATA /TYPE=PSQL
+ /CONNECT=@{connection info@}
+ /SQL=@{query@}
+ [/ASSUMEDVARWIDTH=n]
+ [/UNENCRYPTED]
+ [/BSIZE=n].
+@end display
+
+@cindex postgres
+@cindex databases
+
+The PSQL type is used to import data from a postgres database server.
+The server may be located locally or remotely.
+Variables are automatically created based on the table column names
+or the names specified in the SQL query.
+Postgres data types of high precision, will loose precision when
+imported into PSPP.
+Not all the postgres data types are able to be represented in PSPP.
+If a datum cannot be represented a warning will be issued and that
+datum will be set to SYSMIS.
+
+The CONNECT subcommand is mandatory.
+It is a string specifying the parameters of the database server from
+which the data should be fetched.
+The format of the string is given in the postgres manual
+@url{http://www.postgresql.org/docs/8.0/static/libpq.html#LIBPQ-CONNECT}.
+
+The SQL subcommand is mandatory.
+It must be a valid SQL string to retrieve data from the database.
+
+The ASSUMEDVARWIDTH subcommand specifies the maximum width of string
+variables read from the database.
+If omitted, the default value is determined from the length of the
+string in the first value read for each variable.
+
+The UNENCRYPTED subcommand allows data to be retrieved over an insecure
+connection.
+If the connection is not encrypted, and the UNENCRYPTED subcommand is not
+given, then an error will occur.
+Whether or not the connection is
+encrypted depends upon the underlying psql library and the
+capabilities of the database server.
+
+The BSIZE subcommand serves only to optimise the speed of data transfer.
+It specifies an upper limit on
+number of cases to fetch from the database at once.
+The default value is 4096.
+If your SQL statement fetches a large number of cases but only a small number of
+variables, then the data transfer may be faster if you increase this value.
+Conversely, if the number of variables is large, or if the machine on which
+PSPP is running has only a
+small amount of memory, then a smaller value will be better.
+
+
+The following syntax is an example:
+@example
+GET DATA /TYPE=PSQL
+ /CONNECT='host=example.com port=5432 dbname=product user=fred passwd=xxxx'
+ /SQL='select * from manufacturer'.
+@end example
+
+
+@node GET DATA /TYPE=TXT
+@subsection Textual Data Files
+
+@display
+GET DATA /TYPE=TXT
+ /FILE=@{'file-name',file_handle@}
+ [/ARRANGEMENT=@{DELIMITED,FIXED@}]
+ [/FIRSTCASE=@{first_case@}]
+ [/IMPORTCASE=@{ALL,FIRST max_cases,PERCENT percent@}]
+ @dots{}additional subcommands depending on ARRANGEMENT@dots{}
+@end display
+
+@cindex text files
+@cindex data files
+When TYPE=TXT is specified, GET DATA reads data in a delimited or
+fixed columnar format, much like DATA LIST (@pxref{DATA LIST}).
+
+The FILE subcommand is mandatory. Specify the file to be read as
+a string file name or (for textual data
+only) a file handle (@pxref{File Handles}).
+
+The ARRANGEMENT subcommand determines the file's basic format.
+DELIMITED, the default setting, specifies that fields in the input
+data are separated by spaces, tabs, or other user-specified
+delimiters. FIXED specifies that fields in the input data appear at
+particular fixed column positions within records of a case.
+
+By default, cases are read from the input file starting from the first
+line. To skip lines at the beginning of an input file, set FIRSTCASE
+to the number of the first line to read: 2 to skip the first line, 3
+to skip the first two lines, and so on.
+
+IMPORTCASE can be used to limit the number of cases read from the
+input file. With the default setting, ALL, all cases in the file are
+read. Specify FIRST @i{max_cases} to read at most @i{max_cases} cases
+from the file. Use PERCENT @i{percent} to read only @i{percent}
+percent, approximately, of the cases contained in the file. (The
+percentage is approximate, because there is no way to accurately count
+the number of cases in the file without reading the entire file. The
+number of cases in some kinds of unusual files cannot be estimated;
+PSPP will read all cases in such files.)
+
+FIRSTCASE and IMPORTCASE may be used with delimited and fixed-format
+data. The remaining subcommands, which apply only to one of the two file
+arrangements, are described below.
+
+@menu
+* GET DATA /TYPE=TXT /ARRANGEMENT=DELIMITED::
+* GET DATA /TYPE=TXT /ARRANGEMENT=FIXED::
+@end menu
+
+@node GET DATA /TYPE=TXT /ARRANGEMENT=DELIMITED
+@subsubsection Reading Delimited Data
+
+@display
+GET DATA /TYPE=TXT
+ /FILE=@{'file-name',file_handle@}
+ [/ARRANGEMENT=@{DELIMITED,FIXED@}]
+ [/FIRSTCASE=@{first_case@}]
+ [/IMPORTCASE=@{ALL,FIRST max_cases,PERCENT percent@}]
+
+ /DELIMITERS="delimiters"
+ [/QUALIFIER="quotes" [/ESCAPE]]
+ [/DELCASE=@{LINE,VARIABLES n_variables@}]
+ /VARIABLES=del_var [del_var]@dots{}
+where each del_var takes the form:
+ variable format
+@end display
+
+The GET DATA command with TYPE=TXT and ARRANGEMENT=DELIMITED reads
+input data from text files in delimited format, where fields are
+separated by a set of user-specified delimiters. Its capabilities are
+similar to those of DATA LIST FREE (@pxref{DATA LIST FREE}), with a
+few enhancements.
+
+The required FILE subcommand and optional FIRSTCASE and IMPORTCASE
+subcommands are described above (@pxref{GET DATA /TYPE=TXT}).
+
+DELIMITERS, which is required, specifies the set of characters that
+may separate fields. Each character in the string specified on
+DELIMITERS separates one field from the next. The end of a line also
+separates fields, regardless of DELIMITERS. Two consecutive
+delimiters in the input yield an empty field, as does a delimiter at
+the end of a line. A space character as a delimiter is an exception:
+consecutive spaces do not yield an empty field and neither does any
+number of spaces at the end of a line.
+
+To use a tab as a delimiter, specify @samp{\t} at the beginning of the
+DELIMITERS string. To use a backslash as a delimiter, specify
+@samp{\\} as the first delimiter or, if a tab should also be a
+delimiter, immediately following @samp{\t}. To read a data file in
+which each field appears on a separate line, specify the empty string
+for DELIMITERS.
+
+The optional QUALIFIER subcommand names one or more characters that
+can be used to quote values within fields in the input. A field that
+begins with one of the specified quote characters ends at the next
+matching quote. Intervening delimiters become part of the field,
+instead of terminating it. The ability to specify more than one quote
+character is a PSPP extension.
+
+By default, a character specified on QUALIFIER cannot itself be
+embedded within a field that it quotes, because the quote character
+always terminates the quoted field. With ESCAPE, however, a doubled
+quote character within a quoted field inserts a single instance of the
+quote into the field. For example, if @samp{'} is specified on
+QUALIFIER, then without ESCAPE @code{'a''b'} specifies a pair of
+fields that contain @samp{a} and @samp{b}, but with ESCAPE it
+specifies a single field that contains @samp{a'b}. ESCAPE is a PSPP
+extension.
+
+The DELCASE subcommand controls how data may be broken across lines in
+the data file. With LINE, the default setting, each line must contain
+all the data for exactly one case. For additional flexibility, to
+allow a single case to be split among lines or multiple cases to be
+contained on a single line, specify VARIABLES @i{n_variables}, where
+@i{n_variables} is the number of variables per case.
+
+The VARIABLES subcommand is required and must be the last subcommand.
+Specify the name of each variable and its input format (@pxref{Input
+and Output Formats}) in the order they should be read from the input
+file.
+
+@subsubheading Examples
+
+@noindent
+On a Unix-like system, the @samp{/etc/passwd} file has a format
+similar to this:
+
+@example
+root:$1$nyeSP5gD$pDq/:0:0:,,,:/root:/bin/bash
+blp:$1$BrP/pFg4$g7OG:1000:1000:Ben Pfaff,,,:/home/blp:/bin/bash
+john:$1$JBuq/Fioq$g4A:1001:1001:John Darrington,,,:/home/john:/bin/bash
+jhs:$1$D3li4hPL$88X1:1002:1002:Jason Stover,,,:/home/jhs:/bin/csh
+@end example
+
+@noindent
+The following syntax reads a file in the format used by
+@samp{/etc/passwd}:
+
+@c If you change this example, change the regression test in
+@c tests/command/get-data-txt-examples.sh to match.
+@example
+GET DATA /TYPE=TXT /FILE='/etc/passwd' /DELIMITERS=':'
+ /VARIABLES=username A20
+ password A40
+ uid F10
+ gid F10
+ gecos A40
+ home A40
+ shell A40.
+@end example
+
+@noindent
+Consider the following data on used cars:
+
+@example
+model year mileage price type age
+Civic 2002 29883 15900 Si 2
+Civic 2003 13415 15900 EX 1
+Civic 1992 107000 3800 n/a 12
+Accord 2002 26613 17900 EX 1
+@end example
+
+@noindent
+The following syntax can be used to read the used car data:
+
+@c If you change this example, change the regression test in
+@c tests/command/get-data-txt-examples.sh to match.
+@example
+GET DATA /TYPE=TXT /FILE='cars.data' /DELIMITERS=' ' /FIRSTCASE=2
+ /VARIABLES=model A8
+ year F4
+ mileage F6
+ price F5
+ type A4
+ age F2.
+@end example
+
+@noindent
+Consider the following information on animals in a pet store:
+
+@example
+'Pet''s Name', "Age", "Color", "Date Received", "Price", "Height", "Type"
+, (Years), , , (Dollars), ,
+"Rover", 4.5, Brown, "12 Feb 2004", 80, '1''4"', "Dog"
+"Charlie", , Gold, "5 Apr 2007", 12.3, "3""", "Fish"
+"Molly", 2, Black, "12 Dec 2006", 25, '5"', "Cat"
+"Gilly", , White, "10 Apr 2007", 10, "3""", "Guinea Pig"
+@end example
+
+@noindent
+The following syntax can be used to read the pet store data:
+
+@c If you change this example, change the regression test in
+@c tests/command/get-data-txt-examples.sh to match.
+@example
+GET DATA /TYPE=TXT /FILE='pets.data' /DELIMITERS=', ' /QUALIFIER='''"' /ESCAPE
+ /FIRSTCASE=3
+ /VARIABLES=name A10
+ age F3.1
+ color A5
+ received EDATE10
+ price F5.2
+ height a5
+ type a10.
+@end example
+
+@node GET DATA /TYPE=TXT /ARRANGEMENT=FIXED
+@subsubsection Reading Fixed Columnar Data
+
+@display
+GET DATA /TYPE=TXT
+ /FILE=@{'file-name',file_handle@}
+ [/ARRANGEMENT=@{DELIMITED,FIXED@}]
+ [/FIRSTCASE=@{first_case@}]
+ [/IMPORTCASE=@{ALL,FIRST max_cases,PERCENT percent@}]
+
+ [/FIXCASE=n]
+ /VARIABLES fixed_var [fixed_var]@dots{}
+ [/rec# fixed_var [fixed_var]@dots{}]@dots{}
+where each fixed_var takes the form:
+ variable start-end format
+@end display
+
+The GET DATA command with TYPE=TXT and ARRANGEMENT=FIXED reads input
+data from text files in fixed format, where each field is located in
+particular fixed column positions within records of a case. Its
+capabilities are similar to those of DATA LIST FIXED (@pxref{DATA LIST
+FIXED}), with a few enhancements.
+
+The required FILE subcommand and optional FIRSTCASE and IMPORTCASE
+subcommands are described above (@pxref{GET DATA /TYPE=TXT}).
+
+The optional FIXCASE subcommand may be used to specify the positive
+integer number of input lines that make up each case. The default
+value is 1.
+
+The VARIABLES subcommand, which is required, specifies the positions
+at which each variable can be found. For each variable, specify its
+name, followed by its start and end column separated by @samp{-}
+(e.g.@: @samp{0-9}), followed by the input format type (e.g.@:
+@samp{F}). For this command, columns are numbered starting from 0 at
+the left column. Introduce the variables in the second and later
+lines of a case by a slash followed by the number of the line within
+the case, e.g.@: @samp{/2} for the second line.
+
+@subsubheading Examples
+
+@noindent
+Consider the following data on used cars:
+
+@example
+model year mileage price type age
+Civic 2002 29883 15900 Si 2
+Civic 2003 13415 15900 EX 1
+Civic 1992 107000 3800 n/a 12
+Accord 2002 26613 17900 EX 1
+@end example
+
+@noindent
+The following syntax can be used to read the used car data:
+
+@c If you change this example, change the regression test in
+@c tests/command/get-data-txt-examples.sh to match.
+@example
+GET DATA /TYPE=TXT /FILE='cars.data' /ARRANGEMENT=FIXED /FIRSTCASE=2
+ /VARIABLES=model 0-7 A
+ year 8-15 F
+ mileage 16-23 F
+ price 24-31 F
+ type 32-40 A
+ age 40-47 F.
+@end example
+
@node IMPORT
@section IMPORT
@vindex IMPORT
@display
MATCH FILES
/@{FILE,TABLE@}=@{*,'file-name'@}
- /DROP=var_list
- /KEEP=var_list
/RENAME=(src_names=target_names)@dots{}
/IN=var_name
- /BY var_list
+ /BY=var_list
+ /DROP=var_list
+ /KEEP=var_list
/FIRST=var_name
/LAST=var_name
/MAP
@cmd{MATCH FILES} merges one or more system, portable, or scratch files,
optionally
-including the active file. Records with the same values for BY
-variables are combined into a single record. Records with different
+including the active file. Cases with the same values for BY
+variables are combined into a single case. Cases with different
values are output in order. Thus, multiple sorted files are
combined into a single sorted file based on the value of the BY
variables. The results of the merge become the new active file.
-The BY subcommand specifies a list of variables that are used to match
-records from each of the files. Variables specified must exist
-in all the files specified on FILE and TABLE. BY should usually be
-specified. If TABLE or IN is used then BY is required.
-
Specify FILE with a system, portable, or scratch file as a file name
string or file handle
(@pxref{File Handles}), or with an asterisk (@samp{*}) to
indicate the current active file. The files specified on FILE are
merged together based on the BY variables, or combined case-by-case if
-BY is not specified. Normally at least two FILE subcommands should be
-specified.
+BY is not specified.
Specify TABLE with a file to use it as a @dfn{table
-lookup file}. Records in table lookup files are not used up after
+lookup file}. Cases in table lookup files are not used up after
they've been used once. This means that data in table lookup files can
-correspond to any number of records in FILE files. Table lookup files
+correspond to any number of cases in FILE files. Table lookup files
correspond to lookup tables in traditional relational database systems.
-It is incorrect to have records with duplicate BY values in table lookup
-files.
+If a table lookup file contains more than one case with a given set of
+BY variables, only the first case is used.
-Any number of FILE and TABLE subcommands may be specified. Each
-instance of FILE or TABLE can be followed by any sequence of DROP,
-KEEP, or RENAME subcommands. These have the same form and meaning as
-the corresponding subcommands of @cmd{GET} (@pxref{GET}), but apply
-only to variables in the given file.
+Any number of FILE and TABLE subcommands may be specified.
+Ordinarily, at least two FILE subcommands, or one FILE and at least
+one TABLE, should be specified. Each instance of FILE or TABLE can be
+followed by any sequence of RENAME subcommands. These have the same
+form and meaning as the corresponding subcommands of @cmd{GET}
+(@pxref{GET}), but apply only to variables in the given file.
Each FILE or TABLE may optionally be followed by an IN subcommand,
which creates a numeric variable with the specified name and format
contributed a row to the merged file, 0 otherwise. The DROP, KEEP,
and RENAME subcommands do not affect IN variables.
-Variables belonging to files that are not present for the current case
-are set to the system-missing value for numeric variables or spaces for
-string variables.
+When more than one FILE or TABLE contains a variable with a given
+name, those variables must all have the same type (numeric or string)
+and, for string variables, the same width. This rules applies to
+variable names after renaming with RENAME; thus, RENAME can be used to
+resolve conflicts.
-FIRST, LAST, and MAP are currently ignored.
+FILE and TABLE must be specified at the beginning of the command, with
+any RENAME or IN specifications immediately after the corresponding
+FILE or TABLE. These subcommands are followed by BY, DROP, KEEP,
+FIRST, LAST, and MAP.
+
+The BY subcommand specifies a list of variables that are used to match
+cases from each of the files. When TABLE or IN is used, BY is
+required; otherwise, it is optional. When BY is specified, all the
+files named on FILE and TABLE subcommands must be sorted in ascending
+order of the BY variables. Variables belonging to files that are not
+present for the current case are set to the system-missing value for
+numeric variables or spaces for string variables.
+
+The DROP and KEEP subcommands allow variables to be dropped from or
+reordered within the new active file. These subcommands have the same
+form and meaning as the corresponding subcommands of @cmd{GET}
+(@pxref{GET}). They apply to the new active file as a whole, not to
+individual input files. The variable names specified on DROP and KEEP
+are those after any renaming with RENAME.
+
+The optional FIRST and LAST subcommands name variables that @cmd{MATCH
+FILES} adds to the active file. The new variables are numeric with
+print and write format F1.0. The value of the FIRST variable is 1 in
+the first case with a given set of values for the BY variables, and 0
+in other cases. Similarly, the LAST variable is 1 in the last case
+with a given of BY values, and 0 in other cases.
@cmd{MATCH FILES} may not be specified following @cmd{TEMPORARY}
(@pxref{TEMPORARY}) if the active file is used as an input source.
@vindex XEXPORT
@display
-EXPORT
+XEXPORT
/OUTFILE='file-name'
/DIGITS=n
/DROP=var_list
@xref{EXPORT}, for more information.
-@node XSAVE, , XEXPORT, System and Portable Files
+@node XSAVE
@section XSAVE
@vindex XSAVE