sys-file-reader: Tolerate nonsensical opcodes in compressed data.

[pspp-builds.git] / doc / files.texi
diff --git a/doc/files.texi b/doc/files.texi

index 479fc0671960cd5a8fb82632239062053eaa7b17..30a023aeb90d3a22793169e4a4cce20edaaed77a 100644 (file)
--- a/doc/files.texi
+++ b/doc/files.texi
@@ -1,4 +1,4 @@
-@node System and Portable Files, Variable Attributes, Data Input and Output, Top
+@node System and Portable Files
  @chapter System Files and Portable Files
  
  The commands in this chapter read, write, and examine system files and
@@ -8,6 +8,7 @@ portable files.
  * APPLY DICTIONARY::            Apply system file dictionary to active file.
  * EXPORT::                      Write to a portable file.
  * GET::                         Read from a system file.
+* GET DATA::                    Read from foreign files.
  * IMPORT::                      Read from a portable file.
  * MATCH FILES::                 Merge system files.
  * SAVE::                        Write to a system file.
@@ -156,6 +157,447 @@ is read later, when a procedure is executed.
  Use of @cmd{GET} to read a portable file or scratch file is a PSPP
  extension.
  
+@node GET DATA
+@section GET DATA
+@vindex GET DATA
+
+@display
+GET DATA
+        /TYPE=@{GNM,PSQL,TXT@}
+        @dots{}additional subcommands depending on TYPE@dots{}
+@end display
+
+The @cmd{GET DATA} command is used to read files and other data sources 
+created by other applications. 
+When this command is executed, the current dictionary and active file are
+replaced  with variables and data read from the specified source.
+
+The TYPE subcommand is mandatory and must be the first subcommand
+specified.  It determines the type of the file or source to read.
+PSPP currently supports the following file types:
+
+@table @asis
+@item GNM
+Spreadsheet files created by Gnumeric (@url{http://gnumeric.org}).
+
+@item PSQL
+Relations from PostgreSQL databases (@url{http://postgresql.org}).
+
+@item TXT
+Textual data files in columnar and delimited formats.
+@end table
+
+Each supported file type has additional subcommands, explained in
+separate sections below.
+
+@menu
+* GET DATA /TYPE=GNM::
+* GET DATA /TYPE=PSQL::
+* GET DATA /TYPE=TXT::
+@end menu
+
+@node GET DATA /TYPE=GNM
+@subsection Gnumeric Spreadsheet Files
+
+@display
+GET DATA /TYPE=GNM
+        /FILE=@{'file-name'@}
+        /SHEET=@{NAME 'sheet-name', INDEX n@}
+        /CELLRANGE=@{RANGE 'range', FULL@}
+        /READNAMES=@{ON, OFF@}
+        /ASSUMEDVARWIDTH=n.
+@end display
+
+@cindex Gnumeric
+@cindex spreadsheet files
+To use GET DATA to read a spreadsheet file created by Gnumeric
+(@url{http://gnumeric.org}), specify TYPE=GNM to indicate the file's
+format and use FILE to indicate the Gnumeric file to be read.  All
+other subcommands are optional.
+
+The format of each variable is determined by the format of the spreadsheet 
+cell containing the first datum for the variable.
+If this cell is of string (text) format, then the width of the variable is
+determined from the length of the string it contains, unless the 
+ASSUMEDVARWIDTH subcommand is given.
+
+
+The FILE subcommand is mandatory. Specify the name of the file
+to be read.
+
+The SHEET subcommand specifies the sheet within the spreadsheet file to read.
+There are two forms of the SHEET subcommand.
+In the first form,
+@samp{/SHEET=name @var{sheet-name}}, the string @var{sheet-name} is the
+name of the sheet to read.
+In the second form, @samp{/SHEET=index @var{idx}}, @var{idx} is a
+integer which is the index of the sheet to read.
+The first sheet has the index 1.
+If the SHEET subcommand is omitted, then the command will read the
+first sheet in the file.
+
+The CELLRANGE subcommand specifies the range of cells within the sheet to read.
+If the subcommand is given as @samp{/CELLRANGE=FULL}, then the entire
+sheet  is read.
+To read only part of a sheet, use the form 
+@samp{/CELLRANGE=range '@var{top-left-cell}:@var{bottom-right-cell}'}.
+For example, the subcommand @samp{/CELLRANGE=range 'C3:P19'} reads 
+columns C--P, and rows 3--19 inclusive.
+If no CELLRANGE subcommand is given, then the entire sheet is read.
+
+If @samp{/READNAMES=ON} is specified, then the contents of cells of
+the first row are used as the names of the variables in which to store
+the data from subsequent rows. 
+If the READNAMES command is omitted, or if @samp{/READNAMES=OFF} is
+used, then the variables  receive automatically assigned names.
+
+The ASSUMEDVARWIDTH subcommand specifies the maximum width of string
+variables read  from the file.
+If omitted, the default value is determined from the length of the 
+string in the first spreadsheet cell for each variable.
+
+
+@node GET DATA /TYPE=PSQL
+@subsection Postgres Database Queries
+
+@display
+GET DATA /TYPE=PSQL
+         /CONNECT=@{connection info@}
+         /SQL=@{query@}
+         [/ASSUMEDVARWIDTH=n]
+         [/UNENCRYPTED]
+         [/BSIZE=n].
+@end display
+
+@cindex postgres
+@cindex databases
+
+The PSQL type is used to import data from a postgres database server.
+The server may be located locally or remotely.
+Variables are automatically created based on the table column names 
+or the names specified in the SQL query.
+Postgres data types of high precision, will loose precision when 
+imported into PSPP.
+Not all the postgres data types are able to be represented in PSPP.
+If a datum cannot be represented a warning will be issued and that 
+datum will be set to SYSMIS.
+
+The CONNECT subcommand is mandatory.
+It is a string specifying the parameters of the database server from
+which the data should be fetched.
+The format of the string is given in the postgres manual
+@url{http://www.postgresql.org/docs/8.0/static/libpq.html#LIBPQ-CONNECT}.
+
+The SQL subcommand is mandatory.
+It must be a valid SQL string to retrieve data from the database.
+
+The ASSUMEDVARWIDTH subcommand specifies the maximum width of string
+variables read  from the database.
+If omitted, the default value is determined from the length of the 
+string in the first value read for each variable.
+
+The UNENCRYPTED subcommand allows data to be retrieved over an insecure
+connection.
+If the connection is not encrypted, and the UNENCRYPTED subcommand is not
+given, then an error will occur.
+Whether or not the connection is
+encrypted depends upon the underlying psql library and the 
+capabilities of the database server.
+
+The BSIZE subcommand serves only to optimise the speed of data transfer.
+It specifies an upper limit on
+number of cases to fetch from the database at once.
+The default value is 4096.
+If your SQL statement fetches a large number of cases but only a small number of
+variables, then the data transfer may be faster if you increase this value.
+Conversely, if the number of variables is large, or if the machine on which 
+PSPP is running has only a
+small amount of memory, then a smaller value will be better.
+
+
+The following syntax is an example:
+@example
+GET DATA /TYPE=PSQL
+     /CONNECT='host=example.com port=5432 dbname=product user=fred passwd=xxxx'
+     /SQL='select * from manufacturer'.
+@end example
+
+
+@node GET DATA /TYPE=TXT
+@subsection Textual Data Files
+
+@display
+GET DATA /TYPE=TXT
+        /FILE=@{'file-name',file_handle@}
+        [/ARRANGEMENT=@{DELIMITED,FIXED@}]
+        [/FIRSTCASE=@{first_case@}]
+        [/IMPORTCASE=@{ALL,FIRST max_cases,PERCENT percent@}]
+        @dots{}additional subcommands depending on ARRANGEMENT@dots{}
+@end display
+
+@cindex text files
+@cindex data files
+When TYPE=TXT is specified, GET DATA reads data in a delimited or
+fixed columnar format, much like DATA LIST (@pxref{DATA LIST}).
+
+The FILE subcommand is mandatory.  Specify the file to be read as 
+a string file name or (for textual data
+only) a file handle (@pxref{File Handles}).
+
+The ARRANGEMENT subcommand determines the file's basic format.
+DELIMITED, the default setting, specifies that fields in the input
+data are separated by spaces, tabs, or other user-specified
+delimiters.  FIXED specifies that fields in the input data appear at
+particular fixed column positions within records of a case.
+
+By default, cases are read from the input file starting from the first
+line.  To skip lines at the beginning of an input file, set FIRSTCASE
+to the number of the first line to read: 2 to skip the first line, 3
+to skip the first two lines, and so on.
+
+IMPORTCASE can be used to limit the number of cases read from the
+input file.  With the default setting, ALL, all cases in the file are
+read.  Specify FIRST @i{max_cases} to read at most @i{max_cases} cases
+from the file.  Use PERCENT @i{percent} to read only @i{percent}
+percent, approximately, of the cases contained in the file.  (The
+percentage is approximate, because there is no way to accurately count
+the number of cases in the file without reading the entire file.  The
+number of cases in some kinds of unusual files cannot be estimated;
+PSPP will read all cases in such files.)
+
+FIRSTCASE and IMPORTCASE may be used with delimited and fixed-format
+data.  The remaining subcommands, which apply only to one of the two  file
+arrangements, are described below.
+
+@menu
+* GET DATA /TYPE=TXT /ARRANGEMENT=DELIMITED::
+* GET DATA /TYPE=TXT /ARRANGEMENT=FIXED::
+@end menu
+
+@node GET DATA /TYPE=TXT /ARRANGEMENT=DELIMITED
+@subsubsection Reading Delimited Data
+
+@display
+GET DATA /TYPE=TXT
+        /FILE=@{'file-name',file_handle@}
+        [/ARRANGEMENT=@{DELIMITED,FIXED@}]
+        [/FIRSTCASE=@{first_case@}]
+        [/IMPORTCASE=@{ALL,FIRST max_cases,PERCENT percent@}]
+
+        /DELIMITERS="delimiters"
+        [/QUALIFIER="quotes" [/ESCAPE]]
+        [/DELCASE=@{LINE,VARIABLES n_variables@}]
+        /VARIABLES=del_var [del_var]@dots{}
+where each del_var takes the form:
+        variable format
+@end display
+
+The GET DATA command with TYPE=TXT and ARRANGEMENT=DELIMITED reads
+input data from text files in delimited format, where fields are
+separated by a set of user-specified delimiters.  Its capabilities are
+similar to those of DATA LIST FREE (@pxref{DATA LIST FREE}), with a
+few enhancements.
+
+The required FILE subcommand and optional FIRSTCASE and IMPORTCASE
+subcommands are described above (@pxref{GET DATA /TYPE=TXT}).
+
+DELIMITERS, which is required, specifies the set of characters that
+may separate fields.  Each character in the string specified on
+DELIMITERS separates one field from the next.  The end of a line also
+separates fields, regardless of DELIMITERS.  Two consecutive
+delimiters in the input yield an empty field, as does a delimiter at
+the end of a line.  A space character as a delimiter is an exception:
+consecutive spaces do not yield an empty field and neither does any
+number of spaces at the end of a line.
+
+To use a tab as a delimiter, specify @samp{\t} at the beginning of the
+DELIMITERS string.  To use a backslash as a delimiter, specify
+@samp{\\} as the first delimiter or, if a tab should also be a
+delimiter, immediately following @samp{\t}.  To read a data file in
+which each field appears on a separate line, specify the empty string
+for DELIMITERS.
+
+The optional QUALIFIER subcommand names one or more characters that
+can be used to quote values within fields in the input.  A field that
+begins with one of the specified quote characters ends at the next
+matching quote.  Intervening delimiters become part of the field,
+instead of terminating it.  The ability to specify more than one quote
+character is a PSPP extension.
+
+By default, a character specified on QUALIFIER cannot itself be
+embedded within a field that it quotes, because the quote character
+always terminates the quoted field.  With ESCAPE, however, a doubled
+quote character within a quoted field inserts a single instance of the
+quote into the field.  For example, if @samp{'} is specified on
+QUALIFIER, then without ESCAPE @code{'a''b'} specifies a pair of
+fields that contain @samp{a} and @samp{b}, but with ESCAPE it
+specifies a single field that contains @samp{a'b}.  ESCAPE is a PSPP
+extension.
+
+The DELCASE subcommand controls how data may be broken across lines in
+the data file.  With LINE, the default setting, each line must contain
+all the data for exactly one case.  For additional flexibility, to
+allow a single case to be split among lines or multiple cases to be
+contained on a single line, specify VARIABLES @i{n_variables}, where
+@i{n_variables} is the number of variables per case.
+
+The VARIABLES subcommand is required and must be the last subcommand.
+Specify the name of each variable and its input format (@pxref{Input
+and Output Formats}) in the order they should be read from the input
+file.
+
+@subsubheading Examples
+
+@noindent
+On a Unix-like system, the @samp{/etc/passwd} file has a format
+similar to this:
+
+@example
+root:$1$nyeSP5gD$pDq/:0:0:,,,:/root:/bin/bash
+blp:$1$BrP/pFg4$g7OG:1000:1000:Ben Pfaff,,,:/home/blp:/bin/bash
+john:$1$JBuq/Fioq$g4A:1001:1001:John Darrington,,,:/home/john:/bin/bash
+jhs:$1$D3li4hPL$88X1:1002:1002:Jason Stover,,,:/home/jhs:/bin/csh
+@end example
+
+@noindent
+The following syntax reads a file in the format used by
+@samp{/etc/passwd}:
+
+@c If you change this example, change the regression test in
+@c tests/command/get-data-txt-examples.sh to match.
+@example
+GET DATA /TYPE=TXT /FILE='/etc/passwd' /DELIMITERS=':'
+        /VARIABLES=username A20
+                   password A40
+                   uid F10
+                   gid F10
+                   gecos A40
+                   home A40
+                   shell A40.
+@end example
+
+@noindent
+Consider the following data on used cars:
+
+@example
+model   year    mileage price   type    age
+Civic   2002    29883   15900   Si      2
+Civic   2003    13415   15900   EX      1
+Civic   1992    107000  3800    n/a     12
+Accord  2002    26613   17900   EX      1
+@end example
+
+@noindent
+The following syntax can be used to read the used car data:
+
+@c If you change this example, change the regression test in
+@c tests/command/get-data-txt-examples.sh to match.
+@example
+GET DATA /TYPE=TXT /FILE='cars.data' /DELIMITERS=' ' /FIRSTCASE=2
+        /VARIABLES=model A8
+                   year F4
+                   mileage F6
+                   price F5
+                   type A4
+                   age F2.
+@end example
+
+@noindent
+Consider the following information on animals in a pet store:
+
+@example
+'Pet''s Name', "Age", "Color", "Date Received", "Price", "Height", "Type"
+, (Years), , , (Dollars), ,
+"Rover", 4.5, Brown, "12 Feb 2004", 80, '1''4"', "Dog"
+"Charlie", , Gold, "5 Apr 2007", 12.3, "3""", "Fish"
+"Molly", 2, Black, "12 Dec 2006", 25, '5"', "Cat"
+"Gilly", , White, "10 Apr 2007", 10, "3""", "Guinea Pig"
+@end example
+
+@noindent
+The following syntax can be used to read the pet store data:
+
+@c If you change this example, change the regression test in
+@c tests/command/get-data-txt-examples.sh to match.
+@example
+GET DATA /TYPE=TXT /FILE='pets.data' /DELIMITERS=', ' /QUALIFIER='''"' /ESCAPE
+        /FIRSTCASE=3
+        /VARIABLES=name A10
+                   age F3.1
+                   color A5
+                   received EDATE10
+                   price F5.2
+                   height a5
+                   type a10.
+@end example
+
+@node GET DATA /TYPE=TXT /ARRANGEMENT=FIXED
+@subsubsection Reading Fixed Columnar Data
+
+@display
+GET DATA /TYPE=TXT
+        /FILE=@{'file-name',file_handle@}
+        [/ARRANGEMENT=@{DELIMITED,FIXED@}]
+        [/FIRSTCASE=@{first_case@}]
+        [/IMPORTCASE=@{ALL,FIRST max_cases,PERCENT percent@}]
+
+        [/FIXCASE=n]
+        /VARIABLES fixed_var [fixed_var]@dots{}
+            [/rec# fixed_var [fixed_var]@dots{}]@dots{}
+where each fixed_var takes the form:
+        variable start-end format
+@end display
+
+The GET DATA command with TYPE=TXT and ARRANGEMENT=FIXED reads input
+data from text files in fixed format, where each field is located in
+particular fixed column positions within records of a case.  Its
+capabilities are similar to those of DATA LIST FIXED (@pxref{DATA LIST
+FIXED}), with a few enhancements.
+
+The required FILE subcommand and optional FIRSTCASE and IMPORTCASE
+subcommands are described above (@pxref{GET DATA /TYPE=TXT}).
+
+The optional FIXCASE subcommand may be used to specify the positive
+integer number of input lines that make up each case.  The default
+value is 1.
+
+The VARIABLES subcommand, which is required, specifies the positions
+at which each variable can be found.  For each variable, specify its
+name, followed by its start and end column separated by @samp{-}
+(e.g.@: @samp{0-9}), followed by the input format type (e.g.@:
+@samp{F}).  For this command, columns are numbered starting from 0 at
+the left column.  Introduce the variables in the second and later
+lines of a case by a slash followed by the number of the line within
+the case, e.g.@: @samp{/2} for the second line.
+
+@subsubheading Examples
+
+@noindent
+Consider the following data on used cars:
+
+@example
+model   year    mileage price   type    age
+Civic   2002    29883   15900   Si      2
+Civic   2003    13415   15900   EX      1
+Civic   1992    107000  3800    n/a     12
+Accord  2002    26613   17900   EX      1
+@end example
+
+@noindent
+The following syntax can be used to read the used car data:
+
+@c If you change this example, change the regression test in
+@c tests/command/get-data-txt-examples.sh to match.
+@example
+GET DATA /TYPE=TXT /FILE='cars.data' /ARRANGEMENT=FIXED /FIRSTCASE=2
+        /VARIABLES=model 0-7 A
+                   year 8-15 F
+                   mileage 16-23 F
+                   price 24-31 F
+                   type 32-40 A
+                   age 40-47 F.
+@end example
+
  @node IMPORT
  @section IMPORT
  @vindex IMPORT
@@ -195,12 +637,12 @@ extension.
  @display
  MATCH FILES
          /@{FILE,TABLE@}=@{*,'file-name'@}
-        /DROP=var_list
-        /KEEP=var_list
          /RENAME=(src_names=target_names)@dots{}
          /IN=var_name
  
-        /BY var_list
+        /BY=var_list
+        /DROP=var_list
+        /KEEP=var_list
          /FIRST=var_name
          /LAST=var_name
          /MAP
@@ -208,38 +650,33 @@ MATCH FILES
  
  @cmd{MATCH FILES} merges one or more system, portable, or scratch files,
  optionally
-including the active file.  Records with the same values for BY
-variables are combined into a single record.  Records with different
+including the active file.  Cases with the same values for BY
+variables are combined into a single case.  Cases with different
  values are output in order.  Thus, multiple sorted files are
  combined into a single sorted file based on the value of the BY
  variables.  The results of the merge become the new active file.
  
-The BY subcommand specifies a list of variables that are used to match
-records from each of the files.  Variables specified must exist
-in all the files specified on FILE and TABLE.  BY should usually be
-specified.  If TABLE or IN is used then BY is required.
-
  Specify FILE with a system, portable, or scratch file as a file name
  string or file handle
  (@pxref{File Handles}), or with an asterisk (@samp{*}) to
  indicate the current active file.  The files specified on FILE are
  merged together based on the BY variables, or combined case-by-case if
-BY is not specified.  Normally at least two FILE subcommands should be
-specified.
+BY is not specified.
  
  Specify TABLE with a file to use it as a @dfn{table
-lookup file}.  Records in table lookup files are not used up after
+lookup file}.  Cases in table lookup files are not used up after
  they've been used once.  This means that data in table lookup files can
-correspond to any number of records in FILE files.  Table lookup files
+correspond to any number of cases in FILE files.  Table lookup files
  correspond to lookup tables in traditional relational database systems.
-It is incorrect to have records with duplicate BY values in table lookup
-files.
+If a table lookup file contains more than one case with a given set of
+BY variables, only the first case is used.
  
-Any number of FILE and TABLE subcommands may be specified.  Each
-instance of FILE or TABLE can be followed by any sequence of DROP,
-KEEP, or RENAME subcommands.  These have the same form and meaning as
-the corresponding subcommands of @cmd{GET} (@pxref{GET}), but apply
-only to variables in the given file.
+Any number of FILE and TABLE subcommands may be specified.
+Ordinarily, at least two FILE subcommands, or one FILE and at least
+one TABLE, should be specified.  Each instance of FILE or TABLE can be
+followed by any sequence of RENAME subcommands.  These have the same
+form and meaning as the corresponding subcommands of @cmd{GET}
+(@pxref{GET}), but apply only to variables in the given file.
  
  Each FILE or TABLE may optionally be followed by an IN subcommand,
  which creates a numeric variable with the specified name and format
@@ -247,11 +684,38 @@ F1.0.  The IN variable takes value 1 in a case if the given file
  contributed a row to the merged file, 0 otherwise.  The DROP, KEEP,
  and RENAME subcommands do not affect IN variables.
  
-Variables belonging to files that are not present for the current case
-are set to the system-missing value for numeric variables or spaces for
-string variables.
+When more than one FILE or TABLE contains a variable with a given
+name, those variables must all have the same type (numeric or string)
+and, for string variables, the same width.  This rules applies to
+variable names after renaming with RENAME; thus, RENAME can be used to
+resolve conflicts.
  
-FIRST, LAST, and MAP are currently ignored.
+FILE and TABLE must be specified at the beginning of the command, with
+any RENAME or IN specifications immediately after the corresponding
+FILE or TABLE.  These subcommands are followed by BY, DROP, KEEP,
+FIRST, LAST, and MAP.
+
+The BY subcommand specifies a list of variables that are used to match
+cases from each of the files.  When TABLE or IN is used, BY is
+required; otherwise, it is optional.  When BY is specified, all the
+files named on FILE and TABLE subcommands must be sorted in ascending
+order of the BY variables.  Variables belonging to files that are not
+present for the current case are set to the system-missing value for
+numeric variables or spaces for string variables.
+
+The DROP and KEEP subcommands allow variables to be dropped from or
+reordered within the new active file.  These subcommands have the same
+form and meaning as the corresponding subcommands of @cmd{GET}
+(@pxref{GET}).  They apply to the new active file as a whole, not to
+individual input files.  The variable names specified on DROP and KEEP
+are those after any renaming with RENAME.
+
+The optional FIRST and LAST subcommands name variables that @cmd{MATCH
+FILES} adds to the active file.  The new variables are numeric with
+print and write format F1.0.  The value of the FIRST variable is 1 in
+the first case with a given set of values for the BY variables, and 0
+in other cases.  Similarly, the LAST variable is 1 in the last case
+with a given of BY values, and 0 in other cases.
  
  @cmd{MATCH FILES} may not be specified following @cmd{TEMPORARY}
  (@pxref{TEMPORARY}) if the active file is used as an input source.
@@ -350,7 +814,7 @@ a system file and displays information on its dictionary.
  @vindex XEXPORT
  
  @display
-EXPORT
+XEXPORT
          /OUTFILE='file-name'
          /DIGITS=n
          /DROP=var_list
@@ -378,7 +842,7 @@ the data is read by a procedure or procedure-like command.
  
  @xref{EXPORT}, for more information.
  
-@node XSAVE,  , XEXPORT, System and Portable Files
+@node XSAVE
  @section XSAVE
  @vindex XSAVE