-@node Language, Expressions, Invocation, Top
+@node Language
@chapter The PSPP language
@cindex language, PSPP
@cindex PSPP, language
-@quotation
-@strong{Please note:} PSPP is not even close to completion.
-Only a few actual statistical procedures are implemented. PSPP
-is a work in progress.
-@end quotation
-
This chapter discusses elements common to many PSPP commands.
Later chapters will describe individual commands in detail.
@menu
* Tokens:: Characters combine to form tokens.
* Commands:: Tokens combine to form commands.
+* Syntax Variants:: Batch vs. Interactive mode
* Types of Commands:: Commands come in several flavors.
* Order of Commands:: Commands combine to form syntax files.
* Missing Observations:: Handling missing observations.
-* Variables:: The unit of data storage.
+* Datasets:: Data organization.
* Files:: Files used by PSPP.
+* File Handles:: How files are named.
* BNF:: How command syntax is described.
@end menu
-@node Tokens, Commands, Language, Language
+
+@node Tokens
@section Tokens
@cindex language, lexical analysis
@cindex language, tokens
@cindex tokens
@cindex lexical analysis
-@cindex lexemes
PSPP divides most syntax file lines into series of short chunks
-called @dfn{tokens}, @dfn{lexical elements}, or @dfn{lexemes}. These
-tokens are then grouped to form commands, each of which tells
+called @dfn{tokens}.
+Tokens are then grouped to form commands, each of which tells
PSPP to take some action---read in data, write out data, perform
-a statistical procedure, etc. The process of dividing input into tokens
-is @dfn{tokenization}, or @dfn{lexical analysis}. Each type of token is
+a statistical procedure, etc. Each type of token is
described below.
-@cindex delimiters
-@cindex whitespace
-Tokens must be separated from each other by @dfn{delimiters}.
-Delimiters include whitespace (spaces, tabs, carriage returns, line
-feeds, vertical tabs), punctuation (commas, forward slashes, etc.), and
-operators (plus, minus, times, divide, etc.) Note that while whitespace
-only separates tokens, other delimiters are tokens in themselves.
-
@table @strong
@cindex identifiers
@item Identifiers
-Identifiers are names that specify variable names, commands, or command
-details.
-
-@itemize @bullet
-@item
-The first character in an identifier must be a letter, @samp{#}, or
-@samp{@@}. Some system identifiers begin with @samp{$}, but
-user-defined variables' names may not begin with @samp{$}.
-
-@item
-The remaining characters in the identifier must be letters, digits, or
-one of the following special characters:
+Identifiers are names that typically specify variables, commands, or
+subcommands. The first character in an identifier must be a letter,
+@samp{#}, or @samp{@@}. The remaining characters in the identifier
+must be letters, digits, or one of the following special characters:
@example
-. _ $ # @@
+@center @. _ $ # @@
@end example
-@item
-@cindex variable names
-@cindex names, variable
-Variable names may be up any length up to 64 bytes long.
-
-
-@item
@cindex case-sensitivity
-Identifiers are not case-sensitive: @code{foobar}, @code{Foobar},
-@code{FooBar}, @code{FOOBAR}, and @code{FoObaR} are different
-representations of the same identifier.
+Identifiers may be any length, but only the first 64 bytes are
+significant. Identifiers are not case-sensitive: @code{foobar},
+@code{Foobar}, @code{FooBar}, @code{FOOBAR}, and @code{FoObaR} are
+different representations of the same identifier.
-@item
-@cindex keywords
-Identifiers other than variable names may be abbreviated to their first
-3 characters if this abbreviation is unambiguous. These identifiers are
-often called @dfn{keywords}. (Unique abbreviations of 3 or more
-characters are also accepted: @samp{FRE}, @samp{FREQ}, and
-@samp{FREQUENCIES} are equivalent when the last is a keyword.)
-
-@item
-Whether an identifier is a keyword depends on the context.
-
-@item
-@cindex keywords, reserved
-@cindex reserved keywords
-Some keywords are reserved. These keywords may not be used in any
-context besides those explicitly described in this manual. The reserved
-keywords are:
+@cindex identifiers, reserved
+@cindex reserved identifiers
+Some identifiers are reserved. Reserved identifiers may not be used
+in any context besides those explicitly described in this manual. The
+reserved identifiers are:
@example
-ALL AND BY EQ GE GT LE LT NE NOT OR TO WITH
+@center ALL AND BY EQ GE GT LE LT NE NOT OR TO WITH
@end example
-@item
-Since keywords are identifiers, all the rules for identifiers apply.
-Specifically, they must be delimited as are other identifiers:
-@code{WITH} is a reserved keyword, but @code{WITHOUT} is a valid
-variable name.
-@end itemize
+@item Keywords
+Keywords are a subclass of identifiers that form a fixed part of
+command syntax. For example, command and subcommand names are
+keywords. Keywords may be abbreviated to their first 3 characters if
+this abbreviation is unambiguous. (Unique abbreviations of 3 or more
+characters are also accepted: @samp{FRE}, @samp{FREQ}, and
+@samp{FREQUENCIES} are equivalent when the last is a keyword.)
-@cindex @samp{.}
-@cindex period
-@cindex variable names, ending with period
-@strong{Caution:} It is legal to end a variable name with a period, but
-@emph{don't do it!} The variable name will be misinterpreted when it is
-the final token on a line: @code{FOO.} will be divided into two separate
-tokens, @samp{FOO} and @samp{.}, the @dfn{terminal dot}.
-@xref{Commands, , Forming commands of tokens}.
+Reserved identifiers are always used as keywords. Other identifiers
+may be used both as keywords and as user-defined identifiers, such as
+variable names.
@item Numbers
@cindex numbers
@cindex integers
@cindex reals
-Numbers may be specified as integers or reals. Integers are internally
-converted into reals. Scientific notation is not supported. Here are
-some examples of valid numbers:
+Numbers are expressed in decimal. A decimal point is optional.
+Numbers may be expressed in scientific notation by adding @samp{e} and
+a base-10 exponent, so that @samp{1.234e3} has the value 1234. Here
+are some more examples of valid numbers:
@example
-1234 3.14159265359 .707106781185 8945.
+-5 3.14159265359 1e100 -.707 8945.
@end example
-@strong{Caution:} The last example will be interpreted as two tokens,
-@samp{8945} and @samp{.}, if it is the last token on a line.
+Negative numbers are expressed with a @samp{-} prefix. However, in
+situations where a literal @samp{-} token is expected, what appears to
+be a negative number is treated as @samp{-} followed by a positive
+number.
+
+No white space is allowed within a number token, except for horizontal
+white space between @samp{-} and the rest of the number.
+
+The last example above, @samp{8945.} will be interpreted as two
+tokens, @samp{8945} and @samp{.}, if it is the last token on a line.
+@xref{Commands, , Forming commands of tokens}.
@item Strings
@cindex strings
@cindex @samp{'}
@cindex @samp{"}
@cindex case-sensitivity
-Strings are literal sequences of characters enclosed in pairs of single
-quotes (@samp{'}) or double quotes (@samp{"}).
-
-@itemize @bullet
-@item
-Whitespace and case of letters @emph{are} significant inside strings.
-@item
-Whitespace characters inside a string are not delimiters.
-@item
-To include single-quote characters in a string, enclose the string in
-double quotes.
-@item
-To include double-quote characters in a string, enclose the string in
-single quotes.
-@item
-It is not possible to put both single- and double-quote characters
-inside one string.
-@end itemize
-
-@item Hexstrings
-@cindex hexstrings
-Hexstrings are string variants that use hex digits to specify
-characters.
-
-@itemize @bullet
-@item
-A hexstring may be used anywhere that an ordinary string is allowed.
-
-@item
-@cindex @samp{X'}
-@cindex @samp{'}
-A hexstring begins with @samp{X'} or @samp{x'}, and ends with @samp{'}.
-
-@cindex whitespace
-@item
-No whitespace is allowed between the initial @samp{X} and @samp{'}.
-
-@item
-Double quotes @samp{"} may be used in place of single quotes @samp{'} if
-done in both places.
-
-@item
-Each pair of hex digits is internally changed into a single character
-with the given value.
-
-@item
-If there is an odd number of hex digits, the missing last digit is
-assumed to be @samp{0}.
-
-@item
-@cindex portability
-@strong{Please note:} Use of hexstrings is nonportable because the same
-numeric values are associated with different glyphs by different
-operating systems. Therefore, their use should be confined to syntax
-files that will not be widely distributed.
-
-@item
-@cindex characters, reserved
-@cindex 0
-@cindex whitespace
-@strong{Please note also:} The character with value 00 is reserved for
-internal use by PSPP. Its use in strings causes an error and
-replacement with a blank space (in ASCII, hex 20, decimal 32).
-@end itemize
-
-@item Punctuation
-@cindex punctuation
-Punctuation separates tokens; punctuators are delimiters. These are the
-punctuation characters:
-
-@example
-, / = ( )
-@end example
-
-@item Operators
+Strings are literal sequences of characters enclosed in pairs of
+single quotes (@samp{'}) or double quotes (@samp{"}). To include the
+character used for quoting in the string, double it, e.g.@:
+@samp{'it''s an apostrophe'}. White space and case of letters are
+significant inside strings.
+
+Strings can be concatenated using @samp{+}, so that @samp{"a" + 'b' +
+'c'} is equivalent to @samp{'abc'}. So that a long string may be
+broken across lines, a line break may precede or follow, or both
+precede and follow, the @samp{+}. (However, an entirely blank line
+preceding or following the @samp{+} is interpreted as ending the
+current command.)
+
+Strings may also be expressed as hexadecimal character values by
+prefixing the initial quote character by @samp{x} or @samp{X}.
+Regardless of the syntax file or active dataset's encoding, the
+hexadecimal digits in the string are interpreted as Unicode characters
+in UTF-8 encoding.
+
+Individual Unicode code points may also be expressed by specifying the
+hexadecimal code point number in single or double quotes preceded by
+@samp{u} or @samp{U}. For example, Unicode code point U+1D11E, the
+musical G clef character, could be expressed as @code{U'1D11E'}.
+Invalid Unicode code points (above U+10FFFF or in between U+D800 and
+U+DFFF) are not allowed.
+
+When strings are concatenated with @samp{+}, each segment's prefix is
+considered individually. For example, @code{'The G clef symbol is:' +
+u"1d11e" + "."} inserts a G clef symbol in the middle of an otherwise
+plain text string.
+
+@item Punctuators and Operators
+@cindex punctuators
@cindex operators
-Operators describe mathematical operations. Some operators are delimiters:
-
-@example
-( ) + - * / **
-@end example
-
-Many of the above operators are also punctuators. Punctuators are
-distinguished from operators by context.
-
-The other operators are all reserved keywords. None of these are
-delimiters:
+These tokens are the punctuators and operators:
@example
-AND EQ GE GT LE LT NE OR
+@center , / = ( ) + - * / ** < <= <> > >= ~= & | .
@end example
-@item Terminal Dot
-@cindex terminal dot
-@cindex dot, terminal
-@cindex period
-@cindex @samp{.}
-A period (@samp{.}) at the end of a line (except for whitespace) is one
-type of a @dfn{terminal dot}, although not every terminal dot is a
-period at the end of a line. @xref{Commands, , Forming commands of
-tokens}. A period is a terminal dot @emph{only}
-when it is at the end of a line; otherwise it is part of a
-floating-point number. (A period outside a number in the middle of a
-line is an error.)
-
-@quotation
-@cindex terminal dot, changing
-@cindex dot, terminal, changing
-@strong{Please note:} The character used for the @dfn{terminal dot}
-can be changed with @cmd{SET}'s ENDCMD subcommand (@pxref{SET}). This
-is strongly discouraged, and throughout all the remainder of this
-manual it will be assumed that the default setting is in effect.
-@end quotation
-
+Most of these appear within the syntax of commands, but the period
+(@samp{.}) punctuator is used only at the end of a command. It is a
+punctuator only as the last character on a line (except white space).
+When it is the last non-space character on a line, a period is not
+treated as part of another token, even if it would otherwise be part
+of, e.g.@:, an identifier or a floating-point number.
@end table
-@node Commands, Types of Commands, Tokens, Language
+@node Commands
@section Forming commands of tokens
@cindex PSPP, command structure
@cindex language, command structure
@cindex commands, structure
-Most PSPP commands share a common structure, diagrammed below:
-
-@example
-@var{cmd}@dots{} [@var{sbc}[=][@var{spec} [[,]@var{spec}]@dots{}]] [[/[=][@var{spec} [[,]@var{spec}]@dots{}]]@dots{}].
-@end example
-
-@cindex @samp{[ ]}
-In the above, rather daunting, expression, pairs of square brackets
-(@samp{[ ]}) indicate optional elements, and names such as @var{cmd}
-indicate parts of the syntax that vary from command to command.
-Ellipses (@samp{...}) indicate that the preceding part may be repeated
-an arbitrary number of times. Let's pick apart what it says above:
-
-@itemize @bullet
-@cindex commands, names
-@item
-A command begins with a command name of one or more keywords, such as
-@cmd{FREQUENCIES}, @cmd{DATA LIST}, or @cmd{N OF CASES}. @var{cmd}
-may be abbreviated to its first word if that is unambiguous; each word
-in @var{cmd} may be abbreviated to a unique prefix of three or more
-characters as described above.
-
-@cindex subcommands
-@item
-The command name may be followed by one or more @dfn{subcommands}:
-
-@itemize @minus
-@item
-Each subcommand begins with a unique keyword, indicated by @var{sbc}
-above. This is analogous to the command name.
-
-@item
-The subcommand name is optionally followed by an equals sign (@samp{=}).
-
-@item
-Some subcommands accept a series of one or more specifications
-(@var{spec}), optionally separated by commas.
-
-@item
-Each subcommand must be separated from the next (if any) by a forward
-slash (@samp{/}).
-@end itemize
-
-@cindex dot, terminal
-@cindex terminal dot
-@item
-Each command must be terminated with a @dfn{terminal dot}.
-The terminal dot may be given one of three ways:
-
-@itemize @minus
-@item
-(most commonly) A period character at the very end of a line, as
-described above.
-
-@item
-(only if NULLINE is on: @xref{SET, , Setting user preferences}, for more
-details.) A completely blank line.
-
-@item
-(in batch mode only) Any line that is not indented from the left side of
-the page causes a terminal dot to be inserted before that line.
-Therefore, each command begins with a line that is flush left, followed
-by zero or more lines that are indented one or more characters from the
-left margin.
-
-In batch mode, PSPP will ignore a plus sign, minus sign, or period
-(@samp{+}, @samp{@minus{}}, or @samp{.}) as the first character in a
-line. Any of these characters as the first character on a line will
-begin a new command. This allows for visual indentation of a command
-without that command being considered part of the previous command.
-
-PSPP is in batch mode when it is reading input from a file, rather
-than from an interactive user. Note that the other forms of the
-terminal dot may also be used in batch mode.
-
-Sometimes, one encounters syntax files that are intended to be
-interpreted in interactive mode rather than batch mode (for instance,
-this can happen if a session log file is used directly as a syntax
-file). When this occurs, use the @samp{-i} command line option to force
-interpretation in interactive mode (@pxref{Language control options}).
-@end itemize
-@end itemize
-
-PSPP ignores empty commands when they are generated by the above
-rules. Note that, as a consequence of these rules, each command must
-begin on a new line.
-
-@node Types of Commands, Order of Commands, Commands, Language
+Most PSPP commands share a common structure. A command begins with a
+command name, such as @cmd{FREQUENCIES}, @cmd{DATA LIST}, or @cmd{N OF
+CASES}. The command name may be abbreviated to its first word, and
+each word in the command name may be abbreviated to its first three
+or more characters, where these abbreviations are unambiguous.
+
+The command name may be followed by one or more @dfn{subcommands}.
+Each subcommand begins with a subcommand name, which may be
+abbreviated to its first three letters. Some subcommands accept a
+series of one or more specifications, which follow the subcommand
+name, optionally separated from it by an equals sign
+(@samp{=}). Specifications may be separated from each other
+by commas or spaces. Each subcommand must be separated from the next (if any)
+by a forward slash (@samp{/}).
+
+There are multiple ways to mark the end of a command. The most common
+way is to end the last line of the command with a period (@samp{.}) as
+described in the previous section (@pxref{Tokens}). A blank line, or
+one that consists only of white space or comments, also ends a command.
+
+@node Syntax Variants
+@section Syntax Variants
+
+@cindex Batch syntax
+@cindex Interactive syntax
+
+There are three variants of command syntax, which vary only in how
+they detect the end of one command and the start of the next.
+
+In @dfn{interactive mode}, which is the default for syntax typed at a
+command prompt, a period as the last non-blank character on a line
+ends a command. A blank line also ends a command.
+
+In @dfn{batch mode}, an end-of-line period or a blank line also ends a
+command. Additionally, it treats any line that has a non-blank
+character in the leftmost column as beginning a new command. Thus, in
+batch mode the second and subsequent lines in a command must be
+indented.
+
+Regardless of the syntax mode, a plus sign, minus sign, or period in
+the leftmost column of a line is ignored and causes that line to begin
+a new command. This is most useful in batch mode, in which the first
+line of a new command could not otherwise be indented, but it is
+accepted regardless of syntax mode.
+
+The default mode for reading commands from a file is @dfn{auto mode}.
+It is the same as batch mode, except that a line with a non-blank in
+the leftmost column only starts a new command if that line begins with
+the name of a PSPP command. This correctly interprets most valid PSPP
+syntax files regardless of the syntax mode for which they are
+intended.
+
+The @option{--interactive} (or @option{-i}) or @option{--batch} (or
+@option{-b}) options set the syntax mode for files listed on the PSPP
+command line. @xref{Main Options}, for more details.
+
+@node Types of Commands
@section Types of Commands
Commands in PSPP are divided roughly into six categories:
@item File definition commands
@cindex file definition commands
Give instructions for reading data from text files or from special
-binary ``system files''. Most of these commands discard any previous
-data or variables to replace it with the new data and
-variables. At least one must appear before the first command in any of
+binary ``system files''. Most of these commands replace any previous
+data or variables with new data or
+variables. At least one file definition command must appear before the first command in any of
the categories below. @xref{Data Input and Output}.
@item Input program commands
@cindex input program commands
-Though rarely used, these provide powerful tools for reading data files
+Though rarely used, these provide tools for reading data files
in arbitrary textual or binary formats. @xref{INPUT PROGRAM}.
@item Transformations
@item Restricted transformations
@cindex restricted transformations
-Same as transformations for most purposes. @xref{Order of Commands}, for a
-detailed description of the differences.
+Transformations that cannot appear in certain contexts. @xref{Order
+of Commands}, for details.
@item Procedures
@cindex procedures
Analyze data, writing results of analyses to the listing file. Cause
transformations specified earlier in the file to be performed. In a
more general sense, a @dfn{procedure} is any command that causes the
-active file (the data) to be read.
+active dataset (the data) to be read.
@end table
-@node Order of Commands, Missing Observations, Types of Commands, Language
+@node Order of Commands
@section Order of Commands
@cindex commands, ordering
@cindex order of commands
-PSPP does not place many restrictions on ordering of commands.
-The main restriction is that variables must be defined with one of the
-file-definition commands before they are otherwise referred to.
+PSPP does not place many restrictions on ordering of commands. The
+main restriction is that variables must be defined before they are otherwise
+referenced. This section describes the details of command ordering,
+but most users will have no need to refer to them.
-Of course, there are specific rules, for those who are interested.
PSPP possesses five internal states, called initial, INPUT PROGRAM,
FILE TYPE, transformation, and procedure states. (Please note the
distinction between the @cmd{INPUT PROGRAM} and @cmd{FILE TYPE}
@emph{commands} and the INPUT PROGRAM and FILE TYPE @emph{states}.)
-PSPP starts up in the initial state. Each successful completion
+PSPP starts in the initial state. Each successful completion
of a command may cause a state transition. Each type of command has its
own rules for state transitions:
@item Utility commands
@itemize @bullet
@item
-Legal in all states.
+Valid in any state.
@item
Do not cause state transitions. Exception: when @cmd{N OF CASES}
is executed in the procedure state, it causes a transition to the
@item @cmd{DATA LIST}
@itemize @bullet
@item
-Legal in all states.
+Valid in any state.
@item
When executed in the initial or procedure state, causes a transition to
the transformation state.
@item
-Clears the active file if executed in the procedure or transformation
+Clears the active dataset if executed in the procedure or transformation
state.
@end itemize
@item
Causes a transition to the INPUT PROGRAM state.
@item
-Clears the active file.
+Clears the active dataset.
@end itemize
@item @cmd{FILE TYPE}
@item
Causes a transition to the FILE TYPE state.
@item
-Clears the active file.
+Clears the active dataset.
@end itemize
@item Other file definition commands
@item
Cause a transition to the transformation state.
@item
-Clear the active file, except for @cmd{ADD FILES}, @cmd{MATCH FILES},
+Clear the active dataset, except for @cmd{ADD FILES}, @cmd{MATCH FILES},
and @cmd{UPDATE}.
@end itemize
@end itemize
@end table
-@node Missing Observations, Variables, Order of Commands, Language
+@node Missing Observations
@section Handling missing observations
@cindex missing values
@cindex values, missing
PSPP includes special support for unknown numeric data values.
Missing observations are assigned a special value, called the
@dfn{system-missing value}. This ``value'' actually indicates the
-absence of value; it means that the actual value is unknown. Procedures
+absence of a value; it means that the actual value is unknown. Procedures
automatically exclude from analyses those observations or cases that
-have missing values. Whether single observations or entire cases are
-excluded depends on the procedure.
+have missing values. Details of missing value exclusion depend on the
+procedure and can often be controlled by the user; refer to
+descriptions of individual procedures for details.
The system-missing value exists only for numeric variables. String
variables always have a defined value, even if it is only a string of
Variables, whether numeric or string, can have designated
@dfn{user-missing values}. Every user-missing value is an actual value
for that variable. However, most of the time user-missing values are
-treated in the same way as the system-missing value. String variables
-that are wider than a certain width, usually 8 characters (depending on
-computer architecture), cannot have user-missing values.
+treated in the same way as the system-missing value.
For more information on missing values, see the following sections:
-@ref{Variables}, @ref{MISSING VALUES}, @ref{Expressions}. See also the
+@ref{Datasets}, @ref{MISSING VALUES}, @ref{Expressions}. See also the
documentation on individual procedures for information on how they
handle missing values.
-@node Variables, Files, Missing Observations, Language
-@section Variables
-@cindex variables
+@node Datasets
+@section Datasets
+@cindex dataset
+@cindex variable
@cindex dictionary
-Variables are the basic unit of data storage in PSPP. All the
-variables in a file taken together, apart from any associated data, are
-said to form a @dfn{dictionary}.
-Some details of variables are described in the sections below.
+PSPP works with data organized into @dfn{datasets}. A dataset
+consists of a set of @dfn{variables}, which taken together are said to
+form a @dfn{dictionary}, and one or more @dfn{cases}, each of which
+has one value for each variable.
+
+At any given time PSPP has exactly one distinguished dataset, called
+the @dfn{active dataset}. Most PSPP commands work only with the
+active dataset. In addition to the active dataset, PSPP also supports
+any number of additional open datasets. The @cmd{DATASET} commands
+can choose a new active dataset from among those that are open, as
+well as create and destroy datasets (@pxref{DATASET}).
+
+The sections below describe variables in more detail.
@menu
* Attributes:: Attributes of variables.
* System Variables:: Variables automatically defined by PSPP.
* Sets of Variables:: Lists of variable names.
-* Input/Output Formats:: Input and output formats.
+* Input and Output Formats:: Input and output formats.
* Scratch Variables:: Variables deleted by procedures.
@end menu
-@node Attributes, System Variables, Variables, Variables
+@node Attributes
@subsection Attributes of Variables
@cindex variables, attributes of
@cindex attributes of variables
@table @strong
@item Name
-This is an identifier. Each variable must have a different name.
+An identifier, up to 64 bytes long. Each variable must have a different name.
@xref{Tokens}.
+Some system variable names begin with @samp{$}, but user-defined
+variables' names may not begin with @samp{$}.
+
+@cindex @samp{.}
+@cindex period
+@cindex variable names, ending with period
+The final character in a variable name should not be @samp{.}, because
+such an identifier will be misinterpreted when it is the final token
+on a line: @code{FOO.} will be divided into two separate tokens,
+@samp{FOO} and @samp{.}, indicating end-of-command. @xref{Tokens}.
+
+@cindex @samp{_}
+The final character in a variable name should not be @samp{_}, because
+some such identifiers are used for special purposes by PSPP
+procedures.
+
+As with all PSPP identifiers, variable names are not case-sensitive.
+PSPP capitalizes variable names on output the same way they were
+capitalized at their point of definition in the input.
+
@cindex variables, type
@cindex type of variables
@item Type
@item Width
(string variables only) String variables with a width of 8 characters or
fewer are called @dfn{short string variables}. Short string variables
-can be used in many procedures where @dfn{long string variables} (those
+may be used in a few contexts where @dfn{long string variables} (those
with widths greater than 8) are not allowed.
-@quotation
-@strong{Please note:} Certain systems may consider strings longer than 8
-characters to be short strings. Eight characters represents a minimum
-figure for the maximum length of a short string.
-@end quotation
-
@item Position
Variables in the dictionary are arranged in a specific order.
@cmd{DISPLAY} can be used to show this order: see @ref{DISPLAY}.
Display width, format, and (for numeric variables) number of decimal
places. This attribute does not affect how data are stored, just how
they are displayed. Example: a width of 8, with 2 decimal places.
-@xref{PRINT FORMATS}.
+@xref{Input and Output Formats}.
@cindex write format
@item Write format
-Similar to print format, but used by certain commands that are
-designed to write to binary files. @xref{WRITE FORMATS}.
+Similar to print format, but used by the @cmd{WRITE} command
+(@pxref{WRITE}).
+
+@cindex custom attributes
+@item Custom attributes
+User-defined associations between names and values. @xref{VARIABLE
+ATTRIBUTE}.
@end table
-@node System Variables, Sets of Variables, Attributes, Variables
+@node System Variables
@subsection Variables Automatically Defined by PSPP
@cindex system variables
@cindex variables, system
There are seven system variables. These are not like ordinary
-variables, as they are not stored in each case. They can only be used
+variables because system variables are not always stored. They can be used only
in expressions. These system variables, whose values and output formats
cannot be modified, are described below.
@cindex @code{$TIME}
@item $TIME
-Number of seconds between midnight 14 Oct 1582 and the time the active file
+Number of seconds between midnight 14 Oct 1582 and the time the active dataset
was read, in format F20.
@cindex @code{$WIDTH}
Page width, in characters, in format F3.
@end table
-@node Sets of Variables, Input/Output Formats, System Variables, Variables
+@node Sets of Variables
@subsection Lists of variable names
-@cindex TO convention
-@cindex convention, TO
+@cindex @code{TO} convention
+@cindex convention, @code{TO}
+
+To refer to a set of variables, list their names one after another.
+Optionally, their names may be separated by commas. To include a
+range of variables from the dictionary in the list, write the name of
+the first and last variable in the range, separated by @code{TO}. For
+instance, if the dictionary contains six variables with the names
+@code{ID}, @code{X1}, @code{X2}, @code{GOAL}, @code{MET}, and
+@code{NEXTGOAL}, in that order, then @code{X2 TO MET} would include
+variables @code{X2}, @code{GOAL}, and @code{MET}.
-There are several ways to specify a set of variables:
+Commands that define variables, such as @cmd{DATA LIST}, give
+@code{TO} an alternate meaning. With these commands, @code{TO} define
+sequences of variables whose names end in consecutive integers. The
+syntax is two identifiers that begin with the same root and end with
+numbers, separated by @code{TO}. The syntax @code{X1 TO X5} defines 5
+variables, named @code{X1}, @code{X2}, @code{X3}, @code{X4}, and
+@code{X5}. The syntax @code{ITEM0008 TO ITEM0013} defines 6
+variables, named @code{ITEM0008}, @code{ITEM0009}, @code{ITEM0010},
+@code{ITEM0011}, @code{ITEM0012}, and @code{ITEM00013}. The syntaxes
+@code{QUES001 TO QUES9} and @code{QUES6 TO QUES3} are invalid.
+
+After a set of variables has been defined with @cmd{DATA LIST} or
+another command with this method, the same set can be referenced on
+later commands using the same syntax.
-@enumerate
-@item
-(Most commonly.) List the variable names one after another, optionally
-separating them by commas.
+@node Input and Output Formats
+@subsection Input and Output Formats
-@cindex @code{TO}
-@item
-(This method cannot be used on commands that define the dictionary, such
-as @cmd{DATA LIST}.) The syntax is the names of two existing variables,
-separated by the reserved keyword @code{TO}. The meaning is to include
-every variable in the dictionary between and including the variables
-specified. For instance, if the dictionary contains six variables with
-the names @code{ID}, @code{X1}, @code{X2}, @code{GOAL}, @code{MET}, and
-@code{NEXTGOAL}, in that order, then @code{X2 TO MET} would include
-variables @code{X2}, @code{GOAL}, and @code{MET}.
+@cindex formats
+An @dfn{input format} describes how to interpret the contents of an
+input field as a number or a string. It might specify that the field
+contains an ordinary decimal number, a time or date, a number in binary
+or hexadecimal notation, or one of several other notations. Input
+formats are used by commands such as @cmd{DATA LIST} that read data or
+syntax files into the PSPP active dataset.
+
+Every input format corresponds to a default @dfn{output format} that
+specifies the formatting used when the value is output later. It is
+always possible to explicitly specify an output format that resembles
+the input format. Usually, this is the default, but in cases where the
+input format is unfriendly to human readability, such as binary or
+hexadecimal formats, the default output format is an easier-to-read
+decimal format.
+
+Every variable has two output formats, called its @dfn{print format} and
+@dfn{write format}. Print formats are used in most output contexts;
+write formats are used only by @cmd{WRITE} (@pxref{WRITE}). Newly
+created variables have identical print and write formats, and
+@cmd{FORMATS}, the most commonly used command for changing formats
+(@pxref{FORMATS}), sets both of them to the same value as well. Thus,
+most of the time, the distinction between print and write formats is
+unimportant.
+
+Input and output formats are specified to PSPP with a @dfn{format
+specification} of the form @code{TYPEw} or @code{TYPEw.d}, where
+@code{TYPE} is one of the format types described later, @code{w} is a
+field width measured in columns, and @code{d} is an optional number of
+decimal places. If @code{d} is omitted, a value of 0 is assumed. Some
+formats do not allow a nonzero @code{d} to be specified.
+
+The following sections describe the input and output formats supported
+by PSPP.
-@item
-(This method can be used only on commands that define the dictionary,
-such as @cmd{DATA LIST}.) It is used to define sequences of variables
-that end in consecutive integers. The syntax is two identifiers that
-end in numbers. This method is best illustrated with examples:
+@menu
+* Basic Numeric Formats::
+* Custom Currency Formats::
+* Legacy Numeric Formats::
+* Binary and Hexadecimal Numeric Formats::
+* Time and Date Formats::
+* Date Component Formats::
+* String Formats::
+@end menu
+
+@node Basic Numeric Formats
+@subsubsection Basic Numeric Formats
+
+@cindex numeric formats
+The basic numeric formats are used for input and output of real numbers
+in standard or scientific notation. The following table shows an
+example of how each format displays positive and negative numbers with
+the default decimal point setting:
+
+@float
+@multitable {DOLLAR10.2} {@code{@tie{}$3,141.59}} {@code{-$3,141.59}}
+@headitem Format @tab @code{@tie{}3141.59} @tab @code{-3141.59}
+@item F8.2 @tab @code{@tie{}3141.59} @tab @code{-3141.59}
+@item COMMA9.2 @tab @code{@tie{}3,141.59} @tab @code{-3,141.59}
+@item DOT9.2 @tab @code{@tie{}3.141,59} @tab @code{-3.141,59}
+@item DOLLAR10.2 @tab @code{@tie{}$3,141.59} @tab @code{-$3,141.59}
+@item PCT9.2 @tab @code{@tie{}3141.59%} @tab @code{-3141.59%}
+@item E8.1 @tab @code{@tie{}3.1E+003} @tab @code{-3.1E+003}
+@end multitable
+@end float
+
+On output, numbers in F format are expressed in standard decimal
+notation with the requested number of decimal places. The other formats
+output some variation on this style:
@itemize @bullet
@item
-The syntax @code{X1 TO X5} defines 5 variables:
+Numbers in COMMA format are additionally grouped every three digits by
+inserting a grouping character. The grouping character is ordinarily a
+comma, but it can be changed to a period (@pxref{SET DECIMAL}).
-@itemize @minus
-@item
-X1
-@item
-X2
-@item
-X3
-@item
-X4
@item
-X5
-@end itemize
+DOT format is like COMMA format, but it interchanges the role of the
+decimal point and grouping characters. That is, the current grouping
+character is used as a decimal point and vice versa.
@item
-The syntax @code{ITEM0008 TO ITEM0013} defines 6 variables:
+DOLLAR format is like COMMA format, but it prefixes the number with
+@samp{$}.
-@itemize @minus
-@item
-ITEM0008
-@item
-ITEM0009
-@item
-ITEM0010
-@item
-ITEM0011
-@item
-ITEM0012
@item
-ITEM0013
-@end itemize
+PCT format is like F format, but adds @samp{%} after the number.
@item
-Each of the syntaxes @code{QUES001 TO QUES9} and @code{QUES6 TO QUES3}
-are invalid, although for different reasons, which should be evident.
+The E format always produces output in scientific notation.
@end itemize
-Note that after a set of variables has been defined with @cmd{DATA LIST}
-or another command with this method, the same set can be referenced on
-later commands using the same syntax.
-
-@item
-The above methods can be combined, either one after another or delimited
-by commas. For instance, the combined syntax @code{A Q5 TO Q8 X TO Z}
-is legal as long as each part @code{A}, @code{Q5 TO Q8}, @code{X TO Z}
-is individually legal.
-@end enumerate
+On input, the basic numeric formats accept positive and numbers in
+standard decimal notation or scientific notation. Leading and trailing
+spaces are allowed. An empty or all-spaces field, or one that contains
+only a single period, is treated as the system missing value.
-@node Input/Output Formats, Scratch Variables, Sets of Variables, Variables
-@subsection Input and Output Formats
+In scientific notation, the exponent may be introduced by a sign
+(@samp{+} or @samp{-}), or by one of the letters @samp{e} or @samp{d}
+(in uppercase or lowercase), or by a letter followed by a sign. A
+single space may follow the letter or the sign or both.
-Data that PSPP inputs and outputs must have one of a number of formats.
-These formats are described, in general, by a format specification of
-the form @code{NAMEw.d}, where @var{name} is the
-format name and @var{w} is a field width. @var{d} is the optional
-desired number of decimal places, if appropriate. If @var{d} is not
-included then it is assumed to be 0. Some formats do not allow @var{d}
-to be specified.
-
-When an input format is specified on @cmd{DATA LIST} or another
-command, then
-it is converted to an output format for the purposes of @cmd{PRINT}
-and other
-data output commands. For most purposes, input and output formats are
-the same; the salient differences are described below.
-
-Below are listed the input and output formats supported by PSPP. If an
-input format is mapped to a different output format by default, then
-that mapping is indicated with @result{}. Each format has the listed
-bounds on input width (iw) and output width (ow).
-
-The standard numeric input and output formats are given in the following
-table:
+On fixed-format @cmd{DATA LIST} (@pxref{DATA LIST FIXED}) and in a few
+other contexts, decimals are implied when the field does not contain a
+decimal point. In F6.5 format, for example, the field @code{314159} is
+taken as the value 3.14159 with implied decimals. Decimals are never
+implied if an explicit decimal point is present or if scientific
+notation is used.
-@table @asis
-@item Fw.d: 1 <= iw,ow <= 40
-Standard decimal format with @var{d} decimal places. If the number is
-too large to fit within the field width, it is expressed in scientific
-notation (@code{1.2+34}) if w >= 6, with always at least two digits in
-the exponent. When used as an input format, scientific notation is
-allowed but an E or an F must be used to introduce the exponent.
-
-The default output format is the same as the input format, except if
-@var{d} > 1. In that case the output @var{w} is always made to be at
-least 2 + @var{d}.
+E and F formats accept the basic syntax already described. The other
+formats allow some additional variations:
-@item Ew.d: 1 <= iw <= 40; 6 <= ow <= 40
-For input this is equivalent to F format except that no E or F is
-require to introduce the exponent. For output, produces scientific
-notation in the form @code{1.2+34}. There are always at least two
-digits given in the exponent.
+@itemize @bullet
+@item
+COMMA, DOLLAR, and DOT formats ignore grouping characters within the
+integer part of the input field. The identity of the grouping
+character depends on the format.
-The default output @var{w} is the largest of the input @var{w}, the
-input @var{d} + 7, and 10. The default output @var{d} is the input
-@var{d}, but at least 3.
+@item
+DOLLAR format allows a dollar sign to precede the number. In a negative
+number, the dollar sign may precede or follow the minus sign.
-@item COMMAw.d: 1 <= iw,ow <= 40
-Equivalent to F format, except that groups of three digits are
-comma-separated on output. If the number is too large to express in the
-field width, then first commas are eliminated, then if there is still
-not enough space the number is expressed in scientific notation given
-that w >= 6. Commas are allowed and ignored when this is used as an
-input format.
+@item
+PCT format allows a percent sign to follow the number.
+@end itemize
-@item DOTw.d: 1 <= iw,ow <= 40
-Equivalent to COMMA format except that the roles of comma and decimal
-point are interchanged. However: If SET /DECIMAL=DOT is in effect, then
-COMMA uses @samp{,} for a decimal point and DOT uses @samp{.} for a
-decimal point.
+All of the basic number formats have a maximum field width of 40 and
+accept no more than 16 decimal places, on both input and output. Some
+additional restrictions apply:
-@item DOLLARw.d: 1 <= iw <= 40; 2 <= ow <= 40
-Equivalent to COMMA format, except that the number is prefixed by a
-dollar sign (@samp{$}) if there is room. On input the value is allowed
-to be prefixed by a dollar sign, which is ignored.
+@itemize @bullet
+@item
+As input formats, the basic numeric formats allow no more decimal places
+than the field width. As output formats, the field width must be
+greater than the number of decimal places; that is, large enough to
+allow for a decimal point and the number of requested decimal places.
+DOLLAR and PCT formats must allow an additional column for @samp{$} or
+@samp{%}.
-The default output @var{w} is the input @var{w}, but at least 2.
+@item
+The default output format for a given input format increases the field
+width enough to make room for optional input characters. If an input
+format calls for decimal places, the width is increased by 1 to make
+room for an implied decimal point. COMMA, DOT, and DOLLAR formats also
+increase the output width to make room for grouping characters. DOLLAR
+and PCT further increase the output field width by 1 to make room for
+@samp{$} or @samp{%}. The increased output width is capped at 40, the
+maximum field width.
-@item PCTw.d: 2 <= iw,ow <= 40
-Equivalent to F format, except that the number is suffixed by a percent
-sign (@samp{%}) if there is room. On input the value is allowed to be
-suffixed by a percent sign, which is ignored.
+@item
+The E format is exceptional. For output, E format has a minimum width
+of 7 plus the number of decimal places. The default output format for
+an E input format is an E format with at least 3 decimal places and
+thus a minimum width of 10.
+@end itemize
-The default output @var{w} is the input @var{w}, but at least 2.
+More details of basic numeric output formatting are given below:
-@item Nw.d: 1 <= iw,ow <= 40
-Only digits are allowed within the field width. The decimal point is
-assumed to be @var{d} digits from the right margin.
+@itemize @bullet
+@item
+Output rounds to nearest, with ties rounded away from zero. Thus, 2.5
+is output as @code{3} in F1.0 format, and -1.125 as @code{-1.13} in F5.1
+format.
-The default output format is F with the same @var{w} and @var{d}, except
-if @var{d} > 1. In that case the output @var{w} is always made to be at
-least 2 + @var{d}.
+@item
+The system-missing value is output as a period in a field of spaces,
+placed in the decimal point's position, or in the rightmost column if no
+decimal places are requested. A period is used even if the decimal
+point character is a comma.
-@item Zw.d @result{} F: 1 <= iw,ow <= 40
-Zoned decimal input. If you need to use this then you know how.
+@item
+A number that does not fill its field is right-justified within the
+field.
-@item IBw.d @result{} F: 1 <= iw,ow <= 8
-Integer binary format. The field is interpreted as a fixed-point
-positive or negative binary number in two's-complement notation. The
-location of the decimal point is implied. Endianness is the same as the
-host machine.
+@item
+A number is too large for its field causes decimal places to be dropped
+to make room. If dropping decimals does not make enough room,
+scientific notation is used if the field is wide enough. If a number
+does not fit in the field, even in scientific notation, the overflow is
+indicated by filling the field with asterisks (@samp{*}).
-The default output format is F8.2 if @var{d} is 0. Otherwise it is F,
-with output @var{w} as 9 + input @var{d} and output @var{d} as input
-@var{d}.
+@item
+COMMA, DOT, and DOLLAR formats insert grouping characters only if space
+is available for all of them. Grouping characters are never inserted
+when all decimal places must be dropped. Thus, 1234.56 in COMMA5.2
+format is output as @samp{@tie{}1235} without a comma, even though there
+is room for one, because all decimal places were dropped.
-@item PIB @result{} F: 1 <= iw,ow <= 8
-Positive integer binary format. The field is interpreted as a
-fixed-point positive binary number. The location of the decimal point
-is implied. Endianness is teh same as the host machine.
+@item
+DOLLAR or PCT format drop the @samp{$} or @samp{%} only if the number
+would not fit at all without it. Scientific notation with @samp{$} or
+@samp{%} is preferred to ordinary decimal notation without it.
-The default output format follows the rules for IB format.
+@item
+Except in scientific notation, a decimal point is included only when
+it is followed by a digit. If the integer part of the number being
+output is 0, and a decimal point is included, then the zero before the
+decimal point is dropped.
-@item Pw.d @result{} F: 1 <= iw,ow <= 16
-Binary coded decimal format. Each byte from left to right, except the
-rightmost, represents two digits. The upper nibble of each byte is more
-significant. The upper nibble of the final byte is the least
-significant digit. The lower nibble of the final byte is the sign; a
-value of D represents a negative sign and all other values are
-considered positive. The decimal point is implied.
+In scientific notation, the number always includes a decimal point,
+even if it is not followed by a digit.
-The default output format follows the rules for IB format.
+@item
+A negative number includes a minus sign only in the presence of a
+nonzero digit: -0.01 is output as @samp{-.01} in F4.2 format but as
+@samp{@tie{}@tie{}.0} in F4.1 format. Thus, a ``negative zero'' never
+includes a minus sign.
-@item PKw.d @result{} F: 1 <= iw,ow <= 16
-Positive binary code decimal format. Same as P but the last byte is the
-same as the others.
+@item
+In negative numbers output in DOLLAR format, the dollar sign follows the
+negative sign. Thus, -9.99 in DOLLAR6.2 format is output as
+@code{-$9.99}.
-The default output format follows the rules for IB format.
+@item
+In scientific notation, the exponent is output as @samp{E} followed by
+@samp{+} or @samp{-} and exactly three digits. Numbers with magnitude
+less than 10**-999 or larger than 10**999 are not supported by most
+computers, but if they are supported then their output is considered
+to overflow the field and will be output as asterisks.
-@item RBw @result{} F: 2 <= iw,ow <= 8
+@item
+On most computers, no more than 15 decimal digits are significant in
+output, even if more are printed. In any case, output precision cannot
+be any higher than input precision; few data sets are accurate to 15
+digits of precision. Unavoidable loss of precision in intermediate
+calculations may also reduce precision of output.
-Binary C architecture-dependent ``double'' format. For a standard
-IEEE754 implementation @var{w} should be 8.
+@item
+Special values such as infinities and ``not a number'' values are
+usually converted to the system-missing value before printing. In a few
+circumstances, these values are output directly. In fields of width 3
+or greater, special values are output as however many characters will
+fit from @code{+Infinity} or @code{-Infinity} for infinities, from
+@code{NaN} for ``not a number,'' or from @code{Unknown} for other values
+(if any are supported by the system). In fields under 3 columns wide,
+special values are output as asterisks.
+@end itemize
-The default output format follows the rules for IB format.
+@node Custom Currency Formats
+@subsubsection Custom Currency Formats
+
+@cindex currency formats
+The custom currency formats are closely related to the basic numeric
+formats, but they allow users to customize the output format. The
+SET command configures custom currency formats, using the syntax
+@display
+SET CC@var{x}=@t{"}@var{string}@t{"}.
+@end display
+@noindent
+where @var{x} is A, B, C, D, or E, and @var{string} is no more than 16
+characters long.
+
+@var{string} must contain exactly three commas or exactly three periods
+(but not both), except that a single quote character may be used to
+``escape'' a following comma, period, or single quote. If three commas
+are used, commas will be used for grouping in output, and a period will
+be used as the decimal point. Uses of periods reverses these roles.
+
+The commas or periods divide @var{string} into four fields, called the
+@dfn{negative prefix}, @dfn{prefix}, @dfn{suffix}, and @dfn{negative
+suffix}, respectively. The prefix and suffix are added to output
+whenever space is available. The negative prefix and negative suffix
+are always added to a negative number when the output includes a nonzero
+digit.
+
+The following syntax shows how custom currency formats could be used to
+reproduce basic numeric formats:
-@item PIBHEXw.d @result{} F: 2 <= iw,ow <= 16
-PIB format encoded as textual hex digit pairs. @var{w} must be even.
+@example
+@group
+SET CCA="-,,,". /* Same as COMMA.
+SET CCB="-...". /* Same as DOT.
+SET CCC="-,$,,". /* Same as DOLLAR.
+SET CCD="-,,%,". /* Like PCT, but groups with commas.
+@end group
+@end example
-The input width is mapped to a default output width as follows:
-2@result{}4, 4@result{}6, 6@result{}9, 8@result{}11, 10@result{}14,
-12@result{}16, 14@result{}18, 16@result{}21. No allowances are made for
-decimal places.
+Here are some more examples of custom currency formats. The final
+example shows how to use a single quote to escape a delimiter:
-@item RBHEXw @result{} F: 4 <= iw,ow <= 16
+@example
+@group
+SET CCA=",EUR,,-". /* Euro.
+SET CCB="(,USD ,,)". /* US dollar.
+SET CCC="-.R$..". /* Brazilian real.
+SET CCD="-,, NIS,". /* Israel shekel.
+SET CCE="-.Rp'. ..". /* Indonesia Rupiah.
+@end group
+@end example
-RB format encoded as textual hex digits pairs. @var{w} must be even.
+@noindent These formats would yield the following output:
-The default output format is F8.2.
+@float
+@multitable {CCD13.2} {@code{@tie{}@tie{}USD 3,145.59}} {@code{(USD 3,145.59)}}
+@headitem Format @tab @code{@tie{}3145.59} @tab @code{-3145.59}
+@item CCA12.2 @tab @code{@tie{}EUR3,145.59} @tab @code{EUR3,145.59-}
+@item CCB14.2 @tab @code{@tie{}@tie{}USD 3,145.59} @tab @code{(USD 3,145.59)}
+@item CCC11.2 @tab @code{@tie{}R$3.145,59} @tab @code{-R$3.145,59}
+@item CCD13.2 @tab @code{@tie{}3,145.59 NIS} @tab @code{-3,145.59 NIS}
+@item CCE10.0 @tab @code{@tie{}Rp. 3.146} @tab @code{-Rp. 3.146}
+@end multitable
+@end float
-@item CCAw.d: 1 <= ow <= 40
-@itemx CCBw.d: 1 <= ow <= 40
-@itemx CCCw.d: 1 <= ow <= 40
-@itemx CCDw.d: 1 <= ow <= 40
-@itemx CCEw.d: 1 <= ow <= 40
+The default for all the custom currency formats is @samp{-,,,},
+equivalent to COMMA format.
-User-defined custom currency formats. May not be used as an input
-format. @xref{SET}, for more details.
-@end table
+@node Legacy Numeric Formats
+@subsubsection Legacy Numeric Formats
-The date and time numeric input and output formats accept a number of
-possible formats. Before describing the formats themselves, some
-definitions of the elements that make up their formats will be helpful:
+The N and Z numeric formats provide compatibility with legacy file
+formats. They have much in common:
-@table @dfn
-@item leader
-All formats accept an optional whitespace leader.
+@itemize @bullet
+@item
+Output is rounded to the nearest representable value, with ties rounded
+away from zero.
-@item day
-An integer between 1 and 31 representing the day of month.
+@item
+Numbers too large to display are output as a field filled with asterisks
+(@samp{*}).
-@item day-count
-An integer representing a number of days.
+@item
+The decimal point is always implicitly the specified number of digits
+from the right edge of the field, except that Z format input allows an
+explicit decimal point.
-@item date-delimiter
-One or more characters of whitespace or the following characters:
-@code{- / . ,}
+@item
+Scientific notation may not be used.
-@item month
-A month name in one of the following forms:
-@itemize @bullet
@item
-An integer between 1 and 12.
+The system-missing value is output as a period in a field of spaces.
+The period is placed just to the right of the implied decimal point in
+Z format, or at the right end in N format or in Z format if no decimal
+places are requested. A period is used even if the decimal point
+character is a comma.
+
@item
-Roman numerals representing an integer between 1 and 12.
+Field width may range from 1 to 40. Decimal places may range from 0 up
+to the field width, to a maximum of 16.
+
@item
-At least the first three characters of an English month name (January,
-February, @dots{}).
+When a legacy numeric format used for input is converted to an output
+format, it is changed into the equivalent F format. The field width is
+increased by 1 if any decimal places are specified, to make room for a
+decimal point. For Z format, the field width is increased by 1 more
+column, to make room for a negative sign. The output field width is
+capped at 40 columns.
@end itemize
-@item year
-An integer year number between 1582 and 19999, or between 1 and 199.
-Years between 1 and 199 will have 1900 added.
+@subsubheading N Format
+
+The N format supports input and output of fields that contain only
+digits. On input, leading or trailing spaces, a decimal point, or any
+other non-digit character causes the field to be read as the
+system-missing value. As a special exception, an N format used on
+@cmd{DATA LIST FREE} or @cmd{DATA LIST LIST} is treated as the
+equivalent F format.
+
+On output, N pads the field on the left with zeros. Negative numbers
+are output like the system-missing value.
+
+@subsubheading Z Format
+
+The Z format is a ``zoned decimal'' format used on IBM mainframes. Z
+format encodes the sign as part of the final digit, which must be one of
+the following:
+@example
+0123456789
+@{ABCDEFGHI
+@}JKLMNOPQR
+@end example
+@noindent
+where the characters in each row represent digits 0 through 9 in order.
+Characters in the first two rows indicate a positive sign; those in the
+third indicate a negative sign.
+
+On output, Z fields are padded on the left with spaces. On input,
+leading and trailing spaces are ignored. Any character in an input
+field other than spaces, the digit characters above, and @samp{.} causes
+the field to be read as system-missing.
+
+The decimal point character for input and output is always @samp{.},
+even if the decimal point character is a comma (@pxref{SET DECIMAL}).
+
+Nonzero, negative values output in Z format are marked as negative even
+when no nonzero digits are output. For example, -0.2 is output in Z1.0
+format as @samp{J}. The ``negative zero'' value supported by most
+machines is output as positive.
+
+@node Binary and Hexadecimal Numeric Formats
+@subsubsection Binary and Hexadecimal Numeric Formats
+
+@cindex binary formats
+@cindex hexadecimal formats
+The binary and hexadecimal formats are primarily designed for
+compatibility with existing machine formats, not for human readability.
+All of them therefore have a F format as default output format. Some of
+these formats are only portable between machines with compatible byte
+ordering (endianness) or floating-point format.
+
+Binary formats use byte values that in text files are interpreted as
+special control functions, such as carriage return and line feed. Thus,
+data in binary formats should not be included in syntax files or read
+from data files with variable-length records, such as ordinary text
+files. They may be read from or written to data files with fixed-length
+records. @xref{FILE HANDLE}, for information on working with
+fixed-length records.
+
+@subsubheading P and PK Formats
+
+These are binary-coded decimal formats, in which every byte (except the
+last, in P format) represents two decimal digits. The most-significant
+4 bits of the first byte is the most-significant decimal digit, the
+least-significant 4 bits of the first byte is the next decimal digit,
+and so on.
+
+In P format, the most-significant 4 bits of the last byte are the
+least-significant decimal digit. The least-significant 4 bits represent
+the sign: decimal 15 indicates a negative value, decimal 13 indicates a
+positive value.
+
+Numbers are rounded downward on output. The system-missing value and
+numbers outside representable range are output as zero.
+
+The maximum field width is 16. Decimal places may range from 0 up to
+the number of decimal digits represented by the field.
+
+The default output format is an F format with twice the input field
+width, plus one column for a decimal point (if decimal places were
+requested).
+
+@subsubheading IB and PIB Formats
+
+These are integer binary formats. IB reads and writes 2's complement
+binary integers, and PIB reads and writes unsigned binary integers. The
+byte ordering is by default the host machine's, but SET RIB may be used
+to select a specific byte ordering for reading (@pxref{SET RIB}) and
+SET WIB, similarly, for writing (@pxref{SET WIB}).
+
+The maximum field width is 8. Decimal places may range from 0 up to the
+number of decimal digits in the largest value representable in the field
+width.
+
+The default output format is an F format whose width is the number of
+decimal digits in the largest value representable in the field width,
+plus 1 if the format has decimal places.
+
+@subsubheading RB Format
+
+This is a binary format for real numbers. By default it reads and
+writes the host machine's floating-point format, but SET RRB may be
+used to select an alternate floating-point format for reading
+(@pxref{SET RRB}) and SET WRB, similarly, for writing (@pxref{SET
+WRB}).
+
+The recommended field width depends on the floating-point format.
+NATIVE (the default format), IDL, IDB, VD, VG, and ZL formats should use
+a field width of 8. ISL, ISB, VF, and ZS formats should use a field
+width of 4. Other field widths will not produce useful results. The
+maximum field width is 8. No decimal places may be specified.
-@item julian
-A single number with a year number in the first 2, 3, or 4 digits (as
-above) and the day number within the year in the last 3 digits.
+The default output format is F8.2.
-@item quarter
-An integer between 1 and 4 representing a quarter.
+@subsubheading PIBHEX and RBHEX Formats
+
+These are hexadecimal formats, for reading and writing binary formats
+where each byte has been recoded as a pair of hexadecimal digits.
+
+A hexadecimal field consists solely of hexadecimal digits
+@samp{0}@dots{}@samp{9} and @samp{A}@dots{}@samp{F}. Uppercase and
+lowercase are accepted on input; output is in uppercase.
+
+Other than the hexadecimal representation, these formats are equivalent
+to PIB and RB formats, respectively. However, bytes in PIBHEX format
+are always ordered with the most-significant byte first (big-endian
+order), regardless of the host machine's native byte order or PSPP
+settings.
+
+Field widths must be even and between 2 and 16. RBHEX format allows no
+decimal places; PIBHEX allows as many decimal places as a PIB format
+with half the given width.
+
+@node Time and Date Formats
+@subsubsection Time and Date Formats
+
+@cindex time formats
+@cindex date formats
+In PSPP, a @dfn{time} is an interval. The time formats translate
+between human-friendly descriptions of time intervals and PSPP's
+internal representation of time intervals, which is simply the number of
+seconds in the interval. PSPP has two time formats:
+
+@float
+@multitable {Time Format} {@code{dd-mmm-yyyy HH:MM:SS.ss}} {@code{01-OCT-1978 04:31:17.01}}
+@headitem Time Format @tab Template @tab Example
+@item TIME @tab @code{hh:MM:SS.ss} @tab @code{04:31:17.01}
+@item DTIME @tab @code{DD HH:MM:SS.ss} @tab @code{00 04:31:17.01}
+@end multitable
+@end float
+
+A @dfn{date} is a moment in the past or the future. Internally, PSPP
+represents a date as the number of seconds since the @dfn{epoch},
+midnight, Oct. 14, 1582. The date formats translate between
+human-readable dates and PSPP's numeric representation of dates and
+times. PSPP has several date formats:
+
+@float
+@multitable {Date Format} {@code{dd-mmm-yyyy HH:MM:SS.ss}} {@code{01-OCT-1978 04:31:17.01}}
+@headitem Date Format @tab Template @tab Example
+@item DATE @tab @code{dd-mmm-yyyy} @tab @code{01-OCT-1978}
+@item ADATE @tab @code{mm/dd/yyyy} @tab @code{10/01/1978}
+@item EDATE @tab @code{dd.mm.yyyy} @tab @code{01.10.1978}
+@item JDATE @tab @code{yyyyjjj} @tab @code{1978274}
+@item SDATE @tab @code{yyyy/mm/dd} @tab @code{1978/10/01}
+@item QYR @tab @code{q Q yyyy} @tab @code{3 Q 1978}
+@item MOYR @tab @code{mmm yyyy} @tab @code{OCT 1978}
+@item WKYR @tab @code{ww WK yyyy} @tab @code{40 WK 1978}
+@item DATETIME @tab @code{dd-mmm-yyyy HH:MM:SS.ss} @tab @code{01-OCT-1978 04:31:17.01}
+@end multitable
+@end float
+
+The templates in the preceding tables describe how the time and date
+formats are input and output:
-@item q-delimiter
-The letter @samp{Q} or @samp{q}.
+@table @code
+@item dd
+Day of month, from 1 to 31. Always output as two digits.
+
+@item mm
+@itemx mmm
+Month. In output, @code{mm} is output as two digits, @code{mmm} as the
+first three letters of an English month name (January, February,
+@dots{}). In input, both of these formats, plus Roman numerals, are
+accepted.
+
+@item yyyy
+Year. In output, DATETIME always produces a 4-digit year; other
+formats can produce a 2- or 4-digit year. The century assumed for
+2-digit years depends on the EPOCH setting (@pxref{SET EPOCH}). In
+output, a year outside the epoch causes the whole field to be filled
+with asterisks (@samp{*}).
+
+@item jjj
+Day of year (Julian day), from 1 to 366. This is exactly three digits
+giving the count of days from the start of the year. January 1 is
+considered day 1.
+
+@item q
+Quarter of year, from 1 to 4. Quarters start on January 1, April 1,
+July 1, and October 1.
+
+@item ww
+Week of year, from 1 to 53. Output as exactly two digits. January 1 is
+the first day of week 1.
+
+@item DD
+Count of days, which may be positive or negative. Output as at least
+two digits.
+
+@item hh
+Count of hours, which may be positive or negative. Output as at least
+two digits.
+
+@item HH
+Hour of day, from 0 to 23. Output as exactly two digits.
+
+@item MM
+Minute of hour, from 0 to 59. Output as exactly two digits.
+
+@item SS.ss
+Seconds within minute, from 0 to 59. The integer part is output as
+exactly two digits. On output, seconds and fractional seconds may or
+may not be included, depending on field width and decimal places. On
+input, seconds and fractional seconds are optional. The DECIMAL setting
+controls the character accepted and displayed as the decimal point
+(@pxref{SET DECIMAL}).
+@end table
-@item week
-An integer between 1 and 53 representing a week within a year.
+For output, the date and time formats use the delimiters indicated in
+the table. For input, date components may be separated by spaces or by
+one of the characters @samp{-}, @samp{/}, @samp{.}, or @samp{,}, and
+time components may be separated by spaces, @samp{:}, or @samp{.}. On
+input, the @samp{Q} separating quarter from year and the @samp{WK}
+separating week from year may be uppercase or lowercase, and the spaces
+around them are optional.
+
+On input, all time and date formats accept any amount of leading and
+trailing white space.
+
+The maximum width for time and date formats is 40 columns. Minimum
+input and output width for each of the time and date formats is shown
+below:
+
+@float
+@multitable {DATETIME} {Min. Input Width} {Min. Output Width} {4-digit year}
+@headitem Format @tab Min. Input Width @tab Min. Output Width @tab Option
+@item DATE @tab 8 @tab 9 @tab 4-digit year
+@item ADATE @tab 8 @tab 8 @tab 4-digit year
+@item EDATE @tab 8 @tab 8 @tab 4-digit year
+@item JDATE @tab 5 @tab 5 @tab 4-digit year
+@item SDATE @tab 8 @tab 8 @tab 4-digit year
+@item QYR @tab 4 @tab 6 @tab 4-digit year
+@item MOYR @tab 6 @tab 6 @tab 4-digit year
+@item WKYR @tab 6 @tab 8 @tab 4-digit year
+@item DATETIME @tab 17 @tab 17 @tab seconds
+@item TIME @tab 5 @tab 5 @tab seconds
+@item DTIME @tab 8 @tab 8 @tab seconds
+@end multitable
+@end float
+@noindent
+In the table, ``Option'' describes what increased output width enables:
-@item wk-delimiter
-The letters @samp{wk} in any case.
+@table @asis
+@item 4-digit year
+A field 2 columns wider than minimum will include a 4-digit year.
+(DATETIME format always includes a 4-digit year.)
+
+@item seconds
+A field 3 columns wider than minimum will include seconds as well as
+minutes. A field 5 columns wider than minimum, or more, can also
+include a decimal point and fractional seconds (but no more than allowed
+by the format's decimal places).
+@end table
-@item time-delimiter
-At least one characters of whitespace or @samp{:} or @samp{.}.
+For the time and date formats, the default output format is the same as
+the input format, except that PSPP increases the field width, if
+necessary, to the minimum allowed for output.
-@item hour
-An integer greater than 0 representing an hour.
+Time or dates narrower than the field width are right-justified within
+the field.
-@item minute
-An integer between 0 and 59 representing a minute within an hour.
+When a time or date exceeds the field width, characters are trimmed from
+the end until it fits. This can occur in an unusual situation, e.g.@:
+with a year greater than 9999 (which adds an extra digit), or for a
+negative value on TIME or DTIME (which adds a leading minus sign).
-@item opt-second
-Optionally, a time-delimiter followed by a real number representing a
-number of seconds.
+@c What about out-of-range values?
-@item hour24
-An integer between 0 and 23 representing an hour within a day.
+The system-missing value is output as a period at the right end of the
+field.
-@item weekday
-At least the first two characters of an English day word.
+@node Date Component Formats
+@subsubsection Date Component Formats
-@item spaces
-Any amount or no amount of whitespace.
+The WKDAY and MONTH formats provide input and output for the names of
+weekdays and months, respectively.
-@item sign
-An optional positive or negative sign.
+On output, these formats convert a number between 1 and 7, for WKDAY, or
+between 1 and 12, for MONTH, into the English name of a day or month,
+respectively. If the name is longer than the field, it is trimmed to
+fit. If the name is shorter than the field, it is padded on the right
+with spaces. Values outside the valid range, and the system-missing
+value, are output as all spaces.
-@item trailer
-All formats accept an optional whitespace trailer.
-@end table
+On input, English weekday or month names (in uppercase or lowercase) are
+converted back to their corresponding numbers. Weekday and month names
+may be abbreviated to their first 2 or 3 letters, respectively.
-The date input formats are strung together from the above pieces. On
-output, the date formats are always printed in a single canonical
-manner, based on field width. The date input and output formats are
-described below:
+The field width may range from 2 to 40, for WKDAY, or from 3 to 40, for
+MONTH. No decimal places are allowed.
-@table @asis
-@item DATEw: 9 <= iw,ow <= 40
-Date format. Input format: leader + day + date-delimiter +
-month + date-delimiter + year + trailer. Output format: DD-MMM-YY for
-@var{w} < 11, DD-MMM-YYYY otherwise.
-
-@item EDATEw: 8 <= iw,ow <= 40
-European date format. Input format same as DATE. Output format:
-DD.MM.YY for @var{w} < 10, DD.MM.YYYY otherwise.
-
-@item SDATEw: 8 <= iw,ow <= 40
-Standard date format. Input format: leader + year + date-delimiter +
-month + date-delimiter + day + trailer. Output format: YY/MM/DD for
-@var{w} < 10, YYYY/MM/DD otherwise.
-
-@item ADATEw: 8 <= iw,ow <= 40
-American date format. Input format: leader + month + date-delimiter +
-day + date-delimiter + year + trailer. Output format: MM/DD/YY for
-@var{w} < 10, MM/DD/YYYY otherwise.
-
-@item JDATEw: 5 <= iw,ow <= 40
-Julian date format. Input format: leader + julian + trailer. Output
-format: YYDDD for @var{w} < 7, YYYYDDD otherwise.
-
-@item QYRw: 4 <= iw <= 40, 6 <= ow <= 40
-Quarter/year format. Input format: leader + quarter + q-delimiter +
-year + trailer. Output format: @samp{Q Q YY}, where the first
-@samp{Q} is one of the digits 1, 2, 3, 4, if @var{w} < 8, @code{Q Q
-YYYY} otherwise.
-
-@item MOYRw: 6 <= iw,ow <= 40
-Month/year format. Input format: leader + month + date-delimiter + year
-+ trailer. Output format: @samp{MMM YY} for @var{w} < 8, @samp{MMM
-YYYY} otherwise.
-
-@item WKYRw: 6 <= iw <= 40, 8 <= ow <= 40
-Week/year format. Input format: leader + week + wk-delimiter + year +
-trailer. Output format: @samp{WW WK YY} for @var{w} < 10, @samp{WW WK
-YYYY} otherwise.
-
-@item DATETIMEw.d: 17 <= iw,ow <= 40
-Date and time format. Input format: leader + day + date-delimiter +
-month + date-delimiter + yaer + time-delimiter + hour24 + time-delimiter
-+ minute + opt-second. Output format: @samp{DD-MMM-YYYY HH:MM}. If
-@var{w} > 19 then seconds @samp{:SS} is added. If @var{w} > 22 and
-@var{d} > 0 then fractional seconds @samp{.SS} are added.
-
-@item TIMEw.d: 5 <= iw,ow <= 40
-Time format. Input format: leader + sign + spaces + hour +
-time-delimiter + minute + opt-second. Output format: @samp{HH:MM}.
-Seconds and fractional seconds are available with @var{w} of at least 8
-and 10, respectively.
-
-@item DTIMEw.d: 1 <= iw <= 40, 8 <= ow <= 40
-Time format with day count. Input format: leader + sign + spaces +
-day-count + time-delimiter + hour + time-delimiter + minute +
-opt-second. Output format: @samp{DD HH:MM}. Seconds and fractional
-seconds are available with @var{w} of at least 8 and 10, respectively.
-
-@item WKDAYw: 2 <= iw,ow <= 40
-A weekday as a number between 1 and 7, where 1 is Sunday. Input format:
-leader + weekday + trailer. Output format: as many characters, in all
-capital letters, of the English name of the weekday as will fit in the
-field width.
-
-@item MONTHw: 3 <= iw,ow <= 40
-A month as a number between 1 and 12, where 1 is January. Input format:
-leader + month + trailer. Output format: as many character, in all
-capital letters, of the English name of the month as will fit in the
-field width.
-@end table
+The default output format is the same as the input format.
-There are only two formats that may be used with string variables:
+@node String Formats
+@subsubsection String Formats
-@table @asis
-@item Aw: 1 <= iw <= 255, 1 <= ow <= 254
-The entire field is treated as a string value.
+@cindex string formats
+The A and AHEX formats are the only ones that may be assigned to string
+variables. Neither format allows any decimal places.
-@item AHEXw @result{} A: 2 <= iw <= 254; 2 <= ow <= 510
-The field is composed of characters in a string encoded as textual hex
-digit pairs.
+In A format, the entire field is treated as a string value. The field
+width may range from 1 to 32,767, the maximum string width. The default
+output format is the same as the input format.
-The default output @var{w} is half the input @var{w}.
-@end table
+In AHEX format, the field is composed of characters in a string encoded
+as hex digit pairs. On output, hex digits are output in uppercase; on
+input, uppercase and lowercase are both accepted. The default output
+format is A format with half the input width.
-@node Scratch Variables, , Input/Output Formats, Variables
+@node Scratch Variables
@subsection Scratch Variables
+@cindex scratch variables
Most of the time, variables don't retain their values between cases.
-Instead, either they're being read from a data file or the active file,
+Instead, either they're being read from a data file or the active dataset,
in which case they assume the value read, or, if created with
@cmd{COMPUTE} or
another transformation, they're initialized to the system-missing value
names begin with an octothorpe (@samp{#}).
Scratch variables have the same properties as variables left with
-@cmd{LEAVE}:
-they retain their values between cases, and for the first case they are
-initialized to 0 or blanks. They have the additional property that they
-are deleted before the execution of any procedure. For this reason,
-scratch variables can't be used for analysis. To obtain the same
-effect, use @cmd{COMPUTE} (@pxref{COMPUTE}) to copy the scratch variable's
-value into an ordinary variable, then analysis that variable.
-
-@node Files, BNF, Variables, Language
+@cmd{LEAVE}: they retain their values between cases, and for the first
+case they are initialized to 0 or blanks. They have the additional
+property that they are deleted before the execution of any procedure.
+For this reason, scratch variables can't be used for analysis. To use
+a scratch variable in an analysis, use @cmd{COMPUTE} (@pxref{COMPUTE})
+to copy its value into an ordinary variable, then use that ordinary
+variable in the analysis.
+
+@node Files
@section Files Used by PSPP
PSPP makes use of many files each time it runs. Some of these it
@cindex syntax file
@item command file
@itemx syntax file
-These names (synonyms) refer to the file that contains instructions to
-PSPP that tell it what to do. The syntax file's name is specified on
-the PSPP command line. Syntax files can also be pulled in with
+These names (synonyms) refer to the file that contains instructions
+that tell PSPP what to do. The syntax file's name is specified on
+the PSPP command line. Syntax files can also be read with
@cmd{INCLUDE} (@pxref{INCLUDE}).
@cindex file, data
@cindex data file
@item data file
-Data files contain raw data in ASCII format suitable for being read in
-by @cmd{DATA LIST}. Data can be embedded in the syntax
-file with @cmd{BEGIN DATA} and @cmd{END DATA}: this makes the
-syntax file a data file too.
+Data files contain raw data in text or binary format. Data can also
+be embedded in a syntax file with @cmd{BEGIN DATA} and @cmd{END DATA}.
@cindex file, output
@cindex output file
statistical procedures. The output files may be in any number of formats,
depending on how PSPP is configured.
-@cindex active file
-@cindex file, active
-@item active file
-The active file is the ``file'' on which all PSPP procedures
-are performed. The active file contains variable definitions and
-cases. The active file is not necessarily a disk file: it is stored
-in memory if there is room.
+@cindex system file
+@cindex file, system
+@item system file
+System files are binary files that store a dictionary and a set of
+cases. @cmd{GET} and @cmd{SAVE} read and write system files.
+
+@cindex portable file
+@cindex file, portable
+@item portable file
+Portable files are files in a text-based format that store a dictionary
+and a set of cases. @cmd{IMPORT} and @cmd{EXPORT} read and write
+portable files.
@end table
-@node BNF, , Files, Language
+@node File Handles
+@section File Handles
+@cindex file handles
+
+A @dfn{file handle} is a reference to a data file, system file, or
+portable file. Most often, a file handle is specified as the
+name of a file as a string, that is, enclosed within @samp{'} or
+@samp{"}.
+
+A file name string that begins or ends with @samp{|} is treated as the
+name of a command to pipe data to or from. You can use this feature
+to read data over the network using a program such as @samp{curl}
+(e.g.@: @code{GET '|curl -s -S http://example.com/mydata.sav'}), to
+read compressed data from a file using a program such as @samp{zcat}
+(e.g.@: @code{GET '|zcat mydata.sav.gz'}), and for many other
+purposes.
+
+PSPP also supports declaring named file handles with the @cmd{FILE
+HANDLE} command. This command associates an identifier of your choice
+(the file handle's name) with a file. Later, the file handle name can
+be substituted for the name of the file. When PSPP syntax accesses a
+file multiple times, declaring a named file handle simplifies updating
+the syntax later to use a different file. Use of @cmd{FILE HANDLE} is
+also required to read data files in binary formats. @xref{FILE HANDLE},
+for more information.
+
+In some circumstances, PSPP must distinguish whether a file handle
+refers to a system file or a portable file. When this is necessary to
+read a file, e.g.@: as an input file for @cmd{GET} or @cmd{MATCH FILES},
+PSPP uses the file's contents to decide. In the context of writing a
+file, e.g.@: as an output file for @cmd{SAVE} or @cmd{AGGREGATE}, PSPP
+decides based on the file's name: if it ends in @samp{.por} (with any
+capitalization), then PSPP writes a portable file; otherwise, PSPP
+writes a system file.
+
+INLINE is reserved as a file handle name. It refers to the ``data
+file'' embedded into the syntax file between @cmd{BEGIN DATA} and
+@cmd{END DATA}. @xref{BEGIN DATA}, for more information.
+
+The file to which a file handle refers may be reassigned on a later
+@cmd{FILE HANDLE} command if it is first closed using @cmd{CLOSE FILE
+HANDLE}. @xref{CLOSE FILE HANDLE}, for
+more information.
+
+@node BNF
@section Backus-Naur Form
@cindex BNF
@cindex Backus-Naur Form
@item
Words in all-uppercase are PSPP keyword tokens. In BNF, these are
often called @dfn{terminals}. There are some special terminals, which
-are actually written in lowercase for clarity:
+are written in lowercase for clarity:
@table @asis
@cindex @code{number}
Operators and punctuators.
@cindex @code{.}
-@cindex terminal dot
-@cindex dot, terminal
@item @code{.}
-The terminal dot. This is not necessarily an actual dot in the syntax
-file: @xref{Commands}, for more details.
+The end of the command. This is not necessarily an actual dot in the
+syntax file: @xref{Commands}, for more details.
@end table
@item
@end table
@item
-@cindex @code{::=}
@cindex ``is defined as''
@cindex productions
@samp{::=} means ``is defined as''. The left side of @samp{::=} gives
@dfn{start symbol}. The start symbol defines the entire syntax for
that command.
@end itemize
-@setfilename ignored