doc/combining.texi

   1 @node Combining Data Files
   2 @chapter Combining Data Files
   3
   4 This chapter describes commands that allow data from system files,
   5 portable files, scratch files, and the active dataset to be combined to
   6 form a new active dataset.  These commands can combine data files in the
   7 following ways:
   8
   9 @itemize
  10 @item
  11 @cmd{ADD FILES} interleaves or appends the cases from each input file.
  12 It is used with input files that have variables in common, but
  13 distinct sets of cases.
  14
  15 @item
  16 @cmd{MATCH FILES} adds the data together in cases that match across
  17 multiple input files.  It is used with input files that have cases in
  18 common, but different information about each case.
  19
  20 @item
  21 @cmd{UPDATE} updates a master data file from data in a set of
  22 transaction files.  Each case in a transaction data file modifies a
  23 matching case in the primary data file, or it adds a new case if no
  24 matching case can be found.
  25 @end itemize
  26
  27 These commands share the majority of their syntax, which is described
  28 in the following section, followed by one section for each command
  29 that describes its specific syntax and semantics.
  30
  31 @menu
  32 * Combining Files Common Syntax::
  33 * ADD FILES::                   Interleave cases from multiple files.
  34 * MATCH FILES::                 Merge cases from multiple files.
  35 * UPDATE::                      Update cases using transactional data.
  36 @end menu
  37
  38 @node Combining Files Common Syntax
  39 @section Common Syntax
  40
  41 @display
  42 Per input file:
  43         /FILE=@{*,'file-name'@}
  44         [/RENAME=(src_names=target_names)@dots{}]
  45         [/IN=var_name]
  46         [/SORT]
  47
  48 Once per command:
  49         /BY var_list[(@{D|A@})] [var_list[(@{D|A@}]]@dots{}
  50         [/DROP=var_list]
  51         [/KEEP=var_list]
  52         [/FIRST=var_name]
  53         [/LAST=var_name]
  54         [/MAP]
  55 @end display
  56
  57 This section describes the syntactical features in common among the
  58 @cmd{ADD FILES}, @cmd{MATCH FILES}, and @cmd{UPDATE} commands.  The
  59 following sections describe details specific to each command.
  60
  61 Each of these commands reads two or more input files and combines
  62 them.  The command's output becomes the new active dataset.  The input
  63 files are not changed on disk.
  64
  65 The syntax of each command begins with a specification of the files to
  66 be read as input.  For each input file, specify FILE with a system,
  67 portable, or scratch file's name as a string or a file handle
  68 (@pxref{File Handles}), or specify an asterisk (@samp{*}) to use the
  69 active dataset as input.  Use of portable or scratch files on FILE is a
  70 PSPP extension.
  71
  72 At least two FILE subcommands must be specified.  If the active dataset
  73 is used as an input source, then @cmd{TEMPORARY} must not be in
  74 effect.
  75
  76 Each FILE subcommand may be followed by any number of RENAME
  77 subcommands that specify a parenthesized group or groups of variable
  78 names as they appear in the input file, followed by those variables'
  79 new names, separated by an equals sign (@samp{=}),
  80 e.g. @samp{/RENAME=(OLD1=NEW1)(OLD2=NEW2)}.  To rename a single
  81 variable, the parentheses may be omitted: @samp{/RENAME=OLD=NEW}.
  82 Within a parenthesized group, variables are renamed simultaneously, so
  83 that @samp{/RENAME=(A B=B A)} exchanges the names of variables A and
  84 B.  Otherwise, renaming occurs in left-to-right order.
  85
  86 Each FILE subcommand may optionally be followed by a single IN
  87 subcommand, which creates a numeric variable with the specified name
  88 and format F1.0.  The IN variable takes value 1 in an output case if
  89 the given input file contributed to that output case, and 0 otherwise.
  90 The DROP, KEEP, and RENAME subcommands have no effect on IN variables.
  91
  92 If BY is used (see below), the SORT keyword must be specified after a
  93 FILE if that input file is not already sorted on the BY variables.
  94 When SORT is specified, PSPP sorts the input file's data on the BY
  95 variables before it applies it to the command.  When SORT is used, BY
  96 is required.  SORT is a PSPP extension.
  97
  98 PSPP merges the dictionaries of all of the input files to form the
  99 dictionary of the new active dataset, like so:
 100
 101 @itemize @bullet
 102 @item
 103 The variables in the new active dataset are the union of all the input files'
 104 variables, matched based on their name.  When a single input file
 105 contains a variable with a given name, the output file will contain
 106 exactly that variable.  When more than one input file contains a
 107 variable with a given name, those variables must all have the same
 108 type (numeric or string) and, for string variables, the same width.
 109 Variables are matched after renaming with the RENAME subcommand.
 110 Thus, RENAME can be used to resolve conflicts.
 111
 112 @item
 113 The variable label for each output variable is taken from the first
 114 specified input file that has a variable label for that variable, and
 115 similarly for value labels and missing values.
 116
 117 @item
 118 The file label of the new active dataset (@pxref{FILE LABEL}) is that of the
 119 first specified FILE that has a file label.
 120
 121 @item
 122 The documents in the new active dataset (@pxref{DOCUMENT}) are the
 123 concatenation of all the input files' documents, in the order in which
 124 the FILE subcommands are specified.
 125
 126 @item
 127 If all of the input files are weighted on the same variable, then the
 128 new active dataset is weighted on that variable.  Otherwise, the new
 129 active dataset is not weighted.
 130 @end itemize
 131
 132 The remaining subcommands apply to the output file as a whole, rather
 133 than to individual input files.  They must be specified at the end of
 134 the command specification, following all of the FILE and related
 135 subcommands.  The most important of these subcommands is BY, which
 136 specifies a set of one or more variables that may be used to find
 137 corresponding cases in each of the input files.  The variables
 138 specified on BY must be present in all of the input files.
 139 Furthermore, if any of the input files are not sorted on the BY
 140 variables, then SORT must be specified for those input files.
 141
 142 The variables listed on BY may include (A) or (D) annotations to
 143 specify ascending or descending sort order.  @xref{SORT CASES}, for
 144 more details on this notation.  Adding (A) or (D) to the BY subcommand
 145 specification is a PSPP extension.
 146
 147 The DROP subcommand can be used to specify a list of variables to
 148 exclude from the output.  By contrast, the KEEP subcommand can be used
 149 to specify variables to include in the output; all variables not
 150 listed are dropped.  DROP and KEEP are executed in left-to-right order
 151 and may be repeated any number of times.  DROP and KEEP do not affect
 152 variables created by the IN, FIRST, and LAST subcommands, which are
 153 always included in the new active dataset, but they can be used to drop
 154 BY variables.
 155
 156 The FIRST and LAST subcommands are optional.  They may only be
 157 specified on @cmd{MATCH FILES} and @cmd{ADD FILES}, and only when BY
 158 is used.  FIRST and LIST each adds a numeric variable to the new
 159 active dataset, with the name given as the subcommand's argument and F1.0
 160 print and write formats.  The value of the FIRST variable is 1 in the
 161 first output case with a given set of values for the BY variables, and
 162 0 in other cases.  Similarly, the LAST variable is 1 in the last case
 163 with a given of BY values, and 0 in other cases.
 164
 165 When any of these commands creates an output case, variables that are
 166 only in files that are not present for the current case are set to the
 167 system-missing value for numeric variables or spaces for string
 168 variables.
 169
 170 @node ADD FILES
 171 @section ADD FILES
 172 @vindex ADD FILES
 173
 174 @display
 175 ADD FILES
 176
 177 Per input file:
 178         /FILE=@{*,'file-name'@}
 179         [/RENAME=(src_names=target_names)@dots{}]
 180         [/IN=var_name]
 181         [/SORT]
 182
 183 Once per command:
 184         [/BY var_list[(@{D|A@})] [var_list[(@{D|A@})]@dots{}]]
 185         [/DROP=var_list]
 186         [/KEEP=var_list]
 187         [/FIRST=var_name]
 188         [/LAST=var_name]
 189         [/MAP]
 190 @end display
 191
 192 @cmd{ADD FILES} adds cases from multiple input files.  The output,
 193 which replaces the active dataset, consists all of the cases in all of
 194 the input files.
 195
 196 ADD FILES shares the bulk of its syntax with other PSPP commands for
 197 combining multiple data files.  @xref{Combining Files Common Syntax},
 198 above, for an explanation of this common syntax.
 199
 200 When BY is not used, the output of ADD FILES consists of all the cases
 201 from the first input file specified, followed by all the cases from
 202 the second file specified, and so on.  When BY is used, the output is
 203 additionally sorted on the BY variables.
 204
 205 When ADD FILES creates an output case, variables that are not part of
 206 the input file from which the case was drawn are set to the
 207 system-missing value for numeric variables or spaces for string
 208 variables.
 209
 210 @node MATCH FILES
 211 @section MATCH FILES
 212 @vindex MATCH FILES
 213
 214 @display
 215 MATCH FILES
 216
 217 Per input file:
 218         /@{FILE,TABLE@}=@{*,'file-name'@}
 219         [/RENAME=(src_names=target_names)@dots{}]
 220         [/IN=var_name]
 221         [/SORT]
 222
 223 Once per command:
 224         /BY var_list[(@{D|A@}] [var_list[(@{D|A@})]@dots{}]
 225         [/DROP=var_list]
 226         [/KEEP=var_list]
 227         [/FIRST=var_name]
 228         [/LAST=var_name]
 229         [/MAP]
 230 @end display
 231
 232 @cmd{MATCH FILES} merges sets of corresponding cases in multiple
 233 input files into single cases in the output, combining their data.
 234
 235 MATCH FILES shares the bulk of its syntax with other PSPP commands for
 236 combining multiple data files.  @xref{Combining Files Common Syntax},
 237 above, for an explanation of this common syntax.
 238
 239 How MATCH FILES matches up cases from the input files depends on
 240 whether BY is specified:
 241
 242 @itemize @bullet
 243 @item
 244 If BY is not used, MATCH FILES combines the first case from each input
 245 file to produce the first output case, then the second case from each
 246 input file for the second output case, and so on.  If some input files
 247 have fewer cases than others, then the shorter files do not contribute
 248 to cases output after their input has been exhausted.
 249
 250 @item
 251 If BY is used, MATCH FILES combines cases from each input file that
 252 have identical values for the BY variables.
 253
 254 When BY is used, TABLE subcommands may be used to introduce @dfn{table
 255 lookup file}.  TABLE has same syntax as FILE, and the RENAME, IN, and
 256 SORT subcommands may follow a TABLE in the same way as a FILE.
 257 Regardless of the number of TABLEs, at least one FILE must specified.
 258 Table lookup files are treated in the same way as other input files
 259 for most purposes and, in particular, table lookup files must be
 260 sorted on the BY variables or the SORT subcommand must be specified
 261 for that TABLE.
 262
 263 Cases in table lookup files are not consumed after they have been used
 264 once.  This means that data in table lookup files can correspond to
 265 any number of cases in FILE input files.  Table lookup files are
 266 analogous to lookup tables in traditional relational database systems.
 267
 268 If a table lookup file contains more than one case with a given set of
 269 BY variables, only the first case is used.
 270 @end itemize
 271
 272 When MATCH FILES creates an output case, variables that are only in
 273 files that are not present for the current case are set to the
 274 system-missing value for numeric variables or spaces for string
 275 variables.
 276
 277 @node UPDATE
 278 @section UPDATE
 279 @vindex UPDATE
 280
 281 @display
 282 UPDATE
 283
 284 Per input file:
 285         /FILE=@{*,'file-name'@}
 286         [/RENAME=(src_names=target_names)@dots{}]
 287         [/IN=var_name]
 288         [/SORT]
 289
 290 Once per command:
 291         /BY var_list[(@{D|A@})] [var_list[(@{D|A@})]]@dots{}
 292         [/DROP=var_list]
 293         [/KEEP=var_list]
 294         [/MAP]
 295 @end display
 296
 297 @cmd{UPDATE} updates a @dfn{master file} by applying modifications
 298 from one or more @dfn{transaction files}.
 299
 300 UPDATE shares the bulk of its syntax with other PSPP commands for
 301 combining multiple data files.  @xref{Combining Files Common Syntax},
 302 above, for an explanation of this common syntax.
 303
 304 At least two FILE subcommands must be specified.  The first FILE
 305 subcommand names the master file, and the rest name transaction files.
 306 Every input file must either be sorted on the variables named on the
 307 BY subcommand, or the SORT subcommand must be used just after the FILE
 308 subcommand for that input file.
 309
 310 UPDATE uses the variables specified on the BY subcommand, which is
 311 required, to attempt to match each case in a transaction file with a
 312 case in the master file:
 313
 314 @itemize @bullet
 315 @item
 316 When a match is found, then the values of the variables present in the
 317 transaction file replace those variable's values in the new active
 318 file.  If there are matching cases in more than more transaction file,
 319 PSPP applies the replacements from the first transaction file, then
 320 from the second transaction file, and so on.  Similarly, if a single
 321 transaction file has cases with duplicate BY values, then those are
 322 applied in order to the master file.
 323
 324 When a variable in a transaction file has a missing value or a string
 325 variable's value is all blanks, that value is never used to update the
 326 master file.
 327
 328 @item
 329 If a case in the master file has no matching case in any transaction
 330 file, then it is copied unchanged to the output.
 331
 332 @item
 333 If a case in a transaction file has no matching case in the master
 334 file, then it causes a new case to be added to the output, initialized
 335 from the values in the transaction file.
 336 @end itemize