pintos-os.org Git - pspp/blob - doc/combining.texi

   1 @node Combining Data Files
   2 @chapter Combining Data Files
   3
   4 This chapter describes commands that allow data from system files,
   5 portable files, and open datasets to be combined to
   6 form a new active dataset.  These commands can combine data files in the
   7 following ways:
   8
   9 @itemize
  10 @item
  11 @cmd{ADD FILES} interleaves or appends the cases from each input file.
  12 It is used with input files that have variables in common, but
  13 distinct sets of cases.
  14
  15 @item
  16 @cmd{MATCH FILES} adds the data together in cases that match across
  17 multiple input files.  It is used with input files that have cases in
  18 common, but different information about each case.
  19
  20 @item
  21 @cmd{UPDATE} updates a master data file from data in a set of
  22 transaction files.  Each case in a transaction data file modifies a
  23 matching case in the primary data file, or it adds a new case if no
  24 matching case can be found.
  25 @end itemize
  26
  27 These commands share the majority of their syntax, which is described
  28 in the following section, followed by one section for each command
  29 that describes its specific syntax and semantics.
  30
  31 @menu
  32 * Combining Files Common Syntax::
  33 * ADD FILES::                   Interleave cases from multiple files.
  34 * MATCH FILES::                 Merge cases from multiple files.
  35 * UPDATE::                      Update cases using transactional data.
  36 @end menu
  37
  38 @node Combining Files Common Syntax
  39 @section Common Syntax
  40
  41 @display
  42 Per input file:
  43         /FILE=@{*,'@var{file_name}'@}
  44         [/RENAME=(@var{src_names}=@var{target_names})@dots{}]
  45         [/IN=@var{var_name}]
  46         [/SORT]
  47
  48 Once per command:
  49         /BY @var{var_list}[(@{D|A@})] [@var{var_list}[(@{D|A@}]]@dots{}
  50         [/DROP=@var{var_list}]
  51         [/KEEP=@var{var_list}]
  52         [/FIRST=@var{var_name}]
  53         [/LAST=@var{var_name}]
  54         [/MAP]
  55 @end display
  56
  57 This section describes the syntactical features in common among the
  58 @cmd{ADD FILES}, @cmd{MATCH FILES}, and @cmd{UPDATE} commands.  The
  59 following sections describe details specific to each command.
  60
  61 Each of these commands reads two or more input files and combines
  62 them.  The command's output becomes the new active dataset.  The input
  63 files are not changed on disk.
  64
  65 The syntax of each command begins with a specification of the files to
  66 be read as input.  For each input file, specify FILE with a system
  67 file or portable file's name as a string, a dataset (@pxref{Datasets})
  68 or file handle name, (@pxref{File Handles}), or an asterisk (@samp{*})
  69 to use the active dataset as input.  Use of portable files on @subcmd{FILE} is a
  70 @pspp{} extension.
  71
  72 At least two @subcmd{FILE} subcommands must be specified.  If the active dataset
  73 is used as an input source, then @cmd{TEMPORARY} must not be in
  74 effect.
  75
  76 Each @subcmd{FILE} subcommand may be followed by any number of @subcmd{RENAME}
  77 subcommands that specify a parenthesized group or groups of variable
  78 names as they appear in the input file, followed by those variables'
  79 new names, separated by an equals sign (@subcmd{=}),
  80 e.g. @subcmd{/RENAME=(OLD1=NEW1)(OLD2=NEW2)}.  To rename a single
  81 variable, the parentheses may be omitted: @subcmd{/RENAME=@var{old}=@var{new}}.
  82 Within a parenthesized group, variables are renamed simultaneously, so
  83 that @subcmd{/RENAME=(@var{A} @var{B}=@var{B} @var{A})} exchanges the
  84 names of variables @var{A} and @var{B}.
  85 Otherwise, renaming occurs in left-to-right order.
  86
  87 Each @subcmd{FILE} subcommand may optionally be followed by a single @subcmd{IN}
  88 subcommand, which creates a numeric variable with the specified name
  89 and format F1.0.  The IN variable takes value 1 in an output case if
  90 the given input file contributed to that output case, and 0 otherwise.
  91 The @subcmd{DROP}, @subcmd{KEEP}, and @subcmd{RENAME} subcommands have no effect on IN variables.
  92
  93 If @subcmd{BY} is used (see below), the @subcmd{SORT} keyword must be specified after a
  94 @subcmd{FILE} if that input file is not already sorted on the @subcmd{BY} variables.
  95 When @subcmd{SORT} is specified, @pspp{} sorts the input file's data on the @subcmd{BY}
  96 variables before it applies it to the command.  When @subcmd{SORT} is used, @subcmd{BY}
  97 is required.  @subcmd{SORT} is a @pspp{} extension.
  98
  99 @pspp{} merges the dictionaries of all of the input files to form the
 100 dictionary of the new active dataset, like so:
 101
 102 @itemize @bullet
 103 @item
 104 The variables in the new active dataset are the union of all the input files'
 105 variables, matched based on their name.  When a single input file
 106 contains a variable with a given name, the output file will contain
 107 exactly that variable.  When more than one input file contains a
 108 variable with a given name, those variables must all have the same
 109 type (numeric or string) and, for string variables, the same width.
 110 Variables are matched after renaming with the @subcmd{RENAME} subcommand.
 111 Thus, @subcmd{RENAME} can be used to resolve conflicts.
 112
 113 @item
 114 The variable label for each output variable is taken from the first
 115 specified input file that has a variable label for that variable, and
 116 similarly for value labels and missing values.
 117
 118 @item
 119 The file label of the new active dataset (@pxref{FILE LABEL}) is that of the
 120 first specified @subcmd{FILE} that has a file label.
 121
 122 @item
 123 The documents in the new active dataset (@pxref{DOCUMENT}) are the
 124 concatenation of all the input files' documents, in the order in which
 125 the @subcmd{FILE} subcommands are specified.
 126
 127 @item
 128 If all of the input files are weighted on the same variable, then the
 129 new active dataset is weighted on that variable.  Otherwise, the new
 130 active dataset is not weighted.
 131 @end itemize
 132
 133 The remaining subcommands apply to the output file as a whole, rather
 134 than to individual input files.  They must be specified at the end of
 135 the command specification, following all of the @subcmd{FILE} and related
 136 subcommands.  The most important of these subcommands is @subcmd{BY}, which
 137 specifies a set of one or more variables that may be used to find
 138 corresponding cases in each of the input files.  The variables
 139 specified on @subcmd{BY} must be present in all of the input files.
 140 Furthermore, if any of the input files are not sorted on the @subcmd{BY}
 141 variables, then @subcmd{SORT} must be specified for those input files.
 142
 143 The variables listed on @subcmd{BY} may include (A) or (D) annotations to
 144 specify ascending or descending sort order.  @xref{SORT CASES}, for
 145 more details on this notation.  Adding (A) or (D) to the @subcmd{BY} subcommand
 146 specification is a @pspp{} extension.
 147
 148 The @subcmd{DROP} subcommand can be used to specify a list of variables to
 149 exclude from the output.  By contrast, the @subcmd{KEEP} subcommand can be used
 150 to specify variables to include in the output; all variables not
 151 listed are dropped.  @subcmd{DROP} and @subcmd{KEEP} are executed in left-to-right order
 152 and may be repeated any number of times.  @subcmd{DROP} and @subcmd{KEEP} do not affect
 153 variables created by the @subcmd{IN}, @subcmd{FIRST}, and @subcmd{LAST} subcommands, which are
 154 always included in the new active dataset, but they can be used to drop
 155 @subcmd{BY} variables.
 156
 157 The @subcmd{FIRST} and @subcmd{LAST} subcommands are optional.  They may only be
 158 specified on @cmd{MATCH FILES} and @cmd{ADD FILES}, and only when @subcmd{BY}
 159 is used.  @subcmd{FIRST} and @subcmd{LIST} each adds a numeric variable to the new
 160 active dataset, with the name given as the subcommand's argument and F1.0
 161 print and write formats.  The value of the @subcmd{FIRST} variable is 1 in the
 162 first output case with a given set of values for the @subcmd{BY} variables, and
 163 0 in other cases.  Similarly, the @subcmd{LAST} variable is 1 in the last case
 164 with a given of @subcmd{BY} values, and 0 in other cases.
 165
 166 When any of these commands creates an output case, variables that are
 167 only in files that are not present for the current case are set to the
 168 system-missing value for numeric variables or spaces for string
 169 variables.
 170
 171 @node ADD FILES
 172 @section ADD FILES
 173 @vindex ADD FILES
 174
 175 @display
 176 ADD FILES
 177
 178 Per input file:
 179         /FILE=@{*,'@var{file_name}'@}
 180         [/RENAME=(@var{src_names}=@var{target_names})@dots{}]
 181         [/IN=@var{var_name}]
 182         [/SORT]
 183
 184 Once per command:
 185         [/BY @var{var_list}[(@{D|A@})] [@var{var_list}[(@{D|A@})]@dots{}]]
 186         [/DROP=@var{var_list}]
 187         [/KEEP=@var{var_list}]
 188         [/FIRST=@var{var_name}]
 189         [/LAST=@var{var_name}]
 190         [/MAP]
 191 @end display
 192
 193 @cmd{ADD FILES} adds cases from multiple input files.  The output,
 194 which replaces the active dataset, consists all of the cases in all of
 195 the input files.
 196
 197 @subcmd{ADD FILES} shares the bulk of its syntax with other @pspp{} commands for
 198 combining multiple data files.  @xref{Combining Files Common Syntax},
 199 above, for an explanation of this common syntax.
 200
 201 When @subcmd{BY} is not used, the output of @subcmd{ADD FILES} consists of all the cases
 202 from the first input file specified, followed by all the cases from
 203 the second file specified, and so on.  When @subcmd{BY} is used, the output is
 204 additionally sorted on the @subcmd{BY} variables.
 205
 206 When @subcmd{ADD FILES} creates an output case, variables that are not part of
 207 the input file from which the case was drawn are set to the
 208 system-missing value for numeric variables or spaces for string
 209 variables.
 210
 211 @node MATCH FILES
 212 @section MATCH FILES
 213 @vindex MATCH FILES
 214
 215 @display
 216 MATCH FILES
 217
 218 Per input file:
 219         /@{FILE,TABLE@}=@{*,'@var{file_name}'@}
 220         [/RENAME=(@var{src_names}=@var{target_names})@dots{}]
 221         [/IN=@var{var_name}]
 222         [/SORT]
 223
 224 Once per command:
 225         /BY @var{var_list}[(@{D|A@}] [@var{var_list}[(@{D|A@})]@dots{}]
 226         [/DROP=@var{var_list}]
 227         [/KEEP=@var{var_list}]
 228         [/FIRST=@var{var_name}]
 229         [/LAST=@var{var_name}]
 230         [/MAP]
 231 @end display
 232
 233 @cmd{MATCH FILES} merges sets of corresponding cases in multiple
 234 input files into single cases in the output, combining their data.
 235
 236 @cmd{MATCH FILES} shares the bulk of its syntax with other @pspp{} commands for
 237 combining multiple data files.  @xref{Combining Files Common Syntax},
 238 above, for an explanation of this common syntax.
 239
 240 How @cmd{MATCH FILES} matches up cases from the input files depends on
 241 whether @subcmd{BY} is specified:
 242
 243 @itemize @bullet
 244 @item
 245 If @subcmd{BY} is not used, @cmd{MATCH FILES} combines the first case from each input
 246 file to produce the first output case, then the second case from each
 247 input file for the second output case, and so on.  If some input files
 248 have fewer cases than others, then the shorter files do not contribute
 249 to cases output after their input has been exhausted.
 250
 251 @item
 252 If @subcmd{BY} is used, @cmd{MATCH FILES} combines cases from each input file that
 253 have identical values for the @subcmd{BY} variables.
 254
 255 When @subcmd{BY} is used, @subcmd{TABLE} subcommands may be used to introduce @dfn{table
 256 lookup file}.  @subcmd{TABLE} has same syntax as @subcmd{FILE}, and the @subcmd{RENAME}, @subcmd{IN}, and
 257 @subcmd{SORT} subcommands may follow a @subcmd{TABLE} in the same way as @subcmd{FILE}.
 258 Regardless of the number of @subcmd{TABLE}s, at least one @subcmd{FILE} must specified.
 259 Table lookup files are treated in the same way as other input files
 260 for most purposes and, in particular, table lookup files must be
 261 sorted on the @subcmd{BY} variables or the @subcmd{SORT} subcommand must be specified
 262 for that @subcmd{TABLE}.
 263
 264 Cases in table lookup files are not consumed after they have been used
 265 once.  This means that data in table lookup files can correspond to
 266 any number of cases in @subcmd{FILE} input files.  Table lookup files are
 267 analogous to lookup tables in traditional relational database systems.
 268
 269 If a table lookup file contains more than one case with a given set of
 270 @subcmd{BY} variables, only the first case is used.
 271 @end itemize
 272
 273 When @cmd{MATCH FILES} creates an output case, variables that are only in
 274 files that are not present for the current case are set to the
 275 system-missing value for numeric variables or spaces for string
 276 variables.
 277
 278 @node UPDATE
 279 @section UPDATE
 280 @vindex UPDATE
 281
 282 @display
 283 UPDATE
 284
 285 Per input file:
 286         /FILE=@{*,'@var{file_name}'@}
 287         [/RENAME=(@var{src_names}=@var{target_names})@dots{}]
 288         [/IN=@var{var_name}]
 289         [/SORT]
 290
 291 Once per command:
 292         /BY @var{var_list}[(@{D|A@})] [@var{var_list}[(@{D|A@})]]@dots{}
 293         [/DROP=@var{var_list}]
 294         [/KEEP=@var{var_list}]
 295         [/MAP]
 296 @end display
 297
 298 @cmd{UPDATE} updates a @dfn{master file} by applying modifications
 299 from one or more @dfn{transaction files}.
 300
 301 @cmd{UPDATE} shares the bulk of its syntax with other @pspp{} commands for
 302 combining multiple data files.  @xref{Combining Files Common Syntax},
 303 above, for an explanation of this common syntax.
 304
 305 At least two @subcmd{FILE} subcommands must be specified.  The first @subcmd{FILE}
 306 subcommand names the master file, and the rest name transaction files.
 307 Every input file must either be sorted on the variables named on the
 308 @subcmd{BY} subcommand, or the @subcmd{SORT} subcommand must be used just after the @subcmd{FILE}
 309 subcommand for that input file.
 310
 311 @cmd{UPDATE} uses the variables specified on the @subcmd{BY} subcommand, which is
 312 required, to attempt to match each case in a transaction file with a
 313 case in the master file:
 314
 315 @itemize @bullet
 316 @item
 317 When a match is found, then the values of the variables present in the
 318 transaction file replace those variable's values in the new active
 319 file.  If there are matching cases in more than more transaction file,
 320 @pspp{} applies the replacements from the first transaction file, then
 321 from the second transaction file, and so on.  Similarly, if a single
 322 transaction file has cases with duplicate @subcmd{BY} values, then those are
 323 applied in order to the master file.
 324
 325 When a variable in a transaction file has a missing value or a string
 326 variable's value is all blanks, that value is never used to update the
 327 master file.
 328
 329 @item
 330 If a case in the master file has no matching case in any transaction
 331 file, then it is copied unchanged to the output.
 332
 333 @item
 334 If a case in a transaction file has no matching case in the master
 335 file, then it causes a new case to be added to the output, initialized
 336 from the values in the transaction file.
 337 @end itemize