X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdata-selection.texi;h=9d8cbbf2f51d61dfa796fa0f0b65e6527a6a0a06;hb=7559bde1ff007c0ac0230fba30ae6c416148e171;hp=d7f4c79fc11d756c2fdbe278a10c0edb1862d62a;hpb=38006c9843177b65a2ce9bee47c1a8eeb6243973;p=pspp diff --git a/doc/data-selection.texi b/doc/data-selection.texi index d7f4c79fc1..9d8cbbf2f5 100644 --- a/doc/data-selection.texi +++ b/doc/data-selection.texi @@ -1,12 +1,21 @@ -@node Data Selection, Conditionals and Looping, Data Manipulation, Top +@c PSPP - a program for statistical analysis. +@c Copyright (C) 2017 Free Software Foundation, Inc. +@c Permission is granted to copy, distribute and/or modify this document +@c under the terms of the GNU Free Documentation License, Version 1.3 +@c or any later version published by the Free Software Foundation; +@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. +@c A copy of the license is included in the section entitled "GNU +@c Free Documentation License". +@c +@node Data Selection @chapter Selecting data for analysis -This chapter documents PSPP commands that temporarily or permanently -select data records from the active file for analysis. +This chapter documents @pspp{} commands that temporarily or permanently +select data records from the active dataset for analysis. @menu * FILTER:: Exclude cases based on a variable. -* N OF CASES:: Limit the size of the active file. +* N OF CASES:: Limit the size of the active dataset. * SAMPLE:: Select a specified proportion of cases. * SELECT IF:: Permanently delete selected cases. * SPLIT FILE:: Do multiple analyses with one command. @@ -14,19 +23,19 @@ select data records from the active file for analysis. * WEIGHT:: Weight cases by a variable. @end menu -@node FILTER, N OF CASES, Data Selection, Data Selection +@node FILTER @section FILTER @vindex FILTER @display -FILTER BY var_name. +FILTER BY @var{var_name}. FILTER OFF. @end display @cmd{FILTER} allows a boolean-valued variable to be used to select cases from the data stream for processing. -To set up filtering, specify BY and a variable name. Keyword +To set up filtering, specify @subcmd{BY} and a variable name. Keyword BY is optional but recommended. Cases which have a zero or system- or user-missing value are excluded from analysis, but not deleted from the data stream. Cases with other values are analyzed. @@ -40,7 +49,7 @@ filter variable of the required form, then specify that variable on Filtering takes place immediately before cases pass to a procedure for analysis. Only one filter variable may be active at a time. Normally, case filtering continues until it is explicitly turned off with @code{FILTER -OFF}. However, if @cmd{FILTER} is placed after TEMPORARY, it filters only +OFF}. However, if @cmd{FILTER} is placed after @cmd{TEMPORARY}, it filters only the next procedure or procedure-like command. @node N OF CASES @@ -48,89 +57,69 @@ the next procedure or procedure-like command. @vindex N OF CASES @display -N [OF CASES] num_of_cases [ESTIMATED]. +N [OF CASES] @var{num_of_cases} [ESTIMATED]. @end display -Sometimes you may want to disregard cases of your input. @cmd{N} can -do this. @code{N 100} tells PSPP to disregard all cases after the -first 100. +@cmd{N OF CASES} limits the number of cases processed by any +procedures that follow it in the command stream. @code{N OF CASES +100}, for example, tells @pspp{} to disregard all cases after the first +100. -If the value specified for @cmd{N} is greater than the number of cases -read in, the value is ignored. +When @cmd{N OF CASES} is specified after @cmd{TEMPORARY}, it affects +only the next procedure (@pxref{TEMPORARY}). Otherwise, cases beyond +the limit specified are not processed by any later procedure. -@cmd{N} does not discard cases or prevent them from being read. It -just causes cases beyond the last one specified to be ignored by data -analysis commands. +If the limit specified on @cmd{N OF CASES} is greater than the number +of cases in the active dataset, it has no effect. -A later @cmd{N} command can increase or decrease the number of cases -selected. (To select all the cases without knowing how many there are, -specify a very high number: 100000 or whatever you think is large enough.) +When @cmd{N OF CASES} is used along with @cmd{SAMPLE} or @cmd{SELECT +IF}, the case limit is applied to the cases obtained after sampling or +case selection, regardless of how @cmd{N OF CASES} is placed relative +to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file. Thus, the +commands @code{N OF CASES 100} and @code{SAMPLE .5} will both randomly +sample approximately half of the active dataset's cases, then select the +first 100 of those sampled, regardless of their order in the command +file. -Transformation procedures performed after @cmd{N} is executed -@emph{do} cause cases to be discarded. - -@cmd{SAMPLE} and @cmd{SELECT IF} have -precedence over @cmd{N}---the same results are obtained by both of the -following fragments, given the same random number seeds: - -@example -@i{@dots{}set up, read in data@dots{}} -N 100. -SAMPLE .5. -@i{@dots{}analyze data@dots{}} - -@i{@dots{}set up, read in data@dots{}} -SAMPLE .5. -N 100. -@i{@dots{}analyze data@dots{}} -@end example - -Both fragments above first randomly sample approximately half of the -cases, then select the first 100 of those sampled. - -@cmd{N} with the @code{ESTIMATED} keyword gives an -estimated number of cases before @cmd{DATA LIST} or another command to -read in data. @code{ESTIMATED} never limits the number of cases -processed by procedures. PSPP currently does not make use of -case count estimates. - -When @cmd{N} is specified after @cmd{TEMPORARY}, it affects only -the next procedure (@pxref{TEMPORARY}). +@cmd{N OF CASES} with the @code{ESTIMATED} keyword gives an estimated +number of cases before @cmd{DATA LIST} or another command to read in +data. @code{ESTIMATED} never limits the number of cases processed by +procedures. @pspp{} currently does not make use of case count estimates. @node SAMPLE @section SAMPLE @vindex SAMPLE @display -SAMPLE num1 [FROM num2]. +SAMPLE @var{num1} [FROM @var{num2}]. @end display @cmd{SAMPLE} randomly samples a proportion of the cases in the active file. Unless it follows @cmd{TEMPORARY}, it operates as a -transformation, permanently removing cases from the active file. +transformation, permanently removing cases from the active dataset. The proportion to sample can be expressed as a single number between 0 -and 1. If @code{k} is the number specified, and @code{N} is the number -of currently-selected cases in the active file, then after -@code{SAMPLE @var{k}.}, approximately @code{k*N} cases will be +and 1. If @var{k} is the number specified, and @var{N} is the number +of currently-selected cases in the active dataset, then after +@subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases will be selected. -The proportion to sample can also be specified in the style @code{SAMPLE +The proportion to sample can also be specified in the style @subcmd{SAMPLE @var{m} FROM @var{N}}. With this style, cases are selected as follows: @enumerate @item If @var{N} is equal to the number of currently-selected cases in the -active file, exactly @var{m} cases will be selected. +active dataset, exactly @var{m} cases will be selected. @item If @var{N} is greater than the number of currently-selected cases in the -active file, an equivalent proportion of cases will be selected. +active dataset, an equivalent proportion of cases will be selected. @item If @var{N} is less than the number of currently-selected cases in the active, exactly @var{m} cases will be selected @emph{from the first -@var{N} cases in the active file.} +@var{N} cases in the active dataset.} @end enumerate @cmd{SAMPLE} and @cmd{SELECT IF} are performed in @@ -146,17 +135,17 @@ samples may still result when the file is processed on systems with differing endianness or floating-point formats. By default, the random number seed is based on the system time. -@node SELECT IF, SPLIT FILE, SAMPLE, Data Selection +@node SELECT IF @section SELECT IF @vindex SELECT IF @display -SELECT IF expression. +SELECT IF @var{expression}. @end display -@cmd{SELECT IF} selects cases for analysis based on the value of a -boolean expression. Cases not selected are permanently eliminated -from the active file, unless @cmd{TEMPORARY} is in effect +@cmd{SELECT IF} selects cases for analysis based on the value of +@var{expression}. Cases not selected are permanently eliminated +from the active dataset, unless @cmd{TEMPORARY} is in effect (@pxref{TEMPORARY}). Specify a boolean expression (@pxref{Expressions}). If the value of the @@ -172,12 +161,12 @@ When @cmd{SELECT IF} is specified following @cmd{TEMPORARY} (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used (@pxref{LAG}). -@node SPLIT FILE, TEMPORARY, SELECT IF, Data Selection +@node SPLIT FILE @section SPLIT FILE @vindex SPLIT FILE @display -SPLIT FILE [@{LAYERED, SEPARATE@}] BY var_list. +SPLIT FILE [@{LAYERED, SEPARATE@}] BY @var{var_list}. SPLIT FILE OFF. @end display @@ -192,20 +181,20 @@ An independent analysis is carried out for each group of cases, and the variable values for the group are printed along with the analysis. When a list of variable names is specified, one of the keywords -LAYERED or SEPARATE may also be specified. If provided, either +@subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified. If provided, either keyword are ignored. Groups are formed only by @emph{adjacent} cases. To create a split using a variable where like values are not adjacent in the working file, you should first sort the data by that variable (@pxref{SORT CASES}). -Specify OFF to disable @cmd{SPLIT FILE} and resume analysis of the -entire active file as a single group of data. +Specify @subcmd{OFF} to disable @cmd{SPLIT FILE} and resume analysis of the +entire active dataset as a single group of data. When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only the next procedure (@pxref{TEMPORARY}). -@node TEMPORARY, WEIGHT, SPLIT FILE, Data Selection +@node TEMPORARY @section TEMPORARY @vindex TEMPORARY @@ -216,7 +205,7 @@ TEMPORARY. @cmd{TEMPORARY} is used to make the effects of transformations following its execution temporary. These transformations will affect only the execution of the next procedure or procedure-like -command. Their effects will not be saved to the active file. +command. Their effects will not be saved to the active dataset. The only specification on @cmd{TEMPORARY} is the command name. @@ -238,9 +227,12 @@ BEGIN DATA. 20 24 END DATA. + COMPUTE X=X/2. + TEMPORARY. COMPUTE X=X+3. + DESCRIPTIVES X. DESCRIPTIVES X. @end example @@ -249,26 +241,26 @@ The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8, 10.5, 13, 15. The data read by the first @cmd{DESCRIPTIVES} are 1, 2, 5, 7.5, 10, 12. -@node WEIGHT, , TEMPORARY, Data Selection +@node WEIGHT @section WEIGHT @vindex WEIGHT @display -WEIGHT BY var_name. +WEIGHT BY @var{var_name}. WEIGHT OFF. @end display @cmd{WEIGHT} assigns cases varying weights, -changing the frequency distribution of the active file. Execution of +changing the frequency distribution of the active dataset. Execution of @cmd{WEIGHT} is delayed until data have been read. If a variable name is specified, @cmd{WEIGHT} causes the values of that variable to be used as weighting factors for subsequent statistical -procedures. Use of keyword BY is optional but recommended. Weighting +procedures. Use of keyword @subcmd{BY} is optional but recommended. Weighting variables must be numeric. Scratch variables may not be used for weighting (@pxref{Scratch Variables}). -When OFF is specified, subsequent statistical procedures will weight all +When @subcmd{OFF} is specified, subsequent statistical procedures will weight all cases equally. A positive integer weighting factor @var{w} on a case will yield the @@ -282,6 +274,5 @@ values are not treated specially. When @cmd{WEIGHT} is specified after @cmd{TEMPORARY}, it affects only the next procedure (@pxref{TEMPORARY}). -@cmd{WEIGHT} does not cause cases in the active file to be replicated in -memory. -@setfilename ignored +@cmd{WEIGHT} does not cause cases in the active dataset to be +replicated in memory.