- [IF](commands/data/if.md)
- [RECODE](commands/data/recode.md)
- [SORT CASES](commands/data/sort-cases.md)
+- [Selecting Data](commands/selection/index.md)
+ - [FILTER](commands/selection/filter.md)
+ - [N OF CASES](commands/selection/n.md)
+ - [SAMPLE](commands/selection/sample.md)
+ - [SELECT IF](commands/selection/select-if.md)
+ - [SPLIT FILE](commands/selection/split-file.md)
+ - [TEMPORARY](commands/selection/temporary.md)
+ - [WEIGHT](commands/selection/weight.md)
# Developer Documentation
--- /dev/null
+# FILTER
+
+```
+FILTER BY VAR_NAME.
+FILTER OFF.
+```
+
+`FILTER` allows a boolean-valued variable to be used to select cases
+from the data stream for processing.
+
+To set up filtering, specify `BY` and a variable name. Keyword `BY` is
+optional but recommended. Cases which have a zero or system- or
+user-missing value are excluded from analysis, but not deleted from
+the data stream. Cases with other values are analyzed. To filter
+based on a different condition, use transformations such as `COMPUTE`
+or `RECODE` to compute a filter variable of the required form, then
+specify that variable on `FILTER`.
+
+`FILTER OFF` turns off case filtering.
+
+Filtering takes place immediately before cases pass to a procedure for
+analysis. Only one filter variable may be active at a time.
+Normally, case filtering continues until it is explicitly turned off
+with `FILTER OFF`. However, if `FILTER` is placed after `TEMPORARY`,
+it filters only the next procedure or procedure-like command.
+
--- /dev/null
+# Selecting Data
+
+This chapter documents PSPP commands that temporarily or permanently
+select data records from the active dataset for analysis.
--- /dev/null
+# N OF CASES
+
+```
+N [OF CASES] NUM_OF_CASES [ESTIMATED].
+```
+
+`N OF CASES` limits the number of cases processed by any procedures
+that follow it in the command stream. `N OF CASES 100`, for example,
+tells PSPP to disregard all cases after the first 100.
+
+When `N OF CASES` is specified after `TEMPORARY`, it affects only the
+next procedure (*note TEMPORARY::). Otherwise, cases beyond the limit
+specified are not processed by any later procedure.
+
+If the limit specified on `N OF CASES` is greater than the number of
+cases in the active dataset, it has no effect.
+
+When `N OF CASES` is used along with `SAMPLE` or `SELECT IF`, the
+case limit is applied to the cases obtained after sampling or case
+selection, regardless of how `N OF CASES` is placed relative to `SAMPLE`
+or `SELECT IF` in the command file. Thus, the commands `N OF CASES 100`
+and `SAMPLE .5` both randomly sample approximately half of the active
+dataset's cases, then select the first 100 of those sampled, regardless
+of their order in the command file.
+
+`N OF CASES` with the `ESTIMATED` keyword gives an estimated number of
+cases before `DATA LIST` or another command to read in data.
+`ESTIMATED` never limits the number of cases processed by procedures.
+PSPP currently does not use case count estimates.
+
--- /dev/null
+# SAMPLE
+
+```
+SAMPLE NUM1 [FROM NUM2].
+```
+
+`SAMPLE` randomly samples a proportion of the cases in the active
+file. Unless it follows `TEMPORARY`, it permanently removes cases
+from the active dataset.
+
+The proportion to sample may be expressed as a single number between 0
+and 1. If `N` is the number of currently-selected cases in the active
+dataset, then `SAMPLE K.` will select approximately `K×N` cases.
+
+The proportion to sample can also be specified in the style `SAMPLE M
+FROM N`. With this style, cases are selected as follows:
+
+1. If `N` is the number of currently-selected cases in the active
+ dataset, exactly `M` cases are selected.
+
+2. If `N` is greater than the number of currently-selected cases in
+ the active dataset, an equivalent proportion of cases are selected.
+
+3. If `N` is less than the number of currently-selected cases in the
+ active, exactly `M` cases are selected *from the first `N` cases*
+ in the active dataset.
+
+`SAMPLE` and `SELECT IF` are performed in the order specified by the
+syntax file.
+
+`SAMPLE` is always performed before [`N OF CASES`](n.md), regardless
+of ordering in the syntax file.
+
+The same values for `SAMPLE` may result in different samples. To
+obtain the same sample, use the `SET` command to set the random number
+seed to the same value before each `SAMPLE`. Different samples may
+still result when the file is processed on systems with different
+machine types or PSPP versions. By default, the random number seed is
+based on the system time.
+
--- /dev/null
+# SELECT IF
+
+```
+SELECT IF EXPRESSION.
+```
+
+`SELECT IF` selects cases for analysis based on the value of
+EXPRESSION. Cases not selected are permanently eliminated from the
+active dataset, unless [`TEMPORARY`](temporary.md) is in effect.
+
+Specify a [boolean
+expression](../../language/expressions/index.md#boolean-values). If
+the expression is true for a particular case, the case is analyzed.
+If the expression is false or missing, then the case is deleted from
+the data stream.
+
+Place `SELECT IF` early in the command file. Cases that are deleted
+early can be processed more efficiently in time and space. Once cases
+have been deleted from the active dataset using `SELECT IF` they
+cannot be re-instated. If you want to be able to re-instate cases,
+then use [`FILTER`](filter.md) instead.
+
+When `SELECT IF` is specified following [`TEMPORARY`](temporary.md),
+the [`LAG`](../../language/expressions/functions/miscellaneous.md)
+function may not be used.
+
+## Example
+
+A shop steward is interested in the salaries of younger personnel in a
+firm. The file `personnel.sav` provides the salaries of all the
+workers and their dates of birth. The syntax below shows how `SELECT
+IF` can be used to limit analysis only to those persons born after
+December 31, 1999.
+
+```
+get file = 'personnel.sav'.
+
+echo 'Salaries of all personnel'.
+descriptives salary.
+
+echo 'Salaries of personnel born after December 31 1999'.
+select if dob > date.dmy (31,12,1999).
+descriptives salary.
+```
+
+From the output shown below, one can see that there are 56 persons
+listed in the dataset, and 17 of them were born after December 31,
+1999.
+
+```
+Salaries of all personnel
+
+ Descriptive Statistics
+┌────────────────────────┬──┬────────┬───────┬───────┬───────┐
+│ │ N│ Mean │Std Dev│Minimum│Maximum│
+├────────────────────────┼──┼────────┼───────┼───────┼───────┤
+│Annual salary before tax│56│40028.97│8721.17│$23,451│$57,044│
+│Valid N (listwise) │56│ │ │ │ │
+│Missing N (listwise) │ 0│ │ │ │ │
+└────────────────────────┴──┴────────┴───────┴───────┴───────┘
+
+Salaries of personnel born after December 31 1999
+
+ Descriptive Statistics
+┌────────────────────────┬──┬────────┬───────┬───────┬───────┐
+│ │ N│ Mean │Std Dev│Minimum│Maximum│
+├────────────────────────┼──┼────────┼───────┼───────┼───────┤
+│Annual salary before tax│17│31828.59│4454.80│$23,451│$39,504│
+│Valid N (listwise) │17│ │ │ │ │
+│Missing N (listwise) │ 0│ │ │ │ │
+└────────────────────────┴──┴────────┴───────┴───────┴───────┘
+```
+
+Note that the `personnel.sav` file from which the data were read is
+unaffected. The transformation affects only the active file.
+
--- /dev/null
+# SPLIT FILE
+
+```
+SPLIT FILE [{LAYERED, SEPARATE}] BY VAR_LIST.
+SPLIT FILE OFF.
+```
+
+`SPLIT FILE` allows multiple sets of data present in one data file to
+be analyzed separately using single statistical procedure commands.
+
+Specify a list of variable names to analyze multiple sets of data
+separately. Groups of adjacent cases having the same values for these
+variables are analyzed by statistical procedure commands as one group.
+An independent analysis is carried out for each group of cases, and the
+variable values for the group are printed along with the analysis.
+
+When a list of variable names is specified, one of the keywords
+`LAYERED` or `SEPARATE` may also be specified. With `LAYERED`, which
+is the default, the separate analyses for each group are presented
+together in a single table. With `SEPARATE`, each analysis is
+presented in a separate table. Not all procedures honor the
+distinction.
+
+Groups are formed only by _adjacent_ cases. To create a split using a
+variable where like values are not adjacent in the working file, first
+[sort the data](../../commands/data/sort-cases.md) by that variable.
+
+Specify `OFF` to disable `SPLIT FILE` and resume analysis of the
+entire active dataset as a single group of data.
+
+When `SPLIT FILE` is specified after [`TEMPORARY`](temporary.md), it
+affects only the next procedure.
+
+## Example
+
+The file `horticulture.sav` contains data describing the yield of a
+number of horticultural specimens which have been subjected to various
+treatments. If we wanted to investigate linear statistics of the
+yeild, one way to do this is using the `DESCRIPTIVES` (*note
+DESCRIPTIVES::). However, it is reasonable to expect the mean to be
+different depending on the treatment. So we might want to perform
+three separate procedures -- one for each treatment.[^1] The following
+syntax shows how this can be done automatically using the `SPLIT FILE`
+command.
+
+[^1]: There are other, possibly better, ways to achieve a similar
+result using the `MEANS` or `EXAMINE` commands.
+
+```
+get file='horticulture.sav'.
+
+* Ensure cases are sorted before splitting.
+sort cases by treatment.
+
+split file by treatment.
+
+* Run descriptives on the yield variable
+descriptives /variable = yield.
+```
+
+In the following output, you can see that the table of descriptive
+statistics appears 3 times—once for each value of treatment. In this
+example `N`, the number of observations are identical in all splits.
+This is because that experiment was deliberately designed that way.
+However in general one can expect a different `N` for each split.
+
+```
+ Split Values
+┌─────────┬───────┐
+│Variable │ Value │
+├─────────┼───────┤
+│treatment│control│
+└─────────┴───────┘
+
+ Descriptive Statistics
+┌────────────────────┬──┬─────┬───────┬───────┬───────┐
+│ │ N│ Mean│Std Dev│Minimum│Maximum│
+├────────────────────┼──┼─────┼───────┼───────┼───────┤
+│yield │30│51.23│ 8.28│ 37.86│ 68.59│
+│Valid N (listwise) │30│ │ │ │ │
+│Missing N (listwise)│ 0│ │ │ │ │
+└────────────────────┴──┴─────┴───────┴───────┴───────┘
+
+ Split Values
+┌─────────┬────────────┐
+│Variable │ Value │
+├─────────┼────────────┤
+│treatment│conventional│
+└─────────┴────────────┘
+
+ Descriptive Statistics
+┌────────────────────┬──┬─────┬───────┬───────┬───────┐
+│ │ N│ Mean│Std Dev│Minimum│Maximum│
+├────────────────────┼──┼─────┼───────┼───────┼───────┤
+│yield │30│53.57│ 8.92│ 36.30│ 70.66│
+│Valid N (listwise) │30│ │ │ │ │
+│Missing N (listwise)│ 0│ │ │ │ │
+└────────────────────┴──┴─────┴───────┴───────┴───────┘
+
+ Split Values
+┌─────────┬───────────┐
+│Variable │ Value │
+├─────────┼───────────┤
+│treatment│traditional│
+└─────────┴───────────┘
+
+ Descriptive Statistics
+┌────────────────────┬──┬─────┬───────┬───────┬───────┐
+│ │ N│ Mean│Std Dev│Minimum│Maximum│
+├────────────────────┼──┼─────┼───────┼───────┼───────┤
+│yield │30│56.87│ 8.88│ 39.08│ 75.93│
+│Valid N (listwise) │30│ │ │ │ │
+│Missing N (listwise)│ 0│ │ │ │ │
+└────────────────────┴──┴─────┴───────┴───────┴───────┘
+```
+
+Example 13.3: The results of running `DESCRIPTIVES` with an active split
+
+Unless `TEMPORARY` was used, after a split has been defined for a
+dataset it remains active until explicitly disabled.
+
--- /dev/null
+# TEMPORARY
+
+```
+TEMPORARY.
+```
+
+`TEMPORARY` is used to make the effects of transformations following
+its execution temporary. These transformations affect only the
+execution of the next procedure or procedure-like command. Their
+effects are not be saved to the active dataset.
+
+The only specification on `TEMPORARY` is the command name.
+
+`TEMPORARY` may not appear within a `DO IF` or `LOOP` construct. It
+may appear only once between procedures and procedure-like commands.
+
+Scratch variables cannot be used following `TEMPORARY`.
+
+## Example
+
+In the syntax below, there are two `COMPUTE` transformation. One of
+them immediately follows a `TEMPORARY` command, and therefore affects
+only the next procedure, which in this case is the first
+`DESCRIPTIVES` command.
+
+```
+data list notable /x 1-2.
+begin data.
+ 2
+ 4
+10
+15
+20
+24
+end data.
+
+compute x=x/2.
+
+temporary.
+compute x=x+3.
+
+descriptives x.
+descriptives x.
+```
+
+The data read by the first `DESCRIPTIVES` procedure are 4, 5, 8, 10.5,
+13, 15. The data read by the second `DESCRIPTIVES` procedure are 1,
+2, 5, 7.5, 10, 12. This is because the second `COMPUTE`
+transformation has no effect on the second `DESCRIPTIVES` procedure.
+You can check these figures in the following output.
+
+```
+ Descriptive Statistics
+┌────────────────────┬─┬────┬───────┬───────┬───────┐
+│ │N│Mean│Std Dev│Minimum│Maximum│
+├────────────────────┼─┼────┼───────┼───────┼───────┤
+│x │6│9.25│ 4.38│ 4│ 15│
+│Valid N (listwise) │6│ │ │ │ │
+│Missing N (listwise)│0│ │ │ │ │
+└────────────────────┴─┴────┴───────┴───────┴───────┘
+
+ Descriptive Statistics
+┌────────────────────┬─┬────┬───────┬───────┬───────┐
+│ │N│Mean│Std Dev│Minimum│Maximum│
+├────────────────────┼─┼────┼───────┼───────┼───────┤
+│x │6│6.25│ 4.38│ 1│ 12│
+│Valid N (listwise) │6│ │ │ │ │
+│Missing N (listwise)│0│ │ │ │ │
+└────────────────────┴─┴────┴───────┴───────┴───────┘
+```
--- /dev/null
+# WEIGHT
+
+```
+WEIGHT BY VAR_NAME.
+WEIGHT OFF.
+```
+
+`WEIGHT` assigns cases varying weights, changing the frequency
+distribution of the active dataset. Execution of `WEIGHT` is delayed
+until data have been read.
+
+If a variable name is specified, `WEIGHT` causes the values of that
+variable to be used as weighting factors for subsequent statistical
+procedures. Use of keyword `BY` is optional but recommended.
+Weighting variables must be numeric. Scratch variables may not be
+used for weighting (*note Scratch Variables::).
+
+When `OFF` is specified, subsequent statistical procedures weight all
+cases equally.
+
+A positive integer weighting factor `W` on a case yields the same
+statistical output as would replicating the case `W` times. A
+weighting factor of 0 is treated for statistical purposes as if the
+case did not exist in the input. Weighting values need not be
+integers, but negative and system-missing values for the weighting
+variable are interpreted as weighting factors of 0. User-missing
+values are not treated specially.
+
+When `WEIGHT` is specified after [`TEMPORARY`](temporary.md), it
+affects only the next procedure.
+
+`WEIGHT` does not cause cases in the active dataset to be replicated
+in memory.
+
+## Example
+
+One could define a dataset containing an inventory of stock items. It
+would be reasonable to use a string variable for a description of the
+item, and a numeric variable for the number in stock, like in the
+syntax below.
+
+```
+data list notable list /item (a16) quantity (f8.0).
+begin data
+nuts 345
+screws 10034
+washers 32012
+bolts 876
+end data.
+
+echo 'Unweighted frequency table'.
+frequencies /variables = item /format=dfreq.
+
+weight by quantity.
+
+echo 'Weighted frequency table'.
+frequencies /variables = item /format=dfreq.
+```
+
+One analysis which most surely would be of interest is the relative
+amounts or each item in stock. However without setting a weight
+variable, `FREQUENCIES` (*note FREQUENCIES::) does not tell us what we
+want to know, since there is only one case for each stock item. The
+output below shows the difference between the weighted and unweighted
+frequency tables.
+
+```
+Unweighted frequency table
+
+ item
+┌─────────────┬─────────┬───────┬─────────────┬──────────────────┐
+│ │Frequency│Percent│Valid Percent│Cumulative Percent│
+├─────────────┼─────────┼───────┼─────────────┼──────────────────┤
+│Valid bolts │ 1│ 25.0%│ 25.0%│ 25.0%│
+│ nuts │ 1│ 25.0%│ 25.0%│ 50.0%│
+│ screws │ 1│ 25.0%│ 25.0%│ 75.0%│
+│ washers│ 1│ 25.0%│ 25.0%│ 100.0%│
+├─────────────┼─────────┼───────┼─────────────┼──────────────────┤
+│Total │ 4│ 100.0%│ │ │
+└─────────────┴─────────┴───────┴─────────────┴──────────────────┘
+
+Weighted frequency table
+
+ item
+┌─────────────┬─────────┬───────┬─────────────┬──────────────────┐
+│ │Frequency│Percent│Valid Percent│Cumulative Percent│
+├─────────────┼─────────┼───────┼─────────────┼──────────────────┤
+│Valid washers│ 32012│ 74.0%│ 74.0%│ 74.0%│
+│ screws │ 10034│ 23.2%│ 23.2%│ 97.2%│
+│ bolts │ 876│ 2.0%│ 2.0%│ 99.2%│
+│ nuts │ 345│ .8%│ .8%│ 100.0%│
+├─────────────┼─────────┼───────┼─────────────┼──────────────────┤
+│Total │ 43267│ 100.0%│ │ │
+└─────────────┴─────────┴───────┴─────────────┴──────────────────┘
+```