X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdata-selection.texi;h=b92d1826efe79d804d582aeae8384197e2064a13;hb=86e6b87d7ad411378c3204fe87504c7e6749be78;hp=5c846b3abc318b2dc87b07544d5c4bef633a20fd;hpb=f1141d27ca616a8c8edc2a1f18067085ceaaf448;p=pspp diff --git a/doc/data-selection.texi b/doc/data-selection.texi index 5c846b3abc..b92d1826ef 100644 --- a/doc/data-selection.texi +++ b/doc/data-selection.texi @@ -1,5 +1,5 @@ @c PSPP - a program for statistical analysis. -@c Copyright (C) 2017 Free Software Foundation, Inc. +@c Copyright (C) 2017, 2020 Free Software Foundation, Inc. @c Permission is granted to copy, distribute and/or modify this document @c under the terms of the GNU Free Documentation License, Version 1.3 @c or any later version published by the Free Software Foundation; @@ -76,7 +76,7 @@ When @cmd{N OF CASES} is used along with @cmd{SAMPLE} or @cmd{SELECT IF}, the case limit is applied to the cases obtained after sampling or case selection, regardless of how @cmd{N OF CASES} is placed relative to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file. Thus, the -commands @code{N OF CASES 100} and @code{SAMPLE .5} will both randomly +commands @code{N OF CASES 100} and @code{SAMPLE .5} both randomly sample approximately half of the active dataset's cases, then select the first 100 of those sampled, regardless of their order in the command file. @@ -101,7 +101,7 @@ transformation, permanently removing cases from the active dataset. The proportion to sample can be expressed as a single number between 0 and 1. If @var{k} is the number specified, and @var{N} is the number of currently-selected cases in the active dataset, then after -@subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases will be +@subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases are selected. The proportion to sample can also be specified in the style @subcmd{SAMPLE @@ -110,15 +110,15 @@ The proportion to sample can also be specified in the style @subcmd{SAMPLE @enumerate @item If @var{N} is equal to the number of currently-selected cases in the -active dataset, exactly @var{m} cases will be selected. +active dataset, exactly @var{m} cases are selected. @item If @var{N} is greater than the number of currently-selected cases in the -active dataset, an equivalent proportion of cases will be selected. +active dataset, an equivalent proportion of cases are selected. @item If @var{N} is less than the number of currently-selected cases in the -active, exactly @var{m} cases will be selected @emph{from the first +active, exactly @var{m} cases are selected @emph{from the first @var{N} cases in the active dataset.} @end enumerate @@ -149,18 +149,45 @@ from the active dataset, unless @cmd{TEMPORARY} is in effect (@pxref{TEMPORARY}). Specify a boolean expression (@pxref{Expressions}). If the value of the -expression is true for a particular case, the case will be analyzed. If -the expression has a false or missing value, then the case will be +expression is true for a particular case, the case is analyzed. If +the expression has a false or missing value, then the case is deleted from the data stream. Place @cmd{SELECT IF} as early in the command file as possible. Cases that are deleted early can be processed more efficiently in time and space. +Once cases have been deleted from the active dataset using @cmd{SELECT IF} they +cannot be re-instated. +If you want to be able to re-instate cases, then use @cmd{FILTER} (@pxref{FILTER}) +instead. When @cmd{SELECT IF} is specified following @cmd{TEMPORARY} (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used (@pxref{LAG}). +@subsection Example Select-If + +A shop steward is interested in the salaries of younger personnel in a firm. +The file @file{personnel.sav} provides the salaries of all the workers and their +dates of birth. The syntax in @ref{select-if:ex} shows how @cmd{SELECT IF} can +be used to limit analysis only to those persons born after December 31, 1999. + +@float Example, select-if:ex +@psppsyntax {select-if.sps} +@caption {Using @cmd{SELECT IF} to select persons born on or after a certain date.} +@end float + +From @ref{select-if:res} one can see that there are 56 persons listed in the dataset, +and 17 of them were born after December 31, 1999. + +@float Result, select-if:res +@psppoutput {select-if} +@caption {Salary descriptives before and after the @cmd{SELECT IF} transformation.} +@end float + +Note that the @file{personnel.sav} file from which the data were read is unaffected. +The transformation affects only the active file. + @node SPLIT FILE @section SPLIT FILE @vindex SPLIT FILE @@ -181,12 +208,15 @@ An independent analysis is carried out for each group of cases, and the variable values for the group are printed along with the analysis. When a list of variable names is specified, one of the keywords -@subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified. If provided, either -keyword are ignored. +@subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified. With +@subcmd{LAYERED}, which is the default, the separate analyses for each +group are presented together in a single table. With +@subcmd{SEPARATE}, each analysis is presented in a separate table. +Not all procedures honor the distinction. Groups are formed only by @emph{adjacent} cases. To create a split using a variable where like values are not adjacent in the working file, -you should first sort the data by that variable (@pxref{SORT CASES}). +first sort the data by that variable (@pxref{SORT CASES}). Specify @subcmd{OFF} to disable @cmd{SPLIT FILE} and resume analysis of the entire active dataset as a single group of data. @@ -221,12 +251,23 @@ that way. However in general one can expect a different @samp{N} for each split. @float Example, split:res -@psppsyntax {split.out} +@psppoutput {split} @caption {The results of running @cmd{DESCRIPTIVES} with an active split} @end float Unless @cmd{TEMPORARY} was used, after a split has been defined for a dataset it remains active until explicitly disabled. +In the graphical user interface, the active split variable (if any) is +displayed in the status bar (@pxref{split-status-bar:scr}. +If a dataset is saved to a system file (@pxref{SAVE}) whilst a split +is active, the split stastus is stored in the file and will be +automatically loaded when that file is loaded. + +@float Screenshot, split-status-bar:scr +@psppimage {split-status-bar} +@caption {The status bar indicating that the data set is split using the @exvar{treatment} variable} +@end float + @node TEMPORARY @section TEMPORARY @@ -237,9 +278,9 @@ TEMPORARY. @end display @cmd{TEMPORARY} is used to make the effects of transformations -following its execution temporary. These transformations will +following its execution temporary. These transformations affect only the execution of the next procedure or procedure-like -command. Their effects will not be saved to the active dataset. +command. Their effects are not be saved to the active dataset. The only specification on @cmd{TEMPORARY} is the command name. @@ -249,31 +290,30 @@ procedure-like commands. Scratch variables cannot be used following @cmd{TEMPORARY}. -An example may help to clarify: +@subsection Example Temporary -@example -DATA LIST /X 1-2. -BEGIN DATA. - 2 - 4 -10 -15 -20 -24 -END DATA. +In @ref{temporary:ex} there are two @cmd{COMPUTE} transformation. One +of them immediatly follows a @cmd{TEMPORARY} command, and therefore has +effect only for the next procedure, which in this case is the first +@cmd{DESCRIPTIVES} command. -COMPUTE X=X/2. +@float Example, temporary:ex +@psppsyntax {temporary.sps} +@caption {Running a @cmd{COMPUTE} transformation after @cmd{TEMPORARY}} +@end float -TEMPORARY. -COMPUTE X=X+3. +The data read by the first @cmd{DESCRIPTIVES} procedure are 4, 5, 8, +10.5, 13, 15. The data read by the second @cmd{DESCRIPTIVES} procedure are 1, 2, +5, 7.5, 10, 12. This is because the second @cmd{COMPUTE} transformation +has no effect on the second @cmd{DESCRIPTIVES} procedure. You can check these +figures in @ref{temporary:res}. -DESCRIPTIVES X. -DESCRIPTIVES X. -@end example +@float Result, temporary:res +@psppoutput {temporary} +@caption {The results of running two consecutive @cmd{DESCRIPTIVES} commands after + a temporary transformation} +@end float -The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8, -10.5, 13, 15. The data read by the first @cmd{DESCRIPTIVES} are 1, 2, -5, 7.5, 10, 12. @node WEIGHT @section WEIGHT @@ -294,10 +334,10 @@ procedures. Use of keyword @subcmd{BY} is optional but recommended. Weighting variables must be numeric. Scratch variables may not be used for weighting (@pxref{Scratch Variables}). -When @subcmd{OFF} is specified, subsequent statistical procedures will weight all +When @subcmd{OFF} is specified, subsequent statistical procedures weight all cases equally. -A positive integer weighting factor @var{w} on a case will yield the +A positive integer weighting factor @var{w} on a case yields the same statistical output as would replicating the case @var{w} times. A weighting factor of 0 is treated for statistical purposes as if the case did not exist in the input. Weighting values need not be @@ -325,10 +365,10 @@ item, and a numeric variable for the number in stock, like in @ref{weight:ex}. One analysis which most surely would be of interest is the relative amounts or each item in stock. -However without setting a weight variable, @cmd{FREQUENCIES} (@pxref{FREQUENCIES}) will not -tell us what we want to know, since there is only one case for each stock item. -@ref{weight:res} shows the difference between the weighted and unweighted -frequency tables. +However without setting a weight variable, @cmd{FREQUENCIES} +(@pxref{FREQUENCIES}) does not tell us what we want to know, since +there is only one case for each stock item. @ref{weight:res} shows the +difference between the weighted and unweighted frequency tables. @float Example, weight:res @psppoutput {weight}