X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=doc%2Fdata-selection.texi;h=c2f3ab31a4ddf0fdd33797a83981acd820e35590;hb=refs%2Fheads%2Fctables7;hp=a8cb12d95364d95c0525ec06974cac9c7981fdfc;hpb=1b1837591924226078c96db15888b68beec2ef6d;p=pspp diff --git a/doc/data-selection.texi b/doc/data-selection.texi index a8cb12d953..c2f3ab31a4 100644 --- a/doc/data-selection.texi +++ b/doc/data-selection.texi @@ -1,3 +1,12 @@ +@c PSPP - a program for statistical analysis. +@c Copyright (C) 2017, 2020 Free Software Foundation, Inc. +@c Permission is granted to copy, distribute and/or modify this document +@c under the terms of the GNU Free Documentation License, Version 1.3 +@c or any later version published by the Free Software Foundation; +@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. +@c A copy of the license is included in the section entitled "GNU +@c Free Documentation License". +@c @node Data Selection @chapter Selecting data for analysis @@ -67,7 +76,7 @@ When @cmd{N OF CASES} is used along with @cmd{SAMPLE} or @cmd{SELECT IF}, the case limit is applied to the cases obtained after sampling or case selection, regardless of how @cmd{N OF CASES} is placed relative to @cmd{SAMPLE} or @cmd{SELECT IF} in the command file. Thus, the -commands @code{N OF CASES 100} and @code{SAMPLE .5} will both randomly +commands @code{N OF CASES 100} and @code{SAMPLE .5} both randomly sample approximately half of the active dataset's cases, then select the first 100 of those sampled, regardless of their order in the command file. @@ -92,7 +101,7 @@ transformation, permanently removing cases from the active dataset. The proportion to sample can be expressed as a single number between 0 and 1. If @var{k} is the number specified, and @var{N} is the number of currently-selected cases in the active dataset, then after -@subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases will be +@subcmd{SAMPLE @var{k}.}, approximately @var{k}*@var{N} cases are selected. The proportion to sample can also be specified in the style @subcmd{SAMPLE @@ -101,15 +110,15 @@ The proportion to sample can also be specified in the style @subcmd{SAMPLE @enumerate @item If @var{N} is equal to the number of currently-selected cases in the -active dataset, exactly @var{m} cases will be selected. +active dataset, exactly @var{m} cases are selected. @item If @var{N} is greater than the number of currently-selected cases in the -active dataset, an equivalent proportion of cases will be selected. +active dataset, an equivalent proportion of cases are selected. @item If @var{N} is less than the number of currently-selected cases in the -active, exactly @var{m} cases will be selected @emph{from the first +active, exactly @var{m} cases are selected @emph{from the first @var{N} cases in the active dataset.} @end enumerate @@ -140,18 +149,45 @@ from the active dataset, unless @cmd{TEMPORARY} is in effect (@pxref{TEMPORARY}). Specify a boolean expression (@pxref{Expressions}). If the value of the -expression is true for a particular case, the case will be analyzed. If -the expression has a false or missing value, then the case will be +expression is true for a particular case, the case is analyzed. If +the expression has a false or missing value, then the case is deleted from the data stream. Place @cmd{SELECT IF} as early in the command file as possible. Cases that are deleted early can be processed more efficiently in time and space. +Once cases have been deleted from the active dataset using @cmd{SELECT IF} they +cannot be re-instated. +If you want to be able to re-instate cases, then use @cmd{FILTER} (@pxref{FILTER}) +instead. When @cmd{SELECT IF} is specified following @cmd{TEMPORARY} (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used (@pxref{LAG}). +@subsection Example Select-If + +A shop steward is interested in the salaries of younger personnel in a firm. +The file @file{personnel.sav} provides the salaries of all the workers and their +dates of birth. The syntax in @ref{select-if:ex} shows how @cmd{SELECT IF} can +be used to limit analysis only to those persons born after December 31, 1999. + +@float Example, select-if:ex +@psppsyntax {select-if.sps} +@caption {Using @cmd{SELECT IF} to select persons born on or after a certain date.} +@end float + +From @ref{select-if:res} one can see that there are 56 persons listed in the dataset, +and 17 of them were born after December 31, 1999. + +@float Result, select-if:res +@psppoutput {select-if} +@caption {Salary descriptives before and after the @cmd{SELECT IF} transformation.} +@end float + +Note that the @file{personnel.sav} file from which the data were read is unaffected. +The transformation affects only the active file. + @node SPLIT FILE @section SPLIT FILE @vindex SPLIT FILE @@ -175,7 +211,7 @@ When a list of variable names is specified, one of the keywords @subcmd{LAYERED} or @subcmd{SEPARATE} may also be specified. If provided, either keyword are ignored. -Groups are formed only by @emph{adjacent} cases. To create a split +Groups are formed only by @emph{adjacent} cases. To create a split using a variable where like values are not adjacent in the working file, you should first sort the data by that variable (@pxref{SORT CASES}). @@ -185,6 +221,51 @@ entire active dataset as a single group of data. When @cmd{SPLIT FILE} is specified after @cmd{TEMPORARY}, it affects only the next procedure (@pxref{TEMPORARY}). +@subsection Example Split + +The file @file{horticulture.sav} contains data describing the @exvar{yield} +of a number of horticultural specimens which have been subjected to +various @exvar{treatment}s. If we wanted to investigate linear statistics +of the @exvar{yeild}, one way to do this is using the @cmd{DESCRIPTIVES} (@pxref{DESCRIPTIVES}). +However, it is reasonable to expect the mean to be different depending +on the @exvar{treatment}. So we might want to perform three separate +procedures --- one for each treatment. +@footnote{There are other, possibly better, ways to achieve a similar result +using the @cmd{MEANS} or @cmd{EXAMINE} commands.} +@ref{split:ex} shows how this can be done automatically using +the @cmd{SPLIT FILE} command. + +@float Example, split:ex +@psppsyntax {split.sps} +@caption {Running @cmd{DESCRIPTIVES} on each value of @exvar{treatment}} +@end float + +In @ref{split:res} you can see that the table of descriptive statistics +appears 3 times --- once for each value of @exvar{treatment}. +In this example @samp{N}, the number of observations are identical in +all splits. This is because that experiment was deliberately designed +that way. However in general one can expect a different @samp{N} for each +split. + +@float Example, split:res +@psppoutput {split} +@caption {The results of running @cmd{DESCRIPTIVES} with an active split} +@end float + +Unless @cmd{TEMPORARY} was used, after a split has been defined for +a dataset it remains active until explicitly disabled. +In the graphical user interface, the active split variable (if any) is +displayed in the status bar (@pxref{split-status-bar:scr}. +If a dataset is saved to a system file (@pxref{SAVE}) whilst a split +is active, the split stastus is stored in the file and will be +automatically loaded when that file is loaded. + +@float Screenshot, split-status-bar:scr +@psppimage {split-status-bar} +@caption {The status bar indicating that the data set is split using the @exvar{treatment} variable} +@end float + + @node TEMPORARY @section TEMPORARY @vindex TEMPORARY @@ -194,9 +275,9 @@ TEMPORARY. @end display @cmd{TEMPORARY} is used to make the effects of transformations -following its execution temporary. These transformations will +following its execution temporary. These transformations affect only the execution of the next procedure or procedure-like -command. Their effects will not be saved to the active dataset. +command. Their effects are not be saved to the active dataset. The only specification on @cmd{TEMPORARY} is the command name. @@ -206,31 +287,30 @@ procedure-like commands. Scratch variables cannot be used following @cmd{TEMPORARY}. -An example may help to clarify: +@subsection Example Temporary -@example -DATA LIST /X 1-2. -BEGIN DATA. - 2 - 4 -10 -15 -20 -24 -END DATA. +In @ref{temporary:ex} there are two @cmd{COMPUTE} transformation. One +of them immediatly follows a @cmd{TEMPORARY} command, and therefore has +effect only for the next procedure, which in this case is the first +@cmd{DESCRIPTIVES} command. -COMPUTE X=X/2. +@float Example, temporary:ex +@psppsyntax {temporary.sps} +@caption {Running a @cmd{COMPUTE} transformation after @cmd{TEMPORARY}} +@end float -TEMPORARY. -COMPUTE X=X+3. +The data read by the first @cmd{DESCRIPTIVES} procedure are 4, 5, 8, +10.5, 13, 15. The data read by the second @cmd{DESCRIPTIVES} procedure are 1, 2, +5, 7.5, 10, 12. This is because the second @cmd{COMPUTE} transformation +has no effect on the second @cmd{DESCRIPTIVES} procedure. You can check these +figures in @ref{temporary:res}. -DESCRIPTIVES X. -DESCRIPTIVES X. -@end example +@float Result, temporary:res +@psppoutput {temporary} +@caption {The results of running two consecutive @cmd{DESCRIPTIVES} commands after + a temporary transformation} +@end float -The data read by the first @cmd{DESCRIPTIVES} are 4, 5, 8, -10.5, 13, 15. The data read by the first @cmd{DESCRIPTIVES} are 1, 2, -5, 7.5, 10, 12. @node WEIGHT @section WEIGHT @@ -251,10 +331,10 @@ procedures. Use of keyword @subcmd{BY} is optional but recommended. Weighting variables must be numeric. Scratch variables may not be used for weighting (@pxref{Scratch Variables}). -When @subcmd{OFF} is specified, subsequent statistical procedures will weight all +When @subcmd{OFF} is specified, subsequent statistical procedures weight all cases equally. -A positive integer weighting factor @var{w} on a case will yield the +A positive integer weighting factor @var{w} on a case yields the same statistical output as would replicating the case @var{w} times. A weighting factor of 0 is treated for statistical purposes as if the case did not exist in the input. Weighting values need not be @@ -267,3 +347,27 @@ the next procedure (@pxref{TEMPORARY}). @cmd{WEIGHT} does not cause cases in the active dataset to be replicated in memory. + + +@subsection Example Weights + +One could define a dataset containing an inventory of stock items. +It would be reasonable to use a string variable for a description of the +item, and a numeric variable for the number in stock, like in @ref{weight:ex}. + +@float Example, weight:ex +@psppsyntax {weight.sps} +@caption {Setting the weight on the variable @exvar{quantity}} +@end float + +One analysis which most surely would be of interest is +the relative amounts or each item in stock. +However without setting a weight variable, @cmd{FREQUENCIES} +(@pxref{FREQUENCIES}) does not tell us what we want to know, since +there is only one case for each stock item. @ref{weight:res} shows the +difference between the weighted and unweighted frequency tables. + +@float Example, weight:res +@psppoutput {weight} +@caption {Weighted and unweighted frequency tables of @exvar{items}} +@end float