From: Ben Pfaff Date: Sat, 10 May 2025 15:39:13 +0000 (-0700) Subject: work on manual X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=39e7f4f214956b848be53211c9d4b9bc6024db2e;p=pspp work on manual --- diff --git a/rust/doc/src/SUMMARY.md b/rust/doc/src/SUMMARY.md index 65862d132f..13185aa046 100644 --- a/rust/doc/src/SUMMARY.md +++ b/rust/doc/src/SUMMARY.md @@ -5,6 +5,10 @@ # Language Overview +- [Tutorial](language/tutorial/index.md) + - [Data Preparation](language/tutorial/preparation.md) + - [Data Screening and Transformation](language/tutorial/transformations.md) + - [Hypothesis Testing](language/tutorial/hypotheses.md) - [Basics](language/basics/index.md) - [Tokens](language/basics/tokens.md) - [Forming Commands](language/basics/commands.md) diff --git a/rust/doc/src/language/tutorial/hypotheses.md b/rust/doc/src/language/tutorial/hypotheses.md new file mode 100644 index 0000000000..4806cb3282 --- /dev/null +++ b/rust/doc/src/language/tutorial/hypotheses.md @@ -0,0 +1,236 @@ +# Hypothesis Testing + +One of the most fundamental purposes of statistical analysis is +hypothesis testing. Researchers commonly need to test hypotheses about +a set of data. For example, she might want to test whether one set of +data comes from the same distribution as another, or whether the mean of +a dataset significantly differs from a particular value. This section +presents just some of the possible tests that PSPP offers. + +The researcher starts by making a "null hypothesis". Often this is a +hypothesis which he suspects to be false. For example, if he suspects +that A is greater than B he will state the null hypothesis as A = B.[^1] + +[^1]: This example assumes that it is already proven that B is not +greater than A. + +The "p-value" is a recurring concept in hypothesis testing. It is +the highest acceptable probability that the evidence implying a null +hypothesis is false, could have been obtained when the null hypothesis +is in fact true. Note that this is not the same as "the probability of +making an error" nor is it the same as "the probability of rejecting a +hypothesis when it is true". + +## Testing for differences of means + +A common statistical test involves hypotheses about means. The `T-TEST` +command is used to find out whether or not two separate subsets have the +same mean. + +A researcher suspected that the heights and core body temperature of +persons might be different depending upon their sex. To investigate +this, he posed two null hypotheses based on the data from +`physiology.sav` previously encountered: + + - The mean heights of males and females in the population are equal. + + - The mean body temperature of males and females in the population + are equal. + +For the purposes of the investigation the researcher decided to use a +p-value of 0.05. + +In addition to the T-test, the `T-TEST` command also performs the +Levene test for equal variances. If the variances are equal, then a +more powerful form of the T-test can be used. However if it is unsafe +to assume equal variances, then an alternative calculation is necessary. +PSPP performs both calculations. + +For the height variable, the output shows the significance of the +Levene test to be 0.33 which means there is a 33% probability that the +Levene test produces this outcome when the variances are equal. Had the +significance been less than 0.05, then it would have been unsafe to +assume that the variances were equal. However, because the value is +higher than 0.05 the homogeneity of variances assumption is safe and the +"Equal Variances" row (the more powerful test) can be used. Examining +this row, the two tailed significance for the height t-test is less than +0.05, so it is safe to reject the null hypothesis and conclude that the +mean heights of males and females are unequal. + +For the temperature variable, the significance of the Levene test is +0.58 so again, it is safe to use the row for equal variances. The equal +variances row indicates that the two tailed significance for temperature +is 0.20. Since this is greater than 0.05 we must reject the null +hypothesis and conclude that there is insufficient evidence to suggest +that the body temperature of male and female persons are different. + + The syntax for this analysis is: + +``` +PSPP> get file='/usr/local/share/pspp/examples/physiology.sav'. +PSPP> recode height (179 = SYSMIS). +PSPP> t-test group=sex(0,1) /variables = height temperature. +``` + +PSPP produces the following output for this syntax: + +``` + Group Statistics +┌───────────────────────────────────────────┬──┬───────┬─────────────┬────────┐ +│ │ │ │ Std. │ S.E. │ +│ Group │ N│ Mean │ Deviation │ Mean │ +├───────────────────────────────────────────┼──┼───────┼─────────────┼────────┤ +│Height in millimeters Male │22│1796.49│ 49.71│ 10.60│ +│ Female│17│1610.77│ 25.43│ 6.17│ +├───────────────────────────────────────────┼──┼───────┼─────────────┼────────┤ +│Internal body temperature in degrees Male │22│ 36.68│ 1.95│ .42│ +│Celcius Female│18│ 37.43│ 1.61│ .38│ +└───────────────────────────────────────────┴──┴───────┴─────────────┴────────┘ + + Independent Samples Test +┌─────────────────────┬──────────┬────────────────────────────────────────── +│ │ Levene's │ +│ │ Test for │ +│ │ Equality │ +│ │ of │ +│ │ Variances│ T─Test for Equality of Means +│ ├────┬─────┼─────┬─────┬───────┬──────────┬──────────┐ +│ │ │ │ │ │ │ │ │ +│ │ │ │ │ │ │ │ │ +│ │ │ │ │ │ │ │ │ +│ │ │ │ │ │ │ │ │ +│ │ │ │ │ │ Sig. │ │ │ +│ │ │ │ │ │ (2─ │ Mean │Std. Error│ +│ │ F │ Sig.│ t │ df │tailed)│Difference│Difference│ +├─────────────────────┼────┼─────┼─────┼─────┼───────┼──────────┼──────────┤ +│Height in Equal │ .97│ .331│14.02│37.00│ .000│ 185.72│ 13.24│ +│millimeters variances│ │ │ │ │ │ │ │ +│ assumed │ │ │ │ │ │ │ │ +│ Equal │ │ │15.15│32.71│ .000│ 185.72│ 12.26│ +│ variances│ │ │ │ │ │ │ │ +│ not │ │ │ │ │ │ │ │ +│ assumed │ │ │ │ │ │ │ │ +├─────────────────────┼────┼─────┼─────┼─────┼───────┼──────────┼──────────┤ +│Internal Equal │ .31│ .581│─1.31│38.00│ .198│ ─.75│ .57│ +│body variances│ │ │ │ │ │ │ │ +│temperature assumed │ │ │ │ │ │ │ │ +│in degrees Equal │ │ │─1.33│37.99│ .190│ ─.75│ .56│ +│Celcius variances│ │ │ │ │ │ │ │ +│ not │ │ │ │ │ │ │ │ +│ assumed │ │ │ │ │ │ │ │ +└─────────────────────┴────┴─────┴─────┴─────┴───────┴──────────┴──────────┘ + +┌─────────────────────┬─────────────┐ +│ │ │ +│ │ │ +│ │ │ +│ │ │ +│ │ │ +│ ├─────────────┤ +│ │ 95% │ +│ │ Confidence │ +│ │ Interval of │ +│ │ the │ +│ │ Difference │ +│ ├──────┬──────┤ +│ │ Lower│ Upper│ +├─────────────────────┼──────┼──────┤ +│Height in Equal │158.88│212.55│ +│millimeters variances│ │ │ +│ assumed │ │ │ +│ Equal │160.76│210.67│ +│ variances│ │ │ +│ not │ │ │ +│ assumed │ │ │ +├─────────────────────┼──────┼──────┤ +│Internal Equal │ ─1.91│ .41│ +│body variances│ │ │ +│temperature assumed │ │ │ +│in degrees Equal │ ─1.89│ .39│ +│Celcius variances│ │ │ +│ not │ │ │ +│ assumed │ │ │ +└─────────────────────┴──────┴──────┘ +``` + +The `T-TEST` command tests for differences of means. Here, the height +variable's two tailed significance is less than 0.05, so the null +hypothesis can be rejected. Thus, the evidence suggests there is a +difference between the heights of male and female persons. However +the significance of the test for the temperature variable is greater +than 0.05 so the null hypothesis cannot be rejected, and there is +insufficient evidence to suggest a difference in body temperature. + +## Linear Regression + +Linear regression is a technique used to investigate if and how a +variable is linearly related to others. If a variable is found to be +linearly related, then this can be used to predict future values of that +variable. + +In the following example, the service department of the company wanted +to be able to predict the time to repair equipment, in order to +improve the accuracy of their quotations. It was suggested that the +time to repair might be related to the time between failures and the +duty cycle of the equipment. The p-value of 0.1 was chosen for this +investigation. In order to investigate this hypothesis, the +[`REGRESSION`](../../commands/statistics/regression.md) command was +used. This command not only tests if the variables are related, but +also identifies the potential linear relationship. + +A first attempt includes `duty_cycle`: + +``` +PSPP> get file='/usr/local/share/pspp/examples/repairs.sav'. +PSPP> regression /variables = mtbf duty_cycle /dependent = mttr. +``` + +This attempt yields the following output (in part): + +``` + Coefficients (Mean time to repair (hours) ) +┌────────────────────────┬─────────────────────┬───────────────────┬─────┬────┐ +│ │ Unstandardized │ Standardized │ │ │ +│ │ Coefficients │ Coefficients │ │ │ +│ ├─────────┬───────────┼───────────────────┤ │ │ +│ │ B │ Std. Error│ Beta │ t │Sig.│ +├────────────────────────┼─────────┼───────────┼───────────────────┼─────┼────┤ +│(Constant) │ 10.59│ 3.11│ .00│ 3.40│.002│ +│Mean time between │ 3.02│ .20│ .95│14.88│.000│ +│failures (months) │ │ │ │ │ │ +│Ratio of working to non─│ ─1.12│ 3.69│ ─.02│ ─.30│.763│ +│working time │ │ │ │ │ │ +└────────────────────────┴─────────┴───────────┴───────────────────┴─────┴────┘ +``` + +The coefficients in the above table suggest that the formula +\\(\textrm{MTTR} = 9.81 + 3.1 \times \textrm{MTBF} + 1.09 \times +\textrm{DUTY\_CYCLE}\\) can be used to predict the time to repair. +However, the significance value for the `DUTY_CYCLE` coefficient is +very high, which would make this an unsafe predictor. For this +reason, the test was repeated, but omitting the `duty_cycle` variable: + +``` +PSPP> regression /variables = mtbf /dependent = mttr. +``` + +This second try produces the following output (in part): + +``` + Coefficients (Mean time to repair (hours) ) +┌───────────────────────┬──────────────────────┬───────────────────┬─────┬────┐ +│ │ Unstandardized │ Standardized │ │ │ +│ │ Coefficients │ Coefficients │ │ │ +│ ├─────────┬────────────┼───────────────────┤ │ │ +│ │ B │ Std. Error │ Beta │ t │Sig.│ +├───────────────────────┼─────────┼────────────┼───────────────────┼─────┼────┤ +│(Constant) │ 9.90│ 2.10│ .00│ 4.71│.000│ +│Mean time between │ 3.01│ .20│ .94│15.21│.000│ +│failures (months) │ │ │ │ │ │ +└───────────────────────┴─────────┴────────────┴───────────────────┴─────┴────┘ +``` + +This time, the significance of all coefficients is no higher than +0.06, suggesting that at the 0.06 level, the formula \\(\textrm{MTTR} = 10.5 + +3.11 \times \textrm{MTBF}\\) is a reliable predictor of the time to repair. + diff --git a/rust/doc/src/language/tutorial/index.md b/rust/doc/src/language/tutorial/index.md new file mode 100644 index 0000000000..39e9d6d163 --- /dev/null +++ b/rust/doc/src/language/tutorial/index.md @@ -0,0 +1,36 @@ +# PSPP Language Tutorial + +PSPP is a tool for the statistical analysis of sampled data. You can +use it to discover patterns in the data, to explain differences in one +subset of data in terms of another subset and to find out whether +certain beliefs about the data are justified. This chapter does not +attempt to introduce the theory behind the statistical analysis, but it +shows how such analysis can be performed using PSPP. + +This tutorial assumes that you are using PSPP in its interactive mode +from the command line. However, the example commands can also be +typed into a file and executed in a post-hoc mode by typing `pspp +FILE-NAME` at a shell prompt, where `FILE-NAME` is the name of the +file containing the commands. Alternatively, from the graphical +interface, you can select File → New → Syntax to open a new syntax +window and use the Run menu when a syntax fragment is ready to be +executed. Whichever method you choose, the syntax is identical. + +When using the interactive method, PSPP tells you that it's waiting +for your data with a string like `PSPP>` or `data>`. In the examples +of this chapter, whenever you see text like this, it indicates the +prompt displayed by PSPP, _not_ something that you should type. + +Throughout this chapter reference is made to a number of sample data +files. So that you can try the examples for yourself, you should have +received these files along with your copy of PSPP.[^1] + +> Normally these files are installed in the directory +`/usr/local/share/pspp/examples`. If however your system +administrator or operating system vendor has chosen to install them in +a different location, you will have to adjust the examples +accordingly. + +[^1]: These files contain purely fictitious data. They should not be +used for research purposes. + diff --git a/rust/doc/src/language/tutorial/preparation.md b/rust/doc/src/language/tutorial/preparation.md new file mode 100644 index 0000000000..d5f0b3791c --- /dev/null +++ b/rust/doc/src/language/tutorial/preparation.md @@ -0,0 +1,191 @@ +# Preparation of Data Files + +Before analysis can commence, the data must be loaded into PSPP and +arranged such that both PSPP and humans can understand what the data +represents. There are two aspects of data: + +- The variables—these are the parameters of a quantity which has + been measured or estimated in some way. For example height, weight + and geographic location are all variables. + +- The observations (also called 'cases') of the variables—each + observation represents an instance when the variables were measured + or observed. + +For example, a data set which has the variables height, weight, and +name, might have the observations: + +``` +1881 89.2 Ahmed +1192 107.01 Frank +1230 67 Julie +``` + +The following sections explain how to define a dataset. + +## Defining Variables + +Variables come in two basic types: "numeric" and "string". +Variables such as age, height and satisfaction are numeric, whereas name +is a string variable. String variables are best reserved for commentary +data to assist the human observer. However they can also be used for +nominal or categorical data. + +The following example defines two variables, `forename` and `height`, +and reads data into them by manual input: + +``` +PSPP> data list list /forename (A12) height. +PSPP> begin data. +data> Ahmed 188 +data> Bertram 167 +data> Catherine 134.231 +data> David 109.1 +data> end data +PSPP> +``` + +There are several things to note about this example. + +- The words `data list list` are an example of the [`DATA + LIST`](../../commands/data-io/data-list.md). command, which tells + PSPP to prepare for reading data. The word `list` intentionally + appears twice. The first occurrence is part of the `DATA LIST` + call, whilst the second tells PSPP that the data is to be read as + free format data with one record per line. + + Usually this manual shows command names and other fixed elements of + syntax in upper case, but case doesn't matter in most parts of + command syntax. In the tutorial, we usually show them in lowercase + because they are easier to type that way. + +- The `/` character is important. It marks the start of the list of + variables which you wish to define. + +- The text `forename` is the name of the first variable, and `(A12)` + says that the variable forename is a string variable and that its + maximum length is 12 bytes. The second variable's name is specified + by the text `height`. Since no format is given, this variable has + the default format. Normally the default format expects numeric + data, which should be entered in the locale of the operating system. + Thus, the example is correct for English locales and other locales + which use a period (`.`) as the decimal separator. However if you + are using a system with a locale which uses the comma (`,`) as the + decimal separator, then you should in the subsequent lines + substitute `.` with `,`. Alternatively, you could explicitly tell + PSPP that the height variable is to be read using a period as its + decimal separator by appending the text `DOT8.3` after the word + `height`. For more information on data formats, see [Input and + Output Formats](../../language/datasets/formats/index.md). + +- PSPP displays the prompt `PSPP>` when it's expecting a command. + When it's expecting data, the prompt changes to `data>` so that you + know to enter data and not a command. + +- At the end of every command there is a terminating `.` which tells + PSPP that the end of a command has been encountered. You should not + enter `.` when data is expected (ie. when the `data>` prompt is + current) since it is appropriate only for terminating commands. + + You can also terminate a command with a blank line. + +## Listing the data + +Once the data has been entered, you could type +``` +PSPP> list /format=numbered. +``` +to list the data. The optional text `/format=numbered` requests the +case numbers to be shown along with the data. It should show the +following output: + +``` + Data List +┌───────────┬─────────┬──────┐ +│Case Number│ forename│height│ +├───────────┼─────────┼──────┤ +│1 │Ahmed │188.00│ +│2 │Bertram │167.00│ +│3 │Catherine│134.23│ +│4 │David │109.10│ +└───────────┴─────────┴──────┘ +``` + + +Note that the numeric variable height is displayed to 2 decimal +places, because the format for that variable is `F8.2`. For a +complete description of the `LIST` command, see +[`LIST`](../../commands/data-io/list.md). + +## Reading data from a text file + +The previous example showed how to define a set of variables and to +manually enter the data for those variables. Manual entering of data is +tedious work, and often a file containing the data will be have been +previously prepared. Let us assume that you have a file called +`mydata.dat` containing the ascii encoded data: + +``` +Ahmed 188.00 +Bertram 167.00 +Catherine 134.23 +David 109.10 + . + . + . +Zachariah 113.02 +``` + +You can can tell the `DATA LIST` command to read the data directly +from this file instead of by manual entry, with a command like: PSPP> +data list file='mydata.dat' list /forename (A12) height. Notice +however, that it is still necessary to specify the names of the +variables and their formats, since this information is not contained +in the file. It is also possible to specify the file's character +encoding and other parameters. For full details refer to [`DATA +LIST`](../../commands/data-io/data-list.md). + +## Reading data from a pre-prepared PSPP file + +When working with other PSPP users, or users of other software which +uses the PSPP data format, you may be given the data in a pre-prepared +PSPP file. Such files contain not only the data, but the variable +definitions, along with their formats, labels and other meta-data. +Conventionally, these files (sometimes called "system" files) have the +suffix `.sav`, but that is not mandatory. The following syntax loads a +file called `my-file.sav`. + +``` +PSPP> get file='my-file.sav'. +``` + +You will encounter several instances of this in future examples. + +## Saving data to a PSPP file. + +If you want to save your data, along with the variable definitions so +that you or other PSPP users can use it later, you can do this with the +`SAVE` command. + + The following syntax will save the existing data and variables to a +file called `my-new-file.sav`. + +``` +PSPP> save outfile='my-new-file.sav'. +``` + +If `my-new-file.sav` already exists, then it will be overwritten. +Otherwise it will be created. + +## Reading data from other sources + +Sometimes it's useful to be able to read data from comma separated +text, from spreadsheets, databases or other sources. In these +instances you should use the [`GET +DATA`](../../commands/spss-io/get-data.md) command. + +## Exiting PSPP + +Use the `FINISH` command to exit PSPP: + PSPP> finish. + diff --git a/rust/doc/src/language/tutorial/transformations.md b/rust/doc/src/language/tutorial/transformations.md new file mode 100644 index 0000000000..1975bfb257 --- /dev/null +++ b/rust/doc/src/language/tutorial/transformations.md @@ -0,0 +1,385 @@ +# Data Screening and Transformation + +Once data has been entered, it is often desirable, or even necessary, +to transform it in some way before performing analysis upon it. At +the very least, it's good practice to check for errors. + +## Identifying incorrect data + +Data from real sources is rarely error free. PSPP has a number of +procedures which can be used to help identify data which might be +incorrect. + +The [`DESCRIPTIVES`](../../commands/statistics/descriptives.md) +command is used to generate simple linear statistics for a dataset. +It is also useful for identifying potential problems in the data. The +example file `physiology.sav` contains a number of physiological +measurements of a sample of healthy adults selected at random. +However, the data entry clerk made a number of mistakes when entering +the data. The following example illustrates the use of `DESCRIPTIVES` +to screen this data and identify the erroneous values: + +``` +PSPP> get file='/usr/local/share/pspp/examples/physiology.sav'. +PSPP> descriptives sex, weight, height. +``` + +For this example, PSPP produces the following output: + +``` + Descriptive Statistics +┌─────────────────────┬──┬───────┬───────┬───────┬───────┐ +│ │ N│ Mean │Std Dev│Minimum│Maximum│ +├─────────────────────┼──┼───────┼───────┼───────┼───────┤ +│Sex of subject │40│ .45│ .50│Male │Female │ +│Weight in kilograms │40│ 72.12│ 26.70│ ─55.6│ 92.1│ +│Height in millimeters│40│1677.12│ 262.87│ 179│ 1903│ +│Valid N (listwise) │40│ │ │ │ │ +│Missing N (listwise) │ 0│ │ │ │ │ +└─────────────────────┴──┴───────┴───────┴───────┴───────┘ +``` + + +The most interesting column in the output is the minimum value. The +weight variable has a minimum value of less than zero, which is clearly +erroneous. Similarly, the height variable's minimum value seems to be +very low. In fact, it is more than 5 standard deviations from the mean, +and is a seemingly bizarre height for an adult person. + +We can look deeper into these discrepancies by issuing an additional +`EXAMINE` command: + +``` +PSPP> examine height, weight /statistics=extreme(3). +``` + +This command produces the following additional output (in part): + +``` + Extreme Values +┌───────────────────────────────┬───────────┬─────┐ +│ │Case Number│Value│ +├───────────────────────────────┼───────────┼─────┤ +│Height in millimeters Highest 1│ 14│ 1903│ +│ 2│ 15│ 1884│ +│ 3│ 12│ 1802│ +│ ──────────┼───────────┼─────┤ +│ Lowest 1│ 30│ 179│ +│ 2│ 31│ 1598│ +│ 3│ 28│ 1601│ +├───────────────────────────────┼───────────┼─────┤ +│Weight in kilograms Highest 1│ 13│ 92.1│ +│ 2│ 5│ 92.1│ +│ 3│ 17│ 91.7│ +│ ──────────┼───────────┼─────┤ +│ Lowest 1│ 38│─55.6│ +│ 2│ 39│ 54.5│ +│ 3│ 33│ 55.4│ +└───────────────────────────────┴───────────┴─────┘ +``` + +From this new output, you can see that the lowest value of height is 179 +(which we suspect to be erroneous), but the second lowest is 1598 which +we know from `DESCRIPTIVES` is within 1 standard deviation from the +mean. Similarly, the lowest value of weight is negative, but its second +lowest value is plausible. This suggests that the two extreme values +are outliers and probably represent data entry errors. + +The output also identifies the case numbers for each extreme value, +so we can see that cases 30 and 38 are the ones with the erroneous +values. + +## Dealing with suspicious data + +If possible, suspect data should be checked and re-measured. However, +this may not always be feasible, in which case the researcher may +decide to disregard these values. PSPP has a feature for [missing +values](../../language/basics/missing-values.md), whereby data can +assume the special value 'SYSMIS', and will be disregarded in future +analysis. You can set the two suspect values to the `SYSMIS` value +using the [`RECODE`](../../commands/data/recode.md) command. + +``` +PSPP> recode height (179 = SYSMIS). +PSPP> recode weight (LOWEST THRU 0 = SYSMIS). +``` + +The first command says that for any observation which has a height value +of 179, that value should be changed to the SYSMIS value. The second +command says that any weight values of zero or less should be changed to +SYSMIS. From now on, they will be ignored in analysis. + +If you now re-run the `DESCRIPTIVES` or `EXAMINE` commands from the +previous section, you will see a data summary with more plausible +parameters. You will also notice that the data summaries indicate the +two missing values. + +## Inverting negatively coded variables + +Data entry errors are not the only reason for wanting to recode data. +The sample file `hotel.sav` comprises data gathered from a customer +satisfaction survey of clients at a particular hotel. The following +commands load the file and display its variables and associated data: + +``` +PSPP> get file='/usr/local/share/pspp/examples/hotel.sav'. +PSPP> display dictionary. +``` + +It yields the following output: + +``` + Variables +┌────┬────────┬─────────────┬────────────┬─────┬─────┬─────────┬──────┬───────┐ +│ │ │ │ Measurement│ │ │ │ Print│ Write │ +│Name│Position│ Label │ Level │ Role│Width│Alignment│Format│ Format│ +├────┼────────┼─────────────┼────────────┼─────┼─────┼─────────┼──────┼───────┤ +│v1 │ 1│I am │Ordinal │Input│ 8│Right │F8.0 │F8.0 │ +│ │ │satisfied │ │ │ │ │ │ │ +│ │ │with the │ │ │ │ │ │ │ +│ │ │level of │ │ │ │ │ │ │ +│ │ │service │ │ │ │ │ │ │ +│v2 │ 2│The value for│Ordinal │Input│ 8│Right │F8.0 │F8.0 │ +│ │ │money was │ │ │ │ │ │ │ +│ │ │good │ │ │ │ │ │ │ +│v3 │ 3│The staff │Ordinal │Input│ 8│Right │F8.0 │F8.0 │ +│ │ │were slow in │ │ │ │ │ │ │ +│ │ │responding │ │ │ │ │ │ │ +│v4 │ 4│My concerns │Ordinal │Input│ 8│Right │F8.0 │F8.0 │ +│ │ │were dealt │ │ │ │ │ │ │ +│ │ │with in an │ │ │ │ │ │ │ +│ │ │efficient │ │ │ │ │ │ │ +│ │ │manner │ │ │ │ │ │ │ +│v5 │ 5│There was too│Ordinal │Input│ 8│Right │F8.0 │F8.0 │ +│ │ │much noise in│ │ │ │ │ │ │ +│ │ │the rooms │ │ │ │ │ │ │ +└────┴────────┴─────────────┴────────────┴─────┴─────┴─────────┴──────┴───────┘ + + Value Labels +┌────────────────────────────────────────────────────┬─────────────────┐ +│Variable Value │ Label │ +├────────────────────────────────────────────────────┼─────────────────┤ +│I am satisfied with the level of service 1│Strongly Disagree│ +│ 2│Disagree │ +│ 3│No Opinion │ +│ 4│Agree │ +│ 5│Strongly Agree │ +├────────────────────────────────────────────────────┼─────────────────┤ +│The value for money was good 1│Strongly Disagree│ +│ 2│Disagree │ +│ 3│No Opinion │ +│ 4│Agree │ +│ 5│Strongly Agree │ +├────────────────────────────────────────────────────┼─────────────────┤ +│The staff were slow in responding 1│Strongly Disagree│ +│ 2│Disagree │ +│ 3│No Opinion │ +│ 4│Agree │ +│ 5│Strongly Agree │ +├────────────────────────────────────────────────────┼─────────────────┤ +│My concerns were dealt with in an efficient manner 1│Strongly Disagree│ +│ 2│Disagree │ +│ 3│No Opinion │ +│ 4│Agree │ +│ 5│Strongly Agree │ +├────────────────────────────────────────────────────┼─────────────────┤ +│There was too much noise in the rooms 1│Strongly Disagree│ +│ 2│Disagree │ +│ 3│No Opinion │ +│ 4│Agree │ +│ 5│Strongly Agree │ +└────────────────────────────────────────────────────┴─────────────────┘ +``` + +The output shows that all of the variables v1 through v5 are measured +on a 5 point Likert scale, with 1 meaning "Strongly disagree" and 5 +meaning "Strongly agree". However, some of the questions are positively +worded (v1, v2, v4) and others are negatively worded (v3, v5). To +perform meaningful analysis, we need to recode the variables so that +they all measure in the same direction. We could use the `RECODE` +command, with syntax such as: + +``` +recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1). +``` + +However an easier and more elegant way uses the +[`COMPUTE`](../../commands/data/compute.md) command. Since the +variables are Likert variables in the range (1 ... 5), subtracting +their value from 6 has the effect of inverting them: + +``` +compute VAR = 6 - VAR. +``` + +The following section uses this technique to recode the +variables v3 and v5. After applying `COMPUTE` for both variables, all +subsequent commands will use the inverted values. + +## Testing data consistency + +A sensible check to perform on survey data is the calculation of +reliability. This gives the statistician some confidence that the +questionnaires have been completed thoughtfully. If you examine the +labels of variables v1, v3 and v4, you will notice that they ask very +similar questions. One would therefore expect the values of these +variables (after recoding) to closely follow one another, and we can +test that with the +[`RELIABILITY`](../../commands/statistics/reliability.md) command. +The following example shows a PSPP session where the user recodes +negatively scaled variables and then requests reliability statistics +for v1, v3, and v4. + +``` +PSPP> get file='/usr/local/share/pspp/examples/hotel.sav'. +PSPP> compute v3 = 6 - v3. +PSPP> compute v5 = 6 - v5. +PSPP> reliability v1, v3, v4. +``` + +This yields the following output: + +``` +Scale: ANY + +Case Processing Summary +┌────────┬──┬───────┐ +│Cases │ N│Percent│ +├────────┼──┼───────┤ +│Valid │17│ 100.0%│ +│Excluded│ 0│ .0%│ +│Total │17│ 100.0%│ +└────────┴──┴───────┘ + + Reliability Statistics +┌────────────────┬──────────┐ +│Cronbach's Alpha│N of Items│ +├────────────────┼──────────┤ +│ .81│ 3│ +└────────────────┴──────────┘ +``` + +As a rule of thumb, many statisticians consider a value of Cronbach's +Alpha of 0.7 or higher to indicate reliable data. + +Here, the value is 0.81, which suggests a high degree of reliability +among variables v1, v3 and v4, so the data and the recoding that we +performed are vindicated. + +## Testing for normality + +Many statistical tests rely upon certain properties of the data. One +common property, upon which many linear tests depend, is that of +normality -- the data must have been drawn from a normal distribution. +It is necessary then to ensure normality before deciding upon the test +procedure to use. One way to do this uses the `EXAMINE` command. + +In the following example, a researcher was examining the failure +rates of equipment produced by an engineering company. The file +`repairs.sav` contains the mean time between failures (mtbf) of some +items of equipment subject to the study. Before performing linear +analysis on the data, the researcher wanted to ascertain that the data +is normally distributed. + +``` +PSPP> get file='/usr/local/share/pspp/examples/repairs.sav'. +PSPP> examine mtbf /statistics=descriptives. +``` + +This produces the following output: + +``` + Descriptives +┌──────────────────────────────────────────────────────────┬─────────┬────────┐ +│ │ │ Std. │ +│ │Statistic│ Error │ +├──────────────────────────────────────────────────────────┼─────────┼────────┤ +│Mean time between Mean │ 8.78│ 1.10│ +│failures (months) ──────────────────────────────────┼─────────┼────────┤ +│ 95% Confidence Interval Lower │ 6.53│ │ +│ for Mean Bound │ │ │ +│ Upper │ 11.04│ │ +│ Bound │ │ │ +│ ──────────────────────────────────┼─────────┼────────┤ +│ 5% Trimmed Mean │ 8.20│ │ +│ ──────────────────────────────────┼─────────┼────────┤ +│ Median │ 8.29│ │ +│ ──────────────────────────────────┼─────────┼────────┤ +│ Variance │ 36.34│ │ +│ ──────────────────────────────────┼─────────┼────────┤ +│ Std. Deviation │ 6.03│ │ +│ ──────────────────────────────────┼─────────┼────────┤ +│ Minimum │ 1.63│ │ +│ ──────────────────────────────────┼─────────┼────────┤ +│ Maximum │ 26.47│ │ +│ ──────────────────────────────────┼─────────┼────────┤ +│ Range │ 24.84│ │ +│ ──────────────────────────────────┼─────────┼────────┤ +│ Interquartile Range │ 6.03│ │ +│ ──────────────────────────────────┼─────────┼────────┤ +│ Skewness │ 1.65│ .43│ +│ ──────────────────────────────────┼─────────┼────────┤ +│ Kurtosis │ 3.41│ .83│ +└──────────────────────────────────────────────────────────┴─────────┴────────┘ +``` + +A normal distribution has a skewness and kurtosis of zero. The +skewness of mtbf in the output above makes it clear that the mtbf +figures have a lot of positive skew and are therefore not drawn from a +normally distributed variable. Positive skew can often be compensated +for by applying a logarithmic transformation, as in the following +continuation of the example: + +``` +PSPP> compute mtbf_ln = ln (mtbf). +PSPP> examine mtbf_ln /statistics=descriptives. +``` + +which produces the following additional output: + +``` + Descriptives +┌────────────────────────────────────────────────────┬─────────┬──────────┐ +│ │Statistic│Std. Error│ +├────────────────────────────────────────────────────┼─────────┼──────────┤ +│mtbf_ln Mean │ 1.95│ .13│ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ 95% Confidence Interval for Mean Lower Bound│ 1.69│ │ +│ Upper Bound│ 2.22│ │ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ 5% Trimmed Mean │ 1.96│ │ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ Median │ 2.11│ │ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ Variance │ .49│ │ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ Std. Deviation │ .70│ │ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ Minimum │ .49│ │ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ Maximum │ 3.28│ │ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ Range │ 2.79│ │ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ Interquartile Range │ .88│ │ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ Skewness │ ─.37│ .43│ +│ ─────────────────────────────────────────────┼─────────┼──────────┤ +│ Kurtosis │ .01│ .83│ +└────────────────────────────────────────────────────┴─────────┴──────────┘ +``` + +The `COMPUTE` command in the first line above performs the logarithmic +transformation: ``` compute mtbf_ln = ln (mtbf). ``` Rather than +redefining the existing variable, this use of `COMPUTE` defines a new +variable mtbf_ln which is the natural logarithm of mtbf. The final +command in this example calls `EXAMINE` on this new variable. The +results show that both the skewness and kurtosis for mtbf_ln are very +close to zero. This provides some confidence that the mtbf_ln +variable is normally distributed and thus safe for linear analysis. +In the event that no suitable transformation can be found, then it +would be worth considering an appropriate non-parametric test instead +of a linear one. See [`NPAR +TESTS`](../../commands/statistics/npar-tests.md), for information +about non-parametric tests. + diff --git a/rust/doc/src/system-file/file-header-record.md b/rust/doc/src/system-file/file-header-record.md index 9b95430898..d444313f3b 100644 --- a/rust/doc/src/system-file/file-header-record.md +++ b/rust/doc/src/system-file/file-header-record.md @@ -89,8 +89,8 @@ A system file begins with the file header, with the following format: * `flt64 bias;` - Compression bias, ordinarily set to 100. Only integers between `1 - - bias` and `251 - bias` can be compressed. + Compression bias, usually 100. Only integers between `1 - bias` and + `251 - bias` can be compressed. By assuming that its value is 100, PSPP uses `bias` to determine the file's floating-point format and endianness. If the compression