doc/tutorial.texi

   1 @alias prompt = sansserif
   2
   3 @include doc/tut.texi
   4
   5 @node Using PSPP
   6 @chapter Using PSPP
   7
   8 PSPP is a tool for the statistical analysis of sampled data.
   9 You can use it to discover patterns in the data,
  10 to explain differences in one subset of data in terms of another subset
  11 and to find out
  12 whether certain beliefs about the data are justified.
  13 This chapter does not attempt to introduce the theory behind the
  14 statistical analysis,
  15 but it shows how such analysis can be performed using PSPP.
  16
  17 For the purposes of this tutorial, it is assumed that you are using PSPP in its
  18 interactive mode from the command line.
  19 However, the example commands can also be typed into a file and executed in
  20 a post-hoc mode by typing @samp{pspp @var{filename}} at a shell prompt,
  21 where @var{filename} is the name of the file containing the commands.
  22 Alternatively, from the graphical interface, you can select
  23 @clicksequence{File @click{} New @click{} Syntax} to open a new syntax window
  24 and use the @clicksequence{Run} menu when a syntax fragment is ready to be
  25 executed.
  26 Whichever method you choose, the syntax is identical.
  27
  28 When using the interactive method, PSPP tells you that it's waiting for your
  29 data with a string like @prompt{PSPP>} or @prompt{data>}.
  30 In the examples of this chapter, whenever you see text like this, it
  31 indicates the prompt displayed by PSPP, @emph{not} something that you
  32 should type.
  33
  34 Throughout this chapter reference is made to a number of sample data files.
  35 So that you can try the examples for yourself,
  36 you should have received these files along with your copy of PSPP.
  37 @footnote{These files contain purely fictitious data.  They should not be used
  38 for research purposes.}
  39 @note{Normally these files are installed in the directory
  40 @file{@value{example-dir}}.
  41 If however your system administrator or operating system vendor has
  42 chosen to install them in a different location, you will have to adjust
  43 the examples accordingly.}
  44
  45
  46 @menu
  47 * Preparation of Data Files::
  48 * Data Screening and Transformation::
  49 * Hypothesis Testing::
  50 @end menu
  51
  52 @node Preparation of Data Files
  53 @section Preparation of Data Files
  54
  55
  56 Before analysis can commence,  the data must be loaded into PSPP and
  57 arranged such that both PSPP and humans can understand what
  58 the data represents.
  59 There are two aspects of data:
  60
  61 @itemize @bullet
  62 @item The variables --- these are the parameters of a quantity
  63  which has been measured or estimated in some way.
  64  For example height, weight and geographic location are all variables.
  65 @item The observations (also called `cases') of the variables ---
  66  each observation represents an instance when the variables were measured
  67  or observed.
  68 @end itemize
  69
  70 @noindent
  71 For example, a data set which has the variables @var{height}, @var{weight}, and
  72 @var{name}, might have the observations:
  73 @example
  74 188 89 Ahmed
  75 119 107 Frank
  76 123 67 Julie
  77 @end example
  78 @noindent
  79 The following sections explain how to define a dataset.
  80
  81 @menu
  82 * Defining Variables::
  83 * Listing the data::
  84 * Reading data from a text file::
  85 * Reading data from a pre-prepared PSPP file::
  86 * Saving data to a PSPP file.::
  87 * Reading data from other sources::
  88 @end menu
  89
  90 @node Defining Variables
  91 @subsection Defining Variables
  92 @cindex variables
  93
  94 Variables come in two basic types, @i{viz}: @dfn{numeric} and @dfn{string}.
  95 Variables such as age, height and satisfaction are numeric,
  96 whereas name is a string variable.
  97 String variables are best reserved for commentary data to assist the
  98 human observer.
  99 However they can also be used for nominal or categorical data.
 100
 101
 102 @ref{data-list} defines two variables @var{forename} and @var{height},
 103 and reads data into them by manual input.
 104
 105 @float Example, data-list
 106 @cartouche
 107 @example
 108 @prompt{PSPP>} data list list /forename (A12) height *.
 109 @prompt{PSPP>} begin data.
 110 @prompt{data>} Ahmed 188
 111 @prompt{data>} Bertram 167
 112 @prompt{data>} Catherine 134.231
 113 @prompt{data>} David 109.1
 114 @prompt{data>} end data
 115 @prompt{PSPP>}
 116 @end example
 117 @end cartouche
 118 @caption{Manual entry of data using the @cmd{DATA LIST} command.
 119 Two variables
 120 @var{forename} and @var{height} are defined and subsequently filled
 121 with  manually entered data.}
 122 @end float
 123
 124 There are several things to note about this example.
 125
 126 @itemize @bullet
 127 @item
 128 The words @samp{data list list} are an example of the @cmd{DATA LIST}
 129 command. @xref{DATA LIST}.
 130 It tells PSPP to prepare for reading data.
 131 The word @samp{list} intentionally appears twice.
 132 The first occurrence is part of the @cmd{DATA LIST} call,
 133 whilst the second
 134 tells PSPP that the data is to be read as free format data with
 135 one record per line.
 136
 137 @item
 138 The @samp{/} character is important. It marks the start of the list of
 139 variables which you wish to define.
 140
 141 @item
 142 The text @samp{forename} is the name of the first variable,
 143 and @samp{(A12)} says that the variable @var{forename} is a string
 144 variable and that its maximum length is 12 bytes.
 145 The second variable's name is specified by the text @samp{height}
 146 and the @samp{*}
 147 means that this variable has the default format.
 148 Instead of typing @samp{*} you could also have typed @samp{(F8.2)},
 149 however since @samp{F8.2} is the default format (unless you changed it
 150 with the @cmd{SET} command (@pxref{SET})), it's quicker to simply type
 151 @samp{*}.
 152 For more information on data formats, @pxref{Input and Output Formats}.
 153
 154
 155 @item
 156 Normally, PSPP displays the  prompt @prompt{PSPP>} whenever it's
 157 expecting a command.
 158 However, when it's expecting data, the prompt changes to @prompt{data>}
 159 so that you know to enter data and not a command.
 160
 161 @item
 162 At the end of every command there is a terminating @samp{.} which tells
 163 PSPP that the end of a command has been encountered.
 164 You should not enter @samp{.} when data is expected (@i{ie.} when
 165 the @prompt{data>} prompt is current) since it is appropriate only for
 166 terminating commands.
 167 @end itemize
 168
 169 @node Listing the data
 170 @subsection Listing the data
 171 @vindex LIST
 172
 173 Once the data has been entered,
 174 you could type
 175 @example
 176 @prompt{PSPP>} list /format=numbered.
 177 @end example
 178 @noindent
 179 to list the data.
 180 The optional text @samp{/format=numbered} requests the case numbers to be
 181 shown along with the data.
 182 It should show the following output:
 183 @example
 184 @group
 185 Case#     forename   height
 186 ----- ------------ --------
 187     1 Ahmed          188.00
 188     2 Bertram        167.00
 189     3 Catherine      134.23
 190     4 David          109.10
 191 @end group
 192 @end example
 193 @noindent
 194 Note that the numeric variable @var{height} is displayed to 2 decimal
 195 places, because the format for that variable is @samp{F8.2}.
 196 For a complete description of the @cmd{LIST} command, @pxref{LIST}.
 197
 198 @node Reading data from a text file
 199 @subsection Reading data from a text file
 200 @cindex reading data
 201
 202 The previous example showed how to define a set of variables and to
 203 manually enter the data for those variables.
 204 Manual entering of data is tedious work, and often
 205 a file containing the data will be have been previously
 206 prepared.
 207 Let us assume that you have a file called @file{mydata.dat} containing the
 208 ascii encoded data:
 209 @example
 210 Ahmed          188.00
 211 Bertram        167.00
 212 Catherine      134.23
 213 David          109.10
 214 @              .
 215 @              .
 216 @              .
 217 Zachariah      113.02
 218 @end example
 219 @noindent
 220 You can can tell the @cmd{DATA LIST} command to read the data directly from
 221 this file instead of by manual entry, with a command like:
 222 @example
 223 @prompt{PSPP>} data list file='mydata.dat' list /forename (A12) height *.
 224 @end example
 225 @noindent
 226 Notice however, that it is still necessary to specify the names of the
 227 variables and their formats, since this information is not contained
 228 in the file.
 229 It is also possible to specify the file's character encoding and other
 230 parameters.
 231 For full details refer to @pxref{DATA LIST}.
 232
 233 @node Reading data from a pre-prepared PSPP file
 234 @subsection Reading data from a pre-prepared PSPP file
 235 @cindex system files
 236 @vindex GET
 237
 238 When working with other PSPP users, or users of other software which
 239 uses the PSPP data format, you may be given the data in
 240 a pre-prepared PSPP file.
 241 Such files contain not only the data, but the variable definitions,
 242 along with their formats, labels and other meta-data.
 243 Conventionally, these files (sometimes called ``system'' files)
 244 have the suffix @file{.sav}, but that is
 245 not mandatory.
 246 The following syntax loads a file called @file{my-file.sav}.
 247 @example
 248 @prompt{PSPP>} get file='my-file.sav'.
 249 @end example
 250 @noindent
 251 You will encounter several instances of this in future examples.
 252
 253
 254 @node Saving data to a PSPP file.
 255 @subsection Saving data to a PSPP file.
 256 @cindex saving
 257 @vindex SAVE
 258
 259 If you want to save your data, along with the variable definitions so
 260 that you or other PSPP users can use it later, you can do this with
 261 the @cmd{SAVE} command.
 262
 263 The following syntax will save the existing data and variables to a
 264 file called @file{my-new-file.sav}.
 265 @example
 266 @prompt{PSPP>} save outfile='my-new-file.sav'.
 267 @end example
 268 @noindent
 269 If @file{my-new-file.sav} already exists, then it will be overwritten.
 270 Otherwise it will be created.
 271
 272
 273 @node Reading data from other sources
 274 @subsection Reading data from other sources
 275 @cindex comma separated values
 276 @cindex spreadsheets
 277 @cindex databases
 278
 279 Sometimes it's useful to be able to read data from comma
 280 separated text, from spreadsheets, databases or other sources.
 281 In these instances you should
 282 use the @cmd{GET DATA} command (@pxref{GET DATA}).
 283
 284
 285 @node Data Screening and Transformation
 286 @section Data Screening and Transformation
 287
 288 @cindex screening
 289 @cindex transformation
 290
 291 Once data has been entered, it is often desirable, or even necessary,
 292 to transform it in some way before performing analysis upon it.
 293 At the very least, it's good practice to check for errors.
 294
 295 @menu
 296 * Identifying incorrect data::
 297 * Dealing with suspicious data::
 298 * Inverting negatively coded variables::
 299 * Testing data consistency::
 300 * Testing for normality ::
 301 @end menu
 302
 303 @node Identifying incorrect data
 304 @subsection Identifying incorrect data
 305 @cindex erroneous data
 306 @cindex errors, in data
 307
 308 Data from real sources is rarely error free.
 309 PSPP has a number of procedures which can be used to help
 310 identify data which might be incorrect.
 311
 312 The @cmd{DESCRIPTIVES} command (@pxref{DESCRIPTIVES}) is used to generate
 313 simple linear statistics for a dataset.  It is also useful for
 314 identifying potential problems in the data.
 315 The example file @file{physiology.sav} contains a number of physiological
 316 measurements of a sample of healthy adults selected at random.
 317 However, the data entry clerk made a number of mistakes when entering
 318 the data.
 319 @ref{descriptives} illustrates the use of @cmd{DESCRIPTIVES} to screen this
 320 data and identify the erroneous values.
 321
 322 @float Example, descriptives
 323 @cartouche
 324 @example
 325 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 326 @prompt{PSPP>} descriptives sex, weight, height.
 327 @end example
 328
 329 Output:
 330 @example
 331 DESCRIPTIVES.  Valid cases = 40; cases with missing value(s) = 0.
 332 +--------#--+-------+-------+-------+-------+
 333 |Variable# N|  Mean |Std Dev|Minimum|Maximum|
 334 #========#==#=======#=======#=======#=======#
 335 |sex     #40|    .45|    .50|    .00|   1.00|
 336 |height  #40|1677.12| 262.87| 179.00|1903.00|
 337 |weight  #40|  72.12|  26.70| -55.60|  92.07|
 338 +--------#--+-------+-------+-------+-------+
 339 @end example
 340 @end cartouche
 341 @caption{Using the @cmd{DESCRIPTIVES} command to display simple
 342 summary information about the data.
 343 In this case, the results show unexpectedly low values in the Minimum
 344 column, suggesting incorrect data entry.}
 345 @end float
 346
 347 In the output of @ref{descriptives},
 348 the most interesting column is the minimum value.
 349 The @var{weight} variable has a minimum value of less than zero,
 350 which is clearly erroneous.
 351 Similarly, the @var{height} variable's minimum value seems to be very low.
 352 In fact, it is more than 5 standard deviations from the mean, and is a
 353 seemingly bizarre height for an adult person.
 354 We can examine the data in more detail with the @cmd{EXAMINE}
 355 command (@pxref{EXAMINE}):
 356
 357 In @ref{examine} you can see that the lowest value of @var{height} is
 358 179 (which we suspect to be erroneous), but the second lowest is 1598
 359 which
 360 we know from the @cmd{DESCRIPTIVES} command
 361 is within 1 standard deviation from the mean.
 362 Similarly the @var{weight} variable has a lowest value which is
 363 negative but a plausible value for the second lowest value.
 364 This suggests that the two extreme values are outliers and probably
 365 represent data entry errors.
 366
 367 @float Example, examine
 368 @cartouche
 369 [@dots{} continue from @ref{descriptives}]
 370 @example
 371 @prompt{PSPP>} examine height, weight /statistics=extreme(3).
 372 @end example
 373
 374 Output:
 375 @example
 376 #===============================#===========#=======#
 377 #                               #Case Number| Value #
 378 #===============================#===========#=======#
 379 #Height in millimetres Highest 1#         14|1903.00#
 380 #                              2#         15|1884.00#
 381 #                              3#         12|1801.65#
 382 #                     ----------#-----------+-------#
 383 #                       Lowest 1#         30| 179.00#
 384 #                              2#         31|1598.00#
 385 #                              3#         28|1601.00#
 386 #                     ----------#-----------+-------#
 387 #Weight in kilograms   Highest 1#         13|  92.07#
 388 #                              2#          5|  92.07#
 389 #                              3#         17|  91.74#
 390 #                     ----------#-----------+-------#
 391 #                       Lowest 1#         38| -55.60#
 392 #                              2#         39|  54.48#
 393 #                              3#         33|  55.45#
 394 #===============================#===========#=======#
 395 @end example
 396 @end cartouche
 397 @caption{Using the @cmd{EXAMINE} command to see the extremities of the data
 398 for different variables.  Cases 30 and 38 seem to contain values
 399 very much lower than the rest of the data.
 400 They are possibly erroneous.}
 401 @end float
 402
 403 @node Dealing with suspicious data
 404 @subsection Dealing with suspicious data
 405
 406 @cindex SYSMIS
 407 @cindex recoding data
 408 If possible, suspect data should be checked and re-measured.
 409 However, this may not always be feasible, in which case the researcher may
 410 decide to disregard these values.
 411 PSPP has a feature whereby data can assume the special value `SYSMIS', and
 412 will be disregarded in future analysis. @xref{Missing Observations}.
 413 You can set the two suspect values to the `SYSMIS' value using the @cmd{RECODE}
 414 command.
 415 @example
 416 PSPP> recode height (179 = SYSMIS).
 417 PSPP> recode weight (LOWEST THRU 0 = SYSMIS).
 418 @end example
 419 @noindent
 420 The first command says that for any observation which has a
 421 @var{height} value of 179, that value should be changed to the SYSMIS
 422 value.
 423 The second command says that any @var{weight} values of zero or less
 424 should be changed to SYSMIS.
 425 From now on, they will be ignored in analysis.
 426 For detailed information about the @cmd{RECODE} command @pxref{RECODE}.
 427
 428 If you now re-run the @cmd{DESCRIPTIVES} or @cmd{EXAMINE} commands in
 429 @ref{descriptives} and @ref{examine} you
 430 will see a data summary with more plausible parameters.
 431 You will also notice that the data summaries indicate the two missing values.
 432
 433 @node Inverting negatively coded variables
 434 @subsection Inverting negatively coded variables
 435
 436 @cindex Likert scale
 437 @cindex Inverting data
 438 Data entry errors are not the only reason for wanting to recode data.
 439 The sample file @file{hotel.sav} comprises data gathered from a
 440 customer satisfaction survey of clients at a particular hotel.
 441 In @ref{reliability}, this file is loaded for analysis.
 442 The line @code{display dictionary.} tells PSPP to display the
 443 variables and associated data.
 444 The output from this command has been omitted from the example for the sake of clarity, but
 445 you will notice that each of the variables
 446 @var{v1}, @var{v2} @dots{} @var{v5}  are measured on a 5 point Likert scale,
 447 with 1 meaning ``Strongly disagree'' and 5 meaning ``Strongly agree''.
 448 Whilst variables @var{v1}, @var{v2} and @var{v4} record responses
 449 to a positively posed question, variables @var{v3} and @var{v5} are
 450 responses to negatively worded questions.
 451 In order to perform meaningful analysis, we need to recode the variables so
 452 that they all measure in the same direction.
 453 We could use the @cmd{RECODE} command, with syntax such as:
 454 @example
 455 recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1).
 456 @end example
 457 @noindent
 458 However an easier and more elegant way uses the @cmd{COMPUTE}
 459 command (@pxref{COMPUTE}).
 460 Since the variables are Likert variables in the range (1 @dots{} 5),
 461 subtracting their value  from 6 has the effect of inverting them:
 462 @example
 463 compute @var{var} = 6 - @var{var}.
 464 @end example
 465 @noindent
 466 @ref{reliability} uses this technique to recode the variables
 467 @var{v3} and @var{v5}.
 468 After applying  @cmd{COMPUTE} for both variables,
 469 all subsequent commands will use the inverted values.
 470
 471
 472 @node Testing data consistency
 473 @subsection Testing data consistency
 474
 475 @cindex reliability
 476 @cindex consistency
 477
 478 A sensible check to perform on survey data is the calculation of
 479 reliability.
 480 This gives the statistician some confidence that the questionnaires have been
 481 completed thoughtfully.
 482 If you examine the labels of variables @var{v1},  @var{v3} and @var{v5},
 483 you will notice that they ask very similar questions.
 484 One would therefore expect the values of these variables (after recoding)
 485 to closely follow one another, and we can test that with the @cmd{RELIABILITY}
 486 command (@pxref{RELIABILITY}).
 487 @ref{reliability} shows a PSPP session where the user (after recoding
 488 negatively scaled variables) requests reliability statistics for
 489 @var{v1}, @var{v3} and @var{v5}.
 490
 491 @float Example, reliability
 492 @cartouche
 493 @example
 494 @prompt{PSPP>} get file='@value{example-dir}/hotel.sav'.
 495 @prompt{PSPP>} display dictionary.
 496 @prompt{PSPP>} * recode negatively worded questions.
 497 @prompt{PSPP>} compute v3 = 6 - v3.
 498 @prompt{PSPP>} compute v5 = 6 - v5.
 499 @prompt{PSPP>} reliability v1, v3, v5.
 500 @end example
 501
 502 Output (dictionary information omitted for clarity):
 503 @example
 504 1.1 RELIABILITY.  Case Processing Summary
 505 #==============#==#======#
 506 #              # N|   %  #
 507 #==============#==#======#
 508 #Cases Valid   #17|100.00#
 509 #      Excluded# 0|   .00#
 510 #      Total   #17|100.00#
 511 #==============#==#======#
 512
 513 1.2 RELIABILITY.  Reliability Statistics
 514 #================#==========#
 515 #Cronbach's Alpha#N of items#
 516 #================#==========#
 517 #             .86#         3#
 518 #================#==========#
 519 @end example
 520 @end cartouche
 521 @caption{Recoding negatively scaled variables, and testing for
 522 reliability with the @cmd{RELIABILITY} command. The Cronbach Alpha
 523 coefficient suggests a high degree of reliability among variables
 524 @var{v1}, @var{v2} and @var{v5}.}
 525 @end float
 526
 527 As a rule of thumb, many statisticians consider a value of Cronbach's Alpha of
 528 0.7 or higher to indicate reliable data.
 529 Here, the value is 0.86 so the data and the recoding that we performed
 530 are vindicated.
 531
 532
 533 @node Testing for normality
 534 @subsection Testing for normality
 535 @cindex normality, testing
 536
 537 Many statistical tests rely upon certain properties of the data.
 538 One common property, upon which many linear tests depend, is that of
 539 normality --- the data must have been drawn from a normal distribution.
 540 It is necessary then to ensure normality before deciding upon the
 541 test procedure to use.  One way to do this uses the @cmd{EXAMINE} command.
 542
 543 In @ref{normality}, a researcher was examining the failure rates
 544 of equipment produced by an engineering company.
 545 The file @file{repairs.sav} contains the mean time between
 546 failures (@var{mtbf}) of some items of equipment subject to the study.
 547 Before performing linear analysis on the data,
 548 the researcher wanted to ascertain that the data is normally distributed.
 549
 550 A normal distribution has a skewness and kurtosis of zero.
 551 Looking at the skewness of @var{mtbf} in @ref{normality} it is clear
 552 that the mtbf figures have a lot of positive skew and are therefore
 553 not drawn from a normally distributed variable.
 554 Positive skew can often be compensated for by applying a logarithmic
 555 transformation.
 556 This is done with the @cmd{COMPUTE} command in the line
 557 @example
 558 compute mtbf_ln = ln (mtbf).
 559 @end example
 560 @noindent
 561 Rather than redefining the existing variable, this use of @cmd{COMPUTE}
 562 defines a new variable @var{mtbf_ln} which is
 563 the natural logarithm of @var{mtbf}.
 564 The final command in this example calls @cmd{EXAMINE} on this new variable,
 565 and it can be seen from the results that both the skewness and
 566 kurtosis for @var{mtbf_ln} are very close to zero.
 567 This provides some confidence that the @var{mtbf_ln} variable is
 568 normally distributed and thus safe for linear analysis.
 569 In the event that no suitable transformation can be found,
 570 then it would be worth considering
 571 an appropriate non-parametric test instead of a linear one.
 572 @xref{NPAR TESTS}, for information about non-parametric tests.
 573
 574 @float Example, normality
 575 @cartouche
 576 @example
 577 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 578 @prompt{PSPP>} examine mtbf
 579                 /statistics=descriptives.
 580 @prompt{PSPP>} compute mtbf_ln = ln (mtbf).
 581 @prompt{PSPP>} examine mtbf_ln
 582                 /statistics=descriptives.
 583 @end example
 584
 585 Output:
 586 @example
 587 1.2 EXAMINE.  Descriptives
 588 #====================================================#=========#==========#
 589 #                                                    #Statistic|Std. Error#
 590 #====================================================#=========#==========#
 591 #mtbf    Mean                                        #   8.32  |   1.62   #
 592 #        95% Confidence Interval for Mean Lower Bound#   4.85  |          #
 593 #                                         Upper Bound#  11.79  |          #
 594 #        5% Trimmed Mean                             #   7.69  |          #
 595 #        Median                                      #   8.12  |          #
 596 #        Variance                                    #  39.21  |          #
 597 #        Std. Deviation                              #   6.26  |          #
 598 #        Minimum                                     #   1.63  |          #
 599 #        Maximum                                     #  26.47  |          #
 600 #        Range                                       #  24.84  |          #
 601 #        Interquartile Range                         #   5.83  |          #
 602 #        Skewness                                    #   1.85  |    .58   #
 603 #        Kurtosis                                    #   4.49  |   1.12   #
 604 #====================================================#=========#==========#
 605
 606 2.2 EXAMINE.  Descriptives
 607 #====================================================#=========#==========#
 608 #                                                    #Statistic|Std. Error#
 609 #====================================================#=========#==========#
 610 #mtbf_ln Mean                                        #   1.88  |    .19   #
 611 #        95% Confidence Interval for Mean Lower Bound#   1.47  |          #
 612 #                                         Upper Bound#   2.29  |          #
 613 #        5% Trimmed Mean                             #   1.88  |          #
 614 #        Median                                      #   2.09  |          #
 615 #        Variance                                    #   .54   |          #
 616 #        Std. Deviation                              #   .74   |          #
 617 #        Minimum                                     #   .49   |          #
 618 #        Maximum                                     #   3.28  |          #
 619 #        Range                                       #   2.79  |          #
 620 #        Interquartile Range                         #   .92   |          #
 621 #        Skewness                                    #   -.16  |    .58   #
 622 #        Kurtosis                                    #   -.09  |   1.12   #
 623 #====================================================#=========#==========#
 624 @end example
 625 @end cartouche
 626 @caption{Testing for normality using the @cmd{EXAMINE} command and applying
 627 a logarithmic transformation.
 628 The @var{mtbf} variable has a large positive skew and is therefore
 629 unsuitable for linear statistical analysis.
 630 However the transformed variable (@var{mtbf_ln}) is close to normal and
 631 would appear to be more suitable.}
 632 @end float
 633
 634
 635 @node Hypothesis Testing
 636 @section Hypothesis Testing
 637
 638 @cindex Hypothesis testing
 639 @cindex p-value
 640 @cindex null hypothesis
 641
 642 One of the most fundamental purposes of statistical analysis
 643 is hypothesis testing.
 644 Researchers commonly need to test hypotheses about a set of data.
 645 For example, she might want to test whether one set of data comes from
 646 the same distribution as another,
 647 or does the mean of a dataset significantly differ from a particular
 648 value.
 649 This section presents just some of the possible tests that PSPP offers.
 650
 651 The researcher starts by making a @dfn{null hypothesis}.
 652 Often this is a hypothesis which he suspects to be false.
 653 For example, if he suspects that @var{A} is greater than @var{B} he will
 654 state the null hypothesis as @math{ @var{A} = @var{B}}.
 655 @footnote{This example assumes that is it already proven that @var{B} is
 656 not greater than @var{A}.}
 657
 658 The @dfn{p-value} is a recurring concept in hypothesis testing.
 659 It is the highest acceptable probability that the evidence implying a
 660 null hypothesis is false, could have been obtained when the null
 661 hypothesis is in fact true.
 662 Note that this is not the same as ``the probability of making an
 663 error'' nor is it the same as ``the probability of rejecting a
 664 hypothesis when it is true''.
 665
 666
 667
 668 @menu
 669 * Testing for differences of means::
 670 * Linear Regression::
 671 @end menu
 672
 673 @node Testing for differences of means
 674 @subsection Testing for differences of means
 675
 676 @cindex T-test
 677 @vindex T-TEST
 678
 679 A common statistical test involves hypotheses about means.
 680 The @cmd{T-TEST} command is used to find out whether or not two separate
 681 subsets have the same mean.
 682
 683 @ref{t-test} uses the file @file{physiology.sav} previously
 684 encountered.
 685 A researcher suspected that the heights and core body
 686 temperature of persons might be different depending upon their sex.
 687 To investigate this, he posed two null hypotheses:
 688 @itemize @bullet
 689 @item The mean heights of males and females in the population are equal.
 690 @item The mean body temperature of males and
 691       females in the population are equal.
 692 @end itemize
 693 @noindent
 694 For the purposes of the investigation the researcher
 695 decided to use a  p-value of 0.05.
 696
 697 In addition to the T-test, the @cmd{T-TEST} command also performs the
 698 Levene test for equal variances.
 699 If the variances are equal, then a more powerful form of the T-test can be used.
 700 However if it is unsafe to assume equal variances,
 701 then an alternative calculation is necessary.
 702 PSPP performs both calculations.
 703
 704 For the @var{height} variable, the output shows the significance of the
 705 Levene test to be 0.33 which means there is a
 706 33% probability that the
 707 Levene test produces this outcome when the variances are unequal.
 708 Such a probability is too high
 709 to assume that the variances are equal so the row
 710 for unequal variances should be used.
 711 Examining this row, the two tailed significance for the @var{height} t-test
 712 is less than 0.05, so it is safe to reject the null hypothesis and conclude
 713 that the mean heights of males and females are unequal.
 714
 715 For the @var{temperature} variable, the significance of the Levene test
 716 is 0.58 so again, it is unsafe to use the row for equal variances.
 717 The unequal variances row indicates that the two tailed significance for
 718 @var{temperature} is 0.19.  Since this is greater than 0.05 we must reject
 719 the null hypothesis and conclude that there is insufficient evidence to
 720 suggest that the body temperature of male and female persons are different.
 721
 722 @float Example, t-test
 723 @cartouche
 724 @example
 725 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 726 @prompt{PSPP>} recode height (179 = SYSMIS).
 727 @prompt{PSPP>} t-test group=sex(0,1) /variables = height temperature.
 728 @end example
 729 Output:
 730 @example
 731 1.1 T-TEST.  Group Statistics
 732 #==================#==#=======#==============#========#
 733 #              sex | N|  Mean |Std. Deviation|SE. Mean#
 734 #==================#==#=======#==============#========#
 735 #height      Male  |22|1796.49|         49.71|   10.60#
 736 #            Female|17|1610.77|         25.43|    6.17#
 737 #temperature Male  |22|  36.68|          1.95|     .42#
 738 #            Female|18|  37.43|          1.61|     .38#
 739 #==================#==#=======#==============#========#
 740 1.2 T-TEST.  Independent Samples Test
 741 #===========================#=========#===============================   =#
 742 #                           # Levene's| t-test for Equality of Means      #
 743 #                           #----+----+------+-----+------+---------+-   -#
 744 #                           #    |    |      |     |      |         |     #
 745 #                           #    |    |      |     |Sig. 2|         |     #
 746 #                           #  F |Sig.|   t  |  df |tailed|Mean Diff|     #
 747 #===========================#====#====#======#=====#======#=========#=   =#
 748 #height      Equal variances# .97| .33| 14.02|37.00|   .00|   185.72| ... #
 749 #          Unequal variances#    |    | 15.15|32.71|   .00|   185.72| ... #
 750 #temperature Equal variances# .31| .58| -1.31|38.00|   .20|     -.75| ... #
 751 #          Unequal variances#    |    | -1.33|37.99|   .19|     -.75| ... #
 752 #===========================#====#====#======#=====#======#=========#=   =#
 753 @end example
 754 @end cartouche
 755 @caption{The @cmd{T-TEST} command tests for differences of means.
 756 Here, the @var{height} variable's two tailed significance is less than
 757 0.05, so the null hypothesis can be rejected.
 758 Thus, the evidence suggests there is a difference between the heights of
 759 male and female persons.
 760 However the significance of the test for the @var{temperature}
 761 variable is greater than 0.05 so the null hypothesis cannot be
 762 rejected, and there is insufficient evidence to suggest a difference
 763 in body temperature.}
 764 @end float
 765
 766 @node Linear Regression
 767 @subsection Linear Regression
 768 @cindex linear regression
 769 @vindex REGRESSION
 770
 771 Linear regression is a technique used to investigate if and how a variable
 772 is linearly related to others.
 773 If a variable is found to be linearly related, then this can be used to
 774 predict future values of that variable.
 775
 776 In example @ref{regression}, the service department of the company wanted to
 777 be able to predict the time to repair equipment, in order to improve
 778 the accuracy of their quotations.
 779 It was suggested that the time to repair might be related to the time
 780 between failures and the duty cycle of the equipment.
 781 The p-value of 0.1 was chosen for this investigation.
 782 In order to investigate this hypothesis, the @cmd{REGRESSION} command
 783 was used.
 784 This command not only tests if the variables are related, but also
 785 identifies the potential linear relationship. @xref{REGRESSION}.
 786
 787
 788 @float Example, regression
 789 @cartouche
 790 @example
 791 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 792 @prompt{PSPP>} regression /variables = mtbf duty_cycle /dependent = mttr.
 793 @prompt{PSPP>} regression /variables = mtbf /dependent = mttr.
 794 @end example
 795 Output:
 796 @example
 797 1.3(1) REGRESSION.  Coefficients
 798 #=============================================#====#==========#====#=====#
 799 #                                             #  B |Std. Error|Beta|  t  #
 800 #========#====================================#====#==========#====#=====#
 801 #        |(Constant)                          #9.81|      1.50| .00| 6.54#
 802 #        |Mean time between failures (months) #3.10|       .10| .99|32.43#
 803 #        |Ratio of working to non-working time#1.09|      1.78| .02|  .61#
 804 #        |                                    #    |          |    |     #
 805 #========#====================================#====#==========#====#=====#
 806
 807 1.3(2) REGRESSION.  Coefficients
 808 #=============================================#============#
 809 #                                             #Significance#
 810 #========#====================================#============#
 811 #        |(Constant)                          #         .10#
 812 #        |Mean time between failures (months) #         .00#
 813 #        |Ratio of working to non-working time#         .55#
 814 #        |                                    #            #
 815 #========#====================================#============#
 816 2.3(1) REGRESSION.  Coefficients
 817 #============================================#=====#==========#====#=====#
 818 #                                            #  B  |Std. Error|Beta|  t  #
 819 #========#===================================#=====#==========#====#=====#
 820 #        |(Constant)                         #10.50|       .96| .00|10.96#
 821 #        |Mean time between failures (months)# 3.11|       .09| .99|33.39#
 822 #        |                                   #     |          |    |     #
 823 #========#===================================#=====#==========#====#=====#
 824
 825 2.3(2) REGRESSION.  Coefficients
 826 #============================================#============#
 827 #                                            #Significance#
 828 #========#===================================#============#
 829 #        |(Constant)                         #         .06#
 830 #        |Mean time between failures (months)#         .00#
 831 #        |                                   #            #
 832 #========#===================================#============#
 833 @end example
 834 @end cartouche
 835 @caption{Linear regression analysis to find a predictor for
 836 @var{mttr}.
 837 The first attempt, including @var{duty_cycle}, produces some
 838 unacceptable high significance values.
 839 However the second attempt, which excludes @var{duty_cycle}, produces
 840 significance values no higher than 0.06.
 841 This suggests that @var{mtbf} alone may be a suitable predictor
 842 for @var{mttr}.}
 843 @end float
 844
 845 The coefficients in the first table suggest that the formula
 846 @math{@var{mttr} = 9.81 + 3.1 \times @var{mtbf} + 1.09 \times @var{duty_cycle}}
 847 can be used to predict the time to repair.
 848 However, the significance value for the @var{duty_cycle} coefficient
 849 is very high, which would make this an unsafe predictor.
 850 For this reason, the test was repeated, but omitting the
 851 @var{duty_cycle} variable.
 852 This time, the significance of all coefficients no higher than 0.06,
 853 suggesting that at the 0.06 level, the formula
 854 @math{@var{mttr} = 10.5 + 3.11 \times @var{mtbf}} is a reliable
 855 predictor of the time to repair.
 856
 857
 858 @c  LocalWords:  PSPP dir itemize noindent var cindex dfn cartouche samp xref
 859 @c  LocalWords:  pxref ie sav Std Dev kilograms SYSMIS sansserif pre pspp emph
 860 @c  LocalWords:  Likert Cronbach's Cronbach mtbf npplot ln myfile cmd NPAR Sig
 861 @c  LocalWords:  vindex Levene Levene's df Diff clicksequence mydata dat ascii
 862 @c  LocalWords:  mttr outfile