doc/tutorial.texi

   1 @alias prompt = sansserif
   2
   3 @include doc/tut.texi
   4
   5 @node Using PSPP
   6 @chapter Using PSPP
   7
   8 PSPP is a tool for the statistical analysis of sampled data.
   9 You can use it to discover patterns in the data,
  10 to explain differences in one subset of data in terms of another subset
  11 and to find out
  12 whether certain beliefs about the data are justified.
  13 This chapter does not attempt to introduce the theory behind the
  14 statistical analysis,
  15 but it shows how such analysis can be performed using PSPP.
  16
  17 For the purposes of this tutorial, it is assumed that you are using PSPP in its
  18 interactive mode from the command line.
  19 However, the example commands can also be typed into a file and executed in
  20 a post-hoc mode by typing @samp{pspp @var{filename}} at a shell prompt,
  21 where @var{filename} is the name of the file containing the commands.
  22 Alternatively, from the graphical interface, you can select
  23 @clicksequence{File @click{} New @click{} Syntax} to open a new syntax window
  24 and use the @clicksequence{Run} menu when a syntax fragment is ready to be
  25 executed.
  26 Whichever method you choose, the syntax is identical.
  27
  28 When using the interactive method, PSPP tells you that it's waiting for your
  29 data with a string like @prompt{PSPP>} or @prompt{data>}.
  30 In the examples of this chapter, whenever you see text like this, it
  31 indicates the prompt displayed by PSPP, @emph{not} something that you
  32 should type.
  33
  34 Throughout this chapter reference is made to a number of sample data files.
  35 So that you can try the examples for yourself,
  36 you should have received these files along with your copy of PSPP.@c
  37 @footnote{These files contain purely fictitious data.  They should not be used
  38 for research purposes.}
  39 @note{Normally these files are installed in the directory
  40 @file{@value{example-dir}}.
  41 If however your system administrator or operating system vendor has
  42 chosen to install them in a different location, you will have to adjust
  43 the examples accordingly.}
  44
  45
  46 @menu
  47 * Preparation of Data Files::
  48 * Data Screening and Transformation::
  49 * Hypothesis Testing::
  50 @end menu
  51
  52 @node Preparation of Data Files
  53 @section Preparation of Data Files
  54
  55
  56 Before analysis can commence,  the data must be loaded into PSPP and
  57 arranged such that both PSPP and humans can understand what
  58 the data represents.
  59 There are two aspects of data:
  60
  61 @itemize @bullet
  62 @item The variables --- these are the parameters of a quantity
  63  which has been measured or estimated in some way.
  64  For example height, weight and geographic location are all variables.
  65 @item The observations (also called `cases') of the variables ---
  66  each observation represents an instance when the variables were measured
  67  or observed.
  68 @end itemize
  69
  70 @noindent
  71 For example, a data set which has the variables @var{height}, @var{weight}, and
  72 @var{name}, might have the observations:
  73 @example
  74 1881 89.2 Ahmed
  75 1192 107.01 Frank
  76 1230 67 Julie
  77 @end example
  78 @noindent
  79 The following sections explain how to define a dataset.
  80
  81 @menu
  82 * Defining Variables::
  83 * Listing the data::
  84 * Reading data from a text file::
  85 * Reading data from a pre-prepared PSPP file::
  86 * Saving data to a PSPP file.::
  87 * Reading data from other sources::
  88 @end menu
  89
  90 @node Defining Variables
  91 @subsection Defining Variables
  92 @cindex variables
  93
  94 Variables come in two basic types, @i{viz}: @dfn{numeric} and @dfn{string}.
  95 Variables such as age, height and satisfaction are numeric,
  96 whereas name is a string variable.
  97 String variables are best reserved for commentary data to assist the
  98 human observer.
  99 However they can also be used for nominal or categorical data.
 100
 101
 102 @ref{data-list} defines two variables @var{forename} and @var{height},
 103 and reads data into them by manual input.
 104
 105 @float Example, data-list
 106 @cartouche
 107 @example
 108 @prompt{PSPP>} data list list /forename (A12) height.
 109 @prompt{PSPP>} begin data.
 110 @prompt{data>} Ahmed 188
 111 @prompt{data>} Bertram 167
 112 @prompt{data>} Catherine 134.231
 113 @prompt{data>} David 109.1
 114 @prompt{data>} end data
 115 @prompt{PSPP>}
 116 @end example
 117 @end cartouche
 118 @caption{Manual entry of data using the @cmd{DATA LIST} command.
 119 Two variables
 120 @var{forename} and @var{height} are defined and subsequently filled
 121 with  manually entered data.}
 122 @end float
 123
 124 There are several things to note about this example.
 125
 126 @itemize @bullet
 127 @item
 128 The words @samp{data list list} are an example of the @cmd{DATA LIST}
 129 command. @xref{DATA LIST}.
 130 It tells PSPP to prepare for reading data.
 131 The word @samp{list} intentionally appears twice.
 132 The first occurrence is part of the @cmd{DATA LIST} call,
 133 whilst the second
 134 tells PSPP that the data is to be read as free format data with
 135 one record per line.
 136
 137 @item
 138 The @samp{/} character is important. It marks the start of the list of
 139 variables which you wish to define.
 140
 141 @item
 142 The text @samp{forename} is the name of the first variable,
 143 and @samp{(A12)} says that the variable @var{forename} is a string
 144 variable and that its maximum length is 12 bytes.
 145 The second variable's name is specified by the text @samp{height}.
 146 Since no format is given, this variable has the default format.
 147 For more information on data formats, @pxref{Input and Output Formats}.
 148
 149
 150 @item
 151 Normally, PSPP displays the  prompt @prompt{PSPP>} whenever it's
 152 expecting a command.
 153 However, when it's expecting data, the prompt changes to @prompt{data>}
 154 so that you know to enter data and not a command.
 155
 156 @item
 157 At the end of every command there is a terminating @samp{.} which tells
 158 PSPP that the end of a command has been encountered.
 159 You should not enter @samp{.} when data is expected (@i{ie.} when
 160 the @prompt{data>} prompt is current) since it is appropriate only for
 161 terminating commands.
 162 @end itemize
 163
 164 @node Listing the data
 165 @subsection Listing the data
 166 @vindex LIST
 167
 168 Once the data has been entered,
 169 you could type
 170 @example
 171 @prompt{PSPP>} list /format=numbered.
 172 @end example
 173 @noindent
 174 to list the data.
 175 The optional text @samp{/format=numbered} requests the case numbers to be
 176 shown along with the data.
 177 It should show the following output:
 178 @example
 179 @group
 180 Case#     forename   height
 181 ----- ------------ --------
 182     1 Ahmed          188.00
 183     2 Bertram        167.00
 184     3 Catherine      134.23
 185     4 David          109.10
 186 @end group
 187 @end example
 188 @noindent
 189 Note that the numeric variable @var{height} is displayed to 2 decimal
 190 places, because the format for that variable is @samp{F8.2}.
 191 For a complete description of the @cmd{LIST} command, @pxref{LIST}.
 192
 193 @node Reading data from a text file
 194 @subsection Reading data from a text file
 195 @cindex reading data
 196
 197 The previous example showed how to define a set of variables and to
 198 manually enter the data for those variables.
 199 Manual entering of data is tedious work, and often
 200 a file containing the data will be have been previously
 201 prepared.
 202 Let us assume that you have a file called @file{mydata.dat} containing the
 203 ascii encoded data:
 204 @example
 205 Ahmed          188.00
 206 Bertram        167.00
 207 Catherine      134.23
 208 David          109.10
 209 @              .
 210 @              .
 211 @              .
 212 Zachariah      113.02
 213 @end example
 214 @noindent
 215 You can can tell the @cmd{DATA LIST} command to read the data directly from
 216 this file instead of by manual entry, with a command like:
 217 @example
 218 @prompt{PSPP>} data list file='mydata.dat' list /forename (A12) height.
 219 @end example
 220 @noindent
 221 Notice however, that it is still necessary to specify the names of the
 222 variables and their formats, since this information is not contained
 223 in the file.
 224 It is also possible to specify the file's character encoding and other
 225 parameters.
 226 For full details refer to @pxref{DATA LIST}.
 227
 228 @node Reading data from a pre-prepared PSPP file
 229 @subsection Reading data from a pre-prepared PSPP file
 230 @cindex system files
 231 @vindex GET
 232
 233 When working with other PSPP users, or users of other software which
 234 uses the PSPP data format, you may be given the data in
 235 a pre-prepared PSPP file.
 236 Such files contain not only the data, but the variable definitions,
 237 along with their formats, labels and other meta-data.
 238 Conventionally, these files (sometimes called ``system'' files)
 239 have the suffix @file{.sav}, but that is
 240 not mandatory.
 241 The following syntax loads a file called @file{my-file.sav}.
 242 @example
 243 @prompt{PSPP>} get file='my-file.sav'.
 244 @end example
 245 @noindent
 246 You will encounter several instances of this in future examples.
 247
 248
 249 @node Saving data to a PSPP file.
 250 @subsection Saving data to a PSPP file.
 251 @cindex saving
 252 @vindex SAVE
 253
 254 If you want to save your data, along with the variable definitions so
 255 that you or other PSPP users can use it later, you can do this with
 256 the @cmd{SAVE} command.
 257
 258 The following syntax will save the existing data and variables to a
 259 file called @file{my-new-file.sav}.
 260 @example
 261 @prompt{PSPP>} save outfile='my-new-file.sav'.
 262 @end example
 263 @noindent
 264 If @file{my-new-file.sav} already exists, then it will be overwritten.
 265 Otherwise it will be created.
 266
 267
 268 @node Reading data from other sources
 269 @subsection Reading data from other sources
 270 @cindex comma separated values
 271 @cindex spreadsheets
 272 @cindex databases
 273
 274 Sometimes it's useful to be able to read data from comma
 275 separated text, from spreadsheets, databases or other sources.
 276 In these instances you should
 277 use the @cmd{GET DATA} command (@pxref{GET DATA}).
 278
 279
 280 @node Data Screening and Transformation
 281 @section Data Screening and Transformation
 282
 283 @cindex screening
 284 @cindex transformation
 285
 286 Once data has been entered, it is often desirable, or even necessary,
 287 to transform it in some way before performing analysis upon it.
 288 At the very least, it's good practice to check for errors.
 289
 290 @menu
 291 * Identifying incorrect data::
 292 * Dealing with suspicious data::
 293 * Inverting negatively coded variables::
 294 * Testing data consistency::
 295 * Testing for normality ::
 296 @end menu
 297
 298 @node Identifying incorrect data
 299 @subsection Identifying incorrect data
 300 @cindex erroneous data
 301 @cindex errors, in data
 302
 303 Data from real sources is rarely error free.
 304 PSPP has a number of procedures which can be used to help
 305 identify data which might be incorrect.
 306
 307 The @cmd{DESCRIPTIVES} command (@pxref{DESCRIPTIVES}) is used to generate
 308 simple linear statistics for a dataset.  It is also useful for
 309 identifying potential problems in the data.
 310 The example file @file{physiology.sav} contains a number of physiological
 311 measurements of a sample of healthy adults selected at random.
 312 However, the data entry clerk made a number of mistakes when entering
 313 the data.
 314 @ref{descriptives} illustrates the use of @cmd{DESCRIPTIVES} to screen this
 315 data and identify the erroneous values.
 316
 317 @float Example, descriptives
 318 @cartouche
 319 @example
 320 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 321 @prompt{PSPP>} descriptives sex, weight, height.
 322 @end example
 323
 324 Output:
 325 @example
 326 DESCRIPTIVES.  Valid cases = 40; cases with missing value(s) = 0.
 327 +--------#--+-------+-------+-------+-------+
 328 |Variable# N|  Mean |Std Dev|Minimum|Maximum|
 329 #========#==#=======#=======#=======#=======#
 330 |sex     #40|    .45|    .50|    .00|   1.00|
 331 |height  #40|1677.12| 262.87| 179.00|1903.00|
 332 |weight  #40|  72.12|  26.70| -55.60|  92.07|
 333 +--------#--+-------+-------+-------+-------+
 334 @end example
 335 @end cartouche
 336 @caption{Using the @cmd{DESCRIPTIVES} command to display simple
 337 summary information about the data.
 338 In this case, the results show unexpectedly low values in the Minimum
 339 column, suggesting incorrect data entry.}
 340 @end float
 341
 342 In the output of @ref{descriptives},
 343 the most interesting column is the minimum value.
 344 The @var{weight} variable has a minimum value of less than zero,
 345 which is clearly erroneous.
 346 Similarly, the @var{height} variable's minimum value seems to be very low.
 347 In fact, it is more than 5 standard deviations from the mean, and is a
 348 seemingly bizarre height for an adult person.
 349 We can examine the data in more detail with the @cmd{EXAMINE}
 350 command (@pxref{EXAMINE}):
 351
 352 In @ref{examine} you can see that the lowest value of @var{height} is
 353 179 (which we suspect to be erroneous), but the second lowest is 1598
 354 which
 355 we know from the @cmd{DESCRIPTIVES} command
 356 is within 1 standard deviation from the mean.
 357 Similarly the @var{weight} variable has a lowest value which is
 358 negative but a plausible value for the second lowest value.
 359 This suggests that the two extreme values are outliers and probably
 360 represent data entry errors.
 361
 362 @float Example, examine
 363 @cartouche
 364 [@dots{} continue from @ref{descriptives}]
 365 @example
 366 @prompt{PSPP>} examine height, weight /statistics=extreme(3).
 367 @end example
 368
 369 Output:
 370 @example
 371 #===============================#===========#=======#
 372 #                               #Case Number| Value #
 373 #===============================#===========#=======#
 374 #Height in millimetres Highest 1#         14|1903.00#
 375 #                              2#         15|1884.00#
 376 #                              3#         12|1801.65#
 377 #                     ----------#-----------+-------#
 378 #                       Lowest 1#         30| 179.00#
 379 #                              2#         31|1598.00#
 380 #                              3#         28|1601.00#
 381 #                     ----------#-----------+-------#
 382 #Weight in kilograms   Highest 1#         13|  92.07#
 383 #                              2#          5|  92.07#
 384 #                              3#         17|  91.74#
 385 #                     ----------#-----------+-------#
 386 #                       Lowest 1#         38| -55.60#
 387 #                              2#         39|  54.48#
 388 #                              3#         33|  55.45#
 389 #===============================#===========#=======#
 390 @end example
 391 @end cartouche
 392 @caption{Using the @cmd{EXAMINE} command to see the extremities of the data
 393 for different variables.  Cases 30 and 38 seem to contain values
 394 very much lower than the rest of the data.
 395 They are possibly erroneous.}
 396 @end float
 397
 398 @node Dealing with suspicious data
 399 @subsection Dealing with suspicious data
 400
 401 @cindex SYSMIS
 402 @cindex recoding data
 403 If possible, suspect data should be checked and re-measured.
 404 However, this may not always be feasible, in which case the researcher may
 405 decide to disregard these values.
 406 PSPP has a feature whereby data can assume the special value `SYSMIS', and
 407 will be disregarded in future analysis. @xref{Missing Observations}.
 408 You can set the two suspect values to the `SYSMIS' value using the @cmd{RECODE}
 409 command.
 410 @example
 411 PSPP> recode height (179 = SYSMIS).
 412 PSPP> recode weight (LOWEST THRU 0 = SYSMIS).
 413 @end example
 414 @noindent
 415 The first command says that for any observation which has a
 416 @var{height} value of 179, that value should be changed to the SYSMIS
 417 value.
 418 The second command says that any @var{weight} values of zero or less
 419 should be changed to SYSMIS.
 420 From now on, they will be ignored in analysis.
 421 For detailed information about the @cmd{RECODE} command @pxref{RECODE}.
 422
 423 If you now re-run the @cmd{DESCRIPTIVES} or @cmd{EXAMINE} commands in
 424 @ref{descriptives} and @ref{examine} you
 425 will see a data summary with more plausible parameters.
 426 You will also notice that the data summaries indicate the two missing values.
 427
 428 @node Inverting negatively coded variables
 429 @subsection Inverting negatively coded variables
 430
 431 @cindex Likert scale
 432 @cindex Inverting data
 433 Data entry errors are not the only reason for wanting to recode data.
 434 The sample file @file{hotel.sav} comprises data gathered from a
 435 customer satisfaction survey of clients at a particular hotel.
 436 In @ref{reliability}, this file is loaded for analysis.
 437 The line @code{display dictionary.} tells PSPP to display the
 438 variables and associated data.
 439 The output from this command has been omitted from the example for the sake of clarity, but
 440 you will notice that each of the variables
 441 @var{v1}, @var{v2} @dots{} @var{v5}  are measured on a 5 point Likert scale,
 442 with 1 meaning ``Strongly disagree'' and 5 meaning ``Strongly agree''.
 443 Whilst variables @var{v1}, @var{v2} and @var{v4} record responses
 444 to a positively posed question, variables @var{v3} and @var{v5} are
 445 responses to negatively worded questions.
 446 In order to perform meaningful analysis, we need to recode the variables so
 447 that they all measure in the same direction.
 448 We could use the @cmd{RECODE} command, with syntax such as:
 449 @example
 450 recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1).
 451 @end example
 452 @noindent
 453 However an easier and more elegant way uses the @cmd{COMPUTE}
 454 command (@pxref{COMPUTE}).
 455 Since the variables are Likert variables in the range (1 @dots{} 5),
 456 subtracting their value  from 6 has the effect of inverting them:
 457 @example
 458 compute @var{var} = 6 - @var{var}.
 459 @end example
 460 @noindent
 461 @ref{reliability} uses this technique to recode the variables
 462 @var{v3} and @var{v5}.
 463 After applying  @cmd{COMPUTE} for both variables,
 464 all subsequent commands will use the inverted values.
 465
 466
 467 @node Testing data consistency
 468 @subsection Testing data consistency
 469
 470 @cindex reliability
 471 @cindex consistency
 472
 473 A sensible check to perform on survey data is the calculation of
 474 reliability.
 475 This gives the statistician some confidence that the questionnaires have been
 476 completed thoughtfully.
 477 If you examine the labels of variables @var{v1},  @var{v3} and @var{v5},
 478 you will notice that they ask very similar questions.
 479 One would therefore expect the values of these variables (after recoding)
 480 to closely follow one another, and we can test that with the @cmd{RELIABILITY}
 481 command (@pxref{RELIABILITY}).
 482 @ref{reliability} shows a PSPP session where the user (after recoding
 483 negatively scaled variables) requests reliability statistics for
 484 @var{v1}, @var{v3} and @var{v5}.
 485
 486 @float Example, reliability
 487 @cartouche
 488 @example
 489 @prompt{PSPP>} get file='@value{example-dir}/hotel.sav'.
 490 @prompt{PSPP>} display dictionary.
 491 @prompt{PSPP>} * recode negatively worded questions.
 492 @prompt{PSPP>} compute v3 = 6 - v3.
 493 @prompt{PSPP>} compute v5 = 6 - v5.
 494 @prompt{PSPP>} reliability v1, v3, v5.
 495 @end example
 496
 497 Output (dictionary information omitted for clarity):
 498 @example
 499 1.1 RELIABILITY.  Case Processing Summary
 500 #==============#==#======#
 501 #              # N|   %  #
 502 #==============#==#======#
 503 #Cases Valid   #17|100.00#
 504 #      Excluded# 0|   .00#
 505 #      Total   #17|100.00#
 506 #==============#==#======#
 507
 508 1.2 RELIABILITY.  Reliability Statistics
 509 #================#==========#
 510 #Cronbach's Alpha#N of items#
 511 #================#==========#
 512 #             .86#         3#
 513 #================#==========#
 514 @end example
 515 @end cartouche
 516 @caption{Recoding negatively scaled variables, and testing for
 517 reliability with the @cmd{RELIABILITY} command. The Cronbach Alpha
 518 coefficient suggests a high degree of reliability among variables
 519 @var{v1}, @var{v2} and @var{v5}.}
 520 @end float
 521
 522 As a rule of thumb, many statisticians consider a value of Cronbach's Alpha of
 523 0.7 or higher to indicate reliable data.
 524 Here, the value is 0.86 so the data and the recoding that we performed
 525 are vindicated.
 526
 527
 528 @node Testing for normality
 529 @subsection Testing for normality
 530 @cindex normality, testing
 531
 532 Many statistical tests rely upon certain properties of the data.
 533 One common property, upon which many linear tests depend, is that of
 534 normality --- the data must have been drawn from a normal distribution.
 535 It is necessary then to ensure normality before deciding upon the
 536 test procedure to use.  One way to do this uses the @cmd{EXAMINE} command.
 537
 538 In @ref{normality}, a researcher was examining the failure rates
 539 of equipment produced by an engineering company.
 540 The file @file{repairs.sav} contains the mean time between
 541 failures (@var{mtbf}) of some items of equipment subject to the study.
 542 Before performing linear analysis on the data,
 543 the researcher wanted to ascertain that the data is normally distributed.
 544
 545 A normal distribution has a skewness and kurtosis of zero.
 546 Looking at the skewness of @var{mtbf} in @ref{normality} it is clear
 547 that the mtbf figures have a lot of positive skew and are therefore
 548 not drawn from a normally distributed variable.
 549 Positive skew can often be compensated for by applying a logarithmic
 550 transformation.
 551 This is done with the @cmd{COMPUTE} command in the line
 552 @example
 553 compute mtbf_ln = ln (mtbf).
 554 @end example
 555 @noindent
 556 Rather than redefining the existing variable, this use of @cmd{COMPUTE}
 557 defines a new variable @var{mtbf_ln} which is
 558 the natural logarithm of @var{mtbf}.
 559 The final command in this example calls @cmd{EXAMINE} on this new variable,
 560 and it can be seen from the results that both the skewness and
 561 kurtosis for @var{mtbf_ln} are very close to zero.
 562 This provides some confidence that the @var{mtbf_ln} variable is
 563 normally distributed and thus safe for linear analysis.
 564 In the event that no suitable transformation can be found,
 565 then it would be worth considering
 566 an appropriate non-parametric test instead of a linear one.
 567 @xref{NPAR TESTS}, for information about non-parametric tests.
 568
 569 @float Example, normality
 570 @cartouche
 571 @example
 572 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 573 @prompt{PSPP>} examine mtbf
 574                 /statistics=descriptives.
 575 @prompt{PSPP>} compute mtbf_ln = ln (mtbf).
 576 @prompt{PSPP>} examine mtbf_ln
 577                 /statistics=descriptives.
 578 @end example
 579
 580 Output:
 581 @example
 582 1.2 EXAMINE.  Descriptives
 583 #====================================================#=========#==========#
 584 #                                                    #Statistic|Std. Error#
 585 #====================================================#=========#==========#
 586 #mtbf    Mean                                        #   8.32  |   1.62   #
 587 #        95% Confidence Interval for Mean Lower Bound#   4.85  |          #
 588 #                                         Upper Bound#  11.79  |          #
 589 #        5% Trimmed Mean                             #   7.69  |          #
 590 #        Median                                      #   8.12  |          #
 591 #        Variance                                    #  39.21  |          #
 592 #        Std. Deviation                              #   6.26  |          #
 593 #        Minimum                                     #   1.63  |          #
 594 #        Maximum                                     #  26.47  |          #
 595 #        Range                                       #  24.84  |          #
 596 #        Interquartile Range                         #   5.83  |          #
 597 #        Skewness                                    #   1.85  |    .58   #
 598 #        Kurtosis                                    #   4.49  |   1.12   #
 599 #====================================================#=========#==========#
 600
 601 2.2 EXAMINE.  Descriptives
 602 #====================================================#=========#==========#
 603 #                                                    #Statistic|Std. Error#
 604 #====================================================#=========#==========#
 605 #mtbf_ln Mean                                        #   1.88  |    .19   #
 606 #        95% Confidence Interval for Mean Lower Bound#   1.47  |          #
 607 #                                         Upper Bound#   2.29  |          #
 608 #        5% Trimmed Mean                             #   1.88  |          #
 609 #        Median                                      #   2.09  |          #
 610 #        Variance                                    #   .54   |          #
 611 #        Std. Deviation                              #   .74   |          #
 612 #        Minimum                                     #   .49   |          #
 613 #        Maximum                                     #   3.28  |          #
 614 #        Range                                       #   2.79  |          #
 615 #        Interquartile Range                         #   .92   |          #
 616 #        Skewness                                    #   -.16  |    .58   #
 617 #        Kurtosis                                    #   -.09  |   1.12   #
 618 #====================================================#=========#==========#
 619 @end example
 620 @end cartouche
 621 @caption{Testing for normality using the @cmd{EXAMINE} command and applying
 622 a logarithmic transformation.
 623 The @var{mtbf} variable has a large positive skew and is therefore
 624 unsuitable for linear statistical analysis.
 625 However the transformed variable (@var{mtbf_ln}) is close to normal and
 626 would appear to be more suitable.}
 627 @end float
 628
 629
 630 @node Hypothesis Testing
 631 @section Hypothesis Testing
 632
 633 @cindex Hypothesis testing
 634 @cindex p-value
 635 @cindex null hypothesis
 636
 637 One of the most fundamental purposes of statistical analysis
 638 is hypothesis testing.
 639 Researchers commonly need to test hypotheses about a set of data.
 640 For example, she might want to test whether one set of data comes from
 641 the same distribution as another,
 642 or
 643 whether the mean of a dataset significantly differs from a particular
 644 value.
 645 This section presents just some of the possible tests that PSPP offers.
 646
 647 The researcher starts by making a @dfn{null hypothesis}.
 648 Often this is a hypothesis which he suspects to be false.
 649 For example, if he suspects that @var{A} is greater than @var{B} he will
 650 state the null hypothesis as @math{ @var{A} = @var{B}}.@c
 651 @footnote{This example assumes that it is already proven that @var{B} is
 652 not greater than @var{A}.}
 653
 654 The @dfn{p-value} is a recurring concept in hypothesis testing.
 655 It is the highest acceptable probability that the evidence implying a
 656 null hypothesis is false, could have been obtained when the null
 657 hypothesis is in fact true.
 658 Note that this is not the same as ``the probability of making an
 659 error'' nor is it the same as ``the probability of rejecting a
 660 hypothesis when it is true''.
 661
 662
 663
 664 @menu
 665 * Testing for differences of means::
 666 * Linear Regression::
 667 @end menu
 668
 669 @node Testing for differences of means
 670 @subsection Testing for differences of means
 671
 672 @cindex T-test
 673 @vindex T-TEST
 674
 675 A common statistical test involves hypotheses about means.
 676 The @cmd{T-TEST} command is used to find out whether or not two separate
 677 subsets have the same mean.
 678
 679 @ref{t-test} uses the file @file{physiology.sav} previously
 680 encountered.
 681 A researcher suspected that the heights and core body
 682 temperature of persons might be different depending upon their sex.
 683 To investigate this, he posed two null hypotheses:
 684 @itemize @bullet
 685 @item The mean heights of males and females in the population are equal.
 686 @item The mean body temperature of males and
 687       females in the population are equal.
 688 @end itemize
 689 @noindent
 690 For the purposes of the investigation the researcher
 691 decided to use a  p-value of 0.05.
 692
 693 In addition to the T-test, the @cmd{T-TEST} command also performs the
 694 Levene test for equal variances.
 695 If the variances are equal, then a more powerful form of the T-test can be used.
 696 However if it is unsafe to assume equal variances,
 697 then an alternative calculation is necessary.
 698 PSPP performs both calculations.
 699
 700 For the @var{height} variable, the output shows the significance of the
 701 Levene test to be 0.33 which means there is a
 702 33% probability that the
 703 Levene test produces this outcome when the variances are unequal.
 704 Such a probability is too high
 705 to assume that the variances are equal so the row
 706 for unequal variances should be used.
 707 Examining this row, the two tailed significance for the @var{height} t-test
 708 is less than 0.05, so it is safe to reject the null hypothesis and conclude
 709 that the mean heights of males and females are unequal.
 710
 711 For the @var{temperature} variable, the significance of the Levene test
 712 is 0.58 so again, it is unsafe to use the row for equal variances.
 713 The unequal variances row indicates that the two tailed significance for
 714 @var{temperature} is 0.19.  Since this is greater than 0.05 we must reject
 715 the null hypothesis and conclude that there is insufficient evidence to
 716 suggest that the body temperature of male and female persons are different.
 717
 718 @float Example, t-test
 719 @cartouche
 720 @example
 721 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 722 @prompt{PSPP>} recode height (179 = SYSMIS).
 723 @prompt{PSPP>} t-test group=sex(0,1) /variables = height temperature.
 724 @end example
 725 Output:
 726 @example
 727 1.1 T-TEST.  Group Statistics
 728 #==================#==#=======#==============#========#
 729 #              sex | N|  Mean |Std. Deviation|SE. Mean#
 730 #==================#==#=======#==============#========#
 731 #height      Male  |22|1796.49|         49.71|   10.60#
 732 #            Female|17|1610.77|         25.43|    6.17#
 733 #temperature Male  |22|  36.68|          1.95|     .42#
 734 #            Female|18|  37.43|          1.61|     .38#
 735 #==================#==#=======#==============#========#
 736 1.2 T-TEST.  Independent Samples Test
 737 #===========================#=========#===============================   =#
 738 #                           # Levene's| t-test for Equality of Means      #
 739 #                           #----+----+------+-----+------+---------+-   -#
 740 #                           #    |    |      |     |      |         |     #
 741 #                           #    |    |      |     |Sig. 2|         |     #
 742 #                           #  F |Sig.|   t  |  df |tailed|Mean Diff|     #
 743 #===========================#====#====#======#=====#======#=========#=   =#
 744 #height      Equal variances# .97| .33| 14.02|37.00|   .00|   185.72| ... #
 745 #          Unequal variances#    |    | 15.15|32.71|   .00|   185.72| ... #
 746 #temperature Equal variances# .31| .58| -1.31|38.00|   .20|     -.75| ... #
 747 #          Unequal variances#    |    | -1.33|37.99|   .19|     -.75| ... #
 748 #===========================#====#====#======#=====#======#=========#=   =#
 749 @end example
 750 @end cartouche
 751 @caption{The @cmd{T-TEST} command tests for differences of means.
 752 Here, the @var{height} variable's two tailed significance is less than
 753 0.05, so the null hypothesis can be rejected.
 754 Thus, the evidence suggests there is a difference between the heights of
 755 male and female persons.
 756 However the significance of the test for the @var{temperature}
 757 variable is greater than 0.05 so the null hypothesis cannot be
 758 rejected, and there is insufficient evidence to suggest a difference
 759 in body temperature.}
 760 @end float
 761
 762 @node Linear Regression
 763 @subsection Linear Regression
 764 @cindex linear regression
 765 @vindex REGRESSION
 766
 767 Linear regression is a technique used to investigate if and how a variable
 768 is linearly related to others.
 769 If a variable is found to be linearly related, then this can be used to
 770 predict future values of that variable.
 771
 772 In example @ref{regression}, the service department of the company wanted to
 773 be able to predict the time to repair equipment, in order to improve
 774 the accuracy of their quotations.
 775 It was suggested that the time to repair might be related to the time
 776 between failures and the duty cycle of the equipment.
 777 The p-value of 0.1 was chosen for this investigation.
 778 In order to investigate this hypothesis, the @cmd{REGRESSION} command
 779 was used.
 780 This command not only tests if the variables are related, but also
 781 identifies the potential linear relationship. @xref{REGRESSION}.
 782
 783
 784 @float Example, regression
 785 @cartouche
 786 @example
 787 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 788 @prompt{PSPP>} regression /variables = mtbf duty_cycle /dependent = mttr.
 789 @prompt{PSPP>} regression /variables = mtbf /dependent = mttr.
 790 @end example
 791 Output:
 792 @example
 793 1.3(1) REGRESSION.  Coefficients
 794 #=============================================#====#==========#====#=====#
 795 #                                             #  B |Std. Error|Beta|  t  #
 796 #========#====================================#====#==========#====#=====#
 797 #        |(Constant)                          #9.81|      1.50| .00| 6.54#
 798 #        |Mean time between failures (months) #3.10|       .10| .99|32.43#
 799 #        |Ratio of working to non-working time#1.09|      1.78| .02|  .61#
 800 #        |                                    #    |          |    |     #
 801 #========#====================================#====#==========#====#=====#
 802
 803 1.3(2) REGRESSION.  Coefficients
 804 #=============================================#============#
 805 #                                             #Significance#
 806 #========#====================================#============#
 807 #        |(Constant)                          #         .10#
 808 #        |Mean time between failures (months) #         .00#
 809 #        |Ratio of working to non-working time#         .55#
 810 #        |                                    #            #
 811 #========#====================================#============#
 812 2.3(1) REGRESSION.  Coefficients
 813 #============================================#=====#==========#====#=====#
 814 #                                            #  B  |Std. Error|Beta|  t  #
 815 #========#===================================#=====#==========#====#=====#
 816 #        |(Constant)                         #10.50|       .96| .00|10.96#
 817 #        |Mean time between failures (months)# 3.11|       .09| .99|33.39#
 818 #        |                                   #     |          |    |     #
 819 #========#===================================#=====#==========#====#=====#
 820
 821 2.3(2) REGRESSION.  Coefficients
 822 #============================================#============#
 823 #                                            #Significance#
 824 #========#===================================#============#
 825 #        |(Constant)                         #         .06#
 826 #        |Mean time between failures (months)#         .00#
 827 #        |                                   #            #
 828 #========#===================================#============#
 829 @end example
 830 @end cartouche
 831 @caption{Linear regression analysis to find a predictor for
 832 @var{mttr}.
 833 The first attempt, including @var{duty_cycle}, produces some
 834 unacceptable high significance values.
 835 However the second attempt, which excludes @var{duty_cycle}, produces
 836 significance values no higher than 0.06.
 837 This suggests that @var{mtbf} alone may be a suitable predictor
 838 for @var{mttr}.}
 839 @end float
 840
 841 The coefficients in the first table suggest that the formula
 842 @math{@var{mttr} = 9.81 + 3.1 \times @var{mtbf} + 1.09 \times @var{duty_cycle}}
 843 can be used to predict the time to repair.
 844 However, the significance value for the @var{duty_cycle} coefficient
 845 is very high, which would make this an unsafe predictor.
 846 For this reason, the test was repeated, but omitting the
 847 @var{duty_cycle} variable.
 848 This time, the significance of all coefficients no higher than 0.06,
 849 suggesting that at the 0.06 level, the formula
 850 @math{@var{mttr} = 10.5 + 3.11 \times @var{mtbf}} is a reliable
 851 predictor of the time to repair.
 852
 853
 854 @c  LocalWords:  PSPP dir itemize noindent var cindex dfn cartouche samp xref
 855 @c  LocalWords:  pxref ie sav Std Dev kilograms SYSMIS sansserif pre pspp emph
 856 @c  LocalWords:  Likert Cronbach's Cronbach mtbf npplot ln myfile cmd NPAR Sig
 857 @c  LocalWords:  vindex Levene Levene's df Diff clicksequence mydata dat ascii
 858 @c  LocalWords:  mttr outfile