pintos-os.org Git - pspp/blob - doc/tutorial.texi

   1 @alias prompt = sansserif
   2
   3 @include tut.texi
   4
   5 @node Using PSPP
   6 @chapter Using @pspp{}
   7
   8 @pspp{} is a tool for the statistical analysis of sampled data.
   9 You can use it to discover patterns in the data,
  10 to explain differences in one subset of data in terms of another subset
  11 and to find out
  12 whether certain beliefs about the data are justified.
  13 This chapter does not attempt to introduce the theory behind the
  14 statistical analysis,
  15 but it shows how such analysis can be performed using @pspp{}.
  16
  17 For the purposes of this tutorial, it is assumed that you are using @pspp{} in its
  18 interactive mode from the command line.
  19 However, the example commands can also be typed into a file and executed in
  20 a post-hoc mode by typing @samp{pspp @var{filename}} at a shell prompt,
  21 where @var{filename} is the name of the file containing the commands.
  22 Alternatively, from the graphical interface, you can select
  23 @clicksequence{File @click{} New @click{} Syntax} to open a new syntax window
  24 and use the @clicksequence{Run} menu when a syntax fragment is ready to be
  25 executed.
  26 Whichever method you choose, the syntax is identical.
  27
  28 When using the interactive method, @pspp{} tells you that it's waiting for your
  29 data with a string like @prompt{PSPP>} or @prompt{data>}.
  30 In the examples of this chapter, whenever you see text like this, it
  31 indicates the prompt displayed by @pspp{}, @emph{not} something that you
  32 should type.
  33
  34 Throughout this chapter reference is made to a number of sample data files.
  35 So that you can try the examples for yourself,
  36 you should have received these files along with your copy of @pspp{}.@c
  37 @footnote{These files contain purely fictitious data.  They should not be used
  38 for research purposes.}
  39 @note{Normally these files are installed in the directory
  40 @file{@value{example-dir}}.
  41 If however your system administrator or operating system vendor has
  42 chosen to install them in a different location, you will have to adjust
  43 the examples accordingly.}
  44
  45
  46 @menu
  47 * Preparation of Data Files::
  48 * Data Screening and Transformation::
  49 * Hypothesis Testing::
  50 @end menu
  51
  52 @node Preparation of Data Files
  53 @section Preparation of Data Files
  54
  55
  56 Before analysis can commence,  the data must be loaded into @pspp{} and
  57 arranged such that both @pspp{} and humans can understand what
  58 the data represents.
  59 There are two aspects of data:
  60
  61 @itemize @bullet
  62 @item The variables --- these are the parameters of a quantity
  63  which has been measured or estimated in some way.
  64  For example height, weight and geographic location are all variables.
  65 @item The observations (also called `cases') of the variables ---
  66  each observation represents an instance when the variables were measured
  67  or observed.
  68 @end itemize
  69
  70 @noindent
  71 For example, a data set which has the variables @var{height}, @var{weight}, and
  72 @var{name}, might have the observations:
  73 @example
  74 1881 89.2 Ahmed
  75 1192 107.01 Frank
  76 1230 67 Julie
  77 @end example
  78 @noindent
  79 The following sections explain how to define a dataset.
  80
  81 @menu
  82 * Defining Variables::
  83 * Listing the data::
  84 * Reading data from a text file::
  85 * Reading data from a pre-prepared PSPP file::
  86 * Saving data to a PSPP file.::
  87 * Reading data from other sources::
  88 @end menu
  89
  90 @node Defining Variables
  91 @subsection Defining Variables
  92 @cindex variables
  93
  94 Variables come in two basic types, @i{viz}: @dfn{numeric} and @dfn{string}.
  95 Variables such as age, height and satisfaction are numeric,
  96 whereas name is a string variable.
  97 String variables are best reserved for commentary data to assist the
  98 human observer.
  99 However they can also be used for nominal or categorical data.
 100
 101
 102 @ref{data-list} defines two variables @var{forename} and @var{height},
 103 and reads data into them by manual input.
 104
 105 @float Example, data-list
 106 @cartouche
 107 @example
 108 @prompt{PSPP>} data list list /forename (A12) height.
 109 @prompt{PSPP>} begin data.
 110 @prompt{data>} Ahmed 188
 111 @prompt{data>} Bertram 167
 112 @prompt{data>} Catherine 134.231
 113 @prompt{data>} David 109.1
 114 @prompt{data>} end data
 115 @prompt{PSPP>}
 116 @end example
 117 @end cartouche
 118 @caption{Manual entry of data using the @cmd{DATA LIST} command.
 119 Two variables
 120 @var{forename} and @var{height} are defined and subsequently filled
 121 with  manually entered data.}
 122 @end float
 123
 124 There are several things to note about this example.
 125
 126 @itemize @bullet
 127 @item
 128 The words @samp{data list list} are an example of the @cmd{DATA LIST}
 129 command. @xref{DATA LIST}.
 130 It tells @pspp{} to prepare for reading data.
 131 The word @samp{list} intentionally appears twice.
 132 The first occurrence is part of the @cmd{DATA LIST} call,
 133 whilst the second
 134 tells @pspp{} that the data is to be read as free format data with
 135 one record per line.
 136
 137 @item
 138 The @samp{/} character is important. It marks the start of the list of
 139 variables which you wish to define.
 140
 141 @item
 142 The text @samp{forename} is the name of the first variable,
 143 and @samp{(A12)} says that the variable @var{forename} is a string
 144 variable and that its maximum length is 12 bytes.
 145 The second variable's name is specified by the text @samp{height}.
 146 Since no format is given, this variable has the default format.
 147 Normally the default format expects numeric data, which should be
 148 entered in the locale of the operating system.
 149 Thus, the example is correct for English locales and other
 150 locales which use a period (@samp{.}) as the decimal separator.
 151 However if you are using a system with a locale which uses the comma (@samp{,})
 152 as the decimal separator, then you should in the subsequent lines substitute
 153 @samp{.} with @samp{,}.
 154 Alternatively, you could explicitly tell @pspp{} that the @var{height}
 155 variable is to be read using a period as its decimal separator by appending the
 156 text @samp{DOT8.3} after the word @samp{height}.
 157 For more information on data formats, @pxref{Input and Output Formats}.
 158
 159
 160 @item
 161 Normally, @pspp{} displays the  prompt @prompt{PSPP>} whenever it's
 162 expecting a command.
 163 However, when it's expecting data, the prompt changes to @prompt{data>}
 164 so that you know to enter data and not a command.
 165
 166 @item
 167 At the end of every command there is a terminating @samp{.} which tells
 168 @pspp{} that the end of a command has been encountered.
 169 You should not enter @samp{.} when data is expected (@i{ie.} when
 170 the @prompt{data>} prompt is current) since it is appropriate only for
 171 terminating commands.
 172 @end itemize
 173
 174 @node Listing the data
 175 @subsection Listing the data
 176 @vindex LIST
 177
 178 Once the data has been entered,
 179 you could type
 180 @example
 181 @prompt{PSPP>} list /format=numbered.
 182 @end example
 183 @noindent
 184 to list the data.
 185 The optional text @samp{/format=numbered} requests the case numbers to be
 186 shown along with the data.
 187 It should show the following output:
 188 @example
 189 @group
 190 Case#     forename   height
 191 ----- ------------ --------
 192     1 Ahmed          188.00
 193     2 Bertram        167.00
 194     3 Catherine      134.23
 195     4 David          109.10
 196 @end group
 197 @end example
 198 @noindent
 199 Note that the numeric variable @var{height} is displayed to 2 decimal
 200 places, because the format for that variable is @samp{F8.2}.
 201 For a complete description of the @cmd{LIST} command, @pxref{LIST}.
 202
 203 @node Reading data from a text file
 204 @subsection Reading data from a text file
 205 @cindex reading data
 206
 207 The previous example showed how to define a set of variables and to
 208 manually enter the data for those variables.
 209 Manual entering of data is tedious work, and often
 210 a file containing the data will be have been previously
 211 prepared.
 212 Let us assume that you have a file called @file{mydata.dat} containing the
 213 ascii encoded data:
 214 @example
 215 Ahmed          188.00
 216 Bertram        167.00
 217 Catherine      134.23
 218 David          109.10
 219 @              .
 220 @              .
 221 @              .
 222 Zachariah      113.02
 223 @end example
 224 @noindent
 225 You can can tell the @cmd{DATA LIST} command to read the data directly from
 226 this file instead of by manual entry, with a command like:
 227 @example
 228 @prompt{PSPP>} data list file='mydata.dat' list /forename (A12) height.
 229 @end example
 230 @noindent
 231 Notice however, that it is still necessary to specify the names of the
 232 variables and their formats, since this information is not contained
 233 in the file.
 234 It is also possible to specify the file's character encoding and other
 235 parameters.
 236 For full details refer to @pxref{DATA LIST}.
 237
 238 @node Reading data from a pre-prepared PSPP file
 239 @subsection Reading data from a pre-prepared @pspp{} file
 240 @cindex system files
 241 @vindex GET
 242
 243 When working with other @pspp{} users, or users of other software which
 244 uses the @pspp{} data format, you may be given the data in
 245 a pre-prepared @pspp{} file.
 246 Such files contain not only the data, but the variable definitions,
 247 along with their formats, labels and other meta-data.
 248 Conventionally, these files (sometimes called ``system'' files)
 249 have the suffix @file{.sav}, but that is
 250 not mandatory.
 251 The following syntax loads a file called @file{my-file.sav}.
 252 @example
 253 @prompt{PSPP>} get file='my-file.sav'.
 254 @end example
 255 @noindent
 256 You will encounter several instances of this in future examples.
 257
 258
 259 @node Saving data to a PSPP file.
 260 @subsection Saving data to a @pspp{} file.
 261 @cindex saving
 262 @vindex SAVE
 263
 264 If you want to save your data, along with the variable definitions so
 265 that you or other @pspp{} users can use it later, you can do this with
 266 the @cmd{SAVE} command.
 267
 268 The following syntax will save the existing data and variables to a
 269 file called @file{my-new-file.sav}.
 270 @example
 271 @prompt{PSPP>} save outfile='my-new-file.sav'.
 272 @end example
 273 @noindent
 274 If @file{my-new-file.sav} already exists, then it will be overwritten.
 275 Otherwise it will be created.
 276
 277
 278 @node Reading data from other sources
 279 @subsection Reading data from other sources
 280 @cindex comma separated values
 281 @cindex spreadsheets
 282 @cindex databases
 283
 284 Sometimes it's useful to be able to read data from comma
 285 separated text, from spreadsheets, databases or other sources.
 286 In these instances you should
 287 use the @cmd{GET DATA} command (@pxref{GET DATA}).
 288
 289
 290 @node Data Screening and Transformation
 291 @section Data Screening and Transformation
 292
 293 @cindex screening
 294 @cindex transformation
 295
 296 Once data has been entered, it is often desirable, or even necessary,
 297 to transform it in some way before performing analysis upon it.
 298 At the very least, it's good practice to check for errors.
 299
 300 @menu
 301 * Identifying incorrect data::
 302 * Dealing with suspicious data::
 303 * Inverting negatively coded variables::
 304 * Testing data consistency::
 305 * Testing for normality ::
 306 @end menu
 307
 308 @node Identifying incorrect data
 309 @subsection Identifying incorrect data
 310 @cindex erroneous data
 311 @cindex errors, in data
 312
 313 Data from real sources is rarely error free.
 314 @pspp{} has a number of procedures which can be used to help
 315 identify data which might be incorrect.
 316
 317 The @cmd{DESCRIPTIVES} command (@pxref{DESCRIPTIVES}) is used to generate
 318 simple linear statistics for a dataset.  It is also useful for
 319 identifying potential problems in the data.
 320 The example file @file{physiology.sav} contains a number of physiological
 321 measurements of a sample of healthy adults selected at random.
 322 However, the data entry clerk made a number of mistakes when entering
 323 the data.
 324 @ref{descriptives} illustrates the use of @cmd{DESCRIPTIVES} to screen this
 325 data and identify the erroneous values.
 326
 327 @float Example, descriptives
 328 @cartouche
 329 @example
 330 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 331 @prompt{PSPP>} descriptives sex, weight, height.
 332 @end example
 333
 334 Output:
 335 @example
 336 DESCRIPTIVES.  Valid cases = 40; cases with missing value(s) = 0.
 337 +--------#--+-------+-------+-------+-------+
 338 |Variable# N|  Mean |Std Dev|Minimum|Maximum|
 339 #========#==#=======#=======#=======#=======#
 340 |sex     #40|    .45|    .50|    .00|   1.00|
 341 |height  #40|1677.12| 262.87| 179.00|1903.00|
 342 |weight  #40|  72.12|  26.70| -55.60|  92.07|
 343 +--------#--+-------+-------+-------+-------+
 344 @end example
 345 @end cartouche
 346 @caption{Using the @cmd{DESCRIPTIVES} command to display simple
 347 summary information about the data.
 348 In this case, the results show unexpectedly low values in the Minimum
 349 column, suggesting incorrect data entry.}
 350 @end float
 351
 352 In the output of @ref{descriptives},
 353 the most interesting column is the minimum value.
 354 The @var{weight} variable has a minimum value of less than zero,
 355 which is clearly erroneous.
 356 Similarly, the @var{height} variable's minimum value seems to be very low.
 357 In fact, it is more than 5 standard deviations from the mean, and is a
 358 seemingly bizarre height for an adult person.
 359 We can examine the data in more detail with the @cmd{EXAMINE}
 360 command (@pxref{EXAMINE}):
 361
 362 In @ref{examine} you can see that the lowest value of @var{height} is
 363 179 (which we suspect to be erroneous), but the second lowest is 1598
 364 which
 365 we know from the @cmd{DESCRIPTIVES} command
 366 is within 1 standard deviation from the mean.
 367 Similarly the @var{weight} variable has a lowest value which is
 368 negative but a plausible value for the second lowest value.
 369 This suggests that the two extreme values are outliers and probably
 370 represent data entry errors.
 371
 372 @float Example, examine
 373 @cartouche
 374 [@dots{} continue from @ref{descriptives}]
 375 @example
 376 @prompt{PSPP>} examine height, weight /statistics=extreme(3).
 377 @end example
 378
 379 Output:
 380 @example
 381 #===============================#===========#=======#
 382 #                               #Case Number| Value #
 383 #===============================#===========#=======#
 384 #Height in millimetres Highest 1#         14|1903.00#
 385 #                              2#         15|1884.00#
 386 #                              3#         12|1801.65#
 387 #                     ----------#-----------+-------#
 388 #                       Lowest 1#         30| 179.00#
 389 #                              2#         31|1598.00#
 390 #                              3#         28|1601.00#
 391 #                     ----------#-----------+-------#
 392 #Weight in kilograms   Highest 1#         13|  92.07#
 393 #                              2#          5|  92.07#
 394 #                              3#         17|  91.74#
 395 #                     ----------#-----------+-------#
 396 #                       Lowest 1#         38| -55.60#
 397 #                              2#         39|  54.48#
 398 #                              3#         33|  55.45#
 399 #===============================#===========#=======#
 400 @end example
 401 @end cartouche
 402 @caption{Using the @cmd{EXAMINE} command to see the extremities of the data
 403 for different variables.  Cases 30 and 38 seem to contain values
 404 very much lower than the rest of the data.
 405 They are possibly erroneous.}
 406 @end float
 407
 408 @node Dealing with suspicious data
 409 @subsection Dealing with suspicious data
 410
 411 @cindex SYSMIS
 412 @cindex recoding data
 413 If possible, suspect data should be checked and re-measured.
 414 However, this may not always be feasible, in which case the researcher may
 415 decide to disregard these values.
 416 @pspp{} has a feature whereby data can assume the special value `SYSMIS', and
 417 will be disregarded in future analysis. @xref{Missing Observations}.
 418 You can set the two suspect values to the `SYSMIS' value using the @cmd{RECODE}
 419 command.
 420 @example
 421 @pspp{}> recode height (179 = SYSMIS).
 422 @pspp{}> recode weight (LOWEST THRU 0 = SYSMIS).
 423 @end example
 424 @noindent
 425 The first command says that for any observation which has a
 426 @var{height} value of 179, that value should be changed to the SYSMIS
 427 value.
 428 The second command says that any @var{weight} values of zero or less
 429 should be changed to SYSMIS.
 430 From now on, they will be ignored in analysis.
 431 For detailed information about the @cmd{RECODE} command @pxref{RECODE}.
 432
 433 If you now re-run the @cmd{DESCRIPTIVES} or @cmd{EXAMINE} commands in
 434 @ref{descriptives} and @ref{examine} you
 435 will see a data summary with more plausible parameters.
 436 You will also notice that the data summaries indicate the two missing values.
 437
 438 @node Inverting negatively coded variables
 439 @subsection Inverting negatively coded variables
 440
 441 @cindex Likert scale
 442 @cindex Inverting data
 443 Data entry errors are not the only reason for wanting to recode data.
 444 The sample file @file{hotel.sav} comprises data gathered from a
 445 customer satisfaction survey of clients at a particular hotel.
 446 In @ref{reliability}, this file is loaded for analysis.
 447 The line @code{display dictionary.} tells @pspp{} to display the
 448 variables and associated data.
 449 The output from this command has been omitted from the example for the sake of clarity, but
 450 you will notice that each of the variables
 451 @var{v1}, @var{v2} @dots{} @var{v5}  are measured on a 5 point Likert scale,
 452 with 1 meaning ``Strongly disagree'' and 5 meaning ``Strongly agree''.
 453 Whilst variables @var{v1}, @var{v2} and @var{v4} record responses
 454 to a positively posed question, variables @var{v3} and @var{v5} are
 455 responses to negatively worded questions.
 456 In order to perform meaningful analysis, we need to recode the variables so
 457 that they all measure in the same direction.
 458 We could use the @cmd{RECODE} command, with syntax such as:
 459 @example
 460 recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1).
 461 @end example
 462 @noindent
 463 However an easier and more elegant way uses the @cmd{COMPUTE}
 464 command (@pxref{COMPUTE}).
 465 Since the variables are Likert variables in the range (1 @dots{} 5),
 466 subtracting their value  from 6 has the effect of inverting them:
 467 @example
 468 compute @var{var} = 6 - @var{var}.
 469 @end example
 470 @noindent
 471 @ref{reliability} uses this technique to recode the variables
 472 @var{v3} and @var{v5}.
 473 After applying  @cmd{COMPUTE} for both variables,
 474 all subsequent commands will use the inverted values.
 475
 476
 477 @node Testing data consistency
 478 @subsection Testing data consistency
 479
 480 @cindex reliability
 481 @cindex consistency
 482
 483 A sensible check to perform on survey data is the calculation of
 484 reliability.
 485 This gives the statistician some confidence that the questionnaires have been
 486 completed thoughtfully.
 487 If you examine the labels of variables @var{v1},  @var{v3} and @var{v5},
 488 you will notice that they ask very similar questions.
 489 One would therefore expect the values of these variables (after recoding)
 490 to closely follow one another, and we can test that with the @cmd{RELIABILITY}
 491 command (@pxref{RELIABILITY}).
 492 @ref{reliability} shows a @pspp{} session where the user (after recoding
 493 negatively scaled variables) requests reliability statistics for
 494 @var{v1}, @var{v3} and @var{v5}.
 495
 496 @float Example, reliability
 497 @cartouche
 498 @example
 499 @prompt{PSPP>} get file='@value{example-dir}/hotel.sav'.
 500 @prompt{PSPP>} display dictionary.
 501 @prompt{PSPP>} * recode negatively worded questions.
 502 @prompt{PSPP>} compute v3 = 6 - v3.
 503 @prompt{PSPP>} compute v5 = 6 - v5.
 504 @prompt{PSPP>} reliability v1, v3, v5.
 505 @end example
 506
 507 Output (dictionary information omitted for clarity):
 508 @example
 509 1.1 RELIABILITY.  Case Processing Summary
 510 #==============#==#======#
 511 #              # N|   %  #
 512 #==============#==#======#
 513 #Cases Valid   #17|100.00#
 514 #      Excluded# 0|   .00#
 515 #      Total   #17|100.00#
 516 #==============#==#======#
 517
 518 1.2 RELIABILITY.  Reliability Statistics
 519 #================#==========#
 520 #Cronbach's Alpha#N of Items#
 521 #================#==========#
 522 #             .86#         3#
 523 #================#==========#
 524 @end example
 525 @end cartouche
 526 @caption{Recoding negatively scaled variables, and testing for
 527 reliability with the @cmd{RELIABILITY} command. The Cronbach Alpha
 528 coefficient suggests a high degree of reliability among variables
 529 @var{v1}, @var{v2} and @var{v5}.}
 530 @end float
 531
 532 As a rule of thumb, many statisticians consider a value of Cronbach's Alpha of
 533 0.7 or higher to indicate reliable data.
 534 Here, the value is 0.86 so the data and the recoding that we performed
 535 are vindicated.
 536
 537
 538 @node Testing for normality
 539 @subsection Testing for normality
 540 @cindex normality, testing
 541
 542 Many statistical tests rely upon certain properties of the data.
 543 One common property, upon which many linear tests depend, is that of
 544 normality --- the data must have been drawn from a normal distribution.
 545 It is necessary then to ensure normality before deciding upon the
 546 test procedure to use.  One way to do this uses the @cmd{EXAMINE} command.
 547
 548 In @ref{normality}, a researcher was examining the failure rates
 549 of equipment produced by an engineering company.
 550 The file @file{repairs.sav} contains the mean time between
 551 failures (@var{mtbf}) of some items of equipment subject to the study.
 552 Before performing linear analysis on the data,
 553 the researcher wanted to ascertain that the data is normally distributed.
 554
 555 A normal distribution has a skewness and kurtosis of zero.
 556 Looking at the skewness of @var{mtbf} in @ref{normality} it is clear
 557 that the mtbf figures have a lot of positive skew and are therefore
 558 not drawn from a normally distributed variable.
 559 Positive skew can often be compensated for by applying a logarithmic
 560 transformation.
 561 This is done with the @cmd{COMPUTE} command in the line
 562 @example
 563 compute mtbf_ln = ln (mtbf).
 564 @end example
 565 @noindent
 566 Rather than redefining the existing variable, this use of @cmd{COMPUTE}
 567 defines a new variable @var{mtbf_ln} which is
 568 the natural logarithm of @var{mtbf}.
 569 The final command in this example calls @cmd{EXAMINE} on this new variable,
 570 and it can be seen from the results that both the skewness and
 571 kurtosis for @var{mtbf_ln} are very close to zero.
 572 This provides some confidence that the @var{mtbf_ln} variable is
 573 normally distributed and thus safe for linear analysis.
 574 In the event that no suitable transformation can be found,
 575 then it would be worth considering
 576 an appropriate non-parametric test instead of a linear one.
 577 @xref{NPAR TESTS}, for information about non-parametric tests.
 578
 579 @float Example, normality
 580 @cartouche
 581 @example
 582 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 583 @prompt{PSPP>} examine mtbf
 584                 /statistics=descriptives.
 585 @prompt{PSPP>} compute mtbf_ln = ln (mtbf).
 586 @prompt{PSPP>} examine mtbf_ln
 587                 /statistics=descriptives.
 588 @end example
 589
 590 Output:
 591 @example
 592 1.2 EXAMINE.  Descriptives
 593 #====================================================#=========#==========#
 594 #                                                    #Statistic|Std. Error#
 595 #====================================================#=========#==========#
 596 #mtbf    Mean                                        #   8.32  |   1.62   #
 597 #        95% Confidence Interval for Mean Lower Bound#   4.85  |          #
 598 #                                         Upper Bound#  11.79  |          #
 599 #        5% Trimmed Mean                             #   7.69  |          #
 600 #        Median                                      #   8.12  |          #
 601 #        Variance                                    #  39.21  |          #
 602 #        Std. Deviation                              #   6.26  |          #
 603 #        Minimum                                     #   1.63  |          #
 604 #        Maximum                                     #  26.47  |          #
 605 #        Range                                       #  24.84  |          #
 606 #        Interquartile Range                         #   5.83  |          #
 607 #        Skewness                                    #   1.85  |    .58   #
 608 #        Kurtosis                                    #   4.49  |   1.12   #
 609 #====================================================#=========#==========#
 610
 611 2.2 EXAMINE.  Descriptives
 612 #====================================================#=========#==========#
 613 #                                                    #Statistic|Std. Error#
 614 #====================================================#=========#==========#
 615 #mtbf_ln Mean                                        #   1.88  |    .19   #
 616 #        95% Confidence Interval for Mean Lower Bound#   1.47  |          #
 617 #                                         Upper Bound#   2.29  |          #
 618 #        5% Trimmed Mean                             #   1.88  |          #
 619 #        Median                                      #   2.09  |          #
 620 #        Variance                                    #   .54   |          #
 621 #        Std. Deviation                              #   .74   |          #
 622 #        Minimum                                     #   .49   |          #
 623 #        Maximum                                     #   3.28  |          #
 624 #        Range                                       #   2.79  |          #
 625 #        Interquartile Range                         #   .92   |          #
 626 #        Skewness                                    #   -.16  |    .58   #
 627 #        Kurtosis                                    #   -.09  |   1.12   #
 628 #====================================================#=========#==========#
 629 @end example
 630 @end cartouche
 631 @caption{Testing for normality using the @cmd{EXAMINE} command and applying
 632 a logarithmic transformation.
 633 The @var{mtbf} variable has a large positive skew and is therefore
 634 unsuitable for linear statistical analysis.
 635 However the transformed variable (@var{mtbf_ln}) is close to normal and
 636 would appear to be more suitable.}
 637 @end float
 638
 639
 640 @node Hypothesis Testing
 641 @section Hypothesis Testing
 642
 643 @cindex Hypothesis testing
 644 @cindex p-value
 645 @cindex null hypothesis
 646
 647 One of the most fundamental purposes of statistical analysis
 648 is hypothesis testing.
 649 Researchers commonly need to test hypotheses about a set of data.
 650 For example, she might want to test whether one set of data comes from
 651 the same distribution as another,
 652 or
 653 whether the mean of a dataset significantly differs from a particular
 654 value.
 655 This section presents just some of the possible tests that @pspp{} offers.
 656
 657 The researcher starts by making a @dfn{null hypothesis}.
 658 Often this is a hypothesis which he suspects to be false.
 659 For example, if he suspects that @var{A} is greater than @var{B} he will
 660 state the null hypothesis as @math{ @var{A} = @var{B}}.@c
 661 @footnote{This example assumes that it is already proven that @var{B} is
 662 not greater than @var{A}.}
 663
 664 The @dfn{p-value} is a recurring concept in hypothesis testing.
 665 It is the highest acceptable probability that the evidence implying a
 666 null hypothesis is false, could have been obtained when the null
 667 hypothesis is in fact true.
 668 Note that this is not the same as ``the probability of making an
 669 error'' nor is it the same as ``the probability of rejecting a
 670 hypothesis when it is true''.
 671
 672
 673
 674 @menu
 675 * Testing for differences of means::
 676 * Linear Regression::
 677 @end menu
 678
 679 @node Testing for differences of means
 680 @subsection Testing for differences of means
 681
 682 @cindex T-test
 683 @vindex T-TEST
 684
 685 A common statistical test involves hypotheses about means.
 686 The @cmd{T-TEST} command is used to find out whether or not two separate
 687 subsets have the same mean.
 688
 689 @ref{t-test} uses the file @file{physiology.sav} previously
 690 encountered.
 691 A researcher suspected that the heights and core body
 692 temperature of persons might be different depending upon their sex.
 693 To investigate this, he posed two null hypotheses:
 694 @itemize @bullet
 695 @item The mean heights of males and females in the population are equal.
 696 @item The mean body temperature of males and
 697       females in the population are equal.
 698 @end itemize
 699 @noindent
 700 For the purposes of the investigation the researcher
 701 decided to use a  p-value of 0.05.
 702
 703 In addition to the T-test, the @cmd{T-TEST} command also performs the
 704 Levene test for equal variances.
 705 If the variances are equal, then a more powerful form of the T-test can be used.
 706 However if it is unsafe to assume equal variances,
 707 then an alternative calculation is necessary.
 708 @pspp{} performs both calculations.
 709
 710 For the @var{height} variable, the output shows the significance of the
 711 Levene test to be 0.33 which means there is a
 712 33% probability that the
 713 Levene test produces this outcome when the variances are equal.
 714 Had the significance been less than 0.05, then it would have been unsafe to assume that
 715 the variances were equal.
 716 However, because the value is higher than 0.05 the homogeneity of variances assumption
 717 is safe and the ``Equal Variances'' row (the more powerful test) can be used.
 718 Examining this row, the two tailed significance for the @var{height} t-test
 719 is less than 0.05, so it is safe to reject the null hypothesis and conclude
 720 that the mean heights of males and females are unequal.
 721
 722 For the @var{temperature} variable, the significance of the Levene test
 723 is 0.58 so again, it is safe to use the row for equal variances.
 724 The equal variances row indicates that the two tailed significance for
 725 @var{temperature} is 0.20.  Since this is greater than 0.05 we must reject
 726 the null hypothesis and conclude that there is insufficient evidence to
 727 suggest that the body temperature of male and female persons are different.
 728
 729 @float Example, t-test
 730 @cartouche
 731 @example
 732 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 733 @prompt{PSPP>} recode height (179 = SYSMIS).
 734 @prompt{PSPP>} t-test group=sex(0,1) /variables = height temperature.
 735 @end example
 736 Output:
 737 @example
 738 1.1 T-TEST.  Group Statistics
 739 #==================#==#=======#==============#========#
 740 #              sex | N|  Mean |Std. Deviation|SE. Mean#
 741 #==================#==#=======#==============#========#
 742 #height      Male  |22|1796.49|         49.71|   10.60#
 743 #            Female|17|1610.77|         25.43|    6.17#
 744 #temperature Male  |22|  36.68|          1.95|     .42#
 745 #            Female|18|  37.43|          1.61|     .38#
 746 #==================#==#=======#==============#========#
 747 1.2 T-TEST.  Independent Samples Test
 748 #===========================#=========#===============================   =#
 749 #                           # Levene's| t-test for Equality of Means      #
 750 #                           #----+----+------+-----+------+---------+-   -#
 751 #                           #    |    |      |     |      |         |     #
 752 #                           #    |    |      |     |Sig. 2|         |     #
 753 #                           #  F |Sig.|   t  |  df |tailed|Mean Diff|     #
 754 #===========================#====#====#======#=====#======#=========#=   =#
 755 #height      Equal variances# .97| .33| 14.02|37.00|   .00|   185.72| ... #
 756 #          Unequal variances#    |    | 15.15|32.71|   .00|   185.72| ... #
 757 #temperature Equal variances# .31| .58| -1.31|38.00|   .20|     -.75| ... #
 758 #          Unequal variances#    |    | -1.33|37.99|   .19|     -.75| ... #
 759 #===========================#====#====#======#=====#======#=========#=   =#
 760 @end example
 761 @end cartouche
 762 @caption{The @cmd{T-TEST} command tests for differences of means.
 763 Here, the @var{height} variable's two tailed significance is less than
 764 0.05, so the null hypothesis can be rejected.
 765 Thus, the evidence suggests there is a difference between the heights of
 766 male and female persons.
 767 However the significance of the test for the @var{temperature}
 768 variable is greater than 0.05 so the null hypothesis cannot be
 769 rejected, and there is insufficient evidence to suggest a difference
 770 in body temperature.}
 771 @end float
 772
 773 @node Linear Regression
 774 @subsection Linear Regression
 775 @cindex linear regression
 776 @vindex REGRESSION
 777
 778 Linear regression is a technique used to investigate if and how a variable
 779 is linearly related to others.
 780 If a variable is found to be linearly related, then this can be used to
 781 predict future values of that variable.
 782
 783 In example @ref{regression}, the service department of the company wanted to
 784 be able to predict the time to repair equipment, in order to improve
 785 the accuracy of their quotations.
 786 It was suggested that the time to repair might be related to the time
 787 between failures and the duty cycle of the equipment.
 788 The p-value of 0.1 was chosen for this investigation.
 789 In order to investigate this hypothesis, the @cmd{REGRESSION} command
 790 was used.
 791 This command not only tests if the variables are related, but also
 792 identifies the potential linear relationship. @xref{REGRESSION}.
 793
 794
 795 @float Example, regression
 796 @cartouche
 797 @example
 798 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 799 @prompt{PSPP>} regression /variables = mtbf duty_cycle /dependent = mttr.
 800 @prompt{PSPP>} regression /variables = mtbf /dependent = mttr.
 801 @end example
 802 Output:
 803 @example
 804 1.3(1) REGRESSION.  Coefficients
 805 #=============================================#====#==========#====#=====#
 806 #                                             #  B |Std. Error|Beta|  t  #
 807 #========#====================================#====#==========#====#=====#
 808 #        |(Constant)                          #9.81|      1.50| .00| 6.54#
 809 #        |Mean time between failures (months) #3.10|       .10| .99|32.43#
 810 #        |Ratio of working to non-working time#1.09|      1.78| .02|  .61#
 811 #        |                                    #    |          |    |     #
 812 #========#====================================#====#==========#====#=====#
 813
 814 1.3(2) REGRESSION.  Coefficients
 815 #=============================================#============#
 816 #                                             #Significance#
 817 #========#====================================#============#
 818 #        |(Constant)                          #         .10#
 819 #        |Mean time between failures (months) #         .00#
 820 #        |Ratio of working to non-working time#         .55#
 821 #        |                                    #            #
 822 #========#====================================#============#
 823 2.3(1) REGRESSION.  Coefficients
 824 #============================================#=====#==========#====#=====#
 825 #                                            #  B  |Std. Error|Beta|  t  #
 826 #========#===================================#=====#==========#====#=====#
 827 #        |(Constant)                         #10.50|       .96| .00|10.96#
 828 #        |Mean time between failures (months)# 3.11|       .09| .99|33.39#
 829 #        |                                   #     |          |    |     #
 830 #========#===================================#=====#==========#====#=====#
 831
 832 2.3(2) REGRESSION.  Coefficients
 833 #============================================#============#
 834 #                                            #Significance#
 835 #========#===================================#============#
 836 #        |(Constant)                         #         .06#
 837 #        |Mean time between failures (months)#         .00#
 838 #        |                                   #            #
 839 #========#===================================#============#
 840 @end example
 841 @end cartouche
 842 @caption{Linear regression analysis to find a predictor for
 843 @var{mttr}.
 844 The first attempt, including @var{duty_cycle}, produces some
 845 unacceptable high significance values.
 846 However the second attempt, which excludes @var{duty_cycle}, produces
 847 significance values no higher than 0.06.
 848 This suggests that @var{mtbf} alone may be a suitable predictor
 849 for @var{mttr}.}
 850 @end float
 851
 852 The coefficients in the first table suggest that the formula
 853 @math{@var{mttr} = 9.81 + 3.1 \times @var{mtbf} + 1.09 \times @var{duty_cycle}}
 854 can be used to predict the time to repair.
 855 However, the significance value for the @var{duty_cycle} coefficient
 856 is very high, which would make this an unsafe predictor.
 857 For this reason, the test was repeated, but omitting the
 858 @var{duty_cycle} variable.
 859 This time, the significance of all coefficients no higher than 0.06,
 860 suggesting that at the 0.06 level, the formula
 861 @math{@var{mttr} = 10.5 + 3.11 \times @var{mtbf}} is a reliable
 862 predictor of the time to repair.
 863
 864
 865 @c  LocalWords:  PSPP dir itemize noindent var cindex dfn cartouche samp xref
 866 @c  LocalWords:  pxref ie sav Std Dev kilograms SYSMIS sansserif pre pspp emph
 867 @c  LocalWords:  Likert Cronbach's Cronbach mtbf npplot ln myfile cmd NPAR Sig
 868 @c  LocalWords:  vindex Levene Levene's df Diff clicksequence mydata dat ascii
 869 @c  LocalWords:  mttr outfile