pintos-os.org Git - pspp/blob - doc/tutorial.texi

   1 @c PSPP - a program for statistical analysis.
   2 @c Copyright (C) 2017 Free Software Foundation, Inc.
   3 @c Permission is granted to copy, distribute and/or modify this document
   4 @c under the terms of the GNU Free Documentation License, Version 1.3
   5 @c or any later version published by the Free Software Foundation;
   6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   7 @c A copy of the license is included in the section entitled "GNU
   8 @c Free Documentation License".
   9 @c
  10 @alias prompt = sansserif
  11
  12 @include tut.texi
  13
  14 @node Using PSPP
  15 @chapter Using @pspp{}
  16
  17 @pspp{} is a tool for the statistical analysis of sampled data.
  18 You can use it to discover patterns in the data,
  19 to explain differences in one subset of data in terms of another subset
  20 and to find out
  21 whether certain beliefs about the data are justified.
  22 This chapter does not attempt to introduce the theory behind the
  23 statistical analysis,
  24 but it shows how such analysis can be performed using @pspp{}.
  25
  26 For the purposes of this tutorial, it is assumed that you are using @pspp{} in its
  27 interactive mode from the command line.
  28 However, the example commands can also be typed into a file and executed in
  29 a post-hoc mode by typing @samp{pspp @var{filename}} at a shell prompt,
  30 where @var{filename} is the name of the file containing the commands.
  31 Alternatively, from the graphical interface, you can select
  32 @clicksequence{File @click{} New @click{} Syntax} to open a new syntax window
  33 and use the @clicksequence{Run} menu when a syntax fragment is ready to be
  34 executed.
  35 Whichever method you choose, the syntax is identical.
  36
  37 When using the interactive method, @pspp{} tells you that it's waiting for your
  38 data with a string like @prompt{PSPP>} or @prompt{data>}.
  39 In the examples of this chapter, whenever you see text like this, it
  40 indicates the prompt displayed by @pspp{}, @emph{not} something that you
  41 should type.
  42
  43 Throughout this chapter reference is made to a number of sample data files.
  44 So that you can try the examples for yourself,
  45 you should have received these files along with your copy of @pspp{}.@c
  46 @footnote{These files contain purely fictitious data.  They should not be used
  47 for research purposes.}
  48 @note{Normally these files are installed in the directory
  49 @file{@value{example-dir}}.
  50 If however your system administrator or operating system vendor has
  51 chosen to install them in a different location, you will have to adjust
  52 the examples accordingly.}
  53
  54
  55 @menu
  56 * Preparation of Data Files::
  57 * Data Screening and Transformation::
  58 * Hypothesis Testing::
  59 @end menu
  60
  61 @node Preparation of Data Files
  62 @section Preparation of Data Files
  63
  64
  65 Before analysis can commence,  the data must be loaded into @pspp{} and
  66 arranged such that both @pspp{} and humans can understand what
  67 the data represents.
  68 There are two aspects of data:
  69
  70 @itemize @bullet
  71 @item The variables --- these are the parameters of a quantity
  72  which has been measured or estimated in some way.
  73  For example height, weight and geographic location are all variables.
  74 @item The observations (also called `cases') of the variables ---
  75  each observation represents an instance when the variables were measured
  76  or observed.
  77 @end itemize
  78
  79 @noindent
  80 For example, a data set which has the variables @var{height}, @var{weight}, and
  81 @var{name}, might have the observations:
  82 @example
  83 1881 89.2 Ahmed
  84 1192 107.01 Frank
  85 1230 67 Julie
  86 @end example
  87 @noindent
  88 The following sections explain how to define a dataset.
  89
  90 @menu
  91 * Defining Variables::
  92 * Listing the data::
  93 * Reading data from a text file::
  94 * Reading data from a pre-prepared PSPP file::
  95 * Saving data to a PSPP file.::
  96 * Reading data from other sources::
  97 * Exiting PSPP::
  98 @end menu
  99
 100 @node Defining Variables
 101 @subsection Defining Variables
 102 @cindex variables
 103
 104 Variables come in two basic types, @i{viz}: @dfn{numeric} and @dfn{string}.
 105 Variables such as age, height and satisfaction are numeric,
 106 whereas name is a string variable.
 107 String variables are best reserved for commentary data to assist the
 108 human observer.
 109 However they can also be used for nominal or categorical data.
 110
 111
 112 @ref{data-list} defines two variables @var{forename} and @var{height},
 113 and reads data into them by manual input.
 114
 115 @float Example, data-list
 116 @cartouche
 117 @example
 118 @prompt{PSPP>} data list list /forename (A12) height.
 119 @prompt{PSPP>} begin data.
 120 @prompt{data>} Ahmed 188
 121 @prompt{data>} Bertram 167
 122 @prompt{data>} Catherine 134.231
 123 @prompt{data>} David 109.1
 124 @prompt{data>} end data
 125 @prompt{PSPP>}
 126 @end example
 127 @end cartouche
 128 @caption{Manual entry of data using the @cmd{DATA LIST} command.
 129 Two variables
 130 @var{forename} and @var{height} are defined and subsequently filled
 131 with  manually entered data.}
 132 @end float
 133
 134 There are several things to note about this example.
 135
 136 @itemize @bullet
 137 @item
 138 The words @samp{data list list} are an example of the @cmd{DATA LIST}
 139 command. @xref{DATA LIST}.
 140 It tells @pspp{} to prepare for reading data.
 141 The word @samp{list} intentionally appears twice.
 142 The first occurrence is part of the @cmd{DATA LIST} call,
 143 whilst the second
 144 tells @pspp{} that the data is to be read as free format data with
 145 one record per line.
 146
 147 @item
 148 The @samp{/} character is important. It marks the start of the list of
 149 variables which you wish to define.
 150
 151 @item
 152 The text @samp{forename} is the name of the first variable,
 153 and @samp{(A12)} says that the variable @var{forename} is a string
 154 variable and that its maximum length is 12 bytes.
 155 The second variable's name is specified by the text @samp{height}.
 156 Since no format is given, this variable has the default format.
 157 Normally the default format expects numeric data, which should be
 158 entered in the locale of the operating system.
 159 Thus, the example is correct for English locales and other
 160 locales which use a period (@samp{.}) as the decimal separator.
 161 However if you are using a system with a locale which uses the comma (@samp{,})
 162 as the decimal separator, then you should in the subsequent lines substitute
 163 @samp{.} with @samp{,}.
 164 Alternatively, you could explicitly tell @pspp{} that the @var{height}
 165 variable is to be read using a period as its decimal separator by appending the
 166 text @samp{DOT8.3} after the word @samp{height}.
 167 For more information on data formats, @pxref{Input and Output Formats}.
 168
 169
 170 @item
 171 Normally, @pspp{} displays the  prompt @prompt{PSPP>} whenever it's
 172 expecting a command.
 173 However, when it's expecting data, the prompt changes to @prompt{data>}
 174 so that you know to enter data and not a command.
 175
 176 @item
 177 At the end of every command there is a terminating @samp{.} which tells
 178 @pspp{} that the end of a command has been encountered.
 179 You should not enter @samp{.} when data is expected (@i{ie.} when
 180 the @prompt{data>} prompt is current) since it is appropriate only for
 181 terminating commands.
 182 @end itemize
 183
 184 @node Listing the data
 185 @subsection Listing the data
 186 @vindex LIST
 187
 188 Once the data has been entered,
 189 you could type
 190 @example
 191 @prompt{PSPP>} list /format=numbered.
 192 @end example
 193 @noindent
 194 to list the data.
 195 The optional text @samp{/format=numbered} requests the case numbers to be
 196 shown along with the data.
 197 It should show the following output:
 198 @example
 199 @group
 200            Data List
 201 +-----------+---------+------+
 202 |Case Number| forename|height|
 203 +-----------+---------+------+
 204 |1          |Ahmed    |188.00|
 205 |2          |Bertram  |167.00|
 206 |3          |Catherine|134.23|
 207 |4          |David    |109.10|
 208 +-----------+---------+------+
 209 @end group
 210 @end example
 211 @noindent
 212 Note that the numeric variable @var{height} is displayed to 2 decimal
 213 places, because the format for that variable is @samp{F8.2}.
 214 For a complete description of the @cmd{LIST} command, @pxref{LIST}.
 215
 216 @node Reading data from a text file
 217 @subsection Reading data from a text file
 218 @cindex reading data
 219
 220 The previous example showed how to define a set of variables and to
 221 manually enter the data for those variables.
 222 Manual entering of data is tedious work, and often
 223 a file containing the data will be have been previously
 224 prepared.
 225 Let us assume that you have a file called @file{mydata.dat} containing the
 226 ascii encoded data:
 227 @example
 228 Ahmed          188.00
 229 Bertram        167.00
 230 Catherine      134.23
 231 David          109.10
 232 @              .
 233 @              .
 234 @              .
 235 Zachariah      113.02
 236 @end example
 237 @noindent
 238 You can can tell the @cmd{DATA LIST} command to read the data directly from
 239 this file instead of by manual entry, with a command like:
 240 @example
 241 @prompt{PSPP>} data list file='mydata.dat' list /forename (A12) height.
 242 @end example
 243 @noindent
 244 Notice however, that it is still necessary to specify the names of the
 245 variables and their formats, since this information is not contained
 246 in the file.
 247 It is also possible to specify the file's character encoding and other
 248 parameters.
 249 For full details refer to @pxref{DATA LIST}.
 250
 251 @node Reading data from a pre-prepared PSPP file
 252 @subsection Reading data from a pre-prepared @pspp{} file
 253 @cindex system files
 254 @vindex GET
 255
 256 When working with other @pspp{} users, or users of other software which
 257 uses the @pspp{} data format, you may be given the data in
 258 a pre-prepared @pspp{} file.
 259 Such files contain not only the data, but the variable definitions,
 260 along with their formats, labels and other meta-data.
 261 Conventionally, these files (sometimes called ``system'' files)
 262 have the suffix @file{.sav}, but that is
 263 not mandatory.
 264 The following syntax loads a file called @file{my-file.sav}.
 265 @example
 266 @prompt{PSPP>} get file='my-file.sav'.
 267 @end example
 268 @noindent
 269 You will encounter several instances of this in future examples.
 270
 271
 272 @node Saving data to a PSPP file.
 273 @subsection Saving data to a @pspp{} file.
 274 @cindex saving
 275 @vindex SAVE
 276
 277 If you want to save your data, along with the variable definitions so
 278 that you or other @pspp{} users can use it later, you can do this with
 279 the @cmd{SAVE} command.
 280
 281 The following syntax will save the existing data and variables to a
 282 file called @file{my-new-file.sav}.
 283 @example
 284 @prompt{PSPP>} save outfile='my-new-file.sav'.
 285 @end example
 286 @noindent
 287 If @file{my-new-file.sav} already exists, then it will be overwritten.
 288 Otherwise it will be created.
 289
 290
 291 @node Reading data from other sources
 292 @subsection Reading data from other sources
 293 @cindex comma separated values
 294 @cindex spreadsheets
 295 @cindex databases
 296
 297 Sometimes it's useful to be able to read data from comma
 298 separated text, from spreadsheets, databases or other sources.
 299 In these instances you should
 300 use the @cmd{GET DATA} command (@pxref{GET DATA}).
 301
 302 @node Exiting PSPP
 303 @subsection Exiting PSPP
 304
 305 Use the @cmd{FINISH} command to exit PSPP:
 306 @example
 307 @prompt{PSPP>} finish.
 308 @end example
 309
 310 @node Data Screening and Transformation
 311 @section Data Screening and Transformation
 312
 313 @cindex screening
 314 @cindex transformation
 315
 316 Once data has been entered, it is often desirable, or even necessary,
 317 to transform it in some way before performing analysis upon it.
 318 At the very least, it's good practice to check for errors.
 319
 320 @menu
 321 * Identifying incorrect data::
 322 * Dealing with suspicious data::
 323 * Inverting negatively coded variables::
 324 * Testing data consistency::
 325 * Testing for normality ::
 326 @end menu
 327
 328 @node Identifying incorrect data
 329 @subsection Identifying incorrect data
 330 @cindex erroneous data
 331 @cindex errors, in data
 332
 333 Data from real sources is rarely error free.
 334 @pspp{} has a number of procedures which can be used to help
 335 identify data which might be incorrect.
 336
 337 The @cmd{DESCRIPTIVES} command (@pxref{DESCRIPTIVES}) is used to generate
 338 simple linear statistics for a dataset.  It is also useful for
 339 identifying potential problems in the data.
 340 The example file @file{physiology.sav} contains a number of physiological
 341 measurements of a sample of healthy adults selected at random.
 342 However, the data entry clerk made a number of mistakes when entering
 343 the data.
 344 @ref{descriptives} illustrates the use of @cmd{DESCRIPTIVES} to screen this
 345 data and identify the erroneous values.
 346
 347 @float Example, descriptives
 348 @cartouche
 349 @example
 350 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 351 @prompt{PSPP>} descriptives sex, weight, height.
 352 @end example
 353
 354 Output:
 355 @example
 356                   Descriptive Statistics
 357 +---------------------+--+-------+-------+-------+-------+
 358 |                     | N|  Mean |Std Dev|Minimum|Maximum|
 359 +---------------------+--+-------+-------+-------+-------+
 360 |Sex of subject       |40|    .45|    .50|Male   |Female |
 361 |Weight in kilograms  |40|  72.12|  26.70|  -55.6|   92.1|
 362 |Height in millimeters|40|1677.12| 262.87|    179|   1903|
 363 |Valid N (listwise)   |40|       |       |       |       |
 364 |Missing N (listwise) | 0|       |       |       |       |
 365 +---------------------+--+-------+-------+-------+-------+
 366 @end example
 367 @end cartouche
 368 @caption{Using the @cmd{DESCRIPTIVES} command to display simple
 369 summary information about the data.
 370 In this case, the results show unexpectedly low values in the Minimum
 371 column, suggesting incorrect data entry.}
 372 @end float
 373
 374 In the output of @ref{descriptives},
 375 the most interesting column is the minimum value.
 376 The @var{weight} variable has a minimum value of less than zero,
 377 which is clearly erroneous.
 378 Similarly, the @var{height} variable's minimum value seems to be very low.
 379 In fact, it is more than 5 standard deviations from the mean, and is a
 380 seemingly bizarre height for an adult person.
 381 We can examine the data in more detail with the @cmd{EXAMINE}
 382 command (@pxref{EXAMINE}):
 383
 384 In @ref{ex1} you can see that the lowest value of @var{height} is
 385 179 (which we suspect to be erroneous), but the second lowest is 1598
 386 which
 387 we know from the @cmd{DESCRIPTIVES} command
 388 is within 1 standard deviation from the mean.
 389 Similarly the @var{weight} variable has a lowest value which is
 390 negative but a plausible value for the second lowest value.
 391 This suggests that the two extreme values are outliers and probably
 392 represent data entry errors.
 393
 394 @float Example, ex1
 395 @cartouche
 396 [@dots{} continue from @ref{descriptives}]
 397 @example
 398 @prompt{PSPP>} examine height, weight /statistics=extreme(3).
 399 @end example
 400
 401 Output:
 402 @example
 403                    Extreme Values
 404 +-------------------------------+-----------+-----+
 405 |                               |Case Number|Value|
 406 +-------------------------------+-----------+-----+
 407 |Height in millimeters Highest 1|         14| 1903|
 408 |                              2|         15| 1884|
 409 |                              3|         12| 1802|
 410 |                      Lowest  1|         30|  179|
 411 |                              2|         31| 1598|
 412 |                              3|         28| 1601|
 413 +-------------------------------+-----------+-----+
 414 |Weight in kilograms   Highest 1|         13| 92.1|
 415 |                              2|          5| 92.1|
 416 |                              3|         17| 91.7|
 417 |                      Lowest  1|         38|-55.6|
 418 |                              2|         39| 54.5|
 419 |                              3|         33| 55.4|
 420 +-------------------------------+-----------+-----+
 421 @end example
 422 @end cartouche
 423 @caption{Using the @cmd{EXAMINE} command to see the extremities of the data
 424 for different variables.  Cases 30 and 38 seem to contain values
 425 very much lower than the rest of the data.
 426 They are possibly erroneous.}
 427 @end float
 428
 429 @node Dealing with suspicious data
 430 @subsection Dealing with suspicious data
 431
 432 @cindex SYSMIS
 433 @cindex recoding data
 434 If possible, suspect data should be checked and re-measured.
 435 However, this may not always be feasible, in which case the researcher may
 436 decide to disregard these values.
 437 @pspp{} has a feature whereby data can assume the special value `SYSMIS', and
 438 will be disregarded in future analysis. @xref{Missing Observations}.
 439 You can set the two suspect values to the `SYSMIS' value using the @cmd{RECODE}
 440 command.
 441 @example
 442 @pspp{}> recode height (179 = SYSMIS).
 443 @pspp{}> recode weight (LOWEST THRU 0 = SYSMIS).
 444 @end example
 445 @noindent
 446 The first command says that for any observation which has a
 447 @var{height} value of 179, that value should be changed to the SYSMIS
 448 value.
 449 The second command says that any @var{weight} values of zero or less
 450 should be changed to SYSMIS.
 451 From now on, they will be ignored in analysis.
 452 For detailed information about the @cmd{RECODE} command @pxref{RECODE}.
 453
 454 If you now re-run the @cmd{DESCRIPTIVES} or @cmd{EXAMINE} commands in
 455 @ref{descriptives} and @ref{ex1} you
 456 will see a data summary with more plausible parameters.
 457 You will also notice that the data summaries indicate the two missing values.
 458
 459 @node Inverting negatively coded variables
 460 @subsection Inverting negatively coded variables
 461
 462 @cindex Likert scale
 463 @cindex Inverting data
 464 Data entry errors are not the only reason for wanting to recode data.
 465 The sample file @file{hotel.sav} comprises data gathered from a
 466 customer satisfaction survey of clients at a particular hotel.
 467 In @ref{reliability}, this file is loaded for analysis.
 468 The line @code{display dictionary.} tells @pspp{} to display the
 469 variables and associated data.
 470 The output from this command has been omitted from the example for the sake of clarity, but
 471 you will notice that each of the variables
 472 @var{v1}, @var{v2} @dots{} @var{v5}  are measured on a 5 point Likert scale,
 473 with 1 meaning ``Strongly disagree'' and 5 meaning ``Strongly agree''.
 474 Whilst variables @var{v1}, @var{v2} and @var{v4} record responses
 475 to a positively posed question, variables @var{v3} and @var{v5} are
 476 responses to negatively worded questions.
 477 In order to perform meaningful analysis, we need to recode the variables so
 478 that they all measure in the same direction.
 479 We could use the @cmd{RECODE} command, with syntax such as:
 480 @example
 481 recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1).
 482 @end example
 483 @noindent
 484 However an easier and more elegant way uses the @cmd{COMPUTE}
 485 command (@pxref{COMPUTE}).
 486 Since the variables are Likert variables in the range (1 @dots{} 5),
 487 subtracting their value  from 6 has the effect of inverting them:
 488 @example
 489 compute @var{var} = 6 - @var{var}.
 490 @end example
 491 @noindent
 492 @ref{reliability} uses this technique to recode the variables
 493 @var{v3} and @var{v5}.
 494 After applying  @cmd{COMPUTE} for both variables,
 495 all subsequent commands will use the inverted values.
 496
 497
 498 @node Testing data consistency
 499 @subsection Testing data consistency
 500
 501 @cindex reliability
 502 @cindex consistency
 503
 504 A sensible check to perform on survey data is the calculation of
 505 reliability.
 506 This gives the statistician some confidence that the questionnaires have been
 507 completed thoughtfully.
 508 If you examine the labels of variables @var{v1},  @var{v3} and @var{v4},
 509 you will notice that they ask very similar questions.
 510 One would therefore expect the values of these variables (after recoding)
 511 to closely follow one another, and we can test that with the @cmd{RELIABILITY}
 512 command (@pxref{RELIABILITY}).
 513 @ref{reliability} shows a @pspp{} session where the user (after recoding
 514 negatively scaled variables) requests reliability statistics for
 515 @var{v1}, @var{v3} and @var{v4}.
 516
 517 @float Example, reliability
 518 @cartouche
 519 @example
 520 @prompt{PSPP>} get file='@value{example-dir}/hotel.sav'.
 521 @prompt{PSPP>} display dictionary.
 522 @prompt{PSPP>} * recode negatively worded questions.
 523 @prompt{PSPP>} compute v3 = 6 - v3.
 524 @prompt{PSPP>} compute v5 = 6 - v5.
 525 @prompt{PSPP>} reliability v1, v3, v4.
 526 @end example
 527
 528 Output (dictionary information omitted for clarity):
 529 @example
 530 Scale: ANY
 531
 532 Case Processing Summary
 533 +--------+--+-------+
 534 |Cases   | N|Percent|
 535 +--------+--+-------+
 536 |Valid   |17| 100.0%|
 537 |Excluded| 0|    .0%|
 538 |Total   |17| 100.0%|
 539 +--------+--+-------+
 540
 541     Reliability Statistics
 542 +----------------+----------+
 543 |Cronbach's Alpha|N of Items|
 544 +----------------+----------+
 545 |             .81|         3|
 546 +----------------+----------+
 547 @end example
 548 @end cartouche
 549 @caption{Recoding negatively scaled variables, and testing for
 550 reliability with the @cmd{RELIABILITY} command. The Cronbach Alpha
 551 coefficient suggests a high degree of reliability among variables
 552 @var{v1}, @var{v3} and @var{v4}.}
 553 @end float
 554
 555 As a rule of thumb, many statisticians consider a value of Cronbach's Alpha of
 556 0.7 or higher to indicate reliable data.
 557 Here, the value is 0.81 so the data and the recoding that we performed
 558 are vindicated.
 559
 560
 561 @node Testing for normality
 562 @subsection Testing for normality
 563 @cindex normality, testing
 564
 565 Many statistical tests rely upon certain properties of the data.
 566 One common property, upon which many linear tests depend, is that of
 567 normality --- the data must have been drawn from a normal distribution.
 568 It is necessary then to ensure normality before deciding upon the
 569 test procedure to use.  One way to do this uses the @cmd{EXAMINE} command.
 570
 571 In @ref{normality}, a researcher was examining the failure rates
 572 of equipment produced by an engineering company.
 573 The file @file{repairs.sav} contains the mean time between
 574 failures (@var{mtbf}) of some items of equipment subject to the study.
 575 Before performing linear analysis on the data,
 576 the researcher wanted to ascertain that the data is normally distributed.
 577
 578 A normal distribution has a skewness and kurtosis of zero.
 579 Looking at the skewness of @var{mtbf} in @ref{normality} it is clear
 580 that the mtbf figures have a lot of positive skew and are therefore
 581 not drawn from a normally distributed variable.
 582 Positive skew can often be compensated for by applying a logarithmic
 583 transformation.
 584 This is done with the @cmd{COMPUTE} command in the line
 585 @example
 586 compute mtbf_ln = ln (mtbf).
 587 @end example
 588 @noindent
 589 Rather than redefining the existing variable, this use of @cmd{COMPUTE}
 590 defines a new variable @var{mtbf_ln} which is
 591 the natural logarithm of @var{mtbf}.
 592 The final command in this example calls @cmd{EXAMINE} on this new variable,
 593 and it can be seen from the results that both the skewness and
 594 kurtosis for @var{mtbf_ln} are very close to zero.
 595 This provides some confidence that the @var{mtbf_ln} variable is
 596 normally distributed and thus safe for linear analysis.
 597 In the event that no suitable transformation can be found,
 598 then it would be worth considering
 599 an appropriate non-parametric test instead of a linear one.
 600 @xref{NPAR TESTS}, for information about non-parametric tests.
 601
 602 @float Example, normality
 603 @cartouche
 604 @example
 605 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 606 @prompt{PSPP>} examine mtbf
 607                 /statistics=descriptives.
 608 @prompt{PSPP>} compute mtbf_ln = ln (mtbf).
 609 @prompt{PSPP>} examine mtbf_ln
 610                 /statistics=descriptives.
 611 @end example
 612
 613 Output:
 614 @example
 615                        Case Processing Summary
 616 +-----------------------------------+-------------------------------+
 617 |                                   |             Cases             |
 618 |                                   +----------+---------+----------+
 619 |                                   |   Valid  | Missing |   Total  |
 620 |                                   | N|Percent|N|Percent| N|Percent|
 621 +-----------------------------------+--+-------+-+-------+--+-------+
 622 |Mean time between failures (months)|15| 100.0%|0|    .0%|15| 100.0%|
 623 +-----------------------------------+--+-------+-+-------+--+-------+
 624
 625                                   Descriptives
 626 +----------------------------------------------------------+---------+--------+
 627 |                                                          |         |  Std.  |
 628 |                                                          |Statistic|  Error |
 629 +----------------------------------------------------------+---------+--------+
 630 |Mean time between        Mean                             |     8.32|    1.62|
 631 |failures (months)        95% Confidence Interval Lower    |     4.85|        |
 632 |                         for Mean                Bound    |         |        |
 633 |                                                 Upper    |    11.79|        |
 634 |                                                 Bound    |         |        |
 635 |                         5% Trimmed Mean                  |     7.69|        |
 636 |                         Median                           |     8.12|        |
 637 |                         Variance                         |    39.21|        |
 638 |                         Std. Deviation                   |     6.26|        |
 639 |                         Minimum                          |     1.63|        |
 640 |                         Maximum                          |    26.47|        |
 641 |                         Range                            |    24.84|        |
 642 |                         Interquartile Range              |     5.83|        |
 643 |                         Skewness                         |     1.85|     .58|
 644 |                         Kurtosis                         |     4.49|    1.12|
 645 +----------------------------------------------------------+---------+--------+
 646
 647          Case Processing Summary
 648 +-------+-------------------------------+
 649 |       |             Cases             |
 650 |       +----------+---------+----------+
 651 |       |   Valid  | Missing |   Total  |
 652 |       | N|Percent|N|Percent| N|Percent|
 653 +-------+--+-------+-+-------+--+-------+
 654 |mtbf_ln|15| 100.0%|0|    .0%|15| 100.0%|
 655 +-------+--+-------+-+-------+--+-------+
 656
 657                                 Descriptives
 658 +----------------------------------------------------+---------+----------+
 659 |                                                    |Statistic|Std. Error|
 660 +----------------------------------------------------+---------+----------+
 661 |mtbf_ln Mean                                        |     1.88|       .19|
 662 |        95% Confidence Interval for Mean Lower Bound|     1.47|          |
 663 |                                         Upper Bound|     2.29|          |
 664 |        5% Trimmed Mean                             |     1.88|          |
 665 |        Median                                      |     2.09|          |
 666 |        Variance                                    |      .54|          |
 667 |        Std. Deviation                              |      .74|          |
 668 |        Minimum                                     |      .49|          |
 669 |        Maximum                                     |     3.28|          |
 670 |        Range                                       |     2.79|          |
 671 |        Interquartile Range                         |      .92|          |
 672 |        Skewness                                    |     -.16|       .58|
 673 |        Kurtosis                                    |     -.09|      1.12|
 674 +----------------------------------------------------+---------+----------+
 675 @end example
 676 @end cartouche
 677 @caption{Testing for normality using the @cmd{EXAMINE} command and applying
 678 a logarithmic transformation.
 679 The @var{mtbf} variable has a large positive skew and is therefore
 680 unsuitable for linear statistical analysis.
 681 However the transformed variable (@var{mtbf_ln}) is close to normal and
 682 would appear to be more suitable.}
 683 @end float
 684
 685
 686 @node Hypothesis Testing
 687 @section Hypothesis Testing
 688
 689 @cindex Hypothesis testing
 690 @cindex p-value
 691 @cindex null hypothesis
 692
 693 One of the most fundamental purposes of statistical analysis
 694 is hypothesis testing.
 695 Researchers commonly need to test hypotheses about a set of data.
 696 For example, she might want to test whether one set of data comes from
 697 the same distribution as another,
 698 or
 699 whether the mean of a dataset significantly differs from a particular
 700 value.
 701 This section presents just some of the possible tests that @pspp{} offers.
 702
 703 The researcher starts by making a @dfn{null hypothesis}.
 704 Often this is a hypothesis which he suspects to be false.
 705 For example, if he suspects that @var{A} is greater than @var{B} he will
 706 state the null hypothesis as @math{ @var{A} = @var{B}}.@c
 707 @footnote{This example assumes that it is already proven that @var{B} is
 708 not greater than @var{A}.}
 709
 710 The @dfn{p-value} is a recurring concept in hypothesis testing.
 711 It is the highest acceptable probability that the evidence implying a
 712 null hypothesis is false, could have been obtained when the null
 713 hypothesis is in fact true.
 714 Note that this is not the same as ``the probability of making an
 715 error'' nor is it the same as ``the probability of rejecting a
 716 hypothesis when it is true''.
 717
 718
 719
 720 @menu
 721 * Testing for differences of means::
 722 * Linear Regression::
 723 @end menu
 724
 725 @node Testing for differences of means
 726 @subsection Testing for differences of means
 727
 728 @cindex T-test
 729 @vindex T-TEST
 730
 731 A common statistical test involves hypotheses about means.
 732 The @cmd{T-TEST} command is used to find out whether or not two separate
 733 subsets have the same mean.
 734
 735 @ref{t-test} uses the file @file{physiology.sav} previously
 736 encountered.
 737 A researcher suspected that the heights and core body
 738 temperature of persons might be different depending upon their sex.
 739 To investigate this, he posed two null hypotheses:
 740 @itemize @bullet
 741 @item The mean heights of males and females in the population are equal.
 742 @item The mean body temperature of males and
 743       females in the population are equal.
 744 @end itemize
 745 @noindent
 746 For the purposes of the investigation the researcher
 747 decided to use a  p-value of 0.05.
 748
 749 In addition to the T-test, the @cmd{T-TEST} command also performs the
 750 Levene test for equal variances.
 751 If the variances are equal, then a more powerful form of the T-test can be used.
 752 However if it is unsafe to assume equal variances,
 753 then an alternative calculation is necessary.
 754 @pspp{} performs both calculations.
 755
 756 For the @var{height} variable, the output shows the significance of the
 757 Levene test to be 0.33 which means there is a
 758 33% probability that the
 759 Levene test produces this outcome when the variances are equal.
 760 Had the significance been less than 0.05, then it would have been unsafe to assume that
 761 the variances were equal.
 762 However, because the value is higher than 0.05 the homogeneity of variances assumption
 763 is safe and the ``Equal Variances'' row (the more powerful test) can be used.
 764 Examining this row, the two tailed significance for the @var{height} t-test
 765 is less than 0.05, so it is safe to reject the null hypothesis and conclude
 766 that the mean heights of males and females are unequal.
 767
 768 For the @var{temperature} variable, the significance of the Levene test
 769 is 0.58 so again, it is safe to use the row for equal variances.
 770 The equal variances row indicates that the two tailed significance for
 771 @var{temperature} is 0.20.  Since this is greater than 0.05 we must reject
 772 the null hypothesis and conclude that there is insufficient evidence to
 773 suggest that the body temperature of male and female persons are different.
 774
 775 @float Example, t-test
 776 @cartouche
 777 @example
 778 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 779 @prompt{PSPP>} recode height (179 = SYSMIS).
 780 @prompt{PSPP>} t-test group=sex(0,1) /variables = height temperature.
 781 @end example
 782 Output:
 783 @example
 784                                 Group Statistics
 785 +-------------------------------------------+--+-------+-------------+--------+
 786 |                                           |  |       |     Std.    |  S.E.  |
 787 |                                     Group | N|  Mean |  Deviation  |  Mean  |
 788 +-------------------------------------------+--+-------+-------------+--------+
 789 |Height in millimeters                Male  |22|1796.49|        49.71|   10.60|
 790 |                                     Female|17|1610.77|        25.43|    6.17|
 791 +-------------------------------------------+--+-------+-------------+--------+
 792 |Internal body temperature in degrees Male  |22|  36.68|         1.95|     .42|
 793 |Celcius                              Female|18|  37.43|         1.61|     .38|
 794 +-------------------------------------------+--+-------+-------------+--------+
 795
 796                           Independent Samples Test
 797 +---------------------+-----------------------------------------------------
 798 |                     | Levene's
 799 |                     | Test for
 800 |                     | Equality
 801 |                     |    of
 802 |                     | Variances               T-Test for Equality of Means
 803 |                     +----+-----+-----+-----+-------+----------+----------+
 804 |                     |    |     |     |     |       |          |          |
 805 |                     |    |     |     |     |       |          |          |
 806 |                     |    |     |     |     |       |          |          |
 807 |                     |    |     |     |     |       |          |          |
 808 |                     |    |     |     |     |  Sig. |          |          |
 809 |                     |    |     |     |     |  (2-  |   Mean   |Std. Error|
 810 |                     |  F | Sig.|  t  |  df |tailed)|Difference|Difference|
 811 +---------------------+----+-----+-----+-----+-------+----------+----------+
 812 |Height in   Equal    | .97| .331|14.02|37.00|   .000|    185.72|     13.24|
 813 |millimeters variances|    |     |     |     |       |          |          |
 814 |            assumed  |    |     |     |     |       |          |          |
 815 |            Equal    |    |     |15.15|32.71|   .000|    185.72|     12.26|
 816 |            variances|    |     |     |     |       |          |          |
 817 |            not      |    |     |     |     |       |          |          |
 818 |            assumed  |    |     |     |     |       |          |          |
 819 +---------------------+----+-----+-----+-----+-------+----------+----------+
 820 |Internal    Equal    | .31| .581|-1.31|38.00|   .198|      -.75|       .57|
 821 |body        variances|    |     |     |     |       |          |          |
 822 |temperature assumed  |    |     |     |     |       |          |          |
 823 |in degrees  Equal    |    |     |-1.33|37.99|   .190|      -.75|       .56|
 824 |Celcius     variances|    |     |     |     |       |          |          |
 825 |            not      |    |     |     |     |       |          |          |
 826 |            assumed  |    |     |     |     |       |          |          |
 827 +---------------------+----+-----+-----+-----+-------+----------+----------+
 828
 829 +---------------------+-------------+
 830 |                     |             |
 831 |                     |             |
 832 |                     |             |
 833 |                     |             |
 834 |                     |             |
 835 |                     +-------------+
 836 |                     |     95%     |
 837 |                     |  Confidence |
 838 |                     | Interval of |
 839 |                     |     the     |
 840 |                     |  Difference |
 841 |                     +------+------+
 842 |                     | Lower| Upper|
 843 +---------------------+------+------+
 844 |Height in   Equal    |158.88|212.55|
 845 |millimeters variances|      |      |
 846 |            assumed  |      |      |
 847 |            Equal    |160.76|210.67|
 848 |            variances|      |      |
 849 |            not      |      |      |
 850 |            assumed  |      |      |
 851 +---------------------+------+------+
 852 |Internal    Equal    | -1.91|   .41|
 853 |body        variances|      |      |
 854 |temperature assumed  |      |      |
 855 |in degrees  Equal    | -1.89|   .39|
 856 |Celcius     variances|      |      |
 857 |            not      |      |      |
 858 |            assumed  |      |      |
 859 +---------------------+------+------+
 860 @end example
 861 @end cartouche
 862 @caption{The @cmd{T-TEST} command tests for differences of means.
 863 Here, the @var{height} variable's two tailed significance is less than
 864 0.05, so the null hypothesis can be rejected.
 865 Thus, the evidence suggests there is a difference between the heights of
 866 male and female persons.
 867 However the significance of the test for the @var{temperature}
 868 variable is greater than 0.05 so the null hypothesis cannot be
 869 rejected, and there is insufficient evidence to suggest a difference
 870 in body temperature.}
 871 @end float
 872
 873 @node Linear Regression
 874 @subsection Linear Regression
 875 @cindex linear regression
 876 @vindex REGRESSION
 877
 878 Linear regression is a technique used to investigate if and how a variable
 879 is linearly related to others.
 880 If a variable is found to be linearly related, then this can be used to
 881 predict future values of that variable.
 882
 883 In example @ref{regression}, the service department of the company wanted to
 884 be able to predict the time to repair equipment, in order to improve
 885 the accuracy of their quotations.
 886 It was suggested that the time to repair might be related to the time
 887 between failures and the duty cycle of the equipment.
 888 The p-value of 0.1 was chosen for this investigation.
 889 In order to investigate this hypothesis, the @cmd{REGRESSION} command
 890 was used.
 891 This command not only tests if the variables are related, but also
 892 identifies the potential linear relationship. @xref{REGRESSION}.
 893
 894
 895 @float Example, regression
 896 @cartouche
 897 @example
 898 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 899 @prompt{PSPP>} regression /variables = mtbf duty_cycle /dependent = mttr.
 900 @prompt{PSPP>} regression /variables = mtbf /dependent = mttr.
 901 @end example
 902 Output (excerpts):
 903 @example
 904                   Coefficients (Mean time to repair (hours) )
 905 +------------------------+-----------------------------------------+-----+----+
 906 |                        |    Unstandardized        Standardized   |     |    |
 907 |                        |     Coefficients         Coefficients   |     |    |
 908 |                        +---------+-----------+-------------------+     |    |
 909 |                        |    B    | Std. Error|        Beta       |  t  |Sig.|
 910 +------------------------+---------+-----------+-------------------+-----+----+
 911 |(Constant)              |     9.81|       1.50|                .00| 6.54|.000|
 912 |Mean time between       |     3.10|        .10|                .99|32.43|.000|
 913 |failures (months)       |         |           |                   |     |    |
 914 |Ratio of working to non-|     1.09|       1.78|                .02|  .61|.552|
 915 |working time            |         |           |                   |     |    |
 916 +------------------------+---------+-----------+-------------------+-----+----+
 917
 918                   Coefficients (Mean time to repair (hours) )
 919 +-----------------------+------------------------------------------+-----+----+
 920 |                       |    Unstandardized         Standardized   |     |    |
 921 |                       |     Coefficients          Coefficients   |     |    |
 922 |                       +---------+------------+-------------------+     |    |
 923 |                       |    B    | Std. Error |        Beta       |  t  |Sig.|
 924 +-----------------------+---------+------------+-------------------+-----+----+
 925 |(Constant)             |    10.50|         .96|                .00|10.96|.000|
 926 |Mean time between      |     3.11|         .09|                .99|33.39|.000|
 927 |failures (months)      |         |            |                   |     |    |
 928 +-----------------------+---------+------------+-------------------+-----+----+
 929 @end example
 930 @end cartouche
 931 @caption{Linear regression analysis to find a predictor for
 932 @var{mttr}.
 933 The first attempt, including @var{duty_cycle}, produces some
 934 unacceptable high significance values.
 935 However the second attempt, which excludes @var{duty_cycle}, produces
 936 significance values no higher than 0.06.
 937 This suggests that @var{mtbf} alone may be a suitable predictor
 938 for @var{mttr}.}
 939 @end float
 940
 941 The coefficients in the first table suggest that the formula
 942 @math{@var{mttr} = 9.81 + 3.1 \times @var{mtbf} + 1.09 \times @var{duty_cycle}}
 943 can be used to predict the time to repair.
 944 However, the significance value for the @var{duty_cycle} coefficient
 945 is very high, which would make this an unsafe predictor.
 946 For this reason, the test was repeated, but omitting the
 947 @var{duty_cycle} variable.
 948 This time, the significance of all coefficients no higher than 0.06,
 949 suggesting that at the 0.06 level, the formula
 950 @math{@var{mttr} = 10.5 + 3.11 \times @var{mtbf}} is a reliable
 951 predictor of the time to repair.
 952
 953
 954 @c  LocalWords:  PSPP dir itemize noindent var cindex dfn cartouche samp xref
 955 @c  LocalWords:  pxref ie sav Std Dev kilograms SYSMIS sansserif pre pspp emph
 956 @c  LocalWords:  Likert Cronbach's Cronbach mtbf npplot ln myfile cmd NPAR Sig
 957 @c  LocalWords:  vindex Levene Levene's df Diff clicksequence mydata dat ascii
 958 @c  LocalWords:  mttr outfile