pintos-os.org Git - pspp/blob - doc/tutorial.texi

   1 @c PSPP - a program for statistical analysis.
   2 @c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
   3 @c Permission is granted to copy, distribute and/or modify this document
   4 @c under the terms of the GNU Free Documentation License, Version 1.3
   5 @c or any later version published by the Free Software Foundation;
   6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   7 @c A copy of the license is included in the section entitled "GNU
   8 @c Free Documentation License".
   9 @c
  10 @alias prompt = sansserif
  11
  12 @include tut.texi
  13
  14 @node Using PSPP
  15 @chapter Using @pspp{}
  16
  17 @pspp{} is a tool for the statistical analysis of sampled data.
  18 You can use it to discover patterns in the data,
  19 to explain differences in one subset of data in terms of another subset
  20 and to find out
  21 whether certain beliefs about the data are justified.
  22 This chapter does not attempt to introduce the theory behind the
  23 statistical analysis,
  24 but it shows how such analysis can be performed using @pspp{}.
  25
  26 For the purposes of this tutorial, it is assumed that you are using @pspp{} in its
  27 interactive mode from the command line.
  28 However, the example commands can also be typed into a file and executed in
  29 a post-hoc mode by typing @samp{pspp @var{file-name}} at a shell prompt,
  30 where @var{file-name} is the name of the file containing the commands.
  31 Alternatively, from the graphical interface, you can select
  32 @clicksequence{File @click{} New @click{} Syntax} to open a new syntax window
  33 and use the @clicksequence{Run} menu when a syntax fragment is ready to be
  34 executed.
  35 Whichever method you choose, the syntax is identical.
  36
  37 When using the interactive method, @pspp{} tells you that it's waiting for your
  38 data with a string like @prompt{PSPP>} or @prompt{data>}.
  39 In the examples of this chapter, whenever you see text like this, it
  40 indicates the prompt displayed by @pspp{}, @emph{not} something that you
  41 should type.
  42
  43 Throughout this chapter reference is made to a number of sample data files.
  44 So that you can try the examples for yourself,
  45 you should have received these files along with your copy of @pspp{}.@c
  46 @footnote{These files contain purely fictitious data.  They should not be used
  47 for research purposes.}
  48 @note{Normally these files are installed in the directory
  49 @file{@value{example-dir}}.
  50 If however your system administrator or operating system vendor has
  51 chosen to install them in a different location, you will have to adjust
  52 the examples accordingly.}
  53
  54
  55 @menu
  56 * Preparation of Data Files::
  57 * Data Screening and Transformation::
  58 * Hypothesis Testing::
  59 @end menu
  60
  61 @node Preparation of Data Files
  62 @section Preparation of Data Files
  63
  64
  65 Before analysis can commence,  the data must be loaded into @pspp{} and
  66 arranged such that both @pspp{} and humans can understand what
  67 the data represents.
  68 There are two aspects of data:
  69
  70 @itemize @bullet
  71 @item The variables --- these are the parameters of a quantity
  72  which has been measured or estimated in some way.
  73  For example height, weight and geographic location are all variables.
  74 @item The observations (also called `cases') of the variables ---
  75  each observation represents an instance when the variables were measured
  76  or observed.
  77 @end itemize
  78
  79 @noindent
  80 For example, a data set which has the variables @exvar{height}, @exvar{weight}, and
  81 @exvar{name}, might have the observations:
  82 @example
  83 1881 89.2 Ahmed
  84 1192 107.01 Frank
  85 1230 67 Julie
  86 @end example
  87 @noindent
  88 The following sections explain how to define a dataset.
  89
  90 @menu
  91 * Defining Variables::
  92 * Listing the data::
  93 * Reading data from a text file::
  94 * Reading data from a pre-prepared PSPP file::
  95 * Saving data to a PSPP file.::
  96 * Reading data from other sources::
  97 * Exiting PSPP::
  98 @end menu
  99
 100 @node Defining Variables
 101 @subsection Defining Variables
 102 @cindex variables
 103
 104 Variables come in two basic types, @i{viz}: @dfn{numeric} and @dfn{string}.
 105 Variables such as age, height and satisfaction are numeric,
 106 whereas name is a string variable.
 107 String variables are best reserved for commentary data to assist the
 108 human observer.
 109 However they can also be used for nominal or categorical data.
 110
 111 The following example defines two variables @exvar{forename} and @exvar{height},
 112 and reads data into them by manual input:
 113
 114 @example
 115 @prompt{PSPP>} data list list /forename (A12) height.
 116 @prompt{PSPP>} begin data.
 117 @prompt{data>} Ahmed 188
 118 @prompt{data>} Bertram 167
 119 @prompt{data>} Catherine 134.231
 120 @prompt{data>} David 109.1
 121 @prompt{data>} end data
 122 @prompt{PSPP>}
 123 @end example
 124
 125 There are several things to note about this example.
 126
 127 @itemize @bullet
 128 @item
 129 The words @samp{data list list} are an example of the @cmd{DATA LIST}
 130 command. @xref{DATA LIST}.
 131 It tells @pspp{} to prepare for reading data.
 132 The word @samp{list} intentionally appears twice.
 133 The first occurrence is part of the @cmd{DATA LIST} call,
 134 whilst the second
 135 tells @pspp{} that the data is to be read as free format data with
 136 one record per line.
 137
 138 @item
 139 The @samp{/} character is important. It marks the start of the list of
 140 variables which you wish to define.
 141
 142 @item
 143 The text @samp{forename} is the name of the first variable,
 144 and @samp{(A12)} says that the variable @exvar{forename} is a string
 145 variable and that its maximum length is 12 bytes.
 146 The second variable's name is specified by the text @samp{height}.
 147 Since no format is given, this variable has the default format.
 148 Normally the default format expects numeric data, which should be
 149 entered in the locale of the operating system.
 150 Thus, the example is correct for English locales and other
 151 locales which use a period (@samp{.}) as the decimal separator.
 152 However if you are using a system with a locale which uses the comma (@samp{,})
 153 as the decimal separator, then you should in the subsequent lines substitute
 154 @samp{.} with @samp{,}.
 155 Alternatively, you could explicitly tell @pspp{} that the @exvar{height}
 156 variable is to be read using a period as its decimal separator by appending the
 157 text @samp{DOT8.3} after the word @samp{height}.
 158 For more information on data formats, @pxref{Input and Output Formats}.
 159
 160
 161 @item
 162 Normally, @pspp{} displays the  prompt @prompt{PSPP>} whenever it's
 163 expecting a command.
 164 However, when it's expecting data, the prompt changes to @prompt{data>}
 165 so that you know to enter data and not a command.
 166
 167 @item
 168 At the end of every command there is a terminating @samp{.} which tells
 169 @pspp{} that the end of a command has been encountered.
 170 You should not enter @samp{.} when data is expected (@i{ie.} when
 171 the @prompt{data>} prompt is current) since it is appropriate only for
 172 terminating commands.
 173 @end itemize
 174
 175 @node Listing the data
 176 @subsection Listing the data
 177 @vindex LIST
 178
 179 Once the data has been entered,
 180 you could type
 181 @example
 182 @prompt{PSPP>} list /format=numbered.
 183 @end example
 184 @noindent
 185 to list the data.
 186 The optional text @samp{/format=numbered} requests the case numbers to be
 187 shown along with the data.
 188 It should show the following output:
 189 @psppoutput {tutorial1}
 190 @noindent
 191 Note that the numeric variable @exvar{height} is displayed to 2 decimal
 192 places, because the format for that variable is @samp{F8.2}.
 193 For a complete description of the @cmd{LIST} command, @pxref{LIST}.
 194
 195 @node Reading data from a text file
 196 @subsection Reading data from a text file
 197 @cindex reading data
 198
 199 The previous example showed how to define a set of variables and to
 200 manually enter the data for those variables.
 201 Manual entering of data is tedious work, and often
 202 a file containing the data will be have been previously
 203 prepared.
 204 Let us assume that you have a file called @file{mydata.dat} containing the
 205 ascii encoded data:
 206 @example
 207 Ahmed          188.00
 208 Bertram        167.00
 209 Catherine      134.23
 210 David          109.10
 211 @              .
 212 @              .
 213 @              .
 214 Zachariah      113.02
 215 @end example
 216 @noindent
 217 You can can tell the @cmd{DATA LIST} command to read the data directly from
 218 this file instead of by manual entry, with a command like:
 219 @example
 220 @prompt{PSPP>} data list file='mydata.dat' list /forename (A12) height.
 221 @end example
 222 @noindent
 223 Notice however, that it is still necessary to specify the names of the
 224 variables and their formats, since this information is not contained
 225 in the file.
 226 It is also possible to specify the file's character encoding and other
 227 parameters.
 228 For full details refer to @pxref{DATA LIST}.
 229
 230 @node Reading data from a pre-prepared PSPP file
 231 @subsection Reading data from a pre-prepared @pspp{} file
 232 @cindex system files
 233 @vindex GET
 234
 235 When working with other @pspp{} users, or users of other software which
 236 uses the @pspp{} data format, you may be given the data in
 237 a pre-prepared @pspp{} file.
 238 Such files contain not only the data, but the variable definitions,
 239 along with their formats, labels and other meta-data.
 240 Conventionally, these files (sometimes called ``system'' files)
 241 have the suffix @file{.sav}, but that is
 242 not mandatory.
 243 The following syntax loads a file called @file{my-file.sav}.
 244 @example
 245 @prompt{PSPP>} get file='my-file.sav'.
 246 @end example
 247 @noindent
 248 You will encounter several instances of this in future examples.
 249
 250
 251 @node Saving data to a PSPP file.
 252 @subsection Saving data to a @pspp{} file.
 253 @cindex saving
 254 @vindex SAVE
 255
 256 If you want to save your data, along with the variable definitions so
 257 that you or other @pspp{} users can use it later, you can do this with
 258 the @cmd{SAVE} command.
 259
 260 The following syntax will save the existing data and variables to a
 261 file called @file{my-new-file.sav}.
 262 @example
 263 @prompt{PSPP>} save outfile='my-new-file.sav'.
 264 @end example
 265 @noindent
 266 If @file{my-new-file.sav} already exists, then it will be overwritten.
 267 Otherwise it will be created.
 268
 269
 270 @node Reading data from other sources
 271 @subsection Reading data from other sources
 272 @cindex comma separated values
 273 @cindex spreadsheets
 274 @cindex databases
 275
 276 Sometimes it's useful to be able to read data from comma
 277 separated text, from spreadsheets, databases or other sources.
 278 In these instances you should
 279 use the @cmd{GET DATA} command (@pxref{GET DATA}).
 280
 281 @node Exiting PSPP
 282 @subsection Exiting PSPP
 283
 284 Use the @cmd{FINISH} command to exit PSPP:
 285 @example
 286 @prompt{PSPP>} finish.
 287 @end example
 288
 289 @node Data Screening and Transformation
 290 @section Data Screening and Transformation
 291
 292 @cindex screening
 293 @cindex transformation
 294
 295 Once data has been entered, it is often desirable, or even necessary,
 296 to transform it in some way before performing analysis upon it.
 297 At the very least, it's good practice to check for errors.
 298
 299 @menu
 300 * Identifying incorrect data::
 301 * Dealing with suspicious data::
 302 * Inverting negatively coded variables::
 303 * Testing data consistency::
 304 * Testing for normality ::
 305 @end menu
 306
 307 @node Identifying incorrect data
 308 @subsection Identifying incorrect data
 309 @cindex erroneous data
 310 @cindex errors, in data
 311
 312 Data from real sources is rarely error free.
 313 @pspp{} has a number of procedures which can be used to help
 314 identify data which might be incorrect.
 315
 316 The @cmd{DESCRIPTIVES} command (@pxref{DESCRIPTIVES}) is used to generate
 317 simple linear statistics for a dataset.  It is also useful for
 318 identifying potential problems in the data.
 319 The example file @file{physiology.sav} contains a number of physiological
 320 measurements of a sample of healthy adults selected at random.
 321 However, the data entry clerk made a number of mistakes when entering
 322 the data.
 323 The following example illustrates the use of @cmd{DESCRIPTIVES} to screen this
 324 data and identify the erroneous values:
 325
 326 @example
 327 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 328 @prompt{PSPP>} descriptives sex, weight, height.
 329 @end example
 330
 331 @noindent For this example, PSPP produces the following output:
 332 @psppoutput {tutorial2a}
 333
 334 The most interesting column in the output is the minimum value.
 335 The @exvar{weight} variable has a minimum value of less than zero,
 336 which is clearly erroneous.
 337 Similarly, the @exvar{height} variable's minimum value seems to be very low.
 338 In fact, it is more than 5 standard deviations from the mean, and is a
 339 seemingly bizarre height for an adult person.
 340
 341 We can look deeper into these discrepancies by issuing an additional
 342 @cmd{EXAMINE} command:
 343
 344 @example
 345 @prompt{PSPP>} examine height, weight /statistics=extreme(3).
 346 @end example
 347
 348 @noindent This command produces the following additional output (in part):
 349 @psppoutput {tutorial2b}
 350
 351 @noindent
 352 From this new output, you can see that the lowest value of @exvar{height} is
 353 179 (which we suspect to be erroneous), but the second lowest is 1598
 354 which
 355 we know from @cmd{DESCRIPTIVES}
 356 is within 1 standard deviation from the mean.
 357 Similarly, the lowest value of @exvar{weight} is
 358 negative, but its second lowest value is plausible.
 359 This suggests that the two extreme values are outliers and probably
 360 represent data entry errors.
 361
 362 The output also identifies the case numbers for each extreme value,
 363 so we can see that
 364 cases 30 and 38 are the ones with the erroneous values.
 365
 366 @node Dealing with suspicious data
 367 @subsection Dealing with suspicious data
 368
 369 @cindex SYSMIS
 370 @cindex recoding data
 371 If possible, suspect data should be checked and re-measured.
 372 However, this may not always be feasible, in which case the researcher may
 373 decide to disregard these values.
 374 @pspp{} has a feature whereby data can assume the special value `SYSMIS', and
 375 will be disregarded in future analysis. @xref{Missing Observations}.
 376 You can set the two suspect values to the `SYSMIS' value using the @cmd{RECODE}
 377 command.
 378 @example
 379 @pspp{}> recode height (179 = SYSMIS).
 380 @pspp{}> recode weight (LOWEST THRU 0 = SYSMIS).
 381 @end example
 382 @noindent
 383 The first command says that for any observation which has a
 384 @exvar{height} value of 179, that value should be changed to the SYSMIS
 385 value.
 386 The second command says that any @exvar{weight} values of zero or less
 387 should be changed to SYSMIS.
 388 From now on, they will be ignored in analysis.
 389 For detailed information about the @cmd{RECODE} command @pxref{RECODE}.
 390
 391 If you now re-run the @cmd{DESCRIPTIVES} or @cmd{EXAMINE} commands from
 392 the previous section,
 393 you will see a data summary with more plausible parameters.
 394 You will also notice that the data summaries indicate the two missing values.
 395
 396 @node Inverting negatively coded variables
 397 @subsection Inverting negatively coded variables
 398
 399 @cindex Likert scale
 400 @cindex Inverting data
 401 Data entry errors are not the only reason for wanting to recode data.
 402 The sample file @file{hotel.sav} comprises data gathered from a
 403 customer satisfaction survey of clients at a particular hotel.
 404 The following commands load the file and display its
 405 variables and associated data:
 406
 407 @example
 408 @prompt{PSPP>} get file='@value{example-dir}/hotel.sav'.
 409 @prompt{PSPP>} display dictionary.
 410 @end example
 411
 412 @noindent It yields the following output:
 413
 414 @psppoutput {tutorial3}
 415
 416 The output shows that all of the variables @exvar{v1} through @exvar{v5} are measured on a 5 point Likert scale,
 417 with 1 meaning ``Strongly disagree'' and 5 meaning ``Strongly agree''.
 418 However, some of the questions are positively worded (@exvar{v1}, @exvar{v2}, @exvar{v4}) and others are negatively worded (@exvar{v3}, @exvar{v5}).
 419 To perform meaningful analysis, we need to recode the variables so
 420 that they all measure in the same direction.
 421 We could use the @cmd{RECODE} command, with syntax such as:
 422 @example
 423 recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1).
 424 @end example
 425 @noindent
 426 However an easier and more elegant way uses the @cmd{COMPUTE}
 427 command (@pxref{COMPUTE}).
 428 Since the variables are Likert variables in the range (1 @dots{} 5),
 429 subtracting their value  from 6 has the effect of inverting them:
 430 @example
 431 compute @var{var} = 6 - @var{var}.
 432 @end example
 433 @noindent
 434 The following section uses this technique to recode the variables
 435 @exvar{v3} and @exvar{v5}.
 436 After applying  @cmd{COMPUTE} for both variables,
 437 all subsequent commands will use the inverted values.
 438
 439
 440 @node Testing data consistency
 441 @subsection Testing data consistency
 442
 443 @cindex reliability
 444 @cindex consistency
 445
 446 A sensible check to perform on survey data is the calculation of
 447 reliability.
 448 This gives the statistician some confidence that the questionnaires have been
 449 completed thoughtfully.
 450 If you examine the labels of variables @exvar{v1},  @exvar{v3} and @exvar{v4},
 451 you will notice that they ask very similar questions.
 452 One would therefore expect the values of these variables (after recoding)
 453 to closely follow one another, and we can test that with the @cmd{RELIABILITY}
 454 command (@pxref{RELIABILITY}).
 455 The following example shows a @pspp{} session where the user recodes
 456 negatively scaled variables and then requests reliability statistics for
 457 @exvar{v1}, @exvar{v3}, and @exvar{v4}.
 458
 459 @example
 460 @prompt{PSPP>} get file='@value{example-dir}/hotel.sav'.
 461 @prompt{PSPP>} compute v3 = 6 - v3.
 462 @prompt{PSPP>} compute v5 = 6 - v5.
 463 @prompt{PSPP>} reliability v1, v3, v4.
 464 @end example
 465
 466 @noindent This yields the following output:
 467 @psppoutput {tutorial4}
 468
 469 As a rule of thumb, many statisticians consider a value of Cronbach's Alpha of
 470 0.7 or higher to indicate reliable data.
 471
 472 Here, the value is 0.81, which suggests a high degree of reliability
 473 among variables @exvar{v1}, @exvar{v3} and @exvar{v4}, so the data and
 474 the recoding that we performed are vindicated.
 475
 476
 477 @node Testing for normality
 478 @subsection Testing for normality
 479 @cindex normality, testing
 480
 481 Many statistical tests rely upon certain properties of the data.
 482 One common property, upon which many linear tests depend, is that of
 483 normality --- the data must have been drawn from a normal distribution.
 484 It is necessary then to ensure normality before deciding upon the
 485 test procedure to use.  One way to do this uses the @cmd{EXAMINE} command.
 486
 487 In the following example, a researcher was examining the failure rates
 488 of equipment produced by an engineering company.
 489 The file @file{repairs.sav} contains the mean time between
 490 failures (@exvar{mtbf}) of some items of equipment subject to the study.
 491 Before performing linear analysis on the data,
 492 the researcher wanted to ascertain that the data is normally distributed.
 493
 494 @example
 495 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 496 @prompt{PSPP>} examine mtbf
 497                 /statistics=descriptives.
 498 @end example
 499
 500 @noindent This produces the following output:
 501 @psppoutput {tutorial5a}
 502
 503 A normal distribution has a skewness and kurtosis of zero.
 504 The skewness of @exvar{mtbf} in the output above makes it clear
 505 that the mtbf figures have a lot of positive skew and are therefore
 506 not drawn from a normally distributed variable.
 507 Positive skew can often be compensated for by applying a logarithmic
 508 transformation, as in the following continuation of the example:
 509
 510 @example
 511 @prompt{PSPP>} compute mtbf_ln = ln (mtbf).
 512 @prompt{PSPP>} examine mtbf_ln
 513                 /statistics=descriptives.
 514 @end example
 515
 516 @noindent which produces the following additional output:
 517 @psppoutput {tutorial5b}
 518
 519 The @cmd{COMPUTE} command in the first line above performs the
 520 logarithmic transformation:
 521 @example
 522 compute mtbf_ln = ln (mtbf).
 523 @end example
 524 @noindent
 525 Rather than redefining the existing variable, this use of @cmd{COMPUTE}
 526 defines a new variable @exvar{mtbf_ln} which is
 527 the natural logarithm of @exvar{mtbf}.
 528 The final command in this example calls @cmd{EXAMINE} on this new variable.
 529 The results show that both the skewness and
 530 kurtosis for @exvar{mtbf_ln} are very close to zero.
 531 This provides some confidence that the @exvar{mtbf_ln} variable is
 532 normally distributed and thus safe for linear analysis.
 533 In the event that no suitable transformation can be found,
 534 then it would be worth considering
 535 an appropriate non-parametric test instead of a linear one.
 536 @xref{NPAR TESTS}, for information about non-parametric tests.
 537
 538 @node Hypothesis Testing
 539 @section Hypothesis Testing
 540
 541 @cindex Hypothesis testing
 542 @cindex p-value
 543 @cindex null hypothesis
 544
 545 One of the most fundamental purposes of statistical analysis
 546 is hypothesis testing.
 547 Researchers commonly need to test hypotheses about a set of data.
 548 For example, she might want to test whether one set of data comes from
 549 the same distribution as another,
 550 or
 551 whether the mean of a dataset significantly differs from a particular
 552 value.
 553 This section presents just some of the possible tests that @pspp{} offers.
 554
 555 The researcher starts by making a @dfn{null hypothesis}.
 556 Often this is a hypothesis which he suspects to be false.
 557 For example, if he suspects that @var{A} is greater than @var{B} he will
 558 state the null hypothesis as @math{ @var{A} = @var{B}}.@c
 559 @footnote{This example assumes that it is already proven that @var{B} is
 560 not greater than @var{A}.}
 561
 562 The @dfn{p-value} is a recurring concept in hypothesis testing.
 563 It is the highest acceptable probability that the evidence implying a
 564 null hypothesis is false, could have been obtained when the null
 565 hypothesis is in fact true.
 566 Note that this is not the same as ``the probability of making an
 567 error'' nor is it the same as ``the probability of rejecting a
 568 hypothesis when it is true''.
 569
 570
 571
 572 @menu
 573 * Testing for differences of means::
 574 * Linear Regression::
 575 @end menu
 576
 577 @node Testing for differences of means
 578 @subsection Testing for differences of means
 579
 580 @cindex T-test
 581 @vindex T-TEST
 582
 583 A common statistical test involves hypotheses about means.
 584 The @cmd{T-TEST} command is used to find out whether or not two separate
 585 subsets have the same mean.
 586
 587 A researcher suspected that the heights and core body
 588 temperature of persons might be different depending upon their sex.
 589 To investigate this, he posed two null hypotheses based on the data
 590 from @file{physiology.sav} previously encountered:
 591 @itemize @bullet
 592 @item The mean heights of males and females in the population are equal.
 593 @item The mean body temperature of males and
 594       females in the population are equal.
 595 @end itemize
 596 @noindent
 597 For the purposes of the investigation the researcher
 598 decided to use a  p-value of 0.05.
 599
 600 In addition to the T-test, the @cmd{T-TEST} command also performs the
 601 Levene test for equal variances.
 602 If the variances are equal, then a more powerful form of the T-test can be used.
 603 However if it is unsafe to assume equal variances,
 604 then an alternative calculation is necessary.
 605 @pspp{} performs both calculations.
 606
 607 For the @exvar{height} variable, the output shows the significance of the
 608 Levene test to be 0.33 which means there is a
 609 33% probability that the
 610 Levene test produces this outcome when the variances are equal.
 611 Had the significance been less than 0.05, then it would have been unsafe to assume that
 612 the variances were equal.
 613 However, because the value is higher than 0.05 the homogeneity of variances assumption
 614 is safe and the ``Equal Variances'' row (the more powerful test) can be used.
 615 Examining this row, the two tailed significance for the @exvar{height} t-test
 616 is less than 0.05, so it is safe to reject the null hypothesis and conclude
 617 that the mean heights of males and females are unequal.
 618
 619 For the @exvar{temperature} variable, the significance of the Levene test
 620 is 0.58 so again, it is safe to use the row for equal variances.
 621 The equal variances row indicates that the two tailed significance for
 622 @exvar{temperature} is 0.20.  Since this is greater than 0.05 we must reject
 623 the null hypothesis and conclude that there is insufficient evidence to
 624 suggest that the body temperature of male and female persons are different.
 625
 626 The syntax for this analysis is:
 627 @example
 628 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 629 @prompt{PSPP>} recode height (179 = SYSMIS).
 630 @prompt{PSPP>} t-test group=sex(0,1) /variables = height temperature.
 631 @end example
 632
 633 PSPP produces the following output for this syntax:
 634 @psppoutput {tutorial6}
 635
 636 The @cmd{T-TEST} command tests for differences of means.
 637 Here, the @exvar{height} variable's two tailed significance is less than
 638 0.05, so the null hypothesis can be rejected.
 639 Thus, the evidence suggests there is a difference between the heights of
 640 male and female persons.
 641 However the significance of the test for the @exvar{temperature}
 642 variable is greater than 0.05 so the null hypothesis cannot be
 643 rejected, and there is insufficient evidence to suggest a difference
 644 in body temperature.
 645
 646 @node Linear Regression
 647 @subsection Linear Regression
 648 @cindex linear regression
 649 @vindex REGRESSION
 650
 651 Linear regression is a technique used to investigate if and how a variable
 652 is linearly related to others.
 653 If a variable is found to be linearly related, then this can be used to
 654 predict future values of that variable.
 655
 656 In the following example, the service department of the company wanted to
 657 be able to predict the time to repair equipment, in order to improve
 658 the accuracy of their quotations.
 659 It was suggested that the time to repair might be related to the time
 660 between failures and the duty cycle of the equipment.
 661 The p-value of 0.1 was chosen for this investigation.
 662 In order to investigate this hypothesis, the @cmd{REGRESSION} command
 663 was used.
 664 This command not only tests if the variables are related, but also
 665 identifies the potential linear relationship. @xref{REGRESSION}.
 666
 667 A first attempt includes @exvar{duty_cycle}:
 668
 669 @example
 670 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 671 @prompt{PSPP>} regression /variables = mtbf duty_cycle /dependent = mttr.
 672 @end example
 673
 674 @noindent This attempt yields the following output (in part):
 675 @psppoutput {tutorial7a}
 676
 677 The coefficients in the above table suggest that the formula
 678 @math{@var{mttr} = 9.81 + 3.1 \times @var{mtbf} + 1.09 \times @var{duty_cycle}}
 679 can be used to predict the time to repair.
 680 However, the significance value for the @var{duty_cycle} coefficient
 681 is very high, which would make this an unsafe predictor.
 682 For this reason, the test was repeated, but omitting the
 683 @exvar{duty_cycle} variable:
 684
 685 @example
 686 @prompt{PSPP>} regression /variables = mtbf /dependent = mttr.
 687 @end example
 688
 689 @noindent
 690 This second try produces the following output (in part):
 691 @psppoutput {tutorial7b}
 692
 693 This time, the significance of all coefficients is no higher than 0.06,
 694 suggesting that at the 0.06 level, the formula
 695 @math{@var{mttr} = 10.5 + 3.11 \times @var{mtbf}} is a reliable
 696 predictor of the time to repair.
 697
 698
 699 @c  LocalWords:  PSPP dir itemize noindent var cindex dfn cartouche samp xref
 700 @c  LocalWords:  pxref ie sav Std Dev kilograms SYSMIS sansserif pre pspp emph
 701 @c  LocalWords:  Likert Cronbach's Cronbach mtbf npplot ln myfile cmd NPAR Sig
 702 @c  LocalWords:  vindex Levene Levene's df Diff clicksequence mydata dat ascii
 703 @c  LocalWords:  mttr outfile