pintos-os.org Git - pspp/blob - doc/tutorial.texi

   1 @c PSPP - a program for statistical analysis.
   2 @c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
   3 @c Permission is granted to copy, distribute and/or modify this document
   4 @c under the terms of the GNU Free Documentation License, Version 1.3
   5 @c or any later version published by the Free Software Foundation;
   6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   7 @c A copy of the license is included in the section entitled "GNU
   8 @c Free Documentation License".
   9 @c
  10 @alias prompt = sansserif
  11
  12 @include tut.texi
  13
  14 @node Using PSPP
  15 @chapter Using @pspp{}
  16
  17 @pspp{} is a tool for the statistical analysis of sampled data.
  18 You can use it to discover patterns in the data,
  19 to explain differences in one subset of data in terms of another subset
  20 and to find out
  21 whether certain beliefs about the data are justified.
  22 This chapter does not attempt to introduce the theory behind the
  23 statistical analysis,
  24 but it shows how such analysis can be performed using @pspp{}.
  25
  26 @image{doc/examples/autorecode}
  27
  28 For the purposes of this tutorial, it is assumed that you are using @pspp{} in its
  29 interactive mode from the command line.
  30 However, the example commands can also be typed into a file and executed in
  31 a post-hoc mode by typing @samp{pspp @var{file-name}} at a shell prompt,
  32 where @var{file-name} is the name of the file containing the commands.
  33 Alternatively, from the graphical interface, you can select
  34 @clicksequence{File @click{} New @click{} Syntax} to open a new syntax window
  35 and use the @clicksequence{Run} menu when a syntax fragment is ready to be
  36 executed.
  37 Whichever method you choose, the syntax is identical.
  38
  39 When using the interactive method, @pspp{} tells you that it's waiting for your
  40 data with a string like @prompt{PSPP>} or @prompt{data>}.
  41 In the examples of this chapter, whenever you see text like this, it
  42 indicates the prompt displayed by @pspp{}, @emph{not} something that you
  43 should type.
  44
  45 Throughout this chapter reference is made to a number of sample data files.
  46 So that you can try the examples for yourself,
  47 you should have received these files along with your copy of @pspp{}.@c
  48 @footnote{These files contain purely fictitious data.  They should not be used
  49 for research purposes.}
  50 @note{Normally these files are installed in the directory
  51 @file{@value{example-dir}}.
  52 If however your system administrator or operating system vendor has
  53 chosen to install them in a different location, you will have to adjust
  54 the examples accordingly.}
  55
  56
  57 @menu
  58 * Preparation of Data Files::
  59 * Data Screening and Transformation::
  60 * Hypothesis Testing::
  61 @end menu
  62
  63 @node Preparation of Data Files
  64 @section Preparation of Data Files
  65
  66
  67 Before analysis can commence,  the data must be loaded into @pspp{} and
  68 arranged such that both @pspp{} and humans can understand what
  69 the data represents.
  70 There are two aspects of data:
  71
  72 @itemize @bullet
  73 @item The variables --- these are the parameters of a quantity
  74  which has been measured or estimated in some way.
  75  For example height, weight and geographic location are all variables.
  76 @item The observations (also called `cases') of the variables ---
  77  each observation represents an instance when the variables were measured
  78  or observed.
  79 @end itemize
  80
  81 @noindent
  82 For example, a data set which has the variables @exvar{height}, @exvar{weight}, and
  83 @exvar{name}, might have the observations:
  84 @example
  85 1881 89.2 Ahmed
  86 1192 107.01 Frank
  87 1230 67 Julie
  88 @end example
  89 @noindent
  90 The following sections explain how to define a dataset.
  91
  92 @menu
  93 * Defining Variables::
  94 * Listing the data::
  95 * Reading data from a text file::
  96 * Reading data from a pre-prepared PSPP file::
  97 * Saving data to a PSPP file.::
  98 * Reading data from other sources::
  99 * Exiting PSPP::
 100 @end menu
 101
 102 @node Defining Variables
 103 @subsection Defining Variables
 104 @cindex variables
 105
 106 Variables come in two basic types, @i{viz}: @dfn{numeric} and @dfn{string}.
 107 Variables such as age, height and satisfaction are numeric,
 108 whereas name is a string variable.
 109 String variables are best reserved for commentary data to assist the
 110 human observer.
 111 However they can also be used for nominal or categorical data.
 112
 113 The following example defines two variables @exvar{forename} and @exvar{height},
 114 and reads data into them by manual input:
 115
 116 @example
 117 @prompt{PSPP>} data list list /forename (A12) height.
 118 @prompt{PSPP>} begin data.
 119 @prompt{data>} Ahmed 188
 120 @prompt{data>} Bertram 167
 121 @prompt{data>} Catherine 134.231
 122 @prompt{data>} David 109.1
 123 @prompt{data>} end data
 124 @prompt{PSPP>}
 125 @end example
 126
 127 There are several things to note about this example.
 128
 129 @itemize @bullet
 130 @item
 131 The words @samp{data list list} are an example of the @cmd{DATA LIST}
 132 command. @xref{DATA LIST}.
 133 It tells @pspp{} to prepare for reading data.
 134 The word @samp{list} intentionally appears twice.
 135 The first occurrence is part of the @cmd{DATA LIST} call,
 136 whilst the second
 137 tells @pspp{} that the data is to be read as free format data with
 138 one record per line.
 139
 140 @item
 141 The @samp{/} character is important. It marks the start of the list of
 142 variables which you wish to define.
 143
 144 @item
 145 The text @samp{forename} is the name of the first variable,
 146 and @samp{(A12)} says that the variable @exvar{forename} is a string
 147 variable and that its maximum length is 12 bytes.
 148 The second variable's name is specified by the text @samp{height}.
 149 Since no format is given, this variable has the default format.
 150 Normally the default format expects numeric data, which should be
 151 entered in the locale of the operating system.
 152 Thus, the example is correct for English locales and other
 153 locales which use a period (@samp{.}) as the decimal separator.
 154 However if you are using a system with a locale which uses the comma (@samp{,})
 155 as the decimal separator, then you should in the subsequent lines substitute
 156 @samp{.} with @samp{,}.
 157 Alternatively, you could explicitly tell @pspp{} that the @exvar{height}
 158 variable is to be read using a period as its decimal separator by appending the
 159 text @samp{DOT8.3} after the word @samp{height}.
 160 For more information on data formats, @pxref{Input and Output Formats}.
 161
 162
 163 @item
 164 Normally, @pspp{} displays the  prompt @prompt{PSPP>} whenever it's
 165 expecting a command.
 166 However, when it's expecting data, the prompt changes to @prompt{data>}
 167 so that you know to enter data and not a command.
 168
 169 @item
 170 At the end of every command there is a terminating @samp{.} which tells
 171 @pspp{} that the end of a command has been encountered.
 172 You should not enter @samp{.} when data is expected (@i{ie.} when
 173 the @prompt{data>} prompt is current) since it is appropriate only for
 174 terminating commands.
 175 @end itemize
 176
 177 @node Listing the data
 178 @subsection Listing the data
 179 @vindex LIST
 180
 181 Once the data has been entered,
 182 you could type
 183 @example
 184 @prompt{PSPP>} list /format=numbered.
 185 @end example
 186 @noindent
 187 to list the data.
 188 The optional text @samp{/format=numbered} requests the case numbers to be
 189 shown along with the data.
 190 It should show the following output:
 191 @psppoutput {tutorial1}
 192 @noindent
 193 Note that the numeric variable @exvar{height} is displayed to 2 decimal
 194 places, because the format for that variable is @samp{F8.2}.
 195 For a complete description of the @cmd{LIST} command, @pxref{LIST}.
 196
 197 @node Reading data from a text file
 198 @subsection Reading data from a text file
 199 @cindex reading data
 200
 201 The previous example showed how to define a set of variables and to
 202 manually enter the data for those variables.
 203 Manual entering of data is tedious work, and often
 204 a file containing the data will be have been previously
 205 prepared.
 206 Let us assume that you have a file called @file{mydata.dat} containing the
 207 ascii encoded data:
 208 @example
 209 Ahmed          188.00
 210 Bertram        167.00
 211 Catherine      134.23
 212 David          109.10
 213 @              .
 214 @              .
 215 @              .
 216 Zachariah      113.02
 217 @end example
 218 @noindent
 219 You can can tell the @cmd{DATA LIST} command to read the data directly from
 220 this file instead of by manual entry, with a command like:
 221 @example
 222 @prompt{PSPP>} data list file='mydata.dat' list /forename (A12) height.
 223 @end example
 224 @noindent
 225 Notice however, that it is still necessary to specify the names of the
 226 variables and their formats, since this information is not contained
 227 in the file.
 228 It is also possible to specify the file's character encoding and other
 229 parameters.
 230 For full details refer to @pxref{DATA LIST}.
 231
 232 @node Reading data from a pre-prepared PSPP file
 233 @subsection Reading data from a pre-prepared @pspp{} file
 234 @cindex system files
 235 @vindex GET
 236
 237 When working with other @pspp{} users, or users of other software which
 238 uses the @pspp{} data format, you may be given the data in
 239 a pre-prepared @pspp{} file.
 240 Such files contain not only the data, but the variable definitions,
 241 along with their formats, labels and other meta-data.
 242 Conventionally, these files (sometimes called ``system'' files)
 243 have the suffix @file{.sav}, but that is
 244 not mandatory.
 245 The following syntax loads a file called @file{my-file.sav}.
 246 @example
 247 @prompt{PSPP>} get file='my-file.sav'.
 248 @end example
 249 @noindent
 250 You will encounter several instances of this in future examples.
 251
 252
 253 @node Saving data to a PSPP file.
 254 @subsection Saving data to a @pspp{} file.
 255 @cindex saving
 256 @vindex SAVE
 257
 258 If you want to save your data, along with the variable definitions so
 259 that you or other @pspp{} users can use it later, you can do this with
 260 the @cmd{SAVE} command.
 261
 262 The following syntax will save the existing data and variables to a
 263 file called @file{my-new-file.sav}.
 264 @example
 265 @prompt{PSPP>} save outfile='my-new-file.sav'.
 266 @end example
 267 @noindent
 268 If @file{my-new-file.sav} already exists, then it will be overwritten.
 269 Otherwise it will be created.
 270
 271
 272 @node Reading data from other sources
 273 @subsection Reading data from other sources
 274 @cindex comma separated values
 275 @cindex spreadsheets
 276 @cindex databases
 277
 278 Sometimes it's useful to be able to read data from comma
 279 separated text, from spreadsheets, databases or other sources.
 280 In these instances you should
 281 use the @cmd{GET DATA} command (@pxref{GET DATA}).
 282
 283 @node Exiting PSPP
 284 @subsection Exiting PSPP
 285
 286 Use the @cmd{FINISH} command to exit PSPP:
 287 @example
 288 @prompt{PSPP>} finish.
 289 @end example
 290
 291 @node Data Screening and Transformation
 292 @section Data Screening and Transformation
 293
 294 @cindex screening
 295 @cindex transformation
 296
 297 Once data has been entered, it is often desirable, or even necessary,
 298 to transform it in some way before performing analysis upon it.
 299 At the very least, it's good practice to check for errors.
 300
 301 @menu
 302 * Identifying incorrect data::
 303 * Dealing with suspicious data::
 304 * Inverting negatively coded variables::
 305 * Testing data consistency::
 306 * Testing for normality ::
 307 @end menu
 308
 309 @node Identifying incorrect data
 310 @subsection Identifying incorrect data
 311 @cindex erroneous data
 312 @cindex errors, in data
 313
 314 Data from real sources is rarely error free.
 315 @pspp{} has a number of procedures which can be used to help
 316 identify data which might be incorrect.
 317
 318 The @cmd{DESCRIPTIVES} command (@pxref{DESCRIPTIVES}) is used to generate
 319 simple linear statistics for a dataset.  It is also useful for
 320 identifying potential problems in the data.
 321 The example file @file{physiology.sav} contains a number of physiological
 322 measurements of a sample of healthy adults selected at random.
 323 However, the data entry clerk made a number of mistakes when entering
 324 the data.
 325 The following example illustrates the use of @cmd{DESCRIPTIVES} to screen this
 326 data and identify the erroneous values:
 327
 328 @example
 329 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 330 @prompt{PSPP>} descriptives sex, weight, height.
 331 @end example
 332
 333 @noindent For this example, PSPP produces the following output:
 334 @psppoutput {tutorial2a}
 335
 336 The most interesting column in the output is the minimum value.
 337 The @exvar{weight} variable has a minimum value of less than zero,
 338 which is clearly erroneous.
 339 Similarly, the @exvar{height} variable's minimum value seems to be very low.
 340 In fact, it is more than 5 standard deviations from the mean, and is a
 341 seemingly bizarre height for an adult person.
 342
 343 We can look deeper into these discrepancies by issuing an additional
 344 @cmd{EXAMINE} command:
 345
 346 @example
 347 @prompt{PSPP>} examine height, weight /statistics=extreme(3).
 348 @end example
 349
 350 @noindent This command produces the following additional output (in part):
 351 @psppoutput {tutorial2b}
 352
 353 @noindent
 354 From this new output, you can see that the lowest value of @exvar{height} is
 355 179 (which we suspect to be erroneous), but the second lowest is 1598
 356 which
 357 we know from @cmd{DESCRIPTIVES}
 358 is within 1 standard deviation from the mean.
 359 Similarly, the lowest value of @exvar{weight} is
 360 negative, but its second lowest value is plausible.
 361 This suggests that the two extreme values are outliers and probably
 362 represent data entry errors.
 363
 364 The output also identifies the case numbers for each extreme value,
 365 so we can see that
 366 cases 30 and 38 are the ones with the erroneous values.
 367
 368 @node Dealing with suspicious data
 369 @subsection Dealing with suspicious data
 370
 371 @cindex SYSMIS
 372 @cindex recoding data
 373 If possible, suspect data should be checked and re-measured.
 374 However, this may not always be feasible, in which case the researcher may
 375 decide to disregard these values.
 376 @pspp{} has a feature whereby data can assume the special value `SYSMIS', and
 377 will be disregarded in future analysis. @xref{Missing Observations}.
 378 You can set the two suspect values to the `SYSMIS' value using the @cmd{RECODE}
 379 command.
 380 @example
 381 @pspp{}> recode height (179 = SYSMIS).
 382 @pspp{}> recode weight (LOWEST THRU 0 = SYSMIS).
 383 @end example
 384 @noindent
 385 The first command says that for any observation which has a
 386 @exvar{height} value of 179, that value should be changed to the SYSMIS
 387 value.
 388 The second command says that any @exvar{weight} values of zero or less
 389 should be changed to SYSMIS.
 390 From now on, they will be ignored in analysis.
 391 For detailed information about the @cmd{RECODE} command @pxref{RECODE}.
 392
 393 If you now re-run the @cmd{DESCRIPTIVES} or @cmd{EXAMINE} commands from
 394 the previous section,
 395 you will see a data summary with more plausible parameters.
 396 You will also notice that the data summaries indicate the two missing values.
 397
 398 @node Inverting negatively coded variables
 399 @subsection Inverting negatively coded variables
 400
 401 @cindex Likert scale
 402 @cindex Inverting data
 403 Data entry errors are not the only reason for wanting to recode data.
 404 The sample file @file{hotel.sav} comprises data gathered from a
 405 customer satisfaction survey of clients at a particular hotel.
 406 The following commands load the file and display its
 407 variables and associated data:
 408
 409 @example
 410 @prompt{PSPP>} get file='@value{example-dir}/hotel.sav'.
 411 @prompt{PSPP>} display dictionary.
 412 @end example
 413
 414 @noindent It yields the following output:
 415
 416 @psppoutput {tutorial3}
 417
 418 The output shows that all of the variables @exvar{v1} through @exvar{v5} are measured on a 5 point Likert scale,
 419 with 1 meaning ``Strongly disagree'' and 5 meaning ``Strongly agree''.
 420 However, some of the questions are positively worded (@exvar{v1}, @exvar{v2}, @exvar{v4}) and others are negatively worded (@exvar{v3}, @exvar{v5}).
 421 To perform meaningful analysis, we need to recode the variables so
 422 that they all measure in the same direction.
 423 We could use the @cmd{RECODE} command, with syntax such as:
 424 @example
 425 recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1).
 426 @end example
 427 @noindent
 428 However an easier and more elegant way uses the @cmd{COMPUTE}
 429 command (@pxref{COMPUTE}).
 430 Since the variables are Likert variables in the range (1 @dots{} 5),
 431 subtracting their value  from 6 has the effect of inverting them:
 432 @example
 433 compute @var{var} = 6 - @var{var}.
 434 @end example
 435 @noindent
 436 The following section uses this technique to recode the variables
 437 @exvar{v3} and @exvar{v5}.
 438 After applying  @cmd{COMPUTE} for both variables,
 439 all subsequent commands will use the inverted values.
 440
 441
 442 @node Testing data consistency
 443 @subsection Testing data consistency
 444
 445 @cindex reliability
 446 @cindex consistency
 447
 448 A sensible check to perform on survey data is the calculation of
 449 reliability.
 450 This gives the statistician some confidence that the questionnaires have been
 451 completed thoughtfully.
 452 If you examine the labels of variables @exvar{v1},  @exvar{v3} and @exvar{v4},
 453 you will notice that they ask very similar questions.
 454 One would therefore expect the values of these variables (after recoding)
 455 to closely follow one another, and we can test that with the @cmd{RELIABILITY}
 456 command (@pxref{RELIABILITY}).
 457 The following example shows a @pspp{} session where the user recodes
 458 negatively scaled variables and then requests reliability statistics for
 459 @exvar{v1}, @exvar{v3}, and @exvar{v4}.
 460
 461 @example
 462 @prompt{PSPP>} get file='@value{example-dir}/hotel.sav'.
 463 @prompt{PSPP>} compute v3 = 6 - v3.
 464 @prompt{PSPP>} compute v5 = 6 - v5.
 465 @prompt{PSPP>} reliability v1, v3, v4.
 466 @end example
 467
 468 @noindent This yields the following output:
 469 @psppoutput {tutorial4}
 470
 471 As a rule of thumb, many statisticians consider a value of Cronbach's Alpha of
 472 0.7 or higher to indicate reliable data.
 473
 474 Here, the value is 0.81, which suggests a high degree of reliability
 475 among variables @exvar{v1}, @exvar{v3} and @exvar{v4}, so the data and
 476 the recoding that we performed are vindicated.
 477
 478
 479 @node Testing for normality
 480 @subsection Testing for normality
 481 @cindex normality, testing
 482
 483 Many statistical tests rely upon certain properties of the data.
 484 One common property, upon which many linear tests depend, is that of
 485 normality --- the data must have been drawn from a normal distribution.
 486 It is necessary then to ensure normality before deciding upon the
 487 test procedure to use.  One way to do this uses the @cmd{EXAMINE} command.
 488
 489 In the following example, a researcher was examining the failure rates
 490 of equipment produced by an engineering company.
 491 The file @file{repairs.sav} contains the mean time between
 492 failures (@exvar{mtbf}) of some items of equipment subject to the study.
 493 Before performing linear analysis on the data,
 494 the researcher wanted to ascertain that the data is normally distributed.
 495
 496 @example
 497 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 498 @prompt{PSPP>} examine mtbf
 499                 /statistics=descriptives.
 500 @end example
 501
 502 @noindent This produces the following output:
 503 @psppoutput {tutorial5a}
 504
 505 A normal distribution has a skewness and kurtosis of zero.
 506 The skewness of @exvar{mtbf} in the output above makes it clear
 507 that the mtbf figures have a lot of positive skew and are therefore
 508 not drawn from a normally distributed variable.
 509 Positive skew can often be compensated for by applying a logarithmic
 510 transformation, as in the following continuation of the example:
 511
 512 @example
 513 @prompt{PSPP>} compute mtbf_ln = ln (mtbf).
 514 @prompt{PSPP>} examine mtbf_ln
 515                 /statistics=descriptives.
 516 @end example
 517
 518 @noindent which produces the following additional output:
 519 @psppoutput {tutorial5b}
 520
 521 The @cmd{COMPUTE} command in the first line above performs the
 522 logarithmic transformation:
 523 @example
 524 compute mtbf_ln = ln (mtbf).
 525 @end example
 526 @noindent
 527 Rather than redefining the existing variable, this use of @cmd{COMPUTE}
 528 defines a new variable @exvar{mtbf_ln} which is
 529 the natural logarithm of @exvar{mtbf}.
 530 The final command in this example calls @cmd{EXAMINE} on this new variable.
 531 The results show that both the skewness and
 532 kurtosis for @exvar{mtbf_ln} are very close to zero.
 533 This provides some confidence that the @exvar{mtbf_ln} variable is
 534 normally distributed and thus safe for linear analysis.
 535 In the event that no suitable transformation can be found,
 536 then it would be worth considering
 537 an appropriate non-parametric test instead of a linear one.
 538 @xref{NPAR TESTS}, for information about non-parametric tests.
 539
 540 @node Hypothesis Testing
 541 @section Hypothesis Testing
 542
 543 @cindex Hypothesis testing
 544 @cindex p-value
 545 @cindex null hypothesis
 546
 547 One of the most fundamental purposes of statistical analysis
 548 is hypothesis testing.
 549 Researchers commonly need to test hypotheses about a set of data.
 550 For example, she might want to test whether one set of data comes from
 551 the same distribution as another,
 552 or
 553 whether the mean of a dataset significantly differs from a particular
 554 value.
 555 This section presents just some of the possible tests that @pspp{} offers.
 556
 557 The researcher starts by making a @dfn{null hypothesis}.
 558 Often this is a hypothesis which he suspects to be false.
 559 For example, if he suspects that @var{A} is greater than @var{B} he will
 560 state the null hypothesis as @math{ @var{A} = @var{B}}.@c
 561 @footnote{This example assumes that it is already proven that @var{B} is
 562 not greater than @var{A}.}
 563
 564 The @dfn{p-value} is a recurring concept in hypothesis testing.
 565 It is the highest acceptable probability that the evidence implying a
 566 null hypothesis is false, could have been obtained when the null
 567 hypothesis is in fact true.
 568 Note that this is not the same as ``the probability of making an
 569 error'' nor is it the same as ``the probability of rejecting a
 570 hypothesis when it is true''.
 571
 572
 573
 574 @menu
 575 * Testing for differences of means::
 576 * Linear Regression::
 577 @end menu
 578
 579 @node Testing for differences of means
 580 @subsection Testing for differences of means
 581
 582 @cindex T-test
 583 @vindex T-TEST
 584
 585 A common statistical test involves hypotheses about means.
 586 The @cmd{T-TEST} command is used to find out whether or not two separate
 587 subsets have the same mean.
 588
 589 A researcher suspected that the heights and core body
 590 temperature of persons might be different depending upon their sex.
 591 To investigate this, he posed two null hypotheses based on the data
 592 from @file{physiology.sav} previously encountered:
 593 @itemize @bullet
 594 @item The mean heights of males and females in the population are equal.
 595 @item The mean body temperature of males and
 596       females in the population are equal.
 597 @end itemize
 598 @noindent
 599 For the purposes of the investigation the researcher
 600 decided to use a  p-value of 0.05.
 601
 602 In addition to the T-test, the @cmd{T-TEST} command also performs the
 603 Levene test for equal variances.
 604 If the variances are equal, then a more powerful form of the T-test can be used.
 605 However if it is unsafe to assume equal variances,
 606 then an alternative calculation is necessary.
 607 @pspp{} performs both calculations.
 608
 609 For the @exvar{height} variable, the output shows the significance of the
 610 Levene test to be 0.33 which means there is a
 611 33% probability that the
 612 Levene test produces this outcome when the variances are equal.
 613 Had the significance been less than 0.05, then it would have been unsafe to assume that
 614 the variances were equal.
 615 However, because the value is higher than 0.05 the homogeneity of variances assumption
 616 is safe and the ``Equal Variances'' row (the more powerful test) can be used.
 617 Examining this row, the two tailed significance for the @exvar{height} t-test
 618 is less than 0.05, so it is safe to reject the null hypothesis and conclude
 619 that the mean heights of males and females are unequal.
 620
 621 For the @exvar{temperature} variable, the significance of the Levene test
 622 is 0.58 so again, it is safe to use the row for equal variances.
 623 The equal variances row indicates that the two tailed significance for
 624 @exvar{temperature} is 0.20.  Since this is greater than 0.05 we must reject
 625 the null hypothesis and conclude that there is insufficient evidence to
 626 suggest that the body temperature of male and female persons are different.
 627
 628 The syntax for this analysis is:
 629 @example
 630 @prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
 631 @prompt{PSPP>} recode height (179 = SYSMIS).
 632 @prompt{PSPP>} t-test group=sex(0,1) /variables = height temperature.
 633 @end example
 634
 635 PSPP produces the following output for this syntax:
 636 @psppoutput {tutorial6}
 637
 638 The @cmd{T-TEST} command tests for differences of means.
 639 Here, the @exvar{height} variable's two tailed significance is less than
 640 0.05, so the null hypothesis can be rejected.
 641 Thus, the evidence suggests there is a difference between the heights of
 642 male and female persons.
 643 However the significance of the test for the @exvar{temperature}
 644 variable is greater than 0.05 so the null hypothesis cannot be
 645 rejected, and there is insufficient evidence to suggest a difference
 646 in body temperature.
 647
 648 @node Linear Regression
 649 @subsection Linear Regression
 650 @cindex linear regression
 651 @vindex REGRESSION
 652
 653 Linear regression is a technique used to investigate if and how a variable
 654 is linearly related to others.
 655 If a variable is found to be linearly related, then this can be used to
 656 predict future values of that variable.
 657
 658 In the following example, the service department of the company wanted to
 659 be able to predict the time to repair equipment, in order to improve
 660 the accuracy of their quotations.
 661 It was suggested that the time to repair might be related to the time
 662 between failures and the duty cycle of the equipment.
 663 The p-value of 0.1 was chosen for this investigation.
 664 In order to investigate this hypothesis, the @cmd{REGRESSION} command
 665 was used.
 666 This command not only tests if the variables are related, but also
 667 identifies the potential linear relationship. @xref{REGRESSION}.
 668
 669 A first attempt includes @exvar{duty_cycle}:
 670
 671 @example
 672 @prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
 673 @prompt{PSPP>} regression /variables = mtbf duty_cycle /dependent = mttr.
 674 @end example
 675
 676 @noindent This attempt yields the following output (in part):
 677 @psppoutput {tutorial7a}
 678
 679 The coefficients in the above table suggest that the formula
 680 @math{@var{mttr} = 9.81 + 3.1 \times @var{mtbf} + 1.09 \times @var{duty_cycle}}
 681 can be used to predict the time to repair.
 682 However, the significance value for the @var{duty_cycle} coefficient
 683 is very high, which would make this an unsafe predictor.
 684 For this reason, the test was repeated, but omitting the
 685 @exvar{duty_cycle} variable:
 686
 687 @example
 688 @prompt{PSPP>} regression /variables = mtbf /dependent = mttr.
 689 @end example
 690
 691 @noindent
 692 This second try produces the following output (in part):
 693 @psppoutput {tutorial7b}
 694
 695 This time, the significance of all coefficients is no higher than 0.06,
 696 suggesting that at the 0.06 level, the formula
 697 @math{@var{mttr} = 10.5 + 3.11 \times @var{mtbf}} is a reliable
 698 predictor of the time to repair.
 699
 700
 701 @c  LocalWords:  PSPP dir itemize noindent var cindex dfn cartouche samp xref
 702 @c  LocalWords:  pxref ie sav Std Dev kilograms SYSMIS sansserif pre pspp emph
 703 @c  LocalWords:  Likert Cronbach's Cronbach mtbf npplot ln myfile cmd NPAR Sig
 704 @c  LocalWords:  vindex Levene Levene's df Diff clicksequence mydata dat ascii
 705 @c  LocalWords:  mttr outfile