From: John Darrington Date: Sun, 17 May 2009 05:06:59 +0000 (+0800) Subject: Added a tutorial chapter to the manual. X-Git-Tag: v0.7.3~118 X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=a645efe8a26ace14b0c417a57559381d4c7193f5;p=pspp-builds.git Added a tutorial chapter to the manual. --- diff --git a/doc/automake.mk b/doc/automake.mk index 67c2d83c..9402d7af 100644 --- a/doc/automake.mk +++ b/doc/automake.mk @@ -23,6 +23,8 @@ doc_pspp_TEXINFOS = doc/version.texi \ doc/not-implemented.texi \ doc/statistics.texi \ doc/transformation.texi \ + doc/tutorial.texi \ + doc/tut.texi \ doc/regression.texi \ doc/utilities.texi \ doc/variables.texi \ @@ -46,6 +48,11 @@ doc/ni.texi: $(top_srcdir)/src/language/command.def doc/get-commands.pl @$(MKDIR_P) doc @PERL@ $(top_srcdir)/doc/get-commands.pl $(top_srcdir)/src/language/command.def > $@ +doc/tut.texi: + @$(MKDIR_P) doc + echo "@set example-dir $(examplesdir)" > $@ + + doc/pspp.xml: doc/pspp.texinfo $(doc_pspp_TEXINFOS) @$(MKDIR_P) doc $(MAKEINFO) --docbook -I $(top_srcdir) $< -o $@ diff --git a/doc/expressions.texi b/doc/expressions.texi index 32ace6c2..6ab0443e 100644 --- a/doc/expressions.texi +++ b/doc/expressions.texi @@ -259,7 +259,7 @@ The sections below describe each function in detail. * Statistical Functions:: CFVAR MAX MEAN MIN SD SUM VARIANCE * String Functions:: CONCAT INDEX LENGTH LOWER LPAD LTRIM NUMBER RINDEX RPAD RTRIM STRING SUBSTR UPCASE -* Time & Date:: CTIME.xxx DATE.xxx TIME.xxx XDATE.xxx +* Time and Date:: CTIME.xxx DATE.xxx TIME.xxx XDATE.xxx DATEDIFF DATESUM * Miscellaneous Functions:: LAG YRMODA VALUELABEL * Statistical Distribution Functions:: PDF CDF SIG IDF RV NPDF NCDF @@ -691,7 +691,7 @@ has value @code{"cd"}; @code{SUBSTR("nonsense", 4, 10)} has the value Returns @var{string}, changing lowercase letters to uppercase letters. @end deftypefn -@node Time & Date +@node Time and Date @subsection Time & Date Functions @cindex functions, time & date @cindex times @@ -702,17 +702,17 @@ For compatibility, PSPP considers dates before 15 Oct 1582 invalid. Most time and date functions will not accept earlier dates. @menu -* Time & Date Concepts:: How times & dates are defined and represented +* Time and Date Concepts:: How times & dates are defined and represented * Time Construction:: TIME.@{DAYS HMS@} * Time Extraction:: CTIME.@{DAYS HOURS MINUTES SECONDS@} * Date Construction:: DATE.@{DMY MDY MOYR QYR WKYR YRDAY@} * Date Extraction:: XDATE.@{DATE HOUR JDAY MDAY MINUTE MONTH QUARTER SECOND TDAY TIME WEEK WKDAY YEAR@} -* Time & Date Arithmetic:: DATEDIFF DATESUM +* Time and Date Arithmetic:: DATEDIFF DATESUM @end menu -@node Time & Date Concepts +@node Time and Date Concepts @subsubsection How times & dates are defined and represented @cindex time, concepts @@ -1001,7 +1001,7 @@ Returns the year (as an integer 1582 or greater) corresponding to @var{date}. @end deftypefn -@node Time & Date Arithmetic +@node Time and Date Arithmetic @subsubsection Time and Date Arithmetic @cindex time, mathematical properties of diff --git a/doc/invoking.texi b/doc/invoking.texi index 6b45392b..3052162f 100644 --- a/doc/invoking.texi +++ b/doc/invoking.texi @@ -1,10 +1,38 @@ @node Invocation -@chapter Invoking PSPP + +@chapter Starting PSPP @cindex invocation @cindex PSPP, invoking +There are two separate user interfaces for PSPP. +There is the command line interface, which responds to commands +typed by the user. +The command line interface is generally available on more platforms +than the graphic user interface and since it doesn't require a +graphics device it can be used from a remote terminal. +Platforms which have a windowing system may also be able to support +the graphic user interface. +The graphic user interface can perform all functionality of the +command line interface. +In addition it gives an instantaneous view of the data, variables and +statistical output. + +Whichever interface you choose, a basic understanding of the concepts +used by PSPP is necessary before effective use of the system can be achieved. + + +@menu +* The command line user interface:: +* The graphic user interface:: +@end menu + + +@node The command line user interface +@section The command line user interface + @cindex command line, options @cindex options, command-line + @example pspp [ -B @var{dir} | --config-dir=@var{dir} ] [ -o @var{device} | --device=@var{device} ] [ -a @{compatible|enhanced@} | --algorithm=@{compatible|enhanced@}] @@ -26,7 +54,7 @@ pspp [ -B @var{dir} | --config-dir=@var{dir} ] [ -o @var{device} | --device=@var @end menu @node Non-option Arguments -@section Non-option Arguments +@subsection Non-option Arguments Syntax files and output device substitutions can be specified on PSPP's command line: @@ -64,7 +92,7 @@ typing its name. You can include any options on the command line as usual. PSPP entirely ignores any lines beginning with @samp{#!}. @node Configuration Options -@section Configuration Options +@subsection Configuration Options Configuration options are used to change PSPP's configuration for the current run. The configuration options are: @@ -95,7 +123,7 @@ option disables all devices besides those mentioned on the command line. @end table @node Input and output options -@section Input and output options +@subsection Input and output options Input and output options affect how PSPP reads input and writes output. These are the input and output options: @@ -121,7 +149,7 @@ check} and similar scripts. @end table @node Language control options -@section Language control options +@subsection Language control options Language control options control how PSPP syntax files are parsed and interpreted. The available language control options are: @@ -152,7 +180,7 @@ HOST commands, as well as use of pipes as input and output files. @end table @node Informational options -@section Informational options +@subsection Informational options Informational options cause information about PSPP to be written to the terminal. Here are the available options: @@ -224,3 +252,19 @@ Individual directories included in file searches. Each verbosity level also includes messages from lower verbosity levels. @end table + + +@node The graphic user interface +@section The graphic user interface + +@cindex Graphic user interface +@cindex PSPPIRE + +The graphic user interface can be started by typing @command{psppire} at a +command prompt. +Alternatively many systems have a system of interactive menus or buttons +from which @command{psppire} can be started by a series of mouse clicks. + +Once the principles of the PSPP system are understood, +the graphic user interface is designed to be largely intuitive, and +for this reason is covered only very briefly by this manual. diff --git a/doc/language.texi b/doc/language.texi index 62267414..3a7302cc 100644 --- a/doc/language.texi +++ b/doc/language.texi @@ -3,11 +3,9 @@ @cindex language, PSPP @cindex PSPP, language -@quotation -@strong{Please note:} PSPP is not even close to completion. +@note{PSPP is not even close to completion. Only a few statistical procedures are implemented. PSPP -is a work in progress. -@end quotation +is a work in progress.} This chapter discusses elements common to many PSPP commands. Later chapters will describe individual commands in detail. diff --git a/doc/pspp.texinfo b/doc/pspp.texinfo index 975ee660..9c50904e 100644 --- a/doc/pspp.texinfo +++ b/doc/pspp.texinfo @@ -6,6 +6,13 @@ @c @setchapternewpage odd @c %**end of header + +@macro note{param1} +@quotation +@strong{Please note:} \param1\ +@end quotation +@end macro + @include version.texi @macro cmd{CMDNAME} @@ -25,7 +32,7 @@ This manual is for GNU PSPP version @value{VERSION}, software for statistical analysis. -Copyright @copyright{} 1997, 1998, 2004, 2005 Free Software Foundation, Inc. +Copyright @copyright{} 1997, 1998, 2004, 2005, 2009 Free Software Foundation, Inc. @quotation Permission is granted to copy, distribute and/or modify this document @@ -51,6 +58,14 @@ modify this GNU manual.'' @insertcopying @end titlepage +@chapheading Acknowledgements +The authors wish to thank +Network Theory Ltd +@url{http://www.network-theory.co.uk} +for their financial support +in the production of this manual. + + @contents @@ -66,6 +81,7 @@ modify this GNU manual.'' * License:: Your rights and obligations. * Invocation:: Starting and running PSPP. +* Using PSPP:: How to use PSPP --- A brief tutorial. * Language:: Basics of the PSPP command language. * Expressions:: Numeric and string expression syntax. @@ -95,6 +111,7 @@ modify this GNU manual.'' @include license.texi @include invoking.texi +@include tutorial.texi @include language.texi @include expressions.texi @include data-io.texi diff --git a/doc/tutorial.texi b/doc/tutorial.texi new file mode 100644 index 00000000..5bc80f56 --- /dev/null +++ b/doc/tutorial.texi @@ -0,0 +1,857 @@ +@alias prompt = sansserif + +@include doc/tut.texi + +@node Using PSPP +@chapter Using PSPP + +PSPP is a tool for the statistical analysis of sampled data. +You can use it to discover patterns in the data, +to explain differences in one subset of data in terms of another subset +and to find out +whether certain beliefs about the data are justified. +This chapter does not attempt to introduce the theory behind the +statistical analysis, +but it shows how such analysis can be performed using PSPP. + +For the purposes of this tutorial, it is assumed that you are using PSPP in its +interactive mode from the command line. +However, the example commands can also be typed into a file and executed in +a post-hoc mode by typing @samp{pspp @var{filename}} at a shell prompt, +where @var{filename} is the name of the file containing the commands. +Alternatively, from the graphical interface, you can select +@clicksequence{File @click{} New @click{} Syntax} to open a new syntax window +and use the @clicksequence{Run} menu when a syntax fragment is ready to be +executed. +Whichever method you choose, the syntax is identical. + +When using the interactive method, PSPP tells you that it's waiting for your +data with a string like @prompt{PSPP>} or @prompt{data>}. +In the examples of this chapter, whenever you see text like this, it +indicates the prompt displayed by PSPP, @emph{not} something that you +should type. + +Throughout this chapter reference is made to a number of sample data files. +So that you can try the examples for yourself, +you should have received these files along with your copy of PSPP. +@footnote{These files contain purely fictitious data. They should not be used +for research purposes.} +@note{Normally these files are installed in the directory +@file{@value{example-dir}}. +If however your system administrator or operating system vendor has +chosen to install them in a different location, you will have to adjust +the examples accordingly.} + + +@menu +* Preparation of Data Files:: +* Data Screening and Transformation:: +* Hypothesis Testing:: +@end menu + +@node Preparation of Data Files +@section Preparation of Data Files + + +Before analysis can commence, the data must be loaded into PSPP and +arranged such that both PSPP and humans can understand what +the data represents. +There are two aspects of data: + +@itemize @bullet +@item The variables --- these are the parameters of a quantity + which has been measured or estimated in some way. + For example height, weight and geographic location are all variables. +@item The observations (also called `cases') of the variables --- + each observation represents an instance when the variables were measured + or observed. +@end itemize + +@noindent +For example, a data set which has the variables @var{height}, @var{weight}, and +@var{name}, might have the observations: +@example +188 89 Ahmed +119 107 Frank +123 67 Julie +@end example +@noindent +The following sections explain how to define a dataset. + +@menu +* Defining Variables:: +* Listing the data:: +* Reading data from a text file:: +* Reading data from a pre-prepared PSPP file:: +* Saving data to a PSPP file.:: +* Reading data from other sources:: +@end menu + +@node Defining Variables +@subsection Defining Variables +@cindex variables + +Variables come in two basic types, @i{viz}: @dfn{numeric} and @dfn{string}. +Variables such as age, height and satisfaction are numeric, +whereas name is a string variable. +String variables are best reserved for commentary data to assist the +human observer. +However they can also be used for nominal or categorical data. + + +@ref{data-list} defines two variables @var{forename} and @var{height}, +and reads data into them by manual input. + +@float Example, data-list +@cartouche +@example +@prompt{PSPP>} data list list /forename (A12) height *. +@prompt{PSPP>} begin data. +@prompt{data>} Ahmed 188 +@prompt{data>} Bertram 167 +@prompt{data>} Catherine 134.231 +@prompt{data>} David 109.1 +@prompt{data>} end data +@prompt{PSPP>} +@end example +@end cartouche +@caption{Manual entry of data using the @cmd{DATA LIST} command. +Two variables +@var{forename} and @var{height} are defined and subsequently filled +with manually entered data.} +@end float + +There are several things to note about this example. + +@itemize @bullet +@item +The words @samp{data list list} are an example of the @cmd{DATA LIST} +command. @xref{DATA LIST}. +It tells PSPP to prepare for reading data. + +@item +The @samp{/} character is important. It marks the start of the list of +variables which you wish to define. + +@item +The text @samp{forename} is the name of the first variable, +and @samp{(A12)} says that the variable @var{forename} is a string +variable and that its maximum length is 12 bytes. +The second variable's name is specified by the text @samp{height} +and the @samp{*} +means that this variable has the default format. +Instead of typing @samp{*} you could also have typed @samp{(F8.2)}, +however since @samp{F8.2} is the default format (unless you changed it +with the @cmd{SET} command (@pxref{SET})), it's quicker to simply type +@samp{*}. +For more information on data formats, @pxref{Input and Output Formats}. + + +@item +Normally, PSPP displays the prompt @prompt{PSPP>} whenever it's +expecting a command. +However, when it's expecting data, the prompt changes to @prompt{data>} +so that you know to enter data and not a command. + +@item +At the end of every command there is a terminating @samp{.} which tells +PSPP that the end of a command has been encountered. +You should not enter @samp{.} when data is expected (@i{ie.} when +the @prompt{data>} prompt is current) since it is appropriate only for +terminating commands. +@end itemize + +@node Listing the data +@subsection Listing the data +@vindex LIST + +Once the data has been entered, +you could type +@example +@prompt{PSPP>} list /format=numbered. +@end example +@noindent +to list the data. +The optional text @samp{/format=numbered} requests the case numbers to be +shown along with the data. +It should show the following output: +@example +@group +Case# forename height +----- ------------ -------- + 1 Ahmed 188.00 + 2 Bertram 167.00 + 3 Catherine 134.23 + 4 David 109.10 +@end group +@end example +@noindent +Note that the numeric variable @var{height} is displayed to 2 decimal +places, because the format for that variable is @samp{F8.2}. +For a complete description of the @cmd{LIST} command, @pxref{LIST}. + +@node Reading data from a text file +@subsection Reading data from a text file +@cindex reading data + +The previous example showed how to define a set of variables and to +manually enter the data for those variables. +Manual entering of data is tedious work, and often +a file containing the data will be have been previously +prepared. +Let us assume that you have a file called @file{mydata.dat} containing the +ascii encoded data: +@example +Ahmed 188.00 +Bertram 167.00 +Catherine 134.23 +David 109.10 +@ . +@ . +@ . +Zachariah 113.02 +@end example +@noindent +You can can tell the @cmd{DATA LIST} command to read the data directly from +this file instead of by manual entry, with a command like: +@example +@prompt{PSPP>} data list file='mydata.dat' list /forename (A12) height *. +@end example +@noindent +Notice however, that it is still necessary to specify the names of the +variables and their formats, since this information is not contained +in the file. +It is also possible to specify the file's character encoding and other +parameters. +For full details refer to @pxref{DATA LIST}. + +@node Reading data from a pre-prepared PSPP file +@subsection Reading data from a pre-prepared PSPP file +@cindex system files +@vindex GET + +When working with other PSPP users, or users of other software which +uses the PSPP data format, you may be given the data in +a pre-prepared PSPP file. +Such files contain not only the data, but the variable definitions, +along with their formats, labels and other meta-data. +Conventionally, these files (sometimes called ``system'' files) +have the suffix @file{.sav}, but that is +not mandatory. +The following syntax loads a file called @file{my-file.sav}. +@example +@prompt{PSPP>} get file='my-file.sav'. +@end example +@noindent +You will encounter several instances of this in future examples. + + +@node Saving data to a PSPP file. +@subsection Saving data to a PSPP file. +@cindex saving +@vindex SAVE + +If you want to save your data, along with the variable definitions so +that you or other PSPP users can use it later, you can do this with +the @cmd{SAVE} command. + +The following syntax will save the existing data and variables to a +file called @file{my-new-file.sav}. +@example +@prompt{PSPP>} save outfile='my-new-file.sav'. +@end example +@noindent +If @file{my-new-file.sav} already exists, then it will be overwritten. +Otherwise it will be created. + + +@node Reading data from other sources +@subsection Reading data from other sources +@cindex comma separated values +@cindex spreadsheets +@cindex databases + +Sometimes it's useful to be able to read data from comma +separated text, from spreadsheets, databases or other sources. +In these instances you should +use the @cmd{GET DATA} command (@pxref{GET DATA}). + + +@node Data Screening and Transformation +@section Data Screening and Transformation + +@cindex screening +@cindex transformation + +Once data has been entered, it is often desirable, or even necessary, +to transform it in some way before performing analysis upon it. +At the very least, it's good practice to check for errors. + +@menu +* Identifying incorrect data:: +* Dealing with suspicious data:: +* Inverting negatively coded variables:: +* Testing data consistency:: +* Testing for normality :: +@end menu + +@node Identifying incorrect data +@subsection Identifying incorrect data +@cindex erroneous data +@cindex errors, in data + +Data from real sources is rarely error free. +PSPP has a number of procedures which can be used to help +identify data which might be incorrect. + +The @cmd{DESCRIPTIVES} command (@pxref{DESCRIPTIVES}) is used to generate +simple linear statistics for a dataset. It is also useful for +identifying potential problems in the data. +The example file @file{physiology.sav} contains a number of physiological +measurements of a sample of healthy adults selected at random. +However, the data entry clerk made a number of mistakes when entering +the data. +@ref{descriptives} illustrates the use of @cmd{DESCRIPTIVES} to screen this +data and identify the erroneous values. + +@float Example, descriptives +@cartouche +@example +@prompt{PSPP>} get file='@value{example-dir}/physiology.sav'. +@prompt{PSPP>} descriptives sex, weight, height. +@end example + +Output: +@example +DESCRIPTIVES. Valid cases = 40; cases with missing value(s) = 0. ++--------#--+-------+-------+-------+-------+ +|Variable# N| Mean |Std Dev|Minimum|Maximum| +#========#==#=======#=======#=======#=======# +|sex #40| .45| .50| .00| 1.00| +|height #40|1677.12| 262.87| 179.00|1903.00| +|weight #40| 72.12| 26.70| -55.60| 92.07| ++--------#--+-------+-------+-------+-------+ +@end example +@end cartouche +@caption{Using the @cmd{DESCRIPTIVES} command to display simple +summary information about the data. +In this case, the results show unexpectedly low values in the Minimum +column, suggesting incorrect data entry.} +@end float + +In the output of @ref{descriptives}, +the most interesting column is the minimum value. +The @var{weight} variable has a minimum value of less than zero, +which is clearly erroneous. +Similarly, the @var{height} variable's minimum value seems to be very low. +In fact, it is more than 5 standard deviations from the mean, and is a +seemingly bizarre height for an adult person. +We can examine the data in more detail with the @cmd{EXAMINE} +command (@pxref{EXAMINE}): + +In @ref{examine} you can see that the lowest value of @var{height} is +179 (which we suspect to be erroneous), but the second lowest is 1598 +which +we know from the @cmd{DESCRIPTIVES} command +is within 1 standard deviation from the mean. +Similarly the @var{weight} variable has a lowest value which is +negative but a plausible value for the second lowest value. +This suggests that the two extreme values are outliers and probably +represent data entry errors. + +@float Example, examine +@cartouche +[@dots{} continue from @ref{descriptives}] +@example +@prompt{PSPP>} examine height, weight /statistics=extreme(3). +@end example + +Output: +@example +#===============================#===========#=======# +# #Case Number| Value # +#===============================#===========#=======# +#Height in millimetres Highest 1# 14|1903.00# +# 2# 15|1884.00# +# 3# 12|1801.65# +# ----------#-----------+-------# +# Lowest 1# 30| 179.00# +# 2# 31|1598.00# +# 3# 28|1601.00# +# ----------#-----------+-------# +#Weight in kilograms Highest 1# 13| 92.07# +# 2# 5| 92.07# +# 3# 17| 91.74# +# ----------#-----------+-------# +# Lowest 1# 38| -55.60# +# 2# 39| 54.48# +# 3# 33| 55.45# +#===============================#===========#=======# +@end example +@end cartouche +@caption{Using the @cmd{EXAMINE} command to see the extremities of the data +for different variables. Cases 30 and 38 seem to contain values +very much lower than the rest of the data. +They are possibly erroneous.} +@end float + +@node Dealing with suspicious data +@subsection Dealing with suspicious data + +@cindex SYSMIS +@cindex recoding data +If possible, suspect data should be checked and re-measured. +However, this may not always be feasible, in which case the researcher may +decide to disregard these values. +PSPP has a feature whereby data can assume the special value `SYSMIS', and +will be disregarded in future analysis. @xref{Missing Observations}. +You can set the two suspect values to the `SYSMIS' value using the @cmd{RECODE} +command. +@example +PSPP> recode height (179 = SYSMIS). +PSPP> recode weight (LOWEST THRU 0 = SYSMIS). +@end example +@noindent +The first command says that for any observation which has a +@var{height} value of 179, that value should be changed to the SYSMIS +value. +The second command says that any @var{weight} values of zero or less +should be changed to SYSMIS. +From now on, they will be ignored in analysis. +For detailed information about the @cmd{RECODE} command @pxref{RECODE}. + +If you now re-run the @cmd{DESCRIPTIVES} or @cmd{EXAMINE} commands in +@ref{descriptives} and @ref{examine} you +will see a data summary with more plausible parameters. +You will also notice that the data summaries indicate the two missing values. + +@node Inverting negatively coded variables +@subsection Inverting negatively coded variables + +@cindex Likert scale +@cindex Inverting data +Data entry errors are not the only reason for wanting to recode data. +The sample file @file{hotel.sav} comprises data gathered from a +customer satisfaction survey of clients at a particular hotel. +In @ref{reliability}, this file is loaded for analysis. +The line @code{display dictionary.} tells PSPP to display the +variables and associated data. +The output from this command has been omitted from the example for the sake of clarity, but +you will notice that each of the variables +@var{v1}, @var{v2} @dots{} @var{v5} are measured on a 5 point Likert scale, +with 1 meaning ``Strongly disagree'' and 5 meaning ``Strongly agree''. +Whilst variables @var{v1}, @var{v2} and @var{v4} record responses +to a positively posed question, variables @var{v3} and @var{v5} are +responses to negatively worded questions. +In order to perform meaningful analysis, we need to recode the variables so +that they all measure in the same direction. +We could use the @cmd{RECODE} command, with syntax such as: +@example +recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1). +@end example +@noindent +However an easier and more elegant way uses the @cmd{COMPUTE} +command (@pxref{COMPUTE}). +Since the variables are Likert variables in the range (1 @dots{} 5), +subtracting their value from 6 has the effect of inverting them: +@example +compute @var{var} = 6 - @var{var}. +@end example +@noindent +@ref{reliability} uses this technique to recode the variables +@var{v3} and @var{v5}. +After applying @cmd{COMPUTE} for both variables, +all subsequent commands will use the inverted values. + + +@node Testing data consistency +@subsection Testing data consistency + +@cindex reliability +@cindex consistency + +A sensible check to perform on survey data is the calculation of +reliability. +This gives the statistician some confidence that the questionnaires have been +completed thoughtfully. +If you examine the labels of variables @var{v1}, @var{v3} and @var{v5}, +you will notice that they ask very similar questions. +One would therefore expect the values of these variables (after recoding) +to closely follow one another, and we can test that with the @cmd{RELIABILITY} +command (@pxref{RELIABILITY}). +@ref{reliability} shows a PSPP session where the user (after recoding +negatively scaled variables) requests reliability statistics for +@var{v1}, @var{v3} and @var{v5}. + +@float Example, reliability +@cartouche +@example +@prompt{PSPP>} get file='@value{example-dir}/hotel.sav'. +@prompt{PSPP>} display dictionary. +@prompt{PSPP>} * recode negatively worded questions. +@prompt{PSPP>} compute v3 = 6 - v3. +@prompt{PSPP>} compute v5 = 6 - v5. +@prompt{PSPP>} reliability v1, v3, v5. +@end example + +Output (dictionary information omitted for clarity): +@example +1.1 RELIABILITY. Case Processing Summary +#==============#==#======# +# # N| % # +#==============#==#======# +#Cases Valid #17|100.00# +# Excluded# 0| .00# +# Total #17|100.00# +#==============#==#======# + +1.2 RELIABILITY. Reliability Statistics +#================#==========# +#Cronbach's Alpha#N of items# +#================#==========# +# .86# 3# +#================#==========# +@end example +@end cartouche +@caption{Recoding negatively scaled variables, and testing for +reliability with the @cmd{RELIABILITY} command. The Cronbach Alpha +coefficient suggests a high degree of reliability among variables +@var{v1}, @var{v2} and @var{v5}.} +@end float + +As a rule of thumb, many statisticians consider a value of Cronbach's Alpha of +0.7 or higher to indicate reliable data. +Here, the value is 0.86 so the data and the recoding that we performed +are vindicated. + + +@node Testing for normality +@subsection Testing for normality +@cindex normality, testing + +Many statistical tests rely upon certain properties of the data. +One common property, upon which many linear tests depend, is that of +normality --- the data must have been drawn from a normal distribution. +It is necessary then to ensure normality before deciding upon the +test procedure to use. One way to do this uses the @cmd{EXAMINE} command. + +In @ref{normality}, a researcher was examining the failure rates +of equipment produced by an engineering company. +The file @file{repairs.sav} contains the mean time between +failures (@var{mtbf}) of some items of equipment subject to the study. +Before performing linear analysis on the data, +the researcher wanted to ascertain that the data is normally distributed. + +A normal distribution has a skewness and kurtosis of zero. +Looking at the skewness of @var{mtbf} in @ref{normality} it is clear +that the mtbf figures have a lot of positive skew and are therefore +not drawn from a normally distributed variable. +Positive skew can often be compensated for by applying a logarithmic +transformation. +This is done with the @cmd{COMPUTE} command in the line +@example +compute mtbf_ln = ln (mtbf). +@end example +@noindent +Rather than redefining the existing variable, this use of @cmd{COMPUTE} +defines a new variable @var{mtbf_ln} which is +the natural logarithm of @var{mtbf}. +The final command in this example calls @cmd{EXAMINE} on this new variable, +and it can be seen from the results that both the skewness and +kurtosis for @var{mtbf_ln} are very close to zero. +This provides some confidence that the @var{mtbf_ln} variable is +normally distributed and thus safe for linear analysis. +In the event that no suitable transformation can be found, +then it would be worth considering +an appropriate non-parametric test instead of a linear one. +@xref{NPAR TESTS}, for information about non-parametric tests. + +@float Example, normality +@cartouche +@example +@prompt{PSPP>} get file='@value{example-dir}/repairs.sav'. +@prompt{PSPP>} examine mtbf + /statistics=descriptives. +@prompt{PSPP>} compute mtbf_ln = ln (mtbf). +@prompt{PSPP>} examine mtbf_ln + /statistics=descriptives. +@end example + +Output: +@example +1.2 EXAMINE. Descriptives +#====================================================#=========#==========# +# #Statistic|Std. Error# +#====================================================#=========#==========# +#mtbf Mean # 8.32 | 1.62 # +# 95% Confidence Interval for Mean Lower Bound# 4.85 | # +# Upper Bound# 11.79 | # +# 5% Trimmed Mean # 7.69 | # +# Median # 8.12 | # +# Variance # 39.21 | # +# Std. Deviation # 6.26 | # +# Minimum # 1.63 | # +# Maximum # 26.47 | # +# Range # 24.84 | # +# Interquartile Range # 5.83 | # +# Skewness # 1.85 | .58 # +# Kurtosis # 4.49 | 1.12 # +#====================================================#=========#==========# + +2.2 EXAMINE. Descriptives +#====================================================#=========#==========# +# #Statistic|Std. Error# +#====================================================#=========#==========# +#mtbf_ln Mean # 1.88 | .19 # +# 95% Confidence Interval for Mean Lower Bound# 1.47 | # +# Upper Bound# 2.29 | # +# 5% Trimmed Mean # 1.88 | # +# Median # 2.09 | # +# Variance # .54 | # +# Std. Deviation # .74 | # +# Minimum # .49 | # +# Maximum # 3.28 | # +# Range # 2.79 | # +# Interquartile Range # .92 | # +# Skewness # -.16 | .58 # +# Kurtosis # -.09 | 1.12 # +#====================================================#=========#==========# +@end example +@end cartouche +@caption{Testing for normality using the @cmd{EXAMINE} command and applying +a logarithmic transformation. +The @var{mtbf} variable has a large positive skew and is therefore +unsuitable for linear statistical analysis. +However the transformed variable (@var{mtbf_ln}) is close to normal and +would appear to be more suitable.} +@end float + + +@node Hypothesis Testing +@section Hypothesis Testing + +@cindex Hypothesis testing +@cindex p-value +@cindex null hypothesis + +One of the most fundamental purposes of statistical analysis +is hypothesis testing. +Researchers commonly need to test hypotheses about a set of data. +For example, she might want to test whether one set of data comes from +the same distribution as another, +or does the mean of a dataset significantly differ from a particular +value. +This section presents just some of the possible tests that PSPP offers. + +The researcher starts by making a @dfn{null hypothesis}. +Often this is a hypothesis which he suspects to be false. +For example, if he suspects that @var{A} is greater than @var{B} he will +state the null hypothesis as @math{ @var{A} = @var{B}}. +@footnote{This example assumes that is it already proven that @var{B} is +not greater than @var{A}.} + +The @dfn{p-value} is a recurring concept in hypothesis testing. +It is the highest acceptable probability that the evidence implying a +null hypothesis is false, could have been obtained when the null +hypothesis is in fact true. +Note that this is not the same as ``the probability of making an +error'' nor is it the same as ``the probability of rejecting a +hypothesis when it is true''. + + + +@menu +* Testing for differences of means:: +* Linear Regression:: +@end menu + +@node Testing for differences of means +@subsection Testing for differences of means + +@cindex T-test +@vindex T-TEST + +A common statistical test involves hypotheses about means. +The @cmd{T-TEST} command is used to find out whether or not two separate +subsets have the same mean. + +@ref{t-test} uses the file @file{physiology.sav} previously +encountered. +A researcher suspected that the heights and core body +temperature of persons might be different depending upon their sex. +To investigate this, he posed two null hypotheses: +@itemize @bullet +@item The mean heights of males and females in the population are equal. +@item The mean body temperature of males and + females in the population are equal. +@end itemize +@noindent +For the purposes of the investigation the researcher +decided to use a p-value of 0.05. + +In addition to the T-test, the @cmd{T-TEST} command also performs the +Levene test for equal variances. +If the variances are equal, then a more powerful form of the T-test can be used. +However if it is unsafe to assume equal variances, +then an alternative calculation is necessary. +PSPP performs both calculations. + +For the @var{height} variable, the output shows the significance of the +Levene test to be 0.33 which means there is a +33% probability that the +Levene test produces this outcome when the variances are unequal. +Such a probability is too high +to assume that the variances are equal so the row +for unequal variances should be used. +Examining this row, the two tailed significance for the @var{height} t-test +is less than 0.05, so it is safe to reject the null hypothesis and conclude +that the mean heights of males and females are unequal. + +For the @var{temperature} variable, the significance of the Levene test +is 0.58 so again, it is unsafe to use the row for equal variances. +The unequal variances row indicates that the two tailed significance for +@var{temperature} is 0.19. Since this is greater than 0.05 we must reject +the null hypothesis and conclude that there is insufficient evidence to +suggest that the body temperature of male and female persons are different. + +@float Example, t-test +@cartouche +@example +@prompt{PSPP>} get file='@value{example-dir}/physiology.sav'. +@prompt{PSPP>} recode height (179 = SYSMIS). +@prompt{PSPP>} t-test group=sex(0,1) /variables = height temperature. +@end example +Output: +@example +1.1 T-TEST. Group Statistics +#==================#==#=======#==============#========# +# sex | N| Mean |Std. Deviation|SE. Mean# +#==================#==#=======#==============#========# +#height Male |22|1796.49| 49.71| 10.60# +# Female|17|1610.77| 25.43| 6.17# +#temperature Male |22| 36.68| 1.95| .42# +# Female|18| 37.43| 1.61| .38# +#==================#==#=======#==============#========# +1.2 T-TEST. Independent Samples Test +#===========================#=========#=============================== =# +# # Levene's| t-test for Equality of Means # +# #----+----+------+-----+------+---------+- -# +# # | | | | | | # +# # | | | |Sig. 2| | # +# # F |Sig.| t | df |tailed|Mean Diff| # +#===========================#====#====#======#=====#======#=========#= =# +#height Equal variances# .97| .33| 14.02|37.00| .00| 185.72| ... # +# Unequal variances# | | 15.15|32.71| .00| 185.72| ... # +#temperature Equal variances# .31| .58| -1.31|38.00| .20| -.75| ... # +# Unequal variances# | | -1.33|37.99| .19| -.75| ... # +#===========================#====#====#======#=====#======#=========#= =# +@end example +@end cartouche +@caption{The @cmd{T-TEST} command tests for differences of means. +Here, the @var{height} variable's two tailed significance is less than +0.05, so the null hypothesis can be rejected. +Thus, the evidence suggests there is a difference between the heights of +male and female persons. +However the significance of the test for the @var{temperature} +variable is greater than 0.05 so the null hypothesis cannot be +rejected, and there is insufficient evidence to suggest a difference +in body temperature.} +@end float + +@node Linear Regression +@subsection Linear Regression +@cindex linear regression +@vindex REGRESSION + +Linear regression is a technique used to investigate if and how a variable +is linearly related to others. +If a variable is found to be linearly related, then this can be used to +predict future values of that variable. + +In example @ref{regression}, the service department of the company wanted to +be able to predict the time to repair equipment, in order to improve +the accuracy of their quotations. +It was suggested that the time to repair might be related to the time +between failures and the duty cycle of the equipment. +The p-value of 0.1 was chosen for this investigation. +In order to investigate this hypothesis, the @cmd{REGRESSION} command +was used. +This command not only tests if the variables are related, but also +identifies the potential linear relationship. @xref{REGRESSION}. + + +@float Example, regression +@cartouche +@example +@prompt{PSPP>} get file='@value{example-dir}/repairs.sav'. +@prompt{PSPP>} regression /variables = mtbf duty_cycle /dependent = mttr. +@prompt{PSPP>} regression /variables = mtbf /dependent = mttr. +@end example +Output: +@example +1.3(1) REGRESSION. Coefficients +#=============================================#====#==========#====#=====# +# # B |Std. Error|Beta| t # +#========#====================================#====#==========#====#=====# +# |(Constant) #9.81| 1.50| .00| 6.54# +# |Mean time between failures (months) #3.10| .10| .99|32.43# +# |Ratio of working to non-working time#1.09| 1.78| .02| .61# +# | # | | | # +#========#====================================#====#==========#====#=====# + +1.3(2) REGRESSION. Coefficients +#=============================================#============# +# #Significance# +#========#====================================#============# +# |(Constant) # .10# +# |Mean time between failures (months) # .00# +# |Ratio of working to non-working time# .55# +# | # # +#========#====================================#============# +2.3(1) REGRESSION. Coefficients +#============================================#=====#==========#====#=====# +# # B |Std. Error|Beta| t # +#========#===================================#=====#==========#====#=====# +# |(Constant) #10.50| .96| .00|10.96# +# |Mean time between failures (months)# 3.11| .09| .99|33.39# +# | # | | | # +#========#===================================#=====#==========#====#=====# + +2.3(2) REGRESSION. Coefficients +#============================================#============# +# #Significance# +#========#===================================#============# +# |(Constant) # .06# +# |Mean time between failures (months)# .00# +# | # # +#========#===================================#============# +@end example +@end cartouche +@caption{Linear regression analysis to find a predictor for +@var{mttr}. +The first attempt, including @var{duty_cycle}, produces some +unacceptable high significance values. +However the second attempt, which excludes @var{duty_cycle}, produces +significance values no higher than 0.06. +This suggests that @var{mtbf} alone may be a suitable predictor +for @var{mttr}.} +@end float + +The coefficients in the first table suggest that the formula +@math{@var{mttr} = 9.81 + 3.1 \times @var{mtbf} + 1.09 \times @var{duty_cycle}} +can be used to predict the time to repair. +However, the significance value for the @var{duty_cycle} coefficient +is very high, which would make this an unsafe predictor. +For this reason, the test was repeated, but omitting the +@var{duty_cycle} variable. +This time, the significance of all coefficients no higher than 0.06, +suggesting that at the 0.06 level, the formula +@math{@var{mttr} = 10.5 + 3.11 \times @var{mtbf}} is a reliable +predictor of the time to repair. + + +@c LocalWords: PSPP dir itemize noindent var cindex dfn cartouche samp xref +@c LocalWords: pxref ie sav Std Dev kilograms SYSMIS sansserif pre pspp emph +@c LocalWords: Likert Cronbach's Cronbach mtbf npplot ln myfile cmd NPAR Sig +@c LocalWords: vindex Levene Levene's df Diff clicksequence mydata dat ascii +@c LocalWords: mttr outfile diff --git a/examples/automake.mk b/examples/automake.mk index cdbec4bb..b69470a4 100644 --- a/examples/automake.mk +++ b/examples/automake.mk @@ -1,8 +1,14 @@ ## Process this file with automake to produce Makefile.in -*- makefile -*- -EXTRA_DIST += \ + +examplesdir = $(pkgdatadir)/examples + +examples_DATA = \ examples/descript.stat \ + examples/hotel.sav \ + examples/physiology.sav \ + examples/repairs.sav \ examples/regress.stat \ examples/regress_categorical.stat -EXTRA_DIST += examples/OChangeLog +EXTRA_DIST += examples/OChangeLog $(examples_DATA) diff --git a/examples/hotel.sav b/examples/hotel.sav new file mode 100644 index 00000000..49f6318b Binary files /dev/null and b/examples/hotel.sav differ diff --git a/examples/physiology.sav b/examples/physiology.sav new file mode 100644 index 00000000..005a88cd Binary files /dev/null and b/examples/physiology.sav differ diff --git a/examples/repairs.sav b/examples/repairs.sav new file mode 100644 index 00000000..a1991624 Binary files /dev/null and b/examples/repairs.sav differ