This is pspp.info, produced by makeinfo version 4.0 from pspp.texi. START-INFO-DIR-ENTRY * PSPP: (pspp). Statistical analysis package. END-INFO-DIR-ENTRY PSPP, for statistical analysis of sampled data, by Ben Pfaff. This file documents PSPP, a statistical package for analysis of sampled data that uses a command language compatible with SPSS. Copyright (C) 1996-9, 2000 Free Software Foundation, Inc. This version of the PSPP documentation is consistent with version 2 of "texinfo.tex". Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above condition for modified versions, except that this permission notice may be stated in a translation approved by the Free Software Foundation.  File: pspp.info, Node: DESCRIPTIVES, Next: FREQUENCIES, Prev: Statistics, Up: Statistics DESCRIPTIVES ============ DESCRIPTIVES /VARIABLES=var_list /MISSING={VARIABLE,LISTWISE} {INCLUDE,NOINCLUDE} /FORMAT={LABELS,NOLABELS} {NOINDEX,INDEX} {LINE,SERIAL} /SAVE /STATISTICS={ALL,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS, SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM,DEFAULT, SESKEWNESS,SEKURTOSIS} /SORT={NONE,MEAN,SEMEAN,STDDEV,VARIANCE,KURTOSIS,SKEWNESS, RANGE,MINIMUM,MAXIMUM,SUM,SESKEWNESS,SEKURTOSIS,NAME} {A,D} The DESCRIPTIVES procedure reads the active file and outputs descriptive statistics requested by the user. In addition, it can optionally compute Z-scores. The VARIABLES subcommand, which is required, specifies the list of variables to be analyzed. Keyword VARIABLES is optional. All other subcommands are optional: The MISSING subcommand determines the handling of missing variables. If INCLUDE is set, then user-missing values are included in the calculations. If NOINCLUDE is set, which is the default, user-missing values are excluded. If VARIABLE is set, then missing values are excluded on a variable by variable basis; if LISTWISE is set, then the entire case is excluded whenever any value in that case has a system-missing or, if INCLUDE is set, user-missing value. The FORMAT subcommand affects the output format. Currently the LABELS/NOLABELS and NOINDEX/INDEX settings is not used. When SERIAL is set, both valid and missing number of cases are listed in the output; when NOSERIAL is set, only valid cases are listed. The SAVE subcommand causes DESCRIPTIVES to calculate Z scores for all the specified variables. The Z scores are saved to new variables. Variable names are generated by trying first the original variable name with Z prepended and truncated to a maximum of 8 characters, then the names ZSC000 through ZSC999, STDZ00 through STDZ09, ZZZZ00 through ZZZZ09, ZQZQ00 through ZQZQ09, in that sequence. In addition, Z score variable names can be specified explicitly on VARIABLES in the variable list by enclosing them in parentheses after each variable. The STATISTICS subcommand specifies the statistics to be displayed: `ALL' All of the statistics below. `MEAN' Arithmetic mean. `SEMEAN' Standard error of the mean. `STDDEV' Standard deviation. `VARIANCE' Variance. `KURTOSIS' Kurtosis and standard error of the kurtosis. `SKEWNESS' Skewness and standard error of the skewness. `RANGE' Range. `MINIMUM' Minimum value. `MAXIMUM' Maximum value. `SUM' Sum. `DEFAULT' Mean, standard deviation of the mean, minimum, maximum. `SEKURTOSIS' Standard error of the kurtosis. `SESKEWNESS' Standard error of the skewness. The SORT subcommand specifies how the statistics should be sorted. Most of the possible values should be self-explanatory. NAME causes the statistics to be sorted by name. By default, the statistics are listed in the order that they are specified on the VARIABLES subcommand. The A and D settings request an ascending or descending sort order, respectively.  File: pspp.info, Node: FREQUENCIES, Next: CROSSTABS, Prev: DESCRIPTIVES, Up: Statistics FREQUENCIES =========== FREQUENCIES /VARIABLES=var_list /FORMAT={TABLE,NOTABLE,LIMIT(limit)} {STANDARD,CONDENSE,ONEPAGE[(onepage_limit)]} {LABELS,NOLABELS} {AVALUE,DVALUE,AFREQ,DFREQ} {SINGLE,DOUBLE} {OLDPAGE,NEWPAGE} /MISSING={EXCLUDE,INCLUDE} /STATISTICS={DEFAULT,MEAN,SEMEAN,MEDIAN,MODE,STDDEV,VARIANCE, KURTOSIS,SKEWNESS,RANGE,MINIMUM,MAXIMUM,SUM, SESKEWNESS,SEKURTOSIS,ALL,NONE} /NTILES=ntiles /PERCENTILES=percent... (These options are not currently implemented.) /BARCHART=... /HISTOGRAM=... /HBAR=... /GROUPED=... (Integer mode.) /VARIABLES=var_list (low,high)... FREQUENCIES causes the data to be read and frequency tables to be built and output for specified variables. FREQUENCIES can also calculate and display descriptive statistics (including median and mode) and percentiles. In the future, FREQUENCIES will also support graphical output in the form of bar charts and histograms. In addition, it will be able to support percentiles for grouped data. (As a historical note, these options were supported in a version of PSPP written years ago, but the code has not survived.) The VARIABLES subcommand is the only required subcommand. Specify the variables to be analyzed. In most cases, this is all that is required. This is known as "general mode". Occasionally, one may want to invoke a special mode called "integer mode". Normally, in general mode, PSPP will automatically determine what values occur in the data. In integer mode, the user specifies the range of values that the data assumes. To invoke this mode, specify a range of data values in parentheses, separated by a comma. Data values inside the range are truncated to the nearest integer, then assigned to that value. If values occur outside this range, they are discarded. The FORMAT subcommand controls the output format. It has several possible settings: * TABLE, the default, causes a frequency table to be output for every variable specified. NOTABLE prevents them from being output. LIMIT with a numeric argument causes them to be output except when there are more than the specified number of values in the table. * STANDARD frequency tables contain more complete information, but also to take up more space on the printed page. CONDENSE frequency tables are less informative but take up less space. ONEPAGE with a numeric argument will output standard frequency tables if there are the specified number of values or less, condensed tables otherwise. ONEPAGE without an argument defaults to a threshold of 50 values. * LABELS causes value labels to be displayed in STANDARD frequency tables. NOLABLES prevents this. * Normally frequency tables are sorted in ascending order by value. This is AVALUE. DVALUE tables are sorted in descending order by value. AFREQ and DFREQ tables are sorted in ascending and descending order, respectively, by frequency count. * SINGLE spaced frequency tables are closely spaced. DOUBLE spaced frequency tables have wider spacing. * OLDPAGE and NEWPAGE are not currently used. The MISSING subcommand controls the handling of user-missing values. When EXCLUDE, the default, is set, user-missing values are not included in frequency tables or statistics. When INCLUDE is set, user-missing are included. System-missing values are never included in statistics, but are listed in frequency tables. The available STATISTICS are the same as available in DESCRIPTIVES (*note DESCRIPTIVES::), with the addition of MEDIAN, the data's median value, and MODE, the mode. (If there are multiple modes, the smallest value is reported.) By default, the mean, standard deviation of the mean, minimum, and maximum are reported for each variable. NTILES causes the specified quartiles to be reported. For instance, `/NTILES=4' would cause quartiles to be reported. In addition, particular percentiles can be requested with the PERCENTILES subcommand.  File: pspp.info, Node: CROSSTABS, Prev: FREQUENCIES, Up: Statistics CROSSTABS ========= CROSSTABS /TABLES=var_list BY var_list [BY var_list]... /MISSING={TABLE,INCLUDE,REPORT} /WRITE={NONE,CELLS,ALL} /FORMAT={TABLES,NOTABLES} {LABELS,NOLABELS,NOVALLABS} {PIVOT,NOPIVOT} {AVALUE,DVALUE} {NOINDEX,INDEX} {BOX,NOBOX} /CELLS={COUNT,ROW,COLUMN,TOTAL,EXPECTED,RESIDUAL,SRESIDUAL, ASRESIDUAL,ALL,NONE} /STATISTICS={CHISQ,PHI,CC,LAMBDA,UC,BTAU,CTAU,RISK,GAMMA,D, KAPPA,ETA,CORR,ALL,NONE} (Integer mode.) /VARIABLES=var_list (low,high)... CROSSTABS reads the active file and builds and displays crosstabulation tables requested by the user. It can calculate several statistics for each cell in the crosstabulation tables. In addition, a number of statistics can be calculated for each table itself. The TABLES subcommand is used to specify the tables to be reported. Any number of dimensions is permitted, and any number of variables per dimension is allowed. The TABLES subcommand may be repeated as many times as needed. This is the only required subcommand in "general mode". Occasionally, one may want to invoke a special mode called "integer mode". Normally, in general mode, PSPP will automatically determine what values occur in the data. In integer mode, the user specifies the range of values that the data assumes. To invoke this mode, specify the VARIABLES subcommand, giving a range of data values in parentheses for each variable to be used on the TABLES subcommand. Data values inside the range are truncated to the nearest integer, then assigned to that value. If values occur outside this range, they are discarded. When it is present, the VARIABLES subcommand must precede the TABLES subcommand. The MISSING subcommand determines the handling of user-missing values. When set to TABLE, the default, missing values are dropped on a table by table basis. When set to INCLUDE, user-missing values are included in tables and statistics. When set to REPORT, which is allowed only in integer mode, user-missing values are included in tables but marked with an `M' (for "missing") and excluded from statistical calculations. Currently the WRITE subcommand is not used. The FORMAT subcommand controls the characteristics of the crosstabulation tables to be displayed. It has a number of possible settings: * TABLES, the default, causes crosstabulation tables to be output. NOTABLES suppresses them. * LABELS, the default, allows variable labels and value labels to appear in the output. NOLABELS suppresses them. NOVALLABS displays variable labels but suppresses value labels. * PIVOT, the default, causes each TABLES subcommand to be displayed in a pivot table format. NOPIVOT causes the old-style crosstabulation format to be used. * AVALUE, the default, causes values to be sorted in ascending order. DVALUE asserts a descending sort order. * INDEX/NOINDEX is currently ignored. * BOX/NOBOX is currently ignored. The CELLS subcommand controls the contents of each cell in the displayed crosstabulation table. The possible settings are: COUNT Frequency count. ROW Row percent. COLUMN Column percent. TOTAL Table percent. EXPECTED Expected value. RESIDUAL Residual. SRESIDUAL Standardized residual. ASRESIDUAL Adjusted standardized residual. ALL All of the above. NONE Suppress cells entirely. `/CELLS' without any settings specified requests COUNT, ROW, COLUMN, and TOTAL. If CELLS is not specified at all then only COUNT will be selected. The STATISTICS subcommand selects statistics for computation: CHISQ Pearson chi-square, likelihood ratio, Fisher's exact test, continuity correction, linear-by-linear association. PHI Phi. CC Contingency coefficient. LAMBDA Lambda. UC Uncertainty coefficient. BTAU Tau-b. CTAU Tau-c. RISK Risk estimate. GAMMA Gamma. D Somers' D. KAPPA Cohen's Kappa. ETA Eta. CORR Spearman correlation, Pearson's r. ALL All of the above. NONE No statistics. Selected statistics are only calculated when appropriate for the statistic. Certain statistics require tables of a particular size, and some statistics are calculated only in integer mode. `/STATISTICS' without any settings selects CHISQ. If the STATISTICS subcommand is not given, no statistics are calculated. *Please note:* Currently the implementation of CROSSTABS has the followings bugs: * Pearson's R (but not Spearman!) is off a little. * T values for Spearman's R and Pearson's R are wrong. * How to calculate significance of symmetric and directional measures? * Asymmetric ASEs and T values for lambda are wrong. * ASE of Goodman and Kruskal's tau is not calculated. * ASE of symmetric somers' d is wrong. * Approx. T of uncertainty coefficient is wrong. Fix for any of these deficiencies would be welcomed.  File: pspp.info, Node: Utilities, Next: Not Implemented, Prev: Statistics, Up: Top Utilities ********* Commands that don't fit any other category are placed here. Most of these commands are not affected by commands like IF and LOOP: they take effect only once, unconditionally, at the time that they are encountered in the input. * Menu: * COMMENT:: Document your syntax file. * DOCUMENT:: Document the active file. * DISPLAY DOCUMENTS:: Display active file documents. * DISPLAY FILE LABEL:: Display the active file label. * DROP DOCUMENTS:: Remove documents from the active file. * EXECUTE:: Execute pending transformations. * FILE LABEL:: Set the active file's label. * INCLUDE:: Include a file within the current one. * QUIT:: Terminate the PSPP session. * SET:: Adjust PSPP runtime parameters. * SUBTITLE:: Provide a document subtitle. * SYSFILE INFO:: Display the dictionary in a system file. * TITLE:: Provide a document title.  File: pspp.info, Node: COMMENT, Next: DOCUMENT, Prev: Utilities, Up: Utilities COMMENT ======= Two possibles syntaxes: COMMENT comment text ... . *comment text ... . The COMMENT command is ignored. It is used to provide information to the author and other readers of the PSPP syntax file. A COMMENT command can extend over any number of lines. Don't forget to terminate it with a dot or a blank line!  File: pspp.info, Node: DOCUMENT, Next: DISPLAY DOCUMENTS, Prev: COMMENT, Up: Utilities DOCUMENT ======== DOCUMENT documentary_text. The DOCUMENT command adds one or more lines of descriptive commentary to the active file. Documents added in this way are saved to system files. They can be viewed using SYSFILE INFO or DISPLAY DOCUMENTS. They can be removed from the active file with DROP DOCUMENTS. Specify the documentary text following the DOCUMENT keyword. You can extend the documentary text over as many lines as necessary. Lines are truncated at 80 characters width. Don't forget to terminate the DOCUMENT command with a dot or a blank line.  File: pspp.info, Node: DISPLAY DOCUMENTS, Next: DISPLAY FILE LABEL, Prev: DOCUMENT, Up: Utilities DISPLAY DOCUMENTS ================= DISPLAY DOCUMENTS. DISPLAY DOCUMENTS displays the documents in the active file. Each document is preceded by a line giving the time and date that it was added. *Note DOCUMENT::.  File: pspp.info, Node: DISPLAY FILE LABEL, Next: DROP DOCUMENTS, Prev: DISPLAY DOCUMENTS, Up: Utilities DISPLAY FILE LABEL ================== DISPLAY FILE LABEL. DISPLAY FILE LABEL displays the file label contained in the active file, if any. *Note FILE LABEL::.  File: pspp.info, Node: DROP DOCUMENTS, Next: EXECUTE, Prev: DISPLAY FILE LABEL, Up: Utilities DROP DOCUMENTS ============== DROP DOCUMENTS. The DROP DOCUMENTS command removes all documents from the active file. New documents can be added with the DOCUMENT utility (*note DOCUMENT::). DROP DOCUMENTS only changes the active file. It does not modify any system files stored on disk.  File: pspp.info, Node: EXECUTE, Next: FILE LABEL, Prev: DROP DOCUMENTS, Up: Utilities EXECUTE ======= EXECUTE. The EXECUTE utility causes the active file to be read and all pending transformations to be executed.  File: pspp.info, Node: FILE LABEL, Next: INCLUDE, Prev: EXECUTE, Up: Utilities FILE LABEL ========== FILE LABEL file_label. Use the FILE LABEL command to provide a title for the active file. This title will be saved into system files and portable files that are created during this PSPP run. It is not necessary to include quotes around file_label. If they are included then they become part of the file label.  File: pspp.info, Node: INCLUDE, Next: QUIT, Prev: FILE LABEL, Up: Utilities INCLUDE ======= Two possible syntaxes: INCLUDE 'filename'. @filename. The INCLUDE command causes the PSPP command processor to read an additional command file as if it were included bodily in the current command file. INCLUDE files may be nested to any depth, up to the limit of available memory.  File: pspp.info, Node: QUIT, Next: SET, Prev: INCLUDE, Up: Utilities QUIT ==== Two possible syntaxes: QUIT. EXIT. The QUIT command terminates the current PSPP session and returns control to the operating system. This command is not valid within a command file.  File: pspp.info, Node: SET, Next: SUBTITLE, Prev: QUIT, Up: Utilities SET === SET (data input) /BLANKS={SYSMIS,'.',number} /DECIMAL={DOT,COMMA} /FORMAT=fmt_spec (program input) /ENDCMD='.' /NULLINE={ON,OFF} (interaction) /CPROMPT='cprompt_string' /DPROMPT='dprompt_string' /ERRORBREAK={OFF,ON} /MXERRS=max_errs /MXWARNS=max_warnings /PROMPT='prompt' /VIEWLENGTH={MINIMUM,MEDIAN,MAXIMUM,n_lines} /VIEWWIDTH=n_characters (program execution) /MEXPAND={ON,OFF} /MITERATE=max_iterations /MNEST=max_nest /MPRINT={ON,OFF} /MXLOOPS=max_loops /SEED={RANDOM,seed_value} /UNDEFINED={WARN,NOWARN} (data output) /CC{A,B,C,D,E}={'npre,pre,suf,nsuf','npre.pre.suf.nsuf'} /DECIMAL={DOT,COMMA} /FORMAT=fmt_spec (output routing) /ECHO={ON,OFF} /ERRORS={ON,OFF,TERMINAL,LISTING,BOTH,NONE} /INCLUDE={ON,OFF} /MESSAGES={ON,OFF,TERMINAL,LISTING,BOTH,NONE} /PRINTBACK={ON,OFF} /RESULTS={ON,OFF,TERMINAL,LISTING,BOTH,NONE} (output activation) /LISTING={ON,OFF} /PRINTER={ON,OFF} /SCREEN={ON,OFF} (output driver options) /HEADERS={NO,YES,BLANK} /LENGTH={NONE,length_in_lines} /LISTING=filename /MORE={ON,OFF} /PAGER={OFF,"pager_name"} /WIDTH={NARROW,WIDTH,n_characters} (logging) /JOURNAL={ON,OFF} [filename] /LOG={ON,OFF} [filename] (system files) /COMPRESSION={ON,OFF} /SCOMPRESSION={ON,OFF} (security) /SAFER=ON (obsolete settings accepted for compatibility, but ignored) /AUTOMENU={ON,OFF} /BEEP={ON,OFF} /BLOCK='c' /BOXSTRING={'xxx','xxxxxxxxxxx'} /CASE={UPPER,UPLOW} /COLOR=... /CPI=cpi_value /DISK={ON,OFF} /EJECT={ON,OFF} /HELPWINDOWS={ON,OFF} /HIGHRES={ON,OFF} /HISTOGRAM='c' /LOWRES={AUTO,ON,OFF} /LPI=lpi_value /MENUS={STANDARD,EXTENDED} /MXMEMORY=max_memory /PTRANSLATE={ON,OFF} /RCOLORS=... /RUNREVIEW={AUTO,MANUAL} /SCRIPTTAB='c' /TB1={'xxx','xxxxxxxxxxx'} /TBFONTS='string' /WORKDEV=drive_letter /WORKSPACE=workspace_size /XSORT={YES,NO} The SET command allows the user to adjust several parameters relating to PSPP's execution. Since there are many subcommands to this command, its subcommands will be examined in groups. As a general comment, ON and YES are considered synonymous, and so are OFF and NO, when used as subcommand values. The data input subcommands affect the way that data is read from data files. The data input subcommands are BLANKS This is the value assigned to an item data item that is empty or contains only whitespace. An argument of SYSMIS or '.' will cause the system-missing value to be assigned to null items. This is the default. Any real value may be assigned. DECIMAL The default DOT setting causes the decimal point character to be `.'. A setting of COMMA causes the decimal point character to be `,'. FORMAT Allows the default numeric input/output format to be specified. The default is F8.2. *Note Input/Output Formats::. Program input subcommands affect the way that programs are parsed when they are typed interactively or run from a script. They are ENDCMD This is a single character indicating the end of a command. The default is `.'. Don't change this. NULLINE Whether a blank line is interpreted as ending the current command. The default is ON. Interaction subcommands affect the way that PSPP interacts with an online user. The interaction subcommands are CPROMPT The command continuation prompt. The default is ` > '. DPROMPT Prompt used when expecting data input within BEGIN DATA (*note BEGIN DATA::). The default is `data> '. ERRORBREAK Whether an error causes PSPP to stop processing the current command file after finishing the current command. The default is OFF. MXERRS The maximum number of errors before PSPP halts processing of the current command file. The default is 50. MXWARNS The maximum number of warnings + errors before PSPP halts processing the current command file. The default is 100. PROMPT The command prompt. The default is `PSPP> '. VIEWLENGTH The length of the screen in lines. MINIMUM means 25 lines, MEDIAN and MAXIMUM mean 43 lines. Otherwise specify the number of lines. Normally PSPP should auto-detect your screen size so this shouldn't have to be used. VIEWWIDTH The width of the screen in characters. Normally 80 or 132. Program execution subcommands control the way that PSPP commands execute. The program execution subcommands are MEXPAND MITERATE MNEST MPRINT Currently not used. MXLOOPS The maximum number of iterations for an uncontrolled loop. SEED The initial pseudo-random number seed. Set to a real number or to RANDOM, which will obtain an initial seed from the current time of day. UNDEFINED Currently not used. Data output subcommands affect the format of output data. These subcommands are CCA CCB CCC CCD CCE Set up custom currency formats. The argument is a string which must contain exactly three commas or exactly three periods. If commas, then the grouping character for the currency format is `,', and the decimal point character is `.'; if periods, then the situation is reversed. The commas or periods divide the string into four fields, which are, in order, the negative prefix, prefix, suffix, and negative suffix. When a value is formatted using the custom currency format, the prefix precedes the value formatted and the suffix follows it. In addition, if the value is negative, the negative prefix precedes the prefix and the negative suffix follows the suffix. DECIMAL The default DOT setting causes the decimal point character to be `.'. A setting of COMMA causes the decimal point character to be `,'. FORMAT Allows the default numeric input/output format to be specified. The default is F8.2. *Note Input/Output Formats::. Output routing subcommands affect where the output of transformations and procedures is sent. These subcommands are ECHO If turned on, commands are written to the listing file as they are read from command files. The default is OFF. ERRORS INCLUDE MESSAGES PRINTBACK RESULTS Currently not used. Output activation subcommands affect whether output devices of particular types are enabled. These subcommands are LISTING Enable or disable listing devices. PRINTER Enable or disable printer devices. SCREEN Enable or disable screen devices. Output driver option subcommands affect output drivers' settings. These subcommands are HEADERS LENGTH LISTING MORE PAGER WIDTH Currently not used. Logging subcommands affect logging of commands executed to external files. These subcommands are JOURNAL LOG Not currently used. System file subcommands affect the default format of system files produced by PSPP. These subcommands are COMPRESSION Not currently used. SCOMPRESSION Whether system files created by SAVE or XSAVE are compressed by default. The default is ON. Security subcommands affect the operations that commands are allowed to perform. The security subcommands are SAFER When set, this setting cannot ever be reset, for obvious security reasons. Setting this option disables the following operations: * The ERASE command. * The HOST command. * Pipe filenames (filenames beginning or ending with `|'). * Be aware that this setting does not guarantee safety (commands can still overwrite files, for instance) but it is an improvement.  File: pspp.info, Node: SUBTITLE, Next: TITLE, Prev: SET, Up: Utilities SUBTITLE ======== Two possible syntaxes: SUBTITLE 'subtitle_string'. SUBTITLE subtitle_string. The SUBTITLE command is used to provide a subtitle to a particular PSPP run. This subtitle appears at the top of each output page below the title, if titles are enabled on the output device. Specify a subtitle as a string in quotes. The alternate syntax that did not require quotes is now obsolete. If it is used then the subtitle is converted to all uppercase.  File: pspp.info, Node: TITLE, Prev: SUBTITLE, Up: Utilities TITLE ===== Two possible syntaxes: TITLE 'title_string'. TITLE title_string. The TITLE command is used to provide a title to a particular PSPP run. This title appears at the top of each output page, if titles are enabled on the output device. Specify a title as a string in quotes. The alternate syntax that did not require quotes is now obsolete. If it is used then the title is converted to all uppercase.  File: pspp.info, Node: Not Implemented, Next: Data File Format, Prev: Utilities, Up: Top Not Implemented *************** This chapter lists parts of the PSPP language that are not yet implemented. The following transformations and utilities are not yet implemented, but they will be supported in a later release. * ADD FILES * DEFINE * FILE TYPE * GET SAS * GET TRANSLATE * MCONVERT * PRESERVE * PROCEDURE OUTPUT * RESTORE * SAVE TRANSLATE * SHOW * UPDATE The following transformations and utilities are not implemented. There are no plans to support them in future releases. Contributions to implement them will still be accepted. * EDIT * GET DATABASE * GET OSIRIS * GET SCSS * GSET * HELP * INFO * INPUT MATRIX * KEYED DATA LIST * NUMBERED and UNNUMBERED * OPTIONS * REVIEW * SAVE SCSS * SPSS MANAGER * STATISTICS  File: pspp.info, Node: Data File Format, Next: Portable File Format, Prev: Not Implemented, Up: Top Data File Format **************** PSPP necessarily uses the same format for system files as do the products with which it is compatible. This chapter is a description of that format. There are three data types used in system files: 32-bit integers, 64-bit floating points, and 1-byte characters. In this document these will simply be referred to as `int32', `flt64', and `char', the names that are used in the PSPP source code. Every field of type `int32' or `flt64' is aligned on a 32-bit boundary. The endianness of data in PSPP system files is not specified. System files output on a computer of a particular endianness will have the endianness of that computer. However, PSPP can read files of either endianness, regardless of its host computer's endianness. PSPP translates endianness for both integer and floating point numbers. Floating point formats are also not specified. PSPP does not translate between floating point formats. This is unlikely to be a problem as all modern computer architectures use IEEE 754 format for floating point representation. The PSPP system-missing value is represented by the largest possible negative number in the floating point format; in C, this is most likely `-DBL_MAX'. There are two other important values used in missing values: `HIGHEST' and `LOWEST'. These are represented by the largest possible positive number (probably `DBL_MAX') and the second-largest negative number. The latter must be determined in a system-dependent manner; in IEEE 754 format it is represented by value `0xffeffffffffffffe'. System files are divided into records. Each record begins with an `int32' giving a numeric record type. Individual record types are described below: * Menu: * File Header Record:: * Variable Record:: * Value Label Record:: * Value Label Variable Record:: * Document Record:: * Machine int32 Info Record:: * Machine flt64 Info Record:: * Miscellaneous Informational Records:: * Dictionary Termination Record:: * Data Record::  File: pspp.info, Node: File Header Record, Next: Variable Record, Prev: Data File Format, Up: Data File Format File Header Record ================== The file header is always the first record in the file. struct sysfile_header { char rec_type[4]; char prod_name[60]; int32 layout_code; int32 case_size; int32 compressed; int32 weight_index; int32 ncases; flt64 bias; char creation_date[9]; char creation_time[8]; char file_label[64]; char padding[3]; }; `char rec_type[4];' Record type code. Always set to `$FL2'. This is the only record for which the record type is not of type `int32'. `char prod_name[60];' Product identification string. This always begins with the characters `@(#) SPSS DATA FILE'. PSPP uses the remaining characters to give its version and the operating system name; for example, `GNU pspp 0.1.4 - sparc-sun-solaris2.5.2'. The string is truncated if it would be longer than 60 characters; otherwise it is padded on the right with spaces. `int32 layout_code;' Always set to 2. PSPP reads this value in order to determine the file's endianness. `int32 case_size;' Number of data elements per case. This is the number of variables, except that long string variables add extra data elements (one for every 8 characters after the first 8). `int32 compressed;' Set to 1 if the data in the file is compressed, 0 otherwise. `int32 weight_index;' If one of the variables in the data set is used as a weighting variable, set to the index of that variable. Otherwise, set to 0. `int32 ncases;' Set to the number of cases in the file if it is known, or -1 otherwise. In the general case it is not possible to determine the number of cases that will be output to a system file at the time that the header is written. The way that this is dealt with is by writing the entire system file, including the header, then seeking back to the beginning of the file and writing just the `ncases' field. For `files' in which this is not valid, the seek operation fails. In this case, `ncases' remains -1. `flt64 bias;' Compression bias. Always set to 100. The significance of this value is that only numbers between `(1 - bias)' and `(251 - bias)' can be compressed. `char creation_date[9];' Set to the date of creation of the system file, in `dd mmm yy' format, with the month as standard English abbreviations, using an initial capital letter and following with lowercase. If the date is not available then this field is arbitrarily set to `01 Jan 70'. `char creation_time[8];' Set to the time of creation of the system file, in `hh:mm:ss' format and using 24-hour time. If the time is not available then this field is arbitrarily set to `00:00:00'. `char file_label[64];' Set the the file label declared by the user, if any. Padded on the right with spaces. `char padding[3];' Ignored padding bytes to make the structure a multiple of 32 bits in length. Set to zeros.  File: pspp.info, Node: Variable Record, Next: Value Label Record, Prev: File Header Record, Up: Data File Format Variable Record =============== Immediately following the header must come the variable records. There must be one variable record for every variable and every 8 characters in a long string beyond the first 8; i.e., there must be exactly as many variable records as the value specified for `case_size' in the file header record. struct sysfile_variable { int32 rec_type; int32 type; int32 has_var_label; int32 n_missing_values; int32 print; int32 write; char name[8]; /* The following two fields are present only if has_var_label is 1. */ int32 label_len; char label[/* variable length */]; /* The following field is present only if n_missing_values is not 0. */ flt64 missing_values[/* variable length*/]; }; `int32 rec_type;' Record type code. Always set to 2. `int32 type;' Variable type code. Set to 0 for a numeric variable. For a short string variable or the first part of a long string variable, this is set to the width of the string. For the second and subsequent parts of a long string variable, set to -1, and the remaining fields in the structure are ignored. `int32 has_var_label;' If this variable has a variable label, set to 1; otherwise, set to 0. `int32 n_missing_values;' If the variable has no missing values, set to 0. If the variable has one, two, or three discrete missing values, set to 1, 2, or 3, respectively. If the variable has a range for missing variables, set to -2; if the variable has a range for missing variables plus a single discrete value, set to -3. `int32 print;' Print format for this variable. See below. `int32 write;' Write format for this variable. See below. `char name[8];' Variable name. The variable name must begin with a capital letter or the at-sign (`@'). Subsequent characters may also be octothorpes (`#'), dollar signs (`$'), underscores (`_'), or full stops (`.'). The variable name is padded on the right with spaces. `int32 label_len;' This field is present only if `has_var_label' is set to 1. It is set to the length, in characters, of the variable label, which must be a number between 0 and 120. `char label[/* variable length */];' This field is present only if `has_var_label' is set to 1. It has length `label_len', rounded up to the nearest multiple of 32 bits. The first `label_len' characters are the variable's variable label. `flt64 missing_values[/* variable length */];' This field is present only if `n_missing_values' is not 0. It has the same number of elements as the absolute value of `n_missing_values'. For discrete missing values, each element represents one missing value. When a range is present, the first element denotes the minimum value in the range, and the second element denotes the maximum value in the range. When a range plus a value are present, the third element denotes the additional discrete missing value. HIGHEST and LOWEST are indicated as described in the chapter introduction. The `print' and `write' members of sysfile_variable are output formats coded into `int32' types. The LSB (least-significant byte) of the `int32' represents the number of decimal places, and the next two bytes in order of increasing significance represent field width and format type, respectively. The MSB (most-significant byte) is not used and should be set to zero. Format types are defined as follows: 0 Not used. 1 `A' 2 `AHEX' 3 `COMMA' 4 `DOLLAR' 5 `F' 6 `IB' 7 `PIBHEX' 8 `P' 9 `PIB' 10 `PK' 11 `RB' 12 `RBHEX' 13 Not used. 14 Not used. 15 `Z' 16 `N' 17 `E' 18 Not used. 19 Not used. 20 `DATE' 21 `TIME' 22 `DATETIME' 23 `ADATE' 24 `JDATE' 25 `DTIME' 26 `WKDAY' 27 `MONTH' 28 `MOYR' 29 `QYR' 30 `WKYR' 31 `PCT' 32 `DOT' 33 `CCA' 34 `CCB' 35 `CCC' 36 `CCD' 37 `CCE' 38 `EDATE' 39 `SDATE'  File: pspp.info, Node: Value Label Record, Next: Value Label Variable Record, Prev: Variable Record, Up: Data File Format Value Label Record ================== Value label records must follow the variable records and must precede the header termination record. Other than this, they may appear anywhere in the system file. Every value label record must be immediately followed by a label variable record, described below. Value label records begin with `rec_type', an `int32' value set to the record type of 3. This is followed by `count', an `int32' value set to the number of value labels present in this record. These two fields are followed by a series of `count' tuples. Each tuple is divided into two fields, the value and the label. The first of these, the value, is composed of a 64-bit value, which is either a `flt64' value or up to 8 characters (padded on the right to 8 bytes) denoting a short string value. Whether the value is a `flt64' or a character string is not defined inside the value label record. The second field in the tuple, the label, has variable length. The first `char' is a count of the number of characters in the value label. The remainder of the field is the label itself. The field is padded on the right to a multiple of 64 bits in length.  File: pspp.info, Node: Value Label Variable Record, Next: Document Record, Prev: Value Label Record, Up: Data File Format Value Label Variable Record =========================== Every value label variable record must be immediately preceded by a value label record, described above. struct sysfile_value_label_variable { int32 rec_type; int32 count; int32 vars[/* variable length */]; }; `int32 rec_type;' Record type. Always set to 4. `int32 count;' Number of variables that the associated value labels from the value label record are to be applied. `int32 vars[/* variable length];' A list of variables to which to apply the value labels. There are `count' elements.  File: pspp.info, Node: Document Record, Next: Machine int32 Info Record, Prev: Value Label Variable Record, Up: Data File Format Document Record =============== There must be no more than one document record per system file. Document records must follow the variable records and precede the dictionary termination record. struct sysfile_document { int32 rec_type; int32 n_lines; char lines[/* variable length */][80]; }; `int32 rec_type;' Record type. Always set to 6. `int32 n_lines;' Number of lines of documents present. `char lines[/* variable length */][80];' Document lines. The number of elements is defined by `n_lines'. Lines shorter than 80 characters are padded on the right with spaces.  File: pspp.info, Node: Machine int32 Info Record, Next: Machine flt64 Info Record, Prev: Document Record, Up: Data File Format Machine `int32' Info Record =========================== There must be no more than one machine `int32' info record per system file. Machine `int32' info records must follow the variable records and precede the dictionary termination record. struct sysfile_machine_int32_info { /* Header. */ int32 rec_type; int32 subtype; int32 size; int32 count; /* Data. */ int32 version_major; int32 version_minor; int32 version_revision; int32 machine_code; int32 floating_point_rep; int32 compression_code; int32 endianness; int32 character_code; }; `int32 rec_type;' Record type. Always set to 7. `int32 subtype;' Record subtype. Always set to 3. `int32 size;' Size of each piece of data in the data part, in bytes. Always set to 4. `int32 count;' Number of pieces of data in the data part. Always set to 8. `int32 version_major;' PSPP major version number. In version X.Y.Z, this is X. `int32 version_minor;' PSPP minor version number. In version X.Y.Z, this is Y. `int32 version_revision;' PSPP version revision number. In version X.Y.Z, this is Z. `int32 machine_code;' Machine code. PSPP always set this field to value to -1, but other values may appear. `int32 floating_point_rep;' Floating point representation code. For IEEE 754 systems this is 1. IBM 370 sets this to 2, and DEC VAX E to 3. `int32 compression_code;' Compression code. Always set to 1. `int32 endianness;' Machine endianness. 1 indicates big-endian, 2 indicates little-endian. `int32 character_code;' Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3 indicates 8-bit ASCII, 4 indicates DEC Kanji.  File: pspp.info, Node: Machine flt64 Info Record, Next: Miscellaneous Informational Records, Prev: Machine int32 Info Record, Up: Data File Format Machine `flt64' Info Record =========================== There must be no more than one machine `flt64' info record per system file. Machine `flt64' info records must follow the variable records and precede the dictionary termination record. struct sysfile_machine_flt64_info { /* Header. */ int32 rec_type; int32 subtype; int32 size; int32 count; /* Data. */ flt64 sysmis; flt64 highest; flt64 lowest; }; `int32 rec_type;' Record type. Always set to 3. `int32 subtype;' Record subtype. Always set to 4. `int32 size;' Size of each piece of data in the data part, in bytes. Always set to 4. `int32 count;' Number of pieces of data in the data part. Always set to 3. `flt64 sysmis;' The system missing value. `flt64 highest;' The value used for HIGHEST in missing values. `flt64 lowest;' The value used for LOWEST in missing values.  File: pspp.info, Node: Miscellaneous Informational Records, Next: Dictionary Termination Record, Prev: Machine flt64 Info Record, Up: Data File Format Miscellaneous Informational Records =================================== Miscellaneous informational records must follow the variable records and precede the dictionary termination record. Miscellaneous informational records are ignored by PSPP when reading system files. They are not written by PSPP when writing system files. struct sysfile_misc_info { /* Header. */ int32 rec_type; int32 subtype; int32 size; int32 count; /* Data. */ char data[/* variable length */]; }; `int32 rec_type;' Record type. Always set to 3. `int32 subtype;' Record subtype. May take any value. `int32 size;' Size of each piece of data in the data part. Should have the value 4 or 8, for `int32' and `flt64', respectively. `int32 count;' Number of pieces of data in the data part. `char data[/* variable length */];' Arbitrary data. There must be `size' times `count' bytes of data.  File: pspp.info, Node: Dictionary Termination Record, Next: Data Record, Prev: Miscellaneous Informational Records, Up: Data File Format Dictionary Termination Record ============================= The dictionary termination record must follow all other records, except for the actual cases, which it must precede. There must be exactly one dictionary termination record in every system file. struct sysfile_dict_term { int32 rec_type; int32 filler; }; `int32 rec_type;' Record type. Always set to 999. `int32 filler;' Ignored padding. Should be set to 0.  File: pspp.info, Node: Data Record, Prev: Dictionary Termination Record, Up: Data File Format Data Record =========== Data records must follow all other records in the data file. There must be at least one data record in every system file. The format of data records varies depending on whether the data is compressed. Regardless, the data is arranged in a series of 8-byte elements. When data is not compressed, Every case is composed of `case_size' of these 8-byte elements, where `case_size' comes from the file header record (*note File Header Record::). Each element corresponds to the variable declared in the respective variable record (*note Variable Record::). Numeric values are given in `flt64' format; string values are literal characters string, padded on the right when necessary. Compressed data is arranged in the following manner: the first 8-byte element in the data section is divided into a series of 1-byte command codes. These codes have meanings as described below: 0 Ignored. If the program writing the system file accumulates compressed data in blocks of fixed length, 0 bytes can be used to pad out extra bytes remaining at the end of a fixed-size block. 1 through 251 These values indicate that the corresponding numeric variable has the value `(CODE - BIAS)' for the case being read, where CODE is the value of the compression code and BIAS is the variable `compression_bias' from the file header. For example, code 105 with bias 100.0 (the normal value) indicates a numeric variable of value 5. 252 End of file. This code may or may not appear at the end of the data stream. PSPP always outputs this code but its use is not required. 253 This value indicates that the numeric or string value is not compressible. The value is stored in the 8-byte element following the current block of command bytes. If this value appears twice in a block of command bytes, then it indicates the second element following the command bytes, and so on. 254 Used to indicate a string value that is all spaces. 255 Used to indicate the system-missing value. When the end of the first 8-byte element of command bytes is reached, any blocks of non-compressible values are skipped, and the next element of command bytes is read and interpreted, until the end of the file is reached.