From: Ben Pfaff Date: Sun, 29 Jul 2007 05:40:51 +0000 (+0000) Subject: Make PSPP able to read all the portable files I could find on the X-Git-Tag: v0.6.0~339 X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?p=pspp-builds.git;a=commitdiff_plain;h=29956ba4326b9d6a2bc4d22a9f323902c7a08d43 Make PSPP able to read all the portable files I could find on the web. * por-file-reader.c (struct pfm_reader): New member `line_length'. (error): Print file offset in hexadecimal. (warning): New function. (advance): Treat lines less than 80 bytes long as padded to 80 bytes with spaces. (pfm_open_reader): Call read_documents if we find an "E" record. (convert_format): Convert invalid formats to the default format instead of aborting reading the file. (read_variables): Rename duplicate variable names instead of aborting reading the file. (read_value_label): Allow string variables of different widths to be assigned value labels in the same record. Replace duplicate value labels instead of aborting. (read_documents): New function. * por-file-writer.c (pfm_open_writer): Call write_documents if the dictionary has documents. (write_documents): New function. --- diff --git a/doc/portable-file-format.texi b/doc/portable-file-format.texi index f2882952..d58a987f 100644 --- a/doc/portable-file-format.texi +++ b/doc/portable-file-format.texi @@ -22,17 +22,24 @@ may be incorrect in the general case. * Case Weight Variable Record:: * Variable Records:: * Value Label Records:: +* Portable File Document Record:: * Portable File Data:: @end menu @node Portable File Characters @section Portable File Characters -Portable files are arranged as a series of lines of exactly 80 +Portable files are arranged as a series of lines of 80 characters each. Each line is terminated by a carriage-return, -line-feed sequence ``new-lines''). New-lines are only used to avoid +line-feed sequence (``new-lines''). New-lines are only used to avoid line length limits imposed by some OSes; they are not meaningful. +Most lines in portable files are exactly 80 characters long. The only +exception is a line that ends in one or more spaces, in which the +spaces may optionally be omitted. Thus, a portable file reader must +act as though a line shorter than 80 characters is padded to that +length with spaces. + The file must be terminated with a @samp{Z} character. In addition, if the final line in the file does not have exactly 80 characters, then it is padded on the right with @samp{Z} characters. (The file contents may @@ -80,6 +87,9 @@ missing value record and a variable label record. @item Value labels (optional). +@item +Documents (optional). + @item Data. @end itemize @@ -369,6 +379,11 @@ and 255 for a string variable. @item Name (string). 1--8 characters long. Must be in all capitals. +A few portable files that contain duplicate variable names have been +spotted in the wild. PSPP handles these by renaming the duplicates +with numeric extensions: @code{@var{var}_1}, @code{@var{var}_2}, and +so on. + @item Print format. This is a set of three integer fields: @@ -384,6 +399,11 @@ Format width. 1--40. Number of decimal places. 1--40. @end itemize +A few portable files with invalid format types or formats that are not +of the appropriate width for their variables have been spotted in the +wild. PSPP assigns a default F or A format to a variable with an +invalid format. + @item Write format. Same structure as the print format described above. @end itemize @@ -420,7 +440,8 @@ Variable count (integer). @item List of variables (strings). The variable count specifies the number in the list. Variables are specified by their names. All variables must -be of the same type (numeric or string). +be of the same type (numeric or string), but string variables do not +necessarily have the same width. @item Label count (integer). @@ -431,6 +452,20 @@ tuples. Each tuple consists of a value, which is numeric or string as appropriate to the variables, followed by a label (string). @end itemize +A few portable files that specify duplicate value labels, that is, two +different labels for a single value of a single variable, have been +spotted in the wild. PSPP uses the last value label specified in +these cases. + +@node Portable File Document Record +@section Document Record + +One document record may optionally follow the value label record. The +document record consists of tag code @samp{E}, following by the number +of document lines as an integer, followed by that number of strings, +each of which represents one document line. Document lines must be 80 +bytes long or shorter. + @node Portable File Data @section Portable File Data diff --git a/src/data/ChangeLog b/src/data/ChangeLog index 32651710..553b753e 100644 --- a/src/data/ChangeLog +++ b/src/data/ChangeLog @@ -1,3 +1,26 @@ +2007-07-28 Ben Pfaff + + Make PSPP able to read all the portable files I could find on the + web. Thanks to John Darrington for review. Bug #17620. + * por-file-reader.c (struct pfm_reader): New member `line_length'. + (error): Print file offset in hexadecimal. + (warning): New function. + (advance): Treat lines less than 80 bytes long as padded to 80 + bytes with spaces. + (pfm_open_reader): Call read_documents if we find an "E" record. + (convert_format): Convert invalid formats to the default format + instead of aborting reading the file. + (read_variables): Rename duplicate variable names instead of + aborting reading the file. + (read_value_label): Allow string variables of different widths to + be assigned value labels in the same record. Replace duplicate + value labels instead of aborting. + (read_documents): New function. + + * por-file-writer.c (pfm_open_writer): Call write_documents if the + dictionary has documents. + (write_documents): New function. + 2007-07-25 Ben Pfaff Fix bugs related to bug #17213. diff --git a/src/data/por-file-reader.c b/src/data/por-file-reader.c index f9eb28b6..e82e1361 100644 --- a/src/data/por-file-reader.c +++ b/src/data/por-file-reader.c @@ -65,6 +65,7 @@ struct pfm_reader struct file_handle *fh; /* File handle. */ FILE *file; /* File stream. */ + int line_length; /* Number of characters so far on this line. */ char cc; /* Current character. */ char *trans; /* 256-byte character set translation table. */ int var_cnt; /* Number of variables. */ @@ -91,7 +92,7 @@ error (struct pfm_reader *r, const char *msg, ...) va_list args; ds_init_empty (&text); - ds_put_format (&text, _("portable file %s corrupt at offset %ld: "), + ds_put_format (&text, _("portable file %s corrupt at offset 0x%lx: "), fh_get_file_name (r->fh), ftell (r->file)); va_start (args, msg); ds_put_vformat (&text, msg, args); @@ -110,6 +111,31 @@ error (struct pfm_reader *r, const char *msg, ...) longjmp (r->bail_out, 1); } +/* Displays MSG as an warning for the current position in + portable file reader R. */ +static void +warning (struct pfm_reader *r, const char *msg, ...) +{ + struct msg m; + struct string text; + va_list args; + + ds_init_empty (&text); + ds_put_format (&text, _("reading portable file %s at offset 0x%lx: "), + fh_get_file_name (r->fh), ftell (r->file)); + va_start (args, msg); + ds_put_vformat (&text, msg, args); + va_end (args); + + m.category = MSG_GENERAL; + m.severity = MSG_WARNING; + m.where.file_name = NULL; + m.where.line_number = 0; + m.text = ds_cstr (&text); + + msg_emit (&m); +} + /* Closes portable file reader R, after we're done with it. */ static void por_file_casereader_destroy (struct casereader *reader UNUSED, void *r_) @@ -124,14 +150,33 @@ advance (struct pfm_reader *r) { int c; - while ((c = getc (r->file)) == '\r' || c == '\n') - continue; + /* Read the next character from the file. + Ignore carriage returns entirely. + Mostly ignore new-lines, but if a new-line occurs before the + line has reached 80 bytes in length, then treat the + "missing" bytes as spaces. */ + for (;;) + { + while ((c = getc (r->file)) == '\r') + continue; + if (c != '\n') + break; + + if (r->line_length < 80) + { + c = ' '; + ungetc ('\n', r->file); + break; + } + r->line_length = 0; + } if (c == EOF) error (r, _("unexpected end of file")); if (r->trans != NULL) c = r->trans[c]; r->cc = c; + r->line_length++; } /* Skip a single character if present, and return whether it was @@ -152,7 +197,7 @@ static void read_header (struct pfm_reader *); static void read_version_data (struct pfm_reader *, struct pfm_read_info *); static void read_variables (struct pfm_reader *, struct dictionary *); static void read_value_label (struct pfm_reader *, struct dictionary *); -void dump_dictionary (struct dictionary *); +static void read_documents (struct pfm_reader *, struct dictionary *); /* Reads the dictionary from file with handle H, and returns it in a dictionary structure. This dictionary may be modified in order to @@ -176,6 +221,7 @@ pfm_open_reader (struct file_handle *fh, struct dictionary **dict, goto error; r->fh = fh; r->file = pool_fopen (r->pool, fh_get_file_name (r->fh), "rb"); + r->line_length = 0; r->weight_index = -1; r->trans = NULL; r->var_cnt = 0; @@ -201,6 +247,10 @@ pfm_open_reader (struct file_handle *fh, struct dictionary **dict, while (match (r, 'D')) read_value_label (r, *dict); + /* Read documents. */ + if (match (r, 'E')) + read_documents (r, *dict); + /* Check that we've made it to the data. */ if (!match (r, 'F')) error (r, _("Data record expected.")); @@ -469,14 +519,20 @@ read_version_data (struct pfm_reader *r, struct pfm_read_info *info) checking that the format is appropriate for variable V. */ static struct fmt_spec convert_format (struct pfm_reader *r, const int portable_format[3], - struct variable *v) + struct variable *v, bool *report_error) { struct fmt_spec format; bool ok; if (!fmt_from_io (portable_format[0], &format.type)) - error (r, _("%s: Bad format specifier byte (%d)."), - var_get_name (v), portable_format[0]); + { + if (*report_error) + warning (r, _("%s: Bad format specifier byte (%d). Variable " + "will be assigned a default format."), + var_get_name (v), portable_format[0]); + goto assign_default; + } + format.w = portable_format[1]; format.d = portable_format[2]; @@ -487,14 +543,27 @@ convert_format (struct pfm_reader *r, const int portable_format[3], if (!ok) { - char fmt_string[FMT_STRING_LEN_MAX + 1]; - error (r, _("%s variable %s has invalid format specifier %s."), - var_is_numeric (v) ? _("Numeric") : _("String"), - var_get_name (v), fmt_to_string (&format, fmt_string)); - format = fmt_default_for_width (var_get_width (v)); + if (*report_error) + { + char fmt_string[FMT_STRING_LEN_MAX + 1]; + fmt_to_string (&format, fmt_string); + if (var_is_numeric (v)) + warning (r, _("Numeric variable %s has invalid format " + "specifier %s."), + var_get_name (v), fmt_string); + else + warning (r, _("String variable %s with width %d has " + "invalid format specifier %s."), + var_get_name (v), var_get_width (v), fmt_string); + } + goto assign_default; } return format; + +assign_default: + *report_error = false; + return fmt_default_for_width (var_get_width (v)); } static union value parse_value (struct pfm_reader *, struct variable *); @@ -532,6 +601,7 @@ read_variables (struct pfm_reader *r, struct dictionary *dict) struct variable *v; struct missing_values miss; struct fmt_spec print, write; + bool report_error = true; int j; if (!match (r, '7')) @@ -547,7 +617,7 @@ read_variables (struct pfm_reader *r, struct dictionary *dict) fmt[j] = read_int (r); if (!var_is_valid_name (name, false) || *name == '#' || *name == '$') - error (r, _("position %d: Invalid variable name `%s'."), i, name); + error (r, _("Invalid variable name `%s' in position %d."), name, i); str_uppercase (name); if (width < 0 || width > 255) @@ -555,10 +625,24 @@ read_variables (struct pfm_reader *r, struct dictionary *dict) v = dict_create_var (dict, name, width); if (v == NULL) - error (r, _("Duplicate variable name %s."), name); + { + int i; + for (i = 1; i < 100000; i++) + { + char try_name[LONG_NAME_LEN + 1]; + sprintf (try_name, "%.*s_%d", LONG_NAME_LEN - 6, name, i); + v = dict_create_var (dict, try_name, width); + if (v != NULL) + break; + } + if (v == NULL) + error (r, _("Duplicate variable name %s in position %d."), name, i); + warning (r, _("Duplicate variable name %s in position %d renamed " + "to %s."), name, i, var_get_name (v)); + } - print = convert_format (r, &fmt[0], v); - write = convert_format (r, &fmt[3], v); + print = convert_format (r, &fmt[0], v, &report_error); + write = convert_format (r, &fmt[3], v, &report_error); var_set_print_format (v, &print); var_set_write_format (v, &write); @@ -645,9 +729,9 @@ read_value_label (struct pfm_reader *r, struct dictionary *dict) if (v[i] == NULL) error (r, _("Unknown variable %s while parsing value labels."), name); - if (var_get_width (v[0]) != var_get_width (v[i])) + if (var_get_type (v[0]) != var_get_type (v[i])) error (r, _("Cannot assign value labels to %s and %s, which " - "have different variable types or widths."), + "have different variable types."), var_get_name (v[0]), var_get_name (v[i])); } @@ -661,24 +745,33 @@ read_value_label (struct pfm_reader *r, struct dictionary *dict) val = parse_value (r, v[0]); read_string (r, label); - /* Assign the value_label's to each variable. */ + /* Assign the value label to each variable. */ for (j = 0; j < nv; j++) { struct variable *var = v[j]; - if (!var_add_value_label (var, &val, label)) - continue; - - if (var_is_numeric (var)) - error (r, _("Duplicate label for value %g for variable %s."), - val.f, var_get_name (var)); - else - error (r, _("Duplicate label for value `%.*s' for variable %s."), - var_get_width (var), val.s, var_get_name (var)); + if (!var_is_long_string (var)) + var_replace_value_label (var, &val, label); } } } +/* Reads a set of documents from portable file R into DICT. */ +static void +read_documents (struct pfm_reader *r, struct dictionary *dict) +{ + int line_cnt; + int i; + + line_cnt = read_int (r); + for (i = 0; i < line_cnt; i++) + { + char line[256]; + read_string (r, line); + dict_add_document_line (dict, line); + } +} + /* Reads one case from portable file R into C. */ static bool por_file_casereader_read (struct casereader *reader, void *r_, struct ccase *c) diff --git a/src/data/por-file-writer.c b/src/data/por-file-writer.c index 0bb1df60..cf447456 100644 --- a/src/data/por-file-writer.c +++ b/src/data/por-file-writer.c @@ -83,6 +83,8 @@ static void write_version_data (struct pfm_writer *); static void write_variables (struct pfm_writer *, struct dictionary *); static void write_value_labels (struct pfm_writer *, const struct dictionary *); +static void write_documents (struct pfm_writer *, + const struct dictionary *); static void format_trig_double (long double, int base_10_precision, char[]); static char *format_trig_int (int, bool force_sign, char[]); @@ -159,6 +161,8 @@ pfm_open_writer (struct file_handle *fh, struct dictionary *dict, write_version_data (w); write_variables (w, dict); write_value_labels (w, dict); + if (dict_get_document_line_cnt (dict) > 0) + write_documents (w, dict); buf_write (w, "F", 1); if (ferror (w->file)) goto error; @@ -414,7 +418,25 @@ write_value_labels (struct pfm_writer *w, const struct dictionary *dict) } } -/* Writes case C to the portable file represented by H. */ +/* Write documents in DICT to portable file W. */ +static void +write_documents (struct pfm_writer *w, const struct dictionary *dict) +{ + size_t line_cnt = dict_get_document_line_cnt (dict); + struct string line = DS_EMPTY_INITIALIZER; + int i; + + buf_write (w, "E", 1); + write_int (w, line_cnt); + for (i = 0; i < line_cnt; i++) + { + dict_get_document_line (dict, i, &line); + write_string (w, ds_cstr (&line)); + } + ds_destroy (&line); +} + +/* Writes case C to the portable file represented by WRITER. */ static void por_file_casewriter_write (struct casewriter *writer, void *w_, struct ccase *c)