From: Ben Pfaff Date: Sun, 30 Jul 2017 00:13:28 +0000 (-0700) Subject: sys-file-reader: Ignore duplicate value labels written by ReadStat. X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?p=pspp;a=commitdiff_plain;h=80527716392c066fdf72f37729c42089a2174bae sys-file-reader: Ignore duplicate value labels written by ReadStat. The ReadStat software writes duplicate value labels due to a misunderstanding of string widths. This change make PSPP ignore these duplicates. --- diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi index e52b957116..484fbb43f1 100644 --- a/doc/dev/system-file-format.texi +++ b/doc/dev/system-file-format.texi @@ -222,6 +222,12 @@ pspp 0.1.4 - sparc-sun-solaris2.5.2}. The string is truncated if it would be longer than 60 characters; otherwise it is padded on the right with spaces. +The product name field allow readers to behave differently based on +quirks in the way that particular software writes system files. +@xref{Value Labels Records}, for the detail of the quirk that the PSPP +system file reader tolerates in files written by ReadStat, which has +@code{https://github.com/WizardMac/ReadStat} in @code{prod_name}. + @anchor{layout_code} @item int32 layout_code; Normally set to 2, although a few system files have been spotted in @@ -524,6 +530,14 @@ numeric and short string variables only. Long string variables may have value labels, but their value labels are recorded using a different record type (@pxref{Long String Value Labels Record}). +ReadStat (@pxref{File Header Record}) writes value labels that label a +single value more than once. In more detail, it emits value labels +whose values are longer than string variables' widths, that are +identical in the actual width of the variable, e.g.@: labels for +values @code{ABC123} and @code{ABC456} for a string variable with +width 3. For files written by this software, PSPP ignores such +labels. + The value label record has the following format: @example @@ -606,9 +620,7 @@ Record type. Always set to 6. @item int32 n_lines; Number of lines of documents present. This should be greater than -zero, but the system file writer that identifies itself as -@url{https://github.com/WizardMac/ReadStat} writes document records -with zero @code{n_lines}. +zero, but ReadStats writes system files with zero @code{n_lines}. @item char lines[][80]; Document lines. The number of elements is defined by @code{n_lines}. diff --git a/src/data/sys-file-reader.c b/src/data/sys-file-reader.c index 5e5dc8b144..9d7e392d01 100644 --- a/src/data/sys-file-reader.c +++ b/src/data/sys-file-reader.c @@ -203,6 +203,7 @@ struct sfm_reader size_t sfm_var_cnt; /* Number of variables. */ int case_cnt; /* Number of cases */ const char *encoding; /* String encoding. */ + bool written_by_readstat; /* From https://github.com/WizardMac/ReadStat? */ /* Decompression. */ enum any_compression compression; @@ -967,6 +968,8 @@ read_header (struct sfm_reader *r, struct any_read_info *info, if (!read_string (r, header->magic, sizeof header->magic) || !read_string (r, header->eye_catcher, sizeof header->eye_catcher)) return false; + r->written_by_readstat = strstr (header->eye_catcher, + "https://github.com/WizardMac/ReadStat"); if (!strcmp (ASCII_MAGIC, header->magic) || !strcmp (EBCDIC_MAGIC, header->magic)) @@ -2228,7 +2231,15 @@ parse_value_labels (struct sfm_reader *r, struct dictionary *dict, if (!var_add_value_label (var, &value, utf8_labels[j])) { - if (var_is_numeric (var)) + if (r->written_by_readstat) + { + /* Ignore the problem. ReadStat is buggy and emits value + labels whose values are longer than string variables' + widths, that are identical in the actual width of the + variable, e.g. both values "ABC123" and "ABC456" for a + string variable with width 3. */ + } + else if (var_is_numeric (var)) sys_warn (r, record->pos, _("Duplicate value label for %g on %s."), value.f, var_get_name (var));