lexer: Reimplement for better testability and internationalization.
authorBen Pfaff <blp@cs.stanford.edu>
Sun, 20 Mar 2011 00:05:47 +0000 (17:05 -0700)
committerBen Pfaff <blp@cs.stanford.edu>
Sun, 20 Mar 2011 16:43:45 +0000 (09:43 -0700)
commit9ade26c8349b4434008c46cf09bc7473ec743972
tree8b088c842b96696ed3fbbe17d2c92e3bdfe65274
parentafdf3096926b561f4e6511c10fcf73fc6796b9d2
lexer: Reimplement for better testability and internationalization.

This commit reimplements PSPP lexical analysis from the ground up.
From a PSPP user's perspective, this should make PSPP more reliable
and make it easier to work with syntax files in non-ASCII encodings.
See the changes to NEWS for more details.

From a developer's perspective, the most visible change may be that
strings within tokens are now always encoded in UTF-8, regardless of
the syntax file's encoding.  Many of the changes in this commit are
due to this, especially those to functions that check for valid
identifiers: an identifier in UTF-8 is not necessarily the same length
when encoded in the dictionary's encoding, but limits on identifier
length must be enforced in the dictionary's encoding (otherwise it
might not be possible to write out a valid system file, since the
identifier might not fit in the fixed length fields in such files).

Another important change is that, whereas before some special syntax
had to be handled by the parser providing feedback to the lexer, now
increasing the sophistication of the lexer has enabled all PSPP syntax
to be analyzed into tokens.  This permitted some other improvements:

  - An arbitrary number of tokens of lookahead, up to the end of the
    current command, is now supported using lex_next_token() and
    related functions.

  - Before, some command implementations had a special attribute that
    meant that the top-level PSPP command parser would not consume the
    final token of the command name (because that token was not
    followed by tokenizable syntax).  This is no longer necessary and
    has been removed.

  - Before, each command implementation was responsible for ensuring
    that valid command syntax was not followed by trailing garbage,
    often by calling lex_end_of_command() as the last step of parsing.
    This is no longer necessary; the main command parser will ensure
    this for itself.
152 files changed:
NEWS
Smake
doc/dev/concepts.texi
doc/flow-control.texi
doc/invoking.texi
doc/language.texi
doc/utilities.texi
perl-module/PSPP.xs
perl-module/t/Pspp.t
src/data/automake.mk
src/data/dictionary.c
src/data/dictionary.h
src/data/file-handle-def.c
src/data/gnumeric-reader.h
src/data/identifier.c
src/data/identifier.h
src/data/identifier2.c [new file with mode: 0644]
src/data/mrset.c
src/data/mrset.h
src/data/por-file-reader.c
src/data/por-file-writer.c
src/data/procedure.c
src/data/procedure.h
src/data/sys-file-reader.c
src/data/sys-file-writer.c
src/data/variable.c
src/data/variable.h
src/data/vector.c
src/data/vector.h
src/language/automake.mk
src/language/command.c
src/language/command.def
src/language/control/automake.mk
src/language/control/do-if.c
src/language/control/loop.c
src/language/control/repeat.c
src/language/control/repeat.h [deleted file]
src/language/control/temporary.c
src/language/data-io/combine-files.c
src/language/data-io/data-list.c
src/language/data-io/data-parser.c
src/language/data-io/data-reader.c
src/language/data-io/file-handle.q
src/language/data-io/get-data.c
src/language/data-io/inpt-pgm.c
src/language/data-io/save-translate.c
src/language/data-io/trim.c
src/language/dictionary/apply-dictionary.c
src/language/dictionary/attributes.c
src/language/dictionary/missing-values.c
src/language/dictionary/modify-variables.c
src/language/dictionary/mrsets.c
src/language/dictionary/numeric.c
src/language/dictionary/rename-variables.c
src/language/dictionary/split-file.c
src/language/dictionary/sys-file-info.c
src/language/dictionary/value-labels.c
src/language/dictionary/variable-label.c
src/language/dictionary/vector.c
src/language/dictionary/weight.c
src/language/expressions/parse.c
src/language/expressions/private.h
src/language/lexer/automake.mk
src/language/lexer/include-path.c [new file with mode: 0644]
src/language/lexer/include-path.h [new file with mode: 0644]
src/language/lexer/lexer.c
src/language/lexer/lexer.h
src/language/lexer/q2c.c
src/language/lexer/value-parser.c
src/language/lexer/variable-parser.c
src/language/lexer/variable-parser.h
src/language/prompt.c [deleted file]
src/language/prompt.h [deleted file]
src/language/stats/aggregate.c
src/language/stats/autorecode.c
src/language/stats/descriptives.c
src/language/stats/flip.c
src/language/stats/frequencies.q
src/language/stats/npar.c
src/language/stats/rank.q
src/language/stats/sort-cases.c
src/language/syntax-file.c [deleted file]
src/language/syntax-file.h [deleted file]
src/language/syntax-string-source.c [deleted file]
src/language/syntax-string-source.h [deleted file]
src/language/tests/format-guesser-test.c
src/language/tests/moments-test.c
src/language/tests/paper-size.c
src/language/utilities/cache.c
src/language/utilities/cd.c
src/language/utilities/date.c
src/language/utilities/host.c
src/language/utilities/include.c
src/language/utilities/permissions.c
src/language/utilities/set.q
src/language/utilities/title.c
src/language/xforms/compute.c
src/language/xforms/count.c
src/language/xforms/fail.c
src/language/xforms/recode.c
src/language/xforms/sample.c
src/language/xforms/select-if.c
src/libpspp/automake.mk
src/libpspp/getl.c [deleted file]
src/libpspp/getl.h [deleted file]
src/libpspp/message.c
src/libpspp/message.h
src/libpspp/msg-locator.c [deleted file]
src/libpspp/msg-locator.h [deleted file]
src/output/driver.c
src/ui/gui/automake.mk
src/ui/gui/comments-dialog.c
src/ui/gui/executor.c
src/ui/gui/executor.h
src/ui/gui/main.c
src/ui/gui/psppire-data-window.c
src/ui/gui/psppire-dict.c
src/ui/gui/psppire-syntax-window.c
src/ui/gui/psppire-syntax-window.h
src/ui/gui/psppire-var-store.c
src/ui/gui/psppire.c
src/ui/gui/psppire.h
src/ui/gui/syntax-editor-source.c [deleted file]
src/ui/gui/syntax-editor-source.h [deleted file]
src/ui/source-init-opts.c
src/ui/source-init-opts.h
src/ui/terminal/automake.mk
src/ui/terminal/main.c
src/ui/terminal/msg-ui.c [deleted file]
src/ui/terminal/msg-ui.h [deleted file]
src/ui/terminal/read-line.c [deleted file]
src/ui/terminal/read-line.h [deleted file]
src/ui/terminal/terminal-opts.c
src/ui/terminal/terminal-opts.h
src/ui/terminal/terminal-reader.c [new file with mode: 0644]
src/ui/terminal/terminal-reader.h [new file with mode: 0644]
tests/data/data-in.at
tests/data/sys-file-reader.at
tests/dissect-sysfile.c
tests/language/control/do-repeat.at
tests/language/data-io/data-list.at
tests/language/data-io/get.at
tests/language/data-io/inpt-pgm.at
tests/language/data-io/print.at
tests/language/dictionary/missing-values.at
tests/language/expressions/evaluate.at
tests/language/expressions/parse.at
tests/language/lexer/lexer.at
tests/language/lexer/q2c.at
tests/language/stats/aggregate.at
tests/language/stats/rank.at
tests/language/utilities/insert.at