From: Ben Pfaff Date: Sun, 20 Mar 2011 00:05:47 +0000 (-0700) Subject: lexer: Reimplement for better testability and internationalization. X-Git-Tag: v0.7.7~16 X-Git-Url: https://pintos-os.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=9ade26c8349b4434008c46cf09bc7473ec743972;hp=9ade26c8349b4434008c46cf09bc7473ec743972;p=pspp-builds.git lexer: Reimplement for better testability and internationalization. This commit reimplements PSPP lexical analysis from the ground up. From a PSPP user's perspective, this should make PSPP more reliable and make it easier to work with syntax files in non-ASCII encodings. See the changes to NEWS for more details. From a developer's perspective, the most visible change may be that strings within tokens are now always encoded in UTF-8, regardless of the syntax file's encoding. Many of the changes in this commit are due to this, especially those to functions that check for valid identifiers: an identifier in UTF-8 is not necessarily the same length when encoded in the dictionary's encoding, but limits on identifier length must be enforced in the dictionary's encoding (otherwise it might not be possible to write out a valid system file, since the identifier might not fit in the fixed length fields in such files). Another important change is that, whereas before some special syntax had to be handled by the parser providing feedback to the lexer, now increasing the sophistication of the lexer has enabled all PSPP syntax to be analyzed into tokens. This permitted some other improvements: - An arbitrary number of tokens of lookahead, up to the end of the current command, is now supported using lex_next_token() and related functions. - Before, some command implementations had a special attribute that meant that the top-level PSPP command parser would not consume the final token of the command name (because that token was not followed by tokenizable syntax). This is no longer necessary and has been removed. - Before, each command implementation was responsible for ensuring that valid command syntax was not followed by trailing garbage, often by calling lex_end_of_command() as the last step of parsing. This is no longer necessary; the main command parser will ensure this for itself. ---