--- /dev/null
+//! Syntax segmentation.
+//!
+//! PSPP divides traditional "lexical analysis" or "tokenization" into two
+//! phases: a lower-level phase called "segmentation" and a higher-level phase
+//! called "scanning". This module implements the segmentation phase.
+//! [`super::scan`] contains declarations for the scanning phase.
+//!
+//! Segmentation accepts a stream of UTF-8 bytes as input. It outputs a label
+//! (a segment type) for each byte or contiguous sequence of bytes in the input.
+//! It also, in a few corner cases, outputs zero-width segments that label the
+//! boundary between a pair of bytes in the input.
+//!
+//! Some segment types correspond directly to tokens; for example, an
+//! "identifier" segment (SEG_IDENTIFIER) becomes an identifier token (T_ID)
+//! later in lexical analysis. Other segments contribute to tokens but do not
+//! correspond directly; for example, multiple quoted string segments
+//! (SEG_QUOTED_STRING) separated by spaces (SEG_SPACES) and "+" punctuators
+//! (SEG_PUNCT) may be combined to form a single string token (T_STRING). Still
+//! other segments are ignored (e.g. SEG_SPACES) or trigger special behavior
+//! such as error messages later in tokenization (e.g. SEG_EXPECTED_QUOTE).
+
+/// Segmentation mode.
+///
+/// PSPP syntax is written in one of two modes which are broadly defined as
+/// follows:
+///
+/// - In interactive mode, commands end with a period at the end of the line
+/// or with a blank line.
+///
+/// - In batch mode, the second and subsequent lines of a command are indented
+/// from the left margin.
+///
+/// The segmenter can also try to automatically detect the mode in use, using a
+/// heuristic that is usually correct.
+#[derive(Copy, Clone, Debug, PartialEq, Eq, Default)]
+pub enum Mode {
+ /// Try to interpret input correctly regardless of whether it is written
+ /// for interactive or batch mode.
+ #[default]
+ Auto,
+
+ /// Interactive syntax mode.
+ Interactive,
+
+ /// Batch syntax mode.
+ Batch,
+}
+
+/// The type of a segment.
+#[derive(Copy, Clone, Debug, PartialEq, Eq, Default)]
+pub enum Type {
+ Number,
+ QuotedString,
+ HexString,
+ UnicodeString,
+ UnquotedString,
+ ReservedWord,
+ Identifier,
+ Punct,
+ Shbang,
+ Spaces,
+ Comment,
+ Newline,
+ CommentCommand,
+ DoRepeatCommand,
+ InlineData,
+ MacroId,
+ MacroName,
+ MacroBody,
+ StartDocument,
+ Document,
+ StartCommand,
+ SeparateCommands,
+ EndCommand,
+ End,
+ ExpectedQuote,
+ ExpectedExponent,
+ UnexpectedChar
+}
+
+pub struct Segmenter {
+ state: State
+}