lexgen: A fully-featured lexer generator, implemented as a proc macro

```rust lexer! { // First line specifies name of the lexer and the token type returned by // user actions Lexer -> Token;

// Regular expressions can be named with `let` syntax
let init = ['a'-'z'];
let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

// Rule sets have names. Each rule set is compiled to a separate DFA.
// Switching between rule sets is done explicitly in user actions.
rule Init {
    // Rules without a right-hand sides for skipping whitespace, comments, etc.
    [' ' '\t' '\n']+,

    // Rule for matching identifiers
    $init $subseq* => |lexer| {
        let token = Token::Id(lexer.match_().to_owned());
        lexer.return_(token)
    },
}

}

// The token type

[derive(Debug, PartialEq, Eq)]

enum Token { // An identifier Id(String), }

// Generated lexers are initialized with a &str for the input let mut lexer = Lexer::new(" abc123Q-t z9_9");

// Lexers implement Iterator<Item=Result<(usize, T, usize), LexerError>>, // where T is the token type specified in the lexer definition (Token in // this case), and usizes indicate byte indices of beginning and end of the // lexemes. asserteq!( lexer.next(), Some(Ok((1, Token::Id("abc123Q-t".toowned()), 10))) ); asserteq!( lexer.next(), Some(Ok((12, Token::Id("z99".toowned()), 16))) ); asserteq!(lexer.next(), None); ```

You can see more examples here, and a full Lua 5.1 lexer here.

Motivation

Implementing lexing is often (along with parsing) the most tedious part of implementing a language. Lexer generators make this much easier, but in Rust existing lexer generators miss essential features for practical use, and/or require a pre-processing step when building.

My goal with lexgen is to have a feature-complete and easy to use lexer generator.

Usage

lexgen doesn't require a build step. Just add it as a dependency in your Cargo.toml.

Lexer syntax

lexgen lexers start with type of the generated lexer struct, optional user state part, and the token type (type of values returned by user actions). Example:

rust lexer! { Lexer(LexerState) -> Token; ... }

Here the lexer struct is named Lexer. User state type is LexerState (this type should be defined by the user). The token type is Token.

Next is let bindings for regular expressions. These are optional. The syntax is let <id> = <regex>; where <id> is a Rust identifier and regex is as described below.

rust let init = ['a'-'z']; let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

Finally we define the lexer rules:

```rust rule Init { ... }

rule SomeOtherRule { ... } ```

The first rule set will be defining the initial state of the lexer and needs to be named Init.

In the body of a rule block we define the rules for that lexer state. The syntax for a rule is <regex> => <user action>,. Regex syntax is described below. User action is any Rust code with type fn(LexerHandle) -> LexerAction where LexerHandle and LexerAction are generated names derived from the lexer name (Lexer). More on these types below.

You can omit the rule Init { ... } part and have all of your rules at the top level if you don't need rule sets.

In summary:

Regex syntax

Regex syntax can be used in right-hand side of let bindings and left-hand side of rules. The syntax is:

Binding powers (precedences), from higher to lower:

You can use parenthesis for grouping, e.g. ('a' | 'b')*.

Example: 'a' 'b' | 'c'+ is the same as (('a' 'b') | ('c'+)).

Built-in regular expressions

lexgen comes with a set of built-in regular expressions. Regular expressions listed below match the same set of characters as their Rust counterparts. For example, $$alphabetic matches the same set of characters as Rust's [char::is_alphabetic]:

(Note that in the generated code we don't use Rust char methods. For simple cases like $$ascii we generate simple range checks. For more complicated cases like $$lowercase we generate a binary search table and run binary search when checking a character)

In addition, these two built-in regular expressions match Unicode [XIDStart and XIDContinue]:

Rule syntax

Handle, rule, error, and action types

The lexer macro generates a few types with names derived from the lexer name and type specified by the user. If the lexer type is declared as Lexer(State) -> Token at the beginning of lexer, the generated types are:

Stateful lexer example

Here's an example lexer that counts number of =s appear between two [s:

```rust lexer! { Lexer(usize) -> usize;

rule Init {
    ' ',                                            // line 5

    '[' => |lexer| {
        *lexer.state() = 0;                         // line 8
        lexer.switch(LexerRule::Count)              // line 9
    },
}

rule Count {
    '=' => |lexer| {
        let n = *lexer.state();
        *lexer.state() = n + 1;                     // line 16
        lexer.continue_()                           // line 17
    },

    '[' => |lexer| {
        let n = *lexer.state();
        lexer.switch_and_return(LexerRule::Init, n) // line 22
    },
}

}

let mut lexer = Lexer::new("[[ [=[ [==["); asserteq!(lexer.next(), Some(Ok((0, 0, 2)))); asserteq!(lexer.next(), Some(Ok((3, 1, 6)))); asserteq!(lexer.next(), Some(Ok((7, 2, 11)))); asserteq!(lexer.next(), None); ```

Initially (the Init rule set) we skip spaces (line 5). When we see a [ we initialize the user state (line 8) and switch to the Count state (line 9). In Count, each = increments the user state by one (line 16) and skips the match (line 17). A [ in the Count state returns the current number and switches to the Init state (line 22).

Implementation details

lexgen's implementation should be fairly standard. Each rule set is compiled to a separate NFA. NFAs are then compiled to DFAs. DFAs are added to the same DFA type but there are no transitions between nodes of different DFAs: transitions between DFAs are done by user action, using the switch method of lexer handles, as described above.

Generated code for a DFA is basically a loop that iterates over characters of the input string:

``` loop { match { S1 => { match { C1 => ... // transition to next state

            ...                        // other characters expected in this state

            _ => ...                   // for an accepting state, run user
                                       // action, for a non-accepting, fail
        }
    },
    ...                                // same stuff for other DFA states
}

} ```