```rust lexer! { // First line specifies name of the lexer and the token type returned by // user actions Lexer -> Token;
// Regular expressions can be named with `let` syntax
let init = ['a'-'z'];
let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];
// Rule sets have names. Each rule set is compiled to a separate DFA.
// Switching between rule sets is done explicitly in user actions.
rule Init {
// Rules without a right-hand sides for skipping whitespace, comments, etc.
[' ' '\t' '\n']+,
// Rule for matching identifiers
$init $subseq* => |lexer| {
let token = Token::Id(lexer.match_().to_owned());
lexer.return_(token)
},
}
}
// The token type
enum Token { // An identifier Id(String), }
// Generated lexers are initialized with a &str
for the input
let mut lexer = Lexer::new(" abc123Q-t z9_9");
// Lexers implement Iterator<Item=Result<(usize, T, usize), LexerError>>
,
// where T
is the token type specified in the lexer definition (Token
in
// this case), and usize
s indicate byte indices of beginning and end of the
// lexemes.
asserteq!(
lexer.next(),
Some(Ok((1, Token::Id("abc123Q-t".toowned()), 10)))
);
asserteq!(
lexer.next(),
Some(Ok((12, Token::Id("z99".toowned()), 16)))
);
asserteq!(lexer.next(), None);
```
You can see more examples here, and a full Lua 5.1 lexer here.
Implementing lexing is often (along with parsing) the most tedious part of implementing a language. Lexer generators make this much easier, but in Rust existing lexer generators miss essential features for practical use, and/or require a pre-processing step when building.
My goal with lexgen is to have a feature-complete and easy to use lexer generator.
lexgen doesn't require a build step. Just add it as a dependency in your
Cargo.toml
.
lexgen lexers start with type of the generated lexer struct, optional user state part, and the token type (type of values returned by user actions). Example:
rust
lexer! {
Lexer(LexerState) -> Token;
...
}
Here the lexer struct is named Lexer
. User state type is LexerState
(this
type should be defined by the user). The token type is Token
.
Next is let bindings for regular expressions. These are optional. The syntax is
let <id> = <regex>;
where <id>
is a Rust identifier and regex is as
described below.
rust
let init = ['a'-'z'];
let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];
Finally we define the lexer rules:
```rust rule Init { ... }
rule SomeOtherRule { ... } ```
The first rule set will be defining the initial state of the lexer and needs to
be named Init
.
In the body of a rule
block we define the rules for that lexer state. The
syntax for a rule is <regex> => <user action>,
. Regex syntax is described
below. User action is any Rust code with type fn(LexerHandle) -> LexerAction
where LexerHandle
and LexerAction
are generated names derived from the lexer
name (Lexer
). More on these types below.
You can omit the rule Init { ... }
part and have all of your rules at the top
level if you don't need rule sets.
In summary:
First line is in form <lexer name>(<user state type>) -> <token type name>
.
The (<user state type>)
part can be omitted for stateless lexers.
Next we have let bindings for regexes. This part is optional.
Next is the rule sets. There should be at least one rule set with the name
Init
, which is the name of the initial state.
Regex syntax can be used in right-hand side of let bindings and left-hand side of rules. The syntax is:
$var
for variables defined in the let binding section. Variables need to be
defined before used.$$var
for built-in regexes (see "Built-in regular expressions" section
below).'a'
."abc"
.[...]
for character sets. Inside the brackets you can have one or more of:
'a'-'z'
Here's an example character set for ASCII alphanumerics: ['a'-'z' 'A'-'Z'
'0'-'9']
_
for matching any character$
for matching end-of-input<regex>*
for zero or more repetitions of <regex>
<regex>+
for one or more repetitions of <regex>
<regex>?
for zero or one repetitions of <regex>
<regex> <regex>
for concatenation<regex> | <regex>
for alternation (match the first one, or the second one)Binding powers (precedences), from higher to lower:
*
, +
, ?
|
You can use parenthesis for grouping, e.g. ('a' | 'b')*
.
Example: 'a' 'b' | 'c'+
is the same as (('a' 'b') | ('c'+))
.
lexgen comes with a set of built-in regular expressions. Regular
expressions listed below match the same set of characters as their Rust
counterparts. For example, $$alphabetic
matches the same set of characters as
Rust's [char::is_alphabetic
]:
$$alphabetic
$$alphanumeric
$$ascii
$$ascii_alphabetic
$$ascii_alphanumeric
$$ascii_control
$$ascii_digit
$$ascii_graphic
$$ascii_hexdigit
$$ascii_lowercase
$$ascii_punctuation
$$ascii_uppercase
$$ascii_whitespace
$$control
$$lowercase
$$numeric
$$uppercase
$$whitespace
(Note that in the generated code we don't use Rust char
methods. For simple
cases like $$ascii
we generate simple range checks. For more complicated
cases like $$lowercase
we generate a binary search table and run binary
search when checking a character)
In addition, these two built-in regular expressions match Unicode [XIDStart and XIDContinue]:
$$XID_Start
$$XID_Continue
<regex> => <user action>,
: <regex>
syntax is as described above. <user
action>
is any Rust code with type fn(LexerHandle) ->
LexerAction<LexerReturn>
. More on LexerHandle
and LexerAction
types
below. LexerReturn
is the type declared at the beginning of the lexer with
Lexer -> LexerReturn;
.
<regex> =? <user action>,
: fallible actions. This syntax is similar to the
syntax above, except <user action>
has type fn(LexerHandle) ->
LexerAction<Result<Token, UserError>>
. When using rules of this kind, the
error type needs to be declared at the beginning of the lexer with the type
Error = UserError;
syntax.
When a rule of this kind returns an error, the error is returned to the
caller of the lexer's next
method.
<regex>,
: Syntactic sugar for <regex> => |lexer| lexer.continue_(),
.
Useful for skipping whitespace.
<regex> = <token>,
: Syntactic sugar for <regex> => |lexer|
lexer.return_(<token>),
. Useful for matching keywords, punctuation
(operators) and delimiters (parens, brackets).
The lexer
macro generates a few types with names derived from the lexer name
and type specified by the user. If the lexer type is declared as Lexer(State)
-> Token
at the beginning of lexer, the generated types are:
LexerAction
: this is the type returned by user actions. You don't need to
worry about the detail of this type as the handle type has methods for
generating LexerAction
s.
LexerRule
: see the LexerHandle::switch
method below.
LexerHandle
: this type is the argument type of user actions. It provides
methods for manipulating user and lexer states, and getting the current match.
The API is:
fn match_(&self) -> &str
: returns the current matchfn peek(&mut self) -> Option<char>
: looks ahead one characterfn state(&mut self) -> &mut <user state type>
: returns a mutable reference
to the user statefn return_(&self, token: <user token type>) -> LexerAction
: returns
the passed token as a match.fn continue_(&self) -> LexerAction
: ignores the current match and
continues lexing in the same lexer state. Useful for skipping whitespace and comments.fn switch(&mut self, rule: LexerRule) -> LexerAction
: used for switching
between lexer states. The LexerRule
is an enum with a variant for each
rule set name, for example, LexerRule::Init
. See the stateful lexer
example below.fn switch_and_return(&mut self, rule: LexerRule, token: <user token type>)
-> LexerAction
: switches to the given lexer state and returns the given token.fn reset_match(&mut self)
: resets the current match. E.g. if you call
match_()
right after reset_match()
it will return an empty string.Here's an example lexer that counts number of =
s appear between two [
s:
```rust lexer! { Lexer(usize) -> usize;
rule Init {
' ', // line 5
'[' => |lexer| {
*lexer.state() = 0; // line 8
lexer.switch(LexerRule::Count) // line 9
},
}
rule Count {
'=' => |lexer| {
let n = *lexer.state();
*lexer.state() = n + 1; // line 16
lexer.continue_() // line 17
},
'[' => |lexer| {
let n = *lexer.state();
lexer.switch_and_return(LexerRule::Init, n) // line 22
},
}
}
let mut lexer = Lexer::new("[[ [=[ [==["); asserteq!(lexer.next(), Some(Ok((0, 0, 2)))); asserteq!(lexer.next(), Some(Ok((3, 1, 6)))); asserteq!(lexer.next(), Some(Ok((7, 2, 11)))); asserteq!(lexer.next(), None); ```
Initially (the Init
rule set) we skip spaces (line 5). When we see a [
we
initialize the user state (line 8) and switch to the Count
state (line 9). In
Count
, each =
increments the user state by one (line 16) and skips the
match (line 17). A [
in the Count
state returns the current number and
switches to the Init
state (line 22).
lexgen's implementation should be fairly standard. Each rule set is compiled to
a separate NFA. NFAs are then compiled to DFAs. DFAs are added to the same DFA
type but there are no transitions between nodes of different DFAs: transitions
between DFAs are done by user action, using the switch
method of lexer
handles, as described above.
Generated code for a DFA is basically a loop that iterates over characters of the input string:
```
loop {
match
... // other characters expected in this state
_ => ... // for an accepting state, run user
// action, for a non-accepting, fail
}
},
... // same stuff for other DFA states
}
} ```