Logos logo

Logos

Test Crates.io version shield Docs Crates.io license shield

Create ridiculously fast Lexers.

Logos has two goals:

To achieve those, Logos:

Example

```rust use logos::Logos;

#[derive(Logos, Debug, PartialEq)] #[logos(skip r"[ \t\n\f]+")] // Ignore this regex pattern between tokens enum Token { // Tokens can be literal strings, of any length. #[token("fast")] Fast,

 #[token(".")]
 Period,

 // Or regular expressions.
 #[regex("[a-zA-Z]+")]
 Text,

}

fn main() { let mut lex = Token::lexer("Create ridiculously fast Lexers.");

 assert_eq!(lex.next(), Some(Ok(Token::Text)));
 assert_eq!(lex.span(), 0..6);
 assert_eq!(lex.slice(), "Create");

 assert_eq!(lex.next(), Some(Ok(Token::Text)));
 assert_eq!(lex.span(), 7..19);
 assert_eq!(lex.slice(), "ridiculously");

 assert_eq!(lex.next(), Some(Ok(Token::Fast)));
 assert_eq!(lex.span(), 20..24);
 assert_eq!(lex.slice(), "fast");

 assert_eq!(lex.next(), Some(Ok(Token::Text)));
 assert_eq!(lex.slice(), "Lexers");
 assert_eq!(lex.span(), 25..31);

 assert_eq!(lex.next(), Some(Ok(Token::Period)));
 assert_eq!(lex.span(), 31..32);
 assert_eq!(lex.slice(), ".");

 assert_eq!(lex.next(), None);

} ```

Callbacks

Logos can also call arbitrary functions whenever a pattern is matched, which can be used to put data into a variant:

```rust use logos::{Logos, Lexer};

// Note: callbacks can return Option or Result fn kilo(lex: &mut Lexer) -> Option { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'k' Some(n * 1_000) }

fn mega(lex: &mut Lexer) -> Option { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'm' Some(n * 1000000) }

#[derive(Logos, Debug, PartialEq)] #[logos(skip r"[ \t\n\f]+")] enum Token { // Callbacks can use closure syntax, or refer // to a function defined elsewhere. // // Each pattern can have it's own callback. #[regex("[0-9]+", |lex| lex.slice().parse().ok())] #[regex("[0-9]+k", kilo)] #[regex("[0-9]+m", mega)] Number(u64), }

fn main() { let mut lex = Token::lexer("5 42k 75m");

 assert_eq!(lex.next(), Some(Ok(Token::Number(5))));
 assert_eq!(lex.slice(), "5");

 assert_eq!(lex.next(), Some(Ok(Token::Number(42_000))));
 assert_eq!(lex.slice(), "42k");

 assert_eq!(lex.next(), Some(Ok(Token::Number(75_000_000))));
 assert_eq!(lex.slice(), "75m");

 assert_eq!(lex.next(), None);

} ```

Logos can handle callbacks with following return types:

| Return type | Produces | |------------------------|-----------------------------------------------------------------------------------------------------| | () | Ok(Token::Unit) | | bool | Ok(Token::Unit) or Err(<Token as Logos>::Error::default()) | | Result<(), E> | Ok(Token::Unit) or Err(<Token as Logos>::Error::from(err)) | | T | Ok(Token::Value(T)) | | Option<T> | Ok(Token::Value(T)) or Err(<Token as Logos>::Error::default()) | | Result<T, E> | Ok(Token::Value(T)) or Err(<Token as Logos>::Error::from(err)) | | [Skip] | skips matched input | | [Filter<T>] | Ok(Token::Value(T)) or skips matched input | | [FilterResult<T, E>] | Ok(Token::Value(T)) or Err(<Token as Logos>::Error::from(err)) or skips matched input |

Callbacks can be also used to do perform more specialized lexing in place where regular expressions are too limiting. For specifics look at Lexer::remainder and Lexer::bump.

Errors

By default, Logos uses () as the error type, which means that it doesn't store any information about the error. This can be changed by using #[logos(error = T)] attribute on the enum. The type T can be any type that implements Clone, PartialEq, Default and From<E> for each callback's error type E.

Token disambiguation

Rule of thumb is:

If any two definitions could match the same input, like fast and [a-zA-Z]+ in the example above, it's the longer and more specific definition of Token::Fast that will be the result.

This is done by comparing numeric priority attached to each definition. Every consecutive, non-repeating single byte adds 2 to the priority, while every range or regex class adds 1. Loops or optional blocks are ignored, while alternations count the shortest alternative:

If two definitions compute to the same priority and can match the same input Logos will fail to compile, point out the problematic definitions, and ask you to specify a manual priority for either of them.

For example: [abc]+ and [cde]+ both can match sequences of c, and both have priority of 1. Turning the first definition to #[regex("[abc]+", priority = 2)] will allow for tokens to be disambiguated again, in this case all sequences of c will match [abc]+.

How fast?

Ridiculously fast!

norust test identifiers ... bench: 647 ns/iter (+/- 27) = 1204 MB/s test keywords_operators_and_punctators ... bench: 2,054 ns/iter (+/- 78) = 1037 MB/s test strings ... bench: 553 ns/iter (+/- 34) = 1575 MB/s

Acknowledgements

Thank you

Logos is very much a labor of love. If you find it useful, consider getting me some coffee. ☕

License

This code is distributed under the terms of both the MIT license and the Apache License (Version 2.0), choose whatever works for you.

See LICENSE-APACHE and LICENSE-MIT for details.