It is easier to write things out than to read them in, since more things can go wrong. The read may fail, the text may not be valid UTF-8, the number may be malformed or simply out of range.
Lexical scanners split a stream of characters into tokens.
Tokens are returned by repeatedly calling the get
method of Scanner
,
(which will return Token::End
if no tokens are left)
or by iterating over the scanner. They represent numbers, characters, identifiers,
or single/double quoted strings. There is also Token::Error
to
indicate a badly formed token.
This lexical scanner makes some assumptions, such as a number may not be directly followed by a letter, etc. No attempt is made in this version to decode C-style escape codes in strings. All whitespace is ignored. It's intended for processing generic structured data, rather than code.
For example, the string "hello 'dolly' * 42" will be broken into four tokens:
```rust extern crate scanlex; use scanlex::{Scanner,Token};
let mut scan = Scanner::new("hello 'dolly' * 42"); asserteq!(scan.get(),Token::Iden("hello".into())); asserteq!(scan.get(),Token::Str("dolly".into())); asserteq!(scan.get(),Token::Char('*')); asserteq!(scan.get(),Token::Int(10)); assert_eq!(scan.get(),Token::End); ``` To extract the values, use code like this:
rust
let greeting = scan.get_iden()?;
let person = scan.get_string()?;
let op = scan.get_char()?;
let answer = scan.get_integer(); // i64
Scanner
implements Iterator
. If you just wanted to extract the words from
a string, then filtering with as_iden
will do the trick, since it returns
Option<String>
.
rust
let s = Scanner::new("bonzo 42 dog (cat)");
let v: Vec<_> = s.filter_map(|t| t.as_iden()).collect();
assert_eq!(v,&["bonzo","dog","cat"]);
Using as_number
instead you can use this strategy to extract all the numbers out of a
document, ignoring all other structure. The scan.rs
example shows you the tokens
that would be generated by parsing the given string on the commmand-line.
This iterator only stops at Token::End
- you can handle Token::Error
yourself.
Usually it's important not to ignore structure. Say we have input strings that look like this "(WORD) = NUMBER":
rust
scan.skip_chars("(")?;
let word = scan.get_iden()?;
scan.skip_chars(")=")?;
let num = scan.get_number()?;
Any of these calls may fail!
It is a common pattern to create a scanner for each line of text read from a readable
source. The scanline.rs
example shows how to use ScanLines
to accomplish this.
rust
let f = File::open("scanline.rs").expect("cannot open scanline.rs");
let mut iter = ScanLines::new(&f);
while let Some(s) = iter.next() {
let mut s = s.expect("cannot read line");
// show the first token of each line
println!("{:?}",s.get());
}
A more serious example (taken from the tests) is parsing JSON:
```rust
type JsonArray = Vec
pub enum Value { Str(String), Num(f64), Bool(bool), Arr(JsonArray), Obj(JsonObject), Null }
fn scanjson(scan: &mut Scanner) -> Result
(This is of course an Illustrative Example. JSON is a solved problem.)
With no_float
you get a barebones parser that does not recognize floats,
just integers, strings, chars and identifiers. This is useful if the
existing rules are too strict - e.g "2d" is fine in no_float
mode, but
an error in the default mode. chrono-english
uses this mode to parse date expressions.
With line_comment
you provide a character; after this character, the rest of the current line
will be ignored.