build non-recursive grammars
Pidgin is a Rust port of list_matcher. It can be used to generate simple, non-recursive grammars which allow one to obtain structured matches: parse trees.
Pidgin's grammars are implemented underlyingly as regular expressions with
named matching groups. Rust's regular expression engine is very fast but it has
a limitation which makes parsing hierarchically structured patterns with
repeating components difficult: you cannot use a particular match name more
than once in a pattern. This makes a simple grammar such as
foo -> bar baz | baz bar
bar -> '1'
baz -> '2'
problematic. The foo
rule uses both bar
and baz
twice.
Pidgin allows you to work around this restriction by managing group renaming.
Another disadvantage of regular expressions is that the more expressive they are the more difficult they are to read. Pidgin allows you to construct expressive regular expressions programmatically such that the intention of the pattern doesn't become obscured.
```rust use pidgin::Pidgin; use std::error::Error;
fn experiment() -> Result<(), Box
// now test it
let matcher = date.matcher()?;
// we let whitespace vary
assert!(matcher.is_match(" June 6, 1969 "));
// we made it case-insensitive
assert!(matcher.is_match("june 6, 1969"));
// but we want to respect word boundaries
assert!(!matcher.is_match("jejune 6, 1969"));
// we can inspect the parse tree
let m = matcher.parse("2018/10/6").unwrap();
assert!(m.name("numeric_date").is_some());
assert_eq!(m.name("year").unwrap().as_str(), "2018");
let m = matcher.parse("Friday").unwrap();
assert!(!m.name("numeric_date").is_some());
assert!(m.name("weekday").is_some());
// still more crazy things we allow
assert!(matcher.is_match("F"));
assert!(matcher.is_match("friday"));
assert!(matcher.is_match("Fri"));
// but we said single-letter days had to be capitalized
assert!(!matcher.is_match("f"));
println!("{}", date);
Ok(())
} ```
The final println!
in the example above produces a BNF-esque description of the
grammar:
TOP := (?i) \b{monthday}(?-i:\s+){month}(?-i:\s+){year}|{month}(?-i:\s+){monthday}(:?,(?-i:\s+){year})?|{weekday}(:?,(?-i:\s+){month}(?-i:\s+){monthday},(?-i:\s+){year})?|{numeric_date}\b
monthday := (?i) \b[4-9]|30?|1[0-9]?|2[0-9]?\b
month := (?i) \bMa(?:y|r(:?ch)?)|Oct(:?ober)?|Dec(:?ember)?|Feb(:?ruary)?|Nov(:?ember)?|Sep(:?tember)?|A(?:pr(:?il)?|ug(:?ust)?)|J(?:u(?:ly?|ne?)|an(:?uary)?)\b
year := [12][0-9]{3}|[0-9]{2}
weekday := (?i) (?:\b(?:Fr(:?i(:?day)?)?|Mo(:?n(:?day)?)?|We(:?d(:?nesday)?)?|S(?:u(:?n(:?day)?)?|a(:?t(:?urday)?)?)|T(?:u(:?e(:?sday)?)?|h(:?u(:?rsday)?)?))\b)|(?-i:\b[FMR-UW]\b)
numeric_date := (?i) \b{year}(?-i:\s*)(?:\-(?-i:\s*){numeric_months}(?-i:\s*)\-|/(?-i:\s*){numeric_months}(?-i:\s*)/)(?-i:\s*){numeric_days}|{numeric_days}(?-i:\s*)(?:\-(?-i:\s*){numeric_months}(?-i:\s*)\-|/(?-i:\s*){numeric_months}(?-i:\s*)/)(?-i:\s*){year}|{numeric_months}(?-i:\s*)(?:\-(?-i:\s*){numeric_days}(?-i:\s*)\-|/(?-i:\s*){numeric_days}(?-i:\s*)/)(?-i:\s*){year}\b
numeric_days := (?i) \b[4-9]|30?|0[1-9]|1[0-9]?|2[0-9]?\b
numeric_months := (?i) \b[2-9]|1[01]?|0[1-9]\b
Because you must define sub-grammars before you use them, and because Rust's
regular expression engine and formalism cannot represent it in any case, you
cannot produce a recursive grammar in Pidgin. Something like this is impossible:
XP -> XP conj XP
Typically a grammar is defined top down like so:
TOP -> foo | bar
foo -> baz plugh | xyzzy
bar -> xyzzy baz plugh
baz -> '1' | '2'
plugh -> 'cat' | 'dog'
xyzzy -> 'green' | 'blue' | 'grue'
Because in Pidgin sub-grammars need to be defined before they can be used in the
definition of other rules, construction has to go bottom up.
I am planning to write Pidgin macros which will allow one to define a grammar top-down, but this is currently still in the pie-in-the-sky stage.
Pidgin does not generate unbounded repetitions and it only attempts to match
the examples you present it. Sometimes this is inconvenient: you want to match
all phone numbers but you don't want to provide all phone numbers as patterns,
say. To work around this you can supply a "foreign rule". This is just a regular
expression you have acquired outside of pidgin.
rust
pidgin.foreign_rule("local_us_phone", r"\d{3}-?\d{4}").unwrap();
The unwrap
is necessary in this case because your expression might not
compile, so foreign_rule
returns a Result<(),regex::Error>
so you can
handle this exception.
This is all fine, but suppose you provide a foreign rule with a named capture:
rust
pidgin.foreign_rule("foo", "(?<foo>bar)").unwrap();
This will compile when you define the rule, but if you use foo
in more than
one place in your grammar, this will cause a regex::Error
when you attempt
to generate a pidgin::Matcher
from your grammar.
Pidgin attempts alternations easiest to hardest. In effect it always tries to
pick low-hanging fruit. This means if told to match "elephant" and "cat" it
will generate the pattern cat|elephant
. Generally this is what you want, but
it may happen that you want to prioritize rules differently. Pidgin provides
a mechanism for this: if you at the same rule more than once, this makes the
rule an alternation with the alternate order you specify:
rust
pidgin.rule("foo", &bar);
pidgin.rule("foo", &baz);
This is all fine and good, but it means you cannot replace one rule with
another simply by redefining it. If you wish to do this, you must first remove
the rule:
rust
pidgin.remove_rule("foo");
Because the general pattern of use for Pidgin is that you compile a bunch of
sub-grammars and then use them in other grammars, and because you generall don't
want the alternates in one sub-grammar to be included in the next you compile,
the compile
method clears the phrase list. It does not clear the rule list.
You usually don't want to clear the rule list, but there is a clear
method
which clears both.
The clear
method does not clear any flags you have set. If you want that as
well, you're better off just building a fresh Pidgin.
The principle motivation behind Pidgin is simply to produce something like an
abstract syntax tree one can use to better understand matched text, but the
regular expressions it generates generally match as well as or better than
naively constructed regular expressions represented as alternations of
expressions. There is a simple benchmark suite included in the benches/
directory demonstrating this for certain cases. The suite compares matching
speed for "naive" regular expressions of the form foo|bar|baz
to those
generated by Pidgin, which involve no backtracking and, in this case, would
be something like foo|ba[rz]
. The benchmark suite compares non-matching to
matching and bounded to unbounded patterns, where a bounded pattern must match
from the first character to the last and an unbounded pattern must match
somewhere in the string searched. The mean match/non-match times for these 8
cases were
|pidgin|match|bounded|time| |:------:|:-----:|:-------:|:----:| |✓|✓| |5.1287 ms| | |✓| |27.151 ms| |✓| | |15.757 ms| | | | |168.24 ms| |✓|✓|✓|3.0683 ms| | |✓|✓|4.0253 ms| |✓| |✓|5.0138 ms| | | |✓|1.5124 ms|
The full report can be generated by running
bash
cargo bench