build non-recursive grammars
Pidgin is a Rust port of list_matcher. It can be used to generate simple, non-recursive grammars which allow one to obtain structured matches: parse trees.
Pidgin's grammars are implemented underlyingly as regular expressions with
named matching groups. Rust's regular expression engine is very fast but it has
a limitation which makes parsing hierarchically structured patterns with
repeating components difficult: you cannot use a particular match name more
than once in a pattern. This makes a simple grammar such as
foo -> bar baz | baz bar
bar -> '1'
baz -> '2'
problematic. The foo
rule uses both bar
and baz
twice.
Pidgin allows you to work around this restriction by managing group renaming.
Another disadvantage of regular expressions is that the more expressive they are the more difficult they are to read. Pidgin allows you to construct expressive regular expressions programmatically such that the intention of the pattern doesn't become obscured.
```rust use pidgin::Pidgin;
// set up the initial building for our pidgin grammar let mut p = Pidgin::new() .enclosed(true) .wordbound() .caseinsensitive(true); // you can build word lists and add them in let weekdays = vec![ "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", ]; p.add(&weekdays); // various abbreviations for s in weekdays { p.addstr(&s[0..3]); p.addstr(&s[0..2]); } let g = p.compile(); p.rule("weekday", &g); // for these ones we care about case p = p.caseinsensitive(false); // you can build and compile all in one go let g = p.grammar(&vec!["M", "T", "W", "R", "F", "S", "U"]); // add a case to an existing rule p.rule("weekday", &g); // back to case insensitivity p = p.caseinsensitive(true); // we can add words to a rule piecemeal let months = vec![ "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", ]; p.add(&months); for m in months { p.addstr(&m[0..3]); } let g = p.compile(); p.rule("month", &g); for i in 1..31 { p.addstr(i.tostring().asstr()); } let g = p.compile(); p.rule("monthday", &g); for i in 1..31 { p.addstr(i.tostring().asstr()); // allow both 1 and 01, etc. // adding a word such as "10" twice has no ill effect p.addstr(&format!("{:02}", i)); } let g = p.compile(); p.rule("numericdays", &g); for i in 1..12 { p.addstr(i.tostring().asstr()); p.addstr(&format!("{:02}", i)); } let g = p.compile(); p.rule("numericmonths", &g); // sometimes you may need to add in a handwritten regex // take care with named groups -- names cannot be repeated p.foreignrule("year", "[12][0-9]{3}|[0-9]{2}").unwrap(); // for the following patterns make whitespace optional p = p.normalizewhitespace(false); let g = p.grammar(&vec![ "year / numericmonths / numericdays", "numericmonths / numericdays / year", "numericdays / numericmonths / year", "year - numericmonths - numericdays", "numericmonths - numericdays - year", "numericdays - numericmonths - year", ]); p.rule("numericdate", &g); // for the remaining rules, whitespace is required if present p = p.normalizewhitespace(true); // and finally, the pattern we've been working towards let date = p.grammar(&vec![ "weekday, month monthday, year", "month monthday", "weekday", "monthday month year", "month monthday, year", "numeric_date", ]);
// now test it
let matcher = date.matcher().unwrap();
// we let whitespace vary assert!(matcher.ismatch(" June 6, 1969 ")); // we made it case-insensitive assert!(matcher.ismatch("june 6, 1969")); // but we want to respect word boundaries assert!(!matcher.ismatch("jejune 6, 1969")); // we can inspect the parse tree let m = matcher.parse("2018/10/6").unwrap(); assert!(m.name("numericdate").issome()); asserteq!(m.name("year").unwrap().asstr(), "2018"); let m = matcher.parse("Friday").unwrap(); assert!(!m.name("numericdate").issome()); assert!(m.name("weekday").issome()); // still more crazy things we allow assert!(matcher.ismatch("F")); assert!(matcher.ismatch("friday")); assert!(matcher.ismatch("Fri")); // but we said single-letter days had to be capitalized assert!(!matcher.ismatch("f")); ```
Because you must define sub-grammars before you use them, and because Rust's
regular expression engine and formalism cannot represent it in any case, you
cannot produce a recursive grammar in Pidgin. Something like this is impossible:
XP -> XP conj XP
Currently you can produce something like this
foo -> bar{1,3}
but not
foo -> bar+
I am planning to introduce Pidgin macros which will allow this, but I am not
sure when those will arrive.
Typically a grammar is defined top down like so:
TOP -> foo | bar
foo -> baz plugh | xyzzy
bar -> xyzzy baz plugh
baz -> '1' | '2'
plugh -> 'cat' | 'dog'
xyzzy -> 'green' | 'blue' | 'grue'
Because in Pidgin sub-grammars need to be defned before they can be used in the
definition of other rules, construction has to go bottom up.
I am planning to write Pidgin macros which will allow one to define a grammar top-down, but this is currently still in the pie-in-the-sky stage.
Pidgin does not generate unbounded repetitions and it only attempts to match
the examples you present it. Sometimes this is inconvenient: you want to match
all phone numbers but you don't want to provide all phone numbers as patterns,
say. To work around this you can supply a "foreign rule". This is just a regular
expression you have acquired outside of pidgin.
rust
pidgin.foreign_rule("local_us_phone", r"\d{3}-?\d{4}").unwrap();
The unwrap
is necessary in this case because your expression might not
compile, so foreign_rule
returns a Result<(),regex::Error>
so you can
handle this exception.
This is all fine, but suppose you provide a foreign rule with a named capture:
rust
pidgin.foreign_rule("foo", "(?<foo>bar)").unwrap();
This will compile when you define the rule, but if you use foo
in more than
one place in your grammar, this will cause a regex::Error
when you attempt
to generate a pidgin::Matcher
from your grammar.
Pidgin attempts alternations easiest to hardest. In effect it always tries to
pick low-hanging fruit. This means if told to match "elephant" and "cat" it
will generate the pattern cat|elephant
. Generally this is what you want, but
it may happen that you want to prioritize rules differently. Pidgin provides
a mechanism for this: if you at the same rule more than once, this makes the
rule an alternation with the alternate order you specify:
rust
pidgin.rule("foo", &bar);
pidgin.rule("foo", &baz);
This is all fine and good, but it means you cannot replace one rule with
another simply by redefining it. If you wish to do this, you must first remove
the rule:
rust
pidgin.remove_rule("foo");
Because the general pattern of use for Pidgin is that you compile a bunch of
sub-grammars and then use them in other grammars, and because you generall don't
want the alternates in one sub-grammar to be included in the next you compile,
the compile
method clears the phrase list. It does not clear the rule list.
You usually don't want to clear the rule list, but there is a clear
method
which clears both.
The clear
method does not clear any flags you have set. If you want that as
well, you're better off just building a fresh Pidgin.
The principle motivation behind Pidgin is simply to produce something like an
abstract syntax tree one can use to better understand matched text, but the
regular expressions it generates generally match as well as or better than
naively constructed regular expressions represented as alternations of
expressions. There is a simple benchmark suite included in the benches/
directory demonstrating this for certain cases. The suite compares matching
speed for "naive" regular expressions of the form foo|bar|baz
to those
generated by Pidgin, which involve no backtracking and, in this case, would
be something like foo|ba[rz]
. The benchmark suite compares non-matching to
matching and bounded to unbounded patterns, where a bounded pattern must match
from the first character to the last and an unbounded pattern must match
somewhere in the string searched. The mean match/non-match times for these 8
cases were
|pidgin|match|bounded|time| |:------:|:-----:|:-------:|:----:| |✓|✓| |5.1287 ms| | |✓| |27.151 ms| |✓| | |15.757 ms| | | | |168.24 ms| |✓|✓|✓|3.0683 ms| | |✓|✓|4.0253 ms| |✓| |✓|5.0138 ms| | | |✓|1.5124 ms|
The full report can be generated by running
bash
cargo bench