regex-cli

This is a command line tool for interacting with the regex, regex-automata and regex-syntax crates. It enables one to print debug representations of various values, run searches, generate DFAs and deserialization code and perform various regex development tasks such as generating tests.

Installation

Currently regex-cli is not on crates.io and should be installed from this git repository:

$ cargo install --git https://github.com/rust-lang/regex regex-cli

Example: print debug output

The regex-cli command provides a way to print the debug output for most of the principle types in the regex-automata crate. This can be useful for debugging purposes when working on the regex project, or even if you just want a better look at a regex object's internal representation. For example, the following two commands compare and contrast the differences in the NFA for . and (?-u:.):

``` $ regex-cli debug thompson '.' --no-table

thompson::NFA(

000000: binary-union(2, 1) 000001: \x00-\xFF => 0 ^000002: capture(pid=0, group=0, slot=0) => 10 000003: \x80-\xBF => 11 000004: \xA0-\xBF => 3 000005: \x80-\xBF => 3 000006: \x80-\x9F => 3 000007: \x90-\xBF => 5 000008: \x80-\xBF => 5 000009: \x80-\x8F => 5 000010: sparse(\x00-\t => 11, \x0B-\x7F => 11, \xC2-\xDF => 3, \xE0 => 4, \xE1-\xEC => 5, \xED => 6, \xEE-\xEF => 5, \xF0 => 7, \xF1-\xF3 => 8, \xF4 => 9) 000011: capture(pid=0, group=0, slot=1) => 12 000012: MATCH(0)

transition equivalence classes: ByteClasses(0 => [\x00-\t], 1 => [\n], 2 => [\x0B-\x7F], 3 => [\x80-\x8F], 4 => [\x90-\x9F], 5 => [\xA0-\xBF], 6 => [\xC0-\xC1], 7 => [\xC2-\xDF], 8 => [\xE0], 9 => [\xE1-\xEC], 10 => [\xED], 11 => [\xEE-\xEF], 12 => [\xF0], 13 => [\xF1-\xF3], 14 => [\xF4], 15 => [\xF5-\xFF], 16 => [EOI]) ) ```

And now for (?-u:.):

``` $ regex-cli debug thompson -b '(?-u:.)' --no-table

thompson::NFA(

000000: binary-union(2, 1) 000001: \x00-\xFF => 0 ^000002: capture(pid=0, group=0, slot=0) => 3 000003: sparse(\x00-\t => 4, \x0B-\xFF => 4) 000004: capture(pid=0, group=0, slot=1) => 5 000005: MATCH(0)

transition equivalence classes: ByteClasses(0 => [\x00-\t], 1 => [\n], 2 => [\x0B-\xFF], 3 => [EOI]) ) ```

To make things a bit more concise, we use --no-table to omit some extra metadata about the size of the NFA and the time required to build it.

In the second example, we also pass the -b/--no-utf8-syntax flag. Without it, the command returns an error because patterns are compiled with default settings. The default setting is to forbid any pattern that can possibly match invalid UTF-8. Since (?-u:.) matches any byte except for \n, it can match invalid UTF-8. Thus, you have to say, "I am explicitly okay with matching invalid UTF-8."

Example: execute a search

This command shows how to run a search with multiple patterns with each containing capture groups. The output shows the value of each matching group.

$ regex-cli find capture meta -p '(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$' -p $'(?m)^(?<key>[[:word:]]+)=\'(?<val>[^\']+)\'$' -y 'best_album="Blow Your Face Out"' parse time: 81.541µs translate time: 52.035µs build meta time: 805.696µs search time: 426.391µs total matches: 1 0:{ 0: 0..31/best_album="Blow\x20Your\x20Face\x20Out", 1/key: 0..10/best_album, 2/val: 12..30/Blow\x20Your\x20Face\x20Out }

In this case, meta refers to the regex engine. It can be a number of other things, including lite for testing the regex-lite crate. Also, capture refers to the kind of search. You can also just ask for the match which will print the overall match and not the capture groups:

$ regex-cli find match meta -p '(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$' -p $'(?m)^(?<key>[[:word:]]+)=\'(?<val>[^\']+)\'$' -y 'best_album="Blow Your Face Out"' parse time: 67.067µs translate time: 40.005µs build meta time: 586.163µs search time: 291.633µs total matches: 1 0:0:31:best_album="Blow\x20Your\x20Face\x20Out"

Since not all regex engines support capture groups, using match will open up the ability to test other regex engines such as hybrid.

Finally, the -p/--pattern flag specifies a pattern and the -y/--haystack flag provides a haystack to search as a command line argument. One can also omit the -y/--haystack flag and provide a file path to search instead:

$ echo 'best_album="Blow Your Face Out"' > haystack $ regex-cli find match hybrid -p '(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$' -p $'(?m)^(?<key>[[:word:]]+)=\'(?<val>[^\']+)\'$' haystack parse time: 60.278µs translate time: 43.832µs compile forward nfa time: 462.148µs compile reverse nfa time: 56.275µs build forward hybrid time: 6.532µs build reverse hybrid time: 4.089µs build regex time: 4.899µs cache creation time: 18.59µs search time: 54.653µs total matches: 1 0:0:31:best_album="Blow\x20Your\x20Face\x20Out"

Example: serialize a DFA

One particularly useful command in regex-cli is regex-cli generate serialize. It takes care of generating and writing a fully compiled DFA to a file, and then producing Rust code that deserializes it. The command line provides oodles of options, including all options found in the regex-automata crate for building the DFA in code.

Let's walk through a complete end-to-end example. We assume regex-cli is already installed per instructions above. Let's start with an empty binary Rust project:

$ mkdir regex-dfa-test $ cd regex-dfa-test $ cargo init --bin

Now add a dependency on regex-automata. Technically, the only feature that needs to be enabled for this example is dfa-search, but we include std as well to get some conveniences like std::error::Error implementations and also optimizations. But you can drop std and just use alloc or even drop alloc too altogether if necessary.

$ cargo add regex-automata --features std,dfa-search

Now we can generate a DFA with regex-cli. This will create three files: the little endian binary serialization of the DFA, the big endian version and a simple Rust source file for lazily deserializing the DFA via a static into a regex_automata::util::lazy::Lazy:

regex-cli generate serialize sparse dfa \ --minimize \ --shrink \ --start-kind anchored \ --rustfmt \ --safe \ SIMPLE_WORD_FWD \ ./src/ \ "\w"

We pass a number of flags here. There are even more available, and generally speaking, there is at least one flag for each configuration knob available in the library. This means that it should be possible to configure the DFA in any way you might expect to be able to in the code. We can briefly explain the flags we use here though:

The final three positional arguments are as follows:

Once the DFA is generated, you should see three new files in ./src/:

$ ls -l src/ total 32 -rw-rw-r-- 1 andrew users 45 May 28 22:04 main.rs -rw-rw-r-- 1 andrew users 11095 May 30 10:24 simple_word_fwd.bigendian.dfa -rw-rw-r-- 1 andrew users 11095 May 30 10:24 simple_word_fwd.littleendian.dfa -rw-rw-r-- 1 andrew users 711 May 30 10:24 simple_word_fwd.rs

At this point, you just need to add the appropriate mod definition in main.rs and use the DFA:

```rust use regex_automata::{dfa::Automaton, Anchored, Input};

use crate::simplewordfwd::SIMPLEWORDFWD as DFA;

mod simplewordfwd;

fn main() { let input = Input::new("ω").anchored(Anchored::Yes); println!("is a word: {:?}", DFA.trysearchfwd(&input));

let input = Input::new("☃").anchored(Anchored::Yes);
println!("not a word: {:?}", DFA.try_search_fwd(&input));

} ```

And now run the program:

$ cargo run Compiling regex-dfa-test v0.1.0 (/home/andrew/tmp/regex-dfa-test) Finished dev [unoptimized + debuginfo] target(s) in 0.17s Running `target/debug/regex-dfa-test` is a word: Ok(Some(HalfMatch { pattern: PatternID(0), offset: 2 })) not a word: Ok(None)

There are a few other things worth mentioning: