Splitting output from Read
types with regular expressions.
The chief type in this crate is the
ByteChunker
,
which wraps a type that implements
Read
and iterates over chunks of its byte stream delimited by a supplied
regular expression. The following example reads from the standard input
and prints word counts:
```rust use std::collections::BTreeMap; use regex_chunker::ByteChunker;
fn main() -> Result<(), Box
// The regex is a stab at something matching strings of
// "between-word" characters in general English text.
let chunker = ByteChunker::new(stdin, r#"[ "\r\n.,!?:;/]+"#)?;
for chunk in chunker {
let word = String::from_utf8_lossy(&chunk?).to_lowercase();
*counts.entry(word).or_default() += 1;
}
println!("{:#?}", &counts);
Ok(())
} ```
The async
feature enables the stream
submodule, which contains an
asynchronous version of ByteChunker
that wraps an
tokio::io::AsyncRead
type and produces a
Stream
of byte chunks.
There isn't yet an async StringChunker
. The next update will introduce an
"adaptor" trait for transforming a the output of a ByteChunker into an
arbitrary type; the StringChunker
will then be reimplemented that way.
This is, as of yet, an essentially naive implementation. What can be done to optimize performance?
Add a trait to arbitrarily transform the output of the ByteChunker
,
expose it, and re-implement the StringChunker
(including adding an async
version) with it.
If you want to run the tests, you need to first (debug) build
src/bin/slowsource.rs
with the test
feature enabled:
sh
$ cargo build --bin slowsource --features test
Some of the [stream
] module tests run it in a subprocess and use it as
a source of bytes.