Splitting output from Read
types with regular expressions.
The chief type in this crate is the
ByteChunker
,
which wraps a type that implements
Read
and iterates over chunks of its byte stream delimited by a supplied
regular expression. The following example reads from the standard input
and prints word counts:
```rust use std::collections::BTreeMap; use regex_chunker::ByteChunker;
fn main() -> Result<(), Box
// The regex is a stab at something matching strings of
// "between-word" characters in general English text.
let chunker = ByteChunker::new(stdin, r#"[ "\r\n.,!?:;/]+"#)?;
for chunk in chunker {
let word = String::from_utf8_lossy(&chunk?).to_lowercase();
*counts.entry(word).or_default() += 1;
}
println!("{:#?}", &counts);
Ok(())
} ```
See the crate documentation for more details.
This is, as of yet, an essentially naive implementation. What can be done to optimize performance?
The next major version will support async
versions of the *Chunker
types that read from
tokio::io::AsyncRead
types and produce a
Stream
of chunks.