Provides alternatives to BufRead::read_line
and
BufRead::lines
that allow getting UTF-8 strings but do not stop
on newline delimiters, to avoid loading large amount of data in memory when
reading files with few newlines.
Add this crate as a dependency in your Cargo.toml
:
toml
[dependencies]
utf8-bufread = "0.1.5"
This will allow you to use the BufRead
trait provided by this
crate and automatically implemented on any type implementing
std::io::BufRead
.
This trait provides functions to read utf8 strings from a stream, but none of
those functions guarantee the read chunk of data will end on a newline
delimiter (unlike BufRead::read_line
or
BufRead::lines
). This allows you to use buffered readers and
std::io::BufRead
's API on a large stream without
worrying about loading a huge amount of data into memory if there is no
newline delimiter.
The functions of this trait are centered around
BufRead::with_utf8_chunk
, which takes a closure being
passed the string slice of utf8 data read from the inner reader, and returns
an io::Result
of the number of bytes read, in the same same
fashion as most functions from std::io
's traits and structs functions.
The string slice may be of arbitrary length and may stop at any point in the
stream, but will always contain valid UTF-8.
```rust fn main() { use std::io::Cursor; use utf8_bufread::BufRead;
// Cursor implements BufRead when wrapping a string slice let mut reader = Cursor::new( "The quick fox jumps over the lazy dog" ); let mut o_counter = 0;
// Counts the number of "o"s in the stream loop { match reader.withutf8chunk(|s| { ocounter += s.matches('o').count() }) { Ok(0) | Err() => break, Ok() => continue, } } asserteq!(3, o_counter); } ```
The trait also provides functions to append to a provided buffer and to iterate over read chunks.
```rust use utf8_bufread::BufRead; use std::io::BufReader; use std::fs::File;
fn main() { use std::fs::File; use std::io::BufReader; use utf8_bufread::BufRead;
// Open our file let mut reader = BufReader::new( File::open("myfile.txt").unwrap() ); // The string we'll use to store the text of the file let mut text = String::new(); loop { // Loop until EOF match reader.readutf8(&mut text) { Ok(0) => break, // EOF Ok() => { continue } Err(e) => std::panic::panicany(e), } }
// Do something with text
...
}
```
If valid utf-8 codepoint is read it will always be processed, be it passed
to a closure or appended to provided buffer. If an invalid or incomplete
codepoint is read, the functions of this crate will first process all the
valid bytes read and a relevant io::Error
will be returned on the
next call:
```rust fn main() { use std::io::{Cursor, ErrorKind}; use std::str::Utf8Error; use utf8_bufread::BufRead;
// Cursor implements BufRead when wrapping a u8 slice // "foo\nbar" + some invalid bytes let mut reader = Cursor::new([ 0x66u8, 0x6f, 0x6f, 0xa, 0x62, 0x61, 0x72, 0x9f, 0x92, 0x96, 0x0, ]); let mut n_read = 0; let mut buf = String::new();
// First read all the valid bytes until EOF or error // (in this case, an error) let err = loop { match reader.readutf8(&mut buf) { Ok(0) => break Ok(()), Ok(n) => { nread += n; continue; } Err(e) => break Err(e), }; }; // We did get all our valid bytes asserteq!("foo\nbar", buf.asstr()); asserteq!(7, nread);
// And our last call gave us an io::Error
caused by an
// std::str::Utf8Error
assert!(err.iserr());
let err = err.unwraperr();
asserteq!(ErrorKind::InvalidData, err.kind());
let err = err.intoinner();
assert!(err.is_some());
assert!(err.unwrap().is::
This crate is fairly new, and for now only provides a limited amount API, with a rather simple implementation. In the near future these features should be added:
read_utf8
(see
from_utf8_lossy
&
from_utf8_unchecked
).char
s iterator from the buffer, and its lossy version.This also means it may have a pretty unstable API
Given I'm not the most experience developer at all, you are very welcome to submit push requests here
Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.