UTF-8 Buffered Reader

Provides alternatives to BufRead::read_line and BufRead::lines that allow getting UTF-8 strings but do not stop on newline delimiters, to avoid loading large amount of data in memory when reading files with few newlines.

Usage

Add this crate as a dependency in your Cargo.toml: toml [dependencies] utf8-bufread = "0.1.5"

This will allow you to use the BufRead trait provided by this crate and automatically implemented on any type implementing std::io::BufRead.

This trait provides functions to read utf8 strings from a stream, but none of those functions guarantee the read chunk of data will end on a newline delimiter (unlike BufRead::read_line or BufRead::lines). This allows you to use buffered readers and std::io::BufRead's API on a large stream without worrying about loading a huge amount of data into memory if there is no newline delimiter.

The functions of this trait are centered around BufRead::with_utf8_chunk, which takes a closure being passed the string slice of utf8 data read from the inner reader, and returns an io::Result of the number of bytes read, in the same same fashion as most functions from std::io's traits and structs functions. The string slice may be of arbitrary length and may stop at any point in the stream, but will always contain valid UTF-8.

```rust fn main() { use std::io::Cursor; use utf8_bufread::BufRead;

// Cursor implements BufRead when wrapping a string slice let mut reader = Cursor::new( "The quick fox jumps over the lazy dog" ); let mut o_counter = 0;

// Counts the number of "o"s in the stream loop { match reader.withutf8chunk(|s| { ocounter += s.matches('o').count() }) { Ok(0) | Err() => break, Ok() => continue, } } asserteq!(3, o_counter); } ```

The trait also provides functions to append to a provided buffer and to iterate over read chunks.

```rust use utf8_bufread::BufRead; use std::io::BufReader; use std::fs::File;

fn main() { use std::fs::File; use std::io::BufReader; use utf8_bufread::BufRead;

// Open our file let mut reader = BufReader::new( File::open("myfile.txt").unwrap() ); // The string we'll use to store the text of the file let mut text = String::new(); loop { // Loop until EOF match reader.readutf8(&mut text) { Ok(0) => break, // EOF Ok() => { continue } Err(e) => std::panic::panicany(e), } }

// Do something with text ... } ```

If valid utf-8 codepoint is read it will always be processed, be it passed to a closure or appended to provided buffer. If an invalid or incomplete codepoint is read, the functions of this crate will first process all the valid bytes read and a relevant io::Error will be returned on the next call:

```rust fn main() { use std::io::{Cursor, ErrorKind}; use std::str::Utf8Error; use utf8_bufread::BufRead;

// Cursor implements BufRead when wrapping a u8 slice // "foo\nbar" + some invalid bytes let mut reader = Cursor::new([ 0x66u8, 0x6f, 0x6f, 0xa, 0x62, 0x61, 0x72, 0x9f, 0x92, 0x96, 0x0, ]); let mut n_read = 0; let mut buf = String::new();

// First read all the valid bytes until EOF or error // (in this case, an error) let err = loop { match reader.readutf8(&mut buf) { Ok(0) => break Ok(()), Ok(n) => { nread += n; continue; } Err(e) => break Err(e), }; }; // We did get all our valid bytes asserteq!("foo\nbar", buf.asstr()); asserteq!(7, nread);

// And our last call gave us an io::Error caused by an // std::str::Utf8Error assert!(err.iserr()); let err = err.unwraperr(); asserteq!(ErrorKind::InvalidData, err.kind()); let err = err.intoinner(); assert!(err.is_some()); assert!(err.unwrap().is::()); } ```

Work in progress

This crate is fairly new, and for now only provides a limited amount API, with a rather simple implementation. In the near future these features should be added:

A lossy and unchecked version of read_utf8 (see from_utf8_lossy & from_utf8_unchecked).
A chars iterator from the buffer, and its lossy version.
I'm open to suggestion, if you have ideas 😉

This also means it may have a pretty unstable API

Given I'm not the most experience developer at all, you are very welcome to submit push requests here

License

Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.