UTF-8 Buffered Reader

Provides alternatives to BufRead::read_line and BufRead::lines that allow getting UTF-8 strings but do not stop on newline delimiters, to avoid loading large amount of data in memory when reading files with few newlines.

crates.io docs.rs build status

Usage

Add this crate as a dependency in your Cargo.toml: toml [dependencies] utf8-bufread = "0.1.5"

This will allow you to use the BufRead trait provided by this crate and automatically implemented on any type implementing std::io::BufRead.

This trait provides functions to read utf8 strings from a stream, but none of those functions guarantee the read chunk of data will end on a newline delimiter (unlike BufRead::read_line or BufRead::lines). This allows you to use buffered readers and std::io::BufRead's API on a large stream without worrying about loading a huge amount of data into memory if there is no newline delimiter.

The functions of this trait are centered around BufRead::with_utf8_chunk, which takes a closure being passed the string slice of utf8 data read from the inner reader, and returns an io::Result of the number of bytes read, in the same same fashion as most functions from std::io's traits and structs functions. The string slice may be of arbitrary length and may stop at any point in the stream, but will always contain valid UTF-8.

```rust fn main() { use std::io::Cursor; use utf8_bufread::BufRead;

// Cursor implements BufRead when wrapping a string slice let mut reader = Cursor::new( "The quick fox jumps over the lazy dog" ); let mut o_counter = 0;

// Counts the number of "o"s in the stream loop { match reader.withutf8chunk(|s| { ocounter += s.matches('o').count() }) { Ok(0) | Err() => break, Ok() => continue, } } asserteq!(3, o_counter); } ```

The trait also provides functions to append to a provided buffer and to iterate over read chunks.

```rust use utf8_bufread::BufRead; use std::io::BufReader; use std::fs::File;

fn main() { use std::fs::File; use std::io::BufReader; use utf8_bufread::BufRead;

// Open our file let mut reader = BufReader::new( File::open("myfile.txt").unwrap() ); // The string we'll use to store the text of the file let mut text = String::new(); loop { // Loop until EOF match reader.readutf8(&mut text) { Ok(0) => break, // EOF Ok() => { continue } Err(e) => std::panic::panicany(e), } }

// Do something with text ... } ```

If valid utf-8 codepoint is read it will always be processed, be it passed to a closure or appended to provided buffer. If an invalid or incomplete codepoint is read, the functions of this crate will first process all the valid bytes read and a relevant io::Error will be returned on the next call:

```rust fn main() { use std::io::{Cursor, ErrorKind}; use std::str::Utf8Error; use utf8_bufread::BufRead;

// Cursor implements BufRead when wrapping a u8 slice // "foo\nbar" + some invalid bytes let mut reader = Cursor::new([ 0x66u8, 0x6f, 0x6f, 0xa, 0x62, 0x61, 0x72, 0x9f, 0x92, 0x96, 0x0, ]); let mut n_read = 0; let mut buf = String::new();

// First read all the valid bytes until EOF or error // (in this case, an error) let err = loop { match reader.readutf8(&mut buf) { Ok(0) => break Ok(()), Ok(n) => { nread += n; continue; } Err(e) => break Err(e), }; }; // We did get all our valid bytes asserteq!("foo\nbar", buf.asstr()); asserteq!(7, nread);

// And our last call gave us an io::Error caused by an // std::str::Utf8Error assert!(err.iserr()); let err = err.unwraperr(); asserteq!(ErrorKind::InvalidData, err.kind()); let err = err.intoinner(); assert!(err.is_some()); assert!(err.unwrap().is::()); } ```

Work in progress

This crate is fairly new, and for now only provides a limited amount API, with a rather simple implementation. In the near future these features should be added:

This also means it may have a pretty unstable API

Given I'm not the most experience developer at all, you are very welcome to submit push requests here

License

Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.