⛓️gzp

license

Multithreaded encoding.

Why?

This crate provides a near drop in replacement for Write that has will compress chunks of data in parallel and write to an underlying writer in the same order that the bytes were handed to the writer. This allows for much faster compression of data.

Supported Encodings:

Gzip via flate2
Snappy via rust-snappy

Usage / Features

There are no features enabled by default. pargz_default enables pargz and flate2_default. flate2_default in turn uses the default backed for flate2, which, as of this writing is the pure rust backend. parsnap_default is just a wrapper over parsnap, which pulls in the snap dependency.

The following demonstrate common ways to override the features:

Simple way to enable both pargz and parsnap:

toml [dependencies] gzp = { version = "*", features = ["pargz_default", "parsnap_default"] }

Use pargz to pull in flate2 and zlib-ng-compat to use the zlib-ng-compat backend for flate2 (most performant).

toml [dependencies] gzp = { version = "*", features = ["pargz", "zlib-ng-compat"] }

To use Snap (could also use parsnap_default):

toml [dependencies] gzp = { version = "*", no_default_features = true, features = ["parsnap"] }

To use both Snap and Gzip with specific backend:

toml [dependencies] gzp = { version = "*", no_default_features = true, features = ["parsnap_default", "pargz", "zlib-ng-compat"] }

Examples

Simple example

```rust use std::{env, fs::File, io::Write};

use gzp::pargz::ParGz;

fn main() { let file = env::args().skip(1).next().unwrap(); let writer = File::create(file).unwrap(); let mut pargz = ParGz::builder(writer).build(); pargz.writeall(b"This is a first test line\n").unwrap(); pargz.writeall(b"This is a second test line\n").unwrap(); pargz.finish().unwrap(); } ```

An updated version of pgz.

```rust use gzp::pargz::ParGz; use std::io::{Read, Write};

fn main() { let chunksize = 64 * (1 << 10) * 2;

let stdout = std::io::stdout();
let mut writer = ParGz::builder(stdout).build();

let stdin = std::io::stdin();
let mut stdin = stdin.lock();

let mut buffer = Vec::with_capacity(chunksize);
loop {
    let mut limit = (&mut stdin).take(chunksize as u64);
    limit.read_to_end(&mut buffer).unwrap();
    if buffer.is_empty() {
        break;
    }
    writer.write_all(&buffer).unwrap();
    buffer.clear();
}
writer.finish().unwrap();

} ```

Same thing but using Snappy instead.

```rust use gzp::parsnap::ParSnap; use std::io::{Read, Write};

fn main() { let chunksize = 64 * (1 << 10) * 2;

let stdout = std::io::stdout();
let mut writer = ParSnap::builder(stdout).build();

let stdin = std::io::stdin();
let mut stdin = stdin.lock();

let mut buffer = Vec::with_capacity(chunksize);
loop {
    let mut limit = (&mut stdin).take(chunksize as u64);
    limit.read_to_end(&mut buffer).unwrap();
    if buffer.is_empty() {
        break;
    }
    writer.write_all(&buffer).unwrap();
    buffer.clear();
}
writer.finish().unwrap();

} ```

Notes

Files written with this are just Gzipped blocks catted together and must be read with flate2::bufread::MultiGzDecoder.

Future todos

Explore removing Bytes in favor of raw vec
Check that block is actually smaller than when it started
Update the CRC value with each block written
Add a BGZF mode + tabix index generation (or create that as its own crate)
Try with https://docs.rs/lzzzz/0.8.0/lzzzz/lz4_hc/fn.compress.html

Benchmarks

All benchmarks were run on the file in ./bench-data/shakespeare.txt catted together 100 times which creates a rough 550Mb file.

The primary takeaway here is that you probably want to give gzp at least 4 threads. 2 threads breaks even with the overhead of orchestrating the multi-threadedness, 4 gives a roughly 3x improvement.

benchmarks