mayda

Build Status Latest Version Works on nightly

mayda is a Rust library to compress integer arrays (all primitive integer types are supported). The design favors decompression speed and the ability to index the compressed array over the compression ratio, on the principle that the runtime penalty for using compressed arrays should be as small as possible.

This crate provides three variations on a single compression algorithm. The [Uniform] type can decompress around six billion u32s per second, or 24 GiB/s of decompressed integers, on a 2.6 GHz Intel Core i7-6700HQ processor (see below for specifics). The [Monotone] and [Unimodal] types decompress at around half the speed, but can have much better compression ratios depending on the distribution of the integers.

Compiling mayda requires the nightly compiler and CPU support for the SSE2 instruction set (any Intel or AMD processor manufactured after 2003). The basic approach is further described in [Zukowski2006] and [Lemire2015].

Documentation

The module documentation provides further examples and some more detailed descriptions of the algorithms involved.

Usage

Add this to your Cargo.toml:

toml [dependencies] mayda = "^0.1"

and this to your crate root:

rust extern crate mayda;

Example: encoding and decoding

The Uniform struct is recommended for general purpose integer compression. Use the Encode trait to encode and decode the array.

```rust extern crate mayda;

use mayda::{Uniform, Encode};

fn main() { // Integers in some interval of length 255, require one byte per integer let input: Vec = (0..128).map(|x| (x * 73) % 181 + 307).collect();

let mut uniform = Uniform::new();
uniform.encode(&input).unwrap();

let i_bytes = std::mem::size_of_val(input.as_slice());
let u_bytes = std::mem::size_of_val(uniform.storage());

// 128 bytes for encoded entries, 12 bytes of overhead
assert_eq!(i_bytes, 512);
assert_eq!(u_bytes, 140);

let output: Vec<u32> = uniform.decode();
assert_eq!(input, output);

} ```

Example: indexing

Use the Access trait to index the compressed array. This is similar to Index, but returns a vector instead of a slice.

```rust extern crate mayda;

use mayda::{Uniform, Encode, Access};

fn main() { // All primitive integer types supported let input: Vec = (-64..64).collect();

let mut uniform = Uniform::new();
uniform.encode(&input).unwrap();

let val: isize = uniform.access(64);
assert_eq!(val, 0);

// Vector is returned to give ownership to the caller
let range: Vec<isize> = uniform.access(..10);
assert_eq!(range, (-64..-54).collect::<Vec<_>>());

} ```

Performance

Consider the following three vectors: ```rust extern crate rand;

use rand::distributions::{IndependentSample, Range, Normal};

let mut rng = rand::thread_rng(); let length: usize = 1024;

// input1 contains random integers let mut input1: Vec = Vec::withcapacity(length); let range = Range::new(0, 1024); for _ in 0..length { input1.push(range.indsample(&mut rng)); }

// input2 contains monotone increasing integers let mut input2: Vec = input1.clone(); input2.sort();

// input3 contains Gaussian distributed integers with outliers let mut input3: Vec = Vec::withcapacity(length); let gaussian = Normal::new(4086., 128.); for _ in 0..length { input3.push(gaussian.indsample(&mut rng) as u32); } let index = Range::new(0, length); let outliers = Range::new(0, std::u32::MAX); for _ in 0..(length / 16) { input3[index.indsample(&mut rng)] = outliers.indsample(&mut rng); } ```

The performance of the [Uniform], [Monotone] and [Unimodal] types for these three vectors on a 2.6 GHz Intel Core i7-6700HQ processor is given below. Encoding and decoding speeds using decode_into are reported in billions of integers per second, time required to index the last entry is reported in nanoseconds, and compression is reported as bits per integer. Native encoding and decoding speed measurements perform a memcpy. The Shannon entropy is a reasonable target for the bits per integer.

For input1 the Shannon entropy is 10.00. Uniform is preferrable in every respect for the general case.

| | Encode (BInt/s) | Decode (BInt/s) | Index (ns) | Bits/Int | |----------|-----------------|-----------------|------------|----------| | Uniform | 1.27 | 5.92 | 28 | 10.63 | | Monotone | 1.18 | 2.35 | 69 | 32.63 | | Unimodal | 0.21 | 2.23 | 55 | 11.16 | | Native | 20.08 | 20.08 | 0 | 32 |

For input2 the Shannon entropy is 10.00, but the additional structure is used by Monotone to improve compression.

| | Encode (BInt/s) | Decode (BInt/s) | Index (ns) | Bits/Int | |----------|-----------------|-----------------|------------|----------| | Uniform | 1.28 | 6.06 | 28 | 8.00 | | Monotone | 1.24 | 2.27 | 67 | 3.63 | | Unimodal | 0.24 | 2.57 | 30 | 8.19 | | Native | 20.08 | 20.08 | 0 | 32 |

For input3 the Shannon entropy is 12.46, but compression is difficult due to the presence of outliers. Unimodal gives the most robust compression.

| | Encode (BInt/s) | Decode (BInt/s) | Index (ns) | Bits/Int | |----------|-----------------|-----------------|------------|----------| | Uniform | 1.23 | 6.21 | 28 | 32.63 | | Monotone | 1.18 | 2.35 | 70 | 32.63 | | Unimodal | 0.16 | 1.68 | 60 | 12.50 | | Native | 20.08 | 20.08 | 0 | 32 |

License

mayda is licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.