rust-gd: A Rust Implementation of Generalized Deduplication

Rust implementation of Generalized Deduplication (GD) based on several types of error-correcting codes.

This is an implementation (and somewhat extension) of the novel concept of data deduplication method, called Generalized Deduplication (GD). The original concept of GD was introduced by a group of Aarhus University, Denmark, leaded by Prof. D. E. Lucani.

Vestergaard, Rasmus, Qi Zhang, and Daniel E. Lucani. "Generalized deduplication: bounds, convergence, and asymptotic properties." 2019 IEEE Global Communications Conference (GLOBECOM). IEEE, 2019.

Vestergaard, Rasmus, Daniel E. Lucani, and Qi Zhang. "Generalized deduplication: Lossless compression for large amounts of small IoT data." European Wireless 2019; 25th European Wireless Conference. VDE, 2019.

etc.

Usage

Add the following to your Cargo.toml as imported directly from GitHub:

toml:Cargo.toml [dependencies] rust-gd = { git = "https://github.com/junkurihara/rust-gd.git" }

or from crates.io:

toml:Cargo.toml [dependencies] rust-gd = "*" // or appropriate version

Then, add use in your .rs file.

rust: use rust_gd::*;

Example

NOTE: The compression rate strongly depends on the data alignment and data structure. So you should carefully choose the parameters according to the characteristics of given data.

GD with Reed-Solomon code over $\mathrm{GF}(2^8)$

```rust: use rust_gd::*;

let tobededuped: &[u8] = "寿限無(じゅげむ)寿限無(じゅげむ)五劫(ごこう)のすりきれ海砂利(かいじゃり)padpadpadpadpadpadpadpad".tostring().repeat(128).asbytes()

let codelen = 128; // codeword length over GF(256), i.e., N in (N, K) RS code let msglen = 124; // message length over GF(256), i.e., K in (N, K) RS code let dict_size = 127; // max entry size of a dictionary used in GD process

// GD instance for deduplication (compress) let mut gddedup = GD::ReedSolomon(codelen, msglen).setup(dictsize).await.unwrap(); // Async API

// GD instance for duplication (decompress) let mut gddup = GD::ReedSolomon(codelen, msglen).setup(dictsize).await.unwrap(); // Async API

// struct Deduped = {pub data: Vec, pub lastchunkpadbytelen: usize} let deduped: Deduped = gddedup.dedup(tobededuped).await.unwrap(); // Async API println!("> Deduped data size is {} bytes", x.data.len());

let duped: Vec = gd_dup.dup(&deduped).await.unwrap(); // Async API. println!("> Duped size {} bytes", y.len();

assert_eq!(duped, words); ```

In GD with RS codes, an approach of error-alignment can be employed by

```rust: // Linear transformation matrix used for error-alignment. This must be nonsinglar. let trans: [&[u8; 4]; 4] = [ &[1, 0, 0, 0], &[1, 1, 1, 4], &[1, 1, 3, 0], &[1, 2, 0, 0], ];

// Instantiation let mut gddedup = GD::ReedSolomon(4, 3).setup(15).await.unwrap(); let mut gddup = GD::ReedSolomon(4, 3).setup(15).await.unwrap();

// Set error alignment let resdedup = gddedup.seterroralignment(trans).await; // this simply returns Result<()> let resdup = gddup.seterroralignment(trans).await; // this simply returns Result<()> assert!(resdedup.isok()); assert!(resdup.isok());

// then use gd instances to deduplicate/duplicate data as above. ```

For the detailed design of RS-code based implementation and the basic idea error-alignment, see DESIGN.md.

GD with Hamming code

```rust: let hammingdeg = 4; // Degree m of (2^m - 1, 2^m - m -1) Hamming code let hammingdict_size = 511; // max entry size of a dictionary used in GD process

let tobededuped: &[u8] = "寿限無(じゅげむ)寿限無(じゅげむ)五劫(ごこう)のすりきれ海砂利(かいじゃり)padpadpadpadpadpadpadpad".tostring().repeat(128).asbytes()

// GD instance for deduplication (compress) let mut gddedup = GD::Hamming(hammingdeg).setup(hammingdictsize).await.unwrap(); // Async API

// GD instance for duplication (decompress) let mut gddup = GD::Hamming(hammingdeg).setup(hammingdictsize).await.unwrap(); // Async API

// struct Deduped = {pub data: Vec, pub lastchunkpadbytelen: usize} let deduped: Deduped = gddedup.dedup(tobededuped).await.unwrap(); Async API println!("> Deduped data size is {} bytes", x.data.len());

let duped: Vec = gd_dup.dup(&deduped).await.unwrap(); // Async API. println!("> Duped size {} bytes", y.len(); ```

Codes in our implementation

Currently, our GD implementation is based only on Hamming and Reed-Solomon (RS) codes. The GD based on RS codes processes data chunks as byte stream. On the other hand, Hamming-based GD serves data chunks as bit stream.

For GD implementation using Hamming codes, Hamming code with the degree $m = 3$ of the code works in the internal libecc library of error-correcting codes, i.e., a case of the code length $n = 2^m - 1 = 7$. However, the Hamming code of $m = 3$ cannot be employed as the underlying linear code of Hamming-based GD. This is because the code length, i.e., $n=7$ bits, is not sufficient to deduplicate a "byte"-based data. In order to reasonably deduplicate byte-based data, byte alignment is needed. So, we omitted $m = 3$ and considers the parameter $m \geq 4$.

Byte alignment: Our implementation employs an encoding method that chunks message sequences in the unit of bytes. For example, if $(15, 11)$ Hamming code is employed, a 2-byte message is divided into two one byte (= 8 bits) sequences, and pads $15-8=7$ bits of zeros to each sequence to deal as a 15-bit codeword of Hamming code.

TODO

Following should be considered to be implemented.

Benchmark for the performance of deduplication
Optimization of math operations
Deletion and deviation using PRNG (Yggdrasil paper)
Golomb-Rice codes

Caveats

At this time this solution should be considered suitable for research and experimentation, further code and security review is needed before utilization in a production application.

License

Licensed under the MIT license, see LICENSE file.