rust-gd: A Rust Implementation of Generalized Deduplication

Rust implementation of Generalized Deduplication (GD) based on several types of error-correcting codes.

This is an implementation (and somewhat extension) of the novel concept of data deduplication method, called Generalized Deduplication (GD). The original concept of GD was introduced by a group of Aarhus University, Denmark, leaded by Prof. D. E. Lucani.

Vestergaard, Rasmus, Qi Zhang, and Daniel E. Lucani. "Generalized deduplication: bounds, convergence, and asymptotic properties." 2019 IEEE Global Communications Conference (GLOBECOM). IEEE, 2019.

Vestergaard, Rasmus, Daniel E. Lucani, and Qi Zhang. "Generalized deduplication: Lossless compression for large amounts of small IoT data." European Wireless 2019; 25th European Wireless Conference. VDE, 2019.

etc.

Usage

Add the following to your Cargo.toml as imported directly from GitHub:

toml:Cargo.toml [dependencies] rust-gd = { git = "https://github.com/junkurihara/rust-gd.git" }

or from crates.io (not published yet):

toml:Cargo.toml [dependencies] rust-gd = "0.1.0"

Then, add use in your .rs file.

rust: use rust_gd::*;

Example

NOTE: The compression rate strongly depends on the data alignment and data structure. So you should carefully choose the parameters according to the characteristics of given data.

GD with Reed-Solomon code over GF(256)

```rust: use rust_gd::*;

let tobededuped: &[u8] = "寿限無(じゅげむ)寿限無(じゅげむ)五劫(ごこう)のすりきれ海砂利(かいじゃり)padpadpadpadpadpadpadpad".tostring().repeat(128).asbytes()

let codelen = 128; // codeword length over GF(256), i.e., N in (N, K) RS code let msglen = 124; // message length over GF(256), i.e., K in (N, K) RS code let dict_size = 127; // max entry size of a dictionary used in GD process

// GD instance for deduplication (compress) let mut gddedup = GD::ReedSolomon(codelen, msglen).setup(dictsize).unwrap();

// GD instance for duplication (decompress) let mut gddup = GD::ReedSolomon(codelen, msglen).setup(dictsize).unwrap();

// struct Deduped = {pub data: Vec, pub lastchunkpadbytelen: usize} let deduped: Deduped = gddedup.dedup(tobededuped).unwrap(); println!("> Deduped data size is {} bytes", x.data.len());

let duped: Vec = gd_dup.dup(&deduped).unwrap(); println!("> Duped size {} bytes", y.len();

assert_eq!(duped, words); ```

In GD with RS codes, error-alignment can be employed by

```rust: // Linear transformation matrix used for error-alignment. This must be nonsinglar. let trans: [&[u8; 4]; 4] = [ &[1, 0, 0, 0], &[1, 1, 1, 4], &[1, 1, 3, 0], &[1, 2, 0, 0], ];

// Instantiation let mut gddedup = GD::ReedSolomon(4, 3).setup(15).unwrap(); let mut gddup = GD::ReedSolomon(4, 3).setup(15).unwrap();

// Set error alignment let resdedup = gddedup.seterroralignment(trans); // this simply returns Result<()> let resdup = gddup.seterroralignment(trans); // this simply returns Result<()> assert!(resdedup.isok()); assert!(resdup.isok());

// then use gd instances to deduplicate/duplicate data as above. ```

For the detailed design of RS-code based implementation and the basic idea error-alignment, see DESIGN.md.

GD with Hamming code

```rust: let hammingdeg = 4; // Degree m of (2^m - 1, 2^m - m -1) Hamming code let hammingdict_size = 511; // max entry size of a dictionary used in GD process

let tobededuped: &[u8] = "寿限無(じゅげむ)寿限無(じゅげむ)五劫(ごこう)のすりきれ海砂利(かいじゃり)padpadpadpadpadpadpadpad".tostring().repeat(128).asbytes()

// GD instance for deduplication (compress) let mut gddedup = GD::Hamming(hammingdeg).setup(hammingdictsize).unwrap();

// GD instance for duplication (decompress) let mut gddup = GD::Hamming(hammingdeg).setup(hammingdictsize).unwrap();

// struct Deduped = {pub data: Vec, pub lastchunkpadbytelen: usize} let deduped: Deduped = gddedup.dedup(tobededuped).unwrap(); println!("> Deduped data size is {} bytes", x.data.len());

let duped: Vec = gd_dup.dup(&deduped).unwrap(); println!("> Duped size {} bytes", y.len(); ```

Codes in our implementation

Currently, our implementation is based on Hamming code and Reed-Solomon (RS) code. GD based on RS codes processes data chunks as byte stream. On the other hand, Hamming-based GD considered as data chunks as bit stream.

For GD implementation based on Hamming codes, in the internal libecc library of error-correcting codes, Hamming code with m = 3 works. However, the parameter of m = 3 does not work in GD. This is because the code length, i.e., 7 bits, is not sufficient to deduplicate a "byte"-based data. In order to reasonably deduplicate byte-based data, byte alignment is needed. So, we omitted this small length parameter.

Byte alignment: Our implementation employs an encoding method that chunks message sequences in the unit of bytes. For example, if (15, 11) Hamming code is employed, a 2-bytes message is divided into two one byte (= 8 bits) sequences, and pads 7 bits of zeros to each sequence to deal as 15-bits codeword of Hamming code.

TODO

Following should be considered to be implemented.

Optimization of math operations
Deletion and deviation using PRNG (Yggdrasil paper)
Golomb-Rice codes

Caveats

At this time this solution should be considered suitable for research and experimentation, further code and security review is needed before utilization in a production application.

License

Licensed under the MIT license, see LICENSE file.