rust-gd: A Rust Implementation of Generalized Deduplication

Rust implementation of Generalized Deduplication (GD) based on several types of error-correcting codes.

This is an implementation (and somewhat extension) of the novel concept of data deduplication method, called Generalized Deduplication (GD). The original concept of GD was introduced by a group of Aarhus University, Denmark, leaded by Prof. D. E. Lucani.

Usage

Add the following to your Cargo.toml as imported directly from GitHub:

toml:Cargo.toml [dependencies] rust-gd = { git = "https://github.com/junkurihara/rust-gd.git" }

or from crates.io (not published yet):

toml:Cargo.toml [dependencies] rust-gd = "0.1.0"

Then, add use in your .rs file.

rust: use rust_gd::*;

Example

NOTE: The compression rate strongly depends on the data alignment and data structure. So you should carefully choose the parameters according to the characteristics of given data.

GD with Reed-Solomon code over GF(256)

```rust: use rust_gd::*;

let tobededuped: &[u8] = "寿限無(じゅげむ)寿限無(じゅげむ)五劫(ごこう)のすりきれ海砂利(かいじゃり)padpadpadpadpadpadpadpad".tostring().repeat(128).asbytes()

let codelen = 128; // codeword length over GF(256), i.e., N in (N, K) RS code let msglen = 124; // message length over GF(256), i.e., K in (N, K) RS code let dict_size = 127; // max entry size of a dictionary used in GD process

// GD instance for deduplication (compress) let mut gddedup = GD::ReedSolomon(codelen, msglen).setup(dictsize).unwrap();

// GD instance for duplication (decompress) let mut gddup = GD::ReedSolomon(codelen, msglen).setup(dictsize).unwrap();

// struct Deduped = {pub data: Vec, pub lastchunkpadbytelen: usize} let deduped: Deduped = gddedup.dedup(tobededuped).unwrap(); println!("> Deduped data size is {} bytes", x.data.len());

let duped: Vec = gd_dup.dup(&deduped).unwrap(); println!("> Duped size {} bytes", y.len();

assert_eq!(duped, words); ```

In GD with RS codes, error-alignment can be employed by

```rust: // Linear transformation matrix used for error-alignment. This must be nonsinglar. let trans: [&[u8; 4]; 4] = [ &[1, 0, 0, 0], &[1, 1, 1, 4], &[1, 1, 3, 0], &[1, 2, 0, 0], ];

// Instantiation let mut gddedup = GD::ReedSolomon(4, 3).setup(15).unwrap(); let mut gddup = GD::ReedSolomon(4, 3).setup(15).unwrap();

// Set error alignment let resdedup = gddedup.seterroralignment(trans); // this simply returns Result<()> let resdup = gddup.seterroralignment(trans); // this simply returns Result<()> assert!(resdedup.isok()); assert!(resdup.isok());

// then use gd instances to deduplicate/duplicate data as above. ```

For the detailed design of RS-code based implementation and the basic idea error-alignment, see DESIGN.md.

GD with Hamming code

```rust: let hammingdeg = 4; // Degree m of (2^m - 1, 2^m - m -1) Hamming code let hammingdict_size = 511; // max entry size of a dictionary used in GD process

let tobededuped: &[u8] = "寿限無(じゅげむ)寿限無(じゅげむ)五劫(ごこう)のすりきれ海砂利(かいじゃり)padpadpadpadpadpadpadpad".tostring().repeat(128).asbytes()

// GD instance for deduplication (compress) let mut gddedup = GD::Hamming(hammingdeg).setup(hammingdictsize).unwrap();

// GD instance for duplication (decompress) let mut gddup = GD::Hamming(hammingdeg).setup(hammingdictsize).unwrap();

// struct Deduped = {pub data: Vec, pub lastchunkpadbytelen: usize} let deduped: Deduped = gddedup.dedup(tobededuped).unwrap(); println!("> Deduped data size is {} bytes", x.data.len());

let duped: Vec = gd_dup.dup(&deduped).unwrap(); println!("> Duped size {} bytes", y.len(); ```

Codes in our implementation

Currently, our implementation is based on Hamming code and Reed-Solomon (RS) code. GD based on RS codes processes data chunks as byte stream. On the other hand, Hamming-based GD considered as data chunks as bit stream.

For GD implementation based on Hamming codes, in the internal libecc library of error-correcting codes, Hamming code with m = 3 works. However, the parameter of m = 3 does not work in GD. This is because the code length, i.e., 7 bits, is not sufficient to deduplicate a "byte"-based data. In order to reasonably deduplicate byte-based data, byte alignment is needed. So, we omitted this small length parameter.

Byte alignment: Our implementation employs an encoding method that chunks message sequences in the unit of bytes. For example, if (15, 11) Hamming code is employed, a 2-bytes message is divided into two one byte (= 8 bits) sequences, and pads 7 bits of zeros to each sequence to deal as 15-bits codeword of Hamming code.

TODO

Following should be considered to be implemented.

Caveats

At this time this solution should be considered suitable for research and experimentation, further code and security review is needed before utilization in a production application.

License

Licensed under the MIT license, see LICENSE file.