This crate implements the "FastCDC" content defined chunking algorithm in pure Rust. A critical aspect of its behavior is that it returns exactly the same results for the same input. To learn more about content defined chunking and its applications, see the reference material linked below.
shell
$ cargo clean
$ cargo build
$ cargo test
An example can be found in the examples
directory of the source repository,
which demonstrates reading files of arbitrary size into a memory-mapped buffer
and passing them through the chunker (and computing the SHA256 hash digest of
each chunk).
The unit tests also have some short examples of using the chunker, of which this code snippet is an example:
rust
let read_result = fs::read("test/fixtures/SekienAkashita.jpg");
assert!(read_result.is_ok());
let contents = read_result.unwrap();
let chunker = FastCDC::new(&contents, 16384, 32768, 65536);
let results: Vec<Chunk> = chunker.collect();
assert_eq!(results.len(), 3);
assert_eq!(results[0].offset, 0);
assert_eq!(results[0].length, 32857);
assert_eq!(results[1].offset, 32857);
assert_eq!(results[1].length, 16408);
assert_eq!(results[2].offset, 49265);
assert_eq!(results[2].length, 60201);
The algorithm is as described in "FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication"; see the paper, and presentation for details.
This crate is little more than a rewrite of the implementation by Joran Dirk Greef (see the ronomon link below), in Rust, and greatly simplified in usage. One significant difference is that the chunker in this crate does not calculate a hash digest of the chunks.