This crate is a fork of the original Rust implementation by Nathan Fiedler. (nlfiedler/fastcdc-rs)\ Included here is the enhanced 2020 "FastCDC" content defined chunking algorithm described by Wen Xia, et al.\ \ This fork introduces an adjusted and a bit more complicated alternative API to increase the flexibility and reduce the memory overhead.\ The cut points produced by this fork are identical to the ones produced by the original crate.\ \ This README and all docs are adapted to the adjusted API.
The adjusted FastCDC
structure now allows you to provide the data piece by piece required to find the next cut point.\
The advantages of this approach are that it is no more required to keep the entire data in one contiguous memory block and therefore also save on some memory copies.\
This is most useful for e.g. advanced streaming logics.
Example usage with the original crate:
```rust
fn main(consumer: Receiver let cursor = 0;
let mut intermediatebuffer = vec![1024 * 1024 * 16]; // 16 MiB
for buffer in consumer.iter() {
&intermediatebuffer[cursor..cursor + 4096].copyfromslice(&buffer); }
}
``` Example usage with this fork:
```rust
fn main(consumer: Receiver // Inform the FastCDC struct how much data we are expecting.
fastcdc.setcontentlength(134217728); // 128 MiB // If we are interested in the chunk data, we need to hold to buffers temporarily here.
let chunkdata = Vec:: }
}
``` What else?
* The Examples can be found in the An example using The There is also an async streaming version of FastCDC named ```rust
let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdcalt::AsyncStreamCDC::new(&source, 4096, 16384, 65535);
let stream = chunker.asstream();
let chunks = stream.collect:: for result in chunks {
let chunk = result.unwrap();
println!("offset={} length={}", chunk.offset, chunk.length);
}
``` The original algorithm from 2016 is described in FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication.\
The improved "rolling two bytes each time" version from 2020 is detailed in The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems.cursor += 4096;
if cursor == intermediate_buffer.len() {
let fastcdc = FastCDC::new(&intermediate_buffer, 65535, 1048576, 16777216);
for chunk in fastcdc {
// .. process chunk
}
cursor = 0;
}
if let Some(chunk) = result {
// .. process chunk and data.
// Clear the held buffers with e.g. chunk_data.clear();
// Employ further handling for the cases where the cut point is not exactly at the buffer boundaries.
}
FastCDC
iterator is now accessible using the FastCDC::as_iterator(&self, buffer: &[u8])
method.
* The AsyncStreamCDC
and StreamCDC
implementations have been adapted, their APIs changed just a little bit.
* To focus solely on the 2020 version in this fork, the ronomon and v2016 implementations and examples have been removed.Requirements
Building and Testing
shell
$ cargo clean
$ cargo build
$ cargo test
Example Usage
examples
directory of the source repository, which demonstrate finding chunk boundaries in a given file. There are both streaming and non-streaming examples, where the non-streaming examples use the memmap2
crate to read large files efficiently.shell
$ cargo run --example v2020 -- --size 16384 test/fixtures/SekienAkashita.jpg
Finished dev [unoptimized + debuginfo] target(s) in 0.03s
Running `target/debug/examples/v2020 --size 16384 test/fixtures/SekienAkashita.jpg`
hash=17968276318003433923 offset=0 size=21325
hash=4098594969649699419 offset=21325 size=17140
hash=15733367461443853673 offset=38465 size=28084
hash=4509236223063678303 offset=66549 size=18217
hash=2504464741100432583 offset=84766 size=24700
Non-streaming
FastCDC
to find chunk boundaries in data loaded into memory:rust
let contents = std::fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
let mut chunker = fastcdc_alt::FastCDC::new(16384, 32768, 65536);
for chunk in chunker.as_iterator(&contents) {
println!("offset={} length={}", chunk.offset, chunk.length);
}
Streaming
StreamCDC
version takes a Read
source
and uses a byte vector with capacity equal to the specified maximum chunk size.rust
let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc_alt::StreamCDC::new(source, 4096, 16384, 65535).unwrap();
for result in chunker {
let (_data, chunk) = result.unwrap();
println!("offset={} length={}", chunk.offset, chunk.cutpoint);
}
Async Streaming
AsyncStreamCDC
,
which takes an AsyncRead
(both tokio
and futures
are supported via feature flags)
and uses a byte vector with capacity equal to the specified maximum chunk size.Reference Material
Other Implementations