FastCDC-Alt docs.rs Crates.io Test

This crate is a fork of the original Rust implementation by Nathan Fiedler. (nlfiedler/fastcdc-rs)\ Included here is the enhanced 2020 "FastCDC" content defined chunking algorithm described by Wen Xia, et al.\ \ This fork introduces an adjusted and a bit more complicated alternative API to increase the flexibility and reduce the memory overhead.\ The cut points produced by this fork are identical to the ones produced by the original crate.\ \ This README and all docs are adapted to the adjusted API.

Whats different?

The adjusted FastCDC structure now allows you to provide the data piece by piece required to find the next cut point.\ The advantages of this approach are that it is no more required to keep the entire data in one contiguous memory block and therefore also save on some memory copies.\ This is most useful for e.g. advanced streaming logics.

Example usage with the original crate: ```rust fn main(consumer: Receiver

let cursor = 0; let mut intermediatebuffer = vec![1024 * 1024 * 16]; // 16 MiB for buffer in consumer.iter() { &intermediatebuffer[cursor..cursor + 4096].copyfromslice(&buffer);

cursor += 4096;
if cursor == intermediate_buffer.len() {

  let fastcdc = FastCDC::new(&intermediate_buffer, 65535, 1048576, 16777216);
  for chunk in fastcdc {
    // .. process chunk
  }

  cursor = 0;
}

} } ```

Example usage with this fork: ```rust fn main(consumer: Receiver

// Inform the FastCDC struct how much data we are expecting. fastcdc.setcontentlength(134217728); // 128 MiB

// If we are interested in the chunk data, we need to hold to buffers temporarily here. let chunkdata = Vec::data.push(buffer);

if let Some(chunk) = result {
  // .. process chunk and data.

  // Clear the held buffers with e.g. chunk_data.clear();
  // Employ further handling for the cases where the cut point is not exactly at the buffer boundaries.
}

} } ```


What else? * The FastCDC iterator is now accessible using the FastCDC::as_iterator(&self, buffer: &[u8]) method. * The AsyncStreamCDC and StreamCDC implementations have been adapted, their APIs changed just a little bit. * To focus solely on the 2020 version in this fork, the ronomon and v2016 implementations and examples have been removed.

Requirements

Building and Testing

shell $ cargo clean $ cargo build $ cargo test

Example Usage

Examples can be found in the examples directory of the source repository, which demonstrate finding chunk boundaries in a given file. There are both streaming and non-streaming examples, where the non-streaming examples use the memmap2 crate to read large files efficiently.

shell $ cargo run --example v2020 -- --size 16384 test/fixtures/SekienAkashita.jpg Finished dev [unoptimized + debuginfo] target(s) in 0.03s Running `target/debug/examples/v2020 --size 16384 test/fixtures/SekienAkashita.jpg` hash=17968276318003433923 offset=0 size=21325 hash=4098594969649699419 offset=21325 size=17140 hash=15733367461443853673 offset=38465 size=28084 hash=4509236223063678303 offset=66549 size=18217 hash=2504464741100432583 offset=84766 size=24700

Non-streaming

An example using FastCDC to find chunk boundaries in data loaded into memory:

rust let contents = std::fs::read("test/fixtures/SekienAkashita.jpg").unwrap(); let mut chunker = fastcdc_alt::FastCDC::new(16384, 32768, 65536); for chunk in chunker.as_iterator(&contents) { println!("offset={} length={}", chunk.offset, chunk.length); }

Streaming

The StreamCDC version takes a Read source and uses a byte vector with capacity equal to the specified maximum chunk size.

rust let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap(); let chunker = fastcdc_alt::StreamCDC::new(source, 4096, 16384, 65535).unwrap(); for result in chunker { let (_data, chunk) = result.unwrap(); println!("offset={} length={}", chunk.offset, chunk.cutpoint); }

Async Streaming

There is also an async streaming version of FastCDC named AsyncStreamCDC, which takes an AsyncRead (both tokio and futures are supported via feature flags) and uses a byte vector with capacity equal to the specified maximum chunk size.

```rust let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap(); let chunker = fastcdcalt::AsyncStreamCDC::new(&source, 4096, 16384, 65535); let stream = chunker.asstream(); let chunks = stream.collect::>().await;

for result in chunks { let chunk = result.unwrap(); println!("offset={} length={}", chunk.offset, chunk.length); } ```

Reference Material

The original algorithm from 2016 is described in FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication.\ The improved "rolling two bytes each time" version from 2020 is detailed in The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems.

Other Implementations