arrow-digest

Crates.io

Unofficial Apache Arrow crate that aims to standardize stable hashing of structured data.

Motivation

Today, structured data formats like Parquet are binary-unstable / non-reproducible - writing the same logical data may result in different files on binary level depending on which writer implementation and you use and may vary with each version.

This crate provides a method and implementation for computing stable hashes of structured data (logical hash) based on Apache Arrow in-memory format.

Benefits: - Fast way to check for equality / equivalence of large datasets - Two parties can compare data without needing to transfer it or reveal its contents - A step towards content addressability of structured data (e.g. when storing dataset chunks in DHTs like IPFS)

Use

```rust // Hash single array let array = Int32Array::from(vec![1, 2, 3]); let digest = ArrayDigestV0::::digest(&array); println!("{:x}", digest);

// Alternatively: Use .update(&array) to hash multiple arrays of the same type

// Hash record batches let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Int32, false), Field::new("b", DataType::Utf8, false), ]));

let recordbatch = RecordBatch::trynew(Arc::new(schema), vec![ Arc::new(Int32Array::from(vec![1, 2, 3])), Arc::new(StringArray::from(vec!["a", "b", "c"])), ]).unwrap();

let digest = RecordsDigestV0::::digest(&record_batch); println!("{:x}", digest);

// Alternatively: Use .update(&batch) to hash multiple batches with same schema ```

Status

While we're working towards v1 we reserve the right to break the hash stability. Create an issue if you're planning to use this crate.

Design Goals

Drawbacks

Hashing Process

Starting from primitives and building up:

| Type (in Schema.fb) | TypeID (as u16) | Followed by | | --------------------- | :---------------: | ----------------------------------------------------- | | Null | 0 | | | Int | 1 | unsigned/signed (0/1) as u8, bitwidth as u64 | | FloatingPoint | 2 | bitwidth as u64 | | Binary | 3 | | | Utf8 | 4 | | | Bool | 5 | | | Decimal | 6 | bitwidth as u64, precision as u64, scale as u64 | | Date | 7 | bitwidth as u64, DateUnitID | | Time | 8 | bitwidth as u64, TimeUnitID | | Timestamp | 9 | TimeUnitID, timeZone as nullable Utf8 | | Interval | 10 | | | List | 11 | items data type | | Struct | 12 | | | Union | 13 | | | FixedSizeBinary | 3 | | | FixedSizeList | 11 | items data type | | Map | 16 | | | Duration | 17 | | | LargeBinary | 3 | | | LargeUtf8 | 4 | | | LargeList | 11 | items data type |

Note that some types (Utf8 and LargeUtf8, Binary FixedSizeBinary and LargeBinary, List FixedSizeList and LargeList) are represented in the hash the same, as the difference between them is purely an encoding concern.

| DateUnit (in Schema.fb) | DateUnitID (as u16) | | ------------------------- | :-------------------: | | DAY | 0 | | MILLISECOND | 1 |

| TimeUnit (in Schema.fb) | TimeUnitID (as u16) | | ------------------------- | :-------------------: | | SECOND | 0 | | MILLISECOND | 1 | | MICROSECOND | 2 | | NANOSECOND | 3 |

References