Unofficial Apache Arrow crate that aims to standardize stable hashing of structured data.
Today, structured data formats like Parquet are binary-unstable / non-reproducible - writing the same logical data may result in different files on binary level depending on which writer implementation and you use and may vary with each version.
This crate provides a method and implementation for computing stable hashes of structured data (logical hash) based on Apache Arrow in-memory format.
Benefits: - Fast way to check for equality / equivalence of large datasets - Two parties can compare data without needing to transfer it or reveal its contents - A step towards content addressability of structured data (e.g. when storing dataset chunks in DHTs like IPFS)
```rust
// Hash single array
let array = Int32Array::from(vec![1, 2, 3]);
let digest = ArrayDigestV0::
// Alternatively: Use .update(&array)
to hash multiple arrays of the same type
// Hash record batches let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Int32, false), Field::new("b", DataType::Utf8, false), ]));
let recordbatch = RecordBatch::trynew(Arc::new(schema), vec![ Arc::new(Int32Array::from(vec![1, 2, 3])), Arc::new(StringArray::from(vec!["a", "b", "c"])), ]).unwrap();
let digest = RecordsDigestV0::
// Alternatively: Use .update(&batch)
to hash multiple batches with same schema
```
IPFS
and the likes, but this is a stretch as this is not a general-purpose hashing algoStarting from primitives and building up:
{U}Int{8,16,32,64}, Float{16,32,64}
- hashed using their in-memory binary representationUtf8, LargeUtf8
- hash length (as u64
) followed by in-memory representation of the string
digest(["foo", "bar"]) != hash(["f", "oobar"])
0
(zero) byte
1
for false
and 2
for true
1
and 2
differentiates them from nulls| Type (in Schema.fb
) | TypeID (as u16
) | Followed by |
| --------------------- | :---------------: | ----------------------------------------------------- |
| Null | 0 | |
| Int | 1 | unsigned/signed (0/1) as u8
, bitwidth as u64
|
| FloatingPoint | 2 | bitwidth as u64
|
| Binary | 3 | |
| Utf8 | 4 | |
| Bool | 5 | |
| Decimal | 6 | bitwidth as u64
, precision as u64
, scale as u64
|
| Date | 7 | bitwidth as u64
, DateUnitID
|
| Time | 8 | bitwidth as u64
, TimeUnitID
|
| Timestamp | 9 | TimeUnitID
, timeZone as nullable Utf8
|
| Interval | 10 | |
| List | 11 | |
| Struct | 12 | |
| Union | 13 | |
| FixedSizeBinary | 14 | |
| FixedSizeList | 15 | |
| Map | 16 | |
| Duration | 17 | |
| LargeBinary | 3 | |
| LargeUtf8 | 4 | |
| LargeList | 11 | |
Note that some types (Utf8
and LargeUtf8
, Binary
FixedSizeBinary
and LargeBinary
, List
FixedSizeList
and LargeList
) are represented in the hash the same, as the difference between them is purely an encoding concern.
| DateUnit (in Schema.fb
) | DateUnitID (as u16
) |
| ------------------------- | :-------------------: |
| DAY | 0 |
| MILLISECOND | 1 |
| TimeUnit (in Schema.fb
) | TimeUnitID (as u16
) |
| ------------------------- | :-------------------: |
| SECOND | 0 |
| MILLISECOND | 1 |
| MICROSECOND | 2 |
| NANOSECOND | 3 |
decimal
precision)