Spec — Rust Crate — Rust Docs
Caution! Not yet suitable for production use. The output of Bao isn't stable. There might be more changes before 1.0.
Bao (rhymes with bough 🌳) is a general purpose tree hash for files. Here's the full specification. Here's a talk I gave about it in November 2018. What makes a tree hash different from a regular hash? Depending on how many cores you've got in your machine, the first thing you might notice is that it's ten times faster than SHA-2:
Why is bao hash
so fast? It's mostly parallelism, multiple threads
working on different subtrees at the same time. The demo above is from
the 4-core i5-8250U processor in my laptop, but Bao can scale much
higher. In-memory benchmarks on a 48-core AWS m5.24xlarge instance hit
91 GB/s of
throughput.
Bao is also based on BLAKE2, which was designed to outperform
SHA-1, and it uses the fastest SIMD
implementation available.
Apart from parallelism, tree hashes make it possible to verify a file piece-by-piece rather than all-at-once. This is done by storing both the input and the entire hash tree together in an encoded file. Clients can stream these files, or do random seeks into them, with the guarantee that every byte they read matches the root hash.
Use case: A secure messaging app might support attachment files by including the hash of an attachment in a message. With a regular hash, the recipient would need to download an entire attachment to verify it, but that's impractical for e.g. large video files. With a Bao hash, the recipient can stream a video, while still verifying each byte as it comes in. (This scenario was in fact the original motivation for the Bao project.)
```sh
head -c 1000000 /dev/urandom > f
bao encode f f.bao
stat -c "%n %s" f f.bao | column -t f 1000000 f.bao 1015624
hash=
bao hash f
bao decode $hash f.bao f2 cmp f f2
badhash="0000000000000000000000000000000000000000000000000000000000000000" bao decode $badhash f.bao f3 Error: Custom { kind: InvalidData, error: StringError("hash mismatch") } ```
Encoded files support random seeking, but seeking might not be available or efficent over a network. (Note that one seek in the content usually requires several seeks in the encoding, as the decoder traverses the hash tree level-by-level.) In these situations, rather than e.g. hacking a seek interface into your HTTP client, you can instead request an encoded slice. A slice contains some part of the original file and the parts of the hash tree needed to verify it. Creating a slice requires seeking over the full encoding, but the recipient can then stream the slice without needing to seek. Decoding a slice uses the same root hash as regular decoding, so it doesn't require any preparation in advance from the sender or the recipient.
Use case: A BitTorrent-like application could fetch different slices of a file from different peers, without needing to define the slices ahead of time. Or a distributed file storage application could request random slices of an archived file from its storage providers, to prove that they're honestly storing the file, without needing to prepare or store challenges for the future.
```sh
bao slice 500000 100000 f.bao f.slice
stat -c "%n %s" f.slice f.slice 104584
bao decode-slice $hash 500000 100000 f.slice > f.slice.out
tail
numbers bytes starting with 1.)tail --bytes=+500001 f | head -c 100000 > expected.out cmp f.slice.out expected.out
bao decode-slice $bad_hash 500000 100000 f.slice Error: Custom { kind: InvalidData, error: StringError("hash mismatch") } ```
By default, all of the operations above work with a "combined" encoded
file, that is, one that contains both the content bytes and the tree
hash bytes interleaved. However, sometimes you want to keep them
separate, for example to avoid duplicating a very large input file. In
these cases, you can use the "outboard" encoding format, via the
--outboard
flag:
```sh
bao encode f --outboard f.obao
stat -c "%n %s" f f.bao f.obao | column -t f 1000000 f.bao 1015624 f.obao 15624
cmp f <(bao decode $hash f --outboard f.obao) ```
The bao
command line utility is published on
crates.io as the
bao_bin
crate. To install it, add
~/.cargo/bin
to your PATH
and then run:
sh
cargo install bao_bin
To build the binary directly from this repo:
sh
git clone https://github.com/oconnor663/bao
cd bao/bao_bin
cargo build --release
./target/release/bao --help
tests/bao.py
is a fully functional second
implementation in Python, designed to be as short and readable as
possible. It's a good starting point for understanding the algorithms
involved, before diving into the Rust code.
The bao
library crate includes no_std
support if you set
default-features = false
in your Cargo.toml
. Currently only the
hash
module is available, with Rayon-based multithreading disabled.
The encode
and decode
modules currently depend on the
std::io::{Read, Write, Seek}
traits. Those traits are all they need
from std
, and they do not allocate, so they could be made
no_std
-compatible in the future with some compatibility layer.