Docs — Rust Crate — Spec
Caution: Bao is intended to be a cryptographic hash function, but it hasn't yet been reviewed. The output may change prior to the 1.0 release.
Bao (rhymes with bough 🌳) is a general purpose tree hash for files,
implemented as the bao
command line utility. Here's the full
specification. What makes a tree hash different from a
regular hash? Depending on how many cores you've got in your machine,
the first thing you might notice is that it's five times faster:
Why is bao hash
so fast? The main reason is that tree hashes can use
multiple threads to process different parts of the tree in parallel.
Given enough input, the tree hash can occupy any number of processors:
in-memory benchmarks on one of Amazon's 96-core m5.24xlarge instances
measure 60 GB/s of throughput. Bao is also based on BLAKE2b, which was
designed to outperform SHA1, and it includes the
fastest SIMD implementation
available.
Apart from parallelism, tree hashes make it possible to verify a file piece-by-piece rather than all-at-once. This is done by storing both the input and the branches of the hash tree together in an encoded file:
```sh
head -c 1000000 /dev/urandom > f
bao encode f f.bao
stat -c "%n %s" f f.bao | column -t f 1000000 f.bao 1015624
bao hash
of the input file is the same as thebao hash --encoded
of the encoded file, but the latter is faster.bao hash f [some hash...] bao hash --encoded f.bao [the same hash...] hash=
bao hash --encoded f.bao
cmp f <(bao decode $hash f.bao)
badhash=
echo $hash | sed s/a/b/
cmp f <(bao decode $badhash f.bao) Error: Custom { kind: InvalidData, error: StringError("hash mismatch") } cmp: EOF on /proc/self/fd/11 which is empty ```
That decoding above doesn't require you to have the entire encoded file on disk locally. Streaming it over a pipe or a network socket will work just as well. For situations where you only want to consume some bytes from the middle of the file, and you don't want to transfer the whole encoding, you can extract an encoded slice:
```sh
bao slice 500000 100000 f.bao f.slice
stat -c "%n %s" f.slice f.slice 104584
bao decode-slice $hash 500000 100000 f.slice > f.slice.out
tail
numbers bytes starting with 1.)tail --bytes=+500001 f | head -c 100000 > expected.out cmp f.slice.out expected.out
bao decode-slice $bad_hash 500000 100000 f.slice Error: Custom { kind: InvalidData, error: StringError("hash mismatch") } ```
By default, all of the operations above work with a "combined" encoded
file, that is, one that contains both the content bytes and the tree
hash bytes interleaved. However, sometimes you want to keep them
separate, for example to avoid copying the bytes of a very large input
file. In these cases, you can use the "outboard" encoded format, via the
--outboard
flag:
```sh
bao encode f --outboard f.obao
stat -c "%n %s" f f.bao f.obao | column -t f 1000000 f.bao 1015624 f.obao 15624
cmp f <(bao decode $hash f --outboard f.obao) ```
The bao
command line utility is published on crates.io as the
bao_bin
crate. To install it, add ~/.cargo/bin
to your PATH
and
then run:
sh
cargo install bao_bin
To build the binary directly from this repo:
sh
git clone https://github.com/oconnor663/bao
cd bao/bao_bin
cargo build --release
./target/release/bao --help
tests/bao.py
is a fully functional second
implementation in Python, designed to be as short and readable as
possible. It's a good starting point for understanding the algorithms
involved, before diving into the Rust code.
The bao
library crate includes no_std
support if you set
default-features = false
in your Cargo.toml
. Most of the standalone
functions that don't obviously depend on std
are available. For
example, bao::encode::encode
is available with a single threaded
implementation, but bao::encode::encode_to_vec
isn't avialable. Of the
streaming implementations, only hash::Writer
is available, because the
encoding and decoding implementations rely more on the std::io::{Read,
Write, Seek}
interfaces. If there are any callers that want to do
streaming encoding or decoding under no_std
, please let me know, and
we can figure out which libcore-compatible traits it makes sense to
implement.