OFilter
OFilter is an opinionated Bloom filter implemented in Rust.
It implements:
The basic Bloom filter is inspired from the existing and stable
bloomfilter crate but does not directly depend on it.
The API is slightly different, and makes a few opinionated changes.
- it has a fancy parameters support so
that you can define a filter by its number of items, false positive ratio, size in bits, number
of hash, or any combination of them. It figures out the magic.
- it uses the bitlen crate internally and exports its internal
bit vector in that format.
- it uses the siphasher crate as the internal
hashing func. It is not possible to override this.
- it introduces that notion of "streaming filter" which can be useful when you do not work
in batches and never know when items are going to stop coming in. As some point it is
a naive implementation of an aging Bloom filter, though I prefer the term streaming.
- performance-wise, the default settings tend to optimize for less CPU usage but more
memory footprint. This can be controlled in options but by default the filter can
more memory than strictly required.
In practice, I am using this filter to implement a self-expiring
KV store.

Status
For now this is a toy project, clearly NOT suitable for production use.

https://en.wikipedia.org/wiki/Bloom_filter
Usage
```rust
use ofilter::Bloom;
let mut filter: Bloom = Filter::new(100);
assert!(!filter.check(&42));
filter.set(&42);
assert!(filter.check(&42));
```
A bit of theory
This interactive Bloom filter params calulator
proved very useful while developping this.
Also it recaps the usual formulas:
```
With:
n: number of items in the filter
p: probability of false positives, fraction between 0 and 1
m: number of bits in the filter
k: number of hash functions
We have:
n = ceil(m / (-k / log(1 - exp(log(p) / k))))
p = pow(1 - exp(-k / (m / n)), k)
m = ceil((n * log(p)) / log(1 / pow(2, log(2))));
k = round((m / n) * log(2));
```
A command line equivalent would be this practical
bloom-filter-calculator.rb gist
```ruby
Optimal bloom filter size and number of hashes
Tips:
1. One byte per item in the input set gives about a 2% false positive rate.
2. The optimal number of hash functions is ~0.7x the number of bits per item.
3. The number of hashes dominates performance.
Expected number of items in the collection
n = (m * ln(2))/k;
n = 300_000
Acceptable false-positive rate (0.01 = 1%)
p = e^(-(m/n) * (ln(2)^2));
fpr = 0.01
Optimal size (number of elements in the bit array)
m = -((n*ln(p))/(ln(2)^2));
m = (n * Math.log(fpr).abs) / (Math.log(2) ** 2)
Optimal number of hash functions
k = (m/n) * ln(2);
k = (m / n) * Math.log(2)
puts "Optimal bloom filter size: #{m.ceil} bits"
puts "Optimal number of hash functions: #{k.ceil}"
```
Links
- crate on crates.io
- doc on docs.rs
- source on gitlab.com
- MenhirKV, project using this filter
- More about Bloom filters:
- https://en.wikipedia.org/wiki/Bloomfilter
- https://www.vldb.org/archives/workshop/2010/proceedings/files/vldb2010workshop/ADMS2010/adms10-canim.pdf
- https://arxiv.org/pdf/2001.03147.pdf
- https://arxiv.org/pdf/2106.04364.pdf
- http://www.vldb.org/pvldb/vol13/p2355-liu.pdf
- https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf
- https://freecontent.manning.com/all-about-bloom-filters/
License
OFilter is licensed under the MIT license.