This crate provides implementation of some recent algorithms deriving from the original Minhash. They have better performance and are more general.
It implements:
These algorithms compute an estimation of the Jaccard weighted index via sensitive hashing.
It is an extension of the Jaccard index to the case where objects have a weight, or a multiplicity associated.
This Jaccard weighted index provides a metric on discrete probability distributions as explained in :
Moulton Jiang. Maximally consistent sampling and the Jaccard index of probability distributions (2018)
Moulton-Jiang-ieee or Moulton-Jiang-arxiv
Noting Jp the Jaccard weighted index, then 1. - Jp defines a metric on finite discrete probabilities.
This module is the core of the crate which has two other modules.
An implementation of Superminhash :
A new minwise Hashing Algorithm for Jaccard Similarity Estimation
Otmar Ertl 2017-2018 Cf superminhash Ertl
This algorithm runs on unweighted objects. The hash values are computed by the sketch method or can be computed before entering SuperMinHash methods. In this case (pre-hashed values) the structure just computes permutation according to the paper.
It runs in one pass on data so it can be used in streaming.
ProbOrdMinHash2 is a locality-sensitive hashing for the edit distance implemented over ProbMinHash2 as in Ertl's probordminhash2.
It is inspired by Marcais.G et al. BioInformatics 2019, see Marcais
Invhash
It is just a module providing invertible hash from u32 to u32 or u64 to u64 and can be used to run a prehash on indexes. (See reference to Thomas Wang's invertible integer hash functions in invhash.rs)
Some example of usage (more in the tests in each module) consisting to estimate intersection of contents of 2 vectors:
An example of Prominhash3a with an IndexMap
(see test probminhash::tests::testprobminhash3acountintersectionunequal_weights)
```rust
type FnvIndexMap
let mut wb : FnvIndexMap::<usize,f64> = FnvIndexMap::with_capacity_and_hasher(70, FnvBuildHasher::default());
// initialize ...
let mut waprobhash = ProbMinHash3a::<usize, FnvHasher>::new(nbhash, 0);
waprobhash.hash_weigthed_idxmap(&wa);
//
let mut wbprobhash = ProbMinHash3a::<usize, FnvHasher>::new(nbhash, 0);
wbprobhash.hash_weigthed_idxmap(&wb);
//
let siga = waprobhash.get_signature();
let sigb = wbprobhash.get_signature();
let jp_approx = compute_probminhash_jaccard(siga, sigb);
```
An example of Probminhash3 with items sent one by one:
rust
let set_size = 100;
let mut wa = Vec::<f64>::with_capacity(set_size);
let mut wb = Vec::<f64>::with_capacity(set_size);
// initialize wa, wb
....
// probminhash
let mut waprobhash = ProbMinHash3::<usize, FnvHasher>::new(nbhash, 0);
for i in 0..set_size {
if wa[i] > 0. {
waprobhash.hash_item(i, wa[i]);
}
}
//
let mut wbprobhash = ProbMinHash3::<usize, FnvHasher>::new(nbhash, 0);
for i in 0..set_size {
if wb[i] > 0. {
wbprobhash.hash_item(i, wb[i]);
}
}
let siga = waprobhash.get_signature();
let sigb = wbprobhash.get_signature();
let jp_approx = compute_probminhash_jaccard(siga, sigb);
rust
let va : Vec<usize> = (0..1000).collect();
let vb : Vec<usize> = (900..2000).collect();
let bh = BuildHasherDefault::<FnvHasher>::default();
let mut sminhash : SuperMinHash<usize, FnvHasher>= SuperMinHash::new(70, &bh);
// now compute sketches
let resa = sminhash.sketch_slice(&va);
// we decide to reuse sminhash instead of allocating another SuperMinHash structure
let ska = sminhash.get_hsketch().clone();
sminhash.reinit();
let resb = sminhash.sketch_slice(&vb);
let skb = sminhash.get_hsketch();
//
let jac = get_jaccard_index_estimate(&ska, &skb).unwrap();
...
Licensed under either of
at your option.
This software was written on my own while working at CEA