A persistent datastore backed by RocksDB with fuzzy key lookup using an arbitrary distance function accelerated by the SymSpell algorithm.
The reasons to use this crate over another SymSpell implementation are: - You need to use a custom distance function - Startup time matters more than lookups-per-second - You care about resident memory footprint
This crate manages records, each of which has a unique [RecordID]. Keys are used to perform fuzzy lookups but keys are not guaranteed to be unique. Inserting the same key into a [Table] twice will result in two distinct records.
```rust use fuzzy_rocks::{*};
//Create and reset the FuzzyRocks Table
let mut table = Table::
//Insert some records let thu = table.insert("Thursday", &"Mokuyoubi".tostring()).unwrap(); let wed = table.insert("Wednesday", &"Suiyoubi".tostring()).unwrap(); let tue = table.insert("Tuesday", &"Kayoubi".tostring()).unwrap(); let mon = table.insert("Monday", &"Getsuyoubi".tostring()).unwrap();
//Try out lookupbest, to get the closest fuzzy match let result = table.lookupbest("Bonday", table.defaultdistancefunc()).unwrap(); assert_eq!(result, mon);
//Try out lookupfuzzy, to get all matches and their distances let results : Vec<(RecordID, u64)> = table .lookupfuzzy("Tuesday", table.defaultdistancefunc(), 2) .unwrap().collect(); assert_eq!(results.len(), 2); assert!(results.contains(&(tue, 0))); //Tuesday -> Tuesday with 0 edits assert!(results.contains(&(thu, 2))); //Thursday -> Tuesday with 2 edits
//Retrieve the key and value from a record let (key, val) = table.getrecord(wed).unwrap(); asserteq!(key, "Wednesday"); assert_eq!(val, "Suiyoubi"); ```
Additional usage examples can be found in the tests, located at the bottom of the src/lib.rs
file.
A distance function is any function that returns a scalar distance between two keys. The smaller the
distance, the closer the match. Two identical keys must have a distance of zero. The fuzzy
methods
in this crate, such as lookup_fuzzy, require a distance function to be supplied.
This crate includes a simple Levenstein Distance function called edit_distance. However, you may often want to use a different function.
One reason to use a custom distance function is to account for expected variation patterns. For example: a distance function that considers likely OCR errors might consider 'lo' to be very close to 'b', '0' to be extremely close to 'O', and 'A' to be somewhat near to '^', while '#' would be much further from ',' even though the Levenstein distances tell a different story with 'lo' being two edits away from 'b' and '#' being only one edit away from ',' (comma).
You may have a different distance function to catch common typos on a QWERTY keyboard, etc.
Another reason for a custom distance function is if your keys are not human-readable strings, in which case you may need a different interpretation of variances between keys. For example DNA snippets could be used as keys.
Any distance function you choose must be compatible with SymSpell's delete-distance optimization. In other
words, you must be able to delete no more than MAX_DELETES
characters from each of the record's
key and the lookup key and arrive at identical key-variants. If your distance function is incompatible
with this property then the SymSpell optimization won't work for you and you should use a different fuzzy
lookup technique and a different crate.
Distance functions may return any scalar type, so floating point distances will work. However, the
MAX_DELETES
constant is an integer. Records that can't be reached by deleting MAX_DELETES
characters
from both the record key and the lookup key will never be evaluated by the distance function and are
conceptually "too far away". Once the distance function has been evaluated, its return value is
considered the authoritative distance and the delete distance is irrelevant.
A [Table] may allow for unicode keys or not, depending on the value of the UNICODE_KEYS
constant used
when the Table was created.
If UNICODE_KEYS
is true
, keys may use unicode characters and multi-byte characters will still be
considered as single characters for the purpose of deleting characters to create key variants.
If UNICODE_KEYS
is false
, keys are just strings of [u8] characters.
This option has better performance.
This crate is designed for large databases where startup time and resident memory footprint are significant considerations. This create has been tested with 250,000 unique keys and about 10 million variants.
If your use-case can cope with a higher startup latency and you are ok with all of your keys and variants being loaded into memory, then query performance will certainly be better using a solution built on Rust's native collections, such as this symspell crate on crates.io.
NOTE: The included geonames_megacities.txt
file is a stub for the geonames_test
, designed to stress-test
this crate. The abriged file is included so the test will pass regardless, and to avoid bloating the
download. The content of geonames_megacities.txt
was derived from data on geonames.org,
and licensed under a Creative Commons Attribution 4.0 License