textdistance.rs

[ [github.com](https://github.com/life4/textdistance.rs) ] [ [docs.rs](https://docs.rs/textdistance/) ] [ [crates.io](https://crates.io/crates/textdistance) ]

Rust library with lots of different algorithms to compare how similar two sequences are.

Features:

Available algorithms

Edit-based:

  1. DamerauLevenshtein, both optimal string alignment and restricted.
  2. Hamming
  3. Jaro
  4. JaroWinkler
  5. Levenshtein
  6. Sift4Common
  7. Sift4Simple
  8. SmithWaterman

Token-based:

  1. Bag
  2. Cosine (aka Orchini, Tucker, Otsuka–Ochiai)
  3. EntropyNCD (Entropy-based Normalized Compression Distance)
  4. Jaccard (aka Tanimoto, Critical Success Index)
  5. Overlap (aka Szymkiewicz–Simpson)
  6. Roberts
  7. SorensenDice (aka F1, Czekanowski, Zijdenbos)
  8. Tversky

Sequence-based:

  1. LCSSeq (Longest Common SubSequence)
  2. LCSStr (Longest Common SubString)
  3. RatcliffObershelp (aka Gestalt pattern matching)

Naive:

  1. Prefix
  2. Suffix
  3. Length

Normalization for other metrics:

  1. LIG3 normalization for Hamming by Levenshtein
  2. MLIPNS normalization for Hamming
  3. YujianBo normalization for Levenshtein

Installation

shell cargo add textdistance

Usage

The textdistance::str module provides shortcut functions for each algorithm for calculating the distance/similarity between two strings:

rust use textdistance::str::damerau_levenshtein; assert!(damerau_levenshtein("abc", "acbd") == 2);

The textdistance::nstr module is the same but all algorithms return a normalized value (between 0.0 and 1.0):

rust use textdistance::nstr::damerau_levenshtein; assert!(damerau_levenshtein("abc", "acbd") == 2./4.);

For more advanced usage, each algorithm is provided as a struct implementing the Algorithm trait:

```rust use textdistance::{Algorithm, DamerauLevenshtein};

let a = DamerauLevenshtein::default(); let r = a.for_str("abc", "acbd"); assert!(r.val() == 2); assert!(r.nval() == 2./4.); ```

  1. The Algorithm trait provides for_str, for_vec, and for_iter to calculate the result for two strings, vectors (slices), or iterators respectively. In addition, there are for_words and for_bigrams methods that split the text into words or bigrams respectively before calculating the distance.
  2. Each method returns a textdistance::Result that provides methods to get absolute (val) or normalized (nval) value of the metric, distance (dist and ndist), or similarity (sim and nsim).

Unicode support

The for_str method (and so all functions in the str and nstr modules) uses String.chars to split the string and then runs it through the for_iter method. So, Γ© will be considered two distinct characters ("latin small letter e" and "combining acute accent"). Usually, that's ok and this is how Python works. You can read more in the official Rust documentation.

If you want Γ© to be considered as a single symbol, use the unicode-segmentation crate:

```rust use textdistance::{Algorithm, DamerauLevenshtein}; use unicode_segmentation::UnicodeSegmentation;

let s1 = "a̐éâ̲\r\n"; let s2 = "éa̐â̲\r\n"; let g1 = s1.graphemes(true); let g2 = s2.graphemes(true); let a = DamerauLevenshtein::default(); let r = a.for_iter(g1, g2); assert!(r.val() == 1); ```

Choosing the algorithm

The algorithm to use depends on your use case. First, you need to decide on the algorithm category:

  1. Edit-based algorithms work well on short sequences for detecting typos and minor changes.
  2. Token-based algorithms work well on longer sequences for comparing long texts with noticeable modifications.
  3. Sequence-based algorithms work well for calculating diff size between the original and the changed version of the sequence.

If you go with edit-based, the next thing is to decide what kind of changes you need to detect:

| alg | sub | add | del | trans | | --------------------- | --- | --- | --- | ----- | | Hamming | βœ… | ❌ | ❌ | ❌ | | Jaro | ❌ | ❌ | ❌ | βœ… | | JaroWinkler | ❌ | ❌ | ❌ | βœ… | | Sift4 | ❌ | ❌ | ❌ | βœ… | | Levenshtein | βœ… | βœ… | βœ… | ❌ | | DamerauLevenshtein | βœ… | βœ… | βœ… | βœ… |

There are some use cases:

Benchmarks

Legend:

| algorithm | time | | ------------------ | --------- | | bag | πŸ‡ 523.06 Β΅s | | cosine | πŸ‡ 508.59 Β΅s | | dameraulevenshtein | 🐌 41.938 ms | | dameraulevenshteinrestricted | 🐌 10.301 ms | | entropyncd | πŸ‡ 731.68 Β΅s | | hamming | 🐎 19.203 Β΅s | | jaccard | πŸ‡ 580.79 Β΅s | | jarowinkler | 🐒 1.7174 ms | | jaro | 🐒 1.7148 ms | | lcsseq | 🐌 7.4349 ms | | lcsstr | 🐒 3.2658 ms | | length | 🐎 2.5300 Β΅s | | levenshtein | 🐒 4.5999 ms | | lig3 | 🐌 6.5563 ms | | mlipns | 🐎 20.625 Β΅s | | overlap | πŸ‡ 513.76 Β΅s | | prefix | 🐎 22.473 Β΅s | | ratcliffobershelp | 🐌 36.308 ms | | roberts | πŸ‡ 714.79 Β΅s | | sift4common | 🐎 238.86 Β΅s | | sift4simple | 🐎 143.69 Β΅s | | smithwaterman | 🐌 9.5146 ms | | sorensendice | πŸ‡ 510.75 Β΅s | | suffix | 🐎 38.821 Β΅s | | tversky | πŸ‡ 512.41 Β΅s | | yujian_bo | 🐒 4.6044 ms |

The benchmarks are powered by criterion and live in the benches directory. They are quite simple: grab 10 open-source licenses, take a 200 chars prefix from each, and cross-compare these prefixes. The numbers might be very different for a different kind of input, length of the input, when comparing words rather than characters, or running the benchmarks on a different machine. The goal of these benchmarks is to provide basic guidance rather than give a definitive answer. If performance is critical for your application, I encourage you to make your benchmarks on the real data you have.

Versioning

We stick to SemVer:

  1. The patch number is for bug fixes. The results of an algorithm may change in some corner cases if we found that the previous implementation doesn't match the algorithm described in the original paper.
  2. The minor number is for new algorithms and features.
  3. The major number is for big changes in the API. We try to avoid breaking stuff but we prefer to provide a friendly and convenient API over keeping a backward compatibility.

Limitations

Acknowledgments

There are the libraries that I used as a reference implementation and the source of test cases:

Specials thanks to Trevor Gross for transferring to me the ownership of the textdistance name on crates.io.

Testing locally

To run everything locally, all you need is Rust, Python, and task. Execute task all to run all code formatters, linters, and tests.

Thank you ❀️