Robust and Fast tokenizations alignment library for Rust and Python

sample

Demo: demo
Rust document: docs.rs
Python document: python/README.md
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly

Overview

Get an alignment map for two different and noisy tokenizations:

```python

tokensa = ["げん", "ご"] tokensb = ["けんこ"] # all accents are dropped (が -> か, ご -> こ) a2b, b2a = tokenizations.getalignments(tokensa, tokens_b) print(a2b) [[0], [0]] print(b2a) [[0, 1]] ```

a2b[i] is tokens_a list representing the alignment from tokens_a to tokens_b.

Algorithm

Algorithm overview
Blog post