Robust and Fast tokenizations alignment library for Rust and Python

creates.io pypi Actions Status

sample

Demo: demo
Rust document: docs.rs
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly

Usage (Python)

bash $ pip install -U pip # update pip $ pip install pytokenizations

This library uses maturin to build the wheel.

console $ git clone https://github.com/tamuhey/tokenizations $ cd tokenizations/python $ pip install maturin $ maturin build

Now the wheel is created in python/target/wheels directory, and you can install it with pip install *whl.

get_alignments

python def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...

Returns alignment mappings for two different tokenizations:

```python

tokensa = ["å", "BC"] tokensb = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc) a2b, b2a = tokenizations.getalignments(tokensa, tokens_b) print(a2b) [[0], [0]] print(b2a) [[0, 1]] ```

a2b[i] is a list representing the alignment from tokens_a to tokens_b.

Usage (Rust)

See here: docs.rs

Related