Demo: demo
Rust document: docs.rs
Python document: python/README.md
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly
Installation:
bash
$ pip install pytokenizations
get_alignments
python
def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...
Returns alignment mappings for two different tokenizations:
```python
tokensa = ["å", "BC"] tokensb = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc) a2b, b2a = tokenizations.getalignments(tokensa, tokens_b) print(a2b) [[0], [0]] print(b2a) [[0, 1]] ```
a2b[i]
is a list representing the alignment from tokens_a
to tokens_b
.
get_original_spans
python
def get_original_spans(tokens: Sequence[str], original_text: str) -> List[Optional[Tuple[int, int]]]: ...
Returns the span indices in original_text from the tokens. This is useful, for example, when a processed result is mapped to the original text that is not normalized yet.
```python
tokens = ["a", "bc"] originaltext = "å BC" getoriginalspans(tokens, originaltext) [(0,1), (3,5)] ```
get_charmap
python
def get_charmap(a: str, b: str) -> Tuple[List[Optional[int]], List[Optional[int]]]: ...
Returns character mappings a2b
(from a
to b
) and b2a
(from b
to a
).
```python
a = "åBC" b = "abc" get_charmap(a, b) ([0,1,2], [0,1,2]) ```