bash
$ pip install pytokenizations
Get an alignment map for two different tokenizations:
python
import tokenizations
a = ["New York"]
b = ["New", "York"]
a2b = [[0, 1]]
b2a = [[0], [0]]
assert tokenizations.get_alignments(a, b) == (a2b, b2a)
a2b[i]
is a list representing the alignment from a
to b
.
You can get the alignments for "dirty" tokens:
python
a = ["げん", "ご"]
b = ["けんこ"] # all accents are dropped (が -> か, ご -> こ)
a2b = [[0], [0]]
b2a = [[0, 1]]
assert tokenizations.get_alignments(a, b) == (a2b, b2a)
Let $A = a{11}a{12}..a{1k1},a{21}..a{NkN}$ and $B = b{11}b{12}..b{1l1},b{21}..b{MlM}$ be tokens of length N and M respectively. Each token $Ai$ in $A$ and $Bj$ in $B$ have length $ki$ and $lj$ respectively. The alignment $AL{AB}$ of $A$ to $B$ is such that $ \forall j \in AL{AB,i} => Bj \cap Ai $. ($t \cap s$ means t partially matches s.) For example, $a=["f","o","o"], b=["fo","o"] => AL{AB} = [[1],[1],[2]], AL{BA} = [[1, 2], [3]]$. The goal of this algorithm is to find such $AL{AB}$ and $AL{BA}$
Details:
To compare the token positions, we must compare each characters in tokens. Because the two tokenizations may be partially different, we normalize them in "NFKD" first.
Before calculating the edit graph, we combine tokens into text. For example, if we have tokens ["Foo", "bar"]
, we concatenate them into one text Foobar
.
We calculate the shortest path on edit graph from texts $TA$ and $TB$ to get character map between them. The path can be calculated, for example, by Myers' algorighm
Let $TAi$ and $TBj$ be the i-th and j-th character in the text $TA$ and $TB$, respectively. $C{AB}$ is a mapping from $TA$ to $TB$ such that $C{AB},i \neq -1 \land C{AB,i} = j \Rightarrow TAi = TAj$. For example, $TA = f0o, TB = fboo$ then $C{AB} = [1,-1,3], C{BA} = [1,-1,3,-1]$. We can calculate $C{AB}$ and $C{BA}$ from the shortest path on the edit graph. If there exists diagonal edge $(i-1,j-1) -> (i, j)$ in the path, $C{AB,i} = j$ and $C{BA,j} = i$. If there doesn't exist any diagonal edge to $\forall j (i, j)$ then $C{AB,i} = -1$.
Now we can calculate the desired $AL{AB}$ and $AL{BA}$ from the previous calculated $C{AB}$ and $C{BA}$.