wordcut-engine

Word segmentation library in Rust

Example

```Rust use wordcutengine::loaddict; use wordcut_engine::Wordcut; use std::path::Path;

fn main() { let dictpath = Path::new(concat!( env!("CARGOMANIFESTDIR"), "/dict.txt" )); let dict = loaddict(dictpath).unwrap(); let wordcut = Wordcut::new(dict); println!("{}", wordcut.putdelimiters("หมากินไก่", "|")); } ```

Algorithm

wordcut-engine has three steps:

Identifying clusters, which are substrings that must not be split
Identifying edges of split directed acyclic graph (split-DAG); The program does not add edges that break any cluster to the graph.
Tokenizing a string by finding the shortest path in the split-DAG

Identifying clusters

Identifying clusters identify which substrings that must not be split.

Wrapping regular expressions with parentheses

For example,

[ก-ฮ]็ [ก-ฮ][่-๋] [ก-ฮ][่-๋][ะาำ]

The above rules are wrapped with parentheses as shown below:

([ก-ฮ]็) ([ก-ฮ][่-๋]) ([ก-ฮ][่-๋][ะาำ])

Joining regular expressions with vertical bars (|)

for example,

([ก-ฮ]็)|([ก-ฮ][่-๋])|([ก-ฮ][่-๋][ะาำ])

Building a DFA from the joined regular expression using regex-automata
Creating a directed acyclic graph (DAG) by adding edges using the DFA
Identifying clusters following a shortest path of a DAG from step above

Note: wordcut-engine does not allow a context sensitive rule, since it hurts the performance too much. Moreover, instead of longest matching, we use a DAG, and its shortest path to contraint cluster boundary by another cluster, therefore newmm-style context sensitive rules are not required.

Identifying split-DAG edges

In contrary to identifying clusters, identifying split-DAG edges identify what must be split. Split-DAG edge makers, wordcut-engine has three types of split-DAG edge maker, that are:

Dictionary-based maker
Rule-based maker
Default maker (Unk edge builder)

The dictionary-based maker traverses a prefix tree, which is particularly a trie in wordcut-engine and create an edge that matched word in the prefix tree. Rule-based maker uses regex-automata's Regex matcher built from split rules to find longest matched substrings, and add corresponding edges to the graph. wordcut-engine removes edges that break clusters. The example of split rules are shown below:

[\r\t\n ]+ [A-Za-z]+ [0-9]+ [๐-๙]+ [\(\)"'`\[\]{}\\/]

If there is no edge for each of character indice yet, a default maker create a edge that connected the known rightmost boundary.