Lindera UniDic Builder

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera

UniDic builder for Lindera. This project fork from fulmicoton's kuromoji-rs.

Install

shell script % cargo install lindera-unidic-builder

Build

The following products are required to build:

shell script % cargo build --release

Dictionary version

This project supports UniDic 2.1.2. See detail of UniDic .

Building a dictionary

Building a dictionary with lindera-unidic command:

shell script % UNIDIC_VERSION=2.1.2 % curl -L -O "https://ccd.ninjal.ac.jp/unidic_archive/cwj/${UNIDIC_VERSION}/unidic-mecab-${UNIDIC_VERSION}_src.zip" % unzip ./unidic-mecab-${UNIDIC_VERSION}_src.zip % lindera-unidic-builder -s ./unidic-mecab-${UNIDIC_VERSION}_src -d ./lindera-unidic-${UNIDIC_VERSION}

Dictionary format

Refer to the manual for details on the unidic-mecab dictionary format and part-of-speech tags.

| Index | Name (Japanese) | Name (English) | Notes | | --- | --- | --- | --- | | 0 | 品詞大分類 | | | | 1 | 品詞中分類 | | | | 2 | 品詞小分類 | | | | 3 | 品詞細分類 | | | | 4 | 活用型 | | | | 5 | 活用形 | | | | 6 | 語彙素読み | | | | 7 | 語彙素(語彙素表記 + 語彙素細分類) | Lexeme | | | 8 | 書字形出現形 | | | | 9 | 発音形出現形 | | | | 10 | 書字形基本形 | | | | 11 | 発音形基本形 | | | | 12 | 語種 | | | | 13 | 語頭変化型 | | | | 14 | 語頭変化形 | | | | 15 | 語末変化型 | | | | 16 | 語末変化形 | | |

Tokenizing text using produced dictionary

You can tokenize text using produced dictionary with lindera command:

shell script % echo "羽田空港限定トートバッグ" | lindera -d ./lindera-unidic-2.1.2

text 羽田 名詞,固有名詞,人名,姓,*,*,羽田,ハタ,ハタ 空港 名詞,普通名詞,一般,*,*,*,空港,クーコー,クーコー 限定 名詞,普通名詞,サ変可能,*,*,*,限定,ゲンテー,ゲンテー トート 名詞,普通名詞,一般,*,*,*,トート,トート,トート バッグ 名詞,普通名詞,一般,*,*,*,バッグ,バッグ,バッグ EOS

For more details about lindera command, please refer to the following URL:

API reference

The API reference is available. Please see following URL: - Lindera UniDic Builder