Lindera CLI

A Japanese morphological analysis command-line interface for Lindera. This project fork from fulmicoton's kuromoji-rs.

Install

% cargo install lindera-cli

Build

The following products are required to build:

Rust >= 1.39.0
make >= 3.81

text % make build

Usage

Basic usage

The CLI already includes IPADIC as the default Japanese dictionary.
You can easily tokenize the text and see the results as follows:

% echo "関西国際空港限定トートバッグ" | ./bin/lindera 関西国際空港名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー限定名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイトートバッグ UNK,*,*,*,*,*,*,*,* EOS

Switching dictionary

It is also possible to switch to the pre-built dictionary data instead of the default dictionary and tokenize. The following example uses the pre-built UniDic to tokenize:

% echo "関西国際空港限定トートバッグ" | ./bin/lindera -d ../lindera-unidic-builder/lindera-unidic-2.1.2 関西名詞,固有名詞,地名,一般,*,*,関西,カンサイ,カンサイ国際名詞,普通名詞,一般,*,*,*,国際,コクサイ,コクサイ空港名詞,普通名詞,一般,*,*,*,空港,クーコー,クーコー限定名詞,普通名詞,サ変可能,*,*,*,限定,ゲンテー,ゲンテートート名詞,普通名詞,一般,*,*,*,トート,トート,トートバッグ名詞,普通名詞,一般,*,*,*,バッグ,バッグ,バッグ EOS

Please refer to the following repository for building a dictionary:

Tokenize mode

Linera provides two tokenization modes: normal and decompose.

normal mode tokenizes faithfully based on words registered in the dictionary. (Default):

% echo "関西国際空港限定トートバッグ" | ./bin/lindera --mode=normal 関西国際空港名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー限定名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイトートバッグ UNK,*,*,*,*,*,*,*,* EOS

decopose mode tokenizes a compound noun words additionally:

% echo "関西国際空港限定トートバッグ" | ./bin/lindera --mode=decompose 関西名詞,固有名詞,地域,一般,*,*,関西,カンサイ,カンサイ国際名詞,一般,*,*,*,*,国際,コクサイ,コクサイ空港名詞,一般,*,*,*,*,空港,クウコウ,クーコー限定名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイトートバッグ UNK,*,*,*,*,*,*,*,* EOS

Output format

Linera provides three output formats: mecab, wakati and json.

mecab outputs results in a format like MeCab:

% echo "お待ちしております。" | ./bin/lindera --output=mecab お待ち名詞,サ変接続,*,*,*,*,お待ち,オマチ,オマチし動詞,自立,*,*,サ変・スル,連用形,する,シ,シて助詞,接続助詞,*,*,*,*,て,テ,テおり動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリます助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス。記号,句点,*,*,*,*,。,。,。 EOS

wakati outputs the token text separated by spaces:

% echo "お待ちしております。" | ./bin/lindera --output=wakati お待ちしております。

json outputs the token information in JSON format:

% echo "お待ちしております。" | ./bin/lindera --output=json [ { "text": "お待ち", "detail": { "left_id": 1283, "right_id": 1283, "word_cost": 6376, "pos_level1": "名詞", "pos_level2": "サ変接続", "pos_level3": "*", "pos_level4": "*", "conjugation_type": "*", "conjugate_form": "*", "base_form": "お待ち", "reading": "オマチ", "pronunciation": "オマチ" } }, { "text": "し", "detail": { "left_id": 610, "right_id": 610, "word_cost": 8718, "pos_level1": "動詞", "pos_level2": "自立", "pos_level3": "*", "pos_level4": "*", "conjugation_type": "サ変・スル", "conjugate_form": "連用形", "base_form": "する", "reading": "シ", "pronunciation": "シ" } }, { "text": "て", "detail": { "left_id": 307, "right_id": 307, "word_cost": 5170, "pos_level1": "助詞", "pos_level2": "接続助詞", "pos_level3": "*", "pos_level4": "*", "conjugation_type": "*", "conjugate_form": "*", "base_form": "て", "reading": "テ", "pronunciation": "テ" } }, { "text": "おり", "detail": { "left_id": 1197, "right_id": 1197, "word_cost": 8773, "pos_level1": "動詞", "pos_level2": "非自立", "pos_level3": "*", "pos_level4": "*", "conjugation_type": "五段・ラ行", "conjugate_form": "連用形", "base_form": "おる", "reading": "オリ", "pronunciation": "オリ" } }, { "text": "ます", "detail": { "left_id": 491, "right_id": 491, "word_cost": 5537, "pos_level1": "助動詞", "pos_level2": "*", "pos_level3": "*", "pos_level4": "*", "conjugation_type": "特殊・マス", "conjugate_form": "基本形", "base_form": "ます", "reading": "マス", "pronunciation": "マス" } }, { "text": "。", "detail": { "left_id": 8, "right_id": 8, "word_cost": 215, "pos_level1": "記号", "pos_level2": "句点", "pos_level3": "*", "pos_level4": "*", "conjugation_type": "*", "conjugate_form": "*", "base_form": "。", "reading": "。", "pronunciation": "。" } } ]

If you output result in JSON format, token can be filtering is easily assured by using with jq command.
For example, folloing command executes: 1. Tokenize a text 2. Filter tokens by part of speech (名詞) 3. Concat the token text with a white space

Project links

lindera consists of several projects. The list is following: