Lindera

A morphological analysis library in Rust. This project fork from kuromoji-rs.

Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.

The following products are required to build:

Rust >= 1.46.0

Usage

Make sure you activated the full features of the lindera crate on Cargo.toml:

[dependencies] lindera = { version = "0.19.0", features = ["all-dictionaries"] }

Basic example

This example covers the basic usage of Lindera.

It will: - Create a tokenizer in normal mode - Tokenize the input text - Output the tokens

```rust use lindera::tokenizer::Tokenizer; use lindera::LinderaResult;

fn main() -> LinderaResult<()> { let dictionary = DictionaryConfig { kind: Some(DictionaryKind::IPADIC), path: None, };

let config = TokenizerConfig {
    dictionary,
    user_dictionary: None,
    mode: Mode::Normal,
};

// create tokenizer
let tokenizer = Tokenizer::from_config(config)?;

// tokenize the text
let tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")?;

// output the tokens
for token in tokens {
    println!("{}", token.get_text());
}

Ok(())

} ```

The above example can be run as follows:

shell script % cargo run --features=ipadic --example=basic_example

You can see the result as follows: text 関西国際空港限定トートバッグ

Analysis example

This example covers the basic usage of Lindera Analysis Framework.

It will: - Apply character filter for Unicode normalization (NFKC) - Tokenize the input text with IPADIC - Apply token filters for removing stop tags (Part-of-speech) and Japanese Katakana stem filter

```rust use std::{fs, path::PathBuf};

use lindera::analyzer::Analyzer; use lindera::LinderaResult;

fn main() -> LinderaResult<()> { let path = PathBuf::from(env!("CARGOMANIFESTDIR")) .join("../resources") .join("linderaipadicconf.json");

let config_bytes = fs::read(path).unwrap();

let analyzer = Analyzer::from_slice(&config_bytes).unwrap();

let mut text = "Ｌｉｎｄｅｒａは形態素解析ｴﾝｼﾞﾝです。ユーザー辞書も利用可能です。".to_string();
println!("text: {}", text);

// tokenize the text
let tokens = analyzer.analyze(&mut text)?;

// output the tokens
for token in tokens {
    println!(
        "token: {:?}, start: {:?}, end: {:?}, details: {:?}",
        token.get_text(),
        token.byte_start,
        token.byte_end,
        token.clone().get_details()
    );
}

Ok(())

} ```

The above example can be run as follows:

shell script % cargo run --features=ipadic --example=analysis_example

You can see the result as follows: text text: Ｌｉｎｄｅｒａは形態素解析ｴﾝｼﾞﾝです。ユーザー辞書も利用可能です。 token: Lindera, start: 0, end: 21, details: Some(["UNK"]) token: 形態素, start: 24, end: 33, details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"]) token: 解析, start: 33, end: 39, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"]) token: エンジン, start: 39, end: 54, details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"]) token: ユーザ, start: 0, end: 26, details: Some(["名詞", "一般", "*", "*", "*", "*", "ユーザー", "ユーザー", "ユーザー"]) token: 辞書, start: 26, end: 32, details: Some(["名詞", "一般", "*", "*", "*", "*", "辞書", "ジショ", "ジショ"]) token: 利用, start: 35, end: 41, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "利用", "リヨウ", "リヨー"]) token: 可能, start: 41, end: 47, details: Some(["名詞", "形容動詞語幹", "*", "*", "*", "*", "可能", "カノウ", "カノー"])

User dictionary example

You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.

<surface>,<part_of_speech>,<reading>

For example: shell % cat ./resources/simple_userdic.csv 東京スカイツリー,カスタム名詞,トウキョウスカイツリー東武スカイツリーライン,カスタム名詞,トウブスカイツリーラインとうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

With an user dictionary, Tokenizer will be created as follows: ```rust use std::path::PathBuf;

use lindera::tokenizer::{Tokenizer, TokenizerConfig}; use linderacore::viterbi::Mode; use linderacore::LinderaResult;

fn main() -> LinderaResult<()> { let dictionary = DictionaryConfig { kind: Some(DictionaryKind::IPADIC), path: None, };

let user_dictionary = Some(UserDictionaryConfig {
    kind: DictionaryKind::IPADIC,
    path: PathBuf::from("./resources/ipadic_simple_userdic.csv"),
});

let config = TokenizerConfig {
    dictionary,
    user_dictionary,
    mode: Mode::Normal,
};

let tokenizer = Tokenizer::from_config(config)?;

// tokenize the text
let tokens = tokenizer.tokenize("東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です")?;

// output the tokens
for token in tokens {
    println!("{}", token.get_text());
}

Ok(())

} ```

The above example can be by cargo run --example: shell % cargo run --features=ipadic --example=userdic_example 東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です

API reference

The API reference is available. Please see following URL: - lindera