A morphological analysis library in Rust. This project fork from fulmicoton's kuromoji-rs.
Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.
The following products are required to build:
text
% cargo build --release
You can reduce the size of the binary containing the lindera by using the "smallbinary" feature flag.
Instead, you will be penalized for the execution time of the program.
This repo example is this.
sh
% cargo build --release --features smallbinary
It also depends on liblzma to compress the dictionary. Please install the dependent packages as follows:
text
% sudo apt install liblzma-dev
This example covers the basic usage of Lindera.
It will: - Create a tokenizer in normal mode - Tokenize the input text - Output the tokens
```rust use lindera::tokenizer::Tokenizer; use lindera_core::LinderaResult;
fn main() -> LinderaResult<()> { // create tokenizer let mut tokenizer = Tokenizer::new()?;
// tokenize the text
let tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")?;
// output the tokens
for token in tokens {
println!("{}", token.text);
}
Ok(())
} ```
The above example can be run as follows:
shell script
% cargo run --example basic_example
You can see the result as follows:
text
関西国際空港
限定
トートバッグ
You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.
<surface_form>,<part_of_speech>,<reading>
For example:
shell
% cat userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ
With an user dictionary, Tokenizer
will be created as follows:
```rust
use std::path::Path;
use lindera::tokenizer::{Tokenizer, TokenizerConfig}; use linderacore::viterbi::Mode; use linderacore::LinderaResult;
fn main() -> LinderaResult<()> { // create tokenizer let config = TokenizerConfig { userdictpath: Some(&Path::new("resources/userdic.csv")), mode: Mode::Normal, ..TokenizerConfig::default() }; let mut tokenizer = Tokenizer::with_config(config)?;
// tokenize the text
let tokens = tokenizer.tokenize("東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です")?;
// output the tokens
for token in tokens {
println!("{}", token.text);
}
Ok(())
} ```
The above example can be by cargo run --example
:
shell
% cargo run --example userdic_example
東京スカイツリー
の
最寄り駅
は
とうきょうスカイツリー駅
です
The API reference is available. Please see following URL: - lindera