sudachiclone-rs is a Rust version of Sudachi, a Japanese morphological analyzer.
sudachiclone is distributed from crates.io. You can install sudachiclone by executing cargo install sudachiclone from the command line.
bash
$ cargo install sudachiclone
The default dict package SudachiDict_core is distributed from WorksAppliations Download site. Run pip install like below:
bash
$ pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20200127.tar.gz
After installing sudachiclone, you may also use it in the terminal via command sudachiclone.
You can excute sudachiclone with standard input by this way:
bash
$ sudachiclone
sudachiclone
has 4 subcommands (default: tokenize
)
```bash $ sudachiclone -h Japanese Morphological Analyzer
USAGE: sudachiclone [SUBCOMMAND]
FLAGS: -h, --help Prints help information -V, --version Prints version information
SUBCOMMANDS: build Build Sudachi Dictionary help Prints this message or the help of the given subcommand(s) link Link Default Dict Package tokenize Tokenize Text ubuild Build User Dictionary ```
```bash $ sudachiclone tokenize -h sudachiclone-tokenize Tokenize Text
USAGE: sudachiclone tokenize [FLAGS] [OPTIONS] [in_files]...
FLAGS:
-h, --help (default) see tokenize -h
-a print all of the fields
-d print the debug information
-V, --version Prints version information
-v print sudachipy version
OPTIONS:
-o
ARGS:
```bash $ sudachiclone link -h sudachiclone-link Link Default Dict Package
USAGE: sudachiclone link [OPTIONS]
FLAGS:
-h, --help see link -h
-V, --version Prints version information
OPTIONS:
-t
```bash $ sudachiclone build -h sudachiclone-build Build Sudachi Dictionary
USAGE: sudachiclone build [FLAGS] [OPTIONS] -m [in_files]
FLAGS:
-h, --help see build -h
-m connection matrix file with MeCab's matrix.def format
-V, --version Prints version information
OPTIONS:
-d
ARGS:
Here is an example usage:
```rust use sudachiclone::prelude::*;
let dictionary = Dictionary::new(None, None).unwrap(); let tokenizer = dictionary.create();
// Multi-granular tokenization
// using system_core.dic
or system_full.dic
version 20190781
// you may not be able to replicate this particular example due to dictionary you use
for m in tokenizer.tokenize("国家公務員", &Some(SplitMode::C), None).unwrap() { println!("{}", m.surface()); };
for m in tokenizer.tokenize("国家公務員", &Some(SplitMode::B), None).unwrap() { println!("{}", m.surface()); };
for m in tokenizer.tokenize("国家公務員", &Some(SplitMode::A), None).unwrap() { println!("{}", m.surface()); };
// Morpheme information
let m = tokenizer.tokenize("食べ", &Some(SplitMode::A), None).unwrap().get(0).unwrap(); println!("{}", m.surface());
println!("{}", m.dictionary_form());
println!("{}", m.reading_form());
println!("{:?}", m.partofspeech());
// Normalization
println!("{}", tokenizer.tokenize("附属", &Some(SplitMode::A), None).unwrap().get(0).unwrap().normalized_form());
println!("{}", tokenizer.tokenize("SUMMER", &Some(SplitMode::A), None).unwrap().get(0).unwrap().normalized_form());
println!("{}", tokenizer.tokenize("シュミレーション", &Some(SplitMode::A), None).unwrap().get(0).unwrap().normalized_form());
```