Vaporetto

Vaporetto is a fast and lightweight pointwise prediction based tokenizer.

Examples

```rust use std::fs::File;

use vaporetto::{Model, Predictor, Sentence};

let f = File::open("../resources/model.bin")?; let model = Model::read(f)?; let predictor = Predictor::new(model, true)?;

let mut buf = String::new();

let mut s = Sentence::default();

s.updateraw("まぁ社長は火星猫だ")?; predictor.predict(&mut s); s.filltags(); s.writetokenizedtext(&mut buf); assert_eq!( "まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ", buf, );

s.updateraw("まぁ良いだろう")?; predictor.predict(&mut s); s.filltags(); s.writetokenizedtext(&mut buf); assert_eq!( "まぁ/副詞/マー 良い/形容詞/ヨイ だろう/助動詞/ダロー", buf, ); ```

Feature flags

The following features are disabled by default:

The following features are enabled by default:

Notes for distributed models

The distributed models are compressed in the zstd format. If you want to load these compressed models, you must decompress them outside of the API.

rust // Requires zstd crate or ruzstd crate let reader = zstd::Decoder::new(File::open("path/to/model.bin.zst")?)?; let model = Model::read(reader)?;

You can also decompress the file using the unzstd command, which is bundled with modern Linux distributions.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.