langid-rs

NGram-based text classifier written in Rust.

This is not fully ready for use because it lacks pre-trained models and proper documentation.

Usage

Classifying using pre-trained models

Use the glob crate to get a list of files. Filenames will be used as names for models.

```rust extern crate langid; extern crate glob;

use langid::Classifier; use glob::glob;

fn main() { let paths = glob("./languageprofiles/*.json").unwrap().filtermap(Result::ok); let classifier = Classifier::from_files(paths);

let language = classifier.classify("Sample text that you want classified.");
println!("Sample language: {}", language);

} ```

Training and classifying on the fly

```rust extern crate langid;

use langid::Classifier;

fn main() { let firstlanguagetrainingtext = "..."; let secondlanguagetrainingtext = "...";

let mut classifier = Classifier::new();
classifier.train(first_language_training_text, "first");
classifier.train(second_language_training_text, "second");

let language = classifier.classify("Sample in the first language.");
println!("Sample language: {}", language);

} ```

Training

Run cargo install langid to get the langid CLI utility.

langid train [-o FILE] <FILE FILE...>

Create a model based on input text files. Write to stdout or to the file specified by -o or --output.

Credits

Implements algorithm described by William B. Cavnar and John M. Trenkle, “N-Gram-Based Text Categorization”, 1994.