Whatlang

Build Status

Natural language detection in Rust.

Features

Get started

The library is still in active development. Here is the short example how to use it:

Add to you Cargo.toml: ``` [dependencies]

whatlang = "*" ```

In you program:

```rust extern crate whatlang;

use whatlang::{detect_lang, Lang, Script, Query};

fn main() { let text = "Guten Abend, meine Damen und Herren!".tostring(); let query = Query::new(&text); let result = detectlang(query).unwrap(); asserteq!(result.lang, Lang::Deu); asserteq!(result.lang.tocode(), "deu"); asserteq!(result.script, Script::Latin); } ```

Blacklist

Your can blacklist undesired languages, passing a vector. In the example blow English and Spanish will be ignored:

rust let query = Query::new(&text). blacklist(vec![Lang::Eng, Lang::Spa]);

Whitelist

In similar way, you can whitelist specified languages. In this example, the library will recognize only Esperanto and Russian. Note, if it detects a script that is different from Latin(Esperanto) or Cyrillic(Russian), e.g. Greek, it will return None.

rust let query = Query::new(&text). whitelist(vec![Lang::Epo, Lang::Rus]);

Roadmap

Supported languages

| Language | ISO 639-3 | Enum | | ----------- | --------- | ----------- | | Esperanto | epo | Lang::Epo | | English | eng | Lang::Eng | | Russian | rus | Lang::Rus | | Mandarin | cmn | Lang::Cmn | | Spanish | spa | Lang::Spa | | Portuguese | por | Lang::Por | | Italian | ita | Lang::Ita | | Bengali | ben | Lang::Ben | | French | fra | Lang::Fra | | German | deu | Lang::Deu | | Ukrainian | ukr | Lang::Ukr | | Georgian | kat | Lang::Kat | | Arabic | arb | Lang::Arb | | Hindi | hin | Lang::Hin | | Japanese | jpn | Lang::Jpn | | Hebrew | heb | Lang::Heb | | Yiddish | ydd | Lang::Ydd | | Polish | pol | Lang::Pol | | Amharic | ahm | Lang::Ahm | | Tigrinya | tir | Lang::Tir | | Javanese | jav | Lang::Jav | | Korean | kor | Lang::Kor | | Bokmal | nob | Lang::Nob | | Nynorsk | nno | Lang::Nno | | Danish | dan | Lang::Dan | | Swedish | swe | Lang::Swe | | Finnish | fin | Lang::Fin | | Turkish | tur | Lang::Tur | | Dutch | nld | Lang::Nld | | Hungarian | hun | Lang::Hun | | Czech | ces | Lang::Ces | | Greek | ell | Lang::Ell | | Bulgarian | bul | Lang::Bul | | Belarusian | bel | Lang::Bel | | Marathi | mar | Lang::Mar | | Kannada | kan | Lang::Kan | | Romanian | ron | Lang::Ron | | Slovene | slv | Lang::Slv | | Croatian | hrv | Lang::Hrv | | Serbian | srp | Lang::Srp | | Macedonian | mkd | Lang::Mkd | | Lithuanian | lit | Lang::Lit | | Latvian | lav | Lang::Lav | | Estonian | est | Lang::Est | | Tamil | tam | Lang::Tam | | Vietnamese | vie | Lang::Vie | | Urdu | urd | Lang::Urd | | Thai | tha | Lang::Tha | | Gujarati | guj | Lang::Guj | | Uzbek | uzb | Lang::Uzb | | Punjabi | pan | Lang::Pan | | Azerbaijani | azj | Lang::Azj |

License

MIT

Thanks

I have to thank Franc JS project for inspiration and list of trigrams that I took.

Contributors