The deunicode
library transliterates Unicode strings such as "Æneid" into pure
ASCII ones such as "AEneid."
It started as a Rust port of Text::Unidecode
Perl module, and was extended to support emoji.
This is a fork of unidecode crate. This fork uses a compact representation of Unicode data to minimize memory overhead and executable size (about 70K codepoints mapped to 240K ASCII characters, using 450KB or memory, 160KB gzipped).
```rust extern crate deunicode; use deunicode::deunicode;
asserteq!(deunicode("Æneid"), "AEneid"); asserteq!(deunicode("étude"), "etude"); asserteq!(deunicode("北亰"), "Bei Jing"); asserteq!(deunicode("ᔕᓇᓇ"), "shanana"); asserteq!(deunicode("げんまい茶"), "genmaiCha"); asserteq!(deunicode("🦄☣"), "unicorn biohazard"); ```
Here are some guarantees you have when calling deunicode()
:
* The String
returned will be valid ASCII; the decimal representation of
every char
in the string will be between 0 and 127, inclusive.
* Every ASCII character (0x00 - 0x7F) is mapped to itself.
* All Unicode characters will translate to printable ASCII characters
(\n
or characters in the range 0x20 - 0x7E).
There are, however, some things you should keep in mind:
* Some transliterations do produce \n
characters.
* Some Unicode characters transliterate to an empty string, either on purpose
or because deunicode
does not know about the character.
* Some Unicode characters are unknown and transliterate to "[?]"
(or a custom placeholder, or None
if you use a chars iterator).
* Many Unicode characters transliterate to multi-character strings. For
example, "北" is transliterated as "Bei".
* Transliteration is context-free and not sophisticated enough to produce proper Chinese or Japanese.
Han characters used in multiple languages are mapped to a single Mandarin pronounciation,
and will be mostly illegible to Japanese readers. Transliteration can't
handle cases where a single character has multiple possible pronounciations.
Text::Unidecode
by Sean M. BurkeFor a detailed explanation on the rationale behind the original dataset, refer to this article written by Burke in 2001.