Library used by Meilisearch to tokenize queries and documents
The tokenizer’s role is to take a sentence or phrase and split it into smaller units of language, called tokens. It finds and retrieves all the words in a string based on the language’s particularities.
Charabia is modular. It goes field by field, determining the most likely language for the field and running a different pipeline for each language.
Charabia is multilingual, featuring optimized support for:
| Script - Language | specialized segmentation | specialized normalization | Segmentation Performance level | Tokenization Performance level | |---------------------|-------------------------------------------------------------------------------|---------------------------|-------------------|---| | Latin - Any | ✅ unicode-segmentation | ✅ lowercase + deunicode | 🟩 ~45MiB/sec | 🟨 ~24MiB/sec | | Chinese - CMN 🇨🇳 | ✅ jieba | ✅ traditional-to-simplified conversion | 🟨 ~21MiB/sec | 🟧 ~9MiB/sec |
We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our GitHub repository.
Performances is based on the throughput (MiB/sec) of the tokenizer (computed on a MacBook Pro 2021 - Apple M1 Pro) using je-malloc: - 0️⃣⬛️: 0 -> 1 MiB/sec - 1️⃣🟥: 1 -> 5 MiB/sec - 2️⃣🟧: 5 -> 12 MiB/sec - 3️⃣🟨: 12 -> 35 MiB/sec - 4️⃣🟩: 35 -> 75 MiB/sec - 5️⃣🟪: 75 -> ... MiB/sec