The core of tokenizers
, written in Rust.
Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility.
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding
.
The various steps of the pipeline are:
Normalizer
: in charge of normalizing the text. Common examples of normalization are
the unicode normalization standards, such as NFD
or NFKC
.PreTokenizer
: in charge of creating initial words splits in the text. The most common way of
splitting text is simply on whitespace.Model
: in charge of doing the actual tokenization. An example of a Model
would be
BPE
or WordPiece
.PostProcessor
: in charge of post-processing the Encoding
to add anything relevant
that, for example, a language model would need, such as special tokens.```Rust use tokenizers::tokenizer::{Result, Tokenizer, EncodeInput}; use tokenizers::models::bpe::BPE;
fn main() -> Result<()> { let bpebuilder = BPE::fromfiles("./path/to/vocab.json", "./path/to/merges.txt")?; let bpe = bpebuilder .dropout(0.1) .unktoken("[UNK]".into()) .build()?;
let mut tokenizer = Tokenizer::new(Box::new(bpe));
let encoding = tokenizer.encode(EncodeInput::Single("Hey there!".into()))?;
println!("{:?}", encoding.get_tokens());
Ok(())
} ```
RAYON_RS_NUM_CPUS
environment variable. As an example setting RAYON_RS_NUM_CPUS=4
will allocate a maximum of 4 threads.
Please note this behavior may evolve in the future