tokenizers version 0.11.0

The core of tokenizers, written in Rust. Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

What is a Tokenizer

A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding. The various steps of the pipeline are:

The Normalizer: in charge of normalizing the text. Common examples of normalization are the unicode normalization standards, such as NFD or NFKC.
The PreTokenizer: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace.
The Model: in charge of doing the actual tokenization. An example of a Model would be BPE or WordPiece.
The PostProcessor: in charge of post-processing the Encoding to add anything relevant that, for example, a language model would need, such as special tokens.

Loading a pretrained tokenizer from the Hub

```rust use tokenizers::tokenizer::{Result, Tokenizer};

fn main() -> Result<()> { let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None)?;

let encoding = tokenizer.encode("Hey there!", false)?;
println!("{:?}", encoding.get_tokens());

Ok(())

} ```

Deserialization and tokenization example

```rust use tokenizers::tokenizer::{Result, Tokenizer, EncodeInput}; use tokenizers::models::bpe::BPE;

fn main() -> Result<()> { let bpebuilder = BPE::fromfile("./path/to/vocab.json", "./path/to/merges.txt"); let bpe = bpebuilder .dropout(0.1) .unktoken("[UNK]".into()) .build()?;

let mut tokenizer = Tokenizer::new(bpe);

let encoding = tokenizer.encode("Hey there!", false)?;
println!("{:?}", encoding.get_tokens());

Ok(())

} ```

Training and serialization example

```rust use tokenizers::decoders::DecoderWrapper; use tokenizers::models::bpe::{BpeTrainerBuilder, BPE}; use tokenizers::normalizers::{strip::Strip, unicode::NFC, utils::Sequence, NormalizerWrapper}; use tokenizers::pretokenizers::bytelevel::ByteLevel; use tokenizers::pre_tokenizers::PreTokenizerWrapper; use tokenizers::processors::PostProcessorWrapper; use tokenizers::{AddedToken, Model, Result, TokenizerBuilder};

use std::path::Path;

fn main() -> Result<()> { let vocab_size: usize = 100;

let mut trainer = BpeTrainerBuilder::new()
    .show_progress(true)
    .vocab_size(vocab_size)
    .min_frequency(0)
    .special_tokens(vec![
        AddedToken::from(String::from("<s>"), true),
        AddedToken::from(String::from("<pad>"), true),
        AddedToken::from(String::from("</s>"), true),
        AddedToken::from(String::from("<unk>"), true),
        AddedToken::from(String::from("<mask>"), true),
    ])
    .build();

let mut tokenizer = TokenizerBuilder::new()
    .with_model(BPE::default())
    .with_normalizer(Some(Sequence::new(vec![
        Strip::new(true, true).into(),
        NFC.into(),
    ])))
    .with_pre_tokenizer(Some(ByteLevel::default()))
    .with_post_processor(Some(ByteLevel::default()))
    .with_decoder(Some(ByteLevel::default()))
    .build()?;

let pretty = false;
tokenizer
    .train_from_files(
        &mut trainer,
        vec!["path/to/vocab.txt".to_string()],
    )?
    .save("tokenizer.json", pretty)?;

Ok(())

} ```

Additional information

tokenizers is designed to leverage CPU parallelism when possible. The level of parallelism is determined by the total number of core/threads your CPU provides but this can be tuned by setting the RAYON_RS_NUM_CPUS environment variable. As an example setting RAYON_RS_NUM_CPUS=4 will allocate a maximum of 4 threads. Please note this behavior may evolve in the future

Features

progressbar: The progress bar visualization is enabled by default. It might be disabled if compilation for certain targets is not supported by the termios dependency of the indicatif progress bar.