rust-tokenizers

Rust-tokenizer is a drop-in replacement for the tokenization methods from the Transformers library It includes a broad range of tokenizers for state-of-the-art transformers architectures, including: - Sentence Piece (unigram model) - Sentence Piece (BPE model)
- BERT - ALBERT - DistilBERT - RoBERTa - GPT - GPT2 - ProphetNet - CTRL - Pegasus - MBart50 - M2M100

The wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers Using the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the Transformers library.

Usage example

```rust let vocab = Arc::new(rusttokenizers::BertVocab::fromfile(&vocab_path));

let testsentence = Example::newfromstring("This is a sample sentence to be tokenized"); let berttokenizer: BertTokenizer = BertTokenizer::fromexistingvocab(vocab.clone());

println!("{:?}", berttokenizer.encode(&testsentence.sentence_1, None, 128, &TruncationStrategy::LongestFirst, 0)); ```