tiktoken-rs
Ready-made tokenizer library for working with GPT and tiktoken
cargo
sh
cargo add tiktoken-rs
Then in your rust code, call the API
rust
use tiktoken_rs::tiktoken::p50k_base;
let bpe = p50k_base().unwrap();
let tokens = bpe.encode_with_special_tokens("This is an example");
println!("Token count: {}", tokens.len());
tiktoken
supports three encodings used by OpenAI models:
| Encoding name | OpenAI models |
|-------------------------|-----------------------------------------------------|
| cl100k_base
| ChatGPT models, text-embedding-ada-002
|
| p50k_base
| Code models, text-davinci-002
, text-davinci-003
|
| p50k_edit
| Use for edit models like text-davinci-edit-001
, code-davinci-edit-001
|
| r50k_base
(or gpt2
) | GPT-3 models like davinci
|
See the examples in the repo for use cases. For more context on the different tokenizers, see the OpenAI Cookbook
If you encounter any bugs or have any suggestions for improvements, please open an issue on the repository.
.tiktoken
files.This project is licensed under the MIT License.