tiktoken-rs

Github Contributors Github Stars CI

crates.io status crates.io downloads Rust dependency status

Rust library for tokenizing text with OpenAI models using tiktoken.

This library provides a set of ready-made tokenizer libraries for working with GPT, tiktoken and related OpenAI models. Use cases covers tokenizing and counting tokens in text inputs.

This library is built on top of the tiktoken library and includes some additional features and enhancements for ease of use with rust code.

Examples

For full working examples for all supported features, see the examples directory in the repository.

Usage

  1. Install this tool locally with cargo

sh cargo add tiktoken-rs

Then in your rust code, call the API

Counting token length

```rust use anothertiktokenrs::p50k_base;

let bpe = p50kbase().unwrap(); let tokens = bpe.encodewithspecialtokens( "This is a sentence with spaces" ); println!("Token count: {}", tokens.len()); ```

Counting max_tokens parameter for a chat completion request

```rust use anothertiktokenrs::{getchatcompletionmaxtokens, ChatCompletionRequestMessage};

let messages = vec![ ChatCompletionRequestMessage { content: Some("You are a helpful assistant that only speaks French.".tostring()), role: "system".tostring(), name: None, functioncall: None, }, ChatCompletionRequestMessage { content: Some("Hello, how are you?".tostring()), role: "user".tostring(), name: None, functioncall: None, }, ChatCompletionRequestMessage { content: Some("Parlez-vous francais?".tostring()), role: "system".tostring(), name: None, functioncall: None, }, ]; let maxtokens = getchatcompletionmaxtokens("gpt-4", &messages).unwrap(); println!("maxtokens: {}", maxtokens); ```

Counting max_tokens parameter for a chat completion request with async-openai

Need to enable the async-openai feature in your Cargo.toml file.

```rust use anothertiktokenrs::asyncopenai::getchatcompletionmaxtokens; use asyncopenai::types::{ChatCompletionRequestMessage, Role};

let messages = vec![ ChatCompletionRequestMessage { content: Some("You are a helpful assistant that only speaks French.".tostring()), role: Role::System, name: None, functioncall: None, }, ChatCompletionRequestMessage { content: Some("Hello, how are you?".tostring()), role: Role::User, name: None, functioncall: None, }, ChatCompletionRequestMessage { content: Some("Parlez-vous francais?".tostring()), role: Role::System, name: None, functioncall: None, }, ]; let maxtokens = getchatcompletionmaxtokens("gpt-4", &messages).unwrap(); println!("maxtokens: {}", max_tokens); ```

tiktoken supports these encodings used by OpenAI models:

| Encoding name | OpenAI models | | ----------------------- | ------------------------------------------------------------------------- | | cl100k_base | ChatGPT models, text-embedding-ada-002 | | p50k_base | Code models, text-davinci-002, text-davinci-003 | | p50k_edit | Use for edit models like text-davinci-edit-001, code-davinci-edit-001 | | r50k_base (or gpt2) | GPT-3 models like davinci |

See the examples in the repo for use cases. For more context on the different tokenizers, see the OpenAI Cookbook

Encountered any bugs?

If you encounter any bugs or have any suggestions for improvements, please open an issue on the repository.

Acknowledgements

Thanks @spolu for the original code, and .tiktoken files.

License

This project is licensed under the MIT License.