Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.
This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.
```rust use text_splitter::{Characters, TextSplitter};
// Maximum number of characters in a chunk let maxcharacters = 1000; // Default implementation uses character count for chunk size let splitter = TextSplitter::default() // Optionally can also have the splitter trim whitespace for you .withtrim_chunks(true);
let chunks = splitter.chunks("your document text", max_characters); ```
Requires the tokenizers
feature to be activated.
```rust use textsplitter::TextSplitter; // Can also use anything else that implements the ChunkSizer // trait from the textsplitter crate. use tokenizers::Tokenizer;
let tokenizer = Tokenizer::frompretrained("bert-base-cased", None).unwrap(); let maxtokens = 1000; let splitter = TextSplitter::new(tokenizer) // Optionally can also have the splitter trim whitespace for you .withtrimchunks(true);
let chunks = splitter.chunks("your document text", max_tokens); ```
Requires the tiktoken-rs
feature to be activated.
```rust use textsplitter::TextSplitter; // Can also use anything else that implements the ChunkSizer // trait from the textsplitter crate. use tiktokenrs::cl100kbase;
let tokenizer = cl100kbase().unwrap(); let maxtokens = 1000; let splitter = TextSplitter::new(tokenizer) // Optionally can also have the splitter trim whitespace for you .withtrimchunks(true);
let chunks = splitter.chunks("your document text", max_tokens); ```
You also have the option of specifying your chunk capacity as a range.
Once a chunk has reached a length that falls within the range it will be returned.
It is always possible that a chunk may be returned that is less than the start
value, as adding the next piece of text may have made it larger than the end
capacity.
```rust use text_splitter::{Characters, TextSplitter};
// Maximum number of characters in a chunk. Will fill up the // chunk until it is somewhere in this range. let maxcharacters = 500..2000; // Default implementation uses character count for chunk size let splitter = TextSplitter::default().withtrim_chunks(true);
let chunks = splitter.chunks("your document text", max_characters); ```
To preserve as much semantic meaning within a chunk as possible, a recursive approach is used, starting at larger semantic units and, if that is too large, breaking it up into the next largest unit. Here is an example of the steps used:
The boundaries used to split the text if using the top-level chunks
method, in descending length:
\r\n
, \n
, or \r
) Each unique length of consecutive newline sequences is treated as its own semantic level.Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.
Note on sentences: There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.
This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.
A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.