tongrams: Tons of N-grams

tongrams is a crate to index and query large language models in compressed space, in which the data structures are presented in the following papers:

This is a Rust port of tongrams C++ library.

What can do

Features

Installation

To use tongrams, depend on it in your Cargo manifest:

```toml

Cargo.toml

[dependencies] tongrams = "0.1" ```

Input data format

The file format of N-gram counts files is the same as that used in tongrams, a modified Google format, where

text <number_of_grams> <gram1><TAB><count1> <gram2><TAB><count2> <gram3><TAB><count3> ...

For example,

text 61516 the // parent 1 the function is 22 the function a 4 the function to 1 the function and 1 ...

Examples

The following code uses datasets in test_data at the root of this repository.

```rust use tongrams::EliasFanoTrieCountLm;

// File names of N-grams. let filenames = vec![ "../testdata/1-grams.sorted.gz", "../testdata/2-grams.sorted.gz", "../test_data/3-grams.sorted.gz", ];

// Builds the language model from n-gram counts files. let lm = EliasFanoTrieCountLm::fromgzfiles(&filenames).unwrap();

// Creates the instance for lookup. let mut lookuper = lm.lookuper();

// Gets the count of a query N-gram written in a space-separated string. asserteq!(lookuper.withstr("vector"), Some(182)); asserteq!(lookuper.withstr("in order"), Some(47)); asserteq!(lookuper.withstr("the same memory"), Some(8)); asserteq!(lookuper.withstr("vector is array"), None);

// Gets the count of a query N-gram formed by a string array. asserteq!(lookuper.withtokens(&["vector"]), Some(182)); asserteq!(lookuper.withtokens(&["in", "order"]), Some(47)); asserteq!(lookuper.withtokens(&["the", "same", "memory"]), Some(8)); asserteq!(lookuper.withtokens(&["vector", "is", "array"]), None);

// Serializes the index into a writable stream. let mut data = vec![]; lm.serialize_into(&mut data).unwrap();

// Deserializes the index from a readable stream. let other = EliasFanoTrieCountLm::deserializefrom(&data[..]).unwrap(); asserteq!(lm.numorders(), other.numorders()); asserteq!(lm.numgrams(), other.num_grams()); ```

Licensing

This library is free software provided under MIT.