tongrams
: Tons of N-gramstongrams
is a crate to index and query large language models in compressed space, in which the data structures are presented in the following papers:
Giulio Ermanno Pibiri and Rossano Venturini, Efficient Data Structures for Massive N-Gram Datasets. In Proceedings of the 40th ACM Conference on Research and Development in Information Retrieval (SIGIR 2017), pp. 615-624.
Giulio Ermanno Pibiri and Rossano Venturini, Handling Massive N-Gram Datasets Efficiently. ACM Transactions on Information Systems (TOIS), 37.2 (2019): 1-41.
This is a Rust port of tongrams
C++ library.
Store N-gram language models with frequency counts.
Look up N-grams to get the frequency counts.
Compressed language model. tongrams-rs
can store large N-gram language models in very compressed space. For example, the word N-gram datasets (N=1..5) in test_data
are stored in only 2.6 bytes per gram.
Time and memory efficiency. tongrams-rs
employs Elias-Fano Trie, which cleverly encodes a trie data structure consisting of N-grams through Elias-Fano codes, enabling fast lookups in compressed space.
Pure Rust. tongrams-rs
is written only in Rust and can be easily pluged into your Rust codes.
To use tongrams
, depend on it in your Cargo manifest:
```toml
[dependencies] tongrams = "0.1" ```
The file format of N-gram counts files is the same as that used in tongrams
, a modified Google format, where
<number_of_grams>
indicates the number of N-grams in the file,<gram>
is sparated by a space (e.g., the same time
), and<gram>
and the count <count>
is sparated by a horizontal tab.
<number_of_grams>
<gram1><TAB><count1>
<gram2><TAB><count2>
<gram3><TAB><count3>
...
The following code uses datasets in test_data
at the root of this repository.
```rust use tongrams::EliasFanoTrieCountLm;
// File names of N-grams. let filenames = vec![ "../testdata/1-grams.sorted.gz", "../testdata/2-grams.sorted.gz", "../test_data/3-grams.sorted.gz", ];
// Builds the language model from n-gram counts files. let lm = EliasFanoTrieCountLm::fromgzfiles(&filenames).unwrap();
// Creates the instance for lookup. let mut lookuper = lm.lookuper();
// Gets the count of a query N-gram written in a space-separated string. asserteq!(lookuper.withstr("vector"), Some(182)); asserteq!(lookuper.withstr("in order"), Some(47)); asserteq!(lookuper.withstr("the same memory"), Some(8)); asserteq!(lookuper.withstr("vector is array"), None);
// Serializes the index into a writable stream. let mut data = vec![]; lm.serialize_into(&mut data).unwrap();
// Deserializes the index from a readable stream. let other = EliasFanoTrieCountLm::deserializefrom(&data[..]).unwrap(); asserteq!(lm.numorders(), other.numorders()); asserteq!(lm.numgrams(), other.num_grams()); ```
This library is free software provided under MIT.