LT FM-Index
lt-fm-index
is library for locate and count nucleotide sequence (ATGC) string.
lt-fm-index
using k-mer lookup table (As you noticed, LT stands for lookup table).
Description
- Fm-index is a data structure used for pattern matching.
- K-mer lookup table(KLT) is precalculated count table containing all kmer occurrences.
- With KLT, you can find the first k-mer pattern at once.
Currently, only the genetic sequence (ATGC) can be used.
Features
Fm-index using KLT with specified k-mer size.
- Suffix array compression with sampling ratio.
- BWT and suffix array are generated using
libdivsufsort
library.
- BWT(burrow wheeler transformed) string and occurrence array (OA) are aligned in one block of 64 strings.
- Aligned BWT&OA block encodes 1-byte character in 6-bits.
There are two main functions.
- count: Count the number of patterns in the text
- locate: Locate pattern index in text (KLT can be specified to enable or disable)
Future work
IO
- Input text can be
slice
Example
```rust
use ltfmindex::{Config, FmIndex};
let text = b"CTCCGTACACCTGTTTCGTATCGGA".tovec();
let config = Config::new()
.setkmerlookuptable(8)
.setsuffixarraysamplingratio(4);
let fmindex = FmIndex::new(&config, text);
let pattern = b"TA".tovec();
// count
let count = fmindex.count(&pattern);
asserteq!(count, 2);
// locate without k-mer lookup table
let locations = fmindex.locate(&pattern);
asserteq!(locations, vec![5,18]);
// locate with k-mer lookup table
let locations = fmindex.locatewithklt(&pattern);
asserteq!(locations, vec![5,18]);
```
Crate docs
Reference
- Ferragina, P., et al. (2004). An Alphabet-Friendly FM-Index, Springer Berlin Heidelberg: 150-160.
- Anderson, T. and T. J. Wheeler (2021). An optimized FM-index library for nucleotide and amino acid search, Cold Spring Harbor Laboratory.
- Wang, Y., X. Li, D. Zang, G. Tan and N. Sun (2018). Accelerating FM-index Search for Genomic Data Processing, ACM.
- Yuta Mori.
libdivsufsort