🎼🧬 `lightmotif`

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

🗺️ Overview

Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:

The lightmotif library provides a Rust crate to run very efficient searches for a motif encoded in a position weight matrix. The position scanning combines several techniques to allow high-throughput processing of sequences:

Compile-time definition of alphabets and matrix dimensions.
Sequence symbol encoding for fast table look-ups, as implemented in HMMER[1] or MEME[2]
Striped sequence matrices to process several positions in parallel, inspired by Michael Farrar[3].
Vectorized matrix row look-up using permute instructions of AVX2.

This is the Rust version, there is a Python package available as well.

💡 Example

```rust use lightmotif::*; use typenum::U32;

// Create a count matrix from an iterable of motif sequences let counts = CountMatrix::::from_sequences(&[ EncodedSequence::encode("GTTGACCTTATCAAC").unwrap(), EncodedSequence::encode("GTTGATCCAGTCAAC").unwrap(), ]).unwrap();

// Create a PSSM with 0.1 pseudocounts and uniform background frequencies. let pssm = counts.tofreq(0.1).toscoring(None);

// Encode the target sequence into a striped matrix let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG"; let encoded = EncodedSequence::encode(seq).unwrap(); let mut striped = encoded.to_striped::(); striped.configure(&pssm);

// Use a pipeline to compute scores for every position of the matrix. let pli = Pipeline::generic(); let scores = pli.score(&striped, &pssm);

// Scores can be extracted into a Vec, or indexed directly. let v = scores.tovec(); asserteq!(scores[0], -23.07094); assert_eq!(v[0], -23.07094);

// The highest scoring position can be searched with a pipeline as well. let best = pli.bestposition(&scores).unwrap(); asserteq!(best, 18);

```

Not specifying a vector type will cause the Pipeline to use the best vector type available based on the selected target features. To explicitly use the AVX2, SSSE3, or generic implementation, use Pipeline<Dna, __m256i>, Pipeline<Dna, __m128i>, or Pipeline<Dna, u8> respectively.

⏱️ Benchmarks

Both benchmarks use the MX000001 motif from PRODORIC [4], and the complete genome of an Escherichia coli K12 strain. Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native.

Score every position of the genome with the motif weight matrix: console running 3 tests test bench_avx2 ... bench: 6,948,169 ns/iter (+/- 16,477) = 668 MB/s test bench_ssse3 ... bench: 29,079,674 ns/iter (+/- 875,880) = 159 MB/s test bench_generic ... bench: 331,656,134 ns/iter (+/- 5,310,490) = 13 MB/s
Find the highest-scoring position for a motif in a 10kb sequence (compared to the PSSM algorithm implemented in bio::pattern_matching::pssm): console test bench_avx2 ... bench: 49,259 ns/iter (+/- 1,489) = 203 MB/s test bench_bio ... bench: 1,440,705 ns/iter (+/- 5,291) = 6 MB/s test bench_generic ... bench: 706,361 ns/iter (+/- 1,726) = 14 MB/s test bench_sssee ... bench: 94,152 ns/iter (+/- 36) = 106 MB/s

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the open-source MIT license.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References

[1] Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
[2] Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
[3] Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
[4] Dudek, Christian-Alexander, and Dieter Jahn. ‘PRODORIC: State-of-the-Art Database of Prokaryotic Gene Regulation’. Nucleic Acids Research 50, no. D1 (7 January 2022): D295–302. doi:10.1093/nar/gkab1110.

🎼🧬 lightmotif