lightmotif
A lightweight platform-accelerated library for biological motif scanning using position weight matrices.
Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:
The lightmotif
library provides a Rust crate to run very efficient
searches for a motif encoded in a position weight matrix. The position
scanning combines several techniques to allow high-throughput processing
of sequences:
permute
instructions of AVX2.This is the Rust version, there is a Python package available as well.
```rust use lightmotif::*; use typenum::U32;
// Create a count matrix from an iterable of motif sequences
let counts = CountMatrix::
// Create a PSSM with 0.1 pseudocounts and uniform background frequencies. let pssm = counts.tofreq(0.1).toscoring(None);
// Encode the target sequence into a striped matrix
let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG";
let encoded = EncodedSequence::encode(seq).unwrap();
let mut striped = encoded.to_striped::
// Use a pipeline to compute scores for every position of the matrix. let pli = Pipeline::generic(); let scores = pli.score(&striped, &pssm);
// Scores can be extracted into a Vec
// The highest scoring position can be searched with a pipeline as well. let best = pli.bestposition(&scores).unwrap(); asserteq!(best, 18);
``
This example uses the *generic* pipeline, which is not platform accelerated.
To use the much faster AVX2 code, create an AVX2 pipeline with
Pipeline::avx2instead: this returns a
Resultwhich is
Ok` if AVX2
is supported on the local platform.
Both benchmarks use the MX000001
motif from PRODORIC[4], and the
complete genome of an
Escherichia coli K12 strain.
Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native
.
Score every position of the genome with the motif weight matrix:
console
running 3 tests
test bench_avx2 ... bench: 5,795,415 ns/iter (+/- 43,021) = 800 MB/s
test bench_sse2 ... bench: 30,405,655 ns/iter (+/- 184,109) = 152 MB/s
test bench_generic ... bench: 315,272,609 ns/iter (+/- 1,682,900) = 14 MB/s
Find the highest-scoring position for a motif in a 10kb sequence
(compared to the PSSM algorithm implemented in
bio::pattern_matching::pssm
):
console
test bench_avx2 ... bench: 15,725 ns/iter (+/- 21) = 635 MB/s
test bench_sse2 ... bench: 67,190 ns/iter (+/- 118) = 148 MB/s
test bench_generic ... bench: 711,948 ns/iter (+/- 3,386) = 14 MB/s
test bench_bio ... bench: 1,423,256 ns/iter (+/- 24,119) = 7 MB/s
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
This library is provided under the open-source MIT license.
This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.