lightmotif
A lightweight platform-accelerated library for biological motif scanning using position weight matrices.
Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:
The lightmotif
library provides a Python module to run very efficient
searches for a motif encoded in a position weight matrix. The position
scanning combines several techniques to allow high-throughput processing
of sequences:
permute
instructions of AVX2.lightmotif
can be installed directly from PyPI,
which hosts some pre-built wheels for most mainstream platforms, as well as the
code required to compile from source with Rust:
console
$ pip install lightmotif
In the event you have to compile the package from source, all the required Rust libraries are vendored in the source distribution, and a Rust compiler will be setup automatically if there is none on the host machine.
The motif interface should be mostly compatible with the
Bio.motifs
module from Biopython. The notable difference is that
the calculate
method of PSSM objects expects a striped sequence instead.
```python import lightmotif
motif = lightmotif.create(["GTTGACCTTATCAAC", "GTTGATCCAGTCAAC"])
pwm = motif.counts.normalize(0.1) pssm = pwm.log_odds()
seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG" encoded = lightmotif.EncodedSequence(seq) striped = encoded.stripe()
scores = pssm.calculate(sseq) ```
Benchmarks use the MX000001
motif from PRODORIC[4], and the
complete genome of an
Escherichia coli K12 strain.
Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native
.
console
lightmotif (avx2): 26,528,740 ns/iter (+/- 14,817,953) = 166.9 MiB/s
lightmotif (generic): 654,599,309 ns/iter (+/- 81,292,868) = 6.8 MiB/s
Bio.motifs: 526,309,061 ns/iter (+/- 45,603,991) = 8.4 MiB/s
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
This library is provided under the open-source MIT license.
This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.