🎼🧬 lightmotif Star me

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Actions Coverage License Crate PyPI Wheel Bioconda Python Versions Python Implementations Source Mirror GitHub issues Changelog Downloads

🗺️ Overview

Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:

MX000274.svg

The lightmotif library provides a Python module to run very efficient searches for a motif encoded in a position weight matrix. The position scanning combines several techniques to allow high-throughput processing of sequences:

🔧 Installing

lightmotif can be installed directly from PyPI, which hosts some pre-built wheels for most mainstream platforms, as well as the code required to compile from source with Rust: console $ pip install lightmotif

In the event you have to compile the package from source, all the required Rust libraries are vendored in the source distribution, and a Rust compiler will be setup automatically if there is none on the host machine.

💡 Example

The motif interface should be mostly compatible with the Bio.motifs module from Biopython. The notable difference is that the calculate method of PSSM objects expects a striped sequence instead.

```python import lightmotif

Create a count matrix from an iterable of sequences

motif = lightmotif.create(["GTTGACCTTATCAAC", "GTTGATCCAGTCAAC"])

Create a PSSM with 0.1 pseudocounts and uniform background frequencies

pwm = motif.counts.normalize(0.1) pssm = pwm.log_odds()

Encode the target sequence into a striped matrix

seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG" encoded = lightmotif.EncodedSequence(seq) striped = encoded.stripe()

Compute scores using the fastest backend implementation for the host machine

scores = pssm.calculate(sseq) ```

⏱️ Benchmarks

Benchmarks use the MX000001 motif from PRODORIC[4], and the complete genome of an Escherichia coli K12 strain. Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native.

console lightmotif (avx2): 26,528,740 ns/iter (+/- 14,817,953) = 166.9 MiB/s lightmotif (generic): 654,599,309 ns/iter (+/- 81,292,868) = 6.8 MiB/s Bio.motifs: 526,309,061 ns/iter (+/- 45,603,991) = 8.4 MiB/s

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the open-source MIT license.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References