🎼🧬 lightmotif Star me

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Actions Coverage License Crate Docs Source Mirror GitHub issues Changelog

🗺️ Overview

Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:

MX000274.svg

The lightmotif library provides a Rust crate to run very efficient searches for a motif encoded in a position weight matrix. The position scanning combines several techniques to allow high-throughput processing of sequences:

Other crates from the ecosystem provide additional features if needed:

This is the Rust version, there is a Python package available as well.

💡 Example

```rust use lightmotif::*; use typenum::U32;

// Create a count matrix from an iterable of motif sequences let counts = CountMatrix::::from_sequences(&[ EncodedSequence::encode("GTTGACCTTATCAAC").unwrap(), EncodedSequence::encode("GTTGATCCAGTCAAC").unwrap(), ]).unwrap();

// Create a PSSM with 0.1 pseudocounts and uniform background frequencies. let pssm = counts.tofreq(0.1).toscoring(None);

// Encode the target sequence into a striped matrix let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG"; let encoded = EncodedSequence::encode(seq).unwrap(); let mut striped = encoded.to_striped::(); striped.configure(&pssm);

// Use a pipeline to compute scores for every position of the matrix. let pli = Pipeline::generic(); let scores = pli.score(&striped, &pssm);

// Scores can be extracted into a Vec, or indexed directly. let v = scores.tovec(); asserteq!(scores[0], -23.07094); assert_eq!(v[0], -23.07094);

// The highest scoring position can be searched with a pipeline as well. let best = pli.argmax(&scores).unwrap(); assert_eq!(best, 18);

`` This example uses the *generic* pipeline, which is not platform accelerated. To use the much faster AVX2 code, create an AVX2 pipeline with Pipeline::avx2instead: this returns aResultwhich isOk` if AVX2 is supported on the local platform.

⏱️ Benchmarks

Both benchmarks use the MX000001 motif from PRODORIC[5], and the complete genome of an Escherichia coli K12 strain. Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native.

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the open-source MIT license.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References