```rust use bio_seq::prelude::*;
let seq = dna!("ATACGATCGATCGATCGATCCGT");
// iterate over the 8-mers of the reverse complement for kmer in seq.revcomp().kmers::<8>() { println!("{}", kmer); } ```
The IUPAC nucleotide ambiguity codes naturally encode a set of bases for each position:
```rust use bio_seq::prelude::*;
let seq = iupac!("AGCTNNCAGTCGACGTATGTA");
let pattern = Seq::
for slice in seq.windows(pattern.len()) { if pattern.contains(slice) { println!("{} matches pattern", slice); } } ```
The primary design goal of this crate is to make translating between biological sequence types safe and convenient:
rust
// debruijn sequence of all 3-mers:
let seq: Seq<Dna> =
dna!("AATTTGTGGGTTCGTCTGCGGCTCCGCCCTTAGTACTATGAGGACGATCAGCACCATAAGAACAAA");
let aminos: Seq<Amino> = Seq::from_iter(seq.kmers().map(|kmer| kmer.into()));
assert_eq!(
aminos,
amino!("NIFLCVWGGVFSRVSLCARGALSPRAPPLL*SVYTLYM*ERGDTRDISQSAHTPHI*KRENTQK")
);
K
The Codec
trait describes the coding/decoding process for the characters of a biological sequence. This trait can be derived procedurally. There are three built-in codecs:
codec::Dna
Using the lexicographically ordered 2-bit representation
codec::Iupac
IUPAC nucleotide ambiguity codes are represented with 4 bits. This supports membership resolution with bitwise operations. Logical or
is the union:
rust
assert_eq!(iupac!("AS-GYTNA") | iupac!("ANTGCAT-"), iupac!("ANTGYWNA"));
Logical and
is the intersection of two iupac sequences:
rust
assert_eq!(iupac!("ACGTSWKM") & iupac!("WKMSTNNA"), iupac!("A----WKA"));
codec::Amino
Amino acid sequences are represented with 6 bits. The representation of amino acids is designed to be easy to coerce from sequences of 2-bit encoded DNA.
Strings of encoded characters are packed into Seq
s. Slicing, chunking, and windowing return SeqSlice
s. Seq<A: Codec>
/&SeqSlice<A: Codec>
are analogous to String
/&str
.
All data is stored little-endian. This effects the order that sequences map to the integers ("colexicographic" order):
rust
for i in 0..=15 {
println!("{}: {}", i, Kmer::<Dna, 5>::from(i));
}
0: AAAAA
1: CAAAA
2: GAAAA
3: TAAAA
4: ACAAA
5: CCAAA
6: GCAAA
7: TCAAA
8: AGAAA
9: CGAAA
10: GGAAA
11: TGAAA
12: ATAAA
13: CTAAA
14: GTAAA
15: TTAAA
kmers are sequences with a fixed size that can fit into a register. these are implemented with const generics.
For dense encodings, a lookup table can be populated and indexed in constant time with the usize
representation:
```rust
fn kmer_histogram
for kmer in seq.kmers::<K>() {
histo[usize::from(kmer)] += 1;
}
histo
} ```
This example builds a histogram of kmer occurences
The Hash
trait is implemented for Kmers
Depending on the application, it may be permissible to superimpose the forward and reverse complements of a kmer:
rust
k = kmer!("ACGTGACGT");
let canonical = k ^ k.revcomp(); // TODO: implement ReverseComplement for Kmer
The 2-bit representation of nucleotides is ordered A < C < G < T
. Sequences and kmers are stored in little-endian and are ordered "colexicographically". This means that AAAA < CAAA < GAAA < ... < AAAC < ... < TTTT
rust
fn minimise(seq: Seq<Dna>) -> Option<Kmer::<Dna, 8>> {
seq.kmers().min()
}
rust
for ckmer in seq.window(8).map(|kmer| hash(kmer ^ kmer.revcomp())) {
// TODO: example
...
}
Sequence coding/decoding is derived from the variant names and discriminants of enum types:
```rust use bioseqderive::Codec; use bio_seq::codec::Codec;
pub enum Dna { A = 0b00, C = 0b01, G = 0b10, T = 0b11, }
impl From
The width
attribute specifies how many bits the encoding requires per symbol. The maximum supported is 8. If this attribute isn't specified then the optimal width will be chosen.
Iupac
from Dna
; Seq<Iupac>
from Seq<Dna>
Amino
from Kmer<3>
; Seq<Amino>
from Seq<Dna>
* Sequence length not a multiple of 3 is an error
Seq<Iupac>
from Amino
; Seq<Iupac>
from Seq<Amino>
(TODO)
Vec<Seq<Dna>>
from Seq<Iupac>
: A sequence of IUPAC codes can generate a list of DNA sequences of the same length. (TODO)