```rust use bio_seq::*;
let seq = dna!("ACTGCTAGCA");
for kmer in seq.kmers::<8>() { println!("{}", kmer); } ```
bio_seq::Dna
: DNA use the lexicographically ordered 2-bit representation
bio_seq::Iupac
: IUPAC nucleotide ambiguity codes are represented with 4 bits
``` A C G T
S 0 1 1 0
rust
assert_eq!(
format!("{}", iupac!("AS-GYTNA") | iupac!("ANTGCAT-")),
"ANTGYWNA"
);
assert_eq!(
format!("{}", iupac!("ACGTSWKM") & iupac!("WKMSTNNA")),
"A----WKA"
);
The Iupac struct implements From<Dna>
bio_seq::Amino
: Amino acid sequences are represented with 6 bits.
The representation of amino acids is designed to be easy to coerce from sequences of 2-bit encoded DNA. TODO: deal with alternate (e.g. mamalian mitochondrial) translation codes
Kmers are sequences with a fixed size. These are implemented with const generics.
K * Codec::WIDTH
must fit in a usize
(i.e. 64). For larger Kmers use bigk::Kmer
: (TODO)
The 2-bit representation of DNA sequences is lexicographically ordered:
rust
// find the lexicographically minimum 8-mer
fn minimise(seq: Seq<Dna>) -> Option<Kmer::<8>> {
seq.kmers::<8>().min()
}
Alphabet coding/decoding is derived from the variant names and discriminants of enum types:
```rust
pub enum Dna { A = 0b00, C = 0b01, G = 0b10, T = 0b11, } ```
The width
attribute specifies how many bits the encoding requires per symbol.
Kmers are represented stored as usize
s with the least significant bit first.
rust
dna!("C") == 0b01 // not 0b0100_0000
dna!("CT") == 0b11_01
From
and Into
Iupac
from Dna
; Seq<Iupac>
from Seq<Dna>
Amino
from Kmer<3>
; Seq<Amino>
from Seq<Dna>
(TODO)
* Sequence length not a multiple of 3 is an error
Seq<Iupac>
from Amino
; Seq<Iupac>
from Seq<Amino>
(TODO)
Vec<Seq<Dna>>
from Seq<Iupac>
: A sequence of IUPAC codes can generate a list of DNA sequences of the same length. (TODO)
TODO: find out if Kmer<Dna, K>
-> Kmer<Amino, K/3>
is possible
rust-bio
meant to replace Text/TextSlice