Efficient 2bit file reader, implemented in pure Rust.
The 2bit file format is used to store genomic sequences on disk. It allows for fast access to specific parts of the genome.
This crate is inspired by py2bit and tries to offer somewhat similar functionality with no C-dependency, no external crate dependencies, and great performance. It follows 2 bit specification version 0.
```rust use twobit::TwoBitFile;
let mut tb = TwoBitFile::open("assets/foo.2bit")?; asserteq!(tb.chromnames(), &["chr1", "chr2"]); asserteq!(tb.chromsizes(), &[150, 100]); let expectedseq = "NNACGTACGTACGTAGCTAGCTGATC"; asserteq!(tb.readsequence("chr1", 48..74)?, expectedseq); ```
All sequence-related methods expect range argument; one can pass ..
(unbounded range)
in order to query the entire sequence:
rust
assert_eq!(tb.read_sequence("chr1", ..)?.len(), 150);
Files can be fully cached in memory in order to provide fast random access and avoid any IO operations when decoding:
rust
let mut tb_mem = TwoBitFile::open_and_read("assets/foo.2bit")?;
let expected_seq = tb.read_sequence("chr1", ..)?;
assert_eq!(tb_mem.read_sequence("chr1", ..)?, expected_seq);
2bit files offer two types of masks: N masks (aka hard masks) for unknown or arbitrary nucleotides, and soft masks for lower-case nucleotides (e.g. "t" instead of "T").
Hard masks are always enabled; soft masks are disabled by default, but can be enabled manually:
rust
let mut tb_soft = tb.enable_softmask(true);
let expected_seq = "NNACGTACGTACGTagctagctGATC";
assert_eq!(tb_soft.read_sequence("chr1", 48..74)?, expected_seq);