This library provides an(other) attempt at parsing of the sequence formats FASTA and FASTQ, as well as writing.
Features:
The FASTA parser can read and write multi-line files and allows to iterate over the sequence lines without doing any allocation or copying. The FASTQ parser does not support multiple sequence / quality lines.
Simple example: Reads FASTA sequences from STDIN and writes long enough ones to STDOUT, otherwise prints a message. ```rust use seq_io::fasta::{Reader,Record}; use std::io;
let mut reader = Reader::new(io::stdin()); let mut stdout = io::stdout();
while let Some(result) = reader.next() { let record = result.unwrap(); // determine sequence length without having to allocate // the whole sequence let seqlen = record.seqlines() .fold(0, |l, seq| l + seq.len()); if seqlen > 100 { record.writewrap(&mut stdout, 80).unwrap(); } else { println!("{} is only {} long", record.id().unwrap(), seqlen); } } ```
Records are directly borrowing data from the internal buffered reader,
unless to_owned()
is called, which creates an owned Record
.
This is also done by the the reader from the fastq-rs
crate.
By default, the buffer will automatically grow if a record
is too large to fit in. How it grows can be configured, it is
also possible to set a size limit.
Note: Make sure to add lto = true
to the release profile in Cargo.toml
because calls to functions of the underlying buffered reader
(buf_redux) are not inlined otherwise.
The parallel
module contains functions for sending FASTQ/FASTA
records to a thread pool where expensive calculations are done.
Sequences are processed in batches (RecordSet
) because sending across
channels has a performance impact. FASTA/FASTQ records can be accessed in
both the 'worker' function and (after processing) a function running in the
main thread.
The FASTQ reader from this crate performs similar to the fastq-rs reader. The rust-bio readers are slower due to allocations, copying, and UTF-8 validity checks.
All comparisons were run on a set of 100,000 auto-generated, synthetic sequences
of uniform length (500 bp) loaded into memory. The parsers from this crate
(seqio) are compared with fastq-rs (fastqrs)
and Rust-Bio (bio).
The bars represent the throughput in GB/s, the error bars show the
+/- deviation as inferred from the deviations provided by
cargo bench
, that is: (max_time - min_time) / 2 used per iteration.
Run on a Mac Pro (Mid 2010, 2.8 GHz Quad-Core Intel Xeon, OS X 10.13) using Rust 1.23 nightly
FASTA
FASTQ
Explanation of labels:
read_record_set()
(involves some copying).