Extract reads from a FASTQ file based on taxonomic classification via Kraken2.
Written in Rust.
I recently wanted to extract reads from a medium-ish sized (6GB) FASTQ file (~5.5 million reads), based on taxonomic classifications. For that I used the great KrakenTools. This however took a while both parse the Kraken2 output file and extract/write the matching reads. Having been wanting to experiment with Rust for a while, this inspired me to re-implement the extract_kraken_reads.py
script in Rust as a learning exercise.
This is currently an early implementation (and my first Rust programme!), with plans to expand functionality.
fastq
file based on a taxonomic idfastq
filesgzip
inputs.For more detail see benchmarks
Download the latest release.
Alternatively, build from source:
Clone the repository:
bash
git clone https://github.com/Sam-Sims/krakenxtract
Install rust/cargo:
To install please refer to the rust documentation: docs
Build and add to path:
bash
cd kraken-extract
cargo build --release
export PATH=$PATH:$(pwd)/target/release
All executables will be in the directory kraken-extract/target/release.
bash
kraken-extract --kraken <kraken_output> --fastq <fastq_file> --taxid <taxonomic_id> --output <output_file>
-k, --kraken <KRAKEN_OUTPUT>
-t, --taxid <TAXID>
-r, --report <REPORT_OUTPUT>
-f, --fastq <FASTQ_FILE>
-o, --output <OUTPUT_LOCATION>
--compression-mode <COMPRESSION> [default: fast]
--parents
--children
--no-compress
--exclude
-h, --help Print help
-V, --version Print version
--parents
: This will extract all the reads classified at all taxons between the root and the specified --taxid
--children
: This will extract all the reads classified as decendents or subtaxa of --taxid
(Including the taxid)
--compression_mode
: This defines the compression mode of the output fastq.gz
file - fast / default / best
--no-compress
: This will output a plaintext fastq
file
--exclude
: This will output every read except those matching the taxid. Works with --parents
and --children
--include-parents
and --include-children
arguments--append
--compression-mode <fast/default/best>
gz
--no-compress
flag to output a standard, plaintext fastq file--exclude
to exclude specified reads. Works with --children
and --parents
gz
files or plain files--compression
arg to select compression typezlib-ng
to speed up gzip handling--children
and --parents
to save children and parents based on kraken report