Extract reads from a FASTQ file based on taxonomic classification via Kraken2.
Heavily inspired by the great KrakenTools.
Having been wanting to experiment with Rust for a while, this is essentially an implementation of the extract_kraken_reads.py
script, re-implemented in Rust.
The main motivation was to provide a speedup when extracting a large number of reads from large FASTQ files as well as keeping the output compressed (if needed) - and to learn Rust!
fastq
file based on a taxonomic idfastq
filesgzip
inputs and outputs.For more detail see benchmarks
Precompiled
Github release: 0.3.0
Cargo
Requires cargo
cargo install krakenxtract
Build from source
To install please refer to the rust documentation: docs
bash
git clone https://github.com/Sam-Sims/krakenxtract
bash
cd kraken-extract
cargo build --release
export PATH=$PATH:$(pwd)/target/release
All executables will be in the directory kraken-extract/target/release.
bash
krakenXtract -k <kraken_output> -i <fastq_file> -t <taxonomic_id> -o <output_file>
Or, if you have paired-end illumina reads:
bash
krakenXtract -k <kraken_output> -i <R1_fastq_file> -i <R2_fastq_file> -t <taxonomic_id> -o <R1_output_file> -o <R2_output_file>
If you want to extract all children of a taxon:
bash
krakenXtract -k <kraken_output> -r <kraken_report> -i <fastq_file> -t <taxonomic_id> --children -o <output_file>
-i, --input
This option will specify the input files containing the reads you want to extract from. They can be compressed - (gzip
, bzip
, lzma
, zstd
). Paired end reads can be specified by:
Using --input
twice: -i <R1_fastq_file> -i <R2_fastq_file>
Using --input
once but passing both files: -i <R1_fastq_file> <R2_fastq_file>
This means that bash wildcard expansion works: -i *.fastq
-o, --output
This option will specify the output files containing the extracted reads. The order of the output files is assumed to be the same as the input.
By default the compression will be inferred from the output file extension for supported file types (gzip
, bzip
, lzma
and zstd
). If the output type cannot be inferred, plaintext will be output.
-k, --kraken
This option will specify the path to the Kraken2 output containing taxonomic classification of read IDs.
-t, --taxid
This option will specify the taxon ID for reads you want to extract.
-O, --output-type
This option will manually set the compression mode used for the output file and will override the type inferred from the output path.
Valid values are:
gz
to output gzipbz
to output bziplzma
to output lzmazstd
to output zstdnone
to not apply compresison-l, --level
This option will set the compression level to use if compressing the output. Should be a value between 1-9 with 1 being the fastest but largest file size and 9 is for slowest, but best file size. By default this is set at 6, but for the highest speeds 2 is a good trade off for speed/filesize.
--output-fasta
This option will output a fasta file, with read ids as headers.
-r, --report
This option specifies the path to the report file generated by Kraken2. If you want to use --parents
or --children
then is argument is required.
--parents
This will extract reads classified at all taxons between the root and the specified --taxid
.
--children
This will extract all the reads classified as decendents or subtaxa of --taxid
(Including the taxid).
--exclude
This will output every read except those matching the taxid. Works with --parents
and --children
--include-parents
and --include-children
arguments--append
--compression-mode
gz
--output-fasta
--no-compress
flag to output a standard, plaintext fastq file--exclude
to exclude specified reads. Works with --children
and --parents
gz
files or plain files--compression
arg to select compression typezlib-ng
to speed up gzip handling--children
and --parents
to save children and parents based on kraken report