This is beta software, use at your own risk
Rust UMI based PCR deduplication based on the directional adjacency as UMI-tools but with a constant time hamming distance implementation.
For now this relies on the rust toolchain. There are excellent docs on how to set that up.
cargo install rumi
```bash $ rumi --help rumi-dedup 0.1.0 Seth Stadick sstadick@gmail.com Deduplicate reads based on umis
USAGE:
rumi [FLAGS] [OPTIONS]
FLAGS: --grouponly Don't deduplicate reads, just group them given them agroup id, and print them. Rules for filtering out unpaired reads, etc, will still be applied. -h, --help Prints help information --ignoresplicepos If two reads have the same start pos, and contain a splice site, they will be grouped together, instead of further splitting them based on the splice site --ispaired Input is paired end. Read pairs with unmapped read1 will be ignored. --umiinreadid The UMI is located in the read id after the last ''. Otherwise use the RX tag. -V, --version Prints version information
OPTIONS:
-o, --output
-u, --umi_tag <umi_tag> The tag holding the umi information. [default: RX]
ARGS:
I have not sat down and done any serious benchmarking yet. Anecdotally this is at least 4X faster than umi_tools on small datasets. There are A LOT of low hanging fruit in terms of optimizations to apply though.
I would fully expect that this implementation should be capable of at least a 10x performance boost once it's been smoothed out. The large advantage this has over umitools is that it can take advantage of multiple cores. umitools has already shifted a large amount of it's work into C code, so just having a compiled language isn't a huge advantage.
First pass: Collect all reads into a dict that is keyed on position. Track metrics like umi freq, and extracted umis while building this. Then iter over that dict and deduplicate at each position.