docs crates.io

This is beta software, use at your own risk

rumi

Rust UMI based PCR deduplication based on the directional adjacency as UMI-tools but with a constant time hamming distance implementation.

Install

For now this relies on the rust toolchain. There are excellent docs on how to set that up.

cargo install rumi

Usage

```bash $ rumi --help rumi-dedup 0.1.0 Seth Stadick sstadick@gmail.com Deduplicate reads based on umis

USAGE: rumi [FLAGS] [OPTIONS] --output --umi_tag

FLAGS: --grouponly Don't deduplicate reads, just group them given them agroup id, and print them. Rules for filtering out unpaired reads, etc, will still be applied. -h, --help Prints help information --ignoresplicepos If two reads have the same start pos, and contain a splice site, they will be grouped together, instead of further splitting them based on the splice site --ispaired Input is paired end. Read pairs with unmapped read1 will be ignored. --umiinreadid The UMI is located in the read id after the last ''. Otherwise use the RX tag. -V, --version Prints version information

OPTIONS: -o, --output Output bam file. Use - if stdout [default: -] -c, --allowedcountfactor The factor to multiply the count of a umi by when determining whether or not to group it with other umis within allowedreaddist. include umib as adjacent to umia if: umia.counts >= allowedcountfactor * umib.counts [default: 2] -n, --allowednetworkdepth The number of nodes deep to go when creating a group. If allowedreaddist 1, then allowednetworkdepth of 2 will enable getting all umis with hamming distance of 2 from current umi. [default: 2] -d, --allowedreaddist The distance between umis that will allow them to be counted as adjacent. [default: 1]

-u, --umi_tag <umi_tag>                                The tag holding the umi information. [default: RX]

ARGS: Input bam file. Use - if stdin [default: -] ```

Performance

I have not sat down and done any serious benchmarking yet. Anecdotally this is at least 4X faster than umi_tools on small datasets. There are A LOT of low hanging fruit in terms of optimizations to apply though.

I would fully expect that this implementation should be capable of at least a 10x performance boost once it's been smoothed out. The large advantage this has over umitools is that it can take advantage of multiple cores. umitools has already shifted a large amount of it's work into C code, so just having a compiled language isn't a huge advantage.

Known differences from umi_tools

TODO

Prior Art

Notes

First pass: Collect all reads into a dict that is keyed on position. Track metrics like umi freq, and extracted umis while building this. Then iter over that dict and deduplicate at each position.

Diffs in example.bam (from umi_tools)