rasusa

Randomly subsample sequencing reads to a specified coverage.

Motivation
Install
Usage
Contributing

Motivation

I couldn't find a tool for subsampling reads that met my requirements. All the strategies I could find fell short as they either just wanted a number or percentage of reads to subsample to or, if they did subsample to a coverage, they assume all reads are the same size (i.e Illumina). As I mostly work with long-read data this posed a problem if I wanted to subsample a file to certain coverage, as length of reads was never taken into account.
rasusa addresses this shortcoming.

A workaround I had been using for a while was using filtlong. It was simple enough, I just figure out the number of bases I need to achieve a (theoretical) coverage for my sample. Say I have a fastq from an E. coli sample with 5 million reads and I want to subset it to 50x coverage. I just need to multiply the expected size of the sample's genome, 4.6 million base pairs, by the coverage I want and I have my target bases - 230 million base pairs. In filtlong, I can do the following

sh target=230000000 filtlong --target_bases "$target" reads.fq > reads.50x.fq

However, this is technically not the intended function of filtlong; it's a quality filtering tool. What you get in the end is a subset of the "highest scoring" reads at a (theoretical) coverage of 50x. Depending on your circumstances, this might be what you want. However, you bias yourself towards the best/longest reads in the dataset - not a fair representation of your dataset as a whole. There is also the possibility of favouring regions of the genome that produce longer/higher quality reads. De Maio et al. even found that by randomly subsampling nanopore reads you achieve better genome assemblies than if you had filtered.

So, depending on your circumstances, an unbiased subsample of your reads might be what you need. And if this is the case, rasusa has you covered.

Install

Some of these installation options require the rust toolchain, which is extremely easy to set up. However, if you do not wish to install rust then there are a number of options available.

`cargo`

Prerequisite: rust toolchain

```sh

not implemented yet

```

`conda`

Prerequisite: conda

```sh

not implemented yet

```

`singularity`

Prerequisite: singularity

```sh

not implemented yet

```

`docker`

Prerequisite: docker

```sh

not implemented yet

```

`homebrew`

Prerequisite: homebrew

```sh

not implemented yet

```

Release binaries

These binaries do not require that you have the rust toolchain installed.

Currently, there are two pre-compiled binaries available: - Linux kernel x86_64-unknown-linux-musl (works on most Linux distributions I tested) - OSX kernel x86_64-apple-darwin (works for any post-2007 Mac)

```sh

not implemented yet

```

If these binaries do not work on your system please raise an issue and I will potentially add some additional kernels.

Build locally

Prerequisite: rust toolchain

```sh git clone https://github.com/mbhall88/rasusa.git cd rasusa cargo build --release target/release/rasusa --help

if you want to check everything is working ok

cargo test --all ```

Usage

Basic usage

sh rasusa --input in.fq --coverage 30 --genome-size 4.6mb

The above command will output the subsampled file to stdout. For more details on the above options, and additional options, see below.

Required parameters

There are three required options to run rasusa.

Input

`-i`, `--input`

This option specifies the file containing the reads you would like to subsample. The file must be valid fasta or fastq format and can be compressed (with a tool such as gzip).

Note: The file format is (lazily) determined from the file name. File suffixes that are deemed valid are:

fasta: .fa, .fasta, .fa.gz, .fasta.gz
fastq: .fq, .fastq, .fq.gz, .fastq.gz

If there is a naming convention you feel is missing, please raise an issue.

Coverage

`-c`, `--coverage`

This option is used to determine the minimum coverage to subsample the reads to. It can be specified as an integer (100), a decimal/float (100.0), or either of the previous suffixed with an 'x' (100x).
When using a float, the coverage will be rounded down to the nearest integer (i.e. 100.7 becomes 100).

Note: Due to the method for determining how many bases are required to achieve the desired coverage, the actual coverage, in the end, could be slightly higher than requested. For example, if the last included read is very long. The log messages should inform you of the actual coverage in the end.

Genome size

`-g`, `--genome-size`

The genome size of the input is also required. It is used to determine how many bases are necessary to achieve the desired coverage. This can, of course, be as precise or rough as you like.
Genome size can be passed in many ways. As a plain old integer (1600), or with a metric suffix (1.6kb). All metric suffixes can have an optional 'b' suffix and be lower, upper, or mixed case. So 'Kb', 'kb' and 'k' would all be inferred as 'kilo'. Valid metric suffixes include:

Base (b) - multiplies by 1
Kilo (k) - multiplies by 1,000
Mega (m) - multiplies by 1,000,000
Giga (g) - multiplies by 1,000,000,000
Tera (t) - multiplies by 1,000,000,000,000

Optional parameters

Output

`-o`, `--output`

By default, rasusa will output the subsampled file to stdout. If you would prefer to specify an output file path, then use this option. If you add the compression suffix .gz to the file path then the output will be compressed for you.
Note: The output will always be in the same format as the input. You cannot pass fastq as input and ask for fasta as output. If this poses a problem for you, please raise an issue, and I will implement this feature if/when possible.

Random seed

`-s`, `--seed`

This option allows you to specify the random seed used by the random subsampler. By explicitly setting this parameter, you make the subsample for the input reproducible. The seed is an integer, and by default it is not set, meaning the operating system will seed the random subsampler. You should only pass this parameter if you are likely to want to subsample the same input file again in the future and want the same subset of reads.

Verbosity

`-v`

Adding this optional flag will make the logging more verbose. By default, logging will produce messages considered "info" or above (see here for more details). If verbosity is switched on, you will additionally get "debug" level logging messages.

Full usage

```text $ rasusa --help

rasusa 0.1.0 Randomly subsample reads to a specified coverage.

USAGE: rasusa [FLAGS] [OPTIONS] --coverage --genome-size --input

FLAGS: -h, --help Prints help information -V, --version Prints version information -v Switch on verbosity.

OPTIONS: -c, --coverage The desired coverage to sub-sample the reads to. -g, --genome-size Size of the genome to calculate coverage with respect to. i.e 4.3kb, 7Tb, 9000, 4.1MB etc. -i, --input The fast{a,q} file to sub-sample. -o, --output Output file, stdout if not present. -s, --seed Random seed to use. ```

Contributing

If you would like to help improve rasusa you are very welcome!

For changes to be accepted, they must pass the CI and coverage checks. These include:

Code is formatted with rustfmt. This can be done by running cargo fmt in the project directory.
There are no compiler errors/warnings. You can check this by running cargo clippy --all-features --all-targets -- -D warnings
Code coverage has not reduced. If you want to check coverage before pushing changes, I use kcov.