Some Kmer counting utilities

This package (currently in development) provides the following tools :

The package has a Julia companion providing interactive access to dumped statistics or interactive inspection of sequences of bases and qualities.

Kmer Compression and Counting

The bases are presently encoded on 2 bits.
Kmer can be stored 32-bit or 64-bit words thus providing compressed representation up to 32 bases with the 2-bit alphabet.
Kmer and compressed Kmer are represented respectively by trait KmerT and CompressedKmerT. A kmer is identified with its reverse complement in the counting methods.

Kmer counting is multi-threaded and filters unique kmer in a cuckoo filter to spare memory. Unique kmers are dumped in a separate file with the coordinates (sequence and position in sequence). Multiple kmers, stored in a Bloom filter, are dumped in another file with their multiplicity. See module kmercount

Hashing and Sketching of data

Similarity between sequences can be estimated by counting common Kmers between sequences with minhash, superminhash and the probability Jaccard Index.

Some others standard tools such :

A minimal module rnautils

This module provides an uncompressed representation of Amino Acid sequences along with generation of compressed Kmer (up to a size of 25 amino acids).
This module is, in present state, minimal. Its main objective is to provide sketching of AA sequences in the same way as DNA sequences.

Some basic statistics on sequences

  1. Read length distributions.
    A file giving the number of reads in function of length.

  2. Base distributions.
    a matrix (100, 4) giving for row i and column j in (1,2,3,4) the number of reads where a base (a,c,g,t) corresponding to column j in this order occurs at percentage i.

This file can be reloaded by Julia package Genomics (cf BaseDistribution.jl)

Quality

Qualities are re-mapped to values between in [0..7] so that they need only 3 bits of storage and are stored in a wavelet matrix. The mapping is non uniform and maps the range [0x25,0x37] to [1,6].
The quality part of data are stored in a process serving quality requests described below:

Quality Server

The server is launched on the server machine by the command:
qualityloader -f filename [ -p portnum] [ --wavelet].

The server listens by default to port 4766, the option "--wavelet" asks for wavelet compression.