skc

skc is a simple tool for finding shared k-mer content between two genomes.

Installation

Prebuilt binary

``` curl -sSL skc.mbh.sh | sh

or with wget

wget -nv -O - skc.mbh.sh | sh ```

You can also pass options to the script like so

```text $ curl -sSL skc.mbh.sh | sh -s -- --help install.sh [option]

Fetch and install the latest version of skc, if skc is already installed it will be updated to the latest version.

Options -V, --verbose Enable verbose output for the installer

    -f, -y, --force, --yes
            Skip the confirmation prompt during installation

    -p, --platform
            Override the platform identified by the installer

    -b, --bin-dir
            Override the bin installation directory [default: /usr/local/bin]

    -a, --arch
            Override the architecture identified by the installer [default: x86_64]

    -B, --base-url
            Override the base URL used for downloading releases [default: https://github.com/mbhall88/skc/releases]

    -h, --help
            Display this help message

```

Cargo

text cargo install skc

Conda

text conda install skc

Local

text cargo build --release ./target/release/skc --help

Usage

Check for shared 16-mers between the HIV-1 genome and the Mycobacterium tuberculosis genome.

```text $ skc -k 16 NC001802.1.fa NC000962.3.fa [2023-06-20T01:46:36Z INFO ] 9079 unique k-mers in target [2023-06-20T01:46:38Z INFO ] 2 shared k-mers between target and query

4233642782 tcount=1 qcount=1 tpos=NC001802.1:739 qpos=NC000962.3:4008106 TGCAGAACATCCAGGG 4237062597 tcount=1 qcount=1 tpos=NC001802.1:8415 qpos=NC000962.3:629482 CCAGCAGCAGATAGGG ```

So we can see there are two shared 16-mers between the genomes. By default, the shared k-mers are written to stdout - use the -o option to write them to file.

Fasta description

Example: >4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106

The ID (4233642782) is the 64-bit integer representation of the k-mer's value in bit-space ( see Daniel Liu's brilliant cute-nucleotides repository for more information). tcount and qcount are the number of times the k-mer is present in the target and query genomes, respectively. tpos and qpos are the (1-based) k-mer starting position(s) within the target and query contigs - these will be comma-seperated if the k-mer occurs multiple times.

Usage help

```text $ skc --help Shared k-mer content between two genomes

Usage: skc [OPTIONS]

Arguments: Target sequence

      Can be compressed with gzip, bzip2, xz, or zstd

Query sequence

      Can be compressed with gzip, bzip2, xz, or zstd

Options: -k, --kmer Size of k-mers (max. 32)

      [default: 21]

-o, --output Output filepath(s); stdout if not present

-O, --output-type u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; z: Zstd

      Output compression format is automatically guessed from the filename extension. This option is used to override that

      [default: u]

-l, --compress-level Compression level to use if compressing output

      [default: 6]

-h, --help Print help (see a summary with '-h')

-V, --version Print version ```

Caveats

Alternate tools

skc does not claim to be the fastest or most memory-efficient tool to find shared k-mer content. I basically wrote it as I either struggled to install some alternate tools, they were clunky/verbose, or it was laborious to get shared k-mers out of the results (e.g. can only search one k-mer at a time or have to run many different subcommands). Here is a (non-exhaustive) list of other tools that can be used to get shared k-mer content

Acknowledgements

Daniel Liu's brilliant cute-nucleotides repository is used to (rapidly) convert k-mers into 64-bit integers.