skc
is a simple tool for finding shared k-mer content between two genomes.
``` curl -sSL skc.mbh.sh | sh
wget -nv -O - skc.mbh.sh | sh ```
You can also pass options to the script like so
```text $ curl -sSL skc.mbh.sh | sh -s -- --help install.sh [option]
Fetch and install the latest version of skc, if skc is already installed it will be updated to the latest version.
Options -V, --verbose Enable verbose output for the installer
-f, -y, --force, --yes
Skip the confirmation prompt during installation
-p, --platform
Override the platform identified by the installer
-b, --bin-dir
Override the bin installation directory [default: /usr/local/bin]
-a, --arch
Override the architecture identified by the installer [default: x86_64]
-B, --base-url
Override the base URL used for downloading releases [default: https://github.com/mbhall88/skc/releases]
-h, --help
Display this help message
```
text
cargo install skc
text
conda install skc
text
cargo build --release
./target/release/skc --help
Check for shared 16-mers between the HIV-1 genome and the Mycobacterium tuberculosis genome.
```text $ skc -k 16 NC001802.1.fa NC000962.3.fa [2023-06-20T01:46:36Z INFO ] 9079 unique k-mers in target [2023-06-20T01:46:38Z INFO ] 2 shared k-mers between target and query
4233642782 tcount=1 qcount=1 tpos=NC001802.1:739 qpos=NC000962.3:4008106 TGCAGAACATCCAGGG 4237062597 tcount=1 qcount=1 tpos=NC001802.1:8415 qpos=NC000962.3:629482 CCAGCAGCAGATAGGG ```
So we can see there are two shared 16-mers between the genomes. By default, the shared k-mers are written to stdout -
use the -o
option to write them to file.
Example: >4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106
The ID (4233642782
) is the 64-bit integer representation of the k-mer's value in bit-space (
see Daniel Liu's brilliant cute-nucleotides
repository for more information). tcount
and qcount
are the
number of times the k-mer is present in the target and query genomes, respectively. tpos
and qpos
are the (1-based)
k-mer starting position(s) within the target and query contigs - these will be comma-seperated if the k-mer occurs
multiple times.
```text $ skc --help Shared k-mer content between two genomes
Usage: skc [OPTIONS]
Arguments:
Can be compressed with gzip, bzip2, xz, or zstd
Can be compressed with gzip, bzip2, xz, or zstd
Options:
-k, --kmer
[default: 21]
-o, --output
-O, --output-type u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; z: Zstd
Output compression format is automatically guessed from the filename extension. This option is used to override that
[default: u]
-l, --compress-level
[default: 6]
-h, --help Print help (see a summary with '-h')
-V, --version Print version ```
<TARGET>
) the smallest genome. This is to reduce memory usage as all unique k-mers (
well their u64
value) for this genome will be held in memory.skc
does not claim to be the fastest or most memory-efficient tool to find shared k-mer content. I basically wrote it
as I either struggled to install some alternate tools, they were clunky/verbose, or it was laborious to get shared
k-mers out of the results (e.g. can only search one k-mer at a time or have to run many different subcommands). Here is
a (non-exhaustive) list of other tools that can be used to get shared k-mer content
Daniel Liu's brilliant cute-nucleotides
repository is used to (rapidly) convert k-mers into 64-bit integers.