tidk is a toolkit to identify and visualise telomeric repeats for the Darwin Tree of Life genomes. tidk works especially well on chromosomal genomes, but can also work on PacBio HiFi reads well (see the telomeric repeat database for many examples). There are a few modules in the tool, which may be useful to anyone investigating telomeric repeat sequences in a genome.
explore
- tries to find the telomeric repeat unit in the genome.find
and search
are essentially the same. They identify a repeat sequence in windows across the genome. find
uses an in-built table of telomeric repeats, in search
you supply your own.plot
does what is says on the tin, and plots the csv output of find
or search
as an SVG.min
returns the lexicographically minimal string of an input DNA string(s) or file of DNA sequences.trim
trims fasta sequences using a supplied base repeat string.The easiest way to install is through conda:
bash
conda install -c bioconda tidk
Otherwise...
As with other Rust projects, you have to complile yourself. Download rust, clone this repo, and then run:
cargo build --release
Compiling takes anywhere from 1-6 minutes from fresh (tested on the farm). The executable will be at the location ./target/release/tidk
.
``` TIDK 0.1.5 Max Brown mb39@sanger.ac.uk A Telomere Identification Toolkit.
USAGE: tidk [SUBCOMMAND]
OPTIONS: -h, --help Print help information -V, --version Print version information
SUBCOMMANDS: explore Use a search of all substrings of length k to query a genome for a telomere sequence. find Supply the name of a clade your organsim belongs to, and this submodule will find all telomeric repeat matches for that clade. help Print this message or the help of the given subcommand(s) min Emit the canonical lexicographically minimal DNA string. plot SVG plot of CSV generated from search or find. search Search the input genome with a specific telomeric repeat search string. trim Trim a specific telomeric repeat from the input reads and yield reads oriented at the telomere start. ```
tidk explore
will identify all sequences of length k, which repeat at least twice throughout a genome. Repeats of high number toward the beginning or end of sequences are likely candidates for telomeric repeats. The reported repeats are the lexicographically most minimal of all possible string rotations of the telomeric repeat, in both forward and reverse complement forms.
It outputs either a csv or bedgraph of potential telomeric repeats and their locations, in addition to a text file of the potential telomeric repeat sequences, and how often they are found in the genome.
For example:
tidk explore --fasta fastas/iyBomHort1_1.20210303.curated_primary.fa --minimum 5 --maximum 12 -o test_dist -t 500
searches the genome for repeats from length 5 to length 12 sequentially (definite potential to be made concurrent) on the freshly minted Bombus hortorum genome.
``` tidk-explore 0.1.5 Use a search of all substrings of length k to query a genome for a telomere sequence.
USAGE:
tidk explore [OPTIONS] --fasta
OPTIONS:
-d, --distance
tidk find
will take an input clade, and match the known telomeric repeat for that clade (or repeats plural) and search the genome. Uses the telomeric repeat database. As more telomeric repeats are found and added, the dictionary of sequences used will increase (perhaps there is a more elegant way to parse the command line input?).
``` tidk-find 0.1.5 Supply the name of a clade your organsim belongs to, and this submodule will find all telomeric repeat matches for that clade.
USAGE: tidk find [OPTIONS]
OPTIONS:
-c, --clade
tidk search
will search the genome for an input string. If you know the telomeric repeat of your sequenced organism, this will hopefully find it.
``` tidk-search 0.1.5 Search the input genome with a specific telomeric repeat search string.
USAGE:
tidk search [OPTIONS] --fasta
OPTIONS:
-e, --extension
tidk plot
will plot a CSV from the output of tidk search
. Working on plotting for tidk find
(i.e. extending to multiple telomeric repeat sequences in same CSV).
``` tidk-plot 0.1.5 SVG plot of CSV generated from search or find.
USAGE:
tidk plot [OPTIONS] --csv
OPTIONS:
-c, --csv
As an example on the ol' Square Spot Rustic Xestia xanthographa:
```bash tidk find -f fastas/ilXesXant11.20201023.curatedprimary.fa -c lepidoptera -o Xes
tidk plot -c finder/Xestelomericrepeat_windows.csv -o ilXes -h 120 -w 800
```
tidk min
returns the lexicographically minimal DNA string given an input. Useful for universally comparing repeats using this tool. Surprisingly, it works on even chromosomal length DNA strings quickly, not that you would want to do that...
Examples:
tidk min AATGCG
or process multiple tidk min AATGCG AAGGTTC GGTTAAT
tidk min -f input.fa
proccesses input fasta, and outputs fasta. Otherwise tidk min -f input.txt
reads lines and outputs.echo "AATTGC" | tidk min
, pipes work.cat input.fasta | tidk min -x
outputs a fasta. Otherwise cat input.txt | tidk min
reads lines and outputs. Note cat
here is redundant and creates extra work, but it just shows the piping in action.
``` tidk-min 0.1.5 Emit the canonical lexicographically minimal DNA string.
USAGE: tidk min [OPTIONS] [DNA string]...
ARGS:
OPTIONS:
-f, --file
tidk trim
- a rust port of https://github.com/pgonzale60/telomeric-trim. Thanks Pablo!
``` tidk-trim 0.1.5 Trim a specific telomeric repeat from the input reads and yield reads oriented at the telomere start.
USAGE:
tidk trim [OPTIONS] --fasta
OPTIONS:
-f, --fasta