gfatk

A command line utility to explore, extract, and linearise plant mitochondrial assemblies. The Graphical Fragment Assembly files (GFA's) used to refine the code in this repository are almost exclusively generated from the assembly program <a href="https://github.com/maickrau/MBG">MBG</a>. See the testing section below for caveats.

Install

Grab from the releases (Mac & Linux only):

```bash

for mac

curl -L "https://github.com/tolkit/gfatk/releases/download/0.2.2/gfatkmac0.2.2" > gfatk && chmod +x gfatk

and linux (ubuntu)

curl -L "https://github.com/tolkit/gfatk/releases/download/0.2.2/gfatkubuntu0.2.2" > gfatk && chmod +x gfatk ```

Or build from source.

```bash

e.g. get rustup!

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

get directly from crates.io

currently this is the latest available version 0.2.3XX

cargo install gfatk

or clone this repo!

git clone https://github.com/tolkit/gfatk

cd!

cd gfatk

build!

cargo build --release

or install into your path!

cargo install --path . ```

Features

The features of the toolkit reflect only their usefulness in debugging, visualising, and linearising GFA's from (especially) plant mitochondrial genome assemblies output from <a href="https://github.com/maickrau/MBG">MBG</a>. These genomes are usually pretty small (up to 2Mb), and in many cases have circular or branching paths.

Current help:

``` Explore and linearise (plant organellar) GFA files.

Usage: gfatk [COMMAND]

Commands: overlap Extract overlaps from a GFA. extract Extract subgraph from a GFA, given a segment name. linear Force a linear representation of the graph. fasta Extract a fasta file. Almost as simple as: awk '/^S/{print ">"$2"\n"$3}'. stats Some stats about the input GFA. extract-mito Extract the mitochondria from a GFA. extract-chloro Extract the plastid from a GFA. dot Return the dot representation of a GFA. trim Trim a GFA to remove nodes of degree < 4 (i.e. only has one neighbour). path Supply an input path to evaluate a linear representation of. Input must be a text file of a single comma separated line with node ID's and orientations. E.g. 1+,2-,3+ rename Rename the segment ID's of a GFA. help Print this message or the help of the given subcommand(s)

Options: -h, --help Print help -V, --version Print version ```

To explain each of these briefly:

These are not all the options for each subcommand. Run:

gfatk help <subcommand> for more information on each subcommand.

Many of these commands can be chained in a pipeline, e.g. gfatk extract-chloro in.gfa | gfatk linear > out.fa.

Examples and docs

A couple of more detailed examples can be seen in the examples directory, where there is a README.md file. To view the auto-generated documentation of the binary itself, including details of all underlying functions, see:

API documentation

Requirements and testing

Some unit tests are now provided in the tests directory. To run these (you'll need Rust):

bash cargo test --release

For full functionality of the toolkit, two tags are required, node coverage and edge coverage. Other functionality will fail if the CIGAR string is not purely an overlap; i.e. in the format <integer>M. Only GFA version 1 supported. Only header (H), segment (S), and link (L) lines are required. P lines are used in gfatk path --all <GFA>.

``` H VN:Z:1.0 S 11 ACCTT ll:f:30.0 <- this tag indicates node/segment coverage (here it's 30.0) S 12 TCAAGG ll:f:60.0 S 13 CTTGATT ll:f:30.0 L 11 + 12 - 4M ec:i:1 <- this tag indicates edge coverage (here it's 1) L 12 - 13 + 5M ec:i:1 L 11 + 13 + 3M ec:i:1 L 12 + 11 - 4M ec:i:1 L 13 - 12 + 5M ec:i:1 L 13 - 11 - 3M <- simple overlap on the CIGAR string (overlap == 3) ec:i:1

```

Thanks

Many thanks to the developers of MBG, and partners in the Tree of Life program, and beyond: