By David Cook Wildlife Photography - originally posted to Flickr as Galah (Eolophus roseicapillus), CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=8388233
Galah aims to be a more scalable metagenome assembled genome (MAG) dereplication method. That is, it clusters microbial genomes together based on their average nucleotide identity (ANI), and chooses a single member of each cluster as the representative.
Galah uses a greedy clustering approach to speed up genome dereplication, relative to e.g. dRep, particularly when there are many closely related genomes (i.e. >95% ANI). Generated cluster representatives have 2 properties. If the ANI threshold was set to 99%, then:
If CheckM genome qualities were specified, then the clusters have an additional property:
If CheckM qualities are not used, then:
The overall greedy clustering approach was largely inspired by the work of Donovan Parks, as described in this publication.
Galah is not currently available on bioconda, though it can (or will soon be) be installed and used indirectly through CoverM, which is available on bioconda.
Currently Galah can only be installed following the development instructions below. Hopefully soon it will be available on crates.io.
To run an unreleased version of Galah, after installing Rust:
git clone https://github.com/wwood/galah
cd galah
cargo run -- cluster ...etc...
Galah relies on these 3rd party tools, which must be installed separately.
For clustering a set of genomes at 99% ANI:
galah cluster --genome-fasta-files /path/to/genome1.fna /path/to/genome2.fna >clusters
There are several other options for specifying genomes, ANI cutoffs, etc. See
galah cluster --help
for more information.
Similar to dRep, galah operates in two stages. In the first, a fast pre-clustering distance (dashing) is calculated between each pair of genomes. Genome pairs are only considered as potentially in the same cluster with FastANI if the prethreshold ANI is greater than the specified value. By default, the precluster ANI is set at 95% and the final ANI is set at 99%.
Galah is made available under GPL3+. See LICENSE.txt for details. Copyright Ben Woodcroft.
Developed by Ben Woodcroft at the Australian Centre for Ecogenomics.