alt text

A rust classifier based on probminhash and HNSW for microbial genomes

ARCHAEA stands for: A Rust Classifier base on Hierarchical Navigable SW graphs, et.al.** Later on, we rename it to GSearch, stands of Genomic Search.

This package (currently in development) compute probminhash signature of bacteria and archaea (or virus and fungi) genomes and stores the id of bacteria and probminhash signature in a Hnsw structure for searching of new request genomes.

This package is developped by Jean-Pierre Both (https://github.com/jean-pierreBoth) for the software part and Jianshu Zhao (https://github.com/jianshu93) for the genomics part.

Sketching of genomes/tohnsw

The sketching and database is done by the module tohnsw.

The sketching of reference genomes can take some time (one or 2 hours for 50000 bacterial genomes of NCBI for parameters giving a correct quality of sketching). Result is stored in 2 structures: - A Hnsw structure storing rank of data processed and corresponding sketches. - A Dictionary associating each rank to a fasta id and fasta filename.

The Hnsw structure is dumped in hnswdump.hnsw.graph and hnswdump.hnsw.data The Dictionary is dumped in a json file seqdict.json

Requests

For requests the module request is being used. It reloads the dumped files, hnsw and seqdict related takes a list of fasta files containing requests and for each fasta file dumps the asked number of nearest neighbours.

Usage

```bash

build database given genome file directory, fna.gz was expected. L for nt and .faa or .faa.gz for --aa. Limit for k is 32 (15 not work due to compression), for s is 65535 (u16) and for n is 255 (u8)

tohnsw -d dbdirnt -s 12000 -k 16 --ef 1600 -n 128 tohnsw -d dbdiraa -s 12000 -k 7 --ef 1600 -n 128 --aa

request neighbours for each genomes (fna, fasta, faa et.al. are supported) in querydirnt or aa using pre-built database:

wget http://enve-omics.ce.gatech.edu/data/publicgsearch/GTDBr207hnswgraph.tar.gz tar xzvf ./GTDBr207hnswgraph.tar.gz cd ./GTDBr207hnswgraph/nucl

request neighbors for nt genomes

request -b ./ -d querydirnt -n 50

request neighbors for aa genomes (predicated by Prodigal or FragGeneScanRs)

cd ./GTDBr207hnswgraph/prot request -b ./ -d querydir_aa -n 50 --aa

request neighbors for aa universal gene (extracted by hmmer according to hmm files provided)

cd ./GTDBr207hnswgraph/universal request -b ./ -d querydiruniversalaa -n 50 --aa ```

Dependencies, features and Installation

features

Simple case for install:

Pre-built binaries are available on release page (https://github.com/jean-pierreBoth/archaea/releases/tag/v1.0) for major platforms. If you wan to install/compiling by yourself:

```bash

A simple installation, with annembed enabled would be:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh cargo install archaea --features="annembed_intel-mkl"

on MacOS, which requires dynamic library link:

cargo build --release --features="annembed_openblas-system"

or on intel using openblas instead of intel-mkl:

cargo build --release --features="annembedopenblas-system" --features="hnswrs/simdeez_f"

Then install FragGeneScanRs:

cargo install --git https://gitlab.com/Jianshu_Zhao/fraggenescanrs ```

Alternatively it is possible to modify the features section in Cargo.toml. Just fill in the default you want.

Some hints in case of problem (including installing/compiling on ARM CPUs) are given here

Pre-built databases

We provide pre-built genome/proteome database graph file for bacteria/archaea, virus and fungi. Proteome database are based on genes for each genome, predicted by FragGeneScanRs (https://gitlab.com/JianshuZhao/fraggenescanrs) for bacteria/archaea/virus and GeneMark-ES version 2 (http://exon.gatech.edu/GeneMark/licensedownload.cgi) for fungi.