ARCHAEA stands for: A Rust Classifier base on Hierarchical Navigable SW graphs, et.al.**
This package (currently in development) compute probminhash signature of bacteria and archaea genomes and stores the id of bacteria and probminhash signature in a Hnsw structure for searching of new request genomes.
This package is developped by Jean-Pierre Both (https://github.com/jean-pierreBoth) for the software part and Jianshu Zhao (https://github.com/jianshu93) for the genomics part.
pre-compiled binaries are available in the release page for major platforms. For linux based system, no dependencies but system level gfortran must be later than gfortran@5, which means gcc 8.3 or above. If you want to compile from source, see below:
Clone hnsw-rs and probminhash or get them from crate.io
git clone https://github.com/jean-pierreBoth/hnswlib-rs
git clone https://github.com/jean-pierreBoth/probminhash
Clone kmerutils which is not in crate.io:
Clone ARCHAEA, which is not in crate.io:
Clone annembed:
Three libraries, zeromq, libsodium and openblas (optional for annembed_f feature) are required to successfully compile.
```bash
sudo apt-get install libzmq-dev libsodium-dev openblas
brew install zeromq brew install libsodium
brew install openblas
cd archaea cargo build --release --features annembed_f
cargo build --release
LIBZMQLIBDIR=~/miniconda3/lib LIBZMQINCLUDEDIR=~/miniconda3/include cargo build --release --features annembed_f
LIBZMQLIBDIR=~/miniconda3/lib LIBZMQINCLUDEDIR=~/miniconda3/include cargo build --release
```
Nightly rust must be used ```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default nightly
```
The sketching and database is done by the module tohnsw.
The sketching of reference genomes can take some time (one or 2 hours for 50000 bacterial genomes of NCBI for parameters giving a correct quality of sketching). Result is stored in 2 structures: - A Hnsw structure storing rank of data processed and corresponding sketches. - A Dictionary associating each rank to a fasta id and fasta filename.
The Hnsw structure is dumped in hnswdump.hnsw.graph and hnswdump.hnsw.data The Dictionary is dumped in a json file seqdict.json
For requests the module request is being used. It reloads the dumped files, hnsw and seqdict related takes a list of fasta files containing requests and for each fasta file dumps the asked number of nearest neighbours.
The classify module is used to assign taxonomy information from requested neighbours to query genomes. Average nucleitide identity will be calculated.
```bash
https://github.com/EddyRivasLab/hmmer/tree/h3-arm
cd h3-arm https://github.com/EddyRivasLab/easel/tree/develop
autoconf ./configure make -j 8 sudo make install hmmsearch -h ```
```bash
tohnsw -d dbdirnt -s 12000 -k 21 --ef 1600 -n 128 tohnsw -d dbdiraa -s 24000 -k 7 --ef 1600 -n 128 --aa
request -b ./ -d querydirnt -n 50 request -b ./ -d querydiraa -n 50 --aa ```
We provide pre-built genome/proteome database graph file for bacteria/archaea, virus and fungi. Proteome database are based on genes for each genome, predicted by prodigal (https://github.com/hyattpd/Prodigal) for bacteria/archaea/virus and GeneMark-ES version 2 (http://exon.gatech.edu/GeneMark/licensedownload.cgi) for fungi. Bacteria/archaea genomes are the newest version of GTDB database (https://gtdb.ecogenomic.org), which defines a bacterial speces at 95% ANI. Note that ARCHAEA can also run for even higher resolution species database such as 99% ANI. Virus data base are based on the JGI IMG/VR database newest version (https://genome.jgi.doe.gov/portal/IMGVR/IMG_VR.home.html), which also define a virus OTU (vOTU) at 95% ANI. Fungi database are based on the entire RefSeq fungal genomes, we dereplicated and define a fungal speices at 99% ANI. All three pre-built database can be available here: