ARCHAEA stands for: A Rust Classifier base on Hierarchical Navigable SW graphs, et.al.**
This package (currently in development) compute probminhash signature of bacteria and archaea genomes and stores the id of bacteria and probminhash signature in a Hnsw structure for searching of new request genomes.
This package is developped by Jean-Pierre Both (https://github.com/jean-pierreBoth) for the software part and Jianshu Zhao (https://github.com/jianshu93) for the genomics part.
pre-compiled binaries are available in the release page for major platforms.
Install via conda (recommended):
On linux server where you do not have sudo privilege (install miniconda3 first):
conda activate
conda install zeromq
Install Rustup On linux:
conda install -c milesgranger rustup
Install Rustup on MacOS:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default nightly
change to you miniconda installation path
a=$(which conda)
LIBZMQ_LIB_DIR=${a%/*/*}/lib LIBZMQ_INCLUDE_DIR=${a%/*/*}/include cargo install archaea
cargo install --git https://gitlab.com/Jianshu_Zhao/fraggenescanrs
conda install hmmer
Installl from source:
Clone hnsw-rs, probminhash, kmerutils can be retrieved from crate.io or downloaded by :
git clone https://github.com/jean-pierreBoth/hnswlib-rs
git clone https://github.com/jean-pierreBoth/probminhash
git clone https://github.com/jean-pierreBoth/kmerutils
A dependency is provided as a feature. It uses the crate annembed that gives some statistics on the hnsw graph constructed (and will provide some visualization of data).
It can be activated by the feature annembed_f.
By default archaea uses annembed with openblas-static (compiles and link statically openblas) but you can change in Cargo.toml to intel-mkl-static (which downloads intel-mkl-rc for you).
openblas-static requires gcc, gfortran and make
download from crate.io or git clone https://github.com/jean-pierreBoth/annembed.
annembed is usage in archea is based on openblas-static by default, you can change to intel-mkl-static
Clone ARCHAEA, which is not yet in crate.io:
git clone https://github.com/jean-pierreBoth/archaea
Three libraries, zeromq, libsodium and openblas (optional for annembed_f feature) are required to successfully compile.
```bash
sudo apt-get install libzmq-dev libsodium-dev openblas
brew install zeromq brew install libsodium
brew install openblas
cd archaea cargo build --release --features annembed_f (if annembed is needed or cargo build --release=
cargo build --release
LIBZMQLIBDIR=~/miniconda3/lib LIBZMQINCLUDEDIR=~/miniconda3/include cargo build --release --features annembed_f
LIBZMQLIBDIR=~/miniconda3/lib LIBZMQINCLUDEDIR=~/miniconda3/include cargo build --release
```
Nightly rust must be used ```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default nightly
```
The sketching and database is done by the module tohnsw.
The sketching of reference genomes can take some time (one or 2 hours for 50000 bacterial genomes of NCBI for parameters giving a correct quality of sketching). Result is stored in 2 structures: - A Hnsw structure storing rank of data processed and corresponding sketches. - A Dictionary associating each rank to a fasta id and fasta filename.
The Hnsw structure is dumped in hnswdump.hnsw.graph and hnswdump.hnsw.data The Dictionary is dumped in a json file seqdict.json
For requests the module request is being used. It reloads the dumped files, hnsw and seqdict related takes a list of fasta files containing requests and for each fasta file dumps the asked number of nearest neighbours.
The last step involves a homology search using hmmer, which can be directly installed using conda or brew. If you are using apple M1 ARM/aarch64 structure. This is how you can have a native support of hmmer
```bash
https://github.com/EddyRivasLab/hmmer/tree/h3-arm
cd h3-arm https://github.com/EddyRivasLab/easel/tree/develop
autoconf ./configure make -j 8 sudo make install hmmsearch -h ```
```bash
tohnsw -d dbdirnt -s 12000 -k 21 --ef 1600 -n 128 tohnsw -d dbdiraa -s 24000 -k 7 --ef 1600 -n 128 --aa
request -b ./ -d querydirnt -n 50 request -b ./ -d querydiraa -n 50 --aa
```
We provide pre-built genome/proteome database graph file for bacteria/archaea, virus and fungi. Proteome database are based on genes for each genome, predicted by FragGeneScanRs (https://gitlab.com/JianshuZhao/fraggenescanrs) for bacteria/archaea/virus and GeneMark-ES version 2 (http://exon.gatech.edu/GeneMark/licensedownload.cgi) for fungi. Bacteria/archaea genomes are the newest version of GTDB database (https://gtdb.ecogenomic.org), which defines a bacterial speces at 95% ANI. Note that ARCHAEA can also run for even higher resolution species database such as 99% ANI. Virus data base are based on the JGI IMG/VR database newest version (https://genome.jgi.doe.gov/portal/IMGVR/IMGVR.home.html), which also define a virus OTU (vOTU) at 95% ANI. Fungi database are based on the entire RefSeq fungal genomes, we dereplicated and define a fungal speices at 99% ANI. All three pre-built database can be available here: