ARCHAEA stands for: A Rust Classifier based on Hierarchical Navigable SW graphs, et.al.. Later on, we renamed it to GSearch, stands for Genomic Search.
This package (currently in development) compute probminhash signature of bacteria and archaea (or virus and fungi) genomes and stores the id of bacteria and probminhash signature in a Hnsw structure for searching of new request genomes.
This package is developped by Jean-Pierre Both (https://github.com/jean-pierreBoth) for the software part and Jianshu Zhao (https://github.com/jianshu93) for the genomics part.
The sketching and database is done by the module tohnsw.
The sketching of reference genomes can take some time (one or 2 hours for ~65,000 bacterial genomes of NCBI for parameters giving a correct quality of sketching). Result is stored in 2 structures: - A Hnsw structure storing rank of data processed and corresponding sketches. - A Dictionary associating each rank to a fasta id and fasta filename.
The Hnsw structure is dumped in hnswdump.hnsw.graph and hnswdump.hnsw.data The Dictionary is dumped in a json file seqdict.json
For requests the module request is being used. It reloads the dumped files, hnsw and seqdict related takes a list of fasta files containing requests and for each fasta file dumps the asked number of nearest neighbours.
```bash
tohnsw -d dbdirnt -s 12000 -k 16 --ef 1600 -n 128 tohnsw -d dbdiraa -s 12000 -k 7 --ef 1600 -n 128 --aa
wget http://enve-omics.ce.gatech.edu/data/publicgsearch/GTDBr207hnswgraph.tar.gz tar xzvf ./GTDBr207hnswgraph.tar.gz cd ./GTDBr207hnswgraph/nucl
request -b ./ -d querydirnt -n 50
cd ./GTDBr207hnswgraph/prot request -b ./ -d querydir_aa -n 50 --aa
cd ./GTDBr207hnswgraph/universal request -b ./ -d querydiruniversalaa -n 50 --aa ```
hnswrs relies on the crate simdeez to accelerate distance computation. On intel you can build hnswrs with the feature simdeez_f
annembed relies on openblas so you must choose between the features "annembedopenblas-static" , "annembedopenblas-system" or "annembed_intel-mkl". You may need to install gcc, gfortran and make.
kmerutils provides a feature "withzmq". This feature can be used to store compressed qualities on a server and run requests. It is not necessary in this crate.
Pre-built binaries are available on release page (https://github.com/jean-pierreBoth/archaea/releases/tag/v1.0) for major platforms. If you wan to install/compiling by yourself:
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh cargo install archaea --features="annembed_intel-mkl"
brew install openblas xz echo 'export LDFLAGS="-L/usr/local/opt/openblas/lib"' >> ~/.bashprofile echo 'export CPPFLAGS="-I/usr/local/opt/openblas/include"' >> ~/.bashprofile echo 'export PKGCONFIGPATH="/usr/local/opt/openblas/lib/pkgconfig"' >> ~/.bashprofile cargo install archaea --features="annembedopenblas-system"
cargo install archaea --features="annembed_intel-mkl" --git https://github.com/jean-pierreBoth/archaea
git clone https://github.com/jean-pierreBoth/archaea cd archaea
cargo build --release --features="annembed_openblas-static"
cargo build --release --features="annembed_openblas-system"
cargo build --release --features="annembedopenblas-system" --features="hnswrs/simdeez_f"
cargo install --git https://gitlab.com/Jianshu_Zhao/fraggenescanrs ```
Alternatively it is possible to modify the features section in Cargo.toml. Just fill in the default you want.
Archaea.answer is the default output file in your current directory. For each of your genome in the query_dir, there will be requested N nearest genomes found and sorted by distance (smallest to largest). if one genome in the query does not exist in the output file, meaning at this level (nt or aa), there is no such nearest genomes in the database (or distant away from the best hit in the database). You may then go to amino acid level or universal gene level.
We provide pre-built genome/proteome database graph file for bacteria/archaea, virus and fungi. Proteome database are based on genes for each genome, predicted by FragGeneScanRs (https://gitlab.com/JianshuZhao/fraggenescanrs) for bacteria/archaea/virus and GeneMark-ES version 2 (http://exon.gatech.edu/GeneMark/licensedownload.cgi) for fungi.