gsearch is the new name of the crate archaea. It stands for genomic search.
This package (currently in development) compute probminhash signature of bacteria and archaea (or virus and fungi) genomes and stores the id of bacteria and probminhash signature in a Hnsw structure for searching of new request genomes.
This package is developped by Jean-Pierre Both jpboth for the software part and Jianshu Zhao for the genomics part. We also created a mirror of this repo on GitLab and Gitee, just in case Github service is not available in some region, e.g, China.
The objective is to use the Jaccard index as an accurate proxy of mutation rate or Average Nucleitide Identity(ANI). To achieve this we use sketching.
We generate kmers along sequences and sketch the kmer distribution encountered in a file. Then final sketch is stored in a Hnsw database See hnsw.
The sketching and database is done by the subcommand tohnsw.
The Jaccard index come in 2 flavours:
1. The probability Jaccard index that takes into account the Kmer multiplicity. It is defined by :
$$J{P(A,B)}=\sum{d\in D} \frac{1}{\sum{d'\in D} \max (\frac{\omega{A}(d')}{\omega{A}(d)},\frac{\omega{B}(d')}{\omega{B}(d)})}$$
where $\omega{A}(d)$ is the multiplicity of $d$ in A
(see Moulton-Jiang-arxiv).
In this case for J_p we use the probminhash algorithm as implemented in probminhash
2. The unweighted (simple) Jaccard index defined by :
$$Jaccard(A,B)=\frac{A \cap B}{A \cup B}$$
In this case for J we use the SuperMinHash or the SetSketch (based on hyperloglog) method, also implemented in probminhash mentioned above.
The estimated Jaccard-like index is used to build HNSW graph database, which is implemented in crate hnswlib-rs.
The sketching of reference genomes can take some time (less than one hours for ~65,000 bacterial genomes of GTDB for parameters giving a correct quality of sketching, or 3 to 4 hours for entire NCBI/RefSeq prokaryotic genomes. ~318K). Result is stored in 2 structures:
The Hnsw structure is dumped in hnswdump.hnsw.graph and hnswdump.hnsw.data The Dictionary is dumped in a json file seqdict.json
For requests the subcommand request is being used. It reloads the dumped files, hnsw and seqdict related takes a list of fasta files containing requests and for each fasta file dumps the asked number of nearest neighbours.
Pre-built binaries will be available on release page binaries for major platforms (no need to install but just download and make it executable). We recommend you use the linux one (gsearch-linux-x86-64.zip) for linux system in this release page for convenience because the only dependency is GCC (Recent Linux version does not allow static compiling of GCC libraries like libc.so.6). For macOS, we recommend the binary mac-binaries for corresponding platform (x86-64 or arm64).
bash
conda config --add channels bioconda
conda install gsearch -c bioconda
Otherwise it is possible to install/compile by yourself (see install section)
```bash
wget https://github.com/jean-pierreBoth/gsearch/releases/download/0.1.1/gsearch-linux-x86-64.zip --no-check-certificate unzip gsearch-linux-x86-64.zip
wget https://github.com/jean-pierreBoth/gsearch/releases/download/0.1.1/gsearch-darwin-x86-64.zip --no-check-certificate unzip gsearch-darwin-x86-64.zip
wget https://github.com/jean-pierreBoth/gsearch/releases/download/0.1.1/gsearch-darwin-aarch64.zip --no-check-certificate unzip gsearch-darwin-aarch64.zip
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" brew install openblas xz
mv ./gsearch /usr/local/bin/
gsearch -h
sudo spctl --master-disable
```
```bash gsearch -h *** initializing logger ****
Approximate nearest neighbour search for microbial genomes based on minhash metric
Usage: gsearch [OPTIONS] [COMMAND]
Commands: tohnsw Build HNSW graph database from a collection of database genomes based on MinHash-like metric add Add new genome files to a pre-built HNSW graph database request Request nearest neighbors of query genomes against a pre-built HNSW graph database/index ann Approximate Nearest Neighbor Embedding using UMAP-like algorithm help Print this message or the help of the given subcommand(s)
Options:
--pio
```
We then give here an example of utilization with prebuilt databases.
```bash
wget http://enve-omics.ce.gatech.edu/data/publicgsearch/GTDBv207v2023.tar.gz tar xzvf ./GTDBv207_v2023.tar.gz
wget https://github.com/jean-pierreBoth/gsearch/releases/download/v0.0.12/testdata.tar.gz --no-check-certificate tar xzvf ./testdata.tar.gz
cd ./GTDB/nucl tar -xzvf k16s12000n128_ef1600.prob.tar.gz
gsearch request -b ./k16s12000n128ef1600canonical -r ../../testdata/querydir_nt -n 50
cd ./GTDB/prot gsearch request -b ./k7s12000n128ef1600gsearch -r ../../testdata/querydir_aa -n 50
cd ./GTDB/universal gsearch request -b ./k5n128s1800ef1600universalprob -r ../../testdata/querydiruniversal_aa -n 50
gsearch tohnsw -d dbdirnt -s 12000 -k 16 --ef 1600 -n 128 --algo prob gsearch tohnsw -d dbdiraa -s 12000 -k 7 --ef 1600 -n 128 --aa --algo prob
gsearch add -b ./k16s12000n128ef1600canonical -n dbdirnt (new genomes directory)
cd ./GTDB/prot gsearch add -b ./k7s12000n128ef1600gsearch -n dbdirnt (new genomes directory in AA format predicted by prodigal/FragGeneScanRs)
```
gsearch.answers is the default output file in your current directory.
For each of your genome in the query_dir, there will be requested N nearest genomes found and sorted by distance (smallest to largest).
If one genome in the query does not exist in the output file, meaning at this level (nt or aa), there is no such nearest genomes in the database (or distant away from the best hit in the database), you may then go to amino acid level or universal gene level.
hnswrs relies on the crate simdeez to accelerate distance computation. On intel you can build hnswrs with the feature simdeez_f
annembed relies on openblas so you must choose between the features "annembedopenblas-static" , "annembedopenblas-system" or "annembed_intel-mkl". You may need to install gcc, gfortran and make. This can be done using the --features option as explained below, or by modifying the features section in Cargo.toml. In that case just fill in the default you want.
bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
bash
cargo install gsearch --features="annembed_intel-mkl"
or with a system installed openblas:
bash
cargo install gsearch --features="annembed_openblas-system"
bash
brew install openblas xz
echo 'export LDFLAGS="-L/usr/local/opt/openblas/lib"' >> ~/.bash_profile
echo 'export CPPFLAGS="-I/usr/local/opt/openblas/include"' >> ~/.bash_profile
echo 'export PKG_CONFIG_PATH="/usr/local/opt/openblas/lib/pkgconfig"' >> ~/.bash_profile
source ~/.bash_profile
cargo install gsearch --features="annembed_openblas-system"
bash
cargo install gsearch --features="annembed_openblas-system" --features="hnsw_rs/simdeez_f"
bash
cargo install gsearch --features="annembed_intel-mkl" --git https://github.com/jean-pierreBoth/gsearch
```bash git clone https://github.com/jean-pierreBoth/gsearch cd gsearch
cargo build --release --features="annembed_openblas-static"
cargo build --release --features="annembed_openblas-system" ```
Html documentation can be generated by running (example for someone using the "annembed_openblas-system" feature):
bash
cargo doc --features="annembed_openblas-system" --no-deps --open
bash
cargo install --git https://gitlab.com/Jianshu_Zhao/fraggenescanrs
We provide pre-built genome/proteome database graph file for bacteria/archaea, virus and fungi. Proteome database are based on genes for each genome, predicted by FragGeneScanRs (https://gitlab.com/Jianshu_Zhao/fraggenescanrs) for bacteria/archaea/virus and GeneMark-ES version 2 (http://exon.gatech.edu/GeneMark/license_download.cgi) for fungi.