Crates.io Crates.io Crates.io CI codecov DOI

Mehari

Mehari is a software package for annotating VCF files with variant effect/consequence. The program uses hgvs-rs for projecting genomic variants to transcripts and proteins and thus has high prediction quality.

Other popular tools offering variant effect/consequence prediction include:

Mehari offers predictions that aim to mirror VariantValidator, the gold standard for HGVS variant descriptions. Further, it is written in the Rust programming language and can be used as a library for users' Rust software.

Supported Sequence Variant Frequency Databases

Mehari can import public sequence variant frequency databases. The supported set slightly differs between import for GRCh37 and GRCh38.

GRCh37

GRCh38

Internal Notes

``` rm -rf /tmp/out ; cargo run -- db create seqvar-freqs --path-output-db /tmp/out --genome-release grch38 --path-helix-mtdb ~/Downloads/HelixMTdb20200327.vcf.gz --path-gnomad-mtdna ~/Downloads/gnomad.genomes.v3.1.sites.chrM.vcf.bgz --path-gnomad-exomes-xy tests/data/db/create/seqvarfreqs/xy-38/gnomad.exomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-xy tests/data/db/create/seqvarfreqs/xy-38/gnomad.exomes.r2.1.1.sites.chrY.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvarfreqs/xy-38/gnomad.genomes.r3.1.1.sites.chrX.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvarfreqs/xy-38/gnomad.genomes.r3.1.1.sites.chrY.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvarfreqs/12-38/gnomad.exomes.r2.1.1.sites.chr1.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvarfreqs/12-38/gnomad.exomes.r2.1.1.sites.chr2.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvarfreqs/12-38/gnomad.genomes.r3.1.1.sites.chr1.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr2.vcf

rm -rf /tmp/out ; cargo run -- db create seqvar-freqs --path-output-db /tmp/out --genome-release grch37 --path-gnomad-mtdna ~/Downloads/gnomad.genomes.v3.1.sites.chrM.vcf.bgz --path-gnomad-exomes-xy tests/data/db/create/seqvarfreqs/xy-37/gnomad.exomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-xy tests/data/db/create/seqvarfreqs/xy-37/gnomad.exomes.r2.1.1.sites.chrY.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvarfreqs/xy-37/gnomad.genomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvarfreqs/12-37/gnomad.exomes.r2.1.1.sites.chr1.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvarfreqs/12-37/gnomad.exomes.r2.1.1.sites.chr2.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvarfreqs/12-37/gnomad.genomes.r2.1.1.sites.chr1.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr2 ```

``` prepare() { in=$1 out=$2

zcat $in \
| head -n 5000 \
| grep ^# \
> $out

zcat $in \
| grep -v ^# \
| head -n 3 \
>> $out

}

base=/data/sshfs/data/gpfs-1/groups/cubi/work/projects/2021-07-20_varfish-db-downloader-holtgrewe/varfish-db-downloader/

mkdir -p tests/data/db/create/seqvar_freqs/{12,xy}-{37,38}

37 exomes

prepare \ $base/GRCh37/gnomADexomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr1.vcf.bgz \ tests/data/db/create/seqvarfreqs/12-37/gnomad.exomes.r2.1.1.sites.chr1.vcf prepare \ $base/GRCh37/gnomADexomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr2.vcf.bgz \ tests/data/db/create/seqvarfreqs/12-37/gnomad.exomes.r2.1.1.sites.chr2.vcf prepare \ $base/GRCh37/gnomADexomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrX.vcf.bgz \ tests/data/db/create/seqvarfreqs/xy-37/gnomad.exomes.r2.1.1.sites.chrX.vcf prepare \ $base/GRCh37/gnomADexomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrY.vcf.bgz \ tests/data/db/create/seqvarfreqs/xy-37/gnomad.exomes.r2.1.1.sites.chrY.vcf

37 genomes

prepare \ $base/GRCh37/gnomADgenomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chr1.vcf.bgz \ tests/data/db/create/seqvarfreqs/12-37/gnomad.genomes.r2.1.1.sites.chr1.vcf prepare \ $base/GRCh37/gnomADgenomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chr2.vcf.bgz \ tests/data/db/create/seqvarfreqs/12-37/gnomad.genomes.r2.1.1.sites.chr2.vcf prepare \ $base/GRCh37/gnomADgenomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chrX.vcf.bgz \ tests/data/db/create/seqvarfreqs/xy-37/gnomad.genomes.r2.1.1.sites.chrX.vcf

38 exomes

prepare \ $base/GRCh38/gnomADexomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr1.vcf.bgz \ tests/data/db/create/seqvarfreqs/12-38/gnomad.exomes.r2.1.1.sites.chr1.vcf prepare \ $base/GRCh38/gnomADexomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr2.vcf.bgz \ tests/data/db/create/seqvarfreqs/12-38/gnomad.exomes.r2.1.1.sites.chr2.vcf prepare \ $base/GRCh38/gnomADexomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrX.vcf.bgz \ tests/data/db/create/seqvarfreqs/xy-38/gnomad.exomes.r2.1.1.sites.chrX.vcf prepare \ $base/GRCh38/gnomADexomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrY.vcf.bgz \ tests/data/db/create/seqvarfreqs/xy-38/gnomad.exomes.r2.1.1.sites.chrY.vcf

38 genomes

prepare \ $base/GRCh38/gnomADgenomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chr1.vcf.bgz \ tests/data/db/create/seqvarfreqs/12-38/gnomad.genomes.r3.1.1.sites.chr1.vcf prepare \ $base/GRCh38/gnomADgenomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chr2.vcf.bgz \ tests/data/db/create/seqvarfreqs/12-38/gnomad.genomes.r3.1.1.sites.chr2.vcf prepare \ $base/GRCh38/gnomADgenomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chrX.vcf.bgz \ tests/data/db/create/seqvarfreqs/xy-38/gnomad.genomes.r3.1.1.sites.chrX.vcf prepare \ $base/GRCh38/gnomADgenomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chrY.vcf.bgz \ tests/data/db/create/seqvarfreqs/xy-38/gnomad.genomes.r3.1.1.sites.chrY.vcf ```

Building tx database

``` cd hgvs-rs-data

seqrepo --root-directory seqrepo-data/master init

mkdir -p mirror/ftp.ncbi.nih.gov/refseq/Hsapiens/mRNAProt cd !$ wget https://ftp.ncbi.nih.gov/refseq/Hsapiens/mRNAProt/human.files.installed parallel -j 16 'wget https://ftp.ncbi.nih.gov/refseq/Hsapiens/mRNAProt/{}' ::: $(cut -f 2 human.files.installed | grep fna) cd -

mkdir -p mirror/ftp.ensembl.org/pub/release-108/fasta/homosapiens/cdna cd !$ wget https://ftp.ensembl.org/pub/release-108/fasta/homosapiens/cdna/Homosapiens.GRCh38.cdna.all.fa.gz cd - mkdir -p mirror/ftp.ensembl.org/pub/release-108/fasta/homosapiens/ncrna cd !$ wget https://ftp.ensembl.org/pub/release-109/fasta/homosapiens/ncrna/Homosapiens.GRCh38.ncrna.fa.gz cd - mkdir -p mirror/ftp.ensembl.org/pub/grch37/release-108/fasta/homosapiens/cdna/ cd !$ wget https://ftp.ensembl.org/pub/grch37/release-108/fasta/homosapiens/cdna/Homosapiens.GRCh37.cdna.all.fa.gz cd - mkdir -p mirror/ftp.ensembl.org/pub/grch37/release-108/fasta/homosapiens/ncrna/ cd !$ wget https://ftp.ensembl.org/pub/grch37/release-108/fasta/homosapiens/ncrna/Homosapiens.GRCh37.ncrna.fa.gz cd -

seqrepo --root-directory seqrepo-data/master load -n NCBI $(find mirror/ftp.ncbi.nih.gov -name '.fna.gz' | sort) seqrepo --root-directory seqrepo-data/master load -n ENSEMBL $(find mirror/ftp.ensembl.org -name '.fa.gz' | sort)

cd ../mehari

cargo run --release -- \ -v \ db create txs \ --path-out /tmp/txs-out.bin.zst \ --path-cdot-json ../cdot-0.2.12.ensembl.grch37grch38.json.gz \ --path-cdot-json ../cdot-0.2.12.refseq.grch37grch38.json.gz \ --path-seqrepo-instance ../hgvs-rs-data/seqrepo-data/master/master ```

Development Setup

You will need a recent version of protoc, e.g.:

```

bash utils/install-protoc.sh

export PATH=$PATH:$HOME/.local/share/protoc/bin

```