tidyVCF

tidyVCF is a small tool to convert VCF files to tidy tab/comma separated tables, ideal for downstream analysis with R's tidyverse or Julia's DataFrames ecosystems. tidyVCF is written in pure Rust, replying on the noodles-vcf crate written by @zaeleus and contributors.

Install

Cargo

cargo install tidyvcf

Pre-built binaries

TBD.

Usage

Basic usage

CSV output with -c, default is TSV:

tidyvcf -i test.vcf -c -o test.tsv

Using pipes to deal with compression:

zcat test.vcf.gz | tidyvcf | gzip > test.tsv.gz

Multiple samples: stacked or cartesian

It is common to perform variant calling on several related samples together, which yields VCFs with multiple sets of 'genotype' or FORMAT fields, one for each sample. By default, tidyvcf joins sample names to the names of the format fields with the underscore ('_') character - S1_GT S1_DP S2_GT S2_DP....

The -j/--sample-delim options allow changing the sample-format field delimiter:

tidyvcf -i test.vcf -j '~' -o test.tsv

This behaviour violates the tidy data principle - to avoid this we can stack samples into rows, with the cost of repeating the static and INFO columns for each sample.

Stacking samples:

tidyvcf -i test.vcf --stack -o test_stacked.tsv

Info prefix

To avoid clashes in field names between INFO and FORMAT columns, INFO field names are prefixed with the string "info_" by default - this behaviour can be adjusted with the -p/--info-prefix option:

tidyvcf -i test.vcf -p 'i' -c -o test.csv

VEP CSQ INFO field splitting

If your VCF is annotated with Ensembl's Variant Effect Predictor, you can use the -v option to extract those fields into individual columns:

tidyvcf -i vep.vcf.gz -v -o vep.tsv

Note: Only the first annotated transcript for a record is split, the others are bundled unsplit into an additional column named CSQ_other_transcripts.

Comparison with other software

| Feature | tidyVCF | rbt vcf-to-txt | bcftools -f | gatk VariantsToTable | |----------------------------------------|------------|-------------------------------------------|--------------------|------------------------| | include all fields | by default | manually specified; currently no FILTER | manually specified | manually specified | | long format | --stack | ❌ | ❌ | ❌ | | pipeable | ✓ | ✓ | ✓ | ❌ | | compressed input without external tool | ✓ | ❌ | ✓ | ? | | bcf input | ❌ | ❌ | ✓ | ? |