tidyVCF

tidyVCF is a small tool to convert VCF files to tidy tab/comma separated tables, ideal for downstream analysis with R's tidyverse or Julia's DataFrames ecosystems. All fields are included by default, keeping the command line simple. tidyVCF is written in pure Rust, replying on the excellent noodles-vcf crate developed by @zaeleus and contributors.

Note: The tool works for me, but isn't ready for production use yet - it's built on a fairly experimental API, it lacks proper testing, and it's quite brittle in terms of generally (not) handling various species of wild VCF, and gracelessly erroring at the most minor of spec violations.

Install

Cargo

cargo install tidyvcf

Pre-built binaries

TBD.

Usage

Basic usage

CSV output with -c/--csv, default is TSV:

tidyvcf -i test.vcf -c -o test.csv

BGZF compressed VCFs are detected by file extension and handled automatically:

tidyvcf -i test.vcf.gz -o test.tsv

If dealing with compressed data from stdin, use the --bgzip flag:

cat test.vcf.gz | tidyvcf --bgzip -o test.tsv

Multiple samples: stacked or cartesian

It is common to perform variant calling on several related samples together, which yields VCFs with multiple sets of 'genotype' or FORMAT fields, one for each sample. By default, tidyvcf joins sample names to the names of the format fields with the underscore ('_') character - S1_GT S1_DP S2_GT S2_DP....

The --format-delim option allow changing the sample-format field delimiter:

tidyvcf -i test.vcf --format-delim '~' -o test.tsv

This behaviour violates the tidy data principle - to avoid this we can stack samples into rows, with the cost of repeating the static and INFO columns for each sample.

Stacking samples:

tidyvcf -i test.vcf --stack -o test_stacked.tsv

Info prefix

To avoid clashes in field names between INFO and FORMAT columns, INFO field names are prefixed with the string "info_" by default - this behaviour can be adjusted with the --info-prefix option:

tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv

VEP `CSQ` INFO field splitting

If your VCF is annotated with Ensembl's Variant Effect Predictor, you can use the -v/--vep-fields flag to extract those fields into individual columns:

tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv

By default, the output VEP column names are prefixed with "vep_" to avoid name collisions (for example CSQ/VAF and FMT/VAF) - this string can be customised with the --vep-prefix option:

tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv

Note: Only the first annotated transcript for a record is split, the others are bundled unsplit into an additional column named CSQ_other_transcripts.

Spec Non-Compliant VCFs

The noodles rust library emphasises correctness in an ecosystem where that hasn't always been standard, so in practice it rejects many VCFs produced by variant callers due to not adhering to the spec. tidyvcf comes with a -l / --lenient option that tries to fix spec non-compliant headers using hardcoded replacement rules before conversion. Currently, this option is sufficient to convert VCFs produced by octopus for example. Feel free to raise an issue if this option doesn't help for other spec-non-compliant-but-basically-fine VCFs.

Comparison with other software

| Feature | tidyVCF | rbt vcf-to-txt | bcftools -f | gatk VariantsToTable | |----------------------------------------|------------|-----------------------------------------------|------------------------|------------------------| | include all fields | by default | individually specified; currently no FILTER | individually specified | individually specified | | include a subset of fields | ❌ | individually specified; currently no FILTER | individually specified | individually specified | | long format | --stack | ❌ | ❌ | ❌ | | pipeable | ✓ | ✓ | ✓ | ❌ | | compressed input without external tool | ✓ | ❌ | ✓ | ? | | bcf input | ❌ | ❌ | ✓ | ? |