tidyVCF
is a small tool to convert VCF files to tidy tab/comma
separated tables, ideal for downstream analysis with R's tidyverse
or Julia's DataFrames
ecosystems. tidyVCF
is written in pure Rust,
replying on the noodles-vcf
crate written by
@zaeleus and contributors.
cargo install tidyvcf
TBD.
CSV output with -c
, default is TSV:
tidyvcf -i test.vcf -c -o test.tsv
Using pipes to deal with compression:
zcat test.vcf.gz | tidyvcf | gzip > test.tsv.gz
It is common to perform variant calling on several related samples
together, which yields VCFs with multiple sets of 'genotype' or
FORMAT
fields, one for each sample. By default, tidyvcf
joins
sample names to the names of the format fields with the underscore
('_') character - S1_GT S1_DP S2_GT S2_DP...
.
The -j
/--sample-delim
options allow changing the sample-format field delimiter:
tidyvcf -i test.vcf -j '~' -o test.tsv
This behaviour violates the tidy
data principle - to avoid this
we can stack samples into rows, with the cost of repeating the static
and INFO
columns for each sample.
Stacking samples:
tidyvcf -i test.vcf --stack -o test_stacked.tsv
To avoid clashes in field names between INFO
and FORMAT
columns,
INFO
field names are prefixed with the string "info_" by default -
this behaviour can be adjusted with the -p
/--info-prefix
option:
tidyvcf -i test.vcf -p 'i' -c -o test.csv
CSQ
INFO field splittingIf your VCF is annotated with Ensembl's Variant Effect Predictor, you
can use the -v
option to extract those fields into individual
columns:
tidyvcf -i vep.vcf.gz -v -o vep.tsv
Note: Only the first annotated transcript for a record is split, the
others are bundled unsplit into an additional column named
CSQ_other_transcripts
.
| Feature | tidyVCF
| rbt vcf-to-txt
| bcftools -f
| gatk VariantsToTable
|
|----------------------------------------|------------|-------------------------------------------|--------------------|------------------------|
| include all fields | by default | manually specified; currently no FILTER
| manually specified | manually specified |
| long format | --stack
| ❌ | ❌ | ❌ |
| pipeable | ✓ | ✓ | ✓ | ❌ |
| compressed input without external tool | ✓ | ❌ | ✓ | ? |
| bcf input | ❌ | ❌ | ✓ | ? |