=tidyVCF= is a small tool to convert VCF files to tidy tab/comma separated tables, ideal for downstream analysis with R's =tidyverse= or Julia's =DataFrames= ecosystems. All fields are included by default, keeping the command line simple. =tidyVCF= is written in pure Rust, replying on the excellent =noodles-vcf= crate developed by [[https://github.com/zaeleus][@zaeleus]] and contributors.
Warning: /built on an unstable API, lacking proper testing, brittle in terms of erroring at minor VCF spec violations/.
* Install ** Cargo
cargo install tidyvcf
* Pre-built binaries
TBD.
* Docker
docker pull registry.gitlab.com/jdm204/tidyvcf:latest
CSV output with =-c= / =--csv=, default is TSV:
tidyvcf -i test.vcf -c -o test.csv
BGZF compressed VCFs are detected by file extension and handled automatically:
tidyvcf -i test.vcf.gz -o test.tsv
If dealing with compressed data from =stdin=, use the =--bgzip= flag:
cat test.vcf.gz | tidyvcf --bgzip -o test.tsv
To write compressed TSV, use the =.gz= extension for the =--output= file or pass the =-z= / =--out-gz= options.
tidyvcf -i test.vcf.gz --csv -o test.csv.gz
** Multiple samples: stacked or cartesian
It is common to perform variant calling on several related samples together, which yields VCFs with multiple sets of 'genotype' or =FORMAT= fields, one for each sample. By default, =tidyvcf= joins sample names to the names of the format fields with the underscore ('') character - =S1GT S1DP S2GT S2_DP...=.
The =--format-delim= option allow changing the sample-format field delimiter:
tidyvcf -i test.vcf --format-delim '~' -o test.tsv
This behaviour violates the [[https://r4ds.had.co.nz/tidy-data.html][tidy data]] principle---to avoid this we can stack samples into rows, with the cost of repeating the static and =INFO= columns for each sample.
Stacking samples:
tidyvcf -i test.vcf --stack -o test_stacked.tsv
** Info prefix
To avoid clashes in field names between =INFO= and =FORMAT= columns, =INFO= field names are prefixed with the string "info_" by default---this behaviour can be adjusted with the =--info-prefix= option:
tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv
** VEP =CSQ= INFO field splitting
If your VCF is annotated with Ensembl's Variant Effect Predictor, you can use the =-v= / =--vep-fields= flag to extract those fields into individual columns:
tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv
By default, the output VEP column names are prefixed with "vep_" to avoid name collisions (for example =CSQ/VAF= and =FMT/VAF=)---this string can be customised with the =--vep-prefix= option:
tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv
/Note/: Only the first annotated transcript for a record is split, the others are bundled unsplit into an additional column named =CSQothertranscripts=.
** Spec Non-Compliant VCFs
The =noodles= rust library emphasises correctness in an ecosystem where that hasn't always been standard, so in practice it rejects many VCFs produced by variant callers due to not adhering to the spec. =tidyvcf= comes with a =-l= / =--lenient= option that tries to fix spec non-compliant headers using hardcoded replacement rules before conversion. Currently, this option is sufficient to convert VCFs produced by =octopus= for example. Feel free to raise an issue if this option doesn't help for other spec-non-compliant-but-basically-fine VCFs.
** In a Snakemake Workflow
Here is a sample rule using a container. Note that =snakemake= must be invoked with =--use-singularity= in order to run rules in containers.
rule tidyvcf: input: "some.vcf", output: "some.tsv", params: "--lenient -v" container: "docker://registry.gitlab.com/jdm204/tidyvcf:latest", shell: "tidyvcf -i {input} -o {output} {params}"