tidyVCF
is a small tool to convert VCF files to tidy tab/comma
separated tables, ideal for downstream analysis with R's tidyverse
or Julia's DataFrames
ecosystems. All fields are included by
default, keeping the command line simple. tidyVCF
is written in pure
Rust, replying on the excellent noodles-vcf
crate developed by
@zaeleus and contributors.
Note: The tool works for me, but isn't ready for production use yet - it's built on a fairly experimental API, it lacks proper testing, and it's quite brittle in terms of generally (not) handling various species of wild VCF, and gracelessly erroring at the most minor of spec violations.
cargo install tidyvcf
TBD.
CSV output with -c
/--csv
, default is TSV:
tidyvcf -i test.vcf -c -o test.csv
BGZF compressed VCFs are detected by file extension and handled automatically:
tidyvcf -i test.vcf.gz -o test.tsv
If dealing with compressed data from stdin
, use the --bgzip
flag:
cat test.vcf.gz | tidyvcf --bgzip -o test.tsv
It is common to perform variant calling on several related samples
together, which yields VCFs with multiple sets of 'genotype' or
FORMAT
fields, one for each sample. By default, tidyvcf
joins
sample names to the names of the format fields with the underscore
('_') character - S1_GT S1_DP S2_GT S2_DP...
.
The --format-delim
option allow changing the sample-format field delimiter:
tidyvcf -i test.vcf --format-delim '~' -o test.tsv
This behaviour violates the tidy
data principle - to avoid this
we can stack samples into rows, with the cost of repeating the static
and INFO
columns for each sample.
Stacking samples:
tidyvcf -i test.vcf --stack -o test_stacked.tsv
To avoid clashes in field names between INFO
and FORMAT
columns,
INFO
field names are prefixed with the string "info_" by default -
this behaviour can be adjusted with the --info-prefix
option:
tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv
CSQ
INFO field splittingIf your VCF is annotated with Ensembl's Variant Effect Predictor, you
can use the -v
/--vep-fields
flag to extract those fields into individual
columns:
tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv
By default, the output VEP column names are prefixed with "vep_" to
avoid name collisions (for example CSQ/VAF
and FMT/VAF
) - this
string can be customised with the --vep-prefix
option:
tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv
Note: Only the first annotated transcript for a record is split, the
others are bundled unsplit into an additional column named
CSQ_other_transcripts
.
The noodles
rust library emphasises correctness in an ecosystem where that hasn't always been standard, so in practice it rejects many VCFs produced by variant callers due to not adhering to the spec.
tidyvcf
comes with a -l
/ --lenient
option that tries to fix spec non-compliant headers using hardcoded replacement rules before conversion.
Currently, this option is sufficient to convert VCFs produced by octopus
for example.
Feel free to raise an issue if this option doesn't help for other spec-non-compliant-but-basically-fine VCFs.
| Feature | tidyVCF
| rbt vcf-to-txt
| bcftools -f
| gatk VariantsToTable
|
|----------------------------------------|------------|-----------------------------------------------|------------------------|------------------------|
| include all fields | by default | individually specified; currently no FILTER
| individually specified | individually specified |
| include a subset of fields | ❌ | individually specified; currently no FILTER
| individually specified | individually specified |
| long format | --stack
| ❌ | ❌ | ❌ |
| pipeable | ✓ | ✓ | ✓ | ❌ |
| compressed input without external tool | ✓ | ❌ | ✓ | ? |
| bcf input | ❌ | ❌ | ✓ | ? |