Process the UniParc XML file (uniparc_all.xml.gz
) downloaded from the UniProt website into CSV files that can be loaded into a relational database.
Uncompressed XML data can be piped into uniparc_xml_parser
in order to
bash
$ curl -sS ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/uniparc_all.xml.gz \
| zcat \
| uniparc_xml_parser
The output is a set of CSV (or more specifically TSV) files:
bash
$ ls
component.tsv gene_name.tsv ncbi_taxonomy_id.tsv protein_name.tsv uniparc.tsv xref.tsv
domain.tsv ncbi_gi.tsv pdb_chain.tsv proteome_id.tsv uniprot_kb_accession.tsv
The generated CSV files conform to the following schema:
Parsing 10,000 XML entires takes around 30 seconds (the process is mostly IO-bound):
```bash $ time bash -c "zcat uniparctop10k.xml.gz | uniparcxmlparser >/dev/null"
real 0m33.925s user 0m36.800s sys 0m1.892s ```
The actual uniparc_all.xml.gz
file has around 373,914,570 elements.
Why not split uniparc_all.xml.gz
into multiple small files and process them in parallel?
uniparc_all.xml.gz
makes it easier to create an incremental unique index column (e.g. xref.xref_id
).TODO