Process the UniParc XML file (uniparc_all.xml.gz
) downloaded from the UniProt website into CSV files that can be loaded into a relational database.
Parsing 1 million lines takes about 5.5 seconds:
``` $ mkdir uniparc $ time bash -c "zcat tests/uniparc1mil.xml.gz | uniparcxml_parser >/dev/null"
real 0m5.564s user 0m5.528s sys 0m0.132s ```
The actual uniparc_all.xml.gz
file is about 5 billion rows.
uniparc_all.xml.gz
into multiple small files and process them in paralleluniparc_all.xml.gz
makes it easier to create an incremental unique index column (e.g. UniparcXRef.idx
, Property.idx
, etc.).