UniParc XML parser

Introduction
Usage
Table schema
Installation
- Binaries
- Cargo
- Conda
Output files
- Parquet
- Google BigQuery
Benchmarks
Roadmap
FAQ (Frequently Asked Questions)
FUQ (Frequently Used Queries)

Introduction

Process the UniParc XML file (uniparc_all.xml.gz) downloaded from the UniProt website into CSV files that can be loaded into a relational database.

Usage

Uncompressed XML data can be piped into uniparc_xml_parser in order to

bash $ curl -sS ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/uniparc_all.xml.gz \ | zcat \ | uniparc_xml_parser

The output is a set of CSV (or more specifically TSV) files:

bash $ ls -rw-r--r-- 1 user group 174G Feb 9 13:52 xref.tsv -rw-r--r-- 1 user group 149G Feb 9 13:52 domain.tsv -rw-r--r-- 1 user group 138G Feb 9 13:52 uniparc.tsv -rw-r--r-- 1 user group 107G Feb 9 13:52 protein_name.tsv -rw-r--r-- 1 user group 99G Feb 9 13:52 ncbi_taxonomy_id.tsv -rw-r--r-- 1 user group 74G Feb 9 20:13 uniparc.parquet -rw-r--r-- 1 user group 64G Feb 9 13:52 gene_name.tsv -rw-r--r-- 1 user group 39G Feb 9 13:52 component.tsv -rw-r--r-- 1 user group 32G Feb 9 13:52 proteome_id.tsv -rw-r--r-- 1 user group 15G Feb 9 13:52 ncbi_gi.tsv -rw-r--r-- 1 user group 21M Feb 9 13:52 pdb_chain.tsv -rw-r--r-- 1 user group 12M Feb 9 13:52 uniprot_kb_accession.tsv -rw-r--r-- 1 user group 656K Feb 9 04:04 uniprot_kb_accession.parquet

Table schema

The generated CSV files conform to the following schema:

Installation

Binaries

Linux binaries are available here: https://gitlab.com/ostrokach/uniparcxmlparser/-/packages.

Cargo

Use cargo to compile and install uniparc_xml_parser for your target platform:

bash cargo install uniparc_xml_parser

Conda

Use conda to install precompiled binaries:

bash conda install -c ostrokach-forge uniparc_xml_parser

Output files

Parquet

Parquet files containing the processed data are available at the following URL and are updated monthly: http://uniparc.data.proteinsolver.org/.

Google BigQuery

The data can also be queried directly using Google BigQuery: https://console.cloud.google.com/bigquery?project=ostrokach-data&p=ostrokach-data&page=dataset&d=uniparc.

Benchmarks

Parsing 10,000 XML entires takes around 30 seconds (the process is mostly IO-bound):

```bash $ time bash -c "zcat uniparctop10k.xml.gz | uniparcxmlparser >/dev/null"

real 0m33.925s user 0m36.800s sys 0m1.892s ```

The actual uniparc_all.xml.gz file has around 373,914,570 elements.

Roadmap

[ ] Keep everything in bytes all the way until output.

FAQ (Frequently Asked Questions)

Why not split uniparc_all.xml.gz into multiple small files and process them in parallel?

Splitting the file requires reading the entire file. If we're reading the entire file anyway, why not parse it as we read it?
Having a single process which parses uniparc_all.xml.gz makes it easier to create an incremental unique index column (e.g. xref.xref_id).

FUQ (Frequently Used Queries)

TODO