ATG

ATG is a library and standalone CLI tool to handle and convert different data formats used in Genomics and Transcriptomics. The library provides convenient APIs to parse GTF and RefGene data and work with the resulting transcripts for all kind of downstream analyses.

The binary can be used to convert between GTF and RefGene data, generate bed files for transcripts or generate the nucleotide sequence as Fasta.

The main purpose is actually just that - convert between GTF and RefGene file formats. Surprsingly, there are not many tools to do this properly. Even atg does not handle all edge cases of GTF - but I tried to handle as many as possible. In addition, transcripts can also be written in bed format or as fasta.

The project started only because I wanted to learn Rust. You will see that some sections have really bad code, others will have some better and more improved code. Overall, I'm still very new to Rust and I'm sure I fell for many traps and use lots of unidiomatic code. I'm happy for any feedback and improvement suggestions.

The library is still in its infancy but works so far and can handle what it's supposed to do. The current API is probably going to change a lot in future updates, so be careful of using atg in production or other critical workflows.

Usage

ATG command line tool

Install

cargo

The easiest way to install ATG is to use cargo (if you have cargo and rust installed) bash cargo install atg

Pre-built binaries

You can download pre-built binaries for Linux and Mac (M1) from Github. Save the downloaded binaries in a folder in your PATH or create a symlink as atg.

From source

You can also build ATG from source (if you have the rust toolchains installed):

```bash git clone https://github.com/anergictcell/atg.git cd atg cargo build --release ````

Run

Convert a GTF file to a RefGene file bash atg --from gtf --to refgene --input /path/to/input.gtf --output /path/to/output.refgene

Convert RefGene to GTF bash atg --from refgene --to gtf --input /path/to/input.refgene --output /path/to/output.gtf

Convert RefGene to bed bash atg --from refgene --to bed --input /path/to/input.refgene --output /path/to/output.bed

Supported `--output` formats

gtf

Output in GTF format.

text chr9 ncbiRefSeq.2021-05-17 transcript 74526555 74600974 . + . gene_id "C9orf85"; transcript_id "NM_001365057.2"; gene_name "C9orf85"; chr9 ncbiRefSeq.2021-05-17 exon 74526555 74526752 . + . gene_id "C9orf85"; transcript_id "NM_001365057.2"; exon_number "1"; exon_id "NM_001365057.2.1"; gene_name "C9orf85"; chr9 ncbiRefSeq.2021-05-17 5UTR 74526555 74526650 . + . gene_id "C9orf85"; transcript_id "NM_001365057.2"; exon_number "1"; exon_id "NM_001365057.2.1"; gene_name "C9orf85"; chr9 ncbiRefSeq.2021-05-17 CDS 74526651 74526752 . + 0 gene_id "C9orf85"; transcript_id "NM_001365057.2"; exon_number "1"; exon_id "NM_001365057.2.1"; gene_name "C9orf85"; chr9 ncbiRefSeq.2021-05-17 exon 74561922 74562028 . + . gene_id "C9orf85"; transcript_id "NM_001365057.2"; exon_number "2"; exon_id "NM_001365057.2.2"; gene_name "C9orf85"; chr9 ncbiRefSeq.2021-05-17 CDS 74561922 74562026 . + 0 gene_id "C9orf85"; transcript_id "NM_001365057.2"; exon_number "2"; exon_id "NM_001365057.2.2"; gene_name "C9orf85"; ...

You can specify the value of the source column manually using the --gtf-source/-g option. Defaults to atg

refgene

Output in the refGene format, as used by some UCSC and NCBI RefSeq services

text 0 NM_001101.5 chr7 - 5566778 5570232 5567378 5569288 6 5566778,5567634,5567911,5568791,5569165,5570154, 5567522,5567816,5568350,5569031,5569294,5570232, 0 ACTB cmpl cmpl 0,1,0,0,0,-1, 0 NM_001203247.2 chr7 - 148504474 148581383 148504737 148544390 20 148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543561,148544273,148581255, 148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383, 0 EZH2 cmpl cmpl 2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1, 0 NM_001203248.2 chr7 - 148504474 148581383 148504737 148544390 20 148504474,148506162,148506401,148507424,148508716,148511050,148512005,148512597,148513775,148514313,148514968,148516687,148523560,148524255,148525831,148526819,148529725,148543588,148544273,148581255, 148504798,148506247,148506482,148507506,148508812,148511229,148512131,148512638,148513870,148514483,148515209,148516779,148523724,148524358,148525972,148526940,148529842,148543690,148544397,148581383, 0 EZH2 cmpl cmpl 2,1,1,0,0,1,1,2,0,1,0,1,2,1,1,0,0,0,0,-1, 0 NM_001354750.2 chr11 + 113930432 114127487 113934022 114121277 7 113930432,113933932,114027058,114057673,114112888,114117919,114121047, 113930864,113935290,114027156,114057760,114113059,114118087,114127487, 0 ZBTB16 cmpl cmpl -1,0,2,1,1,1,1,

fasta

Writes the cDNA sequence of all transcripts into one file. Please note that the sequence is stranded.

This target format requires a reference genome fasta file that must be specified using --reference/-r.

This output allows different --fasta-format options: - transcript: The full transcript sequence (from the genomic start to end position, including introns) - exons: The cDNA sequence of the processed transcript, i.e. the sequence of all exons, including non-coding exons. - cds (default): The CDS of the transcript

```text

NM007298.3 BRCA1 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGC TATGCAGAAAATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAAC CTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATGCTGAAA CTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGA TATAACCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTG ... NM001365057.2 C9orf85 ATGAGCTCCCAGAAAGGCAACGTGGCTCGTTCCAGACCTCAGAAGCACCA GAATACGTTTAGCTTCAAAAATGACAAGTTCGATAAAAGTGTGCAGACCA AGAAAATTAATGCAAAACTTCATGATGGAGTATGTCAGCGCTGTAAAGAA GTTCTTGAGTGGCGTGTAAAATACAGCAAATACAAACCATTATCAAAACC TAAAAAGTGA ... ```

fasta-split

Like fasta above, but one file for each transcript. Instead of an output file, you must specify an output directory, ATG will save each transcript as <Transcript_name>.fasta, e.g.: NM_001365057.2.fasta.

This target format requires a reference genome fasta file that must be specified using --reference/-r.

feature-sequence

cDNA sequence of each feature (5' UTR, CDS, 3'UTR), each in a separate row.

This target format requires a reference genome fasta file that must be specified using --reference/-r.

text BRCA1 NM_007298.3 chr17 41196311 41197694 - 3UTR CTGCAGCCAGCCAC... BRCA1 NM_007298.3 chr17 41197694 41197819 - CDS CAATTGGGCAGATGTGTG... BRCA1 NM_007298.3 chr17 41199659 41199720 - CDS GGTGTCCACCCAATTGTG... BRCA1 NM_007298.3 chr17 41201137 41201211 - CDS ATCAACTGGAATGGATGG... BRCA1 NM_007298.3 chr17 41203079 41203134 - CDS ATCTTCAGGGGGCTAGAA... BRCA1 NM_007298.3 chr17 41209068 41209152 - CDS CATGATTTTGAAGTCAGA... BRCA1 NM_007298.3 chr17 41215349 41215390 - CDS GGGTGACCCAGTCTATTA... BRCA1 NM_007298.3 chr17 41215890 41215968 - CDS ATGCTGAGTTTGTGTGTG... BRCA1 NM_007298.3 chr17 41219624 41219712 - CDS ATGCTCGTGTACAAGTTT... BRCA1 NM_007298.3 chr17 41222944 41223255 - CDS AGGGAACCCCTTACCTGG... C9orf85 NM_001365057.2 chr9 74526555 74526650 + 5UTR ATTGACAGAA... C9orf85 NM_001365057.2 chr9 74526651 74526752 + CDS ATGAGCTCCCAGAA... C9orf85 NM_001365057.2 chr9 74561922 74562028 + CDS AAAATTAATGCAAA... C9orf85 NM_001365057.2 chr9 74597573 74597573 + CDS A C9orf85 NM_001365057.2 chr9 74597574 74600974 + 3UTR TGGAGTCTCC...

raw

This is mainly useful for debugging, as it gives a quick glimpse into the Exons and CDS coordinates of the transcripts.

bin

Save Transcripts in ATG binary format for faster re-reading.

Tips

Reading in GTF files is rather slow, due to the complexity of the format. If you need to repeatedly read in data from GTF, I recommend to generate a RefGene or binary file once and use this as input for subsequent steps.

You can change the verbosity, by adding -v (show info messages), -vv (debug), -vvv (trace)

On most Linux systems, you can use --input /dev/stdin and/or --output /dev/stdout to pipe into and out of atg.

Of course, all commands also have shorthand parameters: - -f, --from - -t, --to - -i, --input - -o, --output

ATG as library

The library API is mostly documented inline and available on docs.rs

Examples

Convert GTF to RefGene

```no_run use atg::gtf::Reader; use atg::refgene::Writer; use atg::models::{TranscriptRead, TranscriptWrite};

let mut reader = Reader::fromfile("path/to/input.gtf") .unwraporelse(|| panic!("Error opening input file."));

let mut writer = Writer::fromfile("path/to/output.refgene") .unwraporelse(|| panic!("Unable to open output file"));

let transcripts = reader.transcripts() .unwraporelse(|err| panic!("Error parsing GTF: {}", err));

match writer.writetranscripts(&transcripts) { Ok() => println!("Success"), Err(err) => panic!("Error writing RefGene file: {}", err) }; ```

ToDo / Next tasks

[x] Add to crates.io
[x] Bed module to generate bed files with exons and introns
[ ] Compare transcripts from two different inputs
[x] Add fasta reading for nt and aa sequence outputs
[x] Binary data format

Known issues

GTF parsing

[ ] NM_001371720.1 has two book-ended exons (155160639-155161619 || 155161620-155162101). During input parsing, book-ended features are merged into one exon

ATG

Usage

ATG command line tool

Install

cargo

Pre-built binaries

From source

Run

Supported --output formats

gtf

refgene

fasta

fasta-split

feature-sequence

raw

bin

Tips

ATG as library

Examples

Convert GTF to RefGene

ToDo / Next tasks

Known issues

GTF parsing

Supported `--output` formats