bamcalib utility

bamcalib is a small rust program to normalize calibrated chip-seq data using a normalization procedure build upon the original method of Hu et al..

Installation

You need cargo installed

Then run the following command:

sh cargo install --git http://gitbio.ens-lyon.fr/LBMC/Bernard/bamcalib.git

You can also use the lbmc/bamcalib:0.1.2 Docker image

sh docker run -it lbmc/bamcalib:0.1.2 bamcalib

Usage

Example

sh bamcalib \ -i IP_11_sorted.bam \ -w T_11_sorted.bam \ -bigwig-ip IP_11.bigwig \ -bigwig-WCE T_11.bigwig \ -t 8

The bamcalib expect the following parameters:

Optional Arguments

Method

Normalization

Hu et al. proposed the following normalization procedure to get the normalized coverage in the IP sample ($\text{norm}IP\left(t\right)$) from:

math \text{norm}IP\left(t\right) = IP_{x}\left(t\right) \frac{10^9 \sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{\sum_{t^{'}=1}^{T_c}IP_{c}\left(t^{'}\right)\sum_{t^{'}=1}^{T_x}WCE_{x}\left(t^{'}\right)}

In a reference genome of size $T_x$ (ignoring the chromosome segmentation), and in a calibration genome of size $T_c$.

To be able to better analyze the coverage information at repetitive regions of the gnome, we propose to extend the previous normalization to normalize the signal nucleotide by nucleotide and work with the following quantity:

math \text{ratio}IP\left(t\right) = \frac{\text{norm}IP\left(t\right)}{\text{norm}WCE\left(t\right)}

with:

math \text{norm}WCE\left(t\right) = WCE_{x}\left(t\right) \frac{10^9}{\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)} \alpha

We then find $\alpha$ such that:

math E\left(\text{norm}IP\left(t\right)\right)= E\left(\frac{\text{norm}IP\left(t\right)}{\text{norm}WCE\left(t\right)}\right)

which gives

math \alpha = \frac{\sum_{t^{'}=1}^{T_c}WCE_{c}\left(t^{'}\right)}{10^9} \frac{\sum_{t=1}^{T_x}\frac{IP_{x}\left(t\right)}{WCE_{x}\left(t\right)}}{\sum_{t=1}^{T_x}IP_{x}\left(t\right)}

Not that all the computation is made at the nucleotide level and not at the read level as in Hu et al., this is also why we use $10^9$ as a scaling factor instead of $10^6$.

With this method, we retain the interesting properties of Hu et al. normalization on the average read density between samples (i.e., we can compare two different samples in a quantitative way) and we account for the local bias of read density observed in the WCE samples (differential chromatin accessibility, repetition, low mappability region, etc.).

Genome coverage density

To compute the coverage density $Xy(t)$ with $X \in \left{IP, WCE\right}$ and $y \in \left{c, x\right}$ we count the number of read $r(t)$ overlapping with position $t$. For properly paired reads (with a mate read on the same chromosome and with a starting position ending after the end of the read) we also count a density of 1 between the end of the first reads and the start of his mate read $g(t)$. $Xy(t) = r(t) + g(t)$.

Some fragment can be artificially long, therefore, we compute a robust mean $\mu$ of the gap size, between two reads of a pair, by removing the 0.1 upper and lower value of fragment length. Fragment that has a size higher than $\phi^{-1}(0.95, /mu, 1.0)$ are set to end at the $\phi^{-1}(0.95, /mu, 1.0)$ value, with $\phi()$ the Normal CDF function.

Some fragment can be shorter than the read length in this case we don't count the overlapping reads region as a coverage of 2 fragment but as a coverage of 1 fragment.