Alphabeta (but it's fast)

This repostiory contains a fast implementation of the AlphaBeta algorithm first proposed by Yadollah Shahryary, Frank Johannes and Rashmi Hazarika. The original R implemtation is accessible on Github here. I matched the original parameters needed for the program, to the old documentation is still useful.

Additionally, this repository also contains a program for creating metaprofiles of (epi)genetic data. They are connected, so output from the metaprofile program can automcatically be fed into AlphaBeta.

How to use

Dependencies

The program depends on OpenBLAS for fast matrix calculations. On Linux, you can install them with your package manager, on MacOS you can use Homebrew. On Windows, you can download prebuilt binaries from the OpenBLAS website.

Original Installation instructions here.

Debian/Ubuntu:

bash sudo apt update sudo apt install libopenblas-dev

If you don't have sudo rights, ask your system administrator.

MacOS:

bash brew install openblas

Windows:

Download the prebuilt binaries from the OpenBLAS website (Big .zip button). For building from source, see this.

Building from source

If you are somewhat familiar with coding and git, I'd recommend this approach: You'll need to install Rust and git first. Then, you can clone this repository and build the program yourself:

bash git clone https://github.com/constantingoeldel/alphabeta-rs.git cd alphabeta-rs cargo install --path .

This will install the programs on your system. You can then ensure everything works by running:

```bash alphabeta --help

and

metaprofile --help ```

If you received an error message about libopenblas, you will need to run the code with cargo (I don't really understand this issue)

```bash cargo run --release --bin alphabeta

or

cargo run --release --bin metaprofile ```

Updating

If you want to use a new version of the program, either download the new binaries from the same source or run:

bash cd alphabeta-rs git pull cargo install --path .

Parameters

Alphabeta

```bash Usage: alphabeta [OPTIONS] --edges --nodes --output

Options: -i, --iterations Number of iterations to run for Nelder-Mead optimization, even 100 is enough [default: 1000] -e, --edges Relative or absolute path to an edgelist, see /data for an example -n, --nodes Relative or absolute path to a nodelist, see /data for an example -p, --posterior-max-filter Minimum posterior probability for a singe basepair read to be included in the estimation [default: 0.99] -o, --output Relative or absolute path to an output directory, must exist, EXISTING FILES WILL BE OVERWRITTEN -h, --help Print help -V, --version Print version ```

Metaprofile

```bash Usage: metaprofile [OPTIONS] --methylome --genome --output-dir

Options: -m, --methylome Path to directory containing the methlyome files from which to extract the CG-sites -g, --genome Path of the annotation file containing information about beginning and end of gbM-genes -w, --window-size Size of the window in percent of the gbM-gene length or in basepair number if --absolute is supplied [default: 5] -s, --window-step Size of the step between the start of each window. Default value is window-size, so no overlapp happens -o, --output-dir Path of the directory where extracted segments shall be stored -a, --absolute Use absolute length in base-pairs for window size instead of percentage of gene length -c, --cutoff Number of basepairs to include upstream and downstream of gene [default: 2048] -i, --invert Invert strands, to switch from 5' to 3' and vice versa --db Use a Postgres database to do everything -e, --edges Provide an edgefile -n, --nodes Provide a nodefile - paths will be updated to match the output directory --alphabeta Also run AlphaBeta on every window after extraction, results will be stored in the same directory as the segments --name Name of the run to be used when storing the result in Postgres [default: "Instant { tvsec: 36502, tvnsec: 792133216 }"] -f, --force Overwrite existing content in output directory? If false (default) it will reuse existing windows --cutoff-gene-length Let the cutoff be the gene length instead of a fixed number. So if the gene is 1000 bp long, the cutoff will be 1000 bp instead of 2048 bp (the default). This option takes preference over the cutoff option -h, --help Print help -V, --version Print version ```

About window step and size:

Size determines the "length" of each window, for example for --window-size 5, each window will span 5% of the length of the gene it is in. If you supply --absolute, the size will be interpreted as the number of basepairs instead of a percentage, so 5 bp.

Step determines the distance between the start of each window. If you supply --window-step 1 and --window-size 5, the first window will go from 0% to 5% and the second from 1% to 6% and so on. If you supply --window-step 5 and --window-size 5, the first window will go from 0% to 5% and the second from 5% to 10% and so on. In the latter case, you can also omit the step paramter, as it will default to the same value as size.

Examples

Run alphabeta

bash alphabeta \ --edges ./data/edgelist.txt \ --nodes ./data/nodelist.txt \ --output ./data/output

Create a metaprofile and feed it into AlphaBeta

bash metaprofile \ --methylome ../methylome/within_gbM_genes/ \ --genome ../methylome/gbM_gene_anotation_extract_Arabidopsis.bed \ --output-dir /mnt/extStorage/workingDir/./windows/wt \ --edges ../methylome/edgelist.txt \ --nodes ../methylome/nodelist.txt \ --alphabeta \ --name wildtype \ --window-step 1 --window-size 5