fasta_windows

Written for Darwin Tree of Life chromosomal level genome assemblies. The executable takes a fasta formatted file and calculates some statistics of interest in windows:

Output files can be visualised using fwplot or grouped using fwgroup.

Download

The easiest way to get fasta_windows is through conda/bioconda.

bash conda create -n fasta_windows -c bioconda fasta_windows

Usage

``` Fasta windows 0.2.3 Max Brown mb39@sanger.ac.uk Quickly compute statistics over a fasta file in windows.

USAGE: fasta_windows [FLAGS] [OPTIONS] --fasta --output

FLAGS: -d, --description Add an extra column to _windows.tsv output with fasta header descriptions. -h, --help Prints help information -m, --masked Consider only uppercase nucleotides in the calculations. -V, --version Prints version information

OPTIONS: -f, --fasta The input fasta file. -o, --output Output filename for the TSV's (without extension). -w, --window_size Integer size of window for statistics to be computed over. [default: 1000] ```

Building

Building requires Rust.

```bash git clone https://github.com/tolkit/fastawindows cd fastawindows cargo build --release

./target/release/fasta_windows is the executable

show help

./target/release/fasta_windows --help ```

The default window size is 1kb.

Output

Output is now a tsv with bed-like format in the first three columns:

ID start end GC_prop GC_skew Shannon_entropy Prop_Gs Prop_Cs Prop_As Prop_Ts Prop_Ns Dinucleotide_Shannon_false Trinucleotide_Shannon_false Tetranucleotide_Shannon_false SUPER_1 0 1000 0.452 -0.270 1.929 0.165 0.287 0.361 0.187 0 2.646 3.929 5.134 SUPER_1 1000 2000 0.34 -0.335 1.896 0.113 0.227 0.346 0.314 0 2.617 3.872 5.015 SUPER_1 2000 3000 0.388 -0.912 1.627 0.017 0.371 0.407 0.205 0 1.858 2.049 2.096 SUPER_1 3000 4000 0.634 -0.167 1.933 0.264 0.37 0.199 0.167 0 2.671 3.980 5.215 SUPER_1 4000 5000 0.591 -0.184 1.954 0.241 0.35 0.236 0.173 0 2.701 4.020 5.232 SUPER_1 5000 6000 0.599 -0.229 1.948 0.231 0.368 0.212 0.189 0 2.679 3.991 5.209 SUPER_1 6000 7000 0.596 -0.164 1.961 0.249 0.347 0.214 0.19 0 2.694 3.994 5.206 SUPER_1 7000 8000 0.602 -0.193 1.950 0.243 0.359 0.178 0.22 0 2.672 3.974 5.184 SUPER_1 8000 9000 0.453 -0.214 1.977 0.178 0.275 0.292 0.255 0 2.725 4.031 5.237

Also output (non-optional at the moment), are three more TSV's, which are the arrays of di/tri/tetranucleotide frequencies in each window. These files are large, especially as tetranucleotide frequencies will contain 4e4 columns. The kmers are sorted lexicographically from left -> right (AA(AA) to TT(TT)).

e.g. for dinucleotide frequencies:

ID start end AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT SUPER_1 0 1000 122 120 45 73 134 68 39 46 50 55 45 15 54 44 36 53 SUPER_1 1000 2000 140 83 32 90 85 54 22 66 30 25 19 39 91 65 40 118 SUPER_1 2000 3000 216 181 4 5 4 181 5 181 3 8 3 3 183 1 516 SUPER_1 3000 4000 40 61 54 44 80 137 86 66 54 99 76 35 24 73 48 22 SUPER_1 4000 5000 55 68 75 38 88 138 66 57 58 78 59 46 35 65 41 32 SUPER_1 5000 6000 32 71 63 46 85 137 71 75 65 66 65 34 30 94 31 34 SUPER_1 6000 7000 47 62 63 42 91 132 60 64 58 84 74 32 18 69 51 52 SUPER_1 7000 8000 29 49 64 35 67 143 52 97 58 82 72 31 24 85 55 56 SUPER_1 8000 9000 114 67 43 68 63 86 52 73 51 49 43 35 64 73 40 78 SUPER_1 9000 10000 97 97 44 63 72 95 50 67 46 44 33 46 85 49 42 69

Comments, updates & bugs

As of version 0.2.2, I've removed canonical kmers as an option; it was really computationally expensive and I couldn't think of a way to efficienty add it in. End users that wish this are pointed in the direction of fw_group, which will at some point soon provide this functionality.

The masked (-m) flag only affects GC content, GC proportion, GC skew, proportion of G's, C's, A's, T's, N's. Kmers are coerced to uppercase automatically. Shannon index counts only uppercase nucleotides.

Please use, test, and let me know if there are any bugs or features you want implemented. Either raise an issue, or email me (see email in usage).