SEGUL is an ultrafast and memory efficient command-line (cli) application for working with sequence alignments that typically done using interpreted programming languages, such as Python, R, or Perl. It is designed to handle genomic datasets, but just as capable for Sanger datasets. In our test using a dataset with 4060 UCE loci, for instance, compare to an app written using the biopython library, SEGUL is >40x faster for alignment concatenation while using 3x less RAM space.
Features:
Supported sequence formats:
All of the formats are supported in interleave and sequential. The app supports both DNA and amino acid sequences.
Supported partition formats:
The Nexus partition can be written as a charset block embedded in Nexus formatted sequences or be written in a separate file.
Documentation: GitHub Wiki
Citation:
Handika, H. and Esselstyn, J. A. In prep. SEGUL: An ultrafast, memory efficient, and cross-platform alignment manipulation tool for phylogenomics.
The app may work in any Rust supported platform. Below is a list of operating system that we tested and is guaranteed to work:
:warning: SEGUL modern terminal output comes with a cost of requiring a terminal application that supports UTF-8 encoding. For MacOS and native Linux, your default terminal should have supported UTF-8 encoding by default. For Windows (including WSL) users, we recommend using Windows Terminal to ensure consistent terminal output. Windows Terminal requires separate installation for Windows 10. It will be the default terminal for Windows 11 when it arrives in Fall 2021.
For a quick installation, we provide pre-compiled binaries in the release page. For WSL, either the ManyLinux or Linux binary should work. In our test system, the ManyLinux binary is a little faster. For native Linux OS, first check your GLIBC version:
Bash
ldd --version
If your system GLIBC is >=2.18, use the Linux binary. If lower, use the ManyLinux binary. The installation is similar to any other single executable command-line app, such as the phylogenetic programs IQ-Tree or RaXML. You only need to make sure the path to the app is registered in your environment variable, so that the app can be called from anywhere in your system (see instructions). If you are still having issues running the program, try to install it using the package manager. This installation method will optimize the compiled binary for your system (see below).
The Rust package manager is called cargo. Cargo is easy to install (also easy to uninstall) and will help you to manage the app (see details in the installation instruction). Installing SEGUL through Cargo is similar to installing it from source code, except that it only use the stable version of the code. The source code is managed on crates.io. The badge at top of this Readme has information on the latest version of the app available on crates.io.
After you have Cargo installed in your computer, in Linux system (including WSL), first install the C-development toolkit, build-essential
for Debian-based distributions (Debian, Ubuntu, PopOS, Linux Mint, etc.) or its equivalent in other Linux distributions:
Bash
sudo apt install build-essential
On Windows, you only need to install the GNU compiler toolchain available using Rustup. Rustup is installed automatically when you install Cargo. To install the toolchain:
```Bash rustup toolchain install stable-x86_64-pc-windows-gnu
rustup default stable-x86_64-pc-windows-gnu ```
Then, install SEGUL:
Bash
cargo install segul
You could also install SEGUL from the GitHub repository. Learn more about SEGUL installation here.
The app command structure is similar to git, gh-cli, or any other app that use subcommands. The app file name will be segul
for Linux/MacOS/WSL and segul.exe
for Windows.
Bash
[THE-PROGRAM-FILENAME] <SUBCOMMAND> [OPTIONS] <VALUES> <A-FLAG-IF-APPLICABLE>
To check for available subcommand:
Bash
segul --help
To check for available options and flags for each sub-command:
Bash
segul <SUBCOMMAND> --help
Learn more about SEGUL command structure and expected behaviors for each argument here.
Segul can convert a single sequence file or multiple sequence files in a directory.
To convert a single file:
Bash
segul convert --input [path-to-your-repository] --input-format [sequence-format-keyword] --output-format [sequence-format-keyword]
To convert files in a directory:
Bash
segul convert --dir [path-to-your-repository] --input-format [sequence-format-keyword] --output-format [sequence-format-keyword]
To concat all alignments in a directory:
Bash
segul concat --dir [a-path-to-a-directory] --input-format [sequence-format-keyword]
To generate sequence summary statistics of alignments in a directory:
Bash
segul summary --dir [a-path-to-a-directory] --input-format [sequence-format-keyword]
Segul provide multiple filtering parameters.
Bash
segul filter --dir [a-path-to-a-directory] --input-format [sequence-format-keyword] <parameters>
For example, to filter based on taxon completeness:
Bash
segul filter --dir [a-path-to-a-directory] --input-format [sequence-format-keyword] --percent [percentages-of-minimal-taxa]
Other available parameters are multiple minimal taxon completeness --npercent
, alignment length --len
, numbers of minimal parsimony informative sites --pinf
, and percent of minimal parsimony informative sites --percent-inf
.
By default, the app will copy files that are match with the parameter to a new folder. If you would like to concat the results instead, you can specify it by passing --concat
flags. All the options available for the concat function above also available for concatenating filtered alignments.
You can also extract sequences from a collection of alignments. It can be done by supplying a list of IDs directly on the command line or in text file. The app finds for the exact match. You can also use regular expression to search for matching IDs.
To extract sequences by inputing the IDs in the command line:
bash
segul extract --dir [path-to-alignment-directory] --input-format [sequence-format-keyword] --id [id_1] [id_2] [id_3]
You can specify as many id as you would like. However, for long list of IDs, it may be better to input it using a text file. In the file it should be only the ID list, one ID each line:
bash
sequence_1
sequence_2
sequence_3
sequence_4
The the command will be:
bash
segul extract --dir [path-to-alignment-directory] --input-format [sequence-format-keyword] --file [path-to-text-file]
For using regular expression:
bash
segul extract -d gblock_trimmed_80p/ -f nexus --re="regex-syntax"
The app uses the rust regex library to parse regular expression. The syntax is similar to Perl regular expression (find out more here).
Supported NCBI Genetic Code Tables:
| Table No | Genetic Code | | -------- | -------------------------------------------------------------------------------------------- | | 1 | The Standard Code | | 2 | The Vertebrate Mitochondrial Code | | 3 | The Yeast Mitochondrial Code | | 4 | The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code | | 5 | The Invertebrate Mitochondrial Code | | 6 | The Ciliate, Dasycladacean and Hexamita Nuclear Code | | 9 | The Echinoderm and Flatworm Mitochondrial Code | | 10 | The Euplotid Nuclear Code | | 11 | The Bacterial, Archaeal and Plant Plastid Code | | 12 | The Alternative Yeast Nuclear Code | | 13 | The Ascidian Mitochondrial Code | | 14 | The Alternative Flatworm Mitochondrial Code | | 16 | Chlorophycean Mitochondrial Code | | 21 | Trematode Mitochondrial Code | | 22 | Scenedesmus obliquus Mitochondrial Code | | 23 | Thraustochytrium Mitochondrial Code | | 24 | Rhabdopleuridae Mitochondrial Code | | 25 | Candidate Division SR1 and Gracilibacteria Code | | 26 | Pachysolen tannophilus Nuclear Code | | 29 | Mesodinium Nuclear Code | | 30 | Peritrich Nuclear Code | | 33 | Cephalodiscidae Mitochondrial UAA-Tyr Code |
To translate dna alignment to amino acid:
Bash
segul translate -d [path-to-alignment-files] -f [sequence-format-keyword]
By default, the app will use the standard code table (NCBI Table 1). To set the translation table, use the --table
option. For example, to translate dna sequences using NCBI Table 2 (vertebrate MtDNA):
Bash
segul translate -d loci/ -f fasta --table 2
You can also set the reading frame using the --rf
option:
Bash
segul translate -d loci/ -f fasta --table 2 --rf 2
To show all the table options, use the --show-tables
flag:
Bash
segul translate --show-tables
Across the app functions, most generic arguments are also available in short format to save time typing them. For example, below we use short arguments to concat alignments in a directory named nexus-alignments
:
Bash
segul concat -d nexus-alignments -f nexus
By default, SEGUL will check whether the sequences contain only valid IUPAC characters. It is set for DNA characters by default. If your input is amino acid sequences, you can use --datatype aa
option to specify the input data type to amino acid. For example to concat sequences of amino acid in a directory named nexus-alignments
:
Bash
segul concat --dir nexus-alignments --input-format nexus --datatype aa
Learn more about using SEGUL here.
The app outputs are the resulting files from each task and a log file. Most information that is printed to the terminal is written to the log file. Unlike the terminal output that we try to keep it clean and only show the most important information, the log file will also contain the dates, times, and the log level status. Each time you run the app, if the log file (named segul.log
) exists in the same directory, the app will append the log output to the same log file. Rename this file or move it to a different folder if you would like to keep a different log file for each task.
For other resulting files, the app forbids over-writting files with similar names. The app will check if a such file exists and will ask if you like to remove it.