SEGUL is an ultrafast and memory efficient command-line (cli) application for working with sequence alignments that typically done using interpreted programming languages, such as Python, R, or Perl. It is designed to handle genomic datasets, but just as capable for Sanger datasets. In our test using a dataset with 4060 UCE loci, for instance, compare to a app written using biopython library, SEGUL is >40x faster for alignment concatenation while using 3x less RAM space.
Available features:
Planned features:
Supported sequence formats:
All of the formats are supported in interleave and sequential. The app supports both DNA and amino acid sequences.
Supported partition formats:
charset
in the app)It is now in active development. Our goal is to provide as many functionalities possible for alignment manipulation tasks.
The app may work in any rust supported platform. Below is a list of operating system that we tested and is guaranteed to work:
If you already using a rust app and familiar with its toolchain, the best option is to install the app using cargo. In addition to cargo, for Linux system (including WSL), it only requires the C-development toolkit, build-essential
or its equivalent in other Linux distributions.
If you are new to using a command line application, installing through cargo is also the easiest route (see details in the installation instruction). After you have cargo installed in your computer, installing SEGUL is one command away:
Bash
cargo install segul
You can also use the pre-compiled binary available in the release page. The installation is similar to any other single executable command line app, such as the phylogenetic programs IQ-Tree and RaXML. You only need to make sure the path to the app is registered in your environment variable, so that the app can be called from anywhere in your system (see instructions).
The app command structure is similar to git, gh-cli, or any other app that use subcommands. The app file name will be segul
for Linux/MacOS/WSL and segul.exe
for Windows.
Bash
[THE-PROGRAM-FILENAME] <SUBCOMMAND> [OPTIONS] <VALUES> <A-FLAG-IF-APPLICABLE>
To check for available subcommand:
Bash
segul --help
To check for available options and flags for each sub-command:
Bash
segul <SUBCOMMAND> --help
For example, to concat all the alignments in a directory named nexus-alignments
:
Bash
segul concat --dir nexus-alignments --input-format nexus
It is also available in short options:
Bash
segul concat -d nexus-alignments -f nexus
The app outputs are the resulting files from each task and a log file. Most information that is printed to the terminal is written to the log file. Unlike the terminal output that we try to keep it clean and only show the most important information, the log file will also contain the dates, times, and the log level status. Each time you run the app, the app will append the log output to the same log file (named segul.log
) if the file exists in the same directory. Rename this file or move it to a different folder if you would like to keep a different log file for each task.
:warning: Unlike the log file, for the other outputs, the app will over-write existing files with the same names: Careful in specifying the output file names. Future updates will prevent it.
We want the installation to be as flexible as possible. We offer three ways to install the app. Each of the options has pros and cons.
The pre-compiled binary is available in the release page. The typical workflow is as follow:
See specific details below:
First, copy the link to the zip file in the release page. We provide two versions of the app for Linux. The zip file labeled with HPC is compiled using Red Hat Enterprise Linux Server 7.9 (Kernel Version 3.10). If you are running the app in HPC, you should use this version. The other version (labeled Linux only) is compiled using Ubuntu 20.04 LTS (Kernel version 5.8). You should use this if you are using WSL or more up to date native Linux distros. Simply put, if you encounter GLIBC error, try using the HPC version. If the issue still persists, try to install the app using cargo.
For MacOS, the executable is available for an Intel Mac. If you are using Apple silicon Macs (Apple M1), we recommend installing it using cargo.
Here, we use the version 0.3.1 as an example. You should replace the link with the most up to date version available in the release page.
```Bash
wget https://github.com/hhandika/segul/releases/download/v0.3.1/segul-MacOS-x86_64.zip ```
Bash
unzip segul-MacOS-x86_64.zip
Bash
chmod +x segul
If you would like the binary executable for all users:
Bash
chmod a+x segul
The next step is putting the binary in a folder registered in your path variable. It is always best to avoid registering too many paths in your environment variable. It will slow down your terminal startup if you do. If you already used a single executable cli app, the chance is that you may already have a folder registered in your path variable. Copy SEGUL executable to the folder. Then, try call SEGUL from anywhere in your system:
Bash
segul --version
It should show the SEGUL version number.
If you would like to setup a folder in your environment variable, take a look at simple-qc installation instruction.
The installation procedure is similar to the MacOS or Linux. After downloading the zip file for Windows and extracting it, you will setup your environment variable that point to the path where you will put the executable. In Windows, this is usually done using GUI. Follow this amazing guideline from the stakoverflow to setup the environment variable. After setup, copy the segul.exe file to the folder.
This is the recommended option. Cargo will compile the app, manage its dependencies, and fine-tuned it for your specific hardware. It also allows to easily updating the app.
First, download and install the rust compiler toolchain. It requires rust version 1.5 or higher. Then, check if the toolchain installation successful:
Bash
cargo --version
It should show the cargo version number. Then, install the app:
Bash
cargo install segul
If you encounter a compiling issue (usually happens on Linux or Windows), you may need to install the C-development toolkit. For Debian-based Linux distribution, such as Debian, Ubuntu, PopOS, etc., the easiest way is to install build-essential:
Bash
sudo apt install build-essential
For OpenSUSE:
Bash
zypper install -t pattern devel_basis
For Fedora:
Bash
sudo dnf groupinstall "Development Tools" "Development Libraries"
For Windows, you only need to install the GNU toolchain for rust. The installation should be straighforward using rustup. Rustup comes as a part of the rust-compiler toolchain. It should be available in your system at the same time as you install cargo.
Bash
rustup toolchain install stable-x86_64-pc-windows-gnu
Then set the GNU toolchain as the default compiler
Bash
rustup default stable-x86_64-pc-windows-gnu
Try to install SEGUL again:
Bash
cargo install segul
You will need the rust compiler toolchain. The setup procedure is similar to installing the app using cargo. To install the development version for any supported platform:
Bash
cargo install --git https://github.com/hhandika/segul.git
You should have SEGUL ready to use.
It is equivalent to:
```Bash git clone https://github.com/hhandika/segul
cd segul/
cargo build --release ```
The different is that, for the latter, the executable will be in the segul
repository: /target/release/segul
. Copy the segul
binary and then add it to your environment path folder.
Then, try to call SEGUL:
Bash
segul --version
If you install the app using cargo, updating the app is the same as installing it:
Bash
cargo install segul
Cargo will check whether the version of the app in your computer different from the version in the rust package repository (crates.io) and will install the newer version if it is available. Similar procedure is also applied for installing from the GitHub repository:
Bash
cargo install --git https://github.com/hhandika/segul.git
If you used the pre-compiled binary, replace the old binary with the newer version manually.
It is also easy to do if you install the app using cargo:
Bash
cargo uninstall segul
Rust toolchain, including cargo, can be uninstall easily too:
Bash
rustup self uninstall
Remove the app manually if you use the pre-compiled binary.
```Bash
USAGE:
segul
FLAGS: -h, --help Prints help information -V, --version Prints version information
SUBCOMMANDS: concat Concatenates alignments convert Converts sequence formats filter Filter alignments with specified min taxon completeness, alignment length, or parsimony informative sites help Prints this message or the help of the given subcommand(s) id Gets sample ids from multiple alignments summary Gets alignment summary stats ```
-i
or --input
: Use for a single file input. Only available for convert and summary subcommands.-d
or --dir
: If your input is a path to a directory. The directory input requires users to specify the input format. Available for all subcommands.w
or --wildcard
: If your input is wilcards. This is more flexible than the other two input options and can accept multiple values. Available for all subcommands.Arguments: -f
or --input-format
Availabilities: all subcommands
It is used to specify the input format. For a sinlge input -i
or --input
and -w
or --wildcard
, this is not required.
Input format options (all in lowercase):
auto
(default)nexus
phylip
fasta
Arguments: -o
or --output
Availabilities: all subcommands
For a single output task, such as converting a single file, or concatenating alignment, the output will be the file name for the output. For a multiple output task, such as converting multiple files to a different format, the output will be the directory name for the output. The app will use the input file name for each output file.
The app by default write to the current working directory.
Arguments: -F
or --output-format
Availabilities: all subcommands
By default the output format is nexus
. Use this option to specify the output format. Below is the available output formats.
Sequential formats:
nexus
phylip
fasta
Interleaved formats:
fasta-int
nexus-int
phylip-int
Argument: --datatype
Availabilities: all subcommands
The app support both DNA and amino acid sequences. By default the data type is set for DNA sequences. If your input file is amino acid sequences, you will need to change the data type to aa
. By specifying the data type, the app will check if your sequence files contain only IUPAC characters. Except for computing summary statistics, you can set data type to ignore
to skip checking the IUPAC characters. This usually speed app the computation for about 40%. Use this option when you are sure your sequences contain only IUPAC characters.
To summarize, available data types:
aa
dna
ignore
Arguments: -p
or --part
Availabilities: concat and filter subcommands
This option is used to specify the partition format. By default the format is nexus. Available options:
charset
(embedded in a nexus sequence)nexus
raxml
Arguments: -i
or --interval
Availability: summary subcommand
This option is to specify the percentage decrement interval for computing data matrix completeness in summary statistics. Available interval: 1
, 2
, 5
, 10
.
Only available for the filter subcommand. Available options:
-l
or --len
: To filter alignments based on minimal alignment length--percent
: To filter based on percentage of data matrix completeness.--npercent
: The same as --percent
, but accept multiple values. This option allows you to create collections of alignments with different data matrix completeness in a single command.--pinf
: To filter based on the number of parsimony informative sites.--ntax
: To defined the total number of taxa. By default the app determines the number of taxa in all the alignments based on the numbers of unique IDs.--codon
: Use to set the partition format to codon model. Available in the concat and filter subcommands (if you choose to concatenate the result).
--concat
: Available for the filter subcommand. If is set, the app will concatenate filtered alignments in lieu to copying the files.
--sort
: Available for the convert subcommand to sort the sequences based on their IDs in alphabetical order.
-h
or --help
: To display help information.
--version
: To display the app version information.
Segul can convert from a single file input (-i
or --input
), a directory input (-d
or --dir
), or using a wildcard input (-c
--wildcard
).
Bash
segul convert --input [path-to-your-repository] --input-format [choose-one]
In short format:
Bash
segul convert -i [path-to-your-repository] -f [sequence-format]
By default it converts to nexus. To choose a different output format, use the output format option (-F
or --output-format
):
Bash
segul convert --input [path-to-your-repository] --input-format [sequence-format] --output-format [sequence-format]
In short format, notice the uppercase 'F' for the output format:
Bash
segul convert -i [path-to-your-repository] -f [sequence-format] -F [sequence-format]
You can also skip specifying the input format and the app will infer it based on the file extension:
Bash
segul convert -i [path-to-your-repository]
By default the app will use the input file name for the output. To specify, the output name use the -o
or --output
option. There is no need to include the extension for the output name.
Using the --sort
flag, you can also sort the sequence based on their IDs in alphabetical order.
For example, to convert a file name sequence.fasta
to a phylip format and we will sort the result.
Bash
segul convert -i sequence.phy -f fasta -F phylip -o new_sequence --sort
The conversion command for a directory input is similar to converting a single file. Unlike the single file, the app require you to specify the input format and the output name. The output name will be the directory name for the output files, whereas the output file name will be the same as the input file.
Bash
segul convert --dir [path-to-your-repository] --input-format [choose-one] --output [your-output-dir-name]
In shortened format
Bash
segul convert -d [path-to-your-repository] -f [sequence-format] -o [your-output-dir-name]
For example, suppose we want to convert all the fasta files in the directory below to a phylip format and name the output directory new_sequences
:
Bash
sequences/
├── seq_1.fasta
├── seq_2.fasta
└── seq_3.fasta
The command will be:
Bash
segul convert -d sequences/ -f fasta -F phylip -o new_sequences
The resulting directory will be:
Bash
new_sequences/
├── seq_1.phy
├── seq_2.phy
└── seq_3.phy
All the options for a single input or a directory is also available for a wildcard. The app can also infer the input format. Unlike any other input, the wildcard can take multiple values. This allow you to batch converting files in different folders. The ouput will be in a single directory. It is required to specify the output name and will be used as a name for the output directory.
Bash
segul convert -c [wildcard-1] [wildcard-2] [wildcard-3] -f [sequence-format] -o [your-output-dir-name]
The app concat multiple alignments and write the partition setting for the resulting files. The input options are -d
or --dir
and -c
or --wildcard
. To specify the partition format, you will use the -p
or --part
option. You can also write the partition to a codon model format by using the flag --codon
.