strif

Crates.io CI

Installation

Download binaries

Binaries for the tool can be found under the "Releases" tab.

Cargo

Usage

Sequence-graph alignment

To generate a sequence-graph alignment of your sample to STR loci, use ExpansionHunter. The tool will produce a .realigned.bam file for each sample. Instructions for running ExpansionHunter can be found here.

Extracting repeat sequences

To extract repeat sequences from an ExpansionHunter BAMlet (.realigned.bam), run the following command. If the output is not specified, the output will be saved in the same directory as the BAMlet with a .repeat_seqs.tsv suffix.

strif extract <BAMLET> [OUTPUT]

Profiling STR interruptions

To profile STR interruptions from extracted repeat sequences (the output of strif extract), run the following command. The STR catalog needs to be in the same format as these catalogs. If the output path is not specified, the output will be saved in the same directory as the repeat sequences file with a .strif_profile.tsv suffix. strif profile [OPTIONS] <REPEAT_SEQS> <STR_CATALOG> [OUTPUT] [OUTPUT_ALIGNMENTS]

Options

-z Output visual alignments. Default is false -f, --filter <FILTER> Filter locus IDs using a regular expression. Defaults to None. This is useful for filtering out loci that are not of interest -A <MATCH_SCORE> [default: 1] -B <MISMATCH_PENALTY> [default: 8] -O <GAP_OPEN_PENALTY> [default: 10] -E <GAP_EXTEND_PENALTY> [default: 1]

Merging STR interruption profiles

To merge STR interruption profiles from multiple samples, run the following command. If the output path is not specified, the output will be saved in the same directory as the manifest file with a .merged_profiles.tsv suffix. strif merge [OPTIONS] <MANIFEST> <READ_DEPTHS> [OUTPUT]

-f, --filter <FILTER> Filter locus IDs using a regular expression. Defaults to None. This is useful for filtering out loci that are not of interest -m, --min-read-count <MIN_READ_COUNT> Minimum read count to include in the merged profile. Defaults to 1. This is useful for filtering out loci with low coverage [default: 1] -l, --read-length <READ_LENGTH> The sequencing read length. Used for normalizing the interruption counts [default: 150] -h, --help

Prioritizing interruptions

To find interruptions that display a significant difference between case and control samples, you can use prioritize.py in the scripts directory.

The prioritization script expects Sample IDs to be formatted as follows: <INDIVIDUAL>_<case/control>. If a paired test is run using the -t option, then it is expected that each individual has exactly one case and one control file.

python prioritize.py <merged_profile> <output_file> <sig_output_file>

Note: Currently, the script does not perform multiple hypothesis test correction. It is strongly recommended to independently perform this step.

Options

-n MIN_SAMPLES, --min-samples MIN_SAMPLES Minimum number of samples per group (case or control) -p P_VALUE_CUTOFF, --p-value-cutoff P_VALUE_CUTOFF P-value cutoff -t, --paired-test Enable paired test -c CHUNK_SIZE, --chunk-size CHUNK_SIZE Chunk size for reading merged profile --no-progress Disable progress bars

Generating validation datasets

You can generate simulate repeat sequences to validate and test STRIF using generate_validation_sets.py in the scripts directory. The only argument is a path to a directory, such as datasets/ where the generated datasets will be created.

python generate_validation_sets.py <DATASET_DIR>

Calculating performance metrics

You can calculate metrics on the generated datasets using metrics.py in the scripts directory. The only argument is a path to a directory, such as datasets/ where the generated datasets was created.

python metrics.py <DATASET_DIR>

The script will output a file overall_stats.tsv in the dataset directory containing a summary of metrics on each dataset.

Optimizing alignment parameters

You can find optimal aligning parameters for strif profile by running optimize.py in the scripts directory. The only argument is a path to a dataset. This will be any directory within the datasets directory. It is recommended to run this on datasets/comprehensive_train.

python optimize.py <DATASET_DIR>/<NAME_OF_DATASET>

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

See CONTRIBUTING.md.