Efficient calculation of pairwise phylogenetic distance matrices.
PhyloDM is a high-performance library that converts a phylogenetic tree into pairwise distance matrix. It is designed to use minimal memory, and takes less than a minute to compute large trees (>20,000 taxa), whereas other libraries may take hours and use hundreds of GB of memory.
PhyloDM is written in Rust and is exposed to Python via the Python PyO3 API. This means it can be used in either Python or Rust, however, the documentation below is written for use in Python. For Rust documentation, see Crates.io.
Requires Python 3.7+
shell
conda install -c b bioconda phylodm
Pre-compiled binaries are packaged for most 64-bit Unix platforms. If you are running Python 3.7, or 3.8 then you need to have Rust installed to compile the binaries.
shell
python -m pip install phylodm
A pairwise distance matrix can be created from either a Newick file, or DendroPy tree.
```python from phylodm import PhyloDM
with open('/tmp/newick.tree', 'w') as fh: fh.write('(A:4,(B:3,C:4):1);')
pdm = PhyloDM.loadfromnewick_path('/tmp/newick.tree')
import dendropy tree = dendropy.Tree.getfrompath('/tmp/newick.tree', schema='newick') pdm = PhyloDM.loadfromdendropy(tree)
dm = pdm.dm(norm=False) labels = pdm.taxa()
""" /------------[4]------------ A + | /---------[3]--------- B ---[1]---+ ------------[4]------------- C
labels = ('A', 'B', 'C') dm = [[0. 8. 9.] [8. 0. 7.] [9. 7. 0.]] """ ```
The dm
method generates a symmetrical NumPy matrix and returns a tuple of
keys in the matrix row/column order.
```python
dm = pdm.dm(norm=False) labels = pdm.taxa()
""" /------------[4]------------ A + | /---------[3]--------- B ---[1]---+ ------------[4]------------- C
labels = ('A', 'B', 'C') dm = [[0. 8. 9.] [8. 0. 7.] [9. 7. 0.]] """
dm[0, 1] # 8 dm[labels.index('A'), labels.index('B')] # 8 ```
If the norm
argument of dm
is set to True
, then the data will be normalised
by the sum of all edges in the tree.
Tests were executed using scripts/performance/Snakefile
on an Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz.
For large numbers of taxa it is beneficial to use PhyloDM, however, if you have a small number of taxa in the tree it is beneficial to use DendroPy for the great features it provides.
Using PhyloDM for a large number of taxa, you can expect to use:
* Memory (GB) = 1.4863970739600885e-08 x^2 + 1.730990617342909e-06 x + 0.014523447553823836
* Time (minutes) = 9.496032656158468e-10 x^2 + -3.7621666288523445e-06 x + 0.012201564275114034