Maximum Likelihood Analysis of Phylogenetic Data

phylogeny, the evolutionary tree or lines of descent of living species.

One of the five demonstrations which Indiana University researchers contributed to the SC98 High Performance Networking and Computing Conference was an extended numerically intense bioinformatic computation carried out on 33 advanced computers geographically dispersed across three continents. The systems, linked by the vBNS, TransPAC, and APAN networks, were at Indiana University, the Institute of High Performance Computing at the National University of Singapore, and the Cooperative Research Centre for Advanced Computational Systems (ACSys CRC) in Australia.
The rapid accumulation of DNA sequence data has allowed statistical methods to be applied to a variety of biological quesitons. However, data have accumulated more rapidly than computing power. Researchers must often exclude relevant data to make analysis practicable, even though these exclusion may limit the scope and accuracy of the results.

Maximum likelihood methods of statistical inference were first developed in the 1930's by R.A. Fisher. Theoretical application to phylogenetic analysis was developed by Felsenstein in the `70's and early `80's. Maximum likelihood methods of phylogenetic inference are superior to some other methods, particularly when the data set includes highly divergent sequences, which are desirable but increase the computational difficulty enormously. Parallel computing methods now make the analysis of such large data sets practical.

Our analysis uses FastDNAml [Olsen et.al. 1994, based on Felsenstein 1981], modified and extended to run on a heterogenous and widely distributed parallel virtual machine. This program computes the likelihood of various phylogenetic trees, based upon experimental results concerning DNA replication modification rates, starting with aligned corresponding DNA sequences from a number of species. It explores all possible phylogenetic trees for an initial small set of species; it then adds additional species, and compares different arrangements to produce a sequence of estimated philogenies. Varying the order of introduction addresses the "local trapping problem."
Two data sets were analyzed. In the first data set, contributed by the collaborators at the BioInformatics Center at the National University of Singapore, cytoplasmic coat proteins [involved in intracellular membrane transport] were sequenced from human, rat, bovine and yeast organisms. The second data set addressed the controversial phylogenetic placement of microsporidia [a parasite group including important human pathogens], with a dataset including representatives of most eukaryotic lineages [> 100 taxa]. Some genetic studies find these to be highly degenerate fungi, while others, based upon small subunit rRNA, suggest an ancient eukaryotic lineage; resolving this question bears upon the reliability of ssu rRNA-based phylogenetic analysis.
Demonstrating computationally intensive analyses using a globally distributed collection of computational nodes paves the way for scientists connected by advanced networks to access remote servers in the worldwide computational grid, contribute key data sets, and collaborate with distant researchers. Our initial focus is on molecular biology in Indiana, Singapore, and Australia. Plans are underway to extend this partnership, addressing questions of performance analysis, virtual accounting schemes, and the development and expansion of the user community.


Participants were: David Hart, Don Berry, Eric Wernert, Craig Stewart, Will Fischer, Chris Parkinson, Jeff Palmer, Meena Sakharkar, Zhang Lou Xin, and Tan Tin Wee. Special thanks are due to Mary Papakhian and Dan Lauer of IU's Research and Technical Services group, Tan Chee Chiang of the NUS Computer Centre's Supercomputing & Visualisation Unit, and Markus Buchhorn of ACSys CRC.

More details: FastDNAML Phylogenetic Analysis of COP proteins
IU receives TRANSPAC award