Skip to main content
Log in

Multilocus phylogenetic analysis with gene tree clustering

  • Computational Biomedicine
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Both theoretical and empirical evidence point to the fact that phylogenetic trees of different genes (loci) do not display precisely matched topologies. Nonetheless, most genes do display related phylogenies; this implies they form cohesive subsets (clusters). In this work, we discuss gene tree clustering, focusing on the normalized cut (Ncut) framework as a suitable method for phylogenetics. We proceed to show that this framework is both efficient and statistically accurate when clustering gene trees using the geodesic distance between them over the Billera–Holmes–Vogtmann tree space. We also conduct a computational study on the performance of different clustering methods, with and without preprocessing, under different distance metrics, and using a series of dimensionality reduction techniques. Our results with simulated data reveal that Ncut accurately clusters the set of gene trees, given a species tree under the coalescent process. Other observations from our computational study include the similar performance displayed by Ncut and k-means under most dimensionality reduction schemes, the worse performance of hierarchical clustering, and the significantly better performance of the neighbor-joining method with the p-distance compared to the maximum-likelihood estimation method. Supplementary material, all codes, and the data used in this work are freely available at http://polytopes.net/research/cluster/ online.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. While this kernel with an arbitrary distance matrix D is not necessarily positive definite, in our analysis the Gram matrices \(k(X_i,X_j)\) created by the given data were positive definite.

References

  • Abascal, F., & Valencia, A. (2002). Clustering of proximal sequence space for the identification of protein families. Bioinformatics, 18(7), 908–921.

    Article  Google Scholar 

  • Amemiya, C. T., Alföldi, J., et al. (2013). The african coelacanth genome provides insights into tetrapod evolution. Nature, 496, 311–316.

    Article  Google Scholar 

  • Betancur, R., Li, C., Munroe, T., Ballesteros, J., & Ortí, G. (2013). Addressing gene tree discordance and non-stationarity to resolve a multi-locus phylogeny of the flatfishes (teleostei: Pleuronectiformes). Systematic Biology. doi:10.1093/sysbio/syt039.

  • Billera, L., Holmes, S., & Vogtmann, K. (2001). Geometry of the space of phylogenetic trees. Advances in Applied Mathematics, 27(4), 733–767.

    Article  Google Scholar 

  • Bininda-Emonds, O., Gittleman, J., & Steel, M. (2002). The (super)tree of life: Procedures, problems, and prospects. Annual Review of Ecology and Systematics, 33, 265–289.

    Article  Google Scholar 

  • Bollback, J., & Huelsenbeck, J. (2009). Parallel genetic evolution within and between bacteriophage species of varying degrees of divergence. Genetics, 181(1), 225–234.

    Article  Google Scholar 

  • Brito, P., & Edwards, S. (2009). Multilocus phylogeography and phylogenetics using sequence-based markers. Genetica, 135, 439–455.

    Article  Google Scholar 

  • Carballido-Gamio, J., Belongie, S., & Majumdar, S. (2004). Normalized cuts in 3-D for spinal MRI segmentation. IEEE Transactions on Medical Imaging, 23(1), 36–44.

    Article  Google Scholar 

  • Carling, M., & Brumfield, R. (2008). Integrating phylogenetic and population genetic analyses of multiple loci to test species divergence hypotheses in passerina buntings. Genetics, 178, 363–377.

    Article  Google Scholar 

  • Chatterji, S., Yamazaki, I., Bai, Z., & Eisen, J. A. (2008). Compostbin: A DNA composition-based algorithm for binning environmental shotgun reads. In M. Vingron & L. Wong (Eds.), Research in computational molecular biology (pp. 17–28). Berlin: Springer.

  • Chen, D., Burleigh, G. J., & Fernández-Baca, D. (2007). Spectral partitioning of phylogenetic data sets based on compatibility. Systematic Biology, 56(4), 623–632.

    Article  Google Scholar 

  • Cox, I. J., Rao, S. B., & Zhong, Y. (1996). “Ratio regions”: A technique for image segmentation. In 1996, proceedings of the 13th international conference on pattern recognition, vol. 2 (pp. 557–564). IEEE.

  • Dasarathy, G., Nowak, R., & Roch, S. (2015). Data requirement for phylogenetic inference from multiple loci: A new distance method. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 122, 422–432.

    Article  Google Scholar 

  • Edwards, S. (2009). Is a new and general theory of molecular systematics emerging? Evolution, 63, 1–19.

    Article  Google Scholar 

  • Everitt, B., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). London: Wiley.

    Book  Google Scholar 

  • Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17, 368–376.

    Article  Google Scholar 

  • Fritzsch, B. (1987). The inner ear of the coelacanth fish latimeria has tetrapod affinities. Nature, 327, 153–154.

    Article  Google Scholar 

  • Gori, K., Suchan, T., Alvarez, N., Goldman, N., & Dessimoz, C. (2015). Clustering genes of common evolutionary history. Preprint. arXiv:1510.02356.

  • Gorr, T., Kleinschmidt, T., & Fricke, H. (1991). Close tetrapod relationships of the coelacanth latimeria indicated by haemoglobin sequences. Nature, 351, 394–397.

    Article  Google Scholar 

  • Gretton, A., Smola, A. J., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., et al. (2005). Kernel constrained covariance for dependence measurement. In Proceedings of the 10th international workshop on artificial intelligence and statistics.

  • Guindon, S., & Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology, 52(5), 696–704.

    Article  Google Scholar 

  • Hartigan, J. (1975). Clustering algorithms. London: Wiley.

    Google Scholar 

  • Hasegawa, M., Kishino, H., & Yano, T. (1985). Dating of the human-ape splitting by a molecular clock of mitochondrial dna. Journal of Molecular Evolution, 22, 160–174.

    Article  Google Scholar 

  • Haws, D., Huggins, P., O’Neill, E. M., Weisrock, D. W., & Yoshida, R. (2012). A support vector machine based test for incongruence between sets of trees in tree space. BMC Bioinformatics, 13, 210. doi:10.1186/1471-2105-13-210.

    Article  Google Scholar 

  • Hedges, S. (2009). Vertebrates (vertebrata). In S. B. Hedges & S. Kumar (Eds.), The timetree of life (pp. 309–314). Berlin: Springer-Verlag.

  • Heled, J., & Drummond, A. (2011). Bayesian inference of species trees from multilocus data. Molecular Biology and Evolution, 27(3), 570–580.

    Article  Google Scholar 

  • Hess, J., & Goldman, N. (2011). Addressing inter-gene heterogeneity in maximum likelihood phylogenomic analysis: Yeasts revisited. PLoS ONE, 6, e22783.

    Article  Google Scholar 

  • Higham, D., Kalna, G., & Kibble, M. (2007). Spectral clustering and its use in bioinformatics. Journal of Computational and Applied Mathematics, 204(1), 25–37. (Special issue dedicated to Professor Shinnosuke Oharu on the occasion of his 65th birthday).

    Article  Google Scholar 

  • Hochbaum, D. S. (2010). Polynomial time algorithms for ratio regions and a variant of normalized cut. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5), 889–898.

    Article  Google Scholar 

  • Hochbaum, D. S. (2013). A polynomial time algorithm for rayleigh ratio on discrete variables: Replacing spectral techniques for expander ratio, normalized cut, and cheeger constant. Operations Research, 61(1), 184–198.

    Article  Google Scholar 

  • Holmes, S. (2005). Statistical approach to tests involving phylogenies. In O. Gascuel (Ed.), Mathematics of phylogeny and evolution, chapter 4 (pp. 91–117). New York: Oxford University Press.

    Google Scholar 

  • Huson, D. H., Klopper, T., Lockhart, P. J., & Steel, M. A. (2005). Reconstruction of reticulate networks from gene trees. In S. Miyano, J. Mesirov, S. Kasif, S. Istrail, P. A. Pevzner & M. Waterman (Eds.), Research in computational molecular biology, proceedings (pp. 233–249). Berlin: Springer.

  • Jeffroy, O., Brinkmann, H., Delsuc, F., & Philippe, H. (2006). Phylogenomics: The beginning of incongruence? Trends Genetics, 22, 225–231.

    Article  Google Scholar 

  • Jukes, T., & Cantor, C. (1969). Evolution of protein molecules. In H. Munro (Ed.), Mammalian protein metabolism (pp. 21–32). New York: Academic.

    Chapter  Google Scholar 

  • Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitution through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111–120.

    Article  Google Scholar 

  • Leigh, J. W., Lapointe, F.-J., Lopez, P., & Bapteste, E. (2011). Evaluating phylogenetic congruence in the post-genomic era. Genome Biology and Evolution, 3, 571–587.

    Article  Google Scholar 

  • Liang, D., Shen, X., & Zhang, P. (2013). One thousand two hundred ninety nuclear genes from a genome-wide survey support lungfishes as the sister group of tetrapods. Molecular Biology and Evolution, 30(8), 1803–1807.

    Article  Google Scholar 

  • Liu, K., Raghavan, S., Nelesen, S., Linder, C., & Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324, 1561–1564.

    Article  Google Scholar 

  • Maddison, W. P. (1997). Gene trees in species trees. Systematic Biology, 46(3), 523–536.

    Article  Google Scholar 

  • Maddison, W. P., & Maddison, D. (2009). Mesquite: A modular system for evolutionary analysis. Version 2.72. Available at http://mesquiteproject.org.

  • Maimon, O., & Rokach, L. (2005). Data mining and knowledge discovery handbook (Vol. 2). Berlin: Springer.

    Book  Google Scholar 

  • Martin, A. P., & Burg, T. M. (2002). Perils of paralogy: Using HSP70 genes for inferring organismal phylogenies. Systematic Biology, 51, 570–587.

    Article  Google Scholar 

  • Miller, E., Owen, M., & Provan, J. S. (2015). Averaging metric phylogenetic trees. Advances in Applied Mathematics, 68, 51–91.

    Article  Google Scholar 

  • Mirarab, S., Bayzid, M. S., Boussau, B., & Warnow, T. (2014). Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science, 346(6215), 1250463.

    Article  Google Scholar 

  • Newman, M. E. J. (2013). Spectral methods for community detection and graph partitioning. Physical Review E, 88, 042822.

    Article  Google Scholar 

  • Neyman, J. (1971). Molecular studies of evolution: A source of novel statistical problems. In S. S. Gupta & J. Yackel (Eds.), Statistical decision theory and related topics (pp. 1–27). New York: Academic Press.

  • Owen, M., & Provan, J. S. (2011). A fast algorithm for computing geodesic distances in tree space. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 8(1), 2–13.

    Article  Google Scholar 

  • Pamilo, P., & Nei, M. (1988). Relationships between gene trees and species trees. Molecular Biology and Evolution, 5, 568–583.

    Google Scholar 

  • Posada, D., & Crandall, K. (2002). The effect of recombination on the accuracy of phylogeny reconstruction. Journal of Molecular Evolution, 54, 396–402.

    Article  Google Scholar 

  • Rivera, M. C., Jain, R., Moore, J. E., & Lake, J. A. (1998). Genomic evidence for two functionally distinct gene classes. Proceedings of the National Academy of Sciences of the United States of America, 95(11), 6239–6244.

    Article  Google Scholar 

  • Robinson, D., & Foulds, L. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53, 131–147.

    Article  Google Scholar 

  • Roch, S., & Steel, M. (2015). Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading. Theoretical Population Biology, 100, 56–62.

    Article  Google Scholar 

  • Saitou, N., & Nei, M. (1987). The neighbor joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406–425.

    Google Scholar 

  • Salichos, L., & Rokas, A. (2013). Inferring ancient divergences requires genes with strong phylogenetic signals. Nature, 497, 327–331.

    Article  Google Scholar 

  • Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319.

    Article  Google Scholar 

  • Sharon, E., Galun, M., Sharon, D., Basri, R., & Brandt, A. (2006). Hierarchy and adaptivity in segmenting visual scenes. Nature, 442(7104), 810–813.

    Article  Google Scholar 

  • Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.

    Article  Google Scholar 

  • Takahata, N. (1989). Gene genealogy in 3 related populations: Consistency probability between gene and population trees. Genetics, 122, 957–966.

    Google Scholar 

  • Takahata, N., & Nei, M. (1990). Allelic genealogy under overdominant and frequency-dependent selection and polymorphism of major histocompatibility complex loci. Genetics, 124, 967–978.

    Google Scholar 

  • Takezaki, N., Figueroa, F., Zaleska-Rutczynska, Z., Takahata, N., & Klein, J. (2004). The phylogenetic relationship of tetrapod, coelacanth, and lungfish revealed by the sequences of forty-four nuclear genes. Molecular Biology and Evolution, 21, 1512–1524.

    Article  Google Scholar 

  • Tavare, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences, 17, 57–86.

    Google Scholar 

  • Taylor, J. W., Jacobson, D. J., Kroken, S., Kasuga, T., Geiser, D. M., Hibbett, D. S., et al. (2000). Phylogenetic species recognition and species concepts in fungi. Fungal Genetics and Biology, 31, 21–32.

    Article  Google Scholar 

  • Thompson, K., & Kubatko, L. (2013). Using ancestral information to detect and localize quantitative trait loci in genome-wide association studies. BMC Bioinformatics, 14, 200.

    Article  Google Scholar 

  • van der Maaten, L., & Hinton, G. (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.

    Google Scholar 

  • Weisrock, D. W., Shaffer, H. B., Storz, B. L., Storz, S. R., Storz, S. R., & Voss, S. R. (2006). Multiple nuclear gene sequences identify phylogenetic species boundaries in the rapidly radiating clade of mexican ambystomatid salamanders. Molecular Ecology, 15, 2489–2503.

    Article  Google Scholar 

  • Weyenberg, G., Huggins, P., Schardl, C., Howe, D., & Yoshida, R. (2014). KDETREES: Non-parametric estimation of phylogenetic tree distributions. Bioinformatics, 30(16), 2280–2287.

    Article  Google Scholar 

  • Xing, E., & Karp, R. (2001). CLIFF: Clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics, 17(suppl 1), S306–S315.

    Article  Google Scholar 

  • Yang, Z. (1997). PAML: A program package for phylogenetic analysis by maximum likelihood. CABIOS, 15, 555–556.

    Google Scholar 

  • Yao, W., Krzystek, P., & Heurich, M. (2012). Tree species classification and estimation of stem volume and DBH based on single tree extraction by exploiting airborne full-waveform lidar data. Remote Sensing of Environment, 123, 368–380.

    Article  Google Scholar 

  • Yu, Y., Warnow, T., & Nakhleh, L. (2011). Algorithms for MDC-based multi-locus phylogeny inference: Beyond rooted binary gene trees on single alleles. Journal of Computational Biology, 18(11), 1543–1559.

    Article  Google Scholar 

  • Zhang, S.-B., Zhou, S.-Y., He, J.-G., & Lai, J.-H. (2011). Phylogeny inference based on spectral graph clustering. Journal of Computational Biology, 18(4), 627–637.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editor and the anonymous referees for their useful comments for improving the manuscript.

Funding    K. F. and R. Y. were supported by JSPS KAKENHI 26540016. C. V. would also like to acknowledge support from ND EPSCoR NSF #1355466.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chrysafis Vogiatzis.

Additional information

JSPS KAKENHI 26540016 and ND EPSCoR NSF 1355466.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3824 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yoshida, R., Fukumizu, K. & Vogiatzis, C. Multilocus phylogenetic analysis with gene tree clustering. Ann Oper Res 276, 293–313 (2019). https://doi.org/10.1007/s10479-017-2456-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-017-2456-9

Keywords

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy