Abstract
Among the large number of gene selection algorithms available in literature, the rough set based maximum relevance-maximum significance (RSMRMS) algorithm has been shown to be successful for selecting a set of relevant and significant genes from microarray data. However, the analysis of functional diversity of a gene set is essential to understand the role of genes in a particular disease as well as to evaluate the effectiveness of a gene selection algorithm. In this regard, a gene ontology based quantitative index, termed as degree of functional diversity (DoFD), is proposed to quantify the functional diversity of a set of genes selected by any gene selection algorithm. Moreover, a new gene selection algorithm is presented, integrating judiciously the merits of both DoFD and RSMRMS, to select relevant and significant genes those are also functionally diverse. The performance of the proposed gene selection algorithm, along with a comparison with other gene selection methods, is studied using the proposed DoFD and predictive accuracy of K-nearest neighbor rule and support vector machine on six cancer and one arthritis microarray data sets. An important finding is that the proposed gene ontology based quantitative index can accurately evaluate functional diversity of a set of genes. Also, the proposed gene selection algorithm is shown to be effective for selecting relevant, significant, and functionally diverse genes from microarray data.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12):6745–6750
Boehm O, Hardoon DR, Manevitz LM (2011) Classifying cognitive states of brain activity via one-class neural networks with feature selection by genetic algorithms. Int J Mach Learn Cybern 2(3):125–134
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the computational systems bioinformatics, pp 523–528
Du Z, Li L, Chen CF, Yu PS, Wang JZ (2009) G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery. Nucleic Acids Res 37:W345–W349
Duan K, Rajapakse JC, Wang H, Azuaje F (2005) Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobiosci 4(3):228–234
Duda RO, Hart PE, Stork DG (1999) Pattern classification and scene analysis. Wiley, New York
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Gordon GJ, Jensen RV, Hsiao LL, Gullans, SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62:4963–4967
Hall M (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the seventeenth international conference on machine learning, pp 359–366
Hu Q, Pan W, An S, Ma P, Wei J (2010) An efficient gene selection technique for cancer recognition based on neighborhood mutual information. Int J Mach Learn Cybern 1(1–4):63–74
Kang Y, Siegel PM, Shu W, Drobnjak M, Kakonen SM, Cardo CC, Guise TA, Massague J (2003) A multigenic program mediating breast cancer metastasis to bone. Cancer Cell 3(6):537G–549
Kononenko I, Simec E, Sikonja MR (1997) Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl Intell 7:39–55
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of 15th international conference on machine learning, pp 296–304
Loennstedt I, Speed TP (2002) Replicated microarray data. Stat Sin 12:31–46
Maji P (2009) f-information measures for efficient selection of discriminative genes from microarray data. IEEE Trans Biomed Eng 56(4):1063–1069
Maji P, Pal SK (2010) Feature selection using f-information measures in fuzzy approximation spaces. IEEE Trans Knowl Data Eng 22(6):854–867
Maji P, Pal SK (2010) Fuzzy-rough sets for information measures and selection of relevant genes from microarray data. IEEE Trans Syst Man Cybern B Cybern 40(3):741–752
Maji P, Paul S (2010) Rough sets for selection of molecular descriptors to predict biological activity of molecules. IEEE Trans Syst Man Cybern C Appl Rev 40(6):639–648
Maji P, Paul S (2011) Rough set based maximum relevance-maximum significance criterion and gene selection from microarray data. Int J Approx Reason 52(3):408–426
Pawlak Z (1991) Rough sets, theoretical aspects of resoning about data. Kluwer, Dordrecht
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Pevsner J (2009) Bioinformatics and functional genomics. Wiley, New York
van der Pouw Kraan TCTM, van Gaalen FA, Kasperkovitz PV, Verbeet NL, Smeets TJM, Kraan MC, Fero M, Tak PP, Huizinga TWJ, Pieterman E, Breedveld FC, Alizadeh AA, Verweij CL (2003) Rheumatoid arthritis is a heterogeneous disease: evidence for differences in the activation of the STAT-1 pathway between rheumatoid tissues. Arthritis Rheum 48(8):2132–2145
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of 14th international joint conference on artificial intelligence, pp 448–453
Sharma A, Imoto S, Miyano S, Sharma V (2011) Null space based feature selection method for gene expression data. Int J Mach Learn Cybern
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR (2002) Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nat Med 8(1):68–74
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Res 1:203–209
Slavkov I, Gjorgjioski V, Struyf J, Deroski S (2010) Finding explained groups of time-course gene expression profiles with predictive clustering trees. Mol Biosyst 6:729–740
Tusher V, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116–5121
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
Wang H, Azuaje F, Bodenreider O, Dopazo J (2004) Gene Expression Correlation and Gene Ontology-Based Similarity: An Assessment of Quantitative Relationships. In: Proceedings of IEEE Symposium Computational Intelligence in Bioinformatics and Computational Biology, pp. 25–31
Wang X, Dong C (2009) Improving Generalization of Fuzzy IF-THEN Rules by Maximizing Fuzzy Entropy. IEEE Transactions on Fuzzy Systems 17(3):556–567
Wang X, Dong L, Yan J (2012) Maximum Ambiguity Based Sample Selection in Fuzzy Decision Tree Induction. IEEE Transactions on Knowledge and Data Engineering 24(8):1491–1505
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles. Proceedings of the National Academy of Sciences, USA 98(20):11462–11467
Acknowledgments
The work was done when one of the authors, S. Paul, was a Senior Research Fellow of Council of Scientific and Industrial Research, Government of India.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Paul, S., Maji, P. Gene ontology based quantitative index to select functionally diverse genes. Int. J. Mach. Learn. & Cyber. 5, 245–262 (2014). https://doi.org/10.1007/s13042-012-0133-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-012-0133-5