Abstract
Microarrays are capable of detecting the expression levels of thousands of genes simultaneously. So, gene expression data from DNA microarray are characterized by many measured variables (genes) on only a few samples. One important application of gene expression data is to classify the samples. In statistical terms, the very large number of predictors or variables compared to small number of samples makes most of classical “class prediction” methods unemployable. Generally, this problem can be avoided by selecting only the relevant features or extracting new features containing the maximal information about the class label from the original data. In this paper, a new method for gene selection based on independent variable group analysis is proposed. In this method, we first used t-statistics method to select a part of genes from the original data. Then, we selected the key genes from the selected genes for tumor classification using IVGA. Finally, we used SVM to classify tumors based on the key genes selected using IVGA. To validate the efficiency, the proposed method is applied to classify three different DNA microarray data sets. The prediction results show that our method is efficient and feasible.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alhoniemi E, Honkela A, Lagus K, Seppä J, Wagner P, Valpola H (2006) Compact modeling of data using independent variable group analysis. Technical Report E3, Helsinki University of Technology, Publications in Computer and Information Science, Espoo, Finland
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99:6562–6566
Bae K, Mallick BK (2004) Gene selection using a two-level hierarchical Bayesian model. Bioinformatics 20:3423–3430
Caló DG, Galibemberti G, Pillati M, Viroli C (2005) Variable selection in cell classification problems: a strategy based on independent component analysis. In: Vichi M, Monari P, Mignani S, Montanari A (eds) New development in classification and data analysis. Studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 21–30
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge
Devore J, Peck R (1997) Statistics: the exploration and analysis of data, 3rd edn. Duxbury Press, Pacific Grove, CA
Draghici S, Kulaeva O, Hoff B, Petrov A, Shams S, Tainsky MA (2003) Sorin noise sample method: an ANOVA approach allowing robust selection of differentially regulated genes measured by DNA microarrays. Bioinformatics 19:1348–1359
Dudoit S, Fridyland JF, Speed TP (2002) Comparison of discrimination methods for the classification of tumor using gene expression data. J Am Stat Assoc 97:77–87
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Haykin S (1994) Neural networks, a comprehensive foundation. Prentice-Hall, NJ
Hu QH, Yu DR, Liu JF, Wu CX (2008) Neighborhood rough set based heterogeneous feature subset selection. Info Sci 178(18):3577–3594
Hu QH, Yu DR, Xie ZX (2008) Neighborhood classifiers. Expert Syst Appl 34(2):866–876
Huang DS, Zheng CH (2006) Independent component analysis based penalized discriminant method for tumor classification using gene expression data. Bioinformatics 22(15):1855–1862
Kitter J (1986) Feature selection and extraction. In: Young TY, Fu K-S (eds) Handbook of pattern recognition and image processing. Academic Press, NY
Kraskov A, Stögbauer H, Andrzejak RG, Grassberger P (2005) Hierarchical clustering using mutual information. Europhys Lett 70(2):278–284
Lagus K, Alhoniemi E, Valpola H (2001) Independent variable group analysis. In: Dorffner G, Bischof H, Hornik K (eds) International conference on artificial neural networks—ICANN 2001, ser. LLNCS, vol 2130. Springer, Vienna, Austria. August, pp 203–210
Lagus K, Alhoniemi E, Seppä J, Honkela A, Wagner P (2005) Independent variable group analysis in learning compact representations for data. In: Honkela T, Könönen V, Pöllä M, Simula O (eds) Proceedings of the international and interdisciplinary conference on adaptive knowledge representation and reasoning (AKRR’05). Espoo, Finland, June, pp 49–56
Lee KE, Sha N, Dougherty ER, Vannucci M, Mallick BK (2003) Gene selection: a Bayesian variable selection approach. Bioinformatics 19:90–97
Li W, Sun F, Grosse I (2004) Extreme value distribution based gene selection criteria for discriminant microarray data analysis using logistic regression. J Comput Biol 1:215–226
Nanni L, Lumini A, Brahnam Sheryl (2010) Advanced machine learning techniques for microarray spot quality classification. Neural Comput Appl 19(3):471–475
Nguyen DV, Rocke DM (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18(1):39–50
Nilsson M, Gustafsson H, Andersen SV, Kleijn WB (2002) Gaussian mixture model based mutual information estimation between frequency bands in speech. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing 2002 (ICASSP ‘02), 1, pp I–525–I–528
Pochet N, De Smet F, Suykens JAK, De Moor BLR (2004) Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics 20:3185–3195
Shevade SK, Keerthi S (2003) A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19:2246–2253
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209
Studený M, Vejnarová J (1999) The multiinformation function as a tool for measuring stochastic dependence. In: Jordan M (ed) Learning in graphical models. The MIT Press, Cambridge, pp 261–297
Thomas G et al (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 11:1227–1236
Troyanskaya G et al (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 18:1454–1461
West M (2003) Bayesian factor regression models in the ‘Large p, Small n’ paradigm. Bayesian Stat 7:723–732
Zhang HH, Ahn J, Lin X, Park C (2006) Gene selection using support vector machines with non-convex penalty. Bioinformatics 22:88–95
Zhao XM, Cheung YM, Huang DS (2010) Analysis of gene expression data using RPEM algorithm in normal mixture model with dynamic adjustment of learning rate. Int J Pattern Recogn Artif Intell 24(4):651–666
Zhao XM, Wang RS, Chen LN, Aihara Kazuyuki (2008) Uncovering signal transduction networks from high-throughput data by integer linear programming. Nucl Acids Res 36(9):e48
Zheng CH, Huang DS, Zhang L, Kong XZ (2009) Tumor clustering using non-negative matrix factorization with gene selection. IEEE Trans Info Technol Biomed 13(4):599–607
Zheng CH, Huang DS, Li K, Irwin George, Sun ZL (2007) MISEP method for post-nonlinear blind source separation. Neural Comput 19(9):2557–2578
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant Nos. 30700161 & 30900321, the Foundation for Young Scientist of Shandong Province, China under Grant No. 2008BS01010, and the LIESMARS Special Research Funding.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zheng, CH., Chong, YW. & Wang, HQ. Gene selection using independent variable group analysis for tumor classification. Neural Comput & Applic 20, 161–170 (2011). https://doi.org/10.1007/s00521-010-0513-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-010-0513-2