Abstract
Text streams are a continuous flow of high-dimensional text, transmitted at high-volume and high-velocities. They are expected to be classified in real-time, which is challenging due to the high dimensionality of feature space. Applying feature selection algorithms is one solution to reduce text streams feature space and improve the learning process. However, since text streams are potentially unbounded, it is expected a change in their probabilistic distribution over time, the so-called Concept Drift. The concept drift impacts the feature selection process due to the feature drift when the relevance of features is also subject to changes over time. This paper presents a comparative study of six feature selection methods for binary text streams classification, even in the presence of feature drift. We also propose the Online Feature Selection with Evolving Regularization (OFSER) algorithm, a modified version of the Online Feature Selection (OFS) algorithm, which uses evolving regularization to dynamically penalize model complexity, reducing feature drift impacts on the feature selection process. We conducted the experimental analysis on eleven real-world, commonly used datasets for text classification. The OFSER algorithm showed F1-scores up to 12.92% higher than other algorithms in some cases. The results using Iman and Davenport and Bergmann–Hommel’s tests show that OFSER algorithm is statistically superior to Information Gain and Extremal Feature Selection algorithms in terms of improving the base classifier predictive power.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
de Assunção MD, da Silva Veith A, Buyya R (2018) Distributed data stream processing and edge computing: a survey on resource elasticity and future directions. J Netw Comput Appl 103(November 2017):1–17. https://doi.org/10.1016/j.jnca.2017.12.001
Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Barddal JP, Gomes HM, Enembreck F (2015) Analyzing the impact of feature drifts in streaming learning. Neural information processing. Springer, Berlin, pp 21–28. https://doi.org/10.1007/978-3-319-26532-2_3
Barddal JP, Gomes HM, Enembreck F, Pfahringer B (2017) A survey on feature drift adaptation: definition, benchmark, challenges and future directions. J Syst Softw 127:278–294. https://doi.org/10.1016/j.jss.2016.07.005
Baumann P, Hochbaum DS, Yang YT (2019) A comparative study of the leading machine learning techniques and two new optimization algorithms. Eur J Oper Res 272:1041–1057. https://doi.org/10.1016/j.ejor.2018.07.009
Bergmann B, Hommel G (1988) Improvements of general multiple test procedures for redundant systems of hypotheses. In: Bauer P, Hommel G, Sonnemann E (eds) Multiple hypothesenprüfung / multiple hypotheses testing. Springer, Berlin, pp 100–115
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Bifet A, Kirkby R (2009) Data stream mining a practical approach
Brenes DJ, Gayo-Avello D, Pérez-González K (2009) Survey and evaluation of query intent detection methods. In: Proceedings of the 2009 workshop on web search click data, ACM, New York, NY, USA, WSCD ’09, pp 1–7.https://doi.org/10.1145/1507509.1507510
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data. springer series in Statistics. Springer, Berlin. https://doi.org/10.1007/978-3-642-20192-9
Carvalho VR, Cohen WW (2006) Single-pass online learning: Performance, voting schemes and online feature selection. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’06, pp 548–553. https://doi.org/10.1145/1150402.1150466
Davenport JM (1980) Approximations of the critical region of the friedman statistic. Communi Stat Theory Methods 9(6):571–595. https://doi.org/10.1080/03610928008827904
Delany SJ, Cunningham P, Tsymbal A, Coyle L (2005) A case-based technique for tracking concept drift in spam filtering. In: Macintosh A, Ellis R, Allen T (eds) Applications and innovations in intelligent systems XII. Springer, London, pp 3–16
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Lear Res 7:1–30
Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18. https://doi.org/10.1016/j.swevo.2011.02.002
Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306. https://doi.org/10.1109/TIT.2006.871582
Fong S, Wong R, Vasilakos AV, Member S (2016) Accelerated PSO swarm search feature selection for data stream mining big data. IEEE Trans Serv Comput 9(1):33–45. https://doi.org/10.1109/TSC.2015.2439695
Gama J, Sebastião R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346
Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):1–37. https://doi.org/10.1145/2523813
García S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(December 2008):2677–2694
Gomes JB, Gaber MM, Sousa PAC, Menasalvas E (2014) Mining recurring concepts in a dynamic feature space. IEEE Trans Neural Netw Learn Syst 25(1):95–110. https://doi.org/10.1109/TNNLS.2013.2271915
Gradvohl ALS, Senger H, Arantes L, Sens P (2014) Comparing distributed online stream processing systems considering fault tolerance issues. J Emerg Technol Web Intell 6(2):174–179. https://doi.org/10.4304/jetwi.6.2.174-179
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(3):1157–1182. https://doi.org/10.1016/j.aca.2011.07.027
Han J, Kamber M, Pei J (2011) Data mining concepts and techniques, vol 3. Morgan Kaufmann, Burlington
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Jankowski D, Jackowski K, Cyganek B (2016) Learning decision trees from data streams with concept drift. Proced Comput Sci 80:1682–1691. https://doi.org/10.1016/j.procs.2016.05.508
Katakis I, Tsoumakas G, Banos E, Bassiliades N, Vlahavas I (2009) An adaptive personalized news dissemination system. J Intell Inf Syst 32(2):191–212. https://doi.org/10.1007/s10844-008-0053-8
Katakis I, Tsoumakas G, Vlahavas I (2008) An ensemble of classifiers for coping with recurring contexts in data streams. Anais da 18 ECAI: European conference on artificial intelligence. IOS Press, Amsterdam, pp 763–764
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3):371–391. https://doi.org/10.1007/s10115-009-0206-2
Katakis I, Tsoumakas G, Vlahavas I (2005) On the utility of incremental feature selection for the classification of textual data streams. In: Bozanis P, Houstis EN (eds) Advances in informatics. Springer, Berlin, pp 338–348
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Rese 8:2755–2790
Moraes MB, Gradvohl ALS (2020). MOAFS: A Massive Online Analysis library for feature selection in data streams. J Open Source Software 5(45):1970. https://doi.org/10.21105/joss.01970
Méndez JR, Fdez-Riverola F, Díaz F, Iglesias EL, Corchado JM (2006) A comparative performance study of feature selection methods for the anti-spam filtering domain. In: Perner P (ed) Advances in data mining. Applications in medicine, web mining, marketing, image and signal mining. Springer, Berlin, pp 106–120
OpenML (2019) https://www.openml.org
Pearson K (1992) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Springer, New York, pp 11–28. https://doi.org/10.1007/978-1-4612-4380-9_2
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106. https://doi.org/10.1023/A:1022643204877
Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239:39–57. https://doi.org/10.1016/j.neucom.2017.01.078
Tsymbal A, Pechenizkiy M, Cunningham P, Puuronen S (2008) Dynamic integration of classifiers for handling concept drift. Inf Fusion 9(1):56–68. https://doi.org/10.1016/j.inffus.2006.11.002
Wang S, Schlobach S, Klein M (2011) Concept drift and how to identify it. J Web Semant 9(3):247–265. https://doi.org/10.1016/j.websem.2011.05.003
Wang J, Zhao P, Hoi SC, Jin R (2014) Online feature selection and its applications. IEEE Trans Knowl Data Eng 26(3):698–710. https://doi.org/10.1109/TKDE.2013.32
Wang L, Shen H (2016) Improved data streams classification with fast unsupervised feature selection. In: 17th international conference on parallel and distributed computing, applications and technologies (PDCAT), IEEE, Guangzhou, China, pp 221–226. https://doi.org/10.1109/PDCAT.2016.056
Wu X, Yu K, Ding W, Wang H, Zhu X, Member S (2013) Online feature selection with streaming features. IEEE Trans Pattern Anal Mach Intell 35(5):1178–1192. https://doi.org/10.1109/TPAMI.2012.197
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’97, pp 412–420
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. International conference on machine learning (ICML) pp 1–8
Yue L, Chen W, Li X, Zuo W, Yin M (2019) A survey of sentiment analysis in social media. Knowl Inf Syst 60(2):617–663. https://doi.org/10.1007/s10115-018-1236-4
Zhou P, Hu X, Li P, Wu X (2019) OFS-density: a novel online streaming feature selection method. Pattern Recogn 86:48–61. https://doi.org/10.1016/j.patcog.2018.08.009
Acknowledgements
The authors would like to acknowledge the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brazil–Finance Code 001.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
de Moraes, M.B., Gradvohl, A.L.S. A comparative study of feature selection methods for binary text streams classification. Evolving Systems 12, 997–1013 (2021). https://doi.org/10.1007/s12530-020-09357-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-020-09357-y