Abstract
Synthetic minority oversampling methods have been proven to be an efficient solution for tackling imbalanced data classification issues. Different strategies have been proposed for generating synthetic minority samples. However, noisy samples which may cause the overlapping of minority and majority classes have not yet been properly treated for reducing their influence on the performance of a classification model. A new method, named Importance-SMOTE, is proposed in this paper. In this method, only borderline and edge samples in minority class are oversampled. The synthetic minority samples are generated proportionally to the importance of the minority samples which is calculated according to the composition and distribution of its nearest neighbors. The positions of the synthetic minority samples are determined by the relative importance of the paired neighbors. The proposed method is expected to obtain a more precise estimation of the true decision surface and reduce the influence of noisy samples. Various public imbalanced datasets and a real case study are considered in the experiments to prove the effectiveness of the proposed method.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bach M, Werner A, Żywiec J, Pluskiewicz W (2017) The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf Sci (Ny) 384:174
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE - Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl Data Eng 26:405
Branco P, Torgo L, Ribeiro RP (2016) (不平衡数据综述) A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 49(2):1
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class Imbalanced Problem. Pacific-asia Conference on Advances in Knowledge Discovery & Data Mining, Springer-Verlag, pp 475–482
Chen Z, Duan J, Kang L, Qiu G (2021) A hybrid data-level ensemble to enable learning from highly imbalanced dataset. Inf Sci (Ny) 554:157
Cieslak DA, Chawla NV, Striegel A (2006) “Combating imbalance in network intrusion datasets.,” in GrC, pp. 732–737
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Fernández A, del Río S, Chawla NV, Herrera F (2017) An insight into imbalanced Big Data classification: outcomes and challenges. Complex Intell. Syst. 3:105
Fernández A, García S, Herrera F, Chawla NV (2018) SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J Artif Intell Res 61:863
Han H, Wang W, Mao B (2005) “Borderline-SMOTE : A New Over-Sampling Method in,” in International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23–26 Proceedings, Part I, 2005
Hassib EM, El-Desouky AI, Labib LM, El-kenawy ESM (2019) WOA + BRNN: An imbalanced big data classification framework using Whale optimization and deep neural network. Soft Comput. 24:5573
He H, Bai Y, Garcia EA, Li S (2008) “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” in Proceedings of the International Joint Conference on Neural Networks
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans. Knowl. Data Eng 21(9):1263–1284
Japkowicz N (2000) The class imbalance problem: significance and strategies,” in Proceedings of the 2000 International Conference on Artificial Intelligence
Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2018) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw Learn. Syst. 29(8):3573
Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput. J. 83:105662
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5:221
Last F, Douzas G, Bacao F (2017) “Oversampling for Imbalanced Learning Based on K-Means and SMOTE,”
Laurikkala J (2001) “Improving identification of difficult small classes by balancing class distribution,” in Conference on Artificial Intelligence in Medicine in Europe, pp. 63–66
Li Y, Maguire L (2011) Selecting critical patterns based on local geometrical and statistical information. IEEE Trans. Pattern Anal. Mach. Intell. 33:1189
Liu J, Zio E (2018) A scalable fuzzy support vector machine for fault detection in transportation systems. Expert Syst Appl 102:36
Liu M, Miao L, Zhang D (2014) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans. Reliab 63:676
Liu J, Li YF, Zio E (2017) A SVM framework for fault detection of the braking system in a high speed train. Mech. Syst. Signal Process 87:401
Liu X, Yi GY, Bauman G, He W (2021) Ensembling imbalanced-spatial-structured support vector machine. Econom. Stat. 17:145
López V, Fernández A, García S, Palade V, Herrera F (2013) “An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics,.” Inf Sci (Ny) 250:113–141
MacIejewski T, Stefanowski J (2011) “Local neighbourhood extension of SMOTE for mining imbalanced data,” in IEEE SSCI 2011: Symposium Series on Computational Intelligence - CIDM 2011: 2011 IEEE Symposium on Computational Intelligence and Data Mining
Mathew J, Pang CK, Luo M, Weng HL (2018) Classification of imbalanced data by oversampling in kernel space of support vector machines. Neural Netw Learn Syst IEEE Trans 29(9):4065–4076
Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28(1):92–122
Napierala K, Stefanowski J (2012) “Identification of different types of minority class examples in imbalanced data,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Napierała K, Stefanowski J (2015) Addressing imbalanced data with argument based rule learning. Expert Syst Appl 42:9468
Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46:563
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl 46:405
Noorhalim N, Ali A, Shamsuddin SM (2019) “Handling imbalanced ratio for class imbalance problem using SMOTE,” in Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017)
Piri S, Delen D, Liu T (2018) A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets. Decis Support Syst 106:15
Rey D, Neuhäuser M (2011) Wilcoxon-signed-rank test. In: Lovric M (ed) International encyclopedia of statistical science. Springer, Berlin, Heidelberg, pp 1658–1659. https://doi.org/10.1007/978-3-642-04898-2_616
Rivera WA (2017) “Noise reduction a priori synthetic over-sampling for class imbalanced data sets,.” Inf Sci (Ny) 408:146–161
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci (Ny) 291:184
Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10:e0118432
Shilaskar S, Ghatol A (2019) Diagnosis system for imbalanced multi-minority medical dataset”. Soft Comput 23:4789
Skryjomski P, Krawczyk B (2017) “Influence of minority class instance types on SMOTE imbalanced data oversampling,” in Proceedings of Machine Learning Research LIDTA 2017
Stefanowski J, Napierała K, Trzcielińska M (2014) Local characteristics of minority examples in pre-processing of Imbalanced Data. In: Andreasen T, Christiansen H, Cubero J-C, Raś ZW (eds) Foundations of intelligent systems (ISMIS 2014 Roskilde, Denmark, June 25–27, 2014 Proceedings) . Springer, Cham, pp 123–132
Tuncer T, Dogan S (2019) A novel octopus based Parkinson’s disease and gender recognition method using vowels. Appl. Acoust. 155:75
Tuncer T, Dogan S, Acharya UR (2020) Automated detection of Parkinson’s disease using minimum average maximum tree and singular value decomposition method with vowels. Biocybern. Biomed. Eng. 40:211
Wang B, Japkowicz N (2004) “Imbalanced data set learning with synthetic samples,” in InProc. IRIS Machine Learning Workshop
Xu Y, Wu C, Zheng K, Niu X, Yang Y (2017) Fuzzy-Synthetic minority oversampling technique: oversampling based on fuzzy set theory for android malware detection in imbalanced datasets. Int J Distrib Sens Netw. https://doi.org/10.1177/1550147717703116
Zhai J, Zhang S, Zhang M, Liu X (2018) Fuzzy integral-based ELM ensemble for imbalanced big data classification. Soft Comput 22:3519
Zhu R, Guo Y, Xue JH (2020) Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit Lett. 133:217
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (No.52005027).
Funding
The author declares he has no financial interests.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that he has no conflict of interest.
Human and animal rights
This article does not contain any studies with human participants performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, J. Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data. Soft Comput 26, 1141–1163 (2022). https://doi.org/10.1007/s00500-021-06532-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-06532-4