Abstract
Many practical applications suffer from imbalanced data classification, in which case the minority class has degraded recognition rate. The primary causes are the sample scarcity of the minority class and the intrinsic complex distribution characteristics of imbalanced datasets. The imbalanced classification problem is more serious on small sample datasets. To solve the problems of small sample and class imbalance, a hybrid resampling method is proposed. The proposed method combines an oversampling approach (synthetic minority oversampling technique, SMOTE) and a novel data cleaning approach (weighted edited nearest neighbor rule, WENN). First, SMOTE generates synthetic minority class examples using linear interpolation. Then, WENN detects and deletes unsafe majority and minority class examples using weighted distance function and k-nearest neighbor (kNN) rule. The weighted distance function scales up a commonly used distance by considering local imbalance and spacial sparsity. Extensive experiments over synthetic and real datasets validate the superiority of the proposed SMOTE-WENN compared with three state-of-the-art resampling methods.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Yu H, Ni J (2014) An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 11(4):657–666
Yan Q, Cao Y (2020) Optimizing shapelets quality measure for imbalanced time series classification. Appl Intell 50(2):519–536
Weiss G M, Provost F (2003) Learning when training data are costly: The effect of class distribution on tree induction. J Artif Intell Res 19:315–354
Wu G, Chang E Y (2005) Kba: Kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng (6):786–795
Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597
Holte R C, Acker L, Porter B W et al (1989) Concept learning and the problem of small disjuncts. In: Proceedings of the 11th International Joint Conference on Artificial Intelligence, vol 89. Morgan Kaufmann Publishers, San Francisco, pp 813–818
Prati R C, Batista G E, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, Berlin, pp 312–321
Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: International conference on rough sets and current trends in computing. Springer, Berlin, pp 158–167
Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Emerging paradigms in machine learning. Springer, Berlin, pp 277–306
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
He H, Garcia E A (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21 (9):1263–1284
Su C, Cao J (2019) Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria. Appl Intell 49(3):1127–1145
Xu Y, Wang Q, Pang X, Tian Y (2018) Maximum margin of twin spheres machine with pinball loss for imbalanced data classification. Appl Intell 48(1):23–34
Lin W C, Tsai C F, Hu Y H, Jhang J S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-asia conference on knowledge discovery and data mining. Springer, Berlin, pp 475–482
Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, Washington, pp 104–111
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Batista G E, Prati R C, Monard M C (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29
Guan H, Zhang Y, Xian M, Cheng H D, Tang X (2016) WENN for individualized cleaning in imbalanced data. In: 2016 23Rd international conference on pattern recognition (ICPR). IEEE, pp 456–461
Khoshgoftaar T M, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22(3):387–396
Wilson D R, Martinez T R (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Luque A, Carrasco A, Martin A, Heras A D L (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn 91:216–231
Garcia S, Fernandez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064
Das S, Datta S, Chaudhuri B B (2018) Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recogn 81:674–693
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Nos: 61872203 and 61802212), the Shandong Provincial Natural Science Foundation (No: ZR2019BF017), Major Scientific and Technological Innovation Projects of Shandong Province (Nos: 2019JZZY010127, 2019JZZY010132 and 2019JZZY010201), Jinan City “20 universities” Funding Projects Introducing Innovation Team Program (No: 2019GXRC031), and the Project of Shandong Province Higher Educational Science and Technology Program (No: J18KA331).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Guan, H., Zhang, Y., Xian, M. et al. SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51, 1394–1409 (2021). https://doi.org/10.1007/s10489-020-01852-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01852-8