Abstract
There are a large number of imbalanced classification problems in the real world. Due to the imbalance in the amount of data and the complex nature of the distribution, the minority class samples are difficult to be classified correctly. Oversampling techniques balance the data set by generating minority class samples; however, current clustering-based oversampling techniques are limited by hyperparameters, sample selection, and other issues that affect the final classification performance. In this paper, we propose an oversampling algorithm based on natural neighbor and density peaks clustering (ND-S). ND-S is divided into three steps. Firstly, the natural neighbor algorithm is used to find and filter noises and outliers. Secondly, the density peaks clustering is improved by natural neighbor-based nonparametric adaptive, which clusters all samples and leaves the clusters that meet the conditions. Finally, sampling weights are assigned to each cluster, and the minority class of samples suitable for oversampling is selected for synthetic minority oversampling (SMOTE) by calculating the local sparsity of the samples. Experiments on 18 imbalanced data sets show that ND-S is effective for the imbalanced classification problem, and its classification performance is generally better than other 8 comparison algorithms.









Similar content being viewed by others
Availability of data and materials
The data sets used in this paper are all from the UCI public data set, which can be obtained from the following online link https://archive.ics.uci.edu/ml/index.php.
References
Tarekegn AN, Giacobini M, Michalak K (2021) A review of methods for imbalanced multi-label classification. Pattern Recognit 118:107965
Xu Z, Shen D, Nie T, Kou Y, Yin N, Han X (2021) A cluster-based oversampling algorithm combining smote and k-means for imbalanced medical data. Inform Sci 572:574–589
Liu Q, Wang D, Jia Y, Luo S, Wang C (2022) A multi-task based deep learning approach for intrusion detection. Knowl-Based Syst 238:107852
Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with dte-sbd: decision tree ensemble based on smote and bagging with differentiated sampling rates. Inform Sci 425:76–91
Chakraborty T, Chakraborty AK (2020) Hellinger net: a hybrid imbalance learning model to improve software defect prediction. IEEE Transact Reliab 70(2):481–494
Chen S-x, Wang X-k, Zhang H-y, Wang J-q (2021) Customer purchase prediction from the perspective of imbalanced data: a machine learning framework based on factorization machine. Expert Syst Appl 173:114756
Jiang W, Zhou K, Xiong C, Du G, Ou C, Zhang J (2022) Kscb: a novel unsupervised method for text sentiment analysis. Appl Intell, 1–11
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Fernández A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Santos MS, Abreu PH, Japkowicz N, Fernández A, Soares C, Wilk S, Santos J (2022) On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev, 1–69
Dudjak M, Martinović G (2021) An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. Expert Syst Appl 182:115297
She C, Zeng S (2022) An enhanced local outlier detection using random walk on grid information graph. J Supercomput 78(12):14530–14547
Upadhyay K, Kaur P, Verma DK (2021) Evaluating the performance of data level methods using keel tool to address class imbalance problem. Arabian J Sci Eng, 1–14
Bader-El-Den M, Teitei E, Perry T (2018) Biased random forest for dealing with the class imbalance problem. IEEE Transact neural Netw Learn Syst 30(7):2163–2172
Zheng W, Zhao H (2020) Cost-sensitive hierarchical classification for imbalance classes. Appl Intell 50(8):2328–2338
Kaur H, Pannu HS, Malhi AK (2019) A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surveys (CSUR) 52(4):1–36
Rosales-Pérez A, García S, Herrera F (2022) Handling imbalanced classification problems with support vector machines via evolutionary bilevel optimization. IEEE Transact Cybernet
Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transact Neural Netw Learn Syst 29(8):3573–3587
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning, 878–887. Springer
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, 475–482. Springer
Wang X, Xu J, Zeng T, Jing L (2021) Local distribution-based adaptive minority oversampling for imbalanced data classification. Neurocomput 422:200–213
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets, GrC: 732–737
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets. Expert Syst Appl 46:405–416
Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transact Knowl Data Eng 26(2):405–425
Chen B, Xia S, Chen Z, Wang B, Wang G (2021) Rsmote: A self-adaptive robust smote for imbalanced problems with label noise. Inform Sci 553:397–428
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inform Sci 465:1–20
Lu Y, Cheung Y-M, Tang YY (2019) Self-adaptive multiprototype-based competitive learning approach: a k-means-type algorithm for imbalanced data clustering. IEEE Transact Cybernet 51(3):1598–1612
Devi D, Purkayastha B et al (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recognit Lett 93:3–12
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform Sci 291:184–203
Guan H, Zhang Y, Xian M, Cheng H-D, Tang X (2021) Smote-wenn: solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51(3):1394–1409
Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) Smote-nan-de: Addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223:107056
Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recognit Lett 80:30–36
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Li Z, Tang Y (2018) Comparative density peaks clustering. Expert Syst Appl 95:236–247
Lü J, Guo M (2022) Oversampling algorithm based on density peaks clustering and local sparsity. J Nanjing Universit(Natural Sciences) 58(3):483–494
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Transact Inform Theory 13(1):21–27
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier
Zhang Y, Tino P, Leonardis A, Tang K (2021) A survey on neural network interpretability. IEEE Transct Emerging Topics Comput Intell 5(5):726–742
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning, IEEE, 1322–1328.
Asuncion A, Newman D (2007) Uci machine learning repository. Irvine, CA, USA
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Statist Associat 32(200):675–701
Funding
This work is supported by the scientific and technological innovation project of double-city economic circle construction in Chengdu-Chongqing area (No.KJCX2020024);Chongqing University Innovation Research Group funding (No.CXQT20015).
Author information
Authors and Affiliations
Contributions
The work of this paper, MG wrote the main text and Jia Lu edited the text. All authors reviewed the paper.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical Approval
This declaration is not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guo, M., Lu, J. ND-S: an oversampling algorithm based on natural neighbor and density peaks clustering. J Supercomput 79, 8668–8698 (2023). https://doi.org/10.1007/s11227-022-04965-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04965-8