Skip to main content
Log in

ND-S: an oversampling algorithm based on natural neighbor and density peaks clustering

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

There are a large number of imbalanced classification problems in the real world. Due to the imbalance in the amount of data and the complex nature of the distribution, the minority class samples are difficult to be classified correctly. Oversampling techniques balance the data set by generating minority class samples; however, current clustering-based oversampling techniques are limited by hyperparameters, sample selection, and other issues that affect the final classification performance. In this paper, we propose an oversampling algorithm based on natural neighbor and density peaks clustering (ND-S). ND-S is divided into three steps. Firstly, the natural neighbor algorithm is used to find and filter noises and outliers. Secondly, the density peaks clustering is improved by natural neighbor-based nonparametric adaptive, which clusters all samples and leaves the clusters that meet the conditions. Finally, sampling weights are assigned to each cluster, and the minority class of samples suitable for oversampling is selected for synthetic minority oversampling (SMOTE) by calculating the local sparsity of the samples. Experiments on 18 imbalanced data sets show that ND-S is effective for the imbalanced classification problem, and its classification performance is generally better than other 8 comparison algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Availability of data and materials

The data sets used in this paper are all from the UCI public data set, which can be obtained from the following online link https://archive.ics.uci.edu/ml/index.php.

References

  1. Tarekegn AN, Giacobini M, Michalak K (2021) A review of methods for imbalanced multi-label classification. Pattern Recognit 118:107965

    Article  Google Scholar 

  2. Xu Z, Shen D, Nie T, Kou Y, Yin N, Han X (2021) A cluster-based oversampling algorithm combining smote and k-means for imbalanced medical data. Inform Sci 572:574–589

    Article  MathSciNet  Google Scholar 

  3. Liu Q, Wang D, Jia Y, Luo S, Wang C (2022) A multi-task based deep learning approach for intrusion detection. Knowl-Based Syst 238:107852

    Article  Google Scholar 

  4. Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with dte-sbd: decision tree ensemble based on smote and bagging with differentiated sampling rates. Inform Sci 425:76–91

    Article  MathSciNet  Google Scholar 

  5. Chakraborty T, Chakraborty AK (2020) Hellinger net: a hybrid imbalance learning model to improve software defect prediction. IEEE Transact Reliab 70(2):481–494

    Article  Google Scholar 

  6. Chen S-x, Wang X-k, Zhang H-y, Wang J-q (2021) Customer purchase prediction from the perspective of imbalanced data: a machine learning framework based on factorization machine. Expert Syst Appl 173:114756

    Article  Google Scholar 

  7. Jiang W, Zhou K, Xiong C, Du G, Ou C, Zhang J (2022) Kscb: a novel unsupervised method for text sentiment analysis. Appl Intell, 1–11

  8. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  9. Fernández A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

    Article  MathSciNet  MATH  Google Scholar 

  10. Santos MS, Abreu PH, Japkowicz N, Fernández A, Soares C, Wilk S, Santos J (2022) On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev, 1–69

  11. Dudjak M, Martinović G (2021) An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. Expert Syst Appl 182:115297

    Article  Google Scholar 

  12. She C, Zeng S (2022) An enhanced local outlier detection using random walk on grid information graph. J Supercomput 78(12):14530–14547

    Article  Google Scholar 

  13. Upadhyay K, Kaur P, Verma DK (2021) Evaluating the performance of data level methods using keel tool to address class imbalance problem. Arabian J Sci Eng, 1–14

  14. Bader-El-Den M, Teitei E, Perry T (2018) Biased random forest for dealing with the class imbalance problem. IEEE Transact neural Netw Learn Syst 30(7):2163–2172

    Article  Google Scholar 

  15. Zheng W, Zhao H (2020) Cost-sensitive hierarchical classification for imbalance classes. Appl Intell 50(8):2328–2338

    Article  MathSciNet  Google Scholar 

  16. Kaur H, Pannu HS, Malhi AK (2019) A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surveys (CSUR) 52(4):1–36

    Google Scholar 

  17. Rosales-Pérez A, García S, Herrera F (2022) Handling imbalanced classification problems with support vector machines via evolutionary bilevel optimization. IEEE Transact Cybernet

  18. Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transact Neural Netw Learn Syst 29(8):3573–3587

    Google Scholar 

  19. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  20. Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning, 878–887. Springer

    Google Scholar 

  21. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, 475–482. Springer

    Google Scholar 

  22. Wang X, Xu J, Zeng T, Jing L (2021) Local distribution-based adaptive minority oversampling for imbalanced data classification. Neurocomput 422:200–213

    Article  Google Scholar 

  23. Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets, GrC: 732–737

  24. Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets. Expert Syst Appl 46:405–416

    Article  Google Scholar 

  25. Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transact Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  26. Chen B, Xia S, Chen Z, Wang B, Wang G (2021) Rsmote: A self-adaptive robust smote for imbalanced problems with label noise. Inform Sci 553:397–428

    Article  MathSciNet  MATH  Google Scholar 

  27. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inform Sci 465:1–20

    Article  Google Scholar 

  28. Lu Y, Cheung Y-M, Tang YY (2019) Self-adaptive multiprototype-based competitive learning approach: a k-means-type algorithm for imbalanced data clustering. IEEE Transact Cybernet 51(3):1598–1612

    Article  Google Scholar 

  29. Devi D, Purkayastha B et al (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recognit Lett 93:3–12

    Article  Google Scholar 

  30. Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform Sci 291:184–203

    Article  Google Scholar 

  31. Guan H, Zhang Y, Xian M, Cheng H-D, Tang X (2021) Smote-wenn: solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51(3):1394–1409

    Article  Google Scholar 

  32. Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) Smote-nan-de: Addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223:107056

    Article  Google Scholar 

  33. Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recognit Lett 80:30–36

    Article  Google Scholar 

  34. Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496

    Article  Google Scholar 

  35. Li Z, Tang Y (2018) Comparative density peaks clustering. Expert Syst Appl 95:236–247

    Article  Google Scholar 

  36. Lü J, Guo M (2022) Oversampling algorithm based on density peaks clustering and local sparsity. J Nanjing Universit(Natural Sciences) 58(3):483–494

    Google Scholar 

  37. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Transact Inform Theory 13(1):21–27

    Article  MATH  Google Scholar 

  38. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  39. Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier

  40. Zhang Y, Tino P, Leonardis A, Tang K (2021) A survey on neural network interpretability. IEEE Transct Emerging Topics Comput Intell 5(5):726–742

    Article  Google Scholar 

  41. He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning, IEEE, 1322–1328.

  42. Asuncion A, Newman D (2007) Uci machine learning repository. Irvine, CA, USA

  43. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Statist Associat 32(200):675–701

    Article  MATH  Google Scholar 

Download references

Funding

This work is supported by the scientific and technological innovation project of double-city economic circle construction in Chengdu-Chongqing area (No.KJCX2020024);Chongqing University Innovation Research Group funding (No.CXQT20015).

Author information

Authors and Affiliations

Authors

Contributions

The work of this paper, MG wrote the main text and Jia Lu edited the text. All authors reviewed the paper.

Corresponding author

Correspondence to Jia Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical Approval

This declaration is not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, M., Lu, J. ND-S: an oversampling algorithm based on natural neighbor and density peaks clustering. J Supercomput 79, 8668–8698 (2023). https://doi.org/10.1007/s11227-022-04965-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04965-8

Keywords

Navigation

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy