Skip to main content

Advertisement

Log in

Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Synthetic minority oversampling methods have been proven to be an efficient solution for tackling imbalanced data classification issues. Different strategies have been proposed for generating synthetic minority samples. However, noisy samples which may cause the overlapping of minority and majority classes have not yet been properly treated for reducing their influence on the performance of a classification model. A new method, named Importance-SMOTE, is proposed in this paper. In this method, only borderline and edge samples in minority class are oversampled. The synthetic minority samples are generated proportionally to the importance of the minority samples which is calculated according to the composition and distribution of its nearest neighbors. The positions of the synthetic minority samples are determined by the relative importance of the paired neighbors. The proposed method is expected to obtain a more precise estimation of the true decision surface and reduce the influence of noisy samples. Various public imbalanced datasets and a real case study are considered in the experiments to prove the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Bach M, Werner A, Żywiec J, Pluskiewicz W (2017) The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf Sci (Ny) 384:174

    Article  Google Scholar 

  • Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE - Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl Data Eng 26:405

    Article  Google Scholar 

  • Branco P, Torgo L, Ribeiro RP (2016) (不平衡数据综述) A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 49(2):1

    Article  Google Scholar 

  • Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class Imbalanced Problem. Pacific-asia Conference on Advances in Knowledge Discovery & Data Mining, Springer-Verlag, pp 475–482

  • Chen Z, Duan J, Kang L, Qiu G (2021) A hybrid data-level ensemble to enable learning from highly imbalanced dataset. Inf Sci (Ny) 554:157

    Article  MathSciNet  Google Scholar 

  • Cieslak DA, Chawla NV, Striegel A (2006) “Combating imbalance in network intrusion datasets.,” in GrC, pp. 732–737

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Fernández A, del Río S, Chawla NV, Herrera F (2017) An insight into imbalanced Big Data classification: outcomes and challenges. Complex Intell. Syst. 3:105

    Article  Google Scholar 

  • Fernández A, García S, Herrera F, Chawla NV (2018) SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J Artif Intell Res 61:863

    Article  MathSciNet  Google Scholar 

  • Han H, Wang W, Mao B (2005) “Borderline-SMOTE : A New Over-Sampling Method in,” in International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23–26 Proceedings, Part I, 2005

  • Hassib EM, El-Desouky AI, Labib LM, El-kenawy ESM (2019) WOA + BRNN: An imbalanced big data classification framework using Whale optimization and deep neural network. Soft Comput. 24:5573

    Article  Google Scholar 

  • He H, Bai Y, Garcia EA, Li S (2008) “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” in Proceedings of the International Joint Conference on Neural Networks

  • He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans. Knowl. Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • Japkowicz N (2000) The class imbalance problem: significance and strategies,” in Proceedings of the 2000 International Conference on Artificial Intelligence

  • Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2018) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw Learn. Syst. 29(8):3573

    Article  Google Scholar 

  • Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput. J. 83:105662

    Article  Google Scholar 

  • Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5:221

    Article  Google Scholar 

  • Last F, Douzas G, Bacao F (2017) “Oversampling for Imbalanced Learning Based on K-Means and SMOTE,”

  • Laurikkala J (2001) “Improving identification of difficult small classes by balancing class distribution,” in Conference on Artificial Intelligence in Medicine in Europe, pp. 63–66

  • Li Y, Maguire L (2011) Selecting critical patterns based on local geometrical and statistical information. IEEE Trans. Pattern Anal. Mach. Intell. 33:1189

    Article  Google Scholar 

  • Liu J, Zio E (2018) A scalable fuzzy support vector machine for fault detection in transportation systems. Expert Syst Appl 102:36

    Article  Google Scholar 

  • Liu M, Miao L, Zhang D (2014) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans. Reliab 63:676

    Article  Google Scholar 

  • Liu J, Li YF, Zio E (2017) A SVM framework for fault detection of the braking system in a high speed train. Mech. Syst. Signal Process 87:401

    Article  Google Scholar 

  • Liu X, Yi GY, Bauman G, He W (2021) Ensembling imbalanced-spatial-structured support vector machine. Econom. Stat. 17:145

    MathSciNet  Google Scholar 

  • López V, Fernández A, García S, Palade V, Herrera F (2013) “An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics,.” Inf Sci (Ny) 250:113–141

    Article  Google Scholar 

  • MacIejewski T, Stefanowski J (2011) “Local neighbourhood extension of SMOTE for mining imbalanced data,” in IEEE SSCI 2011: Symposium Series on Computational Intelligence - CIDM 2011: 2011 IEEE Symposium on Computational Intelligence and Data Mining

  • Mathew J, Pang CK, Luo M, Weng HL (2018) Classification of imbalanced data by oversampling in kernel space of support vector machines. Neural Netw Learn Syst IEEE Trans 29(9):4065–4076

    Article  Google Scholar 

  • Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28(1):92–122

    Article  MathSciNet  Google Scholar 

  • Napierala K, Stefanowski J (2012) “Identification of different types of minority class examples in imbalanced data,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

  • Napierała K, Stefanowski J (2015) Addressing imbalanced data with argument based rule learning. Expert Syst Appl 42:9468

    Article  Google Scholar 

  • Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46:563

    Article  Google Scholar 

  • Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl 46:405

    Article  Google Scholar 

  • Noorhalim N, Ali A, Shamsuddin SM (2019) “Handling imbalanced ratio for class imbalance problem using SMOTE,” in Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017)

  • Piri S, Delen D, Liu T (2018) A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets. Decis Support Syst 106:15

    Article  Google Scholar 

  • Rey D, Neuhäuser M (2011) Wilcoxon-signed-rank test. In: Lovric M (ed) International encyclopedia of statistical science. Springer, Berlin, Heidelberg, pp 1658–1659. https://doi.org/10.1007/978-3-642-04898-2_616

    Chapter  Google Scholar 

  • Rivera WA (2017) “Noise reduction a priori synthetic over-sampling for class imbalanced data sets,.” Inf Sci (Ny) 408:146–161

    Article  Google Scholar 

  • Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci (Ny) 291:184

    Article  Google Scholar 

  • Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10:e0118432

    Article  Google Scholar 

  • Shilaskar S, Ghatol A (2019) Diagnosis system for imbalanced multi-minority medical dataset”. Soft Comput 23:4789

    Article  Google Scholar 

  • Skryjomski P, Krawczyk B (2017) “Influence of minority class instance types on SMOTE imbalanced data oversampling,” in Proceedings of Machine Learning Research LIDTA 2017

  • Stefanowski J, Napierała K, Trzcielińska M (2014) Local characteristics of minority examples in pre-processing of Imbalanced Data. In: Andreasen T, Christiansen H, Cubero J-C, Raś ZW (eds) Foundations of intelligent systems (ISMIS 2014 Roskilde, Denmark, June 25–27, 2014 Proceedings) . Springer, Cham, pp 123–132

  • Tuncer T, Dogan S (2019) A novel octopus based Parkinson’s disease and gender recognition method using vowels. Appl. Acoust. 155:75

    Article  Google Scholar 

  • Tuncer T, Dogan S, Acharya UR (2020) Automated detection of Parkinson’s disease using minimum average maximum tree and singular value decomposition method with vowels. Biocybern. Biomed. Eng. 40:211

    Article  Google Scholar 

  • Wang B, Japkowicz N (2004) “Imbalanced data set learning with synthetic samples,” in InProc. IRIS Machine Learning Workshop

  • Xu Y, Wu C, Zheng K, Niu X, Yang Y (2017) Fuzzy-Synthetic minority oversampling technique: oversampling based on fuzzy set theory for android malware detection in imbalanced datasets. Int J Distrib Sens Netw. https://doi.org/10.1177/1550147717703116

    Article  Google Scholar 

  • Zhai J, Zhang S, Zhang M, Liu X (2018) Fuzzy integral-based ELM ensemble for imbalanced big data classification. Soft Comput 22:3519

    Article  Google Scholar 

  • Zhu R, Guo Y, Xue JH (2020) Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit Lett. 133:217

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (No.52005027).

Funding

The author declares he has no financial interests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Liu.

Ethics declarations

Conflict of interest

The author declares that he has no conflict of interest.

Human and animal rights

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Tables

Table 6 Experimental results with respect to F-Measure of KNN

6,

Table 7 Experimental results with respect to F-Measure of CART

7,

Table 8 Experimental results with respect to AUC(PRC) of KNN

8,

Table 9 Experimental results with respect to AUC(PRC) of CART

9

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J. Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data. Soft Comput 26, 1141–1163 (2022). https://doi.org/10.1007/s00500-021-06532-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-06532-4

Keywords

Navigation

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy