Dealing with heterogeneity in the context of distributed feature selection for classification

Morillo-Salas, José Luis; Bolón-Canedo, Verónica; Alonso-Betanzos, Amparo

doi:10.1007/s10115-020-01526-4

Dealing with heterogeneity in the context of distributed feature selection for classification

Regular Paper
Published: 21 November 2020

Volume 63, pages 233–276, (2021)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

355 Accesses
Explore all metrics

Abstract

Advances in the information technologies have greatly contributed to the advent of larger datasets. These datasets often come from distributed sites, but even so, their large size usually means they cannot be handled in a centralized manner. A possible solution to this problem is to distribute the data over several processors and combine the different results. We propose a methodology to distribute feature selection processes based on selecting relevant and discarding irrelevant features. This preprocessing step is essential for current high-dimensional sets, since it allows the input dimension to be reduced. We pay particular attention to the problem of data imbalance, which occurs because the original dataset is unbalanced or because the dataset becomes unbalanced after data partitioning. Most works approach unbalanced scenarios by oversampling, while our proposal tests both over- and undersampling strategies. Experimental results demonstrate that our distributed approach to classification obtains comparable accuracy results to a centralized approach, while reducing computational time and efficiently dealing with data imbalance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Distributed Feature Selection Approach Based on a Complexity Measure

A distributed feature selection scheme with partial information sharing

Article 21 May 2019

Distributed ReliefF-based feature selection in Spark

Article 22 January 2018

Notes

https://www.cs.waikato.ac.nz/ml/weka/.
https://github.com/jlmorillo/Heterogeneity_distributed_features.
http://archive.ics.uci.edu/ml/index.php.
https://github.com/jlmorillo/Heterogeneity_distributed_features. In the tables for standard and microarray datasets in Appendix, the following information is provided: mean and standard deviation (top row), maximum value (second row) and (in the lowest row) the combination that obtained the maximum value for:

References

Guyon I (2006) Feature extraction: foundations and applications, vol 207. Springer, Berlin
Book Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Brown G (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150
Article Google Scholar
Bolón-Canedo V, Sechidis K, Sánchez-Maroño N, Alonso-Betanzos A, Brown G (2019) Insights into distributed feature ranking. Inf Sci 496:378–398
Article Google Scholar
Brankovic A, Hosseini M, Piroddi L (2019) A distributed feature selection algorithm based on distance correlation with an application to microarrays. IEEE/ACM Trans Comput Biol Bioinf 16(6):1802–1815
Google Scholar
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs. distributed feature selection methods based on data complexity measures. Knowl Based Syst 117:27–45
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Article Google Scholar
Murphy P, Pazzani M, Merz C, Brunk C (1994) Reducing misclassification costs. In: International conference of machine learning. Morgan Kauffman, New Brunswick, pp 217–225
Tahir MA, Kittler J, Mikolajczyk K, Yan F (2009) A multiple expert approach to the class imbalance problem using inverse random under sampling. In: Multiple Classifier Systems, pp 82–91
Solberg AH, Solberg R (1996) A large-scale evaluation of features for automatic detection of oil spills in ERS SAR images. In: International geoscience and remote sensing symposium. Lincoln, NE, pp 1484–1486
Chawla NV, Herrera F, Garcia S, Fernandez A (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Article MathSciNet Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2011) Smote: synthetic minority over-sampling technique. arXiv:1106.1813
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE joint conference in neural networks, IJCNN 2008
Ling C, Li CX (1998) Data mining for direct marketing: problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining, KDD’98, vol 98, pp 73–79
Junsomboon N, Phienthrakul T (2017) Combining over-sampling and under-sampling techniques for imbalance dataset. In: ICMLC 2017: proceedings of the 9th international conference on machine learning and computing (ICMLC), pp 243–247
Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory Knowledge Inform Syst 33(2): 245–265
Sanguanmak Y, Hanskunatai A (2016) DBSM: the combination of DBSCAN and SMOTE for imbalanced data classification. In: 13th international joint conference on computer science and software engineering (JCSSE), pp 1–5
Wang Q, Xin J, Wu J, Zheng N (2017) SVM classification of microaneurysms with imbalanced dataset based on borderline-SMOTE and data cleaning techniques. In: Verikas A, Radeva P, Nikolaev DP, Zhang W, Zhou J (eds) Ninth international conference on machine vision (ICMV 2016), vol 10341. International Society for Optics and Photonics, SPIE, pp 355–361
Zhang C, Gao W, Song J, Jiang J (2016) An imbalanced data classification algorithm of improved autoencoder neural network. In: 2016 Eighth international conference on advanced computational intelligence (ICACI). IEEE, pp 95–99
Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:554–562
Article Google Scholar
Yang J, Zhou J, Zhu Z, Ma X, Ji Z (2016) Iterative ensemble feature selection for multiclass classification of imbalanced microarray data. J Biol Res (Thessalon) 23(Suppl 1):13
Article Google Scholar
A fraud detection model based on feature selection and undersampling applied to web payment systems. In: IEEE/WIC/ACM International conference on web intelligence and intelligent agent technology (WI-IAT)
Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced datasets using support vector machines. Inf Sci 286:228–246
Article Google Scholar
Hall M (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato
Mitchell TM (1982) Generalization as search. Artif Intell 18:203–226. Reprinted in Shavlik JW, Dietterich TG (eds) (1990) Readings in machine learning. Morgan Kaufmann, San Francisco
Winston PH (1975) Learning structural description from examples. In: Winston PH (ed) The psychology of computer vision. McGraw-Hill, New York
Google Scholar
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games. In: Michalski RS, Carbonell JG, Mitchell TM (eds) Machine learning: an artificial intelligence approach. Morgan Kaufmann, San Francisco
Google Scholar
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
MATH Google Scholar
Cortes C, Vapnik VN (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Guyon I, Gunn S, Nikravesh M, Zadeh L (2006) Feature extraction. Foundations and applications. Springer, Berlin
Book Google Scholar
Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. MIT press, Cambridge
Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Cerviño-Rabuñal J (2013) Scaling up feature selection: a distributed filter approach. In: Conference of the Spanish Association for artificial intelligence. Springer, Berlin, pp 121–130
Bolón-Canedo V, Sánchez-Marono N, Cervino-Rabunal J (2014) Toward parallel feature selection from vertically partitioned data. In: Proceedings of ESANN 2014, pp 395–400
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2016) Data complexity measures for analyzing the effect of smote over microarrays. In: ESANN
de Haro Garcia A (2011) Scaling data mining algorithms. Application to instance and feature selection. PhD thesis, Universidad de Granada
Hall MA, Smith LA (1998) Practical feature subset selection for machine learning. In: Proceedings of the 21st Australasian computer science conference ACSC 98. Springer, Berlin, pp 181–191
Shannon CE (1948) Mathematical theory of communication. Bell Syst Tech J 27:379–423
Article MathSciNet Google Scholar
Kononenko I (1994) Estimating attributes: analysis and extensions of relief. In: Machine learning: ECML-94, pp 171–182
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on Machine learning. Morgan Kaufmann Publishers Inc., Los Altos, pp 249–256
Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th international conference on machine learning, pp 856–863
Dash M, Liu H, Moto H (2003) Consistency-based search in feature selection. Artif Intell 151(1–2):155–176
Article MathSciNet Google Scholar
Bramer M (2007) Principles of data mining. Springer, Berlin
MATH Google Scholar
Vapnik V (1999) The nature of statistical learning theory. Springer, Berlin
MATH Google Scholar
Witten IH, Frank E (2005) Data mining practical machine learning tools and techniques. Morgan Kaufmann Publishers Inc., Los Altos
MATH Google Scholar
Altman DG (1991) Practical statistics for medical research. Chapman & Hall, London
Google Scholar
Hollander M, Wolfe DA (1973) Nonparametric statistical methods. John Wiley, New York
MATH Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple datasets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

CITIC, Grupo LIDIA, Universidade da Coruña, Campus de Elviña, 15071, A Coruña, Spain
José Luis Morillo-Salas, Verónica Bolón-Canedo & Amparo Alonso-Betanzos

Authors

José Luis Morillo-Salas
View author publications
You can also search for this author in PubMed Google Scholar
Verónica Bolón-Canedo
View author publications
You can also search for this author in PubMed Google Scholar
Amparo Alonso-Betanzos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Verónica Bolón-Canedo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research has been financially supported in part by the Spanish Ministerio de Economía y Competitividad (research projects TIN2015-65069-C2-1-R and PID2019-109238GB-C22), by European Union FEDER funds and by the Consellería de Industria of the Xunta de Galicia (research project ED431C 2018/34). Financial support from the Xunta de Galicia (Centro singular de investigación de Galicia accreditation 2016–2019) and the European Union (European Regional Development Fund—ERDF), is gratefully acknowledged (research project ED431G 2019/01).

Appendix

See Appendix Tables 6, 7, 8, 9, 10, 11, 12, 13 and 14.

Table 6 Accuracy results for Musk2 unbalanced standard dataset in random distribution

Full size table

Table 7 Accuracy results for Isolet balanced standard dataset in homogeneous distribution

Full size table

Table 8 Accuracy results for Brain microarray dataset in random distribution

Full size table

Table 9 Summary of accuracy results on standard datasets

Full size table

Table 10 Summary of accuracy results on microarray datasets

Full size table

Table 11 Summary of kappa results on standard datasets

Full size table

Table 12 Summary of kappa results on microarray datasets

Full size table

Table 13 Summary of results of filter time by packet on standard datasets

Full size table

Table 14 Summary of results of filter time by packet on microarray datasets

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morillo-Salas, J.L., Bolón-Canedo, V. & Alonso-Betanzos, A. Dealing with heterogeneity in the context of distributed feature selection for classification. Knowl Inf Syst 63, 233–276 (2021). https://doi.org/10.1007/s10115-020-01526-4

Download citation

Received: 09 October 2019
Revised: 22 October 2020
Accepted: 01 November 2020
Published: 21 November 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s10115-020-01526-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dealing with heterogeneity in the context of distributed feature selection for classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Distributed Feature Selection Approach Based on a Complexity Measure

A distributed feature selection scheme with partial information sharing

Distributed ReliefF-based feature selection in Spark

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Dealing with heterogeneity in the context of distributed feature selection for classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Distributed Feature Selection Approach Based on a Complexity Measure

A distributed feature selection scheme with partial information sharing

Distributed ReliefF-based feature selection in Spark

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.