A novel ensemble-based paradigm to process large-scale data

Trinh, Thanh; Le, HoangAnh; VuongThi, Nhung; HoangDuc, Hai; VuThi, KieuAnh

doi:10.1007/s11042-023-16624-y

A novel ensemble-based paradigm to process large-scale data

Published: 02 September 2023

Volume 83, pages 26663–26685, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Thanh Trinh ORCID: orcid.org/0000-0002-6973-9749^1,2,
HoangAnh Le^1,3,4,
Nhung VuongThi⁵,
Hai HoangDuc^1,3 &
…
KieuAnh VuThi¹

140 Accesses
Explore all metrics

Abstract

Big data analytics is an emerging topic in academic and industrial engineering fields, where the large-scale data issue is the most attractive challenge. It is crucial to design an effective large-scale data processing model to handle big data. In this paper, we aim to improve the accuracy of the classification task and reduce the execution time for large-scale data within a small cluster. In order to overcome these challenges, this paper presents a novel ensemble-based paradigm that consists of the procedure of splitting large-scale data files and developing ensemble models. Two different splitting methods are first developed to partition large-scale data into small data blocks without overlapping. Then we propose two ensemble-based methods with high accuracy and less execution time: bagging-based and boosting-based methods. Finally, the proposed paradigm can be implemented by four predictive models, which are combinations of two splitting methods and two ensemble-based methods. A series of persuasive experiments was conducted to evaluate the effectiveness of the proposed paradigm with four different combinations. Overall, the proposed paradigm with boosting-based is the best in terms of the accuracy metric compared with existing methods. In addition, boosting-based methods achieve 91.6% accuracy compared with 52% accuracy of base line model for a big data file with 10 million samples. However, the paradigm with bagging-based takes the least execution time to yield results. This paper also reveals the effectiveness of the computing Spark cluster for large-scale data and points out the weakness of RDD (Resilient Distributed dataset).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition

Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Article Open access 31 May 2024

Classification with boosting of extreme learning machine over arbitrarily partitioned data

Article 19 November 2015

Data Availability

Data available on request from the authors.

Notes

References

Rodrigues AP, Chiplunkar NN (2022) A new big data approach for topic classification and sentiment analysis of Twitter data. Evol Intel 15(2):877–887. https://doi.org/10.1007/s12065-019-00236-3
Article Google Scholar
Khan M, Malviya A (2020) Big data approach for sentiment analysis of twitter data using Hadoop framework and deep learning. International Conference on Emerging Trends in Information Technology and Engineering, ic-ETITE 2020:1–5. https://doi.org/10.1109/ic-ETITE47903.2020.201
Article Google Scholar
Trinh T, Wu D, Wang R, Huang JZ (2020) An effective content-based event recommendation model. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-020-08884-9
Article Google Scholar
Huang W, Wang L (2022) Towards big data behavioral analysis: rethinking GPS trajectory mining approaches from geographic, semantic, and quantitative perspectives. Architectural Intelligence 1(1):1–15. https://doi.org/10.1007/s44223-022-00011-y
Article Google Scholar
Cho W, Choi E (2017) Spatial Big Data Analysis System for Vehicle-Driving GPS Trajectory, pp 296–302. https://doi.org/10.1007/978-981-10-5041-1_50
Mostajabi F, Safaei AA, Sahafi A (2021) A Systematic Review of Data Models for the Big Data Problem. IEEE Access 9:128889–128904. https://doi.org/10.1109/ACCESS.2021.3112880
Article Google Scholar
Wu Z, Lin W, Zhang Z, Wen A (2017) Lin L (2017) An Ensemble Random Forest Algorithm for Insurance Big Data Analysis. Proceedings - 2017 IEEE International Conference on Computational Science and Engineering and IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017 1:531–536. https://doi.org/10.1109/CSE-EUC.2017.99
Choi TM, Chan HK, Yue X (2017) Recent Development in Big Data Analytics for Business Operations and Risk Management. IEEE Transactions on Cybernetics 47(1):81–92. https://doi.org/10.1109/TCYB.2015.2507599
Article PubMed Google Scholar
Howard JH, Kazar ML, Menees SG, Nichols DA, Satyanarayanan M, Sidebotham RN, West MJ (1988) Scale and performance in a distributed file system. ACM Trans Comput Syst 6(1):51–81. https://doi.org/10.1145/35037.35059
Article Google Scholar
Emara TZ, Huang JZ (2020) Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers. IEEE Access 8:178526–178538. https://doi.org/10.1109/ACCESS.2020.3027675
Article Google Scholar
Hadoop (2022) Apache Hadoop. https://hadoop.apache.org/
Zaharia M, Chowdhury M, Das T, Dave A (2012) Fast and Interactive Analytics over Hadoop Data with Spark. Usenix 37(4):45–51
Google Scholar
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark. Commun ACM 59(11):56–65. https://doi.org/10.1145/2934664
Article Google Scholar
Qin W, Liu F, Tong M, Li Z (2021) A distributed ensemble of relevance vector machines for large-scale data sets on Spark. Soft Comput 25(10):7119–7130. https://doi.org/10.1007/s00500-021-05671-y
Article Google Scholar
Salloum S, Huang JZ, He Y (2019) Random Sample Partition: A Distributed Data Model for Big Data Analysis. IEEE Transactions on Industrial Informatics 15(11):5846–5854. https://doi.org/10.1109/TII.2019.2912723
Article Google Scholar
Mahmud MS, Huang JZ, Ruby R, Wu K (2023) An ensemble method for estimating the number of clusters in a big data set using multiple random samples. Journal of Big Data 10(1):40. https://doi.org/10.1186/s40537-023-00709-4
Article Google Scholar
Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, pp 1–36. https://doi.org/10.1186/s40537-015-0032-1
Chen X, Cheng JQ, Xie M (2021) Divide-and-Conquer Methods for Big Data Analysis. In: Wiley StatsRef: Statistics Reference Online. Wiley, ???, pp 1–15. https://doi.org/10.1002/9781118445112.stat08298
Chen X, Xie M-g (2014) A split-and-conquer approach for analysis of. Stat Sin. https://doi.org/10.5705/ss.2013.088
Article PubMed PubMed Central Google Scholar
Mahmud MS, Huang JZ, Ruby R, Ngueilbaye A, Wu K (2023) Approximate Clustering Ensemble Method for Big Data. IEEE Transactions on Big Data. https://doi.org/10.1109/TBDATA.2023.3255003
Article Google Scholar
Emara TZ, Huang JZ (2019) A distributed data management system to support large-scale data analysis. J Syst Softw 148:105–115. https://doi.org/10.1016/j.jss.2018.11.007
Article Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655
Article Google Scholar
Schapire RE (2003) The boosting approach to machine learning: an overview, pp 149–171. https://doi.org/10.1007/978-0-387-21579-2_9
DeWitt DJ, Gerber RH, Graefe G, Heytens ML, Kumar KB, Muralikrishna M (1986) GAMMA - A High Performance Dataflow Database Machine. In: Proceedings of the 12th International Conference on Very Large Data Bases. VLDB ’86, pp 228–237. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Shemer N (1984) The Genesis of a Database Computer. Computer 17(11):42–56. https://doi.org/10.1109/MC.1984.1658999
Article Google Scholar
Dean J, Ghemawat S (2004) MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of the OSDI - Symp. on Operating Systems Design and Implementation. USENIX, ???, pp 137–149. http://citeseerx.ist.psu.edu/viewdoc/summary;jsessionid=3CA72B524B9A6153BFE89FE26FBB832?doi=10.1.1.163.5292
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop Distributed File System. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, ???, pp 1–10. https://doi.org/10.1109/MSST.2010.5496972. http://ieeexplore.ieee.org/document/5496972/
Spark (2022) Apache Spark. http://spark.apache.org/docs/latest/index.html
Tang S, He B, Yu C, Li Y, Li K (2022) A Survey on Spark Ecosystem : Big Data Processing Infrastructure. Machine Learning, and Applications 34(1):71–91
Google Scholar
Ahmed N, Barczak ALC, Susnjak T, Rashid MA (2020) A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. Journal of Big Data 7(1). https://doi.org/10.1186/s40537-020-00388-5
Shayaa S, Jaafar NI, Bahri S, Sulaiman A, Seuk Wai P, Wai Chung Y, Piprani AZ, Al-Garadi MA (2018) Sentiment analysis of big data: Methods, applications, and open challenges. IEEE Access 6:37807–37827. https://doi.org/10.1109/ACCESS.2018.2851311
Article Google Scholar
Fernández A, del Río S, Chawla NV, Herrera F (2017) An insight into imbalanced Big Data classification: outcomes and challenges. Complex & Intelligent Systems 3(2):105–120. https://doi.org/10.1007/s40747-017-0037-9
Article Google Scholar
Trinh T, Duc LP, Tran CT, Duy TT, Emara TZ (2022) A New Stratified Block Model to Process Large-Scale Data for a Small Cluster. Lecture Notes on Data Engineering and Communications Technologies, vol 124. Springer, Cham, pp 253–263. https://doi.org/10.1007/978-3-030-97610-1_21
Djouzi K, Beghdad-Bey K, Amamra A (2021) A new adaptive sampling algorithm for big data classification. J Comput Sci 61(February 2021):101653. https://doi.org/10.1016/j.jocs.2022.101653
Sabzevari M, Martínez-Muñoz G, Suárez A (2022) Building heterogeneous ensembles by pooling homogeneous ensembles. International Journal of Machine Learning and Cybernetics 13(2):551–558. https://doi.org/10.1007/s13042-021-01442-1
Article CAS Google Scholar
Baldi P, Sadowski P, Whiteson D (2014) Searching for exotic particles in high-energy physics with deep learning. Nat Commun 5(1):4308. https://doi.org/10.1038/ncomms5308
Article CAS PubMed ADS Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (2017) Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Phenikaa University, Yen Nghia, Ha Dong, Hanoi, 12116, Vietnam
Thanh Trinh, HoangAnh Le, Hai HoangDuc & KieuAnh VuThi
Phenikaa Research and Technology Institute (PRATI), A &A Green Phoenix Group JSC, No.167 Hoang Ngan, Trung Hoa, Cau Giay, Hanoi, 11313, Vietnam
Thanh Trinh
The Information Technology Center, Phenikaa University, Yen Nghia, Ha Dong, Hanoi, 12116, Vietnam
HoangAnh Le & Hai HoangDuc
Phenikaa Institute for Advanced Study (PIAS), Phenikaa University, Yen Nghia, Ha Dong, Hanoi, 12116, Vietnam
HoangAnh Le
Hanoi School of Business and Management, Vietnam National University, Xuan Thuy, Cau Giay, Hanoi, 11310, Vietnam
Nhung VuongThi

Authors

Thanh Trinh
View author publications
You can also search for this author in PubMed Google Scholar
HoangAnh Le
View author publications
You can also search for this author in PubMed Google Scholar
Nhung VuongThi
View author publications
You can also search for this author in PubMed Google Scholar
Hai HoangDuc
View author publications
You can also search for this author in PubMed Google Scholar
KieuAnh VuThi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thanh Trinh.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Trinh, T., Le, H., VuongThi, N. et al. A novel ensemble-based paradigm to process large-scale data. Multimed Tools Appl 83, 26663–26685 (2024). https://doi.org/10.1007/s11042-023-16624-y

Download citation

Received: 15 January 2023
Revised: 05 May 2023
Accepted: 21 August 2023
Published: 02 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16624-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel ensemble-based paradigm to process large-scale data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition

Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Classification with boosting of extreme learning machine over arbitrarily partitioned data

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

A novel ensemble-based paradigm to process large-scale data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition

Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Classification with boosting of extreme learning machine over arbitrarily partitioned data

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.