Abstract
Big data analytics is an emerging topic in academic and industrial engineering fields, where the large-scale data issue is the most attractive challenge. It is crucial to design an effective large-scale data processing model to handle big data. In this paper, we aim to improve the accuracy of the classification task and reduce the execution time for large-scale data within a small cluster. In order to overcome these challenges, this paper presents a novel ensemble-based paradigm that consists of the procedure of splitting large-scale data files and developing ensemble models. Two different splitting methods are first developed to partition large-scale data into small data blocks without overlapping. Then we propose two ensemble-based methods with high accuracy and less execution time: bagging-based and boosting-based methods. Finally, the proposed paradigm can be implemented by four predictive models, which are combinations of two splitting methods and two ensemble-based methods. A series of persuasive experiments was conducted to evaluate the effectiveness of the proposed paradigm with four different combinations. Overall, the proposed paradigm with boosting-based is the best in terms of the accuracy metric compared with existing methods. In addition, boosting-based methods achieve 91.6% accuracy compared with 52% accuracy of base line model for a big data file with 10 million samples. However, the paradigm with bagging-based takes the least execution time to yield results. This paper also reveals the effectiveness of the computing Spark cluster for large-scale data and points out the weakness of RDD (Resilient Distributed dataset).














Similar content being viewed by others
Data Availability
Data available on request from the authors.
References
Rodrigues AP, Chiplunkar NN (2022) A new big data approach for topic classification and sentiment analysis of Twitter data. Evol Intel 15(2):877–887. https://doi.org/10.1007/s12065-019-00236-3
Khan M, Malviya A (2020) Big data approach for sentiment analysis of twitter data using Hadoop framework and deep learning. International Conference on Emerging Trends in Information Technology and Engineering, ic-ETITE 2020:1–5. https://doi.org/10.1109/ic-ETITE47903.2020.201
Trinh T, Wu D, Wang R, Huang JZ (2020) An effective content-based event recommendation model. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-020-08884-9
Huang W, Wang L (2022) Towards big data behavioral analysis: rethinking GPS trajectory mining approaches from geographic, semantic, and quantitative perspectives. Architectural Intelligence 1(1):1–15. https://doi.org/10.1007/s44223-022-00011-y
Cho W, Choi E (2017) Spatial Big Data Analysis System for Vehicle-Driving GPS Trajectory, pp 296–302. https://doi.org/10.1007/978-981-10-5041-1_50
Mostajabi F, Safaei AA, Sahafi A (2021) A Systematic Review of Data Models for the Big Data Problem. IEEE Access 9:128889–128904. https://doi.org/10.1109/ACCESS.2021.3112880
Wu Z, Lin W, Zhang Z, Wen A (2017) Lin L (2017) An Ensemble Random Forest Algorithm for Insurance Big Data Analysis. Proceedings - 2017 IEEE International Conference on Computational Science and Engineering and IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017 1:531–536. https://doi.org/10.1109/CSE-EUC.2017.99
Choi TM, Chan HK, Yue X (2017) Recent Development in Big Data Analytics for Business Operations and Risk Management. IEEE Transactions on Cybernetics 47(1):81–92. https://doi.org/10.1109/TCYB.2015.2507599
Howard JH, Kazar ML, Menees SG, Nichols DA, Satyanarayanan M, Sidebotham RN, West MJ (1988) Scale and performance in a distributed file system. ACM Trans Comput Syst 6(1):51–81. https://doi.org/10.1145/35037.35059
Emara TZ, Huang JZ (2020) Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers. IEEE Access 8:178526–178538. https://doi.org/10.1109/ACCESS.2020.3027675
Hadoop (2022) Apache Hadoop. https://hadoop.apache.org/
Zaharia M, Chowdhury M, Das T, Dave A (2012) Fast and Interactive Analytics over Hadoop Data with Spark. Usenix 37(4):45–51
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark. Commun ACM 59(11):56–65. https://doi.org/10.1145/2934664
Qin W, Liu F, Tong M, Li Z (2021) A distributed ensemble of relevance vector machines for large-scale data sets on Spark. Soft Comput 25(10):7119–7130. https://doi.org/10.1007/s00500-021-05671-y
Salloum S, Huang JZ, He Y (2019) Random Sample Partition: A Distributed Data Model for Big Data Analysis. IEEE Transactions on Industrial Informatics 15(11):5846–5854. https://doi.org/10.1109/TII.2019.2912723
Mahmud MS, Huang JZ, Ruby R, Wu K (2023) An ensemble method for estimating the number of clusters in a big data set using multiple random samples. Journal of Big Data 10(1):40. https://doi.org/10.1186/s40537-023-00709-4
Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, pp 1–36. https://doi.org/10.1186/s40537-015-0032-1
Chen X, Cheng JQ, Xie M (2021) Divide-and-Conquer Methods for Big Data Analysis. In: Wiley StatsRef: Statistics Reference Online. Wiley, ???, pp 1–15. https://doi.org/10.1002/9781118445112.stat08298
Chen X, Xie M-g (2014) A split-and-conquer approach for analysis of. Stat Sin. https://doi.org/10.5705/ss.2013.088
Mahmud MS, Huang JZ, Ruby R, Ngueilbaye A, Wu K (2023) Approximate Clustering Ensemble Method for Big Data. IEEE Transactions on Big Data. https://doi.org/10.1109/TBDATA.2023.3255003
Emara TZ, Huang JZ (2019) A distributed data management system to support large-scale data analysis. J Syst Softw 148:105–115. https://doi.org/10.1016/j.jss.2018.11.007
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655
Schapire RE (2003) The boosting approach to machine learning: an overview, pp 149–171. https://doi.org/10.1007/978-0-387-21579-2_9
DeWitt DJ, Gerber RH, Graefe G, Heytens ML, Kumar KB, Muralikrishna M (1986) GAMMA - A High Performance Dataflow Database Machine. In: Proceedings of the 12th International Conference on Very Large Data Bases. VLDB ’86, pp 228–237. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Shemer N (1984) The Genesis of a Database Computer. Computer 17(11):42–56. https://doi.org/10.1109/MC.1984.1658999
Dean J, Ghemawat S (2004) MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of the OSDI - Symp. on Operating Systems Design and Implementation. USENIX, ???, pp 137–149. http://citeseerx.ist.psu.edu/viewdoc/summary;jsessionid=3CA72B524B9A6153BFE89FE26FBB832?doi=10.1.1.163.5292
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop Distributed File System. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, ???, pp 1–10. https://doi.org/10.1109/MSST.2010.5496972. http://ieeexplore.ieee.org/document/5496972/
Spark (2022) Apache Spark. http://spark.apache.org/docs/latest/index.html
Tang S, He B, Yu C, Li Y, Li K (2022) A Survey on Spark Ecosystem : Big Data Processing Infrastructure. Machine Learning, and Applications 34(1):71–91
Ahmed N, Barczak ALC, Susnjak T, Rashid MA (2020) A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. Journal of Big Data 7(1). https://doi.org/10.1186/s40537-020-00388-5
Shayaa S, Jaafar NI, Bahri S, Sulaiman A, Seuk Wai P, Wai Chung Y, Piprani AZ, Al-Garadi MA (2018) Sentiment analysis of big data: Methods, applications, and open challenges. IEEE Access 6:37807–37827. https://doi.org/10.1109/ACCESS.2018.2851311
Fernández A, del Río S, Chawla NV, Herrera F (2017) An insight into imbalanced Big Data classification: outcomes and challenges. Complex & Intelligent Systems 3(2):105–120. https://doi.org/10.1007/s40747-017-0037-9
Trinh T, Duc LP, Tran CT, Duy TT, Emara TZ (2022) A New Stratified Block Model to Process Large-Scale Data for a Small Cluster. Lecture Notes on Data Engineering and Communications Technologies, vol 124. Springer, Cham, pp 253–263. https://doi.org/10.1007/978-3-030-97610-1_21
Djouzi K, Beghdad-Bey K, Amamra A (2021) A new adaptive sampling algorithm for big data classification. J Comput Sci 61(February 2021):101653. https://doi.org/10.1016/j.jocs.2022.101653
Sabzevari M, Martínez-Muñoz G, Suárez A (2022) Building heterogeneous ensembles by pooling homogeneous ensembles. International Journal of Machine Learning and Cybernetics 13(2):551–558. https://doi.org/10.1007/s13042-021-01442-1
Baldi P, Sadowski P, Whiteson D (2014) Searching for exotic particles in high-energy physics with deep learning. Nat Commun 5(1):4308. https://doi.org/10.1038/ncomms5308
Breiman L, Friedman JH, Olshen RA, Stone CJ (2017) Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Trinh, T., Le, H., VuongThi, N. et al. A novel ensemble-based paradigm to process large-scale data. Multimed Tools Appl 83, 26663–26685 (2024). https://doi.org/10.1007/s11042-023-16624-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16624-y