Skip to main content

Advertisement

Log in

A novel ensemble-based paradigm to process large-scale data

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Big data analytics is an emerging topic in academic and industrial engineering fields, where the large-scale data issue is the most attractive challenge. It is crucial to design an effective large-scale data processing model to handle big data. In this paper, we aim to improve the accuracy of the classification task and reduce the execution time for large-scale data within a small cluster. In order to overcome these challenges, this paper presents a novel ensemble-based paradigm that consists of the procedure of splitting large-scale data files and developing ensemble models. Two different splitting methods are first developed to partition large-scale data into small data blocks without overlapping. Then we propose two ensemble-based methods with high accuracy and less execution time: bagging-based and boosting-based methods. Finally, the proposed paradigm can be implemented by four predictive models, which are combinations of two splitting methods and two ensemble-based methods. A series of persuasive experiments was conducted to evaluate the effectiveness of the proposed paradigm with four different combinations. Overall, the proposed paradigm with boosting-based is the best in terms of the accuracy metric compared with existing methods. In addition, boosting-based methods achieve 91.6% accuracy compared with 52% accuracy of base line model for a big data file with 10 million samples. However, the paradigm with bagging-based takes the least execution time to yield results. This paper also reveals the effectiveness of the computing Spark cluster for large-scale data and points out the weakness of RDD (Resilient Distributed dataset).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Algorithm 2
Algorithm 3
Fig. 4
Fig. 5
Algorithm 4
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

Data available on request from the authors.

Notes

  1. https://archive.ics.uci.edu/ml/datasets/HIGGS

  2. https://drive.google.com/drive/folders/1LWvV1YLENDwg-1k53ey39vB2_fwE2AhI?usp=sharing

References

  1. Rodrigues AP, Chiplunkar NN (2022) A new big data approach for topic classification and sentiment analysis of Twitter data. Evol Intel 15(2):877–887. https://doi.org/10.1007/s12065-019-00236-3

    Article  Google Scholar 

  2. Khan M, Malviya A (2020) Big data approach for sentiment analysis of twitter data using Hadoop framework and deep learning. International Conference on Emerging Trends in Information Technology and Engineering, ic-ETITE 2020:1–5. https://doi.org/10.1109/ic-ETITE47903.2020.201

    Article  Google Scholar 

  3. Trinh T, Wu D, Wang R, Huang JZ (2020) An effective content-based event recommendation model. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-020-08884-9

    Article  Google Scholar 

  4. Huang W, Wang L (2022) Towards big data behavioral analysis: rethinking GPS trajectory mining approaches from geographic, semantic, and quantitative perspectives. Architectural Intelligence 1(1):1–15. https://doi.org/10.1007/s44223-022-00011-y

    Article  Google Scholar 

  5. Cho W, Choi E (2017) Spatial Big Data Analysis System for Vehicle-Driving GPS Trajectory, pp 296–302. https://doi.org/10.1007/978-981-10-5041-1_50

  6. Mostajabi F, Safaei AA, Sahafi A (2021) A Systematic Review of Data Models for the Big Data Problem. IEEE Access 9:128889–128904. https://doi.org/10.1109/ACCESS.2021.3112880

    Article  Google Scholar 

  7. Wu Z, Lin W, Zhang Z, Wen A (2017) Lin L (2017) An Ensemble Random Forest Algorithm for Insurance Big Data Analysis. Proceedings - 2017 IEEE International Conference on Computational Science and Engineering and IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017 1:531–536. https://doi.org/10.1109/CSE-EUC.2017.99

  8. Choi TM, Chan HK, Yue X (2017) Recent Development in Big Data Analytics for Business Operations and Risk Management. IEEE Transactions on Cybernetics 47(1):81–92. https://doi.org/10.1109/TCYB.2015.2507599

    Article  PubMed  Google Scholar 

  9. Howard JH, Kazar ML, Menees SG, Nichols DA, Satyanarayanan M, Sidebotham RN, West MJ (1988) Scale and performance in a distributed file system. ACM Trans Comput Syst 6(1):51–81. https://doi.org/10.1145/35037.35059

    Article  Google Scholar 

  10. Emara TZ, Huang JZ (2020) Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers. IEEE Access 8:178526–178538. https://doi.org/10.1109/ACCESS.2020.3027675

    Article  Google Scholar 

  11. Hadoop (2022) Apache Hadoop. https://hadoop.apache.org/

  12. Zaharia M, Chowdhury M, Das T, Dave A (2012) Fast and Interactive Analytics over Hadoop Data with Spark. Usenix 37(4):45–51

    Google Scholar 

  13. Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark. Commun ACM 59(11):56–65. https://doi.org/10.1145/2934664

    Article  Google Scholar 

  14. Qin W, Liu F, Tong M, Li Z (2021) A distributed ensemble of relevance vector machines for large-scale data sets on Spark. Soft Comput 25(10):7119–7130. https://doi.org/10.1007/s00500-021-05671-y

    Article  Google Scholar 

  15. Salloum S, Huang JZ, He Y (2019) Random Sample Partition: A Distributed Data Model for Big Data Analysis. IEEE Transactions on Industrial Informatics 15(11):5846–5854. https://doi.org/10.1109/TII.2019.2912723

    Article  Google Scholar 

  16. Mahmud MS, Huang JZ, Ruby R, Wu K (2023) An ensemble method for estimating the number of clusters in a big data set using multiple random samples. Journal of Big Data 10(1):40. https://doi.org/10.1186/s40537-023-00709-4

    Article  Google Scholar 

  17. Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, pp 1–36. https://doi.org/10.1186/s40537-015-0032-1

  18. Chen X, Cheng JQ, Xie M (2021) Divide-and-Conquer Methods for Big Data Analysis. In: Wiley StatsRef: Statistics Reference Online. Wiley, ???, pp 1–15. https://doi.org/10.1002/9781118445112.stat08298

  19. Chen X, Xie M-g (2014) A split-and-conquer approach for analysis of. Stat Sin. https://doi.org/10.5705/ss.2013.088

    Article  PubMed  PubMed Central  Google Scholar 

  20. Mahmud MS, Huang JZ, Ruby R, Ngueilbaye A, Wu K (2023) Approximate Clustering Ensemble Method for Big Data. IEEE Transactions on Big Data. https://doi.org/10.1109/TBDATA.2023.3255003

    Article  Google Scholar 

  21. Emara TZ, Huang JZ (2019) A distributed data management system to support large-scale data analysis. J Syst Softw 148:105–115. https://doi.org/10.1016/j.jss.2018.11.007

    Article  Google Scholar 

  22. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655

    Article  Google Scholar 

  23. Schapire RE (2003) The boosting approach to machine learning: an overview, pp 149–171. https://doi.org/10.1007/978-0-387-21579-2_9

  24. DeWitt DJ, Gerber RH, Graefe G, Heytens ML, Kumar KB, Muralikrishna M (1986) GAMMA - A High Performance Dataflow Database Machine. In: Proceedings of the 12th International Conference on Very Large Data Bases. VLDB ’86, pp 228–237. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

  25. Shemer N (1984) The Genesis of a Database Computer. Computer 17(11):42–56. https://doi.org/10.1109/MC.1984.1658999

    Article  Google Scholar 

  26. Dean J, Ghemawat S (2004) MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of the OSDI - Symp. on Operating Systems Design and Implementation. USENIX, ???, pp 137–149. http://citeseerx.ist.psu.edu/viewdoc/summary;jsessionid=3CA72B524B9A6153BFE89FE26FBB832?doi=10.1.1.163.5292

  27. Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop Distributed File System. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, ???, pp 1–10. https://doi.org/10.1109/MSST.2010.5496972. http://ieeexplore.ieee.org/document/5496972/

  28. Spark (2022) Apache Spark. http://spark.apache.org/docs/latest/index.html

  29. Tang S, He B, Yu C, Li Y, Li K (2022) A Survey on Spark Ecosystem : Big Data Processing Infrastructure. Machine Learning, and Applications 34(1):71–91

    Google Scholar 

  30. Ahmed N, Barczak ALC, Susnjak T, Rashid MA (2020) A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. Journal of Big Data 7(1). https://doi.org/10.1186/s40537-020-00388-5

  31. Shayaa S, Jaafar NI, Bahri S, Sulaiman A, Seuk Wai P, Wai Chung Y, Piprani AZ, Al-Garadi MA (2018) Sentiment analysis of big data: Methods, applications, and open challenges. IEEE Access 6:37807–37827. https://doi.org/10.1109/ACCESS.2018.2851311

    Article  Google Scholar 

  32. Fernández A, del Río S, Chawla NV, Herrera F (2017) An insight into imbalanced Big Data classification: outcomes and challenges. Complex & Intelligent Systems 3(2):105–120. https://doi.org/10.1007/s40747-017-0037-9

    Article  Google Scholar 

  33. Trinh T, Duc LP, Tran CT, Duy TT, Emara TZ (2022) A New Stratified Block Model to Process Large-Scale Data for a Small Cluster. Lecture Notes on Data Engineering and Communications Technologies, vol 124. Springer, Cham, pp 253–263. https://doi.org/10.1007/978-3-030-97610-1_21

  34. Djouzi K, Beghdad-Bey K, Amamra A (2021) A new adaptive sampling algorithm for big data classification. J Comput Sci 61(February 2021):101653. https://doi.org/10.1016/j.jocs.2022.101653

  35. Sabzevari M, Martínez-Muñoz G, Suárez A (2022) Building heterogeneous ensembles by pooling homogeneous ensembles. International Journal of Machine Learning and Cybernetics 13(2):551–558. https://doi.org/10.1007/s13042-021-01442-1

    Article  CAS  Google Scholar 

  36. Baldi P, Sadowski P, Whiteson D (2014) Searching for exotic particles in high-energy physics with deep learning. Nat Commun 5(1):4308. https://doi.org/10.1038/ncomms5308

    Article  CAS  PubMed  ADS  Google Scholar 

  37. Breiman L, Friedman JH, Olshen RA, Stone CJ (2017) Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thanh Trinh.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Trinh, T., Le, H., VuongThi, N. et al. A novel ensemble-based paradigm to process large-scale data. Multimed Tools Appl 83, 26663–26685 (2024). https://doi.org/10.1007/s11042-023-16624-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16624-y

Keywords

Navigation

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy