Optimization strategy of Hadoop small file storage for big data in healthcare

He, Hui; Du, Zhonghui; Zhang, Weizhe; Chen, Allen

doi:10.1007/s11227-015-1462-4

Optimization strategy of Hadoop small file storage for big data in healthcare

Published: 17 June 2015

Volume 72, pages 3696–3707, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

1543 Accesses
1 Altmetric
Explore all metrics

Abstract

As the era of “big data” comes, the data processing platform like Hadoop was born at the right moment. But its carrier for storage, Hadoop distributed file system (HDFS) has the great weakness in storage of the numerous small files. The storage of numerous small files will increase the load of the entire colony and reduce efficiency. However, datasets such as genomic data and clinical data that will enable researchers to perform analytics in healthcare are all in storage of small files. To solve the defect of storage of small files, we generally will merge small files, and store the big file after merging. But the former methods have not applied the size distribution of the file, and not further improved the effect of merging of small files. This article proposes a method for merging of small files based on balance of data block, which will optimize the volume distribution of the big file after merging, and effectively reduce the data blocks of HDFS, so as to reduce the memory overhead of major nodes of cluster and reduce load to achieve high-efficiency operation of data processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Management Techniques in Hadoop Framework for Handling Small Files: A Survey

Hadoop Massive Small File Merging Technology Based on Visiting Hot-Spot and Associated File Optimization

Efficient File Accessing Techniques on Hadoop Distributed File Systems

References

Manyika J, Michael C, Brad B, Jacques B, Richard D, Charles R, Angela HB (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, Washington, pp 129–137
First website on cloud computing of China. http://www.zdnet.com.cn. Accessed 20 Feb 2010
Apache Hadoop. http://hadoop.apache.org. Accessed 10 Oct 2012
Yu L, Chen G, Wang W et al (2007) Msfss: a storage system for mass small files. In: 11th International Conference on Computer Supported Cooperative Work in Design (CSCWD), 2007. IEEE, Melbourne, Australia, pp 1087–1092
Beaver D, Kumar S, Li HC et al (2010) Finding a needle in Haystack: Facebook’s photo storage. OSDI 10:1–8
Google Scholar
Taobao File System. http://tfs.taobao.org/
Liu X, Yu Q, Liao J (2014) FastDFS: a high performance distributed file system. ICIC Express Lett Part B Appl Int J Res Surv 5(6):1741–1746
Google Scholar
Qian Y, Yi R, Du Y et al (2013) Dynamic I/O congestion control in scalable Lustre file system. In: IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), 2013. IEEE, Lake Arrowhead, USA, pp 1–5
Mohandas N, Thampi SM (2011) Improving Hadoop Performance in Handling Small Files. In: Abraham A, Lloret Mauri J, Buford JF, Suzuki J, Thampi SM (eds) First International Conference, ACC 2011, Kochi, India, July 22–24, 2011, Proceedings, Part IV. Communications in Computer and Information Science, vol 193. Springer, Berlin, Heidelberg, pp 187–194
Grant M, Saba S, Wang J (2009) Improving metadata management for small files in HDFS. In: International Conference on Cluster Computing and Workshops, CLUSTER ’09. IEEE, New Orleans, USA, pp1–4
Yan CR, Li T, Huang YF, Gan YL (2014) Hmfs: efficient support of small files processing over HDFS. Algorithms Archit Parallel Process Lect Notes Comput Sci 8631:54–67
Google Scholar
Zhang WZ, Lu GZ, He H, Zhang QZ, Yu CL (2015) Exploring large-scale small file storage for search engines. J Supercomput. doi:10.1007/s11227-015-1394-z
Zhang WZ, He H, Ye JW (2013) A two-level cache for distributed information retrieval in search engines. Sci World J. 2013:Article ID 596724 (2013). doi:10.1155/2013/596724

Download references

Acknowledgments

This work was supported in part by the National Basic Research Program of China under Grant No. G2011CB302605. This work is partially supported by the National Natural Science Foundation of China (NSFC) under Grant Nos. 61173145, 61472108.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HL, China
Hui He, Zhonghui Du & Weizhe Zhang
Lynbrook High School, San Jose, CA, 95129, USA
Allen Chen

Authors

Hui He
View author publications
You can also search for this author in PubMed Google Scholar
Zhonghui Du
View author publications
You can also search for this author in PubMed Google Scholar
Weizhe Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Allen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weizhe Zhang.

About this article

Cite this article

He, H., Du, Z., Zhang, W. et al. Optimization strategy of Hadoop small file storage for big data in healthcare. J Supercomput 72, 3696–3707 (2016). https://doi.org/10.1007/s11227-015-1462-4

Download citation

Published: 17 June 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s11227-015-1462-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization strategy of Hadoop small file storage for big data in healthcare

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Data Management Techniques in Hadoop Framework for Handling Small Files: A Survey

Hadoop Massive Small File Merging Technology Based on Visiting Hot-Spot and Associated File Optimization

Efficient File Accessing Techniques on Hadoop Distributed File Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Optimization strategy of Hadoop small file storage for big data in healthcare

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Data Management Techniques in Hadoop Framework for Handling Small Files: A Survey

Hadoop Massive Small File Merging Technology Based on Visiting Hot-Spot and Associated File Optimization

Efficient File Accessing Techniques on Hadoop Distributed File Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.