Abstract
As the era of “big data” comes, the data processing platform like Hadoop was born at the right moment. But its carrier for storage, Hadoop distributed file system (HDFS) has the great weakness in storage of the numerous small files. The storage of numerous small files will increase the load of the entire colony and reduce efficiency. However, datasets such as genomic data and clinical data that will enable researchers to perform analytics in healthcare are all in storage of small files. To solve the defect of storage of small files, we generally will merge small files, and store the big file after merging. But the former methods have not applied the size distribution of the file, and not further improved the effect of merging of small files. This article proposes a method for merging of small files based on balance of data block, which will optimize the volume distribution of the big file after merging, and effectively reduce the data blocks of HDFS, so as to reduce the memory overhead of major nodes of cluster and reduce load to achieve high-efficiency operation of data processing.







Similar content being viewed by others
References
Manyika J, Michael C, Brad B, Jacques B, Richard D, Charles R, Angela HB (2011) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, Washington, pp 129–137
First website on cloud computing of China. http://www.zdnet.com.cn. Accessed 20 Feb 2010
Apache Hadoop. http://hadoop.apache.org. Accessed 10 Oct 2012
Yu L, Chen G, Wang W et al (2007) Msfss: a storage system for mass small files. In: 11th International Conference on Computer Supported Cooperative Work in Design (CSCWD), 2007. IEEE, Melbourne, Australia, pp 1087–1092
Beaver D, Kumar S, Li HC et al (2010) Finding a needle in Haystack: Facebook’s photo storage. OSDI 10:1–8
Taobao File System. http://tfs.taobao.org/
Liu X, Yu Q, Liao J (2014) FastDFS: a high performance distributed file system. ICIC Express Lett Part B Appl Int J Res Surv 5(6):1741–1746
Qian Y, Yi R, Du Y et al (2013) Dynamic I/O congestion control in scalable Lustre file system. In: IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), 2013. IEEE, Lake Arrowhead, USA, pp 1–5
Mohandas N, Thampi SM (2011) Improving Hadoop Performance in Handling Small Files. In: Abraham A, Lloret Mauri J, Buford JF, Suzuki J, Thampi SM (eds) First International Conference, ACC 2011, Kochi, India, July 22–24, 2011, Proceedings, Part IV. Communications in Computer and Information Science, vol 193. Springer, Berlin, Heidelberg, pp 187–194
Grant M, Saba S, Wang J (2009) Improving metadata management for small files in HDFS. In: International Conference on Cluster Computing and Workshops, CLUSTER ’09. IEEE, New Orleans, USA, pp1–4
Yan CR, Li T, Huang YF, Gan YL (2014) Hmfs: efficient support of small files processing over HDFS. Algorithms Archit Parallel Process Lect Notes Comput Sci 8631:54–67
Zhang WZ, Lu GZ, He H, Zhang QZ, Yu CL (2015) Exploring large-scale small file storage for search engines. J Supercomput. doi:10.1007/s11227-015-1394-z
Zhang WZ, He H, Ye JW (2013) A two-level cache for distributed information retrieval in search engines. Sci World J. 2013:Article ID 596724 (2013). doi:10.1155/2013/596724
Acknowledgments
This work was supported in part by the National Basic Research Program of China under Grant No. G2011CB302605. This work is partially supported by the National Natural Science Foundation of China (NSFC) under Grant Nos. 61173145, 61472108.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
He, H., Du, Z., Zhang, W. et al. Optimization strategy of Hadoop small file storage for big data in healthcare. J Supercomput 72, 3696–3707 (2016). https://doi.org/10.1007/s11227-015-1462-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1462-4