Guha Roy 2017
Guha Roy 2017
Keywords—Bigdata; Hadoop; Hdfs; data storage; distributed II. WHAT IS BIG DATA ?
storage;
Data which cannot be captured, stored or processed within
a given/desired time frame due to the size, growth rate,
I. INTRODUCTION
complexity etc. Can be considered as big data.
TThe term Big Data[1] first came into light during 1990.
The Big-Data as we know it today was first defined during According to Doug Laney, an analyst with the Meta Group[2]
2001 after the publication of Doug Laney, an analyst with the big data can be classified in 3 dimensions volume, velocity
Meta Group, about the “3 V’s” of Big Data[2]. In this review and variety . Today dimensions such veracity and variability
paper we will discuss a brief history of big data , the in data are used to classify a data as big data.
challenges we faced with the big data and how the challenges
were handled. In 2013 according to IBM 90% of data was III. EVOLUTION OF STORAGE SYSTEM
created in last 2 years this information alone gives a estimated In the paper evolution of storage system author Robert J T
that how fast the amount data is growing. According to Morris and B J Truskowski explained how storage systems
International Telecommunication Union, Geneva, 22 July have evolved over five decades to meet changing customer
2016 3.9 billion people remain cut-off from the internet needs. development of the control unit, RAID (redundant
globally. That said not only data generation from each user is array of independent disks) technologies, copy services, and
increasing but there is a huge growth potential in adding new basic storage management technologies are discussed in brief.
user base which is bound to happen sooner or How the emergence of low-cost local area data networking has
later. allowed the development of network-attached storage (NAS)
and storage area network (SAN) technologies, and we explain
how block virtualization and SAN file systems are necessary
to fully reap the benefits of these technologies. We also
discuss how the recent trend in storage systems toward
206
API library: This provides the DAG and Runtime APIs and Ever since 1990’s the volume of unstructured data is
other client side libraries to build applications increasing exponentially compared to the structured and this
Orchestration framework: This has been implemented as a trend will follow in the upcoming years. To analyze this type
YARN Application Master [to execute the DAG in a Hadoop of data machine learning will eventually become integral part
cluster via YARN. of data mining and analyzing tools.
Runtime library: This provides implementations of various 4. Business intelligence gathering
inputs and outputs that can be used out of the box
Tez has its own implementation of hive, pig, spark with Today almost every big corporations use some form of
considerable increase in their performance. Map Reduce can business intelligence derived from BigData to drive their
be easily written as a Tez based application the Tez project company. In the feature no company can afford to lose out on
business intelligence if they plan to succeed in their business.
comes with a built-in implementation of Map Reduce. Tez is
deployed at some of the major technology giants like Yahoo,
Microsoft Azure, LinkedIn.
Reference
Future Trends
1. Data volume will continue to increase [1] Peter J. Denning publishes "Saving All the Bits".
[2] 3D Data Management: Controlling Data Volume, Velocity, and Variety.
Volume of data is expected to increase exponentially as
[3] The evolution of storage systems.
technology becomes even more easily accessible and available
[4] Ghemawat, S.; Gobioff, H.; Leung, S. T. (2003). "The Google file
more and more people. The spread of internet will also play a system".
major part. [5] Dean, J.; Ghemawat, S; “Map Reduce: Simplified Data Processing on
2. Emergence of IOT systems Large Clusters”.
[6] https://spark.apache.org/research.
Internet of things (IOT) systems will become a integrated [7] Saha, B; Shah, H.; Seth, S.; Vijayaraghavan, G; Murthy, A; Curino, C;
part of our daily life and all those connected devices will “Apache Tez: A Unifying Framework for Modeling and Building Data
communicate with each other which would generate a huge Processing Applications”.
volume log as well as real time data. [8] Melnik, S; Gubarev, A; Long, J, J; Romer, G; Shivakumar, S; Tolton,
M; Vassilakis, T; “Dremel: Interactive Analysis of Web-Scale Datasets.
3. Integration of machine learning with searching
algorithm
207