0% found this document useful (0 votes)
55 views3 pages

Guha Roy 2017

jkjsadkjdal

Uploaded by

G LAHARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views3 pages

Guha Roy 2017

jkjsadkjdal

Uploaded by

G LAHARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

A Review on BigData theoretical and Application

approach: Past and Present

Rabel Guharoy1 Upasak Pal2


Asst. Prof, of CSE Dept. Big Data Trainer
UEM, Jaipur, India Ardent Computer Tech. Kolkata, India
rabelrock@gmail.com upasak.zero@gmail.com
3
Sumit Kumar
Student of CSE Dept.
UEM, Jaipur, India
Sk51055@gmail.com

Abstract—Machines that stores and process data had been in


the works for a long time, as the volume of data increased so did
the capacity of the machines to handle those data. The notion of
big data came into light during the early days of internet when
machines not only store and process data but also generate a
huge volume of information too. We stared to get very large
volumes of dataset beyond the capacity of a single machine to
store and process, such data were referred as big data. As the era
of internet progressed and people stated to use social media and
mobile devices different types of data set started to generate at a
very high velocity which could not be handled by conventional
computing techniques. Today distributed and parallel processing
techniques are widely used to process this type of data sets. In the
future personal and sensor data generation is expected to
increase exponentially and similarly the need for real time
information to cater such needs not only the machines but the
computation techniques must be further improved. fig.1 Source: International Telecommunication Union

Keywords—Bigdata; Hadoop; Hdfs; data storage; distributed II. WHAT IS BIG DATA ?
storage;
Data which cannot be captured, stored or processed within
a given/desired time frame due to the size, growth rate,
I. INTRODUCTION
complexity etc. Can be considered as big data.
TThe term Big Data[1] first came into light during 1990.
The Big-Data as we know it today was first defined during According to Doug Laney, an analyst with the Meta Group[2]
2001 after the publication of Doug Laney, an analyst with the big data can be classified in 3 dimensions volume, velocity
Meta Group, about the “3 V’s” of Big Data[2]. In this review and variety . Today dimensions such veracity and variability
paper we will discuss a brief history of big data , the in data are used to classify a data as big data.
challenges we faced with the big data and how the challenges
were handled. In 2013 according to IBM 90% of data was III. EVOLUTION OF STORAGE SYSTEM
created in last 2 years this information alone gives a estimated In the paper evolution of storage system author Robert J T
that how fast the amount data is growing. According to Morris and B J Truskowski explained how storage systems
International Telecommunication Union, Geneva, 22 July have evolved over five decades to meet changing customer
2016 3.9 billion people remain cut-off from the internet needs. development of the control unit, RAID (redundant
globally. That said not only data generation from each user is array of independent disks) technologies, copy services, and
increasing but there is a huge growth potential in adding new basic storage management technologies are discussed in brief.
user base which is bound to happen sooner or How the emergence of low-cost local area data networking has
later. allowed the development of network-attached storage (NAS)
and storage area network (SAN) technologies, and we explain
how block virtualization and SAN file systems are necessary
to fully reap the benefits of these technologies. We also
discuss how the recent trend in storage systems toward

978-1-5386-2215-5/17/$31.00 ©2017 IEEE


205
managing complexity, ease-of-use, and lowering the total cost VI. APACHE HADOOP
of ownership has led to the development of autonomic storage.
We conclude with our assessment of the current state-of-the-
Hadoop is a open source framework to store and process large
art by presenting a set of challenges driving research and datasets in a distributed environment. Hadoop is bases on java
development efforts in storage systems. and developed by doug cutting. Hadoop Distributed File
System(HDFS) and Map-Reduce Framework are the core
IV. MAJOR CHALLENGES component on hadoop. HDFS and Map-Reduce Framework
are derived from the google counter parts GFS and Map-
Large scale data analytics, once a technology exclusive to the
Reduce programming model respectively. Being an opensouce
large web companies, is nowadays available relatively easily frameworks its became synonymous to BigData. Companies
and often indispensable for most modern organizations. Doug like facebook, yahoo built components which runs on top of
Laney of the META Group first describe about the three “V’s” the core components of hadoop like hive and pig. Because of
of BigData. Volume, Velocity and Variety were the 3 three that interacting with map-reduce framework became relatively
dimensions across which a data set can be classified: easy.
Volume: Increase in Size of the data set can beyond the
capacity of a machine to store and process that data can be
VII. IN-MEMORY PROCESSING
classified as BigData .
Map Reduce helped us dealing with big data but Map Reduce
Velocity: When the rate of data transfer and generation is disk oriented approach to handle big data problem where the
goes beyond the capacity of machine.
intermediate result is stored in disk storage. Even though disk
Variety: When the complexity in type, format, features space is easily available with the commodity machines,
becomes too complex for a conventional systems like RDBMS Recently the memory capacity of commodity machines has
to handle. increased substantially to a point where map reduce don’t
Apart from this three “V’s” now there are several more utilize the full potential of the available memory , new
dimensions in which data can be classified where conventional processing techniques are being used to fully utilize this area.
machine or processing technique is unusable and inefficient. For example even though Google was the first to come up
To deal with such problem several processing techniques have with idea of map reduce but since 2014 Google was no longer
emerged mostly using distributed and parallel processing. using Map Reduce as their primary model to process Big
Data. In memory processing has become the new trend in Big
V. GOOGLE FILE SYSTEM AND MAP REDUCE Data analysis some of the popular in-memory processing
engines are Tez, impala, Drill, Google Dremel.
Google File System (GFS) is a distributed file system
• Dremel[8] is a scalable, interactive ad-hoc query
based on master slave architecture. GFS was designed and
implemented to meet the rapidly growing demands of system for analysis of read only nested data. There
Google’s data processing needs. GFS shares many of the same are two major component of Dremel is multi-level
goals as previous distributed file systems such as performance, execution trees and columnar data layout. Dremel
scalability, reliability, and availability. Some of the principles multi-level execution trees is a high speed processing
that GFS follows are i) failures are common occurrence, in engine capable of running aggregation queries over
fact it is assumed that at a particular point in time more than trillion-row tables in seconds. The Execution tree can
one nodes are un-operational. ii) large files typically in order scale across multiple nodes typically the number
of gigabytes are common. iii) Overwriting a file is avoided, going to thousands. supports interactive analysis of
instead overwriting a file new information is appended. iv) very large datasets over shared clusters of commodity
Consistency of the system is relaxed to make the system machines.
more flexible. Some of the fields where dremel is used by Google: Analysis
Map Reduce is programming model which helps to of crawled web documents, Tracking install data for
process large datasets in a distributed environment. The basic applications on Android Market, Crash reporting for Google
principal behind map reduce programming is that it takes a products, Spam analysis.
key/value pair as input and generates another key/value pair as • Apache Tez[7] is a open-source framework designed
output. There are two major phases i) Map phase, it takes a to build data-flow driven processing runtimes. Tez
key/value pair as input and generates another intermediate sets was developed after hadoop, Hadoop solved the
of key /value pairs. ii) Reduce phase, it takes the key and a list problem of storing and capturing large data sets. But
of values belonging to the value from the intermediate results hadoop relied on mapreduce for its processing which
and apply transformations before generating the output. involved multiple intermediate states which involved
multiple disk I/O operations. Tez trys to take matter a
step further .Tez is a library to build data-flow based
runtimes/engines. Tez APIs allow frameworks to
clearly model the logical and physical semantics of
their data flow graphs.
• Tez API consists of three main components

206
API library: This provides the DAG and Runtime APIs and Ever since 1990’s the volume of unstructured data is
other client side libraries to build applications increasing exponentially compared to the structured and this
Orchestration framework: This has been implemented as a trend will follow in the upcoming years. To analyze this type
YARN Application Master [to execute the DAG in a Hadoop of data machine learning will eventually become integral part
cluster via YARN. of data mining and analyzing tools.
Runtime library: This provides implementations of various 4. Business intelligence gathering
inputs and outputs that can be used out of the box
Tez has its own implementation of hive, pig, spark with Today almost every big corporations use some form of
considerable increase in their performance. Map Reduce can business intelligence derived from BigData to drive their
be easily written as a Tez based application the Tez project company. In the feature no company can afford to lose out on
business intelligence if they plan to succeed in their business.
comes with a built-in implementation of Map Reduce. Tez is
deployed at some of the major technology giants like Yahoo,
Microsoft Azure, LinkedIn.
Reference
Future Trends
1. Data volume will continue to increase [1] Peter J. Denning publishes "Saving All the Bits".
[2] 3D Data Management: Controlling Data Volume, Velocity, and Variety.
Volume of data is expected to increase exponentially as
[3] The evolution of storage systems.
technology becomes even more easily accessible and available
[4] Ghemawat, S.; Gobioff, H.; Leung, S. T. (2003). "The Google file
more and more people. The spread of internet will also play a system".
major part. [5] Dean, J.; Ghemawat, S; “Map Reduce: Simplified Data Processing on
2. Emergence of IOT systems Large Clusters”.
[6] https://spark.apache.org/research.
Internet of things (IOT) systems will become a integrated [7] Saha, B; Shah, H.; Seth, S.; Vijayaraghavan, G; Murthy, A; Curino, C;
part of our daily life and all those connected devices will “Apache Tez: A Unifying Framework for Modeling and Building Data
communicate with each other which would generate a huge Processing Applications”.
volume log as well as real time data. [8] Melnik, S; Gubarev, A; Long, J, J; Romer, G; Shivakumar, S; Tolton,
M; Vassilakis, T; “Dremel: Interactive Analysis of Web-Scale Datasets.
3. Integration of machine learning with searching
algorithm

207

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy