0% found this document useful (0 votes)

55 views3 pages

Guha Roy 2017

jkjsadkjdal

Uploaded by

G LAHARI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views3 pages

Guha Roy 2017

jkjsadkjdal

Uploaded by

G LAHARI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

A Review on BigData theoretical and Application

approach: Past and Present

Rabel Guharoy1 Upasak Pal2

Asst. Prof, of CSE Dept. Big Data Trainer
UEM, Jaipur, India Ardent Computer Tech. Kolkata, India
rabelrock@gmail.com upasak.zero@gmail.com
3
Sumit Kumar
Student of CSE Dept.
UEM, Jaipur, India
Sk51055@gmail.com

Abstract—Machines that stores and process data had been in

the works for a long time, as the volume of data increased so did
the capacity of the machines to handle those data. The notion of
big data came into light during the early days of internet when
machines not only store and process data but also generate a
huge volume of information too. We stared to get very large
volumes of dataset beyond the capacity of a single machine to
store and process, such data were referred as big data. As the era
of internet progressed and people stated to use social media and
mobile devices different types of data set started to generate at a
very high velocity which could not be handled by conventional
computing techniques. Today distributed and parallel processing
techniques are widely used to process this type of data sets. In the
future personal and sensor data generation is expected to
increase exponentially and similarly the need for real time
information to cater such needs not only the machines but the
computation techniques must be further improved. fig.1 Source: International Telecommunication Union

Keywords—Bigdata; Hadoop; Hdfs; data storage; distributed II. WHAT IS BIG DATA ?
storage;
Data which cannot be captured, stored or processed within
a given/desired time frame due to the size, growth rate,
I. INTRODUCTION
complexity etc. Can be considered as big data.
TThe term Big Data[1] first came into light during 1990.
The Big-Data as we know it today was first defined during According to Doug Laney, an analyst with the Meta Group[2]
2001 after the publication of Doug Laney, an analyst with the big data can be classified in 3 dimensions volume, velocity
Meta Group, about the “3 V’s” of Big Data[2]. In this review and variety . Today dimensions such veracity and variability
paper we will discuss a brief history of big data , the in data are used to classify a data as big data.
challenges we faced with the big data and how the challenges
were handled. In 2013 according to IBM 90% of data was III. EVOLUTION OF STORAGE SYSTEM
created in last 2 years this information alone gives a estimated In the paper evolution of storage system author Robert J T
that how fast the amount data is growing. According to Morris and B J Truskowski explained how storage systems
International Telecommunication Union, Geneva, 22 July have evolved over five decades to meet changing customer
2016 3.9 billion people remain cut-off from the internet needs. development of the control unit, RAID (redundant
globally. That said not only data generation from each user is array of independent disks) technologies, copy services, and
increasing but there is a huge growth potential in adding new basic storage management technologies are discussed in brief.
user base which is bound to happen sooner or How the emergence of low-cost local area data networking has
later. allowed the development of network-attached storage (NAS)
and storage area network (SAN) technologies, and we explain
how block virtualization and SAN file systems are necessary
to fully reap the benefits of these technologies. We also
discuss how the recent trend in storage systems toward

205
managing complexity, ease-of-use, and lowering the total cost VI. APACHE HADOOP
of ownership has led to the development of autonomic storage.
We conclude with our assessment of the current state-of-the-
Hadoop is a open source framework to store and process large
art by presenting a set of challenges driving research and datasets in a distributed environment. Hadoop is bases on java
development efforts in storage systems. and developed by doug cutting. Hadoop Distributed File
System(HDFS) and Map-Reduce Framework are the core
IV. MAJOR CHALLENGES component on hadoop. HDFS and Map-Reduce Framework
are derived from the google counter parts GFS and Map-
Large scale data analytics, once a technology exclusive to the
Reduce programming model respectively. Being an opensouce
large web companies, is nowadays available relatively easily frameworks its became synonymous to BigData. Companies
and often indispensable for most modern organizations. Doug like facebook, yahoo built components which runs on top of
Laney of the META Group first describe about the three “V’s” the core components of hadoop like hive and pig. Because of
of BigData. Volume, Velocity and Variety were the 3 three that interacting with map-reduce framework became relatively
dimensions across which a data set can be classified: easy.
Volume: Increase in Size of the data set can beyond the
capacity of a machine to store and process that data can be
VII. IN-MEMORY PROCESSING
classified as BigData .
Map Reduce helped us dealing with big data but Map Reduce
Velocity: When the rate of data transfer and generation is disk oriented approach to handle big data problem where the
goes beyond the capacity of machine.
intermediate result is stored in disk storage. Even though disk
Variety: When the complexity in type, format, features space is easily available with the commodity machines,
becomes too complex for a conventional systems like RDBMS Recently the memory capacity of commodity machines has
to handle. increased substantially to a point where map reduce don’t
Apart from this three “V’s” now there are several more utilize the full potential of the available memory , new
dimensions in which data can be classified where conventional processing techniques are being used to fully utilize this area.
machine or processing technique is unusable and inefficient. For example even though Google was the first to come up
To deal with such problem several processing techniques have with idea of map reduce but since 2014 Google was no longer
emerged mostly using distributed and parallel processing. using Map Reduce as their primary model to process Big
Data. In memory processing has become the new trend in Big
V. GOOGLE FILE SYSTEM AND MAP REDUCE Data analysis some of the popular in-memory processing
engines are Tez, impala, Drill, Google Dremel.
Google File System (GFS) is a distributed file system
• Dremel[8] is a scalable, interactive ad-hoc query
based on master slave architecture. GFS was designed and
implemented to meet the rapidly growing demands of system for analysis of read only nested data. There
Google’s data processing needs. GFS shares many of the same are two major component of Dremel is multi-level
goals as previous distributed file systems such as performance, execution trees and columnar data layout. Dremel
scalability, reliability, and availability. Some of the principles multi-level execution trees is a high speed processing
that GFS follows are i) failures are common occurrence, in engine capable of running aggregation queries over
fact it is assumed that at a particular point in time more than trillion-row tables in seconds. The Execution tree can
one nodes are un-operational. ii) large files typically in order scale across multiple nodes typically the number
of gigabytes are common. iii) Overwriting a file is avoided, going to thousands. supports interactive analysis of
instead overwriting a file new information is appended. iv) very large datasets over shared clusters of commodity
Consistency of the system is relaxed to make the system machines.
more flexible. Some of the fields where dremel is used by Google: Analysis
Map Reduce is programming model which helps to of crawled web documents, Tracking install data for
process large datasets in a distributed environment. The basic applications on Android Market, Crash reporting for Google
principal behind map reduce programming is that it takes a products, Spam analysis.
key/value pair as input and generates another key/value pair as • Apache Tez[7] is a open-source framework designed
output. There are two major phases i) Map phase, it takes a to build data-flow driven processing runtimes. Tez
key/value pair as input and generates another intermediate sets was developed after hadoop, Hadoop solved the
of key /value pairs. ii) Reduce phase, it takes the key and a list problem of storing and capturing large data sets. But
of values belonging to the value from the intermediate results hadoop relied on mapreduce for its processing which
and apply transformations before generating the output. involved multiple intermediate states which involved
multiple disk I/O operations. Tez trys to take matter a
step further .Tez is a library to build data-flow based
runtimes/engines. Tez APIs allow frameworks to
clearly model the logical and physical semantics of
their data flow graphs.
• Tez API consists of three main components

206
API library: This provides the DAG and Runtime APIs and Ever since 1990’s the volume of unstructured data is
other client side libraries to build applications increasing exponentially compared to the structured and this
Orchestration framework: This has been implemented as a trend will follow in the upcoming years. To analyze this type
YARN Application Master [to execute the DAG in a Hadoop of data machine learning will eventually become integral part
cluster via YARN. of data mining and analyzing tools.
Runtime library: This provides implementations of various 4. Business intelligence gathering
inputs and outputs that can be used out of the box
Tez has its own implementation of hive, pig, spark with Today almost every big corporations use some form of
considerable increase in their performance. Map Reduce can business intelligence derived from BigData to drive their
be easily written as a Tez based application the Tez project company. In the feature no company can afford to lose out on
business intelligence if they plan to succeed in their business.
comes with a built-in implementation of Map Reduce. Tez is
deployed at some of the major technology giants like Yahoo,
Microsoft Azure, LinkedIn.
Reference
Future Trends
1. Data volume will continue to increase [1] Peter J. Denning publishes "Saving All the Bits".
[2] 3D Data Management: Controlling Data Volume, Velocity, and Variety.
Volume of data is expected to increase exponentially as
[3] The evolution of storage systems.
technology becomes even more easily accessible and available
[4] Ghemawat, S.; Gobioff, H.; Leung, S. T. (2003). "The Google file
more and more people. The spread of internet will also play a system".
major part. [5] Dean, J.; Ghemawat, S; “Map Reduce: Simplified Data Processing on
2. Emergence of IOT systems Large Clusters”.
[6] https://spark.apache.org/research.
Internet of things (IOT) systems will become a integrated [7] Saha, B; Shah, H.; Seth, S.; Vijayaraghavan, G; Murthy, A; Curino, C;
part of our daily life and all those connected devices will “Apache Tez: A Unifying Framework for Modeling and Building Data
communicate with each other which would generate a huge Processing Applications”.
volume log as well as real time data. [8] Melnik, S; Gubarev, A; Long, J, J; Romer, G; Shivakumar, S; Tolton,
M; Vassilakis, T; “Dremel: Interactive Analysis of Web-Scale Datasets.
3. Integration of machine learning with searching
algorithm

207

Data Science Solutions With Python Fast and Scalable Models Using
100% (1)
Data Science Solutions With Python Fast and Scalable Models Using
128 pages
Py Spark
No ratings yet
Py Spark
427 pages
MC5502 - BIG DATA ANALYTICS - MCQ - For All Units
100% (1)
MC5502 - BIG DATA ANALYTICS - MCQ - For All Units
19 pages
0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
Big Data Demystified - How To Use Big Data, Data Science and AI To Make Better Business Decisions and Gain Competitive Advantage (PDFDrive) - 61-71
No ratings yet
Big Data Demystified - How To Use Big Data, Data Science and AI To Make Better Business Decisions and Gain Competitive Advantage (PDFDrive) - 61-71
11 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Article
No ratings yet
Article
7 pages
Chapter N1 Introduction To Big Data
No ratings yet
Chapter N1 Introduction To Big Data
40 pages
4 A Review Paper On Big Data and Hadoop
No ratings yet
4 A Review Paper On Big Data and Hadoop
3 pages
Introductionto Big Data Analytics
No ratings yet
Introductionto Big Data Analytics
162 pages
Processign Using Hadoop
No ratings yet
Processign Using Hadoop
44 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Survey Paper On Big Data Analytics Using Hadoop Technologies
No ratings yet
Survey Paper On Big Data Analytics Using Hadoop Technologies
7 pages
CloudSecurity Unit 1
No ratings yet
CloudSecurity Unit 1
24 pages
Seminar Big Data Hadoop
No ratings yet
Seminar Big Data Hadoop
28 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
No ratings yet
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
3 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Ds Using RR
No ratings yet
Ds Using RR
6 pages
Big Data Processing With Hadoop: Bachelor's Thesis Information Technology Internet Technology 2015
No ratings yet
Big Data Processing With Hadoop: Bachelor's Thesis Information Technology Internet Technology 2015
45 pages
Large Scale and MultiStructured Databases
No ratings yet
Large Scale and MultiStructured Databases
223 pages
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
No ratings yet
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
43 pages
Bda 1
No ratings yet
Bda 1
26 pages
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
No ratings yet
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
12 pages
Unit - I Introduction To Big Data
No ratings yet
Unit - I Introduction To Big Data
38 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Modeling of Big Data Processing
No ratings yet
Modeling of Big Data Processing
15 pages
Big Data and Hadoop: A Review Paper
No ratings yet
Big Data and Hadoop: A Review Paper
3 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
Big Data Now 2013
100% (2)
Big Data Now 2013
199 pages
Building Real Time Analytics Applications
No ratings yet
Building Real Time Analytics Applications
36 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Cloud Computing Unit-2and3
No ratings yet
Cloud Computing Unit-2and3
45 pages
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
Bigdata Overview PDF
No ratings yet
Bigdata Overview PDF
98 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
The Growing Enormous of Big Data Storage
No ratings yet
The Growing Enormous of Big Data Storage
6 pages
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
No ratings yet
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
20 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
Unit 1
No ratings yet
Unit 1
19 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
IOT2023
No ratings yet
IOT2023
2 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Informatics Engineering, An International Journal (IEIJ)
No ratings yet
Informatics Engineering, An International Journal (IEIJ)
20 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
HBase
No ratings yet
HBase
31 pages
Report On Big Data
No ratings yet
Report On Big Data
23 pages
Immuta Technical: MARCH 2018
100% (1)
Immuta Technical: MARCH 2018
19 pages
AWS S3 Interview Questions
No ratings yet
AWS S3 Interview Questions
23 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
A Study On Big Data Processing Mechanism & Applicability: Byung-Tae Chun and Seong-Hoon Lee
No ratings yet
A Study On Big Data Processing Mechanism & Applicability: Byung-Tae Chun and Seong-Hoon Lee
10 pages
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
No ratings yet
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
8 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data Analytics Using Apache Hadoop
No ratings yet
Big Data Analytics Using Apache Hadoop
33 pages
Escritura 1
No ratings yet
Escritura 1
7 pages
Big Data: Concepts, Techniques, Storage and Challenges
No ratings yet
Big Data: Concepts, Techniques, Storage and Challenges
9 pages
Project Preet PDF
No ratings yet
Project Preet PDF
59 pages
An Insight On Big Data Analytics Using Pig Script
No ratings yet
An Insight On Big Data Analytics Using Pig Script
7 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Processing Model From Mining Prospective
No ratings yet
Processing Model From Mining Prospective
5 pages
Big Data Technologies
No ratings yet
Big Data Technologies
7 pages
HTTP Tecadmin Net Setup Phpmyadmin On Linux Using Source
No ratings yet
HTTP Tecadmin Net Setup Phpmyadmin On Linux Using Source
10 pages
C2-Distributed Databases
No ratings yet
C2-Distributed Databases
95 pages
IU Master
No ratings yet
IU Master
34 pages
Week 5 CC
No ratings yet
Week 5 CC
7 pages
(Ebook) Apache ZooKeeper Essentials - Saurav Haloi by 2015pdf Download
100% (4)
(Ebook) Apache ZooKeeper Essentials - Saurav Haloi by 2015pdf Download
55 pages
Hadoop Project For IDEAL in CS5604: Jose Cadena Mengsu Chen Chengyuan Wen
No ratings yet
Hadoop Project For IDEAL in CS5604: Jose Cadena Mengsu Chen Chengyuan Wen
69 pages
Sowmya Marripeddi
No ratings yet
Sowmya Marripeddi
9 pages
Tableau Course Agenda
No ratings yet
Tableau Course Agenda
5 pages
A Big Data Analytics Course: de Liu
No ratings yet
A Big Data Analytics Course: de Liu
27 pages
Pig Operations Load Store Dump Describe
No ratings yet
Pig Operations Load Store Dump Describe
8 pages
CC Mini Project Report
No ratings yet
CC Mini Project Report
20 pages
HDFS
100% (2)
HDFS
6 pages
Big Data Review
No ratings yet
Big Data Review
8 pages
Ayush Singh
No ratings yet
Ayush Singh
1 page
4 How To Build A Stock Data Management Database
No ratings yet
4 How To Build A Stock Data Management Database
3 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Guha Roy 2017

Uploaded by

Guha Roy 2017

Uploaded by

A Review on BigData theoretical and Application

approach: Past and Present

Rabel Guharoy1 Upasak Pal2

Abstract—Machines that stores and process data had been in

978-1-5386-2215-5/17/$31.00 ©2017 IEEE

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.