0% found this document useful (0 votes)

105 views47 pages

The Age OF: Every Minute

The document provides statistics that illustrate the growth of big data, including the number of social media posts, website search queries, online videos watched, emails sent, photos posted, and more generated every day, minute, or other time period. It notes that big data is defined as datasets that are too large to store, manage, process and analyze using traditional databases due to their volume, velocity or variety, and that there has been exponential growth in both structured and unstructured data.

Uploaded by

aghodke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views47 pages

The Age OF: Every Minute

Uploaded by

aghodke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 47

THE AGE

OF
BIG DATA
2.8 2.5 27.2
Million Million Thousand
Social Media Website Review
posts search queries posts

100 201
Hours Million
of Online emails
videos sent

EVERY

57 50.7 MINUTE
Thousand Thousand
Pictures Thoughts
posts posts
• Big data is defined as collections of datasets whose volume,
velocity or variety is so large that it is difficult to store,
manage, process and analyze the data using traditional
databases and data processing tools.
• In the recent years, there has been an exponential growth in the
both structured and unstructured data generated by information
technology, industrial, healthcare, Internet of Things, and other
systems.
• According to an estimate by IBM, 2.5 quintillion bytes of data
is created every day
Below are some key pieces of data from the report:
• Facebook users share nearly 4.16 million pieces of content
• Twitter users send nearly 300,000 tweets
• Instagram users like nearly 1.73 million photos
• YouTube users upload 300 hours of new video content
• Apple users download nearly 51,000 apps
• Skype users make nearly 110,000 new calls
• Amazon receives 4300 new visitors
• Uber passengers take 694 rides
• Netflix subscribers stream nearly 77,000 hours of video
• Big Data has the potential to power next generation of smart
applications that will leverage the power of the data to make
the applications intelligent.
• Applications of big data span a wide range of domains such as
web, retail and marketing, banking and financial, industrial,
healthcare, environmental, Internet of Things and cyber-
physical systems.
Big Data analytics deals with collection, storage, processing and
analysis of this massive scale data. Specialized tools and
frameworks are required for big data analysis when:
(1) The volume of data involved is so large that it is difficult to
store, process and analyze data on a single machine,
(2) The velocity of data is very high and the data needs to be
analyzed in real-time,
(3) There is variety of data involved, which can be structured,
unstructured or semi-structured, and is collected from multiple
data sources
• Big data analytics involves several steps starting from data
cleansing, data munging (or wrangling), data processing and
visualization.
• Big data analytics life-cycle starts from the collection of data
from multiple data sources.
• Specialized tools and frameworks are required to ingest the
data from different sources into the big data analytics backend.
• The data is stored in specialized storage solutions (such as
distributed file systems and non-relational databases) which are
designed to scale.
Big data analytics is enabled by several technologies such as
cloud computing, distributed and parallel processing frameworks,
non-relational databases, in-memory computing
Some examples of big data are listed as follows:
• Data generated by social networks including text, images, audio and
video data
• Click-stream data generated by web applications such as e-Commerce to
analyze user behavior
• Machine sensor data collected from sensors embedded in industrial and
energy systems for monitoring their health and detecting failures
• Healthcare data collected in electronic health record (EHR) systems
• Logs generated by web applications
• Stock markets data
• Transactional data generated by banking and financial applications
Characteristics of Big Data
1. Volume
Big data is a form of data whose volume is so large that it would not fit on a
single machine therefore specialized tools and frameworks are required to
store process and analyze such data.
Ex- social media applications process billions of messages everyday,
industrial and energy systems can generate terabytes of sensor data
everyday, cab aggregation applications can process millions of transactions
in a day, etc.
Though there is no fixed threshold for the volume of data to be considered
as big data, however, typically, the term big data is used for massive scale
data that is difficult to store, manage and process using traditional databases
and data processing architectures.
2. Velocity
• Velocity of data refers to how fast the data is generated.
• Data generated by certain sources can arrive at very high velocities, for
example, social media data or sensor data.
• Velocity is another important characteristic of big data and the reason for the
exponential growth of data.
• High velocity of data results in the volume of data accumulated to become very
large, in short span of time.
3. Variety
• Variety refers to the forms of the data.
• Big data comes in different forms such as structured, unstructured or
semi-structured, including text data, image, audio, video and sensor data.
• Big data systems need to be flexible enough to handle such variety of
data.
• RDBMS deals with structured Data
• Out of 100% data nearly 70-80% of data is either unstructured or semi-
structured.
• Unstructured Data – FB videos, audio, images, text messages
• Semistructured Data – Log Files
HADOOP
1990
White Paper

After 13 years
2003 – GFS (Google File System)
2004 – Map Reduce

2006-2007 – HDFS (Hadoop

Distributed File System)

2007-2008 – Map Reduce

HDFS – Store huge data sets

Map Reduce – Processing data stored in HDFS
HADOOP
Hadoop is an Open Source Framework given by Apache Software
Foundation for storing huge data sets and processing huge data sets with a
cluster of commodity hardware.

HDFS is a specially designed file system for storing huge data sets with
cluster of commodity hardware and with streaming access pattern.

Write Once and Read Any (WORA) of times is called streaming access
pattern.

Basically Hadoop accomplishes following 2 tasks

1. Massive Data Storage
2. Faster Processing
HADOOP
Why Hadoop ?

Taking care of hardware failure cannot be made optional in Big Data

Analytics but has to be made as a rule. In case of 1000 nodes, we need to
consider say 4000 disks, 8000 core, 25 switches, 1000 NICs and 2000 RAMs
(16 TB). Meantime between failures could be even less than a day since
commodity hardware is used. There is a need for fault tolerant store to
guarantee reasonable availability.
HADOOP
Hadoop Goals

The main goals of Hadoop are :

1. Scalable : It can scale up from a single server to thousands of servers.
2. Fault tolerance : It is designed with very high degree of fault tolerance
3. Economical : It uses commodity hardware instead of high-end hardware
4. Handle hardware failures : The resiliency of these clusters comes from the
software’s ability to detect and handle failures at the application layer.
HADOOP
Hadoop Core Components

Hadoop consists of following components :

1. Hadoop Common : This package provides file system and OS level
abstractions. It contains libraries and utilities required by other Hadoop
modules.
2. HDFS : HDFS is distributed file system that provides a limited interface
for managing the file system.
3. Hadoop MapReduce : MapReduce is the key algorithm that the Hadoop
MapReduce engine uses to distribute work around a cluster.
4. Hadoop Yet Another Resource Negotiator (YARN) (MapReduce 2.0) : It is
a resource management platform responsible for managing computer
resources in clusters and using them for scheduling of users’ applications.
HADOOP
Hadoop Common Package

• This consists of necessary Java Archive (JAR) files and scripts needed to
start Hadoop.
• Hadoop requires Java Runtime Environment (JRE) 1.6 or higher version.
• The standard start-up and shut-down scripts need secure shell (SSH) to be
setup between the nodes in the cluster.

• HDFS (Storage) and MapReduce (processing) are the two core

components of Apache Hadoop.
• The storage system Hadoop is not physically separate from a processing
system MapReduce.
HADOOP
Hadoop Distributed File System (HDFS)

• HDFS is a distributed file system that provides a limited interface for

managing the file system to allow it to scale and provide high throughput.
• HDFS creates multiple replicas of each data block and distributed them on
computers throughout a cluster to enable reliable and rapid access.
• When file is loaded into HDFS, it is replicated and fragmented into
“blocks” of data which are stored across the cluster nodes; the cluster
nodes are also called the DataNodes.
• The NameNode is responsible for storage and management of metadata, so
that when MapReduce or another execution framework calls for the data,
the NameNode informs it where the data that is needed resides
• Figure below shows NameNode and DataNode replication in HDFS
Architecture.
HADOOP
HADOOP

1. HDFS creates multiple replicas of data blocks for reliability, placing them
on the computer nodes around the cluster.
2. Hadoop’s target is to run on clusters of the order of 10,000 nodes.
3. A file consists of many 64 MB blocks.
HADOOP
• Struts, Spring and Hibernate are the frameworks in Java.
• Just like that Hadoop is a Framework which is overseen by
Apache Software Foundation.
• If someone is making a software as an open source then it is none
other than Apache Software Foundation.
• You need not have to pay anything, you can just download for
free.
• Industries always prefer working with open source software and
not with commercial software.
• Hadoop is basically for storing and processing huge data sets but
not recommended for small data sets.
• For small data sets you must use local machines to store and
process the data and should not use Hadoop.
• Suppose you want to store 500 TB data
Higher reliable Hardware – they are costly
Commodity Hardware – they are cheap
• Hard Disk Internal Structure

1 Block of size 4 kb
Suppose your Hard
Disk Capacity is
500 GB

• Suppose in the 4 kb of block I am storing the file of size 2 kb,

then remaining 2kb is wasted.
• All these block sizes are given when you boot your system for
Unix, Linux of any OS.
• On the top of this OS / Hard Disk, we are going to install Hadoop
/ HDFS.
• The default block size of HDFS is 64 MB.

1 Block of size 64 Mb

• Suppose I want to store the file of size 35 MB, the remaining file
size = 64-35 = 29 MB remains free.
• In normal Hard disk the remaining space is wasted but in Hadoop
it is not wasted.
• It will release remaining memory for other file.
• If I uninstall Hadoop from the machine then the 64 MB block
size will automatically converted to 4kb of block size.

• For 1 GB how many number of 64 MB block sizes we require

suppose 15 block sizes we require.
• For 500 GB we require 7500 blocks of size 64 MB.
• Hadoop Administrator should take care of this allocation and
deallocation process.
• HDFS is having 5 services :

NameNode Master Services / Nodes

Secondary NameNode

JobTracker
Slave Services / Nodes

DataNode

TaskTracker
• Every Master service can talk to each other and every slave
service can talk to each other.
• If NameNode is a master node then its corresponding slave node
is DataNode.
• If JobTracker is a master node then its corresponding slave node
is TaskTracker.
• NameNode can talk to DataNode and JobTracker can talk to
TaskTracker, but NameNode can not talk to TaskTracker.
• Suppose there is a client which needs to store and process huge
data.
• As this client is having a huge data and need to process this data
in less time he wanted to put that into Hadoop.
• Let us suppose that the client is having a file of size 200 MB.
• Hadoop is not for such a small size of 200 MB but for the
understanding I am taking it as 200 MB.
• So this file is stored on Hadoop and its size is going to store on
number of 64 MB block sizes.
• One should not miss 0.000001 kb of data.
• For storing 200 MB of data we need 4 blocks of 64 MB.
• 3 x 64 = 198 MB + 1 Block = 8 MB
• The total = 200 MB
NameNode

file.txt 200 MB Metadata

file.txt (200MB)
Client a.txt – 1, 2, 4
b.txt – 3, 5, 8
c.txt – 5, 6, 7
a.txt d.txt – 7, 9, 10
a.txt a.txt
3 - 64MB blocks
1 – 8 MB block DN DN DN DN DN
TT TT TT TT TT
a.txt - 64 MB
b.txt - 64 MB
c.txt - 64 MB 1 2 3 4 5
d.txt - 8 MB
DN DN DN DN DN
TT TT TT TT TT

6 7 8 9 10
• Basically client does not have to worry to split the file into
cluster of 64 MB blocks.
• NameNode will take care of this splitting.
3 - 64MB blocks
1 – 8 MB block

a.txt - 64 MB
b.txt - 64 MB
c.txt - 64 MB
d.txt - 8 MB
• Even though the d.txt is 8 MB it is stored in 64 MB block and
remaining 56 MB is used to store other files but it will not be
wasted.
• If the client wants to store his 200 MB of data to HDFS, to whom
he should contact, he should go to NameNode.
• When the clients request reaches to NameNode it is taking care
of the Metadata.
• Then NameNode sends the signal to the client and says that Now
you can store the data in DataNode 1, 3, 5, 7.
• Now Client directly approaches NameNode 1 and stores the file
a.txt into it.
• As we know that the DataNodes are commodity Hardwares
(Cheap Hardware) with different storage capacity.
• If the data in the DataNode is lost, what will happen. Are we
going to loose the data.
• Definitely we are not going to loose the data.
• Bu default this HDFS is giving three replications of the same file
i.e. a.txt.
• That means for storing 200 MB of data we will occupy 600 MB
of space as it is making 3 replications of one file.
• How the name node can know which file is stored in which
block.
• For this reason for every 3 seconds each data node sends a block
report and heart beat to the NameNode.
• If the NameNode does not receive the heart beat from the
DataNode, then it thinks that the DataNode is dead.
• Suppose the Metadata is lost then Hadoop is of no use and for
this reason Metadata is stored in reliable hardware.
• This is called as single point failure.
• If NameNode is lost everything will not be accessible.

• Now the data is stored in different DataNodes and it is time to

process that data.
• To process the data I need to write a program may be in any
programming language like java, python, Ruby etc.
• We write a program of 10 kb of data for processing.
• Which is better ?
• Sending 10 KB of program to Data or fetching data to program ?
• Of course sending programs to data is advisable otherwise
fetching huge data to program will create problem.
• Here JobTracker come into picture.
• JobTracker doesnot know which tile is stored in which
DataNode.
• As we have said that Master Services talk to each other and
Slave services talk to each other.
• The JobTracker will take the information about the files stored in
the DataNode for processing.
• JobTracker shares this data to all TaskTrackers.
• JobTrackers job is to assign tasks to all TaskTrackers.
• The TaskTracker knows that the a.txt file is in local DataNode
i.e. is in system No. 1.
• Applying that program on local a,txt file is a process and that
process we call it as Map.
• The same process continues for b.txt, c.txt, d.txt.
• The a.txt, b.txt, c.txt, d.txt are called as input splits.
• The number of input splits in your cluster is equal to the number
of Mappers will be running.
• JobTracker will take care of this.
HADOOP Ecosystem

Data transfer Hadoop to RDBMS

Machine

Columnar database
Scripting Query
Learning
Windows and Scheduling

(Pig) (Hive)
(Mahout)

(HBase)
Coordination
(Zookeeper)

(Squoop)
(Oozie)

Distributed processing
(MapReduce)

Hadoop Distributed File System

(HDFS)

Fig. Hadoop Ecosystem

• Apart from HDFS and MapReduce the other components of
Hadoop ecosystem are shown in fig. above.
• The main ecosystem components of Hadoop architecture are as
follows:
1. Apache Hbase : Columnar (Non-relational) database
2. Apache Hive : Data access and query
3. Apache HCatalog : Metadata Services
4. Apache Pig : Scripting Platform
5. Apache Mahout : Machine Learning Libraries for Data Mining
6. Apache Oozie : Workflow and scheduling services
7. Apache Zookeeper : Cluster coordination
8. Apache Sqoop : Data integration services
1. Hbase
• It is an open source, distributed column oriented store that sites
top of HDFS.
• Hbase is based on Google’s Bigtable.
• It is based on columns rather than rows.
• This increases the speed of execution of operations if they are
need to be performed on similar values across massive data sets
• Hbase is simply to say it is NoSQL database
• All your data will be taken in the form of (Key, Value) pairs.
• Both key and value are byte array type data
• It will take data in the form of column family (CF) and some
timestamp (ts) value.
• Example:

Actual Data HBase Data

id Name salary
101 Howard 10000 row Column + cell
102 Pthanil 11000
101 Column = cf:name ts value = ‘Howard’
103 Robert 10000

102 Column = cf:name ts value = ‘Pthanil’

103 Column = cf:name ts value = ‘Robert’

Id 101 102 103

salary 10000 11000 10000
Name Howard Pthanil Robert
2. Hive
• Hive provides a warehouse structure for other Hadoop input
sources and SQL-like access for data in HDFS.
• Hive’s query language, HiveQL compiles to MapReduce and
also user-defined functions (UDFs).
• Hive’s data model is based primarily on three related data
structures : tables, partitions and buckets.
• Tables correspond to HDFS directories that are derived into
partitions, which in turn can be divided into buckets.
3. HCatalog
• HCatalog is a metadata and table storage service for HDFS.
• HCatalog’s goal is to simplify the User’s interaction with HDFS
data and enable sharing tools and execution platforms.
4. Pig
• Pig is run-time environment that allows users to execute
MapReduce on a Hadoop cluster.
• Pig Latin is a high-level scripting language on Pig platform.
• Like HiveQL in Hive, Pig Latin is high-level language that
compiles to MapReduce.

• Pig is more flexible with respect to possible data format than

Hive due to its data model.
• Pig’s data model is similar to the relational data model but here
tuples can be nested.
• In Pig, tables are called bags.
• Pig also has a “map” data type, which is useful in representing
semi-structured data such as JSON or XML.
5. Sqoop
• Sqoop (“SQL-to-Hadoop”) is a tools which transfers data in both
ways between relational systems and HDFS or other Hadoop
data stores such as Hive or HBase.
• Sqoop can be used to import data from external structured
databases into HDFS or any other related systems such as Hive
or HBase.
• On the other hand, sqoop can also be used to extract data from
Hadoop and export it to external structured databases such as
relational databases and enterprise data warehouses.
6. Oozie
• Oozie is a job coordinator and workflow manager for jobs
executed in Hadoop.
• It is integrated with the rest of the Apache Hadoop stack.
• It supports several types of Hadoop jobs such as Java map-
reduce, Streaming map-reduce, Pig, Hive and Sqoop as well as
system-specific jobs such as Java programs and shell scripts.
• An Oozie workflow is a collection of actions and Hadoop jobs
arranged in a Directed Acyclic Graph (DAG), since tasks are
executed in a sequence and also are subject to certain constraints.
7. Mahout
• Mahout is a scalable machine-learning and data-mining library.
• There are currently following four main groups of algorithms in
Mahout.
• 1. Recommendations / Collective filtering
• 2. Classification / Categorization
• 3. Clustering
• 4. Frequent item-set mining / parallel frequent pattern mining
• Many machine learning algorithms are non-scalable i.e. given the
types of operations they perform, they cannot be executed as a
set of parallel processes.
• But algorithms in Mahout library can be executed in a distributed
fashion, and have been written to MapReduce.
8. Zookeeper
• Zookeeper is a distributed service with master and slave nodes
for storing and maintaining configuration information, naming,
providing distributed synchronization and providing group
services in memory on Zookeeper servers.
• Zookeeper allows distributed processes to coordinate with each
other through a shared hierarchical name space of data registers
called znodes.
• HBase depends on Zookeeper and runs a Zookeeper instance by
default.

Final Year Project - Proposal Defence PDF
No ratings yet
Final Year Project - Proposal Defence PDF
23 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
How To Gambas - Building GUI Applications-0.0.1 PDF
100% (1)
How To Gambas - Building GUI Applications-0.0.1 PDF
177 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Aadhar and PAN Card
50% (2)
Aadhar and PAN Card
2 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Converting SMART FORMS Output To PDF Format
No ratings yet
Converting SMART FORMS Output To PDF Format
4 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
CN Chapter 2 Solutions
No ratings yet
CN Chapter 2 Solutions
7 pages
Peachtree Users Manual Complete
100% (3)
Peachtree Users Manual Complete
317 pages
Hadoop Week 1
No ratings yet
Hadoop Week 1
25 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
VVV
No ratings yet
VVV
13 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
If7202 Cloud Computing L T P C
0% (1)
If7202 Cloud Computing L T P C
1 page
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Learning Python: From Zero To Hero: by TK
No ratings yet
Learning Python: From Zero To Hero: by TK
23 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
Big Data
No ratings yet
Big Data
12 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Cello-IQ Integration Manual
No ratings yet
Cello-IQ Integration Manual
36 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Russia Project
No ratings yet
Russia Project
14 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
AN12422 S32G2 Boot Process
No ratings yet
AN12422 S32G2 Boot Process
14 pages
OLED65G6P+WebOS+3 0+UHD+OLED+TV+In-Start+Service+Menu+Screens
No ratings yet
OLED65G6P+WebOS+3 0+UHD+OLED+TV+In-Start+Service+Menu+Screens
8 pages
Genexus Trial Tutorial EN PDF
No ratings yet
Genexus Trial Tutorial EN PDF
45 pages
Data Management and Database Design: INFO 6210 Week #1 Northeastern University
No ratings yet
Data Management and Database Design: INFO 6210 Week #1 Northeastern University
33 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
LCD TV Led LCD TV: Owner'S Manual
No ratings yet
LCD TV Led LCD TV: Owner'S Manual
144 pages
Brickcom Introduction 2011Q3
No ratings yet
Brickcom Introduction 2011Q3
36 pages
LHC Assignment
No ratings yet
LHC Assignment
8 pages
WLM and ESS New Ways of Managing IO Resources
No ratings yet
WLM and ESS New Ways of Managing IO Resources
47 pages
Anatomy of A Compiler
No ratings yet
Anatomy of A Compiler
12 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
HADOOP
No ratings yet
HADOOP
55 pages
Module 4
No ratings yet
Module 4
26 pages
349 59765 CC431 2020 4 2 1 CC431-Sheet2-Application-Layer
No ratings yet
349 59765 CC431 2020 4 2 1 CC431-Sheet2-Application-Layer
5 pages
Service Mode
No ratings yet
Service Mode
8 pages
ATAL Data Science Certificate
No ratings yet
ATAL Data Science Certificate
1 page
ATAL Blockchain Certificate
No ratings yet
ATAL Blockchain Certificate
1 page
GEOCOMM 2.11 Installation Guide Demo
No ratings yet
GEOCOMM 2.11 Installation Guide Demo
7 pages
SECOND SUMMATIVE TEST - Ict
No ratings yet
SECOND SUMMATIVE TEST - Ict
2 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Bauer Systems Price List
No ratings yet
Bauer Systems Price List
3 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
MLCad Toolbars
No ratings yet
MLCad Toolbars
4 pages
Ejemplos de Diagramas de Base de Datos
No ratings yet
Ejemplos de Diagramas de Base de Datos
14 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Biggdata
No ratings yet
Biggdata
24 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Life Cycle of A Thread
No ratings yet
Life Cycle of A Thread
4 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
Lecture 4
No ratings yet
Lecture 4
32 pages
Big Data - 1
No ratings yet
Big Data - 1
46 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Brook Wingman XE2 (202309V4)
No ratings yet
Brook Wingman XE2 (202309V4)
75 pages
Wa0000.
No ratings yet
Wa0000.
35 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
C1000-174 IBM Exam Practice Questions
No ratings yet
C1000-174 IBM Exam Practice Questions
8 pages
BigData Unit1
No ratings yet
BigData Unit1
74 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Cit208 Summary
No ratings yet
Cit208 Summary
22 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Unit 5
No ratings yet
Unit 5
32 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

The Age OF: Every Minute

Uploaded by

The Age OF: Every Minute

Uploaded by

THE AGE

2006-2007 – HDFS (Hadoop

2007-2008 – Map Reduce

HDFS – Store huge data sets

Basically Hadoop accomplishes following 2 tasks

Taking care of hardware failure cannot be made optional in Big Data

The main goals of Hadoop are :

Hadoop consists of following components :

• HDFS (Storage) and MapReduce (processing) are the two core

• HDFS is a distributed file system that provides a limited interface for

• Suppose in the 4 kb of block I am storing the file of size 2 kb,

• For 1 GB how many number of 64 MB block sizes we require

NameNode Master Services / Nodes

file.txt 200 MB Metadata

• Now the data is stored in different DataNodes and it is time to

Data transfer Hadoop to RDBMS

Hadoop Distributed File System

Fig. Hadoop Ecosystem

Actual Data HBase Data

102 Column = cf:name ts value = ‘Pthanil’

Id 101 102 103

• Pig is more flexible with respect to possible data format than

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.