0% found this document useful (0 votes)

9 views24 pages

BDA Unit 1 Notes

The document provides an overview of Big Data and its significance, highlighting the importance of data analysis for smart decision-making and cost efficiency. It categorizes Big Data into structured, unstructured, and semi-structured types, and discusses the 3 V's: velocity, volume, and variety. Additionally, it explains the architecture and components of Google File System (GFS) and Hadoop Distributed File System (HDFS), emphasizing their roles in managing large data sets.

Uploaded by

chelsigupta14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views24 pages

BDA Unit 1 Notes

Uploaded by

chelsigupta14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

BIG DATA

ANALYTICS
8CS4-01
UNIT-I

DATA: Anything that is recorded in the system or collection of numbers

, alphabets or special symbols. Exp: 10041990.
Information: Processed data. Exp: 10/04/1990 is the birthdate of MS
Dhoni.
Big Data: collection of large, complex, and diverse data sets that are difficult to
manage and analyze using traditional data processing tools.
Exp: Patient Records, Ecommerce Sites(Online Shopping), Transportation etc..
WHY BIG DATA IS IMPORTANT

You don’t have to only take data but you have to analyse the data.
Low cost.
Less time taken.
Smart Decision making.
TYPES OF BIG DATA

• Structured : defined in a fix format generally in table form. Exp:Table of

Employees in an organization
• Unstructured: Not having any pre defined format Exp: Videos,
Images, Presentation, e-mails Etc.
• Semi structured: not fixed in structure or schema. It contains some structural
elements, like tags or markers, to separate data elements. Exp: Json, XML Files
etc.
3 V’S CHARACTERISTICS OF BIG DATA

VELOCITY: How much quickly data is generating.

VOLUME: Amount of data generated every second.
VARIETY: Type of data.
VARACITY: Trustworthiness of data or ambiguity(errors) of data
VALUE: Processed data from Raw data for profit. Exp: Instagram
TRADITIONAL VS BIG DATA

Parameters Traditional data Big Data

Volume GB TB OR PB and more
Generate rate Low(per hour or per day) High(every second)
Structure Structured Semi structured or
Unstructured
Data source Centralised Fully Distributed
Data integration Easy difficult
Data storage RDMS HDFS Or NOSQL
Access Intractive Realtime Or Batch
Data structure Static Schema Dynamic
Data updation Repeated Read And Write Write Once Read Many
GOOGLE FILE SYSTEM(GFS)

 Developed By Google Inc.

 A Scalable Distributed File System (Dfs).
 GFS offers fault tolerance, dependability, scalability, availability, and
performance to big networks and connected nodes.
 GFS is made up of a number of storage systems constructed from
inexpensive commodity hardware parts.
 It manages two types of data namely File metadata and File Data.
 It uses chunks to store data.(Geekforgeeks)
CHARACTERISTICS OF GFS

• Fault tolerance: GFS is designed to be highly available and fault tolerant, even when some servers are unavailable.

• Chunk replication: If a chunk server fails, the data can be replicated from other servers. Smaller chunks reduce the
amount of data that needs to be transferred, which speeds up recovery times.

• Consistency model: GFS is designed to be flexible to the typical workload of clients.

• Block copy placement: GFS uses block copy placement principles(The data is copied from a source location to a
destination location) to increase data reliability and availability.

• Large chunk size: GFS stores each chunk replica as a plain Linux file on a chunk server. The large chunk size
reduces the need for clients to interact with the master.

• Master server: The GFS Master server manages the overall system metadata, such as the namespace, access control
information, and the mapping of files to chunks.

• Chunk servers: Chunk servers store and serve file data in 64MB chunks.
COMPONENTS OF GFS

• GFS Clients: They can be computer programs or applications which may be used to request files.
Requests may be made to access and modify already-existing files or add new files to the system.

• GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the cluster’s actions
in an operation log. Additionally, it keeps track of the data that describes chunks, or metadata. The
chunks’ place in the overall file and which files they belong to are indicated by the metadata to the
master server.

• GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file chunks. The master
server does not receive any chunks from the chunk servers. Instead, they directly deliver the client the
desired chunks. The GFS makes numerous copies of each chunk and stores them on various chunk
servers in order to assure stability; the default is three copies. Every replica is referred to as one.
HDFS(HADOOP DISTRIBUTED FILE SYSTEM)

• Big data

Process store Name Node or Master Node

HDFS HDFS CLIENT Secondary

MapReduce node

Data Node
NAME NODE, SECONDARY NODE, DATA
NODE(BUILDING BLOCKS OF HADOOP)
 NAME NODE OR MASTER NODE: Manages Meta Data(name space, file
location etc).
 SECONDRY NODE: Manages Updation Or Operation(PA).
 DATA NODE: Stores Data On Different Position.
 By Default 3 replicas of data are stored in the system.(It enables fault tolerance)
HADOOP FEATURES

• Written in Java.
• Developed by Doug Cutting and Machael J. Cafarelle.
• Cutting’s son was fascinated by a stuffed toy an elephant which was
called Hadoop. So he implemented and named Hadoop n symbol as an
elephant.
• A computing environment that stores input, processes it and again
stores results.
• Robust Design: can work even an individual node is failed.
• Open Source framework: can be accessible anywhere
HDFS(HADOOP DISTRIBUTED FILE SYSTEM)

Nam Seco
• HDFS ndar
e
node y
Clinet()
Node Block size or file size =128 mb

Data node
Data node Data node

DB DB DB
RECK AWARENESS

• Store data in rack mechanism locally.

• To avoid time consumption.
• Actual data is stored remotely
• FS Image: An image of whole file system or primary file(whole info about files).
• Edit logs: whatever operations are performed(save, update, delete), are stored
in edit logs.
• Heart beat massage: data node sends a heartbeat message to name node that
it is live in every 3 seconds.
JOB TRACKER

• Provides link between Client application & Hadoop.

• Every cluster (Group of computer systems) has one job tracker.
• It manages basically-
 Which file will be processed first.
 Which file will be given to which data node.
 Monitors task if any task gets failed then re-launch the task.
TASK TRACKER

• It is responsible for task to be conducted.

• It communicates with Job tracker to give the update of progress in
tasks by sending a heart beat message.
• If Job Tracker did not receive the heart beat message in every 3 second
from task manager then it considers that task manager has been failed
and re-launch the task.
ARCHITECTURE

Client Job Tracker

Task Tracker Task Tracker

Map Reduce Map Reduce

COMMUNICATION PROCESS ARCHITECTURE
OF JOB TRACKER, TASK TRACKER AND HDFS
Secondary node

Name node
Job Tracker

Data node Data node Data node

Task Tracker Task Tracker

Task Tracker
INTRODUCING AND CONFIGURING HADOOP CLUSTER

We need to configure several XML file.

 Hadoop Settings Hadoop defaut.xml.
 Configuration directory of Hadoop-Home.
 There are 3 modes to configure Hadoop.
 Local(Standalone) (default mode)
 Pseudo distributed mode
 Fully Distributed mode
LOCAL(STANDALONE)

• Local node
• None of the daemons(processes that run in the background to store
and process data) will run.(name node, DN, SNN.JT,TT)
• Hadoop works very fast in this mode.
• Hadoop is used in this mode only for the purpose of testing, learning
and debugging.
PSEUDO DISTRIBUTED MODE (SINGLE NODE CLUSTER)

• Master and slave processes are handled by single system.

• All the processes run independently to each other inside the cluster.
• All daemon will run separately on various JVM.
• It is used for developing and debugging purpose.
FULLY DISTRIBUTED MODE(MULTINODE CLUSTER)

• It has many nodes some of them operate master node and some of
them operate slave nodes.
• It is used in production mode.
CONFIGURING XML FILES

Big Data Notes
No ratings yet
Big Data Notes
191 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Unit 5
No ratings yet
Unit 5
101 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
CH-05 CC
No ratings yet
CH-05 CC
21 pages
Emona FOTEx LabManual ANS Ver1
100% (8)
Emona FOTEx LabManual ANS Ver1
246 pages
Report On Launching of New Product
40% (5)
Report On Launching of New Product
43 pages
Bigdata Lecture 2
No ratings yet
Bigdata Lecture 2
17 pages
CSS Solved General Science and Ability Past Paper 2021
No ratings yet
CSS Solved General Science and Ability Past Paper 2021
35 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Hadoop BigData Testing Overview
No ratings yet
Hadoop BigData Testing Overview
37 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
DS Lecture 5
No ratings yet
DS Lecture 5
28 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Heart of The Sun Warrior 1st Edition Sue Lynn Tan 2024 Scribd Download
100% (3)
Heart of The Sun Warrior 1st Edition Sue Lynn Tan 2024 Scribd Download
37 pages
Assisnment # 1 Os
No ratings yet
Assisnment # 1 Os
7 pages
Big Data
No ratings yet
Big Data
51 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Hadoop
No ratings yet
Hadoop
25 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
4
No ratings yet
4
53 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
DA
No ratings yet
DA
51 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Ni-Cd Battery For Aircraft Battery Design and Charging Options
No ratings yet
Ni-Cd Battery For Aircraft Battery Design and Charging Options
8 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Active Driveline
No ratings yet
Active Driveline
17 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Unit I
No ratings yet
Unit I
38 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Big Data
No ratings yet
Big Data
51 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Lausd Menu and Cals
No ratings yet
Lausd Menu and Cals
10 pages
Theories in Nursing Informatics
No ratings yet
Theories in Nursing Informatics
31 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Gul Nawaz CV
No ratings yet
Gul Nawaz CV
2 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
LinearAI-DS Mid ch1-6 2021S2 DR - Omar
No ratings yet
LinearAI-DS Mid ch1-6 2021S2 DR - Omar
10 pages
GFS Large Scale
No ratings yet
GFS Large Scale
7 pages
Big Data
No ratings yet
Big Data
67 pages
Cisco Hidden Commands
100% (1)
Cisco Hidden Commands
24 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Mixed Methods Research
No ratings yet
Mixed Methods Research
10 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
HADOOP
No ratings yet
HADOOP
18 pages
Jyoti PPT (20erwcs025)
No ratings yet
Jyoti PPT (20erwcs025)
20 pages
Brand Perception of Honda Products
No ratings yet
Brand Perception of Honda Products
64 pages
Đề Cương Ôn Thi CK 2 k10
No ratings yet
Đề Cương Ôn Thi CK 2 k10
9 pages
Life Saving Rules Poster in English
No ratings yet
Life Saving Rules Poster in English
11 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Imo Cnew Series
No ratings yet
Imo Cnew Series
6 pages
Chapter 2 Different Types of Fixtures
No ratings yet
Chapter 2 Different Types of Fixtures
20 pages
Waiver
No ratings yet
Waiver
6 pages
Unit 3
No ratings yet
Unit 3
5 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Gifted 3
No ratings yet
Gifted 3
12 pages
Dispatch & Store
No ratings yet
Dispatch & Store
1 page
Hospital List
75% (4)
Hospital List
4 pages
Maths - Matrices - Matrices Multiplication Symmetric - Skew-Symmetric - Assingment - 9 June 2020
100% (1)
Maths - Matrices - Matrices Multiplication Symmetric - Skew-Symmetric - Assingment - 9 June 2020
2 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
Do417 2.8 Student Guide
No ratings yet
Do417 2.8 Student Guide
600 pages
Machining Strenx and Hardox: Drilling, Countersinking, Tapping, Turning and Milling
No ratings yet
Machining Strenx and Hardox: Drilling, Countersinking, Tapping, Turning and Milling
8 pages
Jaguar Radio Code
0% (1)
Jaguar Radio Code
3 pages
Henry Cavill
No ratings yet
Henry Cavill
2 pages
P6F-52 Antenna: Terrestrial Microwave Antenna Products
No ratings yet
P6F-52 Antenna: Terrestrial Microwave Antenna Products
3 pages
Case Study Analysis On CWO GROUP 8
No ratings yet
Case Study Analysis On CWO GROUP 8
10 pages
Two Stage Analysis Procedure
100% (1)
Two Stage Analysis Procedure
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BDA Unit 1 Notes

Uploaded by

BDA Unit 1 Notes

Uploaded by

BIG DATA

DATA: Anything that is recorded in the system or collection of numbers

• Structured : defined in a fix format generally in table form. Exp:Table of

VELOCITY: How much quickly data is generating.

Parameters Traditional data Big Data

 Developed By Google Inc.

• Consistency model: GFS is designed to be flexible to the typical workload of clients.

Process store Name Node or Master Node

HDFS HDFS CLIENT Secondary

• Store data in rack mechanism locally.

• Provides link between Client application & Hadoop.

• It is responsible for task to be conducted.

Client Job Tracker

Task Tracker Task Tracker

Map Reduce Map Reduce

Data node Data node Data node

Task Tracker Task Tracker

We need to configure several XML file.

• Master and slave processes are handled by single system.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.