0% found this document useful (0 votes)
9 views24 pages

BDA Unit 1 Notes

The document provides an overview of Big Data and its significance, highlighting the importance of data analysis for smart decision-making and cost efficiency. It categorizes Big Data into structured, unstructured, and semi-structured types, and discusses the 3 V's: velocity, volume, and variety. Additionally, it explains the architecture and components of Google File System (GFS) and Hadoop Distributed File System (HDFS), emphasizing their roles in managing large data sets.

Uploaded by

chelsigupta14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views24 pages

BDA Unit 1 Notes

The document provides an overview of Big Data and its significance, highlighting the importance of data analysis for smart decision-making and cost efficiency. It categorizes Big Data into structured, unstructured, and semi-structured types, and discusses the 3 V's: velocity, volume, and variety. Additionally, it explains the architecture and components of Google File System (GFS) and Hadoop Distributed File System (HDFS), emphasizing their roles in managing large data sets.

Uploaded by

chelsigupta14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

BIG DATA

ANALYTICS
8CS4-01
UNIT-I

DATA: Anything that is recorded in the system or collection of numbers


, alphabets or special symbols. Exp: 10041990.
Information: Processed data. Exp: 10/04/1990 is the birthdate of MS
Dhoni.
Big Data: collection of large, complex, and diverse data sets that are difficult to
manage and analyze using traditional data processing tools.
Exp: Patient Records, Ecommerce Sites(Online Shopping), Transportation etc..
WHY BIG DATA IS IMPORTANT

You don’t have to only take data but you have to analyse the data.
Low cost.
Less time taken.
Smart Decision making.
TYPES OF BIG DATA

• Structured : defined in a fix format generally in table form. Exp:Table of


Employees in an organization
• Unstructured: Not having any pre defined format Exp: Videos,
Images, Presentation, e-mails Etc.
• Semi structured: not fixed in structure or schema. It contains some structural
elements, like tags or markers, to separate data elements. Exp: Json, XML Files
etc.
3 V’S CHARACTERISTICS OF BIG DATA

VELOCITY: How much quickly data is generating.


VOLUME: Amount of data generated every second.
VARIETY: Type of data.
VARACITY: Trustworthiness of data or ambiguity(errors) of data
VALUE: Processed data from Raw data for profit. Exp: Instagram
TRADITIONAL VS BIG DATA

Parameters Traditional data Big Data


Volume GB TB OR PB and more
Generate rate Low(per hour or per day) High(every second)
Structure Structured Semi structured or
Unstructured
Data source Centralised Fully Distributed
Data integration Easy difficult
Data storage RDMS HDFS Or NOSQL
Access Intractive Realtime Or Batch
Data structure Static Schema Dynamic
Data updation Repeated Read And Write Write Once Read Many
GOOGLE FILE SYSTEM(GFS)

 Developed By Google Inc.


 A Scalable Distributed File System (Dfs).
 GFS offers fault tolerance, dependability, scalability, availability, and
performance to big networks and connected nodes.
 GFS is made up of a number of storage systems constructed from
inexpensive commodity hardware parts.
 It manages two types of data namely File metadata and File Data.
 It uses chunks to store data.(Geekforgeeks)
CHARACTERISTICS OF GFS

• Fault tolerance: GFS is designed to be highly available and fault tolerant, even when some servers are unavailable.

• Chunk replication: If a chunk server fails, the data can be replicated from other servers. Smaller chunks reduce the
amount of data that needs to be transferred, which speeds up recovery times.

• Consistency model: GFS is designed to be flexible to the typical workload of clients.

• Block copy placement: GFS uses block copy placement principles(The data is copied from a source location to a
destination location) to increase data reliability and availability.

• Large chunk size: GFS stores each chunk replica as a plain Linux file on a chunk server. The large chunk size
reduces the need for clients to interact with the master.

• Master server: The GFS Master server manages the overall system metadata, such as the namespace, access control
information, and the mapping of files to chunks.

• Chunk servers: Chunk servers store and serve file data in 64MB chunks.
COMPONENTS OF GFS

• GFS Clients: They can be computer programs or applications which may be used to request files.
Requests may be made to access and modify already-existing files or add new files to the system.

• GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the cluster’s actions
in an operation log. Additionally, it keeps track of the data that describes chunks, or metadata. The
chunks’ place in the overall file and which files they belong to are indicated by the metadata to the
master server.

• GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file chunks. The master
server does not receive any chunks from the chunk servers. Instead, they directly deliver the client the
desired chunks. The GFS makes numerous copies of each chunk and stores them on various chunk
servers in order to assure stability; the default is three copies. Every replica is referred to as one.
HDFS(HADOOP DISTRIBUTED FILE SYSTEM)

• Big data

Process store Name Node or Master Node

HDFS HDFS CLIENT Secondary


MapReduce node

Data Node
NAME NODE, SECONDARY NODE, DATA
NODE(BUILDING BLOCKS OF HADOOP)
 NAME NODE OR MASTER NODE: Manages Meta Data(name space, file
location etc).
 SECONDRY NODE: Manages Updation Or Operation(PA).
 DATA NODE: Stores Data On Different Position.
 By Default 3 replicas of data are stored in the system.(It enables fault tolerance)
HADOOP FEATURES

• Written in Java.
• Developed by Doug Cutting and Machael J. Cafarelle.
• Cutting’s son was fascinated by a stuffed toy an elephant which was
called Hadoop. So he implemented and named Hadoop n symbol as an
elephant.
• A computing environment that stores input, processes it and again
stores results.
• Robust Design: can work even an individual node is failed.
• Open Source framework: can be accessible anywhere
HDFS(HADOOP DISTRIBUTED FILE SYSTEM)

Nam Seco
• HDFS ndar
e
node y
Clinet()
Node Block size or file size =128 mb

Data node
Data node Data node

DB DB DB
RECK AWARENESS

• Store data in rack mechanism locally.


• To avoid time consumption.
• Actual data is stored remotely
• FS Image: An image of whole file system or primary file(whole info about files).
• Edit logs: whatever operations are performed(save, update, delete), are stored
in edit logs.
• Heart beat massage: data node sends a heartbeat message to name node that
it is live in every 3 seconds.
JOB TRACKER

• Provides link between Client application & Hadoop.


• Every cluster (Group of computer systems) has one job tracker.
• It manages basically-
 Which file will be processed first.
 Which file will be given to which data node.
 Monitors task if any task gets failed then re-launch the task.
TASK TRACKER

• It is responsible for task to be conducted.


• It communicates with Job tracker to give the update of progress in
tasks by sending a heart beat message.
• If Job Tracker did not receive the heart beat message in every 3 second
from task manager then it considers that task manager has been failed
and re-launch the task.
ARCHITECTURE

Client Job Tracker

Task Tracker Task Tracker

Map Reduce Map Reduce


COMMUNICATION PROCESS ARCHITECTURE
OF JOB TRACKER, TASK TRACKER AND HDFS
Secondary node

Name node
Job Tracker

Data node Data node Data node

Task Tracker Task Tracker


Task Tracker
INTRODUCING AND CONFIGURING HADOOP CLUSTER

We need to configure several XML file.


 Hadoop Settings Hadoop defaut.xml.
 Configuration directory of Hadoop-Home.
 There are 3 modes to configure Hadoop.
 Local(Standalone) (default mode)
 Pseudo distributed mode
 Fully Distributed mode
LOCAL(STANDALONE)

• Local node
• None of the daemons(processes that run in the background to store
and process data) will run.(name node, DN, SNN.JT,TT)
• Hadoop works very fast in this mode.
• Hadoop is used in this mode only for the purpose of testing, learning
and debugging.
PSEUDO DISTRIBUTED MODE (SINGLE NODE CLUSTER)

• Master and slave processes are handled by single system.


• All the processes run independently to each other inside the cluster.
• All daemon will run separately on various JVM.
• It is used for developing and debugging purpose.
FULLY DISTRIBUTED MODE(MULTINODE CLUSTER)

• It has many nodes some of them operate master node and some of
them operate slave nodes.
• It is used in production mode.
CONFIGURING XML FILES

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy