BDA Unit 1 Notes
BDA Unit 1 Notes
ANALYTICS
8CS4-01
UNIT-I
You don’t have to only take data but you have to analyse the data.
Low cost.
Less time taken.
Smart Decision making.
TYPES OF BIG DATA
• Fault tolerance: GFS is designed to be highly available and fault tolerant, even when some servers are unavailable.
• Chunk replication: If a chunk server fails, the data can be replicated from other servers. Smaller chunks reduce the
amount of data that needs to be transferred, which speeds up recovery times.
• Block copy placement: GFS uses block copy placement principles(The data is copied from a source location to a
destination location) to increase data reliability and availability.
• Large chunk size: GFS stores each chunk replica as a plain Linux file on a chunk server. The large chunk size
reduces the need for clients to interact with the master.
• Master server: The GFS Master server manages the overall system metadata, such as the namespace, access control
information, and the mapping of files to chunks.
• Chunk servers: Chunk servers store and serve file data in 64MB chunks.
COMPONENTS OF GFS
• GFS Clients: They can be computer programs or applications which may be used to request files.
Requests may be made to access and modify already-existing files or add new files to the system.
• GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the cluster’s actions
in an operation log. Additionally, it keeps track of the data that describes chunks, or metadata. The
chunks’ place in the overall file and which files they belong to are indicated by the metadata to the
master server.
• GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file chunks. The master
server does not receive any chunks from the chunk servers. Instead, they directly deliver the client the
desired chunks. The GFS makes numerous copies of each chunk and stores them on various chunk
servers in order to assure stability; the default is three copies. Every replica is referred to as one.
HDFS(HADOOP DISTRIBUTED FILE SYSTEM)
• Big data
Data Node
NAME NODE, SECONDARY NODE, DATA
NODE(BUILDING BLOCKS OF HADOOP)
NAME NODE OR MASTER NODE: Manages Meta Data(name space, file
location etc).
SECONDRY NODE: Manages Updation Or Operation(PA).
DATA NODE: Stores Data On Different Position.
By Default 3 replicas of data are stored in the system.(It enables fault tolerance)
HADOOP FEATURES
• Written in Java.
• Developed by Doug Cutting and Machael J. Cafarelle.
• Cutting’s son was fascinated by a stuffed toy an elephant which was
called Hadoop. So he implemented and named Hadoop n symbol as an
elephant.
• A computing environment that stores input, processes it and again
stores results.
• Robust Design: can work even an individual node is failed.
• Open Source framework: can be accessible anywhere
HDFS(HADOOP DISTRIBUTED FILE SYSTEM)
Nam Seco
• HDFS ndar
e
node y
Clinet()
Node Block size or file size =128 mb
Data node
Data node Data node
DB DB DB
RECK AWARENESS
Name node
Job Tracker
• Local node
• None of the daemons(processes that run in the background to store
and process data) will run.(name node, DN, SNN.JT,TT)
• Hadoop works very fast in this mode.
• Hadoop is used in this mode only for the purpose of testing, learning
and debugging.
PSEUDO DISTRIBUTED MODE (SINGLE NODE CLUSTER)
• It has many nodes some of them operate master node and some of
them operate slave nodes.
• It is used in production mode.
CONFIGURING XML FILES