0% found this document useful (0 votes)

8 views54 pages

Big Data Unit-4

The document provides an overview of the Hadoop ecosystem, focusing on components like HDFS, YARN, and NoSQL databases. It discusses the high availability feature in Hadoop 2.x, which addresses the single point of failure issue by allowing multiple NameNodes, and introduces HDFS Federation for improved scalability. Additionally, it contrasts NoSQL databases with traditional RDBMS, highlighting their advantages and disadvantages in terms of scalability, availability, and data structure.

Uploaded by

Prashant Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

8 views54 pages

Big Data Unit-4

Uploaded by

Prashant Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 54

Hadoop Ecosystem and YARN FS: Hadoop Distributed File System YARN: Yet Another Resource Negotiator MapReduce: Programming based Data Processing Spark: In-Memory data processing PIG, HIVE: Query-based processing of data services HBase: NoSQL Database Mahout, Spark MLLib: Machine Learning algorithm libraries Solar, Lucene: Searching and Indexing Zookeeper: Managing cluster Oozie: Job SchedulingNameNocie high availability igh Availability was a new feature added to Hadoop 2.x to solve the Single point of failure problem in the older versions of Hadoop. * As the Hadoop HDFS follows the master-slave architecture where the NameNode is the master node and maintains the filesystem tree. So HDFS cannot be used without NameNode. * Before Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if NameNode fails, the cluster as a whole would be out of services. The cluster will be unavailable until the NameNode restarts or brought on a separate machine. + Hadoop 2.0 overcomes this SPOF by providing support for many NameNode. HDFS NameNode High Availability architecture provides the option of running two redundant NameNodes in the same cluster in an active/passive configuration with a hot standby.NameNode high availability 2 Active NameNode ~ It handles all client operations in the cluster. » Passive NameNode - It is a standby namenode, which has similar data as active NameNode. It acts as a slave, maintains enough state to provide a fast failover, if necessary. * If Active NameNode fails, then passive NameNode takes all the responsibility of active node and the cluster continues to work. Issues in maintaining consistency in the HDFS High Availability cluster are as follows: * Active and Standby NameNode should always be in sync with each other, ie. they should have the same metadata. This permit reinstating the Hadoop cluster to the same namespace state where it got crashed. And this will provide us to have fast failover. * There should be only one NameNode active at a time. Otherwise, two NameNode will lead to corruption of the data.HDFS architecture-~ Namespace: consists of files, blocks, and directories. This layer provides support for namespace related filesystem operations like create, delete, modify, and list files. Block Storage layer: + Block Management: Block Management provides DataNode cluster membership by handling registrations, and periodic heartbeats. It also maintains locations of blocks, replica placement. * Storage: DataNode manages storage space by storing blocks on the local file system and providing read/write access. Block Storage Namespace This architecture allows for only single NameNode to maintain the filesystem namespace.HDFS Federation architecture HDFS Federation feature introduced in Hadoop 2 enhances the existing HDFS architecture. It overcomes HDFS architecture limitations (discussed above) by adding multiple NameNode/namespaces support to HDFS. This allows the use of more than one NameNode/namespace. Therefore, it scales the namespace horizontally by allowing the addition of NameNode in the cluster. In HDFS Federation architecture, there are multiple NameNodes and DataNodes. Each NameNode has its own namespace and block pool. All the NameNodes uses DataNodes as the common storage. Each Datanode gets registered to all the NameNodes in the cluster and store blocks for all the block pools in the cluster. * Also, DataNodes periodically send heartbeats and block reports to all the NameNode in the cluster and handles the instructions from the NameNodes.HDFS Federation architectureHDFS Federation architecture vier are multiple NameNodes which are represented as NN1, NN2, ..NNn. + Tre‘multiple namespaces managed by their respective NameNode. + Each namespace has its own block pool. + Each Datanode store blocks for all the block pools in the cluster. For example, DataNode1. stores the blocks from Pool 1, Pool 2, Pool, etc. Summary HDFS federation feature added to Hadoop 2.x provides support for multiple NameNodes/namespaces. This overcomes the isolation, scalability, and performance limitations of the prior HDFS architecture.MRv2(mapResuce 2) MRv1 uses the JobTracker to create and assign tasks to data nodes, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 nodes). MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. YARN overcame the scalability shortcomings by splitting the responsibilities of jobtracker into separate entities. The jobtracker takes care of both job scheduling and task progress monitoring, YARN separates these two roles into two independent daemons : a resource manager and an application master. Resource manager is fixed and static. It performs node management for free and busy nodes for allocating the resource for Map and Reduce phases. Application manager communicates with the resource manager.YARN ee YARN, Resource Data Processing YerIntroduction NoSQL databases (aka "not only SQL") are non-tabular databases and store data differently than relational tables. NoSQL databases come in a variety of types based on their data model. The main types are document, key-value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of data and high user loads The data came in all shapes and sizes — structured, semi-structured, and polymorphic.Introduction SQL VS NoSQL Queries NoSQL Query: db. users. find( <— collection { age: { Sgt: 18} }, — query criteria { name! 1, address: 1 } <— projection ). Limit (5) +— cursor modifier saa SELECT _id, name, address «— projection FROM users + table WHERE age > 18 +— select criteria LIMIT 5 <+— cursor modifierAdvantages and Disadvantages Advantaegs: High-stalability: NoSQL database use sharding for horizontal scaling, * Vertical scaling is not that easy to implement but horizontal scaling is easy to implement. + Examples of horizontal scaling databases are MongoDB, Cassandra etc. + NoSQL can handle huge amount of data because of scalability, as the data grows NoSQL scale itself to handle that data in efficient manner. High availability: * Auto replication feature in NoSQL databases makes it highly available because in case of any failure data replicates itself to the previous consistent state.Advantages and Disadvantages Disadvantaegs: Narrow focus: + Itis mainly designed for storage but it provides very little functionality. Relational databases are a better choice in the field of Transaction Management than NoSQL. Open-source: + NoSQL is open-source database. There is no reliable standard for NoSQL yet. GUI is not available Large document size: + Some database systems like MongoDB and CouchDB store data in JSON format. Which means that documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts, since they increase the document size._Jy6és of NoSQL ument databases store data in documents similar to JSON (JavaScript Object Notation) objects. Each document contains pairs of fields and values. The values can typically be a variety of types including things like strings, numbers, booleans, arrays, or objects. + Key-value databases are a simpler type of database where each item contains keys and values. + Wide-column stores store data in tables, rows, and dynamic columns. + Graph databases store data in nodes and edges. Nodes typically store information about people, places, and things, while edges store information about the relationships between the nodes.RDBMS vs NoSQL databases RDBMS Vs. NoSQL RDBMS* aL © Structured and organized data No declarative query language ‘© Structured Query Language No predefined schema (Sal) * Key-Value pair storage, Column Store, Document Store, Graph Databases ‘* Eventual consistency rather ACID © Data and its relationships stored in separate tables. * Data Manipulation Language, property Data Definition Language * Unstructured and unpredictable © Tight Consistency data © BASE Transaction ¢ CAP Theorem ‘* Prioritize high performance, high availability and scalability

DE - QBANK
No ratings yet
DE - QBANK
125 pages
InterviewQuestions_1735756800
No ratings yet
InterviewQuestions_1735756800
125 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Hadoop
No ratings yet
Hadoop
83 pages
Big Data Unit-2 PPT part1
No ratings yet
Big Data Unit-2 PPT part1
76 pages
UNIT 5
No ratings yet
UNIT 5
101 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
500+ Data Engineering Interview_Questions
No ratings yet
500+ Data Engineering Interview_Questions
118 pages
Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
Unit II Hadoop
No ratings yet
Unit II Hadoop
23 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
BIGDATA4
No ratings yet
BIGDATA4
28 pages
Chapter_6 - Hadoop
No ratings yet
Chapter_6 - Hadoop
51 pages
BDS Session 5
No ratings yet
BDS Session 5
57 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
WT_Unit4
No ratings yet
WT_Unit4
36 pages
CC Unit 5
No ratings yet
CC Unit 5
43 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
bda_unit34
No ratings yet
bda_unit34
17 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
HADOOP FRAME WORK
No ratings yet
HADOOP FRAME WORK
38 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Module II
No ratings yet
Module II
46 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
35 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Introduction to Hadoop
No ratings yet
Introduction to Hadoop
56 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
31 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
CH 2
No ratings yet
CH 2
6 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Corrosion Notes
No ratings yet
Corrosion Notes
14 pages
Electrochemistry.
No ratings yet
Electrochemistry.
12 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
01 Knapsack
No ratings yet
01 Knapsack
4 pages
500+ Interview Questions-1
No ratings yet
500+ Interview Questions-1
126 pages
Yarn Ha Federation
No ratings yet
Yarn Ha Federation
64 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
No ratings yet
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
6 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Big Data
No ratings yet
Big Data
16 pages
Hadoop & Big Data
No ratings yet
Hadoop & Big Data
36 pages
HDFS
No ratings yet
HDFS
11 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Big Data Unit-4

Uploaded by

Big Data Unit-4

Uploaded by

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.