Wa0005.
Wa0005.
Hadoop Ecosystem
Hadoop Ecosystem Components
• HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming-based Data Processing
Spark: In-Memory data processing
PIG, HIVE: processing of data services on Query-based
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
HDFS
• Node name
• A data node
• A Name Node is a primary node that contains metadata (data about data), requiring
comparatively fewer resources than data nodes that store the actual data. These data
nodes are commodity hardware in a distributed environment. Undoubtedly, what
makes Hadoop cost-effective?
YARN
• Yet Another Resource Negotiator, as the name suggests, YARN helps manage resources
across clusters. In short, it performs resource planning and allocation for the Hadoop
system.
1. Resource manager
2. Node manager
3. Application manager
A resource manager has the privilege of allocating resources for applications in the
system. In contrast, node managers work to allocate resources such as CPU, memory,
and bandwidth per machine and later acknowledge the resource manager
Map Reduce
• MapReduce uses two functions, i.e., Map() and Reduce(), whose task is to:
1. Map() performs sorting and filtering of the data and thus organizes it in the form
of a group. The map generates a result based on the key-value pair, which is
later processed by the Reduce() method.
Pig was essentially developed by Yahoo, working on Pig Latin, a query-based language similar to
SQL.
• It is a platform for data flow structuring, processing, and analyzing large data sets.
• Pig does the job of executing the commands, and all the MapReduce activities are taken care of in
the background. After processing, the pig stores the result in HDFS.
• The Pig Latin language is specifically designed for this framework, which runs on the Pig Runtime.
• Pig helps to achieve ease of programming and optimization and thus is a core segment of the
Hadoop ecosystem.
HIVE
• With the help of SQL methodology and the HIVE interface, it reads and writes large data sets known
as Hive Query Language.
• It is highly scalable as it allows both real-time and batches processing. Also, all SQL data types are
supported by Hive, making query processing easier.
• Like Query Processing frameworks, HIVE comes with two components: JDBC Drivers and HIVE
Command-Line.
• JDBC works with ODBC drivers to create data storage and connection permissions, while the HIVE
command line helps with query processing.
Mahout
• Mahout enables machine learning of a system or application. Machine learning, as the name
suggests, helps a system evolve based on certain patterns, user/environment interactions, or
algorithms.
• It provides various libraries or features like collaborative filtering, clustering, and classification,
which are nothing but machine learning concepts. It allows us to invoke algorithms according to our
needs using our libraries.
Apache Spark
• Therefore, it consumes memory resources and is faster than the previous one in
terms of optimization.
• Spark is best suited for real-time data, while Hadoop is best suited for structured
data or batch processing. Hence both are used interchangeably in most companies.
Apache Hbase
• It is a NoSQL database that supports all kinds of data and is capable of processing
anything from a Hadoop database. It provides Google BigTable capabilities to work
with large data sets efficiently.
• Solr, Lucene: These are two services that perform the task of searching and indexing using
some java libraries. Lucene is based on Java which also enables a spell-checking
mechanism. However, Lucene is driven by Solr.
• Zookeeper: There was a huge problem with managing coordination and synchronization
between Hadoop resources or components, which often led to inconsistency. Zookeeper
has overcome all the problems by performing synchronization, inter-component
communication, grouping, and maintenance.
• Oozie: Oozie simply acts as a scheduler, so it schedules jobs and joins them together. There
are two kinds of tasks, i.e., Oozie workflow and Oozie coordinator tasks. An Oozie workflow
is the tasks that need to be executed sequentially in an ordered manner. In contrast, the
Oozie coordinator tasks are triggered when some data or external stimulus is given to it.
Schedulars
FIFO Schedular
First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives
more preferences to the application coming first than those coming later. It places the
applications in a queue and executes them in the order of their submission (first in, first
out).
• Here, irrespective of the size and priority, the request for the first application in the queue
are allocated first. Once the first application request is satisfied, then only the next
application in the queue is served.
Capacity Schedular
• CapacityScheduler allows multiple users to share a Hadoop cluster securely and efficiently.
• It uses hierarchical queues: root (cluster), parent (organization), and leaf (application
submission).Each queue gets a guaranteed portion of cluster resources.
• Unused resources from one queue can be temporarily used by under-utilized queues
(elastic sharing).
• Ensures fair resource usage by setting limits on users, applications, and queues.Prevents
any single user or queue from over-consuming cluster resources.Controls pending and
initialized apps per user or queue for stability.
Fair Schedular
• NodeManager: runs on each node in the cluster and takes direction from the
ResourceManager. It is responsible for managing resources available on a single node.
HDFS Federation is an enhancement of the Hadoop Distributed File System (HDFS) introduced in Hadoop 2.x.
Why it was needed: In earlier versions of HDFS, there was a single NameNode, which meant:
• A single point of failure
• Limited scalability (since one NameNode managed all metadata)
How HDFS Federation solves it:
• Allows multiple NameNodes and multiple namespaces.
• Each NameNode manages a part of the filesystem namespace.
• All NameNodes can work independently and in parallel.
Benefits:
• Better scalability
• Isolation of namespaces (e.g., different users/departments can use separate namespaces)
• No single point of failure (to some extent)
MRv2 (MapReduce Version 2)
MRv2 is the second version of the MapReduce framework, also called YARN-based
MapReduce.
Why it was introduced: MapReduce v1 had limitations:
• Tight coupling of resource management and job scheduling with the JobTracker.
• Limited scalability and flexibility.
How MRv2 works:
• Separates the resource management layer from the data processing layer.
• Introduces YARN for resource management.
• MapReduce becomes just one of many possible data processing frameworks running
on YARN.
YARN (Yet Another Resource Negotiator)
• YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system used for Big Data
processing.
Key Components:
• ResourceManager (RM): Central authority for resource management
• NodeManager (NM): Runs on each node, manages local resources
• ApplicationMaster (AM): Manages a specific application/job (e.g., a MapReduce job)
• Container: A slice of resources (CPU, RAM, etc.) given to tasks
Running MRV1 on Yarn
To run a MapReduce v1 (MRv1) application on YARN, you'll use the yarn command. The
syntax is similar to the MRv1 framework, though application submission can also be done
with the hadoop command in MRv2.
• MRv1 jobs need to be rewritten or adapted to work with YARN.
• Use the new MapReduce API provided with Hadoop 2.O (org.apache.hadoop.mapreduce
instead of org.apache.hadoop.mapred).
• Internally, the MapReduce jobs are executed using the ApplicationMaster and containers
on YARN.
No SQL Database
• NoSQL stands for "Not Only SQL".
Feature Description
Flexible Schema You can store different types of data without defining a fixed schema.
High Performance Optimized for high-speed reads/writes, especially with big data.
Suitable for large-scale, distributed systems (e.g., social media, IoT, real-time
Designed for Big Data
analytics).
Types of Non SQL
database
1. Document-based databases
2. Key-value stores
3. Column-oriented databases
4. Graph-based databases
Types of Non SQL Database
1. Document-Based Databases:
Store data as documents (e.g., JSON, BSON, XML) with a flexible schema.
Ideal for semi-structured data and object-oriented applications.
Documents are grouped in collections and support indexing for fast queries.
2. Key-Value Stores:
Data is stored as key-value pairs, like a dictionary.
Efficient for quick lookups using unique keys.
Best suited for caching, session management, and simple data retrieval.
3. Column-Oriented Databases:
Store data by columns rather than rows, optimizing read-heavy workloads.
Great for analytics and data warehousing.
Enables faster aggregation and compression over large datasets.
• 4. Graph-Based Databases:
Represent data as nodes and edges to model relationships.
Ideal for complex, interconnected data like social networks or recommendation engines.
Efficient at querying relationships and traversals.
• MongoDB is a document-oriented NoSQL database system that provides high scalability,
flexibility, and performance.
• Unlike standard relational databases, MongoDB stores data in a JSON document structure
form.
• This makes it easy to operate with dynamic and unstructured data and MongoDB is an
open-source and cross-platform database System.
Mongo DB Datatypes
Command Description
CREATE Creates a new table in the database and other objects in the database.
• To Delete:
• Click the 🗑 trash icon next to a document
• Confirm delete
MongoDB Query
• A MongoDB query is a request to the database to retrieve specific documents or data based on
certain conditions or criteria.
• It is similar to SQL queries in traditional relational databases, but MongoDB queries are written
using JavaScript-like syntax.
• The most common query operation in MongoDB is the find() method, which allows users to fetch
documents from a collection that match the given query criteria.
• Basic MongoDB Query Example:
db.collection_name.find({ field: value })
• Example:
• This query will retrieve all documents from the articles collection where the author field is equal to
“Aditya”.
• db.articles.find({ author: "Aditya" })
Indexing
Scala is both a functional Programming Language and an object-oriented programming Language. Every variable
and value which is used in Scala is implicitly saved as an object by default.
• Extensible Programming Language:
Scala can support multiple language constructs without the need of any Domain Specific
Language (DSL)Extensions, Libraries, and APIs.
• Statically Typed Programming Language:
Scala provides a lightweight syntax for defining functions, it supports higher-order functions, it allows functions
to be nested.
• Interoperability:
• Scala compiles the code using scala compiler and converts code into Java Byte Code and Executes it on JVM.
Variables in Scala
Mutable Variables
These variables allow us to change a value after
the declaration of a variable. Mutable variables are
defined by using the var keyword. The first letter of
data type should be in capital letter because in
Scala data type is treated as an object.
• var b = “ABC"
• b = “PSIT Institute“
Immutable Variable
• Immutable Variable
These variables do not allow you to change a value after the declaration of a variable.
Immutable variables are defined by using the val keyword. The first letter of data type should
be in capital letter because in the Scala data type is treated as objects.
• val a = "hello world"
• a = "how are you“
output:
a: String = hello world
<console>:25: error: reassignment to val
a = “how are you”
Class and Object in Scala
Classes and Objects are basic concepts of Object Oriented Programming which revolve
around the real-life entities.
Class
• A class is a user-defined blueprint or prototype from which objects are created. Or in other
words, a class combines the fields and methods(member function which defines actions)
into a single unit. Basically, in a class constructor is used for initializing new objects, fields
are variables that provide the state of the class and its objects, and methods are used to
implement the behavior of the class and its objects.
class Class_name{
// methods and fields
}
Objects
It is a basic unit of Object Oriented Programming and represents the real-life entities. A
typical Scala program creates many objects, which as you know, interact by invoking
methods. An object consists of :
• Identity: It gives a unique name to an object and enables one object to interact with other
objects.
Declaring Objects (Also
called instantiating a class)
+ Addition a+b
- Subtraction a-b
* Multiplication a*b
/ Division a/b
== Equal to a == b
!= Not equal to a != b
` `
` ` Bitwise OR
= Simple assignment a = 10
+ Unary plus +a
- Unary minus -a
! Logical negation !a
~ Bitwise complement ~a
Example
• println(a + b) // 15
• println(a > b) // true
• println((a > 0) && (b > 0)) // true
• println(~a) // bitwise NOT
•}
Data Types in SCALA
Data Type Description Example
}
Inheritance
• Super Class: The class whose features are inherited is known as superclass(or a base class or a parent
class).
• Sub Class: The class that inherits the other class is known as subclass(or a derived class, extended class,
or child class). The subclass can add its own fields and methods in addition to the superclass fields and
methods.
• Reusability: Inheritance supports the concept of “reusability”, i.e. when we want to create a new class and
there is already a class that includes some of the code that we want, we can derive our new class from
the existing class. By doing this, we are reusing the fields and methods of the existing clas
How to use inheritance in Scala
The keyword used for inheritance is extends.
Syntax:
class A {
def show(): Unit = println("Hello from A")
}
class B extends A