0% found this document useful (0 votes)
30 views84 pages

Wa0005.

The document provides an overview of the Hadoop ecosystem, detailing its core components such as HDFS, YARN, MapReduce, and various processing frameworks like Pig and Hive. It also discusses NoSQL databases, particularly MongoDB, including its setup, commands, and features like indexing and capped collections. Additionally, it covers scheduling policies in Hadoop and enhancements introduced in Hadoop 2.0, including YARN and HDFS Federation.

Uploaded by

AYUSH GUPTA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views84 pages

Wa0005.

The document provides an overview of the Hadoop ecosystem, detailing its core components such as HDFS, YARN, MapReduce, and various processing frameworks like Pig and Hive. It also discusses NoSQL databases, particularly MongoDB, including its setup, commands, and features like indexing and capped collections. Additionally, it covers scheduling policies in Hadoop and enhancements introduced in Hadoop 2.0, including YARN and HDFS Federation.

Uploaded by

AYUSH GUPTA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

UNIT 4

Hadoop Ecosystem
Hadoop Ecosystem Components
• HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming-based Data Processing
Spark: In-Memory data processing
PIG, HIVE: processing of data services on Query-based
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
HDFS

• HDFS is the primary or core component of the Hadoop ecosystem. It is responsible


for storing large datasets of structured or unstructured data across multiple nodes,
thereby storing metadata as log files.

• HDFS consists of two basic components viz.

• Node name

• A data node

• A Name Node is a primary node that contains metadata (data about data), requiring
comparatively fewer resources than data nodes that store the actual data. These data
nodes are commodity hardware in a distributed environment. Undoubtedly, what
makes Hadoop cost-effective?
YARN

• Yet Another Resource Negotiator, as the name suggests, YARN helps manage resources
across clusters. In short, it performs resource planning and allocation for the Hadoop
system.

• It consists of three main components, viz.

1. Resource manager

2. Node manager

3. Application manager
A resource manager has the privilege of allocating resources for applications in the
system. In contrast, node managers work to allocate resources such as CPU, memory,
and bandwidth per machine and later acknowledge the resource manager
Map Reduce

• Using distributed and parallel algorithms, MapReduce allows you to offload


processing logic and helps you write applications that transform large datasets into
manageable ones.

• MapReduce uses two functions, i.e., Map() and Reduce(), whose task is to:

1. Map() performs sorting and filtering of the data and thus organizes it in the form
of a group. The map generates a result based on the key-value pair, which is
later processed by the Reduce() method.

2. Reduce(), as the name suggests, performs summarization by aggregating


mapped data. Simply put, Reduce() takes the output generated by Map() as input
and combines those tuples into a smaller set of tuples.
PIG

Pig was essentially developed by Yahoo, working on Pig Latin, a query-based language similar to
SQL.
• It is a platform for data flow structuring, processing, and analyzing large data sets.

• Pig does the job of executing the commands, and all the MapReduce activities are taken care of in
the background. After processing, the pig stores the result in HDFS.

• The Pig Latin language is specifically designed for this framework, which runs on the Pig Runtime.

• Pig helps to achieve ease of programming and optimization and thus is a core segment of the
Hadoop ecosystem.
HIVE

• With the help of SQL methodology and the HIVE interface, it reads and writes large data sets known
as Hive Query Language.

• It is highly scalable as it allows both real-time and batches processing. Also, all SQL data types are
supported by Hive, making query processing easier.

• Like Query Processing frameworks, HIVE comes with two components: JDBC Drivers and HIVE
Command-Line.

• JDBC works with ODBC drivers to create data storage and connection permissions, while the HIVE
command line helps with query processing.
Mahout

• Mahout enables machine learning of a system or application. Machine learning, as the name
suggests, helps a system evolve based on certain patterns, user/environment interactions, or
algorithms.

• It provides various libraries or features like collaborative filtering, clustering, and classification,
which are nothing but machine learning concepts. It allows us to invoke algorithms according to our
needs using our libraries.
Apache Spark

• It is a platform that handles all process-intensive tasks such as batch processing,


real-time interactive or iterative processing, graph conversions, and visualizations,
etc.

• Therefore, it consumes memory resources and is faster than the previous one in
terms of optimization.

• Spark is best suited for real-time data, while Hadoop is best suited for structured
data or batch processing. Hence both are used interchangeably in most companies.
Apache Hbase

• It is a NoSQL database that supports all kinds of data and is capable of processing
anything from a Hadoop database. It provides Google BigTable capabilities to work
with large data sets efficiently.

• When we need to search or retrieve occurrences of something small in a huge


database, the request must be processed in a short, fast time frame. At such times,
HBase comes in handy as it provides us with a tolerant way of storing limited data.
Other Components:

• Solr, Lucene: These are two services that perform the task of searching and indexing using
some java libraries. Lucene is based on Java which also enables a spell-checking
mechanism. However, Lucene is driven by Solr.

• Zookeeper: There was a huge problem with managing coordination and synchronization
between Hadoop resources or components, which often led to inconsistency. Zookeeper
has overcome all the problems by performing synchronization, inter-component
communication, grouping, and maintenance.

• Oozie: Oozie simply acts as a scheduler, so it schedules jobs and joins them together. There
are two kinds of tasks, i.e., Oozie workflow and Oozie coordinator tasks. An Oozie workflow
is the tasks that need to be executed sequentially in an ordered manner. In contrast, the
Oozie coordinator tasks are triggered when some data or external stimulus is given to it.
Schedulars
FIFO Schedular

First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives
more preferences to the application coming first than those coming later. It places the
applications in a queue and executes them in the order of their submission (first in, first
out).
• Here, irrespective of the size and priority, the request for the first application in the queue
are allocated first. Once the first application request is satisfied, then only the next
application in the queue is served.
Capacity Schedular

• CapacityScheduler allows multiple users to share a Hadoop cluster securely and efficiently.
• It uses hierarchical queues: root (cluster), parent (organization), and leaf (application
submission).Each queue gets a guaranteed portion of cluster resources.
• Unused resources from one queue can be temporarily used by under-utilized queues
(elastic sharing).
• Ensures fair resource usage by setting limits on users, applications, and queues.Prevents
any single user or queue from over-consuming cluster resources.Controls pending and
initialized apps per user or queue for stability.
Fair Schedular

• FairScheduler enables dynamic resource sharing in Hadoop clusters without fixed


reservations.
• It ensures all running applications receive, on average, equal resources over time.
• By default, it schedules based on memory but can be configured for both memory and
CPU.
• A single app can use all cluster resources if it's the only one running; new apps get
resources as they arrive.It prevents starvation of long-lived apps and allows short apps to
finish quickly.
• Supports hierarchical queues similar to CapacityScheduler.
• Allows setting minimum resource shares for specific queues to guarantee access for
important apps.Excess resources from underused queues are shared among other apps.
Hadoop 2.O
• Apache Hadoop 2.0 represents a generational
shift in the architecture of Apache Hadoop. With
YARN, Apache Hadoop is recast as a
significantly more powerful platform
• YARN is a re-architecture of Hadoop that allows
multiple applications to run on the same platform.
With YARN, applications run “in” Hadoop, instead
of “on” Hadoop
The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker
and TaskTracker into separate entities. In Hadoop 2.0, the JobTracker and TaskTracker no
longer exist and have been replaced by three components:

• ResourceManager: a scheduler that allocates available resources in the cluster amongst


the competing applications.

• NodeManager: runs on each node in the cluster and takes direction from the
ResourceManager. It is responsible for managing resources available on a single node.

• ApplicationMaster: an instance of a framework-specific library, an ApplicationMaster runs


a specific YARN job and is responsible for negotiating resources from the
ResourceManager and also working with the NodeManager to execute and monitor
Containers.
HDFS Federation

HDFS Federation is an enhancement of the Hadoop Distributed File System (HDFS) introduced in Hadoop 2.x.
Why it was needed: In earlier versions of HDFS, there was a single NameNode, which meant:
• A single point of failure
• Limited scalability (since one NameNode managed all metadata)
How HDFS Federation solves it:
• Allows multiple NameNodes and multiple namespaces.
• Each NameNode manages a part of the filesystem namespace.
• All NameNodes can work independently and in parallel.
Benefits:
• Better scalability
• Isolation of namespaces (e.g., different users/departments can use separate namespaces)
• No single point of failure (to some extent)
MRv2 (MapReduce Version 2)

MRv2 is the second version of the MapReduce framework, also called YARN-based
MapReduce.
Why it was introduced: MapReduce v1 had limitations:
• Tight coupling of resource management and job scheduling with the JobTracker.
• Limited scalability and flexibility.
How MRv2 works:
• Separates the resource management layer from the data processing layer.
• Introduces YARN for resource management.
• MapReduce becomes just one of many possible data processing frameworks running
on YARN.
YARN (Yet Another Resource Negotiator)

• YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system used for Big Data
processing.
Key Components:
• ResourceManager (RM): Central authority for resource management
• NodeManager (NM): Runs on each node, manages local resources
• ApplicationMaster (AM): Manages a specific application/job (e.g., a MapReduce job)
• Container: A slice of resources (CPU, RAM, etc.) given to tasks
Running MRV1 on Yarn

To run a MapReduce v1 (MRv1) application on YARN, you'll use the yarn command. The
syntax is similar to the MRv1 framework, though application submission can also be done
with the hadoop command in MRv2.
• MRv1 jobs need to be rewritten or adapted to work with YARN.
• Use the new MapReduce API provided with Hadoop 2.O (org.apache.hadoop.mapreduce
instead of org.apache.hadoop.mapred).
• Internally, the MapReduce jobs are executed using the ApplicationMaster and containers
on YARN.
No SQL Database
• NoSQL stands for "Not Only SQL".

• It refers to a group of non-relational


databases that store data in flexible, scalable
ways, unlike traditional relational databases
(RDBMS) like MySQL, Oracle, or SQL Server.
Key Characteristics of NoSQL Databases:

Feature Description

Non-relational Data is not stored in tables with rows and columns.

Flexible Schema You can store different types of data without defining a fixed schema.

Scalability Easy to scale horizontally (by adding more servers).

High Performance Optimized for high-speed reads/writes, especially with big data.

Suitable for large-scale, distributed systems (e.g., social media, IoT, real-time
Designed for Big Data
analytics).
Types of Non SQL
database

NoSQL databases can be classified into four main


types, based on their data storage and retrieval
methods:

1. Document-based databases

2. Key-value stores

3. Column-oriented databases

4. Graph-based databases
Types of Non SQL Database

1. Document-Based Databases:
Store data as documents (e.g., JSON, BSON, XML) with a flexible schema.
Ideal for semi-structured data and object-oriented applications.
Documents are grouped in collections and support indexing for fast queries.
2. Key-Value Stores:
Data is stored as key-value pairs, like a dictionary.
Efficient for quick lookups using unique keys.
Best suited for caching, session management, and simple data retrieval.
3. Column-Oriented Databases:
Store data by columns rather than rows, optimizing read-heavy workloads.
Great for analytics and data warehousing.
Enables faster aggregation and compression over large datasets.
• 4. Graph-Based Databases:
Represent data as nodes and edges to model relationships.
Ideal for complex, interconnected data like social networks or recommendation engines.
Efficient at querying relationships and traversals.
• MongoDB is a document-oriented NoSQL database system that provides high scalability,
flexibility, and performance.
• Unlike standard relational databases, MongoDB stores data in a JSON document structure
form.
• This makes it easy to operate with dynamic and unstructured data and MongoDB is an
open-source and cross-platform database System.
Mongo DB Datatypes

Data Type Description


String Most commonly used data type
Integer Stores numeric values (32 or 64-bit)
Boolean true or false
Double Floating-point numbers
Object Embedded documents
Array Stores multiple values in a list
Date Stores current date/time in ISODate()
Null Represents null or no value
ObjectId A unique ID for each document
Binary Data Used to store binary data (e.g., files)
MongoDB Commands

Command Description

CREATE Creates a new table in the database and other objects in the database.

INSERT Inserts collection name in existing database.

DROP Deletes an entire table or specified objects in the database.

UPDATE Updates the document into a collection.


MongoDB Atlas Setup & Database Creation –
Summary (Completed Steps)
1. Create a MongoDB Atlas Account
• Visited https://www.mongodb.com/atlas
• Signed up and logged into the MongoDB Atlas dashboard
2. Created a Free Cluster
• Selected the Shared (Free) tier
• Named the cluster as <anyname> e.g. Cluster1
• Chose cloud provider and region
• Clicked “Create Cluster”
MongoDB Atlas Setup & Database Creation –
Summary (Completed Steps)
3. Created a Database User
• Set a username (shubhangisankhyadhar) and password
• Assigned Read and Write access to the user
4. Whitelisted IP Address
• Went to “Network Access”
• Clicked “Add IP Address”
• Selected “Add My Current IP” for secure access
MongoDB Atlas Setup & Database Creation –
Summary (Completed Steps)
5. Connected Using MongoDB Compass
• Chose “Connect using Compass” in Atlas
• Downloaded and installed MongoDB Compass
• Copied the connection string from Atlas (replaced <password> with actual password)
• Opened Compass → Selected New ConnectionPasted the connection string
• Clicked “Connect”
• ✅ Successfully connected to the cluster
6. Created a New Database
• In Compass, clicked “Create Database”
• Named it (e.g., testDB) and created the initial collection (e.g., testCollection)
Insert Document
Insert Document
• After clicking on Insert Document
Write the Query
{
"name": "Amit",
"age": 22,
"course": "Computer Science"
}
Update Document

• To Update:Click the ✏ pen icon on a document


• Change the values (e.g., "age": 23)
• Click Update
Delete Document

• To Delete:
• Click the 🗑 trash icon next to a document
• Confirm delete
MongoDB Query

• A MongoDB query is a request to the database to retrieve specific documents or data based on
certain conditions or criteria.
• It is similar to SQL queries in traditional relational databases, but MongoDB queries are written
using JavaScript-like syntax.
• The most common query operation in MongoDB is the find() method, which allows users to fetch
documents from a collection that match the given query criteria.
• Basic MongoDB Query Example:
db.collection_name.find({ field: value })
• Example:
• This query will retrieve all documents from the articles collection where the author field is equal to
“Aditya”.
• db.articles.find({ author: "Aditya" })
Indexing

• Indexing in MongoDB is a crucial feature that enhances query processing efficiency.


Without indexing, MongoDB must scan every document in a collection to retrieve
the matching documents and leading to slower query performance.
• Indexing in MongoDB is a technique that improves the speed and efficiency of queries.
• Indexing improves the performance of:
• Find queries (db.collection.find()).
• Range queries (e.g., queries with <, >, <=, >= operators).
• Sorting (e.g., db.collection.find().sort()).
• Aggregation operations involving filtering, grouping, and sorting.
Capped Collection

• Capped collection is a fixed-size collection that operates like a circular buffer.


• When a capped collection reaches its maximum size, it overwrites the oldest documents to
make space for new ones.
• This behavior makes them ideal for applications like logging or tracking events where
maintaining a fixed-size history is beneficial.
Apache Spark
SCALA
• Scala is a high-level, statically-typed, object-oriented and functional
programming language designed to be concise, elegant, and scalable.
• It runs on the Java Virtual Machine (JVM) and is interoperable with Java,
meaning you can use Java libraries directly in Scala programs.
• There is no concept of primitive data as everything is an object in Scala.
• A Robust and High-Caliber programming language that changed the
world of big data. Scala is capable enough to outrun the speed of the
fastest existing programming languages.
• Well, Scala is a programming language invented by Mr. Martin
Odersky and his research team in the year 2003.
• Scala is a compiler based and a multi-paradigm programming language which
is compact, fast and efficient. The major advantage of Scala is the JVM (Java
Virtual Machine). Scala code is first compiled by a Scala compiler and the byte
code for the same is generated, which will be then transferred to the Java
Virtual Machine to generate the output.
Why Scala?
• Scala is capable to work with the data which is stored in
a Distributed fashion. It accesses all the available
resources and supports parallel data processing.
• Scala supports Immutable data and it has support to the
higher order functions.
• Scala is an upgraded version of Java which was
designed to eliminate unnecessary code. It supports
multiple Libraries and APIs which will allow the
programmer to achieve Less Down Time.
• Scala supports multiple type Constructs which enables
the programmer to work with wrappers/container types
with ease.
Features of Scala
• Object-oriented Programing Language:

Scala is both a functional Programming Language and an object-oriented programming Language. Every variable
and value which is used in Scala is implicitly saved as an object by default.
• Extensible Programming Language:

Scala can support multiple language constructs without the need of any Domain Specific
Language (DSL)Extensions, Libraries, and APIs.
• Statically Typed Programming Language:

Scala binds the Datatype to the variable in its entire scope.


• Functional Programming Language:

Scala provides a lightweight syntax for defining functions, it supports higher-order functions, it allows functions
to be nested.
• Interoperability:

• Scala compiles the code using scala compiler and converts code into Java Byte Code and Executes it on JVM.
Variables in Scala

Mutable Variables
These variables allow us to change a value after
the declaration of a variable. Mutable variables are
defined by using the var keyword. The first letter of
data type should be in capital letter because in
Scala data type is treated as an object.
• var b = “ABC"
• b = “PSIT Institute“
Immutable Variable

• Immutable Variable
These variables do not allow you to change a value after the declaration of a variable.
Immutable variables are defined by using the val keyword. The first letter of data type should
be in capital letter because in the Scala data type is treated as objects.
• val a = "hello world"
• a = "how are you“
output:
a: String = hello world
<console>:25: error: reassignment to val
a = “how are you”
Class and Object in Scala
Classes and Objects are basic concepts of Object Oriented Programming which revolve
around the real-life entities.

Class
• A class is a user-defined blueprint or prototype from which objects are created. Or in other
words, a class combines the fields and methods(member function which defines actions)
into a single unit. Basically, in a class constructor is used for initializing new objects, fields
are variables that provide the state of the class and its objects, and methods are used to
implement the behavior of the class and its objects.

class Class_name{
// methods and fields
}
Objects
It is a basic unit of Object Oriented Programming and represents the real-life entities. A
typical Scala program creates many objects, which as you know, interact by invoking
methods. An object consists of :

• State: It is represented by attributes of an object. It also reflects the properties of an


object.

• Behavior: It is represented by methods of an object. It also reflects the response of an


object with other objects.

• Identity: It gives a unique name to an object and enables one object to interact with other
objects.
Declaring Objects (Also
called instantiating a class)

• When an object of a class is created,


the class is said to be instantiated. All
the instances share the attributes and
the behavior of the class. But the
values of those attributes, i.e. the
state are unique for each object. A
single class may have any number of
instances.
Operators in Scala

• In Scala, operators are actually just methods. When you write


something like a + b, Scala interprets it as a.+(b) — meaning, +
is a method called on a with b as the argument.Here's a
classification of operators in Scala:
Arithmetic Operator
Operator Meaning Example

+ Addition a+b

- Subtraction a-b

* Multiplication a*b

/ Division a/b

% Modulus (remainder) a%b


Relational (Comparison) Operators
Operator Meaning Example

== Equal to a == b

!= Not equal to a != b

> Greater than a>b

< Less than a<b

>= Greater than or equal a >= b

<= Less than or equal a <= b


3. Logical Operators

Operator Meaning Example

&& Logical AND (both true) (a > 0) && (b > 0)

` `

! Logical NOT (negation) !(a > 0)


Bitwise Operators
Operator Meaning Example

& Bitwise AND a&b

` ` Bitwise OR

^ Bitwise XOR a^b

~ Bitwise complement (NOT) ~a

<< Left shift a << 2

>> Right shift a >> 2

>>> Unsigned right shift a >>> 2


Assignment Operator
Operator Meaning Example

= Simple assignment a = 10

+= Add and assign a += 5

-= Subtract and assign a -= 5

*= Multiply and assign a *= 5

/= Divide and assign a /= 5

%= Modulus and assign a %= 5


6. Unary Operators
Operator Meaning Example

+ Unary plus +a

- Unary minus -a

! Logical negation !a

~ Bitwise complement ~a
Example

• object OperatorExample extends App {


• val a = 10
• val b = 5

• println(a + b) // 15
• println(a > b) // true
• println((a > 0) && (b > 0)) // true
• println(~a) // bitwise NOT
•}
Data Types in SCALA
Data Type Description Example

Byte 8-bit signed integer val x: Byte = 10


Short 16-bit signed integer val y: Short = 1000
Int 32-bit signed integer val age: Int = 25
Long 64-bit signed integer val big: Long = 100000L
Float 32-bit floating point val pi: Float = 3.14f
Double 64-bit floating point (default) val g: Double = 9.8
Char 16-bit Unicode character val grade: Char = 'A'
Boolean true or false val isScalaFun: Boolean = true
Represents no value (like
Unit def printHello(): Unit = println("Hi")
void)
Reference to null for AnyRef
Null val x: String = null
types
Subtype of every other type
e.g., throw new Exception("Oops") returns
Nothing (used for abnormal
Nothing
termination)
Any Supertype of all types root of Scala type hierarchy
AnyVal Supertype of all value types Int, Float, Boolean, etc.
Supertype of all reference
AnyRef Like Java’s Object
types
Example

object DataTypeDemo extends App {


val a: Int = 100
val b: Double = 25.6
val name: String = "Scala"
val isAwesome: Boolean = true
val items: List[String] = List("pen", "book", "bag")

println(s"$name is awesome? $isAwesome")


println(s"First item: ${items.head}")

}
Inheritance

Inheritance is an important pillar of OOP(Object Oriented Programming). It is the mechanism in Scala by


which one class is allowed to inherit the features(fields and methods) of another class.
Important terminology:

• Super Class: The class whose features are inherited is known as superclass(or a base class or a parent
class).

• Sub Class: The class that inherits the other class is known as subclass(or a derived class, extended class,
or child class). The subclass can add its own fields and methods in addition to the superclass fields and
methods.

• Reusability: Inheritance supports the concept of “reusability”, i.e. when we want to create a new class and
there is already a class that includes some of the code that we want, we can derive our new class from
the existing class. By doing this, we are reusing the fields and methods of the existing clas
How to use inheritance in Scala
The keyword used for inheritance is extends.
Syntax:

class child_class_name extends parent_class_name { // Methods and fields }


Example

class A {
def show(): Unit = println("Hello from A")
}
class B extends A

object Test extends App {


val obj = new B()
obj.show() // Output: Hello from A
}

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy