0% found this document useful (0 votes)

30 views84 pages

Wa0005.

The document provides an overview of the Hadoop ecosystem, detailing its core components such as HDFS, YARN, MapReduce, and various processing frameworks like Pig and Hive. It also discusses NoSQL databases, particularly MongoDB, including its setup, commands, and features like indexing and capped collections. Additionally, it covers scheduling policies in Hadoop and enhancements introduced in Hadoop 2.0, including YARN and HDFS Federation.

Uploaded by

AYUSH GUPTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views84 pages

Wa0005.

Uploaded by

AYUSH GUPTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

UNIT 4

Hadoop Ecosystem
Hadoop Ecosystem Components
• HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming-based Data Processing
Spark: In-Memory data processing
PIG, HIVE: processing of data services on Query-based
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
HDFS

• HDFS is the primary or core component of the Hadoop ecosystem. It is responsible

for storing large datasets of structured or unstructured data across multiple nodes,
thereby storing metadata as log files.

• HDFS consists of two basic components viz.

• Node name

• A data node

• A Name Node is a primary node that contains metadata (data about data), requiring
comparatively fewer resources than data nodes that store the actual data. These data
nodes are commodity hardware in a distributed environment. Undoubtedly, what
makes Hadoop cost-effective?
YARN

• Yet Another Resource Negotiator, as the name suggests, YARN helps manage resources
across clusters. In short, it performs resource planning and allocation for the Hadoop
system.

• It consists of three main components, viz.

1. Resource manager

2. Node manager

3. Application manager
A resource manager has the privilege of allocating resources for applications in the
system. In contrast, node managers work to allocate resources such as CPU, memory,
and bandwidth per machine and later acknowledge the resource manager
Map Reduce

• Using distributed and parallel algorithms, MapReduce allows you to offload

processing logic and helps you write applications that transform large datasets into
manageable ones.

• MapReduce uses two functions, i.e., Map() and Reduce(), whose task is to:

1. Map() performs sorting and filtering of the data and thus organizes it in the form
of a group. The map generates a result based on the key-value pair, which is
later processed by the Reduce() method.

2. Reduce(), as the name suggests, performs summarization by aggregating

mapped data. Simply put, Reduce() takes the output generated by Map() as input
and combines those tuples into a smaller set of tuples.
PIG

Pig was essentially developed by Yahoo, working on Pig Latin, a query-based language similar to
SQL.
• It is a platform for data flow structuring, processing, and analyzing large data sets.

• Pig does the job of executing the commands, and all the MapReduce activities are taken care of in
the background. After processing, the pig stores the result in HDFS.

• The Pig Latin language is specifically designed for this framework, which runs on the Pig Runtime.

• Pig helps to achieve ease of programming and optimization and thus is a core segment of the
Hadoop ecosystem.
HIVE

• With the help of SQL methodology and the HIVE interface, it reads and writes large data sets known
as Hive Query Language.

• It is highly scalable as it allows both real-time and batches processing. Also, all SQL data types are
supported by Hive, making query processing easier.

• Like Query Processing frameworks, HIVE comes with two components: JDBC Drivers and HIVE
Command-Line.

• JDBC works with ODBC drivers to create data storage and connection permissions, while the HIVE
command line helps with query processing.
Mahout

• Mahout enables machine learning of a system or application. Machine learning, as the name
suggests, helps a system evolve based on certain patterns, user/environment interactions, or
algorithms.

• It provides various libraries or features like collaborative filtering, clustering, and classification,
which are nothing but machine learning concepts. It allows us to invoke algorithms according to our
needs using our libraries.
Apache Spark

• It is a platform that handles all process-intensive tasks such as batch processing,

real-time interactive or iterative processing, graph conversions, and visualizations,
etc.

• Therefore, it consumes memory resources and is faster than the previous one in
terms of optimization.

• Spark is best suited for real-time data, while Hadoop is best suited for structured
data or batch processing. Hence both are used interchangeably in most companies.
Apache Hbase

• It is a NoSQL database that supports all kinds of data and is capable of processing
anything from a Hadoop database. It provides Google BigTable capabilities to work
with large data sets efficiently.

• When we need to search or retrieve occurrences of something small in a huge

database, the request must be processed in a short, fast time frame. At such times,
HBase comes in handy as it provides us with a tolerant way of storing limited data.
Other Components:

• Solr, Lucene: These are two services that perform the task of searching and indexing using
some java libraries. Lucene is based on Java which also enables a spell-checking
mechanism. However, Lucene is driven by Solr.

• Zookeeper: There was a huge problem with managing coordination and synchronization
between Hadoop resources or components, which often led to inconsistency. Zookeeper
has overcome all the problems by performing synchronization, inter-component
communication, grouping, and maintenance.

• Oozie: Oozie simply acts as a scheduler, so it schedules jobs and joins them together. There
are two kinds of tasks, i.e., Oozie workflow and Oozie coordinator tasks. An Oozie workflow
is the tasks that need to be executed sequentially in an ordered manner. In contrast, the
Oozie coordinator tasks are triggered when some data or external stimulus is given to it.
Schedulars
FIFO Schedular

First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives
more preferences to the application coming first than those coming later. It places the
applications in a queue and executes them in the order of their submission (first in, first
out).
• Here, irrespective of the size and priority, the request for the first application in the queue
are allocated first. Once the first application request is satisfied, then only the next
application in the queue is served.
Capacity Schedular

• CapacityScheduler allows multiple users to share a Hadoop cluster securely and efficiently.
• It uses hierarchical queues: root (cluster), parent (organization), and leaf (application
submission).Each queue gets a guaranteed portion of cluster resources.
• Unused resources from one queue can be temporarily used by under-utilized queues
(elastic sharing).
• Ensures fair resource usage by setting limits on users, applications, and queues.Prevents
any single user or queue from over-consuming cluster resources.Controls pending and
initialized apps per user or queue for stability.
Fair Schedular

• FairScheduler enables dynamic resource sharing in Hadoop clusters without fixed

reservations.
• It ensures all running applications receive, on average, equal resources over time.
• By default, it schedules based on memory but can be configured for both memory and
CPU.
• A single app can use all cluster resources if it's the only one running; new apps get
resources as they arrive.It prevents starvation of long-lived apps and allows short apps to
finish quickly.
• Supports hierarchical queues similar to CapacityScheduler.
• Allows setting minimum resource shares for specific queues to guarantee access for
important apps.Excess resources from underused queues are shared among other apps.
Hadoop 2.O
• Apache Hadoop 2.0 represents a generational
shift in the architecture of Apache Hadoop. With
YARN, Apache Hadoop is recast as a
significantly more powerful platform
• YARN is a re-architecture of Hadoop that allows
multiple applications to run on the same platform.
With YARN, applications run “in” Hadoop, instead
of “on” Hadoop
The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker
and TaskTracker into separate entities. In Hadoop 2.0, the JobTracker and TaskTracker no
longer exist and have been replaced by three components:

• ResourceManager: a scheduler that allocates available resources in the cluster amongst

the competing applications.

• NodeManager: runs on each node in the cluster and takes direction from the
ResourceManager. It is responsible for managing resources available on a single node.

• ApplicationMaster: an instance of a framework-specific library, an ApplicationMaster runs

a specific YARN job and is responsible for negotiating resources from the
ResourceManager and also working with the NodeManager to execute and monitor
Containers.
HDFS Federation

HDFS Federation is an enhancement of the Hadoop Distributed File System (HDFS) introduced in Hadoop 2.x.
Why it was needed: In earlier versions of HDFS, there was a single NameNode, which meant:
• A single point of failure
• Limited scalability (since one NameNode managed all metadata)
How HDFS Federation solves it:
• Allows multiple NameNodes and multiple namespaces.
• Each NameNode manages a part of the filesystem namespace.
• All NameNodes can work independently and in parallel.
Benefits:
• Better scalability
• Isolation of namespaces (e.g., different users/departments can use separate namespaces)
• No single point of failure (to some extent)
MRv2 (MapReduce Version 2)

MRv2 is the second version of the MapReduce framework, also called YARN-based
MapReduce.
Why it was introduced: MapReduce v1 had limitations:
• Tight coupling of resource management and job scheduling with the JobTracker.
• Limited scalability and flexibility.
How MRv2 works:
• Separates the resource management layer from the data processing layer.
• Introduces YARN for resource management.
• MapReduce becomes just one of many possible data processing frameworks running
on YARN.
YARN (Yet Another Resource Negotiator)

• YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system used for Big Data
processing.
Key Components:
• ResourceManager (RM): Central authority for resource management
• NodeManager (NM): Runs on each node, manages local resources
• ApplicationMaster (AM): Manages a specific application/job (e.g., a MapReduce job)
• Container: A slice of resources (CPU, RAM, etc.) given to tasks
Running MRV1 on Yarn

To run a MapReduce v1 (MRv1) application on YARN, you'll use the yarn command. The
syntax is similar to the MRv1 framework, though application submission can also be done
with the hadoop command in MRv2.
• MRv1 jobs need to be rewritten or adapted to work with YARN.
• Use the new MapReduce API provided with Hadoop 2.O (org.apache.hadoop.mapreduce
instead of org.apache.hadoop.mapred).
• Internally, the MapReduce jobs are executed using the ApplicationMaster and containers
on YARN.
No SQL Database
• NoSQL stands for "Not Only SQL".

• It refers to a group of non-relational

databases that store data in flexible, scalable
ways, unlike traditional relational databases
(RDBMS) like MySQL, Oracle, or SQL Server.
Key Characteristics of NoSQL Databases:

Feature Description

Non-relational Data is not stored in tables with rows and columns.

Flexible Schema You can store different types of data without defining a fixed schema.

Scalability Easy to scale horizontally (by adding more servers).

High Performance Optimized for high-speed reads/writes, especially with big data.

Suitable for large-scale, distributed systems (e.g., social media, IoT, real-time
Designed for Big Data
analytics).
Types of Non SQL
database

NoSQL databases can be classified into four main

types, based on their data storage and retrieval
methods:

1. Document-based databases

2. Key-value stores

3. Column-oriented databases

4. Graph-based databases
Types of Non SQL Database

1. Document-Based Databases:
Store data as documents (e.g., JSON, BSON, XML) with a flexible schema.
Ideal for semi-structured data and object-oriented applications.
Documents are grouped in collections and support indexing for fast queries.
2. Key-Value Stores:
Data is stored as key-value pairs, like a dictionary.
Efficient for quick lookups using unique keys.
Best suited for caching, session management, and simple data retrieval.
3. Column-Oriented Databases:
Store data by columns rather than rows, optimizing read-heavy workloads.
Great for analytics and data warehousing.
Enables faster aggregation and compression over large datasets.
• 4. Graph-Based Databases:
Represent data as nodes and edges to model relationships.
Ideal for complex, interconnected data like social networks or recommendation engines.
Efficient at querying relationships and traversals.
• MongoDB is a document-oriented NoSQL database system that provides high scalability,
flexibility, and performance.
• Unlike standard relational databases, MongoDB stores data in a JSON document structure
form.
• This makes it easy to operate with dynamic and unstructured data and MongoDB is an
open-source and cross-platform database System.
Mongo DB Datatypes

Data Type Description

String Most commonly used data type
Integer Stores numeric values (32 or 64-bit)
Boolean true or false
Double Floating-point numbers
Object Embedded documents
Array Stores multiple values in a list
Date Stores current date/time in ISODate()
Null Represents null or no value
ObjectId A unique ID for each document
Binary Data Used to store binary data (e.g., files)
MongoDB Commands

Command Description

CREATE Creates a new table in the database and other objects in the database.

INSERT Inserts collection name in existing database.

DROP Deletes an entire table or specified objects in the database.

UPDATE Updates the document into a collection.

MongoDB Atlas Setup & Database Creation –
Summary (Completed Steps)
1. Create a MongoDB Atlas Account
• Visited https://www.mongodb.com/atlas
• Signed up and logged into the MongoDB Atlas dashboard
2. Created a Free Cluster
• Selected the Shared (Free) tier
• Named the cluster as <anyname> e.g. Cluster1
• Chose cloud provider and region
• Clicked “Create Cluster”
MongoDB Atlas Setup & Database Creation –
Summary (Completed Steps)
3. Created a Database User
• Set a username (shubhangisankhyadhar) and password
• Assigned Read and Write access to the user
4. Whitelisted IP Address
• Went to “Network Access”
• Clicked “Add IP Address”
• Selected “Add My Current IP” for secure access
MongoDB Atlas Setup & Database Creation –
Summary (Completed Steps)
5. Connected Using MongoDB Compass
• Chose “Connect using Compass” in Atlas
• Downloaded and installed MongoDB Compass
• Copied the connection string from Atlas (replaced <password> with actual password)
• Opened Compass → Selected New ConnectionPasted the connection string
• Clicked “Connect”
• ✅ Successfully connected to the cluster
6. Created a New Database
• In Compass, clicked “Create Database”
• Named it (e.g., testDB) and created the initial collection (e.g., testCollection)
Insert Document
Insert Document
• After clicking on Insert Document
Write the Query
{
"name": "Amit",
"age": 22,
"course": "Computer Science"
}
Update Document

• To Update:Click the ✏ pen icon on a document

• Change the values (e.g., "age": 23)
• Click Update
Delete Document

• To Delete:
• Click the 🗑 trash icon next to a document
• Confirm delete
MongoDB Query

• A MongoDB query is a request to the database to retrieve specific documents or data based on
certain conditions or criteria.
• It is similar to SQL queries in traditional relational databases, but MongoDB queries are written
using JavaScript-like syntax.
• The most common query operation in MongoDB is the find() method, which allows users to fetch
documents from a collection that match the given query criteria.
• Basic MongoDB Query Example:
db.collection_name.find({ field: value })
• Example:
• This query will retrieve all documents from the articles collection where the author field is equal to
“Aditya”.
• db.articles.find({ author: "Aditya" })
Indexing

• Indexing in MongoDB is a crucial feature that enhances query processing efﬁciency.

Without indexing, MongoDB must scan every document in a collection to retrieve
the matching documents and leading to slower query performance.
• Indexing in MongoDB is a technique that improves the speed and efﬁciency of queries.
• Indexing improves the performance of:
• Find queries (db.collection.find()).
• Range queries (e.g., queries with <, >, <=, >= operators).
• Sorting (e.g., db.collection.find().sort()).
• Aggregation operations involving filtering, grouping, and sorting.
Capped Collection

• Capped collection is a fixed-size collection that operates like a circular buffer.

• When a capped collection reaches its maximum size, it overwrites the oldest documents to
make space for new ones.
• This behavior makes them ideal for applications like logging or tracking events where
maintaining a fixed-size history is beneficial.
Apache Spark
SCALA
• Scala is a high-level, statically-typed, object-oriented and functional
programming language designed to be concise, elegant, and scalable.
• It runs on the Java Virtual Machine (JVM) and is interoperable with Java,
meaning you can use Java libraries directly in Scala programs.
• There is no concept of primitive data as everything is an object in Scala.
• A Robust and High-Caliber programming language that changed the
world of big data. Scala is capable enough to outrun the speed of the
fastest existing programming languages.
• Well, Scala is a programming language invented by Mr. Martin
Odersky and his research team in the year 2003.
• Scala is a compiler based and a multi-paradigm programming language which
is compact, fast and eﬃcient. The major advantage of Scala is the JVM (Java
Virtual Machine). Scala code is ﬁrst compiled by a Scala compiler and the byte
code for the same is generated, which will be then transferred to the Java
Virtual Machine to generate the output.
Why Scala?
• Scala is capable to work with the data which is stored in
a Distributed fashion. It accesses all the available
resources and supports parallel data processing.
• Scala supports Immutable data and it has support to the
higher order functions.
• Scala is an upgraded version of Java which was
designed to eliminate unnecessary code. It supports
multiple Libraries and APIs which will allow the
programmer to achieve Less Down Time.
• Scala supports multiple type Constructs which enables
the programmer to work with wrappers/container types
with ease.
Features of Scala
• Object-oriented Programing Language:

Scala is both a functional Programming Language and an object-oriented programming Language. Every variable
and value which is used in Scala is implicitly saved as an object by default.
• Extensible Programming Language:

Scala can support multiple language constructs without the need of any Domain Specific
Language (DSL)Extensions, Libraries, and APIs.
• Statically Typed Programming Language:

Scala binds the Datatype to the variable in its entire scope.

• Functional Programming Language:

Scala provides a lightweight syntax for defining functions, it supports higher-order functions, it allows functions
to be nested.
• Interoperability:

• Scala compiles the code using scala compiler and converts code into Java Byte Code and Executes it on JVM.
Variables in Scala

Mutable Variables
These variables allow us to change a value after
the declaration of a variable. Mutable variables are
defined by using the var keyword. The first letter of
data type should be in capital letter because in
Scala data type is treated as an object.
• var b = “ABC"
• b = “PSIT Institute“
Immutable Variable

• Immutable Variable
These variables do not allow you to change a value after the declaration of a variable.
Immutable variables are defined by using the val keyword. The first letter of data type should
be in capital letter because in the Scala data type is treated as objects.
• val a = "hello world"
• a = "how are you“
output:
a: String = hello world
<console>:25: error: reassignment to val
a = “how are you”
Class and Object in Scala
Classes and Objects are basic concepts of Object Oriented Programming which revolve
around the real-life entities.

Class
• A class is a user-defined blueprint or prototype from which objects are created. Or in other
words, a class combines the fields and methods(member function which defines actions)
into a single unit. Basically, in a class constructor is used for initializing new objects, fields
are variables that provide the state of the class and its objects, and methods are used to
implement the behavior of the class and its objects.

class Class_name{
// methods and fields
}
Objects
It is a basic unit of Object Oriented Programming and represents the real-life entities. A
typical Scala program creates many objects, which as you know, interact by invoking
methods. An object consists of :

• State: It is represented by attributes of an object. It also reﬂects the properties of an

object.

• Behavior: It is represented by methods of an object. It also reﬂects the response of an

object with other objects.

• Identity: It gives a unique name to an object and enables one object to interact with other
objects.
Declaring Objects (Also
called instantiating a class)

• When an object of a class is created,

the class is said to be instantiated. All
the instances share the attributes and
the behavior of the class. But the
values of those attributes, i.e. the
state are unique for each object. A
single class may have any number of
instances.
Operators in Scala

• In Scala, operators are actually just methods. When you write

something like a + b, Scala interprets it as a.+(b) — meaning, +
is a method called on a with b as the argument.Here's a
classification of operators in Scala:
Arithmetic Operator
Operator Meaning Example

+ Addition a+b

- Subtraction a-b

* Multiplication a*b

/ Division a/b

% Modulus (remainder) a%b

Relational (Comparison) Operators
Operator Meaning Example

== Equal to a == b

!= Not equal to a != b

> Greater than a>b

< Less than a<b

>= Greater than or equal a >= b

<= Less than or equal a <= b

3. Logical Operators

Operator Meaning Example

&& Logical AND (both true) (a > 0) && (b > 0)

` `

! Logical NOT (negation) !(a > 0)

Bitwise Operators
Operator Meaning Example

& Bitwise AND a&b

` ` Bitwise OR

^ Bitwise XOR a^b

~ Bitwise complement (NOT) ~a

<< Left shift a << 2

>> Right shift a >> 2

>>> Unsigned right shift a >>> 2

Assignment Operator
Operator Meaning Example

= Simple assignment a = 10

+= Add and assign a += 5

-= Subtract and assign a -= 5

= Multiply and assign a = 5

/= Divide and assign a /= 5

%= Modulus and assign a %= 5

6. Unary Operators
Operator Meaning Example

+ Unary plus +a

- Unary minus -a

! Logical negation !a

~ Bitwise complement ~a
Example

• object OperatorExample extends App {

• val a = 10
• val b = 5

• println(a + b) // 15
• println(a > b) // true
• println((a > 0) && (b > 0)) // true
• println(~a) // bitwise NOT
•}
Data Types in SCALA
Data Type Description Example

Byte 8-bit signed integer val x: Byte = 10

Short 16-bit signed integer val y: Short = 1000
Int 32-bit signed integer val age: Int = 25
Long 64-bit signed integer val big: Long = 100000L
Float 32-bit floating point val pi: Float = 3.14f
Double 64-bit floating point (default) val g: Double = 9.8
Char 16-bit Unicode character val grade: Char = 'A'
Boolean true or false val isScalaFun: Boolean = true
Represents no value (like
Unit def printHello(): Unit = println("Hi")
void)
Reference to null for AnyRef
Null val x: String = null
types
Subtype of every other type
e.g., throw new Exception("Oops") returns
Nothing (used for abnormal
Nothing
termination)
Any Supertype of all types root of Scala type hierarchy
AnyVal Supertype of all value types Int, Float, Boolean, etc.
Supertype of all reference
AnyRef Like Java’s Object
types
Example

object DataTypeDemo extends App {

val a: Int = 100
val b: Double = 25.6
val name: String = "Scala"
val isAwesome: Boolean = true
val items: List[String] = List("pen", "book", "bag")

println(s"$name is awesome? $isAwesome")

println(s"First item: ${items.head}")

}
Inheritance

Inheritance is an important pillar of OOP(Object Oriented Programming). It is the mechanism in Scala by

which one class is allowed to inherit the features(ﬁelds and methods) of another class.
Important terminology:

• Super Class: The class whose features are inherited is known as superclass(or a base class or a parent
class).

• Sub Class: The class that inherits the other class is known as subclass(or a derived class, extended class,
or child class). The subclass can add its own ﬁelds and methods in addition to the superclass ﬁelds and
methods.

• Reusability: Inheritance supports the concept of “reusability”, i.e. when we want to create a new class and
there is already a class that includes some of the code that we want, we can derive our new class from
the existing class. By doing this, we are reusing the ﬁelds and methods of the existing clas
How to use inheritance in Scala
The keyword used for inheritance is extends.
Syntax:

class child_class_name extends parent_class_name { // Methods and fields }

Example

class A {
def show(): Unit = println("Hello from A")
}
class B extends A

object Test extends App {

val obj = new B()
obj.show() // Output: Hello from A
}

Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Fro CH3
No ratings yet
Fro CH3
21 pages
BDP Unit 4
No ratings yet
BDP Unit 4
28 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
11 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
Notes - 4 Unit Neha
No ratings yet
Notes - 4 Unit Neha
44 pages
BD Unit-4
No ratings yet
BD Unit-4
79 pages
Unit 4
No ratings yet
Unit 4
21 pages
M2 Bigdata&Hadoop
No ratings yet
M2 Bigdata&Hadoop
27 pages
Unit 4
No ratings yet
Unit 4
85 pages
Big Data - Tomas Iglesias IV
No ratings yet
Big Data - Tomas Iglesias IV
37 pages
Part Big Data Unit-IV
No ratings yet
Part Big Data Unit-IV
12 pages
2-Introduction To Hadoop Eco System
No ratings yet
2-Introduction To Hadoop Eco System
35 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Unit 2
No ratings yet
Unit 2
73 pages
BDP Unit 3
No ratings yet
BDP Unit 3
20 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Hadoop Ecosystem: Overview
No ratings yet
Hadoop Ecosystem: Overview
5 pages
Big Data Notes
No ratings yet
Big Data Notes
12 pages
Hadoop
No ratings yet
Hadoop
5 pages
Unit 2
No ratings yet
Unit 2
23 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
4 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
U2 - Hadoop EcoSytem
No ratings yet
U2 - Hadoop EcoSytem
6 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
INTRO Hadoop-Ecosystem
No ratings yet
INTRO Hadoop-Ecosystem
6 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
10 VESDA Pipe Network Design Guide A4 IE Lores
No ratings yet
10 VESDA Pipe Network Design Guide A4 IE Lores
56 pages
Philosophy of The VEDANTA Paul Deussen
No ratings yet
Philosophy of The VEDANTA Paul Deussen
83 pages
Nuage Networks Glossary
No ratings yet
Nuage Networks Glossary
16 pages
Kogan Headset
No ratings yet
Kogan Headset
2 pages
IT Audit 4ed SM Ch7
No ratings yet
IT Audit 4ed SM Ch7
10 pages
Software Productivity - Word
No ratings yet
Software Productivity - Word
36 pages
Unit1 (DW&DM)
No ratings yet
Unit1 (DW&DM)
30 pages
DIGI COM 1 - Introduction and Sampling
No ratings yet
DIGI COM 1 - Introduction and Sampling
29 pages
Dbms Material
No ratings yet
Dbms Material
40 pages
Brochure Uniflair CW InRow Cooling
No ratings yet
Brochure Uniflair CW InRow Cooling
8 pages
EEE484 Note Book
No ratings yet
EEE484 Note Book
104 pages
Ip Office Power User - lb4323
No ratings yet
Ip Office Power User - lb4323
2 pages
Final PPT Project
No ratings yet
Final PPT Project
26 pages
PHP WITH MYSQL Course Plan
No ratings yet
PHP WITH MYSQL Course Plan
6 pages
MINDMAP Manajemen Layanan IT
No ratings yet
MINDMAP Manajemen Layanan IT
1 page
Known Issues Bulletin
No ratings yet
Known Issues Bulletin
23 pages
Cyber Cafe Management System
No ratings yet
Cyber Cafe Management System
25 pages
Bhagwan Mahavir University: Diploma Engineering (Electronics and Communication Engineering)
No ratings yet
Bhagwan Mahavir University: Diploma Engineering (Electronics and Communication Engineering)
4 pages
0417 Information and Communication Technology: MARK SCHEME For The October/November 2013 Series
No ratings yet
0417 Information and Communication Technology: MARK SCHEME For The October/November 2013 Series
7 pages
Implementing EHR in Nigeria: Potential Challenge and Benefits
No ratings yet
Implementing EHR in Nigeria: Potential Challenge and Benefits
5 pages
DR900 9385100
No ratings yet
DR900 9385100
2 pages
CS-213 Advance Programming: Dr. Sidra Sultana
No ratings yet
CS-213 Advance Programming: Dr. Sidra Sultana
12 pages
Smart Manufacturing1
No ratings yet
Smart Manufacturing1
21 pages
Tugas Computer Sistem Minggu Pertama
No ratings yet
Tugas Computer Sistem Minggu Pertama
14 pages
UNIT3
No ratings yet
UNIT3
10 pages
Ems Guide Book: Electromedic - X86 Friza Servile 03/03/2015
No ratings yet
Ems Guide Book: Electromedic - X86 Friza Servile 03/03/2015
30 pages
Sniffers: Group Members
No ratings yet
Sniffers: Group Members
11 pages
KFC Interview Questions
No ratings yet
KFC Interview Questions
2 pages
Brochure - Panasonic KX-Mb2062CX
No ratings yet
Brochure - Panasonic KX-Mb2062CX
2 pages
Snowflake Tables
No ratings yet
Snowflake Tables
4 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Wa0005.

Uploaded by

Wa0005.

Uploaded by

UNIT 4

• HDFS is the primary or core component of the Hadoop ecosystem. It is responsible

• HDFS consists of two basic components viz.

• It consists of three main components, viz.

• Using distributed and parallel algorithms, MapReduce allows you to offload

2. Reduce(), as the name suggests, performs summarization by aggregating

• It is a platform that handles all process-intensive tasks such as batch processing,

• When we need to search or retrieve occurrences of something small in a huge

• FairScheduler enables dynamic resource sharing in Hadoop clusters without fixed

• ResourceManager: a scheduler that allocates available resources in the cluster amongst

• ApplicationMaster: an instance of a framework-specific library, an ApplicationMaster runs

• It refers to a group of non-relational

Non-relational Data is not stored in tables with rows and columns.

Scalability Easy to scale horizontally (by adding more servers).

NoSQL databases can be classified into four main

Data Type Description

INSERT Inserts collection name in existing database.

DROP Deletes an entire table or specified objects in the database.

UPDATE Updates the document into a collection.

• To Update:Click the ✏ pen icon on a document

• Indexing in MongoDB is a crucial feature that enhances query processing efﬁciency.

• Capped collection is a fixed-size collection that operates like a circular buffer.

Scala binds the Datatype to the variable in its entire scope.

• State: It is represented by attributes of an object. It also reﬂects the properties of an

• Behavior: It is represented by methods of an object. It also reﬂects the response of an

• When an object of a class is created,

• In Scala, operators are actually just methods. When you write

% Modulus (remainder) a%b

> Greater than a>b

< Less than a<b

>= Greater than or equal a >= b

<= Less than or equal a <= b

Operator Meaning Example

&& Logical AND (both true) (a > 0) && (b > 0)

! Logical NOT (negation) !(a > 0)

& Bitwise AND a&b

^ Bitwise XOR a^b

~ Bitwise complement (NOT) ~a

<< Left shift a << 2

>> Right shift a >> 2

>>> Unsigned right shift a >>> 2

+= Add and assign a += 5

-= Subtract and assign a -= 5

*= Multiply and assign a *= 5

/= Divide and assign a /= 5

%= Modulus and assign a %= 5

• object OperatorExample extends App {

Byte 8-bit signed integer val x: Byte = 10

object DataTypeDemo extends App {

println(s"$name is awesome? $isAwesome")

Inheritance is an important pillar of OOP(Object Oriented Programming). It is the mechanism in Scala by

class child_class_name extends parent_class_name { // Methods and fields }

object Test extends App {

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

= Multiply and assign a = 5