0% found this document useful (0 votes)

10 views7 pages

Unit 4 1

The document provides an overview of the Hadoop Ecosystem, including its core components like HDFS and YARN, which manage resources and job scheduling. It also introduces NoSQL databases, highlighting their types, advantages, and specifically details MongoDB's features and operations. Additionally, it covers Apache Spark's capabilities for large dataset processing and introduces Scala as a programming language suited for big data frameworks.

Uploaded by

elitekrishelite

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views7 pages

Unit 4 1

Uploaded by

elitekrishelite

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

UNIT-4

1. Hadoop Ecosystem and YARN

Hadoop Ecosystem Overview:
The Hadoop Ecosystem is a suite of open-source software tools that facilitate the processing and
storage of large-scale datasets. At the core of the ecosystem is Hadoop Distributed File System
(HDFS), a distributed storage system that breaks large files into blocks, which are replicated across
multiple machines for fault tolerance. MapReduce is the computational model for parallel data
processing, allowing distributed execution on data stored in HDFS. Other key components of the
ecosystem include Hive (a SQL-like query language for large-scale data analysis), HBase (a NoSQL
database), Pig (a high-level data flow language), and Oozie (a workflow scheduler).

YARN (Yet Another Resource Negotiator):

YARN is a resource management layer that separates the job scheduling and resource management
functions of MapReduce into distinct components. It allows different processing frameworks
(MapReduce, Spark, Tez) to share a cluster and run concurrently. YARN’s core components are:

• ResourceManager: Manages resources across the cluster.

• NodeManager: Runs on each node and manages resources for containers.

• ApplicationMaster: Manages the lifecycle of a single job.

Key Features:
• Resource Management: YARN dynamically allocates resources based on application demand.

• Fault Tolerance: Ensures job recovery in case of failures by tracking job states.

• Cluster Utilization: YARN optimizes resource allocation for better utilization.

Diagram: YARN Architecture

2. NoSQL Databases
Introduction to NoSQL:
NoSQL (Not Only SQL) refers to a class of database management systems that do not use traditional
relational database models. They provide flexible schemas for storing unstructured or semi-
structured data, making them suitable for handling big data, high-velocity, and high-volume
workloads. NoSQL databases can be categorized into four major types:

• Document Stores: Store data as JSON or BSON documents (e.g., MongoDB).

• Key-Value Stores: Store data as key-value pairs (e.g., Redis).

• Column-family Stores: Store data in columns rather than rows (e.g., Cassandra).

• Graph Databases: Store data as nodes and edges to represent relationships (e.g., Neo4j).

Advantages of NoSQL:
• Scalability: NoSQL databases are designed to scale horizontally, distributing data across
multiple machines or clusters.

• Flexibility: Schemaless design allows the easy addition or modification of fields without
downtime.

• High Performance: Optimized for fast read and write operations, especially for large
datasets.

• Availability: NoSQL databases are designed to remain operational even if some nodes or
data centers fail (CAP Theorem).

3. MongoDB
Introduction to MongoDB:
MongoDB is a widely used NoSQL document-oriented database that stores data in flexible, JSON-like
documents (BSON). It is designed to handle large volumes of data with high availability and
horizontal scalability. MongoDB is schema-less, meaning the structure of the data can vary from
document to document within the same collection.

Key Features:
• Documents: Each document is a self-contained unit that stores data in a key-value format.

• Collections: Group of MongoDB documents that are stored together.

• Indexing: MongoDB supports secondary indexes to improve the performance of queries.

• Aggregation: Provides powerful aggregation features to perform operations like filtering,

grouping, and sorting on data.
Common Operations:
• Create: Inserting documents into collections.

• Update: Modifying existing documents.

• Delete: Removing documents from a collection.

• Query: Retrieving documents that match certain criteria.

Diagram: MongoDB Architecture

Creating, Updating, and Deleting Documents:

Insert:
js
db.collection.insertOne({name: "John", age: 30});
db.collection.insertMany([{name: "Alice", age: 25}, {name: "Bob", age: 28}]);

Update:
js
db.collection.updateOne({name: "John"}, {$set: {age: 31}});
db.collection.updateMany({age: {$gt: 25}}, {$inc: {age: 1}});
Delete:
js
db.collection.deleteOne({name: "Alice"});
db.collection.deleteMany({age: {$lt: 30}});

Querying:
Find:
js
db.collection.find({age: {$gt: 25}});
db.collection.find({name: /John/});

Indexing:
Create Index:
js
db.collection.createIndex({name: 1});

Capped Collections:
Capped collections are fixed-size collections that automatically overwrite the oldest
documents when the space is filled.
Use cases: Logging, caching.
Example:
db.createCollection("logs", {capped: true, size: 100000});

4. Apache Spark
Installing Apache Spark:
Apache Spark is an open-source, distributed computing system designed for processing large
datasets. It provides in-memory computation and runs much faster than Hadoop’s MapReduce.
Spark can be run on top of Hadoop or standalone and supports languages like Scala, Python, and
Java.

Key Features:
• RDD (Resilient Distributed Dataset): Spark’s core abstraction for distributed data. RDDs
allow parallel operations on large datasets.
• Spark SQL: A module for working with structured data and running SQL queries.

• Spark Streaming: Enables real-time stream processing.

• MLlib: A scalable machine learning library.

• GraphX: A library for graph processing.

Execution Flow:
1. Job: A high-level operation (e.g., collect, save).

2. Stage: A subset of tasks that can be executed in parallel.

3. Task: A single unit of computation that operates on one partition of data.

Diagram: Spark Architecture

5. Scala
Introduction:
Scala is a statically typed, functional, and object-oriented programming language that runs on the
JVM (Java Virtual Machine). It is designed to be concise, elegant, and compatible with Java, making it
a popular choice for building scalable and high-performance systems, especially in big data
processing frameworks like Spark.

Key Features:
• Functional Programming: Scala supports higher-order functions, immutability, and pure
functions, making it suitable for writing highly modular code.

• Object-Oriented: Everything in Scala is an object, and it supports concepts like inheritance,

polymorphism, and encapsulation.
• Type Inference: Scala automatically infers the types of variables, reducing the need for
explicit type declarations.

Basic Constructs:
• Classes and Objects: Used to define data structures and methods.

• Pattern Matching: A powerful way to deconstruct data types and handle multiple conditions
in a concise manner.

• Higher-Order Functions: Functions that take other functions as parameters or return

functions.

Example Code:

// Define a class

class Person(val name: String, val age: Int) {

def greet() = println(s"Hello, my name is $name and I am $age years old.")

// Create an instance

val person = new Person("John", 30)

person.greet()

Diagram: Scala's Object-Oriented and Functional Model

+------------------+ +------------------+

| Object |<--->| Class/Traits |

+------------------+ +------------------+

| |

v v

+-------------------+ +------------------+

| Higher-Order Funcs | | Immutable Data |

+-------------------+ +------------------+

Detailed Big Data and Hadoop Notes
No ratings yet
Detailed Big Data and Hadoop Notes
3 pages
Nosql
No ratings yet
Nosql
44 pages
Data Management For Distributed Sensor Networks: A Literature Review
No ratings yet
Data Management For Distributed Sensor Networks: A Literature Review
68 pages
Nosqldbs
No ratings yet
Nosqldbs
149 pages
Big Data Spark Lab Manual 2025-2026
No ratings yet
Big Data Spark Lab Manual 2025-2026
62 pages
BIGDATA4
No ratings yet
BIGDATA4
28 pages
2 Module
No ratings yet
2 Module
14 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
Bda Ut1 Question Bank
No ratings yet
Bda Ut1 Question Bank
19 pages
FSWD LESSONPLAN
No ratings yet
FSWD LESSONPLAN
7 pages
Projects - 100xdevs
No ratings yet
Projects - 100xdevs
29 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
Big Data Lab File
No ratings yet
Big Data Lab File
49 pages
Python
No ratings yet
Python
10 pages
ADBMS
No ratings yet
ADBMS
19 pages
Unit 5
No ratings yet
Unit 5
4 pages
Big - Data - ISE 2
No ratings yet
Big - Data - ISE 2
12 pages
Docs Fire Ly Firely Server en 5.0b1
No ratings yet
Docs Fire Ly Firely Server en 5.0b1
368 pages
InteliSCADA GlobalGuide 2 2 50
No ratings yet
InteliSCADA GlobalGuide 2 2 50
153 pages
Big Data Analysis Unit 1-5 Extended
No ratings yet
Big Data Analysis Unit 1-5 Extended
35 pages
Unit 4 - Class Notes
No ratings yet
Unit 4 - Class Notes
6 pages
BDA Simple 1 To 4
No ratings yet
BDA Simple 1 To 4
11 pages
Chapter 14
No ratings yet
Chapter 14
35 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
M5
No ratings yet
M5
18 pages
Big Data
No ratings yet
Big Data
3 pages
Paisabuddy Report
No ratings yet
Paisabuddy Report
36 pages
Hadoop Spark MongoDB SCALA Notes
No ratings yet
Hadoop Spark MongoDB SCALA Notes
4 pages
Module 1
No ratings yet
Module 1
34 pages
Decomposing SMACK Stack
No ratings yet
Decomposing SMACK Stack
62 pages
Case Study About Database Tools
No ratings yet
Case Study About Database Tools
13 pages
BDA Ass 3
No ratings yet
BDA Ass 3
8 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
2.2. Components of Hadoop - Analysing
No ratings yet
2.2. Components of Hadoop - Analysing
16 pages
Unit 3
No ratings yet
Unit 3
7 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
R For Programmers PDF
No ratings yet
R For Programmers PDF
370 pages
BDA Lab Manual UPDATED
No ratings yet
BDA Lab Manual UPDATED
45 pages
Bda Notes (Unit-2)
No ratings yet
Bda Notes (Unit-2)
26 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
NoSQL Unit 1 & 2 QnA
No ratings yet
NoSQL Unit 1 & 2 QnA
18 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Tools in Data Analytics
No ratings yet
Tools in Data Analytics
17 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Ebffiledoc 4917
No ratings yet
Ebffiledoc 4917
54 pages
HRMS
No ratings yet
HRMS
81 pages
1664473609-Unit 5 - Database Management - MongoDB
No ratings yet
1664473609-Unit 5 - Database Management - MongoDB
23 pages
Big Data Pyq 21-22
No ratings yet
Big Data Pyq 21-22
9 pages
Full Stack Software Development
No ratings yet
Full Stack Software Development
16 pages
NoSQL Quiz Answers
No ratings yet
NoSQL Quiz Answers
5 pages
BDA Answers
No ratings yet
BDA Answers
6 pages
Syllabus MVoc Mobile App Developement
No ratings yet
Syllabus MVoc Mobile App Developement
68 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
2383 - 1019 - DOC - NoSQL Databases
No ratings yet
2383 - 1019 - DOC - NoSQL Databases
6 pages
Introduction To MongoDB
No ratings yet
Introduction To MongoDB
8 pages
4.49 Final TYBSc IT Syllabus PDF
No ratings yet
4.49 Final TYBSc IT Syllabus PDF
93 pages
CT 2
No ratings yet
CT 2
8 pages
Magic Quadrant For C 763557 NDX
No ratings yet
Magic Quadrant For C 763557 NDX
51 pages
NoSQL Lecture Notes Compilation
No ratings yet
NoSQL Lecture Notes Compilation
5 pages
Top MongoDB Interview Q&A
No ratings yet
Top MongoDB Interview Q&A
14 pages
Sub Unit 3
No ratings yet
Sub Unit 3
9 pages
Database Types
No ratings yet
Database Types
4 pages
Web Results: Preview Mongodb Tutorial (PDF Version) - Tutorialspoint
No ratings yet
Web Results: Preview Mongodb Tutorial (PDF Version) - Tutorialspoint
205 pages
7-MongoDB Storage Engine
No ratings yet
7-MongoDB Storage Engine
32 pages
Zkqgfre5tl9l SQLtoMongoDBCheatSheet1
No ratings yet
Zkqgfre5tl9l SQLtoMongoDBCheatSheet1
10 pages
Mongo DB Exercise
No ratings yet
Mongo DB Exercise
45 pages
2 Simple Queries With Mon God B
No ratings yet
2 Simple Queries With Mon God B
4 pages
Resume - Riaz Mahmud
No ratings yet
Resume - Riaz Mahmud
8 pages
Problem Statement
No ratings yet
Problem Statement
4 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
2 pages
The Magical Marvels of Mongodb Slides PDF
No ratings yet
The Magical Marvels of Mongodb Slides PDF
151 pages
Ali Raza CV-2
No ratings yet
Ali Raza CV-2
1 page
Aniket Pal: Education
No ratings yet
Aniket Pal: Education
1 page
Valentin Popescu
No ratings yet
Valentin Popescu
7 pages
Anees Ur Rehman CV
No ratings yet
Anees Ur Rehman CV
3 pages
Experiment No:8: Title: Query Over Mongodb With A Collection
No ratings yet
Experiment No:8: Title: Query Over Mongodb With A Collection
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 4 1

Uploaded by

Unit 4 1

Uploaded by

UNIT-4

1. Hadoop Ecosystem and YARN

YARN (Yet Another Resource Negotiator):

• ResourceManager: Manages resources across the cluster.

• NodeManager: Runs on each node and manages resources for containers.

• ApplicationMaster: Manages the lifecycle of a single job.

• Cluster Utilization: YARN optimizes resource allocation for better utilization.

Diagram: YARN Architecture

• Document Stores: Store data as JSON or BSON documents (e.g., MongoDB).

• Key-Value Stores: Store data as key-value pairs (e.g., Redis).

• Collections: Group of MongoDB documents that are stored together.

• Indexing: MongoDB supports secondary indexes to improve the performance of queries.

• Aggregation: Provides powerful aggregation features to perform operations like filtering,

• Update: Modifying existing documents.

• Delete: Removing documents from a collection.

• Query: Retrieving documents that match certain criteria.

Diagram: MongoDB Architecture

Creating, Updating, and Deleting Documents:

• Spark Streaming: Enables real-time stream processing.

• MLlib: A scalable machine learning library.

• GraphX: A library for graph processing.

2. Stage: A subset of tasks that can be executed in parallel.

3. Task: A single unit of computation that operates on one partition of data.

Diagram: Spark Architecture

• Object-Oriented: Everything in Scala is an object, and it supports concepts like inheritance,

• Higher-Order Functions: Functions that take other functions as parameters or return

class Person(val name: String, val age: Int) {

def greet() = println(s"Hello, my name is $name and I am $age years old.")

val person = new Person("John", 30)

Diagram: Scala's Object-Oriented and Functional Model

| Object |<--->| Class/Traits |

| Higher-Order Funcs | | Immutable Data |

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.