0% found this document useful (0 votes)
10 views7 pages

Unit 4 1

The document provides an overview of the Hadoop Ecosystem, including its core components like HDFS and YARN, which manage resources and job scheduling. It also introduces NoSQL databases, highlighting their types, advantages, and specifically details MongoDB's features and operations. Additionally, it covers Apache Spark's capabilities for large dataset processing and introduces Scala as a programming language suited for big data frameworks.

Uploaded by

elitekrishelite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views7 pages

Unit 4 1

The document provides an overview of the Hadoop Ecosystem, including its core components like HDFS and YARN, which manage resources and job scheduling. It also introduces NoSQL databases, highlighting their types, advantages, and specifically details MongoDB's features and operations. Additionally, it covers Apache Spark's capabilities for large dataset processing and introduces Scala as a programming language suited for big data frameworks.

Uploaded by

elitekrishelite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

UNIT-4

1. Hadoop Ecosystem and YARN


Hadoop Ecosystem Overview:
The Hadoop Ecosystem is a suite of open-source software tools that facilitate the processing and
storage of large-scale datasets. At the core of the ecosystem is Hadoop Distributed File System
(HDFS), a distributed storage system that breaks large files into blocks, which are replicated across
multiple machines for fault tolerance. MapReduce is the computational model for parallel data
processing, allowing distributed execution on data stored in HDFS. Other key components of the
ecosystem include Hive (a SQL-like query language for large-scale data analysis), HBase (a NoSQL
database), Pig (a high-level data flow language), and Oozie (a workflow scheduler).

YARN (Yet Another Resource Negotiator):


YARN is a resource management layer that separates the job scheduling and resource management
functions of MapReduce into distinct components. It allows different processing frameworks
(MapReduce, Spark, Tez) to share a cluster and run concurrently. YARN’s core components are:

• ResourceManager: Manages resources across the cluster.

• NodeManager: Runs on each node and manages resources for containers.

• ApplicationMaster: Manages the lifecycle of a single job.

Key Features:
• Resource Management: YARN dynamically allocates resources based on application demand.

• Fault Tolerance: Ensures job recovery in case of failures by tracking job states.

• Cluster Utilization: YARN optimizes resource allocation for better utilization.

Diagram: YARN Architecture


2. NoSQL Databases
Introduction to NoSQL:
NoSQL (Not Only SQL) refers to a class of database management systems that do not use traditional
relational database models. They provide flexible schemas for storing unstructured or semi-
structured data, making them suitable for handling big data, high-velocity, and high-volume
workloads. NoSQL databases can be categorized into four major types:

• Document Stores: Store data as JSON or BSON documents (e.g., MongoDB).

• Key-Value Stores: Store data as key-value pairs (e.g., Redis).

• Column-family Stores: Store data in columns rather than rows (e.g., Cassandra).

• Graph Databases: Store data as nodes and edges to represent relationships (e.g., Neo4j).

Advantages of NoSQL:
• Scalability: NoSQL databases are designed to scale horizontally, distributing data across
multiple machines or clusters.

• Flexibility: Schemaless design allows the easy addition or modification of fields without
downtime.

• High Performance: Optimized for fast read and write operations, especially for large
datasets.

• Availability: NoSQL databases are designed to remain operational even if some nodes or
data centers fail (CAP Theorem).

3. MongoDB
Introduction to MongoDB:
MongoDB is a widely used NoSQL document-oriented database that stores data in flexible, JSON-like
documents (BSON). It is designed to handle large volumes of data with high availability and
horizontal scalability. MongoDB is schema-less, meaning the structure of the data can vary from
document to document within the same collection.

Key Features:
• Documents: Each document is a self-contained unit that stores data in a key-value format.

• Collections: Group of MongoDB documents that are stored together.

• Indexing: MongoDB supports secondary indexes to improve the performance of queries.

• Aggregation: Provides powerful aggregation features to perform operations like filtering,


grouping, and sorting on data.
Common Operations:
• Create: Inserting documents into collections.

• Update: Modifying existing documents.

• Delete: Removing documents from a collection.

• Query: Retrieving documents that match certain criteria.

Diagram: MongoDB Architecture

Creating, Updating, and Deleting Documents:


Insert:
js
db.collection.insertOne({name: "John", age: 30});
db.collection.insertMany([{name: "Alice", age: 25}, {name: "Bob", age: 28}]);

Update:
js
db.collection.updateOne({name: "John"}, {$set: {age: 31}});
db.collection.updateMany({age: {$gt: 25}}, {$inc: {age: 1}});
Delete:
js
db.collection.deleteOne({name: "Alice"});
db.collection.deleteMany({age: {$lt: 30}});

Querying:
Find:
js
db.collection.find({age: {$gt: 25}});
db.collection.find({name: /John/});

Indexing:
Create Index:
js
db.collection.createIndex({name: 1});

Capped Collections:
Capped collections are fixed-size collections that automatically overwrite the oldest
documents when the space is filled.
Use cases: Logging, caching.
Example:
db.createCollection("logs", {capped: true, size: 100000});

4. Apache Spark
Installing Apache Spark:
Apache Spark is an open-source, distributed computing system designed for processing large
datasets. It provides in-memory computation and runs much faster than Hadoop’s MapReduce.
Spark can be run on top of Hadoop or standalone and supports languages like Scala, Python, and
Java.

Key Features:
• RDD (Resilient Distributed Dataset): Spark’s core abstraction for distributed data. RDDs
allow parallel operations on large datasets.
• Spark SQL: A module for working with structured data and running SQL queries.

• Spark Streaming: Enables real-time stream processing.

• MLlib: A scalable machine learning library.

• GraphX: A library for graph processing.

Execution Flow:
1. Job: A high-level operation (e.g., collect, save).

2. Stage: A subset of tasks that can be executed in parallel.

3. Task: A single unit of computation that operates on one partition of data.

Diagram: Spark Architecture

5. Scala
Introduction:
Scala is a statically typed, functional, and object-oriented programming language that runs on the
JVM (Java Virtual Machine). It is designed to be concise, elegant, and compatible with Java, making it
a popular choice for building scalable and high-performance systems, especially in big data
processing frameworks like Spark.

Key Features:
• Functional Programming: Scala supports higher-order functions, immutability, and pure
functions, making it suitable for writing highly modular code.

• Object-Oriented: Everything in Scala is an object, and it supports concepts like inheritance,


polymorphism, and encapsulation.
• Type Inference: Scala automatically infers the types of variables, reducing the need for
explicit type declarations.

Basic Constructs:
• Classes and Objects: Used to define data structures and methods.

• Pattern Matching: A powerful way to deconstruct data types and handle multiple conditions
in a concise manner.

• Higher-Order Functions: Functions that take other functions as parameters or return


functions.

Example Code:

// Define a class

class Person(val name: String, val age: Int) {

def greet() = println(s"Hello, my name is $name and I am $age years old.")

// Create an instance

val person = new Person("John", 30)

person.greet()

Diagram: Scala's Object-Oriented and Functional Model


+------------------+ +------------------+

| Object |<--->| Class/Traits |

+------------------+ +------------------+

| |

v v

+-------------------+ +------------------+

| Higher-Order Funcs | | Immutable Data |

+-------------------+ +------------------+

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy