Unit 4 1
Unit 4 1
Key Features:
• Resource Management: YARN dynamically allocates resources based on application demand.
• Fault Tolerance: Ensures job recovery in case of failures by tracking job states.
• Column-family Stores: Store data in columns rather than rows (e.g., Cassandra).
• Graph Databases: Store data as nodes and edges to represent relationships (e.g., Neo4j).
Advantages of NoSQL:
• Scalability: NoSQL databases are designed to scale horizontally, distributing data across
multiple machines or clusters.
• Flexibility: Schemaless design allows the easy addition or modification of fields without
downtime.
• High Performance: Optimized for fast read and write operations, especially for large
datasets.
• Availability: NoSQL databases are designed to remain operational even if some nodes or
data centers fail (CAP Theorem).
3. MongoDB
Introduction to MongoDB:
MongoDB is a widely used NoSQL document-oriented database that stores data in flexible, JSON-like
documents (BSON). It is designed to handle large volumes of data with high availability and
horizontal scalability. MongoDB is schema-less, meaning the structure of the data can vary from
document to document within the same collection.
Key Features:
• Documents: Each document is a self-contained unit that stores data in a key-value format.
Update:
js
db.collection.updateOne({name: "John"}, {$set: {age: 31}});
db.collection.updateMany({age: {$gt: 25}}, {$inc: {age: 1}});
Delete:
js
db.collection.deleteOne({name: "Alice"});
db.collection.deleteMany({age: {$lt: 30}});
Querying:
Find:
js
db.collection.find({age: {$gt: 25}});
db.collection.find({name: /John/});
Indexing:
Create Index:
js
db.collection.createIndex({name: 1});
Capped Collections:
Capped collections are fixed-size collections that automatically overwrite the oldest
documents when the space is filled.
Use cases: Logging, caching.
Example:
db.createCollection("logs", {capped: true, size: 100000});
4. Apache Spark
Installing Apache Spark:
Apache Spark is an open-source, distributed computing system designed for processing large
datasets. It provides in-memory computation and runs much faster than Hadoop’s MapReduce.
Spark can be run on top of Hadoop or standalone and supports languages like Scala, Python, and
Java.
Key Features:
• RDD (Resilient Distributed Dataset): Spark’s core abstraction for distributed data. RDDs
allow parallel operations on large datasets.
• Spark SQL: A module for working with structured data and running SQL queries.
Execution Flow:
1. Job: A high-level operation (e.g., collect, save).
5. Scala
Introduction:
Scala is a statically typed, functional, and object-oriented programming language that runs on the
JVM (Java Virtual Machine). It is designed to be concise, elegant, and compatible with Java, making it
a popular choice for building scalable and high-performance systems, especially in big data
processing frameworks like Spark.
Key Features:
• Functional Programming: Scala supports higher-order functions, immutability, and pure
functions, making it suitable for writing highly modular code.
Basic Constructs:
• Classes and Objects: Used to define data structures and methods.
• Pattern Matching: A powerful way to deconstruct data types and handle multiple conditions
in a concise manner.
Example Code:
// Define a class
// Create an instance
person.greet()
+------------------+ +------------------+
| |
v v
+-------------------+ +------------------+
+-------------------+ +------------------+