BIGDATA4
BIGDATA4
BIG DATA
SYLLABUS: Hadoop Eco System and YARN: Hadoop ecosystem components, schedulers,
fair and capacity, Hadoop 2.0 New Features – Name Node high availability, HDFS
federation, MRv2, YARN, Running MRv1 in YARN. NoSQL Databases: Introduction to
NoSQL MongoDB: Introduction, data types, creating, updating and deleing documents,
querying, introduction to indexing, capped collections Spark: Installing spark, spark
applications, jobs, stages and tasks, Resilient Distributed Databases, anatomy of a Spark
job run, Spark on YARN SCALA: Introduction, classes and objects, basic types and
operators, built-in control structures, functions and closures, inheritance.
Hadoop Ecosystem Components:
The Hadoop ecosystem includes several tools that work together to handle big data efficiently.
These tools are built around Hadoop's core components: HDFS (storage) and MapReduce
(processing).
Main components of the Hadoop ecosystem:
1.HDFS (Hadoop Distributed File System) – Used for storing large datasets in a distributed
manner.
2.MapReduce – Programming model for processing large data in parallel.
3.YARN (Yet Another Resource Negotiator) – Manages resources and schedules jobs.
4.Hive – SQL-like query language for data summarization and analysis.
5.Pig – High-level scripting language used with MapReduce.
6.HBase – A NoSQL database that runs on top of HDFS.
7.Sqoop – Used to transfer data between Hadoop and relational databases.
8.Flume – Collects and transports large amounts of streaming data into Hadoop.
9.Oozie – Workflow scheduler to manage Hadoop jobs.
10.Zookeeper – Coordinates and manages distributed applications.
11.Mahout – Machine learning library for building predictive models.
12.Avro – A data serialization system used for efficient data exchange.
Hadoop Schedulers:
Schedulers in Hadoop YARN decide how resources (CPU, memory) are allocated among various
jobs. They ensure multiple users can share the Hadoop cluster efficiently.
1. FIFO Scheduler (First-In-First-Out):
• Simple scheduler.
• Jobs are executed in the order they arrive.
• Not fair when some jobs take too long.
2. Fair Scheduler:
• Developed by Facebook.
• Divides resources equally among all running jobs.
• Ensures that small jobs are not stuck behind large ones.
• Jobs are grouped into pools, and each pool gets a fair share of resources.
3. Capacity Scheduler:
• Developed by Yahoo.
• Designed for large organizations with multiple users.
• Cluster resources are divided into queues, and each queue gets a configured capacity.
• Unused capacity in one queue can be used by others.
Hadoop 2.0 – New Features
• Hadoop 2.0 brought major improvements over Hadoop 1.x. It solved scalability,
availability, and resource management issues. Below are the important new features:
1. NameNode High Availability (HA):
• In Hadoop 1.x, there was only one NameNode, so if it failed, the whole system would
stop (single point of failure).
• Hadoop 2.0 introduced two NameNodes: one active and one standby.
• If the active NameNode fails, the standby takes over automatically.
• This ensures that HDFS continues to work without downtime.
2. HDFS Federation:
• In earlier versions, there was only one NameNode, which could become a bottleneck
in large clusters.
• Federation allows multiple NameNodes, each managing part of the file system.
• It improves scalability and isolation, allowing different applications to use different
parts of the file system without conflict.
3. MRv2 (MapReduce Version 2):
• Also called YARN (Yet Another Resource Negotiator).
• In Hadoop 1.x, JobTracker handled both resource management and job
scheduling, which created performance issues.
• MRv2 separates these two responsibilities:
• ResourceManager handles resource allocation.
• ApplicationMaster manages the lifecycle of individual jobs.
4. YARN (Yet Another Resource Negotiator):
• YARN is the core of Hadoop 2.0.
• It allows running multiple applications (not just MapReduce), like Spark, Tez,
etc., on the same cluster.
• It improves resource utilization, scalability, and flexibility.
How to Run MapReduce Version 1 (MRv1) on YARN
1.YARN allows old MapReduce jobs to run – The jobs written for the older version of
MapReduce can still run on the new YARN system.
2.No need to change code – Existing MapReduce programs do not need to be
rewritten. They can work directly on YARN.
3.YARN handles job execution – YARN takes care of distributing the job and managing
the resources required to run it.
4.A special manager helps run old jobs – YARN includes a built-in manager that
supports running MRv1 jobs in the new environment.
5.Same way of job submission – You submit the job in the same way as before, and
YARN will run it in the background.
6.Backward compatibility – YARN supports older applications so that users can
continue using their previous work without problems.
7.Good for migration – This is helpful for companies or users who are moving from
older versions of Hadoop to newer ones.
8.Benefit of new features – Even when using old jobs, you still get advantages like
better resource sharing and job scheduling from YARN.
NoSQL Databases
Introduction:
• NoSQL stands for "Not Only SQL".
• It refers to a group of databases that do not use the traditional relational database
model.
• Designed to handle large volumes of structured, semi-structured, or unstructured
data.
• Useful in big data applications and real-time web apps.
Advantages of NoSQL:
1.Scalability – Easily handles large amounts of data and traffic by scaling horizontally.
2.Flexibility – No fixed schema; supports dynamic data types and structures.
3.High Performance – Faster read/write operations for large datasets.
4.Supports Big Data – Works well with distributed computing frameworks like Hadoop.
5.Easier for Developers – Matches modern programming paradigms (JSON, key-value).
Disadvantages of NoSQL:
1.Lack of Standardization – No uniform query language like SQL.
2.Limited Support for Complex Queries – Not ideal for multi-table joins.
3.Less Mature Tools – Compared to relational databases.
4.Consistency Issues – Often prefers availability and partition tolerance over
consistency (CAP Theorem).
5.Data Redundancy – Due to denormalization, same data may be repeated.
Types of NoSQL Databases (Explained in Detail)
1. Key-Value Stores
1. Data is stored as a pair of key and value, like a dictionary.
2. The key is unique, and the value can be anything (a string, number, JSON, etc.).
3. Very fast and efficient for lookups by key.
4. Best for: Caching, session management, simple data storage.
5. Examples: Redis, Riak, Amazon DynamoDB.
2. Document-Oriented Databases
1. Data is stored in documents (like JSON or XML), which are more flexible than rows and columns.
2. Each document is self-contained and can have different fields.
3. Easy to map to objects in code and update individual fields.
4. Best for: Content management, real-time analytics, product catalogs.
5. Examples: MongoDB, CouchDB.
3. Column-Oriented Databases
1. Stores data in columns instead of rows, making it efficient for reading specific fields across large datasets.
2. Great for analytical queries on big data.
3. Scales well across many machines.
4. Best for: Data warehousing, real-time analytics, logging.
5. Examples: Apache HBase, Cassandra.
4. Graph-Based Databases
1. Focuses on relationships between data using nodes and edges.
2. Very powerful for handling complex relationships like social networks, recommendation engines, etc.
3. Best for: Social networks, fraud detection, recommendation systems.
4. Examples: Neo4j, ArangoDB.
MongoDB –
MongoDB is a NoSQL, open-source, document-oriented database. It stores data in
JSON-like documents with dynamic schemas, meaning the structure of data can vary
across documents in a collection.
Features of MongoDB:
1.Schema-less – Collections do not require a predefined schema.
2.Document-Oriented Storage – Data is stored in BSON (Binary JSON) format, allowing
for embedded documents and arrays.
3.High Performance – Supports fast read and write operations.
4.Scalability – Supports horizontal scaling using sharding.
5.Replication – Ensures high availability with replica sets.
6.Indexing – Supports indexing on any field to improve query performance.
7.Aggregation – Provides a powerful aggregation framework for data processing and
analytics.
8.Flexibility – You can store structured, semi-structured, or unstructured data.
9.Cross-Platform – Works on Windows, Linux, and MacOS.
Common MongoDB Data Types:
1.String – Used for storing text.
2.Integer – Stores numeric values (32-bit or 64-bit).
3.Boolean – True or False values.
4.Double – Stores floating-point numbers.
5.Date – Stores date and time in UTC format.
6.Array – Stores multiple values in a single field.
7.Object/Embedded Document – Stores documents within documents.
8.Null – Represents a null or missing value.
9.ObjectId – A unique identifier for each document (auto-generated).
10.Binary Data – Used to store binary data such as images or files.
1. Creating Documents in MongoDB:
• Creating a document in MongoDB means adding new data to the database. A
document in MongoDB is a record, which is similar to a row in relational databases.
• Example: If you want to store information about a person, like their name, age, and
city, you create a document for that person. MongoDB will automatically store this
data in a collection (similar to a table in a relational database).
• Once created, this document is assigned an _id by MongoDB, which uniquely
identifies it in the collection.
2. Updating Documents in MongoDB:
• Updating means modifying the existing data in a document. You can update a specific
field in a document (e.g., change the person's age or city) without affecting other
fields.
• Example: Suppose you created a document for a person named "John" with age 29.
Later, if you need to change the age to 30, you can update just the age field in that
document. You can also update multiple documents at once if needed, such as
updating the status of everyone living in "New York."
• MongoDB provides flexibility to update documents based on conditions. For example,
you can choose to update only those documents that match certain criteria.
3. Deleting Documents in MongoDB:
• Deleting documents means removing data from the database. If a document is no longer needed or is
outdated, it can be deleted.
• Example: If you want to delete the document of a person named "John," you can remove that document from
the collection. MongoDB allows you to delete just one document or multiple documents at once. For example,
you can delete all people who live in "New York" if required.
Queries in MongoDB:
MongoDB allows you to retrieve data from the database using queries. A query is a way to search for documents
that match specific conditions. The basic idea is to find specific documents based on their field values.
1. Basic Queries: You can search for documents by specifying the field and value. For example, if you want to find
all users who are 25 years old, you would search for documents where the "age" field is equal to 25.
2. Conditional Queries: MongoDB lets you apply conditions to your queries. For example, if you want to find users
older than 30, you can use a condition that searches for documents where the "age" is greater than 30.
3. Logical Queries: You can combine different conditions using logical operators like "AND" and "OR". For
instance, if you want to find users who are older than 30 but live in "New York", you can combine these
conditions.
4. Sorting: MongoDB allows you to sort your query results. For example, if you want to sort users by their age, you
can choose to display the results either in ascending or descending order.
5. Limiting Results: You can limit the number of results returned by a query. For example, if you want to get only
the first 5 documents, you can apply a limit to the query.
6. Projection: You can specify which fields to display in the query result. For example, if you only want to display
the "name" and "age" fields, you can exclude all other fields from the results.
Indexing in MongoDB:
Indexing in MongoDB is used to improve the performance of queries. When you create an
index on a field, MongoDB creates a structure that makes it faster to find documents that
match a specific value for that field.
1.Single Field Index: This is the simplest type of index and is created on a single field. For
example, if you frequently search for users by their name, you can create an index on the
"name" field to speed up those queries.
2.Compound Index: A compound index is created on multiple fields. It allows queries that
filter by several fields to be executed faster. For instance, if you often search for users by
both their name and age, a compound index can improve performance.
3.Text Index: MongoDB allows you to create a text index for full-text search. This type of index
is useful when you need to search for documents that contain specific words or phrases
within a text field.
4.Geospatial Index: If your data involves geographical locations (latitude and longitude),
MongoDB provides special indexing options to efficiently handle these types of queries.
5.Multikey Index: When you store arrays in MongoDB documents, you can create a multikey
index. This type of index is useful for queries that need to search within array fields.
6.Hashed Index: This type of index is used for efficient equality queries. It is useful when you
need to search for documents based on exact matches to a field value.
Benefits of Indexing:
• Faster Query Execution: Indexes make data retrieval quicker, as they allow
MongoDB to quickly locate relevant documents.
• Better Performance for Sorting: Sorting documents by a field with an index is
faster than sorting without one.
• Improved Read Efficiency: Indexes help MongoDB read data more efficiently,
especially with large datasets.
Limitations of Indexing:
• Space and Memory Usage: Indexes consume additional disk space and
memory. Having too many indexes can slow down performance.
• Impact on Write Operations: Every time a document is added, updated, or
deleted, MongoDB has to update the index, which can slow down write
operations.
• Maintenance: Indexes need to be maintained and updated regularly to ensure
optimal performance.
Capped Collections in MongoDB:
• A capped collection is a fixed-size collection.
• It automatically removes the oldest documents when the size limit is reached.
• Capped collections maintain the insertion order.
• They are ideal for use cases like logging or real-time data tracking.
• Capped collections provide high performance because they don’t allow
deletions or updates that would increase the document size.
Spark: Installing Spark, Spark Applications, Jobs, Stages, and Tasks
1. Installing Spark:
To begin using Apache Spark, you need to install it on your system or set it up on a cluster. Here’s a general
overview of how to install Spark:
• Pre-requisites:
• Java: Spark runs on Java, so you must have Java installed (Java 8 or later).
• Scala: Spark is written in Scala, and it provides a Scala API, but the Java API is used more commonly.
• Hadoop (Optional): If you want to run Spark with Hadoop, you need to install Hadoop as well. If not, Spark
can also run in standalone mode.
• Installation Steps:
• Download Spark: Visit the official Apache Spark website and download the appropriate version (usually
the pre-built version for Hadoop).
• Extract the Spark Archive: After downloading, extract the archive to a desired location on your local
system.
• Configure Spark:
• Set up the environment variables (SPARK HOME and PATH).
• You can configure Spark by editing the spark-defaults .conf file and setting options like the master URL,
memory settings, and other parameters.
Run Spark: After installation and configuration, you can start Spark in local mode or connect it to a cluster (e.g.,
Hadoop YARN, Mesos).
Standalone Mode: You can run Spark on your local machine (single node mode).
• Cluster Mode: You can run Spark on a cluster by connecting to YARN or Mesos for distributed computing.
Spark Applications:
A Spark application is a complete program that uses Spark to process data. Every
Spark application has a driver program that runs the main code. The driver
coordinates the execution of the program and sends tasks to worker nodes.
• Driver Program: This controls the execution of the Spark job. It communicates
with the cluster manager to allocate resources and send tasks to worker nodes.
• Cluster Manager: It manages the distribution of tasks across nodes. It can be
Hadoop YARN, Mesos, or Spark’s built-in manager.
• Executors: Executors are the worker processes that run on worker nodes and
perform the tasks assigned to them by the driver.
Jobs:
A Spark job is triggered when you perform an action, such as counting the number of
elements in a dataset or saving the data. A job represents a complete computation and
consists of multiple stages.
Triggering Jobs: Jobs are triggered by actions in Spark. For example, calling an action
like .collect() will trigger the execution of the job.
Stages in Jobs: When a job involves transformations that require data to be shuffled
across the cluster, Spark divides the job into multiple stages. Stages are separated by
operations that require a shuffle of data (e.g., groupBy or join).
Stages:
Stages are subsets of a job that can be executed independently. Spark divides jobs into
stages based on operations that involve shuffling data.
• Shuffling: Shuffling is the process of redistributing data across the cluster when a
stage involves wide dependencies (e.g., aggregating data from different nodes).
• Execution of Stages: Each stage runs tasks in parallel, and the results are passed to
the next stage. The execution is sequential, meaning Stage 2 will not start until Stage
1 is complete.
Tasks:
A task is the smallest unit of work in Spark, corresponding to a single partition of
the data.
• Parallelism: Tasks are executed in parallel across the different worker nodes in
the cluster. The number of tasks depends on how the data is partitioned.
• Task Execution: When a stage is ready to run, Spark creates tasks for each
partition of the data. For example, if you have 100 partitions, Spark will create
100 tasks to process them in parallel.
• Task Failures: If a task fails, Spark can retry the task on another node.
Resilient Distributed Datasets (RDDs) in Spark:
Resilient Distributed Datasets (RDDs) are the fundamental data structure in Spark,
designed for distributed computing. They represent an immutable distributed
collection of objects that can be processed in parallel across a cluster. The key
features of RDDs are:
1.Fault Tolerance:
RDDs can recover from failures by keeping track of their lineage, which is a record of
operations performed on the data. If a partition of an RDD is lost, Spark can
recompute it using the lineage.
2.Parallel Processing:
RDDs allow Spark to process data in parallel across multiple machines in a cluster.
Each partition of an RDD can be processed independently by a task.
3.Immutability:
RDDs are immutable, meaning once created, they cannot be changed. Any
transformation on an RDD results in the creation of a new RDD.
4.Lazy Evaluation:
Spark does not compute RDDs immediately. Instead, it builds a Directed Acyclic Graph
(DAG) of transformations and computes RDDs only when an action is called.
Anatomy of a Spark Job Run:
When a Spark job is executed, it goes through several stages:
Job Submission: A user submits a job by invoking an action on an RDD, like
.collect() or .save(). The job is submitted to the SparkContext, which coordinates
the execution.
Job Division into Stages: Spark divides the job into stages based on operations
that require data shuffling. Each stage is further divided into tasks, and tasks are
assigned to worker nodes for execution.
Task Scheduling: The scheduler places tasks on available worker nodes. Spark
uses a task scheduling mechanism that distributes the tasks across the cluster for
parallel execution.
Execution:The tasks are executed on the worker nodes. Data may be shuffled
between nodes if necessary (for operations like join or groupBy).
Result Collection: After all tasks are executed, the final results are collected and
returned to the driver program, or written to storage like HDFS.
Spark on YARN:
YARN (Yet Another Resource Negotiator) is a resource management layer for Hadoop that
allows Spark to run on top of Hadoop clusters. Here’s how Spark runs on YARN:
1.Resource Manager:
YARN’s ResourceManager manages cluster resources (CPU, memory) and schedules tasks for
Spark jobs. The ResourceManager ensures that Spark applications get the resources they
need for execution.
2.Application Master:
Spark runs in YARN by using an ApplicationMaster, which is responsible for negotiating
resources from the ResourceManager and tracking the execution of the application.
3.Execution on Worker Nodes:
Worker nodes in the Hadoop cluster run the tasks for Spark jobs. These nodes execute the
individual tasks, perform the computations, and send results back to the ApplicationMaster.
4.Data Locality:
YARN allows Spark to schedule tasks based on data locality, meaning that Spark tries to run
tasks on nodes that have the data already, reducing the need for network transfer.
5.Resource Allocation:
YARN dynamically allocates resources for Spark applications, adjusting resources based on
workload requirements, which improves resource utilization and job performance.
Introduction to Scala:
Scala is a high-level programming language that combines object-oriented and functional programming features.
It is designed to be concise, elegant, and expressive. Scala runs on the Java Virtual Machine (JVM), which
means it is compatible with Java and can make use of existing Java libraries. Scala is statically typed, meaning
that types are checked at compile-time, but it also supports type inference to reduce verbosity.
Classes and Objects:
• Classes: A class in Scala is a blueprint for creating objects. It defines the properties (variables) and behaviors
(methods) that the objects of that class will have.
• Objects: An object in Scala is a singleton instance of a class. It is used to define methods and variables that do
not belong to any specific instance of a class. An object is created when the program starts running and can be
used to access functionality without creating an instance.
Basic Types and Operators:
• Basic Data Types: Scala supports a range of basic types such as integers, floating-point numbers, characters,
and boolean values. Examples of basic types include:
• Int (Integer numbers)
• Double (Floating-point numbers)
• Char (Single characters)
• Boolean (True/False)
• String (Text)
• Operators: Scala supports several types of operators like:
• Arithmetic operators (e.g., +, -, *, /)
• Comparison operators (e.g., ==, !=, >, <)
• Logical operators (e.g., &&, ||)
• Assignment operators (e.g., =, +=, -=)
Built-in Control Structures:
• If-Else Statements: These are used to make decisions based on conditions. It
checks if a condition is true or false and executes the appropriate block of code.
• For Loop: The for loop is used to repeat a block of code a specific number of
times. It can be used with ranges or collections (like lists).
• While Loop: The while loop executes a block of code as long as a condition is
true.
• Match Expression: Similar to a switch statement in other languages, the match
expression in Scala is used to compare a value against different patterns and
execute corresponding code. It is a more powerful version of the switch
statement.
Functions and Closures in Scala:
• Functions: A function in Scala is a block of code that takes inputs (parameters),
performs a task, and returns a result. Functions can be defined with a specific
name and can be called anywhere in the program. Scala allows defining
functions with or without parameters. Scala also supports anonymous
functions, which are functions without a name, often used for short tasks.
• Closures: A closure is a function that can capture and carry its environment
with it. This means that the function can access variables from the scope in
which it was created, even after that scope has ended. Closures are useful
when you need to store a function along with the values it depends on.
• Example of Closures:
If a function is defined inside another function, the inner function can access
variables from the outer function, even if the outer function has finished
executing. This is what makes the inner function a closure.
Inheritance in Scala:
Inheritance is a fundamental concept of object-oriented programming, where a class
can inherit properties and behaviors from another class. In Scala, one class can
extend another class using the extend keyword
The class that is inherited from is called the superclass (or base class), and the class
that inherits is called the subclass (or derived class).
• Super class: The class whose properties and methods are inherited by another class.
• Sub class: The class that inherits the properties and methods from another class.
In Scala, a class can extend only one class, which is called single inheritance. However,
Scala supports multiple traits, which allows a class to mix in multiple behaviors.
• Traits: A trait is similar to an interface in other programming languages but can also
contain method implementations. Traits are used to add behavior to classes. A class
can extend multiple traits in Scala.
EXAMPLE
If you have a superclass called Animal with properties like name and methods like
makeSound(), a subclass like Dog can extend Animal, inheriting those properties and
methods, and then possibly adding new behavior specific to Dog.
THANK YOU FOR WATCHING
BEST OF LUCK