BDA Unit 2
BDA Unit 2
INTRODUCTION TO HADOOP
Syllabus:
Introduction to Hadoop: Introduction, Hadoop and its Ecosystem, Hadoop Distributed File
System, MapReduce Framework and Programming Model, Hadoop Yarn, Hadoop Ecosystem
Tools
Introduction to Apache Spark: The genesis of Spark, Hadoop at Yahoo and Spark early years,
What is Apache Spark, Unified Analytics, Apache Spark’s Distributed Execution, Spark
Application and Spark session, Spark Jobs, Spark stages , Spark tasks, Transformation, Actions
and Lazy Evaluation, Narrow and wide transformation, The Spark UI, Your first Standalone
application.
Reference: Text book-2, edureka, simplilearn
In essence, the ability to design, develop, and implement a big data application is directly
dependent on an awareness of the architecture of the underlying computing platform, both
from hardware and more importantly from a software perspective.
One commonality among the different appliances and frameworks is the adaptation of tools to
leverage the combination of collections of four key computing resources:
1. Processing capability:
– Often referred to as a CPU, processor, or node.
– Generally speaking, modern processing nodes often incorporate multiple cores
that are individual CPUs that share the node’s memory and are managed and
scheduled together, allowing multiple tasks to be run simultaneously; this is
known as multithreading.
2. Memory:
– This holds the data that the processing node is currently working on.
– Most single node machines have a limit to the amount of memory.
3. Storage:
– Provides persistence of data—the place where datasets are loaded, and from
which the data is loaded into memory to be processed.
4. Network:
– This provides the “pipes” through which datasets are exchanged between
different processing and storage nodes.
A General Overview of High-Performance Architecture
• Most high-performance platforms are created by connecting multiple nodes together via a
variety of network topologies.
• Specialty appliances may differ in the specifics of the configurations, as do software
appliances.
• However, the general architecture distinguishes the management of computing resources and
the management of the data across the network of storage nodes, as is seen in Figure below.
• In this configuration, a master job manager oversees the pool of processing nodes, assigns
tasks, and monitors the activity.
• At the same time, a storage manager oversees the data storage pool and distributes datasets
across the collection of storage resources.
• While there is no a priori requirement that there be any collocation of data and processing
tasks, it is beneficial from a performance perspective to ensure that the threads process data
that is local, or close to minimize the costs of data access latency.
• To get a better understanding of the layering and interactions within a big data platform, we
will examine the Apache Hadoop stack.
Introduction to Hadoop
Why Hadoop?
b. Scalability
It also solves the Scaling problem.
It mainly focuses on horizontal scaling rather than vertical scaling.
You can add extra datanodes to HDFS cluster as and when required. Instead of scaling up
the resources of your datanodes.
Hence enhancing performance dramatically.
c. Storing the variety of data
HDFS solved this problem. HDFS can store all kind of data (structured, semi-structured or
unstructured).
It also follows write once and read many models.
Due to this, you can write any kind of data once and you can read it multiple times for
finding insights.
d. Data Processing Speed
This is the major problem of big data.
In order to solve this problem, move computation to data instead of data to computation.
This principle is Data locality. ‘Data Locality’ concept wherein computational logic is sent
to cluster nodes(server) containing data.
HDFS Architecture:
HDFS follows the master-slave architecture and it has the following elements.
o NameNode(Master)
o DataNode(Slave)
NameNode:
NameNode works as a Master in a Hadoop cluster that guides the Datanode (Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can
be the transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location (Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
DataNode:
DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more
number of DataNode, the Hadoop cluster will be able to store more data. So it is advised that
the DataNode should have High storing capacity to store a large number of file blocks.
Suppose you have uploaded a file of 400MB to your HDFS then what happens is this file got
divided into blocks of 128MB+128MB+128MB+16MB = 400MB size. Hadoop is mainly
configured for storing the large size data which is in petabyte, this is what makes Hadoop file
system different from other file systems as it can be scaled, nowadays file blocks of 128MB to
256MB are considered in Hadoop.
Replication in HDFS
Replication ensures the availability of the data. Replication is making a copy of something and the
number of times you make a copy of that particular thing can be expressed as it’s Replication
Factor. As we have seen in File blocks that the HDFS stores the data in the form of various blocks
at the same time Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured means you
can change it manually as per your requirement like in above example we have made 5 file
blocks which means that 3 Replica or copy of each file block is made means total of 5×3 = 15
blocks are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive system
hardware) which can be crashed at any time. We are not using the supercomputer for our
Hadoop setup. That is why we need such a feature in HDFS which can make copies of that file
blocks for backup purposes, this is known as fault tolerance.
If DataNode1 fails, the blocks 1, 2 and 4 present in DataNode1 are still available to the user
from DataNodes (2, 3 for block1), DataNodes (4, 5 for block2) and DataNodes (2, 5 for
block3).
b) MapReduce
• MapReduce is the data processing layer of Hadoop. It processes large structured and
unstructured data stored in HDFS. MapReduce is a Hadoop framework used for writing
applications that can process vast amounts of data on large clusters. It can also be called a
programming model in which we can process large datasets across computer clusters.
• MapReduce also processes a huge amount of data in parallel. It does this by dividing the job
(submitted job) into a set of independent tasks (sub-job).
• MapReduce works by breaking the processing into phases: Map (Splits & Mapping) and
Reduce (Shuffling, Reducing).
• Map : It is the first phase of processing, where we specify all the complex logic code.
• Reduce : It is the second phase of processing. Here we specify light-weight processing
like aggregation/summation.
• As the name MapReduce suggests, the reducer phase takes place after the mapper phase has
been completed. So, the first is the map job, where a block of data is read and processed to
produce key-value pairs as intermediate outputs. The output of a Mapper or map job (key-
value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs. Then, the reducer aggregates
those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or
key-value pairs which is the final output.
The data goes through the following phases of MapReduce in Big Data
Input Splits:
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits
Input split is a chunk of the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of
mapping phase is to count a number of occurrences of each word from input splits (more
details about input-split is given below) and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubbed together
along with their respective frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines
values from Shuffling phase and returns a single output value. In short, this phase summarizes
the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates total
occurrences of each word.
First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
Then, we tokenize the words in each of the mappers and give a value (1) to each of the
tokens or words. The rationale behind giving a value equal to 1 is that every word, in itself,
will occur once.
Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs –
Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer.
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as – Bear, 2.
Finally, all the output key/value pairs are then collected and written in the output file.
Advantages of MapReduce
The two biggest advantages of MapReduce are:
1. Parallel Processing:
o In MapReduce, we are dividing the job among multiple nodes and each node works
with a part of the job simultaneously. So, MapReduce is based on Divide and Conquer
paradigm which helps us to process the data using different machines. As the data is
processed by multiple machines instead of a single machine in parallel, the time taken
to process the data gets reduced by a tremendous amount.
o
2. Data Locality:
o Instead of moving data to the processing unit, we are moving the processing unit to the
data in the MapReduce Framework.
o This allows us to have the following advantages:
It is very cost-effective to move processing unit to the data.
The processing time is reduced as all the nodes are working with their part of
the data in parallel.
Every node gets a part of the data to process and therefore, there is no chance
of a node getting overburdened.
Disadvantages of MapReduce
c) YARN
•
Why YARN?
In Hadoop version 1.0 which is also referred to as MRV1 (MapReduce Version 1), MapReduce
performed both processing and resource management functions. It consisted of a Job Tracker
which was the single master. The Job Tracker allocated the resources, performed scheduling
and monitored the processing jobs. It assigned map and reduce tasks on a number of
subordinate processes called the Task Trackers. The Task Trackers periodically reported
their progress to the Job Tracker.
This design resulted in scalability bottleneck due to a single Job Tracker. Also, the Hadoop
framework became limited only to MapReduce processing paradigm.
To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by
Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over
the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop
the ability to run non-MapReduce jobs within the Hadoop framework.
Yarn architecture
Apart from Resource Management, YARN also performs Job Scheduling. YARN performs all your
processing activities by allocating resources and scheduling tasks. Apache Hadoop YARN
Architecture consists of the following main components as shown in figure below.
Yarn components:
1. Client
2. Resource Manager
3. Node Manager
4. Application Master
5. Container
It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow
on the given node.
It registers with the Resource Manager and sends heartbeats with the health status of the
node.
Its primary goal is to manage application containers assigned to it by the resource
manager.
It keeps up-to-date with the Resource Manager.
4. Application Master
An application is a single job submitted to the framework. Each such application has a
unique Application Master associated with it which is a framework specific entity.
It is the process that coordinates an application’s execution in the cluster and also
manages faults.
Its task is to negotiate resources from the Resource Manager and work with the Node
Manager to execute and monitor the component tasks.
It is responsible for negotiating appropriate resource containers from the Resource
Manager, tracking their status and monitoring progress.
Once started, it periodically sends heartbeats to the Resource Manager to affirm its health
and to update the record of its resource demands.
5. Container
It is a collection of physical resources such as RAM, CPU cores, and disks on a single node.
YARN containers are managed by a container launch context. This record contains a map
of environment variables, dependencies stored in a remotely accessible storage, security
tokens, payload for Node Manager services and the command necessary to create the
process.
It grants rights to an application to use a specific amount of resources (memory, CPU etc.)
on a specific host.
HBase
• HBase is another example of a nonrelational data management environment that distributes
massive datasets over the underlying Hadoop framework.
• HBase is derived from Google’s BigTable and is a column-oriented data layout that, when
layered on top of Hadoop, provides a fault-tolerant method for storing and manipulating large
data tables.
• HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable.
• HBase is not a relational database, and it does not support SQL queries.
• There are some basic operations for HBase:
– Get (which access a specific row in the table),
– Put (which stores or updates a row in the table),
– Scan (which iterates over a collection of rows in the table), and
– Delete (which removes a row from the table).
Apache Hive
• Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
• With the help of SQL methodology and interface, HIVE performs reading and writing of large
data sets. However, its query language is called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Features of Hive:
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Apache Pig
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just
the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of
the Hadoop Ecosystem.
Features of Pig:
1. Rich set of operators - It provides many operators to perform operations like join, sort, filer
etc.
2. Ease of programming - Pig Latin is similar to SOL and it is easy to write à Pig script if you are
good at
3. Self optimization - The tasks in Apache Pig are converted into optimized MapReduce job
automatically, the programmers need to focus only on semantics of the language.
4. Extensibility - Using the existing operators, users can develop their own functions to read,
process, and write data.
5. UDF's-Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts
6. Handles all kinds of data - Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS
Apache Mahout
Apache Mahout is an open source project that is primarily used for creating scalable machine
learning algorithms. It implements popular machine learning techniques such as:
Recommendation
Classification
Clustering
Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a
top level project of Apache.
Features of Mahout
The primitive features of Apache Mahout are listed below.
The algorithms of Mahout are written on top of Hadoop, so it works well in distributed
environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud.
Mahout offers the coder a ready-to-use framework for doing data mining tasks on large
volumes of data.
Mahout lets applications to analyze large sets of data effectively and in quick time.
Includes several MapReduce enabled clustering implementations such as k-means, fuzzy
Applications of Mahout
Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use
Mahout internally.
Foursquare helps you in finding out places, food, and entertainment available in a
particular area. It uses the recommender engine of Mahout.
Twitter uses Mahout for user interest modelling.
Yahoo! uses Mahout for pattern mining.
Apache Sqoop
Sqoop was originally developed by Cloudera. Later on it was further developed and
maintained by Apache then it is termed as Apache Sqoop . In April 2012, the Apache Sqoop
project was promoted as Apache’s top-level project.
Apache Sqoop is an open source tool in hadoop ecosystem. It is mainly designed to transfer
huge data set between Hadoop and external data stores like relational databases, enterprise
data warehouses.
The main functions of Apache sqoop are,
1. Import data
2. Export data
It imports data from relational databases (RDBMS) like MySQL,Oracle,Postgresql and DB2 to
Hadoop distributed file system (HDFS) like Hive and Hbase.
It exports data from Hadoop file system to relational databases.
Apache Flume
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and
transporting large amounts of streaming data such as log files, events (etc...) from various
sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy
streaming data (log data) from various web servers to HDFS.
Features of Flume
Some of the notable features of Flume are as follows −
Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase)
efficiently.
Using Flume, we can get the data from multiple servers immediately into Hadoop.
Along with the log files, Flume is also used to import huge volumes of event data produced
by social networking sites like Facebook and Twitter, and e-commerce websites like
Amazon and Flipkart.
Flume supports a large set of sources and destinations types.
Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
Flume can be scaled horizontally.
Apache Zookeeper
Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between
themselves and maintain shared data with robust synchronization techniques. ZooKeeper is itself
a distributed application providing services for writing a distributed application.
The common services provided by ZooKeeper are as follows −
Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for
nodes.
Configuration management − Latest and up-to-date configuration information of the system
for a joining node.
Cluster management − Joining / leaving of a node in a cluster and node status at real time.
Leader election − Electing a node as leader for coordination purpose.
Locking and synchronization service − Locking the data while modifying it. This
mechanism helps you in automatic fail recovery while connecting other distributed
applications like Apache HBase.
Highly reliable data registry − Availability of data even when one or a few nodes are down.
Apache Oozie
Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed
environment. It allows to combine multiple complex jobs to be run in a sequential order to
achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed to
run parallel to each other.
One of the main advantages of Oozie is that it is tightly integrated with Hadoop stack
supporting various Hadoop jobs like Hive, Pig, Sqoop as well as system-specific jobs like Java
and Shell.
Oozie is an Open Source Java Web-Application available under Apache license 2.0. It is
responsible for triggering the workflow actions, which in turn uses the Hadoop execution
engine to actually execute the task.
Following three types of jobs are common in Oozie −
– Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs) to
specify a sequence of actions to be executed.
– Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data
availability.
– Oozie Bundle − These can be referred to as a package of multiple coordinator and
workflow jobs.
Apache Spark is an open-source cluster computing framework for real-time processing. It is of the
most successful projects in the Apache Software Foundation.
Researchers at UC Berkeley who had previously worked on Hadoop MapReduce took on the
challenge (make Hadoop and MR simpler and faster) with a project they called Spark. They
acknowledged that MR was inefficient (or intractable) for interactive or iterative computing
jobs and a complex framework to learn. So from the onset they embraced the idea of making
Spark simpler, faster, and easier.
Apache Spark is a subproject of Hadoop developed in the year 2009 by Matei Zaharia in UC
Berkeley’s AMPLab. The first users of Spark were the group inside UC Berkeley including
machine learning researchers, which used Spark to monitor and predict traffic congestion in
the San Francisco Bay Area.
Spark has open sourced in the year 2010 under BSD license.
Spark became a project of Apache Software Foundation in the year 2013 and is now the
biggest project of Apache foundation.
• Apache Spark is a unified engine designed for large-scale distributed data processing, on
premises in data centers or in the cloud.
• Spark provides in-memory storage for intermediate computations, making it much faster than
Hadoop MapReduce. It incorporates libraries with composable APIs for machine learning
(MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming)
for interacting with real-time data, and graph processing (GraphX).
Why Spark?
Apache Spark was developed to overcome the limitations of Hadoop MapReduce cluster
computing paradigm. Some of the drawbacks of Hadoop MapReduce are:
Use only Java for application building.
Since the maximum framework is written in Java there is some security concern. Java
being heavily exploited by cybercriminals this may result in numerous security breaches.
Opt only for batch processing. Does not support stream processing(real-time processing).
Hadoop MapReduce uses disk-based processing.
Apache Spark has many features which make it a great choice as a big data processing engine.
Many of these features establish the advantages of Apache Spark over other Big Data processing
engines. Let us look into details of some of the main features:
1. Speed
Spark has pursued the goal of speed in several ways.
• First, its internal implementation benefits immensely from the hardware industry’s
recent huge strides in improving the price and performance of CPUs and memory. Today’s
commodity servers come cheap, with hundreds of gigabytes of memory, multiple cores,
and the underlying Unix-based operating system taking advantage of efficient
multithreading and parallel processing. The framework is optimized to take advantage of
all of these factors.
• Second, Spark builds its query computations as a directed acyclic graph (DAG); its DAG
scheduler and query optimizer construct an efficient computational graph that can usually
be decomposed into tasks that are executed in parallel across workers on the cluster.
• Third, its physical execution engine, Tungsten, uses whole-stage code generation to
generate compact code for execution.
2. Ease of use
• Spark achieves simplicity by providing a fundamental abstraction of a simple logical data
structure called a Resilient Distributed Dataset (RDD) upon which all other higher-level
structured data abstractions, such as DataFrames and Datasets, are constructed.
• By providing a set of transformations and actions as operations, Spark offers a simple
programming model that you can use to build big data applications in familiar languages.
3. Modularity
• Spark operations can be applied across many types of workloads and expressed in any of
the supported programming languages: Scala, Java, Python, SQL, and R.
• Spark offers unified libraries with well-documented APIs that include the following
modules as core components: Spark SQL, Spark Structured Streaming, Spark MLlib, and
GraphX, combining all the workloads running under one engine.
• You can write a single Spark application that can do it all—no need for distinct engines for
disparate workloads, no need to learn separate APIs.
• With Spark, you get a unified processing engine for your workloads.
4. Extensibility
• Spark focuses on its fast, parallel computation engine rather than on storage.
• Unlike Apache Hadoop, which included both storage and compute, Spark decouples the
two. That means you can use Spark to read data stored in myriad sources—Apache
Hadoop, Apache Cassandra, Apache HBase, MongoDB, Apache Hive, RDBMSs, and more—
and process it all in memory.
• Spark’s DataFrameReaders and DataFrameWriters can also be extended to read data from
other sources, such as Apache Kafka, Kinesis, Azure Storage, and Amazon S3, into its
logical data abstraction, on which it can operate.
• The community of Spark developers maintains a list of third-party Spark packages as part
of the growing ecosystem (see Figure above).
• This rich ecosystem of packages includes Spark connectors for a variety of external data
sources, performance monitors, and more.
• Spark offers four distinct components as libraries for diverse workloads: Spark SQL,
Spark MLlib, Spark Structured Streaming, and GraphX.
• Each of these components is separate from Spark’s core fault-tolerant engine, in that
you use APIs to write your Spark application and Spark converts this into a DAG that is
executed by the core engine.
Spark Core
Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems etc.
Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are
Spark’s main programming abstraction.
RDDs represent a collection of items distributed across many compute nodes that can be
manipulated in parallel. RDD offers two types of operations: Transformation, Action.
Spark SQL
Spark SQL is Spark’s package for working with structured data.
We can read data stored in an RDBMS table or from file formats with structured data (CSV,
text, JSON, Avro, ORC, Parquet, etc.) and then construct permanent or temporary tables in
Spark.
Also, when using Spark’s Structured APIs in Java, Python, Scala, or R, you can combine SQL-
like queries to query the data just read into a Spark DataFrame.
Spark Streaming
Spark Streaming is a Spark component that enables processing of live streams of data.
Examples of data streams include log files generated by production web servers, or queues of
messages containing status updates posted by users of a web service.
Spark MLlib
Spark comes with a library containing common machine learning (ML) functionality, called
MLlib.
MLlib provides multiple types of machine learning algorithms, including classification,
regression, clustering, and collaborative filtering, as well as supporting functionality such as
model evaluation and data import etc.
All of these methods are designed to scale out across a cluster.
GraphX
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations.
It offers the standard graph algorithms for
analysis, connections, and traversals, contributed by users in the community: the
available algorithms include PageRank, Connected Components, and Triangle
Counting.
• Spark is a distributed data processing engine with its components working collaboratively on
a cluster of machines.
• We shall understand how all the components of Spark’s distributed architecture work
together and communicate, and what deployment modes are available.
• The components of the architecture is shown in Figure 1-4
• At a high level in the Spark architecture, a Spark application consists of a driver program
that is responsible for orchestrating parallel operations on the Spark cluster.
• The driver accesses the distributed components in the cluster—the Spark executors and
cluster manager—through a SparkSession.
Spark driver
• As the part of the Spark application responsible for instantiating a SparkSession, the Spark
driver has multiple roles:
– it communicates with the cluster manager;
– it requests resources (CPU, memory, etc.) from the cluster manager for Spark’s
executors(JVMs); and
– it transforms all the Spark operations into DAG computations, schedules them, and
distributes their execution as tasks across the Spark executors.
– Once the resources are allocated, it communicates directly with the executors.
Step-1: Communicating with cluster manager to allocate resources
Step-2: Once the resources are allocated, it communicates directly with the executors
Spark Session
• Spark Session is a simplified entry point into Spark application.
• Spark Session is introduced in Spark 2.x.
• We can think spark session as a data structure where the driver maintains all the
information including the executor location and their status.
Cluster manager
• The cluster manager is responsible for managing and allocating resources for the cluster of
nodes on which your Spark application runs.
• Currently, Spark supports four cluster managers: the built-in standalone cluster manager,
Apache Hadoop YARN, Apache Mesos, and Kubernetes.
•
Standalone – a simple cluster manager included with Spark that makes it easy to set up a
cluster.
Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service
applications.
Apache Hadoop YARN – the resource manager in Hadoop 2.
Kubernetes – an open-source system for automating deployment, scaling, and management
of containerized applications.
Spark executor
Spark executors are the processes that perform the tasks assigned by the Spark driver.
Executors have one core responsibility: take the tasks assigned by the driver, run them, and
report back their state (success or failure) and results.
Each Spark Application has its own separate executor processes.
Deployment modes
An attractive feature of Spark is its support for myriad deployment modes, enabling Spark to
run in different configurations and environments.
Table below summarizes the available deployment modes.
Local
Standalone
YARN
(client)
YARN
(cluster)
Kubernetes
Actual physical data is distributed across storage as partitions residing in either HDFS or
cloud storage (see Figure 1-5).
While the data is distributed as partitions across the physical cluster, Spark treats each
partition as a high-level logical data abstraction—asa DataFrame in memory.
Though this is not always possible, each Spark executor is preferably allocated a task that
requires it to read the partition closest to it in the network, observing data locality.
The key concepts are to be familiar of a Spark application to understand how the code is
transformed and executed as tasks across the Spark executors. Some important terms:
Application- It is a user program built on Spark using its APIs. It consists of a driver program and
executors on the cluster.
SparkSession- It is an object that provides a point of entry to interact with underlying Spark
functionality and allows programming Spark with its APIs. In an interactive Spark shell, the Spark
driver instantiates a SparkSession for you, while in a Spark application, you create a SparkSession
object yourself.
Job-A parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g., save(), collect()).
Stage-Each job gets divided into smaller sets of tasks called stages that depend on each other.
Task- A single unit of work or execution that will be sent to a Spark executor.
Spark Jobs:
During interactive sessions with Spark shells, the driver converts your Spark application into
one or more Spark jobs (Figure 2-3).
It then transforms each job into a DAG.
This, in essence, is Spark’s execution plan, where each node within a DAG could be a single or
multiple Spark stages.
(Directed Acyclic Graph) DAG in Apache Spark is a set of Vertices and Edges,
where vertices represent the RDDs and the edges represent the Operation to be applied on
RDD.
In Spark DAG, every edge directs from earlier to later in the sequence. On the calling of Action,
the created DAG submits to DAG Scheduler which further splits the graph into the stages of
the task.
Spark Stages:
As part of the DAG nodes, stages are created based on what operations can be performed
serially or in parallel (Figure 2-4).
Not all Spark operations can happen in a single stage, so they may be divided into multiple
stages.
Often stages are delineated on the operator’s computation boundaries, where they dictate
data transfer among Spark executors.
Spark Tasks:
Each stage is comprised of Spark tasks (a unit of execution), which are then federated across
each Spark executor; each task maps to a single core and works on a single partition of data
(Figure 2-5).
As such, an executor with 16 cores can have 16 or more tasks working on 16 or more
partitions in parallel, making the execution of Spark’s tasks exceedingly parallel.
Types of operations
Spark operations on distributed data can be classified into two types: transformations and
actions.
Transformations, as the name suggests, transform a Spark DataFrame into a new
DataFrame without altering the original data, giving it the property of immutability. Put
another way, an operation such as select( ) or filter( ) will not change the original
DataFrame; instead, it will return the transformed results of the operation as a new
DataFrame.
Actions refer to an operation which also applies on Spark DataFrame , that instructs Spark to
perform computation and send the result back to driver. This is an example of action.
Transformations-Lazy Evaluation:
Transformations are lazy in nature i.e., they get execute when we call an action.
All transformations are evaluated lazily. That is, their results are not computed immediately,
but they are recorded or remembered as a lineage. A recorded lineage allows Spark, at a later
time in its execution plan, to rearrange certain transformations, coalesce them, or optimize
transformations into stages for more efficient execution.
Lazy evaluation is Spark’s strategy for delaying execution until an action is invoked or data is
“touched” (read from or written to disk).An action triggers the lazy evaluation of all the
recorded transformations.
In Figure 2-6, all transformations T are recorded until the action A is invoked. Each
transformation T produces a new DataFrame.
While lazy evaluation allows Spark to optimize your queries by peeking into your chained
transformations, lineage and data immutability provide fault tolerance.
2. Wide transformation – In wide transformation, all the elements that are required to
compute the records in the single partition may live in many partitions of parent RDD. The
partition may live in many partitions of parent RDD. Wide transformations are the result
of groupbyKey() and reducebyKey().
Let’s write a Spark program that reads a file with over 100,000 entries (where each row or line
has a <state, mnm_color, count>) and computes and aggregates the counts for each color and
state. These aggregated counts tell us the colors of M&Ms favored by students in each state. The
complete Python listing is provided below:
Let’s submit our first Spark job using the Python APIs (for an explanation of what the code does,
please read the inline comments in Example 2-1):
QUESTIONS ON UNIT-2
1. Explain about the working of map reduce concept with a small example.
2. Explain the hadoop distributed file system architecture with a neat sketch.
3. Illustrate with example how files are stored in HDFS. Also, justify how HDFS is fault
tolerant.
4. Explain how map reduce jobs run on YARN.
5. Justify how did Hadoop technology emerged as a solution to Big Data problems?
6. Explain in brief the core components of Hadoop ecosystem.
7. Illustrate with a word count example the working of map reduce concept.
8. Can MapReduce be used to solve any kind of computational problems? if not, explain the
cases where MapReduce is not applicable?(Ans: Mapreduce limitation)
9. How does the Hadoop MapReduce Data flow work for a word count program? Give an
example.
10. Explain the Yarn architecture with a neat sketch.
11. Compare batch processing and real time processing
12. Compare Hadoop and Spark
13. Explain the key features of Apache spark
14. Explain about the different Cluster Managers in Apache Spark?
15. Explain the key features of Apache spark.
16. Explain the various components in the apache spark ecosystem.
17. Explain the apache spark architecture.
18. Explain the Spark deployment modes.
19. Explain the following terms with respect to spark application
a. SparkApplication
b. SparkSession
c. Job
d. Stage
e. Task
20. Explain various operations on distributed data/RDD in spark.
21. Explain lazy evaluation, narrow transformation, wide transformation,
22. Write a Spark program(in python) that reads a file with over 100,000 entries (where
each row or line has a <state, mnm_color, count>) and computes and aggregates the counts
for each color and state.