0% found this document useful (0 votes)

31 views39 pages

BDA Unit 2

Uploaded by

1DA20CS051JEEVAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views39 pages

BDA Unit 2

Uploaded by

1DA20CS051JEEVAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Introduction to Hadoop and Apache Spark Unit -2 18CS734

INTRODUCTION TO HADOOP
Syllabus:
Introduction to Hadoop: Introduction, Hadoop and its Ecosystem, Hadoop Distributed File
System, MapReduce Framework and Programming Model, Hadoop Yarn, Hadoop Ecosystem
Tools
Introduction to Apache Spark: The genesis of Spark, Hadoop at Yahoo and Spark early years,
What is Apache Spark, Unified Analytics, Apache Spark’s Distributed Execution, Spark
Application and Spark session, Spark Jobs, Spark stages , Spark tasks, Transformation, Actions
and Lazy Evaluation, Narrow and wide transformation, The Spark UI, Your first Standalone
application.
Reference: Text book-2, edureka, simplilearn
In essence, the ability to design, develop, and implement a big data application is directly
dependent on an awareness of the architecture of the underlying computing platform, both
from hardware and more importantly from a software perspective.
One commonality among the different appliances and frameworks is the adaptation of tools to
leverage the combination of collections of four key computing resources:
1. Processing capability:
– Often referred to as a CPU, processor, or node.
– Generally speaking, modern processing nodes often incorporate multiple cores
that are individual CPUs that share the node’s memory and are managed and
scheduled together, allowing multiple tasks to be run simultaneously; this is
known as multithreading.
2. Memory:
– This holds the data that the processing node is currently working on.
– Most single node machines have a limit to the amount of memory.
3. Storage:
– Provides persistence of data—the place where datasets are loaded, and from
which the data is loaded into memory to be processed.
4. Network:
– This provides the “pipes” through which datasets are exchanged between
different processing and storage nodes.
A General Overview of High-Performance Architecture
• Most high-performance platforms are created by connecting multiple nodes together via a
variety of network topologies.
• Specialty appliances may differ in the specifics of the configurations, as do software
appliances.
• However, the general architecture distinguishes the management of computing resources and
the management of the data across the network of storage nodes, as is seen in Figure below.
• In this configuration, a master job manager oversees the pool of processing nodes, assigns
tasks, and monitors the activity.
• At the same time, a storage manager oversees the data storage pool and distributes datasets
across the collection of storage resources.
• While there is no a priori requirement that there be any collocation of data and processing
tasks, it is beneficial from a performance perspective to ensure that the threads process data
that is local, or close to minimize the costs of data access latency.

1 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

• To get a better understanding of the layering and interactions within a big data platform, we
will examine the Apache Hadoop stack.

Introduction to Hadoop

• Hadoop Ecosystem is neither a programming language nor a service, it is a platform or

framework which solves big data problems. You can consider it as a suite which encompasses
a number of services (ingesting, storing, analyzing and maintaining) inside it.
• Apache Hadoop is an open source, Java-based software platform that manages data
processing and storage for big data applications.
• Hadoop works by distributing large data sets and analytics jobs across nodes in a computing
cluster, breaking them down into smaller workloads that can be run in parallel.
• Hadoop can process structured and unstructured data and scale up reliably from a single
server to thousands of machines.
• It’s distributed file system has the provision of rapid data transfer rates among nodes. It also
allows the system to continue operating in case of node failure.
• Hadoop provides-
– The Storage layer – HDFS
– Batch processing engine – MapReduce
– Resource Management Layer – YARN

Why Hadoop?

It emerged as a solution to the “Big Data” problems-

a. Storage for Big Data
 HDFS Solved this problem. It stores Big Data in Distributed Manner.
 HDFS also stores each file as blocks. Block is the smallest unit of data in a file system.
 Suppose you have 512MB of data. And you have configured HDFS such that it will create
128Mb of data blocks. So HDFS divide data into 4 blocks (512/128=4) and stores it across
different DataNodes. It also replicates the data blocks on different datanodes.
 Hence, storing big data is not a challenge.

2 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

b. Scalability
 It also solves the Scaling problem.
 It mainly focuses on horizontal scaling rather than vertical scaling.
 You can add extra datanodes to HDFS cluster as and when required. Instead of scaling up
the resources of your datanodes.
 Hence enhancing performance dramatically.
c. Storing the variety of data
 HDFS solved this problem. HDFS can store all kind of data (structured, semi-structured or
unstructured).
 It also follows write once and read many models.
 Due to this, you can write any kind of data once and you can read it multiple times for
finding insights.
d. Data Processing Speed
 This is the major problem of big data.
 In order to solve this problem, move computation to data instead of data to computation.
 This principle is Data locality. ‘Data Locality’ concept wherein computational logic is sent
to cluster nodes(server) containing data.

Hadoop –Advantages and disadvantages

Advantages
• Scalability –By adding nodes we can easily grow our system to handle more data.
• Flexibility – In this framework, you don’t have to preprocess data before storing it. You
can store as much data as you want and decide how to use later.
• Low-cost – Open source framework is free and runs on low-cost commodity hardware.
• Fault tolerance – If nodes go down, then jobs are automatically redirected to other nodes.
• Computing power – It’s distributed computing model processes big data fast. The more
computing nodes you use more processing power you have.
Disadvantages
• Security concerns – It can be challenging in managing the complex application. Since,
storage and network levels Hadoop are missing encryption, which is a major point of
concern.
• Vulnerable by nature – The framework is written almost in java, most widely used
language. Java is heavily exploited by cybercriminals. As a result, implicated in numerous
security breaches.
• Not fit for small data –Since, it is not suited for small data. Hence, it lacks the ability to
efficiently support the random reading of small files.
• Potential stability issues – As it is an open source framework. This means that it is
created by many developers who continue to work on the project. While constantly
improvements are made, It has stability issues. To avoid these issues organizations should
run on the latest stable version.

3 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Hadoop Ecosystem and components

• It has 3 core components-

– HDFS
– MapReduce
– YARN(Yet Another Resource Negotiator)
a) HDFS
 Hadoop distributed file system (HDFS) is the primary storage system of Hadoop. It was
developed using distributed file system design. It is run on commodity hardware. Unlike other
distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware.
 HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

HDFS Architecture:
 HDFS follows the master-slave architecture and it has the following elements.
o NameNode(Master)
o DataNode(Slave)
NameNode:
 NameNode works as a Master in a Hadoop cluster that guides the Datanode (Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can
be the transaction logs that keep track of the user’s activity in a Hadoop cluster.
 Meta Data can also be the name of the file, size, and the information about the location (Block

4 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.

DataNode:
 DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more
number of DataNode, the Hadoop cluster will be able to store more data. So it is advised that
the DataNode should have High storing capacity to store a large number of file blocks.

File Block in HDFS

Data in HDFS is always stored in terms of blocks. So the single block of data is divided into
multiple blocks of size 128MB which is default and you can also change it manually.

 Suppose you have uploaded a file of 400MB to your HDFS then what happens is this file got
divided into blocks of 128MB+128MB+128MB+16MB = 400MB size. Hadoop is mainly
configured for storing the large size data which is in petabyte, this is what makes Hadoop file
system different from other file systems as it can be scaled, nowadays file blocks of 128MB to
256MB are considered in Hadoop.

5 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Replication in HDFS
Replication ensures the availability of the data. Replication is making a copy of something and the
number of times you make a copy of that particular thing can be expressed as it’s Replication
Factor. As we have seen in File blocks that the HDFS stores the data in the form of various blocks
at the same time Hadoop is also configured to make a copy of those file blocks.

 By default, the Replication Factor for Hadoop is set to 3 which can be configured means you
can change it manually as per your requirement like in above example we have made 5 file
blocks which means that 3 Replica or copy of each file block is made means total of 5×3 = 15
blocks are made for the backup purpose.
 This is because for running Hadoop we are using commodity hardware (inexpensive system
hardware) which can be crashed at any time. We are not using the supercomputer for our
Hadoop setup. That is why we need such a feature in HDFS which can make copies of that file
blocks for backup purposes, this is known as fault tolerance.

Example of HDFS Fault Tolerance:

 Suppose the user stores a file. HDFS breaks this file into blocks, say Block 1, Block 2, Block 3,
Block 4 and Block 5. Let’s assume there are 5 DataNodes, say DataNode1, DataNode2,
DataNode3, DataNode4 and DataNode5. HDFS creates replicas of each block and stores them
on different nodes to achieve fault tolerance. For each original block, there will be two
replicas stored on different nodes (replication factor 3).

 Let the Block 1 be stored on DataNodes DataNode1, 2, 3, Block 2 be stored on DataNodes

DataNode2, 4, 5, Block 3 be stored on DataNodes DataNode2, 3,4, Block 4 be stored on
DataNodes DataNode1, 2, 5, and Block 5 be stored on DataNodes DataNode3, 4, 5,

 If DataNode1 fails, the blocks 1, 2 and 4 present in DataNode1 are still available to the user
from DataNodes (2, 3 for block1), DataNodes (4, 5 for block2) and DataNodes (2, 5 for
block3).

 Hence even in unfavorable conditions, there is no data loss.

6 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

 This is as shown in fig. below.

b) MapReduce
• MapReduce is the data processing layer of Hadoop. It processes large structured and
unstructured data stored in HDFS. MapReduce is a Hadoop framework used for writing
applications that can process vast amounts of data on large clusters. It can also be called a
programming model in which we can process large datasets across computer clusters.
• MapReduce also processes a huge amount of data in parallel. It does this by dividing the job
(submitted job) into a set of independent tasks (sub-job).
• MapReduce works by breaking the processing into phases: Map (Splits & Mapping) and
Reduce (Shuffling, Reducing).
• Map : It is the first phase of processing, where we specify all the complex logic code.
• Reduce : It is the second phase of processing. Here we specify light-weight processing
like aggregation/summation.
• As the name MapReduce suggests, the reducer phase takes place after the mapper phase has
been completed. So, the first is the map job, where a block of data is read and processed to
produce key-value pairs as intermediate outputs. The output of a Mapper or map job (key-
value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs. Then, the reducer aggregates
those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or
key-value pairs which is the final output.

7 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

The data goes through the following phases of MapReduce in Big Data
Input Splits:
 An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits
Input split is a chunk of the input that is consumed by a single map

Mapping
 This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of
mapping phase is to count a number of occurrences of each word from input splits (more
details about input-split is given below) and prepare a list in the form of <word, frequency>

Shuffling
 This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubbed together
along with their respective frequency.

Reducing
 In this phase, output values from the Shuffling phase are aggregated. This phase combines
values from Shuffling phase and returns a single output value. In short, this phase summarizes
the complete dataset.

 In our example, this phase aggregates the values from Shuffling phase i.e., calculates total
occurrences of each word.

A Word Count Example of MapReduce

 Let us understand, how a MapReduce works by taking an example where I have a text file
called example.txt whose contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
 Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we
will be finding the unique words and the number of occurrences of those unique words.

8 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

 First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
 Then, we tokenize the words in each of the mappers and give a value (1) to each of the
tokens or words. The rationale behind giving a value equal to 1 is that every word, in itself,
will occur once.
 Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs –
Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
 After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer.
 So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
 Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as – Bear, 2.
 Finally, all the output key/value pairs are then collected and written in the output file.

Advantages of MapReduce
The two biggest advantages of MapReduce are:
1. Parallel Processing:
o In MapReduce, we are dividing the job among multiple nodes and each node works
with a part of the job simultaneously. So, MapReduce is based on Divide and Conquer
paradigm which helps us to process the data using different machines. As the data is
processed by multiple machines instead of a single machine in parallel, the time taken
to process the data gets reduced by a tremendous amount.

o
2. Data Locality:
o Instead of moving data to the processing unit, we are moving the processing unit to the
data in the MapReduce Framework.
o This allows us to have the following advantages:
 It is very cost-effective to move processing unit to the data.
 The processing time is reduced as all the nodes are working with their part of
the data in parallel.

9 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

 Every node gets a part of the data to process and therefore, there is no chance
of a node getting overburdened.

Disadvantages of MapReduce

10 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

c) YARN

• YARN[YET ANOTHER RESOURCE NEGOTIATOR] provides the resource management.

• YARN/MapReduce2 has been introduced in Hadoop 2.0.
• It is a layer that separates the resource management layer and processing components layer.
• MapReduce2 moves Resource management into YARN.

•
Why YARN?
 In Hadoop version 1.0 which is also referred to as MRV1 (MapReduce Version 1), MapReduce
performed both processing and resource management functions. It consisted of a Job Tracker
which was the single master. The Job Tracker allocated the resources, performed scheduling
and monitored the processing jobs. It assigned map and reduce tasks on a number of
subordinate processes called the Task Trackers. The Task Trackers periodically reported
their progress to the Job Tracker.

 This design resulted in scalability bottleneck due to a single Job Tracker. Also, the Hadoop
framework became limited only to MapReduce processing paradigm.
 To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by
Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over
the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop
the ability to run non-MapReduce jobs within the Hadoop framework.

11 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Yarn architecture
Apart from Resource Management, YARN also performs Job Scheduling. YARN performs all your
processing activities by allocating resources and scheduling tasks. Apache Hadoop YARN
Architecture consists of the following main components as shown in figure below.
Yarn components:
1. Client
2. Resource Manager
3. Node Manager
4. Application Master
5. Container

1. Client: It submits map-reduce jobs.

2. Resource Manager
 It is the master daemon of YARN and is responsible for resource assignment and
management among all the applications.
 On receiving the processing requests, it passes parts of requests to corresponding node
managers accordingly, where the actual processing takes place.
 It is the arbitrator of the cluster resources and decides the allocation of the available
resources for competing applications.

12 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

 It has two major components: a) Scheduler b) Application Manager

a) Scheduler
 The scheduler is responsible for allocating resources to the various running applications
subject to constraints of capacities, queues etc.
 Performs scheduling based on the resource requirements of the applications.
b) Application Manager
 It is responsible for accepting job submissions.
 Negotiates the first container from the Resource Manager for executing the application
specific Application Master.
 Manages running the Application Masters in a cluster and provides service for restarting
the Application Master container on failure.
3. Node Manager

 It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow
on the given node.
 It registers with the Resource Manager and sends heartbeats with the health status of the
node.
 Its primary goal is to manage application containers assigned to it by the resource
manager.
 It keeps up-to-date with the Resource Manager.
4. Application Master

 An application is a single job submitted to the framework. Each such application has a
unique Application Master associated with it which is a framework specific entity.
 It is the process that coordinates an application’s execution in the cluster and also
manages faults.
 Its task is to negotiate resources from the Resource Manager and work with the Node
Manager to execute and monitor the component tasks.

13 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

It is responsible for negotiating appropriate resource containers from the Resource
Manager, tracking their status and monitoring progress.
 Once started, it periodically sends heartbeats to the Resource Manager to affirm its health
and to update the record of its resource demands.
5. Container

 It is a collection of physical resources such as RAM, CPU cores, and disks on a single node.
 YARN containers are managed by a container launch context. This record contains a map
of environment variables, dependencies stored in a remotely accessible storage, security
tokens, payload for Node Manager services and the command necessary to create the
process.
 It grants rights to an application to use a specific amount of resources (memory, CPU etc.)
on a specific host.

1. Client submits an application

2. Resource Manager allocates a container to start Application Manager
3. Application Manager registers with Resource Manager
4. Application Manager asks containers from Resource Manager
5. Application Manager notifies Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Application Manager unregisters with Resource Manager

14 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Other Hadoop Tools

HBase
• HBase is another example of a nonrelational data management environment that distributes
massive datasets over the underlying Hadoop framework.
• HBase is derived from Google’s BigTable and is a column-oriented data layout that, when
layered on top of Hadoop, provides a fault-tolerant method for storing and manipulating large
data tables.
• HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable.

• In addition, HBase supports in-memory execution.

• HBase is not a relational database, and it does not support SQL queries.
• There are some basic operations for HBase:
– Get (which access a specific row in the table),
– Put (which stores or updates a row in the table),
– Scan (which iterates over a collection of rows in the table), and
– Delete (which removes a row from the table).

Apache Hive
• Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
• With the help of SQL methodology and interface, HIVE performs reading and writing of large
data sets. However, its query language is called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.

Features of Hive:
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.

15 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Apache Pig
 Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data sets.
 Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just
the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major segment of
the Hadoop Ecosystem.

Features of Pig:
1. Rich set of operators - It provides many operators to perform operations like join, sort, filer
etc.
2. Ease of programming - Pig Latin is similar to SOL and it is easy to write à Pig script if you are
good at
3. Self optimization - The tasks in Apache Pig are converted into optimized MapReduce job
automatically, the programmers need to focus only on semantics of the language.
4. Extensibility - Using the existing operators, users can develop their own functions to read,
process, and write data.
5. UDF's-Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts
6. Handles all kinds of data - Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS

Apache Mahout

Apache Mahout is an open source project that is primarily used for creating scalable machine
learning algorithms. It implements popular machine learning techniques such as:
 Recommendation
 Classification
 Clustering
Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a
top level project of Apache.

Features of Mahout
The primitive features of Apache Mahout are listed below.
 The algorithms of Mahout are written on top of Hadoop, so it works well in distributed
environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud.
 Mahout offers the coder a ready-to-use framework for doing data mining tasks on large
volumes of data.
 Mahout lets applications to analyze large sets of data effectively and in quick time.
 Includes several MapReduce enabled clustering implementations such as k-means, fuzzy

16 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

k-means, Canopy, Dirichlet, and Mean-Shift.

 Supports Distributed Naive Bayes and Complementary Naive Bayes classification
implementations.
 Comes with distributed fitness function capabilities for evolutionary programming.
 Includes matrix and vector libraries.

Applications of Mahout
 Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use
Mahout internally.
 Foursquare helps you in finding out places, food, and entertainment available in a
particular area. It uses the recommender engine of Mahout.
 Twitter uses Mahout for user interest modelling.
 Yahoo! uses Mahout for pattern mining.

Apache Sqoop
 Sqoop was originally developed by Cloudera. Later on it was further developed and
maintained by Apache then it is termed as Apache Sqoop . In April 2012, the Apache Sqoop
project was promoted as Apache’s top-level project.
 Apache Sqoop is an open source tool in hadoop ecosystem. It is mainly designed to transfer
huge data set between Hadoop and external data stores like relational databases, enterprise
data warehouses.
 The main functions of Apache sqoop are,
1. Import data
2. Export data

 It imports data from relational databases (RDBMS) like MySQL,Oracle,Postgresql and DB2 to
Hadoop distributed file system (HDFS) like Hive and Hbase.
 It exports data from Hadoop file system to relational databases.

Figure: Sqoop Tool Work flow

 From the above diagram we can clearly understand the data flow mechanism in sqoop.
 It is a command-line interface application for transferring data.

17 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Apache Flume
 Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and
transporting large amounts of streaming data such as log files, events (etc...) from various
sources to a centralized data store.
 Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy
streaming data (log data) from various web servers to HDFS.

Features of Flume
Some of the notable features of Flume are as follows −
 Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase)
efficiently.
 Using Flume, we can get the data from multiple servers immediately into Hadoop.
 Along with the log files, Flume is also used to import huge volumes of event data produced
by social networking sites like Facebook and Twitter, and e-commerce websites like
Amazon and Flipkart.
 Flume supports a large set of sources and destinations types.
 Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
 Flume can be scaled horizontally.

Apache Zookeeper
Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between
themselves and maintain shared data with robust synchronization techniques. ZooKeeper is itself
a distributed application providing services for writing a distributed application.
The common services provided by ZooKeeper are as follows −
 Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for
nodes.
 Configuration management − Latest and up-to-date configuration information of the system
for a joining node.
 Cluster management − Joining / leaving of a node in a cluster and node status at real time.
 Leader election − Electing a node as leader for coordination purpose.
 Locking and synchronization service − Locking the data while modifying it. This
mechanism helps you in automatic fail recovery while connecting other distributed
applications like Apache HBase.
 Highly reliable data registry − Availability of data even when one or a few nodes are down.

18 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Apache Oozie
 Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed
environment. It allows to combine multiple complex jobs to be run in a sequential order to
achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed to
run parallel to each other.
 One of the main advantages of Oozie is that it is tightly integrated with Hadoop stack
supporting various Hadoop jobs like Hive, Pig, Sqoop as well as system-specific jobs like Java
and Shell.
 Oozie is an Open Source Java Web-Application available under Apache license 2.0. It is
responsible for triggering the workflow actions, which in turn uses the Hadoop execution
engine to actually execute the task.
 Following three types of jobs are common in Oozie −
– Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs) to
specify a sequence of actions to be executed.
– Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data
availability.
– Oozie Bundle − These can be referred to as a package of multiple coordinator and
workflow jobs.

How does Oozie work?

 Basically, Oozie is a service that runs in the cluster. Workflow definitions are submitted by the
clients for immediate processing. There are two nodes, namely, control-flow
nodes and action nodes.
 The action node is the one representing workflow tasks such as running a MapReduce task,
importing data, running a Shell script, etc.
 Next, the control-flow node is responsible for controlling the workflow execution in between
actions. This is done by allowing constructs like conditional logic. The control-flow node
includes a start node (used for starting a workflow job), an end node (designating the end of a
job), and an error node (pointing to an error if any).
 At the end of the workflow, HTTP callback is used by Oozie for updating the client with the
workflow status.

19 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Introduction to Apache Spark: A Unified Analytics Engine

Apache Spark is an open-source cluster computing framework for real-time processing. It is of the
most successful projects in the Apache Software Foundation.

History of Apache Spark

 Researchers at UC Berkeley who had previously worked on Hadoop MapReduce took on the
challenge (make Hadoop and MR simpler and faster) with a project they called Spark. They
acknowledged that MR was inefficient (or intractable) for interactive or iterative computing
jobs and a complex framework to learn. So from the onset they embraced the idea of making
Spark simpler, faster, and easier.
 Apache Spark is a subproject of Hadoop developed in the year 2009 by Matei Zaharia in UC
Berkeley’s AMPLab. The first users of Spark were the group inside UC Berkeley including
machine learning researchers, which used Spark to monitor and predict traffic congestion in
the San Francisco Bay Area.
 Spark has open sourced in the year 2010 under BSD license.
 Spark became a project of Apache Software Foundation in the year 2013 and is now the
biggest project of Apache foundation.

20 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

21 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

What Is Apache Spark?

• Apache Spark is a unified engine designed for large-scale distributed data processing, on
premises in data centers or in the cloud.
• Spark provides in-memory storage for intermediate computations, making it much faster than
Hadoop MapReduce. It incorporates libraries with composable APIs for machine learning
(MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming)
for interacting with real-time data, and graph processing (GraphX).

Why Spark?

Apache Spark was developed to overcome the limitations of Hadoop MapReduce cluster
computing paradigm. Some of the drawbacks of Hadoop MapReduce are:
 Use only Java for application building.
 Since the maximum framework is written in Java there is some security concern. Java
being heavily exploited by cybercriminals this may result in numerous security breaches.
 Opt only for batch processing. Does not support stream processing(real-time processing).
 Hadoop MapReduce uses disk-based processing.

Features of Apache Spark

Apache Spark has many features which make it a great choice as a big data processing engine.
Many of these features establish the advantages of Apache Spark over other Big Data processing
engines. Let us look into details of some of the main features:
1. Speed
Spark has pursued the goal of speed in several ways.
• First, its internal implementation benefits immensely from the hardware industry’s
recent huge strides in improving the price and performance of CPUs and memory. Today’s
commodity servers come cheap, with hundreds of gigabytes of memory, multiple cores,
and the underlying Unix-based operating system taking advantage of efficient
multithreading and parallel processing. The framework is optimized to take advantage of
all of these factors.
• Second, Spark builds its query computations as a directed acyclic graph (DAG); its DAG
scheduler and query optimizer construct an efficient computational graph that can usually
be decomposed into tasks that are executed in parallel across workers on the cluster.
• Third, its physical execution engine, Tungsten, uses whole-stage code generation to
generate compact code for execution.
2. Ease of use
• Spark achieves simplicity by providing a fundamental abstraction of a simple logical data
structure called a Resilient Distributed Dataset (RDD) upon which all other higher-level
structured data abstractions, such as DataFrames and Datasets, are constructed.
• By providing a set of transformations and actions as operations, Spark offers a simple
programming model that you can use to build big data applications in familiar languages.

22 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

3. Modularity
• Spark operations can be applied across many types of workloads and expressed in any of
the supported programming languages: Scala, Java, Python, SQL, and R.
• Spark offers unified libraries with well-documented APIs that include the following
modules as core components: Spark SQL, Spark Structured Streaming, Spark MLlib, and
GraphX, combining all the workloads running under one engine.
• You can write a single Spark application that can do it all—no need for distinct engines for
disparate workloads, no need to learn separate APIs.
• With Spark, you get a unified processing engine for your workloads.
4. Extensibility
• Spark focuses on its fast, parallel computation engine rather than on storage.
• Unlike Apache Hadoop, which included both storage and compute, Spark decouples the
two. That means you can use Spark to read data stored in myriad sources—Apache
Hadoop, Apache Cassandra, Apache HBase, MongoDB, Apache Hive, RDBMSs, and more—
and process it all in memory.
• Spark’s DataFrameReaders and DataFrameWriters can also be extended to read data from
other sources, such as Apache Kafka, Kinesis, Azure Storage, and Amazon S3, into its
logical data abstraction, on which it can operate.

• The community of Spark developers maintains a list of third-party Spark packages as part
of the growing ecosystem (see Figure above).
• This rich ecosystem of packages includes Spark connectors for a variety of external data
sources, performance monitors, and more.

23 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Apache Spark Components as a Unified Stack

• Spark offers four distinct components as libraries for diverse workloads: Spark SQL,
Spark MLlib, Spark Structured Streaming, and GraphX.
• Each of these components is separate from Spark’s core fault-tolerant engine, in that
you use APIs to write your Spark application and Spark converts this into a DAG that is
executed by the core engine.
Spark Core
 Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems etc.
 Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are
Spark’s main programming abstraction.
 RDDs represent a collection of items distributed across many compute nodes that can be
manipulated in parallel. RDD offers two types of operations: Transformation, Action.

24 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Spark SQL
 Spark SQL is Spark’s package for working with structured data.
 We can read data stored in an RDBMS table or from file formats with structured data (CSV,
text, JSON, Avro, ORC, Parquet, etc.) and then construct permanent or temporary tables in
Spark.
 Also, when using Spark’s Structured APIs in Java, Python, Scala, or R, you can combine SQL-
like queries to query the data just read into a Spark DataFrame.


Spark Streaming
 Spark Streaming is a Spark component that enables processing of live streams of data.
 Examples of data streams include log files generated by production web servers, or queues of
messages containing status updates posted by users of a web service.

25 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Spark MLlib
 Spark comes with a library containing common machine learning (ML) functionality, called
MLlib.
 MLlib provides multiple types of machine learning algorithms, including classification,
regression, clustering, and collaborative filtering, as well as supporting functionality such as
model evaluation and data import etc.
 All of these methods are designed to scale out across a cluster.

GraphX
 GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations.
 It offers the standard graph algorithms for
 analysis, connections, and traversals, contributed by users in the community: the
 available algorithms include PageRank, Connected Components, and Triangle
 Counting.

26 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Apache Spark’s Distributed Execution/ Spark Architecture Overview

• Spark is a distributed data processing engine with its components working collaboratively on
a cluster of machines.
• We shall understand how all the components of Spark’s distributed architecture work
together and communicate, and what deployment modes are available.
• The components of the architecture is shown in Figure 1-4

• At a high level in the Spark architecture, a Spark application consists of a driver program
that is responsible for orchestrating parallel operations on the Spark cluster.
• The driver accesses the distributed components in the cluster—the Spark executors and
cluster manager—through a SparkSession.

27 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Spark driver
• As the part of the Spark application responsible for instantiating a SparkSession, the Spark
driver has multiple roles:
– it communicates with the cluster manager;
– it requests resources (CPU, memory, etc.) from the cluster manager for Spark’s
executors(JVMs); and
– it transforms all the Spark operations into DAG computations, schedules them, and
distributes their execution as tasks across the Spark executors.
– Once the resources are allocated, it communicates directly with the executors.
Step-1: Communicating with cluster manager to allocate resources

Step-2: Once the resources are allocated, it communicates directly with the executors

Spark Session
• Spark Session is a simplified entry point into Spark application.
• Spark Session is introduced in Spark 2.x.
• We can think spark session as a data structure where the driver maintains all the
information including the executor location and their status.

28 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Cluster manager
• The cluster manager is responsible for managing and allocating resources for the cluster of
nodes on which your Spark application runs.
• Currently, Spark supports four cluster managers: the built-in standalone cluster manager,
Apache Hadoop YARN, Apache Mesos, and Kubernetes.
•

 Standalone – a simple cluster manager included with Spark that makes it easy to set up a
cluster.
 Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service
applications.
 Apache Hadoop YARN – the resource manager in Hadoop 2.
 Kubernetes – an open-source system for automating deployment, scaling, and management
of containerized applications.

Spark executor
 Spark executors are the processes that perform the tasks assigned by the Spark driver.
 Executors have one core responsibility: take the tasks assigned by the driver, run them, and
report back their state (success or failure) and results.
 Each Spark Application has its own separate executor processes.

Deployment modes
 An attractive feature of Spark is its support for myriad deployment modes, enabling Spark to
run in different configurations and environments.
 Table below summarizes the available deployment modes.

29 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Local

Standalone

30 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

YARN
(client)

YARN
(cluster)

Kubernetes

Distributed data and partitions

 Actual physical data is distributed across storage as partitions residing in either HDFS or
cloud storage (see Figure 1-5).
 While the data is distributed as partitions across the physical cluster, Spark treats each
partition as a high-level logical data abstraction—asa DataFrame in memory.
 Though this is not always possible, each Spark executor is preferably allocated a task that
requires it to read the partition closest to it in the network, observing data locality.

31 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

 Partitioning allows for efficient parallelism.

 A distributed scheme of breaking up data into chunks or partitions allows Spark executors to
process only data that is close to them, minimizing network bandwidth.
 That is, each executor’s core is assigned its own data partition to work on (see Figure 1-6).

Spark Use Case

Whether you are a data engineer, data scientist, or machine learning engineer, you’ll find Spark
useful for the following use cases:
• Processing in parallel large data sets distributed across a cluster
• Performing ad hoc or interactive queries to explore and visualize data sets
• Building, training, and evaluating machine learning models using MLlib
• Implementing end-to-end data pipelines from myriad streams of data
• Analyzing graph data sets and social networks

Spark Application Concepts

The key concepts are to be familiar of a Spark application to understand how the code is
transformed and executed as tasks across the Spark executors. Some important terms:
Application- It is a user program built on Spark using its APIs. It consists of a driver program and
executors on the cluster.
SparkSession- It is an object that provides a point of entry to interact with underlying Spark
functionality and allows programming Spark with its APIs. In an interactive Spark shell, the Spark
driver instantiates a SparkSession for you, while in a Spark application, you create a SparkSession
object yourself.
Job-A parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g., save(), collect()).
Stage-Each job gets divided into smaller sets of tasks called stages that depend on each other.
Task- A single unit of work or execution that will be sent to a Spark executor.

32 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Spark Application and SparkSession:

 At the core of every Spark application is the Spark driver program, which creates a
SparkSession object.
 When you’re working with a Spark shell, the driver is part of the shell and the SparkSession
object (accessible via the variable spark) is created for you.
 Figure 2-2 shows how Spark executes on a cluster once you’ve done this.
 Once you have a SparkSession, you can program Spark using the APIs to perform Spark
operations.

Spark Jobs:
 During interactive sessions with Spark shells, the driver converts your Spark application into
one or more Spark jobs (Figure 2-3).
 It then transforms each job into a DAG.
 This, in essence, is Spark’s execution plan, where each node within a DAG could be a single or
multiple Spark stages.

 (Directed Acyclic Graph) DAG in Apache Spark is a set of Vertices and Edges,
where vertices represent the RDDs and the edges represent the Operation to be applied on
RDD.
 In Spark DAG, every edge directs from earlier to later in the sequence. On the calling of Action,
the created DAG submits to DAG Scheduler which further splits the graph into the stages of
the task.

33 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Spark Stages:
 As part of the DAG nodes, stages are created based on what operations can be performed
serially or in parallel (Figure 2-4).
 Not all Spark operations can happen in a single stage, so they may be divided into multiple
stages.
 Often stages are delineated on the operator’s computation boundaries, where they dictate
data transfer among Spark executors.

Spark Tasks:
 Each stage is comprised of Spark tasks (a unit of execution), which are then federated across
each Spark executor; each task maps to a single core and works on a single partition of data
(Figure 2-5).
 As such, an executor with 16 cores can have 16 or more tasks working on 16 or more
partitions in parallel, making the execution of Spark’s tasks exceedingly parallel.

Types of operations

 Spark operations on distributed data can be classified into two types: transformations and
actions.
 Transformations, as the name suggests, transform a Spark DataFrame into a new
DataFrame without altering the original data, giving it the property of immutability. Put
another way, an operation such as select( ) or filter( ) will not change the original
DataFrame; instead, it will return the transformed results of the operation as a new
DataFrame.

34 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

 Actions refer to an operation which also applies on Spark DataFrame , that instructs Spark to
perform computation and send the result back to driver. This is an example of action.

 Table 2-1 lists some examples of transformations and actions.

Transformations-Lazy Evaluation:
 Transformations are lazy in nature i.e., they get execute when we call an action.
 All transformations are evaluated lazily. That is, their results are not computed immediately,
but they are recorded or remembered as a lineage. A recorded lineage allows Spark, at a later
time in its execution plan, to rearrange certain transformations, coalesce them, or optimize
transformations into stages for more efficient execution.
 Lazy evaluation is Spark’s strategy for delaying execution until an action is invoked or data is
“touched” (read from or written to disk).An action triggers the lazy evaluation of all the
recorded transformations.

35 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

 In Figure 2-6, all transformations T are recorded until the action A is invoked. Each
transformation T produces a new DataFrame.
 While lazy evaluation allows Spark to optimize your queries by peeking into your chained
transformations, lineage and data immutability provide fault tolerance.

Narrow and Wide Transformations

There are two types of transformations:

1. Narrow transformation – In Narrow transformation, all the elements that are required to
compute the records in single partition live in the single partition of parent RDD. A limited
subset of partition is used to calculate the result. Narrow transformations are the result
of map(), filter().

2. Wide transformation – In wide transformation, all the elements that are required to
compute the records in the single partition may live in many partitions of parent RDD. The
partition may live in many partitions of parent RDD. Wide transformations are the result
of groupbyKey() and reducebyKey().

36 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Your First Standalone Application

Let’s write a Spark program that reads a file with over 100,000 entries (where each row or line
has a <state, mnm_color, count>) and computes and aggregates the counts for each color and
state. These aggregated counts tell us the colors of M&Ms favored by students in each state. The
complete Python listing is provided below:

37 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

Let’s submit our first Spark job using the Python APIs (for an explanation of what the code does,
please read the inline comments in Example 2-1):

38 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Hadoop and Apache Spark Unit -2 18CS734

QUESTIONS ON UNIT-2
1. Explain about the working of map reduce concept with a small example.
2. Explain the hadoop distributed file system architecture with a neat sketch.
3. Illustrate with example how files are stored in HDFS. Also, justify how HDFS is fault
tolerant.
4. Explain how map reduce jobs run on YARN.
5. Justify how did Hadoop technology emerged as a solution to Big Data problems?
6. Explain in brief the core components of Hadoop ecosystem.
7. Illustrate with a word count example the working of map reduce concept.
8. Can MapReduce be used to solve any kind of computational problems? if not, explain the
cases where MapReduce is not applicable?(Ans: Mapreduce limitation)
9. How does the Hadoop MapReduce Data flow work for a word count program? Give an
example.
10. Explain the Yarn architecture with a neat sketch.
11. Compare batch processing and real time processing
12. Compare Hadoop and Spark
13. Explain the key features of Apache spark
14. Explain about the different Cluster Managers in Apache Spark?
15. Explain the key features of Apache spark.
16. Explain the various components in the apache spark ecosystem.
17. Explain the apache spark architecture.
18. Explain the Spark deployment modes.
19. Explain the following terms with respect to spark application
a. SparkApplication
b. SparkSession
c. Job
d. Stage
e. Task
20. Explain various operations on distributed data/RDD in spark.
21. Explain lazy evaluation, narrow transformation, wide transformation,
22. Write a Spark program(in python) that reads a file with over 100,000 entries (where
each row or line has a <state, mnm_color, count>) and computes and aggregates the counts
for each color and state.

39 Dept. of CSE, Dr. AIT, Bangalore

QB C++ 2024-25
No ratings yet
QB C++ 2024-25
2 pages
IDS Unit 1 Notes
No ratings yet
IDS Unit 1 Notes
24 pages
Computer Systems Bryant Homework Solutions
100% (1)
Computer Systems Bryant Homework Solutions
6 pages
Huawei ICT Competition 2024-2025 Exam Outline - Computing Track
100% (1)
Huawei ICT Competition 2024-2025 Exam Outline - Computing Track
1 page
Smartfalcon - Campus Hiring - 2026 Batch - Notification With Task Details
No ratings yet
Smartfalcon - Campus Hiring - 2026 Batch - Notification With Task Details
1 page
Research Paper On JVM
No ratings yet
Research Paper On JVM
7 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
SPOS pr1 Pass-1
No ratings yet
SPOS pr1 Pass-1
9 pages
Advanced Oracle PLSQL Programming: Advantech 1 Krishnamurthy
No ratings yet
Advanced Oracle PLSQL Programming: Advantech 1 Krishnamurthy
0 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Oop 3
No ratings yet
Oop 3
25 pages
RAPL-3 Language Reference Guide
No ratings yet
RAPL-3 Language Reference Guide
360 pages
Compiler Design PYQP
No ratings yet
Compiler Design PYQP
2 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Chapter 5: Threads
No ratings yet
Chapter 5: Threads
10 pages
BDA Module2
No ratings yet
BDA Module2
43 pages
Online Mobile Recharge
No ratings yet
Online Mobile Recharge
31 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
Note API
No ratings yet
Note API
43 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
Informatica
No ratings yet
Informatica
32 pages
COD219
No ratings yet
COD219
171 pages
Unit 2
No ratings yet
Unit 2
73 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
Zero Generation Assignment
No ratings yet
Zero Generation Assignment
14 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit-2 - Hadoop2
No ratings yet
Unit-2 - Hadoop2
30 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
Module 2
No ratings yet
Module 2
23 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Assinment 3
No ratings yet
Assinment 3
3 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Catalog
No ratings yet
Catalog
36 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
KMP 2
No ratings yet
KMP 2
7 pages
Unit 5
No ratings yet
Unit 5
32 pages
Chapter 6: CPU Scheduling: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 6: CPU Scheduling: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
37 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Getting Started With VCL - JS
No ratings yet
Getting Started With VCL - JS
6 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Attachment
No ratings yet
Attachment
11 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Methods For Solving Tridiagonal System
No ratings yet
Methods For Solving Tridiagonal System
4 pages
Chapter 7: Graphics: Module: Application Development and Emerging Technologies-Cc05
No ratings yet
Chapter 7: Graphics: Module: Application Development and Emerging Technologies-Cc05
10 pages
Gradle User Guide
No ratings yet
Gradle User Guide
1,179 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Big Data
No ratings yet
Big Data
67 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
HADOOP
No ratings yet
HADOOP
18 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Sem 5 Aids Web Computing Viva Question
No ratings yet
Sem 5 Aids Web Computing Viva Question
12 pages
HADOOP
No ratings yet
HADOOP
10 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Lecture - 2 (DIT 1, C&C++)
No ratings yet
Lecture - 2 (DIT 1, C&C++)
2 pages
DAMBI DOLLO UNIVERSITY PPT I-1
No ratings yet
DAMBI DOLLO UNIVERSITY PPT I-1
25 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Hadoop
No ratings yet
Hadoop
11 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
3BSE043732-600 A en System 800xa Control 6.0 AC 800M Planning
No ratings yet
3BSE043732-600 A en System 800xa Control 6.0 AC 800M Planning
196 pages
Hadoop & Big Data
No ratings yet
Hadoop & Big Data
36 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BDA Unit 2

Uploaded by

BDA Unit 2

Uploaded by

Introduction to Hadoop and Apache Spark Unit -2 18CS734

1 Dept. of CSE, Dr. AIT, Bangalore

• Hadoop Ecosystem is neither a programming language nor a service, it is a platform or

It emerged as a solution to the “Big Data” problems-

2 Dept. of CSE, Dr. AIT, Bangalore

Hadoop –Advantages and disadvantages

3 Dept. of CSE, Dr. AIT, Bangalore

Hadoop Ecosystem and components

• It has 3 core components-

4 Dept. of CSE, Dr. AIT, Bangalore

File Block in HDFS

5 Dept. of CSE, Dr. AIT, Bangalore

Example of HDFS Fault Tolerance:

 Let the Block 1 be stored on DataNodes DataNode1, 2, 3, Block 2 be stored on DataNodes

 Hence even in unfavorable conditions, there is no data loss.

6 Dept. of CSE, Dr. AIT, Bangalore

 This is as shown in fig. below.

7 Dept. of CSE, Dr. AIT, Bangalore

A Word Count Example of MapReduce

8 Dept. of CSE, Dr. AIT, Bangalore

9 Dept. of CSE, Dr. AIT, Bangalore

10 Dept. of CSE, Dr. AIT, Bangalore

• YARN[YET ANOTHER RESOURCE NEGOTIATOR] provides the resource management.

11 Dept. of CSE, Dr. AIT, Bangalore

1. Client: It submits map-reduce jobs.

12 Dept. of CSE, Dr. AIT, Bangalore

 It has two major components: a) Scheduler b) Application Manager

13 Dept. of CSE, Dr. AIT, Bangalore

1. Client submits an application

14 Dept. of CSE, Dr. AIT, Bangalore

Other Hadoop Tools

• In addition, HBase supports in-memory execution.

15 Dept. of CSE, Dr. AIT, Bangalore

16 Dept. of CSE, Dr. AIT, Bangalore

k-means, Canopy, Dirichlet, and Mean-Shift.

Figure: Sqoop Tool Work flow

17 Dept. of CSE, Dr. AIT, Bangalore

18 Dept. of CSE, Dr. AIT, Bangalore

How does Oozie work?

19 Dept. of CSE, Dr. AIT, Bangalore

Introduction to Apache Spark: A Unified Analytics Engine

History of Apache Spark

20 Dept. of CSE, Dr. AIT, Bangalore

21 Dept. of CSE, Dr. AIT, Bangalore

What Is Apache Spark?

Features of Apache Spark

22 Dept. of CSE, Dr. AIT, Bangalore

23 Dept. of CSE, Dr. AIT, Bangalore

Apache Spark Components as a Unified Stack

24 Dept. of CSE, Dr. AIT, Bangalore

25 Dept. of CSE, Dr. AIT, Bangalore

26 Dept. of CSE, Dr. AIT, Bangalore

Apache Spark’s Distributed Execution/ Spark Architecture Overview

27 Dept. of CSE, Dr. AIT, Bangalore

28 Dept. of CSE, Dr. AIT, Bangalore

29 Dept. of CSE, Dr. AIT, Bangalore

30 Dept. of CSE, Dr. AIT, Bangalore

Distributed data and partitions

31 Dept. of CSE, Dr. AIT, Bangalore

 Partitioning allows for efficient parallelism.

Spark Use Case

Spark Application Concepts

32 Dept. of CSE, Dr. AIT, Bangalore

Spark Application and SparkSession:

33 Dept. of CSE, Dr. AIT, Bangalore

34 Dept. of CSE, Dr. AIT, Bangalore

 Table 2-1 lists some examples of transformations and actions.

35 Dept. of CSE, Dr. AIT, Bangalore

Narrow and Wide Transformations

There are two types of transformations:

36 Dept. of CSE, Dr. AIT, Bangalore

Your First Standalone Application

37 Dept. of CSE, Dr. AIT, Bangalore

38 Dept. of CSE, Dr. AIT, Bangalore

39 Dept. of CSE, Dr. AIT, Bangalore

You might also like