0% found this document useful (0 votes)

11 views8 pages

Unit 6 Spark

Apache Spark is an open-source distributed processing system designed for big data workloads, featuring components like Spark SQL, Spark Streaming, MLlib, and GraphX. It operates on a master-slave architecture with Resilient Distributed Datasets (RDDs) and Directed Acyclic Graphs (DAGs) for efficient data processing. Spark supports multiple programming languages, provides advanced analytics, and is significantly faster than traditional MapReduce systems.

Uploaded by

animebhai2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views8 pages

Unit 6 Spark

Uploaded by

animebhai2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

TYBBA (CA)

Unit : 6 Spark

Important Questions:

1. Define : Spark

Ans : Apache spark is open source distributed processing system used for big
data workloads. It is cluster computing designed for fast computation.

2. List components of spark

Ans : 1. Apache Spark 2. Spark SQL 3. Spark Streaming 4. MLLib

5. GraphX

3. Which language is not supported by Spark?

Ans : Spark does not have built in support for COBOL

4. Which language is supported by Spark?

Ans : Spark supports languages like Scala, Java, Python and R

5. List features of Spark

Ans : 1. Speed 2. Multiple Language Support 3. Multiple Platform Support

4.Advance Analytics

6. Define : RDD

Ans : RDD is Resilient Distributed Datasets and it is a fundamental data

Structure of Apache Spark. It is an immutable collection of objects

Which computes on the different node of the cluster.

Long Questions

1. How is Apache Spark different from Map Reduce?

Spark Map Reduce

100 times faster than mapreduce Faster than traditional system
Written in scala Written in java
Supports batch/realtime/iterative and Supports only batch processing
graph
Compact & easy Compact & lengthy
Caching enhance the system Doesn’t support caching of data
performance

2. How does spark work? Explain with the help of its architecture?

Ans :

The Spark follows the master-slave architecture. Its cluster consists of a single master and
multiple slaves.

The Spark architecture depends upon two abstractions:

o Resilient Distributed Dataset (RDD)

o Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be stored in-memory
on worker nodes. Here,

o Resilient: Restore the data on failure.

o Distributed: Data is distributed among different nodes.
o Dataset: Group of data.

Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on
data. Each node is an RDD partition, and the edge is a transformation on top of data. Here,
the graph refers the navigation whereas directed and acyclic refers to how it is done.

Driver Program
The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object. The purpose of SparkContext is to coordinate the spark
applications, running as independent sets of processes on a cluster.

To run on a cluster, the SparkContext connects to a different type of cluster managers and
then perform the following tasks: -

o It acquires executors on nodes in the cluster.

o Then, it sends your application code to the executors. Here, the application code can be
defined by JAR or Python files passed to the SparkContext.
o At last, the SparkContext sends tasks to the executors to run.

Cluster Manager
o The role of the cluster manager is to allocate resources across applications. The Spark is
capable enough of running on a large number of clusters.
o It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and
Standalone Scheduler.
o Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install
Spark on an empty set of machines.

Worker Node
o The worker node is a slave node
o Its role is to run the application code in the cluster.

Executor
o An executor is a process launched for an application on a worker node.
o It runs tasks and keeps data in memory or disk storage across them.
o It read and write data to the external sources.
o Every application contains its executor.

Task
o A unit of work that will be sent to one executor.

3. Explain the components of Spark.

Ans :

1. Spark Core
2. Spark SQL
3. Spark Streaming
4. MLlib(Machine learning library)
5. GraphX
6. Spark R

Now since we have some understanding of Spark let us dive deeper into Spark and understand the
components Apache Spark consists of. Apache Spark consists of Spark Core Engine, Spark SQL,
Spark Streaming, MLlib, GraphX, and Spark R. You can use Spark Core Engine along with any of the
other five components mentioned above. It is not necessary to use all the Spark components together.
Depending on the use case and application any one or more of these can be used along with Spark
Core.

Let us look at each of these components in detail.

Spark Core: Spark Core is the heart of the Apache Spark framework. Spark Core provides the
execution engine for the Spark platform which is required and used by other components which are
built on top of Spark Core as per the requirement. Spark Core provides the in-built memory
computing and referencing datasets stored in external storage systems. It is Spark’s core responsibility
to perform all the basic I/O functions, scheduling, monitoring, etc. Also, fault recovery and effective
memory management are Spark Core’s other important functions.

Spark Core uses a very special data structure called the RDD. Data sharing in distributed processing
systems like MapReduce need the data in intermediate steps to be stored and then retrieved from
permanent storage like HDFS or S3 which makes it very slow due to the serialization and
deserialization of I/O steps. RDDs overcome this as these data structures are in-memory and fault-
tolerant and can be shared across different tasks within the same Spark process. The RDDs can be any
immutable and partitioned collections and can contain any type of objects; Python, Scala, Java or
some user-defined class objects. RDDs can be created either by Transformations of an existing RDD
or loading from external sources like HDFS or HBase etc. We will look into RDD and its
transformations in-depth in later sections in the tutorial.

Spark SQL: Spark SQL is built on top of Shark which was the first interactive SQL on the Hadoop
system. Shark was built on top of Hive codebase and achieved performance improvement by
swapping out the physical execution engine part of the Hive. But due to the limitations of Hive, Shark
was not able to achieve the performance it was supposed to. So the Shark project was stopped and
Spark SQL was built with the knowledge of Shark on top of Spark Core Engine to leverage the power
of Spark. You can read more about Shark in the following blog by Reynold Xin, one of the Spark
SQL code maintainers.

Spark SQL is named like this because it works with the data in a similar fashion to SQL. In fact it
there is a mention that Spark SQL’s aim is to meet SQL 92 standards. But the gist is that it allows
developers to write declarative code letting the engine use as much of the data and stored structure
(RDDs) as it can to optimize the resultant distributed query behind the scenes. The goal is to allow the
user to not have to worry about the distributed nature as much and focus on the business use case.
Users can perform extract, transform and load functions on data from a variety of sources in different
formats like JSON, Parquet or Hive and then execute ad-hoc queries using Spark SQL.

DataFrame constitutes the main abstraction for Spark SQL. Distributed collection of data ordered into
named columns is known as a DataFrame in Spark. In the earlier versions of Spark SQL, DataFrames
were referred to as SchemaRDDs. DataFrame API in Spark integrates with the Spark procedural code
to render tight integration between procedural and relational processing. DataFrame API evaluates
operations in a lazy manner to provide support for relational optimizations and optimize the overall
data processing workflow. All relational functionalities in Spark can be encapsulated using the
SparkSQL context or HiveContext.

Catalyst, an extensible optimizer is at the core functioning of Spark SQL, which is an optimization
framework embedded in Scala to help developers improve their productivity and performance of the
queries that they write. Using Catalyst, Spark developers can briefly specify complex relational
optimizations and query transformations in a few lines of code by making the best use of Scala’s
powerful programming constructs like pattern matching and runtime metaprogramming. Catalyst
eases the process of adding optimization rules, data sources and data types for machine learning
domains.

Spark Streaming: This Spark library is primarily maintained by Tathagat Das and helped by
MatieZaharia. As the name suggests this library is for Streaming data. This is a very popular Spark
library as it takes Spark’s big data processing power and cranks up the speed. Spark Streaming has the
ability to Stream gigabytes per second. This capability of big and fast data has a lot of potentials.
Spark Streaming is used for analyzing a continuous stream of data. A common example is processing
log data from a website or server.

Spark streaming is not really streaming technically. What it really does is it breaks down the data into
individual chunks that it processes together as small RDDs. So it actually does not process data as
bytes at a time as it comes in, but it processes data every second or two seconds or some fixed interval
of time. So strictly speaking Spark streaming is not real-time but near real-time or micro batching, but
it suffices for a vast majority of applications.

Spark streaming can be configured to talk to a variety of data sources. So we can just listen to a port
that has a bunch of data being thrown at it, or we can connect to data sources like Amazon Kinesis,
Kafka, Flume, etc. There are connectors available to connect Spark to these sources. The good thing
about Spark streaming is it is reliable. It has a concept called “checkpointing” to store state to the disk
periodically and depending on what kind of data sources or receiver we are using, it can pick up data
from the point of failure. It is a very robust mechanism to handle all kinds of failures like disk failure
or node failure etc. Spark Streaming has exactly-once message guarantees and helps recover lost work
without having to write any extra code or adding additional configurations.

Just like how Spark SQL has the concept of Dataframe/Dataset built on top of RDD, Spark streaming
has something called Dstream. This is a collection of RDDs that embodies the entire stream data. The
good thing about Dstream is that we can apply most of the built-in functions on RDDs also on the
DStream like flatMap, map, etc. Also, the Dstream can be broken into individual RDDs and can be
processed one chunk at a time. Spark developers can reuse the same code for stream and batch
processing and can also integrate the streaming data with historical data.
MLlib: Today many companies focus on building customer-centric data products and services which
need machine learning to build predictive insights, recommendations, and personalized results. Data
scientists can solve these problems using popular languages like Python and R, but they spend a lot of
time in building and supporting infrastructure for these languages. Spark has built-in support for
doing machine learning and data science at a massive scale using the clusters. It’s called MLLib
which stands for Machine Learning Library.

MLlib is a low-level machine learning library. It can be called from Java, Scala and Python
programming languages. It is simple to use, scalable and can be easily integrated with other tools and
frameworks. MLlib eases the deployment and development of scalable machine learning pipelines.
Machine learning in itself is a subject and it may not be possible to get into details here. But these are
some of the important features and capabilities Spark MLLib offers:

 Linear regression, logistic regression

 Support Vector Machines
 Naive Bayes classifier
 K-Means clustering
 Decision trees
 Recommendations using Alternating Least Squares
 Basic statistics
 Chi-squared test, Pearsons or Spearman correlation, min, max, mean, variance
 Feature extraction
 Term Frequency/ Inverse Document Frequency useful for search

GraphX: For graphs and graph-parallel processing Apache Spark provides another API called
GraphX. The graph here does not mean charts, lines or bar graphs, but these are graphs in computer
sciences like social networks which consist of vertices where each vertex consists of an individual
user in the social network and there are many users connected to each other by edges. These edges
represent the relationship between the users in the network.

GraphX is useful in giving overall information about the graph network like it can tell how many
triangles appear in the graph and apply the PageRank algorithm to it. It can measure things like
“connectedness”, degree distribution, average path length and other high-level measures of a graph. It
can also join graphs together and transform graphs quickly. It also supports the Pregel API for
traversing a graph. Spark GraphX provides Resilient Distributed Graph (RDG- an abstraction of
Spark RDD’s). RDG’s API is used by data scientists to perform several graph operations through
various computational primitives. Similar to RDDs basic operations like map, filter, property graphs
also consist of basic operators. Those operators take UDFs (user-defined functions) and produce new
graphs. Moreover, these are produced with transformed properties and structure.

Spark R: R programming language is widely used by Data scientists due to its simplicity and ability
to run complex algorithms. But R suffers from a problem that its data processing capacity is limited to
a single node. This makes R not usable when processing a huge amount of data. The problem is
solved by SparkR which is an R package in Apache Spark. SparkR provides data frame
implementation that supports operations like selection, filtering, aggregation, etc. on distributed large
datasets. SparkR also has support for distributed machine learning using Spark MLlib.

BIG DATA ANLYTICS UNIT 3 R22 IT
No ratings yet
BIG DATA ANLYTICS UNIT 3 R22 IT
57 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Shark
No ratings yet
Shark
24 pages
3_UNIT3_Spark
No ratings yet
3_UNIT3_Spark
55 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Spark BD
No ratings yet
Spark BD
9 pages
Hadoop Course Contents PDF
No ratings yet
Hadoop Course Contents PDF
3 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Apache Spark Components
No ratings yet
Apache Spark Components
4 pages
bda
No ratings yet
bda
4 pages
BDA1
No ratings yet
BDA1
17 pages
BDA GTU Study Material Presentations Unit-6 03102021061221PM
No ratings yet
BDA GTU Study Material Presentations Unit-6 03102021061221PM
23 pages
Unit V Big data
No ratings yet
Unit V Big data
18 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
SPARK
No ratings yet
SPARK
125 pages
Bda 5
No ratings yet
Bda 5
21 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
spark theory
No ratings yet
spark theory
26 pages
Introduction To Spark
No ratings yet
Introduction To Spark
4 pages
Spark
No ratings yet
Spark
9 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
MODULE-3 Notes
100% (1)
MODULE-3 Notes
4 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
PL SQL Interview - Career Ride - Imp
No ratings yet
PL SQL Interview - Career Ride - Imp
392 pages
DBMS Report
No ratings yet
DBMS Report
84 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Petries Case Study
67% (3)
Petries Case Study
3 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
UNIT V
No ratings yet
UNIT V
35 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
IMTC 2013 - Technical Tracks and Sessions
No ratings yet
IMTC 2013 - Technical Tracks and Sessions
151 pages
Interview Ques
No ratings yet
Interview Ques
14 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
Eaddy: Database Copyright Analysis For Dummies
No ratings yet
Eaddy: Database Copyright Analysis For Dummies
39 pages
Assessment 6 - Attempt Review
No ratings yet
Assessment 6 - Attempt Review
4 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Practice Joins
No ratings yet
Practice Joins
14 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
Business Analytics.
No ratings yet
Business Analytics.
18 pages
CV-Abie Yudha P
No ratings yet
CV-Abie Yudha P
1 page
Competitive Product Comparison: Emc Networker Vs Ibm Tivoli Storage Manager
No ratings yet
Competitive Product Comparison: Emc Networker Vs Ibm Tivoli Storage Manager
22 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
JPA 2-2 Date and Time API
No ratings yet
JPA 2-2 Date and Time API
4 pages
PDMS
100% (1)
PDMS
40 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Introduction To Data Warehousing Concepts
No ratings yet
Introduction To Data Warehousing Concepts
8 pages
Bs - DB:: Select Bells, Whistles From Database
No ratings yet
Bs - DB:: Select Bells, Whistles From Database
7 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
COMP4804 Assignment 3: Due Tuesday March 16th, 23:59EDT
No ratings yet
COMP4804 Assignment 3: Due Tuesday March 16th, 23:59EDT
6 pages
Codehelp: Lec-9: SQL in 1-Video
100% (1)
Codehelp: Lec-9: SQL in 1-Video
8 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Itt206 Database Management Systems, July 2021
No ratings yet
Itt206 Database Management Systems, July 2021
4 pages
Unit 5
100% (1)
Unit 5
109 pages
Docx
No ratings yet
Docx
4 pages
FDM User Guide
No ratings yet
FDM User Guide
17 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Normalization of Database Tables
No ratings yet
Normalization of Database Tables
21 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Data Management (Assignment 1)
No ratings yet
Data Management (Assignment 1)
5 pages
Sap New Edition Hana: SQL Script
No ratings yet
Sap New Edition Hana: SQL Script
32 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
SSIS Best Practices
No ratings yet
SSIS Best Practices
13 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
SQL Advanced Queries
No ratings yet
SQL Advanced Queries
69 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Sap Hana On Ahv: Nutanix Best Practices Version 1.6 - January 2021 - BP-2097
No ratings yet
Sap Hana On Ahv: Nutanix Best Practices Version 1.6 - January 2021 - BP-2097
20 pages
Microproject Report (DMS)
No ratings yet
Microproject Report (DMS)
22 pages
SAP HANA Developer Guide For SAP HANA Studio en
100% (2)
SAP HANA Developer Guide For SAP HANA Studio en
904 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 6 Spark

Uploaded by

Unit 6 Spark

Uploaded by

TYBBA (CA)

2. List components of spark

Ans : 1. Apache Spark 2. Spark SQL 3. Spark Streaming 4. MLLib

3. Which language is not supported by Spark?

Ans : Spark does not have built in support for COBOL

4. Which language is supported by Spark?

Ans : Spark supports languages like Scala, Java, Python and R

5. List features of Spark

Ans : 1. Speed 2. Multiple Language Support 3. Multiple Platform Support

Ans : RDD is Resilient Distributed Datasets and it is a fundamental data

Structure of Apache Spark. It is an immutable collection of objects

Which computes on the different node of the cluster.

1. How is Apache Spark different from Map Reduce?

Spark Map Reduce

The Spark architecture depends upon two abstractions:

o Resilient Distributed Dataset (RDD)

Resilient Distributed Datasets (RDD)

o Resilient: Restore the data on failure.

Directed Acyclic Graph (DAG)

o It acquires executors on nodes in the cluster.

3. Explain the components of Spark.

Let us look at each of these components in detail.

 Linear regression, logistic regression

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.