0% found this document useful (0 votes)
34 views8 pages

Unit 4

Apache Spark is a lightning-fast cluster computing framework. It is based on Hadoop MapReduce and extends the MapReduce model for more types of computations like interactive queries and stream processing. Spark improves speed through in-memory cluster computing. It supports Scala, Java, Python and has components like Spark SQL, Spark Streaming, MLlib and GraphX.

Uploaded by

Nisha Pundir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views8 pages

Unit 4

Apache Spark is a lightning-fast cluster computing framework. It is based on Hadoop MapReduce and extends the MapReduce model for more types of computations like interactive queries and stream processing. Spark improves speed through in-memory cluster computing. It supports Scala, Java, Python and has components like Spark SQL, Spark Streaming, MLlib and GraphX.

Uploaded by

Nisha Pundir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on
Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations,
which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster
computing that increases the processing speed of an application.
Hadoop is an open source framework that has the Hadoop Distributed File System (HDFS) as storage, YARN as
a way of managing computing resources used by different applications, and an implementation of the MapReduce
programming model as an execution engine. In a typical Hadoop implementation, different execution engines are
also deployed such as Spark, Tez, and Presto.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was
Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now
Apache Spark has become a top level Apache project from Feb-2014.

Features of Apache Spark

Apache Spark has following features.


• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

Components of Spark

The following illustration depicts the different components of Spark.

Apache Spark Core


Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon.
It provides In-Memory computing and referencing datasets in external storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which
provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data
in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark
architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares
(ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache
Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph
computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized
runtime for this abstraction.

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed
collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different
nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic
operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can
be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing
a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering
a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss
how MapReduce operations take place and why they are not so efficient.

Data Sharing is Slow in MapReduce

MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm
on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to
worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations (Ex − between two
MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS). Although this framework
provides numerous abstractions for accessing a cluster’s computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs. Data sharing is slow
in MapReduce due to replication, serialization, and disk IO. Regarding storage system, most of the Hadoop
applications, they spend more than 90% of the time doing HDFS read-write operations.

Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage applications. The following illustration
explains how the current framework works, while doing the iterative operations on MapReduce. This incurs
substantial overheads due to data replication, disk I/O, and serialization, which makes the system slow.

Interactive Operations on MapReduce


User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable storage, which
can dominate application execution time.
The following illustration explains how the current framework works while doing the interactive queries on
MapReduce.

Data Sharing using Spark RDD

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop
applications, they spend more than 90% of the time doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework called Apache Spark. The key idea of
spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. This means, it
stores the state of memory as an object across the jobs and the object is sharable between those jobs. Data sharing
in memory is 10 to 100 times faster than network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It will store intermediate results in a
distributed memory instead of Stable storage (Disk) and make the system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results (State of the JOB), then it
will store those results on the disk.

Interactive Operations on Spark RDD

This illustration shows interactive operations on Spark RDD. If different queries are run on the same set of data
repeatedly, this particular data can be kept in memory for better execution times.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may
also persist an RDD in memory, in which case Spark will keep the elements around on the cluster for much faster
access, the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple
nodes.

Programming Languages for Apache Spark

It has been observed so often that people or organizations don’t focus on selecting the right language before
working on any project. However, there are certain criteria to look into before going ahead like a perfect blend
of data, right implementation, accuracy, data models, and so on. The point is working on spark gives you some
benefits and opens doors for many different coders like Java, on the other hand, people who are sticking with
Python might have to face some pull-offs.

1. Scala
Since we’re talking about Scala, how can we forget Spark? In fact, Apache spark was written primarily on
Scala only, therefore each function is well mapped for its developers. Scala is indeed the best go -to language
for Apache Spark. It was designed by Martin Odersky in 2001. Although it’s not an old school language but
trusts us this, Scala has gained enormous popularity in a very short span of time. Scala comes with a hybrid
programming language which states that it can work with both functional and object-oriented programming
languages. In some way, there’s no denying that it is a next-level Java programming language. Thus, it can be
a good fit for those who have prior knowledge of Java. Now, let’s dig a bit more to see what else it carries with
itself that makes it special while using with spark:
• It can defeat any of its rivals when it comes to performance, Scala offers supreme speed in both
processing and analyzing data.
• It enables developers to write the clean designs of spark applications and is being considered a
statically typed language.
• Due to its procured adaptability, it can even work on real-time data, and on the other end, the
processing is very quick.
• With the help of Scala, it is possible and much easier to build big data applications despite holding
complexity.
2. Python
This is one of the most popular languages so far in the field of data science among data scientists around the
world and was firstly introduced by Guido van Rossum in late 1991. If you go with the stats, so far it has gained
the top spot when it comes to popularity and was initially designed as a response to the ABC programm ing
language of what we know today as a functional language in a big data world. Today, almost every data analysis
tool, machine learning, data mining, and manipulation library are operated heavily using this language. It carries
good standard libraries with simple syntax. Besides this Python also offers some more resilient features which
you should look into it before moving ahead:
• If you’ll look up the internet, you might find many other supportive languages for Apache spark
but Python is considered the easiest to understand, and creating schemas, interacting with a local
file system, or calling REST API is much easier to perform with python while working in spark.
• It is also called an interpreted language which means that all the codes inside it can be co nverted
back to bytecode which can later be executed back in Python virtual machine.
• Working with Python is way easier for those programmers who have knowledge of SQL or R.
• Python offers an extensive set of libraries that includes string processing, Unicode, or internet
protocols (HTTP, FTP, SMTP, etc.) and can easily run on different OS such as Linux, Windows,
and macOS.
We’ve seen both programming languages one by one along with their features. Now is the time to take a quick
look by comparing both languages for better clarity.
Quick Comparison (Python Vs Scala): Which one to pick while working with Apache Spark?
1. If we talk about complexity in programming then working with python is much easier and being
an interpreted programming language, a developer can easily compile any code and re-edit it by
using a text editor and the same can be executed accordingly whereas working on Scala for this
parameter can be a tough call and one cannot simply re-edit the text and execute the codes for
compilation.

2. Talking about execution speed, Scala offers a superior speed as compared to python. This is
because Scala is derived from Java and thus it also uses JVM (Java Virtual Machine) for execution
and it also enables it to work seamlessly.

3. Being a simple, open-source, general programming language, Python offers simple syntax and less
coding, on the other hand, Scala, being a functional program comes with a lot of functions and
features which makes it far more typical to work on.-1

4. If working on a large project, due to its static nature, Scala is a perfect fit for type checking during
its compilation whereas being dynamic types in nature, Python is not that scalable and can only
and only fit with small segment projects.

5. As we’ve discussed above, Apache spark is being written on Scala because of its scalability over
JVM and so it offers accessibility to all the latest features of the spark that is not the whole, but it
all depends upon what your requirement is. Let’s say you need better graphical visualization for
your project so for that Pyspark is best and that can’t be replaced by either Scala or spark.

Conclusion

Choosing the best language for Apache Spark is not that typical, only a handful of key languages are available
out there. Besides if you’re familiar with Java then working with Scala can be a perfect fit for you and on the
other hand, if you want to go simply straight with less complexity then python is the answer. At last, it all
depends upon your prior knowledge and usability wherever you’ll be applying inside any project. Since, we’ve
tried to sort things out by classifying the features and face-to-face comparison but still, what the best you can
also do here is create a list of issues in pointers scaling them from usability to learning curve and once you’re
done, you’ll surely get the answer for picking up the right programming language for Apache Spark. Also,
Java could be considered while working with Apache Spark.

Libraries

Spark includes libraries for SQL and structured data (Spark SQL), machine learning (MLlib), stream processing
(Spark Streaming and the newer Structured Streaming), and graph analytics (GraphX). Beyond these libraries,
there are hundreds of open source external libraries ranging from connectors for various storage systems to
machine learning algorithms. One index of external libraries is available at spark-packages.org.

QUESTION: DIFFERENCE BETWEEN SPARK AND HADOOP

PYSPARK

PySpark is a Spark library written in Python to run Python applications using Apache Spark
capabilities, using PySpark we can run applications parallelly on the distributed cluster
(multiple nodes).

In other words, PySpark is a Python API for Apache Spark. Apache Spark is an analytical
processing engine for large scale powerful distributed data processing and machine learning
applications.

Spark basically written in Scala and later on due to its industry adaptation it’s API PySpark
released for Python using Py4J. Py4J is a Java library that is integrated within PySpark and
allows python to dynamically interface with JVM objects, hence to run PySpark you also need
Java to be installed along with Python, and Apache Spark.

Additionally, For the development, you can use Anaconda distribution (widely used in the
Machine Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter
notebook to run PySpark applications.
KAFKA

In Big Data, an enormous volume of data is used. Regarding data, we have two main challenges.The first challenge
is how to collect large volume of data and the second challenge is to analyze the collected data. To overcome
those challenges, you must need a messaging system.
Kafka is designed for distributed high throughput systems. Kafka tends to work very well as a replacement for a
more traditional message broker. In comparison to other messaging systems, Kafka has better throughput, built-
in partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-scale message
processing applications.

What is Kafka?

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high
volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline
and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to
prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with
Apache Storm and Spark for real-time streaming data analysis.
Benefits
Following are a few benefits of Kafka −
• Reliability − Kafka is distributed, partitioned, replicated and fault tolerance.
• Scalability − Kafka messaging system scales easily without down time..
• Durability − Kafka uses Distributed commit log which means messages persists on disk as fast
as possible, hence it is durable..
• Performance − Kafka has high throughput for both publishing and subscribing messages. It
maintains stable performance even many TB of messages are stored.
Kafka is very fast and guarantees zero downtime and zero data loss.
Take a look at the following illustration. It shows the cluster diagram of Kafka.

The following table describes each of the components shown in the above diagram.

S.No Components and Description

1
Broker
Kafka cluster typically consists of multiple brokers to maintain load balance. Kafka brokers are
stateless, so they use ZooKeeper for maintaining their cluster state. One Kafka broker instance
can handle hundreds of thousands of reads and writes per second and each bro-ker can handle
TB of messages without performance impact. Kafka broker leader election can be done by
ZooKeeper.

2
ZooKeeper
ZooKeeper is used for managing and coordinating Kafka broker. ZooKeeper service is mainly
used to notify producer and consumer about the presence of any new broker in the Kafka system
or failure of the broker in the Kafka system. As per the notification received by the Zookeeper
regarding presence or failure of the broker then pro-ducer and consumer takes decision and starts
coordinating their task with some other broker.

3
Producers
Producers push data to brokers. When the new broker is started, all the producers search it and
automatically sends a message to that new broker. Kafka producer doesn’t wait for
acknowledgements from the broker and sends messages as fast as the broker can handle.

4
Consumers
Since Kafka brokers are stateless, which means that the consumer has to maintain how many
messages have been consumed by using partition offset. If the consumer acknowledges a
particular message offset, it implies that the consumer has consumed all prior messages. The
consumer issues an asynchronous pull request to the broker to have a buffer of bytes ready to
consume. The consumers can rewind or skip to any point in a partition simply by supplying an
offset value. Consumer offset value is notified by ZooKeeper.

As of now, we discussed the core concepts of Kafka. Let us now throw some light on the workflow of Kafka.
Kafka is simply a collection of topics split into one or more partitions. A Kafka partition is a linearly ordered
sequence of messages, where each message is identified by their index (called as offset). All the data in a Kafka
cluster is the disjointed union of partitions. Incoming messages are written at the end of a partition and messages
are sequentially read by consumers. Durability is provided by replicating messages to different brokers.
Kafka provides both pub-sub and queue based messaging system in a fast, reliable, persisted, fault-tolerance and
zero downtime manner. In both cases, producers simply send the message to a topic and consumer can choose any
one type of messaging system depending on their need. Let us follow the steps in the next section to understand
how the consumer can choose the messaging system of their choice.

Workflow of Pub-Sub Messaging

Following is the step wise workflow of the Pub-Sub Messaging −


• Producers send message to a topic at regular intervals.
• Kafka broker stores all messages in the partitions configured for that particular topic. It ensures
the messages are equally shared between partitions. If the producer sends two messages and
there are two partitions, Kafka will store one message in the first partition and the second
message in the second partition.
• Consumer subscribes to a specific topic.
• Once the consumer subscribes to a topic, Kafka will provide the current offset of the topic to the
consumer and also saves the offset in the Zookeeper ensemble.
• Consumer will request the Kafka in a regular interval (like 100 Ms) for new messages.
• Once Kafka receives the messages from producers, it forwards these messages to the consumers.
• Consumer will receive the message and process it.
• Once the messages are processed, consumer will send an acknowledgement to the Kafka broker.
• Once Kafka receives an acknowledgement, it changes the offset to the new value and updates it
in the Zookeeper. Since offsets are maintained in the Zookeeper, the consumer can read next
message correctly even during server outrages.
• This above flow will repeat until the consumer stops the request.
• Consumer has the option to rewind/skip to the desired offset of a topic at any time and read all
the subsequent messages.

Workflow of Queue Messaging / Consumer Group

In a queue messaging system instead of a single consumer, a group of consumers having the same Group ID will
subscribe to a topic. In simple terms, consumers subscribing to a topic with same Group ID are considered as a
single group and the messages are shared among them. Let us check the actual workflow of this system.
• Producers send message to a topic in a regular interval.
• Kafka stores all messages in the partitions configured for that particular topic similar to the
earlier scenario.
• A single consumer subscribes to a specific topic, assume Topic-01 with Group ID as Group-1.
• Kafka interacts with the consumer in the same way as Pub-Sub Messaging until new consumer
subscribes the same topic, Topic-01 with the same Group ID as Group-1.
• Once the new consumer arrives, Kafka switches its operation to share mode and shares the data
between the two consumers. This sharing will go on until the number of con-sumers reach the
number of partition configured for that particular topic.
• Once the number of consumer exceeds the number of partitions, the new consumer will not
receive any further message until any one of the existing consumer unsubscribes. This scenario
arises because each consumer in Kafka will be assigned a minimum of one partition and once
all the partitions are assigned to the existing consumers, the new consumers will have to wait.
• This feature is also called as Consumer Group. In the same way, Kafka will provide the best of
both the systems in a very simple and efficient manner.

Role of ZooKeeper

A critical dependency of Apache Kafka is Apache Zookeeper, which is a distributed configuration and
synchronization service. Zookeeper serves as the coordination interface between the Kafka brokers and
consumers. The Kafka servers share information via a Zookeeper cluster. Kafka stores basic metadata in
Zookeeper such as information about topics, brokers, consumer offsets (queue readers) and so on.
Since all the critical information is stored in the Zookeeper and it normally replicates this data across its ensemble,
failure of Kafka broker / Zookeeper does not affect the state of the Kafka cluster. Kafka will restore the state, once
the Zookeeper restarts. This gives zero downtime for Kafka. The leader election between the Kafka broker is also
done by using Zookeeper in the event of leader failure.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy