0% found this document useful (0 votes)
45 views

Apache Spark 101 for Data Engineering

Uploaded by

julianalb.berrio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Apache Spark 101 for Data Engineering

Uploaded by

julianalb.berrio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

11/9/24, 21:52 Apache Spark 101 for Data Engineering

Apache Spark 101 for Data Engineering


What is Apache Spark and why it is so popular among Data Engineers?
DARSHIL | DATA ENGINEER
APR 26, 2024

20 Share

We all know this right?

90% of the world’s data was generated in just the last 2 years. The 120 zettabytes
generated in 2023 are expected to increase by over 150% in 2025, hitting 181
zettabytes

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 1/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

In the early 2000s, the amount of data being generated exploded exponentially with
the use of the internet, social media, and various digital technologies. Organizations
found themselves facing a massive volume of data that was very hard to process. To
address this challenge, the concept of Big Data emerged.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 2/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

Big Data refers to extremely large and complex data sets that are difficult to process
using traditional methods. Organizations across the world wanted to process this
massive volume of data and derive useful insights from it.

Here's where Hadoop comes into the picture.

In 2006, a group of engineers at Yahoo developed a special software framework called


Hadoop. They were inspired by Google's MapReduce and Google File System

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 3/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

technology. Hadoop introduced a new way of data processing called distributed


processing.

Instead of relying on a single machine, we can use multiple computers to get the final
result.

Think of it like teamwork: each machine in a cluster will get some part of the data to
process. They will work simultaneously on all of this data, and in the end, we will
combine the output to get the final result.

There are two main key components of Hadoop.

1. Hadoop Distributed File System (HDFS): This is like the giant storage system for
keeping our dataset. It divides our data into multiple chunks and stores all of this
data across different computers.

2. MapReduce: This is a super smart way of processing all of this data together.
MapReduce helps in processing all of this data in parallel. So, you can divide your
data into multiple chunks and process them together, similar to a team of friends

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 4/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

working to solve a very large puzzle. Each person in the team gets a part of the
puzzle to solve, and in the end, we put everything together to get the final result.

It allowed organizations to store and process very large volumes of data.

But here's the thing, although Hadoop was very good at handling Big Data, there were
a few limitations.

1. One of the biggest problems behind Hadoop was that it relied on storing data on
disk, which made things much slower. Every time we run a job, it stores the data
onto the disk, reads it, processes it, and then stores it again through a disk. This
made the data processing a lot slower.

2. Another issue with Hadoop was that it processed data only in batches. This means
we had to wait for one process to complete before submitting any other job. It
was like waiting for the whole group of friends to complete their puzzles
individually and then put them together.

So, processing all of this data was needed faster and in real time. Here's where
Apache Spark comes into the picture.

In 2009, researchers at the University of California, Berkeley, developed Apache Spark


as a research project. The main reason behind the development of Apache Spark was

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 5/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

to address the limitations of Hadoop. This is where they introduced the powerful
concept called RDD (Resilient Distributed Dataset).

RDD is the backbone of Apache Spark. It allows data to be stored in memory and
enables faster data access and processing. Instead of reading and writing the data
repeatedly from the disk, Spark processes the entire data in just memory.

The meaning of memory here is the RAM (Random Access Memory) stored inside our
computer. And this in-memory processing of data makes Spark 100 times faster than
Hadoop.

Additionally, Spark also gave the ability to write code in various programming
languages such as Python, Java, and Scala. So, you can easily start writing Spark
applications in your preferred language and process your data on a large scale.

Apache Spark became very famous because it was fast, could handle a lot of data, and
process it efficiently.

Components attached to Apache Spark.


Spark Core: Manages basic data processing tasks across multiple machines.

Spark SQL: Allows you to run SQL queries directly on datasets.

Spark Streaming: Facilitates real-time data processing.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 6/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

MLlib: Machine learning library to run large-scale machine learning models.

With all of these components working together, Apache Spark became a powerful tool
for processing and analyzing Big Data.

Nowadays, in any company, you will see Apache Spark being used to process Big Data.

Apache Spark Architecture

When you think of a computer, a standalone computer is generally used to watch


movies, play games, or anything else. But you can't do that on a single computer when
you want to process large Big Data. You need multiple computers working together
on individual tasks so that you can combine the output at the end and get the desired
result.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 7/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

You can't just take ten computers and start processing your Big Data. You need a
proper framework to coordinate work across all of these different machines, and
Apache Spark does exactly that.

Apache Spark manages and coordinates the execution of tasks on data across a cluster
of computers. It has something called a cluster manager. When we write any job in
Spark, it is called a Spark application. Whenever we run anything, it goes to the
cluster manager, which grants resources to all applications so that we can complete
our work.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 8/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

Apache Spark coordinates tasks across multiple computers using a cluster manager,
which allocates resources to various applications within Spark. The framework
includes two critical components:

Driver Processes: Acts as the manager, overseeing the application's operations.

Executor Processes: Performs the actual data processing tasks as directed by


the driver.

The driver processes are like a boss, and the executor processes are like workers.

The main job of the driver processes is to keep track of all the information about the
Apache Spark application. It will respond to the command and input from the user.

So, whenever we submit anything, the driver process will make sure it goes through
the Apache Spark application properly. It analyzes the work that needs to be done,
divides our work into smaller tasks, and assigns these tasks to executor processes.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 9/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

So, it is the boss or a manager who is trying to make sure everything works properly.

The driver process is the heart of the Apache Spark application because it makes sure
everything runs smoothly and allocates the right resources based on the input that we
provide.

Executor processes are the ones that do the work. They execute the code assigned
by the driver process and report back the progress and result of the computation.

How Apache Spark executes the code


When we write our code in Apache Spark, the first thing we need to do is create the
Spark session. It is making the connection with the cluster manager. You can create a
Spark session with any of these languages: Python, Scala, or Java. No matter what
language you use to begin writing your Spark application, the first thing you need to
create is a Spark session.

Step 1: Setting Up the Spark Session


The Spark session is the entry point for programming Spark applications. It lets you
communicate with Spark clusters and is necessary for working with RDDs, DataFrames,
and Datasets.

Step 2: Creating a DataFrame

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 10/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

We'll create a DataFrame containing a range of numbers from 0 to 999, which we'll use
for further transformations.

Step 3: Applying Transformations


Let's apply a transformation to filter out all even numbers from the DataFrame.
Remember, transformations in Spark are lazily evaluated, which means no
computation happens until an action is called

Step 4: Executing Actions


Now we will use an action to trigger the computation of the transformed data. We'll
count the number of even numbers found in our DataFrame.

Step 5: Running SQL Queries


To showcase Spark SQL, let's create a temporary view using our DataFrame and run a
SQL query against it.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 11/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

That’s it, you just ran your first Spark code where you created SparkSession,
DataFrame, Transformation, Action, and applied SQL query on top of it.

This is just a quick guide there are many things you might need to understand to learn
Spark properly

Structured API

Lower Level API

Running Spark Code in Production

Databricks and DeltaLake

Lakehouse Architecture

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 12/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

Distributed Shared Variables

and many more…

If you are interested in learning Apache Spark In-Depth with 2 End-To-End Projects on
AWS and Azure like this

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 13/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

Then check out my course Apache Spark with Databricks for Data Engineers

Don’t forget to subscribe DataVidhya newsletter for high-quality content in the data
field.

20 Likes

Comments

Write a comment...

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 14/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

© 2024 Darshil Parmar ∙ Privacy ∙ Terms ∙ Collection notice


Substack is the home for great culture

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 15/15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy