Apache Spark 101 for Data Engineering
Apache Spark 101 for Data Engineering
20 Share
90% of the world’s data was generated in just the last 2 years. The 120 zettabytes
generated in 2023 are expected to increase by over 150% in 2025, hitting 181
zettabytes
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 1/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
In the early 2000s, the amount of data being generated exploded exponentially with
the use of the internet, social media, and various digital technologies. Organizations
found themselves facing a massive volume of data that was very hard to process. To
address this challenge, the concept of Big Data emerged.
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 2/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
Big Data refers to extremely large and complex data sets that are difficult to process
using traditional methods. Organizations across the world wanted to process this
massive volume of data and derive useful insights from it.
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 3/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
Instead of relying on a single machine, we can use multiple computers to get the final
result.
Think of it like teamwork: each machine in a cluster will get some part of the data to
process. They will work simultaneously on all of this data, and in the end, we will
combine the output to get the final result.
1. Hadoop Distributed File System (HDFS): This is like the giant storage system for
keeping our dataset. It divides our data into multiple chunks and stores all of this
data across different computers.
2. MapReduce: This is a super smart way of processing all of this data together.
MapReduce helps in processing all of this data in parallel. So, you can divide your
data into multiple chunks and process them together, similar to a team of friends
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 4/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
working to solve a very large puzzle. Each person in the team gets a part of the
puzzle to solve, and in the end, we put everything together to get the final result.
But here's the thing, although Hadoop was very good at handling Big Data, there were
a few limitations.
1. One of the biggest problems behind Hadoop was that it relied on storing data on
disk, which made things much slower. Every time we run a job, it stores the data
onto the disk, reads it, processes it, and then stores it again through a disk. This
made the data processing a lot slower.
2. Another issue with Hadoop was that it processed data only in batches. This means
we had to wait for one process to complete before submitting any other job. It
was like waiting for the whole group of friends to complete their puzzles
individually and then put them together.
So, processing all of this data was needed faster and in real time. Here's where
Apache Spark comes into the picture.
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 5/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
to address the limitations of Hadoop. This is where they introduced the powerful
concept called RDD (Resilient Distributed Dataset).
RDD is the backbone of Apache Spark. It allows data to be stored in memory and
enables faster data access and processing. Instead of reading and writing the data
repeatedly from the disk, Spark processes the entire data in just memory.
The meaning of memory here is the RAM (Random Access Memory) stored inside our
computer. And this in-memory processing of data makes Spark 100 times faster than
Hadoop.
Additionally, Spark also gave the ability to write code in various programming
languages such as Python, Java, and Scala. So, you can easily start writing Spark
applications in your preferred language and process your data on a large scale.
Apache Spark became very famous because it was fast, could handle a lot of data, and
process it efficiently.
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 6/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
With all of these components working together, Apache Spark became a powerful tool
for processing and analyzing Big Data.
Nowadays, in any company, you will see Apache Spark being used to process Big Data.
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 7/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
You can't just take ten computers and start processing your Big Data. You need a
proper framework to coordinate work across all of these different machines, and
Apache Spark does exactly that.
Apache Spark manages and coordinates the execution of tasks on data across a cluster
of computers. It has something called a cluster manager. When we write any job in
Spark, it is called a Spark application. Whenever we run anything, it goes to the
cluster manager, which grants resources to all applications so that we can complete
our work.
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 8/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
Apache Spark coordinates tasks across multiple computers using a cluster manager,
which allocates resources to various applications within Spark. The framework
includes two critical components:
The driver processes are like a boss, and the executor processes are like workers.
The main job of the driver processes is to keep track of all the information about the
Apache Spark application. It will respond to the command and input from the user.
So, whenever we submit anything, the driver process will make sure it goes through
the Apache Spark application properly. It analyzes the work that needs to be done,
divides our work into smaller tasks, and assigns these tasks to executor processes.
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 9/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
So, it is the boss or a manager who is trying to make sure everything works properly.
The driver process is the heart of the Apache Spark application because it makes sure
everything runs smoothly and allocates the right resources based on the input that we
provide.
Executor processes are the ones that do the work. They execute the code assigned
by the driver process and report back the progress and result of the computation.
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 10/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
We'll create a DataFrame containing a range of numbers from 0 to 999, which we'll use
for further transformations.
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 11/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
That’s it, you just ran your first Spark code where you created SparkSession,
DataFrame, Transformation, Action, and applied SQL query on top of it.
This is just a quick guide there are many things you might need to understand to learn
Spark properly
Structured API
Lakehouse Architecture
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 12/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
If you are interested in learning Apache Spark In-Depth with 2 End-To-End Projects on
AWS and Azure like this
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 13/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
Then check out my course Apache Spark with Databricks for Data Engineers
Don’t forget to subscribe DataVidhya newsletter for high-quality content in the data
field.
20 Likes
Comments
Write a comment...
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 14/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering
https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 15/15