0% found this document useful (0 votes)

45 views

Apache Spark 101 for Data Engineering

Uploaded by

julianalb.berrio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Apache Spark 101 for Data Engineering

Uploaded by

julianalb.berrio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

11/9/24, 21:52 Apache Spark 101 for Data Engineering

Apache Spark 101 for Data Engineering

What is Apache Spark and why it is so popular among Data Engineers?
DARSHIL | DATA ENGINEER
APR 26, 2024

20 Share

We all know this right?

90% of the world’s data was generated in just the last 2 years. The 120 zettabytes
generated in 2023 are expected to increase by over 150% in 2025, hitting 181
zettabytes

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 1/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

In the early 2000s, the amount of data being generated exploded exponentially with
the use of the internet, social media, and various digital technologies. Organizations
found themselves facing a massive volume of data that was very hard to process. To
address this challenge, the concept of Big Data emerged.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 2/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

Big Data refers to extremely large and complex data sets that are difficult to process
using traditional methods. Organizations across the world wanted to process this
massive volume of data and derive useful insights from it.

Here's where Hadoop comes into the picture.

In 2006, a group of engineers at Yahoo developed a special software framework called

Hadoop. They were inspired by Google's MapReduce and Google File System

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 3/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

technology. Hadoop introduced a new way of data processing called distributed

processing.

Instead of relying on a single machine, we can use multiple computers to get the final
result.

Think of it like teamwork: each machine in a cluster will get some part of the data to
process. They will work simultaneously on all of this data, and in the end, we will
combine the output to get the final result.

There are two main key components of Hadoop.

1. Hadoop Distributed File System (HDFS): This is like the giant storage system for
keeping our dataset. It divides our data into multiple chunks and stores all of this
data across different computers.

2. MapReduce: This is a super smart way of processing all of this data together.
MapReduce helps in processing all of this data in parallel. So, you can divide your
data into multiple chunks and process them together, similar to a team of friends

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 4/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

working to solve a very large puzzle. Each person in the team gets a part of the
puzzle to solve, and in the end, we put everything together to get the final result.

It allowed organizations to store and process very large volumes of data.

But here's the thing, although Hadoop was very good at handling Big Data, there were
a few limitations.

1. One of the biggest problems behind Hadoop was that it relied on storing data on
disk, which made things much slower. Every time we run a job, it stores the data
onto the disk, reads it, processes it, and then stores it again through a disk. This
made the data processing a lot slower.

2. Another issue with Hadoop was that it processed data only in batches. This means
we had to wait for one process to complete before submitting any other job. It
was like waiting for the whole group of friends to complete their puzzles
individually and then put them together.

So, processing all of this data was needed faster and in real time. Here's where
Apache Spark comes into the picture.

In 2009, researchers at the University of California, Berkeley, developed Apache Spark

as a research project. The main reason behind the development of Apache Spark was

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 5/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

to address the limitations of Hadoop. This is where they introduced the powerful
concept called RDD (Resilient Distributed Dataset).

RDD is the backbone of Apache Spark. It allows data to be stored in memory and
enables faster data access and processing. Instead of reading and writing the data
repeatedly from the disk, Spark processes the entire data in just memory.

The meaning of memory here is the RAM (Random Access Memory) stored inside our
computer. And this in-memory processing of data makes Spark 100 times faster than
Hadoop.

Additionally, Spark also gave the ability to write code in various programming
languages such as Python, Java, and Scala. So, you can easily start writing Spark
applications in your preferred language and process your data on a large scale.

Apache Spark became very famous because it was fast, could handle a lot of data, and
process it efficiently.

Components attached to Apache Spark.

Spark Core: Manages basic data processing tasks across multiple machines.

Spark SQL: Allows you to run SQL queries directly on datasets.

Spark Streaming: Facilitates real-time data processing.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 6/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

MLlib: Machine learning library to run large-scale machine learning models.

With all of these components working together, Apache Spark became a powerful tool
for processing and analyzing Big Data.

Nowadays, in any company, you will see Apache Spark being used to process Big Data.

Apache Spark Architecture

When you think of a computer, a standalone computer is generally used to watch

movies, play games, or anything else. But you can't do that on a single computer when
you want to process large Big Data. You need multiple computers working together
on individual tasks so that you can combine the output at the end and get the desired
result.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 7/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

You can't just take ten computers and start processing your Big Data. You need a
proper framework to coordinate work across all of these different machines, and
Apache Spark does exactly that.

Apache Spark manages and coordinates the execution of tasks on data across a cluster
of computers. It has something called a cluster manager. When we write any job in
Spark, it is called a Spark application. Whenever we run anything, it goes to the
cluster manager, which grants resources to all applications so that we can complete
our work.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 8/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

Apache Spark coordinates tasks across multiple computers using a cluster manager,
which allocates resources to various applications within Spark. The framework
includes two critical components:

Driver Processes: Acts as the manager, overseeing the application's operations.

Executor Processes: Performs the actual data processing tasks as directed by

the driver.

The driver processes are like a boss, and the executor processes are like workers.

The main job of the driver processes is to keep track of all the information about the
Apache Spark application. It will respond to the command and input from the user.

So, whenever we submit anything, the driver process will make sure it goes through
the Apache Spark application properly. It analyzes the work that needs to be done,
divides our work into smaller tasks, and assigns these tasks to executor processes.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&tr… 9/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

So, it is the boss or a manager who is trying to make sure everything works properly.

The driver process is the heart of the Apache Spark application because it makes sure
everything runs smoothly and allocates the right resources based on the input that we
provide.

Executor processes are the ones that do the work. They execute the code assigned
by the driver process and report back the progress and result of the computation.

How Apache Spark executes the code

When we write our code in Apache Spark, the first thing we need to do is create the
Spark session. It is making the connection with the cluster manager. You can create a
Spark session with any of these languages: Python, Scala, or Java. No matter what
language you use to begin writing your Spark application, the first thing you need to
create is a Spark session.

Step 1: Setting Up the Spark Session

The Spark session is the entry point for programming Spark applications. It lets you
communicate with Spark clusters and is necessary for working with RDDs, DataFrames,
and Datasets.

Step 2: Creating a DataFrame

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 10/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

We'll create a DataFrame containing a range of numbers from 0 to 999, which we'll use
for further transformations.

Step 3: Applying Transformations

Let's apply a transformation to filter out all even numbers from the DataFrame.
Remember, transformations in Spark are lazily evaluated, which means no
computation happens until an action is called

Step 4: Executing Actions

Now we will use an action to trigger the computation of the transformed data. We'll
count the number of even numbers found in our DataFrame.

Step 5: Running SQL Queries

To showcase Spark SQL, let's create a temporary view using our DataFrame and run a
SQL query against it.

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 11/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

That’s it, you just ran your first Spark code where you created SparkSession,
DataFrame, Transformation, Action, and applied SQL query on top of it.

This is just a quick guide there are many things you might need to understand to learn
Spark properly

Structured API

Lower Level API

Running Spark Code in Production

Databricks and DeltaLake

Lakehouse Architecture

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 12/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

Distributed Shared Variables

and many more…

If you are interested in learning Apache Spark In-Depth with 2 End-To-End Projects on
AWS and Azure like this

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 13/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

Then check out my course Apache Spark with Databricks for Data Engineers

Don’t forget to subscribe DataVidhya newsletter for high-quality content in the data
field.

20 Likes

Comments

Write a comment...

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 14/15
11/9/24, 21:52 Apache Spark 101 for Data Engineering

Substack is the home for great culture

https://datavidhya.substack.com/p/apache-spark-101-for-data-engineering?r=1z3i2v&utm_campaign=post&utm_medium=web&t… 15/15

PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Sap Implementation of Fi Co Reallife Project PDF
100% (2)
Sap Implementation of Fi Co Reallife Project PDF
193 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Unit 5
100% (1)
Unit 5
109 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Spark Tutorial
No ratings yet
Spark Tutorial
8 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Unit IV spark
No ratings yet
Unit IV spark
23 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
bda
No ratings yet
bda
4 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
Thesis Apache Spark
100% (2)
Thesis Apache Spark
4 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Spark
No ratings yet
Spark
5 pages
bda u3 p1 (intro to spark)
No ratings yet
bda u3 p1 (intro to spark)
66 pages
Spark Devops
0% (1)
Spark Devops
301 pages
Spark
No ratings yet
Spark
9 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
UNIT V
No ratings yet
UNIT V
35 pages
UNIT 5.1
No ratings yet
UNIT 5.1
9 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Bda 5
No ratings yet
Bda 5
21 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
Basics of Big Data
No ratings yet
Basics of Big Data
7 pages
Big data Handling Techniques
No ratings yet
Big data Handling Techniques
21 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
Spark Architecture
No ratings yet
Spark Architecture
10 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
OpenStack Sahara Essentials
From Everand
OpenStack Sahara Essentials
Omar Khedher
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
DX Diag
No ratings yet
DX Diag
11 pages
File Pdf2wordv3 0
No ratings yet
File Pdf2wordv3 0
1 page
Visual Environment InstallationGuide
No ratings yet
Visual Environment InstallationGuide
44 pages
xanadu_now_platform_administration_create_a_formatter_and_add_it_to_a_form_2024-11-07-10-34-21
No ratings yet
xanadu_now_platform_administration_create_a_formatter_and_add_it_to_a_form_2024-11-07-10-34-21
4 pages
Computer Studies Schemes G8 Term 3 - 2021
No ratings yet
Computer Studies Schemes G8 Term 3 - 2021
10 pages
SecureFiles in Oracle 11g
No ratings yet
SecureFiles in Oracle 11g
4 pages
Performance Testing Interview Questions With Answers: By:P2Cinfotech
No ratings yet
Performance Testing Interview Questions With Answers: By:P2Cinfotech
10 pages
CRSCTL Commands in Oracle RAC
No ratings yet
CRSCTL Commands in Oracle RAC
8 pages
Yolo v4
No ratings yet
Yolo v4
42 pages
Bca Science Slip Sem II C & Dbms Feb2017
No ratings yet
Bca Science Slip Sem II C & Dbms Feb2017
66 pages
Link Download Software
67% (3)
Link Download Software
25 pages
Course Syllabus: Week One: Programming Fundamentals
No ratings yet
Course Syllabus: Week One: Programming Fundamentals
14 pages
Linux Kernel Labs
No ratings yet
Linux Kernel Labs
50 pages
OOAD Lesson Plan Edited
No ratings yet
OOAD Lesson Plan Edited
3 pages
Data Processing Year 10 3rd Term Exam Question
No ratings yet
Data Processing Year 10 3rd Term Exam Question
10 pages
KLabsCoreJava Draft Need Cover Page
No ratings yet
KLabsCoreJava Draft Need Cover Page
16 pages
CS01 - Create Material BOM
100% (2)
CS01 - Create Material BOM
15 pages
Discovering Computers 2011: Living in A Digital World
No ratings yet
Discovering Computers 2011: Living in A Digital World
43 pages
Lamia - Ali Abdallah - Resume - Cloud Engineer & System Admin
No ratings yet
Lamia - Ali Abdallah - Resume - Cloud Engineer & System Admin
2 pages
Avisha Seth
No ratings yet
Avisha Seth
3 pages
Cohesity Deployment Guide Microsoft SQL Configurations
No ratings yet
Cohesity Deployment Guide Microsoft SQL Configurations
35 pages
SQL Server Database Development Best Practices: Grant Fritchey, Red Gate Software
No ratings yet
SQL Server Database Development Best Practices: Grant Fritchey, Red Gate Software
21 pages
GitHub - Chassing - Linux-Sysadmin-Interview-Questions - Collection of Linux Sysadmin - DevOps Interview Questions
No ratings yet
GitHub - Chassing - Linux-Sysadmin-Interview-Questions - Collection of Linux Sysadmin - DevOps Interview Questions
9 pages
Apache Hadoop Developer Training PDF
No ratings yet
Apache Hadoop Developer Training PDF
394 pages
Walchand Institute of Technology, Solapur Information Technology 2021-22 SEMESTER - I
No ratings yet
Walchand Institute of Technology, Solapur Information Technology 2021-22 SEMESTER - I
4 pages
Libgen Proxy, Unblock Access - Proxy of All Websites
No ratings yet
Libgen Proxy, Unblock Access - Proxy of All Websites
2 pages
Sap Solution Manager - Installation Guide - Post-Installation Activity Guide
100% (5)
Sap Solution Manager - Installation Guide - Post-Installation Activity Guide
9 pages
Samsung ARM Chromebook - The Chromium Projects
No ratings yet
Samsung ARM Chromebook - The Chromium Projects
5 pages
Shan's Resume
No ratings yet
Shan's Resume
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Apache Spark 101 for Data Engineering

Uploaded by

Apache Spark 101 for Data Engineering

Uploaded by

11/9/24, 21:52 Apache Spark 101 for Data Engineering

Apache Spark 101 for Data Engineering

We all know this right?

Here's where Hadoop comes into the picture.

In 2006, a group of engineers at Yahoo developed a special software framework called

technology. Hadoop introduced a new way of data processing called distributed

There are two main key components of Hadoop.

It allowed organizations to store and process very large volumes of data.

In 2009, researchers at the University of California, Berkeley, developed Apache Spark

Components attached to Apache Spark.

Spark SQL: Allows you to run SQL queries directly on datasets.

Spark Streaming: Facilitates real-time data processing.

MLlib: Machine learning library to run large-scale machine learning models.

Apache Spark Architecture

When you think of a computer, a standalone computer is generally used to watch

Driver Processes: Acts as the manager, overseeing the application's operations.

Executor Processes: Performs the actual data processing tasks as directed by

How Apache Spark executes the code

Step 1: Setting Up the Spark Session

Step 2: Creating a DataFrame

Step 3: Applying Transformations

Step 4: Executing Actions

Step 5: Running SQL Queries

Lower Level API

Running Spark Code in Production

Databricks and DeltaLake

Distributed Shared Variables

and many more…

© 2024 Darshil Parmar ∙ Privacy ∙ Terms ∙ Collection notice

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.