Updated Data Engineering Syllabus 1
Updated Data Engineering Syllabus 1
MODULE 2 : PYTHON
MODULE 5 : HADOOP
MODULE 8 : SCALA
MODULE 9 : SPARK
MODULE 10 : KAFKA
-1-
Module Duration
SQL and Database Understanding
1 (Minutes)
Understanding of transactional databases (MySQL,
60
PostgreSQL), ACID properties, Basic DDL, DML, DCL
ER diagrams 30
Transaction, Concurrency Control 20
Indexing 15
Types of Keys, Join Operations, Group By, Case When
60
Statements, Nested Queries
Triggers 15
Stored Procedures 20
Views and Materialized views 20
Window functions (Rank, Dense Rank, Row Number,
Lag, Over Clause, Partition By, Sum, Avg, Min, Max, 60
First Value, Last Value
Running Sum or Average Related Queries (Row
30
preceding, Unbounded)
Prepare SQL queries to practise different SQL
concepts
Module
Python
2
Basic Data Structures (List, Tuple, Dictionary),
Conditional operations, Looping, Functions, Lambda
120
Functions, List Comprehension, Command line
arguments
Regex 45
Pandas Library 60
NumPy Library (Moderate Level) 45
JSON, CSV, Datetime, Boto3, Requests Libraries 60
MySQL Connector 30
Average level coding questions in python
Read csv files stored in S3 bucket using Boto3
library and create Data frame using Panda’s
60
library, perform different operations available in
pandas over created data frames
Module
Big Data Terminologies
3
Technical understanding of Distributed Computation
15
& Storage
-2-
Structured, Unstructured, Semi Structured Data 10
File Formats: CSV, JSON, Parquet, AVRO, ORC 10
Horizontal Vs Vertical Scaling 10
File Compressions Techniques 10
Understanding of theoretical concepts mentioned
in the topics
Module
Data Warehousing
4
Facts 15
OLAPS 10
Dimensions 15
Star Schema 15
Snowflake Schema 10
Data Model Types 15
Data Integrity 15
Metadata 10
Slowly Changing Dimensions 10
Data Warehouse Design Questions (Ex - Design
Amazon's Data Warehouse) 30
Understanding of all theoretical concepts
45
Design data ware house for Ecommerce platform
Module
Hadoop
5
Complete Architecture 45
Map-Reduce Functioning 30
HDFS 45
YARN 45
Blocks, Splits, Maps, Data Spilling, Heartbeats, Data
Replication, FS Image, Checkpointing, High 45
availability
Hadoop Daemons (Namenode, Datanode, Secondry
30
Namenode, Standby Namenode)
Setup Hadoop in pseudo distributed mode in your
machine, store large text file on HDFS and write 45
Map-Reduce code to count frequency of each word
Module
Apache Hive
6
Hive Installation 15
Query Syntax 60
Bulk Data Load 20
Internal Vs External Tables 20
Static & Dynamic Partitioning 20
Map Side Join 20
Hive SerDe 20
-3-
UDF's in Hive 20
Bucketing 15
Query Optimization 15
Setup hive in local machine
Create internal and external tables using data
stored in HDFS
Perform a bulk load with dynamic partitioning 120
Use Hive SerDe to create tables in hive for Json data
Write and apply UDF in hive to flattern nested json
file
Module
Tools Build on top of Hadoop - Ecosystem
7
Apache Pig 60
Apache Sqoop 30
Hbase (NoSQL Database) 60
Apache Flume 30
Apache Airflow 60
Use scoop to transfer data incrementally from
MySQL database to HDFS, use Hbase to create table 60
on stored file in HDFS
Module
Scala
8
Basic Data Structures, Looping, If-Else, Conditional
Statements, Pattern Matching 20
Functions, Higher Order Functions 30
Scala OOPs Concepts 45
Scala Traits 15
Scala Access Modifiers 20
Exception Handeling 20
Scala Collections 45
Multithreading 30
Average level coding questions in Scala
Module
Spark
9
Complete Architecture 60
Spark Core, Spark SQL 60
Data frames 60
Datasets 15
RDDs 30
Spark Read/Write operations 20
Lineage Graph, Lazy Evaluation 20
Actions, Transformations, Optimized Joins,
60
Broadcaster, Accumulator
Understanding of Spark UI, Stages, Tasks 20
-4-
Spark Submit Command Options 20
Job optimization techniques 30
Spark Catalyst Optimizer 15
Static and Dynamic Resource allocation 30
Understanding Memory Usage in Spark
a) Cache & Persist 30
b) Java Serializer vs Kryo Serializer
Setup spark in local mode 120
Write Spark application to read CSV file and apply
transformations using Spark Core/Spark SQL
Understand different parameters in spark submit
command and different optimization techniques
Module
Kafka
10
Producer 30
Consumer 30
Kafka Cluster, Cluster Setup, Brokers 60
Topics, Partitioning, Offset, Polling, Data Replication,
Data Retention 60
Consumer Group 30
Zookeeper 20
Create Realtime data pipeline using MySQL as
source for incremental data stream, Apache Kafka
for messaging Queue and Spark Streaming for data
transformation. Store transformed Realtime data in
any NoSQL database for Analytical queries 105
Module
Cloud Essentials & Fundamentals of AZURE
11
Azure Different Services (Iaas,Paas,Saas) 45
Azure Managed Identity and active directory
30
management
Azure Network Security Group with different
30
deployment models (Public, Private, Hybrid)
Microsoft Azure Key Vault 20
Azure Monitor with cost calculator 20
Azure CLI Commands 45
Azure Virtual Machine 30
Module
BigData in AZURE
12
Azure Blob/queue/table 20
Azure SQL database 30
Azure Data Lake gen 1 /gen 2 30
Azure Synapse analytics 60
Azure Cosmos DB 60
-5-
Azure Data Factory (Data Pipeline) 60
Azure Event Hubs 30
Azure Databricks (Data Processing) 30
Azure Scaling and Monitoring 30
Create one Lambda which will copy files from
source azure blob to destination database, access
60
on blob should be IAM role based. Lambda should
be scheduled using CloudWatch rule.
Setup one virtual machine instance and write one
shell script to read text file from azure blob, use CLI 90
commands for file transfer.
Connect A databrick notebook to Azure cloud to run
PySpark codes fo processing.
-6-