Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
1
[Enrollment & Logistics] Enrollment is Ongoing; Room can only take 50
● This class can accommodate only 75 students across INFO 258 (for grads) and DATA 101 (for
undergrads)
○ “Skeleton crew” staff of 2 (compared to 8 in Fa23!)
● INFO258: Originally 22 enrolled → Requested enrollment of 30
○ (Thank you if you replied to my email about prerequisites).
○ Remaining will be admitted if there’s room
● DATA101: Originally 25 enrolled → Requested enrollment of 45, so 20 more students will be
enrolled off the waitlist; remaining if there’s room
● Other enrollment concerns:
○ Limited capacity, so no CE students will be admitted
○ No auditing to keep staff workload low.
○ Room capacity is 50 (but will be broadcast on zoom and recorded), so if we go past
capacity, we will turn you away to not violate fire codes
All enrollment questions handled by Data Science Undergraduate Studies ds-enrollments@berkeley.edu and I
School Enrollment Staff studentaffairs@ischool.berkeley.edu, not instructors. Please email them if you have any
questions or concerns, as well as for any exceptions
2
A note about INFO 258/DATA 101
You’ll hear the two used interchangeably - often I might end up using “DATA 101” since it’s a simpler
number
For all practical purposes these are the same class, modulo:
● The grad/undergrad distinction
● One extra project for the grad students
● Grading done separately for grads and undergrads
3
Intro - Aditya Parameswaran
5
Data Engineering: What? Why?
Course Trajectory
Course Logistics
Data Engineering:
What? Why?
Lecture 01, Data 101 Spring 2024
6
Data Science: The Conventional View
Data Science: The Conventional View
A data scientist operating alone, on one
static dataset at a time, with a clean
“rectangular” shape and fitting in main-
memory, employing various statistical
and ML algorithms on predefined
objectives.
7
Data Science: The Conventional View Now with Data Engineering
Data Science: The Conventional View Data Science today involves Data Engineering:
A data scientist operating alone, on one A set of activities that include collecting, collating,
static dataset at a time, with a clean extracting, moving, transforming, cleaning, integrating,
“rectangular” shape and fitting in main- organizing, representing, storing, and processing data.
memory, employing various statistical
and ML algorithms on predefined
objectives.
● Happens on a large set of messy (often non-rectangular)
dynamic and large datasets
● From Data 100 ● Happens across teams and across the organization
● Also the view reinforced by “popular” ● The team generating the data may not be the same team(s)
Machine Learning, e.g., leaderboards and consuming it
Kaggle competitions ● The objectives are often rather unclear and ill-defined
● A valuable component, but sadly, ● A prerequisite (and typically, precursor) to real-world data
missing the complete picture! science & ML
● A lot of data engineering needs to happen to support the
conventional view!
10
[1/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
11
[2/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
Essential idea: ML is the easy part (perhaps even more so, given LLMs!) → but can’t be done without data
engineering and data engineers
13
[3/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.
15
Data Engineering is Essential in ML/AI
16
New role: Machine Learning Engineer
● In this class, you will learn systems and the infrastructure that enables these techniques.
● You’ll start thinking about efficiency, especially on large datasets.
● Various “plumbing analogies”:
data pipelines, data flows, …
Data engineering is as essential as plumbing!
● When it works well, you don’t realize it exists.
● When it doesn’t, you’ll really know. 18
All these Data Systems!!!
Course Trajectory
Lecture 01, Data 101 Spring 2024
22
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices! Will sample from both!
?
Code-centric Query-centric
Main Storage API is files
● AWS S3, Azure File Storage,
Google Cloud Storage, HDFS, …
Libraries in general-purpose programming
languages, lots of separation
● Spark (Scala/Java) for batch processing
● Ad hoc code (Python/pandas) for exploration
● Metadata tracked in a separate store
24
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices!
(note: topics are grouped by theme and not to scale with + not in order of the class schedule) 27
Data Engineering: What? Why?
Course Trajectory
Course Logistics
Course Logistics
Lecture 01, Data 101 Spring 2024
28
Syllabus Walkthrough
https://data101.org/sp24/syllabus
29
Beginning-of-Semester Logistics
Discussion Sections
Start this Thursday 1/25!
Not recorded but there will be a live zoom link, but
handouts/solutions will be posted.
Again, “in person” will be capped at room capacity
Office Hours
● Just my office hours (was this AM)
● TA office hours start next week.
30
We are in this class together!
Some of the content will be half-baked & experimental! Please bear with the hiccups.
● i.e., not taught in typical "database" classes.
○ Wherever possible we will emphasize underlying concepts…
…but some of what we say will also be practical advice.
● First time I’m teaching a lecture-based class since Spring 2021 (when we piloted Data 101!)