0% found this document useful (0 votes)
59 views31 pages

Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!

This document provides an overview and introduction for a data engineering lecture. It discusses: - Enrollment details for the course, which can accommodate 75 students total across graduate and undergraduate sections. - An introduction to the instructor, Aditya Parameswaran, who has a PhD from Stanford and focuses his research on building scalable data tools. - A definition of data engineering as a set of activities including collecting, organizing, and processing data, which is often a prerequisite for data science work. - Reasons for learning data engineering, including that it occupies most time in data science projects, there are more data engineer jobs than data scientist jobs, and it is essential for enabling machine learning applications.

Uploaded by

mb.doumi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views31 pages

Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!

This document provides an overview and introduction for a data engineering lecture. It discusses: - Enrollment details for the course, which can accommodate 75 students total across graduate and undergraduate sections. - An introduction to the instructor, Aditya Parameswaran, who has a PhD from Stanford and focuses his research on building scalable data tools. - A definition of data engineering as a set of activities including collecting, organizing, and processing data, which is often a prerequisite for data science work. - Reasons for learning data engineering, including that it occupies most time in data science projects, there are more data engineer jobs than data scientist jobs, and it is essential for enabling machine learning applications.

Uploaded by

mb.doumi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

LECTURE 01

Welcome to Data Engineering!


(INFO 258/DATA 101)
January 22, 2024

Data 101, Fall 2024 @ UC Berkeley


Aditya Parameswaran https://data101.org/sp24

1
[Enrollment & Logistics] Enrollment is Ongoing; Room can only take 50
● This class can accommodate only 75 students across INFO 258 (for grads) and DATA 101 (for
undergrads)
○ “Skeleton crew” staff of 2 (compared to 8 in Fa23!)
● INFO258: Originally 22 enrolled → Requested enrollment of 30
○ (Thank you if you replied to my email about prerequisites).
○ Remaining will be admitted if there’s room
● DATA101: Originally 25 enrolled → Requested enrollment of 45, so 20 more students will be
enrolled off the waitlist; remaining if there’s room
● Other enrollment concerns:
○ Limited capacity, so no CE students will be admitted
○ No auditing to keep staff workload low.
○ Room capacity is 50 (but will be broadcast on zoom and recorded), so if we go past
capacity, we will turn you away to not violate fire codes

All enrollment questions handled by Data Science Undergraduate Studies ds-enrollments@berkeley.edu and I
School Enrollment Staff studentaffairs@ischool.berkeley.edu, not instructors. Please email them if you have any
questions or concerns, as well as for any exceptions
2
A note about INFO 258/DATA 101
You’ll hear the two used interchangeably - often I might end up using “DATA 101” since it’s a simpler
number

For all practical purposes these are the same class, modulo:
● The grad/undergrad distinction
● One extra project for the grad students
● Grading done separately for grads and undergrads

3
Intro - Aditya Parameswaran

● Ph.D. in Computer Science from Stanford University (2013).


● Postdoc at MIT (2014).
● Assistant/Associate Prof since 2014, and UC Berkeley (2019-now).

● Building better (more scalable, usable, intelligent) data tools


○ Develop (and open-source) tools & write papers about them
○ Tools include: spreadsheet & visualization systems, data science
libs., comp. Notebooks
○ Open-source tools downloaded millions of times
○ Even sometimes start companies!

● Second time teaching this class!


○ Joe H and I originally developed this class in 2021 (offered 2x)
○ It’s a very new course! Lots of experts at Cal!
● Random facts about me:
○ I have three creatures who ensure I don’t get quality sleep at night:
two foster cats, and my hyperactive toddler
Our wonderful Spring 2024 Course Staff

Natalie Chan Mackenzie Moffit

5
Data Engineering: What? Why?
Course Trajectory
Course Logistics

Data Engineering:
What? Why?
Lecture 01, Data 101 Spring 2024

6
Data Science: The Conventional View
Data Science: The Conventional View
A data scientist operating alone, on one
static dataset at a time, with a clean
“rectangular” shape and fitting in main-
memory, employing various statistical
and ML algorithms on predefined
objectives.

● From Data 100


● Also the view reinforced by “popular”
Machine Learning, e.g., leaderboards and
Kaggle competitions
● A valuable component, but sadly,
missing the complete picture!

7
Data Science: The Conventional View Now with Data Engineering
Data Science: The Conventional View Data Science today involves Data Engineering:
A data scientist operating alone, on one A set of activities that include collecting, collating,
static dataset at a time, with a clean extracting, moving, transforming, cleaning, integrating,
“rectangular” shape and fitting in main- organizing, representing, storing, and processing data.
memory, employing various statistical
and ML algorithms on predefined
objectives.
● Happens on a large set of messy (often non-rectangular)
dynamic and large datasets
● From Data 100 ● Happens across teams and across the organization
● Also the view reinforced by “popular” ● The team generating the data may not be the same team(s)
Machine Learning, e.g., leaderboards and consuming it
Kaggle competitions ● The objectives are often rather unclear and ill-defined
● A valuable component, but sadly, ● A prerequisite (and typically, precursor) to real-world data
missing the complete picture! science & ML
● A lot of data engineering needs to happen to support the
conventional view!

Data systems are tools that support


data engineering. 8
The Data Science Industry Now
…once these junior people get to the market, they come in with an unrealistic set of expectations
about what data science work will look like. Everyone thinks they’re going to be doing machine
learning, deep learning, …

Vicky Boykis, 2019.[blog]

This is not their fault; this is what data


science curriculums [sic.] and the tech
media emphasize….

The reality is that “data science” has never


been as much about machine learning as it
has about cleaning, shaping data, and
moving it from place to place.
I personally
like 2 more
9
than 1!
[1/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.

● Most of the time spent in real-world data science


projects involve data engineering.
● Often underappreciated compared to other
activities, e.g., ML.

10
[1/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.

● Most of the time spent in real-world data science


projects involve data engineering.
● Often underappreciated compared to other
activities, e.g., ML.
● Data engineering activities, e.g., cleaning,
moving, and processing data occupies the
majority of time in data science.

11
[2/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.

“… 70% more open roles at companies in data


engineering as compared to data science. As we
train the next generation of data and ML
practitioners, let’s place more emphasis on Mihail Eric, Jan 2021.[blog]
engineering skills.”

“Data engineer” has emerged as a new specialized


job category:
● Data scientist: Use various techniques
in statistics & ML to process & analyze data.
● Data engineer: Develops a robust and
scalable set of data processing We’re not going to be too dogmatic about
tools/platforms. these distinctions, but it’s worth knowing
what industry envisions. 12
[2/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.

Even bolder claim: data science roles


may disappear!?!
“Many data science teams have not delivered results
that can be measured in ROI by executives.”
Forbes, Feb 2019. [blog]
Many teams have struggled because they can do “ML”
but can’t do data engineering to get to “ML”
“For complex data engineering tasks, you need five data engineers for every one data scientist.”

Essential idea: ML is the easy part (perhaps even more so, given LLMs!) → but can’t be done without data
engineering and data engineers

13
[3/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.

Even when doing ML, the vast fraction of ML-


powered systems is not “ML code.”
In most cases, “ML code” corresponds to calls to
standard libraries, e.g., scikit-learn, pytorch,
tensorflow, etc.

The hard part is getting the data to


the format and quality that these ML
libraries expect!
Sculley et al., SE4ML 2014 [google research]. 14
Data Engineering is Essential in ML/AI

Monica Rogati, 2017 [blog].

Stuff you need to do


first! A lot of this is data
engineering.
In fact, for any sort of
data-driven decision-
making
(ML/AI or not) you will
need these skills.

15
Data Engineering is Essential in ML/AI

“More often than not, companies


are not ready for AI. Maybe they Monica Rogali, 2017 [blog].
hired their first data scientist to
less-than-stellar outcomes, or “However, under the strong
maybe data literacy is not central influence of the current AI
to their culture. But the most hype, people try to plug in
common scenario is that they data that’s dirty & full of
have not yet built the gaps, that spans years while
infrastructure to implement (and changing in format and
reap the benefits of) the most meaning, that’s not
basic data science algorithms and understood yet, that’s
operations, much less ML.” structured in ways that don’t
make sense, and expect those
tools to magically handle it.”

16
New role: Machine Learning Engineer

Tomasz Dudek,, 2018 [blog].

“ML Engineer”: a specialization


of data engineer focused on
operationalizing ML.

“A need for a person that would reunite


two warring parties. One being fluent
just enough in both fields [Data
Science and Software Engineering] to
get the product up and running.
Somebody taking data scientists’ code
and making it more effective and
scalable. ... Explaining the reasons
behind architectural ideas to the
devops team. “
17
Why Learn Data Engineering?
Data science projects largely focus on data engineering.
Data engineer roles >> data scientist roles.
Data engineering is essential to ML/AI.
Balance your data techniques with a systems perspective.

As a Data Science major, you are


likely familiar with techniques: Techniques …but you are likely less
statistics/ML concepts & familiar with systems.
Systems
algorithms…

● In this class, you will learn systems and the infrastructure that enables these techniques.
● You’ll start thinking about efficiency, especially on large datasets.
● Various “plumbing analogies”:
data pipelines, data flows, …
Data engineering is as essential as plumbing!
● When it works well, you don’t realize it exists.
● When it doesn’t, you’ll really know. 18
All these Data Systems!!!

2023 MAD (ML/AI/Data) Landscape: blog, interactive 19


2023 MAD (ML/AI/Data) Landscape
Data systems is a difficult subject! There are many, many data
systems – too many for us to cover.

● In this class, we will try to cover the key categories and


underlying principles.
● This way, you can make informed decisions about when to use
what type of system.

2023 MAD (ML/AI/Data) Landscape: blog, interactive 20


The Bottom Line
Data engineering is an essential ingredient
of real-world data science projects.

A set of activities that include collecting, The backbone, plumbing, or


collating, extracting, moving, transforming,
cleaning, integrating, organizing, infrastructure that supports data
representing, storing, and processing data. science.

Understanding these skills will help you…:


● Apply skills from intro data science classes to messy, large real-world datasets;
● Get your datasets to the point where you can apply AI/ML;
● Explore new, sought-after, & specialized roles, e.g., data engineer/ML engineer;
● Make informed decisions within the vast and confusing landscape of data systems; and
● Start worrying about efficiency :-)
21
Data Engineering: What? Why?
Course Trajectory
Course Logistics

Course Trajectory
Lecture 01, Data 101 Spring 2024

22
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices! Will sample from both!

Stanford; Founded 2003, IPO, Founded 2022 based on


MIT; Founded 2005, DuckDB from CWI
Acq. 2011 (HPE) Acq. 2019 (Salesforce)
$5OM of funding

Founded 1996, one of the


most popular open-source Founded 2021
Founded 2019,
databases, with many Founded 2013, Founded 2013 based on Apache Acq. 2023
$60+M raised
startups & established co. Acq. 2022 (Alteryx)Spark; one of the hottest pre- (Snowflake)
offerings IPO startups 23
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices!

Two general foundational approaches:

?
Code-centric Query-centric
Main Storage API is files
● AWS S3, Azure File Storage,
Google Cloud Storage, HDFS, …
Libraries in general-purpose programming
languages, lots of separation
● Spark (Scala/Java) for batch processing
● Ad hoc code (Python/pandas) for exploration
● Metadata tracked in a separate store

24
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices!

Two general foundational approaches:


Code-centric Query-centric
Main Storage API is files
● AWS S3, Azure File Storage,
Google Cloud Storage, HDFS, …
Main Storage API is tables
● Snowflake, BigQuery,
Redshift, Azure Synapse,
?
Libraries in general-purpose programming ● Teradata (founded 1979, still relevant!!)
languages, lots of separation One language/paradigm for (almost) everything
● Spark (Scala/Java) for batch processing ● Batch: SQL
● Ad hoc code (Python/pandas) for exploration ● Interactive: SQL
● Metadata tracked in a separate store ● Metadata auto-tracked in database
● Other bytestream data stored in files (e.g. S3)
25
Our approach: Query-centric but Open-Minded
Based on formal (theory) query languages: Relational Algebra and Relational Calculus
● Decades of research and development
● RA: procedural, RC: declarative (describe outputs, not algorithms)

Structured Query Language (SQL): A domain-specific language for data


● Same language for batch (e.g., transformation) and interactive (e.g., queries)
● Declarative complement to general-purpose languages (which are often imperative)
○ Abstraction: No “overfitting” of code to the task at hand
○ Huge plus for cloud environment: dynamically changing workloads, hardware, data
● Even code-centric libraries increasingly include SQL-like interfaces (e.g., SparkSQL)
○ The pendulum has swung back in favor of SQL from the noSQL movement of the 00s
● Decades of extensions, tools, and support
We’ll teach you the concepts using
…but nothing’s perfect! postgreSQL (a flavor of SQL). In
● In practice, for both exploration/data engineering, practice, you’ll be able to apply
you will need extra tools beyond SQL. these concepts to new tools..
● But most recent open-source alternatives are similar.
26
Class Journey
Relational Model and Algebra
Advanced SQL queries (views, subqueries, window functions, Project 1
…)
DML, DDL
Referential integrity, index selection, performance tuning Project 2
Data transformation and preparation + Project 5
Data wrangling and cleaning Project 3
Non-relational data models (Tensors, Spreadsheets, etc.)
Semistructured data (and mongoDB) Project 4

ER and normalization, Spreadsheets, Transactions, BI and


OLAP, parallel computing, security and privacy, data pipelines, important data
… engineering topics

(note: topics are grouped by theme and not to scale with + not in order of the class schedule) 27
Data Engineering: What? Why?
Course Trajectory
Course Logistics

Course Logistics
Lecture 01, Data 101 Spring 2024

28
Syllabus Walkthrough

https://data101.org/sp24/syllabus

29
Beginning-of-Semester Logistics
Discussion Sections
Start this Thursday 1/25!
Not recorded but there will be a live zoom link, but
handouts/solutions will be posted.
Again, “in person” will be capped at room capacity

Office Hours
● Just my office hours (was this AM)
● TA office hours start next week.

30
We are in this class together!
Some of the content will be half-baked & experimental! Please bear with the hiccups.
● i.e., not taught in typical "database" classes.
○ Wherever possible we will emphasize underlying concepts…
…but some of what we say will also be practical advice.
● First time I’m teaching a lecture-based class since Spring 2021 (when we piloted Data 101!)

With all of that said:


● You will be evaluated generously.
● Our goal is for you to learn the material, Use the Extenuating Circumstances
not to stress you out. form!
● If you are feeling lost, please reach out.
We welcome feedback at any time
It is much better to do so than to violate our trust.
about the course. Contact course staff
● This is especially true given the
at data101@berkeley.edu or stop by
experimental nature of the class, office hours.
our small staff size, and the state of the world.
31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy