Data Management for Machine Learning
Data Management for Machine Learning
Course Description
Data Models and Query Languages: Relational, Object-Relational, NoSQL data models; Declarative (SQL)
and Imperative (MapReduce) Querying; Data Encoding: Evolution, Formats, Models of dataflow; Machine
learning workflow; Data management challenges in ML workflow; Data Pipelines and patterns; Data Pipeline
Stages: Data extraction, ingestion, cleaning, wrangling, versioning, transformation, exploration, feature
management; Modern Data Infrastructure: Diverse data sources, Cloud data warehouses and lakes, Data
Ingestion tools, Data transformation and modelling tools, Workflow orchestration platforms; ML model
metadata and Registry, ML Observability, Data privacy and anonymity.
Course Objectives
CO1 Introduction to the data models, storages and querying languages used in data management
emphasizing on machine learning aspects
CO2 Required guidance on architecture of modern data platform, usage and types of data pipelines
CO3 Hands-on exposure to the common techniques, and tools used by data engineers to support build,
test, deploy and automate the machine learning pipelines
CO4 Exposure to the industry best practices essential to deal with data privacy, metadata and
observability
Text Book(s)
T1 Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Reis and Housley
T2 Reliable Machine Learning By Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd
Underwood
Learning Outcomes:
LO1 Understand the necessity, position and role of data management components appearing in the
modern data stacks
LO2 Acknowledge the patterns, challenges and possible solutions associated with the data ingestion,
flow, storage and processing on data platforms
LO3 Gain experience in designing and handling the dataflow during machine learning pipeline by
means of state-of-art tools
LO4 Apply the acquired conceptual data management knowledge and practices over a real-world
machine learning workflow addressing the model metadata, privacy and monitoring aspects
Glossary of Terms
Contact Hour CH Contact Hour (CH) stands for an hour long live session with students
conducted either in a physical classroom or enabled through technology.
In this model of instruction, instructor led sessions will be for 32 CH.
Module Summary
No. Content of the Module
M1 Foundations of data management
Detailed Structure
Post CS LE Lab 1
Post CS LE Lab 3, 4
Lab Topic
1 Design and implement the simple data flows involving various Virtual Labs
data formats
Modes of data flows
a) Through Databases – use SQL / Custom Program to read/
write into databases
b) Through REST/RPC – Synchronous mechanism for data ex-
change
c) Through Message Brokers / Queues – Asynchronous mecha-
nism for data exchange
a) Projects
b) Experiments
c) Model metadata
d) Model tracking / logging
e) Model Registry
Evaluation Scheme:
It shall be the responsibility of the individual student to be regular in maintaining the self-study schedule as
given in the course handout, attend the online lectures, and take all the prescribed evaluation components such
as Assignment/Quiz, Mid-Semester Test and Comprehensive Exam according to the evaluation scheme
provided in the handout.