0% found this document useful (0 votes)
58 views7 pages

Data Management for Machine Learning

The document outlines the course design for 'Data Management for Machine Learning' at Birla Institute of Technology & Science, Pilani, detailing its objectives, content, and evaluation scheme. It covers various data models, querying languages, and modern data infrastructure, while providing hands-on experience with data pipelines and machine learning workflows. The course includes a modular structure with specific learning outcomes and evaluation components, emphasizing the importance of data management in machine learning.

Uploaded by

geetapillai1963
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views7 pages

Data Management for Machine Learning

The document outlines the course design for 'Data Management for Machine Learning' at Birla Institute of Technology & Science, Pilani, detailing its objectives, content, and evaluation scheme. It covers various data models, querying languages, and modern data infrastructure, while providing hands-on experience with data pipelines and machine learning workflows. The course includes a modular structure with specific learning outcomes and evaluation components, emphasizing the importance of data management in machine learning.

Uploaded by

geetapillai1963
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMMES

Part A: Course Design

Course Title Data Management for Machine Learning


Course No(s) DSE* ZG529 / AIML* ZG529
Credit Units 4
Content Authors Pravin Y Pawar
Version 1.1

Course Description

Data Models and Query Languages: Relational, Object-Relational, NoSQL data models; Declarative (SQL)
and Imperative (MapReduce) Querying; Data Encoding: Evolution, Formats, Models of dataflow; Machine
learning workflow; Data management challenges in ML workflow; Data Pipelines and patterns; Data Pipeline
Stages: Data extraction, ingestion, cleaning, wrangling, versioning, transformation, exploration, feature
management; Modern Data Infrastructure: Diverse data sources, Cloud data warehouses and lakes, Data
Ingestion tools, Data transformation and modelling tools, Workflow orchestration platforms; ML model
metadata and Registry, ML Observability, Data privacy and anonymity.

Course Objectives

The course aims at providing:

CO1 Introduction to the data models, storages and querying languages used in data management
emphasizing on machine learning aspects

CO2 Required guidance on architecture of modern data platform, usage and types of data pipelines

CO3 Hands-on exposure to the common techniques, and tools used by data engineers to support build,
test, deploy and automate the machine learning pipelines

CO4 Exposure to the industry best practices essential to deal with data privacy, metadata and
observability

Text Book(s)

T1 Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Reis and Housley

T2 Reliable Machine Learning By Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd
Underwood

Reference Book(s) & other resources

R1 Designing Data-Intensive Applications by Martin Kleppmann

R2 Data Pipelines Pocket Reference by Densore


R3 Building Machine Learning Pipelines by Hapke, Nelson

Learning Outcomes:

Students will be able to :

LO1 Understand the necessity, position and role of data management components appearing in the
modern data stacks

LO2 Acknowledge the patterns, challenges and possible solutions associated with the data ingestion,
flow, storage and processing on data platforms

LO3 Gain experience in designing and handling the dataflow during machine learning pipeline by
means of state-of-art tools

LO4 Apply the acquired conceptual data management knowledge and practices over a real-world
machine learning workflow addressing the model metadata, privacy and monitoring aspects

Part B: Course Handout

Academic Term II Semester 2022-2023


Course Title Data Management for Machine Learning
Course No DSE* ZG529 / AIML* ZG529
Lead Instructor Pravin Y Pawar

Glossary of Terms

Module M Module is a standalone quantum of designed content. A typical course is


delivered using a string of modules. M2 means module 2.

Contact Hour CH Contact Hour (CH) stands for an hour long live session with students
conducted either in a physical classroom or enabled through technology.
In this model of instruction, instructor led sessions will be for 32 CH.

Recorded RL RL stands for Recorded Lecture or Recorded Lesson. It is presented to the


Lecture student through an online portal. A given RL unfolds as a sequences of
video segments interleaved with exercises.

Lab Exercises LE Lab exercises associated with various modules

Self-Study SS Specific content assigned for self-study

Homework HW Specific problems/design/lab exercises assigned as homework


Modular Structure

Module Summary
No. Content of the Module
M1 Foundations of data management

M2 Modern Data Platform

M3 Data Management in ML Workflow

M4 Advanced Topic in Data Management

Detailed Structure

M1: Foundations of data management

Contact Session 1-2

Session Type Description/Plan Reference


1 CH1  Data Management Principles
 Data Management Components T2
CH2

2 CH3  Data Models and Query Languages R1


 Data Encoding T1
CH4

Post CS LE  Lab 1

M2: Modern Data Platform

Contact Session 3-4

Session Type Description/Plan Reference


3 CH5  Data Architectures T1
 Modern Data Stack
CH6
 Data Pipelines and patterns

4 CH7  Data Storage T1


 Data Science Infrastructure
CH8  Serving Data for Analytics and ML
Post CS LE  Lab 2

M3: Data Management in ML Workflow

Contact Session 5-12

Session Type Description/Plan Reference


5 CH9 ML Workflow/lifecycle
T2
CH10  Data Pipeline vs ML Pipeline R3
 ML Pipeline Stages
 Training / Serving pipeline
 Data management challenges in ML workflow

6 CH11 Data Collection / Ingestion T1


CH12  Diverse data sources
 Data generation in source systems
 Batch Ingestion
 Message and Stream Ingestion
 Ingestion strategies

7 CH13 Data Validation


R3
CH14  Common problems with data
 Data skew and drift
 Bias and Fairness
 Data leakage
 Data validation approaches

8 CH15 Analytics Engineering Instructor-supplied


material
CH16  Data Integration
 Data Transformation
 Data Partitioning
 Data Versioning
 Test data management challenges

9 CH17 Data Analysis Instructor-supplied


CH18 material
 Types of Analytics
 Data Exploration and Visualizations
 Data Cubes and OLAP
 Data Cube Operations
 Data Cubes and ML
10 CH19 Feature Preparations T2
CH20  Feature life cycle
 Data Annotation / labeling
 Data augmentation and Data Synthesis
 Common Feature Engineering Operations
 Feature Importance
 Feature Generalization
 Feature Stores

11-12 CH21 ML Experimentation & Metadata


CH22
 Model training & experimentation Instructor-supplied
 Model Analysis & Validation material
CH23  ML Metadata Store
CH24  Dataset, Feature, Label, Pipeline metadata
 ML Experiment Tracking data
 ML model metadata and Registry

Post CS LE  Lab 3, 4

M4: Advanced Topic in Data Management

Contact Session 14-16

Session Type Description/Plan Reference


13 CH25 Distributed Data Processing
CH26
 Big Data Analytics
 Technologies for big data processing
 Distributed and Parallel data processing
 In-memory data processing
 Hadoop, Spark, Kafka as exemplar architecture
14 CH27 Data Privacy and anonymity
T2
CH28  Data privacy issues Instructor-supplied
 Differential privacy material
 Anonymization
 Methods to preserve privacy
 Federated learning
 Encrypted ML

15 CH29 Data Observability


T2
CH30  Data Observability Instructor-supplied
 Data downtime material
 Five pillars
 Tools selection
16 CH31 ML Monitoring & Observability T2
Instructor-supplied
CH32  Causes of ML System failure material
 Data Distribution Shifts
 Problems with ML Production Monitoring
 ML-specific metric
 Monitoring Toolbox
 Monitoring vs Observability
Post CS SS  To be identified

Experiential Leaning Component

Lab Topic

1 Design and implement the simple data flows involving various  Virtual Labs
data formats
Modes of data flows
a) Through Databases – use SQL / Custom Program to read/
write into databases
b) Through REST/RPC – Synchronous mechanism for data ex-
change
c) Through Message Brokers / Queues – Asynchronous mecha-
nism for data exchange

2 Build a Modern Data Stack  Virtual Labs


Components
a) a fully managed ELT data pipeline
b) a cloud-based columnar warehouse or data lake as a destina-
tion
c) a data transformation tool
d) A business intelligence or data visualization platform.

3 Manage Machine Learning Model Metadata using MLFlow /  Virtual Labs


Neptune
Components

a) Projects
b) Experiments
c) Model metadata
d) Model tracking / logging
e) Model Registry

4 Construct a Machine Learning Pipeline with Data Versioning  Virtual Labs


Tool
Components
a) Data Pipeline
b) Data Versioning Tool
c) Feature Store
d) ML Pipeline
e) Prediction Service

Evaluation Scheme:

Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session

No Name Type Duration Weight Day, Date, Session, Time


Experiential learning Take 15 days 10% TBA
EC-1 Assignment / Quiz -I Home
Experiential learning Take 15 days 20% TBA
Assignment-II Home
EC-2 Mid-Semester Test Closed 2 hours 30% Per programme schedule
Book
EC-3 Comprehensive Open 3 hours 40% Per programme schedule
Exam Book

Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 7


Syllabus for Comprehensive Exam (Open Book): All topics (Session Nos. 1 to 16)

Important links and information:


Elearn portal: https://elearn.bits-pilani.ac.in
Students are expected to visit the Elearn portal on a regular basis and stay up to date with the latest
announcements and deadlines.
Contact sessions: Students should attend the online lectures as per the schedule provided on the Elearn portal.
Evaluation Guidelines:
1. EC1 consists of two assignments. Announcements will be made available on the portal, in a timely
manner.
2. For Closed Book tests: No books or reference material of any kind will be permitted.
3. For Open Book exams: Use of books and any printed / written reference material (filed or bound) is
permitted. However, loose sheets of paper will not be allowed. Use of calculators is permitted in all
exams. Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.
4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies, the student
should follow the procedure to apply for the Make-Up Test/Exam which will be made available on the
Elearn portal. The Make-Up Test/Exam will be conducted only at selected exam centres on the dates to
be announced later.

It shall be the responsibility of the individual student to be regular in maintaining the self-study schedule as
given in the course handout, attend the online lectures, and take all the prescribed evaluation components such
as Assignment/Quiz, Mid-Semester Test and Comprehensive Exam according to the evaluation scheme
provided in the handout.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy