0% found this document useful (0 votes)

58 views7 pages

Data Management for Machine Learning

The document outlines the course design for 'Data Management for Machine Learning' at Birla Institute of Technology & Science, Pilani, detailing its objectives, content, and evaluation scheme. It covers various data models, querying languages, and modern data infrastructure, while providing hands-on experience with data pipelines and machine learning workflows. The course includes a modular structure with specific learning outcomes and evaluation components, emphasizing the importance of data management in machine learning.

Uploaded by

geetapillai1963

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views7 pages

Data Management for Machine Learning

Uploaded by

geetapillai1963

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMMES

Part A: Course Design

Course Title Data Management for Machine Learning

Course No(s) DSE* ZG529 / AIML* ZG529
Credit Units 4
Content Authors Pravin Y Pawar
Version 1.1

Course Description

Data Models and Query Languages: Relational, Object-Relational, NoSQL data models; Declarative (SQL)
and Imperative (MapReduce) Querying; Data Encoding: Evolution, Formats, Models of dataflow; Machine
learning workflow; Data management challenges in ML workflow; Data Pipelines and patterns; Data Pipeline
Stages: Data extraction, ingestion, cleaning, wrangling, versioning, transformation, exploration, feature
management; Modern Data Infrastructure: Diverse data sources, Cloud data warehouses and lakes, Data
Ingestion tools, Data transformation and modelling tools, Workflow orchestration platforms; ML model
metadata and Registry, ML Observability, Data privacy and anonymity.

Course Objectives

The course aims at providing:

CO1 Introduction to the data models, storages and querying languages used in data management
emphasizing on machine learning aspects

CO2 Required guidance on architecture of modern data platform, usage and types of data pipelines

CO3 Hands-on exposure to the common techniques, and tools used by data engineers to support build,
test, deploy and automate the machine learning pipelines

CO4 Exposure to the industry best practices essential to deal with data privacy, metadata and
observability

Text Book(s)

T1 Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Reis and Housley

T2 Reliable Machine Learning By Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd
Underwood

Reference Book(s) & other resources

R1 Designing Data-Intensive Applications by Martin Kleppmann

R2 Data Pipelines Pocket Reference by Densore

R3 Building Machine Learning Pipelines by Hapke, Nelson

Learning Outcomes:

Students will be able to :

LO1 Understand the necessity, position and role of data management components appearing in the
modern data stacks

LO2 Acknowledge the patterns, challenges and possible solutions associated with the data ingestion,
flow, storage and processing on data platforms

LO3 Gain experience in designing and handling the dataflow during machine learning pipeline by
means of state-of-art tools

LO4 Apply the acquired conceptual data management knowledge and practices over a real-world
machine learning workflow addressing the model metadata, privacy and monitoring aspects

Part B: Course Handout

Academic Term II Semester 2022-2023

Course Title Data Management for Machine Learning
Course No DSE* ZG529 / AIML* ZG529
Lead Instructor Pravin Y Pawar

Glossary of Terms

Module M Module is a standalone quantum of designed content. A typical course is

delivered using a string of modules. M2 means module 2.

Contact Hour CH Contact Hour (CH) stands for an hour long live session with students
conducted either in a physical classroom or enabled through technology.
In this model of instruction, instructor led sessions will be for 32 CH.

Recorded RL RL stands for Recorded Lecture or Recorded Lesson. It is presented to the

Lecture student through an online portal. A given RL unfolds as a sequences of
video segments interleaved with exercises.

Lab Exercises LE Lab exercises associated with various modules

Self-Study SS Specific content assigned for self-study

Homework HW Specific problems/design/lab exercises assigned as homework

Modular Structure

Module Summary
No. Content of the Module
M1 Foundations of data management

M2 Modern Data Platform

M3 Data Management in ML Workflow

M4 Advanced Topic in Data Management

Detailed Structure

M1: Foundations of data management

Contact Session 1-2

Session Type Description/Plan Reference

1 CH1  Data Management Principles
 Data Management Components T2
CH2

2 CH3  Data Models and Query Languages R1

 Data Encoding T1
CH4

Post CS LE  Lab 1

M2: Modern Data Platform

Contact Session 3-4

Session Type Description/Plan Reference

3 CH5  Data Architectures T1
 Modern Data Stack
CH6
 Data Pipelines and patterns

4 CH7  Data Storage T1

 Data Science Infrastructure
CH8  Serving Data for Analytics and ML
Post CS LE  Lab 2

M3: Data Management in ML Workflow

Contact Session 5-12

Session Type Description/Plan Reference

5 CH9 ML Workflow/lifecycle
T2
CH10  Data Pipeline vs ML Pipeline R3
 ML Pipeline Stages
 Training / Serving pipeline
 Data management challenges in ML workflow

6 CH11 Data Collection / Ingestion T1

CH12  Diverse data sources
 Data generation in source systems
 Batch Ingestion
 Message and Stream Ingestion
 Ingestion strategies

7 CH13 Data Validation

R3
CH14  Common problems with data
 Data skew and drift
 Bias and Fairness
 Data leakage
 Data validation approaches

8 CH15 Analytics Engineering Instructor-supplied

material
CH16  Data Integration
 Data Transformation
 Data Partitioning
 Data Versioning
 Test data management challenges

9 CH17 Data Analysis Instructor-supplied

CH18 material
 Types of Analytics
 Data Exploration and Visualizations
 Data Cubes and OLAP
 Data Cube Operations
 Data Cubes and ML
10 CH19 Feature Preparations T2
CH20  Feature life cycle
 Data Annotation / labeling
 Data augmentation and Data Synthesis
 Common Feature Engineering Operations
 Feature Importance
 Feature Generalization
 Feature Stores

11-12 CH21 ML Experimentation & Metadata

CH22
 Model training & experimentation Instructor-supplied
 Model Analysis & Validation material
CH23  ML Metadata Store
CH24  Dataset, Feature, Label, Pipeline metadata
 ML Experiment Tracking data
 ML model metadata and Registry

Post CS LE  Lab 3, 4

M4: Advanced Topic in Data Management

Contact Session 14-16

Session Type Description/Plan Reference

13 CH25 Distributed Data Processing
CH26
 Big Data Analytics
 Technologies for big data processing
 Distributed and Parallel data processing
 In-memory data processing
 Hadoop, Spark, Kafka as exemplar architecture
14 CH27 Data Privacy and anonymity
T2
CH28  Data privacy issues Instructor-supplied
 Differential privacy material
 Anonymization
 Methods to preserve privacy
 Federated learning
 Encrypted ML

15 CH29 Data Observability

T2
CH30  Data Observability Instructor-supplied
 Data downtime material
 Five pillars
 Tools selection
16 CH31 ML Monitoring & Observability T2
Instructor-supplied
CH32  Causes of ML System failure material
 Data Distribution Shifts
 Problems with ML Production Monitoring
 ML-specific metric
 Monitoring Toolbox
 Monitoring vs Observability
Post CS SS  To be identified

Experiential Leaning Component

Lab Topic

1 Design and implement the simple data flows involving various  Virtual Labs
data formats
Modes of data flows
a) Through Databases – use SQL / Custom Program to read/
write into databases
b) Through REST/RPC – Synchronous mechanism for data ex-
change
c) Through Message Brokers / Queues – Asynchronous mecha-
nism for data exchange

2 Build a Modern Data Stack  Virtual Labs

Components
a) a fully managed ELT data pipeline
b) a cloud-based columnar warehouse or data lake as a destina-
tion
c) a data transformation tool
d) A business intelligence or data visualization platform.

3 Manage Machine Learning Model Metadata using MLFlow /  Virtual Labs

Neptune
Components

a) Projects
b) Experiments
c) Model metadata
d) Model tracking / logging
e) Model Registry

4 Construct a Machine Learning Pipeline with Data Versioning  Virtual Labs

Tool
Components
a) Data Pipeline
b) Data Versioning Tool
c) Feature Store
d) ML Pipeline
e) Prediction Service

Evaluation Scheme:

Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session

No Name Type Duration Weight Day, Date, Session, Time

Experiential learning Take 15 days 10% TBA
EC-1 Assignment / Quiz -I Home
Experiential learning Take 15 days 20% TBA
Assignment-II Home
EC-2 Mid-Semester Test Closed 2 hours 30% Per programme schedule
Book
EC-3 Comprehensive Open 3 hours 40% Per programme schedule
Exam Book

Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 7

Syllabus for Comprehensive Exam (Open Book): All topics (Session Nos. 1 to 16)

Important links and information:

Elearn portal: https://elearn.bits-pilani.ac.in
Students are expected to visit the Elearn portal on a regular basis and stay up to date with the latest
announcements and deadlines.
Contact sessions: Students should attend the online lectures as per the schedule provided on the Elearn portal.
Evaluation Guidelines:
1. EC1 consists of two assignments. Announcements will be made available on the portal, in a timely
manner.
2. For Closed Book tests: No books or reference material of any kind will be permitted.
3. For Open Book exams: Use of books and any printed / written reference material (filed or bound) is
permitted. However, loose sheets of paper will not be allowed. Use of calculators is permitted in all
exams. Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.
4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies, the student
should follow the procedure to apply for the Make-Up Test/Exam which will be made available on the
Elearn portal. The Make-Up Test/Exam will be conducted only at selected exam centres on the dates to
be announced later.

It shall be the responsibility of the individual student to be regular in maintaining the self-study schedule as
given in the course handout, attend the online lectures, and take all the prescribed evaluation components such
as Assignment/Quiz, Mid-Semester Test and Comprehensive Exam according to the evaluation scheme
provided in the handout.

Windows Machine Report
No ratings yet
Windows Machine Report
205 pages
Unit_I_1
No ratings yet
Unit_I_1
203 pages
01 - Introduction To Machine Learning
No ratings yet
01 - Introduction To Machine Learning
71 pages
Batch Vs Online ML: Wednesday, March 17, 2021 5:30 PM
No ratings yet
Batch Vs Online ML: Wednesday, March 17, 2021 5:30 PM
436 pages
100 Days of ML
No ratings yet
100 Days of ML
383 pages
ch01 Intro
No ratings yet
ch01 Intro
45 pages
Data Science I: Lesson #01 - Outline Presentation
No ratings yet
Data Science I: Lesson #01 - Outline Presentation
20 pages
(eBook PDF) Concepts in Enterprise Resource Planning 4th Editioninstant download
No ratings yet
(eBook PDF) Concepts in Enterprise Resource Planning 4th Editioninstant download
26 pages
Cseit - All
No ratings yet
Cseit - All
85 pages
Avaya CMS Supervisor R18 Installation
No ratings yet
Avaya CMS Supervisor R18 Installation
140 pages
5th-BDA-Booklet
No ratings yet
5th-BDA-Booklet
58 pages
07---data-lifecycle-challenges-in-production-ml
No ratings yet
07---data-lifecycle-challenges-in-production-ml
12 pages
Data Transformation in the Cloud 3
No ratings yet
Data Transformation in the Cloud 3
9 pages
Polyzotis Et Al_2018
No ratings yet
Polyzotis Et Al_2018
12 pages
Machine_Learning_to_Data_Management_A_Round_Trip
No ratings yet
Machine_Learning_to_Data_Management_A_Round_Trip
4 pages
Data Engineering for IoE V1.0
No ratings yet
Data Engineering for IoE V1.0
3 pages
SEM 5 Syllabus
No ratings yet
SEM 5 Syllabus
28 pages
Accounting Information System (AQ011-3-1-AIS) : Overview of Transaction Processing & Enterprise Resource Planning System
No ratings yet
Accounting Information System (AQ011-3-1-AIS) : Overview of Transaction Processing & Enterprise Resource Planning System
43 pages
BDA-Lec11
No ratings yet
BDA-Lec11
32 pages
IITH Executive MTech Brochure
No ratings yet
IITH Executive MTech Brochure
13 pages
16CS63: Machine Learning
No ratings yet
16CS63: Machine Learning
93 pages
Data Processing in AI
No ratings yet
Data Processing in AI
7 pages
PCAC2009
No ratings yet
PCAC2009
3 pages
OpenTSDB - A Scalable, Distributed Time Series Database Presentation
No ratings yet
OpenTSDB - A Scalable, Distributed Time Series Database Presentation
28 pages
Data Engineering Nanodegree Program Syllabus PDF
No ratings yet
Data Engineering Nanodegree Program Syllabus PDF
5 pages
1725892639Module 3 the Machine Learning Process
No ratings yet
1725892639Module 3 the Machine Learning Process
17 pages
Activity Log
No ratings yet
Activity Log
23 pages
SQL Notes
No ratings yet
SQL Notes
3 pages
22CS911-DEC_Unit_5
No ratings yet
22CS911-DEC_Unit_5
68 pages
AD8552-ML-UNIT-II
No ratings yet
AD8552-ML-UNIT-II
94 pages
21ai402 Data Analytics Unit-3
No ratings yet
21ai402 Data Analytics Unit-3
150 pages
Ai for IT Non Coders
No ratings yet
Ai for IT Non Coders
14 pages
Syllabus of DT-1 23ECH102
No ratings yet
Syllabus of DT-1 23ECH102
5 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
7 pages
5-Day KVCET Bootcamp - Data Analytics
No ratings yet
5-Day KVCET Bootcamp - Data Analytics
6 pages
NDS Data Practitioner Degree Curriculum
No ratings yet
NDS Data Practitioner Degree Curriculum
10 pages
PGP in DS & AI
No ratings yet
PGP in DS & AI
24 pages
Java Means Durga Soft: DURGA SOFTWARE SOLUTIONS, 202 HUDA Maitrivanam, Ameerpet, Hyd. PH: 040-64512786
No ratings yet
Java Means Durga Soft: DURGA SOFTWARE SOLUTIONS, 202 HUDA Maitrivanam, Ameerpet, Hyd. PH: 040-64512786
7 pages
B.Tech CSE 8th sem
No ratings yet
B.Tech CSE 8th sem
10 pages
IITH Executive MTech Brochure 2017
No ratings yet
IITH Executive MTech Brochure 2017
13 pages
Vth Sem Syllabus
No ratings yet
Vth Sem Syllabus
37 pages
Ai for IT Coders
No ratings yet
Ai for IT Coders
18 pages
22am901 Data Science Using Python Unit 2
No ratings yet
22am901 Data Science Using Python Unit 2
116 pages
An Analysis of Data Quality Requirements for Machine Learning
No ratings yet
An Analysis of Data Quality Requirements for Machine Learning
12 pages
FINAL UNIT 4
No ratings yet
FINAL UNIT 4
107 pages
AI IBM Curriculumn
No ratings yet
AI IBM Curriculumn
1 page
Da Handbook
No ratings yet
Da Handbook
18 pages
DSML Curriculum Doc - Google Sheets
0% (1)
DSML Curriculum Doc - Google Sheets
12 pages
08 - Professional Certificate Course On Data Science - v2
No ratings yet
08 - Professional Certificate Course On Data Science - v2
25 pages
Lec-01
No ratings yet
Lec-01
28 pages
COURSEFILE
No ratings yet
COURSEFILE
45 pages
1. Introduction of Subject
No ratings yet
1. Introduction of Subject
28 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
29 pages
SYLLABUS FOR DATA ENGINEERING
No ratings yet
SYLLABUS FOR DATA ENGINEERING
3 pages
MR20 Vi-I Syllabus
No ratings yet
MR20 Vi-I Syllabus
22 pages
Fast SSH
No ratings yet
Fast SSH
38 pages
S03 Native HANA Modeling Part2
No ratings yet
S03 Native HANA Modeling Part2
14 pages
SSRF Bible. Cheatsheet: @wallarm @d0znpp
No ratings yet
SSRF Bible. Cheatsheet: @wallarm @d0znpp
23 pages
Course Name (Code) : Assignment (Canvas Model) Kids Portable GPS Chip by
No ratings yet
Course Name (Code) : Assignment (Canvas Model) Kids Portable GPS Chip by
2 pages
The 2021 Analyst Playbook: How To Future-Proof Your Career
No ratings yet
The 2021 Analyst Playbook: How To Future-Proof Your Career
18 pages
Vocus SIP Brochure
No ratings yet
Vocus SIP Brochure
4 pages
Chapter 8 - Securing Information Systems
No ratings yet
Chapter 8 - Securing Information Systems
16 pages
Rizki CV
No ratings yet
Rizki CV
4 pages
Types of Backup: A Full Backup Is Exactly What The Name Implies: It Is A
No ratings yet
Types of Backup: A Full Backup Is Exactly What The Name Implies: It Is A
12 pages
19cs521-Data Warehousing and Data Mining
No ratings yet
19cs521-Data Warehousing and Data Mining
3 pages
International Conference On: Call For Research Papers For International Conference: ICACTM - 2018
No ratings yet
International Conference On: Call For Research Papers For International Conference: ICACTM - 2018
4 pages
What Is Distributed Software Architecture?
No ratings yet
What Is Distributed Software Architecture?
5 pages
Data Science and Machine Learning Syllabus V1.0
No ratings yet
Data Science and Machine Learning Syllabus V1.0
6 pages
MDU B.Tech CSE 8th Sem Syllabus
No ratings yet
MDU B.Tech CSE 8th Sem Syllabus
7 pages
Assignment_01_Summer2023HCM_ASP.net Core Web API (1)
No ratings yet
Assignment_01_Summer2023HCM_ASP.net Core Web API (1)
5 pages
Syllabus - CIS 509 Data Mining II (Fall 2019)
No ratings yet
Syllabus - CIS 509 Data Mining II (Fall 2019)
7 pages
Syllabus - Ml Lab
No ratings yet
Syllabus - Ml Lab
3 pages
Server Roles and Features
No ratings yet
Server Roles and Features
16 pages
12 Text Field in Put Validation
No ratings yet
12 Text Field in Put Validation
3 pages
Firewalls: (Type The Document Subtitle)
No ratings yet
Firewalls: (Type The Document Subtitle)
8 pages
Course Outline - ML IIFT Delhi MBA(BA) Sep-Dec 24
No ratings yet
Course Outline - ML IIFT Delhi MBA(BA) Sep-Dec 24
5 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
OS Part-1
No ratings yet
OS Part-1
12 pages
Pharmaceutical Distribution Management System - 2ppt
No ratings yet
Pharmaceutical Distribution Management System - 2ppt
24 pages
SD-WAN Based Dynamic Enterprise Services
No ratings yet
SD-WAN Based Dynamic Enterprise Services
11 pages
MCSE615L_DATA-ANALYTICS_TH_1.0_71_MCSE615L_67 ACP
No ratings yet
MCSE615L_DATA-ANALYTICS_TH_1.0_71_MCSE615L_67 ACP
2 pages
Managed Services
No ratings yet
Managed Services
2 pages
m.c.a. Dbms April(2024)
No ratings yet
m.c.a. Dbms April(2024)
2 pages
Syllabus
No ratings yet
Syllabus
3 pages
SAP Implementation Activities
100% (2)
SAP Implementation Activities
4 pages
Short - Medium-Term Planning Circle
No ratings yet
Short - Medium-Term Planning Circle
4 pages
Course Plan - Data Mining
No ratings yet
Course Plan - Data Mining
3 pages
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
From Everand
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Database Architecture: Strategic Techniques for Effective Design
From Everand
Advanced Database Architecture: Strategic Techniques for Effective Design
Adam Jones
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Management for Machine Learning

Uploaded by

Data Management for Machine Learning

Uploaded by

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMMES

Part A: Course Design

Course Title Data Management for Machine Learning

The course aims at providing:

Reference Book(s) & other resources

R1 Designing Data-Intensive Applications by Martin Kleppmann

R2 Data Pipelines Pocket Reference by Densore

Students will be able to :

Part B: Course Handout

Academic Term II Semester 2022-2023

Module M Module is a standalone quantum of designed content. A typical course is

Recorded RL RL stands for Recorded Lecture or Recorded Lesson. It is presented to the

Lab Exercises LE Lab exercises associated with various modules

Self-Study SS Specific content assigned for self-study

Homework HW Specific problems/design/lab exercises assigned as homework

M2 Modern Data Platform

M3 Data Management in ML Workflow

M4 Advanced Topic in Data Management

M1: Foundations of data management

Contact Session 1-2

Session Type Description/Plan Reference

2 CH3  Data Models and Query Languages R1

M2: Modern Data Platform

Contact Session 3-4

Session Type Description/Plan Reference

4 CH7  Data Storage T1

M3: Data Management in ML Workflow

Contact Session 5-12

Session Type Description/Plan Reference

6 CH11 Data Collection / Ingestion T1

7 CH13 Data Validation

8 CH15 Analytics Engineering Instructor-supplied

9 CH17 Data Analysis Instructor-supplied

11-12 CH21 ML Experimentation & Metadata

M4: Advanced Topic in Data Management

Contact Session 14-16

Session Type Description/Plan Reference

15 CH29 Data Observability

Experiential Leaning Component

2 Build a Modern Data Stack  Virtual Labs

3 Manage Machine Learning Model Metadata using MLFlow /  Virtual Labs

4 Construct a Machine Learning Pipeline with Data Versioning  Virtual Labs

Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session

No Name Type Duration Weight Day, Date, Session, Time

Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 7

Important links and information:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.