0% found this document useful (0 votes)

14 views6 pages

2025 DM4ML Assign1

The assignment focuses on designing and implementing an end-to-end data management pipeline for a machine learning project aimed at predicting customer churn. It outlines tasks including problem formulation, data ingestion, storage, validation, preparation, transformation, feature management, model building, and pipeline orchestration, with specific deliverables for each stage. The assignment emphasizes best practices in data management and requires a comprehensive documentation and demonstration of the pipeline workflow.

Uploaded by

geetapillai1963

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views6 pages

2025 DM4ML Assign1

Uploaded by

geetapillai1963

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

1

DATA MANAGEMENT FOR MACHINE LEARNING – ASSIGNMENT I

Submission Date: 10th March 2025 11.50 PM

Weightage: 20%

Title: End-to-End Data Management Pipeline for Machine Learning

Objective:

 Design, implement, and orchestrate a complete data management

pipeline for a machine learning project, addressing all stages from
problem formulation to pipeline orchestration.

Business Context:

Customer churn occurs when an existing customer stops using a

company’s services or purchasing its products, effectively ending their
relationship with the company. While certain types of churn, such as those
resulting from unavoidable circumstances like death, are considered non-
addressable, this discussion focuses on addressable churn—scenarios
where intervention could prevent customer loss.

Churn poses significant challenges for businesses, leading to revenue

declines and increased pressure on teams to compensate for the loss. One

Data Management for Machine Learning – Assignment I

approach to offset churn is acquiring new customers, but this is

both costly and difficult. High customer acquisition costs further strain the
company’s overall revenue. Additionally, churn has indirect effects: former
customers often turn to competitors, potentially influencing other loyal
customers to follow suit.

A recent research note from PWC highlights the gravity of this issue:

“Financial institutions will lose 24% of revenue in the next 3-5 years,
mainly due to customer churn to new fintech companies.”

Given these challenges, reducing customer churn has become a critical

business strategy for most organizations. Even when it’s not explicitly a
strategic objective, retaining existing customers is always in the
company’s best interest.

You are working as a Data Engineer for a startup specializing in predictive

analytics. The company aims to build a robust automated pipeline to
process customer data collected from multiple sources (e.g., web logs,
transactional systems, and third-party APIs) for a machine learning model
that predicts customer churn. Your task is to design and implement this
pipeline while adhering to best practices for data management.

Tasks:

1. Problem Formulation

 Clearly define the business problem

 Identify key business objectives
 List the key data sources and their attributes
 Define the expected outputs from the pipeline:
o Clean datasets for exploratory data analysis (EDA)
o Transformed features for machine learning
o A deployable model to predict customer churn
 Set measurable evaluation metrics
 Deliverables:
o A PDF/Markdown document with the business problem, objectives, data
sources, pipeline outputs, and evaluation metrics.

2. Data Ingestion

 Identify at least two data sources (e.g., CSV files, REST APIs, database
queries)
 Design scripts for data ingestion, ensuring:
o Automatic fetching of data periodically (e.g., daily or hourly)
o Error handling for failed ingestion attempts
o Logging for monitoring ingestion jobs

Data Management for Machine Learning – Assignment I

 Deliverables:
o Python scripts for ingestion (e.g., using pandas, requests etc.)
o A log file showing successful ingestion runs
o Screenshots of ingested data stored in raw format

3. Raw Data Storage

 Store ingested data in a data lake or storage system (e.g., AWS S3, Google
Cloud Storage, HDFS, or a local filesystem)
 Design an efficient folder/bucket structure:
o Partition data by source, type, and timestamp
 Deliverables:
o Folder/bucket structure documentation
o Python code demonstrating the upload of raw data to the storage
system

4. Data Validation

 Implement data validation checks to ensure data quality:

o Check for missing or inconsistent data
o Validate data types, formats, and ranges
o Identify duplicates or anomalies
 Generate a comprehensive data quality report
 Deliverables:
o A Python script for automated validation (e.g., using pandas,
great_expectations, or pydeequ)
o Sample data quality report in PDF or CSV format, summarizing issues
and resolutions

5. Data Preparation

 Clean and preprocess the raw data:

o Handle missing values (e.g., imputation or removal)
o Standardize or normalize numerical attributes
o Encode categorical variables using one-hot encoding or label encoding
 Perform EDA to identify trends, distributions, and outliers.
 Deliverables:
o Jupyter notebook/Python script showcasing the data preparation
process
o Visualizations and summary statistics (e.g., histograms, box plots)
o A clean dataset ready for transformations

6. Data Transformation and Storage

 Perform transformations for feature engineering:

o Create aggregated features (e.g., total spend per customer)

Data Management for Machine Learning – Assignment I

o Derive new features (e.g., customer tenure, activity

frequency)
o Scale and normalize features where necessary
 Store the transformed data in a relational database or a data warehouse.
 Deliverables:
o SQL schema design or database setup script
o Sample queries to retrieve transformed data
o A summary of the transformation logic applied

7. Feature Store

 Implement a feature store to manage engineered features:

o Define metadata for each feature (e.g., description, source, version)
o Use a feature store tool (e.g., Feast) or a custom solution
 Automate feature retrieval for training and inference
 Deliverables:
o Feature store configuration/code
o Sample API or query demonstrating feature retrieval
o Documentation of feature metadata and versions

8. Data Versioning

 Use version control for raw and transformed datasets to ensure

reproducibility:
o Track changes in data using tools like DVC, Git LFS, or a custom
tagging system
o Store version metadata (e.g., source, timestamp, change log)
 Deliverables:
o DVC/Git repository showing dataset versions
o Documentation of the versioning strategy and workflow

9. Model Building

 Train a machine learning model to predict customer churn using the prepared
features:
o Use a framework like scikit-learn or TensorFlow
o Experiment with multiple algorithms (e.g., logistic regression, random
forest)
o Evaluate model performance using metrics such as accuracy,
precision, recall, and F1 score
 Save the trained model using a versioning tool (e.g., MLflow)
 Deliverables:
o Python script for model training and evaluation
o Model performance report
o A versioned, saved model file (e.g., .pkl, .h5)

Data Management for Machine Learning – Assignment I

10. Orchestrating the Data Pipeline

 Automate the entire pipeline using an orchestration tool (e.g., Apache Airflow,
Prefect, or Kubeflow):
o Define a Directed Acyclic Graph (DAG) for pipeline tasks.
o Ensure task dependencies are well-defined (e.g., ingestion → validation
→ preparation).
o Monitor pipeline runs and handle failures gracefully.
 Deliverables:
o Pipeline DAG/script showcasing task automation
o Screenshots of successful pipeline runs in the orchestration tool
o Logs or monitoring dashboard screenshots

Additional Instructions:

 Ensure modularity in your codebase, with separate scripts for each

stage.
 Use proper logging and error handling in all scripts.
 Provide detailed documentation, including:
o Explanation of the pipeline design.
o Challenges faced and solutions implemented.
 Submit a short video (5–10 minutes) demonstrating your pipeline
workflow.

Submission Requirements:

 Source Code: Organized into folders by stage.

 Documentation: Markdown or PDF format.
 Video Walkthrough: Demonstrating the pipeline.
 Final Deliverables: Compressed .zip file with all code, data, and
documentation.

General Notes:

 Although specific tools, products, and platforms are mentioned as

examples in the tasks, you are free to choose and justify a toolchain
of your preference, provided it aligns with the objectives,
expectations, and deliverables of the assignment.
 Refer the document used while registering the groups. In case of
discrepancies, write to me separately (copying all your group
members) with subject line as "Cluster DM4ML Group
<your_group_number>". email – pravin.pawar@pilani.bits-
pilani.ac.in
 Using the LMS, only one member of group has to upload the file. No
submission over email will be considered.

Data Management for Machine Learning – Assignment I

 Make sure that you upload the file well ahead of deadline.
At last moments, we have seen several groups have faced issues
while doing the submissions.
 Note - As it’s a group assignment, only one submission is expected
from each group. Unnecessary don’t upload the solution on
individual basis. If it’s observed, then the penalty (25% reduction)
will be applicable on it.

Data Management for Machine Learning – Assignment I

Data Science Fundamentals
No ratings yet
Data Science Fundamentals
44 pages
Paperglobe Earth A4
No ratings yet
Paperglobe Earth A4
14 pages
Student Hostel Management System
75% (4)
Student Hostel Management System
3 pages
Dr. Colbert's Keto Zone Diet
0% (1)
Dr. Colbert's Keto Zone Diet
6 pages
70 461
100% (1)
70 461
41 pages
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
No ratings yet
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
8 pages
Business Intelligence Notes
100% (1)
Business Intelligence Notes
88 pages
PLSQL
No ratings yet
PLSQL
220 pages
03 Etl 081028 2055
No ratings yet
03 Etl 081028 2055
46 pages
A Metadata Repository Tables
No ratings yet
A Metadata Repository Tables
2 pages
Java Database Connectivity (JDBC)
No ratings yet
Java Database Connectivity (JDBC)
14 pages
Data Management For Machine Learning
No ratings yet
Data Management For Machine Learning
7 pages
ASSIGNMENT 2 (Business Analytics For Managers)
No ratings yet
ASSIGNMENT 2 (Business Analytics For Managers)
5 pages
Business Intelligence and Decision Support: Chapter 7 - 1
No ratings yet
Business Intelligence and Decision Support: Chapter 7 - 1
20 pages
Machine Learning Task Allocation
No ratings yet
Machine Learning Task Allocation
4 pages
JSP, Servlet, JSTL and Mysql Simple Crud Application
No ratings yet
JSP, Servlet, JSTL and Mysql Simple Crud Application
10 pages
Algorithms & Data Structures 05
No ratings yet
Algorithms & Data Structures 05
12 pages
Data Science Project - DSI431 (4.1)
No ratings yet
Data Science Project - DSI431 (4.1)
2 pages
Naukri YogendraVerma (6y 6m)
No ratings yet
Naukri YogendraVerma (6y 6m)
3 pages
Project Database Report
No ratings yet
Project Database Report
13 pages
Internship Report - Merged
No ratings yet
Internship Report - Merged
29 pages
Analyst
No ratings yet
Analyst
2 pages
Project
No ratings yet
Project
2 pages
How Do I Configure SQL Mail in SQL Server 2000
No ratings yet
How Do I Configure SQL Mail in SQL Server 2000
3 pages
Comandos db2
No ratings yet
Comandos db2
6 pages
Psu Patch Steps
No ratings yet
Psu Patch Steps
4 pages
Task - Case Study - DLMDSME01
No ratings yet
Task - Case Study - DLMDSME01
7 pages
Assignment - REST Service
No ratings yet
Assignment - REST Service
7 pages
Staff Daily - Ridwan Anas - 2020-06-09 1591790913
No ratings yet
Staff Daily - Ridwan Anas - 2020-06-09 1591790913
3 pages
PLSQLNotes
No ratings yet
PLSQLNotes
5 pages
Ossum DS
No ratings yet
Ossum DS
12 pages
Abhay Resume
No ratings yet
Abhay Resume
5 pages
1 Analysis and Design of Visualization
No ratings yet
1 Analysis and Design of Visualization
8 pages
IBM Data Science Project - Round2
No ratings yet
IBM Data Science Project - Round2
32 pages
Assignment Data Science
No ratings yet
Assignment Data Science
6 pages
Data Analytics 360digitmg
No ratings yet
Data Analytics 360digitmg
10 pages
LP Iii - 23 24
No ratings yet
LP Iii - 23 24
2 pages
Lu2 Lo1
No ratings yet
Lu2 Lo1
41 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
GIT Interview Questions & Answers
No ratings yet
GIT Interview Questions & Answers
8 pages
Data Analysis Resume
No ratings yet
Data Analysis Resume
2 pages
5-Day KVCET Bootcamp - Data Analytics
No ratings yet
5-Day KVCET Bootcamp - Data Analytics
6 pages
Jake S Resume Anonymous
No ratings yet
Jake S Resume Anonymous
2 pages
Komal CV
No ratings yet
Komal CV
4 pages
Dnyaneshwar Ds
No ratings yet
Dnyaneshwar Ds
2 pages
A1991370857 65680 10 2025 Csm355ca1
No ratings yet
A1991370857 65680 10 2025 Csm355ca1
6 pages
BSC
No ratings yet
BSC
20 pages
The Evolution of Storage Devices
No ratings yet
The Evolution of Storage Devices
3 pages
PPT
No ratings yet
PPT
10 pages
Oracle BI Cheat Sheet 11 Feb 2014 Download
No ratings yet
Oracle BI Cheat Sheet 11 Feb 2014 Download
4 pages
CSE357CV
No ratings yet
CSE357CV
3 pages
HR Data Analysis Assessment Questions
No ratings yet
HR Data Analysis Assessment Questions
2 pages
Session 4 Machine Learning Process
No ratings yet
Session 4 Machine Learning Process
28 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
22 pages
Deep Learning Nanodegree Syllabus: Project: Find Donors For Charityml
No ratings yet
Deep Learning Nanodegree Syllabus: Project: Find Donors For Charityml
13 pages
List of Experiments - CL-I
No ratings yet
List of Experiments - CL-I
3 pages
Omkar Kadam
No ratings yet
Omkar Kadam
4 pages
Supriya Synopsis Final
No ratings yet
Supriya Synopsis Final
27 pages
Shubham Mankodiya DS
No ratings yet
Shubham Mankodiya DS
6 pages
1 - Swati Madhukar Taur
No ratings yet
1 - Swati Madhukar Taur
2 pages
Final Int. Report
No ratings yet
Final Int. Report
14 pages
Om CV PDF
No ratings yet
Om CV PDF
1 page
Raushan Nov-2023
No ratings yet
Raushan Nov-2023
2 pages
Raushan Dec-2023
No ratings yet
Raushan Dec-2023
2 pages
Adnan Internship
No ratings yet
Adnan Internship
15 pages
CSC 603 - Final Project
No ratings yet
CSC 603 - Final Project
3 pages
A Structured Learning Guide For Becoming A Data Scientist
No ratings yet
A Structured Learning Guide For Becoming A Data Scientist
9 pages
Sari Go MM Ulaan U Deep Resume
No ratings yet
Sari Go MM Ulaan U Deep Resume
3 pages
Ce473 Project - Fall 2024
No ratings yet
Ce473 Project - Fall 2024
8 pages
Naukri ShyamPrabhakarAmbilkar 9124317 - 03 04 - 1
No ratings yet
Naukri ShyamPrabhakarAmbilkar 9124317 - 03 04 - 1
4 pages
Machine L-Lab-Manual
No ratings yet
Machine L-Lab-Manual
90 pages
Sai Krishna Neelam Resume
No ratings yet
Sai Krishna Neelam Resume
4 pages
Ninad - Kamdi ML
No ratings yet
Ninad - Kamdi ML
4 pages
Complete Chapter
No ratings yet
Complete Chapter
6 pages
ML Process and Map
No ratings yet
ML Process and Map
7 pages
Data Science, Machine Learning, Python, Basics of SQL.: Professional Summary
No ratings yet
Data Science, Machine Learning, Python, Basics of SQL.: Professional Summary
5 pages
Advanced Techniques in Machine Learning and Optimization
No ratings yet
Advanced Techniques in Machine Learning and Optimization
8 pages
Data Science
No ratings yet
Data Science
8 pages
Machine Learning Assignment-02
No ratings yet
Machine Learning Assignment-02
2 pages
Data Mining & Machine Learning Courseoutline
No ratings yet
Data Mining & Machine Learning Courseoutline
7 pages
Aditya Shebe
No ratings yet
Aditya Shebe
3 pages
Tarun DS Resume
No ratings yet
Tarun DS Resume
1 page
17th Attempt
No ratings yet
17th Attempt
35 pages
Aishwarya Swetha Data Science
No ratings yet
Aishwarya Swetha Data Science
1 page
GCP Pde
100% (3)
GCP Pde
200 pages
1.install Virtualdj Pro V7.0.5 2.copy The Crack To The Software'S Directory C:/Program Files/Virtualdj and Over Write The Original Enjoy!
0% (1)
1.install Virtualdj Pro V7.0.5 2.copy The Crack To The Software'S Directory C:/Program Files/Virtualdj and Over Write The Original Enjoy!
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2025 DM4ML Assign1

Uploaded by

2025 DM4ML Assign1

Uploaded by

1

DATA MANAGEMENT FOR MACHINE LEARNING – ASSIGNMENT I

Submission Date: 10th March 2025 11.50 PM

Title: End-to-End Data Management Pipeline for Machine Learning

 Design, implement, and orchestrate a complete data management

Customer churn occurs when an existing customer stops using a

Churn poses significant challenges for businesses, leading to revenue

Data Management for Machine Learning – Assignment I

approach to offset churn is acquiring new customers, but this is

Given these challenges, reducing customer churn has become a critical

You are working as a Data Engineer for a startup specializing in predictive

 Clearly define the business problem

Data Management for Machine Learning – Assignment I

3. Raw Data Storage

 Implement data validation checks to ensure data quality:

 Clean and preprocess the raw data:

6. Data Transformation and Storage

 Perform transformations for feature engineering:

Data Management for Machine Learning – Assignment I

o Derive new features (e.g., customer tenure, activity

 Implement a feature store to manage engineered features:

 Use version control for raw and transformed datasets to ensure

Data Management for Machine Learning – Assignment I

10. Orchestrating the Data Pipeline

 Ensure modularity in your codebase, with separate scripts for each

 Source Code: Organized into folders by stage.

 Although specific tools, products, and platforms are mentioned as

Data Management for Machine Learning – Assignment I

Data Management for Machine Learning – Assignment I

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.