2025 DM4ML Assign1
2025 DM4ML Assign1
Weightage: 20%
Objective:
Business Context:
A recent research note from PWC highlights the gravity of this issue:
“Financial institutions will lose 24% of revenue in the next 3-5 years,
mainly due to customer churn to new fintech companies.”
Tasks:
1. Problem Formulation
2. Data Ingestion
Identify at least two data sources (e.g., CSV files, REST APIs, database
queries)
Design scripts for data ingestion, ensuring:
o Automatic fetching of data periodically (e.g., daily or hourly)
o Error handling for failed ingestion attempts
o Logging for monitoring ingestion jobs
Deliverables:
o Python scripts for ingestion (e.g., using pandas, requests etc.)
o A log file showing successful ingestion runs
o Screenshots of ingested data stored in raw format
Store ingested data in a data lake or storage system (e.g., AWS S3, Google
Cloud Storage, HDFS, or a local filesystem)
Design an efficient folder/bucket structure:
o Partition data by source, type, and timestamp
Deliverables:
o Folder/bucket structure documentation
o Python code demonstrating the upload of raw data to the storage
system
4. Data Validation
5. Data Preparation
7. Feature Store
8. Data Versioning
9. Model Building
Train a machine learning model to predict customer churn using the prepared
features:
o Use a framework like scikit-learn or TensorFlow
o Experiment with multiple algorithms (e.g., logistic regression, random
forest)
o Evaluate model performance using metrics such as accuracy,
precision, recall, and F1 score
Save the trained model using a versioning tool (e.g., MLflow)
Deliverables:
o Python script for model training and evaluation
o Model performance report
o A versioned, saved model file (e.g., .pkl, .h5)
Automate the entire pipeline using an orchestration tool (e.g., Apache Airflow,
Prefect, or Kubeflow):
o Define a Directed Acyclic Graph (DAG) for pipeline tasks.
o Ensure task dependencies are well-defined (e.g., ingestion → validation
→ preparation).
o Monitor pipeline runs and handle failures gracefully.
Deliverables:
o Pipeline DAG/script showcasing task automation
o Screenshots of successful pipeline runs in the orchestration tool
o Logs or monitoring dashboard screenshots
Additional Instructions:
Submission Requirements:
General Notes:
Make sure that you upload the file well ahead of deadline.
At last moments, we have seen several groups have faced issues
while doing the submissions.
Note - As it’s a group assignment, only one submission is expected
from each group. Unnecessary don’t upload the solution on
individual basis. If it’s observed, then the penalty (25% reduction)
will be applicable on it.