Phase 2
Phase 2
1. Problem Statement
The healthcare industry faces significant challenges in early disease detection and
personalized treatment. Traditional diagnostic methods often rely on reactive approaches,
leading to delayed interventions and higher costs. This project aims to leverage AI and
machine learning to predict diseases early by analyzing patient data such as medical
history, lifestyle factors, and biometric measurements. By transitioning from reactive to
proactive healthcare, we can improve patient outcomes, reduce treatment costs, and
optimize resource allocation.
2. Project Objectives
The primary goal is to develop an AI-powered system that predicts diseases (e.g.,
diabetes, cardiovascular diseases) based on patient data. Key objectives include:
- Identifying patterns and risk factors in patient data that correlate with specific diseases.
- Building predictive models to assess disease likelihood and recommend preventive
measures.
- Providing actionable insights to healthcare providers for early intervention.
- Ensuring the model is interpretable and scalable for real-world deployment.
Data Collection
- EHRs, Wearables, Surveys
- Lab results, Demographics
Data Cleaning
- Missing values
- Outlier removal
- Standardization.
│ - Standardization.
Feature Selection
- Statistical tests
- Domain knowledge
- Feature importance.
Insight Extraction
- SHAP value analysis
- Key risk factor identification
- Patient stratification
Visualization
- Interactive dashboards
- Risk prediction charts
- Trend analysis graphs
4. Data Description
Public datasets (e.g., Kaggle, UCI ML Repository) or synthetic data mimicking real-
world patient records.
• Data Type: Structured tabular data (e.g., CSV files).
• Number of Rows and Columns: 1,00 rows × 12 columns
• Dataset Nature: Static (data does not change in real time)
Key Fields Relevant to the Problem:
• - Patient_ID, Age, Gender
• - Medical history (e.g., past diagnoses, family history)
• - Biometrics (e.g., blood pressure, cholesterol levels)
• - Lifestyle factors (e.g., smoking, exercise habits)
• - Target variable: Disease diagnosis (binary/multi-class)
5. Data Preprocessing
To ensure accurate analysis, we performed the following data cleaning and preparation
steps:
• Removing Duplicates:
Each patient is uniquely identified using a Patient_ID. Duplicates are removed to
avoid bias in model training and disease prediction outcomes.
• Interquartile Range (IQR) and Z-score methods are used to detect anomalies in
lab results (e.g., extremely high cholesterol).
• Transformations:
• Creating New Fields: New fields like Efficiency_Score =
Performance_Score / Monthly_Hours_Worked were created to better reflect
productivity.
● Univariate Analysis:
• Bivariate/Multivariate Analysis:
● Key Insights:
- Lifestyle factors (e.g., sedentary habits) correlate with higher diabetes risk.
• Libraries Used:
• Optional Tools:
Name Contribution