0% found this document useful (0 votes)
3 views8 pages

Appendix - Complete Code Implementation

This appendix contains complete Python code implementations for classification, regression, and clustering tasks using healthcare datasets, organized by task. It includes necessary libraries, model training, evaluation metrics, and visualization functions. The document provides a comprehensive guide to reproduce results and execute a complete machine learning pipeline for healthcare data analysis.

Uploaded by

austinyutw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views8 pages

Appendix - Complete Code Implementation

This appendix contains complete Python code implementations for classification, regression, and clustering tasks using healthcare datasets, organized by task. It includes necessary libraries, model training, evaluation metrics, and visualization functions. The document provides a comprehensive guide to reproduce results and execute a complete machine learning pipeline for healthcare data analysis.

Uploaded by

austinyutw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Appendix: Complete Code Implementation

This appendix contains all the Python code implementations used in the comparative analysis of
classification, regression, and clustering on healthcare datasets. The code is organized by task
and includes complete implementations with proper imports, data preprocessing, model training,
evaluation, and visualization functions.

A.1 Required Libraries and Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer, load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
mean_absolute_error, mean_squared_error,
silhouette_score, davies_bouldin_score,
confusion_matrix, roc_curve, auc)
from sklearn.decomposition import PCA
import seaborn as sns

A.2 Classification Task: Breast Cancer Diagnosis

A.2.1 Basic Classification Implementation

from sklearn.linear_model import LogisticRegression


from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Load breast cancer dataset (features X, labels y)


X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
test_size=0.3,
random_state=0)

# Initialize models
logreg = LogisticRegression(max_iter=10000)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=0)
svm_clf = SVC(kernel='rbf', probability=True, random_state=0)

# Train models
logreg.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)
svm_clf.fit(X_train, y_train)

A.2.2 Complete Classification Implementation with Evaluation

# Load dataset
breast_cancer = load_breast_cancer()
X_bc, y_bc = breast_cancer.data, breast_cancer.target

# Train-test split with stratification


X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(
X_bc, y_bc, test_size=0.3, stratify=y_bc, random_state=0)

# Standardize features
scaler_bc = StandardScaler()
X_train_bc_scaled = scaler_bc.fit_transform(X_train_bc)
X_test_bc_scaled = scaler_bc.transform(X_test_bc)

# Initialize and train classification models


models_clf = {
'Logistic Regression': LogisticRegression(max_iter=10000, random_state=0),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=0),
'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=0)
}

# Train models and make predictions


clf_results = {}
for name, model in models_clf.items():
model.fit(X_train_bc_scaled, y_train_bc)
y_pred = model.predict(X_test_bc_scaled)
y_prob = model.predict_proba(X_test_bc_scaled)[:, 1] if hasattr(model, 'predict_proba

clf_results[name] = {
'model': model,
'predictions': y_pred,
'probabilities': y_prob,
'accuracy': accuracy_score(y_test_bc, y_pred),
'precision': precision_score(y_test_bc, y_pred),
'recall': recall_score(y_test_bc, y_pred)
}

# Print classification results


print("=" * 60)
print("CLASSIFICATION RESULTS - BREAST CANCER DIAGNOSIS")
print("=" * 60)
for name, results in clf_results.items():
print(f"{name}:")
print(f" Accuracy: {results['accuracy']:.3f}")
print(f" Precision: {results['precision']:.3f}")
print(f" Recall: {results['recall']:.3f}")
print()
A.3 Regression Task: Diabetes Progression Prediction

A.3.1 Basic Regression Implementation

from sklearn.linear_model import LinearRegression


from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error

# (Assume X_train, X_test, y_train, y_test are prepared and features scaled)
linreg = LinearRegression().fit(X_train, y_train)
rf_reg = RandomForestRegressor(random_state=0).fit(X_train, y_train)
svr = SVR().fit(X_train, y_train)

# Predict on test set


y_pred_lin = linreg.predict(X_test)
y_pred_rf = rf_reg.predict(X_test)
y_pred_svr = svr.predict(X_test)

# Evaluate errors
print("Linear MAE:", mean_absolute_error(y_test, y_pred_lin))
print("Linear RMSE:", mean_squared_error(y_test, y_pred_lin, squared=False))

A.3.2 Complete Regression Implementation with Evaluation

# Load dataset
diabetes = load_diabetes()
X_db, y_db = diabetes.data, diabetes.target

# Train-test split
X_train_db, X_test_db, y_train_db, y_test_db = train_test_split(
X_db, y_db, test_size=0.2, random_state=0)

# Standardize features
scaler_db = StandardScaler()
X_train_db_scaled = scaler_db.fit_transform(X_train_db)
X_test_db_scaled = scaler_db.transform(X_test_db)

# Initialize and train regression models


models_reg = {
'Linear Regression': LinearRegression(),
'Random Forest Regressor': RandomForestRegressor(n_estimators=100, random_state=0),
'SVR (RBF)': SVR(kernel='rbf')
}

# Train models and make predictions


reg_results = {}
for name, model in models_reg.items():
model.fit(X_train_db_scaled, y_train_db)
y_pred = model.predict(X_test_db_scaled)

reg_results[name] = {
'model': model,
'predictions': y_pred,
'mae': mean_absolute_error(y_test_db, y_pred),
'rmse': mean_squared_error(y_test_db, y_pred, squared=False)
}

# Print regression results


print("=" * 60)
print("REGRESSION RESULTS - DIABETES PROGRESSION")
print("=" * 60)
for name, results in reg_results.items():
print(f"{name}:")
print(f" MAE: {results['mae']:.2f}")
print(f" RMSE: {results['rmse']:.2f}")
print()

A.4 Clustering Task: Unsupervised Patient Stratification

A.4.1 Basic Clustering Implementation

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN


from sklearn.metrics import silhouette_score, davies_bouldin_score

# Standardize features
X_scaled = StandardScaler().fit_transform(X)
# X from WDBC, labels not used

# Run clustering algorithms


kmeans = KMeans(n_clusters=2, random_state=0).fit(X_scaled)
agg = AgglomerativeClustering(n_clusters=2).fit(X_scaled)
dbscan = DBSCAN(eps=2.0, min_samples=5).fit(X_scaled)

# Get cluster labels


labels_km = kmeans.labels_
labels_ag = agg.labels_
labels_db = dbscan.labels_

# Compute evaluation metrics


print("K-Means Silhouette:", silhouette_score(X_scaled, labels_km))
print("K-Means DBI:", davies_bouldin_score(X_scaled, labels_km))

A.4.2 Complete Clustering Implementation with Evaluation

# Use breast cancer data without labels for clustering


X_cluster = StandardScaler().fit_transform(X_bc)

# Initialize clustering models


models_cluster = {
'K-Means': KMeans(n_clusters=2, random_state=0),
'Agglomerative': AgglomerativeClustering(n_clusters=2),
'DBSCAN': DBSCAN(eps=2.0, min_samples=5)
}
# Perform clustering and evaluate
cluster_results = {}
for name, model in models_cluster.items():
labels = model.fit_predict(X_cluster)

# Handle case where DBSCAN might produce -1 labels (noise)


if len(np.unique(labels)) > 1 and min(labels) >= 0:
silhouette = silhouette_score(X_cluster, labels)
dbi = davies_bouldin_score(X_cluster, labels)
else:
silhouette = -1 # Invalid clustering
dbi = float('inf')

cluster_results[name] = {
'model': model,
'labels': labels,
'n_clusters': len(np.unique(labels[labels >= 0])),
'silhouette': silhouette,
'dbi': dbi
}

# Print clustering results


print("=" * 60)
print("CLUSTERING RESULTS - UNSUPERVISED PATIENT STRATIFICATION")
print("=" * 60)
for name, results in cluster_results.items():
print(f"{name}:")
print(f" Number of clusters: {results['n_clusters']}")
print(f" Silhouette Score: {results['silhouette']:.3f}")
print(f" Davies-Bouldin Index: {results['dbi']:.3f}")
print()

A.5 Visualization Functions

A.5.1 Classification Visualizations

def plot_confusion_matrix(y_true, y_pred, model_name):


"""Plot confusion matrix for classification results"""
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix - {model_name}')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

def plot_roc_curves(y_true, models_dict):


"""Plot ROC curves for multiple classification models"""
plt.figure(figsize=(8, 6))
for name, results in models_dict.items():
if results['probabilities'] is not None:
fpr, tpr, _ = roc_curve(y_true, results['probabilities'])
auc_score = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Classification Models')
plt.legend()
plt.grid(True)
plt.show()

A.5.2 Regression Visualizations

def plot_regression_predictions(y_true, y_pred, model_name):


"""Plot actual vs predicted values for regression"""
plt.figure(figsize=(8, 6))
plt.scatter(y_true, y_pred, alpha=0.6)
plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title(f'Actual vs Predicted - {model_name}')
plt.show()

A.5.3 Clustering Visualizations

def plot_clustering_pca(X, labels, model_name):


"""Plot clustering results in 2D PCA space"""
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[^0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[^1]:.1%} variance)')
plt.title(f'Clustering Results - {model_name}')
plt.colorbar(scatter)
plt.show()

A.6 Complete Integrated Implementation

# =============================================================================
# COMPLETE MACHINE LEARNING PIPELINE FOR HEALTHCARE DATA ANALYSIS
# =============================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer, load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
mean_absolute_error, mean_squared_error,
silhouette_score, davies_bouldin_score,
confusion_matrix, roc_curve, auc)
from sklearn.decomposition import PCA
import seaborn as sns

def main():
"""
Main function to execute all three machine learning tasks:
1. Classification: Breast Cancer Diagnosis
2. Regression: Diabetes Progression Prediction
3. Clustering: Unsupervised Patient Stratification
"""

print("Starting Healthcare Machine Learning Analysis...")


print("=" * 70)

# Task 1: Classification
print("\nTask 1: Breast Cancer Classification")
print("-" * 40)
classification_task()

# Task 2: Regression
print("\nTask 2: Diabetes Progression Regression")
print("-" * 40)
regression_task()

# Task 3: Clustering
print("\nTask 3: Unsupervised Patient Clustering")
print("-" * 40)
clustering_task()

print("\n" + "=" * 70)


print("Analysis Complete!")

def classification_task():
"""Execute breast cancer classification task"""
# Implementation as shown in A.2.2
# [Complete code from section A.2.2 goes here]
pass

def regression_task():
"""Execute diabetes progression regression task"""
# Implementation as shown in A.3.2
# [Complete code from section A.3.2 goes here]
pass

def clustering_task():
"""Execute unsupervised clustering task"""
# Implementation as shown in A.4.2
# [Complete code from section A.4.2 goes here]
pass

# Example usage of visualization functions


def generate_all_visualizations():
"""Generate all visualizations for the research paper"""
# Example calls (uncomment to use):
# plot_confusion_matrix(y_test_bc, clf_results['Random Forest']['predictions'], 'Rand
# plot_roc_curves(y_test_bc, clf_results)
# plot_regression_predictions(y_test_db, reg_results['Linear Regression']['prediction
# plot_clustering_pca(X_cluster, cluster_results['K-Means']['labels'], 'K-Means')
pass

if __name__ == "__main__":
main()

A.7 Usage Instructions


To run the complete analysis, execute the following steps:
1. Install required packages:

pip install scikit-learn numpy pandas matplotlib seaborn

2. Run individual tasks:

# For classification only


classification_task()

# For regression only


regression_task()

# For clustering only


clustering_task()

3. Generate visualizations:

# Generate all plots and figures


generate_all_visualizations()

4. Run complete pipeline:

# Execute all tasks in sequence


main()

This appendix provides all the necessary code to reproduce the results presented in the main
research paper. The implementations follow scikit-learn best practices and include proper data
preprocessing, model training, evaluation, and visualization components essential for
comprehensive machine learning analysis in healthcare applications.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy