1) Download the binary classification dataset for... - Colab
1) Download the binary classification dataset for... - Colab
ROLL NO:23126039
OBJECTIVE: This objective encompasses the following key aspects of the assignment:
Model Development: It explicitly mentions the development of Logistic Regression, SVM, and KNN models.
Performance Comparison: It highlights the goal of comparing the performance of these models.
Dataset Specificity: It accurately identifies the "personal loan default prediction dataset."
Hyperparameter Tuning & Regularization: It emphasizes the importance of exploring hyperparameter tuning and regularization.
Model Evaluation: It specifies the use of the F1 score as the primary evaluation metric.
Model Selection: It states the aim to select the optimal model based on the evaluation results.
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun
import pandas as pd
from sklearn.model_selection import train_test_split
# Dataset Explanation
print(df.head())
print(df.info())
print(df.describe())
# Explanation:
# The dataset contains information about personal loans, including:
# - person_age: Age of the borrower.
# - person_gender: Gender of the borrower.
# - person_education: Education level of the borrower.
# - person_income: Annual income of the borrower.
# - person_emp_exp: Employment experience in years.
# - person_home_ownership: Home ownership status.
# - loan_amnt: Loan amount.
# - loan_intent: Purpose of the loan.
# - loan_int_rate: Interest rate of the loan.
# - loan_percent_income: Loan amount as a percentage of income.
# - cb_person_cred_hist_length: Credit history length.
# - credit_score: credit score.
# - previous_loan_defaults_on_file: if the person has previous loan defaults.
# - loan_status: Loan default status (0 = No default, 1 = Default). This is the target variable.
# Train-Test Split
X = df.drop('loan_status', axis=1)
y = df['loan_status']
credit_score loan_status
count 45000.000000 45000.000000
mean 632.608756 0.222222
std 50.435865 0.415744
min 390.000000 0.000000
25% 601.000000 0.000000
50% 640.000000 0.000000
75% 670.000000 0.000000
max 850.000000 1.000000
Train set shape: (36000, 13) (36000,)
Test set shape: (9000, 13) (9000,)
NAME:PRATHAM
ROLL NO:23126039
# Preprocessing
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
# Predictions
y_train_pred = logistic_pipeline.predict(X_train)
y_test_pred = logistic_pipeline.predict(X_test)
# F1 Score
train_f1 = f1_score(y_train, y_train_pred)
test_f1 = f1_score(y_test, y_test_pred)
print(f"Train F1 Score: {train_f1}")
print(f"Test F1 Score: {test_f1}")
4. Regularization
l2_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', penalty='l2', random_state=42))
])
l1_pipeline.fit(X_train, y_train)
l2_pipeline.fit(X_train, y_train)
l1_test_pred = l1_pipeline.predict(X_test)
l2_test_pred = l2_pipeline.predict(X_test)
NAME:PRATHAM
ROLL NO:23126039
results = []
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
for C in C_values:
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', C=C, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
f1 = f1_score(y_test, y_pred)
results.append({'C': C, 'Test F1 Score': f1})
results_df = pd.DataFrame(results)
print(results_df)
C Test F1 Score
0 0.001 0.714126
1 0.010 0.751487
2 0.100 0.757252
3 1.000 0.758393
4 10.000 0.757814
5 100.000 0.757814
inbuilt_pipeline.fit(X_train, y_train)
inbuilt_test_pred = inbuilt_pipeline.predict(X_test)
inbuilt_test_f1 = f1_score(y_test, inbuilt_test_pred)
print(f"Inbuilt Logistic Regression Test F1 Score: {inbuilt_test_f1}")
#The deviation is likely due to the different default solver and regularization methods employed by the inbuilt model compar
svm_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC(random_state=42))
])
svm_pipeline.fit(X_train, y_train)
svm_test_pred = svm_pipeline.predict(X_test)
svm_test_f1 = f1_score(y_test, svm_test_pred)
svm_results = []
C_values_svm = [0.1, 1, 10, 100]
for C in C_values_svm:
svm_pipeline_tuned = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC(C=C, random_state=42))
])
svm_pipeline_tuned.fit(X_train, y_train)
y_pred_svm = svm_pipeline_tuned.predict(X_test)
f1_svm = f1_score(y_test, y_pred_svm)
svm_results.append({'C': C, 'Test F1 Score': f1_svm})
svm_results_df = pd.DataFrame(svm_results)
print(svm_results_df)
8. KNN Implementation
knn_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', KNeighborsClassifier())
])
knn_pipeline.fit(X_train, y_train)
knn_test_pred = knn_pipeline.predict(X_test)
knn_test_f1 = f1_score(y_test, knn_test_pred)
NAME:PRATHAM
ROLL NO:23126039
# Assuming X_train, X_test, y_train, y_test, and preprocessor are already defined from previous steps
knn_results = []
neighbors = [3, 5, 7, 9]
distance_metrics = ['euclidean', 'manhattan', 'minkowski']
for n in neighbors:
for metric in distance_metrics:
knn_pipeline_tuned = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', KNeighborsClassifier(n_neighbors=n, metric=metric))
])
knn_pipeline_tuned.fit(X_train, y_train)
y_pred_knn = knn_pipeline_tuned.predict(X_test)
f1_knn = f1_score(y_test, y_pred_knn)
knn_results.append({'Neighbors': n, 'Distance Metric': metric, 'Test F1 Score': f1_knn})
knn_results_df = pd.DataFrame(knn_results)
print(knn_results_df)
# 10. Conclusion
svm_pipeline.fit(X_train, y_train)
svm_test_pred = svm_pipeline.predict(X_test)
svm_test_f1 = f1_score(y_test, svm_test_pred)
knn_pipeline.fit(X_train, y_train)
knn_test_pred = knn_pipeline.predict(X_test)
knn_test_f1 = f1_score(y_test, knn_test_pred)
# Conclusion:
# Based on the F1 scores, we can compare the performance of the three models:
# - Logistic Regression: [logistic_test_f1 value]
# - Support Vector Machine (SVM): [svm_test_f1 value]
# - K-Nearest Neighbors (KNN): [knn_test_f1 value]
# Based on the results obtained, the best performing model for this dataset is usually SVM or Logistic regression. The KNN p
#Generally, SVM provided the highest F1 score in most of the cases. Logistic regression also provided good scores, and is mu
The SVM (Support Vector Machine) model has the highest F1 score (0.8013716697441309), making it the best-performing model among the
three.