0% found this document useful (0 votes)
54 views6 pages

1) Download the binary classification dataset for... - Colab

The document outlines a project focused on developing and comparing Logistic Regression, SVM, and KNN models for predicting personal loan defaults using a specific dataset. It details the model development process, including hyperparameter tuning, regularization, and performance evaluation using the F1 score. Ultimately, the SVM model achieved the highest F1 score, indicating it as the best-performing model among the three analyzed.

Uploaded by

jacky pundu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views6 pages

1) Download the binary classification dataset for... - Colab

The document outlines a project focused on developing and comparing Logistic Regression, SVM, and KNN models for predicting personal loan defaults using a specific dataset. It details the model development process, including hyperparameter tuning, regularization, and performance evaluation using the F1 score. Ultimately, the SVM model achieved the highest F1 score, indicating it as the best-performing model among the three analyzed.

Uploaded by

jacky pundu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

NAME:PRATHAM

ROLL NO:23126039

OBJECTIVE: This objective encompasses the following key aspects of the assignment:

Model Development: It explicitly mentions the development of Logistic Regression, SVM, and KNN models.

Performance Comparison: It highlights the goal of comparing the performance of these models.

Dataset Specificity: It accurately identifies the "personal loan default prediction dataset."

Hyperparameter Tuning & Regularization: It emphasizes the importance of exploring hyperparameter tuning and regularization.

Model Evaluation: It specifies the use of the F1 score as the primary evaluation metric.

Model Selection: It states the aim to select the optimal model based on the evaluation results.

Generalization: it includes the concept of model generalization.

from google.colab import drive


drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset


df = pd.read_csv("/content/drive/MyDrive/IML lab/lab6/loan_data.csv") #Please upload the loan_data.csv file to the colab env

# Dataset Explanation
print(df.head())
print(df.info())
print(df.describe())

# Explanation:
# The dataset contains information about personal loans, including:
# - person_age: Age of the borrower.
# - person_gender: Gender of the borrower.
# - person_education: Education level of the borrower.
# - person_income: Annual income of the borrower.
# - person_emp_exp: Employment experience in years.
# - person_home_ownership: Home ownership status.
# - loan_amnt: Loan amount.
# - loan_intent: Purpose of the loan.
# - loan_int_rate: Interest rate of the loan.
# - loan_percent_income: Loan amount as a percentage of income.
# - cb_person_cred_hist_length: Credit history length.
# - credit_score: credit score.
# - previous_loan_defaults_on_file: if the person has previous loan defaults.
# - loan_status: Loan default status (0 = No default, 1 = Default). This is the target variable.

# Train-Test Split
X = df.drop('loan_status', axis=1)
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train set shape:", X_train.shape, y_train.shape)


print("Test set shape:", X_test.shape, y_test.shape)
10 cb_person_cred_hist_length 45000 non-null float64
11 credit_score 45000 non-null int64
12 previous_loan_defaults_on_file 45000 non-null object
13 loan_status 45000 non-null int64
dtypes: float64(6), int64(3), object(5)
memory usage: 4.8+ MB
None
person_age person_income person_emp_exp loan_amnt \
count 45000.000000 4.500000e+04 45000.000000 45000.000000
mean 27.764178 8.031905e+04 5.410333 9583.157556
std 6.045108 8.042250e+04 6.063532 6314.886691
min 20.000000 8.000000e+03 0.000000 500.000000
25% 24.000000 4.720400e+04 1.000000 5000.000000
50% 26.000000 6.704800e+04 4.000000 8000.000000
75% 30.000000 9.578925e+04 8.000000 12237.250000
max 144.000000 7.200766e+06 125.000000 35000.000000

loan_int_rate loan_percent_income cb_person_cred_hist_length \


count 45000.000000 45000.000000 45000.000000
mean 11.006606 0.139725 5.867489
std 2.978808 0.087212 3.879702
min 5.420000 0.000000 2.000000
25% 8.590000 0.070000 3.000000
50% 11.010000 0.120000 4.000000
75% 12.990000 0.190000 8.000000
max 20.000000 0.660000 30.000000

credit_score loan_status
count 45000.000000 45000.000000
mean 632.608756 0.222222
std 50.435865 0.415744
min 390.000000 0.000000
25% 601.000000 0.000000
50% 640.000000 0.000000
75% 670.000000 0.000000
max 850.000000 1.000000
Train set shape: (36000, 13) (36000,)
Test set shape: (9000, 13) (9000,)

3. Logistic Regression Model Development

NAME:PRATHAM

ROLL NO:23126039

from sklearn.linear_model import LogisticRegression


from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score

# Preprocessing
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])

# Logistic Regression Pipeline


logistic_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', random_state=42))
])

# Train the model


logistic_pipeline.fit(X_train, y_train)

# Predictions
y_train_pred = logistic_pipeline.predict(X_train)
y_test_pred = logistic_pipeline.predict(X_test)

# F1 Score
train_f1 = f1_score(y_train, y_train_pred)
test_f1 = f1_score(y_test, y_test_pred)
print(f"Train F1 Score: {train_f1}")
print(f"Test F1 Score: {test_f1}")

Train F1 Score: 0.7642629227823867


Test F1 Score: 0.7583926754832147

4. Regularization

# Regularized Logistic Regression (L1 and L2)


l1_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', penalty='l1', random_state=42))
])

l2_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', penalty='l2', random_state=42))
])

l1_pipeline.fit(X_train, y_train)
l2_pipeline.fit(X_train, y_train)

l1_test_pred = l1_pipeline.predict(X_test)
l2_test_pred = l2_pipeline.predict(X_test)

l1_test_f1 = f1_score(y_test, l1_test_pred)


l2_test_f1 = f1_score(y_test, l2_test_pred)

print(f"L1 Regularization Test F1 Score: {l1_test_f1}")


print(f"L2 Regularization Test F1 Score: {l2_test_f1}")

L1 Regularization Test F1 Score: 0.7578144853875477


L2 Regularization Test F1 Score: 0.7583926754832147

5. Varying λ (C in Logistic Regression)

NAME:PRATHAM

ROLL NO:23126039

results = []
C_values = [0.001, 0.01, 0.1, 1, 10, 100]

for C in C_values:
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear', C=C, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
f1 = f1_score(y_test, y_pred)
results.append({'C': C, 'Test F1 Score': f1})

results_df = pd.DataFrame(results)
print(results_df)

C Test F1 Score
0 0.001 0.714126
1 0.010 0.751487
2 0.100 0.757252
3 1.000 0.758393
4 10.000 0.757814
5 100.000 0.757814

6. Comparison with Inbuilt Model

# Inbuilt Logistic Regression


inbuilt_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42)) #uses lbfgs as default solver, and l2 as default penalty.
])

inbuilt_pipeline.fit(X_train, y_train)
inbuilt_test_pred = inbuilt_pipeline.predict(X_test)
inbuilt_test_f1 = f1_score(y_test, inbuilt_test_pred)
print(f"Inbuilt Logistic Regression Test F1 Score: {inbuilt_test_f1}")
#The deviation is likely due to the different default solver and regularization methods employed by the inbuilt model compar

Inbuilt Logistic Regression Test F1 Score: 0.7585856016280844

7. SVM Implementation and Hyperparameter Tuning

from sklearn.svm import SVC


from sklearn.pipeline import Pipeline # Import the Pipeline class

svm_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC(random_state=42))
])

svm_pipeline.fit(X_train, y_train)
svm_test_pred = svm_pipeline.predict(X_test)
svm_test_f1 = f1_score(y_test, svm_test_pred)

print(f"SVM Test F1 Score: {svm_test_f1}")

svm_results = []
C_values_svm = [0.1, 1, 10, 100]

for C in C_values_svm:
svm_pipeline_tuned = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC(C=C, random_state=42))
])
svm_pipeline_tuned.fit(X_train, y_train)
y_pred_svm = svm_pipeline_tuned.predict(X_test)
f1_svm = f1_score(y_test, y_pred_svm)
svm_results.append({'C': C, 'Test F1 Score': f1_svm})

svm_results_df = pd.DataFrame(svm_results)
print(svm_results_df)

SVM Test F1 Score: 0.8013716697441309


C Test F1 Score
0 0.1 0.781457
1 1.0 0.801372
2 10.0 0.804227
3 100.0 0.786705

8. KNN Implementation

from sklearn.neighbors import KNeighborsClassifier

knn_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', KNeighborsClassifier())
])

knn_pipeline.fit(X_train, y_train)
knn_test_pred = knn_pipeline.predict(X_test)
knn_test_f1 = f1_score(y_test, knn_test_pred)

print(f"KNN Test F1 Score: {knn_test_f1}")

KNN Test F1 Score: 0.7477572559366754

NAME:PRATHAM

ROLL NO:23126039

9. KNN Hyperparameter Tuning

from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd

# Assuming X_train, X_test, y_train, y_test, and preprocessor are already defined from previous steps

knn_results = []
neighbors = [3, 5, 7, 9]
distance_metrics = ['euclidean', 'manhattan', 'minkowski']

for n in neighbors:
for metric in distance_metrics:
knn_pipeline_tuned = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', KNeighborsClassifier(n_neighbors=n, metric=metric))
])
knn_pipeline_tuned.fit(X_train, y_train)
y_pred_knn = knn_pipeline_tuned.predict(X_test)
f1_knn = f1_score(y_test, y_pred_knn)
knn_results.append({'Neighbors': n, 'Distance Metric': metric, 'Test F1 Score': f1_knn})

knn_results_df = pd.DataFrame(knn_results)
print(knn_results_df)

# 10. Conclusion

# Compare the performance of Logistic Regression, SVM, and KNN


logistic_pipeline.fit(X_train, y_train)
logistic_test_pred = logistic_pipeline.predict(X_test)
logistic_test_f1 = f1_score(y_test, logistic_test_pred)

svm_pipeline.fit(X_train, y_train)
svm_test_pred = svm_pipeline.predict(X_test)
svm_test_f1 = f1_score(y_test, svm_test_pred)

knn_pipeline.fit(X_train, y_train)
knn_test_pred = knn_pipeline.predict(X_test)
knn_test_f1 = f1_score(y_test, knn_test_pred)

print(f"Logistic Regression Test F1 Score: {logistic_test_f1}")


print(f"SVM Test F1 Score: {svm_test_f1}")
print(f"KNN Test F1 Score: {knn_test_f1}")

# Conclusion:
# Based on the F1 scores, we can compare the performance of the three models:
# - Logistic Regression: [logistic_test_f1 value]
# - Support Vector Machine (SVM): [svm_test_f1 value]
# - K-Nearest Neighbors (KNN): [knn_test_f1 value]

# Based on the results obtained, the best performing model for this dataset is usually SVM or Logistic regression. The KNN p

#Generally, SVM provided the highest F1 score in most of the cases. Logistic regression also provided good scores, and is mu

Neighbors Distance Metric Test F1 Score


0 3 euclidean 0.742546
1 3 manhattan 0.743081
2 3 minkowski 0.742546
3 5 euclidean 0.747757
4 5 manhattan 0.756285
5 5 minkowski 0.747757
6 7 euclidean 0.759500
7 7 manhattan 0.760986
8 7 minkowski 0.759500
9 9 euclidean 0.760481
10 9 manhattan 0.764263
11 9 minkowski 0.760481
Logistic Regression Test F1 Score: 0.7583926754832147
SVM Test F1 Score: 0.8013716697441309
KNN Test F1 Score: 0.7477572559366754

The SVM (Support Vector Machine) model has the highest F1 score (0.8013716697441309), making it the best-performing model among the
three.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy