0% found this document useful (0 votes)
6 views12 pages

sumanca1485cap

The document details a data analysis process using a loan approval dataset, including data cleaning, correlation analysis, and model training. It employs linear regression and logistic regression to predict loan status based on features like loan amount and CIBIL score, revealing that these features have minimal impact on predictions. The results indicate a moderate model fit with logistic regression outperforming naive Bayes in accuracy and recall metrics.

Uploaded by

sumanhait82005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

sumanca1485cap

The document details a data analysis process using a loan approval dataset, including data cleaning, correlation analysis, and model training. It employs linear regression and logistic regression to predict loan status based on features like loan amount and CIBIL score, revealing that these features have minimal impact on predictions. The results indicate a moderate model fit with logistic regression outperforming naive Bayes in accuracy and recall metrics.

Uploaded by

sumanhait82005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

import pandas as pd

from sklearn.model_selection import train_test_split


df1=pd.read_csv('/Users/sumanhait/Desktop/loan_approval_dataset.csv')
print(df1)

loan_id no_of_dependents education self_employed


income_annum \
0 1 2 Graduate No
9600000
1 2 0 Not Graduate Yes
4100000
2 3 3 Graduate No
9100000
3 4 3 Graduate No
8200000
4 5 5 Not Graduate Yes
9800000
... ... ... ... ...
...
4264 4265 5 Graduate Yes
1000000
4265 4266 0 Not Graduate Yes
3300000
4266 4267 2 Not Graduate No
6500000
4267 4268 1 Not Graduate No
4100000
4268 4269 1 Graduate No
9200000

loan_amount loan_term cibil_score


residential_assets_value \
0 29900000 12 778
2400000
1 12200000 8 417
2700000
2 29700000 20 506
7100000
3 30700000 8 467
18200000
4 24200000 20 382
12400000
... ... ... ... ..
.
4264 2300000 12 317
2800000
4265 11300000 20 559
4200000
4266 23900000 18 457
1200000
4267 12800000 8 780
8200000
4268 29700000 10 607
17800000

commercial_assets_value luxury_assets_value
bank_asset_value \
0 17600000 22700000
8000000
1 2200000 8800000
3300000
2 4500000 33300000
12800000
3 3300000 23300000
7900000
4 8200000 29400000
5000000
... ... ... ..
.
4264 500000 3300000
800000
4265 2900000 11000000
1900000
4266 12400000 18100000
7300000
4267 700000 14100000
5800000
4268 11800000 35700000
12000000

loan_status
0 Approved
1 Rejected
2 Rejected
3 Rejected
4 Rejected
... ...
4264 Rejected
4265 Approved
4266 Rejected
4267 Approved
4268 Approved

[4269 rows x 13 columns]

print(df1.columns)

Index(['loan_id', ' no_of_dependents', ' education', ' self_employed',


' income_annum', ' loan_amount', ' loan_term', ' cibil_score',
' residential_assets_value', ' commercial_assets_value',
' luxury_assets_value', ' bank_asset_value', ' loan_status'],
dtype='object')

df1.duplicated().any()#for checking any duplicate value is there or


not

False

has_na = df1.isna().any() #there is no na values in rows


print(has_na)

loan_id False
no_of_dependents False
education False
self_employed False
income_annum False
loan_amount False
loan_term False
cibil_score False
residential_assets_value False
commercial_assets_value False
luxury_assets_value False
bank_asset_value False
loan_status False
dtype: bool

df1.columns

Index(['loan_id', 'no_of_dependents', 'education', 'self_employed',


'income_annum', 'loan_amount', 'loan_term', 'cibil_score',
'residential_assets_value', 'commercial_assets_value',
'luxury_assets_value', 'bank_asset_value', 'loan_status'],
dtype='object')

# Check for zero values in anual income column


zero_cibil = (df1['income_annum'] == 0).any()

if zero_cibil:
print("There are zero values in the anual income column.")
else:
print("There are no zero values in the anual income column.")

There are no zero values in the anual income column.

df1.columns = df1.columns.str.strip()
df1 = df1.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df1['loan_status']=df1['loan_status'].replace({'Approved':1,'Rejected'
:0})
df1['education']=df1['education'].replace({'Not
Graduate':0,'Graduate':1})
df1['self_employed']=df1['self_employed'].replace({'No':0,'Yes':1})
df1

loan_id no_of_dependents education self_employed


income_annum \
0 1 2 1 0
9600000
1 2 0 0 1
4100000
2 3 3 1 0
9100000
3 4 3 1 0
8200000
4 5 5 0 1
9800000
... ... ... ... ... ..
.
4264 4265 5 1 1
1000000
4265 4266 0 0 1
3300000
4266 4267 2 0 0
6500000
4267 4268 1 0 0
4100000
4268 4269 1 1 0
9200000

loan_amount loan_term cibil_score residential_assets_value \


0 29900000 12 778 2400000
1 12200000 8 417 2700000
2 29700000 20 506 7100000
3 30700000 8 467 18200000
4 24200000 20 382 12400000
... ... ... ... ...
4264 2300000 12 317 2800000
4265 11300000 20 559 4200000
4266 23900000 18 457 1200000
4267 12800000 8 780 8200000
4268 29700000 10 607 17800000

commercial_assets_value luxury_assets_value
bank_asset_value \
0 17600000 22700000 8000000

1 2200000 8800000 3300000


2 4500000 33300000 12800000

3 3300000 23300000 7900000

4 8200000 29400000 5000000

... ... ... ...

4264 500000 3300000 800000

4265 2900000 11000000 1900000

4266 12400000 18100000 7300000

4267 700000 14100000 5800000

4268 11800000 35700000 12000000

loan_status
0 1
1 0
2 0
3 0
4 0
... ...
4264 0
4265 1
4266 0
4267 1
4268 1

[4269 rows x 13 columns]

'''So i firstly import important library in pandas then


I have run a query to check all column name to check is three any
blank space in the column name or not.
After that i have checked is there any duplicate rows available in
rows . than Checking for zero values in anual income column or any
other value like cbil and
Loan amount and checking is there any na values in rows or not .
then i coonverted major column like loa aprrovel and self employed as
numaric values'''

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df1 is already defined


numerical_features = df1.select_dtypes(include=['number']).columns
numerical_df = df1[numerical_features]
corr_matrix = numerical_df.corr()

plt.figure(figsize=(12, 10)) # Adjust the figure size


sns.heatmap(corr_matrix, annot=True, cmap='RdBu', fmt=".2f",
annot_kws={"size": 10})
plt.xticks(rotation=45, ha='right', fontsize=10) # Rotate x labels
plt.yticks(fontsize=10) # Set y labels font size
plt.tight_layout() # Adjust layout to fit everything
plt.show()

'''
this is coralrelation metrix of total numaric valus in this data set
loan status is only positively corelated with cbil score and loan
amount
'''
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

x = df1[['loan_amount', 'cibil_score']]
y = df1['loan_status']

# Split the data


x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.2, random_state=42)
# Define the model
model = LinearRegression()
model.fit(x_train, y_train) # Corrected variable name
# Get coefficients and intercept
coefficients = model.coef_
intercept = model.intercept_
# Print the equation
print("Equation:")
print('loan status=', intercept, end=" ")
for i, coef in enumerate(coefficients):
print(f"+ {coef:.2f} * {x.columns[i]}", end=" ")
print()
# Make predictions
y_pred = model.predict(x_test)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print metrics
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Equation:
loan_status = -0.7122471905071295 + 0.00 * loan_amount + 0.00 *
cibil_score
Mean Squared Error: 0.09640870084539979
R-squared: 0.5874846987599525

'''The regression equation suggests that neither loan_amount nor


cibil_score significantly impacts loan_status (coefficients are 0).
The model's Mean Squared Error (MSE) is 0.096, indicating the average
squared difference between actual and predicted values. An R-squared
of 0.59 means the model explains 59% of the variance in loan_status,
indicating a moderate fit.''

import matplotlib.pyplot as plt


import seaborn as sns

# Create a scatter plot


plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)

# Plot the perfect prediction line


plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
'r-')

# Label the axes


plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

# Show the plot


plt.title('Actual vs Predicted Values')
plt.grid()
plt.show()

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, mean_squared_error, r2_score

# Logistic Regression
logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train)
y_pred_logistic = logistic_model.predict(x_test)

# Evaluate Logistic Regression


accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
precision_logistic = precision_score(y_test, y_pred_logistic)
recall_logistic = recall_score(y_test, y_pred_logistic)
f1_logistic = f1_score(y_test, y_pred_logistic)

print("\nLogistic Regression:")
print("Accuracy:", accuracy_logistic)
print("Precision:", precision_logistic)
print("Recall:", recall_logistic)
print("F1 Score:", f1_logistic)

Logistic Regression:
Accuracy: 0.6932084309133489
Precision: 0.6856368563685636
Recall: 0.9440298507462687
F1 Score: 0.7943485086342229

'''The regression model shows that loan_amount and cibil_score have no


significant effect on loan_status due to zero coefficients. With an
MSE of 0.096, the model's predictions are close to actual values. An
R-squared of 0.59 indicates it captures 59% of the variability in
loan_status, suggesting a moderate level of explanation.'''

# Naïve Bayes
nb_model = GaussianNB()
nb_model.fit(x_train, y_train)
y_pred_nb = nb_model.predict(x_test)

# Evaluate Naïve Bayes


accuracy_nb = accuracy_score(y_test, y_pred_nb)
precision_nb = precision_score(y_test, y_pred_nb)
recall_nb = recall_score(y_test, y_pred_nb)
f1_nb = f1_score(y_test, y_pred_nb)

print("\nNaïve Bayes:")
print("Accuracy:", accuracy_nb)
print("Precision:", precision_nb)
print("Recall:", recall_nb)
print("F1 Score:", f1_nb)
Naïve Bayes:
Accuracy: 0.7740046838407494
Precision: 0.7460545193687231
Recall: 0.9701492537313433
F1 Score: 0.8434712084347121

'''In this regression, loan_amount and cibil_score do not influence


loan_status as their coefficients are zero. The MSE of 0.096 reflects
a small average error in predictions. An R-squared of 0.59 implies the
model accounts for 59% of the variation in loan_status, showing a
moderate fit.'''

import matplotlib.pyplot as plt


import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Confusion Matrix for Naïve Bayes


cm_nb = confusion_matrix(y_test, y_pred_nb)
plt.figure(figsize=(6, 4))
sns.heatmap(cm_nb, annot=True, fmt='d', cmap='Blues')
plt.title('Naïve Bayes Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# ROC Curve for Logistic Regression
fpr_logistic, tpr_logistic, _ = roc_curve(y_test,
logistic_model.predict_proba(x_test)[:, 1])
roc_auc_logistic = auc(fpr_logistic, tpr_logistic)

plt.figure(figsize=(6, 4))
plt.plot(fpr_logistic, tpr_logistic, color='blue', label=f'Logistic
Regression (AUC = {roc_auc_logistic:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.show()

# ROC Curve for Naïve Bayes


fpr_nb, tpr_nb, _ = roc_curve(y_test, nb_model.predict_proba(x_test)
[:, 1])
roc_auc_nb = auc(fpr_nb, tpr_nb)

plt.figure(figsize=(6, 4))
plt.plot(fpr_nb, tpr_nb, color='green', label=f'Naïve Bayes (AUC =
{roc_auc_nb:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.show()

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy