0% found this document useful (0 votes)

25 views16 pages

Machine Learning Paper BD

Uploaded by

Lenynquiroga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views16 pages

Machine Learning Paper BD

Uploaded by

Lenynquiroga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Predicting Possible Loan Default Using Machine Learning

A D VA NC E D C LA S S I F I C AT I O N D AT A S E T S LI BRA RI E S M A C HI NE LE A RNI NG PYT HO N

Introduction

Developing a prediction model for loan default involves collecting historical loan data, preprocessing it by
handling missing values and encoding variables, and selecting relevant features like credit scores and
employment history. Machine learning algorithms such as XGBoost in Python are then trained on this data
to predict default risk. Model performance is evaluated using metrics like accuracy and precision, and the
model’s predictions are used to assess risk and inform decision-making, such as adjusting loan terms or
rejecting high-risk applications. Overall, Python’s machine learning libraries enable the development of
effective prediction models for risk assessment and management in lending.

Learning Outcomes

Understanding Loan Default Prediction: Gain insight into the importance of loan default prediction in
financial risk assessment and decision-making.
Data Preprocessing Techniques: Learn essential data preprocessing steps such as handling missing
values, encoding categorical variables, and feature selection.
Application of Machine Learning Algorithms: Understand the application of machine learning
algorithms like XGBoost and Random Forest for loan default prediction in Python.
Performance Evaluation Metrics: Learn to evaluate model performance using metrics like accuracy,
precision, recall, F1-score, and AUC in binary classification tasks.

This ar ticle was published as a par t of the Data Science Blogathon.

Table of contents
Types of Default
Why Do People Borrow, and Why Do Lenders Exist?
Understanding the Dataset
Analyzing Numerical Columns
Analyzing Categorical Features
Data Analysis
Encoding
Splitting the Data into Train and Test Splits
Random Forest Classifier
Frequently Asked Questions

Types of Default

A secured debt default can happen when a borrower fails to make payments on a mortgage loan secured
by property or a business loan secured by assets. Similarly, corporate bond default occurs when a
company can’t meet coupon payments. Unsecured debt defaults, like credit card debt, also occur,
impacting the borrower’s credit and future borrowing capacities. These scenarios are essential
considerations in financial modeling, evaluation metrics, and learning methods, including linear regression
and deep learning algorithms.

Why Do People Borrow, and Why Do Lenders Exist?

Debt serves as a crucial resource for individuals and businesses, enabling them to afford significant
investments like homes and vehicles. However, while loans can offer financial opportunities, they also
pose significant risks.

Lending plays a pivotal role in driving economic growth, supporting both individuals and enterprises
worldwide. With economies becoming more interconnected, the demand for capital has surged, leading to
a substantial increase in retail, SME, and commercial borrowers. While this trend has boosted revenues for
many financial institutions, challenges have emerged.

In recent years, there has been a noticeable uptick in loan defaults, impacting the profitability and risk
management strategies of lenders. This trend underscores the importance of effective loan management,
supported by sophisticated techniques such as support vector machines and gradient-based models, to
assess loan amounts, repayment probabilities, and overall risk profiles accurately.

Let us work with a sample dataset to see how predicting the loan default works.

The Data

For an organization aiming to predict default on consumer lending products, leveraging historical client
behavior data is crucial. By analyzing past patterns, they can identify risky and low-risk consumers,
enabling them to optimize their lending decisions for future clients.

Utilizing advanced techniques like boosting, they can enhance the predictive power of their models,
identifying subtle patterns and signals indicative of default risk. This approach allows for the development
of robust predictive models tailored to the organization’s specific lending context.
Moreover, thorough validation processes ensure the reliability and accuracy of these models, validating
their performance on diverse datasets and ensuring their effectiveness in real-world scenarios. By
continuously refining and validating their predictive models, organizations can make informed lending
decisions, mitigating risks and maximizing returns.

In countries like China, where the lending landscape is rapidly evolving, such predictive analytics
capabilities are particularly valuable. With the growing complexity of consumer behavior and financial
transactions, leveraging data-driven insights becomes indispensable for effective risk management and
decision-making in the lending sector.

The data contains demographic features of each customer and a target variable showing whether they will
default on the loan or not.

First, we impor t the libraries and load the dataset.

import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline sns.set_theme(style = "darkgrid")

Now, we read the data.

data = pd.read_csv("/kaggle/input/loan-prediction-based-on-customer-behavior/Training Data.csv") data.head()

Output:

All the dataset columns are not visible here, but I will share the link to the notebook, so please check it
from there.

Understanding the Dataset

First, we start with understanding the data set and how is the data distributed.

rows, columns = data.shape print('Rows:', rows) print('Columns:', columns)

Output:

Rows: 252000
Columns: 13

So, we see that the data is 252000 rows, that is 252000 data points and 13 columns, that is 13 features.
Out of13 features, 12 are input features and 1 is output feature.
Now we check the data types and other information.

data.info()

Output:

RangeIndex: 252000 entries, 0 to 251999 Data columns (total 13 columns) # Column Non-Null Count Dtype --- ---

--- -------------- ----- 0 Id 252000 non-null int64 1 Income 252000 non-null int64 2 Age 252000 non-null

int64 3 Experience 252000 non-null int64 4 Married/Single 252000 non-null object 5 House_Ownership 252000

non-null object 6 Car_Ownership 252000 non-null object 7 Profession 252000 non-null object 8 CITY 252000 non-

null object 9 STATE 252000 non-null object 10 CURRENT_JOB_YRS 252000 non-null int64 11 CURRENT_HOUSE_YRS

252000 non-null int64 12 Risk_Flag 252000 non-null int64 dtypes: int64(7), object(6) memory usage: 25.0+ MB

So, we see that half the features are numeric and half are string, so they are probably categorical features.

Numerical data is the representation of measurable quantities of a phenomenon. We call numerical data
“quantitative data” in data science because it describes the quantity of the object it represents.

Categorical data refers to the properties of a phenomenon that can be named. This involves describing the
names or qualities of objects with words. Categorical data is referred to as “qualitative data” in data
science since it describes the quality of the entity it represents.

Let us check if there are any missing values in the data.

data.isnull().sum()

Output:

Id 0 Income 0 Age 0 Experience 0 Married/Single 0 House_Ownership 0 Car_Ownership 0 Profession 0 CITY 0 STATE

0 CURRENT_JOB_YRS 0 CURRENT_HOUSE_YRS 0 Risk_Flag 0 dtype: int64

So, there is no missing or empty data here.

Let us check the data column names.

data.columns

Output:

Index(['Id', 'Income', 'Age', 'Experience', 'Married/Single', 'House_Ownership', 'Car_Ownership',

'Profession', 'CITY', 'STATE', 'CURRENT_JOB_YRS', 'CURRENT_HOUSE_YRS', 'Risk_Flag'], dtype='object')

So, we get the names of the data features.

Analyzing Numerical Columns

First, we start with the analysis of numerical data.

data.describe()
Output:

Now, we check the data distribution.

data.hist( figsize = (22, 20) ) plt.show()

Output:
Now, we check the count of the target variable.

data["Risk_Flag"].value_counts()

Output:

0 221004 1 30996 Name: Risk_Flag, dtype: int64

Only a small part of the target variable consists of people who default on loans.

Now, we plot the correlation plot.

fig, ax = plt.subplots( figsize = (12,8) ) corr_matrix = data.corr() corr_heatmap = sns.heatmap( corr_matrix,

cmap = "flare", annot=True, ax=ax, annot_kws={"size": 14}) plt.show()

Output:
Analyzing Categorical Features

Now, we proceed with the analysis of categorical features.

First, we define a function to create the plots.

def categorical_valcount_hist(feature): print(data[feature].value_counts()) fig, ax = plt.subplots( figsize =

(6,6) ) sns.countplot(x=feature, ax=ax, data=data) plt.show()

First, we check the count of married people vs single people.

categorical_valcount_hist("Married/Single")

Output:
So, the majority of the people are single.

Now, we check the count of house ownership.

categorical_valcount_hist("House_Ownership")

Output:

Now, let us check the count of states.

print( "Total categories in STATE:", len( data["STATE"].unique() ) ) print() print(

data["STATE"].value_counts() )
Output:

Total categories in STATE: 29 Uttar_Pradesh 28400 Maharashtra 25562 Andhra_Pradesh 25297 West_Bengal 23483
Bihar 19780 Tamil_Nadu 16537 Madhya_Pradesh 14122 Karnataka 11855 Gujarat 11408 Rajasthan 9174 Jharkhand 8965

Haryana 7890 Telangana 7524 Assam 7062 Kerala 5805 Delhi 5490 Punjab 4720 Odisha 4658 Chhattisgarh 3834
Uttarakhand 1874 Jammu_and_Kashmir 1780 Puducherry 1433 Mizoram 849 Manipur 849 Himachal_Pradesh 833 Tripura

809 Uttar_Pradesh[5] 743 Chandigarh 656 Sikkim 608 Name: STATE dtype: int64

Now, we check the count of professions.

print( "Total categories in Profession:", len( data["Profession"].unique() ) ) print()

data["Profession"].value_counts()

Output:

Total categories in Profession: 51 Physician 5957 Statistician 5806 Web_designer 5397 Psychologist 5390

Computer_hardware_engineer 5372 Drafter 5359 Magistrate 5357 Fashion_Designer 5304 Air_traffic_controller

5281 Comedian 5259 Industrial_Engineer 5250 Mechanical_engineer 5217 Chemical_engineer 5205 Technical_writer

5195 Hotel_Manager 5178 Financial_Analyst 5167 Graphic_Designer 5166 Flight_attendant 5128

Biomedical_Engineer 5127 Secretary 5061 Software_Developer 5053 Petroleum_Engineer 5041 Police_officer 5035

Computer_operator 4990 Politician 4944 Microbiologist 4881 Technician 4864 Artist 4861 Lawyer 4818 Consultant
4808 Dentist 4782 Scientist 4781 Surgeon 4772 Aviator 4758 Technology_specialist 4737 Design_Engineer 4729

Surveyor 4714 Geologist 4672 Analyst 4668 Army_officer 4661 Architect 4657 Chef 4635 Librarian 4628
Civil_engineer 4616 Designer 4598 Economist 4573 Firefighter 4507 Chartered_Accountant 4493 Civil_servant

4413 Official 4087 Engineer 4048 Name: Profession dtype: int64

Data Analysis

Now, we start with understanding the relationship between the different data features.

sns.boxplot(x ="Risk_Flag",y="Income" ,data = data)

Output:

Now, we see the relationship between the flag variable and age.
sns.boxplot(x ="Risk_Flag",y="Age" ,data = data)

Output:

sns.boxplot(x ="Risk_Flag",y="Experience" ,data = data)

Output:

sns.boxplot(x ="Risk_Flag",y="CURRENT_JOB_YRS" ,data = data)

Output:
sns.boxplot(x ="Risk_Flag",y="CURRENT_HOUSE_YRS" ,data = data)

Output:

fig, ax = plt.subplots( figsize = (8,6) ) sns.countplot(x='Car_Ownership', hue='Risk_Flag', ax=ax, data=data)

Output:
fig, ax = plt.subplots( figsize = (8,6) ) sns.countplot( x='Married/Single', hue='Risk_Flag', data=data )

Output:

fig, ax = plt.subplots( figsize = (10,8) ) sns.boxplot(x = "Risk_Flag", y = "CURRENT_JOB_YRS",

hue='House_Ownership', data = data)

Output:
Encoding

Data preparation is a required process in the field of data science before moving on to modelling. In the
data preparation process, we must complete a number of tasks. One of these critical responsibilities is the
encoding of categorical data. We all know that most data in real life contains categorical string values, and
most machine learning models handle only integer values or other understandable formats. In essence, all
models execute mathematical operations that various tools and methodologies can perform.

Encoding categorical data is the process of turning categorical data into integer format so that data with
transformed categorical values may be fed into models to increase prediction accuracy.

We will apply encoding to the categorical features.

from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder import

category_encoders as ce

label_encoder = LabelEncoder() for col in ['Married/Single','Car_Ownership']: data[col] =

label_encoder.fit_transform( data[col] )

onehot_encoder = OneHotEncoder(sparse = False) data['House_Ownership'] =

onehot_encoder.fit_transform(data['House_Ownership'].values.reshape(-1, 1) )

high_card_features = ['Profession', 'CITY', 'STATE'] count_encoder = ce.CountEncoder() # Transform the

features, rename the columns with the _count suffix, and join to dataframe count_encoded =
count_encoder.fit_transform( data[high_card_features] ) data = data.join(count_encoded.add_suffix("_count"))

data= data.drop(labels=['Profession', 'CITY', 'STATE'], axis=1)

After the feature engineering part is complete, we shall split the data into training and testing sets.

Splitting the Data into Train and Test Splits

The train-test split is used to measure the performance of machine learning models relevant to prediction-
based Algorithms/Applications. This approach is a quick and simple procedure that allows us to compare
our own machine learning model outcomes to machine results. By default, the Test set is made up of 30%
of the real data, whereas the Training set is made up of 70% of the actual data.

To assess how effectively our machine learning model works, we must divide a dataset into training and
testing sets. The train set is used to train the Machine Learning model, and its statistics are known. The
second set is known as the test data set, and it is only utilized for predictions.

It is an impor tant par t of the ML chain.

x = data.drop("Risk_Flag", axis=1) y = data["Risk_Flag"]

from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x,

y, test_size = 0.2, stratify = y, random_state = 7)

We have taken the test size to be 20% of the entire data.

Random Forest Classifier

Tree-based algorithms, such as random forests, play a crucial role in loan default prediction and credit risk
assessment. These algorithms are adept at handling both classification and regression tasks, making
them valuable in analyzing loan applications. By generating predictions based on training samples, they
offer high accuracy and stability, crucial for identifying potential defaulters.

In the context of loan default prediction, tree-based algorithms help minimize false negatives and false
positives, ensuring robust risk assessment. While individual decision trees may overfit training data,
random forests mitigate this issue by averaging predictions from multiple trees, resulting in improved
prediction accuracy.

In academic research, studies exploring the efficacy of tree-based algorithms in loan default prediction can
be found in reputable journals. Authors often provide DOIs for their work, facilitating citation and further
research in this area. Additionally, comparisons between tree-based models and logistic regression models
may offer insights into the strengths and limitations of each approach in credit risk assessment.

Now, we train the model and perform the predictions.

from sklearn.ensemble import RandomForestClassifier from imblearn.over_sampling import SMOTE from

imblearn.pipeline import Pipeline

rf_clf = RandomForestClassifier(criterion='gini', bootstrap=True, random_state=100) smote_sampler =

SMOTE(random_state=9) pipeline = Pipeline(steps = [['smote', smote_sampler], ['classifier', rf_clf]])

pipeline.fit(x_train, y_train) y_pred = pipeline.predict(x_test)

Now, we check the accuracy scores.

from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score,
roc_auc_score print("-------------------------TEST SCORES-----------------------") print(f"Recall: {

round(recall_score(y_test, y_pred)*100, 4) }") print(f"Precision: { round(precision_score(y_test,

y_pred)*100, 4) }") print(f"F1-Score: { round(f1_score(y_test, y_pred)*100, 4) }") print(f"Accuracy score: {
round(accuracy_score(y_test, y_pred)*100, 4) }") print(f"AUC Score: { round(roc_auc_score(y_test,
y_pred)*100, 4) }")

Output:

-------------------------TEST SCORES----------------------- Recall: 54.1378 Precision: 54.3306 F1-Score:

54.234 Accuracy score: 88.7619 AUC Score: 73.8778

The accuracy scores might not be up to the mark, but this is the overall process of predicting loan default.

Code: Here

Conclusion

In summary predicting loan default involves thorough exploratory data analysis (EDA) to understand
dataset characteristics. Utilizing Python libraries and techniques like boosting, random forest classifiers,
and logistic regression, classification models are developed, leveraging artificial intelligence algorithms.
Evaluation metrics such as accuracy, precision, recall, F1-score, and AUC assess model performance in
binary classification tasks. International conferences on data science and AI foster collaboration and
innovation in risk assessment and management. This integrated approach enables effective prediction of
loan default, crucial for financial sector risk management.

Key Take Away

The Random Forest approach is appropriate for classification and regression tasks on datasets with
many entries and features that are likely to have missing values when we need a highly accurate result
while avoiding overfitting.
Furthermore, the random forest provides relative feature significance, enabling you to select the most
important features. It is more interpretable than neural network models but less interpretable than
decision trees.
In the case of categorical features, we need to perform encoding so that the ML algorithm can process
them.
Predicting Loan Default is highly dependent on the demographics of the people, people with lower
income are more likely to default on loans.

We are able to successfully perform the classification task using Random Forest Classifier. Hope you liked
my article on predicting loan default.

Frequently Asked Questions

Q1. Why is loan default prediction impor tant?

A. Loan default prediction is crucial for financial institutions to assess the risk associated with lending
money to individuals or businesses. By accurately predicting the likelihood of default, lenders can make
informed decisions regarding loan approval, interest rates, and loan terms, ultimately minimizing potential
losses and maintaining a healthy loan portfolio.

Q2. What is the best model to predict loan default?

A. There isn’t a universally “best” model for predicting loan defaults as it depends on various factors such
as the nature of the dataset, the available features, and the specific requirements of the lender. However,
commonly used models for loan default prediction include logistic regression, decision trees, random
forests, gradient boosting machines, and neural networks.

Q3. What is the probability of default prediction?

A. Probability of default prediction refers to estimating the likelihood or probability that a borrower will fail
to meet their loan obligations. This prediction is typically expressed as a numerical value ranging from 0 to
1, where 0 indicates low risk (unlikely to default) and 1 indicates high risk (likely to default). It serves as a
quantitative measure for assessing credit risk and informing lending decisions.

Q4. What is the loan default prediction dataset?

The loan default prediction dataset typically consists of historical loan data including various borrower
attributes such as credit score, income, employment status, debt-to-income ratio, loan amount, loan term,
and repayment history.

The media shown in this ar ticle is not owned by Analytics Vidhya and are used at the Author’s discretion.

Article Url - https://www.analyticsvidhya.com/blog/2022/04/predicting-possible-loan-default-using-

machine-learning/

Prateek Majumder
Prateek is a final year engineering student from Institute of Engineering and Management, Kolkata. He
likes to code, study about analytics and Data Science and watch Science Fiction movies. His favourite
Sci-Fi franchise is Star Wars. He is also an active Kaggler and part of many student communities in
College.

Credit Risk Modeling Using Python
No ratings yet
Credit Risk Modeling Using Python
133 pages
2024list - Registered Lending Companies 31 August 2024
No ratings yet
2024list - Registered Lending Companies 31 August 2024
99 pages
Default of Credit Card Clients
No ratings yet
Default of Credit Card Clients
33 pages
Questions and Answers
No ratings yet
Questions and Answers
19 pages
Finance and Risk Analytics Project PDF
No ratings yet
Finance and Risk Analytics Project PDF
94 pages
KCS Rules
100% (1)
KCS Rules
33 pages
Credit Card Default Prediction PRESENTATION
No ratings yet
Credit Card Default Prediction PRESENTATION
12 pages
EDA Credit Assignment Shakti - PDF
No ratings yet
EDA Credit Assignment Shakti - PDF
51 pages
Reasons To Pass A Guide To Making Fewer and Better Investments Ralph Birchmeier PDF Download
No ratings yet
Reasons To Pass A Guide To Making Fewer and Better Investments Ralph Birchmeier PDF Download
76 pages
Maths
No ratings yet
Maths
21 pages
Data Science Real World Applications
100% (1)
Data Science Real World Applications
19 pages
Banking Project Final
No ratings yet
Banking Project Final
38 pages
Cluster Credit Risk R PDF
No ratings yet
Cluster Credit Risk R PDF
13 pages
Vehicle Loan Fraud Prediction Using Data Science and Machine Learning Techniques
No ratings yet
Vehicle Loan Fraud Prediction Using Data Science and Machine Learning Techniques
4 pages
Finance and Risk Analytics Project Sai Vinayak Sanam PDF
No ratings yet
Finance and Risk Analytics Project Sai Vinayak Sanam PDF
99 pages
Soham Vani - CENTRAL BANK OF INDIA - SIP REPORT
No ratings yet
Soham Vani - CENTRAL BANK OF INDIA - SIP REPORT
59 pages
Final Project Title and Abstract Group-3
No ratings yet
Final Project Title and Abstract Group-3
5 pages
Van Hollen Parent Plus Letter
No ratings yet
Van Hollen Parent Plus Letter
5 pages
Djc2A: Banking Theory Law and Practice: Syllabus
No ratings yet
Djc2A: Banking Theory Law and Practice: Syllabus
94 pages
Credit Card Default Prediction: Final Project Report
No ratings yet
Credit Card Default Prediction: Final Project Report
28 pages
Capstone Presentation Final
No ratings yet
Capstone Presentation Final
14 pages
NMLS Cheat Sheet
100% (1)
NMLS Cheat Sheet
29 pages
Business Analytics
No ratings yet
Business Analytics
56 pages
Standard Bank Home Loan Prediction
No ratings yet
Standard Bank Home Loan Prediction
11 pages
FinTech Group Project
No ratings yet
FinTech Group Project
28 pages
Credit Loan Default Prediction
No ratings yet
Credit Loan Default Prediction
22 pages
Ads 9
No ratings yet
Ads 9
8 pages
Capstone Project
No ratings yet
Capstone Project
33 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Summary and Context
No ratings yet
Summary and Context
51 pages
Decision Tree Assignment
No ratings yet
Decision Tree Assignment
7 pages
Edafinal 1
No ratings yet
Edafinal 1
32 pages
Ai It HW MST Prac
No ratings yet
Ai It HW MST Prac
14 pages
Coser Al. Crisan Albu (T)
No ratings yet
Coser Al. Crisan Albu (T)
17 pages
Final Report
No ratings yet
Final Report
69 pages
EasyChair Preprint 8693
No ratings yet
EasyChair Preprint 8693
22 pages
Credit EDA Case Study
No ratings yet
Credit EDA Case Study
42 pages
Business Report FRA-Extended Project
No ratings yet
Business Report FRA-Extended Project
22 pages
PA v0.7
No ratings yet
PA v0.7
15 pages
Reading Material - Module-5 - Introduction To Special Topics
No ratings yet
Reading Material - Module-5 - Introduction To Special Topics
27 pages
Data Mining Approach
No ratings yet
Data Mining Approach
4 pages
Capstone Project Report v1 - Abhishek Bihani
No ratings yet
Capstone Project Report v1 - Abhishek Bihani
16 pages
Next Level Deep Machine Learning: Complete Tips and Tricks to Deep Machine Learning
From Everand
Next Level Deep Machine Learning: Complete Tips and Tricks to Deep Machine Learning
Joe Grant
No ratings yet
Kami Export - Nyla Scott - CALCULATE - Using A Mortgage Calculator
No ratings yet
Kami Export - Nyla Scott - CALCULATE - Using A Mortgage Calculator
3 pages
Loan Status Prediction
No ratings yet
Loan Status Prediction
23 pages
Progress Report 2
No ratings yet
Progress Report 2
10 pages
Vechile Loan Defaulter
No ratings yet
Vechile Loan Defaulter
23 pages
Quadexp IDS Project
No ratings yet
Quadexp IDS Project
22 pages
EDA Group Case Study
No ratings yet
EDA Group Case Study
33 pages
1 PB
No ratings yet
1 PB
13 pages
Final Project Making Predictions From Data-Course 2: October 6, 2020
No ratings yet
Final Project Making Predictions From Data-Course 2: October 6, 2020
20 pages
Credit Default Project 23124001
No ratings yet
Credit Default Project 23124001
13 pages
An Kit
No ratings yet
An Kit
12 pages
Introduction to Data Analytics
From Everand
Introduction to Data Analytics
Dan Martin
No ratings yet
CHARLES KARISA - CreditReport
No ratings yet
CHARLES KARISA - CreditReport
2 pages
Group Statements of CashFlow
100% (2)
Group Statements of CashFlow
24 pages
Assessment of Default Risk Factors in The Disbursement of Home Loans
No ratings yet
Assessment of Default Risk Factors in The Disbursement of Home Loans
13 pages
1 - Understanding - The - Problem - and - The - Data - Ipynb - Colaboratory
No ratings yet
1 - Understanding - The - Problem - and - The - Data - Ipynb - Colaboratory
9 pages
Kritika Sejwal 24MCI10023 ML Lab Project Report
No ratings yet
Kritika Sejwal 24MCI10023 ML Lab Project Report
10 pages
Project Report - Lendingclub - FINAL
No ratings yet
Project Report - Lendingclub - FINAL
24 pages
PA v0.25
No ratings yet
PA v0.25
18 pages
Azki Task Solution - Afshin Amiri
No ratings yet
Azki Task Solution - Afshin Amiri
7 pages
Xtreme Boosting Machine
No ratings yet
Xtreme Boosting Machine
5 pages
Shekhar Ibhrampurkar Summer Internship Project Report
No ratings yet
Shekhar Ibhrampurkar Summer Internship Project Report
93 pages
Realestate Annual Handbook 2018
No ratings yet
Realestate Annual Handbook 2018
65 pages
PA v0.21
No ratings yet
PA v0.21
17 pages
Co-Brand Credit Card Questionnaire
100% (2)
Co-Brand Credit Card Questionnaire
3 pages
Credit Defaulter Classifier 1659348484
No ratings yet
Credit Defaulter Classifier 1659348484
7 pages
Investment EXIM RZV
No ratings yet
Investment EXIM RZV
70 pages
Python Code For Loan Default Prediction
No ratings yet
Python Code For Loan Default Prediction
4 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
SSRN Id3769854
No ratings yet
SSRN Id3769854
8 pages
2021 10 19 - Statement
No ratings yet
2021 10 19 - Statement
3 pages
Files All
No ratings yet
Files All
23 pages
I. Parties Are Bound by Terms of The Agreement. If Agreement Permits Repossession of Vehicle, The Financier Has A Right of Repossession
No ratings yet
I. Parties Are Bound by Terms of The Agreement. If Agreement Permits Repossession of Vehicle, The Financier Has A Right of Repossession
63 pages
Sample Bcom Project PDF
No ratings yet
Sample Bcom Project PDF
13 pages
Javier v. Osmena, G.R. No. 9984, March 23, 1916 (34 Phil. 336)
No ratings yet
Javier v. Osmena, G.R. No. 9984, March 23, 1916 (34 Phil. 336)
6 pages
Loan Syndication
No ratings yet
Loan Syndication
30 pages
Perto: Mortgage Professional Australia
No ratings yet
Perto: Mortgage Professional Australia
10 pages
EF3e Int Filetest 2a
100% (1)
EF3e Int Filetest 2a
6 pages
II MBA (B&I) - 3rd Semester Syllabus
No ratings yet
II MBA (B&I) - 3rd Semester Syllabus
7 pages
Afa 4 7 PDF
No ratings yet
Afa 4 7 PDF
3 pages
The Oxford Handbook of Banking 3nbsped 0198824637 9780198824633
100% (3)
The Oxford Handbook of Banking 3nbsped 0198824637 9780198824633
1,309 pages
Credit Risk Analysis
No ratings yet
Credit Risk Analysis
6 pages
February 2017 Subject: Financial Management 2 FM: 100 Program: BBA Semester V PM: 50 Subject Code: BBK2233 Time: 3hrs
No ratings yet
February 2017 Subject: Financial Management 2 FM: 100 Program: BBA Semester V PM: 50 Subject Code: BBK2233 Time: 3hrs
3 pages
Loan Risk Prediction Using User Transaction Information
No ratings yet
Loan Risk Prediction Using User Transaction Information
3 pages
PNB VS Ca
No ratings yet
PNB VS Ca
2 pages
Puerto Azul Land
No ratings yet
Puerto Azul Land
1 page
How To Configure Credit Management in SAP?
No ratings yet
How To Configure Credit Management in SAP?
14 pages
Mary Ann Rodriguez
No ratings yet
Mary Ann Rodriguez
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Machine Learning Paper BD

Uploaded by

Machine Learning Paper BD

Uploaded by

Predicting Possible Loan Default Using Machine Learning

A D VA NC E D C LA S S I F I C AT I O N D AT A S E T S LI BRA RI E S M A C HI NE LE A RNI NG PYT HO N

This ar ticle was published as a par t of the Data Science Blogathon.

Why Do People Borrow, and Why Do Lenders Exist?

First, we impor t the libraries and load the dataset.

Now, we read the data.

data = pd.read_csv("/kaggle/input/loan-prediction-based-on-customer-behavior/Training Data.csv") data.head()

Understanding the Dataset

rows, columns = data.shape print('Rows:', rows) print('Columns:', columns)

Let us check if there are any missing values in the data.

Id 0 Income 0 Age 0 Experience 0 Married/Single 0 House_Ownership 0 Car_Ownership 0 Profession 0 CITY 0 STATE

0 CURRENT_JOB_YRS 0 CURRENT_HOUSE_YRS 0 Risk_Flag 0 dtype: int64

So, there is no missing or empty data here.

Let us check the data column names.

Index(['Id', 'Income', 'Age', 'Experience', 'Married/Single', 'House_Ownership', 'Car_Ownership',

'Profession', 'CITY', 'STATE', 'CURRENT_JOB_YRS', 'CURRENT_HOUSE_YRS', 'Risk_Flag'], dtype='object')

So, we get the names of the data features.

Analyzing Numerical Columns

First, we start with the analysis of numerical data.

Now, we check the data distribution.

data.hist( figsize = (22, 20) ) plt.show()

0 221004 1 30996 Name: Risk_Flag, dtype: int64

Now, we plot the correlation plot.

fig, ax = plt.subplots( figsize = (12,8) ) corr_matrix = data.corr() corr_heatmap = sns.heatmap( corr_matrix,

Now, we proceed with the analysis of categorical features.

First, we define a function to create the plots.

def categorical_valcount_hist(feature): print(data[feature].value_counts()) fig, ax = plt.subplots( figsize =

First, we check the count of married people vs single people.

Now, we check the count of house ownership.

Now, let us check the count of states.

print( "Total categories in STATE:", len( data["STATE"].unique() ) ) print() print(

Now, we check the count of professions.

print( "Total categories in Profession:", len( data["Profession"].unique() ) ) print()

Computer_hardware_engineer 5372 Drafter 5359 Magistrate 5357 Fashion_Designer 5304 Air_traffic_controller

5195 Hotel_Manager 5178 Financial_Analyst 5167 Graphic_Designer 5166 Flight_attendant 5128

4413 Official 4087 Engineer 4048 Name: Profession dtype: int64

sns.boxplot(x ="Risk_Flag",y="Income" ,data = data)

sns.boxplot(x ="Risk_Flag",y="Experience" ,data = data)

sns.boxplot(x ="Risk_Flag",y="CURRENT_JOB_YRS" ,data = data)

fig, ax = plt.subplots( figsize = (8,6) ) sns.countplot(x='Car_Ownership', hue='Risk_Flag', ax=ax, data=data)

fig, ax = plt.subplots( figsize = (10,8) ) sns.boxplot(x = "Risk_Flag", y = "CURRENT_JOB_YRS",

We will apply encoding to the categorical features.

from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder import

label_encoder = LabelEncoder() for col in ['Married/Single','Car_Ownership']: data[col] =

onehot_encoder = OneHotEncoder(sparse = False) data['House_Ownership'] =

high_card_features = ['Profession', 'CITY', 'STATE'] count_encoder = ce.CountEncoder() # Transform the

data= data.drop(labels=['Profession', 'CITY', 'STATE'], axis=1)

Splitting the Data into Train and Test Splits

It is an impor tant par t of the ML chain.

x = data.drop("Risk_Flag", axis=1) y = data["Risk_Flag"]

from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x,

We have taken the test size to be 20% of the entire data.

Random Forest Classifier

Now, we train the model and perform the predictions.

from sklearn.ensemble import RandomForestClassifier from imblearn.over_sampling import SMOTE from

rf_clf = RandomForestClassifier(criterion='gini', bootstrap=True, random_state=100) smote_sampler =

SMOTE(random_state=9) pipeline = Pipeline(steps = [['smote', smote_sampler], ['classifier', rf_clf]])

Now, we check the accuracy scores.

round(recall_score(y_test, y_pred)*100, 4) }") print(f"Precision: { round(precision_score(y_test,

-------------------------TEST SCORES----------------------- Recall: 54.1378 Precision: 54.3306 F1-Score:

Key Take Away

Frequently Asked Questions

Q1. Why is loan default prediction impor tant?

Q2. What is the best model to predict loan default?

Q3. What is the probability of default prediction?

Q4. What is the loan default prediction dataset?

Article Url - https://www.analyticsvidhya.com/blog/2022/04/predicting-possible-loan-default-using-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.