0% found this document useful (0 votes)
15 views28 pages

FinTech Group Project

The document discusses the evolution of credit risk analysis, highlighting the transition from manual, subjective assessments to modern machine learning techniques that enhance predictive accuracy and efficiency in lending. It details various applications of FinTech in credit risk, including automated scoring models, dynamic limit management, and responsible lending practices that ensure compliance and fairness. Additionally, it presents a notebook example demonstrating exploratory data analysis for predicting credit card approval, showcasing methodologies and tools used in the analysis process.

Uploaded by

huyphangia123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views28 pages

FinTech Group Project

The document discusses the evolution of credit risk analysis, highlighting the transition from manual, subjective assessments to modern machine learning techniques that enhance predictive accuracy and efficiency in lending. It details various applications of FinTech in credit risk, including automated scoring models, dynamic limit management, and responsible lending practices that ensure compliance and fairness. Additionally, it presents a notebook example demonstrating exploratory data analysis for predicting credit card approval, showcasing methodologies and tools used in the analysis process.

Uploaded by

huyphangia123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Final Group Project

Group #1

Price College of Business, University of Oklahoma

FIN 4433-001 FinTech & Applications

Mandy Chan, MBA

Started on April 12th, 2025

Last Modified April 27th, 2025


Introduction

Credit risk, the probability that a borrower fails to meet contractual obligations, remains a

core concern for lenders. Recent advances in machine learning, particularly gradient-boosting

methods, allow institutions to model non-linear feature interactions and improve early detection

of high-risk applicants. Replicating peer-reviewed code from public repositories aligns with best

practices in open science and accelerates innovation in FinTech education.

The Challenges of Credit Risk Analysis before FinTech

Before automated scoring systems and networked databases became commonplace, credit-

risk assessment was hampered by a chronic lack of reliable information. Borrower data were

scattered across individual bank branches and local credit bureaus, often locked away in paper

files. Because lenders could not instantly access a customer’s full payment history, they made

decisions with glaring information gaps, increasing the odds of both approving high-risk applicants

and turning away creditworthy ones.

In the absence of large, consistent datasets, underwriting leaned heavily on human

judgment. Loan officers relied on “character” interviews, letters from employers, or the applicant’s

reputation in the community. Although personal insight sometimes revealed nuances that numbers

missed, it also introduced subjectivity and bias. Lending standards varied from branch to branch,

and discriminatory practices could creep in unnoticed, exposing institutions to compliance and

reputational risks.

Quantitative tools were rudimentary. Basic ratio analysis, such as comparing debt to

income or assets to liabilities was calculated by hand or on simple spreadsheets. Multivariate

statistical models were rare outside the largest banks, largely because the computing power and

skilled analysts needed to build them were expensive. As a result, lenders struggled to price the
risk accurately: interest rates and reserve cushions were either set too high, discouraging good

borrowers, or too low, encouraging future defaults.

Operationally, the entire process was slow and costly. Collecting pay stubs, tax returns, and

bank references require physical mail or in-person visits, so underwriting a single file could take

days. High manual workloads limited a lender’s ability to scale and made it difficult to handle

spikes in application volume. Customers often waited weeks for funding decisions, eroding

satisfaction and driving some to faster competitors.

Even after a loan was approved, monitoring remained largely static. Accounts were

reviewed only at fixed intervals or once payments became delinquent. Because portfolio data were

not linked in real time to economic indicators such as unemployment or regional downturns, credit

quality could deteriorate rapidly before management noticed. By the time remedial action was

taken, losses were often much larger than they would have been with earlier warning signals.

Some of the Application of FinTech in Credit Risk Analysis

Modern credit-scoring engines

The most visible use of data analysis in credit risk is building credit-scoring models that

predict the likelihood a new or existing borrower will repay on time. By feeding statistical or

machine-learning algorithms with thousands of historical loan records—payment histories,

utilization ratios, income stability, and even alternative data such as utility-bill payment or mobile-

phone top-ups—lenders translate raw attributes into a single score or probability of default (PD).

Because the score is automated and continuously recalibrated, it lets banks offer instant credit

decisions, set credit-line limits, and comply with “fair-lending” regulations that demand objective,

data-driven rules rather than subjective judgment.

Limit assignment and dynamic line management


Once an account is opened, lenders still need to decide how much exposure to grant. Data

analysis powers optimization frameworks that set initial credit limits and periodically adjust them

up or down. By linking historical utilization patterns, macroeconomic indicators, and forward-

looking loss forecasts, banks can expand limits for customers who show rising incomes and

responsible usage while trimming or freezing limits for those who exhibit early warning signs such

as maxing-out, higher minimum payments, or deteriorating external scores. These fine-tuning

balances revenue growth against loss containment more effectively than static “one-size-fits-all”

policies.

Responsible-lending and model-explainability analytics

Modern credit-risk teams pair machine-learning power with fairness and explainability

techniques to meet regulatory and ethical standards. Tools such as Shapley values and

counterfactual analysis quantify how each input feature influences an individual prediction,

enabling clear adverse-action notices to consumers who are declined. Bias-detection algorithms

scan for disparate impact across protected classes, prompting data scientists to retrain models or

add constraints that preserve accuracy while reducing unfair outcomes. This analytical layer turns

raw predictive power into transparent, compliant, and socially responsible credit decisions.

Stress testing and portfolio loss forecasting

Regulators and risk committees require lenders to show how their consumer portfolios will

perform under adverse economic scenarios. Analysts build panel data sets that marry internal loan-

level performance with unemployment rates, interest-rate paths, and regional house-price indices.

Econometric or machine-learning models project charge-offs and net-losses under baseline,

adverse, and severely adverse scenarios. These forecasts drive capital-adequacy planning, loan-
loss-reserve calculations (CECL/IFRS 9), and inform management on whether to tighten

underwriting standards or raise pricing before a downturn hits.

Example Notebook of Fintech App. in Credit Risk Analysis (w/ Walkthrough)

Business Problem

This app predicts if an applicant will be approved for a credit card or not. Each time there

is a hard enquiry your credit score is affected negatively. This app predicts the probability of being

approved without affecting your credit score. This app can be used by applicants who want to find

out if they will be approved for a credit card without affecting their credit score.

Method

Methodology that is used for this project includes (1) Exploratory data analysis, (2)

Bivariate analysis, (3) Multivariate correlation

Summary of the Notebook


This notebook serves as the exploratory-data-analysis (EDA) stage for a credit-card approval

project. Its goal is to answer two early-stage questions:

1. What does each variable look like on its own?

Univariate analysis cells profile every feature (categorical and numeric) through value-

frequency tables, histograms, box-plots, pie-charts, and summary statistics. These visuals

expose skew, outliers, dominant categories, and missing-value patterns so that later

cleaning and encoding choices are evidence-based.

2. How does a single feature interact with the target or with another feature?

In the bivariate analysis section, the notebook contrasts each variable against the binary

label Is high risk (default flag). Side-by-side box-plots, risk-segmented bar charts, and

grouped means reveal which characteristics—such as short employment history or certain

dwelling types—are disproportionately represented among bad applicants. These insights


help shortlist promising predictors and highlight relationships (e.g., high-risk applicants

having older accounts yet shorter job tenure) that modelling should capture.

Taken together, the univariate and bivariate explorations give stakeholders an intuitive, data-

driven picture of applicant demographics, financial attributes, and early risk signals before the

project moves on to multivariate modelling and machine-learning steps.

Notebook Walkthrough

0. Import required package

The first code cell, titled “0. import the necessary packages,” is the notebook’s staging

area: it loads every library required for the credit-card-approval project so that later sections

can focus entirely on data cleaning, modeling, and evaluation. General-purpose data

wrangling is handled by NumPy and pandas, while exploratory assistants such as

missingno and pandas-profiling make it easy to visualize gaps or anomalous values in the

dataset. Matplotlib and Seaborn, reinforced by scikit-plot and Yellowbrick, give the author
a full palette of plotting utilities—from quick histograms to ROC curves and feature-

importance bars—rendered inline through the %matplotlib inline magic.

Statistical testing and path management come next. Scipy’s statistical tools (for

example, chi-square tests) support hypothesis checks on categorical variables, and

pathlib.Path offers OS-agnostic file handling. A comprehensive slice of scikit-learn

components then enters: splitters such as train_test_split and cross_val_score,

preprocessing helpers like ColumnTransformer, OneHotEncoder, and MinMaxScaler, plus

virtually every mainstream classification algorithm—from logistic regression and support-

vector machines through decision-tree ensembles and neural networks. Calibration, cross-

validation, permutation-based feature importance, and rich reporting utilities

(classification_report, ConfusionMatrixDisplay, ROC functions) are also pulled in so the

notebook can judge model quality on balanced grounds

1. Import and process data


In [2] – Load the two raw data files
The cell reads application_record.csv into cc_data_full_data, which holds the applicant-level
features, and credit_record.csv into credit_status, which contains month-by-month repayment
information for those same customers. Bringing both tables into memory is the foundation for
every transformation that follows.
In [3] – Engineer risk labels and merge account age
First, it determines each borrower’s oldest account by grouping credit_status and taking the
minimum MONTHS_BALANCE, then merges that “Account age” back onto the main
application data. Next, it flags any serious delinquency: statuses “2”, “3”, “4”, or “5” are marked
“Yes” in a temporary dep_value column. By re-aggregating credit_status at the customer level,
the cell collapses multiple monthly rows into a single “Yes/No” high-risk indicator, merges it
into cc_data_full_data, converts the text label to numeric (1 = high risk, 0 = low risk), and drops
the helper column. A chained-assignment warning is also suppressed for neatness.
In [4] – Make column names human-readable
To improve clarity, this cell renames cryptic bureau codes such as CODE_GENDER or
DAYS_BIRTH into plain English equivalents like “Gender” and “Age.” It also renames the
previously added “Account age,” so the entire DataFrame now reads like a business-friendly
table
In [5] – Define a reusable train/test split function
A small helper called data_split wraps train_test_split, taking the DataFrame and a test-size
fraction (here 0.2) and returning reset-index copies of the train and test subsets. Encapsulating
the logic keeps later code tidy and reproducible.
In [6] – Create the working train and test sets
Using the function above, the full application data is split 80 %/20 % into cc_train_original and
cc_test_original. From this point onward, modeling work proceeds on the training set while
performance will later be checked on the held-out test set.
In [7] – Quick sanity check: training-set shape
Simply prints cc_train_original.shape, letting the analyst verify that roughly 80 % of the original
rows landed in the training partition.
In [8] – Quick sanity check: test-set shape
Likewise prints cc_test_original.shape, confirming that the remaining 20 % of records are in the
test split.
In [9] – Persist the training data to disk
Saves cc_train_original as dataset/train.csv so that downstream notebooks or production
pipelines can load the identical training sample without rerunning the earlier preprocessing steps.
In [10] – Persist the test data to disk
Does the same for cc_test_original, writing it to dataset/test.csv for future out-of-sample
evaluation or model-comparison experiments.
In [11] – Protect the raw splits with working copies
Creates cc_train_copy and cc_test_copy, duplicates of the saved splits. Subsequent cleaning,
encoding, or feature-engineering operations can now proceed on the copies, while the pristine
originals remain untouched as a reference point.

2. Basic analysis of the dataset


In [12] — Build and save an automated EDA report
This cell runs Pandas Profiling on the cleaned training set (cc_train_copy). The ProfileReport
object scans every column, computes descriptive statistics, correlation matrices, and missing-
value charts, and then renders them into a self-contained HTML file. A Path check ensures the
report isn’t regenerated if it already exists; otherwise it is written to
pandas_profile_file/income_class_profile.html. The result is a point-and-click exploratory
dashboard that can be opened in any browser for a deep dive into feature distributions and data-
quality issues.
In [13] — Quick visual peek at the data
Calling cc_data_full_data.head() displays the first five rows of the full, feature-engineered
application table. This gives the analyst a sanity check that the earlier merges and renaming
produced sensible, human-readable columns and that the new “Is high risk” target is present.
In [14] — Structural overview of the DataFrame
cc_data_full_data.info() prints each column’s dtype, the number of non-null observations, and
overall memory usage. It confirms that missing values have been handled as expected and shows
which variables are numeric vs. object (categorical), guiding later encoding and scaling
decisions.
In [15] / Out [15] — Numeric summary statistics
cc_data_full_data.describe() returns a table (shown as Out [15]) of count, mean, standard
deviation, and the 25th, 50th, and 75th percentiles for every numeric feature—including income,
age, employment length, and account age. Analysts use this snapshot to spot unreasonable
ranges, skewed distributions, or potential outliers before moving on to modeling.
3. Input the functions used to explore each feature/ pillar
In [18] – Value-count helper
This cell adds a utility called value_cnt_norm_cal. Given a DataFrame and a column name, the
function returns a tidy two-column table that shows the absolute Count of each distinct value and
its Frequency (%) expressed as a percentage. It is the workhorse behind most categorical plots
and summary printouts that follow, sparing the author from rewriting the same value_counts() /
normalization logic over and over.
In [19] – Quick feature profiler
gen_info_feat is a Swiss-army knife for on-the-fly exploration of a single feature. Using Python
3.10’s match … case syntax, the function tailors its behaviour to each variable: for Age it
converts the raw negative “days” into positive years before printing descriptive stats and plotting
a histogram; for categorical fields such as Education level or Dwelling it prints the value-
frequency table from In [18] and shows a bar chart; for numeric money variables it draws box-
and-histograms with scientific notation turned off. In short, one call delivers a concise textual
and visual profile of whichever column is under investigation.
In [20] – Pie-chart generator
create_pie_plot builds an “at-a-glance” pie chart for selected categorical attributes (e.g.,
Dwelling, Education level). It first grabs the percentage distribution via the helper in [18], then
feeds those percentages into plt.pie, formats the legend, enforces equal aspect so the circle isn’t
distorted, and titles the figure. Because credit datasets often have imbalanced classes, seeing the
relative share of, say, “Rented” vs “Owned” housing in one picture can be more intuitive than a
bar chart.
In [21] – Bar-chart generator
Complementing the pie routine, create_bar_plot produces vertical bar charts that show raw
counts for high-cardinality or business-critical categoricals—marital status, dwelling type, job
title, employment status, education level, and others. Tick labels are rotated and right-justified
for readability, and the same function falls back to a generic branch (case _:) so it will sensibly
plot any categorical column passed to it.
In [22] – Box-plot generator
create_box_plot focuses on the spread and outliers of numeric features. It again branches on the
feature name so that each variable is rendered with units users understand (e.g., Age converted to
years, Employment length converted from negative days to positive years, and incomes shown
with thousands separators). For discrete counts such as number of children it sets integer y-ticks,
while for money values it disables scientific notation. The result is a clean, vertically oriented
boxplot that highlights skewness and extreme observations.
In [23] – Histogram generator
create_hist_plot provides a complementary look at distribution shape. Like the box-plot helper, it
converts and formats special variables (Age, Income, Employment length) before calling
sns.histplot, overlays a kernel-density estimate, and allows the caller to specify the number of
bins. This is the go-to tool for diagnosing normality, skew, or multimodality in any numeric
column.
In [24] – High-risk vs low-risk boxplot
low_high_risk_box_plot dives deeper by splitting the numeric variable of interest (currently Age
or Income) into two groups—borrowers flagged Is high risk = 1 vs those flagged 0. It prints the
mean of each group for quick reference, then draws a side-by-side boxplot so analysts can see if,
for example, higher incomes coincide with fewer defaults or if older applicants have a different
risk profile.
In [25] – High-risk vs low-risk bar chart
Finally, low_high_risk_bar_plot serves the categorical analogue: it groups the data by a chosen
categorical feature, sums the Is high risk indicator to count how many risky customers sit in each
category, sorts those counts in descending order, prints the underlying dictionary, and renders a
bar chart. This immediately spotlights, say, which employment statuses or dwelling types
harbour the largest share of delinquent applicants, guiding subsequent feature engineering or
policy rules.
Together, cells 18-25 equip the notebook with a full exploratory-data-analysis toolkit—tables,
pies, bars, histograms, and risk-segmented boxplots—that can be invoked repeatedly without
cluttering the main narrative of the credit-risk project.
4. Run Univariate Analysis
Core variables

• Gender • Income
• Age • Employment status
• Marital status • Employment length
• Children count • Education level
• Has a property (yes/no) • Account age

• Gender
• Age
• Marital Status
• Income
• Employment Status
• Educational Level
• Employment Length
• Property Ownership
• Account Owner Length
5. Bivariate Analysis (Correlation Test)
• Correlation of Age vs Features
• Correlation of Income vs Features
Key Findings of this Notebook

From the data set, a representative customer in this dataset is a woman around 40 years old

who is married or cohabiting and has no children. She has worked for roughly five years, earns

about $157 k annually, and finished secondary school. While she does not own a car, she does

possess residential real estate (such as a house or flat) and her credit account has been open for

about 26 months.

Statistical tests indicate that neither age nor income shows a meaningful correlation with

the default flag. Borrowers classified as high-risk generally have shorter job tenures and longer-

standing accounts, yet they make up less than two percent of all observations. In contrast, the bulk

of applicants are aged 20–45 and hold accounts that have been active for 25 months or less.

Implication for the Future

The disciplined, pillar-by-pillar EDA framework used in this notebook lays a foundation

for explainable, regulator-ready credit-risk models. Because every variable is first profiled on its

own, then contrasted with the default flag, analysts can trace exactly why a feature is included,

how it behaves across segments, and whether it introduces bias. Embedding those checks as

reusable functions means the same diagnostics can run automatically when new data arrive or

when a model drifts, turning what is often a one-off exploratory step into a living set of controls

that satisfy both internal risk committees and external auditors.

Looking ahead, the modular design also accelerates feature expansion and alternative-data

experimentation. New signals (for example, utility-payment history or mobile-device metadata)

can be dropped into the univariate/bivariate template and instantly subjected to the same scrutiny

as traditional bureau fields. That consistency shortens the path from raw idea to production model

while guarding against “black-box” pitfalls. As lenders move toward real-time approvals and
dynamic credit limits, a methodology that couples rapid exploration with transparent governance

will be crucial for scaling machine-learning risk engines without sacrificing trust or compliance.

Acknowledgement

The notebook used for this report is a derived and simplified version of the project called

“Credit-card-approval-prediction-classification” by Stern Semasuka (username: @semasuka)

from GitHub.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy