FinTech Group Project
FinTech Group Project
Group #1
Credit risk, the probability that a borrower fails to meet contractual obligations, remains a
core concern for lenders. Recent advances in machine learning, particularly gradient-boosting
methods, allow institutions to model non-linear feature interactions and improve early detection
of high-risk applicants. Replicating peer-reviewed code from public repositories aligns with best
Before automated scoring systems and networked databases became commonplace, credit-
risk assessment was hampered by a chronic lack of reliable information. Borrower data were
scattered across individual bank branches and local credit bureaus, often locked away in paper
files. Because lenders could not instantly access a customer’s full payment history, they made
decisions with glaring information gaps, increasing the odds of both approving high-risk applicants
judgment. Loan officers relied on “character” interviews, letters from employers, or the applicant’s
reputation in the community. Although personal insight sometimes revealed nuances that numbers
missed, it also introduced subjectivity and bias. Lending standards varied from branch to branch,
and discriminatory practices could creep in unnoticed, exposing institutions to compliance and
reputational risks.
Quantitative tools were rudimentary. Basic ratio analysis, such as comparing debt to
statistical models were rare outside the largest banks, largely because the computing power and
skilled analysts needed to build them were expensive. As a result, lenders struggled to price the
risk accurately: interest rates and reserve cushions were either set too high, discouraging good
Operationally, the entire process was slow and costly. Collecting pay stubs, tax returns, and
bank references require physical mail or in-person visits, so underwriting a single file could take
days. High manual workloads limited a lender’s ability to scale and made it difficult to handle
spikes in application volume. Customers often waited weeks for funding decisions, eroding
Even after a loan was approved, monitoring remained largely static. Accounts were
reviewed only at fixed intervals or once payments became delinquent. Because portfolio data were
not linked in real time to economic indicators such as unemployment or regional downturns, credit
quality could deteriorate rapidly before management noticed. By the time remedial action was
taken, losses were often much larger than they would have been with earlier warning signals.
The most visible use of data analysis in credit risk is building credit-scoring models that
predict the likelihood a new or existing borrower will repay on time. By feeding statistical or
utilization ratios, income stability, and even alternative data such as utility-bill payment or mobile-
phone top-ups—lenders translate raw attributes into a single score or probability of default (PD).
Because the score is automated and continuously recalibrated, it lets banks offer instant credit
decisions, set credit-line limits, and comply with “fair-lending” regulations that demand objective,
analysis powers optimization frameworks that set initial credit limits and periodically adjust them
looking loss forecasts, banks can expand limits for customers who show rising incomes and
responsible usage while trimming or freezing limits for those who exhibit early warning signs such
balances revenue growth against loss containment more effectively than static “one-size-fits-all”
policies.
Modern credit-risk teams pair machine-learning power with fairness and explainability
techniques to meet regulatory and ethical standards. Tools such as Shapley values and
counterfactual analysis quantify how each input feature influences an individual prediction,
enabling clear adverse-action notices to consumers who are declined. Bias-detection algorithms
scan for disparate impact across protected classes, prompting data scientists to retrain models or
add constraints that preserve accuracy while reducing unfair outcomes. This analytical layer turns
raw predictive power into transparent, compliant, and socially responsible credit decisions.
Regulators and risk committees require lenders to show how their consumer portfolios will
perform under adverse economic scenarios. Analysts build panel data sets that marry internal loan-
level performance with unemployment rates, interest-rate paths, and regional house-price indices.
adverse, and severely adverse scenarios. These forecasts drive capital-adequacy planning, loan-
loss-reserve calculations (CECL/IFRS 9), and inform management on whether to tighten
Business Problem
This app predicts if an applicant will be approved for a credit card or not. Each time there
is a hard enquiry your credit score is affected negatively. This app predicts the probability of being
approved without affecting your credit score. This app can be used by applicants who want to find
out if they will be approved for a credit card without affecting their credit score.
Method
Methodology that is used for this project includes (1) Exploratory data analysis, (2)
Univariate analysis cells profile every feature (categorical and numeric) through value-
frequency tables, histograms, box-plots, pie-charts, and summary statistics. These visuals
expose skew, outliers, dominant categories, and missing-value patterns so that later
2. How does a single feature interact with the target or with another feature?
In the bivariate analysis section, the notebook contrasts each variable against the binary
label Is high risk (default flag). Side-by-side box-plots, risk-segmented bar charts, and
having older accounts yet shorter job tenure) that modelling should capture.
Taken together, the univariate and bivariate explorations give stakeholders an intuitive, data-
driven picture of applicant demographics, financial attributes, and early risk signals before the
Notebook Walkthrough
The first code cell, titled “0. import the necessary packages,” is the notebook’s staging
area: it loads every library required for the credit-card-approval project so that later sections
can focus entirely on data cleaning, modeling, and evaluation. General-purpose data
missingno and pandas-profiling make it easy to visualize gaps or anomalous values in the
dataset. Matplotlib and Seaborn, reinforced by scikit-plot and Yellowbrick, give the author
a full palette of plotting utilities—from quick histograms to ROC curves and feature-
Statistical testing and path management come next. Scipy’s statistical tools (for
vector machines through decision-tree ensembles and neural networks. Calibration, cross-
• Gender • Income
• Age • Employment status
• Marital status • Employment length
• Children count • Education level
• Has a property (yes/no) • Account age
• Gender
• Age
• Marital Status
• Income
• Employment Status
• Educational Level
• Employment Length
• Property Ownership
• Account Owner Length
5. Bivariate Analysis (Correlation Test)
• Correlation of Age vs Features
• Correlation of Income vs Features
Key Findings of this Notebook
From the data set, a representative customer in this dataset is a woman around 40 years old
who is married or cohabiting and has no children. She has worked for roughly five years, earns
about $157 k annually, and finished secondary school. While she does not own a car, she does
possess residential real estate (such as a house or flat) and her credit account has been open for
about 26 months.
Statistical tests indicate that neither age nor income shows a meaningful correlation with
the default flag. Borrowers classified as high-risk generally have shorter job tenures and longer-
standing accounts, yet they make up less than two percent of all observations. In contrast, the bulk
of applicants are aged 20–45 and hold accounts that have been active for 25 months or less.
The disciplined, pillar-by-pillar EDA framework used in this notebook lays a foundation
for explainable, regulator-ready credit-risk models. Because every variable is first profiled on its
own, then contrasted with the default flag, analysts can trace exactly why a feature is included,
how it behaves across segments, and whether it introduces bias. Embedding those checks as
reusable functions means the same diagnostics can run automatically when new data arrive or
when a model drifts, turning what is often a one-off exploratory step into a living set of controls
Looking ahead, the modular design also accelerates feature expansion and alternative-data
can be dropped into the univariate/bivariate template and instantly subjected to the same scrutiny
as traditional bureau fields. That consistency shortens the path from raw idea to production model
while guarding against “black-box” pitfalls. As lenders move toward real-time approvals and
dynamic credit limits, a methodology that couples rapid exploration with transparent governance
will be crucial for scaling machine-learning risk engines without sacrificing trust or compliance.
Acknowledgement
The notebook used for this report is a derived and simplified version of the project called
from GitHub.