0% found this document useful (0 votes)

324 views34 pages

Company Bankruptcy Detection PDF

This document discusses a team's efforts to predict company bankruptcy using a dataset from Taiwan. The team aimed to select important features, conduct exploratory data analysis, and apply machine learning models to capture bankruptcy patterns. Their analysis showed the target variable was highly imbalanced, with less than 3.5% of companies being bankrupt. Various feature selection techniques identified important predictors. Visualization revealed classes were overlapping, indicating bankruptcy was not an anomaly. The best models for predicting bankruptcy were Easy Ensemble and XGBoost, achieving recalls of 87-92% on test data.

Uploaded by

Gurmehak kaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

324 views34 pages

Company Bankruptcy Detection PDF

Uploaded by

Gurmehak kaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Capstone Project - 2

Team Space: Company Bankruptcy

Prediction
Team Members:
Saubhagya Verma
Harsh Mudgil
Tawheed Yousuf
Harshal Pawar
Jimmi Kumar
Sai Krishna Reddy Palle 1
Catching the Doom,
Before it Happens

1. Problem Statement
2. Feature Selection
3. Exploratory Data Analysis
4. Applying the Models
5. Model Selection and validation
2
➢ Problem Statement

● Prediction of bankruptcy is a phenomenon of increasing interest to

ﬁrms who stand to lose money because of unpaid debts. Since
computers can store huge dataset pertaining to bankruptcy, making
accurate predictions from them beforehand is becoming important.

● A highly unbalanced dataset, to predict the ﬁnancial state of the

companies in Taiwan, was presented. Company bankruptcy was deﬁned
based on the business regulations of the Taiwan Stock Exchange.

● For this project, we have aimed to curate a model that captures the
bankruptcy patterns among companies in the industry. This model will
work to provide early signs of ﬁnancial downturn in the corporations
3
➢ Data Pipeline
● Cleaning the Data: The data was checked for null values, categorical
values and primary inspection was performed.

● Feature Selection: Techniques such as VIF, p-value, L-1 Regularization

and Information Gain were performed to select important features.

● EDA: Exploratory analysis was performed to review the skewness in

the data, outliers and correlation patterns.

● Model Testing: Combination of different models and feature

selection techniques were used to determine optimal results

4
➢ Data Pipeline

● The data were collected from the Taiwan Economic Journal for the years
1999 to 2009. Company bankruptcy was deﬁned based on the business
regulations of the Taiwan Stock Exchange.

● The dataset consisted of 96 columns with mainly of continuous features

in 6819 rows

5
6
➢ Data Summary
● Operating Expense Rate: The operating expense rate shows the
efﬁciency of a company's management by comparing the total operating
expense (OPEX) of a company to net sales.

● Research and Development Expense Rate: Research and development

(R&D) expenses are associated directly with the research and
development of a company's goods or services and any intellectual
property generated in the process.

● Interest Bearing Debt Interest Rate: The interest-bearing debt ratio is

signiﬁcant because it gives a window into the ﬁnancial health of a
company. The interest-bearing debt ratio, or debt to equity ratio, is
calculated by dividing the total long-term, interest-bearing debt of the
company by the equity value. 7
➢ Data Summary

● Tax Rate: A tax rate is the percentage at which an individual or

corporation is taxed.

● Revenue per share: Amount of revenue over common shares

outstanding. Answers the question, what's the ownership of sales to
each share? Increasing revenue per share (RPS) over time is a good sign,
because it means each share now has claim to more revenues.

● Total Asset Growth Rate: Total Asset Growth Rate deﬁned as

year-over-year percentage change in total assets

8
Important Feature
Selection Techniques

9
➢ Important Feature Selection Techn.
● Quasi Method: Quasi-constant features are those that show the same
value for the great majority of the observations of the dataset. In general,
these features provide little if any information that allows a machine
learning model to discriminate or predict a target.
Using Quasi Constant Method: 31 columns were selected

● Lasso Regression: In linear model regularization, the penalty is applied

over the coefﬁcients that multiply each of the predictors. Lasso or l1 has
the property that is able to shrink some coefﬁcients to zero. Therefore,
that feature can be removed from the model.
Using Lasso Regression Method and VIF scores: 19 columns were selected

10
➢ Important Feature Selection Techn.
● Information Gain: Information gain or mutual information measures
how much information the presence/absence of a feature contributes to
making the correct prediction on the target.
Using Information Gain Method: 30columns were selected

● Random Forest: Random forests consist of 4-12 hundred decision trees,

each of them built over a random extraction of the observations from the
dataset and a random extraction of the features. Features that are
selected at the top of the trees are in general more important than
features that are selected at the end nodes of the trees, as generally the
top splits lead to bigger information gains.
Using Random Forest: 30 columns were selected
11
➢ Other Feature Selection Techn.

1. Using VIF- 71 columns were selected.

2. Using VIF and p_values(logit) 14 were selected.
3. Using p_values- OLS 32 were selected.
4. Using just Lasso -24.

12
Exploratory Data
Analysis

13
Primary EDA: Understanding the effect of the value
of a feature on the decision

➢ Insights
● Bankruptcy is more likely, if the value of features such as ROA(A) value is low.
● Value of features such as Accounts Receivable Turnover is less likely to cause
bankruptcy.
● For higher values of features like Retained Earnings to Total Assets, a company is
likely to stay aﬂoat. 14
Primary EDA: Checking the Imbalance

➢ Insights
● The target variable is highly imbalanced. Less than 3.5% instances for bankrupt
companies exist in the data set.

15
Primary EDA: Detecting and Capping outliers
Before After

● Outliers were capped at 80 percentile value on the upper side and at 20 percentile
value at the bottom side

16
Primary EDA: Understanding the correlation between
features and target variable

17
Primary EDA: Understanding the correlation between
features and target variable. (Contd.)

➢ Insights
● After going through all the columns, we found that none of the features displayed
any high correlation(> 0.5) with the target variable, bankruptcy.
● The most negatively correlated feature was ' Net-Income to Total Assets' i.e -0.32
● The most positively correlated feature was ' Debt Ratio' i.e +0.25.
● Conclusion: Our dataset doesn't provide any modelling power with respect to
linear algorithms. Since there is a high imbalance in the classes, this looks like an
anomaly detection problem. So, let us try anomaly detection algorithm.

18
EDA: Using Anomaly Detection Algorithm to further
explore the data.

➢ Insights
● Isolation Forest was used to detect
anomalies in the dataset. Since
results, were not up to the mark, let
us visualize the dataset to
investigate the problem

19
EDA: Data Visualization in 2 and 3 - Dimensions using PCA

● Classes are intertwined &

overlapping. Thus, bankruptcy isn’t
an anomaly. Bankrupt instances,
simply, happen to be sparse in the
date set.
20
➢ Conclusion

● The Problem at hand is not an anomaly detection problem. It is a classiﬁcation

problem with a highly imbalanced dataset.
● We will use different combinations of feature selection techniques, classiﬁcation
models and resampling techniques to reach a solution

21
Applying Models

22
Applying Model: 1. Features Selected Using VIF.

● Highest Recall on the test set is obtained from Logistic Regression, Easy
Ensemble and SVM.
● Highest test precision is obtained with KNN
● XGBoost has best F1-score at 43%, on the test set. 23
Applying Model: 2. Features Selected Using VIF &
p-values(logit-function)

● Highest Recall on the test set is obtained from Easy Ensemble at 87.2%
● Highest test precision is obtained with KNN at 66%
● XGBoost has best F1-score at 44%, on the test set.
24
Applying Model: 3. Features Selected Using VIF & l1
Regularization

● Highest Recall on the test set is obtained from Easy Ensemble at 92%
● Highest test precision is obtained from KNN att 42%
● Random Forest has best F1-score at 41%, on the test set.
25
Applying Model: 4. Features Selected Using p_values
with OLS model.

● Highest Recall on the test set is obtained from Easy Ensemble at 85.4%
● Highest test precision is obtained from KNN att 71.4%
● Gaussian Naive Bayes has best F1-score at 51%, on the test set.
26
Applying Model: 5. Using Quasi Constant Method

● Highest Recall on the test set is obtained from Logistic Regression at 89%
● Highest test precision is obtained from KNN at 62.5%
● XGBoost has best F1-score at 47.4%, on the test set.
27
Applying Model: 6. Using Information Gain

● Highest Recall on the test set is obtained from Easy Ensemble at 90%
● Highest test precision is obtained from KNN at 66.5%
● XGBoost has best F1-score at 41%, on the test set.
28
Applying Model: 7. Using Random Forest Method

● Highest Recall on the test set is obtained from Easy Ensemble at 94%
● Highest test precision is obtained from KNN at 50.5%
● Gaussian Naive Bias has best F1-score at 40%, on the test set.
29
Applying Model: 8. Lasso Regularization Method

● Highest Recall on the test set is obtained from Easy Ensemble at 83%
● Highest test precision is obtained from KNN at 50%
● XGBoost has best F1-score at 45%, on the test set.
30
Applying Model: 9. Using SMOTE with Quasi Selection

● Highest Recall on the test set is obtained from Logistic Regression & SVM at
82%
● Highest test precision is obtained from Easy Ensemble at 31%
● XGBoost has best F1-score at 39%, on the test set. 31
➢ Conclusion
● If Recall is critical to the use case, then Easy Ensemble model should be
used with Random Forest feature selection, to obtain a 94% Recall value.

● If Precision is critical to the use case, then KKN model should be used
with, p_values-OLS feature selection method to obtain a Precision of
71.4%

● If over-all performance is critical to the use case, then Gaussian Naive

Bayes Model with, p_values-OLS feature selection method can be used
to obtain F1 score of 51%.

● We can use combination of high recall models and high precision

models to cross validate and obtain ideal solution.
32
➢ Challenges
● Balancing the trade-off between recall and precision was a challenge.

● Feature selection and choosing a right technique.

● Exploring literature and resources to understand the problem and to

ﬁnd the solution was a little exhaustive.

● Deadlines felt a little strained. But it all worked out for the best.

33
QnA
34

FRA Extended
No ratings yet
FRA Extended
22 pages
Default of Credit Card Clients
No ratings yet
Default of Credit Card Clients
33 pages
Untitled
No ratings yet
Untitled
29 pages
Viral Pandey Bankruptcy Prediction
No ratings yet
Viral Pandey Bankruptcy Prediction
7 pages
Business Report FRA-Extended Project
No ratings yet
Business Report FRA-Extended Project
22 pages
Practical 3 - TL Analysis - PGP:25:275
No ratings yet
Practical 3 - TL Analysis - PGP:25:275
4 pages
Chapter 10 Data Analysis in Accounting
No ratings yet
Chapter 10 Data Analysis in Accounting
24 pages
Messier 11e Chap13 PPT TB
No ratings yet
Messier 11e Chap13 PPT TB
31 pages
R Assignment
No ratings yet
R Assignment
8 pages
Private Foundations Tax Guide
No ratings yet
Private Foundations Tax Guide
48 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Beneish Model Score (SIDO) 123
No ratings yet
Beneish Model Score (SIDO) 123
12 pages
Nama: Firda Arfianti NIM: 2301949596 Kelas: LB53 Revenue Cycle
No ratings yet
Nama: Firda Arfianti NIM: 2301949596 Kelas: LB53 Revenue Cycle
17 pages
Financial Statement Analysis: K R Subramanyam John J Wild
No ratings yet
Financial Statement Analysis: K R Subramanyam John J Wild
40 pages
Chapter 06
No ratings yet
Chapter 06
40 pages
Solution To Assignment#1 Sample 1
No ratings yet
Solution To Assignment#1 Sample 1
30 pages
Advanced Accounting Chapter 9 2020
No ratings yet
Advanced Accounting Chapter 9 2020
67 pages
Pension Worksheet
No ratings yet
Pension Worksheet
1 page
Financial Risk Analytics: Assignment
No ratings yet
Financial Risk Analytics: Assignment
35 pages
1 Complete The Balance Sheet and Sales Information in The Table That Follows For Hoffmeister
No ratings yet
1 Complete The Balance Sheet and Sales Information in The Table That Follows For Hoffmeister
4 pages
Land and Buildings Fixtures and Fittings Total N$'000 N$'000 N$'000
No ratings yet
Land and Buildings Fixtures and Fittings Total N$'000 N$'000 N$'000
6 pages
Case Study Management Control Texas Instruments and Hewlett Packard
No ratings yet
Case Study Management Control Texas Instruments and Hewlett Packard
20 pages
Beams Aa13e TB 15
100% (1)
Beams Aa13e TB 15
29 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
01 - Data Analytics in Accounting and Business
50% (2)
01 - Data Analytics in Accounting and Business
18 pages
DSA by Shradha Didi & Aman Bhaiya - Bonus DSA Questions
No ratings yet
DSA by Shradha Didi & Aman Bhaiya - Bonus DSA Questions
2 pages
Chapt 13 Questions
No ratings yet
Chapt 13 Questions
19 pages
Conceptual Framework Dalam Pelaporan Keuangan
100% (1)
Conceptual Framework Dalam Pelaporan Keuangan
63 pages
F08120000220104013F0812 P13 Emerging Issues in Accounting and Auditing
0% (1)
F08120000220104013F0812 P13 Emerging Issues in Accounting and Auditing
21 pages
AKMEN CH 15 - Hansen Mowen
100% (1)
AKMEN CH 15 - Hansen Mowen
29 pages
Chapter 9 ALK Subramanyam 11e
No ratings yet
Chapter 9 ALK Subramanyam 11e
23 pages
Financial Statement Analysis: K R Subramanyam John J Wild
No ratings yet
Financial Statement Analysis: K R Subramanyam John J Wild
40 pages
Final Ramdeo Peer Mandir Road - Final ... 02.09.09
No ratings yet
Final Ramdeo Peer Mandir Road - Final ... 02.09.09
297 pages
Chapter 9 Prospective Analysis
No ratings yet
Chapter 9 Prospective Analysis
23 pages
Chapter 22 Accounting Changes
No ratings yet
Chapter 22 Accounting Changes
59 pages
Green Accounting Reporting and Financial Performance of Manufacturing Firms in Nigeria
No ratings yet
Green Accounting Reporting and Financial Performance of Manufacturing Firms in Nigeria
9 pages
Multiple Choice Questions: Analyzing Operating Activities
No ratings yet
Multiple Choice Questions: Analyzing Operating Activities
22 pages
Chapter 5: Intercompany Profit Transactions - Inventories
No ratings yet
Chapter 5: Intercompany Profit Transactions - Inventories
38 pages
Level 1 Assessment EY Financial Analysis Prodegree
0% (2)
Level 1 Assessment EY Financial Analysis Prodegree
9 pages
Assignment 4: SWOT Analysis Using EFAS and IFAS Table (8 Factors Each)
No ratings yet
Assignment 4: SWOT Analysis Using EFAS and IFAS Table (8 Factors Each)
6 pages
RA No. 7277, 9442, and 10754
No ratings yet
RA No. 7277, 9442, and 10754
27 pages
National Cheng Kung University - Institute of International Management Statistic Methods Assignment 1 Summer 2017
No ratings yet
National Cheng Kung University - Institute of International Management Statistic Methods Assignment 1 Summer 2017
21 pages
Managerial Accounting 8th Edition: Chapter 11 Solutions
100% (1)
Managerial Accounting 8th Edition: Chapter 11 Solutions
25 pages
Chapter 4
No ratings yet
Chapter 4
50 pages
Performance Measurement, Compensation, and Multinational Considerations
No ratings yet
Performance Measurement, Compensation, and Multinational Considerations
31 pages
Romney Ais14 CH 16 General Ledger and Reporting System
No ratings yet
Romney Ais14 CH 16 General Ledger and Reporting System
11 pages
P 6 Depresiasi
No ratings yet
P 6 Depresiasi
101 pages
Case Study of Arlington Industries
No ratings yet
Case Study of Arlington Industries
13 pages
BAB 11 - Subramanyam
100% (1)
BAB 11 - Subramanyam
8 pages
2024 Tax Compliance Virtual Symposium Presentation
No ratings yet
2024 Tax Compliance Virtual Symposium Presentation
50 pages
Soal AKL
100% (1)
Soal AKL
3 pages
Question and Answer - 60
No ratings yet
Question and Answer - 60
31 pages
Rais12 SM CH17
No ratings yet
Rais12 SM CH17
23 pages
Cost and Management Accounting - Tugas 6 - 5 November 2019
No ratings yet
Cost and Management Accounting - Tugas 6 - 5 November 2019
3 pages
Return On Invested Capital and Profitability Analysis
No ratings yet
Return On Invested Capital and Profitability Analysis
34 pages
Manipulating Profits Ford S Worthy
0% (1)
Manipulating Profits Ford S Worthy
4 pages
Tugas Chapter 2
No ratings yet
Tugas Chapter 2
2 pages
HMCost2e PPT Ch01
No ratings yet
HMCost2e PPT Ch01
23 pages
Acct 3101 Chapter 05
No ratings yet
Acct 3101 Chapter 05
13 pages
Wayside Inns Inc
0% (1)
Wayside Inns Inc
4 pages
Tugas FM-High Rock Industry
No ratings yet
Tugas FM-High Rock Industry
5 pages
Please: Solutions Guide: This Is Meant As A Solutions Guide
0% (1)
Please: Solutions Guide: This Is Meant As A Solutions Guide
12 pages
CH 12
No ratings yet
CH 12
44 pages
Financial Accounting - Tugas 5 - 18 Sep - REVISI 123
No ratings yet
Financial Accounting - Tugas 5 - 18 Sep - REVISI 123
3 pages
Dynamic Programming - LeetCode
No ratings yet
Dynamic Programming - LeetCode
12 pages
Ppe 2
No ratings yet
Ppe 2
13 pages
Case Study 7 3
No ratings yet
Case Study 7 3
7 pages
Payslip
100% (1)
Payslip
3 pages
Feasibility Report
No ratings yet
Feasibility Report
8 pages
Counterfactual Explanations and Algorithmic Recourses For Machine Learning: A Review
No ratings yet
Counterfactual Explanations and Algorithmic Recourses For Machine Learning: A Review
23 pages
The Gujarat Vision:: Making Msmes Globally Competitive
No ratings yet
The Gujarat Vision:: Making Msmes Globally Competitive
52 pages
F6uk 2011 Dec Q
No ratings yet
F6uk 2011 Dec Q
12 pages
The State of E-Commerce in The Philippines - Issues and Challenges
No ratings yet
The State of E-Commerce in The Philippines - Issues and Challenges
29 pages
Itc 04 2
No ratings yet
Itc 04 2
25 pages
G.K. For April 19 To Aug 20, 2012 (5months G.K)
No ratings yet
G.K. For April 19 To Aug 20, 2012 (5months G.K)
125 pages
Project Report - Summary
No ratings yet
Project Report - Summary
41 pages
Vat Ruling No 7-2006
No ratings yet
Vat Ruling No 7-2006
2 pages
Shell Companies Final Draft
No ratings yet
Shell Companies Final Draft
17 pages
Manila Railroad vs. Insular Collector of Customs
100% (1)
Manila Railroad vs. Insular Collector of Customs
4 pages
TAX LAW PYQs
No ratings yet
TAX LAW PYQs
21 pages
DBMSEnd Sem Winter 2017 Solution
No ratings yet
DBMSEnd Sem Winter 2017 Solution
7 pages
Brazil TIN
No ratings yet
Brazil TIN
4 pages
Start of Year Master Handout (GOALS)
No ratings yet
Start of Year Master Handout (GOALS)
49 pages
MS Excel Exercise - FS Analysis
No ratings yet
MS Excel Exercise - FS Analysis
12 pages
TOC Tut
No ratings yet
TOC Tut
3 pages
Pay Slip - 11012 - Feb-24
No ratings yet
Pay Slip - 11012 - Feb-24
1 page
Tax Compliance in Sierra Leone+revised+
No ratings yet
Tax Compliance in Sierra Leone+revised+
17 pages
Brewbay Innovations PVT LTD
No ratings yet
Brewbay Innovations PVT LTD
1 page
Sanju Krishna Filling Station-01712
No ratings yet
Sanju Krishna Filling Station-01712
1 page
Anacomp, Inc.: Consolidated Financial Statements
No ratings yet
Anacomp, Inc.: Consolidated Financial Statements
23 pages
Omnilog-Penawaran Import Handling-Indraco
No ratings yet
Omnilog-Penawaran Import Handling-Indraco
1 page
Index of Topics: Property Ownership
No ratings yet
Index of Topics: Property Ownership
10 pages
Anant Pay Slip Dec 23
No ratings yet
Anant Pay Slip Dec 23
1 page
Tutorial 5
No ratings yet
Tutorial 5
4 pages
G.R. Nos. L-28508-9
No ratings yet
G.R. Nos. L-28508-9
4 pages
Python JD
No ratings yet
Python JD
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Company Bankruptcy Detection PDF

Uploaded by

Company Bankruptcy Detection PDF

Uploaded by

Capstone Project - 2

Team Space: Company Bankruptcy

● Prediction of bankruptcy is a phenomenon of increasing interest to

● A highly unbalanced dataset, to predict the ﬁnancial state of the

● Feature Selection: Techniques such as VIF, p-value, L-1 Regularization

● EDA: Exploratory analysis was performed to review the skewness in

● Model Testing: Combination of different models and feature

● The dataset consisted of 96 columns with mainly of continuous features

● Research and Development Expense Rate: Research and development

● Interest Bearing Debt Interest Rate: The interest-bearing debt ratio is

● Tax Rate: A tax rate is the percentage at which an individual or

● Revenue per share: Amount of revenue over common shares

● Total Asset Growth Rate: Total Asset Growth Rate deﬁned as

● Lasso Regression: In linear model regularization, the penalty is applied

● Random Forest: Random forests consist of 4-12 hundred decision trees,

1. Using VIF- 71 columns were selected.

● Classes are intertwined &

● The Problem at hand is not an anomaly detection problem. It is a classiﬁcation

● If over-all performance is critical to the use case, then Gaussian Naive

● We can use combination of high recall models and high precision

● Feature selection and choosing a right technique.

● Exploring literature and resources to understand the problem and to

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.