0% found this document useful (0 votes)
324 views34 pages

Company Bankruptcy Detection PDF

This document discusses a team's efforts to predict company bankruptcy using a dataset from Taiwan. The team aimed to select important features, conduct exploratory data analysis, and apply machine learning models to capture bankruptcy patterns. Their analysis showed the target variable was highly imbalanced, with less than 3.5% of companies being bankrupt. Various feature selection techniques identified important predictors. Visualization revealed classes were overlapping, indicating bankruptcy was not an anomaly. The best models for predicting bankruptcy were Easy Ensemble and XGBoost, achieving recalls of 87-92% on test data.

Uploaded by

Gurmehak kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
324 views34 pages

Company Bankruptcy Detection PDF

This document discusses a team's efforts to predict company bankruptcy using a dataset from Taiwan. The team aimed to select important features, conduct exploratory data analysis, and apply machine learning models to capture bankruptcy patterns. Their analysis showed the target variable was highly imbalanced, with less than 3.5% of companies being bankrupt. Various feature selection techniques identified important predictors. Visualization revealed classes were overlapping, indicating bankruptcy was not an anomaly. The best models for predicting bankruptcy were Easy Ensemble and XGBoost, achieving recalls of 87-92% on test data.

Uploaded by

Gurmehak kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Capstone Project - 2

Team Space: Company Bankruptcy


Prediction
Team Members:
Saubhagya Verma
Harsh Mudgil
Tawheed Yousuf
Harshal Pawar
Jimmi Kumar
Sai Krishna Reddy Palle 1
Catching the Doom,
Before it Happens

1. Problem Statement
2. Feature Selection
3. Exploratory Data Analysis
4. Applying the Models
5. Model Selection and validation
2
➢ Problem Statement

● Prediction of bankruptcy is a phenomenon of increasing interest to


firms who stand to lose money because of unpaid debts. Since
computers can store huge dataset pertaining to bankruptcy, making
accurate predictions from them beforehand is becoming important.

● A highly unbalanced dataset, to predict the financial state of the


companies in Taiwan, was presented. Company bankruptcy was defined
based on the business regulations of the Taiwan Stock Exchange.

● For this project, we have aimed to curate a model that captures the
bankruptcy patterns among companies in the industry. This model will
work to provide early signs of financial downturn in the corporations
3
➢ Data Pipeline
● Cleaning the Data: The data was checked for null values, categorical
values and primary inspection was performed.

● Feature Selection: Techniques such as VIF, p-value, L-1 Regularization


and Information Gain were performed to select important features.

● EDA: Exploratory analysis was performed to review the skewness in


the data, outliers and correlation patterns.

● Model Testing: Combination of different models and feature


selection techniques were used to determine optimal results

4
➢ Data Pipeline

● The data were collected from the Taiwan Economic Journal for the years
1999 to 2009. Company bankruptcy was defined based on the business
regulations of the Taiwan Stock Exchange.

● The dataset consisted of 96 columns with mainly of continuous features


in 6819 rows

5
6
➢ Data Summary
● Operating Expense Rate: The operating expense rate shows the
efficiency of a company's management by comparing the total operating
expense (OPEX) of a company to net sales.

● Research and Development Expense Rate: Research and development


(R&D) expenses are associated directly with the research and
development of a company's goods or services and any intellectual
property generated in the process.

● Interest Bearing Debt Interest Rate: The interest-bearing debt ratio is


significant because it gives a window into the financial health of a
company. The interest-bearing debt ratio, or debt to equity ratio, is
calculated by dividing the total long-term, interest-bearing debt of the
company by the equity value. 7
➢ Data Summary

● Tax Rate: A tax rate is the percentage at which an individual or


corporation is taxed.

● Revenue per share: Amount of revenue over common shares


outstanding. Answers the question, what's the ownership of sales to
each share? Increasing revenue per share (RPS) over time is a good sign,
because it means each share now has claim to more revenues.

● Total Asset Growth Rate: Total Asset Growth Rate defined as


year-over-year percentage change in total assets

8
Important Feature
Selection Techniques

9
➢ Important Feature Selection Techn.
● Quasi Method: Quasi-constant features are those that show the same
value for the great majority of the observations of the dataset. In general,
these features provide little if any information that allows a machine
learning model to discriminate or predict a target.
Using Quasi Constant Method: 31 columns were selected

● Lasso Regression: In linear model regularization, the penalty is applied


over the coefficients that multiply each of the predictors. Lasso or l1 has
the property that is able to shrink some coefficients to zero. Therefore,
that feature can be removed from the model.
Using Lasso Regression Method and VIF scores: 19 columns were selected

10
➢ Important Feature Selection Techn.
● Information Gain: Information gain or mutual information measures
how much information the presence/absence of a feature contributes to
making the correct prediction on the target.
Using Information Gain Method: 30columns were selected

● Random Forest: Random forests consist of 4-12 hundred decision trees,


each of them built over a random extraction of the observations from the
dataset and a random extraction of the features. Features that are
selected at the top of the trees are in general more important than
features that are selected at the end nodes of the trees, as generally the
top splits lead to bigger information gains.
Using Random Forest: 30 columns were selected
11
➢ Other Feature Selection Techn.

1. Using VIF- 71 columns were selected.


2. Using VIF and p_values(logit) 14 were selected.
3. Using p_values- OLS 32 were selected.
4. Using just Lasso -24.

12
Exploratory Data
Analysis

13
Primary EDA: Understanding the effect of the value
of a feature on the decision

➢ Insights
● Bankruptcy is more likely, if the value of features such as ROA(A) value is low.
● Value of features such as Accounts Receivable Turnover is less likely to cause
bankruptcy.
● For higher values of features like Retained Earnings to Total Assets, a company is
likely to stay afloat. 14
Primary EDA: Checking the Imbalance

➢ Insights
● The target variable is highly imbalanced. Less than 3.5% instances for bankrupt
companies exist in the data set.

15
Primary EDA: Detecting and Capping outliers
Before After

● Outliers were capped at 80 percentile value on the upper side and at 20 percentile
value at the bottom side

16
Primary EDA: Understanding the correlation between
features and target variable

17
Primary EDA: Understanding the correlation between
features and target variable. (Contd.)

➢ Insights
● After going through all the columns, we found that none of the features displayed
any high correlation(> 0.5) with the target variable, bankruptcy.
● The most negatively correlated feature was ' Net-Income to Total Assets' i.e -0.32
● The most positively correlated feature was ' Debt Ratio' i.e +0.25.
● Conclusion: Our dataset doesn't provide any modelling power with respect to
linear algorithms. Since there is a high imbalance in the classes, this looks like an
anomaly detection problem. So, let us try anomaly detection algorithm.

18
EDA: Using Anomaly Detection Algorithm to further
explore the data.

➢ Insights
● Isolation Forest was used to detect
anomalies in the dataset. Since
results, were not up to the mark, let
us visualize the dataset to
investigate the problem

19
EDA: Data Visualization in 2 and 3 - Dimensions using PCA

● Classes are intertwined &


overlapping. Thus, bankruptcy isn’t
an anomaly. Bankrupt instances,
simply, happen to be sparse in the
date set.
20
➢ Conclusion

● The Problem at hand is not an anomaly detection problem. It is a classification


problem with a highly imbalanced dataset.
● We will use different combinations of feature selection techniques, classification
models and resampling techniques to reach a solution

21
Applying Models

22
Applying Model: 1. Features Selected Using VIF.

● Highest Recall on the test set is obtained from Logistic Regression, Easy
Ensemble and SVM.
● Highest test precision is obtained with KNN
● XGBoost has best F1-score at 43%, on the test set. 23
Applying Model: 2. Features Selected Using VIF &
p-values(logit-function)

● Highest Recall on the test set is obtained from Easy Ensemble at 87.2%
● Highest test precision is obtained with KNN at 66%
● XGBoost has best F1-score at 44%, on the test set.
24
Applying Model: 3. Features Selected Using VIF & l1
Regularization

● Highest Recall on the test set is obtained from Easy Ensemble at 92%
● Highest test precision is obtained from KNN att 42%
● Random Forest has best F1-score at 41%, on the test set.
25
Applying Model: 4. Features Selected Using p_values
with OLS model.

● Highest Recall on the test set is obtained from Easy Ensemble at 85.4%
● Highest test precision is obtained from KNN att 71.4%
● Gaussian Naive Bayes has best F1-score at 51%, on the test set.
26
Applying Model: 5. Using Quasi Constant Method

● Highest Recall on the test set is obtained from Logistic Regression at 89%
● Highest test precision is obtained from KNN at 62.5%
● XGBoost has best F1-score at 47.4%, on the test set.
27
Applying Model: 6. Using Information Gain

● Highest Recall on the test set is obtained from Easy Ensemble at 90%
● Highest test precision is obtained from KNN at 66.5%
● XGBoost has best F1-score at 41%, on the test set.
28
Applying Model: 7. Using Random Forest Method

● Highest Recall on the test set is obtained from Easy Ensemble at 94%
● Highest test precision is obtained from KNN at 50.5%
● Gaussian Naive Bias has best F1-score at 40%, on the test set.
29
Applying Model: 8. Lasso Regularization Method

● Highest Recall on the test set is obtained from Easy Ensemble at 83%
● Highest test precision is obtained from KNN at 50%
● XGBoost has best F1-score at 45%, on the test set.
30
Applying Model: 9. Using SMOTE with Quasi Selection

● Highest Recall on the test set is obtained from Logistic Regression & SVM at
82%
● Highest test precision is obtained from Easy Ensemble at 31%
● XGBoost has best F1-score at 39%, on the test set. 31
➢ Conclusion
● If Recall is critical to the use case, then Easy Ensemble model should be
used with Random Forest feature selection, to obtain a 94% Recall value.

● If Precision is critical to the use case, then KKN model should be used
with, p_values-OLS feature selection method to obtain a Precision of
71.4%

● If over-all performance is critical to the use case, then Gaussian Naive


Bayes Model with, p_values-OLS feature selection method can be used
to obtain F1 score of 51%.

● We can use combination of high recall models and high precision


models to cross validate and obtain ideal solution.
32
➢ Challenges
● Balancing the trade-off between recall and precision was a challenge.

● Feature selection and choosing a right technique.

● Exploring literature and resources to understand the problem and to


find the solution was a little exhaustive.

● Deadlines felt a little strained. But it all worked out for the best.

33
QnA
34

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy