Company Bankruptcy Detection PDF
Company Bankruptcy Detection PDF
1. Problem Statement
2. Feature Selection
3. Exploratory Data Analysis
4. Applying the Models
5. Model Selection and validation
2
➢ Problem Statement
● For this project, we have aimed to curate a model that captures the
bankruptcy patterns among companies in the industry. This model will
work to provide early signs of financial downturn in the corporations
3
➢ Data Pipeline
● Cleaning the Data: The data was checked for null values, categorical
values and primary inspection was performed.
4
➢ Data Pipeline
● The data were collected from the Taiwan Economic Journal for the years
1999 to 2009. Company bankruptcy was defined based on the business
regulations of the Taiwan Stock Exchange.
5
6
➢ Data Summary
● Operating Expense Rate: The operating expense rate shows the
efficiency of a company's management by comparing the total operating
expense (OPEX) of a company to net sales.
8
Important Feature
Selection Techniques
9
➢ Important Feature Selection Techn.
● Quasi Method: Quasi-constant features are those that show the same
value for the great majority of the observations of the dataset. In general,
these features provide little if any information that allows a machine
learning model to discriminate or predict a target.
Using Quasi Constant Method: 31 columns were selected
10
➢ Important Feature Selection Techn.
● Information Gain: Information gain or mutual information measures
how much information the presence/absence of a feature contributes to
making the correct prediction on the target.
Using Information Gain Method: 30columns were selected
12
Exploratory Data
Analysis
13
Primary EDA: Understanding the effect of the value
of a feature on the decision
➢ Insights
● Bankruptcy is more likely, if the value of features such as ROA(A) value is low.
● Value of features such as Accounts Receivable Turnover is less likely to cause
bankruptcy.
● For higher values of features like Retained Earnings to Total Assets, a company is
likely to stay afloat. 14
Primary EDA: Checking the Imbalance
➢ Insights
● The target variable is highly imbalanced. Less than 3.5% instances for bankrupt
companies exist in the data set.
15
Primary EDA: Detecting and Capping outliers
Before After
● Outliers were capped at 80 percentile value on the upper side and at 20 percentile
value at the bottom side
16
Primary EDA: Understanding the correlation between
features and target variable
17
Primary EDA: Understanding the correlation between
features and target variable. (Contd.)
➢ Insights
● After going through all the columns, we found that none of the features displayed
any high correlation(> 0.5) with the target variable, bankruptcy.
● The most negatively correlated feature was ' Net-Income to Total Assets' i.e -0.32
● The most positively correlated feature was ' Debt Ratio' i.e +0.25.
● Conclusion: Our dataset doesn't provide any modelling power with respect to
linear algorithms. Since there is a high imbalance in the classes, this looks like an
anomaly detection problem. So, let us try anomaly detection algorithm.
18
EDA: Using Anomaly Detection Algorithm to further
explore the data.
➢ Insights
● Isolation Forest was used to detect
anomalies in the dataset. Since
results, were not up to the mark, let
us visualize the dataset to
investigate the problem
19
EDA: Data Visualization in 2 and 3 - Dimensions using PCA
21
Applying Models
22
Applying Model: 1. Features Selected Using VIF.
● Highest Recall on the test set is obtained from Logistic Regression, Easy
Ensemble and SVM.
● Highest test precision is obtained with KNN
● XGBoost has best F1-score at 43%, on the test set. 23
Applying Model: 2. Features Selected Using VIF &
p-values(logit-function)
● Highest Recall on the test set is obtained from Easy Ensemble at 87.2%
● Highest test precision is obtained with KNN at 66%
● XGBoost has best F1-score at 44%, on the test set.
24
Applying Model: 3. Features Selected Using VIF & l1
Regularization
● Highest Recall on the test set is obtained from Easy Ensemble at 92%
● Highest test precision is obtained from KNN att 42%
● Random Forest has best F1-score at 41%, on the test set.
25
Applying Model: 4. Features Selected Using p_values
with OLS model.
● Highest Recall on the test set is obtained from Easy Ensemble at 85.4%
● Highest test precision is obtained from KNN att 71.4%
● Gaussian Naive Bayes has best F1-score at 51%, on the test set.
26
Applying Model: 5. Using Quasi Constant Method
● Highest Recall on the test set is obtained from Logistic Regression at 89%
● Highest test precision is obtained from KNN at 62.5%
● XGBoost has best F1-score at 47.4%, on the test set.
27
Applying Model: 6. Using Information Gain
● Highest Recall on the test set is obtained from Easy Ensemble at 90%
● Highest test precision is obtained from KNN at 66.5%
● XGBoost has best F1-score at 41%, on the test set.
28
Applying Model: 7. Using Random Forest Method
● Highest Recall on the test set is obtained from Easy Ensemble at 94%
● Highest test precision is obtained from KNN at 50.5%
● Gaussian Naive Bias has best F1-score at 40%, on the test set.
29
Applying Model: 8. Lasso Regularization Method
● Highest Recall on the test set is obtained from Easy Ensemble at 83%
● Highest test precision is obtained from KNN at 50%
● XGBoost has best F1-score at 45%, on the test set.
30
Applying Model: 9. Using SMOTE with Quasi Selection
● Highest Recall on the test set is obtained from Logistic Regression & SVM at
82%
● Highest test precision is obtained from Easy Ensemble at 31%
● XGBoost has best F1-score at 39%, on the test set. 31
➢ Conclusion
● If Recall is critical to the use case, then Easy Ensemble model should be
used with Random Forest feature selection, to obtain a 94% Recall value.
● If Precision is critical to the use case, then KKN model should be used
with, p_values-OLS feature selection method to obtain a Precision of
71.4%
● Deadlines felt a little strained. But it all worked out for the best.
33
QnA
34