0% found this document useful (0 votes)
25 views3 pages

Ee 708 Report

This project report presents a hybrid framework for predicting company bankruptcy using a combination of Gaussian Naive Bayes (GNB) and Deep Neural Network (DNN) models. The approach includes extensive exploratory data analysis, data preprocessing techniques like SMOTE for class imbalance, and feature selection through ANOVA, achieving a test accuracy of 97.23% and an F1-score of 0.51. The results demonstrate the effectiveness of the ensemble model in accurately predicting bankruptcy despite significant class imbalance.

Uploaded by

csk.312.13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views3 pages

Ee 708 Report

This project report presents a hybrid framework for predicting company bankruptcy using a combination of Gaussian Naive Bayes (GNB) and Deep Neural Network (DNN) models. The approach includes extensive exploratory data analysis, data preprocessing techniques like SMOTE for class imbalance, and feature selection through ANOVA, achieving a test accuracy of 97.23% and an F1-score of 0.51. The results demonstrate the effectiveness of the ensemble model in accurately predicting bankruptcy despite significant class imbalance.

Uploaded by

csk.312.13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

EE708 Course Project Report, Indian Institute of Technology Kanpur

Company Bankruptcy Prediction:


A Hybrid Framework Leveraging Combined Probabilities from
Gaussian and Neural Networks
Ch V Sai Koushik Chilamakuri Kundan Sai Challa Kethan
230312 230330 230317
skoushik23@iitk.ac.in ckundans23@iitk.ac.in kethanc23@iitk.ac.in

Srijani Gadupudi Macha Mohana Harika


231033 230612
srijanig23@iitk.ac.in mharika23@iitk.ac.in

number of features from 86 to 62. A detailed correlation


I. INTRODUCTION heatmap (Fig. 1) illustrates the relationships among retained
features.
Predicting a company's bankruptcy has become a critical
task in today's uncertain economic conditions. This project
aims to build a robust machine learning model that can
predict bankruptcy by exploring a range of factors which
indicate the financial status of the company.
This report is structured as follows: we begin with
Exploratory Data Analysis (EDA) to examine the dataset and
uncover key patterns and correlations; next, we discuss Data
Preprocessing, detailing how the data is cleaned, normalized,
and balanced; then, we describe the Model Architecture,
outlining the machine learning models used for bankruptcy
prediction and the rationale behind their design; and finally,
we evaluate the models using Performance Metrics such as
accuracy, precision, recall, and F1 score to assess their
practical effectiveness.
II. EXPLORATORY DATA ANALYSIS (EDA) Figure 1: Feature Correlation Heatmap of 62 features
III. DATA PREPROCESSING
Exploratory Data Analysis (EDA) and feature
engineering involves several key steps to understand and
preprocess the dataset effectively. The primary goal was to A. Analysis of Variance (ANOVA):
understand the data's characteristics, identify key features,
Analysis of Variance (ANOVA) was then used to select
and prepare the dataset for model training.
the most relevant features for predicting bankruptcy.
The dataset used for bankruptcy prediction comprises ANOVA F-scores were calculated for each feature to assess
5,455 rows, each corresponding to a company, and 96 its significance.
financial features. The target variable distribution revealed a Variance between groups
significant class imbalance, with 5,301 non-bankrupt F score =
Variance within groups
companies (97.2%) and 154 bankrupt companies (2.8%).
This imbalance was considered in subsequent model k

development. Variance between groups = Ni(μi − μ)


i=1
A preliminary analysis extreme values were identified in k
several numerical features, potentially indicating data errors. Variance within groups = σi
Features with over 800 erroneous entries were discarded, i=1
while features with fewer than 200 errors underwent median
imputation. A high F-score signifies substantial variation in feature
values across target variable classes. The SelectKBest
To mitigate multicollinearity and enhance model method, utilizing the f_classif scoring function, was
interpretability, a feature correlation analysis was conducted. employed to identify the top 30 features. These selected
Features with a Pearson correlation coefficient exceeding 0.9 features ensured that the model concentrated on those
were examined, and the one with the weaker correlation to exhibiting the most significant differences in behavior
the target variable was removed. This process reduced the between the two classes.

Page 1
B. Oversampling using SMOTE: into the ensemble, the model effectively leveraged
The dataset exhibited a significant class imbalance, with probabilistic classification, improving the F1-score to 0.51
5,301 non-bankrupt (97.2%) and 154 bankrupt (2.8%) To leverage both models, we applied an ensemble
companies. To address this, the Synthetic Minority approach using soft voting. The probability outputs from the
Oversampling Technique (SMOTE) was applied to the DNN and GNB models were averaged to compute the final
training data. bankruptcy probability. Instead of using the default
classification threshold of 0.5, we fine-tuned the threshold by
Following an 80-20 train-test split, the training set evaluating F1-scores across multiple threshold values
contained 4,241 non-bankrupt companies and 123 bankrupt (between 0.30-0.60) . The threshold (0.45) that maximized
companies. SMOTE was used to generate synthetic samples the F1-score was selected for final predictions.
for the minority class, balancing the training set to 4,241
instances in each class. This oversampling was restricted to
the training data to prevent biasing the test set. SMOTE V. PERFORMANCE METRICS OF THE MODEL
operates by selecting a minority class sample, identifying its
k-nearest neighbors, and generating a new synthetic data The Gaussian Naive Bayes and DNN ensemble model
point through linear interpolation between the selected reached 97.23% accuracy on the test set. Other result metrics
sample and one of its neighbors increasing the in Classification Report (fig 3) along with Confusion
representation of the minority class. Matrix(fig 2) are shown below :

C. Standardisation:
To ensure uniform feature scaling, StandardScaler was
applied, transforming all features to have a mean of 0 and a
standard deviation of 1. This prevents dominance by
features with larger magnitudes.

IV. MODEL ARCHITECTURE

We developed a hybrid bankruptcy prediction model by


ensembling Deep Neural Network (DNN) and a Gaussian Figure 2: Confusion Matrix
Naive Bayes (GNB) classifier. The objective was to leverage
the probabilistic nature of GNB alongside the deep feature
learning capabilities of DNN to improve performance.
The first model, DNN, was developed to capture complex,
non-linear relationships between financial indicators. It
consists of an input layer, three hidden layers, and an output
layer. The input layer receives the selected 30 features. The
first hidden layer comprises 256 neurons, utilizing the ReLU
activation function, batch normalization, and dropout (50%)
to prevent overfitting. The second hidden layer has 128
neurons, also incorporating batch normalization and dropout Figure 3: Classification Report
(50%). The third hidden layer refines the feature
representations further with 64 neurons and a reduced
drop(40%). The output layer contains a single neuron with a VI. CONCLUSION
sigmoid activation function, outputting a probability score This study introduced a hybrid bankruptcy prediction
for bankruptcy classification. framework that integrates a Deep Neural Network (DNN)
The DNN model was compiled with the Adam optimizer with a Gaussian Naïve Bayes (GNB) classifier. Rigorous
(learning rate = 0.0005) and binary cross-entropy loss exploratory data analysis and feature engineering—including
function. It was trained for 200 epochs with a batch size of SMOTE for class imbalance, ANOVA-based feature
64 and a 20% validation split. selection, and standardization-ensured robust data
preprocessing.
GNB applies Bayes' theorem, assuming feature
independence and normal distribution given the class label . The ensemble, which combines deep feature learning and
It computes the posterior probability for each class using the probabilistic inference through fine-tuning the decision
prior probability and the likelihood, where the likelihood of threshold and soft voting for both advanced models,
continuous features is modeled using a normal distribution. achieved a test accuracy of 97.23% and an F1-Score of 0.51
The model predicts the class with the highest posterior while maintaining precision, recall trade-off. This result is
probability. noteworthy given the data with extreme imbalance with class
ratio of 35:1 . These results underscore the framework’s
Initially, employing only the DNN with feature selection potential for accurate bankruptcy prediction, laying the
yielded a maximum F1-score of 0.46. By integrating GNB groundwork for future enhancements in feature selection and
ensemble strategies.

Page 2
VII. REFERENCES Journal of Jilin University (engineering science edition),
2016, 46(3):
[1] Ohlson J A. Financial ratios and the probabilistic 884-889.
prediction of bankruptcy[J]. Journal of accounting research,
1980: 109-131. [4] Shubhair A Abdullah, Ahmed Al-Ashoor, an artificial
deep neural network for the binary classification of network
[2] Kong yiqing, semi-supervised learning and its traffic, Journal of Advanced Computer Science and
application research [D]. Wuxi, Jiangnan University, 2009: Applications, Vol .11, No. 1, 2010.
33-39.Advances in Intelligent Systems Research, volume
168399. [5] Zeng-jun BI, yao-quan HAN, Cai-quan HUANG and
Min WANG, Guassian Naive Bayesian Data classification
[3] Dong liyan, sui peng, sun peng, li yongli, a new naive model based on clustering algorithm.
bayesian algorithm based on semi-supervised learning [J],

Page 3

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy