Medical Insurance Cost Prediction Report Full
Medical Insurance Cost Prediction Report Full
Submitted By
KIRAN
P03NK21S0705
2022-2023
BANGALORE UNIVERSITY
JNANA BHARATHI CAMPUS, BANGALORE-560056
CERTIFICATE
This is to certify that KIRAN Register No: P03NK21S0705 have satisfactorily
completed the project work entitled “MEDICAL INSURANCE COST PREDICTION”
for partial fulfillment of requirement for the award of the degree Master of Computer
Application (MCA) awarded by Bangalore University for the year 2022-23.
Examiners:
1)
2)
DECLARATION
I would like to express my heartfelt thanks to our HOD Dr. MURALIDHARA B.L,
for extending all the facilities to carry out this project.
I would like to express out special thanks to my project guide Dr. M.T.
Somashekara, who has spent his precious time by guiding & encouraging me
throughout the development of the project.
Last but not the least; I am grateful to my parents, my friends & all the people who
have helped me directly or indirectly to make this project a success.
KIRAN
ABSTRACT
Medical costs are one of the most common reoccurring expenses in a person's life. It
is general known that a person's lifestyle and numerous physical factors determine
the diseases or disorders they may get, and that these conditions determine medical
expenses. According to several research, there are several significant reasons that lead
to greater expenditures. smoking, age, and BMI are all factors in personal medical
care. The goal of this study is to examine and identify a link between personal medical
costs and other characteristics. Then, by generating linear regression models and
comparing them using ANOVA, we use the significant traits as predictors to forecast
medical expenditures. In our research, we discovered that smoking, age, and a higher
BMI all have a significant connection with higher medical expenditures, showing that
they are key contributors to the charges, and that the regression can predict the
charges with more than 75% accuracy. According to the World Health Organization,
personal medical and healthcare spending is growing faster than the global economy
This rise in spending has been related to a variety of factors, the most prominent of
which are smoking, ageing, and higher BMI. Using insurance data from diverse
persons with variables such as smoking, age, number of children, area, and BMI, we
hope to uncover a link between medical expenditures and other parameters
TABLE OF CONTENTS
FIGURE NO TITLE
LIST OF ABBREVIATIONS
Abbreviation Expansion
LR Linear regression
ML Machine learning
1.1 OVERVIEW
The expense of health care is rising every day. There is a need to
forecast health costs as the number of novel viruses infecting humans
grows. This form of forecasting aids governments in making health-
related decisions. People are also aware of the significance of health-
care spending. Machine Learning is a field that touches all aspect of
life. Machine learning models are also used in the health-care system
for a variety of health-related applications. We conducted a predicate
analysis on medical health insurance expenses in this study. We create
a model to forecast a person's medical insurance costs depending on
gender. The dataset comes from Kaggle and comprises 1338 rows of
data with the following attributes: age, gender, smoker, BMI, children,
region, and insurance charges. Medical information and expenditures
billed by health insurance companies are included in the data. To
forecast medical expenses, we used a variety of regression techniques
on this dataset. The Python programming language was utilized to
implement the project.
1
Department of Computer Science and Application
cure ailments. It deals with a subject that was formerly supposed to be
fatal. In any event, the expense of her therapy is so high that it is
The chapter 1is details about Overview of the motivation for the
project and problem statement purpose of the project, this tell us how
the project is described
The chapter 2 is about the literature review on the various paper that
are studied and understanding of the project in more detailed way
and analysis of the system that are described
3
Department of Computer Science and Application
The chapter 3 is mainly discussed on the Project goals, current
systems, existing system flaws, prospective systems, and suggested
system benefits.
The chapter 4 is discussed on the project description in which the
architecture diagram, sequence diagram use case diagram, activity
diagram.
The chapter 5 is all about the requirements which is hardware and
software requirements specification and technologies used in the
project.
The chapter 6 is all about the module description used in which the
testing stage is classified into stages such as unit test, integrated
testing and coding of the parts that are implement in the system and
the result part of the system and also the bread bord testing which
helps for correct working of the model.
The chapter 7 is all about user interface and system that will explain
detailed about the project.
The chapter 8 is all about which we got results obtained and testing.
The chapter 9 is all about the conclusion of the project in which how
the project is concluded and what we have done in this project and
also about the future enhancement of the project what will be the
extension of this project for later development This the organization
of thesis what we have discussed in the following chapter this brief
understanding on each and every chapte
4
Department of Computer Science and Application
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
Machine learning is a technique that allows computer to learn from
past data and anticipate fresh samples. Machine Learning models may
be used in any sector. Medical records are likewise not exempt from
machine learning. For numerous years, the medical industry has used
models in various settings. Many of the studies used machine learning
approaches to forecast medical costs B. Nithya [1] et.al In Predictive
Analytics in Health Care, machine learning models were used. For
predictive analysis, they used a variety of supervised and
unsupervised models. They also claimed that machine learning tools
and techniques are crucial in health-care sectors, and that they are
exclusively employed in the detection and prognosis of various
malignancies. Ahuja Tike[2] et.al applied hierarchical decision trees
for the medical price prediction systems. Their experiments showed
that the price prediction system achieves high accuracy. Moran et al.
[3] utilized linear regression techniques to anticipate Intensive Care
Unit (ICU) expenses and utilize understanding socioeconomics, DRG
(Diagnostic Related Group), length of stay in the clinic, and a couple
of others as highlights. Gregory [4] et.al applied various regression
models for analyzing medical costs in the health care system. They
mainly concentrated on reducing the bias in the cost estimates to
achieve good results. Dimitris Bertsimas[5] et.al applied different data
mining techniques which provided an accurate prediction of medical
5
Department of Computer Science and Application
costs and represent a powerful tool for the prediction of healthcare
costs.
6
Department of Computer Science and Application
2.3 Predicting Days in hospital using health Insurance
Claims (2020, Yang Xie, C.W. Chang, Sandra Neubauer)
Identifying and managing patients most at risk within the health care
system is vital for governments, hospitals, and health insurers but they
use different metrics for identifying the patients they perceive to be at
most risk[1]. Hospitals focus on readmission rate [2], [3] and
cumulative risk of death during hospitalization [4]. Accurately
predicting these indicators could assist in allocating limited resources
and thus improve the hospital’s operational efficiency. Health insurers
are mostly concerned with insurance risk, because they agree to
reimburse health-related services in exchange for a fixed monthly
premium. Poor risk measure could result in exceeding a financial
budget. Therefore, one of the most obvious goals for health insurers is
to
7
Department of Computer Science and Application
identification of subpopulations at higher hospitalization risk could
improve current underwriting processes and pricing methodology.
2.4 Summary
This paper says about the various projects that are related to the Medical
price expense prediction using machine learning. Each one paper has
discovered new thing to the existing system and they implemented
various types of relations in them on this surveys and the other thing we
made new design to the existed system this is all about the literature
survey
8
Department of Computer Science and Application
CHAPTER 3
PROJECT DESCRIPTION
This device will help people to save time. As there will be no wastage
of time, the user will be satisfied. The WATERFALL MODEL, which
asserts that the stages are structured in a linear manner, Is essentially
being foll0wed First and foremost, Aa feasibility assessment has been
c0mpleted. After that the requirements analisis and project planning
may c0mmence. If an existing system requires changes or the
installation of a new module, an analysis of the existing system can be
utilized as a starting point. After the requirements analysis is
completes . the design phase begins . followed by the coding phase.
Tesing is a accomplished after the progrsmming is completed.
Requirement anlaisis, project planning, system interated and testing
are the activities conducted in this approach in a software
development project. The linear sequence of these action is crucial in
this case. The phase ends, and the output of one phase becomes the
input of the next.
9
Department of Computer Science and Application
3.2 Existing System
• In sample sizes ranging from small to large, statistical approaches
(E.g., Lenear regresion) suffer due to the zsero point spike and skewed
distibutionof health care expenses with a strong right-hand tail.
10
Department of Computer Science and Application
3.3 Proposed System
• The Medical information and expenditures billed by health insurance
companies are included in the data. It has 1338 rows of information
with thw following columns: age, gender, BMI, illnesses, smokers,
and insurance costs.
11
Department of Computer Science and Application
CHAPTER 4
SYSTEM DESIGN
Price Prediction
System
Different techniques
proposed medical used by other
prediction systems people to predict
prices
Dataset infomation
Different Machine
and basic
Learning Techniques
architecture
Data pre-processing
Linear Regression
implementation
Results
12
Department of Computer Science and Application
4.2 Sequence Diagram
A sequence diagram depicts item interactions in chronological order.
It illustrates the scenario's objects and classes, as well as the sequence
of messages sent between them in order to carry out the scenario's
functionality. In the Logical View of the system under development,
sequence diagrams are often related with use case realizations.
13
Department of Computer Science and Application
4.3 Use case Diagram
Use case diagrams reflect the user's engagement with the system at the
most basic level by depicting the relationships between the user and
the many use cases in which the user participates. Use case diagrams
may be used to depict a variety of system users and use cases, and are
frequently accompanied by other diagrams. External entities that
engage with the system are referred to as actors. Circles or ellipses are
used to depict the use cases.
Open interface
Import libraries
Load datasets
User System
Pre processing
Result
14
Department of Computer Science and Application
CHAPTER 5
PROJECT REQUIREMENTS
15
Department of Computer Science and Application
5.2 Technologies Used
Software requirements specifications (SRS), often known as
software specifications, are a precise description of the behaviour of
the system in development. It provides a collection of scenarios that
define all of the software's interactions with the user. SRS also covers
non-functional requirements in addition to use cases. Nonfunctional
requirements are constraints on a system's design or execution. B.
System requirements specs & # 41; Performance requirements, quality
standards, or design limitations This is a set of data that contains all of
the system's needs. Business analysts, sometimes known as systems
analysts, are in charge of researching customer and stakeholder
business demands in order to uncover and provide solutions to
business problems. There are three components that must be included
in the project.
Business requirements define what must be supplied in order to
generate value.
16
Department of Computer Science and Application
CHAPTER 6
MODULE DESCRIPTION
6.1 Modules
There are 5 modules in this project that have detailed explanation
below:
1) Dataset Information
2) Selection of features from the dataset
3) Data preprocessing
6.2 Module 1
Data set Information:
Our suggested system’s input dataset will be a dataset. That combines
two different datasets. A series of inpatient Medicare payment data
and a column of Zillow data will be displayed. be included in our final
input dataset. In the following part, we'll look at these columns in
further depth. The first dataset, Hospital-level payments to about
30,000 hospitals in the 100 most often billed Diagnosis Related
Groups are included in the Medicare payment data collection (DRGS).
The top one-hundred DRGS account for 60% of total inpatient
Medicare payments. expenditures and accounts for 7 million
discharges. Each row in the payment dataset comprises 10 columns,
as seen in the preceding section. Each of these columns denotes a
characteristic of machine learning. A feature is a significant attribute
that influences the prediction variable under consideration. Every
problem under investigation has a collection of independent
characteristics that aid in the construction of an accurate machine
learning
17
Department of Computer Science and Application
6.3 Module 2
18
Department of Computer Science and Application
6.4 Module 3
Data pre-processing:
The data that will be utilized to answer the problem is one of the most
significant aspects of machine learning difficulties. Data preparation
accounts for around sixty to seventy percent of the overall time spent
on a typical machine learning project. In order to get successful
outcomes, it is critical to have the proper data for the situation at hand.
In general, data preparation consists of selecting characteristics and
pre-processing those features. As a result, after selecting features from
a vast quantity of data, the following step is to pre-process those
features. Because the data is useless in its raw form. The goal of pre-
processing in this case is to make features appropriate for the machine
learning model we'll use. If the characteristics are set up correctly, the
model can produce better results. In addition, the data formats for
various models varies. There are machine learning models
19
Department of Computer Science and Application
Data Integration
Data integration is identifying the many data sources that will be
needed for processing and combining their information into one. This
stage is critical for any system that requires a large amount of data
processing to tackle the challenge at hand. In the sector, there are
several tools for integrating and combining data from various sources.
Such technologies can be utilized in situations when the amount of
data is large and there are several data sources.
Data Cleaning:
Many contaminants can be found in raw data. These contaminants can
have an impact on the ultimate result, especially when it comes to
machine learning difficulties. As a result, after the data has been
incorporated, it must be cleansed. Impurities such as incorrect entries,
irrelevant data, and inconsistent data are detected during data
cleaning. Eliminating these contaminants from records Data cleansing
may be accomplished using a variety of methods. Using automated
tools, personal interaction, and building scripts to programmatically
clean the data according to our needs are some of the data cleaning
strategies.
Transformation Data:
After cleaning data and integration data, the following is step to
convert the Clean data that has been incorporated into the system's
format. In most cases, data conversion entails transforming the target
data into the format needed by the source data.
20
Department of Computer Science and Application
CHAPTER 7
IMPLEMENTATION
21
Department of Computer Science and Application
We get the fit lane after we get the best one and two values. So when
we use our model to predict the value of y for the input value of x, it
will predict the value of y.
How can I change the values of 1 and 2 to achieve the greatest fit
line?
Cost Function (J):
22
Department of Computer Science and Application
Machine learning in regression in linear
Linear regression is one of the most widely used and straightforward
Machine Learning techniques. It's a statical predictive analytics
technique. Linear regression is used to predict sales, salary, age,
product price, and other continuous, real, or numeric variables.
As the name indicates, the linear regression process reveals a linear
relationship between a dependent and one or more independent
variables. Because linear regression displays a linear relationship, it
identifies how the dependent variable's value varies as the independent
variable's value changes. In a linear regression model, the
relationships between variables are represented by sloping straight
lines. Consider the diagram below.
23
Department of Computer Science and Application
Y = Dependent variable in this case (target variable) X
= Error at Random
24
Department of Computer Science and Application
Line of linear regression
linear.
25
Department of Computer Science and Application
Assumptions of Linear Regression
Linear regression requires the following requirements. These are some
formal tests that you should run while developing your linear
regression model to guarantee that you receive the best results
possible from your dataset. A linear relationship between the
features and target:
In linear regression, the dependent and independent variables are
supposed to have a linear relationship.
Small or no multicollinearity between the features:
Multicollinearity refers to a significant connection between
independent variables. Due to multicollinearity, determining the true
relationship between the predictor and the target variables can be
challenging. As a consequence, the model denotes the absence of
minimum or multicollinearity in the feature or independent variable.
26
Department of Computer Science and Application
Homoscedasticity Assumption:
When the error interval for all independent variable values is the same,
homoscadasty arises. There should be no discernible pattern
distribution of the data in a scatter plot using homoscefasciyu.
Normal distribution of error terms:
A normal distribution pattern is predicted for this linear regression
error. The confidence intervals will be too big or too narrow if the
error terms are not normally distributed, making estimating the
coefficient of determination difficult.
The qq chart may be used to verify this. The mistakes are evenly
distributed if the figure depicts a straight line with no deviations.
No autocorrelations:
The linear regression model, by error, does not imply autocorrelation.
The model's accuracy will be greatly diminished if the error terms are
correlated. If there is confidence between the residual errors,
autocorrelation might arise.
7.2 Implementation steps:
Step 1: Start
Step 2: Open interface
Step 3: Upload data sets
Step 4: Pre-processing
Step 5: Training and testing
Step 6: Enter constraints and submit
Step 7: result
Step 8: stop
27
Department of Computer Science and Application
7.3 Implementation procedure:
The process of our project is as follows, Frist we had to open the files
by using the jupyter notebook. After opening we had to import the
libraries, now the user had to load the datasets. Now the system will
calculate the average charges for the each disease. The preprocessing
process will be started so now the data set will be preprocessed. The
dataset now will be trained first, next the testing will be performed.
Now the user need to enter the disease, smoker or non smoker, age.
etc. After submitting the constraints the result will be generated. The
result is the average cost and the insurance that can be claimed.
28
Department of Computer Science and Application
CHAPTER 8
RESULT ANALYSIS
29
Department of Computer Science and Application
Fig 8.1 Dataset charges
This section deals with the result analysis of the project performance
of the model in the which we used the Linear regression algorithms.
In exploratory data analysis we have found the correlation between
the features and heat map to show correlation which we can see the
following observation. String correlation between charges and smoker
yes. Weak correlation between charges and age. Weak correlation
between charges and BMI. Weak correlation between BMI and region
_ southeast. Since the values for the weak correlations are less than
0.5 we can term them as insignificant and drop them. Which is shown
in below figure.
30
Department of Computer Science and Application
Fig 8.2 correlation between features
We begin to predict the charges with the help of the other features. In
Model prediction our basic linear regression model predicting model
predicting the cost of treatment looks good. And the closely matching
results between training and test data means that our model is accurate.
31
Department of Computer Science and Application
Fig 8.3 Heatmap
32
Department of Computer Science and Application
Fig 8.5 Output
The above figure explains the graph and medical expenses that are
confirmed for the disease and also it also explains the insurance that
can be claimed by the patient. The patient can determine the cost that
is going to expense him for the disease from which he/she is suffering
and also it explains the patient that how much amount can be claimed
for the following disease based on the graph and expenditures.
33
Department of Computer Science and Application
CHAPTER 9
CONCLUSION AND FUTURE WORK
9.1 Conclusion
We’ve looked at the fundamentals of the linear regression model, how
to use it o forecast charges, and how to compare anticipated and real
outcomes. I hope you found this post helpful and that you now have a
basic understaning of how a linear regression model works. For
estimating medical expenditures, we suggested a machine learning
approach.. We applied regression techniques Linear Regression and
observed that age, BMI are features that decide the dependent
variable. Out of all experiments, this model gave a better result.
34
Department of Computer Science and Application
REFERENCES
36
Department of Computer Science and Application
APPENDEX A SAMPLE SCREEN SHOT
37
Department of Computer Science and Application
38
Department of Computer Science and Application
39
Department of Computer Science and Application
APPENDIX B
SAMPLE CODE
42
Department of Computer Science and Application