DSML Project Report - Group05
DSML Project Report - Group05
Life Expectancy
Data Analysis
Inspiration ......................................................................................................................... 3
Conclusion........................................................................................................................13
1
Introduction/Business Understanding
Although there have been lot of studies undertaken in the past on factors affecting life expectancy
considering demographic variables, income composition and mortality rates. It was found that
effect of immunization and human development index was not considered in the past. Also, some
of the past research was done considering multiple linear regression based on data set of one year
for all the countries. Hence, this gives motivation to resolve both the factors stated previously by
formulating a regression model based on mixed effects model and multiple linear regression while
considering data from a period of 2000 to 2015 for all the countries. Important immunization like
Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on
immunization factors, mortality factors, economic factors, social factors and other health related
factors as well. Since the observations this dataset are based on different countries, it will be easier
for a country to determine the predicting factor which is contributing to lower value of life
expectancy. This will help in suggesting a country which area should be given importance in order
under World Health Organization (WHO) keeps track of the health status as well as many other
related factors for all countries The datasets are made available to public for the purpose of health
data analysis. The dataset related to life expectancy, health factors for 193 countries has been
collected from the same WHO data repository website and its corresponding economic data was
collected from United Nation website. Among all categories of health-related factors only those
2
critical factors were chosen which are more representative. It has been observed that in the past 15
years, there has been a huge development in health sector resulting in improvement of human
mortality rates especially in the developing nations in comparison to the past 30 years. Therefore,
in this project we have considered data from year 2000-2015 for 193 countries for further analysis.
The individual data files have been merged together into a single dataset. On initial visual
inspection of the data showed some missing values. As the datasets were from WHO, we found
Inspiration
The dataset aims to answer the following key questions:
• Do various predicting factors which has been chosen initially really affect the Life
expectancy? What are the predicting variables affecting the life expectancy?
• Should a country having a lower life expectancy value (<65) increase its healthcare
• How does Infant and Adult mortality rates affect life expectancy?
• Does Life Expectancy have positive or negative correlation with eating habits, lifestyle,
• Does Life Expectancy have positive or negative relationship with drinking alcohol?
3
For all the analysis done throughout the report, we use Crisp-DM methodology. Crisp-DM stands
for Cross Industry standard process for Data Mining. It is nothing but an industry-proven method
that guides the process of our data mining. It is a model that consists of six phases that
systematically describe the data mining process and implementation. The six phases are- Business
Data Understanding
Variables
To prepare the data for modelling, the dataset was loaded onto Python workspace in Jupyter and
then using the ‘info’ command in Python Pandas, the features of the dataset were observed. The
4
data consisted of 19 columns and each column had 1649 entries. The dataset had all the columns
in numerical data type either in int or float. Only the Customer ID column was of object data type.
Processing Data
Since there were 0 missing values, there was no need for data processing. The only variable
which needed to be encoded was Status.
5
We trained the model on 70% of the dataset using random_state = 42 and tested the
remaining 30%.
6
Significant variable for the same are
7
Columns to be removed because of large vif values.
8
Ridge Regression
Lasso Regression
Elnet Regression
9
Data Visualization
10
11
12
Conclusion
The response for Ridge and Elnet regression fair out better than Lasso Regression. Therefore,
we can use the same models and reject Lasso Regression. Based on the regression outputs, the
variables affecting Life Expectancy are, Status Developing, HIV/AIDS, Alcohol, GDP, Adult
Mortality and BMI.
13