0% found this document useful (0 votes)
94 views14 pages

DSML Project Report - Group05

This document provides a report on analyzing factors that affect life expectancy using data from 193 countries between 2000 and 2015. It discusses the inspiration and business understanding behind the project, describes the dataset and variables, and outlines the data processing steps including handling missing data and encoding categories. Regression models like Ridge, Lasso, and Elastic Net were applied and compared to identify significant factors influencing life expectancy. Key factors found to have effects include development status, HIV/AIDS prevalence, alcohol consumption, GDP, adult mortality, and BMI. Visualizations are also included to aid in conclusions.

Uploaded by

deepak raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views14 pages

DSML Project Report - Group05

This document provides a report on analyzing factors that affect life expectancy using data from 193 countries between 2000 and 2015. It discusses the inspiration and business understanding behind the project, describes the dataset and variables, and outlines the data processing steps including handling missing data and encoding categories. Regression models like Ridge, Lasso, and Elastic Net were applied and compared to identify significant factors influencing life expectancy. Key factors found to have effects include development status, HIV/AIDS prevalence, alcohol consumption, GDP, adult mortality, and BMI. Visualizations are also included to aid in conclusions.

Uploaded by

deepak raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DSML Project Report

Life Expectancy
Data Analysis

Submitted By: Group - 05


MBA20217 | ADITYA CHAUHAN
MBA20234 | NITESH KUMAR SINGH
MBA20263 | DHRUV JINDAL
MBA20269 | SHIVAM MISHRA
MBA20291 | DEEPAK RAJ
MBA20334 | DEEPESH
MBA20354 | SOHAIL
Contents
Introduction/Business Understanding ....................................................................................... 2

About the Dataset .............................................................................................................. 2

Inspiration ......................................................................................................................... 3

Data Understanding .............................................................................................................. 4

Processing Data ................................................................................................................. 5

Ridge Regression ............................................................................................................... 9

Lasso Regression ............................................................................................................... 9

Elnet Regression ................................................................................................................ 9

Data Visualization .............................................................................................................10

Conclusion........................................................................................................................13

1
Introduction/Business Understanding

Although there have been lot of studies undertaken in the past on factors affecting life expectancy

considering demographic variables, income composition and mortality rates. It was found that

effect of immunization and human development index was not considered in the past. Also, some

of the past research was done considering multiple linear regression based on data set of one year

for all the countries. Hence, this gives motivation to resolve both the factors stated previously by

formulating a regression model based on mixed effects model and multiple linear regression while

considering data from a period of 2000 to 2015 for all the countries. Important immunization like

Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on

immunization factors, mortality factors, economic factors, social factors and other health related

factors as well. Since the observations this dataset are based on different countries, it will be easier

for a country to determine the predicting factor which is contributing to lower value of life

expectancy. This will help in suggesting a country which area should be given importance in order

to efficiently improve the life expectancy of its population.

About the Dataset


The project relies on accuracy of data. The Global Health Observatory (GHO) data repository

under World Health Organization (WHO) keeps track of the health status as well as many other

related factors for all countries The datasets are made available to public for the purpose of health

data analysis. The dataset related to life expectancy, health factors for 193 countries has been

collected from the same WHO data repository website and its corresponding economic data was

collected from United Nation website. Among all categories of health-related factors only those

2
critical factors were chosen which are more representative. It has been observed that in the past 15

years, there has been a huge development in health sector resulting in improvement of human

mortality rates especially in the developing nations in comparison to the past 30 years. Therefore,

in this project we have considered data from year 2000-2015 for 193 countries for further analysis.

The individual data files have been merged together into a single dataset. On initial visual

inspection of the data showed some missing values. As the datasets were from WHO, we found

no evident errors. There were missing data to be identified in the dataset.

Inspiration
The dataset aims to answer the following key questions:

• Do various predicting factors which has been chosen initially really affect the Life

expectancy? What are the predicting variables affecting the life expectancy?

• Should a country having a lower life expectancy value (<65) increase its healthcare

expenditure to improve its average lifespan?

• How does Infant and Adult mortality rates affect life expectancy?

• Does Life Expectancy have positive or negative correlation with eating habits, lifestyle,

exercise, smoking, drinking alcohol etc.

• What is the impact of schooling on the lifespan of humans?

• Does Life Expectancy have positive or negative relationship with drinking alcohol?

• Do densely populated countries tend to have lower life expectancy?

• What is the impact of Immunization coverage on life Expectancy?

3
For all the analysis done throughout the report, we use Crisp-DM methodology. Crisp-DM stands

for Cross Industry standard process for Data Mining. It is nothing but an industry-proven method

that guides the process of our data mining. It is a model that consists of six phases that

systematically describe the data mining process and implementation. The six phases are- Business

understanding, data understanding, data preparation, modelling, evaluation, and deployment.

Data Understanding
Variables

To prepare the data for modelling, the dataset was loaded onto Python workspace in Jupyter and

then using the ‘info’ command in Python Pandas, the features of the dataset were observed. The

4
data consisted of 19 columns and each column had 1649 entries. The dataset had all the columns

in numerical data type either in int or float. Only the Customer ID column was of object data type.

Processing Data
Since there were 0 missing values, there was no need for data processing. The only variable
which needed to be encoded was Status.

5
We trained the model on 70% of the dataset using random_state = 42 and tested the
remaining 30%.

After running regression the output is as follows,

6
Significant variable for the same are

7
Columns to be removed because of large vif values.

Significant variables after running regression again,

8
Ridge Regression

Lasso Regression

Elnet Regression

9
Data Visualization

10
11
12
Conclusion
The response for Ridge and Elnet regression fair out better than Lasso Regression. Therefore,
we can use the same models and reject Lasso Regression. Based on the regression outputs, the
variables affecting Life Expectancy are, Status Developing, HIV/AIDS, Alcohol, GDP, Adult
Mortality and BMI.

Life Expectancy = 70.2734 -2.1297*Status Developing – 0.4423*HIV/AIDS – 0.0249*Adult


Mortality + 0.2610*Alcohol + 0.1208*BMI + 0.0001*GDP

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy