0% found this document useful (0 votes)

6 views58 pages

Chapter 3

Uploaded by

Huy Quang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views58 pages

Chapter 3

Uploaded by

Huy Quang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

SciEco

Chapter 3 - Missing Data

SciEco - Science for Economics
Lecturer: Le Thanh Ha

Page 1 / 58
SciEco

Agenda
1. Raw Data Problems
2. Introduction of Missing Data
3. Mechanisms of Missingness
4. Handling missing data method

Page 2 / 58
SciEco

Raw Data Problems

Page 3 / 58
SciEco

What is data preprocessing?

The process of processing raw data to improve the quality of the
data, thereby improving the quality of the analysis results.

Page 4 / 58
SciEco

GIGO - Garbage in garbage out

Page 5 / 58
SciEco

Page 6 / 58
SciEco

Page 7 / 58
SciEco

Data Cleaning
Handling Missing data
Handling Noisy data
Identify Outliers

Page 8 / 58
SciEco

1. Missing data

Page 9 / 58
SciEco

The data is not valid at a certain data point

Page 10 / 58
SciEco

2. Mechanisms of Missingness
MCAR - Missing Completely At Random
MAR - Missing At Random

MNAR - Missing Not At Random

Page 11 / 58
MCAR - Missing Completely At Random

Missing Values are

Randomly Distributed
The probability that
each observation is
missing is equal
Not related to any
variables in the data set

Page 12 / 58
MAR - Missing At Random

The missing value is not

randomly distributed in
all observations
The probability that each
observation is missing is
different
Has a relationship with
another variable

Page 13 / 58
MNAR - Missing Not At Random

The missing value is not

randomly distributed in
all observations
Not related to other
variables
Due to the intention of
the informant
There is no basis for
prediction

Page 14 / 58
MAR, MCAR, MNAR

Page 15 / 58
MAR, MCAR, MNAR

Page 16 / 58
MAR, MCAR, MNAR

Identify the mechanism that causes data missing

MCAR and MAR:
Plot the histogram, determine the distribution
T-test, Little’s Test
MNAR: Difficult to define, in this chapter, by default, we
interpreted as MAR for easier handling.

Page 17 / 58
SciEco

R Practice - Identify Defect Mechanism

attenu dataset

Page 18 / 58
R Practice - Identify Defect Mechanism

1. Check for missing data 3. Count the number of missing

exist in the dataset values by column
anyNA(data) colSums(is.na(data))

is.na(data) sum(is.na(data$col))

2. Count the number of 4. By row

missing values
rowSums(is.na(data))
sum(is.na(data))
complete.cases() or cci()
table(is.na(data))

Page 19 / 58
R practice - Using visdat package

vis_dat(obj)

vis_miss(obj)

ggplotly(vis_dat(obj)) # interactive plot

Page 20 / 58
R practice - Using aggr() in VIM package

aggr(obj, ...options)

summary(aggr(obj, ...option))

Page 21 / 58
R practice - Using matrixplot() in VIM package

matrixplot(obj)

Page 22 / 58
R practice - Little’s test

Little’s test ( naniar package)

Hypothesis:
H0 : The missingness mechanism is MCAR

H1 : The missingness mechanism is not MCAR

Using chi-square test (Little's test)

mcar_test(obj)

Page 23 / 58
SciEco

3. Handling missing data

3.1. Remove Missing Data

3.2. Imputation Method

Page 24 / 58
3.1. Remove Missing Data
Pros: Easy to do
Cons:
Lose information when the number of missing values is large
Increase the variance of the variable
Increase bias when systematic missing data

Page 25 / 58
Listwise Deletion

Page 26 / 58
Delete Column

Page 27 / 58
Pairwise Deletion

Page 28 / 58
R Practice

Listwise Deletion: Delete all observations with missing values:

na.omit(object);

or:

drop_na(object); # tidyr

Page 29 / 58
R Practice

Pairwise Deletion: Delete observations with missing values by

column: use package tidyr

drop_na(object, col);

Delete Column: Delete all columns with missing values:

object[colSums(is.na(object)) == 0];

Page 30 / 58
3.2. Imputation Method

⚠️ Note
Do not arbitrarily use it because it will affect the results later

Impute Methods:
Replace with a value (fixed, mean, median, etc.)
Using regression model
Other machine learning methods: Decision Tree, KNN, ...

Page 31 / 58
3.2. Imputation Method

Impute Methods

Single Imputation Multiple Imputation

Page 32 / 58
Single imputation

Each missing value is

replaced once with only 1
value
→ When to use?: low
percentage of missing
data, few observations

Page 33 / 58
Multiple imputation

Page 34 / 58
Multiple imputation

Page 35 / 58
Multiple imputation

Page 36 / 58
Practice R

Some commonly used impute methods

Replace with fixed value (mean, mode, median)
Replace with random value
Using regression model
Based on probabilistic analysis
Classification algorithms in ML (KNN, Decision Tree, Random
Forest, Boosting)

Page 37 / 58
Replace with fixed value (mean, mode, median)

Mean, median, mode of the data or subset of data

Or a fixed value (e.g: 0)

Page 38 / 58
Replace with fixed value (mean, mode, median)

Pros: preserve value (mean/median/mode) after filling

Cons:
Reduced variance, standard deviation
→ narrowed confidence interval
Change the correlation relationship between variables
“Worst” of the impute methods

Page 39 / 58
Replace with fixed value (mean, mode, median)

Page 40 / 58
Replace with random value

1. Hot-deck
Randomly choose the value of another similar observation in the
population (or sample) to fill
2. Cold-deck
Randomly select values from another dataset but still ensure the
same in data attributes
Features
Common use in the field of survey research
Old distribution of data is not guaranteed
It is better to replace it with a fixed value because it ensures the
"natural" of the data

Page 41 / 58
Linear Regression

1. Simple Regression Model

y^i = β^0 + β^1 xi

Pros: Do not affect the correlation between variables

Page 42 / 58
Linear Regression

1. Simple Regression Model

Cons:
Reduce the variance
Increases the correlation
coefficient between variables

Page 43 / 58
Linear Regression

2. Stochastic Regression

y^i = β^0 + β^1 Xi + εi with εi ∼ N (0, σ

^2)

Page 44 / 58
Linear Regression

2. Stochastic Regression

Pros:
Do not break the correlation between variables
Preserve the distribution of data
Cons:
The predicted value can out of the possible range
Limitation when applied to data with Heteroscedasticity

Page 45 / 58
Compare 3 methods

Page 46 / 58
SciEco

Practice R - Single Imputation

Page 47 / 58
Practice R - Replace with fixed value

Option 1: use the command mutate()

mutate(
obj,
col = case_when(
is.na(col) ~ (mean(col, na.rm=TRUE)),
TRUE ~ col
)
)

Option 2: use package mice

complete(mice(obj, method = "mean"))

Page 48 / 58
Practice R - Replace with random value

Hotdeck
Option 1: use package mice

complete(mice(obj, method = "sample"))

Option 2: use package VIM

hotdeck(obj)

Page 49 / 58
Practice R - Simple Linear Regression

Build a simple linear regression model:

m <- lm(y ~ x, data = obj)

summary(m)

Fill in the missing with the predicted value from the regression model:

obj %>%
mutate(
new_x = case_when(
is.na(x) ~ predict(m, .),
TRUE ~ x
)
)

Page 50 / 58
Practice R - Simple Linear Regression

Linear regression using package mice

complete(
mice(obj, method = “norm.predict”)
)

Stochastic Regression using package mice

complete(
mice(obj, method = "norm.nob")
)

Page 51 / 58
Practice R - Machine Learning

Decision Tree

complete(
mice(obj, method = "cart")
)

Random Forest

complete(
mice(obj, method = "rf")
)

Page 52 / 58
SciEco

Practice R: Multiple imputation

Video: Dealing With Missing Data - Multiple Imputation

Page 53 / 58
Practice R - Multiple imputation

Some parameters and functions used with multiple

imputation method
For the function mice()

m = 5 number multiple imputations (with m > 1 is multiple

imputation method)
method the method to be used to fill in
predictorMatrix predictor matrix, it is necessary to create this
matrix before running mice() function
Use the quickpred function to quickly generate predictor matrix
with variables that are highly correlated with the missing variable

Page 54 / 58
Practice R - Multiple imputation

Some parameters and functions used with multiple

imputation method
For the function complete()

action = 1L returns the corresponding datasets of the 1st

impute

Pooling analysis results

pool() pooling results with Rubin (1987) method

pool.syn() pooling results with Reiter (2003) method

Page 55 / 58
SciEco

Summary - which method to choose? 🧐

Page 56 / 58
Summary - which method to choose?

Page 57 / 58
SciEco

Homework
Do the above commands with dataset application_missing2.csv

Identify the missing data mechanism

Handling missing data with the right method
Submit exercise results using rpubs link

Reference
1. https://rpubs.com/nnthieu/301064
2. https://bozenne.github.io/doc/2019-10-22-
multipleImputation/post-multipleImputation.pdf

Page 58 / 58

Missing Data
No ratings yet
Missing Data
71 pages
Missing Data
No ratings yet
Missing Data
25 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Mice Lectures
No ratings yet
Mice Lectures
109 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Class5 DataPreprocessing DataCleaning 23aug2021
No ratings yet
Class5 DataPreprocessing DataCleaning 23aug2021
14 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
Artificial Intelligence in Power Transformer Fault Diagnosis Report
No ratings yet
Artificial Intelligence in Power Transformer Fault Diagnosis Report
49 pages
Missing Data Analysis With Mice - Firouzeh Noghrehchi - 2015
No ratings yet
Missing Data Analysis With Mice - Firouzeh Noghrehchi - 2015
13 pages
IBM SPSS Missing Values
100% (1)
IBM SPSS Missing Values
34 pages
Handling Missing Data
No ratings yet
Handling Missing Data
32 pages
MICE
No ratings yet
MICE
4 pages
PS ML Lect 5 9 Unit 2
No ratings yet
PS ML Lect 5 9 Unit 2
114 pages
Emmanuel 2021 A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel 2021 A Survey On Missing Data in Machine Learning
37 pages
Missing Data
100% (2)
Missing Data
35 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
Emmanuel Et Al. - 2021 - A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel Et Al. - 2021 - A Survey On Missing Data in Machine Learning
37 pages
Imputation: - Applied Multivariate Analysis & Statistical Learning
No ratings yet
Imputation: - Applied Multivariate Analysis & Statistical Learning
17 pages
Unit - 3 - R Programming
No ratings yet
Unit - 3 - R Programming
16 pages
Machine Learning Based Missing Data Imputation
No ratings yet
Machine Learning Based Missing Data Imputation
13 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Module 4
No ratings yet
Module 4
47 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
Package Mice': November 14, 2020
No ratings yet
Package Mice': November 14, 2020
188 pages
Flexible Imputation of Missing Data
100% (3)
Flexible Imputation of Missing Data
444 pages
Data Imputation For Missing Values
No ratings yet
Data Imputation For Missing Values
14 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
01 Dealing With Missing Data The Art and Science of Imputation
No ratings yet
01 Dealing With Missing Data The Art and Science of Imputation
26 pages
Advanced Handling of Missing Data: One-Day Workshop
No ratings yet
Advanced Handling of Missing Data: One-Day Workshop
38 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Unit 2
No ratings yet
Unit 2
76 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Design and Analysis of Pragmatic Trials, 1st Edition Exclusive Download
100% (18)
Design and Analysis of Pragmatic Trials, 1st Edition Exclusive Download
15 pages
Multiple Imputation w2 2024
No ratings yet
Multiple Imputation w2 2024
45 pages
FDS U4
No ratings yet
FDS U4
93 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Platias2020 Greece
No ratings yet
Platias2020 Greece
10 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Week 5 Lecture - Data Wrangling
No ratings yet
Week 5 Lecture - Data Wrangling
26 pages
Final Exam For SAS Enterprise Miner
100% (1)
Final Exam For SAS Enterprise Miner
17 pages
Missing Data
No ratings yet
Missing Data
14 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Modern Method Web in Ar May 2012
No ratings yet
Modern Method Web in Ar May 2012
45 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Unit 2 Notes - Docx-3
No ratings yet
Unit 2 Notes - Docx-3
14 pages
Missing Data & How To Handle It
No ratings yet
Missing Data & How To Handle It
32 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Missing Data Mechanisms and Imputation Methods
No ratings yet
Missing Data Mechanisms and Imputation Methods
16 pages
Handling Missing Data
No ratings yet
Handling Missing Data
23 pages
Lecture 2.3.10
No ratings yet
Lecture 2.3.10
30 pages
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
No ratings yet
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
4 pages
MDA - 1.module 1 - BI Introduction - Data Prep
No ratings yet
MDA - 1.module 1 - BI Introduction - Data Prep
131 pages
Ads Exp2
No ratings yet
Ads Exp2
3 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Mirt
No ratings yet
Mirt
103 pages
The Impact of Green Lending On Credit Risk in China: Sustainability
No ratings yet
The Impact of Green Lending On Credit Risk in China: Sustainability
16 pages
Missing Values and Outliers in R-Software
No ratings yet
Missing Values and Outliers in R-Software
17 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
THE APPLIED DATA SCIENCE WORKSHOP Urinary Biomarkers Based Pancreatic Cancer Classification and Prediction (Vivian Siahaan Rismon Hasiholan Sianipar) (Z-Library)
No ratings yet
THE APPLIED DATA SCIENCE WORKSHOP Urinary Biomarkers Based Pancreatic Cancer Classification and Prediction (Vivian Siahaan Rismon Hasiholan Sianipar) (Z-Library)
491 pages
House Price Prediction Using Machine Learning
No ratings yet
House Price Prediction Using Machine Learning
6 pages
66 Data Analyst Interview Questions To Ace Your in
No ratings yet
66 Data Analyst Interview Questions To Ace Your in
38 pages
Comparative Methods For Handling Missing Data in Large Databases
No ratings yet
Comparative Methods For Handling Missing Data in Large Databases
13 pages
A Comparative Study of Multiple Imputation and Maximum Likelihood Methods of Imputing Missing Data in A
No ratings yet
A Comparative Study of Multiple Imputation and Maximum Likelihood Methods of Imputing Missing Data in A
14 pages
Chaper 3 FoDS
No ratings yet
Chaper 3 FoDS
127 pages
RJwrapper
No ratings yet
RJwrapper
24 pages
DP 5101
No ratings yet
DP 5101
32 pages
Miss Forest
No ratings yet
Miss Forest
10 pages
Module 2
No ratings yet
Module 2
55 pages
Overview and Exploratory Analyses of CICIDS 2017 I
No ratings yet
Overview and Exploratory Analyses of CICIDS 2017 I
9 pages
Unlocking The Power of Parenting Unraveling How Family Atmosphere and Parenting Styles Impact The Pivotal Role in Bullying
No ratings yet
Unlocking The Power of Parenting Unraveling How Family Atmosphere and Parenting Styles Impact The Pivotal Role in Bullying
16 pages
Unit 5 Notes IOT
No ratings yet
Unit 5 Notes IOT
40 pages
Final Report Srini
No ratings yet
Final Report Srini
24 pages
1 Meijer Et Al 2021
No ratings yet
1 Meijer Et Al 2021
11 pages
Final Capstone Project - Group 4 - TPS
No ratings yet
Final Capstone Project - Group 4 - TPS
27 pages
Py Chem Flow An Automated Pre Processing Pipeline in Python For Reproducible Machine Learning On Chemical Data
No ratings yet
Py Chem Flow An Automated Pre Processing Pipeline in Python For Reproducible Machine Learning On Chemical Data
3 pages
AMDA Cheat Sheet Spring FINAL3
No ratings yet
AMDA Cheat Sheet Spring FINAL3
2 pages
Microsoft Selftestengine dp-100 Exam Prep 2021-Jul-08 by Mandel 98q Vce
No ratings yet
Microsoft Selftestengine dp-100 Exam Prep 2021-Jul-08 by Mandel 98q Vce
20 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
Data Analytics Mid
No ratings yet
Data Analytics Mid
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 3

Uploaded by

Chapter 3

Uploaded by

SciEco

Chapter 3 - Missing Data

Raw Data Problems

What is data preprocessing?

GIGO - Garbage in garbage out

The data is not valid at a certain data point

MNAR - Missing Not At Random

Missing Values are

The missing value is not

The missing value is not

Identify the mechanism that causes data missing

R Practice - Identify Defect Mechanism

1. Check for missing data 3. Count the number of missing

2. Count the number of 4. By row

ggplotly(vis_dat(obj)) # interactive plot

Little’s test ( naniar package)

H1 : The missingness mechanism is not MCAR

Using chi-square test (Little's test)

3. Handling missing data

3.2. Imputation Method

Listwise Deletion: Delete all observations with missing values:

Pairwise Deletion: Delete observations with missing values by

Delete Column: Delete all columns with missing values:

Single Imputation Multiple Imputation

Each missing value is

Some commonly used impute methods

Mean, median, mode of the data or subset of data

Pros: preserve value (mean/median/mode) after filling

1. Simple Regression Model

Pros: Do not affect the correlation between variables

1. Simple Regression Model

y^i = β^0 + β^1 Xi + εi with εi ∼ N (0, σ

Practice R - Single Imputation

Option 1: use the command mutate()

Option 2: use package mice

complete(mice(obj, method = "mean"))

complete(mice(obj, method = "sample"))

Option 2: use package VIM

Build a simple linear regression model:

m <- lm(y ~ x, data = obj)

Linear regression using package mice

Stochastic Regression using package mice

Practice R: Multiple imputation

Some parameters and functions used with multiple

m = 5 number multiple imputations (with m > 1 is multiple

Some parameters and functions used with multiple

action = 1L returns the corresponding datasets of the 1st

Pooling analysis results

pool() pooling results with Rubin (1987) method

Summary - which method to choose? 🧐

Identify the missing data mechanism

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.