0% found this document useful (0 votes)
6 views58 pages

Chapter 3

Uploaded by

Huy Quang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views58 pages

Chapter 3

Uploaded by

Huy Quang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

SciEco

Chapter 3 - Missing Data


SciEco - Science for Economics
Lecturer: Le Thanh Ha

Page 1 / 58
SciEco

Agenda
1. Raw Data Problems
2. Introduction of Missing Data
3. Mechanisms of Missingness
4. Handling missing data method

Page 2 / 58
SciEco

Raw Data Problems

Page 3 / 58
SciEco

What is data preprocessing?


The process of processing raw data to improve the quality of the
data, thereby improving the quality of the analysis results.

Page 4 / 58
SciEco

GIGO - Garbage in garbage out

Page 5 / 58
SciEco

Page 6 / 58
SciEco

Page 7 / 58
SciEco

Data Cleaning
Handling Missing data
Handling Noisy data
Identify Outliers

Page 8 / 58
SciEco

1. Missing data

Page 9 / 58
SciEco

The data is not valid at a certain data point

Page 10 / 58
SciEco

2. Mechanisms of Missingness
MCAR - Missing Completely At Random
MAR - Missing At Random

MNAR - Missing Not At Random

Page 11 / 58
MCAR - Missing Completely At Random

Missing Values are


Randomly Distributed
The probability that
each observation is
missing is equal
Not related to any
variables in the data set

Page 12 / 58
MAR - Missing At Random

The missing value is not


randomly distributed in
all observations
The probability that each
observation is missing is
different
Has a relationship with
another variable

Page 13 / 58
MNAR - Missing Not At Random

The missing value is not


randomly distributed in
all observations
Not related to other
variables
Due to the intention of
the informant
There is no basis for
prediction

Page 14 / 58
MAR, MCAR, MNAR

Page 15 / 58
MAR, MCAR, MNAR

Page 16 / 58
MAR, MCAR, MNAR

Identify the mechanism that causes data missing


MCAR and MAR:
Plot the histogram, determine the distribution
T-test, Little’s Test
MNAR: Difficult to define, in this chapter, by default, we
interpreted as MAR for easier handling.

Page 17 / 58
SciEco

R Practice - Identify Defect Mechanism


attenu dataset

Page 18 / 58
R Practice - Identify Defect Mechanism

1. Check for missing data 3. Count the number of missing


exist in the dataset values by column
anyNA(data) colSums(is.na(data))

is.na(data) sum(is.na(data$col))

2. Count the number of 4. By row


missing values
rowSums(is.na(data))
sum(is.na(data))
complete.cases() or cci()
table(is.na(data))

Page 19 / 58
R practice - Using visdat package

vis_dat(obj)

vis_miss(obj)

ggplotly(vis_dat(obj)) # interactive plot

Page 20 / 58
R practice - Using aggr() in VIM package

aggr(obj, ...options)

summary(aggr(obj, ...option))

Page 21 / 58
R practice - Using matrixplot() in VIM package

matrixplot(obj)

Page 22 / 58
R practice - Little’s test

Little’s test ( naniar package)


Hypothesis:
H0 : The missingness mechanism is MCAR

H1 : The missingness mechanism is not MCAR


Using chi-square test (Little's test)

mcar_test(obj)

Page 23 / 58
SciEco

3. Handling missing data


3.1. Remove Missing Data

3.2. Imputation Method

Page 24 / 58
3.1. Remove Missing Data
Pros: Easy to do
Cons:
Lose information when the number of missing values is large
Increase the variance of the variable
Increase bias when systematic missing data

Page 25 / 58
Listwise Deletion

Page 26 / 58
Delete Column

Page 27 / 58
Pairwise Deletion

Page 28 / 58
R Practice

Listwise Deletion: Delete all observations with missing values:

na.omit(object);

or:

drop_na(object); # tidyr

Page 29 / 58
R Practice

Pairwise Deletion: Delete observations with missing values by


column: use package tidyr

drop_na(object, col);

Delete Column: Delete all columns with missing values:

object[colSums(is.na(object)) == 0];

Page 30 / 58
3.2. Imputation Method

⚠️ Note
Do not arbitrarily use it because it will affect the results later

Impute Methods:
Replace with a value (fixed, mean, median, etc.)
Using regression model
Other machine learning methods: Decision Tree, KNN, ...

Page 31 / 58
3.2. Imputation Method

Impute Methods

Single Imputation Multiple Imputation

Page 32 / 58
Single imputation

Each missing value is


replaced once with only 1
value
→ When to use?: low
percentage of missing
data, few observations

Page 33 / 58
Multiple imputation

Page 34 / 58
Multiple imputation

Page 35 / 58
Multiple imputation

Page 36 / 58
Practice R

Some commonly used impute methods


Replace with fixed value (mean, mode, median)
Replace with random value
Using regression model
Based on probabilistic analysis
Classification algorithms in ML (KNN, Decision Tree, Random
Forest, Boosting)

Page 37 / 58
Replace with fixed value (mean, mode, median)

Mean, median, mode of the data or subset of data


Or a fixed value (e.g: 0)

Page 38 / 58
Replace with fixed value (mean, mode, median)

Pros: preserve value (mean/median/mode) after filling


Cons:
Reduced variance, standard deviation
→ narrowed confidence interval
Change the correlation relationship between variables
“Worst” of the impute methods

Page 39 / 58
Replace with fixed value (mean, mode, median)

Page 40 / 58
Replace with random value

1. Hot-deck
Randomly choose the value of another similar observation in the
population (or sample) to fill
2. Cold-deck
Randomly select values from another dataset but still ensure the
same in data attributes
Features
Common use in the field of survey research
Old distribution of data is not guaranteed
It is better to replace it with a fixed value because it ensures the
"natural" of the data

Page 41 / 58
Linear Regression

1. Simple Regression Model


y^i = β^0 + β^1 xi
​ ​ ​ ​ ​ ​ ​

Pros: Do not affect the correlation between variables

Page 42 / 58
Linear Regression

1. Simple Regression Model


Cons:
Reduce the variance
Increases the correlation
coefficient between variables

Page 43 / 58
Linear Regression

2. Stochastic Regression

y^i = β^0 + β^1 Xi + εi with εi ∼ N (0, σ


​ ​ ​ ​ ​ ​ ​ ​ ​ ^2)

Page 44 / 58
Linear Regression

2. Stochastic Regression

Pros:
Do not break the correlation between variables
Preserve the distribution of data
Cons:
The predicted value can out of the possible range
Limitation when applied to data with Heteroscedasticity

Page 45 / 58
Compare 3 methods

Page 46 / 58
SciEco

Practice R - Single Imputation

Page 47 / 58
Practice R - Replace with fixed value

Option 1: use the command mutate()

mutate(
obj,
col = case_when(
is.na(col) ~ (mean(col, na.rm=TRUE)),
TRUE ~ col
)
)

Option 2: use package mice

complete(mice(obj, method = "mean"))

Page 48 / 58
Practice R - Replace with random value

Hotdeck
Option 1: use package mice

complete(mice(obj, method = "sample"))

Option 2: use package VIM

hotdeck(obj)

Page 49 / 58
Practice R - Simple Linear Regression

Build a simple linear regression model:

m <- lm(y ~ x, data = obj)


summary(m)

Fill in the missing with the predicted value from the regression model:

obj %>%
mutate(
new_x = case_when(
is.na(x) ~ predict(m, .),
TRUE ~ x
)
)

Page 50 / 58
Practice R - Simple Linear Regression

Linear regression using package mice

complete(
mice(obj, method = “norm.predict”)
)

Stochastic Regression using package mice

complete(
mice(obj, method = "norm.nob")
)

Page 51 / 58
Practice R - Machine Learning

Decision Tree

complete(
mice(obj, method = "cart")
)

Random Forest

complete(
mice(obj, method = "rf")
)

Page 52 / 58
SciEco

Practice R: Multiple imputation


Video: Dealing With Missing Data - Multiple Imputation

Page 53 / 58
Practice R - Multiple imputation

Some parameters and functions used with multiple


imputation method
For the function mice()

m = 5 number multiple imputations (with m > 1 is multiple


imputation method)
method the method to be used to fill in
predictorMatrix predictor matrix, it is necessary to create this
matrix before running mice() function
Use the quickpred function to quickly generate predictor matrix
with variables that are highly correlated with the missing variable

Page 54 / 58
Practice R - Multiple imputation

Some parameters and functions used with multiple


imputation method
For the function complete()

action = 1L returns the corresponding datasets of the 1st


impute

Pooling analysis results

pool() pooling results with Rubin (1987) method


pool.syn() pooling results with Reiter (2003) method

Page 55 / 58
SciEco

Summary - which method to choose? 🧐

Page 56 / 58
Summary - which method to choose?

Page 57 / 58
SciEco

Homework
Do the above commands with dataset application_missing2.csv

Identify the missing data mechanism


Handling missing data with the right method
Submit exercise results using rpubs link

Reference
1. https://rpubs.com/nnthieu/301064
2. https://bozenne.github.io/doc/2019-10-22-
multipleImputation/post-multipleImputation.pdf

Page 58 / 58

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy