Chapter 3
Chapter 3
Page 1 / 58
SciEco
Agenda
1. Raw Data Problems
2. Introduction of Missing Data
3. Mechanisms of Missingness
4. Handling missing data method
Page 2 / 58
SciEco
Page 3 / 58
SciEco
Page 4 / 58
SciEco
Page 5 / 58
SciEco
Page 6 / 58
SciEco
Page 7 / 58
SciEco
Data Cleaning
Handling Missing data
Handling Noisy data
Identify Outliers
Page 8 / 58
SciEco
1. Missing data
Page 9 / 58
SciEco
Page 10 / 58
SciEco
2. Mechanisms of Missingness
MCAR - Missing Completely At Random
MAR - Missing At Random
Page 11 / 58
MCAR - Missing Completely At Random
Page 12 / 58
MAR - Missing At Random
Page 13 / 58
MNAR - Missing Not At Random
Page 14 / 58
MAR, MCAR, MNAR
Page 15 / 58
MAR, MCAR, MNAR
Page 16 / 58
MAR, MCAR, MNAR
Page 17 / 58
SciEco
Page 18 / 58
R Practice - Identify Defect Mechanism
is.na(data) sum(is.na(data$col))
Page 19 / 58
R practice - Using visdat package
vis_dat(obj)
vis_miss(obj)
Page 20 / 58
R practice - Using aggr() in VIM package
aggr(obj, ...options)
summary(aggr(obj, ...option))
Page 21 / 58
R practice - Using matrixplot() in VIM package
matrixplot(obj)
Page 22 / 58
R practice - Little’s test
mcar_test(obj)
Page 23 / 58
SciEco
Page 24 / 58
3.1. Remove Missing Data
Pros: Easy to do
Cons:
Lose information when the number of missing values is large
Increase the variance of the variable
Increase bias when systematic missing data
Page 25 / 58
Listwise Deletion
Page 26 / 58
Delete Column
Page 27 / 58
Pairwise Deletion
Page 28 / 58
R Practice
na.omit(object);
or:
drop_na(object); # tidyr
Page 29 / 58
R Practice
drop_na(object, col);
object[colSums(is.na(object)) == 0];
Page 30 / 58
3.2. Imputation Method
⚠️ Note
Do not arbitrarily use it because it will affect the results later
Impute Methods:
Replace with a value (fixed, mean, median, etc.)
Using regression model
Other machine learning methods: Decision Tree, KNN, ...
Page 31 / 58
3.2. Imputation Method
Impute Methods
Page 32 / 58
Single imputation
Page 33 / 58
Multiple imputation
Page 34 / 58
Multiple imputation
Page 35 / 58
Multiple imputation
Page 36 / 58
Practice R
Page 37 / 58
Replace with fixed value (mean, mode, median)
Page 38 / 58
Replace with fixed value (mean, mode, median)
Page 39 / 58
Replace with fixed value (mean, mode, median)
Page 40 / 58
Replace with random value
1. Hot-deck
Randomly choose the value of another similar observation in the
population (or sample) to fill
2. Cold-deck
Randomly select values from another dataset but still ensure the
same in data attributes
Features
Common use in the field of survey research
Old distribution of data is not guaranteed
It is better to replace it with a fixed value because it ensures the
"natural" of the data
Page 41 / 58
Linear Regression
Page 42 / 58
Linear Regression
Page 43 / 58
Linear Regression
2. Stochastic Regression
Page 44 / 58
Linear Regression
2. Stochastic Regression
Pros:
Do not break the correlation between variables
Preserve the distribution of data
Cons:
The predicted value can out of the possible range
Limitation when applied to data with Heteroscedasticity
Page 45 / 58
Compare 3 methods
Page 46 / 58
SciEco
Page 47 / 58
Practice R - Replace with fixed value
mutate(
obj,
col = case_when(
is.na(col) ~ (mean(col, na.rm=TRUE)),
TRUE ~ col
)
)
Page 48 / 58
Practice R - Replace with random value
Hotdeck
Option 1: use package mice
hotdeck(obj)
Page 49 / 58
Practice R - Simple Linear Regression
Fill in the missing with the predicted value from the regression model:
obj %>%
mutate(
new_x = case_when(
is.na(x) ~ predict(m, .),
TRUE ~ x
)
)
Page 50 / 58
Practice R - Simple Linear Regression
complete(
mice(obj, method = “norm.predict”)
)
complete(
mice(obj, method = "norm.nob")
)
Page 51 / 58
Practice R - Machine Learning
Decision Tree
complete(
mice(obj, method = "cart")
)
Random Forest
complete(
mice(obj, method = "rf")
)
Page 52 / 58
SciEco
Page 53 / 58
Practice R - Multiple imputation
Page 54 / 58
Practice R - Multiple imputation
Page 55 / 58
SciEco
Page 56 / 58
Summary - which method to choose?
Page 57 / 58
SciEco
Homework
Do the above commands with dataset application_missing2.csv
Reference
1. https://rpubs.com/nnthieu/301064
2. https://bozenne.github.io/doc/2019-10-22-
multipleImputation/post-multipleImputation.pdf
Page 58 / 58