We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4
Corollary: if we are sure there is nothing in the error term correlated with the slope, we
can interpret the coefficient.
Exploratory data analysis: o Library(GGally) : ggpairs o ggpairs(data_set) o for selecting variables in ggpairs: data_set %>% select(var1, var2, var3, …) %>% ggpairs() o library(mosaic) o inspect(data_set) Editing data: o Library(dplyr) : select o Newdf <- select(data_set, var1, var2, var3, …) o Adding a new column: mutate(data_set, new_var = var*2 (or other function)) o filtering data: filter(data_set, filtering options) Graphing to see the relation: o Library(ggplot2) o Scatterplot: ggplot(data_set, aes(x = x_var, y = y_var)) + geom_point() o Classic linear regression: Model <- lm(y_var ~ x_var, data = data_set) summary(model) use multiple R squared to report. To compare models: o library(stargazer) o stargazer(model1, model2, type = ‘text’, report = (‘vc*sp’)) Omitted Variable Bias: o only occur if x2 and y are related & x2 and x1 are related o The sign of the bias is the product of the correlation between x2-x1 and x2-y Interpreting categorical variables: o all of the analysis shown in R are interpreted against the reference category, the one that is not shown. analyzing models without the effect of heteroscedasticity: o make your model (model1) o use summary to get R2: summary(model1) o Use coefficient test: library(sandwich) library(lmtest) coeftest(model1, vcov = vcovHC, type = ‘HC1’) Interpreting interaction terms class 6: o if one of the terms is a dummy variable then use scenarios or eyeball it. o to see the result of different matchs use: library(margins) margins(model1, variables = ‘var1’, at = list(dummy = c(0, 1))) Quadratic Models: o use when the ggpairs function implies a quadratic relation between variables o Both the linear and quadratic terms being significant means that you need a quadratic function o model1 <- lm(y-var ~ x-var + I(x-var)^2 o Interpreting the coefficient of quadratic term: take partial derivative to see when the change in x-var slows down. Logarithmic models: o use to make the effect on variables a unit change, and makes large outliers less problematic. o Effective when a variable is significantly right skewed but not zero. o only if variables are greater than zero, use log models. o Log-log models: both changes are in percentage interpretation: 1% change in x-var is associated with a coefficient% change in y-var. o Log-linear models: interpretation: each additional increase in x-var is associated with a coefficient*100 % change in y-var o linear-log models: interpretation: a 1% change in x-var is associated with a coefficient*0.01 change in y-var Logistic Regression: o use when y-variable is a binary or categorical o model1 <- glm(y-var~x-var, family = binomial(link = ‘logit’), data = data) o summary(model1) o interpretation: exp(coef(model1)) for odds ratio <1 : the odds of y-var if x-var is increased by one unit is 1- exp(coef(model1)) decrease, on average, of what they are if you maintain the same x-var. for odd ratio >1: increasing the x-var by one unit would make the odds of y-var exp(coef(model1))-1 times higher than what they would have been if x-var did not increase, cet par. make predictions: dataset <- dataset %>% mutate(predictions = predict(model1, type = ‘response’, dataset) Create scenarios: library(tidyr) scenarios <- expand_grid(x-var1 = seq(val1, val2, val3), x-var2 = seq(val1, val2)) scenarios <- scenarios %>% mutate(prediction = predict(model1, scenarios, type = ‘response’))
Fixed Effects Models:
o use when there are variables that you want to keep the effects constant for each entity, panel data. o EDA: ggplot(data_set, aes(x = x-var, y = y-var, color = categorical-var )) + geom_line() o Some variables do not change instantly (ea police number) get lagged data for that: library(dplyr) data_set <- data_set %>% group_by(categorical_var) %>% mutate( lag_var = dplyr::lag(var, order_by = ordering_var (ea time) )) %> % ungroup() o Pooled model: regression model with every data point in it, no grouping or filtering. function : lm o fixed effects models are used to eliminate differences between units over the time of the study, such as differences in average income in different states o creating the model to keep one variables constant: library(plm) model1 <- plm(y-var ~ x-var, data = data_set, index = ‘categorical_var’, model = ‘within’) o creating the model to keep two variables constant: library(plm) model1 <- plm(y-var ~ x-var, data = data_set, index = c(‘entity_var’, ‘time_var’), model = ‘within’, effect = ‘twoways’) o Checking to see time variations and individual variations: pvar(data_set, index = c(‘entity-variable’, ‘time_variable’)) o interpreting fixed effects models: coeftest(model1, vcoc = vcovHC, type = ‘HC1’) Dif-in-Dif: o Key: there has to be a treatment group and a control group selected at random o parallel trends assumption o model1 <- lm(y-var~x-var*treatment-var+ control, data = data_set) o interpreting dif-in-dif models: coeftest(model1, vcov = vcovHC, type = ‘HC1’) Regression Discontinuity: o Key: there is a threshold that determines if you are in the treatment or not. o the threshold variable is called an assignment variable o EDA: make a scatterplot color the treatment group add lines of best fit add vertical line at cutoff: data_set %>% ggplot(aes(x = x-var, y = y-var, color = (assignment condition ea age <21))) + geom_point() + geom_smooth(method = ‘lm’, se = FALSE) + geom_vline(xintercept = assignment value) if there is a shift aka discontinuity, it is an indication of this model being a good match o making the model: without centering the var: model1 <- lm(y-var~(assignment var) + control, data = data_set) with centering the var: model1 <- lm(y-var~I(x-var—assignment value)*(assignment variable) + control, data = data_set) o interpreting models: coeftest(model1, vcov= vcovHC, type = “HC1”) in both cases you interpret the key variable not the interaction term Instrumental variable 2SLS: o an instrumental variable needs to be correlated with the x-var and uncorrelated with the y-var o use it when there is a lottery case or any other random treatment group but there is no evidence if the treatment group actually got the treatment, is not used to find the effect of the treatment on the y-var—non-compliance. o creating the model: library(ivreg) model1 <- ivreg::ivreg(y-var~x-var + control1 + control2 | instrumental- var + control1 + control2) o interpreting results: summary(model1, vcov. = vcovHC) Good instrumental var: one instrumental var: weak instruments test p-value = 0 more than one instrumental-var: sargan p-value small: o at least one of the instruments is not exogenous R^2 doesn’t matter because it gets unreliable in 2SLS tests Graphing & Visualization o Graphing for logistic scenarios: scenarios %>% ggplot(aes(x = x-var, y = prediction, color = as.factor(categoprical-var))) + geom_point() + geom_line() + facet_wrap(~x-var, ncol = 5)