Project Stat
Project Stat
Omitted variable bias is the bias in the OLS estimator that arises when the regressor, X,
is correlated with an omitted variable. For omitted variable bias to occur, two conditions must be
fulfilled:
If the car's maintenance history is omitted from the analysis, it can lead to biased and inconsistent
estimates of other coefficients in the model, particularly those related to variables correlated with
maintenance history, such as car age. This can lead to overestimation or underestimation of the impact of
other variables in the model. Omitting a variable that is correlated with both the dependent variable (car
resale value) and included predictors can result in biased coefficient estimates. One of the artificial
instrumental variable we can use is Number of Service Visits. This variable represents the frequency of a
car's service visits. A higher number of service visits might be indicative of a more meticulous
maintenance history.
2. Heteroskedasticity Problem:
Discuss how heteroskedasticity may affect the accuracy of the
regression model.
Focus on the relationship between car mileage and car price,
considering the potential presence of heteroskedasticity.
Explain how heteroskedasticity can affect the model's
assumptions and inferences.
Heteroskedasticity refers to the situation in which the variability of the errors (residuals) in a regression
model is not constant across all levels of the independent variable(s). In the context of car mileage and car
price, heteroskedasticity might occur if the variance of the errors in predicting car prices is not consistent
for all levels of mileage. For example, there might be more variability in prediction errors for high-
mileage cars compared to low-mileage cars. Heteroskedasticity violates the assumption of
homoskedasticity, which assumes that the variance of the errors is constant across all levels of the
independent variable(s). Heteroskedasticity can lead to biased standard errors of the estimated
coefficients. Standard errors are used to calculate confidence intervals and hypothesis tests. If they are
incorrect due to heteroskedasticity, it can result in inaccurate inferences about the statistical significance
of the variables.
Polynomial terms involve raising independent variables to powers higher than one, such as squares (x^2),
cubes (x^3), etc. In the context of a regression model, these terms are added to capture non-linear
relationships between independent and dependent variables. In our case, the model suggests a perfect fit
with a linear relationship; therefore, there is no need to capture non-linear relationship. Generally,
Polynomial terms provide flexibility to the model, allowing it to fit more complex curves. This can
improve the model's ability to represent relationships that cannot be adequately captured by a linear
model. However, while polynomial terms can enhance model fit, there is a risk of overfitting. Overfitting
occurs when a model captures noise in the data rather than the true underlying patterns. This can lead to
poor generalization performance on new, unseen data.