Data Mining Project Presentation - JAG
Data Mining Project Presentation - JAG
03 04
Data Mining Conclusion
Techniques/Algorithms
Model comparison
Variable filtering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
01 02
Introduction Data Cleaning
Dataset Missing data
Problem of Interest Variable Distribution assessment
Concerns Dummy variables transformation
03 04
Data Mining Conclusion
Techniques/Algorithms
Model comparison
Variable filtering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
Introduction
Our data is a collection of variables about houses in
Ames, Iowa (80 independent variables and house sale
price) (www.kaggle.com).
03 04
Data Mining Conclusion
Techniques/Algorithms Model comparison
Variable filtering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
Data Cleaning
03 04
Data Mining Conclusion
Techniques/Algorithms
Model comparison
Variable filtering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
Variable Filtering
● Pros:
○ Has simple fitting procedure
○ Gives sparse model (feature selection)
○ Assesses all possible subset of variables
○ Presents the best candidate for a least-squared
model with q variables
● Cons:
○ Takes a long time to process large models;
computationally expensive
Principal Component Regression
Creates new components from linear combinations of original variables such
that they capture as much variability in the predictors as possible
● Pros:
○ Reduces data dimension
○ When the number of components is small,
overfitting can be avoided
● Cons:
○ Does not yield feature selection
○ The first M principal components, though may
best explain the predictors, are not necessarily
predictive of the response
Pros:
○ All the pros of PCR
○ The supervised dimension reduction can reduce
bias
Cons:
○ Does not yield feature selection
○ The supervised dimension reduction can
increase variance => will not perform that much
better than PCR
Cons
● Interpretability - why does it select
certain variables and not others?
● Complicated model-fitting procedure
(hard to do without statistical software)
The best Log(Lambda) = -5.978623
23 predictors in best model
Ridge
Test MSE: 0.0191, 28 predictors
Pros
● Can create flexible models that do not rely on
hierarchies, as opposed to forward and
backward subset selection
● Gives better performance than Lasso if all
variables are significant
Cons
● Does not eliminate any variables (as opposed
to Lasso)
● Can also lead to high variance due to no
variable reduction (high flexibility)
Best log(lambda) = -3.35042
Non-linear Regression Techniques
Bagging &
K Nearest Regression
Random Boosting
Neighbors Tree
Forest
K-nearest neighbors
Test MSE: 0.0264, k=16
Pros
● Non-parametric, more flexible
● Offers a more accurate model if the true shape
is non-linear
● Simple fitting process
Cons
● Rarely outclass parametric approaches
● Does not work well with high dimensions
● Difficult to identify importance of variables
● Sensitive to noisy data, missing values and
outliers
Regression Tree
Test MSE: 0.0443
Regression Tree
Pros Cons
● Interpretability & visual representation ● Inflexible: dynamic model adjustment
● Numerical and categorical features accommodation ● Unstable
● Little data preprocessing ● Overfitting, which can be mitigated by:
● Feature selection happens automatically ○ Limiting tree depth
○ Minimal # of objects in leaves
○ Tree pruning
Bagging and Random Forest
Random Forest
Bagging and Random Forest
Test MSE: 0.0194 -- ntree=500, mtry=28
Test MSE: 0.0207 -- ntree=25, mtry=28
Test MSE: 0.0200 -- ntree=25, mtry=20 (RF)
Pros Cons
● Impressive in versatility ● Complexity
● Parallelizable ● High computational resources
● Robust to outliers and nonlinear data requirement
● Low bias, moderate variance ● Overfit --- solved by tuning
hyperparameters
Boosting
Pros Cons
● Easy to read and interpret ● Sensitive to outliers
● Resilient method that curbs ● Almost impossible to scale up
over-fitting easily
01 02
Introduction Data Cleaning
Dataset Missing data
Problem of Interest Variable Distribution assessment
Concerns Dummy variables transformation
03 04
Data Mining Conclusion
Techniques/Algorithms
Model comparison
Variable filtering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
Model Comparison - Test MSE
Most important variables from
Boosting model
Conclusions
● Best method: Gradient Boosting
● Performance Accuracy: 86% on average
● Most important variables
○ OverallQual - Overall material and finish quality
○ GrLivArea: Above grade (ground) living area square feet
○ TotalBsmtSF: Total square feet of basement area
○ YearBuilt: Original construction date
● Surprises: No location indicator; Garage-related features importance
● Improvement: Better handling of high dimension next time without
variable filtering
Thanks!