0% found this document useful (0 votes)

24 views32 pages

Data Mining Project Presentation - JAG

Uploaded by

peaceofmai0807

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views32 pages

Data Mining Project Presentation - JAG

Uploaded by

peaceofmai0807

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Best Model for Predicting

House Prices in Ames, Iowa

Team: J-A-G
Jennifer Thuy Nguyen
Aurora Yucong Hu
Garrett Hastings
01 02
Introduction Data Cleaning
Dataset Missing data
Problem Variable Distribution assessment
Concerns Dummy variables transformation

03 04
Data Mining Conclusion
Techniques/Algorithms
Model comparison
Variable ﬁltering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
01 02
Introduction Data Cleaning
Dataset Missing data
Problem of Interest Variable Distribution assessment
Concerns Dummy variables transformation

03 04
Data Mining Conclusion
Techniques/Algorithms
Model comparison
Variable ﬁltering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
Introduction
Our data is a collection of variables about houses in
Ames, Iowa (80 independent variables and house sale
price) (www.kaggle.com).

Goal: To ﬁnd the best model to predict the Selling

Price from the given housing features

Possible predictors: neighborhood, square footage of the

lot, number of bedrooms, year built, etc.

Possible concerns: missing values, highly skewed

variables (high number of zero’s), categorical variable
handling, and computational speed
01 02
Introduction Data Cleaning
Dataset Missing data
Problem Variable distribution assessment
Concerns Dummy variables transformation

03 04
Data Mining Conclusion
Techniques/Algorithms Model comparison
Variable ﬁltering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
Data Cleaning

● Missing categorical entries

○ Add a new level called "Missing" to store all of the NA's
● Missing numerical variables
○ Replace with median values
Data Cleaning
● Variable Distribution
Assessment - Skewness
○ Log transformation on
“LotArea” and “Sale Price”
(response variable)
○ Categorization of several
numerical variables into
“0” or “More than 0”
(or “1” or “More than 1”)
Data Cleaning

● Dummy variables transformation

○ For each categorical variable, we turned it into multiple dummy
variables (each dummy represents one sub-category)
○ Number of independent variables increases from 80 to 314
01 02
Introduction Data Cleaning
Dataset Missing data
Problem of Interest Variable Distribution assessment
Concerns Dummy variables transformation

03 04
Data Mining Conclusion
Techniques/Algorithms
Model comparison
Variable ﬁltering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
Variable Filtering

● Only select predictors with

moderately high correlation with
response y = SalePrice
○ Corr(X,Y) > 0.4
○ From 314 predictors down to
28 predictors
Linear Regression Techniques

Subset Selection: Regularization: Dimension

Best Subset Lasso & Ridge Reduction:
Selection PCR and PLS
Best Subset Selection
Test MSE: 0.0188, 9 predictors

● Pros:
○ Has simple ﬁtting procedure
○ Gives sparse model (feature selection)
○ Assesses all possible subset of variables
○ Presents the best candidate for a least-squared
model with q variables
● Cons:
○ Takes a long time to process large models;
computationally expensive
Principal Component Regression
Creates new components from linear combinations of original variables such
that they capture as much variability in the predictors as possible

● Pros:
○ Reduces data dimension
○ When the number of components is small,
overﬁtting can be avoided
● Cons:
○ Does not yield feature selection
○ The ﬁrst M principal components, though may
best explain the predictors, are not necessarily
predictive of the response

Test MSE: 0.0216

28 Variables (10 PCs)
Partial Least Squares
A supervised alternative to PCR - PLS approach attempts to ﬁnd directions
that help explain both the predictors AND the response.

Pros:
○ All the pros of PCR
○ The supervised dimension reduction can reduce
bias

Cons:
○ Does not yield feature selection
○ The supervised dimension reduction can
increase variance => will not perform that much
better than PCR

Test MSE: 0.0201

28 Variables (9 components)
Lasso
Test MSE: 0.0184, 23 predictors
Pros
● Eliminates many variables in its model (sparse)
● Can create ﬂexible models that do not rely on
hierarchies, unlike forward and backward subset
● Gives better predictions than Variable ﬁltering
and fwd/bwd stepwise

Cons
● Interpretability - why does it select
certain variables and not others?
● Complicated model-fitting procedure
(hard to do without statistical software)
The best Log(Lambda) = -5.978623
23 predictors in best model
Ridge
Test MSE: 0.0191, 28 predictors
Pros
● Can create flexible models that do not rely on
hierarchies, as opposed to forward and
backward subset selection
● Gives better performance than Lasso if all
variables are significant

Cons
● Does not eliminate any variables (as opposed
to Lasso)
● Can also lead to high variance due to no
variable reduction (high ﬂexibility)
Best log(lambda) = -3.35042
Non-linear Regression Techniques

Bagging &
K Nearest Regression
Random Boosting
Neighbors Tree
Forest
K-nearest neighbors
Test MSE: 0.0264, k=16
Pros
● Non-parametric, more ﬂexible
● Offers a more accurate model if the true shape
is non-linear
● Simple ﬁtting process

Cons
● Rarely outclass parametric approaches
● Does not work well with high dimensions
● Difﬁcult to identify importance of variables
● Sensitive to noisy data, missing values and
outliers
Regression Tree
Test MSE: 0.0443
Regression Tree

After pruning (Test MSE): 0.0553

Regression Tree

Pros Cons
● Interpretability & visual representation ● Inﬂexible: dynamic model adjustment
● Numerical and categorical features accommodation ● Unstable
● Little data preprocessing ● Overﬁtting, which can be mitigated by:
● Feature selection happens automatically ○ Limiting tree depth
○ Minimal # of objects in leaves
○ Tree pruning
Bagging and Random Forest
Random Forest
Bagging and Random Forest
Test MSE: 0.0194 -- ntree=500, mtry=28
Test MSE: 0.0207 -- ntree=25, mtry=28
Test MSE: 0.0200 -- ntree=25, mtry=20 (RF)

Pros Cons
● Impressive in versatility ● Complexity
● Parallelizable ● High computational resources
● Robust to outliers and nonlinear data requirement
● Low bias, moderate variance ● Overﬁt --- solved by tuning
hyperparameters
Boosting

Gradient Boosting Tuning Parameters

● N.minobsinnode = c(10,15)
● Fits a new predictor in the residuals committed by
● Interaction.depth = c(1,3)
the preceding predictor
● By combining one weak learner to the next learner, ● N.tree = c(1000, 1500)
the error is reduced signiﬁcantly over time
● Shrinkage = c(0.05,0,1)
Boosting

Tuning Parameters Test MSE: 0.0173, all predictors

● N.minobsinnode = c(10,15) ● N.minobsinnode = 10
● Interaction.depth = c(1,3) ● Interaction.depth = 2
● N.tree = c(1000, 1500) ● N.tree = 900
● Shrinkage = c(0.05,0,1) ● Shrinkage = 0.05
Boosting

Pros Cons
● Easy to read and interpret ● Sensitive to outliers
● Resilient method that curbs ● Almost impossible to scale up
over-ﬁtting easily
01 02
Introduction Data Cleaning
Dataset Missing data
Problem of Interest Variable Distribution assessment
Concerns Dummy variables transformation

03 04
Data Mining Conclusion
Techniques/Algorithms
Model comparison
Variable filtering Interpretation
Linear regression Techniques Takeaways - Application
Non-linear Regression
Model Comparison - Test MSE
Most important variables from
Boosting model
Conclusions
● Best method: Gradient Boosting
● Performance Accuracy: 86% on average
● Most important variables
○ OverallQual - Overall material and finish quality
○ GrLivArea: Above grade (ground) living area square feet
○ TotalBsmtSF: Total square feet of basement area
○ YearBuilt: Original construction date
● Surprises: No location indicator; Garage-related features importance
● Improvement: Better handling of high dimension next time without
variable filtering
Thanks!

Do you have any questions?

CREDITS: This presentation template was created by Slidesgo, including
icons by Flaticon, and infographics & images by Freepik.

Please keep this slide for attribution.

ChemPhysChem - 2018 - Mayerhöfer - Beer S Law Why Absorbance Depends Almost Linearly On Concentration
No ratings yet
ChemPhysChem - 2018 - Mayerhöfer - Beer S Law Why Absorbance Depends Almost Linearly On Concentration
5 pages
Morphological Analysis: Natural Language Processing (CSE 5321)
No ratings yet
Morphological Analysis: Natural Language Processing (CSE 5321)
23 pages
Comparing Linear Regression and Decision Trees For Housing Price Prediction
No ratings yet
Comparing Linear Regression and Decision Trees For Housing Price Prediction
8 pages
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
100% (1)
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
32 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
Supervised Learning - Basics
No ratings yet
Supervised Learning - Basics
115 pages
Unit 5
No ratings yet
Unit 5
18 pages
Group 12 - Final Presentation
No ratings yet
Group 12 - Final Presentation
51 pages
ML Models and Techniques
No ratings yet
ML Models and Techniques
12 pages
(Maths) Functions Ques Bank
No ratings yet
(Maths) Functions Ques Bank
22 pages
Predicting Boston Housing Price Using Machine Learning Models
No ratings yet
Predicting Boston Housing Price Using Machine Learning Models
6 pages
Qeee Solution Documnet
100% (1)
Qeee Solution Documnet
9 pages
Documento
No ratings yet
Documento
5 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
50 pages
Supervised Learning
No ratings yet
Supervised Learning
3 pages
IT426T Project
No ratings yet
IT426T Project
22 pages
20220523121909pmwebology 18 (6) - 443 PDF
No ratings yet
20220523121909pmwebology 18 (6) - 443 PDF
14 pages
ML 3
No ratings yet
ML 3
50 pages
Uber Data Analysis
No ratings yet
Uber Data Analysis
22 pages
House Price Prediction Using Regression Techniques: A Comparative Study
No ratings yet
House Price Prediction Using Regression Techniques: A Comparative Study
5 pages
FDS 16 Regression Tree
No ratings yet
FDS 16 Regression Tree
27 pages
Car Resale Value Prediction
No ratings yet
Car Resale Value Prediction
23 pages
PA DA2 - Merged
No ratings yet
PA DA2 - Merged
29 pages
House Price Prediction Project
No ratings yet
House Price Prediction Project
55 pages
Carprediction
No ratings yet
Carprediction
9 pages
DAV 2201079 Exp 2 2-1
No ratings yet
DAV 2201079 Exp 2 2-1
35 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
16 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
Understanding Machine Learning Algorithms - in Depth
No ratings yet
Understanding Machine Learning Algorithms - in Depth
167 pages
Science of The Egg Drop1
No ratings yet
Science of The Egg Drop1
2 pages
Linear Regression Program So Far
No ratings yet
Linear Regression Program So Far
33 pages
Report
No ratings yet
Report
24 pages
Seminar Presentation
No ratings yet
Seminar Presentation
25 pages
A Comprehensive Review On Power System Risk-Based Transient Stability
No ratings yet
A Comprehensive Review On Power System Risk-Based Transient Stability
6 pages
Pa Da1
No ratings yet
Pa Da1
17 pages
ML Lab Manual 4-8
No ratings yet
ML Lab Manual 4-8
11 pages
ML Assignment 2
No ratings yet
ML Assignment 2
3 pages
Daily Lesson Log
No ratings yet
Daily Lesson Log
6 pages
Housing Price Prediction
No ratings yet
Housing Price Prediction
25 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
Amidakuji.: Wednesday, January 26, 2011
No ratings yet
Amidakuji.: Wednesday, January 26, 2011
10 pages
02-MLR For Prediction
No ratings yet
02-MLR For Prediction
24 pages
A Detailed Analysis of The Supervised Machine Learning Algorithms
No ratings yet
A Detailed Analysis of The Supervised Machine Learning Algorithms
5 pages
Montessori Education
No ratings yet
Montessori Education
32 pages
Regression Models: by Mayuri Bhandari
No ratings yet
Regression Models: by Mayuri Bhandari
64 pages
BA Project - Team17
No ratings yet
BA Project - Team17
13 pages
C Sharp Logical Test
No ratings yet
C Sharp Logical Test
6 pages
Árboles de Regresión. Algunos Algoritmos y Extensiones A Métodos de Consenso Autor David Gonzalo Ejea Carbonell
No ratings yet
Árboles de Regresión. Algunos Algoritmos y Extensiones A Métodos de Consenso Autor David Gonzalo Ejea Carbonell
34 pages
Module 2
No ratings yet
Module 2
5 pages
Antoine
No ratings yet
Antoine
1 page
Mahamaya Technical University,: Noida
No ratings yet
Mahamaya Technical University,: Noida
47 pages
ML Cheat
No ratings yet
ML Cheat
9 pages
III-Day 37
No ratings yet
III-Day 37
3 pages
Probability Questions
No ratings yet
Probability Questions
2 pages
G-01 KAN Guide On Measurement Uncertainty (En)
No ratings yet
G-01 KAN Guide On Measurement Uncertainty (En)
32 pages
PerceptiLabs-ML Handbook
No ratings yet
PerceptiLabs-ML Handbook
31 pages
Three Dimensional Figures
No ratings yet
Three Dimensional Figures
19 pages
Rev Ajrcos 101262 Ina A
No ratings yet
Rev Ajrcos 101262 Ina A
11 pages
ML CheatSheet
No ratings yet
ML CheatSheet
14 pages
Aero Engg Mock Board Exam Mathematics 2014-Answer Keys
No ratings yet
Aero Engg Mock Board Exam Mathematics 2014-Answer Keys
6 pages
Lines and Angles
No ratings yet
Lines and Angles
3 pages
Price Prediction
100% (1)
Price Prediction
13 pages
Stock Market Prediction: Hrithik D B181070PE
No ratings yet
Stock Market Prediction: Hrithik D B181070PE
5 pages
House Price - Prediction
No ratings yet
House Price - Prediction
4 pages
Atmel Avr Microcontroller Mega and Xmega in Assembly and C 1st Edition Han-Way Huang Test Bank
50% (2)
Atmel Avr Microcontroller Mega and Xmega in Assembly and C 1st Edition Han-Way Huang Test Bank
2 pages
Module 5
No ratings yet
Module 5
31 pages
5 - A/D and D/A Conversion: Systems For Digital Signal Processing
No ratings yet
5 - A/D and D/A Conversion: Systems For Digital Signal Processing
35 pages
Relation and Function Enhanced
No ratings yet
Relation and Function Enhanced
50 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
Cad Unit-3 PDF
No ratings yet
Cad Unit-3 PDF
18 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
Real Estate Price Prediction With Regression and Classification
No ratings yet
Real Estate Price Prediction With Regression and Classification
5 pages
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
21 pages
M911 G11 - Transformation Geometry
No ratings yet
M911 G11 - Transformation Geometry
12 pages
Ids Case Study
No ratings yet
Ids Case Study
15 pages
MMPBSA Python Manual
No ratings yet
MMPBSA Python Manual
17 pages
House Price Prediction Using Machine Learning and Neural Networks
No ratings yet
House Price Prediction Using Machine Learning and Neural Networks
4 pages
AS Physics Mechanics Newtons Laws Answers OCR AQA Edexcel Ms
No ratings yet
AS Physics Mechanics Newtons Laws Answers OCR AQA Edexcel Ms
6 pages
Marginal Rate of Technical Substitution
No ratings yet
Marginal Rate of Technical Substitution
9 pages
House Price Forecasting Using Machine Learning Methods: Uter and Mathematics Education 11 (2021), 3624-3632
No ratings yet
House Price Forecasting Using Machine Learning Methods: Uter and Mathematics Education 11 (2021), 3624-3632
9 pages
Partial Differential Equations
No ratings yet
Partial Differential Equations
45 pages
DC Motor Speed Control Methods Using MATLAB - Simulink and Their Integration Into Undergraduate Electric Machinery Courses
No ratings yet
DC Motor Speed Control Methods Using MATLAB - Simulink and Their Integration Into Undergraduate Electric Machinery Courses
9 pages
House Price Prediction 1
No ratings yet
House Price Prediction 1
27 pages
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
22 pages
House Pricing Prediction System
No ratings yet
House Pricing Prediction System
36 pages
Maths
No ratings yet
Maths
222 pages
EC8452 2marks PDF
No ratings yet
EC8452 2marks PDF
21 pages
House Prices Prediction in King County
No ratings yet
House Prices Prediction in King County
10 pages
House Price Prediction
No ratings yet
House Price Prediction
3 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Mining Project Presentation - JAG

Uploaded by

Data Mining Project Presentation - JAG

Uploaded by

Best Model for Predicting

House Prices in Ames, Iowa

Goal: To ﬁnd the best model to predict the Selling

Possible predictors: neighborhood, square footage of the

Possible concerns: missing values, highly skewed

● Missing categorical entries

● Dummy variables transformation

● Only select predictors with

Subset Selection: Regularization: Dimension

Test MSE: 0.0216

Test MSE: 0.0201

After pruning (Test MSE): 0.0553

Gradient Boosting Tuning Parameters

Tuning Parameters Test MSE: 0.0173, all predictors

Do you have any questions?

Please keep this slide for attribution.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.