0% found this document useful (0 votes)

6 views13 pages

Regression in PySpark

The document outlines the process of performing regression analysis using PySpark, including data partitioning into training and test datasets, model estimation using various regression algorithms, and prediction evaluation using metrics such as MAE, MSE, and RMSE. It emphasizes the importance of model accuracy and the need for hyper-parameter tuning to select the best model. The document also provides specific examples of calculations for predicted sales and error metrics for different districts.

Uploaded by

BraveAF

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views13 pages

Regression in PySpark

Uploaded by

BraveAF

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Regression in PySpark

Bharti Motwani
Step 1: Split complete data into training and test dataset

Data Partitioning District

3
ADV

10.1

9.4
INCOM E

50.5

55.6
SALES

128.3

121.3

4 11.6 45 134.4

6 9.5 44.3 111.5

District ADV INCOME SALES
7 11.2 47.4 132.7
1 9.5 39.0 145.1 8 9 60.0 126.9

2 10.1 50.5 128.3 70% 10 8.4 45.7 123.3

3 9.4 55.6 121.3

4 11.6 45 134.4
5 10.3 49.6 106.5
6 9.5 44.3 111.5
7 11.2 47.4 132.7
District ADV INCOM E SALES
8 9 60.0 126.9
1 9.5 39.0 145.1
9 11 62.4 151 30% 5 10.3 49.6 106.5

10 8.4 45.7 123.3 9 11 62.4 151

Step 2: Estimate a model on the training dataset

Regression Models in Pyspark are created from pyspark.ml.regressor library

Regression algorithms and corresponding functions in PySpark:

Linear Regression: LinearRegressor

Decision Tree: DecisionTreeRegressor

Random Forest: RandomForestRegressor

Gradient Boosting Trees: GBTRegressor

Factorization Machines: FMRegressor

3
Step 2: Estimate a model on the training dataset

District ADV INCOME SALES

2 10.1 50.5 128.3

3 9.4 55.6 121.3

4 11.6 45 134.4

6 9.5 44.3 111.5

District ADV INCOME SALES
7 11.2 47.4 132.7
1 9.5 39.0 145.1 8 9 60.0 126.9

2 10.1 50.5 128.3 70% 10 8.4 45.7 123.3

3 9.4 55.6 121.3

𝑌 = 53.17 + 5.11(XADV)+0.44(XINC)+ ε
4 11.6 45 134.4
5 10.3 49.6 106.5
6 9.5 44.3 111.5
7 11.2 47.4 132.7
8 9 60.0 126.9 District ADV INCOME SALES

1 9.5 39.0 145.1

9 11 62.4 151 30%
5 10.3 49.6 106.5
10 8.4 45.7 123.3 9 11 62.4 151
Step 3: Predict using the test dataset

�
𝑌1 = 53.17 + (5.11*9.5) + (0.44*39)
= 118.875
District ADV INCOME SALES Predicted
SALES
1 9.5 39.0 145.1 118.875
5 10.3 49.6 106.5 127.627
9 11 62.4 151 136.836

� = 53.17 + 5.11(XADV)+0.44(XINC)
𝑌
Step 4: Evaluate the model using metrics of accuracy/error

GOAL: To have high predictive accuracy when applied to new records

Regression Models in Pyspark are evaluated using RegressionEvaluator() from
e1 = 145.1 – 118.875
pyspark.ml.evaluate library. = 26.225

The argument metricName in RegressionEvaluator() considers the following metrics for

evaluation.

METRICS:
• MAE: takes the mean of the absolute error
• MSE: takes the mean of the squared errors (SSE/# of observations)
• RMSE: takes the square root of the MSE (values are in original units of measurement
• R2: Explain the variance of independent variables
Step 4: Evaluate the model using metrics of accuracy/error

e1 = 145.1 – 118.875
= 26.225
District ADV INCOME SALES Predicted SALES Error Absolute Error

1 9.5 39.0 145.1 118.875 26.225 26.225

5 10.3 49.6 106.5 127.627 -21.127 21.127

9 11 62.4 151 136.836 14.164 14.164

Metric: Mean Absolute Error

District ADV INCOME SALES Predicted Error Absolute

SALES Error
1 9.5 39.0 145.1 118.875 26.225 26.225
5 10.3 49.6 106.5 127.627 -21.127 21.127
9 11 62.4 151 136.836 14.164 14.164

MAE = (26.225 + 21.127+14.164)/3

= 20.51

The average absolute error in prediction is $20,510

Metric: Mean Squared Error

2
e1 = 26.225*26.225
= 687.75
District ADV INCOME SALES Predicted SALES Error Squared Error

1 9.5 39.0 145.1 118.875 26.225 687.75

5 10.3 49.6 106.5 127.627 -21.127 446.35
9 11 62.4 151 136.836 14.164 200.62

� = 53.17 + 5.11(XADV)+0.44(XINC)
𝑌 MSE = (687.75 + 446.35 + 200.62)/3
= 444.91

The average squared error in prediction is 444.91

(units are not easily interpreted)
Metric: Root Mean Squared Error

District ADV INCOME SALES Predicted Error Squared Error

SALES
1 9.5 39.0 145.1 118.875 26.225 687.75
5 10.3 49.6 106.5 127.627 -21.127 446.35
9 11 62.4 151 136.836 14.164 200.62

� = 53.17 + 5.11(XADV)+0.44(XINC)
𝑌 MSE = (687.75 + 446.35 + 200.62)/3
= 444.91

RMSE = 444.91
= 21.09

The root of the average squared error in prediction is: $21, 090.
Model Evaluation
Metric R2 : The amount of variance explained by independent variables

• It is good to explain maximum variance . Hence a better model has higher R2

MAE MSE RMSE

20.51 444.91 21.09

• Lower is better for all these metrics

• MAE and RMSE are in the original units of measurement
• RMSE penalizes errors more than MAE

Is this a “good” model?

Hard to tell…need a point of comparison
Develop more regression models and compare 16
Step 5: Creating and Selecting the Best Model

GOAL: To create the best model using hyper-parameter tuning

techniques e = 145.1 – 118.875
1
= 26.225
Hyper-parameter tuning in Pyspark is done using different techniques from
pyspark.ml.tuning library.

TECHNIQUES:
• CrossValidator
• TrainValidationSplit
• ParamGridBuilder
13

Project Amazon Sales Data Analysis
No ratings yet
Project Amazon Sales Data Analysis
12 pages
Supervised Learning - Basics
No ratings yet
Supervised Learning - Basics
115 pages
Lab 02 Group 05
No ratings yet
Lab 02 Group 05
2 pages
ML 01 (Shubham)
No ratings yet
ML 01 (Shubham)
14 pages
Regression Metrics
No ratings yet
Regression Metrics
3 pages
Unit 5
No ratings yet
Unit 5
18 pages
Regression - Training - Microsoft Learn
No ratings yet
Regression - Training - Microsoft Learn
9 pages
Day 3 ML
No ratings yet
Day 3 ML
4 pages
ML Week 4
No ratings yet
ML Week 4
5 pages
DT As Regressor-Follow
No ratings yet
DT As Regressor-Follow
4 pages
Sales Prediction For Big Mart 3.0.pptx MM
No ratings yet
Sales Prediction For Big Mart 3.0.pptx MM
25 pages
ML 01 (Pranavv)
No ratings yet
ML 01 (Pranavv)
14 pages
Regression (1) - 1-4
No ratings yet
Regression (1) - 1-4
4 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
Module 3
No ratings yet
Module 3
6 pages
Short Answer Questions - Peer Reviewed GD - Peer Review Class Presentation - Video Reflection - Concept Mapping 3 - 1733815814
No ratings yet
Short Answer Questions - Peer Reviewed GD - Peer Review Class Presentation - Video Reflection - Concept Mapping 3 - 1733815814
3 pages
Data Mining Final Assignment
No ratings yet
Data Mining Final Assignment
4 pages
MLA Manual
No ratings yet
MLA Manual
25 pages
Medical Insurance Prediction Slides
No ratings yet
Medical Insurance Prediction Slides
43 pages
Mod2 - Regression Metrics
No ratings yet
Mod2 - Regression Metrics
24 pages
Model Evaluation Metrics Presentation
No ratings yet
Model Evaluation Metrics Presentation
9 pages
ML Practical 04
No ratings yet
ML Practical 04
19 pages
Lec 4
No ratings yet
Lec 4
24 pages
Profitanalysis
No ratings yet
Profitanalysis
18 pages
Blue Property
No ratings yet
Blue Property
10 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
Abhishek Pandey - BI Lab - Exp 3
No ratings yet
Abhishek Pandey - BI Lab - Exp 3
8 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Chapter3 First Application Linear Regression
No ratings yet
Chapter3 First Application Linear Regression
8 pages
Week 10 - PROG 8510 Week 10
No ratings yet
Week 10 - PROG 8510 Week 10
16 pages
Hair PPT Ch05
No ratings yet
Hair PPT Ch05
18 pages
Measuring Violence-Related Attitudes, Behaviors, and Influences Among Youths - Assessment Tool
100% (1)
Measuring Violence-Related Attitudes, Behaviors, and Influences Among Youths - Assessment Tool
373 pages
1-Linear Regression
No ratings yet
1-Linear Regression
22 pages
Lab Experiment 4 - AI
No ratings yet
Lab Experiment 4 - AI
7 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
Jaspreet WS-3
No ratings yet
Jaspreet WS-3
5 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
Model Name: Simple Linear Regression - Volume Vs Price Dataset Name: Vol - Price - CSV
No ratings yet
Model Name: Simple Linear Regression - Volume Vs Price Dataset Name: Vol - Price - CSV
13 pages
Hierarchical Modelling For The Environmental Sciences Statistical Methods and Applications 2006 James S. Clark Ebook All Chapters PDF
100% (3)
Hierarchical Modelling For The Environmental Sciences Statistical Methods and Applications 2006 James S. Clark Ebook All Chapters PDF
56 pages
Modern Pridictive Modelling (Regression)
No ratings yet
Modern Pridictive Modelling (Regression)
12 pages
AI Lab7
No ratings yet
AI Lab7
13 pages
Full Text 02
No ratings yet
Full Text 02
52 pages
02-MLR For Prediction
No ratings yet
02-MLR For Prediction
24 pages
Lec 2
No ratings yet
Lec 2
6 pages
Vishal - WS-3
No ratings yet
Vishal - WS-3
5 pages
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
No ratings yet
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
23 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
13 pages
Implementation (Raw)
No ratings yet
Implementation (Raw)
12 pages
Chapter 4: Answer Key: Case Exercises Case Exercises
No ratings yet
Chapter 4: Answer Key: Case Exercises Case Exercises
9 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
4.1.2.4 Lab - Simple Linear Regression in Python
No ratings yet
4.1.2.4 Lab - Simple Linear Regression in Python
6 pages
Regression
No ratings yet
Regression
35 pages
ML Exp5
No ratings yet
ML Exp5
7 pages
Teit ML2
No ratings yet
Teit ML2
11 pages
Evaluation Metrics For Your Regression Model - Analytics Vidhya
No ratings yet
Evaluation Metrics For Your Regression Model - Analytics Vidhya
6 pages
Regression Models Evaluation Metrics
No ratings yet
Regression Models Evaluation Metrics
11 pages
Model Evaluation Metrics
No ratings yet
Model Evaluation Metrics
21 pages
Chap 10 B
No ratings yet
Chap 10 B
20 pages
Zuur A.F. Et Al 2009 - Mixed Effects Models and Extensions in Ecology With R - Chap03
No ratings yet
Zuur A.F. Et Al 2009 - Mixed Effects Models and Extensions in Ecology With R - Chap03
30 pages
MUNAR - Linear Regression - Ipynb - Colaboratory
No ratings yet
MUNAR - Linear Regression - Ipynb - Colaboratory
30 pages
Learn Bayesian Networks For Diagnosis
No ratings yet
Learn Bayesian Networks For Diagnosis
6 pages
PSQ Reviewer 1
No ratings yet
PSQ Reviewer 1
6 pages
A Beginner's Guide To Variational Inference
No ratings yet
A Beginner's Guide To Variational Inference
48 pages
Regression Analysis
No ratings yet
Regression Analysis
16 pages
Ekram Assignment ECONO
100% (1)
Ekram Assignment ECONO
16 pages
Arima Cho Usd Eur
No ratings yet
Arima Cho Usd Eur
15 pages
Stat Assignment 2
No ratings yet
Stat Assignment 2
13 pages
Gaither Frazier Forecasting Fall 2011
No ratings yet
Gaither Frazier Forecasting Fall 2011
84 pages
Chapter 5 Homework
No ratings yet
Chapter 5 Homework
7 pages
Flood Level
No ratings yet
Flood Level
11 pages
Regression
No ratings yet
Regression
6 pages
Introduction To Econometrics, 5 Edition: Chapter 3: Multiple Regression Analysis
No ratings yet
Introduction To Econometrics, 5 Edition: Chapter 3: Multiple Regression Analysis
31 pages
CFA Level II Item-Set - Questions Study Session 3 June 2019: Reading 7 Correlation and Regression
No ratings yet
CFA Level II Item-Set - Questions Study Session 3 June 2019: Reading 7 Correlation and Regression
30 pages
Weekly Sales of Hot Pizza Are As Follows:: Week Demand Week Demand Week Demand
No ratings yet
Weekly Sales of Hot Pizza Are As Follows:: Week Demand Week Demand Week Demand
15 pages
HPT30103 Research Methodology Group 3
No ratings yet
HPT30103 Research Methodology Group 3
26 pages
Econometrics Practical 2
No ratings yet
Econometrics Practical 2
2 pages
Hausman - Hausman Specification Test
No ratings yet
Hausman - Hausman Specification Test
10 pages
2278 9131 1 PB PDF
No ratings yet
2278 9131 1 PB PDF
9 pages
Homework 3 Solutions: Joe Neeman September 22, 2010
No ratings yet
Homework 3 Solutions: Joe Neeman September 22, 2010
5 pages
Comparing Groups For Statistical Differences - How To Choose The Right Statistical Test - Biochemia Medica
No ratings yet
Comparing Groups For Statistical Differences - How To Choose The Right Statistical Test - Biochemia Medica
8 pages
Linear Regression and Correlation: Y Y N X X N Y X XY N R
No ratings yet
Linear Regression and Correlation: Y Y N X X N Y X XY N R
3 pages
Miss-Specification: Assignment No # 5
No ratings yet
Miss-Specification: Assignment No # 5
4 pages
Chapter 9: Linear Regression and Correlation
No ratings yet
Chapter 9: Linear Regression and Correlation
6 pages
KTN Omitted Variables
No ratings yet
KTN Omitted Variables
6 pages
Assignment 6-Hijada Exercise 13.2
No ratings yet
Assignment 6-Hijada Exercise 13.2
5 pages
Confidence Intervals with σ unknown
No ratings yet
Confidence Intervals with σ unknown
9 pages
Laravel 12 Training Kit: A Practical Guide to Modern Web Development
From Everand
Laravel 12 Training Kit: A Practical Guide to Modern Web Development
Agus Kurniawan
No ratings yet
VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: Exam VCP-550
From Everand
VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: Exam VCP-550
Brian Atkinson
No ratings yet
CASP+ CompTIA Advanced Security Practitioner Study Guide: Exam CAS-004
From Everand
CASP+ CompTIA Advanced Security Practitioner Study Guide: Exam CAS-004
Nadean H. Tanner
No ratings yet
CompTIA Cloud+ Study Guide: Exam CV0-003
From Everand
CompTIA Cloud+ Study Guide: Exam CV0-003
Ben Piper
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Regression in PySpark

Uploaded by

Regression in PySpark

Uploaded by

Regression in PySpark

Data Partitioning District

6 9.5 44.3 111.5

2 10.1 50.5 128.3 70% 10 8.4 45.7 123.3

3 9.4 55.6 121.3

10 8.4 45.7 123.3 9 11 62.4 151

Regression Models in Pyspark are created from pyspark.ml.regressor library

Regression algorithms and corresponding functions in PySpark:

Linear Regression: LinearRegressor

Decision Tree: DecisionTreeRegressor

Random Forest: RandomForestRegressor

Gradient Boosting Trees: GBTRegressor

Factorization Machines: FMRegressor

District ADV INCOME SALES

2 10.1 50.5 128.3

3 9.4 55.6 121.3

6 9.5 44.3 111.5

2 10.1 50.5 128.3 70% 10 8.4 45.7 123.3

3 9.4 55.6 121.3

1 9.5 39.0 145.1

GOAL: To have high predictive accuracy when applied to new records

The argument metricName in RegressionEvaluator() considers the following metrics for

1 9.5 39.0 145.1 118.875 26.225 26.225

5 10.3 49.6 106.5 127.627 -21.127 21.127

9 11 62.4 151 136.836 14.164 14.164

District ADV INCOME SALES Predicted Error Absolute

MAE = (26.225 + 21.127+14.164)/3

The average absolute error in prediction is $20,510

1 9.5 39.0 145.1 118.875 26.225 687.75

The average squared error in prediction is 444.91

District ADV INCOME SALES Predicted Error Squared Error

• It is good to explain maximum variance . Hence a better model has higher R2

MAE MSE RMSE

• Lower is better for all these metrics

Is this a “good” model?

GOAL: To create the best model using hyper-parameter tuning

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.