0% found this document useful (0 votes)
6 views13 pages

Regression in PySpark

The document outlines the process of performing regression analysis using PySpark, including data partitioning into training and test datasets, model estimation using various regression algorithms, and prediction evaluation using metrics such as MAE, MSE, and RMSE. It emphasizes the importance of model accuracy and the need for hyper-parameter tuning to select the best model. The document also provides specific examples of calculations for predicted sales and error metrics for different districts.

Uploaded by

BraveAF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Regression in PySpark

The document outlines the process of performing regression analysis using PySpark, including data partitioning into training and test datasets, model estimation using various regression algorithms, and prediction evaluation using metrics such as MAE, MSE, and RMSE. It emphasizes the importance of model accuracy and the need for hyper-parameter tuning to select the best model. The document also provides specific examples of calculations for predicted sales and error metrics for different districts.

Uploaded by

BraveAF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Regression in PySpark

Bharti Motwani
Step 1: Split complete data into training and test dataset

Data Partitioning District

3
ADV

10.1

9.4
INCOM E

50.5

55.6
SALES

128.3

121.3

4 11.6 45 134.4

6 9.5 44.3 111.5


District ADV INCOME SALES
7 11.2 47.4 132.7
1 9.5 39.0 145.1 8 9 60.0 126.9

2 10.1 50.5 128.3 70% 10 8.4 45.7 123.3

3 9.4 55.6 121.3


4 11.6 45 134.4
5 10.3 49.6 106.5
6 9.5 44.3 111.5
7 11.2 47.4 132.7
District ADV INCOM E SALES
8 9 60.0 126.9
1 9.5 39.0 145.1
9 11 62.4 151 30% 5 10.3 49.6 106.5

10 8.4 45.7 123.3 9 11 62.4 151


Step 2: Estimate a model on the training dataset

Regression Models in Pyspark are created from pyspark.ml.regressor library

Regression algorithms and corresponding functions in PySpark:

Linear Regression: LinearRegressor

Decision Tree: DecisionTreeRegressor

Random Forest: RandomForestRegressor

Gradient Boosting Trees: GBTRegressor

Factorization Machines: FMRegressor

3
Step 2: Estimate a model on the training dataset

District ADV INCOME SALES

2 10.1 50.5 128.3

3 9.4 55.6 121.3

4 11.6 45 134.4

6 9.5 44.3 111.5


District ADV INCOME SALES
7 11.2 47.4 132.7
1 9.5 39.0 145.1 8 9 60.0 126.9

2 10.1 50.5 128.3 70% 10 8.4 45.7 123.3

3 9.4 55.6 121.3


𝑌 = 53.17 + 5.11(XADV)+0.44(XINC)+ ε
4 11.6 45 134.4
5 10.3 49.6 106.5
6 9.5 44.3 111.5
7 11.2 47.4 132.7
8 9 60.0 126.9 District ADV INCOME SALES

1 9.5 39.0 145.1


9 11 62.4 151 30%
5 10.3 49.6 106.5
10 8.4 45.7 123.3 9 11 62.4 151
Step 3: Predict using the test dataset


𝑌1 = 53.17 + (5.11*9.5) + (0.44*39)
= 118.875
District ADV INCOME SALES Predicted
SALES
1 9.5 39.0 145.1 118.875
5 10.3 49.6 106.5 127.627
9 11 62.4 151 136.836

� = 53.17 + 5.11(XADV)+0.44(XINC)
𝑌
Step 4: Evaluate the model using metrics of accuracy/error

GOAL: To have high predictive accuracy when applied to new records


Regression Models in Pyspark are evaluated using RegressionEvaluator() from
e1 = 145.1 – 118.875
pyspark.ml.evaluate library. = 26.225

The argument metricName in RegressionEvaluator() considers the following metrics for


evaluation.

METRICS:
• MAE: takes the mean of the absolute error
• MSE: takes the mean of the squared errors (SSE/# of observations)
• RMSE: takes the square root of the MSE (values are in original units of measurement
• R2: Explain the variance of independent variables
Step 4: Evaluate the model using metrics of accuracy/error

e1 = 145.1 – 118.875
= 26.225
District ADV INCOME SALES Predicted SALES Error Absolute Error

1 9.5 39.0 145.1 118.875 26.225 26.225

5 10.3 49.6 106.5 127.627 -21.127 21.127

9 11 62.4 151 136.836 14.164 14.164


Metric: Mean Absolute Error

District ADV INCOME SALES Predicted Error Absolute


SALES Error
1 9.5 39.0 145.1 118.875 26.225 26.225
5 10.3 49.6 106.5 127.627 -21.127 21.127
9 11 62.4 151 136.836 14.164 14.164

MAE = (26.225 + 21.127+14.164)/3


= 20.51

The average absolute error in prediction is $20,510


Metric: Mean Squared Error

2
e1 = 26.225*26.225
= 687.75
District ADV INCOME SALES Predicted SALES Error Squared Error

1 9.5 39.0 145.1 118.875 26.225 687.75


5 10.3 49.6 106.5 127.627 -21.127 446.35
9 11 62.4 151 136.836 14.164 200.62

� = 53.17 + 5.11(XADV)+0.44(XINC)
𝑌 MSE = (687.75 + 446.35 + 200.62)/3
= 444.91

The average squared error in prediction is 444.91


(units are not easily interpreted)
Metric: Root Mean Squared Error

District ADV INCOME SALES Predicted Error Squared Error


SALES
1 9.5 39.0 145.1 118.875 26.225 687.75
5 10.3 49.6 106.5 127.627 -21.127 446.35
9 11 62.4 151 136.836 14.164 200.62

� = 53.17 + 5.11(XADV)+0.44(XINC)
𝑌 MSE = (687.75 + 446.35 + 200.62)/3
= 444.91

RMSE = 444.91
= 21.09

The root of the average squared error in prediction is: $21, 090.
Model Evaluation
Metric R2 : The amount of variance explained by independent variables

• It is good to explain maximum variance . Hence a better model has higher R2

MAE MSE RMSE


20.51 444.91 21.09

• Lower is better for all these metrics


• MAE and RMSE are in the original units of measurement
• RMSE penalizes errors more than MAE

Is this a “good” model?


Hard to tell…need a point of comparison
Develop more regression models and compare 16
Step 5: Creating and Selecting the Best Model

GOAL: To create the best model using hyper-parameter tuning


techniques e = 145.1 – 118.875
1
= 26.225
Hyper-parameter tuning in Pyspark is done using different techniques from
pyspark.ml.tuning library.

TECHNIQUES:
• CrossValidator
• TrainValidationSplit
• ParamGridBuilder
13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy