Regression in PySpark
Regression in PySpark
Bharti Motwani
Step 1: Split complete data into training and test dataset
3
ADV
10.1
9.4
INCOM E
50.5
55.6
SALES
128.3
121.3
4 11.6 45 134.4
3
Step 2: Estimate a model on the training dataset
4 11.6 45 134.4
�
𝑌1 = 53.17 + (5.11*9.5) + (0.44*39)
= 118.875
District ADV INCOME SALES Predicted
SALES
1 9.5 39.0 145.1 118.875
5 10.3 49.6 106.5 127.627
9 11 62.4 151 136.836
� = 53.17 + 5.11(XADV)+0.44(XINC)
𝑌
Step 4: Evaluate the model using metrics of accuracy/error
METRICS:
• MAE: takes the mean of the absolute error
• MSE: takes the mean of the squared errors (SSE/# of observations)
• RMSE: takes the square root of the MSE (values are in original units of measurement
• R2: Explain the variance of independent variables
Step 4: Evaluate the model using metrics of accuracy/error
e1 = 145.1 – 118.875
= 26.225
District ADV INCOME SALES Predicted SALES Error Absolute Error
2
e1 = 26.225*26.225
= 687.75
District ADV INCOME SALES Predicted SALES Error Squared Error
� = 53.17 + 5.11(XADV)+0.44(XINC)
𝑌 MSE = (687.75 + 446.35 + 200.62)/3
= 444.91
� = 53.17 + 5.11(XADV)+0.44(XINC)
𝑌 MSE = (687.75 + 446.35 + 200.62)/3
= 444.91
RMSE = 444.91
= 21.09
The root of the average squared error in prediction is: $21, 090.
Model Evaluation
Metric R2 : The amount of variance explained by independent variables
TECHNIQUES:
• CrossValidator
• TrainValidationSplit
• ParamGridBuilder
13