Statistical Analysis and Forecasting of Solar Energy (Intra-State)
Statistical Analysis and Forecasting of Solar Energy (Intra-State)
Contents
1 Introduction 3
1.1 Why Forecasting? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Terms Associated with Solar Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Preprocessing 3
2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Obtaining Daily and Weekly Values and Handling Missing Data . . . . . . . . . . . . . . . . . 4
3 Descriptive Statistics 4
3.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Plotting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Distribution Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.1 Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.2 Distribution Fit Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7 Conclusions 16
References 17
Appendices 18
1
A Code 18
A.1 Getting Daily Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
A.2 Getting Weekly Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
A.3 Plotting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
A.4 KS-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
A.5 Plotting Best Fit Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.6 ADF Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.7 KPSS Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.8 Time Series Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.9 MAPE/MAE Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.10 AR/MA/ARMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.11 SARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
B Other Results 21
2
1 Introduction
Solar energy is an essential source of renewable energy in the modern world. Solar power in India is a fast
developing industry as India receives an abundant amount of sunlight throughout the year. The country’s
solar installed capacity was 36.9 GW as of 30 November 2020. Rajasthan is one of India’s most solar-
developed states, with its total photovoltaic capacity reaching 2289 MW by end of June 2018.[7]
• Diffused Horizontal Irradiance(DHI) represents solar radiation that does not arrive on a direct
path from the sun, but has been scattered by clouds and particles in the atmosphere and comes equally
from all directions.
• The solar zenith angle(Z) is the angle between the sun’s rays and the vertical.
• Global Horizontal Irradiance (GHI) is the total amount of shortwave radiation received from
above by a surface parallel to the ground. The following relation holds between GHI, DNI and DHI:
2 Preprocessing
2.1 Dataset
The given dataset contains hourly information collected over a period of 15 years(2000-2014) at 5 regions in
Rajasthan. Information about the following attributes is available in the dataset:
• Date and Time of measurement.
• DHI and Clearsky DHI
• DNI and Clearsky DNI
• GHI and Clearsky GHI
• Dew Point
• Temperature
• Pressure
• Relative Humidity
• Solar Zenith Angle
• Wind Speed
In this project, we are going to use only GHI values for forecasting.
3
2.2 Obtaining Daily and Weekly Values and Handling Missing Data
Daily and Weekly GHI values were obtained by summing over the hourly values. Other variables were
discarded as they were not important for forecasting. These variables were used only while analysing the
correlations. In this case, the hourly values were directly used.
Data from region-5 for the year 2011 was missing. We used averaging to handle these missing values. As
there is no specific trend in the data (explained later), we took the mean of values (from region-5) for a
particular day from all other years and put it in place of the missing value for that day in 2011. This was
done for all of the days in 2011.
3 Descriptive Statistics
3.1 Correlation
We obtained the feature correlation maps for each region. The plot for region-1 is shown in Fig 1. Plots for
rest of the regions can be seen in the Appendix.
It can be observed that GHI has a highly positive correlation with the DNI and DHI values. This is clear
from the relation between these variables mentioned before. GHI is negatively correlated with Zenith Angle
which is clear as cosine is a decreasing function in the first quadrant. A moderately positive correlation with
temperature is explainable as higher temperatures, to a certain extent, are likely to be caused by higher
amount of solar radiation and thus in turn lead to higher GHI values.
The GHI values are made up of both DNI and DHI values and are a good measure relating to power output.
Hence, GHI forecasts can be useful for forecasting solar power output.
4
3.2 Plotting the Data
Looking at Fig 2b, we can observe that there exists no trend in weekly data (from region-1) large enough
to be visible to the eye. It is possible that a very small trend does exist but we will test for the existence
of trend in a later section. It is however very clear that some kind of seasonality is in play. It looks like the
GHI values at instants of time separated by approximately 52 weeks are very close.
As was the case for weekly data, we cannot observe any significant trend from the plot for daily data shown
in Fig 2b. We can also observe the existence of seasonality in this figure.
(a) Plot of Daily Data from Rajasthan-1 (b) Plot of Weekly Data from Rajasthan-1
From both of the plots, we can make a rough inference that the data is seasonal and values of GHI are
similar after every gap of one year. However, concrete tests need to be conducted to verify the stationarity
(inexistence of trend) of data.
Plots for rest of the regions can be seen in the Appendix.
5
Table 1: p-values obtained while performing KS-Test on weekly data
6
4 Tests for Stationarity
Stationarity in statistics means that the statistical properties of time series like mean, variance and covariance
do not vary with time. Normally two tests are used to check a time series for stationarity:
• Augmented Dickey Fuller (ADF)
• Kwiatkowski-Phillips-Schmidt-Shin (KPSS)
As the p-values are less than 0.01, we can reject the null hypothesis for each of the regions (for both daily
and weekly data). Thus, it is likely that the series data (from all regions) don’t have a unit root.
But, non-stationarity can be caused by other factors too, thus we conduct another test to confirm that our
series is stationary.
We can see from Table 4 that the test statistic was less than the 1% critical value for each of the 5 regions
(for both daily and weekly data). So we cannot reject the null hypothesis for any of the series.
7
4.3 Conclusion about Stationarity
From the combination of ADF and KPSS tests, four cases can arise:
• Case 1: Both tests conclude that the series is stationary.
In this case, we can conclude that the series is stationary.
From Fig 4, it can be seen that there is a seasonality in the daily series and no uniform trend exists.
Similar inferences can be drawn about weekly data from Fig 5.
Time Series Decomposition plots for other regions can be seen in the Appendix.
8
Figure 5: Time Series Decomposition of Weekly Data from Rajasthan-1
9
Figure 6: Rajasthan-1 Weeklyly ACF Plot
10
• The maximum significant lags came out to be around 20 in respective PACF plots for all the daily
GHI value datasets.
• The maximum significant lags came out to be around 10 in respective PACF plots for all the weekly
GHI value datasets.
Daily Datasets for all regions Weekly Datasets for all regions
Number of Training Datapoints 4378 625
Number of Testing Datapoints 1091 156
• Grid Search
For our hyperparameter evaluation, we used the ACF plots to determine the maximum possible value of q
to be evaluated using the number of significant lags in the plot, and we used the PACF plots to determine
the maximum possible value of p to be evaluated using the number of significant lags in the plot. Then, we
used grid search to determine the values of p and q that minimize the value of AIC for the models for every
scenario, i.e., p for AR where q = 0, q for MA where p = 0 and (p, q) for ARMA where both p and q can
take any value.
After determining the appropriate values of hyperparameters p and q, we needed to determine the value
of d for ARIMA model, but since our data is stationary, there is no need for differencing and hence, d=0
should give best results. Upon performing grid search for the hyperparameters (p, d, q), the AIC value was
minimized for cases of d = 0 which verifies our claim.
Due to computational complexity of SARIMA model, hyperparameter optimization could not be performed.
11
6.4 Step 3: Models for forecasting
6.4.1 Autoregressive (AR) models
The Autoregressive models implicitly assume that the future values will be based on the past values and
predict inPaccordance with a relationship between them. The equation describing this model is:
p
Xt = c + i=1 φi Xt−i + t
where φ1 , . . . ., φp are the parameters of the model, c is a constant, and t is white noise.[2] The past values
are multiplied with certain parameters and added with constants.
The results for AR models with tuned hyperparameters are shown in Tables 6 and 7.
Hyperparameters
Region MAPE MAE
(p)
Rajasthan-1 11 5.499% 2152.397
Rajasthan-2 9 7.071% 2598.998
Rajasthan-3 9 7.069% 2597.446
Rajasthan-4 9 6.938% 2561.409
Rajasthan-5 9 5.640% 2125.523
Hyperparameter
Region MAPE MAE
(p)
Rajasthan-1 17 8.012% 408.652
Rajasthan-2 17 11.443% 504.634
Rajasthan-3 17 11.518% 506.659
Rajasthan-4 17 11.112% 498.878
Rajasthan-5 17 8.613 % 412.370
12
6.4.2 Moving Average (MA) models
The Moving Average models analyze the data points by generating a series of averages of subsets of the
data. This mitigates the impact of random short-term fluctuations. The equation describing this model is:
Xt = µ + t + θ1 t−1 + ... + θq t−q
where µ is the mean of the series, θ1 ,..., θq are the parameters of the model and the t , t−1 , ..., t−q are white
noise error terms.[5] The value of q is called the order of the MA model.
Thus, a moving-average model is conceptually a linear regression of the current value of the series against
current and previous (observed) white noise error terms or random shocks. The random shocks at each
point are assumed to be mutually independent and to come from the same distribution, typically a normal
distribution, with location at zero and constant scale.
The results of MA models with tuned hyperparameters are shown in 8 and 9.
Hyperparameters
Region MAPE MAE
(q)
Rajasthan-1 18 5.502% 2137.089
Rajasthan-2 15 7.191% 2615.586
Rajasthan-3 15 7.224% 2632.450
Rajasthan-4 15 7.125% 2606.015
Rajasthan-5 14 5.880% 2191.394
Hyperparameter
Region MAPE MAE
(q)
Rajasthan-1 36 10.458% 452.188
Rajasthan-2 39 13.696% 543.777
Rajasthan-3 39 12.236% 546.145
Rajasthan-4 39 11.825% 537.259
Rajasthan-5 38 9.220% 444.579
13
6.4.3 Autoregressive Moving Average (ARMA) models
The Autoregressive Moving Average model combines the above two approaches to generate a model that
can describe a weakly stationary time series in terms of two polynomials one with p autoregressive terms
and the other P with q moving average terms. The equation describing this model is:
p Pq
Xt = c + t + i=1 φi Xt−i + i=1 θi t−i
where φ1 , ..., φp are the coefficients of the autoregressive polynomial, c is a constant, θ1 , ..., θq are the coeffi-
cients of the moving average polynomial and the t , t−1 , . . . ., t−q are white noise error terms.[3] The results
of ARMA models with tuned hyperparameters are shown in Tables 10 and 11.
Hyperparameters
Region MAPE MAE
(p,q)
Rajasthan-1 (6,1) 5.451% 2165.528
Rajasthan-2 (11,6) 6.812% 2530.356
Rajasthan-3 (5,6) 6.728% 2508.208
Rajasthan-4 (11,5) 6.628% 2475.360
Rajasthan-5 (3,2) 5.623% 2116.391
Hyperparameters
Region MAPE MAE
(p,q)
Rajasthan-1 (1,27) 7.951% 405.901
Rajasthan-2 (15,4) 11.026% 489.336
Rajasthan-3 (1,29) 11.315% 497.593
Rajasthan-4 (9,15) 10.736% 485.587
Rajasthan-5 (17,8) 8.166% 393.479
14
problem. Each ARIMA model uses three hyperparameters (p,d,q), of which the meanings of p and q are
similar to the ARMA model and d represents the number of times the data needs to be differenced to produce
a stationary output.
As our series was already stationary, no differencing was needed. Thus, d=0. But for d=0, ARIMA models
are equivalent to the ARMA models. Hence, we have not showed the results for ARIMA models separately
as they are same as the results of ARMA model.
Hyperparameters
Region MAPE MAE
(p,d,q)(P,D,Q,m)
Rajasthan-1 (1,0,1)(1,1,1,52) 5.927% 2397.875
Rajasthan-2 (1,0,1)(1,1,1,52) 6.453% 2468.270
Rajasthan-3 (1,0,1)(1,1,1,52) 6.504% 2512.564
Rajasthan-4 (1,0,1)(1,1,1,52) 6.152% 2357.052
Rajasthan-5 (1,0,1)(1,1,1,52) 4.859% 1916.109
15
Figure 12: Rajasthan-1 SARIMA Model Weekly Forecasts
7 Conclusions
• Weekly series: For region-1, highly tuned ARMA model gave the best result (lowest MAPE value).
For other regions however, a basic SARIMA model gave the best results.
• Daily series: Highly tuned ARMA models gave the best results for all regions.
• Weekly forecasting was found to be more accurate (with any model) as compared to daily forecasting.
This is because daily data tends to have much more random variation as compared to weekly data.
• In case faster computers are available, hyperparameters of SARIMA model could be optimized to give
even better results. SARIMA could also be trained for daily data in such a case.
16
References
[1] Autocorrelation. Wikipedia, The Free Encyclopedia.
[2] Autoregressive model. Wikipedia, The Free Encyclopedia.
[3] Autoregressive–moving-average model. Wikipedia, The Free Encyclopedia.
[4] Kolmogorov-smirnov goodness-of-fit test. https://www.itl.nist.gov/div898/handbook/eda/
section3/eda35g.htm.
[5] Moving-average model. Wikipedia, The Free Encyclopedia.
[6] Partial autocorrelation function. Wikipedia, The Free Encyclopedia.
[7] Solar power in india. Wikipedia, The Free Encyclopedia.
[8] Christopher Baum. Tests for stationarity of a time series. Stata Technical Bulletin, 10(57), 2011.
[9] Zineb Aman Haj El Moussami Jamal Fattah, Latifa Ezzine and Abdeslam Lachhab. Forecasting of
demand using arima model. International Journal of Engineering Business Management, 10, 2018.
17
Appendices
A Code
A.1 Getting Daily Data
1 import pandas as pd
2
3 frames = []
4 dy = 0
5 for i in range (0 ,15) :
6 try :
7 df = pd . read_csv ( " 15396 _26 .65 _71 .65 _ " + str (2000+ i ) + ’. csv ’ , header = 2)
8 except :
9 dy = dy +365
10 continue
11 df . drop ( df . columns . difference ([ ’ GHI ’ , ’ Year ’ , ’ Month ’ , ’ Day ’ ]) , 1 , inplace = True )
12 yr = 2000+ i
13 for mon in range (1 ,13) :
14 for day in range (1 ,32) :
15 daydata = df [( df [ ’ Year ’] == yr ) & ( df [ ’ Month ’] == mon ) & ( df [ ’ Day ’] == day ) ]
16 if len ( daydata ) ==0:
17 continue
18 sumdata = daydata . sum ( axis = 0)
19 sumdata [ ’ Year ’] = int ( yr )
20 sumdata [ ’ Month ’] = int ( mon )
21 sumdata [ ’ Day ’] = int ( day )
22 sumdata [ ’ DayCode ’] = int ( dy )
23 frames . append ( sumdata . to_frame () . transpose () )
24 dy = dy +1
25 dailyData = pd . concat ( frames )
A.4 KS-Test
18
1 from scipy . stats import kstest
2 import scipy . stats as st
3
4 # Assume df contains the preprocessed daily / weekly data .
5 # df [ ’ GHI ’] has the GHI values
6 data = df [ ’ GHI ’ ]. values
7 dist_names = [ ’ weibull_min ’ , ’ weibull_max ’ , ’ norm ’ , ’ gamma ’ , ’ expon ’ , ’ lognorm ’ , ’ beta ’]
8 params = {}
9 fitted_data = {}
10 for distnm in dist_names :
11 dist = getattr ( st , distnm )
12 param = dist . fit ( data )
13 out = kstest ( data , distnm , args = param )
14 print ( distnm + ’: p - value = ’+ str ( out . pvalue ) )
19
1 import matplotlib . pyplot as plt
2 from statsmodels . tsa . seasonal import s e a s o n a l _ d e c o m p o s e
3
4 # Assume df contains the preprocessed daily / weekly data .
5 # df [ ’ GHI ’] has the GHI values
6 X = df [ ’ GHI ’ ]. values
7 ans = s e a s o n a l _ d e c o m p o s e (X , model = ’ additive ’ , period = 365)
8 ans . plot ()
9 plt . show ()
20
9 # Getting the test set .
10 test = X [ size : len ( X ) ]
11 # Setting hy pe r pa ra m et er s .
12 p ,d ,q , =1 ,0 ,1
13 m = 52
14 P ,D , Q = 1 ,1 ,1
15 # Creating model .
16 model = sm . tsa . statespace . SARIMAX ( X [0: size ] , order =( p ,d , q ) , seasona l_order =( P , D , Q , m ) )
17 # Fitting model .
18 results = model . fit ()
19 # Forecasting
20 predictions = results . forecast ( len ( test ) )
21
22 plt . figure ( figsize =(12 , 8) )
23 plt . plot ( df [ ’ Code ’ ][ size :] , X [ size :] , label = ’ Actual Value ’)
24 plt . plot ( range ( size , len ( X ) ) , predictions , label = ’ Forecast ’)
25 plt . legend ()
26 plt . show ()
27
28
29 mape = m e a n _ a b s o l u t e _ p e r c e n t a g e _ e r r o r ( test , predictions )
30 mae = m e a n _ a b s o l u t e _ e r r o r ( test , predictions )
B Other Results
21
(a) Daily Data (b) Weekly Data
22
Figure 17: Time Series Decomposition of Daily Data from Rajasthan-2
23
(a) Weekly Data (b) Daily Data
24
Figure 22: Correlation Plot of Rajasthan-3
25
Figure 24: Beta-Distribution Fit for Weekly Data from Rajasthan-3
26
(a) Weekly Data (b) Daily Data
27
(a) Weekly Data (b) Daily Data
28
Figure 31: Correlation Plot of Rajasthan-4
29
Figure 33: Beta-Distribution Fit for Weekly Data from Rajasthan-4
30
(a) Weekly Data (b) Daily Data
31
(a) Weekly Data (b) Daily Data
32
Figure 40: Correlation Plot of Rajasthan-5
33
Figure 42: Beta-Distribution Fit for Weekly Data from Rajasthan-5
34
(a) Weekly Data (b) Daily Data
35
(a) Weekly Data (b) Daily Data
36