0% found this document useful (0 votes)
34 views36 pages

Statistical Analysis and Forecasting of Solar Energy (Intra-State)

This document discusses statistical analysis and forecasting of solar energy production in India. It preprocesses solar power output data from Rajasthan to handle missing values and obtain daily and weekly averages. Descriptive statistics are used to analyze the data, including correlation, distribution fitting, and testing for stationarity. Time series decomposition and forecasting models like AR, MA, ARMA, ARIMA and SARIMA are explored to generate accurate short-term and long-term forecasts of solar power output. Code examples in the appendix illustrate the preprocessing, analysis, and modeling performed on the solar energy time series data.

Uploaded by

Chaitanya Balani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views36 pages

Statistical Analysis and Forecasting of Solar Energy (Intra-State)

This document discusses statistical analysis and forecasting of solar energy production in India. It preprocesses solar power output data from Rajasthan to handle missing values and obtain daily and weekly averages. Descriptive statistics are used to analyze the data, including correlation, distribution fitting, and testing for stationarity. Time series decomposition and forecasting models like AR, MA, ARMA, ARIMA and SARIMA are explored to generate accurate short-term and long-term forecasts of solar power output. Code examples in the appendix illustrate the preprocessing, analysis, and modeling performed on the solar energy time series data.

Uploaded by

Chaitanya Balani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Statistical Analysis and Forecasting of Solar Energy (Intra-State)

Group 1 (Bernoulli Group)


30 April, 2021

Contents
1 Introduction 3
1.1 Why Forecasting? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Terms Associated with Solar Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Preprocessing 3
2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Obtaining Daily and Weekly Values and Handling Missing Data . . . . . . . . . . . . . . . . . 4

3 Descriptive Statistics 4
3.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Plotting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Distribution Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.1 Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.2 Distribution Fit Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Tests for Stationarity 7


4.1 Augmented Dickey Fuller Test(ADF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Kwiatkowski-Phillips-Schmidt-Shin Test(KPSS) . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Conclusion about Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Time Series Decomposition 8

6 Time Series Forecasting 9


6.1 Important Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.1.1 Autocorrelation Function(ACF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.1.2 Partial Autocorrelation Function (PACF) . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.1.3 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2 Step 1: Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.3 Step 2: Hyperparameter Evaluation for each model . . . . . . . . . . . . . . . . . . . . . . . . 11
6.4 Step 3: Models for forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.4.1 Autoregressive (AR) models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.4.2 Moving Average (MA) models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.4.3 Autoregressive Moving Average (ARMA) models . . . . . . . . . . . . . . . . . . . . . 14
6.4.4 Autoregressive Integrated Moving Average (ARIMA) models . . . . . . . . . . . . . . 14
6.4.5 Seasonal Autoregressive Integrated Moving Average (SARIMA) models . . . . . . . . 15

7 Conclusions 16

References 17

Appendices 18

1
A Code 18
A.1 Getting Daily Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
A.2 Getting Weekly Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
A.3 Plotting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
A.4 KS-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
A.5 Plotting Best Fit Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.6 ADF Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.7 KPSS Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.8 Time Series Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.9 MAPE/MAE Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.10 AR/MA/ARMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.11 SARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B Other Results 21

2
1 Introduction
Solar energy is an essential source of renewable energy in the modern world. Solar power in India is a fast
developing industry as India receives an abundant amount of sunlight throughout the year. The country’s
solar installed capacity was 36.9 GW as of 30 November 2020. Rajasthan is one of India’s most solar-
developed states, with its total photovoltaic capacity reaching 2289 MW by end of June 2018.[7]

1.1 Why Forecasting?


The solar power output in solar plants is dependent on various uncontrollable variables which affect the
amount of sunlight falling on the solar panels.
Short-term forecasts are valuable for operators in order to make decisions of grid operation, as well as, for
electric market operators to make decisions related to supply and demand. Long-term forecasts are useful for
energy producers and to negotiate contracts with financial entities or utilities that distribute the generated
energy.
Thus, accurate forecasting is required so that the resources can be utilized in a way that generates higher
power output.

1.2 Terms Associated with Solar Power


• Direct Normal Irradiance(DNI) is the amount of solar radiation received per unit area by a surface
that is always held perpendicular (or normal) to the rays that come in a straight line from the direction
of the sun at its current position in the sky.

• Diffused Horizontal Irradiance(DHI) represents solar radiation that does not arrive on a direct
path from the sun, but has been scattered by clouds and particles in the atmosphere and comes equally
from all directions.

• The solar zenith angle(Z) is the angle between the sun’s rays and the vertical.

• Global Horizontal Irradiance (GHI) is the total amount of shortwave radiation received from
above by a surface parallel to the ground. The following relation holds between GHI, DNI and DHI:

GHI = DN I ∗ cosZ + DHI

2 Preprocessing
2.1 Dataset
The given dataset contains hourly information collected over a period of 15 years(2000-2014) at 5 regions in
Rajasthan. Information about the following attributes is available in the dataset:
• Date and Time of measurement.
• DHI and Clearsky DHI
• DNI and Clearsky DNI
• GHI and Clearsky GHI
• Dew Point
• Temperature
• Pressure
• Relative Humidity
• Solar Zenith Angle
• Wind Speed

In this project, we are going to use only GHI values for forecasting.

3
2.2 Obtaining Daily and Weekly Values and Handling Missing Data
Daily and Weekly GHI values were obtained by summing over the hourly values. Other variables were
discarded as they were not important for forecasting. These variables were used only while analysing the
correlations. In this case, the hourly values were directly used.
Data from region-5 for the year 2011 was missing. We used averaging to handle these missing values. As
there is no specific trend in the data (explained later), we took the mean of values (from region-5) for a
particular day from all other years and put it in place of the missing value for that day in 2011. This was
done for all of the days in 2011.

3 Descriptive Statistics
3.1 Correlation
We obtained the feature correlation maps for each region. The plot for region-1 is shown in Fig 1. Plots for
rest of the regions can be seen in the Appendix.

Figure 1: Correlation Plot of Rajasthan-1

It can be observed that GHI has a highly positive correlation with the DNI and DHI values. This is clear
from the relation between these variables mentioned before. GHI is negatively correlated with Zenith Angle
which is clear as cosine is a decreasing function in the first quadrant. A moderately positive correlation with
temperature is explainable as higher temperatures, to a certain extent, are likely to be caused by higher
amount of solar radiation and thus in turn lead to higher GHI values.
The GHI values are made up of both DNI and DHI values and are a good measure relating to power output.
Hence, GHI forecasts can be useful for forecasting solar power output.

4
3.2 Plotting the Data
Looking at Fig 2b, we can observe that there exists no trend in weekly data (from region-1) large enough
to be visible to the eye. It is possible that a very small trend does exist but we will test for the existence
of trend in a later section. It is however very clear that some kind of seasonality is in play. It looks like the
GHI values at instants of time separated by approximately 52 weeks are very close.
As was the case for weekly data, we cannot observe any significant trend from the plot for daily data shown
in Fig 2b. We can also observe the existence of seasonality in this figure.

(a) Plot of Daily Data from Rajasthan-1 (b) Plot of Weekly Data from Rajasthan-1

Figure 2: Rajasthan-1 Data Plots

From both of the plots, we can make a rough inference that the data is seasonal and values of GHI are
similar after every gap of one year. However, concrete tests need to be conducted to verify the stationarity
(inexistence of trend) of data.
Plots for rest of the regions can be seen in the Appendix.

3.3 Distribution Fitting


The probability distribution of a time series describes the probability that an observation falls into a specified
range of values. Distribution fitting is the process of identifying a curve that is best fit to the series of data
points.
Distribution fits are often used as an aid in visualization. They can also be used to summarize the relation-
ships among two or more variables. We first used the KS test to check if our series could be derived from
any of the commonly known distributions and then if successful, we plotted the distribution fits.

3.3.1 Kolmogorov-Smirnov Test


The Kolmogorov-Smirnov test is used to decide if a sample comes from a population with a specific distri-
bution. It is based on the empirical distribution function (ECDF). Given N ordered data points Y1 , Y2 , ...,
YN , the ECDF is defined as:
EN = n(i)/N
where n(i) is the number of points less than Yi and the Yi are ordered from smallest to largest value.[4] The
Kolmogorov-Smirnov test is defined by:
H0 : The data follows the specified distribution
Ha : The data does not follow the specified distribution
On conducting the KS-Test, we obtained the results shown is Tables 1 and 2.

5
Table 1: p-values obtained while performing KS-Test on weekly data

Weibull-Min Weibull-Max Normal Gamma Exponential Lognormal Beta


Region
(p-value) (p-value) (p-value) (p-value) (p-value) (p-value) (p-value)
Rajasthan-1 0 0 1.6e-5 8.2e-6 3.6e-27 0 0.043
Rajasthan-2 0 0 0.0012 0.0013 1.1e-38 0 0.085
Rajasthan-3 0 0 0.0024 0.0025 5.3e-42 0.0016 0.118
Rajasthan-4 0 0 0.001 0.0010 2.43e-42 0 0.077
Rajasthan-5 0 0 0.0016 0.0016 1.21e-41 0 0.09

Table 2: p-values obtained while performing KS-Test on daily data

Weibull-Min Weibull-Max Normal Gamma Exponential Lognormal Beta


Region
(p-value) (p-value) (p-value) (p-value) (p-value) (p-value) (p-value)
Rajasthan-1 0 0 2.5e-42 8.7e-47 0 1.6e-48 1.4e-22
Rajasthan-2 0 0 1.3e-25 3.0e-26 0 5.8e-28 8.6e-17
Rajasthan-3 0 0 2.9e-24 2.5e-27 0 6.1e-25 1.9e-16
Rajasthan-4 0 0 1.9e-23 1.3e-24 0 4.2e-23 8.4e-17
Rajasthan-5 0 0 3.2e-25 3.8e-26 0 4.3e-26 7.4e-15

3.3.2 Distribution Fit Plot


On seeing Table 2, we can clearly say that as the p-values are smaller than 0.01; at 1% significance level, we
can reject the null hypothesis. Thus, we can say that the daily data from each Region is not derived from
any of the above listed distributions.
On the other hand, Table 1 shows that the p-values for weekly data from each region tested against the
beta-distribution is greater than 0.01. Thus, we cannot reject the null hypothesis. Fig 3 shows the best fit
beta-distribution plot for weekly data from Rajasthan-1. Best-fit plots for weekly data from other regions
can be seen in the Appendix.

Figure 3: Beta-Distribution Fit for Weekly Data from Rajasthan-1

6
4 Tests for Stationarity
Stationarity in statistics means that the statistical properties of time series like mean, variance and covariance
do not vary with time. Normally two tests are used to check a time series for stationarity:
• Augmented Dickey Fuller (ADF)
• Kwiatkowski-Phillips-Schmidt-Shin (KPSS)

4.1 Augmented Dickey Fuller Test(ADF)


One of the most common causes of non-stationarity are unit-roots. A unit root is a stochastic trend in a
time series. [Unit root mathematics is quite complex to be mentioned here.]
The ADF test checks for the presence of a unit root. The hypotheses for this test are as follows:

H0 : The series has a unit root.


Ha : The series has no unit root.
The results obtained on conducting the ADF test are shown in Table 3

Table 3: p-values obtained from ADF Test.

Region P-VALUE (DAILY) P-VALUE(WEEKLY)


RAJASTHAN - 1 0.000003 3.678e-16
RAJASTHAN - 2 0.000003 4.191e-15
RAJASTHAN - 3 0.000002 8.002e-15
RAJASTHAN - 4 0.000001 7.694e-15
RAJASTHAN - 5 0.000016 1.805e-13

As the p-values are less than 0.01, we can reject the null hypothesis for each of the regions (for both daily
and weekly data). Thus, it is likely that the series data (from all regions) don’t have a unit root.
But, non-stationarity can be caused by other factors too, thus we conduct another test to confirm that our
series is stationary.

4.2 Kwiatkowski-Phillips-Schmidt-Shin Test(KPSS)


The hypotheses of the KPSS test are:

H0 : The series is stationary


Ha : The series has a unit root (it is not stationary)
The 1% critical value for this test is known to be 0.739. On conducting this test, we get the following results:

Table 4: p-values obtained from KPSS Test.

Region TEST STATISTIC (DAILY) TEST STATISTIC (WEEKLY)


RAJASTHAN - 1 0.02131 0.01142
RAJASTHAN - 2 0.03364 0.01881
RAJASTHAN - 3 0.04004 0.02247
RAJASTHAN - 4 0.03354 0.018912
RAJASTHAN - 5 0.03283 0.01858

We can see from Table 4 that the test statistic was less than the 1% critical value for each of the 5 regions
(for both daily and weekly data). So we cannot reject the null hypothesis for any of the series.

7
4.3 Conclusion about Stationarity
From the combination of ADF and KPSS tests, four cases can arise:
• Case 1: Both tests conclude that the series is stationary.
In this case, we can conclude that the series is stationary.

• Case 2: KPSS indicates stationarity and ADF does not.


Here, the conclusion would be that the series is trend stationary. Trend needs to be removed to make
the series strict stationary.
• Case 3: ADF indicates stationarity and KPSS does not.
This indicates that the series is difference stationary. Differencing needs to be done to make it station-
ary.
• Case 4: Both tests conclude that the series is not stationary. In this case, we can conclude that the
series is not stationary.
For our data, we can thus conclude that all of the series (weekly and daily for each region) are stationary as
both the tests are arriving at this conclusion.

5 Time Series Decomposition


Time Series Decomposition involves breaking down a series into trend, seasonal and noise components. The
additive model was used here as the magnitude of seasonal fluctuation is not varying much.

Figure 4: Time Series Decomposition of Daily Data from Rajasthan-1

From Fig 4, it can be seen that there is a seasonality in the daily series and no uniform trend exists.
Similar inferences can be drawn about weekly data from Fig 5.
Time Series Decomposition plots for other regions can be seen in the Appendix.

8
Figure 5: Time Series Decomposition of Weekly Data from Rajasthan-1

6 Time Series Forecasting


There are several time series models that we have considered for forecasting. These include:
• AR (p)
• MA (q)
• ARMA (p, q)
• ARIMA (p, d, q)
• SARIMA (p, d, q) (P, D, Q, m)
Our approach involves a comparison of the performance of all of these models on various evaluation
criteria. These include accuracies and various error terms like Mean Absolute Percentage Error(MAPE),
Mean Absolute Error(MAE), Mean Square Error(MSE) and Root Mean Square Error(RMSE).

6.1 Important Concepts


6.1.1 Autocorrelation Function(ACF)
ACF plot is a bar chart of coefficients of correlation between a time series and the lagged values. ACF
explains how the present value of a given time series is correlated with the past (1-unit past, 2-unit past,
..., n-unit past) values. The y-axis expresses the correlation coefficient in the ACF plot, whereas the x-axis
mentions the number of lags. Assume that y(t), y(t-1),..., y(t-n) are values of a time series at time t, t-
1,..., t-n, then the lag-1 value is the correlation coefficient between y(t) and y(t-1), lag-2 is the correlation
coefficient between y(t) and y(t-2) and so on.
Let {Xt } be a random process, and t be any point in time. Then Xt is the value (or realization) produced
by a given run of the process at time t. Suppose that the process has mean µt and variance σt2 at time t,
for each t. Then the definition of the autocorrelation function between times t1 and t2 is:
RXX (t1 , t2 ) = E[Xt1 Xt2 ]
The last ACF factor greater than the specified threshold value denotes the maximum significant lag that
also represents the upper limit of the hyperparameter q.[1]
ACF plots for weekly and daily data from region-1 are shown in Fig 6 and 7.
• The maximum significant lags came out to be around 1120 in respective ACF plots for all the daily
GHI value datasets.
• The maximum significant lags came to be around 160 in respective ACF plots for all the weekly GHI
value datasets.

9
Figure 6: Rajasthan-1 Weeklyly ACF Plot

Figure 7: Rajasthan-1 Daily ACF Plot

6.1.2 Partial Autocorrelation Function (PACF)


The partial autocorrelation function explains the partial correlation between the series and the lagged values.
In simple terms, PACF can be explained using a linear regression where we predict y(t) from y(t-1), y(t-2),
and y(t-3). In PACF, we correlate the “parts” of y(t) and y(t-3) that are not predicted by y(t-1) and y(t-2).
Given a time series zt , the partial autocorrelation of lag k, denoted α(k), is the autocorrelation between zt
and zt+k with the linear dependence of zt on zt+1 through zt+k−1 removed; equivalently, it is the autocorre-
lation between zt andzt + k that is not accounted for by lags 1 through k-1, inclusive. α(1) = corr(zt+1 , zt ),
for k = 1,
α(k) = corr(zt+k –Pt,k (zt+k ), zt –Pt,k (zt )), for k > 1,
where Pt,k (x) is the surjective operator of the orthogonal projection of x onto the linear subspace of Hilbert
space spanned by zt+1 , ..., zt+k−1 .[6]
PACF plots for daily and weekly data from region-1 are shown in Fig 8.
The last PACF factor greater than the specified threshold value denotes the maximum significant lag that
also represents the upper limit of the hyperparameter p. But since PACF values do not necessarily follow
a diminishing pattern and the threshold for PACF does not follow a strictly increasing pattern, there is no
proper way to determine the maximum significant lag by looking at the graph.

(a) Weekly Data (b) Daily Data

Figure 8: Rajasthan-1 PACF Plots

10
• The maximum significant lags came out to be around 20 in respective PACF plots for all the daily
GHI value datasets.
• The maximum significant lags came out to be around 10 in respective PACF plots for all the weekly
GHI value datasets.

6.1.3 Grid Search


Grid Search is a method that determines the best values of trend hyperparameters p, d and q and the seasonal
hyperparameters P, D, Q and m, such that the models (AR(p), MA(q), ARMA(p, q), ARIMA(p, d, q) and
SARIMA(p, d, q)(P, D, Q, m)) give minimum losses. Grid search works on the principle of minimization of
AIC values for respective models formed used a given set of values of (p, d, q)(P, D, Q, m). The model that
gives the minimum AIC values is the one with the appropriate hyperparameters.
Grid Search is usually performed over the range of hyperparameters p and q that are initially obtained using
PACF and ACF plots, respectively.

6.2 Step 1: Data Pre-processing


The daily data series and weekly data series were split in the ratio of 8:2 to create training datasets and
testing datasets, respectively. The final errors are computed for the test data, respectively.
After this pre-processing step, we had 10 datasets (5 daily and 5 weekly) that we split into training and test
data. The number of data points in each of them are:

Table 5: Preprocessing Details

Daily Datasets for all regions Weekly Datasets for all regions
Number of Training Datapoints 4378 625
Number of Testing Datapoints 1091 156

6.3 Step 2: Hyperparameter Evaluation for each model


We trained several time series models for forecasting. Still, each model requires some hyperparameters to
be evaluated and provided to the model for proper training with the least possible loss before using them
for forecasting. For the determination of these hyperparameters, we used different methods, which include:
• ACF plots
• PACF plots

• Grid Search
For our hyperparameter evaluation, we used the ACF plots to determine the maximum possible value of q
to be evaluated using the number of significant lags in the plot, and we used the PACF plots to determine
the maximum possible value of p to be evaluated using the number of significant lags in the plot. Then, we
used grid search to determine the values of p and q that minimize the value of AIC for the models for every
scenario, i.e., p for AR where q = 0, q for MA where p = 0 and (p, q) for ARMA where both p and q can
take any value.
After determining the appropriate values of hyperparameters p and q, we needed to determine the value
of d for ARIMA model, but since our data is stationary, there is no need for differencing and hence, d=0
should give best results. Upon performing grid search for the hyperparameters (p, d, q), the AIC value was
minimized for cases of d = 0 which verifies our claim.
Due to computational complexity of SARIMA model, hyperparameter optimization could not be performed.

11
6.4 Step 3: Models for forecasting
6.4.1 Autoregressive (AR) models
The Autoregressive models implicitly assume that the future values will be based on the past values and
predict inPaccordance with a relationship between them. The equation describing this model is:
p
Xt = c + i=1 φi Xt−i + t
where φ1 , . . . ., φp are the parameters of the model, c is a constant, and t is white noise.[2] The past values
are multiplied with certain parameters and added with constants.
The results for AR models with tuned hyperparameters are shown in Tables 6 and 7.

Table 6: AR model Results for Weekly Data

Hyperparameters
Region MAPE MAE
(p)
Rajasthan-1 11 5.499% 2152.397
Rajasthan-2 9 7.071% 2598.998
Rajasthan-3 9 7.069% 2597.446
Rajasthan-4 9 6.938% 2561.409
Rajasthan-5 9 5.640% 2125.523

Table 7: AR model Results for Daily Data

Hyperparameter
Region MAPE MAE
(p)
Rajasthan-1 17 8.012% 408.652
Rajasthan-2 17 11.443% 504.634
Rajasthan-3 17 11.518% 506.659
Rajasthan-4 17 11.112% 498.878
Rajasthan-5 17 8.613 % 412.370

(a) Weekly Data (b) Daily Data

Figure 9: Rajasthan-1 AR Model Forecasts

12
6.4.2 Moving Average (MA) models
The Moving Average models analyze the data points by generating a series of averages of subsets of the
data. This mitigates the impact of random short-term fluctuations. The equation describing this model is:
Xt = µ + t + θ1 t−1 + ... + θq t−q
where µ is the mean of the series, θ1 ,..., θq are the parameters of the model and the t , t−1 , ..., t−q are white
noise error terms.[5] The value of q is called the order of the MA model.
Thus, a moving-average model is conceptually a linear regression of the current value of the series against
current and previous (observed) white noise error terms or random shocks. The random shocks at each
point are assumed to be mutually independent and to come from the same distribution, typically a normal
distribution, with location at zero and constant scale.
The results of MA models with tuned hyperparameters are shown in 8 and 9.

Table 8: MA model Results for Weekly Data

Hyperparameters
Region MAPE MAE
(q)
Rajasthan-1 18 5.502% 2137.089
Rajasthan-2 15 7.191% 2615.586
Rajasthan-3 15 7.224% 2632.450
Rajasthan-4 15 7.125% 2606.015
Rajasthan-5 14 5.880% 2191.394

Table 9: MA model Results for Daily Data

Hyperparameter
Region MAPE MAE
(q)
Rajasthan-1 36 10.458% 452.188
Rajasthan-2 39 13.696% 543.777
Rajasthan-3 39 12.236% 546.145
Rajasthan-4 39 11.825% 537.259
Rajasthan-5 38 9.220% 444.579

(a) Weekly Data (b) Daily Data

Figure 10: Rajasthan-1 MA Model Forecasts

13
6.4.3 Autoregressive Moving Average (ARMA) models
The Autoregressive Moving Average model combines the above two approaches to generate a model that
can describe a weakly stationary time series in terms of two polynomials one with p autoregressive terms
and the other P with q moving average terms. The equation describing this model is:
p Pq
Xt = c + t + i=1 φi Xt−i + i=1 θi t−i
where φ1 , ..., φp are the coefficients of the autoregressive polynomial, c is a constant, θ1 , ..., θq are the coeffi-
cients of the moving average polynomial and the t , t−1 , . . . ., t−q are white noise error terms.[3] The results
of ARMA models with tuned hyperparameters are shown in Tables 10 and 11.

Table 10: ARMA model Results for Weekly Data

Hyperparameters
Region MAPE MAE
(p,q)
Rajasthan-1 (6,1) 5.451% 2165.528
Rajasthan-2 (11,6) 6.812% 2530.356
Rajasthan-3 (5,6) 6.728% 2508.208
Rajasthan-4 (11,5) 6.628% 2475.360
Rajasthan-5 (3,2) 5.623% 2116.391

Table 11: ARMA model Results for Daily Data

Hyperparameters
Region MAPE MAE
(p,q)
Rajasthan-1 (1,27) 7.951% 405.901
Rajasthan-2 (15,4) 11.026% 489.336
Rajasthan-3 (1,29) 11.315% 497.593
Rajasthan-4 (9,15) 10.736% 485.587
Rajasthan-5 (17,8) 8.166% 393.479

(a) Weekly Data (b) Daily Data

Figure 11: Rajasthan-1 ARMA Model Forecasts

6.4.4 Autoregressive Integrated Moving Average (ARIMA) models


One shortcoming of ARMA model is that it cannot handle stationary data. The Autoregressive Integrated
Moving Average model adds an extra feature – “differencing” to the ARMA model, thus overcoming this

14
problem. Each ARIMA model uses three hyperparameters (p,d,q), of which the meanings of p and q are
similar to the ARMA model and d represents the number of times the data needs to be differenced to produce
a stationary output.
As our series was already stationary, no differencing was needed. Thus, d=0. But for d=0, ARIMA models
are equivalent to the ARMA models. Hence, we have not showed the results for ARIMA models separately
as they are same as the results of ARMA model.

6.4.5 Seasonal Autoregressive Integrated Moving Average (SARIMA) models


Seasonal Autoregressive Integrated Moving Average model is an extension of ARIMA that explicitly sup-
ports univariate time series data with a seasonal component. It adds 3 seasonal hyperparameters (P,D,Q)
and a hyperparameter for seasonality m to the three hyperparameters of ARIMA.
Thus each SARIMA model is characterized by the hyperparameters: (p,d,q)(P,D,Q,m) where p,d and q have
meanings similar to the ARIMA model and the other hyperparameters are used as follows:

• P: seasonal autoregressive order


• D: seasonal difference order
• Q: seasonal moving average order

• m: number of time steps in seasonal data


As the SARIMA model is computationally expensive, we could not build it for daily data. Also, hyperpa-
rameter optimization through grid search was not computationally feasible even for weekly data. Hence, we
used a model with fixed small hyperparameters for the weekly data. The results of the SARIMA model are
as follows:

Table 12: SARIMA model Results for Weekly Data

Hyperparameters
Region MAPE MAE
(p,d,q)(P,D,Q,m)
Rajasthan-1 (1,0,1)(1,1,1,52) 5.927% 2397.875
Rajasthan-2 (1,0,1)(1,1,1,52) 6.453% 2468.270
Rajasthan-3 (1,0,1)(1,1,1,52) 6.504% 2512.564
Rajasthan-4 (1,0,1)(1,1,1,52) 6.152% 2357.052
Rajasthan-5 (1,0,1)(1,1,1,52) 4.859% 1916.109

15
Figure 12: Rajasthan-1 SARIMA Model Weekly Forecasts

7 Conclusions
• Weekly series: For region-1, highly tuned ARMA model gave the best result (lowest MAPE value).
For other regions however, a basic SARIMA model gave the best results.

• Daily series: Highly tuned ARMA models gave the best results for all regions.
• Weekly forecasting was found to be more accurate (with any model) as compared to daily forecasting.
This is because daily data tends to have much more random variation as compared to weekly data.
• In case faster computers are available, hyperparameters of SARIMA model could be optimized to give
even better results. SARIMA could also be trained for daily data in such a case.

16
References
[1] Autocorrelation. Wikipedia, The Free Encyclopedia.
[2] Autoregressive model. Wikipedia, The Free Encyclopedia.
[3] Autoregressive–moving-average model. Wikipedia, The Free Encyclopedia.
[4] Kolmogorov-smirnov goodness-of-fit test. https://www.itl.nist.gov/div898/handbook/eda/
section3/eda35g.htm.
[5] Moving-average model. Wikipedia, The Free Encyclopedia.
[6] Partial autocorrelation function. Wikipedia, The Free Encyclopedia.
[7] Solar power in india. Wikipedia, The Free Encyclopedia.

[8] Christopher Baum. Tests for stationarity of a time series. Stata Technical Bulletin, 10(57), 2011.
[9] Zineb Aman Haj El Moussami Jamal Fattah, Latifa Ezzine and Abdeslam Lachhab. Forecasting of
demand using arima model. International Journal of Engineering Business Management, 10, 2018.

17
Appendices
A Code
A.1 Getting Daily Data
1 import pandas as pd
2
3 frames = []
4 dy = 0
5 for i in range (0 ,15) :
6 try :
7 df = pd . read_csv ( " 15396 _26 .65 _71 .65 _ " + str (2000+ i ) + ’. csv ’ , header = 2)
8 except :
9 dy = dy +365
10 continue
11 df . drop ( df . columns . difference ([ ’ GHI ’ , ’ Year ’ , ’ Month ’ , ’ Day ’ ]) , 1 , inplace = True )
12 yr = 2000+ i
13 for mon in range (1 ,13) :
14 for day in range (1 ,32) :
15 daydata = df [( df [ ’ Year ’] == yr ) & ( df [ ’ Month ’] == mon ) & ( df [ ’ Day ’] == day ) ]
16 if len ( daydata ) ==0:
17 continue
18 sumdata = daydata . sum ( axis = 0)
19 sumdata [ ’ Year ’] = int ( yr )
20 sumdata [ ’ Month ’] = int ( mon )
21 sumdata [ ’ Day ’] = int ( day )
22 sumdata [ ’ DayCode ’] = int ( dy )
23 frames . append ( sumdata . to_frame () . transpose () )
24 dy = dy +1
25 dailyData = pd . concat ( frames )

A.2 Getting Weekly Data


1 import pandas as pd
2
3 week = 0
4 frames = []
5 # Assume df contains the daily data created from previous code .
6 df . drop ( df . columns . difference ([ ’ GHI ’ ]) , 1 , inplace = True )
7 while len ( df ) >=7:
8 first7 = df . iloc [0:7 ,:]
9 sumweek = first7 . sum ( axis =0)
10 sumweek [ ’ Week ’] = week
11 week = week +1
12 frames . append ( sumweek . to_frame () . transpose () )
13 df = df . iloc [7: ,:]
14 weeklyData = pd . concat ( frames )

A.3 Plotting Data


1 import matplotlib . pyplot as plt
2
3 # Assume df contains the preprocessed daily / weekly data .
4 # df [ ’ GHI ’] has the GHI values
5 # df [ ’ Code ’] has the coded ( starting from 0) day / week values .
6 Y = df [ ’ GHI ’ ]. values
7 X = df [ ’ Code ’ ]. values
8 plt . plot (X , Y )
9 plt . xlabel ( " Code " )
10 plt . ylabel ( " GHI " )
11 plt . show ()

A.4 KS-Test

18
1 from scipy . stats import kstest
2 import scipy . stats as st
3
4 # Assume df contains the preprocessed daily / weekly data .
5 # df [ ’ GHI ’] has the GHI values
6 data = df [ ’ GHI ’ ]. values
7 dist_names = [ ’ weibull_min ’ , ’ weibull_max ’ , ’ norm ’ , ’ gamma ’ , ’ expon ’ , ’ lognorm ’ , ’ beta ’]
8 params = {}
9 fitted_data = {}
10 for distnm in dist_names :
11 dist = getattr ( st , distnm )
12 param = dist . fit ( data )
13 out = kstest ( data , distnm , args = param )
14 print ( distnm + ’: p - value = ’+ str ( out . pvalue ) )

A.5 Plotting Best Fit Beta Distribution


1 import scipy . stats as st
2 import matplotlib . pyplot as plt
3
4 # Assume df contains the preprocessed daily / weekly data .
5 # df [ ’ GHI ’] has the GHI values
6 data = df [ ’ GHI ’]
7 dist_names = [ ’ beta ’]
8 fitted_data = {}
9 x = range (0 , 55000)
10 dist_name = ’ beta ’
11 dist = getattr ( st , dist_name )
12 param = dist . fit ( data )
13 fitted_data = dist . pdf (x , *( param ) )
14 plt . figure ( figsize =(12 ,8) )
15 plt . hist ( data , density = True , bins =25)
16 plt . plot (x , fitted_data , label = dist_name )
17 plt . legend ()
18 plt . show ()

A.6 ADF Test


1 import pandas as pd
2 from statsmodels . tsa . stattools import adfuller
3
4 # Assume df contains the preprocessed daily / weekly data .
5 # df [ ’ GHI ’] has the GHI values
6 timeseries = df [ ’ GHI ’ ]. values
7 dftest = adfuller ( timeseries , autolag = ’ AIC ’)
8 dfoutput = pd . Series ( dftest [0:4] , index =[ ’ Test Statistic ’ , ’p - value ’ , ’# Lags Used ’ , ’ Number of
Observations Used ’ ])
9 for key , value in dftest [4]. items () :
10 dfoutput [ ’ Critical Value (% s ) ’% key ] = value
11 print ( dfoutput )

A.7 KPSS Test


1 import pandas as pd
2 from statsmodels . tsa . stattools import kpss
3
4 # Assume df contains the preprocessed daily / weekly data .
5 # df [ ’ GHI ’] has the GHI values
6 timeseries = df [ ’ GHI ’ ]. values
7 kpsstest = kpss ( timeseries , regression = ’c ’ , nlags = " auto " )
8 kpss_output = pd . Series ( kpsstest [0:3] , index =[ ’ Test Statistic ’ , ’p - value ’ , ’ Lags Used ’ ])
9 for key , value in kpsstest [3]. items () :
10 kpss_output [ ’ Critical Value (% s ) ’% key ] = value
11 print ( kpss_output )

A.8 Time Series Decomposition

19
1 import matplotlib . pyplot as plt
2 from statsmodels . tsa . seasonal import s e a s o n a l _ d e c o m p o s e
3
4 # Assume df contains the preprocessed daily / weekly data .
5 # df [ ’ GHI ’] has the GHI values
6 X = df [ ’ GHI ’ ]. values
7 ans = s e a s o n a l _ d e c o m p o s e (X , model = ’ additive ’ , period = 365)
8 ans . plot ()
9 plt . show ()

A.9 MAPE/MAE Calculation


1 import numpy as np
2
3 def m e a n _ a b s o l u t e _ p e r c e n t a g e _ e r r o r ( true , pred ) :
4 true = np . array ( true )
5 pred = np . array ( pred )
6 return np . mean ( np . abs (( true - pred ) / true ) ) * 100
7 def m e a n _ a b s o l u t e _ e r r o r ( true , pred ) :
8 true = np . array ( true )
9 pred = np . array ( pred )
10 return np . mean ( np . abs ( true - pred ) )

A.10 AR/MA/ARMA Models


1 import matplotlib . pyplot as plt
2 from statsmodels . tsa . arima_model import ARIMA
3
4 # Assume df contains the preprocessed daily / weekly data .
5 # df [ ’ GHI ’] has the GHI values
6 # df [ ’ Code ’] has the coded ( starting from 0) day / week values .
7 X = df [ ’ GHI ’ ]. values
8 # Calculating size of training set .
9 size = int ( len ( X ) * 0.8)
10 # Getting the test set .
11 test = X [ size : len ( X ) ]
12 # Setting hy pe r pa ra m et er s .
13 p ,d , q = 1 ,0 ,27
14 # Creating model .
15 model = ARIMA ( X [0: size ] , order =( p ,d , q ) )
16 # Fitting Model .
17 results = model . fit ()
18 # Forecasting
19 predictions = results . forecast ( len ( test ) )
20
21 # Plotting Forecasts
22 plt . figure ( figsize =(12 , 8) )
23 plt . plot ( df [ ’ Code ’ ][ size :] , X [ size :] , label = ’ Actual Value ’)
24 plt . plot ( range ( size , len ( X ) ) , predictions , label = ’ Forecast ’)
25 plt . legend ()
26 plt . show ()
27
28 # Calculating MAPE
29 mape = m e a n _ a b s o l u t e _ p e r c e n t a g e _ e r r o r ( test , predictions )
30 # Calculating MAE
31 mae = m e a n _ a b s o l u t e _ e r r o r ( test , predictions )

A.11 SARIMA Models


1 import matplotlib . pyplot as plt
2 import statsmodels . api as sm
3 # Assume df contains the preprocessed daily / weekly data .
4 # df [ ’ GHI ’] has the GHI values
5 # df [ ’ Code ’] has the coded ( starting from 0) day / week values .
6 X = df [ ’ GHI ’ ]. values
7 # Calculating size of training set .
8 size = int ( len ( X ) * 0.8)

20
9 # Getting the test set .
10 test = X [ size : len ( X ) ]
11 # Setting hy pe r pa ra m et er s .
12 p ,d ,q , =1 ,0 ,1
13 m = 52
14 P ,D , Q = 1 ,1 ,1
15 # Creating model .
16 model = sm . tsa . statespace . SARIMAX ( X [0: size ] , order =( p ,d , q ) , seasona l_order =( P , D , Q , m ) )
17 # Fitting model .
18 results = model . fit ()
19 # Forecasting
20 predictions = results . forecast ( len ( test ) )
21
22 plt . figure ( figsize =(12 , 8) )
23 plt . plot ( df [ ’ Code ’ ][ size :] , X [ size :] , label = ’ Actual Value ’)
24 plt . plot ( range ( size , len ( X ) ) , predictions , label = ’ Forecast ’)
25 plt . legend ()
26 plt . show ()
27
28
29 mape = m e a n _ a b s o l u t e _ p e r c e n t a g e _ e r r o r ( test , predictions )
30 mae = m e a n _ a b s o l u t e _ e r r o r ( test , predictions )

B Other Results

Figure 13: Correlation Plot of Rajasthan-2

21
(a) Daily Data (b) Weekly Data

Figure 14: Rajasthan-2 Data Plots

Figure 15: Beta-Distribution Fit for Weekly Data from Rajasthan-2

Figure 16: Time Series Decomposition of Weekly Data from Rajasthan-2

22
Figure 17: Time Series Decomposition of Daily Data from Rajasthan-2

(a) Weekly Data (b) Daily Data

Figure 18: Rajasthan-2 AR Model Forecasts

(a) Weekly Data (b) Daily Data

Figure 19: Rajasthan-2 MA Model Forecasts

23
(a) Weekly Data (b) Daily Data

Figure 20: Rajasthan-2 ARMA Model Forecasts

Figure 21: Rajasthan-2 SARIMA Model Weekly Forecasts

24
Figure 22: Correlation Plot of Rajasthan-3

(a) Daily Data (b) Weekly Data

Figure 23: Rajasthan-3 Data Plots

25
Figure 24: Beta-Distribution Fit for Weekly Data from Rajasthan-3

Figure 25: Time Series Decomposition of Weekly Data from Rajasthan-3

Figure 26: Time Series Decomposition of Daily Data from Rajasthan-3

26
(a) Weekly Data (b) Daily Data

Figure 27: Rajasthan-3 AR Model Forecasts

(a) Weekly Data (b) Daily Data

Figure 28: Rajasthan-3 MA Model Forecasts

27
(a) Weekly Data (b) Daily Data

Figure 29: Rajasthan-3 ARMA Model Forecasts

Figure 30: Rajasthan-3 SARIMA Model Weekly Forecasts

28
Figure 31: Correlation Plot of Rajasthan-4

(a) Daily Data (b) Weekly Data

Figure 32: Rajasthan-4 Data Plots

29
Figure 33: Beta-Distribution Fit for Weekly Data from Rajasthan-4

Figure 34: Time Series Decomposition of Weekly Data from Rajasthan-4

Figure 35: Time Series Decomposition of Daily Data from Rajasthan-4

30
(a) Weekly Data (b) Daily Data

Figure 36: Rajasthan-4 AR Model Forecasts

(a) Weekly Data (b) Daily Data

Figure 37: Rajasthan-4 MA Model Forecasts

31
(a) Weekly Data (b) Daily Data

Figure 38: Rajasthan-4 ARMA Model Forecasts

Figure 39: Rajasthan-4 SARIMA Model Weekly Forecasts

32
Figure 40: Correlation Plot of Rajasthan-5

(a) Daily Data (b) Weekly Data

Figure 41: Rajasthan-5 Data Plots

33
Figure 42: Beta-Distribution Fit for Weekly Data from Rajasthan-5

Figure 43: Time Series Decomposition of Weekly Data from Rajasthan-5

Figure 44: Time Series Decomposition of Daily Data from Rajasthan-5

34
(a) Weekly Data (b) Daily Data

Figure 45: Rajasthan-5 AR Model Forecasts

(a) Weekly Data (b) Daily Data

Figure 46: Rajasthan-5 MA Model Forecasts

35
(a) Weekly Data (b) Daily Data

Figure 47: Rajasthan-5 ARMA Model Forecasts

Figure 48: Rajasthan-5 SARIMA Model Weekly Forecasts

36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy