Case Study Crude Oil Production Forecasting
Case Study Crude Oil Production Forecasting
## Context
The world economy relies heavily on hydrocarbons, particularly oil, for the provision of energy
required in transportation and other industries. Crude oil production is considered one of the most
important indicators of the global economy. Dependence on oil and its finite nature, pose some
complex problems including estimation of future production patterns.
Crude oil production forecasting is an important input into the decision-making process and in-
vestment scenario evaluation, which are crucial for oil-producing countries. Governments and
businesses spend a lot of time and resources figuring out the production forecast that can help to
identify opportunities and decide on the best way forward.
## Objective
In this case study, we will analyze and use historical oil production data, from 1992 to 2018, for a
country to forecast its future production. We need to build a time series forecasting model using
the AR, MA, ARMA, and ARIMA models in order to forecast oil production.
## Data Dictionary
The dataset that we will be using is ‘Crude Oil Production by Country’. This dataset contains the
yearly oil production of 222 countries, but for simplicity, we will use only one country to forecast
its future oil production.
1
We also need to install pmdarima library to successfully execute the last cell of this case study.
Once the below installation codes run successfully, you may either restart the kernel or restart the
Jupyter Notebook before importing the libraries. It is enough to run the below installation cells
only once.
[ ]: # Version check
import statsmodels
statsmodels.__version__
[ ]: '0.13.2'
import pandas as pd
2
Let’s load the dataset
[ ]: data = pd.read_csv('Crude+Oil+Production+by+Country.csv')
data.head()
[ ]: Country 1992 1993 1994 1995 1996 1997 1998 1999 2000 … \
0 United States 7171 6847 6662 6560 6465 6451 6252 5881 5822 …
1 Saudi Arabia 8332 8198 8120 8231 8218 8362 8389 7833 8404 …
2 Russia 7632 6730 6135 5995 5850 5920 5854 6079 6479 …
3 Canada 1605 1679 1746 1805 1837 1922 1981 1907 1977 …
4 Iraq 425 512 553 560 579 1155 2150 2508 2571 …
2018
0 10962.0
1 10425.0
2 10759.0
3 4264.0
4 4613.0
[5 rows x 28 columns]
Since there are observations from 222 countries, i.e., we have 222 different time series. We will
select only one time series for forecasting purpose in this project.
Below we are loading the time series for only one country, i.e., United States. This is a completely
random choice to start with. You can choose any country of your choice and then run the notebook
to see how the model’s parameters (i.e., p, d or q) change with respect to different countries.
[ ]: # Using loc and index = 0 to fetch the data for United States from the original␣
↪dataset
united_states = data.loc[0]
# Dropping the variable country, as we only need the time and production␣
↪information to build the model
united_states = pd.DataFrame(united_states).drop(['Country'])
3
# Converting the data type for variable OIL PRODUCTION to integer
united_states['OIL PRODUCTION'] = united_states['OIL PRODUCTION'].astype(int)
# Checking the time series crude oil production data for United States
united_states.head()
[ ]: OIL PRODUCTION
YEAR
1992-01-01 7171
1993-01-01 6847
1994-01-01 6662
1995-01-01 6560
1996-01-01 6465
plt.show()
4
• The above plot shows that the oil production of United States was declining from the early
1990s to the mid 2000s but has been increasing almost constantly since then.
• The higher oil production can be due to increasing population and hence, increasing the
demand for transportation and other needs.
Let’s now decompose the above time series into its various components, i.e., trend, seasonality,
and white noise. Since this is yearly frequency data, there would not be any seasonal patterns
after decomposing the time series.
The function, seasonal_decompose, decomposes the time series into trend, seasonal, and white
noise components using moving averages. The decomposition results are obtained by first estimat-
ing the trend. The trend is then removed from the series and the average of this de-trended series
for each period is the returned seasonal component. And whatever remains after getting trend and
seasonal components, is known as the white noise or the residual component.
decomposition = sm.tsa.seasonal_decompose(united_states)
decomposed_data['trend'].plot(ax = ax1)
decomposed_data['seasonal'].plot(ax = ax2)
decomposed_data['random_noise'].plot(ax = ax3)
[ ]: <AxesSubplot: xlabel='YEAR'>
5
As we can see from the above plot, the seasonal and residual components are zero, as this time
series has a yearly frequency. Check out this link to see what a time series decomposition plot looks
like for a time series with seasonal patterns.
Now, let’s visualize the train and the test data in the same plot
6
[ ]: # Creating a subplot space
fig, ax = plt.subplots(figsize = (16, 6))
# Showing the time which divides the original data into train and test
plt.axvline(x = '2012-01-01', color = 'black', linestyle = '--')
7
[ ]: # Importing ADF test from statsmodels package
from statsmodels.tsa.stattools import adfuller
print(result[4])
-0.5829098523091656
0.8747971281795592
{'1%': -4.01203360058309, '5%': -3.1041838775510207, '10%': -2.6909873469387753}
Here, the p-value is around 0.87, which is higher than 0.05. Hence, we fail to reject the null
hypothesis, and we can say the time series is non-stationary. We can also see this visually by
comparing the above ADF statistic and visually inspecting the time series.
train_data.plot(ax = ax)
plt.show()
print('p-value:', result[1])
8
ADF Statistic: -0.5829098523091656
p-value: 0.8747971281795592
Let now take the 1st order difference of the data and check if it becomes stationary or not.
# Implementing ADF test on the first order differenced time series data
result = adfuller(train_data_stationary['OIL PRODUCTION'])
train_data_stationary.plot(ax = ax)
plt.show()
print('p-value:', result[1])
9
[ ]: # Taking the 2nd order differencing of the time series
train_data_stationary = train_data.diff().diff().dropna()
# Implementing ADF test on the second order differenced time series data
result = adfuller(train_data_stationary['OIL PRODUCTION'])
train_data_stationary.plot(ax = ax)
plt.show()
print('p-value:', result[1])
# Implementing ADF test on the second order differenced time series data
result = adfuller(train_data_stationary['OIL PRODUCTION'])
10
train_data_stationary.plot(ax = ax)
plt.show()
print('p-value:', result[1])
# Creating and plotting the ACF charts starting from lag = 1 till lag = 8
11
tsaplots.plot_pacf(train_data_stationary, zero = False, ax = ax2, lags = 8)
plt.show()
From the above plots, it does not look like this stationary time series follows a pure AR or MA
model. As none of the plots tails off or cuts off after any lag, it implies that the time series follows
an ARMA or ARIMA model. So, to find out the optimal values of p, d, and q, we need to do a
hyper-parameter search to find their optimal values.
The PACF seems to cut off at lag 2, but we cannot be sure because it is too close to the boundary.
Below we will try several different modeling techniques on this time series: - AR (p) - MA (q) -
ARMA (p, q) - ARIMA (p, d, q)
and then we will check which one performs better.
12
0.8 AR Modeling
Below we will build several AR models at different lags and try to understand whether the AR
model will be a good fit or not. Below is a generalized equation for the AR model.
[ ]: # We are using the ARIMA function to build the AR model, so we need to pass the␣
↪stationary time series that we got after double
# differencing the original time series. Also, we will keep the q parameter as␣
↪0, so that the model acts as an AR model
ar_1_results = ar_1_model.fit()
ar_2_results = ar_2_model.fit()
ar_3_results = ar_3_model.fit()
ar_4_results = ar_4_model.fit()
As we have passed the stationary time series while fitting the above AR models. The forecasts that
we get will also be on the same scale, i.e., after doing double differencing of original time series.
Therefore, to get the forecasts in the original scale, we need to inverse transform the time series
data. The below function is helping us to do that inverse transformation.
# And we are also adding the last element of the training data to the␣
↪forecasted values to get back to the original scale
13
predictions = np.cumsum(np.cumsum(results.predict(start = 19, end = 25))) +␣
↪train_data.iloc[-1][0]
# Computing the AIC and RMSE metrics for the model and printing it into␣
↪title of the plot
Now, let’s plot the forecasted values from all the four models, and then compare the
model outputs.
[ ]: # Plotting the forecasted values along with train and test for all the models
plot_predicted_output(ar_1_results, ax1)
plot_predicted_output(ar_2_results, ax2)
plot_predicted_output(ar_3_results, ax3)
plot_predicted_output(ar_4_results, ax4)
plt.show()
14
As we can see from the above results, out of these four models we have developed, the AIC values
for all these models are very much comparable or approximately the same. But if we check the
RMSE values, it is the least for AR(4) or ARIMA(4, 0, 0) model, and it is significantly less than
the other three models. Based on this analysis, AR(4) or ARIMA(4, 0, 0) looks the best model if
we only want to use the AR component while modeling.
Let’s now check the model summary of this AR(4) or ARIMA(4, 0, 0) model.
[ ]: ar_4_results.summary()
[ ]: <class 'statsmodels.iolib.summary.Summary'>
"""
SARIMAX Results
==============================================================================
Dep. Variable: OIL PRODUCTION No. Observations: 18
Model: ARIMA(4, 0, 0) Log Likelihood -120.495
Date: Sat, 29 Oct 2022 AIC 252.990
Time: 13:34:04 BIC 258.332
Sample: 01-01-1995 HQIC 253.727
- 01-01-2012
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 8.0306 17.221 0.466 0.641 -25.722 41.783
ar.L1 -0.8563 0.371 -2.308 0.021 -1.583 -0.129
ar.L2 -1.0208 0.534 -1.911 0.056 -2.067 0.026
ar.L3 -0.3271 0.410 -0.799 0.425 -1.130 0.476
15
ar.L4 -0.4350 0.340 -1.280 0.201 -1.101 0.231
sigma2 3.249e+04 2.03e+04 1.600 0.110 -7311.526 7.23e+04
================================================================================
===
Ljung-Box (L1) (Q): 0.57 Jarque-Bera (JB):
0.37
Prob(Q): 0.45 Prob(JB):
0.83
Heteroskedasticity (H): 2.22 Skew:
0.10
Prob(H) (two-sided): 0.36 Kurtosis:
2.33
================================================================================
===
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-
step).
"""
0.9 MA Modeling
Now, we will build several MA models at different lags and try to understand whether the MA
model will be a good fit or not in comparison to the AR models that we have built so far. Below
is a generalized equation for the MA model.
[ ]: # We are using the ARIMA function to build the MA model, so we need to pass the␣
↪stationary time series that we got after double
# differencing the original time series. Also, we will keep the p parameter as␣
↪0 so that the model acts as an MA model
16
# Creating MA model with parameter q = 4
ma_4_model = ARIMA(train_data_stationary, order = (0, 0, 4))
ma_1_results = ma_1_model.fit()
ma_2_results = ma_2_model.fit()
ma_3_results = ma_3_model.fit()
ma_4_results = ma_4_model.fit()
[ ]: # Plotting the forecasted values along with train and test for all the models
plot_predicted_output(ma_1_results, ax1)
plot_predicted_output(ma_2_results, ax2)
plot_predicted_output(ma_3_results, ax3)
plot_predicted_output(ma_4_results, ax4)
plt.show()
17
As we can see from the above plots, again all the models that we have developed so far are
comparable to AIC, but RMSE is significantly lower for MA(2) model in comparison to all the
other models. So, the best model that we have got using MA modeling, is MA(2) or ARIMA(0, 0,
2). This also aligns with our observation that PACF plot seems to cut off at lag 2.
Let’s analyze the model summary for MA(2) or ARIMA(0, 0, 2) below.
[ ]: ma_2_results.summary()
[ ]: <class 'statsmodels.iolib.summary.Summary'>
"""
SARIMAX Results
==============================================================================
Dep. Variable: OIL PRODUCTION No. Observations: 18
Model: ARIMA(0, 0, 2) Log Likelihood -122.361
Date: Sat, 29 Oct 2022 AIC 252.721
Time: 13:34:04 BIC 256.283
Sample: 01-01-1995 HQIC 253.212
- 01-01-2012
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 10.8987 12.355 0.882 0.378 -13.317 35.115
ma.L1 -1.6981 3.184 -0.533 0.594 -7.939 4.543
ma.L2 0.9760 3.592 0.272 0.786 -6.064 8.015
sigma2 3.458e+04 1.1e+05 0.315 0.752 -1.8e+05 2.49e+05
================================================================================
===
Ljung-Box (L1) (Q): 0.38 Jarque-Bera (JB):
0.84
Prob(Q): 0.54 Prob(JB):
0.66
Heteroskedasticity (H): 1.59 Skew:
0.51
Prob(H) (two-sided): 0.59 Kurtosis:
3.30
================================================================================
===
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-
step).
"""
18
we will build several ARMA models with different combinations of p and q parameters on the
differenced time series data. And we will evaluate those models based on AIC and RMSE. Let’s build
those models.
Below is a generalized equation for the ARMA model.
𝑦𝑡 = 𝑎1 𝑦𝑡−1 + 𝑚1 𝜖𝑡−1 + … + 𝜖𝑡
[ ]: # We are using the ARIMA function here, so we need to pass stationary time␣
↪series that we got after double differencing the
ar_2_ma_1_results = ar_2_ma_1_model.fit()
ar_2_ma_2_results = ar_2_ma_2_model.fit()
ar_3_ma_2_results = ar_3_ma_2_model.fit()
ar_2_ma_3_results = ar_2_ma_3_model.fit()
[ ]: # Plotting the forecasted values along with train and test for all the models
plot_predicted_output(ar_2_ma_1_results, ax1)
plot_predicted_output(ar_2_ma_2_results, ax2)
plot_predicted_output(ar_3_ma_2_results, ax3)
plot_predicted_output(ar_2_ma_3_results, ax4)
19
plt.show()
As we can see from the above plots, again all the models that we have developed so far have
comparable AIC, but for one specific model, i.e., ARIMA(2, 0, 1), the RMSE is significantly lower
than the models that we have developed above. Also, it is evident from the above plots that the
forecasted values from the model ARIMA(2, 0, 1) are closer to the test data in comparison to all
the other models.
Let’s analyze the summary for the model ARIMA(2, 0, 1).
[ ]: ar_2_ma_1_results.summary()
[ ]: <class 'statsmodels.iolib.summary.Summary'>
"""
SARIMAX Results
==============================================================================
Dep. Variable: OIL PRODUCTION No. Observations: 18
Model: ARIMA(2, 0, 1) Log Likelihood -122.097
Date: Sat, 29 Oct 2022 AIC 254.195
Time: 13:34:05 BIC 258.647
Sample: 01-01-1995 HQIC 254.809
- 01-01-2012
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 8.4986 35.232 0.241 0.809 -60.555 77.553
ar.L1 -1.2899 0.178 -7.251 0.000 -1.639 -0.941
20
ar.L2 -0.8183 0.250 -3.274 0.001 -1.308 -0.328
ma.L1 0.9929 5.337 0.186 0.852 -9.468 11.454
sigma2 3.572e+04 1.83e+05 0.196 0.845 -3.22e+05 3.94e+05
================================================================================
===
Ljung-Box (L1) (Q): 1.47 Jarque-Bera (JB):
0.37
Prob(Q): 0.23 Prob(JB):
0.83
Heteroskedasticity (H): 1.75 Skew:
0.12
Prob(H) (two-sided): 0.51 Kurtosis:
2.34
================================================================================
===
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-
step).
"""
[ ]: train_data = train_data.astype('float32')
We are using the ARIMA function here, so we do not need to pass stationary time series, we can
simply pass the original time without differencing, and pass the parameter d = 3, as we already
know that after triple differencing the original time series becomes a stationary time series.
21
[ ]: # Fitting all the models that we implemented in the above cell
ar_2_d_3_ma_1_results = ar_2_d_3_ma_1_model.fit()
ar_1_d_3_ma_2_results = ar_1_d_3_ma_2_model.fit()
ar_2_d_3_ma_2_results = ar_2_d_3_ma_2_model.fit()
ar_3_d_3_ma_2_results = ar_3_d_3_ma_2_model.fit()
Before we plot the forecasted values, we need to update the plot_predicted_output() function
because the ARIMA model predicts the transformed values and hence we don’t need to perform
operations of the cumulative sum to inverse transform the predicted values.
# Computing the AIC and RMSE metrics for the model and printing it into␣
↪title of the plot
train_data.plot(ax = ax, label = 'train',
[ ]: # Plotting the forecasted values along with train and test for all the models
plot_predicted_output_new(ar_2_d_3_ma_1_results, ax1)
plot_predicted_output_new(ar_1_d_3_ma_2_results, ax2)
22
plot_predicted_output_new(ar_2_d_3_ma_2_results, ax3)
plot_predicted_output_new(ar_3_d_3_ma_2_results, ax4)
plt.show()
From the above analysis, we can see that the ARIMA(2, 3, 2) is the best model in comparison to
others, as it has comparable AIC to other models and less RMSE in comparison to all the other
models.
Let’s analyze the model summary for ARIMA(2, 3, 2).
[ ]: ar_2_d_3_ma_2_results.summary()
[ ]: <class 'statsmodels.iolib.summary.Summary'>
"""
SARIMAX Results
==============================================================================
Dep. Variable: OIL PRODUCTION No. Observations: 21
Model: ARIMA(2, 3, 2) Log Likelihood -120.713
Date: Sat, 29 Oct 2022 AIC 251.427
Time: 13:34:06 BIC 255.879
Sample: 01-01-1992 HQIC 252.041
- 01-01-2012
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
23
------------------------------------------------------------------------------
ar.L1 -0.9701 0.276 -3.518 0.000 -1.511 -0.430
ar.L2 -0.5351 0.318 -1.685 0.092 -1.158 0.087
ma.L1 0.2168 7.403 0.029 0.977 -14.293 14.726
ma.L2 -0.7778 6.047 -0.129 0.898 -12.630 11.074
sigma2 2.577e+04 1.97e+05 0.131 0.896 -3.61e+05 4.12e+05
================================================================================
===
Ljung-Box (L1) (Q): 0.67 Jarque-Bera (JB):
0.91
Prob(Q): 0.41 Prob(JB):
0.64
Heteroskedasticity (H): 1.85 Skew:
-0.10
Prob(H) (two-sided): 0.47 Kurtosis:
1.92
================================================================================
===
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-
step).
"""
Now that we have identified the best parameters (p, d, and q) for our data. Let’s
train the model with the same parameters on the full data for United States and get
the forecasts for the next 7 years, i.e., from 2019-01-01 to 2025-01-01.
final_model_results = final_model.fit()
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
24
• The above plot shows that the model is able to identify the trend in the data and forecast
the values accordingly.
• The forecast indicates that, according to the historic data, the oil production is going to
constantly increase for United Sates.
0.12 Conclusion
• We have built different types of models using search for the optimal parameters for each. We
have compared all the models based on the evaluation metrics AIC and RMSE.
• The AIC for all the models is approximately the same, i.e., there is no significant difference in
the AIC values for all the models. But, we can see significant difference in some of the models
in terms of RMSE. So, the choice of model is more dependent on RMSE for the current data.
• Overall, the model ARIMA(2, 3, 2) has given the best results and we have used the
same to forecast the oil production for United States.
[ ]: import pmdarima as pm
25
print(auto_arima_model.summary())
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-
step).
The auto-arima model is also giving out the best model as - ARIMA(0, 3, 1)(0, 0, 0)[0], which
is different from what we have chosen earlier. There are two important points to remember here:
• In the best model that we have got, the last four parameters are zeros. Those are the
26
parameters that are responsible for capturing the seasonality in the time series. Since
this time series has a yearly frequency, it is expected that it will not have any seasonal
patterns.
• Also, auto-arima tries to minimize the AIC, rather than RMSE of the model. So, we need to
compute the RMSE of these models manually, to check whether the model has acceptable
RMSE or not. The best model from auto-arima might not have a good/acceptable RMSE
score.
We can also plot and analyze the model diagnostics for residuals as shown below
If the residuals are normally distributed and are uncorrelated to each other, then we actually have
a good model.
Observations:
• Top left: The residual errors seem to fluctuate around a mean of zero and have a approxi-
mately uniform variance.
• Top Right: The density plot suggests that the distribution of residuals is very close to a
standard normal distribution.
• Bottom left: All the dots should fall perfectly in line with the red line. Any significant
deviations would imply the distribution of residuals is skewed.
• Bottom Right: The ACF plot shows the residual errors are not autocorrelated as no lag
other than 0 is significant. Any autocorrelation would imply that there is some pattern in
the residual errors that is not explained by the model.
27