In the previous post, there was a bit of an introduction to time series, a bit about stationarity, seasonality, and trend. Now we turn to several methods such as ARIMA or SARIMA used to forecast time series.
ARIMA stands for Auto Regressive Integrated Moving Average. We can use this model for non-seasonal data. The ARIMA model consists of two parts: the autoregressive (AR) model and the moving average MA model and has three variables to consider:
ARIMA (p, d, q)
p = Lags (if P = 3 then we will use the previous three periods from our time series in the autoregressive part of the computation) P helps to adjust the fitted line for series prediction
d = In the ARIMA model, we transform the time series into a stationary one (series without trend or seasonality) using differentiation. D is the number of differential transformations required over the time series to get stationary.
q = This variable represents the error component delay where the error component is part of the time series not explained by trend or seasonality
ARIMA will check if we do not have a seasonal element, but if there is then we can use another model – SARIMA. SARIMA is similar to ARIMA models, we just need to add a few parameters to account for seasonality:
SARIMA (p, d, q) (P, D, Q) m
p – number of autoregressions
d – degree of differentiation
q – number of moving averages of terms
m – refers to the number of periods in each season
(P, D, Q) – represents the (p, d, q) for the seasonal part of the time series
The notation for the model includes the determination of the order of the AR (p), I (d) and MA (q) models as parameters of the ARIMA function and AR (P), I (D), MA (Q) and m parameters at the seasonal level.
Example:
SARIMA (3,1,0) (1,1,0) 12
- The parameter m influences the parameters P, D and Q. For example, m = 12 for monthly data suggests an annual seasonal cycle.
- P = 1 the first seasonal observation in the model is used, e.g. t- (m 1) or t-12. P = 2, the two most recent seasonal observations are used t- (m 1), t- (m * 2).
- D = 1 will calculate the first order seasonal difference, and Q = 1 will use the first order errors in the model (eg Moving Average).
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 20, 6
data = pd.read_csv(r'C:\Users\VFW863\Documents\drugstore.csv')
data['date'] = pd.to_datetime(data['date'])
data = data.set_index('date'); data.tail(3)
plt.plot(data)
Can you see seasonal trends? This is important when deciding which type of model to use. If there is no seasonal trend, we can use the usual ARIMA model. Our model, however, contains elements of seasonality, so we will use the SARIMA model.
Before we move on to the model itself, we will divide the data into training and testing and transform the data to be stationary, i.e. we will use differentiation:
train = 0.75
split = round(len(data)*train)
training, testing = data[0:split], data[split:]
training_log = np.log(training)
training_roz = training_log.diff(periods=1).values[1:]
plt.plot(training_roz)
To determine the parameters of the model we can use two plots: autocorrelation (ACF) and partial autocorrelation (PACF). Autocorrelation refers to the correlation of a time series with its past values. In ACF, the correlation coefficient is on the X axis while the lag number is shown on the Y axis (-1: 1). The graph of the autocorrelation function shows how the given time series are correlated with each other. Typically, in an ARIMA model, we use an AR element, a MA element, or both. The ACF chart helps us decide which of these terms we will use in our time series.
If there is positive autocorrelation in the case of lag 1 then we use the AR model
If there is negative autocorrelation in phase 1 then we use the MA model
Seasonal differences take into account e.g. seasons and differences between the current value and its value in the previous season, e.g.: The difference for a month may be the value in May 2018 – value in May 2017.
In a purely seasonal AR model, ACF decays slowly and PACF cuts to zero. AR models are used when the seasonal autocorrelation is positive. In a MA model, ACF cuts to zero, seasonal correlation is negative
After plotting the ACF plot, we move on to the plot of the partial autocorrelation function (PACF). Partial autocorrelation summarizes the relationship between the observations in the time series and the observations in the stages preceding the time, with the relationships of the observed interventions removed. The PACF only describes the direct relationship between observation and its delay.
We will use AR parameters in the model when:
- ACF plots show autocorrelation decaying towards zero
- The PACF chart cuts quickly towards zero
- ACF series stationary shows positive in lag-1
We will use the MA parameters in the model when:
- Negatively Autocorrelated in Lag – 1
- ACF that drops after a few delays
- PACF is reduced gradually
Confidence intervals are drawn as a cone. We can limit the number of lags on the X axis to 50 to make the plot easier to read. To estimate the value of p we need an ACF plot and for a value of q we need a PACF plot
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.stattools import acf, pacf
lag_acf = acf(training_roz, nlags=50, fft=False)
lag_pacf = pacf(training_roz, nlags=50, method='ols')
fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(121)
plt.axhline(y=-1.96/np.sqrt(len(training)), linestyle='--', color='red')
plt.axhline(y=1.96/np.sqrt(len(training)), linestyle='--', color='red')
plt.stem(lag_acf)
ax2 = fig.add_subplot(122)
plt.axhline(y=-1.96/np.sqrt(len(training)), linestyle='--', color='red')
plt.axhline(y=1.96/np.sqrt(len(training)), linestyle='--', color='red')
plt.stem(lag_pacf)
From the above graphs, we can get p (AR element) and q (MA element) for our model. When we look at the values in both graphs, we can roughly estimate p and q: in our example, a good start would be p = 1 and q = 1 (these are the first values to cross the red lines). The parameter d depends on the differentiation of our data, if the data are not stationary – then d must be at least 1. Analyzing the graphs further, we notice that we have a large value at 12 (X axis) which suggests the season (month), m = 12, this is positive value so P = 1 and Q = 0. The data is not constant so we take D = 0.
The parameters for our model are:
SARIMA (1.1.1) (1.0.0)[12]
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(training, order=(1,1,1), seasonal_order=(1,0,0,12), enforce_stationarity = False, enforce_invertibility = False)
model_fit = model.fit(disp=False)
Now we can forecast sales in the pharmacy for the coming years
K= len(testing)
forecast= model_fit.forecast(K)
from sklearn.metrics import mean_squared_error
plt.figure(figsize=(15,5))
plt.plot(forecast, 'red')
plt.plot(data, 'b')
plt.title ('RMSE: %.2f'% mean_squared_error(testing, forecast))
plt.axvline(x=data.index[split], color = 'black')
Our forecast coincides with the data for the first 2 years (2012-2014) and then we can notice differences, namely the forecasted sales are lower than the actual ones. This difference may be caused, among others, by selection of parameters of the SARIMA model.
Our parameter combination does not have to be the most optimal, if we want to test other values, we can manually change the parameters or use grid search.
Generally, we can divide the time series modeling into stages:
1 – Check Stationarity: If a time series has a trend or seasonality component, it must be stationary before we can use ARIMA / SARIMA for forecasting.
2 – Difference: If the time series is not stationary, we must make it stationary by differentiation. Take the first difference then check the stationarity. Make as many differences as needed. Make sure you check for seasonal differences as well.
3 – Divide the data into testing and training.
4 – Select AR and MA parameters: Use ACF and PACF to decide whether to include AR, MA or both.
5 – Build the model and set the number of periods to forecast – N
6 – Validate the model: compare predicted values with actual values