A time series is a collection of data collected at intervals. They are analyzed to identify a long-term trend to forecast the future or to perform some other form of analysis. Such data set depends on time, so the assumption of linear regression with independent observations will not work here, some series show seasonality, i.e. specific behavior of data in a specific time period (e.g. increased sales of Christmas trees before Christmas).
Machine learning methods can be used to classify and predict problems with time series. Before we move on to machine learning methods, let’s look at the basic methods of predicting time series that focus on linear relationships, and to use these methods, you need to properly prepare the data.
Typically, most analysis methods assume that the time series data contains a systematic element (typically several components) and random noise (error), which complicates the detection of regular components. Therefore, most methods involve different noise filtering methods or some steps must be taken during data pre-processing.
Most of the regular components fall into two main classes – trend or seasonal component. A trend is a general systematic linear or non-linear term that may change over time. The seasonal component repeats from time to time. Both of these types of regular components are usually presented in series simultaneously. For example, sales may increase year on year, but there is a seasonal component that reflects a significant increase in sales in December and a decline in August. In addition to the trend and seasonality, there is also noise and a cyclical component, which differs from seasonality mainly as a long-term effect is also not easily predictable in terms of time – it is also often part of the trend.
Each of these components may have a different meaning in the model depending on the time series, we distinguish two main models: additive and multiplicative.
- Additive has the formula: Z (t) = T (t) + C (t) + S (t) + R (t)
- Multiplicative: Z (t) = T (t) x C (t) x S (t) x R (t)
The main difference between these models is the rate of growth.
In this post we will look at some elements of the time series: stationarity, trend and seasonality, and in the next we will model and forecast data.
At the beginning, we import libraries and data (monthly pharmacy sales). Our data is basic and includes only dates and sales value:
import pandas as pd import numpy as np import matplotlib.pylab as plt %matplotlib inline from matplotlib.pylab import rcParams rcParams['figure.figsize'] = 20, 6
data = pd.read_csv(r'C:\Users\VFW863\Documents\drugstore.csv'); data.head()
date object value int64 dtype: object
We need to change date as date type:
data['date'] = pd.to_datetime(data['date']); data.head()
date datetime64[ns] value int64 dtype: object
data = data.set_index('date'); data.head(3)
A time series can be stationary if its statistical properties, such as mean and variance, remain constant over time. Most models operate under the assumption that the series is stationary. Intuitively, we can suggest that if a given set has a particular behavior over time, there is a very high probability that the same will happen in the future. Also, theories related to stationary series are more mature and easier to implement compared to non-stationary series.
Classic methods of time series analysis and forecasting relate to transforming non-stationary time series data by identifying and removing trends and removing stationary effects.
Data processing before modeling:
- is the data stationary? some models do not require stationary data therefore further processing will depend on the model to be used
To achieve stationarity:
- noise removal – Time series data almost always contain a random noise component. The purpose of denoising methods is to filter and remove unwanted noise using smoothing, e.g .: moving average (each value in the series is replaced by simple or weighted averages of adjacent values), Median filter – similar to the moving average, but values are replaced by the median Local regression filter – value are replaced with a smoothed curve with values fitted by the least squares method
- estimating the trend and seasonality and removing these components, this can be achieved by differentiation and / or decomposition
- removal of anomalies
There are many methods to check whether the time series are stationary or non-stationary:
- Visually: you can review the time series charts of the data and visually see if there are any obvious trends or seasonality
- Summary Statistics: You can split the time series into two (or more) partitions and compare the mean and variance of each group. If they differ, and the difference is statistically significant, the time series is probably not stationary.
- Statistical tests: With the help of statistical tests, we can check whether expectations regarding stationarity have been met or violated.
The easiest and fastest way to check the trend and seasonality is on the chart:
[<matplotlib.lines.Line2D at 0x214e9aa2888>]
The chart shows that there is an upward trend with some seasonal variations. This is not always so obvious and a visual check alone may not be enough. Let’s try other methods to check stationarity, such as: moving average (check if it changes over time) or Dickey-Fuller test (statistical test checking stationarity).
For starters, we’ll use the Dickey-Fuller test with a clean data set without any transformation:
Let’s set the null hypothesis H0: the data set is non-stationary (i.e. the data is time dependent)
HA: dataset is stationary (no trend or seasonality)
from statsmodels.tsa.stattools import adfuller X = data.value result = adfuller(X) print('ADF Statistic: %f' % result) print('p-value: %f' % result) print('Critical values:') for key, value in result.items(): print('\t%s: %.3f' % (key, value))
ADF Statistic: 0.044556 p-value: 0.962065 Critical values: 1%: -3.452 5%: -2.871 10%: -2.572
P-value> 0.05, we accept the null hypothesis, i.e. the data is non-stationary
We will try to bring the data set closer to stationarity. Trend and seasonality are the two main causes of non-stationarity, so we will try to estimate the trend and seasonality in the series and remove them. Some methods may work well for one occasion and others may not. First, we’ll reduce the trend by transformation. In our case, it is clear that there is a significant positive trend (increasing over time) and so we will apply a transformation that penalizes higher values more than lower values. These can be functions such as: log, square root, cubic root, etc.
We transform the dataset to make the distribution of values more linear and better meet the expectations of the statistical test:
data_log = np.log(data) plt.plot(data_log)
[<matplotlib.lines.Line2D at 0x214ece708c8>]
Let’s check the stationarity again, this time of the changed data set:
X = data_log.value result = adfuller(X) print('ADF Statistic: %f' % result) print('p-value: %f' % result) print('Critical values:') for key, value in result.items(): print('\t%s: %.3f' % (key, value))
ADF Statistic: -2.274305 p-value: 0.180389 Critical values: 1%: -3.452 5%: -2.871 10%: -2.572
P-value> 0.05 so the data is still non-stationary, but we are closer to the goal.
Several data transformation methods:
Aggregation – Taking an average over a period such as monthly / weekly averages
Smoothing – getting moving averages
Polynomial fit – regression model fit
We will use a moving average for data smoothing:
moving_avg = data_log.rolling(window=12).mean()
plt.plot(data_log) plt.plot(moving_avg, color='r')
[<matplotlib.lines.Line2D at 0x214ef12b688>]
The red line shows the moving average. We assumed window = 12 because of our monthly data – we take the average of the last 12 values, a moving average is not defined for the first 11 values, we can remove them:
data_log_moving_avg_diff = data_log - moving_avg data_log_moving_avg_diff.dropna(inplace=True)
Let’s check the stationarity of the data (moving average):
X = data_log_moving_avg_diff.value result = adfuller(X) print('ADF Statistic: %f' % result) print('p-value: %f' % result) print('Critical values:') for key, value in result.items(): print('\t%s: %.3f' % (key, value))
ADF Statistic: -3.170050 p-value: 0.021785 Critical values: 1%: -3.453 5%: -2.871 10%: -2.572
P <0.05 so the data are stationary
The test value is less than the critical value of 5%, so we can say with 95% certainty that the data is stationary.
In some cases, it is better to use a weighted moving average instead of the usual moving average, where the latest data is given more weight than the original. With the function ewm (). Mean () we can get our weighted average. The parameters of this function will change depending on the priority.
exp_avg = data_log.ewm(halflife=12).mean() plt.plot(data_log) plt.plot(exp_avg, color='red')
[<matplotlib.lines.Line2D at 0x214ef102648>]
Let’s check what stationarity looks like now:
X = exp_avg.value result = adfuller(X) print('ADF Statistic: %f' % result) print('p-value: %f' % result) print('Critical values:') for key, value in result.items(): print('\t%s: %.3f' % (key, value))
ADF Statistic: -2.585344 p-value: 0.096064 Critical values: 1%: -3.452 5%: -2.871 10%: -2.572
For our data set, the basic moving average performed better than the weighted average.
When we finished dealing with the estimation of the trend, it should be removed, it will be easier for us to examine the seasonality. To remove a trend, you can subtract the trend that was calculated above from the original set.
Another way to remove this trend is „differentiation”, in which we look at the difference between consecutive data points (called „first-order differentiation” – analyzing the difference between one data point and the one before it).
data_log_diff = data_log - data_log.shift() plt.plot(data_log_diff)
[<matplotlib.lines.Line2D at 0x214ef1820c8>]
Thanks to this method of differentiation, we got rid of the main trend.
Another approach is decomposition where both trend and seasonality are modeled separately and the remainder of the series is returned. The purpose of time series decomposition is to estimate and extract the deterministic parts of the trend series and seasonality in the hope that the remaining data, i.e. theoretically random variable, will turn out to be a stationary random process.
from statsmodels.tsa.seasonal import seasonal_decompose decomposition = seasonal_decompose(data_log) trend = decomposition.trend seasonal = decomposition.seasonal residual = decomposition.resid plt.subplot(411) plt.plot(data_log, label='Data') plt.legend(loc='best') plt.subplot(412) plt.plot(trend, label='Trend') plt.legend(loc='best') plt.subplot(413) plt.plot(seasonal,label='Seasonality') plt.legend(loc='best') plt.subplot(414) plt.plot(residual, label='Residuals') plt.legend(loc='best') plt.tight_layout()
Trend and seasonality are separated from the data, and we can model the rest. Let’s check their stationarity:
data_log_decompose = residual data_log_decompose.dropna(inplace=True) X = data_log_decompose result = adfuller(X) print('ADF Statistic: %f' % result) print('p-value: %f' % result) print('Critical values:') for key, value in result.items(): print('\t%s: %.3f' % (key, value))
ADF Statistic: -6.861840 p-value: 0.000000 Critical values: 1%: -3.453 5%: -2.871 10%: -2.572
The test value is much less than 1%, so this set is almost perfectly stationary.