Understanding Autocorrelation in Time Series Analysis

Understanding how autocorrelation works is essential for beginners to make their journey in time series analysis easier.

KSV Muralidhar
Towards Data Science

--

Photo by Chris Liverani on Unsplash

I began my time series analysis journey two years ago. Initially, I began learning through YouTube videos where I came across autocorrelation, a basic concept of time series analysis. According to a few videos, the definition of autocorrelation was the relationship/correlation of a time series with its previous versions in time. Few others said, since we need two variables to compute correlation, but in a time series we have only one variable, we need to compute the correlation of the time series with a ‘k’th lagged version of itself.

What is ‘k’th lag?

This was the first question that came into my mind. As I proceeded, I got to know that a time series (y) with ‘k’th lag is its version that is ‘t-k’ periods behind in time. A time series with lag (k=1) is a version of the original time series that is 1 period behind in time, i.e. y(t-1).

Most of those videos took an example of the stock market daily prices to explain time series analysis. Those prices were recorded daily. They explained, the autocorrelation of the stock prices is the correlation of the current price with the price ‘k’ periods behind in time. So, the autocorrelation with lag (k=1) is the correlation with today’s price y(t) and yesterday’s price y(t-1). Similarly, for k=2, the autocorrelation is computed between y(t) and y(t-2).

Now comes the main question

How can we compute the correlation (or even covariance for that matter) of today’s price with yesterday’s price? Since correlation can only be computed between variables with multiple values. If I try to compute correlation between two single values, I’d get an ‘NaN’. Also, the two variables must have same lengths (number of values), so, I can’t even compute the correlation between y(starting point to t) and y(starting point to t-1).

Then, I started searching for a theoretical explanation of autocorrelation and came across the formula of autocorrelation as shown below.

Image by author
Image by author

Understanding the formula

  • The formula of autocorrelation is similar (but not exactly the same) to that of correlation.
  • The numerator is similar to covariance between the current and lagged versions of the time series (but doesn’t have ‘N-1’ as denominator). A closer examination of the two components of the numerator shows that the mean of the original time series, mean(y), is being subtracted from them, not mean(y(t)) and mean(y(t-k)), respectively. This makes the numerator of the formula a bit different from covariance.
  • The denominator is similar to the square of standard deviation (a.k.a. variance) of the original time series (but doesn’t have ‘N-1’ as denominator).

Let’s answer the question, How to compute autocorrelation? by implementing it in Python

We’ll use the Nifty (an Indian stock index tracking 50 stocks) closing price data from 17 September, 2007 to 30 July, 2021. The data is downloaded as a csv from Yahoo Finance. We’ll first prepare the data for time series analysis.

Nifty time series plot (Image by author)

We’ll define a function called ‘autocorr’ that returns the autocorrelation (acf) for a single lag by taking a time series array and ‘k’th lag value as inputs. This function will be nested inside another function called ‘my_auto_corr’ that returns acf for lags [k,0] by calling ‘autocorr’ function to compute acf for each lag value.

Let’s call the ‘my_auto_corr’ function by passing the ‘nifty’ time series data frame and nlags=10 as arguments. We’ll also compare the output of ‘my_auto_corr’ function with that of ‘acf’ method of ‘statsmodels’.

Image by author

The results of ‘my_auto_corr’ are same as those of the ‘acf’ method of ‘statsmodels’. Let’s once again look at the formula of autocorrelation that we saw earlier and try to understand it.

Image by author
  • The denominator is pretty straightforward, it is similar to the variance of the original time series, but doesn’t have ‘N-1' in the denominator. It is denoted by ‘denominator’ variable in the code.
  • As discussed earlier, the numerator is similar to the covariance between the current and lagged versions of the time series (without N-1 as denominator). Let’s understand how to compute the numerator.
Image by author
  • The brown rectangle represents y(t) in the first part of the numerator. It is subtracted from the mean of the original time series, mean(y). The first part is denoted by ‘numerator_p1’ in the code & y(t)-mean(y) in the formula. y(t) is fixed at the bottom and its top moves down by 1 for every unit increase in the lag (k).
  • Similarly, the green rectangle represents y(t-k) in the second part of the numerator. It is also subtracted from the mean of the original time series, mean(y). The second part is denoted by ‘numerator_p2’ in the code & y(t-k)-mean(y) in the formula. y(t-k) is fixed at the top and its bottom moves up by 1 for every unit increase in the lag (k).

However, as we saw earlier, the numerator of the formula is not exactly the same as covariance. However, the denominator is similar to the variance of original time series, but without ‘N-1’ in denominator. Hence, computing the covariance of the brown and green rectangle and dividing it by the variance of the original time series doesn’t give us the autocorrelation coefficient.

Breaking down the autocorrelation formula into fragments and implementing it in Python helped us understand it better. We saw how the covariance in the numerator is calculated between the current and the lagged versions of time series. Hence, it is important to know what’s under the hood to understand a concept better, be it a machine learning algorithm or a concept in statistics.

Know more about my work at https://ksvmuralidhar.in/

--

--

Data Science | ML | DL | NLP | CV | Web scraping | Kaggler | Python | SQL | Excel VBA | Tableau | About Me: https://ksvmuralidhar.in/