# Lecture 17: Time Series

AM207: Pavlos Protopapas, Harvard University

April 1, 2014


In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
import seaborn as sns
import statsmodels.api as sm
import statsmodels.graphics.tsaplots as tsaplots
from IPython.display import Image


## 1. Foundational concepts in Time Series

### The Non-IID case

So far, we have concentrated on IID sets of data points. This assumption allowed us to express the likelihood function as the product over all data points of the probability distribution for each data point.

This is not always an assumption that holds. In particular, sequential data is a case where the assumption does not hold: for example rain today might portend more rain tomorrow. Other examples are nucleotide base pairs along a strand of DNA or the sequence of characters in an English sentence.

Indeed, you are very familiar with one such sequence: this is the chain obtained from doing MCMC. We do all kinds of manipulations there to remove any correlations to get random samples from a particular distribution.

### What is a Stochastic Process?

Indeed, more generally, lets define a stochastic process. A Stochastic Process is a a collection of random variables ordered in time. It might be rain on successive days or the times in a 24 hour period when a baby was born. The variables may be iid or not. If they are not iid, it is precisely the dependence that we are interested on.

What we'd like to focus on here are processes evolving in time according to probabilistic laws, and deal only with well defined time lags: ie unique sampling frequency. Furthermore, we will concentrate on discrete univariate time series (for now).

### Ensemble

Our key assumption will be to regard the series as ONE realization of stochastic process from an ensemble of time series. Our notation $\E{X}$ is the usual frequentist $\EE{ensemble}{X}$ and similar for variance.

We want to make inferences about the process.

First of all we note that, without loss of generality, we can use the product rule to express the joint distribution for a sequence of observations thus:

$$P(x_1, x_2,...,x_n) = \prod_{n=1}^{N} P(x_n | x_1 , x_2 , ... , x_{n-1} )$$

This is the distribution we'd like to learn. Indeed, specifically, we are interested in the ensemble quantities

$$\mu_n = \E{x_n}$$

$$\sigma_n^2 = E[(x_n − \mu_n)^2] =\V{x_n}$$

and covariance

$$\gamma(x_{n},x_{n+k}) = E[(x_n − \mu_{n})(x_{n+k} − \mu_{n+k})]$$.

### Stationarity

It is useful to distinguish between stationary and non-stationary sequential distributions. In the stationary case, the data evolves in time, but the distribution from which it is generated remains the same. For the more complex non-stationary situation, the generative distribution itself is evolving with time. Here we shall focus on the stationary case.

We define Strongly stationary: $\P{x_{1:k}} = \P{x_{t:t+k-1}} \forall \,\,\, k, t$: joint distribution of blocks of length $k$ is time-invariant. Implies $\mu_t = \mu$, $\gamma(x_{t},x_{t+k-1})=\gamma(k)$, the autocovariance at lag $k$. That is, the autocovariance only depends on the lag.

It may seem surprising to suggest the notion of a distribution where the distribution is the same for such blocks, or more specifically, at all times. However, remember that many processes such as economies have equilibrium distributions far from the initial conditions.

We also define Loosely stationary: $\E{x_1} = \E{x_t} = \mu \,\,\, \forall t$ and $\gamma(x_{t},x_{t+k-1})=\gamma(k)$. There is no restriction on joint distributions.

$$\mu_t = µ, \,\,\, t = 1, 2,...$$ $$\sigma^2_t = σ^2, \,\,\, t = 1, 2,...$$ $$\gamma(t,s) = \gamma(t−s), \,\,\, t \neq s,...$$

Loose is obviously implied by strong.

It is only for gaussian processes that Loose implies Strong. We'll come back to gaussian processes in a later lecture.

Which of the following Time Series is stationary?

In [2]:
Image("images/stationary.png")

Out[2]:
(a) Dow Jones index on 292 consecutive days; (b) Daily change in Dow Jones index on 292 consecutive days; (c) Annual number of strikes in the US; (d) Monthly sales of new one-family houses sold in the US; (e) Price of a dozen eggs in the US (constant dollars); (f) Monthly total of pigs slaughtered in Victoria, Australia; (g) Annual total of lynx trapped in the McKenzie River district of north-west Canada; (h) Monthly Australian beer production; (i) Monthly Australian electricity production.

(b and g are stationary)

### Autocorrelation

Typically, in time series, there is some decay of correlation: $x_n$ and $x_{n+k}$ become more and more nearly independent as $k \to \infty$. We study this by the covariance or autocovariance defined above, or equivalently by the autocorrelation:

$$\rho(k) =\frac{\gamma(k)}{\gamma{(0)}}=\frac{E[(x_t-\mu)(x_{t+k}-\mu)]}{E[(x_t-\mu)(x_{t}-\mu)]}$$

A plot of $\rho_k$ against $k$ is know as the autocorrelogram or auto-correlation function and is often a good guide to the properties of the series. In summary second order stationarity implies that mean, variance and the autocorrelogram are independent of time.

Autocorrelation is important, firstly, because everybody knows about it. Secondly, in the rather special case of Gaussian processes, it really does tell us everything we need to know (more on this in the lecture on Gaussian Processes). Third, in the somewhat less special case of linear prediction, it tells us everything we need to know. Finally, it plays an important role in understanding ergodicity, as we shall soon see.

### Markov

We are interested in $P(x_{t+1} \vert x_{1:t})$. However, for finite correlation times, we lose memory after a while. This leads us to consider Markov models in which we assume that future predictions are independent of all but the most recent observations. That is, the process is Markovian of order $m$ steps. (In class we have so far dealt with $m = 1$.)

$$P(x_n | x_1 , x_2 , ... , x_{n-1} ) = P(x_n | x_{n-m}, x_{n-m+1},...,x_{n-1})$$

In [3]:
Image("images/m2o.png")

Out[3]:

Such a process is a weakening of both i.i.d and deterministic (current state $x_t$ fixes entire future trajectory) dependence. Indeed such a chain only fixes distribution over trajectories. Current states are the only channel through which the past affects the future.

Although such models are tractable, they are also severely limited. We can obtain a more general framework, while still retaining tractability, by the introduction of latent variables. We shall study such "state space" models in the next lecture, concentrating on the limited direct Markov models in this lecture.

### Ergodicity

Using a single measurement at each time to estimate values of the unknown parameters is only valid if the process is ergodic. Ergodicity is a mathematical concept. In essence it means that observations which are suﬃciently far apart in time are uncorrelated so that adding new observations gives extra information. In other words if we can show that for long enough series the average, covariance converges to the ensemble average and variance then we can use a single time series to learn everything (everything here is the average and covariance).

With iid data, the ultimate basis of all our statistical inference is the law of large numbers, which tells that as $n \to \infty$,

$$\frac{1}{n} \sum_{i=1}^{n} x_i \to \E{x},$$

where the average is once again over the ensemble. For the non iid sequential case, as $n \to \infty$:

$$\E{\ol{x_n}} = \E{\frac{1}{n} \sum_{i=1}^{n} x_i }= \frac{1}{n} \sum_{i=1}^{n} \E{x_i} = \E{x} = \mu$$

where $\ol{x_n} = \frac{1}{n} \sum_{i=1}^{n} x_i$ is the average of the first $n$ variables in a particular time (itself a random variable).

We can further show that, as $n \to \infty$:

$$\V{\ol{x_n}} \to 0 .$$

To prove this we need the condition $\gamma(\tau) \to 0$ for all $\tau$ greater than some finite lag $m$. i.e. the correlation time is finite

This can be written as:

$$\sum_{k=0}^{\infty} | \gamma(k) | = \gamma(0)\tau \lt \infty$$

\begin{eqnarray} \V{\ol{x}_n} &=& \V{\frac{1}{n}\sum_{t=1}^{n} x_t}\\ &=& \frac{1}{n^2} \left[ \sum_{t=1}^{n} \V{x_t} +2 \sum_{t=1}^{n}\sum_{s=t+1}^{n} \gamma(x_t , x_s) \right]\\ &=& \frac{1}{n^2} \left[ n\gamma(0) + 2 \sum_{t=1}^{n}\sum_{s=t+1}^{n} \gamma(s-t)\right]\\ & \le & \frac{1}{n^2} \left[ n\gamma(0) + 2 \sum_{t=1}^{n}\sum_{s=t+1}^{n} | \gamma(s-t) |\right]\\ & \le & \frac{1}{n^2} \left[ n\gamma(0) + 2 \sum_{t=1}^{n}\sum_{h=1}^{n} | \gamma(h) |\right]\\ & \le & \frac{1}{n^2} \left[ n\gamma(0) + 2 \sum_{t=1}^{n}\sum_{h=1}^{\infty} | \gamma(h) |\right]\\ & = & \frac{\gamma(0)(1+2\tau)}{n} \end{eqnarray}

Thus the variance goes to 0 as $n \to \infty$.

$\ol{x_n} \to \E{x}$ as $n \to \infty$

Since $\E{\ol{x_n}} \to \E{x_1}$ and the variance $\V{\ol{x}_n} \to 0$ as $n \to \infty$, hence $\ol{x_n} \to \E{x_1}$ as $n \to \infty$ exactly, just as in the iid case.

What does this mean?

A single long time series, if stationary with finite correlation time, becomes representative of the whole data generating process, just like sampling an iid distribution represents the entire population. Thus the ergodic theorem tells us that we CAN learn about the stochastic process from empirical data.

One can extend the ergodic theorem to time averages and ensemble expectations of well behaved functions and functions of blocks $f(X_{t:t+k})$: thus it applies to joint and marginal probability distributions.

If $x_i$ were iid, or even just uncorrelated,

$$\V{\ol{x}_n} = \frac{\gamma(0)}{n}$$

exactly. Thus, our bound on the variance is larger by a factor of $(1 + 2\tau)$, which translates to an effective sample size:

$$ESS = \frac{n}{(1 + 2\tau)}$$

of independent data points. Notice to the similarity to the formula for effective sample size in the lecture on Gibbs sampling, which indeed can be obtained from one of the steps in this derivation.

### Autoregression

For continuous variables $x$, instead of trying to estimate the whole conditional distribution of $x_t$ , we can just look at its conditional expectation. This is a autoregression: $\E{x_t|x_{t-p:t-1}} = r(x_{1:p})$: if we think the process is Markov of order p, then of course there is no point in conditioning on more than p steps of the past when doing an autoregression. The correlation drop off which inclined us towards Markov approximations also make limited-order autoregressions attractive.

We can do linear, kernel, spline, additive..you name it...regression. We'll stick to linear as it is simple, fast to compute, and fast to converge.

### Linear Autoregression

Linear regression if when we use linear-Gaussian conditional distributions in which each $x_n$ has a Gaussian distribution whose mean is a linear function of its markov 'feeders'. If we want to predict $x_t$ as a linear combination of the last $p$ observations $x_{t-1},x_{t-2},...,x_{t-p}$, then the ideal coefficients $\vec{\beta}$ are

$$\vec{\beta} = \left ( \V{x_{t-p:t-1}} \right ) ^ {-1} \gamma (x_{t-p:t-1}, x_t)$$

where $\V{x_{t-p:t-1}}$ is the variance-covariance matrix of that block, and $\gamma$ is a matrix of covariances.

Assuming stationarity, $\V{x_t}$ is constant in t, and so the common factor of the over-all variance goes away, and $\vec{\beta}$ could be written entirely in terms of the correlation function $\rho$.

Ergodicity then allows us to estimate these coefficients by using time averages instead of ensemble averages.

## 2. An exploration of models and stationarity

### Purely Random Process

A purely random process is one in which $x_t = z_t$, where $\{z_t\}$ are iid. We could choose any distribution.

Lets choose $\N{0}{\sigma^2}$, also called white noise.

In [4]:
N=1000
#sigma=1
ts=np.random.randn(1000)
tsaplots.plot_acf(ts, lags=10);


Obviously, since the variables are iid, there is no autocorrelation.

Power spectrum (to be explained at the lab on Friday)

In [5]:
def powspec(samps):
ps = np.abs(np.fft.fft(samps))**2
time_step = 1. / 30. #assume Hertz
freqs = np.fft.fftfreq(samps.size, time_step)
idx = np.argsort(freqs)
return freqs[idx], ps[idx]

def plot_powspec(samps, title):
f,p = powspec(ts)
plt.plot(f,p)
plt.xlabel("frequency (Hz)")
plt.ylabel("Power")
plt.title(title+ " Power Spectrum")

plot_powspec(ts, "White Noise")


### Random Walk Model

Also known as Brownian motion, this is one of the simplest time series:

$$x_t = x_{t-1} + z_t$$

Note that since:

$$\E{z} = 0$$ $$\V{z} = \sigma_z^2$$

Then $$\V{x_t} = t\, \sigma_z^2$$ this series is NOT stationary: a particle under brownian motion can get arbitrarily far from its start.

We can introduce the difference operator

$$Dx_t = x_t - x_{t-1} = z_t$$

The series given by applying the difference operator $D$ IS stationary. We saw this in the figure above where the DJI index itself isnt stationary, but its difference is.

Let us also introduce the backpropagation operator $B$:

$$x_{t-1} = Bx_t$$

Then we can write:

\begin{eqnarray} (1-B)x_t & = & z_t\\ x_t & = & \frac{z_t}{1-B} \end{eqnarray}

In [6]:
N=1000
z=np.random.randn(N)
x=np.zeros(N)
for i in np.arange(1,N,1):
x[i] = x[i-1]+z[i]
plt.plot(x)

Out[6]:
[<matplotlib.lines.Line2D at 0x10937af10>]

In [7]:
tsaplots.plot_acf(x, lags=40);


As you might expect, the autocorrelation does not go down in any reasonable amount of time. Indeed we can use the autocorrelation as a test for stationarity (except in pathological cases).

In [8]:
plot_powspec(x, "Random Walk")


### AR(1) model

This is the simplest autoregressive model: a markov chain of order 1, and $x_t$ be a white noise

$$x_t = \alpha x_{t-1} + z_t$$

$$x_t = z_t + \alpha ( \alpha x_{t-2} +z_{t-1})$$ $$= z_t + \alpha z_{t-1} + \alpha^2 x_{t-2}$$ $$...$$ $$= z_t + \alpha z_{t-1} + \alpha^2 z_{t-2} + \ldots$$

and if $\vert \alpha \vert < 1$ $$\E{x_t} = 0$$

which is independent of t

The random walk is a special case of an AR(1) model with $\alpha_1=1$.

Since $x_t = \frac{1}{1 - \alpha B} z_t$.

If we expand $\frac{1}{1 - \alpha B}$ in a series, we have:

$$\frac{1}{1 - \alpha B} = 1 + \alpha B +\alpha^2 B^2 + ...$$

and thus we also have:

$$x_t = z_t + \alpha z_{t-1} + \alpha^2 z_{t-2} + ...$$

The covariance is given by

$$\begin{eqnarray} \gamma(k) &=& \E{ x_t x_{t+k}} \\ & = & \sum_{0} \alpha^i \alpha^{k+i} \sigma_z^2 \\ & = & \alpha^{k} \sigma^2_z \sum \alpha^{2i} & =& \sigma^2_z \frac{\alpha^k}{1-\alpha^2} \end{eqnarray}$$

For the last step we used the geometric series formula.

The autocorrelation is $\rho{k} = \frac{\gamma(k)}{\gamma(0)}$ and therefore

$$\rho(k) = \alpha^k$$

In [9]:
def ts_gen_ar1(size, sigma, alpha1):
e=sigma*np.random.randn(size)
x=np.zeros(size)
for i in np.arange(1,size,1):
x[i] = alpha1*x[i-1] + e[i]
#print x
return x


AR(1): $x_t = 0.3x_{t-1} + z_t$: STATIONARY

In [10]:
ts1 = ts_gen_ar1(5000, 1., 0.3)
plt.plot(ts1);

In [11]:
tsaplots.plot_acf(ts1, lags=40);


AR(1): $x_t = -0.3x_{t-1} + z_t$: STATIONARY

In [12]:
ts1b = ts_gen_ar1(5000, 1., -0.3)
plt.plot(ts1b);


See the negative terms that come from the $\alpha_1^k$ term in the ACF.

AR(1): $x_t = 1.1x_{t-1} + z_t$: NON-STATIONARY

In [13]:
ts2 = ts_gen_ar1(1000, 1., 1.1)

In [14]:
plt.plot(ts2)

Out[14]:
[<matplotlib.lines.Line2D at 0x109364f50>]

In [15]:
tsaplots.plot_acf(ts2, lags=40);

In [16]:
tsaplots.plot_acf(ts1b, lags=40);


### MA(1) and MA(q) Models

Before we move on to higher order autoregressive models, lets briefly consider a class of models called Moving-Average, or MA models. The reason to do this is the technology we develop here, in conjunction with the notation developed above, will help us understand the structure of AR models.

The general MA(q) model is:

$$x_t = z_t + \beta_1 z_{t-1} + ... + \beta_q z_{t-q}$$

In terms of the back-propogation operator B we can thus write:

$$\MA{B} = 1 + \beta_1 B + \beta_2 B^2 + ... + \beta_q B^q$$

and thus

$$x_t = \MA{B} z_t$$

We do not observe $z_t$, so this is not regression in normal sense. Notice that $x_t$ can be thought of as weighted moving averages of the past few forecast errors.

Obviously, for MA(q): $\E{x_t} =0$, and $\V{x_t} = \sigma_z^2 \sum_{i=0}^q \beta_i^2$.

The MA(q) model is stationary for all finite q(indeed, for the gaussian case described above, the model is strongly stationary).

The covariance for the MA(q) model is 0 for $k > q$. This is as, there are no squared $z_t$ noise terms in the products of type $x_{t}x_{t-k}$ where $k \gt q$ (i.e. more than q-separated $x_t$). Since the noise itself is IID, the covariance is then 0. For $k \le q$:

$$\gamma(k) = \sigma_Z^2 \sum_{i=0}^{q-k} \beta_i \beta_{i+k}$$

Thus

$$\rho_k = \frac{\sum_{i=0}^{q-k} \beta_i \beta_{i+k}}{\sum_{i=0}^q \beta_i^2}$$

for $k \le q$ and 0 otherwise.

Let us briefly consider a MA(1) model:

$$x_t = z_t + \beta_1 z_{t-1}$$

In [17]:
N=1000
x=np.zeros(N)
z=np.random.randn(N)
beta1=0.5
for i in np.arange(1,N,1):
x[i] = z[i] + beta1*z[i-1]
plt.plot(x);

In [18]:
tsaplots.plot_acf(x, lags=10);


Clearly, the autocorrelation goes to 0 after one lag.

We will consider the MA(q) in more detail soon, but lets get back to autoregressive models.

### The AR(2) Model

$$X_t = \alpha_1 X_{t-1} + \alpha_2 X_{t-2} + z_t$$

Just like in the AR(1) case, we can use the $\alpha$s to study the stationarity of the AR(2) series.

We write this as a function of the $B$ operators

$$(1-\alpha_1 B - \alpha_2 B^2) \, x_t = z_t$$

We'll call such a polynomial on the left side $\AR{B}$ and thus:

$$\AR{B} x_t = z_t$$

Lets write this as: $$x_t = \psi(B) z_t = (1+\psi_1 B + \psi_2 B^2 + \ldots) z_t = (1 + \sum_{i=1}^{\infty} \psi_i B^i) z_t$$

Thus

$$(\AR{B})^{-1} = (1 + \sum_{i=1}^{\infty} \psi_i B^i) z_t$$

or

$$(1-\alpha_1 B - \alpha_2 B^2) (1+\psi_1 B + \psi_2 B^2+ \ldots) = 1$$

Equating coefficients to zero we thus get an infinite series of equations that we can use to relate to an infinite order MA process (see below). For now we want to know when this is stationary (a finite order MA process is stationary, but there are no guarantees on an infinite order).

(Note that the infinite order geometric series we got from the previous AR(1) expansion is just an example of a simple infinite order MA series :-))

Lets look at the roots of $$(1- \alpha_1 B - \alpha_2 B^2)$$

we can write the equation as

$$(1-g_1 B) (1-g_2 B) =0$$

The roots are $1/g_1$ and $1/g_2$.

The process is stationary (to see this fix all the other factors and expand any one) if $\vert g_1 \vert < 1$ and $\vert g_2 \vert < 1$ which you can show (solve the quadratic equation!) impose the following condition on $\alpha_1$ and $\alpha_2$.

\begin{eqnarray} \alpha_1 + \alpha_2 &<&1 \\ -\alpha_1 + \alpha_2 &<&1 \\ -1< \alpha_2 &<&1 \\ \end{eqnarray}

In [19]:
Image("ar2st.png")

Out[19]:

### Generalize to AR(p)

We can generalize to AR(P) models. Here we have:

$$x_t = \alpha_1 X_{t-1} + ... + \alpha_p X_{t-p} + Z_t$$

Such series arise in many econometric situations: prediction os inflation, imports, exports, etc: wherever there is some memory in the process.

### Prediction or Forecasting

We forecast by plugging in the values of the coefficients and computing future values of the time series. The first thing to do is just visual inspection to see if we have a reasonable fit.

In [41]:
data=open("./lc_78.5855.788.B.mjd").readlines()
data = [e.strip().split() for e in data[3:]]
t = np.array([float(e[0]) for e in data])
f = np.array([float(e[1]) for e in data])
e = np.array([float(e[2]) for e in data])

In [42]:
step = (t.max() - t.min())/t.shape[0]
times = np.arange(t.min(), t.max(), step)
fluxes = np.interp(times,t,f)
fluxes = fluxes - fluxes.mean()
plt.plot(times, fluxes);

In [43]:
tsaplots.plot_acf(fluxes, lags=40);

In [44]:
tsaplots.plot_pacf(fluxes, lags=40);

In [45]:
fluxes_train=fluxes[:700]
fluxes_test=fluxes[700:]


Fit and check the AIC's for the best model. It looks like a whole bunch of AR(p) processes will do:

In [46]:
aics=np.zeros(15)
orders=range(15)
fits=[]
for j in orders:
fit=sm.tsa.ARMA(fluxes_train, (j,0)).fit(method="mle")
fits.append(fit)
aics[j]=fit.aic
plt.plot(orders, aics);

In [47]:
aics

Out[47]:
array([ -650.86107276, -1460.6788049 , -1518.21965581, -1534.32271782,
-1542.73270875, -1545.03290121, -1548.41346636, -1550.59929528,
-1555.13511839, -1557.92829601, -1557.34941916, -1557.37222612,
-1555.82265655, -1554.68637896, -1555.50610144])


We test the gaussianity of the residuals: the departure is mainly in the tail.

In [48]:
from statsmodels.graphics.api import qqplot
qqplot(fits[10].resid,line='q');

In [49]:
fits[10].summary()

Out[49]:
Dep. Variable: No. Observations: y 700 ARMA(10, 0) 790.675 mle 0.078 Fri, 04 Apr 2014 -1557.349 12:23:13 -1502.736 0 -1536.238
coef std err z P>|z| [95.0% Conf. Int.] 0.0171 0.055 0.308 0.758 -0.092 0.126 0.4860 0.038 12.878 0.000 0.412 0.560 0.1425 0.042 3.400 0.001 0.060 0.225 0.0634 0.042 1.502 0.134 -0.019 0.146 0.0529 0.042 1.252 0.211 -0.030 0.136 0.0102 0.042 0.241 0.810 -0.073 0.093 0.0262 0.042 0.620 0.535 -0.057 0.109 0.0143 0.042 0.338 0.735 -0.068 0.097 0.0493 0.042 1.171 0.242 -0.033 0.132 0.0607 0.042 1.455 0.146 -0.021 0.143 0.0449 0.038 1.193 0.233 -0.029 0.119
1.0166 -0.0000j 1.0166 -0.0000 1.0159 -0.7065j 1.2374 -0.0967 1.0159 +0.7065j 1.2374 0.0967 0.3546 -1.3041j 1.3514 -0.2077 0.3546 +1.3041j 1.3514 0.2077 -0.5328 -1.3667j 1.4669 -0.3092 -0.5328 +1.3667j 1.4669 0.3092 -1.4852 -0.0000j 1.4852 -0.5000 -1.2792 -0.9019j 1.5652 -0.4023 -1.2792 +0.9019j 1.5652 0.4023

We now engage in the predictive process:

In [50]:
index_rem=np.arange(700,900,1)
predicts=fits[10].predict(start=700, end=899)
in_predicts = fits[10].predict()

In [51]:
predicts.shape, index_rem.shape, in_predicts.shape

Out[51]:
((200,), (200,), (700,))

In [52]:
plt.plot(fluxes)
plt.plot(in_predicts)
plt.plot(index_rem, predicts)

Out[52]:
[<matplotlib.lines.Line2D at 0x109326090>]


We visually see that we get the trend roughly right. We must estimate the error bars, which we can do by bootstrapping or other means.

In general, to objectively evaluate our forecast, we must specify a cost function, like a mean squared error or a mean absolute error, on the test set. The MLE (or for AR(p), the least squares) process automatically does this for us on the training set, but this does not mean we have generalized well. However, this is why we use the AIC (or its cousin the BIC), or in a bayesian context, the evidence to pick the order (and parameters) of the model. statsmodels provides us a function for this for AR:

In [53]:
sm.tsa.AR(fluxes_train).select_order(ic='aic', trend='nc',maxlag=50)

Out[53]:
28


Note though that the relative flatness of the aic all the way out would suggest that we be parsimonious and choose a model of a much smaller order. As usual, dont leave your brain at home.

## 5. Putting it all together, the Box Jenkins procedure

2. Plot the sample ACF and PACF.
3. Difference the data, if needed.
4. Plot the sample ACF and PACF of the differenced series.
5. Specify and fit an ARIMA/ARMA/AR/MA model.
6. Check goodness of fit.
7. Generate forecasts.

Lets do this with a example included with MATLAB: the log quarterly Australian Consumer Price Index (CPI) measured from 1972 to 1991.

In [54]:
df=pd.read_csv("./AUSCPIALLQINMEI.csv")

Out[54]:
DATE VALUE
0 1955-01-01 6.763788
1 1955-04-01 6.867846
2 1955-07-01 6.867846
3 1955-10-01 6.971904
4 1956-01-01 6.971904
In [55]:
df.DATE = pd.to_datetime(df.DATE)

Out[55]:
DATE VALUE
0 1955-01-01 00:00:00 6.763788
1 1955-04-01 00:00:00 6.867846
2 1955-07-01 00:00:00 6.867846
3 1955-10-01 00:00:00 6.971904
4 1956-01-01 00:00:00 6.971904
In [56]:
dfsub=df[(df.DATE> '1971') & (df.DATE < '1990')]

Out[56]:
DATE VALUE
66 1971-07-01 00:00:00 10.926119
67 1971-10-01 00:00:00 11.134235
68 1972-01-01 00:00:00 11.238293
69 1972-04-01 00:00:00 11.342352
70 1972-07-01 00:00:00 11.550468
In [57]:
dfsub=dfsub.set_index('DATE')

Out[57]:
VALUE
DATE
1971-07-01 10.926119
1971-10-01 11.134235
1972-01-01 11.238293
1972-04-01 11.342352
1972-07-01 11.550468
In [58]:
cpi=dfsub.VALUE
cpi

Out[58]:
DATE
1971-07-01    10.926119
1971-10-01    11.134235
1972-01-01    11.238293
1972-04-01    11.342352
1972-07-01    11.550468
1972-10-01    11.654527
1973-01-01    11.862643
1973-04-01    12.278876
1973-07-01    12.695109
1973-10-01    13.111342
1974-01-01    13.527575
1974-04-01    14.047867
1974-07-01    14.776275
1974-10-01    15.296566
1975-01-01    15.920916
...
1986-10-01    46.201873
1987-01-01    47.138398
1987-04-01    47.866805
1987-07-01    48.699272
1987-10-01    49.531738
1988-01-01    50.364204
1988-04-01    51.300728
1988-07-01    52.237253
1988-10-01    53.277836
1989-01-01    53.798127
1989-04-01    55.150884
1989-07-01    56.399584
1989-10-01    57.440166
1990-01-01    58.480749
1990-04-01    59.417274
Name: VALUE, Length: 76, dtype: float64

In [59]:
lcpi=np.log(cpi)


In [60]:
lcpi.plot()

Out[60]:
<matplotlib.axes.AxesSubplot at 0x1093a8150>


Plot the sample autocorrelation function (ACF) and partial autocorrelation function (PACF) for the CPI series.

In [61]:
tsaplots.plot_acf(lcpi, lags=40);

In [62]:
tsaplots.plot_pacf(lcpi, lags=40);


This suggests that we have some AR(2) business going on, at the very least. Lets remove the trend, and replot the ACF and PACF:

In [63]:
dcpi=lcpi.diff()[1:]
dcpi

Out[63]:
DATE
1971-10-01    0.018868
1972-01-01    0.009302
1972-04-01    0.009217
1972-07-01    0.018182
1972-10-01    0.008969
1973-01-01    0.017700
1973-04-01    0.034486
1973-07-01    0.033336
1973-10-01    0.032261
1974-01-01    0.031253
1974-04-01    0.037740
1974-07-01    0.050552
1974-10-01    0.034606
1975-01-01    0.040005
1975-04-01    0.032157
...
1986-10-01    0.027399
1987-01-01    0.020068
1987-04-01    0.015334
1987-07-01    0.017242
1987-10-01    0.016950
1988-01-01    0.016667
1988-04-01    0.018424
1988-07-01    0.018091
1988-10-01    0.019725
1989-01-01    0.009718
1989-04-01    0.024834
1989-07-01    0.022389
1989-10-01    0.018282
1990-01-01    0.017954
1990-04-01    0.015887
Name: VALUE, Length: 75, dtype: float64

In [64]:
dcpi.plot()

Out[64]:
<matplotlib.axes.AxesSubplot at 0x109d324d0>

In [65]:
tsaplots.plot_acf(dcpi, lags=40);