# Kalman Filter (01) – S&P 500 and Dow Jones Linear Regression  ### Linear Regression

Let’s get some Kalman filter basics and start playing around with it.

There is a long history about least-squares filtering. Like a linear regression model, we can fit the data with a linear function that minimizes the mean-square error. Python pandas can easily do the job:

[sourcecode language=”python” light=”true” wraplines=”false” collapse=”false”]
import datetime as dt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from statsmodels.api import OLS

# Get Adjusted Close Prices of SPY (S&amp;P 500 ETF) and DIA (Dow Jones)
df_SPY = DataReader(‘SPY’, ‘yahoo’, dt.datetime(2000,1,1), dt.datetime(2015,1,1))
df_DIA = DataReader(‘DIA’, ‘yahoo’, dt.datetime(2000,1,1), dt.datetime(2015,1,1))
df_data.columns = [‘SPY’, ‘DIA’]

# Linear Regression with pandas
result = smf.ols(‘DIA ~ SPY’, df_data).fit()
result.summary()
[/sourcecode]

To get some visual of the regression,

[sourcecode language=”python” light=”true” wraplines=”false” collapse=”false”]
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1)
df_data[:252*8].plot(ax=ax, kind=’scatter’, x=’SPY’, y=’DIA’, figsize=(16,12), color=(0.5, 0.5, 0.5))
df_data[252*8:].plot(ax=ax, kind=’scatter’, x=’SPY’, y=’DIA’, figsize=(16,12), color=(0.7, 0.5, 0.5))

intercept, slope = result.params
x = np.array([df_data.SPY.min(), df_data.SPY.max()])
y = intercept + slope * x
plt.plot(x, y, ‘r-‘)
[/sourcecode]

First observation you might have is: there are two clusters. The light grey indicate the cluster for the SPY-DIA dots in the years from 2000 to 2008; the light purple dots indicate the cluster for years from 2008 to 2015. So the regressed line parameters should not be constant.

R.E. Kalman in 60s published his groundbreaking paper to accommodate the time-varying linear models. Also, the framework he proposed can handle noisy measurements (prices) which might distort the fitting result. Nowaday Kalman filter becomes a popular tool for signal processing.

To start, let’s introduce the notation. Let $(\alpha_k, \beta_k)$ be a sequence of the true intercepts and slopes of the SPY-DIA regression and assume $W_k \sim \text{Normal}(\cdot, Q_k)$ are the *prior* knowledge associated with $\alpha_k$ and $\beta_k$. It is what we believe the time series of $\alpha$ and $\beta$ distribution should behave at time $k$ following $A_k$ linear transformation. So, from time $k$ to time $k + 1$ we have $\left[ \begin{array}{c} \alpha_{k+1} \\ \beta_{k+1} \end{array} \right] = A_k \times \left[ \begin{array}{c} \alpha_{k} \\ \beta_{k} \end{array} \right] + \left[ \begin{array}{c} W_{k,\alpha} \\ W_{k,\beta} \end{array} \right]$

We assume $W_k$ follows a normal distribution. If you believe the parameters should be constant over time, then we simply assume $A_k$ is an identity matrix and $W_k \sim \text{Normal}(0,0)$, which is $0$ mean, $0$ variance. Now, if $\text{SPY}_k$ and $\text{DIA}_k$ represent the prices at time $k$, we plug $\alpha$ and $\beta$ back to the price relationship: $\text{DIA}_k = \beta_{k} \times \text{SPY}_k + \alpha_k + V_k$

where $V_k$ represents the measurement error of the linear regression (becasue the relatinship is never perfactly linear and subjected to many room of interpretation). We also assume $V_k$ follows the normal distribution.

Now, let’s try if we can recreate the oridinary linear regression result. We will use pykalman module.

[sourcecode language=”python” light=”true” wraplines=”false” collapse=”false”]
import pykalman

from pykalman import KalmanFilter

# Transition Matrix
A = np.array([[1, 0], [0, 1]])
# Transition Covariance
Q = np.zeros(A.shape)

# Convert the Price Data to 3-Dimensional Array. Add an Additional Column for Intercept Term
array_SPY = np.expand_dims(df_data[‘SPY’].values, 1)
array_SPY = np.append(array_SPY, np.ones(array_SPY.shape), axis=1)
array_SPY = np.expand_dims(array_SPY, 1)

# Create a Kalman Filter Object
kf = KalmanFilter(n_dim_obs=1, n_dim_state=2, transition_matrices=A, transition_covariance=Q, observation_matrices=array_SPY)
kf = kf.em(df_data[‘DIA’].values)

# Linear Regression Outputs
alpha = kf.initial_state_mean
beta = kf.initial_state_mean

# Linear Regression Figure
fig, ax = plt.subplots(1, 1)
df_data[:252*8].plot(ax=ax, kind=’scatter’, x=’SPY’, y=’DIA’, figsize=(16,12), color=(0.5, 0.5, 0.5))
df_data[252*8:].plot(ax=ax, kind=’scatter’, x=’SPY’, y=’DIA’, figsize=(16,12), color=(0.7, 0.5, 0.5))
x = np.array([df_data.SPY.min(), df_data.SPY.max()])
y = alpha + beta * x
plt.plot(x, y, ‘r-‘)
[/sourcecode]

### What’s Next?

In light of the increasing need for more complicated models to capture prices dynamics, intuition on advanced development is connected with simple cases. Today we set foot in Kalman filter and successfully replicate an ordinary least-squares regression result. Even with noisy price data, Kalman filter has done its first task by smoothing out the signals.

Next time we shall start some fundamental pairs trading with Kalman filter. We will consider the nature of streaming financial data feed and build a workable long-short strategy on S&P 500 and Dow Jones ETFs. We will show how Kalman filter generates divergent signals, and how additional latent factors can improve the performance.