Introduction to Hidden Markov Models

We present Markov Chains and the Hidden Markov Model.

Markov Chains

Let us first give a brief introduction to Markov Chains, a type of a random process. We begin with a few “states” for the chain, {S₁,…,Sₖ}; For instance, if our chain represents the daily weather, we can have {Snow,Rain,Sunshine}. The property a process (Xₜ)ₜ should have to be a Markov Chain is:

In words, the probability of begin in a state j depends only on the previous state, and not on what happened before.

Markov Chains are often described by graph with transition probabilities, i.e, the probability of moving to state j from state i, which are denoted by pᵢ,ⱼ. Let’s look at the following example:

Markov chain example

The chain has three states; For instance, the tradition probability between Snow and Rain is 0.3, that is — if it was snowing yesterday, there is a 30% chance it will rain today. The transition probabilities can be summarized in a matrix:

Notice that the sum of each row equals 1 (think why). Such a matrix is called a Stochastic Matrix. The (i,j) is defined as pᵢ,ⱼ -the transition probability between i and j.

Fact: if we take a power of the matrix, Pᵏ, the (i,j) entry represents the probability to arrive from state i to state j at k steps.

In many cases we are given a vector of initial probabilities q=(q₁,…,qₖ) to be at each state at time t=0. Thus, the probability to be at state i at time t will be equal to the i-th entry of the vector Pq.

For instance, if today the probabilities of snow, rain and sunshine are 0,0.2,0.8, then the probability it will rain in 100 days is calculated as follows:

The 2nd entry equals ≈ 0.44.

Hidden Markov Model

In a Hidden Markov Model (HMM), we have an invisible Markov chain (which we cannot observe), and each state generates in random one of k observations, which are visible to us.

Let’s look at an example. Suppose we have the Markov Chain from the previous part, with three states (snow, rain and sunshine), P – the transition probability matrix and q — the initial probabilities. This is the invisible Markov Chain — suppose we are home and cannot see the weather. We can, however, feel the temperature inside our room, and suppose there are two possible observations: hot and cold, where:

Basic Example

As a first example, we apply the HMM to calculate the probability that it feels cold for two consecutive days. In these two days, there are 3*3=9 options for the underlying Markov states. Let us give an example for the probability computation of one of these 9 options:

Summing up all options gives the desired probability.

Finding Hidden States — Viterbi Algorithm

In some cases we are given a series of observation, and want to find the most probable corresponding hidden states.

A brute force solution would take exponential time (like the calculations above); A more efficient approach is called the Viterbi Algorithm; its main idea is as follows: we are given a sequence of observations o₁,…,o . For each state i and t=1,…,T, we define

That is, the maximum probability of a path which ends at time t at the state i, given our observations. The main observation here is that by the Markov property, the if the most likely path that ends with i at time t, equals to some i* at time t−1, then i* is the value of the last state of the most likely path which ends at time t−1. This gives us the following forward recursion:

here, αⱼ(oₜ) denotes the probability to have oₜ when the hidden Markov state is .

Here is an example. Let us generate a sequence of 14 days, in each 1 denotes hot temperature and 0 denotes called. We will use the algorithm to find the most likely weather forecast of these two weeks.

import numpy as np
import pandas as pd

obs_map = {'Cold':0, 'Hot':1}
obs = np.array([1,1,0,1,0,0,1,0,1,1,0,0,0,1])

inv_obs_map = dict((v,k) for k, v in obs_map.items())
obs_seq = [inv_obs_map[v] for v in list(obs)]

print("Simulated Observations:n",pd.DataFrame(np.column_stack([obs, obs_seq]),columns=['Obs_code', 'Obs_seq']) )

pi = [0.6,0.4] # initial probabilities vector
states = ['Cold', 'Hot']
hidden_states = ['Snow', 'Rain', 'Sunshine']
pi = [0, 0.2, 0.8]
state_space = pd.Series(pi, index=hidden_states, name='states')
a_df = pd.DataFrame(columns=hidden_states, index=hidden_states)
a_df.loc[hidden_states[0]] = [0.3, 0.3, 0.4]
a_df.loc[hidden_states[1]] = [0.1, 0.45, 0.45]
a_df.loc[hidden_states[2]] = [0.2, 0.3, 0.5]
print("n HMM matrix:n", a_df)
a = a_df.values

observable_states = states
b_df = pd.DataFrame(columns=observable_states, index=hidden_states)
b_df.loc[hidden_states[0]] = [1,0]
b_df.loc[hidden_states[1]] = [0.8,0.2]
b_df.loc[hidden_states[2]] = [0.3,0.7]
print("n Observable layer matrix:n",b_df)
b = b_df.values

We get:

Simulated Observations:
Obs_code Obs_seq
0 1 Hot
1 1 Hot
2 0 Cold
3 1 Hot
4 0 Cold
5 0 Cold
6 1 Hot
7 0 Cold
8 1 Hot
9 1 Hot
10 0 Cold
11 0 Cold
12 0 Cold
13 1 Hot

HMM matrix:
Snow Rain Sunshine
Snow 0.3 0.3 0.4
Rain 0.1 0.45 0.45
Sunshine 0.2 0.3 0.5

Observable layer matrix:
Cold Hot
Snow 1 0
Rain 0.8 0.2
Sunshine 0.3 0.7

Now let’s use the algorithm:

path, delta, phi = viterbi(pi, a, b, obs)
state_map = {0:'Snow', 1:'Rain', 2:'Sunshine'}
state_path = [state_map[v] for v in path]
pd.DataFrame().assign(Observation=obs_seq).assign(Best_Path=state_path)

We get:

Start Walk Forward

s=0 and t=1: phi[0, 1] = 2.0
s=1 and t=1: phi[1, 1] = 2.0
s=2 and t=1: phi[2, 1] = 2.0
s=0 and t=2: phi[0, 2] = 2.0
s=1 and t=2: phi[1, 2] = 2.0
s=2 and t=2: phi[2, 2] = 2.0
s=0 and t=3: phi[0, 3] = 0.0
s=1 and t=3: phi[1, 3] = 1.0
s=2 and t=3: phi[2, 3] = 1.0
s=0 and t=4: phi[0, 4] = 2.0
s=1 and t=4: phi[1, 4] = 2.0
s=2 and t=4: phi[2, 4] = 2.0
s=0 and t=5: phi[0, 5] = 0.0
s=1 and t=5: phi[1, 5] = 1.0
s=2 and t=5: phi[2, 5] = 1.0
s=0 and t=6: phi[0, 6] = 0.0
s=1 and t=6: phi[1, 6] = 1.0
s=2 and t=6: phi[2, 6] = 1.0
s=0 and t=7: phi[0, 7] = 2.0
s=1 and t=7: phi[1, 7] = 2.0
s=2 and t=7: phi[2, 7] = 2.0
s=0 and t=8: phi[0, 8] = 0.0
s=1 and t=8: phi[1, 8] = 1.0
s=2 and t=8: phi[2, 8] = 1.0
s=0 and t=9: phi[0, 9] = 2.0
s=1 and t=9: phi[1, 9] = 2.0
s=2 and t=9: phi[2, 9] = 2.0
s=0 and t=10: phi[0, 10] = 2.0
s=1 and t=10: phi[1, 10] = 2.0
s=2 and t=10: phi[2, 10] = 2.0
s=0 and t=11: phi[0, 11] = 0.0
s=1 and t=11: phi[1, 11] = 1.0
s=2 and t=11: phi[2, 11] = 1.0
s=0 and t=12: phi[0, 12] = 0.0
s=1 and t=12: phi[1, 12] = 1.0
s=2 and t=12: phi[2, 12] = 1.0
s=0 and t=13: phi[0, 13] = 0.0
s=1 and t=13: phi[1, 13] = 1.0
s=2 and t=13: phi[2, 13] = 1.0
--------------------------------------------------
Start Backtrace

path[12] = 1
path[11] = 1
path[10] = 1
path[9] = 2
path[8] = 2
path[7] = 1
path[6] = 2
path[5] = 1
path[4] = 1
path[3] = 2
path[2] = 1
path[1] = 2
path[0] = 2

Which leads to the output:

We used the following implementation, based on [2]:

def viterbi(pi, a, b, obs):

nStates = np.shape(b)[0]
T = np.shape(obs)[0]

# init blank path
path = path = np.zeros(T,dtype=int)
# delta --> highest probability of any path that reaches state i
delta = np.zeros((nStates, T))
# phi --> argmax by time step for each state
phi = np.zeros((nStates, T))

# init delta and phi
delta[:, 0] = pi * b[:, obs[0]]
phi[:, 0] = 0

print('nStart Walk Forwardn')
# the forward algorithm extension
for t in range(1, T):
for s in range(nStates):
delta[s, t] = np.max(delta[:, t-1] * a[:, s]) * b[s, obs[t]]
phi[s, t] = np.argmax(delta[:, t-1] * a[:, s])
print('s={s} and t={t}: phi[{s}, {t}] = {phi}'.format(s=s, t=t, phi=phi[s, t]))

# find optimal path
print('-'*50)
print('Start Backtracen')
path[T-1] = np.argmax(delta[:, T-1])
for t in range(T-2, -1, -1):
path[t] = phi[path[t+1], [t+1]]
print('path[{}] = {}'.format(t, path[t]))

return path, delta, phi

Learning and the Baum-Welch Algorithm

A similar approach to the one above can be used for parameter learning of the HMM model. We have some dataset, and we want to find the parameters which fit the HMM model best. The Baum-Welch Algorithm is an iterative process which finds a (local) maximum of the probability of the observations P(O|M), where M denotes the model (with the parameters we want to fit). Since we know P(M|O) by the model, we can use a Bayesian approach to find P(M|O) and converge to an optimum.

HMM have various applications, from character recognition to financial forecasts (detecting regimes in markets).

References

[1] https://cse.buffalo.edu/~jcorso/t/CSE555/files/lecture_hmm.pdf

[2] http://www.blackarbs.com/blog/introduction-hidden-markov-models-python-networkx-sklearn/2/9/2017