4 minute read

In this post we will introduce the problem of finding the Viterbi path of triplet Markov chains. Then we will describe one way to approximate the path using methods from variational infrence.

Triplet Markov chains and Viterbi path

Let \(\{Z_t\}_t\) be a Markov chain with an initial probability distribution \(\pi\) and a (homogeneous) transition matrix \(p\). A triplet Markov chain can be described when we have \(Z_t = (U_t,X_t,Y_t)\). Introducing the extra variables has the following goal — we can describe a fixed data variable \(\textbf{y} \in \mathcal{Y}^T\), do inference to find properties about \(\textbf{X} \in \mathcal{X}^T\) that we are interested in while incorporating the effects of a nuisance variable \(\textbf{U} \in \mathcal{U}^T\) in our model. Triplet Markov chains generalise pairwise Markov chains that have been both studied by Jüri Lember in Tartu University. It is easy to see that these models generalise Markov chains and hidden Markov models.

Let us fix the observations \(\textbf{y} = (y_t)_{t=1}^T\). We can describe the process \(\{U_t,X_t | \textbf{y} \}_t\) as a discrete inhomogeneous pairwise Markov chain with an initial distribution \(\pi\) (now over \(\mathcal{U}^T \times \mathcal{X}^T\)) and a transition matrix at timestep \(t\) noted by \(p_t\). We can introduce the Viterbi path problem as finding the argmax \(\textbf{x}^*\) for the following problem \begin{equation} \max_{\textbf{x}} \sum_{\textbf{u}} \pi(u_1, x_1) \prod_{t=2}^T p_t(u_t, x_t | u_{t-1}, x_{t-1}). \tag{1} \label{eq:test} \end{equation}

This is known to be a NP hard problem, so we introduce a method to help us approximate the Viterbi path.

Variational infrence on discrete probability distributions

As concluded in the previous chapter, we are interested in discrete probability distributions. This makes our life a bit easier than what others usually focus on in literature.

We approximate the Viterbi path by creating a variational distribution \(q\) that approximates the true distribution \(p\). Let us clarify which \(p\) we are talking about. Our wish is to minimise the following Kullback-Leibler divergence on distributions over \(\mathcal{U}^T \times \mathcal{X}^T\)

\[D\left[ q( \textbf{u}, \textbf{x} \| p(\textbf{u},\textbf{x} | \textbf{y}) \right].\]

The problems in our case are again made easy, because the distribution

\[p(\cdot, \cdot | \textbf{y})\]

is tractable. Otherwise it is done by introducing the evidence lower bound \(L\) or ELBO and maximising \(L\) instead of directly minimising the KL divergence.

It is well known that for the KL divergence the minimal \(q\) is in fact \(p\). But this still leaves us with the NP hard problem. So we choose to add some constraints on \(q\) that would make finding the Viterbi path tractable. The two constraints we look at are

  • \(q_{\text{BP}}(\textbf{u},\textbf{x}) = q(\textbf{u})q(\textbf{x})\) inducing the belief propagation (BP) algorithm
  • \(q_{\text{VMP}}(\textbf{u},\textbf{x}) = \prod_{t=1}^T q_t(u_t,x_t)\) inducing the variational message passing algorithm.

These algortihm names are motivated by a paper from Karl Friston.

To recap, we are finding some distributions \(q_{\text{BP}}, q_{\text{VMP}}\) that closely match the posterior (in terms of the KL divergence) and then we find the Viterbi path using these variational distributions resulting in a Viterbi path approximation.

But how would one obtain the distributions that have a minimal KL divergence under these constraints? Let us derive the optimal update rules for the VMP algorithm. This turns out to be quite similar to the EM algorithm.

Let the initialisation be such that we have distributions \(q_1^{(0)},\ldots,q_T^{(0)}\) that have a finite KL divergence. We fix some \(t\) and find \(q^{(1)}_t\) that minimises the KL divergence while having

\[q_{\setminus t} := q_1^{(0)} \times \ldots q_{t-1}^{(0)} \times q_{t+1}^{(0)} \times \ldots \times q_T^{(0)}\]

be fixed. This way we can describe an iterative approach where we are looking at a problem

\[\min_{q_t} D\left[ q_t \times q_{\setminus t} \| p \right].\]

Implementation of the variational update rule

Turns out that the solution for the previous problem is

\[q_t^{(1)}(u_t,x_t) = \frac{1}{Z_t} \exp \left[ \sum_{ \textbf{u}_{\setminus t}, \textbf{x}_{\setminus t}} q_{\setminus t}(\textbf{u}_{\setminus t}, \textbf{x}_{\setminus t}) \ln p(\textbf{u}, \textbf{x} | \textbf{y}) \right]\]

or

\[q_t^{(1)}(u_t,x_t) = \frac{1}{Z_t} \exp \left[ \sum_{u_{t-1},x_{t-1} } q_{t-1}^{(0)} (u_{t-1}, x_{t-1}) \ln p(u_t,x_t | u_{t-1},x_{t-1}) \right]\] \[+ \frac{1}{Z_t} \exp \left[ \sum_{u_{t+1}, x_{t+1} } q_{t+1}^{(0)}(u_{t+1},x_{t+1}) \ln p(u_{t+1},x_{t+1} | u_t, x_t) \right],\]

where \(Z_t\) is the normalising constant to enforce the distribution \(q_t\) to be a probability distribution. Then we can repeat this update for all other \(t\) where some of the factors of the product measure \(q_{\setminus t}\) are the already updated measures.

Moreover, it can be shown that this iterative approach converges to the global minimum, i.e out of such product measures, we converge to

\[q_{1}^{(\infty)} \times \ldots \times q_{T}^{(\infty)}\]

that achieves the minimal KL divergence. In practice this convergence happens quite quickly too. This can mainly be attributed to the fact that the KL form \(D \left[ q \| p \right]\) has good properties and leads to tractable update rules.

The belief propagation algorithm has similar update rule, but whereas the VMP algorithm has to only include \(q_{t-1}\) and \(q_{t+1}\) to calculate optimal \(q_t\), the BP algorithm needs to include information over all timesteps.

This comes from the fact that while updating \(q(\textbf{u})\) finding the necessary marginals \(q(x_{t-1},x_t)\) needs us to use the forward-backward algorithm.

For the VMP algorithm we can give the Viterbi path approximation simply is \(\textbf{x}^* = (x_t^*)_{t=1}^T\) where \(x_t^*\) achieves the maximum for the following problem

\[\max_{x_t} \sum_{\textbf{u}} q_t(u_t,x_t).\]

For the BP algorithm we can give the approximation when we apply the Viterbi algortihm to the distribution \(q(x)\).

Why is this interesting?

This method can be very useful when the underlying distribution for the inhomogeneous pairwise Markov model is similar to either the \(q_{\text{BP}}\) or \(q_{\text{VMP}}\) distributions. The algorithm incorporates information from all timesteps and is quite quick. The VMP algorithm can be implemented very quickly using the RxInfer library in Julia.

This method is of particular interest to me, because of the possibility to slice and dice the hidden variables. In my Master’s thesis I assumed \(\mathcal{U}\) and \(\mathcal{X}\) to be fixed, but it is also possible to choose how to separate a dictionary. This might lead us to a method that helps us construct models that are sparse and explainable.

Updated: