ASR之HMM学习Notes

下面是从"speech and language processing"这本书中关于HMM的摘要. 感觉从来没有这么透彻地理解过HMM. 这本书里, 把所有语音还是自然语言处理解释通俗易懂. 学习这块, 必读书集. 强烈推荐.

HMMs and MEMMs are both sequence classifiers. A sequence classifier or sequence labeler is a model whose job is to assign some label or class to each unit in a sequence.

Hidden Markov Models

Markov chain

We view Markov chain as a kind of probabilitic graphical model; a way of representing probabilistic assumptions in a graph. A Markov chain embodies an important assumption about these probabilities. In a first-order Markov chain, the probability of a particular state is dependent only on previous state.

Markov Assumption: $P(q_i | q_1…q_{i-1}) = P(q_i|q_{i-1})$

A Markov chain is specified by following components:

$Q = q_1 q_2 …q_N$ a set of N states
$A = a_{01} a_{02}…a_{n1}…a_{nn}$ A transition probability matrix A , each with $a_{ij}$ representing the probability of moving from state $i$ to state $j$ , s.t. $\Sigma_{j=1}^{n}a_{ij} = 1 \; \forall i$
$q_0, q_F$ A special start state and end state which are not associated with observations

A markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world. A hidden Markov model allows us to talk about both observed events and hidden events that we think of as causal factors in our probabilistic model.

Hidden Markov Models

HMM is specified by the following components:

$Q = q_1 q_2 … q_N$ A set of N states
$A = a_{01} a_{02}…a_{n1}…a_{nn}$ A transition probability matrix A , each with $a_{ij}$ representing the probability of moving from state $i$ to state $j$ , s.t. $\Sigma_{j=1}^{n}a_{ij} = 1 \; \forall i$
$O = o_1 o_2 … o_T$ A sequence of T observations, each one drawn from a vocabulary $V = v_1, v_2, …, v_V$
$B = b_i(o_t)$ A sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation $o_t$ being generated from a state $i$
$q_0, q_F$ A special state state and end state which are not associated with observations, together with transition probabilities $a_{01} a_{02}…a_{0n}$ out of the start state and $a_{1F} a_{2F} … a_{nF}$ into the end state.

A first-order Hidden Markov Model instantiates two simplifying assumptions:

Markov Assumption : $P(q_i | q_1 … q_{i-1}) = P(q_i|q_{i-1})$ as with first-order Markov chain, the probability of a particular state is dependent only on the previous state
Output Independence Assumption: $P(o_i | q_1 … q_i, …, q_T, o_1, …, o_i, …. o_T) = P(o_i|q_i)$ probability of an output observation $o_t$ is dependent only on the state that produced the observation $q_i$ , and not on any other states or any other observations

Types of HMM:

Fully-connected or ergodic HMM: there is a non-zero probability of transitioning between any two states
Bakis HMM : many of the transitions between states have zero probability and the state transitions proceed from left to right

Hidden Markov Models should be characterized by three fundamental problems:

Problem 1 (Computing Likelihood) : Given an HMM $\lambda = (A, B)$ and an observation sequence $O$ , determine the likelihood $P(O|\lambda)$
Problem 2 (Decoding) : Given an observation sequence $O$ and an HMM $\lambda = (A, B)$ , discover the best hidden state sequence $Q$
Problem 3 (Learning) : Given an observation sequence $O$ and the set of states in HMM, learn the HMM parameters $A$ and $B$

Computing Likelihood: the Forward Algorithm

An efficient $(O(N^2T))$ algorithm called the forward algorithm is a kind of dynamic programming. It computes the observation probability by summing over the probabilities of all possible hidden state paths that could generate the observation sequence, but it does so efficiently by implicitly folding each of these paths into a single forward trellis .

Each cell of the forward algorithm trellis $\alpha_t(j)$ represents the probability of being in state $j$ after seeing the first $t$ observations, give the automaton $\lambda$ :

$\alpha_t(j) = P(o_1,o_2,…o_t, q_t=j|\lambda) = \sum_{i=1}^N \alpha_{t-1}(i)a_{ij}b_j(o_t)$

$\alpha_{t-1}(i)$ The previous forward path probability from the previous time step
$a_{ij}$ The transition probability from previous state $q_i$ to current state $q_j$
$b_j(o_t)$ The state observation likelihood of the observation symbol $o_t$ given the current state $j$

Formal definition of the forward algorithm

Initialization: $\alpha_1(j) = a_{0j}b_j(o_1) \; \; 1 \leq j \leq N$
Recursion: $\alpha_t(j) = \sum_{i=1}^N\alpha_{t-1}(i) a_{ij} b_j(o_t); \; \; 1 \leq j \leq N, 1 < t \leq T$
Termination: $P(O|\lambda) = \alpha_T(q_F) = \sum_{i=1}^N\alpha_T(i)a_{iF}$

ASR之HMM学习Notes

Decoding: the Viterbi Algorithm

For any model, such as an HMM, that contains hidden variables, the task of determining which sequence of variables is the underlying source of some sequence of observations is called the decoding task.

Decoding : Given as input an HMM $\lambda=(A,B)$ and a sequence of observations $O = o_1, o_2, …, o_T$ , find the most probable sequence of states $Q=q_1 q_2 q_3 … q_T$

The most common decoding algorithms for HMMs is the Viterbi algorithm. Like the forward algorithm, Viterbi is a kind of dynamic programming and makes uses of a dynamic programming trellis.

Each cell of the Viterbi trellis, $v_t(j)$ represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence $q_0, q_1, …, q_{t-1}$ , given the automaton $\lambda$ :

$v_t(j) = \max \limits_{q_0,q_1,…,q_{t-1}} P(q_0,q_1,…q_{t-1},o_1,o_2,…o_t,q_t=j|\lambda) \\ \;\;\;\;\;\; = \max_{i=1}^N v_{t-1}(i)a_{ij}b_j(o_t)$

$v_{t-1}(i)$ The previous Viterbi path probability from the previous time step
$a_{ij}$ The transition probability from previous state $q_i$ to current state $q_j$
$b_j(o_t)$ The state observation likelihood of the observation symbol $o_t$ given the current state $j$

Note that the Viterbi algorithm is identical to the Forward algorithm except that it takes the max over the previous path probabilities where the forward algorithm takes the sum. The Viterbi algorithm also has back pointers, which will compute the best state sequence by keeping track of the path of hidden states that led to each state, and then at the end tracing back the best path to the beginning (the Viterbi backtrace).

Formal definition of the Viterbi algorithm

Initialization:

$v_i(j) = a_{0j}b_j(o_1) \; \; 1 \leq j \leq N$

$bt_1(j) = 0$
Recursion (recall states $0$ and $q_F$ are non-emitting)

$v_t(j) = \max_{i=1}^{N}v_{t-1}a_{ij}b_j(o_t); \;\; 1 \leq j \leq N, 1< t \leq T$

$bt_t(j) = \arg\max_{i=1}^{N}v_{t-1}a_{ij}b_j(o_t); \;\; 1 \leq j \leq N, 1 < t \leq T$
Termination:

The best score: $P* = v_t(q_F) = \max \limits_{i=1}^{N} v_T(i)*a_{i,F}$

The start of backtrace: $q_T* = bt_T(q_F) = \arg\max \limits_{i=1}^{N} v_T(i)*a_{i,F}$

TRAINING HMMs: The Forward-Backward Algorithm

Learning: Given an observation sequence $O$ and the set of possible states in the HMM, learn the HMM parameters $A$ and $B$ .

The standard algorithm for HMM training is the forward-backward or Baum-Welch algorithm, a special case of Expectation-Maximization or EM algorithm. The algorithm will let us train both the transition probabilities $A$ and the emission probabilities $B$ of the HMM.

Let us begin by considering the much simpler case of training a Markov chain rather than a HMM. Since states in a Markov chain are observed and it has no emission probabilities $B$ , we could view a Markov chain as a degenerate HMM where all the $b$ probabilities are 1.0 for the observed symbol and 0 for all other symbols. Thus the only probabilities we need to train are the transition probability matrix $A$ .

We get the maximum likelihood estimate of the probability $a_{ij}$ of a particular transition between states $I$ and $j$ by counting the number of times the transition was taken, which we could call $C(i \rightarrow j)$ , and then normalizing by the total count of all times we took any transition from state $i$ :

$a_{ij} = \frac{C(i \rightarrow j)}{\sum_{q \in Q} C(i \rightarrow q)}$

For HMM we cannot compute these counts directly from an observation sequence since we don’t know which path of states was taken through the machine for a given input.

The Baum-Welch algorithm uses two neat intuitions to solve this problem.

The first idea is to iteratively estimate the counts. We will start with an estimate for the transition and observation probabilities, and then use these estimated probabilities to derive better and better probabilities.
The second idea is that we get our estimated probabilities by computing the forward probability for an observation and then dividing that probability mass among all the different paths that contributed to this forward probability.

Backword probability

The backward probability $\beta$ is the probability of seeing the observations from time $t+1$ to the end, give that we are in state $j$ at time $t$ (give the automaton $\lambda$ ):

$\beta_t(i) = P(o_{t+1},o_{t+2}…o_T|q_t=i,\lambda)$

Formal definition of the backward algorithm

Initialization: $\beta_T(i) = a_{i,F}, \;\; 1 \leq i \leq N$
Recursion (again since states 0 and $q_F$ are non-emitting):

$\beta_t(i) = \sum \limits_{j=1}^{N}a_{ij}b_j(o_{t+1})\beta_{t+1}(j), \;\; 1 \leq i \leq N, 1 \leq t < T$
Termination:

$P(O|\lambda) = \alpha_T(q_F) = \beta_1(0) = \sum \limits_{j=1}^{N}a_{0j}b_j(o_1)\beta_1(j)$

We are now ready to understand how the forward and backward probabilities can help us compute the transition probability $a_{ij}$ and the observation probability $b_i(o_t)$ from an observation sequence, even though the actual path taken through the machine is hidden.

Transition Probability Matrix

Let’s begin by showing how to estimate $\hat{a}_{ij}$ :

$\LARGE \hat{a}_{ij} = \frac{\text{expected number of transitions from state i to j}}{\text{expected number of transitions from state i}}$

How do we compute the numerator? Here is the intuition. Assumption we had some estimate of the probability that a given transition $i \rightarrow j$ was taken at a particular point in time $t$ in the observation sequence. If we know this probability for each particular time $t$ , we could sum over all times $t$ to estimate the total count for the transition $i \rightarrow j$ .

Formally, let’s define the probability $\xi_t$ as the probability of being in state $i$ ta time $t$ and state $j$ at time $t+1$ , give the observation sequence and of course the model:

$\xi_t(i, j) = P(q_t=i, q_{t+1}=j|O, \lambda) =\LARGE \frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\alpha_T(N)}$

In detail :

$\left.\begin{matrix} \xi_t(i, j) = P(q_t=i, q_{t+1}=j|O, \lambda)\\ \text{not-quite}\; \xi_t(i,j)=P(q_t=i,q_{t+1}=j,O|\lambda)=\alpha_t(i)a_{ij}b_j(o_t+1)\beta_{t+1}(j)\\ P(O|\lambda)=\alpha_T(N)=\beta_T(1)=\sum \limits_{j=1}^{N}\alpha_t(j)\beta_t(j)\\ \text{laws of probability} :\;P(Q|O,\lambda) = \frac{P(Q,O|\lambda)}{P(O|\lambda)} \end{matrix}\right\} \Rightarrow \xi_t(i,j)=\LARGE \frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\alpha_T(N)}$

The expected number of transitions from state $i$ to state $j$ is then the sum over all $t$ of $\xi$ , so here is the final formula for $\hat{\alpha}_{ij}$ :

$\hat{\alpha}_{ij} = \LARGE \frac{\sum_{t=1}^{T-1}\xi_t(i, j)}{\sum_{t=1}^{T-1}\sum_{j=1}^{N}\xi_t(i, j)}$

Observation Probability Matrix

This is the probability of a given symbol $v_k$ from the observation vocabulary $V$ , given a state $j$ : $\hat{b}_j(v_k)$ .

$\LARGE\hat{b}_j(v_k) = \frac{\text{expected number of times in state $j$ and observing symbol $v_k$}}{\text{expected number of times in state $j$}}$

For this we will need to know the probability of being in state $j$ at time $t$ , which we call $\gamma_t(j)$ :

$\gamma_t(j) = P(q_t=j|O,\lambda) = \frac{P(q_t=j, O|\lambda)}{P(O|\lambda)} = \frac{\alpha_t(j)\beta_t(j)}{P(O|\lambda)}$

We are ready to compute $b$ . For the numerator, we sum $\gamma_t(j)$ for all time steps $t$ in which the observation $o_t$ is the symbol $v_k$ that we are interested in. For the denominator, we sum $\gamma_t(j)$ over all the time steps $t$ . The result will be the percentage of the times that we were in state $j$ and we saw symbol $v_k$ :

$\hat{b}_j(v_k) = \LARGE \frac{\sum_{t=1s.t. O_t=v_k}^{T}\gamma_t(j)}{\sum_{t=1}^{T}\gamma_t(j)}$

We now have ways to re-estimate the transition $A$ and observation $B$ probabilities from an observation sequence $O$ assuming that we already have a previous estimate of $A$ and $B$ .

The Forward-Backward algorithm

The forward-backward algorithm starts with some initial estimate of the HMM parameters $\lambda=(A, B)$ . We then iteratively run two steps. Like other cases of the EM algorithm, the forward-backward algorithm has two steps: the expectation step or E-step and maximization step, or M-step.

In the E-step, we compute the expected state occupancy count $\gamma$ and the expected state transition count $\xi$ , from the earlier $A$ and $B$ probabilities. In the M-step, we use $\gamma$ and $\xi$ to recompute new $A$ and $B$ probabilities.

ASR之HMM学习Notes
thanks