下面是从"speech and language processing"这本书中关于HMM的摘要. 感觉从来没有这么透彻地理解过HMM. 这本书里, 把所有语音还是自然语言处理解释通俗易懂. 学习这块, 必读书集. 强烈推荐.

HMMs and MEMMs are both sequence classifiers. A sequence classifier or sequence labeler is a model whose job is to assign some label or class to each unit in a sequence.

Hidden Markov Models

Markov chain

We view Markov chain as a kind of probabilitic graphical model; a way of representing probabilistic assumptions in a graph. A Markov chain embodies an important assumption about these probabilities. In a first-order Markov chain, the probability of a particular state is dependent only on previous state.

Markov Assumption: P(qiq1qi1)=P(qiqi1)P(q_i | q_1…q_{i-1}) = P(q_i|q_{i-1})

A Markov chain is specified by following components:

  • Q=q1q2qNQ = q_1 q_2 …q_N a set of N states

  • A=a01a02an1annA = a_{01} a_{02}…a_{n1}…a_{nn} A transition probability matrix A , each with aija_{ij} representing the probability of moving from state ii to state jj , s.t. Σj=1naij=1  i\Sigma_{j=1}^{n}a_{ij} = 1 \; \forall i

  • q0,qFq_0, q_F A special start state and end state which are not associated with observations

A markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world. A hidden Markov model allows us to talk about both observed events and hidden events that we think of as causal factors in our probabilistic model.

Hidden Markov Models

HMM is specified by the following components:

  • Q=q1q2qNQ = q_1 q_2 … q_N A set of N states
  • A=a01a02an1annA = a_{01} a_{02}…a_{n1}…a_{nn} A transition probability matrix A , each with aija_{ij} representing the probability of moving from state ii to state jj , s.t. Σj=1naij=1  i\Sigma_{j=1}^{n}a_{ij} = 1 \; \forall i
  • O=o1o2oTO = o_1 o_2 … o_T A sequence of T observations, each one drawn from a vocabulary V=v1,v2,,vVV = v_1, v_2, …, v_V
  • B=bi(ot)B = b_i(o_t) A sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation oto_t being generated from a state ii
  • q0,qFq_0, q_F A special state state and end state which are not associated with observations, together with transition probabilities a01a02a0na_{01} a_{02}…a_{0n} out of the start state and a1Fa2FanFa_{1F} a_{2F} … a_{nF} into the end state.

A first-order Hidden Markov Model instantiates two simplifying assumptions:

  • Markov Assumption : P(qiq1qi1)=P(qiqi1)P(q_i | q_1 … q_{i-1}) = P(q_i|q_{i-1}) as with first-order Markov chain, the probability of a particular state is dependent only on the previous state
  • Output Independence Assumption: P(oiq1qi,,qT,o1,,oi,.oT)=P(oiqi)P(o_i | q_1 … q_i, …, q_T, o_1, …, o_i, …. o_T) = P(o_i|q_i) probability of an output observation oto_t is dependent only on the state that produced the observation qiq_i, and not on any other states or any other observations

Types of HMM:

  • Fully-connected or ergodic HMM: there is a non-zero probability of transitioning between any two states
  • Bakis HMM : many of the transitions between states have zero probability and the state transitions proceed from left to right

Hidden Markov Models should be characterized by three fundamental problems:

  • Problem 1 (Computing Likelihood) : Given an HMM λ=(A,B)\lambda = (A, B) and an observation sequence OO, determine the likelihood P(Oλ)P(O|\lambda)
  • Problem 2 (Decoding) : Given an observation sequence OO and an HMM λ=(A,B)\lambda = (A, B), discover the best hidden state sequence QQ
  • Problem 3 (Learning) : Given an observation sequence OO and the set of states in HMM, learn the HMM parameters AA and BB

Computing Likelihood: the Forward Algorithm

An efficient (O(N2T))(O(N^2T)) algorithm called the forward algorithm is a kind of dynamic programming. It computes the observation probability by summing over the probabilities of all possible hidden state paths that could generate the observation sequence, but it does so efficiently by implicitly folding each of these paths into a single forward trellis .

Each cell of the forward algorithm trellis αt(j)\alpha_t(j) represents the probability of being in state jj after seeing the first tt observations, give the automaton λ\lambda :

αt(j)=P(o1,o2,ot,qt=jλ)=i=1Nαt1(i)aijbj(ot)\alpha_t(j) = P(o_1,o_2,…o_t, q_t=j|\lambda) = \sum_{i=1}^N \alpha_{t-1}(i)a_{ij}b_j(o_t)

  • αt1(i)\alpha_{t-1}(i) The previous forward path probability from the previous time step
  • aija_{ij} The transition probability from previous state qiq_i to current state qjq_j
  • bj(ot)b_j(o_t) The state observation likelihood of the observation symbol oto_t given the current state jj

Formal definition of the forward algorithm

  1. Initialization: α1(j)=a0jbj(o1)    1jN\alpha_1(j) = a_{0j}b_j(o_1) \; \; 1 \leq j \leq N
  2. Recursion: αt(j)=i=1Nαt1(i)aijbj(ot);    1jN,1<tT\alpha_t(j) = \sum_{i=1}^N\alpha_{t-1}(i) a_{ij} b_j(o_t); \; \; 1 \leq j \leq N, 1 < t \leq T
  3. Termination: P(Oλ)=αT(qF)=i=1NαT(i)aiFP(O|\lambda) = \alpha_T(q_F) = \sum_{i=1}^N\alpha_T(i)a_{iF}

ASR之HMM学习Notes

Decoding: the Viterbi Algorithm

For any model, such as an HMM, that contains hidden variables, the task of determining which sequence of variables is the underlying source of some sequence of observations is called the decoding task.

Decoding : Given as input an HMM λ=(A,B)\lambda=(A,B) and a sequence of observations O=o1,o2,,oTO = o_1, o_2, …, o_T, find the most probable sequence of states Q=q1q2q3qTQ=q_1 q_2 q_3 … q_T

The most common decoding algorithms for HMMs is the Viterbi algorithm. Like the forward algorithm, Viterbi is a kind of dynamic programming and makes uses of a dynamic programming trellis.

Each cell of the Viterbi trellis, vt(j)v_t(j) represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q0,q1,,qt1q_0, q_1, …, q_{t-1}, given the automaton λ\lambda :

vt(j)=maxq0,q1,,qt1P(q0,q1,qt1,o1,o2,ot,qt=jλ)            =maxi=1Nvt1(i)aijbj(ot)v_t(j) = \max \limits_{q_0,q_1,…,q_{t-1}} P(q_0,q_1,…q_{t-1},o_1,o_2,…o_t,q_t=j|\lambda) \\ \;\;\;\;\;\; = \max_{i=1}^N v_{t-1}(i)a_{ij}b_j(o_t)

  • vt1(i)v_{t-1}(i) The previous Viterbi path probability from the previous time step
  • aija_{ij} The transition probability from previous state qiq_i to current state qjq_j
  • bj(ot)b_j(o_t) The state observation likelihood of the observation symbol oto_t given the current state jj

Note that the Viterbi algorithm is identical to the Forward algorithm except that it takes the max over the previous path probabilities where the forward algorithm takes the sum. The Viterbi algorithm also has back pointers, which will compute the best state sequence by keeping track of the path of hidden states that led to each state, and then at the end tracing back the best path to the beginning (the Viterbi backtrace).

Formal definition of the Viterbi algorithm

  1. Initialization:

    vi(j)=a0jbj(o1)    1jNv_i(j) = a_{0j}b_j(o_1) \; \; 1 \leq j \leq N

    bt1(j)=0bt_1(j) = 0

  2. Recursion (recall states 00 and qFq_F are non-emitting)

    vt(j)=maxi=1Nvt1aijbj(ot);    1jN,1<tTv_t(j) = \max_{i=1}^{N}v_{t-1}a_{ij}b_j(o_t); \;\; 1 \leq j \leq N, 1< t \leq T

    btt(j)=argmaxi=1Nvt1aijbj(ot);    1jN,1<tTbt_t(j) = \arg\max_{i=1}^{N}v_{t-1}a_{ij}b_j(o_t); \;\; 1 \leq j \leq N, 1 < t \leq T

  3. Termination:

    The best score: P=vt(qF)=maxi=1NvT(i)ai,FP* = v_t(q_F) = \max \limits_{i=1}^{N} v_T(i)*a_{i,F}

    The start of backtrace: qT=btT(qF)=argmaxi=1NvT(i)ai,Fq_T* = bt_T(q_F) = \arg\max \limits_{i=1}^{N} v_T(i)*a_{i,F}
    ASR之HMM学习Notes

TRAINING HMMs: The Forward-Backward Algorithm

Learning: Given an observation sequence OO and the set of possible states in the HMM, learn the HMM parameters AA and BB.

The standard algorithm for HMM training is the forward-backward or Baum-Welch algorithm, a special case of Expectation-Maximization or EM algorithm. The algorithm will let us train both the transition probabilities AA and the emission probabilities BB of the HMM.

Let us begin by considering the much simpler case of training a Markov chain rather than a HMM. Since states in a Markov chain are observed and it has no emission probabilities BB, we could view a Markov chain as a degenerate HMM where all the bb probabilities are 1.0 for the observed symbol and 0 for all other symbols. Thus the only probabilities we need to train are the transition probability matrix AA.

We get the maximum likelihood estimate of the probability aija_{ij} of a particular transition between states II and jj by counting the number of times the transition was taken, which we could call C(ij)C(i \rightarrow j), and then normalizing by the total count of all times we took any transition from state ii:

aij=C(ij)qQC(iq)a_{ij} = \frac{C(i \rightarrow j)}{\sum_{q \in Q} C(i \rightarrow q)}

For HMM we cannot compute these counts directly from an observation sequence since we don’t know which path of states was taken through the machine for a given input.

The Baum-Welch algorithm uses two neat intuitions to solve this problem.

  1. The first idea is to iteratively estimate the counts. We will start with an estimate for the transition and observation probabilities, and then use these estimated probabilities to derive better and better probabilities.
  2. The second idea is that we get our estimated probabilities by computing the forward probability for an observation and then dividing that probability mass among all the different paths that contributed to this forward probability.

Backword probability

The backward probability β\beta is the probability of seeing the observations from time t+1t+1 to the end, give that we are in state jj at time tt (give the automaton λ\lambda):

βt(i)=P(ot+1,ot+2oTqt=i,λ)\beta_t(i) = P(o_{t+1},o_{t+2}…o_T|q_t=i,\lambda)

Formal definition of the backward algorithm

  1. Initialization: βT(i)=ai,F,    1iN\beta_T(i) = a_{i,F}, \;\; 1 \leq i \leq N

  2. Recursion (again since states 0 and qFq_F are non-emitting):

    βt(i)=j=1Naijbj(ot+1)βt+1(j),    1iN,1t<T\beta_t(i) = \sum \limits_{j=1}^{N}a_{ij}b_j(o_{t+1})\beta_{t+1}(j), \;\; 1 \leq i \leq N, 1 \leq t < T

  3. Termination:

    P(Oλ)=αT(qF)=β1(0)=j=1Na0jbj(o1)β1(j)P(O|\lambda) = \alpha_T(q_F) = \beta_1(0) = \sum \limits_{j=1}^{N}a_{0j}b_j(o_1)\beta_1(j)

We are now ready to understand how the forward and backward probabilities can help us compute the transition probability aija_{ij} and the observation probability bi(ot)b_i(o_t) from an observation sequence, even though the actual path taken through the machine is hidden.

Transition Probability Matrix

Let’s begin by showing how to estimate a^ij\hat{a}_{ij} :

a^ij=expected number of transitions from state i to jexpected number of transitions from state i\LARGE \hat{a}_{ij} = \frac{\text{expected number of transitions from state i to j}}{\text{expected number of transitions from state i}}

How do we compute the numerator? Here is the intuition. Assumption we had some estimate of the probability that a given transition iji \rightarrow j was taken at a particular point in time tt in the observation sequence. If we know this probability for each particular time tt, we could sum over all times tt to estimate the total count for the transition iji \rightarrow j.

Formally, let’s define the probability ξt\xi_t as the probability of being in state ii ta time tt and state jj at time t+1t+1, give the observation sequence and of course the model:

ξt(i,j)=P(qt=i,qt+1=jO,λ)=αt(i)aijbj(ot+1)βt+1(j)αT(N)\xi_t(i, j) = P(q_t=i, q_{t+1}=j|O, \lambda) =\LARGE \frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\alpha_T(N)}

In detail :

ξt(i,j)=P(qt=i,qt+1=jO,λ)not-quite  ξt(i,j)=P(qt=i,qt+1=j,Oλ)=αt(i)aijbj(ot+1)βt+1(j)P(Oλ)=αT(N)=βT(1)=j=1Nαt(j)βt(j)laws of probability:  P(QO,λ)=P(Q,Oλ)P(Oλ)}ξt(i,j)=αt(i)aijbj(ot+1)βt+1(j)αT(N)\left.\begin{matrix} \xi_t(i, j) = P(q_t=i, q_{t+1}=j|O, \lambda)\\ \text{not-quite}\; \xi_t(i,j)=P(q_t=i,q_{t+1}=j,O|\lambda)=\alpha_t(i)a_{ij}b_j(o_t+1)\beta_{t+1}(j)\\ P(O|\lambda)=\alpha_T(N)=\beta_T(1)=\sum \limits_{j=1}^{N}\alpha_t(j)\beta_t(j)\\ \text{laws of probability} :\;P(Q|O,\lambda) = \frac{P(Q,O|\lambda)}{P(O|\lambda)} \end{matrix}\right\} \Rightarrow \xi_t(i,j)=\LARGE \frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\alpha_T(N)}

The expected number of transitions from state ii to state jj is then the sum over all tt of ξ\xi , so here is the final formula for α^ij\hat{\alpha}_{ij} :

α^ij=t=1T1ξt(i,j)t=1T1j=1Nξt(i,j) \hat{\alpha}_{ij} = \LARGE \frac{\sum_{t=1}^{T-1}\xi_t(i, j)}{\sum_{t=1}^{T-1}\sum_{j=1}^{N}\xi_t(i, j)}

Observation Probability Matrix

This is the probability of a given symbol vkv_k from the observation vocabulary VV, given a state jj : b^j(vk)\hat{b}_j(v_k).

b^j(vk)=expected number of times in state j and observing symbol vkexpected number of times in state j\LARGE\hat{b}_j(v_k) = \frac{\text{expected number of times in state $j$ and observing symbol $v_k$}}{\text{expected number of times in state $j$}}

For this we will need to know the probability of being in state jj at time tt, which we call γt(j)\gamma_t(j) :

γt(j)=P(qt=jO,λ)=P(qt=j,Oλ)P(Oλ)=αt(j)βt(j)P(Oλ)\gamma_t(j) = P(q_t=j|O,\lambda) = \frac{P(q_t=j, O|\lambda)}{P(O|\lambda)} = \frac{\alpha_t(j)\beta_t(j)}{P(O|\lambda)}

We are ready to compute bb. For the numerator, we sum γt(j)\gamma_t(j) for all time steps tt in which the observation oto_t is the symbol vkv_k that we are interested in. For the denominator, we sum γt(j)\gamma_t(j) over all the time steps tt. The result will be the percentage of the times that we were in state jj and we saw symbol vkv_k :

b^j(vk)=t=1s.t.Ot=vkTγt(j)t=1Tγt(j)\hat{b}_j(v_k) = \LARGE \frac{\sum_{t=1s.t. O_t=v_k}^{T}\gamma_t(j)}{\sum_{t=1}^{T}\gamma_t(j)}

We now have ways to re-estimate the transition AA and observation BB probabilities from an observation sequence OO assuming that we already have a previous estimate of AA and BB.

The Forward-Backward algorithm

The forward-backward algorithm starts with some initial estimate of the HMM parameters λ=(A,B)\lambda=(A, B). We then iteratively run two steps. Like other cases of the EM algorithm, the forward-backward algorithm has two steps: the expectation step or E-step and maximization step, or M-step.

In the E-step, we compute the expected state occupancy count γ\gamma and the expected state transition count ξ\xi, from the earlier AA and BB probabilities. In the M-step, we use γ\gamma and ξ\xi to recompute new AA and BB probabilities.

ASR之HMM学习Notes
thanks

相关文章: