Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions

Part of Advances in Neural Information Processing Systems 9 (NIPS 1996)

Bibtex Metadata Paper

Authors

Yoram Singer, Manfred K. K. Warmuth

Abstract

We present new algorithms for parameter estimation of HMMs. By adapting a framework used for supervised learning, we construct iterative algorithms that maximize the likelihood of the observations while also attempting to stay "close" to the current estimated parameters. We use a bound on the relative entropy between the two HMMs as a distance mea(cid:173) sure between them. The result is new iterative training algorithms which are similar to the EM (Baum-Welch) algorithm for training HMMs. The proposed algorithms are composed of a step similar to the expectation step of Baum-Welch and a new update of the parameters which replaces the maximization (re-estimation) step. The algorithm takes only negligi(cid:173) bly more time per iteration and an approximated version uses the same expectation step as Baum-Welch. We evaluate experimentally the new algorithms on synthetic and natural speech pronunciation data. For sparse models, i.e. models with relatively small number of non-zero parameters, the proposed algorithms require significantly fewer iterations.

1 Preliminaries We use the numbers from 0 to N to name the states of an HMM. State 0 is a special initial state and state N is a special final state. Any state sequence, denoted by s, starts with the initial state but never returns to it and ends in the final state. Observations symbols are also numbers in {I, ... , M} and observation sequences are denoted by x. A discrete output hidden Markov model (HMM) is parameterized by two matrices A and B. The first matrix is of dimension [N, N] and ai,j (0:5: i :5: N - 1,1 :5: j :5: N) denotes the probability of moving from state i to state j. The second matrix is of dimension [N + 1, M] and bi ,k is the probability of outputting symbol k at state i. The set of parameters of an HMM is denoted by 0 = (A, B). (The initial state distribution vector is represented by the first row of A.) An HMM is a probabilistic generator of sequences. It starts in the initial state O. It then iteratively does the following until the final state is reached. If i is the current state then a next state j is chosen according to the transition probabilities out of the current state (row i of matrix A). After arriving at state j a symbol is output according to the output probabilities of that state (row j of matrix B). Let P(x, slO) denote the probability (likelihood) that an HMM 0 generates the observation sequence x on the path s starting at state 0 and ending at state N: P(x, sllsl = Ixl + 1, So = 0, slSI = N, 0) ~ I1~~ll as._t,s.bs.,x •. For the sake of brevity we omit the conditions on s and x. Throughout the paper we assume that the HMMs are absorbing, that is from every state there is a path to the final state with a

642

Y. Singer and M. K. Warmuth

non-zero probability. Similar parameter estimation algorithms can be derived for ergodic HMMs. Absorbing HMMs induce a probability over all state-observation sequences, i.e. Ex,s P(x, s18) = 1. The likelihood of an observation sequence x is obtained by summing over all possible hidden paths (state sequences), P(xI8) = Es P(x, sI8). To obtain the likelihood for a set X of observations we simply mUltiply the likelihood values for the individual sequences. We seek an HMM 8 that maximizes the likelihood for a given set of observations X, or equivalently, maximizes the log-likelihood, LL(XI8) = r:h EXEX In P(xI8). To simplify our notation we denote the generic parameter in 8 by Oi, where i ranges from 1 to the total number of parameters in A and B (There might be less if some are clamped to zero). We denote the total number of parameters of 8 by I and leave the (fixed) correspondence between the Oi and the entries of A and B unspecified. The indices are naturally partitioned into classes corresponding to the rows of the matrices. We denote by [i] the class of parameters to which Oi belongs and by O[i) the vector of all OJ S.t. j E [i]. If j E [i] then both Oi and OJ are parameters from the same row of one of the two matrices. Whenever it is clear from the context, we will use [i] to denote both a class of parameters and the row number (i.e. state) associated with the class. We now can rewrite P(x, s18) as nf=l O~'(X,S), where ni(x, s) is the number of times parameter i is used along the path s with observation sequence x. (Note that this value does not depend on the actual parameters 8.) We next compute partial derivatives ofthe likelihood and the log-likelihood using this notation.