{"title": "Factorial Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 472, "page_last": 478, "abstract": null, "full_text": "Factorial Hidden Markov Models \n\nZoubin Ghahramani \nzoubin@psyche.mit.edu \n\nDepartment of Computer Science \n\nUniversity of Toronto \nToronto, ON M5S 1A4 \n\nCanada \n\nMichael I. Jordan \njordan@psyche.mit.edu \n\nDepartment of Brain & Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nUSA \n\nAbstract \n\nWe present a framework for learning in hidden Markov models with \ndistributed state representations. Within this framework , we de(cid:173)\nrive a learning algorithm based on the Expectation-Maximization \n(EM) procedure for maximum likelihood estimation. Analogous to \nthe standard Baum-Welch update rules, the M-step of our algo(cid:173)\nrithm is exact and can be solved analytically. However, due to the \ncombinatorial nature of the hidden state representation, the exact \nE-step is intractable. A simple and tractable mean field approxima(cid:173)\ntion is derived. Empirical results on a set of problems suggest that \nboth the mean field approximation and Gibbs sampling are viable \nalternatives to the computationally expensive exact algorithm. \n\n1 \n\nIntroduction \n\nA problem of fundamental interest to machine learning is time series modeling. Due \nto the simplicity and efficiency of its parameter estimation algorithm, the hidden \nMarkov model (HMM) has emerged as one of the basic statistical tools for modeling \ndiscrete time series, finding widespread application in the areas of speech recogni(cid:173)\ntion (Rabiner and Juang, 1986) and computational molecular biology (Baldi et al. , \n1994). An HMM is essentially a mixture model, encoding information about the \nhistory of a time series in the value of a single multinomial variable (the hidden \nstate). This multinomial assumption allows an efficient parameter estimation algo(cid:173)\nrithm to be derived (the Baum-Welch algorithm). However, it also severely limits \nthe representational capacity of HMMs. For example, to represent 30 bits of in(cid:173)\nformation about the history of a time sequence, an HMM would need 230 distinct \nstates. On the other hand an HMM with a distributed state representation could \nachieve the same task with 30 binary units (Williams and Hinton, 1991). This paper \naddresses the problem of deriving efficient learning algorithms for hidden Markov \nmodels with distributed state representations. \n\n\fFactorial Hidden Markov Models \n\n473 \n\nThe need for distributed state representations in HMMs can be motivated in two \nways. First, such representations allow the state space to be decomposed into \nfeatures that naturally decouple the dynamics of a single process generating the \ntime series. Second, distributed state representations simplify the task of modeling \ntime series generated by the interaction of multiple independent processes. For \nexample, a speech signal generated by the superposition of multiple simultaneous \nspeakers can be potentially modeled with such an architecture. \n\nWilliams and Hinton (1991) first formulated the problem of learning in HMMs with \ndistributed state representation and proposed a solution based on deterministic \nBoltzmann learning. The approach presented in this paper is similar to Williams \nand Hinton's in that it is also based on a statistical mechanical formulation of hidden \nMarkov models. However, our learning algorithm is quite different in that it makes \nuse of the special structure of HMMs with distributed state representation, resulting \nin a more efficient learning procedure. Anticipating the results in section 2, this \nlearning algorithm both obviates the need for the two-phase procedure of Boltzmann \nmachines, and has an exact M-step. A different approach comes from Saul and \nJordan (1995), who derived a set of rules for computing the gradients required for \nlearning in HMMs with distributed state spaces. However, their methods can only \nbe applied to a limited class of architectures. \n\n2 Factorial hidden Markov models \n\nHidden Markov models are a generalization of mixture models. At any time step, \nthe probability density over the observables defined by an HMM is a mixture of \nthe densities defined by each state in the underlying Markov model. Temporal \ndependencies are introduced by specifying that the prior probability of the state at \ntime t depends on the state at time t -1 through a transition matrix, P (Figure 1a). \nAnother generalization of mixture models, the cooperative vector quantizer (CVQ; \nHinton and Zemel, 1994 ), provides a natural formalism for distributed state repre(cid:173)\nsentations in HMMs. Whereas in simple mixture models each data point must be \naccounted for by a single mixture component, in CVQs each data point is accounted \nfor by the combination of contributions from many mixture components, one from \neach separate vector quantizer. The total probability density modeled by a CVQ \nis also a mixture model; however this mixture density is assumed to factorize into \na product of densities, each density associated with one of the vector quantizers. \nThus, the CVQ is a mixture model with distributed representations for the mixture \ncomponents. \n\nFactorial hidden Markov models! combine the state transition structure of HMMs \nwith the distributed representations of CVQs (Figure 1 b). Each of the d underlying \nMarkov models has a discrete state s~ at time t and transition probability matrix \nPi. As in the CVQ, the states are mutually exclusive within each vector quantizer \nand we assume real-valued outputs. The sequence of observable output vectors is \ngenerated from a normal distribution with mean given by the weighted combination \nof the states of the underlying Markov models: \n\nwhere C is a common covariance matrix. The k-valued states Si are represented as \n\n1 We refer to HMMs with distributed state as factorial HMMs as the features of the \n\ndistributed state factorize the total state representation. \n\n\f474 \n\nZ. GHAHRAMANI. M. I. JORDAN \n\ndiscrete column vectors with a 1 in one position and 0 everywhere else; the mean of \nthe observable is therefore a combination of columns from each of the Wi matrices. \n\na) \n\n~-------.... \n\ny \n\np \n\nFigure 1. a) Hidden Markov model. b) Factorial hidden Markov model. \n\nWe capture the above probability model by defining the energy of a sequence of T \nstates and observations, {(st, yt)};=l' which we abbreviate to {s, y}, as: \n\n1l( {s,y}) = ~ t. k -t. w;s:]' C- 1 [yt -t. w;s:]-t. t. sf A;S:-l, \n\n(1) \n\nwhere [Ai]jl = logP(s~jls~I-I) such that 2::=1 e[Ai]j/ = 1, and I denotes matrix \ntranspose. Priors for the initial state, sl, are introduced by setting the second term \nin (1) to - 2:t=1 sf log7ri. The probability model is defined from this energy by \nthe Boltzmann distribution \n\nP({s,y}) = Z exp{-ll({s,y})}. \n\n1 \n\n(2) \n\nNote that like in the CVQ (Ghahramani, 1995), the undamped partition function \n\nZ = J d{y} Lexp{-ll({s,y})}, \n\n{s} \n\nevaluates to a constant, independent of the parameters. This can be shown by \nfirst integrating the Gaussian variables, removing all dependency on {y}, and then \nsumming over the states using the constraint on e[A,]j/ . \n\nThe EM algorithm for Factorial HMMs \n\nAs in HMMs, the parameters of a factorial HMM can be estimated via the EM \n(Baum-Welch) algorithm. This procedure iterates between assuming the current \nparameters to compute probabilities over the hidden states (E-step), and using \nthese probabilities to maximize the expected log likelihood of the parameters (M(cid:173)\nstep). \n\nUsing the likelihood (2), the expected log likelihood of the parameters is \n\nQ(4)new l4>) = (-ll({s,y}) -logZ)c , \n\n(3) \n\n\fFactorial Hidden Markov Models \n\n475 \n\nwhere