{"title": "Approximate Learning of Dynamic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 396, "page_last": 402, "abstract": null, "full_text": "Approximate Learning of Dynamic Models \n\nXavier Boyen \n\nComputer Science Dept. 1 A \nStanford, CA 94305-9010 \n\nxb@cs.stanford.edu \n\nDaphne Koller \n\nComputer Science Dept. 1 A \nStanford, CA 94305-9010 \n\nkoller@cs.stanford.edu \n\nAbstract \n\nInference is a key component in learning probabilistic models from par(cid:173)\ntially observable data. When learning temporal models, each of the \nmany inference phases requires a traversal over an entire long data se(cid:173)\nquence; furthermore, the data structures manipulated are exponentially \nlarge, making this process computationally expensive. In [2], we describe \nan approximate inference algorithm for monitoring stochastic processes, \nand prove bounds on its approximation error. In this paper, we apply this \nalgorithm as an approximate forward propagation step in an EM algorithm \nfor learning temporal Bayesian networks. We provide a related approxi(cid:173)\nmation for the backward step, and prove error bounds for the combined \nalgorithm. We show empirically that, for a real-life domain, EM using \nour inference algorithm is much faster than EM using exact inference, \nwith almost no degradation in quality of the learned model. We extend \nour analysis to the online learning task, showing a bound on the error \nresulting from restricting attention to a small window of observations. \nWe present an online EM learning algorithm for dynamic systems, and \nshow that it learns much faster than standard offline EM. \n\n1 Introduction \n\nIn many real-life situations, we are faced with the task of inducing the dynamics of a \ncomplex stochastic process from limited observations about its state over time. Until now, \nhidden Markov models (HMMs) [12] have played the largest role as a representation for \nlearning models of stochastic processes. Recently, however, there has been increasing \nuse of more structured models of stochastic processes, such as factorial HMMs [8] or \ndynamic Bayesian networks (DBNs) [4]. Such structured decomposed representations \nallow complex processes over a large number of states to be encoded using a much smaller \nnumber of parameters, thereby allowing better generalization from limited data [8, 7, 13]. \nFurthermore, the natural structure of such processes makes it easier for a human expert to \nincorporate prior knowledge about the domain structure into the model, thereby improving \nits inductive bias. \n\n\fApproximate Learning of Dynamic Models \n\n397 \n\nBoth parameter and structure learning algorithms for dynamic models [12, 7] use proba(cid:173)\nbilistic inference as a crucial component. An inference routine is called multiple times in \norder to \"fill in\" missing data with its expected value according to the current hypothesis; \nthe resulting expected sufficient statistics are then used to construct a new hypothesis. The \ninference step is used many times, each of which iterates over the entire sequence. This \nbehavior is problematic in two important respects. First, in many settings, we may not \nhave access to the entire sequence in advance. Second, the various structured representa(cid:173)\ntions of stochastic processes do not admit an effective inference procedure. The messages \npropagated by exact inference algorithms include an entry for each possible state of the \nsystem; the number of states is exponential in the size of our model, rendering this type of \ncomputation infeasible in all but the smallest of problems. In this paper, we describe and \nanalyze an approach that helps us address both of these problems. \n\nIn [2], we proposed a new approach to approximate inference in stochastic processes, where \napproximate distributions that admit compact representation are maintained and propagated. \nOur approach can achieve exponential savings over exact inference for DBNs. We showed \nempirically that, for a practical DBN [6], our approach results in a factor 15-20 reduction \nin running time at only a small cost in accuracy. We also proved that the accumulated \nerror arising from the repeated approximations remains bounded indefinitely over time. \nThis result relied on an analysis showing that transition through a stochastic process is a \ncontraction for relative entropy (KL-divergence) [3]. \n\nHere, we apply this approach to the parameter learning task. This application is not \ncompletely straightforward, since our algorithm of [2] and the associated analysis only \napplied to the forward propagation of messages, whereas the inference used in learning \nalgorithms require propagation of information from the entire sequence. In this paper, \nwe provide an analysis of the error accumulated by an approximate inference process in \nthe backward propagation phase of inference. This analysis is quite different from the \ncontraction analysis for the forward phase. We combine these two results to prove bounds \non the error of the expected sufficient statistics relayed to the learning algorithm at each \nstage. We then present empirical results for a practical DBN, illustrating the performance \nof this approximate learning algorithm. We show that speedups of 15-20 can be obtained \neasily, with no discern able loss in the quality of the learned hypothesis. \n\nOur theoretical analysis also suggests a way of dealing with the problematic need to reason \nabout the entire sequence of temporal observations at once. Our contraction results show \nthat it is legitimate to ignore observations that are very far in the future. Thus, we can \ncompute a very accurate approximation to the backward message by considering only a \nsmall window of observations in the future. This idea leads to an efficient online learning \nalgorithm. We show that it converges to a good hypothesis much faster than the standard \noffline EM algorithm, even in settings favorable to the latter. \n\n2 Preliminaries \n\nA model for a dynamic system is specified as a tuple (B,8) where B represents the \nqualitative structure of the model, and 8 the appropriate parameterization. In a DBN, the \ninstantaneous state of a process is specified in terms of a set of variables Xl, ... , X n . Here, \nB encodes a network fragment which specifies, for each time t variable Xkt), the set of \nparents Parents(Xkt )); an example fragment is shown in Figure l(a). The parameters 8 \ndefine for each Xkt ) a conditional probability table P[Xkt) I Parents(Xkt ) )]. For simplicity, \nwe assume that the variables are partitioned into state variables, which are never observed, \nand observation variables, which are always observed. We also assume that the observation \nvariables at time t depend only on state variables at time t. We use T to denote the transition \nmatrix over the state variables in the stochastic process; i.e., G,j is the transition probability \n\n\f398 \n\nX Boyen and D. Koller \n\nfrom state Si to state Sj. Note that this concept is well-defined even for a DBN, although in \nthat case, the matrix is represented implicitly via the other parameters. We use a to denote \nthe observation matrix; i.e., Oi,j is the probability of observing response rj in state Si. \n\nOur goal is to learn the model for stochastic process from partially observable data. To \nsimplify our discussion, we focus on the problem of learning parameters for a known \nstructure using the EM (Expectation Maximization) algorithm [5]; most of our discussion \napplies equally to other contexts (e.g., [7]). EM is an iterative procedure that searches over \nthe space of parameter vectors for one which is a local maximum of the likelihood function(cid:173)\nthe probability of the observed data D given 8. We describe the EM algorithm for the task \nof learning HMMs; the extension to DBNs is straightforward. The EM algorithm starts with \nsome initial (often random) parameter vector 8, which specifies a current estimate of the \ntransition and observation matrices of the process T and 6. The EM algorithm computes \nthe expected sufficient statistics (ESS) for D, using T and 6 to compute the expectation. \nIn the case of HMMs, the ESS are an average, over t, ofthe joint distribution~ t/J(t) over the \nvariables at time t -\nI and the variables at time t. A new parameter vector 8' can then be \ncomputed from the ESS by a simple maximum likelihood step. These two steps are iterated \nuntil an appropriate stopping condition is met. \n\nThe t/J(t) for the entire sequence can be computed by a simple forward-backward algorithm. \nLet r(t) be the response observed at time t, and let 0rCI) be its likelihood vector (Or; (i) ~ \nOi,j). Theforwardmessagesa(t) are propagated as follows: a(t) ex (a(t-I) \u00b7T) X OrCI), \nwhere x is the outer product. The backward messages p(t) are propagated as p(t) ex \n(T\u00b7 (p(t+l) x 0r(t+I) )')'. The estimated belief at time t is now simply a(t) x p(t) (suitably \nrenormalized); similarly, the joint belief t/J(t) is proportional to (a(t-I) x p(t) x T x Or(t\u00bb). \nThis message passing algorithm has an obvious extension to DBNs. Unfortunately, it is \nfeasible only for very small DBNs. Essentially, the messages passed in this algorithm have \nan entry for every possible state at time t; in a DBN, the number of states is exponential \nin the number of state variables, rendering such an explicit representation infeasible in \nmost cases. Furthermore even highly structured processes do not admit a more compact \nrepresentation of these messages [8, 2]. \n\n3 Belief state approximation \n\nIn [2], we described a new approach to approximate inference in dynamic systems, which \navoids the problem of explicitly maintaining distributions over large spaces. We maintain \nour belief state (distribution over the current state) using some computationally tractable \nrepresentation of a distribution. We propagate the time t approximate belief state through \nthe transition model and condition it on our evidence at time t + 1. We then approximate the \nresulting time t + I distribution using one that admits a compact representation, allowing \nthe algorithm to continue. We also showed that the errors arising from the repeated \napproximation do not accumulate unboundedly, as the stochasticity of the process attenuates \ntheir effect. \n\nIn particular, for DBNs we considered belief state approximations where certain subsets of \nless correlated variables are grouped into distinct clusters which are approximated as being \nindependent. In this case, the approximation at each step consists of a simple projection \nonto the relevant marginals, which are used as a factored representation of the time t + 1 \napproximate belief state. This algorithm can be implemented efficiently using the clique \ntree algorithm [10]. To compute a(t+ I) from a(t), we generate a clique tree over these two \ntime slices of the DBN, ensuring that both the time t and time t + 1 clusters appear as a \nsubset of some clique. We then incorporate a(t) into the time t cliques; a(t+I) is obtained \n\n\fApproximate Learning of Dynamic Models \n\n399 \n\nby calibrating the tree (doing inference) and reading off the relevant marginals from the tree \n(a(HI) is implicitly defined as their product). \n\nThese results are directly applicable to the learning task, as the belief state is the forward \nmessage in the forward-backward algorithm. Thus, we can apply this approach to the \nforward step, with the guarantee that the approximation will not lead to a big difference \nin the ESS. However, this technique does not resolve our computational problems, as the \nbackward propagation phase is as expensive as the forward phase. We can apply the same \nidea to the backward propagation, i.e., we maintain and propagate a compactly represented \napproximate backward message p(t). The implementation of this idea is a simple extension \nf \no our a gont m or orwar messages . .10 compute \n, we SImp y Incorporate \np(t+I) into our clique tree over these two time slices, then read off the relevant marginals \nfor computing p(t) . \n\n. I' \n\n(3-(HI) \n\n'h f \n\nf \n\n(3-(t)f \n\nrom \n\nI \n\nd \n\n'r \n\nHowever, extending the analysis is not as straightforward. It is not completely straightfor(cid:173)\nward to apply the techniques of [2] to get relative error bounds for the backward message. \nFurthennore, even if we have bounds on relative entropy error of both the forward and \nbackward messages, bounds for the error of the ..p(t) do not follow. The solution turns out to \nuse an alternative notion of distance, which combines additively under Bayesian updating, \nalbeit at the cost of weaker contraction rates. \n\nDefinition 1 Let P and P be two positive vectors of same dimension. Their projective \ndistance is defined as DProj[p, p] ~ maxi,i' In[(pi . Pi' )/(Pi' . Pi)]' \n\nWe note that the projective distance is a (weak) upper bound on the relative entropy. \n\nBased on the results of [1], we show that projective distance contracts when messages are \npropagated through the stochastic transition matrix, in either direction. Of course, the rate \nof contraction depends on ergodicity properties ofthe matrix: \n\nLemma 2 Let k = min{i,j,i',j':'T,,J''7i',j',tO} V(Ti,j' . Ti',j)/(Ti,j . Ti',j'), and define \n\"'T ~ 2 \u00b7 k/(I + k). Then DProj[a(t),a(t)] ::; (I - \"'T) . DProj[a(t-I),a(t-I)], and \nD proj [{3(t),p(t)]::; (I - \"'T)' D PrOj [{3(t+I),p(HI)]. \n\nWe can now show that, if our approximations do not introduce too large an error, then the \nexpected sufficient statistics will remain close to their correct value. \nTheorem 3 Let S be the ESS computed via exact inference, and let 5 be its approximation. \nIf the forward (backward) approximation step is guaranteed to introduce at most c (6) \nprojective error, then DProj[S, 5] ::; (c + 8) / \"'T. Therefore DkdS 115] ::; (c + 8) / \"'T\u00b7 \n\nNote that even small fluctuations in the sufficient statistics can cause the EM algorithm to \nreach a different local maximum. Thus, we cannot analytically compare the quality of the \nresulting algorithms. However, as our experimental results show, there is no divergence \nbetween exact EM and aproximate EM in practice. \n\nWe tested our algorithms on the task of learning the parameters for the BAT network shown \nin Figure 1 (a), used for traffic monitoring [6]. The training set was a fixed sequence \nof 1000 slices, generated from the correct network distribution. Our test metric was the \naverage log-likelihood (per slice) of a fixed test sequence of 50 slices. All experiments \nwere conducted using three different random starting points for the parameters (the same \n\n\f400 \n\n~ - - - -- ------- - -, \n( j Ld'C''''K---;----(L .... nnrl \n\n) - - - - -----{\"\"loCkS \u00ab \n\nX Boyen and D. Koller \n\nRererence DBN \n-\nE\",c,EM \n-\n5+5 EM \n-\n----- 3+2+4+1 \n---\u00b7 1+ .. +1 \n\nslll,:e t \n\ns!u;e t+1 \n\neVldeoce \n\n-\n\n-\n\nIteratIOn \n\n-\n\n-- - --\n\n20 \n\n-\n\n-\n30 \n\n40 \n\n50 \n\n60 \n\nFigure I: (a) The BAT DBN. (b) Structural approximations for batch EM. \n\nin all the experiments). We ran EM with different types of structural approximations, and \nevaluated the quality of the model after each iteration of the algorithm. We used four \ndifferent structural approximations: (i) exact propagation; (ii) a 5+5 clustering of the ten \nstate variables; (iii) a 3+2+4+ I clustering; (iv) each variable in a separate cluster. The results \nfor one random starting point are shown on Figure 1 (b). As we can see, the impact of (even \nsevere) structural approximation on learning accuracy is negligible. In all of the runs, the \napproximate algorithm tracked the exact one very closely, and the largest difference in the \npeak log-likelihood was at most 0.04. This phenomenon is rather remarkable, especially \nin view of the substantial savings caused by the approximations: on a Sun Ultra II, the \ncomputational cost of learning was 138 min/iteration in the exact case, vs. 6 min/iteration \nfor the 5+5 clustering. and less than 5 min/iteration for the other two. \n\n4 Online learning \n\nOur analysis also gives us the tools to address another important problem with learning \ndynamic models: the need to reason about the entire temporal sequence at once. One \nconsequence of our contraction result is that the effect of approximations done far away \nin the sequence decays exponentially with the time difference. In particular, the effect \nof an approximation which ignores observations that are far in the future is also limited. \nTherefore. if we do inference for a time slice based on a small window of observations into \nthe future, the result should still be fairly accurate. More precisely. assume that we are at \ntime t and are considering a window of size w . We can view the uniform message as a very \nbad approximation to p(t+w). But as we propagate this approximate backward message \nfrom t + w to t, the error will decay exponentially with w. \nBased on these insights, we experimented with various online algorithms that use a small \nwindow approximation. Our online algorithms are based on the approach of [11], in \nwhich ESS are updated with an exponential decay every few data cases; the parameters \nare then updated correspondingly. The main problem with frequent parameter updates in \nthe online setting is that they require a recomputation of the messages computed using the \nold parameters. For long sequences, the computational cost of such a scheme would be \nprohibitive. In our algorithms, we simply leave the forward messages unchanged. under \nthe assumption that the most recent time slices used parameters that are very close to the \nnew ones. Our contraction result tells us that the use of old parameters far back in the \nsequence has a negligible effect on the message. We tried several schemes for the update \nof the backward messages. In the dynamic-JOOO approach, we use a backward message \ncomputed over 1000 slices, with the closer messages recomputed very frequently as the \nparameters are changed. based on cached messages that used older parameters. The 8 \n\n\fApproximate Learning o/Dynamic Models \n\n401 \n\n-15.0 . \n\n~ \n-5 \n~ \n} \nu \n'\" \n~ -16.0 ~ \n\"\" \n\n// \nl \nII ,. \n\" ,I \n\" ,I \n\" ,I \n'I \n\" ': \n\" t! \n\nReference DBN \nBatch EM \n\n-\n-\n--_. Dynarruc-IOOO \n--_ ...... StallC-4 \nStalic-{) \n-\n\n-150 \n\n. ~ \n-5 \n~ -; \n~ .. \nE \n\n- ~ -160 -\n\"\" \n\nRctereJk.:c DBN \nDynarruc-l ()(x} \n\n-\n-\n---- Statlc-lOOO \n----- Slatlc-4 \nStallc-{) \n-\n\n-- ---- - - - - - -- --\n\nIteratlon \n\n15 \n\n20 \n\n25 \n\n30 \n\n35 \n\nTmle shce \n\n15000 \n\n25000 \n\n35000 \n\nFigure 2: Temporal approximations for (a) batch setting; (b) online setting. \n\nclosest messages are updated every parameter update, the next 16 every other update, etc. \nThis approach is the closest realistic alternative to a full update of backward messages. In \nthe static-JOOO approach, we use a very long window (1000 slices), but do not recompute \nmessages; when the window ends, we use the current parameters to compute the messages \nfor the entire next window. In the static-4 approach, we do the same, but use a very short \nwindow of 4 slices. Finally, in the static-O approach, there is no lookahead at all; only the \npast and present evidence is used to compute the joint beliefs. The latter case is often used \n(e.g., in the context of Kalman filters [9]) for online learning of the process parameters. \nTo minimize the computational burden, all tests were conducted using the 5+5 structural \napproximation. The running time for the various algorithms are: 0.4 sec/slice for batch \nEM; 1.4 for dynamic-I 000; 0.5 for static-I 000 and for static-4; and 0.3 for static-a. \n\nWe evaluated these temporal approximations both in an online and in a batch setting. In the \nbatch experiments, we used the same I OOO-step sequence used above. The results are shown \nin Figure 2(a). We see that the dynamic-I 000 algorithm reaches the same quality model as \nstandard batch EM, but converges sooner. As in [11], the difference is due to the frequent \nupdate of the sufficient statistics based on more accurate parameters. More interestingly, \nwe see that the static-4 algorithm, which uses a lookahead of only 4, also reaches the same \naccuracy. Thus, our approximation-ignoring evidence far in the future-is a good one, \neven for a very weak notion of \"far\". By contrast, we see that the quality reached by the \nstatic-O approach is significantly lower: the sufficient statistics used by the EM algorithm \nin this case are consistently worse, as they ignore all future evidence. Thus, in this network, \na window of size 4 is as good as full forward-backward, whereas one of size a is clearly \nworse. Our online learning experiments, shown in Figure 2(b), used a single long sequence \nof 40,000 slices. Again, we see that the static-4 approach is almost indistinguishable in \nterms of accuracy from the dynamic-lOOO approach, and that both converge more rapidly \nthan the static-I 000 algorithm. Thus, frequent updates over short windows are better than \ninfrequent updates over longer ones. Finally, we see again that the static-O algorithm \nconverges to a hypothesis of much lower quality. Thus, even a very short window allows \nrapid convergence to the \"best possible\" answer, but a window of size a does not. \n\n5 Conclusion and extensions \n\nIn this paper, we suggested the use of simple structural approximations in the inference \nalgorithm used in an E-step. Our results suggest that even severe structural approximations \nhave almost negligible effects on the accuracy of learning. The advantages of approximate \ninference in the learning setting are even more pronounced than in the inference task [2], \nas the small errors caused by approximation are negligible compared to the larger ones \n\n\f402 \n\nX Boyen and D. Koller \n\ninduced by the learning process. Our techniques provide a new and simple approach for \nlearning structured models of complex dynamic systems, with the resulting advantages of \ngeneralization and the ability to incorporate prior knowledge. We also presented a new \nalgorithm for the online learning task, showing that we can learn high-quality models using \na very small time window of future observations. \n\nThe work most comparable to ours is the variational approach to approximate inference \napplied to learning factorial HMMs [8]. While we have not done a direct empirical \ncomparison, it seems likely that the variational approach would work better for densely \nconnected models, whereas our approach would dominate for structured models such as the \none in our experiments. Indeed, for this model, our algorithms track exact EM so closely \nthat any significant improvement in accuracy is unlikely. Our algorithm is also simpler and \neasier to implement. Most importantly, it is applicable to the task of online learning. \n\nThe most obvious extension to our results is an integration of our ideas with structure \nlearning algorithm for DBNs [7] . We believe that the resulting algorithm will be able to \nlearn structured models for real-life complex systems. \n\nAcknowledgements. We thank Tim Huang for providing us with the BAT network, and Nir \nFriedman and Leonid Gurvits for useful discussions. This research was supported by ARO \nunder the MURI program \"Integrated Approach to Intelligent Systems\", and by DARPA \ncontract DACA 76-93-C-0025 under subcontract to lET, Inc. \n\nReferences \n\n[I] M. Artzrouni and X. Li. A note on the coefficient of ergodicity of a column-allowable \n\nnonnegative matrix. Linear algebra and applications, 214:93-10 I, 1995. \n\n[2] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In \n\nProc. VAl, pages 33-42, 1998. \n\n[3] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991. \n[4] T. Dean and K. Kanazawa. A model for reasoning about persistence and causation. \n\nCamp. Int., 5(3),1989. \n\n[5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum-likelihood from incomplete \ndata via the EM algorithm. Journal of the Royal Statistical Society, B39: 1-38,1977. \n[6] J. Forbes, T. Huang, K. Kanazawa, and SJ. Russell. The BATmobile: Towards a \n\nBayesian automated taxi. In Proc. IJCAI, 1995. \n\n[7] N. Friedman, K. Murphy, and SJ. Russell. Learning the structure of dynamic proba(cid:173)\n\nbilistic networks. In Proc. VAl, pages 139-147, 1998. \n\n[8] Z. Ghahramani and M.1. Jordan. Factorial hidden Markov models. In NIPS 8, 1996. \n[9] R.E. Kalman. A new approach to linear filtering and prediction problems. J. of Basic \n\nEngineering, 82:34-45, 1960. \n\n[10] S.L. Lauritzen and OJ. Spiegelhalter. Local computations with probabilities on graph(cid:173)\nical structures and their application to expert systems. J. Roy. Stat. Soc., B 50, 1988. \n[11] R.M. Neal and G.E. Hinton. A view of the EM algorithm that justifies incremental, \nsparse, and other variants. In M.I. Jordan, editor, Learning in Graphical Models. \nKluwer, 1998. \n\n[12] L. Rabiner and B. Juang. An introduction to hidden Markov models. IEEE Acoustics, \n\nSpeech & Signal Processing, 1986. \n\n[13] G. Zweig and SJ. Russell. Speech recognition with dynamic bayesian networks. In \n\nProc. AAAI, pages 173-180, 1998. \n\n\f", "award": [], "sourceid": 1588, "authors": [{"given_name": "Xavier", "family_name": "Boyen", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}