{"title": "Dynamic Bayesian Networks with Deterministic Latent Tables", "book": "Advances in Neural Information Processing Systems", "page_first": 729, "page_last": 736, "abstract": null, "full_text": "Dynamic Bayesian Networks with\n\nDeterministic Latent Tables\n\nDavid Barber\n\nInstitute for Adaptive and Neural Computation\n\nEdinburgh University\n\n5 Forrest Hill, Edinburgh, EH1 2QL, U.K.\n\ndbarber@anc.ed.ac.uk\n\nAbstract\n\nThe application of latent/hidden variable Dynamic Bayesian Net-\nworks is constrained by the complexity of marginalising over latent\nvariables. For this reason either small latent dimensions or Gaus-\nsian latent conditional tables linearly dependent on past states are\ntypically considered in order that inference is tractable. We suggest\nan alternative approach in which the latent variables are modelled\nusing deterministic conditional probability tables. This specialisa-\ntion has the advantage of tractable inference even for highly com-\nplex non-linear/non-Gaussian visible conditional probability tables.\nThis approach enables the consideration of highly complex latent\ndynamics whilst retaining the bene(cid:12)ts of a tractable probabilistic\nmodel.\n\n1\n\nIntroduction\n\nDynamic Bayesian Networks are a powerful framework for temporal data models\nwith widespread application in time series analysis[10, 2, 5]. A time series of length\nT is a sequence of observation vectors V = fv(1); v(2); : : : ; v(T )g, where vi(t) repre-\nsents the state of visible variable i at time t. For example, in a speech application V\nmay represent a vector of cepstral coe(cid:14)cients through time, the aim being to classify\nthe sequence as belonging to a particular phonene[2, 9]. The power in the Dynamic\nBayesian Network is the assumption that the observations may be generated by\nsome latent (hidden) process that cannot be directly experimentally observed. The\nbasic structure of these models is shown in (cid:12)g(1)[a] where network states are only\ndependent on a short time history of previous states (the Markov assumption).\nRepresenting the hidden variable sequence by H = fh(1); h(2); : : : ; h(T )g, the joint\ndistribution of a (cid:12)rst order Dynamic Bayesian Network is\n\np(V; H) = p(v(1))p(h(1)jv(1))\n\nT (cid:0)1\n\nYt=1\n\np(v(t+1)jv(t); h(t))p(h(t+1)jv(t); v(t+1); h(t))\n\nThis is a Hidden Markov Model (HMM), with additional connections from visible\nto hidden units[9]. The usage of such models is varied, but here we shall concen-\ntrate on unsupervised sequence learning. That is, given a set of training sequences\n\n\fh(1)\n\nh(2)\n\nh(t)\n\nv(1)\n\nv(2)\n\nv(t)\n\nh(1); h(2)\n\nh(2); h(3)\n\nh(t (cid:0) 1); h(t)\n\n(a) Bayesian Network\n\n(b) Hidden Inference\n\nFigure 1: (a) A (cid:12)rst order Dynamic Bayesian Network containing a sequence of\nhidden (latent) variables h(1); h(2); : : : ; h(T ) and a sequence of visible (observ-\nable) variables v(1); v(2); : : : ; v(T ).\nIn general, all conditional probability tables\nare stochastic { that is, more than one state can be realised. (b) Conditioning on\nthe visible units forms an undirected chain in the hidden space. Hidden unit infer-\nence is achieved by propagating information along both directions of the chain to\nensure normalisation.\n\nV 1; : : : ; V P we aim to capture the essential features of the underlying dynamical\nprocess that generated the data. Denoting the parameters of the model by (cid:2),\nlearning can be achieved using the EM algorithm which maximises a lower bound\non the likelihood of a set of observed sequences by the procedure[5]:\n\n(cid:2)new = arg max\n(cid:2)\n\nP\n\nX(cid:22)=1\n\np(H(cid:22)jV (cid:22); (cid:2)old) log p(H(cid:22); V (cid:22); (cid:2)):\n\n(1)\n\nThis procedure contains expectations with respect to the distribution p(HjV) { that\nis, to do learning, we need to infer the hidden unit distribution conditional on the\nvisible variables. p(HjV) is represented by the undirected clique graph, (cid:12)g(1)[b], in\nwhich each node represents a function (dependent on the clamped visible units) of\nthe hidden variables it contains, with p(HjV) being the product of these clique po-\ntentials. In order to do inference on such a graph, in general, it is necessary to carry\nout a message passing type procedure in which messages are (cid:12)rst passed one way\nalong the undirected graph, and then back, such as in the forward-backward algo-\nrithm in HMMs [5]. Only when messages have been passed along both directions of\nall links can the normalised conditional hidden unit distribution be numerically de-\ntermined. The complexity of calculating messages is dominated by marginalisation\nof the clique functions over a hidden vector h(t). In the case of discrete hidden units\nwith S states, this complexity is of the order S 2, and the total complexity of infer-\nence is then O(T S 2). For continuous hidden units, the analogous marginalisation\nrequires integration of a clique function over a hidden vector. If the clique function\nis very low dimensional, this may be feasible. However, in high dimensions, this\nis typically intractable unless the clique functions are of a very speci(cid:12)c form, such\nas Gaussians. This motivates the Kalman (cid:12)lter model[5] in which all conditional\nprobability tables are Gaussian with means determined by a linear combination of\nprevious states. There have been several attempts to generalise the Kalman (cid:12)lter\nto include non-linear/non-Gaussian conditional probability tables, but most rely on\nusing approximate integration methods based on either sampling[3], perturbation\nor variational type methods[5].\n\nIn this paper we take a di(cid:11)erent approach. We consider specially constrained net-\nworks which, when conditioned on the visible variables, render the hidden unit\n\n\fvout(1)\n\nvout(2)\n\nvout(t)\n\nh(1)\n\nh(2)\n\nh(t)\n\nh(1)\n\nh(2)\n\nh(t)\n\nv(1)\n\nv(2)\n\nv(t)\n\nvin(1)\n\nvin(2)\n\nvin(t)\n\n(a) Deterministic Hiddens\n\n(b) Input-Output HMM\n\nh(1)\n\nh(2)\n\nh(t)\n\nv(1)\n\nv(2)\n\nv(3)\n\nv(4)\n\n(c) Hidden Inference\n\n(d) Visible Representation\n\nFigure 2: (a) A (cid:12)rst order Dynamic Bayesian Network with deterministic hidden\nCPTs (represented by diamonds) { that is, the hidden node is certainly in a single\nstate, determined by its parents.\n(b) An input-output HMM with deterministic\nhidden variables. (c) Conditioning on the visible variables forms a directed chain\nin the hidden space which is deterministic. Hidden unit inference can be achieved\nby forward propagation alone. (d) Integrating out hidden variables gives a cascade\nstyle directed visible graph, shown here for only four time steps.\n\ndistribution trivial. The aim is then to be able to consider non-Gaussian and non-\nlinear conditional probability tables (CPTs), and hence richer dynamics in the hid-\nden space.\n\n2 Deterministic Latent Variables\n\nThe deterministic latent CPT case, (cid:12)g(2)[a] de(cid:12)nes conditional probabilities\n\np(h(t + 1)jv(t + 1); v(t); h(t)) = (cid:14) (h(t + 1) (cid:0) f (v(t + 1); v(t); h(t); (cid:18) h))\n\n(2)\n\nwhere (cid:14)(x) represents the Dirac delta function for continuous hidden variables, and\nthe Kronecker delta for discrete hidden variables. The vector function f parame-\nterises the CPT, itself having parameters (cid:18) h. Whilst the restriction to deterministic\nCPTs appears severe, the model retains some attractive features : The marginal\np(V) is non-Markovian, coupling all the variables in the sequence, see (cid:12)g(2)[d]. The\nmarginal p(H) is stochastic, whilst hidden unit inference is deterministic, as illus-\ntrated in (cid:12)g(2)[c]. Although not considered explicitly here, input-output HMMs[7],\nsee (cid:12)g(2)[b], are easily dealt with by a trivial modi(cid:12)cation of this framework.\n\nFor learning, we can dispense with the EM algorithm and calculate the log likelihood\nof a single training sequence V directly,\n\nL((cid:18)v; (cid:18)hjV) = log p(v(1)j(cid:18)v) +\n\nT (cid:0)1\n\nXt=1\n\nlog p(v(t + 1)jv(t); h(t); (cid:18)v)\n\n(3)\n\n\fwhere the hidden unit values are calculated recursively using\n\nh(t + 1) = f (v(t + 1); v(t); h(t); (cid:18)h)\n\n(4)\n\nThe adjustable parameters of the hidden and visible CPTs are represented by (cid:18) h\nand (cid:18)v respectively. The case of training multiple independently generated se-\n\nquences V (cid:22); (cid:22) = 1; : : : P is straightforward and has likelihood P(cid:22) L((cid:18)v; (cid:18)hjV (cid:22)). To\n\nmaximise the log-likelihood, it is useful to evaluate the derivatives with respect to\nthe model parameters. These can be calculated as follows :\n\n+\n\nT (cid:0)1\n\nXt=1\n\n@\n\n@(cid:18)v\n\nlog p(v(t + 1)jv(t); h(t); (cid:18)v)\n\n(5)\n\ndL\nd(cid:18)v\n\n=\n\n@ log p(v(1)j(cid:18)v)\n\n@(cid:18)v\n\nT (cid:0)1\n\nXt=1\n\n@\n\n@h(t)\n\ndL\nd(cid:18)h\n\n=\n\nlog p(v(t + 1)jv(t); h(t); (cid:18)v)\n\ndh(t)\nd(cid:18)h\n\ndh(t)\nd(cid:18)h\n\n=\n\n@f (t)\n@(cid:18)h\n\n+\n\n@f (t)\n\ndh(t (cid:0) 1)\n\n@h(t (cid:0) 1)\n\nd(cid:18)h\n\n(6)\n\n(7)\n\nwhere f (t) (cid:17) f (v(t); v(t (cid:0) 1); h(t (cid:0) 1); (cid:18)h). Hence the derivatives can be calculated\nby deterministic forward propagation of errors and highly complex functions f and\nCPTs p(v(t + 1)jv(t); h(t)) may be used. Whilst the training of such networks\nresembles back-propagation in neural networks [1, 6], the models have a stochastic\ninterpretation and retain the bene(cid:12)ts inherited from probability theory, including\nthe possibility of a Bayesian treatment.\n\n3 A Discrete Visible Illustration\n\nTo make the above framework more explicit, we consider the case of continuous\nhidden units and discrete, binary visible units, vi(t) 2 f0; 1g.\nIn particular, we\nrestrict attention to the model:\n\np(v(t+1)jv(t); h(t)) =\n\nV\n\nYi=1\n\n(cid:27)0\n@(2vi(t + 1) (cid:0) 1)Xj\n\nwij(cid:30)j(t)1\n\nA ; hi(t+1) = Xj\n\nuij j(t)\n\nwhere (cid:27)(x) = 1=(1 + e(cid:0)x) and (cid:30)j(t) and j(t) represent (cid:12)xed functions of the\nnetwork state (h(t); v(t)). Normalisation is ensured since 1 (cid:0) (cid:27)(x) = (cid:27)((cid:0)x). This\nmodel generalises a recurrent stochastic heteroassociative Hop(cid:12)eld network[4] to\ninclude deterministic hidden units dependent on past network states.\n\nThe derivatives of the log likelihood are given by :\n\ndL\ndwij\n\n= Xt\n\n(1 (cid:0) (cid:27)i(t)) (2vi(t+1)(cid:0)1)(cid:30)j(t);\n\ndL\nduij\n\nwhere (cid:27)i(t) (cid:17) (cid:27)((2vi(t + 1) (cid:0) 1)Pj wij(cid:30)j(t)), (cid:30)0\n\nderivatives are found from the recursions\n\n(1 (cid:0) (cid:27)k(t)) (2vk(t+1)(cid:0)1)wkl(cid:30)0\n\n= Xt;k;l\nl(t) (cid:17) d(cid:30)l(t)=dt and the hidden unit\n\nl(t)\n\ndhl(t)\nduij\n\ndhl(t + 1)\n\nduij\n\n= Xk\n\nulk\n\nd k(t)\nduij\n\n+ (cid:14)il j(t);\n\nd k(t)\nduij\n\n= Xm\n\n@ k(t)\n@hm(t)\n\ndhm(t)\n\nduij\n\nWe considered a network with the simple linear type in(cid:13)uences, (cid:9)(t) (cid:17) (cid:8)(t) (cid:17)\n\n(cid:18) h(t)\n\nv(t) (cid:19), and restricted connectivity W = (cid:18)A 0\n\n0 B(cid:19), U = (cid:18)C 0\n\n0 D(cid:19), where the\n\n\fh(t)\n\nh(t + 1)\n\nv(t)\n\nv(t + 1)\n\n(a) Network\n\n(b) original\n\n(c) recalled\n\nFigure 3: (a) A temporal slice of the network. (b) The training sequence consists\nof a random set vectors (V = 3) over T = 10 time steps. (c) The reconstruction\nusing H = 7 hidden units. The initial state v(t = 1) for the recalled sequence was\nset to the correct initial training value albeit with one of the values (cid:13)ipped. Note\nhow the dynamics learned is an attractor for the original sequence.\n\nparameters to learn are the matrices A; B; C; D. A slice of the network is illustrated\nin (cid:12)g(3)[a]. We can easily iterate the hidden states in this case to give\n\nh(t + 1) = Ah(t) + Bv(t) = Ath(1) +\n\nt(cid:0)1\n\nXt0=0\n\nAt0\n\nBv(t (cid:0) t0)\n\nwhich demonstrates how the hidden state depends on the full past history of the\nobservations. We trained the network using 3 visible units and 7 hidden units to\nmaximise the likelihood of the binary sequence in (cid:12)g(3)[b]. Note that this sequence\ncontains repeated patterns and therefore could not be recalled perfectly with a\nmodel which does not contain hidden units. We tested if the learned model had\ncaptured the dynamics of the training sequence by initialising the network in the\n(cid:12)rst visible state in the training sequence, but with one of the values (cid:13)ipped. The\nnetwork then generated the following hidden and visible states recursively, as plotted\nin (cid:12)g(3)[c]. The learned network is an attractor with the training sequence as\na stable point, demonstrating that such models are capable of learning attractor\nrecurrent networks more powerful than those without hidden units. Learning is\nvery fast in such networks, and we have successfully applied these models to cases\nof several hundred hidden and visible unit dimensions.\n\n3.1 Recall Capacity\n\nWhat e(cid:11)ect have the hidden units on the ability of Hop(cid:12)eld networks to recall\nsequences? By recall, we mean that a training sequence is correctly generated by\nthe network given that only the initial state of the training sequence is presented to\nthe trained network. For the analysis here, we will consider the retrieval dynamics\nto be completely deterministic, thus if we concatenate both hidden h(t) and visible\nvariables v(t) into the vector x(t) and consider the deterministic hidden function\nf (y) (cid:17) thresh(y) which is 1 if y > 0 and zero otherwise, then\n\nxi(t + 1) = threshXj\n\nMijxj(t):\n\n(8)\n\nHere Mij are the elements of the weight matrix representing the transitions from\ntime t to time t + 1. A desired sequence ~x(1); : : : ; ~x(T ) can be recalled correctly if\nwe can (cid:12)nd a matrix M and real numbers (cid:15)i(t) such that\n\nM [~x(1); : : : ; x(T (cid:0) 1)] = [(cid:15)(2); : : : ; (cid:15)(T )]\n\n\fwhere the (cid:15)i(t) are arbitrary real numbers for which thresh((cid:15)i(t)) = ~xi(t). This\nsystem of linear equations can be solved if the matrix [~x(1); : : : ; ~x(T (cid:0) 1)] has rank\nT (cid:0) 1. The use of hidden units therefore increases the length of temporal sequences\nthat we can store by forming, during learning, appropriate hidden representations\n\nh(t) such that the vectors (cid:18) h(2)\n\nv(2) (cid:19) ; : : : ;(cid:18) h(T )\n\nv(T ) (cid:19) form a linearly independent set.\n\nSuch vectors are clearly possible to generate if the matrix U is full rank. Thus recall\ncan be achieved if (V + H) (cid:21) T (cid:0) 1.\n\nThe reader might consider forming from a set of linearly dependent patterns\nv(1); : : : ; v(T ) a linearly independent is by injecting the patterns into a higher\ndimensional space, v(t) ! ^v(t) using a non-linear mapping. This would appear\nto dispense with the need to use hidden units. However, if the same pattern in\nthe training set is repeated at di(cid:11)erent times in the sequence (as in (cid:12)g(3)[b]), no\nmatter how complex this non-linear mapping, the resulting vectors ^v(1); : : : ; ^v(T )\nwill be linearly dependent. This demonstrates that hidden units not only solve the\nlinear dependence problem for non-repeated patterns, they also solve it for repeated\npatterns. They are therefore capable of sequence disambiguation since the hidden\nunit representations formed are dependent on the full history of the visible units.\n\n4 A Continuous Visible Illustration\n\nTo illustrate the use of the framework to continuous visible variables, we consider\nthe simple Gaussian visible CPT model\n\np(v(t + 1)jv(t); h(t)) = exp(cid:18)(cid:0)\n\n1\n\n2(cid:27)2 [v(t + 1) (cid:0) g (Ah(t) (cid:0) Bv(t))]2(cid:19) =(2(cid:25)(cid:27)2)V =2\n\nh(t + 1) = f (Ch(t) + Dv(t))\n\n(9)\nwhere the functions f and g are in general non-linear functions of their arguments.\nIn the case that f (x) (cid:17) x, and g(x) (cid:17) x this model is a special case of the\nKalman (cid:12)lter[5]. Training of these models by learning A; B; C; D ((cid:27) 2 was set to\n0:02 throughout) is straightforward using the forward error propagation techniques\noutlined earlier in section (2).\n\n4.1 Classifying Japanese vowels\n\nThis UCI machine learning test problem consists of a set of multi-dimensional times\nseries. Nine speakers uttered two Japanese vowels /ae/ successively to form discrete\ntime series with 12 LPC cepstral coe(cid:14)cients. Each utterance forms a time series\nV whose length is in the range T = 7 to T = 29 and each vector v(t) of the time\nseries contains 12 cepstral coe(cid:14)cients. The training data consists of 30 training\nutterances for each of the 9 speakers. The test data contains 370 time series, each\nuttered by one of the nine speakers. The task is to assign each of the test utterances\nto the correct speaker.\n\nWe used the special settings f (x) (cid:17) x and g(x) (cid:17) x to see if such a simple network\nwould be able to perform well. We split the training data into a 2/3 train and\na 1/3 validation part, training then a set of 10 models for each of the 9 speakers,\nwith hidden unit dimensions taking the values H = 1; 2; : : : ; 10 and using 20 training\niterations of conjugate gradient learning[1]. For simplicity, we used the same number\nof hidden units for each of the nine speaker models. To classify a test utterance,\nwe chose the speaker model which had the highest likelihood of generating the test\nutterance, using an error of 0 if the utterance was assigned to the correct speaker\nand an error of 1 otherwise. The errors on the validation set for these 10 models\n\n\f2\n\n0\n\n\u22122\n2\n\n0\n\n0\n\n\u22122\n2\n\n0\n\n0\n\n\u22122\n2\n\n0\n\n0\n\n\u22122\n2\n\n0\n\n0\n\n\u22122\n\n0\n\n5\n\n5\n\n5\n\n5\n\n5\n\n10\n\n10\n\n10\n\n10\n\n10\n\n15\n\n15\n\n15\n\n15\n\n15\n\n20\n\n20\n\n20\n\n20\n\n20\n\n25\n\n25\n\n25\n\n25\n\n25\n\n30\n\n30\n\n30\n\n30\n\n30\n\n35\n\n35\n\n35\n\n35\n\n35\n\n40\n\n40\n\n40\n\n40\n\n40\n\n2\n\n0\n\n\u22122\n2\n\n0\n\n0\n\n\u22122\n2\n\n0\n\n0\n\n\u22122\n2\n\n0\n\n0\n\n\u22122\n2\n\n0\n\n0\n\n\u22122\n\n0\n\n5\n\n5\n\n5\n\n5\n\n5\n\n10\n\n10\n\n10\n\n10\n\n10\n\n15\n\n15\n\n15\n\n15\n\n15\n\n20\n\n20\n\n20\n\n20\n\n20\n\n25\n\n25\n\n25\n\n25\n\n25\n\n30\n\n30\n\n30\n\n30\n\n30\n\n35\n\n35\n\n35\n\n35\n\n35\n\n40\n\n40\n\n40\n\n40\n\n40\n\nFigure 4: (Left)Five sequences from the model v(t) = sin(2(t (cid:0) 1) + (cid:15)1(t)) + 0:1(cid:15)2(t).\n(Right) Five sequences from the model v(t) = sin(5(t (cid:0) 1) + (cid:15)3(t)) + 0:1(cid:15)4(t), where\n(cid:15)i(t) are zero mean unit variance Gaussian noise samples. These were combined\nto form a training set of 10 unlabelled sequences. We performed unsupervised\nlearning by (cid:12)tting a two component mixture model. The posterior probability\np(i = 1jV (cid:22)) of the 5 sequences on the left belonging to class 1 are (from above)\n0:99; 0:99; 0:83; 0:99; 0:96 and for the 5 sequences on the right belonging to class\n2 are (from above) 0:95; 0:99; 0:97; 0:97; 0:95, in accord with the data generating\nprocess.\n\nwere 6; 6; 3; 5; 5; 5; 4; 5; 6; 3. Based on these validation results, we retrained a model\nwith H = 3 hidden units on all available training data. On the (cid:12)nal independent\ntest set, the model achieved an accuracy of 97:3%. This compares favourably with\nthe 96:2% reported for training using a continuous-output HMM with 5 (discrete)\nhidden states[8]. Although our model is not powerful in being able to reconstruct\nthe training data, it does learn su(cid:14)cient information in the data to be able to make\nreliable classi(cid:12)cation. This problem serves to illustrate that such simple models can\nperform well. An interesting alternative training method not explored here would\nbe to use discriminative learning[7]. Also, not explored here, is the possibility of\nusing Bayesian methods to set the number of hidden dimensions.\n\n5 Mixture Models\n\nSince our models are probabilistic, we can apply standard statistical generalisations\nto them, including using them as part of a M component mixture model\n\np(Vj(cid:2)) =\n\nM\n\nXi=1\n\np (Vj(cid:2)i; i) p (i)\n\n(10)\n\nwhere p(i) denotes the prior mixing coe(cid:14)cients for model i, and each time se-\nries component model is represented by p (Vj(cid:2)i; i). Training mixture models by\nmaximum likelihood on a set of sequences V 1; : : : ; V P is straightforward using the\nstandard EM recursions [1]:\n\npnew(i) = PP\n(cid:22)=1 p(V (cid:22)ji; (cid:2)old\nPM\ni=1PP\nX(cid:22)=1\n\ni = arg max\n(cid:2)i\n\np(V (cid:22)ji; (cid:2)old\n\nP\n\ni\n\n(cid:2)new\n\n)pold(i)\n\ni\n\n(cid:22)=1 p(V (cid:22)ji; (cid:2)old\n\ni\n\n)pold(i)\n\n) log p(V (cid:22)ji; (cid:2)i)\n\n(11)\n\n(12)\n\nTo illustrate this on a simple example, we trained a mixture model with component\nmodels of the form described in section (4). The data is a series of 10 one dimen-\nsional (V = 1) time series each of length T = 40. Two distinct models were used\n\n\fto generate 10 training sequences, see (cid:12)g(4). We (cid:12)tted a two component mixture\nmodel using mixture components of the form (9) (with linear functions f and g)\neach model having H = 3 hidden units. After training, the model priors were found\nto be roughly equal 0:49; 0:51 and it was satisfying to (cid:12)nd that the separation of the\nunlabelled training sequences is entirely consistent with the data generation pro-\ncess, see (cid:12)g(4). An interesting observation is that, whilst the true data generating\nprocess is governed by e(cid:11)ectively stochastic hidden transitions, the deterministic\nhidden model still performs admirably.\n\n6 Discussion\n\nWe have considered a class of models for temporal sequence processing which are\na specially constrained version of Dynamic Bayesian Networks. The constraint was\nchosen to ensure that inference would be trivial even in high dimensional continu-\nous hidden/latent spaces. Highly complex dynamics may therefore be postulated\nfor the hidden space transitions, and also for the hidden to the visible transitions.\nHowever, unlike traditional neural networks the models remain probabilistic (gen-\nerative models), and hence the full machinery of Bayesian inference is applicable to\nthis class of models. Indeed, whilst not explored here, model selection issues, such\nas assessing the relevant hidden unit dimension, are greatly facilitated in this class\nof models. The potential use of this class of such models is therefore widespread.\nAn area we are currently investigating is using these models for fast inference and\nlearning in Independent Component Analysis and related areas. In the case that\nthe hidden unit dynamics is known to be highly stochastic, this class of models is\narguably less appropriate. However, stochastic hidden dynamics is often used in\ncases where one believes that the true hidden dynamics is too complex to model\ne(cid:11)ectively (or, rather, deal with computationally) and one uses noise to \u2018cover\u2019 for\nthe lack of complexity in the assumed hidden dynamics. The models outlined here\nprovide an alternative in the case that a potentially complex hidden dynamics form\ncan be assumed, and may also still provide a reasonable solution even in cases where\nthe underlying hidden dynamics is stochastic. This class of models is therefore a\npotential route to computationally tractable, yet powerful time series models.\n\nReferences\n\n[1] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995.\n[2] H.A. Bourlard and N. Morgan, Connectionist Speech Recognition. A Hybrid Ap-\n\nproach., Kluwer, 1994.\n\n[3] A. Doucet, N. de Freitas, and N. J. Gordon, Sequential Monte Carlo Methods in\n\nPractice, Springer, 2001.\n\n[4] J. Hertz, A. Krogh, and R. Palmer, Introduction to the theory of neural computation.,\n\nAddison-Wesley, 1991.\n\n[5] M. I. Jordan, Learning in Graphical Models, MIT Press, 1998.\n[6] J.F. Kolen and S.C. Kramer, Dynamic Recurrent Networks, IEEE Press, 2001.\n[7] A. Krogh and S.K. Riis, Hidden Neural Networks, Neural Computation 11 (1999),\n\n541{563.\n\n[8] M. Kudo, J. Toyama, and M. Shimbo, Multidimensional Curve Classi(cid:12)cation Using\nPassing-Through Regions, Pattern Recognition Letters 20 (1999), no. 11-13, 1103{\n1111.\n\n[9] L.R. Rabiner and B.H. Juang, An introduction to hidden Markov models, IEEE Trans-\n\nactions on Acoustics Speech, Signal Processing 3 (1986), no. 1, 4{16.\n\n[10] M. West and J. Harrison, Bayesian forecasting and dynamic models, Springer, 1999.\n\n\f", "award": [], "sourceid": 2343, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}]}