{"title": "Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles", "book": "Advances in Neural Information Processing Systems", "page_first": 661, "page_last": 667, "abstract": null, "full_text": "Tree-Based Modeling and Estimation of \nGaussian Processes on Graphs with Cycles \n\nMartin J. Wainwright, Erik B. Sudderth, and Alan S. Willsky \n\nLaboratory for Information and Decision Systems \n\nDepartment of Electrical Engineering and Computer Science \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\n{ mjwain, esuddert, willsky} @mit. edu \n\nAbstract \n\nWe present the embedded trees algorithm, an iterative technique \nfor estimation of Gaussian processes defined on arbitrary graphs. \nBy exactly solving a series of modified problems on embedded span(cid:173)\nning trees, it computes the conditional means with an efficiency \ncomparable to or better than other techniques. Unlike other meth(cid:173)\nods, the embedded trees algorithm also computes exact error co(cid:173)\nvariances. The error covariance computation is most efficient for \ngraphs in which removing a small number of edges reveals an em(cid:173)\nbedded tree. In this context, we demonstrate that sparse loopy \ngraphs can provide a significant increase in modeling power rela(cid:173)\ntive to trees, with only a minor increase in estimation complexity. \n\n1 \n\nIntroduction \n\nGraphical models are an invaluable tool for defining and manipulating probability \ndistributions. In modeling stochastic processes with graphical models, two basic \nproblems arise: (i) specifying a class of graphs with which to model or approximate \nthe process; and (ii) determining efficient techniques for statistical inference. In fact, \nthere exists a fundamental tradeoff between the expressive power of a graph, and \nthe tractability of statistical inference. At one extreme are tree-structured graphs: \nalthough they lead to highly efficient algorithms for estimation [1, 2], their modeling \npower is often limited. The addition of edges to the graph tends to increase modeling \npower, but also introduces loops that necessitate the use of more sophisticated and \ncostly techniques for estimation. \n\nIn areas like coding theory, artificial intelligence, and speech processing [3, 1], graph(cid:173)\nical models typically involve discrete-valued random variables. However, in domains \nsuch as image processing, control, and oceanography [2, 4, 5], it is often more appro(cid:173)\npriate to consider random variables with a continuous distribution. In this context, \nGaussian processes on graphs are of great practical significance. Moreover, the \nGaussian case provides a valuable setting for developing an understanding of esti(cid:173)\nmation algorithms [6, 7]. \n\n\fThe focus of this paper is the estimation and modeling of Gaussian processes de(cid:173)\nfined on graphs with cycles. We first develop an estimation algorithm that is based \non exploiting trees embedded within the loopy graph. Given a set of noisy measure(cid:173)\nments, this embedded trees (ET) algorithm computes the conditional means with \nan efficiency comparable to or better than other techniques. Unlike other methods, \nthe ET algorithm also computes exact error covariances at each node. In many \napplications, these error statistics are as important as the conditional means. We \nthen demonstrate by example that relative to tree models, graphs with a small \nnumber of loops can lead to substantial improvements in modeling fidelity without \na significant increase in estimation complexity. \n\n2 Linear estimation fundamentals \n\n2.1 Problem formulation \n\nConsider a Gaussian stochastic process x <'oJ N(O, P) that is Markov with respect \nto an undirected graph g. Each node in 9 corresponds to a subvector Xi of x. \nWe will refer to Xi as the state variable for the ith node, and its length as the \nstate dimension. By the Hammersley- Clifford Theorem [8], p- 1 inherits a sparse \nstructure from g. If it is partitioned into blocks according to the state dimensions, \nthe (i, j)fh block can be nonzero only if there is an edge between nodes i and j. \nLet y = ex + v, v <'oJ N(O, R), be a set of noisy observations. Without loss of \ngenerality, we assume that the subvectors Yi of the observations are conditionally \nindependent given the state x. For estimation purposes, we are interested in p( Xi Iy), \nthe marginal distribution of the state at each node conditioned on the noisy obser(cid:173)\nvations. Standard formulas exist for the computation of p(xIY) <'oJ N(x, P): \n\nx = peT R-1y \n\n(1) \nThe conditional error covariances Pi are the block diagonal elements of the full error \ncovariance P, where the block sizes are equal to the state dimensions. \n\nP = [p-1 + eT R-1e] -1 \n\n2.2 Exploiting graph structure \n\nWhen 9 is tree structured, both the conditional means and error covariances can \nbe computed by a direct and very efficient O(cF N) algorithm [2]. Here d is the \nmaximal state dimension at any node, and N is the total number of nodes. This \nalgorithm is a generalization of classic Kalman smoothing algorithms for time series, \nand involves passing means and covariances to and from a node chosen as the root. \nFor graphs with cycles, calculating the full error covariance P by brute force matrix \ninversion would, in principle, provide the conditional means and error variances. \nSince the computational complexity of matrix inversion is O([dNP), this proposal \nis not practically feasible in many applications, such as image processing, where N \nmay be on the order of 105 . This motivates the development of iterative techniques \nfor linear estimation on graphs with cycles. \n\nRecently, two groups [6, 7] have analyzed Pearl's belief propagation [1] in appli(cid:173)\ncation to Gaussian processes defined on loopy graphs. For Gaussians on trees, \nbelief propagation produces results equivalent to the Kalman smoother of Chou et \nal. [2]. For graphs with cycles, these groups showed that when belief propagation \nconverges, it computes the correct conditional means, but that error covariances \nare incorrect. The complexity per iteration of belief propagation on loopy graphs \nis O(d3 N), where one iteration corresponds to updating each message once. \n\n\fp-l \n\nP -l \ntree(l) = \n\np-l K \n1 \n\n+ \n\nP -l \ntree(2) = \n\np-l K \n2 \n\n+ \n\nFigure 1. Embedded trees produced by two different cutting matrices Ki for a \nnearest- neighbor grid (observation nodes not shown). \n\nIt is important to note that conditional means can be efficiently calculated using \ntechniques from numerical linear algebra [9]. In particular, it can be seen from \nequation (1) that computing the conditional mean iC is equivalent to computing \nthe product of a matrix inverse and a vector. Given the sparsity of p-l, iterative \ntechniques like conjugate gradient [9] can be used to compute the mean with asso(cid:173)\nciated cost O(dN) per iteration. However, like belief propagation, such techniques \ncompute only the means and not the error covariances. \n\n3 Embedded trees algorithm \n\n3.1 Calculation of means \n\nIn this section, we present an iterative algorithm for computing both the conditional \nmeans and error covariances of a Gaussian process defined on any graph. Central \nto the algorithm is the operation of cutting edges from a loopy graph to reveal \nan embedded tree. Standard tree algorithms [2] can be used to exactly solve the \nmodified problem, and the results are used in a subsequent iteration. \nFor a Gaussian process on a graph, the operation of removing edges corresponds to \nmodifying the inverse covariance matrix. Specifically, we apply a matrix splitting \n\np-l + CT R-1C = p-l \n\n_ K + CT R-1C \n\ntree(t) \n\nt \n\nwhere K t is a symmetric cutting matrix chosen to ensure that Pt~;e(t) corresponds \nto a valid tree-structured inverse covariance matrix. This matrix splitting allows us \nto consider defining a sequence of iterates {xn} by the recursion: \n~n-l + CTR-1 \nY \n\n+ CTR-1C] ~n - K \n\ntree(t(n)) \n\n[p-l \n\nt(n)x \n\nX \n\n-\n\nHere t(n) indexes the embedded tree used in the nth iteration. For example, Figure 1 \nshows two of the many spanning trees embedded in a nearest-neighbor grid. When \nthe matrix (Pt~;e(t(n)) + CT R-1C) is positive definite, it is possible to solve for \nthe next iterate xn in terms of data y and the previous iterate. Thus, given some \nstarting point XO, we can generate a sequence of iterates {iCn } by the recursion \n\n~ M-1 [K ~-l CTR- 1 ] \ny \nx = \n\n(2) \nwhere Mt(n) ~ (Pt~;e(t(n)) + CT R-1C). By comparing equation (2) to equation (1), \nit can be seen that computing the nth iterate corresponds to a linear-Gaussian prob(cid:173)\nlem, which can be solved efficiently and directly with standard tree algorithms [2]. \n\nt(n)X \n\nt(n) \n\n+ \n\n3.2 Convergence of means \n\nBefore stating some convergence results, recall that for any matrix A, the spectral \nradius is defined as p(A) ~ max>. 1>'1, where>. ranges over the eigenvalues of A. \n\n\fProposition 1. Let x be the conditional mean ofthe original problem on the loopy \ngraph, and consider the sequence of iterates {xn} generated by equation (2). Then \nfor any starting point, x is the unique fixed point of the recursion, and the error \nen ::@, xn - x obeys the dynamics \n\nen = [IT Mt~:e(t(j))Kt(j)l eO \n\nJ=1 \n\n(3) \n\nIn a typical implementation of the algorithm, one cycles through the embedded \ntrees in some fixed order, say t = 1, .. . , T. In this case, the convergence of the \nalgorithm can be analyzed in terms of the product matrix A ::@, nJ=1 Mt~:e(j) Kj. \nProposition 2. Convergence of the ET algorithm is governed by the spectral ra(cid:173)\ndius of A. In particular, if p(A) > 1, then the algorithm will not converge, whereas \nif p(A) < 1, then (xn - x) n~ 0 geometrically at rate 'Y ::@, p(A) 10. \n\nNote that the cutting matrices K must be chosen in order to guarantee not only that \n~-;;e is tree-structured but also that M ::@, (Pt-;;e + C T R-1C) is positive definite. \nThe following theorem, adapted from results in [10], gives conditions guaranteeing \nthe validity and convergence of the ET algorithm when cutting to a single tree. \nTheorem 1. Define Q ::@, p-1 + C T R- 1C, and M ::@, Q + K. Suppose the cutting \nmatrix K is symmetric and positive semidefinite. Then we are guaranteed that \np(M- 1 K) < 1. In particular, we have the bounds: \n\nAm ax (K) \n\nAmax(K) + Amax(Q) \n\n:::; p(M-1 K) :::; \n\nAmax (K) \n\nAmax(K) + Amin(Q) \n\n(4) \n\nIt should be noted that the conditions of this theorem are sufficient, but by no \nmeans necessary, to guarantee convergence of the ET algorithm. In particular, we \nfind that indefinite cutting matrices often lead to faster convergence. Furthermore, \nTheorem 1 does not address the superior performance typically achieved by cycling \nthrough several embedded trees. Gaining a deeper theoretical understanding of \nthese phenomena is an interesting open question. \n\n3.3 Calculation of error covariances \n\nAlthough there exist a variety of iterative algorithms for computing the conditional \nmean of a linear-Gaussian problem, none of these methods correctly compute error \ncovariances at each node. We show here that the ET algorithm can efficiently \ncompute these covariances in an iterative fashion. For many applications (e.g., \noceanography [5]), these error statistics are as important as the estimates. \nWe assume for simplicity in notation that XO = 0 and then expand equation (2) \nto yield that for any iteration xn = [F n + Mt(~)]CT R- 1 y, where the matrix F n \nsatisfies the recursion \n\nF n M-1 K \n\n[Fn- 1 M- 1 \n\n+ \n\n] \nt(n-1) \n\n= \n\nt(n) \n\nt(n) \n\n(5) \nwith the initial condition F1 = o. It is straightforward to show that whenever \nthe recursion for the conditional means in equation (2) converges, then the matrix \nsequence {Fn + Mt(~)} converges to the full error covariance P. \nMoreover, the cutting matrices K are typically of low rank, say O(E) where E \nis the number of cut edges. On this basis, it can be shown that each F n can be \n\n\f10' 1t-~-~~-r=:==O;==,'~C;====jl \n\nConvergence of means \n\n-+- Conj. Grad. \n-+- Embedded Tree \n-<)- Behef Prop. \n\nConvergence of error variances \n\n10' f;-~--~-r==:-+-~Ec::'mbC=e=;\"dd;=ed~T;=re=jle \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\nIteration \n(a) \n\nIteration \n(b) \n\nFigure 2. \n\u00a32 error). (b) Convergence rate of ET algorithm for computing error variances. \n\n(a) Convergence rates for computing conditional means (normalized \n\ndecomposed as a sum of O(E) rank 1 matrices. Directly updating this low-rank \ndecomposition of F n from that of F n - 1 requires O(d3 E2 N) operations. However, \nan efficient restructuring of this update requires only O(d3 EN) operations [11]. The \ndiagonal blocks of the low- rank representation may be easily extracted and added to \nthe diagonal blocks of Mi(~), which are computed by standard tree smoothers. All \ntogether, we may obtain these error variances in O(d3 EN) operations per iteration. \nThus, the computation of error variances will be particularly efficient for graphs \nwhere the number of edges E that must be cut is small compared to the total \nnumber of nodes N. \n\n3.4 Results \n\nWe have applied the algorithm to a variety of graphs, ranging from graphs with \nsingle loops to densely connected MRFs on grids. Figure 2(a) compares the rates of \nconvergence for three algorithms: conjugate gradient (CG), embedded trees (ET), \nand belief propagation (BP) on a 20 x 20 nearest-neighbor grid. The ET algorithm \nemployed two embedded trees analogous to those shown in Figure 1. We find that \nCG is usually fastest, and can exhibit supergeometric convergence. In accordance \nwith Proposition 2, the ET algorithm converges geometrically. Either BP or ET can \nbe made to converge faster, depending on the choice of clique potentials. However, \nwe have not experimented with optimizing the performance of ET by adaptively \nchoosing edges to cut. Figure 2(b) shows that in contrast to CG and BP, the ET \nalgorithm can also be used to compute the conditional error variances, where the \nconvergence rate is again geometric. \n\n4 Modeling using graphs with cycles \n\n4.1 \n\nIssues in Illodel design \n\nA variety of graphical structures may be used to approximate a given stochastic \nprocess. For example, perhaps the simplest model for a I-D time series is a Markov \nchain, as shown in Figure 3(a). However, a high order model may be required \nto adequately capture long-range correlations. The associated increase in state \ndimension leads to inefficient estimation. \nFigure 3(b) shows an alternative model structure. Here, additional \"coarse scale\" \nnodes have been added to the graph which are not directly linked to any mea(cid:173)\nsurements. These nodes are auxiliary variables created to explain the \"fine scale\" \nstochastic process of interest. If properly designed, the resulting tree structure \n\n\f0= auxiliary nodes \no = fine scale nodes \n\u2022 = observations \n\nrTT1 \n\n(a) \n\n(b) \n\n0 .8 \n\n0 .6 \n\n0 .4 \n\n0 .2 \n\n(d) \n\n(e) \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0. 1 \n\n(c) \n\n(f) \n\n0 .5 \n\n0.4 \n\n0.3 \n\n0 .2 \n\n0.1 \n\n(a) Markov chain. (b) Multiscale tree model. (c) Tree augmented by \n\nFigure 3. \nextra edge. (d) Desired covariance P. (e) Error IP - ptr \u2022\u2022 1 between desired covari(cid:173)\nance and realized tree covariance. (f) Error IP - Hoopl between desired covariance \nand covariance realized with loopy graph. \n\nwill capture long-range correlations without the increase in state dimension of a \nhigher-order Markov model. In previous work, our group has developed efficient \nalgorithms for estimation and stochastic realization using such multiscale tree mod(cid:173)\nels [2, 4, 5, 12]. The gains provided by multiscale models are especially impressive \nwhen quadtrees are used to approximate two-dimensional Markov random fields. \nWhile statistical inference on MRFs is notoriously difficult, estimation on quadtrees \nremains extremely efficient. \n\nThe most significant weakness of tree models is boundary artifacts. That is, leaf \nnodes that are adjacent in the original process may be widely separated in the \ntree structure (see Figure 3(b)). As a result, dependencies between these nodes \nmay be inadequately modeled, causing blocky discontinuities. Increasing the state \ndimension d of the hidden nodes will reduce blockiness, but will also reduce esti(cid:173)\nmation efficiency, which is O(d3 N) in total. One potential solution is to add edges \nbetween pairs of fine scale nodes where tree artifacts are likely to arise, as shown \nin Figure 3(c). Such edges should be able to account for short-range dependency \nneglected by a tree model. Furthermore, optimal inference for such \"near- tree\" \nmodels using the ET algorithm will still be extremely efficient. \n\n4.2 Application to Illultiscale Illodeling \n\nConsider a one-dimensional process of length 32 with exact covariance P shown in \nFigure 3(d). We approximate this process using two different graphical models, a \nmultiscale tree and a \"near-tree\" containing an additional edge between two fine(cid:173)\nscale nodes across a tree boundary (see Figure 3(c)). In both models, the state \ndimension at each node is constrained to be 2; therefore, the finest scale contains 16 \nnodes to model all 32 process points. Figure 3( e) shows the absolute error I P - Ptree I \nfor the tree model, where realization was performed by the scale-recursive algorithm \npresented in [12]. The tree model matches the desired process statistics relatively \nwell except at the center, where the tree structure causes a boundary artifact. \nFigure 3(f) shows the absolute error IP -l1oop l for a graph obtained by adding a \nsingle edge across the largest fine-scale tree boundary. The addition reduces the \n\n\fpeak error by 60%, a substantial gain in modeling fidelity. If the ET algorithm \nis implemented by cutting to two different embedded trees, it converges extremely \nrapidly with rate 'Y = 0.11. \n\n5 Discussion \n\nThis paper makes contributions to both the estimation and modeling of Gaussian \nprocesses on graphs. First, we developed the embedded trees algorithm for estima(cid:173)\ntion of Gaussian processes on arbitrary graphs. In contrast to other techniques, our \nalgorithm computes both means and error covariances. Even on densely connected \ngraphs, our algorithm is comparable to or better than other techniques for comput(cid:173)\ning means. The error covariance computation is especially efficient for graphs in \nwhich cutting a small number of edges reveals an embedded tree. In this context, \nwe have shown that modeling with sparsely connected loopy graphs can lead to sub(cid:173)\nstantial gains in modeling fidelity, with a minor increase in estimation complexity. \n\nFrom the results of this paper arise a number of fundamental questions about the \ntrade-off between modeling fidelity and estimation complexity. In order to address \nthese questions, we are currently working to develop tighter bounds on the conver(cid:173)\ngence rate of the algorithm, and also considering techniques for optimally selecting \nedges to be removed. On the modeling side, we are expanding on previous work \nfor trees [12] in order to develop a theory of stochastic realization for processes on \ngraphs with cycles. Lastly, although the current paper has focused on Gaussian \nprocesses, similar concepts can be developed for discrete-valued processes. \n\nAcknowledgments \n\nThis work partially funded by ONR grant NOOOI4-00-1-0089 and AFOSR grant F49620-\n98-1-0349; M.W. supported by NSERC 1967 fellowship, and E.S. by NDSEG fellowship. \n\nReferences \n[1] J. Pear!. Probabilistic reasoning in intelligent systems. Morgan Kaufman, 1988. \n[2] K. Chou, A. Willsky, and R. Nikoukhah. Multiscale systems, Kalman filters, and \n\nRiccati equations. IEEE Trans. AC, 39(3}:479-492, March 1994. \n\n[3] R. G. Gallager. Low-density parity check codes. MIT Press, Cambridge, MA, 1963. \n[4] M. Luettgen, W. Karl, and A. Willsky. Efficient multiscale regularization with appli(cid:173)\n\ncation to optical How. IEEE Trans. 1m. Proc., 3(1}:41-64, Jan. 1994. \n\n[5] P. Fieguth, W. Karl, A. Willsky, and C. Wunsch. Multiresolution optimal interpola(cid:173)\n\ntion of satellite altimetry. IEEE Trans. Ceo. Rem., 33(2}:280- 292, March 1995. \n\n[6] P. Rusmevichientong and B. Van Roy. An analysis of turbo decoding with Gaussian \n\ndensities. In NIPS 12, pages 575- 581. MIT Press, 2000. \n\n[7] Y. Weiss and W. T. Freeman. Correctness of belief propagation in Gaussian graphical \n\nmodels of arbitrary topology. In NIPS 12, pages 673- 679. MIT Press, 2000. \n\n[8] J. Besag. Spatial interaction and the statistical analysis of lattice systems. J. Roy. \n\nStat. Soc. Series B, 36:192- 236, 1974. \n\n[9] J.W. Demme!. Applied numerical linear algebra. SIAM, Philadelphia, 1997. \n[10] O. Axelsson. Bounds of eigenvalues of preconditioned matrices. SIAM J. Matrix Anal. \n\nAppl., 13:847- 862, July 1992. \n\n[11] E. Sudderth, M. Wainwright, and A. Willsky. Embedded trees for modeling and \n\nestimation of Gaussian processes on graphs with cycles. In preparation, Dec. 2000. \n\n[12] A. Frakt and A. Willsky. Computationally efficient stochastic realization for internal \n\nmultiscale autoregressive models. Mult. Sys. and Sig. Proc. To appear. \n\n\f", "award": [], "sourceid": 1857, "authors": [{"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}, {"given_name": "Alan", "family_name": "Willsky", "institution": null}]}