{"title": "Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 249, "page_last": 256, "abstract": "We consider the estimation problem in Gaussian graphical models with arbitrary structure. We analyze the Embedded Trees algorithm, which solves a sequence of problems on tractable subgraphs thereby leading to the solution of the estimation problem on an intractable graph. Our analysis is based on the recently developed walk-sum interpretation of Gaussian estimation. We show that non-stationary iterations of the Embedded Trees algorithm using any sequence of subgraphs converge in walk-summable models. Based on walk-sum calculations, we develop adaptive methods that optimize the choice of subgraphs used at each iteration with a view to achieving maximum reduction in error. These adaptive procedures provide a significant speedup in convergence over stationary iterative methods, and also appear to converge in a larger class of models.", "full_text": "Adaptive Embedded Subgraph Algorithms using\n\nWalk-Sum Analysis\n\nVenkat Chandrasekaran, Jason K. Johnson, and Alan S. Willsky\n\nDepartment of Electrical Engineering and Computer Science\n\nMassachusetts Institute of Technology\n\nvenkatc@mit.edu, jasonj@mit.edu, willsky@mit.edu\n\nAbstract\n\nWe consider the estimation problem in Gaussian graphical models with arbitrary\nstructure. We analyze the Embedded Trees algorithm, which solves a sequence of\nproblems on tractable subgraphs thereby leading to the solution of the estimation\nproblem on an intractable graph. Our analysis is based on the recently developed\nwalk-sum interpretation of Gaussian estimation. We show that non-stationary it-\nerations of the Embedded Trees algorithm using any sequence of subgraphs con-\nverge in walk-summable models. Based on walk-sum calculations, we develop\nadaptive methods that optimize the choice of subgraphs used at each iteration with\na view to achieving maximum reduction in error. These adaptive procedures pro-\nvide a signi\ufb01cant speedup in convergence over stationary iterative methods, and\nalso appear to converge in a larger class of models.\n\n1 Introduction\n\nStochastic processes de\ufb01ned on graphs offer a compact representation for the Markov structure in a\nlarge collection of random variables. We consider the class of Gaussian processes de\ufb01ned on graphs,\nor Gaussian graphical models, which are used to model natural phenomena in many large-scale ap-\nplications [1, 2]. In such models, the estimation problem can be solved by directly inverting the\ninformation matrix. However, the resulting complexity is cubic in the number of variables, thus\nbeing prohibitively complex in applications involving hundreds of thousands of variables. Algo-\nrithms such as Belief Propagation and the junction-tree method are effective for computing exact\nestimates in graphical models that are tree-structured or have low treewidth [3], but for graphs with\nhigh treewidth the junction-tree approach is intractable.\n\nWe describe a rich class of iterative algorithms for estimation in Gaussian graphical models with\narbitrary structure. Speci\ufb01cally, we discuss the Embedded Trees (ET) iteration [4] that solves a\nsequence of estimation problems on trees, or more generally tractable subgraphs, leading to the so-\nlution of the original problem on the intractable graph. We analyze non-stationary iterations of the\nET algorithm that perform inference calculations on an arbitrary sequence of subgraphs. Our anal-\nysis is based on the recently developed walk-sum interpretation of inference in Gaussian graphical\nmodels [5]. We show that in the broad class of so-called walk-summable models, the ET algorithm\nconverges for any arbitrary sequence of subgraphs used. The walk-summability of a model is easily\ntested [5, 6], thus providing a simple suf\ufb01cient condition for the convergence of such non-stationary\nalgorithms. Previous convergence results [6, 7] analyzed stationary or \u201ccyclo-stationary\u201d iterations\nthat use the same subgraph at each iteration or cycle through a \ufb01xed sequence of subgraphs. The\nfocus of this paper is on analyzing, and developing algorithms based on, arbitrary non-stationary\niterations that use any (non-cyclic) sequence of subgraphs, and the recently developed concept of\nwalk-sums appears to be critical to this analysis.\n\n1\n\n\fGiven this great \ufb02exibility in choosing successive iterative steps, we develop algorithms that adap-\ntively optimize the choice of subgraphs to achieve maximum reduction in error. These algorithms\ntake advantage of walk-sum calculations, which are useful in showing that our methods minimize\nan upper-bound on the error at each iteration. We develop two procedures to adaptively choose sub-\ngraphs. The \ufb01rst method \ufb01nds the best tree at each iteration by solving an appropriately formulated\nmaximum-weight spanning tree problem, with the weight of each edge being a function of the par-\ntial correlation coef\ufb01cient of the edge and the residual errors at the nodes that compose the edge.\nThe second method, building on this \ufb01rst method, adds extra edges in a greedy manner to the tree\nresulting from the \ufb01rst method to form a thin hypertree. Simulation results demonstrate that these\nnon-stationary algorithms provide a signi\ufb01cant speedup in convergence over stationary and cyclic\niterative methods. Since the class of walk-summable models is broad (including attractive models,\ndiagonally dominant models, and so-called pairwise-normalizable models), our methods provide a\nconvergent, computationally attractive method for inference. We also provide empirical evidence to\nshow that our adaptive methods (with a minor modi\ufb01cation) converge in many non-walk-summable\nmodels when stationary iterations diverge. The estimation problem in Gaussian graphical models\ninvolves solving a linear system with a sparse, symmetric, positive-de\ufb01nite matrix. Such systems\nare commonly encountered in other areas of machine learning and signal processing as well [8, 9].\nTherefore, our methods are broadly applicable beyond estimation in Gaussian models.\n\nSome of the results presented here appear in more detail in a longer paper [10], which provides\ncomplete proofs as well as a detailed description of walk-sum diagrams that give a graphical inter-\npretation of our algorithms (we show an example in this paper). The report also considers problems\ninvolving communication \u201cfailure\u201d between nodes for distributed sensor network applications.\n\n2 Background\n\nLet G = (V,E) be a graph with vertices V , and edges E \u2282(cid:0)V\nHere,(cid:0)V\n\n(cid:1) that link pairs of vertices together.\n(cid:1) represents the set of all unordered pairs of vertices. Consider a Gaussian distribution in\n\n2\n\n2\n\ninformation form [5] p(x) \u221d exp{\u2212 1\n2 xT Jx + hT x}, where J\u22121 is the covariance matrix and J\u22121h\nis the mean. The matrix J, also called the information matrix, is sparse according to graph G, i.e.\nJs,t = Jt,s = 0 if and only if {s, t} /\u2208 E. Thus, G represents the graph with respect to which p(x)\nis Markov, i.e. p(x) satis\ufb01es the conditional independencies implied by the separators of G. The\nGaussian mean estimation problem reduces to solving the following linear system of equations:\n\n(1)\nwhere x is the mean vector. Convergent iterations that compute the mean can also be used in turn to\ncompute variances using a variety of methods [4, 11]. Thus, we focus on the problem of estimating\nthe mean at each node. Throughout the rest of this paper, we assume that J is normalized to have\n1\u2019s along the diagonal.1 Such a re-scaling does not affect the convergence results in this paper, and\nour analysis and algorithms can be easily generalized to the un-normalized case [10].\n\nJx = h,\n\n2.1 Walk-sums\nWe give a brief overview of the walk-sum framework developed in [5]. Let J = I \u2212 R. The off-\ndiagonal entries of the matrix R have the same sparsity structure as that of J, and consequently that\nof the graph G. For Gaussian processes de\ufb01ned on graphs, element Rs,t corresponds to the condi-\ntional correlation coef\ufb01cient between the variables at vertices s and t conditioned on knowledge of\nall the other variables (also known as the partial correlation coef\ufb01cient [5]). A walk is a sequence of\nvertices {wi}\u2018\ni=0 such that each step {wi, wi+1} \u2208 E, 0 \u2264 i \u2264 \u2018 \u2212 1, with no restriction on crossing\nthe same vertex or traversing the same edge multiple times. The weight of a walk is the product of the\ni=0 Rwi,wi+1.\nWe then have that (R\u2018)s,t is the sum of the weights of all length-\u2018 walks from s to t in G. With this\npoint of view, we can interpret J\u22121 as follows:\n\nedge-wise partial correlation coef\ufb01cients of the edges composing the walk: \u03c6(w) ,Q\u2018\u22121\n\n\u221eX\n\n\u221eX\n\n(J\u22121)s,t = ((I \u2212 R)\u22121)s,t =\n\n(R\u2018)s,t =\n\n\u03c6(s \u2018\u2192 t),\n\n(2)\n\n1This can be achieved by performing the transformation \u02dcJ \u2190 D\n\ncontaining the diagonal entries of J.\n\n\u2212 1\n\n2 JD\n\n\u2212 1\n\n2 , where D is a diagonal matrix\n\n\u2018=0\n\n\u2018=0\n\n2\n\n\fwhere \u03c6(s \u2018\u2192 t) represents the sum of the weights of all the length-\u2018 walks from s to t (the set\nof all such walks is \ufb01nite). Thus, (J\u22121)s,t is the length-ordered sum over all walks in G from s\nto t. This, however, is a very speci\ufb01c way to compute the inverse that converges if the spectral\nradius \u0001(R) < 1. Other algorithms may compute walks according to different orders (rather than\nlength-based orders). To analyze arbitrary algorithms that submit to a walk-sum interpretation, the\nfollowing concept of walk-summability was developed in [5]. A model is said to be walk-summable\nif for each pair of vertices s, t \u2208 V , the absolute sum over all walks from s to t in G converges:\n\n|\u03c6(w)| < \u221e.\n\n(3)\n\n\u00af\u03c6(s \u2192 t) , X\n\nw\u2208W(s\u2192t)\n\nHere, W(s \u2192 t) represents the set of all walks from s to t, and \u00af\u03c6(s \u2192 t) denotes the absolute\nwalk-sum2 over this set. Based on the absolute convergence condition, walk-summability implies\nthat walk-sums over a countable set of walks in G can be computed in any order. As a result, we\nhave the following interpretation in walk-summable models:\n\n(J\u22121)s,t = \u03c6(s \u2192 t),\n\nxt = (J\u22121h)t =X\n\n(4)\n\nhs\u03c6(s \u2192 t) , \u03c6(h;\u2217 \u2192 t),\n\ns\u2208V\n\n(5)\nwhere the wildcard character \u2217 denotes a union over all vertices in V , and \u03c6(h;W) denotes a re-\nweighting of each walk in W by the corresponding h value at the starting node. Note that in (4) we\nrelax the constraint that the sum is ordered by length, and do not explicitly specify an ordering on\nthe walks (such as in (2)). In words, (J\u22121)s,t is the walk-sum over the set of all walks from s to t,\nand xt is the walk-sum over all walks ending at t, re-weighted by h.\nAs shown in [5], the walk-summability of a model is equivalent to \u0001( \u00afR) < 1, where \u00afR denotes the\nmatrix of the absolute values of the elements of R. Also, a broad class of models are walk-summable,\nincluding diagonally-dominant models, so-called pairwise normalizable models, and models for\nwhich the underlying graph G is non-frustrated, i.e. each cycle has an even number of negative\npartial correlation coef\ufb01cients. Walk-summability implies that a model is valid, i.e. has positive-\nde\ufb01nite information/covariance.\n\nConcatenation of walks We brie\ufb02y describe the concatenation operation for walks and walk-sets,\nwhich plays a key role in walk-sum analysis. Let u = u0 \u00b7\u00b7\u00b7 uend and v = vstartv1 \u00b7\u00b7\u00b7 v\u2018(v) be walks\nwith uend = vstart. The concatenation of these walks is de\ufb01ned to be u\u00b7 v , u0 \u00b7\u00b7\u00b7 uendv1 \u00b7\u00b7\u00b7 v\u2018(v).\nNow consider a walk-set U with all walks ending at uend and another walk-set V with all walks\nbeginning at vstart. If uend = vstart, then the concatenation of U and V is de\ufb01ned:\n\nU \u2297 V , {u \u00b7 v : u \u2208 U, v \u2208 V}.\n\n2.2 Embedded Trees algorithm\n\nWe describe the Embedded Trees iteration that performs a sequence of updates on trees, or more\ngenerally tractable subgraphs, leading to the solution of (1) on an intractable graph. Each iteration\ninvolves an inference calculation on a subgraph of all the variables V . Let (V,S) be some subgraph\nof G, i.e. S \u2282 E (see examples in Figure 1). Let J be split according to S as J = JS \u2212 KS, so that\nthe entries of J corresponding to edges in S are assigned to JS, and those corresponding to E\\S are\npart of KS. The diagonal entries of J are all part of JS; thus, KS has zeroes along the diagonal.3\nJSbx(n) = KSbx(n\u22121) + h. If JS is invertible, and it is tractable to apply J\u22121S to a vector, then ET\nBased on this splitting, we can transform (1) to JS x = KS x+h, which suggests a natural recursion:\noffers an effective method to solve (1) (assuming \u0001(J\u22121S KS) < 1). If the subgraph used changes\nwith each iteration, then we obtain the following non-stationary ET iteration:\n\n(6)\nwhere {Sn}\u221e\nn=1 is any arbitrary sequence of subgraphs. An important degree of freedom is the\nchoice of the subgraph Sn at iteration n, which forms the focus of Section 4 of this paper. In [10] we\nalso consider a more general class of algorithms that update subsets of variables at each iteration.\n\nbx(n) = J\u22121Sn\n\n(KSnbx(n\u22121) + h),\n\n2We generally denote the walk-sum of the set W(\u223c) by \u03c6(\u223c).\n3KS can have non-zero diagonal in general, but we only consider the zero diagonal case here.\n\n3\n\n\fFigure 1: (Left) G and three embedded trees S1,S2,S3; (Right) Corresponding walk-sum diagram.\n3 Walk-Sum Analysis and Convergence of the Embedded Trees algorithm\n\nSn\u2212\u2192 t)\n\nSn\u2212\u2192 t)\n\nSn\u2212\u2192 t),\n\nIn this section, we provide a walk-sum interpretation for the ET algorithm. Using this analysis, we\nshow that the non-stationary ET iteration (6) converges in walk-summable models for an arbitrary\nchoice of subgraphs {Sn}\u221e\nn=1. Before proceeding with the analysis, we point out that one potential\ncomplication with the ET algorithm is that the matrix JS corresponding to some subgraph S may be\ninde\ufb01nite or singular, even if the original model J is positive-de\ufb01nite. Importantly, such a problem\nnever arises in walk-summable models with JS being positive-de\ufb01nite for any subgraph S if J is\nwalk-summable. This is easily seen because walks in the subgraph S are a subset of the walks\nin G, and thus if absolute walk-sums in G are well-de\ufb01ned, then so are absolute walk-sums in S.\nTherefore, JS is walk-summable, and hence, positive-de\ufb01nite.\nConsider the following recursively de\ufb01ned set of walks for s, t \u2208 V :\nWn(s \u2192 t) =\n\n(cid:20)\n= Wn\u22121(s \u2192 \u2217) \u2297 W(\u2217 E\\Sn(1)\u2212\u2192 \u2022) \u2297 W(\u2022 Sn\u2212\u2192 t) [ W(s\n\n\u222au,v\u2208V Wn\u22121(s \u2192 u) \u2297 W(u\n\n(cid:21) [ W(s\n\nE\\Sn(1)\u2212\u2192 v) \u2297 W(v\n\n(7)\nwith W0(s \u2192 t) = \u2205. Here, \u2217 and \u2022 are used as wildcard characters (a union over all elements in V ),\nand \u2297 denotes concatenation of walk-sets as described previously. The set Wn\u22121(s \u2192 \u2217) denotes\nwalks that start at node s computed at the previous iteration. The middle term W(\u2217 E\\Sn(1)\u2212\u2192 \u2022)\ndenotes a length-1 walk (called a hop) across an edge in E\\Sn. Finally, W(\u2022 Sn\u2212\u2192 t) denotes walks\nin Sn that end at node t. Thus, the \ufb01rst term in (7) refers to previously computed walks starting at s,\nwhich hop across an edge in E\\Sn, and then \ufb01nally propagate only in Sn (ending at t). The second\nSn\u2212\u2192 t) denotes walks from s to t that only live within Sn. The following proposition\nterm W(s\n(proved in [10]) shows that the walks contained in these walk-sets are precisely those computed by\nthe ET algorithm at iteration n. For simplicity, we denote \u03c6(Wn(s \u2192 t)) by \u03c6n(s \u2192 t).\nProposition 1 Let bx(n) be the estimate at iteration n in the ET algorithm (6) with initial guess\nbx(0) = 0. Then,bx(n)\ncan be interpreted as a walk-sum algorithm: bx(n)\n\nWe note that the classic Gauss-Jacobi algorithm [6], a stationary iteration with JS = I and KS = R,\nin this method computes all walks up to length n\nending at t. Figure 1 gives an example of a walk-sum diagram, which provides a graphical repre-\nsentation of the walks accumulated by the walk-sets (7). The diagram is the three-level graph on the\nright, and corresponds to an ET iteration based on the subgraphs S1,S2,S3 of the 3 \u00d7 3 grid G (on\nthe left). Each level n in the diagram consists of the subgraph Sn used at iteration n (solid edges),\nand information from the previous level (iteration) n \u2212 1 is transmitted through the dashed edges\nin E\\Sn. The directed nature of these dashed edges is critical as they capture the one-directional\n\ufb02ow of computations from iteration to iteration, while the undirected edges within each level capture\nthe inference computation at each iteration. Consider a node v at level n of the diagram. Walks in\nthe diagram that start at any node and end at v at level n, re-weighted by h, are exactly the walks\n\nt = \u03c6n(h;\u2217 \u2192 t) =P\n\ns\u2208V hs\u03c6n(s \u2192 t) in walk-summable models.\n\nt\n\ncomputed by the ET algorithm inbx(n)\n\nv . For more examples of such diagrams, see [10].\n\nGiven this walk-sum interpretation of the ET algorithm, we can analyze the walk-sets (7) to prove\nthe convergence of ET in walk-summable models by showing that the walk-sets eventually contain\nall the walks required for the computation of J\u22121h in (5). We have the following convergence\ntheorem for which we only provide a brief sketch of the complete proof [10].\n\n4\n\n\fTheroem 1 Letbx(n) be the estimate at iteration n in the ET algorithm (6) with initial guessbx(0) =\n0. Then,bx(n) \u2192 J\u22121h element-wise as n \u2192 \u221e in walk-summable models.\n\nProof outline: Proving this statement is done in the following stages.\nValidity: The walks in Wn are valid walks in G, i.e. Wn(s \u2192 t) \u2286 W(s \u2192 t).\nNesting: The walk-sets Wn(s \u2192 t) are nested, i.e. Wn\u22121(s \u2192 t) \u2286 Wn(s \u2192 t),\u2200n.\nCompleteness: Let w \u2208 W(s \u2192 t). There exists an N > 0 such that w \u2208 WN (s \u2192 t). Using the\nnesting property, we conclude that for all n \u2265 N, w \u2208 Wn(s \u2192 t).\nThese steps combined together allow us to conclude that \u03c6n(s \u2192 t) \u2192 \u03c6(s \u2192 t) as n \u2192 \u221e. This\nconclusion relies on the fact that \u03c6(Wn) \u2192 \u03c6(\u222anWn) as n \u2192 \u221e for a sequence of nested walk-sets\nWn\u22121 \u2286 Wn in walk-summable models, which is a consequence of the sum-partition theorem for\nabsolutely summable series [5, 10, 12]. Given the walk-sum interpretation from Proposition 1, one\n\ncan check thatbx(n) \u2192 J\u22121h element-wise as n \u2192 \u221e. (cid:3)\nsequence of subgraphs withbx(0) = 0. It is then straightforward to show that convergence can be\n\nThus, the ET algorithm converges to the correct solution of (1) in walk-summable models for any\n\nachieved for any initial guess [10]. Note that we have taken advantage of the absolute convergence\nproperty in walk-summable models (3) by not focusing on the order in which walks are computed,\nbut only that they are eventually computed.\nIn [10], we prove that walk-summability is also a\nnecessary condition for the complete \ufb02exibility in the choice of subgraphs \u2014 there exists at least\none sequence of subgraphs that results in a divergent ET iteration in non-walk-summable models.\n\n4 Adaptive algorithms\n\nLet e(n) = x\u2212bx(n) be the error at iteration n and let h(n) = Je(n) = h\u2212Jbx(n) be the corresponding\n\nresidual error (which is tractable to compute). We begin by describing an algorithm to choose the\n\u201cnext-best\u201d tree Sn in the ET iteration (6). The error at iteration n can be re-written as follows:\n\ne(n) = (J\u22121 \u2212 J\u22121Sn\n\n)h(n\u22121).\n\nt = \u03c6(h(n\u22121);\u2217 G\\Sn\u2212\u2192 t), where G\\Sn denotes walks\nThus, we have the walk-sum interpretation e(n)\nthat do not live entirely within Sn. Using this expression for the error, we have the following bound\nthat is tight for attractive models (Rs,t \u2265 0 for all s, t \u2208 V ) and non-negative h(n\u22121):\n\nke(n)k\u20181 = X\n\nt\u2208V\n\n|\u03c6(h(n\u22121);\u2217 G\\Sn\u2212\u2192 t)|\n\n\u2264 \u00af\u03c6(|h(n\u22121)|;G\\Sn)\n= \u00af\u03c6(|h(n\u22121)|;G) \u2212 \u00af\u03c6(|h(n\u22121)|;Sn).\n\n(8)\nHence, minimizing the error at iteration n corresponds to \ufb01nding the tree Sn that maximizes the\nsecond term \u00af\u03c6(|h(n\u22121)|;Sn). This leads us to the following maximum walk-sum tree problem:\n\narg max\n\nSn a tree\n\n\u00af\u03c6(|h(n\u22121)|;Sn)\n\nFinding the optimal such tree is combinatorially complex. Therefore, we develop a relaxation that\nminimizes a looser upper bound than (8). Speci\ufb01cally, consider an edge {u, v} and all the walks that\nlive on this single edge W({u, v}) = {uv, vu, uvu, vuv, uvuv, vuvu, . . .}. One can check that the\ncontribution based on these single-edge walks can be computed as:\n\n\u00af\u03c6(|h(n\u22121)|; w) =\n\n| + |h(n\u22121)\n\nv\n\n|(cid:17) |Ru,v|\n\n1 \u2212 |Ru,v| .\n\n\u03c3u,v = X\n\nw\u2208W({u,v})\n\nThis weight provides a measure of the error-reduction capacity of edge {u, v} by itself at iteration\nn. These single-edge walks for edges in Sn are a subset of all the walks in Sn, and consequently\nprovide a lower-bound on \u00af\u03c6(|h(n\u22121)|;Sn). Therefore, the maximization\n\narg max\n\nSn a tree\n\n\u03c3u,v\n\n(11)\n\n(9)\n\n(10)\n\n(cid:16)|h(n\u22121)\n\nu\n\nX\n\n{u,v}\u2208Sn\n\n5\n\n\fFigure 2: Grayscale images of residual errors in an 8 \u00d7 8 grid at successive iterations, and corre-\nsponding trees chosen by adaptive method.\n\nFigure 3: Grayscale images of residual errors in an 8 \u00d7 8 grid at successive iterations, and corre-\nsponding hypertrees chosen by adaptive method.\n\nis equivalent to minimizing a looser upper-bound than (8). This relaxed problem can be solved\nef\ufb01ciently using a maximum-weight spanning tree algorithm that has complexity O(|E| log log |V |)\nfor sparse graphs [13].\n\nGiven the maximum-weight spanning tree of the graph, a natural extension is to build a thin hyper-\ntree by adding extra \u201cstrong\u201d edges to the tree, subject to the constraint that the resulting graph has\nlow treewidth. Unfortunately, to do so optimally is an NP-hard optimization problem [14]. Hence,\nwe settle on a simple greedy algorithm. For each edge not included in the tree, in order of decreas-\ning edge weight, we add the edge to the graph if two conditions are met: \ufb01rst, we are able to easily\nverify that the treewidth stays less than M, and second, the length of the unique path in Sn between\nthe endpoints is less than L. In order to bound the tree width, we maintain a counter at each node\nof the total number of added edges that result in a path through that node. Comparing to another\nmethod for constructing junction trees from spanning trees [15], one can check that the maximum\nnode count is an upper-bound on the treewidth. We note that by using an appropriate directed repre-\nsentation of Sn relative to an arbitrary root, it is simple to identify the path between two nodes with\ncomplexity linear in path length (< L).4 Hence, the additional complexity of this greedy algorithm\nover that of the tree-selection procedure described previously is O(L|E|).\nIn Figure 2 and Figure 3 we present a simple demonstration of the tree and hypertree selection\nprocedures respectively, and the corresponding change in error achieved. The grayscale images\nrepresent the residual errors at the nodes of an 8 \u00d7 8 grid similar to G in Figure 1 (with white\nrepresenting 1 and black representing 0), and the graphs beside them show the trees/hypertrees\nchosen based on these residual errors using the methods described above (the grid edge partial\ncorrelation coef\ufb01cients are the same for all edges). Notice that the \ufb01rst tree in Figure 2 tries to\ninclude as many edges as possible that are incident on the nodes with high residual error. Such\nedges are useful for capturing walks ending at the high-error nodes, which contribute to the set of\nwalks in (5). The \ufb01rst hypertree in Figure 3 actually includes all the edges incident on the high-\nerror nodes. The residual errors after inference on these subgraphs are shown next in Figure 2 and\nFigure 3. As expected, the hypertree seems to achieve greater reduction in error compared to the\nspanning tree. Again, at this iteration, the subgraphs chosen by our methods adapt based on the\nerrors at the various nodes.\n\n5 Experimental illustration\n\n5.1 Walk-summable models\n\nWe test the adaptive algorithms on densely connected nearest-neighbor grid-structured models (sim-\nilar to G in Figure 1). We generate random grid models \u2014 the grid edge partial correlation coef-\n\n4One sets two pointers into the tree starting from any two nodes and then iteratively walks up the tree, always\n\nadvancing from the point that is deeper in the tree, until the nearest ancestor of the two nodes is reached.\n\n6\n\n\fFigure 4: (Left) Average number of iterations required for the normalized residual to reduce by a\nfactor of 10\u22126 over 100 randomly generated 75 \u00d7 75 grid models; (Center) Convergence plot for a\nrandomly generated 511\u00d7511 grid model; (Right) Convergence range in terms of partial correlation\nfor 16-node cyclic model with edges to neighbors two steps away.\n\nFigure 5: (Left) 16-node graphical model; (Right) two embedded spanning trees T1, T2.\n\n\ufb01cients are chosen uniformly from [\u22121, 1] and R is scaled so that \u0001( \u00afR) = 0.99. The vector h is\nchosen to be the all-ones vector. The table on the left in Figure 4 shows the average number of\niterations required by various algorithms to reduce the normalized residual error\nby a factor\nof 10\u22126. The average was computed based on 100 randomly generated 75 \u00d7 75 grid models. The\nplot in Figure 4 shows the decrease in the normalized residual error as a function of the number of\niterations on a randomly generated 511 \u00d7 511 grid model. All these models are poorly conditioned\nbecause they are barely walk-summable (\u0001( \u00afR) = 0.99). The stationary one-tree iteration uses a tree\nsimilar to S1 in Figure 1, and the two-tree iteration alternates between trees similar to S1 and S3 in\nFigure 1 [4]. The adaptive hypertree method uses M = 6 and L = 8. We also note that in practice\nthe per-iteration costs of the adaptive tree and hypertree algorithms are roughly comparable.\n\nkh(n)k2\nkh(0)k2\n\nThese results show that our adaptive algorithms demonstrate signi\ufb01cantly superior convergence\nproperties compared to stationary methods, thus providing a convergent, computationally attrac-\ntive method for estimation in walk-summable models. Our methods are applicable beyond Gaussian\nestimation to other problems that require solution of linear systems based on sparse, symmetric,\npositive-de\ufb01nite matrices. Several recent papers that develop machine learning algorithms are based\non solving such systems of equations [8, 9]; in fact, both of these papers involve linear systems\nbased on diagonally-dominant matrices, which are walk-summable.\n\n5.2 Non-walk-summable models\n\nNext, we give empirical evidence that our adaptive methods provide convergence over a broader\nrange of models than stationary iterations. One potential complication in non-walk-summable mod-\nels is that the subgraph models chosen by the stationary and adaptive algorithms may be inde\ufb01nite\nor singular even though J is positive-de\ufb01nite. In order to avoid this problem in the adaptive ET\nalgorithm, the trees Sn chosen at each iteration must be valid (i.e., have positive-de\ufb01nite JSn). A\nsimple modi\ufb01cation to the maximum-weight spanning tree algorithm achieves this goal \u2014 we add\nan extra condition to the algorithm to test for diagonal dominance of the chosen tree model (as\nall symmetric, diagonally-dominant models are positive de\ufb01nite [6]). That is, at each step of the\nmaximum-weight spanning tree algorithm, we only add an edge if it does not create a cycle and\nmaintains a diagonally-dominant tractable subgraph model. Consider the 16-node model on the left\nin Figure 5. Let all the edge partial correlation coef\ufb01cients be r. The range of r for which J is\npositive-de\ufb01nite is roughly (\u22120.46, 0.25), and the range for which the model is walk-summable is\n(\u22120.25, 0.25) (in this range all the algorithms, both stationary and adaptive, converge). For the one-\ntree iteration we use tree T1, and for the two-tree iteration we alternate between trees T1 and T2 (see\n\n7\n\n\fFigure 5). As the table on the right in Figure 4 demonstrates, the adaptive tree algorithm without the\ndiagonal-dominance (DD) check provides convergence over a much broader range of models than\nthe one-tree and two-tree iterations, but not for all valid models. However, the modi\ufb01ed adaptive\ntree algorithm with the DD check appears to converge almost up to the validity threshold. We have\nalso observed such behavior empirically in many other (though not all) non-walk-summable models\nwhere the adaptive ET algorithm with the DD condition converges while stationary methods diverge.\nThus, our adaptive methods, compared to stationary iterations, not only provide faster convergence\nrates in walk-summable models but also converge for a broader class of models.\n\n6 Discussion\n\nWe analyze non-stationary iterations of the ET algorithm that use any sequence of subgraphs for\nestimation in Gaussian graphical models. Our analysis is based on the recently developed walk-sum\ninterpretation of inference in Gaussian models, and we show that the ET algorithm converges for\nany sequence of subgraphs used in walk-summable models. These convergence results motivate\nthe development of methods to choose subgraphs adaptively at each iteration to achieve maximum\nreduction in error. The adaptive procedures are based on walk-sum calculations, and minimize an\nupper bound on the error at each iteration. Our simulation results show that the adaptive algorithms\nprovide a signi\ufb01cant speedup in convergence over stationary methods. Moreover, these adaptive\nmethods also seem to provide convergence over a broader class of models than stationary algorithms.\n\nOur adaptive algorithms are greedy in that they only choose the \u201cnext-best\u201d subgraph. An interest-\ning question is to develop tractable methods to compute the next K best subgraphs jointly to achieve\nmaximum reduction in error after K iterations. The experiment with non-walk-summable models\nsuggests that walk-sum analysis could be useful to provide convergent algorithms for non-walk-\nsummable models, perhaps with restrictions on the order in which walk-sums are computed. Fi-\nnally, subgraph preconditioners have been shown to improve the convergence rate of the conjugate-\ngradient method; using walk-sum analysis to select such preconditioners is of clear interest.\n\nReferences\n[1] M. Luettgen, W. Carl, and A. Willsky. Ef\ufb01cient multiscale regularization with application to optical \ufb02ow.\n\nIEEE Transactions on Image Processing, 3(1):41\u201364, Jan. 1994.\n\n[2] P. Rusmevichientong and B. Van Roy. An Analysis of Turbo Decoding with Gaussian densities.\n\nAdvances in Neural Information Processing Systems 12, 2000.\n\nIn\n\n[3] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kauffman, San Mateo, CA, 1988.\n[4] E. Sudderth, M. Wainwright, and A. Willsky. Embedded Trees: Estimation of Gaussian processes on\n\ngraphs with cycles. IEEE Transactions on Signal Processing, 52(11):3136\u20133150, Nov. 2004.\n\n[5] D. Malioutov, J. Johnson, and A. Willsky. Walk-Sums and Belief Propagation in Gaussian Graphical\n\nModels. Journal of Machine Learning Research, 7:2031\u20132064, Oct. 2006.\n\n[6] R. Varga. Matrix Iterative Analysis. Springer-Verlag, New York, 2000.\n[7] R. Bru, F. Pedroche, and D. Szyld. Overlapping Additive and Multiplicative Schwarz iterations for H-\n\nmatrices. Linear Algebra and its Applications, 393:91\u2013105, Dec. 2004.\n\n[8] D. Zhou, J. Huang, and B. Scholkopf. Learning from Labeled and Unlabeled data on a directed graph. In\n\nProceedings of the 22nd International Conference on Machine Learning, 2005.\n\n[9] D. Zhou and C. Burges. Spectral Clustering and Transductive Learning with multiple views. In Proceed-\n\nings of the 24th International Conference on Machine Learning, 2007.\n\n[10] V. Chandrasekaran, J. Johnson, and A. Willsky. Estimation in Gaussian Graphical Models using Tractable\n\nSubgraphs: A Walk-Sum Analysis. To appear in IEEE Transactions on Signal Processing.\n\n[11] D. Malioutov, J. Johnson, and A. Willsky. GMRF variance approximation using spliced wavelet bases. In\n\nIEEE International Conference on Acoustics, Speech and Signal Processing, 2007.\n\n[12] R. Godement. Analysis I: Convergence, Elementary Functions. Springer-Verlag, New York, 2004.\n[13] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2001.\n[14] N. Srebro. Maximum Likelihood Markov Networks: An Algorithmic Approach. Master\u2019s thesis, Mas-\n\nsachusetts Institute of Technology, 2000.\n\n[15] F. Kschischang, B. Frey, and H. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transac-\n\ntions on Information Theory, 47(2):498\u2013519, Feb. 2001.\n\n8\n\n\f", "award": [], "sourceid": 539, "authors": [{"given_name": "Venkat", "family_name": "Chandrasekaran", "institution": null}, {"given_name": "Alan", "family_name": "Willsky", "institution": null}, {"given_name": "Jason", "family_name": "Johnson", "institution": null}]}