{"title": "Efficient Exact Inference in Planar Ising Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1417, "page_last": 1424, "abstract": "We present polynomial-time algorithms for the exact computation of lowest- energy states, worst margin violators, partition functions, and marginals in binary undirected graphical models. Our approach provides an interesting alternative to the well-known graph cut paradigm in that it does not impose any submodularity constraints; instead we require planarity to establish a correspondence with perfect matchings in an expanded dual graph. Maximum-margin parameter estimation for a boundary detection task shows our approach to be efﬁcient and effective.", "full_text": "Ef\ufb01cient Exact Inference in Planar Ising Models\n\nNicol N. Schraudolph\n\nnips@schraudolph.org\n\nDmitry Kamenetsky\n\ndkamen@cecs.anu.edu.au\n\nNational ICT Australia, Locked Bag 8001, Canberra ACT 2601, Australia\n& RSISE, Australian National University, Canberra ACT 0200, Australia\n\nAbstract\n\nWe give polynomial-time algorithms for the exact computation of lowest-energy\nstates, worst margin violators, partition functions, and marginals in certain binary\nundirected graphical models. Our approach provides an interesting alternative to\nthe well-known graph cut paradigm in that it does not impose any submodularity\nconstraints; instead we require planarity to establish a correspondence with perfect\nmatchings in an expanded dual graph. Maximum-margin parameter estimation for\na boundary detection task shows our approach to be ef\ufb01cient and effective. A C++\nimplementation is available from http://nic.schraudolph.org/isinf/.\n\n1 Introduction\n\nUndirected graphical models are a popular tool in machine learning; they represent real-valued\nenergy functions of the form\n\nE(cid:48)\nij(yi, yj) ,\n\n(1)\n\nE(cid:48)(y) := (cid:88)\n\ni\u2208V\n\ni(yi) + (cid:88)\n\nE(cid:48)\n\n(i,j)\u2208E\n\nwhere the terms in the \ufb01rst sum range over the nodes V = {1, 2, . . . n}, and those in the second sum\nover the edges E \u2286 V \u00d7 V of an undirected graph G(V,E).\nThe junction tree decomposition provides an ef\ufb01cient framework for exact statistical inference in\ngraphs that are (or can be turned into) trees of small cliques. The resulting algorithms, however, are\nexponential in the clique size, i.e., the treewidth of the original graph. This is prohibitively large\nfor many graphs of practical interest \u2014 for instance, it grows as O(n) for an n \u00d7 n square lattice.\nMany approximate inference techniques have been developed so as to deal with such graphs, such\nas pseudo-likelihood, mean \ufb01eld approximation, loopy belief propagation, and tree reweighting.\n\n1.1 The Ising Model\n\nEf\ufb01cient exact inference is possible in certain graphical models with binary node labels. Here we\nfocus on Ising models, whose energy functions have the form E : {0, 1}n \u2192 R with\n\n[yi (cid:54)= yj] Eij,\n\n(2)\n\nE(y) := (cid:88)\n\n(i,j)\u2208E\n\nwhere [\u00b7] denotes the indicator function, i.e., the cost Eij is incurred only in those states y where yi\nand yj disagree. Compared to the general model (1) for binary nodes, (2) imposes two additional\nrestrictions: zero node energies, and edge energies in the form of disagreement costs. At \ufb01rst glance\nthese constraints look severe; for instance, such systems must obey the symmetry E(y) = E(\u00ac y),\nwhere \u00ac denotes Boolean negation (ones\u2019 complement). It is well known, however, that adding a\nsingle node makes the Ising model (2) as expressive as the general model (1) for binary variables:\n\nTheorem 1 Every energy function of the form (1) over n binary variables is equivalent to an Ising\nenergy function of the form (2) over n + 1 variables, with the additional variable held constant.\n\n\fk0\nki\n\nE\n\u2212\n)\n1\n(\n\n)\n0\n(\n\n(cid:48)i\n\n(cid:48)i\n\nE\n\n(a)\n\n(b)\n\nk0\n\nE(cid:48)\n\ni\n\n@@\n\nj(\n\n(0,0)\n\n(cid:48)\nij\nki\n(1,0)\u2212 E\n\n\u2212 E ij\n\n(cid:48)\n\nij\nE\n\nEij :=\n\n2 [(E(cid:48)\n1\n\u2212 (E(cid:48)\n\nij (0,1) + E(cid:48)\nij (0,0) + E(cid:48)\n\nij (1,0))\nij (1,1))]\n\n1\n\n0\n\n,\n\n@\n\n\u2212\n\n)\n\n\u2212\n\nE(cid:48)\n\n@\n\ni\n\nE\n\ni\n\nj\n\nj(\n\n,\n\n0\n\n@\n\n0\n\n)\n\nkj\n\nk0\nkjk1\n\n\n\u2212E0j\n\n\n\nEij\n\n\n\nki\n\nE0i\n\n\n(c)\n\nk0\nkjk1\n\n-\n\n\u2212 E 0 j\n\t\nE ij\n\n2Eij\n\nE 0 i\n\n+\nE ij\n\t\n\nki\n\n(d)\n\nFigure 1: Equivalent Ising model (with disagreement costs) for a given (a) node energy E(cid:48)\nenergy E(cid:48)\nbut E0j < 0; (d) equivalent directed model of Kolmogorov and Zabih [1], Fig. 2d.\n\ni, (b) edge\nij in a binary graphical model; (c) equivalent submodular model if Eij > 0 and E0i > 0\n\nProof by construction: Two energy functions are equivalent if they differ only by a constant. With-\nout loss of generality, denote the additional variable y0 and hold it constant at y0 := 0. Given an\nenergy function of the form (1), construct an Ising model with disagreement costs as follows:\ni(1)\u2212E(cid:48)\n\ni(yi), add a disagreement cost term E0i := E(cid:48)\nij(yi, yj), add the three disagreement cost terms\n\n1. For each node energy function E(cid:48)\n2. For each edge energy function E(cid:48)\n\ni(0);\n\nEij := 1\nE0i := E(cid:48)\nE0j := E(cid:48)\n\nij(0, 1) + E(cid:48)\n\n2[(E(cid:48)\nij(1, 0) \u2212 E(cid:48)\nij(0, 1) \u2212 E(cid:48)\n\nij(0, 0) \u2212 Eij,\nij(0, 0) \u2212 Eij.\n\nij(1, 0)) \u2212 (E(cid:48)\n\nij(0, 0) + E(cid:48)\nand\n\nij(1, 1))],\n\n[E(cid:48)\n\nij(1, 0) \u2212 E(cid:48)\n\nij(0, 0) \u2212 Eij] .\n\nE0i = E(cid:48)\n\ni(0) + (cid:88)\n= E(cid:48)(y) \u2212(cid:88)\n\ni(1) \u2212 E(cid:48)\n(cid:105)(cid:17)\n\nj:(i,j)\u2208E\n\n(cid:16)(cid:104) 0\n\ny\n\ni(0) \u2212 (cid:88)\n\nE(cid:48)\n\ni\u2208V\n\n(i,j)\u2208E\n\n(3)\n\n(4)\n\nSumming the above terms, the total bias of node i (i.e., its disagreement cost with the bias node) is\n\nThis construction de\ufb01nes an Ising model whose energy in every con\ufb01guration y is shifted, relative\nto that of the general model we started with, by the same constant amount, namely E(cid:48)(0):\n\n\u2200y \u2208 {0, 1}n : E\n\nij(0, 0) = E(cid:48)(y) \u2212 E(cid:48)(0).\nE(cid:48)\n\n(5)\n\nThe two models\u2019 energy functions are therefore equivalent.\nNote how in the above construction the label symmetry E(y) = E(\u00ac y) of the plain Ising model\n(2) is conveniently broken by the introduction of a bias node, through the convention that y0 := 0.\n\n1.2 Energy Minimization via Graph Cuts\nDe\ufb01nition 2 The cut C of a binary graphical model G(V,E) induced by state y \u2208 {0, 1}n is the set\nC(y) := {(i, j) \u2208 E : yi (cid:54)= yj}; its weight |C(y)| is the sum of the weights of its edges.\nAny given state y partitions the nodes of a binary graphical model into two sets: those labeled \u20180\u2019,\nand those labeled \u20181\u2019. The corresponding graph cut is the set of edges crossing the partition; since\nonly they contribute disagreement costs to the Ising model (2), we have \u2200y : |C(y)| = E(y). The\nlowest-energy state of an Ising model therefore induces its minimum-weight cut.\nMinimum-weight cuts can be computed in polynomial time in graphs whose edge weights are all\nnon-negative. Introducing one more node, with the constraint yn+1 := 1, allows us to construct an\nequivalent energy function by replacing each negatively weighted bias edge E0i < 0 by an edge to\nthe new node n + 1 with the positive weight Ei,n+1 := \u2212E0i > 0 (Figure 1c). This still leaves us\nwith the requirement that all non-bias edges be non-negative. This submodularity constraint implies\nthat agreement between nodes must be locally preferable to disagreement \u2014 a severe limitation.\nGraph cuts have been widely used in machine learning to \ufb01nd lowest-energy con\ufb01gurations, in\nparticular in image processing. Our construction (Figure 1c) differs from that of Kolmogorov and\nZabih [1] (Figure 1d) in that we do not employ the notion of directed edges. (In directed graphs, the\nweight of a cut is the sum of the weights of only those edges crossing the cut in a given direction.)\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a) a non-plane drawing of a planar graph; (b) a plane drawing of the same graph; (c) a\ndifferent plane drawing of same graph, with the same planar embedding as (b); (d) a plane drawing\nof the same graph with a different planar embedding.\n\n2 Planarity\n\nUnlike graph cut methods, the inference algorithms we describe below do not depend on submodu-\nlarity; instead they require that the model graph be planar, and that a planar embedding be provided.\n\n2.1 Embedding Planar Graphs\nDe\ufb01nition 3 Let G(V,E) be an undirected, connected graph. For each vertex i \u2208 V, let Ei denote\nthe set of edges in E incident upon i, considered as being oriented away from i, and let \u03c0i be a cyclic\npermutation of Ei. A rotation system for G is a set of permutations \u03a0 = {\u03c0i : i \u2208 V}.\n\nRotation systems [2] directly correspond to topological graph embeddings in orientable surfaces:\n\nTheorem 4 (White and Beineke [2], p. 22f) Each rotation system determines an embedding of G\nin some orientable surface S such that \u2200i \u2208 V, any edge (i, j) \u2208 Ei is followed by \u03c0i(i, j) in (say)\nclockwise orientation, and such that the faces F of the embedding, given by the orbits of the mapping\n(i, j) \u2192 \u03c0j(j, i), are 2-cells (topological disks).\nNote that while in graph visualisation \u201cembedding\u201d is often used as a synonym for \u201cdrawing\u201d, in\nmodern topological graph theory it stands for \u201crotation system\u201d. We adopt the latter usage, which\nviews embeddings as equivalence classes of graph drawings characterized by identical cyclic order-\ning of the edges incident upon each vertex. For instance, \u03c04(4, 5) = (4, 3) in Figures 2b and 2c\n(same embedding) but \u03c04(4, 5) = (4, 1) in Figure 2d (different embedding). A sample face in Fig-\nures 2b\u20132d is given by the orbit (4, 1) \u2192 \u03c01(1, 4) = (1, 2) \u2192 \u03c02(2, 1) = (2, 4) \u2192 \u03c04(4, 2) = (4, 1).\nThe genus g of the embedding surface S can be determined from the Euler characteristic\n\n|V| \u2212 |E| + |F| = 2 \u2212 2g,\n\n(6)\nwhere |F| is found by counting the orbits of the rotation system, as described in Theorem 4. Since\nplanar graphs are exactly those that can be embedded on a surface of genus g = 0 (a topological\nsphere), we arrive at a purely combinatorial de\ufb01nition of planarity:\nDe\ufb01nition 5 A graph G(V,E) is planar iff it has a rotation system \u03a0 producing exactly 2+|E|\u2212|V|\norbits. Such a system is called a planar embedding of G, and G(V,E, \u03a0) is called a plane graph.\n\nOur inference algorithms require a plane graph as input. In certain domains (e.g., when working with\ngeographic information) a plane drawing of the graph (from which the corresponding embedding is\nreadily determined) may be available. Where it is not, we employ the algorithm of Boyer and\nMyrvold [3] which, given any connected graph G as input, produces in linear time either a planar\nembedding for G or a proof that G is non-planar. Source code for this step is freely available [3, 4].\n\n2.2 The Planarity Constraint\n\nIn Section 1.1 we have mapped a general binary graphical model to an Ising model with an additional\nbias node; now we require that that Ising model be planar. What does that imply for the original,\ngeneral model? If all nodes of the graph are to be connected to the bias node without violating\nplanarity, the graph has to be outerplanar, i.e., have a planar embedding in which all its nodes lie on\nthe external face \u2014 a very severe restriction.\n\n12345213451234512345\fFigure 3: Possible cuts (bold blue dashes) of a square face of the model graph (dashed) and comple-\nmentary perfect matchings (bold red lines) of its expanded dual (solid lines).\n\nThe situation improves, however, if only a subset B \u2282 V of nodes have non-zero bias (4): Then the\ngraph only has to be B-outerplanar, i.e., have a planar embedding in which all nodes in B lie on the\nsame face. In image processing, for instance, where it is common to operate on a square grid of\npixels, we can permit bias for all nodes on the perimeter of the grid. In general, a planar embedding\nwhich maximizes a weighted sum over the nodes bordering a given face can be found in linear time\n[5]; by setting node weights to some measure of bias (such as E 2\n0i ) we can ef\ufb01ciently obtain the\nplanar Ising model closest (in that measure) to any given planar binary graphical model.\nIn contrast to submodularity, B-outerplanarity is a structural constraint. This has the advantage that\nonce a model obeying the constraint is selected, inference (e.g., parameter estimation) can proceed\nvia unconstrained methods (e.g., optimization). Finally, we note that all our algorithms can be\nextended to work for non-planar graphs as well. They then take time exponential in the genus of the\nembedding though still polynomial in the size of the graph; for graphs of low genus this may well\nbe preferable to current approximative methods.\n\n3 Computing Optimal States via Maximum-Weight Perfect Matching\n\nThe relationship between the states of a planar Ising model and perfect matchings (\u201cdimer coverings\u201d\nto physicists) was \ufb01rst discovered by Kasteleyn [6] and Fisher [7]. Globerson and Jaakkola [8]\npresented a more direct construction for triangulated graphs, which we generalize here.\n\n3.1 The Expanded Dual Graph\nDe\ufb01nition 6 The dual G\u2217(F,E) of an embedded graph G(V,E, \u03a0) has a vertex for each face of G,\nwith edges connecting vertices corresponding to faces that are adjacent (i.e., share an edge) in G.\n\nEach edge of the dual crosses exactly one edge of the original graph; due to this one-to-one rela-\ntionship we will consider the dual to have the same set of edges E (with the same energies) as the\noriginal.\nWe now expand the dual graph by replacing each node with a q-clique, where q is the degree of the\nnode, as shown in Figure 3 for q = 4. The additional edges internal to each q-clique are given zero\nenergy so as to leave the model unaffected. For large q the introduction of these O(q2) internal edges\nslows down subsequent computations (solid line in Figure 4, left); this can be avoided by subdividing\nthe offending q-gonal face with chords (which are also given zero energy) before constructing the\ndual. Our implementation performs best when \u201coctangulating\u201d the graph, i.e., splitting octagons off\nall faces with q > 13; this is more ef\ufb01cient than a full triangulation (Figure 4, left).\n\n3.2 Complementary Perfect Matchings\nDe\ufb01nition 7 A perfect matching of a graph G(V,E) is a subset M \u2286 E of edges wherein exactly\none edge is incident upon each vertex in V; its weight |M| is the sum of the weights of its edges.\nTheorem 8 For every cut C of an embedded graph G(V,E, \u03a0) there exists at least one (if G is trian-\ngulated: exactly one) perfect matching M of its expanded dual complementary to C, i.e., E\\M = C.\nProof sketch Consider the complement E\\C of the cut as a partial matching of the expanded dual.\nBy de\ufb01nition, C intersects any cycle of G, and therefore also the perimeters of G\u2019s faces F, in an\neven number of edges. In each clique of the expanded dual, C\u2019s complement thus leaves an even\n\n\fnumber of nodes unmatched; M can therefore be completed using only edges interior to the cliques.\nIn a 3-clique, there is only one way to do this, so M is unique if G is triangulated.\nIn other words, there exists a surjection from perfect matchings in the expanded dual of G to cuts in\nG. Furthermore, since we have given edges interior to the cliques of the expanded dual zero energy,\nevery perfect matching M complementary to a cut C of our Ising model (2) obeys the relation\n\nEij = const.\n\n(7)\n\n|M| + |C| = (cid:88)\n\n(i,j)\u2208E\n\nThis means that instead of a minimum-weight cut in a graph we can look for a maximum-weight\nperfect matching in its expanded dual. But will that matching always be complementary to a cut?\nTheorem 9 Every perfect matching M of the expanded dual of a plane graph G(V,E, \u03a0) is com-\nplementary to a cut C of G, i.e., E\\M = C.\nProof sketch In each clique of the expanded dual, an even number of nodes is matched by edges\ninterior to the clique. The complement E\\M of the matching in G thus contains an even number of\nedges around the perimeter of each face of the embedding. By induction over faces, this holds for\nevery contractible (on the embedding surface) cycle of G. Because a plane is simply connected, all\ncycles in a plane graph are contractible; thus E\\M is a cut.\nThis is where planarity matters: Surfaces of non-zero genus are not simply connected, and thus\nnon-plane graphs may contain non-contractible cycles; our construction does not guarantee that the\ncomplement E\\M of a perfect matching of the expanded dual contains an even number of edges\nalong such cycles. For planar graphs, however, the above theorems allow us to leverage known\npolynomial-time algorithms for perfect matchings into inference methods for Ising models.\n\n3.3 The Lowest-Energy (MAP or Ground) State\n\nThe blossom-shrinking algorithm [9, 10] is a sophisticated method to ef\ufb01ciently compute the\nmaximum-weight perfect matching of a graph.\nIt can be implemented to run in as little as\nO(|E||V| log |V|) time. Although the Blossom IV code we are using [11] is asymptotically less\nef\ufb01cient \u2014 O(|E||V|2) \u2014 we have found it to be very fast in practice (Figure 4, left).\nWe can now ef\ufb01ciently compute the lowest-energy state of a planar Ising model as follows: Find a\nplanar embedding of the model graph (Section 2.1), construct its expanded dual (Section 3.1), and\nrun the blossom-shrinking algorithm on that to compute its maximum-weight perfect matching. Its\ncomplement in the original model is the minimum-weight graph cut (Section 3.2). We can identify\nthe state which induces this cut via a depth-\ufb01rst graph traversal that labels nodes as it encounters\nthem, starting by labeling the bias node y0 := 0; this is shown below as Algorithm 1.\n\nAlgorithm 1 Find State from Corresponding Graph Cut\n\nprocedure dfs state(i \u2208 {0, 1, 2, . . . n}, s \u2208 {0, 1})\n\nInput:\n\n1.\n\n2.\nOutput:\n\nIsing model graph G(V,E)\ngraph cut C(y) \u2286 E\n\u2200 i \u2208 {0, 1, 2, . . . n} :\nyi := unknown;\n\ndfs state(0, 0);\nstate vector y\n\nif yi = unknown then\n\nyi := s; \u2200(i, j) \u2208 Ei :\n\nif (i, j) \u2208 C then dfs state(j,\u00acs);\nelse dfs state(j, s);\n\nelse assert yi = s;\n\n3.4 The Worst Margin Violator\n\nMaximum-margin parameter estimation in graphical models involves determining the worst margin\nviolator \u2014 the state that minimizes, relative to a given target state y\u2217, the margin energy\n\n(8)\nwhere d(\u00b7|\u00b7) is a measure of divergence in state space. If d(\u00b7|\u00b7) is the weighted Hamming distance\n(9)\n\nd(y|y\u2217) := (cid:88)\n\n[[yi (cid:54)= yj] (cid:54)= [y\u2217\n\nM(y|y\u2217) := E(y) \u2212 d(y|y\u2217),\n\ni (cid:54)= y\u2217\n\nj ]] vij,\n\n(i,j)\u2208E\n\n\fFigure 4: Cost of inference on a ring graph, plotted against ring size. Left & center: CPU time on\nApple MacBook with 2.2 GHz Intel Core2 Duo processor; right: storage size. Left: MAP state via\nBlossom IV [11] on original, triangulated, and octangulated ring; center & right: marginal probabili-\nties, full matrix K (double precision, no prefactoring) vs. prefactored half-Kasteleyn bitmatrix H.\nwhere the vij \u2265 0 are constant weighting factors (in the simplest case: all ones) on the edges of our\nIsing model, then it is easily veri\ufb01ed that the margin energy (8) is implemented (up to a shift that\ndepends only on y\u2217) by an isomorphic Ising model with disagreement costs\n\nj ] \u2212 1) vij.\n(10)\nWe can thus use our algorithm of Section 3.3 to ef\ufb01ciently \ufb01nd the worst margin violator,\nargminy M(y|y\u2217), for maximum-margin parameter estimation.\n\nEij + (2 [y\u2217\n\ni (cid:54)= y\u2217\n\n4 Computing the Partition Function and Marginal Probabilities1\n\nA Markov random \ufb01eld (MRF) over our Ising model (2) models the distribution\n\nZ e\u2212E(y), where Z :=(cid:88)\n\nP(y) = 1\n\ne\u2212E(y)\n\n(11)\n\ny\n\nis the MRF\u2019s partition function. As it involves a summation over exponentially many states y,\ncalculating the partition function is generally intractable. For planar graphs, however, the generating\nfunction for perfect matchings can be calculated in polynomial time via the determinant of a skew-\nsymmetric matrix [6, 7]. Due to the close relationship with graph cuts (Section 3.2) we can calculate\nZ in (11) likewise. Elaborating on work of Globerson and Jaakkola [8], we \ufb01rst convert the Ising\nmodel graph into a Boolean \u201chalf-Kasteleyn\u201d matrix H:\n\n1. plane triangulate the embedded graph so as to make the relationship between cuts and\n\ncomplementary perfect matchings a bijection (cf. Section 3.2);\n\n2. orient the edges of the graph such that the in-degree of every node is odd;\n3. construct the Boolean half-Kasteleyn matrix H from the oriented graph;\n4. prefactor the triangulation edges (added in Step 1) out of H.\n\nOur Step 2 simpli\ufb01es equivalent operations in previous constructions [6\u20138], Step 3 differs in that it\nonly sets unit (i.e., +1) entries in a Boolean matrix, and Step 4 can dramatically reduce the size of H\nfor compact storage (as a bit matrix) and faster subsequent computations (Figure 4, center & right).\nFor a given set of disagreement edge costs Ek, k = {1, 2, , . . .|E|} on that graph, we then build from\nH and the Ek the skew-symmetric, real-valued Kasteleyn matrix K:\n\n1. K := H;\n2. \u2200k \u2208 {1, 2, , . . .|E|} : K2k\u22121,2k := K2k\u22121,2k + eEk;\n3. K := K \u2212 K(cid:62).\n\nThe partition function for perfect matchings is(cid:112)|K| [6\u20138], so we factor K and use (7) to compute\n\nthe log partition function for (11) as ln Z = 1\nprobability of disagreement on the kth edge, and is computed via the inverse of K:\nP(k \u2208 C) := \u2212 \u2202 ln Z\n= 1 + K\u22121\n\u2202Ek\n\n= 1 \u2212 1\n2|K|\n\nK\u22121 \u2202K\n\u2202Ek\n\n\u2202|K|\n\u2202Ek\n\n(cid:19)\n\nk\u2208E Ek. Its derivative yields the marginal\n\n2 ln|K|\u2212(cid:80)\n(cid:18)\n\n= 1 \u2212 1\n\n2 tr\n\n1 We only have space for a high-level overview here; see [12] for full details.\n\n2k\u22121,2k K2k\u22121,2k.\n\n(12)\n\n101001e31e41e510.10.011e-31e-4CPU time (seconds)originaltriang.octang.1010010000.0010.010.1110CPU time (seconds)no prefact.prefactored1010010001e31e61e9memory (bytes)full-size Kprefact. H\fFigure 5: Boundary detection by maximum-margin training of planar Ising grids; from left to right:\nIsing model (100 \u00d7 100 grid), original image, noisy mask, and MAP segmentation of the Ising grid.\n\n5 Maximum Likelihood vs. Maximum Margin CRF Parameter Estimation\n\nOur algorithms can be applied to regularized parameter estimation in conditional random \ufb01elds\n(CRFs). In a linear planar Ising CRF, the disagreement costs Eij in (2) are computed as inner prod-\nucts between features (suf\ufb01cient statistics) x of the modeled data and corresponding parameters \u03b8\nof the model, and (11) is used to model the conditional distribution P(y|x, \u03b8). Maximum likelihood\n(ML) parameter estimation then seeks to minimize wrt. \u03b8 the L2-regularized negative log likelihood\n(13)\nof a given target labeling y\u2217,2 with regularization parameter \u03bb. This is a smooth, convex objective\nthat can be optimized via batch or online implementations of gradient methods such as LBFGS [13];\nthe gradient of the log partition function in (13) is obtained by computing the marginals (12). For\nmaximum margin (MM) parameter estimation [14] we instead minimize\n\n2 \u03bb(cid:107)\u03b8(cid:107)2 + E(y\u2217|x, \u03b8) + ln Z(\u03b8|x)\n\n2 \u03bb(cid:107)\u03b8(cid:107)2 \u2212 ln P(y\u2217|x, \u03b8) = 1\n\nLML(\u03b8) := 1\n\nLMM(\u03b8) := 1\n= 1\n\n2 \u03bb(cid:107)\u03b8(cid:107)2 + E(y\u2217|x, \u03b8) \u2212 min\n2 \u03bb(cid:107)\u03b8(cid:107)2 + E(y\u2217|x, \u03b8) \u2212 E(\u02c6y|x, \u03b8) + d(\u02c6y|y\u2217),\n\nM(y|y\u2217, x, \u03b8)\n\ny\n\n(14)\n\nwhere \u02c6y := argminy M(y|y\u2217, x, \u03b8) is the worst margin violator, i.e., the state that minimizes the\nmargin energy (8). LMM(\u03b8) is convex but non-smooth; we can minimize it via bundle methods such\nas the BT bundle trust algorithm [15], making use of the convenient lower bound \u2200\u03b8 : LMM(\u03b8) \u2265 0.\nTo demonstrate the scalability of planar Ising models, we designed a simple boundary detection\ntask based on images from the GrabCut Ground Truth image segmentation database [16]. We took\n100 \u00d7 100 pixel subregions of images that depicted a segmentation boundary, and corrupted the\nsegmentation mask with pink noise, produced by convolving a white noise image (all pixels i.i.d.\nuniformly random) with a Gaussian density with one pixel standard deviation. We then employed\na planar Ising model to recover the original boundary \u2014 namely, a 100 \u00d7 100 square grid with\none additional edge pegged to a high energy, encoding prior knowledge that two opposing corners\nof the grid depict different regions (Figure 5, left). The energy of the other edges was Eij :=\n(cid:104)[1,|xi \u2212 xj|], \u03b8(cid:105), where xi is the pixel intensity at node i. We did not employ a bias node for this\ntask, and simply set \u03bb = 1.\nNote that this is a huge model: 10 000 nodes and 19 801 edges. Computing the partition function or\nmarginals would require inverting a Kasteleyn matrix with over 1.5 \u00b7 109 entries; minimizing (13) is\ntherefore computationally infeasible for us. Computing a ground state via the algorithm described\nin Section 3, by contrast, takes only 0.3 seconds on an Apple MacBook with 2.2 GHz Intel Core2\nDuo processor. We can therefore ef\ufb01ciently minimize (14) to obtain the MM parameter vector \u03b8\u2217,\nthen compute the CRF\u2019s MAP (i.e., ground) state for rapid prediction.\nFigure 5 (right) shows how even for a signal-to-noise (S/N) ratio of 1:8, our approach is capable\nof recovering the original segmentation boundary quite well, with only 0.67% of nodes mislabeled\nhere. For S/N ratios of 1:9 and lower the system was unable to locate the boundary; for S/N ratios\nof 1:7 and higher we obtained perfect reconstruction. Further experiments are reported in [12].\nOn smaller grids, ML parameter estimation and marginals for prediction become computationally\nfeasible, if slower than the MM/MAP approach. This will allow direct comparison of ML vs. MM for\nparameter estimation, and MAP vs. marginals for prediction, to our knowledge for the \ufb01rst time on\ngraphs intractable for the junction tree appproach, such as the grids often used in image processing.\n\n2 For notational clarity we suppress here the fact that we are usually modeling a collection of data items.\n\n\f6 Discussion\n\nWe have proposed an alternative algorithmic framework for ef\ufb01cient exact inference in binary graph-\nical models, which replaces the submodularity constraint of graph cut methods with a planarity con-\nstraint. Besides proving ef\ufb01cient and effective in \ufb01rst experiments, our approach opens up a number\nof interesting research directions to be explored:\nOur algorithms can all be extended to nonplanar graphs, at a cost exponential in the genus of the\nembedding. We are currently developing these extensions, which may prove of great practical value\nfor graphs that are \u201calmost\u201d planar; examples include road networks (where edge crossings arise\nfrom overpasses without on-ramps) and graphs describing the tertiary structure of proteins [17].\nThese algorithms also provide a foundation for the future development of ef\ufb01cient approximate\ninference methods for nonplanar Ising models.\nOur method for calculating the ground state (Section 3) actually works for nonplanar graphs whose\nground state does not contain frustrated non-contractible cycles. The QPBO graph cut method [18]\n\ufb01nds ground states that do not contain any frustrated cycles, and otherwise yields a partial labeling.\nCan we likewise obtain a partial labeling of ground states with frustrated non-contractible cycles?\nThe existence of two distinct tractable frameworks for inference in binary graphical models implies\na yet more powerful hybrid: Consider a graph each of whose biconnected components is either\nplanar or submodular. As a whole, this graph may be neither planar nor submodular, yet ef\ufb01cient\nexact inference in it is clearly possible by applying the appropriate framework to each component.\nCan this hybrid approach be extended to cover less obvious situations?\nReferences\n[1] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEE Trans.\n\nPattern Analysis and Machine Intelligence, 26(2):147\u2013159, 2004.\n\n[2] A. T. White and L. W. Beineke. Topological graph theory. In L. W. Beineke and R. J. Wilson, editors,\n\nSelected Topics in Graph Theory, chapter 2, pages 15\u201349. Academic Press, 1978.\n\n[3] J. M. Boyer and W. J. Myrvold. On the cutting edge: Simpli\ufb01ed O(n) planarity by edge addition. Journal\nof Graph Algorithms and Applications, 8(3):241\u2013273, 2004. Reference implementation (C source code):\nhttp://jgaa.info/accepted/2004/BoyerMyrvold2004.8.3/planarity.zip\n\n[4] A. Windsor. Planar graph functions for the boost graph library. C++ source code, boost \ufb01le vault: http:\n//boost-consulting.com/vault/index.php?directory=Algorithms/graph, 2007.\n[5] C. Gutwenger and P. Mutzel. Graph embedding with minimum depth and maximum external face. In\n\nG. Liotta, editor, Graph Drawing 2003, volume 2912 of LNCS, pages 259\u2013272. Springer Verlag, 2004.\n\n[6] P. W. Kasteleyn. The statistics of dimers on a lattice: I. the number of dimer arrangements on a quadratic\n\nlattice. Physica, 27(12):1209\u20131225, 1961.\n\n[7] M. E. Fisher. Statistical mechanics of dimers on a plane lattice. Phys Rev, 124(6):1664\u20131672, 1961.\n[8] A. Globerson and T. Jaakkola. Approximate inference using planar graph decomposition. In B. Sch\u00a8olkopf,\nJ. Platt, and T. Hofmann (eds), Advances in Neural Information Processing Systems 19, 2007. MIT Press.\n[9] J. Edmonds. Maximum matching and a polyhedron with 0,1-vertices. Journal of Research of the National\n\nBureau of Standards, 69B:125\u2013130, 1965.\n\n[10] J. Edmonds. Paths, trees, and \ufb02owers. Canadian Journal of Mathematics, 17:449\u2013467, 1965.\n[11] W. Cook and A. Rohe. Computing minimum-weight perfect matchings. INFORMS Journal on Comput-\ning, 11(2):138\u2013148, 1999. C source code: http://www.isye.gatech.edu/\u02dcwcook/blossom4\n[12] N. N. Schraudolph and D. Kamenetsky. Ef\ufb01cient exact inference in planar Ising models. Technical Report\n\n0810.4401, arXiv, 2008. http://aps.arxiv.org/abs/0810.4401\n\n[13] S. V. N. Vishwanathan, N. N. Schraudolph, M. Schmidt, and K. Murphy. Accelerated training conditional\nrandom \ufb01elds with stochastic gradient methods. In Proc. Intl. Conf. Machine Learning, pages 969\u2013976,\nNew York, NY, USA, 2006. ACM Press.\n\n[14] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks.\n\nIn S. Thrun, L. Saul, and B.\nSch\u00a8olkopf (eds), Advances in Neural Information Processing Systems 16, pages 25\u201332, 2004. MIT Press.\n[15] H. Schramm and J. Zowe. A version of the bundle idea for minimizing a nonsmooth function: Conceptual\n\nidea, convergence analysis, numerical results. SIAM J. Optimization, 2:121\u2013152, 1992.\n\n[16] C. Rother, V. Kolmogorov, A. Blake, and M. Brown. GrabCut ground truth database, 2007. http://\n\nresearch.microsoft.com/vision/cambridge/i3l/segmentation/GrabCut.htm\n\n[17] S. V. N. Vishwanathan, K. Borgwardt, and N. N. Schraudolph. Fast computation of graph kernels. In B.\nSch\u00a8olkopf, J. Platt, and T. Hofmann (eds), Advances in Neural Information Processing Systems 19, 2007.\n[18] V. Kolmogorov and C. Rother. Minimizing nonsubmodular functions with graph cuts \u2013 a review. IEEE\n\nTrans. Pattern Analysis and Machine Intelligence, 29(7):1274\u20131279, 2007.\n\n\f", "award": [], "sourceid": 401, "authors": [{"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "Dmitry", "family_name": "Kamenetsky", "institution": null}]}