{"title": "Marginals-to-Models Reducibility", "book": "Advances in Neural Information Processing Systems", "page_first": 1043, "page_last": 1051, "abstract": "We consider a number of classical and new computational problems regarding marginal distributions, and inference in models specifying a full joint distribution. We prove general and efficient reductions between a number of these problems, which demonstrate that algorithmic progress in inference automatically yields progress for \u201cpure data\u201d problems. Our main technique involves formulating the problems as linear programs, and proving that the dual separation oracle for the Ellipsoid Method is provided by the target problem. This technique may be of independent interest in probabilistic inference.", "full_text": "Marginals-to-Models Reducibility\n\nTim Roughgarden\nStanford University\n\ntim@cs.stanford.edu\n\nMichael Kearns\n\nUniversity of Pennsylvania\n\nmkearns@cis.upenn.edu\n\nAbstract\n\nWe consider a number of classical and new computational problems regarding\nmarginal distributions, and inference in models specifying a full joint distribution.\nWe prove general and ef\ufb01cient reductions between a number of these problems,\nwhich demonstrate that algorithmic progress in inference automatically yields\nprogress for \u201cpure data\u201d problems. Our main technique involves formulating the\nproblems as linear programs, and proving that the dual separation oracle required\nby the ellipsoid method is provided by the target problem. This technique may be\nof independent interest in probabilistic inference.\n\n1\n\nIntroduction\n\nThe movement between the speci\ufb01cation of \u201clocal\u201d marginals and models for complete joint distri-\nbutions is ingrained in the language and methods of modern probabilistic inference. For instance,\nin Bayesian networks, we begin with a (perhaps partial) speci\ufb01cation of local marginals or CPTs,\nwhich then allows us to construct a graphical model for the full joint distribution. In turn, this allows\nus to make inferences (perhaps conditioned on observed evidence) regarding marginals that were not\npart of the original speci\ufb01cation.\nIn many applications, the speci\ufb01cation of marginals is derived from some combination of (noisy)\nobserved data and (imperfect) domain expertise. As such, even before the passage to models for the\nfull joint distribution, there are a number of basic computational questions we might wish to ask of\ngiven marginals, such as whether they are consistent with any joint distribution, and if not, what the\nnearest consistent marginals are. These can be viewed as questions about the \u201cdata\u201d, as opposed to\ninferences made in models derived from the data.\nIn this paper, we prove a number of general, polynomial time reductions between such problems\nregarding data or marginals, and problems of inference in graphical models. By \u201cgeneral\u201d we mean\nthe reductions are not restricted to particular classes of graphs or algorithmic approaches, but show\nthat any computational progress on the target problem immediately transfers to progress on the\nsource problem. For example, one of our main results establishes that the problem of determining\nwhether given marginals, whose induced graph (the \u201cdata graph\u201d) falls within some class G, are\nconsistent with any joint distribution reduces to the problem of MAP inference in Markov networks\nfalling in the same class G. Thus, for instance, we immediately obtain that the tractability of MAP\ninference in trees or tree-like graphs yields an ef\ufb01cient algorithm for marginal consistency in tree\ndata graphs; and any future progress in MAP inference for other classes G will similarly transfer.\nConversely, our reductions also can be used to establish negative results. For instance, for any class\nG for which we can prove the intractability of marginal consistency, we can immediately infer the\nintractability of MAP inference as well.\nThere are a number of reasons to be interested in such problems regarding marginals. One, as\nwe have already suggested, is the fact that given marginals may not be consistent with any joint\n\n1\n\n\fFigure 1: Summary of main results. Arrows indicate that the source problem can be reduced to the target\nproblem for any class of graphs G, and in polynomial time. Our main results are the left-to-right arrows from\nmarginals-based problems to Markov net inference problems.\n\ndistribution, due to noisy observations or faulty domain intuitions,1 and we may wish to know this\nbefore simply passing to a joint model that forces or assumes consistency. At the other extreme,\ngiven marginals may be consistent with many joint distributions, with potentially very different\nproperties.2 Rather than simply selecting one of these consistent distributions in which to perform\ninference (as would typically happen in the construction of a Markov or Bayes net), we may wish to\nreason over the entire class of consistent distributions, or optimize over it (for instance, choosing to\nmaximize or minimize independence).\nWe thus consider four natural algorithmic problems involving (partially) speci\ufb01ed marginals:\n\nmarginals?\n\n\u2022 CONSISTENCY: Is there any joint distribution consistent with given marginals?\n\u2022 CLOSEST CONSISTENCY: What are the consistent marginals closest to given inconsistent\n\u2022 SMALL SUPPORT: Of the consistent distributions with the closest marginals, can we com-\npute one with support size polynomial in the data (i.e., number of given marginal values)?\n\u2022 MAX ENTROPY: What is the maximum entropy distribution closest to given marginals?\n\nThe consistency problem has been studied before as the membership problem for the marginal poly-\ntope (see Related Work); in the case of inconsistency, the closest consistency problem seeks the\nminimal perturbation to the data necessary to recover coherence.\nWhen there are many consistent distributions, which one should be singled out? While the maxi-\nmum entropy distribution is a staple of probabilistic inference, it is not the only interesting answer.\nFor example, consider the three features \u201cvotes Republican\u201d, \u201csupports universal healthcare\u201d, and\n\u201csupports tougher gun control\u201d, and suppose the single marginals are 0.5, 0.5, 0.5. The maximum\nentropy distribution is uniform over the 8 possibilities. We might expect reality to hew closer to a\nsmall support distribution, perhaps even 50/50 over the two vectors 100 and 011. The small support\nproblem can be informally viewed as attempting to minimize independence or randomization, and\nthus is a natural contrast to maximum entropy. It is also worth noting that small support distributions\narise naturally through the joint behavior of no-regret algorithms in game-theoretic settings [1].\nWe also consider two standard algorithmic inference problems on full joint distributions (models):\n\n1For a simple example, consider three random variables for which each pairwise marginal speci\ufb01es that the\nsettings (0,1) and (1,0) each occurs with probability 1/2. The corresponding \u201cdata graph\u201d is a triangle. This\nrequires that each variable always disagrees with the other two, which is impossible.\n\n2 For example, consider random variables X, Y, Z. Suppose the pairwise marginals for X and Y and for Y\nand Z specify that all four binary settings are equally likely. No pairwise marginals for X and Z are given, so\nthe data graph is a two-hop path. One consistent distribution \ufb02ips a fair coin independently for each variable;\nbut another \ufb02ips one coin for X, a second for Y , and sets Z = X. The former maximizes entropy while the\nlatter minimizes support size.\n\n2\n\n\f\u2022 MAP INFERENCE: What is the MAP joint assignment in a given Markov network?\n\u2022 GENERALIZED PARTITION: What is the normalizing constant of a given Markov network,\n\npossibly after conditioning on the value of one vertex or edge?\n\nAll six of these problems are parameterized by a class of graphs G \u2014 for the four marginals prob-\nlems, this is the graph induced by the given pairwise marginals, while for the models problems, it is\nthe graph of the given Markov network. All of our reductions are of the form \u201cfor every class G, if\nthere is a polynomial-time algorithm for solving inference problem B for (model) graphs in G, then\nthere is a polynomial-time algorithm for marginals problem A for (marginal) graphs in G\u201d \u2014 that\nis, A reduces to B. Our main results, which are summarized in Figure 1, can be stated informally as\nfollows:\n\n\u2022 CONSISTENCY reduces to MAP INFERENCE.\n\u2022 CLOSEST CONSISTENCY reduces to MAP INFERENCE.\n\u2022 SMALL SUPPORT reduces to MAP INFERENCE.\n\u2022 MAX ENTROPY reduces to GENERALIZED PARTITION.3\n\nWhile connections between some of these problems are known for speci\ufb01c classes of graphs \u2014\nmost notably in trees, where all of these problems are tractable and rely on common underlying al-\ngorithmic approaches such as dynamic programming \u2014 the novelty of our results is their generality,\nshowing that the above reductions hold for every class of graphs.\nAll of our reductions share a common and powerful technique:\nthe use of the ellipsoid method\nfor Linear Programming (LP), with the key step being the articulation of an appropriate separation\noracle. The \ufb01rst three problems we consider have a straightforward LP formulation which will\ntypically have a number of variables that is equal to the number of joint settings, and therefore\nexponential in the number of variables; for the MAX ENTROPY problem, there is an analogous\nconvex program formulation. Since our goal is to run in time polynomial in the input length (the\nnumber and size of given marginals), the straightforward LP formulation will not suf\ufb01ce. However,\nby passing to the dual LP, we instead obtain an LP with only a polynomial number of variables, but\nan exponential number of constraints that can be represented implicitly. For each of the reductions\nabove, we show that the required separation oracle for these implicit constraints is provided exactly\nby the corresponding inference problem (MAP INFERENCE or GENERALIZED PARTITION). We\nbelieve this technique may be of independent interest and have other applications in probabilistic\ninference.\nIt is perhaps surprising that in the study of problems strictly addressing properties of given marginals\n(which have received relatively little attention in the graphical models literature historically), prob-\nlems of inference in full joint models (which have received great attention) should arise so naturally\nand generally. For the marginal problems, our reductions (via the ellipsoid method) effectively\ncreate a series of \u201c\ufb01ctitious\u201d Markov networks such that the solutions to corresponding inference\nproblems (MAP INFERENCE and GENERALIZED PARTITION) indirectly lead to a solution to the\noriginal marginal problems.\nRelated Work: The literature on graphical models and probabilistic inference is rife with connec-\ntions between some of the problems we study here for speci\ufb01c classes of graphical models (such as\ntrees or otherwise sparse structures), and under speci\ufb01c algorithmic approaches (such as dynamic\nprogramming or message-passing algorithms more generally, and various forms of variational infer-\nence); see [2, 3, 4] for good overviews. In contrast, here we develop general and ef\ufb01cient reductions\nbetween marginal and inference problems that hold regardless of the graph structure or algorithmic\napproach; we are not aware of prior efforts in this vein. Some of the problems we consider are also\neither new or have been studied very little, such as CLOSEST CONSISTENCY and SMALL SUPPORT.\nThe CONSISTENCY problem has been studied before as the membership problem for the marginal\npolytope. In particular, [8] shows that \ufb01nding the MAP assignment for Markov random \ufb01elds with\npairwise potentials can be cast as an integer linear program over the marginal polytope \u2014 that is,\nalgorithms for the CONSISTENCY problem are useful subroutines for inference. Our work is the\n\n3The conceptual ideas in this reduction are well known. We include a formal treatment in the Appendix for\n\ncompleteness and to provide an analogy with our other reductions, which are our more novel contributions.\n\n3\n\n\f\ufb01rst to show a converse, that inference algorithms are useful subroutines for decision and optimiza-\ntion problems for the marginal polytope. Furthermore, previous polynomial-time solutions to the\nCONSISTENCY problem generally give a compact (polynomial-size) description of the marginal\npolytope. Our approach dodges this ambitious requirement, in that it only needs a polynomial-time\nseparation oracle (which, for this problem, turns out to be MAP inference). As there are many\ncombinatorial optimization problems with no compact LP formulation that admit polynomial-time\nellipsoid-based algorithms \u2014 like non-bipartite matching, with its exponentially many odd cycle\ninequalities \u2014 our approach provides a new way of identifying computationally tractable special\ncases of problems concerning marginals.\nThe previous work that is perhaps most closely related in spirit to our interests are [5] and [6, 7].\nThese works provide reductions of some form, but not ones that are both general (independent of\ngraph structure) and polynomial time. However, they do suggest both the possibility and interest in\nsuch stronger reductions. The paper [5] discusses and provides heuristic reductions between MAP\nINFERENCE and GENERALIZED PARTITION.\nThe work in [6, 7] makes the point that maximizing entropy subject to an (approximate) consistency\ncondition yields a distribution that can be represented as a Markov network over the graph induced\nby the original data or marginals. As far as we are aware, however, there has been essentially no for-\nmal complexity analysis (i.e., worst-case polynomial-time guarantees) for algorithms that compute\nmax-entropy distributions.4\n\n2 Preliminaries\n\n2.1 Problem De\ufb01nitions\n\nFor clarity of exposition, we focus on the pairwise case in which every marginal involves at most\ntwo variables.5 Denote the underlying random variables by X1, . . . , Xn, which we assume have\nrange [k] = {0, 1, 2, . . . , k}. The input is at most one real-valued single marginal value \u00b5is for\nevery variable i \u2208 [n] and value s \u2208 [k], and at most one real-valued pairwise marginal value \u00b5ijst\nfor every ordered variable pair i, j \u2208 [n]\u00d7[n] with i < j and every pair s, t \u2208 [k]. Note that we allow\na marginal to be only partially speci\ufb01ed. The data graph induced by a set of marginals has one vertex\nper random variable Xi, and an undirected edge (i, j) if and only if at least one of the given pairwise\nmarginal values involves the variables Xi and Xj. Let M1 and M2 denote the sets of indices (i, s)\nand (i, j, s, t) of the given single and pairwise marginal values, and m = |M1| + |M2| the total\nnumber of marginal values. Let A = [k]n denote the space of all possible variable assignments.\nWe say that the given marginals \u00b5 are consistent if there exists a (joint) probability distribution\nconsistent with all of them (i.e., that induces the marginals \u00b5).\nWith these basic de\ufb01nitions, we can now give formal de\ufb01nitions for the marginals problems we\nconsider. Let G denote an arbitrary class of undirected graphs.\n\n\u2022 CONSISTENCY (G): Given marginals \u00b5 such that the induced data graph falls in G, are they\n\nconsistent?\n\n\u2022 CLOSEST CONSISTENCY (G): Given (possibly inconsistent) marginals \u00b5 such that the\ninduced data graph falls in G, compute the consistent marginals \u03bd minimizing ||\u03bd \u2212 \u00b5||1.\n\u2022 SMALL SUPPORT (G): Given (consistent or inconsistent) marginals \u00b5 such that the in-\nduced data graph falls in G, compute a distribution that has a polynomial-size support and\nmarginals \u03bd that minimize ||\u03bd \u2212 \u00b5||1.\n\u2022 MAX ENTROPY (G): Given (consistent or inconsistent) marginals \u00b5 such that the induced\ndata graph falls in G, compute the maximum entropy distribution that has marginals \u03bd that\nminimize ||\u03bd \u2212 \u00b5||1.\n\n4There are two challenges to doing this. The \ufb01rst, which has been addressed in previous work, is to circum-\nvent the exponential number of decision variables via a separation oracle. The second, which does not seem to\nhave been previously addressed, is to bound the diameter of the search space (i.e., the magnitude of the optimal\nLagrange variables). Proving this requires using special properties of the MAX ENTROPY problem, beyond\nmere convexity. We adapt recent techniques of [13] to provide the necessary argument.\n\n5All of our results generalize to the case of higher-order marginals in a straightforward manner.\n\n4\n\n\fIt is important to emphasize that all of the problems above are \u201cmodel-free\u201d, in that we do not\nassume that the marginals are consistent with, or generated by, any particular model (such as a\nMarkov network). They are simply given marginals, or \u201cdata\u201d.\nFor each of these problems, our interest is in algorithms whose running time is polynomial in the\nsize of the input \u00b5. The prospects for this depend strongly on the class G, with tractability generally\nfollowing for \u201cnice\u201d classes such as tree or tree-like graphs, and intractability for the most general\ncases. Our contribution is in showing a strong connection between tractability for these marginals\nproblems and the following inference problems for any class G.\n\na posteriori (MAP) or most probable joint assignment.6\n\n\u2022 MAP INFERENCE (G): Given a Markov network whose graph falls in G, \ufb01nd the maximum\n\u2022 GENERALIZED PARTITION: Given a Markov network whose graph falls in G, compute\nthe partition function or normalization constant for the full joint distribution, possibly after\nconditioning on the value of a single vertex or edge.7\n\n2.2 The Ellipsoid Method for Linear Programming\n\nOur algorithms for the CONSISTENCY, CLOSEST CONSISTENCY, and SMALL SUPPORT problems\nuse linear programming. There are a number of algorithms that solve explictly described linear\nprograms in time polynomial in the description size. Our problems, however, pose an additional\nchallenge: the obvious linear programming formulation has size exponential in the parameters of\ninterest. To address this challenge, we turn to the ellipsoid method [9], which can solve in poly-\nnomial time linear programs that have an exponential number of implicitly described constraints,\nprovided there is a polynomial-time \u201cseparation oracle\u201d for these constraints. The ellipsoid method\nis discussed exhaustively in [10, 11]; we record in this section the facts necessary for our results.\nmx \u2264 bm} denote the\nDe\ufb01nition 2.1 (Separation Oracle) Let P = {x \u2208 Rn : aT\nfeasible region of m linear constraints in n dimensions. A separation oracle for P is an algorithm\nthat takes as input a vector x \u2208 Rn, and either (i) veri\ufb01es that x \u2208 P; or (ii) returns a constraint i\nsuch that at\nix > bi. A polynomial-time separation oracle runs in time polynomial in n, the maximum\ndescription length of a single constraint, and the description length of the input x.\n\n1 x \u2264 b1, . . . , aT\n\n1 x \u2264 b1, . . . , aT\n\nOne obvious separation oracle is to simply check, given a candidate solution x, each of the m\nconstraints in turn. More interesting and relevant are constraint sets that have size exponential in the\ndimension n but admit a polynomial-time separation oracle.\nTheorem 2.2 (Convergence Guarantee of the Ellipsoid Method [9]) Suppose the set P = {x \u2208\nmx \u2264 bm} admits a polynomial-time separation oracle and cT x is a linear\nRn : aT\nobjective function. Then, the ellipsoid method solves the optimization problem {max cT x : x \u2208 P}\nin time polynomial in n and the maximum description length of a single constraint or objective\nfunction. The method correctly detects if P = \u2205. Moreover, if P is non-empty and bounded, the\nellipsoid method returns a vertex of P.8\nTheorem 2.2 provides a general reduction from a problem to an intuitively easier one: if the problem\nof verifying membership in P can be solved in polynomial time, then the problem of optimizing an\narbitrary linear function over P can also be solved in polynomial time. This reduction is \u201cmany-to-\none,\u201d meaning that the ellipsoid method invokes the separation oracle for P a large (but polynomial)\nnumber of times, each with a different candidate point x. See Appendix A.1 for a high-level de-\nscription of the ellipsoid method and [10, 11] for a detailed treatment.\nThe ellipsoid method also applies to convex programming problems under some additional techni-\ncal conditions. This is discussed in Appendix A.2 and applied to the MAX ENTROPY problem in\nAppendix A.3.\n\n6Formally, the input is a graph G = (V, E) with a log-potential log \u03c6i(s) and log \u03c6ij(s, t) for each vertex\ni \u2208 V and edge (i, j) \u2208 E, and each value s \u2208 [k] = {0, 1, 2 . . . , k} and pair s, t \u2208 [k] \u00d7 [k] of values. The\n\nMAP assignment maximizes P (a) :=(cid:81)\n7Formally, given the log-potentials of a Markov network, compute(cid:80)\ngiven i, s; or(cid:80)\n\ni\u2208V \u03c6i(ai)(cid:81)\n\na : ai=s,aj =t P (a) for a given i, j, s, t.\n\n8A vertex is a point of P that satis\ufb01es with equality n linearly independent constraints.\n\n(i,j)\u2208E \u03c6ij(ai, aj) over all assignments a \u2208 [k]V .\n\na\u2208[k]n P (a); (cid:80)\n\na : ai=s P (a) for a\n\n5\n\n\f3 CONSISTENCY Reduces to MAP INFERENCE\nThe goal of this section is to reduce the CONSISTENCY problem for data graphs in the family G to\nthe MAP INFERENCE problem for networks in G.\nTheorem 3.1 (Main Result 1) Let G be a set of graphs. If the the MAP INFERENCE (G) problem\ncan be solved in polynomial time, then the CONSISTENCY (G) problem can be solved in polynomial\ntime.\n\nWe begin with a straightforward linear programming formulation of the CONSISTENCY problem.\n\nLemma 3.2 (Linear Programming Formulation) An instance of the CONSISTENCY problem ad-\nmits a consistent distribution if and only if the following linear program (P) has a solution:\n\n0\n\n(P ) max\n\np\n\nsubject to:\n\n(cid:80)\n\n(cid:80)\n\nfor all (i, s) \u2208 M1\nfor all (i, j, s, t) \u2208 M2\n\na\u2208A:ai=s pa = \u00b5is\n\na\u2208A:ai=s,aj =t pa = \u00b5ijst\n\n(cid:80)\n\na\u2208A pa = 1\npa \u2265 0\n\nfor all a \u2208 A.\n\nSolving (P) using the ellipsoid method (Theorem 2.2), or any other linear programming method,\nrequires time at least |A| = (k +1)n, the number of decision variables. This is generally exponential\nin the size of the input, which is proportional to the number m of given marginal values.\nA ray of hope is provided by the fact that the number of constraints of the linear program in\nLemma 3.2 is equal to the number of marginal values. With an eye toward applying the ellipsoid\nmethod (Theorem 2.2), we consider the dual linear program. We use the following notation. Given\na vector y indexed by M1 \u222a M2, we de\ufb01ne\n\ny(a) =\n\nyis +\n\nyijst\n\n(i,s)\u2208M1 : ai=s\n\n(i,j,s,t)\u2208M2 : ai=s,aj =t\n\n(cid:88)\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u00b5isyis +\n\n\u00b5ijstyijst.\n\n(i,s)\u2208M1\n\n(i,j,s,t)\u2208M2\n\nfor each assignment a \u2208 A, and\n\u00b5T y =\n\n(1)\n\n(2)\n\nStrong linear programming duality implies the following.\n\nLemma 3.3 (Dual Linear Programming Formulation) An instance of the CONSISTENCY prob-\nlem admits a consistent distribution if and only if the optimal value of the following linear pro-\ngram (D) is 0:\n\n(D) max\ny,z\nsubject to:\n\n\u00b5T y + z\n\ny(a) + z \u2264 0\ny, z unrestricted.\n\nfor all a \u2208 A\n\nThe number of variables in (D) \u2014 one per constraint of the primal linear program \u2014 is polynomial\nin the size of the CONSISTENCY input.\nWhat use is the MAP INFERENCE problem for solving the CONSISTENCY problem? The next\nlemma forges the connection.\nLemma 3.4 (Map Inference as a Separation Oracle) Let G be a set of graphs and suppose that\nthe MAP INFERENCE (G) problem can be solved in polynomial time. Consider an instance of the\nCONSISTENCY problem with a data graph in G, and a candidate solution y, z to the corresponding\n\n6\n\n\fdual linear program (D). Then, there is a polynomial-time algorithm that checks whether or not\nthere is an assignment a \u2208 A that satis\ufb01es\n\n(cid:88)\n\n(cid:88)\n\nyis +\n\n(i,s)\u2208M1 : ai=s\n\n(i,j,s,t)\u2208M2 : ai=s,aj =t\n\nyijst > \u2212z,\n\n(3)\n\nand produces such an assignment if one exists.\n\nProof: The key idea is to interpret y as the log-potentials of a Markov network. Precisely, construct a\nMarkov network N as follows. The vertex set V and edge set E correspond to the random variables\nand edge set of the data graph of the CONSISTENCY instance. The potential function at a vertex i\nis de\ufb01ned as \u03c6i(s) = exp{yis} for each value s \u2208 [k]. The potential function at an edge (i, j)\nis de\ufb01ned as \u03c6ij(s, t) = exp{yijst} for (s, t) \u2208 [k] \u00d7 [k]. For a missing pair (i, s) /\u2208 M1 or 4-\ntuple (i, j, s, t) /\u2208 M2, we de\ufb01ne the corresponding potential value \u03c6i(s) or \u03c6ij(st) to be 1. The\nunderlying graph of N is the same as the data graph of the given CONSISTENCY instance and hence\nis a member of G.\nIn the distribution induced by N, the probability of an assignment a \u2208 [k]n is, by de\ufb01nition, propor-\n\ntional to\uf8eb\uf8ed (cid:89)\n\n\uf8f6\uf8f8\uf8eb\uf8ed\n\nexp{yiai}\n\n(cid:89)\n\n\uf8f6\uf8f8 = exp{y(a)}.\n\nexp{yijaiaj}\n\ni\u2208V : (i,ai)\u2208M1\n\n(i,j)\u2208E : (i,j,ai,aj )\u2208M2\n\nThat is, the MAP assignment for the Markov network N is the assignment that maximizes the left-\nhand size of (3).\nChecking if some assignment a \u2208 A satis\ufb01es (3) can thus be implemented as follows: compute the\nMAP assignment a\u2217 for N \u2014 by assumption, and since the graph of N lies in G, this can be done\nin polynomial time; return a\u2217 if it satis\ufb01es (3), and otherwise conclude that no assignment a \u2208 A\nsatis\ufb01es (3). (cid:4)\n\nAll of the ingredients for the proof of Theorem 3.1 are now in place.\nProof of Theorem 3.1: Assume that there is a polynomial-time algorithm for the MAP INFERENCE\n(G) problem with the family G of graphs, and consider an instance of the CONSISTENCY problem\nwith data graph G \u2208 G. Deciding whether or not this instance has a consistent distribution is equiv-\nalent to solving the program (D) in Lemma 3.3. By Theorem 2.2, the ellipsoid method can be used\nto solve (D) in polynomial time, provided the constraint set admits a polynomial-time separation\noracle. Lemma 3.4 shows that the relevant separation oracle is equivalent to computing the MAP\nassignment of a Markov network with graph G \u2208 G. By assumption, the latter problem can be\nsolved in polynomial time. (cid:4)\n\nWe de\ufb01ned the CONSISTENCY problem as a decision problem, where the answer is \u201cyes\u201d or no.\u201d\nFor instances that admit a consistent distribution, we can also ask for a succinct representation of a\ndistribution that witnesses the marginals\u2019 consistency. We next strengthen Theorem 3.1 by showing\nthat for consistent instances, under the same hypothesis, we can compute a small-support consistent\ndistribution in polynomial time. See Figure 2 for the high-level description of the algorithm.\nTheorem 3.5 (Small-Support Witnesses) Let G be a set of graphs. If the MAP INFERENCE (G)\nproblem can be solved in polynomial time, then for every consistent instance of the CONSISTENCY\n(G) problem with m = |M1| + |M2| marginal values, a consistent distribution with support size at\nmost m + 1 can be computed in polynomial time.\nProof: Consider a consistent instance of CONSISTENCY with data graph G \u2208 G. The algorithm\nof Theorem 3.1 concludes by solving the dual linear program of Lemma 3.3 using the ellipsoid\nmethod. This method runs for a polynomial number K of iterations, and each iteration generates\none new inequality. At termination, the algorithm has identi\ufb01ed a \u201creduced dual linear program\u201d, in\nwhich a set of only K out of the original (k + 1)n constraints is suf\ufb01cient to prove the optimality of\nits solution. By strong duality, the corresponding \u201creduced primal linear program,\u201d obtained from\nthe linear program in Lemma 3.2 by retaining only the decision variables corresponding to the K\n\n7\n\n\f1. Solve the dual linear program (D) (Lemma 3.3) using the ellipsoid method (Theorem 2.2),\nusing the given polynomial-time algorithm for MAP INFERENCE (G) to implement the\nellipsoid separation oracle (see Lemma 3.4).\n\n2. If the dual (D) has a nonzero (and hence, unbounded) optimal objective function value,\n\nthen report \u201cno consistent distributions\u201d and halt.\n\n3. Explicitly form the reduced primal linear program (P-red), obtained from (P) by retaining\nonly the variables that correspond to the dual inequalities generated by the separation oracle\nin Step 1.\n\n4. Solve (P-red) using a polynomial-time linear programming algorithm that returns a vertex\n\nsolution, and return the result.\n\nFigure 2: High-level description of the polynomial-time reduction from CONSISTENCY (G) to MAP INFER-\nENCE (G) (Steps 1 and 2) and postprocessing to extract a small-support distribution that witnesses consistent\nmarginals (Steps 3 and 4).\n\nreduced dual constraints, has optimal objective function value 0. In particular, this reduced primal\nlinear program is feasible.\nThe reduced primal linear program has a polynomial number of variables and constraints, so it can be\nsolved by the ellipsoid method (or any other polynomial-time method) to obtain a feasible point p.\nThe point p is an explicit description of a consistent distribution with support size at most K. To\nimprove the support size upper bound from K to m + 1, recall from Theorem 2.2 that p is a vertex\nof the feasible region, meaning it satis\ufb01es K linearly independent constraints of the reduced primal\nlinear program with equality. This linear program has at most one constraint for each of the m given\na\u2208A pa = 1, and non-negativity constraints.\nThus, at least K\u2212m\u22121 of the constraints that p satis\ufb01es with equality are non-negativity constraints.\nEquivalently, it has at most m + 1 strictly positive entries. (cid:4)\n\nmarginal values, at most one normalization constraint(cid:80)\n\n4 CLOSEST CONSISTENCY, SMALL SUPPORT Reduce to MAP INFERENCE\n\nThis section considers the CLOSEST CONSISTENCY and SMALL SUPPORT problems. The input\nto these problems is the same as in the CONSISTENCY problem \u2014 single marginal values \u00b5is for\n(i, s) \u2208 M1 and pairwise marginal values \u00b5ijst for (i, j, s, t) \u2208 M2. The goal is to compute sets\nof marginals {\u03bdis}M1 and {\u03bdijst}M2 that are consistent and, subject to this constraint, minimize the\n(cid:96)1 norm ||\u00b5 \u2212 \u03bd||1 with respect to the given marginals. An algorithm for the CLOSEST CONSIS-\nTENCY problem solves the CONSISTENCY problem as a special case, since a given set of marginals\nis consistent if and only if the corresponding CLOSEST CONSISTENCY problem has optimal objec-\ntive function value 0. Despite this greater generality, the CLOSEST CONSISTENCY problem also\nreduces in polynomial time to the MAP INFERENCE problem, as does the still more general SMALL\nSUPPORT problem.\nIf the MAP INFERENCE (G) problem\nTheorem 4.1 (Main Result 2) Let G be a set of graphs.\ncan be solved in polynomial time, then the CLOSEST CONSISTENCY (G) problem can be solved in\npolynomial time. Moreover, a distribution consistent with the optimal marginals with support size\nat most 3m + 1 can be computed in polynomial time, where m = |M1| + |M2| denotes the number\nof marginal values.\nThe formulation of the CLOSEST CONSISTENCY (G) problem has linear constraints \u2014 the same\nas those in Lemma 3.2, except with the given marginals \u00b5 replaced by the computed consistent\nmarginals \u03bd \u2014 but a nonlinear objective function ||\u00b5 \u2212 \u03bd||1. We can simulate the absolute value\nfunctions in the objective by adding a small number of variables and constraints. We provide details\nand the proof of Theorem 4.1 in Appendix A.4.\n\n8\n\n\fReferences\n[1] Nicolo Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, Learning, and Games. Cambridge Uni-\n\nversity Press, 2006.\n\n[2] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT\n\nPress, 2009.\n\n[3] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1), 2008.\n\n[4] S. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[5] T. Hazan and T. Jaakkola. On the partition function and random maximum a-posteriori pertur-\n\nbations. Proceedings of the 29th International Conference on Machine Learning, 2012.\n\n[6] J. K. Johnson, V. Chandrasekaran, and A. S. Willsky. Learning markov structure by maximum\nentropy relaxation. In 11th International Conference in Arti\ufb01cial Intelligence and Statistics\n(AISTATS 2007), 2007.\n\n[7] V. Chandrasekaran, J. K. Johnson, and A. S. Willsky. Maximum entropy relaxation for graph-\nical model selection given inconsistent statistics. In IEEE Statistical Signal Processing Work-\nshop (SSP 2007), 2007.\n\n[8] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In Neural Information\n\nProcessing Systems (NIPS), 2007.\n\n[9] L. G. Khachiyan. A polynomial algorithm in linear programming. Soviet Mathematics Dok-\n\nlady, 20(1):191\u2013194, 1979.\n\n[10] A. Ben-Tal and A. Nemirovski. Optimization iii. Lecture notes, 2012.\n[11] M. Gr\u00a8otschel, L. Lov\u00b4asz, and A. Schrijver. Geometric Algorithms and Combinatorial Opti-\n\nmization. Springer, 1988. Second Edition, 1993.\n\n[12] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge, 2004.\n[13] M. Singh and N. Vishnoi. Entropy, optimization and counting. arXiv, (1304.8108), 2013.\n\n9\n\n\f", "award": [], "sourceid": 552, "authors": [{"given_name": "Tim", "family_name": "Roughgarden", "institution": "Stanford University"}, {"given_name": "Michael", "family_name": "Kearns", "institution": "University of Pennsylvania"}]}