Paper ID: 552
Title: Marginals-to-Models Reducibility
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper explores new problems in the theory of graphical models, and shows how to reduce these problems to MAP and marginal inference.

The basic problem the paper concerns itself with is consistency: given a set of marginals over some variables, is there a joint distribution which is consistent with these marginals? An example of a set of marginals which does not correspond to a valid joint distribution is any solution of an LP relaxation which is not tight. The paper shows how to pose the consistency problem as an LP with exponentially many variables (one per joint setting), and then how to dualize this LP and find a polynomial-time cutting planes algorithm where the oracle for finding the maximally violated constraint is equivalent to MAP inference in the graphical model induced by the marginals

This idea extends easily to successively more complex settings: it can be used to find the closest (in total variation distance) consistent marginal to a given marginal, or finding a joint distribution with small support which is closest to a given marginal.

The paper also claims that Max Entropy, finding the maximum-entropy joint distribution which is consistent to a given set of marginals is reducible to computing the partition function (i.e., doing marginal inference). However this is known, as the problem of finding the maximum entropy distribution given some constraints is equivalent to maximizing the likelihood of a dual formulation of the problem, which is roughly equivalent to the reduction presented in the appendix.

The paper appears to be correct: the proofs are simple and easy to follow, and so are the main arguments. However, no result in the paper is surprising, and it is not clear why these results are interesting, or how they can be useful in the future.

---

Post author response: the author response clarified substantially the motivation behind the paper, and made me lean more towards accepting it. I strongly recommend the authors to fold in their motivating paragraphs in the main text, as well as restructure it so it does not appear to be unfinished as it does now.
Q2: Please summarize your review in 1-2 sentences
The paper explores some new problems in graphical models and shows how to reduce them to inference. However, the problems aren't that interesting, and not obviously useful.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
*** UPDATE following author response ***

The reviewers feel the paper would be significantly strengthened by augmenting the paper with some of the additional motivation given in the author response.

The particular point that seems most promising to me is:

3. Our ellipsoid-based approach is novel and, we hope, will find further applications in the field. Previous polynomial-time solutions to CONSISTENCY generally give a compact (polynomial-size) description of the marginal polytope. Our approach dodges this ambitious requirement, in that it only needs a polynomial-time separation oracle (which, for this problem, turns out to be MAP inference). There are many combinatorial optimization problems with no compact LP formulation that admit polynomial-time ellipsoid-based algorithms (like nonbipartite matching, with its exponentially many odd cycle inequalities).

This makes it sound like there is a chance of finding new polynomial-time CONSISTENCY algorithms for a set of models where MAP is tractable but the marginal polytope does not have a compact representation. It would be great if the authors could spell this out.

---------------

The authors demonstrate polynomial-time reducibility from several
problems relating to consistency of pseudomarginals of a graphical
model to the problems of inference in the same graphical
model. Specifically, their main result is to show that testing whether
a given vector belongs to the marginal polytope of G is
polynomial-time reducible to the problem of MAP inference in a
graphical model over G. The reduction works by expressing the
consistency problem as an LP with exponentially many variables. The
dual LP, which has an exponential number of constraints, can be solved
in polynomial time by the ellipsoid method if the separation problem
can be solved in polynomial time. The required separation oracle is a
MAP inference problem over G, which shows the desired reduction. They
present related results that follow similar reasoning, e.g., for the
problem of finding the closest point in the marginal polytope to a
provided vector of pseudomarginals.

The paper is technically sound, but it seems incomplete---it ends
abruptly without actually presenting several of the results claimed in
the introduction. I get the sense the authors ran out of space, time,
or both. I took a look at the result in the appendix about MAX-ENTROPY
reducing to GENERALIZED-PARTITION. This looks completely standard to
me: (1) the dual of the maximum entropy problem (problem D-ME in the
paper) is that of maximizing likelihood in an exponential family, (2)
this is a convex problem, (3) it can be solved in polynomial time by
the ellipsoid algorithm if the log-likelihood function and its
gradient can be computed in polynomial time, (4) the log-likelihood
function is a linear function minus the log-partition function, so
computing it efficiently amounts to computing the log-partition
function efficiently. I don't see what the new contribution is.

Overall, the authors could do a much better job situating their work
in relation to previous work. My initial reaction was to guess that
these results would follow from existing results about the theory of
exponential family models (cf. Wainwright and Jordan [2]), which
already describes very strong connections between marginals and
models (e.g., a bijection between marginals and models in a minimal
exponential family). I no longer think the main result (Theorem 3.1)
follows from that literature, but it would help this paper
considerably if the authors could point out the connections and
differences. The related work does a very poor job of this
currently---it simply points to three standard books/monographs about
graphical models.

There are some specific connections to prior work that should be
mentioned. In particular, the CONSISTENCY problem is exactly that of
testing membership in the marginal polytope of G. The authors should
mention this explicitly, as much is known about the marginal polytope,
and the membership problem which they study has appeared before, and
is known to be NP-complete:

David Sontag and Tommi Jaakkola, On Iteratively Constraining the
Marginal Polytope for Approximate Inference and MAP, Tech. report,
2007

The paper is reasonably clear, but imprecise in a few places,
especially surrounding the problem definitions. In particular, the
definition of SMALL-SUPPORT is very informal and probably not correct
as stated. "Polynomial-size" can only be judged asymptotically; for
this case when you have a specific input instance in hand, you
probably need to require that the support size is bounded by a
specific function of the input size. Also, the MAX-ENTROPY problem as
stated seems to want to simulteneuously maximize entropy and minimize
the L1 distance from \mu to the marginal polytope. I doubt this is
possible. Perhaps you mean to maximize entropy subject to the
constraint of \nu being a minimizer of the L1 distance?

Regarding novelty and significance, there is really one primary
observation in this paper: that MAP inference is the dual separation
oracle for the LP formulation of the CONSISTENCY problem (Theorem
3.1). As far as I know, this is new. Several other results
(CLOSEST-CONSISTENCY and SMALL-SUPPORT) are slight variants that
follow from standard LP tricks. The main result is intriguing, but it
is hard to see its significance. The reductions don't point to new
algorithms or complexity results for problems that people care about
(i.e., CONSISTENCY does not arise in practice, and it does not imply
new hardness results for MAP inference). The paper would be
strengthened considerably if the authors could point to some
application of the results.

Minor comments
* The characterization of Markov networks and Bayesian networks on
line 31 seems like it is equating potentials and CPTs with "local
marginals". I don't agree with this characterization---potentials
are not marginals. The marginal inference problem is all about
converting from the potentials to marginals (e.g. [3]). I suggest
clarifying the wording here.
* Figure 1: the text is very hard to read.
* Line 121: the number of LP variables is *equal* to the number of
joint settings (which is exponential in the number of random
variables)
* Line 162: This wording/notation suggests that the marginal is the
scalar \mu_{is}. I suggest refining the notation (e.g. (\mu_{is}){s \in
[k]}) to be more precise, which will clarify that the marginal is
a vector. The same comment applies to \mu_{ijst}.
* Line 167: "Let M_1 and M_2 denote the indices (i,s) and
(i,j,s,t)". This is a bit vague. Can you clarify that M_1 and M_2
are *sets* of indices.
* Line 394: Do you mean CLOSEST-CONSISTENCY instead of CONSISTENCY
here?
* Line 401: mathematical programming --> mathematical program
Q2: Please summarize your review in 1-2 sentences
The paper makes an interesting observation: MAP inference is the dual
separation oracle of the LP for testing membership in the marginal
polytope. However, the significance of this result is unclear, and the
paper is incomplete: it ends before presenting several of the main
results, and does not do a very good job relating to prior work.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper considers the well-known-to-be-difficult problem of determining if a given set of marginal distributions are "consistent" in the sense that there is some joint distribution that corresponds to them. (The local polytope and such enforce a set of necessary conditions only.) Conceptually, what is done is very simple and (to my eyes) clever.

The main idea is that one can set of a naive linear programming problem of optimizing a (exponentially large) joint distribution under a null objective under a set of (linear) marginalization constraints. This is totally impractical, but since the set of marginalization constraints is polynomial, if one takes the dual of this LP, one obtains a problem with a polynomial variable size, but exponentially many constraints. The crucial observation that drives the paper is that the problem of finding a violated constraint in this dual LP is equivalent to a MAP inference problem. Thus, one can solve consistency by starting with a relaxed dual LP, and repeatedly solving MAP problems to add constraints. (Of course, MAP in NP hard in the worst case, but there are many algorithms that work well in practice.) The paper further argues that this can be an instance of the Ellipsoid algorithm meaning that each generated constraint reduces the state space enough that only a polynomial number of constraints will be needed, assuming the problem is consistent. (I think I understand correctly that it could take exponential time to prove inconsistency-- the paper should be slightly more explicit on this point.)

Based on this observation, the Ellipsoid algorithm can solve consistency, and small modifications to the linear problem can be introduced (with a l1 / total variation objective) to find the *closest* set of consistent marginals.

One question I do wonder about is if the MAX-ENTROPY -> Partition theorem should really be considered part of the paper? I know it is in the additional material, but the main text doesn't even *state* the result.

small comment:
- missing a bar on line 396

Edit after author response:

I am in agreement with the other reviewers that the author response did a better job of spelling out the motivation than the paper itself. Thus, I'd encourage putting a similar discussion into the introduction of the paper, which may lead to the paper having an impact on a larger audience.
Q2: Please summarize your review in 1-2 sentences
Interesting theoretical paper on an understudied problem, a clear accept.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all three reviewers for their helpful and fair remarks. We apologize for the length of the rebuttal:), but do believe it directly addresses the questions raised by the reviewers. The main concerns seem to center on the significance of our results and their relationship to what is already known. We agree that we should clarify these points, and in the event of acceptance we would do so in the specific ways outlined below.

Why the (CLOSEST) CONSISTENCY problem and our solution to it are interesting:

1. As one reviewer rightfully points out, the CONSISTENCY problem has already been well studied by the community, as the membership problem for the marginal polytope. For example, in the suggested Sontag-Jakkaola (2007) paper, the authors explain "Finding the MAP assignment for MRFs with pairwise potentials can be cast as an integer linear program over the marginal polytope." That is, algorithms for the CONSISTENCY problem are useful subroutines for inference.

2. As far as we know, our work is the first to show a converse, that inference algorithms are useful subroutines for decision and optimization problems for the marginal polytope.

3. Our ellipsoid-based approach is novel and, we hope, will find further applications in the field. Previous polynomial-time solutions to CONSISTENCY generally give a compact (polynomial-size) description of the marginal polytope. Our approach dodges this ambitious requirement, in that it only needs a polynomial-time separation oracle (which, for this problem, turns out to be MAP inference). There are many combinatorial optimization problems with no compact LP formulation that admit polynomial-time ellipsoid-based algorithms (like nonbipartite matching, with its exponentially many odd cycle inequalities).

4. The (CLOSEST) CONSISTENCY problem is also interesting in its own right. Is given (noisy) data consistent? If not, what is the minimal perturbation to the data necessary to recover coherence? If one or more marginals is missing, what range of values is consistent with the known marginals?

Why the SMALL SUPPORT problem is interesting:

1. When there are many consistent distributions, which one should be singled out? There are, of course, well-known reasons to prefer the maximum entropy distribution. But this is not the only approach. For example, consider the three features "voted Republican in 2012", "supports Obamacare", and "supports tougher gun control", and suppose the single marginals are .5, .5, .5. The max-entropy distribution is uniform over the 8 possibilities. We might expect reality to hew closer to a small support distribution, perhaps even 50/50 over the two vectors 100 and 011.

2. In general, it is interesting to compute a small set of "representative individuals" (100 and 011 in the example above) such that the data set is consistent with a suitable distribution over these individuals. This is precisely the SMALL SUPPORT problem.

3. The SMALL SUPPORT problem can be viewed as a tractable relaxation of the problems of computing the distribution that minimizes the entropy or support size (which, as concave minimization and l_0 minimization problems, we strongly suspect are NP-hard, even for simple graph structures).

4. An algorithm for SMALL SUPPORT can be used as a heuristic for data visualization in the following way. Suppose marginals \mu are consistent. By varying the objective function in Lemma 3.2 (i.e., the direction of optimization), multiple small support distributions consistent with the data can be generated. These can be regarded as a small number of "extreme cases", to be examined by a domain expert.

On the MAX ENTROPY problem:

1. We do not claim our solution to the MAX ENTROPY problem as a primary contribution of the paper. We agree that the main concepts applied in the solution are well known.

2. As far as we know, however, there has been essentially no formal complexity analysis (i.e., worst-case polynomial-time guarantees) for algorithms that compute max-entropy distributions. Such guarantees do NOT follow automatically from convexity, for two reasons. The first, which has been addressed in previous work by the community, is the exponential number of decision variables (circumvented via a suitable separation oracle). The second, which does not seem to have been previously addressed, is the need to bound the diameter of the search space (e.g., if the optimal Lagrange variables are too big, it will take too long to find them). Bounding the search space requires using special properties of the MAX ENTROPY problem, and the last two paragraphs of page 11 outline the necessary argument.

3. An additional motivation for including the MAX ENTROPY discussion in the paper is to provide an analogy to help interpret our main results, e.g. "marginal polytope membership reduces to (multiple invocations of) MAP inference, in a similar sense to how max-entropy computations reduce to generalized partition computations".