
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents an interesting method for dealing with certain restrictions on the space of label assignments in structured prediction settings by learning model potential functions with constraints that will enforce them.
This is in contrast with prevalent structured prediction methods that employ a search (i.e., dynamic programming) over the set of possible label assignments.
Specifically, implication and mutual exclusion constraints are considered and incorporated by constructing a quadratic expansion of the original feature function.
My questions are mainly on the expressivity of the model that results from these additional constraints.
If the computational concerns of Section 5 did not exist, should optimizing the larger set of parameters provide equivalent performance with the typical dynamic programming search structured model?
"Some preprocessing was performed on the label relations to ensure consistency with our assumptions" (line 377)
Please explain, this seems potentially worrisome depending on what modification of the data were required to match assumptions.
It seems that implication and mutual exclusion relationships could also be leveraged to drastically reduce the time complexity of dynamic programming if appropriately integrated. For example, ordering the inference from the "leaves" of implication trees.
Would this give comparable test time performance?
Or could you describe the degree of sophistication of the inference algorithm if it is already doing so?
Additionally, if I understand the label hierarchy setting correctly, it seems that the potentials from all ancestors in the label hierarchy could be summed and then the problem reduces to multiclass classification rather than a combinatorial search.
Is there other structure that prevents this?
Q2: Please summarize your review in 12 sentences
This work tackles structured prediction with structured constraints on the label space by learning potentials that enforce these constraints rather than requiring a combinatorial search.
This is a novel and interesting idea, though I have some concerns about the experimental evaluations that should be addressed.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a specific approach to multilabel prediction with subsumption and mutualexclusion constraints on categories.
First, a model is formulated as a quadratic parametrization, and then linear constraints are added which are proven to enforce the desired subsumption and mutualexclusion constraints through the model without having to do expensive inference at test time.
Experimental results are shown on three text datasets.
Quality: The idea is a nice one, and multilabel inference with such constraints is argued to have given state of the art results in [10], which appears to be the main precursor to this submission.
The mathematical approach is not an obvious one, and seems to have nice computational properties.
The paper seems a bit rushed, with somewhat limited experimental results (one of the main numbers is missing, and apparently wasn't ready by the deadline), and no model selection performed for the regularization parameter.
This leaves a question open as to whether the additional constraints could lead to a slight difference in regularization between methods that can affect the results.
Also, would the difference in results lengthen (their method is faster, but less accurate than the competitor) with a different setting of the C parameter?
Clarity: The paper could be more clear in two main directions. (i) the presentation seems initially to focus on very general structured output prediction, but then the actual methods are relevant only for multilabel problems with a specific form of constraints.
(ii) the paper seems a bit rushed and many technical proofs that might otherwise be relegated to supplementary material are included in the main paper.
Originality: The idea is a nice one, and seems to advance [10] in a nontrivial way.
Significance: The approach effectively trades model expressiveness for efficiency, but in a way that seems to perform reasonably.
This is a reasonable thing to do in the accuracy/computation tradeoff curve.
line 286: "establishes is that"
========================================
Post rebuttal update: the rebuttal, additional experiments, and explanation that slowness was in part due to poorly chosen generic solver has convinced me to increase the score.
It seems that previous shortcomings were due to the paper being a bit rushed last minute, but there has been some improvement through the rebuttal.
Q2: Please summarize your review in 12 sentences
The paper proposes a method for fast (trivial) inference in a specific structuredmultilabel setting by enforcing well designed model constraints during learning.
This reduces the expressiveness of the model very slightly (shown in experimental results) for the benefit of much faster test time inference.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents an approach for learning multilabel prediction. Different from previous approaches that conduct inference to incorporate constraints between labels. The paper proposed to directly embed the implication and mutual exclusive constraints in the model. Although it seems to me the current approach is restricted to certain types of relations between labels, this direction is pretty interesting.
Quality and clarity: this paper is wellwritten and wellmotivated. Experiment results support the claim of the paper.
Originality: There are lots of papers about multilabel classification, and I'm not familiar with many of them. However, to my knowledge, the approach introduced in the paper is new and interesting.
Significant: This paper proposed an interesting direction to deal with multilabel classification problems, although the current presentation seems restricted on certain types of multilabel problems (see below).
Comments/questions.
 I wonder if the proposed method is general and can apply to other constraints based on any firstorder logic between labels.
 One of the main goal of the paper is to improve the speed in test time. However, I wonder if the proposed algorithm is efficient in the training time as well. I wonder if the authors can also show the training time performance of the proposed methods.
Q2: Please summarize your review in 12 sentences
This paper presents an approach that directly embeds the constraints between labels in the model. This leads to a multilabel classifier that is very efficient in the test phase, while preserve a good test performance. Overall, the paper is interesting and wellwritten. Experiments on problems with many labels show the effectiveness of the proposed method.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper deals with a particular task of structured prediction: binary variables, score function consisting of unary potentials only, structure consists in the feasibility constraints (logical and mutual exclusion). The paper attempts to get rid of the costly inference at the test and train stages by posing special constraints on the model parameters such that trivial inference procedure automatically satisfies the feasibility constraints. The proposed approach is compatible with multiple structured losses, e.g. conditional loglikelihood (2) and structured large margin (3) are mentioned. To the best of my knowledge the idea is original and seems interesting. The paper is relatively clear although sometimes it is hard to follow the technical details (I have several major comments/questions about clarity/correctness, see below). The idea of embedding inference might be quite influential (lead to structured prediction methods without inference) even given that the direct applications of the paper are quite narrow.
Major: 1)
Results of sections 3.2 and 3.3 (key results of the paper) are based on Lemma 2, which has no proof. Lemma 2 consists of claims (9) and (10). I agree that (9) is analogous to lemma 1 and therefore clear. Equation (10), however, appears for the first time and in my opinion seems nontrivial. If it does not hold (under the assumptions of lemma 2) than section 3.23.3 are incorrect.
2) One of the contributions of the paper (mentioned in the abstract) is "equivalence to maximum margin structured output prediction". I do not see how this follows from the results of the paper. In particular, proposition 6 claims that is the score function satisfies constraints (5) and (6) than the largemargin objective can be computed without performing inference (r.h.s. of (22)). But general claim of equivalence in my opinion should sound something like this: "A minimizer of objective (3) (can be simplified by (22) under all the assumptions) under model (8) with theta_PAQ being positive semidefinite and constraints (11), (12), (16) (imposing (5) and (6)) at the same time minimizes objective (3) under model (8) with theta_PAQ being positive semidefinite (constraints are satisfied by inference)". I'd really like the authors to clarify what is meant by "equivalence". 3) In the experiments it is not specified how the hyperparameters of the methods were chosen. In particular, choice of the weight of the regularizer r(theta) could significantly influence the results. Minor: Line 153. I do not understand why s_k(x) \geq 0 is a substantial requirement. Lines 159 and 319. Notation "argmax_{k \in {0,1} } f(k)= 1" is a bit confusing i.e. it is not clear what happens if f(1) = f(0). I suggest writing "1 \in argmax_{k \in {0,1} }". Line 295. Notation "t_{ik}" (two indices) is not introduced. Being slightly more careful would help the reader.
*Comments after the response Thanks for for prooving Lemma 2 and clarifying Proposition 6.
Please, incorporate these comments into the paper. In particular, phrase from the abstract "We demonstrate this idea for multilabel prediction under subsumption and mutual exclusion constraints, where equivalence to maximum margin structured output prediction is established." sounds like an overclaim.
Q2: Please summarize your review in 12 sentences
This paper seems to be a solid contribution but I see some significant issues with clarity/correctness. If these issues are fixed I will vote for the acceptance.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Thanks for the helpful comments!
Reviewer
1:
Regarding expressiveness: Given equivalent parameterizations
(ignoring the computational concerns of Sec 5) the standard dynamic
programming formulation can be seen as a form of relaxation of the
constraint embedding approach: the standard approach does not require the
learned parameterization to strictly enforce the constraints (instead it
relies on the DP to enforce them post hoc), whereas the proposed approach
requires the learned parameterization to enforce all
constraints.
We explained the preprocessing method mentioned in
Line 377 in Lines 3923: we used the same assumptions as [10] and added
constraints in the exact same manner as [10]. All competitors used the
same data to ensure a fair comparison.
Unfortunately, it is not
straightforward to leverage HEX constraints to reduce the time complexity
of dynamic programming. The HEX graphs are usually not tree structured,
since all siblings are mutually exclusive. In general, for exact
inference, a junction tree needs to be constructed and belief propagation
conducted over the cliques. In fact, exploiting the specific structure of
HEX constraints to achieve an efficient dynamic program was the main
result of [10]. [10] showed that the number of legal states in a HEX
graph is usually linear to (as opposed to exponential in the general
case), so the inference time could be significantly reduced. We have
exploited all of the algorithmic techniques outlined in [10] to make our
inference as efficient as possible in the comparison.
Generally a
multilabel problem cannot be reduced to a multiclass problem, as
suggested, without an exponential explosion in the number of classes.
Even with the hierarchical constraint, the problem we consider is still
multilabel at the leaves, for which there are still an exponential number
of possible combinations.
Reviewer 2:
Unfortunately, the
generic nonsmooth solver we were using [25] required quadratic space and
was running out of memory (> 32G) for the standard linear
representation used by the inference method on the Reuters data set. We
have since replaced the solver and obtain a test error of 29.3% and test
time of 5.30s for the n/a entries in the left and right tables in Table 2
respectively.
The results are quite robust to the choice of
regularization parameter. For example, for the WIPO data set, the range
of regularization parameter choices 0.01, 0.1, 1.0, 10.0 produced no
change in the test accuracies.
Reviewer 3:
1) Indeed (10)
holds under the assumptions of Lemma 2. Proof: If $\theta_{PAQ} \succeq
0$, then we will have that $Q \succeq 0$ (any principal submatrix of a
PSD matrix, is PSD). So we can write Q as $Q = V^TV$, for some $V$. Then
the LHS of (10) can be equivalently written as (10) = $(\psi_k  \psi_l)^T
Q (\psi_k  \psi_l) = (\psi_k  \psi_l)^T V^T V (\psi_k  \psi_l) =
\V\psi_k  V\psi_l \^2$.
2) Proposition 6 established that for
any score model satisfying (5) and (6), regardless of parameterization,
the objective value obtained with inference (LHS of (27)) is equal to the
decomposed objective value obtained without inference (RHS of (27)).
Therefore, to be precise, the notion of equivalence is a conditional one:
if the constraints (5) and (6) are satisfied, then minimizing the standard
structured output loss (LHS of (27)) is equivalent to minimizing the
decomposed loss (RHS of (27)). But, in general, a minimizer of the LHS of
(27) need not satisfy the constraints (5) and (6).
3) In the
submission, we merely set the regularization parameter to 1.0. The
results are quite robust to this choice (above).
Line 153: To
ensure, say, $y_i \Rightarrow y_k$ in (5), one needs to enforce $s_k(x)
\ge \delta, \forall x$; unfortunately enforcing $s_k(x) \ge 0$ is provably
impossible with a nonvacuous bilinear score model.
Lines 159
(259?) and 319: Thanks
Line 295: Thanks, $t_{ik}$ denotes the kth
true label of the ith training example.
Reviewer 4:
(5) has
no typographical error: The condition in (5) asserts that if $s_1$ is
close to (but still below) zero by a margin, then $s_2$ must be above zero
(by a margin). This condition is needed to establish the relationship to
standard large margin training given in Section 4.
Reviewer
5:
The logics we have consider are generally weaker than first
order logic. Using (8) under the theta_PAQ is positive semidefinite
constraint, we can show that constraints at least as logically expressive
as the diagrammatic language of $d$dimensional Euler diagrams can be
expressed; see Section 1.1.1 of: Gem Stapleton, "Reasoning with
constraint diagrams", PhD Dissertation, University of Brighton,
2004.
Reviewer 6:
We have used the HEX constraints proposed
in [10] as a clear and concrete demonstration of the idea. More general
types of constraints can easily be expressed, such as Horn clauses, and
constraints expressible by Euler diagrams (as noted
above). 
