Submitted by
Assigned_Reviewer_2
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Overview ======== This paper proposes an
algorithm for learning general structured predictors (e.g., nonlinear).
This is done by replacing the structured hinge loss with its smooth dual
LP relaxation and observing that optimizing over classifiers reduces to a
logistic regression task. Therefore, the learning problem can be extended
to cases where this optimization over the class of predictors can be
solved efficiently. Specifically, the paper shows how this enables
learning predictors like decision trees and multilayer perceptrons in
addition to the common linear classifiers.
Pros ==== * The
observation made by the authors about reduction of the learning objective
to a logistic regression problem seems novel and interesting. * The
paper is mostly clearly written and seems technically sound. * The
usefulness of the new formulation is demonstrated through experiments, and
the empirical results seem convincing enough.
Cons ==== *
The description of the experimental setting can probably improve. *
There seems to be some tweaking of parameters involved in the experiments
(e.g., gradient descent step). It would be good to comment on the
sensitivity of the results to the choice of parameters.
Comments
======== * Eq. 2: can this form include regularization? * Alg.
1: In the linear case the objective can be optimized stochastically (as
pointed out in the discussion). It may be interesting to see if/when this
can be done for the other types of classifiers. * Experiments:
epsilon=1 was chosen, but it is interesting to see how the choice of
different values affects the results. This is mentioned in the discussion,
but seems very easy to just try empirically and report. * Since
logistic regression is solved for each factor independently, it is
possible to combine different types of predictors in a single application,
as shown in Table 1. It may be also interesting to optimize over several
types of predictors and choose the best for each factor. * Another
relevant citation: "Efficient Training for Pairwise or Higher Order CRFs
via Dual Decomposition", N. Komodakis, In CVPR 2011.
Typos
===== * Line 36: phi should probably have subscript alpha *
Line 104: nonnegativity constraint has “\geq 1” instead of “\geq 0” *
Line 158: “coordinate ascent” should be “descent” as this is a
minimization problem * Line 266: mu(y_alpha) should be
mu_alpha(y_alpha) * Eq. 8 and Line 269: x_beta should be y_beta *
Line 297: f(x,y_alpha) should be f_alpha(x,y_alpha)
Q2: Please summarize your review in 12
sentences
This is an interesting paper which extends previous
work on learning structured predictors [9, 16] to allow several nonlinear
predictors. The basic theoretical observation seems to be novel and
valuable, and the execution meets NIPS standards. Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors address the issue of learning unary and
pairwise parameters for discrete valued random fields in a structured
output SVM setting. In order to enable learning with nonlinear function
classes, the authors remove the standard l2 regularization and replace it
with an entropy regularization that decomposes over the potentials of the
random field. This formulation results in a saddlepoint problem (typical
for structured prediction) in which parameters of an LP relaxation are
optimized along with parameters of the (nonlinear) discriminant function.
The authors use a message passing formulation, and show that with their
decomposable loss, and decomposable regularizer, the problem decomposes
into problems that each resemble logistic regression. Any solver for a
nonlinear function class can now be applied. Experimental results are
given on relatively simple denoising and horse segmentation datasets
showing that potentials modeled with a multilayer perceptron performed
better than other linear and nonlinear methods.
Quality: The
derivation is principled and complete, with sufficient references to the
optimization literature. The idea is interesting, and (given a source code
release) could result in various applications of nonlinear function
classes in learning the parameters of graphical models. The quality of the
experimental validation is somewhat weak, however, given that the datasets
are quite simple, and results only compare with variants of the proposed
algorithm. Plenty of citations are given for competing methods, e.g. [18]
which is very closely related to the proposed approach in its goal of
enabling nonlinear function classes for random field models.
Clarity: The article is well written, and clear throughout.
Originality: [18] have previously proposed a random forest
framework for learning the parameters of a random field model. This work
is more general and uses a different formulation to arrive at its
objective function.
Significance: This work is quite general
in the functions that can be used, though a number of limiting assumptions
are made (decomposable loss, margin rescaling, only consider discrete
labelings). It is an interesting approach nevertheless.
Additional
comments: Sec 3  Limiting assumptions about the problem setting are
first introduced here: Eq (3)  only considers margin rescaling
Around line 93  \Delta decomposes over \alpha  e.g. Hamming distance
Around line 94  only considering discrete labelings  which is fine
but not fully general in structured prediction Decomposable loss and
discrete labelings should probably already be mentioned in the
introduction
Line 306  typo in equation
Q2: Please summarize your review in 12
sentences
An interesting approach to nonlinear functions for
random field models. Some limiting assumptions and somewhat weak
experiments. Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
This paper shows that parameter
learning in structuredprediction problems can be reduced to independent
logistic regression problems for each of the factors. This requires (a)
writing an upper bound on the predictor loss via the standard
marginrescaled lossaugmented inference (LAI); (b) getting an upper bound
on this upperbound by relaxing the LAI integer programming problem to an
LP; (c) smoothing the LP with a standard entropy term (still an
upperbound); (d) converting the lossminimization from a minmax problem
to a joint minimization over parameters and "messages" (lagrangian
multipliers) by writing the dual of the LAI; (e) finally, using standard
results about conjugate of entropy to formulate the coordinate step over
parameters (given fixed messages) as a Logistic Regression training
problem.
The paper shows a very nice conceptual connection between
structured learning and LR. The strongest appeal (at least to me) is the
ability to train nonlinear structured predictors.
The only
(minor) criticism of the paper is that the experiments are toyish and do
not fully demonstrate the impact of the proposed work. It's nice to see
that the errors on Weizman Horses for Boosting & MLP are lower than
linear, but it would have been great to see results on more challenging
datasets. Having said that, the paper contains novel highquality work and
should be accepted.
The manuscript is pretty clear, although the
main novel section (Section 7) feels a little rushed and should be
expanded.
Minor comments:  L88 is marginscaled not
slackrescaled.  L101: M is typically used for the marginal
polytope. Please use L for the local polytope.  L215: Although
intuitive, but please define (Wx)_y notation before using.  L301:
There should not be an eps on the RHS. argmax is unaffected by scaling.
Q2: Please summarize your review in 12
sentences
Overall, a highquality work containing a novel
conceptual connection that will be broadly useful to people working in
structuredprediction.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Thanks to all for the helpful reviews, in which there
is little to disagree with. In particular, all commented that the
experiments would be improved by systematically varying epsilon, and
considering larger/more realistic datasets. Such results can certainly be
included in the final paper (albeit possibly in the appendix, for a lack
of space).
