Paper ID: | 6585 |
---|---|

Title: | Structured Prediction with Projection Oracles |

Post-feedback update: Thanks for your update. Your additional explanations and results will help improve the paper, and I definitely think this work is strong and should be accepted. -------------------------------------------------------------------------------------------------------- Originality: This paper draw upon a lot of previous work studying losses within structured prediction, all of which is cited throughout in appropriate places. The framework itself is new, and the authors make it very clear how prior work fits into the framework as special cases. At the same time, a good case is made for why this framework is useful to have and how it can be better to use than prior losses. Quality: this paper makes a compelling case for the framework it introduces. It explains how previously developed structured prediction losses fit into it as well as providing examples for additional losses that can be derived and how they might be better to due to their ability to be computed more cheaply. A consistency bound is proven, indicating that using this framework is a sensible thing to do. A tradeoff between computational cost and statistical efficiency is mentioned at several points throughout the paper; however, the experimental results do not reflect on these tradeoffs, as runtime comparisons for both training and inference are not included. The experiments do adequately demonstrate, though, that the use of some of the losses introduced by this framework that are not subsumed by older frameworks can lead to improved models. Clarity: Overall, the flow of ideas in this paper is quite clear. Background material is described in sufficient detail for the reader to understand necessary concepts and how they relate to what is presented here. Section 3.2 is very helpful for understanding examples of the sorts of losses that can be constructed using this framework. There are a few minor typos (for example, in line 101, the first Psi function definition is missing its argument), but these do not significantly impede the clarity of the work. Significance: the provided empirical results demonstrate that practitioners of structured prediction should consider thinking about developing losses using this framework, as they may lead to training models with better performance. Because no information is provided on the relative runtimes of the compared approaches, the tradeoffs between model performance and training/inference time are not as clear as they could be.

Overview: ------------- This paper proposes and studies a general framework for generating surrogate loss functions for structured prediction problems. In particular, the loss is defined in terms of Bregman projections, and some examples are given for when the projection (or a relaxation thereof) can be computed efficiently. Surrogates in this family are shown to be calibrated wrt the true loss. This paper generalizes SparseMAP [26] and Fenchel-Young losses [6] to Bregman projections. It seems to make an interesting contribution to the field of structured prediction, by proposing a family of calibrated surrogate losses. Limitations of the paper include questionable practicality of the approach (due to expensive training and inference), and proximity to previous work [6,26] which is generalized (the calibration results here seem novel). Comments/questions: ----------------------------- Since computational-statistical trade-off is central to the paper, consider reporting runtimes in experiments so the reader can get a sense of what this looks like for particular problem instances. Further, I think a simple structured-prediction baseline (pick your favorite, maybe SPENs?) is missing in the experiments in order to get a sense of how the proposed framework compares to existing approaches in terms of accuracy and complexity (the only comparison is between variants of the proposed method). At the end of section 4 results are described for projected averaged SGD, but experimental results are reported for L-BFGS. What is the reason for this discrepancy? For multilable classification, can you comment on calibration of the losses used here w.r.t. the F1 measure? Might be good to explicitly include somewhere the general training algorithm (minimize the surrogate S). Haven’t seen it anywhere. Minors: ---------- - Line 19: not sure I understand the statement saying structured objects are more difficult than mere vectors. It seems like the structure should actually help learning since otherwise every vector output is equally possible. - Line 78: not tractable => intractable - Line 101: “\Psi = 1/2||u||” => “\Psi(u) = 1/2||u||” - Can you please explain the inequalities in line 114? Psi and Omega are functions, so does it mean that this holds for any input? - Line 133: “we will we” - Line 198: would be good to mention a couple of common losses here in addition to referring to the appendix. - Lines 268, 275: better mention that Tables 3, 5, 6 are in appendix B.

Originality: The idea is not entirely new. L106 gives an interesting explanation of F-Y loss, which seems to be a novel perspective. Basically, the paper focuses on a special case of the F-Y loss [6] for structured prediction; the consistency analysis of excess Bayes surrogate risk is based on the work of [27, 29] using calibration function. Quality: All proofs look good to me (after a quick check). The experimental part of this paper is relatively weak: it only shows the proposed F-Y losses work for selective examples. However, there is no comparison with prior works. Classical baselines such as SSVM, CRF, SparseMAP etc should be compared. Clarity: The paper is generally well written with a clear structure. Significance: Convex surrogates such as F-Y loss should be interesting to structured prediction community. This paper may attract more people (potentially deep learning folks) exploring better surrogates/pipelines for structured prediction.