NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7492
Title:Search-Guided, Lightly-Supervised Training of Structured Prediction Energy Networks

Reviewer 1

Post-feedback update: Thank you for your update. Your additional results will strengthen this paper, and I still think it should be accepted. ------------------------------------------------------------------------------------------------------------- Originality: The ideas presented in this paper represent somewhat of a synthesis of other ideas. Specifically, it combines the basic overall framework for SPEN training using a reward signal introduced by [1] with the idea of adding in random search to find reward scoring violations, which has been used in the past by various papers (which are cited appropriately in this work). However, this exact combination is novel. Quality: The motivation behind using random search to augment the generation of labels to use for training the model is sound and verified empirically. Numerous appropriate baselines are included, ranging from beam search-type approaches to more directly comparable approaches such as [1], and the introduced approach outperforms all of them. There are additional results presented that further reinforce some of the ideas and motivations introduced when describing the model: specifically, that some problems use reward signals that are somewhat uninformative in many regions of the search space, and that using solely gradient-based approaches to find new training points can cause the model to get "stuck" in local optima. It would have been interesting to see experiments in a semi-supervised setting to see how much this approach can augment training using a limited amount of fully-labeled training data, but the content in this paper is sufficient to be interesting on its own. Clarity: The ideas are presented clearly and logically, and there are no problems in understanding the problem, the motivations behind the solution, and how the solution addresses shortcomings of other approaches. The experiments are described in adequate detail and the results are easy to understand. Significance: The ability to utilize a reward function for training instead of full supervision is appealing, since getting full training labels can be much more expensive than being able to provide a reward function. The presented results indicate that this approach can provide significant improvements over competing approaches that do not use full supervision and thus is worthwhile to use. [1]Rooshenas, A., Kamath, A., and McCallum, A. Training structured prediction energy networks with indirect supervision. NAACL: HLT, 2018.

Reviewer 2

===== Update following rebuttal ==== The additional experiments strengthen the submission so I am updating my score to 6. I think there should be a more serious discussion of the accuracy-vs-computation tradeoff for the truncated randomized search (number of steps, margin requirement, etc). Unfortunately, this point was not addressed in the rebuttal. ===== Overview: This paper proposes an approach called Search-Guided SPENs for training SPENs from reward functions defined over a structured output space. The main idea is to refine the prediction of the energy network by searching for another output that improves its reward. If the search is not too expensive, this can speedup training and improve performance. Rather than gradient-based search (previously suggested in R-SPENs), which can get stuck, a truncated randomized search with reward values is proposed here. Experiments suggest that the proposed approach can indeed improve performance on several tasks with relatively cheap computation. Overall, the approach is presented clearly and seems to improve over previous work in experiments. However, I feel that some aspects which are left as future work, such as results in the fully-/semi-supervised settings and investigation on the effect of the search procedure, should actually be included in this work. Detailed comments: The effectiveness of the randomized search procedure seems central to this approach, but it is discussed in Appendix B. It seems like this should get more attention and be included in the main text. Also, it seems interesting to explore smarter search procedures that exploit domain knowledge and compare them to the randomized one in terms of the computation-accuracy trade-off. This is mentioned as future work, but feels central to understanding the merits of the proposed approach. Experiments: * In the experiments, it is interesting to add a comparison to a fully supervised baseline that does use ground-truth labels (e.g., vanilla SPENs) in order to get a sense of performance gaps with reward-based training, whenever ground-truth labels are available. In multilabel classification this is especially relevant since the reward function actually depends on the true labels. * An interesting question is whether the proposed approach can improve over supervised training (e.g., vanilla SPENs) in fully-supervised and semi-supervised settings, and not just learn a predefined reward function, but this is not addressed in the paper. * How is the trade-off parameter alpha between energy and reward in eq (2) chosen? Presumably some tuning is required for this hyperparameter which should be considered for the training time comparison. * Are the results reported wrt to some ground-truth or wrt the reward function? I was expecting to see both in Table 1. This distinction can help understand the source of errors (optimization vs mismatch of reward and performance measure). Minor: Line 120: notice that y_n may not exist if y_s is already near optimal. This is handled later (line 135), but only as failure of the search procedure, and not because of near-optimality. Would be good to clarify. Line 229: “thus we this” Line 301: “an invalid programs”

Reviewer 3

============= Update after rebuttal ================ I have read the other reviews and the authors's rebuttal. I appreciate the additional experiments presented by the authors and will be upgrading my recommendation to a 7 to reflect this. I still think more of these additional experiments and larger-scale experiments would increase the paper's significance by a lot, but I think it does pass the bar for publication in its current state. ============================================= The paper introduces a new twist on the ranked-based approach to training structured prediction energy networks via light supervision (where light supervision means that the learning signal for the energy function learned by the model comes from enforcing that the energy levels are consistent with the levels of a reward function). Instead of picking random samples guided by the energy function, which will often offer the same reward (since this function is mostly uninformative and has wide plateaus), the samples are sampled first through gradient-based inference, and then via local search on the reward function itself, to make sure that there is a difference in value between the two samples. The algorithm is run on 3 small scale structured prediction datasets and is shown to outperform the previous ranked-based SPEN training algorithm. Originality The main contribution (i.e. the new sampling of the datapoints) is a new twist on an existing algorithm. As such it's not very original, though novel. The related work is extensively cited and the delineation with the contributions of the paper is well done. Clarity The paper is very well written and easy to read. While probably a bit verbose, it explain in details both the SPEN models, their training, the new proposed training and its relations with the previous work. There are a couple of surprising claims though, which are worth noting (see details below). Quality While the algorithm is applied on 3 tasks, all of them are very small scale. One cannot really evaluate the promise of the approach on such datasets, so in this sense the paper might not be quite ready yet. Otherwise the paper is technically sound. Significance Again, while the innovation is fairly minor, it might result in big improvements empirically, but one cannot readily verify it for the lack of large scale task in the experimental section. Question: This is beyond the scope of the paper (although it would make for a nice addition and help strengthen its originality. In the setup considered, we only use the light supervision of the reward function. On a fully labeled dataset, it is possible to compute many rewards based on the ground truth labels. Would you expect that training with both the supervised loss as well as the lighter supervision of these additional rewards would work better than to train simply with the supervised loss? All told, this paper is on the fence as regards acceptance. It is very clear and of good quality, but might still be improved with larger scale experiments. Details l157-159: the claim is a bit surprising, considering gradient descent is notoriously prone to converging to poor stationary points as opposed to the global optimum. l200 & l202: the citation should read Daumé and not Daumé III. The way to do it in a bib file is the following: author = {Daum\'e, III, Hal and ...} l215, the claim that the models do not have access to the ground truth is misleading at least in the case of multi-label classification where the reward is a direct function of the labels.