{"title": "Search-Guided, Lightly-Supervised Training of Structured Prediction Energy Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 13522, "page_last": 13532, "abstract": "In structured output prediction tasks, labeling ground-truth training output is often expensive. However, for many tasks, even when the true output is unknown, we can evaluate predictions using a scalar reward function, which may be easily assembled from human knowledge or non-differentiable pipelines. But searching through the entire output space to find the best output with respect to this reward function is typically intractable. In this paper, we instead use efficient truncated randomized search in this reward function to train structured prediction energy networks (SPENs), which provide efficient test-time inference using gradient-based search on a smooth, learned representation of the score landscape, and have previously yielded state-of-the-art results in structured prediction. In particular, this truncated randomized search in the reward function yields previously unknown local improvements, providing effective supervision to SPENs, avoiding their traditional need for labeled training data.", "full_text": "Search-Guided, Lightly-Supervised Training of\n\nStructured Prediction Energy Networks\n\nAmirmohammad Rooshenas, Dongxu Zhang, Gopal Sharma, and Andrew McCallum\n\nCollege of Information of Computer Sciences\n\nUniversity of Massachusetts Amherst\n\n{pedram,dongxuzhang,gopalsharma,mccallum}@cs.umass.edu\n\nAmherst, MA 01003\n\nAbstract\n\nIn structured output prediction tasks, labeling ground-truth training output is often\nexpensive. However, for many tasks, even when the true output is unknown,\nwe can evaluate predictions using a scalar reward function, which may be easily\nassembled from human knowledge or non-differentiable pipelines. But searching\nthrough the entire output space to \ufb01nd the best output with respect to this reward\nfunction is typically intractable. In this paper, we instead use ef\ufb01cient truncated\nrandomized search in this reward function to train structured prediction energy\nnetworks (SPENs), which provide ef\ufb01cient test-time inference using gradient-\nbased search on a smooth, learned representation of the score landscape, and have\npreviously yielded state-of-the-art results in structured prediction. In particular,\nthis truncated randomized search in the reward function yields previously unknown\nlocal improvements, providing effective supervision to SPENs, avoiding their\ntraditional need for labeled training data.\n\n1\n\nIntroduction\n\nStructured output prediction tasks are common in computer vision, natural language processing,\nrobotics, and computational biology. The goal is to \ufb01nd a function from an input vector x to multiple\ncoordinated output variables y. For example, such coordination can represent constrained structures,\nsuch as natural language parse trees, foreground-background pixel maps in images, or intertwined\nbinary labels in multi-label classi\ufb01cation.\nStructured prediction energy networks (SPENs) (Belanger & McCallum, 2016) are a type of energy-\nbased model (LeCun et al., 2006) in which inference is done by gradient descent. SPENs learn an\nenergy landscape E(x, y) on pairs of input x and structured outputs y. In a successfully trained\nSPEN, an input x yields an energy landscape over structured outputs such that the lowest energy\noccurs at the target structured output y\u21e4. Therefore, we can infer the target output by \ufb01nding the\nminimum of energy function E conditioned on input x: y\u21e4 = argminy E(x, y).\nTraditional supervised training of SPENs requires knowledge of the target structured output in\norder to learn the energy landscape, however such labeled examples are expensive to collect in\nmany tasks, which suggests the use of other cheaply acquirable supervision. For example, Mann\nand McCallum (2010) use labeled features instead of labeled output, or Ganchev et al. (2010) use\nconstraints on posterior distributions of output variables, however both directly add constraints as\nfeatures, requiring the constraints to be decomposable and also be compatible with the underlying\nmodel\u2019s factorization to avoid intractable inference.\nAlternatively, scalar reward functions are another widely used source of supervision, mostly in\nreinforcement learning (RL), where the environment evaluates a sequence of actions with a scalar\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\freward value. RL has been used for direct-loss minimization in sequence labeling, where the reward\nfunction is the task-loss between a predicted output and target output (Bahdanau et al., 2017; Maes\net al., 2009), or where it is the result of evaluating a non-differentiable pipeline over the predicted\noutput (Sharma et al., 2018). In these settings, the reward function is often non-differentiable or has\nlow-quality continuous relaxation (or surrogate) making end-to-end training inaccurate with respect\nto the task-loss.\nInterestingly, we can also rely on easily accessible human domain-knowledge to develop such reward\nfunctions, as one can easily express output constraints to evaluate structured outputs (e.g., predicted\noutputs get penalized if they violate the constraints). For example, in dependency parsing each\nsentence should have a verb, and thus parse outputs without a verb can be assigned a low score.\nMore recently, Rooshenas et al. (2018) introduce a method to use such reward functions to supervise\nthe training of SPENs by leveraging rank-based training and SampleRank (Rohanimanesh et al.,\n2011). Rank-based training shapes the energy landscape such that the energy ranking of alternative y\npairs are consistent with their score ranking from the reward function. The key question is how to\nsample the pairs of ys for ranking. We don\u2019t want to train on all pairs, because we will waste energy\nnetwork representational capacity on ranking many unimportant pairs irrelevant to inference; (nor\ncould we tractably train on all pairs if we wanted to). We do, however, want to train on pairs that are in\nregions of output space that are misleading for gradient-based inference when it traverses the energy\nlandscape to return the target. Previous methods have sampled pairs guided by the thus-far-learned\nenergy function, but the \ufb02awed, preliminarily-trained energy function is a weak guide on its own.\nMoreover, reward functions often include many wide plateaus containing most of the sample pairs,\nespecially at early stages of training, thus not providing any supervision signal.\nIn this paper we present a new method providing ef\ufb01cient, light-supervision of SPENs with margin-\nbased training. We describe a new method of obtaining training pairs using a combination of the\nmodel\u2019s energy function and the reward function. In particular, at training time we run the test-time\nenergy-gradient inference procedure to obtain the \ufb01rst element of the pair; then we obtain the second\nelement using randomized search driven by the reward function to \ufb01nd a local true improvement\nover the \ufb01rst. Using this search-guided approach we have successfully performed lightly-supervised\ntraining of SPENs with reward functions and improved accuracy over previous state-of-the-art\nbaselines.\n\n2 Structured Prediction Energy Networks\n\nA SPEN parametrizes the energy function Ew(y, x) using deep neural networks over input x and\noutput variables y, where w denotes the parameters of deep neural networks. SPENs rely on\nparameter learning for \ufb01nding the correlation among variables, which is signi\ufb01cantly more ef\ufb01cient\nthan learning the structure of factor graphs. One can still add task-speci\ufb01c bias to the learned structure\nby designing the general shape of the energy function. For example, Belanger and McCallum (2016)\nseparate the energy function into global and local terms. The role of the local terms is to capture\nthe dependency among input x and each individual output variable yi, while the global term aims to\ncapture long-range dependencies among output variables. Gygli et al. (2017) de\ufb01ne a convolutional\nneural network over joint input and output.\nInference in SPENs is de\ufb01ned as \ufb01nding argminy2Y Ew(y, x) for given input x. Structured outputs\nare represented using discrete variables, however, which makes inference an NP-hard combinatorial\noptimization problem. SPENs achieve ef\ufb01cient approximate inference by relaxing each discrete\nvariable to a probability simplex over the possible outcome of that variable. In this relaxation, the\nvertices of a simplex represent the exact values. The simplex relaxation reduces the combinatorial\noptimization to a continuous constrained optimization that can be optimized numerically using either\nprojected gradient-descent or exponentiated gradient-descent, both of which return a valid probability\ndistribution for each variable after every update iteration.\nPractically, we found that exponentiated gradient-descent, with updates of the form yt+1\n=\ni is the partition function of the unnormalized distribution over the\n1\nZt\ni\nvalues of variable i at iteration t) improves the performance of inference regarding convergence and\n\ufb01nds better outputs. This is in agreement with similar results reported by Belanger et al. (2017) and\nHoang et al. (2017). Exponentiated gradient descent is equivalent to de\ufb01ning yi = Softmax(Ii),\n\ni exp(\u2318 @E\nyt\n\ni\n\n) (where Zt\n\n@yi\n\n2\n\n\f.\n\n@yi\n\ni = I t\n\ni \u2318 @E\n\nwhere Ii is the logits corresponding to variable yi, and taking gradient descent in Ii, but with gradients\nrespect to yi (Kivinen & Warmuth, 1997): I t+1\nMultiple algorithms have been introduced for training SPENs, including structural SVM (Belanger &\nMcCallum, 2016), value-matching (Gygli et al., 2017), end-to-end training (Belanger, 2017), and\nrank-based training (Rooshenas et al., 2018). Given an input, structural SVM training requires the\nenergy of the target structured output to be lower than the energy of the loss-augmented predicted\noutput. Value-matching (Gygli et al., 2017), on the other hand, matches the value of energy for\nadversarially selected structured outputs and annotated target structured outputs (thus strongly-\nsupervised, not lightly-supervised) with their task-loss values. Therefore, given a successfully trained\nenergy function, inference would return the structured output that minimizes the task-loss. End-to-\nend training (Belanger et al., 2017) directly minimizes a differentiable surrogate task-loss between\npredicted and target structured outputs. Finally, rank-based training shapes the energy landscape such\nthat the structured outputs have the same ranking in the energy function and a given reward function.\nWhile structural SVM, value-matching, and end-to-end training require annotated target structured\noutputs, rank-based training can be used in domains where we have only light supervision in the\nform of reward function R(x, y) (which evaluates input x and predicted structured output y to a\nscalar reward value). Rank-based training collects training pairs from a gradient-descent trajectory on\nenergy function. However, these training trajectories may not lead to relevant pairwise rank violations\n(informative constraints that are necessary for training (Huang et al., 2012)) if the current model does\nnot navigate to regions with high reward. This problem is more prevalent if the reward function has\nplateaus over a considerable number of possible outputs\u2014for example, when the violation of strong\nconstraints results in constant values that conceal partial rewards. These plateaus happen in domains\nwhere the structured output is a set of instructions such as a SQL query, and the reward function\nevaluates the structured outputs based on their execution results.\nThis paper introduces a new search-guided training method for SPENs that addresses the above\nproblem, while preserving the ability to learn from light supervision. As described in detail below, in\nour method the gathering of informative training pairs is guided not only by gradient descent on the\nthus-far-learned energy function, but augmented by truncated randomized search informed by the\nreward function, discovering places where reward training signal disagrees with the learned energy\nfunction.\n\n3 Search-Guided Training\n\nSearch-guided training of SPENs relies on a randomized search procedure S(x, ys) which takes the\ninput x and starting point ys and returns a successor point yn such that\n\nR(x, yn) > R(x, ys) + ,\n\n(1)\nwhere > 0 is the search margin that controls the complexity of the search operator. For large , the\nsearch operator requires more exploration to satisfy eq. 1 while the returned successor point yn is\ncloser to the true output that maximizes the reward function, thus providing a stronger supervision\nsignal. Smaller values of , on the other hand, require less exploration, but provide weaker supervision\nsignal; (see Appendix B for a comparison on reward margin values). Of course, many randomized\nsearch procedures are possible\u2014simple and complex.\nIn the experiments of this paper we \ufb01nd that a simple randomized search works well: we start from\nthe gradient-descent inference output, iteratively select a random output variable, uniformly sample\na new state for the selected variable; if the reward increases more than the margin, return the new\nsample; if the reward increases less than the margin, similarly change an additional randomly selected\nvariable; if the reward decreases, undo the change, and begin the sampling again. (If readily available,\ndomain knowledge could be injected into the search to better explore the reward function; this is\nthe target of future work.) We truncate the randomized search by bounding the number of times\nthat it can query the reward function to evaluate structured outputs for each input x at every training\nstep. As a result, the search procedure may not be able to \ufb01nd a local improvement (this also may\nhappen if ys is already near-optimal), in which case we simply ignore that training example in the\ncurrent training iteration. Note that the next time that we visit an ignored example, the inference\nprocedure may provide a better starting point or truncated randomized search may \ufb01nd a local\nimprovement. In practice we observe that, as training continues, the truncated randomized search\n\ufb01nds local improvements for every training point (see Appendix C).\n\n3\n\n\fFigure 1: Search-guided training: the solid and dashed lines show a schematic landscape of energy\nand reward functions, respectively. The blue circles indexed by yi represent the gradient-descent\ninference trajectory with \ufb01ve iterations over the energy function. Dashed arrows represent the\nmapping between the energy and reward functions, while the solid arrows show the direction of\nupdates.\n\nIntuitively, we are sampling \u02c6y from the energy function E(y, x) by adding Gaussian noise (with the\nstandard deviation of ) to the gradient descent on logits: I t+1\n+ N (0, ), which is\nsimilar to using Langevin dynamics for sampling from a Boltzmann distribution.\nVia the search procedure, we \ufb01nd some S(x, \u02c6y) that is a better solution than \u02c6y with respect to\nthe reward function. Therefore, we have to train the SPEN model such that, conditioning on x,\ngradient-descent inference returns S(x, \u02c6y), thus guiding the model toward predicting a better output\nat each step. Figure 1 depicts an example of such a scenario.\nFor the gradient-descent inference to \ufb01nd \u02c6yn = S(x, \u02c6y), the energy of (x, \u02c6yn) must be lower than\nthe energy of (x, \u02c6y) by margin M. We de\ufb01ne the margin using scaled difference of their rewards:\n\ni \u2318 @E\n\ni = I t\n\n@yi\n\nM (x, \u02c6y, \u02c6yn)) = \u21b5(R(x, \u02c6yn) R(x, \u02c6y)),\n\nwhere \u21b5> 1 is a task-dependent scalar.\nNow, we de\ufb01ne at most one constraint for each training example x:\n\nAs a result, our objective is to minimize the magnitude of violations regularized by L2 norm:\n\n\u21e0w(x) = M (x, \u02c6y, \u02c6yn)) Ew(x, \u02c6y) + Ew(x, \u02c6yn) \uf8ff 0\n\n(2)\n\n(3)\n\n(4)\n\nmin\n\nw Xx2D\n\nmax(\u21e0w(x), 0) + c||w||2,\n\nwhere c is the regularization hyper-parameter. Algorithm 1 shows the search-guided training.\n\nAlgorithm 1 Search-guided training of SPENs\n\nD unlabeled mini-batch of training data\nR(., .) reward function\nEw(., .) input SPEN\nrepeat\n\nL 0\nfor each x in D do\n\u02c6y sample from Ew(y, x).\n\u02c6yn S(x, \u02c6y)\n\u21e0w(x) M (x, \u02c6y, \u02c6yn) Ew(x, \u02c6y) + Ew(x, \u02c6yn)\nL L + max(\u21e0w(x), 0)\n\n//search in reward function R starting from \u02c6y\n\nend for\nL L + c||w||2\nw w rwL\n\nuntil convergence\n\n// is learning rate\n\n4 Related Work\n\nPeng et al. (2017) introduce maximum margin rewards networks (MMRNs) which also use the\nindirect supervision from reward functions for margin-based training. Our work has two main\n\n4\n\n\fadvantages over MMRNs: \ufb01rst, MMRNs use search-based inference, while SPENs provide ef\ufb01cient\ngradient-descent inference. Search-based inference, such as beam-search, is more likely to \ufb01nd poor\nlocal optima structured output rather than the most likely one, especially when output space is very\nlarge. Second, SG-SPENs gradually train the energy function for outputting better prediction by\ncontrasting the predicted output with a local improvement of the output found using search, while\nMMRNs use search-based inference twice: once for \ufb01nding the global optimum, which may not\nbe accessible, and next, for loss-augmented inference, so their method heavily depends on \ufb01nding\nthe best points using search, while SG-SPEN only requires search to \ufb01nd more accessible local\nimprovements.\nLearning to search (Chang et al., 2015) also explores learning from a reward function for structured\nprediction tasks where the output structure is a sequence. The training algorithm includes a roll-in and\nroll-out policy. It uses the so-far learned policy to \ufb01ll in some steps, then randomly picks one action,\nand \ufb01lls out the rest of the sequence with a roll-out policy that is a mix of a reference policy and the\nlearned policy. Finally, it observes the reward of the whole sequence and constructs a cost-augmented\ntuple for the randomly selected action to train the policy network using a cost-sensitive classi\ufb01er.\nIn the absence of ground-truth labels, the reference policy can be replaced by a sub-optimal policy\nor the learned policy. In the latter case, the training algorithm reduces to reinforcement learning.\nAlthough it is possible to use search as the sub-optimal policy, we believe that in the absence of the\nground-truth labels, our policy gradient baselines are a good representative of the algorithms in this\ncategory.\nFor some tasks, it is possible to de\ufb01ne differentiable reward functions, so we can directly train the\nprediction model using end-to-end training. For example, Stewart and Ermon (2017) train a neural\nnetwork using a reward function that guides the training based on physics of moving objects with\na differentiable reward function. However, differentiable reward functions are rare, limiting their\napplicability in practice.\nGeneralized expectation (GE) (Mann & McCallum, 2010), posterior regularization (Ganchev et al.,\n2010) and constraint-driven learning (Chang et al., 2007), learning from measurements (Liang et al.,\n2009), have been introduced to learn from a set of constraints and labeled features. Recently, Hu et\nal. (2016) use posterior regularization to distill the human domain-knowledge described as \ufb01rst-order\nlogic into neural networks. However, these methods cannot learn from the common case of black box\nreward functions, such as the ones that we use in our experiments below on citation \ufb01eld extraction\nand shape parsing.\nChang et al. (2010) de\ufb01ne a companion problem for a structured prediction problem (e.g., if the\npart-of-speech tags are legitimate for the given input sentence or not) supposing the acquisition of\nannotated data for the companion problem is cheap. Jointly optimizing the original problem and the\ncompanion problem reduces the required number of annotated data for the original problem since the\ncompanion problem would restrict the feasible structured output space.\nFinally, there exists a body of work using reward functions to train structured prediction models\nwith reward functions de\ufb01ned as task-loss (Norouzi et al., 2016; Bahdanau et al., 2017; Ranzato\net al., 2016), in which they access ground-truth labels to compute the task loss, pretraining the policy\nnetwork, or training the critic. These approaches bene\ufb01t from mixing strong supervision with the\nsupervision from the reward function (task-loss), while reward functions for training SG-SPENs\ndo not assume the accessibility of ground-truth labels. Moreover, when the action space is very\nlarge and the reward function includes plateaus, training policy networks without pretraining with\nsupervised data is very dif\ufb01cult. Daum\u00e9 et al. (2018) address the issue of sparse rewards by learning a\ndecomposition of the reward signal, however, they still assume access to reference policy pre-trained\non supervised data for the structured prediction problems. In Daum\u00e9 et al. (2018), the reward function\nis also the task-loss. The SG-SPEN addresses these problems differently, \ufb01rst it effectively trains\nSPENs that provide joint-inference, thus it does not require partial rewards. Second, the randomized\nsearch can easily avoid the plateaus in the reward function, which is essential for learning at the\nearly stages. Our policy gradients baselines are a strong representative of the reinforcement learning\nalgorithms for structured prediction problems without any assumption about the ground-truth labels.\n\n5\n\n\f5 Experiments\n\nWe have conducted training of SPENs in three settings with different reward functions: 1) Multi-\nlabel classi\ufb01cation with the reward function de\ufb01ned as F1 score between predicted labels and target\nlabels. 2) Citation-\ufb01eld extraction with a human-written reward function. 3) Shape parsing with a\ntask-speci\ufb01c reward function. Except for the oracle reward function that we used for multi-label\nclassi\ufb01cation, our other reward functions of citation-\ufb01eld extraction and shape parsing do not have\naccess to any labeled data. In none of our experiments the models have access to any labeled data\n(for comparison to fully-supervised models see Appendix A).\n\n5.1 Multi-label Classi\ufb01cation\n\nWe \ufb01rst evaluate the ability of search-guided training of SPENs, SG-SPEN, to learn from light super-\nvision provided by truncated randomized search. We consider the task of multi-label classi\ufb01cation on\nBibtex dataset with 159 labels and 1839 input variables and Bookmarks dataset with 208 labels and\n2150 input variables.\nWe de\ufb01ne the reward function as the F1 distance between the true label set and the predicted set\nat training time, and none of the methods have access to the true labels directly, which makes this\nscenario different from fully-supervised training.\nWe also trained R-SPEN (Rooshenas et al., 2018) and DVN (value-matching training of\nSPENs) (Gygli et al., 2017) with the same oracle reward function and energy function. In this\ncase, DVN matches the energy value with the value of the reward function at different structured\noutput points generated by the gradient-descent inference. Similar to SG-SPEN, R-SPEN and DVN\ndo not have direct access to the ground-truth. In general, DVNs require access to ground-truth\nlabels to generate adversarial examples that are located in a vicinity of ground-truth labels, and this\nrestriction signi\ufb01cantly hurts the performance of DVNs. In order to alleviate this problem, we also\nadd Gaussian noise to gradient-descent inference in DVN, so it matches the energy of samples from\nthe energy function with their reward values, giving it the means to better explore the energy function\nin the absence of ground-truth labels. See Appendix D for more details on this experimental setup.\nTable 1.B shows the performance of SG-SPEN, R-SPEN, and DVN on this task. We observed that\nR-SPEN has dif\ufb01culty \ufb01nding violations (optimization constraints) as training progresses. This\nis attributable to the fact that R-SPEN only explores the regions of the reward function based on\nthe samples from the gradient-descent trajectory on the energy function, so if the gradient-descent\ninference is con\ufb01ned within local regions, R-SPEN cannot generate informative constraints.\n\n5.2 Citation Field Extraction\n\nCitation \ufb01eld extraction is a structured prediction task in which the structured output is a sequence\nof tags such as Author, Editor, Title, and Date that distinguishes the segments of a citation text. We\nused the Cora citation dataset (Seymore et al., 1999) including 100 labeled examples as the validation\nset and another 100 labeled examples for the test set. We discard the labels of 300 examples in the\ntraining data and added another 700 unlabeled citation text acquired from the web to them.\nThe citation text, including the validation set, test set, and unlabeled data, have the maximum length\nof 118 tokens, which can be labeled with one of 13 possible tags. We \ufb01xed the length of input data by\npadding all citation text to the maximum citation length in the dataset. We report token-level accuracy\nmeasured on non-pad tokens.\nOur knowledge-based reward function is equivalent to Rooshenas et al. (2018), which takes input\ncitation text and predicated tags and evaluates the consistency of the prediction with about 50 given\nrules describing the human domain-knowledge about citation text.\nWe compare SG-SPEN with R-SPEN (Rooshenas et al., 2018), iterative beam search with random\ninitialization, policy gradient methods (PG) (Williams, 1992), generalized expectation (GE) (Mann &\nMcCallum, 2010), and MMRN (Peng et al., 2017). Appendix E includes a detailed description of\nbaselines and hyper-parameters.\n\n6\n\n\fTable 1: The comparison of SG-SPEN and other baselines using A) token-level accuracy for the\ncitation-\ufb01eld extraction task, B) F1 score for multi-label classi\ufb01cation task, and C) intersection over\nunion (IOU) for the shape-parser task.\n\nA) Citation-\ufb01eld extraction\nMethod\n\nAccuracy\n\nGE\nIterative Beam Search\n\nK=1\nK=2\nK=5\nK=10\n\nPG\n\nEMA baseline\nParametric baseline\n\nMMRN\nDVN\nR-SPEN\nSG-SPEN\n\n37.3%\n\n30.5%\n35.7%\n39.3%\n39.0%\n\n54.5%\n47.9%\n39.5%\n29.6%\n48.3%\n57.1%\n\nInference\ntime (sec.)\n-\n\n159\n850\n2,892\n6,654\n\n< 1\n< 1\n< 1\n< 1\n< 1\n< 1\n\nB) Multi-label classi\ufb01cation\nMethod\nDVN\nR-SPEN\nSG-SPEN\n\nBibtex Bookmarks\n42.2\n40.1\n44.0\n\n34.1\n30.6\n38.4\n\nC) Shape parsing\nMethod\n\nIterative Beam Search\n\nK=5\nK=10\nK=20\n\nNeural shape parser\nSG-SPEN\n\nIOU\n\nInference\ntime (sec.)\n\n24.6%\n30.0%\n43.1%\n32.4%\n56.3%\n\n3,882\n15,537\n38,977\n< 1\n< 1\n\nc(32,32,28)\n\nc(32,32,24)\n\n-\n\nt(32,32,20)\n\n+\n\nParsing\n\nFigure 2: The input image (left) and the parse that generate the input input (right). The \ufb01rst two\nparameters of each shape shows its center location and the third parameter is its scale. A valid\nprogram sequence can be generated by post order traversal of the binary shape parse.\n\n5.2.1 Results and Discussion\nWe reported the token-level accuracy of SG-SPEN and the other baselines in Table 1.A. SG-SPEN\nachieves highest performance in this task with 57.1% token-level accuracy. As we expect, R-SPEN\naccuracy is less than SG-SPEN as it introduces many irrelevant constraints into the optimization.\nIterative beam search with beam size of ten gets about 39.0% accuracy, however, the inference time\ntakes more than a minute per test example on a 10-core CPU. We noticed that using exhaustive\nsearch through a noisy and incomplete reward function may not improve the accuracy despite \ufb01nding\nstructured outputs with higher scores. DVN struggles in the presence of an inaccurate reward function\nsince it tries to match the energy values with the reward values for the generated structured outputs\nby the gradient-descent inference. More importantly, DVNs learn best if they can evaluate the reward\nfunction on relaxed continuous structured outputs, which is not available for the human-written\nreward function in this scenario. MMRN also have problems to \ufb01nd the best path using greedy beam\nsearch because of local optima in the reward functions, but SG-SPEN and PG that are powered by\nrandomized operations for exploring the reward function are more successful on this task.\n\n5.2.2 Semi-Supervised Setting\nWe study the citation-\ufb01eld extraction task in the semi-supervised setting with 1000 unlabeled and 5,\n10, and 50 labeled data points. SG-SPEN can be extended for the semi-supervised setting by using\nthe ground-truth label instead of the output of the search whenever it is available. Similarly, for\nR-SPEN, we can evaluate the rank-based objective using a pair of model\u2019s prediction and ground\n\n7\n\n\fTable 2: Semi-supervised setting for the citation-\ufb01eld extraction task.\n\nNo.\n5\n10\n50\n\nGE\n54.7\n57.9\n68.0\n\nPG\n55.6\n67.7\n76.5\n\nDVN R-SPEN SG-SPEN SG-SPEN-sup DVN-sup\n50.5\n60.6\n67.7\n\n55.0\n65.5\n81.5\n\n65.5\n71.7\n82.9\n\n53.0\n62.4\n81.6\n\n57.4\n61.9\n81.4\n\nFigure 3: The test reward value of SG-SPEN\u2019s outputs trained in the supervised setting and semi-\nsupervised settings with \ufb01ve labeled data points.\n\ntruth output when available. For DVNs, if the ground truth label is available, we use adversarial\nsampling as suggested by Gygli et al. (2017). We also reported the result of PG training with EMA\nbaseline when the model is pre-trained with the labeled data. We reported the performance of GE\nbased on Mann & McCallum (2010). We also reported the results of SG-SPENs when they are only\ntrained with the labeled data using the citation reward function (SG-SPEN-sup).\nSince the citation reward function is based on domain knowledge and is noisy, DVNs struggle\nin matching the energy values with the noisy rewards, so we also trained DVNs with token-level\naccuracy (not available for the unlabeled data) as the reward function (DVN-sup) for the reference.\nSG-SPEN\u2019s performance is better than the other baselines in the presence of limited labeled data.\nHowever, since the training objective of R-SPEN and SG-SPEN are similar for the labeled data (both\nuse rank-based objective), as we increase the number of labeled data, their performance become closer.\nDVNs also bene\ufb01t from the labeled data, but it is very sensitive to noisy reward functions (see DVN\nand DVN-sup in Table 2). To better understand the behavior of SG-SPEN in the semi-supervised\nsetting, we compare the reward value of test data for SG-SPENs during training with \ufb01ve labeled\ndata in the fully-supervised and semi-supervised settings (see Figure 3). The unlabeled data helps\nSG-SPEN to better generalize to unseen data.\n\n5.3 Shape Parsing\n\nShape parsing from computer graphics literature aims at parsing the input shape (2D image or 3D\nshape) into its structured elements as sequential instructions (program). These programs are in the\nform of binary operations applied on basic shape primitives (see Figure 2). However, for an input\nshape, predicting the program that can generate the input shape is a challenging task because of the\ncombinatorially large output program space.\nWe apply our proposed SG-SPEN algorithm to the shape parsing task to show its superior performance\nin inducing programs for an input shape, without explicit supervision. Here we only consider the\nprograms of length \ufb01ve, which includes two operations and three primitive shape objects: circle,\ntriangle, and rectangle parameterized by their center and scale, which describes total 396 different\nshapes. Therefore, every program forms a sequence of \ufb01ve tags that each tag can take 399 possible\nvalues, including three operations and 396 shapes. The execution of a valid program results in 64\u21e5 64\nbinary image (Figure 2).\nFor the shape parser task, we construct the reward function as the intersection over union (IOU)\nbetween a given input image and its constructed image from the predicted output program. This\nreward function is not differentiable as it requires executing the predicted program to generate the\n\n8\n\n\f\ufb01nal image. This is a dif\ufb01cult problem, \ufb01rst, the output space is very large, and second, many\nprograms in the output space are invalid thus the reward function produces zero reward for them.\nWe generated 2000 different image-program pairs based on Sharma et al. (2018), including 1400\ntraining pair, 300 pairs for validation set, and 300 pairs for the test set. We dismiss the programs for\nthe training data.\nWe compare SG-SPEN with R-SPEN, DVN, and iterative beam search with beam size \ufb01ve, ten,\nand twenty. We also apply neural shape parser proposed by Sharma et al. (2018) for learning from\nunlabeled data. See Appendix F for more details on this experiment.\n\n5.3.1 Results and discussion\nR-SPEN is not able to learn in this scenario because the samples from energy functions are often\ninvalid programs and R-SPEN is incapable of producing informative optimization constraints. In\nother words, most of the pairs are invalid programs (with zero reward), thus having the same ranking\nwith respect to the reward function, so they are not useful for updating the energy landscape to guide\ngradient-descent inference toward \ufb01nding better predictions. DVN suffers from the same problem,\nwithout accessing to ground-truth data, the generated structured outputs by gradient-descent inference\noften represent invalid programs, and matching the value of invalid programs is not helpful toward\nshaping the energy landscape.\nThe results on this task are shown in Table 1.C (excluding the unsuccessful training of DVN and\nR-SPEN). SG-SPEN performs much better than neural shape parser because: \ufb01rst, the network is\ntrained from scratch without any explicit supervision using policy gradients, which makes it dif\ufb01cult\nto \ufb01nd a valid program because of the large program space. Second, rewards are only provided at\nthe end and there is no provision for intermediate rewards. In contrast, SG-SPEN makes use of\nthe intermediate reward by searching for better program instructions that can increase IOU score.\nSG-SPEN quickly picks up informative constraints without explicit ground-truth program supervision\n(see Appendix C). The other advantage of SG-SPEN over neural shape parser in this task is its ability\nto encode long-range dependencies which enables it to learn the valid-program constraints quickly if\nthe search operator reveals a valid program.\nSG-SPEN also achieves higher performance compared to iterative beam search. Although in this\nscenario with an exact reward function, iterative beam search with higher beam sizes would gain\nbetter IOU, albeit with signi\ufb01cantly longer inference time.\n\n6 Conclusion\n\nWe introduce SG-SPEN to enable training of SPENs using supervision provided by reward functions,\nincluding human-written functions or complex non-differentiable pipelines. The key ingredients\nof our training algorithm are sampling from the energy function and then sampling from reward\nfunction through truncated randomized search, which are used to generate informative optimization\nconstraints. These constraints gradually guide gradient-descent inference toward \ufb01nding better\nprediction according to the reward function. We show that SG-SPEN trains models that achieve better\nperformance compared to previous methods, such as learning from a reward function using policy\ngradient methods. Our method also enjoys a simpler training algorithm and rich representation over\noutput variables. In addition, SG-SPEN facilitates using task-speci\ufb01c domain knowledge to reduce\nthe search output space, which is critical for complex tasks with enormous output space. In future\nwork we will explore the use of easily-expressed domain knowledge for further guiding search in\nlightly supervised learning.\n\nAcknowledgments\nWe would like to thank David Belanger, Michael Boratko, and other anonymous reviewers for their\nconstructive comments and discussions.\nThis research was funded by DARPA grant FA8750-17-C-0106. The views and conclusions contained\nin this document are those of the authors and should not be interpreted as necessarily representing the\nof\ufb01cial policies, either expressed or implied, of DARPA or the U.S. Government.\n\n9\n\n\fReferences\nBahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. An\nactor-critic algorithm for sequence prediction. In Proceedings of the International Conference on\nLearning Representations, 2017.\n\nBelanger, D. Deep energy-based models for structured prediction. Ph.D. Dissertation, 2017.\nBelanger, D. and McCallum, A. Structured prediction energy networks. In Proceedings of the\n\nInternational Conference on Machine Learning, 2016.\n\nBelanger, D., Yang, B., and McCallum, A. End-to-end learning for structured prediction energy\n\nnetworks. In Proceedings of the International Conference on Machine Learning, 2017.\n\nChang, K.-W., Krishnamurthy, A., Agarwal, A., Daume, H., and Langford, J. Learning to search better\nthan your teacher. In Proceedings of the 32nd International Conference on Machine Learning,\nvolume 37 of Proceedings of Machine Learning Research, pp. 2058\u20132066, Lille, France, 07\u201309\nJul 2015. PMLR.\n\nChang, M.-W., Ratinov, L., and Roth, D. Guiding semi-supervision with constraint-driven learning.\n\nIn ACL, pp. 280\u2013287, 2007.\n\nChang, M.-W., Srikumar, V., Goldwasser, D., and Roth, D. Structured output learning with indirect\n\nsupervision. In Proceedings of the International Conference on Machine Learning, 2010.\n\nDaum\u00e9, III, H., Langford, J., and Sharaf, A. Residual loss prediction: Reinforcement learning with no\nincremental feedback. In Proceedings of the International Conference on Learning Representations,\n2018.\n\nGanchev, K., Gillenwater, J., Taskar, B., et al. Posterior regularization for structured latent variable\n\nmodels. Journal of Machine Learning Research, 11(Jul):2001\u20132049, 2010.\n\nGygli, M., Norouzi, M., and Angelova, A. Deep value networks learn to evaluate and iteratively\nre\ufb01ne structured outputs. In Proceedings of the International Conference on Machine Learning,\n2017.\n\nHoang, C. D. V., Haffari, G., and Cohn, T. Towards decoding as continuous optimisation in neural\nmachine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural\nLanguage Processing, pp. 146\u2013156, 2017.\n\nHu, Z., Yang, Z., Salakhutdinov, R., and Xing, E. P. Deep neural networks with massive learned\n\nknowledge. In EMNLP, pp. 1670\u20131679, 2016.\n\nHuang, L., Fayong, S., and Guo, Y. Structured perceptron with inexact search. In Proceedings of the\n2012 Conference of the North American Chapter of the Association for Computational Linguistics:\nHuman Language Technologies, pp. 142\u2013151. Association for Computational Linguistics, 2012.\nKivinen, J. and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors.\n\ninformation and computation, 132(1):1\u201363, 1997.\n\nLeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F. A tutorial on energy-based learning.\n\nPredicting structured data, 1(0), 2006.\n\nLiang, P., Jordan, M. I., and Klein, D. Learning from measurements in exponential families. In\n\nProceedings of the International Conference on Machine Learning, pp. 641\u2013648, 2009.\n\nMaes, F., Denoyer, L., and Gallinari, P. Structured prediction with reinforcement learning. Machine\n\nlearning, 77(2-3):271, 2009.\n\nMann, G. S. and McCallum, A. Generalized expectation criteria for semi-supervised learning with\n\nweakly labeled data. Journal of machine learning research, 11(Feb):955\u2013984, 2010.\n\nNorouzi, M., Bengio, S., Jaitly, N., Schuster, M., Wu, Y., Schuurmans, D., et al. Reward aug-\nmented maximum likelihood for neural structured prediction. In Advances In Neural Information\nProcessing Systems, pp. 1723\u20131731, 2016.\n\n10\n\n\fPeng, H., Chang, M.-W., and Yih, W.-t. Maximum margin reward networks for learning from explicit\nand implicit supervision. In Proceedings of the 2017 Conference on Empirical Methods in Natural\nLanguage Processing, pp. 2368\u20132378, 2017.\n\nRanzato, M., Chopra, S., Auli, M., and Zaremba, W. Sequence level training with recurrent neural\n\nnetworks. In Proceedings of the International Conference on Learning Representations, 2016.\n\nRohanimanesh, K., Bellare, K., Culotta, A., McCallum, A., and Wick, M. L. Samplerank: Training\nfactor graphs with atomic gradients. In Proceedings of the International Conference on Machine\nLearning, pp. 777\u2013784, 2011.\n\nRooshenas, A., Kamath, A., and McCallum, A. Training structured prediction energy networks\nwith indirect supervision. In Proceedings of the North American Chapter of the Association for\nComputational Linguistics: Human Language Technologies, 2018.\n\nSeymore, K., McCallum, A., and Rosenfeld, R. Learning hidden markov model structure for\ninformation extraction. In AAAI-99 workshop on machine learning for information extraction, pp.\n37\u201342, 1999.\n\nSharma, G., Goyal, R., Liu, D., Kalogerakis, E., and Maji, S. Csgnet: Neural shape parser for\nconstructive solid geometry. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2018.\n\nStewart, R. and Ermon, S. Label-free supervision of neural networks with physics and domain\n\nknowledge. In AAAI, pp. 2576\u20132582, 2017.\n\nWilliams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. In Reinforcement Learning, pp. 5\u201332. Springer, 1992.\n\n11\n\n\f", "award": [], "sourceid": 7492, "authors": [{"given_name": "Amirmohammad", "family_name": "Rooshenas", "institution": "University of Massachusetts Amherst"}, {"given_name": "Dongxu", "family_name": "Zhang", "institution": "University of Massachusetts Amherst"}, {"given_name": "Gopal", "family_name": "Sharma", "institution": "University of Massachusetts Amherst"}, {"given_name": "Andrew", "family_name": "McCallum", "institution": "UMass Amherst"}]}