{"title": "Graph Structured Prediction Energy Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8690, "page_last": 8701, "abstract": "For joint inference over multiple variables, a variety of structured prediction techniques have been developed to model correlations among variables and thereby improve predictions. However, many classical approaches suffer from one of two primary drawbacks: they either lack the ability to model high-order correlations among variables while maintaining computationally tractable inference, or they do not allow to explicitly model known correlations. To address this shortcoming, we introduce \u2018Graph Structured Prediction Energy Networks,\u2019 for which we develop inference techniques that allow to both model explicit local and implicit higher-order correlations while maintaining tractability of inference. We apply the proposed method to tasks from the natural language processing and computer vision domain and demonstrate its general utility.", "full_text": "Graph Structured Prediction Energy Networks\n\nColin Graber\n\ncgraber2@illinois.edu\n\nAlexander Schwing\naschwing@illinois.edu\n\nDepartment of Computer Science\n\nUniversity of Illinois at Urbana-Champaign\n\nChampaign, IL\n\nAbstract\n\nFor joint inference over multiple variables, a variety of structured prediction tech-\nniques have been developed to model correlations among variables and thereby\nimprove predictions. However, many classical approaches suffer from one of two\nprimary drawbacks: they either lack the ability to model high-order correlations\namong variables while maintaining computationally tractable inference, or they\ndo not allow to explicitly model known correlations. To address this shortcom-\ning, we introduce \u2018Graph Structured Prediction Energy Networks,\u2019 for which we\ndevelop inference techniques that allow to both model explicit local and implicit\nhigher-order correlations while maintaining tractability of inference. We apply\nthe proposed method to tasks from the natural language processing and computer\nvision domain and demonstrate its general utility.\n\n1\n\nIntroduction\n\nMany machine learning tasks involve joint prediction of a set of variables. For instance, semantic\nimage segmentation infers the class label for every pixel in an image. To address joint prediction, it\nis common to use deep nets which model probability distributions independently over the variables\n(e.g., the pixels). The downside: correlations between different variables aren\u2019t modeled explicitly.\nA number of techniques, such as Structured SVMs [1], Max-Margin Markov Nets [2] and Deep\nStructured Models [3, 4], directly model relations between output variables. However, modeling the\ncorrelations between a large number of variables is computationally expensive and therefore generally\nimpractical. As an attempt to address some of the fallbacks of classical high-order structured pre-\ndiction techniques, Structured Prediction Energy Networks (SPENs) were introduced [5, 6]. SPENs\nassign a score to an entire prediction, which allows them to harness global structure. Additionally,\nbecause these models do not represent structure explicitly, complex relations between variables can be\nlearned while maintaining tractability of inference. However, SPENs have their own set of downsides:\nBelanger and McCallum [5] mention, and we can con\ufb01rm, that it is easy to over\ufb01t SPENs to the\ntraining data. Additionally, the inference techniques developed for SPENs do not enforce structural\nconstraints among output variables, hence they cannot support structured scores and discrete losses.\nAn attempt to combine locally structured scores with joint prediction was introduced very recently by\nGraber et al. [7]. However, Graber et al. [7] require the score function to take a speci\ufb01c, restricted\nform, and inference is formulated as a dif\ufb01cult-to-solve saddle-point optimization problem.\nTo address these concerns, we develop a new model which we refer to as \u2018Graph Structured Prediction\nEnergy Network\u2019 (GSPEN). Speci\ufb01cally, GSPENs combine the capabilities of classical structured\nprediction models and SPENs and have the ability to explicitly model local structure when known\nor assumed, while providing the ability to learn an unknown or more global structure implicitly.\nAdditionally, the proposed GSPEN formulation generalizes the approach by Graber et al. [7]. Con-\ncretely, inference in GSPENs is a maximization of a generally non-concave function w.r.t. structural\nconstraints, for which we develop two inference algorithms.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconsider tasks where the outputs take the form y = (y1, . . . , yK) \u2208 Y :=(cid:81)K\n\nWe show the utility of GSPENs by comparing to related techniques on several tasks: optical character\nrecognition, image tagging, multilabel classi\ufb01cation, and named entity recognition. In general,\nwe show that GSPENs are able to outperform other models. Our implementation is available at\nhttps://github.com/cgraber/GSPEN.\n2 Background\nLet x \u2208 X represent the input provided to a model, such as a sentence or an image. In this work, we\nk=1 Yk, i.e., they are\nvectors where the k-th variable\u2019s domain is the discrete and \ufb01nite set Yk = {1, . . . ,|Yk|}. In general,\nthe number of variables K which are part of the con\ufb01guration y can depend on the observation x.\nHowever, for readability only, we assume all y \u2208 Y contain K entries, i.e., we drop the dependence\nof the output space Y on input x.\nAll models we consider consist of a function F (x, y; w), which assigns a score to a given con\ufb01guration\ny conditioned on input x and is parameterized by weights w. Provided an input x, the inference prob-\nlem requires \ufb01nding the con\ufb01guration \u02c6y that maximizes this score, i.e., \u02c6y := arg maxy\u2208Y F (x, y; w).\nTo \ufb01nd the parameters w of the function F (x, y; w), it is common to use a Structured Support\nVector Machine (SSVM) (a.k.a. Max-Margin Markov Network) objective [1, 2]: given a multiset\n\nof data points(cid:0)xi, yi(cid:1) comprised of an input xi and the corresponding ground-truth\n\n(cid:110)(cid:0)xi, yi(cid:1)N\n\n(cid:111)\n\ncon\ufb01guration yi, a SSVM attempts to \ufb01nd weights w which maximize the margin between the scores\nassigned to the ground-truth con\ufb01guration yi and the inference prediction:\n\n(cid:8)F(cid:0)xi, \u02c6y; w(cid:1) + L(cid:0)\u02c6y, yi(cid:1)(cid:9) \u2212 F(cid:0)xi, yi; w(cid:1) .\n\n(1)\n\n(cid:88)\n\nw\n\nmin\n\n(xi,yi)\n\nmax\n\u02c6y\u2208Y\n\nHereby, L(cid:0)\u02c6y, yi(cid:1) is a task-speci\ufb01c and often discrete loss, such as the Hamming loss, which steers\nthe task-speci\ufb01c loss L(\u02c6y, yi) to the model score F(cid:0)xi, \u02c6y; w(cid:1), we often refer to the maximization\n\nthe model towards learning a margin between correct and incorrect outputs. Due to addition of\n\ntask within Eq. (1) as loss-augmented inference. The procedure to solve loss-augmented inference\ndepends on the considered model, which we discuss next.\nUnstructured Models. Unstructured models, such as feed-forward deep nets, assign a score to each\nlabel of variable yk which is part of the con\ufb01guration y, irrespective of the label choice of other\nvariables. Hence, the \ufb01nal score function F is the sum of K individual scores fk(x, yk; w), one for\neach variable:\n\ni=1\n\nF (x, y; w) :=\n\nfk(x, yk; w).\n\n(2)\n\nBecause the scores for each output variable do not depend on the scores assigned to other output\nvariables, the inference assignment is determined ef\ufb01ciently by independently \ufb01nding the maximum\nscore for each variable yk. The same is true for loss-augmented inference, assuming that the loss\ndecomposes into a sum of independent terms as well.\nClassical Structured Models. Classical structured models incorporate dependencies between vari-\nables by considering functions that take more than one output space variable yk as input, i.e., each\nfunction depends on a subset r \u2286 {1, . . . , K} of the output variables. We refer to the subset of\nvariables via yr = (yk)k\u2208r and use fr to denote the corresponding function. The overall score for a\ncon\ufb01guration y is a sum of these functions, i.e.,\n\nr\u2208R\n\nfr(x, yr; w).\n\nF (x, y; w) :=\n\n(3)\nHereby, R is a set containing all of the variable subsets which are required to compute F . The\nvariable subset relations between functions fr, i.e., the structure, is often visualized using factor\ngraphs or, generally, Hasse diagrams.\nThis formulation allows to explicitly model relations between variables, but it comes at the price of\nmore complex inference which is NP-hard [8] in general. A number of approximations to this problem\nhave been developed and utilized successfully (see Sec. 5 for more details), but the complexity of\nthese methods scales with the size of the largest region r. For this reason, these models commonly\nconsider only unary and pairwise regions, i.e., regions with one or two variables.\n\nK(cid:88)\n\nk=1\n\n(cid:88)\n\n2\n\n\fInference, i.e., maximization of the score, is equivalent to the integer linear program\n\npr(yr)fr(x, yr; w),\n\n(4)\n\n(cid:88)\n\n(cid:88)\n\nr\u2208R\n\nyr\u2208Yr\n\nmax\np\u2208M\n\nwhere each pr represents a marginal probability vector for region r and M represents the set of pr\nwhose marginal distributions are globally consistent, which is often called the marginal polytope.\nAdding an entropy term over the probabilities to the inference objective transforms the problem from\nmaximum a-posteriori (MAP) to marginal inference, and pushes the predictions to be more uniform\n[9, 10]. When combined with the learning procedure speci\ufb01ed above, this entropy provides learning\nwith the additional interpretation of maximum likelihood estimation [9]. The training objective then\nalso \ufb01ts into the framework of Fenchel-Young Losses [11].\nFor computational reasons, it is common to relax the marginal polytope M to the local polytope ML,\nwhich is the set of all probability vectors that marginalize consistently for the factors present in the\ngraph [9]. Since the resulting marginals are no longer globally consistent, i.e., they are no longer\nguaranteed to arise from a single joint distribution, we write the predictions for each region using\nbr(yr) instead of pr(yr) and refer to them using the term \u201cbeliefs.\u201d Additionally, the entropy term is\napproximated using fractional entropies [12] such that it only depends on the factors in the graph, in\n\nwhich case it takes the form HR(b) :=(cid:80)\n\n\u2212br(yr) log br(yr).\n\nr\u2208R(cid:80)\n\nyr\u2208Yr\n\nStructured Prediction Energy Networks. Structured Prediction Energy Networks (SPENs) [5]\nwere motivated by the desire to represent interactions between larger sets of output variables without\nincurring a high computational cost. The SPEN score function takes the following form:\n\nF (x, p1, . . . , pK; w) := T(cid:0) \u00aff (x; w), p1, . . . , pK; w(cid:1) ,\n\n(5)\n\nT(cid:0) \u00aff (x; w), p1, . . . , pK; w(cid:1) ,\n\nwhere \u00aff (x; w) is a learned feature representation of the input x, each pk is a one-hot vector, and T\nis a function that takes these two terms and assigns a score. This representation of the labels, i.e.,\npk, is used to facilitate gradient-based optimization during inference. More speci\ufb01cally, inference is\nformulated via the program:\n\nmax\n\npk\u2208\u2206k\u2200k\n\n(6)\nwhere each pk is constrained to lie in the |Yk|-dimensional probability simplex \u2206k. This task can\nbe solved using any constrained optimization method. However, for non-concave T the inference\nsolution might only be approximate.\nNLStruct. SPENs do not support score functions that contain a structured component. In response,\nGraber et al. [7] introduced NLStruct, which combines a classical structured score function with\na nonlinear transformation applied on top of it to produce a \ufb01nal score. Given a set R as de\ufb01ned\npreviously, the NLStruct score function takes the following form:\n\nF (x, pR; w) := T (fR(x; w) \u25e6 pR; w) ,\n\n(7)\nwhere fR(x; w) := (fr(x, yr; w))|r\u2208R,yr\u2208Yr is a vectorized form of the score function for a classical\nstructured model, pR := (pr(yr))|\u2200r\u2208R,\u2200yr\u2208Yr is a vector containing all marginals, \u2018\u25e6\u2019 is the\n(cid:81)\nHadamard product, and T is a scalar-valued function.\nFor this model, inference is formulated as a constrained optimization problem, where YR :=\nr\u2208R Yr:\n\nT (y; w) s.t. y = fR(x; w) \u25e6 pR.\n\nForming the Lagrangian of this program and rearranging leads to the saddle-point inference problem\n\nmax\n\ny\u2208R|YR|,pR\u2208M\n\n(cid:8)T (y; w) \u2212 \u03bbT y(cid:9) + max\n\nmin\n\n\u03bb\n\nmax\n\ny\n\npR\u2208M \u03bbT (fR(x; w) \u25e6 pR) .\n\n(8)\n\n(9)\n\nNotably, maximization over pR is solved using techniques developed for classical structured models1,\nand the saddle-point problem is optimized using the primal-dual algorithm of Chambolle and Pock\n[13], which alternates between updating \u03bb, y, and pR.\n1As mentioned, solving the maximization over pR tractably might require relaxing the marginal polytope M\nto the local marginal polytope ML. For brevity, we will not repeat this fact whenever an inference problem of\nthis form appears throughout the rest of this paper.\n\n3\n\n\fAlgorithm 1 Frank-Wolfe Inference for GSPEN\n\n1: Input: Initial set of predictions pR; Input x;\n\n2: for t = 1 . . . T do\n3:\n4:\n\nFactor graph R\n(cid:80)\n(cid:98)pR \u21d0 max\ng \u21d0 \u2207pRF (x, pR; w)\n(cid:98)pR\u2208MR\nt ((cid:98)pR \u2212 pR)\npR \u21d0 pR + 1\n\nr\u2208R,yr\u2208Yr\n\n\u02c6pr(yr)gr(yr)\n\nAlgorithm 2 Structured Entropic Mirror Descent\nInference\n1: Input: Initial set of predictions pR; Input x;\n\nFactor graph R\n\ng \u21d0 \u2207pR F (x, pR; w)\n\u221a\na \u21d0 1 + ln pR + g/\nt\npR\u21d0 max\n\u02c6pr(yr)ar(yr)+HR(\u02c6pR)\n\u02c6pR\u2208M\nr\u2208R,yr\u2208Yr\n\n(cid:80)\n\n2: for t = 1 . . . T do\n3:\n4:\n5:\n6: end for\n7: Return: pR\n\n5:\n6: end for\n7: Return: pR\n3 Graph Structured Prediction Energy Nets\nGraph Structured Prediction Energy Networks (GSPENs) generalize all aforementioned models.\nThey combine both a classical structured component as well as a SPEN-like component to score an\nentire set of predictions jointly. Additionally, the GSPEN score function is more general than that for\nNLStruct, and includes it as a special case. After describing the formulation of both the score function\nand the inference problem (Sec. 3.1), we discuss two approaches to solving inference (Sec. 3.2\nand Sec. 3.3) that we found to work well in practice. Unlike the methods described previously for\nNLStruct, these approaches do not require solving a saddle-point optimization problem.\n3.1 GSPEN Model\nThe GSPEN score function is written as follows:\n\nF (x, pR; w) := T(cid:0) \u00aff (x; w), pR; w(cid:1) ,\n\nF (x, pR; w) :=(cid:80)\n\nwhere vector pR := (pr(yr))|r\u2208R,yr\u2208Yr contains one marginal per region per assignment of values to\nthat region. This formulation allows for the use of a structured score function while also allowing T to\nscore an entire prediction jointly. Hence, it is a combination of classical structured models and SPENs.\nFor instance, we can construct a GSPEN model by summing a classical structured model and a\nmultilayer perceptron that scores an entire label vector, in which case the score function takes the form\npr(yr)fr(x, yr; w) + MLP (pR; w). Of course, this is one of many\npossible score functions that are supported by this formulation. Notably, we recover the NLStruct\nscore function if we use T ( \u00aff (x; w), pR; w) = T (cid:48)( \u00aff (x; w) \u25e6 pR; w) and let \u00aff (x; w) = fR(x; w).\nGiven this model, the inference problem is\n\nr\u2208R(cid:80)\n\nyr\u2208Yr\n\nmax\n\npR\u2208M T(cid:0) \u00aff (x; w), pR; w(cid:1) .\npR\u2208M T(cid:0) \u00aff (x; w), pR; w(cid:1) + HR(pR).\n\nmax\n\n(10)\n\n(11)\n\nAs for classical structured models, the probabilities are constrained to lie in the marginal polytope. In\naddition we also consider a fractional entropy term over the predictions, leading to\n\nIn the classical setting, adding an entropy term relates to Fenchel duality [11]. However, the GSPEN\ninference objective does not take the correct form to use this reasoning. We instead view this entropy\nas a regularizer for the predictions: it pushes predictions towards a uniform distribution, smoothing\nthe inference objective, which we empirically observed to improve convergence. The results discussed\nbelow indicate that adding entropy leads to better-performing models. Also note that it is possible to\nadd a similar entropy term to the SPEN inference objective, which is mentioned by Belanger and\nMcCallum [5] and Belanger et al. [6].\nFor inference in GSPEN, SPEN procedures cannot be used since they do not maintain the additional\nconstraints imposed by the graphical model, i.e., the marginal polytope M. We also cannot use the\ninference procedure developed for NLStruct, as the GSPEN score function does not take the same\nform. Therefore, in the following, we describe two inference algorithms that optimize the program\nwhile maintaining structural constraints.\n3.2 Frank-Wolfe Inference\nThe Frank-Wolfe algorithm [14] is suitable because the objectives in Eqs. (10, 11) are non-linear\nwhile the constraints are linear. Speci\ufb01cally, using [14], we compute a linear approximation of the\n\n4\n\n\f(cid:88)\n\nx(i),p(i)R\n\n(cid:17)\n\nmin\n\nw\n\n(cid:16)\n\n(cid:20)\n\nmax\n\u02c6pR\u2208M\n\n(cid:110)\nT(cid:0) \u00aff (x; w), \u02c6pR; w(cid:1) + L\n\n(cid:16)\n\n\u02c6pR, p(i)R\n\n(cid:17)(cid:111) \u2212 T\n\n(cid:16)\n\n(cid:16) \u00aff\n\nx(i); w\n\n(cid:17)\n\n, p(i)R ; w\n\n(cid:17)(cid:21)\n\n+\n\nFigure 1: The GSPEN learning formulation, consisting of a Structured SVM (SSVM) objective with loss-\naugmented inference. Note that each p(i)R are one-hot representations of labels yi.\nobjective at the current iterate, maximize this linear approximation subject to the constraints of the\noriginal problem, and take a step towards this maximum.\nIn Algorithm 1 we detail the steps to optimize Eq. (10). In every iteration we \ufb01rst calculate the\ngradient of the score function F with respect to the marginals/beliefs using the current prediction as\n\ninput. We denote this gradient using g = \u2207pR T(cid:0) \u00aff (x; w), pR; w(cid:1). The gradient of T depends on the\n\nspeci\ufb01c function used and is computed via backpropagation. If entropy is part of the objective, an\nadditional term of \u2212 ln (pR) \u2212 1 is added to this gradient.\nNext we \ufb01nd the maximizing beliefs which is equivalent to inference for classical structured prediction:\nthe constraint space is identical and the objective is a linear function of the marginals/beliefs. Hence,\nwe solve this inner optimization using one of a number of techniques referenced in Sec. 5.\nConvergence guarantees for Frank-Wolfe have been proven when the overall objective is concave,\ncontinuously differentiable, and has bounded curvature [15], which is the case when T has these\nproperties with respect to the marginals. This is true even when the inner optimization is only solved\napproximately, which is often the case due to standard approximations used for structured inference.\nWhen T is non-concave, convergence can still be guaranteed, but only to a local optimum [16]. Note\nthat entropy has unbounded curvature, therefore its inclusion in the objective precludes convergence\nguarantees. Other variants of the Frank-Wolfe algorithm exist which improve convergence in certain\ncases [17, 18]. We defer a study of these properties to future work.\n\n3.3 Structured Entropic Mirror Descent\nMirror descent, another constrained optimization algorithm, is analogous to projected subgradient\ndescent, albeit using a more general distance beyond the Euclidean one [19]. This algorithm has been\nused in the past to solve inference for SPENs, where entropy was used as the link function \u03c8 and by\nnormalizing over each coordinate independently [5]. We similarly use entropy in our case. However,\nthe additional constraints in form of the polytope M require special care.\nWe summarize the structured entropic mirror descent inference for the proposed model in Algorithm 2.\nEach iteration of mirror descent updates the current prediction pR and dual vector a in two steps: (1)\na is updated based on the current prediction pR. Using \u03c8(pR) = \u2212HR(pR) as the link function, this\nupdate step takes the form a = 1 + ln pR + 1\u221a\nthe gradient of T can be computed using backpropagation; (2) pR is updated by computing the\nmaximizing argument of the Fenchel conjugate of the link function \u03c8\u2217 evaluated at a. More\nspeci\ufb01cally, pR is updated via\n\n(cid:0)\u2207pRT(cid:0) \u00aff (x; w), pR; w(cid:1)(cid:1). As mentioned previously,\n(cid:88)\n\n(cid:88)\n\nt\n\npR = max\n\u02c6pR\u2208M\n\nr\u2208R\n\nyr\u2208Yr\n\n\u02c6pr(yr)ar(yr) + HR (\u02c6pR) ,\n\n(12)\n\nwhich is identical to classical structured prediction.\nWhen the inference objective is concave and Lipschitz continuous (i.e., when T has these properties),\nthis algorithm has also been proven to converge [19]. Unfortunately, we are not aware of any\nconvergence results if the inner optimization problem is solved approximately and if the objective is\nnot concave. In practice, though, we did not observe any convergence issues during experimentation.\n\n3.4 Learning GSPEN Models\nGSPENs assign a score to an input x and a prediction p. An SSVM learning objective is applicable,\nwhich maximizes the margin between the scores assigned to the correct prediction and the inferred\nresult. The full SSVM learning objective with added loss-augmented inference is summarized in\nFig. 1. The learning procedure consists of computing the highest-scoring prediction using one of the\ninference procedures described in Sec. 3.2 and Sec. 3.3 for each example in a mini-batch and then\nupdating the weights of the model towards making better predictions.\n\n5\n\n\f68.56 s\n\n\u2013\n\u2013\n\u2013\n\n\u2013\n\nStruct SPEN NLStruct GSPEN\n8.41 s\n18.85 s 30.49 s 208.96 s 171.65 s\n13.87 s\n0.36 s 11.75 s\n234.33 s\n6.05 s 94.44 s\n29.16 s\n99.83 s\n\n4 Experiments\nTo demonstrate the utility of our model and to\ncompare inference and learning settings, we\nreport results on the tasks of optical charac-\nOCR (size 1000) 0.40 s 0.60 s\nter recognition (OCR), image tagging, multil-\nTagging\nabel classi\ufb01cation, and named entity recogni-\nBibtex\ntion (NER). For each experiment, we use the\nBookmarks\nfollowing baselines: Unary is an unstructured\nNER\nmodel that does not explicitly model the cor-\nTable 1: Average time to compute inference objective\nrelations between output variables in any way.\nand complete a weight update for one pass through the\nStruct is a classical deep structured model us-\ntraining data. We show all models trained for this work.\ning neural network potentials. We follow the\ninference and learning formulation of [3], where inference consists of a message passing algorithm\nderived using block coordinate descent on a relaxation of the inference problem. SPEN and NLStruct\nrepresent the formulations discussed in Sec. 2. Finally, GSPEN represents Graph Structured Predic-\ntion Energy Networks, described in Sec. 3. For GSPENs, the inner structured inference problems\nare solved using the same algorithm as for Struct. To compare the run-time of these approaches,\nTable 1 gives the average epoch compute time (i.e., time to compute the inference objective and\nupdate model weights) during training for our models for each task. In general, GSPEN training\nwas more ef\ufb01cient with respect to time than NLStruct but, expectedly, more expensive than SPEN.\nAdditional experimental details, including hyper-parameter settings, are provided in Appendix A.2.\n4.1 Optical Character Recognition (OCR)\nFor the OCR experiments, we generate data by\nselecting a list of 50 common 5-letter English\nwords, such as \u2018close,\u2019 \u2018other,\u2019 and \u2018world.\u2019 To\ncreate each data point, we choose a word from\nthis list and render each letter as a 28x28 pixel\nimage by selecting a random image of the letter\nfrom the Chars74k dataset [20], randomly shift-\ning, scaling, rotating, and interpolating with a\nrandom background image patch. A different\npool of backgrounds and letter images was used\nfor the training, validation, and test splits of the\ndata. The task is to identify the words given 5 ordered images. We create three versions of this dataset\nusing different interpolation factors of \u03b1 \u2208 {0.3, 0.5, 0.7}, where each pixel in the \ufb01nal image is\ncomputed as \u03b1xbackground + (1 \u2212 \u03b1)xletter. See Fig. 2 for a sample from each dataset. This process\nwas deliberately designed to ensure that information about the structure of the problem (i.e., which\nwords exist in the data) is a strong signal, while the signal provided by each individual letter image\ncan be adjusted. The training, validation, and test set sizes for each dataset are 10,000, 2,000, and\n2,000, respectively. During training we vary the training data to be either 200, 1k or 10k.\nTo study the inference algorithm, we train four different GSPEN models on the dataset containing\n1000 training points and using \u03b1 = 0.5. Each model uses either Frank-Wolfe or Mirror Descent\nand included/excluded the entropy term. To maintain tractability of inference, we \ufb01x a maximum\niteration count for each model. We additionally investigate the effect of this maximum count on \ufb01nal\nperformance. Additionally, we run this experiment by initializing from two different Struct models,\none being trained using entropy during inference and one being trained without entropy. The results\nfor this set of experiments are shown in Fig. 3a. Most con\ufb01gurations perform similarly across the\nnumber of iterations, indicating these choices are suf\ufb01cient for convergence. When initializing from\nthe models trained without entropy, we observe that both Frank-Wolfe without entropy and Mirror\nDescent with entropy performed comparably. When initializing from a model trained with entropy,\nthe use of mirror descent with entropy led to much better results.\nThe results for all values of \u03b1 using a train dataset size of 1000 are presented in Fig. 3b, and results\nfor all train dataset sizes with \u03b1 = 0.5 are presented in Fig. 3c. We observe that, in all cases, GSPEN\noutperforms all baselines. The degree to which GSPEN outperforms other models depends most\non the amount of train data: with a suf\ufb01ciently large amount of data, SPEN and GSPEN perform\ncomparably. However, when less data is provided, GSPEN performance does not drop as sharply as\nthat of SPEN initially. It is also worth noting that GSPEN outperformed NLStruct by a large margin.\n\nFigure 2: OCR sample data points with different inter-\npolation factors \u03b1.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Experimental results on OCR data. The dashed lines in (a) represent models trained from Struct\nwithout entropy, while solid lines represent models trained from Struct with entropy.\n\nThe NLStruct model is less stable due to its saddle-point formulation. Therefore it is much harder to\nobtain good performance with this model.\n\nImage Tagging\n\n4.2\nNext, we evaluate on the MIRFLICKR25k dataset [21], which consists of 25,000 images taken\nfrom Flickr. Each image is assigned a subset of 24 possible tags. The train/val/test sets for these\nexperiments consist of 10,000/5,000/10,000 images, respectively.\nWe compare to NLStruct and SPEN. We initialize the structured portion of our GSPEN model\nusing the pre-trained DeepStruct model described by Graber et al. [7], which consists of unary\npotentials produced from an AlexNet architecture [22] and linear pairwise potentials of the form\nfi,j(yi, yj, W ) = Wi,j,xi,xj , i.e., containing one weight per pair in the graph per assignment of\nvalues to that pair. A fully-connected pairwise graph was used. The T function for our GSPEN\nmodel consists of a 2-layer MLP with 130 hidden units. It takes as input a concatenation of the unary\npotentials generated by the AlexNet model and the current prediction. Additionally, we train a SPEN\nmodel with the same number of layers and hidden units. We used Frank-Wolfe without entropy for\nboth SPEN and GSPEN inference.\nThe results are shown in Fig. 4. GSPEN obtains\nsimilar test performance to the NLStruct model,\nand both outperform SPEN. However, the NL-\nStruct model was run for 100 iterations during in-\nference without reaching \u2018convergence\u2019 (change\nof objective smaller than threshold), while the\nGSPEN model required an average of 69 iter-\nations to converge at training time and 52 iter-\nations to converge at test time. Our approach\nhas the advantage of requiring fewer variables to\nmaintain during inference and requiring fewer\niterations of inference to converge. The \ufb01nal\ntest losses for SPEN, NLStruct and GSPEN are\n2.158, 2.037, and 2.029, respectively.\n\nFigure 4: Results for image tagging.\n\n4.3 Multilabel Classi\ufb01cation\nWe use the Bibtex and Bookmarks multilabel datasets [23]. They consist of binary-valued input feature\nvectors, each of which is assigned some subset of 159/208 possible labels for Bibtex/Bookmarks,\nrespectively. We train unary and SPEN models with architectures identical to [5] and [24] but add\ndropout layers. In addition, we further regularize the unary model by \ufb02ipping each bit of the input\nvectors with probability 0.01 when sampling mini-batches. For Struct and GSPEN, we generate\na graph by \ufb01rst \ufb01nding the label variable that is active in most training data label vectors and add\nedges connecting every other variable to this most active one. Pairwise potentials are generated by\npassing the input vector through a 2-layer MLP with 1k hidden units. The GSPEN model is trained\nby starting from the SPEN model, \ufb01xing its parameters, and training the pairwise potentials.\nThe results are in Table 2 alongside those taken from [5] and [24]. We found the Unary models to\nperform similarly to or better than previous best results. Both SPEN and Struct are able to improve\nupon these Unary results. GSPEN outperforms all con\ufb01gurations, suggesting that the contributions of\nthe SPEN component and the Struct component to the score function are complementary.\n\n7\n\n50100200Inference Iterations1020304050Test Word AccuracyEMDEMD + EntFWFW + Ent0.30.50.7Data Interpolation Factor0102030405060Test Word AccuracyUnaryStructStruct Eps 0SPENNLStructGSPEN0200040006000800010000Train Data Size020406080Test Word AccuracyUnaryStructStruct Eps 0SPENNLStructGSPEN020406080100120Epoch1.21.41.61.82.02.2Hamming LossSPEN TrainSPEN ValNLStruct TrainNLStruct ValGSPEN TrainGSPEN Val\fTable 2: Multilabel classi\ufb01cation results for all models.\nAll entries represent macro F1 scores. The top results\nare taken from the cited publications.\n\nTable 3: Named Entity Recognition results for all\nmodels. All entries represent F1 scores averaged\nover \ufb01ve trials.\n\nBibtex\n\nBookmarks\n\n\u2013\n\u2013\n\nValidation Test Validation Test\n34.4\n37.1\n37.4\n38.9\n39.2\n40.7\n\n42.2\n44.7\n44.1\n46.1\n46.5\n48.6\n\n38.4\n39.7\n40.2\n41.2\n\n\u2013\n\u2013\n\n43.3\n45.8\n46.6\n47.5\n\nSPEN [5]\nDVN [24]\nUnary\nStruct\nSPEN\nGSPEN\n\nAvg. Val.\n\nAvg. Test\n\n94.88 \u00b1 0.18\nStruct [26]\n+ GSPEN 94.97 \u00b1 0.16\n95.88 \u00b1 0.10\nStruct [27]\n+ GSPEN 95.96 \u00b1 0.08\n\n91.37 \u00b1 0.04\n91.51 \u00b1 0.17\n92.79 \u00b1 0.08\n92.69 \u00b1 0.17\n\n4.4 NER\nWe also assess suitability for Named Entity Recognition (NER) using the English portion of the\nCoNLL 2003 shared task [25]. To demonstrate the applicability of GSPEN for this task, we trans-\nformed two separate models, speci\ufb01cally the ones presented by Ma and Hovy [26] and Akbik et al.\n[27], into GSPENs by taking their respective score functions and adding a component that jointly\nscores an entire set of predictions. In each case, we \ufb01rst train six instances of the structured model\nusing different random initializations and drop the model that performs the worst on validation data.\nWe then train the GSPEN model, initializing the structured component from these pre-trained models.\nThe \ufb01nal average performance is presented in Table 3, and individual trial information can be found\nin Table 4 in the appendix. When comparing to the model described by Ma and Hovy [26], GSPEN\nimproves the \ufb01nal test performance in four out of the \ufb01ve trials, and GSPEN has a higher overall\naverage performance across both validation and test data. Compared to Akbik et al. [27], on average\nGSPEN\u2019s validation score was higher, but it performed slightly worse at test time. Overall, these\nresults demonstrate that it is straightforward to augment a task-speci\ufb01c structured model with an\nadditional prediction scoring function which can lead to improved \ufb01nal task performance.\n\n5 Related Work\nA variety of techniques have been developed to model structure among output variables, originating\nfrom seminal works of [1, 2, 28]. These works focus on extending linear classi\ufb01cation, both\nprobabilistic and non-probabilistic, to model the correlation among output variables. Generally\nspeaking, scores representing both predictions for individual output variables and for combinations\nof output variables are used. A plethora of techniques have been developed to solve inference for\nproblems of this form, e.g., [9, 12, 29\u201362]. As exact inference for general structures is NP-hard [8],\nearly work focused on tractable exact inference. However, due to interest in modeling problems\nwith intractable structure, a plethora of approaches have been studied for learning with approximate\ninference [63\u201370]. More recent work has also investigated the role of different types of prediction\nregularization, with Niculae et al. [10] replacing the often-used entropy regularization with an L2\nnorm and Blondel et al. [11] casting both as special cases of a Fenchel-Young loss framework.\nTo model both non-linearity and structure, deep learning and structured prediction techniques were\ncombined. Initially, local, per-variable score functions were learned with deep nets and correlations\namong output variables were learned in a separate second stage [71, 72]. Later work simpli\ufb01ed this\nprocess, learning both local score functions and variable correlations jointly [3, 4, 73\u201375].\nStructured Prediction Energy Networks (SPENs), introduced by Belanger and McCallum [5], take\na different approach to modeling structure. Instead of explicitly specifying a structure a-priori and\nenumerating scores for every assignment of labels to regions, SPENs learn a function which assigns\na score to an input and a label. Inference uses gradient-based optimization to maximize the score\nw.r.t. the label. Belanger et al. [6] extend this technique by unrolling inference in a manner inspired\nby Domke [76]. Both approaches involve iterative inference procedures, which are slower than\nfeed-forward prediction of deep nets. To improve inference speed, Tu and Gimpel [77] learn a neural\nnet to produce the same output as the gradient-based methods. Deep Value Networks [24] follow\nthe same approach of Belanger and McCallum [5] but use a different objective that encourages the\nscore to equal the task loss of the prediction. All these approaches do not permit to include known\nstructure. The proposed approach enables this.\n\n8\n\n\fOur approach is most similar to our earlier work [7], which combines explicitly-speci\ufb01ed structured\npotentials with a SPEN-like score function. The score function of our earlier work is a special\ncase of the one presented here. In fact, earlier we required a classical structured prediction model\nas an intermediate layer of the score function, while we don\u2019t make this assumption any longer.\nAdditionally, in our earlier work we had to solve inference via a computationally challenging saddle-\npoint objective. Another related approach is described by Vilnis et al. [78], whose score function\nis the sum of a classical structured score function and a (potentially non-convex) function of the\nmarginal probability vector pR. This is also a special case of the score function presented here.\nAdditionally, the inference algorithm they develop is based on regularized dual averaging [79] and\ntakes advantage of the structure of their speci\ufb01c score function, i.e., it is not directly applicable to our\nsetting.\n\n6 Conclusions\n\nThe developed GSPEN model combines the strengths of several prior approaches to solving structured\nprediction problems. It allows machine learning practitioners to include inductive bias in the form\nof known structure into a model while implicitly capturing higher-order correlations among output\nvariables. The model formulation described here is more general than previous attempts to combine\nexplicit local and implicit global structure modeling while not requiring inference to solve a saddle-\npoint problem.\n\nAcknowledgments\n\nThis work is supported in part by NSF under Grant No. 1718221 and MRI #1725729, UIUC, Samsung,\n3M, Cisco Systems Inc. (Gift Award CG 1377144) and Adobe. We thank NVIDIA for providing\nGPUs used for this work and Cisco for access to the Arcetri cluster.\n\nReferences\n[1] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin Methods for Structured and\n\nInterdependent Output Variables. JMLR, 2005.\n\n[2] B. Taskar, C. Guestrin, and D. Koller. Max-Margin Markov Networks. In Proc. NIPS, 2003.\n\n[3] L.-C. Chen, A. G. Schwing, A. L. Yuille, and R. Urtasun. Learning Deep Structured Models. In Proc.\n\nICML, 2015.\n\n[4] A. G. Schwing and R. Urtasun.\n\narXiv:1503.02351, 2015.\n\nFully Connected Deep Structured Networks.\n\narXiv preprint\n\n[5] D. Belanger and A. McCallum. Structured Prediction Energy Networks. In Proc. ICML, 2016.\n\n[6] D. Belanger, B. Yang, and A. McCallum. End-to-end learning for structured prediction energy networks.\n\nIn Proc. ICML, 2017.\n\n[7] C. Graber, O. Meshi, and A. Schwing. Deep structured prediction with nonlinear output transformations.\n\nIn Proc. NeurIPS, 2018.\n\n[8] S. E. Shimony. Finding MAPs for Belief Networks is NP-hard. AI, 1994.\n\n[9] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families and Variational Inference.\n\nFTML, 2008.\n\n[10] V. Niculae, A. F. T. Martins, M. Blondel, and C. Cardie. Sparsemap: Differentiable sparse structured\n\ninference. In Proc. ICML, 2018.\n\n[11] M. Blondel, A. F. T. Martins, and V. Niculae. Learning classi\ufb01ers with fenchel-young losses: Generalized\n\nentropies, margins, and algorithms. arXiv preprint arXiv:1805.09717, 2018.\n\n[12] T. Heskes, K. Albers, and B. Kappen. Approximate Inference and Constrained Optimization. In Proc. UAI,\n\n2003.\n\n[13] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications to\n\nimaging. JMIV, 2011.\n\n9\n\n\f[14] M. Frank and P. Wolfe. An algorithm for quadratic programming. NRLQ, 1956.\n\n[15] M. Jaggi. Revisiting frank-wolfe: projection-free sparse convex optimization. In Proc. ICML, 2013.\n\n[16] S. Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives.\n\narXiv:1607.00345, 2016.\n\narXiv preprint\n\n[17] R. G. Krishnan, S. Lacoste-Julien, and D. Sontag. Barrier frank-wolfe for marginal inference. In Proc.\n\nNIPS, 2015.\n\n[18] S. Lacoste-Julien and M. Jaggi. On the global linear convergence of frank-wolfe optimization variants. In\n\nProc. NIPS, 2015.\n\n[19] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization. Oper. Res. Lett., 2003.\n\n[20] T. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. In Proc. VISAPP,\n\n2009.\n\n[21] M. J. Huiskes and M. S. Lew. The mir \ufb02ickr retrieval evaluation. In Proc. ICMR, 2008.\n\n[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional Neural\n\nNetworks. In Proc. NIPS, 2012.\n\n[23] I. Katakis, G. Tsoumakas, and I. Vlahavas. Multilabel text classi\ufb01cation for automated tag suggestion.\n\nECML PKDD Discovery Challenge 2008, 2008.\n\n[24] Michael Gygli, Mohammad Norouzi, and Anelia Angelova. Deep value networks learn to evaluate and\n\niteratively re\ufb01ne structured outputs. In ICML, 2017.\n\n[25] E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: Language-independent\n\nnamed entity recognition. In Proc. NAACL-HLT, 2003.\n\n[26] X. Ma and E. Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proc. ACL, 2016.\n\n[27] A. Akbik, D. Blythe, and R. Vollgraf. Contextual string embeddings for sequence labeling. In Proc.\n\nCOLING, 2018.\n\n[28] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for segmenting\n\nand labeling sequence data. In Proc. ICML, 2001.\n\n[29] M. I. Schlesinger. Sintaksicheskiy analiz dvumernykh zritelnikh signalov v usloviyakh pomekh (Syntactic\n\nAnalysis of Two-Dimensional Visual Signals in Noisy Conditions). Kibernetika, 1976.\n\n[30] T. Werner. A Linear Programming Approach to Max-Sum Problem: A Review. PAMI, 2007.\n\n[31] Y. Boykov, O. Veksler, and R. Zabih. Markov Random Fields with Ef\ufb01cient Approximations. In Proc.\n\nCVPR, 1998.\n\n[32] Y. Boykov, O. Veksler, and R. Zabih. Fast Approximate Energy Minimization via Graph Cuts. PAMI, 2001.\n\n[33] M. J. Wainwright and M. I. Jordan. Variational Inference in Graphical Models: The View from the Marginal\n\nPolytope. In Proc. Conf. on Control, Communication and Computing, 2003.\n\n[34] A. Globerson and T. Jaakkola. Approximate Inference Using Planar Graph Decomposition. In Proc. NIPS,\n\n2006.\n\n[35] M. Welling. On the Choice of Regions for Generalized Belief Propagation. In Proc. UAI, 2004.\n\n[36] D. Sontag, D. K. Choe, and Y. Li. Ef\ufb01ciently Searching for Frustrated Cycles in MAP Inference. In Proc.\n\nUAI, 2012.\n\n[37] D. Batra, S. Nowozin, and P. Kohli. Tighter Relaxations for MAP-MRF Inference: A Local Primal-Dual\n\nGap Based Separation Algorithm. In Proc. AISTATS, 2011.\n\n[38] D. Sontag, T. Meltzer, A. Globerson, and T. Jaakkola. Tightening LP Relaxations for MAP Using Message\n\nPassing. In Proc. NIPS, 2008.\n\n[39] D. Sontag and T. Jaakkola. New Outer Bounds on the Marginal Polytope. In Proc. NIPS, 2007.\n\n10\n\n\f[40] D. Sontag and T. Jaakkola. Tree Block Coordinate Descent for MAP in Graphical Models. In Proc.\n\nAISTATS, 2009.\n\n[41] K. Murphy and Y. Weiss. Loopy Belief Propagation for Approximate Inference: An Empirical Study. In\n\nProc. UAI, 1999.\n\n[42] O. Meshi, A. Jaimovich, A. Globerson, and N. Friedman. Convexifying the Bethe Free Energy. In Proc.\n\nUAI, 2009.\n\n[43] A. Globerson and T. Jaakkola. Fixing Max-Product: Convergent Message Passing Algorithms for MAP\n\nLP-Relaxations. In Proc. NIPS, 2007.\n\n[44] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A New Class of Upper Bounds on the Log Partition\n\nFunction. Trans. Information Theory, 2005.\n\n[45] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. MAP Estimation via Agreement on Trees: Message-\n\nPassing and Linear Programming. Trans. Information Theory, 2005.\n\n[46] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-Based Reparameterization Framework for Analysis\n\nof Sum-Product and Related Algorithms. Trans. Information Theory, 2003.\n\n[47] T. Heskes. Convexity Arguments for Ef\ufb01cient Minimization of the Bethe and Kikuchi Free Energies. AI\n\nResearch, 2006.\n\n[48] T. Hazan and A. Shashua. Norm-Product Belief Propagation: Primal-Dual Message-Passing for LP-\n\nRelaxation and Approximate Inference. Trans. Information Theory, 2010.\n\n[49] T. Hazan and A. Shashua. Convergent Message-Passing Algorithms for Inference Over General Graphs\n\nwith Convex Free Energies. In Proc. UAI, 2008.\n\n[50] C. Yanover, T. Meltzer, and Y. Weiss. Linear Programming Relaxations and Belief Propagation \u2013 An\n\nEmpirical Study. JMLR, 2006.\n\n[51] T. Meltzer, A. Globerson, and Y. Weiss. Convergent Message Passing Algorithms: A Unifying View. In\n\nProc. UAI, 2009.\n\n[52] Y. Weiss, C. Yanover, and T. Meltzer. MAP Estimation, Linear Programming and Belief Propagation with\n\nConvex Free Energies. In Proc. UAI, 2007.\n\n[53] T. Heskes. Stable Fixed Points of Loopy Belief Propagation are Minima of the Bethe Free Energy. In Proc.\n\nNIPS, 2002.\n\n[54] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized Belief Propagation. In Proc. NIPS, 2001.\n\n[55] A. T. Ihler, J. W. Fisher, and A. S. Willsky. Message Errors in Belief Propagation. In Proc. NIPS, 2004.\n\n[56] W. Wiegerinck and T. Heskes. Fractional Belief Propagation. In Proc. NIPS, 2003.\n\n[57] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Dual MAP LP Relaxation\n\nSolvers Using Fenchel-Young Margins. In Proc. NIPS, 2012.\n\n[58] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Parallel MAP LP Relaxation\n\nSolver Using the Frank-Wolfe Algorithm. In Proc. ICML, 2014.\n\n[59] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message Passing for Large Scale\n\nGraphical Models. In Proc. CVPR, 2011.\n\n[60] N. Komodakis, N. Paragios, and G. Tziritas. MRF Energy Minimization & Beyond via Dual Decomposition.\n\nPAMI, 2010.\n\n[61] O. Meshi, M. Mahdavi, and A. G. Schwing. Smooth and Strong: MAP Inference with Linear Convergence.\n\nIn Proc. NIPS, 2015.\n\n[62] O. Meshi and A. G. Schwing. Asynchronous Parallel Coordinate Minimization for MAP Inference. In\n\nProc. NIPS, 2017.\n\n[63] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proc. ICML,\n\n2008.\n\n[64] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Proc. NIPS, 2008.\n\n11\n\n\f[65] P. Pletscher, C. S. Ong, and J. M. Buhmann. Entropy and Margin Maximization for Structured Output\n\nLearning. In Proc. ECML PKDD, 2010.\n\n[66] T. Hazan and R. Urtasun. A Primal-Dual Message-Passing Algorithm for Approximated Large Scale\n\nStructured Prediction. In Proc. NIPS, 2010.\n\n[67] O. Meshi, D. Sontag, T. Jaakkola, and A. Globerson. Learning Ef\ufb01ciently with Approximate Inference via\n\nDual Losses. In Proc. ICML, 2010.\n\n[68] N. Komodakis. Ef\ufb01cient training for pairwise or higher order crfs via dual decomposition. In Proc. CVPR,\n\n2011.\n\n[69] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message Passing for Large Scale\n\nGraphical Models. In Proc. CVPR, 2011.\n\n[70] O. Meshi, M. Mahdavi, A. Weller, and D. Sontag. Train and test tightness of LP relaxations in structured\n\nprediction. In Proc. ICML, 2016.\n\n[71] J. Alvarez, Y. LeCun, T. Gevers, and A. Lopez. Semantic road segmentation via multi-scale ensembles of\n\nlearned features. In Proc. ECCV, 2012.\n\n[72] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with\n\ndeep convolutional nets and fully connected crfs. In Proc. ICLR, 2015.\n\n[73] J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint Training of a Convolutional Network and a Graphical\n\nModel for Human Pose Estimation. In Proc. NIPS, 2014.\n\n[74] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr.\n\nConditional random \ufb01elds as recurrent neural networks. In Proc. ICCV, 2015.\n\n[75] Guosheng Lin, Chunhua Shen, Ian Reid, and Anton van den Hengel. Deeply learning the messages in\n\nmessage passing inference. In Proc. NIPS, 2015.\n\n[76] J. Domke. Generic methods for optimization-based modeling. In Proc. AISTATS, 2012.\n\n[77] L. Tu and K. Gimpel. Learning approximate inference networks for structured prediction. In Proc. ICLR,\n\n2018.\n\n[78] L. Vilnis, D. Belanger, D. Sheldon, and A. McCallum. Bethe projections for non-local inference. In Proc.\n\nUAI, 2015.\n\n[79] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. JMLR, 2010.\n\n[80] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proc.\n\nEMNLP, 2014.\n\n[81] G. Lample, M. Ballesteros, Sa. Subramanian, K. Kawakami, and C. Dyer. Neural architectures for named\n\nentity recognition. In Proc. NAACL-HLT, 2016.\n\n12\n\n\f", "award": [], "sourceid": 4676, "authors": [{"given_name": "Colin", "family_name": "Graber", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}]}