{"title": "A Primal Dual Formulation For Deep Learning With Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 12157, "page_last": 12168, "abstract": "For several problems of interest, there are natural constraints which exist over the output label space. For example, for the joint task of NER and POS labeling, these constraints might specify that the NER label \u2018organization\u2019 is consistent only with the POS labels \u2018noun\u2019 and \u2018preposition\u2019. These constraints can be a great way of injecting prior knowledge into a deep learning model, thereby improving overall performance. In this paper, we present a constrained optimization formulation for training a deep network with a given set of hard constraints on output labels. Our novel approach first converts the label constraints into soft logic constraints over probability distributions outputted by the network. It then converts the constrained optimization problem into an alternating min-max optimization with Lagrangian variables defined for each constraint. Since the constraints are independent of the target labels, our framework easily generalizes to semi-supervised setting. We experiment on the tasks of Semantic Role Labeling (SRL), Named Entity Recognition (NER) tagging, and fine-grained entity typing and show that our constraints not only significantly reduce the number of constraint violations, but can also result in state-of-the-art performance", "full_text": "A Primal-Dual Formulation for Deep Learning with\n\nConstraints\n\nYatin Nandwani, Abhishek Pathak, Mausam and Parag Singla\n\n{yatin.nandwani,abhishek.pathak.cs115,mausam,parags}@cse.iitd.ac.in\n\nDepartment of Computer Science and Engineering\n\nIndian Institute of Technology Delhi\n\nAbstract\n\nFor several problems of interest, there are natural constraints which exist over the\noutput label space. For example, for the joint task of NER and POS labeling, these\nconstraints might specify that the NER label \u2018organization\u2019 is consistent only with\nthe POS labels \u2018noun\u2019 and \u2018preposition\u2019. These constraints can be a great way of\ninjecting prior knowledge into a deep learning model, thereby improving overall\nperformance. In this paper, we present a constrained optimization formulation for\ntraining a deep network with a given set of hard constraints on output labels. Our\nnovel approach \ufb01rst converts the label constraints into soft logic constraints over\nprobability distributions outputted by the network. It then converts the constrained\noptimization problem into an alternating min-max optimization with Lagrangian\nvariables de\ufb01ned for each constraint. Since the constraints are independent of\nthe target labels, our framework easily generalizes to semi-supervised setting.\nWe experiment on the tasks of Semantic Role Labeling (SRL), Named Entity\nRecognition (NER) tagging, and \ufb01ne-grained entity typing and show that our\nconstraints not only signi\ufb01cantly reduce the number of constraint violations, but\ncan also result in state-of-the-art performance.\n\n1\n\nIntroduction\n\nDeep neural models have become the state of the art in many domains including vision, NLP and\nspeech processing. In the vanilla setting, they are trained end to end from data and without additional\nknowledge about the task (other than neural architecture and loss function). However, for many\nproblems of interest (e.g., structured prediction or multi-task learning), there is a set of natural\nconstraints which need to be satis\ufb01ed over the output variables. For example, for the task of NER and\nPOS labeling, the constraint might specify that a word which is given the NER label \u2018institution\u2019 must\nhave the POS label \u2018noun\u2019 or a \u2018preposition\u2019. Or in 3D human pose estimation from a single view, one\nmay impose symmetry constraints, like equal length of two arms, equal distance of shoulders from\nthe spine etc. (M\u00e1rquez-Neila et al. [2017]). These constraints can be seen as additional background\nknowledge made available by the domain experts. Incorporating these constraints into a model can\npresumably regularize the output space resulting in improved predictions.\nOne line of work trains the neural models without this knowledge, but imposes constraints at inference\ntime (e.g., Lee et al. [2019]). We argue in this paper that this is bound to be sub-optimal since the\noriginal training of the network was done oblivious to the constraints. Though the deep network, in\nprinciple, could learn these directly from the data, but, in practice, this is true only when the available\ntraining data is large. Our experiments reveal a large number of constraint violations from such\nunconstrained models, when trained in low data settings. Rather, modeling the constraints explicitly\nduring training gives a strong prior to the model \u2013 it not only reduces constraint violations but can\nalso result in signi\ufb01cantly improved predictions by making the training constraint-aware.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper, we present a principled solution to the problem of learning a deep network with a\ngiven set of hard constraints on output labels. We \ufb01rst formulate this as a constraint optimization\nproblem wherein we maximize the original learning objective (e.g., cross entropy) subject to the\nconstraints being satis\ufb01ed for each example in the training data.When the network makes predictions\nin form of probabilities over the output variables, we can rewrite the constraints on output labels as\nconstraints on probabilities output by the network using soft logic (Br\u00f6cheler et al. [2010], Nov\u00e1k\n[1987]). This rewrite can be seen as imposing constraints over the set of allowed distributions over the\noutput space, i.e., allowing only those distributions that satisfy the constraints. We then convert this\nproblem into a Lagrangian formulation, with one Lagrange variable per constraint. We solve it using\nalternating min-max based optimization. Though the resulting problem can be highly non-convex\nnon-concave, convergence guarantees to a local min-max point (in the limit) follow from the theory\nof min-max optimization (Jin et al. [2019]). Since our constraints are speci\ufb01ed over the predicted\nvariables (no target values are involved), our formulation easily extends to semi-supervised setting\nwhere the unlabeled data only contributes to constraint terms in the formulation.\nWe note that there have been a few recent attempts at adding constraints during training time, but\nwith signi\ufb01cant differences from our work. These include the work by Xu et al. [2018], Mehta et al.\n[2018] and Diligenti et al. [2017b]. While these existing approaches model the constraints as soft\nand incorporate a constraint violation penalty directly in the loss term, our constraints are modeled\nas hard, and we resort to a full Langrangian based optimization. The existing work can require an\nexponential sum to be computed (Xu et al. [2018]), require an additional constraint violation penalty\nas input with no explicit convergence guarantees (Mehta et al. [2018]), or require a speci\ufb01c functional\nform for the constraints (Diligenti et al. [2017b]). In contrast, our formulation is tractable because we\ndo not compute an exponential sum, requires no additional constraint violation term, converges to\nthe stationary point of the objective function, and does not assume a speci\ufb01c functional form for the\nconstraints. We detail these differences in the related work.\nWe experiment on three different NLP tasks: (a) semantic role labeling, (b) NER tagging, and (c)\n\ufb01ne grained entity typing over hierarchical label space. Our experiments clearly demonstrate that,\nin low data setting, our constraint based learning not only reduces the number of violations, but\nalso result in signi\ufb01cantly improved prediction accuracy compared to unconstrained baselines as\nwell as vanilla post processing of constraints at inference time. Semi-supervised learning results in\nfurther improvements using our formulation. Furthermore, in some cases, our approach altogether\neliminates the need for post-processing of constraints, since they have already been learned by the\nneural model. For two of the tasks, we obtain state-of-the-art results for small data sizes. On NER,\nour constrained learning completely eliminates the need for further post-processing with constraints,\nsaving on precious inference time.\nOur contribution in this paper can be summarized as follows: (1) We present a principled approach\nfor incorporating domain knowledge in the form of hard constraints. We present a Lagrangian based\nformulation for learning with constraints in a deep network. Our constraints make use of soft rules to\ndeal with logical operators. (2) We employ a min-max based optimization to solve our constrained\nformulation. To the best our knowledge, we are the \ufb01rst to use a min-max based formulation for\nlearning with constraints speci\ufb01ed over output variables in deep networks. Convergence to local\nmin-max points (Jin et al. [2019]) in the limit follows from the theory of min-max optimization.\n(3) We present experimental results on three different tasks demonstrating the effectiveness of our\napproach, while achieving state-of-the-art results in two of the domains and also signi\ufb01cantly reducing\nthe number of constraint violations in each case.\n\n2 Related Work\n\nUse of hard (or soft) constraints in machine learning models predates modern deep learning. Much\nof this work is concerned with constrained inference. Constrained conditional models use integer\nlinear programming to perform inference with global hard constraints (Roth and Yih [2005], Chang\net al. [2013]). Other approaches have used dual decomposition to solve locally decomposable\nconstraints (Rush and Collins [2012]). A recent attempt incorporates constraints during inference\nby incrementally adjusting the learned weights of the network forcing the probability of currently\npredicted non-satisfying state towards zero (Lee et al. [2019]).\nOur focus in this work is learning with constraints. Posterior regularization (Ganchev et al. [2010])\nadapts the learned distribution (post facto) so as to satisfy structural constraints over latent variables\n\n2\n\n\fin expectation. Chen and Deng [2013] have employed a primal-dual based formulation for optimizing\nwith constraints in deep models, but their constraints are speci\ufb01ed over the weights in a recurrent\nneural network and are only concerned with imparting stability to the overall learning algorithm. In\ndeep learning models, one of the ways to regularize the output space is through a CRF layer (Koller\nand Friedman [2009]) at the end of a deep network (Lample et al. [2016]). This has met with partial\nsuccess in vision (Kn\u00f6belreiter et al. [2017]) as well as in NLP with some state-of-the-art models\ndeploying this either as a post processing step or jointly integrated with training (Huang et al. [2015],\nChen et al. [2018]).\nThere have been some recent attempts to explicitly incorporate constraints over the output space\nduring training of a deep network. Hu et al. [2016] perform posterior regularization over the weights\nbeing learned at each step, so that the resultant distribution satis\ufb01es a given set of logical rules\n(constraints). This rule regularized network (teacher) is then used to guide the learning of the original\nnetwork (student) which balances between optimizing the likelihood based objective and mimicking\nthe teacher network. Their work is different from ours in two main aspects. (1) The imposition of\nconstraints in their algorithm is only indirect by mimicking the rule regularized network. In contrast,\nwe optimize for satisfying the constraints directly. (2) They model constraints as soft whereas we\ndeal with hard constraints. Further, they could achieve limited success in their experiments.\nXu et al. [2018] incorporate constraints by forcing the probability of states violating the constraints\nto zero. They model this as a soft constraint by incorporating a constraint violation penalty in the\nloss function. More importantly, they need to (pre-)compile a circuit for every constraint in order to\ncompute the sum over all the non-satisfying states. Computing these circuits can be NP hard in many\ncases leading to intractability. In contrast, we model constraints as hard through the use of Lagrangian.\nOur formulation disallows the non-satisfying states directly through the use of constraints, and does\nnot require an exponential sum to be computed.\nMehta et al. [2018] present an approach for learning with constraints and demonstrate the effectiveness\nof their method for the speci\ufb01c task of Semantic Role Labeling (SRL). They model the constraints by\nadding another term to the loss which penalizes the currently predicted state for violating the constraint.\nThey require an additional constraint violation penalty from the designer. Their approach can be seen\nas a local search in the weight space, so that the resultant weights result in a satisfying assignment.\nThey do not have a global metric which optimizes the weights for satisfying the constraints (e.g,\nforcing the non-satisfying states to zero probability), and it not clear if their algorithm will converge\nin the limit. In contrast, we can provide convergence guarantees to min-max points of the objective\nand we experiment on a variety of NLP tasks.\nDiligenti et al. [2017b] propose an approach for learning with constraints, where the constraints are\nspeci\ufb01ed as logical formulas. Though their approach seems similar to ours at the outset, there are\nsome important differences. Unlike us, they do not work with full Lagrangian formulation. Their\napproach is simply modifying the loss, and can not handle hard constraints. Further, they require the\nconstraint function to be of speci\ufb01c form (i.e., functionals in the range [0, 1]), and present experiments\nonly on a single task. Our work makes no such assumption about the form of constraints and we\nexperiment on a variety of tasks.\n\n3 Constrained Learning of Neural Models\nConsider learning a neural network over a set of training examples given as {x(i), y(i)}m\ni=1. Each\nx(i) \u2208 Rn represents an n-dimensional feature vector in a real valued space. Each y(i) \u2208 V r\nrepresents an r dimensional target (label) vector, where each element of the vector takes values from\na discrete (or continuous) valued space denoted by V. Note that r may be input-dependent, e.g., in\nsequence labeling tasks. Given the parameters w of the network, let lw(\u02c6y(i), y(i)) denote the loss\nobtained by predicting \u02c6y(i) when the target is y(i). For instance, this could be the cross entropy loss\nwhen the labels are discrete. The goal of learning is to \ufb01nd a set of parameters w\u2217 of the network\nsuch that the average loss L(w) = 1\ni=1 lw(\u02c6y(i), y(i)) between the network outputs \u02c6y(i) and the\ntarget values y(i) is minimized, i.e., w\u2217 = argminw L(w).\nIn this work, we are interested in a scenario where we are additionally provided with a set of (hard)\nconstraints which hold over the output label space. We assume that these constraints are provided\nto us by the domain experts and are available in the form of background knowledge. Our goal is\nto incorporate this background knowledge to learn a more robust and generalizable model. Our\n\nPm\n\nm\n\n3\n\n\fformulation is based on constructing a Lagrangian, which tries to minimize the original objective\nsubject to the given constraints. We solve our problem using an alternating optimization over a\nmax-min formulation.\n\n3.1 A Lagrangian-based Formulation\nLet us assume we are given a set of K constraints as {C1(\u02c6y), C2(\u02c6y),\u00b7\u00b7\u00b7 , CK(\u02c6y)}. We will use\nindex k to vary over the constraint set. Each constraint is a function of the predicted values \u02c6y on\na given example x. Since each of the network outputs in turn is directly a function of weights w\nof the network (for a given x), for ease of notation, we will simply write the constraint set to be\n{C1(w), C2(w),\u00b7\u00b7\u00b7 , CK(w)}. Note that the dependence on input vector x is implicit in this notation.\nFurther, without loss of generality, we will assume that each of our constraints Ck(w) is expressed as\nan inequality constraint over an appropriately de\ufb01ned function fk(w), i.e., Ck(w) : {fk(w) \u2264 0}.\nThis can also model constraints of the form f k(w) = 0 by replacing them with two inequality\nconstraints of the form f k(w) \u2264 0 and \u2212f k(w) \u2264 0.\nWhen dealing with constraints over the set of m training examples, we will incorporate the dependence\non the ith example by including the index in the super script, i.e., we will denote the jth constraint\nover the ith example as C i\n\nj(w). We are now ready to de\ufb01ne our constrained formulation.\nk(w) \u2264 0; \u22001 \u2264 i \u2264 m; \u22001 \u2264 k \u2264 K.\nf i\n\n(1)\nOne problem with the above formulation is that it has O(mK) number of constraints. In particular,\nthe number of constraints grows linearly with number of examples, which may become unwieldy.\nWe use the following trick to reduce the number of constraints. Since we are only interested in\neliminating the states that do not satisfy the constraints, we can in fact ignore the value the function\nk(w) takes when the corresponding constraint is satis\ufb01ed. Accordingly, we de\ufb01ne the Hinge function\nf i\nH : R \u2192 R as: H (c) = c f or c \u2265 0, and 0 f or c < 0.\nk(w)) = 0 without changing\nWe equivalently replace each constraint of the form f i\nk(w)) can be thought of as describing the loss incurred\nthe original formulation. Intuitively, H(f i\nwhen the corresponding constraint is not satis\ufb01ed. This loss is zero when the constraint is satis\ufb01ed.\nThis transformation will be useful in the next step when we combine together instances of a single\ntype of constraint applied to different examples in the training set. In the new formulation, we get our\nprimal objective as:\n\nk(w) \u2264 0 by H(f i\n\nargminw L(w) subject to\n\nargminw L(w) subject to H(f i\n(2)\nk(w)) \u2265 0,\u2200i, k by de\ufb01nition. Therefore, a necessary and suf\ufb01cient condition to enforce\nk(w)),\n\nk(w)) = 0. This is true for all k. De\ufb01ning hk(w) =P\n\nClearly, H(f i\n\u2200i : H(f i\nwe can therefore write our primal objective in Equation 1 as:\n\nk(w)) = 0 isP\n\nk(w)) = 0; \u22001 \u2264 i \u2264 m; \u22001 \u2264 k \u2264 K.\n\ni H(f i\n\n(3)\nA standard way of solving the optimization problem described in Equation 3 is to \ufb01nd a stationary\npoint of the corresponding Lagrangian, L\n\nargminw L(w) subject to\n\nhk(w) = 0; \u22001 \u2264 k \u2264 K.\n\ni H(f i\n\nL(w; \u039b) = L(w) +\n\n\u03bbkhk(w)\n\n(4)\n\nHere, \u039b = {\u03bbk}K\nk=1 denotes the K sized vector of Lagrange multipliers. Note that since hk(w) is\nalways non-negative, the constraint hk(w) = 0 is equivalent to hk(w) \u2264 0. Hence the Lagrange\nmultipliers are always non-negative. Our optimization problem in the primal can be written as:\n\nKX\n\nk=1\n\nmin\n\nw\n\nmax\n\u039b\n\nL (w, \u039b)\n\nInstead of solving the primal in (5), we often solve its corresponding dual:\n\nmax\n\u039b\n\nmin\n\nw\n\nL (w, \u039b)\n\n(5)\n\n(6)\n\nWe make two comments on our formulation. First, the use of the hinge function achieves two\nobjectives: (a) no penalty is paid when constraints are satis\ufb01ed, and (b) the number of dual variables\nis reduced from O(mK) to O(K), making the formulation scalable. Second, our formulation\ncan handle arbitrary constraints as long as they are differentiable. Note that even simple form of\nconstraints (such as linear) over the output variables typically represent highly non-linear functions\nof the networks weights.\n\n4\n\n\fConstraint: C g(w):Choice 1\nyj = v\n\u00acC1\nC1 \u2228 C2\nC1 \u2227 C2\n\nPw(yj = v)\n1 \u2212 g1(w)\nmax(g1(w), g2(w))\nmin(g1(w) + g2(w), 1)\nmax(g1(w) + g2(w) \u2212 1, 0 min(g1(w), g2(w))\n\ng(w): Choice 2\n\nTable 1: g(w), g1(w) and g2(w) are soft value functions for C, C1 and C2, respectively.\n\n.\n3.2 Constraint Language for Discrete Output Spaces\n\nA common learning scenario for many problems is when each element of the target y belongs\u2018 to a\ndiscrete space. In such cases, each y is given as a vector (y1, y2,\u00b7\u00b7\u00b7 , yr), where each yj \u2208 V where V\nrepresents a set of discrete values. The network output is then represented as an r dimensional vector,\nwhere each element of the vector represents a probability distribution over V, i.e. Pw(yj|x) \u2200j, 1 \u2264\nj \u2264 r. One of the common loss functions for discrete spaces is the cross entropy based loss though\nother loss functions can also be used. At prediction time, given a new test example x, we output\nthe vector of values which have the highest probability for each element yj in the output space, i.e.,\nargmaxyj Pw(yj|x),\u2200j, 1 \u2264 j \u2264 r. In this section, we lay out the details of a language which can\nhandle logical constraints speci\ufb01ed over discrete output spaces as described above. Our formulation\nis based on soft logic used earlier in the literature Br\u00f6cheler et al. [2010], and represents constraints\nin the form of inequalities: fk(w) \u2264 0 where w are network weights.\nOur constraints are de\ufb01ned as logical expressions over values v that each yj can take. A constraint C\ncan take the following form: (a) C : 1{yj = v} (b) C = \u00acC1 (c) C = C1 \u2228 C2 (d) C = C1 \u2227 C2.\nHere, C1, C2 denote constraints constructed recursively using above rules. The \ufb01rst expression (a)\ncan be thought of as an atomic constraint, and rest are constructed by applying logical operators over\nexisting constraint(s). Note that C1 \u2192 C2 can be written as \u00acC1 \u2228 C2. Given a logical constraint C\nover the values output by a network with parameters w, we construct a function g(w) \u2208 [0, 1] which\ndenotes the soft value of the corresponding logical expression. Table 1 describes conversions from a\nlogical expression to a corresponding (soft) value. Finally, given a constraint C and the associated\nfunction g(w), the corresponding constraint can be written as: g(w) = 1. Since g(w) \u2208 [0, 1], it is\nequivalent to: g(w) \u2265 1 or f(w) = 1 \u2212 g(w) \u2264 0. We note that since all our constraints are over\nvariables with probability distributions de\ufb01ned over them, introducing soft logic does not make the\nconstraints any softer, it only gives a way to combine underlying probability values.\n\n4 Training\n\nSupervised: We solve the dual optimization problem described in Equation 6 by alternating gradient\ndescent (ascent) steps over w and \u039b, respectively. The gradients of the L with respect to w and \u03bbk\nare given as:\n\nKX\n\nk=1\n\n\u2207w L(w; \u039b) = \u2207wL(w) +\n\n\u03bbk\u2207whk(w); \u2202 L(w; \u039b)\n\n\u2202\u03bbk\n\n= hk(w),\u2200k.\n\n(7)\n\nNon-differentiability due to the Hinge function in hk can be handled by using sub-gradients. Corre-\nspondingly, the parameter update equations can be written as:\n\nw(t1+1) \u2190 w(t1) \u2212 \u03b1w\u2207w L(w; \u039b); \u039b(t+1) \u2190 \u039b(t) + \u03b1\u039b\u2207\u039b L(w; \u039b)\n\n(8)\nAlgorithm 1 presents the pseudocode for our learning algorithm. Initially, w are updated for a\nwarmup number of iterations with each \u03bbk = 0 (i.e., no constraints). Then, we perform the\nfollowing in succession: for every one update of \u039b parameters, we update the w parameters for l\nsteps, where l grows based on an arithmetic progression in increments of d. Intuitively, this ensures\nthat ratio of the effective learning rates for w updates and \u039b updates goes to in\ufb01nity with increasing\nnumber of \u039b updates as l \u2192 \u221e. For convergence, we resort to the theory of min-max optimization\npresented by Jin et al. [2019]. Their key result states that for a min-max optimization problem,\nalternating gradient ascent (descent) over max (min) variables converges to the local min-max point\n(analogous of local minima in the single variable case) if the ratio of learning rates of inner and outer\nvariables goes to \u221e in the limit. A signi\ufb01cant advantage of our formulation is that in practice the\ninner loop can often involve application of algorithms such as AdaDelta or RMSProp, which perform\n\n5\n\n\fgradient descent, but we may not have direct control over the learning rate for w parameters. But\nour step based update still ensures that effective ratio of learning rates goes to in\ufb01nity. We state this\nformally in our next theorem (see supplement for a proof).\nTheorem 1. Algorithm 1 converges to a Local minmax point of L(w; \u039b) for any d \u2265 1.\nSemi-supervised: Our framework can be easily extended to the case of semi-supervised learning.\nSince we do not have the target value y for unlabeled examples, we can\u2019t compute the loss (cross-\nentropy term) in expression for L(w; \u039b) in Equation 4 and hence, contribution of unlabeled examples\nto this term is ignored. On the other hand, the second term in the expression for L(w; \u039b) (correspond-\ning to constraints) does not depend on the target values y. Therefore, for unlabeled examples, we can\ntake this contribution into account by computing this term just like in the case of labeled examples.\nAs demonstrated by our experiments, this simple idea of using unlabeled data only for enforcing\nthe constraints can act as a strong regularizer and result in signi\ufb01cantly improved models, especially\nwhen there is small amount of labeled data available for training. This is also observed in earlier\nwork (Xu et al. [2018]; Mehta et al. [2018]).\n\n\u039b, \u03b1w\n\nUpdate w: Take an SGD step wrt w on L(w; \u039b) on a mini-batch\n\nAlgorithm 1 Training of a Deep Net with Constraints. Hyperparameters: warmup, d, \u03b2, \u03b10\n1 Initialize: w randomly; \u03bbk = 0, \u2200k = 1 . . . K\n2 for warmup iterations do\n3\n4 Initialize: l = 1; t = 1; t1 = 1; \u03b1\u039b = \u03b10\n\u039b\n5 while not converged do\n6\n7\n8\n9\n10\n11\n12\n\nUpdate \u039b: Take an SGA step wrt \u039b on L(w; \u039b) on a mini-batch\nIncrement t = t + 1\nfor l steps do\n\nUpdate w: Take an SGD step wrt w on L(w; \u039b) on a mini-batch\nIncrement t1 = t1 + 1\n\nUpdate l = l + d\nSet learning rates: \u03b1\u039b = \u03b10\n\u039b\n\n1\n\n1+\u03b2t\n\n5 Experiments\n\nThe goal of our experiments is to answer three questions. (1) Does constrained training help in\nlearning more accurate models, especially in the low data setting? (2) Does constrained training\nresult in models with better constraint satisfaction at prediction time? (3) What is the impact of semi-\nsupervision? We perform experiments on three different NLP benchmarks, which we describe next.\nThe speci\ufb01c details of software environments and hyperparameters are mentioned in the supplement.\n\n5.1 Semantic Role Labeling (SRL)\n\nGiven a sentence with a predicate (verb), the goal of SRL is to extract and label the arguments for it\nto determine who did what to whom, when and where, etc. In SRL literature, there is a long history\nof using linguistic and structural constraints in inference (e.g., Punyakanok et al. [2008]). We assess\nthe value of constraints in learning more robust neural models.\nDataset & Baseline Model: We use English Ontonotes 5.0 dataset1 using the CONLL 2011/12\nshared task format (Pradhan et al. [2012]) as the training data. The labeling task is modeled as\nsequence labeling using the BIOUL encoding. The baseline model (B) uses a deep Bidirectional\nLSTM, initialized with ElMo+Glove embeddings.2\nConstraints: We impose two types of constraints. (1) Syntactic Constraints: let SY = {(a, b)|a < b}\nbe the set of syntactic spans of a sentence in its syntactic parse tree. Let yBl\nj be the indicator\nj\nvariables corresponding to beginning and end (last) tag of argument label l at jth word. Then syntactic\n, \u2200 a, b, l.\nconstraints can be written as yBl\nThese constraints are similar to those used by Mehta et al. [2018], albeit in a different formulation.\n\nj\u2208{a:(a,b)\u2208SY }\n\nj\u2208{b:(a,b)\u2208SY }\n\nyLl\nj and yLl\n\nb =\u21d2\n\na =\u21d2\n\nand yLl\n\nW\n\nW\n\nyBl\nj\n\n1http://cemantix.org/data/ontonotes.html\n2implemented in https://allennlp.org/models#semantic-role-labeling\n\n6\n\n\fF1 Score\n5% Data\n\n10% Data\n\nTotal Constraint Violations\n\n5% Data\n\n10% Data\n\n1% Data\n14,857\n9,406\n5,737\n5,039\n\n9,708\n7,461\n4,247\n3,963\n\n7,704\n5,836\n3,654\n3,476\n\nScenario\nB\nCL\nB+CI\nCL + CI\n\n1% Data\n\n62.99\n66.21\n67.90\n68.71\n\n72.64\n74.27\n75.96\n76.51\n\n76.04\n77.19\n78.63\n78.72\n\nTable 2: Effect of constrained learning on SRL, with and without constrained inference (CI).\n\n(2) Transition Constraints: BIOUL encoding naturally de\ufb01nes valid transitions for a sequence, e.g.,\nLl must be preceded by Bl or Il. For a given tag t, let Vt be the set of valid tags for the next word.\nThen, transition constraints enforce that: \u2200 j, t : yt\n\nj+1.\nyu\n\nj =\u21d2 W\n\nu\u2208Vt\n\nMethodology: We compare against two different models, the baseline (B), and the baseline aug-\nmented with Viterbi decoding (B+CI). This constrained decoding enforces transition constraints at\ninference time. We name the constrained learning versions of these algorithms by CL and CL+CI,\nrespectively. Note that for test instances, the syntactic spans SY are not available. We use the\nstandard train/dev/test split and use the of\ufb01cial Perl script to compute span based F1-scores. We train\nwith 1%, 5% and 10% of training data selected randomly.\nResults: Table 2 presents our results. We observe signi\ufb01cant F1 gains of constrained learning (B+CL)\nover the baseline B, supporting the hypothesis that constraints can help in learning more robust\nmodels. We \ufb01nd that constrained learning with constrained decoding consistently performs the\nbest, even though marginal improvements over B+CI are smaller.3 We also note that the bene\ufb01t of\nconstrained learning decreases as training data increases, suggesting that this approach is most useful\nin low data settings. In addition to F1-scores, we also report total number of constraint violations and\n\ufb01nd that constrained learning consistently makes signi\ufb01cantly fewer violations. At the same time, we\nnote that there are still substantial violations remaining. This is not entirely suprising, since learning\nspan constraints without known spans is akin to learning a signi\ufb01cant aspect of the syntactic parsing\ntask, making the learning task much harder.\n\n5.2 Named Entity Recognition (NER)\n\nThe task corresponds to assigning a tag for each word from a given set of NER tags, e.g., \u2018location\u2019,\n\u2018person\u2019 etc. In addition, we also assume that the (training) dataset is labeled with Part of Speech\n(POS) tags for each word. This information is readily available for many datasets. We treat POS\ntagging as an auxiliary task in the standard multi-task learning (MTL) framework.\nDataset & Baseline Model: We use the publicly available GMB4 dataset (Bos et al. [2017]) in our\nexperiments. It contains about 62 thousand sentences, 24 different NER tags and 43 different POS\ntags. We randomly split it into 60/20/20 train/dev/test sets respectively. After removing the hierarchy\namong NER tags (e.g., mapping \u2018person-title\u2019 and \u2018person-family-name\u2019 to a single \u2018person\u2019), we\nare left with 9 high-level NER tags. We use the BIO encoding in our modeling. Our baseline model\n(B) is a BiLSTM that is setup in an MTL framework for predicting both NER and POS. For both the\ntasks, we use a single BiLSTM layer whose parameters are shared between the two tasks.\nConstraints: We encode our prior linguistic knowledge about the relationships between NER and\nPOS as constraints \u2013 for any NER tag te, we have an allowed set of POS tags Tp(te). If a word takes\n\u2208 Tp(te).\nan NER tag te, then its POS tag must come from the set Tp(te), i.e., yN ER\nHere, yN ER\nare the output variables corresponding to NER and POS tags for the jth word,\nrespectively. We give the full details of our constraints in the supplement.\nMethodology: We compare the following models. (1) B: Baseline, (2) CI: Constrained Inference, (3)\nCL: Constrained Learning, and (4) SCL: Semi-supervised Constrained Learning. B is the base model,\nCI does regular training with constrained inference using dual decomposition (Rush and Collins\n[2012]), CL is our model doing constrained training (supervised) and SCL does constrained training\nusing semi-supervised data. In order to test the performance of our model in low data setting, we\nrandomly select data subsets of sizes {400, 800, 1600, 6400, 12800, 25600, 37206} and use them for\n\n= te \u21d2 yP OS\n\n, yP OS\n\nj\n\nj\n\nj\n\nj\n\n3Our F1-scores are not directly comparable with those reported in Mehta et al. [2018], since their exact\n\ntraining splits (or code) are unavailable. Overall, our gains due to constrained learning are similar to theirs.\n\n4https://gmb.let.rug.nl/data.php\n\n7\n\n\f(a) Avg. Gain in F1 Score Over Baseline.\n\n(b) Avg. number of Constrained Violations\nFigure 1: NER: Comparison of different training techniques. B: Baseline; CL: Constrained Learning;\nSCL: Semi-supervised Constrained Learning; CI: Constrained Inference\n\ntraining each model. In each case, the data not used for training is used as unlabeled data for SCL\n(after removing the labels). The reported results are averaged over 10 different randomly selected\nsamples for each training size.\nResults: Figure 1a compares the performance of the four models. We plot the baseline model at zero,\nand plot the performance of all other models relative to the baseline (see supplement for absolute\nnumbers and standard deviations). There is a good gain in F1-score when learning with constraints,\nwith most gain obtained for smaller training sizes. Semi-supervision results in signi\ufb01cant additional\ngains. Figure 1b plots the number of constraint violations with varying training size. For CL and\nSCL, this number is close to 0 all through. Counter intuitively, the violations increase monotonically\nfor CI. This is because with less training data, learning is very shallow, resulting in \u2018Other\u2019 prediction\nmost of the time, and the constraints are trivially satis\ufb01ed. As learned model becomes more complex,\nCI \ufb01nds it harder to satisfy the constraints without hurting the performance. We do early stopping\nof dual decomposition based on dev set performance. This results in decent F1 but high constraint\nviolations. If run till convergence, CI results in all constraints being satis\ufb01ed but with performance\nlower than the baseline. CL does not suffer from this phenomenon due to constraint aware learning.\nWe also experiment with using CI (constrained inference) on top of CL and SCL; this results in no\nadditional gains, since most constraints are already being satis\ufb01ed. This highlights that constrained\nlearning may sometimes obviate the need for constrained inference, This can lead to a huge reduction\nin precious test times. For instance CI has test times 3-15 times that of CL, depending on testing\nbatch size.\n\n5.3 Fine Grained Entity Typing\n\nThis is a multi-label classi\ufb01cation problem. Given a set M of textual mentions of an entity e, we are\ninterested in \ufb01nding all the types that the mentions in M belong to (Yao et al. [2013]; Verga et al.\n[2017]; Murty et al. [2018]). Note that the labels here are entity types.\nDataset & Baseline Model: We work with Typenet5 (Murty et al. [2017]), a publicly available\ndataset of hierarchical entity types for extremely \ufb01ne-grained entity typing. It has been curated by\nmapping Freebase (Bollacker et al. [2008]) types into Wordnet (Miller [1995]) hierarchy. The dataset\ncontains over 1,900 types, placed in a hierarchy of average depth of 7.8. It also provides a corpus\nof textual mentions extracted from Wikipedia articles. It contains 344,246 entities mapped to 1,081\ntypes arranged in the type hierarchy. For baseline (B), we use the state of the art model proposed for\nthis task by Murty et al. [2018].6 Each mention m is represented by an encoding computed using\na CNN over the sentence, and each type is represented using an embedding vector. The two are\ncombined to get a similarity score. Scores coming from different mentions in a set are pooled to get a\n\ufb01nal score for each entity. To exploit the hierarchical structure, an additional loss term (H) encourages\nentities close in the hierarchy to get similar embedding vectors. This can be thought of as imposing a\nsoft constraint on the entity types. We compare with both these versions in our experiments.\n\n5https://github.com/iesl/TypeNet\n6https://github.com/MurtyShikhar/Hierarchical-Typing\n\n8\n\n4008001.6k3k6k13k26k37kTraining Size (#Sentences)0.00.51.01.52.02.53.03.5Gain In F1 ScoreBCICLSCL4008001.6k3k6k13k26k37kTraining Size (#Sentences)01234# Violations(in thousands)BCICLSCL\fMAP Scores\n10% Data\n\n100% Data\n\nConstraint Violations\n\nScenario\nB\nB+H\nCL\nSCL\nTable 3: TypeNet: MAP Scores (in %) and # of constraint violations for different training sizes\n\n10% Data\n21,451\n21,157\n45\n26\n\n100% Data\n22,359\n24,650\n12\n\n5% Data\n22,715\n22,928\n25\n41\n\n70.47\n71.77\n82.80\n\n5% Data\n\n68.62\n68.71\n80.13\n82.22\n\n69.21\n69.31\n81.36\n83.81\n\nConstraints: We enforce two types of constraints in our model. (a) Type Inclusion: given two types\nti and tj such that ti is ancestor of tj in the type hierarchy, for any entity, if ti is selected as a possible\nentity type, then tj should also be selected. I.e., yti \u21d2 ytj where yti and ytj are indicators variables\nfor the corresponding types being selected. This results in 1,891 constraints. (b) Type Exclusion:\npairs of types ti and tj (e.g., \u2018library\u2019 and \u2018camera\u2019) that should not co-occur for any entity. I.e.,\nyti \u21d2 \u00acytj . This results in a total of about 555,000 constraints.\nMethodology: We compare four different models: (a) B: Baseline, (b) B+H: Baseline with hierar-\nchically constrained embeddings (c) CL: constrained learning (d) SCL: constrained learning with\nsemi-supervision. Our constrained learning models are learned on top of vanilla baseline (and do not\nmake use of hierarchical embeddings). We use the original splits of 90%, 5% and 5% for training,\nvalidation and testing, respectively (Murty et al. [2018]). We compare the performance of the four\nmodels for training at (1) 5% of the data (2) 10% of the data, and (3) full training set. The smaller\ntraining subsets are chosen randomly. As earlier, any unused data in the training fold is used for\nsemi-supervision (after removing labels).\nResults: Table 3 presents our comparison results. We note that our baseline results are signi\ufb01cantly\nhigher than those reported in Murty et al. [2018]. We believe this is because they did not train the\nmodel until convergence; running till convergence results in signi\ufb01cantly higher numbers. After\nadditional training, the relative advantage of B+H model over B as reported in their paper is lost. Our\nconstrained model (CL) can give up to 11 pt increase in the performance both at 5% of the data as\nwell as 10% of the data. With semi-supervision, this gain hovers in the range of 12-14 pts. There\nis three orders of magnitude drop in the number of constraint violations when using constrained\nlearning. Interestingly, CL performance with 100% data is slightly worse than semi-supervision\nwith 10% data. We hypothesize that the reason is noise in training data in terms of either missing or\nincorrect labels. We note that there are 634, 544 type inclusion constraint violations in the training\ndata containing 294, 781 entities. As a result, more noisy data is likely hurting the performance.\nWe also compare our constrained learning against Diligenti et al. [2017a]\u2019s approach of using soft\nconstraints, where the violation penalty is multiplied with a constant \u03bb and added to the original loss.\nWhen using the best values of the \u03bb parameter, we \ufb01nd that both methods perform similarly. However,\nDiligenti\u2019s performance requires extensive search over the \u03bb parameters, and varies signi\ufb01cantly\nbased on the value of \u03bb. On the other hand, our formulation can implicitly discover optimal \u03bbk values\n(we have one for every constraint) by way of our Langrangian formulation. This obviates the need\nfor an explicit search for \u03bb which can be expensive. In fact, setting constant \u03bb to be the average of\n\u03bbk\u2019s from our formulation gives its best score.\n\n6 Conclusion and Future Work\n\nIn this paper, we have proposed a primal-dual based approach for solving the problem of learning\nwith hard constraints in deep learning models. While earlier work has modeled the constraints\nas soft, incorporating penalty in the loss term, we instead directly optimize the hard constraints\nusing a Langrangian based formulation. We show that our algorithm converges to local min-max\npoints of the objective. For the case of discrete output spaces, we also present a constraint language\nusing soft logic. Experiments on three different NLP tasks show the effectiveness of our approach\ncompared to non-constrained baselines, as well as constrained inference, achieving the state-of-the-\nart results in two of the domains. In one of the domains, our approach completely eliminates the\nneed for expensive constrained inference. Directions for future work include learning constraints\nautomatically, and experimenting on non-NLP tasks. We have made our all our code publicly available\nat: https://github.com/dair-iitd/dl-with-constraints for future research.\n\n9\n\n\fAcknowledgements\n\nWe thank IIT Delhi HPC facility7 for computational resources, which allows us to run experiments at\nlarge scale. We thank Guy Van den Broeck, Yitao Liang, Sanket Mehta, Shikhar Murty, Dan Roth,\nAlexander Rush and Vivek Srikumar for useful discussions on the work. We also thank Deepanshu\nJindal for proofreading our code. Mausam is supported by grants from Google, Bloomberg and 1MG.\nParag Singla is supported by the DARPA Explainable Arti\ufb01cial Intelligence (XAI) Program with\nnumber N66001-17-2-4032. Both Mausam and Parag Singla are supported by the Visvesvaraya Young\nFaculty Fellowships by Govt. of India and IBM SUR awards. Any opinions, \ufb01ndings, conclusions or\nrecommendations expressed in this paper are those of the authors and do not necessarily re\ufb02ect the\nviews or of\ufb01cial policies, either expressed or implied, of the funding agencies.\n\nReferences\nKurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A collabora-\ntively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM\nSIGMOD International Conference on Management of Data, SIGMOD \u201908, pages 1247\u20131250,\nNew York, NY, USA, 2008. ACM.\n\nJohan Bos, Valerio Basile, Kilian Evang, Noortje Venhuizen, and Johannes Bjerva. The groningen\nmeaning bank. In Nancy Ide and James Pustejovsky, editors, Handbook of Linguistic Annotation,\nvolume 2, pages 463\u2013496. Springer, 2017.\n\nMatthias Br\u00f6cheler, Lilyana Mihalkova, and Lise Getoor. Probabilistic similarity logic. In UAI 2010,\nProceedings of the Twenty-Sixth Conference on Uncertainty in Arti\ufb01cial Intelligence, Catalina\nIsland, CA, USA, July 8-11, 2010, pages 73\u201382, 2010.\n\nKai-Wei Chang, Rajhans Samdani, and Dan Roth. A constrained latent variable model for coreference\nresolution. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A\nmeeting of SIGDAT, a Special Interest Group of the ACL, pages 601\u2013612, 2013.\n\nJianshu Chen and Li Deng. A primal-dual method for training recurrent neural networks constrained\n\nby the echo-state property. arXiv preprint arXiv:1311.6091, 2013.\n\nLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille.\nDeeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully\nconnected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834\u2013848, 2018.\n\nMichelangelo Diligenti, Marco Gori, and Claudio Sacc\u00e0. Semantic-based regularization for learning\n\nand inference. Artif. Intell., 244:143\u2013165, 2017.\n\nMichelangelo Diligenti, Soumali Roychowdhury, and Marco Gori. Integrating prior knowledge into\ndeep learning. In 16th IEEE International Conference on Machine Learning and Applications,\nICMLA 2017, Cancun, Mexico, December 18-21, 2017, pages 920\u2013923, 2017.\n\nKuzman Ganchev, Jo\u00e3o Gra\u00e7a, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for\nstructured latent variable models. Journal of Machine Learning Research, 11:2001\u20132049, 2010.\n\nZhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard H. Hovy, and Eric P. Xing. Harnessing deep\nneural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association\nfor Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long\nPapers, 2016.\n\nZhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM-CRF models for sequence tagging. CoRR,\n\nabs/1508.01991, 2015.\n\nChi Jin, Praneeth Netrapalli, and Michael I. Jordan. Minmax optimization: Stable limit points of\n\ngradient descent ascent are locally optimal. CoRR, abs/1902.00618, 2019.\n\n7http://supercomputing.iitd.ac.in\n\n10\n\n\fPatrick Kn\u00f6belreiter, Christian Reinbacher, Alexander Shekhovtsov, and Thomas Pock. End-to-End\n\nTraining of Hybrid CNN-CRF Models for Stereo. In CVPR, 2017.\n\nDaphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques -\n\nAdaptive Computation and Machine Learning. The MIT Press, 2009.\n\nGuillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer.\nNeural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of\nthe North American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies, San Diego California, USA, June 12-17, 2016, pages 260\u2013270, 2016.\n\nJay Yoon Lee, Sanket Vaibhav Mehta, Michael Wick, Jean-Baptiste Tristan, and Jaime G. Carbonell.\nGradient-based inference for networks with output constraints. In The Thirty-Third AAAI Confer-\nence on Arti\ufb01cial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Arti\ufb01cial\nIntelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in\nArti\ufb01cial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019., pages\n4147\u20134154, 2019.\n\nPablo M\u00e1rquez-Neila, Mathieu Salzmann, and Pascal Fua.\n\nImposing hard constraints on deep\n\nnetworks: Promises and limitations. CoRR, abs/1706.02025, 2017.\n\nSanket Vaibhav Mehta, Jay Yoon Lee, and Jaime G. Carbonell. Towards semi-supervised learning\nfor deep semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods\nin Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages\n4958\u20134963, 2018.\n\nGeorge A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39\u201341, November\n\n1995.\n\nShikhar Murty, Patrick Verga, Luke Vilnis, and Andrew McCallum. Finer grained entity typing with\n\ntypenet. arXiv preprint arXiv:1711.05795, 2017.\n\nShikhar Murty, Patrick Verga, Luke Vilnis, Irena Radovanovic, and Andrew McCallum. Hierarchical\nlosses and new resources for \ufb01ne-grained entity typing and linking. In Proceedings of the 56th\nAnnual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia,\nJuly 15-20, 2018, Volume 1: Long Papers, pages 97\u2013109, 2018.\n\nVil\u00e9m Nov\u00e1k. First-order fuzzy logic. Studia Logica, 46(1):87\u2013109, 1987.\n\nSameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. Conll-\n2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint Conference\non EMNLP and CoNLL - Shared Task, CoNLL \u201912, pages 1\u201340, Stroudsburg, PA, USA, 2012.\nAssociation for Computational Linguistics.\n\nVasin Punyakanok, Dan Roth, and Wen-tau Yih. The importance of syntactic parsing and inference\n\nin semantic role labeling. Computational Linguistics, 34(2):257\u2013287, 2008.\n\nDan Roth and Wen-tau Yih. Integer linear programming inference for conditional random \ufb01elds.\nIn Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005),\nBonn, Germany, August 7-11, 2005, pages 736\u2013743, 2005.\n\nAlexander M. Rush and Michael Collins. A tutorial on dual decomposition and lagrangian relaxation\n\nfor inference in natural language processing. J. Artif. Intell. Res., 45:305\u2013362, 2012.\n\nPatrick Verga, Arvind Neelakantan, and Andrew McCallum. Generalizing to unseen entities and\nentity pairs with row-less universal schema. In Proceedings of the 15th Conference of the European\nChapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7,\n2017, Volume 1: Long Papers, pages 613\u2013622, 2017.\n\nJingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck. A semantic loss\nfunction for deep learning with symbolic knowledge. In Proceedings of the 35th International\nConference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15,\n2018, pages 5498\u20135507, 2018.\n\n11\n\n\fLimin Yao, Sebastian Riedel, and Andrew McCallum. Universal schema for entity type prediction.\nIn Proceedings of the 2013 workshop on Automated knowledge base construction, AKBC@CIKM\n13, San Francisco, California, USA, October 27-28, 2013, pages 79\u201384, 2013.\n\n12\n\n\f", "award": [], "sourceid": 6594, "authors": [{"given_name": "Yatin", "family_name": "Nandwani", "institution": "Indian Institute Of Technology Delhi"}, {"given_name": "Abhishek", "family_name": "Pathak", "institution": "Indian Institute Of Technology, Delhi"}, {"given_name": "Mausam", "family_name": "", "institution": "IIT Dehli"}, {"given_name": "Parag", "family_name": "Singla", "institution": "Indian Institute of Technology Delhi"}]}