{"title": "A Learning Error Analysis for Structured Prediction with Approximate Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 6129, "page_last": 6139, "abstract": "In this work, we try to understand the differences between exact and approximate inference algorithms in structured prediction. We compare the estimation and approximation error of both underestimate and overestimate models. The result shows that, from the perspective of learning errors, performances of approximate inference could be as good as exact inference. The error analyses also suggest a new margin for existing learning algorithms. Empirical evaluations on text classification, sequential labelling and dependency parsing witness the success of approximate inference and the benefit of the proposed margin.", "full_text": "A Learning Error Analysis for Structured Prediction\n\nwith Approximate Inference\n\nYuanbin Wu1, 2, Man Lan1, 2, Shiliang Sun1, Qi Zhang3, Xuanjing Huang3\n\n1School of Computer Science and Software Engineering, East China Normal University\n\n{ybwu, mlan, slsun}@cs.ecnu.edu.cn, {qz, xjhuang}@fudan.edu.cn\n\n2Shanghai Key Laboratory of Multidimensional Information Processing\n\n3School of Computer Science, Fudan University\n\nAbstract\n\nIn this work, we try to understand the differences between exact and approximate\ninference algorithms in structured prediction. We compare the estimation and ap-\nproximation error of both underestimate (e.g., greedy search) and overestimate\n(e.g., linear relaxation of integer programming) models. The result shows that,\nfrom the perspective of learning errors, performances of approximate inference\ncould be as good as exact inference. The error analyses also suggest a new mar-\ngin for existing learning algorithms. Empirical evaluations on text classi\ufb01cation,\nsequential labelling and dependency parsing witness the success of approximate\ninference and the bene\ufb01t of the proposed margin.\n\nIntroduction\n\n1\nGiven an input x 2 X , structured prediction is the task of recovering a structure y = h(x) 2 Y,\nwhere Y is a set of combinatorial objects such as sequences (sequential labelling) and trees (syntactic\nparsing). Usually, the computation of h(x) needs an inference (decoding) procedure to \ufb01nd an\noptimal y:\n\nh(x) = arg max\n\ny2Y score(x; y):\n\nSolving the \u201carg max\u201d operation is essential for training and testing structured prediction models,\nand it is also one of the most time-consuming parts due to its combinatorial natural. In practice, the\ninference problem often reduces to combinatorial optimization or integer programming problems,\nwhich are intractable in many cases. In order to accelerate models, faster approximate inference\nmethods are usually applied. Examples include underestimation algorithms which output structures\nwith suboptimal scores (e.g., greedy search, max-product belief propagation), and overestimation al-\ngorithms which output structures in a larger output space (e.g., linear relaxation of integer program-\nming). Understanding the trade-offs between computational ef\ufb01ciency and statistical performance is\nimportant for designing effective structured prediction models [Chandrasekaran and Jordan, 2013].\nPrior work [Kulesza and Pereira, 2007] shows that approximate inference may not be suf\ufb01cient\nfor learning a good statistical model, even with rigorous approximation guarantees. However, the\nsuccessful application of various approximate inference algorithms motivates a deeper exploration\nof the topic. For example, the recent work [Globerson et al., 2015] shows that an approximate\ninference can achieve optimal results on grid graphs. In this work, instead of focusing on speci\ufb01c\nmodels and algorithms, we try to analyze general estimation and approximation errors for structured\nprediction with approximate inference.\nRecall that given a hypothesis space H, a learning algorithm A receives a set of training samples\nS = f(xi; yi)gm\ni=1 which are i.i.d. according to a distribution D on the space X (cid:2) Y, and returns a\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fhypothesis A(S) 2 H. Let e(h) = EDl(y; h(x)) be the risk of a hypothesis h on X (cid:2) Y (l is a loss\n(cid:3)\nfunction), and h\n\n= arg minh2He(h). Applying algorithm A will suffer two types of error:\n\n|{z}\ne(A(S)) = e(h\n(cid:3)\n\n}\nThe estimation error measures how close A(S) is to the best possible h\n(cid:3); the approximation error\nmeasures whether H is suitable for D, which only depends on the hypothesis space. Our main\ntheoretical results are:\n\napproximation\n\n+ e(A(S)) (cid:0) e(h\n(cid:3)\n\n)\n\n)\n\n|\n\n{z\n\nestimation\n\n(cid:15) For the estimation error, we show that, comparing with exact inference, overestimate infer-\nence always has larger estimation error, while underestimate inference can probably have\nsmaller error. The results are based on the PAC-Bayes framework [McAllester, 2007] for\nstructured prediction models.\n(cid:15) For the approximation error, we \ufb01nd that the errors of underestimate and exact inference\nare not comparable. On the other side, overestimate inference algorithms have a smaller\napproximation error than exact inference.\n\nThe results may explain the success of exact inference: it makes a good balance between the two\nerrors. They also suggest that the learning performances of approximate inference can still be im-\nproved. Our contributions on empirical algorithms are two-fold.\nFirst, following the PAC-Bayes error bounds, we propose to use a new margin (De\ufb01nition 3) when\nworking with approximate algorithms.\nIt introduces a model parameter which can be tuned for\ndifferent inference algorithms. We investigate three widely used structured prediction models with\nthe new margin (structural SVM, structured perceptron and online passive-aggressive algorithm).\nSecond, we evaluate the algorithms on three NLP tasks: multi-class text classi\ufb01cation (a special\ncase of structured prediction), sequential labelling (chunking, POS tagging, word segmentation)\nand high-order non-projective dependency parsing. Results show that the proposed algorithms can\nbene\ufb01t each structured prediction task.\n\n2 Related Work\n\nThe \ufb01rst learning error analysis of structured prediction was given in [Collins, 2001]. The bounds\ndepend on the number of candidate outputs of samples, which grow exponentially with the size\nof a sample. To tighten the result, Taskar et al. [2003] provided an improved covering number\nargument, where the dependency on the output space size is replaced by the l2 norm of feature\nvectors, and London et al. [2013] showed that when the data exhibits weak dependence within each\nstructure (collective stability), the bound\u2019s dependency on structure size could be improved. A\nconcise analysis based on the PAC-Bayes framework was given in [McAllester, 2007]. It enjoys the\nadvantages of Taskar et al.\u2019s bound and has a simpler derivation. Besides the structured hinge loss,\nthe PAC-Bayes framework was also applied to derive generalization bounds (and consistent results)\nfor ramp and probit surrogate loss functions [McAllester and Keshet, 2011], and loss functions based\non Gibbs decoders [Honorio and Jaakkola, 2016]. Recently, Cortes et al. [2016] proposed a new\nhypothesis space complexity measurement (factor graph complexity) by extending the Rademacher\ncomplexity, and they can get tighter bounds than [Taskar et al., 2003].\nFor approximate inference algorithms, theoretical results have been given for different learning sce-\nnarios, such as the cutting plane algorithm of structured SVMs [Finley and Joachims, 2008, Wang\nand Shawe-Taylor, 2009], subgradient descent [Martins et al., 2009], approximate inference via\ndual loss [Meshi et al., 2010], pseudo-max approach [Sontag et al., 2010], local learning with de-\ncomposed substructures [Samdani and Roth, 2012], perceptron [Huang et al., 2012], and amortized\ninference [Kundu et al., 2013, Chang et al., 2015]. Different from previous works, we try to give\na general analysis of approximate inference algorithms which is independent of speci\ufb01c learning\nalgorithms.\nThe concept of algorithmically separable is de\ufb01ned in [Kulesza and Pereira, 2007], it showed that\nwithout understanding combinations of learning and inference, the learning model could fail. Two\nrecent works gave theoretical analyses on approximate inference showing that they could also obtain\n\n2\n\n\fpromising performances: Globerson et al. [2015] showed that for a generative 2D grid models, a two-\nstep approximate inference algorithm achieves optimal learning error. Meshi et al. [2016] showed\nthat approximation based on LP relaxations are often tight in practice.\nThe PAC-Bayes approach was initiated by [McAllester, 1999]. Variants of the theory include\nSeeger\u2019s bound [Seeger, 2002], Catoni\u2019s bound [Catoni, 2007] and the works [Langford and Shawe-\nTaylor, 2002, Germain et al., 2009] on linear classi\ufb01ers.\n\n3 Learning Error Analyses\n\nWe will focus on structured prediction with linear discriminant functions. De\ufb01ne exact inference\n\nh(x; w) = arg max\n\ny2Y w\n\n\u22ba\n\n(cid:8)(x; y);\n\nwhere (cid:8)(x; y) 2 Rd is the feature vector, and w is the parameter vector in Rd. We consider two\ntypes of approximate inference algorithms, namely underestimate approximation and overestimate\napproximation [Finley and Joachims, 2008] 1.\nDe\ufb01nition 1. Given a w, h-(x; w) is an underestimate approximation of h(x; w) on a sample x if\n\n\u22ba\n\n(cid:3)\n\n) (cid:20) w\n\n\u22ba\n\n(cid:8)(x; y-) (cid:20) w\n\n\u22ba\n\n(cid:8)(x; y\n\n(cid:26)w\n= h(x; w); y- = h-(x; w) 2 Y. Similarly, h+(x; w) is an overestimate\n\n(cid:8)(x; y\n\n)\n\n(cid:3)\n\nfor some (cid:26) > 0, where y\napproximation of h(x; w) on sample x if\n\n(cid:3)\n\n(cid:3)\n\n\u22ba\n\nw\n\n(cid:8)(x; y\n\n(cid:3)\n\n) (cid:20) w\n\n\u22ba\n\n(cid:8)(x; y+) (cid:20) (cid:26)w\n\n\u22ba\n\n(cid:8)(x; y\n\n)\n\nfor some (cid:26) > 0, where y+ = h+(x; w) 2 (cid:22)Y and Y (cid:18) (cid:22)Y.\nLet H;H-;H+ be hypothesis spaces containing h, h- and h+ respectively: H = fh((cid:1); w)jw 2\nRdg, H- = fh-((cid:1); w)j8x 2 X ; h-((cid:1); w) is an underestimationg, and H+ = fh+((cid:1); w)j8x 2\nX ; h+((cid:1); w) is an overestimationg. Let l(y; ^y) 2 [0; 1] be a structured loss function on Y (cid:2) Y and\nI((cid:1)) be a 0-1 valued function which equals 1 if the argument is true, 0 otherwise.\n\n3.1 Estimation Error\n\nOur analysis of the estimation error for approximate inference is based on the PAC-Bayes results\nfor exact inference [McAllester, 2007]. PAC-Bayes is a framework for analyzing hypothesis h((cid:1); w)\n\u2032 according to some\nwith stochastic parameters: given an input x, \ufb01rst randomly select a parameter w\ndistribution Q(w\nL(Q;D; h((cid:1); w)) = ED;Q(w\u2032jw)l(y; h(x; w\n\n\u2032jw), and then make a prediction using h(x; w\n\n)); L(Q; S; h((cid:1); w)) =\n\nEQ(w\u2032jw)l(yi; h(xi; w\n\nm\u2211\n\n). De\ufb01ne\n\n\u2032\n\n\u2032\n\n\u2032\n\n)):\n\n1\nm\n\ni=1\n\n\u221a\n\nGiven some prior distribution P (w) on the model parameter w, the following PAC-Bayes Theorem\n[McAllester, 2003] gives an estimation error bound of h(x; w).\nLemma 2 (PAC-Bayes Theorem). Given a w, for any distribution D over X (cid:2) Y, loss function\nl(y; ^y) 2 [0; 1], prior distribution P (w) over w, and (cid:14) 2 [0; 1], we have with probability at least\n1 (cid:0) (cid:14) (over the sample set S), the following holds for all posterior distribution Q(w\n\n\u2032jw):\n\nDKL(Q\u2225P ) + ln m\n\nL(Q;D; h((cid:1); w)) (cid:20) L(Q; S; h((cid:1); w)) +\nwhere DKL(Q\u2225P ) is the KL divergence between Q and P .\n1De\ufb01nition 1 slightly generalizes \u201cundergenerating\u201d and \u201covergenerating\u201d in [Finley and Joachims, 2008].\nInstead of requiring (cid:26) > 0, the \u201cundergenerating\u201d there only considers (cid:26) 2 (0; 1), and \u201covergenerating\u201d only\nconsiders (cid:26) > 1. Although their de\ufb01nition is more intuitive (i.e., the meaning of \u201cover\u201d and \u201cunder\u201d is more\n(cid:3)\n) > 0 for all x and w, which limits the size of hypothesis space. Finally,\nclear), it implicitly assumes w\n\u22ba\nby adding a bias term, we could make w\n) + b > 0 for all x, and obtain the same de\ufb01nitions in [Finley\nand Joachims, 2008].\n\n2(m (cid:0) 1)\n\n(cid:8)(x; y\n\n(cid:8)(x; y\n\n(cid:3)\n\n\u22ba\n\n;\n\n(cid:14)\n\n3\n\n\f(cid:3)\n\n; y-; w) (cid:20) 0 for underestimation, and m(cid:26)(x; y\n\nDe\ufb01nition 3. For (cid:26) > 0, we extend the de\ufb01nition of margin as m(cid:26)(x; y; ^y; w) \u225c w\nwhere \u2206(cid:26)(x; y; ^y) \u225c (cid:26)(cid:8)(x; y) (cid:0) (cid:8)(x; ^y).\nClearly, m(cid:26)(x; y\nThe following theorem gives an analysis of the estimation error for approximate inference. The\nproof (in the supplementary) is based on Theorem 2 of [McAllester, 2007], with emphasis on the\napproximation rate (cid:26).\nTheorem 4. For a training set S = f(xi; yi)gm\n(xi; w) is a (cid:26)i-approximation of\nh(xi; w) on xi for all w. Denote (cid:26) = maxi (cid:26)i and Mi = maxy \u2225(cid:8)(xi; y)\u22251. Then, for any D,\nl(y; ^y) 2 [0; 1] and (cid:14) 2 [0; 1], with probability at least 1 (cid:0) (cid:14), the following upper bound holds.\n\n; y+; w) (cid:21) 0 for overestimation.\n\n\u2032\ni=1, assume h\n\n\u2206(cid:26)(x; y; ^y),\n\n(cid:3)\n\n\u22ba\n\n\u221a\n\n\u2225w\u22252\nm\n\n+\n\n(1 + (cid:26))2\u2225w\u22252 ln 2m(cid:21)S\n2(m (cid:0) 1)\n\n\u2225w\u22252 + ln m\n\n(cid:14)\n\n;\n\n(1)\n\nL(Q;D; h\n\nL(w; S) =\n\n\u2032\n\n((cid:1); w)) (cid:20) L(w; S) +\nm\u2211\nm\u2211\n\ni=1\n\n8>><>>: 1\n\nm\n\n1\nm\n\nmaxy l(yi; y)I(m(cid:26)i(xi; y\n\nmaxy l(yi; y)I(m(cid:26)i(xi; y\n\ni ; y; w) (cid:20) Mi)\n(cid:3)\ni ; y; w) (cid:21) (cid:0)Mi)\n(cid:3)\n\n\u2032\nif h\n\u2032\nif h\n\n((cid:1); w) 2 H-\n((cid:1); w) 2 H+\n\ni=1\n\n2 ln 2m(cid:21)S\n\nwhere y\n(cid:26))\n\n\u2032jw) is Gaussian with identity covariance matrix and mean (1 +\n(cid:3)\ni = h(xi; w), Q(w\n\u2225w\u22252 w, (cid:21)S is the maximum number of non-zero features among samples in S: (cid:21)S =\n\n\u221a\nmaxi;y \u2225(cid:8)(xi; y)\u22250.\nWe compare the bound in Theorem 4 for two hypotheses h1; h2 with approximation rate (cid:26)1;i; (cid:26)2;i\n(cid:3)\n\u22ba\non sample xi. Without loss of generality, we assume w\ni ) > 0 and (cid:26)1;i > (cid:26)2;i.\nIn the case of underestimation, since fyjm(cid:26)1;i(xi; y\ni ; y; w) (cid:20)\ni ; y; w) (cid:20) Mig (cid:18) fyjm(cid:26)2;i(xi; y\n(cid:3)\n(cid:3)\nMig, L(w; S) of h1 is smaller than that of h2, but h1 has a larger square root term. Thus, it is\npossible that underestimate approximation has a less estimation error than the exact inference. On\nthe other hand, for overestimation, both L(w; S) and the square root term of h1 are larger than those\nof h2. It means that the more overestimation an inference algorithm makes, the larger estimation\nerror it may suffer.\n((cid:1); w) attains approximation rate (cid:26)i on xi for all possible w. This assump-\n\u2032\nTheorem 4 requires that h\ntion could be restrictive for including many approximate inference algorithms. We will try to relax\nthe requirement of Theorem 4 using the following measurement on stability of inference algorithms.\nDe\ufb01nition 5. h(x; w) is (cid:28)-stable on a sample x with respect to a norm \u2225 (cid:1) \u2225 if for any w\n\n(cid:8)(xi; y\n\n\u2032\n\n\u22ba\n\njw\n\n(cid:8)(x; y) (cid:0) w\n\n\u2032\u22ba\njw\u22ba(cid:8)(x; y)j\n\n\u2032\n\n)j\n\n(cid:8)(x; y\n\n(cid:20) (cid:28)\n\n\u2032\u2225\n\n\u2225w (cid:0) w\n\u2225w\u2225\n\n;\n\n\u2032\n\n\u2032\n\n= h(x; w\n\nwhere y = h(x; w); y\n((cid:1); w)\n\u2032\nTheorem 6. Assume that h\n(xi; w) is a (cid:26)i-approximation of h(xi; w) on the sample xi, and h\nis (cid:28)-stable on S with respect to \u2225 (cid:1) \u22251. Then with the same symbols in Theorem 4, L(Q;D; h\n((cid:1); w))\n\u2032\nis upper bounded by\n\n).\n\n\u2032\n\n\u221a\n\nL(w; S) +\n\n\u2225w\u22252\nm\n\n+\n\n(1 + 2(cid:26) + (cid:28) )2\u2225w\u22252 ln 2m(cid:21)S\n\n\u2225w\u22252 + ln m\n\n(cid:14)\n\n2(m (cid:0) 1)\n\n:\n\n\u2032 according to the de\ufb01nition of (cid:28). However, upper\nNote that we still need to consider all possible w\nbounds of (cid:28) could be derived for some approximate inference algorithms. As an example, we discuss\nthe linear programming relaxation (LP-relaxation) of integer linear programming, which covers a\nbroad range of approximate inference algorithms. The (cid:28)-stability of LP-relaxation can be obtained\nfrom perturbation theory of linear programming [Renegar, 1994, 1995].\nTheorem 7 (Proposition 2.5 of [Renegar, 1995]). For a feasible linear programming\n\n\u22ba\n\nmax : w\n\nz\n\ns.t. Az (cid:20) b; z (cid:21) 0;\n\n4\n\n\f1\n\n\u2032\n\n1\n\n\u2032\n\n1\n\n1\n\n\u2032\n\n2\n\n3\n\n3\n\n\u2032\n\n2\n\n1\n\n2\n\n(a)\n\n1\n\n(b)\n\n\u2032\n\n3\n\n\u2032\n\n1\n\n\u2032\n\n3\n\n2\n\n(c)\n\n2\n(a)\n\n\u2032\n\n2\n\n3\n\n1\n\n3\n\n3\n\n2\n\n(d)\n\n\u2032\n\n3\n\n(e)\n\n(f)\n\n2\n(d)\n\n\u2032\n\n2\n\n(b)\n\u2032\n\n2\n\n(e)\n\n\u2032\n\n1\n\n\u2032\n\n1\n\n3\n\n\u2032\n\n1\n\n\u2032\n\n3\n\n\u2032\n\n3\n\n1\n\n\u2032\n\n3\n\n\u2032\n\n2\n\n2\n(c)\n\n(f)\n\nFigure 1: An example of exact inference with\nless approximation error than underestimate\ninference (i.e., e(h) < e(h-))\n\nFigure 2: An example of underestimate infer-\nence with less approximation error than exact\ninference (i.e., e(h-) < e(h)).\n\nlet ^z, ^z\n\n\u2032 be solutions of the LP w.r.t. w and w\n\n\u2032. Then\n\n\u22ba\n\njw\n\n^z (cid:0) w\n\n\u2032\u22ba\n\n^z\n\n\u2032j (cid:20) max(\u2225b\u22251;jw\n\n\u22ba\n\n^zj)\n\n\u2225w (cid:0) w\n\n\u2032\u22251;\n\nwhere d is the l1 distance from A; b to the dual infeasible LP (\u2225A; b\u22251 = maxi;j;kfjAijj;jbkjg):\nd = inff(cid:14)j\u2225\u2206A; \u2206b\u22251 < (cid:14) ) the dual problem of the LP with(A + \u2206A; b + \u2206b) is infeasibleg:\n\nd\n\n3.2 Approximation Error\n\nIn this section, we compare the approximation error of models with different inference algorithms.\nThe discussions are based on the following de\ufb01nition (De\ufb01nition 1.1 of [Daniely et al., 2012]).\nDe\ufb01nition 8. For hypothesis spaces H;H\u2032, we say H essentially contains H\u2032 if for any h\n\u2032 2 H\u2032,\nthere is an h 2 H satisfying e(h) (cid:20) e(h\n) for all D, where e(h) = EDl(y; h(x)). In other words,\n\u2032\nfor any distribution D, the approximation error of H is at most the error of H\u2032.\nOur main result is that there exist cases that approximation errors of exact and underestimate infer-\nence are not comparable, in the sense that neither H contains H-, nor H- contains H. 2\nTo see that approximation errors could be non-comparable, we consider an approximate inference\nalgorithm h- which always outputs the second best y for a given w. The two examples in Figure 1\nand 2 demonstrate that it is both possible that e(h) < e(h-) and e(h-) < e(h). The following are the\ndetails.\n\u2032g. Sample x has three possible\nWe consider an input space containing two samples X = fx; x\n\u2032 also has three possible y,\noutput structures, which are named with 1; 2; 3 respectively. Sample x\n\u2032 be 1 and 1\n\u2032. For sample x, feature\n\u2032. Let the correct output of x and x\n\u2032\nwhich are named with 1\nvectors (cid:8)(x; 1); (cid:8)(x; 2); (cid:8)(x; 3) 2 R2 are points on the unit circle and form a equilateral triangle\n) are vertices of \u25b3(1\n\u25b3(1; 2; 3). Similarly, feature vectors (cid:8)(x\n\u2032\n\u2032\n). The\n; 3\nparameter space of w is the unit circle (since inference results only depend on directions of w).\nGiven a w, the exact inference h(x; w) choose the y whose (cid:8)(x; y) has the largest projection on w\n(i.e., h(x; w) = arg maxy2f1;2;3g w\n; y)), and\nh-(x; w) choose the y whose (cid:8)(x; y) has the second largest projection on w.\n\n; w) = arg maxy2f1\u2032;2\u2032;3\u2032g w\n\n(cid:8)(x; y) and h(x\n\n); (cid:8)(x\n\n\u2032\n\n\u2032\n; 2\n\n\u2032\n\n\u2032\n; 3\n\n\u2032\n\n\u2032\n; 1\n\n); (cid:8)(x\n\n; 2\n\n; 3\n\n\u2032\n\n; 2\n\n\u22ba\n\n\u22ba\n\n\u2032\n\n(cid:8)(x\n\n\u2032\n\n\u2032\n\n2 Note that there exist two paradigms for handling intractability of inference problems. The \ufb01rst one is\nto develop approximate inference algorithms for the exact problem, which is our focus here. Another one\nis to develop approximate problems with tractable exact inference algorithms. For example, in probabilistic\ngraphical models, one can add conditional independent assumptions to get a simpli\ufb01ed model with ef\ufb01cient\ninference algorithms. In the second paradigm, it is clear that approximate models are less expressive than the\nexact model, thus the approximation error of them are always larger. Our result, however, shows that it is\npossible to have underestimate inference of the original problem with smaller approximation error.\n\n5\n\n\f\u2032\n\n; w) = 1\n\n; w) = 1\n\n\u2032, which means that approximation error of exact inference H is 0.\n\u2032\n\nWe \ufb01rst show that it is possible e(h) < e(h-). In Figure 1, (a) shows that for sample x, any w in\nthe gray arc can make the output of exact inference correct (i.e., h(x; w) = 1). Similarly, in (b),\n\u2032. (c) shows that the two gray arcs in (a) and (b) are\nany w in the gray arc guarantees h(x\noverlapping on the dark arc. For any w in the dark arc, the exact inference has correct outputs on\nboth x and x\nAt the same time, in (d) of Figure 1, gray arcs contain w which makes the underestimate inference\n\u2032. (f) shows\ncorrect on sample x (i.e., h-(x; w) = 1), gray arcs in (e) are w with h-(x\nthe gray arcs in (d) and (e) are not overlapping, which means it is impossible to \ufb01nd a w such that\n\u2032. Thus the approximation error of underestimate inference H- is\nh-((cid:1); w) is correct on both x and x\nstrictly larger than 0, and we have e(h) < e(h-).\nSimilarly, in Figure 2, (a), (b), (c) show that we are able to choose w such that the underestimate\n\u2032, which implies the approximation error of underestimation H-\ninference is correct both on x and x\nequals 0. On the other hand, (d), (e), (f) shows that the approximation error of exact inference H is\nstrictly larger than 0, and we have e(h-) < e(h).\nFollowing the two \ufb01gures, we can illustrate that when (cid:8)(x; y) are vertices of convex regular n-gons,\nit is both possible that e(h) < e(h-) and e(h-) < e(h), where h- is an underestimation outputting the\nk-th best y. In fact, when we consider the \u201cworst\u201d approximation which outputs y with the smallest\nscore, its approximation error equals to the exact inference since h(x; w) = h-(x;(cid:0)w). Thus, we\nwould like to think that the geometry structures of (cid:8)(x; y) could be complex enough to make both\nexact and underestimate inference ef\ufb01cient.\nTo summarize, the examples suggest that underestimation algorithms give us a different family of\npredictors. For some data distribution, the underestimation family can have a better predictor than\nthe exact inference family.\nFinally, for the case of overestimate approximation, we can show that H+ contains H using Theorem\n1 of [Kulesza and Pereira, 2007].\nTheorem 9. For (cid:26) > 1, if the loss function l satis\ufb01es l(y1; y2) (cid:20) l(y1; y3) + l(y3; y2), then H+\ncontains H.\n\n4 Training with the New Margin\n\nTheorems 4 and 6 suggest that we could learn the model parameter w by minimizing a non-convex\nobjective L(w; S) + \u2225w\u22252. The L(w; S) term is related to the size of the set fyjm(cid:26)(xi; y\ni ; y; w) (cid:20)\n(cid:3)\nMig, which can be controlled by margin m(cid:26)2(xi; yi; y-\ni). Speci\ufb01cally, for underestimation,\n\nm(cid:26)(xi; y\n\ni ; y; w)(cid:21) (cid:26)w\n(cid:3)\n(cid:21) (cid:26)w\n\n\u22ba\n\u22ba\n\n(cid:8)(xi; yi) (cid:0) w\n(cid:8)(xi; yi) (cid:0) (cid:26)\n\n(cid:8)(xi; y) (cid:21) (cid:26)w\n\u22ba\n(cid:0)1w\n\n(cid:8)(xi; y-\n\n\u22ba\n\n\u22ba\n\ni) = (cid:26)\n\n\u22ba\n\n(cid:8)(xi; yi) (cid:0) w\n(cid:0)1m(cid:26)2 (xi; yi; y-\n\n(cid:8)(xi; y\ni; w); 8y:\n\n(cid:3)\ni )\n\n\u2211\n\ni) (replacing exact y\n\ni (cid:24)i; s.t. m(cid:26)2(xi; yi; y-\n\ni ; w) > 1 ) m1(xi; yi; y\n\n(cid:3)\ni with the approximate y-\njjwjj2 + C\n\ni), the lower L(w; S). Thus, when working with approxi-\nIt implies that the larger m(cid:26)2 (xi; yi; y-\nmate inference, we can apply m(cid:26)2(xi; yi; y-\ni) in existing maximum margin frameworks instead of\nm1(xi; yi; y-\ni). For example, the structural SVM in\ni; w) > 1 (cid:0) (cid:24)i.\n[Finley and Joachims, 2008] becomes min : 1\n2\nIntuitively, m(cid:26)2 aims to improve learning process by including more information about inference\nalgorithms. For overestimation, we don\u2019t have similar lower bounds as underestimation, but since\n(cid:0)1, we can apply the margin m(cid:26) as an approxima-\nm(cid:26)(xi; yi; y+\ntion of m1.\nIn practice, since it is hard to obtain (cid:26) for inference algorithms (even it is possible, as (cid:26) must consider\nthe worst case of all possible x, a tight (cid:26) maybe inef\ufb01cient on individual samples), we treat it as an\nalgorithm parameter which can be heuristically determined either by prior knowledge or by tuning\non development data. We leave the study of how to estimate (cid:26) systematically for future work.\nFor empirical evaluation, we examine structural SVM with cutting plane learning algorithm [Finley\nand Joachims, 2008], and we also adapt two wildly used online structured learning algorithms with\nm(cid:26): structured perceptron [Collins, 2002] (Algorithm 3) and online passive aggressive algorithm\n(PA) [Crammer et al., 2006] (Algorithm 4). The mistake bounds of the two algorithms are similar to\nbounds with exact inference algorithms (given in the supplementary).\n\n(cid:3)\ni ; w) > (cid:26)\n\n6\n\n\f1: w0 = 0\n2: for t = 0 to T do\nt = h-(xt; wt)\ny-\n3:\n\u0338= yt then\nif y-\n4:\nwt+1 = wt +(cid:26)(cid:8)(xt; yt)(cid:0)(cid:8)(xt; y-\nt\n5:\nt)\nend if\n6:\n7: end for\n8: return w = wT\nFigure 3: Structured perceptron with m(cid:26).\n\n1: w0 = 0\n2: for t = 0 to T do\nif m(cid:26)(xt; yt; y-\n3:\nwt+1 = arg minw: \u2225w (cid:0) wt\u22252\n4:\ns.t. m(cid:26)(xt; yt; y-\n5:\nend if\n6:\n7: end for\n8: return w = wT\n\nt; w) < 1 then\nt; w) (cid:21) 1\n\nFigure 4: Online PA with m(cid:26).\n\n5 Experiments\n\nWe present experiments on three natural language processing tasks: multi-class text classi\ufb01cation,\nsequential labelling and dependency parsing. For text classi\ufb01cation, we compare with the vanilla\nstructural SVM. For sequential labelling, we consider three tasks (phrase chunking (chu), POS\ntagging (pos) and Chinese word segmentation (cws)) and the perceptron training. For dependency\nparsing, we focus on the second order non-projective parser and the PA algorithm. For each task,\nwe focus on underestimate inference algorithms.\n\n5.1 Multi-class classi\ufb01cation\n\nMulti-class classi\ufb01cation is a special case of structured prediction. It has a limited number of class\nlabels and a simple exact inference algorithm (i.e., by enumerating labels). To evaluate the proposed\nmargin constraints, we consider toy approximate algorithms which output the kth best class label.\nWe report results on the 20 newsgroups corpus\n(18000 documents, 20 classes). The meta data\nis removed (headers, footers and quotes), and\nfeature vectors are simple tf-idf vectors. We\ntake 20% of the training set as development\nset for tuning (cid:26) (grid search in [0, 2] with step\nsize 0.05). The implementation is adapted from\nSVMmulticlass 3.\nFrom the results (Figure 5) we \ufb01nd that, com-\nparing with the vanilla structural SVM, the pro-\nposed margin constraints are able to improve\nerror rates for different inference algorithms.\nAnd, as k becomes larger, the improvement be-\ncomes more signi\ufb01cant. This property might be\nattractive since algorithms with loose approxi-\nmation rates are common in practical use. An-\nother observation is that, as k becomes larger,\nthe best parameter (cid:26) decreases in general.\nIt\nshows that the tuned parameter can re\ufb02ect the\nde\ufb01nition of approximate rate (Defnition 1).\n\nFigure 5: Results on text classi\ufb01cation. Blue\npoints are error rates for different k, and red points\nare (cid:26) achieving the best error rates on the devel-\nopment set. The red dot line is the least square\nlinear \ufb01tting of red points. The model parameter\nC = 104.\n\n5.2 Sequential Labelling\nIn sequential labelling, we predict sequences y = y1y2; : : : ; yK, where yk 2 Y is a label (e.g., POS\n(cid:8)(x; yk; yk(cid:0)1).\n\u22ba\nK\ntag). We consider the \ufb01rst order Markov assumption: h(x) = arg maxy\nk=1 w\nThe inference problem is tractable using O(KY 2) dynamic programming (Viterbi).\nWe examine a simple and fast greedy iterative decoder (\u201cgid\u201d), which is also known as the iterative\nconditional modes [Besag, 1986]. The algorithm \ufb02ips each label yk of y in a greedy way: for \ufb01xed\nyk(cid:0)1 and yk+1, it \ufb01nds a yk that makes the largest increase of the decoding objective function. The\n\n\u2211\n\n3http://www.cs.cornell.edu/People/tj/svm_light/svm_multiclass.html\n\n7\n\n2345678k405060708090ErrorratesVanillaSVMSVMwithm\u03c10.400.450.500.550.600.650.700.75\u03c1\fFigure 6: Results of sequential labelling tasks with Algorithm 3. The x-axis represents the random\nselection parameters u. The y-axis represents label accuracy.\n\nalgorithm passes the sequence multiple times and stops when no yk can be changed. It is faster in\npractice (speedup of 18x on POS tagging, 1:5x on word segmentation), requires less memory (O(1)\nspace complexity), and can obtain a reasonable performance.\nWe use the CoNLL 2000 dataset [Sang and Buchholz, 2000] for chunking and POS tagging,\nSIGHAN 2005 bake-off corpus (pku and msr) [Emerson, 2005] for word segmentation. We use\nAlgorithm 3 with 20 iterations and learning step 1. We adopt standard feature sets in all tasks.\nTo test (cid:26) on more inference algorithms, we will apply a simple random selection strategy to gen-\nerate a bunch of in-between inference algorithms: when decoding an example, we select \u201cViterbi\u201d\nwith probability u, \u201cgid\u201d with probability 1 (cid:0) u. Heuristically, by varying u, we obtain inference\nalgorithms with different expected approximation rates.\nFigure 6 shows the results of (cid:26) (cid:20) 1 4. We can have following observations:\n(cid:15) At u = 0 (i.e., inference with \u201cgid\u201d), models with (cid:26) < 1 are signi\ufb01cantly better than (cid:26) = 1 on\npos and cws (p < 0:01 using z-test for proportions). Furthermore, on pos and cws, the best \u201cgid\u201d\nresults with parameters (cid:26) < 1 are competitive to the standard perceptron with exact inference (i.e.,\n(cid:26) = 1; u = 1). Thus, it is possible for approximate inference to be both fast and good.\n(cid:15) For 0 < u < 1, we can \ufb01nd that curves of (cid:26) < 1 are above the curve of (cid:26) = 1 in many cases. The\nlargest gap is 0:2% on chu, 0:6% on pos and 2% on cws. Thus, the learning parameter (cid:26) can also\nprovide performance gains for the combined inference algorithms.\n(cid:15) For u = 1 (i.e., using the \u201cViterbi\u201d), it is interesting to see that in pos, (cid:26) < 1 still outperforms\n(cid:26) = 1 by a large margin. We suspect that the (cid:26) parameter might also help the structured perceptron\nconverging to a better solution.\n\n5.3 Dependency Parsing\n\nOur third experiment is high order non-projective dependency parsing, for which the exact inference\nis intractable. We follows the approximate inference in MSTParser [McDonald and Pereira, 2006]\n5. The algorithm \ufb01rst \ufb01nds the best high order projective tree using a O(n3) dynamic programming\n[Eisner, 1996], then heuristically introduces non-projective edges on the projective tree.\nWe use the online PA in Algorithm 4 with above two-phase approximate inference algorithm. The\nparser is trained and tested on 5 languages in the CoNLL-2007 shared task [Nivre et al., 2007] with\nnon-projective sentences more than 20%. Features are identical to default MSTParser settings 6.\nFigure 1 lists the results with different (cid:26). It shows that on all languages, tuning the parameter helps\nto improve the parsing accuracy. As a reference, we also include results of the \ufb01rst order models.\nOn Basque and Greek, the performance gains from (cid:26) is comparable to the gains from introducing\nsecond order features, but the improvement on Czech, Hungarian and Turkish are limited. We also\n\ufb01nd that different with text classi\ufb01cation and sequential labelling, both (cid:26) > 1 and (cid:26) < 1 can obtain\noptimal scores. Thus, with the feature con\ufb01guration of MSTParser, the value of w\n) may\nnot always be positive during the online learning process, and it re\ufb02ect the fact that feature space of\n\n(cid:8)(x; y\n\n\u22ba\n\n(cid:3)\n\n4We also test models with (cid:26) > 1, which underperform (cid:26) < 1 in general. Details are in the supplementary.\n5http://sourceforge.net/projects/mstparser/\n6Features in MSTParser are less powerful than state-of-the-art, but we keep them for an easier implementa-\n\ntion and comparison.\n\n8\n\n0.00.20.40.60.81.0u0.9520.9530.9540.9550.9560.9570.9580.959Accuracychu\u03c1=0.9\u03c1=0.99\u03c1=0.999\u03c1=10.00.20.40.60.81.0u0.9440.9450.9460.9470.9480.9490.9500.9510.952pos0.00.20.40.60.81.0u0.9250.9300.9350.9400.9450.9500.9550.960cws-pku0.00.20.40.60.81.0u0.9450.9500.9550.9600.9650.970cws-msr\fparsing problems is usually more complex. Finally, setting a global (cid:26) for different training samples\ncould be coarse (so we only get improvement in a small neighborhood of 1), and how to estimate (cid:26)\nfor individual x is an important future work.\n\nBasque Czech Greek Hungarian Turkish\n\nSetting\n1st Order\n(cid:26) =1\n(cid:26) =1(cid:0)10\n(cid:26) =1(cid:0)10\n(cid:26) =1+10\n(cid:26) =1+10\n\n(cid:0)3\n(cid:0)4\n(cid:0)3\n(cid:0)4\n\n79.4\n79.8\n79.7\n80.3\n79.4\n79.6\n\n82.1\n82.8\n83.0\n82.9\n82.3\n83.0\n\n81.1\n81.7\n81.3\n82.2\n81.5\n82.5\n\n79.9\n81.7\n81.1\n81.8\n80.7\n81.6\n\n85.0\n85.5\n85.2\n85.7\n85.6\n85.4\n\nTable 1: Results of the second order dependency parsing with parameter (cid:26). We report the unlabelled\nattachment score (UAS), which is the percentage of words with correct parents.\n\n6 Conclusion\n\nWe analyzed the learning errors of structured prediction models with approximate inference. For the\nestimation error, we gave a PAC-Bayes analysis for underestimation and overestimation inference\nalgorithms. For the approximation error, we showed the incomparability between exact and underes-\ntimate inference. The experiments on three NLP tasks with the newly proposed learning algorithms\nshowed encouraging performances. In future work, we plan to explore more adaptive methods for\nestimating approximation rate (cid:26) and combining inference algorithms.\n\nAcknowledgements\n\nThe authors wish to thank all reviewers for their helpful comments and suggestions. The corre-\nsponding authors are Man Lan and Shiliang Sun. This research is (partially) supported by NSFC\n(61402175, 61532011), STCSM (15ZR1410700) and Shanghai Key Laboratory of Trustworthy\nComputing (07dz22304201604). Yuanbin Wu is supported by a Microsoft Research Asia Collab-\norative Research Program.\n\nReferences\nJulian Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society B,\n\n48(3):48\u2013259, 1986.\n\nOlivier Catoni. PAC-Bayesian Supervised Classi\ufb01cation: The Thermodynamics of Statistical Learn-\n\ning, volume 56 of Lecture Notes-Monograph Series. IMS, 2007.\n\nVenkat Chandrasekaran and Michael I. Jordan. Computational and statistical tradeoffs via convex\n\nrelaxation. In Proc. of the National Academy of Sciences, volume 110, 2013.\n\nKai-Wei Chang, Shyam Upadhyay, Gourab Kundu, and Dan Roth. Structural learning with amor-\n\ntized inference. In Proc. of AAAI, pages 2525\u20132531, 2015.\n\nMichael Collins. Parameter estimation for statistical parsing models: Theory and practice of\ndistribution-free methods. In Proc. of the Seventh International Workshop on Parsing Technolo-\ngies, 2001.\n\nMichael Collins. Discriminative training methods for hidden Markov models: Theory and experi-\n\nments with perceptron algorithms. In Proc. of EMNLP, pages 1\u20138, 2002.\n\nCorinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Structured prediction theory\n\nbased on factor graph complexity. In NIPS, pages 2514\u20132522, 2016.\n\nKoby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online\n\npassive-aggressive algorithms. Journal of Machine Learning Research, 7:551\u2013585, 2006.\n\n9\n\n\fAmit Daniely, Sivan Sabato, and Shai Shalev-Shwartz. Multiclass learning approaches: A theoreti-\n\ncal comparison with implications. In NIPS, pages 494\u2013502, 2012.\n\nJason M. Eisner. Three new probabilistic models for dependency parsing: An exploration. In Proc.\n\nof COLING, 1996.\n\nThomas Emerson. The second international Chinese word segmentation bakeoff.\n\nSIGHAN Workshop on Chinese Language Processing, pages 123 \u2013 133, 2005.\n\nIn the Second\n\nThomas Finley and Thorsten Joachims. Training structural SVMs when exact inference is intractable.\n\nIn Proc. of ICML, pages 304\u2013311, 2008.\n\nPascal Germain, Alexandre Lacasse, Fran\u00e7ois Laviolette, and Mario Marchand. PAC-Bayesian learn-\n\ning of linear classi\ufb01ers. In Proc. of ICML, pages 353\u2013360, 2009.\n\nAmir Globerson, Tim Roughgarden, David Sontag, and Cafer Yildirim. How hard is inference for\n\nstructured prediction? In Proc. of ICML, pages 2181\u20132190, 2015.\n\nJean Honorio and Tommi S. Jaakkola. Structured prediction: From gaussian perturbations to linear-\n\ntime principled algorithms. In Proc. of UAI, 2016.\n\nLiang Huang, Suphan Fayong, and Yang Guo. Structured perceptron with inexact search. In Proc.\n\nof HLT-NAACL, pages 142\u2013151, 2012.\n\nAlex Kulesza and Fernando Pereira. Structured learning with approximate inference. In NIPS, pages\n\n785\u2013792, 2007.\n\nGourab Kundu, Vivek Srikumar, and Dan Roth. Margin-based decomposed amortized inference. In\n\nProc. of ACL, pages 905\u2013913, 2013.\n\nJohn Langford and John Shawe-Taylor. PAC-Bayes & margins. In NIPS, pages 423\u2013430, 2002.\n\nBen London, Bert Huang, Ben Taskar, and Lise Getoor. Collective stability in structured prediction:\n\nGeneralization from one example. In Proc. of ICML, pages 828\u2013836, 2013.\n\nAndr\u00e9 F. T. Martins, Noah A. Smith, and Eric P. Xing. Polyhedral outer approximations with appli-\n\ncation to natural language parsing. In Proc. of ICML, pages 713\u2013720, 2009.\n\nDavid McAllester. Generalization Bounds and Consistency for Structured Labeling, chapter Pre-\n\ndicting Structured Data. MIT Press, 2007.\n\nDavid A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363, 1999.\n\nDavid A. McAllester. Pac-bayesian stochastic model selection. Machine Learning, 51(1):5\u201321,\n\n2003.\n\nDavid A. McAllester and Joseph Keshet. Generalization bounds and consistency for latent structural\n\nprobit and ramp loss. In NIPS, pages 2205\u20132212, 2011.\n\nRyan McDonald and Fernando Pereira. Online learning of approximate dependency parsing algo-\n\nrithms. In Proc. of EACL, 2006.\n\nOfer Meshi, David Sontag, Tommi S. Jaakkola, and Amir Globerson. Learning ef\ufb01ciently with\n\napproximate inference via dual losses. In Proc. of ICML, pages 783\u2013790, 2010.\n\nOfer Meshi, Mehrdad Mahdavi, Andrian Weller, and David Sontag. Train and test tightness of lp\n\nrelaxations in structured prediction. In Proc. of ICML, 2016.\n\nJoakim Nivre, Johan Hall, Sandra K\u00fcbler, Ryan McDonald, Jens Nilsson, Sebastian Riedel, and\nIn Proc. of the CoNLL\n\nDeniz Yuret. The CoNLL 2007 shared task on dependency parsing.\nShared Task Session of EMNLP-CoNLL 2007, pages 915\u2013932, 2007.\n\nJames Renegar. Some perturbation theory for linear programming. Mathematical Programming, 65:\n\n73\u201391, 1994.\n\n10\n\n\fJames Renegar. Incorporating condition measures into the complexity theory of linear programming.\n\nSIAM Journal on Optimization, 5(3):506\u2013524, 1995.\n\nRajhans Samdani and Dan Roth. Ef\ufb01cient decomposed learning for structured prediction. In Proc.\n\nof ICML, 2012.\n\nErik F. Tjong Kim Sang and Sabine Buchholz. Introduction to the conll-2000 shared task: Chunking.\n\nIn Proc. of CoNLL and LLL, 2000.\n\nMatthias Seeger. PAC-Bayesian generalisation error bounds for gaussian process classi\ufb01cation.\n\nJMLR, 3:233\u2013269, 2002.\n\nDavid Sontag, Ofer Meshi, Tommi S. Jaakkola, and Amir Globerson. More data means less infer-\n\nence: A pseudo-max approach to structured learning. In NIPS, pages 2181\u20132189, 2010.\n\nBenjamin Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In NIPS,\n\npages 25\u201332, 2003.\n\nZhuoran Wang and John Shawe-Taylor. Large-margin structured prediction via linear programming.\n\nIn Proc. of AISTATS, pages 599\u2013606, 2009.\n\n11\n\n\f", "award": [], "sourceid": 3110, "authors": [{"given_name": "Yuanbin", "family_name": "Wu", "institution": "East China Normal University"}, {"given_name": "Man", "family_name": "Lan", "institution": null}, {"given_name": "Shiliang", "family_name": "Sun", "institution": "East China Normal University"}, {"given_name": "Qi", "family_name": "Zhang", "institution": "Fudan University"}, {"given_name": "Xuanjing", "family_name": "Huang", "institution": "Fudan University"}]}