{"title": "Structured Prediction Theory Based on Factor Graph Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 2514, "page_last": 2522, "abstract": "We present a general theoretical analysis of structured prediction with a series of new results. We give new data-dependent margin guarantees for structured prediction for a very wide family of loss functions and a general family of hypotheses, with an arbitrary factor graph decomposition. These are the tightest margin bounds known for both standard multi-class and general structured prediction problems. Our guarantees are expressed in terms of a data-dependent complexity measure, \\emph{factor graph complexity}, which we show can be estimated from data and bounded in terms of familiar quantities for several commonly used hypothesis sets, and a sparsity measure for features and graphs. Our proof techniques include generalizations of Talagrand's contraction lemma that can be of independent interest. We further extend our theory by leveraging the principle of Voted Risk Minimization (VRM) and show that learning is possible even with complex factor graphs. We present new learning bounds for this advanced setting, which we use to devise two new algorithms, \\emph{Voted Conditional Random Field} (VCRF) and \\emph{Voted Structured Boosting} (StructBoost). These algorithms can make use of complex features and factor graphs and yet benefit from favorable learning guarantees. We also report the results of experiments with VCRF on several datasets to validate our theory.", "full_text": "Structured Prediction Theory Based on\n\nFactor Graph Complexity\n\nCorinna Cortes\nGoogle Research\nNew York, NY 10011\ncorinna@google.com\n\nVitaly Kuznetsov\nGoogle Research\n\nNew York, NY 10011\nvitaly@cims.nyu.edu\n\nMehryar Mohrii\n\nCourant Institute and Google\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nScott Yang\n\nCourant Institute\n\nNew York, NY 10012\nyangs@cims.nyu.edu\n\nAbstract\n\nWe present a general theoretical analysis of structured prediction with a series\nof new results. We give new data-dependent margin guarantees for structured\nprediction for a very wide family of loss functions and a general family of hypothe-\nses, with an arbitrary factor graph decomposition. These are the tightest margin\nbounds known for both standard multi-class and general structured prediction\nproblems. Our guarantees are expressed in terms of a data-dependent complexity\nmeasure, factor graph complexity, which we show can be estimated from data and\nbounded in terms of familiar quantities for several commonly used hypothesis sets\nalong with a sparsity measure for features and graphs. Our proof techniques in-\nclude generalizations of Talagrand\u2019s contraction lemma that can be of independent\ninterest.\nWe further extend our theory by leveraging the principle of Voted Risk Minimiza-\ntion (VRM) and show that learning is possible even with complex factor graphs. We\npresent new learning bounds for this advanced setting, which we use to design two\nnew algorithms, Voted Conditional Random Field (VCRF) and Voted Structured\nBoosting (StructBoost). These algorithms can make use of complex features and\nfactor graphs and yet bene\ufb01t from favorable learning guarantees. We also report\nthe results of experiments with VCRF on several datasets to validate our theory.\n\n1\n\nIntroduction\n\nStructured prediction covers a broad family of important learning problems. These include key tasks\nin natural language processing such as part-of-speech tagging, parsing, machine translation, and\nnamed-entity recognition, important areas in computer vision such as image segmentation and object\nrecognition, and also crucial areas in speech processing such as pronunciation modeling and speech\nrecognition.\nIn all these problems, the output space admits some structure. This may be a sequence of tags as in\npart-of-speech tagging, a parse tree as in context-free parsing, an acyclic graph as in dependency\nparsing, or labels of image segments as in object detection. Another property common to these tasks\nis that, in each case, the natural loss function admits a decomposition along the output substructures.\nAs an example, the loss function may be the Hamming loss as in part-of-speech tagging, or it may be\nthe edit-distance, which is widely used in natural language and speech processing.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe output structure and corresponding loss function make these problems signi\ufb01cantly different\nfrom the (unstructured) binary classi\ufb01cation problems extensively studied in learning theory. In\nrecent years, a number of different algorithms have been designed for structured prediction, including\nConditional Random Field (CRF) [Lafferty et al., 2001], StructSVM [Tsochantaridis et al., 2005],\nMaximum-Margin Markov Network (M3N) [Taskar et al., 2003], a kernel-regression algorithm\n[Cortes et al., 2007], and search-based approaches such as [Daum\u00e9 III et al., 2009, Doppa et al., 2014,\nLam et al., 2015, Chang et al., 2015, Ross et al., 2011]. More recently, deep learning techniques have\nalso been developed for tasks including part-of-speech tagging [Jurafsky and Martin, 2009, Vinyals\net al., 2015a], named-entity recognition [Nadeau and Sekine, 2007], machine translation [Zhang et al.,\n2008], image segmentation [Lucchi et al., 2013], and image annotation [Vinyals et al., 2015b].\nHowever, in contrast to the plethora of algorithms, there have been relatively few studies devoted\nto the theoretical understanding of structured prediction [Bakir et al., 2007]. Existing learning\nguarantees hold primarily for simple losses such as the Hamming loss [Taskar et al., 2003, Cortes\net al., 2014, Collins, 2001] and do not cover other natural losses such as the edit-distance. They also\ntypically only apply to speci\ufb01c factor graph models. The main exception is the work of McAllester\n[2007], which provides PAC-Bayesian guarantees for arbitrary losses, though only in the special case\nof randomized algorithms using linear (count-based) hypotheses.\nThis paper presents a general theoretical analysis of structured prediction with a series of new results.\nWe give new data-dependent margin guarantees for structured prediction for a broad family of loss\nfunctions and a general family of hypotheses, with an arbitrary factor graph decomposition. These\nare the tightest margin bounds known for both standard multi-class and general structured prediction\nproblems. For special cases studied in the past, our learning bounds match or improve upon the\npreviously best bounds (see Section 3.3). In particular, our bounds improve upon those of Taskar et al.\n[2003]. Our guarantees are expressed in terms of a data-dependent complexity measure, factor graph\ncomplexity, which we show can be estimated from data and bounded in terms of familiar quantities\nfor several commonly used hypothesis sets along with a sparsity measure for features and graphs.\nWe further extend our theory by leveraging the principle of Voted Risk Minimization (VRM) and\nshow that learning is possible even with complex factor graphs. We present new learning bounds for\nthis advanced setting, which we use to design two new algorithms, Voted Conditional Random Field\n(VCRF) and Voted Structured Boosting (StructBoost). These algorithms can make use of complex\nfeatures and factor graphs and yet bene\ufb01t from favorable learning guarantees. As a proof of concept\nvalidating our theory, we also report the results of experiments with VCRF on several datasets.\nThe paper is organized as follows. In Section 2 we introduce the notation and de\ufb01nitions relevant to\nour discussion of structured prediction. In Section 3, we derive a series of new learning guarantees\nfor structured prediction, which are then used to prove the VRM principle in Section 4. Section 5\ndevelops the algorithmic framework which is directly based on our theory. In Section 6, we provide\nsome preliminary experimental results that serve as a proof of concept for our theory.\n\n2 Preliminaries\nLet X denote the input space and Y the output space. In structured prediction, the output space may\nbe a set of sequences, images, graphs, parse trees, lists, or some other (typically discrete) objects\nadmitting some possibly overlapping structure. Thus, we assume that the output structure can be\ndecomposed into l substructures. For example, this may be positions along a sequence, so that the\noutput space Y is decomposable along these substructures: Y = Y1 \u21e5\u00b7\u00b7\u00b7\u21e5Y l. Here, Yk is the set\nof possible labels (or classes) that can be assigned to substructure k.\nLoss functions. We denote by L : Y\u21e5Y! R+ a loss function measuring the dissimilarity of\ntwo elements of the output space Y. We will assume that the loss function L is de\ufb01nite, that is\nL(y, y0) = 0 iff y = y0. This assumption holds for all loss functions commonly used in structured\nprediction. A key aspect of structured prediction is that the loss function can be decomposed along the\nlPl\nsubstructures Yk. As an example, L may be the Hamming loss de\ufb01ned by L(y, y0) = 1\nk=1 1yk6=y0k\nfor all y = (y1, . . . , yl) and y0 = (y01, . . . , y0l), with yk, y0k 2Y k. In the common case where Y is\na set of sequences de\ufb01ned over a \ufb01nite alphabet, L may be the edit-distance, which is widely used\nin natural language and speech processing applications, with possibly different costs associated to\ninsertions, deletions and substitutions. L may also be a loss based on the negative inner product of\nthe vectors of n-gram counts of two sequences, or its negative logarithm. Such losses have been\n\n2\n\n\f1\n\nf1\n\n2\n(a)\n\nf2\n\n3\n\n2\n\n1\n\nf2\n\nf1\n(b)\n\n3\n\nFigure 1: Example of factor graphs.\n(a) Pairwise Markov network decomposition: h(x, y) =\nhf1(x, y1, y2)+hf2(x, y2, y3) (b) Other decomposition h(x, y) = hf1(x, y1, y3)+hf2(x, y1, y2, y3).\n\nused to approximate the BLEU score loss in machine translation. There are other losses de\ufb01ned\nin computational biology based on various string-similarity measures. Our theoretical analysis is\ngeneral and applies to arbitrary bounded and de\ufb01nite loss functions.\nScoring functions and factor graphs. We will adopt the common approach in structured prediction\nwhere predictions are based on a scoring function mapping X\u21e5Y to R. Let H be a family of\nscoring functions. For any h 2H , we denote by h the predictor de\ufb01ned by h: for any x 2X ,\nh(x) = argmaxy2Y h(x, y).\nFurthermore, we will assume, as is standard in structured prediction, that each function h 2H can\nbe decomposed as a sum. We will consider the most general case for such decompositions, which\ncan be made explicit using the notion of factor graphs.1 A factor graph G is a tuple G = (V, F, E),\nwhere V is a set of variable nodes, F a set of factor nodes, and E a set of undirected edges between\na variable node and a factor node. In our context, V can be identi\ufb01ed with the set of substructure\nindices, that is V = {1, . . . , l}.\nFor any factor node f, denote by N(f ) \u2713 V the set of variable nodes connected to f via an edge and\nde\ufb01ne Yf as the substructure set cross-product Yf =Qk2N(f ) Yk. Then, h admits the following\ndecomposition as a sum of functions hf , each taking as argument an element of the input space\nx 2X and an element of Yf , yf 2Y f :\n\n(1)\n\nh(x, y) =Xf2F\n\nhf (x, yf ).\n\nFigure 1 illustrates this de\ufb01nition with two different decompositions. More generally, we will consider\nthe setting in which a factor graph may depend on a particular example (xi, yi): G(xi, yi) = Gi =\n([li], Fi, Ei). A special case of this setting is for example when the size li (or length) of each example\nis allowed to vary and where the number of possible labels |Y| is potentially in\ufb01nite.\nWe present other examples of such hypothesis sets and their decomposition in Section 3, where we\ndiscuss our learning guarantees. Note that such hypothesis sets H with an additive decomposition are\nthose commonly used in most structured prediction algorithms [Tsochantaridis et al., 2005, Taskar\net al., 2003, Lafferty et al., 2001]. This is largely motivated by the computational requirement for\nef\ufb01cient training and inference. Our results, while very general, further provide a statistical learning\nmotivation for such decompositions.\nLearning scenario. We consider the familiar supervised learning scenario where the training and\ntest points are drawn i.i.d. according to some distribution D over X\u21e5Y . We will further adopt the\nstandard de\ufb01nitions of margin, generalization error and empirical error. The margin \u21e2h(x, y) of a\nhypothesis h for a labeled example (x, y) 2X\u21e5Y is de\ufb01ned by\n\n\u21e2h(x, y) = h(x, y) max\ny06=y\n\n(2)\nLet S = ((x1, y1), . . . , (xm, ym)) be a training sample of size m drawn from Dm. We denote by\nR(h) the generalization error and by bRS(h) the empirical error of h over S:\n\nR(h) = E\n\n[L(h(x), y)]\n\nand\n\n[L(h(x), y)],\n\n(3)\n\n(x,y)\u21e0D\n\nh(x, y0).\n\nbRS(h) = E\n\n(x,y)\u21e0S\n\n1Factor graphs are typically used to indicate the factorization of a probabilistic model. We are not assuming\nprobabilistic models, but they would be also captured by our general framework: h would then be - log of a\nprobability.\n\n3\n\n\fwhere h(x) = argmaxy h(x, y) and where the notation (x, y)\u21e0 S indicates that (x, y) is drawn\naccording to the empirical distribution de\ufb01ned by S. The learning problem consists of using the\nsample S to select a hypothesis h 2H with small expected loss R(h).\nObserve that the de\ufb01niteness of the loss function implies, for all x 2X , the following equality:\n\nWe will later use this identity in the derivation of surrogate loss functions.\n\nL(h(x), y) = L(h(x), y) 1\u21e2h(x,y)\uf8ff0.\n\n(4)\n\n3 General learning bounds for structured prediction\n\nIn this section, we present new learning guarantees for structured prediction. Our analysis is general\nand applies to the broad family of de\ufb01nite and bounded loss functions described in the previous\nsection. It is also general in the sense that it applies to general hypothesis sets and not just sub-families\nof linear functions. For linear hypotheses, we will give a more re\ufb01ned analysis that holds for arbitrary\nnorm-p regularized hypothesis sets.\nThe theoretical analysis of structured prediction is more complex than for classi\ufb01cation since, by\nde\ufb01nition, it depends on the properties of the loss function and the factor graph. These attributes\ncapture the combinatorial properties of the problem which must be exploited since the total number\nof labels is often exponential in the size of that graph. To tackle this problem, we \ufb01rst introduce a\nnew complexity tool.\n\nS (H) =\n\nbRG\n\n3.1 Complexity measure\nA key ingredient of our analysis is a new data-dependent notion of complexity that extends the\nclassical Rademacher complexity. We de\ufb01ne the empirical factor graph Rademacher complexity\nbRG\nS (H) of a hypothesis set H for a sample S = (x1, . . . , xm) and factor graph G as follows:\n\n1\n\nm E\u270f\" sup\n\nh2H\n\nmXi=1 Xf2Fi Xy2Yfp|Fi| \u270fi,f,y hf (xi, y)#,\n\nm(H) = ES\u21e0Dm\u21e5bRG\n\nwhere \u270f = (\u270fi,f,y)i2[m],f2Fi,y2Yf and where \u270fi,f,ys are independent Rademacher random variables\nuniformly distributed over {\u00b11}. The factor graph Rademacher complexity of H for a factor graph\nS (H)\u21e4. It can be shown that the empirical\nG is de\ufb01ned as the expectation: RG\nfactor graph Rademacher complexity is concentrated around its mean (Lemma 8). The factor graph\nRademacher complexity is a natural extension of the standard Rademacher complexity to vector-\nvalued hypothesis sets (with one coordinate per factor in our case). For binary classi\ufb01cation, the factor\ngraph and standard Rademacher complexities coincide. Otherwise, the factor graph complexity can be\nupper bounded in terms of the standard one. As with the standard Rademacher complexity, the factor\ngraph Rademacher complexity of a hypothesis set can be estimated from data in many cases. In some\nimportant cases, it also admits explicit upper bounds similar to those for the standard Rademacher\ncomplexity but with an additional dependence on the factor graph quantities. We will prove this for\nseveral families of functions which are commonly used in structured prediction (Theorem 2).\n\n3.2 Generalization bounds\nIn this section, we present new margin bounds for structured prediction based on the factor graph\nRademacher complexity of H. Our results hold both for the additive and the multiplicative empirical\nmargin losses de\ufb01ned below:\n\nS,\u21e2(h) = E\n\nS,\u21e2 (h) = E\n\nbRadd\nbRmult\n\ny06=y\n\n(x,y)\u21e0S\uf8ff\u21e4\u2713 max\n(x,y)\u21e0S\uf8ff\u21e4\u2713 max\nS,\u21e2(h) and bRmult\n\ny06=y\n\n4\n\nL(y0, y) 1\nL(y0, y)\u21e31 1\n\n\u21e2\u21e5h(x, y) h(x, y0)\u21e4\u25c6\n\u21e2 [h(x, y) h(x, y0)]\u2318\u25c6.\n\n(5)\n\n(6)\n\nHere, \u21e4(r) = min(M, max(0, r)) for all r, with M = maxy,y0 L(y, y0). As we show in Section 5,\nS,\u21e2 (h) directly lead to many existing structured prediction\n\nalgorithms. The following is our general data-dependent margin bound for structured prediction.\n\nconvex upper bounds on bRadd\n\n\f\u21e2\n\n2\n\n,\n\n\n2m\n\n.\n\n\u21e2\n\nRG\n\n\n2m\n\nS,\u21e2(h) +\n\nS,\u21e2 (h) +\n\nR(h) \uf8ff Radd\n\nR(h) \uf8ff Rmult\n\n4p2\nRG\n\u21e2\n4p2M\n\nTheorem 1. Fix \u21e2> 0. For any > 0, with probability at least 1 over the draw of a sample S\nof size m, the following holds for all h 2H ,\n\u21e2 (h) \uf8ff bRadd\n(h) \uf8ff bRmult\n\nm(H) + Ms log 1\nm(H) + Ms log 1\n\nThe full proof of Theorem 1 is given in Appendix A. It is based on a new contraction lemma\n(Lemma 5) generalizing Talagrand\u2019s lemma that can be of independent interest.2 We also present a\nmore re\ufb01ned contraction lemma (Lemma 6) that can be used to improve the bounds of Theorem 1.\nTheorem 1 is the \ufb01rst data-dependent generalization guarantee for structured prediction with general\nloss functions, general hypothesis sets, and arbitrary factor graphs for both multiplicative and additive\nmargins. We also present a version of this result with empirical complexities as Theorem 7 in the\nsupplementary material. We will compare these guarantees to known special cases below.\nThe margin bounds above can be extended to hold uniformly over \u21e2 2 (0, 1] at the price of an\nadditional term of the form p(log log2\n\u21e2 )/m in the bound, using known techniques (see for example\n[Mohri et al., 2012]).\nThe hypothesis set used by convex structured prediction algorithms such as StructSVM [Tsochan-\ntaridis et al., 2005], Max-Margin Markov Networks (M3N) [Taskar et al., 2003] or Conditional\nRandom Field (CRF) [Lafferty et al., 2001] is that of linear functions. More precisely, let be a\n\nfeature mapping from (X\u21e5Y ) to RN such that (x, y) =Pf2F f (x, yf ). For any p, de\ufb01ne Hp\nThen, bRG\nm(Hp) can be ef\ufb01ciently estimated using random sampling and solving LP programs.\nMoreover, one can obtain explicit upper bounds on bRG\nm(Hp). To simplify our presentation, we will\nconsider the case p = 1, 2, but our results can be extended to arbitrary p 1 and, more generally, to\narbitrary group norms.\n\nHp = {x 7! w \u00b7 (x, y) : w 2 RN ,kwkp \uf8ff \u21e4p}.\n\nas follows:\n\n\u21e41r1\n\nbRG\n\nS (H1) \uf8ff\n\nTheorem 2. For any sample S = (x1, . . . , xm), the following upper bounds hold for the empirical\nfactor graph complexity of H1 and H2:\nm ps log(2N ),\n\nwhere r1 = maxi,f,y k f (xi, y)k1, r2 = maxi,f,y k f (xi, y)k2 and where s is a sparsity factor\nde\ufb01ned by s = maxj2[1,N ]Pm\n\nPlugging in these factor graph complexity upper bounds into Theorem 1 immediately yields explicit\ndata-dependent structured prediction learning guarantees for linear hypotheses with general loss\nfunctions and arbitrary factor graphs (see Corollary 10). Observe that, in the worst case, the sparsity\nfactor can be bounded as follows:\n\nm qPm\ni=1Pf2FiPy2Yf |Fi|1 f,j (xi,y)6=0.\n\ni=1Pf2FiPy2Yf |Fi|,\n\nS (H2) \uf8ff\n\nbRG\n\n\u21e42r2\n\ns \uf8ff\n\nmXi=1 Xf2Fi Xy2Yf\n\n|Fi|\uf8ff\n\nmXi=1\n\n|Fi|2di \uf8ff m max\n\ni\n\n|Fi|2di,\n\nwhere di = maxf2Fi |Yf|. Thus, the factor graph Rademacher complexities of linear hypotheses in\nH1 scale as O(plog(N ) maxi |Fi|2di/m). An important observation is that |Fi| and di depend on\nthe observed sample. This shows that the expected size of the factor graph is crucial for learning in\nthis scenario. This should be contrasted with other existing structured prediction guarantees that we\ndiscuss below, which assume a \ufb01xed upper bound on the size of the factor graph. Note that our result\nshows that learning is possible even with an in\ufb01nite set Y. To the best of our knowledge, this is the\n\ufb01rst learning guarantee for learning with in\ufb01nitely many classes.\n\n2A result similar to Lemma 5 has also been recently proven independently in [Maurer, 2016].\n\n5\n\n\fOur learning guarantee for H1 can additionally bene\ufb01t from the sparsity of the feature mapping\nand observed data. In particular, in many applications, f,j is a binary indicator function that is\nnon-zero for a single (x, y) 2X\u21e5Y f . For instance, in NLP, f,j may indicate an occurrence of a\ncertain n-gram in the input xi and output yi. In this case, s =Pm\ni=1 |Fi|2 \uf8ff m maxi |Fi|2 and the\ncomplexity term is only in O(maxi |Fi|plog(N )/m), where N may depend linearly on di.\n\n3.3 Special cases and comparisons\n\n4\u21e42r2\n\n\u21e2 r 2k2\n\nm\n\n.\n\nMarkov networks. For the pairwise Markov networks with a \ufb01xed number of substructures l studied\nby Taskar et al. [2003], our equivalent factor graph admits l nodes, |Fi| = l, and the maximum size\nof Yf is di = k2 if each substructure of a pair can be assigned one of k classes. Thus, if we apply\nCorollary 10 with Hamming distance as our loss function and divide the bound through by l, to\nnormalize the loss to interval [0, 1] as in [Taskar et al., 2003], we obtain the following explicit form\nof our guarantee for an additive empirical margin loss, for all h 2H 2:\n+ 3s log 1\n\nS,\u21e2(h) +\n\n\n2m\n\n2q2r2\n\nR(h) \uf8ff bRadd\nbounded by a quantity that varies as eO(p\u21e42\n\nThis bound can be further improved by eliminating the dependency on k using an extension of our\ncontraction Lemma 5 to k\u00b7k 1,2 (see Lemma 6). The complexity term of Taskar et al. [2003] is\n2/m), where q is the maximal out-degree of a factor\ngraph. Our bound has the same dependence on these key quantities, but with no logarithmic term\nin our case. Note that, unlike the result of Taskar et al. [2003], our bound also holds for general\nloss functions and different p-norm regularizers. Moreover, our result for a multiplicative empirical\nmargin loss is new, even in this special case.\nMulti-class classi\ufb01cation. For standard (unstructured) multi-class classi\ufb01cation, we have |Fi| = 1\nand di = c, where c is the number of classes. In that case, for linear hypotheses with norm-2\nregularization, the complexity term of our bound varies as O(\u21e42r2pc/\u21e22m) (Corollary 11). This\nimproves upon the best known general margin bounds of Kuznetsov et al. [2014], who provide a\nguarantee that scales linearly with the number of classes instead. Moreover, in the special case where\nan individual wy is learned for each class y 2 [c], we retrieve the recent favorable bounds given by Lei\net al. [2015], albeit with a somewhat simpler formulation. In that case, for any (x, y), all components\nof the feature vector (x, y) are zero, except (perhaps) for the N components corresponding to\nclass y, where N is the dimension of wy. In view of that, for example for a group-norm k\u00b7k 2,1-\nregularization, the complexity term of our bound varies as O(\u21e4rp(log c)/\u21e22m), which matches the\nresults of Lei et al. [2015] with a logarithmic dependency on c (ignoring some complex exponents of\nlog c in their case). Additionally, note that unlike existing multi-class learning guarantees, our results\nhold for arbitrary loss functions. See Corollary 12 for further details. Our sparsity-based bounds\ncan also be used to give bounds with logarithmic dependence on the number of classes when the\nfeatures only take values in {0, 1}. Finally, using Lemma 6 instead of Lemma 5, the dependency on\nthe number of classes can be further improved.\nWe conclude this section by observing that, since our guarantees are expressed in terms of the average\nsize of the factor graph over a given sample, this invites us to search for a hypothesis set H and\npredictor h 2H such that the tradeoff between the empirical size of the factor graph and empirical\nerror is optimal. In the next section, we will make use of the recently developed principle of Voted\nRisk Minimization (VRM) [Cortes et al., 2015] to reach this objective.\n\n4 Voted Risk Minimization\n\nIn many structured prediction applications such as natural language processing and computer vision,\none may wish to exploit very rich features. However, the use of rich families of hypotheses could lead\nto over\ufb01tting. In this section, we show that it may be possible to use rich families in conjunction with\nsimpler families, provided that fewer complex hypotheses are used (or that they are used with less\nmixture weight). We achieve this goal by deriving learning guarantees for ensembles of structured\nprediction rules that explicitly account for the differing complexities between families. This will\nmotivate the algorithms that we present in Section 5.\n\n6\n\n\fAssume that we are given p families H1, . . . , Hp of functions mapping from X\u21e5Y to R. De\ufb01ne the\nensemble family F = conv([p\nt=1 \u21b5tht,\nwhere \u21b5 = (\u21b51, . . . ,\u21b5 T ) is in the simplex and where, for each t 2 [1, T ], ht is in Hkt for some\nkt 2 [1, p]. We further assume that RG\nm(Hp). As an example, the\nHks may be ordered by the size of the corresponding factor graphs.\nThe main result of this section is a generalization of the VRM theory to the structured prediction\nS,\u21e2(h) and\n\nk=1Hk), that is the family of functions f of the form f =PT\n\nm(H2) \uf8ff . . . \uf8ff RG\n\nm(H1) \uf8ff RG\n\nsetting. The learning guarantees that we present are in terms of upper bounds on bRadd\nbRmult\nS,\u21e2 (h), which are de\ufb01ned as follows for all \u2327 0:\n\u21e2\u21e5h(x, y) h(x, y0)\u21e4\u25c6\n\u21e2 [h(x, y) h(x, y0)]\u2318\u25c6.\n\n(x,y)\u21e0S\uf8ff\u21e4\u2713 max\n(x,y)\u21e0S\uf8ff\u21e4\u2713 max\n\nL(y0, y) + \u2327 1\nL(y0, y)\u21e31 + \u2327 1\n\nHere, \u2327 can be interpreted as a margin term that acts in conjunction with \u21e2. For simplicity, we assume\nin this section that |Y| = c < +1.\nTheorem 3. Fix \u21e2> 0. For any > 0, with probability at least 1 over the draw of a sample S\nof size m, each of the following inequalities holds for all f 2F :\n\nbRadd\nbRmult\n\nS,\u21e2,\u2327 (h) = E\n\nS,\u21e2,\u2327 (h) = E\n\n(7)\n\n(8)\n\ny06=y\n\ny06=y\n\nR(f ) bRadd\nR(f ) bRmult\n\n\u21b5tRG\n\n4p2\n\u21e2\n4p2M\n\nTXt=1\nTXt=1\nm + 3Mrl 4\n\u21e22 log c2\u21e22m\n\n\u21b5tRG\n\n\u21e2\n\nS,\u21e2,1(f ) \uf8ff\n\nS,\u21e2,1(f ) \uf8ff\n\n\u21e2 q log p\n\nm(Hkt) + C(\u21e2, M, c, m, p),\n\nm(Hkt) + C(\u21e2, M, c, m, p),\n\n4 log pm log p\n\nm + log 2\n2m .\n\n\n\nwhere C(\u21e2, M, c, m, p) = 2M\n\nThe proof of this theorem crucially depends on the theory we developed in Section 3 and is given in\nAppendix A. As with Theorem 1, we also present a version of this result with empirical complexities\nas Theorem 14 in the supplementary material. The explicit dependence of this bound on the parameter\nvector \u21b5 suggests that learning even with highly complex hypothesis sets could be possible so long\nas the complexity term, which is a weighted average of the factor graph complexities, is not too\nlarge. The theorem provides a quantitative way of determining the mixture weights that should be\napportioned to each family. Furthermore, the dependency on the number of distinct feature map\nfamilies Hk is very mild and therefore suggests that a large number of families can be used. These\nproperties will be useful for motivating new algorithms for structured prediction.\n\n5 Algorithms\n\nIn this section, we derive several algorithms for structured prediction based on the VRM principle\ndiscussed in Section 4. We \ufb01rst give general convex upper bounds (Section 5.1) on the structured\nprediction loss which recover as special cases the loss functions used in StructSVM [Tsochantaridis\net al., 2005], Max-Margin Markov Networks (M3N) [Taskar et al., 2003], and Conditional Random\nField (CRF) [Lafferty et al., 2001]. Next, we introduce a new algorithm, Voted Conditional Random\nField (VCRF) Section 5.2, with accompanying experiments as proof of concept. We also present\nanother algorithm, Voted StructBoost (VStructBoost), in Appendix C.\n\n5.1 General framework for convex surrogate losses\nGiven (x, y) 2X\u21e5Y , the mapping h 7! L(h(x), y) is typically not a convex function of h, which\nleads to computationally hard optimization problems. This motivates the use of convex surrogate\nlosses. We \ufb01rst introduce a general formulation of surrogate losses for structured prediction problems.\nLemma 4. For any u 2 R+, let u : R ! R be an upper bound on v 7! u1v\uf8ff0. Then, the following\nupper bound holds for any h 2H and (x, y) 2X\u21e5Y ,\n(9)\n\nL(y0,y)(h(x, y) h(x, y0)).\n\nL(h(x), y) \uf8ff max\ny06=y\n\n7\n\n\fThe proof is given in Appendix A. This result de\ufb01nes a general framework that enables us to\nstraightforwardly recover many of the most common state-of-the-art structured prediction algorithms\nvia suitable choices of u(v): (a) for u(v) = max(0, u(1 v)), the right-hand side of (9) coincides\nwith the surrogate loss de\ufb01ning StructSVM [Tsochantaridis et al., 2005]; (b) for u(v) = max(0, u\nv), it coincides with the surrogate loss de\ufb01ning Max-Margin Markov Networks (M3N) [Taskar et al.,\n2003] when using for L the Hamming loss; and (c) for u(v) = log(1 + euv), it coincides with the\nsurrogate loss de\ufb01ning the Conditional Random Field (CRF) [Lafferty et al., 2001].\nMoreover, alternative choices of u(v) can help de\ufb01ne new algorithms. In particular, we will refer to\nthe algorithm based on the surrogate loss de\ufb01ned by u(v) = uev as StructBoost, in reference to the\nexponential loss used in AdaBoost. Another related alternative is based on the choice u(v) = euv.\nSee Appendix C, for further details on this algorithm. In fact, for each u(v) described above, the\ncorresponding convex surrogate is an upper bound on either the multiplicative or additive margin\nloss introduced in Section 3. Therefore, each of these algorithms seeks a hypothesis that minimizes\nthe generalization bounds presented in Section 3. To the best of our knowledge, this interpretation\nof these well-known structured prediction algorithms is also new. In what follows, we derive new\nstructured prediction algorithms that minimize \ufb01ner generalization bounds presented in Section 4.\n\n5.2 Voted Conditional Random Field (VCRF)\nWe \ufb01rst consider the convex surrogate loss based on u(v) = log(1 + euv), which corresponds\nto the loss de\ufb01ning CRF models. Using the monotonicity of the logarithm and upper bounding the\nmaximum by a sum gives the following upper bound on the surrogate loss holds:\n\n(10)\n\nmax\ny06=y\n\nmin\nw\n\n1\nm\n\nmXi=1\n\nwhich, combined with VRM principle leads to the following optimization problem:\n\nlog(1 + eL(y,y0)w\u00b7( (x,y) (x,y0))) \uf8ff log\u21e3Xy02Y\neL(y,yi)w\u00b7( (xi,yi) (xi,y))\u25c6 +\n\nlog\u2713Xy2Y\n\neL(y,y0)w\u00b7( (x,y) (x,y0))\u2318,\npXk=1\n\n(rk + )kwkk1,\n\nwhere rk = r1|F (k)|plog N. We refer to the learning algorithm based on the optimization\nproblem (10) as VCRF. Note that for = 0, (10) coincides with the objective function of L1-\nregularized CRF. Observe that we can also directly use maxy06=y log(1 + eL(y,y0)w\u00b7 (x,y,y0)) or its\nupper boundPy06=y log(1 + eL(y,y0)w\u00b7 (x,y,y0)) as a convex surrogate. We can similarly derive\nan L2-regularization formulation of the VCRF algorithm. In Appendix D, we describe ef\ufb01cient\nalgorithms for solving the VCRF and VStructBoost optimization problems.\n\n6 Experiments\n\nIn Appendix B, we corroborate our theory by reporting experimental results suggesting that the\nVCRF algorithm can outperform the CRF algorithm on a number of part-of-speech (POS) datasets.\n\n7 Conclusion\n\nWe presented a general theoretical analysis of structured prediction. Our data-dependent margin\nguarantees for structured prediction can be used to guide the design of new algorithms or to derive\nguarantees for existing ones. Its explicit dependency on the properties of the factor graph and on\nfeature sparsity can help shed new light on the role played by the graph and features in generalization.\nOur extension of the VRM theory to structured prediction provides a new analysis of generalization\nwhen using a very rich set of features, which is common in applications such as natural language\nprocessing and leads to new algorithms, VCRF and VStructBoost. Our experimental results for\nVCRF serve as a proof of concept and motivate more extensive empirical studies of these algorithms.\n\nAcknowledgments\nThis work was partly funded by NSF CCF-1535987 and IIS-1618662 and NSF GRFP DGE-1342536.\n\n8\n\n\fReferences\nG. H. Bakir, T. Hofmann, B. Sch\u00f6lkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan. Predicting Structured\n\nData (Neural Information Processing). The MIT Press, 2007.\n\nK. Chang, A. Krishnamurthy, A. Agarwal, H. Daum\u00e9 III, and J. Langford. Learning to search better than your\n\nteacher. In ICML, 2015.\n\nM. Collins. Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods.\n\nIn Proceedings of IWPT, 2001.\n\nC. Cortes, M. Mohri, and J. Weston. A General Regression Framework for Learning String-to-String Mappings.\n\nIn Predicting Structured Data. MIT Press, 2007.\n\nC. Cortes, V. Kuznetsov, and M. Mohri. Ensemble methods for structured prediction. In ICML, 2014.\n\nC. Cortes, P. Goyal, V. Kuznetsov, and M. Mohri. Kernel extraction via voted risk minimization. JMLR, 2015.\n\nH. Daum\u00e9 III, J. Langford, and D. Marcu. Search-based structured prediction. Machine Learning, 75(3):\n\n297\u2013325, 2009.\n\nJ. R. Doppa, A. Fern, and P. Tadepalli. Structured prediction via output space search. JMLR, 15(1):1317\u20131350,\n\n2014.\n\nD. Jurafsky and J. H. Martin. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., 2009.\n\nV. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. In NIPS, 2014.\n\nJ. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting and\n\nlabeling sequence data. In ICML, 2001.\n\nM. Lam, J. R. Doppa, S. Todorovic, and T. G. Dietterich. Hc-search for structured prediction in computer vision.\n\nIn CVPR, 2015.\n\nY. Lei, \u00dc. D. Dogan, A. Binder, and M. Kloft. Multi-class svms: From tighter data-dependent generalization\n\nbounds to novel algorithms. In NIPS, 2015.\n\nA. Lucchi, L. Yunpeng, and P. Fua. Learning for structured prediction using approximate subgradient descent\n\nwith working sets. In CVPR, 2013.\n\nA. Maurer. A vector-contraction inequality for rademacher complexities. In ALT, 2016.\n\nD. McAllester. Generalization bounds and consistency for structured labeling. In Predicting Structured Data.\n\nMIT Press, 2007.\n\nM. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.\n\nD. Nadeau and S. Sekine. A survey of named entity recognition and classi\ufb01cation. Linguisticae Investigationes,\n\n30(1):3\u201326, January 2007.\n\nS. Ross, G. J. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret\n\nonline learning. In AISTATS, 2011.\n\nB. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.\n\nI. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdepen-\n\ndent output variables. JMLR, 6:1453\u20131484, Dec. 2005.\n\nO. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign language. In NIPS,\n\n2015a.\n\nO. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR,\n\n2015b.\n\nD. Zhang, L. Sun, and W. Li. A structured prediction approach for statistical machine translation. In IJCNLP,\n\n2008.\n\n9\n\n\f", "award": [], "sourceid": 1310, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": "Google Research"}, {"given_name": "Vitaly", "family_name": "Kuznetsov", "institution": "Courant Institute"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute, NYU & Google"}, {"given_name": "Scott", "family_name": "Yang", "institution": "New York University"}]}