{"title": "Learning Beam Search Policies via Imitation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10652, "page_last": 10661, "abstract": "Beam search is widely used for approximate decoding in structured prediction problems. Models often use a beam at test time but ignore its existence at train time, and therefore do not explicitly learn how to use the beam. We develop an unifying meta-algorithm for learning beam search policies using imitation learning. In our setting, the beam is part of the model and not just an artifact of approximate decoding. Our meta-algorithm captures existing learning algorithms and suggests new ones. It also lets us show novel no-regret guarantees for learning beam search policies.", "full_text": "Learning Beam Search Policies via Imitation\n\nLearning\n\nRenato Negrinho1\n\nMatthew R. Gormley1\n\nGeoffrey J. Gordon1,2\n\n1Machine Learning Department, Carnegie Mellon University\n\n2Microsoft Research\n\n{negrinho,mgormley,ggordon}@cs.cmu.edu\n\nAbstract\n\nBeam search is widely used for approximate decoding in structured prediction\nproblems. Models often use a beam at test time but ignore its existence at train\ntime, and therefore do not explicitly learn how to use the beam. We develop an\nunifying meta-algorithm for learning beam search policies using imitation learning.\nIn our setting, the beam is part of the model, and not just an artifact of approximate\ndecoding. Our meta-algorithm captures existing learning algorithms and suggests\nnew ones. It also lets us show novel no-regret guarantees for learning beam search\npolicies.\n\n1\n\nIntroduction\n\nBeam search is the dominant method for approximate decoding in structured prediction tasks such\nas machine translation [1], speech recognition [2], image captioning [3], and syntactic parsing [4].\nMost models that use beam search at test time ignore the beam at train time and instead are learned\nvia methods like likelihood maximization. They therefore suffer from two issues that we jointly\naddress in this work: (1) learning ignores the existence of the beam and (2) learning uses only oracle\ntrajectories. These issues lead to mismatches between the train and test settings that negatively\naffect performance. Our work addresses these two issues simultaneously by using imitation learning\nto develop novel beam-aware algorithms with no-regret guarantees. Our analysis is inspired by\nDAgger [5].\nBeam-aware learning algorithms use beam search at both train and test time. These contrast with\ncommon two-stage learning algorithms that, \ufb01rst, at train time, learn a probabilistic model via\nmaximum likelihood, and then, at test time, use beam search for approximate decoding. The insight\nbehind beam-aware algorithms is that, if the model uses beam search at test time, then the model\nshould be learned using beam search at train time. Resulting beam-aware methods run beam search\nat train time (i.e., roll-in) to collect losses that are then used to update the model parameters. The \ufb01rst\nproposed beam-aware algorithms are perceptron-based, updating the parameters either when the best\nhypothesis does not score \ufb01rst in the beam [6], or when it falls out of the beam [7].\nWhile there is substantial prior work on beam-aware algorithms, none of the existing algorithms\nexpose the learned model to its own consecutive mistakes at train time. When rolling in with the\nlearned model, if a transition leads to a beam without the correct hypothesis, existing algorithms\neither stop [6, 8, 9] or reset to a beam with the correct hypothesis [7, 10, 11].1 Additionally, existing\nbeam-aware algorithms either do not have theoretical guarantees or only have perceptron-style\nguarantees [10]. We are the \ufb01rst to prove no-regret guarantees for an algorithm to learn beam search\npolicies.\n\n1[12] take a different approach by training with a differentiable approximation of beam search, but decode\n\nwith the standard (non-differentiable) search algorithm at test time.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fImitation learning algorithms, such as DAgger [5], leverage the ability to query an oracle at train time\nto learn a model that is competitive (in the no-regret sense) to the best model in hindsight. Existing\nimitation learning algorithms such as SEARN [13], DAgger [5]2, AggreVaTe [15], and LOLS [16],\nexecute the learned model at train time to collect data that is then labeled by the oracle and used for\nretraining. Nonetheless, these methods do not take the beam into account at train time, and therefore\ndo not learn to use the beam effectively at test time.\nWe propose a new approach to learn beam search policies using imitation learning that addresses\nthese two issues. We formulate the problem as learning a policy to traverse the combinatorial search\nspace of beams. The learned policy is induced via a scoring function: the neighbors of the elements\nof a beam are scored and the top k are used to form the successor beam. We learn a scoring function\nto match the ranking induced by the oracle costs of the neighbors. We introduce training losses\nthat capture this insight, among which are variants of the weighted all pairs loss [17] and existing\nbeam-aware losses. As the losses we propose are differentiable with respect to the scores, our scoring\nfunction can be learned using modern online optimization algorithms, e.g. Adam [18].\nIn some problems (e.g., sequence labeling and syntactic parsing) we have the ability to compute\noracle completions and oracle completion costs for non-optimal partial outputs. Within our imitation\nlearning framework, we can use this ability to compute oracle completion costs for the neighbors\nof the elements of a beam at train time to induce an oracle that allows us to continue collecting\nsupervision after the best hypothesis falls out of the beam. Using this oracle information, we are able\nto propose a DAgger-like beam-aware algorithm with no-regret guarantees.\nWe describe our novel learning algorithm as an instantiation of a meta-algorithm for learning beam\nsearch policies. This meta-algorithm sheds light into key design decisions that lead to more performant\nalgorithms, e.g., the introduction of better training losses. Our meta-algorithm captures much of\nthe existing literature on beam-aware methods (e.g., [7, 8]), allowing a clearer understanding of and\ncomparison to existing approaches, for example, by emphasizing that they arise from speci\ufb01c choices\nof training loss function and data collection strategy, and by proving novel regret guarantees for them.\nOur contributions are: an algorithm for learning beam search policies (Section 4.2) with accompanying\nregret guarantees (Section 5), a meta-algorithm that captures much of the existing literature (Section 4),\nand new theoretical results for the early update [6] and LaSO [7] algorithms (Section 5.3).\n\n2 Preliminaries\n\nStructured Prediction as Learning to Search We consider structured prediction in the learning\nto search framework [13, 5]. Input-output training pairs D = {(x1, y1), . . . , (xm, ym)} are drawn\naccording to a data generating distribution D jointly over an input space X and an output space Y.\nFor each input x \u2208 X , there is an underlying search space Gx = (Vx, Ex) encoded as a directed\ngraph with nodes Vx and edges Ex. Each output y \u2208 Yx is encoded as a terminal node in Gx, where\nYx \u2286 Y is the set of valid structured outputs for x.\nIn this paper, we deal with stochastic policies \u03c0 : Vx \u2192 \u2206(Vx), where \u2206(Vx) is the set of probability\ndistributions over nodes in Vx. (For convenience and brevity of presentation, we make our policies\ndeterministic later in the paper through the introduction of a tie-breaking total order over the elements\nof Vx, but our arguments and theoretical results hold more generally.) The goal is to learn a stochastic\npolicy \u03c0(\u00b7, x, \u03b8) : Vx \u2192 \u2206(Vx) parametrized by \u03b8 \u2208 \u0398 \u2286 Rp that traverses the induced search\nspaces, generating outputs with small expected cost; i.e., ideally, we would want to minimize\n\nc(\u03b8) = E(x,y)\u223cDE\u02c6y\u223c\u03c0(\u00b7,x,\u03b8)cx,y(\u02c6y),\n\n(1)\nwhere cx,y : Yx \u2192 R is the cost function comparing the ground-truth labeling y to the predicted\nlabeling \u02c6y. We are not able to optimize directly the loss in Equation (1), but we are able to \ufb01nd\na mixture of policies \u03b81, . . . , \u03b8m, where \u03b8t \u2208 \u0398 for all t \u2208 [m], that is competitive with the best\npolicy in \u0398 in the distribution of trajectories induced by the mixture of \u03b81, . . . , \u03b8m. We use notation\n\u02c6y \u223c \u03c0(\u00b7, x, \u03b8) to mean that \u02c6y is generated by sampling a trajectory v1, . . . , vh on Gx by executing\npolicy \u03c0(\u00b7, x, \u03b8), and returning the labeling \u02c6y \u2208 Y associated with terminal node vh \u2208 T . The search\nspaces, cost functions and policies depend on x \u2208 X or (x, y) \u2208 X \u00d7 Y\u2014in the sequel, we omit\nindexing by example for conciseness.\n\n2 Scheduled sampling [14] is an instantiation of DAgger.\n\n2\n\n\fSearch Space, Cost, and Policies Each example (x, y) \u2208 X \u00d7 Y induces a search space G =\n(V, E) and a cost function c : Y \u2192 R. For all v \u2208 V , we introduce its set of neighbors Nv = {v(cid:48) \u2208\nV | (v, v(cid:48)) \u2208 E}. We identify a single initial node v(0) \u2208 V . We de\ufb01ne the set of terminal nodes\nT = {v \u2208 V | Nv = \u2205}. We assume without loss of generality that all nodes are reachable from\nv(0) and that all nodes have paths to terminal nodes. For clarity of exposition, we assume that G is\na tree-structured directed graph where all terminals nodes are at distance h from the root v(0). We\ndescribe in Appendix A how to convert a directed graph search space to a tree-structured one with all\nterminals at the same depth.\nEach terminal node v \u2208 T corresponds to a complete output y \u2208 Y, which can be compared to the\nground-truth y\u2217 \u2208 Y via a cost function c : T \u2192 R of interest (e.g., Hamming loss in sequence\nlabeling or negative BLEU score [19] in machine translation). We de\ufb01ne the optimal completion cost\nfunction c\u2217 : V \u2192 R, which computes the cost of the best terminal node reachable from v \u2208 V as\nc\u2217(v) = minv(cid:48)\u2208Tv c(v(cid:48)), where Tv is the set of terminal nodes reachable from v.\nThe de\ufb01nition of c\u2217 : V \u2192 R naturally gives rise to an oracle policy \u03c0\u2217(\u00b7, c\u2217) : V \u2192 \u2206(V ). At\nv \u2208 V , \u03c0\u2217(v, c\u2217) can be any \ufb01xed distribution (e.g., uniform or one-hot) over arg minv(cid:48)\u2208Nv c\u2217(v(cid:48)).\nFor any state v \u2208 V , executing \u03c0\u2217(\u00b7, c\u2217) until arriving at a terminal node achieves the lowest possible\ncost for completions of v.\nAt v \u2208 V , a greedy policy \u03c0 : V \u2192 \u2206(V ) induced by a scoring function s : V \u2192 R computes a\n\ufb01xed distribution \u03c0(v, \u03b8) over arg maxv(cid:48)\u2208Nv s(v(cid:48), \u03b8). When multiple elements are tied with the same\nhighest score, we can choose an arbitrary distribution over them. If there is a single highest scoring\nelement, the policy is deterministic. In this paper, we assume the existence of a total order over the\nelements of V that is used for breaking ties induced by a scoring function. The tie-breaking total\nordering allows us to talk about a particular unique ordering, even when ties occur. The oracle policy\n\u03c0\u2217(\u00b7, c\u2217) : V \u2192 \u2206(V ) can be thought as being induced by the scoring function \u2212c\u2217 : V \u2192 R.\n\n3 Beam search\n\nAlgorithm 1 Beam Search\n1: function BEAMSEARCH(G, k, \u03b8)\nb \u2190 {v(0)} \u2261 b(0)\n2:\nwhile BEST(b, 1, s(\u00b7, \u03b8)) /\u2208 T do\n3:\nb \u2190 POLICY(G, b, k, s(\u00b7, \u03b8))\n4:\nreturn BEST(b, 1, s(\u00b7, \u03b8))\n5:\n6: function POLICY(G, b, k, f)\n7:\n8:\n\nBeam Search Space Given a search space G, we\nconstruct its beam search space Gk = (Vk, Ek), where\nk \u2208 N is the maximum beam capacity. Vk is the\nset of possible beams that can be formed along the\nsearch process, and Ek is the set of possible beam\ntransitions. Nodes b \u2208 Vk correspond to nonempty\nsets of nodes of V with size upper bounded by k, i.e.,\nb = {v1, . . . , v|b|} with 1 \u2264 |b| \u2264 k and vi \u2208 V for\nall i \u2208 [|b|]. The initial beam state b(0) \u2208 Vk is the\nsingleton set with the initial state v(0) \u2208 V . Terminal\nnodes in Tk are singleton sets with a single terminal\nnode v \u2208 T . For b \u2208 Vk, we de\ufb01ne Ab = \u222av\u2208bNv, i.e.,\nthe union of the neighborhoods of the elements in b.\nAlgorithm 1 describes the beam search variant used\nin our paper. In this paper, all elements in the beam\nare simultaneously expanded when transitioning. It is\npossible to de\ufb01ne different beam search space variants,\ne.g., by considering different expansion strategies or by handling terminals differently (in the case\nwhere terminals can be at different depths) 3. The arguments developed in this paper can be extended\nto those variants in a straightforward manner.\n\nLet Ab = \u222av\u2208bNv\nreturn BEST(Ab, k, f )\n\n9: function BEST(A, k, f)\n10:\n11:\n12:\n13:\n\nLet A = {v1, . . . , vn} be ordered\nsuch that f (v1) \u2265 \u00b7\u00b7\u00b7 \u2265 f (vn)\nLet k(cid:48) = min(k, n)\nreturn v1, . . . , vk(cid:48)\n\nBeam Costs We de\ufb01ne the cost of a beam to be the cost of its lowest cost element, i.e., we have\nc\u2217 : Vk \u2192 R and, for b \u2208 Vk, c\u2217(b) = minv\u2208b c\u2217(v). We de\ufb01ne the beam transition cost function\nc : Ek \u2192 R to be c(b, b(cid:48)) = c\u2217(b(cid:48)) \u2212 c\u2217(b), for (b, b(cid:48)) \u2208 Ek, i.e., the difference in cost between the\nlowest cost element in b(cid:48) and the lowest cost element in b.\nA cost increase occurs on a transition (b, b(cid:48)) \u2208 Ek if c\u2217(b(cid:48)) > c\u2217(b), or equivalently, c(b, b(cid:48)) > 0,\ni.e., b(cid:48) dropped all the lowest cost neighbors of the elements of b. For all b \u2208 Vk, we de\ufb01ne\n\n3[13] mention the possibility of encoding complex search algorithms by de\ufb01ning derived search spaces.\n\n3\n\n\fSGD or Adam\n\nAlgorithm 2 Meta-algorithm\n1: function LEARN(D, \u03b81, k)\nfor each t \u2208 [|D|] do\n2:\nInduce G using xt\n3:\nInduce s(\u00b7, \u03b8t) : V \u2192 R using G and \u03b8t\n4:\nInduce c\u2217 : V \u2192 R using (xt, yt)\n5:\nb1:j \u2190 BEAMTRAJECTORY(G, c\u2217, s(\u00b7, \u03b8t), k)\n6:\nIncur losses (cid:96)(\u00b7, b1), . . . , (cid:96)(\u00b7, bj\u22121)\n7:\n8:\n\nCompute \u03b8t+1 using(cid:80)j\u22121\n\nb = {b(cid:48) \u2208 Nb | c(b, b(cid:48)) = 0}, i.e., the set of beams neighboring b that do not lead to cost increases.\nN\u2217\nWe will signi\ufb01cantly overload notation, but usage will be clear from context and argument types, e.g.,\nwhen referring to c\u2217 : V \u2192 R and c\u2217 : Vk \u2192 R.\nBeam Policies Let \u03c0 : Vk \u2192 \u2206(Vk)\nbe a policy induced by a scoring function\nf : V \u2192 R. To sample b(cid:48) \u223c \u03c0(b) for a\nbeam b \u2208 Vk, form Ab, and compute scores\nf (v) for all v \u2208 Ab; let v1, . . . , vn be the\nelements of Ab ordered such that f (v1) \u2265\n. . . \u2265 f (vn); if v1 \u2208 T , b(cid:48) = {v1}; if\nv1 (cid:54)\u2208 T , let b(cid:48) pick the k top-most elements\nfrom Ab \\ T . At b \u2208 Vk, if there are many\norderings that sort the scores of the ele-\nments of Ab, we can choose a single one\ndeterministically or sample one stochasti-\ncally; if there is a single such ordering, the\npolicy \u03c0 : Vk \u2192 \u2206(Vk) is deterministic at\nb.\nFor each x \u2208 X , at train time, we have\naccess to the optimal path cost function\nc\u2217 : V \u2192 R, which induces the ora-\ncle policy \u03c0\u2217(\u00b7, c\u2217) : Vk \u2192 \u2206(Vk). At\na beam b, a successor beam b(cid:48) \u2208 Nb is\noptimal if c\u2217(b(cid:48)) = c\u2217(b), i.e., at least\none neighbor with the smallest possible\ncost was included in b(cid:48). The oracle pol-\nicy \u03c0\u2217(\u00b7, c\u2217) : Vk \u2192 \u2206(Vk) can be seen as\nusing scoring function \u2212c\u2217 : Vk \u2192 R to\ntransition in the beam search space Gk.\n\n9:\n10: function BEAMTRAJECTORY(G, c\u2217, f, k)\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n\nbj+1 \u2190 POLICY(G, bj, k,\u2212c\u2217)\nbj+1 \u2190 POLICY(G, bj, k, f )\nif c\u2217(bj+1) > c\u2217(bj) then\nif strategy is stop then\n\nb1 \u2190 {v(0)} \u2261 b(0)\nj = 1\nwhile BEST(bj, 1, f ) /\u2208 T do\nif strategy is oracle then\n\nif strategy is reset then\n\nbreak\nbj+1 \u2190 POLICY(G, bj, 1,\u2212c\u2217)\n\nreturn best \u03b8t on validation\n\ni=1 (cid:96)(\u00b7, bi), e.g., by\n\nj \u2190 j + 1\n\nreturn b1:j\n\n4 Meta-Algorithm\nOur goal is to learn a policy \u03c0(\u00b7, \u03b8) : Vk \u2192 \u2206(Vk) induced by a scoring function s(\u00b7, \u03b8) : V \u2192 R\nthat achieves small expected cumulative transition cost along the induced trajectories. Algorithm 2\npresents our meta-algorithm in detail.\nInstantiating our meta-algorithm requires choosing both a\nsurrogate training loss function (Section 4.1) and a data collection strategy (Section 4.2). Table 1\nshows how existing algorithms can be obtained as instances of our meta-algorithm with speci\ufb01c\nchoices of loss function, data collection strategy, and beam size.\n\nelse\n\n4.1 Surrogate Losses\n\nIn the beam search space, a prediction \u02c6y \u2208 Yx for x \u2208 X is generated by running \u03c0(\u00b7, \u03b8)\n\nInsight\non Gk. This yields a beam trajectory b1:h, where b1 = b(0) and bh \u2208 Tk. We have\nc(\u03b8) = E(x,y)\u223cDE\u02c6y\u223c\u03c0(\u00b7,\u03b8)c(\u02c6y) = E(x,y)\u223cDEb1:h\u223c\u03c0(\u00b7,\u03b8)c\u2217(bh).\n\nThe term c\u2217(bh) can be written in a telescoping manner as\n\nc\u2217(bh) = c\u2217(b1) +\n\nc(bi, bi+1).\n\nAs c\u2217(b1) depends on an example (x, y) \u2208 X \u00d7 Y, but not on the parameters \u03b8 \u2208 \u0398, the set of\nminimizers of c : \u0398 \u2192 R is the same as the set of minimizers of\n\n(cid:33)\n\nc(cid:48)(\u03b8) = E(x,y)\u223cDEb1:h\u223c\u03c0(\u00b7,\u03b8)\n\nc(bi, bi+1)\n\n.\n\n(4)\n\n(2)\n\n(3)\n\nh\u22121(cid:88)\n\ni=1\n\n(cid:32)h\u22121(cid:88)\n\ni=1\n\n4\n\n\fIt is not easy to minimize the cost function in Equation (4) as, for example, c(b,\u00b7) : Vk \u2192 R is\ncombinatorial. To address this issue, we observe the following by using linearity of expectation and\nthe law of iterated expectations to decouple the term in the sum over the trajectory:\n\n(cid:32)h\u22121(cid:88)\n\n(cid:33)\n\nh\u22121(cid:88)\n\nEb1:h\u223c\u03c0(\u00b7,\u03b8)\n\nc(bi, bi+1)\n\n=\n\nEbi\u223cd\u03b8,i\n\ni=1\n\ni=1\n\n= Eb1:h\u223c\u03c0(\u00b7,\u03b8)\n\n(cid:32)h\u22121(cid:88)\n\nEbi+1\u223c\u03c0(bi,\u03b8)c(bi, bi+1)\n\n(cid:33)\nEb(cid:48)\u223c\u03c0(bi,\u03b8)c(bi, b(cid:48))\n\n,\n\n(5)\n\ni=1\n\nwhere d\u03b8,i denotes the distribution over beams in Vk that results from following \u03c0(\u00b7, \u03b8) on Gk for i\nsteps. We now replace Eb(cid:48)\u223c\u03c0(b,\u00b7)c(b, b(cid:48)) : \u0398 \u2192 R by a surrogate loss function (cid:96)(\u00b7, b) : \u0398 \u2192 R that\nis differentiable with respect to the parameters \u03b8 \u2208 \u0398, and where (cid:96)(\u03b8, b) is a surrogate loss for the\nexpected cost increase incurred by following policy \u03c0(\u00b7, \u03b8) at beam b for one step.\nElements in Ab should be scored in a way that allows the best elements to be kept in the beam.\nDifferent surrogate losses arise from which elements we concern ourselves with, e.g., all the top k\nelements in Ab or simply one of the best elements in Ab. Surrogate losses are then large when the\nscores lead to discarding desired elements in Ab, and small when the scores lead to comfortably\nkeeping the desired elements in Ab.\n\nSurrogate Loss Functions The following additional notation allows us to de\ufb01ne losses precisely.\nLet Ab = {v1, . . . , vn} be an arbitrary ordering of the neighbors of the elements in b. Let c =\nc1, . . . , cn be the corresponding costs, where ci = c\u2217(vi) for all i \u2208 [n], and s = s1, . . . , sn be the\ncorresponding scores, where si = s(vi, \u03b8) for all i \u2208 [n]. Let \u03c3\u2217 : [n] \u2192 [n] be a permutation such\nthat c\u03c3\u2217(1) \u2264 . . . \u2264 c\u03c3\u2217(n), i.e., v\u03c3\u2217(1), . . . , v\u03c3\u2217(n) are ordered in increasing order of cost. Note that\nc\u2217(b) = c\u03c3\u2217(1). Similarly, let \u02c6\u03c3 : [n] \u2192 [n] be a permutation such that s\u02c6\u03c3(1) \u2265 . . . \u2265 s\u02c6\u03c3(n), i.e.,\nv\u02c6\u03c3(1), . . . , v\u02c6\u03c3(n) are ordered in decreasing order of score. We assume unique \u03c3\u2217 : [n] \u2192 [n] and\n\u02c6\u03c3 : [n] \u2192 [n] for simplifying the presentation of the loss functions (which can be guaranteed via\nthe tie-breaking total order on V ). In this case, at b \u2208 Vk, the successor beam b(cid:48) \u2208 Nb is uniquely\ndetermined by the scores of the elements of Ab.\nFor each (x, y) \u2208 X \u00d7 Y, the corresponding cost function c\u2217 : V \u2192 R is independent of the\nparameters \u03b8 \u2208 \u0398. We de\ufb01ne a loss function (cid:96)(\u00b7, b) : \u0398 \u2192 R at a beam b \u2208 Vk in terms of the\noracle costs of the elements of Ab. We now introduce some well-motivated surrogate loss functions.\nPerceptron and large-margin inspired losses have been used in early update [6], LaSO [7], and\nBSO [11]. We also introduce two log losses.\n\nperceptron (\ufb01rst) Penalizes the lowest cost element in Ab not being put at the top of the beam.\nWhen applied on the \ufb01rst cost increase, this is equivalent to an \u201cearly update\u201d [6].\n\n(6)\n\n(7)\n\n(8)\n\nperceptron (last) Penalizes the lowest cost element in Ab falling out of the beam.\n\n(cid:96)(s, c) = max(cid:0)0, s\u02c6\u03c3(1) \u2212 s\u03c3\u2217(1)\n(cid:96)(s, c) = max(cid:0)0, s\u02c6\u03c3(k) \u2212 s\u03c3\u2217(1)\n\n(cid:1) .\n(cid:1) .\n\nmargin (last) Prefers the lowest cost element to be scored higher than the last element in the beam\nby a margin. This yields updates that are similar but not identical to the approximate large-margin\nvariant of LaSO [7].\n\n(cid:96)(s, c) = max(cid:0)0, 1 + s\u02c6\u03c3(k) \u2212 s\u03c3\u2217(1)\n\n(cid:1)\n\ncost-sensitive margin (last) Weights the margin loss by the cost difference between the lowest\ncost element and the last element in the beam. When applied on a LaSO-style cost increase, this is\nequivalent to the BSO update of [11].\n\n(cid:96)(s, c) = (c\u02c6\u03c3(k) \u2212 c\u03c3\u2217(1)) max(cid:0)0, 1 + s\u02c6\u03c3(k) \u2212 s\u03c3\u2217(1)\n\n(cid:1) .\n\n(9)\n\n5\n\n\f(cid:32)(cid:88)\n\ni\u2208I\n\n(cid:32) n(cid:88)\n\n(cid:33)\n\n(cid:33)\n\nupper bound Convex upper bound to the expected beam transition cost, Eb(cid:48)\u223c\u03c0(b,\u00b7)c(b, b(cid:48)) : \u0398 \u2192 R,\nwhere b(cid:48) is induced by the scores s \u2208 Rn.\n(10)\nwhere \u03b4j = (c\u03c3\u2217(j) \u2212 c\u03c3\u2217(1))(s\u03c3\u2217(j) \u2212 s\u03c3\u2217(1) + 1) for j \u2208 {k + 1, . . . , n}.\nIntuitively, this\nloss imposes a cost-weighted margin between the best neighbor v\u03c3\u2217(1) \u2208 Ab and the neighbors\nv\u03c3\u2217(k+1), . . . , v\u03c3\u2217(n) \u2208 Ab that ought not to be included in the best successor beam b(cid:48). We prove in\nAppendix B that this loss is a convex upper bound for the expected beam transition cost.\n\n(cid:96)(s, c) = max (0, \u03b4k+1, . . . , \u03b4n)\n\nlog loss (beam) Normalizes only over the top k neighbors of a beam according to the scores s.\n\n(cid:96)(s, c) = \u2212s\u03c3\u2217(1) + log\n\n(11)\nwhere I = {\u03c3\u2217(1), \u02c6\u03c3(1), . . . , \u02c6\u03c3(k)}. The normalization is only over the correct element v\u03c3\u2217(1) and\nthe elements included in the beam. The set of indices I \u2286 [n] encodes the fact that the score vector\ns \u2208 Rn may not place v\u03c3\u2217(1) in the top k, and therefore it has to also be included in that case. This\nloss is used in [9], albeit introduced differently.\n\nexp(si)\n\n,\n\nlog loss (neighbors) Normalizes over all elements in Ab.\n\n(cid:96)(s, c) = \u2212s\u03c3\u2217(1) + log\n\nexp(si)\n\n(12)\n\ni=1\n\nDiscussion The losses here presented directly capture the purpose of using a beam for prediction\u2014\nensuring that the best hypothesis stays in the beam, i.e., that, at b \u2208 Vk, v\u03c3\u2217(1) \u2208 Ab is scored\nsuf\ufb01ciently high to be included in the successor beam b(cid:48) \u2208 Nb. If full cost information is not\naccessible, i.e., if are not able to evaluate c\u2217 : V \u2192 R for arbitrary elements in V , it is still possible\nto use a subset of these losses, provided that we are able to identify the lowest cost element among\nthe neighbors of a beam, i.e., for all b \u2208 Vk, an element v \u2208 Ab, such that c\u2217(v) = c\u2217(b).\nWhile certain losses do not appear beam-aware (e.g., those in Equation (6) and Equation (12)), it\nis important to keep in mind that all losses are collected by executing a policy on the beam search\nspace Gk. Given a beam b \u2208 Vk, the score vector s \u2208 Rn and cost vector c \u2208 Rn are de\ufb01ned for the\nelements of Ab. The losses incurred depend on the speci\ufb01c beams visited. Losses in Equation (6),\n(10), and (12) are convex. The remaining losses are non-convex. For k = 1, we recover well-known\nlosses, e.g., loss in Equation (12) becomes a simple log loss over the neighbors of a single node,\nwhich is precisely the loss used in typical log-likelihood maximization models; loss in Equation (7)\nbecomes a perceptron loss. In Appendix C we discuss convexity considerations for different types of\nlosses. In Appendix D, we present additional losses and expand on their connections to existing work.\n\n4.2 Data Collection Strategy\nOur meta-algorithm requires choosing a train time policy \u03c0 : Vk \u2192 \u2206(Vk) to traverse the beam\nsearch space Gk to collect supervision. Sampling a trajectory to collect training supervision is done\nby BEAMTRAJECTORY in Algorithm 2.\noracle Our simplest policy follows the oracle policy \u03c0\u2217 : Vk \u2192 \u2206(Vk) induced by the optimal\ncompletion cost function c\u2217 : V \u2192 R (as in Section 3). Using the terminology of Algorithm 1, we\ncan write \u03c0\u2217(b, c\u2217) = POLICY(G, b, k,\u2212c\u2217). This policy transitions using the negated sorted costs\nof the elements in Ab as scores.\nThe oracle policy does not address the distribution mismatch problem. At test time, the learned policy\nwill make mistakes and visit beams for which it has not collected supervision at train time, leading to\nerror compounding. Imitation learning tells us that it is necessary to collect supervision at train time\nwith the learned policy to avoid error compounding at test time [5].\nWe now present data collection strategies that use the learned policy. For brevity, we only cover the\ncase where the learned policy is always used (except when the transition leads to a cost-increase), and\nleave the discussion of additional possibilities (e.g., probabilistic interpolation of learned and oracle\npolicies) to Appendix E.3. When an edge (b, b(cid:48)) \u2208 Ek incurring cost increase is traversed, different\nstrategies are possible:\n\n6\n\n\fTable 1: Existing and novel beam-aware algorithms as instances of our meta-algorithm. Our theoret-\nical guarantees require the existence of a deterministic no-regret online learning algorithm for the\nresulting problem.\n\nAlgorithm\n\nlog-likelihood\nDAGGER [5]\nearly update [6]\nLaSO (perceptron) [7]\nLaSO (large-margin) [7]\nBSO [11]\nglobally normalized [9]\nOurs\n\nMeta-algorithm choices\n\ndata collection\noracle\ncontinue\nstop\nreset\nreset\nreset\nstop\ncontinue\n\nsurrogate loss\nk\n1\nlog loss (neighbors)\n1\nlog loss (neighbors)\nperceptron (\ufb01rst)\n> 1\nperceptron (\ufb01rst)\n> 1\nmargin (last)\n> 1\ncost-sensitive margin (last) > 1\nlog loss (beam)\n> 1\n[choose a surrogate loss]\n> 1\n\nstop Stop collecting the beam trajectory. The last beam in the trajectory is b(cid:48), i.e., the beam on\nwhich we arrive in the transition that led to a cost increase. This data collection strategy is used in\nstructured perceptron training with early update [6].\n\nreset Reset the beam to contain only the best state as de\ufb01ned by the optimal completion cost\nfunction: b(cid:48) = BEST(b, 1,\u2212c\u2217). In the subsequent steps of the policy, the beam grows back to size k.\nLaSO [7] uses this data collection strategy. Similarly to the oracle data collection strategy, rather than\ncommitting to a speci\ufb01c b(cid:48) \u2208 N\u2217\nb , we can sample b(cid:48) \u223c \u03c0\u2217(b, c\u2217) where \u03c0\u2217(b, c\u2217) is any distribution\nover N\u2217\nb . The reset data collection strategy collects beam trajectories where the oracle policy \u03c0 is\nexecuted conditionally, i.e., when the roll-in policy \u03c0(\u00b7, \u03b8t) would lead to a cost increase.\n\ncontinue We can ignore the cost increase and continue following policy \u03c0t. This is the strategy\ntaken by DAgger [5]. The continue data collection strategy has not been considered in the beam-aware\nsetting, and therefore it is a novel contribution of our work. Our stronger theoretical guarantees apply\nto this case.\n\n5 Theoretical Guarantees\n\nWe state regret guarantees for learning beam search policies using the continue, reset, or stop data\ncollection strategies. One of the main contributions of our work is framing the problem of learning\nbeam search policies in a way that allows us to obtain meaningful regret guarantees. Detailed proofs\nare provided in Appendix E. We begin by analyzing the continue collection strategy. As we will see,\nregret guarantees are stronger for continue than for stop or reset.\nNo-regret online learning algorithms have an important role in the proofs of our guarantees. Let\n(cid:96)1, . . . , (cid:96)m be a sequence of loss functions with (cid:96)t : \u0398 \u2192 R for all t \u2208 [m]. Let \u03b81, . . . , \u03b8m be a\nsequence of iterates with \u03b8t \u2208 \u0398 for all t \u2208 [m]. The loss function (cid:96)t can be chosen according to an\narbitrary rule (e.g., adversarially). The online learning algorithm chooses the iterate \u03b8t. Both (cid:96)t and\n\u03b8t are chosen online, as functions of loss functions (cid:96)1, . . . , (cid:96)t\u22121 and iterates \u03b81, . . . , \u03b8t\u22121.\nDe\ufb01nition 1. An online learning algorithm is no-regret if for any sequence of functions (cid:96)1, . . . , (cid:96)m\nchosen according to the conditions above we have\n(cid:96)t(\u03b8t) \u2212 min\n\u03b8\u2208\u0398\n\nm(cid:88)\n\n(cid:96)t(\u03b8) = \u03b3m,\n\n(13)\n\nm(cid:88)\n\nt=1\n\n1\nm\n\n1\nm\n\nt=1\n\nwhere \u03b3m goes to zero as m goes to in\ufb01nity.\n\nMany no-regret online learning algorithms, especially for convex loss functions, have been proposed\nin the literature, e.g., [20, 21, 22]. Our proofs of the theoretical guarantees require the no-regret\nonline learning algorithm to be deterministic, i.e., \u03b8t to be a deterministic rule of previous observed\niterates \u03b81, . . . , \u03b8t\u22121 and loss functions (cid:96)1, . . . , (cid:96)t\u22121, for all t \u2208 [m]. Online gradient descent [20] is\nan example of such an algorithm.\n\n7\n\n\fIn Theorem 1, we prove no-regret guarantees for the case where the no-regret online algorithm is\npresented with explicit expectations for the loss incurred by a beam search policy. In Theorem 2,\nwe upper bound the expected cost incurred by a beam search policy as a function of its expected\nloss. This result holds in cases where, at each beam, the surrogate loss is an upper bound on the\nexpected cost increase at that beam. In Theorem 3, we use Azuma-Hoeffding to prove no-regret\nhigh probability bounds for the case where we only have access to empirical expectations of the loss\nincurred by a policy, rather than explicit expectations. In Theorem 4, we extend Theorem 3 for the\ncase where the data collection policy is different from the policy that we are evaluating. These results\nallow us to give regret guarantees that depend on how frequently is the data collection policy different\nfrom the policy that we are evaluating.\nIn this section we simply state the results of the theorems alongside some discussion. All proofs\nare presented in detail in Appendix E. Our analysis closely follows that of DAgger [5], although\nthe results need to be interpreted in the beam search setting. Our regret guarantees for beam-aware\nalgorithms with different data collection strategies are novel.\n\n5.1 No-Regret Guarantees with Explicit Expectations\n\n(cid:17)\n\n.\n\ni=1 (cid:96)(\u03b8, bi)\n\n(cid:80)m\n\n1\nm\n\n(cid:16)(cid:80)h\u22121\n\n(cid:80)m\nIf the sequence \u03b81, . . . , \u03b8m\nt=1 (cid:96)(\u03b8t, \u03b8t) \u2212\n\nThe sequence of functions (cid:96)1, . . . , (cid:96)m can be chosen in a way that applying a no-regret online learning\nalgorithm to generate the sequence of policies \u03b81, . . . , \u03b8m leads to no-regret guarantees for the\nperformance of the mixture of \u03b81, . . . , \u03b8m. The adversary presents the no-regret online learning\nalgorithm with (cid:96)t = (cid:96)(\u00b7, \u03b8t) at time t \u2208 [m]. The adversary is able to play (cid:96)(\u00b7, \u03b8t) because it can\nanticipate \u03b8t, as the adversary knows the deterministic rule used by the no-regret online learning\nalgorithm to pick iterates. Paraphrasing Theorem 1, on the distribution of trajectories induced by the\nthe uniform stochastic mixture of \u03b81, . . . , \u03b8m, the best policy in \u0398 for this distribution performs as\nwell (in the limit) as the uniform mixture of \u03b81, . . . , \u03b8m.\nTheorem 1. Let (cid:96)(\u03b8, \u03b8(cid:48)) = E(x,y)\u223cDEb1:h\u223c\u03c0(\u00b7,\u03b8(cid:48))\nis chosen by a deterministic no-regret online learning algorithm, we have 1\nm\nt=1 (cid:96)(\u03b8, \u03b8t) = \u03b3m, where \u03b3m goes to zero when m goes to in\ufb01nity.\nmin\u03b8\u2208\u0398\nFurthermore, if for all (x, y) \u2208 X \u00d7 Y the surrogate loss (cid:96)(\u00b7, b) : \u0398 \u2192 R is an upper bound on the\nexpected cost increase Eb(cid:48)\u223c\u03c0(b,\u00b7)c(b, b(cid:48)) : \u0398 \u2192 R for all b \u2208 Vk, we can transform the surrogate\nloss no-regret guarantees into performance guarantees in terms of c : Y \u2192 R. Theorem 2 tells us\nthat if the best policy along the trajectories induced by the mixture of \u03b81, . . . , \u03b8m in \u0398 incurs small\nsurrogate loss, then the expected cost resulting from labeling examples (x, y) \u2208 X \u00d7Y sampled from\nD with the uniform mixture of \u03b81, . . . , \u03b8m is also small. It is possible to transform the results about\nthe uniform mixture of \u03b81, . . . , \u03b8m on results about the best policy among \u03b81, . . . , \u03b8m, e.g., following\nthe arguments of [23], but for brevity we do not present them in this paper. Proofs of Theorem 1 and\nTheorem 2 are in Appendix E.1\nTheorem 2. Let all the conditions in De\ufb01nition 1 be satis\ufb01ed. Additionally, let c(\u03b8) = c\u2217(b1) +\n(cid:80)m\n= E(x,y)\u223cDEb1:h\u223c\u03c0(\u00b7,\u03b8)c\u2217(bh). Let (cid:96)(\u00b7, b) : \u0398 \u2192 R be an\nE(x,y)\u223cDEb1:h\u223c\u03c0(\u00b7,\u03b8)\nt=1 c(\u03b8t) \u2264 E(x,y)\u223cDc\u2217(b1)+\nupper bound on Eb(cid:48)\u223c\u03c0(b,\u00b7)c(b, b(cid:48)) : \u0398 \u2192 R, for all b \u2208 Vk. Then, 1\nmin\u03b8\u2208\u0398\n\nt=1 (cid:96)(\u03b8, \u03b8t) + \u03b3m, where \u03b3m goes to zero as m goes to in\ufb01nity.\n\n(cid:16)(cid:80)h\u22121\n\n(cid:80)m\n\n1\nm\n\n(cid:17)\n\ni=1 c(bi, bi+1)\n\nm\n\n(cid:16)(cid:80)h\u22121\n\n(cid:17)\ni=1 (cid:96)(\u00b7, bi)\n\n5.2 Finite Sample Analysis\n\n(xt, yt) \u223c D, sample a trajectory b1:h according to \u03c0(\u00b7, \u03b8t), and obtain \u02c6(cid:96)(\u00b7, \u03b8t) =(cid:80)h\u22121\n\nTheorem 1 and Theorem 2 are for the case where the adversary presents explicit expectations, i.e., the\nloss function at time t \u2208 [m] is (cid:96)t(\u00b7) = E(x,y)\u223cDEb1:h\u223c\u03c0(\u00b7,\u03b8t)\n. We most likely only\nhave access to a sample estimator \u02c6(cid:96)(\u00b7, \u03b8t) : \u0398 \u2192 R of the true expectation: we \ufb01rst sample an example\ni=1 (cid:96)(\u00b7, bi). We\nprove high probability no-regret guarantees for this case. Theorem 3 tells us that the population\nsurrogate loss of the mixture of policies \u03b81, . . . , \u03b8m is, with high probability, not much larger than its\nempirical surrogate loss. Combining this result with Theorem 1 and Theorem 2 allows us to give\n\ufb01nite sample high probability results for the performance of the mixture of policies \u03b81, . . . , \u03b8m. The\nproof of Theorem 3 is found in Appendix E.2.\n\n8\n\n\fTheorem 3. Let \u02c6(cid:96)(\u00b7, \u03b8(cid:48)) = (cid:80)h\u22121\njectory using \u03c0(\u00b7, \u03b8(cid:48)). Let |(cid:80)h\u22121\nwe have P(cid:16) 1\n\u03b7(\u03b4, m) = u(cid:112)2 log(1/\u03b4)/m.\n\n(cid:80)m\nt=1 (cid:96)(\u03b8t, \u03b8t) \u2264 1\n\ni=1 (cid:96)(\u00b7, bi) which is generated by sampling (x, y) from D (which\ninduces the corresponding beam search space Gk and cost functions), and sampling a beam tra-\ni=1 (cid:96)(\u03b8, bi)| \u2264 u for a constant u \u2208 R, for all (x, y) \u2208 X \u00d7 Y,\nbeam trajectories b1:h, and \u03b8 \u2208 \u0398. Let the iterates be chosen by a no-regret online learn-\ning algorithm, based on the sequence of losses (cid:96)t = \u02c6(cid:96)(\u00b7, \u03b8t) : \u0398 \u2192 R, for t \u2208 [m], then\n\n(cid:17) \u2265 1 \u2212 \u03b4, where \u03b4 \u2208 (0, 1] and\n\n\u02c6(cid:96)(\u03b8t, \u03b8t) + \u03b7(\u03b4, m)\n\n(cid:80)m\n\nt=1\n\nm\n\nm\n\nThe insight is that(cid:80)h\u22121\n\n5.3 Finite Sample Analysis for Arbitrary Data Collection Policies\nAll the results stated so far are for the continue data collection strategy where, at time t \u2208 [m], the\nwhole trajectory b1:h is collected using the current policy \u03c0(\u00b7, \u03b8t). Stop and reset data collection\nstrategies do not necessarily collect the full trajectory under \u03c0(\u00b7, \u03b8t). If the data collection policy\n\u03c0(cid:48) : Vk \u2192 \u2206(Vk) is other than the learned policy, the analysis can be adapted by accounting for the\ndifference in distribution of trajectories induced by the learned policy and the data collection policy.\ni=1 (cid:96)(\u03b8, bi) only depends on b1:h\u22121, so if no cost increases occur in this portion\nof the trajectory, we are effectively sampling the trajectory using \u03c0(\u00b7, \u03b8) when using the stop and\nreset data collection strategies.\nPrior work presented only perceptron-style results for these settings [6, 7]\u2014we are the \ufb01rst to present\nregret guarantees. Our guarantee depends on the probability with which b1:h\u22121 is collected solely\nwith \u03c0(\u00b7, \u03b8). We state the \ufb01nite sample analysis result for the case where these probabilities are not\nknown explicitly, but we are able to estimate them. The proof of Theorem 4 is found in Appendix E.3.\nTheorem 4. Let \u03c0t : Vk \u2192 \u2206(Vk) be the data collection policy for example t \u2208 [m], which uses\neither the stop or reset data collection strategies. Let \u02c6\u03b1(\u03b8t) be the empirical estimate of the probability\nof \u03c0(\u00b7, \u03b8t) incurring at least one cost increase up to time h \u2212 1. Then,\n\n(cid:32)\n\nm(cid:88)\n\n(cid:32)\nwhere \u03b4 \u2208 (0, 1] and \u03b7(\u03b4, m) = u(cid:112)2 log(1/\u03b4)/m.\n\n(cid:96)(\u03b8t, \u03b8t) \u2264 1\nm\n\n\u02c6(cid:96)(\u03b8t, \u03c0t) + u\n\nm(cid:88)\n\n1\nm\n\nt=1\n\nt=1\n\nP\n\nm(cid:88)\n\nt=1\n\n1 \u2212 1\nm\n\n(cid:33)\n\n(cid:33)\n\n\u02c6\u03b1(\u03b8t)\n\n+ 2\u03b7(\u03b4, m)\n\n\u2265 1 \u2212 \u03b4,\n\nIf the probability of stopping or resetting goes to zero as m goes to in\ufb01nity, then the term captures\nthe discrepancy between the distributions of induced by \u03c0(\u00b7, \u03b8t) and \u03c0t vanishes, and we recover a\nguarantee similar to Theorem 3. If the probability of stopping or resetting does not go completely to\nzero, it is still possible to provide regret guarantees for the performance of this algorithm but now\nwith a term that does not vanish with increasing m. These regret guarantees for the different data\ncollection strategies are novel.\n\n6 Conclusion\n\nWe propose a framework for learning beam search policies using imitation learning. We provide\nregret guarantees for both new and existing algorithms for learning beam search policies. One of the\nmain contributions is formulating learning beam search policies in the learning to search framework.\nPolicies for beam search are induced via a scoring function. The intuition is that the best neighbors in\na beam should be scored suf\ufb01ciently high, allowing them to be kept in the beam when transitioning\nusing these scores. Based on this insight, we motivate different surrogate loss functions for learning\nscoring functions. We recover existing algorithms in the literature through speci\ufb01c choices for the\nloss function and data collection strategy. Our work is the \ufb01rst to provide a beam-aware algorithm\nwith no-regret guarantees.\n\nAcknowledgments\n\nThe authors would like to thank Ruslan Salakhutdinov, Akshay Krishnamurthy, Wen Sun, Christoph\nDann, and Kin Olivares for helpful discussions and detailed reviews.\n\n9\n\n\fReferences\n[1] Ilya Sutskever, Oriol Vinyals, and Quoc Le. Sequence to sequence learning with neural networks. NeurIPS,\n\n2014.\n\n[2] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent\n\nneural networks. ICASSP, 2013.\n\n[3] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\n\ncaption generator. CVPR, 2015.\n\n[4] David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. Structured training for neural network\n\ntransition-based parsing. ACL, 2015.\n\n[5] St\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured\n\nprediction to no-regret online learning. AISTATS, 2011.\n\n[6] Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm. ACL, 2004.\n\n[7] Hal Daum\u00e9 and Daniel Marcu. Learning as search optimization: Approximate large margin methods for\n\nstructured prediction. ICML, 2005.\n\n[8] Liang Huang, Suphan Fayong, and Yang Guo. Structured perceptron with inexact search. NAACL, 2012.\n\n[9] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav\n\nPetrov, and Michael Collins. Globally normalized transition-based neural networks. ACL, 2016.\n\n[10] Yuehua Xu and Alan Fern. On learning linear ranking functions for beam search. ICML, 2007.\n\n[11] Sam Wiseman and Alexander Rush. Sequence-to-sequence learning as beam-search optimization. ACL,\n\n2016.\n\n[12] Kartik Goyal, Graham Neubig, Chris Dyer, and Taylor Berg-Kirkpatrick. A continuous relaxation of beam\n\nsearch for end-to-end training of neural sequence models. AAAI, 2018.\n\n[13] Hal Daum\u00e9, John Langford, and Daniel Marcu. Search-based structured prediction. Machine learning,\n\n2009.\n\n[14] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence\n\nprediction with recurrent neural networks. NeurIPS, 2015.\n\n[15] St\u00e9phane Ross and Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret\n\nlearning. arXiv preprint arXiv:1406.5979, 2014.\n\n[16] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daum\u00e9, and John Langford. Learning to\n\nsearch better than your teacher. ICML, 2015.\n\n[17] Alina Beygelzimer, John Langford, and Bianca Zadrozny. Machine learning techniques\u2014reductions\n\nbetween prediction quality metrics. Performance Modeling and Engineering, 2008.\n\n[18] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n\n[19] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation\n\nof machine translation. ACL, 2002.\n\n[20] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. ICML, 2003.\n\n[21] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Computer\n\nand System Sciences, 2005.\n\n[22] Elad Hazan. Introduction to online convex optimization. Foundations and Trends R(cid:13) in Optimization, 2016.\n[23] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Transactions on Information Theory, 2004.\n\n[24] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. NeurIPS, 2003.\n\n[25] Kevin Gimpel and Noah Smith. Softmax-margin CRFs: Training log-linear models with cost functions. In\n\nACL, 2010.\n\n10\n\n\f", "award": [], "sourceid": 6783, "authors": [{"given_name": "Renato", "family_name": "Negrinho", "institution": "Carnegie Mellon University"}, {"given_name": "Matthew", "family_name": "Gormley", "institution": "Carnegie Mellon University"}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": "MSR Montr\u00e9al & CMU"}]}