{"title": "Stochastic Structured Prediction under Bandit Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 1489, "page_last": 1497, "abstract": "Stochastic structured prediction under bandit feedback follows a learning protocol where on each of a sequence of iterations, the learner receives an input, predicts an output structure, and receives partial feedback in form of a task loss evaluation of the predicted structure. We present applications of this learning scenario to convex and non-convex objectives for structured prediction and analyze them as stochastic first-order methods. We present an experimental evaluation on problems of natural language processing over exponential output spaces, and compare convergence speed across different objectives under the practical criterion of optimal task performance on development data and the optimization-theoretic criterion of minimal squared gradient norm. Best results under both criteria are obtained for a non-convex objective for pairwise preference learning under bandit feedback.", "full_text": "Stochastic Structured Prediction\n\nunder Bandit Feedback\n\nArtem Sokolov(cid:5),\u2217, Julia Kreutzer\u2217, Christopher Lo\u2020,\u2217, Stefan Riezler\u2021,\u2217\n\n\u2217Computational Linguistics & \u2021IWR, Heidelberg University, Germany\n\n{sokolov,kreutzer,riezler}@cl.uni-heidelberg.de\n\n\u2020Department of Mathematics, Tufts University, Boston, MA, USA\n\nchris.aa.lo@gmail.com\n\n(cid:5)Amazon Development Center, Berlin, Germany\n\nAbstract\n\nStochastic structured prediction under bandit feedback follows a learning protocol\nwhere on each of a sequence of iterations, the learner receives an input, predicts\nan output structure, and receives partial feedback in form of a task loss evaluation\nof the predicted structure. We present applications of this learning scenario to\nconvex and non-convex objectives for structured prediction and analyze them as\nstochastic \ufb01rst-order methods. We present an experimental evaluation on problems\nof natural language processing over exponential output spaces, and compare con-\nvergence speed across different objectives under the practical criterion of optimal\ntask performance on development data and the optimization-theoretic criterion of\nminimal squared gradient norm. Best results under both criteria are obtained for a\nnon-convex objective for pairwise preference learning under bandit feedback.\n\n1\n\nIntroduction\n\nWe present algorithms for stochastic structured prediction under bandit feedback that obey the\nfollowing learning protocol: On each of a sequence of iterations, the learner receives an input,\npredicts an output structure, and receives partial feedback in form of a task loss evaluation of the\npredicted structure. In contrast to the full-information batch learning scenario, the gradient cannot\nbe averaged over the complete input set. Furthermore, in contrast to standard stochastic learning,\nthe correct output structure is not revealed to the learner. We present algorithms that use this\nfeedback to \u201cbanditize\u201d expected loss minimization approaches to structured prediction [18, 25]. The\nalgorithms follow the structure of performing simultaneous exploration/exploitation by sampling\noutput structures from a log-linear probability model, receiving feedback to the sampled structure, and\nconducting an update in the negative direction of an unbiased estimate of the gradient of the respective\nfull information objective. The algorithms apply to situations where learning proceeds online on a\nsequence of inputs for which gold standard structures are not available, but feedback to predicted\nstructures can be elicited from users. A practical example is interactive machine translation where\ninstead of human generated reference translations only translation quality judgments on predicted\ntranslations are used for learning [20]. The example of machine translation showcases the complexity\nof the problem: For each input sentence, we receive feedback for only a single predicted translation\nout of a space that is exponential in sentence length, while the goal is to learn to predict the translation\nwith the smallest loss under a complex evaluation metric.\n[19] showed that partial feedback is indeed suf\ufb01cient for optimization of feature-rich linear structured\nprediction over large output spaces in various natural language processing (NLP) tasks. Their\nexperiments follow the standard online-to-batch conversion practice in NLP applications where the\n\n\u2217 The work for this paper was done while the authors were at Heidelberg University.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fmodel with optimal task performance on development data is selected for \ufb01nal evaluation on a test set.\nThe contribution of our paper is to analyze these algorithms as stochastic \ufb01rst-order (SFO) methods\nin the framework of [7] and investigate the connection of optimization for task performance with\noptimization-theoretic concepts of convergence.\nOur analysis starts with revisiting the approach to stochastic optimization of a non-convex expected\nloss criterion presented by [20]. The iteration complexity of stochastic optimization of a non-convex\nobjective J(wt) can be analyzed in the framework of [7] as O(1/\u00012) in terms of the number of\niterations needed to reach an accuracy of \u0001 for the criterion E[(cid:107)\u2207J(wt)(cid:107)2] \u2264 \u0001. [19] attempt to\nimprove convergence speed by introducing a cross-entropy objective that can be seen as a (strong)\nconvexi\ufb01cation of the expected loss objective. The known best iteration complexity for strongly\nconvex stochastic optimization is O(1/\u0001) for the suboptimality criterion E[J(wt)] \u2212 J(w\u2217) \u2264 \u0001.\nLastly, we analyze the pairwise preference learning algorithm introduced by [19]. This algorithm can\nalso be analyzed as an SFO method for non-convex optimization. To our knowledge, this is the \ufb01rst\nSFO approach to stochastic learning form pairwise comparison feedback, while related approaches\nfall into the area of gradient-free stochastic zeroth-order (SZO) approaches [24, 1, 7, 4]. Convergence\nrate for SZO methods depends on the dimensionality d of the function to be evaluated, for example,\nthe non-convex SZO algorithm of [7] has an iteration complexity of O(d/\u00012). SFO algorithms do\nnot depend on d which is crucial if the dimensionality of the feature space is large as is common in\nstructured prediction.\nFurthermore, we present a comparison of empirical and theoretical convergence criteria for the NLP\ntasks of machine translation and noun-phrase chunking. We compare the empirical convergence\ncriterion of optimal task performance on development data with the theoretically motivated criterion\nof minimal squared gradient norm. We \ufb01nd a correspondence of fastest convergence of pairwise\npreference learning on both tasks. Given the standard analysis of asymptotic complexity bounds,\nthis result is surprising. An explanation can be given by measuring variance and Lipschitz constant\nof the stochastic gradient, which is smallest for pairwise preference learning and largest for cross-\nentropy minimization by several orders of magnitude. This offsets the possible gains in asymptotic\nconvergence rates for strongly convex stochastic optimization, and makes pairwise preference learning\nan attractive method for fast optimization in practical interactive scenarios.\n\n2 Related Work\n\nThe methods presented in this paper are related to various other machine learning problems where\npredictions over large output spaces have to be learned from partial information.\nReinforcement learning has the goal of maximizing the expected reward for choosing an action at\na given state in a Markov Decision Process (MDP) model, where unknown rewards are received at\neach state, or once at the \ufb01nal state. The algorithms in this paper can be seen as one-state MDPs with\ncontext where choosing an action corresponds to predicting a structured output. Most closely related\nare recent applications of policy gradient methods to exponential output spaces in NLP problems\n[22, 3, 15]. Similar to our expected loss minimization approaches, these approaches are based on\nnon-convex models, however, convergence rates are rarely a focus in the reinforcement learning\nliterature. One focus of our paper is to present an analysis of asymptotic convergence and convergence\nrates of non-convex stochastic \ufb01rst-order methods.\nContextual one-state MDPs are also known as contextual bandits [11, 13] which operate in a scenario\nof maximizing the expected reward for selecting an arm of a multi-armed slot machine. Similar to\nour case, the feedback is partial, and the models consist of a single state. While bandit learning\nis mostly formalized as online regret minimization with respect to the best \ufb01xed arm in hindsight,\nwe characterize our approach in an asymptotic convergence framework. Furthermore, our high-\ndimensional models predict structures over exponential output spaces. Since we aim to train these\nmodels in interaction with real users, we focus on the ease of elicitability of the feedback and on\nspeed of convergence. In the spectrum of stochastic versus adversarial bandits, our approach is\nsemi-adversarial in making stochastic assumptions on inputs, but not on rewards [12].\nPairwise preference learning has been studied in the full information supervised setting [8, 10, 6]\nwhere given preference pairs are assumed. Work on stochastic pairwise learning has been formalized\nas derivative-free stochastic zeroth-order optimization [24, 1, 7, 4]. To our knowledge, our approach\n\n2\n\n\fAlgorithm 1 Bandit Structured Prediction\n1: Input: sequence of learning rates \u03b3t\n2: Initialize w0\n3: for t = 0, . . . , T do\n4:\n5:\n6:\n7:\n8: Choose a solution \u02c6w from the list {w0, . . . , wT}\n\nObserve xt\nSample \u02dcyt \u223c pwt(y|xt)\nObtain feedback \u2206(\u02dcyt)\nwt+1 = wt \u2212 \u03b3t st\n\nto pairwise preference learning from partial feedback is the \ufb01rst SFO approach to learning from\npairwise preferences in form of relative task loss evaluations.\n\n3 Expected Loss Minimization for Structured Prediction\n\n[18, 25] introduce the expected loss criterion for structured prediction as the minimization of the\nexpectation of a given task loss function with respect to the conditional distribution over structured\noutputs. Let X be a structured input space, let Y(x) be the set of possible output structures for input\nx, and let \u2206y : Y \u2192 [0, 1] quantify the loss \u2206y(y(cid:48)) suffered for predicting y(cid:48) instead of the gold\nstandard structure y. In the full information setting, for a given (empirical) data distribution p(x, y),\nthe learning problem is de\ufb01ned as\n\n(cid:88)\n\nx,y\n\n(cid:88)\n\np(x, y)\n\ny(cid:48)\u2208Y(x)\n\nmin\nw\u2208Rd\n\nEp(x,y)pw(y(cid:48)|x) [\u2206y(y(cid:48))] = min\nw\u2208Rd\n\n\u2206y(y(cid:48))pw(y(cid:48)|x),\n\n(1)\n\nwhere\n\npw(y|x) = exp(w(cid:62)\u03c6(x, y))/Zw(x)\n\n(2)\nis a Gibbs distribution with joint feature representation \u03c6 : X \u00d7 Y \u2192 Rd, weight vector w \u2208 Rd, and\nnormalization constant Zw(x). Despite being a highly non-convex optimization problem, positive\nresults have been obtained by gradient-based optimization with respect to\n\n(cid:104)\n\n\u2206y(y(cid:48))(cid:0)\u03c6(x, y(cid:48)) \u2212 Epw(y(cid:48)|x)[\u03c6(x, y(cid:48))](cid:1)(cid:105)\n\n.\n\n(3)\n\n\u2207Ep(x,y)pw(y(cid:48)|x) [\u2206y(y(cid:48))] = Ep(x,y)pw(y(cid:48)|x)\n\nUnlike in the full information scenario, in structured learning under bandit feedback the gold standard\noutput structure y with respect to which the objective function is evaluated is not revealed to the\nlearner. Thus we can neither evaluate the task loss \u2206 nor calculate the gradient (3) as in the full\ninformation case. A solution to this problem is to pass the evaluation of the loss function to the user,\ni.e, we access the loss directly through user feedback without assuming existence of a \ufb01xed reference\ny. In the following, we will drop the subscript referring to the gold standard structure in the de\ufb01nition\nof \u2206 to indicate that the feedback is in general independent of gold standard outputs. In particular,\nwe allow \u2206 to be equal to 0 for several outputs.\n\n4 Stochastic Structured Prediction under Partial Feedback\n\nAlgorithm Structure. Algorithm 1 shows the structure of the methods analyzed in this paper. It\nassumes a sequence of input structures xt, t = 0, . . . , T that are generated by a \ufb01xed, unknown\ndistribution p(x) (line 4). For each randomly chosen input, an output \u02dcyt is sampled from a Gibbs\nmodel to perform simultaneous exploitation (use the current best estimate) / exploration (get new\ninformation) on output structures (line 5). Then, feedback \u2206(\u02dcyt) to the predicted structure is obtained\n(line 6). An update is performed by taking a step in the negative direction of the stochastic gradient\nst, at a rate \u03b3t (line 7). As a post-optimization step, a solution \u02c6w is chosen from the list of vectors\nwt \u2208 {w0, . . . , wT} (line 8).\nGiven Algorithm 1, we can formalize the notion of \u201cbanditization\u201d of objective functions by presenting\ndifferent instantiations of the vector st, and showing them to be unbiased estimates of the gradients\nof corresponding full information objectives.\n\n3\n\n\fExpected Loss Minimization.\nloss objective. It is non-convex for the speci\ufb01c instantiations in this paper:\n\u2206(y)pw(y|x).\n\nEp(x)pw(y|x) [\u2206(y)] =\n\n(cid:88)\n\n(cid:88)\n\np(x)\n\n[20] presented an algorithm that minimizes the following expected\n\n(4)\n\nx\n\ny\u2208Y(x)\n\nThe vector st used in their algorithm can be seen as a stochastic gradient of this objective, i.e., an\nevaluation of the full gradient at a randomly chosen input xt and output \u02dcyt:\n\nst = \u2206(\u02dcyt)(cid:0)\u03c6(xt, \u02dcyt) \u2212 Epwt (y|xt)[\u03c6(xt, y)](cid:1).\n\n(5)\n\nInstantiating st in Algorithm 1 to the stochastic gradient in equation (5) yields an update that compares\nthe sampled feature vector to the average feature vector, and performs a step into the opposite direction\nof this difference, the more so the higher the loss of the sampled structure is. In the following, we\nrefer to the algorithm for expected loss minimization de\ufb01ned by the update (5) as Algorithm EL.\n\nPairwise Preference Learning. Decomposing complex problems into a series of pairwise com-\nparisons has been shown to be advantageous for human decision making [23]. For the example\nof machine translation, this means that instead of requiring numerical assessments of translation\nquality from human users, only a relative preference judgement on a pair of translations needs to be\nelicited. This idea is formalized in [19] as an expected loss objective with respect to a conditional\ndistribution of pairs of structured outputs. Let P(x) = {(cid:104)yi, yj(cid:105)|yi, yj \u2208 Y(x)} denote the set of\noutput pairs for an input x, and let \u2206((cid:104)yi, yj(cid:105)) : P(x) \u2192 [0, 1] denote a task loss function that\nspeci\ufb01es a dispreference of yi compared to yj. In the experiments reported in this paper, we simulate\ntwo types of pairwise feedback. Firstly, continuous pairwise feedback is computed as\n\n(cid:26)\u2206(yi) \u2212 \u2206(yj)\n\n0\n\notherwise.\n\n\u2206((cid:104)yi, yj(cid:105)) =\n\nif \u2206(yi) > \u2206(yj),\n\n(6)\n\n(7)\n\nA binary feedback function is computed as\n\n\u2206((cid:104)yi, yj(cid:105)) =\n\n(cid:26)1\n\n0\n\nif \u2206(yi) > \u2206(yj),\notherwise.\n\nFurthermore, we assume a feature representation \u03c6(x,(cid:104)yi, yj(cid:105)) = \u03c6(x, yi) \u2212 \u03c6(x, yj) and a Gibbs\nmodel on pairs of output structures\n\npw((cid:104)yi, yj(cid:105)|x) =\n\n(cid:80)\n\new(cid:62)(\u03c6(x,yi)\u2212\u03c6(x,yj ))\n\new(cid:62)(\u03c6(x,yi)\u2212\u03c6(x,yj ))\n\n(cid:104)yi,yj(cid:105)\u2208P(x)\n\n= pw(yi|x)p\u2212w(yj|x).\n\n(8)\n\nThe factorization of this model into the product pw(yi|x)p\u2212w(yj|x) allows ef\ufb01cient sampling and\ncalculation of expectations. Instantiating objective (4) to the case of pairs of output structures de\ufb01nes\nthe following objective that is again non-convex in the use cases in this paper:\n\nEp(x)pw((cid:104)yi,yj(cid:105)|x) [\u2206((cid:104)yi, yj(cid:105))] =\n\np(x)\n\n\u2206((cid:104)yi, yj(cid:105)) pw((cid:104)yi, yj(cid:105)|x).\n\n(9)\n\n(cid:88)\n\nx\n\n(cid:88)\n\n(cid:104)yi,yj(cid:105)\u2208P(x)\n\nLearning from partial feedback on pairwise preferences will ensure that the model \ufb01nds a ranking\nfunction that assigns low probabilities to discordant pairs with respect the the observed preference\npairs. Stronger assumptions on the learned ranking can be made if asymmetry and transitivity of the\nobserved ordering of pairs is required.2 An algorithm for pairwise preference learning can be de\ufb01ned\nby instantiating Algorithm 1 to sampling output pairs (cid:104)\u02dcyi, \u02dcyj(cid:105)t, receiving feedback \u2206((cid:104)\u02dcyi, \u02dcyj(cid:105)t), and\nperforming a stochastic gradient update using\n\nst = \u2206((cid:104)\u02dcyi, \u02dcyj(cid:105)t)(cid:0)\u03c6(xt,(cid:104)\u02dcyi, \u02dcyj(cid:105)t) \u2212 Epwt ((cid:104)yi,yj(cid:105)|xt)[\u03c6(xt,(cid:104)yi, yj(cid:105))](cid:1).\n\n(10)\n\nThe algorithms for pairwise preference ranking de\ufb01ned by update (10) are referred to as Algorithms\nPR(bin) and PR(cont), depending on the use of binary or continuous feedback.\n\n2See [2] for an overview of bandit learning from consistent and inconsistent pairwise comparisons.\n\n4\n\n\fCross-Entropy Minimization. The standard theory of stochastic optimization predicts consider-\nable improvements in convergence speed depending on the functional form of the objective. This\nmotivated the formalization of a convex upper bounds on expected normalized loss in [19]. If a\ny\u2208Y(x) g(y), and g = 1 \u2212 \u2206, the\nnormalized gain function \u00afg(y) = g(y)\nobjective can be seen as the cross-entropy of model pw(y|x) with respect to \u00afg(y):\n\u00afg(y) log pw(y|x).\n\nZg(x) is used where Zg(x) =(cid:80)\n(cid:88)\n\nEp(x)\u00afg(y) [\u2212 log pw(y|x)] = \u2212(cid:88)\n\n(11)\n\np(x)\n\nx\n\ny\u2208Y(x)\n\nFor a proper probability distribution \u00afg(y), an application of Jensen\u2019s inequality to the convex negative\nlogarithm function shows that objective (11) is a convex upper bound on objective (4). However,\nnormalizing the gain function is prohibitive in a partial feedback setting since it would require to\nelicit user feedback for each structure in the output space. [19] thus work with an unnormalized gain\nfunction g(y) that preserves convexity. This can be seen by rewriting the objective as the sum of a\nlinear and a convex function in w:\n\nEp(x)g(y) [\u2212 log pw(y|x)] = \u2212(cid:88)\n(cid:88)\n\nx\n\n(cid:88)\n\np(x)\n\ny\u2208Y(x)\n\n(cid:88)\n\ng(y)w(cid:62)\u03c6(x, y)\n\n(12)\n\n+\n\np(x)(log\n\nx\n\ny\u2208Y(x)\n\nexp(w(cid:62)\u03c6(x, y)))\u03b1(x),\n\nwhere \u03b1(x) =(cid:80)\n\ny\u2208Y(x) g(y) is a constant factor not depending on w. Instantiating Algorithm 1 to the\nfollowing stochastic gradient st of this objective yields an algorithm for cross-entropy minimization:\n\n(cid:0) \u2212 \u03c6(xt, \u02dcyt) + Epwt\n\n[\u03c6(xt, yt)](cid:1).\n\ng(\u02dcyt)\n\nst =\n\npwt(\u02dcyt|xt)\n\n(13)\nNote that the ability to sample structures from pwt(\u02dcyt|xt) comes at the price of having to normalize\nst by 1/pwt(\u02dcyt|xt). While minimization of this objective will assign high probabilities to structures\nwith high gain, as desired, each update is affected by a probability that changes over time and is\nunreliable when training is started. This further increases the variance already present in stochastic\noptimization. We deal with this problem by clipping too small sampling probabilities to \u02c6pwt(\u02dcyt|xt) =\nmax{pwt(\u02dcyt|xt), k} for a constant k [9]. The algorithm for cross-entropy minimization based on the\nstochastic gradient (13) is referred to as Algorithm CE in the following.\n\n5 Convergence Analysis\n\nTo analyze convergence, we describe Algorithms EL, PR, and CE as stochastic \ufb01rst-order (SFO)\nmethods in the framework of [7]. We assume lower bounded, differentiable objective functions J(w)\nwith Lipschitz continuous gradient \u2207J(w) satisfying\n\n(cid:107)\u2207J(w + w(cid:48)) \u2212 \u2207J(w)(cid:107) \u2264 L(cid:107)w(cid:48)(cid:107) \u2200w, w(cid:48),\u2203L \u2265 0.\n\n(14)\nFor an iterative process of the form wt+1 = wt\u2212 \u03b3t st, the conditions to be met concern unbiasedness\nof the gradient estimate\n\nE[st] = \u2207J(wt), \u2200t \u2265 0,\n\n(15)\n\nand boundedness of the variance of the stochastic gradient\n\nE[||st \u2212 \u2207J(wt)||2] \u2264 \u03c32, \u2200t \u2265 0.\n\n(16)\nCondition (15) is met for all three Algorithms by taking expectations over all sources of randomness,\ni.e., over random inputs and output structures. Assuming (cid:107)\u03c6(x, y)(cid:107) \u2264 R, \u2206(y) \u2208 [0, 1] and\ng(y) \u2208 [0, 1] for all x, y, and since the ratio\n\u02c6pwt (\u02dcyt|xt) is bounded, the variance in condition (16) is\nbounded. Note that the analysis of [7] justi\ufb01es the use of constant learning rates \u03b3t = \u03b3, t = 0, . . . , T .\nConvergence speed can be quanti\ufb01ed in terms of the number of iterations needed to reach an accuracy\nof \u0001 for a gradient-based criterion E[(cid:107)\u2207J(wt)(cid:107)2] \u2264 \u0001. For stochastic optimization of non-convex\nobjectives, the iteration complexity with respect to this criterion is analyzed as O(1/\u00012) in [7]. This\ncomplexity result applies to our Algorithms EL and PR.\n\ng(\u02dcyt)\n\n5\n\n\fT wt).\n\nThe iteration complexity of stochastic optimization of (strongly) convex objectives has been analyzed\nas at best O(1/\u0001) for the suboptimality criterion E[J(wt)] \u2212 J(w\u2217) \u2264 \u0001 for decreasing learning rates\n2(cid:107)w(cid:107)2\n[14].3 Strong convexity of objective (12) can be achieved easily by adding an (cid:96)2 regularizer \u03bb\nwith constant \u03bb > 0. Algorithm CE is then modi\ufb01ed to use the following regularized update rule\nwt+1 = wt \u2212 \u03b3t (st + \u03bb\nThis standard analysis shows two interesting points: First, Algorithms EL and PR can be analyzed as\nSFO methods where the latter only requires relative preference feedback for learning, while enjoying\nan iteration complexity that does not depend on the dimensionality of the function as in gradient-free\nstochastic zeroth-order (SZO) approaches. Second, the standard asymptotic complexity bound of\nO(1/\u00012) for non-convex stochastic optimization hides the constants L and \u03c32 in which iteration\ncomplexity increases linearly. As we will show, these constants have a substantial in\ufb02uence, possibly\noffsetting the advantages in asymptotic convergence speed of Algorithm CE.\n\n6 Experiments\n\nMeasuring Numerical Convergence and Task Loss Performance.\nIn the following, we will\npresent an experimental evaluation for two complex structured prediction tasks from the area of\nNLP, namely statistical machine translation and noun phrase chunking. Both tasks involve dynamic\nprogramming over exponential output spaces, large sparse feature spaces, and non-linear non-\ndecomposable task loss metrics. Training for both tasks was done by simulating bandit feedback by\nevaluating \u2206 against gold standard structures which are never revealed to the learner. We compare\nthe empirical convergence criterion of optimal task performance on development data with numerical\nresults on theoretically motivated convergence criteria.\nFor the purpose of measuring convergence with respect to optimal task performance, we report an\nevaluation of convergence speed on a \ufb01xed set of unseen data as performed in [19]. This instantiates\nthe selection criterion in line (8) in Algorithm 1 to an evaluation of the respective task loss function\n\u2206(\u02c6ywt(x)) under MAP prediction \u02c6yw(x) = arg maxy\u2208Y(x) pw(y|x) on the development data. This\ncorresponds to the standard practice of online-to-batch conversion where the model selected on the\ndevelopment data is used for \ufb01nal evaluation on a further unseen test set. For bandit structured\nprediction algorithms, \ufb01nal results are averaged over three runs with different random seeds.\nFor the purpose of obtaining numerical results on convergence speed, we compute estimates of the\nexpected squared gradient norm E[(cid:107)\u2207J(wt)(cid:107)2], the Lipschitz constant L and the variance \u03c32 in\nwhich the asymptotic bounds on iteration complexity grow linearly.4 We estimate the squared gradient\nnorm by the squared norm of the stochastic gradient (cid:107)sT(cid:107)2 at a \ufb01xed time horizon T . The Lipschitz\n(cid:107)si\u2212sj(cid:107)\n(cid:107)wi\u2212wj(cid:107) for 500 pairs wi and wj randomly drawn\nconstant L in equation (14) is estimated by maxi,j\nfrom the weights produced during training. The variance \u03c32 in equation (16) is computed as the\nempirical variance of the stochastic gradient, taken at regular intervals after each epoch of size D,\nD(cid:99). All estimates include\nyielding the estimate 1\nK\nmultiplication of the stochastic gradient with the learning rate. For comparability of results across\ndifferent algorithms, we use the same T and the same constant learning rates for all algorithms.5\n\n(cid:80)K\nk=1 skD(cid:107)2 where K = (cid:98) T\n\n(cid:80)K\nk=1 (cid:107)skD \u2212 1\n\nK\n\nStatistical Machine Translation.\nIn this experiment, an interactive machine translation scenario\nis simulated where a given machine translation system is adapted to user style and domain based on\nfeedback to predicted translations. Domain adaptation from Europarl to NewsCommentary domains\nusing the data provided at the WMT 2007 shared task is performed for French-to-English translation.6\nThe MT experiments are based on the synchronous context-free grammar decoder cdec [5]. The\nmodels use a standard set of dense and lexicalized sparse features, including an out-of and an in-\n\nstochastic optimization.\n\n3For constant learning rates, [21] show even faster convergence in the search phase of strongly-convex\n4For example, these constants appear as O( L\n\u00012 ) in the complexity bound for non-convex stochastic\noptimization of [7].\n5Note that the squared gradient norm upper bounds the suboptimality criterion s.t. (cid:107)\u2207J(wt)(cid:107)2 \u2265\n2\u03bbJ(wt)] \u2212 J(w\u2217) for strongly convex functions. Together with the use of constant learning rates this means\nthat we measure convergence to a point near an optimum for strongly convex objectives.\n\n\u0001 + L\u03c32\n\n6http://www.statmt.org/wmt07/shared-task.html\n\n6\n\n\fTask\n\nSMT\n\nChunking\n\nAlgorithm Iterations\nCE\nEL\nPR(bin)\nCE\nEL\nPR(cont)\n\n281k\n370k\n115k\n5.9M\n7.5M\n4.7M\n\nScore\n0.271\u00b10.001\n0.267\u00b18e\u22126\n0.273\u00b10.0005\n0.891\u00b10.005\n0.923\u00b10.002\n0.914\u00b10.002\n\n\u03b3\n1e-6\n1e-5\n1e-4\n1e-6\n1e-4\n1e-4\n\n\u03bb\n1e-6\n\nk\n5e-3\n\n1e-6\n\n1e-2\n\nTable 1: Test set evaluation for stochastic learning under bandit feedback from [19], for chunking\nunder F1-score, and for machine translation under BLEU. Higher is better for both scores. Results\nfor stochastic learners are averaged over three runs of each algorithm, with standard deviation shown\nin subscripts. The meta-parameter settings were determined on dev sets for constant learning rate \u03b3,\nclipping constant k, (cid:96)2 regularization constant \u03bb.\n\ndomain language model. The out-of-domain baseline model has around 200k active features. The\npre-processing, data splits, feature sets and tuning strategies are described in detail in [19]. The\ndifference in the task loss evaluation between out-of-domain (BLEU: 0.2651) and in-domain (BLEU:\n0.2831) models gives the range of possible improvements (1.8 BLEU points) for bandit learning.\nLearning under bandit feedback starts at the learned weights of the out-of-domain median models.\nIt uses parallel in-domain data (news-commentary, 40,444 sentences) to simulate bandit feedback,\nby evaluating the sampled translation against the reference using as loss function \u2206 a smoothed\nper-sentence 1 \u2212 BLEU (zero n-gram counts being replaced with 0.01). After each update, the\nhypergraph is re-decoded and all hypotheses are re-ranked. Training is distributed across 38 shards\nusing a multitask-based feature selection algorithm [17].\n\nNoun-phrase Chunking. The experimental setting for chunking is the same as in [19]. Following\n[16], conditional random \ufb01elds (CRF) are applied to the noun phrase chunking task on the CoNLL-\n2000 dataset7. The implemented set of feature templates is a simpli\ufb01ed version of [16] and leads to\naround 2M active features. Training under full information with a log-likelihood objective yields\n0.935 F1. In difference to machine translation, training with bandit feedback starts from w0 = 0, not\nfrom a pre-trained model.\n\nTask Loss Evaluation. Table 1 lists the results of the task loss evaluation for machine translation\nand chunking as performed in [19], together with the optimal meta-parameters and the number of\niterations needed to \ufb01nd an optimal result on the development set. Note that the pairwise feedback\ntype (cont or bin) is treated as a meta-parameter for Algorithm PR in our simulation experiment.\nWe found that bin is preferable for machine translation and cont for chunking in order to obtain the\nhighest task scores.\nFor machine translation, all bandit learning runs show signi\ufb01cant improvements in BLEU score\nover the out-of-domain baseline. Early stopping by task performance on the development led to the\nselection of algorithm PR(bin) at a number of iterations that is by a factor of 2-4 smaller compared to\nAlgorithms EL and CE.\nFor the chunking experiment, the F1-score results obtained for bandit learning are close to the full-\ninformation baseline. The number of iterations needed to \ufb01nd an optimal result on the development\nset is smallest for Algorithm PR(cont), compared to Algorithms EL and CE. However, the best\nF1-score is obtained by Algorithm EL.\nNumerical Convergence Results. Estimates of E[(cid:107)\u2207J(wt)(cid:107)2], L and \u03c32 for three runs of each\nalgorithm and task with different random seeds are listed in Table 2.\nFor machine translation, at time horizon T , the estimated squared gradient norm for Algorithm PR\nis several orders of magnitude smaller than for Algorithms EL and CE. Furthermore, the estimated\nLipschitz constant L and the estimated variance \u03c32 are smallest for Algorithm PR. Since the iteration\ncomplexity increases linearly with respect to these terms, smaller constants L and \u03c32 and a smaller\n\n7http://www.cnts.ua.ac.be/conll2000/chunking/\n\n7\n\n\fTask\n\nSMT\n\nChunking\n\nAlgorithm T\nCE\nEL\nPR(bin)\nPR(cont)\nCE\nEL\nPR(bin)\nPR(cont)\n\n767,000\n767,000\n767,000\n767,000\n3,174,400\n3,174,400\n3,174,400\n3,174,400\n\n(cid:107)sT(cid:107)2\n3.04\u00b10.02\n0.02\u00b10.03\n2.88e-4\u00b13.40e\u22126\n1.03e-8\u00b12.91e\u221210\n4.20\u00b10.71\n1.21e-3\u00b11.1e\u22124\n7.71e-4\u00b12.53e\u22124\n5.99e-3\u00b17.24e\u22124\n\nL\n0.54\u00b10.3\n1.63\u00b10.67\n0.08\u00b10.01\n0.10\u00b15.70e\u22123\n1.60\u00b10.11\n1.16\u00b10.31\n1.33\u00b10.24\n1.11\u00b10.30\n\n\u03c32\n35 \u00b16\n3.13e-4\u00b13.60e\u22126\n3.79e-5\u00b19.50e\u22128\n1.78e-7\u00b11.45e\u221210\n4.88\u00b10.07\n0.01\u00b19.51e\u22125\n4.44e-3\u00b12.66e\u22125\n0.03\u00b14.68e\u22124\n\nTable 2: Estimates of squared gradient norm (cid:107)sT(cid:107)2, Lipschitz constant L and variance \u03c32 of stochastic\ngradient (including multiplication with learning rate) for \ufb01xed time horizon T and constant learning\nrates \u03b3 = 1e \u2212 6 for SMT and for chunking. The clipping and regularization parameters for CE are\nset as in Table 1 for machine translation, except for chunking CE \u03bb = 1e \u2212 5. Results are averaged\nover three runs of each algorithm, with standard deviation shown in subscripts.\n\nvalue of the estimate E[(cid:107)\u2207J(wt)(cid:107)2] at the same number of iterations indicates fastest convergence\nfor Algorithm PR. This theoretically motivated result is consistent with the practical convergence\ncriterion of early stopping on development data: Algorithm PR which yields the smallest squared\ngradient norm at time horizon T also needs the smallest number of iterations to achieve optimal\nperformance on the development set. In the case of machine translation, Algorithm PR even achieves\nthe nominally best BLEU score on test data.\nFor the chunking experiment, after T iterations, the estimated squared gradient norm and either of\nthe constants L and \u03c32 for Algorithm PR are several orders of magnitude smaller than for Algorithm\nCE, but similar to the results for Algorithm EL. The corresponding iteration counts determined by\nearly stopping on development data show an improvement of Algorithm PR over Algorithms CE and\nEL, however, by a smaller factor than in the machine translation experiment.\nNote that for comparability across algorithms, the same constant learning rates were used in all runs.\nHowever, we obtained similar relations between algorithms by using the meta-parameter settings\nchosen on development data as shown in Table 1. Furthermore, the above tendendencies hold for\nboth settings of the meta-parameter bin or cont of Algorithm PR.\n\n7 Conclusion\n\nWe presented learning objectives and algorithms for stochastic structured prediction under bandit\nfeedback. The presented methods \u201cbanditize\u201d well-known approaches to probabilistic structured\nprediction such as expected loss minimization, pairwise preference ranking, and cross-entropy mini-\nmization. We presented a comparison of practical convergence criteria based on early stopping with\ntheoretically motivated convergence criteria based on the squared gradient norm. Our experimental\nresults showed fastest convergence speed under both criteria for pairwise preference learning. Our\nnumerical evaluation showed smallest variance for pairwise preference learning, which possibly\nexplains fastest convergence despite the underlying non-convex objective. Furthermore, since this\nalgorithm requires only easily obtainable relative preference feedback for learning, it is an attractive\nchoice for practical interactive learning scenarios.\n\nAcknowledgments.\n\nThis research was supported in part by the German research foundation (DFG), and in part by a\nresearch cooperation grant with the Amazon Development Center Germany.\n\n8\n\n\fReferences\n[1] Agarwal, A., Dekel, O., and Xiao, L. (2010). Optimal algorithms for online convex optimization with\n\nmulti-point bandit feedback. In COLT.\n\n[2] Busa-Fekete, R. and H\u00fcllermeier, E. (2014). A survey of preference-based online learning with bandit\n\nalgorithms. In ALT.\n\n[3] Chang, K.-W., Krishnamurthy, A., Agarwal, A., Daume, H., and Langford, J. (2015). Learning to search\n\nbetter than your teacher. In ICML.\n\n[4] Duchi, J. C., Jordan, M. I., Wainwright, M. J., and Wibisono, A. (2015). Optimal rates for zero-order\nconvex optimization: The power of two function evaluations. IEEE Translactions on Information Theory,\n61(5):2788\u20132806.\n\n[5] Dyer, C., Lopez, A., Ganitkevitch, J., Weese, J., Ture, F., Blunsom, P., Setiawan, H., Eidelman, V., and\nResnik, P. (2010). cdec: A decoder, alignment, and learning framework for \ufb01nite-state and context-free\ntranslation models. In ACL Demo.\n\n[6] Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003). An ef\ufb01cient boosting algorithm for combining\n\npreferences. JMLR, 4:933\u2013969.\n\n[7] Ghadimi, S. and Lan, G. (2012). Stochastic \ufb01rst- and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM J. on Optimization, 4(23):2342\u20132368.\n\n[8] Herbrich, R., Graepel, T., and Obermayer, K. (2000). Large margin rank boundaries for ordinal regression.\n\nIn Advances in Large Margin Classi\ufb01ers, pages 115\u2013132.\n\n[9] Ionides, E. L. (2008). Truncated importance sampling. J. of Comp. and Graph. Stat., 17(2):295\u2013311.\n\n[10] Joachims, T. (2002). Optimizing search engines using clickthrough data. In KDD.\n\n[11] Langford, J. and Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-armed bandits. In\n\nNIPS.\n\n[12] Lazaric, A. and Munos, R. (2012). Learning with stochastic inputs and adversarial outputs. Journal of\n\nComputer and System Sciences, (78):1516\u20131537.\n\n[13] Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized\n\nnews article recommendation. In WWW.\n\n[14] Polyak, B. T. (1987). Introduction to Optimization. Optimization Software, Inc., New York.\n\n[15] Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2016). Sequence level training with recurrent neural\n\nnetworks. In ICLR.\n\n[16] Sha, F. and Pereira, F. (2003). Shallow parsing with conditional random \ufb01elds. In NAACL.\n\n[17] Simianer, P., Riezler, S., and Dyer, C. (2012). Joint feature selection in distributed stochastic learning for\n\nlarge-scale discriminative training in SMT. In ACL.\n\n[18] Smith, N. A. (2011). Linguistic Structure Prediction. Morgan and Claypool.\n\n[19] Sokolov, A., Kreutzer, J., Lo, C., and Riezler, S. (2016). Learning structured predictors from bandit\n\nfeedback for interactive NLP. In ACL.\n\n[20] Sokolov, A., Riezler, S., and Urvoy, T. (2015). Bandit structured prediction for learning from user feedback\n\nin statistical machine translation. In MT Summit XV.\n\n[21] Solodov, M. V. (1998). Incremental gradient algorithms with stepsizes bounded away from zero. Computa-\n\ntional Optimization and Applications, 11:23\u201335.\n\n[22] Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for reinforcement\n\nlearning with function approximation. In NIPS.\n\n[23] Thurstone, L. L. (1927). A law of comparative judgement. Psychological Review, 34:278\u2013286.\n\n[24] Yue, Y. and Joachims, T. (2009). Interactively optimizing information retrieval systems as a dueling bandits\n\nproblem. In ICML.\n\n[25] Yuille, A. and He, X. (2012). Probabilistic models of vision and max-margin methods. Frontiers of\n\nElectrical and Electronic Engineering, 7(1):94\u2013106.\n\n9\n\n\f", "award": [], "sourceid": 826, "authors": [{"given_name": "Artem", "family_name": "Sokolov", "institution": "Heidelberg University"}, {"given_name": "Julia", "family_name": "Kreutzer", "institution": "Heidelberg University"}, {"given_name": "Stefan", "family_name": "Riezler", "institution": "Heidelberg University"}, {"given_name": "Christopher", "family_name": "Lo", "institution": null}]}