{"title": "Learned Prioritization for Trading Off Accuracy and Speed", "book": "Advances in Neural Information Processing Systems", "page_first": 1331, "page_last": 1339, "abstract": "Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs (for particular problems and datasets).  We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing \\cite{kay-1986}. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the ``teacher'' is far too good to successfully imitate with our inexpensive features.  Moreover, it is not specifically tuned for the known reward function.  We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with only a few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.", "full_text": "Learned Prioritization for\n\nTrading Off Accuracy and Speed\u2217\n\nJiarong Jiang\u2217\n\nAdam Teichert\u2020\n\u2217Department of Computer Science\n\nUniversity of Maryland\nCollege Park, MD 20742\n\n{jiarong,hal}@umiacs.umd.edu\n\nHal Daum\u00b4e III\u2217\nJason Eisner\u2020\n\u2020Department of Computer Science\n\nJohns Hopkins University\n\nBaltimore, MD 21218\n\n{teichert,eisner}@jhu.edu\n\nAbstract\n\nUsers want inference to be both fast and accurate, but quality often comes at\nthe cost of speed. The \ufb01eld has experimented with approximate inference algo-\nrithms that make different speed-accuracy tradeoffs (for particular problems and\ndatasets). We aim to explore this space automatically, focusing here on the case of\nagenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement\nlearning techniques fail to learn good policies: the state space is simply too large\nto explore naively. An attempt to counteract this by applying imitation learning\nalgorithms also fails: the \u201cteacher\u201d follows a far better policy than anything in our\nlearner\u2019s policy space, free of the speed-accuracy tradeoff that arises when ora-\ncle information is unavailable, and thus largely insensitive to the known reward\nfunct\ufb01on. We propose a hybrid reinforcement/apprenticeship learning algorithm\nthat learns to speed up an initial policy, trading off accuracy for speed according\nto various settings of a speed term in the loss function.\n\nIntroduction\n\n1\nThe nominal goal of predictive inference is to achieve high accuracy. Unfortunately, high accuracy\noften comes at the price of slow computation. In practice one wants a \u201creasonable\u201d tradeoff between\naccuracy and speed. But the de\ufb01nition of \u201creasonable\u201d varies with the application. Our goal is to\noptimize a system with respect to a user-speci\ufb01ed speed/accuracy tradeoff, on a user-speci\ufb01ed data\ndistribution. We formalize our problem in terms of learning priority functions for generic inference\nalgorithms (Section 2).\nMuch research in natural language processing (NLP) has been dedicated to \ufb01nding speedups for ex-\nact or approximate computation in a wide range of inference problems including sequence tagging,\nconstituent parsing, dependency parsing, and machine translation. Many of the speedup strategies in\nthe literature can be expressed as pruning or prioritization heuristics. Prioritization heuristics gov-\nern the order in which search actions are taken while pruning heuristics explicitly dictate whether\nparticular actions should be taken at all. Examples of prioritization include A\u2217 [13] and Hierar-\nchical A\u2217 [19] heuristics, which, in the case of agenda-based parsing, prioritize parse actions so as\nto reduce work while maintaining the guarantee that the most likely parse is found. Alternatively,\ncoarse-to-\ufb01ne pruning [21], classi\ufb01er-based pruning [23], [22] beam-width prediction [3], etc can\nresult in even faster inference if a small amount of search error can be tolerated.\nUnfortunately, deciding which techniques to use for a speci\ufb01c setting can be dif\ufb01cult:\nit is im-\npractical to \u201ctry everything.\u201d In the same way that statistical learning has dramatically improved\nthe accuracy of NLP applications, we seek to develop statistical learning technology that can dra-\nmatically improve their speed while maintaining tolerable accuracy. By combining reinforcement\nlearning and imitation learning methods, we develop an algorithm that can successfully learn such\na tradeoff in the context of constituency parsing. Although this paper focuses on parsing, we ex-\npect the approach to transfer to prioritization in other agenda-based algorithms, such as machine\ntranslation and residual belief propagation. We give a broader discussion of this setting in [8].\n\n\u2217This material is based upon work supported by the National Science Foundation under Grant No. 0964681.\n\n1\n\n\f2 Priority-based Inference\nInference algorithms in NLP (e.g. parsers, taggers, or translation systems) as well as more broadly\nin arti\ufb01cial intelligence (e.g., planners) often rely on prioritized exploration. For concreteness, we\ndescribe inference in the context of parsing, though it is well known that this setting captures all the\nessential structure of a much larger family of \u201cdeductive inference\u201d problems [12, 9].\n\n2.1 Prioritized Parsing\nGiven a probabilistic context-free grammar, one approach to inferring the best parse tree for a given\nsentence is to build the tree from the bottom up by dynamic programming, as in CKY [29]. When a\nprospective constituent such as \u201cNP from 3 to 8\u201d is built, its Viterbi inside score is the log-probability\nof the best known subparse that matches that description.1\nA standard extension of the CKY algorithm [12] uses an agenda\u2014a priority queue of constituents\nbuilt so far\u2014to decide which constituent is most promising to extend next, as detailed in section 2.2\nbelow. The success of the inference algorithm in terms of speed and accuracy hinge on its ability to\nprioritize \u201cgood\u201d actions before \u201cbad\u201d actions. In our context, a constituent is \u201cgood\u201d if it somehow\nleads to a high accuracy solution, quickly.\nRunning Example 1. Either CKY or an agenda-based parser that prioritizes by Viterbi inside score will\n\ufb01nd the highest-scoring parse. This achieves a percentage accuracy of 93.3, given the very large grammar\nand experimental conditions described in Section 6. However, the agenda-based parser is over an order of\nmagnitude faster than CKY (wall clock time) because it stops as soon as it \ufb01nds a parse, without building\nfurther constituents. With mild pruning according to Viterbi inside score, the accuracy remains 93.3 and the\nspeed triples. With more aggressive pruning, the accuracy drops to 92.0 and the speed triples again.\nOur goal is to learn a prioritization function that satis\ufb01es this condition. In order to operationalize\nthis approach, we need to de\ufb01ne the test-time objective function we wish to optimize; we choose a\nsimple linear interpolation of accuracy and speed:\n\nquality = accuracy \u2212 \u03bb \u00d7 time\n\n(1)\n\nwhere we can choose a \u03bb that re\ufb02ects our true preferences. The goal of \u03bb is to encode \u201chow much\nmore time am I willing to spend to achieve an additional unit of accuracy?\u201d In this paper, we consider\na very simple notion of time: the number of constituents popped from/pushed into the agenda during\ninference, halting inference as soon as the parser pops its \ufb01rst complete parse.\nWhen considering how to optimize the expectation of Eq (1) over test data, several challenges\npresent themselves. First, this is a sequential decision process: the parsing decisions made at a\ngiven time may affect both the availability and goodness of future decisions. Second, the parser\u2019s\ntotal runtime and accuracy on a sentence are unknown until parsing is complete, making this an\ninstance of delayed reward. These considerations lead us to formulate this problem as a Markov\nDecision Process (MDP), a well-studied model of decision processes.\n\nInference as a Markov Decision Process\n\n2.2\nA Markov Decision Process (MDP) is a formalization of a memoryless search process. An MDP\nconsists of a state space S, an action space A, and a transition function T . An agent in an MDP\nobserves the current state s \u2208 S and chooses an action a \u2208 A. The environment responds by\ntransitioning to a state s(cid:48) \u2208 S, sampled from the transition distribution T (s(cid:48) | s, a). The agent then\nobserves its new state and chooses a new action. An agent\u2019s policy \u03c0 describes how the (memory-\nless) agent chooses an action based on its current state, where \u03c0 is either a deterministic function of\nthe state (i.e., \u03c0(s) (cid:55)\u2192 a) or a stochastic distribution over actions (i.e., \u03c0(a | s)).\nFor parsing, the state is the full current chart and agenda (and is astronomically large: roughly 1017\nstates for average sentences). The agent controls which item (constituent) to \u201cpop\u201d from the agenda.\nThe initial state has an agenda consisting of all single-word constituents, and an empty chart of\npreviously popped constituents. Possible actions correspond to items currently on the agenda. When\nthe agent chooses to pop item y, the environment deterministically adds y to the chart, combines y\nas licensed by the grammar with adjacent items z in the chart, and places each resulting new item x\n\n1E.g., the maximum log-probability of generating some tree whose fringe is the substring spanning words\n(3,8], given that NP (noun phrase) is the root nonterminal. This is the total log-probability of rules in the tree.\n\n2\n\n\fon the agenda. (Duplicates in the chart or agenda are merged: the one of highest Viterbi inside score\nis kept.) The only stochasticity is the initial draw of a new sentence to be parsed.\nWe are interested in learning a deterministic policy that always pops the highest-priority available\naction. Thus, learning a policy corresponds to learning a priority function. We de\ufb01ne the priority\nof action a in state s as the dot product of a feature vector \u03c6(a, s) with the weight vector \u03b8; our\nfeatures are described in Section 2.3. Formally, our policy is\n\n(2)\nAn admissible policy in the sense of A\u2217 search [13] would guarantee that we always return the parse\nof highest Viterbi inside score\u2014but we do not require this, instead aiming to optimize Eq (1).\n\n\u03c0\u03b8(s) = arg max\n\na\n\n\u03b8 \u00b7 \u03c6(a, s)\n\n2.3 Features for Prioritized Parsing\nWe use the following simple features to prioritize a possible constituent. (1) Viterbi inside score; (2)\nconstituent touches start of sentence; (3) constituent touches end of sentence; (4) constituent length;\nsentence length ; (6) log p(constituent label | prev. word POS tag) and log p(constituent label | next\n(5) constituent length\nword POS tag), where the part-of-speech (POS) tag of w is taken to be arg maxt p(w | t) under the\ngrammar; (7) 12 features indicating whether the constituent\u2019s {preceding, following, initial} word\nstarts with an {uppercase, lowercase, number, symbol} character; (8) the 5 most positive and 5 most\nnegative punctuation features from [14], which consider the placement of punctuation marks within\nthe constituent.\nThe log-probability features (1), (6) are inspired by work on \ufb01gures of merit for agenda-based pars-\ning [4], while case and punctuation patterns (7), (8) are inspired by structure-free parsing [14].\n\n3 Reinforcement Learning\n\nReinforcement learning (RL) provides a generic solution to solving learning problems with delayed\nreward [25]. The reward function takes a state of the world s and an agent\u2019s chosen action a and\nreturns a real value r that indicates the \u201cimmediate reward\u201d the agent receives for taking that action.\nIn general the reward function may be stochastic, but in our case, it is deterministic: r(s, a) \u2208 R.\nThe reward function we consider is:\n\n(cid:26) acc(a) \u2212 \u03bb \u00b7 time(s)\n\nr(s, a) =\n\n0\n\nif a is a full parse tree\notherwise\n\n(3)\n\nHere, acc(a) measures the accuracy of the full parse tree popped by the action a (against a gold\nstandard) and time(s) is a user-de\ufb01ned measure of time.\nIn words, when the parser completes\nparsing, it receives reward given by Eq (1); at all other times, it receives no reward.\n\n3.1 Boltzmann Exploration\nAt test time, the transition between states is deterministic: our policy always chooses the action a\nthat has highest priority in the current state s. However, during training, we promote exploration of\npolicy space by running with stochastic policies \u03c0\u03b8(a | s). Thus, there is some chance of popping a\nlower-priority action, to \ufb01nd out if it is useful and should be given higher-priority. In particular, we\nuse Boltzmann exploration to construct a stochastic policy with a Gibbs distribution. Our policy is:\n\u03c0\u03b8(a | s) =\n\nwith Z(s) as the appropriate normalizing constant\n\n(cid:20) 1\n(cid:21)\ntemp \u03b8 \u00b7 \u03c6(a, s)\n\nZ(s)\n\n(4)\n\n1\n\nexp\n\nThat is, the log-likelihood of action a at state s is an af\ufb01ne function of its priority. The temperature\ntemp controls the amount of exploration. As temp \u2192 0, \u03c0\u03b8 approaches the deterministic policy\nin Eq (2); as temp \u2192 \u221e, \u03c0\u03b8 approaches the uniform distribution over available actions. During\ntraining, temp can be decreased to shift from exploration to exploitation.\nA trajectory \u03c4 is the complete sequence of state/action/reward triples from parsing a single sentence.\nAs is common, we denote \u03c4 = (cid:104)s0, a0, r0, s1, a1, r1, . . . , sT , aT , rT(cid:105), where: s0 is the starting state;\nat is chosen by the agent by \u03c0\u03b8(at | st); rt = r(st, at); and st+1 is drawn by the environment from\n\n3\n\n\fT (st+1 | st, at), deterministically in our case. At a given temperature, the weight vector \u03b8 gives rise\nto a distribution over trajectories and hence to an expected total reward:\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n(cid:32)\n\u03c6(at, st) \u2212 (cid:88)\n\na(cid:48)\u2208A\n\n(cid:33)\n\nR = E\u03c4\u223c\u03c0\u03b8 [R(\u03c4 )] = E\u03c4\u223c\u03c0\u03b8\n\nrt\n\n.\n\n(5)\n\nwhere \u03c4 is a random trajectory chosen by policy \u03c0\u03b8, and rt is the reward at step t of \u03c4.\n\nt=0\n\n3.2 Policy Gradient\nGiven our features, we wish to \ufb01nd parameters that yield the highest possible expected reward. We\ncarry out this optimization using a stochastic gradient ascent algorithm known as policy gradient\n[27, 26]. This operates by taking steps in the direction of \u2207\u03b8R:\n\u2207\u03b8E\u03c4 [R(\u03c4 )] = E\u03c4 [\nR(\u03c4 )\u2207\u03b8 log p\u03b8(\u03c4 )\n\n\u2207\u03b8 log \u03c0(at | st)\n(6)\nThe expectation can be approximated by sampling trajectories. It also requires computing the gra-\ndient of each policy decision, which, by Eq (4), is:\n\n\u2207\u03b8p\u03b8(\u03c4 )\np\u03b8(\u03c4 )\n\nT(cid:88)\n\nt=0\n\n= E\u03c4\n\nR(\u03c4 )\n\nR(\u03c4 )] = E\u03c4\n\n(cid:105)\n\n(cid:105)\n\n(cid:104)\n\n(cid:104)\n\n\u2207\u03b8 log \u03c0\u03b8(at | st) =\n\n1\n\ntemp\n\n\u03c0\u03b8(a(cid:48) | st)\u03c6(a(cid:48), st)\n\n(7)\n\nCombining Eq (6) and Eq (7) gives the form of the gradient with respect to a single trajectory. The\npolicy gradient algorithm samples one trajectory (or several) according to the current \u03c0\u03b8, and then\ntakes a gradient step according to Eq (6). This increases the probability of actions on high-reward\ntrajectories more than actions on low-reward trajectories.\nRunning Example 2. The baseline system from Running Example 1 always returns the target parse (the\ncomplete parse with maximum Viterbi inside score). This achieves an accuracy of 93.3 (percent recall) and\nspeed of 1.5 mpops (million pops) on training data. Unfortunately, running policy gradient from this starting\npoint degrades speed and accuracy. Training is not practically feasible: even the \ufb01rst pass over 100 training\nsentences (sampling 5 trajectories per sentence) takes over a day.\n\n3.3 Analysis\nOne might wonder why policy gradient performed so poorly on this problem. One hypothesis is that it is the\nfault of stochastic gradient descent: the optimization problem was too hard or our step sizes were chosen poorly.\nTo address this, we attempted an experiment where we added a \u201ccheating\u201d feature to the model, which had a\nvalue of one for constituents that should be in the \ufb01nal parse, and zero otherwise. Under almost every condition,\npolicy gradient was able to learn a near-optimal policy by placing high weight on this cheating feature.\nAn alternative hypothesis is over\ufb01tting to the training data. However, we were unable to achieve signi\ufb01cantly\nhigher accuracy even when evaluating on our training data\u2014indeed, even for a single train/test sentence.\nThe main dif\ufb01culty with policy gradient is credit assignment: it has no way to determine which actions were\n\u201cresponsible\u201d for a trajectory\u2019s reward. Without causal reasoning, we need to sample many trajectories in order\nto distinguish which actions are reliably associated with higher-reward. This is a signi\ufb01cant problem for us,\nsince the average trajectory length of an A\u2217\n0 parser on a 15 word sentence is about 30,000 steps, only about 40\nof which (less than 0.15%) are actually needed to successfully complete the parse optimally.\n\n3.4 Reward Shaping\nA classic approach to attenuating the credit assignment problem when one has some knowledge about the\ndomain is reward shaping [10]. The goal of reward shaping is to heuristically associate portions of the total\nreward with speci\ufb01c time steps, and to favor actions that are observed to be soon followed by a reward, on the\nassumption that they caused that reward.\nIf speed is measured by the number of popped items and accuracy is measured by labeled constituent recall of\nthe \ufb01rst-popped complete parse (compared to the gold-standard parse), one natural way to shape rewards is to\ngive an immediate penalty for the time incurred in performing the action while giving an immediate positive\nreward for actions that build constituents of the gold parse. Since only some of the correct constituents built\nmay actually make it into the returned tree, we can correct for having \u201cincorrectly\u201d rewarded the others by\npenalizing the \ufb01nal action. Thus, the shaped reward:\n\n4\n\n\f\uf8f1\uf8f2\uf8f3 1 \u2212 \u2206(s, a) \u2212 \u03bb if a pops a complete parse (causing the parser to halt and return a)\n\nif a pops a labeled constituent that appears in the gold parse\notherwise\n\n1 \u2212 \u03bb\n\u2212\u03bb\n\n(8)\n\n\u02dcr(s, a) =\n\n\u03bb is from Eq (1), penalizing the runtime of each step. 1 rewards a correct constituent. The correction \u2206(s, a)\nis the number of correct constituents popped into the chart of s that were not in the \ufb01rst-popped parse a. It is\neasy to see that for any trajectory ending in a complete parse, the total shaped and unshaped rewards along a\ntrajectory are equal (i.e. r(\u03c4 ) = \u02dcr(\u03c4 )).\nWe now modify the total reward to use temporal discounting. Let 0 \u2264 \u03b3 \u2264 1 be a discount factor. When\nrewards are discounted over time, the policy gradient becomes\n\nE\u03c4\u223c\u03c0\u03b8 [ \u02dcR\u03b3(\u03c4 )] = E\u03c4\u223c\u03c0\u03b8\n\n\u03b3t(cid:48)\u2212t \u02dcrt(cid:48)\n\n\u2207\u03b8 log \u03c0\u03b8(at | st)\n\n(9)\n\n(cid:34) T(cid:88)\n\n(cid:32) T(cid:88)\n\nt=0\n\nt(cid:48)=t\n\n(cid:33)\n\n(cid:35)\n\nwhere \u02dcrt(cid:48) = \u02dcr(st(cid:48) , at(cid:48) ). When \u03b3 = 1, the gradient of the above turns out to be equivalent to Eq (6) [20, section\n3.1], and therefore following the gradient is equivalent to policy gradient. When \u03b3 = 0, the parser gets only\nimmediate reward\u2014and in general, a small \u03b3 assigns the credit for a local reward \u02dcrt(cid:48) mainly to actions at at\nclosely preceding times.\nThis gradient step can now achieve some credit assignment. If an action is on a good trajectory but occurs after\nmost of the useful actions (pops of correct constituents), then it does not receive credit for those previously\noccurring actions. However, if it occurs before useful actions, it still does receive credit because we do not\nknow (without additional simulation) whether it was a necessary step toward those actions.\nRunning Example 3. Reward shaping helps signi\ufb01cantly, but not enough to be competitive. As the parser\nspeeds up, training is about 10 times faster than before. The best setting (\u03b3 = 0, \u03bb = 10\u22126) achieves an\naccuracy in the mid-70\u2019s with only about 0.2 mpops. No settings were able to achieve higher accuracy.\n\n4 Apprenticeship Learning\n\nIn reinforcement learning, an agent interacts with an environment and attempts to learn to maximize its reward\nby repeating actions that led to high reward in the past. In apprenticeship learning, we assume access to a\ncollection of trajectories taken by an optimal policy and attempt to learn to mimic those trajectories. The\nlearner\u2019s only goal is to behave like the teacher at every step: it does not have any notion of reward. In contrast,\nthe related task of inverse reinforcement learning/optimal control [17, 11] attempts to infer a reward function\nfrom the teacher\u2019s optimal behavior.\nMany algorithms exist for apprenticeship learning. Some of them work by \ufb01rst executing inverse reinforcement\nlearning [11, 17] to induce a reward function and then feeding this reward function into an off-the-shelf rein-\nforcement learning algorithm like policy gradient to learn an approximately optimal agent [1]. Alternatively,\none can directly learn to mimic an optimal demonstrator, without going through the side task of trying to induce\nits reward function [7, 24].\n\n4.1 Oracle Actions\nWith a teacher to help guide the learning process, we would like to explore more intelligently than Boltzmann\nexploration, in particular, focusing on high-reward regions of policy space. We introduce oracle actions as a\nguidance for areas to explore.\nIdeally, oracle actions should lead to a maximum-reward tree. In training, we will identify oracle actions to be\nthose that build items in the maximum likelihood parse consistent with the gold parse. When multiple oracle\nactions are available on the agenda, we will break ties according to the priority assigned by the current policy\n(i.e., choose the oracle action that it currently likes best).\n\n4.2 Apprenticeship Learning via Classi\ufb01cation\nGiven a notion of oracle actions, a straightforward approach to policy learning is to simply train a classi\ufb01er to\nfollow the oracle\u2014a popular approach in incremental parsing [6, 5]. Indeed, this serves as the initial iteration\nof the state-of-the-art apprenticeship learning algorithm, DAGGER [24].\nWe train a classi\ufb01er as follows. Trajectories are generated by following oracle actions, breaking ties using the\ninitial policy (Viterbi inside score) when multiple oracle actions are available. These trajectories are incredibly\n\n5\n\n\fshort (roughly double the number of words in the sentence). At each step in the trajectory, (st, at), a classi-\n\ufb01cation example is generated, where the action taken by the oracle (at) is considered the correct class and all\nother available actions are considered incorrect. The classi\ufb01er that we train on these examples is a maximum\nentropy classi\ufb01er, so it has exactly the same form as the Boltzmann exploration model (Eq (4)) but without the\ntemperature control. In fact, the gradient of this classi\ufb01er (Eq (10)) is nearly identical to the policy gradient\n(Eq (6)) except that \u03c4 is distributed differently and the total reward R(\u03c4 ) does not appear: instead of mimicking\nhigh-reward trajectories we now try to mimic oracle trajectories.\n\nE\u03c4\u223c\u03c0\u2217\n\n(cid:48) | st)\u03c6(a\n(cid:48)\n\n\u03c0\u03b8(a\n\n, st)\n\n(10)\n\n(cid:32)\n\n(cid:34) T(cid:88)\n\nt=0\n\n\u03c6(at, st) \u2212 (cid:88)\n\na(cid:48)\u2208A\n\n(cid:33)(cid:35)\n\nwhere \u03c0\u2217 denotes the oracle policy so at is the oracle action. The potential bene\ufb01t of the classi\ufb01er-based\nIn policy gradient with\napproach over policy gradient with shaped rewards is increased credit assignment.\nreward shaping, an action gets credit for all future reward (though no past reward).\nIn the classi\ufb01er-based\napproach, it gets credit for exactly whether or not it builds an item that is in the true parse.\nRunning Example 4.\nwith shaped rewards. The best accuracy we can obtain is 76.5 with 0.19 mpops.\nTo execute the DAGGER algorithm, we would continue in the next iteration by following the trajectories learned\nby the classi\ufb01er and generating new classi\ufb01cation examples on those states. Unfortunately, this is not compu-\ntationally feasible due to the poor quality of the policy learned in the \ufb01rst iteration. Attempting to follow\nthe learned policy essentially tries to build all possible constituents licensed by the grammar, which can be\nprohibitively expensive. We will remedy this in section 5.\n\nThe classi\ufb01er-based approach performs only marginally better than policy gradient\n\n4.3 What\u2019s Wrong With Apprenticeship Learning\nAn obvious practical issue with the classi\ufb01er-based approach is that it trains the classi\ufb01er only at states visited\nby the oracle. This leads to the well-known problem that it is unable to learn to recover from past errors\n[2, 28, 7, 24]. Even though our current feature set depends only on the action and not on the state, making\naction scores independent of the current state, there is still an issue since the set of actions to choose from does\ndepend on the state. That is, the classi\ufb01er is trained to discriminate only among the small set of agenda items\navailable on the oracle trajectory (which are always combinations of correct constituents). But the action sets\nthe parser faces at test time are much larger and more diverse.\nAn additional objection to classi\ufb01ers is that not all errors are created equal. Some incorrect actions are more\nexpensive than others, if they create constituents that can be combined in many locally-attractive ways and\nhence slow the parser down or result in errors. Our classi\ufb01cation problem does not distinguish among incorrect\nactions. The SEARN algorithm [7] would distinguish them by explicitly evaluating the future reward of each\npossible action (instead of using a teacher) and incorporating this into the classi\ufb01cation problem. But explicit\nevaluation is computationally infeasible in our setting (at each time step, it must roll out a full future trajectory\nfor each possible action from the agenda). Policy gradient provides another approach by observing which\nactions are good or bad across many random trajectories, but recall that we found it impractical as well. We do\nnot further address this problem in this paper, but in [8] we suggested explicit causality analysis.\nA \ufb01nal issue has to do with the nature of the oracle. Recall that the oracle is \u201csupposed to\u201d choose optimal\nactions for the given reward. Also recall that our oracle always picks correct constituents. There seems to be\na contradiction here: our oracle action selector ignores \u03bb, the tradeoff between accuracy and speed, and only\nfocuses on accuracy. This happens because for any reasonable setting of \u03bb, the optimal thing to do is always to\njust build the correct tree without building any extra constituents. Only for very large values of \u03bb is it optimal to\ndo anything else, and for such values of \u03bb, the learned model will have hugely negative reward. This means that\nunder the apprenticeship learning setting, we are actually never going to be able to learn to trade off accuracy\nand speed: as far as the oracle is concerned, you can have both! The tradeoff only appears because our model\ncannot come remotely close to mimicking the oracle.\n5 Oracle-Infused Policy Gradient\nThe failure of both standard reinforcement learning algorithms and standard apprenticeship learning algorithms\non our problem leads us to develop a new approach. We start with the policy gradient algorithm (Section 3.2)\nand use ideas from apprenticeship learning to improve it. Our formulation preserves the reinforcement learning\n\ufb02avor of our overall setting, which involves delayed reward for a known reward function.\nOur approach is speci\ufb01cally designed for the non-deterministic nature of the agenda-based parsing setting [8]:\nonce some action a becomes available (appears on the agenda), it never goes away until it is taken. This makes\nthe notion of \u201cinterleaving\u201d oracle actions with policy actions both feasible and sensible. Like policy gradient,\nwe draw trajectories from a policy and take gradient steps that favor actions with high reward under reward\nshaping. Like SEARN and DAGGER, we begin by exploring the space around the optimal policy and slowly\nexplore out from there.\n\n6\n\n\fTo achieve this, we de\ufb01ne the notion of an oracle-infused policy. Let \u03c0 be an arbitrary policy and let \u03b4 \u2208 [0, 1].\nWe de\ufb01ne the oracle-infused policy \u03c0+\n\n\u03b4 as follows:\n\u03b4 (a | s) = \u03b4\u03c0\n\u2217\n\u03c0+\n\n(a | s) + (1 \u2212 \u03b4)\u03c0(a | s)\n\n(11)\n\u03b4 explores the policy space with probability 1 \u2212 \u03b4 (according to its\n\nIn other words, when choosing an action, \u03c0+\ncurrent model), but with probability \u03b4, we force it to take an oracle action.\nOur algorithm takes policy gradient steps with reward shaping (Eqs (9) and (7)), but with respect to trajectories\ndrawn from \u03c0+\n\u03b4 rather than \u03c0. If \u03b4 = 0, it reduces to policy gradient, with reward shaping if \u03b3 < 1 and\nimmediate reward if \u03b3 = 0. For \u03b4 = 1, the \u03b3 = 0 case reduces to the classi\ufb01er-based approach with \u03c0\u2217 (which\nin turn breaks ties by choosing the best action under \u03c0).\nSimilar to DAGGER and SEARN, we do not stay at \u03b4 = 1, but wean our learner off the oracle supervision as\nit starts to \ufb01nd a good policy \u03c0 that imitates the classi\ufb01er reasonably well. We use \u03b4 = 0.8epoch, where epoch\nis the total number of passes made through the training set at that point (so \u03b4 = 0.80 = 1 on the initial pass).\nOver time, \u03b4 \u2192 0, so that eventually we are training the policy to do well on the same distribution of states\nthat it will pass through at test time (as in policy gradient). With intermediate values of \u03b4 (and \u03b3 \u2248 1), an\niteration behaves similarly to an iteration of SEARN, except that it \u201crolls out\u201d the consequences of an action\nchosen randomly from (11) instead of evaluating all possible actions in parallel.\nRunning Example 5. Oracle-infusion gives a competitive speed and accuracy tradeoff. A typical result is 91.2\nwith 0.68 mpops.\n\n6 Experiments\n\nAll of our experiments (including those discussed earlier) are based on the Wall Street Journal portion of\nthe Penn Treebank [15]. We use a probabilistic context-free grammar with 370,396 rules\u2014enough to make\nthe baseline system accurate but slow. We obtained it as a latent-variable grammar [16] using 5 split-merge\niterations [21] on sections 2\u201320 of the Treebank, reserving section 22 for learning the parameters of our policy.\nAll approaches to trading off speed and accuracy are trained on section 22; in particular, for the running example\nand Section 6.2, the same 100 sentences of at most 15 words from that section were used for training and\ntest. We measure accuracy in terms of labeled recall (including preterminals) and measure speed in terms of\nthe number of pops from on the agenda. The limitation to relatively short sentences is purely for improved\nef\ufb01ciency at training time.\n\n6.1 Baseline Approaches\n\nOur baseline approaches trade off speed and accuracy not by learning to prioritize, but by varying the pruning\nlevel \u2206. A constituent is pruned if its Viterbi inside score is more than \u2206 worse than that of some other\nconstituent that covers the same substring.\nOur baselines are: (HA\u2217) a Hierarchical A\u2217 parser [18] with the same pruning threshold at each hierarchy\nlevel; (A\u2217\n0) an iterative deepening A\u2217 algorithm,\non which a failure to \ufb01nd any parse causes us to increase \u2206 and try again with less aggressive pruning (note\nthat this is not the traditional meaning of IDA*); and (CTF) the default coarse-to-\ufb01ne parser in the Berkeley\nparser [21]. Several of these algorithms can make multiple passes, in which case the runtime (number of pops)\nis assessed cumulatively.\n\n0) an A\u2217 parser with a 0 heuristic function plus pruning; (IDA\u2217\n\nModel\n\nA\u2217\n0 (no pruning)\n\n# of pops Recall\n1496080\n93.34\n56.35\n686641\n76.48\n187403\n84.17\n1275292\n682540\n91.16\n\nF1\n93.19\n58.74\n76.92\n83.38\n91.33\nFigure 1: Performance on 100 sentences.\n\n6.2 Learned Prioritization Approaches\nWe explored four variants of our oracle-infused pol-\nicy gradient with with \u03bb = 10\u22126. Figure 1 shows\nthe result on the 100 training sentences. The \u201c-\u201d tests\nare the degenerate case of \u03b4 = 1, or apprenticeship\nlearning (section 4.2), while the \u201c+\u201d tests use \u03b4 =\n0.8epoch as recommended in section 5. Temperature\nmatters for the \u201c+\u201d tests and we use temp = 1. We\nperformed stochastic gradient descent for 25 passes\nover the data, sampling 5 trajectories in a row for each sentence (when \u03b4 < 1 so that trajectories are random).\nWe can see that the classi\ufb01er-based approaches \u201c-\u201d perform poorly: when training trajectories consist of only\noracle actions, learning is severely biased. Yet we saw in section 3.2 that without any help from the oracle\nactions, we suffer from such large variance in the training trajectories that performance degrades rapidly and\nlearning does not converge even after days of training. Our \u201coracle-infused\u201d compromise \u201c+\u201d uses some oracle\nactions: after several passes through the data, the parser learns to make good decisions without help from the\noracle.\n\nD-\nI-\nD+\nI+\n\n7\n\n\fFigure 2: Pareto frontiers: Our I+ parser at different values of \u03bb, against the baselines at different\npruning levels.\n\nThe other axis of variation is that the \u201cD\u201d tests (delayed reward) use \u03b3 = 1, while the \u201cI\u201d tests (immediate\nreward) use \u03b3 = 0. Note that I+ attempts a form of credit assignment and works better than D+.2 We were\nnot able to get better results with intermediate values of \u03b3, presumably because this crudely assigns credit for\na reward (correct constituent) to the actions that closely preceded it, whereas in our agenda-based parser, the\ncauses of the reward (correct subconstituents) related actions may have happened much earlier [8].\n\n6.3 Pareto Frontier\nOur \ufb01nal evaluation is on the held-out test set (length-limited sentences from Section 23). A 5-split grammar\ntrained on section 2-21 is used. Given our previous results in Table 1, we only consider the I+ model: imme-\ndiate reward with oracle infusion. To investigate trading off speed and accuracy, we learn and then evaluate a\npolicy for each of several settings of the tradeoff parameter: \u03bb. We train our policy using sentences of at most\n15 words from Section 22 and evaluate the learned policy on the held out data (from Section 23). We measure\naccuracy as labeled constituent recall and evaluate speed in terms of the number of pops (or pushes) performed\non the agenda.\nFigure 2 shows the baselines at different pruning thresholds as well as the performance of our policies trained\nusing I+ for \u03bb \u2208 {10\u22123, 10\u22124, . . . , 10\u22128}, using agenda pops as the measure of time. I+ is about 3 times as\nfast as unpruned A\u2217\n0 at the cost of about 1% drop in accuracy (F-score from 94.58 to 93.56). Thus, I+ achieves\nthe same accuracy as the pruned version of A\u2217\n0 while still being twice as fast. I+ also improves upon HA\u2217 and\nIDA\u2217\n0 with respect to speed at 60% of the pops. I+ always does better than the coarse-to-\ufb01ne parser (CTF) in\nterms of both speed and accuracy, though using the number of agenda pops as our measure of speed puts both\nof our hierarchical baselines at a disadvantage.\nWe also ran experiments using the number of agenda pushes as a more accurate measure of time, again sweeping\nover settings of \u03bb. Since our reward shaping was crafted with agenda pops in mind, perhaps it is not surprising\nthat learning performs relatively poorly in this setting. Still, we do manage to learn to trade off speed and\naccuracy. With a 1% drop in recall (F-score from 94.58 to 93.54), we speed up from A\u2217\n0 by a factor of 4 (from\naround 8 billion pushes to 2 billion). Note that known pruning methods could also be employed in conjunction\nwith learned prioritization.\n\n7 Conclusions and Future Work\n\nIn this paper, we considered the application of both reinforcement learning and apprenticeship learning to\nprioritize search in a way that is sensitive to a user-de\ufb01ned tradeoff between speed and accuracy. We found\nthat a novel oracle-infused variant of the policy gradient algorithm for reinforcement learning is effective for\nlearning a fast and accurate parser with only a simple set of features. In addition, we uncovered many properties\nof this problem that separate it from more standard learning scenarios, and designed experiments to determine\nthe reasons off-the-shelf learning algorithms fail.\nAn important avenue for future work is to consider better credit assignment. We are also very interested in\ndesigning richer feature sets, including \u201cdynamic\u201d features that depend on both the action and the state of the\nchart and agenda. One role for dynamic features is to decide when to halt. The parser might decide to continue\nworking past the \ufb01rst complete parse, or give up (returning a partial or default parse) before any complete parse\nis found.\n\n2The D- and I- approaches are quite similar to each other. Both train on oracle trajectories where all actions\nreceive a reward of 1 \u2212 \u03bb, and simply try to make these oracle actions probable. However, D- trains more\naggressively on long trajectories, since (9) implies that it weights a given training action by T \u2212 t + 1, the\nnumber of future actions on that trajectory. The difference between D+ and I+ is more interesting because the\ntrajectory includes non-oracle actions as well.\n\n8\n\n0.820.840.860.880.90.920.940.960123x 107Recall# of popsChange of recall and # of pops  I+A*0IDA*0CTFHA*\fReferences\n[1] Pieter Abbeel and Andrew Ng. Apprenticeship learning via inverse reinforcement learning. In ICML,\n\n2004.\n\n[2] J. Andrew Bagnell. Robust supervised learning. In AAAI, 2005.\n[3] Nathan Bodenstab, Aaron Dunlop, Keith Hall, and Brian Roark. Beam-width prediction for ef\ufb01cient CYK\n\nparsing. In ACL, 2011.\n\n[4] Sharon A. Caraballo and Eugene Charniak. New \ufb01gures of merit for best-\ufb01rst probabilistic chart parsing.\n\nComputational Linguistics, 24(2):275\u2013298, 1998.\n\n[5] Eugene Charniak. Top-down nearly-context-sensitive parsing. In EMNLP, 2010.\n[6] Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm. In ACL, 2004.\n[7] Hal Daum\u00b4e III, John Langford, and Daniel Marcu. Search-based structured prediction. Machine Learning,\n\n75(3):297\u2013325, 2009.\n\n[8] Jason Eisner and Hal Daum\u00b4e III. Learning speed-accuracy tradeoffs in nondeterministic inference algo-\n\nrithms. In COST: NIPS Workshop on Computational Trade-offs in Statistical Learning, 2011.\n\n[9] Joshua Goodman. Semiring parsing. Computational Linguistics, 25(4):573\u2013605, December 1999.\n[10] V. Gullapalli and A. G. Barto. Shaping as a method for accelerating reinforcement learning. In Proceed-\n\nings of the IEEE International Symposium on Intelligent Control, 1992.\n\n[11] R. Kalman. Contributions to the theory of optimal control. Bol. Soc. Mat. Mexicana, 5:558\u2013563, 1968.\n[12] Martin Kay. Algorithm schemata and data structures in syntactic processing. In B. J. Grosz, K. Sparck\nJones, and B. L. Webber, editors, Readings in Natural Language Processing, pages 35\u201370. Kaufmann,\n1986. First published (1980) as Xerox PARC TR CSL-80-12.\n\n[13] Dan Klein and Chris Manning. A* parsing: Fast exact Viterbi parse selection. In NAACL/HLT, 2003.\n[14] Percy Liang, Hal Daum\u00b4e III, and Dan Klein. Structure compilation: Trading structure for features. In\n\nICML, Helsinki, Finland, 2008.\n\n[15] M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: The\n\nPenn Treebank. Computational linguistics, 19(2):330, 1993.\n\n[16] Takuya Matsuzaki, Yusuke Miyao, and Junichi Tsujii. Probabilistic CFG with latent annotations. In ACL,\n\n2005.\n\n[17] Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In ICML, 2000.\n[18] A. Pauls and D. Klein. Hierarchical search for parsing. In NAACL/HLT, pages 557\u2013565. Association for\n\nComputational Linguistics, 2009.\n\n[19] A. Pauls and D. Klein. Hierarchical A* parsing with bridge outside scores.\n\nAssociation for Computational Linguistics, 2010.\n\nIn ACL, pages 348\u2013352.\n\n[20] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural\n\nNetworks, 21(4), 2008.\n\n[21] S. Petrov and D. Klein. Improved inference for unlexicalized parsing. In NAACL/HLT, pages 404\u2013411,\n\n2007.\n\n[22] B. Roark, K. Hollingshead, and N. Bodenstab. Finite-state chart constraints for reduced complexity\n\ncontext-free parsing pipelines. Computational Linguistics, Early Access:1\u201335, 2012.\n\n[23] Brian Roark and Kristy Hollingshead. Classifying chart cells for quadratic complexity context-free infer-\nence. In COLING, pages 745\u2013752, Manchester, UK, August 2008. Coling 2008 Organizing Committee.\n[24] Stephane Ross, Geoff J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured\n\nprediction to no-regret online learning. In AI-Stats, 2011.\n\n[25] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[26] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for\n\nreinforcement learning with function approximation. In NIPS, pages 1057\u20131063. MIT Press, 2000.\n\n[27] R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine Learning, 8(23), 1992.\n\n[28] Yuehua Xu and Alan Fern. On learning linear ranking functions for beam search. In ICML, pages 1047\u2013\n\n1054, 2007.\n\n[29] D. H. Younger. Recognition and parsing of context-free languages in time n3. Information and Control,\n\n10(2):189\u2013208, February 1967.\n\n9\n\n\f", "award": [], "sourceid": 650, "authors": [{"given_name": "Jiarong", "family_name": "Jiang", "institution": null}, {"given_name": "Adam", "family_name": "Teichert", "institution": null}, {"given_name": "Jason", "family_name": "Eisner", "institution": null}, {"given_name": "Hal", "family_name": "Daume", "institution": null}]}