{"title": "A Credit Assignment Compiler for Joint Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 1705, "page_last": 1713, "abstract": "Many machine learning applications involve jointly predicting multiple mutually dependent output variables. Learning to search is a family of methods where the complex decision problem is cast into a sequence of decisions via a search space. Although these methods have shown promise both in theory and in practice, implementing them has been burdensomely awkward. In this paper, we show the search space can be defined by an arbitrary imperative program, turning learning to search into a credit assignment compiler. Altogether with the algorithmic improvements for the compiler, we radically reduce the complexity of programming and the running time. We demonstrate the feasibility of our approach on multiple joint prediction tasks. In all cases, we obtain accuracies as high as alternative approaches, at drastically reduced execution and programming time.", "full_text": "A Credit Assignment Compiler for Joint Prediction\n\nKai-Wei Chang\n\nUniversity of Virginia\nkw@kwchang.net\n\nHe He\n\nUniversity of Maryland\nhhe@cs.umd.edu\n\nHal Daum\u00e9 III\n\nUniversity of Maryland\n\nme@hal3.name\n\nJohn Langford\n\nMicrosoft Research\n\njcl@microsoft.com\n\nStephane Ross\n\nGoogle\n\nstephaneross@google.com\n\nAbstract\n\nMany machine learning applications involve jointly predicting multiple mutually\ndependent output variables. Learning to search is a family of methods where the\ncomplex decision problem is cast into a sequence of decisions via a search space.\nAlthough these methods have shown promise both in theory and in practice, im-\nplementing them has been burdensomely awkward. In this paper, we show the\nsearch space can be de\ufb01ned by an arbitrary imperative program, turning learning\nto search into a credit assignment compiler. Altogether with the algorithmic im-\nprovements for the compiler, we radically reduce the complexity of programming\nand the running time. We demonstrate the feasibility of our approach on multi-\nple joint prediction tasks. In all cases, we obtain accuracies as high as alternative\napproaches, at drastically reduced execution and programming time.\n\n1\n\nIntroduction\n\nMany applications require a predictor to make coherent decisions. As an example, consider recog-\nnizing a handwritten word where each character might be recognized in turn to understand the word.\nHere, it is commonly observed that exposing information from related predictions (i.e. adjacent\nletters) aids individual predictions. Furthermore, optimizing a joint loss function can improve the\ngracefulness of error recovery. Despite these advantages, it is empirically common to build inde-\npendent predictors, in settings where joint prediction naturally applies, because they are simpler to\nimplement and faster to run. Can we make joint prediction algorithms as easy and fast to program\nand compute while maintaining their theoretical bene\ufb01ts?\nMethods making a sequence of sub-decisions have been proposed for handling complex joint pre-\ndictions in a variety of applications, including sequence tagging [30], dependency parsing (known as\ntransition-based method) [35], machine translation [18], and co-reference resolution [44]. Recently,\ngeneral search-based joint prediction approaches (e.g., [10, 12, 14, 22, 41]) have been investigated.\nThe key issue of these search-based approaches is credit assignment: when something goes wrong\ndo you blame the \ufb01rst, second, or third prediction? Existing methods often take two strategies:\n\n\u2022 The system ignores the possibility that a previous prediction may have been wrong, differ-\nent costs have different errors, or the difference between train-time and test-time prediction.\n\u2022 The system uses handcrafted credit assignment heuristics to cope with errors that the un-\n\nderlying algorithm makes and the long-term outcomes of decisions.\n\nBoth approaches may lead to statistical inconsistency: when features are not rich enough for perfect\nprediction, the machine learning may converge sub-optimally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAlgorithm 1 MYRUN(X) % for sequence tagging, X: input sequence, Y: output\n\nA sample user-de\ufb01ned function, where PREDICT and LOSS are library functions (see text). The credit\nassignment compiler translates the code and data into model updates. More examples are in appendices.\n\n1: Y \u2190 []\n2: for t = 1 to LEN(X) do\n3:\n4:\n\nref \u2190 X[t].true_label\nY[t] \u2190 PREDICT(x=examples[t], y=ref , tag=t, condition=[1:t-1])\n\n5: LOSS(number of Y[t] (cid:54)= X[t].true_label)\n6: return Y\n\nIn contrast, learning to search approaches [5, 11, 40] automatically handle the credit assignment\nproblem by decomposing the production of the joint output in terms of an explicit search space\n(states, actions, etc.); and learning a control policy that takes actions in this search space. These have\nformal correctness guarantees which differ qualitatively from models such as Conditional Random\nFields [28] and structured SVMs [46, 47]. Despite the good properties, none of these methods have\nbeen widely adopted because the speci\ufb01cation of a search space as a \ufb01nite state machine is awkward\nand naive implementations do not fully demonstrate the ability of these methods.\nIn this paper, we cast learning to search into a credit assignment compiler with a new programming\nabstraction for representing a search space. Together with several algorithmic improvements, this\nradically reduces both the complexity of programming1 and the running time. The programming\ninterface has the following advantages:\n\n\u2022 The same decoding function (see Alg. 1 for example) is used for training and prediction\nso a developer need only code desired test time behavior and gets training \u201cfor free\u201d. This\nsimple implementation prevents common train/test asynchrony bugs.\n\u2022 The compiler automatically ensures the model learns to avoid compounding errors and\n\u2022 The library functions are in a reduction stack so as base classi\ufb01ers and learning to search\n\nmakes a sequence of coherent decisions.\n\napproaches improve, so does joint prediction performance.\n\nWe implement the credit assignment compiler in Vowpal-Wabbit (http://hunch.net/~vw/),\na fast online learning library, and show that the credit assignment compiler achieves outstanding em-\npirical performance both in accuracy and in speed for several application tasks. This provides strong\nsimple baselines for future research and demonstrates the compiler approach to solving complex\nprediction problems may be of broad interest. Details experimental settings are in appendices.\n\n2 Programmable Learning to Search\n\nWe \ufb01rst describe the proposed programmable joint prediction paradigm. Algorithm 1 shows sample\ncode for a part of speech tagger (or generic sequence labeler) under Hamming loss. The algorithm\ntakes as input a sequence of examples (e.g., words), and predicts the meaning of each element in\nturn. The ith prediction depends on previous predictions.2 It uses two underlying library functions,\nPREDICT(...) and LOSS(...). The function PREDICT(...) returns individual predictions based on x\nwhile LOSS(...) allows the declaration of an arbitrary loss for the point set of predictions. The\nLOSS(...) function and the reference y inputted to PREDICT(...) are only used in the training phase\nand it has no effect in the test phase. Surprisingly, this single library interface is suf\ufb01cient for both\ntesting and training, when augmented to include label \u201cadvice\u201d from a training set as a reference\ndecision (by the parameter y). This means that a developer only has to specify the desired test time\nbehavior and gets training with minor additional decoration. The underlying system works as a\ncredit assignment compiler to translate the user-speci\ufb01ed decoding function and labeled data into\nupdates of the learning model.\nHow can you learn a good PREDICT function given just an imperative program like Algorithm 1?\nIn the following, we show that it is essential to run the MYRUN(...) function (e.g., Algorithm 1)\nmany times, \u201ctrying out\u201d different versions of PREDICT(...) to learn one that yields low LOSS(...).\nWe begin with formal de\ufb01nitions of joint prediction and a search space.\n\n1With library supports, developing a new task often requires only a few lines of code.\n2In this example, we use the library\u2019s support for generating implicit features based on previous predictions.\n\n2\n\n\fThe system begins at the start state S and chooses the middle\naction twice according to the rollin policy. At state R it considers\nboth the chosen action (middle) and one-step deviations from that\naction (top and bottom). Each of these deviations is completed\nusing the rollout policy until reaching an end state, at which point\nthe loss is collected. Here, we learn that deviating to the top\naction (instead of middle) at state R decreases the loss by 0.2.\n\nFigure 1: A search space implicitly de\ufb01ned by an imperative program.\n\nThe de\ufb01nition of a TDOLR program:\n\u2022 Always terminate.\n\u2022 Takes as input any relevant feature information X.\n\u2022 Make zero or more calls to an oracle O : X(cid:48) \u2192 Y\n\u2022 Report a loss L on termination.\n\nwhich provides a discrete outcome.\n\nAlgorithm 2 TDOLR(X)\n1: s \u2190 a\n2: while s (cid:54)\u2208 E do\nCompute xs from X and s\ns \u2190 O(xs)\n\n3:\n4:\n5: return LOSS(s)\n\nFigure 2: Left: the de\ufb01nition; right: A TDOLR program simulates the search space.\n\nJoint prediction aims to induce a function f such that for any X \u2208 X (the input\nJoint Prediction.\nspace), f produces an output f (X) = Y \u2208 Y(X) in a (possibly input-dependent) space Y(X). The\noutput Y often can be decomposed into smaller pieces (e.g., y1, y2, . . .), which are tied together by\nfeatures, by a loss function and/or by statistical dependence. There is a task-speci\ufb01c loss function\n(cid:96) : Y \u00d7 Y \u2192 R\u22650, where (cid:96)(Y \u2217, \u02c6Y ) tells us how bad it is to predict \u02c6Y when the true is Y \u2217.\n\nSearch Space.\nIn our framework, the joint variable \u02c6Y is produced incrementally by traversing a\nsearch space, which is de\ufb01ned by states s \u2208 S and a mapping A : S \u2192 2S de\ufb01ning the set of valid\nnext states.3 One of the states is a unique start state S while some of the others are end states e \u2208 E.\nEach end state corresponds to some output variable Ye. The goal of learning is \ufb01nding a function\nf : Xs \u2192 S that uses the features of an input state (xs) to choose the next state so as to minimize\nthe loss (cid:96)(Y \u2217, Ye) on a holdout test set.4 Follow reinforcement learning terminology, we call the\nfunction a policy and call the learned function f a learned policy \u03c0f .\n\nTurning Search Space into an Imperative Program Surprisingly, search space can be repre-\nsented by a class of imperative program, called Terminal Discrete Oracle Loss Reporting (TDOLR)\nprograms. The formal de\ufb01nition of TDOLR is listed in Figure 2. Without loss of generality, we\nassume the number of choices is \ufb01xed in a search space, and the following theorem holds:\nTheorem 1. For every TDOLR program, there exist an equivalent search space and for every search\nspace, there exists an equivalent TDOLR program.\n\nProof. A search space is de\ufb01ned by (A, E, S, l). We show there is a TDOLR program which can\nsimulate the search space in algorithm 2. This algorithm does a straightforward execution of the\nsearch space, followed by reporting of the loss on termination. This completes the second claim.\nFor the \ufb01rst claim, we need to de\ufb01ne, (A, E, S, l) given a TDOLR program such that the search\nspace can simulate the TDOLR program. At any point in the execution of TDOLR, we de\ufb01ne an\nequivalent state s = (O(X1), ..., O(Xn)) where n is the number of calls to the oracle. We de\ufb01ne a\nas the sequence of zero length, and we de\ufb01ne E as the set of states after which TDOLR terminates.\n\n3Comprehensive strategies for de\ufb01ning search space have been discussed [14]. The theoretical properties\n\ndo not depend on which search space de\ufb01nition is used.\n\n4Note that we use X and Y to represent joint input and output and use x and y to represent input and output\n\nto function f and PREDICT.\n\n3\n\nSREEErollinrollout one-stepdeviationsloss=.2loss=0loss=.8\fAlgorithm 3 LEARN(X, F)\n1: T, ex, cache \u2190 0, [], []\n2: de\ufb01ne PREDICT(x, y) := { T++ ; ex[T-1] \u2190 x; cache[T-1] \u2190 F(x, y, rollin) ; return cache[T-1] }\n3: de\ufb01ne LOSS(l) := no-op\n4: MYRUN(X) % MYRUN(X) is a user-de\ufb01ned TDOLR program (e.g., Algorithm 1).\n5: for t0 = 1 to T do\n6:\n7:\n\nlosses, t \u2190 (cid:104)0, 0, . . . , 0(cid:105), 0\nfor a0 = 1 to A(ex[t0]) do\n\n\uf8f1\uf8f2\uf8f3 cache[t-1]\n\na0\nF(x,y,rollout)\n\nif t < t0\nif t = t0\nif t > t0\n\n}\n\n8:\n\n9:\n10:\n11:\n\nDe\ufb01ne PREDICT(x, y) := { t++ ; return\n\nDe\ufb01ne LOSS(val) := { losses[a0] += val }\nMYRUN(X)\n\nOnline update with cost-sensitive example (ex[t0], losses)\n\nFor each s \u2208 E we de\ufb01ne l(s) as the loss reported on termination. This search space manifestly\noutputs the same loss as the TDOLR program.\n\nThe practical implication of this theorem is that instead of specifying search spaces, we can specify\na TDOLR program (e.g., Algorithm 1), reducing the programming complexity of joint prediction.\n\n3 Credit Assignment Compiler for Training Joint Predictor\n\nNow, we show how a credit assignment compiler turns a TDOLR program and training data into\nmodel updates. In the training phase, the supervised signals are used in two places: 1) to de\ufb01ne\nthe loss function, and 2) to construct a reference policy \u03c0\u2217. The reference policy returns at any\nprediction point a \u201csuggestion\u201d as to a good next state.5 The general strategy is, for some number of\nepochs, and for each example (X, Y ) in the training data, to do the following:\n\n1. Execute MYRUN(...) on X with a rollin policy to obtain a trajectory of actions (cid:126)a and loss (cid:96)0\n2. Many times:\n\n(a) For some (or for all) time step t \u2264 |(cid:126)a|\n(b) For some (or for all) alternative action a(cid:48)\n(c) Execute MYRUN(...) on X, with PREDICT returning a1:t\u22121 initially, then a(cid:48)\n\nt (cid:54)= at (at is the action taken by (cid:126)a in time step t)\nt, then acting\n\n(d) Compare the overall losses (cid:96)t,at and (cid:96)t,a(cid:48)\n\naccording to a rollout policy to obtain a new loss (cid:96)t,a(cid:48)\nthat demonstrates how much better or worse a(cid:48)\n\nt\n\nt is than at in this context.\n\nto construct a classi\ufb01cation/regression example\n\nt\n\n3. Update the learned policy\nThe rollin and rollout policies can be the reference \u03c0\u2217, the current classi\ufb01er \u03c0f or a mixture between\nthem. By varying them and the manner in which classi\ufb01cation/regression examples are created, this\ngeneral framework can mimic algorithms like SEARN [11], DAGGER [41], AGGREVATE [40], and\nLOLS [5].6\nThe full learning algorithm (for a single joint input X) is depicted in Algorithm 3.7 In lines 1\u20134, a\nrollin pass of MYRUN is executed. MYRUN can generally be any TDOLR program as discussed\n(e.g., Alg. 1). In this pass, predictions are made according to the current policy, F, \ufb02agged as rollin\n(this is to enable support of arbitrary rollin and rollout policies). Furthermore, the examples (feature\nvectors) encountered during prediction are stored in ex, indexed by their position in the sequence\n(T), and the rollin predictions are cached in the variable cache (see Sec. 4).\nThe algorithm then initiates one-step deviations from this rollin trajectory. For every time step,\n(line 5), we generate a single cost-sensitive classi\ufb01cation example; its features are ex[t0], and there\n\nassuming it gets to make all future decisions as well.\n\n5Some papers assume the reference policy is optimal. An optimal policy always chooses the best next state\n6E.g., rollin in LOLS is \u03c0f and rollout is a stochastic interpolation of \u03c0f and oracle \u03c0\u2217 constructed by y.\n7This algorithm is awkward because standard computational systems have a single stack. We have elected\nto give MYRUN control of the stack to ease the implementation of joint prediction tasks. Consequently, the\nlearning algorithm does not have access to the machine stack and must be implemented as a state machine.\n\n4\n\n\fare A(ex[t0]) possible labels (=actions). For each action (line 7), we compute the cost of that ac-\ntion by executing MYRUN again (line 10) with a \u201ctweaked\u201d PREDICT which returns the cached\npredictions at steps before t0, returns the perturbed action a0 at t0, and at future timesteps calls F\nfor rollouts. The LOSS function accumulates the loss for the query action. Finally, a cost-sensitive\nclassi\ufb01cation example is generated (line 11) and fed into an online learning algorithm.\n\n4 Optimizing the Credit Assignment Compiler\n\nWe present two algorithmic improvements which make training orders of magnitude faster.\n\nOptimization 1: Memoization The primary computational cost of Alg. 3 is making predictions:\nnamely, calling the underlying classi\ufb01er in Step 10. In order to avoid redundant predictions, we\ncache previous predictions. The challenge is understanding how to know when two predictions are\ngoing to be identical, faster than actually computing the prediction. To accomplish this, the user\nmay decorate calls to the PREDICT function with tags. For a graphical model, a tag is effectively\nthe \u201cname\u201d of a particular variable in the graphical model. For a sequence labeling problem, the\ntag for a given position might just be its index. When calling PREDICT, the user speci\ufb01es both the\ntag of the current prediction and the tag of all previous predictions on which the current prediction\ndepends. The user is guaranteeing that if the predictions for all the tags in the dependent variables\nare the same, then the prediction for the current example are the same.\nUnder this assumption, we store a cache that maps triples of (cid:104)tag, condition tags, condition\npredictions(cid:105) to (cid:104)current prediction(cid:105). The added overhead of maintaining this data structure is tiny\nin comparison to making repeated predictions on the same features.\nIn line 11 the learned pol-\nicy changes making correctness subtle. For data mixing algorithms (like DAgger), this potentially\nchanges Fi implying the memoized predictions may no longer be up-to-date. Thus this optimization\nis okay if the policy does not change much. We evaluate this empirically in Section 5.3.\n\nOptimization 2: Forced Path Collapse The second optimization we can use is a heuristic that\nonly makes rollout predictions for a constant number of steps (e.g., 2 or 4). The intuition is that\noptimizing against a truly long term reward may be impossible if features are not available at the\ncurrent time t0 which enable the underlying learner to distinguish between the outcome of decisions\nfar in the future. The optimization stops rollouts after some \ufb01xed number of rollout steps.\nThis intuitive reasoning is correct, except for accumulating LOSS(...). If LOSS(...) is only declared at\nthe end of MYRUN, then we must execute T\u2212t0 time steps making (possibly memoized) predictions.\nHowever, for many problems, it is possible to declare loss early as with Hamming loss (= number\nof incorrect predictions). There is no need to wait until the end of the sequence to declare a per-\nsequence loss: one can declare it after every prediction, and have the total loss accumulate (hence\nthe \u201c+=\u201d on line 9). We generalize this notion slightly to that of a history-independent loss:\nDe\ufb01nition 1 (History-independent loss). A loss function is history-independent at state s0 if, for any\n\ufb01nal state e reachable from s0, and for any sequence s0s1s2 . . . si = e: it holds that LOSS(e) =\nA(s0) + B(s1s2 . . . si), where B does not depend on any state before s1.\n\nFor example, Hamming loss is history-independent: A(s0) corresponds to loss through s0 and\nB(s1 . . . si) is the loss after s0.8 When the loss function being optimized is history-independent, we\nallow LOSS(...) to be declared early for this optimization. In addition, for tasks like transition-base\ndependency parsing, although LOSS(...) is not decomposable over actions, expected cost per action\ncan be directly computed based on gold labels [19] so the array losses can be directly speci\ufb01ed.\n\nSpeed Up We analyze the time complexity of the sequence tagging task. Suppose that the cost of\ncalling the policy is d and each state has k actions.9 Without any speed enhancements, each execu-\ntion of MYRUN takes O(T ) time, and we execute it T k + 1 times, yielding an overall complexity\nof O(kT 2d) per joint example. For comparison, structured SVMs or CRFs with \ufb01rst order Markov\n8Any loss function that decomposes over the structure, as required by structured SVMs, is guaranteed to\nalso be history-independent; the reverse is not true. Furthermore, when structured SVMs are run with a non-\ndecomposable loss function, their runtime becomes exponential in t. When our approach is used with a loss\nfunction that\u2019s not history-independent, our runtime increases by a factor of t.\n\n9Because the policy is a multiclass classi\ufb01er, d might hide a factor of k or log k.\n\n5\n\n\fFigure 3: Training time (minutes) versus test accuracy for POS and NER. Different points corre-\nspond to different termination criteria for training. The rightmost \ufb01gure use default hyperparame-\nters and the two left \ufb01gures use hyperparameters that were tuned (for accuracy) on the holdout data.\nResults of NER with default parameters are in the appendix. X-axis is in log scale.\n\ndependencies run in O(k2T ) time. When both memoization and forced path collapse are in effect,\nthe complexity of training drops to O(T kd), similar to independent prediction. In particular, if the\nith prediction only depends on the i\u22121th prediction, then at most T k unique predictions are made.10\n\n5 System Performance\n\nWe present two sets of experiments. In the \ufb01rst set, we compare the credit assignment compiler\nwith existing libraries on two sequence tagging problems: Part of Speech tagging (POS) on the Wall\nStreet Journal portion of the Penn Treebank; and sequence chunking problem: named entity recogni-\ntion (NER) based on standard Begin-In-Out encoding on the CoNLL 2003 dataset. In the second set\nof experiments, we demonstrate a simple dependency parser built by our approach achieves strong\nresults when comparing with systems with similar complexity. The parser is evaluated on the stan-\ndard WSJ (English, Stanford-style labels), CTB (Chinese) datasets and the CoNLL-X datasets for\n10 other languages.11 Our approach is implemented using the Vowpal Wabbit [29] toolkit on top of\na cost-sensitive classi\ufb01er [3] trained with online updates [15, 24, 42]. Details of dataset statistics,\nexperimental settings, additional results on other applications, and pseudocode are in the appendix.\n\n5.1 Sequence Tagging Tasks\n\nWe compare our system with freely available systems, including CRF++ [27], CRF SGD [4], Struc-\ntured Perceptron [9], Structured SVM [23], Structured SVM (DEMI-DCD) [6], and an unstructured\nbaseline (OAA) predicting each label independently, using one-against-all classi\ufb01cation [3]12.\nFor each system, we consider two situations, either the default hyperparameters or the tuned\nhyperparameters that achieved the best performance on holdout data. We report both conditions\nto give a sense of how sensitive each approach is to the setting of hyperparameters (the amount of\nhyperparameter tuning directly affects effective training time). We use the built-in feature template\nof CRF++ to generate features and use them for other systems. The templates included neighboring\nwords and, in the case of NER, neighboring POS tags. The CRF++ templates generate 630k unique\nfeatures for the training data. However, because L2S is also able to generate features from its own\ntemplates, we also provide results for L2S (ft) in which it uses its own feature template generation.\n\nTraining time.\nIn Figure 3, we show trade-offs between training time (x-axis, log scaled) and\nprediction accuracy (y-axis) for the aforementioned six systems. For POS tagging, the independent\nclassi\ufb01er is the fastest (trains in less than one minute) but its performance peaks at 95% accuracy.\nThree other approaches are in roughly the same time/accuracy trade-off: L2s, L2S (ft) and Structured\nPerceptron. CRF SGD takes about twice as long. DEMI-DCD (taking a half hour) and CRF++ (taking\n\n10We use tied randomness [34] to ensure that for any time step, the same policy is called.\n11PTB and CTB are prepared by following [8], and CoNLL-X is from the CoNLL shared task 06.\n12 Structured Perceptron and Structured SVM (DEMI-DCD) are implemented in Illioins-SL[7]. DEMI-\n\nDCD is a multi-core dual approach, while Structured SVM uses cutting-planes.\n\n6\n\n\fParser\nDYNA\nSNN\nL2S\nBEST\n\nAR BU CH CZ+ DA DU+ JA+ PO+ SL+ SW\n75.3 89.8 88.7 81.5 87.9 74.2 92.1 88.9 78.5 88.9\n67.4\u2217 88.1 87.3 78.2 83.0 75.3 89.5 83.2\u2217 63.6\u2217 85.7\n78.2 92.0 89.8 84.8 89.8 79.2 91.8 90.6 82.2 89.7\n79.3 92.0 93.2 87.30 90.6 83.6 93.2 91.4 83.2 89.5\n\nPTB CTB\n90.3\n80.0\n91.8# 83.9#\n91.9\n85.1\n94.4# 87.2#\n\nTable 1: UAS on PTB, CTB and CoNLL-X. Best: the best known result in CoNLL-X or the best\npublished results (CTB, PTB) using arbitrary features and resources. See details and additional\nresults in text and in the appendix.15\n\nover \ufb01ve hours) are not competitive. Structured SVM runs out of memory before achieving compet-\nitive performance (likely due to too many constraints). For NER the story is a bit different. The\nindependent classi\ufb01ers are not competitive. Here, the two variants of L2S totally dominate. In this\ncase, Structured Perceptron is no longer competitive and is essentially dominated by CRF SGD. The\nonly system coming close to L2S\u2019s performance is DEMI-DCD, although it\u2019s performance \ufb02attens\nout after a few minutes.13 The trends in the runs with default hyperparameters show similar behav-\nior to those with tuned, though some of the competing approaches suffer signi\ufb01cantly in prediction\nperformance. Structured Perceptron has no hyperparameters.\n\nTest Time.\nIn addition to training time, one might care about test time behavior. On NER, predic-\ntion times where 5.3k tokens/second (DEMI-DCD and Structured Perceptron, 20k (CRF SGD and\nStructured SVM), 100k (CRF++), 220k (L2S (ft)), and 285k (L2S). Although CRF SGD and Struc-\ntured Perceptron fared well in terms of training time, their test-time behavior is suboptimal. When\nthe number of labels increases from 9 (NER) to 45 (POS) the relative advantage of L2S increases\nfurther. The speed of L2S is about halved while for others, it is cut down by as much as a factor of\n8 due to the O(k) vs O(k2) dependence on the label set size.\n\n5.2 Dependency Parsing\n\nTo demonstrate how the credit assignment compiler handles predictions with complex dependencies,\nwe implement an arc-eager transition-based dependency parser [35]. At each state, it takes one of\nthe four actions {Shif t, Reduce, Lef t, Right} based on a simple neural network with one hidden\nlayer of size 5 and generates a dependency parse to a sentence in the end. The rollin policy is the\ncurrent (learned) policy. The probability of executing the reference policy (dynamic oracle) [19]\nfor rollout decreases over each round. We compare our model with two recent greedy transition-\nbased parsers implemented by the original authors, the dynamic oracle parser (DYNA) [19] and the\nStanford neural network parser (SNN) [8]. We also present the best results in CoNLL-X and the\nbest-published results for CTB and PTB. The performances are evaluated by unlabeled attachment\nscores (UAS). Punctuation is excluded.\nTable 1 shows the results. Our implementation with only \u02dc300 lines of C++ code is competitive\nwith DYNA and SNN, which are speci\ufb01cally designed for parsing. Remarkably, our system achieves\nstrong performance on CoNLL-X without tuning any hyper-parameters, even beating heavily tuned\nsystems participating in the challenge on one dataset. The best system to date on PTB [2] uses a\nglobal normalization, more complex neural network layers and k-best POS tags. Similarly, the best\nsystem for CTB [16] uses stack LSTM architectures tailored for dependency parsing.\n\n5.3 Empirical evaluation of optimizations\n\nIn Section 3, we discussed two approaches for computational improvements. Memoization avoids\nre-predicting on the same input multiple times while path collapse stops rollouts at a particular\n\naccuracy improved to 96.5 with essentially the same speed. On NER it\u2019s performance decreased.\n\n13We also tried giving CRF SGD the features computed by L2S (ft) on both POS and NER. On POS, its\n15(\u2217) SNN makes assumptions about the structure of languages and hence obtains substantially worse perfor-\nmance on languages with multi-root trees. (+) Languages contains more than 1% non-projective arcs, where a\ntransition-based parser (e.g. L2S) likely underperforms graph-based parser (Best) due to the model assump-\ntions. (#) Numbers reported in the published papers [8, 16, 2].\n\n7\n\n\fNo Opts\nMem.\nCol.@4+Mem.\nCol.@2+Mem.\n\nNER\n\nPOS\n\nLOLS\n96s\n75s\n71s\n69s\n\nSearn LOLS\n3739s\n123s\n1142s\n85s\n1059s\n75s\n71s\n1038s\n\nSearn\n4255s\n1215s\n1104s\n1074s\n\nFigure 4: The table on the left shows the effect of Collapse (Col) and Memorization (Mem.). The\n\ufb01gure on the right shows the speed-up obtained for different historical lengths and mixing rate of\nrollout policy. Large \u03b1 corresponds to more prediction required when training the model.\n\npoint in time. The effect of the different optimizations depends greatly on the underlying learning\nalgorithm. For example, DAgger does not do rollouts at all, so no ef\ufb01ciency is gained by either\noptimization.16 The affected algorithms are LOLS (with mixed rollouts) and Searn.\nFigure 4 shows the effect of these optimizations on the best NER and POS systems we trained\nwithout using external resources.\nIn the left table, we can see that memoization alone reduces\noverall training runtime by about 25% on NER and about 70% on POS, essentially because the\noverhead for the classi\ufb01er on POS tagging is so much higher (45 labels versus 9). When rollouts\nare terminated early, the speed increases are much more modest, essentially because memoization\nis already accounting for much of these gains. In all cases, the \ufb01nal performance of the predictors\nis within statistical signi\ufb01cance of each other (p-value of 0.95, paired sign test), except for Col-\nlapse@2+Memoization on NER, where the performance decrease is only insigni\ufb01cant at the 0.90\nlevel. The right \ufb01gure demonstrates that when \u03b1 increases, more prediction is required during the\ntraining time, and the speedup increases from a factor of 1 (no change) to a factor of as much as 9.\nHowever, as the history length increases, the speedup is more modest due to low cache hits.\n\n6 Related Work\n\nSeveral algorithms are similar to learning to search approaches, including the incremental structured\nperceptron [10, 22], HC-Search [13, 14], and others [12, 38, 45, 48, 49]. Some \ufb01t this framework.\nProbabilistic programming [21] has been an active area of research. These approaches have a differ-\nent goal: Providing a \ufb02exible framework for specifying graphical models and performing inference\nin those models. The credit assignment compiler instead allows a developer to learn to make co-\nherent decisions for joint prediction (\u201clearning to search\u201d). We also differ by not designing a new\nprogramming language. Instead, we have a two-function library which makes adoption and integra-\ntion into existing code bases much easier.\nThe closest work to ours is Factorie [31]. Factorie is essentially an embedded language for writ-\ning factor graphs compiled into Scala to run ef\ufb01ciently.17 Similarly, Infer.NET [33], Markov Logic\nNetworks (MNLs) [39], and Probabilistic Soft Logic (PSL) [25] concisely construct and use proba-\nbilistic graphical models. BLOG [32] falls in the same category, though with a very different focus.\nSimilarly, Dyna [17] is a related declarative language for specifying probabilistic dynamic programs,\nand Saul [26] is a declarative language embedded in Scala that deals with joint prediction via integer\nlinear programming. All of these examples have picked particular aspects of the probabilistic mod-\neling framework to focus on. Beyond these examples, there are several approaches that essentially\n\u201creinvent\u201d an existing programming language to support probabilistic reasoning at the \ufb01rst order\nlevel. IBAL [36] derives from O\u2019Caml; Church [20] derives from LISP. IBAL uses a (highly opti-\nmized) form of variable elimination for inference that takes strong advantage of the structure of the\nprogram; Church uses MCMC techniques, coupled with a different type of structural reasoning to\nimprove ef\ufb01ciency.\n\nAcknowledgements Part of this work was carried out while Kai-Wei, Hal and Stephane were visiting Microsoft Research. Hal and He are also\nsupported by NSF grant IIS-1320538. Any opinions, \ufb01ndings, conclusions, or recommendations expressed here are those of the authors and\ndo not necessarily re\ufb02ect the view of the sponsor. The authors thank anonymous reviewers for their comments.\n\n16Training speed is only degraded by about 0.5% with optimizations on, demonstrating negligible overhead.\n17Factorie-based implementations of simple tasks are still less ef\ufb01cient than systems like CRF SGD.\n\n8\n\n-7.5-6.5-5.5-4.5-3.5012345678910Effect of Caching on Training Efficiencyhistory=1history=2history=3history=4Alpha (log10), controls mixing rateSpeedup factor for training\fReferences\n[1] A. Agarwal, O. Chapelle, M. Dud\u00edk, and J. Langford. A reliable effective terascale linear learning system.\n\narXiv preprint\n\n[2] D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Collins. Globally normalized transition-based\n\n[3] A. Beygelzimer, V. Dani, T. Hayes, J. Langford, and B. Zadrozny. Error limiting reductions between classi\ufb01cation tasks. In ICML, pages\n\narXiv:1110.4198, 2011.\n\nneural networks. Arxiv, 2016.\n\n49\u201356, 2005.\n\n[4] L. Bottou. crfsgd project, 2011. http://leon.bottou.org/projects/sgd.\n[5] K.-W. Chang, A. Krishnamurthy, A. Agarwal, H. Daum\u00e9 III, and J. Langford. Learning to search better than your teacher. In ICML,\n\n[6] K.-W. Chang, V. Srikumar, and D. Roth. Multi-core structural SVM training. In ECML, 2013.\n[7] K.-W. Chang, S. Upadhyay, M.-W. Chang, V. Srikumar, and D. Roth. Illinoissl: A JAVA library for structured prediction. Arxiv, 2015.\n[8] D. Chen and C. Manning. A fast and accurate dependency parser using neural networks. In EMNLP, pages 740\u2013750, 2014.\n[9] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP,\n\n[10] M. Collins and B. Roark. Incremental parsing with the perceptron algorithm. In ACL, 2004.\n[11] H. Daum\u00e9 III, J. Langford, and D. Marcu. Search-based structured prediction. Machine Learning Journal, 2009.\n[12] H. Daum\u00e9 III and D. Marcu. Learning as search optimization: Approximate large margin methods for structured prediction. In ICML,\n\n[13] J. R. Doppa, A. Fern, and P. Tadepalli. Output space search for structured prediction. In ICML, 2012.\n[14] J. R. Doppa, A. Fern, and P. Tadepalli. HC-Search: A learning framework for search-based structured prediction. JAIR, 50, 2014.\n[15] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121\u20132159,\n\n2015.\n\n2002.\n\n2005.\n\n2011.\n\n[16] C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith. Transition-based dependency parsing with stack long short-term\n\nmemory. In ACL, 2015.\n\nEMNLP, 2005.\n\nIntelligence, 154(1-2):127\u2013143, 2003.\n\n[17] J. Eisner, E. Goldlust, and N. A. Smith. Compiling comp ling: Practical weighted dynamic programming and the dyna language. In\n\n[18] U. Germann, M. Jahr, K. Knight, D. Marcu, and K. Yamada. Fast decoding and optimal decoding for machine translation. Arti\ufb01cial\n\n[19] Y. Goldberg and J. Nivre. Training deterministic parsers with non-deterministic oracles. Transactions of the ACL, 1, 2013.\n[20] N. Goodman, V. Mansinghka, D. Roy, K. Bonawitz, and J. Tenenbaum. Church: a language for generative models. In UAI, 2008.\n[21] A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. Rajamani. Probabilistic programming. In International Conference on Software\n\nEngineering (ICSE, FOSE track), 2014.\n\n[22] L. Huang, S. Fayong, and Y. Guo. Structured perceptron with inexact search. In NAACL, 2012.\n[23] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning Journal, 2009.\n[24] N. Karampatziakis and J. Langford. Online importance weight aware updates. In UAI, 2011.\n[25] A. Kimmig, S. Bach, M. Broecheler, B. Huang, and L. Getoor. A short introduction to probabilistic soft logic. In NIPS Workshop on\n\nProbabilistic Programming, 2012.\n\n[26] P. Kordjamshidi, D. Roth, and H. Wu. Saul: Towards declarative learning based programming. In IJCAI, 2015.\n[27] T. Kudo. CRF++ project, 2005. http://crfpp.googlecode.com.\n[28] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting and labeling sequence data. In\n\n[29] J. Langford, A. Strehl, and L. Li. Vowpal wabbit, 2007. http://hunch.net/~vw.\n[30] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML,\n\n[31] A. McCallum, K. Schultz, and S. Singh. FACTORIE: probabilistic programming via imperatively de\ufb01ned factor graphs. In NIPS, 2009.\n[32] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG: probabilistic models with unknown objects. Statistical\n\nICML, pages 282\u2013289, 2001.\n\n2000.\n\nrelational learning, 2007.\n\n[33] T. Minka, J. Winn, J. Guiver, and D. Knowles. Infer .net 2.4, 2010. microsoft research cambridge, 2010.\n[34] A. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In UAI, pages 406\u2013415, 2000.\n[35] J. Nivre. An ef\ufb01cient algorithm for projective dependency parsing. In IWPT, pages 149\u2013160, 2003.\n[36] A. Pfeffer. Ibal: A probabilistic rational programming language. In IJCAI, 2001.\n[37] L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In CoNLL, 2009.\n[38] N. Ratliff, D. Bradley, J. A. Bagnell, and J. Chestnutt. Boosting structured prediction for imitation learning. In NIPS, 2007.\n[39] M. Richardson and P. Domingos. Markov logic networks. Machine learning, 62(1-2), 2006.\n[40] S. Ross and J. A. Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv:1406.5979, 2014.\n[41] S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In\n\nAI-Stats, 2011.\n\n[42] S. Ross, P. Mineiro, and J. Langford. Normalized online learning. In UAI, 2013.\n[43] D. Roth and S. W. Yih. Global inference for entity and relation identi\ufb01cation via a linear programming formulation. In Introduction to\n\n[44] W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to coreference resolution of noun phrases. Computational\n\nStatistical Relational Learning. MIT Press, 2007.\n\nLinguistics, 27(4):521 \u2013 544, 2001.\n\n[45] U. Syed and R. E. Schapire. A reduction from apprenticeship learning to classi\ufb01cation. In NIPS, 2011.\n[46] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.\n[47]\n\nI. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output\nspaces. In ICML, 2004.\n\n[48] Y. Xu and A. Fern. On learning linear ranking functions for beam search. In ICML, pages 1047\u20131054, 2007.\n[49] Y. Xu, A. Fern, and S. W. Yoon. Discriminative learning of beam-search heuristics for planning. In IJCAI, pages 2041\u20132046, 2007.\n\n9\n\n\f", "award": [], "sourceid": 940, "authors": [{"given_name": "Kai-Wei", "family_name": "Chang", "institution": "University of Virginia"}, {"given_name": "He", "family_name": "He", "institution": "University of Maryland"}, {"given_name": "Stephane", "family_name": "Ross", "institution": "Google"}, {"given_name": "Hal", "family_name": "Daume III", "institution": "University of Maryland"}, {"given_name": "John", "family_name": "Langford", "institution": "Microsoft Research New York"}]}