{"title": "On-the-fly Operation Batching in Dynamic Computation Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 3971, "page_last": 3981, "abstract": "Dynamic neural networks toolkits such as PyTorch, DyNet, and Chainer offer more flexibility for implementing models that cope with data of varying dimensions and structure, relative to toolkits that operate on statically declared computations (e.g., TensorFlow, CNTK, and Theano). However, existing toolkits - both static and dynamic - require that the developer organize the computations into the batches necessary for exploiting high-performance data-parallel algorithms and hardware. This batching task is generally difficult, but it becomes a major hurdle as architectures become complex. In this paper, we present an algorithm, and its implementation in the DyNet toolkit, for automatically batching operations. Developers simply write minibatch computations as aggregations of single instance computations, and the batching algorithm seamlessly executes them, on the fly, in computationally efficient batches. On a variety of tasks, we obtain throughput similar to manual batches, as well as comparable speedups over single-instance learning on architectures that are impractical to batch manually.", "full_text": "On-the-\ufb02y Operation Batching\nin Dynamic Computation Graphs\n\nGraham Neubig\u21e4\n\nLanguage Technologies Institute\n\nCarnegie Mellon University\n\ngneubig@cs.cmu.edu\n\nYoav Goldberg\u21e4\n\nComputer Science Department\n\nBar-Ilan University\nyogo@cs.biu.ac.il\n\nChris Dyer\nDeepMind\n\ncdyer@google.com\n\nAbstract\n\nDynamic neural network toolkits such as PyTorch, DyNet, and Chainer offer more\n\ufb02exibility for implementing models that cope with data of varying dimensions and\nstructure, relative to toolkits that operate on statically declared computations (e.g.,\nTensorFlow, CNTK, and Theano). However, existing toolkits\u2014both static and\ndynamic\u2014require that the developer organize the computations into the batches\nnecessary for exploiting high-performance algorithms and hardware. This batching\ntask is generally dif\ufb01cult, but it becomes a major hurdle as architectures become\ncomplex. In this paper, we present an algorithm, and its implementation in the\nDyNet toolkit, for automatically batching operations. Developers simply write\nminibatch computations as aggregations of single instance computations, and the\nbatching algorithm seamlessly executes them, on the \ufb02y, using computationally\nef\ufb01cient batched operations. On a variety of tasks, we obtain throughput similar to\nthat obtained with manual batches, as well as comparable speedups over single-\ninstance learning on architectures that are impractical to batch manually.2\n\n1\n\nIntroduction\n\nModern CPUs and GPUs evaluate batches of arithmetic operations signi\ufb01cantly faster than the\nsequential evaluation of the same operations. For example, performing elementwise operations takes\nnearly the same amount of time on the GPU whether operating on tens or on thousands of elements,\nand multiplying a few hundred different vectors by the same matrix is signi\ufb01cantly slower than\nexecuting a single (equivalent) matrix\u2013matrix product using an optimized GEMM implementation on\neither a GPU or a CPU. Thus, careful grouping of operations into batches that can execute ef\ufb01ciently\nin parallel is crucial for making the most of available hardware resources.\nToday, developers who write code to train neural networks are responsible for crafting most of this\nbatch handling by hand. In some cases this is easy: when inputs and outputs are naturally represented\nas \ufb01xed sized tensors (e.g., images of a \ufb01xed size such those in the MNIST and CIFAR datasets, or\nregression problems on \ufb01xed sized vector inputs), and the computations required to process each\ninstance are instance-invariant and expressible as standard operations on tensors (e.g., a series of\nmatrix multiplications, convolutions, and elementwise nonlinearities), a suitably \ufb02exible tensor library\n\n\u21e4Authors contributed equally.\n2The proposed algorithm is implemented in DyNet (http://dynet.io/), and can be activated by using the\n\n\u201c--dynet-autobatch 1\u201d command line \ufb02ag.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fRNN\n\nRNN\n\nRNN\n\nRNN\n\nx(1)\n1\n\nRNN\n\nx(2)\n1\n\nRNN\n\nx(3)\n1\n\nx(1)\n2\n\nRNN\n\nx(2)\n2\n\nRNN\n\nx(3)\n2\n\nx(1)\n3\n\nL(2)\n\ny(2)\n\nRNN\n\nx(3)\n3\n\nx(1)\n4\n\nL(3)\n\ny(3)\n\nL(1)\n\ny(1)\n\nmasks\n\nm1\n\nm2\n\nm3\n\nm4\n\nY\n\nY\n\nY\n\nY\n\nL\n\nbatches\n\nL1\n\nRNN\n\nL2\n\nRNN\n\nL3\n\nRNN\n\nL4\n\nRNN\n\nX1\n\nX2\n\nX3\n\nX4\n\nL\n\npadding\n\nFigure 1: Two computation graphs for computing the loss on a minibatch of three training instances\nconsisting of a sequence of input vectors paired with a \ufb01xed sized output vector. On the left is a\n\u201cconceptual\u201d computation graph which shows the operations associated with computing the losses\nindividually for each sequence and then aggregating them. The same computation is executed by\nthe right-hand (\u201cbatched\u201d) computation graph: it aggregates the inputs in order to make better\nuse of modern processors. This comes with a price in complexity\u2014the variable length of the\nsequences requires padding and masking operations. Our aim is for the user to specify the conceptual\ncomputation on the left, and let the framework take care of its ef\ufb01cient execution.\n\nthat provides ef\ufb01cient implementations of higher-order generalizations of low-order operations makes\nmanual batching straightforward. For example, by adding a leading or trailing dimension to the\ntensors representing inputs and outputs, multiple instances can be straightforwardly represented in a\nsingle data structure. In other words: in this scenario, the developer conceives of and writes code for\nthe computation on an individual instance, packs several instances into a tensor as a \u201cminibatch\u201d, and\nthe library handles executing these ef\ufb01ciently in parallel.\nUnfortunately, this idealized scenario breaks when working with more complex architectures. Deep\nlearning is increasingly being applied to problems whose inputs, outputs and intermediate representa-\ntions do not \ufb01t easily into \ufb01xed sized tensors. For example, images vary in size and sequences in\nlength; data may be structured as trees [29] or graphs [4, 17, 27], or the model may select its own\ncomputation conditional on the input [16, 28, 33]. In all these cases, while the desired computation\nis easy enough to write for a single instance, organizing the computational operations so that they\nmake optimally ef\ufb01cient use of the hardware is nontrivial. Indeed, many papers that operate on data\nstructures more complicated than sequences have avoided batching entirely [8, 15, 25]. In fact, until\nlast year [7, 20], all published work on recursive (i.e., tree-structured) neural networks appears to\nhave used single instance training.\nThe premise of this work is that operation batching should not be the responsibility of the user,\nbut instead should be a service provided by the framework. The user should only be responsible\nfor specifying a large enough computation so that batching is possible (i.e, summing the losses of\nseveral instances, such as one sees in the left side of Figure 1), and the framework should take care of\nthe lower-level details of operation batching, much like optimizing compilers or JIT optimizers in\ninterpreted languages do.3\nWe take a large step towards this goal by introducing an ef\ufb01cient algorithm\u2014and a corresponding\nimplementation\u2014for automatic batching in dynamically declared computation graphs.4 Our method\nrelies on separating the graph construction from its execution, using operator overloading and lazy\n\n3 This is in contrast to other existing options for automatic batching such as TensorFlow Fold, which require\nthe user to learn an additional domain-speci\ufb01c language to turn computation into a format conducive to automatic\nbatching [19].\n\n4Computation graphs (often represented in a form called a Wengert list) are the data structures used to structure\nthe evaluation of expressions and use reverse mode automatic differentiation to compute their derivatives [3].\nBroadly, learning frameworks use two strategies to construct these: static and dynamic. In static toolkits (e.g.,\nTheano [6], Tensor\ufb02ow [1]) the computation graph is de\ufb01ned once and compiled, and then examples are fed into\nthe same graph. In contrast, dynamic toolkits (e.g., DyNet [21], Chainer [32], PyTorch [http://pytorch.org])\nconstruct the computation graph for each training instance (or minibatch) as the forward computation is executed.\nWhile dynamic declaration means that each minibatch can have its own computational architecture, the user is\nstill responsible for batching operations themselves.\n\n2\n\n\fevaluation (\u00a72). Once this separation is in place, we propose a fast batching heuristic that can be\nperformed in real time, for each training instance (or minibatch), between the graph construction\nand its execution (\u00a73). We extend the DyNet toolkit [21] with this capability. From the end-user\u2019s\nperspective, the result is a simple mechanism for exploiting ef\ufb01cient data-parallel algorithms in\nnetworks that would be cumbersome to batch by hand. The user simply de\ufb01nes the computation\nindependently for each instance in the batch (using standard Python or C++ language constructs),\nand the framework takes care of the rest. Experiments show that our algorithm compares favorably\nto manually batched code, that signi\ufb01cant speed improvements are possible on architectures with\nno straightforward manual batching design, and that we obtain better performance than TensorFlow\nFold [19], an alternative framework built to simulate dynamic graph de\ufb01nition and automatic batching\non top of TensorFlow (\u00a74).\n\n2 Batching: Conception vs. Ef\ufb01cient Implementation\n\nTo illustrate the challenges with batching, consider the problem of predicting a real-valued vector\nconditional on a sequence of input vectors (this example is chosen for its simplicity; experiments are\nconducted on more standard tasks). We assume that an input sequence of vectors is read sequentially\nby an RNN, and then the \ufb01nal state is used to make a prediction; the training loss is the Euclidean\ndistance between the prediction and target. We compare two algorithms for computing this code: a\nna\u00efve, but developer-friendly one (whose computation graph is shown in the left part of Figure 1),\nwhich re\ufb02ects how one conceives of what a batch loss computation is; and a computationally ef\ufb01cient\u2014\nbut more conceptually complex\u2014version that batches up the computations so they are executed in\nparallel across the sequences (the right part of Figure 1).\n\nNa\u00efve (developer-friendly) batched implementation The left part of Figure 1 shows the compu-\ntations that must be executed to compute the losses associated with three (b = 3) training instances,\nimplemented na\u00efvely. Pseudo-code for constructing the graph for each of the RNNs on the left using\na dynamic declaration framework is as follows:\n\nfunction RNN-REGRESSION-LOSS(x1:n, y; (W, U, b, c) = \u2713)\n\n. Initial state of the RNN; ht 2 Rd.\n\nht = tanh(W[ht1; xt] + b)\n\nh0 = 0\nfor t 2 1, 2, . . . , n do\n\u02c6y = Uhn + c\nL = ||\u02c6y y||2\n2\nreturn L\n\nNEW-GRAPH()\nfor i 2 1, 2, . . . , b do\nL(i) = RNN-REGRESSION-LOSS(x(i)\nL =Pi L(i)\nFORWARD(L)\n@L@\u2713 = BACKWARD(L)\n\u2713 = \u2713 \u2318 @L@\u2713\n\nNote that the code does not compute any value, but constructs a symbolic graph describing the\ncomputation. This can then be integrated into a batched training procedure:\nfunction TRAIN-BATCH-NAIVE(T = {(x(i)\n\n1:n(i), y(i))}b\n\ni=1; \u2713)\n\n1:n(i), y(i); \u2713)\n\n. Na\u00efvely loop over elements of batch.\n. Single instance loss.\n. Aggregate losses for all elements in batch.\n\nThis code is simple to understand, uses basic \ufb02ow control present in any programming language and\nsimple mathematical operations. Unfortunately, executing it will generally be quite inef\ufb01cient, since\nin the resulting computation graph each operation is performed sequentially without exploiting the\nfact that similar operations are being performed across the training instances.\n\nEf\ufb01cient manually batched implementation To make good use of ef\ufb01cient data-parallel algo-\nrithms and hardware, it is necessary to batch up the operations so that the sequences are processed in\nparallel. The standard way to achieve this is by aggregating the inputs and outputs, altering the code\nas follows:\n\n3\n\n\ffunction RNN-REGRESSION-BATCH-LOSS(X1:nmax, Y, n(1:b); (W, U, b, c) = \u2713)\n\nM[i,n(i)] = 1\n\nM = 0\nfor i 2 1, 2, . . . , b do\nH0 = 0\nfor t 2 1, 2, . . . , nmax do\n\n. Build loss mask; M 2 Rb\u21e5nmax.\n. Position where the \ufb01nal symbol in sequence i occurs.\n. Initial states of the RNN (one per instance); Ht 2 Rd\u21e5b.\n. Addition broadcasts b over columns.\n. Addition broadcasts c over columns.\n. Compute masked losses (mt is the tth column of M).\n\ni=1; \u2713)\n\n1:n(i), y(i))}b\n\nHt = tanh(W[Ht1; Xt] + b)\n\u02c6Yt = UHt + c\nLt = ||( \u02c6Yt Y)(mt1>)||2\nF\nL =Pt Lt\nreturn L\nfunction TRAIN-BATCH-MANUAL(T = {(x(i)\nnmax = maxi n(i)\nfor t 2 1, 2, . . . , nmax do\nXt = 0 2 Rd\u21e5b\nfor i 2 1, 2, . . . , b do\nXt,[\u00b7,i] = x(i)\nY = [y(1) y(2) \u00b7\u00b7\u00b7 y(b)]\nNEW-GRAPH()\nL = RNN-REGRESSION-BATCH-LOSS(X1:nmax, Y, n(1:b); \u2713)\nFORWARD(L)\n@L@\u2713 = BACKWARD(L)\n\u2713 = \u2713 \u2318 @L@\u2713\n\nif t \uf8ff n(i) otherwise 0\n\n. Build sequence of batch input matrices.\n\nt\n\n. The ith column of Xt.\n. Build batch of output targets.\n. Now that inputs are constructed, create graph, evaluate loss and gradient.\n\nThis code computes the same value as the na\u00efve implementation, it does so more ef\ufb01ciently, and\nit is signi\ufb01cantly more complicated. Because the sequences processed by RNNs will generally be\nof different lengths (which is precisely why RNNs are useful!), it is necessary to pad the input\nrepresentation with dummy values, and also to mask out the resulting losses at the right times. While\nthese techniques are part of the inventory of skills that a good ML engineer has, they increase the\ndif\ufb01culty of implementation and probability that bugs will be present in the code.\nImplementation comparison The na\u00efve algorithm has two advantages over manual batching. First,\nit is easy to implement: the way we conceive of a model is the way it is implemented, and errors\nwith padding, masking, and batching are avoided. Second, the na\u00efve algorithm aggregates any single\ninstance loss, whereas manual batching efforts are generally problem speci\ufb01c. For these reasons, one\nshould strongly prefer the \ufb01rst algorithm; however, for ef\ufb01ciency reasons, batching matters. In the\nnext section we turn to the problem of how to ef\ufb01ciently execute na\u00efve computation graphs so that\nthey can take advantage of ef\ufb01cient batched implementations of operations. This provides the best of\nboth worlds to developers: code is easy to write, but execution is fast.\n3 An Algorithm for On-the-\ufb02y Batching\nManual batching, discussed in the previous section, mostly operates by aggregating input instances\nand feeding them through a network. In RNNs, this means aggregating inputs that share a time\nstep. This often require padding and masking, as input sizes may differ. It also restricts the kinds\nof operations that can be batched. In contrast, our method identi\ufb01es and aggregates computation\ngraph nodes that can be executed in a batched fashion for a given graph. This reduces the need\nfor workarounds such as padding and masking, allows for seamless ef\ufb01cient execution also in\narchitectures which are hard to conceptualize in the input-centric paradigm, and allows for the\nidenti\ufb01cation of batching opportunities that may not be apparent from an input-centric view.\nOur batching procedure operates in three steps (1) graph de\ufb01nition, (2) operation batching, and (3)\ncomputation. Here, steps (1) and (3) are shared with standard execution of computation graphs, while\n(2) corresponds to our proposed method.\n3.1 Graph De\ufb01nition\nFirst, we de\ufb01ne the graph that represents the computation that we want to perform. From the user\u2019s\nperspective, this is done by simply performing computation that they are interested in performing,\nsuch as that de\ufb01ned in the RNN-REGRESSION-LOSS function from the previous example. While it is\ncommon for dynamic graph frameworks to interleave the graph de\ufb01nition and its forward execution,\n\n4\n\n\fwe separate these parts by using lazy evaluation: we only perform forward evaluation when a resulting\nvalue is requested by the user through the calling of the FORWARD function. The graph can be further\nextended after a call to FORWARD, and further calls will lazily evaluate the delta of the computation.\nThis allows the accumulation of large graph chunks before executing forward computations, providing\nample opportunities for operation batching.\n3.2 Operation Batching\nNext, given a computation graph, such as the one on the left side of Figure 1, our proposed algorithm\nconverts it into a graph where operations that can be executed together are batched together. This is\ndone in the two step process described below.\n\nComputing compatibility groups We \ufb01rst partition the nodes into compatibility groups, where\nnodes in the same group have the potential for batching. This is done by associating each node with\na signature such that nodes that share the same signature are guaranteed to be able to be executed\nin a single operation if their inputs are ready. Signatures vary depending on the operation the node\nrepresents. For example, in nodes representing element-wise operations, all nodes with the same\noperation can be batched together, so the signature is simply the operation name (tanh, log, ...). In\nnodes where dimensions or other information is also relevant to whether the operations can be batched,\nthis information is also included in the signature. For example, a node that picks a slice of the input\nmatrix will also be dependent on the matrix size and range to slice, so the signature will look something\nlike slice-400x500-100:200. In some other cases (e.g. a parameterized matrix multiply) we may\nremember the speci\ufb01c node ID of one of the inputs (e.g. node123 representing the matrix multiply\nparameters) while generalizing across other inputs (e.g. data or hidden state vectors on the right-hand\nside), resulting in a signature that would look something like matmul-node123-400x1. A more\nthorough discussion is given in Appendix A.\n\nDetermining execution order A computation graph is essentially a job dependency graph where\neach node depends on its input (and by proxy the input of other preceding nodes on the path to\nits inputs). Our goal is to select an execution order in which (1) each node is executed after its\ndependencies; and (2) nodes that have the same signature and do not depend on each other are\nscheduled for execution on the same step (and will be executed in a single batched operation).\nFinding an optimal execution order that maximizes the amount of batching in the general case is\nNP hard [24]. We discuss two heuristic strategies for identifying execution orders that satisfy these\nrequirements.\nDepth-based batching is used as a method for automatic batching in TensorFlow Fold [19]. This is\ndone by calculating the depth of each node in the original computation graph, de\ufb01ned as the maximum\nlength from a leaf node to the node itself, and batching together nodes that have an identical depth and\nsignature. By construction, nodes of the same depth are not dependent on each-other, as all nodes will\nhave a higher depth than their input, and thus this batching strategy is guaranteed to satisfy condition\n(1) above. However, this strategy will also miss some good batching opportunities. For example, the\nloss function calculations in Figure 1 are of different depths due to the different-lengthed sequences,\nand similar problems will occur in recurrent neural network language models, tree-structured neural\nnetworks, and a myriad of other situations.\nAgenda-based batching is a method we propose that does not depend solely on depth. The core of\nthis method is an agenda that tracks \u201cavailable\u201d nodes that have no unresolved dependencies. For\neach node, a count of its unresolved dependencies is maintained; this is initialized to be the number\nof inputs to the node. The agenda is initialized by adding nodes that have no incoming inputs (and\nthus no unresolved dependencies). At each iteration, we select a node from the agenda together with\nall of the available nodes in the same signature, and group them into a single batch operation. These\nnodes are then removed from the agenda, and the dependency counter of all of their successors are\ndecremented. Any new zero-dependency nodes are added to the agenda. This process is repeated\nuntil all nodes have been processed.\nHow do we prioritize between multiple available nodes in the agenda? Intuitively, we want to avoid\nprematurely executing nodes if there is a potential for more nodes of the same signature to be added\nto the agenda at a later point, resulting in better batching. A good example of this from our running\nexample in Figure 1 is the loss-calculating nodes, which will be added to the agenda at different points\ndue to becoming calculable after different numbers of RNN time steps. To capture this intuition, we\nintroduce a heuristic method for prioritizing nodes based on the average depth of all nodes with their\n\n5\n\n\fsignature, such that nodes with a lower average depth will be executed earlier. In general (with some\nexceptions), this tends to prioritize nodes that occur in earlier parts of the graph, which will result\nin the nodes in the later parts of the graph, such as these loss calculations, being executed later and\nhopefully batched together.5\nFinally, this non-trivial batching procedure must be executed quickly so that overhead due to batch\nscheduling calculations doesn\u2019t cancel out the ef\ufb01ciency gains from operation batching. To ensure\nthis, we perform a number of optimizations in the implementation, which we detail in Appendix B.\n\n3.3 Forward-backward Graph Execution and Update\n\nOnce we have determined an execution order (including batching decisions), we perform calculations\nof the values themselves. In standard computation graphs, forward computation is done in topological\norder to calculate the function itself, and backward calculation is done in reverse topological order to\ncalculate gradients. In our automatically batched evaluation, the calculation is largely similar with\ntwo exceptions:\n\nSingle!batch node conversion First, it is necessary to convert single nodes into a batched node,\nwhich also requires modi\ufb01cation of the underlying operations such as converting multiple matrix-\nvector operations Whi to a single matrix-matrix operation WH. This is done internally in the library,\nwhile the user-facing API maintains the original unbatched computation graph structure, making this\nprocess invisible to the user.\n\nEnsuring contiguous memory To ensure that operations can be executed as a batch, the inputs to\nthe operations (e.g. the various vectors h(i)\nt ) must be arranged in contiguous memory (e.g. a matrix\nHt). In some cases, it is necessary to perform a memory copy to arrange these inputs into contiguous\nmemory, but in other cases the inputs are already contiguous and in the correct order, and in these\ncases we can omit the memory copy and use the inputs as-is.6\n\n4 Experiments\n\nIn this section we describe our experiments, designed to answer three main questions: (1) in situations\nwhere manual batching is easy, how close can the proposed method approach the ef\ufb01ciency of a\nprogram that uses hand-crafted manual batching, and how do the depth-based and agenda-based\napproaches compare (\u00a74.1)? (2) in situations where manual batching is less easy, is the proposed\nmethod capable of obtaining signi\ufb01cant improvements in ef\ufb01ciency (\u00a74.2)? (3) how does the proposed\nmethod compare to TensorFlow Fold, an existing method for batching variably structured networks\nwithin a static declaration framework (\u00a74.3)?\n\n4.1 Synthetic Experiments\n\nOur \ufb01rst experiments stress-test our proposed algorithm in an ideal case for manual batching. Speci\ufb01-\ncally, we train a model on a bi-directional LSTM sequence labeler [12, 23], on synthetic data where\nevery sequence to be labeled is the same length (40). Because of this, manual batching is easy because\nwe don\u2019t have to do any padding or adjustment for sentences of different lengths. The network takes\nas input a size 200 embedding vector from a vocabulary of size 1000, has 2 layers of 256 hidden node\nLSTMs in either direction, then predicts a label from one of 300 classes. The batch size is 64.7\nWithin this setting we test various batching settings: Without or with manual mini-batching where\nwe explicitly batch the word vector lookup, LSTM update, and loss calculation for each time step.\n\n5Even given this prioritization method it is still possible to have ties, in which case we break ties by calculating\n\n\u201ccheap\u201d operations (e.g. tanh and other elementwise ops) before \u201cheavy\u201d ones (e.g. matrix multiplies).\n\n6The implication of this is that batched computation will take up to twice as much memory as unbatched\ncomputation, but in practice the memory usage is much less than this. Like manually batched computation,\nmemory usage can be controlled by adjusting the batch size appropriately so it \ufb01ts in memory.\n\n7Experiments were run on a single Tesla K80 GPU or Intel Xeon 2.30GHz E5-2686v4 CPU. To control for\nvariance in execution time, we perform three runs and report the fastest. We do not report accuracy numbers, as\nthe functions calculated and thus accuracies are the same regardless of batching strategy.\n\n6\n\n\fFigure 2: Computation time for forward/backward graph construction or computation, as well\nas parameter update for a BiLSTM tagger without or with manual batching, and without, with\ndepth-based, or with agenda-based automatic batching.\n\nWithout on-the-\ufb02y batching (NOAUTO), with depth-based autobatching (BYDEPTH), or with agenda-\nbased autobatching (BYAGENDA). We measure the speed of each method by ms/sec and also break\ndown the percentage of computation time spent in (1) forward graph creation/on-the-\ufb02y batching, (2)\nforward computation, (3) backward graph creation, (4) backward computation, (5) parameter update.\nThe results can be found in Figure 2. First, comparing the \ufb01rst row with the second two, we can\nsee that the proposed on-the-\ufb02y batching strategy drastically reduces computation time per sentence,\nwith BYAGENDA reducing per-sentence computation time from 193ms to 16.9ms on CPU and\n54.6ms to 5.03ms on GPU, resulting in an approximately 11-fold increase in sentences processed per\nsecond (5.17!59.3 on CPU and 18.3!198 on GPU). BYAGENDA is faster than BYDEPTH by about\n15\u201330%, demonstrating that our more sophisticated agenda-based strategy is indeed more effective at\nbatching together operations.\nNext, compared to manual batching without automatic batching (the fourth row), we can see that fully\nautomatic batching with no manual batching is competitive, but slightly slower. The speed decrease\nis attributed to the increased overhead for computation graph construction and batch scheduling.\nHowever, even in this extremely idealized scenario where manual batching will be most competitive,\nthe difference is relatively small (1.27\u21e5 on CPU and 1.76\u21e5 on GPU) compared to the extreme\ndifference between the case of using no batching at all. Given that automatic batching has other\nmajor advantages such as ease of implementation, it may be an attractive alternative even in situations\nwhere manual batching is relatively easy.\nFinally, if we compare the fourth and \ufb01fth/sixth rows, we can see that on GPU, even with manual\nbatching, automatic batching still provides gains in computational ef\ufb01ciency, processing sentences\nup to 1.1 times faster than without automatic batching. The reason for this can be attributed to the\nfact that our BiLSTM implementation performs manual batching across sentences, but not across\ntime steps within the sentence. In contrast, the auto-batching procedure was able to batch the word\nembedding lookup and softmax operations across time-steps as well, reducing the number of GPU\ncalls and increasing speed. This was not the case for CPU, as there is less to be gained from batching\nthese less expensive operations.\n\n4.2 Experiments on Dif\ufb01cult-to-batch Tasks\nNext, we extend our experiments to cases that are increasingly more dif\ufb01cult to manually batch.\nWe use realistic dimension sizes for the corresponding tasks, and batches of size b = 64. Exact\ndimensions and further details on training settings are in Appendix C.\nBiLSTM: This is similar to the ideal case in the previous section, but trained on actual variable\n\nlength sequences.\n\nBiLSTM w/char: This is the same as the BiLSTM tagger above, except that we use an additional\nBiLSTM over characters to calculate the embeddings over rare words. These sorts of\n\n7\n\nNoAuto ByDepth ByAgendaNoAuto ByDepth ByAgenda020406080100120140160180200for graphfor calcback graphback calcupdateNoAuto ByDepth ByAgendaNoAuto ByDepth ByAgenda020406080100120140160180200for graphfor calcback graphback calcupdate02468101214161820GPU ms/ sent02468101214161820CPU ms/ sentw/o Manualw/ Manualw/o Manualw/ Manual\fTable 1: Sentences/second on various training tasks for increasingly challenging batching scenarios.\n\nTask\n\nBiLSTM\nBiLSTM w/ char\nTreeLSTM\nTransition-Parsing\n\nNOAUTO\n16.8\n15.7\n50.2\n16.8\n\nCPU\nBYDEPTH\n139\n93.8\n348\n61.0\n\nBYAGENDA NOAUTO\n56.2\n43.2\n76.5\n33.0\n\n156\n132\n357\n61.2\n\nGPU\nBYDEPTH\n337\n183\n672\n89.5\n\nBYAGENDA\n367\n275\n661\n90.1\n\ncharacter-based embeddings have been shown to allow the model to generalize better [18],\nbut also makes batching operations more dif\ufb01cult, as we now have a variably-lengthed\nencoding step that may or may not occur for each of the words in the input.\n\nTree-structured LSTMs: This is the Tree-LSTM model of [31]. Here, each instance is a tree rather\nthan a sequence, and the network structure follows the tree structures. As discussed in the\nintroduction, this architecture is notoriously hard to manually batch.\n\nTransition-based Dependency Parsing: The most challenging case we evaluate is that of a\ntransition-based system, such as a transition based parser with LSTM-based feature-\nextraction [8, 9, 13] and exploration-based training [2, 5, 10]. Here, a sequence is encoded\nusing an LSTM (or a bi-LSTM), followed by a series of predictions. Each prediction based\non a subset of the encoded vectors, and the vectors that participate in each prediction, as\nwell as the loss, are determined by the outcomes of the previous predictions. Here, batching\nis harder yet as the nature of the computation interleaves sampling from the model and\ntraining, and requires calling FORWARD at each step, leaving the automatic-batcher very\nlittle room to play with. However, with only a small change to the computation, we can run\nb different parsers \u201cin parallel\u201d, and potentially share the computation across the different\nsystems in a given time-step. Concretely, we use a modi\ufb01ed version of the BIST parser [14].\n\nFrom the results in Table 1, we can see that in all cases automatic batching gives healthy improvements\nin computation time, 3.6x\u20139.2\u21e5 on the CPU, and 2.7\u20138.6\u21e5 on GPU. Furthermore, the agenda-based\nheuristic is generally more effective than the depth-based one.\n\n4.3 Comparison to TensorFlow Fold\n\nWe compare the TensorFlow Fold reference im-\nplementation of the Stanford Sentiment Tree-\nbank regression task [30], using the same TreeL-\nSTM architecture [31].Figure 3 shows how\nmany trees are processed per second by TF (ex-\ncluding both evaluation of the dev set and static\ngraph construction/optimization) on GPU and\nCPU relative to the performance of the BYA-\nGENDA algorithm in DyNet (including graph\nconstruction time). The DyNet performance is\nbetter across the board strati\ufb01ed by hardware\ntype. Furthermore, DyNet has greater through-\nput on CPU than TensorFlow Fold on GPU until\nbatch sizes exceed 64. Additionally, we \ufb01nd\nthat with single instance training, DyNet\u2019s se-\nquential evaluation processes 46.7 trees/second\non CPU, whereas autobatching processes 93.6\ntrees/second. This demonstrates that in complex\narchitectures like TreeLSTMs, there are opportunities to batch up operations inside a single training\ninstance, which are exploited by our batching algorithm. In addition, it should be noted that the DyNet\nimplementation has the advantage that it is much more straightforward, relying on simple Python data\nstructures and \ufb02ow control to represent and traverse the trees, while the Fold implementation requires\nimplementing the traversal and composition logic in a domain speci\ufb01c functional programming\nlanguage (described in Section 3 of Looks et al. [19]).\n\nFigure 3: Comparison of runtime performance be-\ntween TensorFlow Fold and DyNet with autobatch-\ning on TreeLSTMs (trees/sec).\n\n8\n\n\f5 Related Work\n\nOptimization of static algorithms is widely studied, and plays an important role in numerical libraries\nused in machine learning. Our work is rather different since the code/workload (as represented by\nthe computation graph) is dynamically speci\ufb01ed and must be executed rapidly, which precludes\nsophisticated statistic analysis. However, we review some of the important related work here.\nAutomatic graph optimization and selection of kernels for static computation graphs is used in a\nvariety of toolkits, including TensorFlow [1] and Theano [6]. Dynamic creation of optimally sized\nminibatches (similar to our strategy, except the computation graph is assumed to be static) that make\ngood use of hardware resources has also been proposed for optimizing convolutional architectures\n[11]. The static nature of the computation makes this tools closer to optimizing compilers rather\nthan ef\ufb01cient interpreters which are required to cope with the dynamic workloads encountered when\ndealing with dynamically structured computations.\nRelated to this is the general technique of automatic vectorization, which is a mainstay of optimizing\ncompilers. Recent work has begun to explore vectorization in the context of interpreted code which\nmay cannot be compiled [26]. Our autobatching variant of DyNet similarly provides vectorized\nprimitives that can be selected dynamically.\nFurther a\ufb01eld, the problem of scheduling with batching decisions has been widely studied in opera-\ntions research since at least the 1950s (for a recent survey, see [24]). Although the OR work deals\nwith similar problems (e.g., scheduling work on machines that can process a \u2018family\u2019 of related item\nwith minimal marginal cost over a single item), the standard algorithms from this \ufb01eld (which are\noften based on polynomial-time dynamic programs or approximations to NP-hard search problems)\nare too computationally demanding to execute in the inner loop of a learning algorithm.\n\n6 Conclusion\n\nDeep learning research relies on empirical exploration of architectures. The rapid pace of innovation\nwe have seen in the last several years has been enabled largely by tools that have automated the\nerror-prone aspects of engineering, such as writing code that computes gradients. However, our\ncontention is that operation batching is increasingly becoming another aspect of model coding that is\nerror prone and amenable to automation.\nOur solution is a framework that lets programmers express computations naturally and relies on a\nsmart yet lightweight interpreter to \ufb01gure out how to execute the operations ef\ufb01ciently. Our hope is\nthat this will facilitate the creation of new classes of models that better cope with the complexities of\nreal-world data.\nAcknowledgements: The work of YG is supported by the Israeli Science Foundation (grant number\n1555/15) and by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).\n\nReferences\n[1] Mart\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,\n2016.\n\n[2] Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A. Smith. Training with exploration\nIn Conference on Empirical Methods in Natural\n\nimproves a greedy stack LSTM parser.\nLanguage Processing (EMNLP), pages 2005\u20132010, November 2016.\n\n[3] Michael Bartholomew-Briggs, Steven Brown, Bruce Christianson, and Laurence Dixon. Au-\ntomatic differentiation of algorithms. Journal of Computational and Applied Mathematics,\n124:171\u2013190, 2000.\n\n[4] Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu.\nInteraction networks for learning about objects, relations and physics. In Neural Information\nProcessing Systems (NIPS), 2016.\n\n9\n\n\f[5] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for\n\nsequence prediction with recurrent neural networks. CoRR, abs/1506.03099, 2015.\n\n[6] James Bergstra, Olivier Breuleux, Fr\u00e9d\u00e9ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume\nDesjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A CPU and GPU\nmath compiler in Python. In Proc. 9th Python in Science Conf, pages 1\u20137, 2010.\n\n[7] Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning,\nand Christopher Potts. A fast uni\ufb01ed model for parsing and sentence understanding. In Annual\nConference of the Association for Computational Linguistics (ACL), pages 1466\u20131477, 2016.\n[8] Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. Transition-\nbased dependency parsing with stack long short-term memory. In Annual Conference of the\nAssociation for Computational Linguistics (ACL), pages 334\u2013343, 2015.\n\n[9] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural\nIn Conference of the North American Chapter of the Association for\n\nnetwork grammars.\nComputational Linguistics (NAACL), pages 199\u2013209, 2016.\n\n[10] Yoav Goldberg and Joakim Nivre. Training deterministic parsers with non-deterministic oracles.\n\nTransactions of the Association for Computational Linguistics, 1:403\u2013414, 2013.\n\n[11] Stefan Hadjis, Firas Abuzaid, Ce Zhang, and Christopher R\u00e9. Caffe con troll: Shallow ideas\nto speed up deep learning. In Proceedings of the Fourth Workshop on Data analytics at sCale\n(DanaC 2015), 2015.\n\n[12] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM-CRF models for sequence tagging.\n\narXiv preprint arXiv:1508.01991, 2015.\n\n[13] Eliyahu Kiperwasser and Yoav Goldberg. Easy-\ufb01rst dependency parsing with hierarchical tree\n\nLSTMs. Transactions of the Association for Computational Linguistics, 4:445\u2013461, 2016.\n\n[14] Eliyahu Kiperwasser and Yoav Goldberg. Simple and accurate dependency parsing using\nbidirectional LSTM feature representations. Transactions of the Association for Computational\nLinguistics, 4:313\u2013327, 2016.\n\n[15] Faisal Ladhak, Ankur Gandhe, Markus Dreyer, Lambert Matthias, Ariya Rastrow, and Bj\u00f6rn\nHoffmeister. Latticernn: Recurrent neural networks over lattices. In Proc. INTERSPEECH,\n2016.\n\n[16] Chengtao Li, Daniel Tarlow, Alexander L. Gaunt, Marc Brockschmidt, and Nate Kushman.\nNeural program lattices. In International Conference on Learning Representations (ICLR),\n2017.\n\n[17] Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object\n\nparsing with graph LSTM. In Proc. ECCV, 2016.\n\n[18] Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis\nMarujo, and Tiago Luis. Finding function in form: Compositional character models for open\nvocabulary word representation. In Conference on Empirical Methods in Natural Language\nProcessing (EMNLP), pages 1520\u20131530, 2015.\n\n[19] Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. Deep learning with\ndynamic computation graphs. In International Conference on Learning Representations (ICLR),\n2017.\n\n[20] Gilles Louppe, Kyunghyun Cho, Cyril Becot, and Kyle Cranmer. QCD-aware recursive neural\n\nnetworks for jet physics. arXiv:1702.00748, 2017.\n\n[21] Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios\nAnastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh,\nManaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro,\nGaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi\nSaphra, Swabha Swayamdipta, and Pengcheng Yin. DyNet: The dynamic neural network\ntoolkit. arXiv preprint arXiv:1701.03980, 2017.\n\n10\n\n\f[22] Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R. Curran. Learning\nmultilingual named entity recognition from Wikipedia. Arti\ufb01cial Intelligence, 194:151\u2013175,\n2012.\n\n[23] Barbara Plank, Anders S\u00f8gaard, and Yoav Goldberg. Multilingual part-of-speech tagging with\nbidirectional long short-term memory models and auxiliary loss. In Annual Conference of the\nAssociation for Computational Linguistics (ACL), pages 412\u2013418, 2016.\n\n[24] Chris N. Potts and Mikhail Y. Kovalyov. Scheduling with batching: A review. European Journal\n\nof Operational Research, 20(2):228\u2013249, 2000.\n\n[25] Scott Reed and Nando de Freitas. Neural programmer-interpreters. In International Conference\n\non Learning Representations (ICLR), 2016.\n\n[26] Erven Rohou, Kevin Williams, and David Yuste. Vectorization technology to improve interpreter\n\nperformance. ACM Transactions on Architecture and Code Optimization, 9(4), 2013.\n\n[27] Franco Scarselli, Marco Gori, Ah Chung Tsoi, and Gabriele Monfardini. The graph neural\n\nnetwork model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[28] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,\nand Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts\nlayer. In International Conference on Learning Representations (ICLR), 2017.\n\n[29] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and\nnatural language with recursive neural networks. In International Conference on Machine\nLearning (ICML), pages 129\u2013136, 2011.\n\n[30] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng,\nand Christopher Potts. Recursive deep models for semantic compositionality over a sentiment\ntreebank. In Conference on Empirical Methods in Natural Language Processing (EMNLP),\n2013.\n\n[31] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations\nfrom tree-structured long short-term memory networks. In Annual Conference of the Association\nfor Computational Linguistics (ACL), 2015.\n\n[32] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open\nsource framework for deep learning. In Proceedings of Workshop on Machine Learning Systems\n(LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing\nSystems (NIPS), 2015.\n\n[33] Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. Learning\nto compose words into sentences with reinforcement learning. In International Conference on\nLearning Representations (ICLR), 2017.\n\n11\n\n\f", "award": [], "sourceid": 2133, "authors": [{"given_name": "Graham", "family_name": "Neubig", "institution": "Carnegie Mellon University"}, {"given_name": "Yoav", "family_name": "Goldberg", "institution": "Bar-Ilan University"}, {"given_name": "Chris", "family_name": "Dyer", "institution": "DeepMind"}]}