{"title": "Neural Arithmetic Logic Units", "book": "Advances in Neural Information Processing Systems", "page_first": 8035, "page_last": 8044, "abstract": "Neural networks can learn to represent and manipulate numerical information, but they seldom generalize well outside of the range of numerical values encountered during training. To encourage more systematic numerical extrapolation, we propose an architecture that represents numerical quantities as linear activations which are manipulated using primitive arithmetic operators, controlled by learned gates. We call this module a neural arithmetic logic unit (NALU), by analogy to the arithmetic logic unit in traditional processors. Experiments show that NALU-enhanced neural networks can learn to track time, perform arithmetic over images of numbers, translate numerical language into real-valued scalars, execute computer code, and count objects in images. In contrast to conventional architectures, we obtain substantially better generalization both inside and outside of the range of numerical values encountered during training, often extrapolating orders of magnitude beyond trained numerical ranges.", "full_text": "Neural Arithmetic Logic Units\n\nAndrew Trask\u2020\u2021\n\nFelix Hill\u2020\n\nChris Dyer\u2020\n\nScott Reed\u2020\nPhil Blunsom\u2020\u2021\n\nJack Rae\u2020\u266d\n\n\u2020DeepMind\n\n\u2021University of Oxford\n\n\u266dUniversity College London\n\n{atrask,felixhill,reedscot,jwrae,cdyer,pblunsom}@google.com\n\nAbstract\n\nNeural networks can learn to represent and manipulate numerical information, but\nthey seldom generalize well outside of the range of numerical values encountered\nduring training. To encourage more systematic numerical extrapolation, we propose\nan architecture that represents numerical quantities as linear activations which are\nmanipulated using primitive arithmetic operators, controlled by learned gates. We\ncall this module a neural arithmetic logic unit (NALU), by analogy to the arithmetic\nlogic unit in traditional processors. Experiments show that NALU-enhanced neural\nnetworks can learn to track time, perform arithmetic over images of numbers,\ntranslate numerical language into real-valued scalars, execute computer code, and\ncount objects in images.\nIn contrast to conventional architectures, we obtain\nsubstantially better generalization both inside and outside of the range of numerical\nvalues encountered during training, often extrapolating orders of magnitude beyond\ntrained numerical ranges.\n\n1\n\nIntroduction\n\nThe ability to represent and manipulate numerical quantities is apparent in the behavior of many\nspecies, from insects to mammals to humans, suggesting that basic quantitative reasoning is a general\ncomponent of intelligence [5, 7].\n\nWhile neural networks can successfully represent and manipulate numerical quantities given an\nappropriate learning signal, the behavior that they learn does not generally exhibit systematic gener-\nalization [6, 20]. Speci\ufb01cally, one frequently observes failures when quantities that lie outside the\nnumerical range used during training are encountered at test time, even when the target function\nis simple (e.g., it depends only on aggregating counts or linear extrapolation). This failure pattern\nindicates that the learned behavior is better characterized by memorization than by systematic ab-\nstraction. Whether input distribution shifts that trigger extrapolation failures are of practical concern\ndepends on the environments where the trained models will operate. However, considerable evidence\nexists showing that animals as simple as bees demonstrate systematic numerical extrapolation [? 7],\nsuggesting that systematicity in reasoning about numerical quantities is ecologically advantageous.\n\nIn this paper, we develop a new module that can be used in conjunction with standard neural network\narchitectures (e.g., LSTMs or convnets) but which is biased to learn systematic numerical computation.\nOur strategy is to represent numerical quantities as individual neurons without a nonlinearity. To these\nsingle-value neurons, we apply operators that are capable of representing simple functions (e.g., +,\n\u2212, \u00d7, etc.). These operators are controlled by parameters which determine the inputs and operations\nused to create each output. However, despite this combinatorial character, they are differentiable,\nmaking it possible to learn them with backpropagation [24].\n\nWe experiment across a variety of task domains (synthetic, image, text, and code), learning signals\n(supervised and reinforcement learning), and structures (feed-forward and recurrent). We \ufb01nd that our\nproposed model can learn functions over representations that capture the underlying numerical nature\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fof the data and generalize to numbers that are several orders of magnitude larger than those observed\nduring training. We also observe that our module exhibits a superior numeracy bias relative to linear\nlayers, even when no extrapolation is required. In one case our model exceeds a state-of-the-art\nimage counting network by an error margin of 54%. Notably, the only modi\ufb01cation we made over\nthe previous state-of-the-art was the replacement of its last linear layer with our model.\n\n1.1 Numerical Extrapolation Failures in Neural Networks\n\nTo illustrate the failure of systematicity in standard networks, we show the behavior of various MLPs\ntrained to learn the scalar identity function, which is the most straightforward systematic relationship\npossible. The notion that neural networks struggle to learn identity relations is not new [14]. We show\nthis because, even though many of the architectures evaluated below could theoretically represent the\nidentity function, they typically fail to acquire it.\n\nIn Figure 1, we show the nature of this failure\n(experimental details and more detailed results\nin Appendix A). We train an autoencoder to\ntake a scalar value as input (e.g., the number 3),\nencode the value within its hidden layers (dis-\ntributed representations), then reconstruct the\ninput value as a linear combination of the last\nhidden layer (3 again). Each autoencoder we\ntrain is identical in its parameterization (3 hid-\nden layers of size 8), tuning (10,000 iterations,\nlearning rate of 0.01, squared error loss), and\ninitialization, differing only on the choice of\nnonlinearity on hidden layers. For each point in\nFigure 1, we train 100 models to encode num-\nbers between \u22125 and 5 and average their ability\nto encode numbers between \u221220 and 20.\nWe see that even over a basic task using a simple\narchitecture, all nonlinear functions fail to learn\nto represent numbers outside of the range seen during training. The severity of this failure directly\ncorresponds to the degree of non-linearity within the chosen activation function. Some activations\nlearn to be highly linear (such as PReLU) which reduces error somewhat, but sharply non-linear\nfunctions such as sigmoid and tanh fail consistently. Thus, despite the fact that neural networks are\ncapable of representing functions that extrapolate, in practice we \ufb01nd that they fail to learn to do so.\n\nFigure 1: MLPs learn the identity function only for\nthe range of values they are trained on. The mean\nerror ramps up severely both below and above the\nrange of numbers seen during training.\n\n2 The Neural Accumulator & Neural Arithmetic Logic Unit\n\nHere we propose two models that are able to learn to represent and manipulate numbers in a systematic\nway. The \ufb01rst supports the ability to accumulate quantities additively, a desirable inductive bias for\nlinear extrapolation. This model forms the basis for a second model, which supports multiplicative\nextrapolation. This model also illustrates how an inductive bias for arbitrary arithmetic functions can\nbe effectively incorporated into an end-to-end model.\n\nOur \ufb01rst model is the neural accumulator (NAC), which is a special case of a linear (af\ufb01ne) layer\nwhose transformation matrix W consists just of \u22121\u2019s, 0\u2019s, and 1\u2019s; that is, its outputs are additions\nor subtractions (rather than arbitrary rescalings) of rows in the input vector. This prevents the layer\nfrom changing the scale of the representations of the numbers when mapping the input to the output,\nmeaning that they are consistent throughout the model, no matter how many operations are chained\ntogether. We improve the inductive bias of a simple linear layer by encouraging 0\u2019s, 1\u2019s, and \u22121\u2019s\nwithin W in the following way.\nSince a hard constraint enforcing that every element of W be one of {\u22121, 0, 1} would make learning\nhard, we propose a continuous and differentiable parameterization of W in terms of unconstrained\nparameters: W = tanh( \u02c6W) \u2299 \u03c3( \u02c6M) (where \u03c3 corresponds to the sigmoid function). This form is\n\nconvenient for learning with gradient descent and produces matrices whose elements are guaranteed\n\n2\n\n\f(a) Neural Accumulator (NAC)\n\n(b) Neural Arithmetic Logic Unit (NALU)\n\nFigure 2: The Neural Accumulator (NAC) is a linear transformation of its inputs. The transformation\nmatrix is the elementwise product of tanh( \u02c6W) and \u03c3( \u02c6M). The Neural Arithmetic Logic Unit\n(NALU) uses two NACs with tied weights to enable addition/subtraction (smaller purple cell) and\nmultiplication/division (larger purple cell), controlled by a gate (orange cell).\n\nto be in [\u22121, 1] and biased to be close to \u22121, 0, or 1.1 The model contains no bias vector, and no\nsquashing nonlinearity is applied to the output.\n\nWhile addition and subtraction enable many useful systematic generalizations, a similarly robust\nability to learn more complex mathematical functions, such as multiplication, may be be desirable.\nFigure 2 describes such a cell, the neural arithmetic logic unit (NALU), which learns a weighted\nsum between two subcells, one capable of addition and subtraction and the other capable of multi-\nplication, division, and power functions such as \u221ax. Importantly, the NALU demonstrates how the\nNAC can be extended with gate-controlled sub-operations, facilitating end-to-end learning of new\nclasses of numerical functions. As with the NAC, there is the same bias against learning to rescale\nduring the mapping from input to output.\n\nThe NALU consists of two NAC cells (the purple cells) interpolated by a learned sigmoidal gate\ng (the orange cell), such that if the add/subtract subcell\u2019s output value is applied with a weight of\n1 (on), the multiply/divide subcell\u2019s is 0 (off) and vice versa. The \ufb01rst NAC (the smaller purple\nsubcell) computes the accumulation vector a, which stores results of the NALU\u2019s addition/subtraction\noperations; it is computed identically to the original NAC, (i.e., a = Wx). The second NAC (the\nlarger purple subcell) operates in log space and is therefore capable of learning to multiply and divide,\nstoring its results in m:\n\na = Wx\n\nNAC:\nNALU: y = g \u2299 a + (1 \u2212 g) \u2299 m\n\nW = tanh( \u02c6W) \u2299 \u03c3( \u02c6M)\nm = exp W(log(|x| + \u01eb)), g = \u03c3(Gx)\n\nwhere \u01eb prevents log 0. Altogether, this cell can learn arithmetic functions consisting of multiplication,\naddition, subtraction, division, and power functions in a way that extrapolates to numbers outside of\nthe range observed during training.\n\n3 Related Work\n\nNumerical reasoning is central to many problems in intelligence and by extension is an important\ntopic in deep learning [5]. A widely studied task is counting objects in images [2, 4? , 25, 31, 33].\nThese models generally take one of two approaches: 1) using a deep neural network to segment\nindividual instances of a particular object and explicitly counting them in a post-processing step or\n2) learning end-to-end to predict object counts via a regression loss. Our work is more closely related\nto the second strategy.\n\nOther work more explicitly attempts to model numerical representations and arithmetic functions\nwithin the context of learning to execute small snippets of code [32, 23]. Learning to count within a\nbounded range has also been included in various question-answer tasks, notably the BaBI tasks [29],\n\n1The stable points {\u22121, 0, 1} correspond to the saturation points of either \u03c3 or tanh.\n\n3\n\n\fand many models successfully learn to do so [1, 18, 12]. However, to our knowledge, no tasks of this\nkind explicitly require counting beyond the range observed during training.\n\nOne can also view our work as advocating a new context for linear activations within deep neu-\nral networks. This is related to recent architectural innovations such as ResNets [14], Highway\nNetworks [26], and DenseNet [15], which also advocate for linear connections to reduce explod-\ning/vanishing gradients and promote better learning bias. Such connections improved performance,\nalbeit with additional computational overhead due to the increased depth of the resulting architectures.\n\nOur work is also in line with a broader theme in machine learning which seeks to identify, in the\nform of behavior-governing equations, the underlying structure of systems that extrapolate well to\nunseen parts of the space [3]. This is a strong trend in recent neural network literature concerning the\nsystematic representation of concepts within recurrent memory, allowing for functions over these\nconcepts to extrapolate to sequences longer than observed during training. The question of whether\nand how recurrent networks generalize to sequences longer than they encountered in training has\nbeen of enduring interest, especially since well-formed sentences in human languages are apparently\nunbounded in length, but are learned from a limited sample [9, 19, 28]. Recent work has also focused\non augmenting LSTMs with systematic external memory modules, allowing them to generalize\noperations such as sorting [30, 11, 13], again with special interest in generalization to sequences\nlonger than observed during training through systematic abstraction.\n\nFinally, the cognitive and neural bases of numerical reasoning in humans and animals have been\nintensively studied; for a popular overview see Dehaene [5]. Our models are reminiscient of theories\nwhich posit that magnitudes are represented as continuous quantities manipulated by accumulation\noperations [? ], and, in particular, our single-neuron representation of number recalls Gelman and\nGallistel\u2019s posited \u201cnumerons\u201d\u2014individual neurons that represent numbers [8]. However, across\nmany species, continuous quantities appear to be represented using an approximate representation\nwhere acuity decreases with magnitude [22], quite different from our model\u2019s constant precision.\n\n4 Experiments\n\nThe experiments in this paper test numeric reasoning and extrapolation in a variety of settings. We\nstudy the explicit learning of simple arithmetic functions directly from numerical input, and indirectly\nfrom image data. We consider temporal domains: the translation of text to integer values, and the\nevaluation of computer programs containing conditional logic and arithmetic. These supervised tasks\nare supplemented with a reinforcement learning task which implicitly involves counting to keep track\nof time. We conclude with the previously-studied MNIST parity task where we obtain state-of-the-art\nprediction accuracy and provide an ablation study to understand which components of NALU provide\nthe most bene\ufb01t.\n\n4.1 Simple Function Learning Tasks\n\nIn these initial synthetic experiments, we demonstrate the ability of NACs and NALUs to learn to\nselect relevant inputs and apply different arithmetic functions to them, which are the key functions\nthey are designed to be able to solve (below we will use these as components in more complex\narchitectures). We have two task variants: one where the inputs are presented all at once as a single\nvector (the static tasks) and a second where inputs are presented sequentially over time (the recurrent\ntasks). Inputs are randomly generated, and for the target, two values (a and b) are computed as a sum\nover regular parts of the input. An operation (e.g., a \u00d7 b) is then computed providing the training (or\nevaluation) target. The model is trained end-to-end by minimizing the squared loss, and evaluation\nlooks at performance of the model on held-out values from within the training range (interpolation) or\non values from outside of the training range (extrapolation). Experimental details and more detailed\nresults are in Appendix B.\n\nOn the static task, for baseline models we compare the NAC and NALU to MLPs with a variety of\nstandard nonlinearities as well as a linear model. We report the baseline with the best median held-out\nperformance, which is the Relu6 activation [17]. Results using additional nonlinearities are also in\nAppendix B. For the recurrent task, we report the performance of an LSTM and the best performing\nRNN variant from among several common architectures, an RNN with ReLU activations (additional\nrecurrent baselines also in Appendix B).\n\n4\n\n\fStatic Task (test)\n\nRecurrent Task (test)\n\nRelu6 None NAC NALU LSTM ReLU NAC NALU\n\n0.2\n0.0\n3.2\n4.2\n0.7\n0.5\n42.6\n29.0\n10.1\n37.2\n47.0\n10.3\n\n0.0\n0.0\n20.9\n35.0\n4.3\n2.2\n0.0\n0.0\n29.5\n52.3\n25.1\n20.0\n\n0.0\n0.0\n21.4\n37.1\n22.4\n3.6\n0.0\n0.0\n33.3\n61.3\n53.3\n16.4\n\n0.0\n0.0\n0.0\n5.3\n0.0\n0.0\n0.0\n0.0\n0.0\n0.7\n0.0\n0.0\n\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n96.1\n97.0\n98.2\n95.6\n98.0\n95.8\n\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n85.5\n70.9\n97.9\n863.5\n98.0\n34.1\n\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n\n0.0\n0.0\n1.5\n1.2\n2.3\n2.1\n0.0\n0.0\n88.4\n>999 >999\n123.7\n>999\n\n0.0\n0.0\n\no\ni\nt\na\nl\no\np\nr\ne\nt\nn\nI\n\nn a + b\na \u2212 b\na \u00d7 b\na/b\na2\n\u221aa\nn a + b\na \u2212 b\na \u00d7 b\na/b\na2\n\u221aa\n\no\ni\nt\na\nl\no\np\na\nr\nt\nx\nE\n\nTable 1: Interpolation and extrapolation error rates for static and recurrent tasks. Scores are scaled\nrelative to a randomly initialized model for each task such that 100.0 is equivalent to random, 0.0 is\nperfect accuracy, and >100 is worse than a randomly initialized model. Raw scores in Appendix B.\n\nTable 1 summarizes results and shows that while several standard architectures succeed at these tasks\nin the interpolation case, none of them succeed at extrapolation. However, in both interpolation and\nextrapolation, the NAC succeeds at modeling addition and subtraction, whereas the more \ufb02exible\nNALU succeeds at multiplicative operations as well (except for division in the recurrent task2).\n\n4.2 MNIST Counting and Arithmetic Tasks\n\nIn the previous synthetic task, both inputs and outputs were provided in a generalization-ready\nrepresentation (as \ufb02oating point numbers), and only the internal operations and representations had to\nbe learned in a way that generalized. In this experiment, we discover whether backpropagation can\nlearn the representation of non-numeric inputs to NACs/NALUs.\n\nIn these tasks, a recurrent network is fed a series of 10 randomly chosen MNIST digits and at the\nend of the series it must output a numerical value about the series it observed.3 In the MNIST Digit\nCounting task, the model must learn to count how many images of each type it has seen (a 10-way\nregression), and in the MNIST Digit Addition task, it must learn to compute the sum of the digits\nit observed (a linear regression). Each training series is formed using images from the MNIST\ndigit training set, and each testing series from the MNIST test set. Evaluation occurs over held-out\nsequences of length 10 (interpolation), and two extrapolation lengths: 100 and 1000. Although no\ndirect supervision of the convnet is provided, we estimate how well it has learned to distinguish digits\nby passing in a test sequences of length 1 (also from the MNIST test dataset) and estimating the\naccuracy based on the count/sum. Parameters are initialized randomly and trained by backpropagating\nthe mean squared error against the target count vector or the sum.\n\nTable 2 shows the results for both tasks. As we saw before, standard architectures succeed on held-out\nsequences in the interpolation length, but they completely fail at extrapolation. Notably, the RNN-tanh\nand RNN-ReLU models also fail to learn to interpolate to shorter sequences than seen during training.\nHowever, the NAC and NALU both extrapolate and interpolate well.\n\n4.3 Language to Number Translation Tasks\n\nNeural networks have also been quite successful in working with natural language inputs, and LSTM-\nbased models are state-of-the-art in many tasks [10, 27, 16]. However, much like other numerical\ninput, it is not clear whether representations of number words are learned in a systematic way. To\ntest this, we created a new translation task which translates a text number expression (e.g., five\nhundred and fifteen) into a scalar representation (515).\n\n2Division is much more challenging to extrapolate. While our models limit numbers using nonlinearities, our\nmodels are still able to represent numbers that are very, very small. Division allows such small numbers to be in\nthe denominator, greatly amplifying even small drifts in extrapolation ability.\n\n3The input to the recurrent networks is the output the convnet in https://github.com/pytorch/\n\nexamples/tree/master/mnist.\n\n5\n\n\fMNIST Digit Counting Test\n\nMNIST Digit Addition Test\n\nSeq Len\nLSTM\nGRU\n\nRNN-tanh\nRNN-ReLU\n\nNAC\nNALU\n\n1\n\nClassi\ufb01cation Mean Absolute Error\n1000\n198.5\n198.3\n198.7\n1.4e10\n3.32\n4.18\n\n98.29%\n99.02%\n38.91%\n9.80%\n99.23%\n97.6%\n\n10\n0.79\n0.73\n1.49\n0.66\n0.12\n0.17\n\n100\n18.2\n18.0\n18.4\n39.8\n0.76\n0.93\n\n1\n\nClassi\ufb01cation Mean Absolute Error\n1000\n8811.6\n8775.2\n200.7\n1171.0\n\n0.0%\n0.0%\n0.0%\n\n88.18%\n97.6%\n77.7%\n\n10\n14.8\n1.75\n2.98\n19.1\n1.42\n5.11\n\n100\n800.8\n771.4\n20.4\n182.1\n7.88\n26.8\n\n57.3\n248.9\n\nTable 2: Accuracy of the MNIST Counting & Addition tasks for series of length 1, 10, 100, and 1000.\n\nWe trained and tested using numbers from 0 to 1000. The training set consists of the numbers 0\u201319\nin addition to a random sample from the rest of the interval, adjusted to make sure that each unique\ntoken is present at least once in the training set. There are 169 examples in the training set, 200 in\nvalidation, and 631 in the test test. All networks trained on this dataset start with a token embedding\nlayer, followed by encoding through an LSTM, and then a linear layer, NAC, or NALU.\n\nModel\nLSTM\nLSTM + NAC\nLSTM + NALU\n\nTrain MAE Validation MAE\n\nTest MAE\n\n0.003\n80.0\n0.12\n\n29.9\n114.1\n0.39\n\n29.4\n114.3\n0.41\n\nTable 3: Mean absolute error (MAE) comparison on translating number strings to scalars. LSTM +\nNAC/NALU means a single LSTM layer followed by NAC or NALU, respectively.\n\nWe observed that both baseline LSTM variants\nover\ufb01t severely to the 169 training set numbers\nand generalize poorly. The LSTM + NAC per-\nforms poorly on both training and test sets. The\nLSTM + NALU achieves the best generalization\nperformance by a wide margin, suggesting that\nthe multiplier is important for this task.\n\n hundred and \n\n\u201cthree \n thirty four\u201d \n 3.05 299.9 301.3 330.1 334\n \n\n\u201cseven \n two\u201d \n 6.98 699.9 701.3 702.2 \n\n hundred and \n\n \n\n \n\n\u201ceighty eight\u201d \n 79.6 88 \n\n \n\n \n\n \n\n \n\n \n\nWe show in Figure 3 the intermediate states of\nthe NALU on randomly selected test examples.\nWithout supervision, the model learns to track\nsensible estimates of the unknown number up\nto the current token. This allows the network\nto predict given tokens it has never seen before\nin isolation, such as e.g. eighty, since it saw\neighty one, eighty four and eighty seven during training.4 The and token can be exploited\nto form addition expressions (see last example), even though these were not seen in training.\n\nFigure 3: Intermediate NALU predictions on pre-\nviously unseen queries.\n\n\u201ctwenty seven \neighty\u201d \n 18.2 27.0 29.1 106.1 \n\n and \n\n4.4 Program Evaluation\n\nEvaluating a program requires the control of several logical and arithmetic operations and internal\nbook-keeping of intermediate values. We consider the two program evaluation tasks de\ufb01ned in [32].\nThe \ufb01rst consists of simply adding two large integers, and the latter involves evaluating programs\ncontaining several operations (if statements, +, \u2212). We focus on extrapolation: can the network learn\na solution that generalizes to larger value ranges? We investigate this by training with two-digit input\nintegers pulled uniformly from [0, 100) and evaluating on random integers with three and four digits.\n\nFollowing the setup of [32] we report the the percentage of matching digits between the rounded\nprediction from the model and target integer, however we handle numeric input differently. Instead of\npassing the integers character-by-character, we pass the full integer value at a single time step and\n\n4Note slight accumulation for the word and owing to the spurious correlation between the use of the word\n\nand a subsequent increase in the target value.\n\n6\n\n\fregress the output with an RMSE loss. Our model setup consists of a NALU that is \u201ccon\ufb01gured by\u201d\nan LSTM, that is, its parameters \u02c6W, \u02c6M, and \u02c6G are learned functions of the LSTM output ht at each\ntimestep. Thus, the LSTM learns to control the NALU, dependent upon operations seen.\n\n(a) Train (2 digits)\n\n(b) Validation (3 digits)\n\n(c) Test (4 digits)\n\nFigure 4: Simple program evaluation with extrapolation to larger values. All models are averaged\nover 10 independent runs, 2\u03c3 con\ufb01dence bands are displayed.\n\nWe compare to three popular RNNs (UGRNN, LSTM and DNC) and observe in both addition\n(Supplementary Figure 6) and program evaluation (Figure 4) that all models are able to solve the task\nat a \ufb01xed input domain, however only the NALU is able to extrapolate to larger numbers. In this case\nwe see extrapolation is stable even when the domain is increased by two orders of magnitude.\n\n4.5 Learning to Track Time in a Grid-World Environment\n\nIn all experiments thus far, our models have\nbeen trained to make numeric predictions. How-\never, as discussed in the introduction, system-\natic numeric computation appears to underlie\na diverse range of (natural) intelligent behav-\niors. In this task, we test whether a NAC can\nbe used \u201cinternally\u201d by an RL-trained agent to\ndevelop more systematic generalization to quan-\ntitative changes in its environment. We devel-\noped a simple grid-world environment task in\nwhich an agent is given a time (speci\ufb01ed as a\nreal value) and receives a reward if is arrives\nat a particular location at (and not before) that\ntime. As illustrated in Figure 5, each episode\nin this task begins (t = 0) with the agent and\na single target red square randomly positioned\nin a 5 \u00d7 5 grid-world. At each timestep, the\nagent receives as input a 56 \u00d7 56 pixel rep-\nresentation of the state of the (entire) world,\nand must select a single discrete action from\n{UP, DOWN, LEFT, RIGHT, PASS}. At the start\nof the episode, the agent also receives a numeric\ninstruction T , which communicates the exact\ntime the agent must arrive at its destination.\n\nt = 0\n\nt = 12\n\nt = 13\n\n\u2026\n\ncommand: 13\n\n1\n\n0\n\nd\nr\na\nw\ne\nr\n \ne\nd\no\ns\np\ne\n \ne\ng\na\nr\ne\nv\nA\n\ni\n\nr = m\n\nA3C + NAC\n\nA3C\n\n-1\n\ntraining\n range\n\nextrapolation\n\nMagnitude of command stimulus\n\nFigure 5:\n(above) Frames from the gridworld\ntime tracking task. The agent (gray) must move\nto the destination (red) at a speci\ufb01ed time. (be-\nlow) NAC improves extrapolation ability learned\nby A3C agents for the dating task.\n\nTo achieve the maximum episode reward m, the\nagent must select actions and move around so as\nto \ufb01rst step onto the red square precisely when\nt = T . Training episodes end either when the\nagent reaches the red square or after timing out (t = L). We \ufb01rst trained a conventional A3C agent\n[21] with a recurrent (LSTM) core memory, modi\ufb01ed so that the instruction T was communicated to\nthe agent via an additional input unit concatenated to the output of the agent\u2019s convnet visual module\nbefore being passed to the agent\u2019s LSTM core memory. We also trained a second variant of the same\n\n7\n\n0.00.20.40.60.81.0Training steps (1M)020406080100% Digits correctNALUDNCLSTMUGRNN0.00.20.40.60.81.0T020406080100%0.00.20.40.60.81.0T020406080100%\farchitecture where the instruction was passed both directly to the LSTM memory and passed through\nan NAC and back into the LSTM. Both agents were trained on episodes where T \u223c U{5, 12} (eight\nbeing the lowest value of T such that reaching the target destination when t = T is always possible).\nBoth agents quickly learned to master the training episodes. However, as shown in Figure 5, the\nagent with the NAC performed well on the task for T \u2264 19, whereas performance of the standard\nA3C agent deteriorated for T > 13.\n\nIt is instructive to also consider why both agents eventually fail. As would be predicted by consid-\neration of extrapolation error observed in previous models, for stimuli greater than 12 the baseline\nagent behaves as if the stimulus were still 12, arriving at the destination at t = 12 (too early) and\nthus receiving incrementally less reward with larger stimuli. In contrast, for stimuli greater than\n20, the agent with NAC never arrives at the destination. Note that in order to develop an agent that\ncould plausibly follow both numerical and non-numeric (e.g. linguistic or iconic) instructions, the\ninstruction stimulus was passed both directly to the agent\u2019s core LSTM and \ufb01rst through the NAC.\nWe hypothesize that the more limited extrapolation (in terms of orders of magnitude) of the NAC\nhere relative with other uses of the NAC was caused by the model still using the LSTM to encode\nnumeracy to some degree.\n\n4.6 MNIST Parity Prediction Task & Ablation Study\n\nLayer Con\ufb01guration Test Acc.\n\nThus far, we have emphasized the extrapola-\ntion successes; however, our results indicate that\nthe NAC layer often performs extremely well\nat interpolation. In our \ufb01nal task, the MNIST\nparity task [25], we look explicitly at interpo-\nlation. Also, in this task, neither the input nor\nthe output is directly provided as a number, but\nit implicitly invovles reasoning about numeric\nquantities. In these experiments, the NAC or its\nvariants replace the last linear layer in the model\nproposed by Segu\u00ed et al. [25], where it connects\noutput of the convnet to the prediction softmax\nlayer. Since the original model had an af\ufb01ne\nlayer here, and a NAC is a constrained af\ufb01ne\nlayer, we look systematically at the importance\nof each constraint. Table 4 summarizes the per-\nformance of the variant models. As we see, removing the bias and applying nonlinearities to the\nweights signi\ufb01cantly increases the accuracy of the end-to-end model, even though the majority of the\nparameters are not in the NAC itself. The NAC reduces the error of the previous best results by 54%.\n\nSegu\u00ed et al. [25]: \u02c6Wx + b\nOurs: \u02c6Wx + b\n\u03c3( \u02c6W)x + b\ntanh( \u02c6W)x + b\n\u02c6Wx + 0\n\u03c3( \u02c6W)x + 0\ntanh( \u02c6W)x + 0\nNAC: (tanh( \u02c6W) \u2299 \u03c3( \u02c6M))x + 0\n\nTable 4: An ablation study between an af\ufb01ne layer\nand a NAC on the MNIST parity task.\n\n85.1\n88.1\n\n60.0\n87.6\n91.4\n\n62.5\n88.7\n93.1\n\n5 Conclusions\n\nCurrent approaches to modeling numeracy in neural networks fall short because numerical repre-\nsentations fail to generalize outside of the range observed during training. We have shown how the\nNAC and NALU can be applied to rectify these two shortcomings across a wide variety of domains,\nfacilitating both numerical representations and functions on numerical representations that generalize\noutside of the range observed during training. However, it is unlikely that NAC or NALU will be\nthe perfect solution for every task. Rather, they exemplify a general design strategy for creating\nmodels that have biases intended for a target class of functions. This design strategy is enabled by the\nsingle-neuron number representation we propose, which allows arbitrary (differentiable) numerical\nfunctions to be added to the module and controlled via learned gates, as the NALU has exempli\ufb01ed\nbetween addition/subtraction and multiplication/division.\n\n6 Acknowledgements\n\nWe thank Ed Grefenstette for suggesting the name Neural ALU and for valuable discussion involving\nnumerical tasks. We also thank Steven Clark, whose presentation on neural counting work directly\ninspired this one. Finally, we also thank Karl Moritz Hermann, John Hale, Richard Evans, David\nSaxton, and Angeliki Lazaridou for valuable comments and discussion.\n\n8\n\n\fReferences\n\n[1] Stanislaw Antol, Aishwarya Agrawal,\n\nJiasen Lu, Margaret Mitchell, Dhruv Batra,\nC Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In Proceedings\nof the IEEE International Conference on Computer Vision, pages 2425\u20132433, 2015.\n\n[2] Carlos Arteta, Victor Lempitsky, J Alison Noble, and Andrew Zisserman. Interactive object\n\ncounting. In European Conference on Computer Vision, pages 504\u2013518, 2014.\n\n[3] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations\nfrom data by sparse identi\ufb01cation of nonlinear dynamical systems. Proceedings of the National\nAcademy of Sciences, 113(15):3932\u20133937, 2016.\n\n[4] Antoni B Chan, Zhang-Sheng John Liang, and Nuno Vasconcelos. Privacy preserving crowd\nmonitoring: Counting people without people models or tracking. In Proc. CVPR, pages 1\u20137.\nIEEE, 2008.\n\n[5] Stanislas Dehaene. The Number Sense: How the Mind Creates Mathematics. Oxford University\n\nPress, 2011.\n\n[6] Jerry A. Fodor and Zenon W. Pylyshyn. Connectionism and cognitive architecture: a critical\n\nanalysis. Cognition, 28(1\u20132):3\u201371, 1988.\n\n[7] C. Randy Gallistel. Finding numbers in the brain. Philosophical Transactions of the Royal\n\nSociety B, 373, 2017.\n\n[8] Rochel Gelman and C. Randy Gallistel. The child\u2019s understanding of number. Harvard, 1978.\n\n[9] Felix A Gers and E Schmidhuber. LSTM recurrent networks learn simple context-free and\ncontext-sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333\u20131340, 2001.\n\n[10] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\nrecurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee\ninternational conference on, pages 6645\u20136649. IEEE, 2013.\n\n[11] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401,\n\n2014. URL http://arxiv.org/abs/1410.5401.\n\n[12] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-\nBarwi\u00b4nska, Sergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou,\net al. Hybrid computing using a neural network with dynamic external memory. Nature, 538\n(7626):471, 2016.\n\n[13] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to\n\ntransduce with unbounded memory. In Proc. NIPS, 2015.\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. CoRR, abs/1603.05027, 2016. URL http://arxiv.org/abs/1603.05027.\n\n[15] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks.\n\nCoRR, abs/1608.06993, 2016. URL http://arxiv.org/abs/1608.06993.\n\n[16] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring\n\nthe limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[17] Alex Krizhevsky and Geoff Hinton. Convolutional deep belief networks on CIFAR-10. Unpub-\n\nlished manuscript, 2010.\n\n[18] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani,\nVictor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory\nnetworks for natural language processing. In Proc. ICML, pages 1378\u20131387, 2016.\n\n[19] Brendan Lake and Marco Baroni. Generalization without systematicity: On the compositional\n\nskills of sequence-to-sequence recurrent networks. In Proc. ICML, 2018.\n\n[20] Gary F. Marcus. The Algebraic Mind: Integrating Connectionism and Cognitive Science. MIT\n\nPress, 2003.\n\n[21] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In Proc. ICML, pages 1928\u20131937, 2016.\n\n9\n\n\f[22] Manuela Piazza, V\u00e9ronique Izard, Philippe Pinel, Denis Le Bihan, and Stanislas Dehaene.\nTuning curves for approximate numerosity in the human intraparietal sulcus. Neuron, 44:\n547\u2013555, 2004.\n\n[23] Scott E. Reed and Nando de Freitas. Neural programmer-interpreters. In Proc. ICLR, 2016.\n\n[24] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by\n\nback-propagating errors. Nature, 323(6088):533, 1986.\n\n[25] Santi Segu\u00ed, Oriol Pujol, and Jordi Vitri\u00e0. Learning to count with deep object features. CoRR,\n\nabs/1505.08082, 2015. URL http://arxiv.org/abs/1505.08082.\n\n[26] Rupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Highway networks. CoRR,\n\nabs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387.\n\n[27] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[28] Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of \ufb01nite\n\nprecision RNNs for language recognition. In Proc. ACL, 2018.\n\n[29] Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. Towards AI-complete\nquestion answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698, 2015. URL\nhttp://arxiv.org/abs/1502.05698.\n\n[30] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In Proc. ICLR, 2015.\n\n[31] Weidi Xie, J Alison Noble, and Andrew Zisserman. Microscopy cell counting and detection with\nfully convolutional regression networks. Computer methods in biomechanics and biomedical\nengineering: Imaging & Visualization, 6(3):283\u2013292, 2018.\n\n[32] Wojciech Zaremba and Ilya Sutskever. Learning to execute. CoRR, abs/1410.4615, 2014. URL\n\nhttp://arxiv.org/abs/1410.4615.\n\n[33] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. Cross-scene crowd counting\n\nvia deep convolutional neural networks. In Proc. CVPR, pages 833\u2013841. IEEE, 2015.\n\n10\n\n\f", "award": [], "sourceid": 4959, "authors": [{"given_name": "Andrew", "family_name": "Trask", "institution": "DeepMind"}, {"given_name": "Felix", "family_name": "Hill", "institution": "Deepmind"}, {"given_name": "Scott", "family_name": "Reed", "institution": "Google DeepMind"}, {"given_name": "Jack", "family_name": "Rae", "institution": "DeepMind, UCL"}, {"given_name": "Chris", "family_name": "Dyer", "institution": "DeepMind"}, {"given_name": "Phil", "family_name": "Blunsom", "institution": "DeepMind and Oxford University"}]}