{"title": "Deep Learning without Weight Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 976, "page_last": 984, "abstract": "Current algorithms for deep learning probably cannot run in the brain because\nthey rely on weight transport, where forward-path neurons transmit their synaptic\nweights to a feedback path, in a way that is likely impossible biologically. An algorithm called feedback alignment achieves deep learning without weight transport by using random feedback weights, but it performs poorly on hard visual-recognition tasks. Here we describe two mechanisms \u2014 a neural circuit called a weight mirror and a modification of an algorithm proposed by Kolen and Pollack in 1994 \u2014 both of which let the feedback path learn appropriate synaptic weights quickly and accurately even in large networks, without weight transport or complex wiring. Tested on the ImageNet visual-recognition task, these mechanisms outperform both feedback alignment and the newer sign-symmetry method, and nearly match backprop, the standard algorithm of deep learning, which uses weight transport.", "full_text": "Deep Learning without Weight Transport\n\nMohamed Akrout\n\nUniversity of Toronto, Triage\n\nCollin Wilson\n\nUniversity of Toronto\n\nPeter C. Humphreys\n\nDeepMind\n\nTimothy Lillicrap\n\nDeepMind, University College London\n\nDouglas Tweed\n\nUniversity of Toronto, York University\n\nAbstract\n\nCurrent algorithms for deep learning probably cannot run in the brain because\nthey rely on weight transport, where forward-path neurons transmit their synaptic\nweights to a feedback path, in a way that is likely impossible biologically. An algo-\nrithm called feedback alignment achieves deep learning without weight transport by\nusing random feedback weights, but it performs poorly on hard visual-recognition\ntasks. Here we describe two mechanisms \u2014 a neural circuit called a weight mirror\nand a modi\ufb01cation of an algorithm proposed by Kolen and Pollack in 1994 \u2014 both\nof which let the feedback path learn appropriate synaptic weights quickly and accu-\nrately even in large networks, without weight transport or complex wiring. Tested\non the ImageNet visual-recognition task, these mechanisms learn almost as well as\nbackprop (the standard algorithm of deep learning, which uses weight transport)\nand they outperform feedback alignment and another, more-recent transport-free\nalgorithm, the sign-symmetry method.\n\n1\n\nIntroduction\n\nThe algorithms of deep learning were devised to run on computers, yet in many ways they seem\nsuitable for brains as well; for instance, they use multilayer networks of processing units, each with\nmany inputs and a single output, like networks of neurons. But current algorithms can\u2019t quite work\nin the brain because they rely on the error-backpropagation algorithm, or backprop, which uses\nweight transport: each unit multiplies its incoming signals by numbers called weights, and some\nunits transmit their weights to other units. In the brain, it is the synapses that perform this weighting,\nbut there is no known pathway by which they can transmit their weights to other neurons or to other\nsynapses in the same neuron [1, 2].\nLillicrap et al. [3] offered a solution in the form of feedback alignment, a mechanism that lets\ndeep networks learn without weight transport, and they reported good results on several tasks. But\nBartunov et al. [4] and Moskovitz et al. [5] have found that feedback alignment does not scale to\nhard visual recognition problems such as ImageNet [6].\nXiao et al. [7] achieved good performance on ImageNet using a sign-symmetry algorithm in which\nonly the signs of the forward and feedback weights, not necessarily their values, must correspond, and\nthey suggested a mechanism by which that correspondence might be set up during brain development.\nKrotov and Hop\ufb01eld [8] and Guerguiev et al. [9] have explored other approaches to deep learning\nwithout weight transport, but so far only in smaller networks and tasks.\nHere we propose two different approaches that learn ImageNet about as well as backprop does, with\nno need to initialize forward and feedback matrices so their signs agree. We describe a circuit called\na weight mirror and a version of an algorithm proposed by Kolen and Pollack in 1994 [10], both of\nwhich let initially random feedback weights learn appropriate values without weight transport.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThere are of course other questions about the biological implications of deep-learning algorithms,\nsome of which we touch on in Appendix C, but in this paper our main concern is with weight\ntransport.\n\n2 The weight-transport problem\n\nIn a typical deep-learning network, some signals \ufb02ow along a forward path through multiple layers of\nprocessing units from the input layer to the output, while other signals \ufb02ow back from the output layer\nalong a feedback path. Forward-path signals perform inference (e.g. they try to infer what objects are\ndepicted in a visual input) while the feedback path conveys error signals that guide learning. In the\nforward path, signals \ufb02ow according to the equation\n\nyl+1 = \u03c6(Wl+1 yl + bl+1)\n\n(1)\nHere yl is the output signal of layer l, i.e. a vector whose i-th element is the activity of unit i in layer\nl. Equation 1 shows how the next layer l + 1 processes its input yl: it multiplies yl by the forward\nweight matrix Wl+1, adds a bias vector bl+1, and puts the sum through an activation function \u03c6.\nInterpreted as parts of a real neuronal network in the brain, the y\u2019s might be vectors of neuronal \ufb01ring\nrates, or some function of those rates, Wl+1 might be arrays of synaptic weights, and bl+1 and \u03c6\nbias currents and nonlinearities in the neurons.\nIn the feedback path, error signals \u03b4 \ufb02ow through the network from its output layer according to the\nerror-backpropagation [11] or backprop equation:\n\n\u03b4l = \u03c6(cid:48)(yl) WT\n\nl+1 \u03b4l+1\n\n(2)\nHere \u03c6(cid:48) is the derivative of the activation function \u03c6 from equation (1), which can be computed from\nyl. So feedback signals pass layer by layer through weights WT\nl . Interpreted as a structure in the\nbrain, the feedback path might be another set of neurons, distinct from those in the forward path, or\nthe same set of neurons might carry inference signals in one direction and errors in the other [12, 13].\nEither way, we have the problem that the same weight matrix Wl appears in the forward equation (1)\nand then again, transposed, in the feedback equation (2), whereas in the brain, the synapses in the\nforward and feedback paths are physically distinct, with no known way to coordinate themselves so\none set is always the transpose of the other [1, 2].\n\n3 Feedback alignment\n\nIn feedback alignment, the problem is avoided by replacing the transposed Wl\u2019s in the feedback path\nby random, \ufb01xed (non-learning) weight matrices Bl,\n\n(3)\n\n\u03b4l = \u03c6(cid:48)(yl) Bl+1 \u03b4l+1\n\n\u2206Wl+1 = \u2212\u03b7W \u03b4l+1 yT\n\nThese feedback signals \u03b4 drive learning in the forward weights W by the rule\n\nl\n\n(4)\nwhere \u03b7W is a learning-rate factor. As shown in [3], equations (1), (3), and (4) together drive the\nforward matrices Wl to become roughly proportional to transposes of the feedback matrices Bl.\nThat rough transposition makes equation (3) similar enough to the backprop equation (2) that the\nnetwork can learn simple tasks as well as backprop does.\nCan feedback alignment be augmented to handle harder tasks? One approach is to adjust the feedback\nweights Bl as well as the forward weights Wl, to improve their agreement. Here we show two\nmechanisms by which that adjustment can be achieved quickly and accurately in large networks\nwithout weight transport.\n\n4 Weight mirrors\n\n4.1 Learning the transpose\n\nThe aim here is to adjust an initially random matrix B so it becomes proportional to the transpose\nof another matrix W without weight transport, i.e. given only the input and output vectors x and\n\n2\n\n\fy = Wx (for this explanation, we neglect the activation function \u03c6). We observe that E(cid:2)x yT(cid:3) =\nE(cid:2)x xT WT(cid:3) = E(cid:2)x xT(cid:3)WT . In the simplest case, if the elements of x are independent and\nzero-mean with equal variance, \u03c32 , it follows that E(cid:2)x yT(cid:3) = \u03c32WT . Therefore we can push B\n\nsteadily in the direction \u03c32W using this transposing rule,\n\n\u2206B = \u03b7B x yT\n\n(5)\n\nSo B integrates a signal that is proportional to WT on average. Over time, this integration may\ncause the matrix norm (cid:107)B(cid:107) to increase, but if we add a mechanism to keep the norm small \u2014 such as\nweight decay or synaptic scaling [14\u201316] \u2014 then the initial, random values in B shrink away, and B\nconverges to a scalar multiple of WT (see Appendix A for an account of this learning rule in terms\nof gradient descent).\n\n4.2 A circuit for transposition\n\nFigure 1 shows one way the learning rule (5) might be implemented in a neural network. This network\nalternates between two modes: an engaged mode, where it receives sensory inputs and adjusts its\nforward weights to improve its inference, and a mirror mode, where its neurons discharge noisily\nand adjust the feedback weights so they mimic the forward ones. Biologically, these two modes may\ncorrespond to wakefulness and sleep, or simply to practicing a task and then setting it aside for a\nmoment.\n\nFigure 1: Network modes for weight mirroring. Both panels show the same two-layer section of\na network. In both modes, the three neurons in layer l of the forward path (\n) send their output\nsignal yl through the weight array Wl+1 (and other processing shown in equation (1)) to yield the\nnext-layer signal yl+1. And in the feedback path (\n), the two neurons in layer l + 1 send their signal\n\u03b4l+1 through weight array Bl+1 to yield \u03b4l, as in (3). The \ufb01gure omits the biases b, nonlinearities \u03c6,\nand, in the top panel, the projections that convey yl to the \u03b4l cells, allowing them to compute the\nfactor \u03c6(cid:48)(yl) in equation (3). a) In engaged mode, cross-projections (\n) convey the feedback signals\n\u03b4 to the forward-path cells, so they can adjust the forward weights W using learning rule (4). b) In\nmirror mode, one layer of forward cells, say layer l, \ufb01res noisily. Its signal yl still passes through\nWl+1 to yield yl+1, but now the blue cross-projections (\n) control \ufb01ring in the feedback path, so\n\u03b4l = yl and \u03b4l+1 = yl+1, and the \u03b4l neurons adjust the feedback weights Bl+1 using learning\nrule (7). We call the circuit yl, yl+1, \u03b4l+1, \u03b4l a weight mirror because it makes the weight array\nBl+1 resemble WT\n\nl+1.\n\nIn mirror mode, the forward-path neurons in each layer l, carrying the signal yl, project strongly to\nlayer l of the feedback path \u2014 strongly enough that each signal \u03b4l of the feedback path faithfully\nmimics yl, i.e.\n\n\u03b4l = yl\n\n(6)\n\nAlso in mirror mode, those forward-path signals yl are noisy. Multiple layers may \ufb01re at once, but the\nprocess is simpler to explain in the case where they take turns, with just one layer l driving forward-\npath activity at any one time. In that case, all the cells of layer l \ufb01re randomly and independently,\nso their output signal yl has zero-mean and equal variance \u03c32. That signal passes through forward\nweight matrix Wl+1 and activation function \u03c6 to yield yl+1 = \u03c6(Wl+1 yl + bl). By equation (6),\n\n3\n\n\u03b4l+1\u03b4 lylyl +1WlWl+2Wl+1BlBl+1 Bl2+ a\u03b4l+1\u03b4 lylyl +1WlWl+2Wl+1BlBl+1 Bl2+ b\fsignals yl and yl+1 are transmitted to the feedback path. Then the layer-l feedback cells adjust their\nweights Bl+1 by Hebbian learning,\n\n\u2206Bl+1 = \u03b7B \u03b4l \u03b4T\n\nl+1\n\n(7)\n\nThis circuitry and learning rule together constitute the weight mirror.\n\n4.3 Why it works\n\nTo see that (7) approximates the transposing rule (5), notice \ufb01rst that\n\n\u03b4l \u03b4T\n\nl+1 = yl yT\n\n(8)\nWe will assume, for now, that the variance \u03c32 of yl is small enough that Wl+1 yl + bl+1 stays in a\nroughly af\ufb01ne range of \u03c6, and that the diagonal elements of the derivative matrix \u03c6(cid:48)(bl+1) are all\nroughly similar to each other, so the matrix is approximately of the form \u03c6(cid:48)\ns is a positive\nscalar and I is the identity. Then\n\nl+1 = yl \u03c6(Wl+1 yl + bl+1)T\n\nsI, where \u03c6(cid:48)\n\nTherefore\n\nand so\n\n\u03c6(Wl+1 yl + bl+1) \u2248 \u03c6(cid:48)(bl+1) Wl+1 yl + \u03c6(bl+1)\n\n\u2248 \u03c6(cid:48)\n\nl+1 \u2248 yl\n\n(cid:2) yT\n(cid:0)E(cid:2)ylyT\n(cid:3) \u2248 \u03b7B\n(cid:3)WT\n(cid:3)WT\n= \u03b7B E(cid:2)ylyT\n\ns Wl+1 yl + \u03c6(bl+1)\nl+1 \u03c6(cid:48)\n\ns + \u03c6(bl+1)T (cid:3)\ns + E(cid:2)yl\n\nl+1\u03c6(cid:48)\nl+1\u03c6(cid:48)\n\nl WT\n\ns\n\nl\n\nl\n\n(cid:3)\u03c6(bl+1)T(cid:1)\n\n\u03b4l \u03b4T\n\nE(cid:2)\u2206Bl+1\n\n= \u03b7B \u03c32\u03c6(cid:48)\n\nsWT\n\nl+1\n\n(9)\n\n(10)\n\n(11)\n\nHence the weight matrix Bl+1 integrates a teaching signal (7) which approximates, on average, a\nl+1. As in (5), this integration may drive up the matrix norm (cid:107)Bl+1(cid:107),\npositive scalar multiple of WT\nbut if we add a mechanism such as weight decay to keep the norm small [15, 16] then Bl+1 evolves\ntoward a reasonable-sized positive multiple of WT\nWe get a stronger result if we suppose that neurons are capable of bias-blocking \u2014 of closing off\ntheir bias currents when in mirror mode, or preventing their in\ufb02uence on the axon hillock. Then\n\nl+1.\n\nE(cid:2)\u2206Bl+1\n\n(cid:3) \u2248 \u03b7B \u03c32\u03c6(cid:48)(0)WT\n\nl+1\n\nSo again, Bl+1 comes to approximate a positive scalar multiple of WT\nderivative around 0, but we no longer need to assume that \u03c6(cid:48)(bl+1) \u2248 \u03c6(cid:48)\nsI.\nIn one respect the weight mirror resembles difference target propagation [4], because both mechanisms\nshape the feedback path layer by layer, but target propagation learns layer-wise autoencoders (though\nsee [17]), and uses feedback weights to propagate targets rather than gradients.\n\n(12)\nl+1, so long as \u03c6 has a positive\n\n5 The Kolen-Pollack algorithm\n\n5.1 Convergence through weight decay\n\nKolen and Pollack [10] observed that we don\u2019t have to transport weights if we can transport changes\nin weights. Consider two synapses, W in the forward path and B in the feedback path (written\nwithout boldface because for now we are considering individual synapses, not matrices). Suppose W\nand B are initially unequal, but at each time step t they undergo identical adjustments A(t) and apply\nidentical weight-decay factors \u03bb, so\n\n\u2206W (t) = A(t) \u2212 \u03bbW (t)\n\n(13)\n\nand\n\n(14)\nThen W (t + 1)\u2212 B(t + 1) = W (t) + \u2206W (t)\u2212 B(t)\u2212 \u2206B(t) = W (t)\u2212 B(t)\u2212 \u03bb[W (t)\u2212 B(t)] =\n(1 \u2212 \u03bb)[W (t) \u2212 B(t)] = (1 \u2212 \u03bb)t+1[W (0) \u2212 B(0)], and so with time, if 0 < \u03bb < 1, W and B will\nconverge.\n\n\u2206B(t) = A(t) \u2212 \u03bbB(t)\n\n4\n\n\fBut biologically, it is no more feasible to transport weight changes than weights, and Kolen and\nPollack do not say how their algorithm might run in the brain. Their \ufb02ow diagram (Figure 2 in their\npaper) is not at all biological: it shows weight changes being calculated at one locus and then traveling\nto distinct synapses in the forward and feedback paths. In the brain, changes to different synapses are\nalmost certainly calculated separately, within the synapses themselves. But it is possible to implement\nKolen and Pollack\u2019s method in a network without transporting weights or weight changes.\n\n5.2 A circuit for Kolen-Pollack learning\n\nThe standard, forward-path learning rule (4) says that the matrix Wl+1 adjusts itself based on a\nproduct of its input vector yl and a teaching vector \u03b4l+1. More speci\ufb01cally, each synapse Wl+1,ij\nadjusts itself based on its own scalar input yl,j and the scalar teaching signal \u03b4l+1,i sent to its neuron\nfrom the feedback path.\nWe propose a reciprocal arrangement, where synapses in the feedback path adjust themselves based\non their own inputs and cell-speci\ufb01c, scalar teaching signals from the forward path,\n\nIf learning rates and weight decay agree in the forward and feedback paths, we get\n\n\u2206Bl+1 = \u2212\u03b7 yl \u03b4T\n\nl+1\n\nand\n\ni.e.\n\n\u2206Wl+1 = \u2212\u03b7W \u03b4l+1 yT\n\n\u2206Bl+1 = \u2212\u03b7W yl \u03b4T\nl+1 = \u2212\u03b7W \u03b4l+1 yT\n\n\u2206BT\n\nl \u2212 \u03bb Wl+1\nl+1 \u2212 \u03bb Bl+1\nl \u2212 \u03bb BT\n\nl+1\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\nIn this network (drawn in Figure 2), the only variables transmitted between cells are the activity\nvectors yl and \u03b4l+1, and each synapse computes its own adjustment locally, but (16) and (18) have\nthe form of the Kolen-Pollack equations (13) and (14), and therefore the forward and feedback weight\nmatrices converge to transposes of each other.\n\nFigure 2: Reciprocal network for Kolen-Pollack learning. There is a single mode of operation.\nGold-colored cross-projections (\n) convey feedback signals \u03b4 to forward-path cells, so they can\nadjust the forward weights W using learning rule (16). Blue cross-projections (\n) convey the signals\ny to the feedback cells, so they can adjust the feedback weights B using (17).\n\nreleased a Python version of\n\nWe have\nthe weight mirror and the KP reciprocal network that we used in our\ngithub.com/makrout/Deep-Learning-without-Weight-Transport.\n\nthe proprietary TensorFlow/TPU code\ntests;\n\nfor\nsee\n\n6 Experiments\n\nWe compared our weight-mirror and Kolen-Pollack networks to backprop, plain feedback alignment,\nand the sign-symmetry method [5, 7]. For easier comparison with recent papers on biologically-\nmotivated algorithms [4, 5, 7], we used the same types of networks they did, with convolution [18],\n\n5\n\n\u03b4l+1\u03b4 lylyl +1WlWl+2Wl+1BlBl+1 Bl+2 \fbatch normalization (BatchNorm) [19], and recti\ufb01ed linear units (ReLUs) without bias-blocking. In\nmost experiments, we used a ResNet block variant where signals were normalized by BatchNorm\nafter the ReLU nonlinearity, rather than before (see Appendix D.3). More brain-like implementations\nwould have to replace BatchNorm with some kind of synaptic scaling [15, 16], ReLU with a bounded\nfunction such as recti\ufb01ed tanh, and convolution with non-weight-sharing local connections.\n\nFigure 3: ImageNet results. a) With ResNet-18 architecture, the weight-mirror network (\u2014 WM)\nand Kolen-Pollack (\u2014 KP) outperformed plain feedback alignment (\u2014 FA) and the sign-symmetry\nalgorithm (\u2014 SS), and nearly matched backprop (\u2014 BP). b) With the larger ResNet-50 architecture,\nresults were similar.\n\nRun on the ImageNet visual-recognition task [6] with the ResNet-18 network (Figure 3a), weight\nmirrors managed a \ufb01nal top-1 test error of 30.2(7)%, and Kolen-Pollack reached 29.2(4)%, versus\n97.4(2)% for plain feedback alignment, 39.2(4)% for sign-symmetry, and 30.1(4)% for backprop.\nWith ResNet-50 (Figure 3b), the scores were: weight mirrors 23.4(5)%, Kolen-Pollack 23.9(7)%,\nfeedback alignment 98.9(1)%, sign-symmetry 33.8(3)%, and backprop 22.9(4)%. (Digits in paren-\ntheses are standard errors).\nSign-symmetry did better in other experiments where batch normalization was applied before the\nReLU nonlinearity. In those runs, it achieved top-1 test errors of 37.8(4)% with ResNet-18 (close to\nthe 37.91% reported in [7] for the same network) and 32.6(6)% with ResNet-50 (see Appendix D.1\nfor details of our hyperparameter selection, and Appendix D.3 for a \ufb01gure of the best result attained\nby sign-symmetry on our tests). The same change in BatchNorm made little difference to the other\nfour methods \u2014 backprop, feedback alignment, Kolen-Pollack, and the weight mirror.\nWeight mirroring kept the forward and feedback matrices in agreement throughout training, as shown\nin Figure 4. One way to measure this agreement is by matrix angles: in each layer of the networks,\nwe took the feedback matrix Bl and the transpose of the forward matrix, WT\n, and reshaped them\nl\ninto vectors. With backprop, the angle between those vectors was of course always 0. With weight\nmirrors (Figure 4a), the angle stayed < 12\u00b0 in all layers, and < 6\u00b0 later in the run for all layers except\nthe \ufb01nal one. That \ufb01nal layer was fully connected, and therefore its Wl received more inputs than\nl harder to deduce. For closer alignment, we\nthose of the other, convolutional layers, making its WT\nwould have needed longer mirroring with more examples.\nThe matrix angles grew between epochs 2 and 10 and then held steady at relatively high levels till\nepoch 32 because during this period the learning rate \u03b7W was large (see Appendix D.1), and mirroring\nl \u2019s. That problem could also have been solved\ndidn\u2019t keep the Bl\u2019s matched to the fast-changing WT\nwith more mirroring, but it did no harm because at epoch 32, \u03b7W shrank by 90%, and from then on,\nthe Bl\u2019s and WT\nWe also computed the \u03b4 angles between the feedback vectors \u03b4l computed by the weight-mirror\nnetwork (using Bl\u2019s) and those that would have been computed by backprop (using WT\nl \u2019s). Weight\nmirrors kept these angles < 25\u00b0 in all layers (Figure 4b), with worse alignment farther upstream,\nbecause \u03b4 angles depend on the accumulated small discrepancies between all the Bl\u2019s and WT\nl \u2019s in\nall downstream layers.\n\nl \u2019s stayed better aligned.\n\n6\n\n020406080100020406080100Test top-1 error (%)EpochFAaBPKPWMSS020406080100020406080100Test top-1 error (%)EpochFAbSSBPKPWM\fFigure 4: Agreement of forward and feedback matrices in the ResNet-50 from Figure 3b. a) Weight\nl small in all layers, from the input layer (\u2014)\nmirrors kept the angles between the matrices Bl and WT\nto the output (\u2014). b) Feedback vectors \u03b4l computed by the weight-mirror network were also well\naligned with those that would have been computed by backprop. c, d) The Kolen-Pollack network\nkept the matrix and \u03b4 angles even smaller. e, f) The sign-symmetry method was less accurate.\n\nThe Kolen-Pollack network was even more accurate, bringing the matrix and \u03b4 angles to near zero\nwithin 20 epochs and holding them there, as shown in Figures 4c and 4d.\nThe sign-symmetry method aligned matrices and \u03b4\u2019s less accurately (Figures 4e and 4f), while with\nfeedback alignment (not shown), both angles stayed > 80\u00b0 for most layers in both the ResNet-18 and\nResNet-50 architectures.\n\n7\n\nInputOutput10120304050Layers0204060800153045607590Matrix angle (\u00b0)Epoche\u03b4 angle (\u00b0)f020406080100100Epoch0153045607590036912150510152025020406080Matrix angle (\u00b0)Epochc\u03b4 angle (\u00b0)d020406080100100Epoch036912150510152025Matrix angle (\u00b0)ab\u03b4 angle (\u00b0)020406080100Epoch020406080100Epoch\f7 Discussion\n\nBoth the weight mirror and the Kolen-Pollack network outperformed feedback alignment and the\nsign-symmetry algorithm, and both kept pace, at least roughly, with backprop. Kolen-Pollack has\nsome advantages over weight mirrors, as it doesn\u2019t call for separate modes of operation and needn\u2019t\nproceed layer by layer. Conversely, weight mirrors don\u2019t need sensory input but learn from noise, so\nthey could tune feedback paths in sleep or in utero. And while KP kept matrix and \u03b4 angles smaller\nthan WM did in Figure 4, that may not be the case in all learning tasks. With KP, the matrix B\nconverges to WT at a rate that depends on \u03bb, the weight-decay factor in equation (17). A big \u03bb\nspeeds up alignment, but may hamper learning, and at present we have no proof that a good balance\ncan always be found between \u03bb and learning rate \u03b7W . In this respect, WM may be more versatile than\nKP, because if mirroring ever fails to yield small enough angles, we can simply do more mirroring,\ne.g. in sleep. More tests are needed to assess the two mechanisms\u2019 aptitude for different tasks,\ntheir sensitivity to hyperparameters, and their effectiveness in non-convolutional networks and other\narchitectures.\nBoth methods may have applications outside biology, because the brain is not the only computing\ndevice that lacks weight transport. Abstractly, the issue is that the brain represents information in two\ndifferent forms: some is coded in action potentials, which are energetically expensive but rapidly\ntransmissible to other parts of the brain, while other information is stored in synaptic weights, which\nare cheap and compact but localized \u2014 they in\ufb02uence the transmissible signals but are not themselves\ntransmitted. Similar issues arise in certain kinds of technology, such as application-speci\ufb01c integrated\ncircuits (ASICs). Here as in the brain, mechanisms like weight mirroring and Kolen-Pollack could\nallow forward and feedback weights to live locally, saving time and energy [20\u201322].\n\nReferences\n[1] Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance. Cognitive\n\nScience, 11(1):23\u201363, 1987.\n\n[2] Francis Crick. The recent excitement about neural networks. Nature, 337(6203):129\u2013132, 1989.\n\n[3] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback\n\nweights support error backpropagation for deep learning. Nature Communications, 7:13276, 2016.\n\n[4] Sergey Bartunov, Adam Santoro, Blake Richards, Luke Marris, Geoffrey E Hinton, and Timothy Lillicrap.\nAssessing the scalability of biologically-motivated deep learning algorithms and architectures. In Advances\nin Neural Information Processing Systems, pages 9390\u20139400, 2018.\n\n[5] Theodore H Moskovitz, Ashok Litwin-Kumar, and LF Abbott. Feedback alignment in deep convolutional\n\nnetworks. arXiv preprint arXiv:1812.06488, 2018.\n\n[6] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nImageNet large scale visual recognition\n\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al.\nchallenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[7] Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio. Biologically-plausible learning algorithms can\n\nscale to large datasets. arXiv preprint arXiv:1811.03567, 2018.\n\n[8] Dmitry Krotov and John Hop\ufb01eld. Unsupervised learning by competing hidden units. Proceedings of the\n\nNational Academy of Sciences, 116(16):7723\u20137731, 2019.\n\n[9] Jordan Guerguiev, Konrad Kording, and Blake Richards. Spike-based causal inference for weight alignment.\n\narXiv preprint arXiv:1910.01689, 2019.\n\n[10] John F Kolen and Jordan B Pollack. Backpropagation without weight transport. In Proceedings of 1994\nIEEE International Conference on Neural Networks (ICNN\u201994), volume 3, pages 1375\u20131380. IEEE, 1994.\n\n[11] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by back-\n\npropagating errors. Nature, 323:533\u2013536, 1986.\n\n[12] Robert Urbanczik and Walter Senn. Learning by the dendritic prediction of somatic spiking. Neuron,\n\n81(3):521\u2013528, 2014.\n\n[13] Jordan Guergiuev, Timothy P Lillicrap, and Blake A Richards. Biologically feasible deep learning with\n\nsegregated dendrites. arXiv preprint arXiv:1610.00161, 2016.\n\n8\n\n\f[14] Nathan Intrator and Leon N Cooper. Objective function formulation of the BCM theory of visual cortical\n\nplasticity: Statistical connections, stability conditions. Neural Networks, 5(1):3\u201317, 1992.\n\n[15] Gina G Turrigiano. The self-tuning neuron: synaptic scaling of excitatory synapses. Cell, 135(3):422\u2013435,\n\n2008.\n\n[16] Dhrubajyoti Chowdhury and Johannes W Hell. Homeostatic synaptic scaling: Molecular regulators of\n\nsynaptic AMPA-type glutamate receptors. F1000Research, 7, 2018.\n\n[17] Daniel Kunin, Jonathan M Bloom, Aleksandrina Goeva, and Cotton Seed. Loss landscapes of regularized\n\nlinear autoencoders. arXiv preprint arXiv:1901.08168, 2019.\n\n[18] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and\nLawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation,\n1(4):541\u2013551, 1989.\n\n[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[20] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energy-ef\ufb01cient data\ufb02ow\nfor convolutional neural networks. In ACM SIGARCH Computer Architecture News, volume 44, pages\n367\u2013379. IEEE Press, 2016.\n\n[21] Hyoukjun Kwon, Michael Pellauer, and Tushar Krishna. Maestro: An open-source infrastructure for\n\nmodeling data\ufb02ows within deep learning accelerators. arXiv preprint arXiv:1805.02566, 2018.\n\n[22] Brian Crafton, Abhinav Parihar, Evan Gebhardt, and Arijit Raychowdhury. Direct feedback alignment\n\nwith sparse connections for local learning. arXiv preprint arXiv:1903.02083, 2019.\n\n[23] Andrey Gushchin and Ao Tang. Total wiring length minimization of C. elegans neural network: a\n\nconstrained optimization approach. PloS one, 10(12):e0145029, 2015.\n\n[24] Marion Langen, Egemen Agi, Dylan J Altschuler, Lani F Wu, Steven J Altschuler, and Peter Robin\nHiesinger. The developmental rules of neural superposition in Drosophila. Cell, 162(1):120\u2013133, 2015.\n\n[25] SL Palay and V Chan-Palay. Cerebellar Cortex. Springer-Verlag. Berlin, Heiderberg, New York, 1974.\n\n[26] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated\n\ndendrites. Elife, 6:e22901, 2017.\n\n[27] Richard Naud and Henning Sprekeler. Sparse bursts optimize information transmission in a multiplexed\n\nneural code. Proceedings of the National Academy of Sciences, 115(27):E6329\u2013E6338, 2018.\n\n[28] Chris Eliasmith and Charles H Anderson. Neural Engineering: Computation, Representation, and\n\nDynamics in Neurobiological Systems. MIT Press, 2004.\n\n[29] RJ Leigh and DS Zee. The Neurology of Eye Movements. New York: Oxford University Press, 3rd ed.\n\nedition, 1999.\n\n[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[31] Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John Aslanides, Frederic Besse, Andy\nBrock, Aidan Clark, Sergio G\u00f3mez Colmenarejo, Aedan Pope, et al. Tf-replicator: Distributed machine\nlearning for researchers. arXiv preprint arXiv:1902.00465, 2019.\n\n[32] Ilya Sutskever, James Martens, George E Dahl, and Geoffrey E Hinton. On the importance of initialization\n\nand momentum in deep learning. ICML (3), 28(1139-1147):5, 2013.\n\n9\n\n\f", "award": [], "sourceid": 553, "authors": [{"given_name": "Mohamed", "family_name": "Akrout", "institution": "University of Toronto"}, {"given_name": "Collin", "family_name": "Wilson", "institution": "University of Toronto"}, {"given_name": "Peter", "family_name": "Humphreys", "institution": "Deepmind"}, {"given_name": "Timothy", "family_name": "Lillicrap", "institution": "DeepMind & UCL"}, {"given_name": "Douglas", "family_name": "Tweed", "institution": "University of Toronto"}]}