{"title": "Direct Feedback Alignment Provides Learning in Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1037, "page_last": 1045, "abstract": "Artificial neural networks are most commonly trained with the back-propagation algorithm, where the gradient for learning is provided by back-propagating the error, layer by layer, from the output layer to the hidden layers. A recently discovered method called feedback-alignment shows that the weights used for propagating the error backward don't have to be symmetric with the weights used for propagation the activation forward. In fact, random feedback weights work evenly well, because the network learns how to make the feedback useful. In this work, the feedback alignment principle is used for training hidden layers more independently from the rest of the network, and from a zero initial condition. The error is propagated through fixed random feedback connections directly from the output layer to each hidden layer. This simple method is able to achieve zero training error even in convolutional networks and very deep networks, completely without error back-propagation. The method is a step towards biologically plausible machine learning because the error signal is almost local, and no symmetric or reciprocal weights are required. Experiments show that the test performance on MNIST and CIFAR is almost as good as those obtained with back-propagation for fully connected networks. If combined with dropout, the method achieves 1.45% error on the permutation invariant MNIST task.", "full_text": "Direct Feedback Alignment Provides Learning in\n\nDeep Neural Networks\n\nArild N\u00f8kland\n\nTrondheim, Norway\n\narild.nokland@gmail.com\n\nAbstract\n\nArti\ufb01cial neural networks are most commonly trained with the back-propagation\nalgorithm, where the gradient for learning is provided by back-propagating the error,\nlayer by layer, from the output layer to the hidden layers. A recently discovered\nmethod called feedback-alignment shows that the weights used for propagating the\nerror backward don\u2019t have to be symmetric with the weights used for propagation\nthe activation forward. In fact, random feedback weights work evenly well, because\nthe network learns how to make the feedback useful. In this work, the feedback\nalignment principle is used for training hidden layers more independently from\nthe rest of the network, and from a zero initial condition. The error is propagated\nthrough \ufb01xed random feedback connections directly from the output layer to each\nhidden layer. This simple method is able to achieve zero training error even in\nconvolutional networks and very deep networks, completely without error back-\npropagation. The method is a step towards biologically plausible machine learning\nbecause the error signal is almost local, and no symmetric or reciprocal weights\nare required. Experiments show that the test performance on MNIST and CIFAR\nis almost as good as those obtained with back-propagation for fully connected\nnetworks. If combined with dropout, the method achieves 1.45% error on the\npermutation invariant MNIST task.\n\n1\n\nIntroduction\n\nFor supervised learning, the back-propagation algorithm (BP), see [2], has achieved great success in\ntraining deep neural networks. As today, this method has few real competitors due to its simplicity\nand proven performance, although some alternatives do exist.\nBoltzmann machine learning in different variants are biologically inspired methods for training neural\nnetworks, see [6], [10] and [5]. The methods use only local available signals for adjusting the weights.\nThese methods can be combined with BP \ufb01ne-tuning to obtain good discriminative performance.\nContrastive Hebbian Learning (CHL), is similar to Boltzmann Machine learning, but can be used\nin deterministic feed-forward networks. In the case of weak symmetric feedback-connections it\nresembles BP [16].\nRecently, target-propagation (TP) was introduced as an biologically plausible training method, where\neach layer is trained to reconstruct the layer below [7]. This method does not require symmetric\nweights and propagates target values instead of gradients backward.\nA novel training principle called feedback-alignment (FA) was recently introduced [9]. The authors\nshow that the feedback weights used to back-propagate the gradient do not have to be symmetric with\nthe feed-forward weights. The network learns how to use \ufb01xed random feedback weights in order to\nreduce the error. Essentially, the network learns how to learn, and that is a really puzzling result.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fBack-propagation with asymmetric weights was also explored in [8]. One of the conclusions from\nthis work is that the weight symmetry constraint can be signi\ufb01cantly relaxed while still retaining\nstrong performance.\nThe back-propagation algorithm is not biologically plausible for several reasons. First, it requires\nsymmetric weights. Second, it requires separate phases for inference and learning. Third, the learning\nsignals are not local, but have to be propagated backward, layer-by-layer, from the output units. This\nrequires that the error derivative has to be transported as a second signal through the network. To\ntransport this signal, the derivative of the non-linearities have to be known.\nAll mentioned methods require the error to travel backward through reciprocal connections. This is\nbiologically plausible in the sense that cortical areas are known to be reciprocally connected [3]. The\nquestion is how an error signal is relayed through an area to reach more distant areas. For BP and FA\nthe error signal is represented as a second signal in the neurons participating in the forward pass. For\nTP the error is represented as a change in the activation in the same neurons. Consider the possibility\nthat the error in the relay layer is represented by neurons not participating in the forward pass. For\nlower layers, this implies that the feedback path becomes disconnected from the forward path, and\nthe layer is no longer reciprocally connected to the layer above.\nThe question arise whether a neuron can receive a teaching signal also through disconnected feedback\npaths. This work shows experimentally that directly connected feedback paths from the output layer\nto neurons earlier in the pathway is suf\ufb01cient to enable error-driven learning in a deep network. The\nrequirements are that the feedback is random and the whole network is adapted. The concept is\nquite different from back-propagation, but the result is very similar. Both methods seem to produce\nfeatures that makes classi\ufb01cation easier for the layers above.\nFigure 1c) and d) show the novel feedback path con\ufb01gurations that is further explored in this work.\nThe methods are based on the feedback alignment principle and is named \"direct feedback-alignment\"\n(DFA) and \"indirect feedback-alignment\" (IFA).\n\nFigure 1: Overview of different error transportation con\ufb01gurations. Grey arrows indicate activation\npaths and black arrows indicate error paths. Weights that are adapted during learning are denoted as\nWi, and weights that are \ufb01xed and random are denoted as Bi. a) Back-propagation. b) Feedback-\nalignment. c) Direct feedback-alignment. d) Indirect feedback-alignment.\n\n2 Method\n\nLet (x, y) be mini-batches of input-output vectors that we want the network to learn. For simplicity,\nassume that the network has only two hidden layers as in Figure 1, and that the target output y is\nscaled between 0 and 1. Let the rows in Wi denote the weights connecting the layer below to a\nunit in hidden layer i, and let bi be a column vector with biases for the units in hidden layer i. The\nactivations in the network are then calculated as\n\na1 = W1x + b1, h1 = f (a1)\na2 = W2h1 + b2, h2 = f (a2)\n\n2\n\n(1)\n(2)\n\n\fay = W3h2 + b3, \u02c6y = fy(ay)\n\n(3)\n\nwhere f () is the non-linearity used in hidden layers and fy() the non-linearity used in the output\nlayer. If we choose a logistic activation function in the output layer and a binary cross-entropy loss\nfunction, the loss for a mini-batch with size N and the gradient at the output layer e are calculated as\n\n(cid:88)\n\nJ = \u2212 1\nN\n\nymn log \u02c6ymn + (1 \u2212 ymn) log(1 \u2212 \u02c6ymn)\n\nm,n\n\ne = \u03b4ay =\n\n= \u02c6y \u2212 y\n\n\u2202J\n\u2202ay\n\n(4)\n\n(5)\n\nwhere m and n are output unit and mini-batch indexes. For the BP, the gradients for hidden layers are\ncalculated as\n\n\u2202J\n\u2202a2\n\n3 e) (cid:12) f(cid:48)(a2), \u03b4a1 =\n\n\u2202J\n\u2202a1\n\n2 \u03b4a2) (cid:12) f(cid:48)(a1)\n\n\u03b4a2 =\n\n= (W T\n\n(6)\nwhere (cid:12) is an element-wise multiplication operator and f(cid:48)() is the derivative of the non-linearity.\nThis gradient is also called steepest descent, because it directly minimizes the loss function given the\nlinearized version of the network. For FA, the hidden layer update directions are calculated as\n\n= (W T\n\n(7)\nwhere Bi is a \ufb01xed random weight matrix with appropriate dimension. For DFA, the hidden layer\nupdate directions are calculated as\n\n\u03b4a2 = (B2e) (cid:12) f(cid:48)(a2), \u03b4a1 = (B1\u03b4a2) (cid:12) f(cid:48)(a1)\n\n\u03b4a2 = (B2e) (cid:12) f(cid:48)(a2), \u03b4a1 = (B1e) (cid:12) f(cid:48)(a1)\n\n(8)\nwhere Bi is a \ufb01xed random weight matrix with appropriate dimension. If all hidden layers have the\nsame number of neurons, Bi can be chosen identical for all hidden layers. For IFA, the hidden layer\nupdate directions are calculated as\n\n\u03b4a2 = (W2\u03b4a1) (cid:12) f(cid:48)(a2), \u03b4a1 = (B1e) (cid:12) f(cid:48)(a1)\n\n(9)\nwhere B1 is a \ufb01xed random weight matrix with appropriate dimension. Ignoring the learning rate, the\nweight updates for all methods are calculated as\n\n\u03b4W1 = \u2212\u03b4a1xT , \u03b4W2 = \u2212\u03b4a2hT\n\n1 , \u03b4W3 = \u2212ehT\n\n2\n\n(10)\n\n3 Theoretical results\n\nBP provides a gradient that points in the direction of steepest descent in the loss function landscape.\nFA provides a different update direction, but experimental results indicate that the method is able\nto reduce the error to zero in networks with non-linear hidden units. This is surprising because the\nprinciple is distinct different from steepest descent. For BP, the feedback weights are the transpose of\nthe forward weights. For FA the feedback weights are \ufb01xed, but if the forward weights are adapted,\nthey will approximately align with the pseudoinverse of the feedback weights in order to make the\nfeedback useful [9].\nThe feedback-alignment paper [9] proves that \ufb01xed random feedback asymptotically reduces the\nerror to zero. The conditions for this to happen are freely restated in the following. 1) The network is\nlinear with one hidden layer. 2) The input data have zero mean and standard deviation one. 3) The\nfeedback matrix B satis\ufb01es B+B = I where B+ is the Moore-Penrose pseudo-inverse of B. 4) The\nforward weights are initialized to zero. 5) The output layer weights are adapted to minimize the error.\nLet\u2019s call this novel principle the feedback alignment principle.\nIt is not clear how the feedback alignment principle can be applied to a network with several non-\nlinear hidden layers. The experiments in [9] show that more layers can be added if the error is\nback-propagated layer-by-layer from the output.\nThe following theorem points at a mechanism that can explain the feedback alignment principle.\nThe mechanism explains how an asymmetric feedback path can provide learning by aligning the\nback-propagated and forward propagated gradients with it\u2019s own, under the assumption of constant\nupdate directions for each data point.\n\n3\n\n\fTheorem 1. Given 2 hidden layers k and k + 1 in a feed-forward neural network where k connects\nto k + 1. Let hk and hk+1 be the hidden layer activations. Let the functional dependency between the\nlayers be hk+1 = f (ak+1), where ak+1 = W hk + b. Here W is a weight matrix, b is a bias vector\nand f () is a non-linearity. Let the layers be updated according to the non-zero update directions\n\u03b4hk and \u03b4hk+1 where \u03b4hk\n(cid:107)\u03b4hk+1(cid:107) are constant for each data point. The negative update\ndirections will minimize the following layer-wise criterion\n\n(cid:107)\u03b4hk(cid:107) and \u03b4hk+1\n\nK = Kk + Kk+1 =\n\nk hk\n\n\u03b4hT\n(cid:107)\u03b4hk(cid:107) +\n\n\u03b4hT\nk+1hk+1\n(cid:107)\u03b4hk+1(cid:107)\n\nMinimizing K will maximize the gradient maximizing the alignment criterion\n\nwhere\n\nL = Lk + Lk+1 =\n\nk ck\n\n\u03b4hT\n(cid:107)\u03b4hk(cid:107) +\n\n\u03b4hT\nk+1ck+1\n(cid:107)\u03b4hk+1(cid:107)\n\nck =\n\n\u2202hk+1\n\u2202hk\n\n\u03b4hk+1 = W T (\u03b4hk+1 (cid:12) f(cid:48)(ak+1))\n\nck+1 =\n\n\u2202hk+1\n\u2202hT\nk\n\n\u03b4hk = (W \u03b4hk) (cid:12) f(cid:48)(ak+1)\n\nIf Lk > 0, then is \u2212\u03b4hk a descending direction in order to minimize Kk+1.\nProof. Let i be the any of the layers k or k + 1. The prescribed update \u2212\u03b4hi is the steepest descent\ndirection in order to minimize Ki because by using the product rule and the fact that any partial\nderivative of\n\n\u03b4hi\n\n(cid:107)\u03b4hi(cid:107) is zero we get\n\n(cid:20) \u03b4hT\n\n(cid:21)\n\n(cid:20) \u03b4hi\n\n(cid:21)\n\ni hi\n(cid:107)\u03b4hi(cid:107)\n\n\u2212 \u2202Ki\n\u2202hi\n\n= \u2212 \u2202\n\u2202hi\n\n= \u2212 \u2202\n\u2202hi\n\n(cid:107)\u03b4hi(cid:107) = \u2212\u03b1i\u03b4hi\nHere \u03b1i = 1(cid:107)\u03b4hi(cid:107) is a positive scalar because \u03b4hi is non-zero. Let \u03b4ai be de\ufb01ned as \u03b4ai = \u2202hi\n\u03b4hi =\n\u03b4hi (cid:12) f(cid:48)(ai) where ai is the input to layer i. Using the product rule again, the gradients maximizing\nLk and Lk+1 are\n\n(cid:107)\u03b4hi(cid:107) = \u22120hi \u2212 \u03b4hi\n\nhi \u2212 \u2202hi\n\u2202hi\n\n(cid:107)\u03b4hi(cid:107)\n\n(15)\n\n\u03b4hi\n\n\u2202ai\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(16)\n\n(17)\n\n(18)\n\n(cid:20) \u03b4hi\n\n(cid:21)\n\n(cid:20) \u03b4hT\n\ni ci\n(cid:107)\u03b4hi(cid:107)\n\u2202Lk+1\n\u2202ck+1\n\u2202ck\n\u2202W T\n\n=\n\n(cid:21)\n\n=\n\n\u2202ck+1\n\u2202W\n\u2202Lk\n\u2202cT\nk\n\n\u2202Li\n\u2202ci\n\n=\n\n\u2202\n\u2202ci\n\n\u2202Lk+1\n\n\u2202W\n\n=\n\n\u2202Lk\n\u2202W\n\nci +\n\n\u2202ci\n\u2202ci\n\n(cid:107)\u03b4hi(cid:107)\n\n\u2202\n\u2202ci\n= \u03b1k+1(\u03b4hk+1 (cid:12) f(cid:48)(ak+1))\u03b4hT\n\n\u03b4hi\n(cid:107)\u03b4hi(cid:107) = 0ci +\n\n\u03b4hi\n(cid:107)\u03b4hi(cid:107) = \u03b1i\u03b4hi\n\nk = \u03b1k+1\u03b4ak+1\u03b4hT\nk\n\n= (\u03b4hk+1 (cid:12) f(cid:48)(ak+1))\u03b1k\u03b4hT\n\nk = \u03b1k\u03b4ak+1\u03b4hT\nk\n\n\u2202W = \u2202Lk+1\n\n\u2202W = \u2202Lk\n\n\u2202W . If we project hi onto \u03b4hi we\n\nIgnoring the magnitude of the gradients we have \u2202L\ncan write hi = hT\n\u03b4W = \u2212\u03b4hk+1\n\n= \u2212(\u03b4hk+1(cid:12)f(cid:48)(ak+1))hT\n\ni \u03b4hi\n(cid:107)\u03b4hi(cid:107)2 \u03b4hi + hi,res = \u03b1iKi\u03b4hi + hi,res. For W , the prescribed update is\n\u2202hk+1\n\u2202W\n\u2212\u03b1kKk\u03b4ak+1\u03b4hT\n\nk = \u2212\u03b4ak+1hT\n\nk,res = \u2212Kk\n\nk \u2212 \u03b4ak+1hT\n\n\u2212 \u03b4ak+1hT\n\u2202Lk\n\u2202W\nWe can indirectly maximize Lk and Lk+1 by maximizing the component of \u2202Lk\n\u2202W in \u03b4W by minimizing\nKk. The gradient to minimize Kk is the prescribed update \u2212\u03b4hk.\nLk > 0 implies that the angle \u03b2 between \u03b4hk and the back-propagated gradient ck is within 90\u25e6 of\n(cid:107)ck(cid:107)(cid:107)\u03b4hk(cid:107) = Lk(cid:107)ck(cid:107) > 0 \u21d2 |\u03b2| < 90\u25e6. Lk > 0 also implies that ck is\neach other because cos(\u03b2) = cT\nnon-zero and thus descending. Then \u03b4hk will point in a descending direction because a vector within\n90\u25e6 of the steepest descending direction will also point in a descending direction.\n\nk = \u2212\u03b4ak+1(\u03b1kKk\u03b4hk+hk,res)T =\n\n(19)\n\nk \u03b4hk\n\nk,res\n\n4\n\n\fIt is important to note that the theorem doesn\u2019t tell that the training will converge or reduce any error\nto zero, but if the fake gradient is successful in reducing K, then will this gradient also include a\ngrowing component that tries to increase the alignment criterion L.\nThe theorem can be applied to the output layer and the last hidden layer in a neural network. To\nachieve error-driven learning, we have to close the feedback loop. Then we get the update directions\n= e and \u03b4hk = Gk(e) where Gk(e) is a feedback path connecting the output to the\n\u03b4hk+1 = \u2202J\n\u2202ay\nhidden layer. The prescribed update will directly minimize the loss J given hk. If Lk turns positive,\nthe feedback will provide a update direction \u03b4hk = Gk(e) that reduces the same loss. The theorem\ncan be applied successively to deeper layers. For each layer i, the weight matrix Wi is updated to\nminimize Ki+1 in the layer above, and at the same time indirectly make it\u2019s own update direction\n\u03b4hi = Gi(e) useful.\nTheorem 1 suggests that a large class of asymmetric feedback paths can provide a descending gradient\ndirection for a hidden layer, as long as on average Li > 0. Choosing feedback paths Gi(e), visiting\nevery layer on it\u2019s way backward, with weights \ufb01xed and random, gives us the FA method. Choosing\ndirect feedback paths Gi(e) = Bie, with Bi \ufb01xed and random, gives us the DFA method. Choosing\na direct feedback path G1(e) = B1e connecting to the \ufb01rst hidden layer, and then visiting every\nlayer on it\u2019s way forward, gives us the IFA method. The experimental section shows that learning is\npossible even with indirect feedback like this.\nDirect random feedback \u03b4hi = Gi(e) = Bie has the advantage that \u03b4hi is non-zero for all non-zero e.\nThis is because a random matrix Bi will have full rank with a probability very close to 1. A non-zero\n\u03b4hi is a requirement in order to achieve Li > 0. Keeping the feedback static will ensure that this\nproperty is preserved during training. In addition, a static feedback can make it easier to maximize Li\nbecause the direction of \u03b4hi is more constant. If the cross-entropy loss is used, and the output target\nvalues are 0 or 1, then the sign of the error ej for a given sample j will not change. This means that\nthe quantity Bi sign(ej) will be constant during training because both Bi and sign(ej) are constant.\nIf the task is to classify, the quantity will in addition be constant for all samples within a class. Direct\nrandom feedback will also provide a update direction \u03b4hi with a magnitude that only varies with the\nmagnitude of the error e.\nIf the forward weights are initialized to zero, then will Li = 0 because the back-propagated error is\nzero. This seems like a good starting point when using asymmetric feedback because the \ufb01rst update\nsteps have the possibility to quickly turn this quantity positive. A zero initial condition is however not\na requirement for asymmetric feedback to work. One of the experiments will show that even when\nstarting from a bad initial condition, direct random and static feedback is able to turn this quantity\npositive and reduce the training error to zero.\nFor FA and BP, the hidden layer growth is bounded by the layers above. If the layers above saturate,\nthe hidden layer update \u03b4hi becomes zero. For DFA, the hidden layer update \u03b4hi will be non-zero as\nlong as the error e is non-zero. To limit the growth, a squashing non-linearity like hyperbolic tangent\nor logistic sigmoid seems appropriate. If we add a tanh non-linearity to the hidden layer, the hidden\nactivation is bounded within [\u22121, 1]. With zero initial weights, hi will be zero for all data points. The\ntanh non-linearity will not limit the initial growth in any direction. The experimental results indicate\nthat this non-linearity is well suited together with DFA.\nIf the hyperbolic tangent non-linearity is used in the hidden layer, the forward weights can be\ninitialized to zero. The recti\ufb01ed linear activation function (ReLU) will not work with zero initial\nweights because the error derivative for such a unit is zero when the bias and incoming weights are\nall zero.\n\n4 Experimental results\n\nTo investigate if DFA learns useful features in the hidden layers, a 3x400 tanh network was trained\non MNIST with both BP and DFA. The input test images and resulting features were visualized using\nt-SNE [15], see Figure 3. Both methods learns features that makes it easier to discriminate between\nthe classes. At the third hidden layer, the clusters are well separated, except for some stray points.\nThe visible improvement in separation from input to \ufb01rst hidden layer indicates that error DFA is\nable to learn useful features also in deeper hidden layers.\n\n5\n\n\fFigure 2: Left: Error curves for a network pre-trained with a frozen \ufb01rst hidden layer. Right: Error\ncurves for normal training of a 2x800 tanh network on MNIST.\n\nFigure 3: t-SNE visualization of MNIST input and features. Different colors correspond to different\nclasses. The top row shows features obtained with BP, the bottom row shows features obtained with\nDFA. From left to right: input images, \ufb01rst hidden layer features, second hidden layer features and\nthird hidden layer features.\n\nFurthermore, another experiment was performed to see if error DFA is able to learn useful hidden\nrepresentations in deeper layers. A 3x50 tanh network was trained on MNIST. The \ufb01rst hidden layer\nwas \ufb01xed to random weights, but the 2 hidden layers above were trained with BP for 50 epochs. At\nthis point, the training error was about 5%. Then, the \ufb01rst hidden layer was unfreezed and training\ncontinued with BP. The training error decreased to 0% in about 50 epochs. The last step was repeated,\nbut this time the unfreezed layer was trained with DFA. As expected because of different update\ndirections, the error \ufb01rst increased, then decreased to 0% after about 50 epochs. The error curves are\npresented in Figure2(Left). Even though the update direction provided by DFA is different from the\nback-propagated gradient, the resulting hidden representation reduces the error in a similar way.\nSeveral feed-forward networks were trained on MNIST and CIFAR to compare the performance\nof DFA with FA and BP. The experiments were performed with the binary cross-entropy loss and\noptimized with RMSprop [14]. For the MNIST dropout experiments, learning rate with decay and\ntraining time was chosen based on a validation set. For all other experiments, the learning rate was\nroughly optimized for BP and then used for all methods. The learning rate was constant for each\ndataset. Training was stopped when training error reached 0.01% or the number of epochs reached\n300. A mini-batch size of 64 was used. No momentum or weight decay was used. The input data\nwas scaled to be between 0 and 1, but for the convolutional networks, the data was whitened. For\nFA and DFA, the weights and biases were initialized to zero, except for the ReLU networks. For BP\nand/or ReLU, the initial weights and biases were sampled from a uniform distribution in the range\n\n6\n\n\f\u221a\n\n\u221a\n[\u22121/\n\u221a\n\u221a\nf anin]. The random feedback weights were sampled from a uniform distribution\nf anin, 1/\nin the range [\u22121/\nf anout, 1/\n\nf anout].\n\nMODEL\n\nBP\nDFA\n2.16 \u00b1 0.13%\n2.32 \u00b1 0.15% (0.03%)\n7x240 Tanh\n3.92 \u00b1 0.09% (0.12%)\n100x240 Tanh\n1.59 \u00b1 0.04%\n1.68 \u00b1 0.05%\n1x800 Tanh\n1.60 \u00b1 0.06%\n1.74 \u00b1 0.08%\n2x800 Tanh\n1.75 \u00b1 0.05%\n1.70 \u00b1 0.04%\n3x800 Tanh\n1.92 \u00b1 0.11%\n1.83 \u00b1 0.07% (0.02%)\n4x800 Tanh\n1.67 \u00b1 0.03%\n1.75 \u00b1 0.04%\n2x800 Logistic\n1.48 \u00b1 0.06%\n1.70 \u00b1 0.06%\n2x800 ReLU\n1.26 \u00b1 0.03% (0.18%)\n1.45 \u00b1 0.07% (0.24%)\n2x800 Tanh + DO\n2x800 Tanh + ADV 1.01 \u00b1 0.08%\n1.02 \u00b1 0.05% (0.12%)\nTable 1: MNIST test error for back-propagation (BP), feedback-alignment (FA) and direct feedback-\nalignment (DFA). Training error in brackets when higher than 0.01%. Empty \ufb01elds indicate no\nconvergence.\n\nFA\n2.20 \u00b1 0.13% (0.02%)\n1.68 \u00b1 0.05%\n1.64 \u00b1 0.03%\n1.66 \u00b1 0.09%\n1.70 \u00b1 0.04%\n1.82 \u00b1 0.10%\n1.74 \u00b1 0.10%\n1.53 \u00b1 0.03% (0.18%)\n1.14 \u00b1 0.03%\n\nThe results on MNIST are summarized in Table 1. For adversarial regularization (ADV), the\nnetworks were trained on adversarial examples generated by the \"fast-sign-method\" [4]. For dropout\nregularization (DO) [12], a dropout probability of 0.1 was used in the input layer and 0.5 elsewhere.\nFor the 7x240 network, target propagation achieved an error of 1.94% [7]. The results for all\nthree methods are very similar. Only DFA was able to train the deepest network with the simple\ninitialization used. The best result for DFA matches the best result for BP.\n\nMODEL\n\nBP\n45.1 \u00b1 0.7% (2.5%)\n1x1000 Tanh\n45.1 \u00b1 0.3% (0.2%)\n3x1000 Tanh\n3x1000 Tanh + DO 42.2 \u00b1 0.2% (36.7%)\n22.5 \u00b1 0.4%\nCONV Tanh\n\nFA\n46.4 \u00b1 0.4% (3.2%)\n47.0 \u00b1 2.2% (0.3%)\n46.9 \u00b1 0.3% (48.9%)\n27.1 \u00b1 0.8% (0.9%)\n\nDFA\n46.4 \u00b1 0.4% (3.2%)\n47.4 \u00b1 0.8% (2.3%)\n42.9 \u00b1 0.2% (37.6%)\n26.9 \u00b1 0.5% (0.2%)\n\nTable 2: CIFAR-10 test error for back-propagation (BP), feedback-alignment (FA) and direct feedback-\nalignment (DFA). Training error in brackets when higher than 0.1%.\n\nThe results on CIFAR-10 are summarized in Table 2. For the convolutional network the error was\ninjected after the max-pooling layers. The model was identical to the one used in the dropout paper\n[12], except for the non-linearity. For the 3x1000 network, target propagation achieved an error of\n49.29% [7]. For the dropout experiment, the gap between BP and DFA is only 0.7%. FA does not\nseem to improve with dropout. For the convolutional network, DFA and FA are worse than BP.\n\nMODEL\n\nBP\n71.7 \u00b1 0.2% (38.7%)\n1x1000 Tanh\n72.0 \u00b1 0.3% (0.2%)\n3x1000 Tanh\n3x1000 Tanh + DO 69.8 \u00b1 0.1% (66.8%)\n51.7 \u00b1 0.2%\nCONV Tanh\n\nFA\n73.8 \u00b1 0.3% (37.5%)\n75.3 \u00b1 0.1% (0.5%)\n75.3 \u00b1 0.2% (77.2%)\n60.5 \u00b1 0.3%\n\nDFA\n73.8 \u00b1 0.3% (37.5%)\n75.9 \u00b1 0.2% (3.1%)\n73.1 \u00b1 0.1% (69.8%)\n59.0 \u00b1 0.3%\n\nTable 3: CIFAR-100 test error for back-propagation (BP), feedback-alignment (FA) and direct\nfeedback-alignment (DFA). Training error in brackets when higher than 0.1%.\n\nThe results on CIFAR-100 are summarized in Table 3. DFA improves with dropout, while FA does\nnot. For the convolutional network, DFA and FA are worse than BP.\nThe above experiments were performed to verify the DFA method. The feedback loops are the\nshortest possible, but other loops can also provide learning. An experiment was performed on MNIST\n\n7\n\n\fto see if a single feedback loop like in Figure 1d), was able to train a deep network with 4 hidden\nlayers of 100 neurons each. The feedback was connected to the \ufb01rst hidden layer, and all hidden\nlayers above were trained with the update direction forward-propagated through this loop. Starting\nfrom a random initialization, the training error reduced to 0%, and the test error reduced to 3.9%.\n\n5 Discussion\n\nThe experimental results indicate that DFA is able to \ufb01t the training data equally good as BP and FA.\nThe performance on the test set is similar to FA but lagging a little behind BP. For the convolutional\nnetwork, BP is clearly the best performer. Adding regularization seems to help more for DFA than\nfor FA.\nOnly DFA was successful in training a network with 100 hidden layers. If proper weight initialization\nis used, BP is able to train very deep networks as well [13][11]. The reason why BP fails to converge\nis probably the very simple initialization scheme used here. Proper initialization might help FA in a\nsimilar way, but this was not investigated any further.\nThe DFA training procedure has a lot in common with supervised layer-wise pre-training of a deep\nnetwork, but with an important difference. If all layers are trained simultaneously, it is the error at the\ntop of a deep network that drives the learning, not the error in a shallow pre-training network.\nIf the network above a target hidden layer is not adapted, FA and DFA will not give an improvement\nin the loss. This is in contrast to BP that is able to decrease the error even in this case because the\nfeedback depends on the weights and layers above.\nDFA demonstrates a novel application of the feedback alignment principle. The brain may or may not\nimplement this kind of feedback, but it is a step towards better better understanding mechanisms that\ncan provide error-driven learning in the brain. DFA shows that learning is possible in feedback loops\nwhere the forward and feedback paths are disconnected. This introduces a large \ufb02exibility in how the\nerror signal might be transmitted. A neuron might receive it\u2019s error signals via a post-synaptic neuron\n(BP,CHL), via a reciprocally connected neuron (FA,TP), directly from a pre-synaptic neuron (DFA),\nor indirectly from an error source located several synapses away earlier in the informational pathway\n(IFA).\nDisconnected feedback paths can lead to more biologically plausible machine learning. If the feedback\nsignal is added to the hidden layers before the non-linearity, the derivative of the non-linearity does\nnot have to be known. The learning rule becomes local because the weight update only depends on\nthe pre-synaptic activity and the temporal derivative of the post-synaptic activity. Learning is not a\nseparate phase, but performed at the end of an extended forward pass. The error signal is not a second\nsignal in the neurons participating in the forward pass, but a separate signal relayed by other neurons.\nThe local update rule can be linked to Spike-Timing-Dependent Plasticity (STDP) believed to govern\nsynaptic weight updates in the brain, see [1].\nDisconnected feedback paths have great similarities with controllers used in dynamical control loops.\nThe purpose of the feedback is to provide a change in the state that reduces the output error. For a\ndynamical control loop, the change is added to the state and propagated forward to the output. For a\nneural network, the change is used to update the weights.\n\n6 Conclusion\n\nA biologically plausible training method based on the feedback alignment principle is presented for\ntraining neural networks with error feedback rather than error back-propagation. In this method,\nneither symmetric weights nor reciprocal connections are required. The error paths are short and\nenables training of very deep networks. The training signals are local or available at most one synapse\naway. No weight initialization is required.\nThe method was able to \ufb01t the training set on all experiments performed on MNIST, Cifar-10 and\nCifar-100. The performance on the test sets lags a little behind back-propagation.\nMost importantly, this work suggests that the restriction enforced by back-propagation and feedback-\nalignment, that the backward pass have to visit every neuron from the forward pass, can be discarded.\nLearning is possible even when the feedback path is disconnected from the forward path.\n\n8\n\n\fReferences\n[1] Yoshua Bengio, Dong-Hyun Lee, J\u00f6rg Bornschein, Thomas Mesnard, and Zhouhan Lin. Towards\n\nbiologically plausible deep learning. CoRR, abs/1502.04156, 2015.\n\n[2] R. J. Williams D. E. Rumelhart, G. E. Hinton. Learning internal representations by error\n\npropagation. Nature, 323:533\u2013536, 1986.\n\n[3] Charles D Gilbert and Wu Li. Top-down in\ufb02uences on visual processing. Nature Reviews\n\nNeuroscience, 14(5):350\u2013363, 2013.\n\n[4] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver-\n\nsarial examples. CoRR, abs/1412.6572, 2014.\n\n[5] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep\n\nbelief nets. Neural Computation, 18(7):1527\u20131554, 2006.\n\n[6] Geoffrey E. Hinton and Terrence J. Sejnowski. Optimal Perceptual Inference. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, 1983.\n\n[7] Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propaga-\ntion. In ECML/PKDD (1), Machine Learning and Knowledge Discovery in Databases, pages\n498\u2013515. Springer International Publishing, 2015.\n\n[8] Qianli Liao, Joel Z. Leibo, and Tomaso A. Poggio. How important is weight symmetry in\n\nbackpropagation? CoRR, abs/1510.05067, 2015.\n\n[9] Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random\n\nfeedback weights support learning in deep neural networks. CoRR, abs/1411.0247, 2014.\n\n[10] Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep boltzmann machines. In Proceedings of\nthe Twelfth International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2009,\nvolume 5 of JMLR Proceedings, pages 448\u2013455. JMLR.org, 2009.\n\n[11] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.\n\n[12] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov. Dropout: a simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[13] David Sussillo. Random walks: Training very deep nonlinear feed-forward networks with smart\n\ninitialization. CoRR, abs/1412.6558, 2014.\n\n[14] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of\n\nits recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2012.\n\n[15] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal\n\nof Machine Learning Research, 9:2579\u20132605, 2008.\n\n[16] Xiaohui Xie and H. Sebastian Seung. Equivalence of backpropagation and contrastive hebbian\n\nlearning in a layered network. Neural Computation, 15(2):441\u2013454, 2003.\n\n9\n\n\f", "award": [], "sourceid": 600, "authors": [{"given_name": "Arild", "family_name": "N\u00f8kland", "institution": "None"}]}