{"title": "GEMINI: Gradient Estimation Through Matrix Inversion After Noise Injection", "book": "Advances in Neural Information Processing Systems", "page_first": 141, "page_last": 148, "abstract": null, "full_text": "GEMINI: GRADIENT ESTIMATION \nTHROUGH MATRIX INVERSION \n\nAFTER NOISE INJECTION \n\n141 \n\nYann Le Cun 1 Conrad C. Galland and Geoffrey E. Hinton \n\nDepartment of Computer Science \n\nUniversity of Toronto \n10 King's College Rd \n\nToronto, Ontario M5S 1A4 \n\nCanada \n\nABSTRACT \n\nLearning procedures that measure how random perturbations of unit ac(cid:173)\ntivities correlate with changes in reinforcement are inefficient but simple \nto implement in hardware. Procedures like back-propagation (Rumelhart, \nHinton and Williams, 1986) which compute how changes in activities af(cid:173)\nfect the output error are much more efficient, but require more complex \nhardware. GEMINI is a hybrid procedure for multilayer networks, which \nshares many of the implementation advantages of correlational reinforce(cid:173)\nment procedures but is more efficient. GEMINI injects noise only at the \nfirst hidden layer and measures the resultant effect on the output error. \nA linear network associated with each hidden layer iteratively inverts the \nmatrix which relates the noise to the error change, thereby obtaining \nthe error-derivatives. No back-propagation is involved, thus allowing un(cid:173)\nknown non-linearities in the system. Two simulations demonstrate the \neffectiveness of GEMINI. \n\nOVERVIEW \n\nReinforcement learning procedures typically measure the effects of changes in lo(cid:173)\ncal variables on a global reinforcement signal in order to determine sensible weight \nchanges. This measurement does not require the connections to be used backwards \n(as in back-propagation), but it is inefficient when more than a few units are in(cid:173)\nvolved. Either the units must be perturbed one at a time, or, if they are perturbed \nsimultaneously, the noise from all the other units must be averaged away over a \nlarge number of samples in order to achieve a reasonable signal to noise ratio. So \nreinforcement learning is much less efficient than back-propagation (BP) but much \neasier to implement in hardware. \n\nGEMINI is a hybrid procedure which retains many of the implementation advan(cid:173)\n\ntages of reinforcement learning but eliminates some of the inefficiency. GEMINI \nuses the squared difference between the desired and actual output vectors as a \nreinforcement signal. It injects random noise at the first hidden layer only, caus(cid:173)\ning correlated noise at later layers. If the noise is sufficiently small, the resultant \n\n1 First Author's present address: Room 4G-332, AT&T Bell Laboratories, Crawfords Corner \n\nRd, Holmdel, NJ 07733 \n\n\f142 \n\nLe Cun, Galland and Hinton \n\nchange in the reinforcement signal is a linear function of the noise vector at any \ngiven layer. A matrix inversion procedure implemented separately at each hidden \nlayer then determines how small changes in the activities of units in the layer affect \nthe reinforcement signal. This matrix inversi?n gives a much more accurate esti(cid:173)\nmate of the error-derivatives than simply averaging away the effects of noise and, \nunlike the averaging approach, it can be used when the noise is correlated. \n\nThe matrix inversion at each layer can be performed iteratively by a local linear \nnetwork that \"learns\" to predict the change in reinforcement from the noise vector at \nthat layer. For each input vector, one ordinary forward pass is performed, followed \nby a number of forward passes each with a small amount of noise added to the total \ninputs of the first hidden layer. After each forward pass, one iteration of an LMS \ntraining procedure is run at each hidden layer in order to improve the estimate of \nthe error-derivatives in that layer. The number of iterations required is comparable \nto the width of the largest hidden layer. \nIn order to avoid singularities in the \nmatrix inversion procedure, it is necessary for each layer to have fewer units than \nth~ preceding one. \n\nIn this hybrid approach, the computations that relate the perturbation vectors \n\nto the reinforcement signal are all local to a layer. There is no detailed back(cid:173)\npropagation of information, so that GEMINI is more amenable to optical or elec(cid:173)\ntronic implementations than BP. The additional time needed to run the gradient(cid:173)\nestimating inner loop, may be offset by the fact that only forward propagation is \nrequired, so this can be made very efficient (e.g. by using analog or optical hard(cid:173)\nware). \n\nTECHNIQUES FOR GRADIENT ESTIMATION \n\nThe most obvious way to measure the derivative of the cost function w.r.t the \nweights is to perturb the weights one at a time, for each input vector, and to \nmeasure the effect that each weight perturbation has on the cost function, C. The \nadvantage of this technique is that it makes very few assumptions about the way \nthe network computes its output. \n\nIt is possible to use far fewer perturbations (Barto and Anandan, 1985) if we are \nusing \"quasi-linear\" units in which the output, Yi, of unit i is a smooth non-linear \nfunction, I, of'its total input, Xi, and the total input is a linear function of the \nincoming weights, Wij and the activities, Yi, of units in the layer below: \n\nXi = L WijYj \n\ni \n\nInstead of perturbing the weights, we perturb the total input, Xi, received by each \nunit, in order to measure 8C / 8Xi . Once this derivative is known it is easy to \nderive 8C / 8Wij for each of the unit's incoming weights by performing a simple local \ncompu tation: \n\n8C \n8C \n__ -_yo \n8W ij -\n8Xi J \n\nIf the units are perturbed one at a time, we can approximate 8C / 8Xi by \n\n\fGEMINI \n\n143 \n\nwhere 6C is the variation of the cost function induced by a perturbation 6Xi of the \ntotal input to unit i. This method is more efficient than perturbing the weights \ndirectly, but it still requires as many forward passes as there are hidden units. \n\nReducing the number of perturbations required \n\nIf the network has a layered, feed-forward, architecture the state of any single layer \ncompletely determines the output. This makes it possible to reduce the number of \nrequired perturbations and forward passes still further . Perturbing units in the first \nhidden layer will induce perturbations at the following layers, and we can use these \ninduced perturbations to compute the gradients for these layers. However, since \nmany of the units in a typical hidden layer will be perturbed simultaneously, and \nsince these induced perturbations will generally be correlated, it is necessary to do \nsome local computation within each layer in order to solve the credit assignment \nproblem of deciding how much of the change in the final cost function to attribute \nto each of the simultaneous perturbations within the layer. This local computation \nis relatively simple. Let x(k) be the vector of total inputs to units in layer k. Let \n6xt(k) be the perturbation vector of layer k at time t. It does not matter for the \nfollowing analysis whether the perturbations are directly caused (in the first hidden \nlayer) or are induced. For a given state of the network, we have: \n\nTo compute the gradient w.r.t. layer k we must solve the following system for g,c \n\nt = 1. .. P \n\nwhere P is the number of perturbations. Unless P is equal to the number of units \nin layer k, and the perturbation vectors are linearly independent, this system will \nbe over- or under-determined. In some network architectures it is impossible to \ninduce nl linearly independent perturbation vectors in a hidden layer, I containing \nnl units. This happens when one of the preceding hidden layers, k, contains fewer \nunits because the perturbation vectors induced by a layer with nk units on the \nfollowing layer generate at most nk independent directions. So to avoid having to \nsolve an under-determined system, we require \"convergent\" networks in which each \nhidden layer has no mbre units than the preceding layer. \n\nUsing a Special Unit to Allocate Credit within a Layer \n\nInstead of directly solving for the 8C/8xi within each layer, we can solve the same \n\nsystem iteratively by minimizing: \n\nE = I)6Ct - gf6xt(k))2 \n\nt \n\n\f144 \n\nLe Cun, Galland and Hinton \n\nD \n\nlinear \nunit \n\nD \n\nlinear \nunit \n\no o o input layer \n\nFigure 1: A GEMINI network. \n\nThis can be done by a special unit whose inputs are the perturbations of layer \nk and whose desired output is the resulting perturbation of the cost function 6C \n(figure 1). When the LMS algorithm is used, the weight vector gk of this special \nunit converges to the gradient of C with respect to the vector of total inputs x(k). \nIf the components of the perturbation vector are uncorrelated, the convergence will \nbe fast and the number of iterations required should be of the order of the the \nnumber of units in the layer. Each time a new input vector is presented to the main \nnetwork, the \"inner-loop\" minimization process that estimates the 8C / 8Xi must \nbe re-initialized by setting the weights of the special units to zero or by reloading \napproximately correct weights from a table that associates estimates of the 8C / 8Xi \nwith each input vector . \n\nSummary of the Gemini Algorithm \n1. Present an input pattern and compute the network state by forward propagation. \n\n2. Present a desired output and evaluate the cost function. \n3. Re-initialize the weights of the special units. \n\n4. Repeat until convergence: \n(a) Perturb the first hidden layer and propagate forward. \n(b) Measure the induced perturbations in other layers and the output cost function. \n(c) At each layer apply one step of the LMS rule on the special units to minimize \n\nthe error between the predicted cost variation and the actual variation. \n\n5. Use the weights of the special units (the estimates of 8C /8Xi ) to compute the \n\nweight changes of the main network. \n\n6. Update the weights of the main network . \n\n\fGEMINI \n\n145 \n\nA TEST EXAMPLE: CHARACTER RECOGNITION \n\nThe GEMINI procedure was tested on a simple classification task using a network \nwith two hidden layers. The input layer represented a 16 by 16 binary image of \na handwritten digit. The first hidden layer was an 8 by 8 array of units that \nwere locally connected to the input layer in the following way: Each hidden unit \nconnected to a 3 by 3 \"receptive field\" of input units and the centers of these \nreceptive fields were spaced two \"pixels\" apart horizontally and vertically. To avoid \nboundary effects we used wraparound which is unrealistic for real image processing. \nThe second hidden layer was a 4 by 4 array of units each of which was connected to \na 5 by 5 receptive field in the previous hidden layer. The centers of these receptive \nfields were spaced two pixels apart. Finally the output layer contained 10 units, \none for each digit, and was fully connected to the second hidden layer. The network \ncontained 1226 weights and biases. \n\nThe sigmoid function used at each node was of the form f( x) = stanh( mx) with \nm = 2/3 and s = 1.716, thus f was odd, and had the property that f(l) = 1 \n(LeCun, 1987). The training set was composed of 6 handwritten exemplars of each \nof the 10 digits. It should be emphasized that this task is simple (it is linearly \nseparable), and the network has considerably more weights than are required for \nthis problem. \n\nExperiments were performed with 64 perturbations in the gradient estimation \n\ninner loop. Therefore, assuming that the perturbation vectors were linearly inde(cid:173)\npendent, the linear system associated with the first hidden layer was not under(cid:173)\nconstrained 2. Since a stochastic gradient procedure was used with a single sweep \nthrough the training set, the solution was only a rough approximation, though con(cid:173)\nvergence was facilitated by the fact that the components of the perturbations were \nstatistically independent. \n\nThe linear systems associated with the second hidden layer and the output layer \nwere almost certainly overconstrained 3, so we expected to obtain a better estimate \nof the gradient for these layers than for the first one. The perturbations injected at \nthe first hidden layer were independent random numbers with a zero-mean gaussian \ndistribution and standard deviation of 0.1. \n\nThe minimization procedure used for gradient estimation was not a pure LMS, \nbut a pseudo-newton method that used a diagonal approximation to the matrix of \nsecond derivatives which scales the learning rates for each link independently (Le \nCun, 1987; Becker and Le Cun, 1988). In our case, the update rule for a gradient \nestimate coefficient was \n\nwhere a'[ is an estimate of the variance of the perturbation for unit i. \nIn the \nsimulations TJ was equal to 0.02 for the first hidden layer, 0.03 for the second hidden \nlayer, and 0.05 for the output layer. Although there was no real need for it, the \ngradient associated-with the output units was estimated using GEMINI so that we \ncould evaluate the accuracy of gradient estimates far away from the noise-injection \n\n2Jt may have been overconstrained since the actual relation between the perturbation and \n\nvariation of the cost function is usually non-linear for finite perturbations \n\n3This depends on the degeneracy of the weight matrices \n\n\f146 \n\nLe Cun, Galland and Hinton \n\nB.1 \n\nB +---~--~--~---r---r---+---+--~--~---4~~ \n12 \n\nFigure 2: The mean squared error as a function of the number of sweeps \nthrough the training set for GEMINI (top curve) and BP (bottom curve). \n\nlayer. The learning rates for the main network, fi, had different values for each unit \nand were equal to 0.1 divided by the fan-in of the unit. \n\nFigure 2 shows the relative learning rates of BP and GEMINI. The two runs were \nstarted from the same initial conditions. Although the learning curve for GEMINI \nis consistently above the one for BP and is more irregular, the rate of decrease of \nthe two curves is similar. The 60 patterns are all correctly classified after 10 passes \nthrough the training set for regular BP, and after 11 passes for GEMINI. In the \nexperiments, the direction of the estimated gradient for a single pattern was within \nabout 20 degrees of the true gradient for the output layer and the second hidden \nlayer, and within 50 degrees for the first hidden layer. Even with such inaccuracies \nin the gradient direction, the procedure still converged at a reasonable rate. \n\nLEARNING TO CONTROL A SIMPLE ROBOT ARM \nIn contrast to the digit recognition task, the robot arm control task considered \n\nhere is particularily suited to the GEMINI procedure because it contains a non(cid:173)\nlinearity which is unknown to the network. In this simulation, a network with 2 \ninput units, a first hidden layer with 8 units, a second with 4 units, and an output \nlayer with 2 units is used to control a simulated arm with two angular degrees of \nfreedom. The problem is to train the network to receive x, y coordinates encoded \non the two input units and produce two angles encoded on the output units which \nwould place the end of the arm on the desired input point (figure 3). The units use \nthe same input-output function as in the digit recognition example. \n\n\fCost. (Euclidean disl. 10 adual point? \n\nt \n\nROBOT ARM \n\n\"unknown\" non-lIne.rlty \n\nt \n\n91 92 \n00 \n\no oto 0 \n\n000 O~O 0 0 0 \n\n00 \n\nJl \n\ny \n\n(a) \n\nGEMINI \n\n147 \n\n(x,y) \n\n(b) \n\nFigure 3: (a) The network trained with the GEMINI procedure, and (b) \nthe 2-D arm controlled by the network. \n\nEach point in the training set is successively applied to the inputs and the resultant \noutput angles determined. The training points are chosen so that the code for the \noutput angles exploits most of the sigmoid input-output curve while avoiding the \nextreme ends. The \"unknown\" non-linearity is essentially the robot arm, which \ntakes the joint angles as input and then \"outputs\" the resulting hand coordinates \nby positioning itself accordingly. The cost function, C, is taken as the square of \nthe Euclidean distance from this point to the desired point. In the simulation, this \ndistance is determined using the appropriate trigonometric relations: \n\nwhere al and a2 are the lengths of the two components of the arm. Although \nthis non-linearity is not actually unknown, analytical derivative calculation can be \ndifficult in many real world applications, and so it is interesting to explore the \npossibility of a control system that can learn without it. \n\nIt is found that the minimum number of iterations of the LMS inner loop search \nneeded to obtain good estimates ofthe gradients when compared to values calculated \nby back-propagation is between 2 and 3 times the number of units in the first hidden \nlayer (figure 4). For this particular kind of problem, the process can be sped up \nsignificantly by using the following two modifications. The same training vector \ncan be applied to the inputs and the weights changed repeatedly until the actual \noutput is within a certain radius of the desired output. The gradient estimates \nare kept between these weight updates, thereby reducing the number of inner loop \n\n\f148 \n\nLe Cun, Galland and Hinton \n\nFigure 4: Gradients of the units in all non-input layers, determined \n(a) by the GEMINI procedure after 24 iterations of the gradient \nestimating inner loop, and \n(b) through analytical calculation. \nThe size of the black and white squares indicates the magnitude of \nnegative and positive error gradients respectively. \n\niterations needed at each step. The second modification requires that the arm be \nmade to move continuously through 2-D space by using an appropriately ordered \ntraining set. The state of the network changes slowly as a result, leading to a slowly \nvarying gradient. Thus, if the gradient estimate is not reset between successive \ninput vectors, it can track the real gradient, allowing the number of iterations per \ngradient estimate to be reduced to as little as 5 in this particular network. \n\nThe results of simulations using training sets of closely spaced points in the first \nquadrant show that GEMINI is capable of training this network to correctly orient \nthe simulated arm, with significantly improved learning efficiency when the above \ntwo modifications are employed. Details of these simulation results and the param(cid:173)\neters used to obtain them are given in (Galland, Hinton, and Le Cun, 1989). \n\nAcknowledgements \n\nThis research was funded by grants from the Ontario Information Technology \n\nResearch Center, the Fyssen Foundation, and the National Science and Engineer(cid:173)\ning Research Council. Geoffrey Hinton is a fellow of the Canadian Institute for \nAdvanced Research. \n\nReferences \nA. G. Barto and P. Anandan (1985) Pattern recognizing stochastic learning au(cid:173)\ntomata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360-375. \nS. Becker and Y. Le Cun (1988) Improving the convergence of back-propagation \nlearning with second order methods. In Touretzky, D. S., Hinton, G. E. and Se(cid:173)\njnowski, T. J., editors, Proceedings of the 1988 Connectionist Summer School, Mor(cid:173)\ngan Kauffman: Los Altos, CA. \nC. C. Galland, G. E. Hinton and Y. Le Cun (1989) Technical Report, in preparation. \nY. Le Cun (1987) Modeles Connexionnistes de l'Apprentissage. Doctoral thesis, \nUniversity of Paris, 6. \nD. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning internal repre(cid:173)\nsentations by back-propagating errors. Nature, 323, 533-536. \n\n\f", "award": [], "sourceid": 122, "authors": [{"given_name": "Yann", "family_name": "Le Cun", "institution": null}, {"given_name": "Conrad", "family_name": "Galland", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}