{"title": "The CHIR Algorithm for Feed Forward Networks with Binary Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 516, "page_last": 523, "abstract": null, "full_text": "516 \n\nGrossman \n\nThe CHIR Algorithm for Feed Forward \n\nNetworks with Binary Weights \n\nTal Grossman \n\nDepartment of Electronics \n\nWeizmann Institute of Science \n\nRehovot 76100 Israel \n\nABSTRACT \n\nA new learning algorithm, Learning by Choice of Internal Rep(cid:173)\nresetations (CHIR), was recently introduced. Whereas many algo(cid:173)\nrithms reduce the learning process to minimizing a cost function \nover the weights, our method treats the internal representations as \nthe fundamental entities to be determined. The algorithm applies \na search procedure in the space of internal representations, and a \ncooperative adaptation of the weights (e.g. by using the perceptron \nlearning rule). Since the introduction of its basic, single output ver(cid:173)\nsion, the CHIR algorithm was generalized to train any feed forward \nnetwork of binary neurons. Here we present the generalised version \nof the CHIR algorithm, and further demonstrate its versatility by \ndescribing how it can be modified in order to train networks with \nbinary (\u00b11) weights. Preliminary tests of this binary version on \nthe random teacher problem are also reported. \n\nI. INTRODUCTION \n\nLearning by Choice oflnternal Representations (CHIR) was recently introduced \n\n[1,11] as a training method for feed forward networks of binary units. \n\nInternal Representations are defined as the states taken by the hidden units \nof a network when patterns (e.g. from the training set) are presented to the input \nlayer of the network. The CHIR algorithm views the internal representations associ(cid:173)\nated with various inputs as the basic independent variables of the learning process. \nOnce such representations are formed, the weights can be found by simple and local \nlearning procedures such as the Percept ron Learning Rule (PLR) [2]. Hence the \nproblem of learning becomes one of searching for proper internal representations, \n\n\fThe CHIR Algorithm for Feed Forward Networks with Binary Weights \n\n517 \n\nrather than of minimizing a cost function by varying the values of weights, which \nis the approach used by back propagation (see, however [3],[4] where \"back prop(cid:173)\nagation of desired states\" is described). This basic idea, of viewing the internal \nrepresentations as the fundamental entities, has been used since by other groups [5-\n7]. Some of these works, and the main differences between them and our approach, \nare briefly disscussed in [11]. One important difference is that the CHIR algorithm, \nas well as another similar algorithm, the MRII [8], try to solve the learning problem \nfor a fixed architecture, and are not guaranteed to converge. Two other algorithms \n[5,6] always find a solution, but at the price of increasing the network size during \nlearning in a manner that resembles similar algorithms developed earlier [9,10]. An(cid:173)\nother approach [7] is to use an error minimizing algorithm which treat~ the internal \nrepresentations as well as the weights as the relevant variables of the search space. \n\nTo be more specific, consider first the single layer perceptron with its Percep(cid:173)\n\ntron Learning Rule (PLR) [2]. This simple network consists of N input (source) \nunits j, and a single target unit i. This unit is a binary linear threshold unit, i.e. \nwhen the source units are set in anyone of Jl = 1, .. M patterns, i.e. Sj = ef, the \nstate of unit i, Si = \u00b11 is determined according to the rule \n\nSi = sign(L WijSj + 0i) . \n\nj \n\n(1) \n\nHere Wij is the (unidirectional) weight assigned to the connection from unit j to \nij 0i is a local bias. For each of the M input patterns, we require that the target \n\nunit (determined using (1)) will take a preassigned value er. Learning takes place \nin the course of a training session. Starting from any arbitrary initial guess for the \nweights, an input v is presented, resulting in the output taking some value Sf. Now \nmodify every weight according to the rule \n\n(2) \nwhere TJ > 0 is a step size parameter (ej = 1 is used to modify the bias 0). Another \ninput pattern is presented, and so on, until all inputs draw the correct output. The \nPerceptron convergence theorem states [2] that the PLR will find a solution (if one \nexists), in a finite number of steps. Nevetheless, one needs, for each unit, both the \ndesired input and output states in order to apply the PLR. \n\nConsider now a two layer perceptron, with N input, H hidden and J{ output \nunits (see Fig.1). The elements of the network are binary linear threshold units i, \nwhose states Si = \u00b11 are determined according to (1). In a typical task for such \na network, M specified output patterns, Sf'-,t,1J. = efut,lJ., are required in response \nto Jl = 1, ... , M input patterns. If a solution is found, it first maps each input onto \nan internal representation generated on the hidden layer, which, in turn, produces \nthe correct output. Now imagine that we are not supplied with the weights that \nsolve the problem; however the correct internal representations are revealed. That \nis, we are given a table with M rows, one for each input. Every row has H bits ef'lJ. I \nfor i = 1..H, specifying the state of the hidden layer obtained in response to input \n\n\f518 \n\nGrossman \n\npattern 1'. One can now view each hidden-layer cell i as the target of the PLR, \nwith the N inputs viewed as source. Given sufficient time, the PLR will converge \nto a set of weights Wii' connecting input unit j to hidden unit i, so that indeed \nthe input-hidden association that appears in column i of our table will be realized. \nIn order to obtain the correct output, we apply the PLR in a learning process that \nuses the hidden layer as source and each output unit as a target, so as to realize \nthe correct output. In general, however, one is not supplied with a correct table of \ninternal representations. Finding such a table is the goal of our approach . \n\n... 0 \n\nFigure 1. A typical three layered feed forward network (two layered percep(cid:173)\n\ntron) with N input, H hidden and I( output units. The unidirectional weight Wij \nconnects unit j to unit i. A layer index is implicitely included in each unit's index. \n\nDuring learning, the CHIR algorithm alternates between two phases: in one it \n\ngenerates the internal representations, and in the other it uses the updated repre(cid:173)\nsentations in order to search for weights, using some single layer learning rule. This \ngeneral scheme describes a large family of possible algorithms, that use different \nways to change the internal representations. and update the weights. \n\nA simple algorithm based on this general scheme was introduced recently [1,11]. \nIn section II we describe the multiple output version of CHIR [11]. In section III we \npresent a way to modify the algorithm so it can train networks with binary weights, \nand the preliminary results of a few tests done on this new version. In the last \nsection we shortly discuss our results and describe some future directions. \n\n\fThe CHIR Algorithm for Feed Forward Networks with Binary Weights \n\n519 \n\nII. THE CHIR ALGORITHM \n\nThe CHIR algorithm that we describe here implements the basic idea of learn(cid:173)\n\ning by choice of internal representations by breaking the learning process into four \ndistinct procedures that are repeated in a cyclic order: \n1. SETINREP: Generate a table of internal representations {ef''''} by presenting \neach input pattern from the training set and recording the states of the hidden \nunits, using Eq.(l), with the existing couplings Wij and 0i. \n2. LEARN23: The current table of internal representations is used as the training \nset, the hidden layer cells are used as source, and each output as the target unit \nof the PLR. If weights Wij and 0i that produce the desired outputs are found, the \nproblem has been solved. Otherwise stop after 123 learning sweeps, and keep the \ncurrent weights, to use in CHANGE INREP. \n3. CHANGE INREP: Generate a new table of internal representations, which \nreduces the error in the output : We present the table sequentially, row by row \n(pattern by pattern), to the hidden layer. If for pattern v the wrong output is \nobtained, the internal representation eh'lI is changed. \n\nThis is done simply by choosing (at random) a hidden unit i, and checking \nthe effect of flipping the sign of e?'''' on the total output error, i.e. the number of \nwrong bits. If the output error is not increased, the flip is accepted and the table of \ninternal representations is changed accordingly. Otherwise the flip is rejected and \nwe try another unit. When we have more than one output unit, it might happen \nthat an error in one output unit can not be corrected without introducing an error \nin another unit. Therefore we allow only for a pre-specified number of attempted \nflips, lin, and go on to the next pattern even if the output error was not eliminated \ncompletely. This procedure ends with a \"modified, \"improved\" table which is our \nnext guess of internal representations. Note that this new table does not necessarily \nyield a totally correct output for all the patterns. In such a case, the learning process \nwill go on even if this new table is perfectly realized by the next stage - LEARN12. \n\n4. LEARN12: Present an input pattern; if the output is wrong, apply the PLR \nwith the first layer serving as source, treating every hidden layer site separately \nas target. If input v does yield the correct output, we insert the current state \nof the hidden layer as the internal representation associated with pattern v, and \nno learning steps are taken. We sweep in this manner the training set, modifying \nweights Wij, (between input and hidden layer), hidden-layer thresholds Oi, and, as \nexplained above, internal representations. If the network has achieved error-free \nperformance for the entire training set, learning is completed. Otherwise, after lt2 \ntraining sweeps (or if the current internal representation is perfectly realized), abort \nthe PLR stage, keeping the present values of Wij, Oi, and start SETINREP again. \nThe idea in trying to learn the current internal representation even if it does not \nyield the perfect output is that it can serve as a better input for the next LEARN23 \nstage. That way, in each learning cycle the algorithm tries to improve the overall \nperformance of the network. \n\n\f520 \n\nGrossman \n\nThis algorithm can be further generalized for multi-layered feed forward net(cid:173)\n\nworks by applying the CHANGE INREP and LEARN12 procedures to each of the \nhidden layers, one by one, from the last to the first hidden layer. \n\nThere are a few details that need to be added. \n\na) The \"iInpatience\" parameters: lt2 and h3, which are rather arbitrary, are \nintroduced to guarantee that the PLR stage is aborted if no solution is found, but \nthey have to be large enough to allow the PLR to find a solution (if one exists) with \nsufficiently high probability. Similar considerations are valid for the lin parameter, \nthe number of flip attempts allowed in the CHANGE INREP procedure. If this \nnumber is too small, the updated internal representations may not improve. If it is \ntoo large, the new internal representations might be too different from the previous \nones, and therefore hard to learn. \n\nThe optimal values depend, in general, on the problem and the network size. \nOur experience indicates, however, that once a \"reasonable\" range of values is found, \nperformance is fairly insensitive to the precise choice. In addition, a simple rule of \nthumb can always be applied: \"Whenever learning is getting hard, increase the \nparameters\". A detailed study of this issue is reported in [11]. \nb) The Internal representations updating scheme: The CHANGE INREP \nprocedure that is presented here (and studied in [11]) is probably the simplest and \n\"most primitive\" way to update the InRep table. The choice of the hidden units to \nbe flipped is completely blind and relies only on the single bit of information about \nthe improvement of the total output error. It may even happen that no change in the \ninternal representaion is made, although such a change is needed. This procedure \ncan certainly be made more efficient, e.g. by probing the fields induced on all the \nhidden units to be flipped and then choosing one (or more) of them by applying a \n\"minimal disturbance\" principle as in [8]. Nevertheless it was shown [11] that even \nthis simple algorithm works quite well. \n\nc) The weights updating schemes: In our experiments we have used the simple \nPLR with a fixed increment (7] = 1/2, .6.Wij = \u00b11) for weight learning. It has the \nadvantage of allowing the use of discrete (or integer) weights. Nevertheless, it is just \na component that can be replaced by other, perhaps more sophisticated methods, in \norder to achieve, for example, better stability [12], or to take into account various \nconstraints on the weights, e.g. binary weights [13]. In the following section we \ndemonstrate how this can be done. \n\nIII. THE CHIR ALGORITHM FOR BINARY WEIGHTS \n\nIn this section we describe how the CHIR algorithm can be used in order to train \nfeed forward networks with binary weights. According to this strong constraint, all \nthe weights in the system (including the thresholds) can be either +1 or -1. The \nway to do it within the CHIR framework is simple: instead of applying the PLR \n(or any other single layer, real weights algorithm) for the updating of the weights, \n\n\fThe CHIR Algorithm for Feed Forward Networks with Binary Weights \n\n521 \n\nwe can use a binary perceptron learning rule. Several ways to solve the learning \nproblem in the binary weight perceptron were suggested recently [13]. The one that \nwe used in the experiments reported here is a modified version of the directed drift \nalgorithm introduced by Venkatesh [13]. Like the standard PLR, the directed drift \nalgorithm works on-line, namely, the patterns are presented one by one, the state of \na unit i is calculated according to (1), and whenever an error occurs the incoming \nweights are updated. When there is an error it means that \n\nNamely, the field hi = Ej Wiie.r ' (induced by the current pattern e.n is \"wrong\". \n\nIf so, there must be some weights that pull it to the wrong direction. These are the \nweights for which \n\n~'! hI! < 0 \n'-' , \n\nerWii{r < o. \n\nHere er is the desired output of unit i for pattern v. The updating of the weights \n\nis done simply by flipping (i.e. Wii ~ -Wij ) at random k of these weights. \n\nThe number of weights to be changed in each learning step, k, can be a pre(cid:173)\nfixed parameter of the algorithm, or, as suggested by Venkatesh, can be decreased \ngradually during the learning process in a way similar to a cooling schedule (as in \nsimulated annealing). What we do is to take k = Ihl/2 + 1, making sure, like in \nrelaxation algorithms, that just enough weights are flipped in order to obtain the \ndesired target for the current pattern. This simple and local rule is now \"plugged\" \ninto the Learn12 and Learn23 procedures instead of (2), and the initial weights are \nchosen to be + 1 or -1 at random. \n\nWe tested the binary version of CHIR on the \"random teacher\" problem. In \nthis problem a \"teacher network\" is created by choosing a random set of +1/-1 \nweights for the given architecture. The training set is then created by presenting \nM input patterns to the network and recording the resulting output as the desired \noutput patterns. Ip. what follows we took M = 2N (exhaustive learning), and an \nN : N : 1 architecture. \n\nThe \"time\" parameter that we use for measuring performance is the number \nof sweeps through the training set of M patterns (\"epochs\") needed in order to find \nthe solution. Namely, how many times each pattern was presented to the network. \nIn the experiments presented here, all possible input patterns were presented se(cid:173)\nquentially in a fixed order (within the perceptron learning sweeps). Therefore in \neach cycle of the algorithm there are 112 + h3 + 1 such sweeps. Note that according \nto our definition, a single sweep involves the updating of only one layer of weights \nor internal representations. for each network size, N, we created an ensemble of \n50 independent runs, with different ranodom teachers and starting with a different \nrandom choice of initial weights. \n\nWe calculate, as a performance measure, the following quantities: \n\na. The median number of sweeps, t m . \n\nb. The \"inverse average rate\", T, as defined by Tesauro and Janssen in [14]. \n\n\f522 \n\nGrossman \n\nc. The success rate, S, i.e. \nsolution in less than the maximal number of training cycles [max specified. \n\nthe fraction of runs in which the algorithm finds a \n\nThe results,with the typical parameters, for N=3,4,5,6, are given in Table 1. \n\nTable 1. The Random Teacher problem with N:N:l architecture. \n\nN \n3 \n4 \n5 \n6 \n\nlt2 \n20 \n25 \n40 \n70 \n\n123 \n10 \n10 \n15 \n40 \n\nlin \n5 \n7 \n9 \n11 \n\n[max \n20 \n60 \n300 \n900 \n\nS \ntm \n14 \n1.00 \n1.00 \n87 \n430 \n1.00 \n15000 1100 0.71 \n\nT \n9 \n37 \n60 \n\nAs mentioned before, these are only preliminary results. No attempt was made \n\nto to optimize the learning parameters. \n\nIV. DISCUSSION \n\nWe presented a generalized version of the CHIR algorithm that is capable \nof training networks with multiple outputs and hidden layers. A way to modify \nthe basic alf$ortihm so it can be applied to networks with binary weights was also \nexplained and tested. The potential importance of such networks, e.g. in hardware \nimplementation, makes this modified version particularly interesting. \n\nAn appealing feature of the CHIR algorithm is the fact that it does not use \nany kind of \"global control\", that manipulates the internal representations (as is \nused for example in [5,6]). The mechanism by which the internal representations are \nchanged is local in the sense that the change is done for each unit and each pattern \nwithout conveying any information from other units or patterns (representations). \nMoreover, the feedback from the \"teacher\" to the system is only a single bit quantity, \nnamely, whether the output is getting worse or not (in contrast to BP, for example, \nwhere one informs each and every output unit about its individual error). \n\nOther advantages of our algorithm are the simplicity of the calculations, the \n\nneed for only integer, or even binary weights and binary units, and the good perfor(cid:173)\nmance. It should be mentioned again that the CHIR training sweep involves much \nless computations than that of back-propagation. The price is the extra memory of \nM H bits that is needed during the learning process in order to store the internal \nrepresentations of all M training patterns. This feature is biologically implausible \nand may be practically limiting. We are developing a method that does not require \nsuch memory. The learning method that is currently studied for that purpose [15], \nis related to the MRII rule, that was recently presented by Widrow and Winter in \n[8]. It seems that further research will be needed in order to study the practical \ndifferences and the relative advantages of the CHIR and the MRII algorithms. \n\n\fThe eHIR Algorithm for Feed Forward Networks with Binary Weights \n\n523 \n\nAcknowledgements: I am gratefull to Prof. Eytan Domany for many useful \nsuggestions and comments. This research was partially supported by a grant from \nMinerva. \n\nReferences \n[1] Grossman T., Meir R. and Domany E., Complex Systems 2, 555 (1989). See \nalso in D. Touretzky (ed.), Advances in Neural Information Processing Systems 1, \n(Morgan Kaufmann, San Mateo 1989). \n[2] Minsky M. and Papert S. 1988, Perceptrons (MIT, Cambridge); \nRosenblatt F. Principles of neurodynamics (Spartan, New York, 1962). \n[3] Plaut D.C., Nowlan S.J., and Hinton G.E., Tech.Report CMU-CS-86-126, \nCarnegie-Mellon University (1986). \n[4] Le Cun Y., Proc. Cognitiva 85, 593 (1985). \n[5] Rujan P. and Marchand M., in the Proc. of the First International Joint Con(cid:173)\nference Neural Networks - Washington D. C. 1989, Vol.lI, pp. 105. and to appear \nin Complex Systems. \n[6] Mezard M. and Nadal J.P., J.Phys.A. 22, 2191 (1989). \n[7] Krogh A., Thorbergsson G.1. and Hertz J.A., in these Proceedings. \nR. Rohwer, to apear in the Proc. of DANIP, GMD Bonn, April 1989, J. Kinderman \nand A. Linden eds ; \nSaad D. and Merom E., preprint (1989). \n\n[8] Widrow B. and Winter R., Computer 21, No.3, 25 (1988). \n[9] See e.g. Cameron S.H., IEEE TEC EC-13,299 (1964) ; Hopcroft J.E. and \nMattson R.L., IEEE, TEC EC-14, 552 (1965). \n[10] Honavar V. and' Uhr L. in the Proc. of the 1988 Connectionist Models Sum(cid:173)\nmer School, Touretzky D., Hinton G . and Sejnowski T. eds. (Morgan Kaufmann, \nSan Mateo, 1988). \n\n[11] Grossman T., to be published in Complex Systems (1990). \n[12] Krauth W. and Mezard M., J.Phys.A, 20, L745 (1988). \n[13] Venkatesh S., preprint (1989) ; \n\nAmaldi E. and Nicolis S., J.Phys.France 50, 2333 (1989). \nKohler H., Diederich S., Kinzel W. and Opper M., preprint (1989). \n\n[14] Tesauro G. and Janssen H., Complex Systems 2, 39 (1988). \n[15] Nabutovski D., unpublished. \n\n\f", "award": [], "sourceid": 205, "authors": [{"given_name": "Tal", "family_name": "Grossman", "institution": null}]}