{"title": "Learning by Choice of Internal Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 73, "page_last": 80, "abstract": null, "full_text": "73 \n\nLEARNING BY CHOICE \n\nOF INTERNAL REPRESENTATIONS \n\nTal Grossman, Ronny Meir and Eytan Domany \n\nDepartment of Electronics, Weizmann Institute of Science \n\nRehovot 76100 Israel \n\nABSTRACT \n\nWe introduce a learning algorithm for multilayer neural net(cid:173)\nworks composed of binary linear threshold elements. Whereas ex(cid:173)\nisting algorithms reduce the learning process to minimizing a cost \nfunction over the weights, our method treats the internal repre(cid:173)\nsentations as the fundamental entities to be determined. Once a \ncorrect set of internal representations is arrived at, the weights are \nfound by the local aild biologically plausible Perceptron Learning \nRule (PLR). We tested our learning algorithm on four problems: \nadjacency, symmetry, parity and combined symmetry-parity. \n\nI. INTRODUCTION \n\nConsider a network of binary linear threshold elements i, whose state Si = \u00b11 \n\nis determined according to the rule \n\nSi = sign(L WijSj + Oi) \n\n. \n\nj \n\n(1) \n\nHere Wij is the (unidirectional) weight assigned to the connection from unit j to \ni; 0i is a local bias. We focus our attention on feed-forward networks in which N \nunits of the input layer determine the states of H units of a hidden layer; these, in \nturn, feed one or more output elements. \n\nFor a typical A vs B classification task such a network needs a single output, \nwith sout = + 1 (or -1) when the input layer is set in a state that belongs to catego~y \nA (or B) of input space. The basic problem of learning is to find an algorithm, that \nproduces weights which enable the network to perform this task. In the absence \nof hidden units learning can be accomplished by the PLR [Rosenblatt 1962], which \nwe now briefly Jcscribe. Consider j = 1, ... , N source units and a single target unit \ni. When the source units are set in anyone of p. = 1, .. M patterns, i.e. Sj = er, \ntakes preassigned values er. \n\nwe require that the target unit (determined using (1\u00bb \nLearning takes place in the course of a training session. Starting from any arbitrary \ninitial guess for the weights, an input 1/ is presented, resulting in the output taking \n\nsome value Sr. Now modify every weight according to the rule \n\nW\u00b7\u00b7 -+ W\u00b7\u00b7 + \"(1 - SI!CI!)CI!Cl( \n\n1 ~I ~I ~J ' \n\n11 \n\nIJ \" \n\n(2) \n\n\f74 \n\nGrossman, Meir and Domany \n\nwhere TJ > 0 is a parameter (er = 1 is used to modify the bias 0). Another \ninput pattern is presented, and so on, until all inputs draw the correct output. \nThe Perceptron convergence theorem states [Rosenblatt 1962, Minsky and Papert \n1969] that the PLR will find a solution (if one exists), in a finite number of steps. \nHowever, of the 22N possible partitions of input space only a small subset (less than \n2N2 / N!) is linearly separable [Lewis and Coates 1967], and hence soluble by single(cid:173)\nlayer perceptrolls. To get around this, hidden units are added. Once a single hidden \nlayer (with a large enough number of units) is inserted beween input and output, \nevery classification problem has a solution. But for such architectures the PLR \ncannot be implemented; when the network errs, it is not clear which connection is \nto blame for the error, and what corrective action is to be taken. \n\nBack-propagation [Rumelhart et al 1986] circumvents this \"credit-assignment\" \nproblem by dealing only with networks of continuous valued units, whose response \nfunction is also continuous (sigmoid). \"Learning\" consists of a gradient-descent \ntype minimization of a cost function that measure the deviation of actual outputs \nfrom the required ones, in the space of weights Wij, 0i. A new version of BP, \"back \npropagation of desired states\", which bears some similarity to our algorithm, has \nrecently been introduced [Plaut 1987]. See also Ie Cun [1985] and Widrow and \nWinter [1988] for related methods. \n\nOur algorithm views the internal representations associated with various inputs \nas the basic independent variables of the learning process. This is a conceptually \nplausible assumption; in the course of learning a biological or artificial system should \nform maps and representations of the external world. Once such representations \nare formed, the weights can be found by simple and local Hebbian learning rules \nsuch as the PLR. Hence the problem of learning becomes one of searching for proper \ninternal representations, rather than one of minimization. Failure of the PLR to \nconverge to a solution is used as an indication that the current guess of internal \nrepresentations needs to be modified. \n\nII. THE ALGORITHM \n\nIf we know the internal representations (e.g. the states taken by the hidden \nlayer when patterns from the training set are presented), the weights can be found \nby the PLR. This way the problem of learning becomes one of choosing proper \ninternal representations, rather than of minimizing a cost function by varying the \nvalues of weights. To demonstrate our approache, consider the classification prob(cid:173)\nlem with output values, sout,~ = eout,~, required in response to J1. = I, ... , M input \npatterns. If a solution is found, it first maps each input onto an internal represen(cid:173)\ntation generated on the hidden layer, which, in turn, produces the correct output. \nNow imagine that we are not supplied with the weights that solve the problem; \nhowever the correct internal representations are revealed. That is, we are given a \ntable with M rows, one for each input. Every row has H bits e;'~, for i = I, .. H, \nspecifying the state of the hidden layer obtained in response to input pattern JJ. \nOne can now view each hidden-layer cell i as the target cell of the PLR, with the \nN inputs viewed as source. Given sufficient time, the PLR will converge to a set \n\n\fLearning by Choice of Internal Representations \n\n75 \n\nof weights Wi,j, connecting input unit j to hidden unit i, so that indeed the input(cid:173)\noutput association that appears in column i of our table will be realized. In a \nsimilar fashion, the PLR will yield a set of weights Wi, in a learning process that \nuses the hidden layer as source and the output unit as target. Thus, in order to \nsolve the problem of learning, all one needs is a search procedure in the space of \npossible internal representations, for a table that can be used to generate a solution. \nUpdating of weights can be done in parallel for the two layers, using the current \ntable of internal representations. In the present algorithm, however, the process is \nbroken up into four distinct stages: \n\n1. SETINREP: Generate a table of internal representations {e?'II} by presenting \neach input pattern from the training set and calculating the state on the hidden \nlayer,using Eq.(la), with the existing couplings Wij and ej. \n2. LEARN23: The hidden layer cells are used as source, and the output as the \ntarget unit of the PLR. The current table of internal representations is used as \nthe training set; the PLR tries to find appropriate weights Wi and e to obtain the \ndesired outputs. If a solution is found, the problem has been solved. Otherwise \nstop after 123 learning sweeps, and keep the current weights, to use in IN REP. \n3. INREP: Generate a new table of internal representations, which, when used in \n(lb), yields the correct outputs. This is done by presenting the table sequentially, \nrow by row, to the 11idden layer. If for row v the wrong output is obtained, the \ninternal representation eh ,1I is changed. Having the wrong output means that the \n\"field\" produced by the hidden layer on the output unit, hout,lI = Ej Wje~'11 is \neither too large or too small. We then randomly pick a site j of the hidden layer, \nand try to flip the sign of e;'II; if hout ,lI changes in the right direction, we replace \nthe entry of our table, i.e. \n\n&~,II -. _&~,II \n'3 \n\n' J ' \n\nWe keep picking sites and changing the internal representation of pattern v until \nthe correct output is generated. We always generate the correct output this way, \nprovided Ej IWjl > leoutl (as is the case for our learning process in LEARN23). \nThis procedure ends with a modified table which is our next guess of internal \nrepresentations. \n\n4. LEARN12: Apply the PLR with the first layer serving as source, treating \nevery hidden layer site separately as target. Actually, when an input from the \ntraining set is presented to the first layer, we first check whether the correct result \nis produced on the output unit of the network. If we get wrong overall output, we \nuse the PLR for every hidden unit i, modifying weights incident on i according \nto (2), using column i of the table as the desired states of this unit. If input v \ndoes yield the correct output, we insert the current state of the hidden layer as the \ninternal representation associated with pattern v, and no learning steps are taken. \nWe sweep in this manner the training set, modifying weights Wij, (between input \nand hidden layer), hidden-layer thresholds ei, and, as explained above, internal \n\n\f76 \n\nGrossman, Meir and Domany \n\nrepresentations. If the network has achieved error-free performance for the entire \ntraining set, learning is completed. If no solution has been found after 112 sweeps \nof the training set, we abort the PLR stage, keep the present values of Wij, OJ, and \nstart SETINREP again. \n\nThis is a fairly complete account of our procedure (see also Grossman et al \n\n[1988]). There are a few details \u00b7that need to be added. \n\na) The \"impatience\" parameters: 112 and 123, which are rather arbitrary, are \nintroduced to guarantee that the PLR stage is aborted if no solution is found. This \nis necessary since it is not clear that a solution exists for the weights, given the \ncurrent table of internal representations. Thus, if the PLR stage does not converge \nwithin the time limit specified, a new table of internal representations is formed. \nThe parameters have to be large enough to allow the PLR to find a solution (if \none exists) with sufficiently high probability. On the other hand, too large values \nare wasteful, since they force the algorithm to execute a long search even when \nno solution exists. Therefore the best values of the impatience parameters can be \ndetermined by optimizing the performance of the network; our experience indicates, \nhowever, that once a \"reasonable\" range of values is found, performance is fairly \ninsensitive to the precise choice. \nb) Integer weights: In the PLR correction step, as given by Eq.2, the size of \nD.. W is constant. Therefore, when using binary units, it can be scaled to unity (by \nsetting T] = 0.5) and one can use integer Wi,j'S without any loss of generality. \nc) Optimization: The algorithm described uses several parameters, which should \nbe optimized to get the best performance. These parameters are: 112 and 123 - see \nsection (a) above; Imax - time limit, i.e. an upper bound to the total number of \ntraining sweeps; and the PLR training parameters - i.e the increment of the weights \nand thresholds during the PLR stage. In the PLR we used values of 1] ~ 0.1 [see \nEq. (2) ] for the weights, and 1] ~ 0.05 for thresholds, whereas the initial (random) \nvalues of all weights were taken from the interval (-0.5,0.5), and thresholds from \n(-0.05,0.05). In the integer weights program, described above, these parameters are \nnot used. \nd) Treating Multiple Outputs: In the version of inrep described above, we \nkeep flipping the internal representations 'until we find one that yields the correct \noutput, i.e. zero error for the given pattern. This is not always possible when using \nmore than one output unit. Instead, we can allow only for a pre-specified number \nof attempted flips, lin' and go on to the next pattern even if vanishing error was \nnot achieved. In this modified version we also introduce a slightly different, and less \n\"restrictive\" criterion for accepting or rejecting a flip. Having chosen (at random) \na hidden unit i, we check the effect of flipping the sign of ~;,II on the total output \nerror, i.e. \nthe number of wrong bits (and not on the output field, as described \nabove). If the output error is not increased, the flip is accepted and the table of \ninternal representations is changed accordingly. \n\nThis modified algorithm is applicable for multiple-output networks. Results of \n\npreliminary experiments with this version are presented in the next section. \n\n\fLearning by Choice of Internal Representations \n\n77 \n\nIII. PERFORMANCE OF THE ALGORITHM \n\nThe \"time\" parameter that we use for measuring performance is the number of \nsweeps through the training set of M patterns needed in order to find the solution. \nNamely, how many times each pattern was presented to the network. In each cycle \nof the algorithm there are 112 + 123 such sweeps. For each problem, and each \nparameter choice, an ensemble of many independent runs, each starting with a \ndifferent random choice of initial weights, is created. In general, when applying a \nlearning algorithm to a given problem, there are cases in which the algorithm fails \nto find a solution within the specified time limit (e.g. when BP get stuck in a local \nminimum), and it is impossible to calculate the ensemble average of learning times. \nTherefore we calculate, as a performance measure, either the median number of \nsweeps, t m , or the \"inverse average rate\", T, as defined in Tesauro and Janssen \n[1988]. \n\nThe first problem we studied is contiguity: the system has to determine whether \nthe number of clumps (i.e. contiguous blocks) of +1 's in the input is, say, equal to \n2 or 3. This is called [Denker et al 1987] the \"2 versus 3\" clumps predicate. We \nused, as our training set, all inputs that have 2 or 3 clumps, with learning cycles \nparametrized by 112 = 20 and 123 = 5. Keeping N = 6 fixed, we varied H; 500 \ncases were used for each data point of Fig.l. \n\n400 -\n\n300 \n\n200 \n\n100 \n\nx BP \n\n<> CHIR \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\n8 \n\nH \n\nFigure 1. Median number of sweeps tm , needed to train a network of N = 6 \ninput units, over an exhaustive training set, to solve the\" 2 vs 3\" clumps predicate, \nplotted against the number of hidden units H. Results for back-propagation [Denker \net al 1987] (x) and this work (\u00a2) are shown. \n\n\f78 \n\nGrossman, Meir and Domany \n\nIn the next problem, symmetry, one requires sout = 1 for reflection-symmetric \ninputs and -1 otherwise. This can be solved with H ~ 2 hidden units. Fig. 2 \npresents, for H = 2, the median number of exhaustive training sweeps needed to \nsolve the problem, vs input size N. At each point 500 cases were run, with 112 = 10 \nand 123 = 5. We always found a solution in' less than 200 cycles. \n\n6 \nN \n\n8 \n\n10 \n\nFigure 2. \n\nsymmetry (with H = 2). \n\nMedian number of sweeps t m , needed to train networks on \n\nIn the Parity problem one requires sout = 1 for an even number of +1 bits in \nthe input, and -1 otherwise. In order to compare performance of our algorithm to \nthat of BP, we studied the Parity problem, using networks with an architecture of \nN : 2N : 1, as chosen by Tesauro and Janssen [1988]. \n\nWe used the integer version of our algorithm, briefly described above. In this \nversion of the algorithm the weights and thresholds are integers, and the increment \nsize, for both thresholds and weights, is unity. As an initial condition, we chose \nthem to be +1 or -1 randomly. In the simulation of this version, all possible input \npatterns were presented sequentially in a fixed order (within the perceptron learning \nsweeps). The results are presented in Table 1. For all choices of the parameters \n( It2, 123 ), that are mentioned in the table, our success rate was 100%. Namely, the \nalgorithm didn't fail even once to find a solution in less than the maximal number \nof training cycles Imax specified in the table. Results for BP, r(BP) (from Tesauro \nand Janssen 1988) are also given in the table. Note that BP does get caught in \nlocal minima, but the percentage of such occurrences was not reported. \n\n\fLearning by Choice of Internal Representations \n\n79 \n\nFor testing the multiple output version of the algorithm we use8 the combined \nparity and symmetry problem; the network has two output units, both connected to \nall hidden units. The first output unit performs the parity predicate on the input, \nand the second performs the symmetry predicate. The network architecture was \nN:2N:2 and the results for N=4 .. 7 are given in Table 2. The choice of parameters \nis also given in that table. \n\nT(CH IR) T(BP) \nIma.x tm \nN (I12,/23) \n3 (8,4) \n10 \n3 \n3 \n4 (9,3)(6,6) \n4 \n20 \n4 \n5 (12,4)(9,6) \n40 \n6 \n8 \n6 (12,4)(10,5) 120 \n19 \n9 \n290 \n7 (12,4)(15,5) 240 \n30 \n2900 150 \n900 \n8 (20,10) \n2400 1300 \n900 \n9 (20,10) \n\n3g \n75 \n130 \n310 I \n80Q \n20db \n\nTable 1. Parity with N:2N:1 architecture. \n\nN \n4 \n5 \n6 \n7 \n\n112 \n12 \n14 \n18 \n40 \n\nh3 \n8 \n7 \n9 \n20 \n\nlin \n7 \n7 \n7 \n7 \n\nIma.x \n40 \n400 \n900 \n900 \n\ntm \nT \n33 \n50 \n900 \n350 \n5250 925 \n6000 2640 \n\nTable 2. Parity and Symmetry with N :2N:2 architecture. \n\nIV. DISCUSSION \n\nWe have presented a learning algorithm for two-Iayerperceptrons, that searches \nfor internal representations of the training set, and determines the weights by the \nlocal, Hebbian perceptron learning rule. Learning by choice of internal represen(cid:173)\ntation may turn out to be most useful in situations where the \"teacher\" has some \ninformation about the desired internal representations. \n\nWe demonstrated that our algorithm works well on four typical problems, and \nstudied the manner in which training time varies with network size. Comparisons \nwith back-propagation were also made. it should be noted that a training sweep \ninvolves much less computations than that of back-propagation. We also presented \na generalization of the algorithm to networks with multiple outputs, and found \nthat it functions well on various problems of the same kind as discussed above. It \nappears that the modification needed to deal with multiple outputs also enables us \nto solve the learning problem for network architectures with more than one hidden \nlayer. \n\n\f80 \n\nGrossman, Meir and Domany \n\nAt this point we can offer only very limited discussion of the interesting ques(cid:173)\n\ntion - why does our algorithm work at all? That is, how come it finds correct \ninternal representations (e.g. \"tables\") while these constitute only a small fraction \nof the total possible number (2H2N)? The main reason is that our procedure ac(cid:173)\ntually does not search this entire space of tables. This large space contains a small \nsubspace, T, of \"target tables\", i.e. those that can be obtained, for all possible \nchoices of w{j and OJ, by rule (1), in response to presentation of the input patterns. \nAnother small subspace S, is that of the tables that can potentially produce the \nrequired output. Solutions of the learning problem constitute the space T n S. \nOur algorithm iterates between T and S, executing also a \"walk\" (induced by the \nmodification of the weights due to the PLR) within each. \n\nAn appealing feature of our algorithm is that it can be implemented in a \nmanner that uses only integer-valued weights and thresholds. This discreteness \nmakes the analysis of the behavior of the network much easier, since we know \nthe exact number of bits used by the system in constructing its solution, and do \nnot have to worry about round-off errors. From a technological point of view, for \nhardware implementation it may also be more feasible to work with integer weights. \n\nWe are extending this work in various directions. The present method needs, in \nthe learning stage, M H bits of memory: internal representations of all M training \npatterns are stored. This feature is biologically implausible and may be techno(cid:173)\nlogically limiting; we are developing a method that does not require such memory. \nOther directions of current study include extensions to networks with continuous \nvariables, and to networks with feed-back. \n\nReferences \nDenker J., Schwartz D., Wittner B., SolI a S., Hopfield J.J., Howard R. and Jackel \nL. 1987, Complex Systems 1, 877-922 \nGrossman T., Meir R. and Domany E . 1988, Complex Systems in press. \nI1ebb D.O. 1949, The organization of Behavior, J. Wiley, N.Y \n\nLe Cun Y. 1985, Proc. Cognitiva 85, 593 \n\nLewis P.M. and Coates C.L. 1967, Threshold Logic. (Wiley, New York) \n\nMinsky M. and Papert S. 1988, Perceptrons. (MIT, Cambridge). \n\nPlaut D.C., Nowlan S.J. and Hinton G.E. 1987, Tech. Report CMU-CS-86-126 \n\nRosenblatt F. Principles of neurodynamics. (Spartan, New York, 1962) \n\nRumelhart D.E., Hinton G.E. and Williams R.J. 1986, Nature 323,533-536 \n\nTesauro G. and Janssen H. 1988, Complex Systems 2, 39 \n\nWidrow B. and Winter R. 1988, Computer 21, No.3, 25 \n\n\f", "award": [], "sourceid": 118, "authors": [{"given_name": "Tal", "family_name": "Grossman", "institution": null}, {"given_name": "Ronny", "family_name": "Meir", "institution": null}, {"given_name": "Eytan", "family_name": "Domany", "institution": null}]}