{"title": "Discovering Discrete Distributed Representations with Iterative Competitive Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 627, "page_last": 634, "abstract": null, "full_text": "Discovering Discrete Distributed Representations \n\nwith Iterative Competitive Learning \n\nMichael C. Mozer \n\nDepartment of Computer Science \nand Institute of Cognitive Science \n\nUniversity of Colorado \n\nBoulder, CO 80309-0430 \n\nAbstract \n\nCompetitive learning is an unsupervised algorithm that classifies input pat(cid:173)\nterns into mutually exclusive clusters. In a neural net framework, each clus(cid:173)\nter is represented by a processing unit that competes with others in a winner(cid:173)\ntake-all pool for an input pattern. I present a simple extension to the algo(cid:173)\nrithm that allows it to construct discrete, distributed representations. Discrete \nrepresentations are useful because they are relatively easy to analyze and \ntheir information content can readily be measured. Distributed representa(cid:173)\ntions are useful because they explicitly encode similarity. The basic idea is \nto apply competitive learning iteratively to an input pattern, and after each \nstage to subtract from the input pattern the component that was captured in \nthe representation at that stage. This component is simply the weight vector \nof the winning unit of the competitive pool. The subtraction procedure forces \ncompetitive pools at different stages to encode different aspects of the input. \nThe algorithm is essentially the same as a traditional data compression tech(cid:173)\nnique known as multistep vector quantization, although the neural net per(cid:173)\nspective suggests potentially powerful extensions to that approach. \n\n1 INTRODUCTION \n\nCompetitive learning (Grossberg, 1976; Kohonen, 1982; Rumelhart & Zipser, 1985; von \nder Malsburg, 1973) is an unsupervised algorithm that classifies input patterns into mutu(cid:173)\nally exclusive clusters. In a neural net framework, each cluster is represented by a pro(cid:173)\ncessing unit that competes with others in a winner-take-all pool for each input pattern. \nCompetitive learning thus constructs a local representation in which a single unit is ac(cid:173)\ntivated in response to an input. I present a simple extension to the algorithm that allows \nit to construct discrete, distributed representations. Discrete representations are useful \nbecause they are relatively easy to analyze and their information content can readily be \nmeasured. Distributed representations are useful because they explicitly encode similari(cid:173)\nty. I begin by describing the standard competitive learning algorithm. \n\n627 \n\n\f628 Mozer \n\n2 COMPETITIVE LEARNING \n\nConsider a two layer network with a input units and ~ competitive units. Each competi(cid:173)\ntive unit represents a different classification of the input. The competitive units are ac(cid:173)\ntivated by the input units and are connected in a winner-take-all pool such that a single \ncompetitive unit becomes active. Formally, \n\nf 1 \n\nif I wi - x I :s; I W j -xl for all j \n\nYi - L 0 otherwise, \n\nwhere Yj is the activity of competitive unit i, x is the input activity vector, Wj is the vec(cid:173)\ntor of connection strengths from the input units to competitive unit i, and 1\u00b71 denotes the \nL2 vector norm. The conventional weight update rule is: \n\n~W\u00b7 - EY\u00b7(X-W.) \n\nI \n\nI \n\nI ' \n\nwhere E is the step size. This algorithm moves each weight vector toward the center of a \ncluster of input patterns. \n\nThe algorithm attempts to develop the best possible representation of the input with only \n~ discrete alternatives. This representation is simply the weight vector of the winning \ncompetitive unit, Ww\"'~'. What does it mean to develop the best representation? Follow(cid:173)\ning Durbin (1990), competitive learning can be viewed as performing gradient descent in \nthe error measure \n\nE - - L In L e -I w, .,,(p) liT \n\n.... ltems \n\nf$ \n\n2 \n\n(1) \n\np-l \n\ni-I \n\nas T -0, where p is an index over patterns. T is a parameter in a soft competitive learn(cid:173)\ning model (Bridle, 1989; Rumelhart, in press) which specifies the degree of competition; \nthe winner-take-all version of competitive learning is obtained at the limit of T - o. \n3 EXTENDING COMPETITIVE LEARNING \n\nCompetitive learning constructs a local representation of the input. How might competi(cid:173)\ntive learning be extended to construct distributed representations? One idea is to have \nseveral independent competitive pools, each of which may form its own partition of the \ninput space. This often fails because all pools will discover the same partitioning if this \npartitioning is unequivocally better than others. Thus, we must force different pools to \nencode different components of the input. \n\nIn the one-pool competitive learning network, the component of the input not encoded is \nsimply \n\nx' - x - W\",i \u2022\u2022 .,. \n\nIf competitive learning is reapplied with x' instead of x, the algorithm is guaranteed to \nextract information not captured by the first pool of competitive units because this infor(cid:173)\nmation has been subtracted out. This procedure can be invoked iteratively to capture dif(cid:173)\nferent aspects of the input in an arbitrary number of competitive pools, hence the name \niterative competitive learning or leL. The same idea is at the heart of Sanger's (1989) \nand Hrycej's (1989) algorithms for performing principal components analysis. Whereas \nthese algorithms discover continuous-valued feature dimensions, ICL is concerned with \n\n\fDiscovering Discrete Distributed Representations \n\n629 \n\nthe discovery of discrete-valued features. Of course, the continuous features can be \nquantized to form discrete features, an idea that both Sanger and Hrycej explore, but \nthere is a cost to this, as I elaborate later. \n\nTo formalize the ICL model, consider a network composed of an arbitrary number of \nstages (Figure 1). Each stage, s, consists of a input units and ~Cs) competitive units. Both \nthe input and competitive units at a given stage feed activity to the input units at the next \nhigher stage. The activity of the input units at stage 1, x(l), is given by the external input. \nAt subsequent stages, s, \n\n(s-I) \n\n(s) \n\nx - X \n\n~I wCS-Il jT (s-I) \n\nY \n\n~ \n\n..J \n\n-\n\nwhere Wand yare as before with an additional index for the stage number. \n\n__ 12) \n-WI \n.1-, \n\n~\"X. ~ \n\n;-\ny(2) ( \n__ 12~ / ~I \\ \n\\ \nW~l ' , / \n\\ \n\\ \n\ni \n\" \n,/ \n\n\\ \n\n... ( \n\n\"--.,/' \n\nFigure 1: The Iterative Competitive Learning Model \n\nTo reconstruct the original input pattern from the activities of the competitive units, the \ncomponents captured by the winning unit at each stage are simply summed together: \n\nx - ; [w(S)fl) . \n\n(2) \n\nA variant of ICL has been independently proposed by Ambros-Ingerson, Granger, and \nLynch (1990).1 Their algorithm, inspired by a neurobiological model, is the same as ICL \nexcept for the competitive unit activation rule which uses an inner product instead of dis(cid:173)\ntance measure: \n\n1 I thank Todd Leen and Steve Rehfuss for bringing this work to my attention, \n\n\f630 Mozer \n\nr 1 if XTWi ~ a and XTwi ~ XTWj for all j \n\nYi ... L a otherwise. \n\nThe problem with this rule is that it is difficult to interpret what exactly the network is \ncomputing, e.g., what aspect of the input is captured by the winning unit, whether the in(cid:173)\nput can be reconstructed from the resulting activity pattern, and what information is dis(cid:173)\ncarded. The ICL activation rule, in combination with the learning rule, has a clear com(cid:173)\nputational justification by virtue of the underlying objective measure (Equation 1) that is \nbeing optimized. \n\nIt also turns out, much to my dismay, that ICL is virtually identical to a conventional \ntechnique in data compression known as multistep vector quantization (Gray, 1984). \nMore on this later. \n\n4 A SIMPLE EXAMPLE \n\nConsider a set of four input patterns forming a rectangle in 2D space, located at (-1,-.5), \n(-1,.5), (1,-.5), and (1,.5), and an ICL network with two stages each containing two com(cid:173)\npetitive units. The first stage discovers the primary dimension of variation -\nalong the \nx-axis. That is, the units develop weight vectors (-1,0) and (1,0). Removing this com(cid:173)\nponent from the input, the four points become (0,-.5), (0,.5), (0,-.5), (0,.5). Thus, the \ntwo points on the left side of the rectangle are collapsed together with the two points on \nthe right side. The second stage of the network then discovers the secondary dimension \nof variation -\n\nalong the y-axis. \n\nThe response of the ICL network to each input pattern can be summarized by the set of \ncompetitive units, one per stage, that are activated. If the two units at each stage are \nnumbered a and 1, four response patterns will be generated: {O,O}, {0,1}, {1,0}, {1,1}. \nThus, ICL has discovered a two-bit code to represent the four inputs. The result will be \nthe same if instead of just four inputs, the input environment consists of four clusters of \npoints centered on the corners of the rectangle. In this case, the two-bit code will not \ndescribe each input uniquely, but it will distinguish the clusters. \n\n5 IMAGE COMPRESSION \n\nBecause ICL discovers compact and discrete codes, the algorithm should be useful for \ndata and image compression. In such problems, a set of raw data must be transformed \ninto a compact representation which can then be used to reconstruct the original data. \nICL performs such a transformation, with the resulting code consisting of the competitive \nunit response pattern. The reconstruction is achieved by Equation 2. \nI experimented with a 600x460 pixel image having 8 bits of gray level information per \npixel. ICL was trained on random 8x8 patches of the image for a total of 125,000 train(cid:173)\ning trials. The network had 64 input units and 80 stages, each with two competitive units. \nThe initial weights were random, selected from a Normal distribution with mean zero and \nstandard deviation .0001. A fixed E of .01 was used. Figure 2 shows incoming connec(cid:173)\ntion strengths to the competitive units in the first nine stages. The connection strengths \nare depicted as an 8x8 grid of cells whose shading indicates the weight from the \ncorresponding position in the image patch to the competitive unit. \n\n\fDiscovering Discrete Distributed Representations \n\n631 \n\nStage 1 \n\nStage 2 \n\nStage 3 \n\nStage 4 \n\nStage 5 \n\nStage 6 \n\nStage 7 \n\nStage 8 \n\nStage 9 \n\nFigure 2: Input-to-Competitive Unit Connection Strengths at Stages 1-9 \n\nFollowing training, the image is compressed by dividing the image into nonoverlapping \n8x8 patches, presenting each in turn to ICL, obtaining the compressed code, and then \nreconstructing the patch from the code. With an s stage network and two units per stage, \nthe compressed code contains s bits. Thus, the number of bits per pixel in the \ncompressed code is s /(8x8). To obtain different levels of compression, the number of \nstages in ICL can be varied. Fortunately, this does not require retraining ICL because the \nfeatures detected at each stage do not depend on the number of stages; the earlier stages \ncapture the most significant variation in the input. Thus, if the network is trained with 80 \nstages, one can use just the first 32 to compress the image, achieving a .5 bit per pixel en(cid:173)\ncoding. \n\nThe image used to train ICL was originally used in a neural net image compression study \nby Cottrell, Munro, and Zipser (1989). Their compression scheme used a three-layer \nback propagation autoencoder to map an image patch back to itself through a hidden \nlayer. The hidden layer, with fewer units than the input layer, served as the encoding. \nBecause hidden unit activities are continuous valued, it was necessary to quantize the ac(cid:173)\ntivities. Using a standard measure of performance, the signal-to-noise ratio (the loga(cid:173)\nrithm of the average energy relative to the average reconstruction error), ICL outper(cid:173)\nforms Cottrell et a1.'s network (Table 1). \n\nsimilar to ICL -\n\nThis result is not surprising. In the data compression literature, vector quantization ap(cid:173)\nproaches -\ne.g., Cottrell et a1. (1989), Sanger (1989). The reason is that transformation-based ap(cid:173)\n-\nproaches do not take quantization into account in the development of the code. That is, \nin transformation-based approaches, the training procedure, which discovers the code, \nand the quantization step, which turns this code into a form that can be used for digital \n\nusually work better than transformation-based approaches \n\n\f632 Mozer \n\ndata transmission or storage, are two distinct processes. In Cottrell et al.'s network, a \nhidden unit encoding is learned without considering the demands of quantization. There \nis no assurance that the quantized code will retain the information in the signal. In con(cid:173)\ntrast, ICL takes quantization into account during training. \n\nTable 1: Signal-to-Noise Ratio for Different Compression Levels \n\ncompression \n1.25 bits/pixel \n1 bit/pixel \n.75 bits/pixel \n.5 bits/pixel \n\nCottrell et al. \n\n2.324 \n2.170 \n1.746 \n\nnot available \n\nICL \n2.366 \n2.270 \n2.146 \n1.975 \n\n6 COMPARISON TO VECTOR QUANTIZATION APPROACHES \n\nAs I mentioned previously, ICL is essentially a neural net reformulation of a convention(cid:173)\nal data compression scheme called multistep vector quantization. However, adopting a \nneural net perspective suggests several promising variants of the approach. These vari(cid:173)\nants result from viewing the encoding task as an optimization problem (Le., finding \nweights that minimize Equation 1). I mention three variants, the first two of which are \nmethods for finding the solution more efficiently and consistently, the final one is a \npowerful extension to algorithm that I believe has not yet been studied in the vector \nquantization literature. \n\n,6.1 AVOIDING LOCAL OPTIMA \n\nAs Rumelhart and Zipser (1985) and others have noted, competitive learning experiences \na serious problem from locally optimal solutions in which one competitive unit captures \nmost or all of the input patterns while others capture none. To eliminate such situations, I \nhave introduced a secondary error term whose purpose is to force the competitive units to \nwin equally often: \n\n2 \nEsec - L(t>: - Yi) , \n\n~ 1 \n\n-\n\ni-I tJ \n\nwhere Yi is the mean activity of competitive unit i over trials. Based on the soft competi(cid:173)\ntive learning model with T>O, this yields the weight update rule \n\n~wi - Y (X-wi)(1-~Yi)' \n\nwhere y is the step size. Because this constraint should not be part of the ultimate solu(cid:173)\ntion, y must gradually be reduced to zero. In the image compression simulation, y was set \nto .005 initially and was decreased by .0001 every 100 training trials. This is a more \nprincipled solution to the local optimum problem than the \"leaky learning\" idea suggest(cid:173)\ned by Rumelhart and Zipser. It can also be seen as an alternative or supplement to the \nschemes proposed for selecting the initial code (weights) in the vector quantization litera(cid:173)\nture. \n\n\fDiscovering Discrete Distributed Representations \n\n633 \n\n6.2 CONSTRAINTS ON THE WEIGHTS \n\nI have explored a further idea to increase the likelihood of converging on a good solution \nand to achieve more rapid convergence. The idea is based on two facts. First, in an op(cid:173)\ntimal solution, the weight vector of a competitive unit should be the mean of the inputs \ncaptured by !J1,at unit. This gives rise to the second observation: beyond stage 1, the \nmean input, x ,should be zero. \nIf the competitive pools contain two units, these facts lead to a strong constraint on the \nweights: \n\nO-x (. ) \n\nL x(')(P) + L x(')(P) \n\npEl-AlIT, \n\npEl-An, \n\nnl+n 2 \n\nn w(') + n w(') \n2 2 \n\n- nl+n2 \n\n1 1 \n\nwhere X(f)(p) is the input vector in stage s for pattern p, PART 1 and PART 2 are the two \nclusters of input patterns partitioned by the competitive units at stage s -1, and n 1 and \"2 \nare the number of elements in each cluster. \n\nThe consequence is that, in an optimal solution, \n\nWl---w2\u00b7 \n\nn2 \n\nnl \n\n(This property is observed in Figure 2.) Constraining the weights in this manner, and per(cid:173)\nforming gradient descent in the ratio n2/nl, as well as in the weight parameters them(cid:173)\nselves, the quality of the solution and the convergence rate are dramatically improved. \n\n6.3 GENERALIZING THE TRANSFORMATION BETWEEN STAGES \n\nAt each stage s, the winning competitive unit specifies a transformation of x(\u00b7) to obtain \nX(H). In ICL, this transformation is simply a translation. There is no reason why this \ncould not be generalized to include rotation and dilation as well, i.e., \n\n(H1) T(') \nX = WlftMrX \n\n(f) \n\n, \n\nwhere T .. u.n ... is a transformation matrix that includes the translation specified by w ........... \n(For this notation to be formally correct, x must be augmented by an element having con(cid:173)\nstant value 1 to allow for translations.) The rotation and dilation parameters can be \nlearned via gradient descent search in the error measure given in Equation 1. Recon(cid:173)\nstruction involves inverting the sequence of transformations: \n\n]-1 \nx -lT2J.n.. \n\nr \n\n. .. IT.l:~,,,,, \n\n(000\u00b7\u00b7\u00b7 I)T. \n\nr \n\n]-1 \n\nA simple example of a situation in which this generalized transformation can be useful is \ndepicted in Figure 3. After subtracting out the component detected at stage 1, the two \nclusters may be rotated into alignment, allowing the second stage to capture the remain-\n\n\f634 Mozer \n\ning variation in the input. Whether or not this extension proves useful has yet to be test(cid:173)\ned. However, the connectivity patterns in Figure 2 certainly suggest that factoring out \nvariations in orientation might permit an even more compact representation of the input \ndata. \n\n\u2022 \n\u2022 \n\n\u2022 \u2022 \n\nFigure 3: A Sample Input Space With Four Data Points \n\nAcknowledgements \nThis research was supported by NSF grant IRI-9058450 and grant 90-21 from the James \nS. McDonnell Foundation. My thanks to Paul Smolensky for helpful comments on this \nwork and to Gary Cottrell for providing the image data and associated software. \n\nReferences \nAmbros-Ingerson, J., Granger, G., & Lynch, G. (1990). Simulation of paleocortex performs hierarchical cluster(cid:173)\n\ning. Science, 247, 1344-1348. \n\nBridle, J. (1990). Training stochastic model recognition algorithms as networks can lead to maximum mutual \n\ninformation estimation of parameters. In D. S. Touretzky (Ed.), Advances in neural information pro(cid:173)\ncessing systems 2 (pp. 211-217). San Mateo, CA: Morgan Kaufmann. \n\nCottrell, G. W., Munro, P., & Zipser, D. (1989). Image compression by back propagation: An example of ex(cid:173)\n\ntensional programming. In N. Sharkey (Ed.), Models of cognition: A review of cognitive science (pp. \n208-240). Norwood, NJ: Ablex. \n\nDurbin, R. (April, 1990). Principled competitive learning in both unsupervised and supervised networks. Post(cid:173)\n\ner presented at the conference on Neural Networks for Computing, Snowbird, Utah. \n\nGray, R. M. (1984). Vector quantization. IEEE ASSP Magazine, 4-29. \nGrossberg, S. (1976). Adaptive pattern classification and universal recoding. I: Parallel development and cod(cid:173)\n\ning of neural feature detectors. Biological Cybernetics, 23, 121-134. \n\nHrycej, T. (1989). Unsupervised learning by backward inhibition. Proceedings of the Eleventh International \n\nJoint Conference on Artificial Intelligence (pp. 170-175). Los Altos, CA: Morgan Kaufmann. \n\nKohonen, T. (1982). Clustering, taxonomy. and topological maps of patterns. In M. Lang (Ed.), Proceedings of \nthe Sixth International Conference on Pattern Recognition (pp. 114-125). Silver Spring, MD: IEEE \nComputer Society Press. \n\nRumelhart, D. E. (in press). Connectionist processing and learning as statistical inference. In Y. Chauvin & D. \nE. Rumelhart (Eds.), Backpropagation: Theory, architectures, and applications. Hillsdale, NJ: Erl(cid:173)\nbaum. \n\nRumelhart, D. E., & Zipser, D. (1985). Feature discovery by competitive learning. Cognitive Science, 9, \n\n75-112. \n\nSanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neur(cid:173)\n\nal Networks, 2, 459-473. \n\nvon der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, \n\n14,85-100. \n\n\f", "award": [], "sourceid": 338, "authors": [{"given_name": "Michael", "family_name": "Mozer", "institution": null}]}