{"title": "Learning Continuous Attractors in Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 654, "page_last": 660, "abstract": "", "full_text": "Learning Continuous Attractors in \n\nRecurrent Networks \n\nH. Sebastian Seung \n\nBell Labs, Lucent Technologies \n\nMurray Hill, NJ 07974 \nseung~bell-labs.com \n\nAbstract \n\nOne approach to invariant object recognition employs a recurrent neu(cid:173)\nral network as an associative memory. In the standard depiction of the \nnetwork's state space, memories of objects are stored as attractive fixed \npoints of the dynamics. I argue for a modification of this picture: if an \nobject has a continuous family of instantiations, it should be represented \nby a continuous attractor. This idea is illustrated with a network that \nlearns to complete patterns. To perform the task of filling in missing in(cid:173)\nformation, the network develops a continuous attractor that models the \nmanifold from which the patterns are drawn. From a statistical view(cid:173)\npoint, the pattern completion task allows a formulation of unsupervised \nlearning in terms of regression rather than density estimation. \n\nA classic approach to invariant object recognition is to use a recurrent neural net(cid:173)\nwork as an associative memory[l]. In spite of the intuitive appeal and biological \nplausibility of this approach, it has largely been abandoned in practical applications. \nThis paper introduces two new concepts that could help resurrect it: object repre(cid:173)\nsentation by continuous attractors, and learning attractors by pattern completion. \nIn most models of associative memory, memories are stored as attractive fixed points \nat discrete locations in state space[l]. Discrete attractors may not be appropriate for \npatterns with continuous variability, like the images of a three-dimensional object \nfrom different viewpoints. When the instantiations of an object lie on a continuous \npattern manifold, it is more appropriate to represent objects by attractive manifolds \nof fixed points, or continuous attractors. \nTo make this idea practical, it is important to find methods for learning attractors \nfrom examples. A naive method is to train the network to retain examples in short(cid:173)\nterm memory. This method is deficient because it does not prevent the network \nfrom storing spurious fixed points that are unrelated to the examples. A superior \nmethod is to train the network to restore examples that have been corrupted, so \nthat it learns to complete patterns by filling in missing information. \n\n\fLearning Continuous Attractors in Recurrent Networks \n\n655 \n\n(a) \n\n(b) \n\nFigure 1: Representing objects by dynamical attractors. (a) Discrete attractors. \n(b) Continuous attractors. \n\nLearning by pattern completion can be understood from both dynamical and sta(cid:173)\ntistical perspectives. Since the completion task requires a large basin of attraction \naround each memory, spurious fixed points are suppressed. The completion task \nalso leads to a formulation of unsupervised learning as the regression problem of \nestimating functional dependences between variables in the sensory input. \nDensity estimation, rather than regression, is the dominant formulation of unsuper(cid:173)\nvised learning in stochastic neural networks like the Boltzmann machine[2] . Density \nestimation has the virtue of suppressing spurious fixed points automatically, but it \nalso has the serious drawback of being intractable for many network architectures. \nRegression is a more tractable, but nonetheless powerful, alternative to density \nestimation. \nIn a number of recent neurobiological models, continuous attractors have been used \nto represent continuous quantities like eye position-[3], direction of reaching[4], head \ndirection[5], and orientation of a visual stimulus[6]. Along with these models, the \npresent work is part of a new paradigm for neural computation based on continuous \nattractors. \n\n1 DISCRETE VERSUS CONTINUOUS ATTRACTORS \n\nFigure 1 depicts two ways of representing objects as attractors of a recurrent neural \nnetwork dynamics. The standard way is to represent each object by an attractive \nfixed point[l], as in Figure 1a. Recall of a memory is triggered by a sensory input, \nwhich sets the initial conditions. The network dynamics converges to a fixed point, \nthus retrieving a memory. If different instantiations of one object lie in the same \nbasin of attraction, they all trigger retrieval of the same memory, resulting in the \nmany-to-one map required for invariant recognition. \nIn Figure 1b, each object is represented by a continuous manifold of fixed points. \nA one-dimensional manifold is shown, but generally the attractor should be mul(cid:173)\ntidimensional, and is parametrized by the instantiation or pose parameters of the \nobject. For example, in visual object recognition, the coordinates would include the \nviewpoint from which the object is seen. \nThe reader should be cautioned that the term \"continuous attractor\" is an idealiza(cid:173)\ntion and should not be taken too literally. In real networks, a continuous attractor \nis only approximated by a manifold in state space along which drift is very slow. \nThis is illustrated by a simple example, a descent dynamics on a trough-shaped \nenergy landscape[3]. If the bottom of the trough is perfectly level, it is a line of \nfixed points and an ideal continuous attract or of the dynamics. However, any slight \nimperfections cause slow drift along the line. This sort of approximate continuous \nattract or is what is found in real networks, including those trained by the learning \n\n\f656 \n\nH S. Seung \n\n(a) \n\nhidden layer \n\n(b) \n\n~ visible layer \n\nFigure 2: (a) Recurrent network. (b) Feedforward autoencoder. \n\nalgorithms to be discussed below. \n\n2 DYNAMICS OF MEMORY RETRIEVAL \n\nThe preceding discussion has motivated the idea of representing pattern manifolds \nby continuous attractors. This idea will be further developed with the simple net(cid:173)\nwork shown in Figure 2a, which consists of a visible layer Xl E Rnl and a hidden \nlayer X2 E Rn2. The architecture is recurrent, containing both bottom-up con(cid:173)\nnections (the n2 x nl matrix W2d and top-down connections (the nl x n2 matrix \nWI2). The vectors bl and b2 represent the biases ofthe neurons. The neurons have \na rectification nonlinearity [x]+ = max{x, O}, which acts on vectors component by \ncomponent. \nThere are many variants of recurrent network dynamics: a convenient choice is the \nfollowing discrete-time version, in which updates of the hidden and visible layers \nalternate in time. After the visible layer is initialized with the input vector Xl (0), \nthe dynamics evolves as \n\n[b2 + W2IXI(t -1)]+ , \n\nX2(t) = \nXl (t) = [bl + W12X2(t)]+ . \n\n(1) \n\nIf memories are stored as attractors, iteration of this dynamics can be regarded as \nmemory retrieval. \nActivity circulates around the feedback loop between the two layers. One iteration \nof this loop is the map Xl(t - 1) ~ X2(t) ~ Xl(t). This single iteration is equiv(cid:173)\nalent to the feedforward architecture of Figure 2b. In the case where the hidden \nlayer is smaller than the visible layers, this architecture is known as an auto en(cid:173)\ncoder network[7]. Therefore the recurrent network dynamics (1) is equivalent to \nrepeated iterations of the feedforward autoencoder. This is just the standard trick \nof unfolding the dynamics of a recurrent network in time, to yield an equivalent \nfeedforward network with many layers[7]. Because of the close relationship between \nthe recurrent network of Figure 2a and the autoencoder of Figure 2b, it should not \nbe surprising that learning algorithms for these two networks are also related, as \nwill be explained below. \n\n3 LEARNING TO RETAIN PATTERNS \n\nLittle trace of an arbitrary input vector Xl (0) remains after a few time steps of the \ndynamics (1). However, the network can retain some input vectors in short-term \nmemory as \"reverberating\" patterns of activity. These correspond to fixed points of \nthe dynamics (1); they are patterns that do not change as activity circulates around \nthe feedback loop. \n\n\fLearning Continuous Attraclors in Recurrent Networlcs \n\n657 \n\nThis suggests a formulation of learning as the optimization of the network's ability to \nretain examples in short-term memory. Then a suitable cost function is the squared \ndifference IXI (T) - Xl (0)12 between the example pattern Xl (0) and the network's \nshort-term memory Xl (T) of it after T time steps. Gradient descent on this cost \nfunction can be done via backpropagation through time[7]. \nIf the network is trained with patterns drawn from a continuous family, then it can \nlearn to perform the short-term memory task oy developing a continuous attractor \nthat lies near the examples it is trained on. When the hidden layer is smaller than \nthe visible layer, the dimensionality of the attractor is limited by the size of the \nhidden layer. \n\nFor the case of a single time step (T = 1), training the recurrent network of Figure \n2a to retain patterns is equivalent to training the autoencoder of Figure 2b by \nminimizing the squared difference between its input and output layers, averaged \nover the examples[8]. From the information theoretic perspective, the small hidden \nlayer in Figure 2b acts as a bottleneck between the input and output layers, forcing \nthe autoencoder to learn an efficient encoding of the input. \nFor the special case of a linear network, the nature of the learned encoding is \nunderstood completely. Then the input and output vectors are related by a simple \nmatrix multiplication. The rank of the matrix is equal to the number of hidden \nunits. The average distortion is minimized when this matrix becomes a projection \noperator onto the subspace spanned by the principal components of the examples[9]. \nFrom the dynamical perspective, the principal subspace is a continuous attractor \nof the dynamics (1). The linear network dynamics converges to this attractor in \na single iteration, starting from any initial condition. Therefore we can interpret \nprincipal component analysis and its variants as methods of learning continuous \nattractors[lO]. \n\n4 LEARNING TO COMPLETE PATTERNS \n\nLearning to retain patterns in short-term memory only works properly for architec(cid:173)\ntures with a small hidden layer. The problem with a large hidden layer is evident \nwhen the hidden and visible layers are the same size, and the neurons are linear. \nThen the cost function for learning can be minimized by setting the weight matrices \nequal to the identity, W21 = W l2 = I. For this trivial minimum, every input vector \nis a fixed point of the recurrent network (Figure 2a), and the equivalent feedforward \nnetwork (Figure 2b) exactly realizes the identity map. Clearly these networks have \nnot learned anything. \nTherefore in the case of a large hidden layer, learning to retain patterns is inad(cid:173)\nequate. Without the bottleneck in the architecture, there is no pressure on the \nfeedforward network to learn an efficient encoding. Without constraints on the di(cid:173)\nmension of the attractor, the recurrent network develops spurious fixed points that \nhave nothing to do with the examples. \nThese problems can be solved by a different formulation of learning based on the \ntask of pattern completion. In the completion task of Figure 3a, the network is \ninitialized with a corrupted version of an example. Learning is done by minimizing \nthe completion error, which is the squared difference IXI (T) - dl 2 between the uncor(cid:173)\nrupted pattern d and the final visible vector Xl (T). Gradient descent on completion \nerror can be done with backpropagation through time[ll]. \nThis new formulation of learning eliminates the trivial identity map solution men-\n\n\f658 \n\nH. S. Seung \n\n(a) ~1 retention. ~1 \n\nL _ \n\n.. _ \n\n(b) \n\ntopographic feature map \n\n9x9 patch \nmissing \n\n~ completio~ ~ 1 \n\nIt ~ \n\nIt ___ \n\nsensory \nInput \n\nretrieved \nmemory \n\nFigure 3: (a) Pattern retention versus completion. (b) Dynamics of pattern com(cid:173)\npletion. \n\n(b) \n\n5x5 receptive fields \n\nFigure 4: ( a) Locally connected architecture. (b) Receptive fields of hidden neurons. \n\ntioned above: while the identity network can retain any example, it cannot restore \ncorrupted examples to their pristine form. The completion task forces the network \nto enlarge the basins of attraction of the stored memories, which suppresses spuri(cid:173)\nous fixed points. It also forces the network to learn associations between variables \nin the sensory input. \n\n5 LOCALLY CONNECTED ARCHITECTURE \n\nExperiments were conducted with images of handwritten digits from the USPS \ndatabase described in [12]. The example images were 16 x 16, with a gray scale \nranging from a to 1. The network was trained on a specific digit class, with the \ngoal of learning a single pattern manifold. Both the network architecture and the \nnature of the completion task were chosen to suit the topographic structure present \nin visual images. \nThe network architecture was given a topographic organization by constraining the \nsynaptic connectivity to be local, as shown in Figure 4a. Both the visible and hidden \nlayers of the network were 16 x 16. The visible layer represented an image, while \nthe hidden layer was a topographic feature map. Each neuron had 5 x 5 receptive \nand projective fields, except for neurons near the edges, which had more restricted \nconnectivity. \nIn the pattern completion task, example images were corrupted by zeroing the \npixels inside a 9 x 9 patch chosen at a random location, as shown in Figure 3a. \nThe location of the patch was randomized for each presentation of an example. \nThe size of the patch was a substantial fraction of the 16 x 16 image, and much \nlarger than the 5 x 5 receptive field size. This method of corrupting the examples \ngave the completion task a topographic nature, because it involved a set of spatially \ncontiguous pixels. This topographic nature would have been lacking if the examples \nhad been corrupted by, for example, the addition of spatially uncorrelated noise. \nFigure 3b illustrates the dynamics of pattern completion performed by a network \n\n\fLearning Continuous Attractors in Recurrent Networks \n\n659 \n\ntrained on examples of the digit class \"two.\" The network is initialized with a \ncorrupted example of a \"two.\" After the first itex:ation of the dynamics, the image \nis partially restored. The second iteration leads to superior restoration, with further \nsharpening of the image. The \"filling in\" phenomenon is also evident in the hidden \nlayer. \nThe network was first trained on a retrieval dynamics of one iteration. The resulting \nbiases and synaptic weights were then used as initial conditions for training on a \nretrieval dynamics of two iterations. The hidden layer developed into a topographic \nfeature map suitable for representing images of the digit \"two.\" Figure 4b depicts \nthe bottom-up receptive fields of the 256 hidden neurons. The top-down projective \nfields of these neurons were similar, but are not shown. \nThis feature map is distinct from others[13) because of its use of top-down and \nbottom-up connections in a feedback loop. The bottom-up connections analyze \nimages into their constituent features, while the top-down connections synthesize \nimages by composing features. The features in the top-down connections can be \nregarded as a \"vocabulary\" for synthesis of images. Since not all combinations of \nfeatures are proper patterns, there must be some \"grammatical\" constraints on their \ncombination. The network's ability to complete patterns suggests that some of these \nconstraints are embedded in the dynamical equations of the network. Therefore the \nrelaxation dynamics (1) can be regarded as a process of massively parallel constraint \nsatisfaction. \n\n6 CONCLUSION \n\nI have argued that continuous attractors are a natural representation for pattern \nmanifolds. One method of learning attractors is to train the network to retain \nexamples in short-term memory. This method is equivalent to autoencoder learning, \nand does not work if the number of hidden units is large. A better method is to train \nthe network to complete patterns. For a locally connected network, this method \nwas demonstrated to learn a topographic feature map. The trained network is able \nto complete patterns, indicating that syntactic constraints on the combination of \nfeatures are embedded in the network dynamics. \nEmpirical evidence that the network has indeed learned a continuous attractor is \nobtained by local linearization of the network (1). The linearized dynamics has \nmany eigenvalues close to unity, indicating the existence of an approximate con(cid:173)\ntinuous attractor. Learning with an increased number of iterations in the retrieval \ndynamics should improve the quality of the approximation. \nThere is only one aspect of the learning algorithm that is specifically tailored for \ncontinuous attractors. This aspect is the limitation of the retrieval dynamics (1) \nto a few iterations, rather than iterating it all the way to a true fixed point. As \nmentioned earlier, a continuous attractor is only an idealization; in a real network \nit does not consist of true fixed points, but is just a manifold to which relaxation is \nfast and along which drift is slow. Adjusting the shape of this manifold is the goal \nof learning; the exact locations of the true fixed points are not relevant. \nThe use of a fast retrieval dynamics removes one long-standing objection to attractor \nneural networks, which is that true convergence to a fixed point takes too long. If all \nthat is desired is fast relaxation to an approximate continuous attractor, attractor \nneural networks are not much slower than feedforward networks. \nIn the experiments discussed here, learning was done with backpropagation through \ntime. Contrastive Hebbian learning[14] is a simpler alternative. Part of the image \n\n\f660 \n\nH S. Seung \n\nis held clamped, the missing values are filled in by convergence to a fixed point, \nand an anti-Hebbian update is made. Then the missing values are clamped at their \ncorrect values, the network converges to a new fixed point, and a Hebbian update \nis made. This procedure has the disadvantage of requiring true convergence to a \nfixed point, which can take many iterations. It also requires symmetric connections, \nwhich may be a representational handicap. \nThis paper addressed only the learning of a single attractor to represent a single \npattern manifold. The problem of learning multiple attractors to represent mUltiple \npattern classes will be discussed elsewhere, along with the extension to network \narchitectures with many layers. \n\nAcknowledgments This work was supported by Bell Laboratories. I thank J. J. \nHopfield, D. D. Lee, L. K. Saul, N. D. Socci, H. Sompolinsky, and D. W. Tank for \nhelpful discussions. \n\nReferences \n[1] J. J. Hopfield. Neural networks and physical systems with emergent collective com(cid:173)\n\nputational abilities. Proc. Nat. Acad. Sci. USA, 79:2554-2558, 1982. \n\n[2] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann \n\nmachines. Cognitive Science, 9:147-169, 1985. \n\n[3] H. S. Seung. How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA,93:13339-\n\n13344, 1996. \n\n[4] A. P. Georgopoulos, M. Taira, and A. Lukashin. Cognitive neurophysiology of the \n\nmotor cortex. Science, 260:47-52, 1993. \n\n[5] K Zhang. Representation of spatial orientation by the intrinsic dynamics of the \n\nhead-direction cell ensemble: a theory. J. Neurosci., 16:2112-2126, 1996. \n\n[6] R . Ben-Yishai, R. L. Bar-Or, and H. Sompolinsky. Theory of orientation tuning in \n\nvisual cortex. Proc. Nat. Acad. Sci. USA, 92:3844-3848, 1995. \n\n[7] D.E. Rumelhart, G.E. Hinton, and R.J . Williams. Learning internal representations \n\nby error propagation. In D.E. Rumelhart and J.L. McClelland, editors, Parallel Dis(cid:173)\ntributed Processing, volume 1, chapter 8, pages 318-362. MIT Press, Cambridge, 1986. \n[8] G. W . Cottrell, P. Munro, and D. Zipser. Image compression by back propagation: an \nexample of extensional programming. In N. E. Sharkey, editor, Models of cognition: \na review of cognitive science. Ablex, Norwood, NJ, 1989. \n\n[9] P. Baldi and K Hornik. Neural networks and principal component analysis: Learning \n\nfrom examples without local minima. Neural Networks, 2:53-58, 1989. \n\n[10] H. S. Seung. Pattern analysis and synthesis in attractor neural networks. In K-Y. M. \nWong, 1. King, and D.-y' Yeung, editors, Theoretical Aspects of Neural Computation: \nA Multidisciplinary Perspective, Singapore, 1997. Springer-Verlag. \n\n[11] F.-S. Tsung and G. W . Cottrell. Phase-space learning. Adv. Neural Info . Proc. Syst., \n\n7:481-488, 1995. \n\n[12] Y. LeCun et al. Learning algorithms for classification: a comparison on handwritten \ndigit recognition. In J.-H. Oh, C. Kwon, and S. Cho, editors, Neural networks: the \nstatistical mechanics perspective, pages 261-276, Singapore, 1995. World Scientific. \n\n[13] T . Kohonen. The self-organizing map. Proc. IEEE, 78:1464-1480, 1990. \n[14] J. J . Hopfield, D. I. Feinstein, and R. G. Palmer. \"Unlearning\" has a stabilizing effect \n\nin collective memories. Nature, 304:158-159, 1983. \n\n\f", "award": [], "sourceid": 1369, "authors": [{"given_name": "H. Sebastian", "family_name": "Seung", "institution": null}]}