{"title": "Learning Lie Groups for Invariant Visual Perception", "book": "Advances in Neural Information Processing Systems", "page_first": 810, "page_last": 816, "abstract": null, "full_text": "Learning Lie Groups for Invariant Visual Perception* \n\nRajesb P. N. Rao and Daniel L. Ruderman \nSloan Center for Theoretical Neurobiology \n\nThe Salk Institute \nLa Jolla, CA 92037 \n\n{rao,ruderrnan}@salk.edu \n\nAbstract \n\nOne  of the  most important problems  in  visual  perception is  that of visual  in(cid:173)\nvariance:  how are objects perceived to be the same despite undergoing transfor(cid:173)\nmations such as translations, rotations or scaling?  In this paper, we describe a \nBayesian method for learning invariances based on Lie group theory.  We show \nthat previous approaches based on first-order Taylor series expansions of inputs \ncan be regarded as special cases of the Lie group approach, the latter being ca(cid:173)\npable of handling in principle arbitrarily large transfonnations. Using a matrix(cid:173)\nexponential based generative model of images,  we derive an unsupervised al(cid:173)\ngorithm for learning Lie group operators from  input data containing infinites(cid:173)\nimal transfonnations.  The on-line unsupervised learning algorithm maximizes \nthe posterior probability of generating the training data.  We provide experimen(cid:173)\ntal results suggesting that the proposed method can learn Lie group operators for \nhandling reasonably large I-D translations and 2-D rotations. \n\n1  INTRODUCTION \nA fundamental problem faced by both biological and  machine vision systems is the recognition \nof familiar objects and patterns in the presence of transfonnations such as translations, rotations \nand scaling.  The importance ofthis problem was recognized early by visual scientists such as J. J. \nGibson who hypothesized that \"constant perception depends on the ability of the individual to de(cid:173)\ntect the invariants\" [6].  Among computational neuroscientists, Pitts and McCulloch were perhaps \nthe first to propose a method for perceptual invariance (\"knowing universals\") [12].  A number of \nother approaches have since been proposed [5, 7,  10], some relying on temporal sequences of input \npatterns undergoing transfonnations (e.g.  [4]) and others relying on modifications to the distance \nmetric for comparing input images to stored templates (e.g. [15]). \n\nIn this paper, we describe a Bayesian method for learning in variances based on the notion of contin(cid:173)\nuous transfonnations 'and Lie group theory. We show that previous approaches based on first-order \nTaylor series expansions of images [1,  14]  can be regarded as special cases of the Lie group ap(cid:173)\nproach.  Approaches based on first-order models can account only for small transfonnations due \nto their assumption of a linear generative model for the transfonned images. The Lie approach on \nthe other hand utilizes a matrix-exponential based generative model which can in principle handle \narbitrarily large transfonnations once the correct transfonnation operators have been learned. Us(cid:173)\ning Bayesian principles, we derive an on-line unsupervised algorithm for learning Lie group opera(cid:173)\ntors from input data containing infinitesimal transfonnations. Although Lie groups have previously \n\n\"This research was supported by the Alfred P.  Sloan Foundation. \n\n\fLearning Lie Groups \n\n8ll \n\nbeen used in visual perception [2], computer vision [16] and image processing [9], the question of \nwhether it is possible to learn these groups directly from input data has remained open.  Our pre(cid:173)\nliminary experimental results suggest that in the two examined cases of l-D translations and 2-D \nrotations, the proposed method can learn the corresponding Lie group operators with a reasonably \nhigh degree of accuracy, allowing the  use of these learned operators in  transformation-invariant \nvision. \n\n2  CONTINUOUS TRANSFORMATIONS AND LIE GROUPS \nSuppose we have a point (in general, a vector) 10 which is an element in a space F. Let T 10 denote a \ntransformation of the point 10 to another point, say It. The transformation operator T  is completely \nspecified by its actions on all points in the space F.  Suppose T  belongs to a family  of operators \nT.  We will be interested in the cases where I  is a group i.e. there exists a mapping f  : I  x  I  -t \nI  from pairs of transformations to another transformation such that (a) f  is associative, (b) there \nexists a  unique identity transformation, and (c) for every TEl, there exists a unique inverse \ntransformation of T. These properties seem reasonable to expect in general for transformations on \nimages. \n\nContinuous transformations are those which can be made infinitesimally small.  Due to their favor(cid:173)\nable properties as described below, we will be especially concerned with continuous transforma(cid:173)\ntion groups or Lie groups.  Continuity is associated with both the transformation operators T  and \nthe group T.  Each TEl is assumed to implement a continuous mapping from F  -t F.  To be \nconcrete, suppose T  is parameterized by  a  single real  number x.  Then, the group I  is continu(cid:173)\nous if the function T{x)  :  1R  -t I  is continuous i.e.  any TEl is  the image of some x  E  1R \nand any continuous variation of x results in a continuous variation of T . Let T{O)  be equivalent \nto the identity transformation.  Then, as x  -t 0, the transformation T{x) gets arbitrarily close to \nidentity.  Its effect on 10  can be written as  (to first order in x):  T{x)/o  ~ (1  + xG)/o  for some \nmatrix G  which is known as the generator of the transformation group.  A macroscopic transfor(cid:173)\nmation It = I{x) = T{x)/o can be produced by chaining together a number of these infinitesimal \ntransformations. For example, by dividing the parameter x into N  equal parts and performing each \ntransformation in tum, we obtain: \n\nI{x)  = {1 + (X/N)G)N 10 \n\n(1) \n\nIn the limit N  -t 00, this expression reduces to the matrix exponential equation: \n\n(2) \nwhere 10  is the initial or \"reference\" input.  Thus, each of the elements of our one-parameter Lie \ngroup can be written as:  T{x) = ezG \u2022 The generatorG ofthe Lie group is related to the derivative \nofT{x) with respect to x:  d~T =  GT.  This suggests an alternate way of deriving Equation 2. \nConsider the Taylor series expansion of a transformed input 1 (x) in terms of a previous input 1 (O): \n\nI{x) = ezG 10 \n\nd/{O) \n\nJl. I{O)  x 2 \n\nI{x) = I{O) + ~x + ---;J;22 +... \n\n(3) \nwhere x denotes the relative transformation between I{x) and I{O).  Defining  d~1 = GI for some \noperator matrix G, we can rewrite Equation 3 as:  I{x)  = ezG 10  which is the same as equation 2 \nwith 10  = I{O).  Thus, some previous approaches based on first-order Taylor series expansions \n[ 1,  14] can be viewed as special cases ofthe Lie group model. \n\n3  LEARNING LIE TRANSFORMATION GROUPS \n\nOur goal is to learn the generators G of particular Lie transformation groups directly from input data \ncontaining examples of infinitesimal transformations.  Note that learning the generator of a trans(cid:173)\nformation effectively allows us to remain invariant to that transformation (see below). We assume \nthat during natural temporal sequences of images containing transformations, there are \"small\" im(cid:173)\nage changes corresponding to deterministic sets of pixel changes that are independent of what the \n\n\f812 \n\nR.  P.  N. Rao and D. L.  Ruderman \n\n(a)  1(.) \n\nN ........ l: \nFA111118lIon 01 \nObject Iclen...,. \n\nN ........ 2: \nEoIlnuIIIoa  01 \nTronot ...... tIon \n\n(b) \n\n1(0) \n\n(c) \n\n\u2022 kG  k \n\n(  - ; ; - )  1(0) \n\n\"',i; \n\n.... \n\n... \n\n... \n\n\u2022 \n\n\u2022 \n\n'I \n\n\" \n\n\u2022 \n\nFigure 1:  Network Architecture and Interpolation Function. (a) An implementation of the proposed ap(cid:173)\nproach to invariant vision involving two cooperating recurrent networks, one estimating transformations and \nthe other estimating object  features.  The latter supplies the reference image 1(0) to the transformation net(cid:173)\nwork.  (b)  A locally recurrent elaboration of the transformation network for implementing Equation 9.  The \nnetwork computes e\",GI(O) = 1(0) + Lk(xkGk jk!)I(O). (c) The interpolation function Q used to generate \ntraining data (assuming periodic, band-limited signals). \n\nactual pixels are.  The rearrangements themselves are universal as in for example image transla(cid:173)\ntions. The question we address is:  can we learn the Lie group operator G given simply a series of \n\"before\" and \"after\" images? \nLet the n x 1 vector 1(0) be the \"before\" image and I(x) the \"after\" image containing the infinites(cid:173)\nimal transformation.  Then,  using results from the previous section,  we can write the following \nstochastic generative model for images: \n\n(4) \nwhere n is assumed to be a zero-mean Gaussian white noise process with variance (J2.  Since learn(cid:173)\ning using this full exponential generative model is difficult due to multiple local minima, we restrict \nourselves to transformations that are infinitesimal. The higher order terms then become negligible \nand we can rewrite the above equation in a more tractable form: \n\nI(x) =  e\",GI(O)  + n \n\n(5) \nwhere ~I =  I( x) - 1(0) is the difference image. Note that although this model is linear, the gener(cid:173)\nator G learned using infinitesimal transformations is the same matrix that is used in the exponential \nmodel.  Thus, once learned, this matrix can be used to handle larger tr,ansformations as well (see \nexperimental results). \n\n~I =  xGI(O) + n \n\nSuppose we are given M  image pairs as data.  We wish to find the n  x  n matrix G and the trans(cid:173)\nformations x  which generated the data set.  To do so, we take a Bayesian maximum a posteriori \napproach using Gaussian priors on x  and G.  The negative log of the posterior probability of gen(cid:173)\nerating the data is given by: \nE  = -logP[G, xll(x), 1(0)] = 2(J2 (~I-xGI(O))T (~I-xGI(O))+ 2(J;x2 + 2gTC-lg  (6) \nwhere (J~ is the variance of the zero-mean Gaussian prior on x, g  is the n 2  x  1 vector form of G \nand C  is the covariance matrix associated with the Gaussian prior on G.  Extending this equation \n\n1 \n\n1 \n\n1 \n\n\fLearning Lie Groups \n\n813 \n\nto multiple image data is accomplished straightforwardly by summing the data-driven tenn over \nthe image pairs (we assume G is fixed for all images although the transfonnation x may vary). For \nthe experiments, u, U x  and C  were chosen to be fixed scalar values but it may be possible to speed \nup learning and improve accuracy by choosing C  based on some knowledge of what we expect for \ninfinitesimal image transfonnations (for example, we may define each entry in C  to be a function \nonly of the distance between pixels associated with the entry and exploit the fact that C  needs to \nbe symmetric; the efficacy of this choice is currently under investigation). \n\nThe n  x  n  generator matrix G  can be learned in an unsupervised manner by perfonning gradient \ndescent on E, thereby maximizing the posterior probability of generating the data: \n\n. \nT \nG  =  -a 8G = a(al - xGI(O\u00bb(xl(O\u00bb \n\n8E \n\n- ac(G) \n\n(7) \n\nwhere a is a positive constant that governs the learning rate and c(G) is the n  x  n matrix fonn of \nthe n 2  x 1 vector c- 1 g.  The learning rule for G above requires the value of x for the current image \npair to be known.  We can estimate x by perfonning gradient descent on E with respect to x (using \na fixed previously learned value for G): \n\nE  = f3(GI(O\u00bbT(al - xGI(O\u00bb  - ~x \nx  =  -f3 8\n8\nx \n\nU x \n\n(8) \n\nThe learning process thus involves alternating between the fast estimation of x  for the given image \npair and the slower adaptation ofthe generator matrix G  using this x.  Figure 1 (a) depicts a pos(cid:173)\nsible network implementation of the proposed approach to invariant vision.  The implementation, \nwhich is reminiscent of the division oflabor between the dorsal and ventral streams in primate vi(cid:173)\nsual cortex [3], uses two parallel but cooperating networks, one estimating object identity and the \nother estimating object transfonnations.  The object network is based on a standard linear gener(cid:173)\native model of the fonn:  1(0)  =  Ur + DO  where U is a matrix of learned object \"features\" and \nr  is the feature vector for the object in 1(0) (see, for example, [11,  13]).  Perceptual constancy is \nachieved due to the fact that the estimate of object identity remains stable in the first network as the \nsecond network attempts to account for any transfonnations being induced in the image, appropri(cid:173)\nately conveying the type of transfonnation being induced in its estimate for x  (see  [14]  for more \ndetails). \n\nThe estimation rule for x given above is based on a first-order model (Equation 5) and is therefore \nuseful only for estimating small (infinitesimal) transfonnations. A more general rule for estimating \nlarger transfonnations is  obtaining by perfonning gradient descent on the optimization function \ngiven by the matrix-exponential generative model (Equation 4): \n\nx  =  -y(exGGI(O\u00bb)T(I(x)  - exGI(O\u00bb  _lx \n\nu; \n\n(9) \n\nFigure 1 (b) shows a locally recurrent network implementation of the matrix exponential compu(cid:173)\ntation required by the above equation. \n\n4  EXPERIMENTAL RESULTS \nTraining Data and Interpolation Function. For the purpose of evaluating the algorithm, we gen(cid:173)\nerated synthetic training data by subjecting a randomly generated image (containing unifonnly ran(cid:173)\ndom pixel intensities) to a known transfonnation.  Consider a given  I-D image 1(0)  with image \npixels given by I (j), j  =  1, ... , N. To be able to continuously transfonn 1(0) sampled at discrete \npixel locations by infinitesimal (sub-pixel) amounts, we need to employ an interpolation function. \nWe make use of the Shannon-Whittaker theorem [8] stating that any band-limited signal I (j), with \nj  being any real number, is uniquely specified by its sufficiently close equally spaced discrete sam(cid:173)\nples.  Assuming that our signal is periodic i.e.  I(j + N)  =  I(j) for all  j. the Shannon-Whittaker \ntheorem in one dimension can be written as:  I(j) = E::~ I(m) E:-oo sinc[1r(j  - m  - Nr)] \nwhere  sinc[x]  =  sin(x)Jx.  After some algebraic  manipulation and  simplification,  this can  be \nreduced  to:  I(j)  =  E::~ I(m)Q(j  - m)  where  the  interpolation  function  Q  is  given  by: \n\n\f814 \n\nR. P.  N.  Rao and D. L.  Ruderman \n\nAnalytical \n\nOperator # 10 \n\nReal \n\nImaginary \n\n(a) \n\n(b) \n\nLearned \n\n0.5 \n\n0 \n\n-1  B~ \n-0.5  BIB \n\nOperator # 10 \n\nIma~nary \n\nReal \n\nFigure 2:  Learned Lie Operators for 1\u00b70 Translations.  (a)  Analytically-derived 20  x  20  Lie operator \nmatrix G, operator for the  10th pixel (10th row of G), and plot of real and imaginary parts of the eigenvalues \nof G.  (b) Learned G matrix, 10th operator, and plot of eigenvalues of the learned matrix. \n\nQ(x)  =  (1/N)[1 + 2 L::~~-l cos(271'px/N)].  Figure 1 (c) shows this interpolation function.  To \ntranslate 1(0) by an infinitesimal amount x  E ~,we use:  I(j + x) = L:~:~ I(m)Q(j + x - m). \nSimilarly, to rotate or translate 2-D images,  we use the 2-D analog of the above.  In addition to \nbeing able to generate images with known transformations, the interpolation function also allows \none to derive an analytical expression for the Lie operator matrix directly from the derivative of \nQ. This allows us to evaluate the results oflearning. Figure 2 (a) shows the analytically-derived G \nmatrix for  I-D infinitesimal translations of 20-pixel images (bright pixels = positive values, dark \n= negative).  Also shown alongside is one of the rows of G  (row 10) representing the Lie operator \ncentered on pixel 10. \nLearning 1\u00b7D Translations. Figure 2 (b) shows the results of using Equation 7 and 50, 000 training \nimage pairs forlearning the generator matrix for I-D translations in 20-pixel images. The randomly \ngenerated first image of a training pair was translated left or right by 0.5 pixels (C- 1 = 0.0001 and \nlearning rate a  = 0.4 was decreased by 1.0001 after each training pair).  Note that as expected for \ntranslations, the rows of the learned G matrix are identical except for a shift:  the same differential \noperator (shown in Figure 2 (b\u00bb  is applied at each image location.  A comparison of the eigenval(cid:173)\nues of the learned matrix with those of the analytical matrix (Figure 2) suggests that the learning \nalgorithm was able to learn a reasonably good approximation of the true generator matrix (to within \nan arbitrary multiplicative scaling factor).  To further evaluate the learned matrix G, we ascertained \nwhether G  could be used to generate arbitrary translations of a given reference image using Equa(cid:173)\ntion 2. The results are encouraging as shown in Figure 3 (a), although we have noticed a tendency \nfor the appearance of some artifacts in translated images if there is significant high-frequency con(cid:173)\ntent in the reference image. \n\nEstimating Large Transformations. The learned generator matrix can be used to estimate large \ntranslations in images using Equation 9.  Unfortunately, the optimization function can contain local \nminima (Figure 3 (b\u00bb . The local minima however tend to be shallow and of approximately the same \nvalue, with a unique well-defined global minimum. We therefore searched for the global minimum \nby performing gradient descent with several equally spaced starting values and picked the minimum \nof the estimated values after convergence. Figure 3 (c) shows results ofthis estimation process. \n\nLearning 2\u00b7D Rotations.  We have also tested the learning algorithm in 2-D images using image \nplane rotations.  Training image pairs were generated by infinitesimally rotating images with ran(cid:173)\ndom pixel intensities 0.2 radians clockwise or counterclockwise. The learned operator matrix (for \nthree different spatial scales) is shown in Figure 4 (a).  The accuracy of these matrices was tested \n\n\fLearning Lie Groups \n\n815 \n\n1(0) \n\n(a) \n\nI(x) \n\nx \n\n1.5_,: \n4.5_ \\ \n7.5_  \nJO.5. r: \n13.5 E \n\nI(x) -i \n\n1(0) -\n\nI(x) \n\nx \n\n_-1.5 \n\u00b7~ _-4.5 \n_-7.5 \n~;~. -10.5 \nII -13.5 \n\n-_ .. _-\n\n(b) \n\nx = 19.9780 (20) \n\nx = -1.9780 (-2) \n\nx = 8.9787 (9) \n\nx = -7.9805 (-8) \n\nx = 2.9805 (3) \n\nx =-18.9805 (-19) \n\nx = 15.9775 (16) \n\nx = 26.9776 (27) \n\nx = 4.9774 (5) \n\n(e) \n\nFigure 3:  Generating and Estimating Large Transformations.  (a) An original reference image 1(0) was \ntranslated to varying degrees by using the learned generator matrix  G and  varying x  in Equation 2.  (b) The \nnegative log likelihood optimization function for the matrix-exponential generative model (Equation 4) which \nwas used for estimating large translations.  The globally minimum value  for x  was  found  by using gradient \ndescent  with  multiple starting points.  (c) Comparison of estimated translation values with  actual values  (in \nparenthesis) for different pairs of reference (1(0)  and translated images (I(x)  shown in the form of a table. \n\nby using them in Equation 2 for various rotations x.  As shown in Figure 4 (b) for the 5  x  5 case, \nthe learned matrix appears to be able to rotate a gi ven reference image between -1800  and + 1800 \nabout an initial position (for the larger rotations, some minor artifacts appear near the edges). \n\n5  CONCLUSIONS \nOur results suggest that it is possible for an unsupervised network to learn  visual  invariances by \nlearning operators (or generators) for the corresponding Lie transformation groups.  An important \nissue is  how local minima can be  avoided during the estimation of large transformations.  Apart \nfrom performing multiple searches, one possibility is to use coarse-to-fine techniques, where trans(cid:173)\nformation estimates obtained at a coarse scale are used as starting points for estimating transforma(cid:173)\ntions at finer scales (see, for example, [1]).  A second possibility is to use stochastic techniques that \nexploit the specialized stucture of the optimization function (Figure  1 (c)).  Besides these direc(cid:173)\ntions of research, we are also investigating the use of structured priors on the generator matrix G to \nimprove learning accuracy and speed.  A concurrent effort involves testing the approach on more \nrealistic natural image sequences containing a richer variety of transformations.! \n\nReferences \n[1]  M. J.  Black and A.  D. Jepson.  Eigentracking:  Robust matching and tracking of articulated \nobjects using a view-based representation.  In Proc.  of the Fourth European Conference on \nComputer Vision (ECCV), pages 329-342, 1996. \n\n[2]  P.  C. Dodwell.  The Lie transformation group model of visual perception.  Perception and \n\nPsychophysics, 34(1):1-16,1983. \n\n[3]  D. J. Felleman and D. C. Van Essen.  Distributed hierarchical processing in the primate cere(cid:173)\n\nbral cortex.  Cerebral Cortex,  1:1-47,1991. \n\n1 The generative model in the case of multiple transformations is given by:  I(x) = eL;:l \",;Gi 1(0) + n \nwhere Gi  is the generator for the ith type of transformation and  Xi is the value of that transformation in the \ninput image. \n\n\f816 \n\nInitial \n. , \n\n\u2022 ' \n-.  \u2022 \n\nFinal \n\n,.,. ..... \n\nR.  P.  N.  Rao and D.  L.  Ruderman \n\n(a) \n\nFigure 4:  Learned Lie Operators for 2-D Rotations.  (a) The initial and converged values of the Lie op(cid:173)\nerator matrix  for  2D rotations  at three different scales (3  x  3, 5 x  5 and 9  x  9).  (b) Examples of arbitrary \nrotations of a 5 x  5 reference image 1(0) generated by using the learned Lie operator matrix (although only \nresults for integer-valued x  between -4 and 4 are shown, rotations can be generated for any real-valued x). \n\n[4]  P.  Foldiak.  Learning  in variance  from  transformation  sequences.  Neural  Computation, \n\n3(2): 194-200, 1991. \n\n[5]  K. Fukushima.  Neocognitron:  A  self-organizing neural network model for a mechanism of \npattern recognition unaffected by shift in position. Biological Cybernetics, 36: 193-202, 1980. \n[6]  J.J. Gibson. The Senses Considered as Perceptual Systems.  Houghton-Mifflin, Boston, 1966. \n[7]  Y.  LeCun,  B.  Boser,  J.  S.  Denker,  B.  Henderson,  R.  E.  Howard.  W.  Hubbard,  and  L. D. \nJackel.  Backpropagation applied to handwritten zip code recognition.  Neural Computation, \n1(4):541-551,1989. \n\n[8]  R.  J.  Marks II.  Introduction to Shannon Sampling and Interpolation  Theory.  New York: \n\nSpringer-Verlag, 1991. \n\n[9]  K. Nordberg.  Signal representation and processing using operator groups.  Technical Report \n\nLinkoping Studies in Science and Technology, Dissertations No. 366, Department of Electri(cid:173)\ncal Engineering, Linkoping University, 1994. \n\n[10]  B. A. 0lshausen, C. H. Anderson, and D. C. Van Essen.  A multiscale dynamic routing circuit \nfor  forming size- and position-invariant object representations.  Journal of Computational \nNeuroscience, 2:45-62,1995. \n\n[11]  B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learn(cid:173)\n\ning a sparse code for natural images.  Nature. 381 :607-609,1996. \n\n[12]  W. Pitts and W.S. McCulloch. How we know universals: the perception of auditory and visual \n\nforms.  Bulletin of Mathematical Biophysics, 9:127-147,1947. \n\n[13]  R. P. N. Rao and D. H. Ballard. Dynamic model of visual recognition predicts neural response \n\nproperties in the visual cortex.  Neural Computation, 9(4):721-763,1997. \n\n[14]  R. P. N. Rao and D. H. Ballard. Developmentoflocalized oriented receptive fields by learning \na translation-invariant code for natural  images.  Network:  Computation in Neural Systems, \n9(2):219-234,1998. \n\n[15]  P. Simard, Y. LeCun, and J. Denker.  Efficient pattern recognition using a new transformation \ndistance. In Advances in Neural Information Processing Systems V, pages 5(}-'58, San Mateo, \nCA, 1993. Morgan Kaufmann Publishers. \n\n[16]  L. Van Gool, T. Moons, E. Pauwels, and A. Oosterlinck. Vision and Lie's approach to invari(cid:173)\n\nance.  Image and Vision  Computing, 13(4):259-277,1995. \n\n\f", "award": [], "sourceid": 1584, "authors": [{"given_name": "Rajesh", "family_name": "Rao", "institution": null}, {"given_name": "Daniel", "family_name": "Ruderman", "institution": null}]}