{"title": "Linear Operator for Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 452, "page_last": 459, "abstract": null, "full_text": "Linear Operator for Object Recognition \n\nRonen Bssri \n\nShimon Ullman\u00b7 \n\nM.I.T. Artificial Intelligence Laboratory \n\nand Department of Brain and Cognitive Science \n\n545 Technology Square \nCambridge, MA 02139 \n\nAbstract \n\nVisual object recognition involves the identification of images of 3-D ob(cid:173)\njects seen from arbitrary viewpoints. We suggest an approach to object \nrecognition in which a view is represented as a collection of points given \nby their location in the image. An object is modeled by a set of 2-D views \ntogether with the correspondence between the views. We show that any \nnovel view of the object can be expressed as a linear combination of the \nstored views. Consequently, we build a linear operator that distinguishes \nbetween views of a specific object and views of other objects. This opera(cid:173)\ntor can be implemented using neural network architectures with relatively \nsimple structures. \n\n1 \n\nIntroduction \n\nVisual object recognition involves the identification of images of 3-D objects seen \nfrom arbitrary viewpoints. In particular, objects often appear in images from previ(cid:173)\nously unseen viewpoints. In this paper we suggest an approach to object recognition \nin which rigid objects are recognized from arbitrary viewpoint. The method can be \nimplemented using neural network architectures with relatively simple structures. \nIn our approach a view is represented as a collection of points given by their loca(cid:173)\ntion in the image, An object is modeled by a small set of views together with the \ncorrespondence between these views. We show that any novel view of the object \n\n\u2022 Also, Weizmann Inst. of Science, Dept. of Applied Math., Rehovot 76100, Israel \n\n452 \n\n\fLinear Operator for Object Recognition \n\n453 \n\ncan be expressed as a linear combination of the stored views. Consequently, we \nbuild a linear operator that distinguishes views of a specific object from views of \nother objects. This operator can be implemented by a neural network. \n\nThe method has several advantages. First, it handles correctly rigid objects, but is \nnot restricted to such objects. Second, there is no need in this scheme to explicitly \nrecover and represent the 3-D structure of objects. Third, the computations involved \nare often simpler than in previous schemes. \n\n2 Previous Approaches \n\nObject recognition involves a comparison of a viewed image against object models \nstored in memory. Many existing schemes to object recognition accomplish this task \nby performing a template comparison between the image and each of the models, \noften after compensating for certain variations due to the different positions and \norientations in which the object is observed. Such an approach is called alignment \n(Ullman, 1989), and a similar approach is used in (Fischler &, Bolles 1981, Lowe \n1985, Faugeras &, Hebert 1986, Chien &, Aggarwal 1987, Huttenlocher &, Ullman \n1987, Thompson &, Mundy 1987). \n\nThe majority of alignment schemes use object-centered representations to model the \nobjects. In these models the 3-D structure of the objects is explicitly represented. \nThe acquisition of models in these schemes therefore requires a separate process to \nrecover the 3-D structure of the objects. \n\nA number of recent studies use 2-D viewer-centered representations for object recog(cid:173)\nnition. Abu-Mostafa &, Pslatis (1987), for instance, developed a neural network that \ncontinuously collects and stores the observed views of objects. When a new view \nis observed it is recognized if it is sufficiently similar to one of the previously seen \nviews. The system is very limited in its ability to recognize objects from novel \nviews. It does not use information available from a collection of object views to \nextend the range of recognizable views beyond the range determined by each of the \nstored views separately. \n\nIn the scheme below we suggest a different kind of viewer-centered representations \nto model the objects. An object is modeled by a set of its observed images with the \ncorrespondence between points in the images. We show that only a small number \nof images is required to predict the appearance of the object from all possible \nviewpoints. These predictions are exact for rigid objects, but are not confined to \nsuch objects. We also suggest a neural network to implement the scheme. \n\nA similar representation was recently used by Poggio &, Edelman (1990) to develop a \nnetwork that recognizes objects using radial basis functions (RBFs). The approach \npresented here has several advantages over this approach. First, by using the linear \ncombinations of the stored views rather than applying radial basis functions to \nthem we obtain exact predictions for the novel appearances of objects rather than \nan approximation. Moreover, a smaller number of views is required in our scheme to \npredict the appearance of objects from all possible views. For example, when a rigid \nobject that does not introduce self occlusion (such as a wired object) is considered, \npredicting its appearance from all possible views requires only three views under \nthe LC Scheme and about sixty views under the RBFs Scheme. \n\n\f454 \n\nBasri and Ullman \n\n3 The Linear Combinations (LC) Scheme \n\nIn this section we introduce the Linear Combinations (LC) Scheme. Additional \ndetails about the scheme can be found in (Ullman & Basri, 1991). Our approach is \nbased on the following observation. For many continuous transformations of interest \nin recognition, such as 3-D rotation, translation, and scaling, every possible view \nof a transforming object can be expressed as a linear combination of other views of \nthe object. In other words, the set of possible images of an object undergoing rigid \n3-D transformations and scaling is embedded in a linear space, spanned by a small \nnumber of 2-D images. \nWe start by showing that any image of an object undergoing rigid transformations \nfollowed by an orthographic projection can be expressed as a linear combination of \na small number of views. The coefficients of this combination may differ for the x(cid:173)\nand y-coordinates. That is, the intermediate view of the object may be given by two \nlinear combinations, one for the x-coordinates and the other for the y-coordinates. \nIn addition, certain functional restrictions may hold among the different coefficients. \n\nWe represent an image by two coordinate vectors, one contains the x-values of the \nobject's points, and the other contains their y-values. In other words, an image P \nis described by x = (XlJ ... , xn) and y = (Yll ... , Yn) where every (Xi, Yi), 1 < i ~ n, \nis an image point. The order of the points in these vectors is preserved in all the \ndifferent views of the same object, namely, if P and pI are two views of the same \nobject, then (Xi, Yi) E P and (x~, yD E pI are in correspondence (or, in other words, \nthey are the projections of the same object point). \nClaim: \nviewpoints is embedded in a 4-D linear space. \n(A proof is given in Appendix A.) \n\nThe set of coordinate vectors of an object obtained from all different \n\nFollowing this claim we can represent the entire space of views of an object by a \nbasis that consists of any four linearly independent vectors taken from the space. \nIn particular, we can construct a basis using familiar views of the object. Two \nimages supply four such vectors and therefore are often sufficient to span the space. \nBy considering the linear combinations of the model vectors we can reproduce any \npossible view of the object. \n\nIt is important to note that the set of views of a rigid object does not occupy the \nentire linear 4-D space. Rather, the coefficients of the linear combinations repro(cid:173)\nducing valid images follow in addition two quadratic constraints. (See Appendix \nA.) In order to verify that an object undergoes a rigid transformation (as opposed \nto a general 3-D affine transformation) the model must consist of at least three \nsnapshots of the object. \nMany 3-D rigid objects are bounded with smooth curved surfaces. The contours of \nsuch objects change their position on the object whenever the viewing position is \nchanged. The linear combinations scheme can be extended to handle these objects \nas well. In this cases the scheme gives accurate approximations to the appearance \nof these objects (Ullman & Basri, 1991). \n\nThe linear combination scheme assumes that the same object points are visible in \nthe different views. When the views are sufficiently different, this will no longer hold, \n\n\fLinear Operator for Object Recognition \n\n455 \n\ndue to self-occlusion. To represent an object from all possible viewing directions \n(e.g., both \"front\" and \"back\"), a number of different models of this type will be \nrequired. This notion is similar to the use of different object aspects suggested by \nKoenderink & Van Doorn (1979). (Other aspects of occlusion are discussed in the \nnext section.) \n\n4 Recognizing an Object Using the LC Scheme \n\nIn the previous section we have shown that the set of views of a rigid object is \nembedded in a linear space of a small dimension. In this section we define a linear \noperator that uses this property to recognize objects. We then show how this \noperator can be used in the recognition process. \n\nLet PI, ... , Pk be the model views, and P be a novel view of the same object. \nAccording to the previous section there exist coefficients a}, ... , ak such that: \nP = L:~=1 aiPi. Suppose L is a linear operator such that LPi = q for every \n1 < i ~ n and some constant vector q, then L transforms P to q (up to a scale \nfactor), Lp = (L:~=1 ai)q. If in addition L transforms vectors outside the space \nspanned by the model to vectors other then q then L distinguishes views of the \nobject from views of other objects. The vector q then serves as a \"name\" for the \nobject. It can either be the zero vector, in which case L transforms every novel view \nof the object to zero, or it can be a familiar view of the object, in which case L has \nan associative property, namely, it takes a novel view of an object and transforms \nit to a familiar view. A constructive definition of L is given in appendix B. \nThe core of the recognition process we propose includes a neural network that imple(cid:173)\nments the linear operator defined above. The input to this network is a coordinate \nvector created from the image, and the output is an indication whether the image \nis in fact an instance of the modeled object. The operator can be implemented \nby a simple, one layer, neural network with only feedforward connections, the type \npresented by Kohonen, Oja, & Lehtio (1981) . It is interesting to note that this \noperator can be modified to recognize several models in parallel. \n\nTo apply this network to the image the image should first be represented by its \ncoordinate vectors. The construction of the coordinate vectors from the image can \nbe implemented using cells with linear response properties, the type of cells encoding \neye positions found by Zipser & Andersen (1988). The positions obtained should \nbe ordered according to the correspondence of the image points with the model \npoints. Establishing the correspondence is a difficult task and an obstacle to most \nexisting recognition schemes. The phenomenon of apparent motion (Marr & Ullman \n1981) suggests, however, that the human visual system is capable of handling this \nproblem. \n\nIn many cases objects seen in the image are partially occluded. Sometimes also some \nof the points cannot be located reliably. To handle these cases the linear operator \nshould be modified to exclude the missing points. The computation of the updated \noperator from the original one involves computing a pseudo-inverse. A method to \ncompute the pseudo-inverse of a matrix in real time using neural networks has been \nsuggested by Yeates (1991). \n\n\f456 \n\nBasri and Ullman \n\n5 Summary \n\nWe have presented a method for recognizing 3-D objects from 2-D images. In this \nmethod, an object-model is represented by the linear combinations of several 2-D \nviews of the object. It has been shown that for objects undergoing rigid trans(cid:173)\nformations the set of possible images of a given object is embedded in a linear \nspace spanned by a small number of views. Rigid transformations can be distin(cid:173)\nguished from more general linear transformations of the object by testing certain \nconstraints placed upon the coefficients of the linear combinations. The method \napplies to objects with sharp as well as smooth boundaries. \n\nWe have proposed a linear operator to map the different views of the same object \ninto a common representation, and we have presented a simple neural network \nthat implements this operator. In addition, we have suggested a scheme to handle \nocclusions and unreliable measurements. One difficulty in this scheme is that it \nrequires to find the correspondence between the image and the model views. This \nproblem is left for future research. \n\nThe linear combination scheme described above was implemented and applied to a \nnumber of objects. Figures 1 and 2 show the application of the linear combinations \nmethod to artificially created and real life objects. The figures show a number of \nobject models, their linear combinations, and the agreement between these linear \ncombinations and actual images of the objects. Figure 3 shows the results of ap(cid:173)\nplying a linear operator with associative properties to artificial objects. It can be \nseen that whenever the operator is fed with a novel view of the object for which it \nwas designed it returns a familiar view of the object. \n\n1'\\ \nI \\ \n/ ;\n\\ \nI \nI \n\\ \nI \n\\ \n\n< \n\n/ \n\n---~-:----~ \n\n\\ \n\nFigure 1: Top: three model pictures of a pyramid. Bottom: two of their linear combina(cid:173)\ntions. \n\nAppendix A \n\nIn this appendix we prove that the coordinate vectors of images of a rigid object lie \nin a 4-D linear space. We also show that the coefficients of the linear combinations \nthat produce valid images of the object follow in addition two quadratic constraints. \nLet 0 be a set of object points, and let x = (Xl, ... , X n), Y = (Yl, ... , Yn), and \n\n\fLinear Operator for Object Recognition \n\n457 \n\nFigure 2: Top: three model pictures of a VW car. Bottom: a linear combination of the \nthree images (left), an actual edge image (middle), and the two images overlayed (right). \n\n//\\ \n. / \n\nI \nI \nI \n\n\\ \n\\ \n\\ \n\\ \n\n--::--~ \n\nFigure 3: Top: applying an associative pyramidal operator to a pyramid (left) returns a \nmodel view of the pyramid (right, compare with Figure 1 top left). Bottom: applying the \nsame operator to a cube (left) returns an unfamiliar image (right). \n\n\f458 \n\nBasri and Ullman \nz = (Zl, ... , zn) such that (Xi, Yi, Zi) E 0 for every 1 ~ i < n. Let P be a view of the \nobject, and let x = (Xl, ... , xn) and y = (!Ill ... , !In) such that (Xi, !Ii) is the position \nof (Xi, Yi, Zi) in P. We call x, y, and z the coordinate vectors of 0, and x and y the \ncorresponding coordinate vectors in P. Assume P is obtained from 0 by applying \na rotation matrix R, a scale factor s, and a translation vector (t~, ty) followed by \nan orthographic projection. \nClaim: There exist coefficients at, a2, aa, a4 and bl, b2, ba, b4 such that: \n\nx \ny \n\nalx+a2y+aaZ+a41 \nbl x+b2y+baz+b41 \n\nwhere 1 = (1, ... , 1) E 1?,\". \nProof: \n\nSimply by assigning: \n\nal \na2 \naa -\na4 \n\nsrll \nsr12 \nsrla \nt~ \n\nh \nb2 \nba \nb4 \n\nsr21 \nsr22 \nsr2a \nty \n\nTherefore, x, y E span{x, y, z, I} regardless of the viewpoint from which x and \nyare taken. Notice that the set of views of a rigid object does not occupy the \nentire linear 4-D space. Rather, the coefficients follow in addition two quadratic \nconstraints: \n\na~ + a~ + a; = b~ + b~ + b; \n\nalbl + a2b2 + aaba = 0 \n\nAppendix B \n\nA \"recognition matrix\" is defined as follows. Let {PI, ... , Pk} be a set of k linearly \nindependent vectors representing the model pictures. Let {Pk+t, ... , Pn} be a set of \nvectors such that {pt, ... , Pn} are all linearly independent. We define the following \nmatrices: \n\nP \nQ \n\n(Pl, .. \u00b7,Pk,Pk+l, ''',Pn) \n(q, .. \u00b7,q,Pk+t, .. \u00b7,Pn) \n\nWe require that: \n\nTherefore: \n\nLP=Q \n\nL = QP- l \n\nNote that since P is composed of n linearly independent vectors, the inverse matrix \np- l exists, therefore L can always be constructed. \n\nAcknowledgments \n\nWe wish to thank Yael Moses for commenting on the final version of this paper. \nThis report describes research done at the Massachusetts Institute of Technology \nwithin the Artificial Intelligence Laboratory. Support for the laboratory's artificial \n\n\fLinear Operator for Object Recognition \n\n459 \n\nintelligence research is provided in part by the Advanced Research Projects Agency \nof the Department of Defense under Office of Naval Research contract N00014-\n85-K-0124. Ronen Basri is supported by the McDonnell-Pew and the Rothchild \npostdoctoral fellowships. \n\nReferences \nAbu-Mostafa, Y.S. & Pslatis, D. 1987. Optical neural computing. Scientific Amer(cid:173)\n\nican, 256, 66-73. \n\nChien, C.H. & Aggarwal, J.K., 1987. Shape recognition from single silhouette. \n\nProc. of ICCV Conf. (London) 481-490. \n\nFaugeras, O.D. & Hebert, M., 1986. The representation, recognition and location \n\nof 3-D objects. Int. J. Robotics Research, 5(3), 27-52. \n\nFischler, M.A. & Bolles, R.C., 1981. Random sample consensus: a paradigm for \nmodel fitting with application to image analysis and automated cartography. \nCommunications of the ACM, 24(6), 381-395. \n\nHuttenlocher, D.P. & Ullman, S., 1987. Object recognition using alignment. Proc. \n\nof ICCV Conf. (London), 102-111. \n\nKoenderink, J.J. & Van Doorn, A.J., 1979. The internal representation of solid \n\nshape with respect to vision. Bioi. Cybernetics 32, 211-216. \n\nKohonen, T., Oja, E., & Lehtio, P., 1981. Storage and processing of informa(cid:173)\n\ntion in distributed associative memory systems. in Hinton, G.E. (3 Anderson, \nJ.A., Parallel Models of Associative Memory. Hillsdale, NJ: Lawrence Erlbaum \nAssociates, 105-143. \n\nLowe, D.G., 1985. Perceptual Organization and Visual Recognition. Boston: \n\nKluwer Academic Publishing. \n\nMan, D. & Ullman, S., 1981. Directional selectivity and its use in early visual \n\nprocessing. Proc. R. Soc. Lond. B 211, 151-180. \n\nPoggio, T. & Edelman, S., 1990. A network that learns to recognize three dimen(cid:173)\n\nsionalobjects. Nature, Vol. 343, 263-266. \n\nThompson, D.W. & Mundy J.L., 1987. Three dimensional model matching from an \nunconstrained viewpoint. Proc. IEEE Int. Con! on robotics and Automation, \nRaleigh, N.C., 208-220. \n\nS. Ullman and R. Basri, 1991. Recognition by Linear Combinations of Models. \nIEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 13, No. 10, \npp. 992-1006 \n\nUllman, S., 1989. Aligning pictorial descriptions: An approach to object recogni(cid:173)\n\ntion: Cognition, 32(3), 193-254. Also: 1986, A.I. Memo 931, The Artificial \nIntelligence Lab., M.I. T .. \n\nYeates, M.C., 1991. A neural network for computing the pseudo-inverse of a \nmatrix and application to Kalman filtering. Tech. Report, California Institute \nof Technology. \n\nZipser, D. & Andersen, R.A., 1988. A back-propagation programmed network that \nsimulates response properties of a subset of posterior parietal neurons. Nature, \n331, 679-684. \n\n\f", "award": [], "sourceid": 515, "authors": [{"given_name": "Ronen", "family_name": "Basri", "institution": null}, {"given_name": "Shimon", "family_name": "Ullman", "institution": null}]}