{"title": "A Neural Network Approach for Three-Dimensional Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 306, "page_last": 312, "abstract": null, "full_text": "A Neural Network Approach for \n\nThree-Dimensional Object Recognition \n\nSiemens AG, Central Reeearch and Development \n\nOtto-HaIm-Ring 6, 0.8000 Munchen 83 \n\nVolker 'bap \n\nGermaD)' \nAb.tract \n\nThe model-bued neural vision Iystem presented here determines the p~ \naition and identity of three-dimensional objects. Two ltereo imagee of \na IC8ne are described in terms of Ihape primitives (line segments derived \nfrom edges in the lcenel) and their relational structure. A recurrent neural \nmatching network solves the correlpondence problem by 888igning corre(cid:173)\nIponding line segments in right and left ltereo images. A 3-D relational \nIC8ne description it then generated and matched by a second neural net(cid:173)\nwork against models in a model bue. The quality of the solutions and \nthe convergence Ipeed were both improved by using mean field approxi(cid:173)\nmations. \n\n1 \n\nINTRODUCTION \n\nMany machine vision IYlteDll and, to a large extent, &lao the human visual Iya(cid:173)\ntem, are model bued. The &Cenes are described in terDll of Ihape primitives and \ntheir relational Itructure, and the vision IYltem triel to find a match between the \n&cene delcriptions and 'familiar' objects in a model bue. In many lituations, IUch \nu robotia applicatioDl, the problem is intrinsically 3-D. Different approaches are \npOl8ible. Poggio and Edelman (1990) describe a neural network that treat. the 3-D \nobject recognition problem u a multivariate approximation problem. A certain \nnumber of 2-D viewl of the object are used to train a neural network to produce \nthe Itandard view of that object. After training, new penpective viewl can be \nrecogniled. \nIn the approach presented here, the vision IYltem tries to capture the true 3-D \nItructure of the 8cene. Two Itereo viewl of a lcene are used to generate a 3-D \n\n306 \n\n\fA Neural Network Approach for Three-Dimensional Object Recognition \n\n307 \n\nmode 1 \nP \u2022 ~ \n\ntype line segr:ent \\ \nP< I) (length) \n3cm \n\nangle. \n\nPIl. ~ \n\ntype line segment \np( I ) (I ength) \nScm \n\nscene \n\nO~---I~~.PJ type : lIne segment \nm ~ J \nP (1) (length): Jcm \n\n\\ \n\nangle . \n\n- 30 degrees \n\nP- JO degrees \n\nO~---t ... ~ .PI \n\nmU! \n\ntype : line segment \nP ( I ) (length): Scm \n\nFigure 1: Match of primitive Pato Pi. \n\nlog(lj/lj) \n\nri{ \n\nQir n/lj \n\nFigure 2: Definitions of r, q, and 6 (left). The function ,..0 (right). \n\ndescription of the scene which is then matched against models in a model base. The \nstereo correspondence problem and the model matching problem are solved by two \nrecurrent neural networks with very similar architectures. A neuron is assigned to \nevery pOl8ible match between primitives in the left and right images or, respectively, \nthe scene and the model base. The networks are designed to find the best matches \nby obeying certain uniqueness constraints. \n\nThe networks are robust against the uncertainties in the descriptions of both the \nstereo images and the 3-D scene (Ihadow lines, missing lines). Since a partial match \nis sufficient for a successful model identification, opaque and partially occluded \nobjects can be recognized. \n\n2 THE NETWORK ARCHITECTURE \n\nHere, a general model matching tuk is considered. The activity of a match neuron \nmai (Figure 1) representl the certainty of a match between a primitive Pa in the \nmodel base and Pi in the lcene description. The interactions between neurons can be \nderived from the network'i energy function where the fixed points of the network \ncorrespond to the minima of the energy function. The first term in the energy \n\n\f308 \n\n'Jresp \n\nfunction evaluates the match between the primitives \n\nEp = -1/2 E leoimed\u00b7 \n\nat \n\n(1) \n\n(3) \n\nThe function leo. is zero if the type of primitive Po is not equal to the type of \nprimitive Pi. \nIf both types are identical. leo. evaluates the agreement between \nparameters pf.(k) and pf(k) which describe properties of the primitives. Here. \nleo. = Io'(EI: /pf.(k) - pf(k)IIc1) is maximum if the parameters of Po and Pi match \n(Figures 1 and 2). \nThe evaluation of the match between the relations of primitives in the scene and \ndata base is performed by the energy term (Mjoisness, Gindi and Anadan, 1989) \n\nEs = -1/2 L Xo'~\"J moim~i' \n\n(2) \nThe function Xoi = Io'(E.lp~,~(k)-~J(k)I/C7't) is maximum if the relation between \nPo and P~ matches the relatIon between Pi and Pi' \nThe constraint that a primitive in the Kene Ihould only match to one or no primitive \nin the model base (column cODltraint) is implemented by the additional (penalty-) \nenergy term (Utans et al .\u2022 1989. Tresp and Gindi, 1990) \n\no,~,iJ \n\nEc = E[((EmOi)-1)2Emoi). \n\ni \n\na \n\na \n\nEc is equal to zero only if in all columns, the sum over the activations of all neurons \nis equal to one or zero and positive otherwise. \n\n2.1 DYNAMIC EQUATIONS AND MEAN FIELD THEORY \n\n2.1.1 MFAI \n\nThe neural network Ihould make binary decisionl, match or no match. but bi(cid:173)\nnary recurrent networks get easily Ituck in local minima. Bad local minima can \nbe avoided by using an annealing strategy but annealing is time-conluming when \nsimulated on a digital computer. Using a mean field approximation. one can ob(cid:173)\ntain deterministic equations by retaining some of the advantages of the annealing \nprocess (Peterson and SOderberg, 1989). The network is interpreted as a IYltem of \ninteracting units in thermal contact with a heat reservoir of temperature T. Such \na system minimizes the free energy F = E - TS where 5 is the entropy of the \nsystem. At T = 0 the energy E is minimized. The mean value va. =< moi > of \na neuron becomes lIai = 1/(1 + e-u\u2022 i/T ) with \"oi = -IJE/lJllo\" These equations \ncan be updated synchronously, asynchronously or solved iteratively by moving only \na Imall distance from the old value of \"a' in the direction of the new mean field. \nAt high temperatures T. the IYltem is in the trivial solution va. :: 1/2 VQ, i and \nthe activations of all neuronl are in the linear region of the ligmoid function. The \nsystem can be described by linearized equatioDi. The magnitudes of all eigenValues \nof the corresponding tranlfer matrix are less than 1. At a critical temperature Tc, \nthe magnitude of at least one of the eigenvalues becomes greater than one and the \ntrivial solution becomes unstable. Tc and favorable weights for the different terms \nin the energy function can be found by an eigenvalue analYlis of the linearized \nequatioDl (Peterson and Soderberg, 1989). \n\n\fA Neural Network Approach for Three-Dimensional Object Recognition \n\n309 \n\n2.1.2 MFA, \n\nThe column constraint is satisfied by states with exactly one neuron or DO neuron \n'on' in every column. If only these states are considered in the derivation of the \nmean field equations, one can obtain another set of mean field equations, \"ai = \n1 x eUoi/T /(1 + E\" eU_i/T ) with \"ai = -8E/8\"ai. \nThe column constraint term (Equation 3) drops out of the energy function and \nthe energy surface in simplified. The high temperature fixed point corresponds to \n\"ai = 1/(N + 1) 'VOl, i where N is the number of rows. \n\n3 THE CORRESPONDENCE PROBLEM \n\nTo solve the correspondence problem, corresponding lines in left and right images \nhave to be identified. A good 888umption is that the appearance of an object in \nthe left image is a distortion and shifted venion of the appearance of the object in \nthe right image with approximately the same scale and orientation. The machinery \njust developed can be applied if the left image is interpreted as the scene and the \nright image as the model. \n\nFigure 3 shows two stereo images of a simple scene and the segmentation of left and \nright images into line segments which are the only primitives in this application. \nLines correspond to the edges, structure and contoun of the objects and shadow \nlines. The length of a line segment pf(l) = Ii is the descriptive parameter attached \nto each line segment Pi. Relations between line segments are only considered if they \nare in a local neighborhood: Xa,,,.ij is equal to zero if not both a) Po is attached to \nline segment p\" and b) line segment Pi is attached to line segment Pi' Otherwise, \nXa,,,.ij = #-,(14)0'' - 4>iil/CT~ + Ira\" - riil/CT~ + 19a\" - 9iil/CT;) where prj(l) = 4>ij is \nthe angle between line segments, prj (2) = riJ the logarithm of the ratio of their \nlengths and pr,/3) = 9ij the attachment point (Shumaker et aI., 1989) (Figure 2). \nHere, we have two uniqueness constraints: only at most one neuron should be \nactive in each column or each row. The row constraint is enforced by an energy \n\nterm equiValent to Eo: ER = Ea[\u00abEi mai) - 1)2 E. rna']' \n\n4 DESCRIPTION OF THE 3-D OBJECT STRUCTURE \n\nFrom the last section, we know which endpoints in the left image correspond to \nendpoints in the right image. If D is the separation of both (in parallel mounted) \ncameras, I the focal lengths of the cameras, Z\" II', Z,., II,. the coordinates of a particu(cid:173)\nlar point in left and right images, the 3-D position of the point in camera coordinates \nz, II, z becomes z = DI/(z,. - z,), II = ZII,./ I, Z = ZZ,./ 1+ D/2. This information \nis used to generate the 3-D description of the visible portion of the objects in the \nscene. \nKnowing the true 3-D position of the endpoints of the line segments, the system \nconcludes that the chair and the wardrobe are two distinct and spatially separated \nobjects and that line segments 12 and 13 in the right image and 12 in the left image \nare not connected to either the chair or the wardrobe. On the other hand, it is not \n\n\f310 \n\nTresp \n\n\"'\" \n\nrT \n-M:1 \nI\u00b7 .. \u00b7 e u \n~JJ 17 \n\nII,,~:~ \nI \n\nI \n\n\u2022 (\";I2 \nI \nI~ \nI \n\nII \n1 \n\nII \n\nI \n\nleft \n\nr1ght \n\nFigure 3: Stereo images of a scene and segmented images. The stereo matching \nnetwork matched all line segments that are present in both images correctly. \n\nobvious that the shadow lines under the wardrobe are not part of the wardrobe. \n\n5 MATCHING OBJECTS AND MODELS \n\nThe scene description now must be matched with stored models describing the \ncomplete 3-D structures of the models in the data base. The model description \nmight be constructed by either explicitly measuring the dimensions of the models \nor by incrementally assembling the 3-D structure from several stereo views of the \nmodels. Descriptive parameters are the (true 3-D) length of line segments I, the \n(true 3-D) angles ~ between line segments and the (true 3-D) attachment points \nq. The knowledge about the 3-D structure allows a segmentation of the scene into \ndifferent objects and the row constraint is only applied to neurons relating to the \nsame object 0 in the scene ER' = Eo Ea[\u00abEiEO mal) - 1)2 EiEO vail\u00b7 \nFigure 4 shows the network after convergence. Except for the occluded leg, all line \nsegments belonging to the chair could be matched correctly. All not occluded line \nsegments of the wardrobe could be matched correctly except for its left front leg. \nThe shadow lines in the image did not find a match. \n\n6 3-D POSITION \n\nIn many applications, one is also interested in determining the positions of the \nrecognized objects in camera coordinates. In general, the transformation between \n\n\fA Neural Network Approach for Three-Dimensional Object Recognition \n\n311 \n\n. __ --\",4----~ \n\n13 \n\nI \n11 \n\\ \n\n\u2022 , \u2022\u2022\u2022\u2022\u2022 , \u2022\u2022 at.Il\" .... \" .. \" ...... \" ........ \" .... .. \n\n__ 1 \n\n_--11----,.....18 \n\nICWle \n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 , . . . . . \" \" ........ \" ..... 11 ......... \" ... .. \n\n\u2022 \n\n2 \n\n~ \n~ \nI \n\nI \" \n\nIIOde1 \n\n~ \n, \nI \"-k \n\nIe . . \n\u2022 \n\n, \u2022 \u2022 \u2022 \u2022 \u2022 , \u2022\u2022 ,,,,,.,,, .. ,,.\"\"' ... 11 ........... ,, ... .. \n\n\u2022 , \n\n\u2022 \u2022 \u2022 \u2022 \u2022 , \n\nFigure 4: 3-D matching network. \n\n\f312 \n\nTresp \n\nan object in a standard frame of reference Xo = (zo, 110, %0) and the transformed \nframe of reference Xs = (z\" 11\" z,) can be described by Xs = RXo, where R is a \n4 x 4 matrix describing a rotation followed by a translation. R can be calculated if \nXo and Xs are known for at leut 4 points using, for example, the pseudo inverse or \nan ADALINE. Knowing the coefficients of R, the object position can be calculated. \nIf an ADALINE is used, the error after convergence is a meuure of the consistency \nof the transformation. A large error can be used u an indication that either a \nwrong model W&8 matched, or certain primitives were miscl&88ified. \n\n7 DISCUSSION \n\nBoth M F Al and M F A2 were used in the experiments. The same solutions were \nfound in general, but due to the simpler energy 8urface, M F A2 allowed greater time \nsteps and therefore converged 5 to 10 times futer. \nFor more complex scenes, a hierarchical system could be considered. In the first \nstep, simple objects such as 8quares, rectangles, and circles would be identified. \nThese would then form the primitives in a second stage which would then recognize \ncomplete objects. It might also be pOllible to combine these two matching nets \ninto one hierarchical net similar to the networks described by Mjolsne&8, Gindi and \nAnadan (1989). \n\nAcknowledgements \n\nI would like to acknowledge the contributions of Gene Gindi, Eric Mjolsnes& and \nJoachim Utans of Yale University to the design of the matching network. I thank \nChristian Evers for helping me to acquire the images. \n\nReferences \n\nEric Mjolsness, Gene Gindi, P. Anadan. Neural Optimization in Model Matching \nand Perceptual Organization. Neural Computation 1, pp. 218-209, 1989. \nCarsten Peterson, Bo Soderberg. A new method for mapping optimization problems \nonto neural networks. International Journal 0/ Neural SJI.tem., Vol. I, No. I, pp. \n3-22, 1989. \nT. Poggio, S. Edelman. A Network That Learns to Recognize Three-Dimensional \nObjects. Nature, No. 6255, pp. 263-266, January 1990. \nGrant Shumaker, Gene Gindi, Eric Mjolsnese, P. An&dan. Stickville: A Neural \nNet for Object Recognition via Graph Matching. Tech. Report No. 8908, Yale \nUniversity, 1989. \nVolker Tresp, Gene Gindi. \nInvariant Object Recognition by Inexact Subgraph \nMatching with Applications in Industrial Part Recognition. International Neural \nNdwork Conference, Paris, pp. 95-98, 1990. \nJoachim Utans, Gene Gindi, Eric Mjolsness, P. An&dan. Neural Networks for Object \nRecognition within Compositional Hierarchies, Initial Experiments. Tech. Report \nNo. 8903, Yale University, 1989. \n\n\f", "award": [], "sourceid": 378, "authors": [{"given_name": "Volker", "family_name": "Tresp", "institution": null}]}