{"title": "Feature Correspondence: A Markov Chain Monte Carlo Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 852, "page_last": 858, "abstract": null, "full_text": "Feature Correspondence: \n\nA Markov Chain Monte Carlo Approach \n\nFrank Dellaert, Steven M. Seitz, Sebastian Thrun, and Charles Thorpe \n\nDepartment of Computer Science &Robotics Institute \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\n{dellaert,seitz,thrun,cet }@cs.cmu.edu \n\nAbstract \n\nWhen trying to recover 3D structure from a set of images, the \nmost difficult problem is establishing the correspondence between \nthe measurements. Most existing approaches assume that features \ncan be tracked across frames, whereas methods that exploit rigidity \nconstraints to facilitate matching do so only under restricted cam(cid:173)\nera motion. In this paper we propose a Bayesian approach that \navoids the brittleness associated with singling out one \"best\" cor(cid:173)\nrespondence, and instead consider the distribution over all possible \ncorrespondences. We treat both a fully Bayesian approach that \nyields a posterior distribution, and a MAP approach that makes \nuse of EM to maximize this posterior. We show how Markov chain \nMonte Carlo methods can be used to implement these techniques \nin practice, and present experimental results on real data. \n\n1 \n\nIntroduction \n\nStructure from motion (SFM) addresses the problem of simultaneously recovering \ncamera pose and a three-dimensional model from a collection of images. This prob(cid:173)\nlem has received considerable attention in the computer vision community [1, 2, 3]. \nMethods that can robustly reconstruct the 3D structure of environments have a \npotentially large impact in many areas of societal importance, such as architecture, \nentertainment, space exploration and mobile robotics. \n\nA fundamental problem in SFM is data association, i.e., the question of determin(cid:173)\ning correspondence between features observed in different images. This problem has \nbeen referred to as the most difficult part of structure recovery [4], and is particu(cid:173)\nlarly challenging if the images have been taken from widely separated viewpoints. \nVirtually all existing approaches assume that either the correspondence is known a \npriori, or that features can be tracked from frame to frame [1 , 2]. Methods based \non the robust recovery of epipolar geometry [3 , 4] can cope with larger inter-frame \ndisplacements, but still depend on the ability to identify a set of initial correspon(cid:173)\ndences to seed the robust matching process. In this paper, we are interested in \ncases where individual camera images are recorded from vastly different viewpoints, \nwhich renders existing SFM approaches inapplicable. Traditional approaches for \n\n\festablishing correspondence between sets of 2D features [5 , 6, 7] are of limited use \nin this domain, as the projected 3D structure can look very different in each image. \n\nThis paper proposes a Bayesian approach to data association. Instead of considering \na single correspondence only (which we conjecture to be brittle), our approach \nconsiders whole distributions over correspondences. As a result, our approach is \nmore robust, and from a Bayesian perspective it is also sound. Unfortunately, no \nclosed-form solution exists for calculating these distributions conditioned on the \ncamera images. Therefore, we propose to use the Metropolis-Hastings algorithm, a \npopular Markov chain Monte Carlo (MCMC) method, to sample from the posterior. \n\nIn particular, we propose two different algorithms. The first method, discussed in \nSection 2, is mathematically more powerful but computationally expensive. It uses \nMCMC to sample from the joint distribution over both correspondences and three(cid:173)\ndimensional scene structure. While this approach is mathematically elegant from a \nBayesian point of view, we have so far only been able to obtain results for simple, \nartificial domains. Thus, to cope with large-scale data sets, we propose in Section \n3 a maximum a posteriori (MAP) approach using the Expectation-Maximization \n(EM) algorithm to maximize the posterior. Here we use MCMC sampling only for \nthe data association problem. Simulated annealing is used to reduce the danger of \ngetting stuck in local minima. Experimental results obtained in realistic domains \nand presented in Section 4 suggest that this approach works well in the general \nSFM case, and that it scales favorably to complex computer vision problems. \n\nThe idea of using MCMC for data association has been used before by [8] in the \ncontext of a traffic surveillance application. However, their approach is not directly \napplicable to SFM, as the computer vision domain is characterized by a large number \nof local minima. Our paper goes beyond theirs in two important aspects: First, we \ndevelop a framework for MCMC sampling over both the data association and the \nmodel, and second, we apply annealing to smooth the posterior so as to reduce the \nchance to get stuck in local minima. In a previous paper [9] we have discussed the \nidea of using EM for SFM, but without the unifying framework presented below. \n\n2 A Fully Bayesian Approach using MCMC \n\nBelow we derive the general approach for MCMC sampling from the joint posterior \nover data association and models. We only show results for a simple example from \npose estimation, as this approach is computationally very demanding. An EM \napproach based on the general principles described here, but applicable to larger(cid:173)\nscale problems, will be described in the next section. \n\n2.1 Structure from Motion \n\nThe structure from motion problem is this: given a set of images of a scene, taken \nfrom different viewpoints, recover the 3D structure of the scene along with the cam(cid:173)\nera parameters. In the feature-based approach to SFM, we consider the situation in \nwhich a set of N 3D features Xj is viewed by a set of m cameras with parameters mi. \nAs input data we are given the set of 2D measurements Uik in the images, where \nk E {l..Ki} and Ki is the number of measurements in the i-th image. To model \ncorrespondence information, we introduce for each measurement Uik the indicator \nvariable j ik, indicating that Uik is a measurement of the jik-th feature Xj,k. \n\nThe choice of feature type and camera model determines the measurement function \nh(mi,Xj), predicting the measurement Uik given mi and Xj (with j =jik): \n\nUik = h(mi, Xj) + n \n\n\fwhere n is the measurement noise. Without loss of generality, let us consider the \ncase in which the features Xj are 3D points and the measurements Uik are points in \nthe 2D image. In this case the measurement function can be written as a 3D rigid \ndisplacement followed by a projection: \n\n(1) \nwhere Ri and ti are the rotation matrix and translation of the i-th camera, respec(cid:173)\ntively, and : ~ 3 -+ ~ 2 is the camera projection model. \n\n2.2 Deriving the Posterior \n\nWhereas previous methods single out a single \"best\" correspondence across images, \nin a Bayesian framework we are interested in characterizing our knowledge about the \nunknowns conditioned on the data only, averaging over all possible correspondences. \nThus, we are interested in the posterior distribution P(OIU), where 0 collects the \nunknown model parameters mi and Xj. In the case of unknown correspondence, we \nneed to sum over all possible assignments J = {jik} to obtain \n\nP(O IU) = L:P(J,O IU) ex P(O) L:P(UIJ,O)P(JIO) \n\n(2) \n\nJ \n\nJ \n\nwhere we have applied Bayes law and the chain rule. Let us assume for now that \nthere are no occlusions or spurious measurements, so that Ki = Nand J is a set of \nm permutations J i of the indices l..N. Then, assuming i.i.d. normally distributed \nnoise on the measurements, each term in (2) can be calculated using \n\n(3) \nif each J i is a permutation, and a otherwise. Here N(.; J-L, 0\") denotes the normal \ndistribution with mean J-L and standard deviation 0\". The first identity in (3) holds \nif we assume each of the N! possible permutations to be equally likely a priori. \n\n2.3 Sampling from the Posterior using MCMC \n\nUnfortunately, direct computation of the total posterior distribution P(OIU) in \n(2) is intractable in general, because the number of correspondence assignments \nJ is combinatorial in the number of features and images. As a solution to this \ncomputational challenge we propose to instead sample from P(O IU). Sampling \ndirectly from P(O IU) is equally difficult, but if we can obtain a sample {(o(r) , J(r))} \nfrom the joint distribution P (0, J I U), we can simply discard the correspondence \npart J(r) to obtain a sample {o(r)} from the marginal distribution P(OIU). \n\nTo sample from the joint distribution P(O, JIU) we propose to use MCMC sam(cid:173)\npling, in particular the Metropolis-Hastings algorithm [10]. This method involves \nsimulating a Markov chain whose equilibrium distribution is the desired posterior \ndistribution P(O,JIU). Defining X~ (J,O), the algorithm is: \n\nl. Start with a random initial state X(O). \n2. Propose a new state X' using a chosen proposal density Q(X'; X(r)). \n3. Compute the ratio \n\na -\n\nP(X'IU) Q(X(r); X') \n--'---;-;--'-- ---'----,--,,-'-\n- p(X(r)IU) Q(X';X(r)) \n\n(4) \n\n4. Accept X' as X(r+1) with probability min(a, 1), otherwise X(r+1) = X(r). \n\n\f\u2022 X3 \n\n' X2 \n\n[~~---Ci:\" \n\n! \n\\ \n\n\"5 \n\n! \n\n6 \n\n\u00b7 X \nI \n\nX ' Z \n3 3 \n\n~~~X: \n\n~ \n\nr \n\n, Z \n\nFigure 1: Left: A 2D model shape, defined by the 6 feature points XJ' Right: Transformed \nshape (by a simple rotation) and 6 noisy measurements Uk of the transformed features. \nThe true rotation is 70 degrees, noise is zero-mean Gaussian. \n\nThe sequence of tuples (e(r), J(r)) thus generated will be a sample from p(e, J IV), if \nthe sampler is run sufficiently long. To calculate the acceptance ratio a, we assume \nthat the noise on the feature measurements is normally distributed and isotropic. \nUsing Bayes law and eq. (3), we can then rewrite a from (4) as \n\nn::1 n~~l N(U;k; h(m~, xj,,), 0\") Q(X(r); X') \na = -n-m-n-::CK=-' -N-(--' h-( ----;-C(r )\"\"\"-='(\"\"'r) -) -) Q (X' . X( r) ) \n\nUik, m; , Xj,k \n\n,0\" \n\n;=1 \n\nk=l \n\n, \n\nSimplifying the notation by defining h~~) ~ h(mY), xt~), we obtain \n\n_ Q(X(r); X') \n\na - Q(X/;X(r)) exp \n\nI 2] \n20\"2 f;'(lluik -hik II -lluik -hik)11 ) \n\n(r) 2 \n\n(5) \n\n[1 '\" \n\nThe proposal density Q(.; .) is application dependent, and an example is given below. \n\n2.4 Example: A 2D Pose Estimation Problem \n\nTo illustrate this method , we present a simple example from pose estimation. As(cid:173)\nsume we have a 2D model shape, given in the form of a set of 2D points Xj, as shown \nin Figure l. We observe an image of this shape which has undergone a rotation \ne to be estimated. This rotated shape is shown at right in the figure, along with \n6 noisy measurements Uk on the feature points. In Figure 2 at left we show the \nposterior distribution over the rotation parameter, given the measurements from \nFigure 1 and with known correspondence. In this case, the posterior is unimodal. \nIn the case of unknown correspondence, the posterior conditioned on the data alone \nis shown at right in Figure 2 and is a mixture of 6!=720 functions of the form (3), \nwith 6 equally likely modes induced by the symmetry of the model shape. \n\nIn order to perform MCMC sampling, we implement the proposal step by choosing \nrandomly between two strategies. (a) In a \"small perturbation\" we keep the corre(cid:173)\nspondence assignment J but add a small amount of noise to e. This serves to explore \nthe values of e within a mode of the posterior probability. (b) In a \"long jump\" , we \ncompletely randomize both e and J. This provides a way to jump between proba(cid:173)\nbility modes. Note that Q(X(r); X') /Q(X/; X(r)) = 1 for this proposal density. The \nresult of the sampling procedure is shown as a histogram of the rotation parameter \ne in Figure 3. The histogram is a non-parametric approximation to the analytic \nposterior shown in Figure 2. The figure shows the results of running a sampler \nfor 100,000 steps, the first 1000 of which were discarded as a transient . Note that \neven for this simple example, there is still considerable correlation in the sample \n\n\f'. \n\n\\ \n\n\"' \n'. \n\nJ \\ \n\n) \n\n\\ \n\nFigure 2: (Left) The posterior distribution over rotation B with known correspondence, \nand (Right) with unknown correspondence, a mixture with 720 components. \n\nFigure 3: Histogram for the values of B obtained in one MeMe run, for the situation in \nFigure 1. The MeMe sampler was run for 100,000 steps. \n\nof 100,000 states as evidenced by the uneven mass in each of the 6 analytically \npredicted modes. \n\n3 Maximum a Posteriori Estimation using MCEM \n\nAs illustrated above, sampling from the joint probability over assignments J and \nparameters 0 using MCMC can be very expensive. However, if only a maximum a \nposteriori (MAP) estimate is needed, sampling over the joint space can be avoided \nby means of the EM algorithm. To obtain the MAP estimate, we need to maximize \nP(OIU) as given by (2). This is intractable in general because of the combinatorial \nnumber of terms. The EM algorithm provides a tractable alternative to maximizing \nP (0 I U), using the correspondence J as a hidden variable [ll). It iterates over: \nE-step: Calculate the expected log-posterior Qt(0): \n\nQt(0) ~ Eo,{logP(OIU, J)IU} = L P(J IU, ot)logP(O IU , J) \n\n(6) \n\nJ \n\nwhere the expectation is taken with respect to the posterior distribution P(JIU, ot) \nover all possible correspondence assignments J given the measurement data U and \na current gu ess ot for the parameters. \nM-step: Re-estimate OtH by maximizing Qt (0), i.e., OtH = argmax 0 Qt(0) \nInstead of calculating Qt (0) exactly using (6) , which again involves summing over a \ncombinatorial number of terms, we can replace it by a Monte Carlo approximation: \n\nR \n\nQt (0) i=::j ~ LlogP(O IU,J(r)) \n\nr=l \n\n(7) \n\nwhere {J(r)} is a sample from P(J IU , ot) obtained by MCMC sampling. Formally \nthis can be justified in the context of a Monte Carlo EM or MCEM, a version \n\n\fFigure 4: Three out of 11 cube images. Although the images were originally taken as a \nsequence in time, the ordering of the images is irrelevant to our method. \n\n1=0 0\"=0.0 \n\n1=1 u=25.1 \n\n1=10 u=18.7 \n\n1=20 u=13.5 \n\n1=100 u=1 .0 \n\nFigure 5: Starting from random structure (t=O) we recover gross 3D structure in the very \nfirst iteration (t=l). As the annealing parameter (1' is gradually decreased, successively \nfiner details are resolved (iterations 1,10,20, and 100 are shown). \n\nof the EM algorithm where the E-step is executed by a Monte-Carlo process [11]. \nThe sampling proceeds as in the previous section, using the Metropolis-Hastings \nalgorithm, but now with a fixed parameter f) = f)t. Note that at each iteration the \nestimate f)t changes and we sample from a different posterior distribution P(JIU, f)t). \n\nIn practice it is important to add annealing to this basic EM scheme, to avoid \ngetting stuck in local minima. In simulated annealing we artificially increase the \nnoise parameter (T for the early iterations, gradually decreasing it to its correct value. \nThis has two beneficial consequences. First, the posterior distribution P(JIU, f)t) \nis less peaked when (T is high, allowing the MCMC sampler to explore the space of \nassignments J more easily. Second, the expected log-posterior Qt (e) is smoother \nand has fewer local maxima for higher values of (T. \n\n4 Results \n\nTo validate our approach we have conducted a number of experiments, one of which \nis presented here. The input data in this experiment consisted of 55 manually \nselected measurements in each of 11 input images, three of which are shown in \nFigure 4. Note that features are not tracked from frame to frame and the images \ncan be presented in arbitrary order. To initialize the 11 cameras mi are all placed \nat the origin, looking towards the 55 model points Xj, who themselves are normally \ndistributed at unit distance from the cameras. We used an orthographic projection \nmodel. The EM algorithm was run for 100 iterations, and the sampler for 10000 \nsteps per image. For this data set the algorithm took about a minute to complete \non a standard PC. \n\nThe algorithm converges consistently and fast to an estimate for the structure and \nmotion where the correct correspondence is the most probable one, and where all \n\n\fassignments in the different images agree with each other. A typical run of the \nalgorithm is shown in Figure 5, where we have shown a wireframe model of the \nrecovered structure at several points during the run. There are two important \npoints to note: (a) the gross structure is recovered in the very first iteration, starting \nfrom random initial structure, and (b) finer details of the structure are gradually \nresolved as the annealing parameter (T is decreased. The estimate for the structure \nafter convergence is almost identical to the one found by the factorization method \n[1] when this is provided with the correct correspondence. \n\n5 Conclusions and Future Directions \n\nIn this paper we presented a theoretically sound method to deal with ambiguous \nfeature correspondence, and have shown how Markov chain Monte Carlo sampling \ncan be used to obtain practical algorithms. We have detailed this for two cases: (1) \nobtaining a posterior distribution over the parameters fJ, and (2) obtaining a MAP \nestimate by means of EM. In future work, we would like to apply these methods \nin other domains where data association plays a central role. \nIn particular, in \nthe highly active area of mobile robot mapping, the data association problem is \ncurrently a major obstacle to building large-scale maps [12, 13]. We conjecture that \nour approach is equally applicable to the robotic mapping problem, and can lead \nto qualitatively new solutions in that domain. \n\nReferences \n[1] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: \n\na factorization method. Int. J. of Computer Vision, 9(2):137- 154, Nov. 1992. \n\n[2] R.1. Hartley. Euclidean reconstruction from uncalibrated views. In Application of \n\nInvariance in Computer Vision, pages 237-256, 1994. \n\n[3] P.A. Beardsley, P.H.S. Torr, and A. Zisserman. 3D model acquisition from extended \nimage sequences. In E'l.w. Conf. on Computer Vision (ECCV), pages 11:683-695, 1996. \n[4] P. Torr, A. Fitzgibbon, and A. Zisserman. Maintaining multiple motion model hy(cid:173)\npotheses over many views to recover matching and structure. In Int. Con/. on Com(cid:173)\nputer Vision (ICC V), pages 485-491, 1998. \n\n[5] G.L. Scott and H.C. Longuet-Higgins. An algorithm for associating the features of \n\ntwo images. Proceedings of Royal Society of London, B-244:21-26, 1991. \n\n[6] L.S. Shapiro and J.M. Brady. Feature-based correspondence: An eigenvector ap(cid:173)\n\nproach. Image and Vision Computing, 10(5):283-288, June 1992. \n\n[7] S. Gold, A. Rangaraj an , C. Lu, S. Pappu, and E. Mjolsness. New algorithms for 2D \n\nand 3D point matching. Pattern Recognition, 31(8):1019-1031, 1998. \n\n[8] H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many \n\nsensors. In Int. Joint Con/. on Artificial Intelligence (IlCAI), Stockholm, 1999. \n\n[9] F. Dellaert, S.M. Seitz, C.E. Thorpe, and S. Thrun. Structure from motion with(cid:173)\n\nIn IEEE Conf. on Computer Vision and Pattern Recognition \n\nout correspondence. \n(CVPR), June 2000. \n\n[10] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, editors. Markov chain Monte \n\nCarlo in practice. Chapman and Hall, 1996. \n\n[11] M.A. Tanner. Tools for Statistical Inference. Springer, 1996. \n[12] J.J. Leonard and H.J.S. Feder. A computationally efficient method for large-scale \nconcurrent mapping and localization. In Proceedings of the Ninth International Sym(cid:173)\nposium on Robotics Research, Salt Lake City, Utah, 1999. \n\n[13] J.A. Castellanos and J.D. Tard6s. Mobile Robot Localization and Map Building; A \n\nMultisensor Fusion Approach. Kluwer Academic Publishers, Boston, MA, 2000. \n\n\f", "award": [], "sourceid": 1838, "authors": [{"given_name": "Frank", "family_name": "Dellaert", "institution": null}, {"given_name": "Steven", "family_name": "Seitz", "institution": null}, {"given_name": "Sebastian", "family_name": "Thrun", "institution": null}, {"given_name": "Charles", "family_name": "Thorpe", "institution": null}]}