{"title": "Recovering Perspective Pose with a Dual Step EM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 780, "page_last": 786, "abstract": null, "full_text": "Recovering Perspective Pose with a Dual \n\nStep EM Algorithm \n\nAndrew D.J. Cross and Edwin R. Hancock, \n\nDepartment of Computer Science, \n\nUniversity of York, \nYork, YOl 5DD, UK. \n\nAbstract \n\nThis paper describes a new approach to extracting 3D perspective \nstructure from 2D point-sets. The novel feature is to unify the \ntasks of estimating transformation geometry and identifying point(cid:173)\ncorrespondence matches. Unification is realised by constructing a \nmixture model over the bi-partite graph representing the correspon(cid:173)\ndence match and by effecting optimisation using the EM algorithm. \nAccording to our EM framework the probabilities of structural cor(cid:173)\nrespondence gate contributions to the expected likelihood function \nused to estimate maximum likelihood perspective pose parameters. \nThis provides a means of rejecting structural outliers. \n\n1 \n\nIntroduction \n\nThe estimation of transformational geometry is key to many problems of computer \nvision and robotics [10] . Broadly speaking the aim is to recover a matrix representa(cid:173)\ntion of the transformation between image and world co-ordinate systems. In order \nto estimate the matrix requires a set of correspondence matches between features \nin the two co-ordinate systems [11] . Posed in this way there is a basic chicken(cid:173)\nand-egg problem. Before good correspondences can be estimated, there need to be \nreasonable bounds on the transformational geometry. Yet this geometry is, after \nall, the ultimate goal of computation. This problem is usually overcome by invoking \nconstraints to bootstrap the estimation of feasible correspondence matches [5, 8]. \nOne of the most popular ideas is to use the epipolar constraint to prune the space of \npotential correspondences [5]. One of the drawbacks of this pruning strategy is that \nresidual outliers may lead to ill-conditioned or singular parameter matrices [11]. \n\n\fRecovering Perspective Pose with a Dual Step EM Algorithm \n\n781 \n\nThe aim in this paper is to pose the two problems of estimating transformation \ngeometry and locating correspondence matches using an architecture that is rem(cid:173)\niniscent of the hierarchical mixture of experts algorithm [6]. Specifically, we use \na bi-partite graph to represent the current configuration of correspondence match. \nThis graphical structure provides an architecture that can be used to gate con(cid:173)\ntributions to the likelihood function for the geometric parameters using structural \nconstraints. Correspondence matches and transformation parameters are estimated \nby applying the EM algorithm to the gated likelihood function. In this way we \narrive at dual maximisation steps. Maximum likelihood parameters are found by \nminimising the structurally gated squared residuals between features in the two \nimages being matched. Correspondence matches are updated so as to maximise the \na posteriori probability of the observed structural configuration on the bi-partite \nassociation graph. \n\nWe provide a practical illustration in the domain of computer vision which is aimed \nat matching images of floppy discs under severe perspective foreshortening. How(cid:173)\never, it is important to stress that the idea of using a graphical model to provide \nstructural constraints on parameter estimation is a task of generic importance. Al(cid:173)\nthough the EM algorithm has been used to extract affine and Euclidean parameters \nfrom point-sets [4] or line-sets [9], there has been no attempt to impose structural \nconstraints of the correspondence matches. Viewed from the perspective of graph(cid:173)\nical template matching [1, 7] our EM algorithm allows an explicit deformational \nmodel to be imposed on a set of feature points. Since the method delivers statisti(cid:173)\ncal estimates for both the transformation parameters and their associated covariance \nmatrix it offers significant advantages in terms of its adaptive capabilities. \n\n2 Perspective Geometry \n\nOur basic aim is to recover the perspective transformation parameters which bring \na set of model or fiducial points into correspondence with their counterparts in a \nset of image data. Each point in the image data is represented by an augmented \nvector of co-ordinates Wi = (Xi, Yi, l)T where i is the point index. The available set \nof image points is denoted by w = {Wi' Vi E 'D} where'D is the point index-set. \nThe fiducial points constituting the model are similarly represented by the set of \naugmented co-ordinate vectors z = b j , Vj EM}. Here M is the index-set for the \nmodel feature-points and the 'l!j represent the corresponding image co-ordinates. \n\nPerspective geometry is distinguished from the simpler Euclidean (translation, ro(cid:173)\ntation and scaling) and affine (the addition of shear) cases by the presence of signifi(cid:173)\ncant foreshortening. We represent the perspective transformation by the parameter \nmatrix \n\n~(n) = \n\n(\n\n\u00a2>(n) \n1,1 \n,hen) \n'f'2,1 \n\u00a2>(n) \n3,1 \n\n(1) \n\nUsing homogeneous co-ordinates, the transformation between model and data is \nzen) = ( \n,hen) 1)T is a column-vector formed \n-J \nfrom the elements in bottom row of the transformation matrix. \n\n)-l~(n)z. where \\lI(n) = (,h(n) \n\n'f'3,1 ''f'3,2 , \n\n1 \n\nZT .'lI(n) \n-J \n\n-J' \n\n\f782 \n\nA. D. J. Cross and E. R. Hancock \n\n3 Relational Constraints \n\nOne of our goals in this paper is to exploit structural constraints to improve the \nrecovery of perspective parameters from sets of feature points. We abstract the \nprocess as bi-partite graph matching. Because of its well documented robustness to \nnoise and change of viewpoint, we adopt the Delaunay triangulation as our basic \nrepresentation of image structure [3]. We establish Delaunay triangulations on the \ndata and the model, by seeding Voronoi tessellations from the feature-points. \n\nTlie process of Delaunay triangulation generates relational graphs from the two \nsets of point-features. More formally, the point-sets are the nodes of a data graph \nGD = {V,ED} and a model graph GM = {M,EM}. Here ED ~ V X V and \nEM ~ M x M are the edge-sets of the data and model graphs. Key to our matching \nprocess is the idea of using the edge-structure of Delaunay graphs to constrain the \ncorrespondence matches between the two point-sets. This correspondence matching \nis denoted by the function j : M -+ V from the nodes of the data-graph to those \nof the model graph. According to this notation the statement j(n)(i) = j indicates \nthat there is a match between the node i E V of the model-graph to the node j E M \nof the model graph at iteration n of the algorithm. We use the binary indicator \n\ns~n) = {I if j(n)(i) = j \n\n0 otherwise \n\nt,) \n\n(2) \n\nto represent the configuration of correspondence matches. \n\nWe exploit the structure of the Delaunay graphs to compute the consistency of \nmatch using the Bayesian framework for relational graph-matching recently reported \nby Wilson and Hancock [12]. Suffice to say that consistency of a configuration of \nmatches residing on the neighbourhood Ri = i U {k ; (i, k) E ED} of the node \ni in the data-graph and its counterpart Sj = j U {I ; (j,l) E Em} for the node \nj in the model-graph is gauged by Hamming distance. The Hamming distance \nH(i,j) counts the number of matches on the data-graph neighbourhood Ri that \nare inconsistently matched onto the model-graph neighbourhood Sj. According to \nWilson and Hancock [12] the structural probability for the correspondence match \nj(i) = j at iteration n of the algorithm is given by \n\nexp [-/3H(i,j)] \n\n( ~) = ----=----;:-----\"-----:;-\nLjEM exp [-/3H(i,j)] \n\nt,) \n\n(3) \n\nthe above expression, \n\nthe Hamming distance \n\nIn \nL(k,I)ER;eSj (l-si~h where the symbol- denotes the composition of the data-graph \nrelation Ri and the model-graph relation Sj. The exponential constant /3 = In 1 Ft, \nis related to the uniform probability of structural matching errors Pe . This proba(cid:173)\nbility is set to reflect the overlap of the two point-sets. In the work reported here \nwe set \n\nis given by H(i,j) \n\n2I1MI-ID\\1 \nI1MI+IDI . \n\np. -\ne -\n\n4 The EM Algorithm \n\nOur aim is to extract perspective pose parameters and correspondences matches \nfrom the two point-sets using the EM algorithm. According to the original work \n\n\fRecovering Perspective Pose with a Dual Step EM Algorithm \n\n783 \n\nof Dempster, Laird and Rubin [2] the expected likelihood function is computed \nby weighting the current log-probability density by the a posteriori measurement \nprobabilities computed from the preceding maximum likelihood parameters. Jordan \nand Jacobs [6] augment the process with a graphical model which effectively gates \ncontributions to the expected log-likelihood function. Here we provide a variant of \nthis idea in which the bi-partite graph representing the correspondences matches \ngate the log-likelihood function for the perspective pose parameters. \n\n4.1 Mixture Model \n\nOur basic aim is to jointly maximize the data-likelihood p(wlz, f,~) over the space \nof correspondence matches f and the matrix of perspective parameters ~. To \ncommence our development, we assume observational independence and factorise \nthe conditional measurement density over the set of data-items \n\np(wlz, f,~) = II p(wil z , f,~) \n\niE'D \n\n(4) \n\nIn order to apply the apparatus of the EM algorithm to maximising p(wlz,f,~) \nwith respect to f and ~, we must establish a mixture model over the space of \ncorrespondence matches. Accordingly, we apply Bayes theorem to expand over the \nspace of match indicator variables. In other words, \n\np(wilz,j,~) = L P(Wi,Si,jIZ,f,~) \n\nSi , jE! \n\n(5) \n\nIn order to develop a tractable likelihood function, we apply the chain rule of condi(cid:173)\ntional probability. In addition, we use the indicator variables to control the switch(cid:173)\ning of the conditional measurement densities via exponentiation. In other words we \nassume p(wilsi,j'~j,~) = p(wil~j, ~)Si.j. \n\nWith this simplification, the mixture model for the correspondence matching process \nleads to the following expression for the expected likelihood function \n\nQ(f(n+l), ~(n+l)lf(n), ~(n\u00bb) = L L P(si,jIW, z, f(n), ~(n\u00bb)s~~) Inp(WiIZj' ~(n+1\u00bb) \n\niE'D iEM \n\n(6) \nTo further simplify matters we make a mean-field approximation and replace s~~) \nby its average value, i.e. we make use of the fact that E(s~~\u00bb) = (i,'j). In this way \nthe structural matching probabilities gate contributions to the expected likelihood \nfunction . This mean-field approximation alleviates problems associated with local \noptima which are likely to occur if the likelihood function is discretised by gating \nwith Si,j' \n\n4.2 Expectation \n\nUsing the Bayes rule, we can re-write the a posteriori measurement probabilities in \nterms of the components of the conditional measurement densities appearing in the \nmixture model in equation (5) \n\nP( .. 1 \n\nSt,} W, z, \n\nfen) ~(n+1\u00bb) = \n\n, ~ \n\nr(n)p(w Iz ~(n\u00bb) \n':.i,j \n\n-i -j, \n\n(n) \n\nLj/EM (i,j' p(wil?;jl, ~(n\u00bb) \n\n(7) \n\n\f784 \n\nA. D. J. Cross and E. R. Hancock \n\nIn order to proceed with the development of a point registration process we require \na model for the conditional measurement densities, i.e. p(wil?;j, cf>(n\u00bb) . Here we \nassume that the required model can be specified in terms of a multivariate Gaussian \ndistribution. The random variables appearing in these distributions are the error \nresiduals for the position predictions of the jth model line delivered by the current \nestimated transformation parameters. Accordingly we write \n\np(wil?;j,cf> \n\n(n\u00bb) _ \n\n- (27l\")~Mexp -2 Wi -?;j \n\n[ 1 ( \n\n1 \n\n(n\u00bb)T~-l ( _ (n\u00bb)] \n\nWi \n\n'l.j \n\nL..J \n\n) \n(8 \n\nIn the above expression ~ is the variance-covariance matrix for the vector of error(cid:173)\n'l.;n) between the components of the predicted mea(cid:173)\nresiduals fi,j(cf>(n\u00bb) = Wi -\nsurement vectors 'l.j and their counterparts in the data, i.e. Wi . Formally, the \nmatrix is related to the expectation of the outer-product of the error-residuals i.e. \n~ = E[fi,j(cf>(n\u00bb)fi,j(cf>(n\u00bb)T]. \n\n4.3 Maximisation \n\nThe maximisation step of our matching algorithm is based on two coupled update \nprocesses. The first of these aims to locate maximum a posteriori probability corre(cid:173)\nspondence matches. The second class of update operation is concerned with locating \nmaximum likelihood transformation parameters. We effect the coupling by allowing \ninformation flow between the two processes. Correspondences located by maximum \na posteriori graph-matching are used to constrain the recovery of maximum likeli(cid:173)\nhood transformation parameters. A posteriori measurement probabilities computed \nfrom the updated transformation parameters are used to refine the correspondence \nmatches. \n\nIn terms of the indicator variables matches the configuration of maximum a poste(cid:173)\nriori probability correspondence matches is updated as follows \n\nj(n+1)(i) = argmaxP(?; .IWi,cf>(n\u00bb) \n\nJEM \n\nJ \n\nexp [-f3 LCk,I)ER,.Sj (1 - s~~l)] \nLjEM exp -f3 L(k ,I)ER,.Sj (1 - sk ,l ) \n\n[ \n\n(n) ] \n\nThe maximum likelihood transformation parameters satisfy the condition \n\ncf>(n+l) = argmin '\"\"' '\"\"' P(z.lw . cf>(n\u00bb);~~)(w . - z(n\u00bb)T~-l(w . - zen\u00bb) \n\n~ L...J L...J \niE'DiEM \n\n-J -P \n\n\"1 ,J \n\n-1 \n\n-J \n\n-1 -J \n\n(9) \n\n(10) \n\nIn the case of perspective geometry where we have used homogeneous co-ordinates \nthe saddle-point equations are not readily amenable in a closed-form linear fash(cid:173)\nion. Instead, we solve the non-linear maximisation problem using the Levenberg(cid:173)\nMarquardt technique. This non-linear optimisation technique offers a compromise \nbetween the steepest gradient and inverse Hessian methods. The former is used \nwhen close to the optimum while the latter is used far from it. \n\n5 Experiments \n\nThe real-world evaluation of our matching method is concerned with recognising \nplaner objects in different 3D poses. The object used in this study is a 3.5 inch \n\n\fRecovering Perspective Pose with a Dual Step EM Algorithm \n\n785 \n\nfloppy disk which is placed on a desktop. The scene is viewed with a low-quality SGI \nIndyCam. The feature points used to triangulate the object are corners. Since the \nimaging process is not accurately modelled by a perspective transformation under \npin-hole optics, the example provides a challenging test of our matching process. \n\nOur experiments are illustrated in Figure 1. The first two columns show the views \nunder match. In the first example (the upper row of Figure 1) we are concerned \nwith matching when there is a significant difference in perspective forshortening . In \nthe example shown in the lower row of Figure 1, there is a rotation of the object \nin addition to the foreshortening. The images in the third column are the initial \nmatching configurations. Here the perspective parameter matrix has been selected \nat random. The fourth column in Figure 1 shows the final matching configuration \nafter the EM algorithm has converged. In both cases the final registration is accu(cid:173)\nrate. The algorithm appears to be capable of recovering good matches even when \nthe initial pose estimate is poor. \n\nFigure 1: Images Under Match, Initial and Final Configurations. \n\nWe now turn to measuring the sensitivity of our method. In order to illustrate \nthe benefits offered by the structural gating process, we compare its performance \nwith a conventional least-squares parameter estimation process. Figure 2 shows a \ncomparison of the two algorithms for a problem involving a point-set of 20 nodes. \nHere we show the RMS error as a function of the number of points which have \ncorrect correspondence matches. The break-even point occurs when 8 nodes are \ninitially matched correctly and there are 12 errors. Once the number of initially \ncorrect correspondences exceeds 8 then the EM method consistently outperforms \nthe least-squares estimation. \n\n6 Conclusions \n\nOur main contributions in this paper are twofold. The theoretical contribution has \nbeen to develop a mixture model that allows a graphical structure to to constrain the \nestimation of maximum likelihood model parameters. The second contribution is a \npractical one, and involves the application of the mixture model to the estimation \nof perspective pose parameters. There are a number of ways in which the ideas \ndeveloped in this paper can be extended. For instance, the framework is readily \nextensible to the recognition of more complex non-planar objects. \n\n\f786 \n\nA. D. J. Cross and E. R. Hancock \n\nLSF a1andard(cid:173)\nLSF strucknI \u00b7\u00b7\u00b7 \u00b7\u00b7 \n\n,. \n\n12 \n\n08 \n\nos \n\no. \n\n~ \nc!! \nf \n.. \nf \n\n! \n\nI \", \n\n1 \n\nl 't- \" \ni. \n\nFigure 2: Structural Sensitivity. \n\nReferences \n\n[1] Y. Amit and A. Kong, \"Graphical Templates for Model Registration\", IEEE PAMI, \n\n18, pp. 225-236, 1996. \n\n(2] A.P. Dempster, Laird N.M. and Rubin D.B., \"Maximum-likelihood from incomplete \ndata via the EM algorithm\", J. Royal Statistical Soc. Ser. B (methodological},39, pp \n1-38, 1977. \n\n(3] O.D. Faugeras, E. Le Bras-Mehlman and J-D. Boissonnat, \"Representing Stereo Data \n\nwith the Delaunay Triangulation\", Artificial Intelligence, 44, pp. 41-87, 1990. \n\n(4] S. Gold, Rangarajan A. and Mjolsness E., \"Learning with pre-knowledge: Clustering \nwith point and graph-matching distance measures\", Neural Computation, 8, pp. 787-\n804, 1996. \n\n(5] R.I. Hartley, \"Projective Reconstruction and Invariants from Multiple Images\", IEEE \n\nPAMI, 16, pp. 1036-1041, 1994. \n\n(6] M.I. Jordan and R.A. Jacobs, \"Hierarchical Mixtures of Experts and the EM Algo(cid:173)\n\nrithm\" , Neural Computation, 6, pp. 181-214, 1994. \n\n[7] M. Lades, J .C. Vorbruggen, J. Buhmann, J. Lange, C. von der Maalsburg, R.P. Wurtz \n\nand W .Konen, \"Distortion-invariant object-recognition in a dynamic link architec(cid:173)\nture\", IEEE Transactions on Computers, 42, pp. 300-311, 1993 \n\n[8] D.P. McReynolds and D.G. Lowe, \"Rigidity Checking of 3D Point Correspondences \n\nunder Perspective Projection\", IEEE PAMI, 18 , pp. 1174-1185, 1996. \n\n(9] S. Moss and E.R. Hancock, \"Registering Incomplete Radar Images with the EM Al(cid:173)\n\ngorithm\", Image and Vision Computing, 15, 637-648, 1997. \n\n[10] D. Oberkampf, D.F. DeMenthon and L.S. Davis, \"Iterative Pose Estimation using \nCoplanar Feature Points\", Computer Vision and Image Understanding, 63, pp. 495-\n511, 1996. \n\n(11] P. Torr, A. Zisserman and S.J. Maybank, \"Robust Detection of Degenerate Configura(cid:173)\n\ntions for the Fundamental Matrix\", Proceedings of the Fifth International Conference \non Computer Vision, pp. 1037-1042, 1995. \n\n(12] R.C. Wilson and E.R. Hancock, \"Structural Matching by Discrete Relaxation\", IEEE \n\nPAMI, 19, pp.634-648 , 1997. \n\n\f", "award": [], "sourceid": 1331, "authors": [{"given_name": "Andrew", "family_name": "Cross", "institution": null}, {"given_name": "Edwin", "family_name": "Hancock", "institution": null}]}