{"title": "Permutation Diffusion Maps (PDM) with Application to the Image Association Problem in Computer Vision", "book": "Advances in Neural Information Processing Systems", "page_first": 541, "page_last": 549, "abstract": "Consistently matching keypoints across images, and the related problem of finding clusters of nearby images, are critical components of various tasks in Computer Vision, including Structure from Motion (SfM). Unfortunately, occlusion and large repetitive structures tend to mislead most currently used matching algorithms, leading to characteristic pathologies in the final output. In this paper we introduce a new method, Permutations Diffusion Maps (PDM), to solve the matching problem, as well as a related new affinity measure, derived using ideas from harmonic analysis on the symmetric group. We show that just by using it as a preprocessing step to existing SfM pipelines, PDM can greatly improve reconstruction quality on difficult datasets.", "full_text": "Permutation Diffusion Maps with Application to the\n\nImage Association Problem in Computer Vision\n\nDeepti Pachauriy, Risi Kondorx, Gautam Sargury, Vikas Singhzy\nyDept. of Computer Sciences, University of Wisconsin\u2013Madison\n\nzDept. of Biostatistics & Medical Informatics, University of Wisconsin\u2013Madison\nxDept. of Computer Science and Dept. of Statistics, The University of Chicago\n\npachauri@cs.wisc.edu risi@uchicago.edu gautam@cs.wisc.edu\n\nvsingh@biostat.wisc.edu\n\nAbstract\n\nConsistently matching keypoints across images, and the related problem of \ufb01nd-\ning clusters of nearby images, are critical components of various tasks in Com-\nputer Vision, including Structure from Motion (SfM). Unfortunately, occlusion\nand large repetitive structures tend to mislead most currently used matching algo-\nrithms, leading to characteristic pathologies in the \ufb01nal output. In this paper we\npropose a new method, Permutations Diffusion Maps (PDM), and a related new\naf\ufb01nity measure, Permutation Diffusion Af\ufb01nity (PDA), to solve this problem.\nPDM is inspired by Vector Diffusion Maps, recently introduced by Singer and\nWu, and uses ideas from the theory of Fourier analysis on the symmetric group.\nWe show that when dealing with dif\ufb01cult datasets, using PDM as a preprocessing\nstep to existing SfM pipelines can signi\ufb01cantly improve results.\n\n1 Introduction\n\nStructure from motion (SfM) is the task of jointly reconstructing 3D scenes and camera poses from\na set of images. Keypoints or features extracted from each image provide correspondences between\npairs of images, making it possible to estimate the relative camera pose. This gives rise to an\nassociation graph in which two images are connected by an edge if they share a suf\ufb01cient number of\ncorresponding keypoints, and the edge itself is labeled by the estimated matching between the two\nsets of keypoints. Starting with these putative image to image associations, one typically uses the so-\ncalled bundle adjustment procedure to simultaneously solve for the global camera pose parameters\nand 3-D scene locations, incrementally minimizing the sum of squares of the re-projection error.\nDespite their popularity, large scale bundle adjustment methods have well known limitations. In\nparticular, due to the highly nonlinear nature of the objective function, they can get stuck in bad lo-\ncal minima. Therefore, starting with a good initial matching (i.e., an informative image association\ngraph) is critical. Several papers have studied this behavior in detail [1], and conclude that if one\nstarts the numerical optimization from an incorrect \u201cseed\u201d (i.e., a subgraph of the image associa-\ntions), the downstream optimization is unlikely to ever recover.\nSimilar challenges arise in other \ufb01elds, ranging from machine learning [2] to computational biology.\nFor instance, consider the de novo genome assembly problem in computational biology [3]. The\ngoal here is to reconstruct the original DNA sequence from fragments without a reference genome.\nBecause the genome may have many repeated structures, the alignment problem becomes very hard.\nIn general, reconstruction algorithms start with two maximally overlapping sequences, and proceed\nby selecting subsequents fragment using a process not unlike bundle adjustment, prone to similar\nissues with local minima [4]. In both cases it would be preferable to have a model that reasons glob-\nally over all pairwise information. In this paper, to make our presentation as concrete as possible,\n\n1\n\n\fwe restrict ourselves to describing such an algorithm in the context of Structure from Motion, while\nunderstanding that the underlying ideas apply more generally.\nSeveral authors [5, 6, 7] have recently described situations in large scale structure from motion\nwhere setting up a good image association graph is dif\ufb01cult, and consequently a direct applica-\ntion of bundle adjustment yields unsatisfactory results. One such situation is when the scene de-\npicted in the images involves a large number of duplicate structures (Figure 1). The preprocessing\nstep in a standard pipeline will\nmatch visual features and set\nup the associations accordingly,\nbut a key underlying assumption\nin most (if not all) approaches\nis that we observe only a sin-\ngle instance of any structure.\nThis assumption is problematic\nwhen scenes have repeating ar-\nchitectural components or recur-\nring patterns, such as windows,\nbricks, and so on.\nIn Figure 1a views that look exactly the same do not necessarily represent the same physical struc-\nture. Some (or all) points in one image are actually occluded in the other image. Typical SfM\nmethods will not work well when initialized with such image associations, regardless of which type\nof solver we use. In our example, the resulting reconstruction will be folded (Figure 1b). In other\ncases [5], we get errors ranging from phantom walls to severely superimposed structures yielding\nnonsensical reconstructions.\n\nFigure 1: HOUSE sequence. (a) Representative images. (b) Folded\nreconstruction by traditional SfM pipeline [8, 9].\n\n(a)\n\n(b)\n\nRelated Work. The issue described above is variously known in the literature as the SfM dis-\nambiguation problem or the data/image association problem in structure from motion. Some of\nthe strategies that have been proposed to mitigate it impose additional conditions, such as in\n[10, 11, 12, 13, 14, 15], but this also breaks down in the presence of large coherent sets of in-\ncorrectly matched pairs. One creative solution in recent work is to use metadata alongside images.\n\u201cGeotags\u201d or GIS data when available have been shown to be very effective in deriving a better\ninitialization for bundle adjustment or as a post-processing step to stitch together different compo-\nnents of a reconstruction. In [6], the authors suggest using image timestamps to impose a natural\nassociation among images, which is valuable when the images are acquired by a single camera in a\ntemporal sequence but dif\ufb01cult to deploy otherwise. Separate from the metadata approach, in con-\ntrolled scenes with relatively less occlusion, missing correspondences yield important local cues to\ninfer potentially incorrect image pairs [6, 7]. Very recently, [5] formalized the intuition that incor-\nrect feature correspondences result in anomalous structures in the so-called visibility graph of the\nfeatures. By looking at a measure of local track quality (from local clustering), one can reason about\nwhich associations are likely to be erroneous. This works well when the number of points is very\nlarge, but the authors of [5] acknowledge that for datasets like those shown in Fig. 1, it may not help\nmuch.\nIn contrast to the above approaches, a number of recent algorithms for the association (or disam-\nbiguation) problem argue for global geometric reasoning.\nIn [16], the authors used the number\nof point correspondences as a measure of certainty, which was then globally optimized to \ufb01nd a\nmaximum-weight set of consistent pairwise associations. The authors in [17] seek consistency of\nepipopolar geometry constraints for triplets, whereas [18] expands it over larger consistent cliques.\nThe procedure in [16] takes into account loops of associations concurrently with a minimal spanning\ntree over image to image matches. In summary, the bulk of prior work suggests that locally based\nstatistics over chained transformations will run into problems if the inconsistencies are more global\nin nature. However, even if the objectives used are global, approximate inference is not known to be\nrobust to coherent noise, which is exactly what we face in the presence of duplicate structures [19].\n\nThis paper.\nIf we take the idea of reasoning globally about association consistency using triples\nor higher order loops to an extreme, it implies deriving the likelihood of a speci\ufb01c image to image\nassociation conditioned on all other associations. This joint likelihood does not factor and explicit\nenumeration quickly becomes intractable. Our approach will make the group structure of image to\n\n2\n\n\fimage relationships explicit. Similarly to prior approaches, we will also operate on the association\ngraph derived from image pairs but with a key distinguishing feature. The association relationships\nwill now be denoted in terms of a \u2018certi\ufb01cate\u2019, that is, the transformation which justi\ufb01es the rela-\ntionship. The transformation may denote the pose parameters derived from the correspondences or\nthe matching (between features) itself. Other options are possible \u2014 as long as this transformation\nis a group action from one set to the other. If so, we can carry over the intuition of consistency over\nlarger cliques of images desired in existing works and rewrite those ideas as invariance properties of\nfunctions de\ufb01ned on the group. In particular, when the transformation is a matching, each edge in the\ngraph is a permutation, i.e., a member of the symmetric group Sn, and a generalization of the Lapla-\ncian related to the representation theory of Sn encodes the associations. In this regard, the present\npaper owes the most to the literature of synchronization problems, speci\ufb01cally [20][21][22][23][24].\nThe key contribution of this paper is to show that the global inference desired in many existing\nworks falls out nicely as a diffusion process using such a Laplacian. We show promising results\ndemonstrating that for various dif\ufb01cult datasets with large repetitive patterns, results from a simple\ndecomposition procedure are, in fact, competitive with those obtained using sophisticated optimiza-\ntion schemes with/without metadata. Finally, we note that the proposed algorithm can either be used\nstandalone to derive meaningful inputs to a bundle adjustment procedure or as a preprocessing step\nto other approaches (especially ones that incorporate timestamps and/or GPS data).\n\n1; xi\n\n2; : : : ; xi\nn\n\nq coming from Ij) which correspond to the same underlying physical feature.\n\n2 Synchronization by Vector Diffusion\nConsider a collection of m images fI1;I2; : : : ;Img of the same object or scene taken from different\nviewpoints and possibly under different conditions, and assume that in each image Ii, a keypoint\ng. Given two images Ii and Ij, the\ndetector has detected n landmarks (keypoints) fxi\np coming from\nlandmark matching problem consists of \ufb01nding pairs of landmarks xi\nimage Ii and xj\np\nAssuming that both images contain exactly the same n landmarks, the matching between Ii and\nIj may be described by the unique permutation (cid:28)ji : f1; 2; : : : ; ng ! f1; 2; : : : ; ng under which\n(cid:28)ji(p). Typically, local image features, such as SIFT descriptors, can provide an initial guess\nxi\np\nfor each (cid:28)ji, but by itself each of these individual image-to-image matchings is highly error prone,\nespecially in the presence of occlusion and repetitive structures. A major clue to correcting these\nerrors is the constraint that matchings must be consistent, i.e., if (cid:28)ji tells us that xi\np corresponds to\nr , then the permutation (cid:28)ki between Ii and Ik should\nq, and (cid:28)kj tells us that xj\nxj\nr . Mathematically, this is a re\ufb02ection of the fact that, de\ufb01ning the product of two\nassign xi\npermutations (cid:27)1 and (cid:27)2 in the usual way as\n\nq corresponds to xk\n\nq (with xi\n\np to xk\n\n(cid:24) xj\n\n(cid:24) xj\n\n()\n\n(cid:27)3 = (cid:27)2(cid:27)1\n\n(cid:27)3(i) = (cid:27)2((cid:27)1(i))\n\ni = 1; 2; : : : ; n;\n\n(cid:0)1\ni\n\nthe n! different permutations of f1; 2; : : : ; ng form a group. This group is called the symmetric\ngroup of order n and is denoted Sn. In group theoretical notation the consistency conditions reduce\nto requiring that given any three images Ii;Ij and Ik, the relative matchings between them must\nsatisfy (cid:28)kj(cid:28)ji = (cid:28)ki. An equivalent condition is that it must be possible to associate to each Ii a \u201cbase\npermutation\u201d (cid:27)i so that (cid:28)ji = (cid:27)j(cid:27)\nfor any (i; j) pair. Thus, the problem of \ufb01nding a consistent set\nof (cid:28)ji\u2019s is reduced to \ufb01nding the m base permutations (cid:27)1; : : : ; (cid:27)m.\nProblems of this general form, where given some (\ufb01nite or continuous) group G, one must estimate\na matrix (gji)m\nj;i=1 of group elements obeying gkjgji = gki are called synchronization problems.\nStarting with the seminal work of Singer et al. [20][21] on synchronization over the rotation group\nfor aligning images in cryo-EM, followed by synchronization over the Euclidean group [25], and\nmost recently synchronization over Sn for matching landmarks [23][24], such problems have re-\ncently generated a lot of interest. Some of the newest and most promising approaches involve\nsemi-de\ufb01nite programming [15][24][26].\nIn the context of synchronizing three dimensional rotations for cryo-EM, Singer and Wu [22] pro-\nposed a particularly elegant formalism, called Vector Diffusion Maps, which conceives of synchro-\nnization as diffusing the base rotation Qi from each image to its neighbors. However, unlike in\nordinary diffusion, as Qi diffuses to Ij, the observed Oji relative rotation of Ij to Ii changes Qi\nto OjiQi. If all the (Oji)i;j observations were perfectly synchronized, then no matter what path\n\n3\n\n\fi ! i1 ! i2 ! : : : ! j we took from i to j, the resulting rotation Oj;ip : : : Oi2;i1 Oi1;iQi would\nbe the same. However, if some (in many practical cases, the majority) of the Oji\u2019s are incorrect,\nthen different paths from one vertex to another contribute different rotations, which one then needs\nto average in some appropriate sense.\nA natural choice for the loss that describes the extent to which the Q1; : : : ; Qm imputed base rota-\ntions (playing the role of the (cid:27)i\u2019s in the permutation case) satisfy the Oji observations is\n(cid:0) Oji\u22252\n\nwij\u2225Qj (cid:0) OjiQi\u22252\n\nE(Q1; : : : ; Qm) =\n\nwij\u2225QjQ\n\nm\u2211\n\nm\u2211\n\n(1)\n\nFrob =\n\nFrob;\n\n\u22a4\ni\n\n1\n2\n\ni;j=1\n\nwhere the wij edge weight descibes our con\ufb01dence in rotation Oji. A crucial observation is that this\nloss can be rewritten in the form E(Q1; : : : ; Qm) = V\n\n1CA ;\n\n0B@ Q1\n\n...\nQm\n\nV =\n\n\u2211\n\n1\n2\n\ni;j=1\n\n0B@\nL =\n\u2211\n\n\u22a4L V , where\n(cid:0)w1;2 O1;2\n(cid:0)wm;1 Om;1 (cid:0)wm;2 Om;2\n(cid:0)1\nji = O\n\ndi I\n...\n\n...\n\n1CA ;\n\n(2)\n\n: : : (cid:0)w1;m O1;m\n\n...\ndm I\n\nj\u0338=i wij. Note that since wij = wji, and Oij = O\n\n: : :\nji, the matrix L is symmetric.\n\u22a4\nand di =\nFurthermore, the above is exactly analogous to the way in which in spectral graph theory, (see,\ni;j wij(f (i)(cid:0)f (j))2 describing the \u201csmoothness\u201d (with respect\ne.g.,[27]) the functional E(f ) = 1\n\u22a4\nto the graph topology) of a function f de\ufb01ned on the vertices of a graph can be written as f\nLf in\nterms of the usual graph Laplacian\n\n2\n\n{\n\n\u2211\n(cid:0)wij\n\nLij =\n\nk\u0338=i wik\n\ni \u0338= j\ni = j:\n\nAs it is well known, constraining f to have unit norm and excluding the subspace of constant func-\ntions, the function minimizing E(f ) is the eigenvector of L with (second) smallest eigenvalue. Anal-\nogously, in synchronizing rotations, the steady state of the diffusion system, which minimizes (1),\ncan be computed by forming the 3m(cid:2)3 dimensional matrix V from the 3 lowest non-zero eigenvalue\neigenvectors of L, and appropriately rounding each 3(cid:2)3 block Vi of V to the nearest orthogonal ma-\n\u22a4\ntrix Qi. The resulting array (QjQ\ni )i;j of imputed relative rotations is guaranteed to be consistent,\nand minimizes the loss (1).\n\n3 Permutation Diffusion\n\nIts elegance notwithstanding, the vector diffusion formalism of the previous section seems ill suited\nto our present purposes of improving the SfM pipeline for two reasons: (1) synchronizing over\nSn, which is a \ufb01nite group, seems much harder than synchronizing over the continuous group of\nrotations; (2) rather than getting an actual synchronized array of matchings, what is critical to SfM\nis to estimate the association graph that captures the extent to which any two images are related to\none-another. The main contribution of the present paper is to show that both of these problems have\nnatural solutions in the formalism of group representations.\nOur \ufb01rst key observation (already alluded to in [21]) is that the critical step of rewriting the loss\n(1) in terms of the Laplacian (2) does not depend on any special properties of the rotation group\nother than the facts that (a) rotation matrices are unitary (in fact, orthogonal) (b) if we follow one\nrotation by another, their matrices simply multiply. In general, for any group G, a complex valued\nfunction (cid:26) : G ! Cd(cid:26)(cid:2)d(cid:26) which satis\ufb01es (cid:26)(g2 g1) = (cid:26)(g2) (cid:26)(g1) is called a representation of G. The\ny denotes the Hermitian conjugate\nrepresentation is unitary if (cid:26)(g\n(conjugate transpose) of M. Thus, we have the following proposition.\nProposition 1. Let G be any compact group with identity e and (cid:26) : G ! Cd(cid:26)(cid:2)d(cid:26) be a unitary\nrepresentation of G. Then given an array of possibly noisy and unsynchronized group elements,\n(gji)i;j, and corresponding positive con\ufb01dence weights (wji)i;j, the synchronization loss (assuming\ngii = e for all i)\n\n(cid:0)1) = ((cid:26)(g))\n\ny, where M\n\n(cid:0)1 = (cid:26)\n\nm\u2211\n\nww(cid:26)(hjh\n\nE(h1; : : : ; hm) =\n\n1\n2\n\nwji\n\ni;j=1\n\nww2\n\n(cid:0)1\ni\n\n) (cid:0) (cid:26)(gji)\n\nFrob\n\nh1; : : : ; hm 2 G\n\n4\n\n\f0B@ (cid:26)(h1)\n\ncan be written in the form E(h1; : : : ; hm) = V\ndi I\n...\n\nL =\n\nV =\n\n...\n\n1CA ;\n\n0B@\n\nyL V , where\n\n(cid:0)w1;2 (cid:26)(g1;2)\n(cid:0)wm;1 (cid:26)(gm;1) (cid:0)wm;2 (cid:26)(gm;2)\n\n...\n\n: : :\n\n(cid:26)(hm)\n\n: : : (cid:0)w1;m (cid:26)(g1;m)\n\n...\ndm I\n\n1CA :\n\n(3)\n\nTo synchronize matchings between images using this proposition, one plugs in the approriate unitary\nrepresentation of the symmetric group. The simplest choice is the so-called de\ufb01ning representation,\nwhose elements are the familiar permutation matrices\n\n{\n\n(cid:26)def((cid:27)) = P ((cid:27))\n\nsince the corresponding loss function is\nE((cid:27)1; : : : ; (cid:27)m) =\n\nm\u2211\n\ni;j=1\n\n1\n2\n\n[P ((cid:27))]q;p =\n\n1 (cid:27)(p) = q\notherwise;\n0\n\nwji\u2225P ((cid:27)j(cid:27)\n\n(cid:0)1\ni\n\n) (cid:0) P ((cid:28)ji)\u22252\n\nFrob:\n\n(4)\n\nThe squared Frobenius norm in this expression simply counts the number of mismatches between\n(cid:0)1\nthe observed but noisy permutation (cid:28)ji, and the inferred permutation (cid:27)j(cid:27)\n. For this choice of (cid:26),\n\u22a4LV , with\ni\nletting Pi := P ((cid:27)(i)) and P obs\nji\n\n:= P ((cid:28)ji), E((cid:27)1; : : : ; (cid:27)m) = V\n\n1CA ;\n\n0B@ P1\n\n...\nPm\n\nV =\n\n0B@\n\nL =\n\n(cid:0)w1;2 P obs\n\n1;2\n\n: : : (cid:0)w1;m P obs\n\n1;m\n\ndi I\n...\n\n(cid:0)wm;1 P obs\n\nm;1\n\n(cid:0)wm;2 P obs\n\nm;2\n\n: : :\n\n...\n\n...\ndm I\n\n1CA :\n\n(5)\n\nConsequently, just as in the rotation case, synchronization over Sn can be solved by forming V from\nthe \ufb01rst d(cid:26)def = n lowest eigenvectors of L, and extracting each Pi from its i\u2019th n (cid:2) n block, Vi.\nHere we must take a little care because unless the (cid:28)ji\u2019s are already synchronized, it is not a priori\nto \ufb01nd the permutationb(cid:27)i, whose permutation matrix is closest to ViV\nguaranteed that the resulting block will be a valid permutation matrix. Therefore, analogously to the\n\u22a4\nprocedure described in [23], we \ufb01rst multiply Vi by V\n1 , and then use a linear assignment procedure\n\u22a4\n1 . The resulting algorithm we\n\ncall Synchronization by Permutation Diffusion.\n\n4 Uncertain Matches and Permutation Diffusion Af\ufb01nity\n\nThe limitation of our framework, as described so far, is the assumption that each keypoint in each\nimage will have a single counterpart in every other image that the local matching procedure with\nsome error can identify. In realistic scenarios this is far from satis\ufb01ed, due to occlusion, repetitive\nstructures, and noisy detections. Most algorithms, including [24] and [23], deal with the problem\nsimply by turning the Pij block in (5) into a weighted sum of all possible permutations. For example,\nif landmarks number 1. . . 20 are present in both images, but landmarks 21 : : : 40 are not, then the\nPij block in (5) will have a corresponding 20(cid:2) 20 block of all ones, rescaled by a factor of 1=20.\nThis approach effectively amounts to replacing (cid:28)ji by an appropriate distribution tji((cid:28) ) over match-\nings. Correspondingly, when we form V from the \ufb01rst d(cid:26) eigenvectors of L, each resulting Vi block\nwill stand for a distribution pi((cid:27)), rather than a single base permutation (cid:27)i. Moreover, if some set\nof k landmarks U = (u1; : : : ; uk) are occluded in Ii, then tij (for any j) will be agnostic to their\nassignment, and consequently pi will be invariant to what is mapped to u1; : : : ; uk. Let (cid:27)(cid:24)U (cid:27)\n\u2032 de-\n\u2032 differ only in what numbers they map to u1; : : : ; uk,\nnote the relation that two permutations (cid:27) and (cid:27)\nbut fully agree on what they assign to any landmark not in U (i.e., (cid:27)(i) = (cid:27)(j) 8 i \u03382 U). Clearly,\n(cid:24)U is an equivalence relation on Sn, and it is not dif\ufb01cult to see that letting (cid:22)U be some reference\npermutation that maps 1 7! u1; : : : ; k 7! uk, and Sk be the subgroup of permutations that permute\n1; 2; : : : ; k amongst themselves but leave k + 1; : : : ; n \ufb01xed, the equivalence classes of (cid:24)U are the\nsets\n(6)\nThese sets are called (two-sided) Sk\u2013cosets. Note that while jSnj = n!, there are only n!=k! distinct\nequivalence classes, so not all possible values of (cid:23) yield a distinct coset.\nWhat is important is that uncertainty in the synchronization process with respect to a given set of\nlandmarks fu1; : : : ; ukg (typically due to occlusion) has a clear algebraic signature, namely the\n\n(cid:22)U Sk (cid:23) := f(cid:22)U (cid:13) (cid:23) j (cid:13) 2 Skg\n\n(cid:23) 2 Sn:\n\n5\n\n\fai((cid:27)) =\n\n!2Sn\n\npi((cid:27)!) pi(!):\n\n\u2211\n\n(7)\n\ninferred pi being constant on each of the cosets in (6). Conversely, if we \ufb01nd that pi is constant\non these cosets, that is a strong indication that u1; : : : ; uk are occluded, which is an important clue\nto estimating Ii\u2019s viewpoint, sometimes even more informative than the synchronized matchings\nthemselves.\nThe invariance structure of pi is most easily detected from its so-called autocorrelation function\n\n\u2211\n\npi(!)2. How-\nClearly, (7) attains its maximum at the identity permutation, where ai(e) =\never, when pi has invariances, the same maximum will be attained over a wider plateau of permuta-\ntions. Note, in particular, that ! and (cid:27)! always fall in the same (cid:22)U Sk (cid:23) coset when (cid:27) 2 (cid:22)U Sk (cid:22)\n(cid:0)1\nU .\nTherefore, if pi happens to be a function that is constant on (cid:22)U Sk (cid:23) cosets, then any (cid:27) 2 (cid:22)U Sk (cid:22)\n(cid:0)1\nwill maximize ai((cid:27)).\nto the weighted sumbpi((cid:26)) :=\nOf course, in synchronization problems pi is not directly accessible to us, rather we only have access\n\u22a4\n1 . Recent years have seen the emergence of\n\u2211\na number of applications of a generalized notion of Fourier transformation on the symmetric group,\nwhich, given a function f : Sn ! R, is de\ufb01ned\n\npi((cid:27)) (cid:26)((cid:27)) = Vi V\n\n!2Sn\n\nU\n\n\u2211\nbf ((cid:21)) =\n\n(cid:27)2Sn\n\nf ((cid:27)) (cid:26)(cid:21)((cid:27));\n\n(cid:27)2Sn\n\n(cid:21) \u22a2 n;\n\nwhere the (cid:26)(cid:21) are special, so-called irreducible, representations of Sn, indexed by the (cid:21) integer\npartitions. Due to space restrictions, we leave the details of this construction to the literature, see,\n\ne.g., [28, 29, 30]. Suf\ufb01ce to say that whilebpi((cid:26)) is not exactly a Fourier component of pi, it can be\n\n\u2211\n\nexpressed as a direct sum of Fourier components\ny\n\nbpi((cid:26)) = C\nf ((cid:27)(cid:22)) g((cid:22))), thenbh((cid:21)) = bf ((cid:21))bg((cid:21))\n[\u2295\nbai((cid:26)) := C\n\n]\nbai((cid:21))\n\n[\u2295\n\nC = C\n\nto ensure that V\n\n(cid:22)2Sn\n\n(cid:21)2(cid:3)\n\n(cid:21)2(cid:3)\n\nfor some unitary matrix C that is effectively just a basis transform. One of the properties of\nthe Fourier transform is that if h is the cross-correlation of two functions f and g (i.e., h((cid:27)) =\ny. Consequently, assuming that V1 has been normalized\n\n\u22a4\n1 V1 = I, and using the fact that in our setting all matrices are real,\n\u22a4\n\u22a4\n1 )\n\n\u22a4\n1 ) (Vi V\n\nbpi((cid:21))bpi((cid:21))\n\nC = (Vi V\n\ny\n\ny\n\ny\n\n\u22a4\n= Vi V\ni\n\n]\n\n]\nbpi((cid:21))\n\nC\n\n[\u2295\n\n(cid:21)2(cid:3)\n\nis an easily computable matrix that captures essentially all the coset invariance structure encoded in\nthe inferred distribution pi.\n\u2211\nTo compute an af\ufb01nity score between two images Ii and Ij re\ufb02ecting how many occluded land-\nmarks they share, it remains to compare their coset invariance structures, for example, by computing\nai((cid:27)) aj((cid:27)))1=2. Omitting certain multiplicative constants arising in the inverse Fourier\n(\ntransform, again using the correlation theorem, one \ufb01nds that this reduces to\n\n(cid:27)2Sn\n\n(cid:5)(i; j) = tr (Vi V\n\n(8)\nwhich we call Permutation Diffusion Af\ufb01nity (PDA). Remarkably, PDA is closely related to the\nnotion of diffusion similarity derived in [22] for rotations, using entirely different, differential geo-\nmetric tools. Our experiments show that PDA is surprisingly informative about the actual distance\nbetween image viewpoints in physical space, and, as easy it is to compute, can greatly improve the\nperformance of the SfM pipeline.\n\n\u22a4\ni Vj V\n\n\u22a4\nj ) ;\n\n1=2\n\n5 Experiments\n\nOur experiments focus on challenging image association problems from the literature, where ge-\nometric ambiguities due to large duplicate structures are present in up to 50% of the matches, so\neven sophisticated SfM pipelines run into dif\ufb01culties [6]. Rather than replacing the standard SfM\npipeline with Permutations Diffusion Maps (PDM) altogether, our general approach is to use PDM\nas a preprocessing step to compute (8) for every image pair, and then feed these PDA scores into the\nSfM pipeline to improve its performance. More information on the experiments, including videos\nof 3D reconstructions, and an additional experiment on scene summarization rather than SfM [31],\ncan be found on the project website: http://pages.cs.wisc.edu/\u02dcpachauri/pdm/.\n\n6\n\n\f(\n\nm\n2\n\n)\n\nIn the SfM experiments we used PDM to generate an image match matrix which is then fed to a\nstate-of-the-art SfM pipeline for 3D reconstruction [8, 9]. The baseline was a Bundle Adjustment\nprocedure which uses visual features for matching and has a built-in heuristic outlier removal mod-\nule. Several other papers have used a similar comparisons [6]. For each dataset, SIFT was used\nto detect and characterize landmarks [32, 33]. We compute putative pairwise matchings ((cid:28)ij)m\ni;j=1\nlinear independent assignments [34] based on their SIFT features. The permutation\nby solving\nmatrix representation is used for putative matchings ((cid:28)ij)m\ni;j=1 as in (5). Here, n is relative large,\non the order of 1000. Ideally, n is the total number of distinct keypoints in the 3D scene, but is\nnot directly observable, so we set n to be the maximum number of keypoints detected in any single\nimage in the dataset. Eigenvector based procedure computes weighted af\ufb01nity matrix. We used a\nbinary match matrix as the input to an SfM library [8, 9]. Note that we only provide this library the\nimage association hypotheses, leaving all other modules unchanged. With (potentially) good image\nassociation information, the SfM modules can sample landmarks more densely and perform bundle\nadjustment, leaving everything else unchanged. The baseline 3D reconstruction is performed using\nthe same SfM pipeline without intervention.\nThe \u201cHOUSE\u201d sequence has three instances of similar looking houses (Figure 1). The diffusion\nprocess accumulates evidence and eventually provides strongly connected images in the data asso-\nciation matrix (Figure 2a). Warm colors correspond to high af\ufb01nity between pairs of images. The\nbinary match matrix was obtained by applying a threshold on the weighted matrix (Figure 2b). We\nused this matrix to de\ufb01ne the image matching for feature tracks. This means that features are only\nmatched between images that are connected in our matching matrix. The SfM pipeline was given\nthese image matches as a hypotheses to explain how the images are \u201cconnected\u201d. The resulting\nreconstruction correctly gives three houses (Figure 2c). In contrast, the same SfM pipeline when\nallowed to track features automatically with an outlier removal heuristic resulted in a folded recon-\nstruction (Figure 1b). One may ask if more specialized heuristics will do better, such as time stamps,\nas suggested in [6]. However, experimental results in [5] and elsewhere strongly suggests that these\ndatasets still remain challenging.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: House sequence: (a) Weighted image association matrix. (b) Binary image match matrix. (c) PDM\ndense reconstruction.\nThe \u201cCUP\u201d dataset has multiple images of a 180 degree symmetric cup from all sides (Figure 3a).\nPDM reveals a strongly connected component along the diagonal for this dataset, shown in warm\ncolors in Figure 3b. Our global reasoning over the space of permutations substantially mitigates\ncoherent errors. The binary match matrix was obtained by thresholding the weighted matrix (Fig-\nure 3c). As is evident from the reconstructions, the baseline method only reconstruct a \u201chalf cup\u201d.\nDue to the structural ambiguity, it also concludes that the cup has two handles (Figure 4b). In con-\ntrast, the PDM reconstruction gives a perfect reconstruction of the full cup with a single handle\n(Figure 4a).\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: (a) Representative images from CUP dataset. (b) Weighted data association matrix. (c) Binary data\nassociation matrix.\n\nThe \u201cOAT\u201d dataset contains two instances of a red oat box, one on the left of a box of \u201cWheat\nThins\u201d, and another on the right (Figure 5a). The PDM weighted match matrix and binary match\n\n7\n\n\fFigure 4: CUP dataset. (a) PDM dense reconstruction. (b) Baseline dense reconstruction.\n\n(a)\n\n(b)\n\nmatrix successfully discover strongly connected components, (Figures 5b, 5c). The baseline method\nconfused the two oat boxes as one, and reconstructs only a single box, (Figure 6b). Moreover, the\nstructural ambiguity splits the Wheat Thins into two pieces. On the other hand, PDM gives a nice\nreconstruction of the two oat boxes with the entire wheat things in the middle, Figure 6(a). Several\nmore experiments (with videos), can be found on the project website.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 5: (a) Representative images from OAT dataset. (b) Weighted data association matrix. (c) Binary data\nassociation matrix.\n\nFigure 6: OAT dataset. (a) PDM dense reconstruction. (b) Baseline dense reconstruction.\n\n(a)\n\n(b)\n\n6 Conclusions\n\nInspired by the Vector Diffusion formalism of [22], we have proposed a new algorithm called Per-\nmutation Diffusion Maps for solving permutation synchronization problems, and an associated new\naf\ufb01nity measure called Permutation Diffusion Af\ufb01nity (PDA). Experiments show that the latter, in\nparticular, can signi\ufb01cantly improve the quality of Structure from Motion reconstructions of dif-\n\ufb01cult scenes. Interestingly, PDA has an interpretation in terms of the inner product between two\nautocorrelation functions expressed in Fourier space, which, we believe, is a new approach to de-\ntecting hidden symmetries, with many potential applications even outside the realm of permutation\nproblems.\n\nAcknowledgments\nThis work was supported in part by NSF\u20131320344, NSF\u20131320755, and funds from the University\nof Wisconsin Graduate School. We thank Charles Dyer, Li Zhang, Amit Singer and Qixing Huang\nfor helpful discussions and suggestions.\nReferences\n[1] D. Crandall, A. Owens, N. Snavely, and D. P. Huttenlocher. Discrete-continuous optimization for large-\n\nscale structure from motion. In CVPR, 2011.\n\n[2] A. Nguyen, M. Ben-Chen, K. Welnicka, Y. Ye, and L. Guibas. An optimization approach to improving\n\ncollections of shape maps. In Computer Graphics Forum, volume 30, 2011.\n\n8\n\n\f[3] R. Li, H. Zhu, et al. De novo assembly of human genomes with massively parallel short read sequencing.\n\nGenome research, 20, 2010.\n\n[4] M. Pop, S. L. Salzberg, and M. Shumway. Genome sequence assembly: Algorithms and issues. IEEE\n\nComputer, 35, 2002.\n\n[5] K. Wilson and N. Snavely. Network principles for SfM: Disambiguating repeated structures with local\n\ncontext. In ICCV, 2013.\n\n[6] R. Roberts, S. Sinha, R. Szeliski, and D. Steedly. Structure from motion for scenes with large duplicate\n\nstructures. In CVPR, 2011.\n\n[7] N. Jiang, P. Tan, and L. F. Cheong. Seeing double without confusion: Structure-from-motion in highly\n\nambiguous scenes. In CVPR, 2012.\n\n[8] C. Wu. Towards linear-time incremental structure from motion.\n\nConference on, 2013.\n\nIn 3DTV-Conference, International\n\n[9] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz. Multicore bundle adjustment. In CVPR, 2011.\n[10] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or \u201dhow do I organize\n\nmy holiday snaps?\u201d. In ECCV. 2002.\n\n[11] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3D. In ACM\n\ntransactions on graphics (TOG), volume 25, 2006.\n\n[12] D. Martinec and T. Pajdla. Robust rotation and translation estimation in multiview reconstruction. In\n\nCVPR, 2007.\n\n[13] M. Havlena, A. Torii, J. Knopp, and T. Pajdla. Randomized structure from motion based on atomic 3d\n\nmodels from camera triplets. In CVPR, 2009.\n\n[14] S. N. Sinha, D. Steedly, and R. Szeliski. A multi-stage linear approach to structure from motion. In Trends\n\nand Topics in Computer Vision. 2012.\n\n[15] O. Ozyesil, A. Singer, and R. Basri. Camera motion estimation by convex programming. CoRR, 2013.\n[16] O. Enqvist, F. Kahl, and C. Olsson. Non-sequential structure from motion. In ICCV Workshops, 2011.\n[17] C. Zach, A. Irschara, and H. Bischof. What can missing correspondences tell us about 3D structure and\n\nmotion? In CVPR, 2008.\n\n[18] C. Zach, M. Klopschitz, and M. Pollefeys. Disambiguating visual relations using loop constraints. In\n\nCVPR, 2010.\n\n[19] V. M. Govindu. Robustness in motion averaging.\n\nSpringer, 2006.\n\nIn Computer Vision\u2013ACCV 2006, pages 457\u2013466.\n\n[20] A. Singer and Y. Shkolnisky. Three-dimensional structure determination from common lines in cryo-EM\n\nby eigenvectors and semide\ufb01nite programming. SIAM Journal on Imaging Sciences, 4, 2011.\n\n[21] A Singer. Angular synchronization by eigenvectors and semide\ufb01nite programming. Applied and compu-\n\ntational harmonic analysis, 30, 2011.\n\n[22] A. Singer and H.-T. Wu. Vector diffusion maps and the connection Laplacian. Communications of Pure\n\nand Applied Mathematics, 2011.\n\n[23] D. Pachauri, R. Kondor, and V. Singh. Solving the multi-way matching problem by permutation synchro-\n\nnization. NIPS, 2013.\n\n[24] Qixing Huang and Leonidas Guibas. Consistent shape maps via semide\ufb01nite programming. Computer\n\nGraphics Forum, 2013.\n\n[25] M. Cucuringu, Y. Lipman, and A. Singer. Sensor network localization by eigenvector synchronization\n\nover the Euclidean group. ACM Transactions on Sensor Networks (TOSN), 8, 2012.\n\n[26] Y. Chen, L. Guibas, and Q. Huang. Near-optimal joint object matching via convex relaxation. In ICML,\n\n2014.\n\n[27] F. R. K. Chung. Spectral graph theory (CBMS regional conference series in mathematics, No. 92). Amer-\n\nican Mathematical Society, 1996.\n\n[28] J. Huang, C. Guestrin, and L. Guibas. Fourier theoretic probabilistic inference over permutations. JMLR,\n\n2009.\n\n[29] R. Kondor. A Fourier space algorithm for solving quadratic assignment problems. In SODA, 2010.\n[30] D. Rockmore, P. Kostelec, W. Hordijk, and P. F. Stadler. Fast Fourier transforms for \ufb01tness landscapes.\n\nAppl. and Comp. Harmonic Anal., 2002.\n\n[31] S. Zhu, L. Zhang, and B. M Smith. Model evolution: An incremental approach to non-rigid structure\n\nfrom motion. In CVPR, 2010.\n\n[32] D.G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004.\n[33] K. Mikolajczyk and C. Schmid. Scale & af\ufb01ne invariant interest point detectors. IJCV, 60, 2004.\n[34] H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2,\n\n1955.\n\n9\n\n\f", "award": [], "sourceid": 360, "authors": [{"given_name": "Deepti", "family_name": "Pachauri", "institution": "UW-Madison"}, {"given_name": "Risi", "family_name": "Kondor", "institution": "University of Chicago"}, {"given_name": "Gautam", "family_name": "Sargur", "institution": "University of Wisconsin Madison"}, {"given_name": "Vikas", "family_name": "Singh", "institution": "UW-Madison"}]}