{"title": "Modelling and unsupervised learning of symmetric deformable object categories", "book": "Advances in Neural Information Processing Systems", "page_first": 8178, "page_last": 8189, "abstract": "We propose a new approach to model and learn, without manual supervision, the symmetries of natural objects, such as faces or flowers, given only images as input. It is well known that objects that have a symmetric structure do not usually result in symmetric images due to articulation and perspective effects. This is often tackled by seeking the intrinsic symmetries of the underlying 3D shape, which is very difficult to do when the latter cannot be recovered reliably from data. We show that, if only raw images are given, it is possible to look instead for symmetries in the space of object deformations. We can then learn symmetries from an unstructured collection of images of the object as an extension of the recently-introduced object frame representation, modified so that object symmetries reduce to the obvious symmetry groups in the normalized space. We also show that our formulation provides an explanation of the ambiguities that arise in recovering the pose of symmetric objects from their shape or images and we provide a way of discounting such ambiguities in learning.", "full_text": "Modelling and unsupervised learning of\nsymmetric deformable object categories\n\nJames Thewlis1\n\nHakan Bilen2\n\nAndrea Vedaldi1\n\n1 Visual Geometry Group\n\nUniversity of Oxford\n\n{jdt,vedaldi}@robots.ox.ac.uk\n\n2 School of Informatics\nUniversity of Edinburgh\n\nhbilen@ed.ac.uk\n\nAbstract\n\nWe propose a new approach to model and learn, without manual supervision, the\nsymmetries of natural objects, such as faces or \ufb02owers, given only images as input.\nIt is well known that objects that have a symmetric structure do not usually result in\nsymmetric images due to articulation and perspective effects. This is often tackled\nby seeking the intrinsic symmetries of the underlying 3D shape, which is very\ndif\ufb01cult to do when the latter cannot be recovered reliably from data. We show that,\nif only raw images are given, it is possible to look instead for symmetries in the\nspace of object deformations. We can then learn symmetries from an unstructured\ncollection of images of the object as an extension of the recently-introduced object\nframe representation, modi\ufb01ed so that object symmetries reduce to the obvious\nsymmetry groups in the normalized space. We also show that our formulation\nprovides an explanation of the ambiguities that arise in recovering the pose of\nsymmetric objects from their shape or images and we provide a way of discounting\nsuch ambiguities in learning.\n\n1\n\nIntroduction\n\nMost natural objects are symmetric: mammals have a bilateral symmetry, a glass is rotationally\nsymmetric, many \ufb02owers have a radial symmetry, etc. While such symmetries are easy to understand\nfor a human, it remains surprisingly challenging to develop algorithms that can reliably detect the\nsymmetries of visual object in images. The key dif\ufb01culty is that objects that are structurally symmetric\ndo not generally result in symmetric images; in fact, the latter occurs only when the object is imaged\nunder special viewpoints and, for deformable objects, with a special poses (Leonardo\u2019s Vitruvian\nMan illustrates this point).\nThe standard approach to characterizing symmetries in objects is to look not at their images, but at\ntheir 3D shape; if the latter is available, then symmetries can be recovered by analysing the intrinsic\ngeometry of the shape. However, often only images of the objects are available, and reconstructing an\naccurate 3D shape from them can be very challenging, especially if the object is deformable.\nIn this paper, we thus seek a new approach to learn without supervision and from raw images alone\nthe symmetries of deformable object categories. This may sound dif\ufb01cult since even characterising\nthe basic geometry of natural objects without external supervision remains largely an open problem.\nNevertheless, we show that it is possible to extend the method of [38], which was recently introduced\nto learn the \u201ctopology\u201d of object categories, to do exactly this.\nThere are three key enabling factors in our approach. First, we do not consider symmetries of a single\nobject or 3D shape in isolation; instead, we seek symmetries shared by all the instances of the objects\nin a given category, imaged under different viewing conditions and deformations. Second, rather than\nconsidering the common concept of intrinsic symmetries, we propose to look at symmetries not of\n3D shapes, but of the space of their deformations (section 4). Third, we show that the normalized\nobject frame of [38] can be learned in such a way that the deformation symmetries are represented by\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Symmetric object frame for human (left) and cat (right) faces (test set). Our method\nlearns a viewpoint and identity invariant geometric embedding which captures the symmetry of\nnatural objects (in this case bilateral) without manual supervision. Top: input images with the axis of\nsymmetry superimposed (shown in green). Middle: dense embedding mapped to colours. Bottom:\nimage pixels mapped to 3D representation space with the re\ufb02ection plane (green).\n\nthe obvious symmetry groups in the object frame. The latter also result in a constraint that can be\neasily added to the self-supervised formulation of [38] to learn symmetries in practice (section 3).\nWe start by deriving our formulation for the special case of bilateral symmetries (section 3). Then,\nwe propose a theory of symmetric deformation spaces (section 4) that generalises the method to other\nsymmetry groups. An important step in this generalization is to characterise the ambiguities that\nsymmetries induce in recovering the pose of an object from an image of it, or from its 3D shape,\nwhich may not occur with bilateral symmetries.\nThe resulting approach is the \ufb01rst that, to our knowledge, can learn the symmetries of object categories\ngiven only raw images as input, without manual annotations. For demonstration, we show that this\napproach can learn the bilateral symmetry in human and pet faces (\ufb01g. 1) as well as in synthetic 3D\nobjects (section 6). To assess the method, we look at how well the resulting representation can detect\npairs of symmetric object landmarks (e.g. left and right eyes) even when the object does not appear\nsymmetric.\nWe also investigate the problem of symmetry-induced ambiguities in learning the geometry of natural\nobjects. For objects such as animals that have a bilateral symmetry, it is generally possible to uniquely\nidentify their left and right sides and thus recover their pose uniquely. On the other hand, for objects\nsuch as \ufb02owers that may have a radial symmetry, it is generally impossible to say which way is\n\u201cup\u201d, creating an ambiguity in pose recovery. Our framework clari\ufb01es why and when this occurs and\nsuggests how to modify the learning formulation to mitigate the effect of such ambiguities (sections 4\nand 6.2).\n\n2 Related work\n\nCross-instance object matching. Our method is also related to the techniques that \ufb01nd dense\ncorrespondences between different object instances by matching their SIFT features [25], establishing\nregion correspondences [14, 15] and matching the internal representations of neural networks [24].\nIn addition, dense correspondences have been generalized between image pairs to arbitrary number\nof multiple images by Learned-Miller [20]. More recently, RSA [32], Collection Flow [18] and\nMobahi et al. [28] show that a collection of images can be projected into a lower dimensional\nsubspace before performing a joint alignment among the projected images. Novotny et al. [30] train\na neural network with image labels that learns to automatically discover semantically meaningful\nparts across animals.\nUnsupervised learning of object structure. Supervised visual object characterization [6, 11, 21,\n8, 10] is a well established problem in computer vision and successfully applied to facial landmark\ndetection and human body pose estimation. Unsupervised methods include Spatial Transformer\nNetworks [16] that learn to transform images to improve image classi\ufb01cation, WarpNet [17] and geo-\nmetric matching networks [34] that learn to match object pairs by estimating relative transformations\nbetween them. In contrast to ours, these methods do not learn a canonical object geometry and only\nprovide relative mapping from one object to another. More related to ours, Thewlis et al. [39, 38]\npropose to characterize object structure via detecting landmarks [39] or dense labels [38] that are\n\n2\n\n\fFigure 2: Left: an object category consisting of two poses \u03c0, \u03c0(cid:48) with bilateral symmetry. Middle:\nthe non-rigid deformation t = \u03c0(cid:48) \u25e6 \u03c0\u22121 transporting one pose into the other. Right: construction of\nt = m\u03c0m\u22121\u03c0\u22121 by applying the re\ufb02ection operator m both in Euclidean space and in representation\nspace S2. This also shows that the symmetric pose \u03c0(cid:48) = m\u03c0m\u22121 is the \u201cconjugate\u201d of \u03c0.\nconsistent with object deformations and viewpoint changes. In fact, our method builds on [38] and\nalso learns a dense geometric embedding for objects, however, by using a different supervision\nprinciple, symmetry.\nSymmetry. Computational symmetry [22] has a long history in sciences and played an essential\nrole in several important discoveries including the theory of relativity [29], the double helix structure\nof DNA [42]. Symmetry is shown to help grouping [19] and recognition [41] in human perception.\nThere is a vast body of computer vision literature dedicated to \ufb01nding symmetries in images [26],\ntwo dimensional [1] and three dimensional shapes [37]. Other axes of variations among symmetry\ndetection methods are whether we seek transformations to map the whole [33] or part of an object [12]\nto itself; whether distances are measured in the extrinsic Euclidean space [1] or with respect to\nan intrinsic metric of the surface [33]. In addition to symmetry detection, symmetry is also used\nas prior information to improve object localization [4], text spotting [47], pose estimation [44] and\n3D reconstruction [35]. Symmetry constraints been used to \ufb01nd objects in 3D point clouds [9, 40].\nSymmetrization [27] can be used to warp meshes to a symmetric pose. Symmetry cues can be used\nin segmentation [3, 5]. [2] learns representations that respect a group structure learned from data\nsymmetries.\n\n3 Self-supervised learning of bilateral symmetries\n\nIn this section, we extend the approach of [38] to learn the bilateral symmetry of an object category.\nObject frame. The key idea of [38] is to study 3D objects not via 3D reconstruction, which is\nchallenging, but by characterizing the correspondences between different 3D shapes of the object, up\nto pose or intra-class variations.\nIn this model, an object category is a space \u03a0 of homeomorphisms \u03c0 : S2 \u2192 R3 that embed\nthe sphere S2 into R3. Each possible shape of the object is obtained as the (mathematical) image\nS = \u03c0[S2] under a corresponding function \u03c0 \u2208 \u03a0, which we therefore call a pose of the object\n(different poses may result in the same shape). The correspondences between a pair of shapes\nS = \u03c0[S2] and S(cid:48) = \u03c0(cid:48)[S2] is then given by \u03c0(cid:48) \u25e6 \u03c0\u22121, which is a bijective deformation of S into S(cid:48).\nNext, we study how poses relate to images of the object. A (color) image is a function x : \u2126 \u2192 R3\nmapping pixels u \u2208 \u2126 to colors xu. Suppose that x is the image of the object under pose \u03c0; then, a\npoint z \u2208 S2 on the sphere projects to a point \u03c0z \u2208 R3 on the object surface S and the latter projects\nto a pixel u = Proj(\u03c0z) \u2208 \u2126, where Proj is the camera projection operator.\nThe idea of [38] is to learn a function \u03c8u(x) that \u201creverses\u201d this process and, given a pixel u in image\nx, recovers the corresponding point z on the sphere (so that \u2200u : u = Proj (\u03c0\u03c8u(x))). The intuition\nis that z identi\ufb01es a certain object landmark (e.g. the corner of the left eye in a face) and that the\nfunction \u03c8u(x) recovers which landmark lands at a certain pixel u.\nThe way the function \u03c8u(x) is learned is by considering pairs of images x and x(cid:48) = tx related by a\nknown 2D deformation t : \u2126 \u2192 \u2126 (where the warped image tx is given by (tx)u = xt\u22121u). In this\nmanner, pixels u and u(cid:48) = tu are images of the same object landmark and therefore must project\non the same sphere point. In formulas, and ignoring visibility effects and other complications, the\nlearned function must satisfy the invariance constraint:\n\n(1)\nIn practice, triplets (x, x(cid:48), t) are obtained by randomly sampling 2D warps t, assuming that the latter\napproximate warps that could arise form an actual pose change \u03c0(cid:48) \u25e6 \u03c0\u22121. In this manner, knowledge\nof t is automatic and the method can be used in an unsupervised setting.\n\n\u2200u \u2208 \u2126 : \u03c8u(x) = \u03c8tu(tx)\n\n3\n\n\fSymmetric object frame. So far the object frame has been used to learn correspondences between\ndifferent object poses; here, we show that it can be used to establish auto-correspondences in order to\nmodel object symmetries as well.\nConsider in particular an object that has a bilateral symmetry. This symmetry is generated by a\nre\ufb02ection operator, say the function m : R3 \u2192 R3 that \ufb02ips the \ufb01rst axis:\n\nm :\n\nR3 \u2192 R3,\n\n(2)\n\n(cid:34)p1\n\n(cid:35)\n\np2\np3\n\n(cid:34)\u2212p1\n\n(cid:35)\n\n.\n\np2\np3\n\n(cid:55)\u2192\n\nIf S is a shape of a bilaterally-symmetric object, no matter how we align S to the symmetry plane, in\ngeneral m[S] (cid:54)= S due to object deformations. However, we can expect m[S] to still be a valid shape\nfor the object. Consider the example of \ufb01g. 2 of a person with his/her right hand raised; if we apply\nm to this shape, we obtain the shape of a person with the left hand raised, which is valid.\nHowever, reasoning about shapes is insuf\ufb01cient to apply the object frame model; we require instead\nto work with correspondences, encoded by poses. Unfortunately, even though m[S] is a valid shape,\nm is not a valid correspondence as it \ufb02ips the left and right sides of a person, which is not a \u201cphysical\u201d\ndeformation (why this is important will be clearer later; intuitively it is the reason why we can tell\nour left hand from the right by looking).\nOur key intuition is that we can learn the pose representation in such a way that the correct corre-\nspondences are trivially expressible there. Namely, assume that m applied to the sphere amounts\nto swapping each left landmark of the object with its corresponding right counterpart. The correct\ndeformation t that maps the \u201cright arm raised\u201d pose to the \u201cleft arm raised\u201d pose can now be found\nby applying m \ufb01rst in the normalized object frame (to swap left and right sides while leaving the\nshape unchanged) and then again in 3D space (undoing the swap while actually deforming the shape).\nThis two-step process is visualised in \ufb01g. 2 right.\nThis derivation is captured by a simple change to constraint (1), encoding equivariance rather than\ninvariance w.r.t. the warp m:\n\n\u2200u \u2208 \u2126 : m\u03c8u(x) = \u03c8mu(mx)\n\n(3)\nWe will show that this simple variant of eq. (1) can be used to learn a representation of the bilateral\nsymmetry of the object category.\nLearning formulation. We follow [38] and learn the model \u03c8u(x) by considering a dataset of\nimages x of a certain object category, modelling the function \u03c8u(x) by a convolutional neural\nnetwork, and formulating learning as a Siamese con\ufb01guration, combining constraints (3) and (1)\ninto a single loss. To avoid learning the trivial solution where \u03c8u(x) is the constant function, the\nconstraints are extended to capture not just invariance/equivariance but also distinctiveness (namely,\nequalities (3) and (1) should not hold if u is replaced with a different pixel v in the left-hand side).\nFollowing [38], this is captured probabilistically by the loss:\np(v|u) =\n\n(4)\nThe probability p(v|u) represents the model\u2019s belief that pixel u in image x matches pixel v in image\nmtx based on the learned embedding function; the latter is relaxed to span R3 rather than only S2 to\nallow the length of the embedding vectors to encode the belief strength (as shorter vectors results in\n\ufb02atter distributions p(v|u)). For unsupervised training, warps t \u223c T are randomly sampled from a\n\ufb01xed distribution T as in [38], whereas m is set to be either the identity or the re\ufb02ection along the\n\ufb01rst axis with 50% probability.\n\n2 p(v|u) dvdu,\n\n(cid:82) exp(cid:104)m\u03c8u(x), \u03c8w(mtx)(cid:105) dw\n\nexp(cid:104)m\u03c8u(x), \u03c8v(mtx)(cid:105)\n\nL(x, m, t) =\n\n(cid:107)v \u2212 mtu(cid:107)\u03b3\n\n(cid:90)\n\n\u2126\n\n4 Theory\n\nIn the previous section, we have given a formulation for learning the bilateral symmetry of an object\ncategory, relying mostly on an intuitive derivation. In this section, we develop the underlying theory\nin a more rigorous manner (proofs can be found in the supplementary material), while clarifying\nthree important points: how to model symmetries other than the bilateral one, why symmetries such\nas radial result in ambiguities in establishing correspondences and why this is usually not the case for\nthe bilateral symmetry, and what can be done to handle such ambiguities in the learning formulation\nwhen they arise.\n\n4\n\n\fFigure 3: Left: a set \u03a0 = {\u03c00, . . . , \u03c03} of four poses with rotational symmetry group H = {hk, k =\n0, 1, 2, 3} where h is a rotation by \u03c0/2. Note that none of the shapes is symmetric; rather, the object,\nwhich stays \u201cupright\u201d, can deform in four symmetric ways. The shape of the object is then suf\ufb01cient\nto recover the pose uniquely. Middle: closure of the pose space \u03a0 by rotations G = H. Now pose\ncan be recovered from shapes only up to the symmetry group H. Right: an equilateral triangle is\nrepresented by a pose \u03c00 invariant to conjugation by 60 degrees rotations (which are the \u201cordinary\u201d\nextrinsic symmetries of this object).\nSymmetric pose spaces. A symmetry of a shape S \u2282 R3 is often de\ufb01ned as an isometry1 h :\nR3 \u2192 R3 that leaves the set invariant, i.e. h[S] = S. This de\ufb01nition is not very useful when dealing\nwith symmetric but deformable objects, as it works only for special poses (cf. the Vitruvian Man);\nwe require instead a de\ufb01nition of symmetry that is not pose dependent. A common approach is to\nde\ufb01ne intrinsic symmetries [33] as maps h : S \u2192 S that preserve the geodesic distance dS de\ufb01ned on\nthe surface of the object (i.e. \u2200p, q \u2208 S : dS(hp, hq) = dS(p, q)). This works because the geodesic\ndistance captures the intrinsic geometry of the shape, which is pose invariant (but elastic shape\ndeformations are still a problem); however, using this de\ufb01nition requires to accurately reconstruct the\n3D shape of objects from images, which is very challenging.\nIn order to sidestep this dif\ufb01culty, we propose to study the symmetry not of the 3D shapes of objects,\nbut rather of the space of their deformations. As discussed in section 3, such deformations are\ncaptured as a whole by the pose space \u03a0. We de\ufb01ne the symmetries of the pose space \u03a0 as the subset\nof linear isometries that leave \u03a0 unchanged via conjugation:\n\nH(\u03a0) = {h \u2208 O(3) : \u2200\u03c0 \u2208 \u03a0 : h\u03c0h\u22121 \u2208 \u03a0 \u2227 h\u22121\u03c0h \u2208 \u03a0}.\n\nFor example, in \ufb01g. 2 we have obtained the \u201cleft hand raised\u201d pose \u03c0(cid:48) from the \u201cright hand raised\u201d\npose via conjugation \u03c0(cid:48) = m\u03c0m\u22121 via the re\ufb02ection m (note that m = m\u22121).\nLemma 1. The set H(\u03a0) is a subgroup of O(3).\n\nThe symmetry group H(\u03a0) partitions \u03a0 in equivalence classes of symmetric poses: two poses \u03c0 and\n\u03c0(cid:48) are symmetric, denoted \u03c0 \u223cH(\u03a0) \u03c0(cid:48), if, and only if, \u03c0(cid:48) = h\u03c0h\u22121 for an h \u2208 H(\u03a0). In fact:\nLemma 2. \u03c0 \u223cH(\u03a0) \u03c0(cid:48) is an equivalence relation on the space of poses \u03a0.\n\nFigure 3 shows an example of an object \u03a0 that has four rotationally-symmetric poses H(\u03a0) =\n{hk\u03c00h\u2212k, k = 0, 1, 2, 3} where h is a clockwise rotation of 90 degrees.\nIn the example of \ufb01g. 3, the object is pinned at the origin of R3\nMotion-induced ambiguities.\nand cannot rotate (it can only be \u201cupright\u201d); in order to allow it to move around, we can extend the\npose space to \u03a0(cid:48) = G\u03a0 by applying further transformations to the poses. For example, choosing\nG = SE(3) to be the Euclidean group allows the object to move rigidly; \ufb01g. 3-middle shows an\nexample in which G = H(\u03a0) is the same group of four rotations as before, so the object is still\npinned at the origin but not necessarily upright.\nMotions are important because they induce ambiguities in pose recover. We formalise this concept\nnext. First, we note that, if G contains H(\u03a0), extending \u03a0 by G preserves all the symmetries:\nLemma 3. If H(\u03a0) \u2282 G, then H(\u03a0) \u2282 H(G\u03a0).\nSecond, consider being given a shape S (intended as a subset of R3) and being tasked with recovering\nthe pose \u03c0 \u2208 \u03a0 that generates S = \u03c0[S2]. Motions makes this recovery ambiguous:\nLemma 4. Let the pose space \u03a0 be closed under a transformation group G, in the sense that\nG\u03a0 = \u03a0. Then, if pose \u03c0 \u2208 \u03a0 is a solution of the equation S = \u03c0[S2] and if h \u2208 H(\u03a0) \u2229 G, then\n\u03c0h\u22121 is another pose that solves the same equation.\n\n1I.e. \u2200p, q \u2208 R3 : d(hp, hq) = d(p, q).\n\n5\n\n\fLemma 4 does not necessarily provide a complete characterization of all the ambiguities in identifying\npose \u03c0 from shape S; rather, it captures the ambiguities arising from the symmetry of the object and\nits ability to move around in a certain manner. Nevertheless, it is possible for speci\ufb01c poses to result\nin further ambiguities (e.g. consider a pose that deforms an object into a sphere).\nIn order to use the lemma to characterise ambiguities in pose recovery, given a pose space \u03a0 one\nmust still \ufb01nd the space of possible motions G. We can take the latter to be the maximal subgroup\nG\u2217 \u2282 SE(3) of rigid motions under which \u03a0 is closed2\n\n4.1 Bilateral symmetry\n\nBilateral symmetries are generated by the re\ufb02ection operator m of eq. (2): a pose space \u03a0 has bilateral\nsymmetry if H(\u03a0) = {1, m}, which induces pairs of symmetric poses \u03c0(cid:48) = m\u03c0m\u22121 as in \ufb01g. 2.\nEven if poses \u03a0 are closed under rigid motions (i.e. G\u2217\u03a0 = \u03a0 where G\u2217 = SE(3)), in this case there\nis generally no ambiguity in recovering the object pose from its shape S. The reason is that in lemma 4\none has G\u2217 \u2229 H(\u03a0) = {1} due to the fact that all transformations in G\u2217 are orientation-preserving\nwhereas m is not. This explains why it is possible to still distinguish left from right sides in most\nbilaterally-symmetric objects despite symmetries and motions. However, this is not the case for other\ntypes of symmetries such as radial.\nSymmetry plane. Note that, given a pair of symmetric poses (\u03c0, \u03c0(cid:48)), \u03c0(cid:48) = m\u03c0m\u22121, the corre-\np (cid:55)\u2192\nspondences between the underlying 3D shapes are given by the map m\u03c0 :\n(m\u03c0m\u22121\u03c0\u22121)(p). For example, in \ufb01g. 2 this map sends the raised left hand of a person to the lowered\nleft hand in the symmetric pose. Of particular interest are the points where m\u03c0 coincides with m as\nthey are on the \u201cplane of symmetry\u201d. In fact, let p = \u03c0(z); then:\n\nS \u2192 m[S],\n\nm\u03c0(p) = m(p) \u21d2 m\u03c0m\u22121\u03c0\u22121(p) = m(p) \u21d2 m\u22121(z) = z \u21d2 z =\n\n(cid:34) 0\n\n(cid:35)\n\n.\n\n(5)\n\nz2\nz3\n\n4.2 Extrinsic symmetries\n\nOur formulation captures the standard notion of extrinsic (standard) symmetries as well. If H(S) =\n{h \u2208 O(3) : h[S] = S} are the extrinsic symmetries of a geometric shape S (say regular pyramid),\nwe can parametrize S using a single pose \u03a0 = {\u03c00} that: (i) generates the shape (S = \u03c00[S2]) and\n(ii) has the same symmetries as the latter (H(\u03a0) = H(S)).\nIn this case, the pose \u03c00 is self-conjugate, in the sense that \u03c00 = h\u03c00h\u22121 for all h \u2208 H(\u03a0).\nFurthermore, given S it is obviously possible to recover the pose uniquely (since there is only\none element in \u03a0); however, as before ambiguities arise by augmenting poses via rigid motions\nG = SE(3). In this case, due to lemma 4, if g\u03c00 is a possible pose of S, so must be g\u03c00h\u22121. We\ncan rewrite the latter as (gh\u22121)(h\u03c00h\u22121) = (gh\u22121)\u03c00, which shows that the ambiguous poses are\nobtained via selected rigid motions gh\u22121 of the reference pose \u03c00.\n\n5 Learning with ambiguities\n\nIn section 3 we have explained how the learning formulation of [38] can be extended in order to\nlearn objects with a bilateral symmetry. The latter is an example where symmetries do not induce an\nambiguities in the recovery of the object\u2019s pose (the reason is given in section 4.1). Now we consider\nthe case in which symmetries induce a genuine ambiguity in pose recovery.\nRecall that ambiguities arise from a non-empty intersection of object symmetries H(\u03a0) and object\nmotions G\u2217 (section 4). A typical example may be an object with a \ufb01nite rotational symmetry group\n(\ufb01g. 3). In this case, it is not possible to recover the object pose uniquely from an image, which in\nturn suggests that \u03c8u(x) cannot be learned using the formulation of section 3.\n\n2Being maximal means that G\u2217\u03a0 = G\u2217 \u2227 G\u03a0 = G \u21d2 G \u2282 G\u2217. The maximal group can be constructed as\nG\u2217 = (cid:104)G \u2282 SE(3) : G\u03a0 = \u03a0(cid:105) , where \u2282 denotes a subgroup and (cid:104)\u00b7(cid:105) the generated subgroup. This de\ufb01nition is\nwell posed: the generated group G\u2217 contains all the other subgroups G so it is maximal; furthermore G\u2217\u03a0 = \u03a0\nbecause, for any pose \u03c0 \u2208 \u03a0 and \ufb01nite combination of other group elements, gn1\n\n1 . . . gnk\n\nk \u03c0 \u2208 \u03a0.\n\n6\n\n\fMethod\n\nEyes Mouth\n23.29 15.27\n[38] & plane est. 5.17 5.38\n3.21 3.47\n\n[38]\n\nOurs\n\n(a) Pixel error when using the re-\n\ufb02ected descriptor from the left eye\nor left mouth corner to locate its\ncounterpart on the right side of\nthe face, across 200 images from\nCelebA (MAFL test subset)\n\n(b) Visualisation of \ufb01g. 4a.\n+: ground truth. \u25e6,\u2022: [38] with\nno learned symmetry. \u25e6,\u2022: [38]\nwith mirroring around the plane\nestimated using annotations. \u25e6,\u2022:\nOur method. Where \u25e6,\u2022 is eye,\nmouth respectively\n\n(c) Difference between us (left)\nand [38] (right). We learn an axis\naligned frame symmetric around\na plane (green), [38] has arbitrary\nrotation and no guaranteed sym-\nmetry plane. But we can estimate\na plane using annotations (cyan).\n\nFigure 4: Comparing object frames\n\nFigure 5: Bilateral symmetry of animal faces. The discovered plane of symmetry is shown in green.\nTop: Inputs, Middle: Colour mapping, Bottom: Embedding (sphere) space\n\nWe propose to address this problem by relaxing loss (4) in order to discount the ambiguity as follows:\n\n(cid:90)\n\n\u2126\n\nLH(\u03a0)(x, t) = min\nh\u2208H(\u03a0)\n\n(cid:107)v \u2212 tu(cid:107)\u03b3\n\n2 ph(v|u) dvdu,\n\nph(v|u) =\n\n(cid:82) exp(cid:104)h\u03c8u(x), \u03c8w(tx)(cid:105) dw\n\nexp(cid:104)h\u03c8u(x), \u03c8v(tx)(cid:105)\n\n(6)\n\nThis loss allows \u03c8u(x) to estimate the embedding vector z \u2208 S2 (or z \u2208 R3) up to an unknown\ntransformation h.\n\n6 Experiments\n\nWe now validate empirically our formulation. To ensure that we have a fair comparison to [38],\nwho introduced learning formulation (4) which our approach extends, we use the same network\narchitecture and hyperparameter values (e.g. \u03b3 = 0.5 in eq. (4)). We show that our extension\nsuccessfully recovers the symmetric structure of bilateral objects (section 6.1) as well as allowing to\nmanage ambiguities arising from symmetries in learning such structures (section 6.2).\n\n6.1 Learning objects with bilateral symmetry\n\nIn this section, we apply the learning formulation (4) to objects with a bilateral symmetry. Due to the\nstructure imposed on the embedding function by eq. (3), we expect the symmetry plane of the object\nto be mapped to the plane z1 = 0 in the embedding space (section 4.1). Once the model is learned,\nthis locus can be projected back to an image for visualisation and qualitative assessment. We also test\nquantitatively the accuracy of the learned geometric embedding in localising object landmarks and\ntheir symmetric counterparts.\nFaces. We evaluate the proposed formulation on faces of humans and animals, which have limited\nout-of-plane rotations. For humans we use the CelebA [23] face dataset, with over 200K images. We\nuse an identical setup to [38, 39], training on 162K images and employing the MAFL [46] subset\nof 1000 images as a validation set. For cats we use the Cat Head dataset [45], with 8609 training\nimages. We also combine multiple animals in the same training set, with Animal Faces dataset [36]\n(20 animal classes, about 100 images per class). We exclude birds and elephants since these images\nhave a signi\ufb01cantly different appearance, and add additional cat, dog and human faces [45, 31, 23]\n(but keep roughly the same distribution of animal classes per batch as the original dataset).\n\n7\n\nQueryTarget\fIn all cases, we do not use any manual annotation; instead, we use learning formulation (4) using the\nsame synthetic transformations t \u223c T as [38]. Additionally, with 50% probability we also apply a\nleft-to-right \ufb02ip m to both the image and the embedding space, as prescribed by eq. (4).\nResults (\ufb01gs. 1 and 5) show that our method, like [38], learns a geometric embedding of the object\ninvariant to viewpoint and intra-category changes. In addition, our new formulation localises the\nintrinsic bilateral symmetry plane in the face images and maps it to a plane of re\ufb02ection in the\nembedding space. We note that images are embedded symmetrically with respect to the plane (shown\nin green in \ufb01g. 1, bottom row). The plane can also be projected back to the image and, as predicted\nby eq. (5), corresponds to our intuitive notion of symmetry plane in faces (\ufb01g. 1, top row). Importantly,\nsymmetry here is a statistical concept that applies to the category as a whole; speci\ufb01c face instances\nneed not be nor appear symmetric \u2014 the latter in particular means that faces need not be imaged\nfronto-parallel for the method to capture their symmetry.\nTo evaluate the learned symmetry quantitatively we use manual annotations (eyes, mouth corners) to\nverify if the representation can transport landmarks to their symmetric counterparts. In particular, we\ntake landmarks on the left side of the face (e.g. left eye), use m (eq. (3)) to mirror their embedding\nvectors, backproject those to the image, and compare the resulting positions to the ground-truth\nsymmetric landmark locations (e.g. right eye). We report the measured pixel error in \ufb01g. 4a. As a\nbaseline, we replace our embedding function with the one from [38] which results in much higher\nerror. This is however expected as the mapping m has no particular meaning in this embedding\nspace; for a fairer comparison, we then explicitly estimate an ad-hoc plane of symmetry de\ufb01ned by\nthe nose, mean of the eyes, and mean of the mouth corners, using 200 training images. This still\ngives higher error than our method, showing that enforcing symmetry during training leads to a better\nrepresentation of symmetric objects.\nIn terms of the accuracy of the geometric embedding as such, we evaluate\nsimply matching annotations between different images and obtain similar\nerror to the embedding of [38] (ours 2.60, theirs 2.59 pixel error on 200\npairs of faces, and both 1.63 error for when the second image is a warped\nversion of the \ufb01rst). Hence representing symmetries does not harm geometric\naccuracy.\nWe also examine the in\ufb02uence of the synthetic warp intensity, in \ufb01g. 6 we\ntrain for 5 epochs scaling the original control point parameters by a factor,\nindicating we are around the sweet spot and unnatural excessive warping is\nharmful.\nSynthetic 3D car model. A challenging problem is capturing bilateral\nsymmetry across out-of-plane rotations. We use a 3D car, animated with\nrandom motion [13] for 30K frames. The heading follows a random walk,\neventually rotating 360\u25e6 out of plane. Translation, pitch and roll are sinusoidal. The back of the car is\nred to easily distinguish from the front. We use consecutive frames for training, with the ground truth\noptical \ufb02ow used for t and image size 75 \u00d7 75. The loss ignores pixels with \ufb02ow smaller than 0.001,\npreventing confusion with the solid background. Figure 8 depicts examples from this dataset. Unlike\nCelebA, the cars are rendered from signi\ufb01cantly different views, but our method can successfully\nlocalize the bilateral axis accurately.\nSynthetic\nrobot arm model.\nWe trained our model on videos of\na left-right pair of robotics arms,\nextending the setup of [38] to a\nsystem of two arms.\nFigure 7 shows the discovered sym-\nmetry by joining corresponding\npoints in a few video frames. Note that symmetries are learned automatically from raw videos\nand ground truth optical \ufb02ow alone. Note also that none of the images is symmetric in the trivial\nleft-right \ufb02ip sense due to the object deformations.\n\nFigure 7: Symmetry in a pair of toy robotics arms\n\nFigure 6: Varying\nwarp intensity\n\n8\n\n012345Warp factor34567Symmetry Error (px)EyesMouth\fFigure 8: Bilateral symmetry on synthetic car images, Top: Input images with the axis of symmetry\nsuperimposed (shown in green), Bottom: Image pixels mapped to 3D with the re\ufb02ection plane (green)\n\nFigure 9: Rotational symmetry on protein. Top: Frames, found center of symmetry red. Middle:\nColorized object frame, a different colouring is assigned to each leg despite ambiguity. Bottom:\nEmbedding in 3D, it learns to be symmetric around an axis (red). Last column: Without relaxed loss.\n6.2 Rotational symmetry\n\nWe create an example based on 3-fold rotational symmetry in nature, the Clathrin protein [43]. We\nuse the protein mesh3 and animate it as a soft body in a physics engine [13, 7], generating 200\n400-frame sequences. For each we vary the camera rotation, lighting, mesh smoothing and position.\nThe protein is anchored at its centre. We vary the gravity vector to produce varied motion.\nWe train using the relaxed loss in eq. (6), where H(\u03a0) corresponds to rotating our sphere 0\u25e6, 120\u25e6 or\n240\u25e6. The mapping then need only be learned up to this rotational ambiguity. As shown in \ufb01g. 9, this\nmaps the protein images onto a canonical position which has rotational symmetry around the chosen\naxis, whereas without the relaxed loss the object frame is not aligned and symmetrical.\nWe also show results for rotational symmetry in real images, using \ufb02ower class Stapelia from\nImageNet in \ufb01g. 10 which has 5-fold rotational symmetry.\n\n7 Conclusions\n\nIn this paper we have developed a new model of the symmetries of deformable object categories.\nThe main advantage of this approach is that it is \ufb02exible and robust enough that it supports learning\nsymmetric objects in an unsupervised manner, from raw images, despite variable viewpoint, defor-\nmations, and intra-class variations. We have also characterised ambiguities in pose recovery caused\nby symmetries and developed a learning formulation that can handle them. Our contributions have\nbeen validated empirically, showing that we can learn to represent symmetries robustly on a variety\nof object categories, while retaining the accuracy of the learned geometric embedding compared to\nprevious approaches.\n\n3https://www.rcsb.org/structure/3LVG\n\nFigure 10: Rotational symmetry on Stapelia \ufb02ower. Superimposed in green, projection into the\nimage of a set of half-planes 72\u25e6 apart in the sphere space. In red, predicted axis of rotational\nsymmetry.\n\n9\n\n\fAcknowledgments: This work acknowledges the support of the AIMS CDT (EPSRC EP/L015897/1) and ERC\n638009-IDIU. We thank Almut Sophia Koepke for feedback and corrections.\n\nReferences\n\n[1] Helmut Alt, Kurt Mehlhorn, Hubert Wagener, and Emo Welzl. Congruence, similarity, and\nsymmetries of geometric objects. Discrete & Computational Geometry, 3(3):237\u2013256, 1988.\n\n[2] Fabio Anselmi, Georgios Evangelopoulos, Lorenzo Rosasco, and Tomaso Poggio. Symmetry-\n\nadapted representation learning. Pattern Recognition, 86:201\u2013208, 2019.\n\n[3] Shai Bagon, Oren Boiman, and Michal Irani. What is a good image segment? a uni\ufb01ed approach\n\nto segment extraction. In Proc. ECCV, pages 30\u201344. Springer, 2008.\n\n[4] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object detection with\n\nposterior regularization. In Proceedings BMVC 2014, pages 1\u201312, 2014.\n\n[5] Oren Boiman and Michal Irani. Similarity by composition. In Proc. NeurIPS, pages 177\u2013184,\n\n2007.\n\n[6] T F Cootes, C J Taylor, D H Cooper, and J Graham. Active shape models: their training and\n\napplication. CVIU, 1995.\n\n[7] Erwin Coumans. Bullet physics engine. Open Source Software: http://bulletphysics. org, 2010.\n\n[8] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. In\n\nProc. CVPR, 2005.\n\n[9] Aleksandrs Ecins, Cornelia Ferm\u00fcller, and Yiannis Aloimonos. Cluttered scene segmentation\nusing the symmetry constraint. In Robotics and Automation (ICRA), 2016 IEEE International\nConference on, pages 2271\u20132278. IEEE, 2016.\n\n[10] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. Object\n\nDetection with Discriminatively Trained Part Based Models. PAMI, 2010.\n\n[11] Rob Fergus, Pietro Perona, and Andrew Zisserman. Object class recognition by unsupervised\n\nscale-invariant learning. In Proc. CVPR, 2003.\n\n[12] Ran Gal and Daniel Cohen-Or. Salient geometric features for partial shape matching and\n\nsimilarity. ACM Transactions on Graphics (TOG), 25(1):130\u2013150, 2006.\n\n[13] Mike Goslin and Mark R Mine. The Panda3D graphics engine. Computer, 37(10):112\u2013114,\n\n2004.\n\n[14] Bumsub Ham, Minsu Cho, Cordelia Schmid, and Jean Ponce. Proposal \ufb02ow. In Proc. CVPR,\n\npages 3475\u20133484, 2016.\n\n[15] Kai Han, Rafael S Rezende, Bumsub Ham, Kwan-Yee K Wong, Minsu Cho, Cordelia Schmid,\n\nand Jean Ponce. Scnet: Learning semantic correspondence. In Proc. ICCV, 2017.\n\n[16] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial Trans-\n\nformer Networks. In Proc. NeurIPS, 2015.\n\n[17] A. Kanazawa, D. W. Jacobs, and M. Chandraker. WarpNet: Weakly supervised matching for\n\nsingle-view reconstruction. In Proc. CVPR, 2016.\n\n[18] Ira Kemelmacher-Shlizerman and Steven M. Seitz. Collection \ufb02ow. In Proc. CVPR, 2012.\n\n[19] Kurt Koffka. Principles of Gestalt psychology, volume 44. Routledge, 2013.\n\n[20] Erik G Learned-Miller. Data driven image models through continuous joint alignment. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 2006.\n\n10\n\n\f[21] Bastian Leibe, Ales Leonardis, and Bernt Schiele. Combined object categorization and segmen-\ntation with an implicit shape model. In Workshop on statistical learning in computer vision,\nECCV, 2004.\n\n[22] Yanxi Liu, Hagit Hel-Or, Craig S Kaplan, Luc Van Gool, et al. Computational symmetry in\ncomputer vision and computer graphics. Foundations and Trends R(cid:13) in Computer Graphics and\nVision, 5(1\u20132):1\u2013195, 2010.\n\n[23] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proc. ICCV, 2015.\n\n[24] Jonathan L Long, Ning Zhang, and Trevor Darrell. Do convnets learn correspondence? In\n\nAdvances in Neural Information Processing Systems, pages 1601\u20131609, 2014.\n\n[25] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal\n\nof computer vision, 60(2):91\u2013110, 2004.\n\n[26] Giovanni Marola. On the detection of the axes of symmetry of symmetric and almost symmetric\n\nplanar images. PAMI, 11(1):104\u2013108, 1989.\n\n[27] Niloy J Mitra, Leonidas J Guibas, and Mark Pauly. Symmetrization. In ACM Transactions on\n\nGraphics (TOG), volume 26, page 63. ACM, 2007.\n\n[28] Hossein Mobahi, Ce Liu, and William T. Freeman. A Compositional Model for Low-\n\nDimensional Image Set Representation. Proc. CVPR, 2014.\n\n[29] Gregory L Naber. The geometry of Minkowski spacetime: An introduction to the mathematics\n\nof the special theory of relativity, volume 92. Springer Science & Business Media, 2012.\n\n[30] D. Novotny, D. Larlus, and A. Vedaldi. Learning 3d object categories by looking around them.\n\nIn Proc. ICCV, 2017.\n\n[31] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In Proc. CVPR,\n\n2012.\n\n[32] Yigang Peng, Arvind Ganesh, John Wright, Wenli Xu, and Yi Ma. Rasl: Robust alignment by\nsparse and low-rank decomposition for linearly correlated images. PAMI, 34(11):2233\u20132246,\n2012.\n\n[33] Dan Raviv, Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Full and partial\n\nsymmetries of non-rigid shapes. IJCV, 89(1):18\u201339, 2010.\n\n[34] I. Rocco, R. Arandjelovi\u00b4c, and J. Sivic. Convolutional neural network architecture for geometric\n\nmatching. In Proc. CVPR, 2017.\n\n[35] Ilan Shimshoni, Yael Moses, and Michael Lindenbaum. Shape reconstruction of 3d bilaterally\n\nsymmetric surfaces. IJCV, 39(2):97\u2013110, 2000.\n\n[36] Zhangzhang Si and Song-Chun Zhu. Learning hybrid image templates (hit) by information\n\nprojection. PAMI.\n\n[37] Changming Sun and Jamie Sherrah. 3d symmetry detection using the extended gaussian image.\n\nPAMI, 19(2):164\u2013168, 1997.\n\n[38] J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object frames by dense\n\nequivariant image labelling. In Proc. NeurIPS, 2017.\n\n[39] J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks by factorized\n\nspatial embeddings. In Proc. ICCV, 2017.\n\n[40] Sebastian Thrun and Ben Wegbreit. Shape from symmetry. In Proc. ICCV, pages 1824\u20131831,\n\n2005.\n\n[41] Thomas Vetter and Tomaso Poggio. Linear object classes and image synthesis from a single\nexample image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):733\u2013\n742, 1997.\n\n11\n\n\f[42] James D Watson, Francis HC Crick, et al. Molecular structure of nucleic acids. Nature,\n\n171(4356):737\u2013738, 1953.\n\n[43] Jeremy D Wilbur, Peter K Hwang, Joel A Ybe, Michael Lane, Benjamin D Sellers, Matthew P\nJacobson, Robert J Fletterick, and Frances M Brodsky. Conformation switching of clathrin light\nchain regulates clathrin lattice assembly. Developmental cell, 18(5):854\u2013861, 2010.\n\n[44] Heng Yang and Ioannis Patras. Mirror, mirror on the wall, tell me, is the error small? In Proc.\n\nCVPR, pages 4685\u20134693, 2015.\n\n[45] Weiwei Zhang, Jian Sun, and Xiaoou Tang. Cat head detection - How to effectively exploit\n\nshape and texture features. In Proc. ECCV, 2008.\n\n[46] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Learning Deep Representation\n\nfor Face Alignment with Auxiliary Attributes. PAMI, 2016.\n\n[47] Zheng Zhang, Wei Shen, Cong Yao, and Xiang Bai. Symmetry-based text line detection in\n\nnatural scenes. In Proc. CVPR, pages 2558\u20132567, 2015.\n\n12\n\n\f", "award": [], "sourceid": 5005, "authors": [{"given_name": "James", "family_name": "Thewlis", "institution": "University of Oxford"}, {"given_name": "Hakan", "family_name": "Bilen", "institution": "University of Edinburgh"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "Facebook AI Research and University of Oxford"}]}