{"title": "Unsupervised Learning of Object Landmarks through Conditional Image Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 4016, "page_last": 4027, "abstract": "We propose a method for learning landmark detectors for visual objects (such as the eyes and the nose in a face) without any manual supervision. We cast this as the problem of generating images that combine the appearance of the object as seen in a first example image with the geometry of the object as seen in a second example image, where the two examples differ by a viewpoint change and/or an object deformation. In order to factorize appearance and geometry, we introduce a tight bottleneck in the geometry-extraction process that selects and distils geometry-related features. Compared to standard image generation problems, which often use generative adversarial networks, our generation task is conditioned on both appearance and geometry and thus is significantly less ambiguous, to the point that adopting a simple perceptual loss formulation is sufficient. We demonstrate that our approach can learn object landmarks from synthetic image deformations or videos, all without manual supervision, while outperforming state-of-the-art unsupervised landmark detectors. We further show that our method is applicable to a large variety of datasets - faces, people, 3D objects, and digits - without any modifications.", "full_text": "Unsupervised Learning of Object Landmarks\n\nthrough Conditional Image Generation\n\nTomas Jakab1\u2217\n\nAnkush Gupta1\u2217\n\nHakan Bilen2\n\nAndrea Vedaldi1\n\n1 Visual Geometry Group\n\nUniversity of Oxford\n\n{tomj,ankush,vedaldi}@robots.ox.ac.uk\n\n2 School of Informatics\nUniversity of Edinburgh\n\nhbilen@ed.ac.uk\n\nAbstract\n\nWe propose a method for learning landmark detectors for visual objects (such as\nthe eyes and the nose in a face) without any manual supervision. We cast this as the\nproblem of generating images that combine the appearance of the object as seen in\na \ufb01rst example image with the geometry of the object as seen in a second example\nimage, where the two examples differ by a viewpoint change and/or an object\ndeformation. In order to factorize appearance and geometry, we introduce a tight\nbottleneck in the geometry-extraction process that selects and distils geometry-\nrelated features. Compared to standard image generation problems, which often\nuse generative adversarial networks, our generation task is conditioned on both\nappearance and geometry and thus is signi\ufb01cantly less ambiguous, to the point\nthat adopting a simple perceptual loss formulation is suf\ufb01cient. We demonstrate\nthat our approach can learn object landmarks from synthetic image deformations\nor videos, all without manual supervision, while outperforming state-of-the-art\nunsupervised landmark detectors. We further show that our method is applicable to\na large variety of datasets \u2014 faces, people, 3D objects, and digits \u2014 without any\nmodi\ufb01cations.\n\n1\n\nIntroduction\n\nThere is a growing interest in developing machine learning methods that have little or no dependence\non manual supervision. In this paper, we consider in particular the problem of learning, without\nexternal annotations, detectors for the landmarks of object categories, such as the nose, the eyes, and\nthe mouth of a face, or the hands, shoulders, and head of a human body.\nOur approach learns landmarks by looking at images of deformable objects that differ by acquisition\ntime and/or viewpoint. Such pairs may be extracted from video sequences or can be generated by\nrandomly perturbing still images. Videos have been used before for self-supervision, often in the\ncontext of future frame prediction, where the goal is to generate future video frames by observing\none or more past frames. A key dif\ufb01culty in such approaches is the high degree of ambiguity that\nexists in predicting the motion of objects from past observations. In order to eliminate this ambiguity,\nwe propose instead to condition generation on two images, a source (past) image and a target (future)\nimage. The goal of the learned model is to reproduce the target image, given the source and target\nimages as input. Clearly, without further constraints, this task is trivial. Thus, we pass the target\nthrough a tight bottleneck meant to distil the geometry of the object (\ufb01g. 1). We do so by constraining\nthe resulting representation to encode spatial locations, as may be obtained by an object landmark\ndetector. The source image and the encoded target image are then passed to a generator network\nwhich reconstructs the target. Minimising the reconstruction error encourages the model to learn\nlandmark-like representations because landmarks can be used to encode the geometry of the object,\n\n\u2217equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Model Architecture. Given a pair of source and target images (x, x(cid:48)), the pose-regressor \u03a6\nextracts K heatmaps from x(cid:48), which are then marginalized to estimate coordinates of keypoints, to\nlimit the information \ufb02ow. 2D Gaussians (y(cid:48)) are rendered from these keypoints and stacked along\nwith the image features extracted from x, to reconstruct the target as \u03a8(x, y(cid:48)) = \u02c6x(cid:48). By restricting\nthe information-\ufb02ow our model learns semantically meaningful keypoints, without any annotations.\n\nwhich changes between source and target, while the appearance of the object, which is constant, can\nbe obtained from the source image alone.\nThe key advantage of our method, compared to other works for unsupervised learning of landmarks,\nis the simplicity and generality of the formulation, which allows it to work well on data far more\ncomplex than previously used in unsupervised learning of object landmarks, e.g. landmarks for the\nhighly-articulated human body. In particular, unlike methods such as [45, 44, 55], we show that our\nmethod can learn from synthetically-generated image deformations as well as raw videos as it does\nnot require access to information about correspondences, optical-\ufb02ow, or transformation between\nimages.\nFurthermore, while image generation has been used extensively in unsupervised learning, especially\nin the context of (variational) auto-encoders [22] and Generative Adversarial Networks (GANs [13];\nsee section 2), our approach has a key advantage over such methods. Namely, conditioning on\nboth source and target images simpli\ufb01es the generation task considerably, making it much easier\nto learn the generator network [18]. The ensuing simpli\ufb01cation means that we can adopt the direct\napproach of minimizing a perceptual loss as in [10], without resorting to more complex techniques\nlike GANs. Empirically, we show that this still results in excellent image generation results and\nthat, more importantly, semantically consistent landmark detectors are learned without manual\nsupervision (section 4). Project code and details are available at: http://www.robots.ox.ac.uk/\n~vgg/research/unsupervised_landmarks/\n\n2 Related work\n\nThe recent approaches of [45, 44] learn to extract landmarks based on the principles of equivariance\nand distinctiveness.\nIn contrast to our work, these methods are not generative. Further, they\nrely on known correspondences between images obtained either through optical \ufb02ow or synthetic\ntransformations, and hence, cannot leverage video data directly. Since the principle of equivariance is\northogonal to our approach, it can be incorporated as an additional cue in our method.\nUnsupervised learning of representations has traditionally been achieved using auto-encoders and\nrestricted Boltzmann machines [14, 47, 15]. InfoGAN [6] uses GANs to disentangle factors in the\ndata by imposing a certain structure in the latent space. Our approach also works by imposing a latent\nstructure, but using a conditional-encoder instead of an auto-encoder.\nLearning representations using conditional image generation via a bottleneck was demonstrated\nby Xue et al. [52] in variational auto-encoders, and by Whitney et al. [50] using a discrete gating\nmechanism to combine representations of successive video frames. Denton et al. [8] factor the pose\nand identity in videos through an adversarial loss on the pose embeddings. We instead design our\nbottleneck to explicitly shape the features to resemble the output of a landmark detector, without any\nadversarial training. Villegas et al. [46] also generate future frames by extracting a representation of\nappearance and human pose, but, differently from us, require ground-truth pose annotations. Our\nmethod essentially inverts their analogy network [36] to output landmarks given the source and target\nimage pairs.\n\n2\n\n-1+1-1\fSeveral other generative methods [42, 40, 37, 48, 32] focus on video extrapolation. Srivastava et\nal. [40] employ Long Short Term Memory (LSTM) [16] networks to encode video sequences into\n\ufb01xed-length representation and decode it to reconstruct the input sequence. Vondrick et al. [48]\npropose a GAN for videos, also with a spatio-temporal convolutional architecture that disentangles\nforeground and background to generate realistic frames. Video Pixel Networks [20] estimate the\ndiscrete joint distribution of the pixel values in a video by encoding different modalities such as time,\nspace and colour information. In contrast, we learn a structured embedding that explicitly encodes\nthe spatial location of object landmarks.\nA series of concurrent works propose similar methods for unsupervised learning of object structure.\nShu et al. [38] learn to factor a single object-category-speci\ufb01c image into an appearance template in a\ncanonical coordinate system, and a deformation \ufb01eld which warps the template to reconstruct the input,\nas in an auto-encoder. They encourage this factorisation by controlling the size of the embeddings.\nSimilarly, Wiles et al. [51] learn a dense deformation \ufb01eld for faces but obtain the template from a\nsecond related image, as in our method. Suwajanakorn et al. [43] learn 3D-keypoints for objects\nfrom two images which differ by a known 3D transformation, by enforcing equivariance [45]. Finally,\nthe method of Zhang et al. [55] shares several similarities with ours, in that they also use image\ngeneration with the goal of learning landmarks. However, their method is based on generating a\nsingle image from itself using landmark-transported features. This, we show is insuf\ufb01cient to learn\ngeometry and requires, as they do, to also incorporate the principle of equivariance [45]. This is a key\ndifference with our method, as ours results in a much simpler system that does not require to know\nthe optical-\ufb02ow/correspondences between images, and can learn from raw videos directly.\n\n3 Method\nLet x, x(cid:48) \u2208 X = RH\u00d7W\u00d7C be two images of an object, for example extracted as frames in a video\nsequence, or synthetically generated by randomly deforming x into x(cid:48). We call x the source image\nand x(cid:48) the target image and we use \u2126 to denote the image domain, namely the H\u00d7W lattice.\nWe are interested in learning a function \u03a6(x) = y \u2208 Y that captures the \u201cstructure\u201d of the object in\nthe image as a set of K object landmarks. As a \ufb01rst approximation, assume that y = (u1, . . . , uK) \u2208\n\u2126K = Y are K coordinates uk \u2208 \u2126, one per landmark.\nIn order to learn the map \u03a6 in an unsupervised manner, we consider the problem of conditional image\ngeneration. Namely, we wish to learn a generator function\n\n\u03a8 : X \u00d7 Y \u2192 X ,\n\n(x, y(cid:48)) (cid:55)\u2192 x(cid:48)\n\nsuch that the target image x(cid:48) = \u03a8(x, \u03a6(x(cid:48))) is reconstructed from the source image x and the\nrepresentation y(cid:48) = \u03a6(x(cid:48)) of the target image. In practice, we learn both functions \u03a6 and \u03a8 jointly\nto minimise the expected reconstruction loss min\u03a8,\u03a6 Ex,x(cid:48) [L(x(cid:48), \u03a8(x, \u03a6(x(cid:48))))] . Note that, if we\ndo not restrict the form of Y, then a trivial solution to this problem is to learn identity mappings\nby setting y(cid:48) = \u03a6(x(cid:48)) = x(cid:48) and \u03a8(x, y(cid:48)) = y(cid:48). However, given that y(cid:48) has the \u201cform\u201d of a set of\nlandmark detections, the model is strongly encouraged to learn those. This is explained next.\n\n3.1 Heatmaps bottleneck\n\nIn order for the model \u03a6(x) to learn to extract keypoint-like structures from the image, we terminate\nthe network \u03a6 with a layer that forces the output to be akin to a set of K keypoint detections. This\nis done in three steps. First, K heatmaps Su(x; k), u \u2208 \u2126 are generated, one for each keypoint\nk = 1, . . . , K. These heatmaps are obtained in parallel as the channels of a RH\u00d7W\u00d7K tensor using\na standard convolutional neural network architecture. Second, each heatmap is renormalised to a\nprobability distribution via (spatial) Softmax and condensed to a point by computing the (spatial)\nexpected value of the latter:\n\nThird, each heatmap is replaced with a Gaussian-like function centred at u\u2217\nstandard deviation \u03c3:\n\nk with a small \ufb01xed\n\n(cid:19)\n\n(1)\n\n(2)\n\nu\u2217\nk(x) =\n\nu\u2208\u2126 ueSu(x;k)\nu\u2208\u2126 eSu(x;k)\n\n\u03a6u(x; k) = exp\n\n\u2212 1\n2\u03c32(cid:107)u \u2212 u\u2217\n\nk(x)(cid:107)2\n\n(cid:80)\n(cid:80)\n(cid:18)\n\n3\n\n\fx\n\nx(cid:48)\n\n\u03a8(x, \u03a6(x(cid:48)))\n\n\u03a6(x(cid:48))\n\nFigure 2: Unsupervised Landmarks. [left]: CelebA images showing the synthetically transformed\nsource x and target x(cid:48) images, the reconstructed target \u03a8(x, \u03a6(x(cid:48))), and the unsupervised landmarks\n\u03a6(x(cid:48)). [middle]: The same for video frames from VoxCeleb. [right]: Two example images with\nselected (8 out of 10) landmarks uk overlaid and their corresponding 2D score maps Su(x; k)\n(see section 3.1; brighter pixels indicate higher con\ufb01dence).\nThe end result is a new tensor y = \u03a6(x) \u2208 RH\u00d7W\u00d7K that encodes as Gaussian heatmaps the\nlocation of K maxima. Since it is possible to recover the landmark locations exactly from these\nheatmaps, this representation is equivalent to the one considered above (2D coordinates); however, it\nis more useful as an input to a generator network, as discussed later.\nOne may wonder whether this construction can be simpli\ufb01ed by removing steps two and three and\nsimply consider S(x) (possibly after re-normalisation) as the output of the encoder \u03a6(x). The answer\nis that these steps, and especially eq. (1), ensure that very little information from x is retained, which,\nas suggested above, is key to avoid degenerate solutions. Converting back to Gaussian landmarks\nin eq. (2), instead of just retaining 2D coordinates, ensures that the representation is still utilisable by\nthe generator network.\nSeparable implementation.\nIn practice, we consider a separable variant of eq. (1) for computa-\ntional ef\ufb01ciency. Namely, let u = (u1, u2) be the two components of each pixel coordinate and write\n\u2126 = \u21261 \u00d7 \u21262. Then we set\n\n,\n\nSui(x; k) =\n\nS(u1,u2)(x; k),\n\n(cid:88)\n\nuj\u2208\u2126j\n\n(cid:80)\n(cid:80)\n\nu\u2217\nik(x) =\n\nui\u2208\u2126i\nui\u2208\u2126i\n\nuieSui (x;k)\neSui (x;k)\n\nwhere i = 1, 2 and j = 2, 1 respectively. Figure 2 visualizes the source x, target x(cid:48) and generated\n\u03a8(x, \u03a6(x(cid:48))) images, as well as x(cid:48) overlaid with the locations of the unsupervised landmarks \u03a6(x(cid:48)).\nIt also shows the heatmaps Su(x; k) and marginalized separable softmax distributions on the top and\nleft of each heatmap for K = 10 keypoints.\n\n3.2 Generator network using a perceptual loss\nThe goal of the generator network \u02c6x(cid:48) = \u03a8(x, y(cid:48)) is to map the source image x and the distilled\nversion y(cid:48) of the target image x(cid:48) to a reconstruction of the latter. Thus the generator network is\noptimised to minimise a reconstruction error L(x(cid:48), \u02c6x(cid:48)). The design of the reconstruction error is\nimportant for good performance. Nowadays the standard practice is to learn such a loss function\nusing adversarial techniques, as exempli\ufb01ed in numerous variants of GANs. However, since the goal\nhere is not generative modelling, but rather to induce a representation y(cid:48) of the object geometry for\nreconstructing a speci\ufb01c target image (as in an auto-encoder), a simpler method may suf\ufb01ce.\nInspired by the excellent results for photo-realistic image synthesis of [4], we resort here to use the\n\u201ccontent representation\u201d or \u201cperceptual\u201d loss used successfully for various generative networks [12, 1,\n9, 19, 27, 30, 31]. The perceptual loss compares a set of the activations extracted from multiple layers\nof a deep network for both the reference and the generated images, instead of the only raw pixel\n2, where \u0393(x) is an off-the-shelf\npre-trained neural network, for example VGG-19 [39], \u0393l denotes the output of the l-th sub-network\n(obtained by chopping \u0393 at layer l). As our goal is to have a purely-unsupervised learning, we\npre-train the network by using a self-supervised approach, namely colorising grayscale images [25].\n\nvalues. We de\ufb01ne the loss as L(x(cid:48), \u02c6x(cid:48)) =(cid:80)\n\nl \u03b1l(cid:107)\u0393l(x(cid:48)) \u2212 \u0393l(\u02c6x(cid:48))(cid:107)2\n\n4\n\n\fn supervised Thewlis [45] Ours selfsup\n12.89 \u00b1 3.21\n8.16 \u00b1 0.96\n7.19 \u00b1 0.45\n4.29 \u00b1 0.34\n2.83 \u00b1 0.06\n2.73 \u00b1 0.03\n2.60 \u00b1 0.00\n2.58\u00b1 N/A\n\n1\n5\n\u2020 10\n100\n500\n1000\n5000\nAll (19,000)\n\n10.82\n9.25\n8.49\n\u2014\n\u2014\n\u2014\n\u2014\n7.15\n\nFigure 3: Sample Ef\ufb01ciency for Supervised Regression on MAFL. [left]: Supervised linear\nregression of 5 keypoints (bottom-row) from 10 unsupervised (top-row) on MAFL test set. Centre\nof the white-dots correspond to the ground-truth location, while the dark ones are the predictions.\nBoth unsupervised and supervised landmarks show a good degree of equivariance with respect to\nhead rotation (columns 2, 4) and invariance to headwear or eyewear (columns 1, 3). [right]: MSE\n(\u00b1\u03c3) (normalised by inter-ocular distance (in %)) on the MAFL test-set for varying number (n) of\nsupervised samples from MAFL training set used for learning the regressor from 30 unsupervised\nlandmarks. \u2020: we outperform the previous state-of-the-art [45] with only 10 labelled examples.\nWe also test using a VGG-19 model pre-trained for image classi\ufb01cation in ImageNet. All other\nnetworks are trained from scratch. The parameters \u03b1l > 0, l = 1, . . . , n are scalars that balance the\nterms. We use a linear combination of the reconstruction error for \u2018input\u2019, \u2018conv1_2\u2019, \u2018conv2_2\u2019,\n\u2018conv3_2\u2019, \u2018conv4_2\u2019 and \u2018conv5_2\u2019 layers of VGG-19; {\u03b1l} are updated online during training to\nnormalise the expected contribution from each layer as in [4]. However, we use the (cid:96)2 norm instead\nof their (cid:96)1, as it worked better for us.\n\n4 Experiments\n\nIn section 4.1 we provide the details of the landmark detection and generator networks; a common\narchitecture is used across all datasets. Next, we evaluate landmark detection accuracy on faces\n(section 4.2) and human-body (section 4.3). In section 4.4 we analyse the invariance of the learned\nlandmarks to various nuisance factors, and \ufb01nally in section 4.5 study the factorised representation of\nobject style and geometry in the generator.\n\n4.1 Model details\nLandmark detection network. The landmark detector ingests the image x(cid:48) to produce K landmark\nheatmaps y(cid:48). It is composed of sequential blocks consisting of two convolutional layers each. All\nthe layers use 3\u00d73 \ufb01lters, except the \ufb01rst one which uses 7\u00d77. Each block doubles the number\nof feature channels in the previous block, with 32 channels in the \ufb01rst one. The \ufb01rst layer in each\nblock, except the \ufb01rst block, downsamples the input tensor using stride 2 convolution. The spatial\nsize of the \ufb01nal output, outputting the heatmaps, is set to 16\u00d716. Thus, due to downsampling, for a\nnetwork with n \u2212 3, n \u2265 4 blocks, the resolution of the input image is H\u00d7W = 2n\u00d72n, resulting in\n16\u00d716\u00d7(32 \u00b7 2n\u22123) tensor. A \ufb01nal 1\u00d71 convolutional layer maps this tensor to a 16\u00d716\u00d7K tensor,\nwith one layer per landmark. As described in section 3.1, these K feature channels are then used to\nrender 16\u00d716\u00d7K 2D-Gaussian maps y(cid:48) (with \u03c3 = 0.1).\nImage generation network. The image generator takes as input the image x and the landmarks\ny(cid:48) = \u03a6(x(cid:48)) extracted from the second image in order to reconstruct the latter. This is achieved in\ntwo steps: \ufb01rst, the image x is encoded as a feature tensor z \u2208 R16\u00d716\u00d7C using a convolutional\nnetwork with exactly the same architecture as the landmark detection network except for the \ufb01nal\n1\u00d71 convolutional layer, which is omitted; next, the features z and the landmarks y(cid:48) are stacked\ntogether (along the channel dimension) and fed to a regressor that reconstructs the target frame x(cid:48).\nThe regressor also comprises of sequential blocks with two convolutional layers each. The input to\neach successive block, except the \ufb01rst one, is upsampled two times through bilinear interpolation,\nwhile the number of feature channels is halved; the \ufb01rst block starts with 256 channels, and a\nminimum of 32 channels are maintained till a tensor with the same spatial dimensions as x(cid:48) is\nobtained. A \ufb01nal convolutional layer regresses the three RGB channels with no non-linearity. All\n\n5\n\n\fBBC Pose Accuracy (%) at d = 6 pixels\n\nHead Wrsts Elbws Shldrs Avg.\nP\ufb01ster et al. [35] 98.00 88.45 77.10 93.50 88.01\nCharles et al. [3] 95.40 72.95 68.70 90.30 79.90\nChen et al. [5]\n64.1\nP\ufb01ster et al. [34] 74.90 53.05 46.00 71.40 59.40\nYang et al. [53]\n63.40 53.70 49.20 46.10 51.63\n81.10 49.05 53.05 70.10 60.79\nOurs (selfsup.)\nOurs\n76.10 56.50 70.70 74.30 68.44\n\n47.9\n\n65.9\n\n66.5\n\n76.8\n\nFigure 4: Learning Human Pose. 50 unsupervised keypoints are learnt on the BBC Pose dataset.\nAnnotations (empty circles in the images) for 7 keypoints are provided, corresponding to \u2014 head,\nwrists, elbows and shoulders. Solid circles represent the predicted positions; in [\ufb01g-top] these are\nraw discovered keypoints which correspond maximally to each annotation; in [\ufb01g-bottom] these are\nregressed (linearly) from the discovered keypoints. [table]: Comparison against supervised methods;\n%-age of points within d= 6-pixels of ground-truth is reported. [top-row]: accuracy-vs-distance d, for\neach body-part; [top-row-rightmost]: average accuracy for varying number of supervised samples\nused for regression.\nlayers use 3\u00d73 \ufb01lters and each block has two layers similarly to the landmark network.\nAll the weights are initialised with random Gaussian noise (\u03c3 = 0.01), and optimised using Adam [21]\nwith a weight decay of 5 \u00b7 10\u22124. The learning rate is set to 10\u22122, and lowered by a factor of 10 once\nthe training error stops decreasing; the (cid:96)2-norm of the gradients is bounded to 1.0.\n\n4.2 Learning facial landmarks\nSetup. We explore extracting source-target image pairs (x, x(cid:48)) using either (1) synthetic trans-\nformations, or (2) videos. In the \ufb01rst case, the pairs are obtained as (x, x(cid:48)) = (g1x0, g2x0) by\napplying two random thin-plate-spline (TPS) [11, 49] warps g1, g2 to a given sample image x0. We\nuse the 200k CelebA [24] images after resizing them to 128\u00d7128 resolution. The dataset provides\nannotations for 5 facial landmarks \u2014 eyes, nose and mouth corners, which we do not use for training.\nFollowing [45] we exclude the images in MAFL [57] test-set from the training split and generate\nsynthetically-deformed pairs as in [45, 55], but the transformations themselves are not required for\ntraining. We discount the reconstruction loss in the regions of the warped image which lie outside the\noriginal image to avoid modelling irrelevant boundary artefacts.\nIn the second case, (x, x(cid:48)) are two frames sampled from a video. We consider VoxCeleb [28], a\nlarge dataset of face tracks, consisting of 1251 celebrities speaking over 100k English language\nutterances. We use the standard training split and remove any overlapping identities which appear in\nthe test sets of MAFL and AFLW. Pairs of frames from the same video, but possibly belonging to\ndifferent utterances are randomly sampled for training. By using video data for training our models\nwe eliminate the need for engineering synthetic data.\n\nFigure 5: Unsupervised Landmarks on Human3.6M. [left]: an example quadruplet source-target-\nreconstruction-keypoint (left to right) from Human3.6M. [right]: learned keypoints on a test video\nsequence. The landmarks consistently track the legs, arms, torso and head across frames.\n\n6\n\n051015200102030405060708090100accuracy [%]headCharles (2013)Pfister (2014)Yang (2013)Pfister (2015)oursours selfsup.5101520wrists5101520elbows5101520shoulders5101520average5101520sample efficiency2005001000500010000\fMethod\n\nK MAFL AFLW\n\nUnsupervised / self-supervised\n\n11.60\n10.94\n8.97\n7.65\n7.23\n6.90\n\nSupervised\n\u2013\n15.84\n9.73\n7.95\n\u2013\n5.39\n\nThewlis [45]\n\n30 7.15\n50 6.67\n5.83\nThewlis [44](frames) \u2013\nShu \u2020 [38]\n\u2013\n5.45\n10 3.46\nZhang [55]\nw/ equiv.\n30 3.16\n30 8.42\nw/o equiv.\nWiles \u2021 [51]\n\u2013\n3.44\n\nRCPR [2]\nCFAN [54]\nCascaded CNN [41]\nTCDCN [57]\nRAR [41]\nMTCNN [56]\n\nQualitative results. Figure 2 shows the learned heatmaps and source-target-reconstruction-\nkeypoints quadruplets (cid:104)x, x(cid:48), \u03a8 (x, \u03a6(x(cid:48))) , \u03a6(x(cid:48))(cid:105) for synthetic transformations and videos. We note\nthat the method extracts keypoints which consistently track facial features across deformation and\nidentity changes (e.g., the green circle tracks the lower chin, and the light blue square lies between\nthe eyes). The regressed semantic keypoints on the MAFL test set are visualised in \ufb01g. 3, where they\nare localised with high accuracy. Further, the target image x(cid:48) is also reconstructed accurately.\nQuantitative results. We follow [45, 44] and use un-\nsupervised keypoints learnt on CelebA and VoxCeleb to\nregress manually-annotated keypoints in the MAFL and\nAFLW [23] test sets. We freeze the parameters of the\nunsupervised detector network (\u03a6) and learn a linear re-\ngressor (without bias) from our unsupervised keypoints\nto 5 manually-labelled ones from the respective training\nsets. Model selection is done using 10% validation split\nof the training data.\nWe report results in terms of standard MSE normalised\nby the inter-ocular distance expressed as a percent-\nage [57], and show a few regressed keypoints in \ufb01g. 3.\nBefore evaluating on AFLW, we \ufb01netune our networks\npre-trained on CelebA or VoxCeleb on the AFLW train-\ning set. We do not use any labels during \ufb01netuning.\nSample ef\ufb01ciency. Figure 3 reports the performance of\ndetectors trained on CelebA as a function of the number\nn of supervised examples used to translate from unsuper-\nvised to supervised keypoints. We note that n = 10 is\nalready suf\ufb01cient for results comparable to the previous\nstate-of-the-art (SoA) method of Thewlis et al. [45],\nand that performance almost saturates at n = 500\n(vs. 19,000 available training samples).\nVs. SoA. Table 1 compares our regression results to the\nSoA. We experiment regressing from K={10, 30, 50}\nunsupervised landmarks, using the self-supervised and\nthe supervised perceptual loss networks; the number of\nsamples n used for regression is maxed out (= 19000)\nto be consistent with previous works. On both MAFL\nand AFLW datasets, at 2.58% and 6.31% error respec-\ntively (for K = 30), we signi\ufb01cantly outperform all\nthe supervised and unsupervised methods. Notably, we\nperform better than the concurrent work of Zhang et\nal. [55] (MAFL: 3.16%; AFLW: 6.58%), while using\na simpler method. When synthetic warps are removed\nfrom [55], so that the equivariance constraint cannot be\nemployed, our method is signi\ufb01cantly better (2.58% vs\n8.42% on MAFL). We are also signi\ufb01cantly better than many SoA supervised detectors [54, 41, 57]\nusing only n = 100 supervised training examples, which shows that the approach is very effective at\nexploiting the unlabelled data. Finally, training with VoxCeleb video frames degrades the performance\ndue to domain gap; including a bias in the linear regressor improves the performance.\n\nTable 1: Comparison with state-of-the-\nart on MAFL and AFLW. K is the num-\nber of unsupervised landmarks. \u2020: train\na 2-layer MLP instead of a linear regres-\nsor. \u2021: use the larger VoxCeleb2 [7] dataset\nfor unsupervised training, and include a\nbias term in their regressor (through batch-\nnormalization). Normalised %-MSE is re-\nported (see \ufb01g. 3).\n\n10 3.19\n30 2.58\n50 2.54\n10 3.32\n30 2.63\n50 2.59\n\nOurs, training set: VoxCeleb\n\nloss-net: selfsup.\n\nw/ bias\n\nloss-net: sup.\n\n6.86\n6.31\n6.33\n6.99\n6.39\n6.35\n\n6.75\n\n\u2013\n\n7.10\n\nOurs, training set: CelebA\n\n10.53\n8.80\n\n7.01\n6.58\n\n\u2013\n\n\u2013\n\n\u2013\n\u2013\n\nloss-net: selfsup.\n\nloss-net: sup.\n\n30 3.94\n30 3.63\n30 4.01\n\nfc-layer (d) \u2192 10\n\n20\n\n60\n\nMAFL\n\n20.60 21.94 28.96\n\nours\nK=30\n2.58\n\nloss \u2192\n\n(cid:96)1\n\nadv.+ (cid:96)1\n\n(cid:96)2\n\nadv.+ (cid:96)2\n\nMAFL (K=30) 3.64\n\n3.62\n\n2.84\n\n2.80\n\ncontent\n(ours)\n2.58\n\nTable 2: Abalation Study. [left]: The keypoint bottleneck when replaced with a low d-dimensional,\nd = {10, 20, 60}, fully-connected (fc) layer leads to signi\ufb01cantly worse landmark detection perfor-\nmance (%-MSE) on the MAFL dataset. [right]: Replacing the content loss with (cid:96)1, (cid:96)2 losses on the\nimages, optionally paired with an adversarial loss (adv.) also degrades the performance.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 6: Invariant Localisation. Unsupervised keypoints discovered on smallNORB test set for\nthe car and airplane categories. Out of 20 learned keypoints, we show the most geometrically\nstable ones: they are invariant to pose, shape, and illumination. [b\u2013c]: elevation-vs-azimuth; [a, d]:\nshape-vs-illumination (y-axis-vs-x-axis).\n\nAblation study.\nIn table 2 we present two ablation studies, \ufb01rst on the keypoint bottleneck, and\nsecond where we compare against adversarial and other image-reconstruction losses. For both the\nsettings, we take the best performing model con\ufb01guration for facial landmark detection on the MAFL\ndataset.\nKeypoint bottleneck. The keypoint bottleneck has two functions: (1) it provides a differentiable and\ndistributed representation of the location of landmarks, and (2) it restricts the information from the\ntarget image to spatial locations only. When the bottleneck is replaced with a generic low dimensional\nfully-connected layer (as in a conventional auto-encoder) the performance degrades signi\ufb01cantly This\nis because the continuous vector embedding is not encouraged to encode geometry explicitly.\nReconstruction loss. We replace our content/perceptual loss with (cid:96)1 and (cid:96)2 losses on generated pixels;\nthe losses are also optionally paired with an adversarial term [13] to encourage verisimilitude as\nin [18]. All of these alternatives lead to worse landmark detection performance (table 2). While\nGANs are useful for aligning image distributions, in our setting we reconstruct a speci\ufb01c target image\n(similar to an auto-encoder). For this task, it is enough to use a simple content/perceptual loss.\n\n4.3 Learning human body landmarks\n\nSetup. Articulated limbs make landmark localisation on human body signi\ufb01cantly more challeng-\ning than faces. We consider two video datasets, BBC-Pose [3], and Human3.6M [17]. BBC-Pose\ncomprises of 20 one-hour long videos of sign-language signers with varied appearance, and dynamic\nbackground; the test set includes 1000 frames. The frames are annotated with 7 keypoints correspond-\ning to head, wrists, elbows, and shoulders which, as for faces, we use only for quantitative evaluation,\nnot for training. Human3.6M dataset contains videos of 11 actors in various poses, shot from multiple\nviewpoints. Image pairs are extracted by randomly sampling frames from the same video sequence,\nwith the additional constraint of maintaining the time difference within the range 3-30 frames for\nHuman3.6M. Loose crops around the subjects are extracted using the provided annotations and\nresized to 128\u00d7128 pixels. Detectors for K = 20 and K = 50 keypoints are trained on Human3.6M\nand BBC-Pose respectively.\nQualitative results. Figure 4 shows raw unsupervised keypoints and the regressed semantic ones on\nthe BBC-Pose dataset. For each annotated keypoint, a maximally matching unsupervised keypoint is\nidenti\ufb01ed by solving bipartite linear assignment using mean distance as the cost. Regressed keypoints\nconsistently track the annotated points. Figure 5 shows (cid:104)x, x(cid:48), \u03a8 (x, \u03a6(x(cid:48))) , \u03a6(x(cid:48))(cid:105) quadruplets, as\nfor faces, as well as the discovered keypoints. All the keypoints lie on top of the human actors, and\nconsistently track the body across identities and poses. However, the model cannot discern frontal\nand dorsal sides of the human body apart, possibly due to weak cues in the images, and no explicit\nconstraints enforcing such consistency.\nQuantitative results.\nFigure 4 compares the accuracy of localising the 7 keypoints on BBC-Pose\nagainst supervised methods, for both self-supervised and supervised perceptual loss networks. The\naccuracy is computed as the the %-age of points within a speci\ufb01ed pixel distance d. In this case, the\ntop two supervised methods are better than our unsupervised approach, but we outperform [33, 53]\nusing 1k training samples (vs. 10k); furthermore, methods such as [35] are specialised for videos and\n\n8\n\n\fFigure 7: Disentangling Style and Geometry. Image generation conditioned on spatial keypoints\ninduces disentanglement of representations for style and geometry in the generator. Source image\n(x) imparts style (e.g. colour, texture), while the target image (x(cid:48)) in\ufb02uences the geometry (e.g.\nshape, pose). Here, during inference, x [middle] is sampled to have a different style than x(cid:48) [top],\nalthough during training, image pairs with consistent style were sampled. The generated images\n[bottom] borrow their style from x, and geometry from x(cid:48). (a) SVHN Digits: the foreground and\nbackground colours are swapped. (b) AFLW Faces: pose of the style image x is made consistent\nwith x(cid:48). (c) Human3.6M: the background, hat, and shoes are retained from x, while the pose is\nborrowed from x(cid:48). All images are sampled from respective test sets, never seen during training.\n\nleverage temporal smoothness. Training using the supervised perceptual loss is understandably better\nthan using the self-supervised one. Performance is particularly good on parts such as the elbow.\n\n4.4 Learning 3D object landmarks: pose, shape, and illumination invariance\n\nWe train our unsupervised keypoint detectors on the SmallNORB [26] dataset, comprising 5 object\ncategories with 10 object instances each, imaged from regularly spaced viewpoints and under different\nillumination conditions. We train category-speci\ufb01c detectors for K = 20 keypoints using image-pairs\nfrom neighbouring viewpoints and show results in \ufb01g. 6 for car and airplane (see supplementary\nmaterial for visualisation of other object categories). Keypoints most invariant to various factors are\nvisualised. These landmarks are especially robust to changes in illumination and elevation angle.\nThey are also invariant to smaller changes in azimuth (\u00b180\u25e6), but fail to generalise beyond that. Most\ninteresting, they localise structurally similar regions, even when there is a large change in object shape\n(e.g. \ufb01g. 6-(d)); such landmarks could thus be leveraged for viewpoint-invariant semantic matching.\n\n4.5 Disentangling appearance and geometry\n\nIn \ufb01g. 7 we show that our method can be interpreted as disentangling appearance from geometry.\nGenerator/ keypoint networks are trained on SVHN digits [29], AFLW faces, and Human3.6M people.\nThe generator network is capable of retaining the geometry of an image, and substituting the style\nwith any other image in the dataset, including unrelated image pairs never seen during training. For\nexample, in the third column we re-render the number 3 by mixing its geometry with the appearance\nof the number 5. This generalises signi\ufb01cantly from the training examples, which only consist of\npairs of digits sampled from the same house number instance, sharing a common style.\n\n5 Conclusions\n\nIn this paper we have shown that a simple network trained for conditional image generation can be\nutilised to induce, without manual supervision, a object landmark detectors. On faces, our method\noutperforms previous unsupervised as well as supervised methods for landmark detection. The\nmethod can also extend to much more challenging data, such as detecting landmarks of people, and\ndiverse data, such as 3D objects and digits.\nAcknowledgements. We are grateful for the support provided by EPSRC AIMS CDT, ERC 638009-\nIDIU, and the Clarendon Fund scholarship. We would like to thank James Thewlis for suggestions\nand support with code and data, and David Novotn\u00fd and Triantafyllos Afouras for helpful advice.\n\n9\n\n\fReferences\n[1] J. Bruna, P. Sprechmann, and Y. LeCun. Super-resolution with deep convolutional suf\ufb01cient\n\nstatistics. In Proc. ICLR, 2016.\n\n[2] X. P. Burgos-Artizzu, P. Perona, and P. Doll\u00e1r. Robust face landmark estimation under occlusion.\nIn Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1513\u20131520. IEEE,\n2013.\n\n[3] J. Charles, T. P\ufb01ster, D. Magee, D. Hogg, and A. Zisserman. Domain adaptation for upper body\n\npose tracking in signed TV broadcasts. In Proc. BMVC, 2013.\n\n[4] Q. Chen and V. Koltun. Photographic image synthesis with cascaded re\ufb01nement networks. In\n\nProc. ICCV, volume 1, 2017.\n\n[5] X. Chen and A. L. Yuille. Articulated pose estimation by a graphical model with image\n\ndependent pairwise relations. In Proc. NIPS, 2014.\n\n[6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In Proc. NIPS,\npages 2172\u20132180, 2016.\n\n[7] J. S. Chung, A. Nagrani, and A. Zisserman. VoxCeleb2: Deep speaker recognition.\n\nINTERSPEECH, 2018.\n\nIn\n\n[8] E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from\n\nvideo. In Proc. NIPS. 2017.\n\n[9] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on\n\ndeep networks. In Proc. NIPS, 2016.\n\n[10] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on\n\ndeep networks. In Proc. NIPS, pages 658\u2013666, 2016.\n\n[11] J. Duchon. Splines minimizing rotation-invariant semi-norms in sobolev spaces. In Constructive\n\ntheory of functions of several variables. 1977.\n\n[12] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural\n\nnetworks. In Proc. CVPR, 2016.\n\n[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In Proc. NIPS, 2014.\n\n[14] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006.\n\n[15] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural\n\ncomputation, 18(7):1527\u20131554, 2006.\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[17] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and\n\npredictive methods for 3d human sensing in natural environments. PAMI, 2014.\n\n[18] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional\n\nadversarial networks. In Proc. CVPR, 2017.\n\n[19] J. Johnson, A. Alahi, and F. Li. Perceptual losses for real-time style transfer and super-resolution.\n\nIn Proc. ECCV, 2016.\n\n[20] N. Kalchbrenner, A. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and\n\nK. Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.\n\n[21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n10\n\n\f[22] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[23] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotated facial landmarks in the wild:\nA large-scale, real-world database for facial landmark localization. In ICCV Workshops, 2011.\n\n[24] Z. L., P. L., X. W., and X. T. Deep learning face attributes in the wild. In Proc. ICCV, 2015.\n\n[25] G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic coloriza-\n\ntion. In Proc. ECCV, 2016.\n\n[26] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In Proc. CVPR, 2004.\n\n[27] C. Ledig, L. Theis, F. Husz\u00e1r, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani,\nJ. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative\nadversarial network. In Proc. CVPR, 2017.\n\n[28] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identi\ufb01cation\n\ndataset. In INTERSPEECH, 2017.\n\n[29] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural\n\nimages with unsupervised feature learning. In NIPS DLW, volume 2011, 2011.\n\n[30] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune. Synthesizing the preferred inputs\n\nfor neurons in neural networks via deep generator networks. In Proc. NIPS, 2016.\n\n[31] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative\n\nnetworks: Conditional iterative generation of images in latent space. In Proc. CVPR, 2017.\n\n[32] V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable\n\nmemory. In ICLR Workshop, 2015.\n\n[33] T. P\ufb01ster, J. Charles, and A. Zisserman. Large-scale learning of sign language by watching TV\n\n(using co-occurrences). In Proc. BMVC, 2013.\n\n[34] T. P\ufb01ster, K. Simonyan, J. Charles, and A. Zisserman. Deep convolutional neural networks\nfor ef\ufb01cient pose estimation in gesture videos. In Proceedings of the Asian Conference on\nComputer Vision, 2014.\n\n[35] T. P\ufb01ster, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos.\n\nIn Proc. ICCV, 2015.\n\n[36] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In Proc. NIPS, 2015.\n\n[37] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to\n\ndraw. In Proc. NIPS, pages 217\u2013225, 2016.\n\n[38] Z. Shu, M. Sahasrabudhe, A. Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming\n\nautoencoders: Unsupervised disentangling of shape and appearance. In Proc. ECCV, 2018.\n\n[39] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. CoRR, abs/1409.1556, 2014.\n\n[40] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representa-\n\ntions using lstms. In Proc. ICML, pages 843\u2013852, 2015.\n\n[41] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection.\n\nIn Proc. CVPR, 2013.\n\n[42] I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted boltzmann\n\nmachine. In Proc. NIPS, pages 1601\u20131608, 2009.\n\n[43] S. Suwajanakorn, N. Snavely, J. Tompson, and M. Norouzi. Discovery of latent 3d keypoints\n\nvia end-to-end geometric reasoning. In Proc. NIPS, 2018.\n\n11\n\n\f[44] J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised object learning from dense invariant image\n\nlabelling. In Proc. NIPS, 2017.\n\n[45] J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks by factorized\n\nspatial embeddings. In Proc. ICCV, 2017.\n\n[46] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future\n\nvia hierarchical prediction. arXiv preprint arXiv:1704.05831, 2017.\n\n[47] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust\n\nfeatures with denoising autoencoders. In Proc. ICML, pages 1096\u20131103. ACM, 2008.\n\n[48] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Proc.\n\nNIPS, pages 613\u2013621, 2016.\n\n[49] G. Wahba. Spline models for observational data, volume 59. Siam, 1990.\n\n[50] W. F. Whitney, M. Chang, T. Kulkarni, and J. B. Tenenbaum. Understanding visual concepts\n\nwith continuation learning. In ICLR Workshop, 2016.\n\n[51] O. Wiles, A. S. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute\n\nembedding from video. In Proc. BMVC, 2018.\n\n[52] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman. Visual dynamics: Probabilistic future frame\n\nsynthesis via cross convolutional networks. In Proc. NIPS, 2016.\n\n[53] Y. Yang and D. Ramanan. Articulated pose estimation with \ufb02exible mixtures-of-parts. In Proc.\n\nCVPR, 2011.\n\n[54] J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-\ufb01ne auto-encoder networks (cfan) for\n\nreal-time face alignment. In Proc. ECCV, 2014.\n\n[55] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee. Unsupervised discovery of object landmarks\n\nas structural representations. In Proc. CVPR, 2018.\n\n[56] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning.\n\nIn Proc. ECCV, pages 94\u2013108. Springer, 2014.\n\n[57] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Learning Deep Representation for Face Alignment\n\nwith Auxiliary Attributes. PAMI, 2016.\n\n12\n\n\f", "award": [], "sourceid": 1985, "authors": [{"given_name": "Tomas", "family_name": "Jakab", "institution": "University of Oxford"}, {"given_name": "Ankush", "family_name": "Gupta", "institution": "University of Oxford"}, {"given_name": "Hakan", "family_name": "Bilen", "institution": "University of Edinburgh"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "Facebook AI Research and University of Oxford"}]}