{"title": "Object landmark discovery through unsupervised adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 13520, "page_last": 13531, "abstract": "This paper proposes a method to ease the unsupervised learning of object landmark detectors. Similarly to previous methods, our approach is fully unsupervised in a sense that it does not require or make any use of annotated landmarks for the target object category. Contrary to previous works, we do however assume that a landmark detector, which has already learned a structured representation for a given object category in a fully supervised manner, is available. Under this setting, our main idea boils down to adapting the given pre-trained network to the target object categories in a fully unsupervised manner. To this end, our method uses the pre-trained network as a core which remains frozen and does not get updated during training, and learns, in an unsupervised manner, only a projection matrix to perform the adaptation to the target categories. By building upon an existing structured representation learned in a supervised manner, the optimization problem solved by our method is much more constrained with significantly less parameters to learn which seems to be important for the case of unsupervised learning. We show that our method surpasses fully unsupervised techniques trained from scratch as well as a strong baseline based on fine-tuning, and produces state-of-the-art results on several datasets. Code can be found at tiny.cc/GitHub-Unsupervised", "full_text": "Object landmark discovery through unsupervised\n\nadaptation\n\nEnrique Sanchez1\n1 Samsung AI Centre\n\nCambridge, UK\n\nGeorgios Tzimiropoulos1,2\n\n{e.lozano, georgios.t}@samsung.com\n\nyorgos.tzimiropoulos@nottingham.ac.uk\n\n2 Computer Vision Lab\n\nUniversity of Nottingham, UK\n\nAbstract\n\nThis paper proposes a method to ease the unsupervised learning of object landmark\ndetectors. Similarly to previous methods, our approach is fully unsupervised in\na sense that it does not require or make any use of annotated landmarks for the\ntarget object category. Contrary to previous works, we do however assume that\na landmark detector, which has already learned a structured representation for a\ngiven object category in a fully supervised manner, is available. Under this setting,\nour main idea boils down to adapting the given pre-trained network to the target\nobject categories in a fully unsupervised manner. To this end, our method uses the\npre-trained network as a core which remains frozen and does not get updated during\ntraining, and learns, in an unsupervised manner, only a projection matrix to perform\nthe adaptation to the target categories. By building upon an existing structured\nrepresentation learned in a supervised manner, the optimization problem solved by\nour method is much more constrained with signi\ufb01cantly less parameters to learn\nwhich seems to be important for the case of unsupervised learning. We show that\nour method surpasses fully unsupervised techniques trained from scratch as well\nas a strong baseline based on \ufb01ne-tuning, and produces state-of-the-art results on\nseveral datasets. Code can be found at tiny.cc/GitHub-Unsupervised.\n\n1\n\nIntroduction\n\nWe wish to learn to detect landmarks (also known as keypoints) on examples of a given object\ncategory like human and animal faces and bodies, shoes, cars etc. Landmarks are important in\nobject shape perception and help establish correspondence across different viewpoints or different\ninstances of that object category. Landmark detection has been traditionally approached in machine\nlearning using a fully supervised approach: for each object category, a set of pre-de\ufb01ned landmarks\nare manually annotated on (typically) several thousand object images, and then a neural network is\ntrained to predict these landmarks by minimizing an L2 loss. Thanks to recent advances in training\ndeep neural nets, supervised methods have been shown to produce good results even for the most\ndif\ufb01cult datasets [2, 1, 37, 22]. This paper attempts to address the more challenging setting which\ndoes not assume the existence of manually annotated landmarks (an extremely laborious task), making\nour approach effortlessly applicable to any object category.\nUnsupervised learning of object landmarks is a challenging learning problem for at least 4 reasons: 1)\nLandmarks are by nature ambiguous; there may exist very different landmark con\ufb01gurations even for\nsimple objects like the human face. 2) Landmarks, although represented by simple x,y coordinates,\nconvey high-level semantic information about objects parts, which is hard to learn without manual\nsupervision. 3) Landmarks must be consistently detected across large changes of viewpoints and\nappearance. 4) Discovered landmarks must not only be stable with viewpoint change but also fully\ncapture the shape of deformable objects like for the case of the human face and body.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Training an object landmark detector: a) Supervised: using heatmap regression, one can learn an\nobject landmark detector from annotated images. b) Unsupervised: using conditional image generation, one\ncan discover the structure of object landmarks. c) Unsupervised, proposed: Our approach uses the \u201cknowledge\u201d\nlearned by a network trained in a supervised way for an object category X to learn how to discover landmarks\nfor a completely different object category Y in a fully unsupervised way. Our approach learns only a small\nfraction of parameters (shown in blue colour) for performing the unsupervised adaptation by solving a more\nconstrained optimization problem which seems to be bene\ufb01cial for the case of unsupervised learning.\n\nOur method departs from recent methods for unsupervised learning of object landmarks [35, 34, 43,\n13] by approaching the problem from an unexplored direction, that of domain adaptation: because\nlandmark annotations do exist for a few object categories, it is natural to attempt to overcome the\naforementioned learning challenges from a domain adaptation perspective, in particular, by attempting\nto adapt a pre-trained landmark detection network, trained on some object category, to a new target\nobject category. Although the pre-trained network is trained in a fully supervised manner, the\nadaptation to the target object category is done in a fully unsupervised manner. We show that this\nis feasible by borrowing ideas from incremental learning [28, 25, 26] which, to our knowledge, are\napplied for the \ufb01rst time for the case of unsupervised learning and dense prediction.\nIn particular, our method uses the pre-trained network as a core which remains frozen and does not\nget updated during training, and just learns, in an unsupervised manner, only a projection matrix to\nperform the adaptation to the target categories. We show that such a simple approach signi\ufb01cantly\nfacilitates the process of unsupervised learning, resulting in signi\ufb01cantly more accurately localized\nand stable landmarks compared to unsupervised methods trained entirely from scratch. We also\nshow that our method signi\ufb01cantly outperforms a strong baseline based on \ufb01ne-tuning the pre-trained\nnetwork on the target domain which we attribute to the fact that the optimization problem solved in\nour case is much more constrained with less degrees of freedom and signi\ufb01cantly less parameters,\nmaking our approach more appropriate for the case of learning without labels. As a second advantage,\nour method adds only a small fraction of parameters to the pre-trained network (about 10%) making\nour approach able to ef\ufb01ciently handle a potentially large number of object categories.\n\n2 Related Work\n\nUnsupervised landmark detection: While there are some works on learning correspondence [36, 5],\nor geometric matching [15, 27], or unsupervised learning of representations [33, 38, 6], very little\neffort has been put towards explicitly building object landmark detectors with no annotations, i.e.\nin an unsupervised way. While [34] used the concept of equivariance constraint [17] to learn\nimage deformations, their proposed approach regresses label heatmaps, that can be later mapped\nto speci\ufb01c keypoint locations. In order to explicitly learn a low-dimensional representation of the\nobject geometry (i.e. the landmarks), [35] proposed to use a softargmax layer [39] which maps\nthe label heatmaps to a vector of K keypoints. The objective in [35] is directly formulated over\nthe produced landmarks, and accounts for the \u201cequivariant error\u201d, as well as the diversity of the\ngenerated landmarks. Recently, [13] proposed a generative approach, which maps the output of\nthe softargmax layer to a new Gaussian-like set of heatmaps. This combination of softargmax and\nheatmap generation is referred to as tight bottleneck. The new heatmaps are used to reconstruct the\ninput image from a deformed version of it. The bottleneck enforces the landmark detector to discover\na set of meaningful and stable points that can be used by the decoder to reconstruct the input image.\nConcurrent with [13], Zhang et al. [43] proposed a generative method that uses an autoencoder to\nlearn the landmark locations. Although their method is based on an encoder-decoder framework,\nit also relies on the explicit use of both an equivariant and a separation constraints. The method of\n\n2\n\n\f[33] learns 3D landmarks detectors from images related by known 3D transformations, again by\napplying equivariance. Finally, it is worth mentioning other recent works in generative models that,\nrather than directly predicting a set of landmarks, focus on disentangling shape and appearance in an\nautoencoder framework [31, 30, 19]. In [31, 30], the learning is formulated as an autoencoder that\ngenerates a dense warp map and the appearance in a reference frame. While the warps can be mapped\nto speci\ufb01c keypoints, their discovery is not the target goal. We note that our method departs from the\naforementioned works by approaching the problem from the unexplored direction of unsupervised\ndomain adaptation.\nIncremental learning combines ideas from multi-task learning, domain adaptation, and transfer\nlearning, where the goal is to learn a set of unrelated tasks in a sequential manner via knowledge\nsharing [28, 25, 26]. The simplest approach to this learning paradigm is \ufb01ne-tuning [10], which often\nresults in what is known as \u201ccatastrophic forgetting\u201d [7], whereby the network \u201cforgets\u201d a previously\nlearned task when learning a new one. To overcome this limitation, some works have proposed to\nadd knowledge incrementally. An example is the progressive networks [29], which add a submodule,\ncoined adapter, for each new task to be learned. In [25], the adapters are conceived as 1\u00d7 1 \ufb01lters that\nare applied sequentially to the output features of task-agnostic layers, while in [26], the adapters are\napplied in a parallel way. An alternative \u2013 and equivalent approach \u2013 is to directly apply a projection\nover the different dimensions of the weight tensor. This approach was proposed in [28], where the\nlearnable set of weights reduce to square matrices, which are projected onto the core set of weights\nbefore applying the task-speci\ufb01c \ufb01lters. In this work, we borrow ideas from incremental learning to\ntrain a core network in a supervised manner and then adapt it to a completely new object category in\na fully unsupervised way. In contrary to all aforementioned methods, our aim, in this work, is not to\navoid \u201ccatastrophic forgetting\u201d in supervised learning, but to show that such an approach seems to be\nvery bene\ufb01cial for solving the optimization problems considered in unsupervised learning.\n\n3 Method\nOur aim is to train a network for landmark detection on data from domain Y, representing an arbitrary\nobject category, without any landmark annotations. We coin this network as the target domain\nnetwork. To do so, our method \ufb01rstly trains a network for landmark detection on data from domain\nX , representing some other object category, in a supervised manner. We call this network the core\nnetwork. The core network is then adapted to give rise to the target domain network in a fully\nunsupervised manner through incremental learning. Note that the core and target networks have the\nsame architecture, in particular, they are both instances of the hourglass network [22] which is the\nmethod of choice for supervised landmark localization [22, 3].\nLearning the core network: Let us denote by \u03a8\u03b8X the core network, where \u03b8X denotes the set of\nweights of \u03a8. In a convenient abuse of notation, we will denote X = {x \u2208 RC\u00d7W\u00d7H} as the set\nof training images belonging to a speci\ufb01c object category X (human poses in our case). For each\nx \u2208 X there is a corresponding set of landmark annotations a \u2208 RK\u00d72, capturing the structured\nrepresentation of the depicted object. The network \u03a8\u03b8X is trained to predict the target set of keypoints\nin unseen images through heatmap regression. In particular, each landmark is represented by a\nheatmap {Hk}k=1,...,K \u2208 RWh\u00d7Hh, produced by placing a Gaussian function at the corresponding\nlandmark location ak = (uk, vk), i.e. Hk(u, v; x) = exp(\u2212\u03c3\u22122(cid:107)(u, v) \u2212 (uk, vk)(cid:107)2). The network\nparameters \u03b8X are learned by minimizing the mean squared error between the heatmaps produced by\nthe network and the ground-truth heatmaps, i.e. the learning is formulated as:\n\n(cid:88)\n\nx\u2208X\n\n\u03b8X = min\n\u03b8X\n\n(cid:107)H(x) \u2212 \u03a8\u03b8X (x)(cid:107)2.\n\nFor a new image x, the landmarks\u2019 locations are estimated by applying the arg max operator to the\nproduced heatmaps, that is \u02c6a = arg max \u03a8\u03b8X (x).\nLearning the target domain network: Let us denote by \u03a8\u03b8Y the target domain network, where \u03b8Y\ndenotes the set of weights of \u03a8. Because there are no annotations available for the domain Y, one\ncould use any of the frameworks of [35, 34, 43, 13] to learn \u03b8Y in an unsupervised manner from\nscratch. Instead, we propose to \ufb01rstly re-parametrize the weights of convolutional layer \u03b8Y,L as:\n\n\u03b8Y,L = \u03c6(WL, \u03b8X ,L),\n\n(1)\n\n3\n\n\fwhere WL is a projection matrix. The weights \u03b8X ,L are kept frozen i.e. are not updated via\nback-propagation when training with data from Y. For simplicity, we choose \u03c6 to be the linear\nfunction.\nSpeci\ufb01cally, for the case of convolutional layers, the weights \u03b8X ,L are tensors \u2208 RCo\u00d7Ci\u00d7k\u00d7k, where\nk represents the \ufb01lter size (e.g. k = 3), Co is the number of output channels, and Ci the number\nof input channels. We propose to learn a set of weights WL \u2208 RCo\u00d7Co that are used to map the\nweights \u03b8X ,L to generate a new set of parameters \u03b8Y,L = WL \u00d71 \u03b8X ,L, where \u00d7n refers to the\nn-mode product of tensors. This new set of weights \u03b8Y,L will have the same dimensionality as \u03b8X ,L,\nand therefore can be directly used within the same hourglass architecture. That is to say, leaving\n\u03b8X ,L \ufb01xed, we learn, for each convolutional layer, a projection matrix WL on the number of output\nchannels that is used to map \u03b8X ,L into the set of weights for the target object category Y.\nRather than directly learning \u03b8Y as in [13, 43], we propose to learn the projection matrices W in\na fully unsupervised way, through solving the auxiliary task of conditional image generation. In\nparticular, we would like to learn a generator network \u03a5 that takes as input a deformed version y(cid:48) of\nan image y \u2208 Y, as well as the landmarks produced by \u03a8\u03b8Y , and tries to reconstruct the image y.\nSpeci\ufb01cally, we want our landmark detector \u03a8\u03b8Y and generator \u03a5 to minimize a reconstruction loss:\n\nL (y, \u03a5(y(cid:48), \u03a8\u03b8Y (y))) .\n\nmin\n\u03b8\u03a5,W\n\nAs pointed out in [13], the above formulation does not ensure that the output of \u03a8\u03b8Y will have the\nform of heatmaps from which meaningful landmarks can be obtained through the arg max operator.\nTo alleviate this, the output of \u03a8\u03b8Y is \ufb01rstly converted into K \u00d7 2 landmarks, from which a new set of\nheatmaps is derived using the above mentioned Gaussian-like formulation. The Gaussian heatmaps\nare then used by the generator \u03a5 to perform the image-to-image translation task. To overcome\nthe non-differentiability of the arg max operator, the landmarks are obtained through a softargmax\noperator [39]. This way, the reconstruction loss can be differentiated throughout both \u03a5 and \u03a8\u03b8Y .\nBesides facilitating the learning process, our approach offers signi\ufb01cant memory savings. For each\nconvolutional layer, the number of parameters to be learned reduces from Co \u00d7 Ci \u00d7 k2 to C 2\no . For a\nset-up of Ci = Co channels, with kernel size k = 3, the total number of parameters to train reduces\nby a factor of 9. The hourglass used in this paper has roughly \u223c 6M. When using the incremental\nlearning approach described herein, the number of learnable parameters reduces to \u223c 0.5M.\nProposed vs. \ufb01ne-tuning: An alternative option to our method consists of directly \ufb01ne-tuning the\npre-trained network on the target domain. While \ufb01ne-tuning improves upon training the network\nfrom scratch, we observed that this option is still signi\ufb01cantly more prone to producing unstable\nlandmarks than the ones produced by our method. We attribute the improvement obtained by our\nmethod to the fact that the core network is not updated during training on domain Y, and hence the\noptimization problem is much more constrained with less degrees of freedom and with signi\ufb01cantly\nless parameters to learn (\u223c 10% compared to \ufb01ne-tuning). While \ufb01ne-tuning has been proven very\neffective for the case of supervised learning, it turns out that the aforementioned properties of our\nmethod make it more appropriate for the case of learning without labels.\nTraining: The training of the network is done using the reconstruction loss de\ufb01ned above. This loss\ncan be differentiated w.r.t. both the parameters of the image encoder-decoder and the projection matri-\nces W. Similarly to [13], we use a reconstruction loss based on a pixel loss and a perceptual loss [14].\nThe perceptual loss enforces the features of the generated images to be similar to those of the real\nimages when forwarded through a VGG-19 [32] network. It is computed as the l1-norm of the differ-\nV GG computed at layers l = {relu1_2, relu2_2, relu3_3, relu4_3}\nence between the features \u03a6l\nfrom the input and generated images. Our total loss is de\ufb01ned as the sum of the pixel reconstruction\nloss and the perceptual loss:\n\n(cid:88)\n\nL (y, y(cid:48)) = (cid:107)y \u2212 \u03a5(y(cid:48); \u03a8\u03b8Y (y))(cid:107)2 +\n\n(cid:107)\u03a6l\n\nV GG(y) \u2212 \u03a6l\n\nV GG(\u03a5(y(cid:48); \u03a8\u03b8Y (y)))(cid:107).\n\nl\n\nThe batch-norm layers are initialized from the learned parameters \u03b8X and are \ufb01ne-tuned through\nlearning the second network on Y. The projection layers are initialized with the identity matrix.\nFinally, in order to allow the number of points to be (possibly) different for X and Y, the very last\nlayer of the network, that maps convolutional features to heatmaps, is made domain speci\ufb01c, and\ntrained from scratch.\n\n4\n\n\f4 Experiments\n\nThis section describes the experimental set-up carried out to validate the proposed approach (Sec. 4.1),\nas well as the obtained results (Sec. 4.2).\n\n4.1\n\nImplementation details\n\n\u221a\n\n0.5. In all of our experiments, K is set to 10 points.\n\nLandmark detector: It is based on the Hourglass architecture proposed in [22]. It receives an RGB\nimage of size 128\u00d7 128, and applies a set of spatial downsampling and residual blocks [8], to produce\na set of K heatmaps. Besides the convolutional blocks, the network comprises batch-norm [11] and\nReLU layers. The output spatial resolution is 32 \u00d7 32, which is converted into a K \u00d7 2 matrix of\ncoordinates with a softargmax layer (\u03b2 = 10). The coordinates are mapped back to heatmaps using\n\u03c3 =\nImage encoder-decoder: The generator is adapted from the architecture used for numerous tasks like\nneural transfer [14], image-to-image translation [12, 46], and face synthesis [24, 20, 9]. It receives an\ninput image y(cid:48) of size 128\u00d7 128, and \ufb01rstly applies two spatial downsampling convolutions, bringing\nthe number of features up to 256. The heatmaps produced by \u03a8\u03b8Y (y) are then concatenated to the\ndownsampled tensor, and passed through a set of 6 residual blocks. Finally, two spatial upsampling\nblocks bring the spatial resolution to the image size.\nCore network pre-training: For our method, the landmark detector is \ufb01rstly pre-trained on the\ntask of Human Pose Estimation. In particular, the network is trained to detect K = 16 keypoints,\ncorresponding to the human body joints, on the MPII training set [2]. The network is trained for\n110 epochs, yielding a validation performance of PCKh = 79%. To study the impact of the quality\nof the core network on performance (see Sec. 4.2-Additional experiments), we also tested different\ncheckpoints, corresponding to the weights obtained after the 1st, 5th, and 10th epoch of the training\nprocess. These models yielded a validation PCKh of 22.95%, 55.03%, and 57.67%, respectively.\nTraining: We generate the pairs (y, y(cid:48)) by applying random similarity transformations (scaling,\nrotation, translation) to the input image. We used the Adam optimizer [16], with (\u03b21, \u03b22) = (0, 0.9),\nand a batch size of 48 samples. The model is trained for 80 epochs, each consisting of 2, 500 iterations,\nwith a learning rate decay of 0.1 every 30 epochs. All networks are implemented in PyTorch [23].\nDatabases: For training the object landmark detec-\ntors in an unsupervised way, we used the CelebA [18],\nthe UT-Zappos50k [41, 40], and the Cats Head\ndatasets [42]. For CelebA, we excluded the subset of\n1, 000 images corresponding to the MAFL dataset [44],\nand used the remaining \u223c 200k images for training.\nFor the UT-Zappos50k, we used 49.5k and 500 im-\nages to train and test, respectively [35, 43]. Finally,\nfor the Cats Head dataset, we used four subfolders to\ntrain the network (\u223c 6, 250 images), and three to test\nit (3, 750 images). To perform a quantitative evaluation\nof our proposed approach, we used the MAFL [44], the\nAFLW [21], and the LS3D [3] datasets. For MAFL,\nwe used the of\ufb01cial train/test partitions. For AFLW, we\nused the same partitions as in [13]. For LS3D, we used\nthe partitions as de\ufb01ned in [3]. It has to be noted that\nthe LS3D dataset is annotated with 3D points. Similarly\nto [13], we extracted loose (random) crops around the\ntarget objects using the provided annotations. We did\nnot use the provided landmarks for training our models.\nModels: We trained three different models: 1) a net-\nwork trained directly on each database from scratch, 2)\na \ufb01ne-tuned network trained by \ufb01ne-tuning the weights\nof the pre-trained network, and 3) our proposed unsupervised domain adaptation approach. Note\nthat the network trained from scratch is our in-house implementation of [13], while the \ufb01ne-tuned\nnetwork is described for the \ufb01rst time in this work, too.\n\nTable 1: Comparison with state-of-the-art on\nMAFL and AFLW. For the sake of clarity, we\nonly compare against methods reporting results\nfor K = 10 landmarks. \u2020: K = 10, uses the\nVGG-16 for perceptual loss. \u2020\u2020: K = 10, uses\na pre-trained network for perceptual loss.\n\nThewlis [35](K = 30)\nJakab [13]\u2020\nJakab [13]\u2020\u2020\nZhang [43](K = 10)\nShu [31]\nSahasrabudhe [30]\n\nTCDCN [45]\nMTCNN [44]\n\n7.95\n5.39\n\n7.15\n3.32\n3.19\n3.46\n5.45\n6.01\n\n5.00\n3.91\n3.99\n\nBaseline\nFinetune\nProposed\n\nUnsupervised\n\n7.65\n6.90\n\n6.99\n6.86\n7.01\n\n-\n\n-\n-\n\n7.65\n6.79\n6.69\n\nOurs\n\nMethod\n\nMAFL AFLW\n\nSupervised\n\n5\n\n\fnim\n1\n5\n10\n100\n500\n1000\n5000\nAll\n1\n5\n10\n100\n500\n1000\n5000\nAll\n\nd\nr\na\nw\nr\no\nF\n\nd\nr\na\nw\nk\nc\na\nB\n\nMAFL\n\nAFLW\n\nScr.\n26.76\n18.32\n12.12\n5.75\n5.28\n5.18\n5.04\n5.00\n30.64\n26.39\n22.99\n18.86\n18.05\n17.82\n17.68\n17.57\n\nF.T.\n16.76\n9.71\n7.45\n4.62\n4.12\n4.02\n3.98\n3.91\n12.50\n8.58\n7.41\n5.23\n4.70\n4.60\n4.47\n4.43\n\nProp.\n18.70\n8.77\n7.13\n4.53\n4.13\n4.16\n4.05\n3.99\n12.30\n7.22\n6.01\n4.23\n3.82\n3.74\n3.59\n3.55\n\nScr.\n17.88\n16.88\n14.62\n9.02\n8.09\n7.90\n7.67\n7.65\n37.23\n35.36\n32.49\n26.36\n25.80\n25.60\n25.50\n25.50\n\nF.T.\n15.40\n13.38\n11.59\n8.24\n7.19\n7.04\n6.81\n6.79\n18.92\n16.46\n15.09\n11.72\n11.30\n11.23\n11.14\n11.14\n\nProp.\n16.08\n12.33\n11.09\n7.64\n7.20\n6.91\n6.73\n6.69\n17.47\n14.55\n12.24\n9.69\n9.29\n9.25\n9.19\n9.19\n\nScr.\n94.02\n70.48\n61.31\n40.03\n34.35\n33.76\n33.25\n33.15\n26.11\n25.47\n20.43\n15.25\n14.81\n14.59\n14.51\n14.45\n\nLS3D\nF.T.\n75.76\n45.11\n39.26\n28.24\n25.55\n25.25\n24.75\n24.79\n13.96\n10.24\n8.92\n6.32\n5.96\n5.91\n5.85\n5.81\n\nProp.\n78.62\n43.57\n39.37\n29.32\n27.18\n26.95\n26.50\n26.41\n12.31\n8.72\n7.83\n6.08\n5.66\n5.55\n5.47\n5.44\n\nTable 2: Errors on MAFL, AFLW, and LS3D datasets for the forward (top), and backward (bottom) cases. Scr.,\nF.T., and Prop. stand for Scratch, Fine-tuned, and Proposed, respectively. Despite the good performance of\nall methods for the forward case, the backward errors clearly show that our method produces the most stable\nlandmarks.\n\n4.2 Evaluation\n\nExisting approaches [13, 43, 35] are assessed quantitatively by measuring the error they produce on\nannotated datasets. To this end, a linear regression is learned from the discovered landmarks and a\nset of manually annotated points on some training set annotated in the same way as the evaluation\ndataset. Unfortunately, such a metric does not help measure the stability of the discovered landmarks.\nIn practice, we found that not all discovered landmarks are stable, despite being able to loosely\ncontribute to reducing the error when used to train the regressor.\nIn order to dig deeper into measuring the stability of our proposed approach, in addition to the\naforementioned metric \u2013 herein referred to as forward - we also measure the error produced by a\nregressor trained in the reverse order, i.e. from the set of annotated landmarks to the discovered ones.\nWe will refer to this case as backward. A method that yields good results in the forward case but\npoor results in the backward will most likely detect a low number of stable landmarks. Similarly,\nif a method yields low error in the backward case, but high error in the forward, it will have likely\nconverged to a \ufb01xed set of points, independent of the input image. Moreover, we further quantify\nthe stability of landmarks through geometric consistency, by measuring the point-to-point distance\nbetween a rotated version of the detected landmarks for a given image and those detected on the\nrotated version of it.\nForward (Unsupervised \u2192 Supervised): Following [13, 43, 35], we learn a linear regressor from\nthe discovered landmarks and 5 manually annotated keypoints in the training partitions of MAFL [44]\nand AFLW [21]. We report the Mean Square Error (MSE), normalized by the inter-ocular distance.\nContrary to previous works, we did not re-train our network on AFLW before evaluating it on that\ndataset. We compare our results against those produced by state-of-the-art methods in Table 1. We\ncan observe that our in-house implementation of [13], although not matching the performance of the\noriginal implementation, is competitive ensuring the strength of our baselines and implementations.\nThe bulk of our results for the forward case are shown in Table 2 (top). Following recent works, we\nreport the results by varying the number of images (nim) to train the regressor. For both MAFL and\nALFW, we can see that our method surpasses the trained from scratch and \ufb01ne-tuned networks in all\ncon\ufb01gurations. For LS3D, all methods produce large errors illustrating that there is still a gap to \ufb01ll\nin order to make the unsupervised learning of landmark detectors robust to 3D rotations.\n\n6\n\n\fFigure 2: Qualitative evaluation of landmark consis-\ntency. Each image is transformed using a random similar-\nity transformation (same for each method). Our method\nconsistently produces the most stable points. See para-\ngraph Landmark consistency for detailed discussion.\n\nBackward (Supervised \u2192 Unsupervised):\nFor the backward experiment, a regressor is\nlearned from manually annotated keypoints to\nthe landmarks produced by each method. For\nMAFL and LS3D, this regressor maps the 68 an-\nnotated points to the 10 discovered points, while\nfor AFLW the correspondence is from 5 to 10.\nTable 2 (bottom) reports the MSE normalized\nby the inter-ocular distance. As our results show,\nour method signi\ufb01cantly outperforms the \ufb01ne-\ntuned network on all datasets and con\ufb01gurations.\nEspecially for MAFL and AFLW datasets, and\nbecause the errors for the forward case were\nalso small, these results clearly show that our\nmethod produces much more stable points than\nthe \ufb01ne-tuned network. Note that the trained\nfrom scratch network produces very large back-\nward errors indicating that some of the points\ndetected were completely unstable.\nLandmark consistency: The consistency of\nthe discovered landmarks is quanti\ufb01ed by mea-\nsuring the error per point ei = (cid:107)\u03a8i\n\u03b8Y (A(y)) \u2212\n\u03b8Y (y))(cid:107), where A refers to a random sim-\nA(\u03a8i\nilarity transformation. We report the error per\npoint for each method and dataset in Table 3. Again, it is clear, that our method produces by far the\nmore stable landmarks for all datasets. Qualitative results for this experiment illustrating the stability\nof the detected landmarks under geometric transformations for all methods can be seen in Fig. 2. It\ncan be seen that, for example for the case of the faces, up to three landmarks are strongly inconsistent\nfor the trained from scratch model (orange, green, and white). For the \ufb01ne-tuned model, we \ufb01nd\ntwo very unstable landmarks (dark blue and white). For our approach, the most unstable landmark is\nthe one depicted in black. However, as shown in Table 3, the error for this landmark is much lower\nthan that of the most unstable landmarks for the \ufb01ne-tuned and from scratch networks. For the Cats\ndatasets, our approach produces again by far the most stable landmarks.\nQualitative evaluation: We show some further examples of the points discovered by our method for\nall the object categories used in our experiments in Fig. 3. While for the faces and cats datasets, we\ncan draw the same conclusions as above, for the shoes dataset, we observe that all methods showed\na consistently good performance (our method is still the most accurate according to Table 3). We\nattribute this to the fact that the shoes are pre-segmented which turns out to facilitate the training\nprocess. However, such a setting is unrealistic for most real-world datasets. More qualitative examples\ncan be found for each dataset in the Supplementary Material.\nAdditional experiments: In addition to the aforementioned experiments, we present two extra\nstudies to illustrate the effect of different training components and design choices in the performance\nof our proposed approach.\nQuality of core: Our proposed approach relies on having a strong core network to perform the\nadaptation. In order to validate this assumption, we repeated the unsupervised training, using as\ncore the saved checkpoints of early epochs of our human pose estimator network (see Sec. 4.1). The\nforward, backward, and consistency errors for the AFLW database are shown in Table 4 and 5 (Top,\n# Epoch). The results support the need of having a strong core network for a reliable adaptation.\nNumber of training images: The small amount of parameters to be learned suggests that our unsuper-\nvised adaptation method could be robust when training with limited data. To validate this, we chose a\nrandom subset of 10, 100, and 1000 images to train the target network. The results of each model are\nshown in Tables 4 and 5 (Bottom, # Images). When training with 10 images, we observe that the\nnetwork is prone to collapsing to a single point. However, the results for 1000 images show evidence\nthat our method can be quite effective for the case of limited training data. Combining our approach\nwith few-shot learning is left for interesting future work.\n\n7\n\n\fL Scratch\nF\nFinetune\nA\nProposed\nM\nW Scratch\nFinetune\nL\nF\nProposed\nA\nD Scratch\nFinetune\n3\nS\nProposed\nL\ns Scratch\nFinetune\nProposed\ns Scratch\nFinetune\nProposed\n\ne\no\nh\nS\n\nt\na\nC\n\n1\n\n1.08\n1.11\n0.96\n1.45\n1.86\n1.46\n3.40\n2.93\n2.36\n1.57\n1.22\n1.07\n1.27\n1.27\n1.00\n\n2\n\n1.20\n1.36\n1.09\n1.78\n1.93\n1.47\n4.11\n3.19\n2.48\n1.65\n1.35\n1.48\n1.44\n1.48\n1.01\n\n3\n\n1.34\n1.39\n1.19\n1.83\n1.95\n1.47\n4.48\n3.26\n3.01\n2.19\n1.42\n1.74\n1.61\n1.81\n1.25\n\n4\n\n1.36\n1.39\n1.34\n1.85\n2.16\n1.54\n4.54\n3.59\n3.02\n2.56\n1.47\n1.80\n1.82\n1.82\n1.60\n\n5\n\n1.38\n1.68\n1.45\n1.95\n2.18\n1.65\n5.18\n3.71\n3.55\n2.79\n1.82\n1.94\n2.30\n1.82\n1.65\n\n6\n\n1.76\n1.83\n1.58\n2.54\n2.53\n1.66\n5.71\n4.38\n3.59\n2.92\n2.03\n2.28\n3.37\n1.84\n1.79\n\n7\n\n3.98\n2.79\n1.80\n8.46\n5.34\n1.92\n6.70\n5.14\n3.71\n3.03\n2.38\n2.30\n3.46\n1.89\n3.57\n\n8\n\n16.51\n3.58\n1.92\n21.62\n7.30\n2.07\n19.72\n5.56\n4.83\n3.05\n2.51\n2.41\n4.44\n5.48\n3.60\n\n9\n\n27.44\n5.59\n3.65\n31.30\n8.30\n4.97\n32.04\n7.29\n6.97\n3.28\n4.21\n2.91\n27.13\n5.93\n3.64\n\n10\n\n35.03\n7.51\n4.09\n39.37\n9.66\n6.99\n38.36\n9.49\n7.08\n4.92\n4.30\n3.49\n28.11\n7.14\n5.29\n\nAvg.\n9.11\n2.82\n1.91\n11.20\n4.32\n2.52\n12.42\n4.85\n4.06\n2.80\n2.27\n2.14\n7.50\n3.05\n2.44\n\nTable 3: Consistency errors on MAFL, AFLW, LS3D, UT-Zappos50k and Cats Head datasets.\n\nFigure 3: Qualitative results on AFLW, Shoes, and Cats datasets. Our method produces the most stable\nlandmarks which is visually more evident for the faces and cats datasets. See text for detailed discussion.\n\n8\n\n\f7\n\n8\n\n9\n\nI\n\n3\n\n4\n\n5\n\n6\n\n2\n\n1\n\n10\n\ne\ng\na\nm\n\n1.53\n1.60\n1.43\n1.46\n1.47\n1.93\n1.55\n1.46\n\n2.91\n1.69\n1.64\n1.47\n2.02\n2.05\n2.00\n1.47\n\n8.66\n1.83\n1.81\n1.65\n2.26\n2.63\n2.36\n1.65\n\n5.25\n1.70\n1.76\n1.54\n2.15\n2.21\n2.22\n1.54\n\n1.77\n1.64\n1.57\n1.47\n2.02\n1.97\n1.56\n1.47\n\n11.63\n6.97\n1.99\n1.92\n2.73\n3.45\n3.44\n1.92\n\n10.19\n2.11\n1.83\n1.66\n2.54\n2.65\n2.52\n1.66\n\n26.69\n39.36\n2.42\n2.07\n3.17\n3.69\n4.54\n2.07\n\n37.14\n41.06\n7.98\n4.97\n6.32\n6.77\n9.06\n4.97\n\n45.12\n45.28\n30.58\n6.99\n6.40\n7.02\n10.82\n6.99\n\nh 1\nc\n5\no\np\n10\nE\n110\n#\ns 10\n100\n1000\nAll\n\nAvg.\n15.09\n14.32\n5.30\n2.52\n3.11\n3.44\n4.01\n2.52\n#\nTable 4: Consistency errors of our method produced by varying the quality of the core network (top), and the\nnumber of images used for training (bottom).\nFurther investigating \ufb01netuning: While our experiments clearly show that our approach signi\ufb01cantly\nimproves upon \ufb01netuning, the results of the latter indicate that it also constitutes an effective approach\nto unsupervised adaptation. To further explore this direction, we also studied (a) \ufb01netuning only the\nlast layers of the core network, leaving the rest of the network frozen, and (b) \ufb01netuning after having\nlearned the projection matrix. In both cases, we found that the results were slightly worse than the\n\ufb01netuning results reported in Table 2.\nGeneralization: The experiments throughout this paper used as core a powerful network trained on\nMPII [2] which has learned rich features that can serve as basis for adaptation. In addition to that, we\nalso investigated whether our approach would also work for the inverse case, e.g. by training a core\nnetwork for facial landmark localization and adapting it to detect body landmarks. In Fig. 5, we show\nsome qualitative results of a network trained to discover 10 points on the BBC-Pose dataset [4], using\nas core a network trained to detect 68 landmarks on the 300W-LP dataset [47]. A detailed evaluation\nof each method for this challenging setting can be found in the Supplementary Material.\n\n5 Conclusions\n\nIn this paper, we showed that the unsupervised discovery of object landmarks can bene\ufb01t signi\ufb01cantly\nfrom inheriting the knowledge acquired by a network trained in a fully supervised way for a different\nobject category. Using an incremental domain adaptation approach, we showed how to transfer the\nknowledge of a network trained in a supervised way for the task of human pose estimation in order to\nlearn to discover landmarks on faces, shoes, and cats in an unsupervised way. When trained under the\nsame conditions, our experiments showed a consistent improvement of our method w.r.t. training the\nmodel from scratch, as well as \ufb01ne-tuning the model on the target object category.\n\n# Epoch\n5\n10\n\n1\n\nnim\n\nd\nw\nF\n\n110\n10 16.27 16.45 12.60 11.09\n100 11.53\n7.64\nAll\n6.69\n9.68\nd 10 43.75 40.75 17.35 10.05\n9.69\nw\nB\n9.19\n\n100 34.69 33.96 14.44\nAll 33.72 32.86 13.58\n\n8.40\n6.93\n\n9.43\n7.76\n\n# Images\n100\n\n10\n\nnim\n\nd\nw\nF\n\n1000 All\n10 16.55 11.31 11.25 11.09\n100 13.21\n7.64\nAll 12.01\n6.69\nd 10 16.55 14.58 15.32 10.05\n9.69\nw\nB\n9.19\n\n11.54 12.16\n10.77 11.30\n\n8.34\n6.86\n\n7.64\n7.31\n\n8.22\n7.25\n\n100\nAll\n\nFigure 4: Examples of 10 discovered landmarks on\nBBC-Pose (test).\n\nTable 5: Forward and backward errors of our method\nproduced by varying the quality of the core network\n(top), and the number of images used for training (bot-\ntom).\n\n9\n\n\fReferences\n[1] Mykhaylo Andriluka, Umar Iqbal, Anton Milan, Eldar Insafutdinov, Leonid Pishchulin, Juergen\nGall, and Bernt Schiele. Posetrack: A benchmark for human pose estimation and tracking. In\nIEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n[2] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose\nestimation: New benchmark and state of the art analysis. In IEEE Conference on Computer\nVision and Pattern Recognition, 2014.\n\n[3] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face\nalignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference\non Computer Vision, 2017.\n\n[4] J. Charles, T. P\ufb01ster, D. Magee, D. Hogg, and A. Zisserman. Domain adaptation for upper body\n\npose tracking in signed TV broadcasts. In British Machine Vision Conference, 2013.\n\n[5] Christopher B Choy, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. Universal\n\ncorrespondence network. In Advances in Neural Information Processing Systems, 2016.\n\n[6] Emily L Denton et al. Unsupervised learning of disentangled representations from video. In\n\nAdvances in Neural Information Processing Systems, 2017.\n\n[7] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive\n\nsciences, 3(4):128\u2013135, 1999.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[9] Yibo Hu, Xiang Wu, Bing Yu, Ran He, and Zhenan Sun. Pose-guided photorealistic face\n\nrotation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n[10] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer\n\nlearning? arXiv preprint arXiv:1608.08614, 2016.\n\n[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, 2015.\n\n[12] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation\nwith conditional adversarial networks. IEEE Conference on Computer Vision and Pattern\nRecognition, 2017.\n\n[13] Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of\nobject landmarks through conditional image generation. In Advances in Neural Information\nProcessing Systems, 2018.\n\n[14] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer\n\nand super resolution. In European Conference on Computer Vision, 2016.\n\n[15] Angjoo Kanazawa, David W Jacobs, and Manmohan Chandraker. Warpnet: Weakly supervised\nmatching for single-view reconstruction. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2016.\n\n[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International\n\nConference on Learning Representations, 2015.\n\n[17] Karel Lenc and Andrea Vedaldi. Learning covariant feature detectors. In European Conference\n\non Computer Vision Workshop, 2016.\n\n[18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In International Conference on Computer Vision, 2015.\n\n[19] Dominik Lorenz, Leonard Bereska, Timo Milbich, and Bj\u00f6rn Ommer. Unsupervised part-based\n\ndisentangling of object shape and appearance. arXiv preprint arXiv:1903.06946, 2019.\n\n10\n\n\f[20] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose\nguided person image generation. In Advances in Neural Information Processing Systems, 2017.\n\n[21] Peter M. Roth Martin Koestinger, Paul Wohlhart and Horst Bischof. Annotated Facial Land-\nmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In\nProc. First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies,\n2011.\n\n[22] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In\n\nEuropean Conference on Computer Vision, 2016.\n\n[23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In Autodiff workshop - NeurIPS, 2017.\n\n[24] Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, and Francesc Moreno-\nNoguer. Ganimation: Anatomically-aware facial animation from a single image. In European\nConference on Computer Vision, 2018.\n\n[25] Sylvestre-Alvise Rebuf\ufb01, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains\n\nwith residual adapters. In Advances in Neural Information Processing Systems, 2017.\n\n[26] Sylvestre-Alvise Rebuf\ufb01, Hakan Bilen, and Andrea Vedaldi. Ef\ufb01cient parametrization of multi-\ndomain deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition,\n2018.\n\n[27] Ignacio Rocco, Relja Arandjelovic, and Josef Sivic. Convolutional neural network architecture\nfor geometric matching. In IEEE Conference on Computer Vision and Pattern Recognition,\n2017.\n\n[28] Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[29] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,\nKoray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv\npreprint arXiv:1606.04671, 2016.\n\n[30] Mihir Sahasrabudhe, Zhixin Shu, Edward Bartrum, Riza Alp Guler, Dimitris Samaras, and\nIasonas Kokkinos. Lifting autoencoders: Unsupervised learning of a fully-disentangled 3d\nmorphable model using deep non-rigid structure from motion. arXiv preprint arXiv:1904.11960,\n2019.\n\n[31] Zhixin Shu, Mihir Sahasrabudhe, Riza Alp Guler, Dimitris Samaras, Nikos Paragios, and Iasonas\nKokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In\nProceedings of the European Conference on Computer Vision (ECCV), pages 650\u2013665, 2018.\n\n[32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In International Conference on Learning Representations, 2015.\n\n[33] Supasorn Suwajanakorn, Noah Snavely, Jonathan J Tompson, and Mohammad Norouzi. Dis-\ncovery of latent 3d keypoints via end-to-end geometric reasoning. In Advances in Neural\nInformation Processing Systems, 2018.\n\n[34] James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object frames\nby dense equivariant image labelling. In Advances in Neural Information Processing Systems,\n2017.\n\n[35] James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks\n\nby factorized spatial embeddings. In International Conference on Computer Vision, 2017.\n\n[36] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-\n\nconsistency of time. arXiv preprint arXiv:1903.07593, 2019.\n\n[37] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and\n\ntracking. In European Conference on Computer Vision, 2018.\n\n11\n\n\f[38] Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic\nfuture frame synthesis via cross convolutional networks. In Advances in Neural Information\nProcessing Systems, 2016.\n\n[39] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature\n\ntransform. In European Conference on Computer Vision, 2016.\n\n[40] A. Yu and K. Grauman. Fine-grained visual comparisons with local learning.\n\nConference on Computer Vision and Pattern Recognition, 2014.\n\nIn IEEE\n\n[41] A. Yu and K. Grauman. Semantic jitter: Dense supervision for visual comparisons via synthetic\n\nimages. In International Conference on Computer Vision, 2017.\n\n[42] Weiwei Zhang, Jian Sun, and Xiaoou Tang. Cat head detection-how to effectively exploit shape\n\nand texture features. In European Conference on Computer Vision, 2008.\n\n[43] Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, and Honglak Lee. Unsupervised\ndiscovery of object landmarks as structural representations. In IEEE Conference on Computer\nVision and Pattern Recognition, 2018.\n\n[44] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by\n\ndeep multi-task learning. In European Conference on Computer Vision, 2014.\n\n[45] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Learning deep representation\nfor face alignment with auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 38(5):918\u2013930, 2016.\n\n[46] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image trans-\nlation using cycle-consistent adversarial networks. In International Conference on Computer\nVision, 2017.\n\n[47] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face alignment across large\nposes: A 3d solution. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n12\n\n\f", "award": [], "sourceid": 7488, "authors": [{"given_name": "Enrique", "family_name": "Sanchez", "institution": "Samsung AI Centre"}, {"given_name": "Georgios", "family_name": "Tzimiropoulos", "institution": "Samsung AI Centre | University of Nottingham"}]}