{"title": "Recurrent Transformer Networks for Semantic Correspondence", "book": "Advances in Neural Information Processing Systems", "page_first": 6126, "page_last": 6136, "abstract": "We present recurrent transformer networks (RTNs) for obtaining dense correspondences between semantically similar images. Our networks accomplish this through an iterative process of estimating spatial transformations between the input images and using these transformations to generate aligned convolutional activations. By directly estimating the transformations between an image pair, rather than employing spatial transformer networks to independently normalize each individual image, we show that greater accuracy can be achieved. This process is conducted in a recursive manner to refine both the transformation estimates and the feature representations. In addition, a technique is presented for weakly-supervised training of RTNs that is based on a proposed classification loss. With RTNs, state-of-the-art performance is attained on several benchmarks for semantic correspondence.", "full_text": "Recurrent Transformer Networks for Semantic\n\nCorrespondence\n\nSeungryong Kim1, Stephen Lin2, Sangryul Jeon1, Dongbo Min3, and Kwanghoon Sohn1,\u2217\n\n1Yonsei University, Seoul, South Korea, 2Microsoft Research, Beijing, China,\n\ndbmin@ewha.ac.kr\n\u2217Corresponding author\n\n3Ewha Womans University, Seoul, South Korea\n\n{srkim89,cheonjsr,khsohn}@yonsei.ac.kr, stevelin@microsoft.com,\n\nAbstract\n\nWe present recurrent transformer networks (RTNs) for obtaining dense correspon-\ndences between semantically similar images. Our networks accomplish this through\nan iterative process of estimating spatial transformations between the input images\nand using these transformations to generate aligned convolutional activations. By\ndirectly estimating the transformations between an image pair, rather than employ-\ning spatial transformer networks to independently normalize each individual image,\nwe show that greater accuracy can be achieved. This process is conducted in a\nrecursive manner to re\ufb01ne both the transformation estimates and the feature repre-\nsentations. In addition, a technique is presented for weakly-supervised training of\nRTNs that is based on a proposed classi\ufb01cation loss. With RTNs, state-of-the-art\nperformance is attained on several benchmarks for semantic correspondence.\n\n1\n\nIntroduction\n\nEstablishing dense correspondences across semantically similar images can facilitate a variety of\ncomputer vision applications including non-parametric scene parsing, semantic segmentation, object\ndetection, and image editing [25, 22, 19]. In this semantic correspondence task, the images resemble\neach other in content but differ in object appearance and con\ufb01guration, as exempli\ufb01ed in the images\nwith different car models in Fig. 1(a-b). Unlike the dense correspondence computed for estimating\ndepth [34] or optical \ufb02ow [4], semantic correspondence poses additional challenges due to intra-class\nappearance and shape variations among different instances from the same object or scene category.\nTo address these challenges, state-of-the-art methods generally extract deep convolutional neural\nnetwork (CNN) based descriptors [5, 44, 18], which provide some robustness to appearance variations,\nand then perform a regularization step to estimate spatially regularized geometric \ufb01elds. The most\nrecent techniques handle geometric deformations in addition to appearance variations within deep\nCNNs. These methods can generally be classi\ufb01ed into two categories, namely methods for geometric\ninvariance in the feature extraction step, e.g., spatial transformer networks (STNs) [15, 5, 19], and\nmethods for geometric invariance in the regularization step, e.g., geometric matching networks [30,\n31]. The STN-based methods infer geometric deformation \ufb01elds within a deep network and transform\nthe convolutional activations to provide geometric-invariant features [5, 41, 19]. While this approach\nhas shown geometric invariance to some extent, we conjecture that directly estimating the geometric\ndeformations between a pair of input images would be more robust and precise than learning to\ntransform each individual image to a geometric-invariant feature representation. This direct estimation\napproach is used by geometric matching-based techniques [30, 31], which recover a matching model\ndirectly through deep networks. Drawbacks of these methods include that globally-varying geometric\n\ufb01elds are inferred, and only \ufb01xed, untransformed versions of the features are used.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 1: Visualization of results from RTNs: (a) source image; (b) target image; (c), (d) warped\nsource and target images using dense correspondences from RTNs; (e), (f) pseudo ground-truth\ntransformations in [36]. Our RTNs learn to infer transformations without ground-truth supervision.\n\nIn this paper, we present recurrent transformer networks (RTNs) for overcoming the aforementioned\nlimitations of current semantic correspondence techniques. As illustrated in Fig. 2, the key idea\nof RTNs is to directly estimate the geometric transformation \ufb01elds between two input images, like\nwhat is done by geometric matching-based approaches [30, 31], but also apply the estimated \ufb01eld to\ntransform the convolutional activations of one of the images, similar to STN-based methods [15, 5, 19].\nWe additionally formulate the RTNs to recursively estimate the geometric transformations, which are\nused for iterative geometric alignment of feature activations. In this way, regularization is enhanced\nthrough recursive re\ufb01nement, while feature extraction is likewise iteratively re\ufb01ned according to the\ngeometric transformations as well as jointly learned with the regularization. Moreover, the networks\nare learned in a weakly-supervised manner via a proposed classi\ufb01cation loss de\ufb01ned between the\nsource image features and the geometrically-aligned target image features, such that the correct\ntransformation is identi\ufb01ed by the highest matching score while other transformations are considered\nas negative examples.\nThe presented approach is evaluated on several common benchmarks and examined in an ablation\nstudy. The experimental results show that this model outperforms the latest weakly-supervised and\neven supervised methods for semantic correspondence.\n\n2 Related Work\n\nSemantic Correspondence To elevate matching quality, most conventional methods for semantic\ncorrespondence focus on improving regularization techniques while employing handcrafted features\nsuch as SIFT [27]. Liu et al. [25] pioneered the idea of dense correspondence across different scenes,\nand proposed SIFT \ufb02ow. Inspired by this, methods have been presented based on deformable spatial\npyramids (DSP) [17], object-aware hierarchical graphs [39], exemplar LDA [3], joint image set\nalignment [45], and joint co-segmentation [36]. As all of these techniques use handcrafted descriptors\nand regularization methods, they lack robustness to geometric deformations.\nRecently, deep CNN-based methods have been used in semantic correspondence as their descriptors\nprovide some degree of invariance to appearance and shape variations. Among them are techniques\nthat utilize a 3-D CAD model for supervision [44], employ fully convolutional feature learning [5],\nlearn \ufb01lters with geometrically consistent responses across different object instances [28], learn\nnetworks using dense equivariant image labelling [37], exploit local self-similarity within a fully\nconvolutional network [18, 19], and estimate correspondences using object proposals [7, 8, 38].\nHowever, none of these methods is able to handle non-rigid geometric variations, and most of\nthem are formulated with handcrafted regularization. More recently, Han et al. [9] formulated the\nregularization into the CNN but do not deal explicitly with the signi\ufb01cant geometric variations\nencountered in semantic correspondence.\n\nSpatial Invariance Some methods aim to alleviate spatial variation problems in semantic corre-\nspondence through extensions of SIFT \ufb02ow, including scale-less SIFT \ufb02ow (SLS) [11], scale-space\nSIFT \ufb02ow (SSF) [29], and generalized DSP (GDSP) [13]. A generalized PatchMatch algorithm [1]\nwas proposed for ef\ufb01cient matching that leverages a randomized search scheme. It was utilized by\nHaCohen et al. [6] in a non-rigid dense correspondence (NRDC) algorithm. Spatial invariance to scale\nand rotation is provided by DAISY \ufb01lter \ufb02ow (DFF) [40]. While these aforementioned techniques\nprovide some degree of geometric invariance, none of them can deal with af\ufb01ne transformations over\nan image. Recently, Kim et al. [20, 21] proposed the discrete-continuous transformation matching\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Intuition of RTNs: (a) methods for geometric invariance in the feature extraction step,\ne.g., STN-based methods [5, 19], (b) methods for geometric invariance in the regularization step,\ne.g., geometric matching-based methods [30, 31], and (c) RTNs, which weave the advantages of\nboth existing STN-based methods and geometric matching techniques, by recursively estimating\ngeometric transformation residuals using geometry-aligned feature activations.\n\n(DCTM) framework where dense af\ufb01ne transformation \ufb01elds are inferred using a hand-designed\nenergy function and regularization.\nTo deal with geometric variations within CNNs, STNs [15] offer a way to provide geometric invari-\nance by warping features through a global transformation. Inspired by STNs, Lin et al. [23] proposed\ninverse compositional STNs (IC-STNs) that replaces the feature warping with transformation pa-\nrameter propagation. Kanazawa et al. [16] presented WarpNet that predicts a warp for establishing\ncorrespondences. Rocco et al. [30, 31] proposed a CNN architecture for estimating a geometric\nmatching model for semantic correspondence. However, they estimate only globally-varying geomet-\nric \ufb01elds, thus leading to limited performance in dealing with locally-varying geometric deformations.\nTo deal with locally-varying geometric variations, some methods such as UCN-spatial transformer\n(UCN-ST) [5] and convolutional af\ufb01ne transformer-FCSS (CAT-FCSS) [19] employ STNs [15] at\nthe pixel level. Similarly, Yi et al. [41] proposed the learned invariant feature transform (LIFT) to\nlearn sparsely, locally-varying geometric \ufb01elds, inspired by [42]. However, these methods determine\ngeometric \ufb01elds by accounting for the source and target images independently, rather than jointly,\nwhich limits their prediction ability.\n\n3 Background\n\ni and I t\n\ni and Dt\n\ni, from I s\n\nLet us denote semantically similar source and target images as I s and I t, respectively. The objective\nis to establish a correspondence \ufb01eld fi = [ui, vi]T between the two images that is de\ufb01ned for each\npixel i = [ix, iy]T in I s. Formally, this involves \ufb01rst extracting handcrafted or deep features, denoted\ni within local receptive \ufb01elds, and then estimating the correspondence\nby Ds\n\ufb01eld fi of the source image by maximizing the feature similarity between Ds\nover a set of\ntransformations using handcrafted or deep geometric regularization models. Several approaches [25,\n18] assume the transformation to be a 2-D translation with negligible variation within local receptive\n\ufb01elds. As a result, they often fail to handle complicated deformations caused by scale, rotation, or skew\nthat may exist among object instances. For greater geometric invariance, recent approaches [20, 21]\nhave modeled the deformations as an af\ufb01ne transformation \ufb01eld represented by a 2 \u00d7 3 matrix\n\ni and Dt\n\ni+fi\n\nTi = [Ai, fi]\n\n(1)\ni and\ni(cid:48)(Ai), where D(Ai) represents the feature extracted from spatially-varying local receptive\n\nthat maps pixel i to i(cid:48) = i + fi. Speci\ufb01cally, they maximize the similarity between the source Ds\ntarget Dt\n\ufb01elds transformed by a 2 \u00d7 2 matrix Ai [5, 19]. For simplicity, we denote Dt(Ti) = Dt\nApproaches for geometric invariance in semantic correspondence can generally be classi\ufb01ed into two\ncategories. The \ufb01rst group infers the geometric \ufb01elds in the feature extraction step by minimizing\na matching objective function [5, 19], as exempli\ufb01ed in Fig. 2(a). Concretely, Ai is learned\nwithout a ground-truth A\u2217\ni by minimizing the difference between Ds\n(Ai) according to a\nground-truth \ufb02ow \ufb01eld f\u2217\ni . This enables explicit feature learning which aims to minimize/maximize\nconvolutional activation differences between matching/non-matching pixel pairs [5, 19]. However,\nground-truth \ufb02ow \ufb01elds f\u2217\ni are still needed for learning the networks, and it predicts the geometric\n\ni and Dt\n\n(Ai).\n\ni+fi\n\ni+fi\n\n3\n\nFeature ExtractionFeature ExtractionFlow EstimationLocalisationifiAsiD()tiDASourceTargetFeature ExtractionFeature ExtractionGeometric MatchingsiDtiDiTSourceTargetFeature ExtractionFeature ExtractionGeometric MatchingsiD()tiDTiTSourceTarget\fFigure 3: Network con\ufb01guration of RTNs, consisting of a feature extraction network and a geometric\nmatching network in a recurrent structure.\n\n\ufb01elds Ai based only on the source or target feature, without jointly considering the source and target,\nthus limiting performance.\nThe second group estimates a geometric matching model directly through deep networks by consider-\ning the source and target features simultaneously [30, 31]. These methods formulate the geometric\nmatching networks by mimicking conventional RANSAC-like methods [14] through feature extrac-\ntion and geometric matching steps. As illustrated in Fig. 2(b), the geometric \ufb01elds Ti are predicted\nin a feed-forward network from extracted source features Ds\ni. By learning\nto extract source and target features and predict geometric \ufb01elds in an end-to-end manner, more\nrobust geometric \ufb01elds can be estimated compared to existing STN-based methods that consider\nsource or target features independently as shown in [31]. A major limitation of these learning-based\nmethods is the lack of ground-truth geometric \ufb01elds T\u2217\ni between source and target images. To\nalleviate this problem, some methods use self-supervision such as synthetic transformations [30] or\nweak-supervision such as soft-inlier maximization [31], but these approaches constrain the global\ngeometric \ufb01eld only. Moreover, these methods utilize feature descriptors extracted from the original\nupright images, rather than from geometrically transformed images, which limits their capability to\nrepresent severe geometric variations.\n\ni and target features Dt\n\n4 Recurrent Transformer Networks\n\n4.1 Motivation and Overview\n\nIn this section, we describe the formulation of recurrent transformer networks (RTNs). The objective\nof our networks is to learn and infer locally-varying af\ufb01ne deformation \ufb01elds Ti in an end-to-end\nand weakly-supervised fashion using only image pairs without ground-truth transformations T\u2217\ni .\nToward this end, we present an effective and ef\ufb01cient integration of the two existing approaches for\ngeometric invariance, i.e., STN-based feature extraction networks [5, 19] and geometric matching\nnetworks [30, 31], that includes a novel weakly-supervised loss function tailored to our objective.\nSpeci\ufb01cally, the \ufb01nal geometric \ufb01eld is recursively estimated by deforming the activations of feature\nextraction networks according to the intermediate output of the geometric matching networks, in\ncontrast to existing approaches based on geometric matching which consider only \ufb01xed, upright\nversions of features [30, 31]. At the same time, our method outperforms STN-based approaches [5,\n19] by using a deep CNN-based geometric matching network instead of handcrafted matching\ncriteria. Our recurrent geometric matching approach intelligently weaves the advantages of both\nexisting STN-based methods and geometric matching techniques, by recursively estimating geometric\ntransformation residuals using geometry-aligned feature activations.\nConcretely, our networks are split into two parts, as shown in Fig. 3: a feature extraction network\ni and target Dt(Ti) features, and a geometric matching network to infer the\nto extract source Ds\ngeometric \ufb01elds Ti. To learn these networks in a weakly-supervised manner, we formulate a novel\nclassi\ufb01cation loss de\ufb01ned without ground-truth T\u2217\ni based on the assumption that the transformation\nwhich maximizes the similarity of the source features Ds\ni and transformed target features Dt(Ti) at\na pixel i should be correct, while the matching scores of other transformation candidates should be\nminimized.\n\n4\n\nSourceTargetFeature Extraction Net.Geometric Matching NetworkEncoderDecoder1ki\uf02dT1()tkiD\uf02dTsiDCorrelation\u2026\u2026Skip Connection11((,())|)kkstkiiiiGDD\uf02d\uf02d\uf03d\uf02bTTTW1((,())|)stkiiGDD\uf02dTW(|)iFIW\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 4: Visualization of search window Ni in RTNs (e.g., |Ni| : 5 \u00d7 5): Source images with the\nsearch window of (a) stride 4, (c) stride 2 , (e) stride 1, and target images with (b), (d), (f) transformed\npoints for (a), (c), (e), respectively. As evolving iterations, the dilate strides are reduced to consider\nprecise matching details.\n\n4.2 Feature Extraction Network\n\nTo extract convolutional features for source Ds and target Dt, the input source and target images (I s,\nI t) are \ufb01rst passed through fully-convolutional feature extraction networks with shared parameters\nWF such that Di = F(Ii|WF ), and the feature for each pixel then undergoes L2 normalization.\nIn the recurrent formulation, at each iteration the target features Dt can be extracted according to\nTi such that Dt(Ti) = F(I t(Ti)|WF ). However, extracting each feature by transforming local\nreceptive \ufb01elds within the target image I t according to Ti for each pixel i and then passing it through\nthe networks would be time-consuming when iterating the networks. Instead, we employ a strategy\nsimilar to UCN-ST [5] and CAT-FCSS [19] by \ufb01rst extracting the convolutional features of the\nentire image I t by passing it through the networks except for the last convolutional layer, and then\ncomputing Dt(Ti) by transforming the resultant convolutional features and \ufb01nally passing it through\nthe last convolution with stride to combine the transformed activations independently [5, 19]. It\nshould be noted that any convolutional features [35, 12, 19] could be used in this framework. In\nexperiments, we used CAT-FCSS [19], and sampled activations after pooling layers such as \u2018conv4-4\u2019\nfor VGGNet [35] and \u2018conv4-23\u2019 for ResNet [12].\n\n4.3 Recurrent Geometric Matching Network\n\nConstraint Correlation Volume To predict the geometric \ufb01elds from two convolutional features\nDs and Dt, we \ufb01rst compute the correlation volume with respect to translational motion only [30, 31]\nand then pass it to subsequent convolutional layers to determine dense af\ufb01ne transformation \ufb01elds. As\nshown in [31], this two-step approach reliably prunes incorrect matches. Speci\ufb01cally, the similarity\nbetween two extracted features is computed as the cosine similarity with L2 normalization:\n\n(cid:114)(cid:88)\n\nl\n\nC(Ds\n\ni , Dt(Tj)) = < Ds\n\ni , Dt(Tj) >/\n\n< Ds\n\ni , Dt(Tl) >2,\n\n(2)\n\nwhere j, l \u2208 Ni for the search window Ni of pixel i.\nCompared to [30, 31] that consider all possible samples within an image, the constraint correlation\nvolume de\ufb01ned within Ni reduces the matching ambiguity and computational times. However, due\nto the limited search window range, it may not cover large geometric variations. To alleviate this\nlimitation, inspired by [43], we utilize dilation techniques in a manner that the local neighborhood\nNi is enlarged with larger stride than 1 pixel, and this dilation is reduced as the iterations progress, as\nexempli\ufb01ed in Fig. 4.\n\nRecurrent Geometry Estimation Based on this matching similarity, the recurrent geometry es-\ntimation network with parameters WG iteratively estimates the residual between the previous and\ncurrent geometric transformation \ufb01elds as\ni \u2212 Tk\u22121\ni = F(C(Ds\nTk\n(cid:88)\n\ni denotes the transformation \ufb01elds at the k-th iteration. The \ufb01nal geometric \ufb01elds are then\n\nwhere Tk\nestimated in a recurrent manner as follows:\n\n))|WG),\n\ni , Dt(Tk\u22121\n\ni\n\n(3)\n\nF(C(Ds\n\ni , Dt(Tk\u22121\n\ni\n\n))|WG),\n\nTi = T0\n\ni +\n\n(4)\n\nk\u2208{1,..,Kmax}\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 5: Convergence of RTNs: (a) source image; (b) target image; Iterative evolution of warped\nimages (c), (d), (e), and (f) after iteration 1, 2, 3, and 4. In the recurrent formulation of RTNs, the\npredicted transformation \ufb01eld becomes progressively more accurate through iterative estimation.\n\nwhere Kmax denotes the maximum iteration and T0\ni is an initial geometric \ufb01eld. Unlike [30, 31]\nwhich estimate a global af\ufb01ne or thin-plate spline transformation \ufb01eld, we formulate the encoder-\ndecoder networks as in [32] to estimate locally-varying geometric \ufb01elds. Moreover, our networks are\nformulated in a fully-convolutional manner, thus source and target inputs of any size can be processed,\nin contrast to [30, 31] which can take inputs of only a \ufb01xed size.\nIteratively inferring af\ufb01ne transformation residuals boosts matching precision and facilitates conver-\ngence. Moreover, inferring residuals instead of carrying the input information through the network has\nbeen shown to improve network optimization [12]. As shown in Fig. 5, the predicted transformation\n\ufb01eld becomes progressively more accurate through iterative estimation.\n\n4.4 Weakly-supervised Learning\n\nA major challenge of semantic correspondence with deep CNNs is the lack of ground-truth cor-\nrespondence maps for training. Obtaining such ground-truth data through manual annotation is\nlabor-intensive and may be degraded by subjectivity [36, 7, 8]. To learn the networks using only\nweak supervision in the form of image pairs, we formulate the loss function based on the intuition\nthat the matching score between the source feature Ds\ni at each pixel i and the target feature Dt(Ti)\nshould be maximized while keeping the scores of other transformation candidates low. This can\nbe treated as a classi\ufb01cation problem in that the network can learn a geometric \ufb01eld as a hidden\nvariable for maximizing the scores for matchable Ti while minimizing the scores for non-matchable\ntransformation candidates. The optimal \ufb01elds Ti can be learned with a classi\ufb01cation loss [19] in a\nweakly-supervised manner by minimizing the energy function\n\ni , Dt(T)) = \u2212 (cid:88)\n\nL(Ds\n\np\u2217\nj log(p(Ds\n\ni , Dt(Tj))),\n\n(5)\n\n(6)\n\nwhere the function p(Ds\n\ni , Dt(Tj)) is a softmax probability de\ufb01ned as\n\np(Ds\n\ni , Dt(Tj)) =\n\nexp(C(Ds\n\nexp(C(Ds\n\ni , Dt(Tj)))\n\ni , Dt(Tl)))\n\n,\n\nj\u2208Mi\n\n(cid:80)\n\nl\u2208Mi\n\nj denoting a class label de\ufb01ned as 1 if j = i, and 0 otherwise for j \u2208 Mi for the search\nwith p\u2217\nwindow Mi, such that the center point i within Mi becomes a positive sample while the other points\nare negative samples.\nWith this loss function, the derivatives \u2202L/\u2202Ds and \u2202L/\u2202Dt(T) of the loss function L with respect\nto the features Ds and Dt(T) can be back-propagated into the feature extraction networks F(\u00b7|WF ).\nExplicit feature learning in this manner with the classi\ufb01cation loss has been shown to be reliable [19].\nLikewise, the derivatives \u2202L/\u2202Dt(T) and \u2202Dt(T)/\u2202T of the loss function L with respect to\ngeometric \ufb01elds T can be back-propagated into the geometric matching networks F(\u00b7|WG) to learn\nthese networks without ground truth T\u2217.\nIt should be noted that our loss function is conceptually similar to [31] in that it is formulated with\nsource and target features in a weakly-supervised manner. While [31] utilizes only positive samples\nin learning feature extraction networks, our method considers both positive and negative samples to\nenhance network training.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 7: Qualitative results on the TSS benchmark [36]: (a) source image, (b) target image, (c)\nDCTM [18], (d) SCNet [9], (e) GMat. w/Inl. [31], and (f) RTNs. The source images are warped to\nthe target images using correspondences.\n\n5 Experimental Results and Discussion\n\n5.1 Experimental Settings\n\nIn the following, we comprehensively evaluated our RTNs through comparisons to state-of-the-art\nmethods for semantic correspondence, including SF [25], DSP [17], Zhou et al. [44], Taniai et al. [36],\nPF [7], SCNet [9], DCTM [18], geometric matching (GMat.) [30], and GMat. w/Inl. [31], as well as\nemploying the SIFT \ufb02ow optimizer1 together with UCN-ST [5], FCSS [18], and CAT-FCSS [19].\nPerformance was measured on the TSS dataset [36], PF-WILLOW dataset [7], and PF-PASCAL\ndataset [8]. In Sec. 5.2, we \ufb01rst analyze the effects of the components within RTNs, and then evaluate\nmatching results with various benchmarks and quantitative measures in Sec. 5.3.\n\n5.2 Ablation Study\n\nTo validate the components within RTNs, we evaluated\nthe matching accuracy for different numbers of itera-\ntions, with various window sizes of Ni, for different back-\nbone feature extraction networks such as VGGNet [35],\nCAT-FCSS [19], and ResNet [12], and with pretrained or\nlearned backbone networks. For quantitative assessment,\nwe examined the matching accuracy on the TSS bench-\nmark [36], as described in the following section. As shown\nin Fig. 6, RTNs w/ResNet [12] converge in 3\u22125 iterations.\nBy enlarging the window size of Ni, the matching accu-\nracy improves until 9\u00d79 with longer training and testing\ntimes, but larger window sizes reduce matching accuracy\ndue to greater matching ambiguity. Note that Mi = Ni.\nTable 1 shows that among the many state-of-the-art fea-\nture extraction networks, ResNet [12] exhibits the best\nperformance for our approach. As shown in comparisons\nbetween pretrained and learned backbone networks, learn-\ning the feature extraction networks jointly with geometric matching networks can boost matching\naccuracy, as similarly seen in [31].\n\nFigure 6: Convergence analysis of RTNs\nw/ResNet [12] for various numbers of\niterations and search window sizes on\nthe TSS benchmark [36].\n\n5.3 Matching Results\n\nTSS Benchmark We evaluated RTNs on the TSS benchmark [36], which consists of 400 image\npairs divided into three groups: FG3DCar [24], JODS [33], and PASCAL [10]. As in [18, 20],\n\ufb02ow accuracy was measured by computing the proportion of foreground pixels with an absolute\n\ufb02ow endpoint error that is smaller than a threshold of T = 5, after resizing each image so that\n\n1For these experiments, we utilized the hierarchical dual-layer belief propagation of SIFT \ufb02ow [25] together\n\nwith alternative dense descriptors.\n\n7\n\n123456#Iteration0.40.50.60.70.8Average flow accuracy\fMethods\n\nFeature\n\nRegular.\n\nSuperv.\n\nFG3D.\n\nJODS\n\nPASC. Avg.\n\nSF [25]\nDSP [17]\nTaniai et al. [36]\nPF [7]\nDCTM [18]\nUCN-ST [5]\nFCSS [18, 19]\n\nSCNet [9]\n\nGMat. [30]\nGMat. w/Inl. [31]\nRTNs\nRTNs\nRTNs\nRTNs\n\nCAT-FCSS\u2020\nUCN-ST\n\nFCSS\n\nSIFT\nSIFT\nHOG\nHOG\n\nCAT-FCSS\nVGGNet\nVGGNet\nVGGNet\nResNet\nResNet\nVGGNet\u2020\nVGGNet\nCAT-FCSS\n\nResNet\n\nSF\nDSP\nTSS\nLOM\nDCTM\n\nSF\nSF\nSF\nAG\nAG+\nGMat.\nGMat.\nGMat.\nR-GMat.\nR-GMat.\nR-GMat.\nR-GMat.\n\n-\n-\n-\n-\n-\n\nSup.\nWeak.\nWeak.\nSup.\nSup.\nSelf.\nSelf.\nWeak.\nWeak.\nWeak.\nWeak.\nWeak.\n\n0.632\n0.487\n0.830\n0.786\n0.891\n0.853\n0.832\n0.858\n0.764\n0.776\n0.835\n0.886\n0.892\n0.875\n0.893\n0.889\n0.901\n\n0.509\n0.465\n0.595\n0.653\n0.721\n0.672\n0.662\n0.680\n0.600\n0.608\n0.656\n0.758\n0.758\n0.736\n0.762\n0.775\n0.782\n\n0.360\n0.382\n0.483\n0.531\n0.610\n0.511\n0.512\n0.522\n0.463\n0.474\n0.527\n0.560\n0.562\n0.586\n0.591\n0.611\n0.633\n\n0.500\n0.445\n0.636\n0.657\n0.740\n0.679\n0.668\n0.687\n0.609\n0.619\n0.673\n0.735\n0.737\n0.732\n0.749\n0.758\n0.772\n\nTable 1: Matching accuracy compared to state-of-the-art correspondence techniques (with feature,\nregularization, and supervision) on the TSS benchmark [36]. \u2020 denotes a pre-trained feature.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 8: Qualitative results on the PF-WILLOW benchmark [7]: (a) source image, (b) target image,\n(c) UCN-ST [5], (d) SCNet [9], (e) GMat. w/Inl. [31], and (f) RTNs. The source images are warped\nto the target images using correspondences.\n\nits larger dimension is 100 pixels. Table 1 compares the matching accuracy of RTNs to state-of-\nthe-art correspondence techniques, and Fig. 7 shows qualitative results. Compared to handcrafted\nmethods [25, 17, 36, 7], most CNN-based methods have better performance. In particular, methods\nthat use STN-based feature transformations, namely UCN-ST [5] and CAT-FCSS [19], show improved\nability to deal with geometric variations. In comparison to the geometric matching-based methods\nGMat. [30] and GMat. w/Inl. [30], RTNs consisting of feature extraction with ResNet and recurrent\ngeometric matching modules provide higher performance. RTNs additionally outperform existing\nCNN-based methods trained with supervision of \ufb02ow \ufb01elds. It should be noted that GMat. w/Inl. [31]\nwas learned with the initial network parameters set through self-supervised learning as in [30]. RTNs\ninstead start from fully-randomized parameters in geometric matching networks.\n\nPF-WILLOW Benchmark We also evaluated our method on the PF-WILLOW benchmark [7],\nwhich includes 10 object sub-classes with 10 keypoint annotations for each image. For the evaluation\nmetric, we use the probability of correct keypoint (PCK) between \ufb02ow-warped keypoints and the\nground truth [26, 7] as in the experiments of [18, 9, 19]. Table 2 compares the PCK values of\nRTNs to state-of-the-art correspondence techniques, and Fig. 8 shows qualitative results. Our RTNs\nexhibit performance competitive to the state-of-the-art correspondence techniques including the\nlatest weakly-supervised and even supervised methods for semantic correspondence. Since RTNs\nestimate locally-varying geometric \ufb01elds, it provides more precise localization ability, as shown in the\n\n8\n\n\fMethods\n\nPF [7]\nDCTM [18]\nUCN-ST [5]\nCAT-FCSS [19]\nSCNet [9]\nGMat. [30]\nGMat. w/Inl. [31]\nRTNs w/VGGNet\nRTNs w/ResNet\n\nPF-WILLOW [7]\n\nPF-PASCAL [8]\n\n\u03b1 = 0.05\n\n\u03b1 = 0.1\n\n\u03b1 = 0.15\n\n\u03b1 = 0.05\n\n\u03b1 = 0.1\n\n\u03b1 = 0.15\n\n0.284\n0.381\n0.241\n0.362\n0.386\n0.369\n0.370\n0.402\n0.413\n\n0.568\n0.610\n0.540\n0.546\n0.704\n0.692\n0.702\n0.707\n0.719\n\n0.682\n0.721\n0.665\n0.692\n0.853\n0.778\n0.799\n0.842\n0.862\n\n0.314\n0.342\n0.299\n0.336\n0.362\n0.410\n0.490\n0.506\n0.552\n\n0.625\n0.696\n0.556\n0.689\n0.722\n0.695\n0.748\n0.743\n0.759\n\n0.795\n0.802\n0.740\n0.792\n0.820\n0.804\n0.840\n0.836\n0.852\n\nTable 2: Matching accuracy compared to state-of-the-art correspondence techniques on the PF-\nWILLOW benchmark [7] and PF-PASCAL benchmark [8].\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 9: Qualitative results on the PF-PASCAL benchmark [8]: (a) source image, (b) target image,\n(c) CAT-FCSS w/SF [19], (d) SCNet [9], (e) GMat. w/Inl. [31], and (f) RTNs. The source images are\nwarped to the target images using correspondences.\n\nresults of \u03b1 = 0.05, in comparison to existing geometric matching networks [30, 31] which estimate\nglobally-varying geometric \ufb01elds only.\n\nPF-PASCAL Benchmark Lastly, we evaluated our method on the PF-PASCAL benchmark [8],\nwhich contains 1,351 image pairs over 20 object categories with PASCAL keypoint annotations [2].\nFollowing the split in [9, 31], we used 700 training pairs, 300 validation pairs, and 300 testing\npairs. For the evaluation metric, we use the PCK between \ufb02ow-warped keypoints and the ground\ntruth as done in the experiments of [9]. Table 2 summarizes the PCK values, and Fig. 9 shows\nqualitative results. Similar to the experiments on the PF-WILLOW benchmark [7], CNN-based\nmethods [9, 30, 31] including our RTNs yield better performance, with RTNs providing the highest\nmatching accuracy.\n\n6 Conclusion\n\nWe presented RTNs, which learn to infer locally-varying geometric \ufb01elds for semantic correspondence\nin an end-to-end and weakly-supervised fashion. The key idea of this approach is to utilize and\niteratively re\ufb01ne the transformations and convolutional activations via geometric matching between\nthe input image pair. In addition, a technique is presented for weakly-supervised training of RTNs. A\ndirection for further study is to examine how the semantic correspondence of RTNs could bene\ufb01t\nsingle-image 3-D reconstruction and instance-level object detection and segmentation.\n\n9\n\n\fAcknowledgments\n\nThis research was supported by Next-Generation Information Computing Development Program\nthrough the National Research Foundation of Korea (NRF) funded by the Ministry of Science and\nICT (NRF-2017M3C4A7069370).\n\nReferences\n[1] C. Barnes, E. Shechtman, D. B. Goldman, and A. Finkelstein. The generalized patchmatch\n\ncorrespondence algorithm. In: ECCV, 2010.\n\n[2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations,.\n\nIn: ICCV, 2009.\n\n[3] H. Bristow, J. Valmadre, and S. Lucey. Dense semantic correspondence where every pixel is a\n\nclassi\ufb01er. In: ICCV, 2015.\n\n[4] D. Butler, J. Wulff, G. Stanley, and M. Black. A naturalistic open source movie for optical \ufb02ow\n\nevaluation. In: ECCV, 2012.\n\n[5] C. B. Choy, Y. Gwak, and S. Savarese. Universal correspondence network. In:NIPS, 2016.\n\n[6] Y. HaCohen, E. Shechtman, D. B. Goldman, and D. Lischinski. Non-rigid dense correspondence\n\nwith applications for image enhancement. In: SIGGRAPH, 2011.\n\n[7] B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal \ufb02ow. In: CVPR, 2016.\n\n[8] B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal \ufb02ow: Semantic correspondences from\n\nobject proposals. IEEE Trans. PAMI, 2017.\n\n[9] K. Han, R. S. Rezende, B. Ham, K. Wong, M. Cho, C. Schmid, and J. Ponce. Scnet: Learning\n\nsemantic correspondence. In: ICCV, 2017.\n\n[10] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse\n\ndetectors. In: ICCV, 2011.\n\n[11] T. Hassner, V. Mayzels, and L. Zelnik-Manor. On sifts and their scales. In: CVPR, 2012.\n\n[12] K. He, X. Zhang, S. Ren, and Sun. J. Deep residual learning for image recognition. In: CVPR,\n\n2016.\n\n[13] J. Hur, H. Lim, C. Park, and S. C. Ahn. Generalized deformable spatial pyramid: Geometry-\n\npreserving dense correspondence estimation. In: CVPR, 2015.\n\n[14] Philbin. J., O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabu-\n\nlaries and fast spatial matching. In: CVPR, 2007.\n\n[15] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks.\n\nIn: NIPS, 2015.\n\n[16] A. Kanazawa, D. W. Jacobs, and M. Chandraker. Warpnet: Weakly supervised matching for\n\nsingle-view reconstruction. In: CVPR, 2016.\n\n[17] J. Kim, C. Liu, F. Sha, and K. Grauman. Deformable spatial pyramid matching for fast dense\n\ncorrespondences. In: CVPR, 2013.\n\n[18] S. Kim, D. Min, B. Ham, S. Jeon, S. Lin, and K. Sohn. Fcss: Fully convolutional self-similarity\n\nfor dense semantic correspondence. In: CVPR, 2017.\n\n[19] S. Kim, D. Min, B. Ham, S. Lin, and K. Sohn. Fcss: Fully convolutional self-similarity for\n\ndense semantic correspondence. TPAMI, 2018.\n\n[20] S. Kim, D. Min, S. Lin, and K. Sohn. Dctm: Discrete-continuous transformation matching for\n\nsemantic \ufb02ow. In: ICCV, 2017.\n\n10\n\n\f[21] S. Kim, D. Min, S. Lin, and K. Sohn. Discrete-continuous transformation matching for dense\n\nsemantic correspondence. IEEE Trans. PAMI, 2018.\n\n[22] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang. Visual attribute transfer through deep image\n\nanalogy. In: SIGGRAPH, 2017.\n\n[23] Chen-Hsuan Lin and Simon Lucey. Inverse compositional spatial transformer networks. CVPR,\n\n2017.\n\n[24] Y. L. Lin, V. I. Morariu, W. Hsu, and L. S. Davis. Jointly optimizing 3d model \ufb01tting and\n\n\ufb01ne-grained classi\ufb01cation. In: ECCV, 2014.\n\n[25] C. Liu, J. Yuen, and A Torralba. Sift \ufb02ow: Dense correspondence across scenes and its\n\napplications. IEEE Trans. PAMI, 33(5):815\u2013830, 2011.\n\n[26] J. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence? In: NIPS, 2014.\n[27] D.G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110,\n\n2004.\n\n[28] D. Novotny, D. Larlus, and A. Vedaldi. Anchornet: A weakly supervised network to learn\n\ngeometry-sensitive features for semantic matching. In:CVPR, 2017.\n\n[29] W. Qiu, X. Wang, X. Bai, A. Yuille, and Z. Tu. Scale-space sift \ufb02ow. In: WACV, 2014.\n[30] I. Rocco, R. Arandjelovic, and J. Sivic. Convolutional neural network architecture for geometric\n\nmatching. In:CVPR, 2017.\n\n[31] I. Rocco, R. Arandjelovi\u00b4c, and J. Sivic. End-to-end weakly-supervised semantic alignment. In:\n\nCVPR, 2018.\n\n[32] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image\n\nsegmentation. In: MICCAI, 2015.\n\n[33] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu. Unsupervised joint object discovery and\n\nsegmentation in internet images. In: CVPR, 2013.\n\n[34] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspon-\n\ndence algorithms. IJCV, 47(1):7\u201342, 2002.\n\n[35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In: ICLR, 2015.\n\n[36] T. Taniai, S. N. Sinha, and Y. Sato. Joint recovery of dense correspondence and cosegmentation\n\nin two images. In: CVPR, 2016.\n\n[37] J. Thewlis1, H. Bilen, and A. Vedald. Unsupervised learning of object frames by dense\n\nequivariant image labelling. In: NIPS, 2017.\n\n[38] N. Ufer and B. Ommer. Deep semantic feature matching. In:CVPR, 2017.\n[39] F. Yang, X. Li, H. Cheng, J. Li, and L. Chen. Object-aware dense semantic correspondence.\n\nIn:CVPR, 2017.\n\n[40] H. Yang, W. Y. Lin, and J. Lu. Daisy \ufb01lter \ufb02ow: A generalized discrete approach to dense\n\ncorrespondences. In: CVPR, 2014.\n\n[41] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. Lift: Learned invariant feature transform. In: ECCV,\n\n2016.\n\n[42] K. M. Yi, Y. Verdie, P. Fua, and V. Lepetit. Learning to assign orientations to feature points. In:\n\nCVPR, 2016.\n\n[43] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In: ICLR, 2016.\n[44] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence\n\nvia 3d-guided cycle consistency. In: CVPR, 2016.\n\n[45] T. Zhou, Y. J. Lee, S. X. Yu, and A. A. Efros. Flowweb: Joint image set alignment by weaving\n\nconsistent, pixel-wise correspondences. In: CVPR, 2015.\n\n11\n\n\f", "award": [], "sourceid": 3020, "authors": [{"given_name": "Seungryong", "family_name": "Kim", "institution": "Yonsei University"}, {"given_name": "Stephen", "family_name": "Lin", "institution": "Microsoft Research"}, {"given_name": "SANG RYUL", "family_name": "JEON", "institution": "Yonsei University"}, {"given_name": "Dongbo", "family_name": "Min", "institution": "Ewha Womans University"}, {"given_name": "Kwanghoon", "family_name": "Sohn", "institution": "Yonsei Univ."}]}