{"title": "Universal Correspondence Network", "book": "Advances in Neural Information Processing Systems", "page_first": 2414, "page_last": 2422, "abstract": "We present a deep learning framework for accurate visual correspondences and demonstrate its effectiveness for both geometric and semantic matching, spanning across rigid motions to intra-class shape or appearance variations. In contrast to previous CNN-based approaches that optimize a surrogate patch similarity objective, we use deep metric learning to directly learn a feature space that preserves either geometric or semantic similarity. Our fully convolutional architecture, along with a novel correspondence contrastive loss allows faster training by effective reuse of computations, accurate gradient computation through the use of thousands of examples per image pair and faster testing with $O(n)$ feedforward passes for n keypoints, instead of $O(n^2)$ for typical patch similarity methods. We propose a convolutional spatial transformer to mimic patch normalization in traditional features like SIFT, which is shown to dramatically boost accuracy for semantic correspondences across intra-class shape variations. Extensive experiments on KITTI, PASCAL and CUB-2011 datasets demonstrate the significant advantages of our features over prior works that use either hand-constructed or learned features.", "full_text": "Universal Correspondence Network\n\nChristopher B. Choy\nStanford University\n\nchrischoy@ai.stanford.edu\n\nJunYoung Gwak\nStanford University\n\njgwak@ai.stanford.edu\n\nSilvio Savarese\n\nStanford University\n\nssilvio@stanford.edu\n\nManmohan Chandraker\n\nNEC Laboratories America, Inc.\n\nmanu@nec-labs.com\n\nAbstract\n\nWe present a deep learning framework for accurate visual correspondences and\ndemonstrate its effectiveness for both geometric and semantic matching, spanning\nacross rigid motions to intra-class shape or appearance variations. In contrast\nto previous CNN-based approaches that optimize a surrogate patch similarity\nobjective, we use deep metric learning to directly learn a feature space that preserves\neither geometric or semantic similarity. Our fully convolutional architecture, along\nwith a novel correspondence contrastive loss allows faster training by effective\nreuse of computations, accurate gradient computation through the use of thousands\nof examples per image pair and faster testing with O(n) feed forward passes for\nn keypoints, instead of O(n2) for typical patch similarity methods. We propose\na convolutional spatial transformer to mimic patch normalization in traditional\nfeatures like SIFT, which is shown to dramatically boost accuracy for semantic\ncorrespondences across intra-class shape variations. Extensive experiments on\nKITTI, PASCAL, and CUB-2011 datasets demonstrate the signi\ufb01cant advantages\nof our features over prior works that use either hand-constructed or learned features.\n\n1\n\nIntroduction\n\nCorrespondence estimation is the workhorse that drives several fundamental problems in computer\nvision, such as 3D reconstruction, image retrieval or object recognition. Applications such as\nstructure from motion or panorama stitching that demand sub-pixel accuracy rely on sparse keypoint\nmatches using descriptors like SIFT [22]. In other cases, dense correspondences in the form of stereo\ndisparities, optical \ufb02ow or dense trajectories are used for applications such as surface reconstruction,\ntracking, video analysis or stabilization. In yet other scenarios, correspondences are sought not\nbetween projections of the same 3D point in different images, but between semantic analogs across\ndifferent instances within a category, such as beaks of different birds or headlights of cars. Thus, in\nits most general form, the notion of visual correspondence estimation spans the range from low-level\nfeature matching to high-level object or scene understanding.\nTraditionally, correspondence estimation relies on hand-designed features or domain-speci\ufb01c priors.\nIn recent years, there has been an increasing interest in leveraging the power of convolutional neural\nnetworks (CNNs) to estimate visual correspondences. For example, a Siamese network may take a\npair of image patches and generate their similiarity as the output [1, 34, 35]. Intermediate convolution\nlayer activations from the above CNNs are also usable as generic features.\nHowever, such intermediate activations are not optimized for the visual correspondence task. Such\nfeatures are trained for a surrogate objective function (patch similarity) and do not necessarily form a\nmetric space for visual correspondence and thus, any metric operation such as distance does not have\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Various types of correspondence problems have traditionally required different specialized methods:\nfor example, SIFT or SURF for sparse structure from motion, DAISY or DSP for dense matching, SIFT Flow or\nFlowWeb for semantic matching. The Universal Correspondence Network accurately and ef\ufb01ciently learns a\nmetric space for geometric correspondences, dense trajectories or semantic correspondences.\n\nexplicit interpretation. In addition, patch similarity is inherently inef\ufb01cient, since features have to be\nextracted even for overlapping regions within patches. Further, it requires O(n2) feed-forward passes\nto compare each of n patches with n other patches in a different image.\nIn contrast, we present the Universal Correspondence Network (UCN), a CNN-based generic dis-\ncriminative framework that learns both geometric and semantic visual correspondences. Unlike many\nprevious CNNs for patch similarity, we use deep metric learning to directly learn the mapping, or\nfeature, that preserves similarity (either geometric or semantic) for generic correspondences. The\nmapping is, thus, invariant to projective transformations, intra-class shape or appearance variations,\nor any other variations that are irrelevant to the considered similarity. We propose a novel correspon-\ndence contrastive loss that allows faster training by ef\ufb01ciently sharing computations and effectively\nencoding neighborhood relations in feature space. At test time, correspondence reduces to a nearest\nneighbor search in feature space, which is more ef\ufb01cient than evaluating pairwise patch similarities.\nThe UCN is fully convolutional, allowing ef\ufb01cient generation of dense features. We propose an\non-the-\ufb02y active hard-negative mining strategy for faster training. In addition, we propose a novel\nadaptation of the spatial transformer [13], called the convolutional spatial transformer, desgined to\nmake our features invariant to particular families of transformations. By learning the optimal feature\nspace that compensates for af\ufb01ne transformations, the convolutional spatial transformer imparts the\nability to mimic patch normalization of descriptors such as SIFT. Figure 1 illustrates our framework.\nThe capabilities of UCN are compared to a few important prior approaches in Table 1. Empirically,\nthe correspondences obtained from the UCN are denser and more accurate than most prior approaches\nspecialized for a particular task. We demonstrate this experimentally by showing state-of-the-art\nperformances on sparse SFM on KITTI, as well as dense geometric or semantic correspondences on\nboth rigid and non-rigid bodies in KITTI, PASCAL and CUB datasets.\nTo summarize, we propose a novel end-to-end system that optimizes a general correspondence\nobjective, independent of domain, with the following main contributions:\n\u2022 Deep metric learning with an ef\ufb01cient correspondence constrastive loss for learning a feature\n\u2022 Fully convolutional network for dense and ef\ufb01cient feature extraction, along with fast active hard\n\u2022 Fully convolutional spatial transformer for patch normalization.\n\u2022 State-of-the-art correspondences across sparse SFM, dense matching and semantic matching,\n\nrepresentation that is optimized for the given correspondence task.\n\nnegative mining.\n\nencompassing rigid bodies, non-rigid bodies and intra-class shape or appearance variations.\n\n2 Related Works\nCorrespondences Visual features form basic building blocks for many computer vision applica-\ntions. Carefully designed features and kernel methods have in\ufb02uenced many \ufb01elds such as structure\n\n2\n\n\fFigure 2: System overview: The network is fully convolutional, consisting of a series of convolutions,\npooling, nonlinearities and a convolutional spatial transformer, followed by channel-wise L2 normalization and\ncorrespondence contrastive loss. As inputs, the network takes a pair of images and coordinates of corresponding\npoints in these images (blue: positive, red: negative). Features that correspond to the positive points (from both\nimages) are trained to be closer to each other, while features that correspond to negative points are trained to be\na certain margin apart. Before the last L2 normalization and after the FCNN, we placed a convolutional spatial\ntransformer to normalize patches or take larger context into account.\n\nFeatures\nSIFT [22]\nDAISY [28]\nConv4 [21]\nDeepMatching [25]\nPatch-CNN [34]\nLIFT [33]\nOurs\n\nDense\n\nGeometric Corr.\n\nSemantic Corr.\n\nTrainable\n\nEf\ufb01cient Metric Space\n\n\u0017\n\u0013\n\u0013\n\u0013\n\u0013\n\u0017\n\u0013\n\n\u0013\n\u0013\n\u0017\n\u0013\n\u0013\n\u0013\n\u0013\n\n\u0017\n\u0017\n\u0013\n\u0017\n\u0017\n\u0017\n\u0013\n\n\u0017\n\u0017\n\u0013\n\u0017\n\u0013\n\u0013\n\u0013\n\n\u0013\n\u0013\n\u0013\n\u0017\n\u0017\n\u0013\n\u0013\n\n\u0017\n\u0017\n\u0017\n\u0013\n\u0017\n\u0013\n\u0013\n\nTable 1: Comparison of prior state-of-the-art methods with UCN (ours). The UCN generates dense and accurate\ncorrespondences for either geometric or semantic correspondence tasks. The UCN directly learns the feature\nspace to achieve high accuracy and has distinct ef\ufb01ciency advantages, as discussed in Section 3.\n\nfrom motion, object recognition and image classi\ufb01cation. Several hand-designed features, such as\nSIFT, HOG, SURF and DAISY have found widespread applications [22, 3, 28, 8].\nRecently, many CNN-based similarity measures have been proposed. A Siamese network is used in\n[34] to measure patch similarity. A driving dataset is used to train a CNN for patch similarity in [1],\nwhile [35] also uses a Siamese network for measuring patch similarity for stereo matching. A CNN\npretrained on ImageNet is analyzed for visual and semantic correspondence in [21]. Correspondences\nare learned in [16] across both appearance and a global shape deformation by exploiting relationships\nin \ufb01ne-grained datasets. In contrast, we learn a metric space in which metric operations have direct\ninterpretations, rather than optimizing the network for patch similarity and using the intermediate\nfeatures. For this, we implement a fully convolutional architecture with a correspondence contrastive\nloss that allows faster training and testing and propose a convolutional spatial transformer for local\npatch normalization.\n\nMetric learning using neural networks Neural networks are used in [5] for learning a mapping\nwhere the Euclidean distance in the space preserves semantic distance. The loss function for learning\nsimilarity metric using Siamese networks is subsequently formalized by [7, 12]. Recently, a triplet\nloss is used by [29] for \ufb01ne-grained image ranking, while the triplet loss is also used for face\nrecognition and clustering in [26]. Mini-batches are used for ef\ufb01ciently training the network in [27].\n\nCNN invariances and spatial transformations A CNN is invariant to some types of transfor-\nmations such as translation and scale due to convolution and pooling layers. However, explicitly\nhandling such invariances in forms of data augmentation or explicit network structure yields higher\naccuracy in many tasks [17, 15, 13]. Recently, a spatial transformer network is proposed in [13] to\nlearn how to zoom in, rotate, or apply arbitrary transformations to an object of interest.\nFully convolutional neural network Fully connected layers are converted in 1 \u00d7 1 convolutional\n\ufb01lters in [20] to propose a fully convolutional framework for segmentation. Changing a regular CNN\nto a fully convolutional network for detection leads to speed and accuracy gains in [11]. Similar to\nthese works, we gain the ef\ufb01ciency of a fully convolutional architecture through reusing activations\n\n3\n\n\fMethods\n\n# examples per\n\nimage pair\n\nSiamese Network\nTriplet Loss\nContrastive Loss\nCorres. Contrast. Loss\n\n1\n2\n1\n\n> 103\n\n# feed forwards\n\nper test\nO(N 2)\nO(N )\nO(N )\nO(N )\n\nTable 2: Comparisons between metric learning meth-\nods for visual correspondence. Feature learning allows\nfaster test times. Correspondence contrastive loss al-\nlows us to use many more correspondences in one pair\nof images than other methods.\n\nFigure 3: Correspondence contrastive loss takes three\ninputs: two dense features extracted from images and a\ncorrespondence table for positive and negative pairs.\n\nfor overlapping regions. Further, since number of training instances is much larger than number of\nimages in a batch, variance in the gradient is reduced, leading to faster training and convergence.\n\n3 Universal Correspondence Network\n\nWe now present the details of our framework. Recall that the Universal Correspondence Network is\ntrained to directly learn a mapping that preserves similarity instead of relying on surrogate features.\nWe discuss the fully convolutional nature of the architecture, a novel correspondence contrastive\nloss for faster training and testing, active hard negative mining, as well as the convolutional spatial\ntransformer that enables patch normalization.\nFully Convolutional Feature Learning To speed up training and use resources ef\ufb01ciently, we\nimplement fully convolutional feature learning, which has several bene\ufb01ts. First, the network can\nreuse some of the activations computed for overlapping regions. Second, we can train several\nthousand correspondences for each image pair, which provides the network an accurate gradient for\nfaster learning. Third, hard negative mining is ef\ufb01cient and straightforward, as discussed subsequently.\nFourth, unlike patch-based methods, it can be used to extract dense features ef\ufb01ciently from images\nof arbitrary sizes.\nDuring testing, the fully convolutional network is faster as well. Patch similarity based networks such\nas [1, 34, 35] require O(n2) feed forward passes, where n is the number of keypoints in each image,\nas compared to only O(n) for our network. We note that extracting intermediate layer activations as\na surrogate mapping is a comparatively suboptimal choice since those activations are not directly\ntrained on the visual correspondence task.\nCorrespondence Contrastive Loss Learning a metric space for visual correspondence requires\nencoding corresponding points (in different views) to be mapped to neighboring points in the feature\nspace. To encode the constraints, we propose a generalization of the contrastive loss [7, 12], called\ncorrespondence contrastive loss. Let FI(x) denote the feature in image I at location x = (x, y). The\nloss function takes features from images I and I(cid:48), at coordinates x and x(cid:48), respectively (see Figure 3).\nIf the coordinates x and x(cid:48) correspond to the same 3D point, we use the pair as a positive pair that\nare encouraged to be close in the feature space, otherwise as a negative pair that are encouraged to be\nat least margin m apart. We denote s = 1 for a positive pair and s = 0 for a negative pair. The full\ncorrespondence contrastive loss is given by\n\nN(cid:88)\n\ni\n\nL =\n\n1\n2N\n\nsi(cid:107)FI(xi) \u2212 FI(cid:48)(xi\n\n(cid:48))(cid:107)2 + (1 \u2212 si) max(0, m \u2212 (cid:107)FI(x) \u2212 FI(cid:48)(xi\n\n(cid:48))(cid:107))2\n\n(1)\n\nFor each image pair, we sample correspondences from the training set. For instance, for KITTI\ndataset, if we use each laser scan point, we can train up to 100k points in a single image pair. However\nin practice, we used 3k correspondences to limit memory consumption. This allows more accurate\ngradient computations than traditional contrastive loss, which yields one example per image pair.\nWe again note that the number of feed forward passes at test time is O(n) compared to O(n2) for\nSiamese network variants [1, 35, 34]. Table 2 summarizes the advantages of a fully convolutional\narchitecture with correspondence contrastive loss.\nHard Negative Mining The correspondence contrastive loss in Eq. (1) consists of two terms. The\n\ufb01rst term minimizes the distance between positive pairs and the second term pushes negative pairs to\nbe at least margin m away from each other. Thus, the second term is only active when the distance\nbetween the features FI(xi) and FI(cid:48)(x(cid:48)\ni) are smaller than the margin m. Such boundary de\ufb01nes the\n\n4\n\n\f(a) SIFT\n\n(b) Spatial transformer\n\n(c) Convolutional spatial transformer\n\nFigure 4: (a) SIFT normalizes for rotation and scaling. (b) The spatial transformer takes the whole image\nas an input to estimate a transformation. (c) Our convolutional spatial transformer applies an independent\ntransformation to features.\n\nmetric space, so it is crucial to \ufb01nd the negatives that violate the constraint and train the network to\npush the negatives away. However, random negative pairs do not contribute to training since they are\nare generally far from each other in the embedding space.\nInstead, we actively mine negative pairs that violate the constraints the most to dramatically speed up\ntraining. We extract features from the \ufb01rst image and \ufb01nd the nearest neighbor in the second image.\nIf the location is far from the ground truth correspondence location, we use the pair as a negative. We\ncompute the nearest neighbor for all ground truth points on the \ufb01rst image. Such mining process is\ntime consuming since it requires O(mn) comparisons for m and n feature points in the two images,\nrespectively. Our experiments use a few thousand points for n, with m being all the features on the\nsecond image, which is as large as 22000. We use a GPU implementation to speed up the K-NN\nsearch [10] and embed it as a Caffe layer to actively mine hard negatives on-the-\ufb02y.\nConvolutional Spatial Transformer CNNs are known to handle some degree of scale and rotation\ninvariances. However, handling spatial transformations explicitly using data-augmentation or a\nspecial network structure have been shown to be more successful in many tasks [13, 15, 16, 17]. For\nvisual correspondence, \ufb01nding the right scale and rotation is crucial, which is traditionally achieved\nthrough patch normalization [23, 22]. A series of simple convolutions and poolings cannot mimic\nsuch complex spatial transformations.\nTo mimic patch normalization, we borrow the idea of the spatial transformer layer [13]. However,\ninstead of a global image transformation, each keypoint in the image can undergo an independent\ntransformation. Thus, we propose a convolutional version to generate the transformed activations,\ncalled the convolutional spatial transformer. As demonstrated in our experiments, this is especially\nimportant for correspondences across large intra-class shape variations.\nThe proposed transformer takes its input from a lower layer and for each output feature, applies an\nindependent spatial transformation. The transformation parameters are also extracted convolutionally.\nSince they go through an independent transformation, the transformed activations are placed inside\na larger activation without overlap and then go through a successive convolution with the stride to\ncombine the transformed activations independently. The stride size has to be equal to the size of the\nspatial transformer kernel size. Figure 4 illustrates the convolutional spatial transformer module.\n\n4 Experiments\n\nWe use Caffe [14] package for implementation. Since it does not support the new layers we propose,\nwe implement the correspondence contrastive loss layer and the convolutional spatial transformer\nlayer, the K-NN layer based on [10] and the channel-wise L2 normalization layer. We did not use\n\ufb02attening layer nor the fully connected layer to make the network fully convolutional, generating\nfeatures at every fourth pixel. For accurate localization, we then extract features densely using\nbilinear interpolation to mitigate quantization error for sparse correspondences. Please refer to the\nsupplementary materials for the network implementation details and visualization.\nFor each experiment setup, we train and test three variations of networks. First, the network has\nhard negative mining and spatial transformer (Ours-HN-ST). Second, the same network without\nspatial transformer (Ours-HN). Third, the same network without spatial transformer and hard negative\nmining, providing random negative samples that are at least certain pixels apart from the ground\n\n5\n\n\fSIFT-NN [22] HOG-NN [8] SIFT-\ufb02ow [19] DaisyFF [31] DSP [18] DM best (1/2) [25] Ours-HN Ours-HN-ST\n\nmethod\nMPI-Sintel\nKITTI\n\n68.4\n48.9\n\n71.2\n53.7\n\nTable 3: Matching performance PCK@10px on KITTI Flow 2015 [24] and MPI-Sintel [6]. Note that DaisyFF,\nDSP, DM use global optimization whereas we only use the raw correspondences from nearest neighbor matches.\n\n89.0\n67.3\n\n87.3\n79.6\n\n85.3\n58.0\n\n89.2\n85.6\n\n91.5\n86.5\n\n90.7\n83.4\n\n(a) PCK performance for dense features NN\n\n(b) PCK performance on keypoints NN\n\nFigure 5: Comparison of PCK performance on KITTI raw dataset (a) PCK performance of the densely extracted\nfeature nearest neighbor (b) PCK performance for keypoint features nearest neighbor and the dense CNN feature\nnearest neighbor\n\n(a) Original image pair and keypoints\n\n(b) SIFT [22] NN matches\n\n(c) DAISY [28] NN matches\n\n(d) Ours-HN NN matches\n\nFigure 6: Visualization of nearest neighbor (NN) matches on KITTI images (a) from top to bottom, \ufb01rst and\nsecond images and FAST keypoints and dense keypoints on the \ufb01rst image (b) NN of SIFT matches on second\nimage. (c) NN of dense DAISY matches on second image. (d) NN of our dense UCN matches on second image.\n\ntruth correspondence location instead (Ours-RN). With this con\ufb01guration of networks, we verify the\neffectiveness of each component of Universal Correspondence Network.\nDatasets and Metrics We evaluate our UCN on three different tasks: geometric correspondence,\nsemantic correspondence and accuracy of correspondences for camera localization. For geometric\ncorrespondence (matching images of same 3D point in different views), we use two optical \ufb02ow\ndatasets from KITTI 2015 Flow benchmark and MPI Sintel dataset and split their training set into\na training and a validation set individually. The exact splits are available on the project website.\nalidation For semantic correspondences (\ufb01nding the same functional part from different instances),\nwe use the PASCAL-Berkeley dataset with keypoint annotations [9, 4] and a subset used by FlowWeb\n[36]. We also compare against prior state-of-the-art on the Caltech-UCSD Bird dataset[30]. To test the\naccuracy of correspondences for camera motion estimation, we use the raw KITTI driving sequences\nwhich include Velodyne scans, GPS and IMU measurements. Velodyne points are projected in\nsuccessive frames to establish correspondences and any points on moving objects are removed.\nTo measure performance, we use the percentage of correct keypoints (PCK) metric [21, 36, 16] (or\nequivalently \u201caccuracy@T\u201d [25]). We extract features densely or on a set of sparse keypoints (for\nsemantic correspondence) from a query image and \ufb01nd the nearest neighboring feature in the second\nimage as the predicted correspondence. The correspondence is classi\ufb01ed as correct if the predicted\nkeypoint is closer than T pixels to ground-truth (in short, PCK@T ). Unlike many prior works, we\ndo not apply any post-processing, such as global optimization with an MRF. This is to capture the\nperformance of raw correspondences from UCN, which already surpasses previous methods.\nGeometric Correspondence We pick random 1000 correspondences in each KITTI or MPI Sintel\nimage during training. We consider a correspondence as a hard negative if the nearest neighbor in\n\n6\n\n\faero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean\nconv4 \ufb02ow 28.2 34.1 20.4 17.1 50.6 36.7 20.9 19.6 15.7 25.4 12.7 18.7 25.9 23.1\n21.4 40.2 21.1 14.5 18.3 33.3 24.9\nSIFT \ufb02ow 27.6 30.8 19.9 17.5 49.4 36.4 20.7 16.0 16.1 25.0 16.1 16.3 27.7 28.3\n20.2 36.4 20.5 17.2 19.9 32.9 24.7\nNN transfer 18.3 24.8 14.5 15.4 48.1 27.6 16.0 11.1 12.0 16.8 15.7 12.7 20.2 18.5\n18.7 33.4 14.0 15.5 14.6 30.0 19.9\n31.5 19.6 30.1 23.0 53.5 36.7 34.0 33.7 22.2 28.1 12.8 33.9 29.9 23.4\nOurs RN\n38.4 39.8 38.6 17.6 28.4 60.2 36.0\n36.0 26.5 31.9 31.3 56.4 38.2 36.2 34.0 25.5 31.7 18.1 35.7 32.1 24.8\nOurs HN\n41.4 46.0 45.3 15.4 28.2 65.3 38.6\nOurs HN-ST 37.7 30.1 42.0 31.7 62.6 35.4 38.0 41.7 27.5 34.0 17.3 41.9 38.0 24.4\n47.1 52.5 47.5 18.5 40.2 70.5 44.0\nTable 4: Per-class PCK on PASCAL-Berkeley correspondence dataset [4] (\u03b1 = 0.1, L = max(w, h)).\nQuery\n\nVGG conv4_3 NN\n\nGround Truth\n\nOurs HN-ST\n\nVGG conv4_3 NN\n\nGround Truth\n\nOurs HN-ST\n\nQuery\n\nFigure 7: Qualitative semantic correspondence results on PASCAL [9] correspondences with Berkeley\nkeypoint annotation [4] and Caltech-UCSD Bird dataset [30].\n\nthe feature space is more than 16 pixels away from the ground truth correspondence. We used the\nsame architecture and training scheme for both datasets. Following convention [25], we measure\nPCK at 10 pixel threshold and compare with the state-of-the-art methods on Table 3. SIFT-\ufb02ow [19],\nDaisyFF [31], DSP [18], and DM best [25] use additional global optimization to generate more\naccurate correspondences. On the other hand, just our raw correspondences outperform all the\nstate-of-the-art methods. We note that the spatial transformer does not improve performance in this\ncase, likely due to over\ufb01tting to a smaller training set. As we show in the next experiments, its\nbene\ufb01ts are more apparent with a larger-scale dataset and greater shape variations. Note that though\nwe used stereo datasets to generate a large number of correspondences, the result is not directly\ncomparable to stereo methods without a global optimization and epipolar geometry to \ufb01lter out the\nnoise and incorporate edges.\nWe also used KITTI raw sequences to generate a large number of correspondences, and we split\ndifferent sequences into train and test sets. The details of the split is on the supplementary material.\nWe plot PCK for different thresholds for various methods with densely extracted features on the larger\nKITTI raw dataset in Figure 5a. The accuracy of our features outperforms all traditional features\nincluding SIFT [22], DAISY [28] and KAZE [2]. Due to dense extraction at the original image scale\nwithout rotation, SIFT does not perform well. So, we also extract all features except ours sparsely on\nSIFT keypoints and plot PCK curves in Figure 5b. All the prior methods improve (SIFT dramatically\nso), but our UCN features still perform signi\ufb01cantly better even with dense extraction. Also note\nthe improved performance of the convolutional spatial transformer. PCK curves for geometric\ncorrespondences on individual semantic classes such as road or car are in supplementary material.\nSemantic Correspondence The UCN can also learn semantic correspondences invariant to intra-\nclass appearance or shape variations. We independently train on the PASCAL dataset [9] with various\nannotations [4, 36] and on the CUB dataset [30], with the same network architecture.\nWe again use PCK as the metric [32]. To account for variable image size, we consider a predicted\nkeypoint to be correctly matched if it lies within Euclidean distance \u03b1\u00b7 L of the ground truth keypoint,\nwhere L is the size of the image and 0 < \u03b1 < 1 is a variable we control. For comparison, our\nde\ufb01nition of L varies depending on the baseline. Since intraclass correspondence alignment is a\ndif\ufb01cult task, preceding works use either geometric [18] or learned [16] spatial priors. However, even\nour raw correspondences, without spatial priors, achieve stronger results than previous works.\nAs shown in Table 4 and 5, our approach outperforms that of Long et al.[21] by a large margin on the\nPASCAL dataset with Berkeley keypoint annotation, for most classes and also overall. Note that our\n\n7\n\n\fmean\n\nconv4 \ufb02ow[21]\n\nSIFT \ufb02ow\nfc7 NN\nours-RN\nours-HN\n\nours-HN-ST\n\n\u03b1 = 0.1\n\n\u03b1 = 0.05\n\n\u03b1 = 0.025\n\n24.9\n24.7\n19.9\n36.0\n38.6\n44.0\n\n11.8\n10.9\n7.8\n21.0\n23.2\n25.9\n\n4.08\n3.55\n2.35\n11.5\n13.1\n14.4\n\nTable 5: Mean PCK on PASCAL-Berkeley cor-\nrespondence dataset [4] (L = max(w, h)). Even\nwithout any global optimization, our nearest neigh-\nbor search outperforms all methods by a large mar-\ngin.\n\nFigure 8: PCK on CUB dataset [30], compared with\n\u221a\nvarious other approaches including WarpNet [16] (L =\nw2 + h2.)\n\nSIFT [22] DAISY [28] SURF [3] KAZE [2] Agrawal et al. [1] Ours-HN Ours-HN-ST\n\nFeatures\nAng. Dev. (deg)\nTrans. Dev.(deg)\n\n0.307\n4.749\n\n0.309\n4.516\n\n0.344\n5.790\n\n0.312\n4.584\n\n0.394\n9.293\n\n0.317\n4.147\n\n0.325\n4.728\n\nTable 6: Essential matrix decomposition performance using various features. The performance is measured as\nangular deviation from the ground truth rotation and the angle between predicted translation and the ground\ntruth translation. All features generate very accurate estimation.\n\nresult is purely from nearest neighbor matching, while [21] uses global optimization too. We also\ntrain and test UCN on the CUB dataset [30], using the same cleaned test subset as WarpNet [16]. As\nshown in Figure 8, we outperform WarpNet by a large margin. However, please note that WarpNet is\nan unsupervised method. Please see Figure 7 for qualitative matches. Results on FlowWeb datasets\nare in supplementary material, with similar trends.\nFinally, we observe that there is a signi\ufb01cant performance improvement obtained through use of\nthe convolutional spatial transformer, in both PASCAL and CUB datasets. This shows the utility of\nestimating an optimal patch normalization in the presence of large shape deformations.\nCamera Motion Estimation We use KITTI raw sequences to get more training examples for this\ntask. To augment the data, we randomly crop and mirror the images and to make effective use of our\nfully convolutional structure, we use large images to train thousands of correspondences at once.\nWe establish correspondences with nearest neighbor matching, use RANSAC to estimate the essential\nmatrix and decompose it to obtain the camera motion. Among the four candidate rotations, we choose\nthe one with the most inliers as the estimate Rpred, whose angular deviation with respect to the\n\nground truth Rgt is reported as \u03b8 = arccos(cid:0)(Tr (R(cid:62)\n\npredRgt) \u2212 1)/2(cid:1). Since translation may only be\n\nestimated up to scale, we report the angular deviation between unit vectors along the estimated and\nground truth translation from GPS-IMU.\nIn Table 6, we list decomposition errors for various features. Note that sparse features such as SIFT are\ndesigned to perform well in this setting, but our dense UCN features are still quite competitive. Note\nthat intermediate features such as [1] learn to optimize patch similarity, thus, our UCN signi\ufb01cantly\noutperforms them since it is trained directly on the correspondence task.\n\n5 Conclusion\n\nWe have proposed a novel deep metric learning approach to visual correspondence, that is shown to be\nadvantageous over approaches that optimize a surrogate patch similarity objective. We propose several\ninnovations, such as a correspondence contrastive loss in a fully convolutional architecture, on-the-\ufb02y\nactive hard negative mining and a convolutional spatial transformer. These lend capabilities such as\nmore ef\ufb01cient training, accurate gradient computations, faster testing and local patch normalization,\nwhich lead to improved speed or accuracy. We demonstrate in experiments that our features perform\nbetter than prior state-of-the-art on both geometric and semantic correspondence tasks, even without\nusing any spatial priors or global optimization. In future work, we will explore applications for rigid\nand non-rigid motion or shape estimation as well as applying global optimization towards applications\nsuch as optical \ufb02ow or dense stereo.\nAcknowledgments\nThis work was part of C. Choy\u2019s internship at NEC Labs. We acknowledge the support of Korea\nFoundation of Advanced Studies, Toyota Award #122282, ONR N00014-13-1-0761, and MURI\nWF911NF-15-1-0479.\n\n8\n\n\f[7] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to\n\nface veri\ufb01cation. In CVPR, volume 1, June 2005.\n\n[8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object\n\nClasses Challenge 2011 (VOC2011) Results.\n\n[10] V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud. K-nearest neighbor search: Fast gpu-based implemen-\n\ntations and application to high-dimensional feature matching. In ICIP, 2010.\n\n[11] R. Girshick. Fast R-CNN. ArXiv e-prints, Apr. 2015.\n[12] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In\n\n[13] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. NIPS,\n\nCVPR, 2006.\n\n2015.\n\n[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[15] H. Kaiming, Z. Xiangyu, R. Shaoqing, and J. Sun. Spatial pyramid pooling in deep convolutional networks\n\n[16] A. Kanazawa, D. W. Jacobs, and M. Chandraker. WarpNet: Weakly Supervised Matching for Single-view\n\nfor visual recognition. In ECCV, 2014.\n\nReconstruction. ArXiv e-prints, Apr. 2016.\n\n[17] A. Kanazawa, A. Sharma, and D. Jacobs. Locally Scale-invariant Convolutional Neural Network. In Deep\n\nLearning and Representation Learning Workshop: NIPS, 2014.\n\n[18] J. Kim, C. Liu, F. Sha, and K. Grauman. Deformable spatial pyramid matching for fast dense correspon-\n\n[19] C. Liu, J. Yuen, and A. Torralba. Sift \ufb02ow: Dense correspondence across scenes and its applications. PAMI,\n\n[20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. CVPR,\n\ndences. In CVPR. IEEE, 2013.\n\n33(5), May 2011.\n\n2015.\n\n[21] J. Long, N. Zhang, and T. Darrell. Do convnets learn correspondence? In NIPS, 2014.\n[22] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.\n[23] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal\n\n[24] M. Menze and A. Geiger. Object scene \ufb02ow for autonomous vehicles. In CVPR, 2015.\n[25] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. DeepMatching: Hierarchical Deformable Dense\n\n[26] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni\ufb01ed embedding for face recognition and\n\nregions. In BMVC, 2002.\n\nMatching. Oct. 2015.\n\nclustering. In CVPR, 2015.\n\nReferences\n[1] P. Agrawal, J. Carreira, and J. Malik. Learning to See by Moving. In ICCV, 2015.\n[2] P. F. Alcantarilla, A. Bartoli, and A. J. Davison. Kaze features. In ECCV, 2012.\n[3] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (SURF). CVIU, 2008.\n[4] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d pose annotations. In ICCV, 2009.\n[5] J. Bromley, I. Guyon, Y. Lecun, E. S\u00e4ckinger, and R. Shah. Signature veri\ufb01cation using a Siamese time\n\n[6] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical \ufb02ow\n\ndelay neural network. In NIPS, 1994.\n\nevaluation. In ECCV, 2012.\n\n[27] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature\n\nembedding. In Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[28] E. Tola, V. Lepetit, and P. Fua. DAISY: An Ef\ufb01cient Dense Descriptor Applied to Wide Baseline Stereo.\n\n[29] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning \ufb01ne-grained\n\nimage similarity with deep ranking. In CVPR, 2014.\n\n[30] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200.\n\nTechnical Report CNS-TR-2010-001, California Institute of Technology, 2010.\n\n[31] H. Yang, W. Y. Lin, and J. Lu. DAISY \ufb01lter \ufb02ow: A generalized approach to discrete dense correspondences.\n\n[32] Y. Yang and D. Ramanan. Articulated human detection with \ufb02exible mixtures of parts. PAMI, 2013.\n[33] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned Invariant Feature Transform. In ECCV, 2016.\n[34] S. Zagoruyko and N. Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks.\n\nPAMI, 2010.\n\nIn CVPR, 2014.\n\nCVPR, 2015.\n\n[35] J. Zbontar and Y. LeCun. Computing the stereo matching cost with a CNN. In CVPR, 2015.\n[36] T. Zhou, Y. Jae Lee, S. X. Yu, and A. A. Efros. Flowweb: Joint image set alignment by weaving consistent,\n\npixel-wise correspondences. In CVPR, June 2015.\n\n9\n\n\f", "award": [], "sourceid": 1253, "authors": [{"given_name": "Christopher", "family_name": "Choy", "institution": "Stanford University"}, {"given_name": "JunYoung", "family_name": "Gwak", "institution": "Stanford University"}, {"given_name": "Silvio", "family_name": "Savarese", "institution": "Stanford University"}, {"given_name": "Manmohan", "family_name": "Chandraker", "institution": "NEC Labs America"}]}