{"title": "R2D2: Reliable and Repeatable Detector and Descriptor", "book": "Advances in Neural Information Processing Systems", "page_first": 12405, "page_last": 12415, "abstract": "Interest point detection and local feature description are fundamental steps in many computer vision applications. Classical approaches are based on a detect-then-describe paradigm where separate handcrafted methods are used to first identify repeatable keypoints and then represent them with a local descriptor. Neural networks trained with metric learning losses have recently caught up with these techniques, focusing on learning repeatable saliency maps for keypoint detection or learning descriptors at the detected keypoint locations. In this work, we argue that repeatable regions are not necessarily discriminative and can therefore lead to select suboptimal keypoints. Furthermore, we claim that descriptors should be learned only in regions for which matching can be performed with high confidence. \nWe thus propose to jointly learn keypoint detection and description together with a predictor of the local descriptor discriminativeness. This allows to avoid ambiguous areas, thus leading to reliable keypoint detection and description. Our detection-and-description approach simultaneously outputs sparse, repeatable and reliable keypoints that outperforms state-of-the-art detectors and descriptors on the HPatches dataset and on the recent Aachen Day-Night localization benchmark.", "full_text": "R2D2: Repeatable and Reliable Detector and Descriptor\n\nJerome Revaud\n\nPhilippe Weinzaepfel\n\nC\u00e9sar De Souza\n\nMartin Humenberger\n\nNAVER LABS Europe\n\nfirstname.lastname@naverlabs.com\n\nAbstract\n\nInterest point detection and local feature description are fundamental steps in many\ncomputer vision applications. Classical approaches are based on a detect-then-\ndescribe paradigm where separate handcrafted methods are used to \ufb01rst identify\nrepeatable keypoints and then represent them with a local descriptor. Neural\nnetworks trained with metric learning losses have recently caught up with these\ntechniques, focusing on learning repeatable saliency maps for keypoint detection\nor learning descriptors at the detected keypoint locations. In this work, we argue\nthat repeatable regions are not necessarily discriminative and can therefore lead\nto select suboptimal keypoints. Furthermore, we claim that descriptors should be\nlearned only in regions for which matching can be performed with high con\ufb01dence.\nWe thus propose to jointly learn keypoint detection and description together with\na predictor of the local descriptor discriminativeness. This allows to avoid am-\nbiguous areas, thus leading to reliable keypoint detection and description. Our\ndetection-and-description approach simultaneously outputs sparse, repeatable and\nreliable keypoints that outperforms state-of-the-art detectors and descriptors on the\nHPatches dataset and on the recent Aachen Day-Night localization benchmark.\n\n1\n\nIntroduction\n\nAccurately \ufb01nding and describing similar points of interest (keypoints) across images is crucial\nin many applications such as large-scale visual localization [45, 55], object detection [7], pose\nestimation [31], Structure-from-Motion (SfM) [49] and 3D reconstruction [21]. In these applications,\nextracted keypoints should be sparse, repeatable and discriminative in order to maximize the matching\naccuracy with a low memory footprint.\nClassical approaches are based on a two-stage pipeline that \ufb01rst detects keypoints [17, 26, 27, 28]\nand then computes a local descriptor for each keypoint [4, 24]. Speci\ufb01cally, the role of the keypoint\ndetector is to \ufb01nd scale-space locations with covariance with respect to camera viewpoint changes\nand invariance with respect to photometric transformations. A large number of handcrafted keypoints\nhave shown to work well in practice, such as corners [17] or blobs [24, 26, 27]. As for the description,\nvarious schemes based on histograms of local gradients [4, 6, 23, 42], whose most well known\ninstance is SIFT [24], were proposed and are still widely used.\nDespite this apparent success, this paradigm was recently challenged by several data-driven ap-\nproaches willing to replace the handcrafted parts [16, 25, 29, 32, 34, 48, 57, 58, 59, 62, 64]. Arguably,\nhandcrafted methods are limited by the a priori knowledge researchers have about the tasks at\nhand. The point is thus to let a deep network automatically discover which feature extraction\nprocess and representation are most suited to the data. The few attempts for learning keypoint\ndetectors [9, 11, 34, 48, 62] have only focused on the repeatability. On the other hand, metric learning\ntechniques applied to learning local robust descriptors [25, 32, 57, 58] have recently outperformed\ntraditional descriptors, including SIFT [20]. They are trained on the repeatable locations provided\nby the detector, which may harm the performance in regions that are repeatable but where accurate\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Toy examples to illustrate the key difference between repeatability (2nd column) and\nreliability (3rd column) for a given image. Repeatable regions in the \ufb01rst image are only located near\nthe black triangle, however, all patches containing it are equally reliable. In contrast, all squares in\nthe checkerboard pattern are salient hence repeatable, but are not discriminative due to self-similarity.\n\nmatching is not possible. Figure 1 shows such an example with a checkerboard image: every corner\nor blob is repeatable but matching cannot be performed due to the repetitiveness of the pattern. In\nnatural images, common textures such as the tree leafage, skyscraper windows or sea waves can be\nsalient but hard to match because of their repetitiveness and unstable nature.\nIn this work, we claim that detection and description are inseparably tangled since good keypoints\nshould not only be repeatable but should also be reliable for matching. We thus propose to jointly\nlearn the descriptor reliability seamlessly with the detection and description processes. Our method\nestimates a con\ufb01dence map for each of these two aspects and selects only keypoints which are\nboth repeatable and reliable. More precisely, our network outputs dense local descriptors (one for\neach pixel) as well as two associated repeatability and reliability con\ufb01dence maps. The two maps\nrespectively aim to predict if a keypoint is repeatable and if its descriptor is discriminative, i.e., if it\ncan be accurately matched with high con\ufb01dence. Our keypoints thus correspond to locations that\nmaximize both con\ufb01dence maps.\nTo train the keypoint detector, we employ a novel unsupervised loss that encourages repeatability,\nsparsity and a uniform coverage of the image. As for the local descriptor, we introduce a new loss\nto learn reliable local descriptors while speci\ufb01cally targeting image regions that are meaningful for\nmatching. It is trained with a listwise ranking loss based on a differentiable Average Precision (AP)\nmetric, hereby leveraging recent advances in metric learning [5, 20, 38]. We jointly learn an estimator\nof the descriptor reliability to predict which patches can be matched with a high AP, i.e., that are both\ndiscriminative, robust and in the end that can be accurately matched. Experiment results show that\nour elegant formulation of joint detector and descriptor selects keypoints which are both repeatable\nand reliable, leading to state-of-the-art results on the HPatches and Aachen datasets. Our code and\nmodels are available at https://github.com/naver/r2d2.\n\n2 Related work\n\nLocal feature extraction and description have received a continuous in\ufb02ux of attention in the past\nseveral years (cf. surveys in [8, 13, 43, 60]). We focus here on the learning methods only.\n\nLearned descriptors. Most deep feature matching methods have focused on learning the descriptor\ncomponent, applied either on a sparse set of keypoints [3, 25, 29, 53, 54] detected using standard\nhandcrafted methods or densely over the image [12, 32, 47, 56]. The descriptor is usually trained\nusing a metric learning loss that seeks to maximize the similarity of descriptors corresponding to\nthe same patches and minimize it otherwise [1, 16, 25, 57, 58]. To this aim, the triplet loss [14, 52]\nand the contrastive loss [37] have been widely used: they process two or three patches at a time,\nsteadily optimizing the global objective based on local comparisons. Another type of loss, labeled as\nglobal in opposition, have been recently proposed by He et al. [20]. Inspired by advances in listwise\nlosses [19, 61], it consists in a differentiable approximation of the Average-Precision (AP), a standard\nranking metric evaluating the global ranking, which is directly optimized during training. It was\nshown to produce state-of-the-art results in patch and image matching [5, 20, 38]. Our approach also\noptimizes the AP but has several advantages over [20]: (a) the detector is trained jointly with the\ndescriptor, alleviating the drawbacks of sparse handcrafted keypoint detector; (b) our approach is\nfully convolutional, outputting dense patch descriptors for an input image instead of being applied\npatch by patch; (c) our novel AP-based loss jointly learns patch descriptors and an estimate of their\nreliability, allowing in turn the network to minimize its effort on undistinctive regions.\n\n2\n\nInput imageKeypointdetectorDescriptorreliabilityInput imageKeypointdetectorDescriptorreliability0.00.20.40.60.81.0\fLearned detectors. The \ufb01rst approach to rely on machine learning for keypoint detection was\nFAST [41]. Later, Di et al. [10] learn to mimic the output of handcrafted detectors with a compact\nneural network. In [22], handcrafted and learned \ufb01lters are combined to detect repeatable key-\npoints. These two approaches still rely on some handcrafted detectors or \ufb01lters while ours is trained\nend-to-end. QuadNet [48] is an unsupervised approach based on the idea that the ranking of the\nkeypoint salience are preserved by natural image transformations. In the same spirit, [63] additionally\nencourage peakiness of the saliency map for keypoint detector on textures. In this paper, we employ\na simpler unsupervised formulation that locally enforces the similarity of the saliency maps.\n\nJointly learned descriptor and detector.\nIn the seminal LIFT approach, Yi et al. [62] introduced\na pipeline where keypoints are detected and cropped regions are then fed to a second network to\nestimate the orientation before going throughout a third network to perform description. Recently,\nthe SuperPoint approach by DeTone et al. [9] tackles keypoint detection as a supervised task learned\nfrom arti\ufb01cially generated training images containing basic structures like corners and edges. After\nlearning the keypoint detector, a deep descriptor is trained using a second network branch, sharing\nmost of the computation. In contrast, our approach learns both of them jointly from scratch and\nwithout introducing any arti\ufb01cial bias in the keypoint detector. Noh et al. [32] proposed DELF, an\napproach targeted for image retrieval that learns local features as a by-product of a classi\ufb01cation\nloss coupled with an attention mechanism trained using a large-scale dataset of landmark images.\nIn comparison, our approach is unsupervised and trained with relatively little data. More similar to\nour approach, Mishkin et al. [30] recently leverage deep learning to jointly enhance an af\ufb01ne regions\ndetector and local descriptors. Nevertheless, their approach is rooted on a handcrafted keypoint\ndetector that generates seeds for the af\ufb01ne regions, thus not truly learning keypoint detection.\nMore recently, D2-Net [11] uses a single CNN for joint detection and description that share all\nweights; the detection being based on local maxima across the channels and the spatial dimensions of\nthe feature maps. Similarly, Ono et al. [34] train a network from pairs of matching images with a\ncomplicated asymmetric gradient backpropagation scheme for the detection and a triplet loss for the\nlocal descriptor. Compared to these works, we highlight for the \ufb01rst time the importance of treating\nrepeatability and reliability as separate entities represented by their own respective score maps. Our\nnovel AP-based reliability loss allows us to estimate patch reliability according to the AP metric while\nsimultaneously optimizing for the descriptor. In a single batch, each patch is typically compared to\nthousands of other patches. In contrast to Hartmann et al. [18] that predicts reliability given \ufb01xed\ndescriptors, our novel loss tightly couples descriptors and reliability estimates. This capability cannot\nbe achieved with the standard contrastive and triplet losses used in prior work. Overall, being able to\ntrain a keypoint detector from scratch while jointly predicting reliable descriptors is made possible by\nour novel losses that are unlike any of the ones used in [9, 11, 20, 34, 48].\n\n3 Joint learning reliable and repeatable detectors and descriptors\n\nThe proposed approach, referred to as R2D2, aims to predict a set of sparse locations of an input\nimage I that are repeatable and reliable for the purpose of local feature matching. In contrast to\nclassical approaches, we make an explicit distinction between repeatability and reliability. As shown\nin Figure 1, they are in fact two complementary aspects that must be predicted separately.\nWe thus propose to train a fully-convolutional network (FCN) that predicts 3 outputs for an image\nI of size H \u00d7 W . The \ufb01rst one is a 3D tensor X \u2208 RH\u00d7W\u00d7D that corresponds to a set of dense\nD-dimensional descriptors, one per pixel. The second one is a heatmap S \u2208 [0, 1]H\u00d7W whose goal\nis to provide sparse yet repeatable keypoint locations. To achieve sparsity, we only extract keypoints\nat locations corresponding to local maxima in S. The third output is an associated reliability map\nR \u2208 [0, 1]H\u00d7W that indicates the estimated reliability of descriptor X ij, i.e., likelihood that it is\ngood for matching, at each pixel (i, j) with i \u2208 {1, . . . , W} and j \u2208 {1, . . . , H}.\nThe network architecture is shown in Figure 2. The backbone is a L2-Net [57], with two minor\ndifferences: (a) subsampling is replaced by dilated convolutions in order to preserve the input\nresolution at all stages, and (b) the last 8 \u00d7 8 convolutional layer is replaced by 3 successive 2 \u00d7 2\nconvolutional layers. We found that this latter modi\ufb01cation reduces the number of weights by a factor\n5 for a similar accuracy. The 128-dimensional output tensor serves as input to: (a) a (cid:96)2-normalization\nlayer to obtain the per-pixel patch descriptors X, (b) an element-wise square operation followed by\n\n3\n\n\fFigure 2: Overview of our network for jointly learning repeatable and reliable matches.\n\nan additional 1 \u00d7 1 convolutional layer and a softmax to obtain the repeatability map S, and (c) an\nidentical second branch to obtain the reliability map R.\n\n3.1 Learning repeatability\n\nAs observed in previous works [9, 62], keypoint repeatability is a problem that cannot be tackled\nby standard supervised training. In fact, using supervision essentially boils down in this case to\nimitating an existing detector rather than discovering potentially better keypoints. We thus treat the\nrepeatability as a self-supervised task and train the network such that the positions of local maxima\nin S are covariant to natural image transformations like viewpoint or illumination changes.\nLet I and I(cid:48) be two images of the same scene and let U \u2208 RH\u00d7W\u00d72 be the ground-truth corre-\nspondences between them. In other words, if the pixel (i, j) in the \ufb01rst image I corresponds to\npixel (i(cid:48), j(cid:48)) in I(cid:48), then Uij = (i(cid:48), j(cid:48)). In practice, U can be estimated using existing optical \ufb02ow\nor stereo matching if I and I(cid:48) are natural images or can be obtained exactly if I(cid:48) was synthetically\ngenerated with a known transformation, e.g. an homography [9], see Section 3.3. Let S and S(cid:48) be\n(cid:48)\nU be S(cid:48) warped according to U.\nthe repeatability maps for image I and I(cid:48) respectively, and S\n(cid:48)\nUltimately, we want to enforce the fact that all local maxima in S correspond to the ones in S\nU . Our\nkey idea is to maximize the cosine similarity, denoted as cosim in the following, between S and\n(cid:48)\nU ) is maximized, the two heatmaps are indeed identical and their maxima\nS\ncorrespond exactly. While this is true in ideal conditions, in practice, local occlusions, warp artifacts\nor border effects make this approach unrealistic. Therefore we reformulate this idea locally, i.e., we\naverage the cosine similarity over many small patches. We de\ufb01ne the set of overlapping patches\nP = {p} that contains all N \u00d7 N patches in {1, . . . , W} \u00d7 {1, . . . , H} and de\ufb01ne the loss as:\n\n(cid:48)\nU . When cosim(S, S\n\n(cid:48)\n\nLcosim(I, I\n\n, U ) = 1 \u2212\n\n1\n\n|P|(cid:88)p\u2208P\n\ncosim(cid:0)S [p] , S\n\n(cid:48)\n\nU [p](cid:1) ,\n\nwhere S [p] \u2208 RN 2 denotes the \ufb02attened N \u00d7 N patch p extracted from S, and likewise for S\nNote that Lcosim can be minimized trivially by having S and S\na second loss function that aims to maximize the local peakiness of the repeatability map:\n\n(cid:48)\nU [p].\n(cid:48)\nU constant. To avoid this, we employ\n\n(1)\n\n(2)\n\nLpeaky(I) = 1 \u2212\n\n1\n\n|P|(cid:88)p\u2208P(cid:18) max\n\n(i,j)\u2208p\n\nSij \u2212 mean\n(i,j)\u2208p\n\nSij(cid:19) .\n\nInterestingly, this allows to choose the spatial frequency of local maxima by varying the patch size\nN, see Section 4.2. Finally, the resulting repeatability loss is composed as a weighted sum of the \ufb01rst\nloss and second loss applied to both images:\n\nLrep(I, I\n3.2 Learning reliability\n\n(cid:48)\n\n, U ) = Lcosim(I, I\n\n(cid:48)\n\n, U ) +\n\n1\n2\n\n(Lpeaky(I) + Lpeaky(I\n\n(cid:48)\n\n)) .\n\n(3)\n\nIn addition to the repeatibility map S, our network also computes dense local descriptors as well as\na heatmap R that predicts the individual reliability Rij of each descriptor X ij. The goal is to let\nthe network learn to choose between making descriptors as discriminative as possible or, conversely,\nsparing its efforts on uniformative regions like the sky or the ground. To that aim, we propose a loss\nthat is minimized when the network can successfully predict the actual descriptor reliability.\n\n4\n\n\ud835\udc51\ud835\udc52\ud835\udc60\ud835\udc50\ud835\udc5f\ud835\udc56\ud835\udc5d\ud835\udc61\ud835\udc5c\ud835\udc5f\ud835\udc60128HW\ud835\udc53\ud835\udc62\ud835\udc59\ud835\udc59\ud835\udc66\ud835\udc50\ud835\udc5c\ud835\udc5b\ud835\udc63\ud835\udc3f2-\ud835\udc41\ud835\udc52\ud835\udc6132326464128128128HW: 3\u00d73 conv+ BN + ReLU: 3 successive 2\u00d72 conv: 1\u00d71 conv: \u21132normalization\u21132\ud835\udc652: elementwise square\ud835\udf0e: softmax\u21132\ud835\udc652128HW1\ud835\udc5f\ud835\udc52\ud835\udc5d\ud835\udc52\ud835\udc4e\ud835\udc61\ud835\udc4e\ud835\udc4f\ud835\udc56\ud835\udc59\ud835\udc56\ud835\udc61\ud835\udc662\ud835\udf0e\ud835\udc5f\ud835\udc52\ud835\udc59\ud835\udc56\ud835\udc4e\ud835\udc4f\ud835\udc56\ud835\udc59\ud835\udc56\ud835\udc61\ud835\udc6612\ud835\udf0e: 3\u00d73 conv, dilation \u00d72 + BN + ReLU\fAs in previous works [1, 16, 25, 57, 58], we cast descriptor matching as a metric learning problem.\nMore speci\ufb01cally, each pixel (i, j) from the \ufb01rst image I is the center of a M \u00d7 M patch pij with\n(cid:48)\ndescriptor X ij that we can compare to the descriptors {X\nuv} of all other patches in the second image\nI(cid:48). Knowing the ground-truth correspondence mapping U, we estimate the reliability of patch pij\nusing the Average-Precision (AP), a standard ranking metric. We ideally want that patch descriptors\nare as reliable as they can be, i.e., we want to maximize the AP for all patches. We therefore follow\n\nHe et al. [20] and optimize a differentiable approximation of the AP, denoted as (cid:102)AP. Training then\nconsists in maximizing the AP computed for each of the B patches {pij} in the batch:\n\nLAP =\n\n1\n\nB(cid:88)ij\n\n1 \u2212 (cid:102)AP(pij).\n\nLocal descriptors are extracted at each pixel, but not all locations are equally interesting. In particular,\nuniform regions or elongated 1D patterns are known to lack the distinctiveness necessary for accurate\nmatching [15]. More interestingly, even well-textured regions are also known to be unreliable from\ntheir unstable nature, such as tree leafages or ocean waves. It becomes thus clear that optimizing\nthe patch descriptor even in such image regions can hinder performance. We therefore propose to\nenhance the AP loss to spare the network in wasting its efforts on undistinctive regions:\n\n(4)\n\n(5)\n\nLAP,R =\n\n1\n\nB(cid:88)ij\n\n1 \u2212 (cid:102)AP(pij)Rij + \u03ba(1 \u2212 Rij),\n\nwhere \u03ba \u2208 [0, 1] is a hyperparameter that represents the AP thresh-\nold above which a patch is considered reliable. We found that\n\u03ba = 0.5 yields good results in practice and we use this value in\nthe rest of the paper. Figure 3.2 shows the loss function LAP,R\nfor a given patch pij as a function of (cid:102)AP(pij) and Rij. For reli-\nable patches (i.e. AP > \u03ba), the loss incites to maximize the AP.\nConversely, when AP < \u03ba, the loss encourages the reliability to\nbe low. This way, learning converges to a region where there is\nalmost no gradients (at Rij (cid:39) 0), hence having barely any effect\non descriptors that belong to undistinctive image regions. Note that\na similar idea of jointly training the descriptor and an associated con\ufb01dence was recently proposed in\n[33], but using a triplet loss, which prevents the use of an interpretable threshold \u03ba as in our case.\n\nFigure 3: Visualization of our\nproposed loss LAP,R.\n\n3.3\n\nInference and training details\n\nRuntime. At test time, we run the trained network multiple times on the input image at different\nscales starting at L = 1024 pixels and downsampling it by 21/4 each time until L < 256 pixels, where\nL denotes the largest dimension of the image. For each scale, we \ufb01nd local maxima in S and gather\ndescriptors from X at corresponding locations. Finally, we keep a shortlist of the best K descriptors\nover all scales where the score of descriptor X ij is computed as SijRij, i.e. requiring both repeatable\nand reliable keypoints. In practice, processing a 1M pixel image on a Tesla P100-SXM2 GPU takes\nabout 0.5s to extract keypoints at a single scale (full image) and 1s for all scales.\nTraining data. We use three sources of data to train our method: (a) distractors from a retrieval\ndataset [36] (i.e., random web images), from which we build synthetic image pairs by applying random\ntransformations (homography and color jittering), (b) images from the Aachen dataset [44, 46], using\nthe same strategy to build synthetic pairs, and (c) pairs of nearby views from the Aachen dataset\nwhere we obtain a pseudo ground-truth using optical \ufb02ow (see below). All sources are represented\napproximately equally (about 4000 images each) and we study their importance in Section 4.4. Note\nthat we do not use any image from the HPatches evaluation dataset [2] during training.\nGround-truth correspondences. To generate dense ground-truth correspondences between two\nimages of the same scene, we leverage existing matching techniques. As in previous works [11, 34],\nwe use points veri\ufb01ed by Structure-from-Motion that we enhance by designing a pipeline based on\noptical \ufb02ow tools to reliably extract dense correspondences. As a \ufb01rst step, we run a SfM pipeline [49]\nthat outputs a list of 3D points and a 6D camera pose for each image. For each image pair with\nsuf\ufb01cient overlap (i.e., with some common 3D points), we then compute the fundamental matrix.\nNext, we compute high-quality dense correspondences using EpicFlow [39]. We enhance it by adding\nepipolar constraints in DeepMatching [40], the \ufb01rst step of EpicFlow that produces semi-sparse\n\n5\n\n0.00.250.50.751.0ReliabilityRij0.00.20.40.60.81.0fAP(pij)ReliabilitylossLAP,R0.000.150.300.450.600.750.90\f(a) input image\n\n(b) Repeatability heatmap S for N = 64\n\n(c) Repeatability heatmap S for N = 32\n\n(d) Repeatability heatmap S for N = 16\n\n(e) Repeatability heatmap S for N = 8\n\n(f) Repeatability heatmap S for N = 4\n\nFigure 4: Sample repeatability heatmaps obtained when training the repeatability loss Lrep from\nEq. (3) with different patch size N. Red and green colors denote low and high values, respectively.\n\nmatches. In addition, we also predict a mask where the \ufb02ow is reliable, as optical \ufb02ow is de\ufb01ned at\nevery pixel, even in occluded areas. We post-process the output of DeepMatching by computing a\ngraph of connected consistent neighbors, and keeping only matches belonging to large connected\ncomponents (at least 50 matches). The mask is de\ufb01ned using a thresholded kernel density estimator\non the veri\ufb01ed matches.\nTraining parameters. We optimize the network using Adam for 25 epochs with a \ufb01xed learning\nrate of 0.0001, weight decay of 0.0005 and a batch size of 8 pairs of images cropped to 192 \u00d7 192.\nSampling issues for AP loss. To have a setup as realistic as possible given hardware constraints, we\nsubsample \u201cquery\u201d patches in the \ufb01rst image on a regular grid of 8 \u00d7 8 pixels. To handle the inherent\nimperfection of the optical \ufb02ow, we de\ufb01ne a single positive per query patch pij in the second image\nas the one with the most similar descriptor within a radius of 3 pixels from the ground-truth position\nUij. Negatives are de\ufb01ned as more than 5 pixels away from Uij and sampled on a 8 \u00d7 8 regular grid.\n4 Experiments\n\n4.1 Datasets and metrics\n\nWe evaluate our method on the full image sequences of the HPatches dataset [2]. The HPatches\ndataset contains 116 scenes where the \ufb01rst image is taken as a reference and subsequent images in a\nsequence are used to form pairs with increasing dif\ufb01culty. This dataset can also be further separated\ninto 57 sequences containing large changes in illumination and 59 with large changes in viewpoint.\nRepeatability. Following [27], we compute the repeatability score for a pair of images as the number\nof point correspondences found between the two images divided by the minimum number of keypoint\ndetections in the image pair. We report the average score over all image pairs.\nMatching score (M-score). We follow the de\ufb01nitions given in [9, 62]. The matching score is the\naverage ratio between ground-truth correspondences that can be recovered by the whole pipeline and\nthe total number of estimated features within the shared viewpoint region when matching points from\nthe \ufb01rst image to the second and the second image to the \ufb01rst one.\nMean Matching Accuracy (MMA). We use the same de\ufb01nition as in [11] where the matching\naccuracy is the average percentage of correct matches in an image pair considering multiple pixel\nerror thresholds. When reporting the MMA, i.e. the average score for each threshold over all image\npairs, we exclude as in [11] a few image sequences having an excessive resolution. Furthermore, we\nalso report the MMA@3, i.e. the MMA for a speci\ufb01c error threshold of 3 pixels.\n\n4.2 Parameter study\nImpact of N. We \ufb01rst evaluate the impact of the patch size N used in the repeatability loss Lrep,\nsee Equation 3. It essentially controls the number of keypoints as the loss ideally encourages the\n\n6\n\n050100150200250300350050100150200\fFigure 5: MMA@3 and M-score for different patch sizes N on the HPatches dataset, as a function\nof the number of retained keypoints K per image.\n\nRepeatability Reliability Keypoint score\n\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n\nRij\nSij\n\nRijSij\n\nMMA@3\n0.588 \u00b1 0.010\n0.639 \u00b1 0.034\n0.688 \u00b1 0.009\n\nM-score\n\n0.361 \u00b1 0.011\n0.432 \u00b1 0.033\n0.470 \u00b1 0.011\n\nTable 1: Ablative study on HPatches. We report the M-score and the MMA at a 3px error threshold\nfor our method as well as our approach without repeatability (top row) or reliability (middle row)\n\nnetwork to output a single local maxima per window of size N \u00d7 N. Figure 4 shows different\nrepeatability maps S obtained from the same input image with various N. When N is large, our\nmethod outputs few highly-repeatable keypoints, and conversely for smaller values of N. Note that\nthe networks even learn to populate empty regions like the sky with a grid-like pattern when N is\nsmall, while it avoids them when N is large. We also plot the MMA@3 and the M-score on the\nHPatches dataset in Figure 5 for various N as a function of the number of retained keypoints K per\nimage. Models trained with large N outperform those with lower N when the number of retained\nkeypoints K is low, since these keypoints have a higher quality. When keeping more keypoints,\npoor local maxima starts to get selected for these models (e.g. in the sky or the river in Figure 4)\nand the matching performance drops. However, having numerous keypoints is important for many\napplications such as visual localization because it augments the chance that at least a few of them\nwill be correctly matched despite occlusions or other noise sources. There is therefore a trade-off\nbetween the number of keypoints and the matching performance. In the following experiments, and\nunless stated otherwise, we use N = 16 and K = 5000.\nImpact of separate reliability and repeatability. Our main contribution is to show that separately\npredicting repeatability and reliability is key to improve the \ufb01nal matching performance. Table 1\nreports the performance aggregated over 5 independent runs when (a) removing the repeatability map,\nin which case keypoints are de\ufb01ned by maxima of the reliability map, or (b) removing the reliability\nmap and loss, i.e., only using the AP loss formulation of Equation 4. In both cases, the performance\ndrops in terms of MMA@3 and M-score. This highlights that repeatability is not well correlated with\nthe descriptor reliability, and shows the importance of estimating the reliability of descriptors. In the\nfollowing, we select an \u201caverage\u201d model (with 0.686 MMA@3px) for all subsequent experiments.\nFigure 6 shows the repeatability and reliability heatmaps obtained for a few images. Our network\ntrained with reliability loss is able to eliminate regions that cannot be accurately matched, such as the\nsky 6(a,d) or repetitive patterns arti\ufb01cially printed on top of the pepper photography 6(c). Note that\nthe network has never seen the arti\ufb01cial patterns in 6(c) during training but is still able to reject them.\nMore complex patterns are also discarded, such as the river in 6(a), the paved ground in 6(d), various\n1-D structures in 6(a,d) or the central white building with repetitive structures in 6(a). Even though\nthe reliability appears to be high in these regions, it is in fact slightly inferior, resulting in keypoints\nbeing scored lower which are therefore not retained in the top-K \ufb01nal output (top row of Figure 6).\nSingle-scale experiments. To assess the importance of the multi-scale feature extraction (Sec-\ntion 3.3), we evaluate our model at a single-scale (full image size). We obtain 0.651 MMA@3px\ncompared to 0.686 MMA@3px in the multi-scale setting.\n\n4.3 Comparison with the state of the art\n\nWe now compare our approach to state-of-the-art detectors and descriptors on HPatches.\n\n7\n\n0200040006000800010000Number of keypoints per image K0.450.500.550.600.650.700.75MMA@30200040006000800010000Number of keypoints per image K0.250.300.350.400.45M scoreN=64N=32N=16N=8N=4\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 6: For one given input image (1st row), we show the repeatability (2nd row) and reliability\nheatmaps (3rd row) extracted at a single scale, and overlaid onto the original image. The reliability\nheatmap\u2019s color scale is enhanced for the sake of visualization. Top-scoring keypoints are shown as\ngreen crosses in the \ufb01rst image. They tend to avoid uniform and repetitive patterns (sky, ground, ...).\n\nTransformations Data Method\nDoG\n\ngraf\n\nViewpoint\nPerspective\n\n(VP)\n\nZoom and\nRotation\n(Z+R)\n\nwall\n\nbark\n\nboat\n\n-\n\n-\n\n0.24\n0.45\n\n0.25\n0.47\n\n0.18\n0.21\n0.42\n0.28\n0.39\n0.65\n\nK=300 K=600 K=1200 K=2400 K=3000\n0.21\n0.17\n0.32\n0.27\n0.3\n0.62\n0.13\n0.12\n0.27\n0.26\n0.21\n0.33\n\n0.0.2\n0.19\n0.38\n0.28\n0.35\n0.62\n0.13\n0.13\n0.33\n0.25\n0.24\n0.39\n\n0.14\n0.37\n0.2\n0.28\n0.45\n\n0.29\n0.57\n\n0.46\n0.71\n\n0.28\n0.54\n\n0.44\n0.70\n\n0.16\n0.47\n\n0.16\n0.44\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\nTransformations\n\nLuminosity\n\n(L)\n\nData\nleuven\n\nBlur\n(B)\n\nCompression\n\n(JPEG)\n\nbikes\n\ntrees\n\nubc\n\nMethod\nDoG\n\nQuadNet\n\nQuadNet\n\nOurs\nDoG\n\nOurs\nDoG\n\nOurs\nDoG\n\nQuadNet\n\nQuadNet\n\nOurs\n\n-\n\n-\n\n0.76\n0.76\n\n0.77\n0.77\n\nK=300 K=600 K=1200 K=2400 K=3000\n0.51\n0.7\n0.65\n0.41\n0.53\n0.66\n0.29\n0.36\n0.28\n0.68\n0.55\n0.40\n\n0.51\n0.72\n0.69\n0.41\n0.53\n0.67\n0.3\n0.39\n0.36\n0.6\n0.62\n0.45\n\n0.5\n0.75\n0.73\n0.39\n0.49\n0.71\n0.31\n0.44\n0.45\n\n-\n0.5\n0.6\n-\n\n0.67\n0.65\n\n0.66\n0.54\n\n0.68\n0.68\n\n0.57\n0.76\n\n0.55\n0.75\n\n0.49\n0.55\n\n-\n\n-\n\n-\n\n-\n\n-\n\nQuadNet\n\nQuadNet\n\nOurs\nDoG\n\nOurs\nDoG\n\nOurs\nDoG\n\nQuadNet\n\nQuadNet\n\nOurs\n\nTable 2: Comparison with QuadNet [48] and a handcrafted difference of gaussian (DoG) in terms of\ndetector repeatability on the Oxford dataset, with a varying number of keypoints K.\n\nDetector repeatability. We \ufb01rst evaluate our approach in terms of repeatability. Following [48], we\nreport the repeatability on the Oxford dataset [28], a subset of HPatches, for which the transforma-\ntions applied to sequences is known and include jpeg compression (JPEG), blur (Blur), zoom and\nrotation (Z+R), luminosity (L), and viewpoint perspective (VP). Table 2 shows a comparison with\nQuadNet [48] and the handcrafted Difference of Gaussians (DoG) used in SIFT [24] on this dataset\nwhen varying the number of interest points. Overall our approach signi\ufb01cantly outperforms these\ntwo methods, in particular for a high number of interest points. This demonstrates the excellent\nrepeatability of our detector. Note that training on the Aachen dataset may obviously helps for street\nviews. Our approach performs well even for the cases of blur or rotation (bark, boat), while we did\nnot train the network for such challenging cases.\nMean Matching Accuracy. We next compare the mean matching accuracy on HPatches with\nDELF [32], SuperPoint [9], LF-Net [34], mono- and multi-scale D2-Net [11], HardNet++ descriptors\nwith HesAffNet regions [29, 30] (HAN + HN++) and a handcrafted Hessian af\ufb01ne detector with\nRootSIFT descriptor [35]. Figure 7 shows the results for illumination and viewpoint changes and the\noverall performance. R2D2 signi\ufb01cantly outperforms the state of the art at nearly all error thresholds.\nThis is at the exception of DELF in the case of illumination changes, which can be explained by their\n\ufb01xed grid of keypoints while this subset has no spatial changes. Interestingly, our method signi\ufb01cantly\noutperforms joint detector-and-descriptors such as D2-Net [11], in particular at low level thresholds,\nshowing that our keypoints bene\ufb01t from our joint training with repeatability and reliability.\n\n8\n\n\fFigure 7: Comparison with the state of the art in term of MMA on the HPatches dataset.\n\nFigure 8: Sample results using reciprocal nearest matching. Correct and incorrect correspondences\nare shown as green dots and red crosses, respectively.\n\nMethod\n\nRootSIFT [24]\nHAN+HN [30]\nSuperPoint [9]\nDELF (new) [32]\n\nD2-Net [11]\nR2D2, N = 16\nR2D2, N = 8\n\n-\n\ndim #weights\n#kpts\n128\n11K\n128\n11K\n7K\n256\n11K 1024\n19K\n512\n128\n5K\n10K\n128\n\n2 M\n1.3 M\n9 M\n15 M\n0.5 M\n1.0 M\n\n0.5m, 2\u25e6\n33.7\n37.8\n42.8\n39.8\n44.9\n45.9\n45.9\n\n1m, 5\u25e6\n52.0\n54.1\n57.1\n61.2\n66.3\n65.3\n66.3\n\n5m, 10\u25e6\n65.3\n75.5\n75.5\n85.7\n88.8\n86.7\n88.8\n\nTable 3: Comparison to the state of the art on the\nAachen Day-Night dataset [44] for the visual localiza-\ntion task. The last row is performed with an increased\nnumber of channels.\n\nTraining data\n\nW A S\n(cid:88)\n(cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88)\n(cid:88)\n(cid:88) (cid:88)\n(cid:88) (cid:88) (cid:88) (cid:88)\n\n0.669\n0.689\n0.667\n0.686\n0.719\n\nHPatches\nF MMA@3px\n\nAachen Day-Night\n\n0.5m, 2\u25e6\n43.9\n42.9\n42.9\n43.9\n45.9\n\n1m, 5\u25e6\n61.2\n60.2\n61.2\n63.3\n65.3\n\n5m, 10\u25e6\n77.6\n78.6\n84.7\n86.7\n86.7\n\nTable 4: Ablation study for the training data. W=web\nimages; A=Aachen-day images; S=Aachen-day-night\npairs from automatic style transfer; F=Aachen-day\nreal images pairs. For W,A,S we use random homo-\ngraphies; for F optical \ufb02ow.\n\nMatching score. At an error threshold of 3 pixels, we obtain a M-Score of 0.453 compared to 0.335\nfor LF-Net [34] and 0.288 for SIFT [24], demonstrating the bene\ufb01t of our matching approach.\nQualitative results. Figure 8 shows two examples with a drastic change of viewpoint (left) and\nillumination (right). Our matches cover the entire image and most of them are correct (green dots).\n\n4.4 Applications to visual localization\n\nWe additionnaly provide results for the visual localization task on the Aachen Day-Night dataset [44],\nas in D2-Net [11]. This corresponds to a realistic application scenario beyond traditional matching\nmetrics. The goal is to \ufb01nd the camera poses in night images (not included in training), given the\nimages taken during day in the same area with their known poses. We follow the \u201cVisual Localization\nBenchmark\u201d guideline: we use a pre-de\ufb01ned visual localization pipeline based on COLMAP [50, 51],\nwith our matches as input. They serve to reconstruct a SfM model in which test images are registered.\nReported metrics are the percentages of successfully localized images within 3 error thresholds.\nFor the localization task, we include an additional source of data, denoted as S, comprising night\nimages automatically obtained from daytime Aachen images by applying style transfer. In Table 3,\nwe compare our approach to the state of the art on the Aachen Day-Night localization task. Our\napproach outperforms all competing approaches at the time of submission. Table 4 shows the impact\nof the different sources of training data, with N = 16 and K = 5000 kpts/img (same settings as the\nlast row but one in Table 3). We \ufb01rst note that training only with random web images and random\nhomographies already yields high performance on both tasks: state-of-the-art on HPatches, and\nsigni\ufb01cantly better than SIFT, HAN, and SuperPoint for the localization task, showing the excellent\ngeneralization capability of our method. Adding other data sources leads to small performance gains.\nWe point out that our network architecture is signi\ufb01cantly smaller than other networks (up to 15\u00d7\nless weights) while also generating much less keypoints per image. Our keypoint descriptors are\nalso much more compact (128-D only) compared to SuperPoint [9], DELF [32] or D2-Net [11] (resp.\n256-, 1024- and 512-dimensional descriptors).\n\n9\n\n123456789100.000.250.500.751.00MMAOverall12345678910threshold [px]Illumination12345678910ViewpointR2D2Hes. Aff. + Root-SIFTHAN + HN++DELFDELF NewSuperPointLF-NetD2-NetD2-Net MSD2-Net TrainedD2-Net Trained MS\fReferences\n\n[1] V. Balntas, E. Johns, L. Tang, and K. Mikolajczyk. Pn-net: Conjoined triple deep network for learning\n\nlocal image descriptors. arXiv preprint arXiv:1601.05030, 2016. 2, 3.2\n\n[2] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted\n\nand learned local descriptors. In CVPR, 2017. 3.3, 4.1\n\n[3] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk. Learning local feature descriptors with triplets and\n\nshallow convolutional neural networks. In BMVC, 2016. 2\n\n[4] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In ECCV, 2006. 1\n[5] F. Cakir, K. He, X. Xia, B. Kulis, and S. Sclaroff. Deep Metric Learning to Rank. In CVPR, 2019. 1, 2\n[6] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. Brief: Binary robust independent elementary features. In\n\nECCV, 2010. 1\n\n[7] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints.\n\nIn Workshop on statistical learning in computer vision, ECCV, 2004. 1\n\n[8] G. Csurka, C. R. Dance, and M. Humenberger. From handcrafted to deep local invariant features. arXiv\n\npreprint arXiv:1807.10254, 2018. 2\n\n[9] D. DeTone, T. Malisiewicz, and A. Rabinovich. Superpoint: Self-supervised interest point detection and\n\ndescription. In CVPR, 2018. 1, 2, 3.1, 4.1, 4.3, 4.3, 4.4\n\n[10] P. Di Febbo, C. Dal Mutto, K. Tieu, and S. Mattoccia. Kcnn: Extremely-ef\ufb01cient hardware keypoint\n\ndetection with a compact convolutional neural network. In CVPR, 2018. 2\n\n[11] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler. D2-net: A trainable cnn\n\nfor joint detection and description of local features. In CVPR, 2019. 1, 2, 3.3, 4.1, 4.3, 4.3, 4.4\n\n[12] M. E. Fathy, Q.-H. Tran, M. Zeeshan Zia, P. Vernaza, and M. Chandraker. Hierarchical metric learning and\n\nmatching for 2d and 3d geometric correspondences. In ECCV, 2018. 2\n\n[13] S. Gauglitz, T. H\u00f6llerer, and M. Turk. Evaluation of interest point detectors and feature descriptors for\n\n[14] A. Gordo, J. Almaz\u00e1n, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representations for\n\n[15] K. Grauman and B. Leibe. Visual object recognition. Synthesis lectures on arti\ufb01cial intelligence and\n\nvisual tracking. IJCV, 2011. 2\n\nimage search. In ECCV, 2016. 2\n\nmachine learning, 2011. 3.2\n\n[16] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning\n\nfor patch-based matching. In CVPR, 2015. 1, 2, 3.2\n\n[17] C. G. Harris, M. Stephens, et al. A combined corner and edge detector. In Alvey vision conference, 1988. 1\n[18] W. Hartmann, M. Havlena, and K. Schindler. Predicting matchability. In CVPR, 2014. 2\n[19] K. He, F. Cakir, S. A. Bargal, and S. Sclaroff. Hashing as tie-aware learning to rank. In CVPR, 2018. 2\n[20] K. He, Y. Lu, and S. Sclaroff. Local descriptors optimized for average precision. In CVPR, 2018. 1, 1, 2, 2,\n\n3.2\n\n2011. 1\n\n[21] J. Heinly, J. L. Schonberger, E. Dunn, and J.-M. Frahm. Reconstructing the world* in six days*(as captured\n\nby the yahoo 100 million image dataset). In CVPR, 2015. 1\n\n[22] A. B. Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk. Key. net: Keypoint detection by handcrafted and\n\nlearned cnn \ufb01lters. arXiv preprint arXiv:1904.00889, 2019. 2\n\n[23] S. Leutenegger, M. Chli, and R. Siegwart. Brisk: Binary robust invariant scalable keypoints. In ICCV,\n\n[24] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 1, 4.3, 4.3, 4.3\n[25] Z. Luo, T. Shen, L. Zhou, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan. ContextDesc: Local Descriptor\nAugmentation With Cross-Modality Context. In Conference on Computer Vision and Pattern Recognition\n(CVPR), 2019. 1, 2, 3.2\n\n[26] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide-baseline stereo from maximally stable extremal\n\nregions. Image and vision computing, 2004. 1\n\n[27] K. Mikolajczyk and C. Schmid. Scale & af\ufb01ne invariant interest point detectors. IJCV, 2004. 1, 4.1\n[28] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and\n\nL. Van Gool. A comparison of af\ufb01ne region detectors. IJCV, 2005. 1, 4.3\n\n[29] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas. Working hard to know your neighbor\u2019s margins:\n\nLocal descriptor learning loss. In NIPS, 2017. 1, 2, 4.3\n\n[30] D. Mishkin, F. Radenovic, and J. Matas. Repeatability is not enough: Learning af\ufb01ne regions via\n\ndiscriminability. In ECCV, 2018. 2, 4.3, 4.3\n\n[31] J. Nath Kundu, R. MV, A. Ganeshan, and R. Venkatesh Babu. Object pose estimation from monocular\n\nimage using multi-view keypoint correspondence. In ECCV, 2018. 1\n\n[32] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han. Large-scale image retrieval with attentive deep local\n\nfeatures. In CVPR, 2017. 1, 2, 2, 4.3, 4.3, 4.4\n\n[33] D. Novotny, S. Albanie, D. Larlus, and A. Vedaldi. Self-supervised learning of geometrically stable\n\nfeatures through probabilistic introspection. In CVPR, 2018. 3.2\n\n[34] Y. Ono, E. Trulls, P. Fua, and K. M. Yi. Lf-net: learning local features from images. In NIPS, 2018. 1, 2,\n\n3.3, 4.3, 4.3\n\nretrieval. In CVPR, 2009. 4.3\n\n[35] M. Perd\u2019och, O. Chum, and J. Matas. Ef\ufb01cient representation of local geometry for large scale object\n\n[36] F. Radenovi\u00b4c, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Revisiting oxford and paris: Large-scale\n\nimage retrieval benchmarking. In CVPR, 2018. 3.3\n\n10\n\n\f[37] F. Radenovi\u00b4c, G. Tolias, and O. Chum. CNN image retrieval learns from BoW: Unsupervised \ufb01ne-tuning\n\nwith hard examples. In ECCV, 2016. 2\n\n[38] J. Revaud, J. Almazan, R. S. de Rezende, and C. R. de Souza. Learning with Average Precision: Training\n\nImage Retrieval with a Listwise Loss. In ICCV, 2019. 1, 2\n\n[39] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Epic\ufb02ow: Edge-preserving interpolation of\n\ncorrespondences for optical \ufb02ow. In CVPR, 2015. 3.3\n\n[40] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Deepmatching: Hierarchical deformable dense\n\n[41] E. Rosten and T. Drummond. Fusing points and lines for high performance tracking. In ICCV, 2005. 2\n[42] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski. Orb: An ef\ufb01cient alternative to sift or surf. In\n\nmatching. IJCV, 2016. 3.3\n\nICCV, 2011. 1\n\n[43] E. Salahat and M. Qasaimeh. Recent advances in features extraction and description algorithms: A\n\ncomprehensive survey. In ICIT, 2017. 2\n\n[44] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Polle-\nfeys, J. Sivic, F. Kahl, and T. Pajdla. Benchmarking 6DOF Outdoor Visual Localization in Changing\nConditions. In CVPR, 2018. 3.3, 3, 4.4\n\n[45] T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, and T. Pajdla. Are large-scale 3d models\n\nreally necessary for accurate visual localization? In CVPR, 2017. 1\n\n[46] T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt. Image Retrieval for Image-Based Localization Revisited.\n\n[47] N. Savinov, L. Ladicky, and M. Pollefeys. Matching neural paths: transfer from recognition to correspon-\n\nIn BMCV, 2012. 3.3\n\ndence search. In NIPS, 2017. 2\n\n[48] N. Savinov, A. Seki, L. Ladicky, T. Sattler, and M. Pollefeys. Quad-networks: unsupervised learning to\n\nrank for interest point detection. In CVPR, 2017. 1, 2, 2, 2, 4.3\n\n[49] J. L. Sch\u00f6nberger and J.-M. Frahm. Structure-from-motion revisited. In CVPR, 2016. 1, 3.3\n[50] J. L. Sch\u00f6nberger and J.-M. Frahm. Structure-from-motion revisited. In CVPR, 2016. 4.4\n[51] J. L. Sch\u00f6nberger, E. Zheng, M. Pollefeys, and J.-M. Frahm. Pixelwise view selection for unstructured\n\n[52] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A uni\ufb01ed embedding for face recognition and\n\nmulti-view stereo. In ECCV, 2016. 4.4\n\nclustering. In CVPR, 2015. 2\n\n[53] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative learning of\n\ndeep convolutional feature point descriptors. In ICCV, 2015. 2\n\n[54] K. Simonyan, A. Vedaldi, and A. Zisserman. Learning local feature descriptors using convex optimisation.\n\n[55] L. Sv\u00e4rm, O. Enqvist, F. Kahl, and M. Oskarsson. City-scale localization for cameras with known vertical\n\nIEEE Trans. on PAMI, 2014. 2\n\ndirection. IEEE Trans. on PAMI, 2016. 1\n\n[56] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. Inloc: Indoor\n\nvisual localization with dense matching and view synthesis. In CVPR, 2018. 2\n\n[57] Y. Tian, B. Fan, and F. Wu. L2-net: Deep learning of discriminative patch descriptor in euclidean space. In\n\nCVPR, 2017. 1, 2, 3, 3.2\n\n[58] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas. SOSNet: Second Order Similarity Regularization\n\nfor Local Descriptor Learning. In CVPR, 2019. 1, 2, 3.2\n\n[59] T. Trzcinski, M. Christoudias, V. Lepetit, and P. Fua. Learning image descriptors with the boosting-trick.\n[60] T. Tuytelaars, K. Mikolajczyk, et al. Local invariant feature detectors: a survey. Foundations and trends R(cid:13)\n\nIn NIPS, 2012. 1\n\nin computer graphics and vision, 2008. 2\n\n[61] E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In NIPS, 2016. 2\n[62] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. Lift: Learned invariant feature transform. In ECCV, 2016. 1, 2,\n\n3.1, 4.1\n\n[63] L. Zhang and S. Rusinkiewicz. Learning to detect features in texture images. In CVPR, 2018. 2\n[64] M. Zieba, P. Semberecki, T. El-Gaaly, and T. Trzcinski. Bingan: Learning compact binary descriptors with\n\na regularized gan. In NIPS, 2018. 1\n\n11\n\n\f", "award": [], "sourceid": 6717, "authors": [{"given_name": "Jerome", "family_name": "Revaud", "institution": "Naver Labs Europe"}, {"given_name": "Cesar", "family_name": "De Souza", "institution": "NAVER LABS Europe"}, {"given_name": "Martin", "family_name": "Humenberger", "institution": "Naver Labs Europe"}, {"given_name": "Philippe", "family_name": "Weinzaepfel", "institution": "NAVER LABS Europe"}]}