{"title": "GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs", "book": "Advances in Neural Information Processing Systems", "page_first": 6992, "page_last": 7003, "abstract": "Finding local correspondences between images with different viewpoints requires local descriptors that are robust against geometric transformations. An approach for transformation invariance is to integrate out the transformations by pooling the features extracted from transformed versions of an image. However, the feature pooling may sacrifice the distinctiveness of the resulting descriptors. In this paper, we introduce a novel visual descriptor named Group Invariant Feature Transform (GIFT), which is both discriminative and robust to geometric transformations. The key idea is that the features extracted from the transformed versions of an image can be viewed as a function defined on the group of the transformations. Instead of feature pooling, we use group convolutions to exploit underlying structures of the extracted features on the group, resulting in descriptors that are both discriminative and provably invariant to the group of transformations. Extensive experiments show that GIFT outperforms state-of-the-art methods on several benchmark datasets and practically improves the performance of relative pose estimation.", "full_text": "GIFT: Learning Transformation-Invariant\nDense Visual Descriptors via Group CNNs\n\nYuan Liu Zehong Shen Zhixuan Lin Sida Peng Hujun Bao\u2217 Xiaowei Zhou\u2217\nState Key Lab of CAD&CG, ZJU-Sensetime Joint Lab of 3D Vision, Zhejiang University\n\nAbstract\n\nFinding local correspondences between images with different viewpoints requires\nlocal descriptors that are robust against geometric transformations. An approach\nfor transformation invariance is to integrate out the transformations by pooling the\nfeatures extracted from transformed versions of an image. However, the feature\npooling may sacri\ufb01ce the distinctiveness of the resulting descriptors. In this paper,\nwe introduce a novel visual descriptor named Group Invariant Feature Transform\n(GIFT), which is both discriminative and robust to geometric transformations. The\nkey idea is that the features extracted from the transformed versions of an image\ncan be viewed as a function de\ufb01ned on the group of the transformations. Instead of\nfeature pooling, we use group convolutions to exploit underlying structures of the\nextracted features on the group, resulting in descriptors that are both discriminative\nand provably invariant to the group of transformations. Extensive experiments show\nthat GIFT outperforms state-of-the-art methods on several benchmark datasets and\npractically improves the performance of relative pose estimation.\n\n1\n\nIntroduction\n\nEstablishing local feature correspondences between images is a fundamental problem in many\ncomputer vision tasks such as structure from motion [21], visual localization [17], SLAM [43], image\nstitching [5] and image retrieval [47]. Finding reliable correspondences requires image descriptors\nthat effectively encode distinctive image patterns while being invariant to geometric and photometric\nimage transformations caused by viewpoint and illumination changes.\nTo achieve the invariance to viewpoints, traditional methods [36, 37] use patch detectors [33, 39]\nto extract transformation covariant local patches which are then normalized for transformation\ninvariance. Then, invariant descriptors can be extracted on the detected local patches. However,\na typical image may have very few pixels for which viewpoint covariant patches can be reliably\ndetected [22]. Also, \u201chand-crafted\" detectors such as DoG [36] and Af\ufb01ne-Harris [39] are sensitive to\nimage artifacts and lighting conditions. Reliably detecting covariant regions is still an open problem\n[29, 11] and a performance bottleneck in the traditional pipeline of correspondence estimation.\nInstead of relying on a sparse set of covariant patches, some recent works [20, 7, 11] propose to\nextract dense descriptors by feeding the whole image into a convolutional neural network (CNN) and\nconstructing pixel-wise descriptors from the feature maps of the CNN. However, the CNN-based\ndescriptors are usually sensitive to viewpoint changes as convolutions are inherently not invariant\nto geometric transformations. While augmenting training data with warped images improves the\nrobustness of learned features, the invariance is not guaranteed and a larger network is typically\nrequired to \ufb01t the augmented datasets.\nIn order to explicitly improve invariance to geometric transformations, some works [60, 55, 22] resort\nto integrating out the transformations by pooling the features extracted from transformed versions of\n\n\u2217Corresponding authors: {xzhou,bao}@cad.zju.edu.cn. Project page: https://zju3dv.github.io/GIFT.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe original images. But the distinctiveness of extracted features may degenerate due to the pooling\noperation.\nIn this paper, we propose a novel CNN-based dense descriptor, named Group Invariant Feature\nTransform (GIFT), which is both discriminative and invariant to a group of transformations. The key\nidea is that, if an image is regarded as a function de\ufb01ned on the translation group, the CNN features\nextracted from multiple transformed images can be treated as a function de\ufb01ned on the transformation\ngroup. Analogous to local image patterns, such features on the group also have discriminative patterns,\nwhich are neglected by the previous methods that use pooling for invariance. We argue that exploiting\nunderlying structures of the group features is essential for building discriminative descriptors. It\ncan be theoretically demonstrated that transforming the input image with any element in the group\nresults in a permutation of the group features. Such a permutation preserves local structures of the\ngroup features. Thus, we propose to use group convolutions to encode the local structures of the\ngroup features, resulting in feature representations that are not only discriminative but equivariant\nto the transformations in the group. Finally, the intermediate representations are bilinearly pooled\nto obtain provably invariant descriptors. This transformation-invariant dense descriptor simpli\ufb01es\ncorrespondence estimation as detecting covariant patches can be avoided. Without needs for patch\ndetectors, the proposed descriptor can be incorporated with any interest point detector for sparse\nfeature matching or even a uniformly sampled grid for dense matching.\nWe evaluate the performance of GIFT on the HPSequence [1, 30] dataset and the SUN3D [59] dataset\nfor correspondence estimation. The results show that GIFT outperforms both of the traditional\ndescriptors and recent learned descriptors. We further demonstrate the robustness of GIFT to\nextremely large scale and orientation changes on several new datasets. The current unoptimized\nimplementation of GIFT runs at \u223c15 fps on a GTX 1080 Ti GPU, which is suf\ufb01ciently fast for\npractical applications.\n\n2 Related work\n\nExisting pipelines for feature matching usually rely on a feature detector and a feature descriptor.\nFeature detectors [33, 36, 39] detect local patches which are covariant to geometric transformations\nbrought by viewpoint changes. Then, invariant descriptors can be extracted on the normalized\nlocal patches via traditional patch descriptors [36, 6, 48, 3] or deep metric learning based patch\ndescriptors [63, 19, 40, 52, 37, 2, 18, 23, 64]. The robustness of detectors can be guaranteed\ntheoretically, e.g., by the scale-space theory [34]. However, a typical image often have very few\npixels for which viewpoint covariant patches may be reliably detected [22]. The scarcity of reliably\ndetected patches becomes a performance bottleneck in the traditional pipeline of correspondence\nestimation. Some recent works [45, 61, 29, 41, 64, 12, 62] try to learn such viewpoint covariant\npatch detectors by CNNs. However, the de\ufb01nition of a canonical scale or orientation is ambiguous.\nDetecting a consistent scale or orientation for every pixel remains challenging.\nTo alleviate the dependency on detectors, A-SIFT [42] warps original image patches by af\ufb01ne\ntransformations and exhaustively searches for the best match. Some other methods [60, 22, 55, 13]\nfollow similar pipelines but pool features extracted from these transformed patches to obtain invariant\ndescriptors. GIFT also transforms images, but instead of using feature pooling, it applies group\nconvolutions to further exploit the underlying structures of features extracted from the group of\ntransformed images to retain distinctiveness of the resulting descriptors.\nFeature map based descriptor. Descriptors can also be directly extracted from feature maps of\nCNNs [20, 11, 7]. However, CNNs are not invariant to geometric transformations naturally. The\ncommon strategy to make CNNs invariant to geometric transformations is to augment the training data\nwith such transformations. However, data augmentation cannot guarantee the invariance on unseen\ndata. The Universal Correspondence Network (UCN) [7] uses a convolutional spatial transformer [26]\nin the network to normalize the local patches to a canonical shape. However, learning an invariant\nspatial transformer is as dif\ufb01cult as learning a viewpoint covariant detector. Our method also uses\nCNNs to extract features on transformed images but applies subsequent group convolutions to\nconstruct transformation-invariant descriptors.\nEquivariant or invariant CNNs. Some recent works [8, 38, 28, 10, 58, 15, 27, 9, 28, 57, 38, 24, 14,\n16, 4] design special architectures to make CNNs equivariant to speci\ufb01c transformations. The most\nrelated work is the Group Equivariant CNN [8] which uses group convolution and subgroup pooling\n\n2\n\n\fFigure 1: Pipeline. The input image is warped with different transformations and fed into a vanilla\nCNN to extract group features. Then the group features for each interest point are further processed\nby two group CNNs and a bilinear pooling operator to obtain \ufb01nal GIFT descriptors.\nto learn equivariant feature representations. It applies group convolutions directly on a large group\nwhich is the product of the translation group and the geometric transformation group. In contrast,\nGIFT uses a vanilla CNN to process images, which can be regarded as features de\ufb01ned on the\ntranslation group, and separate group CNNs to process the features on the geometric transformation\ngroup, which results in a more ef\ufb01cient model than the original Group Equivariant CNN.\n\n3 Method\n\nPreliminary. Assuming the observed 3D surfaces are smooth, the transformation between corre-\nsponding image patches under different viewpoints is approximately in the af\ufb01ne group. In this paper,\nwe only consider its subgroup G which consists of rotations and scaling. The key intermediate feature\nrepresentation in the pipeline of GIFT is a map f : G \u2192 Rn from the group G to a feature space Rn,\nwhich is referred to as group feature.\nOverview. As illustrated in Fig. 1, the proposed method consists of two modules: group feature\nextraction and group feature embedding. Group feature extraction module takes an image I as input,\nwarps the image with a grid of sampled elements in G, separately feeds the warped images through a\nvanilla CNN, and outputs a set of feature maps where each feature map corresponds to an element in\nG. For any interest point p in the image, a feature vector can be extracted from each feature map.\nThe feature vectors corresponding to p in all the feature maps form a group feature f0 : G \u2192 Rn0.\nNext, the group feature embedding module embeds the group feature f0 of every interest point to two\nfeatures fl,\u03b1 and fl,\u03b2 by two group CNNs, both of which have l group convolution layers. Finally,\nfl,\u03b1 and fl,\u03b2 are pooled by a bilinear pooling operator [32] to obtain a GIFT descriptor d.\n\n3.1 Group feature extraction\n\nGiven an input image I and a point p = (x, y) on the image, this module aims to extract a\ntransformation-equivariant group feature f0 : G \u2192 Rn0 on this point p. To get the feature vector\nf0(g) on a speci\ufb01c transformation g \u2208 G, we begin with transforming the input image I with g. Then,\nwe process the transformed image g \u25e6 I with a vanilla CNN \u03b7. The output feature map is denoted by\n\u03b7(Tg \u25e6 I). Since the image is transformed, the corresponding location of p on the output feature map\nalso changes into Tg(p). We use the feature vector locating at Tg(p) on the feature map \u03b7(Tg \u25e6 I)\nas the value of f0(g). Because the coordinates of Tg(p) may not be integers, we apply a bilinear\ninterpolation \u03c6 to get the feature vector on it. The whole process can be expressed by,\n\nf0(g) = \u03c6(\u03b7(Tg \u25e6 I), Tg(p)).\n\n(1)\n\n3\n\nGIFT descriptorSnRn2. Group feature extraction module3. Group feature embedding moduleVanilla CNNGroup CNNsWarped ImagesFeature mapsConv-IN-ReLUAvg Pool1. Input Image4. Pixel-wise descriptorsGroup Conv - ReLUBilinear Pool...Reshape group features for each point()gTp\uf068gTIgT0()fg\fThe extracted group feature f0 is equivariant to transformations in the group, as illustrated in Fig. 2.\nLemma 1. The group feature of a point p in an image I extracted by Eq. (1) is denoted by f. If the\nimage is transformed by an element h \u2208 G and the group feature extracted at the corresponding\npoint Th(p) in this transformed image is denoted by f(cid:48), then for any g \u2208 G, f(cid:48)(g) = \u03c6(\u03b7(Tg \u25e6 Th \u25e6\nI), Tg(Th(p))) = \u03c6(\u03b7(Tgh \u25e6 I), Tgh(p)) = f (gh), which means that the transformation of the input\nimage results in a permutation of the group feature.\n\nLemma 1 provides a novel and strict crite-\nrion for matching two feature points. Tra-\nditional methods usually detect a canon-\nical scale and orientation for an interest\npoint in each view and match points across\nviews by descriptors extracted at the canon-\nical scale and orientation. This can be\ninterpreted as, if two points are matched,\nthen there exists a g \u2208 G and g(cid:48) \u2208 G\nsuch that f (g) = f(cid:48)(g(cid:48)). However, the\ncanonical g and g(cid:48) are ambiguous and hard\nto detect reliably. Lemma 1 shows that,\nif two points are matched, then there ex-\nists an h \u2208 G such that for all g \u2208 G,\nf(cid:48)(g) = f (gh). In other words, the group\nfeatures of two matched points are related\nby a permutation. This provides a strict\nmatching criterion between two group fea-\ntures. Even though h can hardly be de-\ntermined when extracting descriptors, the\npermutation caused by h preserves struc-\ntures of group features and only changes\ntheir locations. Encoding local structures\nof group features allows us to construct distinctive and transformation-invariant descriptors.\n\nFigure 2: The scaling and rotation of an image (left)\nresult in the permutation of the feature maps de\ufb01ned on\nthe scaling and rotation group (right). The red arrows\nillustrate the directions of the permutation.\n\n3.2 Group convolution layer\n\nAfter group feature extraction, we apply the discrete group convolution originally proposed in [8] to\nencode local structures of group features, which is de\ufb01ned as\n\n[fl(g)]i = \u03c3\n\nf T\nl\u22121(hg)Wi(h) + bi\n\n(2)\nwhere fl and fl\u22121 are group features of the layer l and the layer l \u2212 1 respectively, [\u00b7]i means the i-th\ndimension of the vector, g and h are elements in the group, H \u2282 G is a set of transformations around\nthe identity transformation, W are learnable parameters which are de\ufb01ned on H, bi is a bias term\nand \u03c3 is a non-linear activation function. If G is the 2D translation group, the group convolution in\nEq. (2) becomes the conventional 2D convolution. Similar to the conventional CNNs that are able to\nencode local patterns of images, the group CNNs are able to encode local structures of group features.\nFor more discussions about the relationship between the group convolution and the conventional\nconvolution, please refer to [8].\nThe group convolution actually preserves the equivariance of group features:\nLemma 2. In Eq. (2), if fl\u22121 is equivariant to transformations in G as stated in Lemma 1, then fl is\nalso equivariant to transformations in G.\n\nThe proof of Lemma 2 is in the supplementary material. With Lemma 2, we can stack multiple\ngroup convolution layers to construct group CNNs which are able to encode local structures of group\nfeatures while maintaining the equivariance property.\n\n3.3 Group bilinear pooling\n\nIn GIFT, we actually construct two group CNNs \u03b1 and \u03b2, both of which consist of l group convolution\nlayers, to process the input group feature f0. The outputs of two group CNNs are denoted by fl,\u03b1\n\n4\n\n(cid:32)(cid:88)\n\nh\u2208H\n\n(cid:33)\n\n,\n\nTransformationGroup Feature ExtractionPermutationrotationscaleGroup Feature Extraction\fand fl,\u03b2, respectively. Finally, we obtain the GIFT descriptor d by applying the bilinear pooling\noperator [32] to fl,\u03b1 and fl,\u03b2, which can be described as\n\n(cid:90)\n\ndi,j =\n\n[fl,\u03b1(g)]i[fl,\u03b2(g)]jdg,\n\n(3)\n\nG\n\nwhere di,j is an element of feature vector d. Based on Lemma 1 and Lemma 2, we can prove the\ninvariance of GIFT as stated in Proposition 1. The proof is given in the supplementary material.\nProposition 1. Let d denote the GIFT descriptor of an interest point in an image. If the image is\ntransformed by any transformation h \u2208 G and the GIFT descriptor extracted at the corresponding\npoint in the transformed image is denoted by d(cid:48), then d(cid:48) = d.\nIn fact, many pooling operators such as average pooling and max pooling can achieve such invariance.\nWe adopt bilinear pooling for two reasons. First, it collects second-order statistics of features and\nthus produces more informative descriptors. Second, it can be shown that the statistics used in many\nprevious methods for invariant descriptors [22, 55, 60] can be written as special forms of bilinear\npooling, as proved in supplementary material. So the proposed GIFT is a more generalized form\ncompared to these methods.\n\n3.4\n\nImplementation details\n\nSampling from the group. Due to limited computational resources, we sample a range of elements\nin G to compute group features. We sample evenly in the scale group S and the rotation group\nR separately. The unit transformations are de\ufb01ned as 1/4 downsampling and 45 degree clockwise\nrotation and denoted by s and r, respectively. Then, the sampled elements in the group G form a grid\n{(si, rj)|i, j \u2208 Z}. Considering computational complexity, we choose ns = 5 scales ranging from\ns0 to s4 and nr = 5 orientations ranging from r\u22122 to r2. In this case, the group feature of an interest\npoint is a tensor f \u2208 Rns\u00d7nr\u00d7n where n is the dimension of the feature space.\nDue to the discrete sampling, Lemma 1 and Lemma 2 don\u2019t rigorously hold near the boundary of the\nselected range. But empirical results show that this boundary effect will not obviously affect the \ufb01nal\nmatching performance if the scale and rotation changes are in a reasonable range.\nBilinear pooling. The integral in the Eq. (3) is approximated by the summation over the sampled\ngroup elements. Suppose the output group features of two group CNNs are denoted by fl,\u03b1 \u2208\nRns\u00d7nr\u00d7n\u03b1 and fl,\u03b2 \u2208 Rns\u00d7nr\u00d7n\u03b2 , respectively, and reshaped as two matrices \u02dcfl,\u03b1 \u2208 Rng\u00d7n\u03b1 and\n\u02dcfl,\u03b2 \u2208 Rng\u00d7n\u03b2 , where ng = ns \u00d7 nr. Then, the GIFT descriptor d \u2208 Rn\u03b1\u00d7n\u03b2 can be written as\n\nd = \u02dcf T\nl,\u03b1\n\n\u02dcfl,\u03b2.\n\n(4)\nNetwork architecture. The vanilla CNN has four convolution layers and an average pooling layer\nto enlarge receptive \ufb01elds. In the vanilla CNN, we use instance normalization [53] instead of batch\nnormalization [25]. The output feature dimension n0 of the vanilla CNN is 32. In both group\nCNNs, H de\ufb01ned in Eq. (2) is {r, r\u22121, s, s\u22121, rs, rs\u22121, r\u22121s, r\u22121s\u22121, e}, where e is the identity\ntransformation. ReLU [44] is used as the nonlinear activation function. The number of group\nconvolution layers l = 1 in ablation studies and l = 6 in subsequent comparisons to state-of-the-art\nmethods. The output feature dimensions n\u03b1 and n\u03b2 of two group CNNs are 8 and 16 respectively,\nwhich results in a 128-dimensional descriptor after bilinear pooling. The output descriptors are\nL2-normalized so that (cid:107)d(cid:107)2 = 1.\nLoss function. The model is trained by minimizing a triplet loss [49] de\ufb01ned by\n\n(cid:96) = max(cid:0)(cid:107)da \u2212 dp(cid:107)2 \u2212 (cid:107)da \u2212 dn(cid:107)2 + \u03b3, 0(cid:1) ,\n\n(5)\nwhere da, dp and dn are descriptors of an anchor point in an image, its true match in the other image,\nand a false match selected by hard negative mining, respectively. The margin \u03b3 is set to 0.5 in all\nexperiments. The hard negative mining is a modi\ufb01ed version of that proposed in [7].\n\n4 Experiments\n\n4.1 Datasets and Metrics\n\nHPSequences [1, 30] is a dataset that contains 580 image pairs for evaluation which can be di-\nvided into two splits, namely Illum-HP and View-HP. Illum-HP contains only illumination changes\n\n5\n\n\fwhile View-HP contains mainly viewpoint changes. The viewpoint changes in the View-HP cause\nhomography transformations because all observed objects are planar.\nSUN3D [59] is a dataset that contains 500 image pairs of indoor scenes. The observed objects are not\nplanar so that it introduces self-occlusion and perspective distortion, which are commonly-considered\nchallenges in correspondence estimation.\nES-* and ER-*. To fully evaluate the correspondence estimation performance under extreme scale\nand orientation changes, we create extreme scale (ES) and extreme rotation (ER) datasets by arti\ufb01cially\nscaling and rotating the images in HPSequences and SUN3D. For a pair of images, we manually add\nlarge orientation or scale changes to the second image. The range of rotation angle is [\u2212\u03c0, \u03c0]. The\n\nrange of scaling factor is [2.83, 4](cid:83)[0.25, 0.354]. Examples are shown in Fig. 3.\n\nMVS dataset [51] contains six image sequences of outdoor scenes. All images have accurate\nground-truth camera poses which are used to evaluate the descriptors for relative pose estimation.\nTraining data. The proposed GIFT is trained on a synthetic dataset. We randomly sample images\nfrom MS-COCO [31] and warp images with reasonable homographies de\ufb01ned in Superpoint [11]\nto construct image pairs for training. When evaluating on the task of relative pose estimation, we\nfurther \ufb01netune GIFT on the GL3D [50] dataset which contains real image pairs with ground truth\ncorrespondences given by a standard Structure-from-Motion (SfM) pipeline.\nMetrics. To quantify the performance of correspondence estimation, we use Percentage of Correctly\nMatched Keypoints (PCK) [35, 65], which is de\ufb01ned as the ratio between the number of correct\nmatches and the total number of interest points. All matches are found by nearest-neighbor search.\nA matched point is declared being correct if it is within \ufb01ve pixels from the ground truth location.\nTo evaluate relative pose estimation, we use the rotation error as the metric, which is de\ufb01ned as the\nangle of Rerr = Rpr \u00b7 RT\ngt in the axis-angle form, where Rpr is the estimated rotation and Rgt is the\nground truth rotation. All testing images are resized to 480\u00d7360 in all experiments.\n\nSuperpoint [11]+GIFT\n\nSuperpoint [11]\n\nDoG [36]+GIFT\n\nDoG [36]+GeoDesc [37]\n\nFigure 3: Visualization of estimated correspondences on HPSequences (\ufb01rst two rows), ER-HP\n(middle two rows) and ES-HP (last two rows). The \ufb01rst two columns use keypoints detected by\nSuperpoint [11] and the last two columns use keypoints detected by DoG [36].\n4.2 Ablation study\n\nWe conduct ablation studies on HPSequence, ES-HP and ER-HP in three aspects, namely comparison\nto baseline models, choice of pooling operators and different numbers of group convolution layers. In\nall ablation studies, we use the keypoints detected by Superpoint [11] as interest points for evaluation.\n\n6\n\n\fIllum-HP\nView-HP\nES-HP\nER-HP\n\nVCNN GFC\n60.63\n59.15\n61.7\n62.5\n16.58\n14.9\n28.86\n26.89\n\nGAS GIFT-1\n59.61\n59.2\n63.71\n62.2\n21.74\n18.28\n39.68\n30.72\n\nIllum-HP\nView-HP\nES-HP\nER-HP\n\navg\n57.72\n62.52\n19.08\n36.15\n\nmax\n54.31\n58.16\n19.37\n32.57\n\nsubspace\n\n47.21\n49.36\n14.85\n29.12\n\nbilinear\n59.61\n63.71\n21.74\n39.68\n\nTable 1: PCK of different baseline models and\nGIFT-1.\n\nTable 2: PCK of models using different pooling\noperators.\n\nWe denote the proposed method by GIFT-l where l means the number of group convolution layers.\nArchitectures of compared models can be found in the supplementary material. All tested models are\ntrained with the same loss function and training data.\nBaseline models. We consider three baseline models which all produce 128-dimensional descriptors,\nnamely Vanilla CNN (VCNN), Group Fully Connected network (GFC) and Group Attention Selection\nnetwork (GAS). VCNN has four vanilla convolution layers with three average pooling layers and\noutputs a 128-channel feature map. Descriptors are directly interpolated from the output feature map.\nGFC and GAS have the same group feature extraction module as GIFT. GFC replaces the group\nCNN in GIFT-1 with a two-layer fully connected network. GAS is similar to the model proposed\nin [54], which tries to learn attention weights by CNNs to select a scale for each keypoint. GAS \ufb01rst\ntransforms input group features to 128-dimension by a 1 \u00d7 1 convolution layer. Then, it applies a\ntwo-layer fully connected network on the input group feature to produce nr \u00d7 ns attention weights.\nFinally, GAS uses the average of 128-dimensional embedded group features weighted by the attention\nweights as descriptors.\nTable 1 summarizes results of the proposed method and other baseline models. The proposed method\nachieves the best performance on all datasets except Illum-HP. The Illum-HP dataset contains no\nviewpoint changes, which means that there is no permutation between the group features of two\nmatched points. Then, the GFC model which directly compares the elements of two group features\nachieves a better performance. Compared to baseline models, the signi\ufb01cant improvements of GIFT-1\non ES-HP and ER-HP demonstrate the bene\ufb01t of the proposed method to deal with large scale and\norientation changes.\nPooling operators. To illustrate the necessity of bilinear pooling, we test other three commonly-used\npooling operators, namely average pooling, max pooling and subspace pooling [22, 55, 56]. For all\nthese models, we apply the same group feature extraction module as GIFT. For average pooling and\nmax pooling, the input group feature is fed into group CNNs to produce 128-dimensional group\nfeatures which are subsequently pooled with average pooling or max pooling to construct descriptors.\nFor subspace pooling, we use a group CNN to produce a feature map with 16 channels, which results\nin 256-dimensional descriptors after subspace pooling. Results are listed in Table 2 which shows that\nthe bilinear pooling outperforms all other pooling operators.\nNumber of group convolution layers. To further\ndemonstrate the effect of group convolution layers,\nwe test on different numbers of group convolution\nlayers. All models use the same vanilla CNN but\ndifferent group CNNs with 1, 3 or 6 group convo-\nlution layers. The results in the Table 3 show that\nthe performance increases with the number of group\nconvolution layers. In subsequent experiments, we\nuse GIFT-6 as the default model and denote it with\nGIFT for short.\n\nGIFT-1 GIFT-3 GIFT-6\n62.49\n59.61\n67.15\n63.71\n27.29\n21.74\n48.93\n39.68\n\nTable 3: PCK of GIFT using different num-\nbers of group convolution layers.\n\nIllum-HP\nView-HP\nES-HP\nER-HP\n\n61.33\n64.91\n23.9\n43.37\n\n4.3 Comparison with state-of-the-art methods\n\nWe compare the proposed GIFT with three state-of-the-art methods, namely Superpoint [11],\nGeoDesc [37] and LF-Net [45]. For all methods, we use their released pretrained models for\ncomparison. Superpoint [11] localizes keypoints and interpolates descriptors of these keypoints\ndirectly on a feature map of a vanilla CNN. GeoDesc [37] is a state-of-the-art patch descriptor which\nis usually incorporated with DoG detector for correspondence estimation. LF-Net [45] provides a\ncomplete pipeline of feature detection and description. The detector network of LF-Net not only\n\n7\n\n\fdetector\n\nSuperpoint [11]\n\ndescriptor GIFT Superpoint GIFT\n\ndataset\nIllum-HP\nView-HP\nSUN3D\nES-HP\nER-HP\nES-SUN3D\nER-SUN3D\n\n62.49\n67.15\n27.32\n27.29\n48.93\n12.37\n22.29\n\n[11]\n61.13\n53.66\n26.4\n12.16\n24.77\n5.94\n14.01\n\n56.58\n62.53\n19.97\n22.07\n44.44\n7.40\n15.77\n\nDoG [36]\nSIFT GeoDesc GIFT LF-Net\n[36]\n28.38\n34.33\n15.2\n18.25\n29.39\n4.09\n15.16\n\n[37]\n34.41\n42.75\n14.53\n19.63\n37.36\n3.42\n15.39\n\n[45]\n34.55\n1.22\n12.93\n0.3\n0.05\n0.55\n10.59\n\nLF-Net [45]\n\n52.17\n15.93\n21.73\n7.89\n12.50\n7.61\n15.98\n\nTable 4: PCK of GIFT and the state-of-the-art methods on HPSequences, SUN3D and the extreme\nrotation (ER-) and scaling (ES-) datasets.\n\nReference\n\nGIFT\n\nVCNN\n\nDaisy [47]\n\nFigure 4: Visualization of estimated dense correspondences. Matched points are drawn with the same\ncolor in the reference and query images. Only correctly estimated correspondences are drawn.\nlocalizes keypoints but also estimates their scales and orientations. Then the local patches are fed\ninto the descriptor network to generate descriptors. For fair comparison, we use the same keypoints\nas the compared method for evaluation. Results are summarized in Table 4, which shows that GIFT\noutperforms all other state-of-the-art methods. Qualitative results are shown in Fig. 3.\nTo further validate the robustness of GIFT to scaling and rotation, we add synthetic scaling and\nrotation to images in HPatches and report the matching performances under different scaling and\nrotations. The results are plotted in Fig. 5, which show that the PCK of GIFT drops slowly with the\nincrease of scaling and rotation.\n\n4.4 Performance for dense correspondence estimation\n\nWe also evaluate GIFT for the task of dense corre-\nspondence estimation on HPSequence, ES-HP and\nER-HP. The quantitative results are listed in Table 5\nand qualitative results are shown in Fig. 4. The pro-\nposed GIFT outperforms the baseline Vanilla CNN\nand the traditional method Daisy [47], which demon-\nstrates the ability of GIFT for dense correspondence\nestimation.\n\n4.5 Performance for relative pose estimation.\n\nIllum-HP\nView-HP\nES-HP\nER-HP\n\nGIFT VCNN Daisy [47]\n27.82\n37.92\n12.52\n26.61\n\n17.08\n19.6\n1.05\n5.69\n\n26.96\n32.92\n4.64\n14.02\n\nTable 5: PCK of dense correspondence esti-\nmation.\n\nWe also evaluate GIFT for the task of relative pose estimation of image pairs on the MVS dataset [51].\nFor a pair of images, we estimate the relative pose of cameras by matching descriptors and computing\n\n8\n\n\fFigure 5: PCKs on the HPatches dataset as scaling and rotation increase. GIFT-SP uses Superpoint as\nthe detector while GIFT-DoG uses DoG as the detector.\nessential matrix. Since the estimated translations are up-to-scale, we only evaluate the estimated\nrotations using the metric of rotation error as mentioned in Section 4.1. We further \ufb01netune GIFT\non the outdoor GL3D dataset [50] and denote the \ufb01netuned model with GIFT-F. The results are\nlisted in Table 6. GIFT-F outperforms all other methods on most sequences, which demonstrates the\napplicability of GIFT to real computer vision tasks.\n\nDetector\nDescriptor\n\nDoG [36]\nGIFT GIFT-F\n\nSequence\nHerz-Jesus-P8\nHerz-Jesus-P25\nFountain-P11\n\nEntry-P10\nCastle-P30\nCastle-P19\nAverage\n\n0.656\n4.968\n0.821\n1.368\n3.431\n1.887\n2.189\n\n0.582\n2.756\n1.268\n1.259\n1.741\n1.991\n1.600\n\nSuperpoint [11]\n\nSIFT GIFT GIFT-F\n[36]\n0.662\n5.296\n0.587\n3.844\n2.706\n3.018\n2.686\n\n0.848\n4.484\n1.331\n1.915\n1.526\n1.739\n1.974\n\n0.942\n2.891\n1.046\n1.059\n1.501\n1.500\n1.490\n\nSuperpoint\n\n[11]\n1.072\n2.87\n1.071\n1.076\n1.588\n1.814\n1.583\n\nTable 6: Rotation error (\u00b0) of relative pose estimation on the MVS Dataset [51].\n\n4.6 Running time\nGiven a 480\u00d7360 image and randomly-distributed 1024 interest points in the image, the PyTorch [46]\nimplementation of GIFT-6 costs about 65.2 ms on a desktop with an Intel i7 3.7GHz CPU and a GTX\n1080 Ti GPU. Speci\ufb01cally, it takes 32.5 ms for image warping, 27.5 ms for processing all warped\nimages with the vanilla CNN and 5.2 ms for group feature embedding by the group CNNs.\n\n5 Conclusion\n\nWe introduced a novel dense descriptor named GIFT with provable invariance to a certain group\nof transformations. We showed that the group features, which are extracted on the transformed\nimages, contain structures which are stable under the transformations and discriminative among\ndifferent interest points. We adopt group CNNs to encode such structures and applied bilinear pooling\nto construct transformation-invariant descriptors. We reported state-of-the-art performance on the\ntask of correspondence estimation on the HPSequence dataset, the SUN3D dataset and several new\ndatasets with extreme scale and orientation changes.\nAcknowledgement. The authors would like to acknowledge support from NSFC (No. 61806176),\nFundamental Research Funds for the Central Universities and ZJU-SenseTime Joint Lab of 3D\nVision.\n\n9\n\n\fReferences\n[1] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and\n\nevaluation of handcrafted and learned local descriptors. In CVPR, 2017.\n\n[2] Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. Learning local feature descriptors\n\nwith triplets and shallow convolutional neural networks. In BMVC, 2016.\n\n[3] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In ECCV, 2006.\n\n[4] Erik J Bekkers, Maxime W Lafarge, Mitko Veta, Koen AJ Eppenhof, Josien PW Pluim, and Remco Duits.\n\nRoto-translation covariant convolutional networks for medical image analysis. In MICCAI, 2018.\n\n[5] Matthew Brown and David G Lowe. Automatic panoramic image stitching using invariant features. In\n\nICCV, 2007.\n\n[6] Michael Calonder, Vincent Lepetit, Mustafa Ozuysal, Tomasz Trzcinski, Christoph Strecha, and Pascal\n\nFua. Brief: Computing a local binary descriptor very fast. T-PAMI, 34(7):1281\u20131298, 2012.\n\n[7] Christopher B Choy, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. Universal correspon-\n\ndence network. In NeurIPS, 2016.\n\n[8] Taco Cohen and Max Welling. Group equivariant convolutional networks. In ICML, 2016.\n\n[9] Taco S Cohen, Mario Geiger, Jonas K\u00f6hler, and Max Welling. Spherical cnns. In ICLR, 2018.\n\n[10] Taco S Cohen and Max Welling. Steerable cnns. In ICLR, 2017.\n\n[11] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point\n\ndetection and description. In CVPR Workshops, 2018.\n\n[12] Nehal Doiphode, Rahul Mitra, Shuaib Ahmed, and Arjun Jain. An improved learning framework for\n\ncovariant local feature detection. CoRR, abs/1811.00438, 2018.\n\n[13] Jingming Dong and Stefano Soatto. Domain-size pooling in local descriptors: Dsp-sift. In CVPR, 2015.\n\n[14] Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning so (3)\n\nequivariant representations with spherical cnns. In ECCV, 2018.\n\n[15] Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar transformer\n\nnetworks. In ICLR, 2018.\n\n[16] Carlos Esteves, Yinshuang Xu, Christine Allen-Blanchette, and Kostas Daniilidis. Equivariant multi-view\n\nnetworks. arXiv preprint arXiv:1904.00993, 2019.\n\n[17] David Filliat. A visual bag of words method for interactive qualitative localization and mapping. In ICRA,\n\n2007.\n\n[18] Vijay Kumar B G, Gustavo Carneiro, and Ian Reid. Learning local image descriptors with deep siamese\n\nand triplet convolutional networks by minimising global loss functions. In CVPR, 2015.\n\n[19] Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Sukthankar, and Alexander C Berg. Matchnet: Unifying\n\nfeature and metric learning for patch-based matching. In CVPR, 2015.\n\n[20] Bharath Hariharan, Pablo Arbel\u00e1ez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmen-\n\ntation and \ufb01ne-grained localization. In CVPR, 2015.\n\n[21] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university\n\npress, 2003.\n\n[22] Tal Hassner, Viki Mayzels, and Lihi Zelnik-Manor. On sifts and their scales. In CVPR, 2012.\n\n[23] Kun He, Yan Lu, and Stan Sclaroff. Local descriptors optimized for average precision. In CVPR, 2018.\n\n[24] Joao F Henriques and Andrea Vedaldi. Warped convolutions: Ef\ufb01cient invariance to spatial transformations.\n\nIn ICML, 2017.\n\n[25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\n10\n\n\f[26] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NeurIPS,\n\n2015.\n\n[27] Tai Kaisheng, Bailis Peter, and Valiant Gregory. Equivariant transformer networks. In ICML, 2019.\n\n[28] Renata Khasanova and Pascal Frossard. Graph-based isometry invariant representation learning. In ICML,\n\n2017.\n\n[29] Karel Lenc and Andrea Vedaldi. Learning covariant feature detectors. In ECCV, 2016.\n\n[30] Karel Lenc and Andrea Vedaldi. Large scale evaluation of local image feature detectors on homography\n\ndatasets. In BMVC, 2018.\n\n[31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\n\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.\n\n[32] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for \ufb01ne-grained visual\n\nrecognition. In ICCV, 2015.\n\n[33] Tony Lindeberg. Feature detection with automatic scale selection. IJCV, 30(2):79\u2013116, 1998.\n\n[34] Tony Lindeberg. Scale-space theory in computer vision, volume 256. Springer Science & Business Media,\n\n2013.\n\n[35] Jonathan L Long, Ning Zhang, and Trevor Darrell. Do convnets learn correspondence? In NeurIPS, 2014.\n\n[36] David G Lowe. Distinctive image features from scale-invariant keypoints. In ICCV, 2004.\n\n[37] Zixin Luo, Tianwei Shen, Lei Zhou, Siyu Zhu, Runze Zhang, Yao Yao, Tian Fang, and Long Quan.\n\nGeodesc: Learning local descriptors by integrating geometry constraints. In ECCV, 2018.\n\n[38] Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector \ufb01eld\n\nnetworks. In ICCV, 2017.\n\n[39] Krystian Mikolajczyk and Cordelia Schmid. Scale & af\ufb01ne invariant interest point detectors. In ICCV,\n\n2004.\n\n[40] Anastasiia Mishchuk, Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Working hard to know your\n\nneighbor\u2019s margins: Local descriptor learning loss. In NeurIPS, 2017.\n\n[41] Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Repeatability is not enough: Learning af\ufb01ne regions via\n\ndiscriminability. In ECCV, 2018.\n\n[42] Jean-Michel Morel and Guoshen Yu. Asift: A new framework for fully af\ufb01ne invariant image comparison.\n\nSIAM journal on imaging sciences, 2(2):438\u2013469, 2009.\n\n[43] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate\n\nmonocular slam system. IEEE Trans. Robot., 31(5):1147\u20131163, 2015.\n\n[44] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML,\n\n2010.\n\n[45] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. Lf-net: learning local features from images. In\n\nNeurIPS, 2018.\n\n[46] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NeurIPS\nWorkshops, 2017.\n\n[47] James Philbin, Michael Isard, Josef Sivic, and Andrew Zisserman. Descriptor learning for ef\ufb01cient retrieval.\n\nIn ECCV, 2010.\n\n[48] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R Bradski. Orb: An ef\ufb01cient alternative to sift or\n\nsurf. In ICCV, 2011.\n\n[49] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uni\ufb01ed embedding for face\n\nrecognition and clustering. In CVPR, 2015.\n\n[50] Tianwei Shen, Zixin Luo, Lei Zhou, Runze Zhang, Siyu Zhu, Tian Fang, and Long Quan. Matchable\n\nimage retrieval by learning from surface reconstruction. In ACCV, 2018.\n\n11\n\n\f[51] Christoph Strecha, Wolfgang Von Hansen, Luc Van Gool, Pascal Fua, and Ulrich Thoennessen. On\n\nbenchmarking camera calibration and multi-view stereo for high resolution imagery. In CVPR, 2008.\n\n[52] Yurun Tian, Bin Fan, and Fuchao Wu. L2-net: Deep learning of discriminative patch descriptor in euclidean\n\nspace. In CVPR, 2017.\n\n[53] Dmitry Ulyanov, Andrea Vedaldi, and Victor S Lempitsky. Instance normalization: The missing ingredient\n\nfor fast stylization. CoRR, abs/1607.08022, 2016.\n\n[54] Shenlong Wang, Linjie Luo, Ning Zhang, and Jia Li. Autoscaler: scale-attention networks for visual\n\ncorrespondence. In BMVC, 2017.\n\n[55] Zhenhua Wang, Bin Fan, and Fuchao Wu. Af\ufb01ne subspace representation for feature description. In ECCV,\n\n2014.\n\n[56] Xing Wei, Yue Zhang, Yihong Gong, and Nanning Zheng. Kernelized subspace pooling for deep local\n\ndescriptors. In CVPR, 2018.\n\n[57] Maurice Weiler, Fred A Hamprecht, and Martin Storath. Learning steerable \ufb01lters for rotation equivariant\n\ncnns. In CVPR, 2018.\n\n[58] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic networks:\n\nDeep translation and rotation equivariance. In CVPR, 2017.\n\n[59] Jianxiong Xiao, Andrew Owens, and Antonio Torralba. Sun3d: A database of big spaces reconstructed\n\nusing sfm and object labels. In ICCV, 2013.\n\n[60] Tsun-Yi Yang, Yen-Yu Lin, and Yung-Yu Chuang. Accumulated stability voting: A robust descriptor from\n\ndescriptors of multiple scales. In CVPR, 2016.\n\n[61] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform.\n\nIn ECCV, 2016.\n\n[62] Kwang Moo Yi, Yannick Verdie, Pascal Fua, and Vincent Lepetit. Learning to assign orientations to feature\n\npoints. In CVPR, 2016.\n\n[63] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural\n\nnetworks. In CVPR, 2015.\n\n[64] Xu Zhang, Felix X Yu, Svebor Karaman, and Shih-Fu Chang. Learning discriminative and transformation\n\ncovariant local feature detectors. In CVPR, 2017.\n\n[65] Tinghui Zhou, Yong Jae Lee, Stella X Yu, and Alyosha A Efros. Flowweb: Joint image set alignment by\n\nweaving consistent, pixel-wise correspondences. In CVPR, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3786, "authors": [{"given_name": "Yuan", "family_name": "Liu", "institution": "Zhejiang University"}, {"given_name": "Zehong", "family_name": "Shen", "institution": "Zhejiang University"}, {"given_name": "Zhixuan", "family_name": "Lin", "institution": "Zhejiang University"}, {"given_name": "Sida", "family_name": "Peng", "institution": "Zhejiang University"}, {"given_name": "Hujun", "family_name": "Bao", "institution": "Zhejiang University"}, {"given_name": "Xiaowei", "family_name": "Zhou", "institution": "Zhejiang University, China"}]}