{"title": "Learning Image Descriptors with the Boosting-Trick", "book": "Advances in Neural Information Processing Systems", "page_first": 269, "page_last": 277, "abstract": "In this paper we apply boosting to learn complex non-linear local visual feature representations, drawing inspiration from its successful application to visual object detection. The main goal of local feature descriptors is to distinctively represent a salient image region while remaining invariant to viewpoint and illumination changes. This representation can be improved using machine learning, however, past approaches have been mostly limited to learning linear feature mappings in either the original input or a kernelized input feature space. While kernelized methods have proven somewhat effective for learning non-linear local feature descriptors, they rely heavily on the choice of an appropriate kernel function whose selection is often difficult and non-intuitive. We propose to use the boosting-trick to obtain a non-linear mapping of the input to a high-dimensional feature space. The non-linear feature mapping obtained with the boosting-trick is highly intuitive. We employ gradient-based weak learners resulting in a learned descriptor that closely resembles the well-known SIFT. As demonstrated in our experiments, the resulting descriptor can be learned directly from intensity patches achieving state-of-the-art performance.", "full_text": "Learning Image Descriptors with the Boosting-Trick\n\nTomasz Trzcinski, Mario Christoudias, Vincent Lepetit and Pascal Fua\n\nCVLab, EPFL, Lausanne, Switzerland\nfirstname.lastname@epfl.ch\n\nAbstract\n\nIn this paper we apply boosting to learn complex non-linear local visual feature\nrepresentations, drawing inspiration from its successful application to visual ob-\nject detection. The main goal of local feature descriptors is to distinctively repre-\nsent a salient image region while remaining invariant to viewpoint and illumina-\ntion changes. This representation can be improved using machine learning, how-\never, past approaches have been mostly limited to learning linear feature mappings\nin either the original input or a kernelized input feature space. While kernelized\nmethods have proven somewhat effective for learning non-linear local feature de-\nscriptors, they rely heavily on the choice of an appropriate kernel function whose\nselection is often dif\ufb01cult and non-intuitive. We propose to use the boosting-trick\nto obtain a non-linear mapping of the input to a high-dimensional feature space.\nThe non-linear feature mapping obtained with the boosting-trick is highly intu-\nitive. We employ gradient-based weak learners resulting in a learned descriptor\nthat closely resembles the well-known SIFT. As demonstrated in our experiments,\nthe resulting descriptor can be learned directly from intensity patches achieving\nstate-of-the-art performance.\n\nIntroduction\n\n1\nRepresenting salient image regions in a way that is invariant to unwanted image transformations is\na crucial Computer Vision task. Well-known local feature descriptors, such as the Scale Invariant\nFeature Transform (SIFT) [1] or Speeded Up Robust Features (SURF) [2], address this problem\nby using a set of hand-crafted \ufb01lters and non-linear operations. These descriptors have become\nprevalent, even though they are not truly invariant with respect to various viewpoint and illumination\nchanges which limits their applicability.\nIn an effort to address these limitations, a fair amount of work has focused on learning local feature\ndescriptors [3, 4, 5] that leverage labeled training image patches to learn invariant feature representa-\ntions based on local image statistics. Although signi\ufb01cant progress has been made, these approaches\nare either built on top of hand-crafted representations [5] or still require signi\ufb01cant parameter tuning\nas in [4] which relies on a non-analytical objective that is dif\ufb01cult to optimize.\nLearning an invariant feature representation is strongly related to learning an appropriate similarity\nmeasure or metric over intensity patches that is invariant to unwanted image transformations, and\nwork on descriptor learning has been predominantly focused in this area [3, 6, 5]. Methods for met-\nric learning that have been applied to image data have largely focused on learning a linear feature\nmapping in either the original input or a kernelized input feature space [7, 8]. This includes previous\nboosting-based metric learning methods that thus far have been limited to learning linear feature\ntransformations [3, 7, 9]. In this way, non-linearities are modeled using a prede\ufb01ned similarity or\nkernel function that implicitly maps the input features to a high-dimensional feature space where the\ntransformation is assumed to be linear. While these methods have proven somewhat effective for\nlearning non-linear local feature mappings, choosing an appropriate kernel function is often non-\nintuitive and remains a challenging and largely open problem. Additionally, kernel methods involve\n\n1\n\n\fan optimization whose problem complexity grows quadratically with the number of training exam-\nples making them dif\ufb01cult to apply to large problems that are typical to local descriptor learning.\nIn this paper, we apply boosting to learn complex non-linear local visual feature representations\ndrawing inspiration from its successful application to visual object detection [10].\nImage patch\nappearance is modeled using local non-linear \ufb01lters evaluated within the image patch that are effec-\ntively selected with boosting. Analogous to the kernel-trick, our approach can be seen as applying a\nboosting-trick [11] to obtain a non-linear mapping of the input to a high-dimensional feature space.\nUnlike kernel methods, the boosting-trick allows for the de\ufb01nition of intuitive non-linear feature\nmappings. Also, our learning approach scales linearly with the number of training examples making\nit more easily amenable to large scale problems and results in highly accurate descriptor matching.\nWe build upon [3] that also relies on boosting to compute a descriptor, and show how we can use it\nas a way to ef\ufb01ciently select features, from which we compute a compact representation. We also\nreplace the simple weak learners of [3] by non-linear \ufb01lters more adapted to the problem. In par-\nticular, we employ image gradient-based weak learners similar to [12] that share a close connection\nwith the non-linear \ufb01lters used in proven image descriptors such as SIFT and Histogram-of-Oriented\nGradients (HOG) [13]. Our approach can be seen as a generalization of these methods cast within\na principled learning framework. As seen in our experiments, our descriptor can be learned di-\nrectly from intensity patches and results in state-of-the-art performance rivaling its hand-designed\nequivalents.\nTo evaluate our approach we consider the image patch dataset of [4] containing several hundreds\nof thousands of image patches under varying viewpoint and illumination conditions. As baselines\nwe compare against leading contemporary hand-designed and learned local feature descriptors [1,\n2, 3, 5]. We demonstrate the effectiveness of our approach on this challenging dataset, signi\ufb01cantly\noutperforming the baseline methods.\n2 Related work\nMachine learning has been applied to improve both matching ef\ufb01ciency and accuracy of image\ndescriptors [3, 4, 5, 8, 14, 15]. Feature hashing methods improve the storage and computational\nrequirements of image-based features [16, 14, 15]. Salakhutdinov and Hinton [16, 17] develop\na semantic hashing approach based on Restricted Boltzman Machines (RBMs) applied to binary\nimages of digits. Similarly, Weiss et al. [14] present a spectral hashing approach that learns compact\nbinary codes for ef\ufb01cient image indexing and matching. Kulis and Darrell [15] extend this idea\nto explicitly minimize the error between the original Euclidean and computed Hamming distances.\nMany of these approaches presume a given distance or similarity measure over a pre-de\ufb01ned input\nfeature space. Although they result in ef\ufb01cient description and indexing in many cases they are\nlimited to the matching accuracy of the original input space. In contrast, our approach learns a non-\nlinear feature mapping that is speci\ufb01cally optimized to result in highly accurate descriptor matching.\nMethods to metric learning learn feature spaces tailored to a particular matching task [5, 8]. These\nmethods assume the presence of annotated label pairs or triplets that encode the desired proximity\nrelationships of the learned feature embedding. Jain et al. [8] learn a Mahalanobis distance metric\nde\ufb01ned using either the original input or a kernelized input feature space applied to image classi\ufb01-\ncation and matching. Alternatively, Strecha et al. [5] employ Linear Discriminant Analysis to learn\na linear feature mapping from binary-labeled example pairs. Both of these methods are closely re-\nlated, offering different optimization strategies for learning a Mahalanobis-based distance metric.\nWhile these methods improve matching accuracy through a learned feature space, they require the\npresence of a pre-selected kernel function to encode non-linearities. Such approaches are well suited\nfor certain image indexing and classi\ufb01cation tasks where task-speci\ufb01c kernel functions have been\nproposed (e.g., [18]). However, they are less applicable to local image feature matching, for which\nthe appropriate choice of kernel function is less understood.\nBoosting has also been applied for learning Mahalanobis-based distance metrics involving high-\ndimensional input spaces overcoming the large computational complexity of conventional positive\nsemi-de\ufb01nite (PSD) solvers based on the interior point method [7, 9]. Shen et al. [19] proposed\na PSD solver using column generation techniques based on AdaBoost, that was later extended to\ninvolve closed-form iterative updates [7]. More recently, Bi et al. [9] devised a similar method\nexhibiting even further improvements in computational complexity with application to bio-medical\nimagery. While these methods also use boosting to learn a feature mapping, they have emphasized\n\n2\n\n\fcomputational ef\ufb01ciency only considering linear feature embeddings. Our approach exhibits similar\ncomputational advantages, however, has the ability to learn non-linear feature mappings beyond\nwhat these methods have proposed.\nSimilar to our work, Brown et al. [4] also consider different feature pooling and selection strategies\nof gradient-based features resulting in a descriptor which is both short and discriminant. In [4],\nhowever, they optimize on the combination of handcrafted blocks, and their parameters. The crite-\nrion they consider\u2014the area below the ROC curve\u2014is not analytical and thus dif\ufb01cult to optimize,\nand does not generalize well. In contrast, we provide a generic learning framework for \ufb01nding such\nrepresentations. Moreover, the form of our descriptor is much simpler. Simultaneous to this work,\nsimilar ideas were explored in [20, 21]. While these approaches assume a sub-sampled or course\nset of pooling regions to mitigate tractability, we allow for the discovery of more generic pooling\ncon\ufb01gurations with boosting.\nOur work on boosted feature learning can be traced back to the work of Doll\u00b4ar et al. [22] where they\napply boosting across a range of different features for pedestrian detection. Our approach is probably\nmost similar to the boosted Similarity Sensitive Coding (SSC) method of Shakhnarovich [3] that\nlearns a boosted similarity function from a family of weak learners, a method that was later extended\nin [23] to be used with a Hamming distance. In [3], only linear projection based weak-learners were\nconsidered. Also, Boosted SSC can often yield fairly high-dimensional embeddings. Our approach\ncan be seen as an extension of Boosted SSC to form low-dimensional feature mappings. We also\nshow that the image gradient-based weak learners of [24] are well adapted to the problem. As seen\nin our experiments, our approach signi\ufb01cantly outperforms Boosted SSC when applied to image\nintensity patches.\n3 Method\nGiven an image intensity patch x \u2208 RD we look for a descriptor of x as a non-linear mapping\nH(x) into the space spanned by {hi}M\ni=1, a collection of thresholded non-linear response functions\nhi(x) : RD \u2192 {\u22121, 1}. The number of response functions M is generally large and possibly\nin\ufb01nite.\nThis mapping can be learned by minimizing the exponential loss with respect to a desired similarity\nfunction f (x, y) de\ufb01ned over image patch pairs\n\nL =\n\nexp(\u2212lif (xi, yi))\n\n(1)\n\nwhere xi, yi \u2208 RD are training intensity patches and li \u2208 {\u22121, 1} is a label indicating whether it is\na similar (+1) or dissimilar (\u22121) pair.\nThe Boosted SSC method proposed in [3] considers a similarity function de\ufb01ned by a simply\nweighted sum of thresholded response functions\n\nf (x, y) =\n\n\u03b1ihi(x)hi(y) .\n\nN(cid:88)\n\ni=1\n\nM(cid:88)\n\ni=1\n\n(2)\n\n(3)\n\nThis de\ufb01nes a weighted hash function with the importance of each dimension i given by \u03b1i.\nSubstituting this expression into Equation (1) gives\n\nN(cid:88)\n\n\uf8eb\uf8ed\u2212li\n\nM(cid:88)\n\nLSSC =\n\nexp\n\n\u03b1jhj(xi)hj(yi)\n\n\uf8f6\uf8f8 .\n\ni=1\n\nj=1\n\nIn practice M is large and in general the number of possible hi\u2019s can be in\ufb01nite making the explicit\noptimization of LSSC dif\ufb01cult, which constitutes a problem for which boosting is particularly well\nsuited [25]. Although boosting is a greedy optimization scheme, it is a provably effective method\nfor constructing a highly accurate predictor from a collection of weak predictors hi.\nSimilar to the kernel trick, the resulting boosting-trick also maps each observation to a high-\ndimensional feature space, however, it computes an explicit mapping for which the \u03b1i\u2019s that de\ufb01ne\nf (x, y) are assumed to be sparse [11]. In fact, Rosset et al. [26] have shown that under certain\n\n3\n\n\fsettings boosting can be interpreted as imposing an L1 sparsity constraint over the response func-\ntion weights \u03b1i. As will be seen below, unlike the kernel trick, this allows for the de\ufb01nition of\nhigh-dimensional embeddings well suited to the descriptor matching task whose features have an\nintuitive explanation.\nBoosted SSC employs linear response weak predictors based on a linear projection of the input. In\ncontrast, we consider non-linear response functions more suitable for the descriptor matching task\nas discussed in Section 3.3. In addition, the greedy optimization can often yield embeddings that\nalthough accurate are fairly redundant and inef\ufb01cient.\nIn what follows, we will present our approach for learning compact boosted feature descriptors\ncalled Low-Dimensional Boosted Gradient Maps (L-BGM). First, we present a modi\ufb01ed similarity\nfunction well suited for learning low-dimensional, discriminative embeddings with boosting. Next,\nwe show how we can factorize the learned embedding to form a compact feature descriptor. Finally,\nthe gradient-based weak learners utilized by our approach are detailed.\n3.1 Similarity measure\nTo mitigate the potentially redundant embeddings found by boosting we propose an alternative sim-\nilarity function that models the correlation between weak response functions,\n\nfLBGM (x, y) =\n\n\u03b1i,jhi(x)hj(y) = h(x)T Ah(y),\n\n(4)\n\n(cid:88)\n\ni,j\n\nwhere h(x) = [h1(x),\u00b7\u00b7\u00b7 , hM (x)] and A is an M \u00d7 M matrix of coef\ufb01cients \u03b1i,j. This similarity\nmeasure is a generalization of Equation (2). In particular, fLBGM is equivalent to the Boosted SSC\nsimilarity measure in the restricted case of a diagonal A.\nSubstituting the above expression into Equation (1) gives\n\nN(cid:88)\n\n\uf8eb\uf8ed\u2212lk\n\n(cid:88)\n\nk=1\n\ni,j\n\n\uf8f6\uf8f8 .\n\nLLBGM =\n\nexp\n\n\u03b1i,jhi(xk)hj(yk)\n\n(5)\n\nAlthough it can be shown that LLBGM can be jointly optimized for A and the hi\u2019s using boosting,\nthis involves a fairly complex procedure. Instead, we propose a two step learning strategy whereby\nwe \ufb01rst apply AdaBoost to \ufb01nd the hi\u2019s as in [3]. As shown by our experiments, this provides an\neffective way to select relevant hi\u2019s. We then apply stochastic gradient descent to \ufb01nd an optimal\nweighting over the selected features that minimizes LLBGM .\nMore formally, let P be the number of relevant response functions found with AdaBoost with P (cid:28)\nM. We de\ufb01ne AP \u2208 RP\u00d7P to be the sub-matrix corresponding to the non-zero entries of A,\nexplicitly optimized by our approach. Note that as the loss function is convex in A, AP can be\nfound optimally with respect to the selected hi\u2019s.\nIn addition, we constrain \u03b1i,j = \u03b1j,i during\noptimization restricting the solution to the set of symmetric P \u00d7 P matrices yielding a symmetric\nsimilarity measure fLBGM . We also experimented with more restrictive forms of regularization,\ne.g., constraining AP to be possitive semi-de\ufb01nite, however, this is more costly and gave similar\nresults.\nWe use a simple implementation of stochastic gradient descent with a constant valued step size,\ninitialized using the diagonal matrix found by Boosted SSC, and iterate until convergence or a max-\nimum number of iterations is reached. Note that because the weak learners are binary, we can\nprecompute the exponential terms involved in the derivatives for all the data samples, as they are\nconstant with respect to AP . This signi\ufb01cantly speeds up the optimization process.\n3.2 Embedding factorization\nThe similarity function of Equation (4) de\ufb01nes an implicit feature mapping over example pairs. We\nnow show how the AP matrix in fLBGM can be factorized to result in compact feature descriptors\ncomputed independently over each input.\nAssuming AP to be a symmetric P \u00d7 P matrix it can be factorized into the following form,\n\nAP = BWBT =\n\nwkbkbT\nk\n\n(6)\n\nd(cid:88)\n\nk=1\n\n4\n\n\fFigure 1: A specialized con\ufb01guration of weak response functions \u03c6 corresponding to a regular\ngridding within the image patch. In addition, assuming a Gaussian weighting of the \u03b1\u2019s results in a\ndescriptor that closely resembles SIFT [1] and is one of the many solutions afforded by our learning\nframework.\nwhere W = diag([w1,\u00b7\u00b7\u00b7 , wd]), wk \u2208 {\u22121, 1}, B = [b1,\u00b7\u00b7\u00b7 , bd], b \u2208 RP , and d \u2264 P .\nEquation (4) can then be re-expressed as\n\nd(cid:88)\n\n(cid:32) P(cid:88)\n\n(cid:33)\uf8eb\uf8ed P(cid:88)\n\n\uf8f6\uf8f8 .\n\nfLBGM (x, y) =\n\nwk\n\nbk,ihi(x)\n\nbk,jhj(y)\n\nk=1\n\ni=1\n\nj=1\n\nThis factorization de\ufb01nes a signed inner product between the embedded feature vectors and provides\nincreased ef\ufb01ciency with respect to the original similarity measure 1. For d < P (i.e., the effective\nrank of AP is d < P ) the factorization represents a smoothed version of AP discarding the low-\nenergy dimensions that typically correlate with noise, leading to further performance improvements.\nThe \ufb01nal embedding found with our approach is therefore\n\nHLBGM (x) = BT h(x) ,\n\nand HLBGM (x) : RD \u2192 Rd.\nThe projection matrix B de\ufb01nes a discriminative dimensionality reduction optimized with respect to\nthe exponential loss objective of Equation (5). As seen in our experiments, in the case of redundant\nhi this results in a considerable feature compression, also offering a more compact description than\nthe original input patch.\n3.3 Weak learners\nThe boosting-trick allows for a variety of non-linear embeddings parameterized by the chosen weak\nlearner family. We employ the gradient-based response functions of [12] to form our feature descrip-\ntor. In [12], the usefulness of these features was demonstrated for visual object detection. In what\nfollows, we extend these features to the descriptor matching task illustrating their close connection\nwith the well-known SIFT descriptor.\nFollowing the notation of [12], our weak learners are de\ufb01ned as\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(cid:26)1\n\n\u22121\n\nh(x; R, e, T ) =\n\n(cid:88)\n\nm\u2208R\n\nif \u03c6R,e(x) \u2264 T\notherwise\n\n,\n\n(cid:88)\n\n\u03bee(x, m) /\n\n\u03beek (x, m) ,\n\nek\u2208\u03a6,m\u2208R\n\nwhere\n\n\u03c6R,e(x) =\n\nwith region \u03bee(x, m) being the gradient energy along an orientation e at location m within x, and\nR de\ufb01ning a rectangular extent within the patch. The gradient energy is computed based on the dot\nproduct between e and the gradient orientation at pixel m [12]. The orientation e ranges between\n[\u2212\u03c0, \u03c0] and is quantized to take values \u03a6 = {0, 2\u03c0\nq } with q the number of\n1Matching two sets of descriptors each of size N is O(N 2P 2) under the original measure and O(N P d +\n\nN 2d) provided the factorization, resulting in signi\ufb01cant savings for reasonably sized N and P , and d (cid:28) P .\n\nq ,\u00b7\u00b7\u00b7 , (q \u2212 1) \u2217 2\u03c0\n\nq , 4\u03c0\n\n5\n\nImage Gradients Keypoint Descriptor \u03b1 weighting R1 R2 R3 R4 e0 e1 e2 e3 e4 e5 e6 \u03c6R ,e 1 1 \f(a)\n\n(b)\n\n(c)\n\nFigure 2: Learned spatial weighting obtained with Boosted Gradient Maps (BGM) trained on (a)\nLiberty, (b) Notre Dame and (c) Yosemite datasets. The learned weighting closely resembles the\nGaussian weighting employed by SIFT (white circles indicate \u03c3/2 and \u03c3 used by SIFT).\n\nquantization bins. As noted in [12] this representation can be computed ef\ufb01ciently using integral\nimages.\nThe non-linear gradient response functions \u03c6R,e along with their thresholding T de\ufb01ne the param-\neterization of the weak learner family optimized with our approach. Consider the specialized con-\n\ufb01guration illustrated in Figure 1. This corresponds to a selection of weak learners whose R and e\nvalues are parameterized such that they lie along a regular grid, equally sampling each edge ori-\nentation within each grid cell. In addition, if we assume a Gaussian weighting centered about the\npatch, the resulting descriptor closely resembles SIFT2 [1]. In fact, this con\ufb01guration and weighting\ncorresponds to one of the many solutions afforded by our approach. In [4], they note the importance\nof allowing for alternative pooling and feature selection strategies, both of which are effectively op-\ntimized within our framework. As seen in our experiments, this results in a signi\ufb01cant performance\ngain over hand-designed SIFT.\n4 Results\nIn this section, we \ufb01rst present an overview of our evaluation framework. We then show the results\nobtained using Boosted SSC combined with gradient-based weak learners described in Sec. 3.3.\nWe continue with the results generated when applying the factorized embedding of the matrix A.\nFinally, we present a comparison of our \ufb01nal descriptor with the state of the art.\n\n4.1 Evaluation framework\nWe evaluate the performance of our methods using three publicly available datasets: Liberty, Notre\nDame and Yosemite [4]. Each of them contain over 400k scale- and rotation-normalized 64 \u00d7 64\npatches. These patches are sampled around interest points detected using Difference of Gaus-\nsians and the correspondences between patches are found using a multi-view stereo algorithm. The\ndatasets created this way exhibit substantial perspective distortion and various lighting conditions.\nThe ground truth available for each of these datasets describes 100k, 200k and 500k pairs of patches,\nwhere 50% correspond to match pairs, and 50% to non-match pairs. In our evaluation, we separately\nconsider each dataset for training and use the held-out datasets for testing. We report the results of\nthe evaluation in terms of ROC curves and 95% error rate as is done in [4].\n\n4.2 Boosted Gradient Maps\nTo show the performance boost we get by using gradient-based weak learners in our boosting\nscheme, we plot the results for the original Boosted SSC method [3], which relies on thresholded\npixel intensities as weak learners, and for the same method which uses gradient-based weak learners\ninstead (referred to as Boosted Gradient Maps (BGM)) with q = 24 quantized orientation bins used\nthroughout our experiments. As we can see in Fig. 3(a), a 128-dimensional Boosted SSC descriptor\ncan be easily outperformed by a 32-dimensional BGM descriptor. When comparing descriptors with\nthe same dimensionality, the improvement measured in terms of 95% error rate reaches over 50%.\nFurthermore, it is worth noticing, that with 128 dimensions BGM performs similarly to SIFT, and\nwhen we increase the dimensionality to 512 - it outperforms SIFT by 14% in terms of 95% error\nrate. When comparing the 256-dimensional SIFT (obtained by increasing the granularity of the ori-\nentation bins) with the 256-dimensional BGM, the extended SIFT descriptor performs much worse\n\n2SIFT additionally normalizes each descriptor to be unit norm, however, the underlying representation is\n\notherwise quite similar.\n\n6\n\n 0.20.40.60.81 0.20.40.60.811.2 00.20.40.60.811.2\f(a)\n\n(b)\n\nFigure 3: (a) Boosted SCC using thresholded pixel intensities in comparison with our Boosted\nGradient Maps (BGM) approach. (b) Results after optimization of the correlation matrix A. Per-\nformance is evaluated with respect to factorization dimensionality d. In parentheses: the number of\ndimensions and the 95% error rate.\n\n(34.22% error rate vs 15.99% for BGM-256). This indicates that boosting with a similar number of\nnon-linear classi\ufb01ers adds to the performance, and proves how well tuned the SIFT descriptor is.\nVisualizations of the learned weighting obtained with BGM trained on Liberty, Notre Dame and\nYosemite datasets are displayed in Figure 2. To plot the visualizations we sum the \u03b1\u2019s across\norientations within the rectangular regions of the corresponding weak learners. Note that although\nthere are some differences, interestingly this weighting closely resembles the Gaussian weighting\nemployed by SIFT.\n\n4.3 Low-Dimensional Boosted Gradient Maps\nTo further improve performance, we optimize over the correlation matrix of the weak learners\u2019 re-\nsponses, as explained in Sec. 3.1, and apply the embedding from Sec. 3.2. The results of this method\nare shown in Fig. 3(b). In these experiments, we learn our L-BGM descriptor using the responses of\n512 gradient-based weak learners selected with boosting. We \ufb01rst optimize over the weak learners\u2019\ncorrelation matrix which is constrained to be diagonal. This corresponds to a global optimization\nof the weights of the weak learners. The resulting 32-dimensional L-BGM-Diag descriptor per-\nforms only slightly better than the corresponding 32-dimensional BGM. Interestingly, the additional\ndegrees of freedom obtained by optimizing over the full correlation matrix boost the results sig-\nni\ufb01cantly and allow us to outperform SIFT with as few as 32 dimensions. When we compare our\n128-dimensional descriptor, i.e., the descriptor of the same length as SIFT, we observe 15% im-\nprovement in terms of 95% error rate. However, when we increase the descriptor length from 256 to\n512 we can see a slight performance drop since we begin to include the \u201cnoisy\u201d dimensions of our\nembedding which correspond to the eigenvalues of low magnitude, a trend typical to many dimen-\nsionality reduction techniques. Hence, as our \ufb01nal descriptor, we select the 64-dimensional L-BGM\ndescriptor, as it provides a decent trade-off between performance and descriptor length.\nFigure 3(b) also shows the results obtained by applying PCA on the responses of 512 gradient-based\nweak learners (BGM-PCA). The descriptor generated this way performs similarly to SIFT, however\nour method still provides better results even for the same dimensionality, which shows the advantage\nin optimizing the exponential loss of Eq. 5.\n\n4.4 Comparison with the state of the art\nHere we compare our approach against the following baselines: sum of squared differences of pixel\nintensities (SSD), the state-of-the-art SIFT descriptor [1], SURF descriptor [2], binary LDAHash\ndescriptor [5], a real-valued descriptor computed by applying LDE projections on bias-gain normal-\nized patches (LDA-int) [4] and the original Boosted SSC [3]. We have also tested recent binary\ndescriptors such as BRIEF [27], ORB [28] or BRISK [29], however, they performed much worse\nthan the baselines presented in the paper. For SIFT, we use the publicly available implementation of\nA. Vedaldi [30]. For SURF and LDAHash, we use the implementation available from the websites\nof the authors. For the other methods, we use our own implementation. For LDA-int we choose\nthe dimensionality which was reported to perform the best on a given dataset according to [4]. For\nBoosted SSC, we use 128-dimensions as this obtained the best performance.\n\n7\n\n 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5True Positive RateFalse Positive RateTrain: Liberty (200k) Test: Notre Dame (100k)SIFT (128, 28.09%)Boosted SSC (128, 72.95%)BGM (32, 37.03%)BGM (64, 29.60%)BGM (128, 21.93%)BGM (256, 15.99%)BGM (512, 14.36%) 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5True Positive RateFalse Positive RateTrain: Liberty (200k) Test: Notre Dame (100k)SIFT (128, 28.09%)Boosted SSC (128, 72.95%)BGM-PCA (32, 25.73%)L-BGM-Diag (32, 34.71%)L-BGM (32, 16.20%)L-BGM (64, 14.15%)L-BGM (128, 13.76%)L-BGM (256, 13.38%)L-BGM (512, 16.33%)\f(a)\n\n(b)\n\nFigure 4: Comparison to state of the art. In parentheses: the number of dimensions, and the 95%\nerror rate. Our L-BGM approach outperforms SIFT by up to 18% in terms of 95% error rate using\nhalf fewer dimensions.\n\nIn Fig. 4 we plot the recognition curves for all the baselines and our method. BGM and L-BGM\noutperform the baseline methods across all FP rates. The maximal performance boost is obtained\nby using our 64-dimensional L-BGM descriptor that results in an up to 18% improvement in terms\nof 95% error rate with respect to the state-of-the-art SIFT descriptor. Descriptors derived from\npatch intensities, i.e. SSD, Boosted SSC and LDA-int, perform much worse than the gradient-based\nones. Finally, our BGM and L-BGM descriptors far outperform SIFT which relies on hand-crafted\n\ufb01lters applied to gradient maps. Moreover, with BGM and L-BGM we are able to reduce the 95%\nerror rate by over 3 times with respect to the other state-of-the-art descriptors, namely SURF and\nLDAHash. We have computed the results for all the con\ufb01gurations of training and testing datasets\nwithout observing any signi\ufb01cant differences, thus we show here only a representative set of the\ncurves. More results can be found in the supplementary material.\nInterestingly, the results we obtain are comparable with \u201cthe best of the best\u201d results reported in [4].\nHowever, since the code for their compact descriptors is not publicly available, we can only com-\npare the performance in terms of the 95% error rates. Only the composite descriptors of [4] provide\nsome advantage over our compact L-BGM, as their average 95% error rate is 2% lower than this of\nL-BGM. Nevertheless, we outperform their non-parametric descriptors by 12% and perform slightly\nbetter than the parametric ones, while using descriptors of an order of magnitude shorter. This com-\nparison indicates that even though our approach does not require any complex pipeline optimization\nand parameter tuning, we perform similarly to the \ufb01nely optimized descriptors presented in [4].\n5 Conclusions\nIn this paper we presented a new method for learning image descriptors by using Low-Dimensional\nBoosted Gradient Maps (L-BGM). L-BGM offers an attractive alternative to traditional descriptor\nlearning techniques that model non-linearities based on the kernel-trick, relying on a pre-speci\ufb01ed\nkernel function whose selection can be dif\ufb01cult and unintuitive. In contrast, we have shown that\nfor the descriptor matching problem the boosting-trick leads to non-linear feature mappings whose\nfeatures have an intuitive explanation. We demonstrated the use of gradient-based weak learner\nfunctions for learning descriptors within our framework, illustrating their close connection with the\nwell-known SIFT descriptor. A discriminative embedding technique was also presented, yielding\nfairly compact and discriminative feature descriptions compared to the baseline methods. We eval-\nuated our approach on benchmark datasets where L-BGM was shown to outperform leading con-\ntemporary hand-designed and learned feature descriptors. Unlike previous approaches, our L-BGM\ndescriptor can be learned directly from raw intensity patches achieving state-of-the-art performance.\nInteresting avenues of future work include the exploration of other weak learner families for de-\nscriptor learning, e.g., SURF-like Haar features, and extensions to binary feature embeddings.\nAcknowledgments\nWe would like to thank Karim Ali for sharing his feature code and his insightful feedback and discussions.\n\nReferences\n[1] Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints.\n\n91\u2013110\n\nIJCV 20(2) (2004)\n\n8\n\n 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5True Positive RateFalse Positive RateTrain: Notre Dame (200k) Test: Liberty (100k)SSD (1024, 69.11%)SIFT (128, 36.27%)SURF (64, 54.01%)LDAHash (128, 49.66%)LDA-int (27, 53.93%)Boosted SSC (128, 70.35%)BGM (256, 21.62%)L-BGM (64, 18.05%) 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5True Positive RateFalse Positive RateTrain: Yosemite (200k) Test: Notre Dame (100k)SSD (1024, 76.13%)SIFT (128, 28.09%)SURF (64, 45.51%)LDAHash (128, 51.58%)LDA-int (14, 49.14%)Boosted SSC (128, 72.20%)BGM (256, 14.69%)L-BGM (64, 13.73%)\f[2] Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded Up Robust Features. In: ECCV\u201906\n[3] Shakhnarovich, G.: Learning Task-Speci\ufb01c Similarity. PhD thesis, MIT (2006)\n[4] Brown, M., Hua, G., Winder, S.: Discriminative Learning of Local Image Descriptors. PAMI\n\n(2011)\n\n[5] Strecha, C., Bronstein, A., Bronstein, M., Fua, P.: LDAHash: Improved Matching with Smaller\n\nDescriptors. PAMI 34(1) (2012)\n\n[6] Kulis, B., Jain, P., Grauman, K.: Fast Similarity Search for Learned Metrics. PAMI (2009)\n\n2143\u20132157\n\n[7] Shen, C., Kim, J., Wang, L., van den Hengel, A.: Positive Semide\ufb01nite Metric Learning with\n\nBoosting. In: NIPS. (2009)\n\n[8] Jain, P., Kulis, B., Davis, J., Dhillon, I.: Metric and Kernel Learning using a Linear Transfor-\n\nmation. JMLR (2012)\n\n[9] Bi, J., Wu, D., Lu, L., Liu, M., Tao, Y., Wolf, M.: AdaBoost on Low-Rank PSD Matrices for\n\nMetric Learning. In: CVPR. (2011)\n\n[10] Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In:\n\nCVPR\u201901\n\n[11] Chapelle, O., Shivaswamy, P., Vadrevu, S., Weinberger, K., Zhang, Y., Tseng, B.: Boosted\n\nMulti-Task Learning. Machine Learning (2010)\n\n[12] Ali, K., Fleuret, F., Hasler, D., Fua, P.: A Real-Time Deformable Detector. PAMI 34(2) (2012)\n\n225\u2013239\n\n[13] Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR\u201905\n[14] Weiss, Y., Torralba, A., Fergus, R.: Spectral Hashing. NIPS 21 (2009) 1753\u20131760\n[15] Kulis, B., Darrell, T.: Learning to Hash with Binary Reconstructive Embeddings. In: NIPS\u201909\n[16] Salakhutdinov, R., Hinton, G.: Learning a Nonlinear Embedding by Preserving Class Neigh-\nIn: International Conference on Arti\ufb01cial Intelligence and Statistics.\n\nbourhood Structure.\n(2007)\n\n[17] Salakhutdinov, R., Hinton, G.: Semantic Hashing.\n\nReasoning (2009)\n\nInternational Journal of Approximate\n\n[18] Grauman, K., Darrell, T.: The Pyramid Match Kernel: Discriminative Classi\ufb01cation with Sets\n\nof Image Features. In: ICCV\u201905\n\n[19] Shen, C., Welsh, A., Wang, L.: PSDBoost: Matrix Generation Linear Programming for Posi-\n\ntive Semide\ufb01nite Matrices Learning. In: NIPS. (2008)\n\n[20] Jia, Y., Huang, C., Darrell, T.: Beyond Spatial Pyramids: Receptive Field Learning for Pooled\n\nImage Features. In: CVPR\u201912\n\n[21] Simonyan, K., Vedaldi, A., Zisserman, A.: Descriptor Learning Using Convex Optimisation.\n\nIn: ECCV\u201912\n\n[22] Doll\u00b4ar, P., Tu, Z., Perona, P., Belongie, S.: Integral Channel Features. In: BMVC\u201909\n[23] Torralba, A., Fergus, R., Weiss, Y.: Small Codes and Large Databases for Recognition. In:\n\nCVPR\u201908\n\n[24] Ali, K., Fleuret, F., Hasler, D., Fua, P.: A Real-Time Deformable Detector. PAMI (2011)\n[25] Freund, Y., Schapire, R.: A Decision-Theoretic Generalization of On-Line Learning and an\nApplication to Boosting. In: European Conference on Computational Learning Theory. (1995)\n[26] Rosset, S., Zhu, J., Hastie, T.: Boosting as a Regularized Path to a Maximum Margin Classi\ufb01er.\n\nJMLR (2004)\n\n[27] Calonder, M., Lepetit, V., Ozuysal, M., Trzcinski, T., Strecha, C., Fua, P.: BRIEF: Computing\n\na Local Binary Descriptor Very Fast. PAMI 34(7) (2012) 1281\u20131298\n\n[28] Rublee, E., Rabaud, V., Konolidge, K., Bradski, G.: ORB: An Ef\ufb01cient Alternative to SIFT or\n\nSURF. In: ICCV\u201911\n\n[29] Leutenegger, S., Chli, M., Siegwart, R.: BRISK: Binary Robust Invariant Scalable Keypoints.\n\nIn: ICCV\u201911\n\n[30] Vedaldi, A.: http://www.vlfeat.org/\u02dcvedaldi/code/siftpp.html\n\n9\n\n\f", "award": [], "sourceid": 144, "authors": [{"given_name": "Tomasz", "family_name": "Trzcinski", "institution": null}, {"given_name": "Mario", "family_name": "Christoudias", "institution": null}, {"given_name": "Vincent", "family_name": "Lepetit", "institution": null}, {"given_name": "Pascal", "family_name": "Fua", "institution": null}]}