{"title": "Heavy-tailed Distances for Gradient Based Image Descriptors", "book": "Advances in Neural Information Processing Systems", "page_first": 397, "page_last": 405, "abstract": "Many applications in computer vision measure the similarity between images or image patches based on some statistics such as oriented gradients. These are often modeled implicitly or explicitly with a Gaussian noise assumption, leading to the use of the Euclidean distance when comparing image descriptors. In this paper, we show that the statistics of gradient based image descriptors often follow a heavy-tailed distribution, which undermines any principled motivation for the use of Euclidean distances. We advocate for the use of a distance measure based on the likelihood ratio test with appropriate probabilistic models that fit the empirical data distribution. We instantiate this similarity measure with the Gamma-compound-Laplace distribution, and show significant improvement over existing distance measures in the application of SIFT feature matching, at relatively low computational cost.", "full_text": "Heavy-tailed Distances for\n\nGradient Based Image Descriptors\n\nYangqing Jia and Trevor Darrell\n\nUC Berkeley EECS and ICSI\n\n{jiayq,trevor}@eecs.berkeley.edu\n\nAbstract\n\nMany applications in computer vision measure the similarity between images or\nimage patches based on some statistics such as oriented gradients. These are of-\nten modeled implicitly or explicitly with a Gaussian noise assumption, leading to\nthe use of the Euclidean distance when comparing image descriptors. In this pa-\nper, we show that the statistics of gradient based image descriptors often follow\na heavy-tailed distribution, which undermines any principled motivation for the\nuse of Euclidean distances. We advocate for the use of a distance measure based\non the likelihood ratio test with appropriate probabilistic models that \ufb01t the em-\npirical data distribution. We instantiate this similarity measure with the Gamma-\ncompound-Laplace distribution, and show signi\ufb01cant improvement over existing\ndistance measures in the application of SIFT feature matching, at relatively low\ncomputational cost.\n\n1\n\nIntroduction\n\nA particularly effective image representation has developed in recent years, formed by computing\nthe statistics of oriented gradients quantized into various spatial and orientation selective bins. SIFT\n[14], HOG [6], and GIST [17] have been shown to have extraordinary descriptiveness on both in-\nstance and category recognition tasks, and have been designed with invariances to many common\nnuisance parameters. Signi\ufb01cant motivation for these architectures arises from biology, where mod-\nels of early visual processing similarly integrate statistics over orientation selective units [21, 18].\nTwo camps have developed in recent years regarding how such descriptors should be compared. The\n\ufb01rst advocates comparison of raw descriptors. Early works [6] considered the distance of patches to\na database from labeled images; this idea was reformulated as a probabilistic classi\ufb01er in the NBNN\ntechnique [4], which has surprisingly strong performance across a range of conditions. Ef\ufb01cient\napproximations based on hashing [22, 12] or tree-based data structures [14, 16] or their combination\n[19] have been commonly applied, but do not change the underlying ideal distance measure.\nThe other approach is perhaps the more dominant contemporary paradigm, and explores a quantized-\nprototype approach where descriptors are characterized in terms of the closest prototype, e.g., in\na vector quantization scheme. Recently, hard quantization and/or Euclidean-based reconstruction\ntechniques have been shown inferior to sparse coding methods, which employ a sparsity prior to\nform a dictionary of prototypes. A series of recent publications has proposed prototype formation\nmethods including various sparsity-inducing priors, including most commonly the L1 prior [15], as\nwell as schemes for sharing structure in a ensemble-sparse fashion across tasks or conditions [10]. It\nis informative that sparse coding methods also have a foundation as models for computational visual\nneuroscience [18].\nVirtually all these methods use the Euclidean distance when comparing image descriptors against\nthe prototypes or the reconstructions, which is implicitly or explicitly derived from a Gaussian noise\nassumption on image descriptors. In this paper, we ask whether this is the case, and further, whether\n\n1\n\n\f(a) Histogram\n\n(b) Matching Patches\n\nFigure 1: (a) The histogram of the difference between SIFT features of matching image patches from\nthe Photo Tourism dataset. (b) A typical example of matching patches. The obstruction (wooden\nbranch) in the bottom patch leads to a sparse change to the histogram of oriented gradients (the two\nred bars).\n\nthere is a distance measure that better \ufb01ts the distribution of real-world image descriptors. We\nbegin by investigating the statistics of oriented gradient based descriptors, focusing on the well\nknown Photo Tourism database [25] of SIFT descriptors for the case of simplicity. We evaluate\nthe statistics of corresponding patches, and see the distribution is heavy-tailed and decidedly non-\nGaussian, undermining any principled motivation for the use of Euclidean distances.\nWe consider generative factors why this may be so, and derive a heavy-tailed distribution (that we\ncall the gamma-compound-Laplace distribution) in a Bayesian fashion, which empirically \ufb01ts well\nto gradient based descriptors. Based on this, we propose to use a principled approach using the\nlikelihood ratio test to measure the similarity between data points under any arbitrary parameterized\ndistribution, which includes the previously adopted Gaussian and exponential family distributions\nas special cases.\nIn particular, we prove that for the heavy-tailed distribution we proposed, the\ncorresponding similarity measure leads to a distance metric, theoretically justifying its use as a\nsimilarity measurement between image patches.\nThe contribution of this paper is two-fold. We believe ours is the \ufb01rst work to systematically ex-\namine the distribution of the noise in terms of oriented gradients for corresponding keypoints in\nnatural scenes. In addition, the likelihood ratio distance measure establishes a principled connection\nbetween the distribution of data and various distance measures in general, allowing us to choose\nthe appropriate distance measure that corresponds to the true underlying distribution in an applica-\ntion. Our method serves as a building block in either nearest-neighbor distance computation (e.g.\nNBNN [4]) and codebook learning (e.g. vector quantization and sparse coding), where the Euclidean\ndistance measure can be replaced by our distance measure for better performance.\nIt is important to note that in both paradigms listed above \u2013 nearest-neighbor distance computation\nand codebook learning \u2013 discriminative variants and structured approaches exist that can optimize a\ndistance measure or codebook based on a given task. Learning a distance measure that incorporate\nboth the data distribution and task-dependent information is the subject of future work.\n\n2 Statistics of Local Image Descriptors\n\nIn this section, we focus on examining the statistics of local image descriptors, using the SIFT\nfeature [14] as an example. Classical feature matching and clustering methods on SIFT features\nuse the Euclidean distance to compare two descriptors. In a probabilistic perspective, this implies\na Gaussian noise model for SIFT: given a feature prototype \u00b5 (which could be the prototype in\nfeature matching, or a cluster center in clustering), the probability that an observation x matches the\nprototype can be evaluated by the Gaussian probability\n\np(x|\u00b5) \u221d exp\n\n,\n\n(1)\n\n(cid:18)(cid:107)x \u2212 \u00b5(cid:107)2\n\n2\n\n(cid:19)\n\n2\u03c32\n\n2\n\n\fFigure 2: The probability values of the GCL, Laplace and Gaussian distributions via ML estimation,\ncompared against the empirical distribution of local image descriptor noises. The \ufb01gure is in log\nscale and curves are normalized for better comparison. For details about the data, see Section 4.\n\nwhere \u03c3 is the standard deviation of the noise. Such a Gaussian noise model has been explicitly or\nimplicitly assumed in most algorithms including vector quantization, sparse coding (on the recon-\nstruction error), etc.\nDespite the popular use of Euclidean distance, the distribution of the noise between matching SIFT\npatches does not follow a Gaussian distribution: as shown in Figure 1(a), the distribution is highly\nkurtotic and heavy tailed, indicating that Euclidean distance may not be ideal.\nThe reason why the Gaussian distribution may not be a good model for the noise of local image\ndescriptors can be better understood from the generative procedure of the SIFT features. Figure\n1(b) shows a typical case of matching patches: one patch contains a partially obstructing object\nwhile the other does not. The resulting histogram differs only in a sparse subset of the oriented\ngradients. Further, research on the V1 receptive \ufb01eld [18] suggests that natural images are formed\nfrom localized, oriented, bandpass patterns, implying that changing the weight of one such building\npattern may tend to change only one or a few dimensions of the binned oriented gradients, instead\nof imposing an isometric Gaussian change to the whole feature.\n\n2.1 A Heavy-tailed Distribution for Image Descriptors\n\nWe \ufb01rst explore distributions that \ufb01ts such heavy-tailed property. A common approach to cope with\nheavy-tails is to use the L1 distance, which corresponds to the Laplace distribution\n\np(x|\u00b5; \u03bb) \u221d \u03bb\n2\n\nexp (\u2212\u03bb|x \u2212 \u00b5|) .\n\n(2)\n\nHowever, the tail of the noise distribution is often still heavier than the Laplace distribution: empir-\nically, we \ufb01nd the kurtosis of the SIFT noise distribution to be larger than 7 for most dimensions,\nwhile the kurtosis of the Laplace distribution is only 3. Inspired by the hierarchical Bayesian models\n[11], instead of \ufb01xing the \u03bb value in the Laplace distribution, we introduce a conjugate Gamma prior\nover \u03bb modeled by hyperparameters {\u03b1, \u03b2}, and compute the probability of x given the prototype \u00b5\nby integrating over \u03bb:\n\n(3)\n\nThis leads to a heavier tail than the Laplace distribution. We call Equation (3) the Gamma-\ncompound-Laplace (GCL) distribution, in which the hyperparameters \u03b1 and \u03b2 control the shape\nof the tail. Figure 2 shows the empirical distribution of the SIFT noise and the maximum likelihood\n\ufb01tting of various models. It can be observed that the GCL distribution enables us to \ufb01t the heavy\ntailed empirical distribution better than other distributions. We note that similar approaches have\nbeen exploited in the compressive sensing context [9], and are shown to perform better than using\nthe Laplace distribution as the sparse prior in applications such as signal recovery.\nFurther, we note that the statistics of a wide range of other natural image descriptors beyond SIFT\nfeatures are known to be highly non-Gaussian and have heavy tails [24]. Examples of these include\n\n3\n\np(x|\u00b5; \u03b1, \u03b2) =\n\n=\n\n\u03bb\u03b1\u22121\u03b2\u03b1e\u2212\u03b2\u03bb d\u03bb\n\ne\u2212\u03bb|x\u2212\u00b5| 1\n\u0393(\u03b1)\n\n\u03bb\n2\n\u03b1\u03b2\u03b1(|x \u2212 \u00b5| + \u03b2)\u2212\u03b1\u22121.\n\n(cid:90)\n\n\u03bb\n1\n2\n\n\fderivative-like wavelet \ufb01lter responses [23, 20], optical \ufb02ow and stereo vision statistics [20, 8], shape\nfrom shading [3], and so on.\nIn this paper we retract from the general question \u201cwhat is the right distribution for natural images\u201d,\nand ask speci\ufb01cally whether there is a good distance metric for local image descriptors that takes\nthe heavy-tailed distribution into consideration. Although heuristic approaches such as taking the\nsquared root of the feature values before computing the Euclidean distance are sometimes adopted\nto alleviate the effect of heavy tails, there lacks a principled way to de\ufb01ne a distance for heavy-\ntailed data in computer vision to the best of our knowledge. To this end, we start with a principled\nsimilarity measure based on the well known statistical hypothesis test, and instantiate it with heavy-\ntailed distributions we propose for local image descriptors.\n\n3 Distance For Heavy-tailed Distributions\n\nIn statistics, the hypothesis test [7] approach has been widely adopted to test if a certain statistical\nmodel \ufb01ts the observation. We will focus on the likelihood ratio test in this paper. In general, we\nassume that the data is generated by a parameterized probability distribution p(x|\u03b8), where \u03b8 is the\nvector of parameters. A null hypothesis is stated by restricting the parameter \u03b8 in a speci\ufb01c subset\n\u03980, which is nested in a more general parameter space \u0398. To test if the restricted null hypothesis\n\ufb01ts a set of observations X , a natural choice is to use the ratio of the maximized likelihood of the\nrestricted model to the more general model:\n\n\u039b(X ) = L(\u02c6\u03b80;X )/L(\u02c6\u03b8;X ),\n\n(4)\nwhere L(\u03b8;X ) is the likelihood function, \u02c6\u03b80 is the maximum likelihood estimate of the parameter\nwithin the restricted subset \u03980, and \u02c6\u03b8 is the maximum likelihood estimate under the general case.\nIt is easily veri\ufb01able that \u039b(X ) always lies in the range [0, 1], as the maximum likelihood estimate\nof the general case would always \ufb01t at least as well as the restricted case, and that the likelihood\nis always a nonnegative value. The likelihood ratio test is then de\ufb01ned as a statistical test that\nrejects the null hypothesis when the statistic \u039b(X ) is smaller than a certain threshold \u03b1, such as the\nPearson\u2019s chi-square test [7] for categorical data.\nInstead of producing a binary decision, we propose to use the score directly as the generative sim-\nilarity measure between two single data points. Speci\ufb01cally, we assume that each data point x is\ngenerated from a parameterized distribution p(x|\u00b5) with unknown prototype \u00b5. Thus, the statement\n\u201ctwo data points x and y are similar\u201d can be reasonably represented by the null hypothesis that the\ntwo data points are generated from the same prototype \u00b5, leading to the probability\n\n(5)\nThis restricted model is further nested in the more general model that generates the two data points\nfrom two possibly different prototypes:\n\nq0(x, y|\u00b5xy) = p(x|\u00b5xy)p(y|\u00b5xy).\n\nq(x, y|\u00b5x, \u00b5y) = p(x|\u00b5x)p(y|\u00b5y),\n\n(6)\n\nwhere \u00b5x and \u00b5y are not necessarily equal.\nThe similarity between the two data points x and y is then de\ufb01ned by the the likelihood ratio statistics\nbetween the null hypothesis of equality and the alternate hypothesis of inequality over prototypes:\n\ns(x, y) =\n\np(x|\u02c6\u00b5xy)p(y|\u02c6\u00b5xy)\np(x|\u02c6\u00b5x)p(y|\u02c6\u00b5y)\n\n,\n\n(7)\n\nwhere \u02c6\u00b5x, \u02c6\u00b5y and \u02c6\u00b5xy are the maximum likelihood estimates of the prototype based on x, y, and\n{x, y} respectively. We call (7) the likelihood ratio similarity between x and y, which provides\nus information from a generative perspective: two similar data points, such as two patches of the\nsame real-world location, are more likely to be generated from the same underlying distribution,\nthus have a large likelihood ratio value. In the following parts of the paper, we de\ufb01ne the likelihood\nratio distance between x and y as the square root of the negative logarithm of the similarity:\n\nd(x, y) =(cid:112)\u2212 log(s(x, y)).\n\n(8)\nIt is worth pointing out that, for arbitrary distributions p(x), d(x, y) is not necessarily a distance\nmetric as the triangular inequality may not hold. However, for heavy-tailed distributions, we have\nthe following suf\ufb01cient condition in the 1-dimensional case:\n\n4\n\n\fTheorem 3.1. If the distribution p(x|\u00b5) can be written as p(x|\u00b5) = exp(\u2212f (x\u2212 \u00b5))b(x), where\nf (t) is a non-constant quasiconvex function w.r.t. t that satis\ufb01es f(cid:48)(cid:48)(t) \u2264 0, \u2200t \u2208 R\\{0}, then the\ndistance de\ufb01ned in Equation (8) is a metric.\n\nProof. First we point out the following lemmas:\n\nLemma 3.2. If a function d(x, y) de\ufb01ned on X\u00d7 X \u2192 R is a distance metric, then(cid:112)d(x, y) is also\n\na distance metric.\nLemma 3.3. If function f (t) is de\ufb01ned as in Theorem 3.1, then we have:\n(1) the minimizer \u02c6\u00b5xy = arg min\u00b5 f (x\u2212\u00b5) + f (y\u2212\u00b5) is either x or y.\n(2) the function g(t) = min(f (t), f (\u2212t)) \u2212 f (0) is monotonically increasing and concave in R+ \u222a\n{0}, and g(0) = 0.\nWith Lemma 3.3, it is easily veri\ufb01able that d2(x, y) = g(|x\u2212 y|). Then, via the subadditivity of g(\u00b7)\nwe can reach a result stronger than Theorem 3.1 that d2(x, y) is a distance metric. Thus, d(x, y) is\nalso a distance metric based on Lemma 3.2. Note that we keep the square root here in conformity\nwith classical distance metrics, which we will discuss in the later parts of the paper. Detailed proofs\nof the theorem and lemmas can be found in the supplementary material.\nAs an extreme case, when f(cid:48)(cid:48)(t) = 0 (t (cid:54)= 0), the distance de\ufb01ned above is the square root of the\n(scaled) L1 distance.\n\n3.1 Distance for the GCL distribution\n\nWe use the GCL distribution parameterized by the prototype \u00b5 with \ufb01xed hyperparameters (\u03b1, \u03b2)\nas the SIFT noise model, which leads to the following GCL distance between dimensions of SIFT\npatches1:\n\nd2(x, y) = (\u03b1 + 1)(log(|x \u2212 y| + \u03b2) \u2212 log \u03b2)\n\n(9)\nThe distance between two patches is then de\ufb01ned as the sum of per-dimension distances. Intuitively,\nwhile the Euclidean distance grows linearly w.r.t. to the difference between the coordinates, the GCL\ndistance grows in a logarithmic way, suppressing the effect of too large differences. Further, we have\nthe following theoretical justi\ufb01cation which is a direct result of Theorem 3.1.:\nProposition 3.4. The distance d(x, y) de\ufb01ned in (9) is a metric.\n\n3.2 Hyperparameter Estimation for GCL\n\nIn the following, we discuss how to estimate the hyperparameters \u03b1 and \u03b2 in the GCL distribution.\nAssuming that we are given a set of one-dimensional data D = {x1, x2,\u00b7\u00b7\u00b7 , xn} that follows the\nGCL distribution, we estimate the hyperparameters by maximizing the log likelihood\n\nl(\u03b1, \u03b2;D) =\n\n+ \u03b1 log \u03b2 \u2212 (\u03b1 + 1) log (|xi| + \u03b2)\n\n\u03b1\n2\n\nThe ML estimation does not have a closed-form solution, so we adopt an alternate optimization and\niteratively update \u03b1 and \u03b2 until convergence. Updating \u03b1 with \ufb01xed \u03b2 can be achieved by computing\n\n(cid:17)\n\n(cid:33)\u22121\n\nlog\n\ni=1\n\n(cid:16)\nn(cid:88)\n(cid:32) n(cid:88)\n\u2212 n(cid:88)\n\ni=1\n\ni=1\n\n(10)\n\n(11)\n\n(12)\n\n\u03b1 \u2190 n\n\nlog(|xi| + \u03b2) \u2212 n log(\u03b2)\n\n.\n\nUpdating \u03b2 can be done via the Newton-Raphson method \u03b2 \u2190 \u03b2 \u2212 l(cid:48)(\u03b2)\n\nl(cid:48)(\u03b2) =\n\nn\u03b1\n\u03b2\n\n\u03b1 + 1\n|xi| + \u03b2\n\nl(cid:48)(cid:48)(\u03b2) =\n\n,\n\nn(cid:88)\n\ni=1\n\nl(cid:48)(cid:48)(\u03b2), where\n(|xi| + \u03b2)2 \u2212 n\u03b1\n\n\u03b1 + 1\n\n\u03b22\n\n1For more than two data points X = {xi}, it is generally dif\ufb01cult to \ufb01nd the maximum likelihood estimation\nof \u00b5 as the likelihood is nonconvex. However, with two data points x and y, it is trivial to see that \u00b5 = x\nand \u00b5 = y are the two global optimums of the likelihood L(\u00b5;{x, y}), both leading to the same distance\nrepresentation in (9).\n\n5\n\n\f3.3 Relation to Existing Measures\n\n(cid:113)\n\nd(x, y) =\n\nThe likelihood ratio distance is related to several existing methods. In particular, we show that under\nthe exponential family distribution, it leads to several widely used distance measures.\nThe exponential family distribution has drawn much attention in the recent years. Here we focus on\nthe regular exponential family, where the distribution of data x can be written in the following form:\n(13)\nwhere \u00b5 is the mean in the exponential family sense, and dB is the regular Bregman divergence\ncorresponding to the distribution [2]. When applying the likelihood ratio distance on the distribution,\nwe obtain the distance\n\np(x) = exp (\u2212dB(x, \u00b5)) b(x),\n\ndB(x, \u02c6\u00b5xy) + dB(x, \u02c6\u00b5x,y)\n\n(14)\nsince \u02c6\u00b5x \u2261 x and dB(x, x) \u2261 0 for any x. We note that this is the square root of the Jensen-Bregman\ndivergence and is known to be a distance metric [1]. Several popular distances can be derived in this\nway.\nIn the two most common cases, the Gaussian distribution leads to the Euclidean distance,\nand the multinomial distribution leads to the square root of the Jensen-Shannon divergence, whose\n\ufb01rst-order approximation is the \u03c7-squared distance. More generally, for (non-regular) Bregman\ndivergences dB(x, \u00b5) de\ufb01ned as dB(x, \u00b5) = F (x) \u2212 F (\u00b5) + (x \u2212 \u00b5)F (cid:48)(\u00b5) with arbitrary smooth\nfunction F , the condition on which the square root of the corresponding Jensen-Bregman divergence\nis a metric has been discussed in [5].\nWhile the exponential family embraces a set of mathematically elegant distributions whose proper-\nties are well known, it fails to capture the heavy-tailed property of various natural image statistics,\nas the tail of the suf\ufb01cient statistics is exponentially bounded by de\ufb01nition. The likelihood ratio\ndistance with heavy-tailed distributions serves as a principled extension of several popular distance\nmetrics based on the exponential family distribution. Further, there are principled approaches that\nconnect distances with kernels [1], upon which kernel methods such as support vector machines may\nbe built with possible heavy-tailed property of the data taken into consideration.\nThe idea of computing the similarity between data points based on certain scores has also been seen\nin the one-shot learning context [26] that uses the average prediction score taking one data point\nas training and the other as testing, and vice versa. Our method shares similar merit, but with a\ngenerative probabilistic interpretation. Integration of our method with discriminative information or\nlatent application-dependent structures is one future direction.\n\n4 Experiments\n\nIn this section, we apply the GCL distance to the problem of local image patch similarity mea-\nsure using the SIFT feature, a common building block of many applications such as stereo vision,\nstructure from motion, photo tourism, and bag-of-words image classi\ufb01cation.\n\n4.1 The Photo Tourism Dataset\n\nWe used the Photo Tourism dataset [25] to evaluate different similarity measures of the SIFT feature.\nThe dataset contains local image patches extracted from three scenes namely Notredame, Trevi\nand Halfdome, re\ufb02ecting different natural scenarios. Each set contains approximately 30,000\nground-truth 3D points, with each point containing a bag of 2d image patches of size 64 \u00d7 64\ncorresponding to the 3D point. To the best of our knowledge, this is the largest local image patch\ndatabase with ground-truth correspondences. Figure 3 shows a typical subset of patches from the\ndataset.\nThe SIFT features are computed using the code in [13]. Speci\ufb01cally, two different normalization\nschemes are tested: the l2 scheme simply normalizes each feature to be of length 1, and the thres\nscheme further thresholds the histogram at 0.2, and rescales the resulting feature to length 1. The\nlatter is the classical hand-tuned normalization designed in the original SIFT paper, and can be seen\nas a heuristic approach to suppress the effect of heavy tails.\nFollowing the experimental setting of [25], we also introduce random jitter effects to the raw patches\nbefore SIFT feature extraction by warping each image by the following random warping parame-\n\n6\n\n\fFigure 3: An example of the Photo Tourism dataset. From top to bottom patches are sampled from\nNotredame, Trevi and Halfdome respectively. Within each row, every adjacent two patches forms a\nmatching pair.\n\n(a) trevi\n\n(b) notredame\n\n(c) halfdome\n\nFigure 4: The mean precision-recall curve over 20 independent runs. In the \ufb01gure, solid lines are\nexperiments using features that are normalized in the l2 scheme, and dashed lines using features\nnormalized in the thres scheme. Best viewed in color.\n\nters: position shift, rotation and scale with standard deviations of 0.4 pixels, 11 degrees and 0.12\noctaves respectively. Such jitter effects represent the noise we may encounter in real feature detec-\ntion and localization [25], and allows us to test the robustness of different distance measures. For\ncompleteness, the data without jitter effects are also tested and the results reported.\n\n4.2 Testing Protocol\n\nThe testing protocol is as follows: 10,000 matching pairs and 10,000 non-matching pairs are ran-\ndomly sampled from the dataset, and we classify each pair to be matching or non-matching based on\nthe distance computed from different testing metrics. The precision-recall (PR) curve is computed,\nand two values, namely the average precision (AP) computed as the area under the PR curve and\nthe false positive rate at 95% recall (95%-FPR) are reported to compare different distance measures.\nTo test the statistical signi\ufb01cance, we carry out 20 independent runs and report the mean and stan-\ndard deviation in the paper. We focus on comparing distance measures that presume the data to lie\nin a vector space. Five different distance measures are compared, namely the L2 distance, the L1\ndistance, the symmetrized KL divergence, the \u03c72 distance, and the GCL distance.\nThe hyperparameters of the GCL distance measure are learned by randomly sampling 50,000 match-\ning pairs from the set Notredame, and performing hyperparameter estimation as described in Sec-\ntion 3.2. They are then \ufb01xed and used universally for all other experiments without re-estimation.\nAs a \ufb01nal note, the code for the experiments in the paper will be released to public for repeatability.\n\n4.3 Experimental Results\n\nFigure 4 shows the average precision-recall curve for all the distances on the three datasets re-\nspectively. The numerical results on the data with jitter effects are summarized in Table 1, with\nstatistically signi\ufb01cant values shown in bold. Table 2 shows the 99% FPR on the data without jitter\neffects2. We refer to the supplementary materials for other results on the no jitter case due to space\nconstraints. Notice that, the observed trends and conclusions from the experiments with jitter effects\nare also con\ufb01rmed on those without jitter effects.\nThe GCL distance outperforms other base distance measures in all the experiments. Notice that the\nhyperparameters learned from the notredame set performs well on the other two datasets as well,\n\n2As the accuracy for the no jitter effects case is much higher in general, 99% FPR is reported instead of\n\n95% FPR as in the jitter effect case.\n\n7\n\n0.800.850.900.951.00Recall0.600.650.700.750.800.850.900.951.00PrecisionPR-curvetreviL2L1symmKLchi2gcl0.800.850.900.951.00Recall0.600.650.700.750.800.850.900.951.00PrecisionPR-curvenotredameL2L1symmKLchi2gcl0.800.850.900.951.00Recall0.600.650.700.750.800.850.900.951.00PrecisionPR-curvehalfdomeL2L1symmKLchi2gcl\fAP\ntrevi-l2\ntrevi-thres\nnotre-l2\nnotre-thres\nhalfd-l2\nhalfd-thres\n95%-FPR\ntrevi-l2\ntrevi-thres\nnotre-l2\nnotre-thres\nhalfd-l2\nhalfd-thres\n\nL2\n\n96.61\u00b10.16\n97.23\u00b10.12\n95.90\u00b10.14\n96.76\u00b10.13\n94.51\u00b10.16\n95.55\u00b10.14\n\nL2\n\n23.61\u00b11.14\n19.23\u00b10.84\n26.43\u00b11.03\n21.88\u00b11.21\n36.34\u00b10.98\n31.44\u00b11.20\n\nL1\n\n98.08\u00b10.10\n98.05\u00b10.10\n97.83\u00b10.10\n97.84\u00b10.10\n96.75\u00b10.11\n96.90\u00b10.11\n\nL1\n\n12.71\u00b10.83\n13.08\u00b10.91\n14.27\u00b11.09\n14.49\u00b11.25\n24.11\u00b11.13\n23.14\u00b10.13\n\nSymmKL\n97.40\u00b10.12\n97.40\u00b10.11\n96.96\u00b10.12\n97.05\u00b10.12\n94.87\u00b10.15\n95.08\u00b10.16\nSymmKL\n17.58\u00b10.96\n17.57\u00b10.98\n19.56\u00b11.00\n19.07\u00b11.11\n34.55\u00b10.96\n33.71\u00b11.05\n\n\u03c72\n\n97.69\u00b10.11\n97.71\u00b10.11\n97.31\u00b10.11\n97.39\u00b10.11\n95.42\u00b10.14\n95.64\u00b10.14\n\n\u03c72\n\n15.85\u00b10.74\n15.66\u00b10.77\n17.70\u00b11.08\n17.38\u00b10.95\n31.62\u00b11.09\n30.56\u00b11.13\n\nGCL\n\n98.33\u00b10.09\n98.21\u00b10.10\n98.19\u00b10.10\n98.07\u00b10.10\n98.19\u00b10.10\n97.21\u00b10.10\n\nGCL\n\n10.52\u00b10.73\n11.21\u00b10.71\n11.58\u00b11.00\n12.09\u00b11.11\n19.76\u00b11.03\n20.74\u00b11.16\n\nTable 1: The average precision (above) and the false positive rate at 95% recall (below) of different\ndistance measures on the Photo Tourism datasets, with random jitter effects. A larger AP score and\na smaller FPR score are desired. The l2 and thres in the leftmost column indicate the two different\nfeature normalization schemes.\n\n99%-FPR\ntrevi-l2\ntrevi-thres\nnotre-l2\nnotre-thres\nhalfd-l2\nhalfd-thres\n\nL2\n\n11.36\u00b11.65\n7.14\u00b11.31\n19.69\u00b11.93\n11.9\u00b11.19\n44.55\u00b19.42\n40.58\u00b11.63\n\nL1\n\n3.44\u00b10.75\n3.24\u00b10.69\n6.09\u00b10.72\n5.17\u00b10.58\n34.01\u00b12.10\n32.30\u00b12.28\n\nSymmKL\n8.02\u00b11.04\n7.93\u00b11.11\n14.81\u00b11.66\n13.11\u00b11.39\n43.51\u00b11.07\n42.51\u00b11.22\n\n\u03c72\n\n8.02\u00b11.08\n5.06\u00b10.97\n9.40\u00b11.04\n8.24\u00b11.12\n40.53\u00b11.12\n39.28\u00b11.49\n\nGCL\n\n2.42\u00b10.58\n2.23\u00b10.48\n4.16\u00b10.57\n3.72\u00b10.56\n26.06\u00b12.25\n26.36\u00b12.50\n\nTable 2: The false positive rate at 99% recall of different distance measures on the Photo Tourism\ndatasets without jitter effects.\n\nindicating that they capture the general statistics of the SIFT feature, instead of dataset-dependent\nstatistics. Also, the thresholding and renormalization of SIFT features does provide a signi\ufb01cant\nimprovement for the Euclidean distance, but its effect is less signi\ufb01cant for other distances. In fact,\nthe hard thresholding may introduce arti\ufb01cial noise to the data, counterbalancing the positive effect\nof reducing the tail, especially when the distance measure is already able to cope with heavy tails.\nWe argue that the key factor leading to the performance improvement is taking the heavy tail prop-\nerty of the data into consideration but not others. For instance, the Laplace distribution has a heavier\ntail than distributions corresponding to other base distance measures, and a better performance of the\ncorresponding L1 distance over other distance measures is observed, showing a positive correlation\nbetween tail heaviness and performance. Notice that the tails of distributions assumed by the base-\nline distances are still exponentially bounded, and performance is further increased by introducing\nheavy-tailed distributions such as the GCL distribution in our experiment.\n\n5 Conclusion\n\nWhile visual representations based on oriented gradients have been shown to be effective in many ap-\nplications, scant attention has been paid to the issue of the heavy-tailed nature of their distributions,\nundermining the use of distance measures based on exponentially bounded distributions. In this pa-\nper, we advocate the use of distance measures that are derived from heavy-tailed distributions, where\nthe derivation can be done in a principled manner using the log likelihood ratio test. In particular,\nwe examine the distribution of local image descriptors, and propose the Gamma-compound-Laplace\n(GCL) distribution and the corresponding distance for image descriptor matching. Experimental\nresults have shown that this yields to more accurate feature matching than existing baseline distance\nmeasures.\n\n8\n\n\fReferences\n[1] A Agarwal and H Daume III. Generative kernels for exponential families. In AISTATS, 2011.\n[2] A Banerjee, S Merugu, I Dhillon, and J Ghosh. Clustering with Bregman divergences. JMLR, 6:1705\u2013\n\n1749, 2005.\n\n[3] JT Barron and J Malik. High-frequency shape and albedo from shading using natural image statistics. In\n\nCVPR, 2011.\n\n[4] O Boiman, E Shechtman, and M Irani.\n\nCVPR, 2008.\n\nIn defense of nearest-neighbor based image classi\ufb01cation.\n\nIn\n\n[5] P Chen, Y Chen, and M Rao. Metrics de\ufb01ned by bregman divergences. Communications in Mathematical\n\nSciences, 6(4):915\u2013926, 2008.\n\n[6] N Dalal. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[7] AC Davison. Statistical models. Cambridge Univ Press, 2003.\n[8] J Huang, AB Lee, and D Mumford. Statistics of range images. In CVPR, 2000.\n[9] S Ji, Y Xue, and L Carin. Bayesian compressive sensing. IEEE Trans. Signal Processing, 56(6):2346\u2013\n\n2356, 2008.\n\n[10] Y Jia, M Salzmann, and D Trevor. Factorized latent spaces with structured sparsity. In NIPS, 2010.\n[11] D Koller and N Friedman. Probabilistic graphical models. MIT press, 2009.\n[12] B Kulis and T Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, 2009.\n[13] S Lazebnik, C Schmid, and J Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. In CVPR, 2006.\n\n[14] D Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110, 2004.\n[15] J Mairal, F Bach, J Ponce, and G Sapiro. Online learning for matrix factorization and sparse coding.\n\nJMLR, 11:19\u201360, 2010.\n\n[16] AW Moore. The anchors hierarchy: using the triangle inequality to survive high dimensional data. In\n\nUAI, 2000.\n\n[17] A Oliva and A Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope.\n\nIJCV, 42(3):145\u2013175, 2001.\n\n[18] B Olshausen. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code for natural\n\nimages. Nature, 381(6583):607\u2013609, 1996.\n\n[19] M Ozuysal and P Fua. Fast keypoint recognition in ten lines of code. In CVPR, 2007.\n[20] J Portilla, V Strela, MJ Wainwright, and EP Simoncelli. Image denoising using scale mixtures of gaus-\n\nsians in the wavelet domain. IEEE Trans. Image Processing, 12(11):1338\u20131351, 2003.\n\n[21] M Riesenhuber and T Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience,\n\n2:1019\u20131025, 1999.\n\n[22] G Shakhnarovich, P Viola, and T Darrell. Fast pose estimation with parameter-sensitive hashing.\n\nICCV, 2003.\n\nIn\n\n[23] EP Simoncelli. Statistical models for images: compression, restoration and synthesis. In Asilomar Con-\n\nference on Signals, Systems & Computers, 1997.\n\n[24] Y Weiss and WT Freeman. What makes a good model of natural images? In CVPR, 2007.\n[25] S Winder and M Brown. Learning local image descriptors. In CVPR, 2007.\n[26] L Wolf, T Hassner, and Y Taigman. The one-shot similarity kernel. In ICCV, 2009.\n\n9\n\n\f", "award": [], "sourceid": 302, "authors": [{"given_name": "Yangqing", "family_name": "Jia", "institution": null}, {"given_name": "Trevor", "family_name": "Darrell", "institution": null}]}