{"title": "Quantized Kernel Learning for Feature Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 172, "page_last": 180, "abstract": "Matching local visual features is a crucial problem in computer vision and its accuracy greatly depends on the choice of similarity measure. As it is generally very difficult to design by hand a similarity or a kernel perfectly adapted to the data of interest, learning it automatically with as few assumptions as possible is preferable. However, available techniques for kernel learning suffer from several limitations, such as restrictive parametrization or scalability. In this paper, we introduce a simple and flexible family of non-linear kernels which we refer to as Quantized Kernels (QK). QKs are arbitrary kernels in the index space of a data quantizer, i.e., piecewise constant similarities in the original feature space. Quantization allows to compress features and keep the learning tractable. As a result, we obtain state-of-the-art matching performance on a standard benchmark dataset with just a few bits to represent each feature dimension. QKs also have explicit non-linear, low-dimensional feature mappings that grant access to Euclidean geometry for uncompressed features.", "full_text": "Quantized Kernel Learning for Feature Matching\n\nDanfeng Qin\nLuc Van Gool\nETH Z\u00a8urich\nETH Z\u00a8urich\n{qind, guillaumin, vangool}@vision.ee.ethz.ch, xuanli.chen@tum.de\n\nXuanli Chen\nTU Munich\n\nMatthieu Guillaumin\n\nETH Z\u00a8urich\n\nAbstract\n\nMatching local visual features is a crucial problem in computer vision and its\naccuracy greatly depends on the choice of similarity measure. As it is generally\nvery dif\ufb01cult to design by hand a similarity or a kernel perfectly adapted to the\ndata of interest, learning it automatically with as few assumptions as possible is\npreferable. However, available techniques for kernel learning suffer from several\nlimitations, such as restrictive parametrization or scalability.\nIn this paper, we introduce a simple and \ufb02exible family of non-linear kernels\nwhich we refer to as Quantized Kernels (QK). QKs are arbitrary kernels in the\nindex space of a data quantizer, i.e., piecewise constant similarities in the origi-\nnal feature space. Quantization allows to compress features and keep the learning\ntractable. As a result, we obtain state-of-the-art matching performance on a stan-\ndard benchmark dataset with just a few bits to represent each feature dimension.\nQKs also have explicit non-linear, low-dimensional feature mappings that grant\naccess to Euclidean geometry for uncompressed features.\n\nIntroduction\n\n1\nMatching local visual features is a core problem in computer vision with a vast range of applications\nsuch as image registration [28], image alignment and stitching [6] and structure-from-motion [1].\nTo cope with the geometric transformations and photometric distorsions that images exhibit, many\nrobust feature descriptors have been proposed. In particular, histograms of oriented gradients such\nas SIFT [15] have proved successful in many of the above tasks. Despite these results, they are\ninherently limited by their design choices. Hence, we have witnessed an increasing amount of work\nfocusing on automatically learning visual descriptors from data via discriminative embeddings [11,\n4] or hyper-parameter optimization [5, 21, 23, 22].\nA dual aspect of visual description is the measure of visual (dis-)similarity, which is responsible\nfor deciding whether a pair of features matches or not.\nIn image registration, retrieval and 3D\nreconstruction, for instance, nearest neighbor search builds on such measures to establish point\ncorrespondences. Thus, the choice of similarity or kernel impacts the performance of a system as\nmuch as the choice of visual features [2, 16, 18]. Designing a good similarity measure for matching\nis dif\ufb01cult and commonly used kernels such as the linear, intersection, \u03c72 and RBF kernels are not\nideal as their inherent properties (e.g., stationarity, homogeneity) may not \ufb01t the data well.\nExisting techniques for automatically learning similarity measures suffer from different limitations.\nMetric learning approaches [25] learn to project the data to a lower-dimensional and more discrim-\ninative space where the Euclidean geometry can be used. However, these methods are inherently\nlinear. Multiple Kernel Learning (MKL) [3] is able to combine multiple base kernels in an optimal\nway, but its complexity limits the amount of data that can be used and forces the user to pre-select\nor design a small number of kernels that are likely to perform well. Additionally, the resulting ker-\nnel may not be easily represented in a reasonably small Euclidean space. This is problematic, as\nmany ef\ufb01cient algorithms (e.g. approximate nearest neighbor techniques) heavily rely on Euclidean\ngeometry and have non-intuitive behavior in higher dimensions.\n\n1\n\n\fIn this paper, we introduce a simple yet powerful family of kernels, Quantized Kernels (QK), which\n(a) model non-linearities and heterogeneities in the data, (b) lead to compact representations that\ncan be easily decompressed into a reasonably-sized Euclidean space and (c) are ef\ufb01cient to learn so\nthat large-scale data can be exploited. In essence, we build on the fact that vector quantizers project\ndata into a \ufb01nite set of N elements, the index space, and on the simple observation that kernels on\n\ufb01nite sets are fully speci\ufb01ed by the N\u00d7N Gram matrix of these elements (the kernel matrix), which\nwe propose to learn directly. Thus, QKs are piecewise constant but otherwise arbitrary, making\nthem very \ufb02exible. Since the learnt kernel matrices are positive semi-de\ufb01nite, we directly obtain the\ncorresponding explicit feature mappings and exploit their potential low-rankness.\nIn the remainder of the paper, we \ufb01rst further discuss related work (Sec. 2), then present QKs in detail\n(Sec. 3). As important contributions, we show how to ef\ufb01ciently learn the quantizer and the kernel\nmatrix so as to maximize the matching performance (Sec. 3.2), using an exact linear-time inference\nsubroutine (Sec. 3.3), and devise practical techniques for users to incorporate knowledge about the\nstructure of the data (Sec. 3.4) and reduce the number of parameters of the system. Our experiments\nin Sec. 4 show that our kernels yield state-of-the-art performance on a standard feature matching\nbenchmark and improve over kernels used in the literature for several descriptors, including one\nbased on metric learning. Our compressed features are very compact, using only 1 to 4 bits per\ndimension of the original features. For instance, on SIFT descriptors, our QK yields about 10%\nimprovement on matching compared to the dot product, while compressing features by a factor 8.\n\n2 Related work\nOur work relates to a vast literature on kernel selection and tuning, descriptor, similarity, distance\nand kernel learning. We present a selection of such works below.\nBasic kernels and kernel tuning. A common approach for choosing a kernel is to pick one from\nthe literature: dot product, Gaussian RBF, intersection [16], \u03c72, Hellinger, etc. These generic kernels\nhave been extensively studied [24] and have properties such as homogeneity or stationarity. These\nproperties may be inadequate for the data of interest and thus the kernels will not yield optimal\nperformance. Ef\ufb01cient yet approximate versions of such kernels [9, 20, 24] are similarly inadequate.\nDescriptor learning. Early work on descriptor learning improved SIFT by exploring its parame-\nter space [26]. Later, automatic parameter selection was proposed with a non-convex objective [5].\nRecently, signi\ufb01cant improvements in local description for matching have been obtained by opti-\nmizing feature encoding [4] and descriptor pooling [21, 23]. These works maximize the matching\nperformance directly via convex optimization [21] or boosting [23]. As we show in our experiments,\nour approach improves matching even for such optimized descriptors.\nDistance, similarity and kernel learning. Mahalanobis metrics (e.g., [25]) are probably the most\nwidely used family of (dis-)similarities in supervised settings. They extend the Euclidean metric\nby accounting for correlations between input dimensions and are equivalent to projecting data to\na new, potentially smaller, Euclidean space. Learning the projection improves discrimination and\ncompresses feature vectors, but the projection is inherently linear.1 There are several attempts to\nlearn more powerful non-linear kernels from data. Multiple Kernel Learning (MKL) [3] operates\non a parametric family of kernels: it learns a convex combination of a few base kernels so as to\nmaximize classi\ufb01cation accuracy. Recent advances now allow to combine thousands of kernels in\nMKL [17] or exploit specialized families of kernels to derive faster algorithms [19]. In that work, the\nauthors combine binary base kernels based on randomized indicator functions but restricted them\nto XNOR-like kernels. Our QK framework can also be seen as an ef\ufb01cient and robust MKL on\na speci\ufb01c family of binary base kernels. However, our binary base kernels originate from more\ngeneral quantizations:\nthey correspond to their regions of constantness. As a consequence, the\nresulting optimization problem is also more involves and thus calls for approximate solutions.\nIn parallel to MKL approaches, Non-Parametric Kernel Learning (NPKL) [10] has emerged as a\n\ufb02exible kernel learning alternative. Without any assumption on the form of the kernel, these methods\naim at learning the Gram matrix of the data directly. The optimization problem is a semi-de\ufb01nite\nprogram whose size is quadratic in the number of samples. Scalability is therefore an issue, and\napproximation techniques must be used to compute the kernel on unobserved data. Like NPKL, we\nlearn the values of the kernel matrix directly. However, we do it in the index space instead of the\n\n1Metric learning can be kernelized, but then one has to choose the kernel.\n\n2\n\n\foriginal space. Hence, we restrict our family of kernels to piecewise constant ones2, but, contrary to\nNPKL, the complexity of the problems we solve does not grow with the number of data points but\nwith the re\ufb01nement of the quantization and our kernels trivially generalize to unobserved inputs.\n\n3 Quantized kernels\nIn this section, we present the framework of quantized kernels (QK). We start in Sec. 3.1 by de\ufb01ning\nQKs and looking at some of their properties. We then present in Sec. 3.2 a general alternating\nlearning algorithm. A key step is to optimize the quantizer itself. We present in Sec. 3.3 our scheme\nfor quantization optimization for a single dimensional feature and how to generalize it to higher\ndimensions in Sec. 3.4.\n3.1 De\ufb01nition and properties\nFormally, quantized kernels QKD\n\u2203q : RD (cid:55)\u2192 {1, . . . , N},\n(1)\nwhere q is a quantization function which projects x \u2208 RD to the \ufb01nite index space {1, . . . , N},\nand K (cid:23) 0 denotes that K is a positive semi-de\ufb01nite (PSD) matrix. As discussed above, quantized\nkernels are an ef\ufb01cient parametrization of piecewise constant functions, where q de\ufb01nes the regions\nof constantness. Moreover, the N \u00d7 N matrix K is unique for a given choice of kq, as it simply\naccounts for the N (N+1)/2 possible values of the kernel and is the Gram matrix of the N elements\nof the index space. We can also see q as a 1-of-N coding feature map \u03d5q, such that:\n\nN are the set of kernels kq on RD\u00d7RD such that:\n\n\u2203K \u2208 RN\u00d7N (cid:23) 0,\n\nkq(x, y) = K(q(x), q(y)),\n\n\u2200x, y \u2208 RD,\n\nkq(x, y) = K(q(x), q(y)) = \u03d5q(x)(cid:62)K\u03d5q(y).\n\n(2)\n\nq (y)(cid:11) ,\n\nThe components of the matrix K fully parametrize the family of quantized kernels based on q, and\nit is a PSD matrix if and only if kq is a PSD kernel. An explicit feature mapping of kq is easily\ncomputed from the Cholesky decomposition of the PSD matrix K = P(cid:62)P:\n\nkq(x, y) = \u03d5q(x)(cid:62)K\u03d5q(y) =(cid:10)\u03c8P\n\nq (x), \u03c8P\n\n(3)\nq (x) = P\u03d5q(x). It is of particular interest to limit the rank N(cid:48) \u2264 N of K, and hence the\nwhere \u03c8P\nnumber of rows in P. In their compressed form, vectors require only log2(N ) bits of memory for\nstoring q(x) and they can be decompressed in RN(cid:48)\nusing P\u03d5q(x). Not only is this decompressed\nvector smaller than one based on \u03d5q, but it is also associated with the Euclidean geometry rather than\nthe kernel one. This allows the exploitation of the large literature of ef\ufb01cient methods specialized to\nEuclidean spaces.\n3.2 Learning quantized kernels\nIn this section, we describe a general alternating algorithm to learn a quantized kernel kq for feature\nmatching. This problem can be formulated as quadruple-wise constraints of the following form:\n\n\u2200(x, y) \u2208 P,\n\n\u2200(u, v) \u2208 N ,\n\nkq(x, y) > kq(u, v),\n\n(4)\nwhere P denotes the set of positive feature pairs, and N is the negative one. The positive set contains\nfeature pairs that should be visually matched, while the negative pairs are mismatches.\nWe adopt a large-margin formulation of the above constraints using the trace-norm regularization\n(cid:107) \u00b7 (cid:107)\u2217 on K, which is the tightest convex surrogate to low-rank regularization [8]. Using M training\npairs {(xj, yj)}j=1...M , we obtain the following optimization problem:\n0, 1 \u2212 lj\u03d5q(xj)\n\n(5)\nwhere QD\nN denotes the set of quantizers q : RD (cid:55)\u2192 {1, . . . , N}, the pair label lj \u2208 {\u22121, 1} denotes\nwhether the feature pair (xj, yj) is in N or P respectively. The parameter \u03bb controls the trade-off\nbetween the regularization and the empirical loss. Solving Eq. (5) directly is intractable. We thus\npropose to alternate between the optimization of K and q. We describe the former below, and the\nlatter in the next section.\n\n(cid:107)K(cid:107)\u2217 +\n\nM(cid:88)\n\nE(K, q) =\n\nK\u03d5q(yj)\n\nargmin\n\nK(cid:23)0, q\u2208QD\n\nN\n\n(cid:17)\n\n,\n\n(cid:16)\n\nmax\n\nj=1\n\n\u03bb\n2\n\n(cid:62)\n\n2As any continuous function on an interval is the uniform limit of a series of piecewise constant functions,\n\nthis assumption does not inherently limit the \ufb02exibility of the family.\n\n3\n\n\fOptimizing K with \ufb01xed q. When \ufb01xing q in Eq. (5), the objective function becomes convex in\nK but is not differentiable, so we resort to stochastic sub-gradient descent for optimization. Similar\nto [21], we used Regularised Dual Averaging (RDA) [27] to optimize K iteratively. At iteration\nt + 1, the kernel matrix Kt+1 is updated with the following rule:\n\n(cid:18)\n\n\u221a\nt\n\n\u03b3\n\n\u2212\n\n(cid:0)Gt + \u03bbI(cid:1)(cid:19)\n\nKt+1 = \u03a0\n\n(cid:80)t\n\n(6)\n\nInterval quantization optimization for a single dimension\n\nwhere \u03b3 > 0 and Gt = 1\nt(cid:48)=1 Gt(cid:48) is the rolling average of subgradients Gt(cid:48) of the loss computed\nat step t(cid:48) from one sample pair. I is the identity matrix and \u03a0 is the projection onto the PSD cone.\nt\n3.3\nTo optimize an objective like Eq. (5) when K is \ufb01xed, we must consider how to design and\nparametrize the elements of QD\nN . In this work, we adopt interval quantizers, and in this section\nwe assume D = 1, i.e., restrict the study of quantization to R.\nInterval quantizers. An interval quantizer q over R is de\ufb01ned by a set of N + 1 boundaries\nbi \u2208 R with b0 = \u2212\u221e, bN = \u221e and q(x) = i if and only if bi\u22121 < x \u2264 bi. Importantly, interval\nquantizers are monotonous, x\u2264 y \u21d2 q(x)\u2264 q(y), and boundaries bi can be set to any value between\nmaxq(x)=i x (included) and minq(x)=i+1 x (excluded). Therefore, Eq. (5) can be viewed as a data\nlabelling problem, where each value xj or yj takes a label in [1, N ], with a monotonicity constraint.\nThus, let us now consider the graph (V,E) where nodes V = {vt}t=1...2M represent the list of all\nxj and yj in a sorted order and the edges E ={(vs, vt)} connect all pairs (xj, yj). Then Eq. (5) with\n\ufb01xed K is equivalent to the following discrete pairwise energy minimization problem:\n\nargmin\nq\u2208[1,N ]2M\n\nE(cid:48)(q) =\n\n(cid:88)\n\n(s,t)\u2208E\n\n2M(cid:88)\n\nt=2\n\nEst(q(vs), q(vt)) +\n\nCt(q(vt\u22121), q(vt)),\n\n(7)\n\nwhere Est(q(vs), q(vt)) = Ej(q(xj), q(yj)) = max (0, 1 \u2212 ljK(q(xj), q(yj))) and Ct is \u221e for\nq(vt) < q(vt\u22121) and 0 otherwise (i.e., it encodes the monotonicity of q in the sorted list of vt).\nThe optimization of Eq. (7) is an NP-hard problem as the energies Est are arbitrary and the graph\ndoes not have a bounded treewidth, in general. Hence, we iterate the individual optimization of each\nof the boundaries using an exact linear-time algorithm, which we present below.\nExact linear-time optimization of a binary interval quantizer. We now consider solving equa-\ntions of the form of Eq. (7) for the binary label case (N = 2). The main observation is that the\nmonotonicity constraint means that labels are 1 until a certain node t and then 2 from node t + 1,\nand this switch can occur only once on the entire sequence, where vt \u2264 b1 < vt+1. This means\nthat there are only 2M +1 possible labellings and we can order them from (1, . . . , 1), (1, . . . , 1, 2)\nto (2, . . . , 2). A na\u00a8\u0131ve algorithm consists in computing the 2M +1 energies explicitly. Since each\nenergy computation is linear in the number of edges, this results in a quadratic complexity overall.\nA linear-time algorithm exist. It stems from the observation that the energies of two consecutive\nlabellings (e.g., switching the label of vt from 1 to 2) differ only by a constant number of terms:\n\nE(q(vt\u22121) = 1, q(vt) = 2, q(vt+1) = 2) = E(q(vt\u22121) = 1, q(vt) = 1, q(vt+1) = 2)\n\n+ Ct(1, 2) \u2212 Ct(1, 1) + Ct+1(2, 2) \u2212 Ct+1(1, 2) + Est(q(vs), 2) \u2212 Est(q(vs), 1)\n\n(8)\nwhere, w.l.o.g., we have assumed (s, t) \u2208 E. After \ufb01nding the optimal labelling, i.e. \ufb01nding the\nlabel change (vt, vt+1), we set b1 = (vt +vt+1)/2 to obtain the best possible generalization.\nFinite spaces. When the input feature space has a \ufb01nite number of different values (e.g., x \u2208\n[1, T ]), then we can use linear-time sorting and merge all nodes with equal value in Eq. (7): this\nresults in considering at most T + 1 labellings, which is potentially much smaller than 2M + 1.\nExtension to the multilabel case. Optimizing a single boundary bi of a multilabel interval quan-\ntization is essentially the same binary problem as above, where we limit the optimization to the\nvalues currently assigned to i and i + 1 and keep the other assignments q \ufb01xed. We use unaries\nEj(q(xj), q(yj)) or Ej(q(xj), q(yj)) to model half-\ufb01xed pairs for xj or yj, respectively.\n\n3.4 Learning higher dimensional quantized kernels\nWe now want to generalize interval quantizers to higher dimensions. This is readily feasible via\nproduct quantization [13], using interval quantizers for each individual dimension.\n\n4\n\n\fD(cid:88)\n\nD(cid:88)\n\nd=1\n\n(cid:80)\n\nbins respectively, i.e., N = (cid:81)D\n\nInterval product quantization. An interval product quantizer q(x) : RD (cid:55)\u2192 {1, . . . , N} is of\nthe form q(x) = (q1(x1), . . . , qD(xD)), where q1, . . . , qD are interval quantizers with N1, . . . , ND\nd=1 Nd. The learning algorithm devised above trivially generalizes\nto interval product quantization by \ufb01xing all but one boundary of a single component quantizer qd.\nHowever, learning K \u2208 RN \u00d7 RN when N is very large becomes problematic: not only does RDA\nscale unfavourably, but the lack of training data will eventually lead to severe over\ufb01tting. To address\nthese issues, we devise below variants of QKs that have practical advantages for robust learning.\nAdditive quantized kernels (AQK). We can drastically reduce the number of parameters by re-\nstricting product quantized kernels to additive ones, which consists in decomposing over dimensions:\n\nNd\n\nd=1\n\nkq(x, y) =\n\nkqd (xd, yd) =\n\n\u03d5qd (xd)(cid:62)Kd\u03d5qd (yd) = \u03d5q(x)(cid:62)K\u03d5q(y),\n\nwhere qd \u2208 Q1\n\ndimension d, \u03d5q is the concatenation of the D mappings \u03d5qd, and K is a ((cid:80)\nreduced from N =(cid:81)\nonly(cid:80)\n\nd Nd)\u00d7((cid:80)\nd instead of N 2. The compression ratio is unchanged since log2(N ) =(cid:80)\n\n(9)\n, \u03d5qd is the 1-of-Nd coding of dimension d, Kd is the Nd \u00d7 Nd Gram matrix of\nd Nd) block-\ndiagonal matrix of K1, . . . , KD. The bene\ufb01ts of AQK are twofold. First, the explicit feature space is\nd Nd. Second, the number of parameters to learn in K is now\nd log2(Nd).\nTo learn K in Eq. (9), we simply set the off-block-diagonal elements of Gt(cid:48) to zero in each iteration,\nand iteratively update K as describe in Sec. 3.2. To optimize a product quantizer, we iterate the\noptimization of each 1d quantizer qd following Sec. 3.3, while \ufb01xing qc for c (cid:54)= d. This leads to\nusing the following energy Ej for a pair (xj, yj):\n\nd Nd to N(cid:48) =(cid:80)\n\nd N 2\n\nEj,d(qd(xj,d), qd(yj,d)) = max (0, \u00b5j,d \u2212 ljKd(qd(xj,d), qd(yj,d))) ,\n\n(10)\n\nc(cid:54)=d Kc(qc(xc), qc(yc)) acts as an adaptive margin.\n\nwhere \u00b5j,d = 1 \u2212 lj\nBlock quantized kernels (BQK). Although the additive assumption in AQK greatly reduces the\nnumber of parameters, it is also very restrictive, as it assumes independent data dimensions. A\nsimple way to extend additive quantized kernels to model the inter-dependencies of dimensions is\nto allow the off-diagonal elements of K in Eq. (9) to be nonzero. As a trade-off between a block-\ndiagonal (AQK) and a full matrix, in this work we also consider the grouping of the feature dimen-\nsions into B blocks, and only learn off-block-diagonal elements within each block, leading to Block\nQuantized Kernels (BQK). In this way, assuming \u2200d Nd = n, the number of parameters in K is\nB times smaller than for the full matrix. As a matter of fact, many features such as SIFT descriptors\nexhibit block structure. SIFT is composed of a 4\u00d74 grid of 8 orientation bins. Components within\nthe same spatial cell correlate more strongly than others and, thus, only modeling those jointly may\nprove suf\ufb01cient. The optimization of K and q are straightforwardly adapted from the AQK case.\nAdditional parameter sharing. Commonly, the different dimensions of a descriptor are gen-\nerated by the same procedure and hence share similar properties. This results in block matrices\nK1, . . . , KD in AQK that are quite similar as well. We propose to exploit this observation and share\nthe kernel matrix for groups of dimensions, further reducing the number of parameters. Speci\ufb01cally,\nwe cluster dimensions based on their variances into G equally sized groups and use a single block\nmatrix for each group. During optimization, dimensions sharing the same block matrix can con-\n\u03d5qd (xd)], and then\nK = diag(K(cid:48)\nG) is learnt following the procedure already described for AQK. Notably, the\nquantizers themselves are not shared, so the kernel still adapts uniquely to every dimension of the\ndata, and the optimization of quantizers is not changed either. This parameter sharing strategy can\nbe readily applied to BQK as well.\n\nveniently be merged, i.e. \u03d5q(x) = [(cid:80)\n\n\u03d5qd (xd), . . . ,(cid:80)\n\n1, . . . , K(cid:48)\n\nd s.t. Kd=K(cid:48)\n\nd s.t. Kd=K(cid:48)\n\nG\n\n1\n\n4 Results\nWe now present our experimental results, starting with a description of our protocol. We then explore\nparameters and properties of our kernels (optimization of quantizers, explicit feature maps). Finally,\nwe compare to the state-of-the-art in performance and compactness.\nDataset and evaluation protocol. We evaluate our method using the dataset of Brown et al. [5].\nIt contains three sets of patches extracted from Liberty, Notre Dame and Yosemite using the Differ-\nence of Gaussians (DoG) interest point detector. The patches are recti\ufb01ed with respect to the scale\n\n5\n\n\fUniform\nAdaptive\nAdaptive+\n\nInitial Optimized\n24.84\n25.99\n14.62\n\n21.68\n25.70\n14.29\n\nTable 1: Impact of quantization op-\ntimization for different quantization\nstrategies\n\nFigure 1: Impact of N, the num-\nber of quantization intervals\n\nFigure 2: Impact of G, the num-\nber of dimension groups\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 3: Our learned feature maps and additive quantized kernel of a single dimension. (a) shows the quan-\ntized kernel in index space, while (b) is in the original feature space for the \ufb01rst quantizer. (c,d) show the two\ncorresponding feature maps, and (e,f) the related rank-1 kernels.\n\nand dominant orientation, and pairwise correspondences are computed using a multi-view stereo\nalgorithm. In our experiments, we use the standard evaluation protocol [5] and state-of-the-art de-\nscriptors: SIFT [15], PR-proj [21] and SQ-4-DAISY [4]. M=500k feature pairs are used for training\non each dataset, with as many positives as negatives. We report the false positive rate (FPR) at 95%\nrecall on the test set of 100k pairs. A challenge for this dataset is the bias in local patch appearance\nfor each set, so a key factor for performance is the ability to generalize and adapt across sets.\nBelow, in absence of other mention, AQKs are trained for SIFT on Yosemite and tested on Liberty.\nInterval quantization and optimization. We \ufb01rst study the in\ufb02uence of initialization and opti-\nmization on the generalization ability of the interval quantizers. For initialization, we have used two\ndifferent schemes: a) Uniform quantization, i.e. the quantization with equal intervals; b) Adaptive\nquantization, i.e. the quantization with intervals with equal number of samples. In both cases, it\nallows to learn a \ufb01rst kernel matrix, and we can then iterate with boundary optimization (Sec. 3.3).\nTypically, convergence is very fast (2-3 iterations) and takes less than 5 minutes in total (i.e., about\n2s per feature dimension) with 1M nodes. We see in Table 1 that uniform binning outperforms the\nadaptive one and that further optimization bene\ufb01ts the uniform case more. This may seem paradox-\nical at \ufb01rst, but this is due to the train/test bias problem: intervals with equal number of samples\nare very different across sets, so re\ufb01nements will not transfer well. Hence, following [7], we \ufb01rst\nnormalize the features with respect to their rank, separately for the training and test sets. We refer to\nthis process as Adaptive+. As Table 1 shows, not only does it bring a signi\ufb01cant improvement, but\nfurther optimization of the quantization boundaries is more bene\ufb01cial than for the Adaptive case. In\nthe following, we thus adopt this strategy.\nNumber of quantization intervals.\nIn Fig. 1, we show the impact of the number of intervals N\nof the quantizer on the matching accuracy, using a single shared kernel submatrix (G = 1). This\nnumber balances the \ufb02exibility of the model and its compression ratio. As we can see, using too few\nintervals limits the performance of QK, and using too many eventually leads to over\ufb01tting. The best\nperformance for SIFT is obtained with between 8 and 16 intervals.\nExplicit feature maps. Fig. 3a shows the additive quantized kernel learnt for SIFT with N = 8\nand G = 1. Interestingly, the kernel has negative values far from the diagonal and positive values\nnear the diagonal. This is typical of stationary kernels: when both features have similar values,\nthey contribute more to the similarity. However, contrary to stationary kernels, diagonal elements\nare far from being constant. There is a mode on small values and another one on large ones. The\nsecond one is stronger: i.e., the co-occurrence of large values yields greater similarity. This is con-\nsistent with the voting nature of SIFT descriptors, where strong feature presences are both rarer and\nmore informative than their absences. The negative values far from the diagonal actually penalize\ninconsistent observations, thus con\ufb01rming existing results [12]. Looking at the values in the origi-\nnal space in Fig. 3b, we see that the quantizer has learnt that \ufb01ne intervals are needed in the lower\n\n6\n\n2610141814161820#intervalsFPR @ 95% recall [%]1234810121416#groupsFPR @ 95% recall [%]SIFT[15]SQ-4-DAISY[4]PR-proj[18]1234567812345678501001502002505010015020025050100150200250\u22120.4\u22120.3\u22120.2\u22120.100.10.20.30.450100150200250\u22120.4\u22120.3\u22120.2\u22120.100.10.20.30.412345678123456781234567812345678\u22120.08\u22120.0400.040.080.10.06\u22120.06\u22120.020.02\fDescriptor\n\nKernel\n\nDimensionality\n\nTrain on Yosemite\n\nTrain on Notredame\n\nMean\n\nNotredame\n\nLiberty\n\nYosemite\n\nLiberty\n\nSIFT[15]\nSIFT[15]\nSIFT[15]\nSIFT[15]\nSIFT[15]\n\nSQ-4-DAISY [4]\nSQ-4-DAISY [4]\nSQ-4-DAISY [4]\nSQ-4-DAISY [4]\n\nEuclidean\n\n\u03c72\n\nAQK(8)\nAQK(8)\nBQK(8)\n\nEuclidean\n\n\u03c72\n\nSQ [4]\nAQK(8)\n\n128\n128\n128\n256\n256\n\n1360\n1360\n1360\n\u22641813\n<64\n\u2264102\n\n24.02\n17.65\n10.72\n9.26\n8.05\n\n10.08\n10.61\n8.42\n4.96\n\n31.34\n22.84\n16.90\n14.48\n13.31\n\n16.90\n16.25\n15.58\n9.41\n\n27.96\n23.50\n10.72\n10.16\n9.88\n\n10.47\n12.19\n9.25\n5.60\n\n31.34\n22.84\n16.85\n14.43\n13.16\n\n16.90\n16.25\n15.58\n9.77\n\n28.66\n21.71\n13.80\n12.08\n11.10\n\n13.58\n13.82\n12.21\n7.43\n\n7.11\n5.41\n\nAQK(16)\n\nEuclidean[21]\n\nPR-proj [21]\nPR-proj [21]\nTable 2: Performance of kernels on different datasets with different descriptors. AQK(N) denotes the additive\nquantized kernel with N quantization intervals. Following [6], we report the False positive rate (%) at 95%\nrecall. The best results for each descriptor are in bold.\n\n14.82\n10.90\n\n10.54\n7.65\n\n12.88\n10.54\n\n11.34\n8.63\n\nvalues, while larger ones are enough for larger values. This is consistent with previous observations\nthat the contribution of large values in SIFT should not grow proportionally [2, 18, 14].\nIn this experiment, the learnt kernel has rank 2. We show in Fig. 3c, 3d, 3e and 3f the corresponding\nfeature mappings and their associated rank 1 kernels. The map for the largest eigenvalue (Fig. 3c)\nis monotonous but starts with negative values. This impacts dot product signi\ufb01cantly, and accounts\nfor the above observation that negative similarities occur when inputs disagree. This rank 1 kernel\ncannot allot enough contribution to similar mid-range values. This is compensated by the second\nrank (Fig. 3f).\nNumber of groups. Fig. 2 now shows the in\ufb02uence of the number of groups G on performance,\nfor the three different descriptors (N = 8 for SIFT and SQ-4-DAISY, N = 16 for PR-proj). As for\nintervals, using more groups adds \ufb02exibility to the model, but as less data is available to learn each\nparameter, over-\ufb01tting will hurt performance. We choose G = 3 for the rest of the experiments.\nComparison to the state of the art. Table 2 reports the matching performance of different kernels\nusing different descriptors, for all sets, as well as the dimensionality of the corresponding explicit\nfeature maps. For all three descriptors and on all sets, our quantized kernels signi\ufb01cantly and con-\nsistently outperform the best reported result in the literature. Indeed, AQK improves the mean error\nrate at 95% recall from 28.66% to 12.08% for SIFT, from 13.58% to 7.43% for SQ-4-DAISY and\nfrom 11.34% to 8.63% for PR-proj compared to the Euclidean distance, and about as much for the\n\u03c72 kernel. Note that PR-proj already integrates metric learning in its design ([21] thus recommends\nusing the Euclidean distance): as a consequence our experiments show that modelling non-linearities\ncan bring signi\ufb01cant improvements. When comparing to sparse quantization (SQ) with hamming\ndistance as done in [4], the error is signi\ufb01cantly reduced from 12.21% to 7.43%. This is a notable\nachievement considering that [4] is the previous state of the art.\nThe SIFT descriptor has a grid block design which makes it particularly suited for the use of BQK.\nHence, we also evaluated our BQK variant for that descriptor. With BQK(8), we observed a relative\nimprovement of 8%, from 12.08% for AQK(8) to 11.1%.\nWe provide in Fig. 4 the ROC curves for the three descriptors when training on Yosemite and testing\non Notre Dame and Liberty. These \ufb01gures show that the improvement in recall is consistent over the\nfull range of false positive rates. For further comparisons, our data and code are available online.3\nCompactness of our kernels.\nIn many applications of feature matching, the compactness of the\ndescriptor is important. In Table 3, we compare to other methods by grouping them according to\ntheir memory footprint. As a reference, the best method reported in Table 2 (AQK(8) on SQ-4-\nDAISY) uses 4080 bits per descriptor. As expected, error rates increase as fewer bits are used, the\noriginal features being signi\ufb01cantly altered. Notably, QKs consistently yield the best performance in\nall groups. Even with a crude binary quantization of SQ-4-DAISY, our quantized kernel outperform\nthe state-of-the-art SQ of [4] by 3 to 4%. When considering the most compact encodings (\u2264 64 bits),\nour AQK(2) does not improve over BinBoost [22], a descriptor designed for extreme compactness, or\nthe product quantization (PQ [13]) encoding as used in [21]. This is because our current framework\ndoes not yet allow for joint compression of multiple dimensions. Hence, it is unable to use less\n\n3See: http://www.vision.ee.ethz.ch/\u02dcqind/QuantizedKernel.html\n\n7\n\n\fFigure 4: ROC curves when evaluating Notre Dame (top) and Liberty (bottom) from Yosemite\n\nDescriptor\n\nEncoding\n\nMemory (bits)\n\nTrain on Yosemite\n\nTrain on Notredame\n\nMean\n\nNotredame\n\nLiberty\n\nYosemite\n\nLiberty\n\nSQ-4-DAISY [4]\nSQ-4-DAISY [4]\n\nSIFT[15]\nPR-proj [21]\nPR-proj [21]\n\nSIFT[15]\nPR-proj [21]\nPR-proj [21]\n\nSQ [4]\nAQK(2)\n\nAQK(8)\nBin [21]\nAQK(16)\n\nAQK(2)\nBin [21]\nAQK(4)\n\n1360\n1360\n\n384\n1024\n<256\n\n128\n128\n<128\n\n8.42\n5.86\n\n9.26\n7.09\n5.41\n\n14.62\n10.00\n7.18\n\n15.58\n10.81\n\n14.48\n15.15\n10.90\n\n19.72\n18.64\n13.02\n\n9.25\n6.36\n\n10.16\n8.5\n7.65\n\n15.65\n13.41\n10.29\n\n15.58\n10.94\n\n14.43\n12.16\n10.54\n\n19.45\n16.39\n13.18\n\n12.21\n8.49\n\n12.08\n10.73\n8.63\n\n17.36\n14.61\n10.92\n\nBinBoost[22]\n\nAQK(2)\nPQ [21]\n\nPCA+AQK(4)\n\n64\n<64\n64\n64\n\n20.49\n22.24\n17.97\n17.60\n\nBinBoost[22]\nPR-proj [21]\nPR-proj [21]\nPR-proj [21]\nTable 3: Performance comparison of different compact feature encoding. The number in the table is\nreported as False positive rate (%) at 95% recall. The best results for each group are in bold.\n\n18.97\n19.38\n19.32\n14.44\n\n21.67\n20.59\n20.15\n17.46\n\n18.92\n19.26\n17.59\n15.06\n\n14.54\n14.80\n12.91\n10.74\n\nthan 1 bit per original dimension, and is not optimal in that case. To better understand the potential\nbene\ufb01ts of decorrelating features and joint compression in future work, we pre-processed the data\nwith PCA, projecting to 32 dimensions and then using AQK(4). This simple procedure obtained\nstate-of-the-art performance with 15% error rate, now outperforming [22] and [21].\nAlthough QKs yield very compact descriptors and achieve the best performance across many ex-\nperimental setups, the computation of similarity values is slower than for competitors: in the binary\ncase, we double the complexity of hamming distance for the 2 \u00d7 2 table look-up.\n\n5 Conclusion\nIn this paper, we have introduced the simple yet powerful family of quantized kernels (QK), and\npresented an ef\ufb01cient algorithm to learn its parameters, i.e. the kernel matrix and the quantization\nboundaries. Despite their apparent simplicity, QKs have numerous advantages: they are very \ufb02ex-\nible, can model non-linearities in the data and provide explicit low-dimensional feature mappings\nthat grant access to the Euclidean geometry. Above all, they achieve state-of-the-art performance\non the main visual feature matching benchmark. We think that QKs have a lot of potential for fur-\nther improvements. In future work, we want to explore new learning algorithms to obtain higher\ncompression ratios \u2013 e.g. by jointly compressing feature dimensions \u2013 and \ufb01nd the weight sharing\npatterns that would further improve the matching performance automatically.\n\nAcknowledgements\n\nWe gratefully thank the KIC-Climate project Modeling City Systems.\n\n8\n\n051015202530707580859095100False Positive Rate [%]True Positive Rate [%]SIFTBQK(8)AQK(8)AQK(2)L2051015202530707580859095100False Positive Rate [%]True Positive Rate [%]SQ-4-DAISYAQK(8)AQK(2)SQ051015202530707580859095100False Positive Rate [%]True Positive Rate [%]PR\u2212projAQK(16)AQK(4)L2051015202530707580859095100False Positive Rate [%]True Positive Rate [%]SIFTBQK(8)AQK(8)AQK(2)L2051015202530707580859095100False Positive Rate [%]True Positive Rate [%]SQ-4-DAISYAQK(8)AQK(2)SC051015202530707580859095100False Positive Rate [%]True Positive Rate [%]PR\u2212projAQK(16)AQK(4)L2\fReferences\n[1] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and\n\nRichard Szeliski. Building rome in a day. Communications of the ACM, 54(10):105\u2013112, 2011.\n\n[2] Relja Arandjelovic and Andrew Zisserman. Three things everyone should know to improve object re-\n\ntrieval. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2012.\n\n[3] Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality, and\n\nthe smo algorithm. In Proceedings of the International Conference on Machine learning. ACM, 2004.\n\n[4] Xavier Boix, Michael Gygli, Gemma Roig, and Luc Van Gool. Sparse quantization for patch description.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.\n\n[5] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image descriptors.\n\nPattern Analysis and Machine Intelligence, IEEE Transactions on, 33(1):43\u201357, 2011.\n\n[6] Matthew Brown and David G Lowe. Automatic panoramic image stitching using invariant features.\n\nInternational Journal of Computer Vision, 74(1):59\u201373, 2007.\n\n[7] Thomas Dean, Mark A Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan, and Jay\n\nYagnik. Fast, accurate detection of 100,000 object classes on a single machine. In CVPR, 2013.\n\n[8] Maryam Fazel. Matrix rank minimization with applications. PhD thesis, 2002.\n[9] Yunchao Gong, Sanjiv Kumar, Vishal Verma, and Svetlana Lazebnik. Angular quantization-based binary\n\ncodes for fast similarity search. In NIPS, pages 1196\u20131204, 2012.\n\n[10] Steven CH Hoi, Rong Jin, and Michael R Lyu. Learning nonparametric kernel matrices from pairwise\n\nconstraints. In Proceedings of the International Conference on Machine learning. ACM, 2007.\n\n[11] Gang Hua, Matthew Brown, and Simon Winder. Discriminant embedding for local image descriptors. In\n\nICCV 2007. IEEE, 2007.\n\n[12] Herv\u00b4e J\u00b4egou and Ond\u02c7rej Chum. Negative evidences and co-occurences in image retrieval: The bene\ufb01t of\n\npca and whitening. In Computer Vision\u2013ECCV 2012, pages 774\u2013787. Springer, 2012.\n\n[13] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.\n\nPattern Analysis and Machine Intelligence, IEEE Transactions on, 33(1):117\u2013128, 2011.\n\n[14] Herv\u00b4e J\u00b4egou, Matthijs Douze, Cordelia Schmid, and Patrick P\u00b4erez. Aggregating local descriptors into a\n\ncompact image representation. In CVPR, pages 3304\u20133311. IEEE, 2010.\n\n[15] David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.\n[16] Subhransu Maji, Alexander C Berg, and Jitendra Malik. Ef\ufb01cient classi\ufb01cation for additive kernel svms.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):66\u201377, 2013.\n\n[17] Francesco Orabona and Luo Jie. Ultra-fast optimization algorithm for sparse multi kernel learning. In Pro-\n\nceedings of the 28th International Conference on Machine Learning (ICML-11), pages 249\u2013256, 2011.\n\n[18] Florent Perronnin, Jorge S\u00b4anchez, and Thomas Mensink. Improving the \ufb01sher kernel for large-scale image\n\nclassi\ufb01cation. In European Conference on Computer Vision (ECCV). 2010.\n\n[19] Gemma Roig, Xavier Boix, and Luc Van Gool. Random binary mappings for kernel learning and ef\ufb01cient\n\nSVM. arXiv preprint arXiv:1307.5161, 2013.\n\n[20] Dimitris Achlioptas Frank McSherry Bernhard Scholkopf. Sampling techniques for kernel methods. In\n\nNIPS 2001, volume 1, page 335. MIT Press, 2002.\n\n[21] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Learning local feature descriptors using con-\n\nvex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.\n\n[22] Tomasz Trzcinski, Mario Christoudias, Pascal Fua, and Vincent Lepetit. Boosting binary keypoint de-\n\nscriptors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.\n\n[23] Tomasz Trzcinski, Mario Christoudias, Vincent Lepetit, and Pascal Fua. Learning image descriptors with\n\nthe boosting-trick. In NIPS, 2012.\n\n[24] Andrea Vedaldi and Andrew Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on, 34(3):480\u2013492, 2012.\n\n[25] Kilian Weinberger, John Blitzer, and Lawrence Saul. Distance metric learning for large margin nearest\n\nneighbor classi\ufb01cation. Advances in neural information processing systems, 18:1473, 2006.\n[26] Simon AJ Winder and Matthew Brown. Learning local image descriptors. In CVPR, 2007.\n[27] Lin Xiao et al. Dual averaging methods for regularized stochastic learning and online optimization.\n\nJournal of Machine Learning Research, 11(2543-2596):4, 2010.\n\n[28] Zheng Yi, Cao Zhiguo, and Xiao Yang. Multi-spectral remote image registration based on sift. Electronics\n\nLetters, 44(2):107\u2013108, 2008.\n\n9\n\n\f", "award": [], "sourceid": 134, "authors": [{"given_name": "Danfeng", "family_name": "Qin", "institution": "Computer Vision Lab, ETH Zurich"}, {"given_name": "Xuanli", "family_name": "Chen", "institution": "TU Munich"}, {"given_name": "Matthieu", "family_name": "Guillaumin", "institution": "ETH Zurich"}, {"given_name": "Luc", "family_name": "Gool", "institution": "Computer Vision Lab, ETH Zurich"}]}