{"title": "Online Metric Learning and Fast Similarity Search", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 768, "abstract": "Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their offline counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efficient updates for approximate similarity search data structures. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.", "full_text": "Online Metric Learning and Fast Similarity Search\n\nPrateek Jain, Brian Kulis, Inderjit S. Dhillon, and Kristen Grauman\n\nDepartment of Computer Sciences\n\nUniversity of Texas at Austin\n\nAustin, TX 78712\n\n{pjain,kulis,inderjit,grauman}@cs.utexas.edu\n\nAbstract\n\nMetric learning algorithms can provide useful distance functions for a variety\nof domains, and recent work has shown good accuracy for problems where the\nlearner can access all distance constraints at once. However, in many real appli-\ncations, constraints are only available incrementally, thus necessitating methods\nthat can perform online updates to the learned metric. Existing online algorithms\noffer bounds on worst-case performance, but typically do not perform well in\npractice as compared to their of\ufb02ine counterparts. We present a new online metric\nlearning algorithm that updates a learned Mahalanobis metric based on LogDet\nregularization and gradient descent. We prove theoretical worst-case performance\nbounds, and empirically compare the proposed method against existing online\nmetric learning algorithms. To further boost the practicality of our approach, we\ndevelop an online locality-sensitive hashing scheme which leads to ef\ufb01cient up-\ndates to data structures used for fast approximate similarity search. We demon-\nstrate our algorithm on multiple datasets and show that it outperforms relevant\nbaselines.\n\n1 Introduction\n\nA number of recent techniques address the problem of metric learning, in which a distance function\nbetween data objects is learned based on given (or inferred) similarity constraints between exam-\nples [4, 7, 11, 16, 5, 15]. Such algorithms have been applied to a variety of real-world learning\ntasks, ranging from object recognition and human body pose estimation [5, 9], to digit recogni-\ntion [7], and software support [4] applications. Most successful results have relied on having access\nto all constraints at the onset of the metric learning. However, in many real applications, the desired\ndistance function may need to change gradually over time as additional information or constraints\nare received. For instance, in image search applications on the internet, online click-through data\nthat is continually collected may impact the desired distance function. To address this need, recent\nwork on online metric learning algorithms attempts to handle constraints that are received one at a\ntime [13, 4]. Unfortunately, current methods suffer from a number of drawbacks, including speed,\nbound quality, and empirical performance.\n\nFurther complicating this scenario is the fact that fast retrieval methods must be in place on top\nof the learned metrics for many applications dealing with large-scale databases. For example, in\nimage search applications, relevant images within very large collections must be quickly returned\nto the user, and constraints and user queries may often be intermingled across time. Thus a good\nonline metric learner must also be able to support fast similarity search routines. This is problematic\nsince existing methods (e.g., locality-sensitive hashing [6, 1] or kd-trees) assume a static distance\nfunction, and are expensive to update when the underlying distance function changes.\n\n1\n\n\fThe goal of this work is to make metric learning practical for real-world learning tasks in which both\nconstraints and queries must be handled ef\ufb01ciently in an online manner. To that end, we \ufb01rst develop\nan online metric learning algorithm that uses LogDet regularization and exact gradient descent. The\nnew algorithm is inspired by the metric learning algorithm studied in [4]; however, while the loss\nbounds for the latter method are dependent on the input data, our loss bounds are independent of\nthe sequence of constraints given to the algorithm. Furthermore, unlike the Pseudo-metric Online\nLearning Algorithm (POLA) [13], another recent online technique, our algorithm requires no eigen-\nvector computation, making it considerably faster in practice. We further show how our algorithm\ncan be integrated with large-scale approximate similarity search. We devise a method to incremen-\ntally update locality-sensitive hash keys during the updates of the metric learner, making it possible\nto perform accurate sub-linear time nearest neighbor searches over the data in an online manner.\n\nWe compare our algorithm to related existing methods using a variety of standard data sets. We\nshow that our method outperforms existing approaches, and even performs comparably to several\nof\ufb02ine metric learning algorithms. To evaluate our approach for indexing a large-scale database, we\ninclude experiments with a set of 300,000 image patches; our online algorithm effectively learns to\ncompare patches, and our hashing construction allows accurate fast retrieval for online queries.\n\n1.1 Related Work\n\nA number of recent techniques consider the metric learning problem [16, 7, 11, 4, 5]. Most work\ndeals with learning Mahalanobis distances in an of\ufb02ine manner, which often leads to expensive opti-\nmization algorithms. The POLA algorithm [13], on the other hand, is an approach for online learning\nof Mahalanobis metrics that optimizes a large-margin objective and has provable regret bounds, al-\nthough eigenvector computation is required at each iteration to enforce positive de\ufb01niteness, which\ncan be slow in practice. The information-theoretic metric learning method of [4] includes an on-\nline variant that avoids eigenvector decomposition. However, because of the particular form of the\nonline update, positive-de\ufb01niteness still must be carefully enforced, which impacts bound quality\nand empirical performance, making it undesirable for both theoretical and practical purposes. In\ncontrast, our proposed algorithm has strong bounds, requires no extra work for enforcing positive\nde\ufb01niteness, and can be implemented ef\ufb01ciently. There are a number of existing online algorithms\nfor other machine learning problems outside of metric learning, e.g. [10, 2, 12].\n\nFast search methods are becoming increasingly necessary for machine learning tasks that must cope\nwith large databases. Locality-sensitive hashing [6] is an effective technique that performs approx-\nimate nearest neighbor searches in time that is sub-linear in the size of the database. Most existing\nwork has considered hash functions for Lp norms [3], inner product similarity [1], and other stan-\ndard distances. While recent work has shown how to generate hash functions for (of\ufb02ine) learned\nMahalanobis metrics [9], we are not aware of any existing technique that allows incremental updates\nto locality-sensitive hash keys for online database maintenance, as we propose in this work.\n2 Online Metric Learning\nIn this section we introduce our model for online metric learning, develop an ef\ufb01cient algorithm to\nimplement it, and prove regret bounds.\n\n2.1 Formulation and Algorithm\n\nAs in several existing metric learning methods, we restrict ourselves to learning a Mahalanobis\ndistance function over our input data, which is a distance function parameterized by a d\u00d7 d positive\nde\ufb01nite matrix A. Given d-dimensional vectors u and v, the squared Mahalanobis distance between\nthem is de\ufb01ned as\n\ndA(u, v) = (u \u2212 v)T A(u \u2212 v).\n\nPositive de\ufb01niteness of A assures that the distance function will return positive distances. We may\nequivalently view such distance functions as applying a linear transformation to the input data and\ncomputing the squared Euclidean distance in the transformed space; this may be seen by factorizing\nthe matrix A = GT G, and distributing G into the (u \u2212 v) terms.\nIn general, one learns a Mahalanobis distance by learning the appropriate positive de\ufb01nite matrix A\nbased on constraints over the distance function. These constraints are typically distance or similarity\nconstraints that arise from supervised information\u2014for example, the distance between two points\nin the same class should be \u201csmall\u201d. In contrast to of\ufb02ine approaches, which assume all constraints\n\n2\n\n\fare provided up front, online algorithms assume that constraints are received one at a time. That\nis, we assume that at time step t, there exists a current distance function parameterized by At. A\nconstraint is received, encoded by the triple (ut, vt, yt), where yt is the target distance between ut\nand vt (we restrict ourselves to distance constraints, though other constraints are possible). Using\nAt, we \ufb01rst predict the distance \u02c6yt = dAt (ut, vt) using our current distance function, and incur a\nloss \u2113(\u02c6yt, yt). Then we update our matrix from At to At+1. The goal is to minimize the sum of\n\nthe losses over all time steps, i.e. LA = Pt \u2113(\u02c6yt, yt). One common choice is the squared loss:\n\u2113(\u02c6yt, yt) = 1\n2 (\u02c6yt \u2212 yt)2. We also consider a variant of the model where the input is a quadruple\n(ut, vt, yt, bt), where bt = 1 if we require that the distance between ut and vt be less than or equal\nto yt, and bt = \u22121 if we require that the distance between ut and vt be greater than or equal to yt.\nIn that case, the corresponding loss function is \u2113(\u02c6yt, yt, bt) = max(0, 1\nA typical approach [10, 4, 13] for the above given online learning problem is to solve for At+1 by\nminimizing a regularized loss at each step:\n\n2 bt(\u02c6yt \u2212 yt))2.\n\nAt+1 = argmin\n\nA\u227b0\n\nD(A, At) + \u03b7\u2113(dA(ut, vt), yt),\n\n(2.1)\n\nt\n\nt\n\n) \u2212 log det(AA\u22121\n\nwhere D(A, At) is a regularization function and \u03b7t > 0 is the regularization parameter. As in [4],\nwe use the LogDet divergence D\u2113d(A, At) as the regularization function. It is de\ufb01ned over positive\nde\ufb01nite matrices and is given by D\u2113d(A, At) = tr(AA\u22121\n) \u2212 d. This divergence\nhas previously been shown to be useful in the context of metric learning [4].\nIt has a number\nof desirable properties for metric learning, including scale-invariance, automatic enforcement of\npositive de\ufb01niteness, and a maximum-likelihood interpretation.\nExisting approaches solve for At+1 by approximating the gradient of the loss function,\ni.e.\n\u2113\u2032(dA(ut, vt), yt) is approximated by \u2113\u2032(dAt (ut, vt), yt) [10, 13, 4]. While for some regulariza-\ntion functions (e.g. Frobenius divergence, von-Neumann divergence) such a scheme works out well,\nfor LogDet regularization it can lead to non-de\ufb01nite matrices for which the regularization function\nis not even de\ufb01ned. This results in a scheme that has to adapt the regularization parameter in order\nto maintain positive de\ufb01niteness [4].\nIn contrast, our algorithm proceeds by exactly solving for the updated parameters At+1 that mini-\nmize (2.1). Since we use the exact gradient, our analysis will become more involved; however, the\nresulting algorithm will have several advantages over existing methods for online metric learning.\nUsing straightforward algebra and the Sherman-Morrison inverse formula, we can show that the\nresulting solution to the minimization of (2.1) is:\n\nAt+1 = At \u2212\n\n\u03b7(\u00afy \u2212 yt)Atztz\n1 + \u03b7(\u00afy \u2212 yt)z\n\nT\nt At\nT\nt Atzt\n\n,\n\n(2.2)\n\nt At+1zt. The detailed derivation will appear in\nwhere zt = ut \u2212 vt and \u00afy = dAt+1(ut, vt) = z\nT\na longer version. It is not immediately clear that this update can be applied, since \u00afy is a function\nof At+1. However, by multiplying the update in (2.2) on the left by z\nt and on the right by zt and\nT\nnoting that \u02c6yt = z\n\nt Atzt, we obtain the following:\nT\n\n\u00afy = \u02c6yt \u2212\n\n\u03b7(\u00afy \u2212 yt)\u02c6y2\n1 + \u03b7(\u00afy \u2212 yt)\u02c6yt\n\nt\n\n, and so \u00afy =\n\n\u03b7yt \u02c6yt \u2212 1 +p(\u03b7yt \u02c6yt \u2212 1)2 + 4\u03b7 \u02c6y2\n\n2\u03b7 \u02c6yt\n\nt\n\n.\n\n(2.3)\n\nWe can solve directly for \u00afy using this formula, and then plug this into the update (2.2). For the case\nwhen the input is a quadruple and the loss function is the squared hinge loss, we only perform the\nupdate (2.2) if the new constraint is violated.\nIt is possible to show that the resulting matrix At+1 is positive de\ufb01nite; the proof appears in our\nlonger version. The fact that this update maintains positive de\ufb01niteness is a key advantage of our\nmethod over existing methods; POLA, for example, requires projection to the positive semide\ufb01nite\ncone via an eigendecomposition. The \ufb01nal loss bound in [4] depends on the regularization parameter\n\u03b7t from each iteration and is in turn dependent on the sequence of constraints, an undesirable prop-\nerty for online algorithms. In contrast, by minimizing the function ft we designate above in (2.1),\nour algorithm\u2019s updates automatically maintain positive de\ufb01niteness. This means that the regulariza-\ntion parameter \u03b7 need not be changed according to the current constraint, and the resulting bounds\n(Section 2.2) and empirical performance are notably stronger.\n\n3\n\n\fWe refer to our algorithm as LogDet Exact Gradient Online (LEGO), and use this name throughout\nto distinguish it from POLA [13] (which uses a Frobenius regularization) and the Information The-\noretic Metric Learning (ITML)-Online algorithm [4] (which uses an approximation to the gradient).\n\n2.2 Analysis\n\nWe now brie\ufb02y analyze the regret bounds for our online metric learning algorithm. Due to space\nissues, we do not present the full proofs; please see the longer version for further details.\n\nTo evaluate the online learner\u2019s quality, we want to compare the loss of the online algorithm (which\nhas access to one constraint at a time in sequence) to the loss of the best possible of\ufb02ine algorithm\n(which has access to all constraints at once). Let \u02c6dt = dA\u2217 (ut, vt) be the learned distance between\n\npoints ut and vt with a \ufb01xed positive de\ufb01nite matrix A\u2217, and let LA\u2217 = Pt \u2113( \u02c6dt, yt) be the loss\n\nsuffered over all t time steps. Note that the loss LA\u2217 is with respect to a single matrix A\u2217, whereas\nLA (Section 2.1) is with respect to a matrix that is being updated every time step. Let A\u2217 be the\noptimal of\ufb02ine solution, i.e. it minimizes total loss incurred (LA\u2217). The goal is to demonstrate that\nthe loss of the online algorithm LA is competitive with the loss of any of\ufb02ine algorithm. To that end,\nwe now show that LA \u2264 c1LA\u2217 + c2, where c1 and c2 are constants.\nIn the result below, we assume that the length of the data points is bounded: kuk2\nThe following key lemma shows that we can bound the loss at each step of the algorithm:\nLemma 2.1. At each step t,\n\n2 \u2264 R for all u.\n\n1\n2\n\n\u03b1t(\u02c6yt \u2212 yt)2 \u2212\n\n1\n2\n\n\u03b2t(dA\u2217 (ut, vt) \u2212 yt)2 \u2264 Dld(A\u2217, At) \u2212 Dld(A\u2217, At+1),\n\u03b7(cid:19)2 , \u03b2t = \u03b7, and A\u2217 is the optimal of\ufb02ine solution.\n\n2 +q R2\n\n4 + 1\n\nwhere 0 \u2264 \u03b1t \u2264\n\n\u03b7\n\n1+\u03b7(cid:18) R\n\nProof. See longer version.\n\nTheorem 2.2.\n\nLA \u2264(cid:18)1 + \u03b7(cid:18) R\n\n2\n\n+s R2\n\n4\n\n+\n\n1\n\n\u03b7(cid:19)2(cid:19)LA\u2217 +(cid:18) 1\n\n\u03b7\n\n+(cid:18) R\n\n2\n\n+s R2\n\n4\n\n+\n\n1\n\n\u03b7(cid:19)2(cid:19)Dld(A\u2217, A0),\n\n2\n\n1\n2\n\n\u03b1t(\u02c6yt \u2212 yt)2 \u2212\n\nwhere LA = Pt \u2113(\u02c6yt, yt) is the loss incurred by the series of matrices At generated by Equa-\ntion (2.3), A0 \u227b 0 is the initial matrix, and A\u2217 is the optimal of\ufb02ine solution.\nProof. The bound is obtained by summing the loss at each step using Lemma 2.1:\n\u03b2t(dA\u2217 (ut, vt) \u2212 yt)2(cid:19) \u2264Xt (cid:18)Dld(A\u2217, At) \u2212 Dld(A\u2217, At+1)(cid:19).\nXt (cid:18) 1\nThe result follows by plugging in the appropriate \u03b1t and \u03b2t, and observing that the right-hand side\ntelescopes to Dld(A\u2217, A0) \u2212 Dld(A\u2217, At+1) \u2264 Dld(A\u2217, A0) since Dld(A\u2217, At+1) \u2265 0.\nFor the squared hinge loss \u2113(\u02c6yt, yt, bt) = max(0, bt(\u02c6yt \u2212 yt))2, the corresponding algorithm has the\nsame bound.\nThe regularization parameter affects the tradeoff between LA\u2217 and Dld(A\u2217, A0): as \u03b7 gets larger,\nthe coef\ufb01cient of LA\u2217 grows while the coef\ufb01cient of Dld(A\u2217, A0) shrinks.\nIn most scenarios,\nR is small; for example, in the case when R = 2 and \u03b7 = 1, then the bound is LA \u2264\n(4 + \u221a2)LA\u2217 + 2(4 + \u221a2)Dld(A\u2217, A0). Furthermore, in the case when there exists an of\ufb02ine\nsolution with zero error, i.e., LA\u2217 = 0, then with a suf\ufb01ciently large regularization parameter, we\nknow that LA \u2264 2R2Dld(A\u2217, A0). This bound is analogous to the bound proven in Theorem 1 of\nthe POLA method [13]. Note, however, that our bound is much more favorable to scaling of the op-\ntimal solution A\u2217, since the bound of POLA has a kA\u2217k2\nF term while our bound uses Dld(A\u2217, A0):\nif we scale the optimal solution by c, then the Dld(A\u2217, A0) term will scale by O(c), whereas kA\u2217k2\nF\nwill scale by O(c2). Similarly, our bound is tighter than that provided by the ITML-Online algo-\nrithm since, in the ITML-Online algorithm, the regularization parameter \u03b7t for step t is dependent\non the input data. An adversary can always provide an input (ut, vt, yt) so that the regularization\n\n4\n\n\fparameter has to be decreased arbitrarily; that is, the need to maintain positive de\ufb01ninteness for each\nupdate can prevent ITML-Online from making progress towards an optimal metric.\n\nIn summary, we have proven a regret bound for the proposed LEGO algorithm, an online metric\nlearning algorithm based on LogDet regularization and gradient descent. Our algorithm automati-\ncally enforces positive de\ufb01niteness every iteration and is simple to implement. The bound is compa-\nrable to POLA\u2019s bound but is more favorable to scaling, and is stronger than ITML-Online\u2019s bound.\n\n3 Fast Online Similarity Searches\n\nIn many applications, metric learning is used in conjunction with nearest-neighbor searching, and\ndata structures to facilitate such searches are essential. For online metric learning to be practical\nfor large-scale retrieval applications, we must be able to ef\ufb01ciently index the data as updates to the\nmetric are performed. This poses a problem for most fast similarity searching algorithms, since each\nupdate to the online algorithm would require a costly update to their data structures.\n\nOur goal is to avoid expensive naive updates, where all database items are re-inserted into the search\nstructure. We employ locality-sensitive hashing to enable fast queries; but rather than re-hash all\ndatabase examples every time an online constraint alters the metric, we show how to incorporate\na second level of hashing that determines which hash bits are changing during the metric learning\nupdates. This allows us to avoid costly exhaustive updates to the hash keys, though occasional\nupdating is required after substantial changes to the metric are accumulated.\n\n3.1 Background: Locality-Sensitive Hashing\nLocality-sensitive hashing (LSH) [6, 1] produces a binary hash key H(u) = [h1(u)h2(u)...hb(u)]\nfor every data point. Each individual bit hi(u) is obtained by applying the locality sensitive hash\nfunction hi to input u. To allow sub-linear time approximate similarity search for a similarity\nfunction \u2018sim\u2019, a locality-sensitive hash function must satisfy the following property: P r[hi(u) =\nhi(v)] = sim(u, v), where \u2018sim\u2019 returns values between 0 and 1. This means that the more similar\nexamples are, the more likely they are to collide in the hash table.\n\nA LSH function when \u2018sim\u2019 is the inner product was developed in [1], in which a hash bit is the sign\nof an input\u2019s inner product with a random hyperplane. For Mahalanobis distances, the similarity\nfunction of interest is sim(u, v) = u\nT Av. The hash function in [1] was extended to accommodate\na Mahalanobis similarity function in [9]: A can be decomposed as GT G, and the similarity function\nis then equivalently \u02dcu\nT Av is:\n\nT \u02dcv, where \u02dcu = Gu and \u02dcv = Gv. Hence, a valid LSH function for u\n\nhr,A(u) =(cid:26) 1,\n\n0,\n\nif r\notherwise,\n\nT Gu \u2265 0\n\n(3.1)\n\nwhere r is the normal to a random hyperplane. To perform sub-linear time nearest neighbor searches,\na hash key is produced for all n data points in our database. Given a query, its hash key is formed\nand then, an appropriate data structure can be used to extract potential nearest neighbors (see [6, 1]\nfor details). Typically, the methods search only O(n1/(1+\u01eb)) of the data points, where \u01eb > 0, to\nretrieve the (1 + \u01eb)-nearest neighbors with high probability.\n\n3.2 Online Hashing Updates\n\nThe approach described thus far is not immediately amenable to online updates. We can imagine\nproducing a series of LSH functions hr1,A, ..., hrb,A, and storing the corresponding hash keys for\neach data point in our database. However, the hash functions as given in (3.1) are dependent on the\nMahalanobis distance; when we update our matrix At to At+1, the corresponding hash functions,\nparameterized by Gt, must also change. To update all hash keys in the database would require\nO(nd) time, which may be prohibitive. In the following we propose a more ef\ufb01cient approach.\nRecall the update for A: At+1 = At \u2212 \u03b7(\u00afy\u2212yt)At ztz\n\u03b2tAtztz\nt GT\nGT\nT\nt (I + \u03b2tGtztz\nT\nt Atzt\u22121)/(z\n\n, which we will write as At+1 = At +\nt Gt = At. Then At+1 =\nt , where\nis I + \u03b1tGtztz\nt At. The corresponding\nT\n\nt Atzt). As a result, Gt+1 = Gt+\u03b1tGtztz\nT\n\n1+\u03b7(\u00afy\u2212yt)\u02c6yt\n\nt At, where \u03b2t = \u2212\u03b7(\u00afy \u2212 yt)/(1 + \u03b7(\u00afy \u2212 yt)\u02c6yt). Let GT\n\nt )Gt. The square-root of I + \u03b2tGtztz\n\nt GT\nT\nt\n\nt GT\nT\n\nT\nt At\n\nT\n\n\u03b1t = (p1 + \u03b2tz\n\nupdate to (3.1) is to \ufb01nd the sign of\n\nT Gt+1x = r\n\nT Gtu + \u03b1tr\n\nr\n\nT Gtztz\n\nT\nt Atu.\n\n(3.2)\n\n5\n\n\fT Gt1 u)(cid:18)Pt\u22121\n\n\u2113=t1\n\nT\n\n\u2113 A\u2113u(cid:19) \u2264 0. We can\n\nSuppose that the hash functions have been updated in full at some time step t1 in the past.\nNow at time t, we want to determine which hash bits have \ufb02ipped since t1, or more pre-\nT Gt has changed from positive to negative, or vice\ncisely, which examples\u2019 product with some r\nversa. This amounts to determining all bits such that sign(r\nT Gtu), or equiv-\nalently, (r\nT Gtu as\n\u2113 A\u2113u. Therefore, \ufb01nding the bits that have changed sign is equiva-\nT\nT Gt1 u)2 + (r\n\nT Gtu) \u2264 0. Expanding the update given in (3.2), we can write r\nT G\u2113z\u2113z\n\u03b1\u2113r\n\nT Gt1 u +Pt\u22121\n\nlent to \ufb01nding all u such that (r\n\nT Gt1 u) 6= sign(r\n\nT Gt1 u)(r\n\nT G\u2113z\u2113z\n\n\u03b1\u2113r\n\n\u2113=t1\n\nr\n\n\u2113=t1\n\n\u03b1\u2113r\n\nT A\u2113z\u2113z\n\nT Gt1 u)2; (r\n\nT Gt1 u)u] and a \u201cquery\u201d \u00afq = [\u22121;\u2212Pt\u22121\n\nuse a second level of locality-sensitive hashing to approximately \ufb01nd all such u. De\ufb01ne a vec-\ntor \u00afu = [(r\n\u2113 G\u2113]. Then the\nT\nbits that have changed sign can be approximately identi\ufb01ed by \ufb01nding all examples \u00afu such that\nT \u00afu \u2265 0. In other words, we look for all \u00afu that have a large inner product with \u00afq, which translates\n\u00afq\nthe problem to a similarity search problem. This may be solved approximately using the locality-\nsensitive hashing scheme given in [1] for inner product similarity. Note that \ufb01nding \u00afu for each r can\nbe computationally expensive, so we search \u00afu for only a randomly selected subset of the vectors r.\nIn summary, when performing online metric learning updates, instead of updating all the hash keys\nat every step (which costs O(nd)), we delay updating the hash keys and instead determine approxi-\nmately which bits have changed in the stored entries in the hash table since the last update. When we\nhave a nearest-neighbor query, we can quickly determine which bits have changed, and then use this\ninformation to \ufb01nd a query\u2019s approximate nearest neighbors using the current metric. Once many of\nthe bits have changed, we perform a full update to our hash functions.\n\nFinally, we note that the above can be extended to the case where computations are done in kernel\nspace. We omit details due to lack of space.\n4 Experimental Results\nIn this section we evaluate the proposed algorithm (LEGO) over a variety of data sets, and examine\nboth its online metric learning accuracy as well as the quality of its online similarity search updates.\nAs baselines, we consider the most relevant techniques from the literature: the online metric learners\nPOLA [13] and ITML-Online [4]. We also evaluate a baseline of\ufb02ine metric learner associated with\nour method. For all metric learners, we gauge improvements relative to the original (non-learned)\nEuclidean distance, and our classi\ufb01cation error is measured with the k-nearest neighbor algorithm.\nFirst we consider the same collection of UCI data sets used in [4]. For each data set, we provide the\nonline algorithms with 10,000 randomly-selected constraints, and generate their target distances as\nin [4]\u2014for same-class pairs, the target distance is set to be equal to the 5th percentile of all distances\nin the data, while for different-class pairs, the 95th percentile is used. To tune the regularization\nparameter \u03b7 for POLA and LEGO, we apply a pre-training phase using 1,000 constraints. (This is not\nrequired for ITML-Online, which automatically sets the regularization parameter at each iteration\nto guarantee positive de\ufb01niteness). The \ufb01nal metric (AT ) obtained by each online algorithm is used\nfor testing (T is the total number of time-steps). The left plot of Figure 1 shows the k-nn error rates\nfor all \ufb01ve data sets. LEGO outperforms the Euclidean baseline as well as the other online learners,\nand even approaches the accuracy of the of\ufb02ine method (see [4] for additional comparable of\ufb02ine\nlearning results using [7, 15]). LEGO and ITML-Online have comparable running times. However,\nour approach has a signi\ufb01cant speed advantage over POLA on these data sets: on average, learning\nwith LEGO is 16.6 times faster, most likely due to the extra projection step required by POLA.\n\nNext we evaluate our approach on a handwritten digit classi\ufb01cation task, reproducing the experiment\nused to test POLA in [13]. We use the same settings given in that paper. Using the MNIST data set,\nwe pose a binary classi\ufb01cation problem between each pair of digits (45 problems in all). The training\nand test sets consist of 10,000 examples each. For each problem, 1,000 constraints are chosen and\nthe \ufb01nal metric obtained is used for testing. The center plot of Figure 1 compares the test error\nbetween POLA and LEGO. Note that LEGO beats or matches POLA\u2019s test error in 33/45 (73.33%)\nof the classi\ufb01cation problems. Based on the additional baselines provided in [13], this indicates that\nour approach also fares well compared to other of\ufb02ine metric learners on this data set.\n\nWe next consider a set of image patches from the Photo Tourism project [14], where user photos\nfrom Flickr are used to generate 3-d reconstructions of various tourist landmarks. Forming the\nreconstructions requires solving for the correspondence between local patches from multiple images\nof the same scene. We use the publicly available data set that contains about 300,000 total patches\n\n6\n\n\f1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\ns\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\nPhotoTourism Dataset\n\nLEGO\nITML Offline\nPOLA\nITML Online\nBaseline Euclidean\n\n0.015\n\n0.02\n\n0.025\n\nLEGO Error\n\n0.03\n\n0.035\n\n0.04\n\n0.6\n0\n\n0.05\n\n0.1\n\n0.15\n\nFalse Positives\n\n0.2\n\n0.25\n\n0.3\n\nUCI data sets (order of bars = order of legend)\n\nMNIST data set\n\nr\no\nr\nr\n\n \n\nE\nN\nN\n\u2212\nk\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n\n \n\nITML Offline\nLEGO\nITML Online\nPOLA\nBaseline Euclidean\n\n0.04\n\n0.035\n\n0.03\n\n0.025\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\nr\no\nr\nr\n\n \n\nE\nA\nL\nO\nP\n\nWine\n\nIris\n\nBal\u2212Scale\n\nIonosphere\n\nSoybean\n\n0\n0\n\n0.005\n\n0.01\n\nFigure 1: Comparison with existing online metric learning methods. Left: On the UCI data sets, our method\n(LEGO) outperforms both the Euclidean distance baseline as well as existing metric learning methods, and\neven approaches the accuracy of the of\ufb02ine algorithm. Center: Comparison of errors for LEGO and POLA\non 45 binary classi\ufb01cation problems using the MNIST data; LEGO matches or outperforms POLA on 33 of the\n45 total problems. Right: On the Photo Tourism data, our online algorithm signi\ufb01cantly outperforms the L2\nbaseline and ITML-Online, does well relative to POLA, and nearly matches the accuracy of the of\ufb02ine method.\n\nfrom images of three landmarks1. Each patch has a dimensionality of 4096, so for ef\ufb01ciency we\napply all algorithms in kernel space, and use a linear kernel. The goal is to learn a metric that\nmeasures the distance between image patches better than L2, so that patches of the same 3-d scene\npoint will be matched together, and (ideally) others will not. Since the database is large, we can also\nuse it to demonstrate our online hash table updates. Following [8], we add random jitter (scaling,\nrotations, shifts) to all patches, and generate 50,000 patch constraints (50% matching and 50% non-\nmatching patches) from a mix of the Trevi and Halfdome images. We test with 100,000 patch pairs\nfrom the Notre Dame portion of the data set, and measure accuracy with precision and recall.\n\nThe right plot of Figure 1 shows that LEGO and POLA are able to learn a distance function that\nsigni\ufb01cantly outperforms the baseline squared Euclidean distance. However, LEGO is more accurate\nthan POLA, and again nearly matches the performance of the of\ufb02ine metric learning algorithm. On\nthe other hand, the ITML-Online algorithm does not improve beyond the baseline. We attribute\nthe poor accuracy of ITML-Online to its need to continually adjust the regularization parameter to\nmaintain positive de\ufb01niteness; in practice, this often leads to signi\ufb01cant drops in the regularization\nparameter, which prevents the method from improving over the Euclidean baseline. In terms of\ntraining time, on this data LEGO is 1.42 times faster than POLA (on average over 10 runs).\n\nFinally, we present results using our online metric learning algorithm together with our online hash\ntable updates described in Section 3.2 for the Photo Tourism data. For our \ufb01rst experiment, we\nprovide each method with 50,000 patch constraints, and then search for nearest neighbors for 10,000\ntest points sampled from the Notre Dame images. Figure 2 (left plot) shows the recall as a function\nof the number of patches retrieved for four variations: LEGO with a linear scan, LEGO with our\nLSH updates, the L2 baseline with a linear scan, and L2 with our LSH updates. The results show\nthat the accuracy achieved by our LEGO+LSH algorithm is comparable to the LEGO+linear scan\n(and similarly, L2+LSH is comparable to L2+linear scan), thus validating the effectiveness of our\nonline hashing scheme. Moreover, LEGO+LSH needs to search only 10% of the database, which\ntranslates to an approximate speedup factor of 4.7 over the linear scan for this data set.\n\nNext we show that LEGO+LSH performs accurate and ef\ufb01cient retrievals in the case where con-\nstraints and queries are interleaved in any order. Such a scenario is useful in many applications: for\nexample, an image retrieval system such as Flickr continually acquires new image tags from users\n(which could be mapped to similarity constraints), but must also continually support intermittent\nuser queries. For the Photo Tourism setting, it would be useful in practice to allow new constraints\nindicating true-match patch pairs to stream in while users continually add photos that should partic-\nipate in new 3-d reconstructions with the improved match distance functions. To experiment with\nthis scenario, we randomly mix online additions of 50,000 constraints with 10,000 queries, and mea-\nsure performance by the recall value for 300 retrieved nearest neighbor examples. We recompute the\nhash-bits for all database examples if we detect changes in more than 10% of the database examples.\nFigure 2 (right plot) compares the average recall value for various methods after each query. As\nexpected, as more constraints are provided, the LEGO-based accuracies all improve (in contrast to\nthe static L2 baseline, as seen by the straight line in the plot). Our method achieves similar accuracy\nto both the linear scan method (LEGO Linear) as well as the naive LSH method where the hash\ntable is fully recomputed after every constraint update (LEGO Naive LSH). The curves stack up\n\n1http://phototour.cs.washington.edu/patches/default.htm\n\n7\n\n\f0.8\n\n0.78\n\n0.76\n\n0.74\n\n0.72\n\n0.7\n\n0.68\n\n0.66\n\n0.64\n\n0.62\n\nl\nl\na\nc\ne\nR\n\n \n\n100\n\n200\n\n \n\n0.74\n\n0.72\n\n0.7\n\n0.68\n\n0.66\n\n0.64\n\nl\nl\na\nc\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\nL\n Linear Scan\n2\nL\n LSH\n2\n\nLEGO Linear Scan\nLEGO LSH\n\n \n\nLEGO LSH\nLEGO Linear Scan\nLEGO Naive LSH\nL\n Linear Scan\n2\n\n300\n800\nNumber of nearest neighbors (N)\n\n600\n\n700\n\n400\n\n500\n\n900\n\n1000\n\n0.62\n \n0\n\n2000\n\n6000\n\n8000\n\n10000\n\n4000\n\nNumber of queries\n\nFigure 2: Results with online hashing updates. The left plot shows the recall value for increasing numbers of\nnearest neighbors retrieved. \u2018LEGO LSH\u2019 denotes LEGO metric learning in conjunction with online searches\nusing our LSH updates, \u2018LEGO Linear\u2019 denotes LEGO learning with linear scan searches. L2 denotes the\nbaseline Euclidean distance. The right plot shows the average recall values for all methods at different time\ninstances as more queries are made and more constraints are added. Our online similarity search updates make\nit possible to ef\ufb01ciently interleave online learning and querying. See text for details.\n\nappropriately given the levels of approximation: LEGO Linear yields the upper bound in terms of\naccuracy, LEGO Naive LSH with its exhaustive updates is slightly behind that, followed by our\nLEGO LSH with its partial and dynamic updates. In reward for this minor accuracy loss, however,\nour method provides a speedup factor of 3.8 over the naive LSH update scheme. (In this case the\nnaive LSH scheme is actually slower than a linear scan, as updating the hash tables after every update\nincurs a large overhead cost.) For larger data sets, we can expect even larger speed improvements.\nConclusions: We have developed an online metric learning algorithm together with a method to\nperform online updates to fast similarity search structures, and have demonstrated their applicability\nand advantages on a variety of data sets. We have proven regret bounds for our online learner that\noffer improved reliability over state-of-the-art methods in terms of regret bounds, and empirical\nperformance. A disadvantage of our algorithm is that the LSH parameters, e.g. \u01eb and the number of\nhash-bits, need to be selected manually, and may depend on the \ufb01nal application. For future work,\nwe hope to tune the LSH parameters automatically using a deeper theoretical analysis of our hash\nkey updates in conjunction with the relevant statistics of the online similarity search task at hand.\n\nAcknowledgments: This research was supported in part by NSF grant CCF-0431257, NSF-\nITR award IIS-0325116, NSF grant IIS-0713142, NSF CAREER award 0747356, Microsoft\nResearch, and the Henry Luce Foundation.\nReferences\n[1] M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC, 2002.\n[2] L. Cheng, S. V. N. Vishwanathan, D. Schuurmans, S. Wang, and T. Caelli. Implicit Online Learning with\n\n[3] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable\n\n[4] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-Theoretic Metric Learning. In ICML, 2007.\n[5] A. Frome, Y. Singer, and J. Malik. Image retrieval and classi\ufb01cation using local distance functions. In\n\nKernels. In NIPS, 2006.\n\nDistributions. In SOCG, 2004.\n\nNIPS, 2007.\n\n[6] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In VLDB, 1999.\n[7] A. Globerson and S. Roweis. Metric Learning by Collapsing Classes. In NIPS, 2005.\n[8] G. Hua, M. Brown, and S. Winder. Discriminant embedding for local image descriptors. In ICCV, 2007.\n[9] P. Jain, B. Kulis, and K. Grauman. Fast Image Search for Learned Metrics. In CVPR, 2008.\n[10] J. Kivinen and M. K. Warmuth. Exponentiated Gradient Versus Gradient Descent for Linear Predictors.\n\nInf. Comput., 132(1):1\u201363, 1997.\n\n[11] M. Schultz and T. Joachims. Learning a Distance Metric from Relative Comparisons. In NIPS, 2003.\n[12] S. Shalev-Shwartz and Y. Singer. Online Learning meets Optimization in the Dual. In COLT, 2006.\n[13] S. Shalev-Shwartz, Y. Singer, and A. Ng. Online and Batch Learning of Pseudo-metrics. In ICML, 2004.\n[14] N. Snavely, S. Seitz, and R. Szeliski. Photo Tourism: Exploring Photo Collections in 3D. In SIGGRAPH\nConference Proceedings, pages 835\u2013846, New York, NY, USA, 2006. ACM Press. ISBN 1-59593-364-6.\n[15] K. Weinberger, J. Blitzer, and L. Saul. Distance Metric Learning for Large Margin Nearest Neighbor\n\n[16] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance Metric Learning, with Application to Clustering with\n\nClassi\ufb01cation. In NIPS, 2006.\n\nSide-Information. In NIPS, 2002.\n\n8\n\n\f", "award": [], "sourceid": 1003, "authors": [{"given_name": "Prateek", "family_name": "Jain", "institution": null}, {"given_name": "Brian", "family_name": "Kulis", "institution": null}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": null}, {"given_name": "Kristen", "family_name": "Grauman", "institution": null}]}