{"title": "Positive Semidefinite Metric Learning with Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 1651, "page_last": 1659, "abstract": "The learning of appropriate distance metrics is a critical problem in classification. In this work, we propose a boosting-based technique, termed BoostMetric, for learning a Mahalanobis distance metric. One of the primary difficulties in learning such a metric is to ensure that the Mahalanobis matrix remains positive semidefinite. Semidefinite programming is sometimes used to enforce this constraint, but does not scale well. BoostMetric is instead based on a key observation that any positive semidefinite matrix can be decomposed into a linear positive combination of trace-one rank-one matrices.  BoostMetric thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting method is easy to implement, does not require tuning, and can accommodate various types of constraints.  Experiments on various datasets show that the proposed algorithm compares favorably to those state-of-the-art methods in terms of classification accuracy and running time.", "full_text": "Positive Semide\ufb01nite Metric Learning with Boosting\n\nChunhua Shen\u2020\u2021, Junae Kim\u2020\u2021, Lei Wang\u2021, Anton van den Hengel\u00b6\n\n\u2020 NICTA Canberra Research Lab, Canberra, ACT 2601, Australia\u2217\n\u2021 Australian National University, Canberra, ACT 0200, Australia\n\n\u00b6 The University of Adelaide, Adelaide, SA 5005, Australia\n\nAbstract\n\nThe learning of appropriate distance metrics is a critical problem in image classi\ufb01-\ncation and retrieval. In this work, we propose a boosting-based technique, termed\nBOOSTMETRIC, for learning a Mahalanobis distance metric. One of the primary\ndif\ufb01culties in learning such a metric is to ensure that the Mahalanobis matrix re-\nmains positive semide\ufb01nite. Semide\ufb01nite programming is sometimes used to en-\nforce this constraint, but does not scale well. BOOSTMETRIC is instead based\non a key observation that any positive semide\ufb01nite matrix can be decomposed\ninto a linear positive combination of trace-one rank-one matrices. BOOSTMET-\nRIC thus uses rank-one positive semide\ufb01nite matrices as weak learners within an\nef\ufb01cient and scalable boosting-based learning process. The resulting method is\neasy to implement, does not require tuning, and can accommodate various types\nof constraints. Experiments on various datasets show that the proposed algorithm\ncompares favorably to those state-of-the-art methods in terms of classi\ufb01cation ac-\ncuracy and running time.\n\n1\n\nIntroduction\n\nIt has been an extensively sought-after goal to learn an appropriate distance metric in image clas-\nsi\ufb01cation and retrieval problems using simple and ef\ufb01cient algorithms [1\u20135]. Such distance metrics\nare essential to the effectiveness of many critical algorithms such as k-nearest neighbor (kNN), k-\nmeans clustering, and kernel regression, for example. We show in this work how a Mahalanobis\nmetric is learned from proximity comparisons among triples of training data. Mahalanobis dis-\ntance, a.k.a. Gaussian quadratic distance, is parameterized by a positive semide\ufb01nite (p.s.d.) matrix.\nTherefore, typically methods for learning a Mahalanobis distance result in constrained semide\ufb01nite\nprograms. We discuss the problem setting as well as the dif\ufb01culties for learning such a p.s.d. ma-\ntrix. If we let ai, i = 1, 2 \u00b7 \u00b7 \u00b7 , represent a set of points in RD, the training data consist of a set of\nconstraints upon the relative distances between these points, SS = {(ai, aj, ak)|distij < distik},\nwhere distij measures the distance between ai and aj. We are interested in the case that dist\ncomputes the Mahalanobis distance. The Mahalanobis distance between two vectors takes the form:\n\nkai \u2212 ajkX = p(ai \u2212 aj)\u22a4X(ai \u2212 aj), with X < 0, a p.s.d. matrix. It is equivalent to learn a pro-\n\njection matrix L and X = LL\u22a4. Constraints such as those above often arise when it is known that ai\nand aj belong to the same class of data points while ai, ak belong to different classes. In some cases,\nthese comparison constraints are much easier to obtain than either the class labels or distances be-\ntween data elements. For example, in video content retrieval, faces extracted from successive frames\nat close locations can be safely assumed to belong to the same person, without requiring the indi-\nvidual to be identi\ufb01ed. In web search, the results returned by a search engine are ranked according\nto the relevance, an ordering which allows a natural conversion into a set of constraints.\n\n\u2217NICTA is funded through the Australian Government\u2019s Backing Australia\u2019s Ability initiative, in part\n\nthrough the Australian Research Council.\n\n\fThe requirement of X being p.s.d. has led to the development of a number of methods for learning\na Mahalanobis distance which rely upon constrained semide\ufb01nite programing. This approach has a\nnumber of limitations, however, which we now discuss with reference to the problem of learning a\np.s.d. matrix from a set of constraints upon pairwise-distance comparisons. Relevant work on this\ntopic includes [3\u20138] amongst others.\n\nXing et al [4] \ufb01rstly proposed to learn a Mahalanobis metric for clustering using convex optimiza-\ntion. The inputs are two sets: a similarity set and a dis-similarity set. The algorithm maximizes the\ndistance between points in the dis-similarity set under the constraint that the distance between points\nin the similarity set is upper-bounded. Neighborhood component analysis (NCA) [6] and large mar-\ngin nearest neighbor (LMNN) [7] learn a metric by maintaining consistency in data\u2019s neighborhood\nand keeping a large margin at the boundaries of different classes. It has been shown in [7] that\nLMNN delivers the state-of-the-art performance among most distance metric learning algorithms.\n\nThe work of LMNN [7] and PSDBoost [9] has directly inspired our work. Instead of using hinge\nloss in LMNN and PSDBoost, we use the exponential loss function in order to derive an AdaBoost-\nlike optimization procedure. Hence, despite similar purposes, our algorithm differs essentially in\nthe optimization. While the formulation of LMNN looks more similar to support vector machines\n(SVM\u2019s) and PSDBoost to LPBoost, our algorithm, termed BOOSTMETRIC, largely draws upon\nAdaBoost [10].\n\nIn many cases, it is dif\ufb01cult to \ufb01nd a global optimum in the projection matrix L [6]. Reformulation-\nlinearization is a typical technique in convex optimization to relax and convexify the problem [11].\nIn metric learning, much existing work instead learns X = LL\u22a4 for seeking a global optimum, e.g.,\n[4, 7, 12, 8]. The price is heavy computation and poor scalability: it is not trivial to preserve the\nsemide\ufb01niteness of X during the course of learning. Standard approaches like interior point Newton\nmethods require the Hessian, which usually requires O(D4) resources (where D is the input dimen-\nsion). It could be prohibitive for many real-world problems. Alternative projected (sub-)gradient is\nadopted in [7, 4, 8]. The disadvantages of this algorithm are: (1) not easy to implement; (2) many\nparameters involved; (3) slow convergence. PSDBoost [9] converts the particular semide\ufb01nite pro-\ngram in metric learning into a sequence of linear programs (LP\u2019s). At each iteration of PSDBoost, an\nLP needs to be solved as in LPBoost, which scales around O(J 3.5) with J the number of iterations\n(and therefore variables). As J increases, the scale of the LP becomes larger. Another problem is\nthat PSDBoost needs to store all the weak learners (the rank-one matrices) during the optimization.\nWhen the input dimension D is large, the memory required is proportional to J D2, which can be\nprohibitively huge at a late iteration J . Our proposed algorithm solves both of these problems.\n\nBased on the observation from [9] that any positive semide\ufb01nite matrix can be decomposed into a\nlinear positive combination of trace-one rank-one matrices, we propose BOOSTMETRIC for learning\na p.s.d. matrix. The weak learner of BOOSTMETRIC is a rank-one p.s.d. matrix as in PSDBoost.\nThe proposed BOOSTMETRIC algorithm has the following desirable properties: (1) BOOSTMETRIC\nis ef\ufb01cient and scalable. Unlike most existing methods, no semide\ufb01nite programming is required.\nAt each iteration, only the largest eigenvalue and its corresponding eigenvector are needed.\n(2)\nBOOSTMETRIC can accommodate various types of constraints. We demonstrate learning a Maha-\nlanobis metric by proximity comparison constraints. (3) Like AdaBoost, BOOSTMETRIC does not\nhave any parameter to tune. The user only needs to know when to stop. In contrast, both LMNN\nand PSDBoost have parameters to cross validate. Also like AdaBoost it is easy to implement. No\nsophisticated optimization techniques such as LP solvers are involved. Unlike PSDBoost, we do not\nneed to store all the weak learners. The ef\ufb01cacy and ef\ufb01ciency of the proposed BOOSTMETRIC is\ndemonstrated on various datasets.\n\nThroughout this paper, a matrix is denoted by a bold upper-case letter (X); a column vector is\ndenoted by a bold lower-case letter (x). The ith row of X is denoted by Xi: and the ith column\nXij Zij calculates the\ninner product of two matrices. An element-wise inequality between two vectors like u \u2264 v means\nui \u2264 vi for all i. We use X < 0 to indicate that matrix X is positive semide\ufb01nite.\n\nX:i. Tr(\u00b7) is the trace of a symmetric matrix and hX, Zi = Tr(XZ\u22a4) = Pij\n\n\f2 Algorithms\n\n2.1 Distance Metric Learning\n\nAs discussed, the Mahalanobis metric is equivalent to linearly transform the data by a projection\nmatrix L \u2208 RD\u00d7d (usually D \u2265 d) before calculating the standard Euclidean distance:\n\ndist2\n\nij = kL\u22a4ai \u2212 L\u22a4ajk2\n\n2 = (ai \u2212 aj)\u22a4LL\u22a4(ai \u2212 aj) = (ai \u2212 aj)\u22a4X(ai \u2212 aj).\n\n(1)\n\nAlthough one can learn L directly as many conventional approaches do, in this setting, non-convex\nconstraints are involved, which make the problem dif\ufb01cult to solve. As we will show, in order to\nconvexify these conditions, a new variable X = LL\u22a4 is introduced instead. This technique has been\nused widely in convex optimization and machine learning such as [12]. If X = I, it reduces to the\nEuclidean distance. If X is diagonal, the problem corresponds to learning a metric in which the\ndifferent features are given different weights, a.k.a. feature weighting.\n\nIn the framework of large-margin learning, we want to maximize the distance between distij and\ndistik. That is, we wish to make dist2\nik = (ai \u2212ak)\u22a4X(ai \u2212ak)\u2212(ai \u2212aj)\u22a4X(ai \u2212aj) as\nlarge as possible under some regularization. To simplify notation, we rewrite the distance between\ndist2\n\nij \u2212dist2\n\nij and dist2\n\nik as dist2\n\nij \u2212 dist2\n\nik = hAr, Xi,\n\nAr = (ai \u2212 ak)(ai \u2212 ak)\u22a4 \u2212 (ai \u2212 aj)(ai \u2212 aj)\u22a4,\n\n(2)\n\nr = 1, \u00b7 \u00b7 \u00b7 , |SS|. |SS| is the size of the set SS.\n\n2.2 Learning with Exponential Loss\n\nWe derive a general algorithm for p.s.d. matrix learning with exponential loss. Assume that we want\nto \ufb01nd a p.s.d. matrix X < 0 such that a bunch of constraints\n\nhAr, Xi > 0, r = 1, 2, \u00b7 \u00b7 \u00b7 ,\n\nare satis\ufb01ed as well as possible. These constraints need not be all strictly satis\ufb01ed. We can de\ufb01ne\nthe margin \u03c1r = hAr, Xi, \u2200r. By employing exponential loss, we want to optimize\n\nmin log(cid:0)P|SS|\n\nr=1 exp \u2212\u03c1r(cid:1) + v Tr(X)\n\ns.t. \u03c1r = hAr, Xi, r = 1, \u00b7 \u00b7 \u00b7 , |SS|, X < 0.\n\n(P0)\n\nNote that: (1) We have worked on the logarithmic version of the sum of exponential loss. This\ntransform does not change the original optimization problem of sum of exponential loss because\nthe logarithmic function is strictly monotonically decreasing. (2) A regularization term Tr(X) has\nbeen applied. Without this regularization, one can always multiply an arbitrarily large factor to X\nto make the exponential loss approach zero in the case of all constraints being satis\ufb01ed. This trace-\nnorm regularization may also lead to low-rank solutions. (3) An auxiliary variable \u03c1r, r = 1, . . .\nmust be introduced for deriving a meaningful dual problem, as we show later.\n\nWe can decompose X into: X = PJ\n\nSo\n\nj=1wj Zj, with wj \u2265 0, rank(Zj) = 1 and Tr(Zj) = 1, \u2200j.\n\n\u03c1r = hAr, Xi = DAr,PJ\n\nj=1wjhAr, Zji = PJ\nHere Hrj is a shorthand for Hrj = hAr, Zji. Clearly Tr(X) = PJ\n\nj=1wj ZjE = PJ\n\nj=1wj Hrj = Hr:w, \u2200r.\n\n(3)\n\nj=1wj Tr(Zj) = 1\u22a4w.\n\n2.3 The Lagrange Dual Problem\n\nWe now derive the Lagrange dual of the problem we are interested in. The original problem (P0)\nnow becomes\n\nIn order to derive its dual, we write its Lagrangian\n\nr=1 exp \u2212\u03c1r(cid:1) + v1\u22a4w, s.t. \u03c1r = Hr:w, r = 1, \u00b7 \u00b7 \u00b7 , |SS|, w \u2265 0.\n\nmin log(cid:0)P|SS|\nL(w, \u03c1, u, p) = log(cid:0)P|SS|\n\nr=1 exp \u2212\u03c1r(cid:1) + v1\u22a4w +P|SS|\n\nr=1ur(\u03c1r \u2212 Hr:w) \u2212 p\u22a4w,\n\n(P1)\n\n(4)\n\n\fmax\n\nu\n\n\u2212P|SS|\n\nr=1ur log ur, s.t. u \u2265 0, 1\u22a4u = 1, and (5).\n\nWeak and strong duality hold under mild conditions [11]. That means, one can usually solve one\nproblem from the other. The KKT conditions link the optimal between these two problems. In our\ncase, it is\n\n, \u2200r.\n\n(6)\n\nu\u22c6\n\nr =\n\nexp \u2212\u03c1\u22c6\nr\nk=1 exp \u2212\u03c1\u22c6\n\nk\n\nP|SS|\n\nwith p \u2265 0. Here u and p are Lagrange multipliers. The dual problem is obtained by \ufb01nding the\nsaddle point of L; i.e., supu,p inf w,\u03c1 L.\n\nL1\n\nL2\n\ninf\nw,\u03c1\n\nL = inf\n\u03c1\n\nz\nlog(cid:0)P|SS|\n\nr=1 exp \u2212\u03c1r(cid:1) + u\u22a4\u03c1 + inf\n\nw\n\n}|\n\n{\n\nz\n(v1\u22a4 \u2212P|SS|\n\n}|\nr=1urHr: \u2212 p\u22a4)w = \u2212P|SS|\n\n{\n\nr=1ur log ur.\n\nThe in\ufb01mum of L1 is found by setting its \ufb01rst derivative to zero and we have:\n\nL1 = (cid:26)\u2212Prur log ur\n\n\u2212\u221e\n\ninf\n\u03c1\n\nif u \u2265 0, 1\u22a4u = 1,\notherwise.\n\nThe in\ufb01mum is Shannon entropy. L2 is linear in w, hence L2 must be 0. It leads to\n\nThe Lagrange dual problem of (P1) is an entropy maximization problem, which writes\n\nP|SS|\nr=1urHr: \u2264 v1\u22a4.\n\n(5)\n\n(D1)\n\nWhile it is possible to devise a totally-corrective column generation based optimization procedure\nfor solving our problem as the case of LPBoost [13], we are more interested in considering one-at-\na-time coordinate-wise descent algorithms, as the case of AdaBoost [10], which has the advantages:\n(1) computationally ef\ufb01cient and (2) parameter free. Let us start from some basic knowledge of\ncolumn generation because our coordinate descent strategy is inspired by column generation.\n\nIf we knew all the bases Zj(j = 1 . . . J) and hence the entire matrix H is known, then either the\nprimal (P1) or the dual (D1) could be trivially solved (at least in theory) because both are convex\noptimization problems. We can solve them in polynomial time. Especially the primal problem is\nconvex minimization with simple nonnegativeness constraints. Off-the-shelf software like LBFGS-\nB [14] can be used for this purpose. Unfortunately, in practice, we do not access all the bases: the\nnumber of possible Z\u2019s is in\ufb01nite. In convex optimization, column generation is a technique that is\ndesigned for solving this dif\ufb01culty.\n\nInstead of directly solving the primal problem (P1), we \ufb01nd the most violated constraint in the dual\n(D1) iteratively for the current solution and add this constraint to the optimization problem. For this\npurpose, we need to solve\n\n\u02c6Z = argmaxZnP|SS|\n\nr=1ur(cid:10)Ar, Z(cid:11), s.t. Z \u2208 \u21261o .\n\n(7)\n\nHere \u21261 is the set of trace-one rank-one matrices. We discuss how to ef\ufb01ciently solve (7) later. Now\nwe move on to derive a coordinate descent optimization procedure.\n\n2.4 Coordinate Descent Optimization\n\nWe show how an AdaBoost-like optimization procedure can be derived for our metric learning prob-\nlem. As in AdaBoost, we need to solve for the primal variables wj given all the weak learners up to\niteration j.\n\nOptimizing for wj\nSince we are interested in the one-at-a-time coordinate-wise optimization, we\nkeep w1, w2, . . . , wj\u22121 \ufb01xed when solving for wj . The cost function of the primal problem is (in\nthe following derivation, we drop those terms irrelevant to the variable wj )\n\nCp(wj) = log(cid:2)P|SS|\n\nr=1 exp(\u2212\u03c1j\u22121\n\nr\n\n) \u00b7 exp(\u2212Hrjwj)(cid:3) + vwj.\n\nClearly, Cp is convex in wj and hence there is only one minimum that is also globally optimal. The\n\ufb01rst derivative of Cp w.r.t. wj vanishes at optimality, which results in\n\nP|SS|\nr=1(Hrj \u2212 v)uj\u22121\n\nr\n\nexp(\u2212wj Hrj) = 0.\n\n(8)\n\n\fAlgorithm 1 Bisection search for wj .\n\nInput: An interval [wl, wu] known to contain the optimal value of wj and convergence tolerance \u03b5 > 0.\nrepeat\n\n\u00b7 wj = 0.5(wl + wu);\n\u00b7 if l.h.s. of (8) > 0 then\n\nwl = wj ;\n\nelse\n\nwu = wj .\n\nuntil wu \u2212 wl < \u03b5 ;\nOutput: wj .\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\nIf Hrj is discrete, such as {+1, \u22121} in standard AdaBoost, we can obtain a close-form solution\nsimilar to AdaBoost. Unfortunately in our case, Hrj can be any real value. We instead use bisection\nto search for the optimal wj . The bisection method is one of the root-\ufb01nding algorithms. It repeat-\nedly divides an interval in half and then selects the subinterval in which a root exists. Bisection is a\nsimple and robust, although it is not the fastest algorithm for root-\ufb01nding. Newton-type algorithms\nare also applicable here. Algorithm 1 gives the bisection procedure. We have utilized the fact that\nthe l.h.s. of (8) must be positive at wl. Otherwise no solution can be found. When wj = 0, clearly\nthe l.h.s. of (8) is positive.\n\nUpdating u The rule for updating the dual variable u can be easily obtained from (6). At iteration\nj, we have\n\nderived from (6). So once wj is calculated, we can update u as\n\nuj\nr \u221d exp \u2212\u03c1j\n\nr \u221d uj\u22121\n\nr\n\nexp(\u2212Hrjwj), andP|SS|\n\nr=1uj\n\nr = 1,\n\nuj\n\nr =\n\nuj\u22121\n\nr\n\nexp(\u2212Hrjwj)\n\nz\n\n, r = 1, . . . , |SS|,\n\n(9)\n\nwhere z is a normalization factor so thatP|SS|\n\nr=1uj\n\nr = 1. This is exactly the same as AdaBoost.\n\n2.5 Base Learning Algorithm\n\nIn this section, we show that the optimization problem (7) can be exactly and ef\ufb01ciently solved using\neigenvalue-decomposition (EVD). From Z < 0 and rank(Z) = 1, we know that Z has the format:\nZ = \u03be\u03be\u22a4, \u03be \u2208 RD; and Tr(Z) = 1 means k\u03bek2 = 1. We have\n\nBy denoting\n\nr=1ur(cid:10)Ar, Z(cid:11) = (cid:10)P|SS|\nP|SS|\n\nr=1urAr, Z(cid:11) = \u03be\u22a4(cid:0)P|SS|\n\u02c6A = P|SS|\n\nr=1urAr,\n\nr=1urAr(cid:1)\u03be.\n\n(10)\nthe base learning optimization equals: max\u03be \u03be\u22a4 \u02c6A\u03be, s.t. k\u03bek2 = 1. It is clear that the largest\neigenvalue of \u02c6A, \u03bbmax( \u02c6A), and its corresponding eigenvector \u03be1 gives the solution to the above\nproblem. Note that \u02c6A is symmetric. Also see [9] for details.\n\n\u03bbmax( \u02c6A) is also used as one of the stopping criteria of the algorithm. Form the condition (5),\n\u03bbmax( \u02c6A) < v means that we are not able to \ufb01nd a new base matrix \u02c6Z that violates (5)\u2014the algorithm\nconverges. We summarize our main algorithmic results in Algorithm 2.\n\n3 Experiments\n\n3.1 Classi\ufb01cation on Benchmark Datasets\n\nWe evaluate BOOSTMETRIC on 15 datasets of different sizes. Some of the datasets have very high\ndimensional inputs. We use PCA to decrease the dimensionality before training on these datasets\n(datasets 2-6). PCA pre-processing helps to eliminate noises and speed up computation. We have\n\n\f1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\nAlgorithm 2 Positive semide\ufb01nite matrix learning with boosting.\n\nInput:\n\n\u2022 Training set triplets (ai, aj , ak) \u2208 SS; Compute Ar, r = 1, 2, \u00b7 \u00b7 \u00b7 , using (2).\n\n\u2022 J : maximum number of iterations;\n\u2022 (optional) regularization parameter v; We may simply set v to a very small value, e.g., 10\u22127.\n\nInitialize: u0\nfor j = 1, 2, \u00b7 \u00b7 \u00b7 , J do\n\nr = 1\n\n|SS| , r = 1 \u00b7 \u00b7 \u00b7 |SS|;\n\n\u00b7 Find a new base Zj by \ufb01nding the largest eigenvalue (\u03bbmax( \u02c6A)) and its eigenvector of \u02c6A in (10);\n\u00b7 if \u03bbmax( \u02c6A) < v then\nbreak (converged);\n\n\u00b7 Compute wj using Algorithm 1;\n\u00b7 Update u to obtain uj\n\nr, r = 1, \u00b7 \u00b7 \u00b7 |SS| using (9);\n\nOutput: The \ufb01nal p.s.d. matrix X \u2208 RD\u00d7D, X = PJ\n\nj=1 wj Zj .\n\nused USPS and MNIST handwritten digits, ORL face recognition datasets, Columbia University\nImage Library (COIL20)1, and UCI machine learning datasets2 (datasets 7-13), Twin Peaks and\nHelix. The last two are arti\ufb01cial datasets3.\n\nExperimental results are obtained by averaging over 10 runs (except USPS-1). We randomly split the\ndatasets for each run. We have used the same mechanism to generate training triplets as described\nin [7]. Brie\ufb02y, for each training point ai, k nearest neighbors that have same labels as yi (targets),\nas well as k nearest neighbors that have different labels from yi (imposers) are found. We then\nconstruct triplets from ai and its corresponding targets and imposers. For all the datasets, we have\nset k = 3 except that k = 1 for datasets USPS-1, ORLFace-1 and ORLFace-2 due to their large\nsize. We have compared our method against a few methods: Xing et al [4], RCA [5], NCA [6]\nand LMNN [7]. LMNN is one of the state-of-the-art according to recent studies such as [15]. Also\nin Table 1, \u201cEuclidean\u201d is the baseline algorithm that uses the standard Euclidean distance. The\ncodes for these compared algorithms are downloaded from the corresponding authors\u2019 websites. We\nhave released our codes for BOOSTMETRIC at [16]. Experiment setting for LMNN follows [7]. For\nBOOSTMETRIC, we have set v = 10\u22127, the maximum number of iterations J = 500. As we can\nsee from Table 1, we can conclude: (1) BOOSTMETRIC consistently improves kNN classi\ufb01cation\nusing Euclidean distance on most datasets. So learning a Mahalanobis metric based upon the large\nmargin concept does lead to improvements in kNN classi\ufb01cation. (2) BOOSTMETRIC outperforms\nother algorithms in most cases (on 11 out of 15 datasets). LMNN is the second best algorithm on\nthese 15 datasets statistically. LMNN\u2019s results are consistent with those given in [7]. (3) Xing et\nal [4] and NCA can only handle a few small datasets. In general they do not perform very well. A\ngood initialization is important for NCA because NCA\u2019s cost function is non-convex and can only\n\ufb01nd a local optimum.\n\nIn\ufb02uence of v Previously, we claim that our algorithm is parameter-free like AdaBoost. However,\nwe do have a parameter v in BOOSTMETRIC. Actually, AdaBoost simply set v = 0. The coordinate-\nwise gradient descent optimization strategy of AdaBoost leads to an \u21131-norm regularized maximum\nmargin classi\ufb01er [17]. It is shown that AdaBoost minimizes its loss criterion with an \u21131 constraint on\nthe coef\ufb01cient vector. Given the similarity of the optimization of BOOSTMETRIC with AdaBoost,\nwe conjecture that BOOSTMETRIC has the same property. Here we empirically prove that as long\nas v is suf\ufb01ciently small, the \ufb01nal performance is not affected by the value of v. We have set v from\n10\u22128 to 10\u22124 and run BOOSTMETRIC on 3 UCI datasets. Table 2 reports the \ufb01nal 3NN classi\ufb01cation\nerror with different v. The results are nearly identical.\n\nComputational time As we discussed, one major issue in learning a Mahalanobis distance is\nheavy computational cost because of the semide\ufb01niteness constraint.\n\n1http://www1.cs.columbia.edu/CAVE/software/softlib/coil-20.php\n2http://archive.ics.uci.edu/ml/\n3http://boosting.googlecode.com/files/dataset1.tar.bz2\n\n\fTable 1: Test classi\ufb01cation error rates (%) of a 3-nearest neighbor classi\ufb01er on benchmark datasets. Results of\nNCA and Xing et al [4] on large datasets are not available either because the algorithm does not converge or\ndue to the out-of-memory problem.\n\ndataset\n\nEuclidean\n\nXing et al [4] RCA\n\nNCA\n\nLMNN\n\nBOOSTMETRIC\n\n1 USPS-1\n2 USPS-2\n3 ORLFace-1\n4 ORLFace-2\n5 MNIST\n6 COIL20\n7\n8 Wine\n9 Bal\nIris\n\nLetters\n\n10\n11 Vehicle\n12 Breast-Cancer\n13 Diabetes\n14\n15 Helix\n\nTwin Peaks\n\n5.18\n3.56 (0.28)\n3.33 (1.47)\n5.33 (2.70)\n4.11 (0.43)\n0.19 (0.21)\n5.74 (0.24)\n26.23 (5.52)\n18.13 (1.79)\n2.22 (2.10)\n30.47 (2.41)\n3.28 (1.06)\n27.43 (2.93)\n1.13 (0.09)\n0.60 (0.12)\n\n10.38 (4.81)\n11.12 (2.12)\n2.22 (2.10)\n28.66 (2.49)\n3.63 (0.93)\n27.87 (2.71)\n\n32.71\n5.57 (0.33)\n5.75 (2.85)\n4.42 (2.08)\n4.31 (0.42)\n0.32 (0.29)\n5.06 (0.26)\n2.26 (1.95)\n19.47 (2.39)\n3.11 (2.15)\n21.42 (2.46)\n3.82 (1.15)\n26.48 (1.61)\n1.02 (0.09)\n0.61 (0.11)\n\n3.92 (2.01)\n3.75 (1.63)\n\n27.36 (6.31)\n4.81 (1.80)\n2.89 (2.58)\n22.61 (3.26)\n4.31 (1.10)\n27.61 (1.55)\n\n7.51\n2.18 (0.27)\n6.67 (2.94)\n2.83 (1.77)\n4.19 (0.49)\n2.41 (1.80)\n4.34 (0.36)\n5.47 (3.01)\n11.87 (2.14)\n2.89 (2.58)\n22.57 (2.16)\n3.19 (1.43)\n26.78 (2.42)\n0.98 (0.11)\n0.61 (0.13)\n\n2.96\n1.99 (0.24)\n2.00 (1.05)\n3.00 (1.31)\n4.09 (0.31)\n0.02 (0.07)\n3.54 (0.18)\n2.64 (1.59)\n8.93 (2.28)\n2.89 (2.78)\n19.17 (2.10)\n2.45 (0.95)\n25.04 (2.25)\n0.14 (0.08)\n0.58 (0.12)\n\nTable 2: Test error (%) of a 3-nearest neighbor classi\ufb01er with different values of the parameter v. Each\nexperiment is run 10 times. We report the mean and variance. As expected, as long as v is suf\ufb01ciently small, in\na wide range it almost does not affect the \ufb01nal classi\ufb01cation performance.\n\nv\n\n10\u22128\n\n10\u22127\n\n10\u22126\n\n10\u22125\n\n10\u22124\n\nBal\nB-Cancer\nDiabetes\n\n8.98 (2.59)\n2.11 (0.69)\n26.0 (1.33)\n\n8.88 (2.52)\n2.11 (0.69)\n26.0 (1.33)\n\n8.88 (2.52)\n2.11 (0.69)\n26.0 (1.33)\n\n8.88 (2.52)\n2.11 (0.69)\n26.0 (1.34)\n\n8.93 (2.52)\n2.11 (0.69)\n26.0 (1.46)\n\nOur algorithm is generally fast. It involves matrix operations and an EVD for \ufb01nding its largest\neigenvalue and its corresponding eigenvector. The time complexity of this EVD is O(D2) with\nD the input dimensions. We compare our algorithm\u2019s running time with LMNN in Fig. 1 on the\narti\ufb01cial dataset (concentric circles). We vary the input dimensions from 50 to 1000 and keep the\nnumber of triplets \ufb01xed to 250. Instead of using standard interior-point SDP solvers that do not scale\nwell, LMNN heuristically combines sub-gradient descent in both the matrices L and X. At each\niteration, X is projected back onto the p.s.d. cone using EVD. So a full EVD with time complexity\nO(D3) is needed. Note that LMNN is much faster than SDP solvers like CSDP [18]. As seen from\nFig. 1, when the input dimensions are low, BOOSTMETRIC is comparable to LMNN. As expected,\nwhen the input dimensions become high, BOOSTMETRIC is signi\ufb01cantly faster than LMNN. Note\nthat our implementation is in Matlab. Improvements are expected if implemented in C/C++.\n\n3.2 Visual Object Categorization and Detection\n\nThe proposed BOOSTMETRIC and the LMNN are further compared on four classes of the Caltech-\n101 object recognition database [19], including Motorbikes (798 images), Airplanes (800), Faces\n(435), and Background-Google (520). For each image, a number of interest regions are identi\ufb01ed\nby the Harris-af\ufb01ne detector [20] and the visual content in each region is characterized by the SIFT\ndescriptor [21]. The total number of local descriptors extracted from the images of the four classes\n\n800\n\n700\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n)\ns\nd\nn\no\nc\ne\ns\n(\nn\nu\nr\nr\ne\np\ne \nm\n\ni\nt\n\nU \nP\nC\n\n0\n0\n\nBoostMetric\n\nLMNN\n\n200\n\n400\n600\ninput dimensions\n\n800\n\n1000\n\nFigure 1: Computation time of the proposed BOOSTMET-\nRIC and the LMNN method versus the input data\u2019s dimen-\nsions on an arti\ufb01cial dataset. BOOSTMETRIC is faster than\nLMNN with large input dimensions because at each iter-\nation BOOSTMETRIC only needs to calculate the largest\neigenvector and LMNN needs a full eigen-decomposition.\n\n \n \n \n \n\f20\n\n15\n\n10\n\n5\n\n0\n\n)\n\n%\n\n(\nr\no\nb\nh\ng\ni\ne\nn\nt\ns\ne\nr\na\ne\nn\n-\n3\nf\no\nr \no\nr\nr\ne\nt\ns\ne\nT\n\nEuclidean\n\nLMNN\n\nBoostMetric\n\n5.5\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n)\n\n%\n\n(\nr \no\nb\nh\ng\ni\ne\nn\nt \ns\ne\nr\na\ne\nn\n-\n3\nf \no\nr\no\nr\nr\ne\nt \ns\ne\nT\n\ndim.: 100D\n\n200D\n\n1000 2000 3000 4000 5000 6000 7000 8000 9000\n\nNumber of triplets\n\nFigure 2: Test error (3-nearest neighbor) of BOOSTMETRIC on the Motorbikes vs. Airplanes datasets. The\nsecond \ufb01gure shows the test error against the number of training triplets with a 100-word codebook. Test error\nof LMNN is 4.7% \u00b1 0.5% with 8631 triplets for training, which is worse than BOOSTMETRIC. For Euclidean\ndistance, the error is much larger: 15% \u00b1 1%.\n\nare about 134, 000, 84, 000, 57, 000, and 293, 000, respectively. This experiment includes both ob-\nject categorization (Motorbikes vs. Airplanes) and object detection (Faces vs. Background-Google)\nproblems. To accumulate statistics, the images of two involved object classes are randomly split as\n10 pairs of training/test subsets. Restricted to the images in a training subset (those in a test subset\nare only used for test), their local descriptors are clustered to form visual words by using k-means\nclustering. Each image is then represented by a histogram containing the number of occurrences of\neach visual word.\n\nMotorbikes vs. Airplanes This experiment discriminates the images of a motorbike from those\nof an airplane. In each of the 10 pairs of training/test subsets, there are 959 training images and\n639 test images. Two visual codebooks of size 100 and 200 are used, respectively. With the result-\ning histograms, the proposed BOOSTMETRIC and the LMNN are learned on a training subset and\nevaluated on the corresponding test subset. Their averaged classi\ufb01cation error rates are compared\nin Fig. 2 (left). For both visual codebooks, the proposed BOOSTMETRIC achieves lower error rates\nthan the LMNN and the Euclidean distance, demonstrating its superior performance. We also apply\na linear SVM classi\ufb01er with its regularization parameter carefully tuned by 5-fold cross-validation.\nIts error rates are 3.87% \u00b1 0.69% and 3.00% \u00b1 0.72% on the two visual codebooks, respectively. In\ncontrast, a 3NN with BOOSTMETRIC has error rates 3.63% \u00b1 0.68% and 2.96% \u00b1 0.59%. Hence,\nthe performance of the proposed BOOSTMETRIC is comparable to or even slightly better than the\nSVM classi\ufb01er. Also, Fig. 2 (right) plots the test error of the BOOSTMETRIC against the number of\ntriplets for training. The general trend is that more triplets lead to smaller errors.\n\nFaces vs. Background-Google This experiment uses the two object classes as a retrieval prob-\nlem. The target of retrieval is the face images. The images in the class of Background-Google are\nrandomly collected from the Internet and they are used to represent the non-target class. BOOST-\nMETRIC is \ufb01rst learned from a training subset and retrieval is conducted on the corresponding test\nsubset. In each of the 10 training/test subsets, there are 573 training images and 382 test images.\nAgain, two visual codebooks of size 100 and 200 are used. Each face image in a test subset is used\nas a query, and its distances from other test images are calculated by BOOSTMETRIC, LMNN and\nthe Euclidean distance. For each metric, the precision of the retrieved top 5, 10, 15 and 20 images\nare computed. The retrieval precision for each query are averaged on this test subset and then aver-\naged over the whole 10 test subsets. BOOSTMETRIC consistently attains the highest values, which\nagain veri\ufb01es its advantages over LMNN and the Euclidean distance. With a codebook size 200,\nvery similar results are obtained. See [16] for the experiment results.\n\n4 Conclusion\n\nWe have presented a new algorithm, BOOSTMETRIC, to learn a positive semide\ufb01nite metric using\nboosting techniques. We have generalized AdaBoost in the sense that the weak learner of BOOST-\nMETRIC is a matrix, rather than a classi\ufb01er. Our algorithm is simple and ef\ufb01cient. Experiments\nshow its better performance over a few state-of-the-art existing metric learning methods. We are\ncurrently combining the idea of on-line learning into BOOSTMETRIC to make it handle even larger\ndatasets.\n\n \n \n \n \n \n \n \n\fReferences\n\n[1] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classi\ufb01cation. IEEE Trans.\n\nPattern Anal. Mach. Intell., 18(6):607\u2013616, 1996.\n\n[2] J. Yu, J. Amores, N. Sebe, P. Radeva, and Q. Tian. Distance learning for similarity estimation.\n\nIEEE Trans. Pattern Anal. Mach. Intell., 30(3):451\u2013462, 2008.\n\n[3] B. Jian and B. C. Vemuri. Metric learning using Iwasawa decomposition. In Proc. IEEE Int.\n\nConf. Comp. Vis., pages 1\u20136, Rio de Janeiro, Brazil, 2007. IEEE.\n\n[4] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to\n\nclustering with side-information. In Proc. Adv. Neural Inf. Process. Syst. MIT Press, 2002.\n\n[5] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a Mahalanobis metric from\n\nequivalence constraints. J. Mach. Learn. Res., 6:937\u2013965, 2005.\n\n[6] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood component anal-\n\nysis. In Proc. Adv. Neural Inf. Process. Syst. MIT Press, 2004.\n\n[7] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest\n\nneighbor classi\ufb01cation. In Proc. Adv. Neural Inf. Process. Syst., pages 1473\u20131480, 2005.\n\n[8] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Proc. Adv. Neural Inf.\n\nProcess. Syst., 2005.\n\n[9] C. Shen, A. Welsh, and L. Wang. PSDBoost: Matrix-generation linear programming for pos-\nitive semide\ufb01nite matrices learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou,\neditors, Proc. Adv. Neural Inf. Process. Syst., pages 1473\u20131480, Vancouver, Canada, 2008.\n\n[10] R. E. Schapire. Theoretical views of boosting and applications. In Proc. Int. Conf. Algorithmic\n\nLearn. Theory, pages 13\u201325, London, UK, 1999. Springer-Verlag.\n\n[11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[12] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by semide\ufb01nite\n\nprogramming. Int. J. Comp. Vis., 70(1):77\u201390, 2006.\n\n[13] A. Demiriz, K.P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column\n\ngeneration. Mach. Learn., 46(1-3):225\u2013254, 2002.\n\n[14] C. Zhu, R. H. Byrd, and J. Nocedal. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN\nroutines for large scale bound constrained optimization. ACM Trans. Math. Softw., 23(4):550\u2013\n560, 1997.\n\n[15] L. Yang, R. Jin, L. Mummert, R. Sukthankar, A. Goode, B. Zheng, S. Hoi, and M. Satya-\nnarayanan. A boosting framework for visuality-preserving distance metric learning and its\napplication to medical image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. IEEE com-\nputer Society Digital Library, November 2008, http://doi.ieeecomputersociety.\norg/10.1109/TPAMI.2008.273.\n\n[16] http://code.google.com/p/boosting/.\n\n[17] S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classi\ufb01er.\n\nJ. Mach. Learn. Res., 5:941\u2013973, 2004.\n\n[18] B. Borchers. CSDP, a C library for semide\ufb01nite programming. Optim. Methods and Softw.,\n\n11(1):613\u2013623, 1999.\n\n[19] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories.\n\nIEEE Trans.\n\nPattern Anal. Mach. Intell., 28(4):594\u2013611, April 2006.\n\n[20] K. Mikolajczyk and C. Schmid. Scale & af\ufb01ne invariant interest point detectors. Int. J. Comp.\n\nVis., 60(1):63\u201386, 2004.\n\n[21] D. G. Lowe. Distinctive image features from scale-invariant keypoints.\n\nInt. J. Comp. Vis.,\n\n60(2):91\u2013110, 2004.\n\n\f", "award": [], "sourceid": 629, "authors": [{"given_name": "Chunhua", "family_name": "Shen", "institution": null}, {"given_name": "Junae", "family_name": "Kim", "institution": null}, {"given_name": "Lei", "family_name": "Wang", "institution": null}, {"given_name": "Anton", "family_name": "Hengel", "institution": null}]}