{"title": "Regularized Distance Metric Learning:Theory and Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 862, "page_last": 870, "abstract": "In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.", "full_text": "Regularized Distance Metric Learning:\n\nTheory and Algorithm\n\nRong Jin1\n\nShijun Wang2\n\nYang Zhou1\n\n1Dept. of Computer Science & Engineering, Michigan State University, East Lansing, MI 48824\n\n2Radiology and Imaging Sciences, National Institutes of Health, Bethesda, MD 20892\nrongjin@cse.msu.edu wangshi@cc.nih.gov zhouyang@msu.edu\n\nAbstract\n\nIn this paper, we examine the generalization error of regularized distance metric\nlearning. We show that with appropriate constraints, the generalization error of\nregularized distance metric learning could be independent from the dimensional-\nity, making it suitable for handling high dimensional data. In addition, we present\nan ef\ufb01cient online learning algorithm for regularized distance metric learning. Our\nempirical studies with data classi\ufb01cation and face recognition show that the pro-\nposed algorithm is (i) effective for distance metric learning when compared to the\nstate-of-the-art methods, and (ii) ef\ufb01cient and robust for high dimensional data.\n\n1 Introduction\n\nDistance metric learning is a fundamental problem in machine learning and pattern recognition. It is\ncritical to many real-world applications, such as information retrieval, classi\ufb01cation, and clustering\n[6, 7]. Numerous algorithms have been proposed and examined for distance metric learning. They\nare usually classi\ufb01ed into two categories: unsupervised metric learning and supervised metric learn-\ning. Unsupervised distance metric learning, or sometimes referred to as manifold learning, aims to\nlearn a underlying low-dimensional manifold where the distance between most pairs of data points\nare preserved. Example algorithms in this category include ISOMAP [13] and Local Linear Embed-\nding (LLE) [8]. Supervised metric learning attempts to learn distance metrics from side information\nsuch as labeled instances and pairwise constraints. It searches for the optimal distance metric that\n(a) keeps data points of the same classes close, and (b) keeps data points from different classes far\napart. Example algorithms in this category include [17, 10, 15, 5, 14, 19, 4, 12, 16]. In this work,\nwe focus on supervised distance metric learning.\n\nAlthough a large number of studies were devoted to supervised distance metric learning (see the sur-\nvey in [18] and references therein), few studies address the generalization error of distance metric\nlearning. In this paper, we examine the generalization error for regularized distance metric learning.\nFollowing the idea of stability analysis [1], we show that with appropriate constraints, the general-\nization error of regularized distance metric learning is independent from the dimensionality of data,\nmaking it suitable for handling high dimensional data. In addition, we present an online learning\nalgorithm for regularized distance metric learning, and show its regret bound. Note that although\nonline metric learning was studied in [9], our approach is advantageous in that (a) it is computation-\nally more ef\ufb01cient in handling the constraint of SDP cone, and (b) it has a proved regret bound while\n[9] only shows a mistake bound for the datasets that can be separated by a Mahalanobis distance. To\nverify the ef\ufb01cacy and ef\ufb01ciency of the proposed algorithm for regularized distance metric learning,\nwe conduct experiments with data classi\ufb01cation and face recognition. Our empirical results show\nthat the proposed online algorithm is (1) effective for metric learning compared to the state-of-the-art\nmethods, and (2) robust and ef\ufb01cient for high dimensional data.\n\n1\n\n\f2 Regularized Distance Metric Learning\n\nLet D = {zi = (xi, yi), i = 1, . . . , n} denote the labeled examples, where xk = (x1\nk) \u2208 Rd\nk, . . . , xd\nis a vector of d dimension and yi \u2208 {1, 2, . . . , m} is class label. In our study, we assume that\nthe norm of any example is upper bounded by R, i.e., supx |x|2 \u2264 R. Let A \u2208 Sd\u00d7d\nbe the\ndistance metric to be learned, where the distance between two data points x and x\u2032 is calculated as\n|x \u2212 x\u2032|2\nFollowing the idea of maximum margin classi\ufb01ers, we have the following framework for regularized\ndistance metric learning:\n\nA = (x \u2212 x\u2032)\u22a4A(x \u2212 x\u2032).\n\n+\n\n\uf8f1\uf8f2\n\uf8f3\n\nmin\n\nA\n\nwhere\n\n1\n2|A|2\n\nF +\n\n2C\n\nn(n \u2212 1)Xi<j\n\ng(cid:0)yi,j(cid:2)1 \u2212 |xi \u2212 xj|2\n\nA(cid:3)(cid:1) : A (cid:23) 0, tr(A) \u2264 \u03b7(d)\uf8fc\uf8fd\n\uf8fe\n\n(1)\n\n\u2022 yi,j is derived from class labels yi and yj, i.e., yi,j = 1 if yi = yj and \u22121 otherwise.\n\u2022 g(z) is the loss function. It outputs a small value when z is a large positive value, and a large\nvalue when z is large negative. We assume g(z) to be convex and Lipschitz continuous with\nLipschitz constant L.\n\nF is the regularizer that measures the complexity of the distance metric A.\n\n\u2022 |A|2\n\u2022 tr(A) \u2264 \u03b7(d) is introduced to ensure a bounded domain for A. As will be revealed later,\nthis constraint will become active only when the constraint constant \u03b7(d) is sublinear in\nd, i.e., \u03b7 \u223c O(dp) with p < 1. We will also show how this constraint could affect the\ngeneralization error of distance metric learning.\n\n3 Generalization Error\n\nLet AD be the distance metric learned by the algorithm in (1) from the training examples D. Let\nID(A) denote the empirical loss , i.e.,\n2\n\ng(cid:0)yi,j(cid:2)1 \u2212 |xi \u2212 xj|2\nA(cid:3)(cid:1)\nFor the convenience of presentation, we also write g(cid:0)yi,j(1 \u2212 |xi \u2212 xj|2\nA)(cid:1) = V (A, zi, zj) to high-\n\nlight its dependence on A and two examples zi and zj. We denote by I(A) the loss of A over the\ntrue distribution, i.e.,\n\nn(n \u2212 1)Xi<j\n\nID(A) =\n\n(2)\n\n(3)\nGiven the empirical loss ID(A) and the loss over the true distribution I(A), we de\ufb01ne the estimation\nerror as\n\nI(A) = E(zi,zj)[V (A, zi, zj)]\n\nDD = I(AD) \u2212 ID(AD)\n\n(4)\nIn order to show the behavior of estimation error, we follow the analysis based on the stability of\nthe algorithm [1]. The uniform stability of an algorithm determines the stability of the algorithm\nwhen one of the training examples is replaced with another. More speci\ufb01cally, an algorithm A has\nuniform stability \u03b2 if\n(5)\n\n\u2200(D, z), \u2200i, sup\n\nu,v |V (AD, u, v) \u2212 V (ADz,i, u, v)| \u2264 \u03b2\n\nwhere Dz,i stands for the new training set that is obtained by replacing zj \u2208 D with a new example\nz. We further de\ufb01ne \u03b2 = \u03ba/n as the uniform stability \u03b2 behaves like O(1/n).\nThe advantage of using stability analysis for the generalization error of regularized distance metric\nlearning. This is because the example pair (zi, zj) used for training distance metrics are not I.I.D.\nalthough zi is, making it dif\ufb01cult to directly utilize the results from statistical learning theory.\nIn the analysis below, we \ufb01rst show how to derive the generalization error bound for regularized\ndistance metric learning given the uniform stability \u03b2 (or \u03ba). We then derive the uniform stability\nconstant for the regularized distance metric learning framework in (1).\n\n2\n\n\f3.1 Generalization Error Bound for Given Uniform Stability\n\nAnalysis in this section follows closely [1], and we therefore omit the detailed proofs.\n\nOur analysis utilizes the McDiarmid inequality that is stated as follows.\nTheorem 1. (McDiarmid Inequality) Given random variables {vi}l\nR satisfying\n\ni=1, v\u2032\n\ni, and a function F : vl \u2192\n\nthe following statement holds\n\n\u2032\n\n\u2032\n\nsup\n\nv1 ,...,vl,v\n\ni\u02db\u02db\u02db\n\nF (v1, . . . , vl) \u2212 F (v1, . . . , vi\u22121, v\n\ni , vi+1, . . . , vl)\u02db\u02db\u02db\nPr (|F (v1, . . . , vl) \u2212 E(F (v1, . . . , vl))| > \u01eb) \u2264 2 exp \u2212\n\n\u2264 ci,\n\n2\u01eb\ni=1 c2\n\ni !\n\nPl\n\nTo use the McDiarmid inequality, we \ufb01rst compute E(DD).\nLemma 1. Given a distance metric learning algorithm A has uniform stability \u03ba/n, we have the\nfollowing inequality for E(DD)\n\nE(DD) \u2264 2\n\n\u03ba\nn\n\n(6)\n\nwhere n is the number of training examples in D.\nThe result in the following lemma shows that the condition in McDiarmid inequality holds.\nLemma 2. Let D be a collection of n randomly selected training examples, and Di,z be the collec-\ntion of examples that replaces zi in D with example z. We have |DD \u2212 DDi,z| bounded as follows\n(7)\n\n2\u03ba + 8L\u03b7(d) + 2g0\n\n|DD \u2212 DDi,z| \u2264\n\nn\n\nwhere g0 = supz,z\u2032 |V (0, z, z\u2032)| measures the largest loss when distance metric A is 0.\nCombining the results in Lemma 1 and 2, we can now derive the the bound for the generalization\nerror by using the McDiarmid inequality.\nTheorem 2. Let D denote a collection of n randomly selected training examples, and AD be the\ndistance metric learned by the algorithm in (1) whose uniform stability is \u03ba/n. With probability\n1 \u2212 \u03b4, we have the following bound for I(AD)\n\nI(AD) \u2212 ID(AD) \u2264\n\n2\u03ba\nn\n\n+ (2\u03ba + 4L\u03b7(d) + 2g0)r ln(2/\u03b4)\n\n2n\n\n3.2 Generalization Error for Regularized Distance Metric Learning\n\nFirst, we show that the superium of tr(AD) is O(d1/2), which veri\ufb01es that \u03b7(d) should behave\nsublinear in d. This is summarized by the following proposition.\nProposition 1. The trace constraint in (1) will be activated only when\n\n\u03b7(d) \u2264p2dg0C\n\nwhere g0 = supz,z\u2032 |V (0, z, z\u2032)|.\nProof. It follows directly from [tr(AD)/d]2 \u2264 |AD|2\n\nF \u2264 2C sup\n\nz,z\u2032 |V (0, z, z\u2032)| \u2264 Cg0.\n\n(8)\n\n(9)\n\nTo bound the uniform stability, we need the following proposition\nProposition 2. For any two distance metrics A and A\u2032, we have the following inequality hold for\nany examples zu and zv\n\n|V (A, zu, zv) \u2212 V (A\u2032, zu, zv)| \u2264 4LR2|A \u2212 A\u2032|F\n\n(10)\n\n3\n\n\fThe above proposition follows directly from the fact that (a) V (A, z, z\u2032) is Lipschitz continuous and\n(b) |x|2 \u2264 R for any example x. The following lemma bounds |AD \u2212 AD\u2032|F .\nLemma 3. Let D denote a collection of n randomly selected training examples, and by z = (x, y) a\nrandomly selected example. Let AD be the distance metric learned by the algorithm in (1). We have\n\n|AD \u2212 ADi,z|F \u2264\n\n8CLR2\n\nn\n\n(11)\n\nThe proof of the above lemma can be found in Appendix A.\n\nBy putting the results in Lemma 3 and Proposition 2, we have the following theorem for the stability\nof the Frobenius norm based regularizer.\nTheorem 3. The uniform stability for the algorithm in (1) using the Frobenius norm regularizer,\ndenoted by \u03b2, is bounded as follows\n\n\u03b2 =\n\n\u03ba\nn \u2264\n\n32CL2R4\n\nn\n\n(12)\n\nwhere \u03ba = 32CL2R4\n\nCombing Theorem 3 and 2, we have the following theorem for the generalization error of distance\nmetric learning algorithm in (1) using the Frobenius norm regularizer\nTheorem 4. Let D be a collection of n randomly selected examples, and AD be the distance metric\nlearned by the algorithm in (1) with h(A) = |A|2\nF . With probability 1 \u2212 \u03b4, we have the following\nbound for the true loss function I(AD) where AD is learned from (1) using the Frobenius norm\nregularizer\n\nI(AD) \u2212 ID(AD) \u2264\n\n32CL2R4\n\nn\n\n+(cid:0)32CL2R4 + 4Ls(d) + 2g0(cid:1)r ln(2/\u03b4)\n\n2n\n\nwhere s(d) = min(cid:0)\u221a2dg0C, \u03b7(d)(cid:1).\nO(s(d)/\u221an). By choosing \u03b7(d) to have a low dependence of d (i.e., \u03b7(d) \u223c dp with p \u226a 1), the\n\nRemark The most important feature in the estimation error is that it converges in the order of\n\nproposed framework for regularized distance metric learning will be robust to the high dimensional\ndata. In the extreme case, by setting \u03b7(d) to be a constant, the estimation error will be independent\nfrom the dimensionality of data.\n\n(13)\n\n4 Algorithm\n\nIn this section, we discuss an ef\ufb01cient algorithm for solving (1). We assume a hinge loss for g(z),\ni.e., g(z) = max(0, b \u2212 z), where b is the classi\ufb01cation margin. To design an online learning\nalgorithm for regularized distance metric learning, we follow the theory of gradient based online\nlearning [2] by de\ufb01ning potential function \u03a6(A) = |A|2\nF /2. Algorithm 1 shows the online learning\nalgorithm.\n\nThe theorem below shows the regret bound for the online learning algorithm in Figure 1.\nTheorem 5. Let the online learning algorithm 1 run with learning rate \u03bb > 0 on a sequence\nt), yt, t = 1, . . . , n. Assume |x|2 \u2264 R for all the training examples. Then, for all distance\n(xt, x\u2032\nmetric M \u2208 Sd\u00d7d\n\n+ , we have\n\nF(cid:19)\n1\n2\u03bb|M|2\n\nbLn \u2264\n\n1\n\n1 \u2212 8R4\u03bb/b(cid:18)Ln(M ) +\nnXt=1\n\nmax(cid:0)0, b \u2212 yt(1 \u2212 |xt \u2212 x\u2032\nM )(cid:1) ,bLn =\nt|2\n\nwhere\n\n\u0141n(M ) =\n\nnXt=1\n\nmax(cid:16)0, b \u2212 yt(1 \u2212 |xt \u2212 x\u2032\nAt\u22121 )(cid:17)\nt|2\n\n4\n\n\fAlgorithm 1 Online Learning Algorithm for Regularized Distance Metric Learning\n1: INPUT: prede\ufb01ned learning rate \u03bb\n2: Initialize A0 = 0\n3: for t = 1, . . . , T do\n4:\n5:\n6:\n7:\n8:\n9:\n\nReceive a pair of training examples {(x1\nCompute the class label yt: yt = +1 if y1\nif the training pair (x1\n\nt ), yt is classi\ufb01ed correctly, i.e., yt(cid:16)1 \u2212 |x1\n\nt , y2\nt , and yt = \u22121 otherwise.\nAt\u22121(cid:17) > 0 then\nt|2\nt \u2212 x2\n\nt)\u22a4), where \u03c0S+ (M ) projects matrix M into the\n\nt , y1\nt = y2\n\nAt = At\u22121.\n\nt ), (x2\n\nt , x2\n\nt )}\n\nelse\n\nAt = \u03c0S+ (At\u22121 \u2212 \u03bbyt(xt \u2212 x\u2032\nSDP cone.\n\nt)(xt \u2212 x\u2032\n\nend if\n10:\n11: end for\n\nt)(xt \u2212 x\u2032\n\nThe proof of this theorem can be found in Appendix B. Note that the above online learning algorithm\nrequire computing \u03c0S+ (M ), i.e., projecting matrix M onto the SDP cone, which is expensive for\nhigh dimensional data. To address this challenge, \ufb01rst notice that M \u2032 = \u03c0S+ (M ) is equivalent to the\noptimization problem M \u2032 = arg minM \u2032(cid:23)0 |M \u2032 \u2212 M|F . We thus approximate At = \u03c0S+ (At\u22121 \u2212\nt)\u22a4 where \u03bbt is computed as\n\u03bbyt(xt \u2212 x\u2032\nfollows\n\nt)\u22a4) with At = At\u22121 \u2212 \u03bbtyt(xt \u2212 x\u2032\nt)(xt \u2212 x\u2032\n(cid:8)|\u03bbt \u2212 \u03bb| : \u03bbt \u2208 [0, \u03bb], At\u22121 \u2212 \u03bbtyt(xt \u2212 x\u2032\n\nt)(xt \u2212 x\u2032\nThe following theorem shows the solution to the above optimization problem.\nTheorem 6. The optimal solution \u03bbt to the problem in (14) is expressed as\n\nt)\u22a4 (cid:23) 0(cid:9)\n\n\u03bbt = arg min\n\n(14)\n\n\u03bbt\n\n\u03bbt =(cid:26) \u03bb\n\nmin(cid:0)\u03bb, [(xt \u2212 x\u2032\n\nt)\u22a4A\u22121\n\nt\u22121(xt \u2212 x\u2032\n\nt)]\u22121(cid:1)\n\nyt = \u22121\nyt = +1\n\nProof of this theorem can be found in the supplementary materials. Finally, the quantity (xt \u2212\nt)A\u22121\nx\u2032\n\nt) can be computed by solving the following optimization problem\n\nt\u22121(xt \u2212 x\u2032\n\nmax\n\nu\n\n2u\u22a4(xt \u2212 x\u2032\n\nt) \u2212 u\u22a4Au\n\nwhose optimal value can be computed ef\ufb01ciently using the conjugate gradient method [11].\n\nNote that compared to the online metric learning algorithm in [9], the proposed online learning\nalgorithm for metric learning is advantageous in that (i) it is computationally more ef\ufb01cient by\navoiding projecting a matrix into a SDP cone, and (ii) it has a provable regret bound while [9] only\npresents the mistake bound for the separable datasets.\n\n5 Experiments\n\nWe conducted an extensive study to verify both the ef\ufb01ciency and the ef\ufb01cacy of the proposed\nalgorithms for metric learning. For the convenience of discussion, we refer to the propoesd online\ndistance metric learning algorithm as online-reg. To examine the ef\ufb01cacy of the learned distance\nmetric, we employed the k Nearest Neighbor (k-NN) classi\ufb01er. Our hypothesis is that the better the\ndistance metric is, the higher the classi\ufb01cation accuracy of k-NN will be. We set k = 3 for k-NN\nfor all the experiments according to our experience.\n\nputed as the inverse of covariance matrix of training samples, i.e., (Pn\n\nWe compare our algorithm to the following six state-of-the-art algorithms for distance metric learn-\ning as baselines: (1) Euclidean distance metric; (2) Mahalanobis distance metric, which is com-\ni=1 xixi)\u22121; (3) Xing\u2019s algo-\nrithm proposed in [17]; (4) LMNN, a distance metric learning algorithm based on the large margin\nnearest neighbor classi\ufb01er [15]; (5) ITML, an Information-theoretic metric learning based on [4];\nand (6) Relevance Component Analysis (RCA) [10]. We set the maximum number of iterations for\nXing\u2019s method to be 10, 000. The number of target neighbors in LMNN and parameter \u03b3 in ITML\n\n5\n\n\fTable 1: Classi\ufb01cation error (%) of a k-NN (k = 3) classi\ufb01er on the ten UCI data sets using seven\ndifferent metrics. Standard deviation is included.\n\nDataset\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n\nEclidean\n19.5 \u00b1 2.2\n39.9 \u00b1 2.3\n36.0 \u00b1 2.0\n4.0 \u00b1 1.7\n30.6 \u00b1 1.9\n25.4 \u00b1 4.2\n31.9 \u00b1 2.8\n18.9 \u00b1 0.5\n2.0 \u00b1 0.4\n\nMahala\n\n18.8 \u00b1 2.5\n6.7 \u00b1 0.6\n42.1 \u00b1 4.0\n10.4 \u00b1 2.7\n29.1 \u00b1 2.1\n18.4 \u00b1 3.4\n10.0 \u00b1 2.8\n37.3 \u00b1 0.5\n6.1 \u00b1 0.5\n\nXing\n\n29.3 \u00b1 17.2\n40.1 \u00b1 2.6\n43.5 \u00b1 12.5\n\n3.1 \u00b1 2.0\n30.6 \u00b1 1.9\n23.3 \u00b1 3.4\n24.6 \u00b1 7.5\n16.1 \u00b1 0.6\n12.4 \u00b1 0.8\n\nLMNN\n\n13.8 \u00b1 2.5\n3.6 \u00b1 1.1\n33.1 \u00b1 0.6\n3.9 \u00b1 1.6\n29.6 \u00b1 1.8\n15.2 \u00b1 3.1\n4.5 \u00b1 2.4\n18.4 \u00b1 0.4\n1.6 \u00b1 0.3\n\nITML\n\n8.6 \u00b1 1.7\n40.0 \u00b1 2.3\n39.8 \u00b1 3.3\n3.2 \u00b1 1.6\n28.8 \u00b1 2.1\n17.1 \u00b1 4.1\n28.7 \u00b1 3.7\n23.3 \u00b1 1.3\n2.5 \u00b1 0.4\n\nRCA\n\n17.4 \u00b1 1.5\n3.8 \u00b1 0.4\n41.6 \u00b1 0.7\n2.9 \u00b1 1.5\n28.6 \u00b1 2.3\n13.9 \u00b1 2.2\n1.8 \u00b1 1.5\n30.6 \u00b1 0.7\n2.8 \u00b1 0.4\n\nOnline-reg\n13.2 \u00b1 2.2\n3.7 \u00b1 1.2\n37.3 \u00b1 4.1\n3.2 \u00b1 1.3\n27.7 \u00b1 1.3\n12.9 \u00b1 2.2\n1.8 \u00b1 1.1\n19.8 \u00b1 0.6\n2.9 \u00b1 0.4\n\nTable 2: p-values of the Wilcoxon signed-rank test of the 7 methods on the 9 datasets.\nLMNN ITML RCA Online-reg\n0.004\n0.008\n0.027\n1.000\n0.129\n0.496\n0.734\n\nEclidean Mahala Xing\n0.641\n0.301\n1.000\n0.027\n0.359\n0.074\n0.027\n\n1.000\n0.734\n0.641\n0.004\n0.496\n0.301\n0.129\n\nMethods\nEuclidean\nMahala\nXing\nLMNN\nITML\nRCA\n\nOnline-reg\n\n0.734\n1.000\n0.301\n0.008\n0.570\n0.004\n0.004\n\n0.496\n0.570\n0.359\n0.129\n1.000\n0.820\n0.164\n\n0.301\n0.004\n0.074\n0.496\n0.820\n1.000\n0.074\n\n0.129\n0.004\n0.027\n0.734\n0.164\n0.074\n1.000\n\nwere tuned by cross validation over the range from 10\u22124 to 104. All the algorithms are implemented\nand run using Matlab. All the experiment are run on a AMD Processor 2.8G machine, with 8GMB\nRAM and Linux operation system.\n\n5.1 Experiment (I): Comparison to State-of-the-art Algorithms\n\nWe conducted experiments of data classi\ufb01cation over the following nine datasets from UCI repos-\nitory: (1) balance-scale, with 3 classes, 4 features, and 625 instances; (2) breast-cancer, with 2\nclasses, 10 features, and 683 instance; (3) glass, with 6 classes, 9 features, and 214 instances; (4)\niris, with 3 classes, 4 features, and 150 instances; (5) pima, with 2 classes, 8 features, and 768 in-\nstances; (6) segmentation, with 7 classes, 19 features, and 210 instances; (7)wine, with 3 classes,\n13 features, and 178 instances; (8) waveform, with 3 classes, 21 features, and 5000 instances; (9)\noptdigits, with 10 classes, 64 features, 3823 instances. For all the datasets, we randomly select 50%\nsamples for training, and use the remaining samples for testing. Table 1 shows the classi\ufb01cation\nerrors of all the metric learning methods over 9 datasets averaged over 10 runs, together with the\nstandard deviation. We observe that the proposed metric learning algorithm deliver performance that\ncomparable to the state-of-the-art methods. In particular, for almost all datasets, the classi\ufb01cation\naccuracy of the proposed algorithm is close to that of LMNN, which has yielded overall the best\nperformance among six baseline algorithms. This is consistent with the results of the other studies,\nwhich show LMNN is among the most effective algorithms for distance metric learning.\n\nTo further verify if the proposed method performs statistically better than the baseline methods, we\nconduct statistical test by using Wilcoxon signed-rank test [3]. The Wilcoxon signed-rank test is a\nnon-parametric statistical hypothesis test for the comparisons of two related samples. It is known to\nbe safer than the Student\u2019s t-test because it does not assume normal distributions. From table 2, we\n\ufb01nd that the regularized distance metric learning improves the classi\ufb01cation accuracy signi\ufb01cantly\ncompared to Mahalanobis distance, Xing\u2019s method and RCA at signi\ufb01cant level 0.1. It performs\nslightly better than ITML and is comparable to LMNN.\n\n6\n\n\fy\nc\na\nr\nu\nc\nc\na\nn\no\n\n \n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n0.1\n\n0.12\n\natt\u2212face\n\n \n\nEuclidean\nMahalanobis\nLMNN\nITML\nRCA\nOnline_reg\n\n0.18\n\n0.2\n\n0.14\n\n0.16\n\nImage resize ratio\n\natt\u2212face\n\n \n\nLMNN\nITML\nRCA\nOnline_reg\n\n7000\n\n6000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nR\n\n0\n \n0.1\n\n0.12\n\n0.14\n\n0.16\n\n0.18\n\n0.2\n\nImage resize ratio\n\n(a)\n\n(b)\n\nFigure 1: (a) Face recognition accuracy of kNN and (b) running time of LMNN, ITML, RCA and\nonline reg algorithms on the \u201catt-face\u201d dataset with varying image sizes.\n\n5.2 Experiment (II): Results for High Dimensional Data\n\nTo evaluate the dependence of the regularized metric learning algorithms on data dimensions, we\ntested it by the task of face recognition. The AT&T face database 1 is used in our study. It consists\nof grey images of faces from 40 distinct subjects, with ten pictures for each subject. For every\nsubject, the images were taken at different times, with varied the lighting condition and different\nfacial expressions (open/closed-eyes, smiling/not-smiling) and facial details (glasses/no-glasses).\nThe original size of each image is 112 \u00d7 92 pixels, with 256 grey levels per pixel.\nTo examine the sensitivity to data dimensionality, we vary the data dimension (i.e., the size of\nimages) by compressing the original images into size different sizes with the image aspect ratio\npreserved. The image compression is achieved by bicubic interpolation (the output pixel value is a\nweighted average of pixels in the nearest 4-by-4 neighborhood). For each subject, we randomly spit\nits face images into training set and test set with ratio 4 : 6. A distance metric is learned from the\ncollection of training face images, and is used by the kNN classi\ufb01er (k = 3) to predict the subject ID\nof the test images. We conduct each experiment 10 times, and report the classi\ufb01cation accuracy by\naveraging over 40 subjects and 10 runs. Figure 1 (a) shows the average classi\ufb01cation accuracy of the\nkNN classi\ufb01er using different distance metric learning algorithms. The running times of different\nmetric learning algorithms for the same dataset is shown in Figure 1 (b). Note that we exclude\nXing\u2019s method in comparison because its extremely long computational time. We observed that\nwith increasing image size (dimensions), the regularized distance metric learning algorithm yields\nstable performance, indicating that the it is resilient to high dimensional data. In contrast, for almost\nall the baseline methods except ITML, their performance varied signi\ufb01cantly as the size of the input\nimage changed. Although ITML yields stable performance with respect to different size of images,\nits high computational cost (Figure 1), arising from solving a Bregman optimization problem in each\niteration, makes it unsuitable for high-dimensional data.\n\n6 Conclusion\n\nIn this paper, we analyze the generalization error of regularized distance metric learning. We show\nthat with appropriate constraint, the regularized distance metric learning could be robust to high\ndimensional data. We also present ef\ufb01cient learning algorithms for solving the related optimiza-\ntion problems. Empirical studies with face recognition and data classi\ufb01cation show the proposed\napproach is (i) robust and ef\ufb01cient for high dimensional data, and (ii) comparable to the state-of-the-\nart approaches for distance learning. In the future, we plan to investigate different regularizers and\ntheir effect for distance metric learning.\n\n1http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html\n\n7\n\n\fACKNOWLEDGEMENTS\n\nThe work was supported in part by the National Science Foundation (IIS-0643494) and the U. S.\nArmy Research Laboratory and the U. S. Army Research Of\ufb01ce (W911NF-09-1-0421). Any opin-\nions, \ufb01ndings, and conclusions or recommendations expressed in this material are those of the au-\nthors and do not necessarily re\ufb02ect the views of NSF and ARO.\n\nAppendix A: Proof of Lemma 3\n\nProof. We introduce the Bregmen divergence for the proof of this lemma. Given a convex function\nof matrix \u03d5(X), the Bregmen divergence between two matrices A and B is computed as follows:\n\nWe de\ufb01ne convex function N (X) and VD(X) as follows:\n\nd\u03d5(A, B) = \u03d5(B) \u2212 \u03d5(A) \u2212 tr(cid:0)\u2207\u03d5(A)\u22a4(B \u2212 A)(cid:1)\nN (X) = kXk2\nF ,\n\nVD(X) =\n\nn(n \u2212 1)Xi<j\n\n2\n\nV (X, zi, zj)\n\nand furthermore convex function TD(X) = N (X) + CVD(X). We thus have\n\ndN (AD, ADi,z ) + dN (ADi,z , AD) \u2264 dTD (AD, ADi,z ) + dTDi,z (ADi,z , AD)\n\n[V (ADi,z , zi, zj) \u2212 V (ADi,z , z, zj) + V (AD, z, zj) \u2212 V (AD, zi, zj)]\n\nC\n\nn(n \u2212 1)Xj6=i\n\n8CLR2\n\nn\n\n|AD \u2212 ADi,z|F\n\n=\n\n\u2264\n\nThe \ufb01rst inequality follows from the fact that both N (X) and VD(X) are convex in X. The second\nstep holds because matrix AD and ADi,z minimize the objective function TD(X) and TDi,z (X),\nrespectively, and therefore\n\n(ADi,z \u2212 AD)\u22a4 \u2207TD(AD) \u2265 0,\n\n(AD \u2212 ADi,z )\u22a4 \u2207TDi,z (ADi,z ) \u2265 0\n\nSince dN (A, B) = kA \u2212 Bk2\n\nF , we therefore have\n\n|AD \u2212 ADi,z|2\nwhich leads to the result in the lemma.\n\nF \u2264\n\nAppendix B: Proof of Theorem 7\n\n8CLR2\n\nn\n\n|AD \u2212 ADi,z|F ,\n\nProof. We denote by A\u2032\n11.1 and Theorem 11.4 [2], we have\n\nt = At\u22121 \u2212 \u03bby(xt \u2212 x\u2032\n\nt)(xt \u2212 x\u2032\n\nt)\u22a4 and At = \u03c0S+ (A\u2032\n\nt). Following Theorem\n\nwhere\n\nbLn \u2212 Ln(M ) \u2264\n\n1\n\u03bb\n\nD\u03a6\u2217 (M, A0) +\n\n1\n\u03bb\n\nnXt=1\n\nD\u03a6\u2217 (At\u22121, A\u2032\nt)\n\nD\u03a6\u2217 (A, B) =\n\n1\n2|A \u2212 B|2\n\nF , \u03a6(A) = \u03a6\u2217(A) =\n\n1\n2|A|2\n\nF\n\nUsing the relation A\u2032\n\nt)\u22a4 and A0 = 0, we have\n\nt = At\u22121 \u2212 \u03bby(xt \u2212 x\u2032\nt)(xt \u2212 x\u2032\nnXt=1\n1\n2\u03bb\n\n1\n2\u03bb|M|2\n\nbLn \u2212 Ln(M ) \u2264\n\nAt\u22121 ) < 0i|xt \u2212 x\u2032\nIhyt(1 \u2212 |xt \u2212 x\u2032\nt|4\nt|2\nBy assuming |x|2 \u2264 R for any training example, we have |xt \u2212 x\u2032\n2 \u2264 16R4. Since\nt|4\nnXt=1\nt|2\nmax(0, b \u2212 yt(1 \u2212 |xt \u2212 x\u2032\nAt\u22121 ))\n\nAt\u22121 ) < 0i|xt \u2212 x\u2032\nIhyt(1 \u2212 |xt \u2212 x\u2032\nt|2\n\nnXt=1\n\nt|4 \u2264\n\nF +\n\n16R4\n\nb\n\nwe thus have the result in the theorem\n\n8\n\n=\n\n16R4\n\nb bLn\n\n\fReferences\n\n[1] Bousquet, Olivier, and Andr\u00b4e Elisseeff. Stability and generalization. Journal of Machine\n\nLearning Research, 2:499\u2013526, March 2002.\n\n[2] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge Uni-\n\nversity Press, New York, NY, USA, 2006.\n\n[3] G.W. Corder and D.I. Foreman. Nonparametric Statistics for Non-Statisticians: A Step-by-Step\n\nApproach. New Jersey: Wiley, 2009.\n\n[4] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In\n\nProceedings of the 24th international conference on Machine Learning, 2007.\n\n[5] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Advances in Neural\n\nInformation Processing Systems, 2005.\n\n[6] Steven C.H. Hoi, Wei Liu, and Shih-Fu Chang. Semi-supervised distance metric learning for\ncollaborative image retrieval. In Proceedings of IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2008.\n\n[7] Steven C.H. Hoi, Wei Liu, Michael R. Lyu, and Wei-Ying Ma. Learning distance metrics with\ncontextual constraints for image retrieval. In Proceedings of IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2006.\n\n[8] L. K. Saul and S. T. Roweis. Think globally, \ufb01t locally: Unsupervised learning of low dimen-\n\nsional manifolds. Journal of Machine Learning Research, 4, 2003.\n\n[9] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-\nIn Proceedings of the twenty-\ufb01rst international conference on Machine learning,\n\nmetrics.\npages 94\u2013101, 2004.\n\n[10] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component\nanalysis. In Proceedings of the Seventh European Conference on Computer Vision, volume 4,\npages 776\u2013792, 2002.\n\n[11] Jonathan R Shewchuk. An introduction to the conjugate gradient method without the agonizing\n\npain. Technical report, Carnegie Mellon University, Pittsburgh, PA, USA, 1994.\n\n[12] Luo Si, Rong Jin, Steven C. H. Hoi, and Michael R. Lyu. Collaborative image retrieval via\n\nregularized metric learning. In ACM Multimedia Systems Journal (MMSJ), 2006.\n\n[13] J.B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290, 2000.\n\n[14] I.W. Tsang, P.M. Cheung, and J.T. Kwok. Kernel relevant component analysis for distance\nmetric learning. In IEEE International Joint Conference on Neural Networks (IJCNN), 2005.\n[15] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neigh-\n\nbor classi\ufb01cation. In Advances in Neural Information Processing Systems, 2005.\n\n[16] Lei Wu, Steven C.H. Hoi, Rong Jin, Jianke Zhu, and Nenghai Yu. Distance metric learning\nfrom uncertain side information with application to automated photo tagging. In Proceedings\nof ACM International Conference on Multimedia (MM), 2009.\n\n[17] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning, with application\nto clustering with side-information. In Advances in Neural Information Processing Systems,\n2002.\n\n[18] L. Yang and R. Jin. Distance metric learning: A comprehensive survey. Michigan State\n\nUniversity, Tech. Rep., 2006.\n\n[19] L. Yang, R. Jin, R. Sukthankar, and Y. Liu. An ef\ufb01cient algorithm for local distance metric\nlearning. In the Proceedings of the Twenty-First National Conference on Arti\ufb01cial Intelligence\nProceedings (AAAI), 2006.\n\n9\n\n\f", "award": [], "sourceid": 326, "authors": [{"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Shijun", "family_name": "Wang", "institution": null}, {"given_name": "Yang", "family_name": "Zhou", "institution": null}]}