{"title": "Sample Complexity of Learning Mahalanobis Distance Metrics", "book": "Advances in Neural Information Processing Systems", "page_first": 2584, "page_last": 2592, "abstract": "Metric learning seeks a transformation of the feature space that enhances prediction quality for a given task. In this work we provide PAC-style sample complexity rates for supervised metric learning. We give matching lower- and upper-bounds showing that sample complexity scales with the representation dimension when no assumptions are made about the underlying data distribution. In addition, by leveraging the structure of the data distribution, we provide rates fine-tuned to a specific notion of the intrinsic complexity of a given dataset, allowing us to relax the dependence on representation dimension. We show both theoretically and empirically that augmenting the metric learning optimization criterion with a simple norm-based regularization is important and can help adapt to a dataset\u2019s intrinsic complexity yielding better generalization, thus partly explaining the empirical success of similar regularizations reported in previous works.", "full_text": "Sample Complexity of Learning Mahalanobis\n\nDistance Metrics\n\nNakul Verma\n\nJanelia Research Campus, HHMI\nverman@janelia.hhmi.org\n\nKristin Branson\n\nJanelia Research Campus, HHMI\n\nbransonk@janelia.hhmi.org\n\nAbstract\n\nMetric learning seeks a transformation of the feature space that enhances predic-\ntion quality for a given task. In this work we provide PAC-style sample complexity\nrates for supervised metric learning. We give matching lower- and upper-bounds\nshowing that sample complexity scales with the representation dimension when\nno assumptions are made about the underlying data distribution. In addition, by\nleveraging the structure of the data distribution, we provide rates \ufb01ne-tuned to a\nspeci\ufb01c notion of the intrinsic complexity of a given dataset, allowing us to relax\nthe dependence on representation dimension. We show both theoretically and em-\npirically that augmenting the metric learning optimization criterion with a simple\nnorm-based regularization is important and can help adapt to a dataset\u2019s intrin-\nsic complexity yielding better generalization, thus partly explaining the empirical\nsuccess of similar regularizations reported in previous works.\n\nIntroduction\n\n1\nIn many machine learning tasks, data is represented in a high-dimensional Euclidean space. The\nL2 distance in this space is then used to compare observations in methods such as clustering and\nnearest-neighbor classi\ufb01cation. Often, this distance is not ideal for the task at hand. For example,\nthe presence of uninformative or mutually correlated measurements arbitrarily in\ufb02ates the distances\nbetween pairs of observations. Metric learning has emerged as a powerful technique to learn a\nmetric in the representation space that emphasizes feature combinations that improve prediction\nwhile suppressing spurious measurements. This has been done by exploiting class labels [1, 2] or\nother forms of supervision [3] to \ufb01nd a Mahalanobis distance metric that respects these annotations.\nDespite the popularity of metric learning methods, few works have studied how problem complexity\nscales with key attributes of the dataset. In particular, how do we expect generalization error to\nscale\u2014both theoretically and practically\u2014as one varies the number of informative and uninforma-\ntive measurements, or changes the noise levels? In this work, we develop two general frameworks\nfor PAC-style analysis of supervised metric learning. The distance-based metric learning frame-\nwork uses class label information to derive distance constraints. The objective is to learn a metric\nthat yields smaller distances between examples from the same class than those from different classes.\nAlgorithms that optimize such distance-based objectives include Mahalanobis Metric for Clustering\n(MMC) [4], Large Margin Nearest Neighbor (LMNN) [1] and Information Theoretic Metric Learn-\ning (ITML) [2]. Instead of using distance comparisons as a proxy, however, one can also optimize\nfor a speci\ufb01c prediction task directly. The second framework, the classi\ufb01er-based metric learning\nframework, explicitly incorporates the hypotheses associated with the prediction task to learn effec-\ntive distance metrics. Examples in this regime include [5] and [6].\n\n1\n\n\fOur analysis shows that in both frameworks, the sample complexity scales with a dataset\u2019s represen-\ntation dimension (Theorems 1 and 3), and this dependence is necessary in the absence of assump-\ntions about the underlying data distribution (Theorems 2 and 4). By considering any Lipschitz loss,\nour results improve upon previous sample complexity results (see Section 6) and, for the \ufb01rst time,\nprovide matching lower bounds.\nIn light of our observation that data measurements often include uninformative or weakly informa-\ntive features, we expect a metric that yields good generalization performance to de-emphasize such\nfeatures and accentuate the relevant ones. We thus formalize the metric learning complexity of a\ngiven dataset in terms of the intrinsic complexity d of the optimal metric. For Mahalanobis metrics,\nwe characterize intrinsic complexity by the norm of the matrix representation of the metric. We\nre\ufb01ne our sample complexity results and show a dataset-dependent bound for both frameworks that\nrelaxes the dependence on representation dimension and instead scales with the dataset\u2019s intrinsic\nmetric learning complexity d (Theorem 7).\nBased on our dataset-dependent result, we propose a simple variation on the empirical risk min-\nimizing (ERM) algorithm that returns a metric (of complexity d) that jointly minimizes the ob-\nserved sample bias and the expected intra-class variance for metrics of \ufb01xed complexity d. This\nbias-variance balancing criterion can be viewed as a structural risk minimizing algorithm that pro-\nvides better generalization performance than an ERM algorithm and justi\ufb01es norm-regularization\nof weighting metrics in the optimization criteria for metric learning, partly explaining empirical\nsuccess of similar objectives [7, 8]. We experimentally validate how the basic principle of norm-\nregularization can help enhance the prediction quality even for existing metric learning algorithms\non benchmark datasets (Section 5). Our experiments highlight that norm-regularization indeed helps\nlearn weighting metrics that better adapt to the signal in data in high-noise regimes.\n\n2 Preliminaries\n\nIn this section, we de\ufb01ne our notation, and explicitly de\ufb01ne the distance-based and classi\ufb01er-based\nlearning frameworks. Given a D-dimensional representation space X = RD, we want to learn a\nweighting, or a metric1 M\u2217 on X that minimizes some notion of error on data drawn from a \ufb01xed\nunknown distribution D on X \u00d7 {0, 1}:\n\nM\u2217 := argminM\u2208M err(M,D),\n\nwhere M is the class of weighting metrics M := {M | M \u2208 RD\u00d7D, \u03c3max(M ) = 1} (we constrain\nthe maximum singular value \u03c3max to remove arbitrary scalings). For supervised metric learning,\nthis error is typically label-based and can be de\ufb01ned in two intuitive ways.\n\nThe distance-based framework prefers metrics M that bring data from the same class closer to-\ngether than those from opposite classes. The corresponding distance-based error then measures how\nthe distances amongst data violate class labels:\n\n(cid:104)\n\n\u03c6\u03bb(cid:0) \u03c1M(x1, x2), Y(cid:1)(cid:105)\n\n,\n\nerr\u03bb\n\ndist(M,D) := E(x1,y1),(x2,y2)\u223cD\n\nwhere \u03c6\u03bb(\u03c1M , Y ) is a generic distance-based loss function that computes the degree of violation\nbetween weighted distance \u03c1M(x1, x2) := (cid:107)M (x1\u2212x2)(cid:107)2 and the label agreement Y := 1[y1 = y2]\nand penalizes it by factor \u03bb. For example, \u03c6 could penalize intra-class distances that are more than\nsome upper limit U and inter-class distances that are less than some lower limit L > U:\n\n\u03c6\u03bb\n\nL,U (\u03c1M, Y ) :=\n\nmin{1, \u03bb[\u03c1M \u2212U ]+}\nmin{1, \u03bb[L \u2212 \u03c1M ]+}\n\nif Y = 1\notherwise ,\n\n(1)\n\n(cid:40)\n\n1Note that we are looking at the linear form of the metric M; usually the corresponding quadratic form\n\nM TM is discussed in the literature, which is necessarily positive semi-de\ufb01nite.\n\n2\n\n\fwhere [A]+ := max{0, A}. MMC optimizes an ef\ufb01ciently computable variant of Eq. (1) by con-\nstraining the aggregate intra-class distances while maximizing the aggregate inter-class distances.\nITML explicitly includes the upper and lower limits with an added regularization on the learned M\nto be close to a pre-speci\ufb01ed metric of interest M0.\nWhile we will discuss loss-functions \u03c6 that handle distances between pairs of observations, it is easy\nto extend to relative distances among triplets:\n\n(cid:110) min{1, \u03bb[\u03c1M (x1, x2) \u2212 \u03c1M (x1, x3)]+} if y1 = y2 (cid:54)= y3\n\n\u03c6\u03bb\ntriple(\u03c1M(x1, x2), \u03c1M (x1, x3), (y1, y2, y3)) :=\notherwise\nLMNN is a popular variant, in which instead of looking at all triplets, it focuses on triplets in local\nneighborhoods, improving the quality of local distance comparisons.\n\n0\n\n,\n\nThe classi\ufb01er-based framework prefers metrics M that directly improve the prediction quality for\na downstream task. Let H represent a real-valued hypothesis class associated with the prediction\ntask of interest (each h \u2208 H : X \u2192 [0, 1]), then the corresponding classi\ufb01er-based error becomes:\n\n(cid:104)\n\n1(cid:2)|h(M x) \u2212 y| \u2265 1/2(cid:3)(cid:105)\n\n.\n\nerrhypoth(M,D) := inf\nh\u2208H\n\nE(x,y)\u223cD\n\nExample classi\ufb01er-based methods include [5], which minimizes ranking errors for information re-\ntrieval and [6], which incorporates network topology constraints for predicting network connectivity\nstructure.\n\n3 Metric Learning Sample Complexity: General Case\nIn any practical setting, we estimate the ideal weighting metric M\u2217 by minimizing the empirical\nversion of the error criterion from a \ufb01nite size sample from D. Let Sm denote a sample of size\nm, and err(M, Sm) denote the corresponding empirical error. We can then de\ufb01ne the empirical\nrisk minimizing metric based on m samples as M\u2217\nm := argminM err(M, Sm), and compare its\ngeneralization performance to that of the theoretically optimal M\u2217, that is,\n\nerr(M\u2217\n\nm,D) \u2212 err(M\u2217,D).\n\n(2)\nDistance-Based Error Analysis. Given an i.i.d. sequence of observations z1, z2, . . . from\nD, zi = (xi, yi), we can pair the observations together to form a paired sample2 Spair\nm =\n{(z1, z2), . . . , (z2m\u22121, z2m)} = {(z1,i, z2,i)}m\ni=1 of size m, and de\ufb01ne the sample-based distance\nerror induced by a metric M as\n\nerr\u03bb\n\ndist(M, Spair\n\nm ) :=\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n\u03c6\u03bb(cid:0) \u03c1M (x1,i, x2,i), 1[y1,i = y2,i](cid:1).\n\nThen for any B-bounded-support distribution D (that is, each (x, y) \u223c D, (cid:107)x(cid:107) \u2264 B), we have the\nfollowing.3,4\nTheorem 1 Let \u03c6\u03bb be a distance-based loss function that is \u03bb-Lipschitz in the \ufb01rst argument. Then\nwith probability at least 1 \u2212 \u03b4 over an i.i.d. draw of 2m samples from an unknown B-bounded-\nsupport distribution D paired as Spair\n\nm , we have\n\nm )(cid:3) \u2264 O\n\n(cid:16)\n\n\u03bbB2(cid:112)D ln(1/\u03b4)/m\n\n(cid:17)\n\n.\n\ndist(M,D)\u2212 err\u03bb\n\ndist(M, Spair\n\n(cid:2) err\u03bb\n\nsup\nM\u2208M\n\n2While we pair 2m samples into m independent pairs, it is common to consider all O(m2) possibly depen-\ndent pairs. By exploiting independence we provide a simpler analysis yielding O(m\u22121/2) sample complexity\nrates, which is similar to the dependent case.\n\n3We only present the results for paired comparisons; the results are easily extended to triplet comparisons.\n4All the supporting proofs are provided in Appendix A.\n\n3\n\n\fThis implies a bound on our key quantity of interest, Eq. (2). To achieve estimation error rate \u0001,\nm = \u2126((\u03bbB2/\u0001)2D ln(1/\u03b4)) samples are suf\ufb01cient, showing that one never needs more than a\nnumber proportional to D examples to achieve the desired level of accuracy with high probability.\nSince many applications involve high-dimensional data, we next study if such a strong dependency\non D is necessary. It turns out that even for simple distance-based loss functions like \u03c6\u03bb\nL,U (c.f. Eq.\n1), there are data distributions for which one cannot ensure good estimation error with fewer than\nlinear in D samples.\nTheorem 2 Let A be any algorithm that, given an i.i.d. sample Sm (of size m) from a \ufb01xed unknown\nbounded support distribution D, returns a weighting metric from M that minimizes the empirical\nL,U . There exist \u03bb \u2265 0, 0 \u2264 U < L (indep. of\n(cid:104)\nerror with respect to distance-based loss function \u03c6\u03bb\n64 , there exists a bounded support distribution D, such that if m \u2264 D+1\nD), s.t. for all 0 < \u0001, \u03b4 < 1\n512\u00012 ,\n\n(cid:105)\ndist(M\u2217,D) > \u0001\n\nPSm\n\nerr\u03bb\n\ndist(A(Sm),D) \u2212 err\u03bb\n\n> \u03b4.\n\nWhile this strong dependence on D may seem discouraging, note that here we made no assump-\ntions about the underlying structure of the data distribution. One may be able to achieve a more\nrelaxed dependence on D in settings in which individual features contain varying amounts of useful\ninformation. This is explored in Section 4.\nClassi\ufb01er-Based Error Analysis. In this setting, we consider an i.i.d. set of observations z1, z2, . . .\nfrom D to obtain the unpaired sample Sm = {zi}m\ni=1 of size m. To analyze the generalization-ability\nof weighting metrics optimized w.r.t. underlying real-valued hypothesis class H, we must measure\nthe classi\ufb01cation complexity of H. The scale-sensitive version of VC-dimension, the fat-shattering\ndimension, of a hypothesis class (denoted Fat\u03b3(H)) encodes the right notion of classi\ufb01cation com-\nplexity and provides a way to relate generalization error to the empirical error at a margin \u03b3 [9].\nIn the context of metric learning with respect to a \ufb01xed hypothesis class, de\ufb01ne the empirical error\n1[Margin(h(M xi), yi) \u2264 \u03b3], where\nat a margin \u03b3 as err\u03b3\nMargin(\u02c6y, y) := (2y \u2212 1)(\u02c6y \u2212 1/2).\nTheorem 3 Let H be a \u03bb-Lipschitz base hypothesis class. Pick any 0 < \u03b3 \u2264 1/2, and let m \u2265\nFat\u03b3/16(H) \u2265 1. Then with probability at least 1 \u2212 \u03b4 over an i.i.d. draw of m samples Sm from an\nunknown B-bounded-support distribution D (\u00010 := min{\u03b3/2, 1/2\u03bbB})\nD\n\u00010\n\nerrhypoth(M,D) \u2212 err\u03b3\n\nhypoth(M, Sm) := inf h\u2208H 1\n\n(cid:105) \u2264 O\n\nFat\u03b3/16(H)\n\nhypoth(M, Sm)\n\n(cid:32)(cid:115)\n\n(xi,yi)\u2208Sm\n\nln\n\n+\n\nln\n\n+\n\n(cid:17)(cid:33)\n\n(cid:16) m\n\nsup\nM\u2208M\n\n(cid:80)\n\n(cid:104)\n\n1\nm\n\n1\n\u03b4\n\nD2\nm\n\nm\n\nln\n\n\u03b3\n\n.\n\nm\n\nthis implies a bound on Eq. (2).\n\nAs before,\nTo achieve estimation error rate \u0001, m =\n\u2126((D2 ln(\u03bbDB/\u03b3) + Fat\u03b3/16(H) ln(1/\u03b4\u03b3))/\u00012) samples suf\ufb01ces. Note that the task of \ufb01nding an\noptimal metric only additively increases sample complexity over that of \ufb01nding the optimal hypoth-\nesis from the underlying hypothesis class. In contrast to the distance-based framework (Theorem 1),\nhere we get a quadratic dependence on D. The following shows that a strong dependence on D is\nnecessary in the absence of assumptions on the data distribution and base hypothesis class.\nTheorem 4 Pick any 0 < \u03b3 < 1/8. Let H be a base hypothesis class of \u03bb-Lipschitz functions that is\nclosed under addition of constants (i.e., h \u2208 H =\u21d2 h(cid:48) \u2208 H, where h(cid:48) : x (cid:55)\u2192 h(x) + c, for all c)\ns.t. each h \u2208 H maps into the interval [1/2 \u2212 4\u03b3, 1/2 + 4\u03b3] after applying an appropriate theshold.\nThen for any metric learning algorithm A, and for any B \u2265 1, there exists \u03bb \u2265 0, for all 0 < \u0001, \u03b4 <\n\n1/64, there exists a B-bounded-support distribution D s.t. if m ln2 m < O(cid:0) D2+d\n\n(cid:1)\n\n\u00012 ln(1/\u03b32)\n\nhypoth(A(Sm),D) + \u0001] > \u03b4,\nwhere d := Fat768\u03b3(H) is the fat-shattering dimension of H at margin 768\u03b3.\n\nPSm\u223cD[errhypoth(M\u2217,D) > err\u03b3\n\n4\n\n\f4 Sample Complexity for Data with Un- and Weakly Informative Features\n\nWe introduce the concept of the metric learning complexity of a given dataset. Our key observa-\ntion is that a metric that yields good generalization performance should emphasize relevant features\nwhile suppressing the contribution of spurious features. Thus, a good metric re\ufb02ects the quality of\nindividual feature measurements of data and their relative value for the learning task. We can lever-\nage this and de\ufb01ne the metric learning complexity of a given dataset as the intrinsic complexity d\nof the weighting metric that yields the best generalization performance for that dataset (if multiple\nmetrics yield best performance, we select the one with minimum d). A natural way to characterize\nthe intrinsic complexity of a weighting metric M is via the norm of the matrix M. Using metric\nlearning complexity as our gauge for feature-set richness, we now re\ufb01ne our analysis in both canoni-\ncal frameworks. We will \ufb01rst analyze sample complexity for norm-bounded metrics, then show how\nto automatically adapt to the intrinsic complexity of the unknown underlying data distribution.\n\n4.1 Distance-Based Re\ufb01nement\n\nWe start with the following re\ufb01nement of the distance-based metric learning sample complexity for\na class of Frobenius norm-bounded weighting metrics.\nLemma 5 Let M be any class of weighting metrics on the feature space X = RD, and de\ufb01ne\nd := supM\u2208M (cid:107)M TM(cid:107)2\n. Let \u03c6\u03bb be any distance-based loss function that is \u03bb-Lipschitz in the \ufb01rst\nargument. Then with probability at least 1 \u2212 \u03b4 over an i.i.d. draw of 2m samples from an unknown\nB-bounded-support distribution D paired as Spair\n\nm , we have\n\nF\n\nm )(cid:3) \u2264 O\n\n(cid:16)\n\n\u03bbB2(cid:112)d ln(1/\u03b4)/m\n\n(cid:17)\n\n.\n\n(cid:2) err\u03bb\n\nsup\nM\u2208M\n\ndist(M,D)\u2212 err\u03bb\n\ndist(M, Spair\n\nObserve that if our dataset has a low metric learning complexity d (cid:28) D, then considering an appro-\npriate class of norm-bounded weighting metrics M can help sharpen the sample complexity result,\nyielding a dataset-dependent bound. Of course, a priori we do not know which class of metrics is\nappropriate; We discuss how to automatically adapt to the right complexity class in Section 4.3.\n\n4.2 Classi\ufb01er-Based Re\ufb01nement\n\nEffective data-dependent analysis of classi\ufb01er-based metric learning requires accounting for poten-\ntially complex interactions between an arbitrary base hypothesis class and the distortion induced\nby a weighting metric to the unknown underlying data distribution. To make the analysis tractable\nwhile still keeping our base hypothesis class H general, we assume that H is a class of two-layer\nfeed-forward networks.5 Recall that for any smooth target function f\u2217, a two-layer feed-forward\nneural network (with appropriate number of hidden units and connection weights) can approximate\nf\u2217 arbitrarily well [10], so this class is \ufb02exible enough to include most reasonable target hypotheses.\nMore formally, de\ufb01ne the base hypothesis class of two-layer feed-forward neural network with K\n\u00b7 x) | (cid:107)w(cid:107)1 \u2264 1,(cid:107)vi(cid:107)1 \u2264 1}, where \u03c3\u03b3 : R \u2192\nhidden units as H2-net\n[\u22121, 1] is a smooth, strictly monotonic, \u03b3-Lipschitz activation function with \u03c3\u03b3(0) = 0. Then, for\ngeneralization error w.r.t. any classi\ufb01er-based \u03bb-Lipschitz loss function \u03c6\u03bb,\n\n:= {x (cid:55)\u2192(cid:80)K\n\ni=1 wi \u03c3\u03b3(vi\n\n\u03c3\u03b3\n\nE(x,y)\u223cD(cid:2)\u03c6\u03bb(cid:0)h(M x), y(cid:1)(cid:3),\n\nerr\u03bb\n\nhypoth(M, D) := inf\nh\u2208H2-net\n\u03c3\u03b3\n\nwe have the following.6\n\n5We only present the results for two-layer networks in Lemma 6; the results are easily extended to multi-\n6Since we know the functional form of the base hypothesis class H (i.e., a two layer feed-forward neural\n\nlayer feed-forward networks.\nnet), we can provide a more precise bound than leaving it as Fat(H).\n\n5\n\n\fLemma 6 Let M be any class of weighting metrics on the feature space X = RD, and de\ufb01ne\nd := supM\u2208M (cid:107)M TM(cid:107)2\n\u03c3\u03b3 be a two layer feed-forward neural network\nbase hypothesis class (as de\ufb01ned above) and \u03c6\u03bb be a classi\ufb01er-based loss function that \u03bb-Lipschitz\nin its \ufb01rst argument. Then with probability at least 1 \u2212 \u03b4 over an i.i.d. draw of m samples Sm from\nan unknown B-bounded support distribution D, we have\n\n. For any \u03b3 > 0, let H2-net\n\nF\n\nhypoth(M, Sm)(cid:3) \u2264 O\n\n(cid:16)\n\nB\u03bb\u03b3(cid:112)d ln(D/\u03b4)/m\n\n(cid:17)\n\n.\n\nhypoth(M,D)\u2212 err\u03bb\n\n(cid:2) err\u03bb\n\nsup\nM\u2208M\n\n4.3 Automatically Adapting to Intrinsic Complexity\n\nWhile Lemmas 5 and 6 provide a sample complexity bound tuned to the metric learning complexity\nof a given dataset, these results are not directly useful since one cannot select the correct norm-\nbounded class M a priori, as the underlying distribution D is unknown. Fortunately, by considering\nan appropriate sequence of norm-bounded classes of weighting metrics, we can provide a uniform\nbound that automatically adapts to the intrinsic complexity of the unknown underlying data distri-\nbution D.\nTheorem 7 De\ufb01ne Md := {M | (cid:107)M TM(cid:107)2\n\u2264 d}, and consider the nested sequence of weighting\nmetric classes M1 \u2282 M2 \u2282 \u00b7\u00b7\u00b7 . Let \u00b5d be any non-negative measure across the sequence Md\nd \u00b5d = 1 (for d = 1, 2,\u00b7\u00b7\u00b7 ). Then for any \u03bb \u2265 0, with probability at least 1 \u2212 \u03b4 over an\ni.i.d. draw of sample Sm from an unknown B-bounded-support distribution D, for all d = 1, 2,\u00b7\u00b7\u00b7 ,\n\nsuch that(cid:80)\n(cid:17)\nC \u00b7 B\u03bb(cid:112)d ln(1/\u03b4\u00b5d)/m\nand all M d \u2208 Md,(cid:2) err\u03bb(M d,D) \u2212 err\u03bb(M d, Sm)(cid:3) \u2264 O\nare m \u2265 \u2126(cid:0)d(CB\u03bb)2 ln(1/\u03b4\u00b5d)/\u00012(cid:1) samples, then with probability at least 1 \u2212 \u03b4\n\nwhere C := B for distance-based error, or C := \u03b3\nIn particular, for a data distribution D that has metric learning complexity at most d \u2208 N, if there\n\nln D for classi\ufb01er-based error (for H2-net\n\u03c3\u03b3 ).\n\n,\n\n(3)\n\n(cid:16)\n\n\u221a\n\nF\n\n(cid:2)err\u03bb(M reg\n(cid:2)err\u03bb(M, Sm) + \u039bM dM\n\nm ,D) \u2212 err\u03bb(M\u2217,D)(cid:3) \u2264 O(\u0001),\n\n(cid:113)\n(cid:3), \u039bM:=CB\u03bb\n\nln(\u03b4\u00b5dM\n\n)\u22121/m , dM :=(cid:6)(cid:107)M TM(cid:107)2\n\n(cid:7) .\n\nF\n\nfor M reg\n\nm := argmin\nM\u2208M\n\nThe measure \u00b5d above encodes our prior belief on the complexity class Md from which a target\nmetric is selected by a metric learning algorithm given the training sample Sm. In absence of any\nprior beliefs, \u00b5d can be set to 1/D (for d = 1, . . . , D) for scale constrained weighting metrics\n(\u03c3max = 1). Thus, for an unknown underlying data distribution D with metric learning complexity\nd, with number of samples just proportional to d, we can \ufb01nd a good weighting metric.\nThis result also highlights that the generalization error of any weighting metric returned by an al-\ngorithm is proportional to the (smallest) norm-bounded class to which it belongs (cf. Eq. 3). If two\nmetrics M1 and M2 have similar empirical errors on a given sample, but have different intrinsic\ncomplexities, then the expected risk of the two metrics can be considerably different. We expect the\nmetric with lower intrinsic complexity to yield better generalization error. This partly explains the\nobserved empirical success of norm-regularized optimization for metric learning [7, 8].\nUsing this as a guiding principle, we can design an improved optimization criteria for metric learning\nthat jointly minimizes the sample error and a Frobenius norm regularization penalty. In particular,\n\nM\u2208M err(M, Sm) + \u039b (cid:107)M TM(cid:107)2\n\n(4)\nfor any error criteria \u2018err\u2019 used in a downstream prediction task and a regularization parameter \u039b.\nSimilar optimizations have been studied before [7, 8], here we explore the practical ef\ufb01cacy of\nthis augmented optimization on existing metric learning algorithms in high noise regimes where a\ndataset\u2019s intrinsic dimension is much smaller than its representation dimension.\n\nmin\n\nF\n\n6\n\n\fFigure 1: Nearest-neighbor classi\ufb01cation performance of LMNN and ITML metric learning algorithms with-\nout regularization (dashed red lines) and with regularization (solid blue lines) on benchmark UCI datasets. The\nhorizontal dotted line is the classi\ufb01cation error of random label assignment drawn according to the class pro-\nportions, and solid gray line shows classi\ufb01cation error of k-NN performance with respect to identity metric (no\nmetric learning) for baseline reference.\n5 Empirical Evaluation\nOur analysis shows that the generalization error of metric learning can scale with the representation\ndimension, and regularization can help mitigate this by adapting to the intrinsic metric learning\ncomplexity of the given dataset. We want to explore to what degree these effects manifest in practice.\nWe select two popular metric learning algorithms, LMNN [1] and ITML [2], that are used to \ufb01nd\nmetrics that improve nearest-neighbor classi\ufb01cation quality. These algorithms have varying degrees\nof regularization built into their optimization criteria: LMNN implicitly regularizes the metric via its\n\u201clarge margin\u201d criterion, while ITML allows for explicit regularization by letting the practitioners\nspecify a \u201cprior\u201d weighting metric. We modi\ufb01ed the LMNN optimization criteria as per Eq. (4) to\nalso allow for an explicit norm-regularization controlled by the trade-off parameter \u039b.\nWe can evaluate how the unregularized criteria (i.e., unmodi\ufb01ed LMNN, or ITML with the prior\nset to the identity matrix) compares to the regularized criteria (i.e., modi\ufb01ed LMNN with best \u039b, or\nITML with the prior set to a low-rank matrix).\nDatasets. We use the UCI benchmark datasets for our experiments: IRIS (4 dim., 150 samples),\nWINE (13 dim., 178 samples) and IONOSPHERE (34 dim., 351 samples) datasets [11]. Each dataset\nhas a \ufb01xed (unknown, but low) intrinsic dimension; we can vary the representation dimension by\naugmenting each dataset with synthetic correlated noise of varying dimensions, simulating regimes\nwhere datasets contain large numbers of uninformative features. Each UCI dataset is augmented\nwith synthetic D-dimensional correlated noise as detailed in Appendix B.\nExperimental setup. Each noise-augmented dataset was randomly split between 70% training, 10%\nvalidation, and 20% test samples. We used the default settings for each algorithm. For regularized\nLMNN, we picked the best performing trade-off parameter \u039b from {0, 0.1, 0.2, ..., 1} on the valida-\ntion set. For regularized ITML, we seeded with the rank-one discriminating metric, i.e., we set the\nprior as the matrix with all zeros, except the diagonal entry corresponding to the most discriminating\ncoordinate set to one. All the reported results were averaged over 20 runs.\nResults. Figure 1 shows the nearest-neighbor performance (with k = 3) of LMNN and ITML on\nnoise-augmented UCI datasets. Notice that the unregularized versions of both algorithms (dashed\nred lines) scale poorly when noisy features are introduced. As the number of uninformative features\ngrows, the performance of both algorithms quickly degrades to that of classi\ufb01cation performance in\nthe original unweighted space with no metric learning (solid gray line), showing poor adaptability\nto the signal in the data.\nThe regularized versions of both algorithms (solid blue lines) signi\ufb01cantly improve the classi\ufb01cation\nperformance. Remarkably, regularized ITML shows almost no degradation in classi\ufb01cation perfor-\n\n7\n\n05010015000.10.20.30.40.50.60.70.8Ambient noise dimensionAvg. test errorUCI Iris Dataset RandomId. MetricLMNNreg\u2212LMNNITMLreg\u2212ITML0501001502002503003504004505000.10.20.30.40.50.60.7Ambient noise dimensionAvg. test errorUCI Wine Dataset RandomId. MetricLMNNreg\u2212LMNNITMLreg\u2212ITML0204060801001200.150.20.250.30.350.40.450.5Ambient noise dimensionAvg. test errorUCI Ionosphere Dataset RandomId. MetricLMNNreg\u2212LMNNITMLreg\u2212ITML\fmance, even in very high noise regimes, demonstrating a strong robustness to noise. These results\nunderscore the value of regularization in metric learning, showing that regularization encourages\nadaptability to the intrinsic complexity and improved robustness to noise.\n\n\u221a\n\n6 Discussion and Related Work\nPrevious theoretical work on metric learning has focused almost exclusively on analyzing upper-\nbounds on the sample complexity in the distance-based framework, without exploring any intrinsic\nproperties of the input data. Our work improves these results and additionally analyzes the classi\ufb01er-\nbased framework. It is, to best of our knowledge, the \ufb01rst to provide lower bounds showing that the\ndependence on D is necessary.\nImportantly, it is also the \ufb01rst to provide an analysis of sample\nrates based on a notion of intrinsic complexity of a dataset, which is particularly important in metric\nlearning, where we expect the representation dimension to be much higher than intrinsic complexity.\n[12] studied the norm-regularized convex losses for stable algorithms and showed an upper-bound\nsublinear in\nD, which can be relaxed by applying techniques from [13]. We analyze the ERM\ncriterion directly (thus no assumptions are made about the optimization algorithm), and provide a\nprecise characterization of when the problem complexity is independent of D (Lm. 5). Our lower-\nbound (Thm. 2) shows that the dependence on D is necessary for ERM in the assumption-free case.\n[14] and [15] analyzed the ERM criterion, and are most similar to our results providing an upper-\nbound for the distance-based framework. [14] shows a O(m\u22121/2) rate for thresholds on bounded\nconvex losses for distance-based metric learning without explicitly studying the dependence on\nD. Our upper-bound (Thm. 1) improves this result by considering arbitrary (possibly non-convex)\ndistance-based Lipschitz losses and explicitly revealing the dependence on D. [15] provides an alter-\nnate ERM analysis of norm-regularized metrics and parallels our norm-bounded analysis in Lemma\n5. While they focus on analyzing a speci\ufb01c optimization criterion (thresholds on the hinge loss with\nnorm-regularization), our result holds for general Lipschitz losses. Our Theorem 7 extends it further\nby explicitly showing when we can expect good generalization performance from a given dataset.\n[16] provides an interesting analysis for robust algorithms by relying upon the existence of a partition\nof the input space where each cell has similar training and test losses. Their sample complexity\nbound scales with the partition size, which in general can be exponential in D.\nIt is worth emphasizing that none of these closely related works discuss the importance of or lever-\nage the intrinsic structure in data for the metric learning problem. Our results in Section 4 formalize\nan intuitive notion of dataset\u2019s intrinsic complexity for metric learning, and show sample complex-\nity rates that are \ufb01nely tuned to this metric learning complexity. Our lower bounds indicate that\nexploiting the structure is necessary to get rates that don\u2019t scale with representation dimension D.\nThe classi\ufb01er-based framework we discuss has parallels with the kernel learning and similarity learn-\ning literature. The typical focus in kernel learning is to analyze the generalization ability of linear\nseparators in Hilbert spaces [17, 18]. Similarity learning on the other hand is concerned about \ufb01nd-\ning a similarity function (that does not necessarily has a positive semide\ufb01nite structure) that can best\nassist in linear classi\ufb01cation [19, 20]. Our work provides a complementary analysis for learning\nexplicit linear transformations of the given representation space for arbitrary hypotheses classes.\nOur theoretical analysis partly justi\ufb01es the empirical success of norm-based regularization as well.\nOur empirical results show that such regularization not only helps in designing new metric learning\nalgorithms [7, 8], but can even bene\ufb01t existing metric learning algorithms in high-noise regimes.\n\nAcknowledgments\nWe would like to thank Aditya Menon for insightful discussions, and the anonymous reviewers for\ntheir detailed comments that helped improve the \ufb01nal version of this manuscript.\n\n8\n\n\fReferences\n[1] K.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJournal of Machine Learning Research (JMLR), 10:207\u2013244, 2009.\n\n[2] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. International\n\nConference on Machine Learning (ICML), pages 209\u2013216, 2007.\n\n[3] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. Neural Information\n\nProcessing Systems (NIPS), 2004.\n\n[4] E.P. Xing, A.Y. Ng, M.I. Jordan, and S.J. Russell. Distance metric learning with application to clustering\n\nwith side-information. Neural Information Processing Systems (NIPS), pages 505\u2013512, 2002.\n\n[5] B. McFee and G.R.G. Lanckriet. Metric learning to rank. International Conference on Machine Learning\n\n(ICML), 2010.\n\n[6] B. Shaw, B. Huang, and T. Jebara. Learning a distance metric from a network. Neural Information\n\nProcessing Systems (NIPS), 2011.\n\n[7] D.K.H. Lim, B. McFee, and G.R.G. Lanckriet. Robust structural metric learning. International Confer-\n\nence on Machine Learning (ICML), 2013.\n\n[8] M.T. Law, N. Thome, and M. Cord. Fantope regularization in metric learning. Computer Vision and\n\nPattern Recognition (CVPR), 2014.\n\n[9] M. Anthony and P. Bartlett. Neural network learning: Theoretical foundations. Cambridge University\n\nPress, 1999.\n\n[10] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.\n\nNeural Networks, 4:359\u2013366, 1989.\n\n[11] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n[12] R. Jin, S. Wang, and Y. Zhou. Regularized distance metric learning: Theory and algorithm. Neural\n\nInformation Processing Systems (NIPS), pages 862\u2013870, 2009.\n\n[13] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research\n\n(JMLR), 2:499\u2013526, 2002.\n\n[14] W. Bian and D. Tao. Learning a distance metric by empirical loss minimization.\n\nConference on Arti\ufb01cial Intelligence (IJCAI), pages 1186\u20131191, 2011.\n\nInternational Joint\n\n[15] Q. Cao, Z. Guo, and Y. Ying. Generalization bounds for metric and similarity learning. CoRR,\n\nabs/1207.5437, 2013.\n\n[16] A. Bellet and A. Habrard. Robustness and generalization for metric learning. CoRR, abs/1209.1086,\n\n2012.\n\n[17] Y. Ying and C. Campbell. Generalization bounds for learning the kernel. Conference on Computational\n\nLearning Theory (COLT), 2009.\n\n[18] C. Cortes, M. Mohri, and A. Rostamizadeh. New generalization bounds for learning kernels. International\n\nConference on Machine Learning (ICML), 2010.\n\n[19] M-F. Balcan, A. Blum, and N. Srebro. Improved guarantees for learning via similarity functions. Confer-\n\nence on Computational Learning Theory (COLT), 2008.\n\n[20] A. Bellet, A. Habrard, and M. Sebban. Similarity learning for provably accurate sparse linear classi\ufb01ca-\n\ntion. International Conference on Machine Learning (ICML), 2012.\n\n[21] Z. Guo and Y. Ying. Generalization classi\ufb01cation via regularized similarity learning. Neural Computation,\n\n26(3):497\u2013552, 2014.\n\n[22] A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured\n\ndata. CoRR, abs/1306.6709, 2014.\n\n[23] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results.\n\nJournal of Machine Learning Research (JMLR), 3:463\u2013482, 2002.\n\n[24] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing,\n\nTheory and Applications. 2010.\n\n9\n\n\f", "award": [], "sourceid": 1517, "authors": [{"given_name": "Nakul", "family_name": "Verma", "institution": "Janelia Research Campus HHMI"}, {"given_name": "Kristin", "family_name": "Branson", "institution": "Janelia Research Campus, HHMI"}]}