{"title": "Similarity-based Learning via Data Driven Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 1998, "page_last": 2006, "abstract": "We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by (Balcan-Blum 2006) and (Wang et al 2007). An attractive feature of our framework is its adaptability to data - we do not promote a fixed notion of goodness but rather let data dictate it. We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. We propose a landmarking-based approach to obtaining a classifier from such learned goodness criteria. We then provide a novel diversity based heuristic to perform task-driven selection of landmark points instead of random selection. We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similarity-based learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin.", "full_text": "Similarity-based Learning via Data Driven\n\nEmbeddings\n\nPurushottam Kar\n\nIndian Institute of Technology\n\nKanpur, INDIA\n\npurushot@cse.iitk.ac.in\n\nprajain@microsoft.com\n\nPrateek Jain\n\nMicrosoft Research India\n\nBangalore, INDIA\n\nAbstract\n\nWe consider the problem of classi\ufb01cation using similarity/distance functions over\ndata. Speci\ufb01cally, we propose a framework for de\ufb01ning the goodness of a\n(dis)similarity function with respect to a given learning task and propose algo-\nrithms that have guaranteed generalization properties when working with such\ngood functions. Our framework uni\ufb01es and generalizes the frameworks proposed\nby [1] and [2]. An attractive feature of our framework is its adaptability to data\n- we do not promote a \ufb01xed notion of goodness but rather let data dictate it. We\nshow, by giving theoretical guarantees that the goodness criterion best suited to a\nproblem can itself be learned which makes our approach applicable to a variety of\ndomains and problems. We propose a landmarking-based approach to obtaining a\nclassi\ufb01er from such learned goodness criteria. We then provide a novel diversity\nbased heuristic to perform task-driven selection of landmark points instead of ran-\ndom selection. We demonstrate the effectiveness of our goodness criteria learning\nmethod as well as the landmark selection heuristic on a variety of similarity-based\nlearning datasets and benchmark UCI datasets on which our method consistently\noutperforms existing approaches by a signi\ufb01cant margin.\n\n1\n\nIntroduction\n\nMachine learning algorithms have found applications in diverse domains such as computer vision,\nbio-informatics and speech recognition. Working in such heterogeneous domains often involves\nhandling data that is not presented as explicit features embedded into vector spaces. However in\nmany domains, for example co-authorship graphs, it is natural to devise similarity/distance functions\nover pairs of points. While classical techniques like decision tree and linear perceptron cannot handle\nsuch data, several modern machine learning algorithms such as support vector machine (SVM) can\nbe kernelized and are thereby capable of using kernels or similarity functions.\nHowever, most of these algorithms require the similarity functions to be positive semi-de\ufb01nite\n(PSD), which essentially implies that the similarity stems from an (implicit) embedding of the data\ninto a Hilbert space. Unfortunately in many domains, the most natural notion of similarity does not\nsatisfy this condition - moreover, verifying this condition is usually a non-trivial exercise. Take for\nexample the case of images on which the most natural notions of distance (Euclidean, Earth-mover)\n[3] do not form PSD kernels. Co-authorship graphs give another such example.\nConsequently, there have been efforts to develop algorithms that do not make assumptions about\nthe PSD-ness of the similarity functions used. One can discern three main approaches in this area.\nThe \ufb01rst approach tries to coerce a given similarity measure into a PSD one by either clipping or\nshifting the spectrum of the kernel matrix [4, 5]. However, these approaches are mostly restricted to\ntransductive settings and are not applicable to large scale problems due to eigenvector computation\nrequirements. The second approach consists of algorithms that either adapt classical methods like\n\n1\n\n\fk-NN to handle non-PSD similarity/distance functions and consequently offer slow test times [5],\nor are forced to solve non-convex formulations [6, 7].\nThe third approach, which has been investigated recently in a series of papers [1, 2, 8, 9], uses the\nsimilarity function to embed the domain into a low dimensional Euclidean space. More speci\ufb01cally,\nthese algorithms choose landmark points in the domain which then give the embedding. Assuming\na certain \u201cgoodness\u201d property (that is formally de\ufb01ned) for the similarity function, these models\noffer both generalization guarantees in terms of how well-suited the similarity function is to the\nclassi\ufb01cation task as well as the ability to use fast algorithmic techniques such as linear SVM [10]\non the landmarked space. The model proposed by Balcan-Blum in [1] gives suf\ufb01cient conditions for\na similarity function to be well suited to such a landmarking approach. Wang et al. in [2] on the other\nhand provide goodness conditions for dissimilarity functions that enable landmarking algorithms.\nInformally, a similarity (or distance) function can be said to be good if points of similar labels\nare closer to each other than points of different labels in some sense. Both the models described\nabove restrict themselves to a \ufb01xed goodness criterion, which need not hold for the underlying data.\nWe observe that this might be too restrictive in many situations and present a framework that al-\nlows us to tune the goodness criterion itself to the classi\ufb01cation problem at hand. Our framework\nconsequently uni\ufb01es and generalizes those presented in [1] and [2]. We \ufb01rst prove generalization\nbounds corresponding to landmarked embeddings under a \ufb01xed goodness criterion. We then pro-\nvide a uniform-convergence bound that enables us to learn the best goodness criterion for a given\nproblem. We further generalize our framework by giving the ability to incorporate any Lipschitz\nloss function into our goodness criterion which allows us to give guarantees for the use of various\nalgorithms such as C-SVM and logistic regression on the landmarked space.\nNow similar to [1, 2], our framework requires random sampling of training points to create the\nembedding space1. However in practice, random sampling is inef\ufb01cient and requires sampling of a\nlarge number of points to form a useful embedding, thereby increasing training and test time. To\naddress this issue, [2] proposes a heuristic to select the points that are to be used as landmarks.\nHowever their scheme is tied to their optimization algorithm and is computationally inef\ufb01cient for\nlarge scale data. In contrast, we propose a general heuristic for selecting informative landmarks\nbased on a novel notion of diversity which can then be applied to any instantiation of our model.\nFinally, we apply our methods to a variety of benchmark datasets for similarity learning as well as\nones from the UCI repository. We empirically demonstrate that our learning model and landmark\nselection heuristic consistently offers signi\ufb01cant improvements over the existing approaches.\nIn\nparticular, for small number of landmark points, which is a practically important scenario as it is\nexpensive to compute similarity function values at test time, our method provides, on an average,\naccuracy boosts of upto 5% over existing methods. We also note that our methods can be applied on\ntop of any strategy used to learn the similarity measure (eg. MKL techniques [11]) or the distance\nmeasure (eg. [12]) itself. Akin to [1], our techniques can also be extended to learn a combination of\n(dis)similarity functions but we do not explore these extensions in this paper.\n\n2 Methodology\nLet D be a \ufb01xed but unknown distribution over the labeled input domain X and let (cid:96) : X \u2192\n{\u22121, +1} be a labeling over the domain. Given a (potentially non-PSD) similarity function2 K :\nX \u00d7 X \u2192 R, the goal is to learn a classi\ufb01er \u02c6(cid:96) : X \u2192 {\u22121, +1} from a \ufb01nite number of i.i.d.\nsamples from D that has bounded generalization error over D.\nNow, learning a reasonable classi\ufb01er seems unlikely if the given similarity function does not have\nany inherent \u201cgoodness\u201d property. Intuitively, the goodness of a similarity function should be its\nsuitability to the classi\ufb01cation task at hand. For PSD kernels, the notion of goodness is de\ufb01ned\nin terms of the margin offered in the RKHS [13]. However, a more basic requirement is that the\nsimilarity function should preserve af\ufb01nities among similarly labeled points - that is to say, a good\nsimilarity function should not, on an average, assign higher similarity values to dissimilarly labeled\npoints than to similarly labeled points. This intuitive notion of goodness turns out to be rather robust\n\n1Throughout the paper, we use the terms embedding space and landmarked space interchangeably.\n2Results described in this section hold for distance functions as well; we present results with respect to\n\nsimilarity functions for sake of simplicity.\n\n2\n\n\fin the sense that all PSD kernels that offer a good margin in their respective RKHSs satisfy some\nform of this goodness criterion as well [14].\nRecently there has been some interest in studying different realizations of this general notion of\ngoodness and developing corresponding algorithms that allow for ef\ufb01cient learning with similar-\nity/distance functions. Balcan-Blum in [1] present a goodness criteria in which a good similarity\nfunction is considered to be one that, for most points, assigns a greater average similarity to sim-\nilarly labeled points than to dissimilarly labeled points. More speci\ufb01cally, a similarity function is\n(\u0001, \u03b3)-good if there exists a weighing function w : X \u2192 R such that, at least a (1 \u2212 \u0001) probability\nmass of examples x \u223c D satis\ufb01es:\n\nx(cid:48)\u223cD [w (x(cid:48)) K(x, x(cid:48))|(cid:96)(x(cid:48)) = (cid:96)(x)] \u2265 E\nE\n\nx(cid:48)\u223cD [w (x(cid:48)) K(x, x(cid:48))|(cid:96)(x(cid:48)) (cid:54)= (cid:96)(x)] + \u03b3.\n\n(1)\n\nwhere instead of average similarity, one considers an average weighted similarity to allow the de\ufb01-\nnition to be more general.\nWang et al in [2] de\ufb01ne a distance function d to be good if a large fraction of the domain is, on\nan average, closer to similarly labeled points than to dissimilarly labeled points. They allow these\naverages to be calculated based on some distribution distinct from D, one that may be more suited\nto the learning problem. However it turns out that their de\ufb01nition is equivalent to one in which one\nagain assigns weights to domain elements, as done by [1], and the following holds\n\nx(cid:48),x(cid:48)(cid:48)\u223cD\u00d7D [w(x(cid:48))w(x(cid:48)(cid:48)) sgn (d(x, x(cid:48)(cid:48)) \u2212 d(x, x(cid:48)))|(cid:96)(x(cid:48)) = (cid:96)(x), (cid:96)(x(cid:48)(cid:48)) (cid:54)= (cid:96)(x)] > \u03b3\n\nE\n\n(2)\n\nAssuming their respective goodness criteria, [1] and [2] provide ef\ufb01cient algorithms to learn clas-\nsi\ufb01ers with bounded generalization error. However these notions of goodness with a single \ufb01xed\ncriterion may be too restrictive in the sense that the data and the (dis)similarity function may not sat-\nisfy the underlying criterion. This is, for example, likely in situations with high intra-class variance.\nThus there is need to make the goodness criterion more \ufb02exible and data-dependent.\nTo this end, we unify and generalize both the above criteria to give a notion of goodness that is more\ndata dependent. Although the above goodness criteria (1) and (2) seem disparate at \ufb01rst, they can\nbe shown to be special cases of a generalized framework where an antisymmetric function is used\nto compare intra and inter-class af\ufb01nities. We use this observation to de\ufb01ne our novel goodness\ncriterion using arbitrary bounded antisymmetric functions which we refer to as transfer functions.\nThis allows us to de\ufb01ne a family of goodness criteria of which (1) and (2) form special cases ((1)\nuses the identity function and (2) uses the sign function as transfer function). Moreover, the resulting\nde\ufb01nition of a good similarity function is more \ufb02exible and data dependent. In the rest of the paper\nwe shall always assume that our similarity functions are normalized i.e. for the domain of interest\nX , sup\nx,y\u2208X\nDe\ufb01nition 1 (Good Similarity Function). A similarity function K : X \u00d7 X \u2192 R is said to be\nan (\u0001, \u03b3, B)-good similarity for a learning problem where \u0001, \u03b3, B > 0 if for some antisymmetric\ntransfer function f : R \u2192 R and some weighing function w : X \u00d7 X \u2192 [\u2212B, B], at least a (1 \u2212 \u0001)\nprobability mass of examples x \u223c D satis\ufb01es\n\nK(x, y) \u2264 1.\n\nx(cid:48),x(cid:48)(cid:48)\u223cD\u00d7D [w (x(cid:48), x(cid:48)(cid:48)) f (K(x, x(cid:48)) \u2212 K(x, x(cid:48)(cid:48)))|(cid:96)(x(cid:48)) = (cid:96)(x), (cid:96)(x(cid:48)(cid:48)) (cid:54)= (cid:96)(x)] \u2265 Cf \u03b3\n\nE\n\n(3)\n\nwhere Cf = sup\nx,x(cid:48)\u2208X\n\nf (K(x, x(cid:48))) \u2212 inf\n\nx,x(cid:48)\u2208X f (K(x, x(cid:48)))\n\nAs mentioned before, the above goodness criterion generalizes the previous notions of goodness3\nand is adaptive to changes in data as it allows us, as shall be shown later, to learn the best possible\ncriterion for a given classi\ufb01cation task by choosing the most appropriate transfer function from a\nparameterized family of functions. We stress that the property of antisymmetry for the transfer\nfunction is crucial to the de\ufb01nition in order to provide a uniform treatment to points of all classes as\nwill be evident in the proof4 of Theorem 2.\nAs in [1, 2], our goodness criterion lends itself to a simple learning algorithm which consists of\ni=1 (which we refer to\n\nchoosing a set of d random pairs of points from the domain P =(cid:8)(cid:0)x+\n\ni , x\u2212\n3We refer the reader to the supplementary material (Section 2) for a discussion.\n4Due to lack of space we relegate all proofs to the supplementary material\n\n(cid:1)(cid:9)d\n\ni\n\n3\n\n\fi ))(cid:1)d\n\nlandmarks : \u03a6L : X \u2192 Rd, \u03a6L(x) = (cid:0)f (K(x, x+\npositive and negative samples, (cid:8)x+\ni=1 \u2282 D+ and(cid:8)x\u2212\n(cid:9)d\n(cid:80)d\n\nas landmark pairs) and de\ufb01ning an embedding of the domain into a landmarked space using these\ni=1 \u2208 Rd. The advantage of\ni ) \u2212 K(x, x\u2212\nperforming this embedding is the guaranteed existence of a large margin classi\ufb01er in the landmarked\nspace as shown below.\n(cid:9)d\nTheorem 2. If K is an (\u0001, \u03b3, B)-good similarity with respect to transfer function f and weight func-\ntion w then for any \u00011 > 0, with probability at least 1 \u2212 \u03b4 over the choice of d = (8/\u03b32) ln(2/\u03b4\u00011)\ni )f(cid:0)K(x, x+\ni=1 \u2282 D\u2212 respectively, the classi\ufb01er\n\ni\nh(x) = sgn[g(x)] where g(x) = 1\nd\nthan \u0001 + \u00011 at margin \u03b3\n2 .\nHowever, there are two hurdles to obtaining this large margin classi\ufb01er. Firstly, the existence of this\nclassi\ufb01er itself is predicated on the use of the correct transfer function, something which is unknown.\nSecondly, even if an optimal transfer function is known, the above formulation cannot be converted\ninto an ef\ufb01cient learning algorithm for discovering the (unknown) weights since the formulation\nseeks to minimize the number of misclassi\ufb01cations which is an intractable problem in general.\nWe overcome these two hurdles by proposing a nested learning problem. First of all we assume\nthat for some \ufb01xed loss function L, given any transfer function and any set of landmark pairs, it is\npossible to obtain a large margin classi\ufb01er in the corresponding landmarked space that minimizes L.\nHaving made this assumption, we address below the issue of learning the optimal transfer function\nfor a given learning task. However as we have noted before, this assumption is not valid for arbitrary\nloss functions. This is why, subsequently in Section 2.2, we shall show it to be valid for a large class\nof loss functions by incorporating surrogate loss functions into our goodness criterion.\n\ni )(cid:1) has error no more\n\ni ) \u2212 K(x, x\u2212\n\ni=1 w(x+\n\ni , x\u2212\n\ni\n\n2.1 Learning the transfer function\n\n(cid:8)(cid:0)x+\n\nR be a class of antisymmetric functions and W = [\u2212B, B]\n\nIn this section we present results that allow us to learn a near optimal transfer function from a family\nof transfer functions. We shall assume, for some \ufb01xed loss function L, the existence of an ef\ufb01cient\nroutine which we refer to as TRAIN that shall return, for any landmarked space indexed by a set of\nlandmark pairs P, a large margin classi\ufb01er minimizing L. The routine TRAIN is allowed to make\nuse of additional training data to come up with this classi\ufb01er.\nAn immediate algorithm for choosing the best transfer function is to simply search the set of pos-\nsible transfer functions (in an algorithmically ef\ufb01cient manner) and choose the one offering lowest\ntraining error. We show here that given enough landmark pairs, this simple technique, which we\nrefer to as FTUNE (see Algorithm 2) is guaranteed to return a near-best transfer function. For this\nwe prove a uniform convergence type guarantee on the space of transfer functions.\nX\u00d7X be a class of weight\nLet F \u2282 [\u22121, 1]\nfunctions. For two real valued functions f and g de\ufb01ned on X , let (cid:107)f \u2212 g(cid:107)\u221e := sup\n|f (x) \u2212 g(x)|.\n(cid:1)(cid:9)d\nx\u2208X\nLet B\u221e(f, r) := { f(cid:48) \u2208 F | (cid:107)f \u2212 f(cid:48)(cid:107)\u221e < r}. Let L be a CL-Lipschitz loss function. Let P =\ni=1 be a set of (random) landmark pairs. For any f \u2208 F, w \u2208 W, de\ufb01ne\nx(cid:48),x(cid:48)(cid:48)\u223cD\u00d7D [w (x(cid:48), x(cid:48)(cid:48)) f (K(x, x(cid:48)) \u2212 K(x, x(cid:48)(cid:48)))|(cid:96)(x(cid:48)) = (cid:96)(x), (cid:96)(x(cid:48)(cid:48)) (cid:54)= (cid:96)(x)]\nd(cid:88)\nw(cid:0)x+\n(cid:2)L(g(f,w)(x))(cid:3) \u2264 E\n(cid:2)L(G(f,w)(x))(cid:3) + \u00011. We now show that a similar result holds even if\n(cid:2)L(cid:0)g(f,w)(x)(cid:1)(cid:3) 5. Similarly, let w(G,f ) be the best weighing function corresponding\n\nTheorem 5 (see Section 2.2) guarantees us that for any \ufb01xed f and any \u00011 > 0, if d is large enough\nthen E\none is allowed to vary f. Before stating the result, we develop some notation.\nFor any transfer function f and arbitrary choice of landmark pairs P, let w(g,f ) be the best\nweighing function for this choice of transfer function and landmark pairs i.e.\nlet w(g,f ) =\narg min\nw\u2208[\u2212B,B]d\nto G i.e. w(G,f ) = arg min\n\n(cid:2)L(cid:0)G(f,w)(x)(cid:1)(cid:3). Then we can ensure the following :\n\n(cid:1) f(cid:0)K(x, x+\n\ni , x\u2212\nG(f,w)(x) =\n\ni ) \u2212 K(x, x\u2212\n\ni )(cid:1)\n\ng(f,w)(x) =\n\ni , x\u2212\n\ni\n\n1\nd\n\ni=1\n\nE\nx\u223cD\n\nx\n\nx\n\ni\n\nE\n\nE\nx\u223cD\n\nw\u2208W\n\n5Note that the function g(f,w)(x) is dictated by the choice of the set of landmark pairs P\n\n4\n\n\fTheorem 3. Let F be a compact class of transfer functions with respect to the in\ufb01nity norm and\n\u00011, \u03b4 > 0. Let N (F, r) be the size of the smallest \u0001-net over F with respect to the in\ufb01nity norm at\nscale r = \u00011\nrandom landmark pairs then\nwe have the following with probability greater than (1 \u2212 \u03b4)\n\n4CLB . Then if one chooses d = 64B2C2\n\nln\n\nL\n\n(cid:17)\n\n(cid:16) 16B\u00b7N (F ,r)\n(cid:16)\n(cid:104)\n\n\u03b4\u00011\n\nL\n\ng(f,w(g,f ))(x)\n\nL\n\nG(f,w(G,f ))(x)\n\n(cid:104)\n\n(cid:104)(cid:12)(cid:12)(cid:12) E\n\nx\u223cD\n\n(cid:16)\n\nsup\nf\u2208F\n\n\u00012\n1\n\n(cid:17)(cid:105) \u2212 E\n\nx\u223cD\n\n(cid:17)(cid:105)(cid:12)(cid:12)(cid:12)(cid:105) \u2264 \u00011\n\nThis result tells us that in a large enough landmarked space, we shall, for each function f \u2208 F,\nrecover close to the best classi\ufb01er possible for that transfer function. Thus, if we iterate over the\nset of transfer functions (or use some gradient-descent based optimization routine), we are bound to\nselect a transfer function that is capable of giving a classi\ufb01er that is close to the best.\n\n2.2 Working with surrogate loss functions\n\ni\n\ni , x\u2212\n\nThe formulation of a good similarity function suggests a simple learning algorithm that involves\nthe construction of an embedding of the domain into a landmarked space on which the existence\nof a large margin classi\ufb01er having low misclassi\ufb01cation rate is guaranteed. However, in order to\n\nexploit this guarantee we would have to learn the weights w(cid:0)x+\n\n(cid:1) associated with this classi\ufb01er\n\nby minimizing the empirical misclassi\ufb01cation rate on some training set.\nUnfortunately, not only is this problem intractable but also hard to solve approximately [15, 16].\nThus what we require is for the landmarked space to admit a classi\ufb01er that has low error with\nrespect to a loss function that can also be ef\ufb01ciently minimized on any training set.\nIn such a\nsituation, minimizing the loss on a random training set would, with very high probability, give us\nweights that give similar performance guarantees as the ones used in the goodness criterion.\nWith a similar objective in mind, [1] offers variants of its goodness criterion tailored to the hinge loss\nfunction which can be ef\ufb01ciently optimized on large training sets (for example LIBSVM [17]). Here\nwe give a general notion of goodness that can be tailored to any arbitrary Lipschitz loss function.\nDe\ufb01nition 4. A similarity function K : X \u00d7 X \u2192 R is said to be an (\u0001, B)-good similarity for\na learning problem with respect to a loss function L : R \u2192 R+ where \u0001 > 0 if for some transfer\nx\u223cD [L(G(x))] \u2264 \u0001 where\nfunction f : R \u2192 R and some weighing function w : X \u00d7X \u2192 [\u2212B, B], E\nG(x) =\n\nx(cid:48),x(cid:48)(cid:48)\u223cD\u00d7D [w (x(cid:48), x(cid:48)(cid:48)) f (K(x, x(cid:48)) \u2212 K(x, x(cid:48)(cid:48)))|(cid:96)(x(cid:48)) = (cid:96)(x), (cid:96)(x(cid:48)(cid:48)) (cid:54)= (cid:96)(x)]\n\nE\n\nOne can see that taking the loss functions as L(x) = 1x 0, with probability at least 1 \u2212 \u03b4 over the choice of d =\n1) ln(4B/\u03b4\u00011) positive and negative samples from D+ and D\u2212 respectively, the ex-\n(16B2C 2\n[L(g(x))] \u2264 \u0001 + \u00011 where g(x) =\npected loss of the classi\ufb01er g(x) with respect to L satis\ufb01es E\n\nL/\u00012\n\ni=1 w(cid:0)x+\n(cid:80)d\n\n1\nd\n\ni , x\u2212\n\ni\n\n(cid:1) f(cid:0)K(x, x+\n\ni )(cid:1).\n\ni ) \u2212 K(x, x\u2212\n\nx\n\n\u03b3 . The 0 \u2212 1 loss function and the loss\nIf the loss function is hinge loss at margin \u03b3 then CL = 1\nfunction L(x) = 1x<\u03b3 (implicitly used in De\ufb01nition 1 and Theorem 2) are not Lipschitz and hence\nthis proof technique does not apply to them.\n\n2.3 Selecting informative landmarks\n\nRecall that the generalization guarantees we described in the previous section rely on random se-\nlection of landmark pairs from a \ufb01xed distribution over the domain. However, in practice, a totally\nrandom selection might require one to select a large number of landmarks, thereby leading to an\ninef\ufb01cient classi\ufb01er in terms of training as well as test times. For typical domains such as computer\nvision, similarity function computation is an expensive task and hence selection of a small number\nof landmarks should lead to a signi\ufb01cant improvement in the test times. For this reason, we pro-\npose a landmark pair selection heuristic which we call DSELECT (see Algorithm 1). The heuristic\n\n5\n\n\fAlgorithm 1 DSELECT\nRequire: A training set T , landmarking size d.\nEnsure: A set of d landmark pairs/singletons.\n1: L \u2190 get-random-element(T ), PFTUNE \u2190 \u2205\n2: for j = 2 to d do\nz \u2190 arg min\nK(x, x(cid:48)).\n3:\nL \u2190 L \u222a {z}, T \u2190 T\\{z}\n\n(cid:80)\n\nx\u2208T\n\nx(cid:48)\u2208L\n\n4:\n5: end for\n6: for j = 1 to d do\n7:\n\nSample z1, z2 randomly from L with replace-\nment s.t. (cid:96)(z1) = 1, (cid:96)(z2) = \u22121\nPFTUNE \u2190 PFTUNE \u222a {(z1, z2)}\n\n8:\n9: end for\n10: return L (for BBS), PFTUNE (for FTUNE)\n\nity function K and a loss function L\n\nAlgorithm 2 FTUNE\nRequire: A family of transfer functions F, a similar-\nEnsure: An optimal transfer function f\u2217 \u2208 F.\n1: Select d landmark pairs P .\n2: for all f \u2208 F do\n3:\n4: end for\n5: f\u2217 \u2190 arg min\nLf\n6: return (f\u2217, wf\u2217 ).\n\nwf \u2190 TRAIN(P, L), Lf \u2190 L (wf )\n\nf\u2208F\n\n1\n\n|S|(|S|\u22121)\n\ngeneralizes naturally to multi-class problems and can also be applied to the classi\ufb01cation model of\nBalcan-Blum that uses landmark singletons instead of pairs.\n(cid:80)\nAt the core of our heuristic is a novel notion of diversity among landmarks. Assuming K is a nor-\nmalized similarity kernel, we call a set of points S \u2282 X diverse if the average inter-point similarity\nx,y\u2208S,x(cid:54)=y K(x, y) (cid:28) 1 (in case we are working with a distance kernel we\nis small i.e\nwould require large inter-point distances). The key observation behind DSELECT is that a non-\ndiverse set of landmarks would cause all data points to receive identical embeddings and linear\nseparation would be impossible. Small inter-landmark similarity, on the other hand would imply\nthat the landmarks are well-spread in the domain and can capture novel patterns in the data.\nSimilar notions of diversity have been used in the past for ensemble classi\ufb01ers [18] and k-NN clas-\nsi\ufb01ers [5]. Here we use this notion to achieve a better embedding into the landmarked space. Ex-\nperimental results demonstrate that the heuristic offers signi\ufb01cant performance improvements over\nrandom landmark selection (see Figure 1). One can easily extend Although Algorithm 1 to multi-\nclass problems by selecting a \ufb01xed number of landmarks from each class.\n\n3 Empirical results\n\nIn this section, we empirically study the performance of our proposed methods on a variety of bench-\nmark datasets. We refer to the algorithmic formulation presented in [1] as BBS and its augmentation\nusing DSELECT as BBS+D. We refer to the formulation presented in [2] as DBOOST. We refer to\nour transfer function learning based formulation as FTUNE and its augmentation using DSELECT\nas FTUNE+D. In multi-class classi\ufb01cation scenarios we will use a one-vs-all formulation which\npresents us with an opportunity to further exploit the transfer function by learning separate transfer\nfunction per class (i.e. per one-vs-all problem). We shall refer to our formulation using a single\n(resp. multiple) transfer function as FTUNE+D-S (resp. FTUNE+D-M). We take the class of ramp\nfunctions indexed by a slope parameter as our set of transfer functions. We use 6 different values\nof the slope parameter {1, 5, 10, 50, 100, 1000}. Note that these functions (approximately) include\nboth the identity function (used by [1]) and the sign function (used by [2]).\nOur goal in this section is two-fold: 1) to show that our FTUNE method is able to learn a more\nsuitable transfer function for the underlying data than the existing methods BBS and DBOOST and\n2) to show that our diversity based heuristic for landmark selection performs better than random\nselection. To this end, we perform experiments on a few benchmark datasets for learning with simi-\nlarity (non-PSD) functions [5] as well as on a variety of standard UCI datasets where the similarity\nfunction used is the Gaussian kernel function.\nFor our experiments, we implemented our methods FTUNE and FTUNE+D as well as BBS and\nBBS+D using MATLAB while using LIBLINEAR [10] for SVM classi\ufb01cation. For DBOOST, we\nuse the C++ code provided by the authors of [2]. On all the datasets we randomly selected a \ufb01xed\npercentage of data for training, validation and testing. Except for DBOOST , we selected the SVM\npenalty constant C from the set {1, 10, 100, 1000} using validation. For each method and dataset, we\nreport classi\ufb01cation accuracies averaged over 20 runs. We compare accuracies obtained by different\nmethods using t-test at 95% signi\ufb01cance level.\n\n6\n\n\fDataset/Method\nAmazonBinary\n\nAuralSonar\n\nPatrol\nVoting\nProtein\nMirex07\nAmazon47\nFaceRec\n\nBBS\n\nDBOOST\n0.73(0.13)\n0.77(0.10)\n0.82(0.08)\n0.81(0.08)\n0.51(0.06)\n0.34(0.11)\n0.95(0.03)\n0.94(0.03)\n1.00(0.01)\n0.98(0.02)\n0.12(0.01)\n0.21(0.03)\n0.39(0.06)\n0.07(0.04)\n0.20(0.04)\n0.12(0.03)\n(a) 30 Landmarks\n\nFTUNE+D-S\n0.84(0.12)\n0.80(0.08)\n0.58(0.06)\n0.94(0.04)\n0.98(0.02)\n0.28(0.03)\n0.61(0.08)\n0.63(0.04)\n\nDataset/Method\nAmazonBinary\n\nAuralSonar\n\nPatrol\nVoting\nProtein\nMirex07\nAmazon47\nFaceRec\n\nBBS\n\n0.78(0.11)\n0.88(0.06)\n0.79(0.05)\n0.97(0.02)\n0.98(0.02)\n0.17(0.02)\n0.40(0.13)\n0.27(0.05)\n\nDBOOST\n0.82(0.10)\n0.85(0.07)\n0.55(0.12)\n0.97(0.01)\n0.99(0.02)\n0.31(0.04)\n0.07(0.05)\n0.19(0.03)\n(b) 300 Landmarks\n\nFTUNE+D-S\n0.88(0.07)\n0.85(0.07)\n0.79(0.07)\n0.97(0.02)\n0.98(0.02)\n0.35(0.02)\n0.66(0.07)\n0.64(0.04)\n\nTable 1: Accuracies for Benchmark Similarity Learning Datasets for Embedding Dimensional-\nity=30, 300. Bold numbers indicate the best performance with 95% con\ufb01dence level.\n\nFigure 1: Accuracy obtained by various methods on four different datasets as the number of land-\nmarks used increases. Note that for small number of landmarks (30, 50) our diversity based landmark\nselection criteria increases accuracy for both BBS and our method FTUNE-S signi\ufb01cantly.\n\n3.1 Similarity learning datasets\n\nFirst, we conduct experiments on a few similarity learning datasets [5]; these datasets provide a\n(non-PSD) similarity matrix along with class labels. For each of the datasets, we randomly select\n70% of the data for training, 10% for validation and the remaining for testing purposes. We then\napply our FTUNE-S, FTUNE+D-S, BBS+D methods along with BBS and DBOOST with varying\nnumber of landmark pairs. Note that we do not apply our FTUNE-M method to these datasets as it\nover\ufb01ts heavily to these datasets as typically they are small in size.\nWe \ufb01rst compare the accuracy achieved by FTUNE+D-S with the existing methods. Table 1 com-\npares the accuracies achieved by our FTUNE+D-S method with those of BBS and DBOOST over\ndifferent datasets when using landmark sets of sizes 30 and 300. Numbers in brackets denote stan-\ndard deviation over different runs. Note that in both the tables FTUNE+D-S is one of the best\nmethods (upto 95% signi\ufb01cance level) on all but one dataset. Furthermore, for datasets with large\nnumber of classes such as Amazon47 and FaceRec our method outperforms BBS and DBOOST by\nat least 20% percent. Also, note that some of the datasets have multiple bold faced methods, which\nmeans that the two sample t-test (at 95% level) rejects the hypothesis that their mean is different.\nNext, we evaluate the effectiveness of our landmark selection criteria for both BBS and our method.\nFigure 1 shows the accuracies achieved by various methods on four different datasets with increasing\nnumber of landmarks. Note that in all the datasets, our diversity based landmark selection criteria\nincreases the classi\ufb01cation accuracy by around 5 \u2212 6% for small number of landmarks.\n\n3.2 UCI benchmark datasets\n\nWe now compare our FTUNE method against existing methods on a variety of UCI datasets [19].\nWe ran experiments with FTUNE and FTUNE+D but the latter did not provide any advantage. So\nfor lack of space we drop it from our presentation and only show results for FTUNE-S (FTUNE with\na single transfer function) and FTUNE-M (FTUNE with one transfer function per class). Similar\nto [2], we use the Gaussian kernel function as the similarity function for evaluating our method.\nWe set the \u201cwidth\u201d parameter in the Gaussian kernel to be the mean of all pair-wise training data\ndistances, a standard heuristic. For all the datasets, we randomly select 50% data for training, 20%\nfor validation and the remaining for testing. We report accuracy values averaged over 20 runs for\neach method with varying number of landmark pairs.\n\n7\n\n501001502002503000.50.60.70.80.91AmazonBinary (Accuracy vs Landmarks)Number of LandmarksAccuracy FTUNE+DFTUNEBBS+DBBSDBOOST5010015020025030000.10.20.30.40.50.60.7Amazon47 (Accuracy vs Landmarks)Number of LandmarksAccuracy FTUNE+DFTUNEBBS+DBBSDBOOST501001502002503000.10.150.20.250.30.350.40.45Mirex07 (Accuracy vs Landmarks)Number of LandmarksAccuracy FTUNE+DFTUNEBBS+DBBSDBOOST010020030000.10.20.30.40.50.6FaceRec (Accuracy vs Landmarks)Number of LandmarksAccuracy FTUNE+DFTUNEBBS+DBBSDBOOST\fDataset/Method\n\nCod-rna\nIsolet\nLetters\nMagic\n\nPen-digits\nNursery\nFaults\n\nMfeat-pixel\nMfeat-zernike\n\nOpt-digits\nSatellite\nSegment\n\nBBS\n\n0.93(0.01)\n0.81(0.01)\n0.67(0.02)\n0.82(0.01)\n0.94(0.01)\n0.91(0.01)\n0.70(0.01)\n0.94(0.01)\n0.79(0.02)\n0.92(0.01)\n0.85(0.01)\n0.90(0.01)\n\nDBOOST\n0.89(0.01)\n0.67(0.01)\n0.58(0.01)\n0.81(0.01)\n0.93(0.01)\n0.91(0.01)\n0.68(0.02)\n0.91(0.01)\n0.72(0.02)\n0.89(0.01)\n0.86(0.01)\n0.93(0.01)\n\nFTUNE-S\n0.93(0.01)\n0.84(0.01)\n0.69(0.01)\n0.84(0.01)\n0.97(0.01)\n0.90(0.01)\n0.70(0.02)\n0.95(0.01)\n0.79(0.02)\n0.94(0.01)\n0.86(0.01)\n0.92(0.01)\n\nFTUNE-M\n0.93(0.01)\n0.83(0.01)\n0.68(0.02)\n0.84(0.01)\n0.97(0.00)\n0.90(0.00)\n0.71(0.02)\n0.94(0.01)\n0.79(0.02)\n0.94(0.01)\n0.87(0.01)\n0.92(0.01)\n\nDataset/Method\n\nCod-rna\nIsolet\nLetters\nMagic\n\nPen-digits\nNursery\nFaults\n\nMfeat-pixel\nMfeat-zernike\n\nOpt-digits\nSatellite\nSegment\n\nBBS\n\n0.94(0.00)\n0.91(0.01)\n0.72(0.01)\n0.84(0.01)\n0.96(0.00)\n0.93(0.01)\n0.72(0.02)\n0.96(0.01)\n0.81(0.01)\n0.95(0.01)\n0.85(0.01)\n0.90(0.01)\n\nDBOOST\n0.93(0.00)\n0.89(0.01)\n0.84(0.01)\n0.84(0.00)\n0.99(0.00)\n0.97(0.00)\n0.74(0.02)\n0.97(0.01)\n0.79(0.01)\n0.97(0.00)\n0.90(0.01)\n0.96(0.01)\n\nFTUNE-S\n0.94(0.00)\n0.93(0.01)\n0.83(0.01)\n0.85(0.01)\n0.99(0.00)\n0.96(0.00)\n0.73(0.02)\n0.97(0.01)\n0.82(0.02)\n0.98(0.00)\n0.89(0.01)\n0.96(0.01)\n\nFTUNE-M\n0.94(0.00)\n0.93(0.00)\n0.83(0.01)\n0.85(0.01)\n0.99(0.00)\n0.97(0.00)\n0.73(0.02)\n0.97(0.01)\n0.82(0.01)\n0.98(0.00)\n0.89(0.01)\n0.96(0.01)\n\n(a) 30 Landmarks\n\n(b) 300 Landmarks\n\nTable 2: Accuracies for Gaussian Kernel for Embedding Dimensionality=30. Bold numbers indicate\nthe best performance with 95% con\ufb01dence level. Note that both our methods, especially FTUNE-S,\nperforms signi\ufb01cantly better than the existing methods.\n\nFigure 2: Accuracy achieved by various methods on four different UCI repository datasets as the\nnumber of landmarks used increases. Note that both FTUNE-S and FTUNE-M perform signi\ufb01cantly\nbetter than BBS and DBOOST for small number of landmarks (30, 50).\n\nTable 2 compares the accuracies obtained by our FTUNE-S and FTUNE-M methods with those of\nBBS and DBOOST when applied to different UCI benchmark datasets. Note that FTUNE-S is one\nof the best on most of the datasets for both the landmarking sizes. Also, BBS performs reasonably\nwell for small landmarking sizes while DBOOST performs well for large landmarking sizes. In\ncontrast, our method consistently outperforms the existing methods in both the scenarios.\nNext, we study accuracies obtained by our method for different landmarking sizes. Figure 2 shows\naccuracies obtained by various methods as the number of landmarks selected increases. Note that\nthe accuracy curve of our method dominates the accuracy curves of all the other methods, i.e. our\nmethod is consistently better than the existing methods for all the landmarking sizes considered.\n\n3.3 Discussion\n\nWe note that since FTUNE selects its output by way of validation, it is susceptible to over-\ufb01tting on\nsmall datasets but at the same time, capable of giving performance boosts on large ones. We observe\na similar trend in our experiments \u2013 on smaller datasets (such as those in Table 1 with average dataset\nsize 660), FTUNE over-\ufb01ts and performs worse than BBS and DBOOST. However, even in these\ncases, DSELECT (intuitively) removes redundancies in the landmark points thus allowing FTUNE\nto recover the best transfer function. In contrast, for larger datasets like those in Table 2 (average\nsize 13200), FTUNE is itself able to recover better transfer functions than the baseline methods\nand hence both FTUNE-S and FTUNE-M perform signi\ufb01cantly better than the baselines. Note that\nDSELECT is not able to provide any advantage here since the datasets sizes being large, greedy\nselection actually ends up hurting the accuracy.\n\nAcknowledgments\n\nWe thank the authors of [2] for providing us with C++ code of their implementation. P. K. is\nsupported by Microsoft Corporation and Microsoft Research India under a Microsoft Research India\nPh.D. fellowship award. Most of this work was done while P. K. was visiting Microsoft Research\nLabs India, Bangalore.\n\n8\n\n501001502002503000.650.70.750.80.850.90.951Isolet (Accuracy vs Landmarks)Number of LandmarksAccuracy FTUNE (Single)FTUNE (Multiple)BBSDBOOST501001502002503000.50.60.70.80.91Letters (Accuracy vs Landmarks)Number of LandmarksAccuracy FTUNE (Single)FTUNE (Multiple)BBSDBOOST501001502002503000.930.940.950.960.970.980.991Pen\u2212digits (Accuracy vs Landmarks)Number of LandmarksAccuracy FTUNE (Single)FTUNE (Multiple)BBSDBOOST501001502002503000.880.90.920.940.960.98Number of LandmarksAccuracyOpt\u2212digits (Accuracy vs Landmarks) FTUNE (Single)FTUNE (Multiple)BBSDBOOST\fReferences\n\n[1] Maria-Florina Balcan and Avrim Blum. On a Theory of Learning with Similarity Functions. In Interna-\n\ntional Conference on Machine Learning, pages 73\u201380, 2006.\n\n[2] Liwei Wang, Cheng Yang, and Jufu Feng. On Learning with Dissimilarity Functions. In International\n\nConference on Machine Learning, pages 991\u2013998, 2007.\n\n[3] Piotr Indyk and Nitin Thaper. Fast Image Retrieval via Embeddings. In International Workshop Statistical\n\nand Computational Theories of Vision, 2003.\n\n[4] El\u02d9zbieta Pe\u00b8kalska and Robert P. W. Duin. On Combining Dissimilarity Representations.\n\nClassi\ufb01er Systems, pages 359\u2013368, 2001.\n\nIn Multiple\n\n[5] Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, and Luca Cazzanti. Similarity-based Classi\ufb01-\n\ncation: Concepts and Algorithms. Journal of Machine Learning Research, 10:747\u2013776, 2009.\n\n[6] Cheng Soon Ong, Xavier Mary, St\u00b4ephane Canu, and Alexander J. Smola. Learning with non-positive\n\nKernels. In International Conference on Machine Learning, 2004.\n\n[7] Bernard Haasdonk. Feature Space Interpretation of SVMs with Inde\ufb01nite Kernels. IEEE Transactions on\n\nPattern Analysis and Machince Intelligence, 27(4):482\u2013492, 2005.\n\n[8] Thore Graepel, Ralf Herbrich, Peter Bollmann-Sdorra, and Klaus Obermayer. Classi\ufb01cation on Pairwise\n\nProximity Data. In Neural Information Processing Systems, page 438444, 1998.\n\n[9] Maria-Florina Balcan, Avrim Blum, and Nathan Srebro. Improved Guarantees for Learning via Similarity\n\nFunctions. In 21st Annual Conference on Computational Learning Theory, pages 287\u2013298, 2008.\n\n[10] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A\n\nLibrary for Large Linear Classi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[11] Manik Varma and Bodla Rakesh Babu. More Generality in Ef\ufb01cient Multiple Kernel Learning. In 26th\n\nAnnual International Conference on Machine Learning, pages 1065\u20131072, 2009.\n\n[12] Prateek Jain, Brian Kulis, Jason V. Davis, and Inderjit S. Dhillon. Metric and Kernel Learning using a\n\nLinear Transformation. To appear, Journal of Machine Learning (JMLR), 2011.\n\n[13] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Kernels as Features: On Kernels, Margins,\n\nand Low-dimensional Mappings. Machine Learning, 65(1):79\u201394, 2006.\n\n[14] Nathan Srebro. How Good Is a Kernel When Used as a Similarity Measure? In 20th Annual Conference\n\non Computational Learning Theory, pages 323\u2013335, 2007.\n\n[15] M. R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the theory of NP-Completeness.\n\nFreeman, San Francisco, 1979.\n\n[16] Sanjeev Arora, L\u00b4aszl\u00b4o Babai, Jacques Stern, and Z. Sweedyk. The Hardness of Approximate Optima in\nLattices, Codes, and Systems of Linear Equations. Journal of Computer and System Sciences, 54(2):317\u2013\n331, April 1997.\n\n[17] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transac-\n\ntions on Intelligent Systems and Technology, 2(3):27:1\u201327:27, 2011.\n\n[18] Krithika Venkataramani and B. V. K. Vijaya Kumar. Designing classi\ufb01ers for fusion-based biometric\nveri\ufb01cation. In Plataniotis Boulgouris and Micheli-Tzankou, editors, Biometrics: Theory, Methods and\nApplications. Springer, 2009.\n\n[19] A. Frank and Arthur Asuncion. UCI Machine Learning Repository. http://archive.ics.uci.\n\nedu/ml, 2010. University of California, Irvine, School of Information and Computer Sciences.\n\n9\n\n\f", "award": [], "sourceid": 1129, "authors": [{"given_name": "Purushottam", "family_name": "Kar", "institution": null}, {"given_name": "Prateek", "family_name": "Jain", "institution": null}]}