{"title": "Sparse Local Embeddings for Extreme Multi-label Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 730, "page_last": 738, "abstract": "The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point with the most relevant subset of labels from an extremely large label set. Embedding based approaches make training and prediction tractable by assuming that the training label matrix is low-rank and hence the effective number of labels can be reduced by projecting the high dimensional label vectors onto a low dimensional linear subspace. Still, leading embedding approaches have been unable to deliver high prediction accuracies or scale to large problems as the low rank assumption is violated in most real world applications.This paper develops the SLEEC classifier to address both limitations. The main technical contribution in SLEEC is a formulation for learning a small ensemble of local distance preserving embeddings which can accurately predict infrequently occurring (tail) labels. This allows SLEEC to break free of the traditional low-rank assumption and boost classification accuracy by learning embeddings which preserve pairwise distances between only the nearest label vectors. We conducted extensive experiments on several real-world as well as benchmark data sets and compare our method against state-of-the-art methods for extreme multi-label classification. Experiments reveal that SLEEC can make significantly more accurate predictions then the state-of-the-art methods including both embeddings (by as much as 35%) as well as trees (by as much as 6%). SLEEC can also scale efficiently to data sets with a million labels which are beyond the pale of leading embedding methods.", "full_text": "Sparse Local Embeddings for Extreme Multi-label\n\nClassi\ufb01cation\n\nKush Bhatia\u2020, Himanshu Jain\u00a7, Purushottam Kar\u2021\u2217, Manik Varma\u2020, and Prateek Jain\u2020\n\n\u2020Microsoft Research, India\n\n\u00a7Indian Institute of Technology Delhi, India\n\u2021Indian Institute of Technology Kanpur, India\n\n{t-kushb,prajain,manik}@microsoft.com\n\nhimanshu.j689@gmail.com, purushot@cse.iitk.ac.in\n\nAbstract\n\nThe objective in extreme multi-label learning is to train a classi\ufb01er that can auto-\nmatically tag a novel data point with the most relevant subset of labels from an\nextremely large label set. Embedding based approaches attempt to make training\nand prediction tractable by assuming that the training label matrix is low-rank and\nreducing the effective number of labels by projecting the high dimensional label\nvectors onto a low dimensional linear subspace. Still, leading embedding ap-\nproaches have been unable to deliver high prediction accuracies, or scale to large\nproblems as the low rank assumption is violated in most real world applications.\nIn this paper we develop the SLEEC classi\ufb01er to address both limitations. The\nmain technical contribution in SLEEC is a formulation for learning a small ensem-\nble of local distance preserving embeddings which can accurately predict infre-\nquently occurring (tail) labels. This allows SLEEC to break free of the traditional\nlow-rank assumption and boost classi\ufb01cation accuracy by learning embeddings\nwhich preserve pairwise distances between only the nearest label vectors.\nWe conducted extensive experiments on several real-world, as well as bench-\nmark data sets and compared our method against state-of-the-art methods for ex-\ntreme multi-label classi\ufb01cation. Experiments reveal that SLEEC can make signif-\nicantly more accurate predictions then the state-of-the-art methods including both\nembedding-based (by as much as 35%) as well as tree-based (by as much as 6%)\nmethods. SLEEC can also scale ef\ufb01ciently to data sets with a million labels which\nare beyond the pale of leading embedding methods.\n\n1\n\nIntroduction\n\nIn this paper we develop SLEEC (Sparse Local Embeddings for Extreme Classi\ufb01cation), an extreme\nmulti-label classi\ufb01er that can make signi\ufb01cantly more accurate and faster predictions, as well as\nscale to larger problems, as compared to state-of-the-art embedding based approaches.\neXtreme Multi-label Learning (XML) addresses the problem of learning a classi\ufb01er that can auto-\nmatically tag a data point with the most relevant subset of labels from a large label set. For instance,\nthere are more than a million labels (categories) on Wikipedia and one might wish to build a classi-\n\ufb01er that annotates a new article or web page with the subset of most relevant Wikipedia categories.\nIt should be emphasized that multi-label learning is distinct from multi-class classi\ufb01cation where the\naim is to predict a single mutually exclusive label.\nChallenges: XML is a hard problem that involves learning with hundreds of thousands, or even mil-\nlions, of labels, features and training points. Although, some of these problems can be ameliorated\n\n\u2217This work was done while P.K. was a postdoctoral researcher at Microsoft Research India.\n\n1\n\n\fusing a label hierarchy, such hierarchies are unavailable in many applications [1, 2]. In this setting,\nan obvious baseline is thus provided by the 1-vs-All technique which seeks to learn an an indepen-\ndent classi\ufb01er per label. As expected, this technique is infeasible due to the prohibitive training and\nprediction costs given the large number of labels.\nEmbedding-based approaches: A natural way of overcoming the above problem is to reduce the\neffective number of labels. Embedding based approaches try to do so by projecting label vectors onto\na low dimensional space, based on an assumption that the label matrix is low-rank. More speci\ufb01cally,\ni=1} with d-dimensional feature vectors xi \u2208 Rd and L-\ngiven a set of n training points {(xi, yi)n\nvectors onto a lower (cid:98)L-dimensional linear subspace as zi = Uyi. Regressors are then trained to\ndimensional label vectors yi \u2208 {0, 1}L, state-of-the-art embedding approaches project the label\npredict zi as Vxi. Labels for a novel point x are predicted by post-processing y = U\u2020Vx where U\u2020\nis a decompression matrix which lifts the embedded label vectors back to the original label space.\nEmbedding methods mainly differ in the choice of their compression and decompression techniques\nsuch as compressed sensing [3], Bloom \ufb01lters [4], SVD [5], landmark labels [6, 7], output codes [8],\netc. The state-of-the-art LEML algorithm [9] directly optimizes for U\u2020, V using a regularized\nleast squares objective. Embedding approaches have many advantages including simplicity, ease of\nimplementation, strong theoretical foundations, the ability to handle label correlations, as well as\nadapt to online and incremental scenarios. Consequently, embeddings have proved to be the most\npopular approach for tackling XML problems [6, 7, 10, 4, 11, 3, 12, 9, 5, 13, 8, 14].\nEmbedding approaches also have limitations \u2013 they are slow at training and prediction even for small\n\nembedding dimensions(cid:98)L. For instance, on WikiLSHTC [15, 16], a Wikipedia based challenge data\nset, LEML with(cid:98)L = 500 takes \u223c 12 hours to train even with early termination whereas prediction\nIn fact, for text applications with (cid:98)d-sparse feature\nvectors such as WikiLSHTC (where (cid:98)d = 42 (cid:28) (cid:98)L = 500), LEML\u2019s prediction time \u2126((cid:98)L((cid:98)d + L))\ncan be an order of magnitude more than even 1-vs-All\u2019s prediction time O((cid:98)dL).\nerror in the label matrix as (cid:98)L is varied on the WikiLSHTC data set. As is clear, even with a 500-\n\nMore importantly, the critical assumption made by embedding methods, that the training label matrix\nis low-rank, is violated in almost all real world applications. Figure 1(a) plots the approximation\n\ntakes nearly 300 milliseconds per test point.\n\ndimensional subspace the label matrix still has 90% approximation error. This happens primarily\ndue to the presence of hundreds of thousands of \u201ctail\u201d labels (Figure 1(b)) which occur in at most 5\ndata points each and, hence, cannot be well approximated by any linear low dimensional basis.\nThe SLEEC approach: Our algorithm SLEEC extends embedding methods in multiple ways to ad-\ndress these limitations. First, instead of globally projecting onto a linear low-rank subspace, SLEEC\nlearns embeddings zi which non-linearly capture label correlations by preserving the pairwise dis-\ntances between only the closest (rather than all) label vectors, i.e. d(zi, zj) \u2248 d(yi, yj) only if\ni \u2208 kNN(j) where d is a distance metric. Regressors V are trained to predict zi = Vxi. We pro-\npose a novel formulation for learning such embeddings that can be formally shown to consistently\npreserve nearest neighbours in the label space. We build an ef\ufb01cient pipeline for training these\nembeddings which can be orders of magnitude faster than state-of-the-art embedding methods.\nDuring prediction, rather than using a decompression matrix, SLEEC uses a k-nearest neighbour\n(kNN) classi\ufb01er in the embedding space, thus leveraging the fact that nearest neighbours have been\npreserved during training. Thus, for a novel point x, the predicted label vector is obtained using\ni:Vxi\u2208kNN(Vx) yi. The use of a kNN classi\ufb01er is well motivated as kNN outperforms dis-\n\ny = (cid:80)\n\ncriminative methods in acutely low training data regimes [17] as is the case with tail labels.\nThe superiority of SLEEC\u2019s proposed embeddings over traditional low-rank embeddings can be\nseen by looking at Figure 1, which shows that the relative approximation error in learning SLEEC\u2019s\nembeddings is signi\ufb01cantly smaller as compared to the low-rank approximation error. Moreover, we\nalso \ufb01nd that SLEEC can improve the prediction accuracy of state-of-the-art embedding methods\nby as much as 35% (absolute) on the challenging WikiLSHTC data set. SLEEC also signi\ufb01cantly\noutperforms methods such as WSABIE [13] which also use kNN classi\ufb01cation in the embedding\nspace but learn their embeddings using the traditional low-rank assumption.\nClustering based speedup: However, kNN classi\ufb01ers are known to be slow at prediction. SLEEC\ntherefore clusters the training data into C clusters, learning a separate embedding per cluster and\nperforming kNN classi\ufb01cation within the test point\u2019s cluster alone. This allows SLEEC to be more\n\n2\n\n\f(a)\n\nF /(cid:107)Y (cid:107)2\n\n(b)\n\nincurred by computing the rank (cid:98)L SVD of Y . Local SVD computes rank (cid:98)L SVD of Y within each cluster.\nFigure 1: (a) error (cid:107)Y \u2212 Y(cid:98)L(cid:107)2\n\nF in approximating the label matrix Y . Global SVD denotes the error\n\nSLEEC NN objective denotes SLEEC\u2019s objective function. Global SVD incurs 90% error and the error is\ndecreasing at most linearly as well. (b) shows the number of documents in which each label is present for the\nWikiLSHTC data set. There are about 300K labels which are present in < 5 documents lending it a \u2018heavy\ntailed\u2019 distribution. (c) shows Precision@1 accuracy of SLEEC and localLEML on the Wiki-10 data set as we\nvary the number of clusters.\n\n(c)\n\nthan two orders of magnitude faster at prediction than LEML and other embedding methods on the\nWikiLSHTC data. In fact, SLEEC also scales well to the Ads1M data set involving a million labels\nwhich is beyond the pale of leading embedding methods. Moreover, the clustering trick does not\nsigni\ufb01cantly bene\ufb01t other state-of-the-art methods (see Figure 1(c), thus indicating that SLEEC\u2019s\nembeddings are key to its performance boost.\nSince clustering can be unstable in large dimensions, SLEEC compensates by learning a small en-\nsemble where each individual learner is generated by a different random clustering. This was empir-\nically found to help tackle instabilities of clustering and signi\ufb01cantly boost prediction accuracy with\nonly linear increases in training and prediction time. For instance, on WikiLSHTC, SLEEC\u2019s pre-\ndiction accuracy was 55% with an 8 millisecond prediction time whereas LEML could only manage\n20% accuracy while taking 300 milliseconds for prediction per test point.\nTree-based approaches: Recently, tree based methods [1, 15, 2] have also become popular for\nXML as they enjoy signi\ufb01cant accuracy gains over the existing embedding methods. For instance,\nFastXML [15] can achieve a prediction accuracy of 49% on WikiLSHTC using a 50 tree ensemble.\nHowever, using SLEEC, we are now able to extend embedding methods to outperform tree ensem-\nbles, achieving 49.8% with 2 learners and 55% with 10. Thus, SLEEC obtains the best of both\nworlds \u2013 achieving the highest prediction accuracies across all methods on even the most challeng-\ning data sets, as well as retaining all the bene\ufb01ts of embeddings and eschewing the disadvantages of\nlarge tree ensembles such as large model size and lack of theoretical understanding.\n\n2 Method\nLet D = {(x1, y1) . . . (xn, yn)} be the given training data set, xi \u2208 X \u2286 Rd be the input feature\nvector, yi \u2208 Y \u2286 {0, 1}L be the corresponding label vector, and yij = 1 iff the j-th label is turned\non for xi. Let X = [x1, . . . , xn] be the data matrix and Y = [y1, . . . , yn] be the label matrix. Given\nD, the goal is to learn a multi-label classi\ufb01er f : Rd \u2192 {0, 1}L that accurately predicts the label\nvector for a given test point. Recall that in XML settings, L is very large and is of the same order as\nn and d, ruling out several standard approaches such as 1-vs-All.\nWe now present our algorithm SLEEC which is designed primarily to scale ef\ufb01ciently for large L.\nOur algorithm is an embedding-style algorithm, i.e., during training we map the label vectors yi to\n\n(cid:98)L-dimensional vectors zi \u2208 R(cid:98)L and learn a set of regressors V \u2208 R(cid:98)L\u00d7d s.t. zi \u2248 V xi,\u2200i. During\n\nthe test phase, for an unseen point x, we \ufb01rst compute its embedding V x and then perform kNN\nover the set [V x1, V x2, . . . , V xn]. To scale our algorithm, we perform a clustering of all the training\npoints and apply the above mentioned procedures in each of the cluster separately. Below, we \ufb01rst\ndiscuss our method to compute the embeddings zis and the regressors V . Section 2.2 then discusses\nour approach for scaling the method to large data sets.\n\n2.1 Learning Embeddings\nAs mentioned earlier, our approach is motivated by the fact that a typical real-world data set tends\nto have a large number of tail labels that ensure that the label matrix Y cannot be well-approximated\nusing a low-dimensional linear subspace (see Figure 1). However, Y can still be accurately modeled\n\n3\n\n10020030040050000.51Approximation RankApproximation Error Global SVDLocal SVDSLEEC NN Obj01234x 1051e01e11e21e31e41e5Label ID Active Documents24681075808590Number of ClustersPrecision@1 Wiki10SLEECLocalLEML\fAlgorithm 1 SLEEC: Train Algorithm\nRequire: D = {(x1, y1) . . . (xn, yn)}, embedding\n\ndimensionality: (cid:98)L, no. of neighbors: \u00afn, no. of\n\nclusters: C, regularization parameter: \u03bb, \u00b5, L1\nsmoothing parameter \u03c1\n\n1: Partition X into Q1, .., QC using k-means\n2: for each partition Qj do\n3:\n\nForm \u2126 using \u00afn nearest neighbors of each label\nvector yi \u2208 Qj\n\n[U \u03a3] \u2190 SVP(P\u2126(Y jY j T ), (cid:98)L)\n\nZ j \u2190 U \u03a3\nV j \u2190 ADM M (X j, Z j, \u03bb, \u00b5, \u03c1)\nZ j = V jX j\n\n4:\n5:\n6:\n7:\n8: end for\n9: Output: {(Q1, V 1, Z 1), . . . , (QC , V C , Z C}\n\n1\n2\n\nSub-routine 3 SLEEC: SVP\nRequire: Observations: G, index set: \u2126, dimension-\n\nality: (cid:98)L\n3: (cid:99)M \u2190 M + \u03b7(G \u2212 P\u2126(M ))\n[U \u03a3] \u2190 Top-EigenDecomp((cid:99)M , (cid:98)L)\n\n1: M1 := 0, \u03b7 = 1\n2: repeat\n\n\u03a3ii \u2190 max(0, \u03a3ii),\u2200i\n\n4:\n5:\n6: M \u2190 U \u00b7 \u03a3 \u00b7 U T\n7: until Convergence\n8: Output: U, \u03a3\n\nSub-routine 4 SLEEC: ADMM\nRequire: Data Matrix : X, Embeddings : Z, Regular-\nization Parameter : \u03bb, \u00b5, Smoothing Parameter : \u03c1\n\nAlgorithm 2 SLEEC: Test Algorithm\nRequire: Test point: x, no. of NN: \u00afn, no. of desired\n\nlabels: p\n1: Q\u03c4 : partition closest to x\n2: z \u2190 V \u03c4 x\n3: Nz \u2190 \u00afn nearest neighbors of z in Z \u03c4\n4: Px \u2190 empirical label dist. for points \u2208 Nz\n5: ypred \u2190 T opp(Px)\n\n1: \u03b2 := 0, \u03b1 := 0\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8: until Convergence\n9: Output: V\n\nQ \u2190 (Z + \u03c1(\u03b1 \u2212 \u03b2))X(cid:62)\nV \u2190 Q(XX(cid:62)(1 + \u03c1) + \u03bbI)\u22121\n\u03b1 \u2190 (V X + \u03b2)\n\u03b1i = sign(\u03b1i) \u00b7 max(0,|\u03b1i| \u2212 \u00b5\n\u03b2 \u2190 \u03b2 + V X \u2212 alpha\n\n\u03c1 ), \u2200i\n\nusing a low-dimensional non-linear manifold. That is, instead of preserving distances (or inner\nproducts) of a given label vector to all the training points, we attempt to preserve the distance to\n\nonly a few nearest neighbors. That is, we wish to \ufb01nd a (cid:98)L-dimensional embedding matrix Z =\n[z1, . . . , zn] \u2208 R(cid:98)L\u00d7n which minimizes the following objective:\n(cid:107)P\u2126(Y T Y ) \u2212 P\u2126(Z T Z)(cid:107)2\n\n(1)\nwhere the index set \u2126 denotes the set of neighbors that we wish to preserve, i.e., (i, j) \u2208 \u2126 iff\nj \u2208 Ni. Ni denotes a set of nearest neighbors of i. We select Ni = arg maxS,|S|\u2264\u03b1\u00b7n\ni yj),\nwhich is the set of \u03b1 \u00b7 n points with the largest inner products with yi. |N| is always chosen large\nenough so that distances (inner products) to a few far away points are also preserved while optimiz-\ning for our objective function. This prohibits non-neighboring points from entering the immediate\nneighborhood of any given point. P\u2126 : Rn\u00d7n \u2192 Rn\u00d7n is de\ufb01ned as:\n\nF + \u03bb(cid:107)Z(cid:107)1,\n\nZ\u2208R(cid:98)L\u00d7n\n\nj\u2208S(yT\n\n(cid:80)\n\nmin\n\n(P\u2126(Y T Y ))ij =\n\nAlso, we add L1 regularization, (cid:107)Z(cid:107)1 =(cid:80)\n(2)\ni (cid:107)zi(cid:107)1, to the objective function to obtain sparse embed-\nsize of the model, and c) avoid over\ufb01tting. Now, given the embeddings Z = [z1, . . . , zn] \u2208 R(cid:98)L\u00d7n,\ndings. Sparse embeddings have three key advantages: a) they reduce prediction time, b) reduce the\nThat is, we require that Z \u2248 V X where V \u2208 R(cid:98)L\u00d7d. Combining the two formulations and adding\n\nwe wish to learn a multi-regression model to predict the embeddings Z using the input features.\n\n0,\n\nif (i, j) \u2208 \u2126,\notherwise.\n\n(cid:26)(cid:104)yi, yj(cid:105) ,\n\nan L2-regularization for V , we get:\n\n(cid:107)P\u2126(Y T Y ) \u2212 P\u2126(X T V T V X)(cid:107)2\n\nF + \u03bb(cid:107)V (cid:107)2\n\nF + \u00b5(cid:107)V X(cid:107)1.\n\n(3)\n\nmin\n\nV \u2208R(cid:98)L\u00d7d\n\nNote that the above problem formulation is somewhat similar to a few existing methods for non-\nlinear dimensionality reduction that also seek to preserve distances to a few near neighbors [18, 19].\nHowever, in contrast to our approach, these methods do not have a direct out of sample generaliza-\ntion, do not scale well to large-scale data sets, and lack rigorous generalization error bounds.\nOptimization: We \ufb01rst note that optimizing (3) is a signi\ufb01cant challenge as the objective function is\nnon-convex as well as non-differentiable. Furthermore, our goal is to perform optimization for data\n\n4\n\n\fMt+1 = P(cid:98)L(Mt + \u03b7P\u2126(Y T Y \u2212 Mt)),\n\n(6)\n\nsets where L, n, d (cid:29) 100, 000. To this end, we divide the optimization into two phases. We \ufb01rst\nlearn embeddings Z = [z1, . . . , zn] and then learn regressors V in the second stage. That is, Z is\nobtained by directly solving (1) but without the L1 penalty term:\n\nmin\n\nZ,Z\u2208R(cid:98)L\u00d7n\n\n(cid:107)P\u2126(Y T Y ) \u2212 P\u2126(Z T Z)(cid:107)2\n\nF \u2261 min\nM(cid:23)0,\n\nrank(M )\u2264(cid:98)L\n\n(cid:107)P\u2126(Y T Y ) \u2212 P\u2126(M )(cid:107)2\nF ,\n\nwhere M = Z T Z. Next, V is obtained by solving the following problem:\nF + \u00b5(cid:107)V X(cid:107)1.\n\nF + \u03bb(cid:107)V (cid:107)2\n\n(cid:107)Z \u2212 V X(cid:107)2\n\nmin\n\nV \u2208R(cid:98)L\u00d7d\n\n(4)\n\n(5)\n\nNote that the Z matrix obtained using (4) need not be sparse. However, we store and use V X as our\nembeddings, so that sparsity is still maintained.\nOptimizing (4): Note that even the simpli\ufb01ed problem (4) is an instance of the popular low-rank\nmatrix completion problem and is known to be NP-hard in general. The main challenge arises\ndue to the non-convex rank constraint on M. However, using the Singular Value Projection (SVP)\nmethod [20], a popular matrix completion method, we can guarantee convergence to a local minima.\nSVP is a simple projected gradient descent method where the projection is onto the set of low-rank\nmatrices. That is, the t-th step update for SVP is given by:\n\ndecomposition of M. That is, if M = UM \u039bM U T\n\nM be the eigenvalue decomposition of M. Then,\n\nthe top-r eigenvalues of M and UM (1 : r) denotes the corresponding eigenvectors.\n\nP(cid:98)L(M ) = UM (1 : r) \u00b7 \u039bM (1 : r) \u00b7 UM (1 : r)T ,\nM ) and (cid:98)L+\n\nwhere Mt is the t-th step iterate, \u03b7 > 0 is the step-size, and P(cid:98)L(M ) is the projection of M onto\nthe set of rank-(cid:98)L positive semi-de\ufb01nite de\ufb01nite (PSD) matrices. Note that while the set of rank-\n(cid:98)L PSD matrices is non-convex, we can still project onto this set ef\ufb01ciently using the eigenvalue\nwhere r = min((cid:98)L,(cid:98)L+\nWhile the above update restricts the rank of all intermediate iterates Mt to be at most(cid:98)L, computing\nrank-(cid:98)L eigenvalue decomposition can still be fairly expensive for large n. However, by using special\nIn general, the eigenvalue decomposition can be computed in time O((cid:98)L\u03b6)\nmatrix has special structure of \u02c6M = Mt + \u03b7P\u2126(Y T Y \u2212 Mt). Hence \u03b6 = O(n(cid:98)L + n\u00afn) where\ncomplexity reduces to O(n(cid:98)L2 + n(cid:98)L\u00afn) which is linear in n, assuming \u00afn is nearly constant.\n\nstructure in the update (6), one can signi\ufb01cantly reduce eigenvalue decomposition\u2019s computation\ncomplexity as well.\nwhere \u03b6 is the time complexity of computing a matrix-vector product. Now, for SVP update (6),\n\u00afn = |\u2126|/n2 is the average number of neighbors preserved by SLEEC. Hence, the per-iteration time\n\nM is the number of positive eigenvalues of M. \u039bM (1 : r) denotes\n\nOptimizing (5): (5) contains an L1 term which makes the problem non-smooth. Moreover, as the L1\nterm involves both V and X, we cannot directly apply the standard prox-function based algorithms.\nInstead, we use the ADMM method to optimize (5). See Sub-routine 4 for the updates and [21] for\na detailed derivation of the algorithm.\nGeneralization Error Analysis: Let P be a \ufb01xed (but unknown) distribution over X \u00d7Y. Let each\ntraining point (xi, yi) \u2208 D be sampled i.i.d. from P. Then, the goal of our non-linear embedding\nmethod (3) is to learn an embedding matrix A = V T V that preserves nearest neighbors (in terms\nof label distance/intersection) of any (x, y) \u223c P. The above requirements can be formulated as the\nfollowing stochastic optimization problem:\nL(A) =\n\n(x,y),((cid:101)x,(cid:101)y)\u223cP(cid:96)(A; (x, y), ((cid:101)x,(cid:101)y)),\n\nwhere the loss function (cid:96)(A; (x, y), ((cid:101)x,(cid:101)y)) = g((cid:104)(cid:101)y, y(cid:105))((cid:104)(cid:101)y, y(cid:105) \u2212(cid:101)xT Ax)2, and g((cid:104)(cid:101)y, y(cid:105)) =\nI [(cid:104)(cid:101)y, y(cid:105) \u2265 \u03c4 ], where I [\u00b7] is the indicator function. Hence, a loss is incurred only if y and \u02dcy have\n\na large inner product. For an appropriate selection of the neighborhood selection operator \u2126, (3)\nindeed minimizes a regularized empirical estimate of the loss function (7), i.e., it is a regularized\nERM w.r.t. (7).\n\nrank(A)\u2264k\n\nmin\nA(cid:23)0\n\n(7)\n\nE\n\n5\n\n\fWe now show that the optimal solution (cid:98)A to (3) indeed minimizes the loss (7) upto an additive\n\napproximation error. The existing techniques for analyzing excess risk in stochastic optimization\nrequire the empirical loss function to be decomposable over the training set, and as such do not\napply to (3) which contains loss-terms with two training points. Still, using techniques from the\nAUC maximization literature [22], we can provide interesting excess risk bounds for Problem (7).\nTheorem 1. With probability at least 1\u2212 \u03b4 over the sampling of the dataset D, the solution \u02c6A to the\noptimization problem (3) satis\ufb01es\n\n(cid:110)L(A\u2217) +\n\nE-Risk(n)\n\n(cid:122)\n(cid:125)(cid:124)\nC(cid:0) \u00afL2 +(cid:0)r2 + (cid:107)A\u2217(cid:107)2\n\u03bb and A :=\n\n(cid:111)\n(cid:110)\n(cid:111)\nA \u2208 Rd\u00d7d : A (cid:23) 0, rank(A) \u2264(cid:98)L\n\n(cid:1) R4(cid:1)(cid:114) 1\n\n(cid:123)\n\n1\n\u03b4\n\nF\n\nlog\n\n,\n\nn\n\n.\n\nL( \u02c6A) \u2264 inf\nA\u2217\u2208A\n\nwhere \u02c6A is the minimizer of (3), r = \u00afL\n\nSee Appendix A for a proof of the result. Note that the generalization error bound is independent\nof both d and L, which is critical for extreme multi-label classi\ufb01cation problems with large d, L. In\nfact, the error bound is only dependent on \u00afL (cid:28) L, which is the average number of positive labels\nper data point. Moreover, our bound also provides a way to compute best regularization parameter\n\u03bb that minimizes the error bound. However, in practice, we set \u03bb to be a \ufb01xed constant.\nTheorem 1 only preserves the population neighbors of a test point. Theorem 7, given in Appendix A,\nextends Theorem 1 to ensure that the neighbors in the training set are also preserved. We would also\nlike to stress that our excess risk bound is universal and hence holds even if \u02c6A does not minimize\n(3), i.e., L( \u02c6A) \u2264 L(A\u2217) + E-Risk(n) + (L( \u02c6A) \u2212 L(\u02c6(A\u2217)), where E-Risk(n) is given in Theorem 1.\n2.2 Scaling to Large-scale Data sets\n\nFor large-scale data sets, one might require the embedding dimension(cid:98)L to be fairly large (say a few\n\nhundreds) which might make computing the updates (6) infeasible. Hence, to scale to such large\ndata sets, SLEEC clusters the given datapoints into smaller local region. Several text-based data sets\nindeed reveal that there exist small local regions in the feature-space where the number of points as\nwell as the number of labels is reasonably small. Hence, we can train our embedding method over\nsuch local regions without signi\ufb01cantly sacri\ufb01cing overall accuracy.\nWe would like to stress that despite clustering datapoints in homogeneous regions, the label matrix of\nany given cluster is still not close to low-rank. Hence, applying a state-of-the-art linear embedding\nmethod, such as LEML, to each cluster is still signi\ufb01cantly less accurate when compared to our\nmethod (see Figure 1). Naturally, one can cluster the data set into an extremely large number of\nregions, so that eventually the label matrix is low-rank in each cluster. However, increasing the\nnumber of clusters beyond a certain limit might decrease accuracy as the error incurred during the\ncluster assignment phase itself might nullify the gain in accuracy due to better embeddings. Figure 1\nillustrates this phenomenon where increasing the number of clusters beyond a certain limit in fact\ndecreases accuracy of LEML.\nAlgorithm 1 provides a pseudo-code of our training algorithm. We \ufb01rst cluster the datapoints into\nC partitions. Then, for each partition we learn a set of embeddings using Sub-routine 3 and then\ncompute the regression parameters V \u03c4 , 1 \u2264 \u03c4 \u2264 C using Sub-routine 4. For a given test point x,\nwe \ufb01rst \ufb01nd out the appropriate cluster \u03c4. Then, we \ufb01nd the embedding z = V \u03c4 x. The label vector\nis then predicted using k-NN in the embedding space. See Algorithm 2 for more details.\nOwing to the curse-of-dimensionality, clustering turns out to be quite unstable for data sets with\nlarge d and in many cases leads to some drop in prediction accuracy. To safeguard against such\ninstability, we use an ensemble of models generated using different sets of clusters. We use different\ninitialization points in our clustering procedure to obtain different sets of clusters. Our empirical\nresults demonstrate that using such ensembles leads to signi\ufb01cant increase in accuracy of SLEEC\n(see Figure 2) and also leads to stable solutions with small variance (see Table 4).\n\n3 Experiments\n\nExperiments were carried out on some of the largest XML benchmark data sets demonstrating that\nSLEEC could achieve signi\ufb01cantly higher prediction accuracies as compared to the state-of-the-art.\nIt is also demonstrated that SLEEC could be faster at training and prediction than leading embedding\ntechniques such as LEML.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: Variation in Precision@1 accuracy with model size and the number of learners on large-scale data\nsets. Clearly, SLEEC achieves better accuracy than FastXML and LocalLEML-Ensemble at every point of the\ncurve. For WikiLSTHC, SLEEC with a single learner is more accurate than LocalLEML-Ensemble with even\n15 learners. Similarly, SLEEC with 2 learners achieves more accuracy than FastXML with 50 learners.\n\n(c)\n\nData sets: Experiments were carried out on multi-label data sets including Ads1M [15] (1M la-\nbels), Amazon [23] (670K labels), WikiLSHTC (320K labels), DeliciousLarge [24] (200K labels)\nand Wiki10 [25] (30K labels). All the data sets are publically available except Ads1M which is\nproprietary and is included here to test the scaling capabilities of SLEEC.\nUnfortunately, most of the existing embedding techniques do not scale to such large data sets. We\ntherefore also present comparisons on publically available small data sets such as BibTeX [26],\nMediaMill [27], Delicious [28] and EURLex [29]. (Table 2 in the appendix lists their statistics).\nBaseline algorithms: This paper\u2019s primary focus is on comparing SLEEC to state-of-the-art meth-\nods which can scale to the large data sets such as embedding based LEML [9] and tree based\nFastXML [15] and LPSR [2]. Na\u00a8\u0131ve Bayes was used as the base classi\ufb01er in LPSR as was done\nin [15]. Techniques such as CS [3], CPLST [30], ML-CSSP [7], 1-vs-All [31] could only be trained\non the small data sets given standard resources. Comparisons between SLEEC and such techniques\nare therefore presented in the supplementary material. The implementation for LEML and FastXML\nwas provided by the authors. We implemented the remaining algorithms and ensured that the pub-\nlished results could be reproduced and were veri\ufb01ed by the authors wherever possible.\nHyper-parameters: Most of SLEEC\u2019s hyper-parameters were kept \ufb01xed including the number of\n\nclusters in a learner(cid:0)(cid:98)NTrain/6000(cid:99)(cid:1), embedding dimension (100 for the small data sets and 50\n\nfor the large), number of learners in the ensemble (15), and the parameters used for optimizing (3).\nThe remaining two hyper-parameters, the k in kNN and the number of neighbours considered during\nSVP, were both set by limited validation on a validation set.\nThe hyper-parameters for all the other algorithms were set using \ufb01ne grained validation on each data\nset so as to achieve the highest possible prediction accuracy for each method. In addition, all the\nembedding methods were allowed a much larger embedding dimension (0.8L) than SLEEC (100)\nto give them as much opportunity as possible to outperform SLEEC.\nEvaluation Metrics: We evaluated algorithms using metrics that have been widely adopted for\nXML and ranking tasks. Precision at k (P@k) is one such metric that counts the fraction of correct\npredictions in the top k scoring labels in \u02c6y, and has been widely utilized [1, 3, 15, 13, 2, 9]. We\nuse the ranking measure nDCG@k as another evaluation metric. We refer the reader to the supple-\nmentary material \u2013 Appendix B.1 and Tables 5 and 6 \u2013 for further descriptions of the metrics and\nresults.\nResults on large data sets with more than 100K labels: Table 1a compares SLEEC\u2019s prediction\naccuracy, in terms of P@k (k= {1, 3, 5}), to all the leading methods that could be trained on \ufb01ve\nsuch data sets. SLEEC could improve over the leading embedding method, LEML, by as much as\n35% and 15% in terms of P@1 and P@5 on WikiLSHTC. Similarly, SLEEC outperformed LEML\nby 27% and 22% in terms of P@1 and P@5 on the Amazon data set which also has many tail labels.\nThe gains on the other data sets are consistent, but smaller, as the tail label problem is not so acute.\nSLEEC also outperforms the leading tree method, FastXML, by 6% in terms of both P@1 and P@5\non WikiLSHTC and Wiki10 respectively. This demonstrates the superiority of SLEEC\u2019s overall\npipeline constructed using local distance preserving embeddings followed by kNN classi\ufb01cation.\nSLEEC also has better scaling properties as compared to all other embedding methods. In particular,\napart from LEML, no other embedding approach could scale to the large data sets and, even LEML\ncould not scale to Ads1M with a million labels. In contrast, a single SLEEC learner could be learnt\non WikiLSHTC in 4 hours on a single core and already gave \u223c 20% improvement in P@1 over\nLEML (see Figure 2 for the variation in accuracy vs SLEEC learners). In fact, SLEEC\u2019s training\n\n7\n\n051030405060Model Size (GB)Precision@1WikiLSHTC [L= 325K, d = 1.61M, n = 1.77M] SLEECFastXMLLocalLEML\u2212Ens05101530405060Number of LearnersPrecision@1WikiLSHTC [L= 325K, d = 1.61M, n = 1.77M] SLEECFastXMLLocalLEML\u2212ENS05101560708090Number of LearnersPrecision@1Wiki10 [L= 30K, d = 101K, n = 14K] SLEECFastXMLLocalLEML\u2212Ens\fTable 1: Precision Accuracies (a) Large-scale data sets : Our proposed method SLEEC is as much as 35%\nmore accurate in terms of P@1 and 22% in terms of P@5 than LEML, a leading embedding method. Other\nembedding based methods do not scale to the large-scale data sets; we compare against them on small-scale\ndata sets in Table 3. SLEEC is also 6% more accurate (w.r.t. P@1 and P@5) than FastXML, a state-of-the-\nart tree method. \u2018-\u2019 indicates LEML could not be run with the standard resources. (b) Small-scale data sets\n: SLEEC consistently outperforms state of the art approaches. WSABIE, which also uses kNN classi\ufb01er on\nits embeddings is signi\ufb01cantly less accurate than SLEEC on all the data sets, showing the superiority of our\nembedding learning algorithm.\n\n(a)\n\n(b)\n\nData set\n\nWiki10\n\nDelicious-Large\n\nWikiLSHTC\n\nAmazon\n\nAds-1m\n\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\n\n73.50\n62.38\n54.30\n40.30\n37.76\n36.66\n19.82\n11.43\n8.39\n8.13\n6.83\n6.03\n\nSLEEC LEML FastXML LPSR-NB\n85.54\n73.59\n63.10\n47.03\n41.67\n38.88\n55.57\n33.84\n24.07\n35.05\n31.25\n28.56\n21.84\n14.30\n11.01\n\n72.71\n58.51\n49.40\n18.59\n15.43\n14.07\n27.43\n16.38\n12.01\n28.65\n24.88\n22.37\n17.08\n11.38\n8.83\n\n82.56\n66.67\n56.70\n42.81\n38.76\n36.34\n49.35\n32.69\n24.03\n33.36\n29.30\n26.12\n23.11\n13.86\n10.12\n\n-\n-\n-\n\nData set\n\nBibTex\n\nDelicious\n\nMediaMill\n\nEurLEX\n\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\n\nSLEEC LEML FastXML WSABIE OneVsAll\n65.57\n40.02\n29.30\n68.42\n61.83\n56.80\n87.09\n72.44\n58.45\n80.17\n65.39\n53.75\n\n63.73\n39.00\n28.54\n69.44\n63.62\n59.10\n84.24\n67.39\n53.14\n68.69\n57.73\n48.00\n\n62.53\n38.4\n28.21\n65.66\n60.54\n56.08\n84.00\n67.19\n52.80\n61.28\n48.66\n39.91\n\n61.83\n36.44\n26.46\n65.01\n58.90\n53.26\n83.57\n65.50\n48.57\n74.96\n62.92\n53.42\n\n54.77\n32.38\n23.98\n64.12\n58.13\n53.64\n81.29\n64.74\n49.82\n70.87\n56.62\n46.2\n\ntime on WikiLSHTC was comparable to that of tree based FastXML. FastXML trains 50 trees in\n7 hours on a single core to achieve a P@1 of 49.37% whereas SLEEC could achieve 49.98% by\ntraining 2 learners in 8 hours. Similarly, SLEEC\u2019s training time on Ads1M was 6 hours per learner\non a single core.\nSLEEC\u2019s predictions could also be up to 300 times faster than LEMLs. For instance, on Wik-\niLSHTC, SLEEC made predictions in 8 milliseconds per test point as compared to LEML\u2019s 279.\nSLEEC therefore brings the prediction time of embedding methods to be much closer to that of\ntree based methods (FastXML took 0.5 milliseconds per test point on WikiLSHTC) and within the\nacceptable limit of most real world applications.\nEffect of clustering and multiple learners: As mentioned in the introduction, other embedding\nmethods could also be extended by clustering the data and then learning a local embedding in each\ncluster. Ensembles could also be learnt from multiple such clusterings. We extend LEML in such\na fashion, and refer to it as LocalLEML, by using exactly the same 300 clusters per learner in the\nensemble as used in SLEEC for a fair comparison. As can be seen in Figure 2, SLEEC signi\ufb01cantly\noutperforms LocalLEML with a single SLEEC learner being much more accurate than an ensemble\nof even 10 LocalLEML learners. Figure 2 also demonstrates that SLEEC\u2019s ensemble can be much\nmore accurate at prediction as compared to the tree based FastXML ensemble (the same plot is\nalso presented in the appendix depicting the variation in accuracy with model size in RAM rather\nthan the number of learners in the ensemble). The \ufb01gure also demonstrates that very few SLEEC\nlearners need to be trained before accuracy starts saturating. Finally, Table 4 shows that the variance\nin SLEEC s prediction accuracy (w.r.t. different cluster initializations) is very small, indicating that\nthe method is stable even though clustering in more than a million dimensions.\nResults on small data sets: Table 3, in the appendix, compares the performance of SLEEC to sev-\neral popular methods including embeddings, trees, kNN and 1-vs-All SVMs. Even though the tail\nlabel problem is not acute on these data sets, and SLEEC was restricted to a single learner, SLEEC\u2019s\npredictions could be signi\ufb01cantly more accurate than all the other methods (except on Delicious\nwhere SLEEC was ranked second). For instance, SLEEC could outperform the closest competitor\non EurLex by 3% in terms of P1. Particularly noteworthy is the observation that SLEEC outper-\nformed WSABIE [13], which performs kNN classi\ufb01cation on linear embeddings, by as much as\n10% on multiple data sets. This demonstrates the superiority of SLEEC\u2019s local distance preserving\nembeddings over the traditional low-rank embeddings.\n\nAcknowledgments\nWe are grateful to Abhishek Kadian for helping with the experiments. Himanshu Jain is supported\nby a Google India PhD Fellowship at IIT Delhi\n\n8\n\n\fReferences\n[1] R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: Recom-\n\nmending advertiser bid phrases for web pages. In WWW, pages 13\u201324, 2013.\n\n[2] J. Weston, A. Makadia, and H. Yee. Label partitioning for sublinear ranking. In ICML, 2013.\n[3] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. In NIPS,\n\n2009.\n\n[4] M. Ciss\u00b4e, N. Usunier, T. Arti`eres, and P. Gallinari. Robust bloom \ufb01lters for large multilabel classi\ufb01cation\n\ntasks. In NIPS, pages 1851\u20131859, 2013.\n\n[5] F. Tai and H.-T. Lin. Multi-label classi\ufb01cation with principal label space transformation. In Workshop\n\nproceedings of learning from multi-label data, 2010.\n\n[6] K. Balasubramanian and G. Lebanon. The landmark selection method for multiple output prediction. In\n\nICML, 2012.\n\n[7] W. Bi and J.T.-Y. Kwok. Ef\ufb01cient multi-label classi\ufb01cation with many labels. In ICML, 2013.\n[8] Y. Zhang and J. G. Schneider. Multi-label output codes using canonical correlation analysis. In AISTATS,\n\npages 873\u2013882, 2011.\n\n[9] H.-F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale multi-label learning with missing labels. ICML,\n\n2014.\n\n[10] Y.-N. Chen and H.-T. Lin. Feature-aware label space dimension reduction for multi-label classi\ufb01cation.\n\nIn NIPS, pages 1538\u20131546, 2012.\n\n[11] C.-S. Feng and H.-T. Lin. Multi-label classi\ufb01cation with error-correcting codes. JMLR, 20, 2011.\n[12] S. Ji, L. Tang, S. Yu, and J. Ye. Extracting shared subspace for multi-label classi\ufb01cation. In KDD, 2008.\nIn\n[13] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation.\n\nIJCAI, 2011.\n\n[14] Z. Lin, G. Ding, M. Hu, and J. Wang. Multi-label classi\ufb01cation via feature-aware implicit label space\n\nencoding. In ICML, pages 325\u2013333, 2014.\n\n[15] Y. Prabhu and M. Varma. FastXML: a fast, accurate and stable tree-classi\ufb01er for extreme multi-label\n\nlearning. In KDD, pages 263\u2013272, 2014.\n\n[16] Wikipedia dataset for the 4th large scale hierarchical text classi\ufb01cation challenge, 2014.\n[17] A. Ng and M. Jordan. On Discriminative vs. Generative classi\ufb01ers: A comparison of logistic regression\n\nand naive Bayes. In NIPS, 2002.\n\n[18] K. Q. Weinberger and L. K. Saul. An introduction to nonlinear dimensionality reduction by maximum\n\nvariance unfolding. In AAAI, pages 1683\u20131686, 2006.\n\n[19] B. Shaw and T. Jebara. Minimum volume embedding. In AISTATS, pages 460\u2013467, 2007.\n[20] P. Jain, R. Meka, and I. S. Dhillon. Guaranteed rank minimization via singular value projection. In NIPS,\n\npages 937\u2013945, 2010.\n\n[21] P. Sprechmann, R. Litman, T. B. Yakar, A. Bronstein, and G. Sapiro. Ef\ufb01cient Supervised Sparse Analysis\n\nand Synthesis Operators. In NIPS, 2013.\n\n[22] P. Kar, K. B. Sriperumbudur, P. Jain, and H. Karnick. On the Generalization Ability of Online Learning\n\nAlgorithms for Pairwise Loss Functions. In ICML, 2013.\n\n[23] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection, 2014.\n[24] R. Wetzker, C. Zimmermann, and C. Bauckhage. Analyzing social bookmarking systems: A del.icio.us\n\ncookbook. In Mining Social Data (MSoDa) Workshop Proceedings, ECAI, pages 26\u201330, July 2008.\n\n[25] A. Zubiaga. Enhancing navigation on wikipedia with social tags, 2009.\n[26] I. Katakis, G. Tsoumakas, and I. Vlahavas. Multilabel text classi\ufb01cation for automated tag suggestion. In\n\nProceedings of the ECML/PKDD 2008 Discovery Challenge, 2008.\n\n[27] C. Snoek, M. Worring, J. van Gemert, J.-M. Geusebroek, and A. Smeulders. The challenge problem for\n\nautomated detection of 101 semantic concepts in multimedia. In ACM Multimedia, 2006.\n\n[28] G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective and effcient multilabel classi\ufb01cation in domains\n\nwith large number of labels. In ECML/PKDD, 2008.\n\n[29] J. Menc\u00b4\u0131a E. L.and F\u00a8urnkranz. Ef\ufb01cient pairwise multilabel classi\ufb01cation for large-scale problems in the\n\nlegal domain. In ECML/PKDD, 2008.\n\n[30] Y.-N. Chen and H.-T. Lin. Feature-aware label space dimension reduction for multi-label classi\ufb01cation.\n\nIn NIPS, pages 1538\u20131546, 2012.\n\n[31] B. Hariharan, S. V. N. Vishwanathan, and M. Varma. Ef\ufb01cient max-margin multi-label classi\ufb01cation with\n\napplications to zero-shot learning. ML, 2012.\n\n9\n\n\f", "award": [], "sourceid": 495, "authors": [{"given_name": "Kush", "family_name": "Bhatia", "institution": "Microsoft Research"}, {"given_name": "Himanshu", "family_name": "Jain", "institution": "IIT Delhi"}, {"given_name": "Purushottam", "family_name": "Kar", "institution": "Microsoft Research India"}, {"given_name": "Manik", "family_name": "Varma", "institution": "Microsoft Research India"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}]}