{"title": "Top-k Multiclass SVM", "book": "Advances in Neural Information Processing Systems", "page_first": 325, "page_last": 333, "abstract": "Class ambiguity is typical in image classification problems with a large number of classes. When classes are difficult to discriminate, it makes sense to allow k guesses and evaluate classifiers based on the top-k error instead of the standard zero-one loss. We propose top-k multiclass SVM as a direct method to optimize for top-k performance. Our generalization of the well-known multiclass SVM is based on a tight convex upper bound of the top-k error. We propose a fast optimization scheme based on an efficient projection onto the top-k simplex, which is of its own interest. Experiments on five datasets show consistent improvements in top-k accuracy compared to various baselines.", "full_text": "Top-k Multiclass SVM\n\nMaksim Lapin,1 Matthias Hein2 and Bernt Schiele1\n\n1Max Planck Institute for Informatics, Saarbr\u00fccken, Germany\n\n2Saarland University, Saarbr\u00fccken, Germany\n\nAbstract\n\nClass ambiguity is typical in image classi\ufb01cation problems with a large number\nof classes. When classes are dif\ufb01cult to discriminate, it makes sense to allow k\nguesses and evaluate classi\ufb01ers based on the top-k error instead of the standard\nzero-one loss. We propose top-k multiclass SVM as a direct method to optimize\nfor top-k performance. Our generalization of the well-known multiclass SVM is\nbased on a tight convex upper bound of the top-k error. We propose a fast opti-\nmization scheme based on an ef\ufb01cient projection onto the top-k simplex, which is\nof its own interest. Experiments on \ufb01ve datasets show consistent improvements in\ntop-k accuracy compared to various baselines.\n\n1\n\nIntroduction\n\nFigure 1: Images from SUN 397 [29] illustrating\nclass ambiguity. Top: (left to right) Park, River,\nPond. Bottom: Park, Campus, Picnic area.\n\nAs the number of classes increases, two im-\nportant issues emerge: class overlap and multi-\nlabel nature of examples [9]. This phenomenon\nasks for adjustments of both the evaluation met-\nrics as well as the loss functions employed.\nWhen a predictor is allowed k guesses and is\nnot penalized for k \u2212 1 mistakes, such an eval-\nuation measure is known as top-k error. We ar-\ngue that this is an important metric that will in-\nevitably receive more attention in the future as\nthe illustration in Figure 1 indicates.\nHow obvious is it that each row of Figure 1 shows examples of different classes? Can we imagine\na human to predict correctly on the \ufb01rst attempt? Does it even make sense to penalize a learning\nsystem for such \u201cmistakes\u201d? While the problem of class ambiguity is apparent in computer vision,\nsimilar problems arise in other domains when the number of classes becomes large.\nWe propose top-k multiclass SVM as a generalization of the well-known multiclass SVM [5]. It\nis based on a tight convex upper bound of the top-k zero-one loss which we call top-k hinge loss.\nWhile it turns out to be similar to a top-k version of the ranking based loss proposed by [27], we\nshow that the top-k hinge loss is a lower bound on their version and is thus a tighter bound on the\ntop-k zero-one loss. We propose an ef\ufb01cient implementation based on stochastic dual coordinate\nascent (SDCA) [24]. A key ingredient in the optimization is the (biased) projection onto the top-k\nsimplex. This projection turns out to be a tricky generalization of the continuous quadratic knapsack\nproblem, respectively the projection onto the standard simplex. The proposed algorithm for solving\nit has complexity O(m log m) for x \u2208 Rm. Our implementation of the top-k multiclass SVM scales\nto large datasets like Places 205 with about 2.5 million examples and 205 classes [30]. Finally,\nextensive experiments on several challenging computer vision problems show that top-k multiclass\nSVM consistently improves in top-k error over the multiclass SVM (equivalent to our top-1 multi-\nclass SVM), one-vs-all SVM and other methods based on different ranking losses [11, 16].\n\n1\n\n\f2 Top-k Loss in Multiclass Classi\ufb01cation\nIn multiclass classi\ufb01cation, one is given a set S = {(xi, yi)| i = 1, . . . , n} of n training examples\nxi \u2208 X along with the corresponding labels yi \u2208 Y. Let X = Rd be the feature space and\nY = {1, . . . , m} the set of labels. The task is to learn a set of m linear predictors wy \u2208 Rd such that\nthe risk of the classi\ufb01er arg maxy\u2208Y (cid:104)wy, x(cid:105) is minimized for a given loss function, which is usually\nchosen to be a convex upper bound of the zero-one loss. The generalization to nonlinear predictors\nusing kernels is discussed below.\nThe classi\ufb01cation problem becomes extremely challenging in the presence of a large number of\nambiguous classes. It is natural in that case to extend the evaluation protocol to allow k guesses,\nwhich leads to the popular top-k error and top-k accuracy performance measures. Formally, we\nconsider a ranking of labels induced by the prediction scores (cid:104)wy, x(cid:105). Let the bracket [\u00b7] denote a\npermutation of labels such that [j] is the index of the j-th largest score, i.e.\n\n(cid:10)w[1], x(cid:11) \u2265(cid:10)w[2], x(cid:11) \u2265 . . . \u2265(cid:10)w[m], x(cid:11) .\n\nThe top-k zero-one loss errk is de\ufb01ned as\n\nerrk(f (x), y) = 1(cid:104)w[k],x(cid:105)>(cid:104)wy,x(cid:105),\n\n(cid:62) and 1P = 1 if P is true and 0 otherwise. Note that\nwhere f (x) = ((cid:104)w1, x(cid:105) , . . . ,(cid:104)wm, x(cid:105))\nthe standard zero-one loss is recovered when k = 1, and errk(f (x), y) is always 0 for k = m.\nTherefore, we are interested in the regime 1 \u2264 k < m.\n\n2.1 Multiclass Support Vector Machine\n\nIn this section we review the multiclass SVM of Crammer and Singer [5] which will be extended to\nthe top-k multiclass SVM in the following. We mainly follow the notation of [24].\nGiven a training pair (xi, yi), the multiclass SVM loss on example xi is de\ufb01ned as\n\ny\u2208Y {1y(cid:54)=yi + (cid:104)wy, xi(cid:105) \u2212 (cid:104)wyi, xi(cid:105)}.\n\nmax\n\n(1)\n\nSince our optimization scheme is based on Fenchel duality, we also require a convex conjugate of\nthe primal loss function (1). Let c (cid:44) 1\u2212eyi, where 1 is the all ones vector and ej is the j-th standard\nbasis vector in Rm, let a \u2208 Rm be de\ufb01ned componentwise as aj (cid:44) (cid:104)wj, xi(cid:105) \u2212 (cid:104)wyi, xi(cid:105), and let\n\n\u2206 (cid:44) {x \u2208 Rm | (cid:104)1, x(cid:105) \u2264 1, 0 \u2264 xi, i = 1, . . . , m}.\n\n(cid:26)\u2212(cid:104)c, b(cid:105)\n\n+\u221e\n\nif b \u2208 \u2206,\notherwise.\n\n(2)\n\nProposition 1 ([24], \u00a7 5.1). A primal-conjugate pair for the multiclass SVM loss (1) is\n\n\u03c6(a) = max{0, (a + c)[1]},\n\n\u03c6\u2217(b) =\n\nNote that thresholding with 0 in \u03c6(a) is actually redundant as (a + c)[1] \u2265 (a + c)yi = 0 and is only\ngiven to enhance similarity to the top-k version de\ufb01ned later.\n\n2.2 Top-k Support Vector Machine\n\nThe main motivation for the top-k loss is to relax the penalty for making an error in the top-k\npredictions. Looking at \u03c6 in (2), a direct extension to the top-k setting would be a function\n\n\u03c8k(a) = max{0, (a + c)[k]},\n\nwhich incurs a loss iff (a + c)[k] > 0. Since the ground truth score (a + c)[yi] = 0, we conclude that\n\n\u03c8k(a) > 0 \u21d0\u21d2 (cid:10)w[1], xi\n\n(cid:11) \u2265 . . . \u2265(cid:10)w[k], xi\n\n(cid:11) > (cid:104)wyi , xi(cid:105) \u2212 1,\n\nwhich directly corresponds to the top-k zero-one loss errk with margin 1.\nNote that the function \u03c8k ignores the values of the \ufb01rst (k \u2212 1) scores, which could be quite large if\nthere are highly similar classes. That would be \ufb01ne in this model as long as the correct prediction is\n\n2\n\n\fwithin the \ufb01rst k guesses. However, the function \u03c8k is unfortunately nonconvex since the function\nfk(x) = x[k] returning the k-th largest coordinate is nonconvex for k \u2265 2. Therefore, \ufb01nding a\nglobally optimal solution is computationally intractable.\nInstead, we propose the following convex upper bound on \u03c8k, which we call the top-k hinge loss,\n\n(cid:110)\n\nk(cid:88)\n\nj=1\n\n1\nk\n\n(cid:111)\n\n\u03c6k(a) = max\n\n0,\n\n(a + c)[j]\n\n,\n\n(3)\n\nwhere the sum of the k largest components is known to be convex [3]. We have that\n\n\u03c8k(a) \u2264 \u03c6k(a) \u2264 \u03c61(a) = \u03c6(a),\n\nfor any k \u2265 1 and a \u2208 Rm. Moreover, \u03c6k(a) < \u03c6(a) unless all k largest scores are the same. This\nextra slack can be used to increase the margin between the current and the (m \u2212 k) remaining least\nsimilar classes, which should then lead to an improvement in the top-k metric.\n\n2.2.1 Top-k Simplex and Convex Conjugate of the Top-k Hinge Loss\n\nIn this section we derive the conjugate of the proposed loss (3). We begin with a well known result\nthat is used later in the proof. All proofs can be found in the supplement. Let [a]+ = max{0, a}.\n\nLemma 1 ([17], Lemma 1). (cid:80)k\n\n(cid:8)kt +(cid:80)m\n\nj=1[hj \u2212 t]+\n\n(cid:9).\n\nj=1 h[j] = mint\n\nWe also de\ufb01ne a set \u2206k which arises naturally as the effective\ndomain1 of the conjugate of (3). By analogy, we call it the top-k\nsimplex as for k = 1 it reduces to the standard simplex with the\ninequality constraint (i.e. 0 \u2208 \u2206k). Let [m] (cid:44) 1, . . . , m.\nDe\ufb01nition 1. The top-k simplex is a convex polytope de\ufb01ned as\n\n(cid:26)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:104)1, x(cid:105) \u2264 r, 0 \u2264 xi \u2264 1\n\nk\n\n\u2206k(r) (cid:44)\n\nx\n\n(cid:104)1, x(cid:105) , i \u2208 [m]\n\n,\n\n(cid:27)\n\nwhere r \u2265 0 is the bound on the sum (cid:104)1, x(cid:105). We let \u2206k (cid:44) \u2206k(1).\nThe crucial difference to the standard simplex is the upper bound\non xi\u2019s, which limits their maximal contribution to the total sum\n(cid:104)1, x(cid:105). See Figure 2 for an illustration.\nThe \ufb01rst technical contribution of this work is as follows.\nProposition 2. A primal-conjugate pair for the top-k hinge loss\n(3) is given as follows:\n\n(cid:26)\u2212(cid:104)c, b(cid:105)\n\n+\u221e\n\nif b \u2208 \u2206k,\notherwise.\n\n(4)\n\n(cid:111)\n\n(a + c)[j]\n\n,\n\n\u03c6\u2217\nk(b) =\n\nFigure 2: Top-k simplex \u2206k(1)\nfor m = 3. Unlike the standard\n\nsimplex, it has(cid:0)m\n\n(cid:1) + 1 vertices.\nk(cid:88)\n\n(cid:110)\n\nk\n\n\u03c6k(a) = max\n\n0,\n\n1\nk\n\nj=1\n\nMoreover, \u03c6k(a) = max{(cid:104)a + c, \u03bb(cid:105) | \u03bb \u2208 \u2206k}.\nTherefore, we see that the proposed formulation (3) naturally extends the multiclass SVM of Cram-\nmer and Singer [5], which is recovered when k = 1. We have also obtained an interesting extension\n(or rather contraction, since \u2206k \u2282 \u2206) of the standard simplex.\n\n2.3 Relation of the Top-k Hinge Loss to Ranking Based Losses\n\nUsunier et al. [27] have recently formulated a very general family of convex losses for ranking and\nmulticlass classi\ufb01cation. In their framework, the hinge loss on example xi can be written as\n\nm(cid:88)\n\nL\u03b2(a) =\n\n\u03b2y max{0, (a + c)[y]},\n\n1 A convex function f : X \u2192 R \u222a {\u00b1\u221e} has an effective domain dom f = {x \u2208 X | f (x) < +\u221e}.\n\ny=1\n\n3\n\n1/31/2111/21/3011/21/30Top-1Top-2Top-3\fwhere \u03b21 \u2265 . . . \u2265 \u03b2m \u2265 0 is a non-increasing sequence of non-negative numbers which act as\nweights for the ordered losses.\nk if j \u2264 k, and 0 otherwise.\nThe relation to the top-k hinge loss becomes apparent if we choose \u03b2j = 1\nIn that case, we obtain another version of the top-k hinge loss\n\nmax{0, (a + c)[j]}.\n\n(5)\n\n(cid:0)a(cid:1) =\n\n\u02dc\u03c6k\n\nk(cid:88)\n\nj=1\n\n1\nk\n\nIt is straightforward to check that\n\n\u03c8k(a) \u2264 \u03c6k(a) \u2264 \u02dc\u03c6k(a) \u2264 \u03c61(a) = \u02dc\u03c61(a) = \u03c6(a).\n\nThe bound \u03c6k(a) \u2264 \u02dc\u03c6k(a) holds with equality if (a + c)[1] \u2264 0 or (a + c)[k] \u2265 0. Otherwise, there\nis a gap and our top-k loss is a strictly better upper bound on the actual top-k zero-one loss. We\nperform extensive evaluation and comparison of both versions of the top-k hinge loss in \u00a7 5.\nWhile [27] employed LaRank [1] and [9], [28] optimized an approximation of L\u03b2(a), we show in\nthe supplement how the loss function (5) can be optimized exactly and ef\ufb01ciently within the Prox-\nSDCA framework.\nMulticlass to binary reduction. It is also possible to compare directly to ranking based methods\nthat solve a binary problem using the following reduction. We employ it in our experiments to\nevaluate the ranking based methods SVMPerf [11] and TopPush [16]. The trick is to augment the\ntraining set by embedding each xi \u2208 Rd into Rmd using a feature map \u03a6y for each y \u2208 Y. The\nmapping \u03a6y places xi at the y-th position in Rmd and puts zeros everywhere else. The example\n\u03a6yi(xi) is labeled +1 and all \u03a6y(xi) for y (cid:54)= yi are labeled \u22121. Therefore, we have a new training\nset with mn examples and md dimensional (sparse) features. Moreover, (cid:104)w, \u03a6y(xi)(cid:105) = (cid:104)wy, xi(cid:105)\nwhich establishes the relation to the original multiclass problem.\nAnother approach to general performance measures is given in [11]. It turns out that using the above\nreduction, one can show that under certain constraints on the classi\ufb01er, the recall@k is equivalent to\nthe top-k error. A convex upper bound on recall@k is then optimized in [11] via structured SVM.\nAs their convex upper bound on the recall@k is not decomposable in an instance based loss, it is not\ndirectly comparable to our loss. While being theoretically very elegant, the approach of [11] does\nnot scale to very large datasets.\n\n3 Optimization Framework\n\nWe begin with a general (cid:96)2-regularized multiclass classi\ufb01cation problem, where for notational con-\nvenience we keep the loss function unspeci\ufb01ed. The multiclass SVM or the top-k multiclass SVM\nare obtained by plugging in the corresponding loss function from \u00a7 2.\n\n3.1 Fenchel Duality for (cid:96)2-Regularized Multiclass Classi\ufb01cation Problems\nLet X \u2208 Rd\u00d7n be the matrix of training examples xi \u2208 Rd, let W \u2208 Rd\u00d7m be the matrix of primal\nvariables obtained by stacking the vectors wy \u2208 Rd, and A \u2208 Rm\u00d7n the matrix of dual variables.\nBefore we prove our main result of this section (Theorem 1), we \ufb01rst impose a technical constraint\non a loss function to be compatible with the choice of the ground truth coordinate. The top-k hinge\nloss from Section 2 satis\ufb01es this requirement as we show in Proposition 3. We also prove an auxiliary\nLemma 2, which is then used in Theorem 1.\nDe\ufb01nition 2. A convex function \u03c6 is j-compatible if for any y \u2208 Rm with yj = 0 we have that\n\nsup{(cid:104)y, x(cid:105) \u2212 \u03c6(x)| xj = 0} = \u03c6\u2217(y).\nThis constraint is needed to prove equality in the following Lemma.\nLemma 2. Let \u03c6 be j-compatible, let Hj = I \u2212 1e(cid:62)\n\nj , and let \u03a6(x) = \u03c6(Hjx), then\n\n(cid:26)\u03c6\u2217(y \u2212 yjej)\n\n+\u221e\n\n\u03a6\u2217(y) =\n\nif (cid:104)1, y(cid:105) = 0,\notherwise.\n\n4\n\n\fWe can now use Lemma 2 to compute convex conjugates of the loss functions.\nTheorem 1. Let \u03c6i be yi-compatible for each i \u2208 [n], let \u03bb > 0 be a regularization parameter, and\nlet K = X(cid:62)X be the Gram matrix. The primal and Fenchel dual objective functions are given as:\n\ntr(cid:0)W (cid:62)W(cid:1) ,\ntr(cid:0)AKA(cid:62)(cid:1) , if (cid:104)1, ai(cid:105) = 0 \u2200i, +\u221e otherwise.\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\n\u03c6i\n\n(cid:0)W (cid:62)xi \u2212 (cid:104)wyi , xi(cid:105) 1(cid:1) +\n\n\u03bb\n2\ni (\u2212\u03bbn(ai \u2212 ayi,ieyi)) \u2212 \u03bb\n\u03c6\u2217\n2\n\nP (W ) = +\n\n1\nn\n\nD(A) = \u2212 1\nn\n\nMoreover, we have that W = XA(cid:62) and W (cid:62)xi = AKi, where Ki is the i-th column of K.\nFinally, we show that Theorem 1 applies to the loss functions that we consider.\nProposition 3. The top-k hinge loss function from Section 2 is yi-compatible.\n\nby making the requirement ayi,i = \u2212(cid:80)\n\nWe have repeated the derivation from Section 5.7 in [24] as there is a typo in the optimization\nproblem (20) leading to the conclusion that ayi,i must be 0 at the optimum. Lemma 2 \ufb01xes this\naj,i explicit. Note that this modi\ufb01cation is already\n\nj(cid:54)=yi\n\nmentioned in their pseudo-code for Prox-SDCA.\n\n3.2 Optimization of Top-k Multiclass SVM via Prox-SDCA\n\nrandomly permute training data\nfor i = 1 to n do\n\ni=1}, parameters\nk (loss), \u03bb (regularization), \u0001 (stopping cond.)\n\nAlgorithm 1 Top-k Multiclass SVM\n1: Input: training data {(xi, yi)n\n2: Output: W \u2208 Rd\u00d7m, A \u2208 Rm\u00d7n\n3: Initialize: W \u2190 0, A \u2190 0\n4: repeat\n5:\n6:\n7:\n8:\n9:\n\nAs an optimization scheme, we employ the\nproximal stochastic dual coordinate ascent\n(Prox-SDCA) framework of Shalev-Shwartz\nand Zhang [24], which has strong convergence\nguarantees and is easy to adapt to our prob-\nlem. In particular, we iteratively update a batch\nai \u2208 Rm of dual variables corresponding to\nthe training pair (xi, yi), so as to maximize the\ndual objective D(A) from Theorem 1. We also\nmaintain the primal variables W = XA(cid:62) and\nstop when the relative duality gap is below \u0001.\nThis procedure is summarized in Algorithm 1.\nLet us make a few comments on the advantages\nof the proposed method. First, apart from the\nupdate step which we discuss below, all main\noperations can be computed using a BLAS li-\nbrary, which makes the overall implementation ef\ufb01cient. Second, the update step in Line 9 is optimal\nin the sense that it yields maximal dual objective increase jointly over m variables. This is opposed\nto SGD updates with data-independent step sizes, as well as to maximal but scalar updates in other\nSDCA variants. Finally, we have a well-de\ufb01ned stopping criterion as we can compute the duality\ngap (see discussion in [2]). The latter is especially attractive if there is a time budget for learning.\nThe algorithm can also be easily kernelized since W (cid:62)xi = AKi (cf. Theorem 1).\n\nsi \u2190 W (cid:62)xi {prediction scores}\ni \u2190 ai {cache previous values}\naold\nai \u2190 update(k, \u03bb,(cid:107)xi(cid:107)2 , yi, si, ai)\n{see \u00a7 3.2.1 for details}\nW \u2190 W + xi(ai \u2212 aold\n)(cid:62)\n\n11:\n12: until relative duality gap is below \u0001\n\n{rank-1 update}\n\nend for\n\n10:\n\ni\n\n3.2.1 Dual Variables Update\n\nFor the proposed top-k hinge loss from Section 2, optimization of the dual objective D(A) over\nai \u2208 Rm given other variables \ufb01xed is an instance of a regularized (biased) projection problem onto\nthe top-k simplex \u2206k( 1\nProposition 4. The following two problems are equivalent with a\n\n\u03bbn ). Let a\\j be obtained by removing the j-th coordinate from vector a.\n\\yi\ni = \u2212x and ayi,i = (cid:104)1, x(cid:105)\n\nmax\n\nai\n\n{D(A)| (cid:104)1, ai(cid:105) = 0} \u2261 min\n\n(cid:0)q\\yi + (1 \u2212 qyi)1(cid:1), q = W (cid:62)xi \u2212 (cid:104)xi, xi(cid:105) ai and \u03c1 = 1.\n\n{(cid:107)b \u2212 x(cid:107)2 + \u03c1(cid:104)1, x(cid:105)2 | x \u2208 \u2206k( 1\n\nx\n\n\u03bbn )},\n\nwhere b = 1(cid:104)xi,xi(cid:105)\n\nWe discuss in the following section how to project onto the set \u2206k( 1\n\n\u03bbn ) ef\ufb01ciently.\n\n5\n\n\f4 Ef\ufb01cient Projection onto the Top-k Simplex\n\nOne of our main technical results is an algorithm for ef\ufb01ciently computing projections onto \u2206k(r),\nrespectively the biased projection introduced in Proposition 4. The optimization problem in Proposi-\ntion 4 reduces to the Euclidean projection onto \u2206k(r) for \u03c1 = 0, and for \u03c1 > 0 it biases the solution\nto be orthogonal to 1. Let us highlight that \u2206k(r) is substantially different from the standard simplex\nand none of the existing methods can be used as we discuss below.\n\n4.1 Continuous Quadratic Knapsack Problem\n\nFinding the Euclidean projection onto the simplex is an instance of the general optimization problem\nminx{(cid:107)a \u2212 x(cid:107)2\n2 | (cid:104)b, x(cid:105) \u2264 r, l \u2264 xi \u2264 u} known as the continuous quadratic knapsack problem\n(CQKP). For example, to project onto the simplex we set b = 1, l = 0 and r = u = 1. This is a well\nexamined problem and several highly ef\ufb01cient algorithms are available (see the surveys [18, 19]).\nThe \ufb01rst main difference to our set is the upper bound on the xi\u2019s. All existing algorithms expect\nthat u is \ufb01xed, which allows them to consider decompositions minxi{(ai\u2212 xi)2 | l \u2264 xi \u2264 u} which\nk (cid:104)1, x(cid:105) introduces coupling across all\ncan be solved in closed-form. In our case, the upper bound 1\nvariables, which makes the existing algorithms not applicable. A second main difference is the bias\nterm \u03c1(cid:104)1, x(cid:105)2 added to the objective. The additional dif\ufb01culty introduced by this term is relatively\nminor. Thus we solve the problem for general \u03c1 (including \u03c1 = 0 for the Euclidean projection onto\n\u2206k(r)) even though we need only \u03c1 = 1 in Proposition 4. The only case when our problem reduces\nto CQKP is when the constraint (cid:104)1, x(cid:105) \u2264 r is satis\ufb01ed with equality. In that case we can let u = r/k\nand use any algorithm for the knapsack problem. We choose [13] since it is easy to implement, does\nnot require sorting, and scales linearly in practice. The bias in the projection problem reduces to a\nconstant \u03c1r2 in this case and has, therefore, no effect.\n\n4.2 Projection onto the Top-k Cone\nWhen the constraint (cid:104)1, x(cid:105) \u2264 r is not satis\ufb01ed with equality at the optimum, it has essentially no\nin\ufb02uence on the projection problem and can be removed. In that case we are left with the problem\nof the (biased) projection onto the top-k cone which we address with the following lemma.\nLemma 3. Let x\u2217 \u2208 Rd be the solution to the following optimization problem\nk (cid:104)1, x(cid:105) , i \u2208 [d]},\nk (cid:104)1, x\u2217(cid:105)}, L (cid:44) {i| x\u2217\n(cid:80)k\ni=1 a[i] for i \u2208 U, where\n\n1. If U = \u2205 and M = \u2205, then x\u2217 = 0.\n2. If U (cid:54)= \u2205 and M = \u2205, then U = {[1], . . . , [k]}, x\u2217\n\n{(cid:107)a \u2212 x(cid:107)2 + \u03c1(cid:104)1, x(cid:105)2 | 0 \u2264 xi \u2264 1\nk (cid:104)1, x\u2217(cid:105)}, M (cid:44) {i| 0 < x\u2217\n\nand let U (cid:44) {i| x\u2217\n\ni = 0}.\n\ni = 1\n\ni < 1\n\ni = 1\n\nmin\n\nx\n\nk+\u03c1k2\n\n(6)\n\n(7)\n\n[i] is the index of the i-th largest component in a.\n\n3. Otherwise (M (cid:54)= \u2205), the following system of linear equations holds\n\n\uf8f1\uf8f2\uf8f3u =(cid:0)|M|(cid:80)\nt(cid:48) =(cid:0)|U| (1 + \u03c1k)(cid:80)\n\ni\u2208U ai + (k \u2212 |U|)(cid:80)\n\nD = (k \u2212 |U|)2 + (|U| + \u03c1k2)|M| ,\n\ni\u2208M ai \u2212 (k \u2212 |U| + \u03c1k |M|)(cid:80)\n\ni\u2208M ai\n\n(cid:1)/D,\n\n(cid:1)/D,\n\ni\u2208U ai\n\ntogether with the feasibility constraints on t (cid:44) t(cid:48) + \u03c1uk\nmax\ni\u2208M\n\nai \u2264 t \u2264 min\ni\u2208M\n\nmax\ni\u2208L\n\nai,\n\nai \u2264 t + u \u2264 min\ni\u2208U\n\nai,\n\nand we have x\u2217 = min{max{0, a \u2212 t}, u}.\n\nLemma 4. The biased projection x\u2217 onto the top-k cone is zero if(cid:80)k\n\nWe now show how to check if the (biased) projection is 0. For the standard simplex, where the cone\n+, the projection is 0 when all ai \u2264 0. It is slightly more involved for \u2206k.\nis the positive orthant Rd\ni=1 a[i] \u2264 0 (suf\ufb01cient condi-\n\ntion). If \u03c1 = 0 this is also necessary.\n\n6\n\n\fProjection. Lemmas 3 and 4 suggest a simple algorithm for the (biased) projection onto the top-\nk cone. First, we check if the projection is constant (cases 1 and 2 in Lemma 3). In case 2, we\ncompute x and check if it is compatible with the corresponding sets U, M, L. In the general case 3,\nwe suggest a simple exhaustive search strategy. We sort a and loop over the feasible partitions\nU, M, L until we \ufb01nd a solution to (6) that satis\ufb01es (7). Since we know that 0 \u2264 |U| < k and\nk \u2264 |U| +|M| \u2264 d, we can limit the search to (k \u2212 1)(d\u2212 k + 1) iterations in the worst case, where\neach iteration requires a constant number of operations. For the biased projection, we leave x = 0\nas the fallback case as Lemma 4 gives only a suf\ufb01cient condition. This yields a runtime complexity\nof O(d log(d) + kd), which is comparable to simplex projection algorithms based on sorting.\n\n4.3 Projection onto the Top-k Simplex\n\nAs we argued in \u00a7 4.1, the (biased) projection onto the top-k simplex becomes either the knapsack\nproblem or the (biased) projection onto the top-k cone depending on the constraint (cid:104)1, x(cid:105) \u2264 r at the\noptimum. The following Lemma provides a way to check which of the two cases apply.\nLemma 5. Let x\u2217 \u2208 Rd be the solution to the following optimization problem\n\n{(cid:107)a \u2212 x(cid:107)2 + \u03c1(cid:104)1, x(cid:105)2 | (cid:104)1, x(cid:105) \u2264 r, 0 \u2264 xi \u2264 1\n\nk (cid:104)1, x(cid:105) , i \u2208 [d]},\n\nmin\n\nx\n\nlet (t, u) be the optimal thresholds such that x\u2217 = min{max{0, a \u2212 t}, u}, and let U be de\ufb01ned as\nin Lemma 3. Then it must hold that \u03bb = t + p\n\ni\u2208U ai \u2212 |U| (t + u).\n\nk \u2212 \u03c1r \u2265 0, where p =(cid:80)\n\nProjection. We can now use Lemma 5 to compute the (biased) projection onto \u2206k(r) as follows.\nFirst, we check the special cases of zero and constant projections, as we did before. If that fails, we\nproceed with the knapsack problem since it is faster to solve. Having the thresholds (t, u) and the\npartitioning into the sets U, M, L, we compute the value of \u03bb as given in Lemma 5. If \u03bb \u2265 0, we\nare done. Otherwise, we know that (cid:104)1, x(cid:105) < r and go directly to the general case 3 in Lemma 3.\n\n5 Experimental Results\n\nWe have two main goals in the experiments. First, we show that the (biased) projection onto the\ntop-k simplex is scalable and comparable to an ef\ufb01cient algorithm [13] for the simplex projection\n(see the supplement). Second, we show that the top-k multiclass SVM using both versions of the\ntop-k hinge loss (3) and (5), denoted top-k SVM\u03b1 and top-k SVM\u03b2 respectively, leads to im-\nprovements in top-k accuracy consistently over all datasets and choices of k. In particular, we note\nimprovements compared to the multiclass SVM of Crammer and Singer [5], which corresponds to\ntop-1 SVM\u03b1/top-1 SVM\u03b2. We release our implementation of the projection procedures and both\nSDCA solvers as a C++ library2 with a Matlab interface.\n\n5.1\n\nImage Classi\ufb01cation Experiments\n\nWe evaluate our method on \ufb01ve image classi\ufb01cation datasets of different scale and complexity:\nCaltech 101 Silhouettes [26] (m = 101, n = 4100), MIT Indoor 67 [20] (m = 67, n = 5354), SUN\n397 [29] (m = 397, n = 19850), Places 205 [30] (m = 205, n = 2448873), and ImageNet 2012\n[22] (m = 1000, n = 1281167). For Caltech, d = 784, and for the others d = 4096. The results on\nthe two large scale datasets are in the supplement.\nWe cross-validate hyper-parameters in the range 10\u22125 to 103, extending it when the optimal value\nis at the boundary. We use LibLinear [7] for SVMOVA, SVMPerf [11] with the corresponding loss\nfunction for Recall@k, and the code provided by [16] for TopPush. When a ranking method like\nRecall@k and TopPush does not scale to a particular dataset using the reduction of the multiclass\nto a binary problem discussed in \u00a7 2.3, we use the one-vs-all version of the corresponding method.\nWe implemented Wsabie++ (denoted W++, Q/m) based on the pseudo-code from Table 3 in [9].\nOn Caltech 101, we use features provided by [26]. For the other datasets, we extract CNN features\nof a pre-trained CNN (fc7 layer after ReLU). For the scene recognition datasets, we use the Places\n205 CNN [30] and for ILSVRC 2012 we use the Caffe reference model [10].\n\n2https://github.com/mlapin/libsdca\n\n7\n\n\fMethod\n\nTop-1 [26]\nTop-2 [26]\nTop-5 [26]\n\nTop-1\n\n62.1\n61.4\n60.2\n\nCaltech 101 Silhouettes\nTop-2\n\nTop-3\n\nTop-4\n\nTop-5 Top-10 Method\n\nMIT Indoor 67\n\nTop-1 Method\n\nTop-1 Method\n\nTop-1\n\n-\n-\n-\n\n79.6\n79.2\n78.7\n\n-\n-\n-\n\n83.1\n83.4\n83.4\n\n-\n-\n-\n\nBLH [4]\nSP [25]\nJVJ [12] 63.10 GWG [8] 68.88\n\n48.3 DGE [6]\n51.4\n\n66.87 RAS [21]\nZLX [30] 68.24 KL [14]\n\n69.0\n70.1\n\nMethod\n\nTop-1\n\nTop-2\n\nTop-3\n\nTop-4\n\nTop-5 Top-10\n\nTop-1\n\nTop-2\n\nTop-3\n\nTop-4\n\nTop-5\n\nTop-10\n\nSVMOVA\nTopPush\n\nRecall@1\nRecall@5\nRecall@10\n\n61.81 73.13 76.25 77.76 78.89 83.57\n63.11 75.16 78.46 80.19 81.97 86.95\n\n61.55 73.13 77.03 79.41 80.97 85.18\n61.60 72.87 76.51 78.76 80.54 84.74\n61.51 72.95 76.46 78.72 80.54 84.92\n\nW++, 0/256\nW++, 1/256\nW++, 2/256\n\n62.68 76.33 79.41 81.71 83.18 88.95\n59.25 65.63 69.22 71.09 72.95 79.71\n55.09 61.81 66.02 68.88 70.61 76.59\n\n71.72\n70.52\n\n71.57\n71.49\n71.42\n\n70.07\n68.13\n64.63\n\n81.49\n83.13\n\n83.06\n81.49\n81.49\n\n84.10\n81.49\n78.43\n\ntop-1 SVM\u03b1\n62.81 74.60 77.76 80.02 81.97 86.91\ntop-10 SVM\u03b1 62.98 77.33 80.49 82.66 84.57 89.55\ntop-20 SVM\u03b1 59.21 75.64 80.88 83.49 85.39 90.33\n\n73.96 85.22\n70.00 85.45\n65.90\n84.10\n\n84.93\n86.94\n\n87.69\n85.45\n85.52\n\n89.48\n86.64\n84.18\n\n89.25\n90.00\n89.93\n\n86.49\n90.00\n\n90.45\n87.24\n87.24\n\n92.46\n89.63\n88.13\n\n91.94\n93.13\n92.69\n\ntop-1 SVM\u03b2\n62.81 74.60 77.76 80.02 81.97 86.91\ntop-10 SVM\u03b2 64.02 77.11 80.49 83.01 84.87 89.42\n63.37 77.24 81.06 83.31 85.18 90.03\ntop-20 SVM\u03b2\n\n73.96 85.22\n85.30\n71.87\n71.94\n85.30\n\n89.25\n91.94\n90.45 93.36\n90.07\n92.46\n\n87.39\n91.64\n\n92.24\n88.21\n88.28\n\n94.48\n91.42\n89.93\n\n93.43\n94.63\n94.25\n\n93.43\n94.40\n94.33\n\n90.45\n95.90\n\n96.19\n92.01\n92.16\n\n97.91\n95.45\n94.55\n\n96.94\n97.76\n97.54\n\n96.94\n97.76\n97.39\n\nTop-1 accuracy\n\nXHE [29]\nSPM [23]\n\n38.0\n\n47.2 \u00b1 0.2\n\nSUN 397 (10 splits)\nLSH [15]\nGWG [8]\n\n49.48 \u00b1 0.3\n\n51.98\n\nZLX [30]\nKL [14]\n\n54.32 \u00b1 0.1\n54.65 \u00b1 0.2\n\nMethod\n\nTop-1\n\nTop-2\n\nTop-3\n\nTop-4\n\nTop-5\n\nTop-10\n\nSVMOVA\nTopPushOVA\n\n70.81 \u00b1 0.4\n55.23 \u00b1 0.6\n71.46 \u00b1 0.2\n53.53 \u00b1 0.3\n71.86 \u00b1 0.2\n52.95 \u00b1 0.2\nRecall@1OVA\n70.75 \u00b1 0.3\n50.72 \u00b1 0.2\nRecall@5OVA\n70.95 \u00b1 0.2\nRecall@10OVA 50.92 \u00b1 0.2\n78.22 \u00b1 0.1\n58.16 \u00b1 0.2\n80.80 \u00b1 0.1\n58.00 \u00b1 0.2\n80.22 \u00b1 0.2\n55.98 \u00b1 0.3\n58.16 \u00b1 0.2\n78.22 \u00b1 0.1\n59.32 \u00b1 0.1 74.13 \u00b1 0.2 80.91 \u00b1 0.2\n58.65 \u00b1 0.2\n\n79.00 \u00b1 0.3\n66.23 \u00b1 0.6\n85.15 \u00b1 0.3\n65.39 \u00b1 0.3\n86.03 \u00b1 0.2\n65.49 \u00b1 0.2\n80.66 \u00b1 0.2\n64.74 \u00b1 0.3\n80.68 \u00b1 0.2\n64.94 \u00b1 0.2\n91.48 \u00b1 0.2\n71.66 \u00b1 0.2\n93.40 \u00b1 0.2\n73.65 \u00b1 0.1\n93.62 \u00b1 0.2\n72.51 \u00b1 0.2\n71.66 \u00b1 0.2\n91.48 \u00b1 0.2\n93.36 \u00b1 0.2\n73.96 \u00b1 0.2 80.95 \u00b1 0.2 85.05 \u00b1 0.2 87.70 \u00b1 0.2 93.64 \u00b1 0.2\n\n74.93 \u00b1 0.2\n77.95 \u00b1 0.2\n78.72 \u00b1 0.2\n76.06 \u00b1 0.3\n76.21 \u00b1 0.2\n84.98 \u00b1 0.2\n87.45 \u00b1 0.2\n87.37 \u00b1 0.2\n84.98 \u00b1 0.2\n87.49 \u00b1 0.2\n\n73.30 \u00b1 0.2\n75.25 \u00b1 0.1\n75.88 \u00b1 0.2\n74.02 \u00b1 0.3\n74.14 \u00b1 0.2\n82.29 \u00b1 0.2\n84.81 \u00b1 0.2\n84.54 \u00b1 0.2\n82.29 \u00b1 0.2\n84.92 \u00b1 0.2\n\ntop-1 SVM\u03b1\ntop-10 SVM\u03b1\ntop-20 SVM\u03b1\n\ntop-1 SVM\u03b2\ntop-10 SVM\u03b2\ntop-20 SVM\u03b2\n\nTable 1: Top-k accuracy (%). Top section: State of the art. Middle section: Baseline methods.\nBottom section: Top-k SVMs: top-k SVM\u03b1 \u2013 with the loss (3); top-k SVM\u03b2 \u2013 with the loss (5).\n\nExperimental results are given in Table 1. First, we note that our method is scalable to large datasets\nwith millions of training examples, such as Places 205 and ILSVRC 2012 (results in the supplement).\nSecond, we observe that optimizing the top-k hinge loss (both versions) yields consistently better\ntop-k performance. This might come at the cost of a decreased top-1 accuracy (e.g. on MIT Indoor\n67), but, interestingly, may also result in a noticeable increase in the top-1 accuracy on larger datasets\nlike Caltech 101 Silhouettes and SUN 397. This resonates with our argumentation that optimizing\nfor top-k is often more appropriate for datasets with a large number of classes.\nOverall, we get systematic increase in top-k accuracy over all datasets that we examined. For ex-\nample, we get the following improvements in top-5 accuracy with our top-10 SVM\u03b1 compared to\ntop-1 SVM\u03b1: +2.6% on Caltech 101, +1.2% on MIT Indoor 67, and +2.5% on SUN 397.\n\n6 Conclusion\n\nWe demonstrated scalability and effectiveness of the proposed top-k multiclass SVM on \ufb01ve image\nrecognition datasets leading to consistent improvements in top-k performance. In the future, one\ncould study if the top-k hinge loss (3) can be generalized to the family of ranking losses [27]. Similar\nto the top-k loss, this could lead to tighter convex upper bounds on the corresponding discrete losses.\n\n8\n\n\fReferences\n[1] A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving multiclass support vector machines with\n\nLaRank. In ICML, pages 89\u201396, 2007.\n\n[2] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In NIPS, pages 161\u2013168, 2008.\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[4] S. Bu, Z. Liu, J. Han, and J. Wu. Superpixel segmentation based structural scene recognition. In MM,\n\npages 681\u2013684. ACM, 2013.\n\n[5] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines. The Journal of Machine Learning Research, 2:265\u2013292, 2001.\n\n[6] C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual element discovery as discriminative mode\n\nseeking. In NIPS, pages 494\u2013502, 2013.\n\n[7] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[8] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activa-\n\ntion features. In ECCV, 2014.\n\n[9] M. R. Gupta, S. Bengio, and J. Weston. Training highly multiclass classi\ufb01ers. JMLR, 15:1461\u20131492,\n\n2014.\n\n[10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[11] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377\u2013384,\n\n2005.\n\n[12] M. Juneja, A. Vedaldi, C. Jawahar, and A. Zisserman. Blocks that shout: distinctive parts for scene\n\nclassi\ufb01cation. In CVPR, 2013.\n\n[13] K. Kiwiel. Variable \ufb01xing algorithms for the continuous quadratic knapsack problem. Journal of Opti-\n\nmization Theory and Applications, 136(3):445\u2013458, 2008.\n\n[14] M. Koskela and J. Laaksonen. Convolutional network features for scene recognition. In Proceedings of\n\nthe ACM International Conference on Multimedia, pages 1169\u20131172. ACM, 2014.\n\n[15] M. Lapin, B. Schiele, and M. Hein. Scalable multitask representation learning for scene classi\ufb01cation. In\n\nCVPR, 2014.\n\n[16] N. Li, R. Jin, and Z.-H. Zhou. Top rank optimization in linear time. In NIPS, pages 1502\u20131510, 2014.\n[17] W. Ogryczak and A. Tamir. Minimizing the sum of the k largest functions in linear time. Information\n\nProcessing Letters, 85(3):117\u2013122, 2003.\n\n[18] M. Patriksson. A survey on the continuous nonlinear resource allocation problem. European Journal of\n\nOperational Research, 185(1):1\u201346, 2008.\n\n[19] M. Patriksson and C. Str\u00f6mberg. Algorithms for the continuous nonlinear resource allocation problem\n\u2013 new implementations and numerical studies. European Journal of Operational Research, 243(3):703\u2013\n722, 2015.\n\n[20] A. Quattoni and A. Torralba. Recognizing indoor scenes. In CVPR, 2009.\n[21] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding\n\nbaseline for recognition. arXiv preprint arXiv:1403.6382, 2014.\n\n[22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\n\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.\n\n[23] J. S\u00e1nchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classi\ufb01cation with the Fisher vector: theory\n\nand practice. IJCV, pages 1\u201324, 2013.\n\n[24] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized\n\nloss minimization. Mathematical Programming, pages 1\u201341, 2014.\n\n[25] J. Sun and J. Ponce. Learning discriminative part detectors for image classi\ufb01cation and cosegmentation.\n\nIn ICCV, pages 3400\u20133407, 2013.\n\n[26] K. Swersky, B. J. Frey, D. Tarlow, R. S. Zemel, and R. P. Adams. Probabilistic n-choose-k models for\n\nclassi\ufb01cation and ranking. In NIPS, pages 3050\u20133058, 2012.\n\n[27] N. Usunier, D. Buffoni, and P. Gallinari. Ranking with ordered weighted pairwise classi\ufb01cation. In ICML,\n\npages 1057\u20131064, 2009.\n\n[28] J. Weston, S. Bengio, and N. Usunier. Wsabie: scaling up to large vocabulary image annotation. IJCAI,\n\npages 2764\u20132770, 2011.\n\n[29] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition\n\nfrom abbey to zoo. In CVPR, 2010.\n\n[30] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition\n\nusing places database. In NIPS, 2014.\n\n9\n\n\f", "award": [], "sourceid": 195, "authors": [{"given_name": "Maksim", "family_name": "Lapin", "institution": "Max Planck Institute for Informatics"}, {"given_name": "Matthias", "family_name": "Hein", "institution": "Saarland University"}, {"given_name": "Bernt", "family_name": "Schiele", "institution": "Max Planck Institute for Informatics"}]}