{"title": "Large Margin Discriminant Dimensionality Reduction in Prediction Space", "book": "Advances in Neural Information Processing Systems", "page_first": 1488, "page_last": 1496, "abstract": "In this paper we establish a duality between boosting and SVM, and use this to derive a novel discriminant dimensionality reduction algorithm. In particular, using the multiclass formulation of boosting and SVM we note that both use a combination of mapping and linear classification to maximize the multiclass margin. In SVM this is implemented using a pre-defined mapping (induced by the kernel) and optimizing the linear classifiers. In boosting the linear classifiers are pre-defined and the mapping (predictor) is learned through combination of weak learners. We argue that the intermediate mapping, e.g. boosting predictor, is preserving the discriminant aspects of the data and by controlling the dimension of this mapping it is possible to achieve discriminant low dimensional representations for the data. We use the aforementioned duality and propose a new method, Large Margin Discriminant Dimensionality Reduction (LADDER) that jointly learns the mapping and the linear classifiers in an efficient manner. This leads to a data-driven mapping which can embed data into any number of dimensions. Experimental results show that this embedding can significantly improve performance on tasks such as hashing and image/scene classification.", "full_text": "Large Margin Discriminant Dimensionality\n\nReduction in Prediction Space\n\nMohammad Saberian\n\nNet\ufb02ix\n\nesaberian@netflix.com\n\nJose Costa Pereira\n\nINESCTEC\n\njose.c.pereira@inesctec.pt\n\nCan Xu\nGoogle\n\ncanxu@google.com\n\nJian Yang\n\nYahoo Research\n\nNuno Vasconcelos\n\nUC San Diego\n\njianyang@yahoo-inc.com\n\nnvasconcelos@ucsd.edu\n\nAbstract\n\nIn this paper we establish a duality between boosting and SVM, and use this\nto derive a novel discriminant dimensionality reduction algorithm. In particular,\nusing the multiclass formulation of boosting and SVM we note that both use\na combination of mapping and linear classi\ufb01cation to maximize the multiclass\nmargin. In SVM this is implemented using a pre-de\ufb01ned mapping (induced by\nthe kernel) and optimizing the linear classi\ufb01ers. In boosting the linear classi\ufb01ers\nare pre-de\ufb01ned and the mapping (predictor) is learned through a combination of\nweak learners. We argue that the intermediate mapping, i.e. boosting predictor, is\npreserving the discriminant aspects of the data and that by controlling the dimension\nof this mapping it is possible to obtain discriminant low dimensional representations\nfor the data. We use the aforementioned duality and propose a new method, Large\nMargin Discriminant Dimensionality Reduction (LADDER) that jointly learns the\nmapping and the linear classi\ufb01ers in an ef\ufb01cient manner. This leads to a data-driven\nmapping which can embed data into any number of dimensions. Experimental\nresults show that this embedding can signi\ufb01cantly improve performance on tasks\nsuch as hashing and image/scene classi\ufb01cation.\n\n1\n\nIntroduction\n\nBoosting and support vector machines (SVM) are widely popular techniques for learning classi\ufb01ers.\nWhile both methods are maximizing the margin, there are a number of differences that distinguish\nthem; e.g. while SVM selects a number of examples to assemble the decision boundary, boosting\nachieves this by combining a set of weak learners. In this work we propose a new duality between\nboosting and SVM which follows from their multiclass formulations. It shows that both methods\nseek a linear decision rule by maximizing the margin after transforming input data to an intermediate\nspace. In particular, kernel-SVM (K-SVM) [39] \ufb01rst selects a transformation (induced by the kernel)\nthat maps data points into an intermediate space, and then learns a set of linear decision boundaries\nthat maximize the margin. This is depicted in Figure 1-bottom. In contrast, multiclass boosting\n(MCBoost) [34] relies on a set of pre-de\ufb01ned codewords in an intermediate space, and then learns a\nmapping to this space such that it maximizes the margin with respect to the boundaries de\ufb01ned by\nthose codewords. See Figure 1-top. Therefore, both boosting and SVM follow a two-step procedure:\n(i) mapping data to some intermediate space, and (ii) determine the boundaries that separate the\nclasses. There are, however, two notable differences: 1) while K-SVM aims to learn only the\nboundaries, MCBoost effort is on learning the mapping and 2) in K-SVM the intermediate space\ntypically has in\ufb01nite dimensions, while in MCBoost the space has M or M \u2212 1 dimensions, where\nM is the number of classes.\n\n1\n\n\fFigure 1: Duality between multiclass boosting and\nSVM.\n\nThe intermediate space (called prediction space)\nin the exposed duality has some interesting prop-\nerties. In particular, the \ufb01nal classi\ufb01er decision\nis based on the representation of data points in\nthis prediction space where the decision bound-\naries are linear. An accurate classi\ufb01cation by\nthese simple boundaries suggests that the in-\nput data points must be very-well separated in\nthis space. Given that in the case of boosting\nthis space has limited dimensions, e.g. M or\nM \u2212 1, this suggests that we can potentially use\nthe predictor of MCBoost as a discriminant di-\nmensionality reduction operator. However, the\ndimension of MCBoost is either M or M \u2212 1\nwhich restricts application of this operator as a\ngeneral dimensionality reduction operator. In\naddition, according to the proposed duality, each of K-SVM or Boosting optimizes only one of the\ntwo components, i.e. mapping and decision boundaries. Because of this, extra care needs to be put in\nmanually choosing the right kernel in K-SVM; and in MCBoost, we may not even be able to learn a\ngood mapping if we preset some bad boundaries.\nWe can potentially overcome these limitations by combining boosting and SVM to jointly learn both\nthe mapping and linear classi\ufb01ers for a prediction space of arbitrary dimension d. We note that this\nis not a straightforward merge of the two methods as this can lead to a computationally prohibitive\nmethod; e.g. imagine having to solve the quadratic optimization of K-SVM before each iteration of\nboosting. In this paper, we propose a new algorithm, Large-mArgin Discriminant DimEnsionality\nReduction (LADDER), to ef\ufb01ciently implement this hybrid approach using a boosting-like method.\nLADDER is able to learn both the mapping and the decision boundaries in a margin maximizing\nobjective function that is adjustable to any number of dimensions. Experiments show that the resulting\nembedding can signi\ufb01cantly improve tasks such as hashing and image/scene classi\ufb01cation.\nRelated works: This paper touches several topics such as dimensionality reduction, classi\ufb01cation,\nembedding and representation learning. Due to space constraints we present only a brief overview\nand comparison to previous work.\nDimensionality reduction has been studied extensively. Unsupervised techniques, such as principal\ncomponent analysis (PCA), non-negative matrix factorization (NMF), clustering, or deep auto-\nencoders, are conceptually simple and easy to implement, but may eliminate discriminant dimensions\nof the data and result in sub-optimal representations for classi\ufb01cation. Discriminant methods, such as\nsequential feature selection techniques [31], neighborhood components analysis [11], large margin\nnearest neighbors [42] or maximally collapsing metric learning [37] can require extensive computation\nand/or fail to guarantee large margin discriminant data representations.\nThe idea of jointly optimizing the classi\ufb01ers and the embedding has been extensively explored in\nembedding and classi\ufb01cation literature, e.g. [7, 41, 45, 43]. These methods, however, typically\nrely on linear data transformation/classi\ufb01er, requires more complex semi-de\ufb01nite programming [41]\nor rely on Error Correcting Output Codes (ECOC) approach [7, 45, 10] which has shown inferior\nperformance compared to direct multiclass boosting methods [34, 27]. In comparison, we note that\nthe proposed method (1) is able to learn a very non-linear transformation through boosting predictor,\ne.g. boosting deep decision trees; and, (2) relies on direct multiclass boosting that optimizes a margin\nenforcing loss function. Another example of jointly learning the classi\ufb01ers and the embedding is\nmultiple kernel learning (MKL) literature, e.g. [12, 36]. In these methods, a new kernel is learned as\na linear combination of \ufb01xed basis functions. Compared with LADDER, 1) the basis functions are\ndata-driven and not \ufb01xed, and 2) our method is also able to combine weak learners and form novel\nbasis functions tailored for the current task. Finally, it is also possible to jointly learn the classi\ufb01ers\nand embedding using deep neural networks. This, however, requires large number of training data\nand can be computationally very intensive. In addition the proposed LADDER method is a meta\nalgorithm that can be used to further improve the deep networks, e.g. by boosting of the deep CNNs.\n\n2\n\nSVCL 64 select a transformation K-SVM: Data MCBoost: select linear classifiers learn a transformation learn a linear classifier \f2 Duality of boosting and SVM\nConsider an M-class classi\ufb01cation problem, with training set D = {(xi, zi)}n\ni=1, where zi \u2208\n{1 . . . M} is the class of example xi. The goal is to learn a real-valued (multidimensional) function\nf (x) to predict the class label z of each example x. This is formulated as the predictor f (x) that\nminimizes the risk de\ufb01ned in terms of the expected loss L(z, f (x)):\n\nR[f ] = EX,Z{L(z, f (x))} \u2248 1\nn\n\nL(zi, f (xi)).\n\n(1)\n\n(cid:88)\n\ni\n\nDifferent algorithms vary in their choice of loss functions and numerical optimization procedures. The\nlearned predictor has large margin if the loss L(z, f (x)) encourages large values of the classi\ufb01cation\nmargin. For binary classi\ufb01cation, f (x) \u2208 R, z \u2208 {1, 2}, the margin is de\ufb01ned as M(xi, zi) =\nyif (xi), where yi = y(zi) \u2208 {\u22121, 1} is the codeword of class zi. The classi\ufb01er is then F (x) =\nH(sign[f (x)]) where H(+1) = 1 and H(\u22121) = 2.\nThe extension to M-ary classi\ufb01cation requires M codewords. These are de\ufb01ned in a multidimensional\nspace, i.e. as yk \u2208 Rd, k = 1 . . . M where commonly d = M or d = M \u2212 1. The predictor is then\nf (x) = [f1(x), f2(x) . . . fd(x)] \u2208 Rd, and the margin is de\ufb01ned as\n\nM(xi, zi) =\n\n1\n2\n\n(cid:104)f (xi), yzi(cid:105) \u2212 max\nl(cid:54)=zi\n\n(cid:104)f (xi), yl(cid:105)\n\n,\n\nwhere (cid:104)\u00b7,\u00b7(cid:105) is the Euclidean dot product. Finally, the classi\ufb01er is implemented as\n\nF (x) = arg max\n\nk\u2208{1,...,M}\n\n(cid:104)yk, f (x)(cid:105).\n\nNote that the binary equations are the special cases of (2)-(3) for codewords {\u22121, 1}.\nMutliclass Boosting: MCBoost [34] is a multiclass boosting method that uses a set of unit vectors\nas codewords \u2013 forming a regular simplex in RM\u22121 \u2013, and the exponential loss\n\n(cid:21)\n\n(cid:20)\n\nM(cid:88)\n\nj=1,j(cid:54)=zi\n\n(2)\n\n(3)\n\n(5)\n\nL(zi, f (xi)) =\n\ne\u2212 1\n\n2 [(cid:104)yzi ,f (xi)(cid:105)\u2212(cid:104)yj ,f (xi)(cid:105)].\n\n(4)\n\nwhere w(xi) =(cid:80)M\n\nFor M = 2, this reduces to the loss L(zi, f (xi)) = e\u2212yzi f (xi) of AdaBoost [9].\nGiven a set, G, of weak learners g(x) \u2208 G : X \u2192 RM\u22121, MCBoost minimizes (1) by gradient\ndescent in function space. In each iteration MCBoost computes the directional derivative of the risk\nfor updating f (x) along the direction of g(x),\n\n\u03b4R[f ; g] =\n\n\u2202R[f + \u0001g]\n\n\u2202\u0001\n\n= \u2212 1\n2n\n\n(cid:104)g(xi), w(xi)(cid:105),\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u0001=0\n\nn(cid:88)\n\ni=1\n\nthe optimal step size toward that direction are then\n\nj=1(yj \u2212 yzi)e\u2212 1\ng\u2217 = arg min\n\n2(cid:104)yzi\u2212yj ,f (xi)(cid:105) \u2208 RM\u22121. The direction of steepest descent and\ng\u2208G \u03b4R[f ; g] \u03b1\u2217 = arg min\n\n\u03b1\u2208R R[f + \u03b1g\u2217].\n\n(6)\nThe predictor is \ufb01nally updated with f := f + \u03b1\u2217g\u2217. This method is summarized in Algorithm 1. As\npreviously mentioned, it reduces to AdaBoost [9] for M = 2, in which \u03b1\u2217 has closed form.\nMutliclass Kernel SVM (MC-KSVM) : In the support vector machine (SVM) literature, the margin\nis de\ufb01ned as\n\nM(xi, wzi) = (cid:104)\u03a6(xi), wzi(cid:105) \u2212 max\nl(cid:54)=zi\n\n(7)\nwhere \u03a6(x) is a feature transformation, usually de\ufb01ned indirectly through a kernel k(x, x(cid:48)) =\n(cid:104)\u03a6(x), \u03a6(x(cid:48))(cid:105), and wl (l = 1 . . . M) are a set of discriminative projections. Several algorithms have\nbeen proposed for multiclass SVM learning [39, 44, 17, 5]. The classical formulation by Vapnik \ufb01nds\nthe projections that solve:\n\n(cid:104)\u03a6(xi), wl(cid:105),\n\n2 + C(cid:80)\n\n(cid:80)M\nl=1 (cid:107)wl(cid:107)2\n(cid:104)\u03a6(xi), wzi(cid:105) \u2212 (cid:104)\u03a6(xi), wl(cid:105) \u2265 1 \u2212 \u03bei,\u2200(xi, zi) \u2208 D, l (cid:54)= zi,\n\u03bei \u2265 0 \u2200i.\n\ni \u03bei\n\n\uf8f1\uf8f2\uf8f3 minw1...wM\n\ns.t.\n\n(8)\n\n3\n\n\fAlgorithm 1 MCBoost\n\nInput: Number of classes M, number of iterations Nb, codewords {y1, . . . , yM} \u2208 RM\u22121, and\ndataset D = {(xi, zi)}n\nInitialization: Set f = 0 \u2208 RM\u22121.\nfor t = 1 to Nb do\n\ni=1 where zi \u2208 {1 . . . M} is label of example xi.\n\nFind the best weak learner g\u2217(x) and optimal step size \u03b1\u2217 using (6).\nUpdate f (x) := f (x) + \u03b1\u2217g\u2217(x).\n\nend for\nOutput: F (x) = arg maxk (cid:104)f (x), yk(cid:105)\n\nRewriting the constraints as\n\n(cid:104)\u03a6(xi), wl(cid:105))],\n\n\u03bei \u2265 max[0, 1 \u2212 ((cid:104)\u03a6(xi), wzi(cid:105) \u2212 max\nl(cid:54)=zi\n(cid:80)\ni (cid:98)(cid:104)\u03a6(xi), wzi(cid:105) \u2212 maxl(cid:54)=zi(cid:104)\u03a6(xi), wl(cid:105)(cid:99)+ + \u03bb(cid:80)M\n\nand using the fact that the objective function is monotonically increasing in \u03bei, this is identical to\nsolving the problem\n\nminw1...wM\n\n(9)\nwhere (cid:98)x(cid:99)+ = max(0, 1 \u2212 x) is the hinge loss, and \u03bb = 1/C. Hence, MC-KSVM minimizes the\n2. The predictor of the multiclass kernel\n\nrisk R[f ] subject to a regularization constraint on(cid:80)\n\nl (cid:107)wl(cid:107)2\n\nl=1 (cid:107)wl(cid:107)2\n2,\n\nSVM (MC-KSVM) is then de\ufb01ned as\n\nFM C\u2212KSV M (x) = arg max\nl=1..M\n\n(cid:104)\u03a6(x), w\u2217\nl (cid:105).\n\n(10)\n\nDuality: The discussion of the previous sections unveils an interesting duality between multiclass\nboosting and SVM. Since (7) and (10) are special cases of (2) and (3), respectively, the MC-SVM is a\nspecial case of the formulation of Section 2, with predictor f (x) = \u03a6(x) and codewords yl = wl.\nThis leads to the duality of Figure 1. Both boosting and SVM implement a classi\ufb01er with a set of\nlinear decision boundaries on a prediction space F. This prediction space is the range space of the\npredictor f (x). The linear decision boundaries are the planes whose normals are the codewords\nyl. For both boosting and SVM, the decision boundaries implement a large margin classi\ufb01er in F.\nHowever, the learning procedure is different. For the SVM, examples are \ufb01rst mapped into F by a\npre-de\ufb01ned predictor. This is the feature transformation \u03a6(x) that underlies the SVM kernel. The\ncodewords (linear classi\ufb01ers) are then learned so as to maximize the margin. On the other hand, for\nboosting, the codewords are pre-de\ufb01ned and the boosting algorithm learns the predictor f (x) that\nmaximizes the margin. The boosting / SVM duality is summarized in Table 1.\n\nTable 1: Duality between MCBoost and MC-KSVM\n\nMCBoost\nMC-KSVM\n\npredictor\nlearns f (x)\n\n\ufb01x \u03a6(x)\n\ncodewords\n\n\ufb01x yi\n\nlearns wl\n\n3 Discriminant dimensionality reduction\nIn this section, we exploit the multiclass boosting / SVM duality to derive a new family of discriminant\ndimensionality reduction methods. Many learning problems require dimensionality reduction. This\nis usually done by mapping the space of features X to some lower dimensional space Z, and then\nlearning a classi\ufb01er on Z. However, the mapping from X to Z is usually quite dif\ufb01cult to learn.\nUnsupervised procedures, such as principal component analysis (PCA) or clustering, frequently\neliminate discriminant dimensions of the data that are important for classi\ufb01cation. On the other hand,\nsupervised procedures tend to lead to complex optimization problems and can be quite dif\ufb01cult to\nimplement. Using the proposed duality we argue that it is possible to use an embedding provided\nby boosting or SVM. In case of SVM this embedding is usually in\ufb01nite dimensional which can\nmake it impractical for some applications, e.g. hashing problem [20]. In case of boosting the\nembedding, f (x), has a \ufb01nite dimension d. In general, the complexity of learning a predictor f (x)\nis inversely proportional to this dimension d, and lower dimensional codewords/predictors require\nmore sophisticated predictor learning. For example, convolutional networks such as [22] use the\n\n4\n\n\fAlgorithm 2 Codeword boosting\nInput: Dataset D = {(xi, zi)}n\ni=1 where zi \u2208 {1 . . . M}\nis label of example xi, n. of classes M, a predictor f (x) :\nX \u2192 Rd, n. of codeword learning iterations Nc and a set\nof d dimensional codewords Y.\nfor t = 1 to Nc do\n\n\u2202Y and \ufb01nd the best step size, \u03b2\u2217 by (12).\n\nCompute \u2202R\nUpdate Y := Y \u2212 \u03b2\u2217dY.\nNormalize codewords in Y to satisfy constraint of (11).\n\nend for\nOutput: Codeword set Y\n\nFigure 2: Codeword updates after a\ngradient descent step\n\ncanonical basis of RM as codeword set, and a predictor composed of M neural network outputs.\nThis is a deep predictor, with multiple layers of feature transformation, using a combination of linear\nand non-linear operations. Similarly, as discussed in the previous section, MCBoost can be used to\nlearn predictors of dimension M or M \u2212 1, by combining weak learners. A predictor learned by\nany of these methods can be interpreted as a low-dimensional embedding. Compared to the classic\nsequential approach of \ufb01rst learning an intermediate low dimensional space Z and then learning\na predictor f : Z \u2192 F = RM , these methods learn the classi\ufb01er directly in a low-dimensional\nprediction space, i.e. F = Z. In the case of boosting, this leverages a classi\ufb01er that explicitly\nmaximizes the classi\ufb01cation margin for the solution of the dimensionality reduction problem.\nThe main limitation of this approach is that current multiclass boosting methods [34, 27] rely on a\n\ufb01xed codeword dimension d, e.g. d = M in [27] or d = M \u2212 1 in [34]. In addition these codewords\nare pre-de\ufb01ned and are independent of the input data, e.g. vertices of a regular simplex in RM or\nRM\u22121 [34]. In summary, the dimensionality of the predictor and codewords are tied to the number\nof classes. Next, we propose a method that extends current boosting algorithms 1) to use embeddings\nof arbitrary dimensions and 2) to learn the codewords (linear classi\ufb01ers) based on the input data.\nIn principle, the formulation of section 2 is applicable to any codeword set and the challenge is to \ufb01nd\nthe optimal codewords for a target dimension d. For this, we propose to leverage the duality between\nboosting and SVM. First, use boosting to learn the optimal predictor for a given set of codewords,\nand second use SVM to learn the optimal codewords for the given predictor. This procedure, has\ntwo limitations. First, although both are large margin methods, boosting and SVM use different loss\nfunctions (exponential vs. hinge). Hence, the procedure is not guaranteed to converge. Second, an\nalgorithm based on multiple iterations of boosting and SVM learning is computationally intensive.\nWe avoid these problems by formulating the codeword learning problem in the boosting framework\nrather than an SVM formulation. For this, we note that, given a predictor f (x), it is possible to learn\na set of codewords Y = {y1 . . . yM} that guarantees large margins, under the exponential loss, by\nsolving\n\n(cid:26) miny1...yM R[Y, f ] = 1\n\n(cid:80)n\ni=1 L(Y, zi, f (xi))\n\n2n\n\ns.t.\n\n(cid:107)yk(cid:107) = 1 \u2200k\n\n(11)\n2(cid:104)yzi\u2212yj ,f (xi)(cid:105). As is usual in boosting, we propose to solve this\noptimization by a gradient descent procedure. Each iteration of the proposed codeword boosting\nalgorithm computes the risk derivatives with respect to all codewords and forms the matrix \u2202R\n\u2202Y =\n\nwhere L(Y, zi, f (xi)) =(cid:80)\n(cid:104) \u2202R[Y,f ]\n\n. The codewords are then updated according to Y = Y \u2212 \u03b2\u2217 \u2202R\n\ne\u2212 1\n\nj(cid:54)=zi\n\n(cid:105)\n\n. . . \u2202R[Y,f ]\n\n\u2202yM\n\n\u2202Y where\n\n\u2202y1\n\n(cid:20)\n\n(cid:21)\n\n\u03b2\u2217 = arg min\n\n\u03b2\n\nR\n\nY \u2212 \u03b2\n\n\u2202R\n\u2202Y , f\n\n,\n\n(12)\n\nis found by a line search. Finally, each codeword yl is normalized to satisfy the constraint of (11).\nThis algorithm is summarized in Algorithm 2.\nGiven this, we are ready to introduce an algorithm that jointly optimizes the codeword set Y and\npredictor f. This is implemented using an alternate minimization procedure that iterates between\nthe following two steps. First, given a codeword set Y, determine the predictor f\u2217(x) of minimum\nrisk R[Y, f ]. This is implemented with MCBoost (Algorithm 1). Second, given the optimal predictor\n\n5\n\nSVCL 3 \fAlgorithm 3 LADDER\n\nInput: number of classes M, dataset D = {(xi, zi)}n\ni=1 where zi \u2208 {1 . . . M} is label of example\nxi, number of predictor and codeword dimension d, number of boosting iteration Nb, number\ncodeword learning iteration Nc and number of interleaving rounds Nr.\nInitialization: Set f = 0 \u2208 Rd and initialize Y.\nfor t = 1 to Nr do\n\nUse Y and run Nb iterations of MCBoost, Algorithm 1, to update f (x).\nUse f (x) and run Nc iterations of gradient descent in Algorithm 2 to update Y.\n\nend for\nOutput: Predictor f (x), codeword set Y and decision rule F (x) = arg maxk (cid:104)f (x), yk(cid:105)\n\nf\u2217(x), determine the codeword set Y\u2217 of minimum risk R[Y\u2217, f\u2217]. This is implemented with\ncodeword boosting (Algorithm 2). Note that, unlike the combined SVM-Boosting solution, the two\nsteps of this algorithm optimize the common risk of (11). Since this risk encourages predictors\nof large margin, the algorithm is denoted Large mArgin Discriminant DimEnsionality Reduction\n(LADDER). The procedure is summarized in Algorithm 3.\nAnalysis: First, note that the sub-problems solved by each step of LADDER, i.e. the minimization of\nR[Y, f ] given Y or f, are convex. However, the overall optimization of (11) is not convex. Hence,\nthe algorithm will converge to a local optimum, which depends on the initialization conditions. We\npropose an initialization procedure motivated by the following intuition. If two of the codewords\nare very close, e.g. yj \u2248 yk, then (cid:104)yj, f (x)(cid:105) is very similar to (cid:104)yk, f (x)(cid:105) and small variations\nof x may change the classi\ufb01cation results of (3) from k to j and vice-versa. This suggests that\nthe codewords should be as distant from each other as possible. We thus propose to initialize the\nMCBoost codewords with the set of unit vectors of maximum pair-wise distance, e.g.\n\nmax\ny1...yM\n\nmin\nj(cid:54)=k\n\n||yj \u2212 yk|| ,\u2200j (cid:54)= k\n\n(13)\n\n1\n2\n\ne\n\n(cid:80)\n\n1\n2\n\ne\nk(cid:54)=zi\n\n(cid:104)yj ,f (xi)(cid:105)\n\nwhere Li = L(Y, zi, f (xi)), sij =\n\n(cid:80)\ni(\u22121)\u03b4ij f (xi)Lis(1\u2212\u03b4ij )\n\nFor d = M, these codewords can be the canonical basis of RM . We have implemented a barrier\nmethod from [18] to obtain maximum pair-wise distance codeword sets for any d < M.\nSecond, Algorithm 2 has interesting intuitions. We start by rewriting the risk derivatives as \u2202R[Y,f ]\n\n\u2202yj =\n(cid:104)yk ,f (xi)(cid:105) , and \u03b4ij = 1\n1\n2n\nif zi = j and \u03b4ij = 0 otherwise. It follows that the update of each codeword along the gradient\nascent direction, \u2212 \u2202R[Y,f ]\n, is a weighted average of the predictions f (xi). Since \u03b4ij is an indicator\nof the examples xi in class j, the term (\u22121)\u03b4ij re\ufb02ects the assignment of examples to the classes.\nWhile each xi in class j contributes to the update of yj with a multiple of the prediction f (xi),\nthis contribution is \u2212f (xi) for examples in classes other than j. Hence, each example xi in class\nj pulls yj towards its current prediction f (xi), while pulling all other codewords in the opposite\ndirection. This is illustrated in Figure 2. The result is an increase of the dot-product (cid:104)yj, f (xi)(cid:105),\nwhile the dot-products (cid:104)yk, f (xi)(cid:105) \u2200k (cid:54)= j decrease. Besides encouraging correct classi\ufb01cation, these\ndot product adjustments maximize the multiclass margin. This effect is modulated by the weight\nof the contribution of each point. This weight is the factor Lis(1\u2212\u03b4ij )\n, which has two components.\nThe \ufb01rst, Li, is the loss of the current predictor f (xi) for example xi. This measures how much\nxi contributes to the current risk and is similar to the example weighting mechanism of AdaBoost.\nTraining examples are weighted, so as to emphasize those poorly classi\ufb01ed by the current predictor\nf (x). The second, s(1\u2212\u03b4ij )\n, only affects examples xi that do not belong to class j. For these, the\nweight is multiplied by sij. This computes a softmax-like operation among the codeword projections\nof f (xi) and is large when the projection along yj is one of the largest, and small otherwise. Hence,\namong examples xi from classes other than j that have equivalent loss Li, the learning algorithm\nweights more heavily those most likely to be mistakenly assigned to class j. In result, the emphasis\non incorrectly classi\ufb01ed examples is modulated by how much class pairs are confused by the current\npredictor. Examples from classes that are more confusable with class j receive larger weight for the\nupdate of the latter.\n\nij\n\n\u2202yj\n\nij\n\nij\n\n6\n\n\fFigure 3: Left: Initial codewords for all traf\ufb01c sign classes. Middle: codewords learned by LADDER. Right:\nError rate evaluation with standard MCBoost classi\ufb01er (CLR) with several dimensionality reduction techniques.\n\n4 Experiments\n\nWe start with a traf\ufb01c sign detection problem that allows some insight on the merits of learning\ncodewords from data. This experiment was based on \u223c 2K instances from 17 different types of traf\ufb01c\nsigns in the \ufb01rst set of the Summer traf\ufb01c sign dataset [25], which was split into training and test set.\nExamples of traf\ufb01c signs are shown in the left of \ufb01gure 3. We also collected about 1, 000 background\nimages, to represent non-traf\ufb01c sign images, leading to a total of 18 classes. The background class is\nshown as a black image in \ufb01gure 3-left and middle. All images were resized to 40 \u00d7 40 pixels and\nthe integral channel method of [8] was used to extract 810 features per image.\nThe \ufb01rst experiment compared the performance of traditional multiclass boosting to LADDER. The\nformer was implemented by running MCBoost (Algorithm 1) for Nb = 200 iterations, using the\noptimal solution of (13) as codeword set. LADDER was implemented with Algorithm 3, using\nNb = 2, Nc = 4, and Nr = 100. In both cases, codewords were initialized with the solution\nof (13) and the initial assignment of codewords to classes was random. In each experiment, the\nlearning algorithm was initialized with 5 different random assignments. Figure 3 compares the initial\ncodewords (Left) to those learned by LADDER (Middle) for a 2-D embedding (d = 2). A video\nshowing the evolution of the codewords is available in the supplementary materials. The organization\nof the learned codewords re\ufb02ects the semantics of the various classes. Note, for example, how\nLADDER clusters the codewords associated with speed limit signs, which were initially scattered\naround the unit circle. On the other hand, all traf\ufb01c sign codewords are pushed away from that of the\nbackground image class. Within the traf\ufb01c sign class, round signs are positioned in one half-space\nand signs of other shapes on the other. Regarding discriminant power, a decision rule learned by\nMCBoost achieved 0.44 \u00b1 0.03 error rate, while LADDER achieved 0.21 \u00b1 0.02. In summary,\ncodeword adaptation produces a signi\ufb01cantly more discriminant prediction space.\nThis experiment was repeated for d \u2208 [2, 27], with the results of Figure 3-right. For small d,\nLADDER substantially improves on MCBoost (about half error rate for d \u2264 5). LADDER was\nalso compared to various classical dimensionality reduction techniques that do not operate on the\nprediction space. These included PCA, LDA, Probabilistic PCA [33], Kernel PCA [35], Locally\nPreserving Projections (LPP) [16], and Neighborhood Preserving Embedding (NPE) [15]. All\nimplementations were provided by [1]. For each method, the data was mapped to a lower dimension\nd and classi\ufb01ed using MCBoost. LADDER outperformed all methods for all dimensions.\nHashing and retrieval: Image retrieval is a classical problem in Vision [3, 4]. Encoding high dimen-\nsional feature vectors into short binary codes to enable large scale retrieval has gained momentum in\nthe last few years [6, 38, 23, 13, 24, 26]. LADDER enables the design of an effective discriminant\nhash code for retrieval systems. To obtain a d-bit hash, we learn a predictor f (x) \u2208 Rd. Each\npredictor coordinate is then thresholded and mapped to {0, 1}. Retrieval is \ufb01nally based on the\nHamming distance between these hash codes. We compare this hashing method to a number of\npopular techniques on CIFAR-10 [21], which contains 60K images of ten classes. Evaluation was\nbased on the test settings of [26], using 1, 000 randomly selected images. Learning was based on\na random set of 2, 000 images, sampled from the remaining 59K. All images are represented as\n512-dimensional GIST feature vectors [28]. The 1, 000 test images were used to query a database\ncontaining the remaining 59K images.\n\n7\n\n0510152025300102030405060number of dimensionserror rate MCBoostLADDERPCA+CLRProbPCA+CLRKernelPCA+CLRLPP+CLRNPE+CLRLDA+CLR\fTable 2: Left: Mean average precision (mAP) for CIFAR-10. Right: Classi\ufb01cation accuracy on MIT\nindoor scenes dataset.\n\nMethod\n\nLSH\nBRE\n\nITQunsup.\nITQsup.\nMCBoost\n\nKSH\n\nLADDER\n\nhash length (bits)\n\n8\n\n0.147\n0.156\n0.162\n0.220\n0.200\n0.237\n0.224\n\n10\n\n0.150\n0.156\n0.159\n0.225\n0.250\n0.252\n0.270\n\n12\n\n0.150\n0.158\n0.164\n0.231\n0.250\n0.253\n0.266\n\nMethod\n\nRBoW [29]\nSPM-SM [40]\n\nHMP [2]\n\nconv5+PCA+FV\n\nconv5+MC-Boost+FV\nconv5+LADDER+FV\n\nAccuracy\n37.9%\n44.0%\n47.6%\n52.9%\n52.8%\n55.2%\n\nTable 2-Left shows mean average precision (mAP) scores under different code lengths for LSH [6],\nBRE [23], ITQ [13], MCBoost [34], KSH [26] and LADDER. Several conclusions can be drawn.\nFirst, using a multiclass boosting technique with prede\ufb01ned equally spaced codewords of (13),\nMCBoost, we observe a competitive performance; on par with popular approaches such as ITQ,\nhowever slightly worst than KSH. Second, LADDER improves on MCBoost, with mAP gains that\nrange from 6 to 12%. This is due to its ability of LADDER to adjust/learn codewords according\nto the training data. Finally, LADDER outperformed other popular methods for hash code lengths\n\u2265 10-bits. These gains are about 5 and 7% as compared to KSH, the second best method.\nScene understanding: In this experiment we show that LADDER can provide more ef\ufb01cient dimen-\nsionality reduction than regular methods such as PCA. For this we selected the scene understanding\npipeline of [30, 14] that is consists of deep CNNs [22, 19], PCA, Fisher Vectors(FV) and SVM. PCA\nin this setting is necessary as the Fisher Vectors can become extremely high dimensional. We replaced\nthe PCA component by embeddings of MCBoost and LADDER and compared their performance\nwith PCA and other scene classi\ufb01cation methods on the MIT Indoor dataset [32]. This is a dataset of\n67 indoor scene categories where the standard train/test split contains 80 images for training and 20\nimages for testing per class. Table 2-Right summarizes performance of different methods. Again\neven with plain MCBoost predictor we observe a competitive performance; on par with PCA. The\nperformance is then improved by LADDER by learning the embedding and codewords jointly.\n5 Conclusions\nIn this work we present a duality between boosting and SVM. This duality is used to propose a novel\ndiscriminant dimensionality reduction method. We show that both boosting and K-SVM maximize\nthe margin, using the combination of a non-linear predictor and linear classi\ufb01cation. For K-SVM,\nthe predictor (induced by the kernel) is \ufb01xed and the linear classi\ufb01er is learned. For boosting, the\nlinear classi\ufb01er is \ufb01xed and the predictor is learned. It follows from this duality that 1) the predictor\nlearned by boosting is a discriminant mapping, and 2) by iterating between boosting and SVM it\nshould be possible to design better discriminant mappings. We propose the LADDER algorithm to\nef\ufb01ciently implement the two steps and learn an embedding of arbitrary dimension. Experiments\nshow that LADDER learns low-dimensional spaces that are more discriminant.\nReferences\n[1] Visualizing High-Dimensional Data Using t-SNE. JMLR, MIT Press, 2008.\n[2] L. Bo, X. Ren, and D. Fox. Unsup. Feature Learn. for RGB-D Based Object Recognition. In ISER, 2012.\n[3] J. Costa Pereira and N. Vasconcelos. On the Regularization of Image Semantics by Modal Expansion. In\n\nProc. IEEE CVPR, pages 3093\u20133099, 2012.\n\n[4] J. Costa Pereira and N. Vasconcelos. Cross-modal domain adaptation for text-based regularization of image\n\nsemantics in image retrieval systems. Comput. Vision Image Understand., 124:123\u2013135, July 2014.\n\n[5] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines.\n\nJMLR, MIT Press, 2:265\u2013292, 2002.\n\n[6] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable\n\ndistributions. In Proc. ACM Symp. on Comp. Geometry, pages 253\u2013262, 2004.\n\n[7] O. Dekel and Y. Singer. Multiclass learning by prob. embeddings. In Adv. NIPS, pages 945\u2013952, 2002.\n[8] P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In Proc. BMVC, 2009.\n[9] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. Journal Comp. and Sys. Science, 1997.\n\n8\n\n\f[10] T. Gao and D. Koller. Multiclass boosting with hinge loss based on output coding. In ICML, 2011.\n[11] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Adv.\n\nNIPS, pages 513\u2013520, 2004.\n\n[12] M. Gonen and E. Alpaydin. Multiple kernel learning algorithms. JMLR, MIT Press, 12:2211\u20132268, July\n\n2011.\n\n[13] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to\n\nlearning binary codes for large-scale image retrieval. (99):1\u201315, 2012.\n\n[14] Y. Gong, L. Wang, R. Guo, and S. LazebniK. Multi-scale orderless pooling of multi-scale orderless pooling\n\nof deep convolutional activation features. In Proc. ECCV, 2014.\n\n[15] X. He, D. Cai, S. Yan, and H.-J. Zhang. Neighborhood preserving embedding. In Proc. IEEE ICCV, 2005.\n[16] X. He and P. Niyogi. Locality preserving projections. In Adv. NIPS, 2003.\n[17] C. Hsu and C. Lin. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural\n\nNetw., 13(2):415\u2013425, 2002.\n\n[18] S. J. Nocedal and Wright. Numerical Optimization. Springer Verlag, New York, 1999.\n[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[20] D. Knuth. The art of computer programming: Sorting and searching, 1973.\n[21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report,\n\nUniversity of Toronto, Dept. of Computer Science, 2009.\n\n[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Adv. NIPS, 2012.\n\n[23] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Adv. NIPS, volume 22,\n\npages 1042\u20131050. 2009.\n\n[24] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing. IEEE Trans. Pattern Anal. Mach. Intell.,\n\n34(6):1092\u20131104, 2012.\n\n[25] F. Larsson, M. Felsberg, and P. Forssen. Correlating Fourier descriptors of local patches for road sign\n\nrecognition. IET Computer Vision, 5(4):244\u2013254, 2011.\n\n[26] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Proc. IEEE\n\nCVPR, pages 2074\u20132081, 2012.\n\n[27] I. Mukherjee and R. E. Schapire. A theory of multiclass boosting. In Adv. NIPS, 2010.\n[28] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope.\n\nInt. Journal Comput. Vision, 42(3):145\u2013175, 2001.\n\n[29] S. N. Parizi, J. G. Oberlin, and P. F. Felzenszwalb. Recon\ufb01gurable models for scene recognition. In Proc.\n\nIEEE CVPR, 2012.\n\n[30] F. Perronnin, J. S\u00e1nchez, and T. Mensink. Improving the \ufb01sher kernel for large-scale image classi\ufb01cation.\n\nIn Proc. ECCV, 2010.\n\n[31] P. Pudil, J. Novovi\u02c7cov\u00e1, and J. Kittler. Floating search methods in feature selection. Pattern Recogn. Lett.,\n\npages 1119\u20131125, 1994.\n\n[32] A. Quattoni and A. Torralba. Recognizing indoor scenes. Proc. IEEE CVPR, 2009.\n[33] S. Roweis. EM Algorithms for PCA and SPCA. In Adv. NIPS, 1998.\n[34] M. Saberian and N. Vasconcelos. Multiclass boosting: Theory and algorithms. In Adv. NIPS, 2011.\n[35] B. Sch\u00f6lkopf, A. Smola, and K.-R. M\u00fcller. Nonlinear component analysis as a kernel eigenvalue problem.\n\nNeural Computations, pages 1299\u20131319, 1998.\n\n[36] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf. Large scale multiple kernel learning. JMLR, MIT\n\nPress, 7:1531\u20131565, Dec. 2006.\n\n[37] M. Sugiyama. Dimensionality reduction of multimodal labeled data by local \ufb01sher discriminant analysis.\n\nJMLR, MIT Press, pages 1027\u20131061, 2007.\n\n[38] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In Proc.\n\nIEEE CVPR, pages 1\u20138, 2008.\n\n[39] V. N. Vapnik. Statistical Learning Theory. John Wiley Sons Inc, 1998.\n[40] N. Vasconcelos and N. Rasiwasia. Scene recognition on the semantic manifold. In Proc. ECCV, 2012.\n[41] K. Q. Weinberger and O. Chapelle. Large margin taxonomy embedding for document categorization. In\n\nAdv. NIPS. 2009.\n\n[42] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJMLR, MIT Press, pages 207\u2013244, 2009.\n\n[43] J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: Learning to rank with joint word-image\n\nembeddings. In Proc. ECML, 2010.\n\n[44] J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Euro. Symp. On\n\nArti\ufb01cial Neural Networks, pages 219\u2013224, 1999.\n\n[45] B. Zhao and E. Xing. Sparse output coding for large-scale visual recognition. In Proc. IEEE CVPR, pages\n\n3350\u20133357, 2013.\n\n9\n\n\f", "award": [], "sourceid": 823, "authors": [{"given_name": "Mohammad", "family_name": "Saberian", "institution": "Netflix"}, {"given_name": "Jose", "family_name": "Costa Pereira", "institution": "UC San Diego"}, {"given_name": "Can", "family_name": "Xu", "institution": "Google"}, {"given_name": "Jian", "family_name": "Yang", "institution": "Yahoo Research"}, {"given_name": "Nuno", "family_name": "Nvasconcelos", "institution": "UC San Diego"}]}