{"title": "A Risk Minimization Principle for a Class of Parzen Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 1137, "page_last": 1144, "abstract": "This paper explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an $O(n)$ algorithm able to process large datasets in reasonable time.", "full_text": "A Risk Minimization Principle\nfor a Class of Parzen Estimators\n\nKristiaan Pelckmans, Johan A.K. Suykens, Bart De Moor\nDepartment of Electrical Engineering (ESAT) - SCD/SISTA\n\nK.U.Leuven University\n\nKasteelpark Arenberg 10, Leuven, Belgium\n\nKristiaan.Pelckmans@esat.kuleuven.be\n\nAbstract\n\nThis paper1 explores the use of a Maximal Average Margin (MAM) optimality\nprinciple for the design of learning algorithms. It is shown that the application\nof this risk minimization principle results in a class of (computationally) simple\nlearning machines similar to the classical Parzen window classi\ufb01er. A direct rela-\ntion with the Rademacher complexities is established, as such facilitating analysis\nand providing a notion of certainty of prediction. This analysis is related to Sup-\nport Vector Machines by means of a margin transformation. The power of the\nMAM principle is illustrated further by application to ordinal regression tasks,\nresulting in an O(n) algorithm able to process large datasets in reasonable time.\n\n1 Introduction\n\nThe quest for ef\ufb01cient machine learning techniques which (a) have favorable generalization capac-\nities, (b) are \ufb02exible for adaptation to a speci\ufb01c task, and (c) are cheap to implement is a pervasive\ntheme in literature, see e.g. [14] and references therein. This paper introduces a novel concept for\ndesigning a learning algorithm, namely the Maximal Average Margin (MAM) principle. It closely\nresembles the classical notion of maximal margin as lying on the basis of perceptrons, Support Vec-\ntor Machines (SVMs) and boosting algorithms, see a.o. [14, 11]. It however optimizes the average\nmargin of points to the (hypothesis) hyperplane, instead of the worst case margin as traditional. The\nfull margin distribution was studied earlier in e.g. [13], and theoretical results were extended and\nincorporated in a learning algorithm in [5].\n\nThe contribution of this paper is twofold. On a methodological level, we relate (i) results in structural\nrisk minimization, (ii) data-dependent (but dimension-independent) Rademacher complexities [8, 1,\n14] and a new concept of \u2019certainty of prediction\u2019, (iii) the notion of margin (as central is most\nstate-of-the-art learning machines), and (iv) statistical estimators as Parzen windows and Nadaraya-\nWatson kernel estimators.\nIn [10], the principle was already shown to underlie the approach of\nmincuts for transductive inference over a weighted undirected graph. Further, consider the model-\nclass consisting of all models with bounded average margin (or classes with a \ufb01xed Rademacher\ncomplexity as we will indicate lateron). The set of such classes is clearly nested, enabling structural\nrisk minimization [8].\n\nOn a practical level, we show how the optimality principle can be used for designing a computation-\nally fast approach to (large-scale) classi\ufb01cation and ordinal regression tasks, much along the same\n\n1Acknowledgements - K. Pelckmans is supported by an FWO PDM. J.A.K. Suykens and B. De Moor are a\n(full) professor at the Katholieke Universiteit Leuven, Belgium. Research supported by Research Council KUL:\nGOA AMBioRICS, CoE EF/05/006 OPTEC, IOF-SCORES4CHEM, several PhD/postdoc & fellow grants;\nFlemish Government: FWO: PhD/postdoc grants, projects G.0452.04, G.0499.04, G.0211.05, G.0226.06,\nG.0321.06, G.0302.07, (ICCoS, ANMMM, MLDM); IWT: PhD Grants, McKnow-E, Eureka-Flite+ Belgian\nFederal Science Policy Of\ufb01ce: IUAP P6/04, EU: ERNSI;\n\n1\n\n\flines as Parzen classi\ufb01ers and Nadaraya-Watson estimators. It becomes clear that this result enables\nresearchers on Parzen windows to bene\ufb01t directly from recent advances in kernel machines, two\n\ufb01elds which have evolved mostly separately. It must be emphasized that the resulting learning rules\nwere already studied in different forms and motivated by asymptotic and geometric arguments, as\ne.g. the Parzen window classi\ufb01er [4], the \u2019simple classi\ufb01er\u2019 as in [12] chap. 1, probabilistic neural\nnetworks [15], while in this paper we show how an (empirical) risk based optimality criterion un-\nderlies this approach. A number of experiments con\ufb01rm the use of the resulting cheap learning rules\nfor providing a reasonable (baseline) performance in a small time-window.\n\nThe following notational conventions are used throughout the paper. Let the random vector\n(X, Y ) \u2208 Rd \u00d7 {\u22121, 1} obey a (\ufb01xed but unknown) joint distribution PXY from a probability\nspace (Rd\u00d7{\u22121, 1},P). Let Dn = {(Xi, Yi)}n\ni=1 be sampled i.i.d. according to PXY . Let y \u2208 Rn\nbe de\ufb01ned as y = (Y1, . . . , Yn)T \u2208 {\u22121, 1}n and X = (X1, . . . , Xn)T \u2208 Rn\u00d7d. This paper\nis organized as follows. The next section illustrates the principle of maximal average margin for\nclassi\ufb01cation problems. Section 3 investigates the close relationship with Rademacher complexi-\nties, Section 4 develops the maximal average margin principle for ordinal regression, and Section\n5 reports experimental results of application of the MAM to classi\ufb01cation and ordinal regression\ntasks.\n\n2 Maximal Average Margin for Classi\ufb01ers\n\n2.1 The Linear Case\n\nLet the class of hypotheses be de\ufb01ned as\n\nH =nf (\u00b7) : Rd \u2192 R, w \u2208 Rd(cid:12)(cid:12)(cid:12)\u2200x \u2208 Rd : f (x) = wT x, kwk2 = 1o .\n\nConsequently, the signed distance of a sample (X, Y ) to the hyper-plane wT x = 0, or the margin\nM (w) \u2208 R, can be de\ufb01ned as\n\n(1)\n\nM (w) =\n\n.\n\n(2)\n\nSVMs maximize the worst-case margin. We instead focus on the \ufb01rst moment of the margin distri-\nbution. Maximizing the expected (average) margin follows from solving\n\nY (wT X)\nkwk2\n\nM\u2217 = max\n\nw\n\nE(cid:20) Y (wT X)\n\nkwk2 (cid:21) = max\n\nf\u2208H\n\nE [Y f (X)] .\n\n(3)\n\nRemark that the non-separable case does not require the need for slack-variables. The empirical\ncounterpart becomes\n\n\u02c6M = max\n\nw\n\n1\nn\n\nn\n\nXi=1\n\nYi(wT Xi)\n\nkwk2\n\n,\n\n(4)\n\nnPn\nwhich can be written as a constrained convex problem as minw \u2212 1\ni=1 Yi(wT Xi) s.t. kwk2 \u2264\nnPn\n1. The Lagrangian with multiplier \u03bb \u2265 0 becomes L(w, \u03bb) = \u2212 1\n2 (wT w \u2212 1).\ni=1 Yi(wT Xi) + \u03bb\nBy switching the minimax problem to a maximin problem (application of Slater\u2019s condition), the\n\ufb01rst order condition for optimality \u2202L(w,\u03bb)\n\n\u2202w = 0 gives\n\nwn =\n\n1\n\u03bbn\n\nYiXi =\n\n1\n\u03bbn\n\nXT y,\n\n(5)\n\nn\n\nXi=1\n\nwhere wn \u2208 Rd denotes the optimum to (4). The corresponding parameter \u03bb can be found by\nn kPn\nnpyT XXT y since the\nsubstituting (5) in the constraint wT w = 1, or \u03bb = 1\noptimum is obviously taking place when wT w = 1. It becomes clear that the above derivations\nremain valid as n \u2192 \u221e, resulting in the following theorem.\nTheorem 1 (Explicit Actual Optimum for the MAMC) The function f (x) = wT x in H maxi-\nmizing the expected margin satis\ufb01es\n\ni=1 YiXik2 = 1\n\nkwk2 (cid:21) =\n1\n\u03bb\nwhere \u03bb is a normalization constant such that kw\u2217k2 = 1.\n\nE(cid:20) Y (wT X)\n\narg max\n\nw\n\nE[XY ] , w\u2217,\n\n(6)\n\n2\n\n\f2.2 Kernel-based Classi\ufb01er and Parzen Window\n\nIt becomes straightforward to recast the resulting classi\ufb01er as a kernel classi\ufb01er by mapping the\ninput data-samples X in a feature space \u03d5 : Rd \u2192 Rd\u03d5 where d\u03d5 is possibly in\ufb01nite. In particular,\nwe do not have to resort to Lagrange duality in a context of convex optimization (see e.g. [14, 9] for\nan overview) or functional analysis in a Reproducing Kernel Hilbert Space. Speci\ufb01cally,\n\nwT\n\nn \u03d5(X) =\n\n1\n\u03bbn\n\nn\n\nXi=1\n\nYiK(Xi, X),\n\n(7)\n\nwhere K : Rd \u00d7 Rd \u2192 R is de\ufb01ned as the inner product such that \u03d5(X)T \u03d5(X\u2032) = K(X, X\u2032)\nfor any X, X\u2032. Conversely, any function K corresponds with the inner product of a valid map \u03d5\nnpyT \u2126y with\nif the function K is positive de\ufb01nite. As previously, the term \u03bb becomes \u03bb = 1\nkernel matrix \u2126 \u2208 Rn\u00d7n where \u2126ij = K(Xi, Xj) for all i, j = 1, . . . , n. Now the class of\npositive de\ufb01nite Mercer kernels can be used as they induce a proper mapping \u03d5. A classical choice\nis the use of a linear kernel (or K(X, X\u2032) = X T X\u2032), a polynomial kernel of degree p \u2208 N0 (or\n2/\u03c3)), or a dedicated\nK(X, X\u2032) = (X T X\u2032 + b)p), an RBF kernel (or K(X, X\u2032) = exp(\u2212kX \u2212 X\u2032k2\nkernel for a speci\ufb01c application (e.g. a string kernel, a Fisher kernel, see e.g. [14] and references\ntherein). Figure 1.a depicts an example of a nonlinear classi\ufb01er based on the well-known Ripley\ndataset, and the contourlines score the \u2019certainty of prediction\u2019 as explained in the next section.\n\nThe expression (7) is similar (proportional) to the classical Parzen window for classi\ufb01cation, but\ndiffers in the use of a positive de\ufb01nite (Mercer) kernel K instead of the pdf \u03ba( X\u2212\u00b7h ) with bandwidth\nh > 0, and in the form of the denominator. The classical motivation of statistical kernel estimators is\nbased on asymptotic theory in low dimensions (i.e d = O(1)), see e.g. [4], chap. 10 and references.\nThe functional form of the optimal rule (7) is similar to the \u2019simple classi\ufb01er\u2019 described in [12],\nchap. 1. Thirdly, this estimator was also termed and empirically validated as a probabilistic neural\nnetwork by [15]. The novel element from above result is the derivation of a clear (both theoretical\nand empirical) optimality principle of the rule, as opposed to the asymptotic results of [4] and the\ngeometric motivations in [12, 15]. As a direct byproduct, it becomes straightforward to extend\nthe Parzen window classi\ufb01er easily with an additional intercept term or other parametric parts, or\ntowards additive (structured) models as in [9].\n\n3 Analysis and Rademacher Complexities\n\nThe quantity of interest in the analysis of the generalization performance is the probability of pre-\ndicting a mistake (the risk R(w; PXY )), or\n\nwhere I(z) equals one if z is true, and zero otherwise.\n\nR(w; PXY ) = PXY (cid:0)Y (wT \u03d5(X)) \u2264 0(cid:1) = E(cid:2)I(Y (wT \u03d5(X)) \u2264 0)(cid:3) ,\n\n(8)\n\n3.1 Rademacher Complexity\n\nLet {\u03c3i}n\n\u22121) = 1\n\ni=1 taken from the set {\u22121, 1}n be Bernoulli random variables with P (\u03c3 = 1) = P (\u03c3 =\n2 . The empirical Rademacher complexity is then de\ufb01ned [8, 1] as\n\n\u02c6Rn(H) , E\u03c3\"sup\n\nf\u2208H\n\nn\n\nXi=1\n\n2\n\nn(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u03c3if (Xi)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)\n\nX1, . . . , Xn# ,\n\n(9)\n\nwhere the expectation is taken over the choice of the binary vector \u03c3 = (\u03c31, . . . , \u03c3n)T \u2208 {\u22121, 1}n.\nIt is observed that the empirical Rademacher complexity de\ufb01nes a natural complexity measure to\nstudy the maximal average margin classi\ufb01er, as both the de\ufb01nitions of the empirical Rademacher\ncomplexity and the maximal average margin resemble closely (see also [8]). The following result\nwas given in [1], Lemma 22, but we give an alternative proof by exploiting the structure of the\noptimal estimate explicitly.\nLemma 1 (Trace bound for the Empirical Rademacher Complexity for H) Let \u2126 \u2208 Rn\u00d7n be\nde\ufb01ned as \u2126ij = K(Xi, Xj) for all i, j = 1, . . . , n, then\n\n\u02c6Rn(H) \u2264\n\n2\n\nnptr(\u2126).\n\n3\n\n(10)\n\n\fi=1 \u03c3i(wT \u03d5(Xi)) = \u03c3T \u2126\u03c3\u221a\u03c3T \u2126\u03c3\n\nProof: The proof goes along the same lines as the classical bound on the empirical Rademacher\ncomplexity for kernel machines outlined in [1], Lemma 22. Speci\ufb01cally, once a vector \u03c3 \u2208 {\u22121, 1}n\ni=1 \u03c3if (Xi) equals the solution as in (7) or\nis \ufb01xed, it is immediately seen that the maxf\u2208H\n= \u221a\u03c3T \u2126\u03c3. Now, application of the expectation operator E\nmaxwPn\n\u02c6Rn(H) = E(cid:20) 2\n\nover the choice of the Rademacher variables gives\n\n\u221a\u03c3T \u2126\u03c3(cid:21) \u2264\n\nnPn\n\n2 =\n\nn\n\n2\n\n2\n\n1\n\n1\n\n1\n\nn(cid:0)E(cid:2)\u03c3T \u2126\u03c3(cid:3)(cid:1)\n\n2\n\nn\uf8eb\n\uf8edXi,j\nn n\nXi=1\n\nE [\u03c3i\u03c3j] K(Xi, Xj)\uf8f6\n\uf8f8\nK(Xi, Xi)!\nnptr(\u2126),\n\n=\n\n2\n\n2\n\n2\n\n1\n\n=\n\n(11)\n\nwhere the inequality is based on application of Jensen\u2019s inequality. This proves the Lemma. (cid:3)\n\nin the case of the RBF kernel\n\nRemark that in the case of a kernel with constant trace (as e.g.\n\ngives as in [8, 1].\n\nwhereptr(\u2126) = \u221an), it follows from this result that also the (expected) Rademacher complexity\nE[ \u02c6Rn(H)] \u2264ptr(\u2126). In general, one has that E[K(X, X)] equals the trace of the integral operator\nTK de\ufb01ned on L2(PX ) de\ufb01ned as TK(f ) = R K(X, Y )f (X)dPX (X) as in [1]. Application of\nMcDiarmid\u2019s inequality on the variable Z = supf\u2208H(cid:0)E[Y (wT \u03d5(X))] \u2212 1\ni=1 Yi(wT \u03d5(Xi))(cid:1)\nLemma 2 (Deviation Inequality) Let 0 < B\u03d5 < \u221e be a \ufb01xed constant such that supz k\u03d5(z)k2\n= supzpK(z, z) \u2264 B\u03d5 such that |wT \u03d5(z)| \u2264 B\u03c6, and let \u03b4 \u2208 R+\n0 be \ufb01xed. Then with probability\nexceeding 1 \u2212 \u03b4, one has for any w \u2208 Rd that\n\nnPn\n\nE[Y (wT \u03d5(X))] \u2265\n\nYi(wT \u03d5(Xi)) \u2212 \u02c6Rn(H) \u2212 3B\u03d5s 2 ln(cid:0) 2\n\u03b4(cid:1)n\n\n.\n\n1\nn\n\nn\n\nXi=1\n\n(12)\n\nTherefore it follows that one maximizes the expected margin by maximizing the empirical average\nmargin, while controlling the empirical Rademacher complexity by choice of the model class (ker-\nnel). In the case of RBF kernels, B\u03d5 = 1, resulting in a reasonable tight bound. It is now illustrated\nhow one can obtain a practical upper-bound to the \u2019certainty of prediction\u2019 using f (x) = wT\n\nn x.\n\nTheorem 2 (Occurrence of Mistakes) Given an i.i.d. sample Dn = {(Xi, Yi)}n\nB \u2208 R such that supzpK(z, z) \u2264 B\u03d5, and a \ufb01xed \u03b4 \u2208 R+\n1 \u2212 \u03b4, one has for all w \u2208 Rd that\n\uf8edpyT \u2126y\nP (cid:0)Y (wT \u03d5(X)) \u2264 0(cid:1) \u2264\n\nB\u03d5 \u2212 E[Y (wT \u03d5(X))]\n\n\u2264 1 \u2212\uf8eb\n\n\u02c6Rn(H)\nB\u03d5\n\nnB\u03d5\n\nB\u03d5\n\n+\n\ni=1, a constant\n0 . Then, with probability exceeding\n\n\u03b4(cid:1)n \uf8f6\n+ 3s 2 ln(cid:0) 2\n\uf8f8 .\n\n(13)\n\nProof: The proof follows directly from application of Markov\u2019s inequality on the positive random\nvariable B\u03d5 \u2212 Y (wT \u03d5(X)), with expectation B\u03d5 \u2212 E[Y (wT \u03d5(X))], estimated accurately by the\nsample average as in the previous theorem. (cid:3)\nMore generally, one obtains that with probability exceeding 1 \u2212 \u03b4 that for any w \u2208 Rd and for any\n\u03c1 such that \u2212B\u03d5 < \u03c1 < B\u03d5 that\n\u03b4(cid:1)n \uf8f6\nB\u03d5 + \u03c1s 2 ln(cid:0) 2\nP (cid:0)Y (wT \u03d5(X)) \u2264 \u2212\u03c1(cid:1) \u2264\n\uf8f8 ,\nwith probability exceeding 1 \u2212 \u03b4 < 1. This results in a practical assessment of the \u2019certainty\u2019 of a\nprediction as follows. At \ufb01rst, note that the random variable Y (wT\nn \u03d5(x)) for a \ufb01xed X = x can take\ntwo values: either \u2212|wT\nn \u03d5(x)|. Therefore P (Y (wT\nn \u03d5(x)) \u2264 0) = P (Y (wT\nn \u03d5(x)) =\n\n\uf8ed pyT \u2126y\n\nB\u03d5 + \u03c1 \u2212\uf8eb\n\nn \u03d5(x)| or |wT\n\n\u02c6Rn(H)\nB\u03d5 + \u03c1\n\nn(B\u03d5 + \u03c1)\n\n3B\u03d5\n\n(14)\n\nB\u03d5\n\n+\n\n+\n\n4\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n2\n\nX\n\n \n\nClass prediction\nclass 1\nclass 2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n2\n\nX\n\n \n\n\u22120.2\n\n \n\n\u22121.2\n\n\u22121\n\n\u22120.8 \u22120.6 \u22120.4 \u22120.2\nX1\n(a)\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n\u22120.2\n\n \n\n\u22121.2\n\n\u22121\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n\u22120.8 \u22120.6 \u22120.4 \u22120.2\nX1\n(b)\n\nFigure 1: Example of (a) the MAM classi\ufb01er and (b) the SVM on the Ripley dataset. The contourlines\nrepresenttheestimateofcertaintyofprediction(\u2019scores\u2019)asderivedinTheorem2fortheMAMclassi\ufb01erfor\n(a),andasinCorollary1forthecaseofSVMswith g(z) = min(1, max(\u22121, z)) where |z| < 1 corresponds\nwith the inner part of the margin of the SVM (b). While the contours in (a) give an overall score of the\npredictions,thescoresgivenin(b)focustowardsthemarginoftheSVM.\n\nn \u03d5(x)) \u2264 \u2212|wT\n\nn \u03d5(x)|) \u2264 P (Y (wT\n\nn \u03d5(x)|) as Y can only take the two values \u22121 or 1. Thus\n\u2212|wT\nthe event \u2019Y 6= sign(wT x\u2217)\u2019 for samples X = x\u2217 occurs with probability lower than the rhs. of\n(13) with \u03c1 = |wT x\u2217|. When asserting this for a number nv \u2208 N of samples X \u223c PX with\nnv \u2192 \u221e, a misprediction would occur less than \u03b4nv times. In this sense, one can use the latent\nvariable wT \u03d5(x\u2217) as an indication of how \u2019certain\u2019 the prediction is. Figure 1.a gives an example\nof the MAM classi\ufb01er, together with the level plots indicating the certainty of prediction. Remark\nhowever that the described \u2019certainty of prediction\u2019 statement differs from a conditional statement\nof the risk given as P (Y (wT \u03d5(X)) < 0 | X = x\u2217). The essential difference with the probabilistic\nestimates based on the density estimates resulting from the Parzen window estimator is that results\nbecome independent of the data dimension, as one avoids estimating the joint distribution.\n\n3.2 Transforming the Margin Distribution\n\nConsider the case where the assumption of a reasonable constant B such that P (kXk2 < B) = 1 is\nunrealistic. Then, a transformation of the random variable Y (wT X) can be fruitful using a monotone\nincreasing function g : R \u2192 R with a constant B\u2032\u03d5 \u226a B such that |g(z)| \u2264 B\u2032\u03d5, and g(0) = 0. In\nthe choice of a proper transformation, two counteracting effects should be traded properly. At \ufb01rst,\na small choice of B improves the bound as e.g. described in Lemma 2. On the other hand, such a\ntransformation would make the expected value E[g(Y (wT \u03d5(X)))] smaller than E[Y (wT \u03d5(X))].\nModifying Theorem 2 gives\n\nCorollary 1 (Occurrence of Mistakes, bis) Given i.i.d. samples Dn = {(Xi, Yi)}n\ni=1, and a \ufb01xed\n\u03b4 \u2208 R+\n0 . Let g : R \u2192 R be a monotonically increasing function with Lipschitz constant 0 < Lg <\n\u221e, let B\u2032\u03d5 \u2208 R such that |g(z)| \u2264 B\u2032\u03d5 for all z, and g(0) = 0. Then with probability exceeding\n1 \u2212 \u03b4, one has for any \u03c1 such that \u2212B\u2032\u03d5 \u2264 \u03c1 \u2264 B\u2032\u03d5 and w \u2208 Rd that\n\nB\u2032\u03d5\nB\u2032\u03d5 + \u03c1\u2212\n\n1\n\nnPn\n\ni=1 g(Yi(wT\n\nn \u03d5(Xi))) \u2212 Lg \u02c6Rn(H) \u2212 3B\u2032\u03d5q 2 log( 2\n\nn\n\n\u03b4 )\n\n.\n\nn \u03d5(X))) \u2264 \u2212\u03c1(cid:1) \u2264\n\nP (cid:0)g(Y (wT\nThis result follows straightforwardly from Theorem 2 using the property that \u02c6Rn(g \u25e6 H) \u2264\nn \u03d5(X))) \u2264 0(cid:1) \u2264 1\u2212E[Y g(wT \u03d5(X))]\nLg \u02c6Rn(H), see e.g.\n.\nSimilar as in the previous section, corollary 1 can be used to score the certainty of prediction by\nconsidering for each X = x\u2217 the value of g(wT x\u2217) and g(\u2212wT x\u2217). Figure 1.b gives an example by\nconsidering the clipping transformation g(z) = min(1, max(\u22121, z)) \u2208 [\u22121, 1] such that B\u2032\u03d5 = 1.\n\n[1]. When \u03c1 = 0, one has P (cid:0)g(Y (wT\n\nB\u2032\u03d5 + \u03c1\n\n(15)\n\n1\n\n5\n\n\fNote that this a-priori choice of the function g is not dependent on the (empirical) optimality criterion\nat hand.\n\n3.3 Soft-margin SVMs and MAM classi\ufb01ers\n\nExcept the margin-based mechanisms, the MAM classi\ufb01er shares other properties with the soft-\nmargin maximal margin classi\ufb01er (SVM) as well. Consider the following saturation function g(z) =\n(1 \u2212 z)+, where (\u00b7)+ is de\ufb01ned as (z)+ = z if z \u2265 0, and zero otherwise (g(0) = 0). Application\nof this function to the MAM formulation of (4), one obtains for a C > 0\n\nn\n\nw \u2212\nmax\n\nXi=1(cid:0)1 \u2212 Yi(wT \u03d5(Xi))(cid:1)+\n\nwhich is similar to the support vector machine (see e.g.\nexplicit, consider the following formulation of (16)\n\ns.t. wT w = C,\n\n(16)\n\n[14]). To make this equivalence more\n\nmin\nw,\u03be\n\n\u03bei\n\ns.t. wT w \u2264 C and Yi(wT \u03d5(Xi)) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0 \u2200i = 1, . . . , n,\n\n(17)\n\nwhich is similar to the SVM. Consider the following modi\ufb01cation\n\nn\n\nXi=1\n\nn\n\nXi=1\n\nwith rY denoting the ranks of all Yi in y. This expression simpli\ufb01es expression for wn as wn =\n1\n\u03bbn Xdy. It is seen that using kernels as before, the resulting estimator of the order of the responses\ncorresponding to x and x\u2032 becomes\n\n\u02c6fK(x, x\u2032) = sign (m(x) \u2212 m(x\u2032)) , where m(x) =\n\n1\n\u03bbn\n\nn\n\nXi=1\n\nK(Xi, x) rY (i).\n\n(22)\n\n6\n\nmin\nw,\u03be\n\n\u03bei\n\ns.t. wT w \u2264 C and Yi(wT \u03d5(Xi)) \u2265 1 \u2212 \u03bei\n\n\u2200i = 1, . . . , n,\n\n(18)\n\nwhich is equivalent to (4) as in the optimum, Yi(wT \u03d5(Xi)) = (1 \u2212 \u03bei) for all i. Thus, omission of\nthe slack constraints \u03bei \u2265 0 in the SVM formulation results in the Parzen window classi\ufb01er.\n4 Maximal Average Margin for Ordinal Regression\n\nAlong the same lines as [6], the maximal average margin principle can be applied to ordinal re-\ngression tasks. Let (X, Y ) \u2208 Rd \u00d7 {1, . . . , m} with distribution PXY . The w \u2208 Rd maximizing\nP (I(wT (\u03d5(X) \u2212 \u03d5(X)\u2032)(Y \u2212 Y \u2032) > 0)) can be found by solving for the maximal average margin\nbetween pairs as follows\n\nM\u2217 = max\n\nw\n\nE(cid:20) sign(Y \u2212 Y \u2032)wT (\u03d5(X) \u2212 \u03d5(X)\u2032)\n\nkwk2\n\n(cid:21) .\n\nGiven n i.i.d. samples {(Xi, Yi)}n\n\ni=1, empirical risk minimization is obtained by solving\n\nw \u2212\nmin\n\n1\nn\n\nn\n\nXi,j=1\n\nsign(Yj \u2212 Yi)wT (\u03d5(Xj) \u2212 \u03d5(Xi)) s.t. kwk2 \u2264 1.\n\nThe Lagrangian with multiplier \u03bb \u2265 0 becomes L(w, \u03bb) = \u2212 1\nnPi,j wT sign(Yj \u2212 Yi)(\u03d5(Xj) \u2212\n2 (wT w\u22121). Let there be n\u2032 couples (i, j). Let Dy \u2208 {\u22121, 0, 1}n\u2032\u00d7n such that Dy,ki = 1\n\u03d5(Xi))+ \u03bb\nand Dy,kj = \u22121 if the kth couple equals (i, j). Then, by switching the minimax problem to a\nmaximin problem, the \ufb01rst order condition for optimality \u2202L(w,\u03bb)\n\u2202w = 0 gives the expression. wn =\n1\n\u03bbn XDy1n\u2032. Now the parameter \u03bb can be found by substituting\n\u03bb\u2032nPYi<Yj\ny XT X Dy1n\u2032. Now the key element is the\n\n(\u03d5(Xj) \u2212 \u03d5(Xi)) = 1\n\n(5) in the constraint wT w = 1, or \u03bb = 1\ncomputation of dy = Dy1n\u2032. Note that\n\n(19)\n\n(20)\n\n(21)\n\nn\u2032 DT\n\nnq1T\nXj=1\nsign(Yj \u2212 Yi) , ry(i),\n\nn\n\ndy(i) =\n\n\f \n\noMAM\nLS\u2212SVM\noSVM\noGP\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n \n0\n0.5\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\n\u03c4\n\n0.8\n\n0.85\n\n0.9\n\n0.95\n\n1\n\n(a)\n\nData (train/test)\nBank(1) (100/8.092)\nBank(1) (500/7.629)\nBank(1) (5.000/3.192)\nBank(1) (7.500/692)\nBank(2) (100/8.092)\nBank(2) (500/7.629)\nBank(2) (5.000/3.192)\nBank(2) (7.500/692)\nCpu(1) (100/20.540)\nCpu(1) (500/20.140)\nCpu(1) (5.000/15.640)\nCpu(1) (7.500/13.140)\nCpu(1) (15.000/5.640)\n(b)\n\n0.37\n0.49\n0.56\n0.57\n0.81\n0.83\n0.86\n0.88\n0.44\n0.50\n0.57\n0.60\n0.69\n\noMAM LS-SVM oSVM oGP\n0.41\n0.50\n\n0.46\n0.55\n\n0.43\n0.51\n0.56\n\n-\n\n-\n-\n\n-\n-\n\n0.84\n0.86\n0.88\n\n-\n\n0.62\n0.66\n0.68\n\n-\n-\n\n0.87\n0.87\n\n-\n-\n\n0.64\n0.66\n\n-\n-\n-\n\n0.80\n0.81\n\n-\n-\n\n0.63\n0.65\n\n-\n-\n-\n\nFigure 2: ResultsonordinalregressiontasksusingoMAM(22)ofO(n),aregressionontherank-transformed\nresponsesusingLS-SVMs[16]of O(n2) \u2212 O(n3),ordinalSVMsandordinalGaussianProcessesforprefer-\nentiallearningof O(n4) \u2212 O(n6). TheresultsareexpressedasKendall\u2019s \u03c4 (with \u22121 \u2264 \u03c4 \u2264 1)computedon\nthevalidationdatasets. Figure(a)reportsthenumericalresultsofthearti\ufb01ciallygenerateddata,Table(b)gives\ntheresultonanumberoflargescaleddatasetsdescribedin[2],ifthecomputationtooklessthan5minutes.\n\nRemark that the estimator m : Rd \u2192 R equals (except for the normalization term) the Nadaraya-\nWatson kernel based on the rank-transform rY of the responses. This observation suggest the appli-\ncation of standard regression tools based on the rank-transformed responses as in [7]. Experiments\ncon\ufb01rm the use of the proposed ranking estimator, and also motivate the use of a more involved\nfunction approximation tools as e.g. LS-SVMs [16] based on the rank-transformed responses.\n\n5 Illustrative Example\n\ni=1 \u2282 R5 \u00d7 R with n = 100 and a validation set {(X v\n\nTable 2.b provides numerical results on the 13 classi\ufb01cation (including 100 randomizations) bench-\nmark datasets as described in [11]. The choice of an appropriate kernel parameter was obtained by\ncross-validation over a range of bandwidths from \u03c3 = 1e \u2212 2 to \u03c3 = 1e15. The results illustrate\nthat the Parzen window classi\ufb01er performs in general slightly (but not signi\ufb01cantly so) worse than\nthe other methods, but obviously reduces the required amount of memory and computation time\n(i.e. O(n) versus O(n2) \u2212 O(n3)). Hence, it is advised to use the Parzen classi\ufb01er as a cheap\nbase-line method, or to use it in a context where time- or memory requirements are stringent. The\n\ufb01rst arti\ufb01cial dataset for testing the ordinal regression scheme is constructed as follows. The train-\ni )}nv\ning set {(Xi, Yi)}n\ni=1 \u2282 R5 \u00d7 R\ni , Y v\nwith nv = 250 is constructed such that Zi = (wT\ni with\ni )3 + ev\nX v\ni = (wT\n\u2217\n\u2217\nw\u2217 \u2208 N (0, 1), X, X v \u223c N (0, I5), and e, ev \u223c N (0, 0.25). Now Y (and Y v) are generated pre-\nserving the order implied by {Zi}100\ni=1) with the intervals \u03c72-distributed with 5 degrees\nof freedom. Figure 2.a shows the results of a Monte Carlo experiment relating both the O(n) pro-\nposed estimator (22), a LS-SVM regressor of O(n2) \u2212 O(n3) on the rank-transformed responses\n{(Xi, rY (i))}, the O(n4) \u2212 O(n6) SVM approach as proposed in [3] and the Gaussian Process\napproach of O(n4) \u2212 O(n6) given in [2]. The performance of the different algorithms is expressed\nin terms of Kendall\u2019s \u03c4 computed on the validation data. Table 2.b reports the results on some large\nscale datasets as described in [2], imposing a maximal computation time of 5 minutes. Both tests\nsuggest the competitive nature of the proposed O(n) procedure, while clearly showing the bene\ufb01t\nof using function estimation (as e.g. LS-SVMs) based on the rank-transformed responses.\n\ni=1 (and {Z v\n\nXi)3 + ei and Z v\n\ni }250\n\n7\n\n\f6 Conclusion\n\nThis paper discussed the use of the MAM risk optimality principle for designing a learning ma-\nchine for classi\ufb01cation and ordinal regression. The relation with classical methods including Parzen\nwindows and Nadaraya-Watson estimators is established, while the relation with the empirical\nRademacher complexity is used to provide a measure of \u2019certainty of prediction\u2019. Empirical exper-\niments show the applicability of the O(n) algorithms on real world problems, trading performance\nsomewhat for computational ef\ufb01ciency with respect to state-of-the art learning algorithms.\n\nReferences\n\n[1] P.L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[2] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. Journal of Machine Learning\n\nResearch, 6:1019\u20131041, 2006.\n\n[3] W. Chu and S. S. Keerthi. New approaches to support vector ordinal regression. In in Proc. of Interna-\n\ntional Conference on Machine Learning, pages 145\u2013152. 2005.\n\n[4] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag,\n\n1996.\n\n[5] A. Garg and D. Roth. Margin distribution and learning algorithms.\n\nIn Proceedings of the Fifteenth\nInternational Conference on Machine Learning (ICML), pages 210\u2013217. Morgan Kaufmann Publishers,\n2003.\n\n[6] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Ad-\n\nvances in Large Margin Classi\ufb01ers, pages 115\u2013132, 2000. MIT Press, Cambridge, MA.\n\n[7] R.L. Iman and W.J. Conover. The use of the rank transform in regression. Technometrics, 21(4):499\u2013509,\n\n1979.\n\n[8] V. Koltchinski. Rademacher penalties and structural risk minimization. IEEE Transactions on Information\n\nTheory, 47(5):1902\u20131914, 1999.\n\n[9] K. Pelckmans. Primal-Dual kernel Machines. PhD thesis, Faculty of Engineering, K.U.Leuven, May.\n\n2005. 280 p., TR 05-95.\n\n[10] K. Pelckmans, J. Shawe-Taylor, J.A.K. Suykens, and B. De Moor. Margin based transductive graph\nIn Proceedings of the Eleventh International Conference on Arti\ufb01cial\n\ncuts using linear programming.\nIntelligence and Statistics, (AISTATS 2007), pp. 360-367, San Juan, Puerto Rico, 2007.\n\n[11] G. R\u00a8atsch, T. Onoda, and K.-R. M\u00a8uller. Soft margins for adaboost. Machine Learning, 42(3):287 \u2013 320,\n\n2001.\n\n[12] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[13] J. Shawe-Taylor and N. Cristianini. Further results on the margin distribution.\n\nIn Proceedings of the\ntwelfth annual conference on Computational learning theory (COLT), pages 278\u2013285. ACM Press, 1999.\n[14] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,\n\n2004.\n\n[15] D.F. Specht. Probabilistic neural networks. Neural Networks, 3:110\u2013118, 1990.\n[16] J.A.K. Suykens, T. van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least Squares Support\n\nVector Machines. World Scienti\ufb01c, Singapore, 2002.\n\n8\n\n\f", "award": [], "sourceid": 581, "authors": [{"given_name": "Kristiaan", "family_name": "Pelckmans", "institution": null}, {"given_name": "Johan", "family_name": "Suykens", "institution": null}, {"given_name": "Bart", "family_name": "Moor", "institution": null}]}