{"title": "Minimax Probability Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 801, "page_last": 807, "abstract": null, "full_text": "Minimax Probability Machine \n\nUniversity of California, Berkeley \n\nUniversity of California, Berkeley \n\nGert R.G. Lanckriet* \nDepartment of EECS \n\nBerkeley, CA 94720-1770 \n\ngert@eecs. berkeley.edu \n\nLaurent EI Ghaoui \nDepartment of EECS \n\nBerkeley, CA 94720-1770 \nelghaoui@eecs.berkeley.edu \n\nChiranjib Bhattacharyya \n\nDepartment of EECS \n\nUniversity of California, Berkeley \n\nBerkeley, CA 94720-1776 \nchiru@eecs.berkeley.edu \n\nMichael I. Jordan \n\nComputer Science and Statistics \nUniversity of California, Berkeley \n\nBerkeley, CA 94720-1776 \njordan@cs.berkeley.edu \n\nAbstract \n\nWhen constructing a classifier, the probability of correct classifi(cid:173)\ncation of future data points should be maximized. In the current \npaper this desideratum is translated in a very direct way into an \noptimization problem, which is solved using methods from con(cid:173)\nvex optimization. We also show how to exploit Mercer kernels in \nthis setting to obtain nonlinear decision boundaries. A worst-case \nbound on the probability of misclassification of future data is ob(cid:173)\ntained explicitly. \n\n1 \n\nIntroduction \n\nConsider the problem of choosing a linear discriminant by minimizing the probabil(cid:173)\nities that data vectors fall on the wrong side of the boundary. One way to attempt \nto achieve this is via a generative approach in which one makes distributional as(cid:173)\nsumptions about the class-conditional densities and thereby estimates and controls \nthe relevant probabilities. The need to make distributional assumptions, however, \ncasts doubt on the generality and validity of such an approach, and in discrimina(cid:173)\ntive solutions to classification problems it is common to attempt to dispense with \nclass-conditional densities entirely. \n\nRather than avoiding any reference to class-conditional densities, it might be useful \nto attempt to control misclassification probabilities in a worst-case setting; that \nis, under all possible choices of class-conditional densities. Such a minimax ap(cid:173)\nproach could be viewed as providing an alternative justification for discriminative \napproaches. In this paper we show how such a minimax programme can be carried \nout in the setting of binary classification. Our approach involves exploiting the \nfollowing powerful theorem due to Isii [6], as extended in recent work by Bertsimas \n\n\u2022 http://robotics.eecs.berkeley.edur gert/ \n\n\fand Sethuraman [2]: \n\nwhere y is a random vector, where a and b are constants, and where the supremum \nis taken over all distributions having mean y and covariance matrix ~y. This \ntheorem provides us with the ability to bound the probability of misclassifying a \npoint, without making Gaussian or other specific distributional assumptions. We \nwill show how to exploit this ability in the design of linear classifiers. \n\nOne of the appealing features of this formulation is that one obtains an explicit \nupper bound on the probability of misclassification of future data: 1/(1 + rP). \nA second appealing feature of this approach is that, as in linear discriminant analysis \n[7], it is possible to generalize the basic methodology, utilizing Mercer kernels and \nthereby forming nonlinear decision boundaries. We show how to do this in Section \n3. \n\nThe paper is organized as follows: in Section 2 we present the minimax formulation \nfor linear classifiers, while in Section 3 we deal with kernelizing the method. We \npresent empirical results in Section 4. \n\n2 Maximum probabilistic decision hyperplane \n\nIn this section we present our minimax formulation for linear decision boundaries. \nLet x and y denote random vectors in a binary classification problem, with mean \nvectors and covariance matrices given by x '\" (x, ~x) and y '\" (y, ~y) , respectively, \nwhere \"\",\" means that the random variable has the specified mean and covariance \nmatrix but that the distribution is otherwise unconstrained. Note that x, x , y , Y E \nJRn and ~x, ~y E JRnxn. \n\nWe want to determine the hyperplane aT z = b (a, z E JRn and b E JR) that separates \nthe two classes of points with maximal probability with respect to all distributions \nhaving these means and covariance matrices. This boils down to: \n\nor, \n\nmax a \na ,a ,b \n\ns.t. \n\ninf Pr{ aT x 2: b} 2: a \n\nmax a \na,a,b \n\ns.t. \n\n1 - a 2: sup Pr{ aT x :s b} \n1- a 2: sup Pr{aT y 2: b} . \n\n(2) \n\n(3) \n\nConsider the second constraint in (3). Recall the result of Bertsimas and Sethura(cid:173)\nman [2]: \n\nsupPr{aTY2:b}=-d2' with d2 = \n\n1 \n1 + \n\ninf (Y_Yf~y-1(y_y) (4) \n\naTy?b \n\nWe can write this as d2 = infcTw>d wT w, where w = ~y -1/2 (y_y), cT = aT~y 1/2 \nand d = b - aTy. To solve this,-first notice that we can assume that aTy :S b (i.e. \ny is classified correctly by the decision hyperplane aT z = b): indeed, otherwise we \nwould find d2 = 0 and thus a = 0 for that particular a and b, which can never be \nan optimal value. So, d> o. We then form the Lagrangian: \n\n\u00a3(w, >.) = w T w + >.(d - cT w), \n\n(5) \n\n\fwhich is to be maximized with respect to A 2: 0 and minimized with respect to w . \nAt the optimum, 2w = AC and d = cT W , so A = -!#c and w = c%c. This yields: \n\n(6) \n\nUsing (4), the second constraint in (3) becomes 1-0: 2: 1/(I+d2 ) or ~ 2: 0:/(1-0:). \nTaking (6) into account, this boils down to: \n\nb-aTY2:,,(o:)/aT~ya \n\nV \n\nwhere \n\n,,(0:)=) 0: \n\n1-0: \n\n(7) \n\nWe can handle the first constraint in (3) in a similar way (just write aT x ::::: b as \n_aT x 2: -b and apply the result (7) for the second constraint). The optimization \nproblem (3) then becomes: \n\nmax 0: \na ,a,b \n\ns.t. \n\n-b + aTx 2: ,,(o:)JaT~xa \n\nb - aTy 2: \"(o:h/aT~ya. \n\nBecause \"(0:) is a monotone increasing function of 0:, we can write this as: \n\nmax\" s.t. \n\"\"a,b \n\nb - aTy 2: \"JaT~ya. \n\nFrom both constraints in (9), we get \n\naTy + \"JaT~ya::::: b::::: aTx - \"JaT~xa, \n\nwhich allows us to eliminate b from (9): \n\nmax\" s.t. \nI<,a \n\naTy + \"JaT~ya::::: aTx - \"JaT~xa. \n\n(8) \n\n(9) \n\n(10) \n\n(11) \n\nBecause we want to maximize \", it is obvious that the inequalities in (10) will \nbecome equalities at the optimum. The optimal value of b will thus be given by \n\n(12) \n\nwhere a* and \"* are the optimal values of a and \" respectively. \nconstraint in (11), we get: \n\nRearranging the \n\naT(x - y) 2:\" (JaT~xa+ JaT~ya). \n\n(13) \n\nThe above is positively homogeneous in a: if a satisfies (13), sa with s E 114 also \ndoes. Furthermore, (13) implies aT(x - y) 2: O. Thus, we can restrict a to be such \nthat aT(x - y) = 1. The optimization problem (11) then becomes \n\nmax\" s.t. \nI<,a \n\n~ 2: JaT~xa + JaT~ya \naT (x-Y)=I , \n\nwhich allows us to eliminate ,,: \n\nm~n JaT~xa + JaT~ya s.t. aT(x - y) = 1, \n\n(14) \n\n(15) \n\n\for, equivalently \n\n(16) \n\nThis is a convex optimization problem, more precisely a second order cone program \n(SOCP) [8,5]. Furthermore, notice that we can write a = ao +Fu, where U E Il~n-l, \nao = (x - y)/llx - y112, and F E IRnx (n-l) is an orthogonal matrix whose columns \nspan the subspace of vectors orthogonal to x - y. \nUsing this we can write (16) as an unconstrained SOCP: \n\n(17) \n\nWe can solve this problem in various ways, for example using interior-point methods \nfor SOCP [8], which yield a worst-case complexity of O(n3 ). Of course, the first and \nsecond moments of x, y must be estimated from data, using for example plug-in es(cid:173)\ntimates X, y, :Ex, :Ey for respectively x, y, ~x, ~y. This brings the total complexity \nto O(ln3 ), where l is the number of data points. This is the same complexity as the \nquadratic programs one has to solve in support vector machines. \nIn our implementations, we took an iterative least-squares approach, which is based \non the following form, equivalent to (17): \n\n(18) \n\nAt iteration k , we first minimize with respect to 15 and E by setting 15k = II~x 1/2(ao + \nFuk- d112 and Ek = II~y 1/2(ao + Fuk - 1)112. Then we minimize with respect to U \nby solving a least squares problem in u for 15 = 15k and E = Ek, which gives us \nUk. Because in both update steps the objective of this COP will not increase, the \niteration will converge to the global minimum II~xl/2(ao + Fu*)112 + II~yl /2(ao + \nFu*)lb with u* an optimal value of u. \n\nWe then obtain a* as ao + Fu* and b* from (12) with \"'* = l/h/ar~xa* + \nJar~ya*). Classification of a new data point Znew is done by evaluating \nsign( a;; Znew - b*): if this is + 1, Znew is classified as from class x, otherwise Zn ew is \nclassified as from class y. \n\nIt is interesting to see what happens if we make distributional assumptions; in \nparticular, let us assume that x \"\" N(x, ~x) and y \"\" N(y, ~y). This leads to the \nfollowing optimization problem: \n\nmax a S.t. \no:, a ,b \n\n-b + aTx ::::: -l(a)JaT~xa \n\n(19) \n\nwhere (z) is the cumulative distribution function for a standard normal Gaussian \ndistribution. This has the same form as (8), but now with \",(a) = -l(a) instead \n\nof \",(a) = V l~a (d. a result by Chernoff [4]). We thus solve the same optimization \n\nproblem (a disappears from the optimization problem because \",(a) is monotone \nincreasing) and find the same decision hyperplane aT z = b. The difference lies in \nthe value of a associated with \"'*: a will be higher in this case, so the hyperplane \nwill have a higher predicted probability of classifying future data correctly. \n\n\f3 Kernelization \n\nIn this section we describe the \"kernelization\" of the minimax approach described in \nthe previous section. We seek to map the problem to a higher dimensional feature \nspace ]Rf via a mapping cP : ]Rn 1-+ ]Rf, such that a linear discriminant in the feature \nspace corresponds to a nonlinear discriminant in the original space. To carry out \nthis programme, we need to try to reformulate the minimax problem in terms of a \nkernel function K(Z1' Z2) = cp(Z1)T CP(Z2) satisfying Mercer's condition. \n\n\"\"' \nLet the data be mapped as x 1-+ cp(x) \n(cp(y) , ~cp(y)) where {Xi}~1 and {Yi}~1 are training data points in the classes \ncorresponding to x and Y respectively. The decision hyperplane in ]Rf is then given \nby aT cp(Z) = b with a, cp(z) E ]Rf and b E ]R. In ]Rf, we need to solve the following \noptimization problem: \n\n(cp(X) , ~cp(x)) and Y 1-+ cp(y) \n\n\"\"' \n\nmln Jr-aT-~-cp-(-x)-a + J aT~cp(y)a s.t. aT (cp(X) - cp(y)) = 1, \n\n(20) \n\nwhere, as in (12), the optimal value of b will be given by \n\nb* = a; cp(x) - \"'*Jar~cp(x)a* = a; cp(y) + \"'*Jar~cp(y)a*, \n\n(21) \nwhere a* and \"'* are the optimal values of a and '\" respectively. However, we do \nnot wish to solve the COP in this form, because we want to avoid using f or cp \nexplicitly. \nIf a has a component in ]Rf which is orthogonal to the subspace spanned by CP(Xi), \ni = 1,2, ... , N x and CP(Yi), i = 1,2, ... , Ny, then that component won't affect the \nobjective or the constraint in (20) . This implies that we can write a as \n\nN. \n\nNy \n\na = LaiCP(Xi) + L;)jCP(Yj). \n\ni=1 \n\nj=1 \n\n(22) \n\n1 \n\nA \n\n_ \n\nNy \n\nSubstituting expression (22) for a and estimates ;Pw = J. 2:~1 CP(Xi) , ;p(Y) = \n1 \nNy 2:i=l cp(Yi), ~cp(x) - N. 2:i=1 (cp(Xi) - cp(X)) (cp(Xi) - cp(x)) \nand ~cp(y) -\nJ 2:i~1(CP(Yi) - cp(y))(cp(Yi) - cp(y))T for the means and the covariance matri-\nces in the objective and the constraint of the optimization problem (20), we see \nthat both the objective and the constraints can be written in terms of the kernel \nfunction K(Zl' Z2) = CP(Z1)T cp(Z2) . We obtain: \n\n.....--.. T \n\n.....--.. \n\n.....--.. \n\n.....--.. \n\nN. \n\nN \n\n-\n\nA \n\n_ \n\ny \n\nT \n\n-\n\n\"f (k x - ky) = 1, \n\n-\n\n(23) \n\n]R .+ y WIth [kxl i = \nwhere \"f = [a1 a2 ... aN. ;)1 \nJ. 2:f;1 K(xj, Zi), ky E ]RN. +Ny with [kyl i = Jy 2:f~l K(Yj, Zi), Zi = Xi for \ni = 1,2, ... ,Nx and Zi = Yi - N. for i = N x + 1, N x + 2, ... ,Nx + Ny . K is defined \nas: \n\n... ;)Ny l , kx E \n\n;)2 \n\nN \n\nN \n\n. \n\nT \n\n-\n\n-\n\nK = (Kx -IN.~~) = (*x) \nKy \n\nKy -lNy ky \n\n(24) \n\nwhere 1m is a column vector with ones of dimension m. Kx and Ky contain \nrespectively the first N x rows and the last Ny rows of the Gram matrix K (defined \nas Kij = cp(zdTcp(zj) = K(Zi,Zj)). We can also write (23) as \n\n-\nKx \n\nI \n\n-\n\nI Ky \n\nI \n\nm~n II ~\"f12 + I .jlV;\"f 12 s.t. \n\nT \n\n-\n\n\"f (kx - ky) = 1, \n\n-\n\n(25) \n\n\fwhich is a second order cone program (SOCP) [5] that has the same form as the \nSOCP in (16) and can thus be solved in a similar way. Notice that, in this case, \nthe optimizing variable is \"f E ~Nz +Ny instead of a E ~n. Thus the dimension of \nthe optimization problem increases, but the solution is more powerful because the \nkernelization corresponds to a more complex decision boundary in ~n . \nSimilarly, the optimal value b* of b in (21) will then become \n\n(26) \n\nwhere \"f* and \"'* are the optimal values of \"f and\", respectively. \nOnce \"f* is known, we get \"'* = 1/ ( J ~z \"f;K~Kx\"f* + J ~y \"f;K~Ky\"f*) and then \nb* from (26). Classification of a new data point Znew is then done by evaluating \nsign(a;