{"title": "Learning Monotonic Transformations for Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 681, "page_last": 688, "abstract": null, "full_text": "Learning Monotonic Transformations for\n\nClassi(cid:12)cation\n\nAndrew G. Howard\n\nTony Jebara\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nColumbia University\nNew York, NY 10027\n\nColumbia University\nNew York, NY 10027\n\nahoward@cs.columbia.edu\n\njebara@cs.columbia.edu\n\nAbstract\n\nA discriminative method is proposed for learning monotonic transforma-\ntions of the training data while jointly estimating a large-margin classi(cid:12)er.\nIn many domains such as document classi(cid:12)cation, image histogram classi(cid:12)-\ncation and gene microarray experiments, (cid:12)xed monotonic transformations\ncan be useful as a preprocessing step. However, most classi(cid:12)ers only explore\nthese transformations through manual trial and error or via prior domain\nknowledge. The proposed method learns monotonic transformations auto-\nmatically while training a large-margin classi(cid:12)er without any prior knowl-\nedge of the domain. A monotonic piecewise linear function is learned which\ntransforms data for subsequent processing by a linear hyperplane classi(cid:12)er.\nTwo algorithmic implementations of the method are formalized. The (cid:12)rst\nsolves a convergent alternating sequence of quadratic and linear programs\nuntil it obtains a locally optimal solution. An improved algorithm is then\nderived using a convex semide(cid:12)nite relaxation that overcomes initializa-\ntion issues in the greedy optimization problem. The e(cid:11)ectiveness of these\nlearned transformations on synthetic problems, text data and image data\nis demonstrated.\n\n1 Introduction\n\nMany (cid:12)elds have developed heuristic methods for preprocessing data to improve perfor-\nmance. This often takes the form of applying a monotonic transformation prior to using\na classi(cid:12)cation algorithm. For example, when the bag of words representation is used in\ndocument classi(cid:12)cation, it is common to take the square root of the term frequency [6, 5].\nMonotonic transforms are also used when classifying image histograms. In [3], transforma-\ntions of the form xa where 0 (cid:20) a (cid:20) 1 are demonstrated to improve performance. When\nclassifying genes from various microarray experiments it is common to take the logarithm of\nthe gene expression ratio [2]. Monotonic transformations can also capture crucial properties\nof the data such as threshold and saturation e(cid:11)ects.\n\nIn this paper, we propose to simultaneously learn a hyperplane classi(cid:12)er and a monotonic\ntransformation. The solution produced by our algorithm is a piecewise linear monotonic\nfunction and a maximum margin hyperplane classi(cid:12)er similar to a support vector machine\n(SVM) [4]. By allowing for a richer class of transforms learned at training time (as opposed\nto a rule of thumb applied during preprocessing), we improve classi(cid:12)cation accuracy. The\nlearned transform is speci(cid:12)cally tuned to the classi(cid:12)cation task. The main contributions\nof this paper include, a novel framework for estimating a monotonic transformation and\na hyperplane classi(cid:12)er simultaneously at training time, an e(cid:14)cient method for (cid:12)nding a\n\n\f,1nx\n\n,2nx\n\n,n Dx\n\n1w\n\n2w\n\nDw\n\nny\n\nb\n\nFigure 1: Monotonic transform applied to each dimension followed by a hyperplane classi(cid:12)er.\n\nlocally optimal solution to the problem, and a convex relaxation to (cid:12)nd a globally optimal\napproximate solution.\n\nThe paper is organized as follows. In section 2, we present our formulation for learning a\npiecewise linear monotonic function and a hyperplane. We show how to learn this combined\nmodel through an iterative coordinate ascent optimization using interleaved quadratic and\nlinear programs to (cid:12)nd a local minimum. In section 3, we derive a convex relaxation based\non Lasserre\u2019s method [8]. In section 4 synthetic experiments as well as document and image\nclassi(cid:12)cation problems demonstrate the diverse utility of our method. We conclude with a\ndiscussion and future work.\n\n2 Learning Monotonic Transformations\n\nFor an unknown distribution P (~x; y) over inputs ~x 2 <d and labels y 2 f(cid:0)1; 1g, we assume\nthat there is an unknown nuisance monotonic transformation (cid:8)(x) and unknown hyperplane\nparameterized by ~w and b such that predicting with f (x) = sign( ~wT (cid:8)(~x) + b) yields a low\n2 jy (cid:0) f (x)jdP (~x; y). We would like to recover (cid:8)(~x); ~w; b from a\nlabeled training set S = f(~x1; y1); : : : ; (~xN ; yN )g which is sampled i.i.d. from P (~x; y). The\ntransformation acts elementwise as can be seen in Figure 1.\n\nexpected test error R = R 1\n\nWe propose to learn both a maximum margin hyperplane and the unknown transform (cid:8)(x)\nsimultaneously. In our formulation, (cid:8)(x) is a piecewise linear function that we parameterize\nwith a set of K knots fz1; : : : ; zKg and associated positive weights fm1; : : : ; mKg where\nj=1 mj (cid:30)j (x) where\n(cid:30)j(x) are truncated ramp functions acting on vectors and matrices elementwise as follows:\n\nzj 2 < and mj 2 <+. The transformation can be written as (cid:8)(x) = PK\n\n0\n\nx(cid:0)zj\n\nzj+1(cid:0)zj\n1\n\nx (cid:20) zj\nzj < x < zj+1\nzj+1 (cid:20) x\n\n(1)\n\n(cid:30)j (x) =8<\n:\n\nThis is a less common way to parameterize piecewise linear functions. The positivity con-\nstraints enforce monotonicity on (cid:8)(x) for all x. A more common method is to parameterize\nthe function value (cid:8)(z) at each knot z and apply order constraints between subsequent knots\nto enforce monotonicity. Values in between knots are found through linear interpolation.\nThis is the method used in isotonic regression [10], but in practice, these are equivalent\nformulations. Using truncated ramp functions is preferable for numerous reasons. They can\nbe easily precomputed and are sparse. Once precomputed, most calculations can be done\nvia sparse matrix multiplications. The positivity constraints on the weights ~m will also yield\na simpler formulation than order constraints and interpolation which becomes important in\nsubsequent relaxation steps.\n\nFigure 2a shows the truncated ramp function associated with knot z1. Figure 2b shows\na conic combination of truncated ramps that builds a piecewise linear monotonic function.\nCombining this with the support vector machine formulation leads us to the following learn-\ning problem:\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nm1+m2+m3+m4+m5\n\nm1+m2+m3+m4\n\nm1+m2+m3\n\nm1+m2\n\nz1\n\nz2\n\nm1\n\nz1\n\nz2\n\nz3\n\nz4\n\nz5\n\na) Truncated ramp function (cid:30)1(x).\n\nFigure 2: Building blocks for piecewise linear functions.\n\nb) (cid:8)(x) =P5\n\nj=1 mj (cid:30)j(x).\n\nmin\n\n~w;~(cid:24);b; ~m\n\nsubject to\n\nk ~wk2\n\n2 + C\n\nN\n\n(cid:24)i\n\nXi=1\nmj(cid:30)j ( ~xi)+ + b1\nXj=1\n\nK\n\nyi0\n* ~w;\n@\n(cid:24)i (cid:21) 0; mj (cid:21) 0;Xj\n\nA (cid:21) 1 (cid:0) (cid:24)i 8i\n\nmj (cid:20) 1 8i; j\n\n(2)\n\nwhere ~(cid:24) are the standard SVM slack variables, ~w and b are the maximum margin solution\nfor the training set that has been transformed via (cid:8)(x) with learned weights ~m. Before\ntraining, the knot locations are chosen at the empirical quantiles so that they are evenly\nspaced in the data.\n\nThis problem is nonconvex due to the quadratic term involving ~w and ~m in the classi(cid:12)cation\nconstraints. Although it is di(cid:14)cult to (cid:12)nd a globally optimal solution, the structure of the\nproblem suggests a simple method for (cid:12)nding a locally optimal solution. We can divide the\nproblem into two convex subproblems. This amounts to solving a support vector machine\nfor ~w and b with a (cid:12)xed (cid:8)(x) and alternatively solving for (cid:8)(x) as a linear program with\nthe SVM solution (cid:12)xed. In both subproblems, we optimize over ~(cid:24) as it is part of the hinge\nloss. This yields an e(cid:14)cient convergent optimization method. However, this method can\nget stuck in local minima. In practice, we initialize it with a linear (cid:8)(x) and iterate from\nthere. Alternative initializations do not yield much help. This leads us to look for a method\nto e(cid:14)ciently (cid:12)nd global solutions.\n\n3 Convex Relaxation\n\nWhen faced with a nonconvex quadratic problem, an increasingly popular technique is to\nrelax it into a convex one. Lasserre [8] proposed a sequence of convex relaxations for\nthese types of nonconvex quadratic programs. This method replaces all quadratic terms\nin the original optimization problem with entries in a matrix.\nIn its simplest form this\nmatrix corresponds to the outer product of the the original variables with rank one and\nsemide(cid:12)nite constraints. The relaxation comes from dropping the rank one constraint on\nthe outer product matrix. Lasserre proposed more elaborate relaxations using higher order\nmoments of the variables. However, we mainly use the (cid:12)rst moment relaxation along with\na few of the second order moment constraints that do not require any additional variables\nbeyond the outer product matrix.\n\nA convex relaxation could be derived directly from the primal formulation of our problem.\nBoth ~w and ~m would be relaxed as they interact in the nonconvex quadratic terms. Un-\n\n\ffortunately, this yields a semide(cid:12)nite constraint that scales with both the number of knots\nand the dimensionality of the data. This is troublesome because we wish to work with high\ndimensional data such as a bag of words representation for text. However, if we (cid:12)rst (cid:12)nd\nthe dual formulation for ~w, b, and ~(cid:24), we only have to relax ~m which yields both a tighter\nrelaxation and a less computationally intensive problem. Finding the dual leaves us with the\nfollowing min max saddle point problem that will be subsequently relaxed and transformed\ninto a semide(cid:12)nite program:\n\nmin\n~m\n\nmax\n\n~(cid:11)\n\n@Y 0\n2~(cid:11)T~1 (cid:0) ~(cid:11)T 0\n@Xi;j\n0 (cid:20) (cid:11)i (cid:20) C; ~(cid:11)T ~y = 0; mj (cid:21) 0;Xj\n\nA Y1\nmimj (cid:30)i(X)T (cid:30)j (X)1\nA ~(cid:11)\n\nmj (cid:20) 1 8i; j\n\n(3)\n\nwhere ~1 is a vector of ones, ~y is a vector of the labels, Y = diag(~y) is a matrix with the\nlabels on its diagonal with zeros elsewhere, and X is a matrix with ~xi in the ith column.\nWe introduce the relaxation via the substitution M = (cid:22)m (cid:22)mT and constraint M (cid:23) 0 where\n(cid:22)m is constructed by concatenating 1 with ~m. We can then transform the relaxed min max\nproblem into a semide(cid:12)nite program similar to the multiple kernel learning framework [7]\nby (cid:12)nding the dual with respect to ~(cid:11) and using the Schur complement lemma to generate\na linear matrix inequality [1]:\n\nmin\n\nM;t;(cid:21);~(cid:23);~(cid:14)\n\nt\n\nsubject to\n\n  Y Pi;j Mi;j (cid:30)i(X)T (cid:30)j (X)Y ~1 + ~(cid:23) (cid:0) ~(cid:14) + (cid:21)~y\n\n(~1 + ~(cid:23) (cid:0) ~(cid:14) + (cid:21)~y)T\n\nt (cid:0) 2C~(cid:14)T~1\n\n(4)\n\n! (cid:23) 0\n\nM (cid:23) 0; M (cid:21) 0; M (cid:22)1 (cid:20) ~0; M0;0 = 1; ~(cid:23) (cid:21) ~0; ~(cid:14) (cid:21) ~0\n\nwhere ~0 is a vector of zeros and (cid:22)1 is a vector with (cid:0)1 in the (cid:12)rst dimension and ones in the\nrest. The variables (cid:21), ~(cid:23), ~(cid:14) arise from the dual transformation. This relaxation is exact if M\nis a rank one matrix.\n\nThe above can be seen as a generalization of the multiple kernel learning framework. Instead\nof learning a kernel from a combination of kernels, we are learning a combination of inner\nproducts of di(cid:11)erent functions applied to our data. In our case, these are truncated ramp\nfunctions. The terms (cid:30)i(X)T (cid:30)j (X) are not Mercer kernels except when i = j. This more\ngeneral combination requires the stricter constraints that the mixing weights M form a\npositive semide(cid:12)nite matrix, a constraint which is introduced via the relaxation. This is\n\na su(cid:14)cient condition for the resulting matrix Pi;j Mi;j(cid:30)i(X)T (cid:30)j(X) to also be positive\n\nsemide(cid:12)nite.\n\nWhen using this relaxation, we can recover the monotonic transform by using the (cid:12)rst\ncolumn (row) as the mixing weights, ~m, of the truncated ramp functions.\nIn practice,\n\nhowever, we use the learned kernel in our predictions k(~x; ~x0) =Pi;j Mi;j (cid:30)i(~x)T (cid:30)j (~x0).\n\n4 Experiments\n\n4.1 Synthetic Experiment\n\nIn this experiment we will demonstrate our method\u2019s ability to recover a monotonic trans-\nformation from data. We sampled data near a linear decision boundary and generated labels\nbased on this boundary. We then applied a strictly monotonic function to this sampled data.\nThe training set is made up of the transformed points and the original labels. A linear al-\ngorithm will have di(cid:14)culty because the mapped data is not linearly separable. However,\n\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\na)\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nd)\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nb)\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\ne)\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nc)\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nf)\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\ng)\n\nh)\n\ni)\n\nFigure 3: a) Original data. b) Data transformed by a logarithm. c) Data transformed\nby a quadratic function. d-f) The transformation functions learned using the nonconvex\nalgorithm. g-i) The transformation functions learned using the convex algorithm.\n\nif we could recover the inverse monotonic function, then a linear decision boundary would\nperform well.\n\nFigure 3a shows the original data and decision boundary. Figure 3b shows the data and\nhyperplane transformed with a normalized logarithm. Figure 3c depicts a quadratic trans-\nform. 600 data points were sampled, and then transformed. 200 were used for training, 200\nfor cross validation and 200 for testing. We compared our locally optimal method (L mono),\nour convex relaxation (C mono) and a linear SVM (linear). The linear SVM struggled on\nall of the transformed data while the other methods performed well as reported in Figure 4.\nThe learned transforms for L mono are plotted in Figure 3(d-f). The solid blue line is the\nmean over 10 experiments, and the dashed blue is the standard deviation. The black line\nis the true target function. The learned functions for C mono are in Figure 3(g-i). Both\nalgorithms performed quite well on the task of classi(cid:12)cation and recover nearly the exact\nmonotonic transform. The local method outperformed the relaxation slightly because this\nwas an easy problem with few local minima.\n\n4.2 Document Classi(cid:12)cation\n\nIn this experiment we used the four universities WebKB dataset. The data is made up of\nweb pages from four universities plus an additional larger set from miscellaneous universities.\n\n\fLinear\nL Mono\nC Mono\n\nlinear\n0.0005\n0.0020\n0.0025\n\nexponential\n0.0375\n0.0005\n0.0075\n\nsquare root\n0.0685\n0.0020\n0.0025\n\ntotal\n0.0355\n0.0015\n0.0042\n\nFigure 4: Testing error rates for the synthetic experiments.\n\nLinear\nTFIDF\nSqrt\nPoly\nRBF\nL Mono\nC Mono\n\n1 vs 2\n0.0509\n0.0428\n0.0363\n0.0499\n0.0514\n0.0338\n0.0322\n\n1 vs 3\n0.0879\n0.0891\n0.0667\n0.0861\n0.0836\n0.0739\n0.0776\n\n1 vs 4\n0.1381\n0.1623\n0.0996\n0.1389\n0.1356\n0.0854\n0.0812\n\n2 vs 3\n0.0653\n0.0486\n0.0456\n0.0599\n0.0641\n0.0511\n0.0501\n\n2 vs 4\n0.1755\n0.1910\n0.1153\n0.1750\n0.1755\n0.1060\n0.0973\n\n3 vs 4\n0.0941\n0.1096\n0.0674\n0.0950\n0.0981\n0.0602\n0.0584\n\ntotal\n0.1025\n0.1059\n0.0711\n0.1009\n0.1024\n0.0683\n0.0657\n\nFigure 5: Testing error rates for WebKB.\n\nThese web pages are then categorized. We will be working with the largest four categories:\nstudent, faculty, course, and project. The task is to solve all six pairwise classi(cid:12)cation\nproblems. In [6, 5] preprocessing the data with a square root was demonstrated to yield\ngood results. We will compare our nonconvex method (L mono), and our convex relaxation\n(C mono) to a linear SVM with and without the square root, with TFIDF features and also\na kernelized SVM with both the polynomial kernel and the RBF kernel. We will follow the\nsetup of [6] by training on three universities and the miscellaneous university set and testing\non web pages from the fourth university. We repeated this four fold experiment (cid:12)ve times.\nFor each fold, we use a subset of 200 points for training, 200 to cross validate the parameter\nsettings, and all of the fourth university\u2019s points for testing.\n\nOur two methods outperform the competition on average as reported in Figure 5. The\nconvex relaxation chooses a step function nearly every time. This outputs a 1 if a word is\nin the training vector and 0 if it is absent. The nonconvex greedy algorithm does not end\nup recovering this solution as reliably and seems to get stuck in local minima. This leads to\nslightly worse performance than the convex version.\n\n4.3 Image Histogram Classi(cid:12)cation\n\nIn this experiment, we used the Corel image dataset. In [3], it was shown that monotonic\ntransforms of the form xa for 0 (cid:20) a (cid:20) 1 worked well. The Corel image dataset is made up\nof various categories, each containing 100 images. We chose four categories of animals: 1)\neagles, 2) elephants, 3) horses, and 4) tigers. Images were transformed into RGB histograms\nfollowing the binning strategy of [3, 5]. We ran a series of six pairwise experiments where the\ndata was randomly split into 80 percent training, 10 percent cross validation, and 10 percent\ntesting. These six experiments were repeated 10 times. We compared our two methods to\na linear support vector machine, as well as an SVM with RBF and polynomial kernels. We\nalso compared to the set of transforms xa for 0 (cid:20) a (cid:20) 1 where we cross validated over\na = f0; :125; :25; :5; :625; :75; :875; 1g. This set includes linear a = 1 at one end, a binary\nthreshold a = 0 at the other (choosing 00 = 0), and the square root transform in the middle.\n\nThe convex relaxation performed best or tied for best on 4 out 6 of the experiments and\nwas the best overall as reported in Figure 6. The nonconvex version also performed well\nbut ended up with a lower accuracy than the cross validated family of xa transforms. The\nkey to this dataset is that most of the data is very close to zero due to few pixels being in a\ngiven bin. Cross validation over xa most often chose low nonzero a values. Our method had\nmany knots in these extremely low values because that was where the data support was.\nPlots of our learned functions on these small values can be found in Figure 7(a-f). Solid\nblue is the mean for the nonconvex algorithm and dashed blue is the standard deviation.\nSimilarly, the convex relaxation is in red.\n\n\fLinear\nSqrt\nPoly\nRBF\nxa\nL Mono\nC Mono\n\n1 vs 2\n0.08\n0.03\n0.07\n0.06\n0.08\n0.05\n0.04\n\n1 vs 3\n0.10\n0.05\n0.10\n0.08\n0.04\n0.06\n0.03\n\n1 vs 4\n0.28\n0.09\n0.28\n0.22\n0.03\n0.04\n0.03\n\n2 vs 3\n0.11\n0.12\n0.11\n0.10\n0.03\n0.05\n0.04\n\n2 vs 4\n0.14\n0.08\n0.15\n0.13\n0.09\n0.13\n0.06\n\n3 vs 4\n0.26\n0.20\n0.23\n0.23\n0.06\n0.05\n0.05\n\ntotal\n0.1617\n0.0950\n0.1567\n0.1367\n0.0550\n0.0633\n0.0417\n\nFigure 6: Testing error rates on Corel dataset.\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\nx 10\u22123\n\n0.5\n\n1\n\n1.5\n\n2\nx 10\u22123\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\nx 10\u22123\n\n0.5\n\n1\n\n1.5\n\n2\nx 10\u22123\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\nx 10\u22123\n\n0.5\n\n1\n\n1.5\n\n2\nx 10\u22123\n\nFigure 7: The learned transformation functions for 6 Corel problems.\n\n4.4 Gender classi(cid:12)cation\n\nIn this experiment we try to di(cid:11)erentiate between images of males and females. We have\n1755 labelled images from the FERET dataset processed as in [9]. Each processed image\nis a 21 by 12 pixel 256 color gray scale image that is rastorized to form training vectors.\nThere are 1044 male images and 711 female images. We randomly split the data into 80\npercent training, 10 percent cross validation, and and 10 percent testing. We then compare\na linear SVM to our two methods on 5 random splits of the data. The learned monotonic\nfunction from L Mono and C Mono are similar to a sigmoid function which indicates that\nuseful saturation and threshold e(cid:11)ects where uncovered by our methods. Figure 8a shows\nexamples of training images before and after they have been transformed by our learned\nfunction. Figure 8b summarizes the results. Our learned transformation outperforms the\nlinear SVM with the convex relaxation performing best.\n\n5 Discussion\n\nA data driven framework was presented for jointly learning monotonic transformations of\ninput data and a discriminative linear classi(cid:12)er. The joint optimization improves classi(cid:12)-\ncation accuracy and produces interesting transformations that otherwise would require a\npriori domain knowledge. Two implementations were discussed. The (cid:12)rst is a fast greedy\nalgorithm for (cid:12)nding a locally optimal solution. Subsequently, a semide(cid:12)nite relaxation of\nthe original problem was presented which does not su(cid:11)er from local minima. The greedy\nalgorithm has similar scaling properties as a support vector machine yet has local minima\nto contend with. The semide(cid:12)nite relaxation is more computationally intensive yet ensures\na reliable global solution. Nevertheless, both implementations were helpful in synthetic and\nreal experiments including text and image classi(cid:12)cation and improved over standard support\nvector machine tools.\n\n\fAlgorithm Error\n.0909\n.0818\n.0648\n\nLinear\nL Mono\nC Mono\n\na)\n\nb)\n\nFigure 8: a) Original and transformed gender images. b) Error rates for gender classi(cid:12)cation.\n\nA natural next step is to explore faster (convex) algorithms that take advantage of the\nspeci(cid:12)c structure of the problem. These faster algorithms will help us explore extensions\nsuch as learning transformations across multiple tasks. We also hope to explore applications\nto other domains such as gene expression data to re(cid:12)ne the current logarithmic transforms\nnecessary to compensate for well-known saturation e(cid:11)ects in expression level measurements.\nWe are also interested in looking at fMRI and audio data where monotonic transformations\nare useful.\n\n6 Acknowledgements\n\nThis work was supported in part by NSF Award IIS-0347499 and ONR Award\nN000140710507.\n\nReferences\n\n[1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,\n\n2004.\n\n[2] M. Brown, W. Grundy, D. Lin, N. Christianini, C. Sugnet, M. Jr, and D. Haussler.\n\nSupport vector machine classi(cid:12)cation of microarray gene expression data, 1999.\n\n[3] O. Chapelle, P. Hafner, and V.N. Vapnik. Support vector machines for histogram-based\n\nclassi(cid:12)cation. Neural Networks, IEEE Transactions on, 10:1055{1064, 1999.\n\n[4] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273{297,\n\n1995.\n\n[5] M. Hein and O. Bousquet. Hilbertian metrics and positive de(cid:12)nite kernels on probability\n\nmeasures. In Proceedings of Arti(cid:12)cial Intelligence and Statistics, 2005.\n\n[6] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. Journal of Machine\n\nLearning Research, 5:819{844, 2004.\n\n[7] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the\nkernel matrix with semide(cid:12)nite programming. Journal of Machine Learning Research,\n5:27{72, 2004.\n\n[8] J.B. Lasserre. Convergent LMI relaxations for nonconvex quadratic programs.\n\nIn\n\nProceedings of 39th IEEE Conference on Decision and Control, 2000.\n\n[9] B. Moghaddam and M.H. Yang. Sex with support vector machines. In Todd K. Leen,\nThomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Pro-\ncessing 13, pages 960{966. MIT Press, 2000.\n\n[10] T. Robertson, F.T. Wright, and R.L. Dykstra. Order Restricted Statistical Inference.\n\nWiley, 1988.\n\n\f", "award": [], "sourceid": 1045, "authors": [{"given_name": "Andrew", "family_name": "Howard", "institution": null}, {"given_name": "Tony", "family_name": "Jebara", "institution": null}]}