{"title": "Learning Kernels with Random Features", "book": "Advances in Neural Information Processing Systems", "page_first": 1298, "page_last": 1306, "abstract": "Randomized features provide a computationally efficient way to approximate kernel machines in machine learning tasks. However, such methods require a user-defined kernel as input. We extend the randomized-feature approach to the task of learning a kernel (via its associated random features). Specifically, we present an efficient optimization problem that learns a kernel in a supervised manner. We prove the consistency of the estimated kernel as well as generalization bounds for the class of estimators induced by the optimized kernel, and we experimentally evaluate our technique on several datasets. Our approach is efficient and highly scalable, and we attain competitive results with a fraction of the training cost of other techniques.", "full_text": "Learning Kernels with Random Features\n\nAman Sinha1\n\nJohn Duchi1,2\n\nDepartments of 1Electrical Engineering and 2Statistics\n\nStanford University\n\n{amans,jduchi}@stanford.edu\n\nAbstract\n\nRandomized features provide a computationally ef\ufb01cient way to approximate kernel\nmachines in machine learning tasks. However, such methods require a user-de\ufb01ned\nkernel as input. We extend the randomized-feature approach to the task of learning\na kernel (via its associated random features). Speci\ufb01cally, we present an ef\ufb01cient\noptimization problem that learns a kernel in a supervised manner. We prove the\nconsistency of the estimated kernel as well as generalization bounds for the class\nof estimators induced by the optimized kernel, and we experimentally evaluate our\ntechnique on several datasets. Our approach is ef\ufb01cient and highly scalable, and we\nattain competitive results with a fraction of the training cost of other techniques.\n\n1\n\nIntroduction\n\nAn essential element of supervised learning systems is the representation of input data. Kernel\nmethods [27] provide one approach to this problem: they implicitly transform the data to a new\nfeature space, allowing non-linear data representations. This representation comes with a cost, as\nkernelized learning algorithms require time that grows at least quadratically in the data set size,\nand predictions with a kernelized procedure require the entire training set. This motivated Rahimi\nand Recht [24, 25] to develop randomized methods that ef\ufb01ciently approximate kernel evaluations\nwith explicit feature transformations; this approach gives substantial computational bene\ufb01ts for large\ntraining sets and allows the use of simple linear models in the randomly constructed feature space.\nWhether we use standard kernel methods or randomized approaches, using the \u201cright\u201d kernel for a\nproblem can make the difference between learning a useful or useless model. Standard kernel methods\nas well as the aforementioned randomized-feature techniques assume the input of a user-de\ufb01ned\nkernel\u2014a weakness if we do not a priori know a good data representation. To address this weakness,\none often wishes to learn a good kernel, which requires substantial computation. We combine kernel\nlearning with randomization, exploiting the computational advantages offered by randomized features\nto learn the kernel in a supervised manner. Speci\ufb01cally, we use a simple pre-processing stage for\nselecting our random features rather than jointly optimizing over the kernel and model parameters.\nOur work\ufb02ow is straightforward: we create randomized features, solve a simple optimization problem\nto select a subset, then train a model with the optimized features. The procedure results in lower-\ndimensional models than the original random-feature approach for the same performance. We give\nempirical evidence supporting these claims and provide theoretical guarantees that our procedure is\nconsistent with respect to the limits of in\ufb01nite training data and in\ufb01nite-dimensional random features.\n\n1.1 Related work\n\nTo discuss related work, we \ufb01rst describe the supervised learning problem underlying our approach.\nWe have a cost c : R \u00d7 Y \u2192 R, where c(\u00b7, y) is convex for y \u2208 Y, and a reproducing kernel Hilbert\nspace (RKHS) of functions F with kernel K. Given a sample {(xi, yi)}n\ni=1, the usual (cid:96)2-regularized\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fn(cid:88)\n\ni=1\n\n\u2212 n(cid:88)\n\ni=1\n\nlearning problem is to solve the following (shown in primal and dual forms respectively):\n\n(cid:107)f(cid:107)2\n2 ,\n\n\u03bb\n2\n\nc\u2217(\u03b1i, yi) \u2212 1\n2\u03bb\n\n\u03b1\u2208Rn\n\n\u03b1T G\u03b1,\n\nminimize\n\nf\u2208F\n\nc(f (xi), yi) +\n\nor maximize\n\ni,j=1 denotes the Gram matrix.\n\n(1)\nwhere (cid:107)\u00b7(cid:107)2 denotes the Hilbert space norm, c\u2217(\u03b1, y) = supz{\u03b1z \u2212 c(z, y)} is the convex conjugate\nof c (for \ufb01xed y) and G = [K(xi, xj)]n\nSeveral researchers have studied kernel learning. As noted by G\u00f6nen and Alpayd\u0131n [14], most\nformulations fall into one of a few categories. In the supervised setting, one assumes a base class\nor classes of kernels and either uses heuristic rules to combine kernels [2, 23], optimizes structured\n(e.g. linear, nonnegative, convex) compositions of the kernels with respect to an alignment metric\n[9, 16, 20, 28], or jointly optimizes kernel compositions with empirical risk [17, 20, 29]. The latter\napproaches require an eigendecomposition of the Gram matrix or costly optimization problems\n(e.g. quadratic or semide\ufb01nite programs) [10, 14], but these models have a variety of generalization\nguarantees [1, 8, 10, 18, 19]. Bayesian variants of compositional kernel search also exist [12, 13]. In\nun- and semi-supervised settings, the goal is to learn an embedding of the input distribution followed\nby a simple classi\ufb01er in the embedded space (e.g. [15]); the hope is that the input distribution carries\nthe structure relevant to the task. Despite the current popularity of these techniques, especially deep\nneural architectures, they are costly, and it is dif\ufb01cult to provide guarantees on their performance.\nOur approach optimizes kernel compositions with respect to an alignment metric, but rather than work\nwith Gram matrices in the original data representation, we work with randomized feature maps that\napproximate RKHS embeddings. We learn a kernel that is structurally different from a user-supplied\nbase kernel, and our method is an ef\ufb01ciently (near linear-time) solvable convex program.\n\n2 Proposed approach\n\n(cid:90)\n\n(cid:88)\n\ni,j\n\nAt a high level, we take a feature mapping, \ufb01nd a distribution that aligns this mapping with the labels\ny, and draw random features from the learned distribution; we then use these features in a standard\nsupervised learning approach.\nFor simplicity, we focus on binary classi\ufb01cation: we have n datapoints (xi, yi) \u2208 Rd \u00d7 {\u22121, 1}.\nLetting \u03c6 : Rd \u00d7 W \u2192 [\u22121, 1] and Q be a probability measure on a space W, de\ufb01ne the kernel\n\n(2)\nWe want to \ufb01nd the \u201cbest\u201d kernel KQ over all distributions Q in some (large, nonparametric) set P of\npossible distributions on random features; we consider a kernel alignment problem of the form\n\nKQ(x, x(cid:48)) :=\n\n\u03c6(x, w)\u03c6(x(cid:48), w)dQ(w).\n\nmaximize\n\nQ\u2208P\n\nKQ(xi, xj)yiyj.\n\n(3)\n\nDf (P||Q) = (cid:82) f ( dP\n\nWe focus on sets P de\ufb01ned by divergence measures on the space of probability distributions.\nFor a convex function f with f (1) = 0, the f-divergence between distributions P and Q is\ndQ )dQ. Then, for a base (user-de\ufb01ned) distribution P0, we consider collec-\ntions P := {Q : Df (Q||P0) \u2264 \u03c1} where \u03c1 > 0 is a speci\ufb01ed constant. In this paper, we focus\non divergences f (t) = tk \u2212 1 for k \u2265 2. Intuitively, the distribution Q maximizing the align-\nment (3) gives a feature space in which pairwise distances are similar to those in the output space Y.\nUnfortunately, the problem (3) is generally intractable as it is in\ufb01nite dimensional.\nUsing the randomized feature approach, we approximate the integral (2) as a discrete sum over\nsamples W i iid\u223c P0, i \u2208 [Nw]. De\ufb01ning the discrete approximation PNw := {q : Df (q||1/Nw) \u2264 \u03c1}\nto P, we have the following empirical version of problem (3):\n\n(cid:88)\n\nNw(cid:88)\n\nmaximize\n\nq\u2208PNw\n\ni,j\n\nm=1\n\nyiyj\n\nqm\u03c6(xi, wm)\u03c6(xj, wm).\n\n(4)\n\nUsing randomized features, matching the input and output distances in problem (4) translates to\n\ufb01nding a (weighted) set of points among w1, w2, ..., wNw that best \u201cdescribe\u201d the underlying dataset,\nor, more directly, \ufb01nding weights q so that the kernel matrix matches the correlation matrix yyT .\n\n2\n\n\fGiven a solution(cid:98)q to problem (4), we can solve the primal form of problem (1) in two ways. First, we\ncan apply the Rahimi and Recht [24] approach by drawing D samples W 1, . . . , W D iid\u223c (cid:98)q, de\ufb01ning\n\nfeatures \u03c6i = [\u03c6(xi, w1) \u00b7\u00b7\u00b7 \u03c6(xi, wD)]T , and solving the risk minimization problem\n\n(cid:26) n(cid:88)\n\ni=1\n\n\u03b8\n\n(cid:98)\u03b8 = argmin\n(cid:26) n(cid:88)\n\n(cid:98)\u03b8 = argmin\n\n(cid:16) 1\u221a\n\nc\n\n\u03b8T \u03c6i, yi(cid:17)\n\nD\n\n+ r(\u03b8)\n\n(cid:27)\n\n(cid:27)\n\n(5)\n\n(6)\n\nfor some regularization r. Alternatively, we may set \u03c6i = [\u03c6(xi, w1) \u00b7\u00b7\u00b7 \u03c6(xi, wNw )]T , where\nw1, . . . , wNw are the original random samples from P0 used to solve (4), and directly solve\n\nNotably, if(cid:98)q is sparse, the problem (6) need only store the random features corresponding to non-zero\nentries of(cid:98)q. Contrast our two-phase procedure to that of Rahimi and Recht [25], which samples\n\ni=1\n\n\u03b8\n\nW 1, . . . , W D iid\u223c P0 and solves the minimization problem\n\n1\n\n2 \u03c6i, yi) + r(\u03b8)\n\n.\n\nc(\u03b8T diag((cid:98)q)\n\n(cid:19)\n\n\u03b1m\u03c6(xi, wm), yi\n\nsubject to (cid:107)\u03b1(cid:107)\u221e \u2264 C/Nw,\n\n(7)\n\n(cid:18) D(cid:88)\n\nn(cid:88)\n\nc\n\ni=1\n\nm=1\n\nminimize\n\n\u03b1\u2208RNw\n\nwhere C is a numerical constant. At \ufb01rst glance, it appears that we may suffer both in terms of\ncomputational ef\ufb01ciency and in classi\ufb01cation or learning performance compared to the one-step\nprocedure (7). However, as we show in the sequel, the alignment problem (4) can be solved very\n\nef\ufb01ciently and often yields sparse vectors (cid:98)q, thus substantially decreasing the dimensionality of\n\nproblem (6). Additionally, we give experimental evidence in Section 4 that the two-phase procedure\nyields generalization performance similar to standard kernel and randomized feature methods.\n\n2.1 Ef\ufb01ciently solving problem (4)\n\nThe optimization problem (4) has structure that enables ef\ufb01cient (near linear-time) solutions. De\ufb01ne\nthe matrix \u03a6 = [\u03c61 \u00b7\u00b7\u00b7 \u03c6n] \u2208 RNw\u00d7n, where \u03c6i = [\u03c6(xi, w1) \u00b7\u00b7\u00b7 \u03c6(xi, wNw )]T \u2208 RNw is the\nrandomized feature representation for xi and wm iid\u223c P0. We can rewrite the optimization objective as\n\nyiyj\n\nqm\u03c6(xi, wm)\u03c6(xj, wm) =\n\nqm\n\nyi\u03c6(xi, wm)\n\n= qT ((\u03a6y) (cid:12) (\u03a6y)) ,\n\nNw(cid:88)\n\n(cid:18) n(cid:88)\n\nm=1\n\ni=1\n\n(cid:19)2\n\n(cid:88)\n\nNw(cid:88)\n\ni,j\n\nm=1\n\nwhere (cid:12) denotes the Hadamard product. Constructing the linear objective requires the evaluation of\n\u03a6y. Assuming that the computation of \u03c6 is O(d), construction of \u03a6 is O(nNwd) on a single processor.\nHowever, this construction is trivially parallelizable. Furthermore, computation can be sped up even\nfurther for certain distributions P0. For example, the Fastfood technique can approximate \u03a6 in\nO(nNw log(d)) time for the Gaussian kernel [21].\nThe problem (4) is also ef\ufb01ciently solvable via bisection over a scalar dual variable. Using \u03bb \u2265 0 for\nthe constraint Df (Q||P0) \u2264 \u03c1, a partial Lagrangian is\n\nL(q, \u03bb) = qT ((\u03a6y) (cid:12) (\u03a6y)) \u2212 \u03bb (Df (q||1/Nw) \u2212 \u03c1) .\n\nThe corresponding dual function is g(\u03bb) = supq\u2208\u2206 L(q, \u03bb), where \u2206 := {q \u2208 RNw\n+ : qT 1 = 1}\nis the probability simplex. Minimizing g(\u03bb) yields the solution to problem (4); this is a convex\noptimization problem in one dimension so we can use bisection. The computationally expensive step\nin each iteration is maximizing L(q, \u03bb) with respect to q for a given \u03bb. For f (t) = tk \u2212 1, we de\ufb01ne\nv := (\u03a6y) (cid:12) (\u03a6y) and solve\n\nThis has a solution of the form qm =(cid:2)vm/\u03bbN k\u22121\n\nmaximize\n\nq\u2208\u2206\n\nqT v \u2212 \u03bb\n\nNw(cid:88)\nw + \u03c4(cid:3) 1\n+ , where \u03c4 is chosen so that(cid:80)\n\n(Nwqm)k.\n\n1\nNw\n\nm=1\n\nk\u22121\n\n(8)\n\nm qm = 1.\nWe can \ufb01nd such a \u03c4 by a variant of median-based search in O(Nw) time [11]. Thus, for any k \u2265 2,\nan \u0001-suboptimal solution to problem (4) can be found in O(Nw log(1/\u0001)) time (see Algorithm 1).\n\n3\n\n\fAlgorithm 1 Kernel optimization with f (t) = tk \u2212 1 as divergence\n\ni=1, Nw \u2208 N, feature function \u03c6, \u0001 > 0\n\nINPUT: distribution P0 on W, sample {(xi, yi)}n\nOUTPUT: q \u2208 RNw that is an \u0001-suboptimal solution to (4).\nSETUP: Draw Nw samples wm iid\u223c P0, build feature matrix \u03a6, compute v := (\u03a6y) (cid:12) (\u03a6y).\nSet \u03bbu \u2190 \u221e, \u03bbl \u2190 0, \u03bbs \u2190 1\nwhile \u03bbu = \u221e\nq \u2190 argmaxq\u2208\u2206 L(q, \u03bbs)\nif Df (q||1/Nw) < \u03c1 then \u03bbu \u2190 \u03bbs\nwhile \u03bbu \u2212 \u03bbl > \u0001\u03bbs\n\u03bb \u2190 (\u03bbu + \u03bbl)/2\nq \u2190 argmaxq\u2208\u2206 L(q, \u03bb)\n// (solution to problem (8))\nif Df (q||1/Nw) < \u03c1 then \u03bbu \u2190 \u03bb else \u03bbl \u2190 \u03bb\n\n// (solution to problem (8))\nelse \u03bbs \u2190 2\u03bbs\n\n3 Consistency and generalization performance guarantees\n\nAlthough the procedure (4) is a discrete approximation to a heuristic kernel alignment problem,\nwe can provide guarantees on its performance as well as the generalization performance of our\nsubsequent model trained with the optimized kernel.\n\nConsistency First, we provide guarantees that the solution to problem (4) approaches a population\noptimum as the data and random sampling increase (n \u2192 \u221e and Nw \u2192 \u221e, respectively). We\nconsider the following (slightly more general) setting: let S : X \u00d7 X \u2192 [\u22121, 1] be a bounded\nfunction, where we intuitively think of S(x, x(cid:48)) as a similarity metric between labels for x and x(cid:48),\nand denote Sij := S(xi, xj) (in the binary case with y \u2208 {\u22121, 1}, we have Sij = yiyj). We then\nde\ufb01ne the alignment functions\n\nT (P ) := E[S(X, X(cid:48))KP (X, X(cid:48))],\n\nSijKP (xi, xj),\n\n(cid:98)T (P ) :=\n\n1\n\nn(n \u2212 1)\n\n(cid:88)\n\ni(cid:54)=j\n\nwhere the expectation is taken over S and the independent variables X, X(cid:48). Lemmas 1 and 2 provide\nconsistency guarantees with respect to the data sample (xi and Sij) and the random feature sample\n(wm); together they give us the overall consistency result of Theorem 1. We provide proofs in the\nsupplement (Sections A.1, A.2, and A.3 respectively).\nLemma 1 (Consistency with respect to data). Let f (t) = tk \u2212 1 for k \u2265 2. Let P0 be any distribution\non the space W, and let P = {Q : Df (Q||P0) \u2264 \u03c1}. Then\n\n(cid:18)\n\nP\n\nsup\nQ\u2208P\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)T (Q) \u2212 T (Q)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2265 t\n\n(cid:19)\n\n(cid:18)\n\n\u221a\n\n\u2264\n\n2 exp\n\n\u2212 nt2\n16(1 + \u03c1)\n\n(cid:19)\n\n.\n\nLemma 1 shows that the empirical quantity (cid:98)T is close to the true T . Now we show that, independent\nof the size of the training data, we can consistently estimate the optimal Q \u2208 P via sampling (i.e.\nQ \u2208 PNw).\nLemma 2 (Consistency with respect to sampling features). Let the conditions of Lemma 1 hold.\nThen, with C\u03c1 = 2(\u03c1+1)\n\n\u221a\n\n1+\u03c1\u22121 and D\u03c1 =(cid:112)8(1 + \u03c1), we have\n(cid:115)\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 4C\u03c1\n(cid:98)T (Q)\n\n(cid:98)T (Q) \u2212 sup\n\nQ\u2208P\n\nsup\nQ\u2208PNw\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:115)\n\nlog(2Nw)\n\nNw\n\n+ D\u03c1\n\nlog 2\n\u03b4\nNw\n\nwith probability at least 1 \u2212 \u03b4 over the draw of the samples W m iid\u223c P0.\nFinally, we combine the consistency guarantees for data and sampling to reach our main result, which\n\nshows that the alignment provided by the estimated distribution (cid:98)Q is nearly optimal.\nTheorem 1. Let (cid:98)Qw maximize (cid:98)T (Q) over Q \u2208 PNw. Then, with probability at least 1 \u2212 3\u03b4 over the\n(cid:115)\n\nsampling of both (x, y) and W , we have\n\n(cid:115)\n\n(cid:115)\n\n(cid:12)(cid:12)(cid:12)(cid:12)T ((cid:98)Qw) \u2212 sup\n\nQ\u2208P\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 4C\u03c1\n\nT (Q)\n\nlog(2Nw)\n\n+ D\u03c1\n\n+ 2D\u03c1\n\nlog 2\n\u03b4\nNw\n\n2 log 2\n\u03b4\n\n.\n\nn\n\nNw\n\n4\n\n\fGeneralization performance The consistency results above show that our optimization procedure\nnearly maximizes alignment T (P ), but they say little about generalization performance for our model\ntrained using the optimized kernel. We now show that the class of estimators employed by our method\nhas strong performance guarantees. By construction, our estimator (6) uses the function class\n\nFNw :=\n\nh(x) =\n\n\u221a\n\nqm\u03c6(x, wm) | q \u2208 PNw ,(cid:107)\u03b1(cid:107)2 \u2264 B\n\n\u03b1m\n\n(cid:111)\n\n,\n\n(cid:110)\n\nNw(cid:88)\n(cid:80)n\n\nm=1\n\nn\n\nE[supf\u2208FNw\n\nand we provide bounds on its generalization via empirical Rademacher complexity. To that end,\nde\ufb01ne Rn(FNw ) := 1\ni=1 \u03c3if (xi)], where the expectation is taken over the i.i.d.\nRademacher variables \u03c3i \u2208 {\u22121, 1}. We have the following lemma, whose proof is in Section A.4.\nLemma 3. Under the conditions of the preceding paragraph, Rn(FNw ) \u2264 B\nApplying standard concentration results, we obtain the following generalization guarantee.\nTheorem 2 ([8, 18]). Let the true misclassi\ufb01cation risk and \u03bd-empirical misclassi\ufb01cation risk for an\nestimator h be de\ufb01ned as follows:\n\n(cid:113) 2(1+\u03c1)\n\nn\n\n.\n\nR(h) := P(Y h(X) < 0),\n\n{R(h) \u2212 (cid:98)R\u03bd(h)} \u2264 2\n\n(cid:98)R\u03bd(h) :=\n\u03bdRn(FNw ) + 3\n\n1\nn\n\n(cid:111)\n\n(cid:110)\n1,(cid:2)1 \u2212 yh(xi)/\u03bd(cid:3)\n\nn(cid:88)\n(cid:113) log 2\n2n with probability at least 1 \u2212 \u03b4.\n\nmin\n\ni=1\n\n+\n\n.\n\nThen suph\u2208FNw\nThe bound is independent of the number of terms Nw, though in practice we let B grow with Nw.\n\n\u03b4\n\n4 Empirical evaluations\n\nWe now turn to empirical evaluations, comparing our approach\u2019s predictive performance with that of\nRahimi and Recht\u2019s randomized features [24] as well as a joint optimization over kernel compositions\nand empirical risk. In each of our experiments, we investigate the effect of increasing dimensionality\nof the randomized feature space D. For our approach, we use the \u03c72-divergence (k = 2 or f (t) =\n\nt2 \u2212 1). Letting(cid:98)q denote the solution to problem (4), we use two variants of our approach: when\nD < nnz((cid:98)q) we use estimator (5), and we use estimator (6) otherwise. For the original randomized\n\nfeature approach, we relax the constraint in problem (7) with an (cid:96)2 penalty. Finally, for the joint\noptimization in which we learn the kernel and classi\ufb01er together, we consider the kernel-learning\nobjective, i.e. \ufb01nding the best Gram matrix G in problem (1) for the soft-margin SVM [14]:\nm=1 qm\u03c6(xi, wm)\u03c6(xj, wm)\n\ni,j \u03b1i\u03b1jyiyj(cid:80)Nw\n(cid:80)\n\nminimizeq\u2208PNw\n\n\u03b1T 1 \u2212 1\n\nsup\u03b1\nsubject to 0 (cid:22) \u03b1 (cid:22) C1, \u03b1T y = 0.\n\n2\n\n(9)\n\nWe use a standard primal-dual algorithm [4] to solve the min-max problem (9). While this is an\nexpensive optimization, it is a convex problem and is solvable in polynomial time.\nIn Section 4.1, we visualize a particular problem that illustrates the effectiveness of our approach\nwhen the user-de\ufb01ned kernel is poor. Section 4.2 shows how learning the kernel can be used to quickly\n\ufb01nd a sparse set of features in high dimensional data, and Section 4.3 compares our performance with\nunoptimized random features and the joint procedure (9) on benchmark datasets. The supplement\ncontains more experimental results in Section C.\n\n4.1 Learning a new kernel with a poor choice of P0\n\nFor our \ufb01rst experiment, we generate synthetic data xi iid\u223c N(0, I) with labels yi = sign((cid:107)x(cid:107)2 \u2212 \u221a\nd),\nwhere x \u2208 Rd. The Gaussian kernel is ill-suited for this task, as the Euclidean distance used\nin this kernel does not capture the underlying structure of the classes. Nevertheless, we use the\nGaussian kernel, which corresponds [24] to \u03c6(x, (w, v)) = cos((x, 1)T (w, v)) where (W, V ) \u223c\nn = 104 and a test set of size 103, and we employ logistic regression with D = nnz((cid:98)q) for both our\nN(0, I) \u00d7 Uni(0, 2\u03c0), to showcase the effects of our method. We consider a training set of size\n1For 2 \u2264 d \u2264 15, nnz((cid:98)q) < 250 when the kernel is trained with Nw = 2 \u00b7 104, and \u03c1 = 200.\n\ntechnique as well as the original random feature approach.1\n\n5\n\n\f(a) Training data & optimized features for d = 2\n\n(b) Error vs. d\n\nFigure 1. Experiments with synthetic data. (a) Positive and negative training examples are blue and red,\nand optimized randomized features (wm) are yellow. All offset parameters vm were optimized to be\nnear 0 or \u03c0 (not shown). (b) Misclassi\ufb01cation error of logistic regression model vs. dimensionality of\ndata. GK denotes random features with a Gaussian kernel, and our optimized kernel is denoted OK.\n\n(a) Error vs. D\n\n(b) (cid:98)qi vs. i\n\nOur error is \ufb01xed above D = nnz((cid:98)q) after which we employ estimator (6). (b) Weight of feature i in\n\nFigure 2. Feature selection in sparse data. (a) Misclassi\ufb01cation error of ridge regression model vs.\ndimensionality of data. LK denotes random features with a linear kernel, and OK denotes our method.\noptimized kernel (qi) vs. i. Vertical bars delineate separations between k-grams, where 1 \u2264 k \u2264 5 is\nnondecreasing in i. Circled features are pre\ufb01xes of GGTTG and GTTGG at indices 60\u201364.\n\nFigure 1 shows the results of the experiments for d \u2208 {2, . . . , 15}. Figure 1(a) illustrates the output\nof the optimization when d = 2. The selected kernel features wm lie near (1, 1) and (\u22121,\u22121); the\noffsets vm are near 0 and \u03c0, giving the feature \u03c6(\u00b7, w, v) a parity \ufb02ip. Thus, the kernel computes\nsimilarity between datapoints via neighborhoods of (1, 1) and (\u22121,\u22121) close to the classi\ufb01cation\nboundary. In higher dimensions, this generalizes to neighborhoods of pairs of opposing points along\nthe surface of the d-sphere; these features provide a coarse approximation to vector magnitude.\nPerformance degradation with d occurs because the neighborhoods grow exponentially larger and\nless dense (due to \ufb01xed Nw and n). Nevertheless, as shown in Figure 1(b), this degradation occurs\nmuch more slowly than that of the Gaussian kernel, which suffers a similar curse of dimensionality\ndue to its dependence on Euclidean distance. Although somewhat contrived, this example shows that\neven in situations with poor base kernels our approach learns a more suitable representation.\n\n4.2 Feature selection and biological sequences\n\nIn addition to the computational advantages rendered by the sparsity of q after performing the\noptimization (4), we can use this sparsity to gain insights about important features in high-dimensional\ndatasets; this can act as an ef\ufb01cient \ufb01ltering mechanism before further investigation. We present\none example of this task, studying an aptamer selection problem [6]. In this task, we are given\nn = 2900 nucleotide sequences (aptamers) xi \u2208 A81, where A = {A,C,G,T} and labels yi indicate\n(thresholded) binding af\ufb01nity of the aptamer to a molecular target. We create one-hot encoded forms\nk=1 |A|k(82 \u2212 k) = 105,476\n\nof k-grams of the sequence, where 1 \u2264 k \u2264 5, resulting in d =(cid:80)5\n\n6\n\n-4-2024-3-2-10123246810121400.050.10.150.20.250.30.350.40.45GK-trainGK-testOK-trainOK-test1011021031040.10.20.30.40.510110210310410500.010.020.030.040.05\f(a) Error vs. D, adult\n\n(b) Error vs. D, reuters\n\n(c) Error vs. D, buzz\n\n(d) Speedup vs. D, adult\n\n(e) Speedup vs. D, reuters\n\n(f) Speedup vs. D, buzz\n\nFigure 3. Performance analysis on benchmark datasets. The top row shows training and test misclassi\ufb01-\ncation rates. Our method is denoted as OK and is shown in red. The blue methods are random features\nwith Gaussian, linear, or arc-cosine kernels (GK, LK, or ACK respectively). Our error and running\n\ntime become \ufb01xed above D = nnz((cid:98)q) after which we employ estimator (6). The bottom row shows the\n\nspeedup factor of using our method over regular random features (speedup = x indicates our method\ntakes 1/x of the time required to use regular random features). Our method is faster at moderate to large\nD and shows better performance than the random feature approach at small to moderate D.\n\nTable 1: Best test results over benchmark datasets\n\nDataset\nadult\nreuters\n\nbuzz\n\nn, ntest\n\n32561,\n16281\n23149, 781265\n35177\n105530,\n\nd\n123\n47236\n77\n\nModel\nLogistic\nRidge\nRidge\n\nOur error (%), time(s) Random error (%), time(s)\n\n15.54,\n9.27,\n4.92,\n\n3.6\n0.8\n2.0\n\n15.44,\n9.36,\n4.58,\n\n43.1\n295.9\n11.9\n\nfeatures. We consider the linear kernel, i.e. \u03c6(x, w) = xw, where w \u223c Uni({1, . . . , d}). Figure 2(a)\ncompares the misclassi\ufb01cation error of our method with that of random k-gram features, while Figure\n2(b) indicates the weights qi given to features by our method. In under 0.2 seconds, we whittle down\nthe original feature space to 379 important features. By restricting random selection to just these\nfeatures, we outperform the approach of selecting features uniformly at random when D (cid:28) d. More\nimportantly, however, we can derive insights from this selection. For example, the circled features in\nFigure 2(b) correspond to k-gram pre\ufb01xes for the 5-grams GGTTG and GTTGG at indices 60 through\n64; G-complexes are known to be relevant for binding af\ufb01nities in aptamers [6], so this is reasonable.\n\n4.3 Performance on benchmark datasets\n\nWe now show the bene\ufb01ts of our approach on large-scale datasets, since we exploit the ef\ufb01ciency\nof random features with the performance of kernel-learning techniques. We perform experiments\non three distinct types of datasets, tracking training/test error rates as well as total (training + test)\ntime. For the adult2 dataset we employ the Gaussian kernel with a logistic regression model, and\nfor the reuters3 dataset we employ a linear kernel with a ridge regression model. For the buzz4\ndataset we employ ridge regression with an arc-cosine kernel of order 2, i.e. P0 = N (0, I) and\n\u03c6(x, w) = H(wT x)(wT x)2, where H(\u00b7) is the Heavyside step function [7].\n\n2https://archive.ics.uci.edu/ml/datasets/Adult\n3http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm. We con-\n\nsider predicting whether a document has a CCAT label.\n\n4http://ama.liglab.fr/data/buzz/classi\ufb01cation/. We use the Twitter dataset.\n\n7\n\n1021030.140.160.180.20.220.241011021031040.10.150.20.250.30.350.40.450.51011021030.040.060.080.10.120.140.160.180.210210310-110010110110210310410-1100101102103101102103100101\fDataset\nadult\nreuters\n\nbuzz\n\nTable 2: Comparisons with joint optimization on subsampled data\nOur training / test error (%), time(s)\n\nJoint training / test error (%), time(s)\n198.1\n173.3\n137.5\n\n/ 16.31,\n8.96,\n7.08,\n\n14.88\n6.30 /\n7.38 /\n\n16.22\n7.64\n8.44\n\n/ 16.36,\n9.66,\n/\n/\n8.32,\n\n1.8\n0.6\n0.4\n\nComparison with unoptimized random features Results comparing our method with unopti-\nmized random features are shown in Figure 3 for many values of D, and Table 1 tabulates the best\ntest error and corresponding time for the methods. Our method outperforms the original random\nfeature approach in terms of generalization error for small and moderate values of D; at very large D\nthe random feature approach either matches our surpasses our performance. The trends in speedup\nare opposite: our method requires extra optimizations that dominate training time at extremely small\nD; at very large D we use estimator (6), so our method requires less overall time. The nonmonotonic\n\nbehavior for reuters (Figure 3(e)) occurs due to the following: at D (cid:46) nnz((cid:98)q), sampling indices\n\nfrom the optimized distribution takes a non-neglible fraction of total time, and solving the linear\nsystem requires more time when rows of \u03a6 are not unique (due to sampling).\nPerformance improvements also depend on the kernel choice for a dataset. Namely, our method\nprovides the most improvement, in terms of training time for a given amount of generalization error,\nover random features generated for the linear kernel on the reuters dataset; we are able to surpass\nthe best results of the random feature approach 2 orders of magnitude faster. This makes sense when\nconsidering the ability of our method to sample from a small subset of important features. On the\nother hand, random features for the arc-cosine kernel are able to achieve excellent results on the\nbuzz dataset even without optimization, so our approach only offers modest improvement at small to\nmoderate D. For the Gaussian kernel employed on the adult dataset, our method is able to achieve\nthe same generalization performance as random features in roughly 1/12 the training time.\nThus, we see that our optimization approach generally achieves competitive results with random\nfeatures at lower computational costs, and it offers the most improvements when either the base\nkernel is not well-suited to the data or requires a large number of random features (large D) for good\nperformance. In other words, our method reduces the sensitivity of model performance to the user\u2019s\nselection of base kernels.\n\nComparison with joint optimization Despite the fact that we do not choose empirical risk as our\nobjective in optimizing kernel compositions, our optimized kernel enjoys competitive generalization\nperformance compared to the joint optimization procedure (9). Because the joint optimization is\nvery costly, we consider subsampled training datasets of 5000 training examples. Results are shown\nin Table 2, where it is evident that the ef\ufb01ciency of our method outweighs the marginal gain in\nclassi\ufb01cation performance for joint optimization.\n\n5 Conclusion\n\nWe have developed a method to learn a kernel in a supervised manner using random features. Although\nwe consider a kernel alignment problem similar to other approaches in the literature, we exploit\ncomputational advantages offered by random features to develop a much more ef\ufb01cient and scalable\noptimization procedure. Our concentration bounds guarantee the results of our optimization procedure\nclosely match the limits of in\ufb01nite data (n \u2192 \u221e) and sampling (Nw \u2192 \u221e), and our method produces\nmodels that enjoy good generalization performance guarantees. Empirical evaluations indicate that\nour optimized kernels indeed \u201clearn\u201d structure from data, and we attain competitive results on\nbenchmark datasets at a fraction of the training time for other methods. Generalizing the theoretical\nresults for concentration and risk to other f\u2212divergences is the subject of further research. More\nbroadly, our approach opens exciting questions regarding the usefulness of simple optimizations on\nrandom features in speeding up other traditionally expensive learning problems.\n\nAcknowledgements This research was supported by a Fannie & John Hertz Foundation Fellowship\nand a Stanford Graduate Fellowship.\n\n8\n\n\fReferences\n[1] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.\n\nThe Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[2] A. Ben-Hur and W. S. Noble. Kernel methods for predicting protein\u2013protein interactions. Bioinformatics,\n\n21(suppl 1):i38\u2013i46, 2005.\n\n[3] A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of\n\noptimization problems affected by uncertain probabilities. Management Science, 59(2):341\u2013357, 2013.\n\n[4] D. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 1999.\n[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic Theory of\n\nIndependence. Oxford University Press, 2013.\n\n[6] M. Cho, S. S. Oh, J. Nie, R. Stewart, M. Eisenstein, J. Chambers, J. D. Marth, F. Walker, J. A. Thomson,\nand H. T. Soh. Quantitative selection and parallel characterization of aptamers. Proceedings of the National\nAcademy of Sciences, 110(46), 2013.\n\n[7] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in neural information processing\n\nsystems, pages 342\u2013350, 2009.\n\n[8] C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of\n\nthe 27th International Conference on Machine Learning (ICML-10), pages 247\u2013254, 2010.\n\n[9] C. Cortes, M. Mohri, and A. Rostamizadeh. Algorithms for learning kernels based on centered alignment.\n\nThe Journal of Machine Learning Research, 13(1):795\u2013828, 2012.\n\n[10] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On kernel target alignment. In Innovations in\n\nMachine Learning, pages 205\u2013256. Springer, 2006.\n\n[11] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the (cid:96)1-ball for learning\n\nin high dimensions. In Proceedings of the 25th International Conference on Machine Learning, 2008.\n\n[12] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure discovery in\n\nnonparametric regression through compositional kernel search. arXiv preprint arXiv:1302.4922, 2013.\n\n[13] M. Girolami and S. Rogers. Hierarchic bayesian models for kernel learning. In Proceedings of the 22nd\n\ninternational conference on Machine learning, pages 241\u2013248. ACM, 2005.\n\n[14] M. G\u00f6nen and E. Alpayd\u0131n. Multiple kernel learning algorithms. The Journal of Machine Learning\n\nResearch, 12:2211\u20132268, 2011.\n\n[15] G. E. Hinton and R. R. Salakhutdinov. Using deep belief nets to learn covariance kernels for gaussian\n\nprocesses. In Advances in neural information processing systems, pages 1249\u20131256, 2008.\n\n[16] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Optimizing kernel alignment over combinations of kernel.\n\n2002.\n\n[17] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. The Journal of\n\nMachine Learning Research, 12:953\u2013997, 2011.\n\n[18] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of\n\ncombined classi\ufb01ers. Annals of Statistics, pages 1\u201350, 2002.\n\n[19] V. Koltchinskii, D. Panchenko, et al. Complexities of convex combinations and bounding the generalization\n\nerror in classi\ufb01cation. The Annals of Statistics, 33(4):1455\u20131496, 2005.\n\n[20] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with\n\nsemide\ufb01nite programming. The Journal of Machine Learning Research, 5:27\u201372, 2004.\n\n[21] Q. Le, T. Sarl\u00f3s, and A. Smola. Fastfood-computing hilbert space expansions in loglinear time.\n\nProceedings of the 30th International Conference on Machine Learning, pages 244\u2013252, 2013.\n\nIn\n\n[22] D. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.\n[23] S. Qiu and T. Lane. A framework for multiple kernel support vector regression and its applications to\nsirna ef\ufb01cacy prediction. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 6(2):\n190\u2013199, 2009.\n\n[24] A. Rahimi and B. Recht. Random features for large-scale kernel machines.\n\nInformation Processing Systems 20, 2007.\n\nIn Advances in Neural\n\n[25] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: replacing minimization with randomiza-\n\ntion in learning. In Advances in Neural Information Processing Systems 21, 2008.\n\n[26] P. Samson. Concentration of measure inequalities for Markov chains and \u03c6-mixing processes. Annals of\n\nProbability, 28(1):416\u2013461, 2000.\n\n[27] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,\n\n2004.\n\n[28] Y. Ying, K. Huang, and C. Campbell. Enhanced protein fold recognition through a novel data integration\n\napproach. BMC bioinformatics, 10(1):1, 2009.\n\n[29] A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Proceedings of the 24th international\n\nconference on Machine learning, pages 1191\u20131198. ACM, 2007.\n\n9\n\n\f", "award": [], "sourceid": 710, "authors": [{"given_name": "Aman", "family_name": "Sinha", "institution": "Stanford University"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}]}