{"title": "Ensemble Nystrom Method", "book": "Advances in Neural Information Processing Systems", "page_first": 1060, "page_last": 1068, "abstract": "A crucial technique for scaling kernel methods to very large data sets reaching or exceeding millions of instances is based on low-rank approximation of kernel matrices. We introduce a new family of algorithms based on mixtures of Nystrom approximations, ensemble Nystrom algorithms, that yield more accurate low-rank approximations than the standard Nystrom method. We give a detailed study of multiple variants of these algorithms based on simple averaging, an exponential weight method, or regression-based methods. We also present a theoretical analysis of these algorithms, including novel error bounds guaranteeing a better convergence rate than the standard Nystrom method. Finally, we report the results of extensive experiments with several data sets containing up to 1M points demonstrating the signi\ufb01cant performance improvements gained over the standard Nystrom approximation.", "full_text": "Ensemble Nystr\u00a8om Method\n\nSanjiv Kumar\nGoogle Research\nNew York, NY\n\nCourant Institute and Google Research\n\nMehryar Mohri\n\nNew York, NY\n\nsanjivk@google.com\n\nmohri@cs.nyu.edu\n\nCourant Institute of Mathematical Sciences\n\nAmeet Talwalkar\n\nNew York, NY\n\nameet@cs.nyu.edu\n\nAbstract\n\nA crucial technique for scaling kernel methods to very large data sets reaching\nor exceeding millions of instances is based on low-rank approximation of kernel\nmatrices. We introduce a new family of algorithms based on mixtures of Nystr\u00a8om\napproximations, ensemble Nystr\u00a8om algorithms, that yield more accurate low-rank\napproximations than the standard Nystr\u00a8om method. We give a detailed study of\nvariants of these algorithms based on simple averaging, an exponential weight\nmethod, or regression-based methods. We also present a theoretical analysis of\nthese algorithms, including novel error bounds guaranteeing a better convergence\nrate than the standard Nystr\u00a8om method. Finally, we report results of extensive\nexperiments with several data sets containing up to 1M points demonstrating the\nsigni\ufb01cant improvement over the standard Nystr\u00a8om approximation.\n\n1 Introduction\n\nModern learning problems in computer vision, natural language processing, computational biology,\nand other areas are often based on large data sets of tens of thousands to millions of training in-\nstances. But, several standard learning algorithms such as support vector machines (SVMs) [2, 4],\nkernel ridge regression (KRR) [14], kernel principal component analysis (KPCA) [15], manifold\nlearning [13], or other kernel-based algorithms do not scale to such orders of magnitude. Even the\nstorage of the kernel matrix is an issue at this scale since it is often not sparse and the number of\nentries is extremely large. One solution to deal with such large data sets is to use an approximation\nof the kernel matrix. As shown by [18], later by [6, 17, 19], low-rank approximations of the kernel\nmatrix using the Nystr\u00a8om method can provide an effective technique for tackling large-scale scale\ndata sets with no signi\ufb01cant decrease in performance.\n\nThis paper deals with very large-scale applications where the sample size can reach millions of in-\nstances. This motivates our search for further improved low-rank approximations that can scale to\nsuch orders of magnitude and generate accurate approximations. We show that a new family of al-\ngorithms based on mixtures of Nystr\u00a8om approximations, ensemble Nystr\u00a8om algorithms, yields more\naccurate low-rank approximations than the standard Nystr\u00a8om method. Moreover, these ensemble al-\ngorithms naturally \ufb01t distributed computing environment where their computational cost is roughly\nthe same as that of the standard Nystr\u00a8om method. This issue is of great practical signi\ufb01cance given\nthe prevalence of distributed computing frameworks to handle large-scale learning problems.\n\nThe remainder of this paper is organized as follows. Section 2 gives an overview of the Nystr\u00a8om\nlow-rank approximation method and describes our ensemble Nystr\u00a8om algorithms. We describe sev-\neral variants of these algorithms, including one based on simple averaging of p Nystr\u00a8om solutions,\n\n1\n\n\fan exponential weight method, and a regression method which consists of estimating the mixture pa-\nrameters of the ensemble using a few columns sampled from the matrix. In Section 3, we present a\ntheoretical analysis of ensemble Nystr\u00a8om algorithms, namely bounds on the reconstruction error for\nboth the Frobenius norm and the spectral norm. These novel generalization bounds guarantee a bet-\nter convergence rate for these algorithms in comparison to the standard Nystr\u00a8om method. Section 4\nreports the results of extensive experiments with these algorithms on several data sets containing up\nto 1M points, comparing different variants of our ensemble Nystr\u00a8om algorithms and demonstrating\nthe performance improvements gained over the standard Nystr\u00a8om method.\n\n2 Algorithm\n\nWe \ufb01rst give a brief overview of the Nystr\u00a8om low-rank approximation method, introduce the notation\nused in the following sections, and then describe our ensemble Nystr\u00a8om algorithms.\n\n2.1 Standard Nystr\u00a8om method\n\nWe adopt a notation similar to that of [5, 9] and other previous work. The Nystr\u00a8om approximation\nof a symmetric positive semide\ufb01nite (SPSD) matrix K is based on a sample of m \u226a n columns\nof K [5, 18]. Let C denote the n\u00d7 m matrix formed by these columns and W the m\u00d7 m matrix\nconsisting of the intersection of these m columns with the corresponding m rows of K. The columns\nand rows of K can be rearranged based on this sampling so that K and C be written as follows:\n\nK =(cid:20) W K\u22a421\nK21 K22(cid:21)\n\nNote that W is also SPSD since K is SPSD. For a uniform sampling of the columns, the Nystr\u00a8om\n\nand C =(cid:20) W\nK21(cid:21) .\nmethod generates a rank-k approximation eK of K for k\u2264 m de\ufb01ned by:\n(2)\nthat is Wk =\nwhere Wk is the best k-rank approximation of W for the Frobenius norm,\nargminrank(V)=k kW \u2212 VkF and W+\nk can be de-\nrived from the singular value decomposition (SVD) of W, W = U\u03a3U\u22a4, where U is orthonormal\nand \u03a3 = diag(\u03c31, . . . , \u03c3m) is a real diagonal matrix with \u03c31 \u2265\u00b7\u00b7\u00b7\u2265 \u03c3m \u2265 0. For k \u2264 rank(W), it\nUiUi\u22a4, where Ui denotes the ith column of U. Since the running\nis given by W+\ntime complexity of SVD is O(m3) and O(nmk) is required for multiplication with C, the total\ncomplexity of the Nystr\u00a8om approximation computation is O(m3 +nmk).\n\nk denotes the pseudo-inverse of Wk [7]. W+\n\neK = CW+\n\nk\n\nk =Pk\n\ni=1 \u03c3\u22121\n\ni\n\nC\u22a4 \u2248 K,\n\n(1)\n\n2.2 Ensemble Nystr\u00a8om algorithm\n\nThe main idea behind our ensemble Nystr\u00a8om algorithm is to treat each approximation generated by\nthe Nystr\u00a8om method for a sample of m columns as an expert and to combine p\u2265 1 such experts to\nderive an improved hypothesis, typically more accurate than any of the original experts.\nThe learning set-up is de\ufb01ned as follows. We assume a \ufb01xed kernel function K : X \u00d7X \u2192 R that\ncan be used to generate the entries of a kernel matrix K. The learner receives a sample S of mp\ncolumns randomly selected from matrix K uniformly without replacement. S is decomposed into\np subsamples S1,. . ., Sp. Each subsample Sr, r \u2208 [1, p], contains m columns and is used to de\ufb01ne\na rank-k Nystr\u00a8om approximation eKr. Dropping the rank subscript k in favor of the sample index\nr, eKr can be written as eKr = CrW+\nC\u22a4r , where Cr and Wr denote the matrices formed from\nr is the pseudo-inverse of the rank-k approximation of Wr. The learner\nthe columns of Sr and W+\nfurther receives a sample V of s columns used to determine the weight \u00b5r \u2208 R attributed to each\nexpert eKr. Thus, the general form of the approximation of K generated by the ensemble Nystr\u00a8om\nalgorithm is\n\nr\n\neKens =\n\npXr=1\n\n\u00b5reKr.\n\n2\n\nThe mixture weights \u00b5r can be de\ufb01ned in many ways. The most straightforward choice consists of\nassigning equal weight to each expert, \u00b5r = 1/p, r \u2208 [1, p]. This choice does not require the addi-\ntional sample V , but it ignores the relative quality of each Nystr\u00a8om approximation. Nevertheless,\n\n(3)\n\n\fthis simple uniform method already generates a solution superior to any one of the approximations\n\nAnother method, the exponential weight method, consists of measuring the reconstruction error \u02c6\u01ebr of\n\neKr used in the combination, as we shall see in the experimental section.\neach expert eKr over the validation sample V and de\ufb01ning the mixture weight as \u00b5r = exp(\u2212\u03b7\u02c6\u01ebr)/Z,\nwhere \u03b7 > 0 is a parameter of the algorithm and Z a normalization factor ensuring that the vector\n\u00b5 = (\u00b51, . . . , \u00b5p) belongs to the simplex \u2206 of Rp: \u2206 = {\u00b5 \u2208 Rp : \u00b5 \u2265 0 \u2227Pp\nr=1 \u00b5r = 1}. The\nchoice of the mixture weights here is similar to those used in the weighted-majority algorithm [11].\nLet KV denote the matrix formed by using the samples from V as its columns and let eKV\nr denote\nthe submatrix of eKr containing the columns corresponding to the columns in V . The reconstruction\nerror \u02c6\u01ebr =keKV\n\nA more general class of methods consists of using the sample V to train the mixture weights \u00b5r to\noptimize a regression objective function such as the following:\n\nr \u2212 KV k can be directly computed from these matrices.\n\nmin\n\n\u00b5\n\n\u03bbk\u00b5k2\n\n2 + k\n\npXr=1\n\nr \u2212 KV k2\nF ,\n\n\u00b5reKV\n\n(4)\n\nwhere KV denotes the matrix formed by the columns of the samples S and V and \u03bb > 0. This can\nbe viewed as a ridge regression objective function and admits a closed form solution. We will refer\nto this method as the ridge regression method.\nThe total complexity of the ensemble Nystr\u00a8om algorithm is O(pm3 + pmkn + C\u00b5), where C\u00b5 is\nthe cost of computing the mixture weights, \u00b5, used to combine the p Nystr\u00a8om approximations. In\ngeneral, the cubic term dominates the complexity since the mixture weights can be computed in\nconstant time for the uniform method, in O(psn) for the exponential weight method, or in O(p3 +\npms) for the ridge regression method. Furthermore, although the ensemble Nystr\u00a8om algorithm\nrequires p times more space and CPU cycles than the standard Nystr\u00a8om method, these additional\nrequirements are quite reasonable in practice. The space requirement is still manageable for even\nlarge-scale applications given that p is typically O(1) and m is usually a very small percentage of\nn (see Section 4 for further details). In terms of CPU requirements, we note that our algorithm\ncan be easily parallelized, as all p experts can be computed simultaneously. Thus, with a cluster\nof p machines, the running time complexity of this algorithm is nearly equal to that of the standard\nNystr\u00a8om algorithm with m samples.\n\n3 Theoretical analysis\n\nWe now present a theoretical analysis of the ensemble Nystr\u00a8om method for which we use as tools\nsome results previously shown by [5] and [9]. As in [9], we shall use the following generalization\nof McDiarmid\u2019s concentration bound to sampling without replacement [3].\nTheorem 1. Let Z1, . . . , Zm be a sequence of random variables sampled uniformly without re-\nplacement from a \ufb01xed set of m + u elements Z, and let \u03c6 : Z m \u2192 R be a symmetric function\nsuch that for all i \u2208 [1, m] and for all z1, . . . , zm \u2208 Z and z\u20321, . . . , z\u2032m \u2208 Z, |\u03c6(z1, . . . , zm)\u2212\n\u03c6(z1, . . . , zi\u22121, z\u2032i, zi+1, . . . , zm)|\u2264 c. Then, for all \u01eb > 0, the following inequality holds:\n\nwhere \u03b1(m, u) =\n\nmu\n\nm+u\u22121/2\n\nPr(cid:2)\u03c6 \u2212 E[\u03c6] \u2265 \u01eb(cid:3) \u2264 exp(cid:2) \u22122\u01eb2\n1\u22121/(2 max{m,u}) .\n\n\u03b1(m,u)c2(cid:3),\n\n1\n\n(5)\n\nWe de\ufb01ne the selection matrix corresponding to a sample of m columns as the matrix S \u2208 Rn\u00d7m\nde\ufb01ned by Sii = 1 if the ith column of K is among those sampled, Sij = 0 otherwise. Thus, C = KS\nis the matrix formed by the columns sampled. Since K is SPSD, there exists X \u2208 RN\u00d7n such that\nK = X\u22a4X. We shall denote by Kmax the maximum diagonal entry of K, Kmax = maxi Kii, and\nby dK\n\nmax the distance maxijpKii + Kjj \u2212 2Kij.\n\n3.1 Error bounds for the standard Nystr\u00a8om method\n\nThe following theorem gives an upper bound on the norm-2 error of the Nystr\u00a8om approximation of\n\nthe form kK\u2212eKk2/kKk2 \u2264 kK\u2212 Kkk2/kKk2 + O(1/\u221am) and an upper bound on the Frobenius\n\n3\n\n\f1\n\nerror of the Nystr\u00a8om approximation of the form kK \u2212 eKkF /kKkF \u2264 kK \u2212 KkkF /kKkF +\n\n4 ). Note that these bounds are similar to the bounds in Theorem 3 in [9], though in this\nO(1/m\nwork we give new results for the spectral norm and present a tighter Lipschitz condition (9), the\nlatter of which is needed to derive tighter bounds in Section 3.2.\n\nTheorem 2. Let eK denote the rank-k Nystr\u00a8om approximation of K based on m columns sampled\nuniformly at random without replacement from K, and Kk the best rank-k approximation of K.\nThen, with probability at least 1 \u2212 \u03b4, the following inequalities hold for any sample of size m:\n\nKmaxh1 +q n\u2212m\nkK \u2212 eKk2 \u2264 kK \u2212 Kkk2 + 2n\u221am\n4 nKmaxh1 +q n\u2212m\nkK \u2212 eKkF \u2264 kK \u2212 KkkF +(cid:2) 64k\nm (cid:3) 1\n\nn\u22121/2\n\nn\u22121/2\n\n1\n\n1\n\n\u03b2(m,n) log 1\n\n\u03b4 dK\n\nmax/K\n\n1\n\n2\n\nmaxi\n\n1\n\n\u03b2(m,n) log 1\n\n\u03b4 dK\n\nmax/K\n\n2\n\n2\n\n1\n\nmaxi 1\n\n,\n\nwhere \u03b2(m, n) = 1\u2212\nProof. To bound the norm-2 error of the Nystr\u00a8om method in the scenario of sampling without re-\nplacement, we start with the following general inequality given by [5][proof of Lemma 4]:\n\n2 max{m,n\u2212m}\n\n.\n\nm\n\nkK \u2212 eKk2 \u2264 kK \u2212 Kkk2 + 2kXX\u22a4 \u2212 ZZ\u22a4k2,\n\n(6)\nwhere Z = p n\nXS. We then apply the McDiarmid-type inequality of Theorem 1 to \u03c6(S) =\nkXX\u22a4\u2212ZZ\u22a4k2. Let S\u2032 be a sampling matrix selecting the same columns as S except for one, and\nlet Z\u2032 denotep n\n\nXS\u2032. Let z and z\u2032 denote the only differing columns of Z and Z\u2032, then\n\n|\u03c6(S\u2032) \u2212 \u03c6(S)| \u2264 kz\u2032z\u2032\u22a4 \u2212 zz\u22a4k2 = k(z\u2032 \u2212 z)z\u2032\u22a4 + z(z\u2032 \u2212 z)\u22a4k2\n\nm\n\n\u2264 2kz\u2032 \u2212 zk2 max{kzk2,kz\u2032k2}.\n\nColumns of Z are those of X scaled bypn/m. The norm of the difference of two columns of X\n\ncan be viewed as the norm of the difference of two feature vectors associated to K and thus can be\nbounded by dK. Similarly, the norm of a single column of X is bounded by K\nmax. This leads to the\nfollowing inequality:\n\n1\n\n2\n\n|\u03c6(S\u2032) \u2212 \u03c6(S)| \u2264\nThe expectation of \u03c6 can be bounded as follows:\n\n2n\nm\n\ndK\nmax\n\n1\n\nK\n\n2\n\nmax.\n\n(7)\n(8)\n\n(9)\n\n(10)\n\nE[\u03a6] = E[kXX\u22a4 \u2212 ZZ\u22a4k2] \u2264 E[kXX\u22a4 \u2212 ZZ\u22a4kF ] \u2264\n\nn\n\u221am\n\nKmax,\n\nwhere the last inequality follows Corollary 2 of [9]. The inequalities (9) and (10) combined with\nTheorem 1 give a bound on kXX\u22a4 \u2212 ZZ\u22a4k2 and yield the statement of the theorem.\nThe following general inequality holds for the Frobenius error of the Nystr\u00a8om method [5]:\n\nF \u2264 kK \u2212 Kkk2\nBounding the term kXX\u22a4\u2212 ZZ\u22a4k2\nTheorem 1 yields the result of the theorem.\n\nkK \u2212 eKk2\n\nF + \u221a64k kXX\u22a4 \u2212 ZZ\u22a4k2\n\n(11)\nF as in the norm-2 case and using the concentration bound of\n\nF nKmax\n\nii\n\n.\n\n3.2 Error bounds for the ensemble Nystr\u00a8om method\n\nThe following error bounds hold for ensemble Nystr\u00a8om methods based on a convex combination of\nNystr\u00a8om approximations.\nTheorem 3. Let S be a sample of pm columns drawn uniformly at random without replacement\n\nfrom K, decomposed into p subsamples of size m, S1, . . . , Sp. For r \u2208 [1, p], let eKr denote the\nrank-k Nystr\u00a8om approximation of K based on the sample Sr, and let Kk denote the best rank-k\napproximation of K. Then, with probability at least 1 \u2212 \u03b4, the following inequalities hold for any\nsample S of size pm and for any \u00b5 in the simplex \u2206 and eKens =Pp\n2q n\u2212pm\nKmaxh1 + \u00b5maxp\nkK \u2212 eKensk2 \u2264 kK \u2212 Kkk2 + 2n\u221am\n2q n\u2212pm\n4 nKmaxh1 + \u00b5maxp\nm (cid:3) 1\nkK \u2212 eKenskF \u2264 kK \u2212 KkkF +(cid:2) 64k\nwhere \u03b2(pm, n) = 1\u2212\n\nr=1 \u00b5reKr:\n\nand \u00b5max = maxp\n\n\u03b2(pm,n) log 1\n\n\u03b2(pm,n) log 1\n\nmaxi\n\n2 max{pm,n\u2212pm}\n\nr=1 \u00b5r.\n\nmax/K\n\nmax/K\n\n\u03b4 dK\n\n\u03b4 dK\n\nn\u22121/2\n\nn\u22121/2\n\nmaxi 1\n\n1\n\n1\n\n1\n\n2\n\n2\n\n1\n\n1\n\n2\n\n1\n\n1\n\n,\n\n4\n\n\fProof. For r \u2208 [1, p], let Zr = pn/m XSr, where Sr denotes the selection matrix corresponding\nto the sample Sr. By de\ufb01nition of eKens and the upper bound on kK \u2212 eKrk2 already used in the\n\nproof of theorem 2, the following holds:\n\npXr=1\n\nkK \u2212 eKensk2 =(cid:13)(cid:13)(cid:13)\n\u00b5r(K \u2212 eKr)(cid:13)(cid:13)(cid:13)2 \u2264\npXr=1\n\u00b5rkK \u2212 eKrk2\npXr=1\n\u00b5r(cid:0)kK \u2212 Kkk2 + 2kXX\u22a4 \u2212 Zr Z\u22a4r k2(cid:1)\n\u00b5rkXX\u22a4 \u2212 Zr Z\u22a4r k2.\n= kK \u2212 Kkk2 + 2\n\n\u2264\n\npXr=1\n\n(12)\n\n(13)\n\n(14)\n\nWe apply Theorem 1 to \u03c6(S) =Pp\nr=1 \u00b5rkXX\u22a4 \u2212 Zr Z\u22a4r k2. Let S\u2032 be a sample differing from\nS by only one column. Observe that changing one column of the full sample S changes only one\nsubsample Sr and thus only one term \u00b5rkXX\u22a4 \u2212 Zr Z\u22a4r k2. Thus, in view of the bound (9) on the\nchange to kXX\u22a4 \u2212 Zr Z\u22a4r k2, the following holds:\n2n\nm\n\n|\u03c6(S\u2032) \u2212 \u03c6(S)| \u2264\n\n\u00b5maxdK\n\nmax,\n\n(15)\n\nn\u221am\n\nr=1 \u00b5r\n\nThe expectation of \u03a6 can be straightforwardly bounded by E[\u03a6(S)] = Pp\nZr Z\u22a4r k2] \u2264 Pp\n\nr=1 \u00b5r E[kXX\u22a4 \u2212\nKmax using the bound (10) for a single expert. Plugging\nin this upper bound and the Lipschitz bound (15) in Theorem 1 yields our norm-2 bound for the\nensemble Nystr\u00a8om method.\nFor the Frobenius error bound, using the convexity of the Frobenius norm square k\u00b7k2\ngeneral inequality (11), we can write\n\nKmax = n\u221am\n\nF and the\n\nmax\n\nK\n\n1\n\n2\n\nkK \u2212 eKensk2\n\npXr=1\n\n2\n\nF =(cid:13)(cid:13)(cid:13)\n\u00b5r(K \u2212 eKr)(cid:13)(cid:13)(cid:13)\npXr=1\n\u00b5rhkK \u2212 Kkk2\npXr=1\nF + \u221a64k\n= kK \u2212 Kkk2\n\n\u2264\n\nF\n\n\u00b5rkK \u2212 eKrk2\n\nF \u2264\nF + \u221a64k kXX\u22a4 \u2212 Zr Z\u22a4r kF nKmax\npXr=1\n\n\u00b5rkXX\u22a4 \u2212 Zr Z\u22a4r kF nKmax\n\nii\n\nii\n\n.\n\ni.\n\n(16)\n\n(17)\n\n(18)\n\nThe result follows by the application of Theorem 1 to \u03c8(S) =Pp\n\nsimilar to the norm-2 case.\n\nr=1 \u00b5rkXX\u22a4 \u2212 Zr Z\u22a4r kF in a way\n\nThe bounds of Theorem 3 are similar in form to those of Theorem 2. However, the bounds for the\nensemble Nystr\u00a8om are tighter than those for any Nystr\u00a8om expert based on a single sample of size\nm even for a uniform weighting. In particular, for \u00b5 = 1/p, the last term of the ensemble bound for\nnorm-2 is smaller by a factor larger than \u00b5maxp\n\n2 = 1/\u221ap.\n\n1\n\n4 Experiments\n\nIn this section, we present experimental results that illustrate the performance of the ensemble\nNystr\u00a8om method. We work with the datasets listed in Table 1. In Section 4.1, we compare the\nperformance of various methods for calculating the mixture weights (\u00b5r). In Section 4.2, we show\nthe effectiveness of our technique on large-scale datasets. Throughout our experiments, we mea-\n\nsure the accuracy of a low-rank approximation eK by calculating the relative error in Frobenius and\nspectral norms, that is, if we let \u03be = {2, F}, then we calculate the following quantity:\n\n% error = kK \u2212 eKk\u03be\n\nkKk\u03be \u00d7 100.\n\n5\n\n(19)\n\n\fDataset\nPIE-2.7K [16]\nMNIST [10]\nESS [8]\nAB-S [1]\nDEXT [1]\nSIFT-1M [12]\n\nType of data\nface images\ndigit images\n\nproteins\nabalones\n\nbag of words\nImage features\n\n2731\n\n4000\n\n4728\n\n4177\n\n2304\n\n# Points (n) # Features (d) Kernel\nlinear\nlinear\nRBF\nRBF\nlinear\nRBF\n\n2000\n1M\n\n20000\n\n16\n\n8\n\n784\n\n128\n\nTable 1: A summary of the datasets used in the experiments.\n\n4.1 Ensemble Nystr\u00a8om with various mixture weights\n\nIn this set of experiments, we show results for our ensemble Nystr\u00a8om method using different tech-\nniques to choose the mixture weights as discussed in Section 2.2. We \ufb01rst experimented with the\n\ufb01rst \ufb01ve datasets shown in Table 1. For each dataset, we \ufb01xed the reduced rank to k = 50, and set the\nnumber of sampled columns to m = 3% n.1 Furthermore, for the exponential and the ridge regres-\nsion variants, we sampled an additional set of s = 20 columns and used an additional 20 columns\n(s\u2032) as a hold-out set for selecting the optimal values of \u03b7 and \u03bb. The number of approximations, p,\nwas varied from 2 to 30. As a baseline, we also measured the minimal and mean percent error across\n\nthe p Nystr\u00a8om approximations used to construct eKens. For the Frobenius norm, we also calculated\n\nthe performance when using the optimal \u00b5, that is, we used least-square regression to \ufb01nd the best\npossible choice of combination weights for a \ufb01xed set of p approximations by setting s = n.\nThe results of these experiments are presented in Figure 1 for the Frobenius norm and in Figure 2\nfor the spectral norm. These results clearly show that the ensemble Nystr\u00a8om performance is signi\ufb01-\ncantly better than any of the individual Nystr\u00a8om approximations. Furthermore, the ridge regression\ntechnique is the best of the proposed techniques and generates nearly the optimal solution in terms of\nthe percent error in Frobenius norm. We also observed that when s is increased to approximately 5%\nto 10% of n, linear regression without any regularization performs about as well as ridge regression\nfor both the Frobenius and spectral norm. Figure 3 shows this comparison between linear regression\nand ridge regression for varying values of s using a \ufb01xed number of experts (p = 10). Finally we\nnote that the ensemble Nystr\u00a8om method tends to converge very quickly, and the most signi\ufb01cant\ngain in performance occurs as p increases from 2 to 10.\n\n4.2 Large-scale experiments\n\nNext, we present an empirical study of the effectiveness of the ensemble Nystr\u00a8om method on the\nSIFT-1M dataset in Table 1 containing 1 million data points. As is common practice with large-scale\ndatasets, we worked on a cluster of several machines for this dataset. We present results comparing\nthe performance of the ensemble Nystr\u00a8om method, using both uniform and ridge regression mixture\nweights, with that of the best and mean performance across the p Nystr\u00a8om approximations used to\n\nconstruct eKens. We also make comparisons with a recently proposed k-means based sampling tech-\n\nnique for the Nystr\u00a8om method [19]. Although the k-means technique is quite effective at generating\ninformative columns by exploiting the data distribution, the cost of performing k-means becomes\nexpensive for even moderately sized datasets, making it dif\ufb01cult to use in large-scale settings. Nev-\nertheless, in this work, we include the k-means method in our comparison, and we present results\nfor various subsamples of the SIFT-1M dataset, with n ranging from 5K to 1M.\nTo fairly compare these techniques, we performed \u2018\ufb01xed-time\u2019 experiments. To do this, we \ufb01rst\nsearched for an appropriate m such that the percent error for the ensemble Nystr\u00a8om method with\nridge weights was approximately 10%, and measured the time required by the cluster to construct\nthis approximation. We then alloted an equal amount of time (within 1 second) for the other tech-\nniques, and measured the quality of the resulting approximations. For these experiments, we set\nk = 50 and p = 10, based on the results from the previous section. Furthermore, in order to speed up\ncomputation on this large dataset, we decreased the size of the validation and hold-out sets to s = 2\nand s\u2032 = 2, respectively.\n\n1Similar results (not reported here) were observed for other values of k and m as well.\n\n6\n\n\f4.5\n\n4\n\n3.5\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n3\n \n0\n\n5\n\n Ensemble Method \u2212 PIE\u22122.7K\n\n Ensemble Method \u2212 MNIST\n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\noptimal\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n25\n\n30\n\n10\n \n0\n\n5\n\n Ensemble Method \u2212 AB\u2212S\n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\noptimal\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n25\n\n30\n\n0.4\n \n0\n\n5\n\n Ensemble Method \u2212 ESS\n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\noptimal\n\n25\n\n30\n\n40\n\n38\n\n36\n\n34\n\n32\n\n30\n\n28\n\n26\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n24\n \n0\n\n5\n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\noptimal\n\n70\n\n68\n\n66\n\n64\n\n62\n\n60\n\n58\n\n56\n\n54\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n25\n\n30\n\n52\n \n0\n\n5\n\n Ensemble Method \u2212 DEXT\n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\noptimal\n\n25\n\n30\n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\nFigure 1: Percent error in Frobenius norm for ensemble Nystr\u00a8om method using uniform (\u2018uni\u2019), ex-\nponential (\u2018exp\u2019), ridge (\u2018ridge\u2019) and optimal (\u2018optimal\u2019) mixture weights as well as the best (\u2018best\nb.l.\u2019) and mean (\u2018mean b.l.\u2019) performance of the p base learners used to create the ensemble approx-\nimation.\n\n Ensemble Method \u2212 PIE\u22122.7K\n\n Ensemble Method \u2212 MNIST\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n)\nl\na\nr\nt\nc\ne\np\nS\n\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n0.8\n \n0\n\n5\n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n)\nl\na\nr\nt\nc\ne\np\nS\n\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n25\n\n30\n\n2\n \n0\n\n5\n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\n\n0.28\n\n0.26\n\n0.24\n\n0.22\n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n)\nl\na\nr\nt\nc\ne\np\nS\n\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n25\n\n30\n\n0.08\n \n0\n\n5\n\n Ensemble Method \u2212 ESS\n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\n\n25\n\n30\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n)\nl\na\nr\nt\nc\ne\np\nS\n\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n10\n \n0\n\n5\n\n Ensemble Method \u2212 AB\u2212S\n\n Ensemble Method \u2212 DEXT\n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n)\nl\na\nr\nt\nc\ne\np\nS\n\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n25\n\n30\n\n10\n \n0\n\n5\n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\n \n\nmean b.l.\nbest b.l.\nuni\nexp\nridge\n\n25\n\n30\n\n10\n\n15\n\n20\n\nNumber of base learners (p) \n\nFigure 2: Percent error in spectral norm for ensemble Nystr\u00a8om method using various mixture\nweights as well as the best and mean performance of the p approximations used to create the ensem-\nble approximation. Legend entries are the same as in Figure 1.\n\nThe results of this experiment, presented in Figure 4, clearly show that the ensemble Nystr\u00a8om\nmethod is the most effective technique given a \ufb01xed amount of time. Furthermore, even with\nthe small values of s and s\u2032, ensemble Nystr\u00a8om with ridge-regression weighting outperforms the\nuniform ensemble Nystr\u00a8om method. We also observe that due to the high computational cost of\nk-means for large datasets, the k-means approximation does not perform well in this \u2018\ufb01xed-time\u2019\nexperiment. It generates an approximation that is worse than the mean standard Nystr\u00a8om approxi-\nmation and its performance increasingly deteriorates as n approaches 1M. Finally, we note that al-\n\n7\n\n\f Effect of Ridge \u2212 PIE\u22122.7K\n\n Effect of Ridge \u2212 MNIST\n\n Effect of Ridge \u2212 ESS\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n3.5\n\n3.45\n\n3.4\n\n3.35\n\n \n\n \n\nno\u2212ridge\nridge\noptimal\n\n5\n\n10\n\n15\n\n20\n\n25\n\n Relative size of validation set\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n10.525\n\n10.52\n\n10.515\n\n10.51\n\n10.505\n\n10.5\n\n10.495\n\n \n\n \n\nno\u2212ridge\nridge\noptimal\n\n5\n\n10\n\n15\n\n20\n\n25\n\n Relative size of validation set\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n0.455\n\n0.45\n\n0.445\n\n \n0\n\n \n\nno\u2212ridge\nridge\noptimal\n\n5\n\n10\n\n15\n\n20\n\n25\n\n Relative size of validation set\n\n Effect of Ridge \u2212 AB\u2212S\n\n Effect of Ridge \u2212 DEXT\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n28.5\n\n28\n\n27.5\n\n27\n\n26.5\n\n26\n\n \n0\n\n \n\nno\u2212ridge\nridge\noptimal\n\n5\n\n10\n\n15\n\n20\n\n25\n\n Relative size of validation set\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\n\nn\ne\nc\nr\ne\nP\n\n56\n\n55.5\n\n55\n\n54.5\n\n \n\n \n\nno\u2212ridge\nridge\noptimal\n\n5\n\n10\n\n15\n\n20\n\n25\n\n Relative size of validation set\n\nFigure 3: Comparison of percent error in Frobenius norm for the ensemble Nystr\u00a8om method with p =\n10 experts with weights derived from linear regression (\u2018no-ridge\u2019) and ridge regression (\u2018ridge\u2019).\nThe dotted line indicates the optimal combination. The relative size of the validation set equals\ns/n\u00d7100%.\n\ni\n\n)\ns\nu\nn\ne\nb\no\nr\nF\n(\n \nr\no\nr\nr\n\nE\n\n \nt\nn\ne\nc\nr\ne\nP\n\n17\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\n10\n\n9\n\n \n\n Large Scale Ensemble Study\n\n \n\nmean b.l.\nbest b.l.\nuni\nridge\nkmeans\n\n104\n\n105\n\n Size of dataset (n) \n\n106\n\nFigure 4: Large-scale performance comparison with SIFT-1M dataset. Given \ufb01xed computational\ntime, ensemble Nystr\u00a8om with ridge weights tends to outperform other techniques.\n\nthough the space requirements are 10 times greater for ensemble Nystr\u00a8om in comparison to standard\nNystr\u00a8om (since p = 10 in this experiment), the space constraints are nonetheless quite reasonable.\nFor instance, when working with the full 1M points, the ensemble Nystr\u00a8om method with ridge re-\ngression weights only required approximately 1% of the columns of K to achieve a percent error of\n10%.\n\n5 Conclusion\n\nWe presented a novel family of algorithms, ensemble Nystr\u00a8om algorithms, for accurate low-rank ap-\nproximations in large-scale applications. The consistent and signi\ufb01cant performance improvement\nacross a number of different data sets, along with the fact that these algorithms can be easily par-\nallelized, suggests that these algorithms can bene\ufb01t a variety of applications where kernel methods\nare used. Interestingly, the algorithmic solution we have proposed for scaling these kernel learning\nalgorithms to larger scales is itself derived from the machine learning idea of ensemble methods.\nWe also gave the \ufb01rst theoretical analysis of these methods. We expect that \ufb01ner error bounds and\ntheoretical guarantees will further guide the design of the ensemble algorithms and help us gain a\nbetter insight about the convergence properties of our algorithms.\n\n8\n\n\fReferences\n\n[1] A. Asuncion and D. Newman. UCI machine learning repository, 2007.\n[2] B. E. Boser, I. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classi\ufb01ers.\n\nIn COLT, volume 5, pages 144\u2013152, 1992.\n\n[3] C. Cortes, M. Mohri, D. Pechyony, and A. Rastogi. Stability of transductive regression algo-\n\nrithms. In ICML, 2008.\n\n[4] C. Cortes and V. N. Vapnik. Support-Vector Networks. Machine Learning, 20(3):273\u2013297,\n\n1995.\n\n[5] P. Drineas and M. W. Mahoney. On the Nystr\u00a8om method for approximating a Gram matrix for\n\nimproved kernel-based learning. JMLR, 6:2153\u20132175, 2005.\n\n[6] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystr\u00a8om method.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 2004.\n\n[7] G. Golub and C. V. Loan. Matrix Computations. Johns Hopkins University Press, Baltimore,\n\n2nd edition, 1983.\n\n[8] A. Gustafson, E. Snitkin, S. Parker, C. DeLisi, and S. Kasif. Towards the identi\ufb01cation of\nessential genes using targeted genome sequencing and comparative analysis. BMC:Genomics,\n7:265, 2006.\n\n[9] S. Kumar, M. Mohri, and A. Talwalkar. Sampling techniques for the Nystr\u00a8om method.\n\nAISTATS, pages 304\u2013311, 2009.\n\nIn\n\n[10] Y. LeCun and C. Cortes. The MNIST database of handwritten digits, 2009.\n[11] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Com-\n\nputation, 108(2):212261, 1994.\n\n[12] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal\n\nof Computer Vision, 60:91\u2013110, 2004.\n\n[13] J. C. Platt. Fast embedding of sparse similarity graphs. In NIPS, 2004.\n[14] C. Saunders, A. Gammerman, and V. Vovk. Ridge Regression Learning Algorithm in Dual\n\nVariables. In Proceedings of the ICML \u201998, pages 515\u2013521, 1998.\n\n[15] B. Sch\u00a8olkopf, A. Smola, and K.-R. M\u00a8uller. Nonlinear component analysis as a kernel eigen-\n\nvalue problem. Neural Computation, 10(5):1299\u20131319, 1998.\n\n[16] T. Sim, S. Baker, and M. Bsat. The CMU PIE database. In Conference on Automatic Face and\n\nGesture Recognition, 2002.\n\n[17] A. Talwalkar, S. Kumar, and H. Rowley. Large-scale manifold learning. In CVPR, 2008.\n[18] C. K. I. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In\n\nNIPS, pages 682\u2013688, 2000.\n\n[19] K. Zhang, I. Tsang, and J. Kwok. Improved Nystr\u00a8om low-rank approximation and error anal-\n\nysis. In ICML, pages 273\u2013297, 2008.\n\n9\n\n\f", "award": [], "sourceid": 434, "authors": [{"given_name": "Sanjiv", "family_name": "Kumar", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}, {"given_name": "Ameet", "family_name": "Talwalkar", "institution": null}]}