{"title": "Multiresolution Kernel Approximation for Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 3740, "page_last": 3748, "abstract": "Gaussian process regression generally does not scale to beyond a few thousands data points without applying some sort of kernel approximation method. Most approximations focus on the high eigenvalue part of the spectrum of the kernel matrix, $K$, which leads to bad performance when the length scale of the kernel is small. In this paper we introduce Multiresolution Kernel Approximation (MKA), the first true broad bandwidth kernel approximation algorithm. Important points about MKA are that it is memory efficient, and it is a direct method, which means that it also makes it easy to approximate $K^{-1}$ and $\\mathop{\\textrm{det}}(K)$.", "full_text": "Multiresolution Kernel Approximation for\n\nGaussian Process Regression\n\nYi Ding\u2217, Risi Kondor\u2217\u2020, Jonathan Eskreis-Winkler\u2020\n\n\u2217Department of Computer Science, \u2020Department of Statistics\n{dingy,risi,eskreiswinkler}@uchicago.edu\n\nThe University of Chicago, Chicago, IL, 60637\n\nAbstract\n\nGaussian process regression generally does not scale to beyond a few thousands\ndata points without applying some sort of kernel approximation method. Most\napproximations focus on the high eigenvalue part of the spectrum of the kernel\nmatrix, K, which leads to bad performance when the length scale of the kernel is\nsmall. In this paper we introduce Multiresolution Kernel Approximation (MKA),\nthe \ufb01rst true broad bandwidth kernel approximation algorithm. Important points\nabout MKA are that it is memory ef\ufb01cient, and it is a direct method, which means\nthat it also makes it easy to approximate K\u22121 and det(K).\n\n1\n\nIntroduction\n\nGaussian Process (GP) regression, and its frequentist cousin, kernel ridge regression, are such nat-\nural and canonical algorithms that they have been reinvented many times by different communities\nunder different names. In machine learning, GPs are considered one of the standard methods of\nBayesian nonparametric inference [1]. Meanwhile, the same model, under the name Kriging or\nGaussian Random Fields, is the de facto standard for modeling a range of natural phenomena from\ngeophyics to biology [2]. One of the most appealing features of GPs is that, ultimately, the algo-\nrithm reduces to \u201cjust\u201d having to compute the inverse of a kernel matrix, K. Unfortunately, this also\nturns out to be the algorithm\u2019s Achilles heel, since in the general case, the complexity of inverting a\ndense n\u00d7n matrix scales with O(n3), meaning that when the number of training examples exceeds\n104 \u223c 105, GP inference becomes problematic on virtually any computer1. Over the course of the\nlast 15 years, devising approximations to address this problem has become a burgeoning \ufb01eld.\nThe most common approach is to use one of the so-called Nystr\u00a8om methods [3], which select a small\nsubset {xi1, . . . , xim} of the original training data points as \u201canchors\u201d and approximate K in the\nform K \u2248 K\u2217,I CK(cid:62)\n\u2217,I, where K\u2217,I is the submatrix of K consisting of columns {i1, . . . , im}, and\nC is a matrix such as the pseudo-inverse of KI,I. Nystr\u00a8om methods often work well in practice and\nhave a mature literature offering strong theoretical guarantees. Still, Nystr\u00a8om is inherently a global\nlow rank approximation, and, as pointed out in [4], a priori there is no reason to believe that K\nshould be well approximable by a low rank matrix: for example, in the case of the popular Gaussian\nkernel k(x, x(cid:48)) = exp(\u2212(x\u2212 x(cid:48))2/(2(cid:96)2)), as (cid:96) decreases and the kernel becomes more and more\n\u201clocal\u201d the number of signi\ufb01cant eigenvalues quickly increases. This observation has motivated\nalternative types of approximations, including local, hierarchical and distributed ones (see Section\n2). In certain contexts involving translation invariant kernels yet other strategies may be applicable\n[5], but these are beyond the scope of the present paper.\nIn this paper we present a new kernel approximation method, Multiresolution Kernel Approxima-\ntion (MKA), which is inspired by a combination of ideas from hierarchical matrix decomposition\n\n1 In the limited case of evaluating a GP with a \ufb01xed Gram matrix on a single training set, GP inference\nreduces to solving a linear system in K, which scales better with n, but might be problematic behavior when\nthe condition number of K is large.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\falgorithms and multiresolution analysis. Some of the important features of MKA are that (a) it is a\nbroad spectrum algorithm that approximates the entire kernel matrix K, not just its top eigenvectors,\nand (b) it is a so-called \u201cdirect\u201d method, i.e., it yields explicit approximations to K\u22121 and det(K).\nNotations. We de\ufb01ne [n] = {1, 2, . . . , n}. Given a matrix A, and a tuple I = (i1, . . . , ir), AI,\u2217\nwill denote the submatrix of A formed of rows indexed by i1, . . . , ir, similarly A\u2217,J will denote\nthe submatrix formed of columns indexed by j1, . . . , jp, and AI,J will denote the submatrix at the\nintersection of rows i1, . . . , ir and columns j1, . . . , jp. We extend these notations to the case when\n\nI and J are sets in the obvious way. If A is a blocked matrix then(cid:74)A(cid:75)i,j will denote its (i, j) block.\n\n2 Local vs. global kernel approximation\nRecall that a Gaussian Process (GP) on a space X is a prior over functions f : X \u2192 R de\ufb01ned by\na mean function \u00b5(x) = E[f (x)], and covariance function k(x, x(cid:48)) = Cov(f (x), f (x(cid:48))). Using\nthe most elementary model yi = f (xi) + \u0001 where \u0001 \u223c N (0, \u03c32) and \u03c32 is a noise parameter, given\ntraining data {(x1, y1), . . . , (xn, yn)}, the posterior is also a GP, with mean \u00b5(cid:48)(x) = \u00b5(x)+k\n(cid:62)\nx (K +\n\u03c32I)\u22121y, where kx = (k(x, x1), . . . , k(x, xn)), y = (y1, . . . , yn), and covariance\n\nk(cid:48)(x, x(cid:48)) = k(x, x(cid:48)) \u2212 k\n\n(cid:62)\nx(cid:48)(K + \u03c32I)\u22121kx.\n\n(1)\nThus (here and in the following assuming \u00b5 = 0 for simplicity), the maximum a posteriori (MAP)\nestimate of f is\n\n(2)\nRidge regression, which is the frequentist analog of GP regression, yields the same formula, but re-\n\ngards (cid:98)f as the solution to a regularized risk minimization problem over a Hilbert space H induced by\n\n(cid:98)f (x) = k\n\n(cid:62)\nx (K + \u03c32I)\u22121y.\n\nk. We will use \u201cGP\u201d as the generic term to refer to both Bayesian GPs and ridge regression. Letting\nK(cid:48) = (K+\u03c32I), virtually all GP approximation approaches focus on trying to approximate the (aug-\nmented) kernel matrix K(cid:48) in such a way so as to make inverting it, solving K(cid:48)y = \u03b1 or computing\ndet(K(cid:48)) easier. For the sake of simplicity in the following we will actually discuss approximating\nK, since adding the diagonal term usually doesn\u2019t make the problem any more challenging.\n\n2.1 Global low rank methods\n\nAs in other kernel methods, intuitively, Ki,j = k(xi, xj) encodes the degree of similarity or close-\nness between the two points xi and xj as it relates to the degree of correlation/similarity between the\nvalue of f at xi and at xj. Given that k is often conceived of as a smooth, slowly varying function,\none very natural idea is to take a smaller set {xi1, . . . , xim} of \u201clandmark points\u201d or \u201cpseudo-inputs\u201d\nand approximate k(x, x(cid:48)) in terms of the similarity of x to each of the landmarks, the relationship of\nthe landmarks to each other, and the similarity of the landmarks to x(cid:48). Mathematically,\n\nk(x, x(cid:48)) \u2248 m(cid:88)\n\nm(cid:88)\n\nk(x, xis) cis,ij k(xij , x(cid:48)),\n\ns=1\n\nj=1\n\nK \u2248 K\u2217,I W +K(cid:62)\n\u2217,I ,\n\nwhich, assuming that {xi1, . . . , xim} is a subset of the original point set {x1, . . . , xn}, amounts to\nan approximation of the form K \u2248 K\u2217,I C K(cid:62)\n\u2217,I, with I = {i1, . . . , im}. The canonical choice for\nC is C = W +, where W = KI,I, and W + denotes the Moore-Penrose pseudoinverse of W . The\nresulting approximation\n\n(3)\nis known as the Nystr\u00a8om approximation, because it is analogous to the so-called Nystr\u00a8om extension\nused to extrapolate continuous operators from a \ufb01nite number of quadrature points. Clearly, the\nchoice of I is critical for a good quality approximation. Starting with the pioneering papers [6, 3, 7],\nover the course of the last 15 years a sequence of different sampling strategies have been developed\nfor obtaining I, several with rigorous approximation bounds [8, 9, 10, 11]. Further variations include\nthe ensemble Nystr\u00a8om method [12] and the modi\ufb01ed Nystr\u00a8om method [13].\nNystr\u00a8om methods have the advantage of being relatively simple, and having reliable performance\nbounds. A fundamental limitation, however, is that the approximation (3) is inherently low rank. As\npointed out in [4], there is no reason to believe that kernel matrices in general should be close to low\nrank. An even more fundamental issue, which is less often discussed in the literature, relates to the\n\n2\n\n\fspeci\ufb01c form of (2). The appearance of K(cid:48)\u22121 in this formula suggests that it is the low eigenvalue\neigenvectors of K(cid:48) that should dominate the result of GP regression. On the other hand, multiplying\nthe matrix by kx largely cancels this effect, since kx is effectively a row of a kernel matrix similar\nto K(cid:48), and will likely concentrate most weight on the high eigenvalue eigenvectors. Therefore,\nultimately, it is not K(cid:48) itself, but the relationship between the eigenvectors of K(cid:48) and the data vector\ny that determines which part of the spectrum of K(cid:48) the result of GP regression is most sensitive to.\nOnce again, intuition about the kernel helps clarify this point. In a setting where the function that\nwe are regressing is smooth, and correspondingly, the kernel has a large length scale parameter, it\nis the global, long range relationships between data points that dominate GP regression, and that\ncan indeed be well approximated by the landmark point method. In terms of the linear algebra, the\nspectral expansion of K(cid:48) is dominated by a few large eigenvalue eigenvectors, we will call this the\n\u201cPCA-like\u201d scenario. In contrast, in situations where f varies more rapidly, a shorter lengthscale\nkernel is called for, local relationships between nearby points become more important, which the\nlandmark point method is less well suited to capture. We call this the \u201ck\u2013nearest neighbor type\u201d\nscenario. In reality, most non-trivial GP regression problems fall somewhere in between the above\ntwo extremes. In high dimensions data points tend to be all almost equally far from each other any-\nway, limiting the applicability of simple geometric interpretations. Nonetheless, the two scenarios\nare an illustration of the general point that one of the key challenges in large scale machine learning\nis integrating information from both local and global scales.\n\n2.2 Local and hierarchical low rank methods\n\nRealizing the limitations of the low rank approach, local kernel approximation methods have also\nstarted appearing in the literature. Broadly, these algorithms: (1) \ufb01rst cluster the rows/columns of\nK with some appropriate fast clustering method, e.g., METIS [14] or GRACLUS [15] and block K\nto each diagonal block of K; (3) use the {Ui} bases to compute possibly coarser approximations\ni\n\naccordingly; (2) compute a low rank, but relatively high accuracy, approximation(cid:74)K(cid:75)i,i \u2248 Ui\u03a3iU(cid:62)\nto the(cid:74)K(cid:75)i,j off diagonal blocks. This idea appears in its purest form in [16], and is re\ufb01ned in [4]\n\nin a way that avoids having to form all rows/columns of the off-diagonal blocks in the \ufb01rst place.\nRecently, [17] proposed a related approach, where all the blocks in a given row share the same\nrow basis but have different column bases. A major advantage of local approaches is that they are\ninherently parallelizable. The clustering itself, however, is a delicate, and sometimes not very robust\ncomponent of these methods. In fact, divide-and-conquer type algorithms such as [18] and [19] can\nalso be included in the same category, even though in these cases the blocking is usually random.\nA natural extension of the blocking idea would be to apply the divide-and-conquer approach re-\ncursively, at multiple different scales. Geometrically, this is similar to recent multiresolution data\nanalysis approaches such as [20]. In fact, hierarchical matrix approximations, including HODLR\nmatrices, H\u2013matrices [21], H2\u2013matrices [22] and HSS matrices [23] are very popular in the numer-\nical analysis literature. While the exact details vary, each of these methods imposes a speci\ufb01c type\nof block structure on the matrix and forces the off-diagonal blocks to be low rank (Figure 1 in the\nSupplement). Intuitively, nearby clusters interact in a richer way, but as we move farther away, data\ncan be aggregated more and more coarsely, just as in the fast multipole method [24].\nWe know of only two applications of the hierarchical matrix methodology to kernel approximation:\nB\u00a8orm and Garcke\u2019s H2 matrix approach [25] and O\u2019Neil et al.\u2019s HODLR method [26]. The advan-\ntage of H2 matrices is their more intricate structure, allowing relatively tight interactions between\nneighboring clusters even when the two clusters are not siblings in the tree (e.g. blocks 8 and 9\nin Figure 1c in the Supplement). However, the H2 format does not directly help with inverting K\nor computing its determinant: it is merely a memory-ef\ufb01cient way of storing K and performing\nmatrix/vector multiplies inside an iterative method. HODLR matrices have a simpler structure, but\nadmit a factorization that makes it possible to directly compute both the inverse and the determinant\nof the approximated matrix in just O(n log n) time.\nThe reason that hierarchical matrix approximations have not become more popular in machine learn-\ning so far is that in the case of high dimensional, unstructured data, \ufb01nding the way to organize\n{x1, . . . , xn} into a single hierarchy is much more challenging than in the setting of regularly spaced\npoints in R2 or R3, where these methods originate: 1. Hierarchical matrices require making hard\nassignments of data points to clusters, since the block structure at each level corresponds to parti-\ntioning the rows/columns of the original matrix. 2. The hierarchy must form a single tree, which\n\n3\n\n\fputs deep divisions between clusters whose closest common ancestor is high up in the tree. 3. Find-\ning the hierarchy in the \ufb01rst place is by no means trivial. Most works use a top-down strategy which\ndefeats the inherent parallelism of the matrix structure, and the actual algorithm used (kd-trees) is\nknown to be problematic in high dimensions [27].\n\n3 Multiresolution Kernel Approximation\n\nOur goal in this paper is to develop a data adapted multiscale kernel matrix approximation method,\nMultiresolution Kernel Approximation (MKA), that re\ufb02ects the \u201cdistant clusters only interact in a\nlow rank fashion\u201d insight of the fast multipole method, but is considerably more \ufb02exible than existing\nhierarchical matrix decompositions. The basic building blocks of MKA are local factorizations of a\nspeci\ufb01c form, which we call core-diagonal compression.\nDe\ufb01nition 1 We say that a matrix H is c\u2013core-diagonal if Hi,j = 0 unless either i, j \u2264 c or i = j.\nDe\ufb01nition 2 A c\u2013core-diagonal compression of a symmetric matrix A \u2208 Rm\u00d7m is an approxima-\ntion of the form\n\n(cid:19)(cid:18)\n\n(cid:19)(cid:18)\n\n(cid:18)\n\n(cid:19)\n\nA \u2248 Q(cid:62)H Q =\n\n,\n\n(4)\n\nwhere Q is orthogonal and H is c\u2013core-diagonal.\n\nCore-diagonal compression is to be contrasted with rank c sketching, where H would just have\nthe c\u00d7 c block, without the rest of the diagonal. From our multiresolution inspired point of view,\nhowever, the purpose of (4) is not just to sketch A, but to also to split Rm into the direct sum of\ntwo subspaces: (a) the \u201cdetail space\u201d, spanned by the last n\u2212c rows of Q, responsible for capturing\npurely local interactions in A and (b) the \u201cscaling space\u201d, spanned by the \ufb01rst c rows, capturing the\noverall structure of A and its relationship to other diagonal blocks.\nHierarchical matrix methods apply low rank decompositions to many blocks of K in parallel, at\ndifferent scales. MKA works similarly, by applying core-diagonal compressions. Speci\ufb01cally, the\nalgorithm proceeds by taking K through a sequence of transformations K = K0 (cid:55)\u2192 K1 (cid:55)\u2192 . . . (cid:55)\u2192\nKs, called stages. In the \ufb01rst stage\n1. Similar to other local methods, MKA \ufb01rst uses a fast clustering method to cluster the\np1. Using the corresponding permutation matrix\n1|), the elements of the second\n2|), and so on) we form a blocked matrix K0 = C1 K0 C(cid:62)\n1 ,\n\nrows/columns of K0 into clusters C1\nC1 (which maps the elements of the \ufb01rst cluster to (1, 2, . . .|C1\ncluster to (|C1\n\n1| + 1, . . . ,|C1\n\n1 , . . . ,C1\n\n1| + |C1\n\n2. Each diagonal block of K0 is independently core-diagonally compressed as in (4) to yield\n\ni local rotations are assembled into a single large orthogonal matrix Q1 = (cid:76)\n\ni ) in the index stands for truncation to c1\n\ni \u2013core-diagonal form.\n\nCD(c1\ni )\n\nH 1\n\n(5)\n\ni Q1\n\ni and\n\nwhere CD(c1\n\n3. The Q1\n\ni =(cid:0)Q1\n\ni )(cid:62)(cid:1)\ni (cid:74)K0(cid:75)i,i (Q1\n\nwhere(cid:74)K0(cid:75)i,j = KC1\n\ni ,C1\n\nj\n\n.\n\napplied to the full matrix to give H1 = Q1 K0 Q1\n\n(cid:62)\n\n.\n\n1 = P1 H1 P (cid:62)\n1 .\n\nof each block to one of the \ufb01rst c1 := c1\ngiving Hpre\n5. Finally, Hpre\n\n4. The rows/columns of H1 are rearranged by applying a permutation P1 that maps the core part\np1 coordinates, and the diagonal part to the rest,\nis truncated into the core-diagonal form H1 = K1 \u2295 D1, where K1 \u2208 Rc1\u00d7c1 is\ndense, while D1 is diagonal. Effectively, K1 is a compressed version of K0, while D1 is formed\nby concatenating the diagonal parts of each of the H 1\ni matrices. Together, this gives a global\ncore-diagonal compression\n\n1 + . . . c1\n\n1\n\nK0 \u2248 C(cid:62)\n\n(cid:62)P (cid:62)\n\n1\n\n(K1\u2295 D1) P1 Q1 C1\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n1 Q1\nQ(cid:62)\n\n1\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n(cid:125)Q1\n\nof the entire original matrix K0.\n\nThe second and further stages of MKA consist of applying the above \ufb01ve steps to K1, K2, . . . , Ks\u22121\nin turn, so ultimately the algorithm yields a kernel approximation \u02dcK which has a telescoping form\n(6)\n\ns (Ks\u2295 Ds)Qs . . .\u2295 D2)Q2\u2295 D1)Q1\n\n2 (. . .Q(cid:62)\n\n\u02dcK \u2248 Q(cid:62)\n\n1 (Q(cid:62)\n\nThe pseudocode of the full algorithm is in the Supplementary Material.\n\n4\n\n\fMKA is really a meta-algorithm, in the sense that it can be used in conjunction with different core-\ndiagonal compressors. The main requirements on the compressor are that (a) the core of H should\ncapture the dominant part of A, in particular the subspace that most strongly interacts with other\nblocks, (b) the \ufb01rst c rows of Q should be as sparse as possible. We consider two alternatives.\nAugmented Sparse PCA (SPCA). Sparse PCA algorithms explicitly set out to \ufb01nd a set of vectors\n{v1, . . . , vc} so as to maximize (cid:107)V (cid:62)AV (cid:107)Frob, where V = [v1, . . . , vc], while constraining each\nvector to be as sparse as possible [28]. While not all SPCAs guarantee orthogonality, this can be\nenforced a posteriori via e.g., QR factorization, yielding Qsc, the top c rows of Q in (4). Letting U\nbe a basis for the complementary subspace, the optimal choice for the bottom m\u2212 c rows in terms\nof minimizing Frobenius norm error of the compression is Qwlet = U \u02c6O, where\n\n\u02c6O = argmax\nO(cid:62)O=I\n\n(cid:107) diag(O(cid:62)U(cid:62)A U O)(cid:107),\n\nthe solution to which is of course given by the eigenvectors of U(cid:62)AU. The main drawback of the\nSPCA approach is its computational cost: depending on the algorithm, the complexity of SPCA\nscales with m3 or worse [29, 30].\nMultiresolution Matrix Factorization (MMF) MMF is a recently introduced matrix factorization\nalgorithm motivated by similar multiresolution ideas as the present work, but applied at the level of\nindividual matrix entries rather than at the level of matrix blocks [31]. Speci\ufb01cally, MMF yields a\nfactorization of the form\n\nA \u2248 q(cid:62)\n\n(cid:124)\n(cid:123)(cid:122)\n(cid:125)\n1 . . . q(cid:62)\nQ(cid:62)\n\nL\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nH qL . . . q1\n\n,\n\nQ\n\nwhere, in the simplest case, the qi\u2019s are just Givens rotations. Typically, the number of rotations\nin MMF is O(m). MMF is ef\ufb01cient to compute, and sparsity is guaranteed by the sparsity of\nthe individual qi\u2019s and the structure of the algorithm. Hence, MMF has complementary strengths\nto SPCA: it comes with strong bounds on sparsity and computation time, but the quality of the\nscaling/wavelet space split that it produces is less well controlled.\nRemarks. We make a few remarks about MKA. 1. Typically, low rank approximations reduce di-\nmensionality quite aggressively. In contrast, in core-diagonal compression c is often on the order\nof m/2, leading to \u201cgentler\u201d and more faithful, kernel approximations. 2. In hierarchical matrix\nmethods, the block structure of the matrix is de\ufb01ned by a single tree, which, as discussed above,\nis potentially problematic. In contrast, by virtue of reclustering the rows/columns of K(cid:96) before ev-\nery stage, MKA affords a more \ufb02exible factorization. In fact, beyond the \ufb01rst stage, it is not even\nindividual data points that MKA clusters, but subspaces de\ufb01ned by the earlier local compressions.\n3. While C(cid:96) and P(cid:96) are presented as explicit permutations, they really just correspond to different\nways of blocking Ks, which is done implicitly in practice with relatively little overhead. 4. Step 3\nof the algorithm is critical, because it extends the core-diagonal splits found in the diagonal blocks\nof the matrix to the off-diagonal blocks. Essentially the same is done in [4] and [17]. This operation\nre\ufb02ects a structural assumption about K, namely that the same bases that pick out the dominant\nparts of the diagonal blocks (composed of the \ufb01rst c(cid:96)\ni rotations) are also good for\ncompressing the off-diagonal blocks. In the hierarchical matrix literature, for the case of speci\ufb01c\nkernels sampled in speci\ufb01c ways in low dimensions, it is possible to prove such statements. In our\nhigh dimensional and less structured setting, deriving analytical results is much more challenging.\n5. MKA is an inherently bottom-up algorithm, including the clustering, thus it is naturally paralleliz-\nable and can be implemented in a distributed environment. 6. The hierarchical structure of MKA is\nsimilar to that of the parallel version of MMF (pMMF) [32], but the way that the compressions are\ncalculated is different (pMMF tries to minimize an objective that relates to the entire matrix).\n\ni rows of the Q(cid:96)\n\n4 Complexity and application to GPs\n\nFor MKA to be effective for large scale GP regression, it must be possible to compute the factor-\nization fast. In addition, the resulting approximation \u02dcK must be symmetric positive semi-de\ufb01nite\n(spsd) (MEKA, for example, fails to ful\ufb01ll this [4]). We say that a matrix approximation algorithm\nA (cid:55)\u2192 \u02dcA is spsd preserving if \u02dcA is spsd whenever A is. It is clear from its form that the Nystr\u00a8om\napproximation is spsd preserving , so is augmented SPCA compression. MMF has different variants,\nbut the core part of H is always derived by conjugating A by rotations, while the diagonal elements\nare guaranteed to be positive, therefore MMF is spsd preserving as well.\n\n5\n\n\fmax\n\ncore.\n\nProposition 1 If the individual core-diagonal compressions in MKA are spsd preserving, then the\nentire algorithm is spsd perserving.\nThe complexity of MKA depends on the complexity of the local compressions. Next, we assume\nthat to leading order in m this cost is bounded by ccomp m\u03b1comp (with \u03b1comp \u2265 1) and that each row of\nthe Q matrix that is produced is csp\u2013sparse. We assume that the MKA has s stages, the size of the\n\ufb01nal Ks \u201ccore matrix\u201d is dcore \u00d7 dcore, and that the size of the largest cluster is mmax. We assmue\nthat the maximum number of clusters in any stage is bmax and that the clustering is close to balanced\nin the sense that that bmax = \u03b8(n/mmax) with a small constant. We ignore the cost of the clustering\nalgorithm, which varies, but usually scales linearly in snbmax. We also ignore the cost of permuting\nthe rows/columns of K(cid:96), since this is a memory bound operation that can be virtualized away. The\nfollowing results are to leading order in mmax and are similar to those in [32] for parallel MMF.\nProposition 2 With the above notations, the number of operations needed to compute the MKA of\nan n\u00d7 n matrix is upper bounded by 2scspn2 + sccompm\u03b1comp\u22121\nn. Assuming bmax\u2013fold parallelism,\nthis complexity reduces to 2scspn2/bmax + sccompm\u03b1comp\nmax .\nThe memory cost of MKA is just the cost of storing the various matrices appearing in (6). We only\ninclude the number of non-zero reals that need to be stored and not indices, etc..\nProposition 3 The storage complexity of MKA is upper bounded by (scsp + 1)n + d2\nRather than the general case, it is more informative to focus on MMF based MKA, which is what\nwe use in our experiments. We consider the simplest case of MMF, referred to as \u201cgreedy-Jacobi\u201d\nMMF, in which each of the qi elementary rotations is a Given rotation. An additional parameter of\nthis algorithm is the compression ratio \u03b3, which in our notation is equal to c/n. Some of the special\nfeatures of this type of core-diagonal compression are:\n(a) While any given row of the rotation Q produced by the algorithm is not guaranteed to be sparse,\n(b) The leading term in the cost is the m3 cost of computing A(cid:62)A, but this is a BLAS operation, so\n(c) Once A(cid:62)A has been computed, the cost of the rest of the compression scales with m2.\nTogether, these features result in very fast core-diagonal compressions and a very compact represen-\ntation of the kernel matrix.\nProposition 4 The complexity of computing the MMF-based MKA of an n\u00d7n dense matrix is upper\nmaxn, where s = log(dcore/n)/(log \u03b3). Assuming bmax\u2013fold parallelism, this\nbounded by 4sn2 + sm2\nis reduced to 4snmmax + m3\nProposition 5 The storage complexity of MMF-based MKA is upper bounded by (2s + 1)n + d2\ncore.\nTypically, dcore = O(1). Note that this implies O(n log n) storage complexity, which is similar to\nNystr\u00a8om approximations with very low rank. Finally, we have the following results that are critical\nfor using MKA in GPs.\nProposition 6 Given an approximate kernel \u02dcK in MMF-based MKA form (6), and a vector z \u2208 Rn\nthe product \u02dcKz can be computed in 4sn + d2\ncore operations. With bmax\u2013fold parallelism, this is\nreduced to 4smmax + d2\nProposition 7 Given an approximate kernel \u02dcK in (MMF or SPCA-based) MKA form, the MKA\nform of \u02dcK \u03b1 for any \u03b1 can be computed in O(n + d3\ncore) operations. The complexity of computing\nthe matrix exponential exp(\u03b2 \u02dcK) for any \u03b2 in MKA form and the complexity of computing det( \u02dcK)\nare also O(n + d3\n\nQ will be the product of exactly (cid:98)(1\u2212 \u03b3)m(cid:99) Givens rotations.\n\nit is fast.\n\nmax.\n\ncore.\n\ncore).\n\n4.1 MKA\u2013GPs and MKA Ridge Regression\n\nThe most direct way of applying MKA to speed up GP regression (or ridge regression) is simply\nusing it to approximate the augmented kernel matrix K(cid:48) = (K + \u03c32I) and then inverting this\napproximation using Proposition 7 (with \u03b1 = \u22121). Note that the resulting \u02dcK(cid:48)\u22121 never needs to be\nevaluated fully, in matrix form. Instead, in equations such as (2), the matrix-vector product \u02dcK(cid:48)\u22121y\ncan be computed in \u201cmatrix-free\u201d form by cascading y through the analog of (6). Assuming that\ndcore (cid:28) n and mmax is not too large, the serial complexity of each stage of this computation scales\nwith at most n2, which is the same as the complexity of computing K in the \ufb01rst place.\nOne potential issue with the above approach however is that because MKA involves repeated trunca-\nj matrices, \u02dcK(cid:48) will be a biased approximation to K, therefore expressions such as (2)\ntion of the Hpre\n\n6\n\n\fFull\n\nSOR\n\nFITC\n\nPITC\n\nMEKA\n\nMKA\n\nFigure 1: Snelson\u2019s 1D example: ground truth (black circles); prediction mean (solid line curves);\none standard deviation in prediction uncertainty (dashed line curves).\n\nTable 1: Regression Results with k to be # pseudo-inputs/dcore : SMSE(MNLP)\n\nMethod\nhousing\nrupture\nwine\npageblocks\ncompAct\npendigit\n\nk\n16\n16\n32\n32\n32\n64\n\nFull\n\n0.36(\u22120.32)\n0.17(\u22120.89)\n0.59(\u22120.33)\n0.44(\u22121.10)\n0.58(\u22120.66)\n0.15(\u22120.73)\n\nSOR\n\n0.93(\u22120.03)\n0.94(\u22120.04)\n0.86(\u22120.07)\n0.86(\u22120.57)\n0.88(\u22120.13)\n0.65(\u22120.19)\n\nFITC\n\n0.91(\u22120.04)\n0.96(\u22120.04)\n0.84(\u22120.03)\n0.81(\u22120.78)\n0.91(\u22120.08)\n0.70(\u22120.17)\n\nPITC\n\n0.96(\u22120.02)\n0.93(\u22120.05)\n0.87(\u22120.07)\n0.86(\u22120.72)\n0.88(\u22120.14)\n0.71(\u22120.17)\n\nMEKA\n\n0.85(\u22120.08)\n0.46(\u22120.18)\n0.97(\u22120.12)\n0.96(\u22120.10)\n0.75(\u22120.21)\n0.53(\u22120.29)\n\nMKA\n\n0.52(\u22120.32)\n0.32(\u22120.54)\n0.70(\u22120.23)\n0.63(\u22120.85)\n0.60(\u22120.32)\n0.30(\u22120.42)\n\nwhich mix an approximate K(cid:48) with an exact kx will exhibit some systematic bias. In Nystr\u00a8om type\nmethods (speci\ufb01cally, the so-called Subset of Regressors and Deterministic Training Conditional\nGP approximations) this problem is addressed by replacing kx with its own Nystr\u00a8om approxima-\ntion, \u02c6kx = K\u2217,I W +kI\n\u2217,I + \u03c32I is a large\n\u02c6K(cid:48)\u22121 can nonetheless be ef\ufb01ciently evaluated by using a variant of\nmatrix, expressions such as \u02c6k(cid:62)\nthe Sherman\u2013Morrison\u2013Woodbury identity and the fact that W is low rank (see [33]).\nThe same approach cannot be applied to MKA because \u02dcK is not low rank. Assuming that the testing\nset {x1, . . . , xp} is known at training time, however, instead of approximating K or K(cid:48), we compute\nthe MKA approximation of the joint train/test kernel matrix\n\nx]j = k(x, xij ). Although \u02c6K(cid:48) = K\u2217,I W +K(cid:62)\n\nx, where [\u02c6kI\n\nx\n\n(cid:18) K K\u2217\n\n(cid:19)\n\nK =\n\nK(cid:62)\nWriting K\u22121 in blocked form\n\n\u2217 Ktest\n\nKi,j = k(xi, xj) + \u03c32\n[K\u2217]i,j = k(xi, x(cid:48)\nj)\ni, x(cid:48)\n[Ktest]i,j = k(x(cid:48)\nj).\n\n(cid:19)\n\n,\n\n\u02dcK\u22121 =\n\nwhere\n\n(cid:18) A B\n1), . . . ,(cid:98)f (x(cid:48)\n\nC D\n\nformula (cid:98)f = K(cid:62)\n\n\u2217 \u02c7K\u22121y, where (cid:98)f = ((cid:98)f (x(cid:48)\n\nand taking the Schur complement of D now recovers an alternative approximation \u02c7K\u22121 = A \u2212\nBD\u22121C to K\u22121 which is consistent with the off-diagonal block K\u2217 leading to our \ufb01nal MKA\u2013GP\np))(cid:62). While conceptually this is somewhat\nmore involved than naively estimating K(cid:48), assuming p (cid:28) n, the cost of inverting D is negligible,\nand the overall serial complexity of the algorithm remains (n + p)2.\nIn certain GP applications, the O(n2) cost of writing down the kernel matrix is already forbidding.\nThe one circumstance under which MKA can get around this problem is when the kernel matrix is\na matrix polynomial in a sparse matrix L, which is most notably for diffusion kernels and certain\nother graph kernels. Speci\ufb01cally in the case of MMF-based MKA, since the computational cost is\ndominated by computing local \u201cGram matrices\u201d A(cid:62)A, when L is sparse, and this sparsity is retained\nfrom one compression to another, the MKA of sparse matrices can be computed very fast. In the\ncase of graph Laplacians, empirically, the complexity is close to linear in n. By Proposition 7, the\ndiffusion kernel and certain other graph kernels can also be approximated in about O(n log n) time.\n\n5 Experiments\n\nWe compare MKA to \ufb01ve other methods: 1. Full: the full GP regression using Cholesky factoriza-\ntion [1]. 2. SOR: the Subset of Regressors method (also equivalent to DTC in mean) [1]. 3. FITC:\nthe Fully Independent Training Conditional approximation, also called Sparse Gaussian Processes\nusing Pseudo-inputs [34]. 4. PITC: the Partially Independent Training Conditional approximation\nmethod (also equivalent to PTC in mean) [33]. 5. MEKA: the Memory Ef\ufb01cient Kernel Approxi-\nmation method [4]. The KISS-GP [35] and other interpolation based methods are not discussed in\nthis paper, because, we believe, they mostly only apply to low dimensional settings. We used custom\nMatlab implementations [1] for Full, SOR, FITC, and PITC. We used the Matlab codes provided by\n\n7\n\n50100150200250300-8-6-4-2024681050100150200250300-8-6-4-2024681050100150200250300-8-6-4-2024681050100150200250300-8-6-4-2024681050100150200250300-8-6-4-2024681050100150200250300-8-6-4-20246810\fhousing\n\nhousing\n\nrupture\n\nrupture\n\nFigure 2: SMSE and MNLP as a function of the number of pseudo-inputs/dcore on two datasets. In\nthe given range MKA clearly outperforms the other methods in both error measures.\n\nthe author for MEKA. Our algorithm MKA was implemented in C++ with the Matlab interface. To\nget an approximately fair comparison, we set dcore in MKA to be the number of pseudo-inputs. The\nparallel MMF algorithm was used as the compressor due to its computational strength [32]. The\nGaussian kernel is used for all experiments with one length scale for all input dimensions.\nQualitative results. We show the qualitative behavior of each method on the 1D toy dataset from\n[34]. We sampled the ground truth from a Gaussian processes with length scale (cid:96) = 0.5 and number\nof pseudo-inputs (dcore) is 10. We applied cross-validation to select the parameters for each method\nto \ufb01t the data. Figure 1 shows that MKA \ufb01ts the data almost as well as the Full GP does. In terms\nof the other approximate methods, although their \ufb01t to the data is smoother, this is to the detriment\nof capturing the local structure of the underlying data, which veri\ufb01es MKA\u2019s ability to capture the\nentire spectrum of the kernel matrix, not just its top eigenvectors.\nReal data. We tested the ef\ufb01cacy of GP regression on real-world datasets. The data are normalized\nto mean zero and variance one. We randomly selected 10% of each dataset to be used as a test set.\nOn the other 90% we did \ufb01ve-fold cross validation to learn the length scale and noise parameter\n(cid:80)n\nfor each method and the regression results were averaged over repeating this setting \ufb01ve times.\nAll experiments were ran on a 3.4GHz 8 core machine with 8GB of memory. Two distinct error\nt=1(\u02c6yt\u2212\nmeasures are used to assess performance: (a) standardized mean square error (SMSE), 1\nn\n(cid:63) is the variance of test outputs, and (2) mean negative log probability (MNLP)\nyt)2/\u02c6\u03c32\n1\nn\nvariance in error assessment. From Table 1, we are competitive in both error measures when the\nnumber of pseudo-inputs (dcore) is small, which reveals low-rank methods\u2019 inability in capturing the\nlocal structure of the data. We also illustrate the performance sensitivity by varying the number of\npseudo-inputs on selected datasets. In Figure 2, for the interval of pseudo-inputs considered, MKA\u2019s\nperformance is robust to dcore, while low-rank based methods\u2019 performance changes rapidly, which\nshows MKA\u2019s ability to achieve good regression results even with a crucial compression level. The\nSupplementary Material gives a more detailed discussion of the datasets and experiments.\n\n(cid:63) + log 2\u03c0(cid:1), each of which corresponds to the predictive mean and\n\n(cid:80)n\n\n(cid:63) + log \u02c6\u03c32\n\n(cid:0)(\u02c6yt \u2212 yt)2/\u02c6\u03c32\n\n(cid:63), where \u02c6\u03c32\n\nt=1\n\n6 Conclusions\n\nIn this paper we made the case that whether a learning problem is low rank or not depends on the\nnature of the data rather than just the spectral properties of the kernel matrix K. This is easiest to\nsee in the case of Gaussian Processes, which is the algorithm that we focused on in this paper, but it\nis also true more generally. Most existing sketching algorithms used in GP regression force low rank\nstructure on K, either globally, or at the block level. When the nature of the problem is indeed low\nrank, this might actually act as an additional regularizer and improve performance. When the data\ndoes not have low rank structure, however, low rank approximations will fail. Inspired by recent\nwork on multiresolution factorizations, we proposed a mulitresolution meta-algorithm, MKA, for\napproximating kernel matrices, which assumes that the interaction between distant clusters is low\nrank, while avoiding forcing a low rank structure of the data locally, at any scale. Importantly, MKA\nallows fast direct calculations of the inverse of the kernel matrix and its determinant, which are\nalmost always the computational bottlenecks in GP problems.\n\nAcknowledgements\n\nThis work was completed in part with resources provided by the University of Chicago Research\nComputing Center. The authors wish to thank Michael Stein for helpful suggestions.\n\n8\n\n22.533.544.5Log2 # pseudo-inputs0.40.450.50.550.60.650.70.750.80.850.9SMSEFullSORFITCPITCMEKAMKA22.533.544.5Log2 # pseudo-inputs-0.6-0.5-0.4-0.3-0.2-0.1MNLPFullSORFITCPITCMEKAMKA44.555.566.577.58Log2 # pseudo-inputs0.20.30.40.50.60.70.80.9SMSEFullSORFITCPITCMEKAMKA44.555.566.577.58Log2 # pseudo-inputs-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1MNLPFullSORFITCPITCMEKAMKA\fReferences\n[1] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning\n\n(Adaptive Computation and Machine Learning). The MIT Press, 2005.\n\n[2] Michael L. Stein. Statistical Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999.\n[3] Christopher Williams and Matthias Seeger. Using the Nystr\u00a8om Method to Speed Up Kernel Machines. In\n\nAdvances in Neural Information Processing Systems 13, 2001.\n\n[4] Si Si, C Hsieh, and Inderjit S Dhillon. Memory Ef\ufb01cient Kernel Approximation. In ICML, 2014.\n[5] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with\n\nrandomization in learning. NIPS, 2008.\n\n[6] Alex J. Smola and Bernhard Sch\u00a8okopf. Sparse Greedy Matrix Approximation for Machine Learning. In\n\nProceedings of the 17th International Conference on Machine Learning, ICML, pages 911\u2013918, 2000.\n\n[7] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral grouping using the Nystr\u00a8om\n\nmethod. IEEE transactions on pattern analysis and machine intelligence, 26(2):214\u201325, 2004.\n\n[8] P. Drineas and M. W. Mahoney. On the Nystr\u00a8om method for approximating a Gram matrix for improved\n\nkernel-based learning. Journal of Machine Learning Research, 6:2153\u20132175, 2005.\n\n[9] Rong Jin, Tianbao Yang, Mehrdad Mahdavi, Yu-Feng Li, and Zhi-Hua Zhou. Improved Bounds for the\n\nNystr\u00a8om Method With Application to Kernel Classi\ufb01cation. IEEE Trans. Inf. Theory, 2013.\n\n[10] Alex Gittens and Michael W Mahoney. Revisiting the Nystr\u00a8om method for improved large-scale machine\n\n[11] Shiliang Sun, Jing Zhao, and Jiang Zhu. A Review of Nystr\u00a8om Methods for Large-Scale Machine Learn-\n\nlearning. ICML, 28:567\u2013575, 2013.\n\ning. Information Fusion, 26:36\u201348, 2015.\n\n[12] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Ensemble Nystr\u00a8om method. In NIPS, 2009.\n[13] Shusen Wang. Ef\ufb01cient algorithms and error analysis for the modi\ufb01ed Nystr\u00a8om method. AISTATS, 2014.\n[14] Amine Abou-Rjeili and George Karypis. Multilevel algorithms for partitioning power-law graphs.\nIn\n\nProceedings of the 20th International Conference on Parallel and Distributed Processing, 2006.\n\n[15] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigenvectors a multilevel\n\napproach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11):1944\u20131957, 2007.\n\n[16] Berkant Savas, Inderjit Dhillon, et al. Clustered Low-Rank Approximation of Graphs in Information\n\nScience Applications. In Proceedings of the SIAM International Conference on Data Mining, 2011.\n\n[17] Ruoxi Wang, Yingzhou Li, Michael W Mahoney, and Eric Darve. Structured Block Basis Factorization\n\nfor Scalable Kernel Matrix Evaluation. arXiv preprint arXiv:1505.00398, 2015.\n\n[18] Yingyu Liang, Maria-Florina F Balcan, Vandana Kanchanapally, and David Woodruff.\n\nImproved dis-\n\ntributed principal component analysis. In NIPS, pages 3113\u20133121, 2014.\n\n[19] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression. Con-\n\nference on Learning Theory, 30:1\u201326, 2013.\n\n[20] William K Allard, Guangliang Chen, and Mauro Maggioni. Multi-scale geometric methods for data sets\n\nII: Geometric multi-resolution analysis. Applied and Computational Harmonic Analysis, 2012.\n\n[21] W Hackbusch. A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices.\n\nComputing, 62:89\u2013108, 1999.\n\nmathematics, pages 9\u201329, 2000.\n\n[22] Wolfgang Hackbusch, Boris Khoromskij, and Stefan a. Sauter. On H2-Matrices. Lectures on applied\n\n[23] S. Chandrasekaran, M. Gu, and W. Lyons. A Fast Adaptive Solver For Hierarchically Semi-separable\n\nRepresentations. Calcolo, 42(3-4):171\u2013185, 2005.\n\n[24] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. J. Comput. Phys., 1987.\n[25] Steffen B\u00a8orm and Jochen Garcke. Approximating Gaussian Processes with H 2 Matrices. In ECML. 2007.\n[26] Sivaram Ambikasaran, Sivaram Foreman-Mackey, Leslie Greengard, David W. Hogg, and Michael\n\nO\u2019Neil. Fast Direct Methods for Gaussian Processes. arXiv:1403.6015v2, April 2015.\n\n[27] Nazneen Rajani, Kate McArdle, and Inderjit S Dhillon. Parallel k-Nearest Neighbor Graph Construction\n\nUsing Tree-based Data Structures. In 1st High Performance Graph Mining workshop, 2015.\n\n[28] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse Principal Component Analysis. Journal of Com-\n\nputational and Graphical Statistics, 15(2):265\u2013286, 2004.\n\n[29] Q. Berthet and P. Rigollet. Complexity Theoretic Lower Bounds for Sparse Principal Component Detec-\n\ntion. J. Mach. Learn. Res. (COLT), 30, 1046-1066 2013.\n\n[30] Volodymyr Kuleshov. Fast algorithms for sparse principal component analysis based on rayleigh quotient\n\niteration. In ICML, pages 1418\u20131425, 2013.\n\n[31] Risi Kondor, Nedelina Teneva, and Vikas Garg. Multiresolution Matrix Factorization. In ICML, 2014.\n[32] Nedelina Teneva, Pramod K Murakarta, and Risi Kondor. Multiresolution Matrix Compression. In Pro-\nceedings of the 19th International Conference on Ariti\ufb01cal Intelligence and Statistics (AISTATS-16), 2016.\n[33] Joaquin Qui\u02dcnonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaus-\n\nsian process regression. Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[34] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. NIPS, 2005.\n[35] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian pro-\n\ncesses (KISS-GP). In ICML, Lille, France, 6-11, pages 1775\u20131784, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2082, "authors": [{"given_name": "Yi", "family_name": "Ding", "institution": "University of Chicago"}, {"given_name": "Risi", "family_name": "Kondor", "institution": "The University of Chicago"}, {"given_name": "Jonathan", "family_name": "Eskreis-Winkler", "institution": "University of Chicago"}]}