{"title": "Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 2463, "page_last": 2472, "abstract": "Ridge leverage scores provide a balance between low-rank approximation and regularization, and are ubiquitous in randomized linear algebra and machine learning. Deterministic algorithms are also of interest in the moderately big data regime, because deterministic algorithms provide interpretability to the practitioner by having no failure probability and always returning the same results. We provide provable guarantees for deterministic column sampling using ridge leverage scores. The matrix sketch returned by our algorithm is a column subset of the original matrix, yielding additional interpretability. Like the randomized counterparts, the deterministic algorithm provides $(1+\\epsilon)$ error column subset selection, $(1+\\epsilon)$ error projection-cost preservation, and an additive-multiplicative spectral bound. We also show that under the assumption of power-law decay of ridge leverage scores, this deterministic algorithm is provably as accurate as randomized algorithms. Lastly, ridge regression is frequently used to regularize ill-posed linear least-squares problems. While ridge regression provides shrinkage for the regression coefficients, many of the coefficients remain small but non-zero. Performing ridge regression with the matrix sketch returned by our algorithm and a particular regularization parameter forces coefficients to zero and has a provable $(1+\\epsilon)$ bound on the statistical risk. As such, it is an interesting alternative to elastic net regularization.", "full_text": "Ridge Regression and Provable Deterministic Ridge\n\nLeverage Score Sampling\n\nShannon R. McCurdy\n\nCalifornia Institute for Quantitative Biosciences\n\nUC Berkeley\n\nBerkeley, CA 94702\n\nsmccurdy@berkeley.edu\n\nAbstract\n\nRidge leverage scores provide a balance between low-rank approximation and reg-\nularization, and are ubiquitous in randomized linear algebra and machine learning.\nDeterministic algorithms are also of interest in the moderately big data regime,\nbecause deterministic algorithms provide interpretability to the practitioner by\nhaving no failure probability and always returning the same results. We provide\nprovable guarantees for deterministic column sampling using ridge leverage scores.\nThe matrix sketch returned by our algorithm is a column subset of the original\nmatrix, yielding additional interpretability. Like the randomized counterparts, the\ndeterministic algorithm provides (1 + \u0001) error column subset selection, (1 + \u0001) error\nprojection-cost preservation, and an additive-multiplicative spectral bound. We also\nshow that under the assumption of power-law decay of ridge leverage scores, this\ndeterministic algorithm is provably as accurate as randomized algorithms. Lastly,\nridge regression is frequently used to regularize ill-posed linear least-squares prob-\nlems. While ridge regression provides shrinkage for the regression coef\ufb01cients,\nmany of the coef\ufb01cients remain small but non-zero. Performing ridge regression\nwith the matrix sketch returned by our algorithm and a particular regularization\nparameter forces coef\ufb01cients to zero and has a provable (1 + \u0001) bound on the\nstatistical risk. As such, it is an interesting alternative to elastic net regularization.\n\n1\n\nIntroduction\n\nClassical leverage scores quantify the importance of each column i for the range space of the sample-\nby-feature data matrix A \u2208 Rn\u00d7d. Classical leverage scores have been used in regression diagnostics,\noutlier detection, and randomized matrix algorithms (Velleman and Welsch, 1981; Chatterjee and\nHadi, 1986; Drineas et al., 2008). Historically, leverage scores were used to select informative\nsamples (rows, in our matrix orientation). More recently, as datasets with d > n have become more\ncommon, leverage scores have been used to select informative features (columns, in our matrix\norientation). There are many different \ufb02avors of leverage scores, and we will focus on a variation\ncalled ridge leverage scores. However, to appreciate the advantages of ridge leverage scores, we also\nbrie\ufb02y review classical and rank-k subspace leverage scores.\nRidge leverage scores were introduced by Alaoui and Mahoney (2015) to give statistical bounds\nfor the Nystr\u00f6m approximation for kernel ridge regression. Alaoui and Mahoney (2015) argue that\nridge leverage scores provide the relevant notion of leverage in the context of kernel ridge regression.\nRidge leverage scores have been successfully used in kernel ridge regression to approximate the\nsymmetric kernel matrix (\u2208 Rn\u00d7n) by selecting informative samples (Alaoui and Mahoney, 2015;\nRudi et al., 2015). Cohen et al. (2017) provide a de\ufb01nition for ridge leverage scores for selecting\ninformative features from the non-symmetric sample-by-feature data matrix A \u2208 Rn\u00d7d. The ridge\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fleverage score \u00af\u03c4i(A) for the ith column of A is,\n\n(cid:0)AAT + \u03bb2I(cid:1)+\n\n\u00af\u03c4i(A) = aT\ni\n\n(1)\nwhere the ith column of A is an (n \u00d7 1)-vector denoted by ai, M+ denotes the Moore-Penrose\nk||A \u2212\npseudoinverse of M, and \u03bb2 is the regularization parameter. We will always choose \u03bb2 = 1\nAk||2\nF , where Ak is the rank-k SVD approximation to A, de\ufb01ned in Sec. 1.2, because this choice of\nregularization parameter gives the stated guarantees. In contrast to ridge leverage scores, the rank-k\nsubspace leverage score \u03c4i(Ak) is,\n\nai,\n\nk )+ai.\n\ni (AkAT\n\n\u03c4i(Ak) = aT\n\n(2)\nThe classical leverage score is the ridge leverage score (and also the rank-k subspace leverage score)\nevaluated at k = rank(A) = r \u2264 n.\nRidge leverage scores and rank-k subspace leverage scores take two different approaches to mitigating\nthe small singular values components of AAT in classical leverage scores. Ridge leverage scores\ndiminish the importance of small principle components through regularization, as opposed to rank-k\nsubspace leverage scores, which omit the small principle components entirely. Cohen et al. (2017)\nargue that regularization is a more natural and stable alternative to omission. For randomized\nalgorithms with ridge leverage score sampling, Cohen et al. (2017) prove bounds for the spectrum,\ncolumn subset selection, and projection-cost preservation (counterparts to our Theorems 1, 2, and 3\nfor deterministic ridge leverage scores, respectively). The \ufb01rst and the last bounds hold for a weighted\ncolumn subset of the full data matrix. These bounds require O(k log(k/\u03b4)/\u00012) columns, where \u03b4 is\nthe failure probability and \u0001 is the error.\nIn the \"big data\" era, much attention has been paid to randomized algorithms due to improved\nalgorithm performance and ease of generalization to the streaming setting. However, for moderately\nbig data (i.e. the feature set is too large for inspection by humans, but the algorithm performance\nis not a limitation), deterministic algorithms provide more interpretability to the practitioner than\nrandomized algorithms, since they always provide the same results and have no failure probability.\nThe usefulness of deterministic algorithms has already been recognized. Papailiopoulos et al. (2014)\nintroduce a deterministic algorithm for sampling columns from rank-k subspace leverage scores\nand provide a columns subset selection bound (the counterpart to our Theorem 2 for deterministic\nridge leverage scores). McCurdy et al. (2017) prove a (1 + \u0001) spectral bound for Papailiopoulos\net al. (2014)\u2019s deterministic algorithm and for random sampling with rank-k subspace leverage scores\n(the counterpart to our Theorem 1 for deterministic ridge leverage scores). One major drawback\nof using the rank-k subspace leverage scores is that their relative spectral bound is limited to the\nrank-k subspace projection of the column subset matrix C and the full data matrix A, so to get a\nrelative spectral bound on the complete subspace requires k = n. A consequence of this is that\nprojection-cost preservation also requires k = n (the counterpart to our Theorem 3). One advantage\nof using deterministic rather than randomized rank-k subspace leverage score algorithms is that under\nthe condition of power-law decay in the sorted rank-k subspace leverage scores, the deterministic\nalgorithm chooses fewer columns than random sampling with the same error for the column subset\na\u22121 \u2212 1, k) < Ck log(k/\u03b4)/\u00012, where a is\nselection bound when max((2k/\u0001)\nthe decay power and C is an absolute constant (Papailiopoulos et al., 2014) (this is the counterpart to\nour Theorem 5). In addition, Papailiopoulos et al. (2014) show that many real datasets display power-\nlaw decay in the sorted rank-k subspace leverage scores, illustrating the deterministic algorithm\u2019s\nreal-world utility.\nRidge regression (Hoerl and Kennard, 1970) is a commonly used method to regularize ill-posed linear\nleast-squares problems. The ridge regression minimization problem is, for outcome y \u2208 Rn, features\nA \u2208 Rn\u00d7d, and coef\ufb01cients x \u2208 Rd,\n\na \u2212 1, (2k/((a \u2212 1)\u0001))\n\n1\n\n1\n\n(cid:0)||y \u2212 Ax||2\n= (cid:0)AT A + \u03bb2I(cid:1)\u22121\n\nx\n\n\u02c6xA = argmin\n\n2 + \u03bb2||x||2\nAT y.\n\n2\n\n(cid:1)\n\n(3)\nwhere the regularization parameter \u03bb2 penalizes the size of the coef\ufb01cients in the minimization\nproblem. We will always choose \u03bb2 = 1\nIn ridge regression, the underlying statistical model for data generation is,\n\nF for ridge regression with matrix A.\n\nk||A \u2212 Ak||2\n\n(4)\n\ny = y\u2217 + \u03c32\u03be,\n\n2\n\n\fwhere y\u2217 = Ax\u2217 is a deterministic linear function of the \ufb01xed design features A and \u03be \u223c N (0, I) is\nthe random error. The mean squared error is a measure of statistical risk R(\u02c6y) for the squared error\nloss function and estimator \u02c6y and is,\n\n(cid:2)||\u02c6y \u2212 y\u2217||2\n\n(cid:3) .\n\n2\n\nR(\u02c6y) = 1\n\nn E\u03be\n\n(5)\n\nRidge regression is often chosen over regression subset selection procedures for regularization\nbecause, as a continuous shrinkage method, it exhibits lower variability (Breiman, 1996). However\nmany ridge regression coef\ufb01cients can be small but non-zero, leading to a lack of interpretability for\nmoderately big data (d > n). The lasso method (Tibshirani, 1994) provides continuous shrinkage\nand automatic feature selection using an L1 penalty function instead of the L2 penalty function in\nridge regression, but for d > n case, lasso saturates at n features. The elastic net algorithm combines\nlasso (L1 penalty function) and ridge regression (L2 penalty function) for continuous shrinkage and\nautomatic feature selection (Zou and Hastie, 2005).\n\n1.1 Contributions\n\nWe explore deterministic ridge leverage score (DRLS) sampling for matrix approximation and for\nfeature selection in concert with ridge regression. This work has two main motivations: (1) the\nadvantages of ridge leverage scores over rank-k subspace leverage scores, and (2) the advantages\nof deterministic algorithms in some practical settings. This work complements Papailiopoulos et al.\n(2014), who considered deterministic rank-k subspace leverage sampling and experiments on real\ndata, but did not consider DRLS sampling or uses beyond matrix approximation. This work also\ncomplements Cohen et al. (2017), who considered randomized RLS sampling but did not consider\nDRLS sampling, the uses of RLS sampling beyond matrix approximation (e.g. ridge regression), or\nexperiments on real data.\nWe introduce a deterministic algorithm (Algorithm 1) for ridge leverage score sampling inspired by the\ndeterministic algorithm for rank-k subspace leverage score sampling (Papailiopoulos et al., 2014). By\nusing ridge leverage scores instead of rank-k subspace scores in the deterministic algorithm, we prove\nsigni\ufb01cantly better bounds for the column subset matrix C (see Table 1 for a comparison). We prove\nthat the same additive-multiplicative spectral bounds (Theorem 1), (1 + \u0001) columns subset selection\n(Theorem 2), and (1+\u0001) projection-cost preservation (Theorem 3) hold for DRLS column sampling as\nfor random sampling as in Cohen et al. (2017). We show that under the condition of power-law decay\nin the ridge leverage scores, the deterministic algorithm chooses fewer columns than random sampling\na\u22121 \u2212 1, k) < Ck log(k/\u03b4)/\u00012, where\nwith the same error when max((4k/\u0001)\na is the decay power and C is an absolute constant (Theorem 5).\nWe combine deterministic ridge leverage score column subset selection with ridge regression for a\nparticular value of the regularization parameter, providing automatic feature selection and continuous\nshrinkage. This procedure has a provable (1 + \u0001) bound on the statistical risk (Theorem 4). The\nproof techniques are such that a (1 + \u0001) bound on the statistical risk also holds for randomized ridge\nleverage score sampling. Our ridge regression theorem is novel to both deterministic and randomized\nsampling with ridge leverage scores (as far as we know, this has never been considered for any\nleverage score), another demonstrable advance of the state of the art, and one of our main results.\nWe also provide a proof-of-concept illustration on real biological data, with \ufb01gures included in the\nSupplementary Materials. Our real-data illustration makes a strong case for the empirical usefulness\nof the DRLS algorithm and bounds. The real data exhibits striking power law decay of the ridge\nleverage scores (Figure 7), justifying the assumptions underlying the use of DRLS sampling (Theorem\n5).\nOur work is triply bene\ufb01cial from the interpretability standpoint; it is deterministic, it chooses a\nsubset of representative columns, and it comes with four desirable error guarantees for all rank-k,\nthree of which stem from naturalness of the low-rank ridge regularization.\n\na \u2212 1, (4k/((a \u2212 1)\u0001))\n\n1\n\n1\n\n1.2 Notation\nThe singular value decomposition (SVD) of any complex matrix A is A = U\u03a3V\u2020, where U and\nV are square unitary matrices (U\u2020U = UU\u2020 = I, V\u2020V = VV\u2020 = I), \u03a3 is a rectangular diagonal\nmatrix with real non-negative non-increasingly ordered entries. U\u2020 is the complex conjugate and\n\n3\n\n\fTable 1: Comparison of deterministic ridge and rank-k subspace leverage score theorems.\n\nDeterministic Sampling Algorithm\n\nRank-k Subspace\n\nPapailiopoulos et al. (2014)\n\nRank-k Ridge\nAlgorithm 1\n\nSpectral Bound for CCT\n\nColumn Subset Selection\n\nRank-k Projection\nCost Preservation\nApproximate Ridge\n\nRegression Risk\n\nLeverage Power-law Decay\n\nMultiplicative, k = n\nMcCurdy et al. (2017)\n\nall k\n\nPapailiopoulos et al. (2014)\n\nk = n\n\nN/A\n\nall k\n\nPapailiopoulos et al. (2014)\n\nAdditive-Multiplicative, all k\n\nTheorem 1\n\nall k\n\nTheorem 2\n\nTheorem 3\n\nTheorem 4\n\nall k\n\nall k\n\nall k\n\nTheorem 5\n\ntranspose of U, and I is the identity matrix. The diagonal elements of \u03a3 are called the singular\nvalues, and they are the positive square roots of the eigenvalues of both AA\u2020 and A\u2020A, which have\neigenvectors U and V, respectively. U and V are the left and right singular vectors of A.\nDe\ufb01ning Uk as the \ufb01rst k columns of U and analogously for V, and \u03a3k the square diagonal\n\u2020\nk is the rank-k SVD approximation to\nmatrix with the \ufb01rst k entries of \u03a3, then Ak = Uk\u03a3kV\nA.Furthermore, we refer to matrix with only the last n \u2212 k columns of U, V and last n \u2212 k entries in\n\u03a3 as U\\k, V\\k, and \u03a3\\k.\nThe Moore-Penrose pseudo inverse of a rank k matrix A is given by A+ = Vk\u03a3\u22121\nThe Frobenius norm ||A||F of a matrix A is given by ||A||2\nof a matrix A is given by the largest singular value of A.\n\nF = tr (AA\u2020). The spectral norm ||A||2\n\nk U\n\n\u2020\nk.\n\n2 Deterministic Ridge Leverage Score (DRLS) Column Sampling\n\n2.1 The DRLS Algorithm\n\nAlgorithm 1. The DRLS algorithm selects for the submatrix C all columns i with ridge leverage\nscore \u00af\u03c4i(A) above a threshold \u03b8, determined by the error tolerance \u0001. This algorithm is deeply\nindebted to the deterministic algorithm of Papailiopoulos et al. (2014). It substitutes ridge leverage\nscores for rank-k subspace scores, and has a different stopping parameter. The algorithm is as\nfollows.\n\n1. Choose the error tolerance, \u0001.\n2. For every column i, calculate the ridge leverage scores \u00af\u03c4i(A) (Eqn. 1).\n3. Sort the columns by \u00af\u03c4i(A), from largest to smallest. The sorted column indices are \u03c0i.\n4. De\ufb01ne an empty set \u0398 = {}. Starting with the largest sorted column index \u03c00, add the\n\ncorresponding column index i to the set \u0398, in decreasing order, until,\n\n(cid:88)\n\n\u00af\u03c4i(A) > \u00aft \u2212 \u0001,\n\nand then stop. Note that \u00aft =(cid:80)d\nde\ufb01ne \u02dc\u0001 =(cid:80)\n\ni\u2208\u0398\ni=1 \u00af\u03c4i(A) \u2264 2k (see Sec.1.2 for proof). It will be useful to\n\n(6)\n\ni(cid:54)\u2208\u0398 \u00af\u03c4i(A). Eqn. 6 can equivalently be written as \u0001 > \u02dc\u0001.\n\n5. If the set size |\u0398| < k, continue adding columns in decreasing order until |\u0398| = k.\n6. The leverage score \u00af\u03c4i(A) of the last column i included in \u0398 de\ufb01nes the leverage score\n7. Introduce a rectangular selection matrix S of size d \u00d7 |\u0398|. If the column indexed by (i, \u03c0i)\n\nthreshold \u03b8.\n\nis in \u0398, then Si,\u03c0i = 1. Si,\u03c0i = 0 otherwise. The DRLS submatrix is C = AS.\n\n4\n\n\fNote that when the ridge leverage scores on either side of the threshold are not equal, the algorithm\nreturns a unique solution. Otherwise, there are as many solutions as there are columns with equal\nridge leverage scores at the threshold.\n\nAlgorithm 1 requires O(min(d, n)nd) arithmetic operations.\n\n3 Approximation Guarantees\n\n3.1 Bounds for DRLS\n\nWe derive a new additive-multiplicative spectral approximation bound (Eqn. 7) for the square of the\nsubmatrix C selected with DRLS.\nTheorem 1. Additive-Multiplicative Spectral Bound: Let A \u2208 Rn\u00d7d be a matrix of at least rank k\nand \u00af\u03c4i(A) be de\ufb01ned as in Eqn. 1. Construct C following the DRLS algorithm described in Sec. 2.1.\nThen C satis\ufb01es,\n\n(7)\nThe symbol (cid:22) denotes the Loewner partial ordering which is reviewed in Sec 1.1 (see Horn and\nJohnson (2013) for a thorough discussion).\n\n||A\\k||2\n\nF I (cid:22) CCT (cid:22) AAT .\n\n(1 \u2212 \u0001)AAT \u2212 \u0001\nk\n\nConceptually, the Loewner ordering in Eqn. 7 is the generalization of the ordering of real numbers\n(e.g. 1 < 1.5) to Hermitian matrices. Statements of Loewner ordering are quite powerful; important\nconsequences include inequalities for the eigenvalues. We will use Eqn. 7 to prove Theorems 2, 3,\nand 4. Note that our additive-multiplicative bound holds for an un-weighted column subset of A.\nTheorem 2. Column Subset Selection: Let A \u2208 Rn\u00d7d be a matrix of at least rank k and \u00af\u03c4i(A) be\nde\ufb01ned as in Eqn. 1. Construct C following the DRLS algorithm described in Sec. 2.1. Then C\nsatis\ufb01es,\n\n||A \u2212 CC+A||2\n\nC,k(A)||2\n\nF \u2264 ||A \u2212 \u03a0F\nC,k(A) = (CC+A)k is the best rank-k approximation to A in the\n\nF \u2264 (1 + 4\u0001)||A\\k||2\nF ,\n\n(8)\n\n0 < \u0001 < 1\n\nwith\ncolumn space of C with the respect to the Frobenius norm.\n\n4 and where \u03a0F\n\nColumn subset selection algorithms are widely used for feature selection for high-dimensional data,\nsince the aim of the column subset selection problem is to \ufb01nd a small number of columns of A that\napproximate the column space nearly as well as the top k singular vectors.\nTheorem 3. Rank-k Projection-Cost Preservation: Let A \u2208 Rn\u00d7d be a matrix of at least rank k and\n\u00af\u03c4i(A) be de\ufb01ned as in Eqn. 1. Construct C following the DRLS algorithm described in Sec. 2.1.\nThen C satis\ufb01es, for any rank k orthogonal projection X \u2208 Rn\u00d7n,\n\n(1 \u2212 \u0001)||A \u2212 XA||2\n\n(9)\nTo simplify the bookkeeping, we prove the lower bound of Theorem 3 with (1 \u2212 \u03b1\u0001) error (\u03b1 =\n2(2 +\n\n2)), and assume 0 < \u0001 < 1\n2 .\n\n\u221a\n\nF \u2264 ||C \u2212 XC||2\n\nF \u2264 ||A \u2212 XA||2\nF .\n\nProjection-cost preservation bounds were formalized recently in Feldman et al. (2013); Cohen et al.\n(2015). Bounds of this type are important because it means that low-rank projection problems can be\nsolved with C instead of A while maintaining the projection cost. Furthermore, the projection-cost\npreservation bound has implications for k-means clustering, because the k-means objective function\ncan be written in terms of the orthogonal rank-k cluster indicator matrix (Boutsidis et al., 2009).1\nNote that our rank-k projection-cost preservation bound holds for an un-weighted column subset of\nA.\nA useful lemma on an approximate ridge leverage score kernel comes from combining Theorem 1\nand 3.\nLemma 1. Approximate ridge leverage score kernel: Let A \u2208 Rn\u00d7d be a matrix of at least rank\nk and \u00af\u03c4i(A) be de\ufb01ned as in Eqn. 1. Construct C following the DRLS algorithm described in\n\n1Thanks to Michael Mahoney for this point.\n\n5\n\n\f\uf8eb\uf8ed||y \u2212 Ax||2\n\n\uf8f6\uf8f8 .\n\nd(cid:88)\n\nj=1\n\nK(M) =(cid:0)MMT + 1\n\nSec. 2.1. Let \u03b1 be the coef\ufb01cient in the lower bound of Theorem 3 and assume 0 < \u0001 < 1\n\nF I(cid:1)+ for matrix M \u2208 Rn\u00d7l. Then K(C) and K(A) satisfy,\n\nk||M\\k||2\n\n2 . Let\n\nK(A) (cid:22) K(C) (cid:22)\n\n1\n\n1 \u2212 (\u03b1 + 1)\u0001\n\nK(A).\n\n(10)\n\nTheorem 4. Approximate Ridge Regression with DRLS: Let A \u2208 Rn\u00d7d be a matrix of at least rank\nk and \u00af\u03c4i(A) be de\ufb01ned as in Eqn. 1. Construct C following the DRLS algorithm described in Sec.\n2.1, let \u03b1 be the coef\ufb01cient in the lower bound of Theorem 3, and assume 0 < \u0001 < 1\n2 . Choose\nfor ridge regression with a matrix M (Eqn. 3). Under\nthe regularization parameter \u03bb2 =\nthese conditions, the statistical risk R(\u02c6yC) of the ridge regression estimator \u02c6yC is bounded by the\nstatistical risk R(\u02c6yA) of the ridge regression estimator \u02c6yA:\n\n2\u03b1 < 1\n\n||M\\k||2\n\nk\n\nF\n\nR(\u02c6yC) \u2264 (1 + \u03b2\u0001)R(\u02c6yA),\n\n(11)\n\nwhere \u03b2 = 2\u03b1(\u22121+2\u03b1+3\u03b12)\n\n(1\u2212\u03b1)2\n\n.\n\nTheorem 4 means that there are bounds on the statistical risk for substituting the DRLS selected\ncolumn subset matrix for the complete matrix when performing ridge regression with the appropriate\nregularization parameter. Performing ridge regression with the column subset C effectively forces\ncoef\ufb01cients to be zero and adds the bene\ufb01ts of automatic feature selection to the L2 regularization\nproblem. We also note that the proof of Theorem 4 relies only on Theorem 1 and Theorem 3 and\nfacts from linear algebra, so a randomized selection of weighted column subsets that obey similar\nbounds to Theorem 1 and Theorem 3 (e.g. Cohen et al. (2017)) will also have bounded statistical risk,\nalbeit with a different coef\ufb01cient \u03b2. As a point of comparison, consider the elastic net minimization\nwith our ridge regression regularization parameter:\n\n\u02c6xE = argmin\n\nx\n\n2 +\n\n||A\\k||2\n\nF||x||2\n\n2 + \u03bb1\n\n1\nk\n\n|xj|\n\n(12)\n\nThe risk of elastic net R(\u02c6yE) has the following bound in terms of the risk of ridge regression R(\u02c6yA):\n\nR(\u02c6yE = A\u02c6xE) = R(\u02c6yA) + \u03bb2\n\n1\n\n4d||A||2\nk2||A\\k||4\n\n2\n\n1\n\nF\n\n(13)\n\n2\n\n.\n\nF\n\n1 \u2248 \u03b2\u0001\n\nR(\u02c6yA)\n4d||A||2\n\nk2||A\\k||4\n\nThis comes from a slight re-working of Theorem 3.1 of Zou and Zhang (2009). The bounds for the\nelastic net risk and R(\u02c6yC) are comparable when \u03bb2\nRidge regression is a special case of kernel ridge regression with a linear kernel. While previous\nwork in kernel ridge regression has considered the use of ridge leverage scores to approximate the\nsymmetric kernel matrix by selecting a subset of n informative samples (Alaoui and Mahoney, 2015;\nRudi et al., 2015), to our knowledge, no previous work has used ridge leverage scores to approximate\nthe symmetric kernel matrix using ridge leverage scores to select a subset of the f informative features\n(after the feature mapping of the d-dimensional data points). The latter case would be the natural\ngeneralization of Theorem 4 to non-linear kernels, and remains an interesting open question. Lastly,\nwe note that placing statistical assumptions on A in the spirit of (Rudi et al., 2015) may lead to an\nimproved bound for random designs for A.\nTheorem 5. Ridge Leverage Power-law Decay: Let A \u2208 Rn\u00d7d be a matrix of at least rank k and\n\u00af\u03c4i(A) be de\ufb01ned as in Eqn. 1. Furthermore, let the ridge leverage scores exhibit power-law decay in\nthe sorted column index \u03c0i,\n\n\u00af\u03c4\u03c0i(A) = \u03c0\u2212a\n\ni \u00af\u03c4\u03c00 (A)\n\na > 1.\n\n(14)\n\nConstruct C following the DRLS algorithm described in Sec. 2.1. The number of sample columns\nselected by DRLS is,\n\n(cid:18)(cid:0) 4k\n\n(cid:1) 1\n\n\u0001\n\n(cid:16) 4k\n\n(a\u22121)\u0001\n\n(cid:17) 1\n\n(cid:19)\n\n|\u0398| \u2264 max\n\na \u2212 1,\n\na\u22121 \u2212 1, k\n\n.\n\n(15)\n\n6\n\n\fTheorem 3 of Papailiopoulos et al. (2014) introduces the concept of power-law decay behavior for\nleverage scores for rank-k subspace leverage scores. Our Theorem 5 is an adaptation of Papailiopoulos\net al. (2014)\u2019s Theorem 3 for ridge leverage scores.\nAn obvious extension of Eqn. 7 is the following bound,\n\n(1 \u2212 \u0001)AAT \u2212 \u0001\nk\n\n||A\\k||2\n\nF I (cid:22) CCT (cid:22) (1 + \u0001)AAT +\n\n||A\\k||2\n\nF I,\n\n\u0001\nk\n\n(16)\n\n\u00012 ln(cid:0) k\n\n(cid:1))\n\nwhich also holds for C selected by ridge leverage random sampling methods with O( k\nweighted columns and failure probability \u03b4 Cohen et al. (2017). Thus, DRLS selects fewer columns\nwith the same accuracy \u0001 in Eqn. 16 for power-law decay in the ridge leverage scores when,\n\n\u03b4\n\nmax\n\n(17)\nwhere C is an absolute constant. In particular, when a \u2265 2, the number of columns deterministically\nsampled is O(k).2\n\n(a\u22121)\u0001\n\n< C k\n\n\u0001\n\na \u2212 1,\n\na\u22121 \u2212 1, k\n\n(cid:18)(cid:0) 4k\n\n(cid:1) 1\n\n(cid:16) 4k\n\n(cid:17) 1\n\n(cid:19)\n\n\u00012 ln(cid:0) k\n\n\u03b4\n\n(cid:1),\n\n4 Biological Data Illustration\n\nWe provide a biological data illustration of ridge leverage scores and ridge regression with multi-omic\ndata from lower-grade glioma (LGG) tumor samples collected by the TCGA Research Network\n(http://cancergenome.nih.gov/). Diffuse lower-grade gliomas are in\ufb01ltrative brain tumors that\noccur most frequently in the cerebral hemisphere of adults.\nThe data is publicly available and hosted by the Broad Institute\u2019s GDAC Firehose (Broad Institute\nof MIT and Harvard, 2016). We download the data using the R tool T CGA2ST AT (Wan et al.,\n2016). T CGA2ST AT imports the latest available version-stamped standardized Level 3 dataset on\nFirehose. The data collection and data platforms are discussed in detail in the original paper (The\nCancer Genome Atlas Research Network, 2015).\nWe use the following multi-omic data types: mutations (d = 4845), DNA copy number (alteration\n(d = 22618) and variation (d = 22618)), messenger RNA (mRNA) expression (d = 20501), and\nmicroRNA expression (d = 1046). Methylation data is also available, but we omit it due to memory\nconstraints. The mRNA and microRNA data is normalized. DNA copy number (variation and\nalteration) has an additional pre-processing step; the segmentation data reported by TCGA is turned\ninto copy number using the R tool CN tools (Zhang, 2015) that is imbedded in T CGA2ST AT . The\nmutation data is \ufb01ltered based on status and variant classi\ufb01cation and then aggregated at the gene\nlevel (Wan et al., 2016).\nThere are 280 tumor samples and d = 71628 multi-omic features in the downloaded dataset. We\nare interested in performing ridge regression with the biologically meaningful outcome variables\nrelating to mutations of the \"IDH1\" and \"IDH2\" gene and deletions of the \"1p/19q\" chromosome\narms (\"codel\"). These variables were shown to be predictive of favorable clinical outcomes and can\nbe found in the supplemental tables (The Cancer Genome Atlas Research Network, 2015). We restrict\nto samples with these outcome variables (275 tumor samples), and we drop an additional sample\n(\"TCGA-CS-4944\") because it is an outlier with respect to the k = 3 SVD projection of the samples.\nThis leaves a total of 274 tumor samples with outcome variables \"IDH\" (a mutation in either \"IDH1\"\nor \"IDH2\") and \"codel\" for the analysis.\nLastly, we drop all multi-omic features that have zero columns and greater than 10% missing data\non the 274 tumor samples. We the replace missing values with the mean of the column. This\nleaves a \ufb01nal multi-omic feature set of d = 68522 for the 274 tumor samples. Our \ufb01nal matrix\nA \u2208 R274\u00d768522 is column mean-centered. Figure 1 shows a pie chart of the breakdown of the \ufb01nal\nmatrix A\u2019s multi-omic feature types.\n\n4.1 Ridge leverage score sampling\n\nFigure 2 shows the spectrum of eigenvalues of AAT for LGG. The eigenvalues range of multiple\norders of magnitude. We choose k = 3 for the DRLS algorithm because these components are\n\n2Thanks to Ahmed El Alaoui for this point.\n\n7\n\n\fof columns kept, |\u0398|, and \u02dc\u0001 =(cid:80)\n\nmeaningful for the \"IDH\" and \"codel\" outcome variables (see Figures 3, 4 , and 5). The top\nthree components capture 79% of the Frobenius norm |A|2\nF . Applying the DRLS algorithm with\nk = 3, \u0001 = 0.1 leads to |\u0398| = 1512, selecting approximately 0.02% of the total multi-omic features\nfor the column subset matrix C. The majority of the features selected are mRNA (1473 features),\nand the remainder are microRNA (39 features). Figure 6 shows the relationship between the number\ni(cid:54)\u2208\u0398 \u00af\u03c4i(A) for the k = 3 ridge leverage scores. Only a small error\npenalty is incurred by a dramatic reduction in the number of columns kept according to Algorithm 1.\nFigure 7 shows the power-law decay of the LGG k = 3 ridge leverage scores with sorted column\nindex. This LGG multi-omic data example shows that ridge leverage score power-law decay occurs\nin the wild. Figure 8 shows a histogram of the ratio of ||C \u2212 XC||2\nF for 1000 random\nrank-k = 3 orthogonal projections X. The projections are chosen as the \ufb01rst 3 directions from an\northogonal basis randomly selected with respect to the Haar measure for the orthogonal group O(n)\n(Mezzadri, 2006). This con\ufb01rms that the projection cost empirically has very small error. Lastly,\nFigure 9 illustrates the k = 3 ridge leverage score regularization of the classical leverage score for the\nLGG multi-omic features. As expected, many of the columns\u2019 ridge leverage scores exhibit shrinkage\nwhen compared to the classical leverage scores. Table 2 includes ratios derived from the full data\nmatrix A and the column subset matrix C selected by the DRLS algorithm with k = 3, \u0001 = 0.1.\n\nF /||A \u2212 XA||2\n\n4.2 Ridge regression with ridge leverage score sampling\n\nWe perform ridge regression with the appropriate regularization parameter for two biologically\nmeaningful outcome variables; the \ufb01rst is whether either the \"IDH1\" or the \"IDH2\" gene is mutated\nand the second whether the \"1p/19q\" chromosome arms have deletions (\"codel\"). We encode the\nstatus of each event as \u00b11. Figures 3, 4 , and 5 show the top three SVD projections for the tumor\nsamples, colored by the combined status for \"IDH\" and \"codel\". No tumor samples have the \"1p/19q\"\ncodeletion and no \"IDH\" mutation. Visual inspection of the SVD plot con\ufb01rms that this is a reasonable\nregression problem for \"IDH\" and a dif\ufb01cult regression problem for \"codel\"; also, logisitic regression\nwould be more natural for binary outcomes. We proceed anyway, since our objective is to compare\nridge regression with all of the features (A) to ridge regression with the DRLS subset (C) on realistic\nbiological data. Figures 10 and 11 con\ufb01rm that the ridge regression \ufb01ts are close (\u02c6yA \u2212 \u02c6yC) for\nall the tumor samples. Figures 12 and 13 con\ufb01rm that the ridge regression coef\ufb01cients are close\n(\u02c6xA \u2212 \u02c6xC) for all the tumor samples. Figure 14 and 15 illustrate the overall performance of ridge\nregression for these two outcome variables.\nLastly, we simulate 274 samples y according to the linear model (Eqn. 4), where y\u2217 = Ax\u2217,\nthe coef\ufb01cients x\u2217 \u223c N (0, I), and A is the LGG multi-omic feature matrix. We choose \u03c32 =\n{10\u22123, 1, 103}. We perform ridge regression with A and then again with C in accordance with\nTheorem 4. We calculate the risks R(\u02c6yA) and R(\u02c6yC) and \ufb01nd that Theorem 4 is not violated. Table\n2 shows the risk ratios R(\u02c6yC)/R(\u02c6yA) along with other relevant ratios for the ridge leverage scores.\n\nTable 2: Ridge leverage score ratios for k = 3, \u0001 = 0.1 for LGG tumor multi-omic data. The ratios\nare near one, as expected. Ridge regression risk ratio R(\u02c6yC)/R(\u02c6yA) for data simulated from the\nLGG multi-omic matrix A and Eqn. 4.\n||C\\k||2\n\nF ave( \u00af\u03a32/ \u00af\u03a32\nC)\n\nC/\u03a32)\n\nF /||A\\k||2\n0.97\n\n1.03\n\nave(\u03a32\n\n0.85\n\nAlgorithm 1\nk = 3, \u0001 = 0.1\n\n\u03c32 R(\u02c6yC)/R(\u02c6yA)\n10\u22123\n100\n103\n\n0.99\n0.99\n0.99\n\nAcknowledgements\n\nResearch reported in this publication was supported by the National Human Genome Research\nInstitute of the National Institutes of Health under Award Number F32HG008713. The content is\nsolely the responsibility of the authors and does not necessarily represent the of\ufb01cial views of the\nNational Institutes of Health. SRM thanks Michael Mahoney, Ahmed El Alaoui, Elaine Angelino,\nand Kai Rothauge for thoughtful comments and the Barcellos and Pachter Labs.\n\n8\n\n\fSupporting Information\n\nSoftware in the form of python and R code is available at https://github.com/srmcc/\ndeterministic-ridge-leverage-sampling. Code for downloading the data and reproducing\nall of the \ufb01gures is included. Proofs and \ufb01gures are included in the Supplementary Material.\n\nReferences\nAhmed El Alaoui and Michael W. Mahoney. 2015. Fast Randomized Kernel Ridge Regression with\nStatistical Guarantees. In Proceedings of the 28th International Conference on Neural Information\nProcessing Systems - Volume 1 (NIPS\u201915). MIT Press, Cambridge, MA, USA, 775\u2013783. http:\n//dl.acm.org/citation.cfm?id=2969239.2969326 http://arxiv.org/abs/1411.0306.\n\nChristos Boutsidis, Petros Drineas, and Michael W Mahoney. 2009. Unsupervised Feature\nIn Advances in Neural Information Pro-\nSelection for the k-means Clustering Problem.\ncessing Systems 22, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and\nA. Culotta (Eds.). Curran Associates, Inc., 153\u2013161.\nhttp://papers.nips.cc/paper/\n3724-unsupervised-feature-selection-for-the-k-means-clustering-problem.\npdf\n\nLeo Breiman. 1996. Heuristics of instability and stabilization in model selection. The Annals of\n\nStatistics 24, 6 (Dec. 1996), 2350\u20132383. https://doi.org/10.1214/aos/1032181158\n\nBroad Institute of MIT and Harvard. 2016. Broad Institute TCGA Genome Data Analysis Center\n(2016): Analysis-ready standardized TCGA data from Broad GDAC Firehose 2016_01_28 run.\n(Jan. 2016). https://doi.org/10.7908/C11G0KM9 Dataset.\n\nSamprit Chatterjee and Ali S. Hadi. 1986. In\ufb02uential Observations, High Leverage Points, and\nOutliers in Linear Regression. Statist. Sci. 1, 3 (Aug. 1986), 379\u2013393. https://doi.org/10.\n1214/ss/1177013622\n\nMichael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. 2015.\nDimensionality Reduction for k-Means Clustering and Low Rank Approximation. In Proceedings\nof the Forty-seventh Annual ACM Symposium on Theory of Computing (STOC \u201915). ACM, New\nYork, NY, USA, 163\u2013172. https://doi.org/10.1145/2746539.2746569\n\nMichael B. Cohen, Cameron Musco, and Christopher Musco. 2017. Input Sparsity Time Low-rank\nApproximation via Ridge Leverage Score Sampling. In Proceedings of the Twenty-Eighth Annual\nACM-SIAM Symposium on Discrete Algorithms (SODA \u201917). Society for Industrial and Applied\nMathematics, Philadelphia, PA, USA, 1758\u20131777. http://dl.acm.org/citation.cfm?id=\n3039686.3039801\n\nPetros Drineas, Michael W. Mahoney, and S. Muthukrishnan. 2008. Relative-Error $CUR$ Matrix\nDecompositions. SIAM J. Matrix Anal. Appl. 30, 2 (Sept. 2008), 844\u2013881. https://doi.org/\n10.1137/07070471X\n\nD. Feldman, M. Schmidt, and C. Sohler. 2013. Turning Big data into tiny data: Constant-size\ncoresets for k-means, PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual\nACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics,\n1434\u20131453. https://doi.org/10.1137/1.9781611973105.103\n\nArthur E. Hoerl and Robert W. Kennard. 1970. Ridge Regression: Biased Estimation for Nonorthogo-\nnal Problems. Technometrics 12, 1 (Feb. 1970), 55\u201367. https://doi.org/10.1080/00401706.\n1970.10488634\n\nRoger A. Horn and Charles R. Johnson. 2013. Matrix analysis (2nd ed ed.). Cambridge University\n\nPress, New York.\n\nShannon McCurdy, Vasilis Ntranos, and Lior Pachter. 2017. Column subset selection for single-cell\n\nRNA-Seq clustering. bioRxiv (July 2017), 159079. https://doi.org/10.1101/159079\n\nFrancesco Mezzadri. 2006. How to generate random matrices from the classical compact groups.\n\nNotices of the American Mathematical Society 54 (Oct. 2006).\n\n9\n\n\fDimitris Papailiopoulos, Anastasios Kyrillidis, and Christos Boutsidis. 2014. Provable Deterministic\nLeverage Score Sampling. In Proceedings of the 20th ACM SIGKDD International Conference\non Knowledge Discovery and Data Mining (KDD \u201914). ACM, New York, NY, USA, 997\u20131006.\nhttps://doi.org/10.1145/2623330.2623698\n\nAlessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. 2015. Less is More: Nystr\\\"om\nComputational Regularization. arXiv:1507.04717 [cs, stat] (July 2015). http://arxiv.org/\nabs/1507.04717 arXiv: 1507.04717.\n\nThe Cancer Genome Atlas Research Network. 2015. Comprehensive, Integrative Genomic Analysis\nof Diffuse Lower-Grade Gliomas. The New England journal of medicine 372, 26 (June 2015),\n2481\u20132498. https://doi.org/10.1056/NEJMoa1402121\n\nRobert Tibshirani. 1994. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal\n\nStatistical Society, Series B 58 (1994), 267\u2013288.\n\nPaul F. Velleman and Roy E. Welsch. 1981. Ef\ufb01cient Computing of Regression Diagnostics. The\n\nAmerican Statistician 35, 4 (1981), 234\u2013242. https://doi.org/10.2307/2683296\n\nYing-Wooi Wan, Genevera I. Allen, and Zhandong Liu. 2016. TCGA2STAT: simple TCGA data\naccess for integrated statistical analysis in R. Bioinformatics 32, 6 (March 2016), 952\u2013954.\nhttps://doi.org/10.1093/bioinformatics/btv677\n\nJianhua Zhang. 2015. CNTools: Convert segment data into a region by sample matrix to allow for other\nhigh level computational analyses. (2015). http://bioconductor.org/packages/CNTools/\nR package version 1.26.0.\n\nHui Zou and Trevor Hastie. 2005. Regularization and variable selection via the Elastic Net. Journal\n\nof the Royal Statistical Society, Series B 67 (2005), 301\u2013320.\n\nHui Zou and Hao Helen Zhang. 2009. On the adaptive elastic-net with a diverging number of\nparameters. The Annals of Statistics 37, 4 (Aug. 2009), 1733\u20131751. https://doi.org/10.\n1214/08-AOS625\n\n10\n\n\f", "award": [], "sourceid": 1243, "authors": [{"given_name": "Shannon", "family_name": "McCurdy", "institution": "Ancestry"}]}