{"title": "On the number of variables to use in principal component regression", "book": "Advances in Neural Information Processing Systems", "page_first": 5094, "page_last": 5103, "abstract": "We study least squares linear regression over $N$ uncorrelated Gaussian features that are selected in order of decreasing variance. When the number of selected features $p$ is at most the sample size $n$, the estimator under consideration coincides with the principal component regression estimator; when $p>n$, the estimator is the least $\\ell_2$ norm solution over the selected features. We give an average-case analysis of the out-of-sample prediction error as $p,n,N \\to \\infty$ with $p/N \\to \\alpha$ and $n/N \\to \\beta$, for some constants $\\alpha \\in [0,1]$ and $\\beta \\in (0,1)$. In this average-case setting, the prediction error exhibits a ``double descent'' shape as a function of $p$. We also establish conditions under which the minimum risk is achieved in the interpolating ($p>n$) regime.", "full_text": "On the number of variables to use in principal\n\ncomponent regression\n\nJi Xu\n\nColumbia University\n\njixu@cs.columbia.edu\n\nDaniel Hsu\n\nColumbia University\n\ndjhsu@cs.columbia.edu\n\nAbstract\n\nWe study least squares linear regression over N uncorrelated Gaussian features that\nare selected in order of decreasing variance. When the number of selected features\np is at most the sample size n, the estimator under consideration coincides with the\nprincipal component regression estimator; when p > n, the estimator is the least\n`2 norm solution over the selected features. We give an average-case analysis of\nthe out-of-sample prediction error as p, n, N ! 1 with p/N ! \u21b5 and n/N ! ,\nfor some constants \u21b5 2 [0, 1] and 2 (0, 1). In this average-case setting, the\nprediction error exhibits a \u201cdouble descent\u201d shape as a function of p. We also\nestablish conditions under which the minimum risk is achieved in the interpolating\n(p > n) regime.\n\n1\n\nIntroduction\n\nIn principal component regression (PCR), a linear model is \ufb01t to variables obtained using principal\ncomponent analysis on the original covariates. Suppose the data consists of n i.i.d. observations\n(x1, y1), . . . , (xn, yn) from RN \u21e5 R. Let X := [x1|\u00b7\u00b7\u00b7|xn]> be the n \u21e5 N design matrix, y :=\n(y1, . . . , yn)> be the n-dimensional vector of responses, and \u2303 := E[x1x>1 ] 2 RN\u21e5N. Assuming\n\u2303 is known (as we do in this paper), the PCR \ufb01t is given by V (XV )+y, where V 2 RN\u21e5p is the\nmatrix of top p (orthonormal) eigenvectors of \u2303, and A+ denotes the Moore-Penrose pseudo-inverse\nof A. PCR notably addresses issues of multi-collinearity in under-determined (n < N) settings, while\navoiding saturation effects suffered by other regression methods such as ridge regression [1, 7, 12].\nThe critical parameter in PCR is the number of components p to include in the regression. Nearly\nall previous analyses of variable selection have restricted attention to the p < n regime [e.g., 4].\nThis restriction may seem benign, as conventional wisdom suggests that choosing p > n leads to\nover-\ufb01tting. This paper aims to challenge this conventional wisdom in a particular setting for PCR.\nWe study the prediction error of the PCR \ufb01t for all values of p in the under-determined regime. We\nassume the xi are Gaussian and conduct an \u201caverage-case\u201d analysis, where the \u201ctrue\u201d coef\ufb01cient\nvector is randomly chosen from an isotropic prior distribution. Thus, all of the original variables\nin xi are relevant but weak in terms of predicting the response. When the eigenvalues of \u2303 exhibit\nsome decay, one expects diminishing returns as p increases. It is often suggested to \ufb01nd a value of p\nthat balances bias and variance, and such a value of p can be found in the p < n regime.\nHowever, we show that when p > n, the prediction error can again be decreasing with p. This\nphenomenon\u2014the second descent of the so-called \u201cdouble descent\u201d risk curve [2]\u2014has been observed\nin a number of scenarios and for many different machine learning models (where p is regarded as a\nnominal number of model parameters) [2, 3, 8, 13, 17]. In these previous studies, the limiting risk\nas p ! 1 was often (but not always) observed to be lower than the best risk achieved in the p < n\nregime. We prove that this phenomenon occurs with PCR in our data model: the lowest prediction\nerror is achieved at some p > n, rather than any p < n.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur data model. Our data (x1, y1), . . . , (xn, yn) are assumed to be i.i.d. with xi \u21e0N (0, \u2303), and\n\nyi = x>i \u2713 + wi.\n\nHere, w1, . . . , wn are i.i.d. N (0, 2) noise variables, and \u2713 2 RN is the true coef\ufb01cient vector. We\nassume, without loss of generality, that \u2303 is diagonal. In fact, we shall take \u2303 := diag(1, . . . , N )\nwith distinct positive eigenvalues 1 > \u00b7\u00b7\u00b7 > N > 0. The prediction (squared) error of \u27130 2 RN is\nEx,y[(y x>\u27130)2], where (x, y) is an independent copy of (x1, y1).\nSome notation. For a vector v 2 RN, let vP 2 Rp denote the sub-vector of the \ufb01rst p entries of v, and\nlet vP c 2 RNp denote the sub-vector of the last N p entries. Similarly, for a matrix M 2 Rn\u21e5N,\nlet M P 2 Rn\u21e5p denote the sub-matrix of the \ufb01rst p columns of M, and let M P c 2 Rn\u21e5(Np)\ndenote the sub-matrix of the last N p columns.\nRecall that PCR selects components in order of decreasing j. So, using the notation from above, the\nPCR estimator \u02c6\u2713 for \u2713 is de\ufb01ned by\n\n\u02c6\u2713P c := 0.\n\n(1)\n\n\u02c6\u2713P :=((X>\n\nX>\n\nP X P )1X>\nP (X P X>\n\nP y if p \uf8ff n,\nP )1y if p > n;\n\n(Recall that X := [x1|\u00b7\u00b7\u00b7|xn]> and y := (y1, . . . , yn)>; also, the matrices being inverted above\nare, indeed, invertible with probability 1.) The prediction error of the PCR estimate \u02c6\u2713 is denoted by\n\nError := Ex,y[(y x>\u02c6\u2713)2].\n\nObserve that the (squared) correlation between the response and the jth variable is proportional to\nj , but PCR selects variables only on the basis of the j. So, for a worst-case \u2713, PCR may be\nj\u27132\nunlucky and end up selecting the p least correlated variables. To avoid this worst-case scenario, we\nconsider an \u201caverage-case\u201d analysis, where the true coef\ufb01cient vector \u2713 is independently drawn from\nan isotropic prior distribution:\n\nE\u2713[\u2713] = 0, E\u2713[\u2713\u2713>] = I.\n\n(2)\nWe will study the random quantity Ew,\u2713[Error], where the expectation is conditional on the design\nmatrix X, but averages over the observation noise w = (w1, . . . , wn) and random choice of \u2713.\nOur analysis uses high-dimensional asymptotic considerations to study the under-determined (n < N)\nregression problem, letting p, n, N ! 1 with p/N ! \u21b5 and n/N ! for some \ufb01xed constants\n\u21b5 2 [0, 1] and 2 (0, 1). We are primarily interested in the limiting value of Ew,\u2713[Error], which is\nthe asymptotic risk.\n\nOur results.\nIn Section 2, we give an exact expression for the asymptotic risk in the case where\nthe eigenvalues of \u2303 exhibit polynomial decay, namely j = j\uf8ff for a \ufb01xed constant \uf8ff> 0. Our\nexpression covers both the p < n and p > n regimes, and we \ufb01nd that the smallest asymptotic risk\ncan be achieved with p > n (or equivalently, \u21b5> ) in noiseless settings. In noisy settings, the\ncomparison of the p < n and p > n regimes depends crucially on the exponent \uf8ff.\nIn Section 3, we relax the condition on the eigenvalues, and instead just assume that the empirical\ndistribution of the cN j, for some suitable sequence (cN )N1, has a \u201cnice\u201d limiting distribution. We\nobtain results similar to those in Section 2 using a slightly different variable selection rule.\nOur analyses permit a 1 o(1) fraction of j\u2019s to converge to zero as p, n, N ! 1. (In particular, the\ncN may go to in\ufb01nity.) This makes our analysis technically non-trivial and more generally applicable.\nThe proofs of the results are detailed in the full version of the paper [19].\n\nRelated works. Strategies for choosing the optimal value of p in PCR (e.g., cross validation,\nvariance in\ufb02ation factors) are typically only studied in the p < n regime [9]. For instance, the exact\nrisk of PCR as a function of p for Gaussian designs can be extracted from the analysis of Breiman\nand Freedman [4], but only for the p < n regime.\nThe high-dimensional analyses of ridge regression by Dicker [5], Dobriban and Wager [6], Hastie\net al. [8] are closely related to our work. Indeed, for \ufb01xed p, the PCR estimator (or \u201cridgeless\u201d\nestimator) is obtained by taking the ridge regularization parameter to zero. These analyses extend\n\n2\n\n\fbeyond the Gaussian design setting that we consider, but are restricted to cases where either all\neigenvalues of \u2303 remain bounded below by an absolute constant as N ! 1, or where the ridge\nregularization parameter is held at some positive constant.\nThe \u201cdouble descent\u201d phenomenon was observed by several researchers [e.g., 2, 8, 13, 17] for a\nvariety of machine learning models such as neural networks and ensemble methods. Belkin et al.\n[3], Hastie et al. [8], Muthukumar et al. [13] provide statistical explanations for this phenomenon by\nstudying the behavior of the minimum `2 norm linear \ufb01t with p > n. The analysis of Muthukumar\net al. [13] restricts attention to correctly-speci\ufb01ed linear models (i.e., p = N in our notation) and\nshows some potential bene\ufb01ts of the p > n regime. A related analysis of estimation variance was\ncarried out by Neal et al. [14]. The analysis of Belkin et al. [3] studies an isotropic Gaussian design\nthat is otherwise similar to our setup, as well as a Fourier design that is related to the random\nFourier features of Rahimi and Recht [15]. The analyses of Hastie et al. [8] look at more general\nand non-isotropic designs (and, in fact, certain non-linear models related to neural networks!), but\nas mentioned before, they assume the eigenvalues of \u2303 are bounded away from zero. While their\n\u201cmisspeci\ufb01ed\u201d setting appears to be similar to our setup, we note that varying their p/n parameter\n(which they call ) changes the statistical problem under consideration. In contrast, our analysis\nlooks at the effect of choosing different p on the same statistical problem, and thus is able to shed\nlight on the number of variables one should use in principal component regression.\n\nNotations for asymptotics. For any two random quantities X and Y , we use the notation X p! Y\nto mean that X = Y + op(Y ) as n, p, N ! 1. Similarly, for any two non-random quantities X and\nY , we use the notation X ! Y to mean that X = Y + o(Y ) as n, p, N ! 1. Finally, we say that\nX > Y holds in probability if Pr(X > Y ) ! 1 as n, p, N ! 1.\n2 Analysis under polynomial eigenvalue decay\n\nIn this section, we analyze the asymptotic risk of PCR under the following assumptions:\n\nA.1 There exists a constant \uf8ff> 0 such that j = j\uf8ff for all j = 1, . . . , N.\nA.2 There exist constants \u21b5 2 [0, 1] and 2 (0, 1) such that p/N ! \u21b5 and n/N ! as\n\np, n, N ! 1.\n\nAssumption A.1 implies that the eigenvalues of \u2303 decay to zero at a polynomial rate, while Assump-\ntion A.2 is a standard scaling for high-dimensional asymptotic analysis.\nWe also assume in this section that there is no observation noise, i.e., var(wi) = 2 = 0. In the\nnoiseless setting, the asymptotic risk is the limiting value of E\u2713[Error]. Results for the noisy setting\nare stated in Appendix C.\n\n2.1 Main results\n\nOur \ufb01rst theorem provides characterizes the asymptotic risk when \u21b5< . De\ufb01ne the functions h\uf8ff\nand R\uf8ff on (0, ):\n\nTheorem 1. Assume A.1 with constant \uf8ff; A.2 with constants \u21b5 and ; 2 = 0; and \u21b5< . Then\n\nE\u2713[Error] p!R \uf8ff(\u21b5).\n\nFurthermore, the equation h\uf8ff(\u21b5) = 0 has a unique solution \u21b5\u21e4 over the interval (0, ), and R\uf8ff(\u21b5)\nis decreasing on \u21b5 2 [0,\u21b5 \u21e4) and increasing on \u21b5 2 (\u21b5\u21e4, ). Finally,\n0\uf8ff\u21b5< R\uf8ff(\u21b5) = N 1\uf8ff \n(\u21b5\u21e4)\uf8ff .\n\nR\uf8ff(\u21b5\u21e4) = min\n\n(5)\n\n3\n\n\n\nh\uf8ff(\u21b5) :=\n\n\u21b5 Z 1\nR\uf8ff(\u21b5) := N 1\uf8ffZ 1\n\n\u21b5\n\n\u21b5\n\nt\uf8ff2 dt 1,\n\nt\uf8ff dt \u00b7\n\n \u21b5\n\nfor all \u21b5< ;\n\n,\n\nfor all \u21b5<.\n\n(3)\n\n(4)\n\n\f5\n1\n\n0\n1\n\n8\n0\n1\n\nk\ns\ni\nr\n\nk\ns\ni\nr\n\n6\n\n4\n5\n\n2\n\n0\n0\n\n\uf8ff = 1\n\nAAAB73icbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0EYI2lhGMDGQHGFus0mW7O2tu3tCOPInbCwUsfXv2Plv3CRXaOKDgcd7M8zMi5Tgxvr+t1dYWV1b3yhulra2d3b3yvsHTZOkmrIGTUSiWxEaJrhkDcutYC2lGcaRYA/R6GbqPzwxbXgi7+1YsTDGgeR9TtE6qdUZoVJ4FXTLFb/qz0CWSZCTCuSod8tfnV5C05hJSwUa0w58ZcMMteVUsEmpkxqmkI5wwNqOSoyZCbPZvRNy4pQe6SfalbRkpv6eyDA2ZhxHrjNGOzSL3lT8z2untn8ZZlyq1DJJ54v6qSA2IdPnSY9rRq0YO4JUc3croUPUSK2LqORCCBZfXibNs2rgV4O780rtOo+jCEdwDKcQwAXU4Bbq0AAKAp7hFd68R+/Fe/c+5q0FL585hD/wPn8Ai+CPoQ==\nAAAB73icbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0EYI2lhGMDGQHGFus0mW7O2tu3tCOPInbCwUsfXv2Plv3CRXaOKDgcd7M8zMi5Tgxvr+t1dYWV1b3yhulra2d3b3yvsHTZOkmrIGTUSiWxEaJrhkDcutYC2lGcaRYA/R6GbqPzwxbXgi7+1YsTDGgeR9TtE6qdUZoVJ4FXTLFb/qz0CWSZCTCuSod8tfnV5C05hJSwUa0w58ZcMMteVUsEmpkxqmkI5wwNqOSoyZCbPZvRNy4pQe6SfalbRkpv6eyDA2ZhxHrjNGOzSL3lT8z2untn8ZZlyq1DJJ54v6qSA2IdPnSY9rRq0YO4JUc3croUPUSK2LqORCCBZfXibNs2rgV4O780rtOo+jCEdwDKcQwAXU4Bbq0AAKAp7hFd68R+/Fe/c+5q0FL585hD/wPn8Ai+CPoQ==\nAAAB73icbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0EYI2lhGMDGQHGFus0mW7O2tu3tCOPInbCwUsfXv2Plv3CRXaOKDgcd7M8zMi5Tgxvr+t1dYWV1b3yhulra2d3b3yvsHTZOkmrIGTUSiWxEaJrhkDcutYC2lGcaRYA/R6GbqPzwxbXgi7+1YsTDGgeR9TtE6qdUZoVJ4FXTLFb/qz0CWSZCTCuSod8tfnV5C05hJSwUa0w58ZcMMteVUsEmpkxqmkI5wwNqOSoyZCbPZvRNy4pQe6SfalbRkpv6eyDA2ZhxHrjNGOzSL3lT8z2untn8ZZlyq1DJJ54v6qSA2IdPnSY9rRq0YO4JUc3croUPUSK2LqORCCBZfXibNs2rgV4O780rtOo+jCEdwDKcQwAXU4Bbq0AAKAp7hFd68R+/Fe/c+5q0FL585hD/wPn8Ai+CPoQ==\nAAAB73icbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0EYI2lhGMDGQHGFus0mW7O2tu3tCOPInbCwUsfXv2Plv3CRXaOKDgcd7M8zMi5Tgxvr+t1dYWV1b3yhulra2d3b3yvsHTZOkmrIGTUSiWxEaJrhkDcutYC2lGcaRYA/R6GbqPzwxbXgi7+1YsTDGgeR9TtE6qdUZoVJ4FXTLFb/qz0CWSZCTCuSod8tfnV5C05hJSwUa0w58ZcMMteVUsEmpkxqmkI5wwNqOSoyZCbPZvRNy4pQe6SfalbRkpv6eyDA2ZhxHrjNGOzSL3lT8z2untn8ZZlyq1DJJ54v6qSA2IdPnSY9rRq0YO4JUc3croUPUSK2LqORCCBZfXibNs2rgV4O780rtOo+jCEdwDKcQwAXU4Bbq0AAKAp7hFd68R+/Fe/c+5q0FL585hD/wPn8Ai+CPoQ==\n\nAAACD3icbVDLSsNAFJ34rPUVdekmWBQRKYkIuixKwWUF+4Amhsl02g6dPJi5UUqIX+DGX3HjQhG3bt35N07SLLT1wIXDOfdy7z1exJkE0/zW5uYXFpeWSyvl1bX1jU19a7slw1gQ2iQhD0XHw5JyFtAmMOC0EwmKfY/Ttje6zPz2HRWShcENjCPq+HgQsD4jGJTk6ge2j2HoeUk9dZP7YxuGFPDtUfqQ68JP6kKEInX1ilk1cxizxCpIBRVouPqX3QtJ7NMACMdSdi0zAifBAhjhNC3bsaQRJiM8oF1FA+xT6ST5P6mxr5Se0Q+FqgCMXP09kWBfyrHvqc7sSjntZeJ/XjeG/rmTsCCKgQZksqgfcwNCIwvH6DFBCfCxIpgIpm41yBALTEBFWFYhWNMvz5LWSdUyq9b1aaV2UcRRQrtoDx0iC52hGrpCDdREBD2iZ/SK3rQn7UV71z4mrXNaMbOD/kD7/AFrvZ2H\nAAACD3icbVDLSsNAFJ34rPUVdekmWBQRKYkIuixKwWUF+4Amhsl02g6dPJi5UUqIX+DGX3HjQhG3bt35N07SLLT1wIXDOfdy7z1exJkE0/zW5uYXFpeWSyvl1bX1jU19a7slw1gQ2iQhD0XHw5JyFtAmMOC0EwmKfY/Ttje6zPz2HRWShcENjCPq+HgQsD4jGJTk6ge2j2HoeUk9dZP7YxuGFPDtUfqQ68JP6kKEInX1ilk1cxizxCpIBRVouPqX3QtJ7NMACMdSdi0zAifBAhjhNC3bsaQRJiM8oF1FA+xT6ST5P6mxr5Se0Q+FqgCMXP09kWBfyrHvqc7sSjntZeJ/XjeG/rmTsCCKgQZksqgfcwNCIwvH6DFBCfCxIpgIpm41yBALTEBFWFYhWNMvz5LWSdUyq9b1aaV2UcRRQrtoDx0iC52hGrpCDdREBD2iZ/SK3rQn7UV71z4mrXNaMbOD/kD7/AFrvZ2H\nAAACD3icbVDLSsNAFJ34rPUVdekmWBQRKYkIuixKwWUF+4Amhsl02g6dPJi5UUqIX+DGX3HjQhG3bt35N07SLLT1wIXDOfdy7z1exJkE0/zW5uYXFpeWSyvl1bX1jU19a7slw1gQ2iQhD0XHw5JyFtAmMOC0EwmKfY/Ttje6zPz2HRWShcENjCPq+HgQsD4jGJTk6ge2j2HoeUk9dZP7YxuGFPDtUfqQ68JP6kKEInX1ilk1cxizxCpIBRVouPqX3QtJ7NMACMdSdi0zAifBAhjhNC3bsaQRJiM8oF1FA+xT6ST5P6mxr5Se0Q+FqgCMXP09kWBfyrHvqc7sSjntZeJ/XjeG/rmTsCCKgQZksqgfcwNCIwvH6DFBCfCxIpgIpm41yBALTEBFWFYhWNMvz5LWSdUyq9b1aaV2UcRRQrtoDx0iC52hGrpCDdREBD2iZ/SK3rQn7UV71z4mrXNaMbOD/kD7/AFrvZ2H\nAAACD3icbVDLSsNAFJ34rPUVdekmWBQRKYkIuixKwWUF+4Amhsl02g6dPJi5UUqIX+DGX3HjQhG3bt35N07SLLT1wIXDOfdy7z1exJkE0/zW5uYXFpeWSyvl1bX1jU19a7slw1gQ2iQhD0XHw5JyFtAmMOC0EwmKfY/Ttje6zPz2HRWShcENjCPq+HgQsD4jGJTk6ge2j2HoeUk9dZP7YxuGFPDtUfqQ68JP6kKEInX1ilk1cxizxCpIBRVouPqX3QtJ7NMACMdSdi0zAifBAhjhNC3bsaQRJiM8oF1FA+xT6ST5P6mxr5Se0Q+FqgCMXP09kWBfyrHvqc7sSjntZeJ/XjeG/rmTsCCKgQZksqgfcwNCIwvH6DFBCfCxIpgIpm41yBALTEBFWFYhWNMvz5LWSdUyq9b1aaV2UcRRQrtoDx0iC52hGrpCDdREBD2iZ/SK3rQn7UV71z4mrXNaMbOD/kD7/AFrvZ2H\n\nE Error\nEw,\u21e4 Error\nR(alpha)\nR\uf8ff(\u21b5)\n\nAAACBXicbVBNS8NAEN34WetX1KMegkWol5KIoMeiF49V7Ac0IUy2m3bpJll2N0IJuXjxr3jxoIhX/4M3/42bNgdtfTDweG+GmXkBZ1Qq2/42lpZXVtfWKxvVza3tnV1zb78jk1Rg0sYJS0QvAEkYjUlbUcVIjwsCUcBINxhfF373gQhJk/heTTjxIhjGNKQYlJZ888iNQI0wsOwu9zN3DJxDXneB8RGc+mbNbthTWIvEKUkNlWj55pc7SHAakVhhBlL2HZsrLwOhKGYkr7qpJBzwGIakr2kMEZFeNv0it060MrDCROiKlTVVf09kEEk5iQLdWdws571C/M/rpyq89DIa81SRGM8WhSmzVGIVkVgDKghWbKIJYEH1rRYegQCsdHBVHYIz//Ii6Zw1HLvh3J7XmldlHBV0iI5RHTnoAjXRDWqhNsLoET2jV/RmPBkvxrvxMWtdMsqZA/QHxucPwoGYtg==\nAAACBXicbVBNS8NAEN34WetX1KMegkWol5KIoMeiF49V7Ac0IUy2m3bpJll2N0IJuXjxr3jxoIhX/4M3/42bNgdtfTDweG+GmXkBZ1Qq2/42lpZXVtfWKxvVza3tnV1zb78jk1Rg0sYJS0QvAEkYjUlbUcVIjwsCUcBINxhfF373gQhJk/heTTjxIhjGNKQYlJZ888iNQI0wsOwu9zN3DJxDXneB8RGc+mbNbthTWIvEKUkNlWj55pc7SHAakVhhBlL2HZsrLwOhKGYkr7qpJBzwGIakr2kMEZFeNv0it060MrDCROiKlTVVf09kEEk5iQLdWdws571C/M/rpyq89DIa81SRGM8WhSmzVGIVkVgDKghWbKIJYEH1rRYegQCsdHBVHYIz//Ii6Zw1HLvh3J7XmldlHBV0iI5RHTnoAjXRDWqhNsLoET2jV/RmPBkvxrvxMWtdMsqZA/QHxucPwoGYtg==\nAAACBXicbVBNS8NAEN34WetX1KMegkWol5KIoMeiF49V7Ac0IUy2m3bpJll2N0IJuXjxr3jxoIhX/4M3/42bNgdtfTDweG+GmXkBZ1Qq2/42lpZXVtfWKxvVza3tnV1zb78jk1Rg0sYJS0QvAEkYjUlbUcVIjwsCUcBINxhfF373gQhJk/heTTjxIhjGNKQYlJZ888iNQI0wsOwu9zN3DJxDXneB8RGc+mbNbthTWIvEKUkNlWj55pc7SHAakVhhBlL2HZsrLwOhKGYkr7qpJBzwGIakr2kMEZFeNv0it060MrDCROiKlTVVf09kEEk5iQLdWdws571C/M/rpyq89DIa81SRGM8WhSmzVGIVkVgDKghWbKIJYEH1rRYegQCsdHBVHYIz//Ii6Zw1HLvh3J7XmldlHBV0iI5RHTnoAjXRDWqhNsLoET2jV/RmPBkvxrvxMWtdMsqZA/QHxucPwoGYtg==\nAAACBXicbVBNS8NAEN34WetX1KMegkWol5KIoMeiF49V7Ac0IUy2m3bpJll2N0IJuXjxr3jxoIhX/4M3/42bNgdtfTDweG+GmXkBZ1Qq2/42lpZXVtfWKxvVza3tnV1zb78jk1Rg0sYJS0QvAEkYjUlbUcVIjwsCUcBINxhfF373gQhJk/heTTjxIhjGNKQYlJZ888iNQI0wsOwu9zN3DJxDXneB8RGc+mbNbthTWIvEKUkNlWj55pc7SHAakVhhBlL2HZsrLwOhKGYkr7qpJBzwGIakr2kMEZFeNv0it060MrDCROiKlTVVf09kEEk5iQLdWdws571C/M/rpyq89DIa81SRGM8WhSmzVGIVkVgDKghWbKIJYEH1rRYegQCsdHBVHYIz//Ii6Zw1HLvh3J7XmldlHBV0iI5RHTnoAjXRDWqhNsLoET2jV/RmPBkvxrvxMWtdMsqZA/QHxucPwoGYtg==\n\n2\n2\n1\n1\n0\n0\n0\n0\n\n.\n.\n\n\uf8ff = 2\n\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoBeh6MVjBfuBbSiT7aZdutmE3Y1QQv+FFw+KePXfePPfuG1z0NYHA4/3ZpiZFySCa+O6305hbX1jc6u4XdrZ3ds/KB8etXScKsqaNBax6gSomeCSNQ03gnUSxTAKBGsH49uZ335iSvNYPphJwvwIh5KHnKKx0mNvjEmC5JrU+uWKW3XnIKvEy0kFcjT65a/eIKZpxKShArXuem5i/AyV4VSwaamXapYgHeOQdS2VGDHtZ/OLp+TMKgMSxsqWNGSu/p7IMNJ6EgW2M0Iz0sveTPzP66YmvPIzLpPUMEkXi8JUEBOT2ftkwBWjRkwsQaq4vZXQESqkxoZUsiF4yy+vklat6rlV7/6iUr/J4yjCCZzCOXhwCXW4gwY0gYKEZ3iFN0c7L86787FoLTj5zDH8gfP5Az3oj/Y=\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoBeh6MVjBfuBbSiT7aZdutmE3Y1QQv+FFw+KePXfePPfuG1z0NYHA4/3ZpiZFySCa+O6305hbX1jc6u4XdrZ3ds/KB8etXScKsqaNBax6gSomeCSNQ03gnUSxTAKBGsH49uZ335iSvNYPphJwvwIh5KHnKKx0mNvjEmC5JrU+uWKW3XnIKvEy0kFcjT65a/eIKZpxKShArXuem5i/AyV4VSwaamXapYgHeOQdS2VGDHtZ/OLp+TMKgMSxsqWNGSu/p7IMNJ6EgW2M0Iz0sveTPzP66YmvPIzLpPUMEkXi8JUEBOT2ftkwBWjRkwsQaq4vZXQESqkxoZUsiF4yy+vklat6rlV7/6iUr/J4yjCCZzCOXhwCXW4gwY0gYKEZ3iFN0c7L86787FoLTj5zDH8gfP5Az3oj/Y=\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoBeh6MVjBfuBbSiT7aZdutmE3Y1QQv+FFw+KePXfePPfuG1z0NYHA4/3ZpiZFySCa+O6305hbX1jc6u4XdrZ3ds/KB8etXScKsqaNBax6gSomeCSNQ03gnUSxTAKBGsH49uZ335iSvNYPphJwvwIh5KHnKKx0mNvjEmC5JrU+uWKW3XnIKvEy0kFcjT65a/eIKZpxKShArXuem5i/AyV4VSwaamXapYgHeOQdS2VGDHtZ/OLp+TMKgMSxsqWNGSu/p7IMNJ6EgW2M0Iz0sveTPzP66YmvPIzLpPUMEkXi8JUEBOT2ftkwBWjRkwsQaq4vZXQESqkxoZUsiF4yy+vklat6rlV7/6iUr/J4yjCCZzCOXhwCXW4gwY0gYKEZ3iFN0c7L86787FoLTj5zDH8gfP5Az3oj/Y=\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoBeh6MVjBfuBbSiT7aZdutmE3Y1QQv+FFw+KePXfePPfuG1z0NYHA4/3ZpiZFySCa+O6305hbX1jc6u4XdrZ3ds/KB8etXScKsqaNBax6gSomeCSNQ03gnUSxTAKBGsH49uZ335iSvNYPphJwvwIh5KHnKKx0mNvjEmC5JrU+uWKW3XnIKvEy0kFcjT65a/eIKZpxKShArXuem5i/AyV4VSwaamXapYgHeOQdS2VGDHtZ/OLp+TMKgMSxsqWNGSu/p7IMNJ6EgW2M0Iz0sveTPzP66YmvPIzLpPUMEkXi8JUEBOT2ftkwBWjRkwsQaq4vZXQESqkxoZUsiF4yy+vklat6rlV7/6iUr/J4yjCCZzCOXhwCXW4gwY0gYKEZ3iFN0c7L86787FoLTj5zDH8gfP5Az3oj/Y=\n\nE Error\nR(alpha)\n\nk\ns\ni\nr\n\n0\n0\n1\n1\n0\n0\n0\n0\n\n.\n.\n\n8\n8\n0\n0\n0\n0\n0\n0\n\n.\n.\n\n6\n6\n0\n0\n0\n0\n0\n0\n\n.\n.\n\n0.0\n0.0\n\n0.2\n0.2\n\n0.6\n0.6\n\n0.4\n0.4\n\u21b5 = p/N\np/N\n\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq\n\n0.8\n0.8\n\n1.0\n1.0\n\n0.0\n0.0\n\n0.2\n0.2\n\n0.6\n0.6\n\n0.4\n0.4\n\u21b5 = p/N\n\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq\nAAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq\n\n0.8\n0.8\n\n1.0\n1.0\n\nFigure 1: The asymptotic risk function R\uf8ff as a function of \u21b5 (with n = 300, N = 1000, =\nn/N = 0.3 and \uf8ff = 1, 2 respectively). The location of \u21b5\u21e4 from Theorem 1 is marked with a black\ncircle. In both cases, the asymptotic risk at \u21b5 = 1 is lower than the asymptotic risk at \u21b5\u21e4.\n\nThe proof of Theorem 1 is sketched in Section 2.2, with some details left to Appendix A. Theorem 1\nsupports the well-known intuition that the risk curve is \u201cU-shaped\u201d in the p < n regime. Our next\ntheorem, however, shows a very different behavior when \u21b5> .\nFormally de\ufb01ne m\uf8ff(z) for z \uf8ff 0 to be the smallest positive solution to the equation\n\n z =\n\n1\n\nm\uf8ff(z) \n\n1\n\nZ 1\n\n\u21b5\uf8ff\n\n1\n\n\uf8fft1/\uf8ff(1 + t \u00b7 m\uf8ff(z))\n\ndt,\n\n(6)\n\n(7)\n\nand let m0\uf8ff(\u00b7) denote the derivative of m\uf8ff(\u00b7). Also de\ufb01ne the function R\uf8ff on (, 1]:\n\nR\uf8ff(\u21b5) := N 1\uf8ff \n\nm\uf8ff(0)\n\n+Z 1\n\n\u21b5\n\nm0\uf8ff(0)\n\nm\uf8ff(0)2! ,\n\nt\uf8ff dt \u00b7\n\nfor all \u21b5>.\n\nTheorem 2. Assume A.1 with constant \uf8ff; A.2 with constants \u21b5 and ; 2 = 0; and \u21b5> . The\nfunction m\uf8ff and its derivative m0\uf8ff are well-de\ufb01ned and positive at z = 0 (and hence R\uf8ff(\u21b5) is\nwell-de\ufb01ned for all \u21b5> ). Moreover,\n\nE\u2713[Error] p!R \uf8ff(\u21b5).\n\nThe proof of Theorem 2 is sketched in Section 2.3, with some details left to Appendix B.\nWe plot the asymptotic risk function R\uf8ff in Figure 1 for two different values of \uf8ff, both with = 0.3.\n(In simulations, we \ufb01nd that E\u2713[Error] matches these curves very closely for sample sizes as small\nas n = 300.) For both values of \uf8ff 2{ 1, 2}, we observe the striking \u201cdouble descent\u201d behavior as\nfound in previous studies [e.g., 2]. Moreover, we see that the asymptotic risk at \u21b5 = 1 is smaller than\nthe minimum asymptotic risk achieved at any \u21b5< . This, in fact, happens for all values of \uf8ff> 0,\nas we claim in the next theorem.\nTheorem 3. Assume A.1 with constant \uf8ff, A.2 with constants \u21b5 and , 2 = 0. Let \u21b5\u21e4 be\nthe minimizer of R\uf8ff over the interval [0, ). Then lim supN R\uf8ff(1)/R\uf8ff(\u21b5\u21e4) < 1. Moreover,\nR\uf8ff(\u21b5)/R\uf8ff(1) ! 1 as \u21b5 ! .\nThe proof of Theorem 3 is given in Section 2.4. Theorem 3 shows that the asymptotic risk exhibits a\nsecond decrease somewhere in the p > n regime when N is suf\ufb01ciently large, and moreover, that it is\npossible to \ufb01nd a value of p in this p > n regime to achieve a lower asymptotic risk than any p < n.\nIn the noisy setting (see Appendix C), it is possible for the asymptotic risk to be dominated by the\nnoise, in which case the minimum asymptotic risk is in fact achieved by \u21b5 = 0 (i.e., p = o(n)).\nHowever, there exists a regime with 2 > 0 in which we have the same conclusion as in Theorem 3.\n\n4\n\n\f2.2 Proof sketch for Theorem 1\nWe \ufb01rst show that h\uf8ff(\u21b5) = 0 has a unique solution on (0, ). De\ufb01ne \u02dch\uf8ff(\u21b5) := \u21b51\uf8ffh\uf8ff(\u21b5). We\nshall show that \u02dch\uf8ff(\u21b5) = 0 has a unique solution on (0, ), which in turn immediately implies that\nh\uf8ff(\u21b5) = 0 also has a unique solution on the same interval. Observe that\n\nd\u02dch\uf8ff(\u21b5)\n\nd\u21b5\n\n= \uf8ff + \uf8ff\u21b5\n\n\u21b51+\uf8ff\n\n< 0.\n\nHence, the function \u02dch\uf8ff(\u21b5) is strictly decreasing on \u21b5 2 (0, ]. Furthermore, we have\n\n\u02dch\uf8ff(\u21b5) > 0 as \u21b5 ! 0+,\n\nand\n\n\u02dch\uf8ff(\u21b5) < 0 at \u21b5 = .\n\n(8)\n\n(9)\n\nBecause \u02dch\uf8ff is continuous, it follows that the equation \u02dch\uf8ff(\u21b5) = 0 has a unique solution on (0, ).\nWe now prove E\u2713[Error] p!R \uf8ff(\u21b5). Since the proof only requires standard techniques, we just\nsketch the main ideas in this section, and leave the full proof to Appendix A. First, since \u21b5< , for\nlarge enough N, we have p < n. Then the prediction error is given by\n\nP c \u2713P ck2,\n\nE\u2713[Error] = tr(X>\n\nP X P c\u2713P ck2 + k\u23031/2\n\nP X P1 X>\n\nError = Ex,y[(y x>\u02c6\u2713)2] = k\u23031/2\n\nwhere \u2303P 2 Rp\u21e5p and \u2303P c 2 R(Np)\u21e5(Np) are two diagonal matrices whose diagonal elements\nare the \ufb01rst p and last N p diagonal elements of \u2303, respectively. By (2), we have\n\nP X>\nP X P1 \u2303PX>\nNote that X P c is independent of X P , thus, given X P , the trace that includes X P c is a sum of\nN p independent random variables. Therefore, we have\np! tr(\u2303P c) \u00b7 (tr((X>\n= tr(\u2303P c) \u00b7 (tr(( \u00afX>\np! tr(\u2303P c)\n\nP X P )1\u2303P ) + 1)\n\u00afX P )1) + 1)\n\nP X P1 X>\n\nP cX PX>\n\nP X P c) + tr(\u2303P c).\n\nE\u2713[Error]\n\n\n\nP\n\n,\n\n \u21b5\n\nP\n\n(4), we just need to compute tr(\u2303P c). Note thatR s+1\n\nwhere \u00afX P := X P \u23031/2\nis a standard Gaussian matrix. The \ufb01rst line above uses Markov\u2019s\ninequality to show that E\u2713[Error] converges in probability to E\u2713,X P c [Error]. The third line above\n\u00afX P is a standard Wishart matrix Wp(I, n). So, to prove\nuses Assumption A.2 and the fact that \u00afX>\nP\nt\uf8ff dt < s\uf8ff , we know the function q\uf8ff(s, \u21b5) is strictly decreasing on s 2 (0, ( \n\u21b5 )1/\uf8ff] and\nstrictly increasing on s 2 [( \n\u21b5 )1/\uf8ff,1). Furthermore, q\uf8ff(s, \u21b5) ! 1 as s ! 0 and q\uf8ff(s, \u21b5) ! 0\nas s ! 1. Hence, by the continuity of s 7! q\uf8ff(s, \u21b5), we conclude that q\uf8ff(s, \u21b5) = 0 has a unique\nsolution s\u21e4\uf8ff.\nUsing the chain rule, we can also show that m0\uf8ff(0) is well-de\ufb01ned, and that its value is given by\n\nm0\uf8ff(0) = \uf8ffm2\nWe leave the details to Appendix B.1.\nOur next goal is to prove E\u2713[Error] p!R \uf8ff(\u21b5). Since \u21b5> , we have p > n for large enough N. In\nthis case,\n\n\uf8ff(0) \u00b7 (1 + (s\u21e4\uf8ff)\uf8ff)/ + ( \u21b5)(s\u21e4\uf8ff)\uf8ff > 0.\n\nError = Ex,y[(y x>\u02c6\u2713)2] = Ex,y[(x>P (\u02c6\u2713P \u2713P ) x>P c\u2713P c)2]\n\nP ((\u21e7X P I)\u2713P + X>\n\nP (X P X>\n\nP )1X P c\u2713P c)k2 + k\u23031/2\n\nP c \u2713P ck2,\n\nwhere \u21e7X P := X>\nSection 2.2. Hence, E\u2713[Error] is equal to\n\nP1 X P , and the diagonal matrices \u2303P and \u2303P c are as de\ufb01ned in\n\ntr(\u2303P (I \u21e7X P ))\n\n+ tr(X>\n\nP c(X P X>\n\nP (X P X>\n\nP )1X P c) + tr(\u2303P c)\n\n.\n\n(15)\n\n= k\u23031/2\nPX P X>\n|\n}\n\npart 1\n\n{z\n\n|\n\nWe claim that\n\nP )1X P \u2303P X>\npart 2\n\n{z\n\n}\n\npart 1\n\np!\n\nN 1\uf8ff\nm\uf8ff(0)\n\n,\n\nand part 2 p! N 1\uf8ff \u00b7\n\nm0\uf8ff(0)\nm2\n\n\uf8ff(0) \u00b7Z 1\n\n\u21b5\n\nt\uf8ff2 dt + op(N 1\uf8ff);\n\n(16)\n\ntogether, they complete the proof that E\u2713[Error] p!R \uf8ff(\u21b5). Rigorous proofs of the claims in (16)\nare presented in Appendix B.2 and Appendix B.3; here, we give a heuristic argument that conveys\nthe main idea. For part 1, let \u02dc\u2303P = N \uf8ff\u2303P and \u02dcX P = N \uf8ff/2X P . This scaling ensures that the\nempirical eigenvalue distribution of \u02dc\u2303P has a limiting distribution with probability density\n\nf\uf8ff(s) =\n\n1\n\uf8ff\u21b5\n\ns11/\uf8ff \u00b7 1\n\n{s2[\u21b5\uf8ff,1)}\n\n(Lemma 2 in Appendix B.2). Also, under this scaling, we have\n\ntr(\u2303PI \u21e7X P) = lim\ntr \u02dc\u2303P\u2713 1\n\n= lim\n\u00b5!0\n\nn\nN \uf8ff \u00b7\n\n\u00b5!0\n\n\u00b5\nn\n\nn\n\nn\n\nN \uf8ff\u2713 1\n\nn\n\n\u02dcX>\nP\n\ntr( \u02dc\u2303P ) \n\n1\nn\n\n\u02dcX P + \u00b5I\u25c61! = lim\n\n\u00b5!0\n\ntr( \u02dc\u2303P ( \u02dcX>\nP\n\n\u02dcX P + \u00b5nI)1 \u02dcX>\nP\n\n\u02dcX P )\u25c6\n\nn\nN \uf8ff \u00b7\n\n\u00b5\nn\n\ntr( \u02dc\u2303P \u02dcSn),\n\n(17)\n\n6\n\n\fwhere \u02dcSn := (n1 \u02dcX>\nP\nlimiting distribution with bounded support, we have\n\n\u02dcX P + \u00b5I)1. As long as the empirical eigenvalue distribution of \u02dc\u2303P has a\n\n8\u00b5 > 0, \u00b5 \u00b7\n\n1\nn\n\ntr \u02dc\u2303P\u2713 1\n\nn\n\n\u02dcX>\nP\n\n\u02dcX P + \u00b5I\u25c61! p!\n\n1\n\nm\uf8ff(\u00b5)\n\n,\n\n(18)\n\nwhere m\uf8ff(z) is, in fact, the Stieltjes transform of the limiting empirical eigenvalue distribution of\nn1 \u02dcX P \u02dcX>\nP (Lemma 1 in Appendix B.2); this follows from results of Dobriban and Wager [6],\nwhich in turn are derived from the results of Ledoit and P\u00e9ch\u00e9 [11]. Assume we can exchange the two\nlimits \u00b5 ! 0+ and N ! 1, and also that (18) still holds for f\uf8ff(s) which has unbounded support.\nThen, from (17), we conclude\n\n.\n\n\u00b5!0\n\npart 2\n\nN 1\uf8ff\nm\uf8ff(0)\n\nPX P X>\n\nFor part 2, note that X P c is independent of X P . Thus, conditional on X P , part 2 is a sum of N p\nindependent random variables. Therefore, using Markov inequality, we can show that\n\npart 1 = tr(\u2303PI \u21e7X P) p!\np! EX P c [part 2] = tr (\u2303P c) \u00b7\u2713tr\u21e3\u2303P X>\n= tr (\u2303P c) \u00b7 lim\n= tr (\u2303P c) \u00b7\u2713 lim\n(19)\nAgain, if we ignore the fact that the support of f\uf8ff(s) is unbounded and assume the limits of \u00b5 ! 0\nand N ! 1 can be exchanged, then by Lemma 7.4 of Dobriban and Wager [6], we have\npart 2 p! tr (\u2303P c) \u00b7\u2713 lim\nm0\uf8ff(0)\nm2\n\uf8ff(0)\nA straightforward analysis of tr(\u2303P c) (as in (10)) completes the analysis of part 2 of (16).\nRemark 1. Although Theorem 2 should intuitively hold given the results of Dobriban and Wager\n[6], a careful and more involved argument is needed to deal with the facts that k \u02dc\u2303Pk2 ! 1 (since\nn tr( \u02dc\u2303P \u02dcSn) = Op(N \uf8ff).\nk\u23031\nHowever, we need the stronger bound \u00b5\n\nP2 X P\u2318 + 1\u25c6\n\u02dcX P\u21e3 \u02dcX>\nn\u2318 + 1\u25c6 .\nn\u2318 + 1\u25c6 p! tr (\u2303P c) \u00b7\n\ntr\u2713 \u02dc\u2303P\u21e3 \u02dcX>\ntr\u21e3 \u02dc\u2303P \u02dcSn\u2318 \ntr\u21e3 \u02dc\u2303P \u02dcSn\u2318 \n\n\u02dcX P + \u00b5nI\u23181\ntr\u21e3 \u02dc\u2303P \u02dcS\ntr\u21e3 \u02dc\u2303P \u02dcS\n\nP k2 ! 1) and \u00b5 ! 0. For example, standard techniques only imply \u00b5\n\n\u02dcX P + \u00b5nI\u23181\u25c6 + 1!\n\nn tr( \u02dc\u2303P \u02dcSn) = Op(1) (e.g., Appendix B.2.2).\n\n\u02dcX>\nP\n\n\u00b5!0\n\n\u00b5!0\n\n(20)\n\n\u00b5\nn\n\n\u00b5\nn\n\n1\nn\n\n1\nn\n\nP\n\n2\n\n2\n\nP\n\n.\n\n2.4 Proof of Theorem 3\nComparing the expression for R\uf8ff(\u21b5) in (7) at \u21b5 = 1 to the expression for R\uf8ff(\u21b5\u21e4) in (5), we see that\nit suf\ufb01ces to prove m\uf8ff(0)1/\uf8ff >\u21b5 \u21e4. Recall that in Section 2.3, we have proved s\u21e4\uf8ff := m\uf8ff(0)1/\uf8ff is\nthe unique solution of the equation q\uf8ff(s, 1) = 0. Furthermore, using the expression for the derivative\nof q\uf8ff(s, 1) with respect to s in (14), we know that q(s, 1) > 0 ) s < s\u21e4\uf8ff. Thus, we only need to\nshow q\uf8ff(\u21b5\u21e4, 1) > 0 = h\uf8ff(\u21b5\u21e4), where the equality is due to the de\ufb01nition of \u21b5\u21e4 in Theorem 1. Note\nthat by the de\ufb01nitions of the functions q\uf8ff and h\uf8ff in (3) and (13), we have\n\nh\uf8ff(s) =\n\n\n\ns Z 1\n\ns\n\nt\uf8ff2 dt 1 = q\uf8ff(s, 1) +Z 1\n\ns\n\nt\uf8ff2\n\n(1 + t\uf8ff)\n\ndt Z 1\n\ns\n\nt\uf8ff2 dt 1.\n\nFurthermore, h\uf8ff(s) q\uf8ff(s, 1) is increasing in s:\n= \n\nd (h\uf8ff(s) q\uf8ff(s, 1))\nHence, for all for all s 2 (0, 1], we have\n\nds\n\ns\uf8ff2\n\n(1 + s\uf8ff)\n\n+ s\uf8ff2 =\n\ns2\uf8ff2\n1 + s\uf8ff > 0.\n\nh\uf8ff(s) q\uf8ff(s, 1) \uf8ff h\uf8ff(1) q\uf8ff(1, 1) = Z 1\ndt Z 1\n\n= Z 1\n\nt\uf8ff2\n\n1\n\n1\n\nt\uf8ff2\n\ndt 1\n\n(1 + t\uf8ff)\n\n1\n\nt2 dt = Z 1\n\n(1 + t\uf8ff)\n\nt2(1 + t\uf8ff)\nSince \u21b5\u21e4 << 1, we have 0 = h\uf8ff(\u21b5\u21e4) < q\uf8ff(\u21b5\u21e4, 1), and thus we have s\u21e4\uf8ff >\u21b5 \u21e4.\nBy inspection of the expression for R\uf8ff(\u21b5) in (4), it is also clear that R\uf8ff(\u21b5)/R\uf8ff(1) ! 1 as\n\u21b5 ! .\n\n1\n\n1\n\n1\n\ndt < 0.\n\n7\n\n\f3 Analysis under general eigenvalue decay\n\n1\n\nj=1\n\n{j\u232bN}.\n\nIn this section, we extend the results from Section 2 (with noise) to hold under a more general\nassumption on the eigenvalues of \u2303. To simplify calculations, we use a slightly different feature\n\nInstead of Assumptions A.1 and A.2, we assume the following:\n\nselection procedure that includes all components j such that j \u232bN, so p =PN\nB.1 k\u2303k2 \uf8ff C for some constant C > 0. Also, there exists a positive sequence (cN )N1\nsuch that the empirical eigenvalue distribution of cN \u2303 converges as N ! 1 to F =\n(1 )F0 + F1, where 2 (0, 1], F0 is a point mass of 0, and F1 has a continuous\nprobability density f supported on either [\u23181,\u2318 2] or [\u23181,1) for some constants \u23181,\u2318 2 > 0.\nB.2 There exist constants \u232b> 0 and 2 (0, ) s.t. \u232bN cN ! \u232b and n/N ! as n, N ! 1.\nThe cN in Assumption B.1 generalizes the N \uf8ff scaling introduced in the proof of Theorem 2. In\nfact, Assumption B.1 is more general than the eigenvalue assumptions made by Dobriban and Wager\n[6] and Hastie et al. [8]: the eigenvalues of \u2303 could decrease smoothly ( = 1), or there could be a\nsudden drop between (say) j and j+1 (< 1). Since p is now determined by \u232b, whether p < n\nor p > n is now determined by whether \u232b>\u232b b or \u232b<\u232b b, where \u232bb >\u2318 1 is given by the equation\n\nf (t) dt = . Finally, by Assumption B.1,\n\np\nN\n\n=\n\n1\nN\n\nNXj=1\n\n1\n\n{cN j\u232b}\n\na.s.! Es\u21e0f [1\n\n{s\u232b}] = Z 1\n\n\u232b\n\nf (t) dt =: \u21b5(\u232b),\n\n8\u232b> 0.\n\n(21)\n\nFor \u232b = 0, i.e., \u232bN = o(1/cN ), we choose \u232bN be the N largest eigenvalues of \u2303, then \u21b5(\u232b) = .\nHence, combined with Assumption B.2, we have the same asymptotics considered in Section 2,\nexcept that is now restricted in (0, ). This restriction on is required, otherwise both cN X>X\nand cN XX> are asymptotically singular.\nThe following theorem generalizes the results in Section 2 to hold under Assumptions B.1 and B.2.\nTheorem 4. Assume B.1 with sequence (cN )N1 and constants C, , \u23181, and \u23182; and B.2 with\nconstants \u232b and .\n\nR 1\n\n\u232bb\n\n=: Rf (\u232b, ).\n\n(22)\n\ntf (t) dt. If the equation hf (\u232b) = 0 has a\n\n\u23181\n\n\u23181\n\n\n\u232b f (t) dt\n\ntf (t) dt + 2!\n\u232b f (t) dt R \u232b\nRf (\u232b\u21e4, 0) = min\n\n(i) Assume \u232b 2 (\u232bb,1). Then\nEw,\u2713[Error] p! N\ncN \u00b7 Z \u232b\nDe\ufb01ne hf (\u232b) := \u232b \u232bR 1\nsolution on (\u232bb,1)T supp(f ), then the solution \u232b\u21e4 is unique, and\nN\ncN \u00b7 \u232b\u21e4.\nZ 1\n\n\u232b2(\u232bb,1)Rf (\u232b, 0) = lim\n\n\u232b2(\u232bb,1)Rf (\u232b, 0) =\n\n R 1\n\n\u232b!1Rf (\u232b, 0) =\n(ii) Assume \u232b 2 [0,\u232b b). De\ufb01ne qf (s, \u232b) := s sR 1\nR \u232b\ns\u21e4fR 1\n\nEw,\u2713[Error]\n\nwhere s\u21e4f is the unique solution of the equation qf (s, \u232b) = 0.\n\ns+t dt. Then\ntf (t) dt + 2\n(s\u21e4f +t)2 dt\n\ntf (t)\n\ns\u21e4f + \u00b7\n\nOtherwise,\n\nN\ncN\n\n\u232b\n\n\u23181\n\n\u232b\n\nN\ncN\n\np!\n\nN\ncN\n\ntf (t)\n\ninf\n\n(23)\n\n(24)\n\ntf (t) dt.\n\n\u23181\n\n=: Rf (\u232b, ),\n\n(25)\n\n(iii) Suppose = 0. Let \u232b\u21e4 be the minimizer of Rf (\u232b, 0) over the interval (\u232bb,1] (including 1).\nLet Rf (\u23181, 0) be the risk achieved at \u232b = \u23181. Then lim supN Rf (\u23181, 0)/Rf (\u232b\u21e4, 0) < 1.\n\nThe proof of this theorem is presented in Appendix D.\n\n8\n\n\f4 Discussion\n\nOur results con\ufb01rm the emergence of the \u201cdouble descent\u201d risk curve in a natural setting with\nGaussian design. As in previous works [e.g., 3, 8, 13], the shape emerges when there is a spike at the\ninterpolation threshold (p = n), which is typically caused by a near-zero minimum eigenvalue of the\nempirical covariance matrix.\nMore importantly, however, our results shed light on when the minimum risk is achieved before\nor after the interpolation threshold in terms of the noise level and eigenvalues of the (population)\ncovariance matrix. For instance, when the eigenvalues decay very slowly or not at all (\uf8ff< 1), a\nsmaller risk is achieved after the interpolation threshold (p > n) than any point before (p < n). On\nthe other hand, when the eigenvalues decay more quickly (\uf8ff> 1), a smaller risk is achieved in the\np > n regime only in the noiseless setting. In general, the p < n regime yields a smaller risk when\nthe noise dominates the error due to model misspeci\ufb01cation. Providing a full characterization is an\nimportant direction for future research.\nFinally, we point out that the PCR estimator we study is a non-standard \u201coracle\u201d estimator because it\ngenerally requires knowledge of \u2303. Although it can be plausibly implemented in a semi-supervised\nsetting (by estimating \u2303 very accurately using unlabeled data), a full analysis that accounts for\nestimation errors in \u2303, or of a more standard PCR estimator, remains open. However, we note that\nthe PCR estimator with p = N can be implemented, and in our analysis, the dominance of the p > n\nregime is always established at p = N. We believe that this should be true for the standard PCR\nestimator as well.\n\nAcknowledgments\n\nThis research was supported by NSF CCF-1740833, a Sloan Research Fellowship, a Google Faculty\nAward, and a Cheung-Kong Graduate School of Business Fellowship.\n\nReferences\n[1] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning\n\ntheory. Journal of complexity, 23(1):52\u201372, 2007.\n\n[2] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine\n\nlearning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.\n\n[3] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXiv\n\npreprint arXiv:1903.07571, 2019.\n\n[4] Leo Breiman and David Freedman. How many variables should be entered in a regression\n\nequation? Journal of the American Statistical Association, 78(381):131\u2013136, 1983.\n\n[5] Lee H Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing\n\ndimension. Bernoulli, 22(1):1\u201337, 2016.\n\n[6] Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regres-\n\nsion and classi\ufb01cation. The Annals of Statistics, 46(1):247\u2013279, 2018.\n\n[7] L Lo Gerfo, Lorenzo Rosasco, Francesca Odone, Ernesto De Vito, and Alessandro Verri.\n\nSpectral algorithms for supervised learning. Neural Computation, 20(7):1873\u20131897, 2008.\n\n[8] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-\n\ndimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.\n\n[9] Ian Jolliffe. Principal Component Analysis. Springer, 2011.\n\n[10] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model\n\nselection. Annals of Statistics, pages 1302\u20131338, 2000.\n\n[11] Olivier Ledoit and Sandrine P\u00e9ch\u00e9. Eigenvectors of some large sample covariance matrix\n\nensembles. Probability Theory and Related Fields, 151(1-2):233\u2013264, 2011.\n\n9\n\n\f[12] Peter Math\u00e9. Saturation of regularization methods for linear ill-posed problems in hilbert spaces.\n\nSIAM journal on numerical analysis, 42(3):968\u2013973, 2004.\n\n[13] Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. Harmless interpolation of noisy data\n\nin regression. arXiv preprint arXiv:1903.09139, 2019.\n\n[14] Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-\nJulien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff in neural networks.\narXiv preprint arXiv:1810.08591, 2018.\n\n[15] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin Neural Information Processing Systems, pages 1177\u20131184, 2008.\n\n[16] Jack W Silverstein and Sang-Il Choi. Analysis of the limiting spectral distribution of large\n\ndimensional random matrices. Journal of Multivariate Analysis, 54(2):295\u2013309, 1995.\n\n[17] Stefano Spigler, Mario Geiger, St\u00e9phane d\u2019Ascoli, Levent Sagun, Giulio Biroli, and Matthieu\nWyart. A jamming transition from under-to over-parametrization affects loss landscape and\ngeneralization. arXiv preprint arXiv:1810.09665, 2018.\n\n[18] Antonia M Tulino and Sergio Verd\u00fa. Random matrix theory and wireless communications.\n\nFoundations and Trends in Communications and Information Theory, 1(1):1\u2013182, 2004.\n\n[19] Ji Xu and Daniel Hsu. On the number of variables to use in principal component regression.\n\narXiv preprint arXiv:1906.01139, 2019.\n\n[20] Ji Xu, Arian Maleki, and Kamiar Rahnama Rad. Consistent risk estimation in high-dimensional\n\nlinear regression. arXiv preprint arXiv:1902.01753, 2019.\n\n10\n\n\f", "award": [], "sourceid": 2794, "authors": [{"given_name": "Ji", "family_name": "Xu", "institution": "Columbia University"}, {"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}]}