Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Denny Wu, Ji Xu
We consider the linear model $\vy=\vX\vbeta_{\star}+\vepsilon$ with $\vX\in \mathbb{R}^{n\times p}$ in the overparameterized regime $p>n$. We estimate $\vbeta_{\star}$ via generalized (weighted) ridge regression: $\hat{\vbeta}_{\lambda}=\left(\vX^{\t}\vX+\lambda\vSigma_w\right)^{\dagger}\vX^{\t}\vy$, where $\vSigma_w$ is the weighting matrix. Assuming a random effects model with general data covariance $\vSigma_x$ and anisotropic prior on the true coefficients $\vbeta_{\star}$, i.e., $\bbE\vbeta_{\star}\vbeta_{\star}^{\t}=\vSigma_\beta$, we provide an exact characterization of the prediction risk $\mathbb{E}(y-\vx^{\t}\hat{\vbeta}_{\lambda})^2$ in the proportional asymptotic limit $p/n\rightarrow \gamma \in (1,\infty)$. Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting $\lambda_{\opt}$ for the ridge parameter $\lambda$ and confirm the implicit $\ell_2$ regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that $\lambda_{\opt}$ can be \textit{negative} in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when $\vX$ and $\vbeta_{\star}$ are both anisotropic. Finally, we determine the optimal weighting matrix $\vSigma_w$ for both the ridgeless ($\lambda\to 0$) and optimally regularized ($\lambda = \lambda_{\opt}$) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.