Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back

Part of Advances in Neural Information Processing Systems 29 (NIPS 2016)

Bibtex Metadata Paper Reviews Supplemental


Vitaly Feldman


In stochastic convex optimization the goal is to minimize a convex function $F(x) \doteq \E_{f\sim D}[f(x)]$ over a convex set $\K \subset \R^d$ where $D$ is some unknown distribution and each $f(\cdot)$ in the support of $D$ is convex over $\K$. The optimization is based on i.i.d.~samples $f^1,f^2,\ldots,f^n$ from $D$. A common approach to such problems is empirical risk minimization (ERM) that optimizes $F_S(x) \doteq \frac{1}{n}\sum_{i\leq n} f^i(x)$. Here we consider the question of how many samples are necessary for ERM to succeed and the closely related question of uniform convergence of $F_S$ to $F$ over $\K$. We demonstrate that in the standard $\ell_p/\ell_q$ setting of Lipschitz-bounded functions over a $\K$ of bounded radius, ERM requires sample size that scales linearly with the dimension $d$. This nearly matches standard upper bounds and improves on $\Omega(\log d)$ dependence proved for $\ell_2/\ell_2$ setting in (Shalev-Shwartz et al. 2009). In stark contrast, these problems can be solved using dimension-independent number of samples for $\ell_2/\ell_2$ setting and $\log d$ dependence for $\ell_1/\ell_\infty$ setting using other approaches. We also demonstrate that for a more general class of range-bounded (but not Lipschitz-bounded) stochastic convex programs an even stronger gap appears already in dimension 2.