{"title": "On Optimal Generalizability in Parametric Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3455, "page_last": 3465, "abstract": "We consider the parametric learning problem, where the objective of the learner is determined by a parametric loss function. Employing empirical risk minimization with possibly regularization, the inferred parameter vector will be biased toward the training samples. Such bias is measured by the cross validation procedure in practice where the data set is partitioned into a training set used for training and a validation set, which is not used in training and is left to measure the out-of-sample performance. A classical cross validation strategy is the leave-one-out cross validation (LOOCV) where one sample is left out for validation and training is done on the rest of the samples that are presented to the learner, and this process is repeated on all of the samples. LOOCV is rarely used in practice due to the high computational complexity. In this paper, we first develop a computationally efficient approximate LOOCV (ALOOCV) and provide theoretical guarantees for its performance. Then we use ALOOCV to provide an optimization algorithm for finding the regularizer in the empirical risk minimization framework. In our numerical experiments, we illustrate the accuracy and efficiency of ALOOCV as well as our proposed framework for the optimization of the regularizer.", "full_text": "On Optimal Generalizability in Parametric Learning\n\nAhmad Beirami\u2217\n\nbeirami@seas.harvard.edu\n\nMeisam Razaviyayn\u2020\nrazaviya@usc.edu\n\nShahin Shahrampour\u2217\n\nshahin@seas.harvard.edu\n\nVahid Tarokh\u2217\n\nvahid@seas.harvard.edu\n\nAbstract\n\nWe consider the parametric learning problem, where the objective of the learner is\ndetermined by a parametric loss function. Employing empirical risk minimization\nwith possibly regularization, the inferred parameter vector will be biased toward\nthe training samples. Such bias is measured by the cross validation procedure\nin practice where the data set is partitioned into a training set used for training\nand a validation set, which is not used in training and is left to measure the out-\nof-sample performance. A classical cross validation strategy is the leave-one-out\ncross validation (LOOCV) where one sample is left out for validation and training\nis done on the rest of the samples that are presented to the learner, and this process\nis repeated on all of the samples. LOOCV is rarely used in practice due to the\nhigh computational complexity. In this paper, we \ufb01rst develop a computationally\nef\ufb01cient approximate LOOCV (ALOOCV) and provide theoretical guarantees for\nits performance. Then we use ALOOCV to provide an optimization algorithm\nfor \ufb01nding the regularizer in the empirical risk minimization framework. In our\nnumerical experiments, we illustrate the accuracy and ef\ufb01ciency of ALOOCV as\nwell as our proposed framework for the optimization of the regularizer.\n\nIntroduction\n\n1\nWe consider the parametric supervised/unsupervised learning problem, where the objective of the\nlearner is to build a predictor based on a set of historical data. Let zn = {zi}n\ni=1, where zi \u2208\nZ denotes the data samples at the learner\u2019s disposal that are assumed to be drawn i.i.d. from an\nunknown density function p(\u00b7), and Z is compact.\nWe assume that the learner expresses the objective in terms of minimizing a parametric loss function\n(cid:96)(z; \u03b8), which is a function of the parameter vector \u03b8. The learner solves for the unknown parameter\nvector \u03b8 \u2208 \u0398 \u2286 Rk, where k denotes the number of parameters in the model class, and \u0398 is a\nconvex, compact set.\nLet\n\nL(\u03b8) (cid:44) E{(cid:96)(z; \u03b8)}\n\n(1)\n\nbe the risk associated with the parameter vector \u03b8, where the expectation is with respect to the den-\nsity p(\u00b7) that is unknown to the learner. Ideally, the goal of the learner is to choose the parameter\n\u2217 \u2208 arg min\u03b8\u2208\u0398 L(\u03b8) = arg min\u03b8\u2208\u0398 E{(cid:96)(z; \u03b8)}. Since the density function\nvector \u03b8\np(\u00b7) is unknown, the learner cannot compute \u03b8\n\u2217 and hence cannot achieve the ideal performance of\n) = min\u03b8\u2208\u0398 L(\u03b8) associated with the model class \u0398. Instead, one can consider the minimiza-\nL(\u03b8\n\n\u2217 such that \u03b8\n\n\u2217\n\n\u2217School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA.\n\u2020Department of Industrial and Systems Engineering, University of Southern California, Los Angeles, CA\n\n90089, USA.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(cid:98)\u03b8(zn) \u2208 arg min\n\n\u03b8\u2208\u0398\n\n(cid:96)(zi; \u03b8) + r(\u03b8),\n\n(cid:88)\n\ni\u2208[n]\n\ntion of the empirical version of the problem through the empirical risk minimization framework:\n\nwhere [n] (cid:44) {1, 2, . . . , n} and r(\u03b8) is some regularization function. While the learner can eval-\nuate her performance on the training data samples (also called the in-sample empirical risk, i.e.,\n1\nn\n\n(cid:80)n\ni=1 (cid:96)(zi;(cid:98)\u03b8(zn))), it is imperative to assess the average performance of the learner on fresh test\nsamples, i.e., L((cid:98)\u03b8(zn)), which is referred to as the out-of-sample risk. A simple and universal ap-\n\nproach to measuring the out-of-sample risk is cross validation [1]. Leave-one-out cross validation\n(LOOCV), which is a popular exhaustive cross validation strategy, uses (n \u2212 1) of the samples for\ntraining while one sample is left out for testing. This procedure is repeated on the n samples in a\nround-robin fashion, and the learner ends up with n estimates for the out-of-sample loss correspond-\ning to each sample. These estimates together form a cross validation vector which can be used for\nthe estimation of the out-of-sample performance, model selection, and tuning the model hyperpa-\nrameters. While LOOCV provides a reliable estimate of the out-of-sample loss, it brings about an\nadditional factor of n in terms of computational cost, which makes it practically impossible because\nof the high computational cost of training when the number of samples is large.\nContribution: Our \ufb01rst contribution is to provide an approximation for the cross validation vector,\ncalled ALOOCV, with much lower computational cost. We compare its performance with LOOCV\nin problems of reasonable size where LOOCV is tractable. We also test it on problems of large size\nwhere LOOCV is practically impossible to implement. We describe how to handle quasi-smooth\nloss/regularizer functions. We also show that ALOOCV is asymptotically equivalent to Takeuchi\ninformation criterion (TIC) under certain regularity conditions.\nOur second contribution is to use ALOOCV to develop a gradient descent algorithm for jointly\noptimizing the regularization hyperparameters as well as the unknown parameter vector \u03b8. We show\nthat multiple hyperparameters could be tuned using the developed algorithm. We emphasize that\nthe second contribution would not have been possible without the developed estimator as obtaining\nthe gradient of the LOOCV with respect to tuning parameters is computationally expensive. Our\nexperiments show that the developed method handles quasi-smooth regularized loss functions as\nwell as number of tuning parameters that is on the order of the training samples.\nFinally, it is worth mentioning that although the leave-one-out cross validation scenario is considered\nin our analyses, the results and the algorithms can be extended to the leave-q-out cross validation\nand bootstrap techniques.\nRelated work: A main application of cross validation (see [1] for a recent survey) is in model\nselection [2\u20134]. On the theoretical side, the proposed approximation on LOOCV is asymptotically\nequivalent to Takeuchi information criterion (TIC) [4\u20137], under certain regularity conditions (see [8]\nfor a proof of asymptotic equivalence of AIC and LOOCV in autoregressive models). This is also\nrelated to Barron\u2019s predicted square error (PSE) [9] and Moody\u2019s effective number of parameters\nfor nonlinear systems [10]. Despite these asymptotic equivalences our main focus is on the non-\nasymptotic performance of ALOOCV.\nALOOCV simpli\ufb01es to the closed form derivation of the LOOCV for linear regression, called\nPRESS (see [11, 12]). Hence, this work can be viewed as an approximate extension of this closed\nform derivation for an arbitrary smooth regularized loss function. This work is also related to the\nconcept of in\ufb02uence functions [13], which has recently received renewed interest [14]. In contrast to\nmethods based on in\ufb02uence functions that require large number of samples due to their asymptotic\nnature, we empirically show that the developed ALOOCV works well even when the number of\nsamples and features are small and comparable to each other. In particular, ALOOCV is capable\nof predicting over\ufb01tting and hence can be used for model selection and choosing the regularization\nhyperparameter. Finally, we expect that the idea of ALOOCV can be extended to derive computa-\ntionally ef\ufb01cient approximate bootstrap estimators [15].\nOur second contribution is a gradient descent optimization algorithm for tuning the regularization\nhyperparameters in parametric learning problems. A similar approach has been taken for tuning\nthe single parameter in ridge regression where cross validation can be obtained in closed form [16].\nMost of the existing methods, on the other hand, ignore the response and carry out the optimiza-\ntion solely based on the features, e.g., Stein unbiased estimator of the risk for multiple parameter\nselection [17, 18].\n\n2\n\n\fBayesian optimization has been used for tuning the hyperparameters in the model [19\u201323], which\npostulates a prior on the parameters and optimizes for the best parameter. Bayesian optimization\nmethods are generally derivative free leading to slow convergence rate. In contrast, the proposed\nmethod is based on a gradient descent method. Other popular approaches to the tuning of the opti-\nmization parameters include grid search and random search [24\u201326]. These methods, by nature, also\nsuffer from slow convergence. Finally, model selection has been considered as a bi-level optimiza-\ntion [27,28] where the training process is modeled as a second level optimization problem within the\noriginal problem. These formulations, similar to many other bi-level optimization problems, often\nlead to computationally intensive algorithms that are not scalable.\nWe remark that ALOOCV can also be used within Bayesian optimization, random search, and grid\nsearch methods. Further, resource allocation can be used for improving the optimization perfor-\nmance in all of such methods.\n\n2 Problem Setup\n\nTo facilitate the presentation of the ideas, let us de\ufb01ne the following concepts. Throughout, we\nassume that all the vectors are in column format.\n\nDe\ufb01nition 1 (regularization vector/regularized loss function) We suppose that\nthe learner is\nconcerned with M regularization functions r1(\u03b8), . . . , rM (\u03b8) in addition to the main loss function\n(cid:96)(z; \u03b8). We de\ufb01ne the regularization vector r(\u03b8) as\n\nr(\u03b8) (cid:44) (r1(\u03b8), . . . , rM (\u03b8))(cid:62).\n\nFurther, let \u03bb = (\u03bb1, . . . , \u03bbM )(cid:62) be the vector of regularization parameters. We call wn(z; \u03b8, \u03bb) the\nregularized loss function given by\n\nwn(z; \u03b8, \u03bb) (cid:44) (cid:96)(z; \u03b8) +\n\n(cid:62)\n\n\u03bb\n\n1\nn\n\nr(\u03b8) = (cid:96)(z; \u03b8) +\n\n1\nn\n\n\u03bbmrm(\u03b8).\n\n(cid:88)\n\nm\u2208[M ]\n\nn\n\n1\nn\n\nThe above de\ufb01nition encompasses many popular learning problems. For example, elastic net regres-\nsion [31] can be cast in this framework by setting r1(\u03b8) = (cid:107)\u03b8(cid:107)1 and r2(\u03b8) = 1\nDe\ufb01nition 2 (empirical risk/regularized empirical risk) Let\n\n(cid:80)n\ni=1 (cid:96)(zi; \u03b8). Similarly, let the regularized empirical risk be de\ufb01ned as(cid:99)Wzn (\u03b8, \u03bb) =\n\nthe empirical risk be de\ufb01ned as\n\n2(cid:107)\u03b8(cid:107)2\n2.\n\nDe\ufb01nition 3 (regularized empirical risk minimization) We suppose that the learner solves the\n\n(cid:98)Lzn (\u03b8) = 1\n(cid:80)n\ni=1{wn(zi; \u03b8, \u03bb)}.\nempirical risk minimization problem by selecting(cid:98)\u03b8\u03bb(zn) as follows:\n\uf8f1\uf8f2\uf8f3(cid:88)\n\n(cid:110)(cid:99)Wzn (\u03b8, \u03bb)\n(cid:111)\n(cid:98)\u03b8\u03bb(zn) \u2208 arg min\nOnce the learner solves for(cid:98)\u03b8\u03bb(zn), the empirical risk corresponding to(cid:98)\u03b8\u03bb(zn) can be readily com-\n(cid:80)\nputed by (cid:98)Lzn ((cid:98)\u03b8\u03bb(zn)) = 1\ni\u2208[n] (cid:96)(zi;(cid:98)\u03b8\u03bb(zn)). While the learner can evaluate her performance\non the observed data samples (also called the in-sample empirical risk, i.e., (cid:98)Lzn((cid:98)\u03b8\u03bb(zn))), it is\nimperative to assess the performance of the learner on unobserved fresh samples, i.e., L((cid:98)\u03b8\u03bb(zn))\n\n(see (1)), which is referred to as the out-of-sample risk. To measure the out-of-sample risk, it is\na common practice to perform cross validation as it works outstandingly well in many practical\nsituations and is conceptually universal and simple to implement.\nLeave-one-out cross validation (LOOCV) uses all of the samples but one for training, which is left\nout for testing, leading to an n-dimensional cross validation vector of out-of-sample estimates. Let\nus formalize this notion. Let zn\\i (cid:44) (z1, . . . , zi\u22121, zi+1, . . . , zn) denote the set of the training\nexamples excluding zi.\n\n\uf8fc\uf8fd\uf8fe .\n\n(cid:96)(zi; \u03b8) + \u03bb\n\nr(\u03b8)\n\n(cid:62)\n\n\u03b8\u2208\u0398\n\n= arg min\n\u03b8\u2208\u0398\n\ni\u2208[n]\n\n(2)\n\nn\n\n3\n\n\fDe\ufb01nition 4 (LOOCV empirical risk minimization/cross validation vector) Let (cid:98)\u03b8\u03bb(zn\\i) be\n\nthe estimated parameter over the training set zn\\i, i.e.,\n\n(cid:98)\u03b8\u03bb(zn\\i) \u2208 arg min\n\n\u03b8\u2208Rk\n\n(cid:110)(cid:99)Wzn\\i(\u03b8, \u03bb)\n\n(cid:111)\n\n= arg min\n\u03b8\u2208Rk\n\n\uf8f1\uf8f2\uf8f3 (cid:88)\n\n\uf8fc\uf8fd\uf8fe .\n\nThe cross validation vector is given by {CV\u03bb,i(zn)}i\u2208[n] where CV\u03bb,i(zn) (cid:44) (cid:96)(zi;(cid:98)\u03b8\u03bb(zn\\i)), and\n\nj\u2208[n]\\i\n\nthe cross validation out-of-sample estimate is given by CV\u03bb(zn) (cid:44) 1\n\n(cid:80)n\n\ni=1 CV\u03bb,i(zn).\n\nn\n\n(cid:96)(zj; \u03b8) + \u03bb\n\nr(\u03b8)\n\n(cid:62)\n\n(3)\n\nThe empirical mean and the empirical variance of the n-dimensional cross validation vector are used\nby practitioners as surrogates on assessing the out-of-sample performance of a learning method.\nThe computational cost of solving the problem in (3) is n times that of the original problem in (2).\nHence, while LOOCV provides a simple yet powerful tool to estimate the out-of-sample perfor-\nmance, the additional factor of n in terms of the computational cost makes LOOCV impractical\nin large-scale problems. One common solution to this problem is to perform validation on fewer\nnumber of samples, where the downside is that the learner would obtain a much more noisy and\nsometimes completely unstable estimate of the out-of-sample performance compared to the case\nwhere the entire LOOCV vector is at the learner\u2019s disposal. On the other hand, ALOOCV described\nnext will provide the bene\ufb01ts of LOOCV with negligible additional computational cost.\nWe emphasize that the presented problem formulation is general and includes a variety of parametric\nmachine learning tasks, where the learner empirically solves an optimization problem to minimize\nsome loss function.\n\n3 Approximate Leave-One-Out Cross Validation (ALOOCV)\nWe assume that the regularized loss function is three times differentiable with continuous derivatives\n(see Assumption 1). This includes many learning problems, such as the L2 regularized logistic loss\nfunction. We later comment on how to handle the (cid:96)1 regularizer function in LASSO. To proceed,\nwe need one more de\ufb01nition.\nDe\ufb01nition 5 (Hessian/empirical Hessian) Let H(\u03b8) denote the Hessian of the risk function de\ufb01ned\nas H(\u03b8) (cid:44) \u22072\n\n\u03b8L(\u03b8). Further, let (cid:98)Hzn(\u03b8, \u03bb) denote the empirical Hessian of the regularized loss\n\nfunction, de\ufb01ned as (cid:98)Hzn (\u03b8, \u03bb) (cid:44) (cid:98)Ezn\nde\ufb01ne (cid:98)Hzn (\u03b8, \u03bb) (cid:44) (cid:98)Ezn\\i\n\n(cid:8)\u22072\n\u03b8wn(z; \u03b8, \u03bb)(cid:9) = 1\n\n(cid:8)\u22072\n\u03b8wn(z; \u03b8, \u03bb)(cid:9) = 1\n\n(cid:80)n\n(cid:80)\ni=1 \u22072\ni\u2208[n]\\i \u22072\n\u03b8wn(zi; \u03b8, \u03bb).\n\n\u03b8wn(zi; \u03b8, \u03bb). Similarly, we\n\nn\n\nn\u22121\n\nNext we present the set of assumptions we need to prove the main result of the paper.\n\nAssumption 1 We assume that\n\n\u2217 \u2208 \u0398\u25e6,3 such that (cid:107)(cid:98)\u03b8\u03bb(zn) \u2212 \u03b8\n\n(a) There exists \u03b8\n(b) wn(z; \u03b8) is of class C 3 as a function of \u03b8 for all z \u2208 Z.\n(c) H(\u03b8\n\n) (cid:31) 0 is positive de\ufb01nite.\n\n\u2217\n\n\u2217(cid:107)\u221e = op(1).4\n\nTheorem 1 Under Assumption 1, let\n\nassuming the inverse exists. Then,\n\n(i)\n\n(cid:101)\u03b8\n\u03bb (zn) (cid:44)(cid:98)\u03b8\u03bb(zn) +\n(cid:98)\u03b8\u03bb(zn\\i) \u2212(cid:101)\u03b8\n\n(i)\n\u03bb (zn) =\n\n1\n\nn \u2212 1\n\n1\n\nn \u2212 1\n\n(cid:16)(cid:98)Hzn\\i\n(cid:16)(cid:98)Hzn\\i\n\n(cid:16)(cid:98)\u03b8\u03bb(zn), \u03bb\n(cid:16)(cid:98)\u03b8\u03bb(zn), \u03bb\n\n(cid:17)(cid:17)\u22121 \u2207\u03b8(cid:96)(zi;(cid:98)\u03b8\u03bb(zn)),\n(cid:17)(cid:17)\u22121\n\n\u03b5(i)\n\u03bb,n,\n\n(4)\n\n(5)\n\n3(\u00b7)\u25e6 denotes the interior operator.\n4Xn = op(an) implies that Xn/an approaches 0 in probability with respect to the density function p(\u00b7).\n\n4\n\n\fwith high probability where\n\n\u03b5(i)\n\u03bb,n = \u03b5(i),1\n\n\u03bb,n \u2212 \u03b5(i),2\n\u03bb,n ,\n\n(6)\n\n\u03ba\u2208[k]\n\nj\u2208[n]\\i\n\n\u03b5(i),1\n\u03bb,n\n\n(cid:44) 1\n2\n\n(cid:88)\n\nand \u03b5(i),1\n\n(cid:62)(cid:18) \u2202\n\n\u03bb,n is de\ufb01ned as\n\n(cid:19)\nwhere(cid:98)e\u03ba is \u03ba-th standard unit vector, and such that for all \u03ba \u2208 [k], \u03b6i,j,1\n\n(cid:88)\n((cid:98)\u03b8\u03bb(zn) \u2212(cid:98)\u03b8\u03bb(zn\\i))\n)(cid:98)\u03b8\u03bb(zn\\i) for some 0 \u2264 \u03b1i,j,1\n(cid:88)\n\u03bd ((cid:98)\u03b8\u03bb(zn)\u2212(cid:98)\u03b8\u03bb(zn\\i))\n(cid:98)e\n\n\u22072\n\u03b8wn\u22121(zj; \u03b6i,j,1\n\n(cid:44)(cid:88)\n\n(1 \u2212 \u03b1i,j,1\n\n\u03bb,\u03ba (zn), \u03bb)\n\n\u03b5(i),2\n\u03bb,n\n\n\u2202\u03b8\u03ba\n\n(cid:62)\n\n\u03ba\n\nj\u2208[n]\\i\n\n\u03ba,\u03bd\u2208[k]\n\n\u03bb,\u03ba (zn) = \u03b1i,j,1\n\n(7)\n\n((cid:98)\u03b8\u03bb(zn) \u2212(cid:98)\u03b8\u03bb(zn\\i))(cid:98)e\u03ba,\n\u03ba (cid:98)\u03b8\u03bb(zn) +\n(cid:19)\n((cid:98)\u03b8\u03bb(zn)\u2212(cid:98)\u03b8\u03bb(zn\\i))(cid:98)e\u03ba,\n(8)\n\u03ba,\u03bd \u2264 1.\n\n\u2202\u03b8\u03ba\u2202\u03b8\u03bd\n\n\u03bb,\u03ba,\u03bd (zn), \u03bb)\n\n\u03bb,n is de\ufb01ned as\n\n\u2207(cid:62)\n\u03b8 wn\u22121(zj; \u03b6i,j,2\n\n\u03ba \u2264 1. Further, \u03b5(i),2\n(cid:18) \u22022\n\u03ba,\u03bd )(cid:98)\u03b8\u03bb(zn\\i) for some 0 \u2264 \u03b1i,j,2\n\u03ba,\u03bd (cid:98)\u03b8\u03bb(zn) + (1\u2212 \u03b1i,j,2\n(cid:18) 1\n(cid:19)\n(cid:107)(cid:98)\u03b8\u03bb(zn) \u2212(cid:98)\u03b8\u03bb(zn\\i))(cid:107)\u221e = Op\n(cid:19)\n(cid:18) 1\n(cid:107)(cid:98)\u03b8\u03bb(zn\\i) \u2212(cid:101)\u03b8\n\u03bb (zn)(cid:107)\u221e = Op\n\n(i)\n\nn\n\n.\n\n,\n\nn2\n\n(9)\n\n(10)\n\nsuch that for \u03ba, \u03bd \u2208 [k], \u03b6(i),2\nFurther, we have5\n\n\u03bb,\u03ba,\u03bd(zn) = \u03b1i,j,2\n\nSee the appendix for the proof. Inspired by Theorem 1, we provide an approximation on the cross\nvalidation vector.\n\n(cid:18)\n\nzi;(cid:101)\u03b8\n\n(i)\n\u03bb (zn)\n\n(cid:19)\n\n. We call\n\n(11)\n\nDe\ufb01nition 6 (approximate cross validation vector) Let ACV\u03bb,i(zn) = (cid:96)\n{ACV\u03bb,i(zn)}i\u2208[n] the approximate cross validation vector. We further call\n\nACV\u03bb(zn) (cid:44) 1\nn\n\nACV\u03bb,i(zn)\n\nn(cid:88)\n\ni=1\n\nthe approximate cross validation estimator of the out-of-sample loss.\n\n(i)\n\nWe remark that the de\ufb01nition can be extended to leave-q-out and q-fold cross validation by replacing\nthe index i to an index set S with |S| = q, comprised of the q left-out samples in (4).\n\nThe cost of the computation of {(cid:101)\u03b8\nis the computational cost of solving for(cid:98)\u03b8\u03bb(zn) in (2); see [14]. Note that the empirical risk mini-\n\u03bb (zn)}i\u2208[n] is upper bounded by O(np+C(n, p)) where C(n, p)\nof {(cid:101)\u03b8\nmization problem posed in (2) requires time at least \u2126(np). Hence, the overall cost of computation\ncross validation performance by naively solving n optimization problems {(cid:98)\u03b8\u03bb(zn\\i)}i\u2208[n] posed\n\u03bb (zn)}i\u2208[n] is dominated by solving (2). On the other hand, the cost of computing the true\n\nin (3) would be O(nC(n, p)) which would necessarily be \u2126(n2p) making it impractical for large-\nscale problems.\n\n(i)\n\ni \u2208 [n] .\n\nCorollary 2 The approximate cross validation vector is exact for kernel ridge regression. That is,\n\ngiven that the regularized loss function is quadratic in \u03b8, we have (cid:101)\u03b8\n\u03bb (zn) = (cid:98)\u03b8\u03bb(zn\\i) for all\ntion in a neighborhood of(cid:98)\u03b8\u03bb(zn). Hence, provided that the regularized loss function is quadratic in\n\u03bb,n = 0 for all i \u2208 [n].\n5Xn = Op(an) implies that Xn/an is stochastically bounded with respect to the density function p(\u00b7).\n\n\u03bb,n in (6) only depends on the third derivative of the loss func-\n\nProof We notice that the error term \u03b5(i)\n\n\u03b8, \u03b5(i)\n\n(cid:4)\n\n(i)\n\n5\n\n\fThe fact that the cross validation vector could be obtained for kernel ridge regression in closed form\nwithout actually performing cross validation is not new, and the method is known as PRESS [11].\nIn a sense, the presented approximation could be thought of as an extension of this idea to more\ngeneral loss and regularizer functions while losing the exactness property. We remark that the idea\nof ALOOCV is also related to that of the in\ufb02uence functions. In particular, in\ufb02uence functions have\nbeen used in [14] to derive an approximation on LOOCV for neural networks with large sample\nsizes. However, we notice that methods based on in\ufb02uence functions usually underestimate over-\n\ufb01tting making them impractical for model selection. In contrast, we empirically demonstrate the\neffectiveness of ALOOCV in capturing over\ufb01tting and model selection.\n\nIn the case of (cid:96)1 regularizer we assume that the support set of(cid:98)\u03b8\u03bb(zn) and(cid:98)\u03b8\u03bb(zn\\i) are the same.\ngiven sample zn when sample i is left out. Provided that the support set of(cid:98)\u03b8\u03bb(zn\\i) is known we use\n\nAlthough this would be true for large enough n under Assumption 1, it is not necessarily true for a\n\nthe developed machinery in Theorem 1 on the subset of parameters that are non-zero. Further, we\nignore the (cid:96)1 regularizer term in the regularized loss function as it does not contribute to the Hessian\nmatrix locally, and we assume that the regularized loss function is otherwise smooth in the sense\nof Assumption 1. In this case, the cost of calculating ALOOCV would scale with O(npa log(1/\u0001))\n\nwhere pa denotes the number of non-zero coordinates in the solution(cid:98)\u03b8\u03bb(zn).\n\nWe remark that although the nature of guarantees in Theorem 1 are asymptotic, we have experi-\nmentally observed that the estimator works really well even for n and p as small as 50 in elastic net\nregression, logistic regression, and ridge regression. Next, we also provide an asymptotic character-\nization of the approximate cross validation.\nLemma 3 Under Assumption 1, we have\n\nACV\u03bb(zn) = (cid:98)Lzn((cid:98)\u03b8\u03bb(zn)) + (cid:98)Rzn ((cid:98)\u03b8\u03bb(zn), \u03bb) + Op\n(cid:104)(cid:98)Hzn\\i(\u03b8, \u03bb)\n(cid:98)Rzn (\u03b8, \u03bb) (cid:44)\n\n\u2207(cid:62)\n\u03b8 (cid:96)(zi; \u03b8)\n\n(cid:88)\n\n1\n\nn(n \u2212 1)\n\ni\u2208[n]\n\n(cid:19)\n(cid:18) 1\n(cid:105)\u22121 \u2207\u03b8(cid:96)(zi; \u03b8).\n\nn2\n\n,\n\n(12)\n\n(13)\n\nwhere\n\nNote that in contrast to the ALOOCV (in Theorem 1), the Op(1/n2) error term here depends on the\nsecond derivative of the loss function with respect to the parameters, consequently leading to worse\nperformance, and underestimation of over\ufb01tting.\n\n4 Tuning the Regularization Parameters\nThus far, we presented an approximate cross validation vector that closely follows the pre-\ndictions provided by the cross validation vector, while being computationally inexpensive.\nIn this section, we use the approximate cross validation vector\nto tune the regulariza-\ntion parameters for the optimal out-of-sample performance. We are interested in solving\n. To this end, we need to calculate the gradient of\nmin\u03bb\n\n(cid:80)n\n\n(cid:16)\n\ni=1 (cid:96)\n\nCV\u03bb(zn) = 1\nn\n\n(cid:0)zn\\i(cid:1)(cid:17)(cid:17)\n(cid:16)\nzi;(cid:98)\u03b8\u03bb\n(cid:98)\u03b8\u03bb(zn) with respect to \u03bb, which is given in the following lemma.\n(cid:104)(cid:98)Hzn\n(cid:16)(cid:98)\u03b8\u03bb(zn), \u03bb\n(cid:17)(cid:105)\u22121 \u2207\u03b8r((cid:98)\u03b8\u03bb(zn)).\nLemma 4 We have \u2207\u03bb(cid:98)\u03b8\u03bb(zn) = \u2212 1\n(cid:16)(cid:98)\u03b8\u03bb(zn\\i), \u03bb\n(cid:104)(cid:98)Hzn\\i\nCorollary 5 We have \u2207\u03bb(cid:98)\u03b8\u03bb(zn\\i) = \u2212 1\n(cid:16)\n(cid:16)\nzn\\i(cid:17)(cid:17)\nzi;(cid:98)\u03b8\u03bb\n\u03bb(cid:98)\u03b8\u03bb(zn\\i) \u2207\u03b8(cid:96)\n(cid:17)(cid:105)\u22121 \u2207\u03b8(cid:96)\n(cid:16)(cid:98)\u03b8\u03bb(zn\\i)\n(cid:17) (cid:104)(cid:98)Hzn\\i\n(cid:16)(cid:98)\u03b8\u03bb(zn\\i)\n\n\u2207\u03bbCV\u03bb(zn) =\n\nn(cid:88)\n\nn(cid:88)\n\n\u2207(cid:62)\n\u03b8 r\n\n= \u2212\n\n\u2207(cid:62)\n\nn\u22121\n\nn\n\n1\nn\n\ni=1\n\nIn order to apply \ufb01rst order optimization methods for minimizing CV\u03bb(zn), we need to compute its\ngradient with respect to the tuning parameter vector \u03bb. Applying the simple chain rule implies\n\n(cid:16)\nzi;(cid:98)\u03b8\u03bb\n\n(cid:16)\n\nzn\\i(cid:17)(cid:17)\n\n,\n\n(14)\n\n(15)\n\n(cid:17)(cid:105)\u22121 \u2207\u03b8r((cid:98)\u03b8\u03bb(zn\\i)).\n\n1\n\nn(n \u2212 1)\n\ni=1\n\n6\n\n\fFigure 1: The progression of the loss when\nAlgorithm 1 is applied to ridge regression\nwith diagonal regressors.\n\nFigure 2: The progression of \u03bb\u2019s when Al-\ngorithm 1 is applied to ridge regression with\ndiagonal regressors.\n\nwhere (15) follows by substituting \u2207\u03bb(cid:98)\u03b8\u03bb(zn\\i) from Corollary 5. However, (15) is computationally\n\nexpensive and almost impossible in practice even for medium size datasets. Hence, we use the\nALOOCV from (4) (Theorem 1) in (14) to approximate the gradient.\nLet\n\n(cid:18)(cid:101)\u03b8\n\n(cid:19) (cid:20)(cid:98)Hzn\\i\n\n(cid:18)(cid:101)\u03b8\n\n(i)\n\u03bb (zn)\n\n(cid:19)(cid:21)\u22121 \u2207\u03b8(cid:96)\n\n(cid:18)\n\nzi;(cid:101)\u03b8\n\n(cid:19)\n\n(i)\n\u03bb (zn)\n\n.\n\n(16)\n\n\u03bb (zn) (cid:44) \u2212 1\ng(i)\nn \u2212 1\n\n\u2207(cid:62)\n\u03b8 r\n\n(i)\n\u03bb (zn)\n\n(cid:80)\n\nFurther, motivated by the suggested ALOOCV, let us de\ufb01ne the approximate gradient g\u03bb(zn) as\ng\u03bb(zn) (cid:44) 1\n\u03bb (zn) . Based on our numerical experiments, this approximate gradient\nclosely follows the gradient of the cross validation, i.e., \u2207\u03bbCV\u03bb(zn) \u2248 g\u03bb(zn). Note that this\napproximation is straightforward to compute. Therefore, using this approximation, we can apply the\n\ufb01rst order optimization algorithm 1 to optimize the tuning parameter \u03bb. Although Algorithm 1 is\n\ni\u2208[n] g(i)\n\nn\n\nAlgorithm 1 Approximate gradient descent algorithm for tuning \u03bb\n\nInitialize the tuning parameter \u03bb0, choose a step-size selection rule, and set t = 0\nfor t = 0, 1, 2, . . . do\n\ncalculate the approximate gradient g\u03bbt(zn)\nset \u03bbt+1 = \u03bbt \u2212 \u03b1tg\u03bbt(zn)\n\nend for\n\nmore computationally ef\ufb01cient compared to LOOCV (saving a factor of n), it might still be compu-\ntationally expensive for large values of n as it still scales linearly with n. Hence, we also present an\nonline version of the algorithm using the stochastic gradient descent idea; see Algorithm 2.\n\nAlgorithm 2 Stochastic (online) approximate gradient descent algorithm for tuning \u03bb\n\nInitialize the tuning parameter \u03bb0 and set t = 0\nfor t = 0, 1, 2, . . . do\n\nchoose a random index it \u2208 {1, . . . , n}\ncalculate the stochastic gradient g(it)\nset \u03bbt+1 = \u03bbt \u2212 \u03b1tg(it)\n\n\u03bbt (zn)\n\n\u03bbt (zn) using (16)\n\nend for\n\n5 Numerical Experiments\n\nRidge regression with diagonal regressors: We consider the following regularized loss function:\n\nwn(z; \u03b8, \u03bb) = (cid:96)(z; \u03b8) +\n\n(cid:62)\n\n\u03bb\n\n1\nn\n\nr(\u03b8) =\n\n(y \u2212 \u03b8\n\n(cid:62)\n\nx)2 +\n\n1\n2\n\n(cid:62)\n\n\u03b8\n\n1\n2n\n\ndiag(\u03bb)\u03b8.\n\n7\n\n0100200300400500600700800Iteration Number0.20.30.40.50.60.70.8LossALOOCVOut-of-Sample LossElapsed time: 28 seconds0100200300400500600700800Iteration Number0.10.150.20.250.30.35 mean(1,...,m)mean(m+1, ...,p)\f(cid:98)Lzn\n\n0.6578\n0.5810\n0.5318\n0.5152\n0.4859\n0.4456\n\nn\np \u03bb\n1e5\n1e4\n1e3\n1e2\n1e1\n1e0\n\nL\n\n0.6591\n0.6069\n0.5832\n0.5675\n0.5977\n0.6623\n\nACV\n\n0.6578 (0.0041)\n0.5841 (0.0079)\n0.5444 (0.0121)\n0.5379 (0.0146)\n0.5560 (0.0183)\n0.6132 (0.0244)\n\n(cid:96)(zi;(cid:98)\u03b8\u03bb(zn))\n\n0.0872\n0.0920\n0.0926\n0.0941\n0.0950\n0.0990\n0.1505\n\nCV\n8.5526\n2.1399\n10.8783\n3.5210\n5.7753\n5.2626\n12.0483\n\nACV\n8.6495\n2.1092\n9.4791\n3.3162\n6.1859\n5.0554\n11.5281\n\nIF\n\n0.2202\n0.2081\n0.2351\n0.2210\n0.2343\n0.2405\n0.3878\n\nFigure 3: The histogram of the normalized difference between LOOCV and ALOOCV for 5 runs\nof the algorithm on randomly selected samples for each \u03bb in Table 1 (MNIST dataset with n = 200\nand p = 400).\n\n(cid:98)Lzn\n\n0.0637 (0.0064)\n0.0468 (0.0051)\n0.0327 (0.0038)\n0.0218 (0.0026)\n0.0139 (0.0017)\n0.0086 (0.0011)\n0.0051 (0.0006)\n\n\u03bb\n\n3.3333\n1.6667\n0.8333\n0.4167\n0.2083\n0.1042\n0.0521\n\nL\n\n0.1095 (0.0168)\n0.1021 (0.0182)\n0.0996 (0.0201)\n0.1011 (0.0226)\n0.1059 (0.0256)\n0.1131 (0.0291)\n0.1219 (0.0330)\n\nCV\n\n0.1077 (0.0151)\n0.1056 (0.0179)\n0.1085 (0.0214)\n0.1158 (0.0256)\n0.1264 (0.0304)\n0.1397 (0.0356)\n0.1549 (0.0411)\n\nACV\n\n0.1080 (0.0152)\n0.1059 (0.0179)\n0.1087 (0.0213)\n0.1155 (0.0254)\n0.1258 (0.0300)\n0.1386 (0.0349)\n0.1534 (0.0402)\n\nIF\n\n0.0906 (0.0113)\n0.0734 (0.0100)\n0.0559 (0.0079)\n0.0397 (0.0056)\n0.0267 (0.0038)\n0.0171 (0.0024)\n0.0106 (0.0015)\n\nTable 1: The results of logistic regression (in-sample loss, out-of-sample loss, LOOCV, and\nALOOCV, and In\ufb02uence Function LOOCV) for different regularization parameters on MNIST\ndataset with n = 200 and p = 400. The numbers in parentheses represent the standard error.\n\nTable 2: The results of logistic regression (in-\nsample loss, out-of-sample loss, CV, ACV)\non CIFAR-10 dataset with n = 9600 and p =\n3072.\n\nTable 3: Comparison of the leave-one-out es-\ntimates on the 8 outlier samples with highest\nin-sample loss in the MNIST dataset.\n\n\u2217(cid:62)\n\nIn other words, we consider one regularization parameter per each model parameter. To validate the\nproposed optimization algorithm, we consider a scenario with p = 50 where x is drawn i.i.d. from\nN (0, Ip). We let y = \u03b8\nx + \u0001 where \u03b81 = . . . = \u03b840 = 0 and \u03b841, . . . , \u03b850 \u223c N (0, 1) i.i.d, and\n\u0001 \u223c N (0, 0.1). We draw n = 150 samples from this model, and apply Algorithm 1 to optimize for\n\u03bb = (\u03bb1, . . . , \u03bb50). The problem is designed in such a way that out of 50 features, the \ufb01rst 40 are\nirrelevant while the last 10 are important. We initialize the algorithm with \u03bb1\n50 = 1/3\nand compute ACV using Theorem 1. Recall that in this case, ACV is exactly equivalent to CV (see\nCorollary 2). Figure 1 plots ALOOCV, the out-of-sample loss, and the mean value of \u03bb calculated\nover the irrelevant and relevant features respectively. As expected, the \u03bb for an irrelevant feature is\nset to a larger number, on the average, compared to that of a relevant feature. Finally, we remark\nthat the optimization of 50 tuning parameters in 800 iterations took a mere 28 seconds on a PC.\n\n1 = . . . = \u03bb1\n\n8\n\nHistogram of\fFigure 4: The application of Algorithms 1 and 2 to elastic net regression. The left panel shows the\nloss vs. number of iterations. The right panel shows the run-time vs. n (the sample size).\n\n(cid:62)\n\nr(\u03b8) = H(y|| sigmoid(\u03b80 + \u03b8\n\n(cid:62)\n\n\u03bb\n\n1\nn\n\n\u03bb(cid:107)\u03b8(cid:107)2\n2.\n\nwn(z; \u03b8, \u03bb) = (cid:96)(z; \u03b8) +\n\nLogistic regression: The second example that we consider is logistic regression:\n1\n2n\n\nx)) +\nv + (1 \u2212 u) log 1\nwhere H(\u00b7||\u00b7) for any u \u2208 [0, 1] and v \u2208 (0, 1) is given by H(u||v) := u log 1\n1\u2212v ,\nand denotes the binary cross entropy function, and sigmoid(x) := 1/(1 + e\u2212x) denotes the sig-\nmoid function. In this case, we only consider a single regularization parameter. Since the loss and\nregularizer are smooth, we resort to Theorem 1 to compute ACV. We applied logistic regression on\nMNIST and CIFAR-10 image datasets where we used each pixel in the image as a feature according\nto the aforementioned loss function. In MNIST, we classify the digits 2 and 3 while in CIFAR-10,\nwe classify \u201cbird\u201d and \u201ccat.\u201d As can be seen in Tables 1 and 2, ACV closely follows CV on the\nMNIST dataset. On the other hand, the approximation of LOOCV based on in\ufb02uence functions [14]\nperforms poorly in the regime where the model is signi\ufb01cantly over\ufb01t and hence it cannot be used\nfor effective model selection. On CIFAR-10, ACV takes \u22481s to run per each sample, whereas CV\ntakes \u224860s per each sample requiring days to run for each \u03bb even for this medium sized problem.\nThe histogram of the normalized difference between CV and ACV vectors is plotted in Figure 3\nfor 5 runs of the algorithm for each \u03bb in Table 1. As can be seen, CV and ACV are almost always\nwithin 5% of each other. We have also plotted the loss for the eight outlier samples with the highest\nin-sample loss in the MNIST dataset in Table 3. As can be seen, ALOOCV closely follows LOOCV\neven when the leave-one-out loss is two orders of magnitude larger than the in-sample loss for these\noutliers. On the other hand, the approximation based on the in\ufb02uence functions fails to capture the\nout-of-sample performance and the outliers in this case.\nElastic net regression: Finally, we consider the popular elastic net regression problem [31]:\n\nwn(z; \u03b8, \u03bb) = (cid:96)(z; \u03b8) +\n\n(cid:62)\n\n\u03bb\n\n1\nn\n\nr(\u03b8) =\n\n(y \u2212 \u03b8\n\n(cid:62)\n\nx)2 +\n\n1\n2\n\n\u03bb1(cid:107)\u03b8(cid:107)1 +\n\n1\nn\n\n\u03bb2(cid:107)\u03b8(cid:107)2\n2.\n\n1\n2n\n\n\u2217(cid:62)\n\nIn this case, there are only two regularization parameters to be optimized for the quasi-smooth\nregularized loss. Similar to the previous case, we consider y = \u03b8\nx + \u0001 where \u03b8\u03ba = \u03ba\u03c1\u03ba\u03c8\u03ba where\n\u03c1\u03ba is a Bernoulli(1/2) RV and \u03c8\u03ba \u223c N (0, 1). Hence, the features are weighted non-uniformly in y\nand half of them are zeroed out on the average. We apply both Algorithms 1 and 2 where we used the\napproximation in Theorem 1 and the explanation on how to handle (cid:96)1 regularizers to compute ACV.\nWe initialized with \u03bb1 = \u03bb2 = 0. As can be seen on the left panel (Figure 4), ACV closely follows\nCV in this case. Further, we see that both algorithms are capable of signi\ufb01cantly reducing the loss\nafter only a few iterations. The right panel compares the run-time of the algorithms vs. the number\nof samples. This con\ufb01rms our analysis that the run-time of CV scales quadratically with O(n2) as\nopposed to O(n) in ACV. This impact is more signi\ufb01ed in the inner panel where the run-time ratio\nis plotted.\n\nAcknowledgement\n\nThis work was supported in part by DARPA under Grant No. W911NF-16-1-0561. The authors are\nthankful to Jason D. Lee (USC) who brought to their attention the recent work [14] on in\ufb02uence\nfunctions for approximating leave-one-out cross validation.\n\n9\n\n02468101214161820050010001500200025003000350040004500 Out-of-Sample LossOnline Batch Estimated CVand Actual CV77.588.59330034003500360037003800Actual CVOut-of-Sample Loss Estimated CVn=\t70p\t=\t50No.\tof\titerations(Algo.\t1)(Algo.\t2)ACVCVACV\tand\tCVloss708090100110120130140150p00.511.522.533.5Computation TimeLOOCVApproximate\tLOOCV70809010011012013014015046810121416182022Run\ttime\tratioLOOCV/ALOOCVn\t=\t200\fReferences\n\n[1] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection.\n\nStatistics surveys, 4:40\u201379, 2010.\n\n[2] Seymour Geisser. The predictive sample reuse method with applications. Journal of the Amer-\n\nican Statistical Association, 70(350):320\u2013328, 1975.\n\n[3] Peter Craven and Grace Wahba. Smoothing noisy data with spline functions. Numerische\n\nMathematik, 31(4):377\u2013403, 1978.\n\n[4] Kenneth P Burnham and David R Anderson. Model selection and multimodel inference: a\n\npractical information-theoretic approach. Springer Science & Business Media, 2003.\n\n[5] Hirotugu Akaike. Statistical predictor identi\ufb01cation. Annals of the Institute of Statistical Math-\n\nematics, 22(1):203\u2013217, 1970.\n\n[6] Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle.\n\nIn Selected Papers of Hirotugu Akaike, pages 199\u2013213. Springer, 1998.\n\n[7] K Takeuchi. Distribution of informational statistics and a criterion of model \ufb01tting. suri-kagaku\n\n(mathematical sciences) 153 12-18, 1976.\n\n[8] Mervyn Stone. Cross-validation and multinomial prediction. Biometrika, pages 509\u2013515,\n\n1974.\n\n[9] Andrew R Barron. Predicted squared error: a criterion for automatic model selection. 1984.\n[10] John E Moody. The effective number of parameters: An analysis of generalization and regular-\nization in nonlinear learning systems. In Advances in neural information processing systems,\npages 847\u2013854, 1992.\n\n[11] David M Allen. The relationship between variable selection and data agumentation and a\n\nmethod for prediction. Technometrics, 16(1):125\u2013127, 1974.\n\n[12] Ronald Christensen. Plane answers to complex questions:\n\nSpringer Science & Business Media, 2011.\n\nthe theory of linear models.\n\n[13] R Dennis Cook and Sanford Weisberg. Characterizations of an empirical in\ufb02uence function\n\nfor detecting in\ufb02uential cases in regression. Technometrics, 22(4):495\u2013508, 1980.\n\n[14] Pang Wei Koh and Percy Liang. Understanding black-box predictions via in\ufb02uence functions.\n\nInternational Conference on Machine Learning, 2017.\n\n[15] Bradley Efron. Better bootstrap con\ufb01dence intervals. Journal of the American statistical\n\nAssociation, 82(397):171\u2013185, 1987.\n\n[16] Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method\n\nfor choosing a good ridge parameter. Technometrics, 21(2):215\u2013223, 1979.\n\n[17] Charles-Alban Deledalle, Samuel Vaiter, Jalal Fadili, and Gabriel Peyr\u00b4e. Stein Unbiased GrA-\ndient estimator of the Risk (SUGAR) for multiple parameter selection. SIAM Journal on Imag-\ning Sciences, 7(4):2448\u20132487, 2014.\n\n[18] Sathish Ramani, Zhihao Liu, Jeffrey Rosen, Jon-Fredrik Nielsen, and Jeffrey A Fessler. Regu-\nlarization parameter selection for nonlinear iterative image restoration and MRI reconstruction\nusing GCV and SURE-based methods. IEEE Transactions on Image Processing, 21(8):3659\u2013\n3672, 2012.\n\n[19] Jonas Mo\u02c7ckus. Application of Bayesian approach to numerical methods of global and stochas-\n\ntic optimization. Journal of Global Optimization, 4(4):347\u2013365, 1994.\n\n[20] Jonas Mo\u02c7ckus. On Bayesian methods for seeking the extremum. In Optimization Techniques\n\nIFIP Technical Conference, pages 400\u2013404. Springer, 1975.\n\n[21] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of ex-\npensive cost functions, with application to active user modeling and hierarchical reinforcement\nlearning. arXiv preprint arXiv:1012.2599, 2010.\n\n[22] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization\nfor general algorithm con\ufb01guration. In International Conference on Learning and Intelligent\nOptimization, pages 507\u2013523. Springer, 2011.\n\n10\n\n\f[23] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of ma-\nchine learning algorithms. In Advances in neural information processing systems, pages 2951\u2013\n2959, 2012.\n\n[24] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal\n\nof Machine Learning Research, 13(Feb):281\u2013305, 2012.\n\n[25] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Auto-weka: Com-\nbined selection and hyperparameter optimization of classi\ufb01cation algorithms. In Proceedings\nof the 19th ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 847\u2013855. ACM, 2013.\n\n[26] Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger\nHoos, and Kevin Leyton-Brown. Towards an empirical foundation for assessing Bayesian\noptimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and\nPractice, pages 1\u20135, 2013.\n\n[27] Gautam Kunapuli, K Bennett, Jing Hu, and Jong-Shi Pang. Bilevel model selection for support\nvector machines. In CRM Proceedings and Lecture Notes, volume 45, pages 129\u2013158, 2008.\n[28] Kristin P Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli, and Jong-Shi Pang. Model selection\nvia bilevel optimization. In 2006 Intl. Joint Conf. on Neural Networks IJCNN\u201906, pages 1922\u2013\n1929, 2006.\n\n[29] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization.\n\nAdvances in neural information processing systems, pages 2004\u20132012, 2013.\n\nIn\n\n[30] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy-\nperband: Bandit-based con\ufb01guration evaluation for hyperparameter optimization. Proc. of\nICLR, 17, 2017.\n\n[31] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n11\n\n\f", "award": [], "sourceid": 1967, "authors": [{"given_name": "Ahmad", "family_name": "Beirami", "institution": "Harvard University & MIT"}, {"given_name": "Meisam", "family_name": "Razaviyayn", "institution": "University of Southern California"}, {"given_name": "Shahin", "family_name": "Shahrampour", "institution": "Harvard University"}, {"given_name": "Vahid", "family_name": "Tarokh", "institution": "Harvard University"}]}