{"title": "Regularization Path of Cross-Validation Error Lower Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 1675, "page_last": 1683, "abstract": "Careful tuning of a regularization parameter is indispensable in many machine learning tasks because it has a significant impact on generalization performances.Nevertheless, current practice of regularization parameter tuning is more of an art than a science, e.g., it is hard to tell how many grid-points would be needed in cross-validation (CV) for obtaining a solution with sufficiently small CV error.In this paper we propose a novel framework for computing a lower bound of the CV errors as a function of the regularization parameter, which we call regularization path of CV error lower bounds.The proposed framework can be used for providing a theoretical approximation guarantee on a set of solutions in the sense that how far the CV error of the current best solution could be away from best possible CV error in the entire range of the regularization parameters.We demonstrate through numerical experiments that a theoretically guaranteed a choice of regularization parameter in the above sense is possible with reasonable computational costs.", "full_text": "Regularization Path of\n\nCross-Validation Error Lower Bounds\n\nAtsushi Shibagaki, Yoshiki Suzuki, Masayuki Karasuyama, and Ichiro Takeuchi\n\n{shibagaki.a.mllab.nit,suzuki.mllab.nit}@gmail.com\n\n{karasuyama,takeuchi.ichiro}@nitech.ac.jp\n\nNagoya Institute of Technology\n\nNagoya, 466-8555, Japan\n\nAbstract\n\nCareful tuning of a regularization parameter is indispensable in many machine\nlearning tasks because it has a signi\ufb01cant impact on generalization performances.\nNevertheless, current practice of regularization parameter tuning is more of an art\nthan a science, e.g., it is hard to tell how many grid-points would be needed in\ncross-validation (CV) for obtaining a solution with suf\ufb01ciently small CV error. In\nthis paper we propose a novel framework for computing a lower bound of the CV\nerrors as a function of the regularization parameter, which we call regularization\npath of CV error lower bounds. The proposed framework can be used for provid-\ning a theoretical approximation guarantee on a set of solutions in the sense that\nhow far the CV error of the current best solution could be away from best possi-\nble CV error in the entire range of the regularization parameters. Our numerical\nexperiments demonstrate that a theoretically guaranteed choice of a regularization\nparameter in the above sense is possible with reasonable computational costs.\n\n1 Introduction\nMany machine learning tasks involve careful tuning of a regularization parameter that controls the\nbalance between an empirical loss term and a regularization term. A regularization parameter is\nusually selected by comparing the cross-validation (CV) errors at several different regularization\nparameters. Although its choice has a signi\ufb01cant impact on the generalization performances, the\ncurrent practice is still more of an art than a science. For example, in commonly used grid-search, it\nis hard to tell how many grid points we should search over for obtaining suf\ufb01ciently small CV error.\nIn this paper we introduce a novel framework for a class of regularized binary classi\ufb01cation problems\nthat can compute a regularization path of CV error lower bounds. For an \u03b5 \u2208 [0, 1], we de\ufb01ne \u03b5-\napproximate regularization parameters to be a set of regularization parameters such that the CV\nerror of the solution at the regularization parameter is guaranteed to be no greater by \u03b5 than the best\npossible CV error in the entire range of regularization parameters. Given a set of solutions obtained,\nfor example, by grid-search, the proposed framework allows us to provide a theoretical guarantee\nof the current best solution by explicitly quantifying its approximation level \u03b5 in the above sense.\nFurthermore, when a desired approximation level \u03b5 is speci\ufb01ed, the proposed framework can be used\nfor ef\ufb01ciently \ufb01nding one of the \u03b5-approximate regularization parameters.\nThe proposed framework is built on a novel CV error lower bound represented as a function of the\nregularization parameter, and this is why we call it as a regularization path of CV error lower bounds.\nOur CV error lower bound can be computed by only using a \ufb01nite number of solutions obtained by\narbitrary algorithms. It is thus easy to apply our framework to common regularization parameter tun-\ning strategies such as grid-search or Bayesian optimization. Furthermore, the proposed framework\ncan be used not only with exact optimal solutions but also with suf\ufb01ciently good approximate solu-\n\n1\n\n\fFigure 1: An illustration of the proposed framework. One of our\nalgorithms presented in \u00a74 automatically selected 39 regulariza-\ntion parameter values in [10\u22123, 103], and an upper bound of the\nvalidation error for each of them is obtained by solving an op-\ntimization problem approximately. Among those 39 values, the\none with the smallest validation error upper bound (indicated as\n\u22c6 at C = 1.368) is guaranteed to be \u03b5(= 0.1) approximate reg-\nularization parameter in the sense that the validation error for\nthe regularization parameter is no greater by \u03b5 than the smallest\npossible validation error in the whole interval [10\u22123, 103]. See\n\u00a75 for the setup (see also Figure 3 for the results with other op-\ntions).\n\ntions, which is computationally advantageous because completely solving an optimization problem\nis often much more costly than obtaining a reasonably good approximate solution.\nOur main contribution in this paper is to show that a theoretically guaranteed choice of a regular-\nization parameter in the above sense is possible with reasonable computational costs. To the best\nof our knowledge, there is no other existing methods for providing such a theoretical guarantee on\nCV error that can be used as generally as ours. Figure 1 illustrates the behavior of the algorithm for\nobtaining \u03b5 = 0.1 approximate regularization parameter (see \u00a75 for the setup).\nRelated works Optimal regularization parameter can be found if its exact regularization path can\nbe computed. Exact regularization path has been intensively studied [1, 2], but they are known to be\nnumerically unstable and do not scale well. Furthermore, exact regularization path can be computed\nonly for a limited class of problems whose solutions are written as piecewise-linear functions of the\nregularization parameter [3]. Our framework is much more ef\ufb01cient and can be applied to wider\nclasses of problems whose exact regularization path cannot be computed. This work was motivated\nby recent studies on approximate regularization path [4, 5, 6, 7]. These approximate regularization\npaths have a property that the objective function value at each regularization parameter value is no\ngreater by \u03b5 than the optimal objective function value in the entire range of regularization param-\neters. Although these algorithms are much more stable and ef\ufb01cient than exact ones, for the task\nof tuning a regularization parameter, our interest is not in objective function values but in CV er-\nrors. Our approach is more suitable for regularization parameter tuning tasks in the sense that the\napproximation quality is guaranteed in terms of CV error.\nAs illustrated in Figure 1, we only compute a \ufb01nite number of solutions, but still provide approxima-\ntion guarantee in the whole interval of the regularization parameter. To ensure such a property, we\nneed a novel CV error lower bound that is suf\ufb01ciently tight and represented as a monotonic function\nof the regularization parameter. Although several CV error bounds (mostly for leave-one-out CV) of\nSVM and other similar learning frameworks exist (e.g., [8, 9, 10, 11]), none of them satisfy the above\nrequired properties. The idea of our CV error bound is inspired from recent studies on safe screening\n[12, 13, 14, 15, 16] (see Appendix A for the detail). Furthermore, we emphasize that our contribu-\ntion is not in presenting a new generalization error bound, but in introducing a practical framework\nfor providing a theoretical guarantee on the choice of a regularization parameter. Although gener-\nalization error bounds such as structural risk minimization [17] might be used for a rough tuning of\na regularization parameter, they are known to be too loose to use as an alternative to CV (see, e.g.,\n\u00a711 in [18]). We also note that our contribution is not in presenting new method for regularization\nparameter tuning such as Bayesian optimization [19], random search [20] and gradient-based search\n[21]. As we demonstrate in experiments, our approach can provide a theoretical approximation\nguarantee of the regularization parameter selected by these existing methods.\n\n2 Problem Setup\n\nWe consider linear binary classi\ufb01cation problems. Let {(xi, yi) \u2208 Rd\u00d7{\u22121, 1}}i\u2208[n] be the training\nset where n is the size of the training set, d is the input dimension, and [n] := {1, . . . , n}. An inde-\npendent held-out validation set with size n\u2032 is denoted similarly as {(x\u2032i, y\u2032i) \u2208 Rd \u00d7 {\u22121, 1}}i\u2208[n\u2032].\nA linear decision function is written as f (x) = w\u22a4x, where w \u2208 Rd is a vector of coef\ufb01cients,\nand \u22a4 represents the transpose. We assume the availability of a held-out validation set only for sim-\nplifying the exposition. All the proposed methods presented in this paper can be straightforwardly\n\n2\n\n\fadapted to a cross-validation setup. Furthermore, the proposed methods can be kernelized if the loss\nfunction satis\ufb01es a certain condition. In this paper we focus on the following class of regularized\nconvex loss minimization problems:\n\nw\u2217C := arg min\nw\u2208Rd\n\n1\n\n2\u2225w\u22252 + C !i\u2208[n]\n\n\u2113(yi, w\u22a4xi),\n\n(1)\n\nwhere C > 0 is the regularization parameter, and \u2225\u00b7\u2225 is the Euclidean norm. The loss function is\ndenoted as \u2113 : {\u22121, 1}\u00d7 R \u2192 R. We assume that \u2113(\u00b7,\u00b7) is convex and subdifferentiable in the 2nd\nargument. Examples of such loss functions include logistic loss, hinge loss, Huber-hinge loss, etc.\nFor notational convenience, we denote the individual loss as \u2113i(w) := \u2113(yi, w\u22a4xi) for all i \u2208 [n].\nThe optimal solution for the regularization parameter C is explicitly denoted as w\u2217C. We assume that\nthe regularization parameter is de\ufb01ned in a \ufb01nite interval [C\u2113, Cu], e.g., C\u2113 = 10\u22123 and Cu = 103\nas we did in the experiments.\nFor a solution w \u2208 Rd, the validation error1 is de\ufb01ned as\n\nEv(w) :=\n\nI(y\u2032iw\u22a4x\u2032i < 0),\n\n(2)\n\n1\n\nn\u2032 !i\u2208[n\u2032]\n\nwhere I(\u00b7) is the indicator function. In this paper, we consider two problem setups. The \ufb01rst prob-\nlem setup is, given a set of (either optimal or approximate) solutions w\u2217C1, . . . , w\u2217CT at T different\nregularization parameters C1, . . . , CT \u2208 [C\u2113, Cu], to compute the approximation level \u03b5 such that\n(3)\n\nEv(w\u2217Ct) \u2212 E\u2217v \u2264 \u03b5, where E\u2217v := min\n\nEv(w\u2217C),\n\nmin\n\nCt\u2208{C1,...,CT }\n\nC\u2208[Cl,Cu]\n\nby which we can \ufb01nd how accurate our search (grid-search, typically) is in a sense of the deviation\nof the achieved validation error from the true minimum in the range, i.e., E\u2217v. The second problem\nsetup is, given the approximation level \u03b5, to \ufb01nd an \u03b5-approximate regularization parameter within\nan interval C \u2208 [Cl, Cu], which is de\ufb01ned as an element of the following set\n\nC(\u03b5) :=\"C \u2208 [Cl, Cu]### Ev(w\u2217C) \u2212 E\u2217v \u2264 \u03b5$.\n\nOur goal in this second setup is to derive an ef\ufb01cient exploration procedure which achieves the\nspeci\ufb01ed validation approximation level \u03b5. These two problem setups are both common scenarios in\npractical data analysis, and can be solved by using our proposed framework for computing a path of\nvalidation error lower bounds.\n\n3 Validation error lower bounds as a function of regularization parameter\n\nIn this section, we derive a validation error lower bound which is represented as a function of the\nregularization parameter C. Our basic idea is to compute a lower and an upper bound of the inner\nproduct score w\u2217\u22a4C x\u2032i for each validation input x\u2032i, i \u2208 [n\u2032], as a function of the regularization param-\neter C. For computing the bounds of w\u2217\u22a4C x\u2032i, we use a solution (either optimal or approximate) for\na different regularization parameter \u02dcC \u0338= C.\n3.1 Score bounds\n\nWe \ufb01rst describe how to obtain a lower and an upper bound of inner product score w\u2217\u22a4C x\u2032i based on\nan approximate solution \u02c6w \u02dcC at a different regularization parameter \u02dcC \u0338= C.\nLemma 1. Let \u02c6w \u02dcC be an approximate solution of the problem (1) for a regularization parameter\nvalue \u02dcC and \u03bei( \u02c6wC) be a subgradient of \u2113i at w = \u02c6wC such that a subgradient of the objective\nfunction is\n\ng( \u02c6w \u02dcC) := \u02c6wC + \u02dcC !i\u2208[n]\n\n\u03bei( \u02c6wC).\n\n(4)\n\n1 For simplicity, we regard a validation instance whose score is exactly zero, i.e., w\u22a4x\u2032i = 0, is correctly\nclassi\ufb01ed in (2). Hereafter, we assume that there are no validation instances whose input vector is completely\n0, i.e., x\u2032i = 0, because those instances are always correctly classi\ufb01ed according to the de\ufb01nition in (2).\n\n3\n\n\fThen, for any C > 0, the score w\u2217\u22a4C x\u2032i, i \u2208 [n\u2032], satis\ufb01es\nw\u2217\u22a4C x\u2032i\u2265 LB(w\u2217\u22a4C x\u2032i| \u02c6w \u02dcC) :=% \u03b1( \u02c6w \u02dcC, x\u2032i) \u2212 1\n\u2212\u03b2( \u02c6w \u02dcC, x\u2032i) + 1\nw\u2217\u22a4C x\u2032i\u2264 U B(w\u2217\u22a4C x\u2032i| \u02c6w \u02dcC) :=%\u2212\u03b2( \u02c6w \u02dcC, x\u2032i) + 1\n\u03b1( \u02c6w \u02dcC, x\u2032i) \u2212 1\n\n\u02dcC\n\u02dcC\n\n\u02dcC\n\u02dcC\n\n(\u03b2( \u02c6w \u02dcC, x\u2032i) + \u03b3(g( \u02c6w \u02dcC), x\u2032i))C, if C > \u02dcC,\n(\u03b1( \u02c6w \u02dcC, x\u2032i) + \u03b4(g( \u02c6w \u02dcC), x\u2032i))C, if C < \u02dcC,\n\n(\u03b1( \u02c6w \u02dcC, x\u2032i) + \u03b4(g( \u02c6w \u02dcC), x\u2032i))C, if C > \u02dcC,\n(\u03b2( \u02c6w \u02dcC, x\u2032i) + \u03b3(g( \u02c6w \u02dcC), x\u2032i))C, if C < \u02dcC,\n\n(5a)\n\n(5b)\n\nwhere\n\u03b1(w\u2217\u02dcC, x\u2032i) :=\n\n1\n2\n1\n2\n\n1\n2\n1\n2\n\n\u03b2(w\u2217\u02dcC, x\u2032i) :=\n\n(\u2225w\u2217\u02dcC\u2225\u2225x\u2032i\u2225 + w\u2217\u22a4\u02dcC x\u2032i) \u2265 0,\u03b3 (g( \u02c6w \u02dcC), x\u2032i) :=\n(\u2225w\u2217\u02dcC\u2225\u2225x\u2032i\u2225 \u2212 w\u2217\u22a4\u02dcC x\u2032i) \u2265 0,\u03b4 (g( \u02c6w \u02dcC), x\u2032i) :=\n\n(\u2225g( \u02c6w \u02dcC)\u2225\u2225x\u2032i\u2225 + g( \u02c6w \u02dcC)\u22a4x\u2032i) \u2265 0,\n(\u2225g( \u02c6w \u02dcC)\u2225\u2225x\u2032i\u2225 \u2212 g(w \u02dcC)\u22a4x\u2032i) \u2265 0.\nThe proof is presented in Appendix A. Lemma 1 tells that we have a lower and an upper bound of the\nscore w\u2217\u22a4C x\u2032i for each validation instance that linearly change with the regularization parameter C.\nWhen \u02c6w \u02dcC is optimal, it can be shown that (see Proposition B.24 in [22]) there exists a subgradient\nsuch that g( \u02c6w \u02dcC) = 0, meaning that the bounds are tight because \u03b3(g( \u02c6w \u02dcC), x\u2032i) = \u03b4(g( \u02c6w \u02dcC), x\u2032i) = 0.\nCorollary 2. When C = \u02dcC, the score w\u2217\u22a4\u02dcC x\u2032i, i \u2208 [n\u2032], for the regularization parameter value \u02dcC\nitself satis\ufb01es\nw\u2217\u22a4\u02dcC x\u2032i\u2265LB(w\u2217\u22a4\u02dcC x\u2032i| \u02c6w \u02dcC) = \u02c6w\u22a4\u02dcC x\u2032i\u2212\u03b3(g( \u02c6w \u02dcC), x\u2032i), w\u2217\u22a4\u02dcC x\u2032i\u2264 U B(w\u2217\u22a4\u02dcC x\u2032i| \u02c6w \u02dcC) = \u02c6w\u22a4\u02dcC x\u2032i+\u03b4(g( \u02c6w \u02dcC), x\u2032i).\nThe results in Corollary 2 are obtained by simply substituting C = \u02dcC into (5a) and (5b).\n\n3.2 Validation Error Bounds\n\nGiven a lower and an upper bound of the score of each validation instance, a lower bound of the\nvalidation error can be computed by simply using the following facts:\n\ny\u2032i = +1 and U B(w\u2217\u22a4C x\u2032i| \u02c6w \u02dcC) < 0 \u21d2 mis-classi\ufb01ed,\ny\u2032i = \u22121 and LB(w\u2217\u22a4C x\u2032i| \u02c6w \u02dcC) > 0 \u21d2 mis-classi\ufb01ed.\n\n(6a)\n(6b)\nFurthermore, since the bounds in Lemma 1 linearly change with the regularization parameter C, we\ncan identify the interval of C within which the validation instance is guaranteed to be mis-classi\ufb01ed.\nLemma 3. For a validation instance with y\u2032i = +1, if\n\n\u02dcC < C <\n\n\u03b2( \u02c6w \u02dcC, x\u2032i)\n\n\u03b1( \u02c6w \u02dcC, x\u2032i) + \u03b4(g( \u02c6w \u02dcC), x\u2032i)\n\n\u02dcC or\n\n\u03b1( \u02c6w \u02dcC, x\u2032i)\n\n\u03b2( \u02c6w \u02dcC, x\u2032i) + \u03b3(g( \u02c6w \u02dcC), x\u2032i)\n\n\u02dcC < C < \u02dcC,\n\nthen the validation instance (x\u2032i, y\u2032i) is mis-classi\ufb01ed. Similarly, for a validation instance with y\u2032i =\n\u22121, if\n\n\u02dcC < C <\n\n\u03b1( \u02c6w \u02dcC, x\u2032i)\n\n\u02dcC or\n\n\u03b2( \u02c6w \u02dcC, x\u2032i)\n\n\u02dcC < C < \u02dcC,\n\n\u03b2( \u02c6w \u02dcC, x\u2032i) + \u03b3(g( \u02c6w \u02dcC), x\u2032i)\n\n\u03b1( \u02c6w \u02dcC, x\u2032i) + \u03b4(g( \u02c6w \u02dcC), x\u2032i)\n\nthen the validation instance (x\u2032i, y\u2032i) is mis-classi\ufb01ed.\nThis lemma can be easily shown by applying (5) to (6).\nAs a direct consequence of Lemma 3, the lower bound of the validation error is represented as a\nfunction of the regularization parameter C in the following form.\nTheorem 4. Using an approximate solution \u02c6w \u02dcC for a regularization parameter \u02dcC, the validation\nerror Ev(w\u2217C) for any C > 0 satis\ufb01es\nEv(w\u2217C) \u2265 LB(Ev(w\u2217C)| \u02c6w \u02dcC) :=\n\n(7)\n\n\u02dcC(+!y\u2032i=+1\nI\u2019\n\u02dcC(+!y\u2032i=\u22121\nI\u2019\n\n4\n\n1\n\nn\u2032& !y\u2032i=+1\n+!y\u2032i=\u22121\n\nI\u2019\u02dcC 0-),\nI, \u02c6w\u22a4\u02dcC x\u2032i + \u03b4(g( \u02c6w \u02dcC), x\u2032i) \u2264 0-). (8b)\n\nn\u2032& !y\u2032i=+1\nn\u2032& !y\u2032i=+1\n\nEv(w\u2217\u02dcC) \u2264 U B(Ev(w\u2217\u02dcC)| \u02c6w \u02dcC)\n= 1 \u2212\n\n(8a)\n\n1\n\n1\n\n4 Algorithm\nIn this section we present two algorithms for each of the two problems discussed in \u00a72. Due to the\nspace limitation, we roughly describe the most fundamental forms of these algorithms. Details and\nseveral extensions of the algorithms are presented in supplementary appendices B and C.\n\n4.1 Problem setup 1: Computing the approximation level \u03b5 from a given set of solutions\nGiven a set of (either optimal or approximate) solutions \u02c6w \u02dcC1\n, . . . , \u02c6w \u02dcCT , obtained e.g., by ordinary\ngrid-search, our \ufb01rst problem is to provide a theoretical approximation level \u03b5 in the sense of (3)2.\nThis problem can be solved easily by using the validation error lower bounds developed in \u00a73.2. The\nin\nalgorithm is presented in Algorithm 1, where we compute the current best validation error Ebest\nline 1, and a lower bound of the best possible validation error E\u2217v := minC\u2208[C\u2113,Cu] Ev(w\u2217C) in line\n2. Then, the approximation level \u03b5 can be simply obtained by subtracting the latter from the former.\nWe note that LB(E\u2217v ), the lower bound of E\u2217v, can be easily computed by using T evaluation error\nlower bounds LB(Ev(w\u2217C)|w \u02dcCt\n4.2 Problem setup 2: Finding an \u03b5-approximate regularization parameter\n\n), t = 1, . . . , T , because they are staircase functions of C.\n\nv\n\nGiven a desired approximation level \u03b5 such as \u03b5 = 0.01, our second problem is to \ufb01nd an \u03b5-\napproximate regularization parameter. To this end we develop an algorithm that produces a set of\noptimal or approximate solutions \u02c6w \u02dcC1\n, . . . , \u02c6w \u02dcCT such that, if we apply Algorithm 1 to this sequence,\nthen approximation level would be smaller than or equal to \u03b5. Algorithm 2 is the pseudo-code of\nthis algorithm. It computes approximate solutions for an increasing sequence of regularization pa-\nrameters in the main loop (lines 2-11).\n\n2 When we only have approximate solutions \u02c6w \u02dcC1 , . . . , \u02c6w \u02dcCT , Eq. (3) is slightly incorrect. The \ufb01rst term of\n\nthe l.h.s. of (3) should be min \u02dcCt\u2208{ \u02dcC1,..., \u02dcCT } U B(Ev( \u02c6w \u02dcCt )| \u02c6w \u02dcCt ).\n\n5\n\n\fLet us now consider tth iteration in the main\nloop, where we have already computed t\u22121 ap-\n, . . . , \u02c6w \u02dcCt\u22121 for \u02dcC1 <\nproximate solutions \u02c6w \u02dcC1\n. . . < \u02dcCt\u22121. At this point,\nCbest := arg min\n\n)| \u02c6w \u02dcC\u03c4\nis the best (in worst-case) regularization param-\neter obtained so far and it is guaranteed to be an\n\u03b5-approximate regularization parameter in the\ninterval [Cl, \u02dcCt] in the sense that the validation\nerror,\nEbest\n\n\u02dcC\u03c4\u2208{ \u02dcC1,..., \u02dcCt\u22121}\n\nU B(Ev(w\u2217\u02dcC\u03c4\n\nmin\n\n:=\n\n),\n\nAlgorithm 2: Finding an \u03b5 approximate regular-\nization parameter with approximate solutions\nInput:{(xi, yi)}i\u2208[n], {(x\u2032i, y\u2032i)}i\u2208[n\u2032], Cl, Cu, \u03b5\n1: t \u2190 1, \u02dcCt \u2190 Cl, Cbest \u2190 Cl, Ebest\nv \u2190 1\n2: while \u02dcCt \u2264 Cu do\n\u02c6w \u02dcCt\u2190solve (1) approximately for C= \u02dcCt\n3:\n) by (8b).\nCompute U B(Ev(w\u2217\u02dcCt\n4:\nif U B(Ev(w\u2217\u02dcCt\n5:\n)| \u02c6w \u02dcCt\n6:\n7:\n8:\n9:\n10:\n11: end while\nOutput: Cbest \u2208C (\u03b5).\n\nv \u2190 U B(Ev(w\u2217\u02dcCt\nEbest\nCbest \u2190 \u02dcCt\nend if\nSet \u02dcCt+1 by (10)\nt \u2190 t + 1\n\n)| \u02c6w \u02dcCt\n) < Ebest\n)| \u02c6w \u02dcCt\n\nthen\n)\n\nv\n\nv\n\n),\n\nU B(Ev(w\u2217\u02dcC\u03c4\n\n\u02dcC\u03c4\u2208{ \u02dcC1,..., \u02dcCt\u22121}\n\n)| \u02c6w \u02dcC\u03c4\nis shown to be at most greater by \u03b5 than the\nsmallest possible validation error in the inter-\nval [Cl, \u02dcCt]. However, we are not sure whether\nCbest can still keep \u03b5-approximation property\nfor C > \u02dcCt. Thus, in line 3, we approx-\nimately solve the optimization problem (1) at C = \u02dcCt and obtain an approximate solution\n\u02c6w \u02dcCt. Note that the approximate solution \u02c6w \u02dcCt must be suf\ufb01ciently good enough in the sense that\n) is suf\ufb01ciently smaller than \u03b5 (typically 0.1\u03b5). If the up-\nU B(Ev(w\u2217\u02dcC\u03c4\nper bound of the validation error U B(Ev(w\u2217\u02dcC\u03c4\nand\nCbest (lines 5-8).\nOur next task is to \ufb01nd \u02dcCt+1 in such a way that Cbest is an \u03b5-approximate regularization parameter\nin the interval [Cl, \u02dcCt+1]. Using the validation error lower bound in Theorem 4, the task is to \ufb01nd\nthe smallest \u02dcCt+1 > \u02dcCt that violates\n\n) \u2212 LB(Ev(w\u2217\u02dcC\u03c4\n\n) is smaller than Ebest\n\n, we update Ebest\n\n)| \u02c6w \u02dcC\u03c4\n\n)| \u02c6w \u02dcC\u03c4\n\n)| \u02c6w \u02dcC\u03c4\n\nv\n\nv\n\nv \u2212 LB(Ev(w\u2217C)| \u02c6w \u02dcCt\nEbest\n\n) \u2264 \u03b5, \u2200C \u2208 [ \u02dcCt, Cu],\n\n(9)\n\nIn order to formulate such a \u02dcCt+1, let us de\ufb01ne\nP := {i \u2208 [n\u2032]|y\u2032i = +1, U B(w\u2217\u22a4\u02dcCt\nFurthermore, let\n\nx\u2032i| \u02c6w \u02dcCt\n\n\u0393:= \"\n\n, x\u2032i)\n\n\u03b2( \u02c6w \u02dcCt\n, x\u2032i) + \u03b4(g( \u02c6w \u02dcCt\n\n\u03b1( \u02c6w \u02dcCt\n\n), x\u2032i)\n\n) < 0},N := {i \u2208 [n\u2032]|y\u2032i = \u22121, LB(w\u2217\u22a4\u02dcCt\n\u02dcCt$i\u2208P \u222a\"\n\n, x\u2032i) + \u03b3(g( \u02c6w \u02dcCt\n\n\u03b1( \u02c6w \u02dcCt\n\n\u03b2( \u02c6w \u02dcCt\n\n), x\u2032i)\n\n, x\u2032i)\n\nx\u2032i| \u02c6w \u02dcCt\n\u02dcCt$i\u2208N\n\n,\n\n) > 0}.\n\nand denote the kth-smallest element of \u0393 as kth(\u0393) for any natural number k. Then, the smallest\n\u02dcCt+1 > \u02dcCt that violates (9) is given as\n\n\u02dcCt+1\u2190 (\u230an\u2032(LB(Ev(w\u2217\u02dcCt\n\n)| \u02c6w \u02dcCt\n\n)\u2212Ebest\n\nv +\u03b5)\u230b+1)th(\u0393).\n\n(10)\n\n5 Experiments\n\nIn this section we present experiments for illustrating the proposed methods. Table 2 summarizes\nthe datasets used in the experiments. They are taken from libsvm dataset repository [23]. All the\ninput features except D9 and D10 were standardized to [\u22121, 1]3. For illustrative results, the in-\nstances were randomly divided into a training and a validation sets in roughly equal sizes. For\nquantitative results, we used 10-fold CV. We used Huber hinge loss (e.g., [24]) which is convex\nand subdifferentiable with respect to the second argument. The proposed methods are free from\nthe choice of optimization solvers. In the experiments, we used an optimization solver described in\n[25], which is also implemented in well-known liblinear software [26]. Our slightly modi\ufb01ed code\n\n3 We use D9 and D10 as they are for exploiting sparsity.\n\n6\n\n\fliver-disorders (D2)\n\nionosphere (D3)\n\naustralian (D4)\n\nFigure 2: Illustrations of Algorithm 1 on three benchmark datasets (D2, D3, D4). The plots indicate\nhow the approximation level \u03b5 improves as the number of solutions T increases in grid-search (red),\nBayesian optimization (blue) and our own method (green, see the main text).\n\n(a) \u03b5 = 0.1 without tricks\n\n(b) \u03b5 = 0.05 without tricks\n\n(c) \u03b5 = 0.05 with tricks 1 and 2\n\nFigure 3: Illustrations of Algorithm 2 on ionosphere (D3) dataset for (a) op2 with \u03b5 = 0.10, (b)\nop2 with \u03b5 = 0.05 and (c) op3 with \u03b5 = 0.05, respectively. Figure 1 also shows the result for op3\nwith \u03b5 = 0.10.\n\n(for adaptation to Huber hinge loss) is provided as a supplementary material, and is also available\non https://github.com/takeuchi-lab/RPCVELB. Whenever possible, we used warm-\nstart approach, i.e., when we trained a new solution, we used the closest solutions trained so far\n(either approximate or optimal ones) as the initial starting point of the optimizer. All the compu-\ntations were conducted by using a single core of an HP workstation Z800 (Xeon(R) CPU X5675\n(3.07GHz), 48GB MEM). In all the experiments, we set C\u2113 = 10\u22123 and Cu = 103.\n\nResults on problem 1 We applied Algorithm 1 in \u00a74 to a set of solutions obtained by 1) grid-\nsearch, 2) Bayesian optimization (BO) with expected improvement acquisition function, and 3)\nadaptive search with our framework which sequentially computes a solution whose validation lower\nbound is smallest based on the information obtained so far. Figure 2 illustrates the results on three\ndatasets, where we see how the approximation level \u03b5 in the vertical axis changes as the number of\nsolutions (T in our notation) increases. In grid-search, as we increase the grid points, the approxi-\nmation level \u03b5 tends to be improved. Since BO tends to focus on a small region of the regularization\nparameter, it was dif\ufb01cult to tightly bound the approximation level. We see that the adaptive search,\nusing our framework straightforwardly, seems to offer slight improvement from grid-search.\n\nResults on problem 2 We applied Algorithm 2 to benchmark datasets for demonstrating theo-\nretically guaranteed choice of a regularization parameter is possible with reasonable computational\ncosts. Besides the algorithm presented in \u00a74, we also tested a variant described in supplementary\nAppendix B. Speci\ufb01cally, we have three algorithm options. In the \ufb01rst option (op1), we used op-\nIn the second option (op2),\ntimal solutions {w\u2217\u02dcCt}t\u2208[T ] for computing CV error lower bounds.\nwe instead used approximate solutions { \u02c6w \u02dcCt}t\u2208[T ]. In the last option (op3), we additionally used\nspeed-up tricks described in supplementary Appendix B. We considered four different choices of\n\u03b5 \u2208{ 0.1, 0.05, 0.01, 0}. Note that \u03b5 = 0 indicates the task of \ufb01nding the exactly optimal regular-\n\n7\n\n\fop1\n\nop2\n\nop2\n\nop1\n\n(using w\u2217\u02dcC\nT\n\n(using w\u2217\u02dcC\nT\n\n(using \u02c6w \u02dcC )\ntime\nT\n(sec)\n0.031\n0.061\n0.194\n\n32\n70\n324\n\nTable 1: Computational costs. For each of the three options and \u03b5 \u2208{ 0.10, 0.05, 0.01, 0}, the\nnumber of optimization problems solved (denoted as T ) and the total computational costs (denoted\nas time) are listed. Note that, for op2, there are no results for \u03b5 = 0.\n)\ntime\n(sec)\n1.916\n4.099\n16.31\n57.57\n8.492\n16.18\n57.79\n1135\n0.761\n1.687\n8.257\n218.4\n360.2\n569.9\n2901\n106937\n\n(using tricks)\ntime\n(sec)\n0.628\n1.136\n5.362\n44.68\n3.319\n6.604\n24.04\n760.8\n0.606\n0.926\n4.043\n99.57\n74.37\n128.5\n657.4\n98631\n\n62\n123\n728\n2840\n167\n379\n1735\n42135\n66\n110\n614\n15218\n89\n200\n1164\n63300\n\n92\n207\n1042\n4276\n289\n601\n2532\n67490\n72\n192\n1063\n34920\n134\n317\n1791\n85427\n\n(using \u02c6w \u02dcC )\ntime\nT\n(sec)\n0.975\n2.065\n9.686\n\n93\n209\n1069\n\n62\n129\n778\n\n27\n65\n181\n\n201.0\n280.7\n1345\n\n0.124\n0.290\n0.825\n\n5.278\n9.806\n35.21\n\n293\n605\n2788\n\n74\n195\n1065\n\n136\n323\n1822\n\n0.266\n0.468\n0.716\n\n0.088\n0.173\n0.418\n\n0.604\n1.162\n6.238\n\n223\n540\n2183\n\nD3\n\nD4\n\nN.A.\n\nD7\n\nD8\n\nD9\n\nD1\n\nD2\n\nN.A.\n\nN.A.\n\nN.A.\n\nN.A.\n\nN.A.\n\nN.A.\n\nop3\n\nT\n\nD6\n\nN.A.\n\n63\n109\n440\n\n0.108\n0.171\n0.631\n\nN.A.\n\nD10\n\nEv < 0.10\nEv < 0.05\n157\n258552\n\n81.75\n85610\n\nEv < 0.10\nEv < 0.05\n162\n31.02\n\nN.A.\n\nEv < 0.10\nEv < 0.05\n114\n42040\n\n36.81\n23316\n\nop3\n\nT\n\n(using tricks)\ntime\n(sec)\n0.041\n0.057\n0.157\n0.629\n0.084\n0.218\n0.623\n3.805\n0.277\n0.359\n0.940\n6.344\n0.093\n0.153\n0.399\n1.205\n0.091\n0.137\n0.401\n2.451\n\n33\n57\n205\n383\n131\n367\n1239\n6275\n43\n73\n270\n815\n23\n47\n156\n345\n45\n77\n258\n968\n\n\u03b5\n\n0.10\n0.05\n0.01\n0\n0.10\n0.05\n0.01\n0\n0.10\n0.05\n0.01\n0\n0.10\n0.05\n0.01\n0\n0.10\n0.05\n0.01\n0\n\n30\n68\n234\n442\n221\n534\n1503\n10939\n61\n123\n600\n5412\n27\n64\n167\n342\n62\n108\n421\n2330\n\nD5\n\n)\ntime\n(sec)\n0.068\n0.124\n0.428\n0.697\n0.177\n0.385\n0.916\n6.387\n0.617\n1.073\n4.776\n26.39\n0.169\n0.342\n0.786\n1.317\n0.236\n0.417\n1.201\n4.540\n\nization parameter. In some datasets, the smallest validation errors are less than 0.1 or 0.05, in which\ncases we do not report the results (indicated as \u201cEv < 0.05\u201d etc.). In trick1, we initially computed\nsolutions at four different regularization parameter values evenly allocated in [10\u22123, 103] in the log-\narithmic scale. In trick2, the next regularization parameter \u02dcCt+1 was set by replacing \u03b5 in (10) with\n1.5\u03b5 (see supplementary Appendix B). For the purpose of illustration, we plot examples of valida-\ntion error curves in several setups. Figure 3 shows the validation error curves of ionosphere (D3)\ndataset for several options and \u03b5.\nTable 1 shows the number of optimization problems solved in the algorithm (denoted as T ), and\nthe total computation time in CV setups. The computational costs mostly depend on T , which gets\nsmaller as \u03b5 increases. Two tricks in supplementary Appendix B was effective in most cases for\nreducing T . In addition, we see the advantage of using approximate solutions by comparing the\ncomputation times of op1 and op2 (though this strategy is only for \u03b5 \u0338= 0). Overall, the results\nsuggest that the proposed algorithm allows us to \ufb01nd theoretically guaranteed approximate regular-\nization parameters with reasonable costs except for \u03b5 = 0 cases. For example, the algorithm found\nan \u03b5 = 0.01 approximate regularization parameter within a minute in 10-fold CV for a dataset with\nmore than 50000 instances (see the results on D10 for \u03b5 = 0.01 with op2 and op3 in Table 1).\n\ndataset name\n\nheart\n\nliver-disorders\n\nionosphere\naustralian\n\ndiabetes\n\nD1\nD2\nD3\nD4\nD5\n\nTable 2: Benchmark datasets used in the experiments.\n\nsample size\n\ninput dimension\n\ndataset name\n\nsample size\n\n270\n345\n351\n690\n768\n\n13\n6\n34\n14\n8\n\nD6\nD7\nD8\nD9\nD10\n\ngerman.numer\n\nsvmguide3\nsvmguide1\n\na1a\nw8a\n\n1000\n1284\n7089\n32561\n64700\n\ninput dimension\n\n24\n21\n4\n123\n300\n\n6 Conclusions and future works\n\nWe presented a novel algorithmic framework for computing CV error lower bounds as a function\nof the regularization parameter. The proposed framework can be used for a theoretically guaranteed\nchoice of a regularization parameter. Additional advantage of this framework is that we only need to\ncompute a set of suf\ufb01ciently good approximate solutions for obtaining such a theoretical guarantee,\nwhich is computationally advantageous. As demonstrated in the experiments, our algorithm is prac-\ntical in the sense that the computational cost is reasonable as long as the approximation quality \u03b5 is\nnot too close to 0. An important future work is to extend the approach to multiple hyper-parameters\ntuning setups.\n\n8\n\n\fReferences\n[1] B. Efron, T. Hastie, I. Johnstone, and R. TIbshirani. Least angle regression. Annals of Statistics,\n\n32(2):407\u2013499, 2004.\n\n[2] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector\n\nmachine. Journal of Machine Learning Research, 5:1391\u20131415, 2004.\n\n[3] S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012\u20131030,\n\n2007.\n\n[4] J. Giesen, J. Mueller, S. Laue, and S. Swiercy. Approximating Concavely Parameterized Optimization\n\nProblems. In Advances in Neural Information Processing Systems, 2012.\n\n[5] J. Giesen, M. Jaggi, and S. Laue. Approximating Parameterized Convex Optimization Problems. ACM\n\nTransactions on Algorithms, 9, 2012.\n\n[6] J. Giesen, S. Laue, and Wieschollek P. Robust and Ef\ufb01cient Kernel Hyperparameter Paths with Guaran-\n\ntees. In International Conference on Machine Learning, 2014.\n\n[7] J. Mairal and B. Yu. Complexity analysis of the Lasso reguralization path. In International Conference\n\non Machine Learning, 2012.\n\n[8] V. Vapnik and O. Chapelle. Bounds on Error Expectation for Support Vector Machines. Neural Compu-\n\ntation, 12:2013\u20132036, 2000.\n\n[9] T. Joachims. Estimating the generalization performance of a SVM ef\ufb01ciently. In International Conference\n\non Machine Learning, 2000.\n\n[10] K. Chung, W. Kao, C. Sun, L. Wang, and C. Lin. Radius margin bounds for support vector machines with\n\nthe RBF kernel. Neural computation, 2003.\n\n[11] M. Lee, S. Keerthi, C. Ong, and D. DeCoste. An ef\ufb01cient method for computing leave-one-out error in\nsupport vector machines with Gaussian kernels. IEEE Transactions on Neural Networks, 15:750\u20137, 2004.\n[12] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. Paci\ufb01c\n\nJournal of Optimization, 2012.\n\n[13] Z. Xiang, H. Xu, and P. Ramadge. Learning sparse representations of high dimensional data on large\n\nscale dictionaries. In Advances in Neural Information Processing Sysrtems, 2011.\n\n[14] K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screening of non-support vectors in pathwise SVM computa-\n\ntion. In International Conference on Machine Learning, 2013.\n\n[15] J. Liu, Z. Zhao, J. Wang, and J. Ye. Safe Screening with Variational Inequalities and Its Application to\n\nLasso. In International Conference on Machine Learning, volume 32, 2014.\n\n[16] J. Wang, J. Zhou, J. Liu, P. Wonka, and J. Ye. A Safe Screening Rule for Sparse Logistic Regression. In\n\nAdvances in Neural Information Processing Sysrtems, 2014.\n\n[17] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1996.\n[18] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning. Cambridge University Press,\n\n2014.\n\n[19] J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian Optimization of Machine Learning Algo-\n\nrithms. In Advances in Neural Information Processing Sysrtems, 2012.\n\n[20] J. Bergstra and Y. Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine\n\nLearning Research, 13:281\u2013305, 2012.\n\n[21] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector\n\nmachines. Machine Learning, 46:131\u2013159, 2002.\n\n[22] P D. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 1999.\n[23] C. Chang and C. Lin. LIBSVM : A Library for Support Vector Machines. ACM Transactions on Intelligent\n\nSystems and Technology, 2:1\u201339, 2011.\n\n[24] O. Chapelle. Training a support vector machine in the primal. Neural computation, 19:1155\u20131178, 2007.\n[25] C. Lin, R. Weng, and S. Keerthi. Trust Region Newton Method for Large-Scale Logistic Regression. The\n\nJournal of Machine Learning Research, 9:627\u2013650, 2008.\n\n[26] R. Fan, K. Chang, and C. Hsieh. LIBLINEAR: A library for large linear classi\ufb01cation. The Journal of\n\nMachine Learning, 9:1871\u20131874, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1025, "authors": [{"given_name": "Atsushi", "family_name": "Shibagaki", "institution": "Nagoya Institute of Technology"}, {"given_name": "Yoshiki", "family_name": "Suzuki", "institution": "Nagoya Institute of Technology"}, {"given_name": "Masayuki", "family_name": "Karasuyama", "institution": "Nagoya Institute of Technology"}, {"given_name": "Ichiro", "family_name": "Takeuchi", "institution": "Nagoya Institute of Technology"}]}