{"title": "Asymptotically Optimal Regularization in Smooth Parametric Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1132, "page_last": 1140, "abstract": "Many types of regularization schemes have been employed in statistical learning, each one motivated by some assumption about the problem domain.  In this paper, we present a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer.  In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework.  We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning.", "full_text": "Asymptotically Optimal Regularization\n\nin Smooth Parametric Models\n\nPercy Liang\n\nUniversity of California, Berkeley\n\npliang@cs.berkeley.edu\n\nFrancis Bach\n\nINRIA - \u00b4Ecole Normale Sup\u00b4erieure, France\n\nfrancis.bach@ens.fr\n\nGuillaume Bouchard\n\nXerox Research Centre Europe, France\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\n\nGuillaume.Bouchard@xrce.xerox.com\n\njordan@cs.berkeley.edu\n\nAbstract\n\nMany types of regularization schemes have been employed in statistical learning,\neach motivated by some assumption about the problem domain.\nIn this paper,\nwe present a uni\ufb01ed asymptotic analysis of smooth regularizers, which allows us\nto see how the validity of these assumptions impacts the success of a particular\nregularizer. In addition, our analysis motivates an algorithm for optimizing regu-\nlarization parameters, which in turn can be analyzed within our framework. We\napply our analysis to several examples, including hybrid generative-discriminative\nlearning and multi-task learning.\n\n1\n\nIntroduction\n\nMany problems in machine learning and statistics involve the estimation of parameters from \ufb01nite\ndata. Although empirical risk minimization has favorable limiting properties, it is well known that\nthis procedure can over\ufb01t on \ufb01nite data. Hence, various forms of regularization have been employed\nto control this over\ufb01tting. Regularizers are usually chosen based on assumptions about the problem\ndomain at hand. For example, in classi\ufb01cation, we might use L2 regularization if we expect the data\nto be separable with a large margin. We might regularize with a generative model if we think it is\nroughly well-speci\ufb01ed [7, 20, 15, 17]. In multi-task learning, we might penalize deviation between\nparameters across tasks if we believe the tasks to be similar [3, 12, 2, 13].\nIn each case, we would like (1) a procedure for choosing the parameters of the regularizer (for exam-\nple, its strength) and (2) an analysis that shows the amount by which regularization reduces expected\nrisk, expressed as a function of the compatibility between the regularizer and the problem domain.\nIn this paper, we address these two points by developing an asymptotic analysis of smooth regular-\nizers for parametric problems. The key idea is to derive a second-order Taylor approximation of the\nexpected risk, yielding a simple and interpretable quadratic form which can be directly minimized\nwith respect to the regularization parameters. We \ufb01rst develop the general theory (Section 2) and\nthen apply it to some examples of common regularizers used in practice (Section 3).\n\n2 General theory\nWe use uppercase letters (e.g., L, R, Z) to denote random variables and script letters (e.g., L,R,I)\nto denote constant limits of random variables. For a \u03bb-parametrized differentiable function \u03b8 (cid:55)\u2192\n...\nf(\u03bb; \u03b8), let\nf denote the \ufb01rst, second and third derivatives of f with respect to \u03b8, and\nlet \u2207f(\u03bb; \u03b8) denote the derivative with respect to \u03bb. Let Xn = Op(n\u2212\u03b1) denote a sequence of\n\n\u02d9f, \u00a8f, and\n\n1\n\n\fP\u2212\u2192 X denote convergence in\nrandom variables for which n\u03b1Xn is bounded in probability. Let Xn\nprobability. For a vector v, let v\u2297 = vv(cid:62). Expectation and variance operators are denoted as E[\u00b7]\nand V[\u00b7], respectively.\n\n2.1 Setup\nWe are given a loss function (cid:96)(\u00b7; \u03b8) parametrized by \u03b8 \u2208 Rd (e.g., (cid:96)((x, y); \u03b8) = 1\nlinear regression). Our goal is to minimize the expected risk,\n\n2(y \u2212 x(cid:62)\u03b8)2 for\n\n\u03b8\u221e def= argmin\n\u03b8\u2208Rd\n\n(1)\nwhich averages the loss over some true data generating distribution p\u2217(Z). We do not have access\nto p\u2217, but instead receive a sample of n i.i.d. data points Z1, . . . , Zn drawn from p\u2217. The standard\nunregularized estimator minimizes the empirical risk:\n\nL(\u03b8), L(\u03b8) def= EZ\u223cp\u2217[(cid:96)(Z; \u03b8)],\n\nn(cid:88)\n\ni=1\n\n\u02c6\u03b80\n\nn\n\ndef= argmin\n\u03b8\u2208Rd\n\nLn(\u03b8), Ln(\u03b8) def=\n\n1\nn\n\n(cid:96)(Zi, \u03b8).\n\n(2)\n\nAlthough \u02c6\u03b80\nn is consistent (that is, it converges in probability to \u03b8\u221e) under relatively weak condi-\ntions, it is well known that regularization can improve performance substantially for \ufb01nite n. Let\nRn(\u03bb, \u03b8) be a (possibly data-dependent) regularization function, where \u03bb \u2208 Rb are the regulariza-\n2n(cid:107)\u03b8(cid:107)2),\ntion parameters. For linear regression, we might use squared regularization (Rn(\u03bb, \u03b8) = \u03bb\nwhere \u03bb \u2208 R determines the strength. De\ufb01ne the regularized estimator as follows:\n\n\u02c6\u03b8\u03bb\n\nn\n\ndef= argmin\n\u03b8\u2208Rd\n\nLn(\u03b8) + Rn(\u03bb, \u03b8).\n\n(3)\n\nn) \u2212 L(\u02c6\u03b80\n\nThe goal of this paper is to choose good values of \u03bb and analyze the subsequent impact on perfor-\nmance. Speci\ufb01cally, we wish to minimize the relative risk:\nLn(\u03bb) def= EZ1,...,Zn\u223cp\u2217[L(\u02c6\u03b8\u03bb\n\n(4)\nwhich is the difference in risk (averaged over the training data) between the regularized and unreg-\nularized estimators; Ln(\u03bb) < 0 is desirable. Clearly, argmin\u03bb\nLn(\u03bb) is the optimal regularization\nparameter. However, it is dif\ufb01cult to get a handle on Ln(\u03bb). Therefore, the main focus of this work is\non deriving an asymptotic expansion for Ln(\u03bb). In this paper, we make the following assumptions:1\nAssumption 1 (Compact support). The true distribution p\u2217(Z) has compact support.\nAssumption 2 (Smooth loss). The loss function (cid:96)(z, \u03b8) is thrice-differentiable with respect to \u03b8.\nFurthermore, assume the expected Hessian of the loss function is positive de\ufb01nite ( \u00a8L(\u03b8\u221e) (cid:31) 0).2\nAssumption 3 (Smooth regularizer). The regularizer Rn(\u03bb, \u03b8) is thrice-differentiable with respect\nto \u03b8 and differentiable with respect to \u03bb. Assume Rn(0, \u03b8) \u2261 0 and Rn(\u03bb, \u03b8) P\u2212\u2192 0 as n \u2192 \u221e.\n\nn)],\n\n2.2 Rate of regularization strength\n\nn\n\nLet us establish some basic properties that the regularizer Rn(\u03bb, \u03b8) should satisfy. First, a desirable\nP\u2212\u2192 \u03b8\u221e), i.e., convergence to the parameters that achieve the minimum\nproperty is consistency (\u02c6\u03b8\u03bb\npossible risk in our hypothesis class. To achieve this, it suf\ufb01ces (and in general also necessitates)\nthat (1) the loss class satis\ufb01es standard uniform convergence properties [22] and (2) the regularizer\nhas a vanishing impact in the limit of in\ufb01nite data (Rn(\u03bb, \u03b8) P\u2212\u2192 0). These two properties can be\nveri\ufb01ed given our assumptions.\nThe next question is at what rate Rn(\u03bb, \u03b8) should converge to 0? As we show in [16], Rn(\u03bb, \u03b8) =\nOp(n\u22121) is the rate that minimizes the relative risk Ln. With this rate, it is natural to consider the\nregularizer as a prior p(\u03b8 | \u03bb) \u221d exp{\u2212Rn(\u03bb, \u03b8)} (and \u2212(cid:96)(z, \u03b8) as the log-likelihood), in which\ncase \u02c6\u03b8\u03bb\n\nn is the maximum a posteriori (MAP) estimate.\n\n1While we do not explicitly assume convexity of (cid:96) and Rn, the local nature of our analysis means that we\nare essentially working under strong convexity.\n2This assumption can be weakened. If \u00a8L (cid:54)(cid:31) 0, the parameters can only be estimated up to the row space of\n\u00a8L. But since we are interested in the parameters \u03b8 only in terms of L(\u03b8), this particular non-identi\ufb01ability of\nthe parameters is irrelevant.\n\n2\n\n\f2.3 Asymptotic expansion\n\nOur main result is the following theorem, which provides a simple interpretable asymptotic expres-\nsion for the relative risk, characterizing the impact of regularization (see [16] for proof):\nTheorem 1. Assume Rn(\u03bb, \u03b8\u221e) = Op(n\u22121). The relative risk admits the following asymptotic\nexpansion:\n\nLn(\u03bb) = L(\u03bb) \u00b7 n\u22122 + Op(n\u2212 5\n2 )\n\n(5)\n\nin terms of the asymptotic relative risk:\n\n1\n2\n\nL(\u03bb) def=\n\ntr{ \u02d9R(\u03bb)\u2297 \u00a8L\u22121} \u2212 tr{I(cid:96)(cid:96) \u00a8L\u22121 \u00a8R(\u03bb) \u00a8L\u22121} \u2212 2B(cid:62) \u02d9R(\u03bb) + tr{I(cid:96)r(\u03bb) \u00a8L\u22121},\n\n(6)\nwhere \u00a8L def= E[\u00a8(cid:96)(Z; \u03b8\u221e)], R(\u03bb) def= limn\u2192\u221e nRn(\u03bb, \u03b8\u221e) (derivatives thereof are de\ufb01ned analo-\ngously), I(cid:96)(cid:96)\nn \u2212 \u03b8\u221e].\nThe most important equation of this paper is (6), which captures the lowest-order terms of the relative\nrisk de\ufb01ned in (4).\n\ndef= E[ \u02d9(cid:96)(Z; \u03b8\u221e)\u2297], I(cid:96)r(\u03bb) def= limn\u2192\u221e nE[ \u02d9Ln \u02d9Rn(\u03bb)(cid:62)], B def= limn\u2192\u221e nE[\u02c6\u03b80\n\nInterpretation The signi\ufb01cance of Theorem 1 is in identifying the three problem-dependent con-\ntributions to the asymptotic relative risk:\n\u02d9R(\u03bb) is the gradient of the regularizer at the lim-\nSquared bias of the regularizer tr{ \u02d9R(\u03bb)\u2297 \u00a8L\u22121}:\n\u02d9R(\u03bb) with respect to the\niting parameters \u03b8\u221e; the squared regularizer bias is the squared norm of\nMahalanobis metric given by \u00a8L. Note that the squared regularizer bias is always positive: it always\nincreases the risk by an amount which depends on how \u201cwrong\u201d the regularizer is.\nVariance reduction provided by the regularizer tr{I(cid:96)(cid:96) \u00a8L\u22121 \u00a8R(\u03bb) \u00a8L\u22121}: The key quantity is \u00a8R(\u03bb),\nthe Hessian of the regularizer, whose impact on the relative risk is channeled through \u00a8L\u22121 and\nI(cid:96)(cid:96). For convex regularizers, \u00a8R(\u03bb) (cid:23) 0, so we always improve the stability of the estimate by\nregularizing. Furthermore, if the loss is the negative log-likelihood and our model is well-speci\ufb01ed\n(that is, p\u2217(z) = exp{\u2212(cid:96)(z; \u03b8\u221e)}), then I(cid:96)(cid:96) = \u00a8L by the \ufb01rst Bartlett identity [4], and the variance\nreduction term simpli\ufb01es to tr{ \u00a8R(\u03bb) \u00a8L\u22121}.\nAlignment between regularizer bias and unregularized estimator bias 2B(cid:62) \u02d9R(\u03bb) \u2212 tr{I(cid:96)r(\u03bb) \u00a8L\u22121}:\nThe alignment has two parts, the \ufb01rst of which is nonzero only for non-linear models and the second\nof which is nonzero only when the regularizer depends on the training data. The unregularized\nestimator errs in direction B; we can reduce the risk if the regularizer bias \u02d9R(\u03bb) helps correct for the\nestimator bias (B(cid:62) \u02d9R(\u03bb) > 0). The second part carries the same intuition: the risk is reduced when\nthe random regularizer compensates for the loss (tr{I(cid:96)r(\u03bb) \u00a8L\u22121} < 0).\n\n\u03bb\u2217 = argmin\n\nL(\u03bb) and call \u02c6\u03b8\u03bb\u2217\n\n2.4 Oracle regularizer\nThe principal advantage of having a simple expression for L(\u03bb) is that we can minimize it with\nrespect to \u03bb. Let \u03bb\u2217 def= argmin\u03bb\nn the oracle estimator. We have a closed form for\n\u03bb\u2217 in the important special case that the regularization parameter \u03bb is the strength of the regularizer:\nCorollary 1 (Oracle regularization strength). If Rn(\u03bb, \u03b8) = \u03bb\ntr{I(cid:96)(cid:96) \u00a8L\u22121\u00a8r \u00a8L\u22121} + 2B(cid:62) \u02d9r\n\nn r(\u03b8) for some r(\u03b8), then\n, L(\u03bb\u2217) = \u2212 C2\n2C2\nProof. (6) is a quadratic in \u03bb; solve by differentiation. Compute L(\u03bb\u2217) by substitution.\nIn general, \u03bb\u2217 will depend on \u03b8\u221e and hence is not computable from data; Section 2.5 will remedy\nthis. Nevertheless, the oracle regularizer provides an upper bound on performance and some insight\ninto the relevant quantities that make a regularizer useful.\nNote L(\u03bb\u2217) \u2264 0, since optimizing \u03bb\u2217 must be no worse than not regularizing since L(0) = 0.\nBut what might be surprising at \ufb01rst is that the oracle regularization parameter \u03bb\u2217 can be negative\n\n\u02d9r(cid:62) \u00a8L\u22121 \u02d9r\n\nL(\u03bb) =\n\n(7)\n\ndef=\n\nC1C2\n\n\u03bb\n\n1\n\n.\n\n3\n\n\fUNREGULARIZED ORACLE\n\nEstimator\nNotation\n\nRelative risk\n\n\u02c6\u03b80\nn\n0\n\n\u02c6\u03b8\u03bb\u2217\nL(\u03bb\u2217)\n\nn\n\nPLUGIN\nn = \u02c6\u03b8\u20221\n\u02c6\u03b8 \u02c6\u03bbn\nL\u2022(1)\n\nn\n\nORACLEPLUGIN\n\n\u02c6\u03b8\u2022\u03bb\u2022\u2217\nL\u2022(\u03bb\u2022\u2217)\n\nn\n\nTable 1: Notation for the various estimators and their relative risks.\n\n(corresponding to \u201canti-regularization\u201d). But if \u2202L(\u03bb)\nhelps (\u03bb\u2217 > 0 and L(\u03bb) < 0 for 0 < \u03bb < 2\u03bb\u2217).\n\n\u2202\u03bb = \u2212C1 < 0, then (positive) regularization\n\nn\n\nn\n\nn = \u02c6\u03b8\u20221\n\nn(\u03bb\u2022) = E[L(\u02c6\u03b8\u2022\u03bb\u2022\n\nn ) \u2212 L(\u02c6\u03b8\u20220\n\ndef= argmin\u03b8 Ln(\u03b8) + R\u2022\n\nn(\u03bb\u2022, \u03b8) has relative risk L\u2022\n\n2 ). We then use the plugin estimator \u02c6\u03b8\u02c6\u03bbn\n\nn , which means the asymptotic risk of the plugin estimator \u02c6\u03b8\u02c6\u03bbn\n\ndef= argmin\u03b8 Ln(\u03b8) + Rn(\u02c6\u03bbn, \u03b8).\nn ) \u2212 L(\u02c6\u03b80\n\n2.5 Plugin regularizer\nWhile the oracle regularizer Rn(\u03bb\u2217, \u03b8) given by (7) is asymptotically optimal, \u03bb\u2217 depends on the\nunknown \u03b8\u221e, so \u02c6\u03b8\u03bb\u2217\nn is actually not implementable. In this section, we develop the plugin regularizer\nas a way to avoid this dependence. The key idea is to substitute \u03bb\u2217 with an estimate \u02c6\u03bbn\ndef= \u03bb\u2217 + \u03b5n\nwhere \u03b5n = Op(n\u2212 1\nHow well does this plugin estimator work, that is, what is its relative risk E[L(\u02c6\u03b8\u02c6\u03bbn\nn)]?\nWe cannot simply write Ln(\u02c6\u03bbn) and apply Theorem 1 because L(\u00b7) can only be applied to non-\nrandom arguments. However, we can still leverage existing machinery by de\ufb01ning a new plugin\nn(\u03bb\u2022, \u03b8) def= \u03bb\u2022Rn(\u02c6\u03bbn, \u03b8) with regularization parameter \u03bb\u2022 \u2208 R. Henceforth, the\nregularizer R\u2022\nsuperscript \u2022 will denote quantities concerning the plugin regularizer. The corresponding estimator\n\u02c6\u03b8\u2022\u03bb\u2022\nn )]. The key\nn is simply L\u2022(1).\nidentity is \u02c6\u03b8\u02c6\u03bbn\nWe could try to squeeze more out of the plugin regularizer by further optimizing \u03bb\u2022 according to\nrather than just using \u03bb\u2022 =\n\u03bb\u2022\u2217 def= argmin\u03bb\u2022 L\u2022(\u03bb\u2022) and use the oracle plugin estimator \u02c6\u03b8\u2022\u03bb\u2022\u2217\nIn general, this is not useful since \u03bb\u2022\u2217 might depend on \u03b8\u221e, and the whole point of plugin\n1.\nis to remove this dependence. However, in a fortuitous turn of events, for some linear models\n(Sections 3.1 and 3.4), \u03bb\u2022\u2217 is in fact independent of \u03b8\u221e, and so \u02c6\u03b8\u2022\u03bb\u2022\u2217\nis actually implementable.\nTable 1 summarizes all the estimators we have discussed.\nThe following theorem relates the risks of all estimators we have considered (see [16] for the proof):\nTheorem 2 (Relative risk of plugin). The relative risk of the plugin estimator is L\u2022(1) = L(\u03bb\u2217)+E,\nwhere E def= limn\u2192\u221e nE[tr{ \u02d9Ln(\u2207 \u02d9Rn(\u03bb\u2217)\u03b5n)(cid:62) \u00a8L\u22121}]. If Rn(\u03bb) is linear in \u03bb, then the relative risk\nof the oracle plugin estimator is L\u2022(\u03bb\u2022\u2217) = L\u2022(1) + E2\nNote that the sign of E depends on the nature of the error \u03b5n, so PLUGIN could be either better or\nworse than ORACLE. On the other hand, ORACLEPLUGIN is always better than PLUGIN. We can\nget a simpler expression for E if we know more about \u03b5n (see [16] for the proof):\nTheorem 3. Suppose \u03bb\u2217 = f(\u03b8\u221e) for some differentiable f : Rd \u2192 Rb. If \u02c6\u03bbn = f(\u02c6\u03b80\nresults of Theorem 2 hold with E = \u2212tr{I(cid:96)(cid:96) \u00a8L\u22121\u2207 \u02d9R(\u03bb\u2217) \u02d9f \u00a8L\u22121}.\n\n4L(\u03bb\u2217) with \u03bb\u2022\u2217 = 1 + E\n\n2L(\u03bb\u2217) .\n\nn), then the\n\nn\n\nn\n\n3 Examples\n\nIn this section, we apply our results from Section 2 to speci\ufb01c problems. Having made all the\nasymptotic derivations in the general setting, we now only need to make a few straightforward\ncalculations to obtain the asymptotic relative risks and regularization parameters for a given problem.\nWe \ufb01rst explore two classical examples from statistics (Sections 3.1 and 3.2) to get some intuition\nfor the theory. Then we consider two important examples in machine learning (Sections 3.3 and 3.4).\n\n3.1 Estimation of normal means\n\nAssume that data are generated from a multivariate normal distribution with d independent compo-\n2(x\u2212\u03b8)2,\nnents (p\u2217 = N (\u03b8\u221e, I)). We use the negative log-likelihood as the loss function: (cid:96)(x; \u03b8) = 1\nso the model is well-speci\ufb01ed.\n\n4\n\n\f(cid:16)\n\n(cid:17)\n2(cid:107)\u03b8(cid:107)2).\n\n(cid:80)n\nn )] < E[L(\u02c6\u03b80\n\nn\n\nn\n\ndef= \u00afX\n\n2(cid:107)\u03b8\u221e(cid:107)2 .\n\n1 \u2212 d\u22122\nn(cid:107) \u00afX(cid:107)2\n\ni=1 Xi is beaten by the James-Stein estimator \u02c6\u03b8JS\n\n(cid:107)\u03b8(cid:107)4 . By Theorems 2\n2(cid:107)\u03b8\u221e(cid:107)2 . Note that since E > 0, PLUGIN is always (asymptoti-\n\nIn his seminal 1961 paper [14], Stein showed that, surprisingly, the standard empirical risk minimizer\n\u02c6\u03b80\nn = \u00afX def= 1\nin the sense\nthat E[L(\u02c6\u03b8JS\nn)] for all n and \u03b8\u221e if d > 2. We will show that the James-Stein estimator\nis essentially equivalent to ORACLEPLUGIN with quadratic regularization (r(\u03b8) = 1\nFirst compute \u02d9Ln = \u03b8\u221e \u2212 \u00afX, \u00a8L = I, B = 0, \u02d9r = \u03b8\u221e, and \u00a8r = I. By (7), the oracle regularization\nweight is \u03bb\u2217 = d(cid:107)\u03b8\u221e(cid:107)2 , which yields a relative risk of L(\u03bb\u2217) = \u2212 d2\nNow let us derive PLUGIN (Section 2.5). We have f(\u03b8) = d(cid:107)\u03b8(cid:107)2 and \u02d9f(\u03b8) = \u22122d\u03b8\nand 3, E = 2d(cid:107)\u03b8\u221e(cid:107)2 and L\u2022(1) = \u2212 d(d\u22124)\ncally) worse than ORACLE but better than UNREGULARIZED if d > 4.\nTo get ORACLEPLUGIN, compute \u03bb\u2022\u2217 = 1\u2212 2\nin R\u2022\nsmall improvement over PLUGIN (and is superior to UNREGULARIZED when d > 2).\nNote that the ORACLEPLUGIN and PLUGIN are adaptive: We regularize more or less depend-\ning on whether our preliminary estimate of \u00afX is small or large, respectively. By simple al-\ngebra, ORACLEPLUGIN has a closed form \u02c6\u03b8\u2022\u03bb\u2022\u2217\n, which differs from\nn \u2212 \u02c6\u03b8JS\nJAMESSTEIN by a very small amount: \u02c6\u03b8\u2022\u03bb\u2022\u2217\n2 ). ORACLEPLUGIN has the added\nbene\ufb01t that it always shrinks towards zero by an amount between 0 and 1, whereas JAMESSTEIN can\novershoot. Empirically, we found that ORACLEPLUGIN generally had a lower expected risk than\nJAMESSTEIN when (cid:107)\u03b8\u221e(cid:107) is large, but JAMESSTEIN was better when (cid:107)\u03b8\u221e(cid:107) \u2264 1.\n\n1\u2212 2\n(cid:107) \u00afX(cid:107)2(cid:107)\u03b8(cid:107)2. By Theorem 2, its relative risk is L\u2022(\u03bb\u2022\u2217) = \u2212 (d\u22122)2\n(cid:17)\n\nd (note that this doesn\u2019t depend on \u03b8\u221e), which results\n2(cid:107)\u03b8\u221e(cid:107)2 , which offers a\n\n1 \u2212\nn = Op(n\u2212 5\n\nn(\u03b8) = 1\n\nn(cid:107) \u00afX(cid:107)2+d\u22122\n\n= \u00afX\n\n(cid:16)\n\nd\n\n2\n\nd\u22122\n\nn\n\n3.2 Binomial estimation\n\n2(\u03b8 + 2 log(1 + e\u2212\u03b8)), which corresponds to a Beta( \u03bb\n\nConsider the estimation of \u03b8, the log-odds of a coin coming up heads. We use the negative log-\nlikelihood loss (cid:96)(x; \u03b8) = \u2212x\u03b8 + log(1 + e\u03b8), where x \u2208 {0, 1} is the outcome of the coin. This\nexample serves to provide intuition for the bias B appearing in (6), which is typically ignored in\n\ufb01rst-order asymptotics or is zero (for linear models).\nConsider a regularizer r(\u03b8) = 1\n2 ) prior.\nChoosing \u03bb has been studied extensively in statistics. Some common choices are the Haldane prior\n(\u03bb = 0), the reference (Jeffreys) prior (\u03bb = 1), the uniform prior (\u03bb = 2), and Laplace smoothing\n(\u03bb = 4). We will choose \u03bb to minimize expected risk adaptively based on data.\n2. Then compute \u00a8L = v,\n\nDe\ufb01ne \u00b5 def=\n\u00a8r = v, B = \u2212v\u22121b. ORACLE corresponds to \u03bb\u2217 = 2 + v\nregularization always helps.\n\u221a\n4 , E > 0,\nWe can compute the difference between ORACLE and PLUGIN: E = 2 \u2212 2v\nwhich means that PLUGIN is worse; otherwise PLUGIN is actually better. Even when PLUGIN\nis worse than ORACLE, PLUGIN is still better than UNREGULARIZED, which can be veri\ufb01ed by\nchecking that L\u2022(1) = \u2212 5\n\n...L = \u22122vb, \u02d9r = b,\nb2 . Note that \u03bb\u2217 > 0, so again (positive)\n\n1+e\u2212\u03b8\u221e , v def= \u00b5(1 \u2212 \u00b5), and b def= \u00b5 \u2212 1\n\n2 vb\u22122 \u2212 2v\u22121b2 < 0 for all \u03b8\u221e.\n\nb2 . If |b| >\n\n2 , \u03bb\n\n1\n\n2\n\n3.3 Hybrid generative-discriminative learning\nIn prediction tasks, we wish to learn a mapping from some input x \u2208 X to an output y \u2208 Y. A\ncommon approach is to use probabilistic models de\ufb01ned by exponential families, which is de\ufb01ned\nby a vector of suf\ufb01cient statistics (features) \u03c6(x, y) \u2208 Rd and an accompanying vector of parameters\n\u03b8 \u2208 Rd. These features can be used to de\ufb01ne a generative model (8) or a discriminative model (9):\n\np\u03b8(x, y) = exp{\u03c6(x, y)(cid:62)\u03b8 \u2212 A(\u03b8)}, A(\u03b8) = log\n\nX\np\u03b8(y | x) = exp{\u03c6(x, y)(cid:62)\u03b8 \u2212 A(\u03b8; x)}, A(\u03b8; x) = log\n\nexp{\u03c6(x, y)(cid:62)\u03b8}dydx,\n\nexp{\u03c6(x, y)(cid:62)\u03b8}dy.\n\n(8)\n\n(9)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nY\n\nY\n\n5\n\n\fMisspeci\ufb01cation\n0%\n5%\n50%\n\ntr{I(cid:96)(cid:96)v\u22121\nx }\nx vv\u22121\n5\n\n5.38\n13.8\n\n2B(cid:62)(\u00b5 \u2212 \u00b5xy)\n\ntr{(\u00b5 \u2212 \u00b5xy)\u2297v\u22121\nx }\n\n0\n\n-0.073\n-1.0\n\n0\n\n0.00098\n0.034\n\n\u03bb\u2217\nL(\u03bb\u2217)\n\u221e -0.65\n-48\n310\n230\n-808\n\nTable 2: The oracle regularizer for the hybrid generative-discriminative estimator. As misspeci-\n\ufb01cation increases, we regularize less, but the relative risk is reduced more (due to more variance\nreduction).\n\nn\n\nn\n\nn\n\nn\n\ndef= argmin\u03b8 Gn(\u03b8), where\ndef= argmin\u03b8 Dn(\u03b8), where\n\n(cid:80)n\nGiven these de\ufb01nitions, we can either use a generative estimator \u02c6\u03b8gen\n(cid:80)n\nGn(\u03b8) = \u2212 1\ni=1 log p\u03b8(x, y) or a discriminative estimator \u02c6\u03b8dis\ni=1 log p\u03b8(y | x).\nDn(\u03b8) = \u2212 1\nThere has been a \ufb02urry of work on combining generative and discriminative learning [7, 20, 15,\n18, 17]. [17] showed that if the generative model is well-speci\ufb01ed (p\u2217(x, y) = p\u03b8\u221e(x, y)), then\nthe generative estimator is better in the sense that L(\u02c6\u03b8gen\n2 ) for some\nc \u2265 0; if the model is misspeci\ufb01ed, the discriminative estimator is asymptotically better. To create a\nhybrid estimator, let us treat the discriminative and generative objectives as the empirical risk and the\nregularizer, respectively, so (cid:96)((x, y); \u03b8) = \u2212 log p\u03b8(y | x), so Ln = Dn and Rn(\u03bb, \u03b8) = \u03bb\nn Gn(\u03b8).\nAs n \u2192 \u221e, the discriminative objective dominates as desired. Our approach generalizes the analysis\nof [6], which applies only to unbiased estimators for conditionally well-speci\ufb01ed models.\nBy moment-generating properties of the exponential family, we arrive at the following quanti-\nties (write \u03c6 for \u03c6(X, Y )): \u00a8L = vx\n\u02d9R(\u03bb) = \u03bb(\u00b5 \u2212 \u00b5xy) def=\n\u03bb(Ep\u03b8\u221e (X,Y )[\u03c6] \u2212 Ep\u2217(X,Y )[\u03c6]), and \u00a8R(\u03bb) = \u03bbv def= \u03bbVp\u03b8\u221e (X,Y )[\u03c6]. The oracle regularization\nparameter is then\n\ndef= Ep\u2217(X)[Vp\u03b8\u221e (Y |X)[\u03c6|X]],\n\nn + Op(n\u2212 3\n\nn ) \u2264 L(\u02c6\u03b8dis\n\nn ) \u2212 c\n\n\u03bb\u2217 =\n\ntr{I(cid:96)(cid:96)v\u22121\n\nx vv\u22121\n\nx }\nx } + 2B(cid:62)(\u00b5 \u2212 \u00b5xy) \u2212 tr{I(cid:96)rv\u22121\ntr{(\u00b5 \u2212 \u00b5xy)\u2297v\u22121\nx }\n\n.\n\n(10)\nThe sign and magnitude of \u03bb\u2217 provides insight into how generative regularization improves pre-\ndiction as a function of the model and problem: Speci\ufb01cally, a large positive \u03bb\u2217 suggests regu-\nlarization is helpful. To simplify, assume that the discriminative model is well-speci\ufb01ed, that is,\np\u2217(y | x) = p\u03b8\u221e(y | x) (note that the generative model could still be misspeci\ufb01ed). In this case,\nI(cid:96)(cid:96) = \u00a8L, I(cid:96)r = vx, and so the numerator reduces to tr{(v \u2212 vx)v\u22121\nSince v (cid:23) vx (the key fact used in [17]), the variance reduction (plus the random alignment term\nfrom I(cid:96)r) is always non-negative with magnitude equal to the fraction of missing information pro-\nvided by the generative model. There is still the non-random alignment term 2B(cid:62)(\u00b5 \u2212 \u00b5xy), whose\nsign depends on the problem. Finally, the denominator (always positive) affects the optimal magni-\ntude of the regularization. If the generative model is almost well-speci\ufb01ed, \u00b5 will be close to \u00b5xy,\nand the regularizer should be trusted more (large \u03bb\u2217). Since our analysis is local, misspeci\ufb01cation\n(how much p\u03b8\u221e(x, y) deviates from p\u2217(x, y)) is measured by a Mahalanobis distance between \u00b5\nand \u00b5xy, rather than something more stringent and global like KL-divergence.\n\nx } + 2B(cid:62)(\u00b5 \u2212 \u00b5xy).\n\nAn empirical example To provide some concrete intuition, we investigated the oracle regularizer\nfor a synthetic binary classi\ufb01cation problem of predicting y \u2208 {0, 1} from x \u2208 {0, 1}k. Using\nfeatures \u03c6(x, y) = (I[y = 0]x(cid:62), I[y = 1]x(cid:62))(cid:62) de\ufb01nes the corresponding generative (Naive Bayes)\n10)(cid:62),\nand discriminative (logistic regression) estimators. We set k = 5, \u03b8\u221e = ( 1\nand p\u2217(x, y) = (1 \u2212 \u03b5)p\u03b8\u221e(x, y) + \u03b5p\u03b8\u221e(y)p\u03b8\u221e(x1 | y)I[x1 = \u00b7\u00b7\u00b7 = xk]. The amount of mis-\nspeci\ufb01cation is controlled by 0 \u2264 \u03b5 \u2264 1, the fraction of examples whose features are perfectly\ncorrelated.\nTable 2 shows how the oracle regularizer changes with \u03b5. As \u03b5 increases, \u03bb\u2217 decreases (we regularize\nless) as expected. But perhaps surprisingly, the relative risk is reduced with more misspeci\ufb01cation;\nthis is due to the fact that the variance reduction term increases and has a quadratic effect on L(\u03bb\u2217).\nFigure 1(a) shows the relative risk Ln(\u03bb) for various values of \u03bb. The vertical line corresponds\nto \u03bb\u2217, which was computed numerically by sampling. Note that the minimum of the curves\n\n10 ,\u00b7\u00b7\u00b7 , 1\n\n10 ,\u00b7\u00b7\u00b7 , 3\n\n10 , 3\n\n6\n\n\fLn(\u03bb)), the desired quantity, is quite close to \u03bb\u2217 and approaches \u03bb\u2217 as n increases, which\n\n(argmin\u03bb\nempirically justi\ufb01es our asymptotic approximations.\n\n(cid:80)m\n\nUnlabeled data One of the main advantages of having a generative model is that we can lever-\nage unlabeled examples by marginalizing out their hidden outputs. Speci\ufb01cally, suppose we have\nm i.i.d. unlabeled examples Xn+1, . . . , Xn+m \u223c p\u2217(x), with m \u2192 \u221e as n \u2192 \u221e. De\ufb01ne the\nunlabeled regularizer as Rn(\u03bb, \u03b8) = \u2212 \u03bb\nWe can compute \u02d9R = \u00b5 \u2212 \u00b5xy using the stationary conditions of the loss function at \u03b8\u221e. Also,\n\u00a8R = v \u2212 vx, and I(cid:96)r = 0 (the regularizer doesn\u2019t depend on the labeled data). If the model is\nconditionally well-speci\ufb01ed, we can verify that the oracle regularization parameter \u03bb\u2217 is the same as\nif we had regularized with Gn. This equivalence suggests that the dominant concern asymptotically\nis developing an adequate generative model with small bias and not exactly how it is used in learning.\n\ni=1 log p\u03b8(Xn+i).\n\nnm\n\n3.4 Multi-task regression\n\n1\n\n2tr{\u039b2\u0398(cid:62)\n\n2 d2tr{(\u0398(cid:62)\n\n, . . . , X K(cid:62)\n\ni\n\ni\n\ni \u223c p\u2217(X k\n\ni ) and Y k\n\ni \u223c N (X k(cid:62)\n\ni = Id, which implies that I(cid:96)(cid:96) = \u00a8L = IKd.\n\nThe intuition behind multi-task learning is to share statistical strength between tasks [3, 12, 2, 13].\nSuppose we have K regression tasks. For each task k = 1, . . . , K, we generate each data point\ni \u03b8k\u221e, 1). We can treat this\ni = 1, . . . , n independently as follows: X k\n)(cid:62) \u2208\nas a single task problem by concatenating the vectors for all the tasks: Xi = (X 1(cid:62)\nRKd, Y = (Y 1, . . . , Y K)(cid:62) \u2208 RK, and \u03b8 = (\u03b81(cid:62), . . . , \u03b8K(cid:62))(cid:62) \u2208 RKd. It will also be useful to\nrepresent \u03b8 \u2208 RKd by the matrix \u0398 = (\u03b81, . . . , \u03b8K) \u2208 Rd\u00d7K. The loss function is (cid:96)((x, y), \u03b8) =\n\n(cid:80)K\nk=1(yk \u2212 xk(cid:62)\u03b8k)2. Assume the model is conditionally well-speci\ufb01ed.\n\n1\n2\nWe would like to be \ufb02exible in case some tasks are more related than others, so let us de\ufb01ne a positive\nde\ufb01nite matrix \u039b \u2208 RK\u00d7K of inter-task af\ufb01nities and use the quadratic regularizer: r(\u039b, \u03b8) =\n2 \u03b8(cid:62)(\u039b \u2297 Id)\u03b8. For simplicity, assume EX k\u2297\nMost of the computations that follow parallel those of Section 3.1, only extended to matrices. Sub-\n\u221e\u0398\u221e} \u2212 dtr{\u039b}.\nstituting the relevant quantities into (6) yields the relative risk: L(\u039b) = 1\nOptimizing with respect to \u039b produces the oracle regularization parameter \u039b\u2217 = d(\u0398(cid:62)\n\u221e\u0398\u221e)\u22121 and\nits associated relative risk L(\u039b\u2217) = \u2212 1\nTo analyze PLUGIN, \ufb01rst compute \u02d9f = \u2212d(\u0398(cid:62)\nincreases the asymptotic risk by E = 2dtr{(\u0398(cid:62)\nstill favorable when d > 4, as L\u2022(1) = \u2212 1\nWe can do slightly better using ORACLEPLUGIN (\u03bb\u2022\u2217 = 1 \u2212 2\nd), which results in a relative risk of\nL\u2022(\u03bb\u2022\u2217) = \u2212 1\n\u221e\u0398\u221e)\u22121}. For comparison, if we had solved the K regression tasks\ncompletely independently with K independent regularization parameters, our relative risk would\nhave been \u2212 1\nWe now compare joint versus independent regularization. Let A = \u0398(cid:62)\n\u221e\u0398\u221e with eigendecompo-\nsition A = U DU(cid:62). The difference in relative risks between joint and independent regularization\nis \u2206 = \u2212 1\nkk ) (\u2206 < 0 means joint regularization is better). The gap\nbetween joint and independent regularization is large when the tasks are non-trivial but similar (\u03b8k\u221es\nare close, but (cid:107)\u03b8k\u221e(cid:107) is large). In that case, D\u22121\nkk s are small.\n\n\u221e\u0398\u221e)\u22121}.\n\u221e\u0398\u221e)\u22121; we \ufb01nd that PLUGIN\n\u221e\u0398\u221e)\u22121(2\u0398(cid:62)\n\u221e\u0398\u221e)\u22121}. However, the relative risk of PLUGIN is\n2 d(d \u2212 4)tr{(\u0398(cid:62)\n\n2(d \u2212 2)2((cid:80)K\n2(d \u2212 2)2((cid:80)\n\nk=1 (cid:107)\u03b8k\u221e(cid:107)\u22122) (following similar but simpler computations).\n\n\u221e(\u00b7))(\u0398(cid:62)\n\u221e\u0398\u221e)\u22121} < 0 for d > 4.\n\nkk is quite large for k > 1, but all the A\u22121\n\nkk \u2212(cid:80)\n\nk D\u22121\n\n2(d \u2212 2)2tr{(\u0398(cid:62)\n\nk A\u22121\n\nMHC-I binding prediction We evaluated our multitask regularization method on the IEDB\nMHC-I peptide binding dataset created by [19] and used by [13]. The goal here is to predict the\nbinding af\ufb01nity (represented by log IC50) of a MHC-I molecule given its amino-acid sequence (rep-\nresented by a vector of binary features, reduced to a 20-dimensional real vector using SVD). We\ncreated \ufb01ve regression tasks corresponding to the \ufb01ve most common MHC-I molecules.\nWe compared four estimators: UNREGULARIZED, DIAGCV (\u039b = cI), UNIFORMCV (using\nthe same task-af\ufb01nity for all pairs of tasks with \u039b = c(1\u2297 + 10\u22125I)), and PLUGINCV (\u039b =\ncd( \u02c6\u0398(cid:62)\n\u02c6\u0398n)\u22121), where c was chosen by cross-validation.3 Figure 1 shows the results averaged over\n\nn\n\n3We performed three-fold cross-validation to select c from 21 candidates in [10\u22125, 105].\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Relative risk (Ln(\u03bb)) of the hybrid generative/discriminative estimator for various \u03bb;\nthe \u03bb attaining the minimum of Ln(\u03bb) is close to the oracle \u03bb\u2217 (the vertical line). (b) On the MHC-\nI binding prediction task, test risk for the four multi-task estimators; PLUGINCV (estimating all\npairwise task af\ufb01nities using PLUGIN and cross-validating the strength) works best.\n\n30 independent train/test splits. Multi-task regularization actually performs worse than independent\nlearning (DIAGCV) if we assume all tasks are equally related (UNIFORMCV). By learning the full\nmatrix of task af\ufb01nities (PLUGINCV), we obtain the best results. Note that setting the O(K 2) entries\nof \u039b via cross-validation is not computationally feasible, though other approaches are possible [13].\n\n4 Related work and discussion\n\nThe subject of choosing regularization parameters has received much attention. Much of the learning\ntheory literature focuses on risk bounds, which approximate the expected risk (L(\u02c6\u03b8\u03bb\nn)) with upper\nbounds. Our analysis provides a different type of approximation\u2014one that is exact in the \ufb01rst few\nterms of the expansion. Though we cannot make a precise statement about the risk for any given n,\nexact control over the \ufb01rst few terms offers other advantages, e.g., the ability to compare estimators.\nTo elaborate further, risk bounds are generally based on the complexity of the hypothesis class,\nwhereas our analysis is based on the variance of the estimator. Vanilla uniform convergence bounds\nyield worst-case analyses, whereas our asymptotic analysis is tailored to a particular problem (p\u2217\nand \u03b8\u221e) and algorithm (estimator). Localization techniques [5], regret analyses [9], and stability-\nbased bounds [8] all allow for some degree of problem- and algorithm-dependence. As bounds,\nhowever, they necessarily have some looseness, whereas our analysis provides exact constants, at\nleast the ones associated with the lowest-order terms.\nAsymptotics has a rich tradition in statistics.\nIn fact, our methodology of performing a Taylor\nexpansion of the risk is reminiscent of AIC [1]. However, our aim is different: AIC is intended\nfor model selection, whereas we are interested in optimizing regularization parameters. The Stein\nunbiased risk estimate (SURE) is another method of estimating the expected risk for linear models\n[21], with generalizations to non-linear models [11].\nIn practice, cross-validation procedures [10] are quite effective. However, they are only feasible\nwhen the number of hyperparameters is very small, whereas our approach can optimize many hy-\nperparameters. Section 3.4 showed that combining the two approaches can be effective.\nTo conclude, we have developed a general asymptotic framework for analyzing regularization, along\nwith an ef\ufb01cient procedure for choosing regularization parameters. Although we are so far restricted\nto parametric problems with smooth losses and regularizers, we think that these tools provide a\ncomplementary perspective on analyzing learning algorithms to that of risk bounds, deepening our\nunderstanding of regularization.\n\n8\n\n100102104\u22120.025\u22120.02\u22120.015\u22120.01\u22120.0050relative riskregularization n= 75n=100n=150minimumoracle reg.200300500800100015001314151617number of training points (n)test risk \"unregularized\"\"diag CV\"\"uniform CV\"\"plugin CV\"\fReferences\n[1] H. Akaike. A new look at the statistical model identi\ufb01cation. IEEE Transactions on Automatic\n\nControl, 19:716\u2013723, 1974.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural\n\nInformation Processing Systems (NIPS), pages 41\u201348, 2007.\n\n[3] B. Bakker and T. Heskes. Task clustering and gating for Bayesian multitask learning. Journal\n\nof Machine Learning Research, 4:83\u201399, 2003.\n\n[4] M. S. Bartlett. Approximate con\ufb01dence intervals. II. More than one unknown parameter.\n\nBiometrika, 40:306\u2013317, 1953.\n\n[5] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of\n\nStatistics, 33(4):1497\u20131537, 2005.\n\n[6] G. Bouchard. Bias-variance tradeoff in hybrid generative-discriminative models.\n\nIn Sixth\nInternational Conference on Machine Learning and Applications (ICMLA), pages 124\u2013129,\n2007.\n\n[7] G. Bouchard and B. Triggs. The trade-off between generative and discriminative classi\ufb01ers. In\n\nInternational Conference on Computational Statistics, pages 721\u2013728, 2004.\n\n[8] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning\n\nResearch, 2:499\u2013526, 2002.\n\n[9] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[10] P. Craven and G. Wahba. Smoothing noisy data with spline functions. estimating the correct\ndegree of smoothing by the method of generalized cross-validation. Numerische Mathematik,\n31(4):377\u2013403, 1978.\n\n[11] Y. C. Eldar. Generalized SURE for exponential families: Applications to regularization. IEEE\n\nTransactions on Signal Processing, 57(2):471\u2013481, 2009.\n\n[12] T. Evgeniou, C. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Jour-\n\nnal of Machine Learning Research, 6:615\u2013637, 2005.\n\n[13] L. Jacob, F. Bach, and J. Vert. Clustered multi-task learning: A convex formulation. In Ad-\n\nvances in Neural Information Processing Systems (NIPS), pages 745\u2013752, 2009.\n\n[14] W. James and C. Stein. Estimation with quadratic loss.\n\nIn Fourth Berkeley Symposium in\n\nMathematics, Statistics, and Probability, pages 361\u2013380, 1961.\n\n[15] J. A. Lasserre, C. M. Bishop, and T. P. Minka. Principled hybrids of generative and discrimi-\n\nnative models. In Computer Vision and Pattern Recognition (CVPR), pages 87\u201394, 2006.\n\n[16] P. Liang, F. Bach, G. Bouchard, and M. I. Jordan. Asymptotically optimal regularization in\n\nsmooth parametric models. Technical report, ArXiv, 2010.\n\n[17] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudo-\n\nlikelihood estimators. In International Conference on Machine Learning (ICML), 2008.\n\n[18] A. McCallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning: Genera-\ntive/discriminative training for clustering and classi\ufb01cation. In Association for the Advance-\nment of Arti\ufb01cial Intelligence (AAAI), 2006.\n\n[19] B. Peters, H. Bui, S. Frankild, M. Nielson, C. Lundegaard, E. Kostem, D. Basch, K. Lam-\nberth, M. Harndahl, W. Fleri, S. S. Wilson, J. Sidney, O. Lund, S. Buus, and A. Sette. A\ncommunity resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS\nCompututational Biology, 2, 2006.\n\n[20] R. Raina, Y. Shen, A. Ng, and A. McCallum.\n\nClassi\ufb01cation with hybrid genera-\nIn Advances in Neural Information Processing Systems (NIPS),\n\ntive/discriminative models.\n2004.\n\n[21] C. M. Stein. Estimation of the mean of a multivariate normal distribution. Annals of Statistics,\n\n9(6):1135\u20131151, 1981.\n\n[22] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.\n\n9\n\n\f", "award": [], "sourceid": 816, "authors": [{"given_name": "Percy", "family_name": "Liang", "institution": null}, {"given_name": "Guillaume", "family_name": "Bouchard", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}