{"title": "Greedy Model Averaging", "book": "Advances in Neural Information Processing Systems", "page_first": 1242, "page_last": 1250, "abstract": "This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression.  It is known that if the models are mis-specified, model averaging is superior to model selection.  Specifically, let $n$ be the sample size, then the worst case regret of the former decays at the rate of $O(1/n)$   while the worst case regret of the latter decays at the rate of $O(1/\\sqrt{n})$.  In the literature, the most important and widely studied model averaging method that achieves the optimal $O(1/n)$ average regret   is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations.   The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA.  We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples.", "full_text": "Greedy Model Averaging\n\nDepartment of Statistics Rutgers University, New Jersey, 08816\n\ndongdai916@gmail.com\n\nDong Dai\n\nDepartment of Statistics, Rutgers University, New Jersey, 08816\n\ntzhang@stat.rutgers.edu\n\nTong Zhang\n\nAbstract\n\nThis paper considers the problem of combining multiple models to achieve a\nprediction accuracy not much worse than that of the best single model for least\nsquares regression. It is known that if the models are mis-speci\ufb01ed, model aver-\naging is superior to model selection. Speci\ufb01cally, let n be the sample size, then\nthe worst case regret of the former decays at the rate of O(1/n) while the worst\ncase regret of the latter decays at the rate of O(1/\nn). In the literature, the most\nimportant and widely studied model averaging method that achieves the optimal\nO(1/n) average regret is the exponential weighted model averaging (EWMA) al-\ngorithm. However this method suffers from several limitations. The purpose of\nthis paper is to present a new greedy model averaging procedure that improves\nEWMA. We prove strong theoretical guarantees for the new procedure and illus-\ntrate our theoretical results with empirical examples.\n\n\u221a\n\n1\n\nIntroduction\n\nThis paper considers the model combination problem, where the goal is to combine multiple models\nin order to achieve improved accuracy. This problem is important for practical applications because\nit is often the case that single learning models do not perform as well as their combinations. In\npractice, model combination is often achieved through the so-called \u201cstacking\u201d procedure, where\nmultiple models {f1(x), . . . , fM (x)} are \ufb01rst learned based on a shared \u201ctraining dataset\u201d. Then\nthese models are combined on a separate \u201cvalidation dataset\u201d. This paper is motivated by this sce-\nnario. In particular, we assume that M models {f1(x), . . . , fM (x)} are given a priori (e.g., we may\nregard them as being obtained with a separate training set), and we are provided with n labeled data\npoints (validation data) {(X1, Y1), . . . , (Xn, Yn)} to combine these models.\nFor simplicity and clarity, our analysis focuses on least squares regression in \ufb01xed design although\nsimilar analysis can be extended to random design and to other loss functions. In this setting, for\nnotation convenience, we can represent the k-th model on the validation data as a vector f k =\n[fk(X1), . . . , fk(Xn)] \u2208 Rn, and we let the observation vector y = [Y1, . . . , Yn] \u2208 Rn. Let\ng = Ey be the mean. Our goal (in the \ufb01xed design or denoising setting) is to estimate the mean\nvector g from y using the M existing models F = {f 1, . . . f M}. Here, we can write\n\ny = g + \u03be,\n\nwhere we assume that \u03be are iid Gaussian noise: \u03be \u223c N (0, \u03c32I n\u00d7n) for simplicity. This iid Gaussian\nassumption isn\u2019t critical, and the results remain the same for independent sub-Gaussian noise.\nWe assume that the models may be mis-speci\ufb01ed. That is, let k\u2217 be the best single model de\ufb01ned as:\n(1)\n\nk\u2217 = argmin\n\n(cid:107)f k \u2212 g(cid:107)2\n2 ,\n\nk\n\n1\n\n\fthen f k\u2217 (cid:54)= g.\nWe are interested in an estimator \u02c6f of g that achieves a small regret\n\nR(\u02c6f ) =\n\n1\nn\n\n(cid:13)(cid:13)f k\u2217 \u2212 g(cid:13)(cid:13)2\n\n2 .\n\n\u2212 1\nn\n\n(cid:13)(cid:13)(cid:13)\u02c6f \u2212 g\n(cid:13)(cid:13)(cid:13)2\nM(cid:88)\n\n2\n\n\u02c6f =\n\nk=1\n\n\u02c6wkf k,\n\nThis paper considers a special class of model combination methods which we refer to as model\naveraging, with combined estimators of the form\n\nwhere \u02c6wk \u2265 0 and(cid:80)\n\nwe choose the model \u02c6k with the smallest least squares error:\n\nk \u02c6wk = 1. A standard method for \u201cmodel averaging\u201d is model selection, where\n\nM(cid:88)\n\n(cid:80)M\n\nqke\u2212\u03bb(cid:107)f k\u2212y(cid:107)2\nj=1 qje\n\n\u2212\u03bb(cid:107)f j\u2212y(cid:107)2\n\n2\n\n,\n\n\u02c6f M S = f \u02c6k;\n\n\u02c6k = arg min\nk\n\n(cid:107)f k \u2212 y(cid:107)2\n2 .\n\nThis corresponds to the choice of \u02c6w\u02c6k = 1 and \u02c6wk = 0 when k (cid:54)= \u02c6k. However, it is well known that\n\nthe worst case regret this procedure can achieve is R(\u02c6f M S) = O((cid:112)ln M/n) [1]. Another standard\n\nmodel averaging method is the Exponential Weighted Model Averaging (EWMA) estimator de\ufb01ned\nas\n\n\u02c6f EW M A =\n\n\u02c6wkf k,\n\n\u02c6wk =\n\n(2)\n\n2\n\nk=1\n\nfavoring some models over some other models. Here we assume that qj \u2265 0 and(cid:80)\n\nwith a tuned parameter \u03bb \u2265 0. The extra parameters {qj}j=1,...,M are priors that impose bias\nj qj = 1.\nIn this setting, the most common prior choice is the \ufb02at prior qj = 1/M. It should be pointed\nout that a progressive variant of (2), which returns the average of n + 1 EWMA estimators with\nSi = {(X1, Y1), . . . , (Xi, Yi)} for i = 0, 1, . . . , n, was often analyzed in the earlier literature\n[2, 9, 5, 1]. Nevertheless, the non progressive version presented in (2) is clearly a more natural\nestimator, and this is the form that has been studied in more recent work [3, 6, 8]. Our current paper\ndoes not differentiate these two versions of EWMA because they have similar theoretical properties.\nIn particular, our experiments only compare to the non-progressive version (2) that performs better\nin practice.\nIt is known that exponential model averaging leads to an average regret of O(ln M/n) which\nachieves the optimal rate; however it was pointed out in [1] that the rate does not hold with large\n\nprobability. Speci\ufb01cally, EWMA only leads to a sub-optimal deviation bound of O((cid:112)ln M/n) with\n\nlarge probability. To remedy this sub-optimality, an empirical star algorithm (which we will refer to\nas STAR from now on) was then proposed in [1]; it was shown that the algorithm gives O(ln M/n)\ndeviation bound with large probability under the \ufb02at prior qi = 1/M. One major issue of the STAR\nalgorithm is that its average performance is often inferior to EWMA, as we can see from our em-\npirical examples. Therefore although theoretically interesting, it is not an algorithm that can be\nregarded as a replacement of EWMA for practical purposes. Partly for this reason, a more recent\nstudy [7] re-examined the problem of improving EWMA, where different estimators were proposed\nin order to achieve optimal deviation for model averaging. However, the proposed algorithms are\nrather complex and dif\ufb01cult to implement. The purpose of this paper is to present a simple greedy\nmodel averaging (GMA) algorithm that gives the optimal O(ln M/n) deviation bound with large\nprobability, and it can be applied with arbitrary prior qi. Moreover, unlike STAR which has average\nperformance inferior to EWMA, the average performance of GMA algorithm is generally superior\nto EWMA as we shall illustrate with examples. It also has some other advantages which we will\ndiscuss in more details later in the paper.\n\n2 Greedy Model Averaging\n\nThis paper studies a new model averaging procedure presented in Algorithm 1. The procedure has L\nstages, and each time adds an additional model f \u02c6k((cid:96)) into the ensemble. It is based on a simple, but\n\n2\n\n\f(cid:13)(cid:13)(cid:13)2\n\n((cid:96)\u22121) \u2212 f j\n\ndepends on the extra term \u00b5((cid:96))(cid:13)(cid:13)(cid:13)\u02c6f\n\nimportant modi\ufb01cation of a classical sequential greedy approximation procedure in the literature [4],\nwhich corresponds to setting \u00b5((cid:96)) = 0, \u03bb = 0 in Algorithm 1 with \u03b1((cid:96)) optimized over [0, 1]. The\n(2) with the above mentioned classical greedy\nSTAR algorithm corresponds to the stage-2 estimator \u02c6f\nprocedure of [4]. However, in order to prove the desired deviation bound, our analysis critically\nwhich isn\u2019t present in the classical procedure (that\nis, our proof does not apply to the procedure of [4]). As we will see in Section 4, this extra term\ndoes have a positive impact under suitable conditions that correspond to Theorem 1 and Theorem 2\nbelow, and thus this term is not only for theoretical interest, but also it leads to practical bene\ufb01ts\nunder the right conditions.\nAnother difference between GMA and the greedy algorithm in [4] is that our procedure allows the\nuse of non-\ufb02at priors through the extra penalty term \u03bbc((cid:96)) ln(1/qj). This generality can be useful\nfor some applications. Moreover, it is useful to notice that if we choose the \ufb02at prior qj = 1/M,\nthen the term \u03bbc((cid:96)) ln(1/qj) is identical for all models, and thus this term can be removed from the\noptimization. In this case, the proposed method has the advantage of being parameter free (with the\ndefault choice of \u03bd = 0.5). This advantage is also shared by the STAR algorithm.\n\n2\n\n((cid:96))\n\n(0)\n\n= 0\n\n: noisy observation y and static models f 1, . . . , f M\n: averaged model \u02c6f\n\ninput\noutput\nparameters: prior {qj}j=1,...,M and regularization parameters \u03bd and \u03bb\nlet \u02c6f\nfor (cid:96) = 1, 2, . . . , L do\nlet \u03b1((cid:96)) = ((cid:96) \u2212 1)/(cid:96)\nlet \u00b5(1) = 0; \u00b5(2) = 0.05; \u00b5((cid:96)) = \u03bd((cid:96) \u2212 1)/(cid:96)2 if (cid:96) > 2\nlet c(1) = 1; c(2) = 0.25; and c((cid:96)) = [20\u03bd(1 \u2212 \u03bd)((cid:96) \u2212 1)]\u22121 if (cid:96) > 2\nlet \u02c6k((cid:96)) = argminj Q((cid:96))(j), where\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u03b1((cid:96)) \u02c6f\n\n((cid:96)\u22121)\n\n((cid:96)\u22121)\n\n+ (1 \u2212 \u03b1((cid:96)))f j \u2212 y\n\n((cid:96)\u22121) \u2212 f j\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n+ \u00b5((cid:96))(cid:13)(cid:13)(cid:13)\u02c6f\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n+ \u03bbc((cid:96)) ln 1\nqj\n\n(cid:21)\n\nQ((cid:96))(j) :=\n\n((cid:96))\n\n= \u03b1((cid:96)) \u02c6f\n\nlet \u02c6f\n\nend\n\n+ (1 \u2212 \u03b1((cid:96)))f \u02c6k((cid:96))\n\nAlgorithm 1: Greedy Model Averaging (GMA)\n\nObserve that the \ufb01rst stage of GMA corresponds to the standard model selection procedure:\n\n(cid:104)(cid:13)(cid:13)f j \u2212 y(cid:13)(cid:13)2\n\n2\n\n(cid:105)\n\n+ \u03bb ln(1/qj)\n\n,\n\n\u02c6k(1) = argmin\n\nj\n\n(1)\n\n\u02c6f\n\n= f \u02c6k(1).\n\n(cid:34)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\n2\n\n\u221a\nAs we have pointed out earlier, it is well known that only O(1/\nn) regret can be achieved by\nany model selection procedure (that is, any procedure that returns a single model \u02c6f \u02c6k for some \u02c6k).\nHowever, a combination of only two models will allow us to achieve the optimal O(1/n) rate. In\nfact, \u02c6f\n\n(2) achieves this rate. For clarity, we rewrite this stage 2 estimator as\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:13)(cid:13)(cid:13)\u02c6f \u02c6k(1) \u2212 f j\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:35)\n\n\u02c6k(2) = argmin\n\nj\n\n(f \u02c6k(1) + f j) \u2212 y\n\n+\n\n1\n20\n\n+\n\n\u03bb\n4\n\nln(1/qj)\n\n,\n\n(2)\n\n\u02c6f\n\n=\n\n1\n2\n\n(f \u02c6k(1) + f \u02c6k(2)).\n\nTheorem 1 shows that this simple stage 2 estimator achieves O(1/n) regret. A similar result was\nshown in [1] for the STAR algorithm under the \ufb02at prior qj = 1/M, which corresponds to the stage\n2 estimator of the classical greedy algorithm in [4]. Theoretically our result has several advantages\nover that of the classical EWMA method. First it produces a sparse estimator while exponential\naveraging estimator is dense; second the performance bound is scale free in the sense that the bound\n\n3\n\n\f(cid:13)(cid:13)f j\n\n(cid:13)(cid:13); third the optimal bound\n\ndepends only on the noise variance but not the magnitude of maxj\nholds with high probability while EWMA only achieves optimal bound on average but not with large\nprobability; and \ufb01nally if we choose a \ufb02at prior qj = 1/M, the estimator is parameter free because\nwe can exclude the term \u03bb ln(1/qj) from the estimators. This result also improves the recent work\nof [7] in that the resulting bound is scale free while the algorithm itself is signi\ufb01cantly simpler. One\ndisadvantage of this stage-2 estimator (and similarly the STAR estimator of [1]) is that its average\nperformance is generally inferior to that of EWMA, mainly due to the relatively large constant in\nTheorem 1 (the same issue holds for the STAR algorithm). For this reason, the stage-2 estimator is\nnot a practical replacement of EWMA. This is the main reason why it is necessary to run GMA for\nL > 2 stages, which leads to reduced constants (see Theorem 2) below. Our empirical experiments\nshow that in order to compete with EWMA for average performance, it is important to take L > 2.\nHowever a relatively small L (as small as L = 5) is often suf\ufb01cient, and in such case the resulting\nestimator is still quite sparse.\n\nTheorem 1 Given qj \u2265 0 such that\n\nqj = 1. If \u03bb \u2265 40\u03c32, then with probability 1 \u2212 2\u03b4 we have\n\nM(cid:80)\n\n(2)\n\nR(\u02c6f\n\nj=1\n\n) \u2264 \u03bb\nn\n\n(cid:20) 3\n\n4\n\n(cid:21)\n\nln(1/qk\u2217 ) +\n\n1\n2\n\nln(1/\u03b4)\n\n.\n\n(2) achieves the optimal rate, running GMA for more more stages\nWhile the stage-2 estimator \u02c6f\ncan further improve the performance. The following theorem shows that similar bounds can be\nobtained for GMA at stages larger than 2. However, the constant before \u03c32\nqk\u2217 \u03b4 approaches 8\nwhen (cid:96) \u2192 \u221e (with default \u03bd = 0.5), which is smaller than the constant of Theorem 1 which is\nabout 30. This implies potential improvement when we run more stages, and this improvement is\ncon\ufb01rmed in our empirical study. In fact, with relatively large (cid:96), the GMA method not only has the\ntheoretical advantage of achieving smaller regret in deviation (that is, the regret bound holds with\nlarge probability) but also achieves better average performance in practice.\n\nn ln 1\n\nTheorem 2 Given qj \u2265 0 such that\nthen with probability 1 \u2212 2\u03b4 we have\n\n((cid:96))\n\nR(\u02c6f\n\n) \u2264 \u03bb\nn\n\nqj = 1. If \u03bb \u2265 40\u03c32 and let 0 < \u03bd < 1 in Algorithm 1,\n\nM(cid:80)\n(cid:20) ((cid:96) \u2212 2) + ln((cid:96) \u2212 1) + 30\u03bd(1 \u2212 \u03bd)\n\nj=1\n\n(cid:21)\n\n20\u03bd(1 \u2212 \u03bd)(cid:96)\n\nln\n\n1\n\nqk\u2217 \u03b4\n\n.\n\nAnother important advantage of running GMA for (cid:96) > 2 stages is that the resulting estimator not\nonly competes with the best single estimator f k\u2217, but also competes with the best estimator in\nthe convex hull of cov(F) (with the parameter \u03bd appropriately tuned). Note that the latter can be\nsigni\ufb01cantly better than the former. De\ufb01ne the convex hull of F as\n\n\uf8f1\uf8f2\uf8f3 M(cid:88)\n\nj=1\n\n(cid:88)\n\nj\n\n\uf8fc\uf8fd\uf8fe .\n\ncov(F) =\n\nwjf j : wj \u2265 0;\n\nwj = 1\n\n\u221a\nThe following theorem shows that as (cid:96) \u2192 \u221e, the prediction error of \u02c6f\n((cid:96)) is no more than O(1/\n\u221a\nn)\nworse than that of the optimal \u00aff \u2208 cov(F) when we choose a suf\ufb01ciently small \u03bd = O(1/\nn)\nin Algorithm 1. Note that in this case, it is bene\ufb01cial to use a parameter \u03bd smaller than the default\nchoice of \u03bd = 0.5. This phenomenon is also con\ufb01rmed by our experiments.\n\nTheorem 3 Given qj \u2265 0 such that\n\nM(cid:80)\n(cid:80)\nj wj = 1 and wj \u2265 0, and let \u00aff =(cid:80)\n(cid:13)(cid:13)(cid:13)\u02c6f\n(cid:88)\n\nthen with probability 1 \u2212 2\u03b4, when (cid:96) \u2192 \u221e:\n1\nn\n\n(cid:13)(cid:13)\u00aff \u2212 g(cid:13)(cid:13)2\n\n((cid:96)) \u2212 g\n\n(cid:13)(cid:13)(cid:13)2\n\n\u2264 1\nn\n\n2+\n\n\u03bd\nn\n\nwk\n\nj=1\n\n2\n\nk\n\nqj = 1. Consider any {wj : j = 1, . . . , M} such that\nj wjf j. If \u03bb \u2265 40\u03c32 and let 0 < \u03bd < 1 in Algorithm 1,\n(cid:19)\n\n(cid:19)\n\n(cid:18) 1\n\n(cid:18) 1\n\n(cid:88)\n\n(cid:13)(cid:13)f k \u2212 \u00aff(cid:13)(cid:13)2\n\nwk ln\n\n+O\n\n\u03bb\n\n.\n\n2+\n\n20\u03bd(1 \u2212 \u03bd)n\n\n(cid:96)\n\n\u03b4qk\n\nk\n\n4\n\n\f3 Experiments\n\nThe point of these experiments is to show that the consequences of our theoretical analysis can be\nobserved in practice, which support the main conclusions we reach. For this purpose, we consider\nthe model g = Xw + 0.5\u2206g, where X = (f 1, . . . , f M ) is an n \u00d7 M matrix with independent\nstandard Gaussian entries, and \u2206g \u223c N (0, In\u00d7n) implies that the model is mis-speci\ufb01ed.\nThe noise vector is \u03be \u223c N (0, \u03c32I n\u00d7n), independently generated of X. The coef\ufb01cient vector\nj=1 |uj| for i = 1, . . . , s, where u1, . . . , us are\n\nw = (w1, . . . , wM )(cid:62) is given by wi = |ui|/(cid:80)s\n\nindependent standard uniform random variables for some \ufb01xed s.\nThe performance of an estimator \u02c6f measured here is the mean squared error (MSE) de\ufb01ned as\n\n(cid:13)(cid:13)(cid:13)\u02c6f \u2212 g\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\nMSE(\u02c6f ) =\n\n1\nn\n\n.\n\nWe run the Greedy Model Averaging (GMA) algorithm for L stages up to L = 40. The EWMA\nparameter is tuned via 10-fold cross-validation. Moreover, we also listed the performance of EWMA\nwith projection, which is the method that runs EWMA, but with each model f k replaced by model\n\u02dcf k = \u03b1kf k where \u03b1k = arg min\u03b1\u2208R (cid:107)\u03b1f k \u2212 y(cid:107)2\n2. That is, \u02dcf k is the best linear scaling of f k to\npredict y. Note that this is a special case of the class of methods studied in [6] (which considers\nmore general projections) that leads to non progressive regret bounds, and this is the method of\nsigni\ufb01cant current interests [3, 8]. However, at least for the scenario considered in our paper, the\nprojected EWMA method never improves performance in our experiments. Finally, for reference\npurpose, we also report the MSE of the best single model (BSM) f k\u2217, where k\u2217 is given by (1).\nThe model f k\u2217 is clearly not a valid estimator because it depends on the unobserved g; however its\nperformance is informative, and thus included in the tables. For simplicity, all algorithms use \ufb02at\nprior qk = 1/M.\n\n4\n\nIllustration of Theorem 1 and Theorem 2\n\nThe \ufb01rst set of experiments are performed with the parameters n = 50, M = 200,s = 1 and \u03c3 = 2.\nFive hundred replications are run, and the MSE performance of different algorithms are reported in\nTable 1 using the \u201cmean \u00b1 standard deviation\u201d format.\nNote that with s = 1, the target is g = f 1 + 0.5\u2206g. Since f 1 and \u2206g are random Gaussian vectors,\nthe best single model is likely f 1. The noise \u03c3 = 2 is relatively large. This is thus the situation\nthat model averaging does not achieve as good a performance as that of the best single model. This\ncorresponds to the scenario considered in Theorem 1 and Theorem 2.\nThe results indicate that for GMA, from L = 1 (corresponding to model selection) to L = 2 (stage-2\nmodel averaging of Theorem 1), there is signi\ufb01cant reduction of error. The performance of GMA\nwith L = 2 is comparable to that of the STAR algorithm. This isn\u2019t surprising, because STAR can\nbe regarded as the stage-2 estimator based on the more classical greedy algorithm of [4]. We also\nobserve that the error keeps decreasing (but at a slower pace) when L > 2, which is consistent with\nTheorem 2. It means that in order to achieve good performance, it is necessary to use more stages\nthan L = 2 (although this doesn\u2019t change the O(1/n) rate for regret, it can signi\ufb01cantly reduce\nconstant). It becomes better than EWMA when L is as small as 5, which still gives a relatively\nsparse averaged model. EWMA with projection does not perform as well as the standard EWMA\nmethod in this setting. Moreover, we note that in this scenario, the standard choice of \u03bd = 0.5 in\nTheorem 2 is superior to choosing smaller \u03bd = 0.1 or \u03bd = 0.001. This again is consistent with\nTheorem 2, which shows that the new term we added into the greedy algorithm is indeed useful in\nthis scenario.\n\n5\n\nIllustration of Theorem 3\n\nThe second set of experiments are performed with the parameters n = 50, M = 200,s = 10 and\n\u03c3 = 0.5. Five hundred replications are run, and the MSE performance of different algorithms are\nreported in Table 2 using the \u201cmean \u00b1 standard deviation\u201d format.\n\n5\n\n\fTable 1: MSE of different algorithms: best single model is superior to averaged models\n\nSTAR\n\n0.663 \u00b1 0.4\nL = 1\n\nEWMA\n0.645 \u00b1 0.5\nL = 2\n\n0.735 \u00b1 0.74\n0.735 \u00b1 0.74\n0.735 \u00b1 0.74\n\n0.689 \u00b1 0.4\n0.689 \u00b1 0.4\n0.689 \u00b1 0.4\n\nGMA\n\u03bd = 0.5\n\u03bd = 0.1\n\u03bd = 0.01\n\n0.744 \u00b1 0.5\nL = 5\n\n0.58 \u00b1 0.39\n0.645 \u00b1 0.31\n0.663 \u00b1 0.3\n\nEWMA (with projection)\n\nBSM\n\n0.252 \u00b1 0.05\n\nL = 20\n\n0.566 \u00b1 0.37\n0.623 \u00b1 0.29\n0.638 \u00b1 0.28\n\nL = 40\n\n0.567 \u00b1 0.38\n0.622 \u00b1 0.29\n0.639 \u00b1 0.28\n\n\u221a\n\nNote that with s = 10, the target is g = \u00aff + 0.5\u2206g for some \u00aff \u2208 cov(F). The noise \u03c3 = 0.5 is\nrelatively small, which makes it bene\ufb01cial to compete with the best model \u00aff in the convex hull even\nn) when competing with \u00aff. This is thus the situation\nthough GMA has a larger regret of O(1/\nconsidered in Theorem 3, which means that model averaging can achieve better performance than\nthat of the best single model.\nThe results again show that for GMA, from L = 1 (corresponding to model selection) to L = 2\n(stage-2 model averaging of Theorem 1), there is signi\ufb01cant reduction of error. The performance\nof GMA with L = 2 is again comparable to that of the STAR algorithm. Again we observe that\neven with the standard choice of \u03bd = 0.5, the error keeps decreasing (but at a slower pace) when\nL > 2, which is consistent with Theorem 2. It becomes better than EWMA when L is as small as 5,\nwhich still gives a relatively sparse averaged model. EWMA with projection again does not perform\nas well as the standard EWMA method in this setting. Moreover, we note that in this scenario, the\nstandard choice of \u03bd = 0.5 in Theorem 2 is inferior to choosing smaller parameter values of \u03bd = 0.1\nor \u03bd = 0.001. This is consistent with Theorem 3, where it is bene\ufb01cial to use a smaller value for \u03bd\nin order to compete with the best model in the convex hull.\n\nTable 2: MSE of different algorithms: best single model is inferior to averaged model\n\nEWMA (with projection)\n\nSTAR\n\n0.443 \u00b1 0.08\nL = 1\n\n0.809 \u00b1 0.12\n0.809 \u00b1 0.12\n0.809 \u00b1 0.12\n\nEWMA\n\n0.316 \u00b1 0.087\nL = 2\n\n0.456 \u00b1 0.081\n0.456 \u00b1 0.081\n0.456 \u00b1 0.081\n\nGMA\n\u03bd = 0.5\n\u03bd = 0.1\n\u03bd = 0.01\n\n0.364 \u00b1 0.078\nL = 5\n\n0.305 \u00b1 0.062\n0.269 \u00b1 0.056\n0.268 \u00b1 0.053\n\nL = 20\n\n0.266 \u00b1 0.057\n0.214 \u00b1 0.046\n0.211 \u00b1 0.045\n\nL = 40\n\n0.265 \u00b1 0.057\n0.211 \u00b1 0.045\n0.207 \u00b1 0.045\n\nBSM\n\n0.736 \u00b1 0.083\n\n6 Conclusion\n\nThis paper presents a new model averaging scheme which we call greedy model averaging (GMA).\nIt is shown that the new method can achieve regret bound of O(ln M/n) with large probability\nwhen competing with the single best model. Moreover, it can also compete with the best combined\nmodel in convex hull. Both our theory and experimental results suggest that the proposed GMA\nalgorithm is superior to the standard EWMA procedure. Due to the simplicity of our proposal,\nGMA may be regarded as a valid alternative to the more widely studied EWMA procedure both for\npractical applications and for theoretical purposes. Finally we shall point out that while this work\nonly considers static model averaging where the models F are \ufb01nite, similar results can be obtained\nfor af\ufb01ne estimators or in\ufb01nite models considered in recent work [3, 6, 8]. Such extension will be\nleft to the extended report.\n\nA Proof Sketches\n\nWe only include proof sketches, and leave the details to the supplemental material that accompanies\nthe submission. First we need the following standard Gaussian tail bounds. The proofs can be found\nin the supplemental material.\n\n6\n\n\f(cid:80)\nProposition 1 Let f j \u2208 Rn be a set of \ufb01xed vectors (j = 1, . . . , M), and assume that qj \u2265 0 with\n\nj qj = 1. Let k\u2217 be a \ufb01xed integer between 1 and M. De\ufb01ne event E1 as\n\n(cid:26)\n\n(cid:113)\n\u2200j : (f j \u2212 f k\u2217 )(cid:62)\u03be \u2264 \u03c3(cid:107)f j \u2212 f k\u2217(cid:107)2\n(cid:113)\n\n\u2200j, k : (f j \u2212 f k)(cid:62)\u03be \u2264 \u03c3(cid:107)f j \u2212 f k(cid:107)2\n\n(cid:26)\n\nE1 =\n\nE2 =\n\n(cid:27)\n(cid:27)\n\n2 ln(1/(\u03b4qj))\n\n2 ln(1/(\u03b4qjqk))\n\n,\n\nand de\ufb01ne event E2 as\n\nthen P (E1) \u2265 1 \u2212 \u03b4 and P (E2) \u2265 1 \u2212 \u03b4.\n\nA.1 Proof Sketch of Theorem 1\nMore detailed proof can be found in the supplemental material. Note that with probability 1 \u2212 2\u03b4,\nboth event E1 and event E2 of Proposition 1 hold. Moreover we have\n\n(cid:13)(cid:13)(cid:13)\u02c6f\n\n(2) \u2212 g\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n+\u00b5(2)\n\n+ \u03bbc(2)(ln(1/qk\u2217 ) \u2212 ln(1/q\u02c6k(2))).\nIn the above derivation, the inequality is equivalent to Q(2)(\u02c6k(2)) \u2264 Q(2)(k\u2217), which is a simple fact\nof the de\ufb01nition of \u02c6k((cid:96)) in the algorithm. Also we can rewrite the fact that Q(1)(\u02c6k(1)) \u2264 Q(1)(k\u2217) as\n\n(1) \u2212 f \u02c6k(2)\n\n2\n\n2\n\n+ 2(1 \u2212 \u03b1(2))\u03be\n\n(cid:62)\n\n(f \u02c6k(2) \u2212 f k\u2217 )\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)2\n\nBy combining the above two inequalities, we obtain\n\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n2\n\n(cid:13)(cid:13)(cid:13)\u02c6f\n(cid:13)(cid:13)(cid:13)\u02c6f\n\n2\n\n(1)\n\n(1)\n\n(cid:62)\n\n=\n\n(cid:13)(cid:13)(cid:13)2\n\n(1) \u2212 g\n\n2 \u2264 2\u03be\n\n+ (1 \u2212 \u03b1(2))f \u02c6k(2) \u2212 g\n+ (1 \u2212 \u03b1(2))f k\u2217 \u2212 g\n(1) \u2212 f k\u2217\n\n\u2212(cid:13)(cid:13)(cid:13)\u02c6f\n\n(cid:13)(cid:13)(cid:13)\u03b1(2) \u02c6f\n\u2264 (cid:13)(cid:13)(cid:13)\u03b1(2) \u02c6f\n(cid:18)(cid:13)(cid:13)(cid:13)\u02c6f\n(cid:13)(cid:13)(cid:13)2\n\u2212(cid:13)(cid:13)f k\u2217 \u2212 g(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)2\n2 \u2264 \u03b1(2)(cid:104)\n\u2212(cid:13)(cid:13)f k\u2217 \u2212 g(cid:13)(cid:13)2\n\u2212\u00b5(2)(cid:13)(cid:13)f \u02c6k(1) \u2212 f \u02c6k(2)\n(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)\u02c6f\n(cid:13)(cid:13)(cid:13)2\n\u2212(cid:13)(cid:13)f k\u2217 \u2212 g(cid:13)(cid:13)2\n(cid:115)\n+2(cid:13)(cid:13)f \u02c6k(1) \u2212 f k\u2217\n(cid:13)(cid:13)2 \u03c3\n+(\u00b5(2) \u2212 1/4)(cid:13)(cid:13)f \u02c6k(1) \u2212 f k\u2217\n\n(2) \u2212 g\n\u03bbc(1) + \u03bbc(2)) ln(1/qk\u2217 ) \u2212 1\n2\n\n(2) \u2212 g\n+2(1 \u2212 \u03b1(2))\u03be\n\n(f \u02c6k(2) \u2212 f k\u2217 ) +\n\n2 ln\n\n1\n2\n\n(cid:62)\n\n2\n\n2\n\n2\n\n\u2264 (\n\nSince \u03b1(2) = 1/2, we obtain\n\n(f \u02c6k(1) \u2212 f k\u2217 ) + \u03bbc(1) ln(q\u02c6k(1)/qk\u2217 ).\n\n(cid:62)\n\n(cid:105)\n(cid:13)(cid:13)2\n(f \u02c6k(1) \u2212 f k\u2217 ) + \u03bbc(1) ln(q\u02c6k(1)/qk\u2217 )\n\u00b5(2) \u2212 \u03b1(2)(1 \u2212 \u03b1(2))\n2 + \u03bbc(2)(ln(1/qk\u2217 ) \u2212 ln(1/q\u02c6k(2))).\n\n(cid:105)(cid:13)(cid:13)f \u02c6k(1) \u2212 f k\u2217\n\n(cid:104)\n\n2\u03be\n\n2\n\n1\n\n(cid:115)\n\u03bbc(1) ln(1/q\u02c6k(1)) \u2212 \u03bbc(2) ln(1/q\u02c6k(2))\n\n(cid:13)(cid:13)f \u02c6k(2) \u2212 f \u02c6k(1)\n2 \u2212 \u00b5(2)(cid:13)(cid:13)f \u02c6k(1) \u2212 f \u02c6k(2)\n(cid:13)(cid:13)2\n(cid:13)(cid:13)2\n\n(cid:13)(cid:13)2 \u03c3\n\n+ 2 \u00b7 1\n2\n\nq\u02c6k(1)\u03b4\n\n2 ln\n\n2\n\n1\n\nq\u02c6k(1)q\u02c6k(2)\u03b4\n\n\u2264 (\n\n1\n2\n\n\u03bbc(1) + \u03bbc(2)) ln(1/qk\u2217 ) + (2r1 + 2r2) ln(1/\u03b4).\n\nThe \ufb01rst inequality above uses the tail probability bounds in the event E1 and E2. We then use the\nalgebraic inequality 2a1b1 \u2264 a2\n2 to obtain the last inequality,\nwhich implies the desired bound.\n\n1 and 2a2b2 \u2264 a2\n\n2/r2 + r2b2\n\n1/r1 + r1b2\n\nA.2 Proof Sketch of Theorem 2\nAgain, more detailed proof can be found in the supplemental material. With probability 1\u2212 2\u03b4, both\nevent E1 and event E2 of Proposition 1 hold. This implies that the claim of Theorem 1 also holds.\n\n7\n\n\f\u2212 \u03bbc((cid:96))(ln(qk\u2217 ) \u2212 ln(q\u02c6k((cid:96)) )) + 2(1 \u2212 \u03b1((cid:96)))\u03be\n\n(cid:62)\n\n(f \u02c6k((cid:96)) \u2212 f k\u2217 )\n\n((cid:96)\u22121)\n\n(cid:13)(cid:13)(cid:13)2\n\n((cid:96)) \u2212 g\n\n(cid:13)(cid:13)(cid:13)\u02c6f\n\n\u2264 (cid:13)(cid:13)(cid:13)\u03b1((cid:96)) \u02c6f\n\nNow consider any (cid:96) \u2265 3. We have\n\n(cid:105)\n(cid:62)(cid:104)\n(cid:18)(cid:13)(cid:13)(cid:13)\u02c6f\n(cid:13)(cid:13)(cid:13)2\n(1 \u2212 \u03b1((cid:96)))f \u02c6k((cid:96)) \u2212 (1 \u2212 \u03b1((cid:96)))f k\u2217\n((cid:96)\u22121) \u2212 f \u02c6k((cid:96))\n((cid:96)\u22121) \u2212 f k\u2217\nThe inequality is equivalent to Q((cid:96))(\u02c6k((cid:96))) \u2264 Q((cid:96))(k\u2217), which is a simple fact of the de\ufb01nition of \u02c6k((cid:96))\nin the algorithm. We can rewrite the above inequality as\n\n+\u03bbc((cid:96))(ln(1/qk\u2217 ) \u2212 ln(1/q\u02c6k((cid:96)) )) + \u00b5((cid:96))\n\n+ (1 \u2212 \u03b1((cid:96)))f k\u2217 \u2212 g\n\n\u2212(cid:13)(cid:13)(cid:13)\u02c6f\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)2\n\n+ 2\u03be\n\n2\n\n2\n\n2\n\n2\n\n(cid:19)\n\n.\n\n2\n\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)f \u02c6k((cid:96)) \u2212 f k\u2217\n\n(cid:13)(cid:13)2\n\n2\n\n\u2264 \u03b1((cid:96))\n\n((cid:96)\u22121) \u2212 g\n\n2\n\n((cid:96)) \u2212 g\n\n(cid:18)(cid:13)(cid:13)(cid:13)\u02c6f\n(cid:13)(cid:13)(cid:13)2\n(cid:18)(cid:13)(cid:13)(cid:13)\u02c6f\n\u2212\u00b5((cid:96))(cid:13)(cid:13)(cid:13)f \u02c6k((cid:96)) \u2212 \u02c6f\n(cid:18)(cid:13)(cid:13)(cid:13)\u02c6f\n(cid:13)(cid:13)f \u02c6k((cid:96)) \u2212 f k\u2217\n(cid:18)(cid:13)(cid:13)(cid:13)\u02c6f\n\n2\n(cid:96)\n\n+\n\n\u2264 \u03b1((cid:96))\n\n((cid:96)\u22121) \u2212 g\n\n\u2264 (cid:96) \u2212 1\n(cid:20)\n\u2212 (cid:96) \u2212 1\n\n((cid:96)\u22121) \u2212 g\n(cid:96)2 \u03bd(1 \u2212 \u03bd) +\n\n+\n\n(cid:96)\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n+\n\n(cid:105)(cid:13)(cid:13)(cid:13)\u02c6f\n\n((cid:96)\u22121) \u2212 f k\u2217\n\n\u00b5((cid:96)) \u2212 \u03b1((cid:96))(1 \u2212 \u03b1((cid:96)))\n\n(cid:19)\n\u2212(cid:13)(cid:13)f k\u2217 \u2212 g(cid:13)(cid:13)2\n(cid:19)\n(cid:13)(cid:13)(cid:13)2\n\u2212(cid:13)(cid:13)f k\u2217 \u2212 g(cid:13)(cid:13)2\n((cid:96)\u22121)(cid:13)(cid:13)(cid:13)2\n(cid:104)\n(cid:19)\n(cid:13)(cid:13)(cid:13)2\n\u2212(cid:13)(cid:13)f k\u2217 \u2212 g(cid:13)(cid:13)2\n(cid:115)\n\u2212 \u00b5((cid:96))(cid:2)\u03b1((cid:96))(1 \u2212 \u03b1((cid:96))) \u2212 \u00b5((cid:96))(cid:3)\n+ \u03bbc((cid:96))(ln(1/qk\u2217 ) \u2212 ln(1/q\u02c6k((cid:96))))\n(cid:13)(cid:13)2 \u03c3\n(cid:19)\n(cid:13)(cid:13)(cid:13)2\n\u2212(cid:13)(cid:13)f k\u2217 \u2212 g(cid:13)(cid:13)2\n(cid:21)(cid:13)(cid:13)f \u02c6k((cid:96)) \u2212 f k\u2217\n(cid:2)\u00b5((cid:96)) \u2212 \u03b1((cid:96))(1 \u2212 \u03b1((cid:96)))(cid:3)(cid:13)(cid:13)(cid:13)\u02c6f\n(cid:13)(cid:13)2\n(cid:13)(cid:13)f \u02c6k((cid:96)) \u2212 f k\u2217\n\n(cid:13)(cid:13)2\n((cid:96)\u22121) \u2212 f k\u2217\n\n\u03b1((cid:96))(1 \u2212 \u03b1((cid:96)))\n\n+ \u03bbc((cid:96))(ln(1/qk\u2217 ) \u2212 ln(1/q\u02c6k((cid:96)) ))\n\n2 + 2r(cid:96) ln\n\n(cid:13)(cid:13)(cid:13)2\n\n\u03c32\n(cid:96)2r(cid:96)\n\nq\u02c6k((cid:96)) \u03b4\n\nq\u02c6k((cid:96))\u03b4\n\n2 ln\n\n1\n\n1\n\n2\n\n2\n\n.\n\nthat \u2212p(cid:107)a(cid:107)2 \u2212 q (cid:107)b(cid:107)2 \u2264 \u2212pq/(p + q)(cid:107)a + b(cid:107)2,\n\u2264\n\nThe second inequality uses the fact\nwhich implies\nthat\n\u2212 \u00b5((cid:96))[\u03b1((cid:96))(1\u2212\u03b1((cid:96)))\u2212\u00b5((cid:96))]\n2 and uses the Gaussian tail bound in the event E1. The\nlast inequality uses 2ab \u2264 a2/r(cid:96) + r(cid:96)b2, where r(cid:96) > 0 is r(cid:96) = \u03bbc((cid:96))/2. Denote by R((cid:96)) =\n\u22121 we\n(cid:96) R((cid:96)\u22121) + \u03bbc((cid:96)) ln(1/qk\u2217 \u03b4) . Solving this recursion for R((cid:96)) leads to the desired\n\n2 ,then since the choice of parameters c((cid:96)) = [20\u03bd(1 \u2212 \u03bd)((cid:96) \u2212 1)]\n\n\u2212(cid:13)(cid:13)f k\u2217 \u2212 g(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)\u02c6f\n\n\u03b1((cid:96))(1\u2212\u03b1((cid:96)))\n\n((cid:96)) \u2212 g\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n2\n\n2\n\n\u2212 \u00b5((cid:96))(cid:13)(cid:13)(cid:13)f \u02c6k((cid:96)) \u2212 \u02c6f\n\n((cid:96)\u22121)(cid:13)(cid:13)(cid:13)2\n\nobtain R((cid:96)) \u2264 (cid:96)\u22121\nbound.\n\nA.3 Proof Sketch of Theorem 3\nAgain, more detailed proof can be found in the supplemental material. Consider any (cid:96) \u2265 3. We have\n\n(cid:13)(cid:13)(cid:13)\u02c6f\n(cid:88)\n\n2\n\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)\u03b1((cid:96)) \u02c6f\n(cid:88)\n\n+ \u00b5((cid:96))\n\n2\n\nk\n\nwk\n\n\u2264\n\n((cid:96)\u22121)\n\n((cid:96)) \u2212 g\n\n+\u03bbc((cid:96))(\n\n+ (1 \u2212 \u03b1((cid:96)))f k \u2212 g\n\n(cid:13)(cid:13)(cid:13)2\nThe inequality is equivalent to Q((cid:96))(\u02c6k((cid:96))) \u2264(cid:80)\n(cid:13)(cid:13)(cid:13)\u02c6f\n\nof \u02c6k((cid:96)) in the algorithm. Denote by R((cid:96)) =\nthat of Theorem 2 implies that\nR((cid:96)) \u2264 (cid:96) \u2212 1\n\nR((cid:96)\u22121) + \u03bbc((cid:96))(cid:88)\n\nk\n\nwk ln(1/qk) \u2212 ln(1/q\u02c6k((cid:96)) )) + 2\u03be\n\n(cid:62)\n\n(cid:96)\n\nNow by solving the recursion, we obtain the theorem.\n\n8\n\n(cid:33)\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u2212(cid:13)(cid:13)(cid:13)\u02c6f\n((cid:96)\u22121) \u2212 f \u02c6k((cid:96))\n(cid:35)\n(cid:88)\n\nwkf k\n\n.\n\nk\n\nk\n\nwk\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)\u02c6f\n\n((cid:96)\u22121) \u2212 f k\n\n(cid:32)(cid:88)\n(cid:34)\n(1 \u2212 \u03b1((cid:96)))f \u02c6k((cid:96)) \u2212 (1 \u2212 \u03b1((cid:96)))\n(cid:13)(cid:13)(cid:13)2\n\n\u2212(cid:13)(cid:13)\u00aff \u2212 g(cid:13)(cid:13)2\n\n2\n\n2\n\nk wkQ((cid:96))(k), which is a simple fact of the de\ufb01nition\n((cid:96)) \u2212 g\n2, then the same derivation as\n\n(cid:88)\n\n(cid:13)(cid:13)f k \u2212 \u00aff(cid:13)(cid:13)2\n\n2 .\n\nwk ln(1/(\u03b4qk)) + [\u00b5((cid:96)) + (1 \u2212 \u03b1((cid:96)))2]\n\nwk\n\nk\n\nk\n\n\fReferences\n[1] Jean-Yves Audibert. Progressive mixture rules are deviation suboptimal. In NIPS\u201907, 2008.\n[2] Olivier Catoni. Statistical learning theory and stochastic optimization. Springer-Verlag, 2004.\n[3] Arnak Dalalyan and Joseph Salmon. Optimal aggregation of af\ufb01ne estimators. In COLT\u201901,\n\n2011.\n\n[4] L.K. Jones. A simple lemma on greedy approximation in Hilbert space and convergence rates for\n\nprojection pursuit regression and neural network training. Ann. Statist., 20(1):608\u2013613, 1992.\n\n[5] Anatoli Juditsky, Philippe Rigollet, and Alexandre Tsybakov. Learning by mirror averaging.\n\nThe Annals of Statistics, 36:2183\u20132206, 2008.\n\n[6] Gilbert Leung and A.R. Barron. Information theory and mixing least-squares regressions. In-\n\nformation Theory, IEEE Transactions on, 52(8):3396 \u20133410, aug. 2006.\n\n[7] Philippe Rigollet. Kullback-leibler aggregation and misspeci\ufb01ed generalized linear models.\n\narXiv:0911.2919, November 2010.\n\n[8] Pilippe Rigollet and Alexandre Tsybakov. Exponential Screening and optimal rates of sparse\n\nestimation. The Annals of Statistics, 39:731\u2013771, 2011.\n\n[9] Yuhong Yang. Adaptive regression by mixing. Journal of the American Statistical Association,\n\n96:574\u2013588, 2001.\n\n9\n\n\f", "award": [], "sourceid": 721, "authors": [{"given_name": "Dong", "family_name": "Dai", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}]}