{"title": "Competitive On-line Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 364, "page_last": 370, "abstract": "", "full_text": "Competitive On-Line Linear Regression \n\nV. Vovk \n\npepartment of Computer Science \n\nRoyal Holloway, University of London \n\nEgham, Surrey TW20 OEX, UK \n\nvovkGdcs.rhbnc.ac.uk \n\nAbstract \n\nWe apply a general algorithm for merging prediction strategies (the \nAggregating Algorithm) to the problem of linear regression with the \nsquare loss; our main assumption is that the response variable is \nbounded. It turns out that for this particular problem the Aggre(cid:173)\ngating Algorithm resembles, but is slightly different from, the well(cid:173)\nknown ridge estimation procedure. From general results about the \nAggregating Algorithm we deduce a guaranteed bound on the dif(cid:173)\nference between our algorithm's performance and the best, in some \nsense, linear regression function's performance. We show that the \nAA attains the optimal constant in our bound, whereas the con(cid:173)\nstant attained by the ridge regression procedure in general can be \n4 times worse. \n\n1 \n\nINTRODUCTION \n\nThe usual approach to regression problems is to assume that the data are gen(cid:173)\nerated by some stochastic mechanism and make some, typically very restrictive, \nassumptions about that stochastic mechanism. In recent years, however, a different \napproach to this kind of problems was developed (see, e.g., DeSantis et al. [2], Lit(cid:173)\ntlestone and Warmuth [7]): in our context, that approach sets the goal of finding \nan on-line algorithm that performs not much worse than the best regression func(cid:173)\ntion found off-line; in other words, it replaces the usual statistical analyses by the \ncompetitive analysis of on-line algorithms. \n\nDeSantis et al. [2] performed a competitive analysis of the Bayesian merging scheme \nfor the log-loss prediction game; later Littlestone and Warmuth [7] and Vovk [10] \nintroduced an on-line algorithm (called the Weighted Majority Algorithm by the \n\n\fCompetitive On-line Linear Regression \n\n365 \n\nformer authors) for the simple binary prediction game. These two algorithms (the \nBayesian merging scheme and the Weighted Majority Algorithm) are special cases \nof the Aggregating Algorithm (AA) proposed in [9, 11]. The AA is a member of \na wide family of algorithms called \"multiplicative weight\" or \"exponential weight\" \nalgorithms. \n\nCloser to the topic of this paper, Cesa-Bianchi et al. [1) performed a competitive \nanalysis, under the square loss, of the standard Gradient Descent Algorithm and \nKivinen and Warmuth [6] complemented it by a competitive analysis of a modi(cid:173)\nfication of the Gradient Descent, which they call the Exponentiated Gradient Al(cid:173)\ngorithm. The bounds obtained in [1, 6] are of the following type: at every trial \nT, \n\n(1) \nwhere LT is the loss (over the first T trials) of the on-line algorithm, LT is the \nloss of the best (by trial T) linear regression function, and c is a constant, c > \n1; specifically, c = 2 for the Gradient Descent and c = 3 for the Exponentiated \nGradient. These bounds hold under the following assumptions: for the Gradient \nDescent, it is assumed that the L2 norm of the weights and of all data items are \nbounded by constant 1; for the Exponentiated Gradient, that the Ll norm of the \nweights and the Loo norm of all data items are bounded by 1. \nIn many interesting cases bound (1) is weak. For example, suppose that our com(cid:173)\nparison class contains a \"true\" regression function, but its values are corrupted by \nan Li.d. noise. Then, under reasonable assumptions about the noise, LT will grow \nlinearly in T, and inequality (1) will only bound the difference LT - LT by a lin(cid:173)\near function of T. (Though in other situations bound (1) can be better than our \nbound (2), see below. For example, in the case of the Exponentiated Gradient, the \n0(1) in (1) depends on the number of parameters n logarithmically whereas our \nbound depends on n linearly.) \nIn this paper we will apply the AA to the problem of linear regression. The AA \nhas been proven to be optimal in some simple cases [5, 11], so we can also expect \ngood performance in the problem of linear regression. The following is a typical \nresult that can be obtained using the AA: Learner has a strategy which ensures \nthat always \n\nLT ~ LT + nIn(T + 1) + 1 \n\n(2) \n(n is the number of predictor variables). It is interesting that the assumptions \nfor the last inequality are weaker than those for both the Gradient Descent and \nExponentiated Gradient: we only assume that the L2 norm of the weights and the \nLoo norm of all data items are bounded by constant 1 (these assumptions will be \nfurther relaxed later on). The norms L2 and Loo are not dual, which casts doubt \non the accepted intuition that the weights and data items should be measured by \ndual norms (such as Ll-Loo or L2-L2). \nNotice that the logarithmic term nln(T + 1) of (2) is similar to the term ~ In T \noccurring in the analysis of the log-loss game and its generalizations, in particular \nin Wallace's theory of minimum message length, Rissanen's theory of stochastic \ncomplexity, minimax regret analysis. In the case n = 1 and Xt = 1, Vt, inequality (2) \ndiffers from Freund's [4] Theorem 4 only in the additive constant. In this paper \nwe will see another manifestation of a phenomenon noticed by Freund [4]: for some \nimportant problems, the adversarial bounds of on-line competitive learning theory \n\n\f366 \n\nV. Vovk \n\nare only a tiny amount worse than the average-case bounds for some stochastic \nstrategies for Nature. \nA weaker variant of inequality (2) can be deduced from Foster's [3] Theorem 1 (if \nwe additionally assume that the response variable take only two values, -1 or 1): \nFoster's result implies \n\nLT ~ LT + 8n In(2n(T + 1)) + 8 \n\n(a multiple of 4 arises from replacing Foster's set {O, 1} of possible values of the \nresponse variable by our {-1, 1}j we also replaced Foster's d by 2n: to span our set \nof possible weights we need 2n Foster's predictors). \n\nInequality (2) is also similar to Yamanishi's [12] resultj in that paper, he considers a \nmore general framework than ours but does not attempt to find optimal constants. \n\n2 ALGORITHM \n\nWe consider the following protocol of interaction between Learner and Nature: \n\nFOR t = 1,2, ... \n\nNature chooses Xt \u20ac m.n \nLearner chooses prediction Pt E m. \nNature chooses Yt E [-Y, Y] \n\nEND FOR. \n\nThis is a \"perfect-information\" protocol: either player can see the other player's \nmoves. The parameters of our protocol are: a fixed positive number n (the dimen(cid:173)\nsionality of our regression problem) and an upper bound Y > 0 on the value Yt \nreturned by Nature. It is important, however, that our algorithm for playing this \ngame (on the part of Learner) does not need to know Y. \n\nWe will only give a description of our regression algorithmj its derivation from the \ngeneral AA will be given in the future full version of this paper. (It is usually a non(cid:173)\ntrivial task to represent the AA in a computationally efficient form, and the case of \non-line linear regression is not an exception.) Fix n and a > O. The algorithm is as \nfollows: \n\nA :=alj b:=O \nFOR TRlAL t = 1,2, ... : \n\nread new Xt E m.n \nA:= A +XtX~ \noutput prediction Pt := b' A -1 Xt \nread new Yt E m. \nb:= b+YtXt \n\nEND FOR. \n\nIn this description, A is an n x n matrix (which is always symmetrical and positive \ndefinite), bE mn , I is the unit n x n matrix, and 0 is the all-O vector. \nThe naive implementation of this algorithm would require O(n3) arithmetic oper(cid:173)\nations at every trial, but the standard recursive technique allows us to spend only \nO(n2 ) arithmetic operations per trial. This is still not as good as for the Gradient \nDescent Algorithm and Exponentiated Gradient Algorithm (they require only O(n) \n\n\fCompetitive On-line Linear Regression \n\n367 \n\noperations per trial); we seem to have a trade-off between the quality of bounds \non predictive performance and computational efficiency. In the rest of the paper \n\"AA\" will mean the algorithm. described in the previous paragraph (which is the \nAggregating Algorithm applied to a particular uncountable pool of experts with a \nparticular Gaussian prior). \n\n3 BOUNDS \n\nIn this section we state, without proof, results describing the predictive performance \nof our algorithm. Our comparison class consists of the linear functions Yt = W\u00b7 Xt, \nwhere W E m.n \u2022 We will call the possible weights w \"experts\" (imagine that we \nhave continuously many experts indexed by W E m.n ; Expert w always recommends \nprediction w . Xt to Learner). At every trial t Expert w and Learner suffer loss \n(Yt - w . Xt)2 and (Yt - Pt)2, respectively. Our notation for the total loss suffered by \nExpert w and Learner over the first T trials will be \n\nT \n\nLT(W) := L(Yt - W\u00b7 Xt)2 \n\nt=1 \n\nT \n\nLT(Learner) := L(Yt - Pt)2, \n\nt=1 \n\nand \n\nrespectively. \n\nFor compact pools of experts (which, in our setting, corresponds to the set of \npossible weights w being bounded and closed) it is usually possible to derive bounds \n(such as (2\u00bb where the learner's loss is compared to the best expert's loss. In our \ncase of non-compact pool, however, we need to give the learner a start on remote \nexperts. Specifically, instead of comparing Learner's performance to infw LT(W), \nwe compare it to infw (LT(W) + allwlI 2 ) (thus giving ~arner a start of allwII2 on \nExpert w), where a > 0 is a constant reflecting our prior expectations about the \n\"complexity\" IIwll := -IE:=1 w; of successful experts. \nThis idea of giving a start to experts allows us to prove stronger results; e.g., the \nfollowing elaboration of (2) holds: \n\n(3) \n\n(this inequality still assumes that IIXtiloo ~ 1 for all t but w is unbounded). \nOur notation for the transpose of matrix A will be A'; as usual, vectors are identified \nwith one-column matrices. \n\nTheorem 1 For any fi:ted n, Leamer has a strategy which ensures that always \n\n\f368 \n\nII, in addition, IIxt II 00 $ X, \\It, \n\nV. Vovk \n\n(4) \n\nThe last inequality of this theorem implies inequality (3): it suffices to put X = \nY=a=1. \nThe term \n\nlndet (1 +; t,x.x:) \n\nin Theorem 1 might be difficult to interpret. Notice that it can be rewritten as \n\nnlnT + lndet (~I + ~COV(Xl' ... ,Xn)) , \n\nwhere cov(Xl , ... , Xn) is the empirical covariance matrix of the predictor variables \n(in other words, cov(Xl , ... ,Xn) is the covariance matrix of the random vector \nwhich takes the values Xl, ... ,XT with equal probability ~). We can see that this \nterm is typically close to n In T. \n\nUsing standard transformations, it is easy to deduce from Theorem 1, e.g., the \nfollowing results (for simplicity we assume n = 1 and Xt,Yt E [-1,1], 'It): \n\n\u2022 if the pool of experts consists of all polynomials of degree d, Learner has a \n\nstrategy guaranteeing \n\n\u2022 if the pool of experts consists of all splines of degree d with k nodes (chosen \n\na priori), Learner has a strategy guaranteeing \n\nLT(Learner) S inf (LT(W) + Ilw1l 2 ) + (d + k + 1) In(T + 1). \n\nw \n\nThe following theorem shows that the constant n in inequality (4) cannot be im(cid:173)\nproved. \n\nTheorem 2 Fix n (the number 01 attributes) and Y (the upper bound on IlItl). For \nany f > 0 there exist a constant C and a stochastic strategy lor Nature such that \nIIxtiloo = 1 and Illtl = Y, lor all t, and, lor any stochastic strategy lor Learner, \n\nE (LT(Learner) -\n\ninf LT(W)) ~ (n - f)y21nT - C, \n\nw:llwll:SY \n\n'IT. \n\n4 COMPARISONS \n\nIt is easy to see that the ridge regression procedure sometimes gives results that \nare not sensible in our framework where lit E [-Y, Y] and the goal is to compete \n\n\fCompetitive On-line Linear Regression \n\n369 \n\nagainst the best linear regression function. For example, suppose n = 1, Y = 1, \nand Nature generates outcomes (Xt, Yt), t = 1,2, ... , where \n\na \u00ab Xl \u00ab X2 \u00ab ... , Yt -\n\n_ { 1, if todd, \n\n-1, if t even. \n\nAt trial t = 2,3, ... the ridge regression procedure (more accurately, its natural \nmodification which truncates its predictions to [-1, 1]) will give prediction Pt = Yt-l \nequal to the previous response, and so will suffer a loss of about 4T over T trials. \nOn the other hand, the AA's prediction will be close to 0, and so the cumulative \nloss of the AA over the first T trials will be about T, which is close to the best \nexpert's loss. We can see that the ridge regression procedure in this situation is \nforced to suffer a loss 4 times as big as the AA's loss. \n\nThe lower bound stated in Theorem 2 does not imply that our regression algorithm is \nbetter than the ridge regression procedure in our adversarial framework. (Moreover, \nthe idea of our proof of Theorem 2 is to lower bound the performance of the ridge \nregression procedure in the situation where the expected loss of the ridge regression \nprocedure is optimal.) Theorem 1 asserts that \n\nLT(Leamer) :S ~ (LT( w) + allwl12) + y2 t. In ( 1 + ~ t. X~.i) \n\n(5) \n\nwhen Learner follows the AA. The next theorem shows that the ridge regression \nprocedure sometimes violates this inequality. \nTheorem 3 Let n = 1 (the number 0/ attributes) and Y = 1 (the upper bound \non IYtl); fix a > O. Nature has a strategy such that, when Learner plays the ridge \nregression strategy, \n\nLT(Learner) = 4T + 0(1), \n\nUI \n\ninf (LT(w) + allwll2) = T + 0(1), \nIn (1+ ~ t.x~) = TIn2+ 0(1) \n\n(6) \n(7) \n\n(8) \n\nas T -4 00 (and, there/ore, (5) is violated). \n\n5 CONCLUSION \n\nA distinctive feature of our approach to linear regression is that our only assump(cid:173)\ntion about the data is that IYt I ~ Y, 'tit; we do not make any assumptions about \nstochastic properties of the data-generating mechanism. In some situations (if the \ndata were generated by a partially known stochastic mechanism) this feature is a \ndisadvantage, but often it will be an advantage. \n\nThis paper was greatly influenced by Vapnik's [8] idea of transductive inference. \nThe algorithm analyzed in this paper is \"transductive\", in the sense that it outputs \nsome prediction Pt for Yt after being given Xt, rather than to output a general rule \nfor mapping Xt into Ptj in particular, Pt may depend non-linearly on Xt. (It is easy, \nhowever, to extract such a rule from the description of the algorithm once it is \nfound.) \n\n\f370 \n\nAcknowledgments \n\nV. Vovk \n\nKostas Skouras and Philip Dawid noticed that our regression algorithm is different \nfrom the ridge regression and that in some situations it behaves very differently. \nManfred Warmuth's advice about relevant literature is also gratefully appreciated. \n\nReferences \n\n[11 N. Cesa-Bianchi, P. M. Long, and M. K. Warmuth (1996), Worst-case quadratic \nloss bounds for on-line prediction of linear functions by gradient descent, IEEE \n7rons. Neural Networks 7:604-619. \n\n[21 A. DeSantis, G. Markowsky, and M. N. Wegman (1988), Learning probabilistic \n\nprediction functions, in \"Proceedings, 29th Annual IEEE Symposium on Foun(cid:173)\ndations of Computer Science,\" pp. 110-119, Los Alamitos, CA: IEEE Comput. \nSoc. \n\n[31 D. P. Foster (1991), Prediction in the worst case, Ann. Statist. 19:1084-1090. \n\n[4] Y. Freund (1996), Predicting a binary sequence almost as well as the optimal \nbiased coin, in \"Proceedings, 9th Annual ACM Conference on Computational \nLearning Theory\" , pp. 89-98, New York: Assoc. Comput. Mach. \n\n[5] D. Haussler, J. Kivinen, and M. K. Warmuth (1994), Tight worst-case loss \nbounds for predicting with expert advice, University of California at Santa \nCruz, Technical Report UCSC-CRL-94-36, revised December. Short version in \n\"Computational Learning Theory\" (P. Vitanyi, Ed.), Lecture Notes in Com(cid:173)\nputer Science, Vol. 904, pp. 69-83, Berlin: Springer, 1995. \n\n[6] J. Kivinen and M. K. Warmuth (1997), Exponential Gradient versus Gradient \n\nDescent for linear predictors, Inform. Computation 132:1-63. \n\n[7] N. Littlestone and M. K. Warmuth (1994), The Weighted Majority Algorithm, \n\nInform. Computation 108:212-261. \n\n[8] V. N. Vapnik (1995), The Nature of Statistical Learning Theory, New York: \n\nSpringer. \n\n[9] V. Vovk (1990), Aggregating strategies, in \"Proceedings, 3rd Annual Workshop \non Computational Learning Theory\" (M. Fulk and J. Case, Eds.), pp. 371-383, \nSan Mateo, CA: Morgan Kaufmann. \n\n[10] V. Vovk (1992), Universal forecasting algorithms, In/orm. Computation \n\n96:245-277. \n\n[11] V. Vovk (1997), A game of prediction with expert advice, to appear in J. Com(cid:173)\nput. In/orm. Syst. Short version in \"Proceedings, 8th Annual ACM Conference \non Computational Learning Theory,\" pp. 51-60, New York: Assoc. Comput. \nMach., 1995. \n\n[12] K. Yamanishi (1997), A decision-theoretic extension of stochastic complexity \n\nand its applications to learning, submitted to IEEE 7rons. In/orm. Theory. \n\n\f", "award": [], "sourceid": 1419, "authors": [{"given_name": "Volodya", "family_name": "Vovk", "institution": null}]}