{"title": "Putting Bayes to sleep", "book": "Advances in Neural Information Processing Systems", "page_first": 135, "page_last": 143, "abstract": "We consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart). However, what if the favored models in each segment are from a small subset, i.e. the data is likely to be predicted well by models that predicted well before? Curiously, fitting such ''sparse composite models'' is achieved by mixing in a bit of all the past posteriors. This self-referential updating method is rather peculiar, but it is efficient and gives superior performance on many natural data sets. Also it is important because it introduces a long-term memory: any model that has done well in the past can be recovered quickly. While Bayesian interpretations can be found for mixing in a bit of the initial prior, no Bayesian interpretation is known for mixing in past posteriors. We build atop the ''specialist'' framework from the online learning literature to give the Mixing Past Posteriors update a proper Bayesian foundation. We apply our method to a well-studied multitask learning problem and obtain a new intriguing efficient update that achieves a significantly better bound.", "full_text": "Putting Bayes to sleep\n\nWouter M. Koolen\u2217\n\nDmitry Adamskiy\u2020\n\nManfred K. Warmuth\u2021\n\nAbstract\n\nWe consider sequential prediction algorithms that are given the predictions from\na set of models as inputs. If the nature of the data is changing over time in that\ndifferent models predict well on different segments of the data, then adaptivity is\ntypically achieved by mixing into the weights in each round a bit of the initial prior\n(kind of like a weak restart). However, what if the favored models in each segment\nare from a small subset, i.e. the data is likely to be predicted well by models\nthat predicted well before? Curiously, \ufb01tting such \u201csparse composite models\u201d is\nachieved by mixing in a bit of all the past posteriors. This self-referential updating\nmethod is rather peculiar, but it is ef\ufb01cient and gives superior performance on\nmany natural data sets. Also it is important because it introduces a long-term\nmemory: any model that has done well in the past can be recovered quickly. While\nBayesian interpretations can be found for mixing in a bit of the initial prior, no\nBayesian interpretation is known for mixing in past posteriors.\nWe build atop the \u201cspecialist\u201d framework from the online learning literature to give\nthe Mixing Past Posteriors update a proper Bayesian foundation. We apply our\nmethod to a well-studied multitask learning problem and obtain a new intriguing\nef\ufb01cient update that achieves a signi\ufb01cantly better bound.\n\n1\n\nIntroduction\n\nWe consider sequential prediction of outcomes y1, y2, . . . using a set of models m = 1, . . . , M for\nthis task. In practice m could range over a mix of human experts, parametric models, or even com-\nplex machine learning algorithms. In any case we denote the prediction of model m for outcome yt\ngiven past observations yt | \u03c7t = w for\nall t. Since the outcomes y\u2264t are a stochastic function of m and \u03c7\u2264t, the Bayesian joint satis\ufb01es\n\n(5)\nTheorem 2. Let Predt(yt) be the prediction of MPP for some mixing scheme \u03b31, \u03b32, . . . Let\nP(yt|yt | \u03c7t = w\n\nPredt(yt) = P(yt|y 1, we expand the right-hand side, apply (5), use\nthe independence we just proved, and the fact that asleep specialist predict with the rest:\n\nq)P(Z t\n\n= P(y