{"title": "A Filtering Approach to Stochastic Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2114, "page_last": 2122, "abstract": "Stochastic variational inference (SVI) uses stochastic optimization to scale up Bayesian computation to massive data. We present an alternative perspective on SVI as approximate parallel coordinate ascent. SVI trades-off bias and variance to step close to the unknown true coordinate optimum given by batch variational Bayes (VB). We define a model to automate this process. The model infers the location of the next VB optimum from a sequence of noisy realizations. As a consequence of this construction, we update the variational parameters using Bayes rule, rather than a hand-crafted optimization schedule. When our model is a Kalman filter this procedure can recover the original SVI algorithm and SVI with adaptive steps. We may also encode additional assumptions in the model, such as heavy-tailed noise. By doing so, our algorithm outperforms the original SVI schedule and a state-of-the-art adaptive SVI algorithm in two diverse domains.", "full_text": "A Filtering Approach to Stochastic Variational\n\nInference\n\nNeil M.T. Houlsby \u2217\nGoogle Research\nZurich, Switzerland\n\nneilhoulsby@google.com\n\nDavid M. Blei\n\nDepartment of Statistics\n\nDepartment of Computer Science\n\nColombia University\n\ndavid.blei@colombia.edu\n\nAbstract\n\nStochastic variational inference (SVI) uses stochastic optimization to scale up\nBayesian computation to massive data. We present an alternative perspective on\nSVI as approximate parallel coordinate ascent. SVI trades-off bias and variance\nto step close to the unknown true coordinate optimum given by batch variational\nBayes (VB). We de\ufb01ne a model to automate this process. The model infers the lo-\ncation of the next VB optimum from a sequence of noisy realizations. As a conse-\nquence of this construction, we update the variational parameters using Bayes rule,\nrather than a hand-crafted optimization schedule. When our model is a Kalman\n\ufb01lter this procedure can recover the original SVI algorithm and SVI with adaptive\nsteps. We may also encode additional assumptions in the model, such as heavy-\ntailed noise. By doing so, our algorithm outperforms the original SVI schedule\nand a state-of-the-art adaptive SVI algorithm in two diverse domains.\n\n1\n\nIntroduction\n\nStochastic variational inference (SVI) is a powerful method for scaling up Bayesian computation to\nmassive data sets [1]. It has been successfully used in many settings, including topic models [2],\nprobabilistic matrix factorization [3], statistical network analysis [4, 5], and Gaussian processes [6].\nSVI uses stochastic optimization to \ufb01t a variational distribution, following cheap-to-compute noisy\nnatural gradients that arise from repeatedly subsampling the data. The algorithm follows these\ngradients with a decreasing step size [7]. One nuisance, as for all stochastic optimization techniques,\nis setting the step size schedule.\nIn this paper we develop variational \ufb01ltering, an alternative perspective of stochastic variational\ninference. We show that this perspective leads naturally to a tracking algorithm\u2014one based on\na Kalman \ufb01lter\u2014that effectively adapts the step size to the idiosyncrasies of data subsampling.\nWithout any tuning, variational \ufb01ltering performs as well or better than the best constant learning\nrate chosen in retrospect. Further, it outperforms both the original SVI algorithm and SVI with\nadaptive learning rates [8].\nIn more detail, variational inference optimizes a high-dimensional variational parameter \u03bb to \ufb01nd\na distribution that approximates an intractable posterior. A concept that is important in SVI is the\nparallel coordinate update. This refers to setting each dimension of \u03bb to its coordinate optimum, but\nwhere these coordinates are computed parallel. We denote the resulting updated parameters \u03bbVB.\nWith this de\ufb01nition we have a new perspective on SVI. At each iteration it attempts to reach its par-\nallel coordinate update, but one estimated from a randomly sampled data point. (The true coordinate\nupdate requires iterating over all of the data.) Speci\ufb01cally, SVI iteratively updates an estimate of \u03bb\n\n\u2217Work carried out while a member of the University of Cambridge, visiting Princeton University.\n\n1\n\n\fas follows,\n\n\u03bbt = (1 \u2212 \u03c1t)\u03bbt\u22121 + \u03c1t\n\n\u02c6\u03bbt,\n(1)\nwhere \u02c6\u03bbt is a random variable whose expectation is \u03bbVB\nand \u03c1t is the learning rate. The original\nt \u2212 \u03bbt is the natural gradient of the\nt\npaper on SVI points out that this iteration works because \u03bbVB\nvariational objective, and so Eq 1 is a noisy gradient update. But we can also see the iteration as a\n. While \u02c6\u03bb is an unbiased estimate of this\nnoisy attempt to reach the parallel coordinate optimum \u03bbVB\nquantity, we will show that Eq 1 uses a biased estimate but with reduced variance.\nThis new perspective opens the door to other ways of updating \u03bbt based on the noisy estimates of\n\u03bbVB\n. In particular, we use a Kalman \ufb01lter to track the progress of \u03bbt based on the sequence of\nt\nnoisy coordinate updates. This gives us a \u2018meta-model\u2019 about the optimal parameter, which we now\nestimate through ef\ufb01cient inference. We show that one setting of the Kalman \ufb01lter corresponds to\nSVI; another corresponds to SVI with adaptive learning rates; and others, like using a t-distribution\nin place of a Gaussian, account better for noise than any previous methods.\n\nt\n\n2 Variational Filtering\n\nWe \ufb01rst introduce stochastic variational inference (SVI) as approximate parallel coordinate ascent.\nWe use this view to present variational \ufb01ltering, a model-based approach to variational optimization\nthat observes noisy parallel coordinate optima and seeks to infer the true VB optimum. We instan-\ntiate this method with a Kalman \ufb01lter, discuss relationships to other optimization schedules, and\nextend the model to handle real-world SVI problems.\nStochastic Variational Inference Given data x1:N , we want to infer the posterior distribution\nover model parameters \u03b8, p(\u03b8|x1:N ). For most interesting models exact inference is intractable and\nwe must use approximations. Variational Bayes (VB) formulates approximate inference as a batch\noptimization problem. The intractable posterior distribution p(\u03b8|x1:N ) is approximated by a simpler\ndistribution q(\u03b8; \u03bb) where \u03bb are the variational parameters of q.1 These parameters are adjusted to\nmaximize a lower bound on the model evidence (the ELBO),\n\nN(cid:88)\n\nL(\u03bb) =\n\nEq[log p(xi|\u03b8)] + Eq[log p(\u03b8)] \u2212 Eq[log q(\u03b8)] .\n\n(2)\n\ni=1\n\nMaximizing Eq 2 is equivalent to minimizing the KL divergence between the exact and approximate\nposterior, KL[q||p]. Successive optima of the ELBO often have closed-form [1], so to maximize\nEq 2 VB can perform successive parallel coordinate updates on the elements in \u03bb, \u03bbt+1 = \u03bbVB\nt\nUnfortunately, the sum over all N datapoints in Eq 2 means that \u03bbVB\nis too expensive on large\ndatasets. SVI avoids this dif\ufb01culty by sampling a single datapoint (or a mini-batch) and optimizing\na cheap, noisy estimate of the ELBO \u02c6L(\u03bb). The optimum of \u02c6L(\u03bb) is denoted \u02c6\u03bbt,\n\u02c6L(\u03bb) =NEq[log p(xi|\u03b8)] + Eq[log p(\u03b8)] \u2212 Eq[log q(\u03b8)] ,\n\u02c6L(\u03bb) = Eq[N log p(xi|\u03b8) + log p(\u03b8)] .\n\n\u02c6\u03bb := argmax\n\n(3)\n(4)\n\n.\n\nt\n\n\u03bb\n\nThe constant N in Eq 4 ensures the noisy parallel coordinate optimum is unbiased with respect to\nthe full VB optimum, E[\u02c6\u03bbt] = \u03bbVB\n. After computing \u02c6\u03bbt, SVI updates the parameters using Eq 1.\nThis corresponds to using natural gradients [9] to perform stochastic gradient ascent on the ELBO.\nWe present an alternative perspective on Eq 1. SVI may be viewed as an attempt to reach the true\nusing the noisy estimate \u02c6\u03bbt. The observation \u02c6\u03bbt is an unbiased\nparallel coordinate optimum \u03bbVB\nt with variance Var[\u02c6\u03bbt]. The variance may be large, so SVI makes a bias/variance\nestimator of \u03bbVB\ntrade-off to reduce the overall error. The bias and variance in \u03bbt computed using SVI (Eq 1) are\n\nt\n\nt\n\nE[\u03bbt \u2212 \u03bbVB\n\n] = (1 \u2212 \u03c1t)(\u03bbt\u22121 \u2212 \u03bbVB\n\n(5)\nrespectively. Decreasing the step size reduces the variance but increases the bias. However, as the\nalgorithm converges, the bias decreases as the VB optima fall closer to the current parameters. Thus,\n\nt ) , Var[\u03bbt] = \u03c12\n\nt Var[\u02c6\u03bbt] ,\n\nt\n\n1To readers familiar with stochastic variational inference, we refer to the global variational parameters,\n\nassuming that the local parameters are optimized at each iteration. Details can be found in [1].\n\n2\n\n\ft\n\nt\n\nt\n\nfrom the observations, and use Bayes rule to determine the optimal step size.\n\ntends to zero and as optimization progresses, \u03c1t should decay. This reduces the variance\n\n\u03bbt\u22121 \u2212 \u03bbVB\ngiven the same level of bias.\nIndeed, most stochastic optimization schedules decay the step size, including the Robbins-Monro\nschedule [7] used in SVI. Different schedules yield different bias/variance trade-offs, but the trade-\noff is heuristic and these schedules often require hand tuning. Instead we use a model to infer the\nlocation of \u03bbVB\nProbabilistic Filtering for SVI We described our view of SVI as approximate parallel coordinate\nascent. With this perspective, we can de\ufb01ne a model to infer \u03bbVB\n. We have three sets of variables:\n\u03bbt are the current parameters of the approximate posterior q(\u03b8; \u03bbt); \u03bbVB\nis a hidden variable cor-\nt\nresponding to the VB coordinate update at the current time step; and \u02c6\u03bbt is an unbiased, but noisy\nobservation of \u03bbVB\nWe specify a model that observes the sequence of noisy coordinate optima \u02c6\u03bb1:t, and we use it\n|\u02c6\u03bb1:t). When making a parallel coordinate\nto compute a distribution over the full VB update p(\u03bbVB\n|\u02c6\u03bb1:t].\nupdate at time t we move to the best estimate of the VB optimum under the model, \u03bbt = E[\u03bbVB\nUsing this approach we i) avoid the need to tune the step size because Bayes rule determines how\nthe posterior mean moves at each iteration; ii) can use a Kalman \ufb01lter to recover particular static\nand adaptive step size algorithms; and iii) can add extra modelling assumptions to vary the step size\nschedule in useful ways.\nIn variational inference, our \u2018target\u2019 is \u03bbVB\n. It moves because the parameters of approximate pos-\nterior \u03bbt change as optimization progresses. Therefore, we use a dynamic tracking model, the\nKalman \ufb01lter [10]. We compute the posterior over next VB optimum given previous observations,\n|\u02c6\u03bb1:t). In tracking, this is called \ufb01ltering, so we call our method variational \ufb01ltering (VF).2 At\np(\u03bbVB\neach time t, VF has a current set of model parameters \u03bbt\u22121 and takes these steps.\n\n.\n\nt\n\nt\n\nt\n\nt\n\nt\n\n1. Sample a datapoint xt.\n2. Compute the noisy estimate of the coordinate update \u02c6\u03bbt using Eq 3.\n3. Run Kalman \ufb01ltering to compute the posterior over the VB optimum, p(\u03bbVB\n|\u02c6\u03bb1:t] and repeat.\n4. Update the parameters to the posterior mean \u03bbt = E[\u03bbVB\n\nt\n\nt\n\n|\u02c6\u03bb1:t).\n\nt \u2212 \u02c6\u03bbt||2\n\nVariational \ufb01ltering uses the entire history of observations, encoded by the posterior, to infer the\nlocation of the VB update. Standard optimization schedules use only the current parameters \u03bbt to\nregularize the noisy coordinate update, and these methods require tuning to balance bias and variance\nin the update. In our setting, Bayes rule automatically makes this trade-off.\nTo illustrate this perspective we consider a small problem. We \ufb01t a variational distribution for latent\nDirichlet allocation on a small corpus of 2.5k documents from the ArXiv. For this problem we can\ncompute the full parallel coordinate update and thus compute the tracking error ||\u03bbVB\n2 and\nthe observation noise ||\u03bbVB\n2 for various algorithms. We emphasize that \u02c6\u03bbt is unbiased, and\nso the observation noise is completely due to variance. A reduction in tracking error indicates an\nadvantage to incurring bias for a reduction in variance.\nWe compared variational \ufb01ltering (Alg. 1) to the original Robbins-Monro schedule used in SVI [1],\nand a large constant step size of 0.5. The same sequence of random documents was handed to each\nalgorithm. Figs. 1 (a-c) show the tracking error of each algorithm. The large constant step size yields\nlarge error due to high variance, see Eq 5. The SVI updates are too small and the bias dominates.\nHere, the bias is even larger than the variance in the noisy observations during early stages, but it\ndecays as the term (\u03bbt\u2212\u03bbVB\nt\u22121) in Eq 5 slowly decreases. The variational \ufb01lter automatically balances\nbias and variance, yielding the smallest tracking error. As a result of following the VB optima more\nclosely, the variational \ufb01lter achieves larger values of the ELBO, shown in Fig. 1 (d).\n\nt \u2212 \u03bbt||2\n\n3 Kalman Variational Filter\n\nWe now detail our Kalman \ufb01lter for SVI. Then we discuss different settings of the parameters and\nestimating these online. Finally, we extend the \ufb01lter to handle heavy-tailed noise.\n\n2 We do not perform \u2018smoothing\u2019 in our dynamical system because we are not interested in old VB coordi-\n\nnate optima after the parameters have been optimized further.\n\n3\n\n\f(a) Variational Filtering\n\n(b) SVI, Robbins-Monro\n\n(c) Constant Rate\n\n(d) ELBO\n\nFigure 1:\n(a-c) Curves show the error in tracking the VB update. Markers depict the error in the\nnoisy observations \u02c6\u03bbt to the VB update. (d) Evolution of the ELBO computed on the entire dataset.\n\nThe Gaussian Kalman \ufb01lter (KF) is attractive because inference is tractable and, in SVI, computa-\ntional time is the limiting factor, not the rate of data acquisition. The model is speci\ufb01ed as\n\np(\u03bbVB\n\nt+1|\u03bbVB\n\nt ) = N (\u03bbVB\n\nt\n\n, Q) ,\n\np(\u02c6\u03bbt|\u03bbVB\n\nt ) = N (\u03bbVB\n\nt\n\n, R) ,\n\n(6)\n\nwhere R models the variance in the noisy coordinate updates and Q models how far the VB optima\nmove at each iteration. The observation noise has zero mean because the noisy updates are unbiased.\nWe assume no systematic parameter drift, so E[\u03bbVB\n. Filtering in this linear-Gaussian model\nt\u22121|\u02c6\u03bb1:t\u22121) = N (\u00b5t\u22121; \u03a3t\u22121) and a noisy coordinate\nis tractable, given the current posterior p(\u03bbVB\nupdate \u02c6\u03bbt, the next posterior is computed directly using Gaussian manipulations [11],\n\nt+1] = \u03bbVB\n\nt\n\n(cid:17)\n\n|\u02c6\u03bb1:t) = N(cid:16)\n\np(\u03bbVB\n\nt\n\n[1 \u2212 Pt]\u00b5t\u22121 + Pt\u02c6\u03bbt, [1 \u2212 Pt]\n\n\u22121[\u03a3t\u22121 + Q]\n\nPt = [\u03a3t\u22121 + Q][\u03a3t\u22121 + Q + R]\n\n\u22121 .\n\n,\n\n(7)\n\n(8)\n\nThe variable Pt is known as the Kalman gain. Notice the update to the posterior mean has the same\nform as the SVI update in Eq 1. The gain Pt is directly equivalent to the SVI step size \u03c1t.3 Different\nmodelling choices to get different optimization schedules. We now present some key cases.\nStatic Parameters\nIf the parameters Q and R are \ufb01xed, the step size progression in Eq 7 can\nbe computed a priori as Pt+1 = [Q/R + Pt][1 + Q/R + Pt]\u22121. This yields a \ufb01xed sequence of\ndecreasing step size. A popular schedule is the Robbins-Monro routine, \u03c1 \u221d (t0 + t)\u2212\u03ba also used in\nSVI [1]. If we set Q = 0 the variational \ufb01lter returns a Robbins-Monro schedule with \u03ba = 1. This\ncorresponds to online estimation of the mean of a Gaussian. This is because Q = 0 assumes that the\noptimization has converged and the \ufb01lter simply averages the noisy updates.\nIn practice, decay rates slower that \u03ba = 1 perform better [2, 8]. This is because updates which\nwere computed using old parameter values are forgotten faster. Setting Q > 0 yields the same\nreduced memory.\n\nIn this case, the step size tends to a constant limt\u2192\u221e Pt = [(cid:112)1 + 4R/Q +\n1][(cid:112)1 + 4R/Q + 1 + 2R/Q]\u22121. Larger the noise-to-signal ratios R/Q result in smaller limiting\n\nstep sizes. This demonstrates the automatic bias/variance trade-off. If R/Q is large, the variance in\nthe noisy updates Var[\u02c6\u03bbt] is assumed large. Therefore, the \ufb01lter uses a smaller step size, yielding\nmore bias (Eq 5), but with lower overall error. Conversely, if there is no noise R/Q = 0, P\u221e = 1\nand we recover batch VB.\nParameter Estimation Normally the parameters will not be known a priori. Further, if Q is\n\ufb01xed then the step size does not tend to zero and so Robbins-Monro criteria do not hold [7]. We can\naddress both issues by estimating Q and R online.\nThe parameter R models the variance in the noisy optima, and Q measures how near the process is\nto convergence. These parameters are unknown and will change as the optimization progresses. Q\nwill decrease as convergence is approached; R may decrease or increase. In our demonstration in\nFig. 1, it increases during early iterations and then plateaus. Therefore we estimate these parameters\nonline, similar to [8, 12]. The desired parameter values are\n||2] = E[||\u02c6\u03bbt \u2212 \u03bbVB\nt\u22121||2\n2 .\n\nR = E[||\u02c6\u03bbt \u2212 \u03bbVB\nt \u2212 \u03bbVB\nQ = ||\u03bbVB\n\n2] \u2212 ||\u03bbVB\n\n(9)\n(10)\n\nt \u2212 \u03bbVB\n\nt\u22121||2\n2 ,\n\nt\u22121||2\n\nt\n\n3 In general, Pt is a full-rank matrix update. For simplicity, and to compare to scalar learning rates, we\n\npresent the 1D case. The multi-dimensional generalization is straightforward.\n\n4\n\n05010011121314151617tlog Euclidean distance tracking errorobservation error05010011121314151617tlog Euclidean distance tracking errorobservation error05010011121314151617tlog Euclidean distance tracking errorobservation error050100\u221210.5\u221210\u22129.5\u22129\u22128.5\u22128tELBO Variational FilteringSVIConstant\fFigure 2:\nStep sizes learned by the Gaussian\nKalman \ufb01lter, the Student\u2019s t \ufb01lter (Alg. 1) and\nthe adaptive learning rate in [8], on non-stationary\nArXiv data. The adaptive algorithms react to the\ndataset shift by increasing the step size. The vari-\national \ufb01lters react even faster than adaptive-SVI\nbecause not only do Q and R adjust, but the pos-\nterior variance increases at the shift which further\naugments the next step size.\n\n2], using\n(\u02c6\u03bbt \u2212 \u00b5t\u22121) , ht = (1 \u2212 \u03c4\n\n\u22121\nt\n\n\u22121\nt\n\nWe estimate these using exponentially weighted moving averages. To estimate the two terms in Eq 9,\nwe estimate the expected difference between the current state and the observation gt = E[\u02c6\u03bbt\u2212\u03bbVB\nt\u22121],\nand the norm of this difference ht = E[||\u02c6\u03bbt \u2212 \u03bbVB\n\nt\u22121||2\n\ngt = (1 \u2212 \u03c4\n\n\u22121\nt\n\n\u22121\nt\n\n||\u02c6\u03bbt \u2212 \u00b5t\u22121||2\n2 ,\n\n)gt\u22121 + \u03c4\n\n)ht\u22121 + \u03c4\n\n2 and Q = ||gt||2\n\n(11)\nwhere \u03c4 is the window length and \u00b5t\u22121 is the current posterior mean. The parameters are estimated\nas R = ht \u2212 ||gt||2\n2. After \ufb01ltering, the window length is adjusted to \u03c4t+1 =\n(1 \u2212 Pt)\u03c4t + 1. Larger steps result in shorter memory of old parameter values. Joint parameter\nand state estimation can be poorly determined.\nInitializing the parameters to appropriate values\nwith Monte Carlo sampling, as in [8], mitigates this issue. In our experiments we avoid this under-\nspeci\ufb01cation by tying the \ufb01ltering parameters across the \ufb01lters for each variational parameter.\nThe variational \ufb01lter with parameter estimation recovers an automatic step size similar to the\nadaptive-SVI algorithm in [8]. Their step size is equivalent to \u03c1t = Q/[Q + R]. Variational \ufb01l-\ntering uses Pt = [\u03a3t\u22121 + Q]/[\u03a3t\u22121 + Q + R], Eq 7. If this posterior variance \u03a3t\u22121 is zero the\nupdates are identical. If \u03a3t\u22121 is large, as in early time steps, the \ufb01lter produces a larger step size.\nFig. 3 demonstrates how the these methods react to non-stationary data. LDA was run on ArXiv\nabstracts whose category changed every 5k documents. Variational \ufb01ltering and adaptive-SVI react\nto the shift by increasing the step size, the ELBO is similar for both methods.\nStudent\u2019s t Filter\nIn SVI, the noisy estimates \u02c6\u03bbt are often heavy-tailed. For example, in ma-\ntrix factorization heavy-tailed parameters distributions [13] produce to heavy-tailed noisy updates.\nEmpirically, we observe similar heavy tails in LDA. Heavy tails may also arise from computing\nEuclidean distances between parameter vectors and not using the more natural Fisher information\nmetric [9]. We add robustness these sources of noise with a heavy-tailed Kalman \ufb01lter.\n, R, \u03b4), where T (m, V, d) denotes a t-\nWe use a t-distributed noise model, p(\u02c6\u03bbt|\u03bbVB\ndistribution with mean m, covariance V and d degrees of freedom. For computational convenience\nwe also use a t-distributed transition model, p(\u03bbVB\n, Q, \u03b3). If the current posterior\n|\u02c6\u03bb1:t) = T (\u00b5t, \u03a3t, \u03b7t) and the degrees of freedom are identical, \u03b7t = \u03b3 = \u03b4,\nis t-distributed, p(\u03bbVB\nthen \ufb01ltering has closed-form,\n\nt ) = T (\u03bbVB\n\nt+1|\u03bbVB\n\nt ) = T (\u03bbVB\n\nt\n\nt\n\nt\n\n(cid:18)\n\n(cid:19)\n\np(\u03bbVB\n\nt\n\n|\u02c6\u03bb1:t) =T\n\n(1 \u2212 Pt)\u00b5t\u22121 + Pt\u02c6\u03bbt,\n\n\u03b7t\u22121 + \u22062\n\u03b7t\u22121 + ||\u03bb||0\n\nwhere Pt =\n\n\u03a3t\u22121 + Q\n\n\u03a3t\u22121 + Q + R\n\n, and \u22062 =\n\n(1 \u2212 Pt)[\u03a3t\u22121 + Q], \u03b7t\u22121 + ||\u03bb||0\n||\u02c6\u03bbt \u2212 \u00b5t\u22121||2\n2\n\u03a3t\u22121 + Q + R\n\n.\n\n,\n\n(12)\n\n(13)\n\nThe update to the mean is the same as in the Gaussian KF. The crucial difference is in the update to\nthe variance in Eq 12. If an outlier \u02c6\u03bbt arrives, then \u22062, and hence \u03a3t, are augmented. The increased\nposterior uncertainty at time t + 1 yields an increased gain Pt+1. This allows the \ufb01lter to react\nquickly to a large perturbation. The t-\ufb01lter differs fundamentally to the Gaussian KF in that the step\nsize is now a direct function of the observations. In the Gaussian KF the dependency is indirect,\nthrough the estimation of R and Q.\nEq 12 has closed-form because the d.o.f. are equal. Unfortunately, this will not generally be the\ncase because the posterior degrees of freedom grow, so we require an approximation. Following\n[14], we approximate the \u2018incompatible\u2019 t-distributions by adjusting their degrees of freedom to be\nequal. We choose all of these to equal \u02dc\u03b7t = min(\u03b7t, \u03b3, \u03b4). We match the degrees of freedom in\n\n5\n\n050001000015000\u22123.5\u22123\u22122.5\u22122\u22121.5\u22121\u22120.50# docslog(\u03c1t) Students t FilterGaussian FilterSVI\u2212adapt [Ran13]\fthis way because it prevents the posterior degree of freedom from growing over time. If \u03b7t, Eq 12\nwere allowed to grow large, the t-distributed \ufb01lter reverts back to a Gaussian KF. This is undesirable\nbecause the heavy-tailed noise does not necessarily disappear at convergence.\nTo account for adjusting the degrees of freedom, we moment match the old and new t-distributions.\nThis has closed-from; to match the second moments of T (m, \u02dc\u03a3, \u02dc\u03b7) to T (m, \u03a3, \u03b7), the variance is set\nto \u02dc\u03a3 = \u03b7(\u02dc\u03b7\u22122)\n(\u03b7\u22122)\u02dc\u03b7 \u03a3. This results in tractable \ufb01ltering and has the same computational cost as Gaussian\n\ufb01ltering. The routine is summarized in Algorithm 1.\n\nInitialize \ufb01ltering distribution \u03a30, \u00b50, \u03b70, see \u00a7 5\nInitialize statistics g0, h0, \u03c40 with Monte-Carlo sampling\nSet initial variational parameters \u03bb0 \u2190 \u00b50\nfor t = 1, . . . , T do\n\nAlgorithm 1 Variational \ufb01ltering with Student\u2019s t-distributed noise\n1: procedure FILTER(data x1:N )\n2:\n3:\n4:\n5:\nSample a datapoint xt\n(cid:46) Or a mini-batch of data.\n6:\n\u02c6\u03bbt \u2190 f (\u03bbt, xt), f given by Equation Eq 4 (cid:46) Noisy estimate of the coordinate optimum.\n7:\nCompute gt and ht using Eq 11.\n8:\nt , Q \u2190 ht\nR \u2190 ht \u2212 g2\n(cid:46) Update parameters of the \ufb01lter.\n9:\n\u02dc\u03b7t\u22121 \u2190 min(\u03b7t\u22121, \u03b3, \u03b4)\n(cid:46) Match degrees of freedom.\n10:\n\u02dc\u03a3t\u22121 \u2190 \u03b7t\u22121(\u02dc\u03b7t\u22121 \u2212 2)[(\u03b7t\u22121 \u2212 2)\u02dc\u03b7t\u22121]\u22121\u03a3t\u22121, similar for \u02dcR, \u02dcQ (cid:46) Moment match.\n11:\nPt \u2190 [ \u02dc\u03a3t\u22121 + \u02dcQ][ \u02dc\u03a3t\u22121 + \u02dcQ + \u02dcR]\u22121\n(cid:46) Compute gain, or step size.\n12:\n\u22062 \u2190 ||\u02c6\u03bbt \u2212 \u00b5t\u22121||2\n13:\n\u00b5t \u2190 [I \u2212 Pt]\u00b5t\u22121 + Pt\n14:\n\u03a3t \u2190 \u02dc\u03b7t\u22121+\u22062\n15:\n\u02dc\u03b7t\u22121+||\u03bb||0\n\u03bbt \u2190 \u00b5t\n16:\nend for\n17:\nreturn \u03bbT\n18:\n19: end procedure\n\n(cid:46) Update the variational parameters of q.\n\n[I \u2212 Pt][ \u02dc\u03a3t\u22121 + \u02dcQ],\n\n\u03b7t \u2190 \u03b7t\u22121 + 1\n\n2[ \u02dc\u03a3t\u22121 + \u02dcQ + \u02dcR]\u22121\n\n\u02c6\u03bbt,\n\n(cid:46) Update \ufb01lter posterior.\n\n4 Related Work\n\nStochastic and Streamed VB SVI performs fast inference on a \ufb01xed dataset of known size N.\nOnline VB algorithms process an in\ufb01nite stream of data [15, 16], but these methods cannot use a\nre-sampled datapoint. Variational \ufb01ltering falls between both camps. The noisy observations require\nan estimate of N. However, Kalman \ufb01ltering does not try to optimize a static dataset like a \ufb01xed\nRobbins-Monro schedule. As observed in Fig. 3 the algorithm can adapt to a regime change, and\nforgets the old data. The \ufb01lter simply tries to move to the VB coordinate update at each step, and is\nnot directly concerned about asymptotic convergence on static dataset.\nKalman \ufb01lters for parameter learning Kalman \ufb01lters have been used to learn neural network\nparameters. Extended Kalman \ufb01lters have been used to train supervised networks [17, 18, 19].\nThe network weights evolve because of data non-stationarity. This problem differs fundamentally\nto SVI. In the neural network setting, the observations are the \ufb01xed data labels, but in SVI the\nobservations are noisy realizations of a moving VB parallel coordinate optimum. If the VF draws\nthe same datapoint, the observations \u02c6\u03bb will still change because \u03bbt will have changed. In the work\nwith neural nets, the same datapoint always yields the same observation for the \ufb01lter.\nAdaptive learning rates Automatic step size schedules have been proposed for online estimation\nof the mean of a Gaussian [20], or drifting parameters [21]. The latter work uses a Gaussian KF for\nparameter estimation in approximate dynamic programming. Automatic step sizes are derived for\nstochastic gradient descent in [12] and SVI in [8]. These methods set the step size to minimize the\nexpected update error. Our work is the \ufb01rst Bayesian approach to learn the SVI schedule.\nMeta-modelling Variational \ufb01ltering is a \u2018meta-model\u2019, these are models that assist training of a\nmore complex method. They are becoming increasingly popular, examples include Kalman \ufb01lters\n\n6\n\n\f(a) LDA ArXiv\n\n(b) LDA NYT\n\n(c) LDA Wikipedia\n\n(d) BMF WebView\n\n(e) BMF Kosarak\n\n(f) BMF Net\ufb02ix\n\nFigure 3: Final performance achieved by each algorithm on the two problems. Stars indicate the\nbest performing non-oracle algorithm and those statistically indistinguishable at p = 0.05. (a-c)\nLDA: Value of the ELBO after observing 0.5M documents. (d-f) BMF: recall@10 after observing\n2 \u00b7 108 cells.\n\nfor training neural networks [17], Gaussian process optimization for hyperparameter search [22] and\nGaussian process regression to construct Bayesian quasi-Newton methods [23].\n\n5 Empirical Case Studies\n\nWe tested variational \ufb01ltering on two diverse problems: topic modelling with Latent Dirichlet Allo-\ncation (LDA) [24], a popular testbed for scalable inference routines, and binary matrix factorization\n(BMF). Variational \ufb01ltering outperforms Robbins-Monro SVI and a state-of-the-art adaptive method\n[8] in both domains. The Student\u2019s t \ufb01lter performs substantially better than the Gaussian KF and is\ncompetitive with an oracle that picks the best constant step size with hindsight.\nModels We used 100 topics in LDA and set the Dirichlet hyperparameters to 0.5. This value is\nslightly larger than usual because it helps the stochastic routines escape local minima early on. For\nBMF we used a logistic matrix factorization model with a Gaussian variational posterior over the\nlatent matrices [3]. This task differs to LDA in two ways. The variational parameters are Gaussian\nand we sample single cells from the matrix to form stochastic updates. We used minibatches of 100\ndocuments in LDA, and 5 times the number of rows in BMF.\nDatasets We trained LDA on three large document corpora: 630k abstracts from the ArXiv,\n1.73M New York Times articles, and Wikipedia, which has \u2248 4M articles. For BMF we used three\nrecommendation matrices: clickstream data from the Kosarak news portal; click data from an e-\ncommerce website, BMS-WebView-2 [25]; and the Net\ufb02ix data, treating 4-5 star ratings as ones.\nFollowing [3] we kept the 1000 items with most ones and sampled up to 40k users.\nAlgorithms We ran our Student\u2019s t variational \ufb01lter in Algorithm 1 (TVF) and the Gaussian\nversion in \u00a7 3 (GVF). The variational parameters were initialized randomly in LDA and with an\nSVD-based routine [26] in BMF. The prior variance was set to \u03a30 = 103 and t-distribution\u2019s degrees\nof freedom to \u03b70 = 3 to get the heaviest tails with a \ufb01nite variance for moment matching.\nIn general, VF can learn full-rank matrix stepsizes. LDA and BMF, however, have many parameters,\nand so we used the simplest setting of VF in which a single step size was learned for all of them;\nthat is, Q and R are constrained to be proportional to the identity matrix. This choice reduces the\ncost of VF from O(N 3) to O(N ). Empirically, this computational overhead was negligible. Also\n\n7\n\n\u22128.5\u22128\u22127.5test ELBOTVF (this paper) GVF (this paper) SVI [Hof13] Adapt\u2212SVI [Ran13] Oracle Const \u22128\u22127.9\u22127.8\u22127.7\u22127.6test ELBOTVF (this paper) GVF (this paper) SVI [Hof13] Adapt\u2212SVI [Ran13] Oracle Const \u22127.5\u22127.4\u22127.3\u22127.2\u22127.1\u22127test ELBOTVF (this paper) GVF (this paper) SVI [Hof13] Adapt\u2212SVI [Ran13] Oracle Const 0.20.250.30.350.40.45recall@10TVF (this paper) GVF (this paper) SVI [Hof13] Adapt\u2212SVI [Ran13] Oracle Const 0.250.30.350.40.45recall@10TVF (this paper) GVF (this paper) SVI [Hof13] Adapt\u2212SVI [Ran13] Oracle Const 0.120.140.160.180.20.22recall@10TVF (this paper) GVF (this paper) SVI [Hof13] Adapt\u2212SVI [Ran13] Oracle Const \f(a) LDA ArXiv, ELBO\n\n(b) BMF WebView, recall@10\n\ncurves\n\nFigure 4:\nExample\nlearning\nof\n(a)\nthe ELBO (plot\nsmoothed with Lowess\u2019\nmethod) and (b)\nre-\ncall@10, on the LDA\nand BMF problems,\nrespectively.\n\nit allows us to aggregate statistics across the variational parameters, yielding more robust estimates.\nFinally, we can directly compare our Bayesian adaptive rate to the single adaptive rate in [8].\nWe compared to the SVI schedule proposed in [1].\nThis is a Robbins-Monro schedule\n\u03c1t = (t0 + t)\u2212\u03ba, we used \u03ba = 0.7; t0 = 1000 for LDA as these performed well in [1, 2, 8] and\n\u03ba = 0.7, t0 = 0 for BMF, as in [3]. We also compared to the adaptive-SVI routine in [8]. Finally, we\nused an oracle method that picked the constant learning rate from a grid of rates 10\u2212k, k \u2208 1, . . . , 5,\nthat gave the best \ufb01nal performance. In BMF, the Robbins-Monro SVI schedule learns a different\nrate for each row and column. All other methods computed a single rate.\nEvaluation\nIn LDA, we evaluated the algorithms using the per-word ELBO, estimated on random\nsets of held-out documents. Each algorithm was given 0.5M documents and the \ufb01nal ELBO was\naveraged over the \ufb01nal 10% of the iterations. We computed statistical signi\ufb01cance between the\nalgorithms with a t-test on these noisy estimates of the ELBO. Our BMF datasets were from item\nrecommendation problems, for which recall is a popular metric [27]. We computed recall at N by\nremoving a single one from each row during training. We then ranked the zeros by their posterior\nprobability of being a one and computed the fraction of the rows in which the held-out one was in\nthe top N. We used a budget of 2 \u00b7 108 observations and computed statistical signi\ufb01cance over 8\nrepeats of the experiment, including the random train/test split.\nResults The \ufb01nal performance levels on both tasks are plotted in Fig. 3. These plots show that over\nthe six datasets and two tasks the Student\u2019s t variational \ufb01lter is the strongest non-oracle method.\nSVI [1] and Adapt-SVI [8] come close on LDA, which they were originally used for, but on the\nWebView and Kosarak binary matrices they yield a substantially lower recall. In terms of the ELBO\nin BMF (not plotted), TVF was the best non-oracle method on WebView and Kosarak and SVI was\nbest on Net\ufb02ix, with TVF second best. The Gaussian Kalman \ufb01lter worked less well. It produced\nhigh learning rates due to the inaccurate Gaussian noise assumption.\nThe t-distributed \ufb01lter appears to be robust to highly non-Gaussian noise. It was even competitive\nwith the oracle method (2 wins, 2 draws, 1 loss). Note that the oracle picked the best \ufb01nal perfor-\nmance at time T , but at t < T the variational \ufb01lter converged faster, particularly in LDA. Fig. 4\n(a) shows example learning curves on the ArXiv data. Although the oracle just outperforms TVF at\n0.5M documents, TVF converged much faster. Fig. 4 (b) shows example learning curves in BMF\non the WebView data. This \ufb01gure shows that most of the BMF routines converge within the budget.\nAgain, TVF not only reached the best solution, but also converged fastest.\nConclusions We have presented a new perspective on SVI as approximate parallel coordinate de-\nscent. With our model-based approach to this problem, we shift the requirement from hand tuning\noptimization schedules to constructing an appropriate tracking model. This approach allows us to\nderive a new algorithm for robust SVI that uses a model with Student\u2019s t-distributed noise. This Stu-\ndent\u2019s t variational \ufb01ltering algorithm performed strongly on two domains with completely different\nvariational distributions. Variational \ufb01ltering is a promising new direction for SVI.\nAcknowedgements\nNMTH is grateful to the Google European Doctoral Fellowship scheme for\nfunding this research. DMB is supported by NSF CAREER NSF IIS-0745520, NSF BIGDATA NSF\nIIS-1247664, NSF NEURO NSF IIS-1009542, ONR N00014-11-1-0651 and DARPA FA8750-14-\n2-0009. We thank James McInerney, Alp Kucukelbir, Stephan Mandt, Rajesh Ranganath, Maxim\nRabinovich, David Duvenaud, Thang Bui and the anonymous reviews for insightful feedback.\n\n8\n\n012345x 105\u22128.4\u22128.2\u22128\u22127.8# docstest ELBO TVF (this paper)GVF (this paper)SVI [Hof13]Adapt\u2212SVI [Ran13]Oracle Const00.511.52x 1080.10.20.30.40.5# matrix entriesrecall@10\fReferences\n[1] M.D. Hoffman, D.M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR, 14:1303\u2013\n\n1347, 2013.\n\n[2] M.D. Hoffman, D.M. Blei, and F. Bach. Online learning for latent Dirichlet allocation. NIPS, 23:856\u2013864,\n\n2010.\n\n[3] J.M. Hernandez-Lobato, N.M.T. Houlsby, and Z. Ghahramani. Stochastic inference for scalable proba-\n\nbilistic modeling of binary matrices. ICML, 2014.\n\n[4] P.K. Gopalan and D.M. Blei. Ef\ufb01cient discovery of overlapping communities in massive networks. PNAS,\n\n110(36):14534\u201314539, 2013.\n\n[5] J. Yin, Q. Ho, and E. Xing. A scalable approach to probabilistic latent space inference of large-scale\n\nnetworks. In NIPS, pages 422\u2013430. 2013.\n\n[6] J. Hensman, N. Fusi, and N.D. Lawrence. Gaussian processes for big data. CoRR, abs/1309.6835, 2013.\n[7] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\n22(3):400\u2013407, 1951.\n\n[8] R. Ranganath, C. Wang, D.M. Blei, and E.P. Xing. An adaptive learning rate for stochastic variational\n\ninference. In ICML, pages 298\u2013306, 2013.\n\n[9] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\n[10] R. E. Kalman. A new approach to linear \ufb01ltering and prediction problems. Journal of basic Engineering,\n\n82(1):35\u201345, 1960.\n\n[11] S. Roweis and Z. Ghahramani. A unifying review of linear gaussian models. Neural computation,\n\n11(2):305\u2013345, 1999.\n\n[12] T. Schaul, S. Zhang, and Y. LeCun. No More Pesky Learning Rates. In ICML, 2013.\n[13] B. Lakshminarayanan, G. Bouchard, and C. Archambeau. Robust Bayesian matrix factorisation.\n\nAISTATS, pages 425\u2013433, 2011.\n\nIn\n\n[14] M. Roth, E. Ozkan, and F. Gustafsson. A Student\u2019s t \ufb01lter for heavy tailed process and measurement\n\nnoise. In ICASSP, pages 5770\u20135774. IEEE, 2013.\n\n[15] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael Jordan. Streaming\n\nvariational Bayes. In NIPS, pages 1727\u20131735, 2013.\n\n[16] Zoubin Ghahramani and H Attias. Online variational Bayesian learning. In Slides from talk presented at\n\nNIPS workshop on Online Learning, 2000.\n\n[17] J.F.G. de Freitas, M. Niranjan, and A.H. Gee. Hierarchical Bayesian models for regularization in sequen-\n\ntial learning. Neural Computation, 12(4):933\u2013953, 2000.\n\n[18] S.S. Haykin. Kalman \ufb01ltering and neural networks. Wiley Online Library, 2001.\n[19] Enrico Capobianco. Robust control methods for on-line statistical learning. EURASIP Journal on Ad-\n\nvances in Signal Processing, (2):121\u2013127, 2001.\n\n[20] Y.T. Chien and K. Fu. On Bayesian learning and stochastic approximation. Systems Science and Cyber-\n\nnetics, IEEE Transactions on, 3(1):28\u201338, 1967.\n\n[21] A.P. George and W.B. Powell. Adaptive stepsizes for recursive estimation with applications in approxi-\n\nmate dynamic programming. Machine learning, 65(1):167\u2013198, 2006.\n\n[22] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning\n\nalgorithms. In NIPS, pages 2960\u20132968, 2012.\n\n[23] P. Hennig and M. Kiefel. Quasi-newton methods: A new direction. JMLR, 14(1):843\u2013865, 2013.\n[24] D. M Blei, A. Y Ng, and M.I. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[25] R. Kohavi, C.E. Brodley, B. Frasca, L. Mason, and Z. Zheng. Kdd-cup 2000 organizers\u2019 report: peeling\n\nthe onion. ACM SIGKDD Explorations Newsletter, 2(2):86\u201393, 2000.\n\n[26] S. Nakajima, M. Sugiyama, and R. Tomioka. Global analytic solution for variational Bayesian matrix\n\nfactorization. NIPS, 23:1759\u20131767, 2010.\n\n[27] A. Gunawardana and G. Shani. A survey of accuracy evaluation metrics of recommendation tasks. JMLR,\n\n10:2935\u20132962, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1126, "authors": [{"given_name": "Neil", "family_name": "Houlsby", "institution": "Cambridge"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}