{"title": "Sparse Variational Inference: Bayesian Coresets from Scratch", "book": "Advances in Neural Information Processing Systems", "page_first": 11461, "page_last": 11472, "abstract": "The proliferation of automated inference algorithms in Bayesian statistics has provided practitioners newfound access to fast, reproducible data analysis and powerful statistical models. Designing automated methods that are also both computationally scalable and theoretically sound, however, remains a significant challenge. Recent work on Bayesian coresets takes the approach of compressing the dataset before running a standard inference algorithm, providing both scalability and guarantees on posterior approximation error. But the automation of past coreset methods is limited because they depend on the availability of a reasonable coarse posterior approximation, which is difficult to specify in practice. In the present work we remove this requirement by formulating coreset construction as sparsity-constrained variational inference within an exponential family. This perspective leads to a novel construction via greedy optimization, and also provides a unifying information-geometric view of present and past methods. The proposed Riemannian coreset construction algorithm is fully automated, requiring no problem-specific inputs aside from the probabilistic model and dataset. In addition to being significantly easier to use than past methods, experiments demonstrate that past coreset constructions are fundamentally limited by the fixed coarse posterior approximation; in contrast, the proposed algorithm is able to continually improve the coreset, providing state-of-the-art Bayesian dataset summarization with orders-of-magnitude reduction in KL divergence to the exact posterior.", "full_text": "Sparse Variational Inference:\nBayesian Coresets from Scratch\n\nTrevor Campbell\n\nDepartment of Statistics\n\nUniversity of British Columbia\n\nVancouver, BC V6T 1Z4\ntrevor@stat.ubc.ca\n\nBoyan Beronov\n\nDepartment of Computer Science\nUniversity of British Columbia\n\nVancouver, BC V6T 1Z4\nberonov@cs.ubc.ca\n\nAbstract\n\nThe proliferation of automated inference algorithms in Bayesian statistics has pro-\nvided practitioners newfound access to fast, reproducible data analysis and powerful\nstatistical models. Designing automated methods that are also both computationally\nscalable and theoretically sound, however, remains a signi\ufb01cant challenge. Recent\nwork on Bayesian coresets takes the approach of compressing the dataset before\nrunning a standard inference algorithm, providing both scalability and guarantees\non posterior approximation error. But the automation of past coreset methods is\nlimited because they depend on the availability of a reasonable coarse posterior\napproximation, which is dif\ufb01cult to specify in practice. In the present work we re-\nmove this requirement by formulating coreset construction as sparsity-constrained\nvariational inference within an exponential family. This perspective leads to a novel\nconstruction via greedy optimization, and also provides a unifying information-\ngeometric view of present and past methods. The proposed Riemannian coreset\nconstruction algorithm is fully automated, requiring no problem-speci\ufb01c inputs\naside from the probabilistic model and dataset. In addition to being signi\ufb01cantly\neasier to use than past methods, experiments demonstrate that past coreset con-\nstructions are fundamentally limited by the \ufb01xed coarse posterior approximation;\nin contrast, the proposed algorithm is able to continually improve the coreset, pro-\nviding state-of-the-art Bayesian dataset summarization with orders-of-magnitude\nreduction in KL divergence to the exact posterior.\n\n1\n\nIntroduction\n\nBayesian statistical models are powerful tools for learning from data, with the ability to encode\ncomplex hierarchical dependence and domain expertise, as well as coherently quantify uncertainty in\nlatent parameters. In practice, however, exact Bayesian inference is typically intractable, and we must\nuse approximate inference algorithms such as Markov chain Monte Carlo (MCMC) [1; 2, Ch. 11,12]\nand variational inference (VI) [3, 4]. Until recently, implementations of these methods were created\non a per-model basis, requiring expert input to design the MCMC transition kernels or derive VI\ngradient updates. But developments in automated tools\u2014e.g., automatic differentiation [5, 6], \u201cblack-\nbox\u201d gradient estimates [7], and Hamiltonian transition kernels [8, 9]\u2014have obviated much of this\nexpert input, greatly expanding the repertoire of Bayesian models accessible to practitioners.\nIn modern data analysis problems, automation alone is insuf\ufb01cient; inference algorithms must also\nbe computationally scalable\u2014to handle the ever-growing size of datasets\u2014and provide theoretical\nguarantees on the quality of their output such that statistical pracitioners may con\ufb01dently use them\nin failure-sensitive settings. Here the standard set of tools falls short. Designing correct MCMC\nschemes in the large-scale data setting is a challenging, problem-speci\ufb01c task [10\u201312]; and despite\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frecent results in asymptotic theory [13\u201316], it is dif\ufb01cult to assess the effect of the variational family\non VI approximations for \ufb01nite data, where a poor choice can result in severe underestimation\nof posterior uncertainty [17, Ch. 21]. Other scalable Bayesian inference algorithms have largely\nbeen developed by modifying standard inference algorithms to handle distributed or streaming data\nprocessing [10, 11, 18\u201329], which tend to have no guarantees on inferential quality and require\nextensive model-speci\ufb01c expert tuning.\nBayesian coresets (\u201ccore of a dataset\u201d) [30\u201332] are an alternative approach\u2014based on the notion that\nlarge datasets often contain a signi\ufb01cant fraction of redundant data\u2014that summarize and sparsify the\ndata as a preprocessing step before running a standard inference algorithm such as MCMC or VI. In\ncontrast to other large-scale inference techniques, Bayesian coreset construction is computationally\ninexpensive, simple to implement, and provides theoretical guarantees relating coreset size to posterior\napproximation quality. However, state-of-the-art algorithms formulate coreset construction as a sparse\nregression problem in a Hilbert space, which involves the choice of a weighted L2 inner product [31].\nIf left to the user, the choice of weighting distribution signi\ufb01cantly reduces the overall automation\nof the approach; and current methods for \ufb01nding the weighting distribution programatically are\ngenerally as expensive as posterior inference on the full dataset itself. Further, even if an appropriate\ninner product is speci\ufb01ed, computing it exactly is typically intractable, requiring the use of \ufb01nite-\ndimensional projections for approximation [31]. Although the problem in \ufb01nite-dimensions can be\nstudied using well-known techniques from sparse regression, compressed sensing, random sketching,\nboosting, and greedy approximation [33\u201351], these projections incur an unknown error in the\nconstruction process in practice, and preclude asymptotic consistency as the coreset size grows.\nIn this work, we provide a new formulation of coreset construction as exponential family variational\ninference with a sparsity constraint. The fact that coresets form a sparse subset of an exponential\nfamily is crucial in two regards. First, it enables tractable unbiased Kullback-Leibler (KL) divergence\ngradient estimation, which is used in the development of a novel coreset construction algorithm\nbased on greedy optimization. In contrast to past work, this algorithm is fully automated, with\nno problem-speci\ufb01c inputs aside from the probabilistic model and dataset. Second, it provides a\nunifying view and strong theoretical underpinnings of both the present and past coreset constructions\nthrough Riemannian information geometry. In particular, past methods are shown to operate in a\nsingle tangent space of the coreset manifold; our experiments show that this fundamentally limits the\nquality of the coreset constructed with these methods. In contrast, the proposed method proceeds\nalong the manifold towards the posterior target, and is able to continually improve its approximation.\nFurthermore, new relationships between the optimization objective of past approaches and the coreset\nposterior KL divergence are derived. The paper concludes with experiments demonstrating that,\ncompared with past methods, Riemannian coreset construction is both easier to use and provides\norders-of-magnitude reduction in KL divergence to the exact posterior.\n\n2 Background\n\nIn the problem setting of the present paper, we are given a probability density \u03c0(\u03b8) for variables\n\u03b8 \u2208 \u0398 that decomposes into N potentials (fn(\u03b8))N\n\nn=1 and a base density \u03c00(\u03b8),\n\n(cid:33)\n\n(cid:32) N(cid:88)\n\nn=1\n\n\u03c0(\u03b8) :=\n\n1\nZ\n\nexp\n\nfn(\u03b8)\n\n\u03c00(\u03b8),\n\n(1)\n\nwhere Z is the (unknown) normalization constant. Such distributions arise frequently in a number of\nscenarios: for example, in Bayesian statistical inference problems with conditionally independent\ndata given \u03b8, the functions fn are the log-likelihood terms for the N data points, \u03c00 is the prior\ndensity, and \u03c0 is the posterior; or in undirected graphical models, the functions fn and log \u03c00 might\nrepresent N + 1 potentials. The algorithms and analysis in the present work are agnostic to their\nparticular meaning, but for clarity we will focus on the setting of Bayesian inference throughout.\nAs it is often intractable to compute expectations under \u03c0 exactly, practitioners have turned to\napproximate algorithms. Markov chain Monte Carlo (MCMC) methods [1, 8, 9], which return\napproximate samples from \u03c0, remain the gold standard for this purpose. But since each sample\ntypically requires at least one evaluation of a function proportional to \u03c0 with computational cost \u0398(N ),\nin the large N setting it is expensive to obtain suf\ufb01ciently many samples to provide high con\ufb01dence in\nempirical estimates. To reduce the cost of MCMC, we can instead run it on a small, weighted subset\n\n2\n\n\fof data known as a Bayesian coreset [30], a concept originating from the computational geometry\nand optimization literature [52\u201357]. Let w \u2208 RN\u22650 be a sparse vector of nonnegative weights such\n1 [wn > 0] \u2264 M. Then we approximate the\nfull log-density with a w-reweighted sum with normalization Z(w) > 0 and run MCMC on the\napproximation1,\n\nthat only M (cid:28) N are nonzero, i.e. (cid:107)w(cid:107)0 :=(cid:80)N\n(cid:32) N(cid:88)\n\n(cid:33)\n\nn=1\n\n1\n\nwnfn(\u03b8)\n\n\u03c00(\u03b8),\n\n(2)\n\n\u03c0w(\u03b8) :=\n\nexp\n\nZ(w)\n\nn=1\n\nwhere \u03c01 = \u03c0 corresponds to the full density. If M (cid:28) N, evaluating a function proportional to \u03c0w is\nmuch less expensive than doing so for the original \u03c0, resulting in a signi\ufb01cant reduction in MCMC\ncomputation time. The major challenge posed by this approach, then, is to \ufb01nd a set of weights w that\nrenders \u03c0w as close as possible to \u03c0 while maintaining sparsity. Past work [31, 32] formulated this as\na sparse regression problem in a Hilbert space with the L2(\u02c6\u03c0) norm for some weighting distribution\n\u02c6\u03c0 and vectors2 gn := (fn \u2212 E\u02c6\u03c0 [fn]),\n\n\uf8ee\uf8f0(cid:32) N(cid:88)\n\ngn \u2212 N(cid:88)\n\nn=1\n\nn=1\n\n(cid:33)2\uf8f9\uf8fb s.t. w \u2265 0, (cid:107)w(cid:107)0 \u2264 M.\n\nwngn\n\nw(cid:63) = arg min\nw\u2208RN\n\nE\u02c6\u03c0\n\n(3)\n\nAs the expectation is generally intractable to compute exactly, a Monte Carlo approximation is used in\nS \u22121 [gn(\u03b81) \u2212 \u00afgn, . . . , gn(\u03b8S) \u2212 \u00afgn]T \u2208\nits place: taking samples (\u03b8s)S\ns=1 gn(\u03b8s) yields a linear \ufb01nite-dimensional sparse regression problem in RS,\nRS where \u00afgn = 1\n\ni.i.d.\u223c \u02c6\u03c0 and setting \u02c6gn =\n\n\u221a\n\ns=1\n\n(cid:80)S\n\nS\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N(cid:88)\n\nn=1\n\n\u02c6gn \u2212 N(cid:88)\n\nn=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\nw(cid:63) = arg min\nw\u2208RN\n\nwn\u02c6gn\n\ns.t. w \u2265 0, (cid:107)w(cid:107)0 \u2264 M,\n\n(4)\n\nwhich can be solved with sparse optimization techniques [31, 32, 34\u201336, 45, 46, 48, 58\u201360]. However,\nthere are two drawbacks inherent to the Hilbert space formulation. First, the use of the L2(\u02c6\u03c0) norm\nrequires the selection of the weighting function \u02c6\u03c0, posing a barrier to the full automation of coreset\nconstruction. There is currently no guidance on how to select \u02c6\u03c0, or the effect of different choices\nin the literature. We show in Sections 4 and 5 that using such a \ufb01xed weighting \u02c6\u03c0 fundamentally\nlimits the quality of coreset construction. Second, the inner products typically cannot be computed\nexactly, requiring a Monte Carlo approximation. This adds noise to the construction and precludes\nasymptotic consistency (in the sense that \u03c0w (cid:54)\u2192 \u03c01 as the sparsity budget M \u2192 \u221e). Addressing\nthese drawbacks is the focus of the present work.\n\n3 Bayesian coresets from scratch\n\nIn this section, we provide a new formulation of Bayesian coreset construction as variational inference\nover an exponential family with sparse natural parameters, and develop an iterative greedy algorithm\nfor optimization.\n\n3.1 Sparse exponential family variational inference\n\nWe formulate coreset construction as a sparse variational inference problem,\n\nw(cid:63) = arg min\nw\u2208RN\n\nDKL (\u03c0w||\u03c01)\n\ns.t. w \u2265 0, (cid:107)w(cid:107)0 \u2264 M.\n\nExpanding the objective and denoting expectations under \u03c0w as Ew,\n\nDKL (\u03c0w||\u03c01) = log Z(1) \u2212 log Z(w) \u2212 N(cid:88)\n\n(1 \u2212 wn)Ew [fn(\u03b8)] .\n\n(5)\n\n(6)\n\n1Throughout, [N ] := {1, . . . , N}, 1 and 0 are the constant vectors of all 1s / 0s respectively (the dimension\nwill be clear from context), 1A is the indicator vector for A \u2286 [N ], and 1n is the indicator vector for n \u2208 [N ].\n2In [31], the E\u02c6\u03c0 [fn] term was missing; it is necessary to account for the shift-invariance of potentials.\n\nn=1\n\n3\n\n\fEq. (6) illuminates the major challenges with the variational approach posed in Eq. (5). First, the\nnormalization constant Z(w) of \u03c0w\u2014itself a function of the weights w\u2014is unknown; typically, the\nform of the approximate distribution is known fully in variational inference. Second, even if the\nconstant were known, computing the objective in Eq. (5) requires taking expectations under \u03c0w,\nwhich is in general just as dif\ufb01cult as the original problem of sampling from the true posterior \u03c01.\nTwo key insights in this work address these issues and lead to both the development of a new\ncoreset construction algorithm (Algorithm 1) and a more comprehensive understanding of the coreset\nconstruction literature (Section 4). First, the coresets form a sparse subset of an exponential family:\nthe nonnegative weights form the natural parameter w \u2208 RN\u22650, the component potentials (fn(\u03b8))N\nn=1\nform the suf\ufb01cient statistic, log Z(w) is the log partition function, and \u03c00 is the base density,\nfN (\u03b8) ]T .\n\n\u03c0w(\u03b8) := exp(cid:0)wT f (\u03b8) \u2212 log Z(w)(cid:1) \u03c00(\u03b8)\n\n(7)\nUsing the well-known fact that the gradient of an exponential family log-partition function is the\nmean of the suf\ufb01cient statistic, Ew [f (\u03b8)] = \u2207w log Z(w), we can rewrite the optimization Eq. (5) as\n(8)\nw(cid:63) = arg min\nw\u2208RN\n\nlog Z(1) \u2212 log Z(w) \u2212 (1 \u2212 w)T\u2207w log Z(w)\n\ns.t. w \u2265 0, (cid:107)w(cid:107)0 \u2264 M.\n\nf (\u03b8) := [ f1(\u03b8)\n\n. . .\n\n(cid:2)f, f T (1 \u2212 w)(cid:3) ,\n\nTaking the gradient of this objective function and noting again that, for an exponential family, the\nHessian of the log-partition function log Z(w) is the covariance of the suf\ufb01cient statistic,\n\n\u2207wDKL (\u03c0w||\u03c01) = \u2212\u22072\n\nw log Z(w)(1 \u2212 w) = \u2212 Covw\n\nn=1 fn(\u03b8) \u2212(cid:80)N\n\nfn(\u03b8) with the residual error(cid:80)N\n\n(9)\nwhere Covw denotes covariance under \u03c0w. In other words, increasing the weight wn by a small\namount decreases DKL (\u03c0w||\u03c01) by an amount proportional to the covariance of the nth potential\nn=1 wnfn(\u03b8) under \u03c0w. If required, it is not dif\ufb01cult\nto use the connection between derivatives of log Z(w) and moments of the suf\ufb01cient statistic under\n\u03c0w to derive 2nd and higher order derivatives of DKL (\u03c0w||\u03c01).\nThis provides a natural tool for optimizing the coreset construction objective in Eq. (5)\u2014Monte\nCarlo estimates of suf\ufb01cient statistic moments\u2014and enables coreset construction without both the\nproblematic selection of a Hilbert space (i.e., \u02c6\u03c0) and \ufb01nite-dimensional projection error from past\napproaches. But obtaining Monte Carlo estimates requires sampling from \u03c0w; the second key insight\nin this work is that as long as we build up the sparse approximation w incrementally, the iterates\nwill themselves be sparse. Therefore, using a standard Markov chain Monte Carlo algorithm [9] to\nobtain samples from \u03c0w for gradient estimation is actually not expensive\u2014with cost O(M ) instead\nof O(N )\u2014despite the potentially complicated form of \u03c0w.\n\n3.2 Greedy selection\n\nOne option to build up a coreset incrementally is to use a greedy approach (Algorithm 1) to select and\nsubsequently reweight a single potential function at a time. For greedy selection, the na\u00efve approach\nis to select the potential that provides the largest local decrease in KL divergence around the current\nweights w, i.e., selecting the potential with the largest covariance with the residual error per Eq. (9).\nHowever, since the weight wn(cid:63) will then be optimized over [0,\u221e), the selection of the next potential\nto add should be invariant to scaling each potential fn by any positive constant. Thus we propose the\nuse of the correlation\u2014rather than the covariance\u2014between fn and the residual error f T (1 \u2212 w) as\nthe selection criterion:\n\n(cid:26) (cid:12)(cid:12)Corrw\n\nCorrw\n\n(cid:2)fn, f T (1 \u2212 w)(cid:3)(cid:12)(cid:12) wn > 0\n(cid:2)fn, f T (1 \u2212 w)(cid:3) wn = 0\n\nn(cid:63)= arg max\n\nn\u2208[N ]\n\n.\n\n(10)\n\nAlthough seemingly ad-hoc, this modi\ufb01cation will be placed on a solid information-geometric\ntheoretical foundation in Proposition 1 (see also Eq. (34) in Appendix A). Note that since we do not\nhave access to the exact correlations, we must use Monte Carlo estimates via sampling from \u03c0w for\ngreedy selection. Given S samples (\u03b8s)S\n\ni.i.d.\u223c \u03c0w, these are given by the N-dimensional vector\n\n(cid:34)\n\nS(cid:88)\n\ns=1\n\n1\nS\n\n(cid:35)\u2212 1\n2(cid:32)\n\n1\nS\n\ns=1\n\nS(cid:88)\n\n\u02c6gs\u02c6gT\n\n(cid:33)\ns (1 \u2212 w)\n\n(cid:91)Corr = diag\n\n\u02c6gs\u02c6gT\ns\n\n(cid:34) f1(\u03b8s)\n\n(cid:35)\n\n\u02c6gs :=\n\n...\n\n\u2212 1\nS\n\n(cid:34) f1(\u03b8r)\n\n(cid:35)\n\nS(cid:88)\n\nr=1\n\nfN (\u03b8r)\n\n...\n\n,\n\n(11)\n\ns=1\n\nfN (\u03b8s)\n\n4\n\n\fwhere diag [\u00b7] returns a diagonal matrix with the same diagonal entries as its argument. The details of\nusing the correlation estimate (Eq. (11)) in the greedy selection rule (Eq. (10)) to add points to the\ncoreset are shown in lines 4\u20139 of Algorithm 1. Note that this computation has cost O(N S). If N is\nlarge enough that computing the entire vectors \u02c6gs \u2208 RN is cost-prohibitive, one may instead compute\n\u02c6gs in Eq. (11) only for indices in I \u222a U\u2014where I = {n \u2208 [N ] : wn > 0} is the set of active indices,\nand U is a uniformly selected subsample of U \u2208 [N ] indices\u2014and perform greedy selection only\nwithin these indices.\n\n3.3 Weight update\nAfter selecting a new potential function n(cid:63), we add it to the active set of indices I \u2286 [N ] and update\nthe weights by optimizing\n\nw(cid:63) = arg min\n\nv\u2208RN\n\nDKL (\u03c0v||\u03c0)\n\ns.t.\n\nv \u2265 0,\n\n(1 \u2212 1I)T v = 0.\n\n(12)\n\nIn particular, we run T steps of generating S samples (\u03b8s)S\nestimate D of the gradient \u2207wDKL (\u03c0w||\u03c01) based on Eq. (9),\n\ns=1\n\ni.i.d.\u223c \u03c0w, computing a Monte Carlo\n\nS(cid:88)\n\ns=1\n\nD := \u2212 1\nS\n\ns (1 \u2212 w) \u2208 RN \u02c6gs as in Eq. (11),\n\n\u02c6gs\u02c6gT\n\n(13)\n\nand taking a stochastic gradient step wn \u2190 wn \u2212 \u03b3tDn at step t \u2208 [T ] for each n \u2208 I, using a\ntypical learning rate \u03b3t \u221d t\u22121. The details of the weight update step are shown in lines 10\u201315 of\nAlgorithm 1. As in the greedy selection step, the cost of each gradient step is O(N S), due to the \u02c6gT\ns 1\nterm in the gradient. If N is large enough that this computation is cost-prohibitive, one can use \u02c6gs\ncomputed only for indices in I \u222a U, where U is a uniformly selected subsample of U \u2208 [N ] indices.\n\n4 The information geometry of coreset construction\n\n(cid:115)\n\n(cid:90) 1\n\n0\n\nThe perspective of coresets as a sparse exponential family also enables the use of information geometry\nto derive a unifying connection between the variational formulation and previous constructions.\nIn particular, the family of coreset posteriors de\ufb01nes a Riemannian statistical manifold M =\nwith chart M \u2192 RN\u22650, endowed with the Fisher information metric G [61, p. 33,34],\n{\u03c0w}w\u2208RN\u22650\n\nG(w) =\n\n\u03c0w(\u03b8)\u2207w log \u03c0w(\u03b8)\u2207w log \u03c0w(\u03b8)T d\u03b8 = \u22072\n\nw log Z(w) = Covw [f ] .\nFor any differentiable curve \u03b3 : [0, 1] \u2192 RN\u22650, the metric de\ufb01nes a notion of path length,\n\n(14)\n\n(cid:90)\n\nT\n\nd\u03b3(t)\n\nd\u03b3(t)\n\ndt\n\ndt\n\ndt,\n\nL(\u03b3) =\n\nG(\u03b3(t))\n\n(15)\nand a constant-speed curve of minimal length between any two points w, w(cid:48) \u2208 RN\u22650 is referred to as a\ngeodesic [61, Thm. 5.2]. The geodesics are the generalization of straight lines in Euclidean space to\ncurved Riemannian manifolds, such as M. Using this information-geometric view, Proposition 1\nshows that both Hilbert coreset construction (Eq. (3)) and the proposed greedy sparse variational\ninference procedure (Algorithm 1) attempt to directionally align the \u02c6w \u2192 w and \u02c6w \u2192 1 geodesics\non M for \u02c6w, w, 1 \u2208 RN\u22650 (reference, coreset, and true posterior weights, respectively) as illustrated\nin Fig. 1. The key difference is that Hilbert coreset construction uses a \ufb01xed reference point \u02c6w\u2014\ncorresponding to \u02c6\u03c0 in Eq. (3)\u2014and thus operates entirely in a single tangent space of M, while\nthe proposed greedy method uses \u02c6w = w and thus improves its tangent space approximation\nas the algorithm iterates. For this reason, we refer to the method in Section 3 as a Riemannian\ncoreset construction algorithm. In addition to this uni\ufb01cation of coreset construction methods, the\ngeometric perspective also provides the means to show that the Hilbert coresets objective bounds the\nsymmetrized coreset KL divergence DKL (\u03c0w||\u03c0) + DKL (\u03c0||\u03c0w) if the Riemannian metric does not\nvary too much, as shown in Proposition 2. Incidentally, Lemma 3 in Appendix A\u2014which is used to\nprove Proposition 2\u2014also provides a nonnegative unbiased estimate of the symmetrized coreset KL\ndivergence, which may be used for performance monitoring in practice.\n\n5\n\n\fAlgorithm 1 Greedy sparse stochastic variational inference\n1: procedure SPARSEVI(f, \u03c00, S, T , (\u03b3t)\u221e\n2:\n3:\n\nw \u2190 0 \u2208 RN , I \u2190 \u2205\nfor m = 1, . . . , M do\n\nt=1, M)\n\n4:\n\n5:\n6:\n\n8:\n9:\n\n10:\n\n11:\n12:\n13:\n14:\n\n7: (cid:91)Corr \u2190 diag\n\n(cid:17) \u2208 RN\n\ns=1\n\ni.i.d.\u223c \u03c0w \u221d exp(wT f (\u03b8))\u03c00(\u03b8)\n\n(cid:46) Take S samples from the current coreset posterior approximation \u03c0w\n(\u03b8s)S\n(cid:46) Compute the N-dimensional potential vector for each sample\n\u02c6fs \u2190 f (\u03b8s) \u2208 RN for s \u2208 [S], and \u00aff \u2190 1\n\u02c6gs \u2190 \u02c6fs \u2212 \u00aff for s \u2208 [S]\n(cid:46) Estimate correlations between the potentials and the residual error\n\n(cid:80)S\n\n\u02c6fs\n\ns=1\n\nS\n\n(cid:104) 1\n\nS\n\n(cid:80)S\n\n(cid:105)\u2212 1\n2(cid:16)1\n\nS\n\n(cid:80)S\n\ns=1 \u02c6gs\u02c6gT\n\ns=1 \u02c6gs\u02c6gT\ns\n\ns (1 \u2212 w)\n(cid:46) Add the best next potential to the coreset\nn(cid:63) \u2190 arg maxn\u2208[N ] |(cid:91)Corrn|1 [n \u2208 I] + (cid:91)Corrn1 [n /\u2208 I]\nI \u2190 I \u222a {n(cid:63)}\n(cid:46) Update all the active weights in I via stochastic gradient descent on DKL (\u03c0w||\u03c0)\nfor t = 1, . . . , T do\n(cid:46) Use samples from \u03c0w to estimate the gradient\n(\u03b8s)S\n\u02c6fs \u2190 f (\u03b8s) \u2208 RN for s \u2208 [S], and \u00aff \u2190 1\n\u02c6gs \u2190 \u02c6fs \u2212 \u00aff for s \u2208 [S]\nD \u2190 \u2212 1\n(cid:46) Take a stochastic gradient step for active indices in I\n\ni.i.d.\u223c \u03c0w \u221d exp(wT f (\u03b8))\u03c00(\u03b8)\n\nw \u2190 w \u2212 \u03b3tIID where II :=(cid:80)\n\nn is the diagonal indicator matrix for I\n\ns (1 \u2212 w)\n\n(cid:80)S\n\n(cid:80)S\n\ns=1 \u02c6gs\u02c6gT\n\nn\u2208I 1n1T\n\n\u02c6fs\n\ns=1\n\ns=1\n\nS\n\nS\n\nend for\n\n15:\n16:\n17:\n18:\n19: end procedure\n\nend for\nreturn w\n\nProposition 1. Suppose \u02c6\u03c0 in Eq. (3) satis\ufb01es \u02c6\u03c0 = \u03c0 \u02c6w for a set of weights \u02c6w \u2208 RN\u22650. For u, v \u2208 RN\u22650,\nlet \u03beu\u2192v denote the initial tangent of the u \u2192 v geodesic on M, and (cid:104)\u00b7,\u00b7(cid:105)u denote the inner product\nunder the Riemannian metric G(u) with induced norm (cid:107) \u00b7 (cid:107)u. Then Hilbert coreset construction in\nEq. (3) is equivalent to\n\nw(cid:63) = arg min\nw\u2208RN\n\n(cid:107)\u03be \u02c6w\u21921 \u2212 \u03be \u02c6w\u2192w(cid:107) \u02c6w\n\ns.t. w \u2265 0, (cid:107)w(cid:107)0 \u2264 M,\n\nand each greedy selection step of Riemannian coreset construction in Eq. (10) is equivalent to\n\nn(cid:63) = arg min\nn\u2208[N ],tn\u2208R\n\n(cid:107)\u03bew\u21921 \u2212 \u03bew\u2192w+tn1n(cid:107)w\n\ns.t. \u2200 n /\u2208 I, tn > 0.\n\nProposition 2. Suppose \u02c6\u03c0 in Eq. (3) satis\ufb01es \u02c6\u03c0 = \u03c0 \u02c6w for a set of weights \u02c6w \u2208 RN\u22650. Then if J\u02c6\u03c0(w)\nis the objective function in Eq. (3),\n\nDKL (\u03c0||\u03c0w) + DKL (\u03c0w||\u03c0) \u2264 C\u02c6\u03c0(w) \u00b7 J\u02c6\u03c0(w),\n\n(cid:0)G( \u02c6w)\u22121/2G((1 \u2212 U )w + U 1)G( \u02c6w)\u22121/2(cid:1)(cid:3). In particular, if\n\n(cid:2)\u03bbmax\n\n(18)\n\nwhere C\u02c6\u03c0(w) := EU\u223cUnif[0,1]\n\u22072\nw log Z(w) is constant in w \u2208 RN\u22650, then C\u02c6\u03c0(w) = 1.\n\n(16)\n\n(17)\n\n5 Experiments\n\nIn this section, we compare the quality of coresets constructed via the proposed SparseVI greedy\ncoreset construction method, uniform random subsampling, and Hilbert coreset construction (GIGA\n[32]). In particular, for GIGA we used a 100-dimensional random projection generated from a\nGaussian \u02c6\u03c0 with two parametrizations: one with mean and covariance set using the moments of the\nexact posterior (Optimal) which is a benchmark but is not possible to achieve in practice; and one\n\n6\n\n\fFigure 1: Information-geometric view of greedy coreset construction on the coreset manifold M.\n(1a): Hilbert coreset construction, with weighting distribution \u03c0 \u02c6w, full posterior \u03c0, coreset posterior\n\u03c0w, and arrows denoting initial geodesic directions from \u02c6w towards new datapoints. (1b): Riemannian\ncoreset construction, with the path of posterior approximations \u03c0wt, t = 0, . . . , 3, and arrows denoting\ninitial geodesic directions towards new datapoints to add within each tangent plane.\n\n(a)\n\n(b)\n\nFigure 2: (2a): Synthetic comparison of coreset construction methods. Solid lines show the median\nKL divergence over 10 trials, with 25th and 75th percentiles shown by shaded areas. (2b): 2D\nprojection of coresets after 0, 1, 5, 20, 50, and 100 iterations via SparseVI. True/coreset posterior\nand 2\u03c3-predictive ellipses are shown in black/blue respectively. Coreset points are black with radius\ndenoting weight.\n\nwith mean and covariance uniformly distributed between the prior and the posterior with 75% relative\nnoise added (Realistic) to simulate the choice of \u02c6\u03c0 without exact posterior information. Experiments\nwere performed on a machine with an Intel i7 8700K processor and 32GB memory; code is available\nat www.github.com/trevorcampbell/bayesian-coresets.\n\n5.1 Synthetic Gaussian posterior inference\n\nWe \ufb01rst compared the coreset construction algorithms on a synthetic example involving posterior\ninference for the mean of a d-dimensional Gaussian with Gaussian observations,\ni.i.d.\u223c N (\u03b8, \u03a3), n = 1, . . . , N.\n\n\u03b8 \u223c N (\u00b50, \u03a30)\n\n(19)\n\nxn\n\nWe selected this example because it decouples the evaluation of the coreset construction methods from\nthe concerns of stochastic optimization and approximate posterior inference: the coreset posterior\n\u03c0w is a Gaussian \u03c0w = N (\u00b5w, \u03a3w) with closed-form expressions for the parameters as well as\ncovariance (see Appendix B for the derivation),\n\n\u03a3w =(cid:0)\u03a3\u22121\n\n0 +(cid:80)N\n\nn=1wn\u03a3\u22121(cid:1)\u22121\n\n(cid:0)\u03a3\u22121\n0 \u00b50 + \u03a3\u22121(cid:80)N\n\nn=1wnxn\n\n(cid:1)\n\n\u00b5w = \u03a3w\n\nCovw [fn, fm] = 1/2 tr \u03a8T \u03a8 + \u03bdT\n\nm\u03a8\u03bdn,\n\n7\n\n(20)\n(21)\n\n\f(a)\n\n(b)\n\nFigure 3: (3a): Coreset construction on regression of housing prices using radial basis functions in\nthe UK Land Registry data. Solid lines show the median KL divergence over 10 trials, with 25th and\n75th percentiles shown by shaded areas. (3b): Posterior mean contours with coresets of size 0\u2013300\nvia SparseVI compared with the exact posterior. Posterior and \ufb01nal coreset highlighted in the top\nrow. Coreset points are black with radius denoting weight.\n\nwhere \u03a3 = QQT , \u03bdn := Q\u22121(xn \u2212 \u00b5w) and \u03a8 := Q\u22121\u03a3wQ\u2212T . Thus the greedy selection and\nweight update can be performed without Monte Carlo estimation. We set \u03a30 = \u03a3 = I, \u00b50 = 0,\nd = 200, and N = 1, 000. We used a learning rate of \u03b3t = t\u22121, T = 100 weight update optimization\niterations, and M = 200 greedy iterations, although note that this is an upper bound on the size of\nthe coreset as the same data point may be selected multiple times. The results in Fig. 2 demonstrate\nthat the use of a \ufb01xed weighting function \u02c6\u03c0 (and thus, a \ufb01xed tangent plane on the coreset manifold)\nfundamentally limits the quality coresets via past algorithms. In contrast, the proposed greedy\nalgorithm is \u201cmanifold-aware\u201d and is able to continually improve the approximation, resulting in\norders-of-magnitude improvements in KL divergence to the true posterior.\n\n5.2 Bayesian radial basis function regression\n\n\u0001n\n\nn \u03b1 + \u0001n\n\ni.i.d.\u223c N (0, \u03c32)\n\nNext, we compared the coreset construction algorithms on Bayesian basis function regression for\nN = 10, 000 records of house sale log-price yn \u2208 R as a function of latitude / longitude coordinates\nxn \u2208 R2 in the UK.3 The regression problem involved inference for the coef\ufb01cients \u03b1 \u2208 RK in a\nlinear combination of radial basis functions bk(x) = exp(\u22121/2\u03c32\n(22)\nyn = bT\nWe generated 50 basis functions for each of 6 scales \u03c3k \u2208 {0.2, 0.4, 0.8, 1.2, 1.6, 2.0} by generating\nmeans \u00b5k uniformly from the data, and added one additional near-constant basis with scale 100\nand mean corresponding to the mean latitude and longitude of the data. This resulted in K = 301\ntotal basis functions and thus a 301-dimensional regression problem. We set the prior and noise\nparameters \u00b50, \u03c32\n0, \u03c32 equal to the empirical mean, second moment, and variance of the price paid\nn=1 across the whole dataset, respectively. As in Section 5.1, the posterior and log-likelihood\n(yn)N\ncovariances are available in closed form, and all algorithmic steps can be performed without Monte\nCarlo. In particular, \u03c0w = N (\u00b5w, \u03a3w), where (see Appendix B for the derivation)\n\nk(x \u2212 \u00b5k)2), k = 1, . . . , K,\n\u03b1 \u223c N (\u00b50, \u03c32\n\nbn = [ b1(xn)\n\nbK(xn) ]T\n\n\u00b7\u00b7\u00b7\n\n0I).\n\nN(cid:88)\n\nN(cid:88)\n\nn=1\n\n\u03a3w = (\u03a3\u22121\n\nwnbnbT\n\n0 + \u03c3\u22122\n\nCovw [fn, fm] = \u03c3\u22124(cid:0)\u03bdn\u03bdm\u03b2T\n\nn )\u22121\n\nn=1\n\nn \u03b2m)2(cid:1) .\n\nand \u00b5w = \u03a3w(\u03a3\u22121\n\n0 \u00b50 + \u03c3\u22122\n\nwnynbn)\n\n(23)\n\n(24)\nwhere \u03bdn := yn \u2212 \u00b5T\nwbn, \u03a3w = LLT , and \u03b2n := LT bn. We used a learning rate of \u03b3t = t\u22121,\nT = 100 optimization steps, and M = 300 greedy iterations, although again note that this is an upper\n\nn \u03b2m + 1/2(\u03b2T\n\n3This dataset was constructed by merging housing prices from the UK land registry data https://www.gov.\nuk/government/statistical-data-sets/price-paid-data-downloads with latitude & longitude co-\nordinates from the Geonames postal code data http://download.geonames.org/export/zip/.\n\n8\n\n\f(a)\n\n(b)\n\nFigure 4: The results of the logistic (4a) and Poisson (4b) regression experiments. Plots show the\nmedian KL divergence (estimated using the Laplace approximation [62] and normalized by the value\nfor the prior) across 10 trials, with 25th and 75th percentiles shown by shaded areas. From top to\nbottom, (4a) shows the results for logistic regression on synthetic, chemical reactivities, and phishing\nwebsites data, while (4b) shows the results for Poisson regression on synthetic, bike trips, and airport\ndelays data. See Appendix C for details.\n\nbound on the coreset size. The results in Fig. 3 generally align with those from the previous synthetic\nexperiment. The proposed sparse variational inference formulation builds coresets of comparable\nquality to Hilbert coreset construction (when given the exact posterior for \u02c6\u03c0) up to a size of about 150.\nBeyond this point, past methods become limited by their \ufb01xed tangent plane approximation while\nthe proposed method continues to improve. This experiment also highlights the sensitivity of past\nmethods to the choice of \u02c6\u03c0: uniform subsampling outperforms GIGA with a realistic choice of \u02c6\u03c0.\n\n5.3 Bayesian logistic and Poisson regression\n\nFinally, we compared the methods on logistic and Poisson regression applied to six datasets (details\nmay be found in Appendix C) with N = 500 and dimension ranging from 2-15. We used M = 100\ngreedy iterations, S = 100 samples for Monte Carlo covariance estimation, and T = 500 optimization\niterations with learning rate \u03b3t = 0.5t\u22121. Fig. 4 shows the result of this test, demonstrating that\nthe proposed greedy sparse VI method successfully recovers a coreset with divergence from the\nexact posterior as low or lower than GIGA with without having the bene\ufb01t of a user-speci\ufb01ed\nweighting function. Note that there is a computational price to pay for this level of automation;\nFig. 5, Appendix C shows that SparseVI is signi\ufb01cantly slower than Hilbert coreset construction via\nGIGA [32], primarily due to the expensive gradient descent weight update. However, if we remove\nGIGA (Optimal) from consideration due to its unrealistic use of \u02c6\u03c0 \u2248 \u03c01, SparseVI is the only\npractical coreset construction algorithm that reduces the KL divergence to the posterior appreciably\nfor reasonable coreset sizes. We leave improvements to computational cost for future work.\n\n6 Conclusion\n\nThis paper introduced sparse variational inference for Bayesian coreset construction. By exploiting\nthe fact that coreset posteriors form an exponential family, a greedy algorithm as well as a unifying\nRiemannian information-geometric view of present and past coreset constructions were developed.\nFuture work includes extending sparse VI to improved optimization techniques beyond greedy\nmethods, and reducing computational cost.\n\nAcknowledgments T. Campbell and B. Beronov are supported by National Sciences and Engineer-\ning Research Council of Canada (NSERC) Discovery Grants. T. Campbell is additionally supported\nby an NSERC Discovery Launch Supplement.\n\n9\n\n\fReferences\n[1] Christian Robert and George Casella. Monte Carlo Statistical Methods. Springer, 2nd edition, 2004.\n\n[2] Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin. Bayesian data\n\nanalysis. CRC Press, 3rd edition, 2013.\n\n[3] Michael Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul. An introduction to variational\n\nmethods for graphical models. Machine Learning, 37:183\u2013233, 1999.\n\n[4] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[5] A\u0131l\u0131m G\u00fcne\u00b8s Baydin, Barak Pearlmutter, Alexey Radul, and Jeffrey Siskind. Automatic differentiation in\n\nmachine learning: a survey. Journal of Machine Learning Research, 18:1\u201343, 2018.\n\n[6] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic differentia-\n\ntion variational inference. Journal of Machine Learning Research, 18:1\u201345, 2017.\n\n[7] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, 2014.\n\n[8] Radford Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, and\n\nXiao-Li Meng, editors, Handbook of Markov chain Monte Carlo, chapter 5. CRC Press, 2011.\n\n[9] Matthew Hoffman and Andrew Gelman. The No-U-Turn Sampler: adaptively setting path lengths in\n\nHamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1351\u20131381, 2014.\n\n[10] R\u00e9mi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte Carlo methods for tall data.\n\nJournal of Machine Learning Research, 18:1\u201343, 2017.\n\n[11] Steven Scott, Alexander Blocker, Fernando Bonassi, Hugh Chipman, Edward George, and Robert McCul-\nloch. Bayes and big data: the consensus Monte Carlo algorithm. International Journal of Management\nScience and Engineering Management, 11:78\u201388, 2016.\n\n[12] Michael Betancourt. The fundamental incompatibility of Hamiltonian Monte Carlo and data subsampling.\n\nIn International Conference on Machine Learning, 2015.\n\n[13] Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their variational approxi-\n\nmations. The Annals of Statistics, 2018 (to appear).\n\n[14] Yixin Wang and David Blei. Frequentist consistency of variational Bayes. Journal of the American\n\nStatistical Association, 0(0):1\u201315, 2018.\n\n[15] Yun Yang, Debdeep Pati, and Anirban Bhattacharya. \u03b1-variational inference with statistical guarantees.\n\nThe Annals of Statistics, 2018 (to appear).\n\n[16] Badr-Eddine Ch\u00e9rief-Abdellatif and Pierre Alquier. Consistency of variational Bayes inference for\n\nestimation and model selection in mixtures. Electronic Journal of Statistics, 12:2995\u20133035, 2018.\n\n[17] Kevin Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012.\n\n[18] Matthew Hoffman, David Blei, Chong Wang, and John Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14:1303\u20131347, 2013.\n\n[19] Maxim Rabinovich, Elaine Angelino, and Michael Jordan. Variational consensus Monte Carlo. In Advances\n\nin Neural Information Processing Systems, 2015.\n\n[20] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia Wilson, and Michael Jordan. Streaming\n\nvariational Bayes. In Advances in Neural Information Processing Systems, 2013.\n\n[21] Trevor Campbell, Julian Straub, John W. Fisher III, and Jonathan How. Streaming, distributed variational\n\ninference for Bayesian nonparametrics. In Advances in Neural Information Processing Systems, 2015.\n\n[22] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In\n\nInternational Conference on Machine Learning, 2011.\n\n[23] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient\n\nFisher scoring. In International Conference on Machine Learning, 2012.\n\n10\n\n\f[24] R\u00e9mi Bardenet, Arnaud Doucet, and Chris C Holmes. Towards scaling up Markov chain Monte Carlo: an\nadaptive subsampling approach. In International Conference on Machine Learning, pages 405\u2013413, 2014.\n\n[25] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC land: cutting the Metropolis-\n\nHastings budget. In International Conference on Machine Learning, 2014.\n\n[26] Dougal Maclaurin and Ryan Adams. Fire\ufb02y Monte Carlo: exact MCMC with subsets of data. In Conference\n\non Uncertainty in Arti\ufb01cial Intelligence, 2014.\n\n[27] Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: scalable Bayes via barycenters\n\nof subset posteriors. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[28] Reihaneh Entezari, Radu Craiu, and Jeffrey Rosenthal. Likelihood in\ufb02ating sampling algorithm.\n\narXiv:1605.02113, 2016.\n\n[29] Elaine Angelino, Matthew Johnson, and Ryan Adams. Patterns of scalable Bayesian inference. Foundations\n\nand Trends in Machine Learning, 9(1\u20132):1\u2013129, 2016.\n\n[30] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for Bayesian logistic regression. In\n\nAdvances in Neural Information Processing Systems, 2016.\n\n[31] Trevor Campbell and Tamara Broderick. Automated scalable Bayesian inference via Hilbert coresets.\n\nJournal of Machine Learning Research, 20(15):1\u201338, 2019.\n\n[32] Trevor Campbell and Tamara Broderick. Bayesian coreset construction via greedy iterative geodesic ascent.\n\nIn International Conference on Machine Learning, 2018.\n\n[33] Kenneth Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM\n\nTransactions on Algorithms, 6(4), 2010.\n\n[34] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization\n\nvariants. In Advances in Neural Information Processing Systems, 2015.\n\n[35] Francesco Locatello, Michael Tschannen, Gunnar R\u00e4tsch, and Martin Jaggi. Greedy algorithms for cone\nconstrained optimization with convergence guarantees. In Advances in Neural Information Processing\nSystems, 2017.\n\n[36] Andrew Barron, Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Approximation and learning by\n\ngreedy algorithms. The Annals of Statistics, 36(1):64\u201394, 2008.\n\n[37] Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Uncertainty in\n\nArti\ufb01cial Intelligence, 2010.\n\n[38] Robert Schapire. The strength of weak learnability. Machine Learning, 5(2):197\u2013227, 1990.\n\n[39] Ferenc Huszar and David Duvenaud. Optimally-weighted herding is Bayesian quadrature. In Uncertainty\n\nin Arti\ufb01cial Intelligence, 2012.\n\n[40] Yoav Freund and Robert Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\n\n[41] Emmanuel Cand\u00e8s and Terence Tao. Decoding by linear programming. IEEE Transactions on Information\n\nTheory, 51(12):4203\u20134215, 2005.\n\n[42] Emmanual Cand\u00e8s and Terence Tao. The Dantzig selector: statistical estimation when p is much larger\n\nthan n. The Annals of Statistics, 35(6):2313\u20132351, 2007.\n\n[43] David Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289\u20131306, 2006.\n\n[44] Holger Boche, Robert Calderbank, Gitta Kutyniok, and Jan Vyb\u00edral. A survey of compressed sensing. In\nHolger Boche, Robert Calderbank, Gitta Kutyniok, and Jan Vyb\u00edral, editors, Compressed Sensing and its\nApplications: MATHEON Workshop 2013. Birkh\u00e4user, 2015.\n\n[45] St\u00e9phane Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transac-\n\ntions on Signal Processing, 41(12):3397\u20133415, 1993.\n\n[46] Sheng Chen, Stephen Billings, and Wan Luo. Orthogonal least squares methods and their application to\n\nnon-linear system identi\ufb01cation. International Journal of Control, 50(5):1873\u20131896, 1989.\n\n11\n\n\f[47] Scott Chen, David Donoho, and Michael Saunders. Atomic decomposition by basis pursuit. SIAM Review,\n\n43(1):129\u2013159, 1999.\n\n[48] Joel Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information\n\nTheory, 50(10):2231\u20132242, 2004.\n\n[49] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[50] Leo Geppert, Katja Ickstadt, Alexander Munteanu, Jesn Quedenfeld, and Christian Sohler. Random\n\nprojections for Bayesian regression. Statistics and Computing, 27:79\u2013101, 2017.\n\n[51] Daniel Ahfock, William Astle, and Sylvia Richardson. Statistical properties of sketching algorithms.\n\narXiv:1706.03665, 2017.\n\n[52] Pankaj Agarwal, Sariel Har-Peled, and Kasturi Varadarajan. Geometric approximation via coresets.\n\nCombinatorial and computational geometry, 52:1\u201330, 2005.\n\n[53] Michael Langberg and Leonard Schulman. Universal \u0001-approximators for integrals. In Proceedings of the\n\n21st Annual ACM\u2013SIAM Symposium on Discrete Algorithms, pages 598\u2013607, 2010.\n\n[54] Dan Feldman and Michael Langberg. A uni\ufb01ed framework for approximating and clustering data. In\n\nProceedings of the 43rd Annual ACM Symposium on Theory of Computing, pages 569\u2013578, 2011.\n\n[55] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: constant-size\nIn Proceedings of the 24th Annual ACM\u2013SIAM\n\ncoresets for k-means, pca and projective clustering.\nSymposium on Discrete Algorithms, pages 1434\u20131453, 2013.\n\n[56] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning.\n\narXiv:1703.06476, 2017.\n\n[57] Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for of\ufb02ine and streaming coreset\n\nconstructions. arXiv:1612.00889, 2016.\n\n[58] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics\n\nQuarterly, 3:95\u2013110, 1956.\n\n[59] Jacques Gu\u00e9lat and Patrice Marcotte. Some comments on Wolfe\u2019s \u2018away step\u2019. Mathematical Programming,\n\n35:110\u2013119, 1986.\n\n[60] Martin Jaggi. Revisiting Frank-Wolfe: projection-free sparse convex optimization. In International\n\nConference on Machine Learning, 2013.\n\n[61] Shun ichi Amari. Information Geometry and its Applications. Springer, 2016.\n\n[62] Luke Tierney and Joseph Kadane. Accurate approximations for posterior moments and marginal densities.\n\nJournal of the American Statistical Association, 81(393):82\u201386, 1986.\n\n12\n\n\f", "award": [], "sourceid": 6102, "authors": [{"given_name": "Trevor", "family_name": "Campbell", "institution": "UBC"}, {"given_name": "Boyan", "family_name": "Beronov", "institution": "University of British Columbia"}]}