{"title": "Nonparametric Bayesian inference on multivariate exponential families", "book": "Advances in Neural Information Processing Systems", "page_first": 2546, "page_last": 2554, "abstract": "We develop a model by choosing the maximum entropy distribution from the set of models satisfying certain smoothness and independence criteria; we show that inference on this model generalizes local kernel estimation to the context of Bayesian inference on stochastic processes. Our model enables Bayesian inference in contexts when standard techniques like Gaussian process inference are too expensive to apply. Exact inference on our model is possible for any likelihood function from the exponential family. Inference is then highly efficient, requiring only O(log N) time and O(N) space at run time. We demonstrate our algorithm on several problems and show quantifiable improvement in both speed and performance relative to models based on the Gaussian process.", "full_text": "Nonparametric Bayesian inference on multivariate\n\nexponential families\n\nWilliam Vega-Brown, Marek Doniec, and Nicholas Roy\n\nMassachusetts Institute of Technology\n\n{wrvb, doniec, nickroy}@csail.mit.edu\n\nCambridge, MA 02139\n\nAbstract\n\nWe develop a model by choosing the maximum entropy distribution from the\nset of models satisfying certain smoothness and independence criteria; we show\nthat inference on this model generalizes local kernel estimation to the context of\nBayesian inference on stochastic processes. Our model enables Bayesian infer-\nence in contexts when standard techniques like Gaussian process inference are\ntoo expensive to apply. Exact inference on our model is possible for any like-\nlihood function from the exponential family. Inference is then highly ef\ufb01cient,\nrequiring only O (log N ) time and O (N ) space at run time. We demonstrate our\nalgorithm on several problems and show quanti\ufb01able improvement in both speed\nand performance relative to models based on the Gaussian process.\n\n1\n\nIntroduction\n\nMany learning problems can be formulated in terms of inference on predictive stochastic models.\nThese models are distributions p(y|x) over possible observation values y drawn from some obser-\nvation set Y, conditioned on a known input value x from an input set X . The supervised learning\nproblem is then to infer a distribution p(y|x\u2217,D) over possible observations for some target input\nx\u2217, given a sequence of N independent observations D = {(x1, y1), . . . , (xN , yN )}.\nIt is often convenient to associate latent parameters \u03b8 \u2208 \u0398 with each input x, where p(y|\u03b8) is a\n\u2217 associated with x\u2217,\nknown likelihood function. By inferring a distribution over target parameters \u03b8\nwe can infer a distribution over y.\n\np(y|x\u2217,D) =\n\n\u2217\n\np(y|\u03b8\n\n\u2217\n\nd\u03b8\n\n\u2217|x\u2217,D)\n\n)p(\u03b8\n\n(1)\n\n(cid:90)\n\n\u0398\n\nFor instance, regression problems can be formulated as the inference of an unknown but determin-\nistic underlying function \u03b8(x) given noisy observations, so that p(y|x) = N (y; \u03b8(x), \u03c32), where\n\u03c32 is a known noise variance. If we can specify a joint prior over the parameters corresponding to\ndifferent inputs, we can infer p(\u03b8\n\n\u2217|x\u2217,D) using Bayes\u2019 rule.\n\n(cid:90)\n\n(cid:34) N(cid:89)\n\n(cid:35)\nd\u03b8ip(yi|\u03b8i)\n\n\u2217|x\u2217,D) \u221d\n\np(\u03b8\n\n\u0398N\n\ni=1\n\np(\u03b81:N , \u03b8\n\n\u2217|x\u2217, x1:N )\n\n(2)\n\nThe notation x1:N indicates the sample inputs {x1, . . . , xN}; this model is depicted graphically in\n\ufb01gure 1a. Although the choice of likelihood is often straightforward, specifying a prior can be more\ndif\ufb01cult. Ideally, we want a prior which is expressive, in the sense that it can accurately capture all\nprior knowledge, and which permits ef\ufb01cient and accurate inference.\nA powerful motivating example for specifying problems in terms of generative models is the Gaus-\nsian process [1], which speci\ufb01es the prior p(\u03b81:N|x1:N ) as a multivariate Gaussian with a covariance\nparameterized by x1:N . This prior can express complex and subtle relationships between inputs and\n\n1\n\n\fxn\n\n\u03b8n\n\ni, j\n\nyn\n\n\u221e\n\nxn\n\nx\u2217\n\n\u03b8n\n\n\u03b8\u2217\n\nN\n\nyn\n\ny\u2217\n\n(a) Stochastic process\n\n(b) Inference model\n\nFigure 1: Figure 1a models any stochastic process with fully connected latent parameters. Figure 1b is our\napproximate model, used for inference; we assume that the latent parameters are independent given the target\nparameters. Shaded nodes are observed.\n\nobservations, and permits ef\ufb01cient exact inference for a Gaussian likelihood with known variance.\nExtensions exist to perform approximate inference with other likelihood functions [2, 3, 4, 5].\nHowever, the assumptions of the Gaussian process are not the only set of reasonable assumptions,\nand are not always appropriate. Very large datasets require complex sparsi\ufb01cation techniques to be\ncomputationally tractable [6]. Likelihood functions with many coupled parameters, such as the pa-\nrameters of a categorical distribution or of the covariance matrix of a multivariate Gaussian, require\nthe introduction of large numbers of latent variables which must be inferred approximately. As an\nexample, the generalized Wishart process developed by Wilson and Ghahramani [7] provides a dis-\ntribution over covariance matrices using a sum of Gaussian processes. Inference of the the posterior\ndistribution over the covariance can only be performed approximately: no exact inference procedure\nis known.\n(cid:81)\nHistorically, an alternative approach to estimation has been to use local kernel estimation techniques\n[8, 9, 10], which are often developed from a weighted parameter likelihood of the form p(\u03b8|D) =\ni p(yi|\u03b8)wi. Algorithms for determining the maximum likelihood parameters of such a model\nare easy to implement and very fast in practice; various techniques, such as dual trees [11] or the\nimproved fast Gauss transform [12] allow the computation of kernel estimates in logarithmic or\nconstant time. This choice of model is often principally motivated by the computational convenience\nof resulting algorithms. However, it is not clear how to perform Bayesian inference on such models.\nMost instantiations instead return a point estimate of the parameters.\nIn this paper, we bridge the gap between the local kernel estimators and Bayesian inference. Rather\nthan perform approximate inference on an exact generative model, we formulate a simpli\ufb01ed model\nfor which we can ef\ufb01ciently perform exact inference. Our simpli\ufb01cation is to choose the maximum\nentropy distribution from the set of models satisfying certain smoothness and independence crite-\nria. We then show that for any likelihood function in the exponential family, our process model\nhas a conjugate prior, which permits us to perform Bayesian inference in closed form. This moti-\nvates many of the local kernel estimators from a Bayesian perspective, and generalizes them to new\nproblem domains. We demonstrate the usefulness of this model on multidimensional regression\nproblems with coupled observations and input-dependent noise, a setting which is dif\ufb01cult to model\nusing Gaussian process inference.\n\n2 The kernel process model\n\n\u2217|x\u2217, x1:N ).\nGiven a likelihood function, a generative model can be speci\ufb01ed by a prior p(\u03b81:N , \u03b8\nFor almost all combinations of prior and likelihood, inference is analytically intractable. We relax\nthe requirement that the model be generative, and instead require only that the prior be well-formed\nfor a given query x\u2217. To facilitate inference, we make the strong assumption that the latent training\nparameters \u03b81:N are conditionally independent given the target parameters \u03b8\n\n\u2217.\n\n(cid:34) N(cid:89)\n\n(cid:35)\n\np(\u03b81:N , \u03b8\n\n\u2217|x1:N , x\u2217) =\n\np(\u03b8i|\u03b8\n\n\u2217\n\n, xi, x\u2217)\n\n\u2217|x\u2217)\n\np(\u03b8\n\n(3)\n\ni=1\n\nThis restricted structure is depicted graphically in \ufb01gure 1b. Essentially, we assume that interac-\ntions between latent parameters are unimportant relative to interactions between the latent and target\nparameters; this will allow us to build models based on exponential family likelihood functions\nwhich will permit exact inference. Note that models with this structure will not correspond exactly\nto probabilistic generative models; the prior distribution over the latent parameters associated with\nthe training inputs varies depending on the target input. Instead of approximating inference on our\n\n2\n\n\fmodel, we make our approximation at the stage of model selection; in doing so, we enable fast,\nexact inference. Note that the class of models with the structure of equation (3) is quite rich, and\nas we demonstrate in section 3.2, performs well on many problems. We discuss in section 4 the\nrami\ufb01cations of this assumption and when it is appropriate.\nThis assumption is closely related to known techniques for sparsifying Gaussian process inference.\nQui\u02dcnonero-Candela and Rasmussen [6] provide a summary of many sparsi\ufb01cation techniques, and\ndescribe which correspond to generative models. One of the most successful sparsi\ufb01cation tech-\nniques, the fully independent training conditional approximation of Snelson [13], assumes all train-\ning examples are independent given a speci\ufb01ed set of inducing inputs. Our assumption extends this\nto the case of a single inducing input equal to the target input.\nNote that by marginalizing the parameters \u03b81:N , we can directly relate the observations y1:N to the\ntarget parameters \u03b8\n\n\u2217. Combining equations (2) and (3),\n\u2217|x\u2217,D) \u221d\n\np(\u03b8\n\n\u2217\n\n, xi, x\u2217)\n\n\u2217|x\u2217)\n\np(\u03b8\n\n(4)\n\n(cid:35)\n\nand marginalizing the latent parameters \u03b81:N , we observe that the posterior factors into a product\nover likelihoods p(yi|\u03b8\n\n, x, x\u2217) and a prior over \u03b8\n\n\u2217\n\n(cid:34) N(cid:89)\n(cid:90)\n(cid:34) N(cid:89)\n\ni=1\n\ni=1\n\n=\n\n\u0398\n\nd\u03b8ip(yi|\u03b8i)p(\u03b8i|\u03b8\n(cid:35)\n\n\u2217.\n\np(yi|\u03b8\n\n\u2217\n\n, xi, x\u2217)\n\np(\u03b8\n\n\u2217|x\u2217)\n, x, x\u2217) or p(y|\u03b8\n\n(5)\n\nNote that we can equivalently specify either p(\u03b8|\u03b8\n, x, x\u2217), without loss of gen-\nerality. In other words, we can equivalently describe the interaction between input variables either\nin the likelihood function or in the prior.\n\n\u2217\n\n\u2217\n\n\u2217\n\n2.1 The extended likelihood function\nBy construction, we know the distribution p(yi|\u03b8i). After making the transformation to equa-\ntion (5), much of the problem of model speci\ufb01cation has shifted to specifying the distribution\np(yi|\u03b8\n, xi, x\u2217). We call this distribution the extended likelihood distribution. Intuitively, these dis-\n, xi, x\u2217) = p(yi|\u03b8i).\ntributions should be related; if x\u2217 = xi, then we expect \u03b8i = \u03b8\nWe therefore de\ufb01ne the extended likelihood function in terms of the known likelihood p(yi|\u03b8i).\nTypically, we prefer smooth models: we expect similar inputs to lead to a similar distribution over\noutputs. In the absence of a smoothness constraint, any inference method can perform arbitrarily\npoorly [14]. However, the notion of smoothness is not well-de\ufb01ned in the context of probability\ndistributions. Denote g(yi) = p(yi|\u03b8\n, xi, x\u2217), and f (yi) = p(yi|\u03b8i). We can formalize a smooth\nmodel as one in which the information divergence of the likelihood distribution f from the extended\nlikelihood distribution g is bounded by some function \u03c1 : X \u00d7 X \u2192 R+.\n\n\u2217 and p(yi|\u03b8\n\n\u2217\n\n\u2217\n\nDKL (g(cid:107)f ) \u2264 \u03c1(x\u2217, xi)\n\n(6)\nSince the divergence is a premetric, \u03c1(\u00b7,\u00b7) must also satisfy the properties of a premetric: \u03c1(x, x) =\n0\u2200x and \u03c1(x1, x2) \u2265 0\u2200x1, x2. For example, if X = Rn, we may draw an analogy to Lipschitz\ncontinuity and choose \u03c1(x1, x2) = K(cid:107)x1 \u2212 x2(cid:107), with K a positive constant. The class of models\nwith bounded divergence has the property that g \u2192 f as x(cid:48) \u2192 x, and it does so smoothly provided\n\u03c1(\u00b7,\u00b7) is smooth. Note that this bound is a constraint on possible g, not an objective to be minimized;\nin particular, we do not minimize the divergence between g and f to develop an approximation, as is\ncommon in the approximate inference literature. Note also that this constraint has a straightforward\ninformation-theoretic interpretation; \u03c1(x1, x2) is a bound on the amount of information we would\nlose if we were to assume an observation y1 were taken at x2 instead of at x1.\nThe assumptions of equations (3) and (6) de\ufb01ne a class of models for a given likelihood function,\nbut are insuf\ufb01cient for specifying a well-de\ufb01ned prior. We therefore use the principle of maximum\nentropy and choose the maximum entropy distribution from among that class. In our attached sup-\nporting material, we prove the following.\nTheorem 1 The maximum entropy distribution g satisfying DKL (g(cid:107)f ) \u2264 \u03c1(x\u2217, x) has the form\n\n(7)\nwhere k : X \u00d7 X \u2192 [0, 1] is a kernel function which can be uniquely determined from \u03c1(\u00b7,\u00b7) and\nf (\u00b7).\n\ng(y) \u221d f (y)k(x\u2217,x)\n\n3\n\n\fThere is an equivalence relationship between functions k(\u00b7,\u00b7) and \u03c1(\u00b7,\u00b7); as either is uniquely de-\ntermined by the other, it may more convenient to select a kernel function than a smoothness bound,\nand doing so implies no loss in generality or correctness. Note it is neither necessary nor suf\ufb01cient\nthat the kernel function k(\u00b7,\u00b7) be positive de\ufb01nite. It is necessary only that k(x, x) = 1\u2200x and that\nk(x, x(cid:48)) \u2208 [0, 1]\u2200x, x(cid:48). This includes the possibility of asymmetric kernel functions. We discuss\nin the attached supporting material the mapping between valid kernel functions k(\u00b7,\u00b7) and bounding\nfunctions \u03c1(\u00b7,\u00b7).\nIt follows from equation (7) that the maximum entropy distribution satisfying a bound of \u03c1(x, x\u2217)\non the divergence of the observation distribution p(y|\u03b8\n, x, x\u2217) from the known distribution\np(y|\u03b8, x, x\u2217) = p(y|\u03b8) is\n\n\u2217\n\n(8)\nBy combining equations (5) and (6), we can fully specify a stochastic model with a likelihood\np(y|\u03b8), a pointwise marginal prior p(\u03b8|x), and a kernel function k : X \u00d7 X \u2192 [0, 1]. To perform\ninference, we must evaluate\n\np(y|\u03b8\n\n\u2217\n\n, x, x\u2217) \u221d p(y|\u03b8)k(x,x\u2217).\n\np(yi|\u03b8)k(x,xi)p(\u03b8|x)\n\n(9)\n\ni=1\n\n\u2217\n\n= arg max p(\u03b8\n\nThis can be done in closed form if we can normalize the terms on the right side of the equality.\nIn certain limiting cases with uninformative priors, our model can be reduced to known frequentist\nestimators. For instance, if we employ an uninformative prior p(\u03b8|x) \u221d 1 and choose the maximum-\n\u2217|x\u2217,D), we recover the weighted maximum-\nlikelihood target parameters \u02c6\u03b8\nlikelihood estimator, detailed by Wang [15]. If the function k(x, x(cid:48)) is local, in the sense that it goes\nto zero if the distance (cid:107)x\u2212 x(cid:48)(cid:107) is large, then choosing maximum likelihood parameter estimates for\nan uninformative prior gives the locally weighted maximum-likelihood estimator, described in the\ncontext of regression by Cleveland [16] and for generalized linear models by Tibshirani and Hastie\n[10]. However, our result is derived from a Bayesian interpretation of statistics, and we infer a full\ndistribution over the parameters; we are not limited to a point estimate. The distinction is of both\nacademic and practical interest; in addition to providing insight into to the meaning of the weighting\nfunction and the validity of the inferred parameters, by inferring a posterior distribution we provide\na principled way to reason about our knowledge and to insert prior knowledge of the underlying\nprocess.\n\np(\u03b8|x,D) \u221d N(cid:89)\n\n2.2 Kernel inference on the exponential family\nEquation (8) is particularly useful if we choose our likelihood model p(y|\u03b8) from the exponential\nfamily.\n\np(y|\u03b8) = h(y) exp\n\n(cid:62)\n\nT (y) \u2212 A(\u03b8)\n\n\u03b8\n\n(10)\n\n(cid:17)\n\nA member of an exponential family remains in the same family when raised to the power of\nk(x, xi). Because every exponential family has a conjugate prior, we may choose our point-wise\n\u2217|x\u2217) to be conjugate to our chosen likelihood. We denote this conjugate prior p\u03c0(\u03c7, \u03bd),\nprior p(\u03b8\nwhere \u03c7 and \u03bd are hyperparameters.\n\np(\u03b8|x\u2217) = p\u03c0(\u03c7(x\u2217), \u03bd(x\u2217)) = f (\u03c7(x\u2217), \u03bd(x\u2217)) exp (\u03b8 \u00b7 \u03c7(x\u2217) \u2212 \u03bd(x\u2217)A(\u03b8))\n\n(11)\n\nTherefore, our posterior as de\ufb01ned by equation (9) may be evaluated in closed form.\n\n\u2217|x\u2217,D) = p\u03c0(\n\np(\u03b8\n\nk(x\u2217, xi) + \u03bd(x\u2217))\n\n(12)\n\nThe prior predictive distribution p(y|x) is given by\n\n(cid:16)\n\n4\n\nN(cid:88)\n\ni=1\n\np(y|x) =\n\nN(cid:88)\n\ni=1\n\nk(x\u2217, xi)T (yi) + \u03c7(x\u2217),\n(cid:90)\n\np(y|\u03b8)p\u03c0(\u03b8|\u03c7(x\u2217), \u03bd(x\u2217))\nf (\u03c7(x\u2217), \u03bd(x\u2217))\n\n= h(y)\n\nf (\u03c7(x\u2217) + T (y), \u03bd(x\u2217) + 1)\n\n(13)\n\n(14)\n\n\fand the posterior predictive distribution is\np(y|x\u2217,D) = h(y)\n\nf ((cid:80)N\ni=1 k(x\u2217, xi)T (yi) + \u03c7(x\u2217),(cid:80)N\ni=1 k(x\u2217, xi)T (yi) + \u03c7(x\u2217) + T (y),(cid:80)N\nf ((cid:80)N\n\ni=1 k(x\u2217, xi) + \u03bd(x\u2217))\ni=1 k(x\u2217, xi) + \u03bd(x\u2217) + 1)\n\n(15)\nThis is a general formulation of the posterior distribution over the parameters of any likelihood\nmodel belonging to the exponential family. Note that given a function k(x\u2217, x), we may evaluate this\nposterior without sampling, in time linear in the number of samples. Moreover, for several choices\nof kernels the relevant sums can be evaluates in sub-linear time; a sum over squared exponential\nkernels, for instance, can be evaluated in logarithmic time.\n\n3 Local inference for multivariate Gaussian\n\nWe now discuss in detail the application of equation (12) to the case of a multivariate Gaussian\nlikelihood model with unknown mean \u00b5 and unknown covariance \u03a3.\n\np(y|\u00b5, \u03a3) = N (y; \u00b5, \u03a3)\n\n(16)\nWe present the conjugate prior, posterior, and predictive distributions without derivation; see [17],\nfor example, for a derivation. The conjugate prior for a multivariate Gaussian with unknown mean\nand covariance is the normal-inverse Wishart distribution, with hyperparameter functions \u00b50(x\u2217),\n\u03a8(x\u2217), \u03bd(x\u2217), and \u03bb(x\u2217).\n\np(\u00b5, \u03a3|x\u2217) = N\n\n\u00d7 W\u22121 (\u03a3; \u03a8(x\u2217), \u03bd(x\u2217))\n\n(17)\n\n(cid:18)\n\u00b5; \u00b50(x\u2217),\n\n(cid:19)\n\n\u03a3\n\n\u03bb(x\u2217)\n\nThe hyperparameter functions have intuitive interpretations; \u00b50(x\u2217) is our initial belief of the mean\nfunction, while \u03bb(x\u2217) is our con\ufb01dence in that belief, with \u03bb(x\u2217) = 0 indicating no con\ufb01dence\nin the region near x\u2217, and \u03bb(x\u2217) \u2192 \u221e indicating a state of perfect knowledge. Likewise, \u03a8(x\u2217)\nindicates the expected covariance, and \u03bd(x\u2217) represents the con\ufb01dence in that estimate, much like\n\u03bb. Given a dataset D, we can compute a posterior over the mean and covariance, represented by\nupdated parameters \u00b5(cid:48)\n\n\u03bd(cid:48)(x\u2217) = \u03bd(x\u2217) + k(x\u2217)\n\n0(x\u2217), \u03a8(cid:48)(x\u2217), \u03bb(cid:48)(x\u2217), and \u03bd(cid:48)(x\u2217).\n\u03bb(cid:48)(x\u2217) = \u03bb(x\u2217) + k(x\u2217)\n\u00b5(cid:48)\n0(x\u2217) =\n\n\u03bb(x\u2217)\u00b50(x\u2217) + y\n\u03bb(x\u2217) + k(x\u2217)\n\u03a8(cid:48)(x\u2217) = \u03a8(x\u2217) + S(x\u2217) +\n\nwhere\n\nk(x\u2217) =\n\nS(x\u2217) =\n\nE(x\u2217) =\n\ni=1\n\nN(cid:88)\nN(cid:88)\n(cid:16)\n\ni=1\n\n\u03bb(x\u2217)k(x\u2217)\n\u03bb(x\u2217) + k(x\u2217)\n\nE(x\u2217)\n\nN(cid:88)\n\nk(x\u2217, xi)\n\nk(x\u2217, xi)\n\n(cid:16)\n\n1\n\ni=1\n\nk(x\u2217)\n\ny(x\u2217) =\n\nk(x\u2217, xi)yi\n(cid:17)(cid:16)\n(cid:17)(cid:62)\nyi \u2212 y(x\u2217)\nyi \u2212 y(x\u2217)\n(cid:17)(cid:16)\n(cid:17)(cid:62)\ny(x\u2217) \u2212 \u00b50(x\u2217)\n\ny(x\u2217) \u2212 \u00b50(x\u2217)\n(cid:18)\n\n(cid:19)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\nThe resulting posterior predictive distribution is a multivariate Student-t distribution.\n\np(y|x\u2217) = t\u03bd(cid:48)(x\u2217)\n\n\u00b5(cid:48)\n0(x\u2217),\n\n\u03bb(cid:48)(x\u2217) + 1\n\u03bb(cid:48)(x\u2217)\u03bd(cid:48)(x\u2217)\n\n\u03a8(cid:48)(x\u2217)\n\n3.1 Special cases\n\nTwo special cases of the multivariate Gaussian are worth mentioning. First, a \ufb01xed, known co-\nvariance \u03a3(x\u2217) can be described by the hyperparameters \u03a8(x\u2217) = lim\u03bd\u2192\u221e \u03a3(x\u2217)\n. The resulting\nposterior distribution is then\n\n\u03bd\n\n(cid:19)\n\np(\u00b5|x\u2217,D) = N\n\n\u00b5(cid:48)\n0,\n\n1\n\n\u03bb(cid:48)(x\u2217)\n\n\u03a3(x\u2217)\n\n(cid:18)\n\n5\n\n\fwith predictive distribution\n\np(\u00b5|x\u2217,D) = N\n\n(cid:18)\n\n\u00b5(cid:48)\n0,\n\n(cid:19)\n\n1 + \u03bb(cid:48)(x\u2217)\n\n\u03bb(cid:48)(x\u2217)\n\n\u03a3(x\u2217)\n\n(22)\n\nIn the limit as \u03bb goes to 0, when the prior is uninformative, the mean and mode of the predictive\ndistribution approaches the Nadaraya-Watson [8, 9] estimate.\n\n(cid:80)N\n(cid:80)N\ni=1 k(x\u2217, xi)yi\ni=1 k(x\u2217, xi)\n\n\u00b5NW (x\u2217) =\n\n(cid:18)\n(cid:32)\n\nN(cid:88)\n\nki\n\n(cid:19)\n(cid:33)\n\n(23)\n\n(24)\n\n(25)\n\n(26)\n\nThe complementary case of known mean \u00b5(x\u2217) and unknown covariance \u03a3(x\u2217) is described by the\nlimit \u03bb \u2192 \u221e. In this case, the posterior distribution is\n\np(\u03a3|x\u2217,D) = W\u22121\n\n\u03a8(x\u2217) +\n\nki(yi \u2212 \u00b5(x\u2217))(yi \u2212 \u00b5(x\u2217))(cid:62), \u03bb(x\u2217) +\n\nwith predictive distribution\n\np(y|x\u2217) = t\u03bd(cid:48)(x\u2217)\n\n\u00b5(x\u2217),\n\ni=1\n\ni=1\n\n1\n\n\u03bd(cid:48)(x\u2217)\n\n\u03a8(x\u2217) +\n\nN(cid:88)\n\ni=1\n\nki(yi \u2212 \u00b5(x\u2217))(yi \u2212 \u00b5(x\u2217))(cid:62)\n\nIn the limit as \u03bd goes to 0, the maximum likelihood covariance estimate is\n\n\u03a3ML(x\u2217) =\n\nki(yi \u2212 \u00b5(x\u2217))(yi \u2212 \u00b5(x\u2217))(cid:62)\n\nwhich is precisely the result of our prior work [18, 19]. In both cases, our method yields distributions\nover parameters, rather than point estimates; moreover, the use of Bayesian inference naturally\nhandles the case of limited or no available samples.\n\ni=1\n\n3.2 Experimental results\n\nWe evaluate our approach on several regression problems, and compare the results with alterna-\ntive nonparametric Bayesian models.\nIn all experiments, we use the squared-exponential kernel\n2(cid:107)y \u2212 y(cid:48)(cid:107)2). This function meets both the requirements of our algorithm and is\nk(y, y(cid:48)) = exp( c\npositive-de\ufb01nite and thus a suitable covariance function for models based on the Gaussian process.\nWe set the kernel scale c by maximum likelihood for each model.\nWe compare our approach to covariance prediction to the generalized Wishart process (GWP) of [7].\nFirst, we sample a synthetic dataset; the output is a two-dimensional observation set Y = R2, where\nsamples are drawn from a zero-mean normal distribution with a covariance that rotates over time.\n\n(cid:18)cos(t) \u2212 sin(t)\n\n(cid:19)(cid:18)4\n\nsin(t)\n\ncos(t)\n\n0\n\n(cid:19)(cid:18)cos(t) \u2212 sin(t)\n(cid:19)(cid:62)\n\nsin(t)\n\ncos(t)\n\n0\n10\n\n\u03a3(t) =\n\n(27)\n\nSecond, we predict the covariances of the returns on two currency exchanges\u2014the Euro to US dollar,\nand the Japanese yen to US dollar\u2014over the past four years. Following Wilson and Ghahramani, we\nde\ufb01ne a return as log( Pt+1\n), where Pt is the exchange rate on day t. Illustrative results are provided\nPt\nin \ufb01gure 2. To compare these results quantitatively, one natural measure is the mean of the logarithm\nof the likelihood of the predicted model given the data.\n\nN(cid:88)\n\nN(cid:88)\n\nN(cid:88)\n\ni=1\n\nMLL =\n\n1\nN\n\n(y(cid:62)\n\ni\n\n\u02c6\u03a3\n\n\u22121\ni yi + log det \u02c6\u03a3i)\n\n\u2212 1\n2\n\n(28)\n\nHere, \u02c6\u03a3i is the maximum likelihood covariance predicted for the ith sample.\nIn addition to how well our model describes the available data, we may also be interested in how\naccurately we recover the distribution used to generate the data. This is a measure of how closely\nthe inferred ellipses in \ufb01gure 2 approximate the true covariance ellipses. One measure of the quality\nof the inferred distribution is the KL divergence of the inferred distribution from the true distribution\n\n6\n\n\f(a) Synthetic periodic data\n\n(b) Exchange data\n\nFigure 2: Comparison of covariances predicted by our kernel inverse Wishart process and the generalized\nWishart process for the problems described in section 3.2. The true covariance used to generate data is provided\nfor comparison. The samples used are plotted so that the area of the circle is proportional to the weight assigned\nby the kernel. The kernel inverse Wishart process outperforms the generalized Wishart process, both in terms\nof the likelihood of the training data, and in terms of the divergence of the inferred distribution from the true\ndistribution.\n\nused to generate the data. Note we cannot evaluate this quantity on the exchange dataset, as we do\nnot know the true distribution. We present both the mean likelihood and the KL divergence of both\nalgorithms, along with running times, in table 1.\nBy both metrics, our algorithm outperforms the GWP by a signi\ufb01cant margin; the running time\nadvantage of kernel estimation over the GWP is even more dramatic. It is important to note that\nrunning times are dif\ufb01cult to compare, as they depend heavily on implementation and hardware\ndetails; the numbers reported should be considered qualitatively. Both algorithms were implemented\nin the MATLAB programming language, with the likelihood functions for the GWP implemented in\nheavily optimized c code in an effort to ensure a fair competition. Despite this, the GWP took over\na thousand times longer than our method to generate predictions.\n\nPeriodic\n\nExchange\n\nttr (s)\nkNIW 0.022\nGWP\n7.08\nkNIW 0.520\nGWP\n15.7\n\ntev (ms) MLL DKL (\u02c6p(cid:107)p)\n0.0138\n0.003\n0.135\n0.0248\n0.020\n1.708\n\n-10.43\n-19.79\n7.73\n7.56\n\n\u2014\n\u2014\n\nTable 1: Comparison of the performance of two models of covariance prediction, based on time required to\nmake predictions at evaluation, the mean log likelihood and the KL divergence between the predicted covariance\nand the ground truth covariance.\n\nWe next evaluate our approach on heteroscedastic regression problems. First, we generate 100\nsamples from the distribution described by Yuan and Wahba [20], which has mean \u00b5(x\u2217) =\n2 exp(\u221230(x\u2217 \u2212 0.25)2) + sin(\u03c0(x\u2217)2) and variance \u03c32(x\u2217) = exp(2 \u2217 sin(2\u03c0x\u2217)). Second,\nwe test on the motorcycle dataset of Silverman et al. [21]. We compare our approach to a variety\nof Gaussian process based regression algorithms, including a standard homoscedastic Gaussian pro-\ncess, the variational heteroscedastic Gaussian process of L\u00b4azaro-Gredilla and Titsias [4], and the\nmaximum likelihood heteroscedastic Gaussian process of Quadrianto et al. [22]. All algorithms are\nimplemented in MATLAB, using the authors\u2019 own code. Running times are presented with the same\ncaveat as in the previous experiments, and a similar conclusion holds: our method provides results\nwhich are as good or better than methods based upon the Gaussian process, and does so in a fraction\nof the time. Figure 3 illustrates the predictions made by our method on the heteroscedastic motor-\n\n7\n\n\u221210\u221250510\u221210\u221250510x1x2\u03b1=0.83\u221210\u221250510\u221210\u221250510x1x2\u03b1=1.20\u221210\u221250510\u221210\u221250510x1x2\u03b1=1.57\u221210\u221250510\u221210\u221250510x1x2\u03b1=1.94\u221210\u221250510\u221210\u221250510x1x2\u03b1=2.31 Ground TruthInverse WishartGeneralised Wishart Process\u22120.05\u22120.02500.0250.05\u22120.0500.05EUR/USDJPY/USD2010/10/29\u22120.05\u22120.02500.0250.05\u22120.0500.05JPY/USD2011/8/17\u22120.05\u22120.02500.0250.05\u22120.0500.05JPY/USD2012/6/4\u22120.05\u22120.02500.0250.05\u22120.0500.05JPY/USD2013/3/22\u22120.05\u22120.02500.0250.05\u22120.0500.05EUR/USDJPY/USD2014/1/10 Inverse WishartGeneralised Wishart Process\f(a) kNIW\n\n(b) VHGP\n\nFigure 3: Comparison of the distributions inferred using the kernel normal inverse Wishart process and the\nvariational heteroscedastic Gaussian process to model Silverman\u2019s motorcycle dataset. Both models capture\nthe time-varying nature of the measurement noise; as is typical, the kernel model is much less smooth and has\nmore local structure than the Gaussian process model. Both models perform well according to most metrics,\nbut the kernel model can be computed in a fraction of the time.\n\ncycle dataset of Silverman. For reference, we provide the distribution generated by the variational\nheteroscedastic Gaussian process.\n\nMotorcycle\n\nPeriodic\n\nVHGP\nMLHGP\nkNIW\n\nttr (s)\nkNIW 0.124\n0.52\nGP\n3.12\n2.39\n0.68\n3.41\n26.4\n38.3\n\nGP\n\nVHGP\nMLHGP\n\ntev (ms) NMSE MLL\n-4.04\n-4.51\n-4.07\n-4.03\n-2.07\n-2.56\n-1.85\n-2.38\n\n0.2\n0.202\n0.202\n0.204\n0.0708\n0.0822\n0.0827\n0.0827\n\n2.95\n3.52\n7.53\n5.83\n7.94\n22\n54.4\n29.1\n\nTable 2: Comparison of the performance of various models of heteroscedastic processes, based on time re-\nquired to train, time required to make predictions at evaluation, the normalized mean squared error, and the\nmean log likelihood. Note how the normal-inverse Wishart process obtains performance as good or better than\nthe other algorithms in a fraction of the time.\n4 Discussion\n\nWe have presented a family of stochastic models which permit exact inference for any likelihood\nfunction from the exponential family. Algorithms for performing inference on this model include\nmany local kernel estimators, and extend them to probabilistic contexts. We showed the instantiation\nof our model for a multivariate Gaussian likelihood; due to lack of space, we do not present others,\nbut the approach is easily extended to tasks like classi\ufb01cation and counting. The models we develop\nare built on a strong assumption of independence; this assumption is critical to enabling ef\ufb01cient\nexact inference. We now explore the costs of this assumption, and when it is inappropriate.\nFirst, while the kernel function in our model does not need to be positive de\ufb01nite\u2014or even\nsymmetric\u2014we lose an important degree of \ufb02exibility relative to the covariance functions employed\nin a Gaussian process. Covariance functions can express a number of complex concepts, such as a\nprior over functions with a speci\ufb01ed additive or hierarchical structure [23]; these concepts cannot\nbe easily formulated in terms of smoothness. Second, by neglecting the relationships between latent\nparameters, we lose the ability to extrapolate trends in the data, meaning that in places where data\nis sparse we cannot expect good performance. Thus, for a problem like time series forecasting, our\napproach will likely be unsuccessful. Our approach is suitable in situations where we are likely to\nsee similar inputs many times, which is often the case. Moreover, regardless of the family of models\nused, extrapolation to regions of sparse data can perform very poorly if the prior does not model the\ntrue process well. Our approach is particularly effective when data is readily available, but compu-\ntation is expensive; the gains in ef\ufb01ciency due to an independence assumption allow us to scale to\nlarger much larger datasets, improving predictive performance with less design effort.\n\nAcknowledgements\n\nThis research was funded by the Of\ufb01ce of Naval Research under contracts N00014-09-1-1052 and\nN00014-10-1-0936. The support of Behzad Kamgar-Parsi and Tom McKenna is gratefully acknowl-\nedged.\n\n8\n\n101520253035404550\u2212150\u2212100\u221250050100ta101520253035404550\u2212150\u2212100\u221250050100ta\fReferences\n[1] C. E. Rasmussen and C. Williams, Gaussian processes for machine learning. Cambridge,\n\nMA: MIT Press, Apr. 2006, vol. 14, no. 2.\n\n[2] Q. Le, A. Smola, and S. Canu, \u201cHeteroscedastic Gaussian process regression,\u201d in Proc. ICML,\n\n2005, pp. 489\u2013496.\n\n[3] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, \u201cMost-Likely Heteroscedastic Gaussian\n\nProcess Regression,\u201d in Proc. ICML, Corvallis, OR, USA, June 2007, pp. 393\u2013400.\n\n[4] M. L\u00b4azaro-Gredilla and M. Titsias, \u201cVariational heteroscedastic Gaussian process regression,\u201d\n\nin Proc. ICML, 2011.\n\n[5] L. Shang and A. B. Chan, \u201cOn approximate inference for generalized gaussian process mod-\n\nels,\u201d arXiv preprint arXiv:1311.6371, 2013.\n\n[6] J. Qui\u02dcnonero-Candela and C. Rasmussen, \u201cA unifying view of sparse approximate Gaussian\nprocess regression,\u201d The Journal of Machine Learning Research, vol. 6, pp. 1939\u20131959, 2005.\n[7] A. Wilson and Z. Ghahramani, \u201cGeneralised Wishart processes,\u201d in Proc. UAI, 2011, pp. 736\u2013\n\n744.\n\n[8] E. Nadaraya, \u201cOn estimating regression,\u201d Theory of Probability & Its Applications, vol. 9,\n\nno. 1, pp. 141\u2013143, 1964.\n\n[9] G. Watson, \u201cSmooth regression analysis,\u201d Sankya: The Indian Journal of Statistics, Series A,\n\nvol. 26, no. 4, pp. 359\u2013372, 1964.\n\n[10] R. Tibshirani and T. Hastie, \u201cLocal likelihood estimation,\u201d Journal of the American Statistical\n\nAssociation, vol. 82, no. 398, pp. 559\u2013567, 1987.\n\n[11] A. G. Gray and A. W. Moore, \u201cN-body\u2019problems in statistical learning,\u201d in NIPS, vol. 4. Cite-\n\nseer, 2000, pp. 521\u2013527.\n\n[12] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis, \u201cImproved fast gauss transform and\nef\ufb01cient kernel density estimation,\u201d in Computer Vision, 2003. Proceedings. Ninth IEEE Inter-\nnational Conference on.\n\nIEEE, 2003, pp. 664\u2013671.\n\n[13] E. Snelson, \u201cFlexible and ef\ufb01cient Gaussian process models for machine learning,\u201d PhD thesis,\n\nUniversity of London, 2007.\n\n[14] L. Gyor\ufb01, M. Kohler, A. Krzyzak, and H. Walk, A Distribution Free Theory of Nonparametric\n\nRegression. New York, NY: Springer, 2002.\n\n[15] S. Wang, \u201cMaximum weighted likelihood estimation,\u201d PhD thesis, University of British\n\nColumbia, 2001.\n\n[16] W. S. Cleveland, \u201cRobust locally weighted regression and smoothing scatterplots,\u201d Journal of\n\nthe American statistical association, vol. 74, no. 368, pp. 829\u2013836, 1979.\n\n[17] K. Murphy, \u201cConjugate Bayesian analysis of the Gaussian distribution,\u201d 2007.\n[18] W. Vega-Brown, \u201cPredictive Parameter Estimation for Bayesian Filtering,\u201d SM Thesis, Mas-\n\nsachusetts Institute of Technology, 2013.\n\n[19] W. Vega-Brown and N. Roy, \u201cCELLO-EM: Adaptive Sensor Models without Ground Truth,\u201d\n\nin Proc. IROS, Tokyo, Japan, 2013.\n\n[20] M. Yuan and G. Wahba, \u201cDoubly penalized likelihood estimator in heteroscedastic regression,\u201d\n\nStatistics & probability letters, vol. 69, no. 1, pp. 11\u201320, 2004.\n\n[21] B. W. Silverman et al., \u201cSome aspects of the spline smoothing approach to non-parametric\nregression curve \ufb01tting,\u201d Journal of the Royal Statistical Society, Series B, vol. 47, no. 1, pp.\n1\u201352, 1985.\n\n[22] N. Quadrianto, K. Kersting, M. Reid, T. Caetano, and W. Buntine, \u201cMost-Likely Heteroscedas-\n\ntic Gaussian Process Regression,\u201d in Proc. ICDM, Miami, FL, USA, December 2009.\n\n[23] D. Duvenaud, H. Nickisch, and C. E. Rasmussen, \u201cAdditive Gaussian processes,\u201d in Advances\n\nin Neural Information Processing Systems 24, Granada, Spain, 2011, pp. 226\u2013234.\n\n9\n\n\f", "award": [], "sourceid": 1323, "authors": [{"given_name": "William", "family_name": "Vega-Brown", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Marek", "family_name": "Doniec", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Nicholas", "family_name": "Roy", "institution": "Massachusetts Institute of Technology"}]}