{"title": "Kernel Bayes' Rule", "book": "Advances in Neural Information Processing Systems", "page_first": 1737, "page_last": 1745, "abstract": "A nonparametric kernel-based method for realizing Bayes' rule is proposed, based on kernel representations of probabilities in reproducing kernel Hilbert spaces. The prior and conditional probabilities are expressed as empirical kernel mean and covariance operators, respectively, and the kernel mean of the posterior distribution is computed in the form of a weighted sample. The kernel Bayes' rule can be applied to a wide variety of Bayesian inference problems: we demonstrate Bayesian computation without likelihood, and filtering with a nonparametric state-space model. A consistency rate for the posterior estimate is established.", "full_text": "Kernel Bayes\u2019 Rule\n\nKenji Fukumizu\n\nThe Institute of Statistical\n\nMathematics, Tokyo\n\nLe Song\n\nCollege of Computing\n\nGeorgia Institute of Technology\n\nArthur Gretton\nGatsby Unit, UCL\n\nMPI for Intelligent Systems\n\nfukumizu@ism.ac.jp\n\nlsong@cc.gatech.edu\n\narthur.gretton@gmail.com\n\nAbstract\n\nA nonparametric kernel-based method for realizing Bayes\u2019 rule is proposed, based\non kernel representations of probabilities in reproducing kernel Hilbert spaces.\nThe prior and conditional probabilities are expressed as empirical kernel mean\nand covariance operators, respectively, and the kernel mean of the posterior dis-\ntribution is computed in the form of a weighted sample. The kernel Bayes\u2019 rule\ncan be applied to a wide variety of Bayesian inference problems: we demonstrate\nBayesian computation without likelihood, and \ufb01ltering with a nonparametric state-\nspace model. A consistency rate for the posterior estimate is established.\n\n1\n\nIntroduction\n\nKernel methods have long provided powerful tools for generalizing linear statistical approaches to\nnonlinear settings, through an embedding of the sample to a high dimensional feature space, namely\na reproducing kernel Hilbert space (RKHS) [16]. The inner product between feature mappings need\nnever be computed explicitly, but is given by a positive de\ufb01nite kernel function, which permits ef\ufb01-\ncient computation without the need to deal explicitly with the feature representation. More recently,\nthe mean of the RKHS feature map has been used to represent probability distributions, rather than\nmapping single points: we will refer to these representations of probability distributions as ker-\nnel means. With an appropriate choice of kernel, the feature mapping becomes rich enough that\nits expectation uniquely identi\ufb01es the distribution: the associated RKHSs are termed characteristic\n[6, 7, 22]. Kernel means in characteristic RKHSs have been applied successfully in a number of\nstatistical tasks, including the two sample problem [9], independence tests [10], and conditional in-\ndependence tests [8]. An advantage of the kernel approach is that these tests apply immediately to\nany domain on which kernels may be de\ufb01ned.\n\nWe propose a general nonparametric framework for Bayesian inference, expressed entirely in terms\nof kernel means. The goal of Bayesian inference is to \ufb01nd the posterior of x given observation y;\n\nq(x|y) =\n\np(y|x)\u03c0(x)\n\nqY (y)\n\n,\n\nqY (y) =Z p(y|x)\u03c0(x)d\u00b5X (x),\n\n(1)\n\nwhere \u03c0(x) and p(y|x) are respectively the density function of the prior, and the conditional density\nor likelihood of y given x. In our framework, the posterior, prior, and likelihood are all expressed\nas kernel means: the update from prior to posterior is called the Kernel Bayes\u2019 Rule (KBR). To\nimplement KBR, the kernel means are learned nonparametrically from training data: the prior and\nlikelihood means are expressed in terms of samples from the prior and joint probabilities, and the\nposterior as a kernel mean of a weighted sample. The resulting updates are straightforward matrix\noperations. This leads to the main advantage of the KBR approach: in the absence of a speci\ufb01c para-\nmetric model or an analytic form for the prior and likelihood densities, we can still perform Bayesian\ninference by making suf\ufb01cient observations on the system. Alternatively, we may have a paramet-\nric model, but it might be complex and require time-consuming sampling techniques for inference.\nBy contrast, KBR is simple to implement, and is amenable to well-established approximation tech-\nniques which yield an overall computational cost linear in the training sample size [5]. We further\n\n1\n\n\festablish the rate of consistency of the estimated posterior kernel mean to the true posterior, as a\nfunction of training sample size.\n\nThe proposed kernel realization of Bayes\u2019 rule is an extension of the approach used in [20] for\nstate-space models. This earlier work applies a heuristic, however, in which the kernel mean of the\nprevious hidden state and the observation are assumed to combine additively to update the hidden\nstate estimate. More recently, a method for belief propagation using kernel means was proposed\n[18, 19]: unlike the present work, this directly estimates conditional densities assuming the prior\nto be uniform. An alternative to kernel means would be to use nonparametric density estimates.\nClassical approaches include \ufb01nite distribution estimates on a partitioned domain or kernel density\nestimation, which perform poorly on high dimensional data. Alternatively, direct estimates of the\ndensity ratio may be used in estimating the conditional p.d.f. [24]. By contrast with density estima-\ntion approaches, KBR makes it easy to compute posterior expectations (as an RKHS inner product)\nand to perform conditioning and marginalization, without requiring numerical integration.\n\n2 Kernel expression of Bayes\u2019 rule\n\n2.1 Positive de\ufb01nite kernel and probabilities\n\nWe begin with a review of some basic concepts and tools concerning statistics on RKHS [1, 3, 6, 7].\nGiven a set \u2126, a (R-valued) positive de\ufb01nite kernel k on \u2126 is a symmetric kernel k : \u2126\u00d7\u2126 \u2192 R such\nthatPn\ni,j=1 cicjk(xi, xj) \u2265 0 for arbitrary points x1, . . . , xn in \u2126 and real numbers c1, . . . , cn. It is\nknown [1] that a positive de\ufb01nite kernel on \u2126 uniquely de\ufb01nes a Hilbert space H (RKHS) consisting\nof functions on \u2126, where hf, k(\u00b7, x)i = f (x) for any x \u2208 \u2126 and f \u2208 H (reproducing property).\nLet (X ,BX , \u00b5X ) and (Y,BY , \u00b5Y ) be measure spaces, and (X, Y ) be a random variable on X \u00d7\nY with probability P . Throughout this paper, it is assumed that positive de\ufb01nite kernels on the\nmeasurable spaces are measurable and bounded, where boundedness is de\ufb01ned as supx\u2208\u2126 k(x, x) <\n\u221e. Let kX be a positive de\ufb01nite kernel on a measurable space (X ,BX ), with RKHS HX . The kernel\nmean mX of X on HX is de\ufb01ned by the mean of the HX -valued random variable kX (\u00b7, X), namely\n(2)\n\nmX =Z kX (\u00b7, x)dPX (x).\n\nFor notational simplicity, the dependence on kX in mX is not shown. Since the kernel mean depends\nonly on the distribution of X (and the kernel), it may also be written mPX ; we will use whichever\nof these equivalent notations is clearest in each context. From the reproducing property, we have\n\n(\u2200f \u2208 HX ).\n\nhf, mXi = E[f (X)]\n\n(\u2200f \u2208 HX , g \u2208 HY ).\n\n( = hg \u2297 f, m(Y X)iHY \u2297HX )\n\nhg, CY X fiHY = E[f (X)g(Y )]\n\n(3)\nLet kX and kY be positive de\ufb01nite kernels on X and Y with respective RKHS HX and HY. The\n(uncentered) covariance operator CY X : HX \u2192 HY is de\ufb01ned by the relation\nIt should be noted that CY X is identi\ufb01ed with the mean m(Y X) in the tensor product space HY\u2297HX ,\nwhich is given by the product kernel kY kX [1]. The identi\ufb01cation is standard: the tensor product is\nisomorphic to the space of linear maps by the correspondence \u03c8 \u2297 \u03c6 \u2194 [h 7\u2192 \u03c8h\u03c6, hi]. We also\nde\ufb01ne CXX : HX \u2192 HX by hf2, CXX f1i = E[f2(X)f1(X)] for any f1, f2 \u2208 HX .\nWe next introduce the notion of a characteristic RKHS, which is essential when using kernels to ma-\nnipulate probability measures. A bounded measurable positive de\ufb01nite kernel k is called character-\nistic if EX\u223cP [k(\u00b7, X)] = EX \u2032\u223cQ[k(\u00b7, X \u2032)] implies P = Q: probabilities are uniquely determined\nby their kernel means [7, 22]. With this property, problems of statistical inference can be cast in\nterms of inference on the kernel means. A widely used characteristic kernel on Rm is the Gaussian\nkernel, exp(\u2212kx \u2212 yk2/(2\u03c32)).\nEmpirical estimates of the kernel mean and covariance operator are straightforward to obtain. Given\nan i.i.d. sample (X1, Y1), . . . , (Xn, Yn) with law P , the empirical kernel mean and covariance op-\nerator are respectively\n\nY X is written in the tensor product form. These are known to be \u221an-consistent in norm.\n\n1\nn\n\nnXi=1\n\nY X =\n\nbC (n)\n\nkY (\u00b7, Yi) \u2297 kX (\u00b7, Xi),\n\n1\nn\n\nnXi=1\n\nX =\n\nbm(n)\n\nkX (\u00b7, Xi),\n\nwhere bC (n)\n\n2\n\n\f2.2 Kernel Bayes\u2019 rule\n\nWe now derive the kernel mean implementation of Bayes\u2019 rule. Let \u03a0 be a prior distribution on\nIn the following, Q and QY denote the probabilities with p.d.f. q(x, y) =\nX with p.d.f. \u03c0(x).\np(y|x)\u03c0(x) and qY (y) in Eq. (1), respectively. Our goal is to obtain an estimator of the kernel\nmean of posterior mQX |y = R kX (\u00b7, x)q(x|y)d\u00b5X (x). The following theorem is fundamental in\nmanipulating conditional probabilities with positive de\ufb01nite kernels.\nTheorem 1 ([6]). If E[g(Y )|X = \u00b7] \u2208 HX holds for g \u2208 HY, then\nCXX E[g(Y )|X = \u00b7] = CXY g.\n\nIf CXX is injective, the above relation can be expressed as\nE[g(Y )|X = \u00b7] = CXX\n\n\u22121CXY g.\n\nUsing Eq. (4), we can obtain an expression for the kernel mean of QY.\nTheorem 2 ([20]). Assume CXX is injective, and let m\u03a0 and mQY be the kernel means of \u03a0 in HX\nand QY in HY, respectively. If m\u03a0 \u2208 R(CXX ) and E[g(Y )|X = \u00b7] \u2208 HX for any g \u2208 HY, then\n(5)\n\nmQY = CY X CXX\n\n\u22121m\u03a0.\n\n(4)\n\nAs discussed in [20], the operator CY X C \u22121\nXX implements forward \ufb01ltering of the prior \u03c0 with the\nconditional density p(y|x), as in Eq. (1). Note, however, that the assumptions E[g(Y )|X = \u00b7] \u2208\nHX and injectivity of CXX may not hold in general; we can easily provide counterexamples. In\nthe following, we nonetheless derive a population expression of Bayes\u2019 rule under these strong\nassumptions, use it as a prototype for an empirical estimator expressed in terms of Gram matrices,\nand prove its consistency subject to appropriate smoothness conditions on the distributions.\n\nIn deriving kernel realization of Bayes\u2019 rule, we will also use Theorem 2 to obtain a kernel mean\nrepresentation of the joint probability Q:\n\nmQ = C(Y X)X C \u22121\n\nXX m\u03a0 \u2208 HY \u2297 HX .\n\n(6)\nIn the above equation, C(Y X)X is the covariance operator from HX to HY \u2297 HX with\np.d.f. \u02dcp((y, x), x\u2032) = p(x, y)\u03b4x(x\u2032), where \u03b4x(x\u2032) is the point measure at x.\nIn many applications of Bayesian inference, the probability conditioned on a particular value should\nbe computed. By plugging the point measure at x into \u03a0 in Eq. (5), we have a population expression\n(7)\nwhich was used by [20, 18, 19] as the kernel mean of the conditional probability p(y|x). Let (Z, W )\nbe a random variable on X \u00d7 Y with law Q. Replacing P by Q and x by y in Eq. (7), we obtain\n(8)\nThis is exactly the kernel mean of the posterior which we want to obtain. The next step is to derive\nthe covariance operators in Eq. (8). Recalling that the mean mQ = m(ZW ) \u2208 HX \u2297 HY can be\nidenti\ufb01ed with the covariance operator CZW : HY \u2192 HX , and m(W W ) \u2208 HY \u2297 HY with CW W ,\nwe use Eq. (6) to obtain the operators in Eq. (8), and thus the kernel mean expression of Bayes\u2019 rule.\n\nE[kX (\u00b7, Z)|W = y] = CZW C \u22121\n\nE[kY (\u00b7, Y )|X = x] = CY X CXX\n\n\u22121kX (\u00b7, x),\n\nW W kY (\u00b7, y).\n\nThe above argument can be rigorously implemented for empirical estimates of the kernel means and\ncovariances. Let (X1, Y1), . . ., (Xn, Yn) be an i.i.d. sample with law P , and assume a consistent\nestimator for m\u03a0 given by\n\n\u2113Xj=1\n\n\u03b3jkX (\u00b7, Uj),\n\n\u03a0 =\n\nbm(\u2113)\n\nwhere U1, . . . , U\u2113 is the sample that de\ufb01nes the estimator (which need not be generated by \u03a0), and\n\u03b3j are the weights. Negative values are allowed for \u03b3j. The empirical estimators for CZW and\n\nCW W are identi\ufb01ed with bm(ZW ) and bm(W W ), respectively. From Eq. (6), they are given by\nXX + \u03b5nI(cid:1)\u22121\nbmQ = bm(ZW ) = bC (n)\n\n\u03a0 ,\nwhere I is the identity and \u03b5n is the coef\ufb01cient of Tikhonov regularization for operator inversion.\nThe next two propositions express these estimators using Gram matrices. The proofs are simple\nmatrix manipulation and shown in Supplementary material. In the following, GX and GY denote\nthe Gram matrices (kX (Xi, Xj)) and (kY (Yi, Yj)), respectively.\n\nbm(W W ) = bC (n)\n\n(Y X)X(cid:0)bC (n)\n\n(Y Y )X(cid:0)bC (n)\n\nXX + \u03b5nI(cid:1)\u22121\n\nbm(\u2113)\n\nbm(\u2113)\n\n\u03a0 ,\n\n3\n\n\fInput: (i) {(Xi, Yi)}n\n\ni=1: sample to express P . (ii) {(Uj , \u03b3j)}\u2113\n\nj=1: weighted sample to express the kernel\n\nmean of the prior bm\u03a0. (iii) \u03b5n, \u03b4n: regularization constants.\n\nComputation:\n\n(P\u2113\n\nj=1 \u03b3j kX (Xi, Uj))n\n\n1. Compute Gram matrices GX = (kX (Xi, Xj)), GY = (kY (Yi, Yj)), and a vector bm\u03a0 =\n2. Compute b\u00b5 = n(GX + n\u03b5nIn)\u22121 bm\u03a0.\n3. Compute RX|Y = \u039bGY ((\u039bGY )2 + \u03b4nIn)\u22121\u039b, where \u039b = Diag(b\u00b5).\n\ni=1 \u2208 Rn.\n\nOutput: n \u00d7 n matrix RX|Y .\n\nGiven conditioning value y, the kernel mean of the posterior q(x|y) is estimated by the weighted\nsample {(Xi, wi)}n\n\ni=1 with w = RX|Y kY (y), where kY (y) = (kY (Yi, y))n\n\ni=1.\n\nFigure 1: Kernel Bayes\u2019 Rule Algorithm\n\n(9)\n\nbCZW =Pn\n\nProposition 3. The Gram matrix expressions of bCZW and bCW W are given by\nrespectively, where the common coef\ufb01cientb\u00b5 \u2208 Rn is\n\ni=1b\u00b5ikX (\u00b7, Xi) \u2297 kY (\u00b7, Yi) and bCW W =Pn\nbm\u03a0,i = bm\u03a0(Xi) =P\u2113\n\nb\u00b5 = n(GX + n\u03b5nIn)\u22121bm\u03a0,\n\nProp. 3 implies that\n\ni=1b\u00b5ikY (\u00b7, Yi) \u2297 kY (\u00b7, Yi),\n\nj=1\u03b3jkX (Xi, Uj).\n\nthe probabilities Q and QY are estimated by the weighted samples\n\n(10)\n\nkY (y),\n\nX RX|Y\n\nbmQX |y = kT\n\nRX|Y := \u039bGY ((\u039bGY )2 + \u03b4nIn)\u22121\u039b,\n\ni=1 and {(Yi,b\u00b5i)}n\n\nbmQX |y := bCZW(cid:0)bC 2\n\nmay be negative, we use another type of Tikhonov regularization in computing Eq. (8),\n\nW W + \u03b4nI(cid:1)\u22121bCW W kY (\u00b7, y).\n\ni=1, respectively, with common weights. Since the weights b\u00b5i\n\n{((Xi, Yi),b\u00b5i)}n\nProposition 4. For any y \u2208 Y, the Gram matrix expression of bmQX |y is given by\nwhere \u039b = Diag(b\u00b5) is a diagonal matrix with elements b\u00b5i given by Eq. (9), kX =\nn, and kY = (kY (\u00b7, Y1), . . . , kY (\u00b7, Yn))T \u2208 HY\n(kX (\u00b7, X1), . . . , kX (\u00b7, Xn))T \u2208 HX\nWe call Eqs.(10) or (11) the kernel Bayes\u2019 rule (KBR): i.e., the expression of Bayes\u2019 rule entirely\nin terms of kernel means. The algorithm to implement KBR is summarized in Fig. 1. If our aim\nis to estimate E[f (Z)|W = y], that is, the expectation of a function f \u2208 HX with respect to the\nposterior, then based on Eq. (3) an estimator is given by\n(12)\nwhere fX = (f (X1), . . . , f (Xn))T \u2208 Rn. In using a weighted sample to represent the posterior,\nKBR has some similarity to Monte Carlo methods such as importance sampling and sequential\nMonte Carlo ([4]). The KBR method, however, does not generate samples from the posterior, but\nupdates the weights of a sample via matrix operations. We will provide experimental comparisons\nbetween KBR and sampling methods in Sec. 4.1.\n\nhf, bmQX |yiHX = f T\n\nX RX|Y\n\nkY (y),\n\n(11)\n\nn.\n\n2.3 Consistency of KBR estimator\n\nWe now demonstrate the consistency of the KBR estimator in Eq. (12). We show only the best rate\nthat can be derived under the assumptions, and leave more detailed discussions and proofs to the\nSupplementary material. We assume that the sample size \u2113 = \u2113n for the prior goes to in\ufb01nity as the\nsample size n for the likelihood goes to in\ufb01nity, and that bm(\u2113n)\nis n\u03b1-consistent. In the theoretical\nresults, we assume all Hilbert spaces are separable. In the following, R(A) denotes the range of A.\nTheorem 5. Let f \u2208 HX , (Z, W ) be a random vector on X \u00d7 Y such that its law is Q with\nbe an estimator of m\u03a0 such that kbm(\u2113n)\np.d.f. p(y|x)\u03c0(x), and bm(\u2113n)\n\u03a0 \u2212 m\u03a0kHX = Op(n\u2212\u03b1) as\nn \u2192 \u221e for some 0 < \u03b1 \u2264 1/2. Assume that \u03c0/pX \u2208 R(C 1/2\nXX ), where pX is the p.d.f. of PX, and\nW W ). For \u03b5n = n\u2212 2\n3 \u03b1 and \u03b4n = n\u2212 8\nE[f (Z)|W = \u00b7] \u2208 R(C 2\n27 \u03b1, we have for any y \u2208 Y\nkY (y) \u2212 E[f (Z)|W = y] = Op(n\u2212 8\nf T\nX RX|Y\nkY (y) is the estimator of E[f (Z)|W = y] given by Eq. (12).\n\n(n \u2192 \u221e),\n\nwhere f T\n\nX RX|Y\n\n27 \u03b1),\n\n\u03a0\n\n\u03a0\n\n4\n\n\fThe condition \u03c0/pX \u2208 R(C 1/2\n\u03a0 is a direct\nempirical kernel mean with an i.i.d. sample of size n from \u03a0, typically \u03b1 = 1/2 and the theorem\nimplies n4/27-consistency. While this might seem to be a slow rate, in practice the convergence may\nbe much faster than the above theoretical guarantee.\n\nXX ) requires the prior to be smooth. If \u2113n = n, and if bm(n)\n\n3 Bayesian inference with Kernel Bayes\u2019 Rule\n\nIn Bayesian inference, tasks of interest include \ufb01nding properties of the posterior (MAP value,\nmoments), and computing the expectation of a function under the posterior. We now demonstrate\nthe use of the kernel mean obtained via KBR in solving these problems.\n\nHX\n\nX RX|Y\n\nkY (y)k2\n\nFirst, we have already seen from Theorem 5 that we may obtain a consistent estimator under the pos-\nterior for the expectation of some f \u2208 HX . This covers a wide class of functions when characteristic\nkernels are used (see also experiments in Sec. 4.1).\nNext, regarding a point estimate of x, [20] proposes to use the preimagebx = arg minx kkX (\u00b7, x) \u2212\n\n, which represents the posterior mean most effectively by one point. We use\nkT\nthis approach in the present paper where point estimates are considered. In the case of the Gaussian\nkernel, a \ufb01xed point method can be used to sequentially optimize x [13].\nIn KBR the prior and likelihood are expressed in terms of samples. Thus unlike many methods for\nBayesian inference, exact knowledge on their densities is not needed, once samples are obtained.\nThe following are typical situations where the KBR approach is advantageous:\n\u2022 The relation among variables is dif\ufb01cult to realize with a simple parametric model, however we\n\u2022 The p.d.f of the prior and/or likelihood is hard to obtain explicitly, but sampling is possible: (a) In\npopulation genetics, branching processes are used for the likelihood to model the split of species,\nfor which the explicit density is hard to obtain. Approximate Bayesian Computation (ABC)\nis a popular sampling method in these situations [25, 12, 17]. (b) In nonparametric Bayesian\ninference (e.g. [14]), the prior is typically given in the form of a process without a density.\nThe KBR approach can give alternative ways of Bayesian computation for these problems. We\nwill show some experimental comparisons between KBR approach and ABC in Sec. 4.2.\n\ncan obtain samples of the variables (e.g. nonparametric state-space model in Sec. 3).\n\n\u2022 If a standard sampling method such as MCMC or sequential MC is applicable, the computation\ngiven y may be time consuming, and real-time applications may not be feasible. Using KBR, the\nexpectation of the posterior given y is obtained simply by the inner product as in Eq. (12), once\nX RX|Y has been computed.\nf T\n\nThe KBR approach nonetheless has a weakness common to other nonparametric methods: if a new\ndata point appears far from the training sample, the reliability of the output will be low. Thus, we\nneed suf\ufb01cient diversity in training sample to reliably estimate the posterior.\n\nIn KBR computation, Gram matrix inversion is necessary, which would cost O(n3) for sample size n\nif attempted directly. Substantial cost reductions can be achieved by low rank matrix approximations\nsuch as the incomplete Cholesky decomposition [5], which approximates a Gram matrix in the form\nof \u0393\u0393T with n \u00d7 r matrix \u0393. Computing \u0393 costs O(nr2), and with the Woodbury identity, the KBR\ncan be approximately computed with cost O(nr2).\nKernel choice or model selection is key to the effectiveness of KBR, as in other kernel methods.\nKBR involves three model parameters: the kernel (or its parameters), and the regularization parame-\nters \u03b5n and \u03b4n. The strategy for parameter selection depends on how the posterior is to be used in the\ninference problem. If it is applied in a supervised setting, we can use standard cross-validation (CV).\nA more general approach requires constructing a related supervised problem. Suppose the prior is\ngiven by the marginal PX of P . The posterior density q(x|y) averaged with PY is then equal to the\nmarginal density pX. We are then able to compare the discrepancy of the kernel mean of PX and\nthe average of the estimators bQX |y=Yi over Yi. This leads to application of K-fold CV approach.\na=1, let bm[\u2212a]\nbe the kernel mean\nNamely, for a partition of {1, . . . , n} into K disjoint subsets {Ta}K\nof posterior estimated with data {(Xi, Yi)}i /\u2208Ta, and the prior mean bm[\u2212a]\nX with data {Xi}i /\u2208Ta. We\nusePK\n|Ta|Pj\u2208Ta\nfor CV, where bm[a]\n\nX(cid:13)(cid:13)2\nQX |y=Yj \u2212 bm[a]\n\n|Ta|Pj\u2208Ta bm[\u2212a]\n\na=1(cid:13)(cid:13) 1\n\nkX (\u00b7, Xj).\n\nX = 1\n\nQX |y\n\nHX\n\n5\n\n\fApplication to nonparametric state-space model. Consider the state-space model,\n\np(X, Y ) = \u03c0(X1)QT\n\nt=1p(Yt|Xt)QT \u22121\n\nt=1 q(Xt+1|Xt),\n\nwhere Yt is observable and Xt is a hidden state. We do not assume the conditional probabili-\nties p(Yt|Xt) and q(Xt+1|Xt) to be known explicitly, nor do we estimate them with simple para-\nmetric models. Rather, we assume a sample (X1, Y1), . . . , (XT +1, YT +1) is given for both the\nobservable and hidden variables in the training phase. This problem has already been consid-\nered in [20], but we give a more principled approach based on KBR. The conditional probabil-\nity for the transition q(xt+1|xt) and observation process p(y|x) are represented by the covariance\noperators as computed with the training sample; bCX,X+1 = 1\ni=1 kX (\u00b7, Xi) \u2297 kX (\u00b7, Xi+1),\nbCXY = 1\ni=1 kX (\u00b7, Xi) \u2297 kY (\u00b7, Yi), and bCY Y and bCXX are de\ufb01ned similarly. Note that though\n\nthe data are not i.i.d., consistency is achieved by the mixing property of the Markov model.\n\nFor simplicity, we focus on the \ufb01ltering problem, but smoothing and prediction can be done similarly.\nIn \ufb01ltering, we wish to estimate the current hidden state xt, given observations \u02dcy1, . . . , \u02dcyt. The\nsequential estimate of p(xt|\u02dcy1, . . . , \u02dcyt) can be derived using KBR (we give only a sketch below; see\nSupplementary material for the detailed derivation). Suppose we already have an estimator of the\nkernel mean of p(xt|\u02dcy1, . . . , \u02dcyt) in the form\n\nT PT\n\nT PT\n\nbmxt|\u02dcy1,...,\u02dcyt =PT\n\ni=1\u03b1(t)\n\ni kX (\u00b7, Xi),\n\ni\n\n1\n\nT\n\n(13)\n\nwhere \u03b1(t)\n\ni = \u03b1(t)\n\n\u039b(t+1)kY (\u02dcyt+1).\n\n), kernel Bayes\u2019 rule yields\n\ni (\u02dcy1, . . . , \u02dcyt) are the coef\ufb01cients at time t. By applying Theorem 2 twice, the\n\n, . . . ,b\u00b5(t+1)\n\nkernel mean of p(yt+1|\u02dcy1, . . . , \u02dcyt) is estimated by bmyt+1|\u02dcy1,...,\u02dcyt =PT\nkY (\u00b7, Yi), where\nHere GX+1X is the \u201ctransfer\u201d matrix de\ufb01ned by(cid:0)GX+1X(cid:1)ij = kX (Xi+1, Xj). With the notation\n\u039b(t+1) = Diag(b\u00b5(t+1)\n\ni=1b\u00b5(t+1)\nb\u00b5(t+1) = (GX + T \u03b5T IT )\u22121GX,X+1(GX + T \u03b5T IT )\u22121GX \u03b1(t).\n\u03b1(t+1) = \u039b(t+1)GY(cid:0)(\u039b(t+1)GY )2 + \u03b4T IT(cid:1)\u22121\n\n(14)\nEqs. (13) and (14) describe the update rule of \u03b1(t)(\u02dcy1, . . . , \u02dcyt). By contrast with [20], where the\nestimates of the previous hidden state and observation are assumed to combine additively, the above\nderivation is based only on applying KBR. In sequential \ufb01ltering, a substantial reduction of compu-\ntational cost can be achieved by low rank approximations for the matrices of a training phase: given\nrank r, the computation costs only O(T r2) for each step in \ufb01ltering.\nBayesian computation without likelihood. When the likelihood and/or prior is not obtained in\nan analytic form but sampling is possible, the ABC approach [25, 12, 17] is popular for Bayesian\ncomputation. The ABC rejection method generates a sample from q(X|Y = y) as follows: (1) gen-\nerate Xt from the prior \u03a0, (2) generate Yt from p(y|Xt), (3) if D(y, Yt) < \u03c1, accept Xt; otherwise\nreject, (4) go to (1). In Step (3), D is a distance on X , and \u03c1 is the tolerance to acceptance.\nIn the exactly the same situation as the above, the KBR approach gives the following method: (i)\ngenerate X1, . . . , Xn from the prior \u03a0, (ii) generate a sample Yt from p(y|Xt) (t = 1, . . . , n), (iii)\ncompute Gram matrices GX and GY with (X1, Y1), . . . , (Xn, Yn), and RX|Y\nThe distribution of a sample given by ABC approaches the true posterior if \u03c1 \u2192 0, while the\nempirical posterior estimate of KBR converges to the true one as n \u2192 \u221e. The computational\nef\ufb01ciency of ABC, however, can be arbitrarily low for a small \u03c1, since Xt is then rarely accepted\nin Step (3). Finally, ABC generates a sample, which allows any statistic of the posterior to be\napproximated. In the case of KBR, certain statistics of the posterior (such as con\ufb01dence intervals)\ncan be harder to obtain, since consistency is guaranteed only for expectations of RKHS functions.\nIn Sec. 4.2, we provide experimental comparisons addressing the trade-off between computational\ntime and accuracy for ABC and KBR.\n\nkY (y).\n\n4 Experiments\n\n4.1 Nonparametric inference of posterior\n\nFirst we compare KBR and the standard kernel density estimation (KDE). Let {(Xi, Yi)}n\ni=1 be\nan i.i.d. sample from P on Rd \u00d7 Rr. With p.d.f. K(x) on Rd and H(y) on Rr, the conditional\n\n6\n\n\fX K(x/hX ) and HhY (x) = h\u2212r\n\nj=1 KhX (x \u2212 Xj)HhY (y \u2212 Yj)/Pn\n\np.d.f. p(y|x) is estimated by bp(y|x) = Pn\nj=1 KhX (x \u2212 Xj),\nwhere KhX (x) = h\u2212d\nfrom the prior \u03a0, the posterior q(x|y) is represented by the weighted sample (Ui, wi) with wi =\nbp(y|Ui)/P\u2113\nWe compare the estimates ofR xq(x|y)dx obtained by KBR and KDE + IW, using Gaussian kernels\nfor both the methods. Note that with Gaussian kernel, the function f (x) = x does not belong to\nHX , and the consistency of the KBR method is not rigorously guaranteed (c.f. Theorem 5). Gaussian\nkernels, however, are known to be able to approximate any continuous function on a compact subset\nwith arbitrary accuracy [23]. We can thus expect that the posterior mean can be estimated effectively.\n\nj=1bp(y|Uj) as importance weight (IW).\n\nY H(y/hY ). Given an i.i.d. sample {Uj}\u2113\n\nj=1\n\n60\n\n \n\n50\n\nKBR vs KDE+IW (E[X|Y=y])\n\nKBR (CV)\nKBR (Med dist)\nKDE+IW (MS CV)\nKDE+IW (best)\n\nIn the experiments, the dimensionality was given by\nr = d ranging form 2 to 64. The distribution P of\n(X, Y ) was N ((0, 1d)T , V ) with V randomly generated\nfor each run. The prior \u03a0 was PX = N (0, VXX /2),\nwhere VXX is the X-component of V . The sample sizes\nwere n = \u2113 = 200. The bandwidth parameter hX , hY\nin KDE were set hX = hY and chosen by two ways,\nthe least square cross-validation [15] and the best mean\nperformance, over the set {2 \u2217 i | i = 1, . . . , 10}. For\nthe KBR, we used use two methods to choose the devi-\nation parameter in Gaussian kernel: the median over the\npairwise distances in the data [10] and the 10-fold CV\ndescribed in Sec. 3. Fig. 2 shows the MSE of the esti-\nmates over 1000 random points y \u223c N (0, VY Y ). While the accuracy of the both methods decrease\nfor larger dimensionality, the KBR signi\ufb01cantly outperforms the KDE+IW.\n\nFigure 2: KBR v.s. KDE+IW.\n\n)\ns\nn\nu\nr\n \n0\n5\n(\n \nE\nS\nM\n\nDimension\n\n8 12 16\n\n \n.\ne\nv\nA\n\n2 4\n\n20\n\n24\n\n32\n\n40\n\n30\n\n10\n\n0\n\n \n\n48\n\n64\n\n4.2 Bayesian computation without likelihood\n\nWe compare KBR and ABC in terms of the estima-\ntion accuracy and computational time. To compute the\nestimation accuracy rigorously, Gaussian distributions\nare used for the true prior and likelihood. The sam-\nples are taken from the same model as in Sec. 4.1, and\n\nR xq(x|y)dx is evaluated at 10 different points of y. We\n\nperformed 10 runs with different covariance.\n\nCPU time vs Error (6 dim.)\n\n5.1\u00d7102\n\n2.5\u00d7103\n\n1.0\u00d7104\n\n \n\nKBR\nABC\n\n400\n\n600\n\n800\n\n1000\n\n2000\n\n6.4\u00d7104\n\n7.9\u00d7105\n\n10\u22121\n\n200\n\ns\nr\no\nr\nr\nE\n \ne\nr\na\nu\nq\nS\nn\na\ne\n\n \n\nM\n\n \n.\n\nv\nA\n\n \n\n100\n\nFor ABC, we used only the rejection method; while\nthere are more advanced sampling schemes [12, 17], im-\nplementation is not straightforward. Various parameters\nfor the acceptance are used, and the accuracy and com-\nputational time are shown in Fig.3 together with total\nsizes of generated samples. For the KBR method, the sample sizes n of the likelihood and prior are\nvaried. The regularization parameters are given by \u03b5n = 0.01/n and \u03b4n = 2\u03b5n. In KBR, Gaussian\nkernels are used and the incomplete Cholesky decomposition is employed. The results indicate that\nKBR achieves more accurate results than ABC in the same computational time.\n\nFigure 3: Estimation accuracy and com-\nputational time with KBR and ABC.\n\nCPU time (sec)\n\n101\n\n102\n\n103\n\n104\n\n4.3 Filtering problems\n\nThe KBR \ufb01lter proposed in Sec. 3 is applied. Alternative strategies for state-space models with\ncomplex dynamics involve the extended Kalman \ufb01lter (EKF) and unscented Kalman \ufb01lter (UKF,\n[11]). There are some works on nonparametric state-space model or HMM which use nonparametric\nestimation of conditional p.d.f. such as KDE or partitions [27, 26] and, more recently, kernel method\n[20, 21]. In the following, the KBR method is compared with linear and nonlinear Kalman \ufb01lters.\nKBR has the regularization parameters \u03b5T , \u03b4T , and kernel parameters for kX and kY (e.g., the de-\nviation parameter for Gaussian kernel). The validation approach is applied for selecting them by\ndividing the training sample into two. To reduce the search space, we set \u03b4T = 2\u03b5T and use the\nGaussian kernel deviation \u03b2\u03c3X and \u03b2\u03c3Y, where \u03c3X and \u03c3Y are the median of pairwise distances\namong the training samples ([10]), leaving only two parameters \u03b2 and \u03b5T to be tuned.\n\n7\n\n\f0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\ns\nr\no\nr\nr\ne\n \ne\nr\na\nu\nq\ns\n \n\nn\na\ne\n\nM\n\n \n\nKBR\nEKF\nUKF\n\n0.09\n\n0.08\n\n0.07\n\n0.06\n\ns\nr\no\nr\nr\ne\n \ne\nr\na\nu\nq\ns\n \n\nn\na\ne\n\nM\n\n0.02\n\n \n\n200\n\n600\n\n400\nTraining sample size\nData (a)\n\n800\n\n1000\n\n0.05\n\n \n\n200\n\n400\n\n600\n\nTraining data size\nData (b)\n\n \n\nKBF\nEKF\nUKF\n\n800\n\nFigure 4: Comparisons with the KBR Filter and EKF. (Average MSEs and SEs over 30 runs.)\n\n\u03c32 = 10\u22124\n\u03c32 = 10\u22123\n\nKBR (Gauss)\n0.210 \u00b1 0.015\n0.222 \u00b1 0.009\n\nKBR (Tr)\n\n0.146 \u00b1 0.003\n0.210 \u00b1 0.008\n\nKalman (9 dim.) Kalman (Quat.)\n0.557 \u00b1 0.023\n1.980 \u00b1 0.083\n1.935 \u00b1 0.064\n0.541 \u00b1 0.022\n\nTable 1: Average MSEs and SEs of camera angle estimates (10 runs).\n\nWe \ufb01rst use two synthetic data sets with KBR, EKF, and UKF, assuming that EKF and UKF know\nthe exact dynamics. The dynamics has a hidden state Xt = (ut, vt)T \u2208 R2, and is given by\n\n(ut+1, vt+1) = (1 + b sin(M \u03b8t+1))(cos \u03b8t+1, sin \u03b8t+1) + Zt,\n\n\u03b8t+1 = \u03b8t + \u03b7 (mod 2\u03c0),\n\nhI2) is independent noise. Note that the dynamics of (ut, vt) is nonlinear even\nwhere Zt \u223c N (0, \u03c32\noI). The two dynamics\nfor b = 0. The observation Yt follows Yt = Xt + Wt, where Wt \u223c N (0, \u03c32\nare de\ufb01ned as follows: (a) (noisy rotation) \u03b7 = 0.3, b = 0, \u03c3h = \u03c3o = 0.2, (b) (noisy oscillatory\nrotation) \u03b7 = 0.4, b = 0.4, M = 8, \u03c3h = \u03c3o = 0.2. The results are shown in Fig. 4. In all the cases,\nEKF and UKF show unrecognizably small difference. The dynamics in (a) has weak nonlinearity,\nand KBR shows slightly worse MSE than EKF and UKF. For dataset (b) of strong nonlinearity, KBR\noutperforms for T \u2265 200 the nonlinear Kalman \ufb01lters, which know the true dynamics.\nNext, we applied the KBR \ufb01lter to the camera rotation problem used in [20]1, where the angle of a\ncamera is the hidden variable and the movie frames of a room taken by the camera are observed. We\nare given 3600 frames of 20\u00d7 20 RGB pixels (Yt \u2208 [0, 1]1200), where the \ufb01rst 1800 frames are used\nfor training, and the second half are used for test. For the details on the data, see [20]. We make\nthe data noisy by adding Gaussian noise N (0, \u03c32) to Yt. Our experiments cover two settings. In the\n\ufb01rst, we assume we do not know the hidden state Xt is included in SO(3), but is a general 3 \u00d7 3\nmatrix. In this case, we use the Kalman \ufb01lter by estimating the relations under a linear assumption,\nand the KBR \ufb01lter with Gaussian kernels for Xt and Yt. In the second setting, we exploit the fact\nXt \u2208 SO(3): for the Kalman \ufb01lter, Xt is represented by a quanternion, and for the KBR \ufb01lter\nthe kernel k(A, B) = Tr[ABT ] is used for Xt. Table 1 shows the Frobenius norms between the\nestimated matrix and the true one. The KBR \ufb01lter signi\ufb01cantly outperforms the Kalman \ufb01lter, since\nKBR has the advantage in extracting the complex nonlinear dependence of the observation on the\nhidden state.\n\n5 Conclusion\n\nWe have proposed a general, novel framework for implementing Bayesian inference, where the prior,\nlikelihood, and posterior are expressed as kernel means in reproducing kernel Hilbert spaces. The\nmodel is expressed in terms of a set of training samples, and inference consists of a small number\nof straightforward matrix operations. Our approach is well suited to cases where simple paramet-\nric models or an analytic forms of density are not available, but samples are easily obtained. We\nhave addressed two applications: Bayesian inference without likelihood, and sequential \ufb01ltering\nwith nonparametric state-space model. Future studies could include more comparisons with sam-\npling approaches like advanced Monte Carlo, and applications to various inference problems such\nas nonparametric Bayesian models and Bayesian reinforcement learning.\nAcknowledgements. KF was supported in part by JSPS KAKENHI (B) 22300098.\n\n1Due to some difference in noise model, the results here are not directly comparable with those of [20].\n\n8\n\n\fReferences\n\n[1] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68(3):337\u2013404, 1950.\n[2] C.R. Baker. Joint measures and cross-covariance operators. Trans. Amer. Math. Soc., 186:273\u2013289, 1973.\n[3] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics.\n\nKluwer Academic Publisher, 2004.\n\n[4] A. Doucet, N. De Freitas, and N.J. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001.\n[5] S. Fine and K. Scheinberg. Ef\ufb01cient SVM training using low-rank kernel representations. JMLR, 2:243\u2013\n\n264, 2001.\n\n[6] K. Fukumizu, F.R. Bach, and M.I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel Hilbert spaces. JMLR, 5:73\u201399, 2004.\n\n[7] K. Fukumizu, F.R. Bach, and M.I. Jordan. Kernel dimension reduction in regression. Anna. Stat.,\n\n37(4):1871\u20131905, 2009.\n\n[8] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00a8olkopf. Kernel measures of conditional dependence. In\n\nAdvances in NIPS 20, pages 489\u2013496. MIT Press, 2008.\n\n[9] A. Gretton, K.M. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two-\n\nsample-problem. In Advances in NIPS 19, pages 513\u2013520. MIT Press, 2007.\n\n[10] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch\u00a8olkopf, and A. Smola. A kernel statistical test of\n\nindependence. In Advances in NIPS 20, pages 585\u2013592. MIT Press, 2008.\n\n[11] S.J. Julier and J.K. Uhlmann. A new extension of the Kalman \ufb01lter to nonlinear systems.\nAeroSense: The 11th Intern. Symp. Aerospace/Defence Sensing, Simulation and Controls, 1997.\n\nIn Proc.\n\n[12] P. Marjoram, Jo. Molitor, V. Plagnol, and S. Tavare. Markov chain monte carlo without likelihoods. PNAS,\n\n100(26):15324\u201315328, 2003.\n\n[13] S. Mika, B. Sch\u00a8olkopf, A. Smola, K.-R. M\u00a8uller, M. Scholz, and G. R\u00a8atsch. Kernel pca and de-noising in\n\nfeature spaces. In Advances in NIPS 11, pages 536\u2013542. MIT Press, 1999.\n\n[14] P. M\u00a8uller and F.A. Quintana. Nonparametric bayesian data analysis. Statistical Science, 19(1):95\u2013110,\n\n2004.\n\n[15] M. Rudemo. Empirical choice of histograms and kernel density estimators. Scandinavian J. Statistics,\n\n9(2):pp. 65\u201378, 1982.\n\n[16] B. Sch\u00a8olkopf and A.J. Smola. Learning with Kernels. MIT Press, 2002.\n[17] S. A. Sisson, Y. Fan, and M. M. Tanaka. Sequential monte carlo without likelihoods. PNAS, 104(6):1760\u2013\n\n1765, 2007.\n\n[18] L. Song, A. Gretton., and C. Guestrin. Nonparametric tree graphical models via kernel embeddings. In\n\nAISTATS 2010, pages 765\u2013772, 2010.\n\n[19] L. Song, A. Gretton, D. Bickson, Y. Low, and C. Guestrin. Kernel belief propagation. In AISTATS 2011.\n[20] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions\n\nwith applications to dynamical systems. Proc ICML2009, pages 961\u2013968. 2009.\n\n[21] L. Song and S. M. Siddiqi and G. Gordon and A. Smola. Hilbert Space Embeddings of Hidden Markov\n\nModels. Proc. ICML2010, 991\u2013998. 2010.\n\n[22] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00a8olkopf, and G. R.G. Lanckriet. Hilbert space\n\nembeddings and metrics on probability measures. JMLR, 11:1517\u20131561, 2010.\n\n[23] I. Steinwart. On the In\ufb02uence of the Kernel on the Consistency of Support Vector Machines. JMLR,\n\n2:67\u201393, 2001.\n\n[24] M. Sugiyama, I. Takeuchi, T. Suzuki, T. Kanamori, H. Hachiya, and D. Okanohara. Conditional density\n\nestimation via least-squares density ratio estimation. In AISTATS 2010, pages 781\u2013788, 2010.\n\n[25] S. Tavar\u00b4e, D.J. Balding, R.C. Grif\ufb01this, and P. Donnelly. Inferring coalescence times from dna sequece\n\ndata. Genetics, 145:505\u2013518, 1997.\n\n[26] S. Thrun, J. Langford, and D. Fox. Monte carlo hidden markov models: Learning non-parametric models\n\nof partially observable stochastic processes. In ICML 1999, pages 415\u2013424, 1999.\n\n[27] V. Monbet , P. Ailliot, and P.F. Marteau. l1-convergence of smoothing densities in non-parametric state\n\nspace models. Statistical Inference for Stochastic Processes, 11:311\u2013325, 2008.\n\n9\n\n\f", "award": [], "sourceid": 985, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Le", "family_name": "Song", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}]}