{"title": "Kernel Bayesian Inference with Posterior Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 4763, "page_last": 4771, "abstract": "We propose a vector-valued regression problem whose solution is equivalent to the reproducing kernel Hilbert space (RKHS) embedding of the Bayesian posterior distribution. This equivalence provides a new understanding of kernel Bayesian inference. Moreover, the optimization problem induces a new regularization for the posterior embedding estimator, which is faster and has comparable performance to the squared regularization in kernel Bayes' rule. This regularization coincides with a former thresholding approach used in kernel POMDPs whose consistency remains to be established. Our theoretical work solves this open problem and provides consistency analysis in regression settings. Based on our optimizational formulation, we propose a flexible Bayesian posterior regularization framework which for the first time enables us to put regularization at the distribution level. We apply this method to nonparametric state-space filtering tasks with extremely nonlinear dynamics and show performance gains over all other baselines.", "full_text": "Kernel Bayesian Inference with\n\nPosterior Regularization\n\nYang Song\u2020, Jun Zhu\u2021\u2217, Yong Ren\u2021\n\n\u2020 Dept. of Physics, Tsinghua University, Beijing, China\n\n\u2021 Dept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired Computing Research\n\nState Key Lab for Intell. Tech. & Systems, Tsinghua University, Beijing, China\n\nyangsong@cs.stanford.edu; {dcszj@, renyong15@mails}.tsinghua.edu.cn\n\nAbstract\n\nWe propose a vector-valued regression problem whose solution is equivalent to the\nreproducing kernel Hilbert space (RKHS) embedding of the Bayesian posterior\ndistribution. This equivalence provides a new understanding of kernel Bayesian\ninference. Moreover, the optimization problem induces a new regularization for the\nposterior embedding estimator, which is faster and has comparable performance\nto the squared regularization in kernel Bayes\u2019 rule. This regularization coincides\nwith a former thresholding approach used in kernel POMDPs whose consistency\nremains to be established. Our theoretical work solves this open problem and\nprovides consistency analysis in regression settings. Based on our optimizational\nformulation, we propose a \ufb02exible Bayesian posterior regularization framework\nwhich for the \ufb01rst time enables us to put regularization at the distribution level.\nWe apply this method to nonparametric state-space \ufb01ltering tasks with extremely\nnonlinear dynamics and show performance gains over all other baselines.\n\n1\n\nIntroduction\n\nKernel methods have long been effective in generalizing linear statistical approaches to nonlinear\ncases by embedding a sample to the reproducing kernel Hilbert space (RKHS) [1]. In recent years,\nthe idea has been generalized to embedding probability distributions [2, 3]. Such embeddings of\nprobability measures are usually called kernel embeddings (a.k.a. kernel means). Moreover, [4, 5, 6]\nshow that statistical operations of distributions can be realized in RKHS by manipulating kernel\nembeddings via linear operators. This approach has been applied to various statistical inference and\nlearning problems, including training hidden Markov models (HMM) [7], belief propagation (BP)\nin tree graphical models [8], planning Markov decision processes (MDP) [9] and partially observed\nMarkov decision processes (POMDP) [10].\nOne of the key workhorses in the above applications is the kernel Bayes\u2019 rule [5], which establishes\nthe relation among the RKHS representations of the priors, likelihood functions and posterior\ndistributions. Despite empirical success, the characterization of kernel Bayes\u2019 rule remains largely\nincomplete. For example, it is unclear how the estimators of the posterior distribution embeddings\nrelate to optimizers of some loss functions, though the vanilla Bayes\u2019 rule has a nice connection [11].\nThis makes generalizing the results especially dif\ufb01cult and hinters the intuitive understanding of\nkernel Bayes\u2019 rule.\nTo alleviate this weakness, we propose a vector-valued regression [12] problem whose optimizer\nis the posterior distribution embedding. This new formulation is inspired by the progress in two\n\ufb01elds: 1) the alternative characterization of conditional embeddings as regressors [13], and 2) the\n\n\u2217Corresponding author.\n\n\fintroduction of posterior regularized Bayesian inference (RegBayes) [14] based on an optimizational\nreformulation of the Bayes\u2019 rule.\nWe demonstrate the novelty of our formulation by providing a new understanding of kernel Bayesian\ninference, with theoretical, algorithmic and practical implications. On the theoretical side, we are\nable to prove the (weak) consistency of the estimator obtained by solving the vector-valued regression\nproblem under reasonable assumptions. As a side product, our proof can be applied to a thresholding\ntechnique used in [10], whose consistency is left as an open problem. On the algorithmic side, we\npropose a new regularization technique, which is shown to run faster and has comparable accuracy to\nsquared regularization used in the original kernel Bayes\u2019 rule [5]. Similar in spirit to RegBayes, we\nare also able to derive an extended version of the embeddings by directly imposing regularization on\nthe posterior distributions. We call this new framework kRegBayes. Thanks to RKHS embeddings of\ndistributions, this is the \ufb01rst time, to the best of our knowledge, people can do posterior regularization\nwithout invoking linear functionals (such as moments) of the random variables. On the practical side,\nwe demonstrate the ef\ufb01cacy of our methods on both simple and complicated synthetic state-space\n\ufb01ltering datasets.\nSame to other algorithms based on kernel embeddings, our kernel regularized Bayesian inference\nframework is nonparametric and general. The algorithm is nonparametric, because the priors, posterior\ndistributions and likelihood functions are all characterized by weighted sums of data samples. Hence\nit does not need the explicit mechanism such as differential equations of a robot arm in \ufb01ltering tasks.\nIt is general in terms of being applicable to a broad variety of domains as long as the kernels can be\nde\ufb01ned, such as strings, orthonormal matrices, permutations and graphs.\n\n2 Preliminaries\n\n2.1 Kernel embeddings\nLet (X ,BX ) be a measurable space of random variables, pX be the associated probability measure and\nHX be a RKHS with kernel k(\u00b7,\u00b7). We de\ufb01ne the kernel embedding of pX to be \u00b5X = EpX [\u03c6(X)] \u2208\nHX , where \u03c6(X) = k(X,\u00b7) is the feature map. Such a vector-valued expectation always exists if the\nkernel is bounded, namely supx kX (x, x) < \u221e.\nThe concept of kernel embeddings has several important statistical merits. Inasmuch as the re-\nproducing property, the expectation of f \u2208 H w.r.t. pX can be easily computed as EpX [f (X)] =\nEpX [(cid:104)f, \u03c6(X)(cid:105)] = (cid:104)f, \u00b5X(cid:105). There exists universal kernels [15] whose corresponding RKHS H is\ndense in CX in terms of sup norm. This means H contains a rich range of functions f and their\nexpectations can be computed by inner products without invoking usually intractable integrals. In\naddition, the inner product structure of the embedding space H provides a natural way to measure the\ndifferences of distributions through norms.\nIn much the same way we can de\ufb01ne kernel embeddings of linear operators. Let (X ,BX ) and (Y,BY )\nbe two measurable spaces, \u03c6(x) and \u03c8(y) be the measurable feature maps of corresponding RKHS\nHX and HY with bounded kernels, and p denote the joint distribution of a random variable (X, Y ) on\nX \u00d7Y with product measures. The covariance operator CXY is de\ufb01ned as CXY = Ep[\u03c6(X)\u2297\u03c8(Y )],\nwhere \u2297 denotes the tensor product. Note that it is possible to identify CXY with \u00b5(XY ) in HX \u2297HY\nwith the kernel function k((x1, y1), (x2, y2)) = kX (x1, x2)kY (y1, y2) [16]. There is an important\nrelation between kernel embeddings of distributions and covariance operators, which is fundamental\nfor the sequel:\nTheorem 1 ([4, 5]). Let \u00b5X, \u00b5Y be the kernel embeddings of pX and pY respectively. If CXX is\ninjective, \u00b5X \u2208 R(CXX ) and E[g(Y ) | X = \u00b7] \u2208 HX for all g \u2208 HY, then\n\n\u00b5Y = CY XC\u22121\n\nXX \u00b5X .\n\n(1)\n\nIn addition, \u00b5Y |X=x = E[\u03c8(Y )|X = x] = CY XC\u22121\n(cid:80)N\nestimator for the embedding \u00b5X is (cid:98)\u00b5X = 1\n(cid:80)N\nOn the implementation side, we need to estimate these kernel embeddings via samples. An intuitive\nSimilarly, the covariance operators can also be estimated by (cid:98)CXY = 1\ni=1 \u03c6(xi), where {xi}N\ni=1 is a sample from pX.\ni=1 \u03c6(xi) \u2297 \u03c8(yi). Both\noperators are shown to converge in the RKHS norm at a rate of Op(N\u2212 1\n\nXX \u03c6(x).\n\nN\n\nN\n\n2 ) [4].\n\n2\n\n\f2.2 Kernel Bayes\u2019 rule\nLet \u03c0(Y ) be the prior distribution of a random variable Y , p(X = x | Y ) be the likelihood,\np\u03c0(Y | X = x) be the posterior distribution given \u03c0(Y ) and observation x, and p\u03c0(X, Y ) be the\njoint distribution incorporating \u03c0(Y ) and p(X | Y ). Kernel Bayesian inference aims to obtain the\nY (X = x) given a prior embedding \u03c0Y and a covariance operator CXY . By\nposterior embedding \u00b5\u03c0\nBayes\u2019 rule, p\u03c0(Y | X = x) \u221d \u03c0(Y )p(X = x | Y ). We assume that there exists a joint distribution\np on X \u00d7 Y whose conditional distribution matches p(X | Y ) and let CXY be its covariance operator.\nNote that we do not require p = p\u03c0 hence p can be any convenient distribution.\nAccording to Thm. 1, \u00b5\u03c0\ndistribution p\u03c0 and C\u03c0\nwith \u00b5(Y X) in HY\u2297HX , we can apply Thm. 1 to obtain \u00b5(Y X) = C(Y X)Y C\u22121\nE[\u03c8(Y ) \u2297 \u03c6(X) \u2297 \u03c8(Y )]. Similarly, C\u03c0\nway of computing posterior embeddings is called the kernel Bayes\u2019 rule [5].\n\nY X corresponds to the joint\nY X can be identi\ufb01ed\nY Y \u03c0Y , where C(Y X)Y :=\nY Y \u03c0Y . This\ni=1 \u02dc\u03b1i\u03c8(yi) and the covariance operator (cid:98)CY X, The\n\nXX to the marginal probability of p\u03c0 on X. Recall that C\u03c0\n\nY (X = x) = C\u03c0\n\nGiven estimators of the prior embedding(cid:98)\u03c0Y =(cid:80)m\nXX ]2 + \u03bbI)\u22121(cid:98)C\u03c0\nposterior embedding can be obtained via(cid:98)\u00b5\u03c0\nsquared regularization is added to the inversion. Note that the regularization for(cid:98)\u00b5\u03c0\n\nXX can be represented as \u00b5(XX) = C(XX)Y C\u22121\nY (X = x) = (cid:98)C\u03c0\n\nXX \u03c6(x) , where\nY (X = x) is not\nunique. A thresholding alternative is proposed in [10] without establishing the consistency. We will\ndiscuss this thresholding regularization in a different perspective and give consistency results in the\nsequel.\n\nY X ([(cid:98)C\u03c0\n\n\u22121\u03c6(x), where C\u03c0\n\nY XC\u03c0\n\nXX\n\n2.3 Regularized Bayesian inference\nRegularized Bayesian inference (RegBayes [14]) is based on a variational formulation of the Bayes\u2019\nrule [11]. The posterior distribution can be viewed as the solution of minp(Y |X=x) KL(p(Y |X =\n\nx)(cid:107)\u03c0(Y )) \u2212(cid:82) log p(X = x|Y )dp(Y |X = x), subjected to p(Y |X = x) \u2208 Pprob, where Pprob is\n\nthe set of valid probability measures. RegBayes combines this formulation and posterior regulariza-\ntion [17] in the following way\n\n(cid:90)\n\nmin\n\np(Y |X=x),\u03be\n\nKL(p(Y |X = x)(cid:107)\u03c0(Y )) \u2212\n\nlog p(X = x|Y )dp(Y |X = x) + U (\u03be)\n\ns.t. p(Y |X = x) \u2208 Pprob(\u03be),\n\nwhere Pprob(\u03be) is a subset depending on \u03be and U (\u03be) is a loss function. Such a formulation makes it\npossible to regularize Bayesian posterior distributions, smoothing the gap between Bayesian genera-\ntive models and discriminative models. Related applications include max-margin topic models [18]\nand in\ufb01nite latent SVMs [14].\nDespite the \ufb02exibility of RegBayes, regularization on the posterior distributions is practically imposed\nindirectly via expectations of a function. We shall see soon in the sequel that our new framework of\nkernel Regularized Bayesian inference can control the posterior distribution in a direct way.\n\n2.4 Vector-valued regression\nThe main task for vector-valued regression [12] is to minimize the following objective\n\nn(cid:88)\n\nE(f ) :=\n\n(cid:107)yj \u2212 f (xj)(cid:107)2HY + \u03bb(cid:107)f(cid:107)2HK\n\n,\n\ni=1\n\nwhere yj \u2208 HY, f : X \u2192 HY. Note that f is a function with RKHS values and we assume that f\nbelongs to a vector-valued RKHS HK. In vector-valued RKHS, the kernel function k is generalized\nto linear operators L(HY ) (cid:51) K(x1, x2) : HY \u2192 HY, such that K(x1, x2)y := (Kx2y)(x1) for\nevery x1, x2 \u2208 X and y \u2208 HY, where Kx2y \u2208 HK. The reproducing property is generalized to\n(cid:104)y, f (x)(cid:105)HY = (cid:104)Kxy, f(cid:105)HK for every y \u2208 HY, f \u2208 HK and x \u2208 X . In addition, [12] shows that\nthe representer theorem still holds for vector-valued RKHS.\n\n3 Kernel Bayesian inference as a regression problem\nOne of the unique merits of the posterior embedding \u00b5\u03c0\ndistributions can be computed via inner products, i.e., (cid:104)h, \u00b5\u03c0\n\nY (X = x) is that expectations w.r.t. posterior\nY (X = x)(cid:105) = Ep\u03c0(Y |X=x)[h(Y )] for all\n\n3\n\n\fY (X = x) \u2208 HY, \u00b5\u03c0\n\nh \u2208 HY. Since \u00b5\u03c0\ncontaining functions f : X \u2192 HY.\nA natural optimization objective [13] thus follows from the above observations\n\nY can be viewed as an element of a vector-valued RKHS HK\n\n(cid:2)(EY [h(Y )|X] \u2212 (cid:104)h, \u00b5(X)(cid:105)HY )2(cid:3) ,\n\n(2)\n\nE[\u00b5] := sup\n(cid:107)h(cid:107)Y\u22641\n\nEX\n\nwhere EX [\u00b7] denotes the expectation w.r.t. p\u03c0(X) and EY [\u00b7|X] denotes the expectation w.r.t. the\nBayesian posterior distribution, i.e., p\u03c0(Y | X) \u221d \u03c0(Y )p(X | Y ). Clearly, \u00b5\u03c0\nY = arg inf \u00b5 E[\u00b5].\nFollowing [13], we introduce an upper bound Es for E by applying Jensen\u2019s and Cauchy-Schwarz\u2019s\ninequalities consecutively\n\nEs[\u00b5] := E(X,Y )[(cid:107)\u03c8(Y ) \u2212 \u00b5(X)(cid:107)2HY ],\n\n(3)\nwhere (X, Y ) is the random variable on X \u00d7Y with the joint distribution p\u03c0(X, Y ) = \u03c0(Y )p(X | Y ).\nThe \ufb01rst step to make this optimizational framework practical is to \ufb01nd \ufb01nite sample estimators of\nEs[\u00b5]. We will show how to do this in the following section.\n3.1 A consistent estimator of Es[\u00b5]\nUnlike the conditional embeddings in [13], we do not have i.i.d. samples from the joint distribution\np\u03c0(X, Y ), as the priors and likelihood functions are represented with samples from different distribu-\ntions. We will eliminate this problem using a kernel trick, which is one of our main innovations in\nthis paper.\nThe idea is to use the inner product property of a kernel embedding \u00b5(X,Y ) to represent the expectation\nE(X,Y )[(cid:107)\u03c8(Y ) \u2212 \u00b5(X)(cid:107)2HY ] and then use \ufb01nite sample estimators of \u00b5(X,Y ) to estimate Es[\u00b5]. Recall\nthat we can identify CXY := EXY [\u03c6(X) \u2297 \u03c8(Y )] with \u00b5(X,Y ) in a product space HX \u2297 HY\nwith a product kernel kX kY on X \u00d7 Y [16]. Let f (x, y) = (cid:107)\u03c8(y) \u2212 \u00b5(x)(cid:107)2HY and assume that\nf \u2208 HX \u2297 HY. The optimization objective Es[\u00b5] can be written as\n\nEs[\u00b5] = E(X,Y )[(cid:107)\u03c8(Y ) \u2212 \u00b5(X)(cid:107)2HY ] = (cid:104)f, \u00b5(X,Y )(cid:105)HX \u2297HY .\n\n(4)\n\nY Y \u03c0Y and a natural estimator follows to be\n\nFrom Thm. 1, we assert that \u00b5(X,Y ) = C(X,Y )Y C\u22121\n\n(cid:98)\u00b5(X,Y ) = (cid:98)C(X,Y )Y ((cid:98)CY Y + \u03bbI)\u22121(cid:98)\u03c0Y . As a result, (cid:98)Es[\u00b5] := (cid:104)(cid:98)\u00b5(X,Y ), f(cid:105)HX \u2297HY and we introduce\nthe following proposition to write (cid:98)Es in terms of Gram matrices.\n(cid:98)\u03c0Y =(cid:80)l\n\nProposition 1 (Proof in Appendix). Suppose (X, Y ) is a random variable in X \u00d7Y, where the prior\nfor Y is \u03c0(Y ) and the likelihood is p(X | Y ). Let HX be a RKHS with kernel kX and feature map\n\u03c6(x), HY be a RKHS with kernel kY and feature map \u03c8(y), \u03c6(x, y) be the feature map of HX \u2297 HY,\ni=1 be a sample representing\np(X | Y ). Under the assumption that f (x, y) = (cid:107)\u03c8(y) \u2212 \u00b5(x)(cid:107)2HY \u2208 HX \u2297 HY, we have\n\ni=1 \u02dc\u03b1i\u03c8(\u02dcyi) be a consistent estimator of \u03c0Y and {(xi, yi)}n\n\n\u03b2i (cid:107)\u03c8(yi) \u2212 \u00b5(xi)(cid:107)2HY ,\n\n(5)\n\n(cid:98)Es[\u00b5] =\n\nn(cid:88)\n\ni=1\n\n(cid:124) is given by \u03b2 = (GY + n\u03bbI)\u22121 \u02dcGY \u02dc\u03b1, where (GY )ij = kY (yi, yj),\n\nwhere \u03b2 = (\u03b21,\u00b7\u00b7\u00b7 , \u03b2n)\n( \u02dcGY )ij = kY (yi, \u02dcyj), and \u02dc\u03b1 = (\u02dc\u03b11,\u00b7\u00b7\u00b7 , \u02dc\u03b1l)\n\n(cid:124).\n\nThe consistency of (cid:98)Es[\u00b5] is a direct consequence of the following theorem adapted from [5], since the\nCauchy-Schwarz inequality ensures |(cid:104)\u00b5(X,Y ), f(cid:105) \u2212 (cid:104)(cid:98)\u00b5(X,Y ), f(cid:105)| \u2264(cid:13)(cid:13)\u00b5(X,Y ) \u2212(cid:98)\u00b5(X,Y )\nTheorem 2 (Adapted from [5], Theorem 8). Assume that CY Y is injective, (cid:98)\u03c0Y is a consistent\n\nestimator of \u03c0Y in HY norm, and that E[k((X, Y ), ( \u02dcX, \u02dcY )) | Y = y, \u02dcY = \u02dcy] is included in\nHY \u2297 HY as a function of (y, \u02dcy), where ( \u02dcX, \u02dcY ) is an independent copy of (X, Y ). Then, if the\nregularization coef\ufb01cient \u03bbn decays to 0 suf\ufb01ciently slowly,\n\n(cid:13)(cid:13)(cid:107)f(cid:107).\n\n(cid:13)(cid:13)(cid:13)(cid:98)C(X,Y )Y ((cid:98)CY Y + \u03bbnI)\u22121(cid:98)\u03c0Y \u2212 \u00b5(X,Y )\n\n(cid:13)(cid:13)(cid:13)HX \u2297HY\n\n\u2192 0\n\n(6)\n\nin probability as n \u2192 \u221e.\n\n4\n\n\fAlthough (cid:98)Es[\u00b5] is a consistent estimator of Es[\u00b5], it does not necessarily have minima, since the\n:= max(0, \u03b2i) in (cid:98)Es[\u00b5].\n\ncoef\ufb01cients \u03b2i can be negative. One of our main contributions in this paper is the discovery that we\ncan ignore data points (xi, yi) with a negative \u03b2i, i.e., replacing \u03b2i with \u03b2+\ni\nWe will give explanations and theoretical justi\ufb01cations in the next section.\n\n3.2 The thresholding regularization\n\nWe show in the following theorem that (cid:98)E +\n\ns [\u00b5] :=(cid:80)n\n\ni=1 \u03b2+\n\ni (cid:107)\u03c8(yi) \u2212 \u00b5(xi)(cid:107)2 converges to Es[\u00b5]\nis named thresholding\n\nin probability in discrete situations. The trick of replacing \u03b2i with \u03b2+\ni\nregularization.\nTheorem 3 (Proof in Appendix). Assume that X is compact and |Y| < \u221e, k is a strictly positive\nHX \u2297 HY. With the conditions in Thm. 2, we assert that(cid:98)\u00b5+\nde\ufb01nite continuous kernel with sup(x,y) k((x, y), (x, y)) < \u03ba and f (x, y) = (cid:107)\u03c8(y) \u2212 \u00b5(x)(cid:107)2HY \u2208\n(cid:12)(cid:12)(cid:12)(cid:98)E +\n(X,Y ) is a consistent estimator of \u00b5(X,Y )\n\n(cid:12)(cid:12)(cid:12) \u2192 0 in probability as n \u2192 \u221e.\n\ns [\u00b5] \u2212 Es[\u00b5]\n\nand\n\nIn the context of partially observed Markov decision processes (POMDPs) [10], a similar thresholding\napproach, combined with normalization, was proposed to make the Bellman operator isotonic and\ncontractive. However, the authors left the consistency of that approach as an open problem. The\njusti\ufb01cation of normalization has been provided in [13], Lemma 2.2 under the \ufb01nite space assumption.\nA slight modi\ufb01cation of our proof of Thm. 3 (change the probability space from X \u00d7 Y to X ) can\ncomplete the other half as a side product, under the same assumptions.\nCompared to the original squared regularization used in [5], thresholding regularization is more\ncomputational ef\ufb01cient because 1) it does not need to multiply the Gram matrix twice, and 2) it does\nnot need to take into consideration those data points with negative \u03b2i\u2019s. In many cases a large portion\nof {\u03b2i}n\ni=1 is negative but the sum of their absolute values is small. The \ufb01nite space assumption in\nThm. 3 may also be weakened, but it requires deeper theoretical analyses.\n\n3.3 Minimizing (cid:98)E +\nterm to (cid:98)E +\n\ns [\u00b5]\n\nFollowing the standard steps of solving a RKHS regression problem, we add a Tikhonov regularization\n\ns [\u00b5] to provide a well-proposed problem,\n\n(cid:98)E\u03bb,n[\u00b5] =\n\nn(cid:88)\n\ni (cid:107)\u03c8(yi) \u2212 \u00b5(xi)(cid:107)2HY + \u03bb(cid:107)\u00b5(cid:107)2HK\n\u03b2+\n\n.\n\ni=1\n\nLet (cid:98)\u00b5\u03bb,n = arg min\u00b5(cid:98)E\u03bb,n[\u00b5]. Note that (cid:98)E\u03bb,n[\u00b5] is a vector-valued regression problem, and the\n(cid:98)\u00b5\u03bb,n in the following proposition.\n\nrepresenter theorems in vector-valued RKHS apply here. We summarize the matrix expression of\n\n(cid:54)= 0 for all\nProposition 2 (Proof in Appendix). Without loss of generality, we assume that \u03b2+\n1 \u2264 i \u2264 n. Let \u00b5 \u2208 HK and choose the kernel of HK to be K(xi, xj) = kX (xi, xj)I, where\ni\nI : HK \u2192 HK is an identity map. Then\n\n(7)\n\n(8)\n\n(cid:98)\u00b5\u03bb,n(x) = \u03a8(KX + \u03bbn\u039b+)\u22121K:x,\n\n1 ,\u00b7\u00b7\u00b7 , 1/\u03b2+\n\nn ), K:x =\n\nwhere \u03a8 = (\u03c8(y1),\u00b7\u00b7\u00b7 , \u03c8(yn)), (KX )ij = kX (xi, xj), \u039b+ = diag(1/\u03b2+\n(kX (x, x1),\u00b7\u00b7\u00b7 , kX (x, xn))\n\n(cid:124) and \u03bbn is a positive regularization constant.\n\n3.4 Theoretical justi\ufb01cations for(cid:98)\u00b5\u03bb,n\nIn this section, we provide theoretical explanations for using(cid:98)\u00b5\u03bb,n as an estimator of the posterior\nthat(cid:98)\u00b5\u03bb,n = arg min\u00b5(cid:98)E\u03bb,n[\u00b5]. We \ufb01rst show the relations between \u00b5\u2217 and \u00b5(cid:48) and then discuss the\nembedding under speci\ufb01c assumptions. Let \u00b5\u2217 = arg min\u00b5 E[\u00b5], \u00b5(cid:48) = arg min\u00b5 Es[\u00b5], and recall\nrelations between(cid:98)\u00b5\u03bb,n and \u00b5(cid:48).\n\nThe forms of E and Es are exactly the same for posterior kernel embeddings and conditional kernel\nembeddings. As a consequence, the following theorem in [13] still hold.\n\n5\n\n\fTheorem 4 ([13]). If there exists a \u00b5\u2217 \u2208 HK such that for any h \u2208 HY, E[h|X] = (cid:104)h, \u00b5\u2217(X)(cid:105)HY\npX-a.s., then \u00b5\u2217 is the pX-a.s. unique minimiser of both objectives:\nEs[\u00b5].\n\n\u00b5\u2217 = arg min\n\u00b5\u2208HK\n\nE[\u00b5] = arg min\n\u00b5\u2208HK\n\nmain dif\ufb01culty here is that {(xi, yi)}|n\n\nY |X=x (cid:54)\u2208 HK, we refer the readers to [13].\n\nY |X=x, both E\nThis theorem shows that if the vector-valued RKHS HK is rich enough to contain \u00b5\u03c0\nand Es can lead us to the correct embedding. In this case, it is reasonable to use \u00b5(cid:48) instead of \u00b5\u2217. For\nUnfortunately, we cannot obtain the relation between(cid:98)\u00b5\u03bb,n and \u00b5(cid:48) by referring to [19], as in [13]. The\nthe situation where \u00b5\u03c0\nthe estimator (cid:98)E +\ni=1 is not an i.i.d. sample from p\u03c0(X, Y ) = \u03c0(Y )p(X | Y ) and\ns [\u00b5] does not use i.i.d. samples to estimate expectations. Therefore the concentration\n(cid:98)\u00b5\u03bb,n. The relation between(cid:98)\u00b5\u03bb,n and \u00b5(cid:48) can now be summarized in the following theorem.\n\ninequality ([19], Prop. 2) used in the proofs of [19] cannot be applied.\nTo solve the problem, we propose Thm. 9 (in Appendix) which can lead to a consistency proof for\n\nTheorem 5 (Proof in Appendix). Assume Hypothesis 1 and Hypothesis 2 in [20] and our Assump-\ntion 1 (in the Appendix) hold. With the conditions in Thm. 3, we assert that if \u03bbn decreases to 0\nsuf\ufb01ciently slowly,\n\nEs[(cid:98)\u00b5\u03bbn,n] \u2212 Es[\u00b5(cid:48)] \u2192 0\n\nin probability as n \u2192 \u221e.\n\n(9)\n\n(10)\n\n(11)\n\n=\n\n4 Kernel Bayesian inference with posterior regularization\n\nBased on our optimizational formulation of kernel Bayesian inference, we can add additional\nregularization terms to control the posterior embeddings. This technique gives us the possibility to\nincorporate rich side information from domain knowledge and to enforce supervisions on Bayesian\ninference. We call our framework of imposing posterior regularization kRegBayes.\nAs an example of the framework, we study the following optimization problem\n\nL :=\n\nm(cid:88)\n(cid:124)\ni=1 is the sample used for representing the likelihood, {(xi, ti)}n\n\ni (cid:107)\u00b5(xi) \u2212 \u03c8(yi)(cid:107)2HY + \u03bb(cid:107)\u00b5(cid:107)2HK\n(cid:125)\n\u03b2+\n\n(cid:123)(cid:122)\n(cid:98)E\u03bb,n[\u00b5]\n\nn(cid:88)\n\n(cid:107)\u00b5(xi) \u2212 \u03c8(ti)(cid:107)2HY\n(cid:123)(cid:122)\n(cid:125)\n\nThe regularization term\n\ni=m+1\n\n(cid:124)\n\n+ \u03b4\n\ni=1\n\n,\n\nwhere {(xi, yi)}m\ni=m+1 is the sample\nused for posterior regularization and \u03bb, \u03b4 are the regularization constants. Note that in RKHS\nembeddings, \u03c8(t) is identi\ufb01ed as a point distribution at t [2]. Hence the regularization term in (10)\nencourages the posterior distributions p(Y | X = xi) to be concentrated at ti. More complicated\n\nregularization terms are also possible, such as (cid:107)\u00b5(xi) \u2212(cid:80)l\n\ni=1 \u03b1i\u03c8(ti)(cid:107)HY .\n\nCompared to vanilla RegBayes, our kernel counterpart has several obvious advantages. First, the\ndifference between two distributions can be naturally measured by RKHS norms. This makes it\npossible to regularize the posterior distribution as a whole, rather than through expectations of\ndiscriminant functions. Second, the framework of kernel Bayesian inference is totally nonparametric,\nwhere the priors and likelihood functions are all represented by respective samples. We will further\ndemonstrate the properties of kRegBayes through experiments in the next section.\n\nLet (cid:98)\u00b5reg = arg min\u00b5 L. It is clear that solving L is substantially the same as (cid:98)E\u03bb,n[\u00b5] and we\n\nsummarize it in the following proposition.\nProposition 3. With the conditions in Prop. 2, we have\n\n(cid:98)\u00b5reg(x) = \u03a8(KX + \u03bb\u039b+)\u22121K:x,\n\nwhere \u03a8\ndiag(1/\u03b2+\n\n=\n\n1 ,\u00b7\u00b7\u00b7 , 1/\u03b2+\n\n(\u03c8(y1),\u00b7\u00b7\u00b7 , \u03c8(yn)),\nm, 1/\u03b4,\u00b7\u00b7\u00b7 , 1/\u03b4), and K:x = (kX (x, x1),\u00b7\u00b7\u00b7 , kX (x, xn))\n\nkX (xi, xj)|1\u2264i,j\u2264n, \u039b+\n\n(KX )ij\n\n(cid:124).\n\n=\n\n6\n\n\f(cid:18)xt+1\n\n(cid:19)\n\nyt+1\n\n(cid:18)cos \u03b8t+1\n\n(cid:19)\n\n+ \u03b6t,\n\n5 Experiments\n\nIn this section, we compare the results of kRegBayes and several other baselines for two state-space\n\ufb01ltering tasks. The mechanism behind kernel \ufb01ltering is stated in [5] and we provide a detailed\nintroduction in Appendix, including all the formula used in implementation.\n\nToy dynamics This experiment is a twist of that used in [5]. We report the results of extended\nKalman \ufb01lter (EKF) [21] and unscented Kalman \ufb01lter (UKF) [22], kernel Bayes\u2019 rule (KBR) [5],\nkernel Bayesian learning with thresholding regularization (pKBR) and kRegBayes.\nThe data points {(\u03b8t, xt, yt)} are generated from the dynamics\n\nsin \u03b8t+1\n\n(mod 2\u03c0),\n\n= (1 + sin(8\u03b8t+1))\n\n\u03b8t+1 = \u03b8t + 0.4 + \u03bet\n\n(12)\nwhere \u03b8t is the hidden state, (xt, yt) is the observation, \u03bet \u223c N (0, 0.04) and \u03b6t \u223c N (0, 0.04). Note\nthat this dynamics is nonlinear for both transition and observation functions. The observation model\nis an oscillation around the unit circle. There are 1000 training data and 200 validation/test data for\neach algorithm.\nWe suppose that EKF, UKF and kRegBayes know\nthe true dynamics of the model and the \ufb01rst hid-\nIn this case, we use \u02dc\u03b8t+1 =\nden state \u03b81.\n\u03b81 + 0.4t (mod 2\u03c0) and (\u02dcxt+1, \u02dcyt+1)\n= (1 +\n(cid:124) as the supervision\nsin(8\u02dc\u03b8t+1))(cos \u02dc\u03b8t+1, sin \u02dc\u03b8t+1)\ndata point for the (t + 1)-th step. We follow [5] to\nset our parameters.\nThe results are summarized in Fig. 5. pKBR has\nlower errors compared to KBR, which means the\nthresholding regularization is practically no worse\nthan the original squared regularization. The lower\nMSE of kRegBayes compared with pKBR shows that\nthe posterior regularization successfully incorporates\ninformation from equations of the dynamics. More-\nover, pKBR and kRegBayes run faster than KBR. The\ntotal running times for 50 random datasets of pKBR,\nkRegBayes and KBR are respectively 601.3s, 677.5s and 3667.4s.\n\nFigure 1: Mean running MSEs against time\nsteps for each algorithm. (Best view in color)\n\n(cid:124)\n\nCamera position recovery\nIn this experiment, we build a scene containing a table and a chair,\nwhich is derived from classchair.pov (http://www.oyonale.com). With a \ufb01xed focal point,\nthe position of the camera uniquely determines the view of the scene. The task of this experiment is\nto estimate the position of the camera given the image. This is a problem with practical applications\nin remote sensing and robotics.\nWe vary the position of the camera in a plane with a \ufb01xed height. The transition equations of the\nhidden states are\n\nxt+1 = cos \u03b8t+1,\n\nrt+1 = max(R2, min(R1, rt +\u03ber)),\n\n\u03b8t+1 = \u03b8t +0.2+\u03be\u03b8,\nyt+1 = sin \u03b8t+1,\nwhere \u03be\u03b8 \u223c N (0, 4e \u2212 4), \u03ber \u223c N (0, 1), 0 \u2264 R1 < R2 are two constants and {(xt, yt)}|m\nt=1 are\ntreated as the hidden variables. As the observation at t-th step, we render a 100 \u00d7 100 image with the\ncamera located at (xt, yt). For training data, we set R1 = 0 and R2 = 10 while for validation data\nand test data we set R1 = 5 and R2 = 7. The motivation is to distinguish the ef\ufb01cacy of enforcing\nthe posterior distribution to concentrate around distance 6 by kRegBayes. We show a sample set of\ntraining and test images in Fig. 2.\nWe compare KBR, pKBR and kRegBayes with the traditional linear Kalman \ufb01lter (KF [23]). Fol-\nlowing [4] we down-sample the images and train a linear regressor for observation model. In all\nexperiments, we \ufb02atten the images to a column vector and apply Gaussian RBF kernels if needed.\nThe kernel band widths are set to be the median distances in the training data. Based on experiments\non the validation dataset, we set \u03bbT = 1e \u2212 6 = 2\u03b4T and \u00b5T = 1e \u2212 5.\n\n7\n\n\fFigure 2: First several frames of training data (upper row) and test data (lower row).\n\n(a)\n\n(b)\n\nFigure 3: (a) MSEs for different algorithms (best view in color). Since KF performs much worse\nthan kernel \ufb01lters, we use a different scale and plot it on the right y-axis. (b) Probability histograms\nfor the distance between each state and the scene center. All algorithms use 100 training data.\n\nTo provide supervision for kRegBayes, we uniformly generate 2000 data points {(\u02c6xi, \u02c6yt)}2000\ni=1\non the circle r = 6. Given the previous estimate (\u02dcxt, \u02dcyt), we \ufb01rst compute \u02c6\u03b8t = arctan(\u02c6yt/\u02c6xt)\n(where the value \u02c6\u03b8t is adapted according to the quadrant of (\u02c6xt, \u02c6yt)) and estimate (\u02d8xt+1, \u02d8yt+1) =\n(cos(\u02c6\u03b8t + 0.4), sin(\u02c6\u03b8t + 0.4)). Next, we \ufb01nd the nearest point to (\u02d8xt+1, \u02d8yt+1) in the supervision set\n(\u02dcxk, \u02dcyk) and add the regularization \u00b5T (cid:107)\u00b5(It+1) \u2212 \u03c6(\u02dcxk, \u02dcyk)(cid:107) to the posterior embedding, where\nIt+1 denotes the (t + 1)-th image.\nWe vary the size of training dataset from 100 to 300 and report the results of KBR, pKBR, kRegBayes\nand KF on 200 test images in Fig. 3. KF performs much worse than all three kernel \ufb01lters due\nto the extreme non-linearity. The result of pKBR is a little worse than that of KBR, but the gap\ndecreases as the training dataset becomes larger. kRegBayes always performs the best. Note that the\nadvantage becomes less obvious as more data come. This is because kernel methods can learn the\ndistance relation better with more data, and posterior regularization tends to be more useful when\ndata are not abundant and domain knowledge matters. Furthermore, Fig. 3(b) shows that the posterior\nregularization helps the distances to concentrate.\n\n6 Conclusions\n\nWe propose an optimizational framework for kernel Bayesian inference. With thresholding regular-\nization, the minimizer of the framework is shown to be a reasonable estimator of the posterior kernel\nembedding. In addition, we propose a posterior regularized kernel Bayesian inference framework\ncalled kRegBayes. These frameworks are applied to non-linear state-space \ufb01ltering tasks and the\nresults of different algorithms are compared extensively.\n\nAcknowledgements\n\nWe thank all the anonymous reviewers for valuable suggestions. The work was supported by the\nNational Basic Research Program (973 Program) of China (No. 2013CB329403), National NSF\nof China Projects (Nos. 61620106010, 61322308, 61332007), the Youth Top-notch Talent Support\nProgram, and Tsinghua Initiative Scienti\ufb01c Research Program (No. 20141080934).\n\n8\n\n\fReferences\n[1] Alex J Smola and Bernhard Sch\u00f6lkopf. Learning with kernels. Citeseer, 1998.\n[2] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and\n\nstatistics. Springer Science & Business Media, 2011.\n\n[3] Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch\u00f6lkopf. A hilbert space embedding for distributions.\n\nIn Algorithmic learning theory, pages 13\u201331. Springer, 2007.\n\n[4] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional\ndistributions with applications to dynamical systems. In Proceedings of the 26th Annual International\nConference on Machine Learning, pages 961\u2013968. ACM, 2009.\n\n[5] Kenji Fukumizu, Le Song, and Arthur Gretton. Kernel bayes\u2019 rule. In Advances in neural information\n\nprocessing systems, pages 1737\u20131745, 2011.\n\n[6] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A uni\ufb01ed\nkernel framework for nonparametric inference in graphical models. Signal Processing Magazine, IEEE,\n30(4):98\u2013111, 2013.\n\n[7] Le Song, Byron Boots, Sajid M Siddiqi, Geoffrey J Gordon, and Alex Smola. Hilbert space embeddings of\n\nhidden markov models. 2010.\n\n[8] Le Song, Arthur Gretton, and Carlos Guestrin. Nonparametric tree graphical models. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 765\u2013772, 2010.\n\n[9] Steffen Grunewalder, Guy Lever, Luca Baldassarre, Massi Pontil, and Arthur Gretton. Modelling transition\n\ndynamics in mdps with rkhs embeddings. arXiv preprint arXiv:1206.4655, 2012.\n\n[10] Yu Nishiyama, Abdeslam Boularias, Arthur Gretton, and Kenji Fukumizu. Hilbert space embeddings of\n\npomdps. arXiv preprint arXiv:1210.4887, 2012.\n\n[11] Peter M. Williams. Bayesian conditionalisation and the principle of minimum information. The British\n\nJournal for the Philosophy of Science, 31(2), 1980.\n\n[12] Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation,\n\n17(1):177\u2013204, 2005.\n\n[13] Steffen Gr\u00fcnew\u00e4lder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimiliano\nPontil. Conditional mean embeddings as regressors. In Proceedings of the 29th International Conference\non Machine Learning (ICML-12), pages 1823\u20131830, 2012.\n\n[14] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and applications to\n\nin\ufb01nite latent svms. The Journal of Machine Learning Research, 15(1):1799\u20131847, 2014.\n\n[15] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. The Journal of Machine\n\nLearning Research, 7:2651\u20132667, 2006.\n\n[16] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society,\n\n68(3):337\u2013404, 1950.\n\n[17] Kuzman Ganchev, Joao Gra\u00e7a, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured\n\nlatent variable models. The Journal of Machine Learning Research, 11:2001\u20132049, 2010.\n\n[18] Jun Zhu, Amr Ahmed, and Eric Xing. MedLDA: Maximum margin supervised topic models. JMLR,\n\n13:2237\u20132278, 2012.\n\n[19] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.\n\nFoundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[20] Ernesto De Vito and Andrea Caponnetto. Risk bounds for regularized least-squares algorithm with\n\noperator-value kernels. Technical report, DTIC Document, 2005.\n\n[21] Simon J Julier and Jeffrey K Uhlmann. New extension of the kalman \ufb01lter to nonlinear systems. In\n\nAeroSense\u201997, pages 182\u2013193. International Society for Optics and Photonics, 1997.\n\n[22] Eric A Wan and Ronell Van Der Merwe. The unscented kalman \ufb01lter for nonlinear estimation. In Adaptive\nSystems for Signal Processing, Communications, and Control Symposium 2000. AS-SPCC. The IEEE 2000,\npages 153\u2013158. Ieee, 2000.\n\n[23] Rudolph Emil Kalman. A new approach to linear \ufb01ltering and prediction problems. Journal of basic\n\nEngineering, 82(1):35\u201345, 1960.\n\n[24] Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning theory and kernel\n\nmachines, pages 144\u2013158. Springer, 2003.\n\n[25] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume\n\n375. Springer Science & Business Media, 1996.\n\n9\n\n\f", "award": [], "sourceid": 2421, "authors": [{"given_name": "Yang", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Yong", "family_name": "Ren", "institution": "Tsinghua University"}]}