{"title": "t-divergence Based Approximate Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1494, "page_last": 1502, "abstract": "Approximate inference is an important technique for dealing with large, intractable graphical models based on the exponential family of distributions. We extend the idea of approximate inference to the t-exponential family by defining a new t-divergence. This divergence measure is obtained via convex duality between the log-partition function of the t-exponential family and a new t-entropy. We illustrate our approach on the Bayes Point Machine with a Student's t-prior.", "full_text": "t-divergence Based Approximate Inference\n\nNan Ding2, S.V. N. Vishwanathan1,2, Yuan Qi2,1\nDepartments of 1Statistics and 2Computer Science\n\nPurdue University\n\nding10@purdue.edu, vishy@stat.purdue.edu, alanqi@cs.purdue.edu\n\nAbstract\n\nApproximate inference is an important technique for dealing with large, in-\ntractable graphical models based on the exponential family of distributions. We\nextend the idea of approximate inference to the t-exponential family by de\ufb01ning\na new t-divergence. This divergence measure is obtained via convex duality be-\ntween the log-partition function of the t-exponential family and a new t-entropy.\nWe illustrate our approach on the Bayes Point Machine with a Student\u2019s t-prior.\n\n1\n\nIntroduction\n\nThe exponential family of distributions is ubiquitous in statistical machine learning. One promi-\nnent application is their use in modeling conditional independence between random variables via a\ngraphical model. However, when the number or random variables is large, and the underlying graph\nstructure is complex, a number of computational issues need to be tackled in order to make inference\nfeasible. Therefore, a number of approximate techniques have been brought to bear on the problem.\nTwo prominent approximate inference techniques include the Monte Carlo Markov Chain (MCMC)\nmethod [1], and the deterministic method [2, 3].\nDeterministic methods are gaining signi\ufb01cant research traction, mostly because of their high ef\ufb01-\nciency and practical success in many applications. Essentially, these methods are premised on the\nsearch for a proxy in an analytically solvable distribution family that approximates the true under-\nlying distribution. To measure the closeness between the true and the approximate distributions,\nthe relative entropy between these two distributions is used. When working with the exponential\nfamily, one uses the Shannon-Boltzmann-Gibbs (SBG) entropy in which case the relative entropy is\nthe well known Kullback-Leibler (KL) divergence [2]. Numerous well-known algorithms in expo-\nnential family, such as the mean \ufb01eld method [2, 4] and the expectation propagation [3, 5], are based\non this criterion.\nThe thin-tailed nature of the exponential family makes it unsuitable for designing algorithms which\nare potentially robust against certain kinds of noisy data. Notable work including [6, 7] utilizes\nmixture/split exponential family based approximate model to improve the robustness. Meanwhile,\neffort has also been devoted to develop alternate, generalized distribution families in statistics [e.g.\n8, 9], statistical physics [e.g. 10, 11], and most recently in machine learning [e.g. 12]. Of particular\ninterest to us is the t-exponential family1, which was \ufb01rst proposed by Tsallis and co-workers [10,\n13, 14]. It is a special case of the more general \u03c6-exponential family of Naudts [11, 15\u201317]. Related\nwork in [18] has applied the t-exponential family to generalize logistic regression and obtain an\nalgorithm that is robust against certain types of label noise.\nIn this paper, we attempt to generalize deterministic approximate inference by using the t-\nexponential family.\nIn other words, the approximate distribution used is from the t-exponential\nfamily. To obtain the corresponding divergence measure as in the exponential family, we exploit the\n\n1Sometimes, also called the q-exponential family or the Tsallis distribution.\n\n1\n\n\fconvex duality between the log-partition function of the t-exponential family and a new t-entropy2\nto de\ufb01ne the t-divergence. To illustrate the usage of the above procedure, we use it for approximate\ninference in the Bayes Point Machine (BPM) [3] but with a Student\u2019s t-prior.\nThe rest of the paper is organized as follows. Section 2 consists of a brief review of the t-exponential\nfamily. In Section 3 a new t-entropy is de\ufb01ned as the convex dual of the log-partition function of the\nt-exponential family. In Section 4, the t-divergence is derived and is used for approximate inference\nin Section 5. Section 6 illustrates the inference approach by applying it to the Bayes Point Machine\nwith a Student\u2019s t-prior, and we conclude the paper with a discussion in Section 7.\n\n2 The t-exponential Family and Related Entropies\n\nThe t-exponential family was \ufb01rst proposed by Tsallis and co-workers [10, 13, 14]. It is de\ufb01ned as\n(1)\n\np(x; \u03b8) := expt (\ufffd\u03a6(x), \u03b8\ufffd \u2212 gt(\u03b8)) , where\nexpt(x) :=\ufffdexp(x)\n\nif t = 1\notherwise.\n\n1\n1\u2212t\n+\n\n[1 + (1 \u2212 t)x]\n\nThe inverse of the expt function is called logt. Note that the log-partition function, gt(\u03b8), in (1)\npreserves convexity and satis\ufb01es\n\nHere q(x) is called the escort distribution of p(x), and is de\ufb01ned as\n\n\u2207\u03b8gt(\u03b8) = Eq [\u03a6(x)] .\n\nq(x) :=\n\n.\n\np(x)t\n\n\ufffd p(x)tdx\n\nSee the supplementary material for a proof of convexity of gt(\u03b8) based on material from [17], and a\ndetailed review of the t-exponential family of distributions.\nThere are various generalizations of the Shannon-Boltzmann-Gibbs (SBG) entropy which are pro-\nposed in statistical physics, and paired with the t-exponential family of distributions. Perhaps the\nmost well-known among them is the Tsallis entropy [10]:\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(7)\n\nHtsallis(p) := \u2212\ufffd p(x)t logt p(x)dx.\n\nNaudts [11, 15, 16, 17] proposed a more general framework, wherein the familiar exp and log\nfunctions are generalized to exp\u03c6 and log\u03c6 functions which are de\ufb01ned via a function \u03c6. These\ngeneralized functions are used to de\ufb01ne a family of distributions, and corresponding to this family\nan entropy like measure called the information content I\u03c6(p) as well as its divergence measure are\nde\ufb01ned. The information content is the dual of a function F (\u03b8), where\n\n\u2207\u03b8F (\u03b8) = Ep [\u03a6(x)] .\n\n(6)\nSetting \u03c6(p) = pt in the Naudts framework recovers the t-exponential family de\ufb01ned in (1). Inter-\nestingly when \u03c6(p) = 1\nOne another well-known non-SBG entropy is the R\u00b4enyi entropy [19]. The R\u00b4enyi \u03b1-entropy (when\n\u03b1 \ufffd= 1) of the probability distribution p(x) is de\ufb01ned as:\n\nt p2\u2212t, the information content I\u03c6 is exactly the Tsallis entropy (5).\n\nH\u03b1(p) =\n\n1\n\n1 \u2212 \u03b1\n\nlog\ufffd\ufffd p(x)\u03b1dx\ufffd .\n\nBesides these entropies proposed in statistical physics, it is also worth noting efforts that work with\ngeneralized linear models or utilize different divergence measures, such as [5, 8, 20, 21].\nIt is well known that the negative SBG entropy is the Fenchel dual of the log-partition function of an\nexponential family distribution. This fact is crucially used in variational inference [2]. Although all\n\n2Although closely related, our t-entropy de\ufb01nition is different from either the Tsallis entropy [10] or the\ninformation content in [17]. Nevertheless, it can be regarded as an example of the generalized framework of\nthe entropy proposed in [8].\n\n2\n\n\fof the above generalized entropies are useful in their own way, none of them satisfy this important\nproperty for the t-exponential family. In the following sections we attempt to \ufb01nd an entropy which\nsatis\ufb01es this property, and outline the principles of approximate inference using the t-exponential\nfamily. Note that although our main focus is the t-exponential family, we believe that our results can\nalso be extended to the more general \u03c6-exponential family of Naudts [15, 17].\n\n3 Convex Duality and the t-Entropy\n\nDe\ufb01nition 1 (Inspired by Wainwright and Jordan [2]) The t-entropy of a distribution p(x; \u03b8) is\nde\ufb01ned as\n\nHt(p(x; \u03b8)) : = \u2212\ufffd q(x; \u03b8) logt p(x; \u03b8) dx = \u2212 Eq [logt p(x; \u03b8)] .\n\n(8)\n\nwhere q(x; \u03b8) is the escort distribution of p(x; \u03b8). It is straightforward to verify that the t-entropy is\nnon-negative. Furthermore, the following theorem establishes the duality between \u2212Ht and gt. The\nproof is provided in the supplementary material. This extends Theorem 3.4 of [2] to the t-entropy.\n\nTheorem 2 For any \u00b5, de\ufb01ne \u03b8(\u00b5) (if exists) to be the parameter of the t-exponential family s.t.\n\n\u00b5 = Eq(x;\u03b8(\u00b5)) [\u03a6(x)] =\ufffd \u03a6(x)q(x; \u03b8(\u00b5)) dx.\ng\u2217t (\u00b5) =\ufffd\u2212Ht(p(x; \u03b8(\u00b5))) if \u03b8(\u00b5) exists\n\n+\u221e otherwise .\n\nThen\n\nwhere g\u2217t (\u00b5) denotes the Fenchel dual of gt(\u03b8). By duality it also follows that\n\ngt(\u03b8) = sup\n\n\u00b5 {\ufffd\u00b5, \u03b8\ufffd \u2212 g\u2217t (\u00b5)} .\n\n(9)\n\n(10)\n\n(11)\n\nFrom Theorem 2, it is obvious that Ht(\u00b5) is a concave function. Below, we derive the t-entropy\nfunction corresponding to two commonly used distributions. See Figure 1 for a graphical illustration.\n\nExample 1 (t-entropy of Bernoulli distribution) Assume the Bernoulli distribution is Bern(p)\nwith parameter p. The t-entropy is\n\nHt(p) = \u2212pt logt p \u2212 (1 \u2212 p)t logt(1 \u2212 p)\n\npt + (1 \u2212 p)t\n\n=\n\n1 \u2212 (pt + (1 \u2212 p)t)\u22121\n\nt \u2212 1\n\n(12)\n\nExample 2 (t-entropy of Student\u2019s t-distribution) Assume that a k-dim Student\u2019s t-distribution\np(x; \u00b5, \u03a3, v) is given by (54), then the t-entropy of p(x; \u00b5, \u03a3, v) is given by\n\nwhere K = (v \u03a3)\u22121, v = 2\n\n\u03a8\n\nHt(p(x))) = \u2212\n\nt\u22121 \u2212 k, and \u03a8 =\ufffd\n\n1\n1 \u2212 t\n\n1 \u2212 t\ufffd1 + v\u22121\ufffd +\n(\u03c0v)k/2\u0393(v/2)| \u03a3 |1/2\ufffd\u22122/(v+k)\n\n\u0393((v+k)/2)\n\n(13)\n\n.\n\n3.1 Relation with the Tsallis Entropy\n\nUsing (4), (5), and (8), the relation between the t-entropy and Tsallis entropy is obvious. Basically,\nthe t-entropy is a normalized version of the Tsallis entropy,\n\nHt(p) = \u2212\n\n1\n\n\ufffd p(x)tdx\ufffd p(x)t logt p(x)dx =\n\n1\n\n\ufffd p(x)tdx\n\n3\n\nHtsallis(p).\n\n(14)\n\n\f)\np\n(\nt\n\nH\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nt=0.1\nt=0.5\nt=1.0\nt=1.5\nt=1.9\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\np\n\n)\n2\n\u03c3\n(\nt\n\nH\n\n15\n\n10\n\n5\n\n0\n\n0\n\nt=1.0\nt=1.3\nt=1.6\nt=1.9\n\n2\n\n4\n\n6\n\n8\n\n10\n\n\u03c32\n\nFigure 1: t-entropy corresponding to two well known probability distributions. Left: the Bernoulli\ndistribution Bern(x; p); Right: the Student\u2019s t-distribution St(x; 0, \u03c32, v), where v = 2/(t\u22121)\u22121.\nOne can recover the SBG entropy by setting t = 1.0.\n\n3.2 Relation with the R\u00b4enyi Entropy\n\nWe can equivalently rewrite the R\u00b4enyi Entropy as:\n\n.\n\n(15)\n\nH\u03b1(p) =\n\n1\n\n1 \u2212 \u03b1\n\nlog\ufffd\ufffd p(x)\u03b1dx\ufffd = \u2212 log\ufffd\ufffd p(x)\u03b1dx\ufffd\u22121/(1\u2212\u03b1)\n= \u2212 logt\ufffd\ufffd p(x)tdx\ufffd\u22121/(1\u2212t)\n\n.\n\nThe t-entropy of p(x) (when t \ufffd= 1) is equal to\nHt(p) = \u2212\ufffd p(x)t logt p(x)dx\n\ufffd p(x)tdx\n\nTherefore, when \u03b1 = t,\n\nHt(p) = \u2212 logt(exp(\u2212H\u03b1(p)))\n\nWhen t and \u03b1 \u2192 1, both entropies go to the SBG entropy.\n4 The t-divergence\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\nRecall that the Bregman divergence de\ufb01ned by a convex function \u2212H between p and \u02dcp is [22]:\n\nD(p\ufffd \u02dcp) = \u2212H(p) + H(\u02dcp) +\ufffd dH(\u02dcp)\n\nd \u02dcp\n\n(\u02dcp(x) \u2212 p(x))dx.\n\nFor the SBG entropy, it is easy to verify that the Bregman divergence leads to the relative SBG-\nentropy (also widely known as the Kullback-Leibler (KL) divergence). Analogously, one can de\ufb01ne\nthe t-divergence3 as the Bregman divergence or relative entropy based on the t-entropy.\n\nDe\ufb01nition 3 The t-divergence, which is the relative t-entropy between two distribution p(x) and\n\u02dcp(x), is de\ufb01ned as,\n\nDt(p\ufffd \u02dcp) =\ufffd q(x) logt p(x) \u2212 q(x) logt \u02dcp(x)dx.\n\nThe following theorem states the relationship between the relative t-entropy and the Bregman diver-\ngence. The proof is provided in the supplementary material.\nTheorem 4 The t-divergence is the Bregman divergence de\ufb01ned on the negative t-entropy \u2212Ht(p).\n3Note that the t-divergence is not a special case of the divergence measure of Naudts [17] because the\n\nentropies are de\ufb01ned differently the derivations are fairly similar in spirit.\n\n4\n\n\fThe t-divergence plays a central role in the variational inference that will be derived shortly. It also\npreserves the following properties:\n\n\u2022 Dt(p\ufffd \u02dcp) \u2265 0, \u2200p, \u02dcp. The equality holds only for p = \u02dcp.\n\u2022 Dt(p\ufffd \u02dcp) \ufffd= Dt(\u02dcp\ufffdp).\n\nExample 3 (Relative t-entropy between Bernoulli distributions) Assume that two Bernoulli dis-\ntributions Bern(p1) and Bern(p2), then the relative t-entropy Dt(p1\ufffdp2) between these two dis-\ntributions is:\n\nDt(p1\ufffdp2) =\n\n=\n\npt\n1 logt p1 + (1 \u2212 p1)t logt(1 \u2212 p1) \u2212 pt\n\n1 logt p2 \u2212 (1 \u2212 p1)t logt(1 \u2212 p2)\n\npt\n1 + (1 \u2212 p1)t\n\n1 \u2212 pt\n\n1p1\u2212t\n2 \u2212 (1 \u2212 p1)t(1 \u2212 p2)1\u2212t\n(1 \u2212 t)(pt\n\n1 + (1 \u2212 p1)t)\n\nExample 4 (Relative t-entropy between Student\u2019s t-distributions) Assume that two Student\u2019s t-\ndistributions p1(x; \u00b51, \u03a31, v) and p2(x; \u00b52, \u03a32, v) are given, then the relative t-entropy Dt(p1\ufffdp2)\nbetween these two distributions is:\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\nDt(p1\ufffdp2) =\ufffd q1(x) logt p1(x) \u2212 q1(x) logt p2(x)dx\n\n=\n\n\u2212\n\n\u03a81\n\n1 \u2212 t\ufffd1 + v\u22121\ufffd +\n\n2\u03a82\n1 \u2212 t\n\n\u03a82\n1 \u2212 t\n\nT r\ufffdK\ufffd2 \u03a31\ufffd \u2212\n\n\u00b5\ufffd1 K2 \u00b52\n\n\u03a82\n1 \u2212 t\n\n\u00b5\ufffd1 K2 \u00b51 \u2212\n\n\u03a82\n\n1 \u2212 t\ufffd\u00b5\ufffd2 K2 \u00b52 + 1\ufffd\n\n1\n\n0.8\n\n)\n2\np\n\ufffd\n\n0.6\n\n1\np\n(\nt\n\nD\n\n0.4\n\n0.2\n\n0\n\n0\n\nt=0.1\nt=0.5\nt=1.0\nt=1.5\nt=1.9\n\n60\n\n)\n2\np\n\ufffd\n\n40\n\n1\np\n(\nt\n\nD\n\n20\n\nt=1.0\nt=1.3\nt=1.6\nt=1.9\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\np1\n\n0\n\n\u22124\n\n\u22122\n\n0\n\u00b5\n\n2\n\n4\n\n4\n\n3\n\n)\n2\np\n\ufffd\n\n1\np\n(\nt\n\nD\n\n2\n\n1\n\n0\n\n0\n\nt=1.0\nt=1.3\nt=1.6\nt=1.9\n\n2\n\n4\n\n6\n\n8\n\n10\n\n\u03c32\n\nFigure 2: The t-divergence between: Left: Bern(p1) and Bern(p2 = 0.5); Middle: St(x; \u00b5, 1, v)\nand St(x; 0, 1, v); Right: St(x; 0, \u03c32, v) and St(x; 0, 1, v), where v = 2/(t \u2212 1) \u2212 1.\n\n5 Approximate Inference in the t-Exponential Family\n\nIn essence, the deterministic approximate inference \ufb01nds an approximate distribution from an an-\nalytically tractable distribution family which minimizes the relative entropy (e.g. KL-divergence\nin exponential family) with the true distribution. Since the relative entropy is not symmetric, the\nresults of minimizing D(p\ufffd \u02dcp) and D(\u02dcp\ufffdp) are different. In the main body of the paper we describe\nmethods which minimize D(p\ufffd \u02dcp) where \u02dcp comes from the t-exponential family. Algorithms which\nminimize D(\u02dcp\ufffdp) are described in the supplementary material.\nGiven an arbitrary probability distribution p(x), in order to obtain a good approximation \u02dcp(x; \u03b8) in\nthe t-exponential family, we minimize the relative t-relative entropy (19)\n\n\u02dcp = argmin\n\n\u02dcp\n\nDt(p\ufffd \u02dcp) =\ufffd q(x) logt p(x) \u2212 q(x) logt \u02dcp(x; \u03b8)dx.\n\nHere q(x) = 1\n\nZ p(x)t denotes the escort of the original distribution p(x). Since\n\n\u02dcp(x; \u03b8) = expt(\ufffd\u03a6(x), \u03b8\ufffd \u2212 gt(\u03b8)),\n\n5\n\n(24)\n\n(25)\n\n\fusing the fact that \u2207\u03b8gt(\u03b8) = E\u02dcq[\u03a6(x)], one can take the derivative of (24) with respect to \u03b8:\n\nEq[\u03a6(x)] = E\u02dcq[\u03a6(x)].\n\n(26)\nIn other words, the approximate distribution can be obtained by matching the escort expectation of\n\u03a6(x) between the two distributions.\nThe escort expectation matching in (26) is reminiscent of the moment matching in the Power-EP [5]\nor the Fractional BP [23] algorithm, where the approximate distribution is obtained by\n\nE\u02dcp[\u03a6(x)] = Ep\u03b1 \u02dcp1\u2212\u03b1 /Z[\u03a6(x)].\n\n(27)\nThe main reason for using the t-divergence, however, is not to address the computational or conver-\ngence issues as is done in the case of power EP/fractional BP. In contrast, we use the generalized\nexponential family (t-exponential family) to build our approximate models.\nIn this context, the\nt-divergence plays the same role as KL divergence in the exponential family.\nTo illustrate our ideas on a non-trivial problem, we apply escort expectation matching to the Bayes\nPoint Machine (BPM) [3] with a Student\u2019s t-distribution prior.\n\n6 Bayes Point Machine with Student\u2019s t-Prior\n\nLet D = {(x1, y1), . . . , (xn, yn)} be the training data. Consider a linear model parametrized by the\nk-dim weight vector w. For each training data point (xi, yi), the conditional distribution of the label\nyi given xi and w is modeled as [3]:\n\n(28)\nwhere \u0398(z) is the step function: \u0398(z) = 1 if z > 0 and = 0 otherwise. By making a standard i.i.d.\nassumption about the data, the posterior distribution can be written as\n\nti(w) = p(yi | xi, w) = \ufffd + (1 \u2212 2\ufffd)\u0398(yi \ufffdw, xi\ufffd),\n\np(w | D) \u221d p0(w)\ufffdi\n\nti(w),\n\n(29)\n\nwhere p0(w) denotes a prior distribution. Instead of using multivariate Gaussian distribution as a\nprior as was done by Minka [3], we will use a Student\u2019s t-prior, because we want to build robust\nmodels:\n\np0(w) = St(w; 0, I, v).\n\n(30)\n\nAs it turns out, the posterior p(w | D) is infeasible to obtain in practice. Therefore we will \ufb01nd a\nmultivariate Student\u2019s t-distribution to approximate the true posterior.\np(w | D) \ufffd \u02dcp(w) = St(w; \u02dc\u00b5, \u02dc\u03a3, v).\n\n(31)\nIn order to obtain such a distribution, we implement the Bayesian online learning method [24],\nwhich is also known as Assumed Density Filter [25]. The extension to the expectation propagation is\nsimilar to [3] and omitted due to space limitation. The main idea is to process data points one by one\nand update the posterior by using escort moment matching. Assume the approximate distribution\nafter processing (x1, y1), . . . , (xi\u22121, yi\u22121) to be \u02dcpi\u22121(w) and de\ufb01ne\n\nThen the approximate posterior \u02dcpi(w) is updated as\n\n\u02dcp0(w) = p0(w)\npi(w) \u221d \u02dcpi\u22121(w)ti(w)\n\n(32)\n(33)\n\n(35)\n\n(36)\n\n\u02dcpi(w) = St(w; \u00b5(i), \u03a3(i), v) = argmin\n\n\u00b5,\u03a3\n\nDt(pi(w)\ufffdSt(w; \u00b5, \u03a3, v)).\n\n(34)\n\nBecause \u02dcpi(w) is a k-dim Student\u2019s t-distribution with degree of freedom v, for which \u03a6(w) =\n[w, w w\ufffd] and t = 1 + 2/(v + k) (see example 5 in Appendix A), it turns out that we only need\n\n\ufffd qi(w) w d w =\ufffd \u02dcqi(w) w d w, and\n\ufffd qi(w) w w\ufffd d w =\ufffd \u02dcqi(w) w w\ufffd d w .\n\n6\n\n\fHere \u02dcqi(w) \u221d \u02dcpi(w)t, qi(w) \u221d \u02dcpi\u22121(w)t\u02dcti(w) and\n\n(37)\nDenote \u02dcpi\u22121(w) = St(w; \u00b5(i\u22121), \u03a3(i\u22121), v), \u02dcqi\u22121(w) = St(w; \u00b5(i\u22121), v \u03a3(i\u22121) /(v + 2), v + 2)\n(also see example 5), and we make use of the following relations:\n\n\u02dcti(w) = ti(w)t = \ufffdt +\ufffd(1 \u2212 \ufffd)t \u2212 \ufffdt\ufffd \u0398(yi \ufffdw, xi\ufffd).\nZ1 =\ufffd \u02dcpi\u22121(w)\u02dcti(w)d w\n= \ufffdt +\ufffd(1 \u2212 \ufffd)t \u2212 \ufffdt\ufffd\ufffd z\nZ2 =\ufffd \u02dcqi\u22121(w)\u02dcti(w)d w\n= \ufffdt +\ufffd(1 \u2212 \ufffd)t \u2212 \ufffdt\ufffd\ufffd z\n\nSt(x; 0, 1, v)dx\n\n\u2212\u221e\n\n\u2212\u221e\n\nSt(x; 0, v/(v + 2), v + 2)dx\n\ng =\n\nG =\n\n1\nZ2\u2207\u00b5Z1 = yi\u03b1 xi\n1\nZ2\u2207\u03a3Z1 = \u2212\n\n1\n2\n\nyi\u03b1\ufffdxi, \u00b5(i\u22121)\ufffd\n\nx\ufffdi \u03a3(i\u22121) xi\n\nxi x\ufffdi\n\n((1 \u2212 \ufffd)t \u2212 \ufffdt) St(z; 0, 1, v)\n\nZ2\ufffdx\ufffdi \u03a3(i\u22121) xi\n\nand z =\n\nyi\ufffdxi, \u00b5(i\u22121)\ufffd\n\ufffdx\ufffdi \u03a3(i\u22121) xi\n\n.\n\nwhere,\n\n\u03b1 =\n\n(38)\n\n(39)\n\n(40)\n\n(41)\n\n(42)\n\n(43)\n\n(44)\n\n(45)\n\n(46)\n\n(47)\n\nEquations (39) and (41) are analogous to Eq. (5.17) in [3]. By assuming that a regularity condition4\n\nholds,\ufffd and \u2207 can be interchanged in \u2207Z1 of (42) and (43). Combining with (38) and (40), we\n\nobtain the escort expectations of pi(w) from Z1 and Z2 (similar to Eq. (5.12) and (5.13) in [3]),\n\nEq[w w\ufffd] \u2212 Eq[w] Eq[w]\ufffd =\n\nEq[w] =\n\n1\n\n1\n\nZ2\ufffd \u02dcqi\u22121(w)\u02dcti(w) w d w = \u00b5(i\u22121) + \u03a3(i\u22121) g\nZ2\ufffd \u02dcqi\u22121(w)\u02dcti(w) w w\ufffd d w \u2212 Eq[w] Eq[w]\ufffd\n= r \u03a3(i\u22121) \u2212 \u03a3(i\u22121)\ufffdg g\ufffd \u22122 G\ufffd \u03a3(i\u22121)\n\nwhere r = Z1/Z2 and Eq[\u00b7] means the expectation with respect to qi(w).\nSince the mean and variance of the escort of \u02dcpi(w) is \u00b5(i) and \u03a3(i) (again see example 5), after\ncombining with (42) and (43),\n\n\u00b5(i) = Eq[w] = \u00b5(i\u22121) + \u03b1yi \u03a3(i\u22121) xi\n\n\u03a3(i) = Eq[w w\ufffd] \u2212 Eq[w] Eq[w]\ufffd = r \u03a3(i\u22121) \u2212(\u03a3(i\u22121) xi)\ufffd \u03b1yi\ufffdxi, \u00b5(i)\ufffd\n\nx\ufffdi \u03a3(i\u22121) xi\ufffd (\u03a3(i\u22121) xi)\ufffd.\n\n6.1 Results\n\nIn the above Bayesian online learning algorithm, everytime a new data xn coming in,\np(\u03b8 | x1, . . . , xn\u22121) is used as a prior, and the posterior is computed by incorporating the likeli-\nhood p(xn | \u03b8). The Student\u2019s t-distribution is a more conservative or non-subjective prior than the\nGaussian distribution because its heavy-tailed nature. More speci\ufb01cally, it means that the Student\u2019s\nt-based BPM can be more strongly in\ufb02uenced by the newly coming in points.\nIn many binary class\ufb01cation problems, it is assumed that the underlying class\ufb01cation hyperplane\nis always \ufb01xed. However, in some real situations, this assumption might not hold. Especially, in\n\n4This is a fairly standard technical requirement which is often proved using the Dominated Convergence\n\nTheorem (see e.g. Section 9.2 of Rosenthal [26]).\n\n7\n\n\fw\n\nf\no\n\ns\nn\ng\ni\nS\n\n.\nf\nf\ni\n\nD\n#\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0\n\nGauss\nv=3\nv=10\n\n1,000\n\n2,000\n\n3,000\n\n4,000\n\n# Points\n\nw\n\nf\no\n\ns\nn\ng\ni\nS\n\n.\nf\nf\ni\n\nD\n#\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\nGauss\nv=3\nv=10\n\n1,000\n\n2,000\n\n3,000\n\n4,000\n\n# Points\n\nFigure 3: The number of wrong signs between w. Left: case I; Right: case II\n\nTable 1: The classi\ufb01cation error of all the data points\n\nCase I\nCase II\n\nGauss\n0.337\n0.150\n\nv=3\n0.242\n0.130\n\nv=10\n0.254\n0.128\n\nan online learning problem, the data sequence coming in is time dependent. It is possible that the\nunderlying classi\ufb01er is also time dependent. For a senario like this, we require our learning machine\nis able to self-adjust during the time given the data.\nIn our experiment, we build a synthetic online dataset which mimics the above senario, that is the\nunderlying classi\ufb01cation hyperplane is changed during a certain time interval. Our sequence of\ndata is composed of 4000 data points randomly generated by a 100 dimension isotropic Gaussian\ndistribution N (0, I). The sequence can be partitioned into 10 sub-sequences of length 400. During\n(s) \u2208 {\u22121, +1}100. Each point x(i) of the\neach sub-sequence s, there is a base weight vector wb\nsubsequence is labeled as y(i) = sign\ufffdw\ufffd(i) x(i)\ufffd where w(i) = wb\n(s) +n and n is a random noise\nfrom [\u22120.1, +0.1]100. The base weight vector wb\n(s) can be (I) totally randomly generated, or (II)\ngenerated based on the base weight vector wb\n(s\u22121) in the following way:\n(s)j =\ufffdRand{\u22121, +1}\n\nj \u2208 [400s \u2212 399, 400s]\notherwise.\n\n(48)\n\nwb\n\nwb\n\n(s\u22121)j\n\nNamely, only 10% of the base weight vector is changed based upon the previous base weight vector.\nWe compare the Bayes Point Machine with Student\u2019s t-prior (with v = 3 and v = 10) with the\nGaussian prior. For both method, \ufffd = 0.01. We report (1) for each point the number of different\nsigns between the base weight vector and the mean of the posterior (2) the error rate of all the points.\nAccording to the Fig. 3 and Table. 1, we \ufb01nd that the Bayes Point Machine with the Student\u2019s-\nt prior adjusts itself signi\ufb01cantly faster than the Gaussian prior and it also ends up with a better\nclassi\ufb01cation results. We believe that is mostly resulted from its heavy-tailness.\n\n7 Discussion\n\nIn this paper, we investigated the convex duality of the log-partition function of the t-exponential\nfamily, and de\ufb01ned a new t-entropy. By using the t-divergence as a divergence measure, we pro-\nposed approximate inference on the t-exponential family by matching the expectation of the escort\ndistributions. The results in this paper can be extended to the more generalized \u03c6-exponential family\nby Naudts [15].\nThe t-divergence based approximate inference is only applied in a toy example. The focus of our\nfuture work is on utilizing this approach in various graphical models. Especially, it is important to\ninvestigate a new family of graphical models based on heavy-tailed distributions for applications\ninvolving noisy data.\n\n8\n\n\fReferences\n[1] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman &\n\nHall, 1995.\n\n[2] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1 \u2013 2):1 \u2013 305, 2008.\n\n[3] T. Minka. Expectation Propagation for approximative Bayesian inference. PhD thesis, MIT Media Labs,\n\nCambridge, USA, 2001.\n\n[4] Y. Weiss. Comparing the mean \ufb01eld method and belief propagation for approximate inference in MRFs.\n\nIn David Saad and Manfred Opper, editors, Advanced Mean Field Methods. MIT Press, 2001.\n[5] T. Minka. Divergence measures and message passing. Report 173, Microsoft Research, 2005.\n[6] C. Bishop, N. Lawrence, T. Jaakkola, and M. Jordan. Approximating posterior distributions in belief\n\nnetworks using mixtures. In Advances in Neural Information Processing Systems 10, 1997.\n\n[7] G. Bouchard and O. Zoeter. Split variational inference. In Proc. Intl. Conf. Machine Learning, 2009.\n[8] P. Grunwald and A. Dawid. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian\n\ndecision theory. Annals of Statistics, 32(4):1367\u20131433, 2004.\n\n[9] C. R. Shalizi. Maximum likelihood estimation for q-exponential (tsallis) distributions, 2007. URL http:\n\n//arxiv.org/abs/math.ST/0701854.\n\n[10] C. Tsallis. Possible generalization of boltzmann-gibbs statistics. J. Stat. Phys., 52:479\u2013487, 1988.\n[11] J. Naudts. Deformed exponentials and logarithms in generalized thermostatistics. Physica A, 316:323\u2013\n\n334, 2002. URL http://arxiv.org/pdf/cond-mat/0203489.\n\n[12] T. D. Sears. Generalized Maximum Entropy, Convexity, and Machine Learning. PhD thesis, Australian\n\nNational University, 2008.\n\n[13] A. Sousa and C. Tsallis. Student\u2019s t- and r-distributions: Uni\ufb01ed derivation from an entropic variational\n\nprinciple. Physica A, 236:52\u201357, 1994.\n\n[14] C. Tsallis, R. S. Mendes, and A. R. Plastino. The role of constraints within generalized nonextensive\n\nstatistics. Physica A: Statistical and Theoretical Physics, 261:534\u2013554, 1998.\n\n[15] J. Naudts. Generalized thermostatistics based on deformed exponential and logarithmic functions. Phys-\n\nica A, 340:32\u201340, 2004.\n\n[16] J. Naudts. Generalized thermostatistics and mean-\ufb01eld theory. Physica A, 332:279\u2013300, 2004.\n[17] J. Naudts. Estimators, escort proabilities, and \u03c6-exponential families in statistical physics. Journal of\n\nInequalities in Pure and Applied Mathematics, 5(4), 2004.\n\n[18] N. Ding and S. V. N. Vishwanathan. t-logistic regression. In Richard Zemel, John Shawe-Taylor, John\nLafferty, Chris Williams, and Alan Culota, editors, Advances in Neural Information Processing Systems\n23, 2010.\n\n[19] A. R\u00b4enyi. On measures of information and entropy. In Proc. 4th Berkeley Symposium on Mathematics,\n\nStatistics and Probability, pages 547\u2013561, 1960.\n\n[20] J. D. Lafferty. Additive models, boosting, and inference for generalized divergences. In Proc. Annual\n\nConf. Computational Learning Theory, volume 12, pages 125\u2013133. ACM Press, New York, NY, 1999.\n\n[21] I. Csisz\u00b4ar. Information type measures of differences of probability distribution and indirect observations.\n\nStudia Math. Hungarica, 2:299\u2013318, 1967.\n\n[22] K. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the exponential\nfamily of distributions. Machine Learning, 43(3):211\u2013246, 2001. Special issue on Theoretical Advances\nin On-line Learning, Game Theory and Boosting.\n\n[23] W. Wiegerinck and T. Heskes. Fractional belief propagation. In S. Becker, S. Thrun, and K. Obermayer,\n\neditors, Advances in Neural Information Processing Systems 15, pages 438\u2013445, 2003.\n\n[24] M. Opper. A Bayesian approach to online learning.\n\n363\u2013378. Cambridge University Press, 1998.\n\nIn On-line Learning in Neural Networks, pages\n\n[25] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In UAI, 1998.\n[26] J. S. Rosenthal. A First Look at Rigorous Probability Theory. World Scienti\ufb01c Publishing, 2006.\n\n9\n\n\f", "award": [], "sourceid": 856, "authors": [{"given_name": "Nan", "family_name": "Ding", "institution": null}, {"given_name": "Yuan", "family_name": "Qi", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}