{"title": "Learning from Distributions via Support Measure Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 10, "page_last": 18, "abstract": "This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (Flex-SVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework.", "full_text": "Learning from Distributions via Support Measure\n\nMachines\n\nKrikamol Muandet\n\nMPI for Intelligent Systems, T\u00a8ubingen\nkrikamol@tuebingen.mpg.de\n\nKenji Fukumizu\n\nThe Institute of Statistical Mathematics, Tokyo\n\nfukumizu@ism.ac.jp\n\nFrancesco Dinuzzo\n\nMPI for Intelligent Systems, T\u00a8ubingen\nfdinuzzo@tuebingen.mpg.de\n\nBernhard Sch\u00a8olkopf\n\nMPI for Intelligent Systems, T\u00a8ubingen\n\nbs@tuebingen.mpg.de\n\nAbstract\n\nThis paper presents a kernel-based discriminative learning framework on prob-\nability measures. Rather than relying on large collections of vectorial training\nexamples, our framework learns using a collection of probability distributions\nthat have been constructed to meaningfully represent training data. By represent-\ning these probability distributions as mean embeddings in the reproducing kernel\nHilbert space (RKHS), we are able to apply many standard kernel-based learning\ntechniques in straightforward fashion. To accomplish this, we construct a gener-\nalization of the support vector machine (SVM) called a support measure machine\n(SMM). Our analyses of SMMs provides several insights into their relationship\nto traditional SVMs. Based on such insights, we propose a \ufb02exible SVM (Flex-\nSVM) that places different kernel functions on each training example. Experi-\nmental results on both synthetic and real-world data demonstrate the effectiveness\nof our proposed framework.\n\n1\n\nIntroduction\n\nDiscriminative learning algorithms are typically trained from large collections of vectorial training\nexamples. In many classical learning problems, however, it is arguably more appropriate to represent\ntraining data not as individual data points, but as probability distributions. There are, in fact, multiple\nreasons why probability distributions may be preferable.\n\nFirstly, uncertain or missing data naturally arises in many applications. For example, gene expres-\nsion data obtained from the microarray experiments are known to be very noisy due to various\nsources of variabilities [1]. In order to reduce uncertainty, and to allow for estimates of con\ufb01dence\nlevels, experiments are often replicated. Unfortunately, the feasibility of replicating the microarray\nexperiments is often inhibited by cost constraints, as well as the amount of available mRNA. To cope\nwith experimental uncertainty given a limited amount of data, it is natural to represent each array as\na probability distribution that has been designed to approximate the variability of gene expressions\nacross slides.\n\nProbability distributions may be equally appropriate given an abundance of training data. In data-\nrich disciplines such as neuroinformatics, climate informatics, and astronomy, a high throughput\nexperiment can easily generate a huge amount of data, leading to signi\ufb01cant computational chal-\nlenges in both time and space. Instead of scaling up one\u2019s learning algorithms, one can scale down\none\u2019s dataset by constructing a smaller collection of distributions which represents groups of similar\nsamples. Besides computational ef\ufb01ciency, aggregate statistics can potentially incorporate higher-\nlevel information that represents the collective behavior of multiple data points.\n\n1\n\n\fPrevious attempts have been made to learn from distributions by creating positive de\ufb01nite (p.d.)\nkernels on probability measures. In [2], the probability product kernel (PPK) was proposed as a\ngeneralized inner product between two input objects, which is in fact closely related to well-known\nkernels such as the Bhattacharyya kernel [3] and the exponential symmetrized Kullback-Leibler\n(KL) divergence [4]. In [5], an extension of a two-parameter family of Hilbertian metrics of Tops\u00f8e\nwas used to de\ufb01ne Hilbertian kernels on probability measures. In [6], the semi-group kernels were\ndesigned for objects with additive semi-group structure such as positive measures. Recently, [7] in-\ntroduced nonextensive information theoretic kernels on probability measures based on new Jensen-\nShannon-type divergences. Although these kernels have proven successful in many applications,\nthey are designed speci\ufb01cally for certain properties of distributions and application domains. More-\nover, there has been no attempt in making a connection to the kernels on corresponding input spaces.\n\nThe contributions of this paper can be summarized as follows. First, we prove the representer the-\norem for a regularization framework over the space of probability distributions, which is a gener-\nalization of regularization over the input space on which the distributions are de\ufb01ned (Section 2).\nSecond, a family of positive de\ufb01nite kernels on distributions is introduced (Section 3). Based on\nsuch kernels, a learning algorithm on probability measures called support measure machine (SMM)\nis proposed. An SVM on the input space is provably a special case of the SMM. Third, the paper\npresents the relations between sample-based and distribution-based methods (Section 4). If the dis-\ntributions depend only on the locations in the input space, the SMM particularly reduces to a more\n\ufb02exible SVM that places different kernels on each data point.\n\n2 Regularization on probability distributions\n\nGiven a non-empty set X , let P denote the set of all probability measures P on a measurable\nspace (X , A), where A is a \u03c3-algebra of subsets of X . The goal of this work is to learn a function\nh : P \u2192 Y given a set of example pairs {(Pi, yi)}m\ni=1, where Pi \u2208 P and yi \u2208 Y. In other words,\nwe consider a supervised setting in which input training examples are probability distributions. In\nthis paper, we focus on the binary classi\ufb01cation problem, i.e., Y = {+1, \u22121}.\nIn order to learn from distributions, we employ a compact representation that not only preserves\nnecessary information of individual distributions, but also permits ef\ufb01cient computations. That is,\nwe adopt a Hilbert space embedding to represent the distribution as a mean function in an RKHS\n[8, 9]. Formally, let H denote an RKHS of functions f : X \u2192 R, endowed with a reproducing\nkernel k : X \u00d7 X \u2192 R. The mean map from P into H is de\ufb01ned as\n\n\u00b5 : P \u2192 H, P 7\u2212\u2192ZX\n\nk(x, \u00b7) dP(x) .\n\n(1)\n\nWe assume that k(x, \u00b7) is bounded for any x \u2208 X . It can be shown that, if k is characteristic, the map\n(1) is injective, i.e., all the information about the distribution is preserved [10]. For any P, letting\n\u00b5P = \u00b5(P), we have the reproducing property\n\n(2)\nThat is, we can see the mean embedding \u00b5P as a feature map associated with the kernel K :\nP \u00d7 P \u2192 R, de\ufb01ned as K(P, Q) = h\u00b5P, \u00b5QiH. Since supx kk(x, \u00b7)kH < \u221e, it also follows\n\nEP[f ] = h\u00b5P, f iH, \u2200f \u2208 H .\n\nthat K(P, Q) = RR hk(x, \u00b7), k(z, \u00b7)iH dP(x) dQ(z) = RR k(x, z) dP(x) dQ(z), where the second\n\nequality follows from the reproducing property of H. It is immediate that K is a p.d. kernel on P.\nThe following theorem shows that optimal solutions of a suitable class of regularization problems\ninvolving distributions can be expressed as a \ufb01nite linear combination of mean embeddings.\nTheorem 1. Given training examples (Pi, yi) \u2208 P \u00d7 R, i = 1, . . . , m, a strictly monotonically\nincreasing function \u2126 : [0, +\u221e) \u2192 R, and a loss function \u2113 : (P \u00d7 R2)m \u2192 R \u222a {+\u221e}, any\nf \u2208 H minimizing the regularized risk functional\n\n\u2113 (P1, y1, EP1 [f ], . . . , Pm, ym, EPm [f ]) + \u2126 (kf kH)\n\n(3)\n\nadmits a representation of the form f =Pm\n\ni=1 \u03b1i\u00b5Pi for some \u03b1i \u2208 R, i = 1, . . . , m.\n\nTheorem 1 clearly indicates how each distribution contributes to the minimizer of (3). Roughly\nspeaking, the coef\ufb01cients \u03b1i controls the contribution of the distributions through the mean em-\nbeddings \u00b5Pi. Furthermore, if we restrict P to a class of Dirac measures \u03b4x on X and consider\n\n2\n\n\frecovered as a particular case (see also [12] for more general results on representer theorem).\n\ni=1, the functional (3) reduces to the usual regularization functional [11]\ni=1 \u03b1ik(xi, \u00b7). Therefore, the standard representer theorem is\n\nthe training set {(\u03b4xi , yi)}m\n\nand the solution reduces to f = Pm\n\nNote that, on the one hand, the minimization problem (3) is different from minimizing the functional\nEP1 . . . EPm\u2113(x1, y1, f (x1), . . . , xm, ym, f (xm))+\u2126(kf kH) for the special case of the additive loss\n\u2113. Therefore, the solution of our regularization problem is different from what one would get in the\nlimit by training on an in\ufb01nitely many points sampled from P1, . . . , Pm. On the other hand, it is\nalso different from minimizing the functional \u2113(M1, y1, f (M1), . . . , Mm, ym, f (Mm)) + \u2126(kf kH)\nwhere Mi = Ex\u223cPi [x]. In a sense, our framework is something in between.\n\n3 Kernels on probability distributions\n\nAs the map (1) is linear in P, optimizing the functional (3) amounts to \ufb01nding a function in H that\n\napproximate well functions from P to R in the function class F , {P \u2192 RX g dP | P \u2208 P, g \u2208\n\nC(X )} where C(X ) is a class of bounded continuous functions on X . Since \u03b4x \u2208 P for any x \u2208 X ,\nit follows that C(X ) \u2282 F \u2282 C(P) where C(P) is a class of bounded continuous functions on P\nendowed with the topology of weak convergence and the associated Borel \u03c3-algebra. The following\nlemma states the relation between the RKHS H induced by the kernel k and the function class F.\nLemma 2. Assuming that X is compact, the RKHS H induced by a kernel k is dense in F if k\nis universal, i.e., for every function F \u2208 F and every \u03b5 > 0 there exists a function g \u2208 H with\n\nsupP\u2208P|F (P) \u2212R g dP| \u2264 \u03b5.\n\nProof. Assume that k is universal. Then, for every function f \u2208 C(X ) and every \u03b5 > 0 there exists a\nfunction g \u2208 H induced by k with supx\u2208X |f (x)\u2212g(x)| \u2264 \u03b5 [13]. Hence, by linearity of F, for every\n\nF \u2208 F and every \u03b5 > 0 there exists a function h \u2208 H such that supP\u2208P|F (P) \u2212R h dP| \u2264 \u03b5. (cid:4)\n\nNonlinear kernels on P can be de\ufb01ned in an analogous way to nonlinear kernels on X , by treating\nmean embeddings \u00b5P of P \u2208 P as its feature representation. First, assume that the map (1) is\ninjective and let h\u00b7, \u00b7iP be an inner product on P. By linearity, we have hP, QiP = h\u00b5P, \u00b5QiH (cf.\n[8] for more details). Then, the nonlinear kernels on P can be de\ufb01ned as K(P, Q) = \u03ba(\u00b5P, \u00b5Q) =\nh\u03c8(\u00b5P), \u03c8(\u00b5Q)iH\u03ba where \u03ba is a p.d. kernel. As a result, many standard nonlinear kernels on X can\nbe used to de\ufb01ne nonlinear kernels on P as long as the kernel evaluation depends entirely on the in-\nner product h\u00b5P, \u00b5QiH, e.g., K(P, Q) = (h\u00b5P, \u00b5QiH + c)d. Although requiring more computational\neffort, their practical use is simple and \ufb02exible. Speci\ufb01cally, the notion of p.d. kernels on distri-\nbutions proposed in this work is so generic that standard kernel functions can be reused to derive\nkernels on distributions that are different from many other kernel functions proposed speci\ufb01cally for\ncertain distributions.\nIt has been recently proved that the Gaussian RBF kernel given by K(P, Q) = exp(\u2212 \u03b3\n2 k\u00b5P \u2212\nH), \u2200P, Q \u2208 P is universal w.r.t C(P) given that X is compact and the map \u00b5 is injective\n\u00b5Qk2\n[14]. Despite its success in real-world applications, the theory of kernel-based classi\ufb01ers beyond\nthe input space X \u2282 Rd, as also mentioned by [14], is still incomplete. It is therefore of theoretical\ninterest to consider more general classes of universal kernels on probability distributions.\n\n3.1 Support measure machines\n\nThis subsection extends SVMs to deal with probability distributions, leading to support measure\nmachines (SMMs). In its general form, an SMM amounts to solving an SVM problem with the\nexpected kernel K(P, Q) = Ex\u223cP,z\u223cQ[k(x, z)]. This kernel can be computed in closed-form for\ncertain classes of distributions and kernels k. Examples are given in Table 1.\nAlternatively, one can approximate the kernel K(P, Q) by the empirical estimate:\n\nKemp(bPn,bQm) =\n\n1\n\nn \u00b7 m\n\nnXi=1\n\nmXj=1\n\nk(xi, zj)\n\n(4)\n\nwhere bPn and bQm are empirical distributions of P and Q given random samples {xi}n\n\ni=1 and\nj=1, respectively. A \ufb01nite sample of size m from a distribution P suf\ufb01ces (with high probability)\n\n{zj}m\n\n3\n\n\fTable 1: the analytic forms of expected kernels for different choices of kernels and distributions.\n\nDistributions\n\nArbitrary P(m; \u03a3)\nGaussian N (m; \u03a3) Gaussian RBF exp(\u2212 \u03b3\n\nEmbedding kernel k(x, y)\nLinear hx, yi\n\nGaussian N (m; \u03a3)\nGaussian N (m; \u03a3)\n\nPolynomial degree 2 (hx, yi + 1)2\nPolynomial degree 3 (hx, yi + 1)3\n\n2 kx \u2212 yk2)\n\n2 (mi \u2212 mj)T(\u03a3i + \u03a3j + \u03b3\u22121I)\u22121(mi \u2212 mj))\n\n1\n\ni mj + \u03b4ijtr \u03a3i\n\nK(Pi, Pj) = h\u00b5Pi , \u00b5Pj iH\nmT\nexp(\u2212 1\n/|\u03b3\u03a3i + \u03b3\u03a3j + I|\n(hmi, mji + 1)2 + tr \u03a3i\u03a3j + mT\n(hmi, mji + 1)3 + 6mT\ni \u03a3i\u03a3jmj\n+3(hmi, mji + 1)(tr \u03a3i\u03a3j + mT\n\n2\n\ni \u03a3jmi + mT\n\nj \u03a3imj\n\ni \u03a3jmi + mT\n\nj \u03a3imj)\n\nto compute an approximation within an error of O(m\u2212 1\n2 ). Instead, if the sample set is suf\ufb01ciently\nlarge, one may choose to approximate the true distribution by simpler probabilistic models, e.g., a\nmixture of Gaussians model, and choose a kernel k whose expected value admits an analytic form.\nStoring only the parameters of probabilistic models may save some space compared to storing all\ndata points.\nNote that the standard SVM feature map \u03c6(x) is usually nonlinear in x, whereas \u00b5P is linear in P.\nThus, for an SMM, the \ufb01rst level kernel k is used to obtain a vectorial representation of the measures,\nand the second level kernel K allows for a nonlinear algorithm on distributions. For clarity, we will\nrefer to k and K as the embedding kernel and the level-2 kernel, respectively\n\n4 Theoretical analyses\n\nThis section presents key theoretical aspects of the proposed framework, which reveal important\nconnection between kernel-based learning algorithms on the space of distributions and on the input\nspace on which they are de\ufb01ned.\n\n4.1 Risk deviation bound\n\n1\n\ni=1 drawn i.i.d.\n\n\u2113(yi, f (xij)) based\n\nfrom some unknown probability distribu-\nGiven a training sample {(Pi, yi)}m\ntion P on P \u00d7 Y, a loss function \u2113 : R \u00d7 R \u2192 R, and a function class \u039b, the goal of\nstatistical learning is to \ufb01nd the function f \u2208 \u039b that minimizes the expected risk functional\n\nR(f ) = RPRX \u2113(y, f (x)) dP(x) dP(P, y). Since P is unknown, the empirical risk Remp(f ) =\ni=1RX \u2113(yi, f (x)) dPi(x) based on the training sample is considered instead. Furthermore,\nmPm\n\nthe risk functional can be simpli\ufb01ed further by considering 1\non n samples xij drawn from each Pi.\nOur framework, on the other hand, alleviates the problem by minimizing the risk functional\n\ni=1Pxij \u223cPi\n\nm\u00b7nPm\n\nemp(f ) = 1\n\nmPm\n\nR\u00b5(f ) = RP \u2113(y, EP[f (x)]) dP(P, y) for f \u2208 H with corresponding empirical risk functional\n\nR\u00b5\ni=1 \u2113(yi, EPi [f (x)]) (cf. the discussion at the end of Section 2). It is often easier\nto optimize R\u00b5\nemp(f ) as the expectation can be computed exactly for certain choices of Pi and H.\nMoreover, for universal H, this simpli\ufb01cation preserves all information of the distributions. Never-\ntheless, there is still a loss of information due to the loss function \u2113.\nDue to the i.i.d. assumption, the analysis of the difference between R and R\u00b5 can be simpli\ufb01ed\nw.l.o.g. to the analysis of the difference between EP[\u2113(y, f (x))] and \u2113(y, EP[f (x)]) for a particular\ndistribution P \u2208 P. The theorem below provides a bound on the difference between EP[\u2113(y, f (x))]\nand \u2113(y, EP[f (x)]).\nTheorem 3. Given an arbitrary probability distribution P with variance \u03c32, a Lipschitz continu-\nous function f : R \u2192 R with constant Cf , an arbitrary loss function \u2113 : R \u00d7 R \u2192 R that is\nLipschitz continuous in the second argument with constant C\u2113, it follows that |Ex\u223cP[\u2113(y, f (x))] \u2212\n\u2113(y, Ex\u223cP[f (x)])| \u2264 2C\u2113Cf \u03c3 for any y \u2208 R.\n\nTheorem 3 indicates that if the random variable x is concentrated around its mean and the func-\ntion f and \u2113 are well-behaved, i.e., Lipschitz continuous, then the loss deviation |EP[\u2113(y, f (x))] \u2212\n\u2113(y, EP[f (x)])| will be small. As a result, if this holds for any distribution Pi in the training set\n{(Pi, yi)}m\n\ni=1, the true risk deviation |R \u2212 R\u00b5| is also expected to be small.\n\n4\n\n\f4.2 Flexible support vector machines\n\nIt turns out that, for certain choices of distributions P, the linear SMM trained using {(Pi, yi)}m\ni=1\nis equivalent to an SVM trained using some samples {(xi, yi)}m\ni=1 with an appropriate choice of\nkernel function.\n\nLemma 4. Let k(x, z) be a bounded p.d. kernel on a measure space such thatRR k(x, z)2 dx dz <\n\u221e, and g(x, \u02dcx) be a square integrable function such that R g(x, \u02dcx) d\u02dcx < \u221e for all x. Given\nRR k(\u02dcx, \u02dcz)g(x, \u02dcx)g(z, \u02dcz) d\u02dcx d\u02dcz.\n\ni=1 where each Pi is assumed to have a density given by g(xi, x), the lin-\ni=1 with kernel Kg(x, z) =\n\na sample {(Pi, yi)}m\near SMM is equivalent to the SVM on the training sample {(xi, yi)}m\n\nNote that the important assumption for this equivalence is that the distributions Pi differ only in their\nlocation in the parameter space. This need not be the case in all possible applications of SMMs.\n\nFurthermore, we have Kg(x, z) = (cid:10)R k(\u02dcx, \u00b7)g(x, \u02dcx) d\u02dcx,R k(\u02dcz, \u00b7)g(z, \u02dcz) d\u02dcz(cid:11)H. Thus, it is clear that\n\nthe feature map of x depends not only on the kernel k, but also on the density g(x, \u02dcx). Consequently,\nby virtue of Lemma 4, the kernel Kg allows the SVM to place different kernels at each data point.\nWe call this algorithm a \ufb02exible SVM (Flex-SVM).\nConsider for example the linear SMM with Gaussian distributions N (x1; \u03c32\nm \u00b7 I)\nand Gaussian RBF kernel k\u03c32 with bandwidth parameter \u03c3. The convolution theorem of Gaussian\ndistributions implies that this SMM is equivalent to a \ufb02exible SVM that places a data-dependent\nkernel k\u03c32+2\u03c32\n\n(xi, \u00b7) on training example xi, i.e., a Gaussian RBF kernel with larger bandwidth.\n\n1 \u00b7 I), . . . , N (xm; \u03c32\n\ni\n\n5 Related works\n\nThe kernel K(P, Q) = h\u00b5P, \u00b5QiH is in fact a special case of the Hilbertian metric [5], with the\nassociated kernel K(P, Q) = EP,Q[k(x, \u02dcx)], and a generative mean map kernel (GMMK) proposed\nby [15]. In the GMMK, the kernel between two objects x and y is de\ufb01ned via \u02c6px and \u02c6py, which are\nestimated probabilistic models of x and y, respectively. That is, a probabilistic model \u02c6px is learned\nfor each example and used as a surrogate to construct the kernel between those examples. The idea\nof surrogate kernels has also been adopted by the Probability Product Kernel (PPK) [2]. In this case,\n\nwe have K\u03c1(p, p\u2032) = RX p(x)\u03c1p\u2032(x)\u03c1 dx, which has been shown to be a special case of GMMK\n\nwhen \u03c1 = 1 [15]. Consequently, GMMK, PPK with \u03c1 = 1, and our linear kernels are equivalent\nwhen the embedding kernel is k(x, x\u2032) = \u03b4(x \u2212 x\u2032). More recently, the empirical kernel (4) was\nemployed in an unsupervised way for multi-task learning to generalize to a previously unseen task\n[16].\nthe regularized\nfunctional (3)) and the kernel is not restricted to only the empirical kernel.\n\nIn contrast, we treat the probability distributions in a supervised way (cf.\n\nThe use of expected kernels in dealing with the uncertainty in the input data has a connection to\nrobust SVMs. For instance, a generalized form of the SVM in [17] incorporates the probabilistic\nuncertainty into the maximization of the margin. This results in a second-order cone programming\n(SOCP) that generalizes the standard SVM. In SOCP, one needs to specify the parameter \u03c4i that\nre\ufb02ects the probability of correctly classifying the ith training example. The parameter \u03c4i is therefore\nclosely related to the parameter \u03c3i, which speci\ufb01es the variance of the distribution centered at the\nith example. [18] showed the equivalence between SVMs using expected kernels and SOCP when\n\u03c4i = 0. When \u03c4i > 0, the mean and covariance of missing kernel entries have to be estimated\nexplicitly, making the SOCP more involved for nonlinear kernels. Although achieving comparable\nperformance to the standard SVM with expected kernels, the SOCP requires a more computationally\nextensive SOCP solver, as opposed to simple quadratic programming (QP).\n\n6 Experimental results\n\nIn the experiments, we primarily consider three different learning algorithms: i) SVM is considered\nas a baseline algorithm. ii) Augmented SVM (ASVM) is an SVM trained on augmented samples\ndrawn according to the distributions {Pi}m\ni=1. The same number of examples are drawn from each\ndistribution. iii) SMM is distribution-based method that can be applied directly on the distributions1.\n\n1We used the LIBSVM implementation.\n\n5\n\n\f 100\n\n)\n\n%\n(\ny\nc\na\nr\nu\nc\nc\nA\n\n 80\n\n 60\n\n 40\n\n 20\n\n 0\n\n \n\n \n\n1\nF\nB\nR\ng\nn\nd\nd\ne\nb\nm\nE\n\ni\n\nEmbedding RBF 1\nLevel-2 RBF\nEmbedding RBF 2\nLevel-2 Poly\n\n \n\n \n\n2\nF\nB\nR\ng\nn\nd\nd\ne\nb\nm\nE\n\ni\n\n 0 1 2 3 4 5 6 7 8\n\nParameters\n\nLevel-2 POLY\n\nLevel-2 RBF\n\n 100\n 90\n 80\n 70\n 60\n 50\n 40\n\n(a) decision boundaries.\n\n(b) sensitivity of kernel parameters\n\nFigure 1: (a) the decision boundaries of SVM, ASVM, and SMM. (b) the heatmap plots of average\naccuracies of SMM over 30 experiments using POLY-RBF (center) and RBF-RBF (right) kernel\ncombinations with the plots of average accuracies at different parameter values (left).\n\nTable 2: accuracies (%) of SMM on synthetic data with different combinations of embedding and\nlevel-2 kernels.\n\nLIN\n\nPOLY2\n\nEmbedding kernels\n\nPOLY3\n\nRBF\n\nURBF\n\n2\n-\nl\ne\nv\ne\nL\n\nl\ne\nn\nr\ne\nk\n\ns LIN\n\n85.20\u00b12.20\nPOLY 83.95\u00b12.11\nRBF\n87.80\u00b11.96\n\n81.04\u00b13.11\n81.34\u00b11.21\n73.12\u00b13.29\n\n81.10\u00b12.76\n82.66\u00b11.75\n78.28\u00b12.19\n\n87.74\u00b12.19\n88.06\u00b11.73\n89.65\u00b11.37\n\n85.39\u00b12.56\n86.84\u00b11.51\n86.86\u00b11.88\n\n6.1 Synthetic data\n\nFirstly, we conducted a basic experiment that illustrates a fundamental difference between SVM,\nASVM, and SMM. A binary classi\ufb01cation problem of 7 Gaussian distributions with different means\nand covariances was considered. We trained the SVM using only the means of the distributions,\nASVM with 30 virtual examples generated from each distribution, and SMM using distributions as\ntraining examples. A Gaussian RBF kernel with \u03b3 = 0.25 was used for all algorithms.\nFigure 1a shows the resulting decision boundaries. Having been trained only on means of the dis-\ntributions, the SVM classi\ufb01er tends to overemphasize the regions with high densities and underrep-\nresent the lower density regions. In contrast, the ASVM is more expensive and sensitive to outliers,\nespecially when learning on heavy-tailed distributions. The SMM treats each distribution as a train-\ning example and implicitly incorporates properties of the distributions, i.e., means and covariances,\ninto the classi\ufb01er. Note that the SVM can be trained to achieve a similar result to the SMM by\nchoosing an appropriate value for \u03b3 (cf. Lemma 4). Nevertheless, this becomes more dif\ufb01cult if the\ntraining distributions are, for example, nonisotropic and have different covariance matrices.\n\nSecondly, we evaluate the performance of the SMM for different combinations of embedding and\nlevel-2 kernels. Two classes of synthetic Gaussian distributions on R10 were generated. The mean\nparameters of the positive and negative distributions are normally distributed with means m+ =\n(1, . . . , 1) and m\u2212 = (2, . . . , 2) and identical covariance matrix \u03a3 = 0.5 \u00b7 I10, respectively. The\ncovariance matrix for each distribution is generated according to two Wishart distributions with\ncovariance matrices given by \u03a3+ = 0.6 \u00b7 I10 and \u03a3\u2212 = 1.2 \u00b7 I10 with 10 degrees of freedom.\nThe training set consists of 500 distributions from the positive class and 500 distributions from the\nnegative class. The test set consists of 200 distributions with the same class proportion.\n\nThe kernels used in the experiment include linear kernel (LIN), polynomial kernel of degree 2\n(POLY2), polynomial kernel of degree 3 (POLY3), unnormalized Gaussian RBF kernel (RBF), and\nnormalized Gaussian RBF kernel (NRBF). To \ufb01x parameter values of both kernel functions and\nSMM, 10-fold cross-validation (10-CV) is performed on a parameter grid, C \u2208 {2\u22123, 2\u22122, . . . , 27}\nfor SMM, bandwidth parameter \u03b3 \u2208 {10\u22123, 10\u22122, . . . , 102} for Gaussian RBF kernels, and degree\nparameter d \u2208 {2, 3, 4, 5, 6} for polynomial kernels. The average accuracy and \u00b11 standard de-\nviation for all kernel combinations over 30 repetitions are reported in Table 2. Moreover, we also\ninvestigate the sensitivity of kernel parameters for two kernel combinations: RBF-RBF and POLY-\nRBF. In this case, we consider the bandwidth parameter \u03b3 = {10\u22123, 10\u22122, . . . , 103} for Gaussian\n\n6\n\n\f1 vs 8\n\n3 vs 4\n\n3 vs 8\n\n6 vs 9\n\n 100\n\n 95\n\n 90\n 100\n\n 95\n\n 100\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n 100\n\n 95\n\n 100\n\n 95\n\n 90\n 100\n\n 95\n\n 95\n\n 10\n\n 20\n\n 30\n\n 10\n\n 100\n 95\n 90\n 85\n 100\n 95\n 90\n 85\n\n 100\n 95\n 90\n 85\n\n 20\nNumber of virtual examples\n\n 20\n\n 30\n\n 10\n\n 100\n\n 95\n\n 100\n\n 95\n\n 90\n 100\n 90\n 80\n 70\n\ng\nn\n\ni\nl\n\na\nc\nS\n\nn\no\n\ni\nt\n\nl\n\na\ns\nn\na\nr\nT\n\nn\no\n\ni\nt\n\nt\n\na\no\nR\n\n 30\n\n 10\n\n 20\n\n 30\n\nFigure 2: the performance of SVM, ASVM, and SMM\nalgorithms on handwritten digits constructed using three\nbasic transformations.\n\nt\ns\no\nc\n \n.\n\np\nm\no\nc\n \n\ne\nv\ni\nt\n\nl\n\na\ne\nR\n\n103\n102\n101\n100\n10-1\n\nSMM\n\nASVM\n\n2000\n\n4000\n\n6000\n\nNumber of virtual examples\n\nFigure 3: relative computational cost of\nASVM and SMM (baseline: SMM with\n2000 virtual examples).\n\n 70\n\n)\n\n 65\n\n 60\n\n 55\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n 50\n\npLSA\n\nSVM LSMM NLSMM\n\nFigure 4: accuracies of four different\ntechniques for natural scene categoriza-\ntion.\n\nRBF kernels and degree parameter d = {2, 3, . . . , 8} for polynomial kernels. Figure 1b depicts the\naccuracy values and average accuracies for considered kernel functions.\n\nTable 2 indicates that both embedding and level-2 kernels are important for the performance of the\nclassi\ufb01er. The embedding kernels tend to have more impact on the predictive performance compared\nto the level-2 kernels. This conclusion also coincides with the results depicted in Figure 1b.\n\n6.2 Handwritten digit recognition\n\nIn this section, the proposed framework is applied to distributions over equivalence classes of images\nthat are invariant to basic transformations, namely, scaling, translation, and rotation. We consider\nthe handwritten digits obtained from the USPS dataset. For each 16 \u00d7 16 image, the distribution\nover the equivalence class of the transformations is determined by a prior on parameters associated\nwith such transformations. Scaling and translation are parametrized by the scale factors (sx, sy) and\ndisplacements (tx, ty) along the x and y axes, respectively. The rotation is parametrized by an angle\n\u03b8. We adopt Gaussian distributions as prior distributions, including N ([1, 1], 0.1\u00b7I2), N ([0, 0], 5\u00b7I2),\nand N (0; \u03c0). For each image, the virtual examples are obtained by sampling parameter values from\nthe distribution and applying the transformation accordingly.\n\nExperiments are categorized into simple and dif\ufb01cult binary classi\ufb01cation tasks. The former consists\nof classifying digit 1 against digit 8 and digit 3 against digit 4. The latter considers classifying digit 3\nagainst digit 8 and digit 6 against digit 9. The initial dataset for each task is constructed by randomly\nselecting 100 examples from each class. Then, for each example in the initial dataset, we generate\n10, 20, and 30 virtual examples using the aforementioned transformations to construct virtual data\nsets consisting of 2,000, 4,000, and 6,000 examples, respectively. One third of examples in the\ninitial dataset are used as a test set. The original examples are excluded from the virtual datasets.\nThe virtual examples are normalized such that their feature values are in [0, 1]. Then, to reduce\ncomputational cost, principle component analysis (PCA) is performed to reduce the dimensionality\nto 16. We compare the SVM on the initial dataset, the ASVM on the virtual datasets, and the SMM.\nFor SVM and ASVM, the Gaussian RBF kernel is used. For SMM, we employ the empirical kernel\n(4) with Gaussian RBF kernel as a base kernel. The parameters of the algorithms are \ufb01xed by 10-CV\nover parameters C \u2208 {2\u22123, 2\u22122, . . . , 27} and \u03b3 \u2208 {0.01, 0.1, 1}.\nThe results depicted in Figure 2 clearly demonstrate the bene\ufb01ts of learning directly from the equiv-\nalence classes of digits under basic transformations2. In most cases, the SMM outperforms both the\nSVM and the ASVM as the number of virtual examples increases. Moreover, Figure 3 shows the\nbene\ufb01t of the SMM over the ASVM in term of computational cost3.\n\n2While the reported results were obtained using virtual examples with Gaussian parameter distributions\n\n(Sec. 6.2), we got similar results using uniform distributions.\n\n3The evaluation was made on a 64-bit desktop computer with Intel R(cid:13) CoreTM 2 Duo CPU E8400 at\n\n3.00GHz\u00d72 and 4GB of memory.\n\n7\n\n\f6.3 Natural scene categorization\n\nThis section illustrates bene\ufb01ts of the nonlinear kernels between distributions for learning natural\nscene categories in which the bag-of-word (BoW) representation is used to represent images in the\ndataset. Each image is represented as a collection of local patches, each being a codeword from a\nlarge vocabulary of codewords called codebook. Standard BoW representations encode each image\nas a histogram that enumerates the occurrence probability of local patches detected in the image w.r.t.\nthose in the codebook. On the other hand, our setting represents each image as a distribution over\nthese codewords. Thus, images of different scenes tends to generate distinct set of patches. Based\non this representation, both the histogram and the local patches can be used in our framework.\n\nWe use the dataset presented in [19]. According to their results, most errors occurs among the four\nindoor categories (830 images), namely, bedroom (174 images), living room (289 images), kitchen\n(151 images), and of\ufb01ce (216 images). Therefore, we will focus on these four categories. For each\ncategory, we split the dataset randomly into two separate sets of images, 100 for training and the rest\nfor testing.\n\nA codebook is formed from the training images of all categories. Firstly, interesting keypoints in the\nimage are randomly detected. Local patches are then generated accordingly. After patch detection,\neach patch is transformed into a 128-dim SIFT vector [20]. Given the collection of detected patches,\nK-means clustering is performed over all local patches. Codewords are then de\ufb01ned as the centers\nof the learned clusters. Then, each patch in an image is mapped to a codeword and the image can\nbe represented by the histogram of the codewords. In addition, we also have an M \u00d7 128 matrix of\nSIFT vectors where M is the number of codewords.\nWe compare the performance of a Probabilistic Latent Semantic Analysis (pLSA) with the stan-\ndard BoW representation, SVM, linear SMM (LSMM), and nonlinear SMM (NLSMM). For\nSMM, we use the empirical embedding kernel with Gaussian RBF base kernel k: K(hi, hj) =\ns=1 hi(cr)hj(cs)k(cr, cs) where hi is the histogram of the ith image and cr is the rth\nSIFT vector. A Gaussian RBF kernel is also used as the level-2 kernel for nonlinear SMM. For\nthe SVM, we adopt a Gaussian RBF kernel with \u03c72-distance between the histograms [21], i.e.,\n. The parameters of\nthe algorithms are \ufb01xed by 10-CV over parameters C \u2208 {2\u22123, 2\u22122, . . . , 27} and \u03b3 \u2208 {0.01, 0.1, 1}.\nFor NLSMM, we use the best \u03b3 of LSMM in the base kernel and perform 10-CV to choose \u03b3 param-\neter only for the level-2 kernel. To deal with multiple categories, we adopt the pairwise approach\nand voting scheme to categorize test images. The results in Figure 4 illustrate the bene\ufb01t of the\ndistribution-based framework. Understanding the context of a complex scene is challenging. Em-\nploying distribution-based methods provides an elegant way of utilizing higher-order statistics in\nnatural images that could not be captured by traditional sample-based methods.\n\nPM\nr=1PM\nK(hi, hj) = exp(cid:0)\u2212\u03b3\u03c72(hi, hj)(cid:1) where \u03c72(hi, hj) = PM\n\n(hi(cr)\u2212hj (cr))2\n\nr=1\n\nhi(cr)+hj (cr)\n\n7 Conclusions\n\nThis paper proposes a method for kernel-based discriminative learning on probability distributions.\nThe trick is to embed distributions into an RKHS, resulting in a simple and ef\ufb01cient learning al-\ngorithm on distributions. A family of linear and nonlinear kernels on distributions allows one to\n\ufb02exibly choose the kernel function that is suitable for the problems at hand. Our analyses provide\ninsights into the relations between distribution-based methods and traditional sample-based meth-\nods, particularly the \ufb02exible SVM that allows the SVM to place different kernels on each training\nexample. The experimental results illustrate the bene\ufb01ts of learning from a pool of distributions,\ncompared to a pool of examples, both on synthetic and real-world data.\n\nAcknowledgments\n\nKM would like to thank Zoubin Gharamani, Arthur Gretton, Christian Walder, and Philipp Hennig\nfor a fruitful discussion. We also thank all three insightful reviewers for their invaluable comments.\n\n8\n\n\fReferences\n\n[1] Y. H. Yang and T. Speed. Design issues for cDNA microarray experiments. Nat. Rev. Genet.,\n\n3(8):579\u2013588, 2002.\n\n[2] T. Jebara, R. Kondor, A. Howard, K. Bennett, and N. Cesa-bianchi. Probability product kernels.\n\nJournal of Machine Learning Research, 5:819\u2013844, 2004.\n\n[3] A. Bhattacharyya. On a measure of divergence between two statistical populations de\ufb01ned by\n\ntheir probability distributions. Bull. Calcutta Math Soc., 1943.\n\n[4] P. J. Moreno, P. P. Ho, and N. Vasconcelos. A Kullback-Leibler divergence based kernel\nIn Proceedings of Advances in Neural\n\nfor SVM classi\ufb01cation in multimedia applications.\nInformation Processing Systems. MIT Press, 2004.\n\n[5] M. Hein and O. Bousquet. Hilbertian metrics and positive de\ufb01nite kernels on probability.\nIn Proceedings of The 12th International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 136\u2013143, 2005.\n\n[6] M. Cuturi, K. Fukumizu, and J-P. Vert. Semigroup kernels on measures. Journal of Machine\n\nLearning Research, 6:1169\u20131198, 2005.\n\n[7] Andr\u00b4e F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, and M\u00b4ario A. T.\nFigueiredo. Nonextensive information theoretic kernels on measures. Journal of Machine\nLearning Research, 10:935\u2013975, 2009.\n\n[8] A. Berlinet and Thomas C. Agnan. Reproducing kernel Hilbert spaces in probability and\n\nstatistics. Kluwer Academic Publishers, 2004.\n\n[9] A. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A hilbert space embedding for distributions.\nIn Proceedings of the 18th International Conference on Algorithmic Learning Theory, pages\n13\u201331. Springer-Verlag, 2007.\n\n[10] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00a8olkopf, and Gert R. G. Lanckriet.\nHilbert space embeddings and metrics on probability measures. Journal of Machine Learn-\ning Research, 99:1517\u20131561, 2010.\n\n[11] B. Sch\u00a8olkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In COLT\n\n\u201901/EuroCOLT \u201901, pages 416\u2013426. Springer-Verlag, 2001.\n\n[12] F. Dinuzzo and B. Sch\u00a8olkopf. The representer theorem for Hilbert spaces: a necessary and\nsuf\ufb01cient condition. In Advances in Neural Information Processing Systems 25, pages 189\u2013\n196. 2012.\n\n[13] I. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines.\n\nJournal of Machine Learning Research, 2:67\u201393, 2001.\n\n[14] A. Christmann and I. Steinwart. Universal kernels on non-standard input spaces. In Proceed-\n\nings of Advances in Neural Information Processing Systems, pages 406\u2013414. 2010.\n\n[15] N. A. Mehta and A. G. Gray. Generative and latent mean map kernels. CoRR, abs/1005.0188,\n\n2010.\n\n[16] G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classi\ufb01cation tasks to\na new unlabeled sample. In Advances in Neural Information Processing Systems 24, pages\n2178\u20132186. 2011.\n\n[17] P. K. Shivaswamy, C. Bhattacharyya, and A. J. Smola. Second order cone programming ap-\nproaches for handling missing and uncertain data. Journal of Machine Learning Research,\n7:1283\u20131314, 2006.\n\n[18] H.S. Anderson and M.R. Gupta. Expected kernel for missing features in support vector ma-\n\nchines. In Statistical Signal Processing Workshop, pages 285\u2013288, 2011.\n\n[19] L. Fei-fei. A bayesian hierarchical model for learning natural scene categories. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 524\u2013531,\n2005.\n\n[20] D. G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the In-\nternational Conference on Computer Vision, pages 1150\u20131157, Washington, DC, USA, 1999.\n[21] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection.\n\nIn Proceedings of the International Conference on Computer Vision, pages 606\u2013613, 2009.\n\n9\n\n\f", "award": [], "sourceid": 15, "authors": [{"given_name": "Krikamol", "family_name": "Muandet", "institution": null}, {"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Francesco", "family_name": "Dinuzzo", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}