{"title": "Variational Mixture of Gaussian Process Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 1897, "page_last": 1904, "abstract": "Mixture of Gaussian processes models extended a single Gaussian process with ability of modeling multi-modal data and reduction of training complexity. Previous inference algorithms for these models are mostly based on Gibbs sampling, which can be very slow, particularly for large-scale data sets. We present a new generative mixture of experts model. Each expert is still a Gaussian process but is reformulated by a linear model. This breaks the dependency among training outputs and enables us to use a much faster variational Bayesian algorithm for training. Our gating network is more flexible than previous generative approaches as inputs for each expert are modeled by a Gaussian mixture model. The number of experts and number of Gaussian components for an expert are inferred automatically. A variety of tests show the advantages of our method.", "full_text": "Variational Mixture of Gaussian Process Experts\n\nChao Yuan and Claus Neubauer\n\nSiemens Corporate Research\n\nIntegrated Data Systems Department\n\n755 College Road East, Princeton, NJ 08540\n\n{chao.yuan,claus.neubauer}@siemens.com\n\nAbstract\n\nMixture of Gaussian processes models extended a single Gaussian process with\nability of modeling multi-modal data and reduction of training complexity. Pre-\nvious inference algorithms for these models are mostly based on Gibbs sampling,\nwhich can be very slow, particularly for large-scale data sets. We present a new\ngenerative mixture of experts model. Each expert is still a Gaussian process but\nis reformulated by a linear model. This breaks the dependency among training\noutputs and enables us to use a much faster variational Bayesian algorithm for\ntraining. Our gating network is more \ufb02exible than previous generative approaches\nas inputs for each expert are modeled by a Gaussian mixture model. The number\nof experts and number of Gaussian components for an expert are inferred auto-\nmatically. A variety of tests show the advantages of our method.\n\n1 Introduction\n\nDespite of its widespread success in regression problems, Gaussian process (GP) has two limita-\ntions. First, it cannot handle data with multi-modality. Multi-modality can exist in the input dimen-\nsion (e.g., non-stationarity), in the output dimension (given the same input, the output has multiple\nmodes), or in a combination of both. Secondly, the cost of training is O(N 3), where N is the size of\nthe training set, which can be too expensive for large data sets. Mixture of GP experts models were\nproposed to tackle the above problems (Rasmussen & Ghahramani [1]; Meeds & Osindero [2]).\nMonte Carlo Markov Chain (MCMC) sampling methods (e.g., Gibbs sampling) are the standard\napproaches to train these models, which theoretically can achieve very accurate results. However,\nMCMC methods can be slow to converge and their convergence can be dif\ufb01cult to diagnose. It is\nthus important to explore alternatives.\nIn this paper, we propose a new generative mixture of Gaussian processes model for regression prob-\nlems and apply variational Bayesian methods to train it. Each Gaussian process expert is described\nby a linear model, which breaks the dependency among training outputs and makes variational\ninference feasible. The distribution of inputs for each expert is modeled by a Gaussian mixture\nmodel (GMM). Thus, our gating network can handle missing inputs and is more \ufb02exible than single\nGaussian-based gating models [2-4]. The number of experts and the number of components for\neach GMM are automatically inferred. Training using variational methods is much faster than using\nMCMC. The rest of this paper is organized as follows. Section 2 surveys the related work. Section\n3 describes the proposed algorithm. We present test results in Section 4 and summarize this paper\nin Section 5.\n\n2 Related work\n\nGaussian process is a powerful tool for regression problems (Rasmussen & Williams [5]). It ele-\ngantly models the dependency among data with a Gaussian distribution: P (Y) = N (Y|0, K+\u03c32\nnI),\n\n1\n\n\ff exp(\u2212Pd\n\nFigure 1: The graphical model representation for the proposed mixture of experts model. It consists\nof a hyperparameter set \u0398 = {L, \u03b1y, C, \u03b1x, m0, R0, r, S, \u03b81:L,I1:L, a, b} and a parameter set \u03a8 =\n{p, ql, mlc, Rlc, vl, \u03b3l | l = 1, 2, ..., L and c = 1, 2, ..., C}. The local expert is a GP linear model\nto predict output y from input x; the gating network is a GMM for input x. Data can be generated\nas follows. Step 1, determine hyperparameters \u0398. Step 2, sample parameters \u03a8. Step 3, to sample\none data point x and y, we sequentially sample expert indicator t, cluster indicator z, x and y. Step\n3 is independently repeated until enough data points are generated.\nwhere Y = {y1:N} are N training outputs and I is an identity matrix. We will use y1:N to denote\ny1, y2, ..., yN . The kernel matrix K considered here consists of kernel functions between pairs of\ngm)(xim \u2212 xjm)2), where d is the\ninputs xi and xj: Kij = k(xi, xj) = \u03c32\ndimension of the input x. The d + 2 hyperparameters \u03c3n, \u03c3f , \u03c3g1, \u03c3g2, ..., \u03c3gd can be ef\ufb01ciently\nestimated from the data. However, Gaussian process has dif\ufb01culties in modeling large-scale data\nand multi-modal data. The \ufb01rst issue was addressed by various sparse Gaussian processes [6-9, 16].\nThe mixture of experts (MoE) framework offers a natural solution for multi-modality problems\n(Jacobs et al. [10]). Early MoE work used linear experts [3, 4, 11, 12] and some of them were\nneatly trained via variational methods [4, 11, 12]. However, these methods cannot model nonlinear\ndata sets well. Tresp [13] proposed a mixture of GPs model that can be trained fast using the EM\nalgorithm. However, hyperparameters including the number of experts needed to be speci\ufb01ed and the\ntraining complexity issue was not addressed. By introducing the Dirichlet process mixture (DPM)\nprior, in\ufb01nite mixture of GPs models are able to infer the number of experts, both hyperparameters\nand parameters via Gibbs sampling [1, 2]. However, these models are trained by MCMC methods,\nwhich demand expensive training and testing time (as collected samples are usually combined to\ngive predictive distributions). How to select samples and how many samples to be used are still\nchallenging problems.\n\nm=1 1/(2\u03c32\n\n3 Algorithm description\n\nFig.1 shows the graphical model of the proposed mixture of experts. It consists of the local expert\npart and gating network part, which are covered in Sections 3.1 and 3.2, respectively. In Section 3.3,\nwe describe how to perform variational inference of this model.\n\n3.1 Local Gaussian process expert\n\nA local Gaussian process expert is speci\ufb01ed by the following linear model given the expert indicator\nt = l (where l = 1 : L) and other related variables:\n\nl \u03c6l(x), \u03b3\u22121\n\nl\n\nP (y|x, t = l, vl, \u03b8l,Il, \u03b3l) = N (y|vT\n\n(1)\nThis linear model is symbolized by the inner product of the weight vector vl and a nonlinear feature\nvector \u03c6l(x). \u03c6l(x) is a vector of kernel functions between a test input x and a subset of training\ninputs: [kl(x, xIl1), kl(x, xIl2), ..., kl(x, xIlM )]T . The active set Il denotes the indices of selected\nM training samples. How to select Il will be addressed in Section 3.3; for now let us assume that\nwe use the whole training set as the active set. vl has a Gaussian distribution N (vl|0, U\u22121\n) with\nhlI, where Kl is a M \u00d7 M kernel matrix\n0 mean and inverse covariance Ul. Ul is set to Kl + \u03c32\nconsisting of kernel functions between training samples in the active set. \u03c32\nhl is needed to avoid\nsingularity of Ul. \u03b8l = {\u03c3hl, \u03c3f l, \u03c3gl1, \u03c3gl2, ..., \u03c3gld} denotes the set of hyperparameters for this\n\n).\n\nl\n\n2\n\nmlcRlcm0R0rSql\u03b1xp\u03b1yvl\u03b8l\u03b3labIlzxtyLC\flinear model. Note that \u03c6l(x) depends on \u03b8l. \u03b3l is the inverse variance of this linear model. The\nprior of \u03b3l is set as a Gamma distribution: \u0393(\u03b3l|a, b) \u221d ba\u03b3a\u22121\ne\u2212b\u03b3l with hyperparameters a and b.\nIt is easy to see that for each expert, y is a Gaussian process de\ufb01ned on x. Such a linear model\nwas proposed by Silverman [14] and was used by sparse Gaussian process models [6, 8]. If we set\nhl = 0 and \u03b3l = 1\n, the joint distribution of the training outputs Y, assuming they are from\n\u03c32\n\u03c32\nthe same expert l, can be proved to be N (Y|0, Kl + \u03c32\nnl\nnlI). This has exactly the same form of\nQN\na regular Gaussian process. However, the largest advantage of this linear model is that it breaks\nthe dependency of y1:N once t1:N are given; i.e., P (y1:N|x1:N , t1:N , v1:L, \u03b81:L,I1:L, \u03b31:L) =\nn=1 P (yn|xn, tn = l, vl, \u03b8l,Il, \u03b3l). This makes the variational inference of the mixture of\nGaussian processes feasible.\n\nl\n\n3.2 Gating network\n\nA gating network determines which expert to use based on input x. We consider a generative gating\nnetwork, where expert indicator t is generated by a categorical distribution P (t = l) = pl. p =\n[p1 p2 ... pL] is given a symmetric Dirichlet distribution P (p) = Dir(p|\u03b1y/L, \u03b1y/L, ..., \u03b1y/L).\nGiven expert indicator t = l, we assume that x follows a Gaussian mixture model (GMM) with\nC components. Each component (cluster) is modeled by a Gaussian distribution P (x|t = l, z =\nc, mlc, Rlc) = N (x|mlc, R\u22121\nlc ). z is the cluster indicator which has a categorical distribution\nP (z = c|t = l, ql) = qlc. In addition, we give mlc a Gaussian prior N (mlc|m0, R\u22121\n0 ), Rlc a\nWishart prior W(Rlc|r, S) and ql a symmetric Dirichlet prior Dir(ql|\u03b1x/C, \u03b1x/C, ..., \u03b1x/C).\nIn previous generative gating networks [2-4], the expert indicator also acts as the cluster indicator\n(or t = z) such that inputs for an expert can only have one Gaussian distribution. In comparison,\nour model is more \ufb02exible by modeling inputs x for each expert as a Gaussian mixture distribution.\nOne can also put prior (e.g., inverse Gamma distribution) on \u03b1x and \u03b1y as done in [1, 2, 15]. In this\npaper we treat them as \ufb01xed hyperparameters.\n\n3.3 Variational inference\nVariational EM algorithm Given a set of training data D = {(xn, yn) | n = 1 : N}, the task\nof learning is to estimate unknown hyperparameters and infer posterior distribution of parameters.\nThis problem is nicely addressed by the variational EM algorithm. The objective is to maximize\nlog P (D|\u0398) over hyperparameters \u0398. Parameters \u03a8, expert indicators T = {t1:N} and cluster\nindicators Z = {z1:N} are treated as hidden variables, denoted by \u2126 = {\u03a8, T, Z}.\nIt is possible to estimate all hyperparameters via the EM algorithm. However, most of the hyperpa-\nrameters are generic and are thus \ufb01xed as follows. m0 and R0 are set to be the mean and inverse\ncovariance of the training inputs, respectively. We \ufb01x degree of freedom r = d and the scale matrix\nS = 100I for the Wishart distribution. \u03b1x, \u03b1y, C and L are all set to 10. Following Bishop &\nSvens\u00b4en [12], we set a = 0.01 and b = 0.0001. Such settings give broad priors to the parameters\nand make our model suf\ufb01ciently \ufb02exible. Our algorithm is not found to be sensitive to these generic\nhyperparameters. The only hyperparameters remain to be estimated are \u0398 = {\u03b81:L,I1:L}. Note\nthat these GP-related hyperparameters are problem speci\ufb01c and should not be assumed known.\nIn the E-step, based on the current estimates of \u0398, posterior probability of hidden variables\nP (\u2126|D, \u0398) is computed. Variational\ninference is involved in this step by approximating\nP (\u2126|D, \u0398) with a factorized distribution\n\nQ(\u2126) =Y\n\nQ(mlc)Q(Rlc)Y\n\nQ(ql)Q(vl)Q(\u03b3l)Q(p)Y\n\nQ(tn, zn).\n\n(2)\n\nl,c\n\nl\n\nn\n\nEach hidden variable has the same type of posterior distribution as its conjugate prior. To compute\nthe distribution for a hidden variable \u03c9i, we need to compute the posterior mean of log P (D, \u2126|\u0398)\nover all hidden variables except \u03c9i: hlog P (D, \u2126|\u0398)i\u2126/\u03c9i. The derivation is standard and is thus\nomitted.\nVariational inference for each hidden variable takes linear time with respect to N, C and L, because\nthe factorized form of P (D, \u2126|\u0398) leads to separation of hidden variables in log P (D, \u2126|\u0398). If\nwe switch from our linear model to a regular Gaussian process, one will encounter a prohibitive\n\n3\n\n\fcomplexity of O(LN ) for integrating log P (y1:N|x1:N , t1:N , \u0398) over t1:N . Also note that C =\nL = 10 represents the maximum number of clusters and experts. The actual number is usually\nsmaller. During iteration, if a cluster c for expert l does not have a single training sample supporting\nit (Q(tn = l, zn = c) > 0), this cluster and its associated parameters mlc and Rlc will be removed.\nSimilarly, we remove an expert l if no Q(tn = l) > 0. These C and L choices are \ufb02exible enough\nfor all our tests, but for more complicated data, larger values may be needed.\nIn the M-step, we search for \u0398 which maximizes hlog P (D, \u2126|\u0398)i\u2126. We employ the conjugate\ngradient method to estimate \u03b81:L similarly to [5]. Both E-step and M-step are repeated until the\nalgorithm converges. For better ef\ufb01ciency, we do not select the active sets I1:L in each M-step;\ninstead, we \ufb01x I1:L during the EM algorithm and only update I1:L once when the EM algorithm\nconverges. The details are given after we introduce the algorithm initialization.\nInitialization Without proper initialization, variational methods can be easily trapped into local\noptima. Consequently, using pure randomization methods, one cannot rely on a single result, but\nhas to run the algorithm multiple times and then either pick the best result [12] or average the results\n[11]. We present a new initialization method that only needs the algorithm to run once. Our method\nis based on the assumption that the combined data including x and y for an expert are usually\ndistributed locally in the combined d + 1 dimensional space. Therefore, clustering methods such as\nk-mean can be used to cluster data, one cluster for one expert.\nExperts are initialized incrementally as follows. First, all training data are used to train one expert.\nSecondly, we cluster all training data into two clusters and train one expert per cluster. We do this\nfour times and collect a total of L = 1 + 2 + 3 + 4 = 10 experts. Different experts represent\ndifferent local portions of training data in different scales. Although our assumption may not be true\nin some cases (e.g., one expert\u2019s data intersect with others), this initialization method does give us a\nmeaningful starting point. In practice, we \ufb01nd it effective and reliable.\nActive set selection We now address the problem of selecting active set Il of size M in de\ufb01n-\nGaussian with inverse covariance eUl = h\u03b3liP\ning the feature vector \u03c6l for expert l. The posterior distribution Q(vl) can be proved to be\ne\u00b5l = eU\u22121\nhlI and mean\nmean of \u03b3l. Inverting eUl has a complexity of O(M 3). Thus, for small data sets, the active set can be\nn Tnlyn\u03c6l(xn). Tnl is an abbreviation for Q(tn = l) and h\u03b3li is the posterior\n\nn Tnl\u03c6l(xn)\u03c6l(xn)T + Kl + \u03c32\n\nh\u03b3liP\n\nset to the full training set (M = N). But for large data sets, we have to select a subset with M < N.\nThe active set Il is randomly initialized. With Il \ufb01xed, we run the variational EM algorithm and\nobtain Q(\u2126) and \u0398. Now we want to improve our results by updating Il. Our method is inspired by\nthe maximum a posteriori probability (MAP) used by sparse Gaussian processes [6, 8]. Speci\ufb01cally,\nthe optimization target in our case is maxIl,vl P (vl|D) \u2248 Q(vl) with posterior distributions of\nsupported by data D such that Q(vl) is highly peaked. Since Q(vl) is Gaussian, vl is always e\u00b5l\nother hidden variables \ufb01xed. The justi\ufb01cation of this choice is that a good Il should be strongly\n\nat the optimal point and thus this optimization is equivalent to maximizing the determinant of the\ninverse covariance\n\nl\n\n|eUl| = |h\u03b3liX\n\nn\n\nmaxIl\n\nTnl\u03c6l(xn)\u03c6l(xn)T + Kl + \u03c32\n\nhlI|.\n\n(3)\n\nNote that if Tnl is one for all n, our method turns into a MAP-based sparse Gaussian process.\nHowever, even in that case, our criterion maxIl,vl P (vl|D) differs from maxIl,vl P (D|vl)P (vl)\nderived in previous MAP-based work [6, 8]. First, the denominator P (D) is ignored by previous\nhlI| in P (vl) is also ignored in previous\nmethods, which actually depends on Il. Secondly, |Kl + \u03c32\nmethods. For these reasons, previous methods are not real MAP estimation but its approximations.\nFor a candidate index i, computing the new eUl requires O(N M); incremental updating Cholesky\nLooking for the global optimal active set with size M is not feasible. Thus, similarly to many\nsparse Gaussian processes, we consider a greedy algorithm by adding one index to Il each time.\nfactorization of eUl requires O(M 2) and computing the new |eUl| needs O(1). Therefore, checking\n\none candidate i takes O(N M). We consider selecting the best index from \u03ba = 100 randomly\nselected candidates [6, 8], which makes the total time for adding one index O(\u03baN M). For adding\nall M indices, the total time is O(\u03baN M 2). Such a complexity is comparable to that of [6], but\nhigher than those of [7, 8]. Note that this time is needed for each of the L experts.\n\n4\n\n\fIn a summary, the variational EM algorithm with active set selection proceeds as follows. During\ninitialization, training data are clustered and assigned to each expert by the k-mean clustering algo-\nrithm noted above; the data assigned to each expert is used for randomly selecting the active set and\nthen training the linear model. During each iteration, we run variational EM to update parameters\nand hyperparameters; when the EM algorithm converges, we update the active set and Q(vl) for\neach expert. Such an iteration is repeated until convergence.\nIt is also possible to de\ufb01ne the feature vector \u03c6l(x) as [k(x, x1), k(x, x2), ..., k(x, xM )]T , where\neach x is a pseudo-input (Snelson & Ghahramani [9]).\nIn this way, these pseudo-inputs X can\nbe viewed as hyperparameters and can be optimized in the same variational EM algorithm without\nresorting to a separate update for active sets as we do. This is theoretically more sound. However,\nit leads to a large number of hyperparameters to be optimized. Although over\ufb01tting may not be an\nissue, the authors cautioned that this method can be vulnerable to local optima.\nPredictive distribution Once training is done, for a test input x\u2217, its predictive distribution\nP (y\u2217|x\u2217, D, \u0398) is evaluated as following:\n\nZ\n\nZ\n\nP (y\u2217|x\u2217, D, \u0398) =\n\nP (y\u2217|x\u2217, \u2126, \u0398)P (\u2126|D, \u0398)d\u2126 \u2248\n\nP (y\u2217|x\u2217, \u2126, \u0398)Q(\u2126)d\u2126\n\n\u2248 P (y\u2217|x\u2217,hpi,{hqli},{hmlci},{hRlci},{hvli},{h\u03b3li},{\u03b8l},{Il}).\n\n(4)\n\nThe \ufb01rst approximation uses the results from the variational inference. Note that expert indicators\nT and cluster indicators Z are integrated out. Suppose that there are suf\ufb01cient training data. Thus,\nthe posterior distribution of all parameters are usually highly peaked. This leads to the second ap-\nproximation, where the integral reduces to point evaluation at the posterior mean of each parameter.\nEq.(4) can be easily computed using standard predictive algorithm for mixture of linear experts. See\nappendix for more details.\n\n4 Test results\n\nFor all data sets, we normalize each dimension of data to zero mean and unit variance before using\nthem for training. After training, to plot \ufb01tting results, we de-normalize data into their original\nscales.\nArti\ufb01cial toy data We consider the toy data set used by [2], which consists of four continuous\nfunctions covering input ranges (0, 15), (35, 60), (45, 80) and (80, 100), respectively. Different\nlevels of noise (with standard deviations std = 7, 7, 4 and 2) are added to different functions. This is\na challenging multi-modality problem in both input and output dimensions. Fig.2 (left) shows 400\npoints generated by this toy model, each point with a equal probability 0.25 to be assigned to one of\nthe four functions. Using these 400 points as training data, our method found two experts that \ufb01t the\ndata nicely. Fig.2 (left) shows the results.\nIn general, expert one represents the last two functions while expert two represents the \ufb01rst two\nfunctions. One may desire to recover each function separately by an expert. However, note the fact\nthat the \ufb01rst two functions have the same noise level (std = 7); so it is reasonable to use just one GP\n\nto model these two functions. In fact, we recovered a very close estimated std = 1/ph\u03b32i = 6.87\nto 1/ph\u03b31i = 2.48 of the \ufb01rst expert. Note that the GP for expert one appears to \ufb01t the data of the\n\nfor the second expert. The stds of the last two functions are also close (4 vs. 2), and are also similar\n\n\ufb01rst function comparably well to that of expert two. However, the gating network does not support\nthis: the means of the GMM for expert one does not cover the region of the \ufb01rst function.\nRef.[2] and our method performed similarly well in discovering different modalities in different\ninput regions. We did not plot the mean of the predictive distribution as this data set has multiple\nmodes in the output dimension. Our results were produced using an active set size M = 60. Larger\nactive sets did not give appreciably better results.\nMotorcycle data Our algorithm was also applied to the 2D motorcycle data set [14], which contains\n133 data points with input-dependent noise as shown in Fig.2 (right). Our algorithm yielded two\nexperts with the \ufb01rst expert modeling the majority of the points and the second expert only depicting\nthe beginning part. The estimated stds of the two experts are 23.46 and 2.21, respectively. This\nappears to correctly represent different levels of noise present in different parts of the data.\n\n5\n\n\fFigure 2: Test results for toy data (left) and motorcycle data (right). Each data point is assigned to\nan expert l based on its posterior probability Q(tn = l) and is referred to as \u201cdata for expert l\u201d. The\nmeans of the GMM for each expert are also shown at the bottom as \u201cm for expert l\u201d. In the right\n\ufb01gure, the mean of the predictive distribution is shown as a solid line and samples drawn from the\npredictive distribution are shown as dots (100 samples for each of the 45 horizontal locations).\n\nWe also plot the mean of the predictive distribution (4) in Fig.2 (right). Our mean result compares\nfavorably with other methods using medians of mixtures [1, 2]. In particular, our result is similar\nto that of [1] at input \u2264 30. At input > 35, the result of [1] abruptly becomes \ufb02at while our\nresult is smooth and appears to \ufb01t data better. The result of [2] is jagged, which may suggest using\nmore Gibbs samples for smoother results. In terms of the full predictive (posterior) distribution\n(represented by samples in Fig.2 (right)), our results are better at input \u2264 40 as more artifacts are\nproduced by [1, 2] (especially between 15 and 25). However, our results have more artifacts at input\n> 40 because that region shares the same std = 23.46 as the other region where input is between\n15 and 40. The active set size of our method is set to 40. Training using matlab 7 on a Pentium 2.4\nGHz machine took 20 seconds, compared to one hour spent by Gibbs sampling method [1].\nRobot arm data We consider the two-link robot arm data set used by [12]. Fig.3 (left) shows the\nkinematics of such a 2D robot. The joint angles are limited to the ranges 0.3 \u2264 \u03b81 \u2264 1.2 and\n\u03c0/2 \u2264 \u03b82 \u2264 3\u03c0/2. Based on the forward kinematic equations (see [12]) the end point position\n(x1, x2) has a unique solution given values of joint angles (\u03b81, \u03b82). However, we are interested in\nthe inverse kinematics problem: given the end point position, we want to estimate the joint angles.\nWe randomly generated 2000 points based on the forward kinematics, with the \ufb01rst 1000 points for\ntraining and the remaining 1000 points for testing. Although noise can be added, we did not do so\nto make our results comparable to those of [12].\nSince this problem involves predicting two correlated outputs at the same time, we used an inde-\npendent set of local experts for each output but let these two outputs share the same gating network.\nThis was easily adapted in our algorithm. Our algorithm found \ufb01ve experts vs. 16 experts used by\n[12]. The average number of GMM components is 3. We use residue plots [12] to present results\n(see Fig.3). Compared to that of [12], the \ufb01rst residue plot is much cleaner suggesting that our errors\nare much smaller. This is expected as we use more powerful GP experts vs. linear experts used by\n[12]. The second residue plot (not used in [12]) also gives clean result but is worse than the \ufb01rst plot.\nThis is because the modality with the smaller posterior probability is more likely to be replaced by\nfalse positive modes. The active set size was set to 100. A larger size did not improve the results.\nDELVE data We applied our algorithm to three widely used DELVE data sets: Boston, Kin-8nm\nand Pumadyn-32nm. These data sets appear to be single modal because impressive results were\nachieved by a single GP. The purpose of this test is to check how our algorithm (intended for multi-\nmodality) handles single modality without knowing it. We followed the standard DELVE testing\nframework: for the Boston data, there are two tests each using 128 training examples; for both\nKin-8nm and Pumadyn-32nm data, there are four tests, each using 1024 training examples.\nTable 1 shows the standardised squared errors for the test. The scores from all previous methods are\ncopied from Waterhouse [11]. We used the full training set as the active set. Reducing the active\n\n6\n\n020406080100\u2212100\u2212500501020304050\u2212150\u2212100\u221250050100data for expert 1GP for expert 1m for expert 1data for expert 2GP for expert 2m for expert 2mean of expertsposterior samples\fFigure 3: Test results for robot arm data set. Left: illustration of the robot kinematics (adapted\nfrom [12]). Our task is to estimate the joint angles (\u03b81, \u03b82) based on the end point positions. In\nregion B, there are two modalities for the same end point position. In regions A and C, there is only\none modality. Middle: the \ufb01rst residue plot. For a test point, its predictive distribution is a Gaussian\nmixture. The mean of the Gaussian distribution with the highest probability was fed into the forward\nkinematics to obtain the estimated end point position. A line was drawn between the estimated and\nreal end point positions; the length of the line indicates the magnitude of the error. The average line\nlength (error) is a very small 0.00067 so many lines appear as dots. Right: the second residue plot\nusing the mean of the Gaussian distribution with the second highest probability only for region B.\nThe average line length is 0.001. Both residue plots are needed to check whether both modalities\nare detected correctly.\n\nDate sets\nBoston\nKin8nm\nPum32nm\n\ngp\n\n0.194 \u00b1 0.061\n0.116 \u00b1 0.006\n0.044 \u00b1 0.009\n\nmars\n\n0.157 \u00b1 0.009\n0.460 \u00b1 0.013\n0.061 \u00b1 0.003\n\nmlp\n\n-\n\n0.094 \u00b1 0.013\n0.046 \u00b1 0.023\n\nme\n\n0.159 \u00b1 0.023\n0.182 \u00b1 0.020\n0.701 \u00b1 0.079\n\nvmgp\n\n0.157 \u00b1 0.002\n0.119 \u00b1 0.005\n0.041 \u00b1 0.005\n\nTable 1: Standardised squared errors of different methods on the DELVE data sets. Our method\n(vmgp) is compared with a single Gaussian process trained using a maximum a posteriori method\n(gp), a bagged version of MARS (mars), a multi-layer perceptron trained using hybrid MCMC (mlp)\nand a committee of mixtures of linear experts (me) [11].\n\nset compromised the results, suggesting that for these high dimensional data sets, a large number\nof training examples are required; and for the present training sets, each training example carries\ninformation not represented by others. We started with ten experts and found an average of 2, 1 and\n2.75 experts for these data sets, respectively. The average number of GMM components for these\ndata sets are 8.5, 10 and 9.5, respectively, indicating that more GMM components are needed for\nmodeling higher dimensional inputs. Our results are comparable to and sometimes better than those\nof previous methods.\nFinally, to test how our active set selection algorithm performs, we conducted a standard test for\nsparse GPs: 7168 samples from Pumadyn-32nm were used for training and the remaining 1024\nwere for testing. The active set size M was varied from 10 to 150. The error was 0.0569 when\nM = 10, but quickly reduced to 0.0225, the same as the benchmark error in [7], when M = 25. We\nrapidly achieved 0.0196 at M = 50 and the error did not decrease after that. This result is better\nthan that of [7] and comparable to the best result of [9].\n\n5 Conclusions\n\nWe present a new mixture of Gaussian processes model and apply variational Bayesian method\nto train it. The proposed algorithm nicely addresses data multi-modality and training complexity\nissues of a single Gaussian process. Our method achieved comparable results to previous MCMC-\nbased models on several 2D data sets. One future direction is to compare all algorithms using high\ndimensional data so we can draw more meaningful conclusions. However, one clear advantage of\n\n7\n\n00.5100.51ABC\u03b81\u03b8200.5100.5100.5100.51\four method is that training is much faster. This makes our method more suitable for many real-world\napplications where speed is critical.\nOur active set selection method works well on the Pumadyn-32nm data set. But this test was done\nin the context of mixture of GPs. To make a fair comparison to other sparse GPs, we can set L = 1\nand also try more data sets. It is worthy noting that in the current implementation, the active set size\nM is \ufb01xed for all experts. This can be improved by using a smaller M for an expert with a smaller\nnumber of supporting training samples.\n\nAcknowledgments\n\nThanks to Carl Rasmussen and Christopher Williams for sharing the GPML matlab package.\n\nAppendix\n\nEq.(4) can be expressed as a weighted sum of all experts, where hyperparameters and parameters are omitted:\n\n\u2217|x\n\u2217\n\nP (y\n\n) = X\n\nX\n\n\u2217\nP (t\n\n= l, z\n\n\u2217\n\n= c|x\n\u2217\n\n\u2217|x\n\u2217\n\n\u2217\n, t\n\n)P (y\n\n= l).\n\n(A-1)\n\nl\n\nc\n\nThe \ufb01rst term in (A-1) is the posterior probability for expert t\u2217 = l and it is the sum of\nP (x\u2217|t\u2217 = l, z\u2217 = c)P (t\u2217 = l, z\u2217 = c)\n\n= l, z\n\n(A-2)\nwhere P (t\u2217 = l, z\u2217 = c) = hplihqlci. The second term in (A-1) is the predictive probability for y\u2217 given\nexpert l, which is Gaussian.\n\nPl0 Pc0 P (x\u2217|t\u2217 = l0, z\u2217 = c0)P (t\u2217 = l0, z\u2217 = c0)\n\n) =\n\n,\n\n\u2217\nP (t\n\n\u2217\n\n= c|x\n\u2217\n\nReferences\n[1] C. E. Rasmussen and Z. Ghahramani.\n\nNeural Information Processing Systems 14. MIT Press, 2002.\n\nIn\ufb01nite mixtures of Gaussian process experts.\n\nIn Advances in\n\n[2] E. Meeds and S. Osindero. An alternative in\ufb01nite mixture of Gaussian process experts. In Advances in\n\nNeural Information Processing Systems 18. MIT Press, 2006.\n\n[3] L. Xu, M. I. Jordan, and G. E. Hinton. An alternative model for mixtures of experts. In Advances in\n\nNeural Information Processing Systems 7. MIT Press, 1995.\n\n[4] N. Ueda and Z. Ghahramani. Bayesian model search for mixture models based on optimizing variational\n\nbounds. Neural Networks, 15(10):1223\u20131241, 2002.\n\n[5] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[6] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information\n\nProcessing Systems 13. MIT Press, 2001.\n\n[7] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian\n\nprocess regression. In Workshop on Arti\ufb01cial Intelligence and Statistics 9, 2003.\n\n[8] S. S. Keerthi and W. Chu. A matching pursuit approach to sparse Gaussian process regression.\n\nAdvances in Neural Information Processing Systems 18. MIT Press, 2006.\n\nIn\n\n[9] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural\n\nInformation Processing Systems 18. MIT Press, 2006.\n\n[10] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixture of local experts. Neural\n\ncomputation, 3:79\u201387, 1991.\n\n[11] S. Waterhouse. Classi\ufb01cation and regression using mixtures of experts. PhD Theis, Department of Engi-\n\nneering, Cambridge University, 1997.\n\n[12] C. M. Bishop and M. Svens\u00b4en. Bayesian hierarchical mixtures of experts. In Proc. Uncertainty in Arti\ufb01cial\n\nIntelligence, 2003.\n\n[13] V. Tresp. Mixtures of Gaussian processes. In Advances in Neural Information Processing Systems 13.\n\nMIT Press, 2001.\n\n[14] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve\n\n\ufb01tting. J. Royal. Stat. Society. B, 47(1):1\u201352, 1985.\n\n[15] C. E. Rasmussen. The in\ufb01nite Gaussian mixture model. In Advances in Neural Information Processing\n\nSystems 12. MIT Press, 2000.\n\n[16] L. Csat\u00b4o and M. Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641\u2013668, 2002.\n\n8\n\n\f", "award": [], "sourceid": 94, "authors": [{"given_name": "Chao", "family_name": "Yuan", "institution": null}, {"given_name": "Claus", "family_name": "Neubauer", "institution": null}]}