{"title": "Multiresolution Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 737, "page_last": 745, "abstract": "We propose a multiresolution Gaussian process to capture long-range, non-Markovian dependencies while allowing for abrupt changes. The multiresolution GP hierarchically couples a collection of smooth GPs, each defined over an element of a random nested partition. Long-range dependencies are captured by the top-level GP while the partition points define the abrupt changes. Due to the inherent conjugacy of the GPs, one can analytically marginalize the GPs and compute the conditional likelihood of the observations given the partition tree. This allows for efficient inference of the partition itself, for which we employ graph-theoretic techniques. We apply the multiresolution GP to the analysis of Magnetoencephalography (MEG) recordings of brain activity.", "full_text": "Multiresolution Gaussian Processes\n\nEmily B. Fox\n\nDept of Statistics, University of Washington\n\nDavid B. Dunson\n\nDept of Statistical Science, Duke University\n\nebfox@stat.washington.edu\n\ndunson@stat.duke.edu\n\nAbstract\n\nWe propose a multiresolution Gaussian process to capture long-range, non-\nMarkovian dependencies while allowing for abrupt changes and non-stationarity.\nThe multiresolution GP hierarchically couples a collection of smooth GPs, each\nde\ufb01ned over an element of a random nested partition. Long-range dependen-\ncies are captured by the top-level GP while the partition points de\ufb01ne the abrupt\nchanges. Due to the inherent conjugacy of the GPs, one can analytically marginal-\nize the GPs and compute the marginal likelihood of the observations given the par-\ntition tree. This property allows for ef\ufb01cient inference of the partition itself, for\nwhich we employ graph-theoretic techniques. We apply the multiresolution GP to\nthe analysis of magnetoencephalography (MEG) recordings of brain activity.\n\n1 Introduction\nA key challenge in many time series applications is capturing long-range dependencies for which\nMarkov-based models are insuf\ufb01cient. One method of addressing this challenge is through em-\nploying a Gaussian process (GP) with an appropriate (non-band-limited) covariance function. How-\never, GPs typically assume smoothness properties that can blur key elements of the signal if abrupt\nchanges occur. The Mat\u00b4ern kernel enables less smooth functions, but assumes a stationary process\nthat does not adapt to varying levels of smoothness. Likewise, a changepoint [21] or partition [8]\nmodel between smooth functions fails to capture long range dependencies spanning changepoints.\n\nAnother long-memory process is the fractional ARIMA process [5, 13]. Wavelet methods have also\nbeen proposed, including recently for smooth functions with discontinuities [2]. We take a funda-\nmentally different approach based on GPs that allows (i) direct interpretability, (ii) local stationarity,\n(iii) irregular grids of observations, and (iv) sharing information across related time series.\n\nAs a motivating application, consider magnetoencephalography (MEG) recordings of brain activity\nin response to some word stimulus. Due to the low signal-to-noise-ratio (SNR) regime, multiple\ntrials are often recorded, presenting a functional data analysis scenario. Each trial results in a noisy\ntrajectory with key discontinuities (e.g., after stimulus onset). Although there are overall similarities\nbetween the trials, there are also key differences that occur based on various physiological phenom-\nena, as depicted in Fig. 1. We clearly see abrupt changes as well as long-range correlations. Key to\nthe data analysis is the ability to share information about the overall trajectory between the single\ntrials without forcing unrealistic smoothness assumptions on the single trials themselves.\n\nIn order to capture both long-range dependencies and potential discontinuities, we propose a mul-\ntiresolution GP (mGP) that hierarchically couples a collection of smooth GPs, each de\ufb01ned over an\nelement of a nested partition set. The top-level GP captures a smooth global trajectory, while the\npartition points de\ufb01ne abrupt changes in correlation induced by the lower-level GPs. Due to the in-\nherent conjugacy of the GPs, conditioned on the partition points the resulting function at the bottom\nlevel is marginally GP-distributed with a partition-dependent (and thus non-stationary) covariance\nfunction. The correlation between any two observations yi and yj generated by the mGP at locations\nxi and xj is a function of the distance ||xi \u2212 xj || and which partition sets contain both xi and xj.\nIn a standard regression setting, the marginal GP structure of the mGP allows us to compute the\nmarginal likelihood of the data conditioned on the partition, enabling ef\ufb01cient inference of the par-\ntition itself. We integrate over the hierarchy of GPs and only sample the partition points. For our\n\n1\n\n\f0.5\n\n0\n\n\u22120.5\n\ns\nn\no\n\ni\nt\n\na\nv\nr\ne\ns\nb\nO\n\n\u22121\n0\n\n50\n\n100\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n250\n\n300\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nA1\n1\n\nA1\n2\n\nA0\n\nA1\n1\n\nA1\n2\n\nA2\n1\n\nA2\n2\n\nA2\n3\n\nA2\n4\n\nA2\n1\n\nA2\n2\n\nA2\n\n3 A2\n\n4\n\n150\n\n200\n\nTime\n\n1, A1\n\nFigure 2: mGP on a balanced, binary tree\npartition: Parent function is split by A1 =\n2}. Recursing down the tree, each\n{A1\npartition has a GP with mean given by its\nparent function restricted to that set.\n\nFigure 1: For sensor 1 and word house, Left: Data from three\ntrials; Middle: Empirical correlation matrix from 20 trials; Right:\nHierarchical segmentation produced by recursive minimization of\nnormalized cut objective, with color indicating tree level.\nproposal distribution, we borrow the graph-theoretic idea of normalized cuts [22] often used in image\nsegmentation. Our inferences integrate over the partition tree, allowing blurring of discontinuities\nand producing functions which can appear smooth when discontinuities are not present in the data.\n2 Background\nA GP provides a distribution on real-valued functions f : X \u2192 \u211c, with the property that the function\nevaluated at any \ufb01nite collection of points is jointly Gaussian. The GP, denoted GP(m, c), is uniquely\nde\ufb01ned by its mean function m and covariance function c. That is, f \u223c GP(m, c) if and only if for\nall n \u2265 1 and x1, . . . , xn, (f (x1), . . . , f (xn)) \u223c Nn(\u00b5, K), with \u00b5 = [m(x1), . . . , m(xn)] and\n[K]ij = c(xi, xj ). The properties (e.g., continuity, smoothness, periodicity, etc.) of functions\ndrawn from a given GP are determined by the covariance function. The squared exponential kernel,\n2), leads to smooth functions. Here, d is a scale hyperparameter and \u03ba\nc(x, x\u2032) = d exp(\u2212\u03ba||x \u2212 x\u2032||2\nis the bandwidth determining the extent of the correlation in f over X . See [18] for further details.\n3 Multiresolution Gaussian Process Formulation\nOur interest is in modeling a function g that (i) is locally smooth, (ii) exhibits long-range correlations\n(i.e., corr(g(x), g(x\u2032)) > 0 for ||x \u2212 x\u2032|| relatively large), and (iii) has abrupt changes. We begin by\nmodeling a single function, but with a speci\ufb01cation that readily lends itself to modeling a collection\nof functions that share a common global trajectory, as explored in Sec. 4.\nGenerative Model Assume a set of noisy observations y = {y1, . . . , yn}, yi \u2208 \u211c, of the function\ng at locations {x1, . . . , xn}, xi \u2208 X \u2282 \u211cp:\n\nyi = g(xi) + \u01ebi,\n\n(1)\nWe hierarchically de\ufb01ne g as follows. Let A = {A0, A1, . . . , AL\u22121} be a nested partition, or tree\nfor some k. Furthermore,\nassume that each A\u2113\ni is a contiguous subset of X . Fig. 2 depicts a balanced, binary tree partition. We\nde\ufb01ne a global parent function on A0 as f 0 \u223c GP(0, c0). This function captures the overall shape\nof g and its long-range dependencies. Then, over each partition set A\u2113\n\npartition, of X with A0 = X , X = Si A\u2113\n\nj = \u2205, and A\u2113\n\n\u01ebi \u223c N (0, \u03c32).\n\ni we independently draw\n\ni \u2282 A\u2113\u22121\n\ni \u2229 A\u2113\n\ni, A\u2113\n\nk\n\ni ).\n\ni ), c\u2113\n\nf \u2113(A\u2113\n\ni ) \u223c GP(f \u2113\u22121(A\u2113\n\n(2)\nThat is, the mean of the GP is given by the parent function restricted to the current partition set. Due\nto the conditional independence of these draws, f \u2113 can have discontinuities at the partition points.\nHowever, due to the coupling of GPs through the tree, f \u2113 will maintain aspects of the shape of f 0.\nFinally, we set g = f L\u22121. A pictorial representation of the mGP is shown in Fig. 2.\ni ) \u223c GP(0, c\u2113\nWe can equivalently represent the mGP as an additive GP model: \u03c6\u2113(A\u2113\ni ), g = P\u2113 \u03c6\u2113.\nCovariance Function We assume a squared exponential kernel c\u2113\n2),\ni||x \u2212 x\u2032||2\ni = d\u2113\nencouraging local smoothness over each partition set A\u2113\n\u2113=1(d\u2113)2 < 1\nfor \ufb01nite variance regardless of tree depth and additionally encouraging lower levels to vary less from\ntheir parent function, providing regularization and robustness to the choice of L.\n2 so that each child function is locally as smooth as\nWe typically assume bandwidths \u03ba\u2113\nits parent. One can think of this formulation as akin to a fractal process: zooming in on any partition,\nthe locally de\ufb01ned function has the same smoothness as that of its parent over the larger partition.\nThus, lower levels encode \ufb01ner-resolution details. We denote the covariance hyperparameters as \u03b8 =\n{d0, . . . , dL\u22121, \u03ba}, and omit the dependency in conditional distributions for notational simplicity.\nSee the Supplementary Material for discussion of other possible covariance speci\ufb01cations.\n\ni = d\u2113 withP\u221e\n\ni . We focus on d\u2113\n\ni = \u03ba/||A\u2113\n\ni exp(\u2212\u03ba\u2113\n\ni||2\n\n2\n\n\fInduced Marginal GP The conditional independencies of our mGP imply that\n\np(g | A) = Z p(f 0)\n\nL\u22121\n\nY\u2113=1\n\np(f \u2113 | f \u2113\u22121, A\u2113)df 0:L\u22122.\n\n(3)\n\nDue to the inherent conjugacy of the GPs, one can analytically marginalize the hierarchy of GPs\nconditioned on the partition tree A yielding\n\ng | A \u223c GP(0, c\u2217\n\nA),\n\nc\u2217\nA =\n\nL\u22121\n\nX\u2113=0 Xi\n\nc\u2113\ni\n\nI\nA\u2113\ni\n\n.\n\n(4)\n\nA\u2113\ni\n\n(x, x\u2032) = 1 if x, x\u2032 \u2208 A\u2113\n\nHere, I\ni and 0 otherwise. Eq. (4) provides an interpretation of the mGP\nas a (marginally) partition-dependent GP, where the partition A de\ufb01nes the discontinuities in the\ncovariance function c\u2217\nA. The covariance function encodes local smoothness of g and discontinuities\nat the partition points. Note that c\u2217\nThe correlation between any two observations yi and yj at locations xi and xj generated as in Eq. (1)\nis a function of how many tree levels contain both xi and xj and the distance ||xi \u2212 xj ||. Let r\u2113\ni\nindex the partition set such that xi \u2208 A\u2113\nand Lij the lowest level for which xi and xj fall into the\nr\u2113\ni\nj). Then, for xi 6= xj,\nsame set (i.e., the largest \u2113 such that r\u2113\n\nA de\ufb01nes a non-stationary covariance function.\n\ni = r\u2113\n\ncorr(yi, yj | A) =\n\n(xi, xj )\n\nPLij\n\u2113=0 c\u2113\nr\u2113\ni\nQk\u2208{i,j}(\u03c32 +PL\u22121\n\u2113=0 c\u2113\nr\u2113\nk\n\n(xk, xk))\n\n1\n\n2\n\n= PLij\n\n\u2113=0 d\u2113 exp(\u2212\u03ba||xi \u2212 xj ||2\n\n2/||A\u2113\nr\u2113\ni\n\n\u03c32 +PL\u22121\n\n\u2113=0 d\u2113\n\n||2\n2)\n\n,\n\n(5)\n\nwhere the second equality follows from assuming the previously described kernels. An example\ncorrelation matrix is shown in Fig. 3(c). \u03ba determines the width of the bands while d\u2113 controls the\ncontribution of level \u2113. Since d\u2113 is square summable, lower levels are less in\ufb02uential.\nMarginal Likelihood Based on a vector of observations y = [y1 \u00b7 \u00b7 \u00b7 yn]\u2032 at\nlocations\nx = [x1 \u00b7 \u00b7 \u00b7 xn]\u2032, we can restrict our attention to evaluating the GPs at x. Let f \u2113(x) =\n[f \u2113(x1) \u00b7 \u00b7 \u00b7 f \u2113(xn)]\u2032. By de\ufb01nition of the GP, we have\n\nf \u2113(x) | f \u2113\u22121(x), A\u2113 \u223c N (f \u2113\u22121(x), K\u2113),\n\n[K\u2113]i,j = (cid:26) c\u2113\n\nr(xi, xj) xi, xj \u2208 A\u2113\nr\notherwise\n\n0\n\n.\n\n(6)\n\nThe level-speci\ufb01c covariance matrix K\u2113 is block-diagonal with structure determined by the level-\nspeci\ufb01c partition A\u2113. Observations are generated as y | g(x) \u223c N (g(x), \u03c32In). Recalling Eq. (3),\nstandard results yield\n\ng(x) | A \u223c N(cid:18)0,\n\nL\u22121\n\nX\u2113=0\n\nK\u2113(cid:19) y | A \u223c N(cid:18)0, \u03c32In +\n\nL\u22121\n\nX\u2113=0\n\nK\u2113(cid:19).\n\n(7)\n\nThis result can also be derived from the induced mGP of Eq. (4). We see that the marginal likelihood\np(y | A) has a closed form. Alternatively, one can condition on the GP at any level \u2113\u2032:\n\ny | f \u2113\u2032\n\n(x), A \u223c N(cid:18)f \u2113\u2032\n\n(x), \u03c32In +\n\nL\u22121\n\nX\u2113=\u2113\u2032+1\n\nK\u2113(cid:19).\n\n(8)\n\nA key advantage of the mGP is the conditional conjugacy of the latent GPs that allows us to compute\nthe likelihood of the data simply conditioned on the hierarchical partition A (see Eq. (7)). This fact\nis fundamental to the ef\ufb01ciency of the partition inference procedure described in Sec. 5.\n4 Multiple Trials\nIn many applications, such as the motivating MEG application, one has a collection of observations\nof an underlying signal. To capture the common global trajectory of these trials while still allowing\nfor trial-speci\ufb01c variability, we model each as a realization from an mGP with a shared parent func-\ntion f 0. One could trivially allow for alternative structures of hierarchical sharing beyond f 0 if an\napplication warranted. For simplicity, and due to the motivating MEG application, we additionally\nassume shared changepoints between the trials, though this assumption can also be relaxed.\n\n3\n\n\fs\nn\no\n\ni\nt\n\na\nv\nr\ne\ns\nb\nO\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n0\n\n10\n\n5\n\n0\n\n\u22125\n\ns\nn\no\n\ni\nt\n\na\nv\nr\ne\ns\nb\nO\n\n50\n\n100\nTime\n\n(a)\n\n150\n\n200\n\n0\n\n50\n\n150\n\n200\n\n100\nTime\n(b)\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n20\n\n40\n\n60\n\n80\n\n100\n\n(c)\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n20\n\n40\n\n60\n\n120\n\n140\n\n160\n\n180\n\n200\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n20\n\n40\n\n60\n\n120\n\n140\n\n160\n\n180\n\n200\n\n120\n\n140\n\n160\n\n180\n\n200\n\n80\n\n100\n\n(e)\n\n80\n\n100\n\n(d)\n\nFigure 3: (a) Three trials and (b) all 100 trials of data generated from a 5-level mGP with a shared parent\nfunction f 0 and partition A (randomly sampled). (c) True correlation matrix. (d) Empirical correlation matrix\nfrom 100 trials. (e) Hierarchical segmentation produced by recursive minimization of normalized cut objective.\nGenerative Model For each trial y\n\n(j) = {y(j)\n\ni = g(j)(xi) + \u01eb(j)\ny(j)\n\ni\n\n1 , . . . , y(j)\nn }, we model\n\u01eb(j)\ni \u223c N (0, \u03c32),\n\n,\n\nwith g(j) = f L\u22121,(j) generated from a trial-speci\ufb01c GP hierarchy f 0 \u2192 f 1,(j) \u2192 \u00b7 \u00b7 \u00b7 \u2192 f L\u22121,(j)\nwith shared parent f 0. (Again, alternative structures can be considered.) From Eq. (8) with \u2113\u2032 = 0,\nand exploiting the independence of {f \u2113,(j)}, independently for each j\nL\u22121\n\n(9)\n\n(10)\n\n(j) | f 0(x), A \u223c N(cid:18)y\n\ny\n\n(j); f 0(x), \u03c32In +\n\nX\u2113=1\n\nK\u2113(cid:19).\n\nNote that with our GP-based formulation, we need not assume coincident observation locations\nx1, . . . , xn between the trials. However, for simplicity of exposition, we consider shared locations.\n\nWe compactly denote the covariance by \u03a3 = \u03c32In +PL\u22121\n\n\u2113=1 K\u2113.\n\nSimulated data generated from a 5-level mGP with shared f 0 and A are shown in Fig. 3. The sample\ncorrelation matrix is also shown. Compare with the MEG data of Fig. 1. Both the qualitative struc-\nture of the raw time series as well as blockiness of the correlation matrix have striking similarities.\n\nPosterior Global Trajectory and Predictions Based on a set of trials {y\ninterest to infer the posterior of f 0. Standard Gaussian conjugacy results imply that\n\n(1), . . . , y\n\n(J)}, it is of\n\np(f 0(x) | y\n\n(1), . . . , y\n\n(J), A) = N(cid:18)(cid:0)K \u22121\n\n0 + J\u03a3\u22121(cid:1)\u22121\n\n\u02dcy,(cid:0)K \u22121\n\n0 + J\u03a3\u22121(cid:1)\u22121(cid:19),\n\n(11)\n\n(i). Likewise, the predictive distribution of data from a new trial is\n\nwhere \u02dcy = \u03a3\u22121Pi y\n\np(y\n\n(J+1) | y\n\n(1), . . . , y\n\n(J), A) = Z p(y\n= N(cid:18)(cid:0)K \u22121\n\n0 + J\u03a3\u22121(cid:1)\u22121\n\n(J+1) | f 0(x), A)p(f 0(x) | y\n\n(1), . . . , y\n\n(J), A)df 0\n\n\u02dcy, \u03a3 +(cid:0)K \u22121\n\n0 + J\u03a3\u22121(cid:1)\u22121(cid:19).\n\n(12)\n\nMarginal Likelihood Since the set of trials Y = {y\nparent function f 0, the marginal likelihood does not decompose over trials. Instead,\n\n(1), . . . , y\n\n(J)} are generated from a shared\n\np(Y | A) =\n\n|K0|\u22121/2|\u03a3|\u2212J/2\n\n(2\u03c0)\u2212nJ/2|K \u22121\n\n0 + J\u03a3\u22121|1/2\n\nexp(cid:18) \u2212\n\n1\n\n2 Xi\n\n(i)\u2032\n\ny\n\n\u03a3\u22121\n\n(i) +\n\ny\n\n1\n2\n\n\u02dcy\n\n\u2032(K \u22121\n\n0 + J\u03a3\u22121)\u22121 \u02dcy(cid:19).\n\n(13)\n\nSee the Supplementary Material for a derivation. One can easily verify that the above simpli\ufb01es to\nthe marginal likelihood of Eq. (7) when J = 1.\n5 Inference of the Hierarchical Partition\nIn the formulation so far, we have assumed that the hierarchical partition A is given. A key question\nis to infer the partition from the data. Assume that we have prior p(A) on the hierarchical partition.\nBased on the fact that we can analytically compute p(Y | A), we can use importance sampling or\nindependence chain Metropolis Hastings to draw samples from the posterior p(A | Y ).\nIn what follows, we assume a balanced binary tree for A. See the Supplementary Material for a\ndiscussion of how unbalanced trees can be considered via modi\ufb01cations to the covariance hyperpa-\nrameter speci\ufb01cation or by considering alternative priors p(A) such as the Mondrian process [20].\n\n4\n\n\fPartition Prior We consider a prior solely on the partition points {z1, . . . , z2L\u22121\u22121} rather than\ntaking tree level into account as well. Because of our time-series analysis focus, we assume X \u2282 \u211c.\nWe de\ufb01ne a distribution F on X and specify p(A) = Qi F (zi). Generatively, one can think of\n\ndrawing 2L\u22121 \u2212 1 partition points from F and deterministically forming a balanced binary tree\nA from these. For multidimensional X , one could use Voronoi tessellation and graph matching to\nbuild the tree from the randomly selected zi. Such a prior allows for trivial speci\ufb01cation of a uniform\ndistribution on A (simply taking F uniform on X ) or for eliciting prior information on changepoints,\nsuch as based on physiological information for the MEG data. Eliciting such information in a level-\ndependent setup is not straightforward. Also, despite common deployment, taking the partition point\nat level \u2113 as uniformly distributed over the parent set A\u2113\u22121\nyields high mass on A with small A\u2113\ni.\nThis property is undesirable because it leads to trees with highly unbalanced partitions.\n\ni\n\nOur resulting inferences perform Bayesian model averaging over trees. As such, even though we\nspecify a prior on partitions with 2L\u22121 \u2212 1 changepoints, the resulting functions can appear to\nadaptively use fewer by averaging over the uncertainty in the discontinuity location.\nPartition Proposal Although stochastic tree search algorithms tend to be inef\ufb01cient in general,\nwe can harness the well-de\ufb01ned correlation structure associated with a given hierarchical partition\nto much more ef\ufb01ciently search the tree space. One can think of every observed location xi as a\nnode in a graph with edge weights between xi and xj de\ufb01ned by the magnitude of the correlation of\nyi and yj. Based on this interpretation, the partition points of A correspond to graph cuts that bisect\nsmall edge weights, as graphically depicted in Fig. 4. As such, we seek a method for hierarchically\ncutting a graph. Given a cost matrix W with elements wuv de\ufb01ned for all pairs of nodes u, v in a set\nV , the normalized cut metric [22] for partitioning V into disjoint sets A and B is given by\n\nncut(A, B) = cut(A, B)(cid:2)assoc(A, V )\u22121 + assoc(B, V )\u22121(cid:3) ,\n\n(14)\n\nwhere cut(A, B) = Pu\u2208A,v\u2208B wuv and assoc(A, V ) = Pu\u2208A,v\u2208V wuv. Typically, the cut point\n\nis selected as the minimum of the metric ncut(A, B) computed over all possible subsets A and B.\nThe normalized cut metric balances between the cost of edge weights cut and the connectivity of the\ncut component, thus avoiding cuts that separate small sets. Fig. 1 shows an example of applying a\ngreedy normalized cuts algorithm (recursively minimizing ncut(A, B)) to MEG data.\nInstead of deterministically selecting cut points, we employ the\nnormalized cut objective as a proposal distribution. Let the cost\nmatrix W be the absolute value of the empirical correlation matrix\ncomputed from trials {y\n(J)} (see Fig. 1). Due to the\nnatural ordering of our locations xi \u2208 X \u2282 \u211c, the algorithm is\nstraightforwardly implemented. We step down the hierarchy, \ufb01rst\nproposing a cut of A0 into {A1\n\n(1), . . . , y\n\n2} with probability\n\n1, A1\n\nTIME \n\nFigure 4: Illustration of cutpoints\ndividing contiguous segments at\npoints of low correlation.\n\ncut 2 \n\ncut 1 \n\ncut 2 \n\nq({A1\n\n1, A1\n\n2}) \u221d ncut(A1\n\n1, A1\n\n2)\u22121.\n\n(15)\n\ni is partitioned via a normalized cut proposal based on the submatrix of W corre-\nAt level \u2113, each A\u2113\ni . The probability of any partition A under the speci\ufb01ed proposal\nsponding to the locations xi \u2208 A\u2113\ndistribution is simply computed as the product of the sequence of conditional probabilities of each\ncut. This procedure generates cut points only at the observed locations xi. More formally, the\npartition point in X is proposed as uniformly distributed between xi and xi+1. Extensions to multi-\ndimensional X rely on spectral clustering algorithms based on the graph Laplacian [24].\nMarkov Chain Monte Carlo An importance sampler draws hierarchical partitions A(m) \u223c q,\nwith the proposal distribution q de\ufb01ned as above, and then weights the samples by p(A(m))/q(A(m))\nto obtain posterior draws [19]. Such an approach is naively parallelizable, and thus amenable to\nef\ufb01cient computations, though the effective sample size may be low if q does not adequately match\nthe posterior p(A | Y ). Alternatively, a straightforward independence chain Metropolis Hastings\nalgorithm (see Supplementary Material) is de\ufb01ned by iteratively proposing A\u2032 \u223c q which is accepted\nwith probability min{r(A\u2032 | A), 1} where A is a previous sample of a hierarchical partition and\n\nr(A\u2032 | A) = p(Y | A\u2032)p(A\u2032)q(A)/[p(Y | A)p(A)q(A\u2032)].\n\n(16)\n\nThe tailoring of the proposal distribution q to this application based on normalized cuts dramatically\naids in improving the acceptance rate relative to more naive tree proposals. However, the acceptance\n\n5\n\n\frate tends to decrease as higher posterior probability partitions A are discovered, especially for trees\nwith many levels and large input spaces X for which the search space is larger.\nOne bene\ufb01t of the MCMC approach over importance sampling is the ability to include more intricate\ntree proposals to increase ef\ufb01ciency. We choose to interleave both local and global tree proposals. At\ni) and then propose a\neach iteration, we \ufb01rst randomly select a node in the tree (i.e., a partition set A\u2113\nnew sequence of cuts for all children of this node. When the root node is selected, corresponding to\nA0, the proposal is equivalent to the global proposals previously considered. We adapt the proposal\ndistribution for node selection to encourage more global searches at \ufb01rst and then shift towards a\ngreater balance between local and global searches as the sampling progresses. Sequential Monte\nCarlo methods [4] can also be considered, with particles generated as global proposals.\nComputational Complexity The per iteration complexity is O(n3), equivalent to a typical like-\nlihood evaluation under a GP prior. Using dynamic programming, the cost associated with the nor-\nmalized cuts proposal is O(n2(L \u2212 1)). Standard techniques for more ef\ufb01cient GP computations are\nreadily applicable, as well as extensions that harness the additive block structure of the covariance.\n6 Related Work\nVarious aspects of the mGP have similarities to other models proposed in the literature that primarily\nfall into two main categories: (i) GPs de\ufb01ned over a partitioned input space, and (ii) collections of\nGPs de\ufb01ned at tree nodes. The treed GP [8] captures non-stationarities by de\ufb01ning independent GPs\nat the leaves of a Bayesian CART-partitioned input space. The related approach of [12] assumes\na Voronoi tessellation. For time series, [21] examines online inference of changepoints with GPs\nmodeling the data within each segment. These methods capture abrupt changes, but do not allow for\nlong-range dependencies spanning changepoints nor a functional data hierarchical structure, both\ninherent to our multiresolution perspective. A main motivation of the treed GP is the resulting\ncomputational speed-ups of an independently partitioned GP. A two-level hierarchical GP also aimed\nat computational ef\ufb01ciency is considered by [16], where the top-level GP is de\ufb01ned at a coarser scale\nand provides a piece-wise constant mean for lower-level GPs on a pre-partitioned input space.\n\n[10, 11] consider covariance functions de\ufb01ned on a phylogenetic tree such that the covariance be-\ntween function-valued traits depends on both their spatial distance and evolutionary time spanned\nvia a common ancestor. Here, the tree de\ufb01nes the strength and structure of sharing between a col-\nlection of functions rather than abrupt changes within the function. The Bayesian rose tree of [3]\nconsiders a mixture of GP experts, as in [14, 17], but using Bayesian hierarchical clustering with\narbitrary branching structure in place of a Dirichlet process mixture. Such an approach is funda-\nmentally different from the mGP: each GP is de\ufb01ned over the entire input space, data result from a\nGP mixture, and input points are not necessarily spatially clustered. Alternatively, multiscale pro-\ncesses have a long history (cf. [25]): the variables de\ufb01ne a Markov process on a typically balanced,\nbinary tree and higher-level nodes capture coarser level information about the process. In contrast,\nthe higher level nodes in the mGP share the same temporal resolution and only vary in smoothness.\n\nAt a high level, the mGP differs from previous GP-based tree models in that the nodes of our tree\nrepresent GPs over a contiguous subset of the input space X constrained in a hierarchical fashion.\nThus, the mGP combines ideas of GP-based tree models and GP-based partition models.\n\nAs presented in Sec. 3, one can formulate an mGP as an additive GP where each GP in the sum\ndecomposes independently over the level-speci\ufb01c partition of the input space X . The additive GPs\nof [6] instead focus on coping with multivariate inputs, in a similar vain to hierarchical kernel learn-\ning [1], thus addressing an inherently different task.\n7 Results\n7.1 Synthetic Experiments\nTo assess our ability to infer a hierarchical partition via the proposed MCMC sampler, we generated\n100 trials of length 200 from a 5-level mGP with a shared parent function f 0. The hyperparameters\nwere set to \u03c32 = 0.1, \u03ba = 10, d\u2113 = d0 exp(\u22120.5(\u2113 + 1)) for \u2113 = 0, . . . , L \u2212 1 with d0 = 5. The\ndata are shown in Fig. 3, along with the empirical correlation matrix that is used as the cost matrix\nfor the normalized cuts proposals.\n\nFor inference, we set \u03c32 = \u02c6\u03c32/3 and d\u2113 = (\u02c6\u03c32/3) exp(\u22120.5\u2113), where \u02c6\u03c32 is the average time-\nspeci\ufb01c sample variance. \u03ba was as in the simulation. The hyperparameter mismatch demonstrates\n\n6\n\n\f20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n20\n\n40\n\n60\n\n80\n\n100\n\n(a)\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n20\n\n40\n\n60\n\n120\n\n140\n\n160\n\n180\n\n200\n\n80\n\n100\n\n(b)\n\nd\no\no\nh\n\ni\nl\n\ni\n\ne\nk\nL\ng\no\nL\n\n \n\n\u22122.88 x 104\n\n\u22122.9\n\n\u22122.92\n\n\u22122.94\n\n\u22122.96\n\n\u22122.98\n\n\u22123\n0\n\n120\n\n140\n\n160\n\n180\n\n200\n\n \n\nmGP\nhGP\nGP\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\nf\n \n\n\u2212\n \n)\n\n0\n\nf\n(\nd\ne\na\nm\n\nt\n\nd\no\no\nh\n\ni\nl\n\ni\n\ne\nk\nL\ng\no\nL\n\n \n\n \nt\n\nu\no\nd\ne\nH\n\nl\n\n\u2212300\n\n\u2212350\n\n\u2212400\n\n\u2212450\n\n\u2212500\n\ni\nt\ns\ne\n\n\u22120.1\n\n\u22120.2\n\n \n\n0\n\n3000\n\n2000\n\n1000\nIteration\n(c)\n\n150\n\n200\n\n100\nTime\n\n50\n(d)\n\nG P\n\nhG P\n\nL=2\n(e)\n\nL=5\n\nL=7\n\nL=10\n\nFigure 5: For the data of Fig. 3, (a) true and (b) MAP partitions. (c) Trace plots of log likelihood versus\nMCMC iteration for 10 chains. Log likelihood under the true partition (cyan) and minimized normalized cut\npartition of Fig. 3 (magenta) are also shown. (d) Errors between posterior mean f 0 and true f 0 for GP, hGP, and\nmGP. (e) Predictive log likelihood of 10 heldout sequences for GP, hGP, and mGP with L = 2, 5(true), 7, 10.\n\nsome robustness to mispeci\ufb01cation. For a uniform prior p(A), 10 independent MCMC chains were\nrun for 3000 iterations, thinned by 10. The \ufb01rst 1000 iterations used pure global tree searches; the\nsampler was then tempered to uniform node proposals. The effects of this choice are apparent in\nthe likelihood plot of Fig. 5, which also displays the true hierarchical partition and MAP estimate.\nCompare to the normalized cuts partition of Fig. 3, especially at the important level 1 cut. The full\nsimulation study took less than 7 minutes to run on a single 1.8 GHz Intel Core i7 processor.\n\nTo assess sensitivity to the choice of L, we compare the predictive log-likelihood of 10 heldout test\nsequences under an mGP with 2, 5, 7, and 10 levels. As shown in Fig. 5(e), there is a clear gain\ngoing from 2 to 5 levels. However, overestimating L has minimal in\ufb02uence on predictive likelihood\nsince lower tree levels capture \ufb01ner details and have less overall effect. We also compare to a single\nGP and a 2-level hierarchical GP (hGP) (see Sec. 7.2). For a direct comparison, both use squared\nexponential kernels. Hyperparameters were set as in the mGP for the top-level GP. The total variance\nwas also matched with the GP taking this as noise and the hGP splitting between level 2 and noise.\nIn addition to better predictive performance, Fig. 5(d) shows the mGP\u2019s improved estimation of f 0.\n7.2 MEG Analysis\nWe analyzed magnetoencephalography (MEG) recordings of neuronal activity collected from a hel-\nmet with gradiometers distributed over 102 locations around the head. The gradiometers measure\nthe spatial gradient of the magnetic activity in Teslas per meter (T/m) [9]. Since the \ufb01rings of neu-\nrons in the brain only induce a weak magnetic \ufb01eld outside of the skull, the signal-to-noise ratio of\nthe MEG data is very low and typically multiple recordings, or trials, of a given task are collected.\nOur MEG data was recorded while a subject viewed 20 stimuli describing concrete nouns (both\nthe written noun and a representative line drawing), with 20 interleaved trials per word. See the\nSupplementary Material for further details on the data and our analyses presented herein.\n\nEf\ufb01cient sharing of information between the single trials is important for tasks such as word clas-\nsi\ufb01cation [7]. A key insight of [7] was the importance of capturing the time-varying correlations\nbetween MEG sensors for performing classi\ufb01cation. However, the formulation still necessitates a\nmean model. [7] propose a 2-level hierarchical GP (hGP): a parent GP captures the common global\ntrajectory, as in the mGP, and each trial-speci\ufb01c GP is centered about the entire parent function1.\nThis formulation maintains global smoothness at the individual trial level. The mGP instead mod-\nels the trial-speci\ufb01c variability with a multi-level tree of GPs de\ufb01ned as deviations from the parent\nfunction over local partitions, allowing for abrupt changes relative to the smooth global trajectory.\n\nFor our analyses, we consider the words associated with the \u201cbuilding\u201d and \u201ctool\u201d categories shown\nin Fig. 7.\nIndependently for each of the 10 words and 102 sensors, we trained a 5-level mGP\nusing 15 randomly selected trials as training data and the 5 remaining for testing. Each trial was\nof length n = 340. We ran 3 independent MCMC chains for 3000 iterations with both global and\nlocal tree searches. We discarded the \ufb01rst 1000 samples as burn-in and thinned by 10. The mGP\nhyperparameters were set exactly as in the simulated study of Sec. 7.1 for structure learning and\nthen optimized over a grid to maximize the marginal likelihood of the training data.\n\nWe compare the predictive performance of the mGP in terms of MSE of heldout segments relative to\na GP and hGP, each with similarly optimized hyperparameters. The predictive mean conditioned on\ndata up to the heldout time is straightforwardly derived from Eq. (12). For the mGP, the calculation is\naveraged over the posterior samples of A. Fig. 6 displays the MSEs decomposed by cortical region.\n\n1The model of [7] uses an hGP in a latent space. The mGP could be similarly deployed.\n\n7\n\n\fP\nG\n\n \n.\nv\n \nE\nS\nM\n \nn\ni\n \ne\ns\na\ne\nr\nc\ne\nD\n%\n\n \n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n \n\n \n\nVisual\nFrontal\nParietal\nTemporal\n\n \n\nVisual\nFrontal\nParietal\nTemporal\n\nP\nG\nh\n \n.\nv\n \nE\nS\nM\n \nn\ni\n \ne\ns\na\ne\nr\nc\ne\nD\n%\n\n \n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n \n\ns\nn\no\n\ni\nt\n\na\nv\nr\ne\ns\nb\nO\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n \n\n0\n\nMLE\nhGP\nmGP\n\n1.5\n\n \n\nx 104\n\nd\no\no\nh\n\ni\nl\n\ni\n\ne\nk\nL\n \ng\no\nL\n \nt\nu\no\nd\ne\nH\n\nl\n\n\u22125\n\n\u22126\n\n\u22127\n\n\u22128\n\nwfm m\n\n100\n\n150\n\n250\nConditioning Point\n\n200\n\n300\n\n100\n\n150\n\n250\nConditioning Point\n\n200\n\n300\n\n(a)\n\n(b)\n\n0.5\n\n1\n\nTime (sec)\n(c)\n\nm G P\n\n(d)\n\nVisual L=1\n\nFrontal L=1\n\n\u03c4 :\u03c4 +30 conditioned on y \u2217\n\n\u2217 for the mGP and wavelet-based method of [15].\n\n\u2217. (d) Boxplots of predictive log likelihood of heldout y\n\nFigure 6: Per-lobe comparison of mGP to (a) GP and (b) hGP: For various values of \u03c4 , % decrease in predictive\nMSE of heldout y \u2217\n1:\u03c4 \u22121 and 15 training sequences. (c) For a visual cortex sensor and\nword hammer, plots of test data, empirical mean (MLE), and hGP and mGP predictive mean for entire heldout\ny\nThe results clearly indicate that the mGP consistently bet-\nter captures the features of the data, and particularly for\nsensors with large abrupt changes such as in the visual\ncortex. The heldout trials for a visual cortex sensor are\ndisplayed in Fig. 6(c). Relative to the hGP, the mGP\nmuch better tracks the early dip in activity right after the\nvisual stimulus onset (t = 0). The posterior distribu-\ntion of inferred changepoints at level 1, also broken down\nby cortical region, are displayed in Fig. 7. As expected,\nthe visual cortex has the earliest changepoints. Similar\ntrends are seen in the parietal lobe that handles percep-\ntion and sensory integration. The temporal lobe, which\nis key in semantic processing, has changepoints occur-\nring later. These results concur with the \ufb01ndings of [23]:\nsemantic processing starts between 250 and 600 ms and\nword length (a visual feature) is decoded most accurately\nvery near the standard 100ms response time (\u201cn100\u201d).\n\nFigure 7: Inferred changepoints at level 1\naggregated over sensors within each lobe:\nvisual (top-left), frontal (top-right), parietal\n(bottom-left), and temporal (bottom-right).\n\nigloo\nhouse\nchurch\napartment\nbarn\nhammer\nsaw\nscrewdriver\npliers\nchisel\n\nigloo\nhouse\nchurch\napartment\nbarn\nhammer\nsaw\nscrewdriver\npliers\nchisel\n\nTemporal L=1\n\nParietal L=1\n\nTime (sec)\n\nTime (sec)\n\n0.5\n\nTime (sec)\n\n1\n\n0.5\n\nTime (sec)\n\n1\n\n0.5\n\n1\n\n0\n\n0\n\n0.5\n\n1\n\n0\n\n0\n\nWe also compare our predictive performance to that of the wavelet-based functional mixed model\n(wfmm) of [15]. The wfmm has become a standard approach for functional data analysis since it\nallows for spiky trajectories and ef\ufb01cient sharing of information between trials. One limitation, how-\never, is the restriction to a regular grid of observations. The wfmm enables analysis in a multivariate\nsetting, but for a direct comparison we simply apply the wfmm to each word and sensor indepen-\ndently. Fig. 6(d) shows boxplots of the predictive heldout log likelihood of the test trials under the\nmGP and wfmm. The results are over 5 heldout trials, 102 sensors, and 10 words. In addition to the\neasier interpretability of the mGP, the predictive performance also exceeds that of the wfmm.\n8 Discussion\nThe mGP provides a \ufb02exible framework for characterizing the dependence structure of real data,\nsuch as the examined MEG recordings, capturing certain features more accurately than previous\nmodels. In particular, the mGP provides a hierarchical functional data analysis framework for mod-\neling (i) strong, locally smooth sharing of information, (ii) global long-range correlations, and (iii)\nabrupt changes. The simplicity of the mGP formulation enables further theoretical analysis, for\nexample, combining posterior consistency results from changepoint analysis with those for GPs.\nAlthough we focused on univariate time series analysis, our formulation is amenable to multivari-\nate functional data analysis extensions: one can naturally accommodate hierarchical dependence\nstructures through partial sharing of parents in the tree, or possibly via mGP factor models.\n\nThere are many interesting questions relating to the proposed covariance function. Our fractal spec-\ni\ufb01cation represents a particular choice to avoid over-parameterization, although alternatives could\nbe considered. For hyperparameter inference, we anticipate that joint sampling with the partition\nwould mix poorly, and consider it a topic for future exploration. Another interesting topic is to\nexplore proposals for more general tree structures. We believe that the proposed mGP represents a\npowerful, broadly applicable new framework for non-stationary analyses, especially in a functional\ndata analysis setting, and sets the foundation for many interesting possible extensions.\nAcknowledgments\nThe authors thank Alona Fyshe, Gustavo Sudre and Tom Mitchell for their help with data acquisition, prepro-\ncessing, and useful suggestions. This work was supported in part by AFOSR Grant FA9550-12-1-0453 and the\nNational Institute of Environmental Health Sciences (NIEHS) of the NIH under Grant R01 ES017240.\n\n8\n\n\fReferences\n[1] F. Bach. High-dimensional non-linear variable selection through hierarchical kernel learning. Technical\n\nReport 0909.0844v1, arXiv, 2009.\n\n[2] J. Beran and Y. Shumeyko. On asymptotically optimal wavelet estimation of trend functions under long-\n\nrange dependence. Bernoulli, 18(1):137\u2013176, 2012.\n\n[3] C. Blundell, Y. W. Teh, and K. A. Heller. Bayesian rose trees. In Proc. Uncertainty in Arti\ufb01cial Intelli-\n\ngence, pages 65\u201372, 2010.\n\n[4] P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical\n\nSociety, Series B, 68(3):411\u2013436, 2006.\n\n[5] F. X. Diebold and G. D. Rudebusch. Long memory and persistence in aggregate output. Journal of\n\nMonetary Economics, 24:189\u2013209, 1989.\n\n[6] D. Duvenaud, H. Nickisch, and C. E. Rasmussen. Additive Gaussian processes. In Advances in Neural\n\nInformation Processing Systems, volume 24, pages 226\u2013234, 2011.\n\n[7] A. Y. Fyshe, E. B. Fox, D. B. Dunson, and T. Mitchell. Hierarchical latent dictionaries for models of\nbrain activation. In Proc. International Conference on Arti\ufb01cial Intelligence and Statistics, pages 409\u2013\n421, 2012.\n\n[8] R. .B. Gramacy and H. K. H. Lee. Bayesian treed Gaussian process models with an application to com-\n\nputer modeling. Journal of the American Statistical Association, 103(483):1119\u20131130, 2008.\n\n[9] P. Hansen, M. Kringelbach, and R. Salmelin. MEG: An Introduction to Methods. Oxford University\n\nPress, USA, 2010. ISBN 0195307232.\n\n[10] R. Henao and J. E. Lucas. Ef\ufb01cient hierarchical clustering for continuous data. Technical Report\n\n1204.4708v1, arXiv, 2012.\n\n[11] N. S. Jones and J. Moriarty. Evolutionary inference for function-valued traits: Gaussian process regression\n\non phylogenies. Technical Report 1004.4668v2, arXiv, 2011.\n\n[12] H. M. Kim, B. K. Mallick, and C. C. Holmes. Analyzing nonstationary spatial data using piecewise\n\nGaussian processes. Journal of the American Statistical Association, 100(470):653\u2013668, 2005.\n\n[13] P. S. Kokoszka and M. S. Taqqu. Parameter estimation for in\ufb01nite variance fractional ARIMA. The\n\nAnnals of Statistics, 24(5):1880\u20131913, 1996.\n\n[14] E. Meeds and S. Osindero. An alternative mixture of Gaussian process experts. In Advances in Neural\n\nInformation Processing Systems, volume 18, pages 883\u2013890, 2006.\n\n[15] J. S. Morris and R. J. Carroll. Wavelet-based functional mixed models. Journal of the Royal Statistical\n\nSociety, Series B, 68(2):179\u2013199, 2006.\n\n[16] S. Park and S. Choi. Hierarchical Gaussian process regression. In Proc. Asian Conference on Machine\n\nLearning, pages 95\u2013110, 2010.\n\n[17] C. E. Rasmussen and Z. Ghahramani.\n\nNeural Information Processing Systems, volume 2, pages 881\u2013888, 2002.\n\nIn\ufb01nite mixtures of Gaussian process experts.\n\nIn Advances in\n\n[18] C. E. Rasmussen and C. K. .I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[19] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2005.\n[20] D. M. Roy and Y. W. Teh. The Mondrian process. In Advances in Neural Information Processing Systems,\n\nvolume 21, pages 1377\u20131384, 2009.\n\n[21] Y. Saatci, R. Turner, and C. E. Rasmussen. Gausssian process change point models. In Proc. International\n\nConference on Machine Learning, pages 927\u2013934, 2010.\n\n[22] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 22(8):888\u2013905, 2000.\n\n[23] G. Sudre, D. Pomerleaum, M. Palatucci, L. Wehbe, A. Fyshe, R. Salmelin, and T. Mitchell. Tracking\nneural coding of perceptual and semantic features of concrete nouns. Neuroimage, 62(1):451\u2013463, 2012.\n\n[24] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013416, 2007.\n[25] A. S. Willsky. Multiresolution Markov models for signal and image processing. Proceedings of the IEEE,\n\n90(8):1396\u20131458, 2002.\n\n9\n\n\f", "award": [], "sourceid": 338, "authors": [{"given_name": "Emily", "family_name": "Fox", "institution": ""}, {"given_name": "David", "family_name": "Dunson", "institution": ""}]}