{"title": "GP Kernels for Cross-Spectrum Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1999, "page_last": 2007, "abstract": "Multi-output Gaussian processes provide a convenient framework for multi-task problems. An illustrative and motivating example of a multi-task problem is multi-region electrophysiological time-series data, where experimentalists are interested in both power and phase coherence between channels. Recently, Wilson and Adams (2013) proposed the spectral mixture (SM) kernel to model the spectral density of a single task in a Gaussian process framework. In this paper, we develop a novel covariance kernel for multiple outputs, called the cross-spectral mixture (CSM) kernel. This new, flexible kernel represents both the power and phase relationship between multiple observation channels. We demonstrate the expressive capabilities of the CSM kernel through implementation of a Bayesian hidden Markov model, where the emission distribution is a multi-output Gaussian process with a CSM covariance kernel. Results are presented for measured multi-region electrophysiological data.", "full_text": "GP Kernels for Cross-Spectrum Analysis\n\n1Kyle Ulrich, 3David E. Carlson, 2Kafui Dzirasa, 1Lawrence Carin\n1Department of Electrical and Computer Engineering, Duke University\n2Department of Psychiatry and Behavioral Sciences, Duke University\n{kyle.ulrich, kafui.dzirasa, lcarin}@duke.edu\n\n3Department of Statistics, Columbia University\n\ndavid.edwin.carlson@gmail.com\n\nAbstract\n\nMulti-output Gaussian processes provide a convenient framework for multi-task\nproblems. An illustrative and motivating example of a multi-task problem is\nmulti-region electrophysiological time-series data, where experimentalists are in-\nterested in both power and phase coherence between channels. Recently, Wilson\nand Adams (2013) proposed the spectral mixture (SM) kernel to model the spec-\ntral density of a single task in a Gaussian process framework. In this paper, we\ndevelop a novel covariance kernel for multiple outputs, called the cross-spectral\nmixture (CSM) kernel. This new, \ufb02exible kernel represents both the power and\nphase relationship between multiple observation channels. We demonstrate the\nexpressive capabilities of the CSM kernel through implementation of a Bayesian\nhidden Markov model, where the emission distribution is a multi-output Gaus-\nsian process with a CSM covariance kernel. Results are presented for measured\nmulti-region electrophysiological data.\n\nIntroduction\n\n1\nGaussian process (GP) models have become an important component of the machine learning liter-\nature. They have provided a basis for non-linear multivariate regression and classi\ufb01cation tasks, and\nhave enjoyed much success in a wide variety of applications [16].\nA GP places a prior distribution over latent functions, rather than model parameters. In the sense\nthat these functions are de\ufb01ned for any number of sample points and sample positions, as well\nas any general functional form, GPs are nonparametric. The properties of the latent functions are\nde\ufb01ned by a positive de\ufb01nite covariance kernel that controls the covariance between the function\nat any two sample points. Recently, the spectral mixture (SM) kernel was proposed by Wilson and\nAdams [24] to model a spectral density with a scale-location mixture of Gaussians. This \ufb02exible\nand interpretable class of kernels is capable of recovering any composition of stationary kernels\n[27, 9, 13]. The SM kernel has been used for GP regression of a scalar output (i.e., single function, or\nobservation \u201ctask\u201d), achieving impressive results in extrapolating atmospheric CO2 concentrations\n[24]; image inpainting [25]; and feature extraction from electrophysiological signals [21].\nHowever, the SM kernel is not de\ufb01ned for multiple outputs (multiple correlated functions). Multi-\noutput GPs intersect with the \ufb01eld of multi-task learning [4], where solving similar problems jointly\nallows for the transfer of statistical strength between problems, improving learning performance\nwhen compared to learning all tasks individually. In this paper, we consider neuroscience appli-\ncations where low-frequency (< 200 Hz) extracellular potentials are simultaneously recorded from\nimplanted electrodes in multiple brain regions of a mouse [6]. These signals are known as local \ufb01eld\npotentials (LFPs) and are often highly correlated between channels. Inferring and understanding\nthat interdependence is biologically signi\ufb01cant.\n\n1\n\n\fA multi-output GP can be thought of as a standard GP (all observations are jointly normal) where the\ncovariance kernel is a function of both the input space and the output space (see [2] and references\ntherein for a comprehensive review); here \u201cinput space\u201d means the points at which the functions are\nsampled (e.g., time), and the \u201coutput space\u201d may correspond to different brain regions. A particular\npositive de\ufb01nite form of this multi-output covariance kernel is the sum of separable (SoS) kernels,\nor the linear model of coregionalization (LMC) in the geostatistics literature [10], where a separable\nkernel is represented by the product of separate kernels for the input and output spaces.\nWhile extending the SM kernel to the multi-output setting via the LMC framework (i.e., the SM-\nLMC kernel) provides a powerful modeling framework, the SM-LMC kernel does not intuitively\nrepresent the data. Speci\ufb01cally, the SM-LMC kernel encodes the cross-amplitude spectrum (square\nroot of the cross power spectral density) between every pair of channels, but provides no cross-\nphase information. Together, the cross-amplitude and cross-phase spectra form the cross-spectrum,\nde\ufb01ned as the Fourier transform of the cross-covariance between the pair of channels.\nMotivated by the desire to encode the full cross-spectra into the covariance kernel, we design a novel\nkernel termed the cross-spectral mixture (CSM) kernel, which provides an intuitive representation\nof the power and phase dependencies between multiple outputs. The need for embedding the full\ncross-spectrum into the covariance kernel is illustrated by a recent surge in neuroscience research\ndiscovering that LFP interdependencies between regions exhibit phase synchrony patterns that are\ndependent on frequency band [11, 17, 18].\nThe remainder of the paper is organized as follows. Section 2 provides a summary of GP regression\nmodels for vector-valued data, and Section 3 introduces the SM, SM-LMC, and novel CSM covari-\nance kernels. In Section 4, the CSM kernel is incorporated in a Bayesian hidden Markov model\n(HMM) [14] with a GP emission distribution as a demonstration of its utility in hierarchical model-\ning. Section 5 provides details on inverting the Bayesian HMM with variational inference, as well\nas details on a fast, novel GP \ufb01tting process that approximates the CSM kernel by its representation\nin the spectral domain. Section 6 analyzes the performance of this approximation and presents re-\nsults for the CSM kernel in the neuroscience application, considering measured multi-region LFP\ndata from the brain of a mouse. We conclude in Section 7 by discussing how this novel kernel can\ntrivially be extended to any time-series application where GPs and the cross-spectrum are of interest.\n\n2 Review of Multi-Output Gaussian Process Regression\n\nA multi-output regression task estimates samples from C output channels, yn = [yn1, . . . , ynC]T\ncorresponding to the n-th input point xn (e.g., the n-th temporal sample). An unobserved latent\nfunction f (x) = [f1(x), . . . , fC(x)]T is responsible for generating the observations, such that yn \u223c\nN (f (xn), H\u22121), where H = diag(\u03b71, . . . , \u03b7C) is the precision of additive Gaussian noise.\nA GP prior on the latent function is formalized by f (x) \u223c GP(m(x), K(x, x(cid:48))) for arbitrary\ninput x, where the mean function m(x) \u2208 RC is set to equal 0 without loss of generality, and\nthe covariance function (K(x, x(cid:48)))c,c(cid:48) = kc,c(cid:48)\n(x, x(cid:48)) = cov(fc(x), fc(cid:48)(x(cid:48))) creates dependencies\nbetween observations at input points x and x(cid:48), as observed on channels c and c(cid:48). In general, the input\nspace x could be vector valued, but for simplicity we here assume it to be scalar, consistent with our\nmotivating neuroscience application in which x corresponds to time.\nA convenient representation for multi-output kernel functions is to separate the kernel into the prod-\nuct of a kernel for the input space and a kernel for the interactions between the outputs. This is\nknown as a separable kernel. A sum of separable kernels (SoS) representation [2] is given by\n\nQ(cid:88)\n\nQ(cid:88)\n\nkc,c(cid:48)\n\n(x, x(cid:48)) =\n\nbq(c, c(cid:48))kq(x, x(cid:48)),\n\nor\n\nK(x, x(cid:48)) =\n\nBqkq(x, x(cid:48)),\n\n(1)\n\nq=1\n\nq=1\n\nwhere kq(x, x(cid:48)) is the input space kernel for component q, bq(c, c(cid:48)) is the q-th output interaction\nkernel, and Bq \u2208 RC\u00d7C is a positive semi-de\ufb01nite output kernel matrix. Note that we have a dis-\ncrete set of C output spaces, c \u2208 {1, . . . , C}, where the input space x is continuous, and discretely\nsampled arbitrarily in experiments. The SoS formulation is also known as the linear model of core-\ngionalization (LMC) [10] and Bq is termed the coregionalization matrix. When Q = 1, the LMC\nreduces to the intrinsic coregionalization model (ICM) [2], and when rank(Bq) is restricted to equal\n1, the LMC reduces to the semiparametric latent factor model (SLFM) [19].\n\n2\n\n\fAny \ufb01nite number of latent functional evaluations f = [f1(x), . . . , fC(x)]T at locations x =\n[x1, . . . , xN ]T has a multivariate normal distribution N (f ; 0, K), such that K is formed through\nthe block partitioning\n\nBq \u2297 kq(x, x),\n\n(2)\n\n\uf8ee\uf8ef\uf8f0 k1,1(x, x)\n\n...\n\nkC,1(x, x)\n\nK =\n\n\uf8f9\uf8fa\uf8fb =\n\nQ(cid:88)\n\nq=1\n\nk1,C(x, x)\n\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7 kC,C(x, x)\n\n...\n\n(x, x) is an N \u00d7 N matrix and \u2297 symbolizes the Kronecker product.\n\nwhere each kc,c(cid:48)\nA vector-valued dataset consists of observations y = vec([y1, . . . , yN ]T ) \u2208 RCN at the respective\nlocations x = [x1, . . . , xN ]T such that the \ufb01rst N elements of y are from channel 1 up to the last N\nelements belonging to channel C. Since both the likelihood p(y|f , x) and distribution over latent\nfunctions p(f|x) are Gaussian, the marginal likelihood is conveniently represented by\n\np(y|x) =\n\np(y|f , x)p(f|x)df = N (0, \u0393),\n\n\u0393 = K + H\u22121 \u2297 I N ,\n\n(3)\n\n(cid:90)\n\nwhere all possible functions f have been marginalized out.\nEach input-space covariance kernel is de\ufb01ned by a set of hyperparameters, \u03b8. This conditioning was\nremoved for notational simplicity, but will henceforth be included in the notation. For example, if\nthe squared exponential kernel is used, then kSE(x, x(cid:48); \u03b8) = exp(\u2212 1\n2||x \u2212 x(cid:48)||2/(cid:96)2), de\ufb01ned by a\nsingle hyperparameter \u03b8 = {(cid:96)}. To \ufb01t a GP to the dataset, the hyperparameters are typically chosen\nto maximize the marginal likelihood in (3) via gradient ascent.\n\n3 Expressive Kernels in the Spectral Domain\nThis section \ufb01rst introduces the spectral mixture (SM) kernel [24] as well as a multi-output extension\nof the SM kernel within the LMC framework. While the SM-LMC model is capable of represent-\ning complex spectral relationships between channels, it does not intuitively model the cross-phase\nspectrum between channels. We propose a novel kernel known as the cross-spectral mixture (CSM)\nkernel that provides both the cross-amplitude and cross-phase spectra of multi-channel observations.\nDetailed derivations of each of these kernels is found in the Supplemental Material.\n3.1 The Spectral Mixture Kernel\nA spectral Gaussian (SG) kernel is de\ufb01ned by an amplitude spectrum with a single Gaussian distri-\nbution re\ufb02ected about the origin,\n\n[N (\u03c9;\u2212\u00b5, \u03bd) + N (\u03c9; \u00b5, \u03bd)] ,\n\nSSG(\u03c9; \u03b8) =\n\n(4)\nwhere \u03b8 = {\u00b5, \u03bd} are the kernel hyperparameters, \u00b5 represents the peak frequency, and the variance\n\u03bd is a scale parameter that controls the spread of the spectrum around \u00b5. This spectrum is a function\nof angular frequency. The Fourier transform of (4) results in the stationary, positive de\ufb01nite auto-\ncovariance function\n\n1\n2\n\nkSG(\u03c4 ; \u03b8) = exp(\u2212 1\n2\n\n\u221a\n\n\u03bd\u03c4 2) cos(\u00b5\u03c4 ),\n\n(5)\nwhere stationarity implies dependence on input domain differences k(\u03c4 ; \u03b8) = k(x, x(cid:48); \u03b8) with \u03c4 =\nx\u2212 x(cid:48). The SG kernel may also be derived by considering a latent signal f (x) =\n2 cos(\u03c9(x + \u03c6))\nwith frequency uncertainty \u03c9 \u223c N (\u00b5, \u03bd) and phase offset \u03c9\u03c6. The kernel is the auto-covariance\nfunction for f (x), such that kSG(\u03c4 ; \u03b8) = cov(f (x), f (x+\u03c4 )). When computing the auto-covariance,\nthe frequency \u03c9 is marginalized out, providing the kernel in (5) that includes all frequencies in the\nspectral domain with probability 1.\nA weighted, linear combination of SG kernels gives the spectral mixture (SM) kernel [24],\n\n(6)\nwhere \u03b8q = {aq, \u03bdq, \u00b5q} and \u03b8 = {\u03b8q} has 3Q degrees of freedom. The SM kernel may be derived\nas the Fourier transform of the spectral density SSM(\u03c9; \u03b8) or as the auto-covariance of latent func-\n\nQ(cid:88)\n(cid:112)2aq cos(\u03c9q(x + \u03c6q)) with uncertainty in angular frequency \u03c9q \u223c N (\u00b5q, \u03bdq).\n\ntions f (x) =(cid:80)Q\n\naqSSG(\u03c9; \u03b8q),\n\naqkSG(\u03c4 ; \u03b8q),\n\nSSM(\u03c9; \u03b8) =\n\nkSM(\u03c4 ; \u03b8) =\n\nQ(cid:88)\n\nq=1\n\nq=1\n\nq=1\n\n3\n\n\fFigure 1: Latent functions drawn for two channels f1(x) (blue) and f2(x) (red) using the CSM kernel (left)\nand rank-1 SM-LMC kernel (center). The functions are comprised of two SG components centered at 4 and 5\nHz. For the CSM kernel, we set the phase shift \u03c8c(cid:48),2 = \u03c0. Right: the cross-amplitude (purple) and cross-phase\n(green) spectra between f1(x) and f2(x) are shown for the CSM kernel (solid) and SM-LMC kernel (dashed).\nThe ability to tune phase relationships is bene\ufb01cial for kernel design and interpretation.\n\nThe moniker for the SM kernel in (6) re\ufb02ects the mixture of Gaussian components that de\ufb01ne the\nspectral density of the kernel. The SM kernel is able to represent any stationary covariance kernel\ngiven large enough Q; to name a few, this includes any combination of squared exponential, Mat`ern,\nrational quadratic, or periodic kernels [9, 16, 24].\n\n3.2 The Cross-Spectral Mixture Kernel\nA multi-output version of the SM kernel uses the SG kernel directly within the LMC framework:\n\nKSM-LMC(\u03c4 ; \u03b8) =\n\nBqkSG(\u03c4 ; \u03b8q),\n\n(7)\n\nphase spectrum. Speci\ufb01cally, each channel is merely a weighted sum of(cid:80)\n\nwhere Q SG kernels are shared among the outputs via the coregionalization matrices {Bq}Q\nq=1. A\ngeneralized, non-stationary version of this SM-LMC kernel was proposed in [23] using the Gaussian\nprocess regression network (GPRN) [26]. The marginal distribution for any single channel is simply\na Gaussian process with a SM covariance kernel. While this formulation is capable of providing\na full cross-amplitude spectrum between two channels, it contains no information about a cross-\nq Rq latent functions\nwhere Rq = rank(Bq). Whereas these functions are shared exactly across channels, our novel CSM\nkernel shares phase-shifted versions of these latent functions across channels.\nDe\ufb01nition 3.1. The cross-spectral mixture (CSM) kernel takes the form\n\nQ(cid:88)\n\nq=1\n\n(cid:1)(cid:1) ,\n\nc(cid:48)q \u2212 \u03c6r\n\ncq\n\n(8)\n\nq=1 Rq(2C \u2212 1) degrees of freedom,\ncq respectively represent the amplitude and shift in the input space for latent functions\n\n(cid:44) 0}Rq\n\n(cid:113)\n\nr=1\n\nRq(cid:88)\n\nQ(cid:88)\n\nq, \u03c6r\n\ncq and \u03c6r\n\nq=1\nq, \u03c6r\n1q\n\n(cid:0)\u03c4 + \u03c6r\n\n\u03bdq\u03c4 2) cos(cid:0)\u00b5q\nc(cid:48)q exp(\u2212 1\nq=1 has 2Q +(cid:80)Q\ncqar\nar\n2\nr=1}Q\n(cid:41)\nBq(cid:101)kSG(\u03c4 ; \u03b8q)\n\nkc,c(cid:48)\nCSM(\u03c4 ; \u03b8) =\nwhere \u03b8 = {\u03bdq, \u00b5q,{ar\nand ar\nassociated with channel c. In the LMC framework, the CSM kernel is\n\n(cid:40) Q(cid:88)\n(cid:101)kSG(\u03c4 ; \u03b8q) = exp(\u2212 1\n\nRq(cid:88)\ncq =(cid:112)ar\nwhere(cid:101)kSG(\u03c4, \u03b8q) is phasor notation of the SG kernel, Bq is rank-Rq, {\u03b2r\n\nKCSM(\u03c4 ; \u03b8) = Re\n\n\u03bdq\u03c4 2 + j\u00b5q\u03c4 ),\n\nBq =\n\n\u03b2r\n\nr=1\n\nq=1\n\n2\n\n,\n\ncq} are complex scalar\ncq is an alternative phase representation.\n\u221a\u22121, Re{\u00b7} returns the real component of its argument, and\n\ncoef\ufb01cients encoding amplitude and phase, and \u03c8r\ncq\nWe use complex notation where j =\n\u2020 represents the complex conjugate of \u03b2.\n\u03b2\nBoth the CSM and SM-LMC kernels force the marginal distribution of data from a single chan-\nnel to be a Gaussian process with a SM covariance kernel. The CSM kernel is derived in the\nSupplemental Material by considering functions represented by phase-shifted sinusoidal signals,\niid\u223c N (\u00b5q, \u03bdq). Computing the\n\nfc(x) = (cid:80)Q\n\n(cid:112)2ar\n\ncq)), where each \u03c9r\nq\n\n(cid:80)Rq\n\n(cid:44) \u00b5q\u03c6r\n\nq (x + \u03c6r\n\ncq cos(\u03c9r\n\ncross-covariance function cov(fc(x), fc(cid:48)(x + \u03c4 )) provides the CSM kernel.\nA comparison between draws from Gaussian processes with CSM and SM-LMC kernels is shown\nin Figure 1. The utility of the CSM kernel is clearly illustrated by its ability to encode phase\n\nr=1\n\nq=1\n\nq(\u03b2r\n\u03b2r\n\nq)\u2020,\n\ncq exp(\u2212j\u03c8r\n\ncq),\n\n4\n\nTime00.20.40.60.81-4-2024f1(x)f2(x)Time00.20.40.60.81-4-2024f1(x)f2(x)Frequency33.544.555.56Amplitude00123Phase-3.14-1.570 1.57 3.14 \fresenting the cross-spectrum in phasor notation, i.e., \u0393c,c(cid:48)(\u03c9; \u0398) = (cid:80)\n\ninformation, as well as its powerful functional form of the full cross-spectrum (both amplitude\nand phase). The amplitude function Ac,c(cid:48)(\u03c9) and phase function \u03a6c,c(cid:48)(\u03c9) are obtained by rep-\nq(Bq)c,c(cid:48)SSG(\u03c9; \u03b8q) =\nAc,c(cid:48)(\u03c9) exp(j\u03a6c,c(cid:48)(\u03c9)).\nInterestingly, while the CSM and SM-LMC kernels have identical\nmarginal amplitude spectra for shared {\u00b5q, \u03bdq, aq}, their cross-amplitude spectra differ due to the\ninherent destructive interference of the CSM kernel (see Figure 1, right).\n\n4 Multi-Channel HMM Analysis\nNeuroscientists are interested in examining how the network structure of the brain changes as an-\nimals undergo a task, or various levels of arousal [15]. The LFP signal is a modality that allows\nresearchers to explore this network structure. In the model provided in this section, we cluster seg-\nments of the LFP signal into discrete \u201cbrain states\u201d [21]. Each brain state is represented by a unique\ncross-spectrum provided by the CSM kernel. The use of the full cross-spectrum to de\ufb01ne brain states\nis supported by previous work discovering that 1) the power spectral density of LFP signals indicate\nvarious levels of arousal states in mice [7, 21], and 2) frequency-dependent phase synchrony patterns\nchange as animals undergo different conditions in a task [11, 17, 18] (see Figure 2).\nThe vector-valued observations from C channels are segmented into W contiguous, non-overlapping\nwindows. The windows are common across channels, such that the C-channel data for window\nw \u2208 {1, . . . , W} are represented by yw\nn . Given data, each\nwindow consists of Nw temporal samples, but the model is de\ufb01ned for any set of sample locations.\nn} as emissions from a hidden Markov model (HMM) with L hidden,\nWe model the observations {yw\ndiscrete states. State assignments are represented by latent variables \u03b6w \u2208 {1, . . . , L} for each win-\ndow w \u2208 {1, . . . , W}. In general, L is a set upper bound of the number of states (brain states [21],\nor \u201cclusters\u201d), but the model can shrink down and infer the number of states needed to \ufb01t the data.\nThis is achieved by de\ufb01ning the dynamics of the latent states according to a Bayesian HMM [14]:\n\nnC]T at sample location xw\n\nn1, . . . , yw\n\nn = [yw\n\n\u03b61 \u223c Categorical(\u03c10),\n\n\u03b6w \u223c Categorical(\u03c1\u03b6w\u22121 ) \u2200w \u2265 2,\n\n\u03c10, \u03c1(cid:96) \u223c Dirichlet(\u03bd),\n\nwhere the initial state assignment is drawn from a categorical distribution with probability vector \u03c10\nand all subsequent states assignments are drawn from the transition vector \u03c1\u03b6w\u22121. Here, \u03c1(cid:96)h is the\nprobability of transitioning from state (cid:96) to state h. The vectors {\u03c10, \u03c11, . . . , \u03c1L} are independently\ndrawn from symmetric Dirichlet distributions centered around \u03bd = [1/L, . . . , 1/L] to impose spar-\nsity on transition probabilities. In effect, this allows the model to learn the number of states needed\nfor the data (i.e., fewer than L) [3].\nEach cluster (cid:96) \u2208 {1, . . . , L} is assigned GP parameters \u03b8(cid:96). The latent cluster assignment \u03b6w for\nwindow w indicates which set of GP parameters control the emission distribution of the HMM:\n\nn \u223c N (f w(xw\nn ), H\u22121\nyw\nwhere (K(x, x(cid:48); \u03b8(cid:96)))c,c(cid:48) = kc,c(cid:48)\nCSM(x, x(cid:48); \u03b8(cid:96)) is the CSM kernel, and the cluster-dependent precision\nH \u03b6w = diag(\u03b7\u03b6w ) generates independent Gaussian observation noise. In this way, each window w\nis modeled as a stochastic process with a multi-channel cross-spectrum de\ufb01ned by \u03b8\u03b6w.\n\nf w(x) \u223c GP(0, K(x, x(cid:48); \u03b8\u03b6w )),\n\n(9)\n\n),\n\n\u03b6w\n\nDELTA Waves\n\nTHETA Waves\n\nALPHA Waves\n\nBETA Waves\n\nFigure 2: A short segment of LFP data recorded from the basolateral amygdala and infralimbic cortex is\nshown on the left. The cross-amplitude and phase spectra are produced using Welch\u2019s averaged periodogram\nmethod [22] for several consecutive 5 second windows of LFP data. Frequency dependent phase synchrony lags\nare consistently present in the cross-phase spectrum, motivating the CSM kernel. This frequency dependency\naligns with preconcieved notions of bands, or brain waves (e.g., 8-12 Hz alpha waves).\n\n5\n\nTime (sec)0.10.20.30.40.50.60.70.8PotentialRawLFPDataBLAIL CortexFrequency (Hz)0246810121416AmplitudeCross-AmplitudeSpectrumFrequency (Hz)0246810121416Lag (rad)-1.5-1-0.500.51Cross-PhaseSpectrum\f5\n\nInference\n\n1 , . . . , yw\nNw\n\n(cid:96)=0,{\u03b6w}W\n\n(cid:96)=1 and model variables \u2126 = {{\u03c1(cid:96)}L\n\nA convenient notation vectorizes all observations within a window, yw = vec([yw\n]T ),\nwhere vec(A) is the vectorization of matrix A; i.e., the \ufb01rst Nw elements of yw are observations\nfrom channel 1, up to the last Nw elements of yw belonging to channel C. Because samples are\nobtained on an evenly spaced temporal grid, we \ufb01x Nw = N and align relative sample locations\nwithin a window to an oracle xw = x = [x1, . . . , xN ]T for all w.\nThe model in Section 4 generates the set of observations Y = {yw}W\nw=1 at aligned sample locations\nx given kernel hyperparameters \u0398 = {\u03b8(cid:96), \u03b7(cid:96)}L\nw=1}.\nmate posterior distribution q(\u2126) = q(\u03b61:W )(cid:81)L\nThe latent variables \u2126 are inverted using mean-\ufb01eld variational inference [3], obtaining an approxi-\n(cid:96)=0 Dir(\u03c1(cid:96); \u03b1(cid:96)). The approximate posterior is chosen\nto minimize the KL divergence to the true posterior distribution p(\u2126|Y , \u0398, x) using the standard\nvariational EM method detailed in Chapter 3 of [3].\nmaximize the expected marginal log-likelihood Q =(cid:80)W\n(cid:80)L\nDuring each iteration of the variational EM algorithm, the kernel hyperparameters \u0398 are chosen to\n(cid:96)=1 q(\u03b6w = (cid:96)) log N (yw; 0, \u0393(cid:96))via\nto brain state (cid:96), and \u0393(cid:96) = Re{(cid:101)\u0393(cid:96)} is the CSM kernel matrix for state (cid:96) with the complex form\n(cid:101)\u0393(cid:96) = (cid:80)\ngradient ascent, where q(\u03b6w = (cid:96)) is the marginal posterior probability that window w is assigned\n(cid:96) \u2297 I N . Performing gradient ascent requires the derivatives\n\u2202Q\n(cid:96) yw [16]. A na\u00a8\u0131ve implementation of\n\u2202\u0398j\nthis gradient requires the inversion of \u0393(cid:96), which has complexity O(N 3C 3) and storage requirements\nO(N 2C 2) since a simple method to invert a sum of Kronecker products does not exist.\nA common trick for GPs with evenly spaced samples (e.g., a temporal grid) is to use the discrete\nFourier transform (DFT) to approximate the inverse of \u0393(cid:96) by viewing this as an approximately\ncirculant matrix [5, 12]. These methods can speed up inference because circulant matrices are diag-\nonalizable by the DFT coef\ufb01cient matrix. Adjusting these methods to the multi-output formulation,\nwe show how the DFT of the marginal covariance matrices retains the cross-spectrum information.\nProposition 5.1. Let yw \u223c N (0, \u0393\u03b6w ) represent the marginal likelihood of circularly-symmetric\n[8] real-valued observations in window w, and denote the concatenation of the DFT of each channel\nas zw = (I C \u2297 U )\u2020yw where U is the N \u00d7 N unitary DFT matrix. Then, zw is shown in the\nSupplemental Material to have the complex normal distribution [8]:\n\nq \u2297(cid:101)kSG(x, x; \u03b8(cid:96)) + H\u22121\n(cid:96)w \u2212 \u0393\u22121\n\n) where \u03b1(cid:96)w = \u0393\u22121\n\nw,(cid:96) tr((\u03b1(cid:96)w\u03b1T\n\n(cid:96) ) \u2202\u0393(cid:96)\n\u2202\u0398j\n\n= 1\n2\n\n(cid:80)\n\nq B(cid:96)\n\nw=1\n\nQ(cid:88)\n\nzw \u223c CN (0, 2S\u03b6w ),\n\nS(cid:96) = \u03b4\u22121\n\nq \u2297 W (cid:96)\nB(cid:96)\n\nq + H\u22121\n\n(cid:96) \u2297 I N ,\n\n(10)\n\nq=1\n\nwhere \u03b4 = xi+1 \u2212 xi for all i = 2, . . . , N, and W (cid:96)\ndiagonal. The spectral density SSG(\u03c9; \u03b8) = [SSG(\u03c91; \u03b8), . . . , SSG(\u03c9(cid:98) N +1\nangular frequencies \u03c9 = 2\u03c0\nN \u03b4\n\n(cid:5)(cid:3), and 0 = [0, . . . , 0] is a row vector of(cid:4) N\u22121\n\nq \u2248 diag([SSG(\u03c9; \u03b8(cid:96)q), 0]) is approximately\n2 (cid:99); \u03b8)] is found via (4) at\n\n(cid:2)0, 1, . . . ,(cid:4) N\n\n(cid:5) zeros.\n\n2\n\n2\n\nThe hyperparameters of the CSM kernels \u0398 may now be optimized from the expected marginal\nlog-likelihood of Z = {zw}W\nw=1 instead of Y . Conceptually, the only difference during the \ufb01tting\nprocess is that, with the latter, derivatives of the covariance kernel are used, while, with the for-\nmer, derivatives of the power spectral density are used. Computationally, this method improves the\nna\u00a8\u0131ve O(N 3C 3) complexity of \ufb01tting the standard CSM kernel to O(N C 3) complexity. Memory\nrequirements are also reduced from O(N 2C 2) to O(N C 2). The reason for this improvement is that\nS(cid:96) is now represented as N independent C \u00d7 C blocks, reducing the inversion of S(cid:96) to inverting a\npermuted block-diagonal matrix.\n\n6 Experiments\nSection 6.1 demonstrates the performance of the CSM kernel and the accuracy of the DFT approx-\nimation In Section 6.2, the DFT approximation for the CSM kernel is used in a Bayesian HMM\nframework to cluster time-varying multi-channel LFP data based on the full cross-spectrum; the\nHMM states here correspond to states of the brain during LFP recording.\n\n6\n\n\f0 (0)\n\nCSM\n\nCSM\n\nSE-LMC\nSM-LMC\n\n5550 (1240)\n412 (184)\n204 (71.7)\n\nFigure 3: Time-series data is drawn from a Gaussian process with\na known CSM covariance kernel, where the domain restricted to a\n\ufb01xed number of seconds. A Gaussian process is then \ufb01tted to this\ndata using the DFT approximation. The KL-divergence of the \ufb01tted\nmarginal likelihood from the true marginal likelihood is shown.\n6.1 Performance and Inference Analysis\nThe performance of the CSM kernel is compared to the SM-LMC kernel and SE-LMC (squared\nexponential) kernel. Each of these models allow Q=20, and the rank of the coregionalization ma-\ntrices is varied from rank-1 to rank-3. For a given rank, the CSM kernel always obtains the largest\nmarginal likelihood for a window of LFP data, and the marginal likelihood always increases for\nincreasing rank. To penalize the number of kernel parameters (e.g., a rank-3, Q=20 CSM kernel for\n7 channels has 827 free parameters to optimize), the Akaike information criterion (AIC) is used for\nmodel selection [1]. For this reason, we do not test rank greater than 3. Table 1 shows that a rank-2\nCSM kernel is selected using this criterion, followed by a rank-1 CSM kernel. To show the rank-2\nCSM kernel is consistently selected as the preferred model we report means and standard deviations\nof AIC value differences across 30 different randomly selected 3-second windows of LFP data.\nNext, we provide numerical results for the conditions required when using the DFT approximation\nin (10). This allows for one to de\ufb01ne details of a particular application in order to determine if the\nDFT approximation to the CSM kernel is appropriate. A CSM kernel is de\ufb01ned for two outputs\nwith a single Gaussian component, Q = 1. The mean frequency and variance for this component\nof interest, namely greater than 1 Hz; therefore, we test values of(cid:101)\u00b51 \u2208 { 1\nare set to push the limits of the application. For example, with LFP data, low frequency content is\nvariances at these frequencies to be around(cid:101)\u03bd1 = 1 Hz2. A conversion to angular frequency gives\n2 , 1, 3} Hz. We anticipate\n\u00b51 = 2\u03c0(cid:101)\u00b51 and \u03bd1 = 4\u03c02(cid:101)\u03bd1. The covariance matrix \u0393 in (3) is formed using these parameters, a\n\n\ufb01xed noise variance, and N observations on a time grid with sampling rate of 200 Hz. Data y are\ndrawn from the marginal likelihood with covariance \u0393.\nA new CSM kernel is \ufb01t to y using the DFT approximation, providing an estimate \u02c6\u0393. The KL\ndivergence of the \ufb01tted marginal likelihood from the true marginal likelihood is\n\nKL(p(y|\u02c6\u0393)||p(y|\u0393)) =\n\n(cid:34)\n\n1\n2\n\nlog\n\n(cid:35)\n|\u0393|\n|\u02c6\u0393| \u2212 N + tr(\u0393\u22121 \u02c6\u0393)\n\n,\n\nTable 1: The mean and standard de-\nviation of the difference between the\nAIC value of a given model and the\nAIC value of the rank-2 CSM model.\nLower values are better.\n\nModel\nSE-LMC\nSM-LMC\n\nCSM\n\nSE-LMC\nSM-LMC\n\n\u2206 AIC\n\n4770 (993)\n512 (190)\n109 (110)\n5180 (1120)\n325 (167)\n\nRank\n\n1\n1\n1\n2\n2\n2\n3\n3\n3\n\nwhere | \u00b7 | and tr(\u00b7) are the determinant and trace operators,\n\nN KL(p(y|\u02c6\u0393)||p(y|\u0393)) for various values of(cid:101)\u00b51 and N provides the results in Figure 3. This plot\n\n1\nshows that the DFT approximation struggles to resolve low frequency components unless the se-\nries length is suf\ufb01ciently long. Due to the approximation error, when using the DFT approximation\non LFP data we a priori \ufb01lter out frequencies below 1.5 Hz and perform analyses with a series\nlength of 3 seconds. This ensures the DFT approximation represents the true covariance matrix. The\nfollowing application of the CSM kernel uses these settings.\n\nrespectively.\n\nComputing\n\nIncluding the CSM Kernel in a Bayesian Hierarchical Model\n\n6.2\nWe analyze 12 hours of LFP data of a mouse transitioning between different stages of sleep [7, 21].\nObservations were recorded simultaneously from 4 channels [6], high-pass \ufb01ltered at 1.5 Hz, and\nsubsampled to 200 Hz. Using 3 second windows provides N = 600 and W = 14, 400. The HMM\nwas implemented with the number of kernel components Q = 15 and the number of states L = 7.\n\n7\n\nSeries Length (seconds)012345678KL Divergence00.0050.010.0150.020.0250.030.0357 = 0.5 Hz7 = 1 Hz7 = 3 Hz\fFigure 4: A subset of results from the Bayesian HMM analysis of brain states. In the upper left, the full cross-\nspectrum for an arbitrary state (state 7) is plotted. In the upper right, the amplitude (top) and phase (bottom)\nfunctions for the cross-spectrum between the Dorsomedial Striatum (DMS) and Hippocampus (DHipp) are\nshown for all seven states. On the bottom, the maximum likelihood state assignments are shown and compared\nto the state assignments from [7]. The same colors between the CSM state assignments and the phase and\namplitude functions correspond to the same state. These colors are alligned to the [7] states, but there is no\nexplicit relationship between the colors of the two state sequences.\n\nThis was chosen because sleep staging tasks categorize as many as seven states: various levels of\nrapid eye movement, slow wave sleep, and wake [20]. Although rigorous model selection on L\nis necessary to draw scienti\ufb01c conclusions from the results, the purpose of this experiment is to\nillustrate the utility of the CSM kernel in this application.\nAn illustrative subset of the results are shown in Figure 4. The full cross-spectrum is shown for\na single state (state 7), and the cross-spectrum between the Dorsomedial Striatum and the Dorsal\nHippocampus are shown for all states. Furthermore, we show the progression of these brain state\nassignments over 3 hours and compare them to states from the method of [7], where statistics of the\nHippocampus spectral density were clustered in an ad hoc fashion. To the best of our knowledge,\nthis method represents the most relevant and accurate results for sleep staging from LFP signals in\nthe neuroscience literature. From these results, it is apparent that our clusters pick up sub-states of\n[7]. For instance, states 3, 6, and 7 all appear with high probability when the method from [7] infers\nstate 3. Observing the cross-phase function of sub-state 7 reveals striking differences from other\nstates in the theta wave (4-7 Hz) and the alpha wave (8-15 Hz). This cross-phase function is nearly\nidentical for states 2 and 5, implying that signi\ufb01cant differences in the cross-amplitude spectrum\nmay have played a role in identifying the difference between these two brain states.\nMany more of these interesting details exist due to the expressive nature of the CSM kernel. As a\nfull interpretation of the cross-spectrum results is not the focus of this work, we contend that the\nCSM kernel has the potential to have a tremendous impact in \ufb01elds such as neuroscience, where the\ndynamics of cross-spectrum relationships of LFP signals are of great interest.\n\n7 Conclusion\nThis work introduces the cross-spectral mixture kernel as an expressive kernel capable of extracting\npatterns for multi-channel observations. Combined with the powerful nonparametric representation\nof a Gaussian process, the CSM kernel expresses a functional form for every pairwise cross-spectrum\nbetween channels. This is a novel approach that merges Gaussian processes in the machine learning\ncommunity to standard signal processing techniques. We believe the CSM kernel has the potential\nto impact a broad array of disciplines since the kernel can trivially be extended to any time-series\napplication where Gaussian processes and the cross-spectrum are of interest.\nAcknowledgments\nThe research reported here was funded in part by ARO, DARPA, DOE, NGA and ONR.\n\n8\n\n01356AmplitudeBasalAmy01356AmplitudeDLS01356AmplitudeDMS05101501356AmplitudeFrequency051015Frequency051015Frequency051015DHippFrequency051015051015051015\u22123.14\u22121.5701.573.14Phase\u22123.14\u22121.5701.573.14Phase\u22123.14\u22121.5701.573.14Phase051015\u22123.14\u22121.5701.573.14Phase05101500.511.522.5Frequency(Hz)Amplitude051015\u22123\u22122\u221210123Frequency(Hz)PhaseMinutes 020406080100120140160Dzirasa et al.CSM KernelState 1State 2State 3State 4State 5State 6State 7\fReferences\n[1] H. Akaike. A new look at the statistical model identi\ufb01cation. IEEE Transactions on Automatic Control,\n\n19(6):716\u2013723, 1974.\n\n[2] M. A. Alvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valued functions: a review. Founda-\n\ntions and Trends in Machine Learning, 4(3):195\u2013266, 2012.\n\n[3] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, University College\n\nLondon.\n\n[4] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n[5] C. R. Dietrich and G. N. Newsam. Fast and exact simulation of stationary Gaussian processes through\ncirculant embedding of the covariance matrix. SIAM Journal on Scienti\ufb01c Computing, 18(4):1088\u20131107,\n1997.\n\n[6] K. Dzirasa, R. Fuentes, S. Kumar, J. M. Potes, and M. A. L. Nicolelis. Chronic in vivo multi-circuit\n\nneurophysiological recordings in mice. Journal of Neuroscience Methods, 195(1):36\u201346, 2011.\n\n[7] K. Dzirasa, S. Ribeiro, R. Costa, L. M. Santos, S. C. Lin, A. Grosmark, T. D. Sotnikova, R. R. Gainet-\ndinov, M. G. Caron, and M. A. L. Nicolelis. Dopaminergic control of sleep\u2013wake states. The Journal of\nNeuroscience, 26(41):10577\u201310589, 2006.\n\n[8] R. G. Gallager. Principles of digital communication. pages 229\u2013232, 2008.\n[9] M. G\u00a8onen and E. Alpaydn. Multiple kernel learning algorithms. JMLR, 12:2211\u20132268, 2011.\n[10] P. Goovaerts. Geostatistics for Natural Resources Evaluation. Oxford University Press, 1997.\n[11] G. G. Gregoriou, S. J. Gotts, H. Zhou, and R. Desimone. High-frequency, long-range coupling between\n\nprefrontal and visual cortex during attention. Science, 324(5931):1207\u20131210, 2009.\n\n[12] M. L\u00b4azaro-Gredilla, J. Qui\u02dcnonero Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse spectrum\n\nGaussian process regression. JMLR, (11):1865\u20131881, 2010.\n\n[13] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic construction and\n\nnatural-language description of nonparametric regression models. AAAI, 2014.\n\n[14] D. J. C. MacKay. Ensemble learning for hidden Markov models. Technical report, 1997.\n[15] D. Pfaff, A. Ribeiro, J. Matthews, and L. Kow. Concepts and mechanisms of generalized central nervous\n\nsystem arousal. ANYAS, 2008.\n\n[16] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. 2006.\n[17] P. Sauseng and W. Klimesch. What does phase information of oscillatory brain activity tell us about\n\ncognitive processes? Neuroscience and Biobehavioral Reviews, 32:1001\u20131013, 2008.\n\n[18] C. M. Sweeney-Reed, T. Zaehle, J. Voges, F. C. Schmitt, L. Buentjen, K. Kopitzki, C. Esslinger, H. Hin-\nrichs, H. J. Heinze, R. T. Knight, and A. Richardson-Klavehn. Corticothalamic phase synchrony and\ncross-frequency coupling predict human memory formation. eLIFE, 2014.\n\n[19] Y. W. Teh, M. Seeger, and M. I. Jordan. Semiparametric latent factor models. AISTATS, 10:333\u2013340,\n\n2005.\n\n[20] M. A. Tucker, Y. Hirota, E. J. Wamsley, H. Lau, A. Chaklader, and W. Fishbein. A daytime nap containing\nsolely non-REM sleep enhances declarative but not procedural memory. Neurobiology of Learning and\nMemory, 86(2):241\u20137, 2006.\n\n[21] K. Ulrich, D. E. Carlson, W. Lian, J. S. Borg, K. Dzirasa, and L. Carin. Analysis of brain states from\n\nmulti-region LFP time-series. NIPS, 2014.\n\n[22] P. D. Welch. The use of fast Fourier transform for the estimation of power spectra: A method based on\ntime averaging over short, modi\ufb01ed periodograms. IEEE Transactions on Audio and Electroacoustics,\n15(2):70\u201373, 1967.\n\n[23] A. G. Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian\n\nprocesses. PhD thesis, University of Cambridge, 2014.\n\n[24] A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. ICML,\n\n2013.\n\n[25] A. G. Wilson, E. Gilboa, A. Nehorai, and J. P. Cunningham. Fast kernel learning for multidimensional\n\npattern extrapolation. NIPS, 2014.\n\n[26] A. G. Wilson and D. A. Knowles. Gaussian process regression networks. ICML, 2012.\n[27] Z. Yang, A. J. Smola, L. Song, and A. G. Wilson. \u00b4A la carte \u2013 learning fast kernels. AISTATS, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1211, "authors": [{"given_name": "Kyle", "family_name": "Ulrich", "institution": "Duke"}, {"given_name": "David", "family_name": "Carlson", "institution": null}, {"given_name": "Kafui", "family_name": "Dzirasa", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}