{"title": "A State-Space Model for Decoding Auditory Attentional Modulation from MEG in a Competing-Speaker Environment", "book": "Advances in Neural Information Processing Systems", "page_first": 460, "page_last": 468, "abstract": "Humans are able to segregate auditory objects in a complex acoustic scene, through an interplay of bottom-up feature extraction and top-down selective attention in the brain. The detailed mechanism underlying this process is largely unknown and the ability to mimic this procedure is an important problem in artificial intelligence and computational neuroscience. We consider the problem of decoding the attentional state of a listener in a competing-speaker environment from magnetoencephalographic (MEG) recordings from the human brain. We develop a behaviorally inspired state-space model to account for the modulation of the MEG with respect to attentional state of the listener. We construct a decoder based on the maximum a posteriori (MAP) estimate of the state parameters via the Expectation-Maximization (EM) algorithm. The resulting decoder is able to track the attentional modulation of the listener with multi-second resolution using only the envelopes of the two speech streams as covariates. We present simulation studies as well as application to real MEG data from two human subjects. Our results reveal that the proposed decoder provides substantial gains in terms of temporal resolution, complexity, and decoding accuracy.", "full_text": "A State-Space Model for Decoding Auditory\n\nAttentional Modulation from MEG in a\n\nCompeting-Speaker Environment\n\nSahar Akram1,2, Jonathan Z. Simon1,2,3, Shihab Shamma1,2, and Behtash Babadi1,2\n\n1 Department of Electrical and Computer Engineering,\n\n2 Institute for Systems Research, 3 Department of Biology\n\nUniversity of Maryland\n\nCollege Park, MD 20742, USA\n\n{sakram,jzsimon,sas,behtash}@umd.edu\n\nAbstract\n\nHumans are able to segregate auditory objects in a complex acoustic scene,\nthrough an interplay of bottom-up feature extraction and top-down selective at-\ntention in the brain. The detailed mechanism underlying this process is largely\nunknown and the ability to mimic this procedure is an important problem in ar-\nti\ufb01cial intelligence and computational neuroscience. We consider the problem of\ndecoding the attentional state of a listener in a competing-speaker environment\nfrom magnetoencephalographic (MEG) recordings from the human brain. We de-\nvelop a behaviorally inspired state-space model to account for the modulation of\nthe MEG with respect to attentional state of the listener. We construct a decoder\nbased on the maximum a posteriori (MAP) estimate of the state parameters via\nthe Expectation-Maximization (EM) algorithm. The resulting decoder is able to\ntrack the attentional modulation of the listener with multi-second resolution using\nonly the envelopes of the two speech streams as covariates. We present simula-\ntion studies as well as application to real MEG data from two human subjects.\nOur results reveal that the proposed decoder provides substantial gains in terms of\ntemporal resolution, complexity, and decoding accuracy.\n\nIntroduction\n\n1\nSegregating a speaker of interest in a multi-speaker environment is an effortless task we routinely\nperform. It has been hypothesized that after entering the auditory system, the complex auditory sig-\nnal resulted from concurrent sound sources in a crowded environment is decomposed into acoustic\nfeatures. An appropriate binding of the relevant features, and discounting of others, leads to forming\nthe percept of an auditory object [1, 2, 3]. The complexity of this process becomes tangible when\none tries to synthesize the underlying mechanism known as the cocktail party problem [4, 5, 6, 7].\nIn a number of recent studies it has been shown that concurrent auditory objects even with highly\noverlapping spectrotemporal features, are neurally encoded as a distinct object in auditory cortex\nand emerge as fundamental representational units for high-level cognitive processing [8, 9, 10]. In\nthe case of listening to speech, it has recently been demonstrated by Ding and Simon [8], that the\nauditory response manifested in MEG is strongly modulated by the spectrotemporal features of the\nspeech. In the presence of two speakers, this modulation appears to be strongly correlated with the\ntemporal features of the attended speaker as opposed to the unattended speaker (See Figure 1\u2013A).\nPrevious studies employ time-averaging across multiple trials in order to decode the attentional state\nof the listener from MEG observations. This method is only valid when the subject is attending to a\nsingle speaker during the entire trial. In a real-world scenario, the attention of the listener can switch\ndynamically from one speaker to another. Decoding the attentional target in this scenario requires a\n\n1\n\n\fFigure 1: A) Schematic depiction of auditory object encoding in the auditory cortex. B) The MEG\nmagnetic \ufb01eld distribution of the \ufb01rst DSS component shows a stereotypical pattern of neural activity\noriginating separately in the left and right auditory cortices. Purple and green contours represent the\nmagnetic \ufb01eld strength. Red arrows schematically represent the locations of the dipole currents,\ngenerating the measured magnetic \ufb01eld. C) An example of the TRF, estimated from real MEG data.\nSigni\ufb01cant TRF components analogous to the well-known M50 and M100 auditory responses are\nmarked in the plot.\n\ndynamic estimation framework with high temporal resolution. Moreover, the current techniques use\nthe full spectrotemporal features of the speech for decoding. It is not clear whether the decoding can\nbe carried out with a more parsimonious set of spectrotemporal features.\nIn this paper, we develop a behaviorally inspired state-space model to account for the modulation\nof MEG with respect to the attentional state of the listener in a double-speaker environment. MAP\nestimation of the state-space parameters given MEG observations is carried out via the EM algo-\nrithm. We present simulation studies as well as application to experimentally acquired MEG data,\nwhich reveal that the proposed decoder is able to accurately track the attentional state of a listener\nin a double-speaker environment while selectively listening to one of the two speakers. Our method\nhas three main advantages over existing techniques. First, the decoder provides estimates with sub-\nsecond temporal resolution. Second, it only uses the envelopes of the two speech streams as the\ncovariates, which is a substantial reduction in the dimension of the spectrotemporal feature set used\nfor decoding. Third, the principled statistical framework used in constructing the decoder allows us\nto obtain con\ufb01dence bounds on the estimated attentional state.\nThe paper is organized as follows. In Section 2, we introduce the state-space model and the proposed\ndecoding algorithm. We present simulation studies to test the decoder in terms of robustness with\nrespect to noise as well as tracking performance and apply to real MEG data recorded from two\nhuman subjects in Section 3. Finally, we discuss the future directions and generalizations of our\nproposed framework in Section 4.\n\n2 Methods\nWe \ufb01rst consider the forward problem of relating the MEG observations to the spectrotemporal\nfeatures of the attended and unattended speech streams. Next, we consider the inverse problem\nwhere we seek to decode the attentional state of the listener given the MEG observations and the\ntemporal features of the two speech streams.\n\n2.1 The Forward Problem: Estimating the Temporal Response Function\nConsider a task where the subject is passively listening to a speech stream. Let the discrete-\ntime MEG observation at time t, sensor j, and trial r be denoted by xt,j,r, for t = 1, 2,\u00b7\u00b7\u00b7 , T ,\nj = 1, 2,\u00b7\u00b7\u00b7 , M and r = 1, 2,\u00b7\u00b7\u00b7 , R. The stimulus-irrelevant neural activity can be removed using\ndenoising source separation (DSS) [11]. The DSS algorithm is a blind source separation method\nthat decomposes the data into T temporally uncorrelated components by enhancing consistent com-\nponents over trials and suppressing noise-like components of the data, with no knowledge of the\nstimulus or timing of the task. Let the time series y1,r, y2,r,\u00b7\u00b7\u00b7 , yT,r denote the \ufb01rst signi\ufb01cant\ncomponent of the DSS decomposition, denoted hereafter by MEG data. In an auditory task, this\ncomponent has a \ufb01eld map which is consistent with the stereotypical auditory response in MEG\n(See Figure 1\u2013B). Also, let Et be the speech envelope of the speaker at time t in dB scale. In a linear\nmodel, the MEG data is linearly related to the envelope of speech as:\n\nyt,r = \u03c4t \u2217 Et + vt,r,\n\n2\n\n(1)\n\nSpk1 AttendedSpk2 AttendedSinkSource50ft/StepBAC0 125 250375500\u22126\u22124\u2212202468x 10\u2212562Time (ms)Temporal Response FunctionSpk1Spk2MEGSpk1 SpeechSpk2 Speech\fwhere \u03c4t is a linear \ufb01lter of length L denoted by temporal response function (TRF), \u2217 denotes the\nconvolution operator, and vt,r is a nuisance component accounting for trial-dependent and stimulus-\nindependent components manifested in the MEG data. It is known that the TRF is a sparse \ufb01lter,\nwith signi\ufb01cant components analogous to the M50 and M100 auditory responses ([9, 8], See Figure\n1\u2013C). A commonly-used technique for estimating the TRF is known as Boosting ([12, 9]), where\nthe components of the TRF are greedily selected to decrease the mean square error (MSE) of the\n\ufb01t to the MEG data. We employ an alternative estimation framework based on (cid:96)1-regularization.\nLet \u03c4 := [\u03c4L, \u03c4L\u22121,\u00b7\u00b7\u00b7 , \u03c41](cid:48) be the time-reversed version of the TRF \ufb01lter in vector form, and\nlet Et := [Et, Et\u22121,\u00b7\u00b7\u00b7 , Et\u2212L+1](cid:48). In order to obtain a sparse estimate of the TRF, we seek the\n(cid:96)1-regularized estimate:\n\n(cid:107)yt,r \u2212 \u03c4 (cid:48)Et(cid:107)2\n\n2 + \u03b3(cid:107)\u03c4(cid:107)1,\n\n(2)\n\n(cid:98)\u03c4 = argmin\n\n\u03c4\n\nR,T(cid:88)\n\nr,t=1\n\nwhere \u03b3 is the regularization parameter. The above problem can be solved using standard opti-\nmization software. We have used a fast solver based on iteratively re-weighted least squares [13].\nThe parameter \u03b3 is chosen by two-fold cross-validation, where the \ufb01rst half of the data is used for\nestimating \u03c4 and the second half is used to evaluate the goodness-of-\ufb01t in the MSE sense. An exam-\nple of the estimated TRF is shown in Figure 1\u2013C. In a competing-speaker environment, where the\nsubject is only attending to one of the two speakers, the linear model takes the form:\n\nt , \u03c4 u\n\nt , Ea\n\nyt,r = \u03c4 a\n\nt , and Eu\n\n(3)\nt , denoting the TRF and envelope of the attended and unattended speak-\nwith \u03c4 a\ners, respectively. The above estimation framework can be generalized to the two-speaker case\nby replacing the regressor \u03c4 (cid:48)Et with \u03c4 a(cid:48)Ea\nt are de\ufb01ned\nin a fashion similar to the single-speaker case. Similarly, the regularization \u03b3(cid:107)\u03c4(cid:107)1 is replaced by\n\u03b3a(cid:107)\u03c4 a(cid:107)1 + \u03b3u(cid:107)\u03c4 u(cid:107)1.\n\nt , where \u03c4 a, Ea\n\nt + \u03c4 u(cid:48)Eu\n\nt , \u03c4 u, and Eu\n\nt + vt,r,\n\nt + \u03c4 u\n\nt \u2217 Ea\n\nt \u2217 Eu\n\n2.2 The Inverse Problem: Decoding Attentional Modulation\n2.2.1 Observation Model\nLet y1,r, y2,r,\u00b7\u00b7\u00b7 , yT,r denote the MEG data time series at trial r, for r = 1, 2,\u00b7\u00b7\u00b7 , R during an\nobservation period of length T . For a window length W , let\n\nyk,r :=(cid:2)y(k\u22121)W +1,r, y(k\u22121)W +2,r,\u00b7\u00b7\u00b7 , ykW,r\n\n(4)\nfor k = 1, 2,\u00b7\u00b7\u00b7 , K := (cid:98)T /W(cid:99). Also, let Ei,t be the speech envelope of speaker i at time t in dB\nscale, i = 1, 2. Let \u03c4 a\nt denote the TRFs of the attended and unattended speakers, respectively.\nThe MEG predictors in the linear model are given by:\n\n(cid:3) ,\n\ne2,t := \u03c4 a\n\nt \u2217 E2,t\nt \u2217 E1,t\n\nt and \u03c4 u\nt \u2217 E1,t + \u03c4 u\nt \u2217 E2,t + \u03c4 u\n\n(cid:26) e1,t := \u03c4 a\nei,k :=(cid:2)ei,(k\u22121)W +1, ei,(k\u22121)W +2,\u00b7\u00b7\u00b7 , ei,kW\n(cid:3) ,\n(cid:18)(cid:28) yk,r\n\n\u03b8i,k,r := arccos\n\n(cid:29)(cid:19)\n\n(cid:107)yk,r(cid:107)2\n\n,\n\nei,k\n(cid:107)ei,k(cid:107)2\n\nLet\n\n(6)\nRecent work by Ding and Simon [8] suggests that the MEG data yk is more correlated with the\npredictor ei,k when the subject is attending to the ith speaker at window k. Let\n\nfor i = 1, 2 and k = 1, 2,\u00b7\u00b7\u00b7 , K.\n\nattending to speaker 1\nattending to speaker 2 ,\n\nt = 1, 2,\u00b7\u00b7\u00b7 , T.\n\n(5)\n\n(7)\n\ndenote the empirical correlation between the observed MEG data and the model prediction when\nattending to speaker i at window k and trial r. When \u03b8i,k,r is close to 0 (\u03c0), the MEG data and its\npredicted value are highly (poorly) correlated. Inspired by the \ufb01ndings of Ding and Simon [8], we\nmodel the statistics of \u03b8i,k,r by the von Mises distribution [14]:\n\n1\n\n\u03b8i,k,r \u2208 [0, \u03c0],\n\n\u03c0I0(\u03bai)\n\np (\u03b8i,k,r) =\n\nexp (\u03bai cos (\u03b8i,k,r)) ,\n\n(8)\nwhere I0(\u00b7) is the zeroth order modi\ufb01ed Bessel function of the \ufb01rst kind, and \u03bai denotes the spread\nparameter of the von Mises distribution for i = 1, 2. The von Mises distribution gives more (less)\nweight to higher (lower) values of correlation between the MEG data and its predictor and is pretty\nrobust to gain \ufb02uctuations of the neural data.The spread parameter \u03bai accounts for the concentration\nof \u03b8i,k,r around 0. We assume a conjugate prior of the form p(\u03bai) \u221d exp(c0d\u03bai)\nover \u03bai, for some\nhyper-parameters c0 and d.\n\ni = 1, 2\n\nI0(\u03bai)d\n\n3\n\n\f2.2.2 State Model\nSuppose that at each window of observation, the subject is attending to either of the two speakers.\nLet nk,r be a binary variable denoting the attention state of the subject at window k and trial r:\n\nnk,r =\n\nattending to speaker 1\nattending to speaker 2\n\n(9)\n\n(cid:26) 1\n\n0\n\nThe subjective experience of attending to a speci\ufb01c speech stream among a number of competing\nspeeches reveals that the attention often switches to the competing speakers, although not intended\nby the listener. Therefore, we model the statistics of nk,r by a Bernoulli process with a success\nprobability of qk:\n(10)\nA value of qk close to 1 (0) implies attention to speaker 1 (2). The process {qk}K\nk=1 is assumed to be\ncommon among different trials. In order to model the dynamics of qk, we de\ufb01ne a variable zk such\nthat\n\np(nk,r|qk) = qnk,r\n\n(1 \u2212 qk)1\u2212nk,r .\n\nk\n\nqk = logit\u22121(zk) :=\n\nexp(zk)\n\n(11)\nWhen zk tends to +\u221e (\u2212\u221e), qk tends to 1 (0). We assume that zk obeys AR(1) dynamics of the\nform:\n\n1 + exp(zk)\n\n(12)\nwhere wk is a zero-mean i.i.d. Gaussian random variable with a variance of \u03b7k. We further assume\nthat \u03b7k are distributed according to the conjugate prior given by the inverse-Gamma distribution with\nhyper-parameters \u03b1 (shape) and \u03b2 (scale).\n\nzk = zk\u22121 + wk,\n\n.\n\n(cid:110)\n\n(cid:111)\n\n(13)\nbe the set of parameters. The log-posterior of the parameter set \u2126 given the observed data\n\n\u2126 :=\n\nk=1\n\n\u03ba1, \u03ba2,{zk}K\n\nk=1,{\u03b7k}K\n\n2.2.3 Parameter Estimation\nLet\n\n(cid:8)\u03b8i,k,r\n(cid:16)\n\nlog p\n\n(cid:9)2,T,R\n(cid:12)(cid:12)(cid:12){\u03b8i,k,r}2,K,R\n\n\u2126\n\ni,k,r=1\n\ni,k,r=1 is given by:\n\n(cid:17)\n\n=\n\nqk\n\nlog\n\n(cid:20)\n(cid:26) 1\n\nR,K(cid:88)\n+(cid:2)(\u03ba1 + \u03ba2)c0d \u2212 d(cid:0) log I0(\u03ba1) + log I0(\u03ba2)(cid:1)(cid:3)\n\u2212 R,K(cid:88)\n\nexp (\u03ba1cos (\u03b81,k,r))+\n\n(zk\u2212zk\u22121)2 +\n\n\u03c0I0(\u03ba1)\n\nr,k=1\n\n1 \u2212 qk\n\u03c0I0(\u03ba2)\n\n1\n2\n\n2\u03b7k\n\nr,k=1\n\nlog \u03b7k + (\u03b1 + 1) log \u03b7k +\n\nexp (\u03ba2cos (\u03b82,k,r))\n\n(cid:27)\n\n\u03b2\n\u03b7k\n\n+ cst.\n\n(cid:21)\n\nwhere cst. denotes terms that are not functions of \u2126. The MAP estimate of the parameters is\ndif\ufb01cult to obtain given the involved functional form of the log-posterior. However, the complete\ndata log-posterior, where the unobservable sequence {nk,r}K,R\n\nk=1,r=1 is given, takes the form:\n\n(cid:16)\n\n(cid:12)(cid:12)(cid:12){\u03b8i,k,r, nk,r}2,K,R\n\ni,k,r=1\n\n(cid:17)\n\nlog p\n\n\u2126\n\n=\n\nr,k=1\n\nnk,r [\u03ba1 cos (\u03b81,k,r)\u2212log I0(\u03ba1) + log qk]\n\nR,K(cid:88)\nR,K(cid:88)\n+(cid:2)(\u03ba1 + \u03ba2)c0d \u2212 d(cid:0) log I0(\u03ba1) + log I0(\u03ba2)(cid:1)(cid:3)\n\u2212 R,K(cid:88)\n\n(zk\u2212zk\u22121)2 +\n\n(cid:26) 1\n\nr,k=1\n\n+\n\n1\n2\n\n2\u03b7k\n\nr,k=1\n\n(1\u2212nk,r) [\u03ba2 cos (\u03b82,k,r)\u2212log I0(\u03ba2) + log(1 \u2212 qk)]\n\nlog \u03b7k +(\u03b1 + 1) log \u03b7k +\n\n(cid:27)\n\n\u03b2\n\u03b7k\n\n+cst.\n\nThe log-posterior of the parameters given the complete data has a tractable functional form for\noptimization purposes. Therefore, by taking {nk,r}K,R\nk=1,r=1 as the unobserved data, we can estimate\n\n4\n\n\f\u2126 via the EM algorithm [15]. Using Bayes\u2019 rule, the expectation of nk,r, given(cid:8)\u03b8i,k,r\n(cid:9)2,K,R\ncurrent estimates of the parameters \u2126((cid:96)) :=(cid:8)\u03ba((cid:96))\n(cid:9) is given by:\n(cid:17)\nE(cid:110)\n(cid:16)\n(cid:16)\n(cid:17) exp\n\n(cid:9)K\nk=1,(cid:8)\u03b7((cid:96))\n2 ,(cid:8)z((cid:96))\n(cid:16)\n1 , \u03ba((cid:96))\n(cid:16)\n(cid:17) exp\n(cid:17)\nq((cid:96))\n\u03ba((cid:96))\n1 cos (\u03b81,k,r)\nk\n\u03ba((cid:96))\n(cid:16)\n+ 1\u2212q((cid:96))\n\n\u03ba((cid:96))\n2 cos (\u03b82,k,r)\nDenoting the expectation above by the shorthand E((cid:96)){nk,r}, the M-step of the EM algorithm for\n\u03ba((cid:96)+1)\n1\n\n\u03ba((cid:96))\n1 cos (\u03b81,k,r)\n\ni,k,r=1 and\n\n(cid:17) exp\n\nand \u03ba((cid:96)+1)\n\n(cid:9)K\n\nq((cid:96))\nk\n\u03ba((cid:96))\n\nk\n\u03ba((cid:96))\n\nnk,r\n\n(cid:16)\n\nk=1\n\n\u03c0I0\n\n\u03c0I0\n\n\u03c0I0\n\n=\n\nk\n\nk\n\n1\n\n2\n\n1\n\n(cid:17) .\n\n(cid:12)(cid:12)(cid:12){\u03b8i,k,r}2,K,R\ni,k,r=1, \u2126((cid:96))(cid:111)\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n= A\u22121\n\ngives:\n\nc0d +\n\nR,K(cid:88)\n\nd +\n\n2\n\nr,k=1\n\n\u03ba((cid:96)+1)\ni\n\n(cid:26) E((cid:96)){nk,r}\n\n1 \u2212 E((cid:96)){nk,r}\n\n\u03b5((cid:96))\ni,k,r =\n\ni = 1\ni = 2\n\n,\n\n(14)\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 ,\n\n\u03b5((cid:96))\ni,k,r cos (\u03b8i,k,r)\n\nR,K(cid:88)\n\nr,k=1\n\n\u03b5((cid:96))\ni,k,r\n\nk=1 and {zk}K\nR,K(cid:88)\n\n(cid:20)\n\nwhere A(x) := I1(x)/I0(x), with I1(\u00b7) denoting the \ufb01rst order modi\ufb01ed Bessel function of the \ufb01rst\nkind. Inversion of A(\u00b7) can be carried out numerically in order to \ufb01nd \u03ba((cid:96)+1)\n. The M-step\nfor {\u03b7k}K\n\nand \u03ba((cid:96)+1)\n\n1\n\n2\n\nk=1 corresponds to the following maximization problem:\n(zk \u2212 zk\u22121)2 +2\u03b2\n\nE((cid:96)){nk,r}zk\u2212log(1 + exp(zk))\u2212 1\n2\u03b7k\n\nk=1\n\nr,k=1\n\nargmax\n{zk,\u03b7k}K\nAn ef\ufb01cient approximate solution to this maximization problem is given by another EM algorithm,\nwhere the E-step is the point process smoothing algorithm [16, 17] and the M-step updates the\nstate variance sequence [18]. At iteration m, given an estimate of \u03b7((cid:96)+1)\n, the\nforward pass of the E-step for k = 1, 2,\u00b7\u00b7\u00b7 , K is given by:\n\n, denoted by \u03b7((cid:96)+1,m)\n\nlog \u03b7k\n\nk\n\nk\n\n2\n\n.\n\n(cid:105)\u2212 1+2(\u03b1+1)\n\n(cid:21)\n\n(cid:104)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nk|k\u22121 = \u00afz((cid:96)+1,m)\n\u00afz((cid:96)+1,m)\nk\u22121|k\u22121\n\u03c3((cid:96)+1,m)\nk|k\u22121 = \u03c3((cid:96)+1,m)\n\nk\u22121|k\u22121 +\n\n\u03b7((cid:96)+1,m)\nk\n\nR\n\n\u00afz((cid:96)+1,m)\nk|k\n\n= \u00afz((cid:96)+1,m)\n\nk|k\u22121 + \u03c3((cid:96)+1,m)\n\nk|k\u22121\n\n\uf8ee\uf8ef\uf8f0\n\n(cid:16)\n\n1\n\n\u03c3((cid:96)+1,m)\nk|k\u22121\n\n+ R\n\n\u03c3((cid:96)+1,m)\nk|k\n\n=\n\n\uf8ee\uf8f0 R(cid:88)\n(cid:16)\n\nr=1\n\nexp\n\nE((cid:96)){nk,r} \u2212 R\n\n(cid:17)\n\n\uf8f9\uf8fa\uf8fb\u22121\n\n(cid:17)(cid:17)2\n\n\u00afz((cid:96)+1,m)\nk|k\n\n(cid:16)\n\n1 + exp\n\n\u00afz((cid:96)+1,m)\nk|k\n\n(cid:16)\n\n(cid:16)\n\n\u00afz((cid:96)+1,m)\nk|k\n\u00afz((cid:96)+1,m)\nk|k\n\nexp\n\n1 + exp\n\n(cid:17)\n\n\uf8f9\uf8fb\n\n(cid:17)\n\n(15)\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n(cid:16)\n\nand for k = K \u2212 1, K \u2212 2,\u00b7\u00b7\u00b7 , 1, the backward pass of the E-step is given by:\n\n(cid:17)\n(cid:17)\nk+1|K \u2212 \u00afz((cid:96)+1,m)\n\u00afz((cid:96)+1,m)\nk+1|k\nk+1|K \u2212 \u03c3((cid:96)+1,m)\n\u03c3((cid:96)+1,m)\nk+1|k\nNote that the third equation in the forward \ufb01lter is non-linear in \u00afz((cid:96)+1,m)\nstandard techniques (e.g., Newton\u2019s method). The M-step gives the updated value of \u03b7((cid:96)+1,m+1)\n\n/\u03c3((cid:96)+1,m)\n+ s((cid:96)+1,m)\n+ s((cid:96)+1,m)\n\n= \u03c3((cid:96)+1,m)\n= \u00afz((cid:96)+1,m)\n= \u03c3((cid:96)+1,m)\n\ns((cid:96)+1,m)\nk\n\u00afz((cid:96)+1,m)\nk|K\n\u03c3((cid:96)+1,m)\nk|K\n\n, and can be solved using\nas:\n\nk|k\nk|k\nk|k\n\ns((cid:96)+1,m)\nk\n\n(cid:16)\n(cid:16)\n\nk+1|k\n\n(16)\n\nk|k\n\nk\n\nk\n\nk\n\n\u03b7((cid:96)+1,m+1)\nk\n\n=\n\n\u00afz((cid:96)+1,m)\nk|K\n\n\u2212 \u00afz((cid:96)+1,m)\nk\u22121|K\n\n+ \u03c3((cid:96)+1,m)\n\nk|K\n\n+ \u03c3((cid:96)+1,m)\n\nk\u22121|K \u2212 2\u03c3((cid:96)+1,m)\n\nk|K\n\n1 + 2(\u03b1 + 1)\n\ns((cid:96)+1,m)\nk\u22121\n\n+ 2\u03b2\n\n. (17)\n\nFor each (cid:96) in the outer EM iteration, the inner iteration over m is repeated until convergence, to\n}K\nobtain the updated values of {z((cid:96)+1)\nk=1 to be passed to the outer EM iteration.\n\n}K\nk=1 and {\u03b7((cid:96)+1)\n\nk\n\nk\n\n5\n\n(cid:17)2\n\n\f\u22121(cid:0)z((cid:96)+1)\n\nk\n\n= logit\n\n(cid:1). Starting with an initial guess of the parameters, the outer EM algorithm\n\nThe updated estimate of the Bernoulli success probability at window k and iteration (cid:96) + 1 is given by\nq((cid:96)+1)\nk\nalternates between \ufb01nding the expectation of {nk,r}K,R\nk=1,r=1 and estimating the parameters \u03ba1, \u03ba2,\n{zk}K\nk can be obtained by mapping\nthe Gaussian con\ufb01dence intervals for the Gaussian variable z((cid:96))\nk via the inverse logit mapping. In\nsummary, the decoder inputs the MEG observations and the envelopes of the two speech streams,\nand outputs the Bernoulli success probability sequence corresponding to attention to speaker 1.\n\nk=1 until convergence. Con\ufb01dence intervals for q((cid:96))\n\nk=1 and {\u03b7k}K\n\n3 Results\n3.1 Simulated Experiments\nWe \ufb01rst evaluated the proposed state-space model and estimation procedure on simulated MEG\ndata. For a sampling rate of Fs = 200Hz, a window length of W = 50 samples (250ms), and a\ntotal observation time of T = 12000 samples (60s), the binary sequence {nk,r}240,3\nk=1,r=1 is generated\nas realizations of a Bernoulli process with success probability qk = 0.95 or 0.05, corresponding to\nattention to speaker one or two, respectively. Using a TRF template of length 0.5s estimated from\nreal data, we generated 3 trials with an SNR of 10dB. Each trial includes three attentional switches\noccurring every 15 seconds. The hyper-parameters \u03b1 and \u03b2 for the inverse-Gamma prior on the state\nvariance are chosen as \u03b1 = 2.01 and \u03b2 = 2. This choice of \u03b1 close to 2 results in a non-informative\nprior, as the variance of the prior is given by \u03b22/[(\u03b1 \u2212 1)2(\u03b1 \u2212 2)] \u2248 400, while the mean is given\nby \u03b2/(\u03b1 \u2212 1) \u2248 2. The mean of the prior is chosen large enough so that the state transition from\nqk = 0.99 to qk+1 = 0.01 lies in the 98% con\ufb01dence interval around the state innovation variable\n2 KR and\nwk (See Eq. (12)). The hyper-parameters for the von Mises distribution are chosen as d = 7\nc0 = 0.15, as the average observed correlation between the MEG data and the model prediction is \u2248\nin the range of 0.1\u20130.2. The choice of d = 7\n2 KR gives more weight to the prior than the empirical\nestimate of \u03bai.\nFigure 2\u2013A and 2\u2013B show the simulated MEG signal (black traces) and predictors of attending to\nspeaker one and two (red traces), respectively, at an SNR of 10 dB. Regions highlighted in yellow\nin panels A and B indicate the attention of the listener to either of the two speakers. Estimated\nvalues of {qk}240\nk=1 (green trace) and the corresponding con\ufb01dence intervals (green hull) are shown\nin Figure 2\u2013C. The estimated qk values reliably track the attentional modulation, and the transitions\nare captured with high accuracy. MEG data recorded from the brain is usually contaminated with\nenvironmental noise as well as nuisance sources of neural activity, which can considerably decrease\nthe SNR of the measured signal.\nIn order to test the robustness of the decoder with respect to\nobservation noise, we repeated the above simulation with SNR values of 0 dB, \u221210 dB and \u221220\ndB. As Figure 2\u2013D shows, the dynamic denoising feature of the proposed state-space model results\nin a desirable decoding performance for SNR values as low as \u221220 dB. The con\ufb01dence intervals\nand the estimated transition width widen gracefully as the SNR decreases. Finally, we test the\ntracking performance of the decoder with respect to the frequency of the attentional switches. From\nsubjective experience, attentional switches occur over a time scale of few seconds. We repeated the\nabove simulation for SNR = 10 dB with 14 attentional switches equally spaced during the 60s trial.\nFigure 2\u2013E shows the corresponding estimate values of {qk}, which reliably tracks the 14 attentional\nswitches during the observation period.\n\n3.2 Application to Real MEG Data\nWe evaluated our proposed state-space model and decoder on real MEG data recorded from two\nhuman subjects listening to a speech mixture from a male and a female speaker under different at-\ntentional conditions. The experimental methods were approved by the Institutional Review Board\n(IRB) at the authors\u2019 home institution. Two normal-hearing right-handed young adults participated\nin this experiment. Listeners selectively listened to one of the two competing speakers of opposite\ngender, mixed into a single acoustic channel with equal density. The stimuli consisted of 4 segments\nfrom the book A Child History of England by Charles Dickens, narrated by two different readers\n(one male and one female). Three different mixtures, each 60s long, were generated and used in dif-\nferent experimental conditions to prevent reduction in attentional focus of the listeners, as opposed\nto listening to a single mixture repeatedly over the entire experiment. All stimuli were delivered\n\n6\n\n\fFigure 2: Simulated MEG data (black traces) and model prediction (red traces) of A) speaker one and\nB) speaker two at SNR = 10 dB. Regions highlighted in yellow indicate the attention of the listener\nto each of the speakers. C) Estimated values of {qk} with 95% con\ufb01dence intervals. D) Estimated\nvalues of {qk} from simulated MEG data vs. SNR = 0, \u221210 and \u221220dB. E) Estimated values of\n{qk} from simulated MEG data with SNR = 10dB and 14 equally spaced attention switches during\nthe entire trial. Error hulls indicate 95% con\ufb01dence intervals. The MEG units are in pT /m.\n\nidentically to both ears using tube phones plugged into the ears and at a comfortable loudness level\nof around 65 dB. The neuromagnetic signal was recorded using a 157\u2013channel, whole-head MEG\nsystem (KIT) in a magnetically shielded room, with a sampling rate of 1kHz. Three reference chan-\nnels were used to measure and cancel the environmental magnetic \ufb01eld [19].\nThe stimulus-irrelevant neural activity was removed using the DSS algorithm [11]. The recorded\nneural response during each 60s was high-pass \ufb01ltered at 1 Hz and downsampled to 200 Hz before\nsubmission to the DSS analysis. Only the \ufb01rst component of the DSS decomposition was used in the\nanalysis [9]. The TRF corresponding to the attended speaker was estimated from a pilot condition\nwhere only a single speech stream was presented to the subject, using 3 repeated trials (See Section\n2.1). The TRF corresponding to the unattended speaker was approximated by truncating the attended\nTRF beyond a lag of 90ms, on the grounds of the recent \ufb01ndings of Ding and Simon [8] which show\nthat the components of the unattended TRF are signi\ufb01cantly suppressed beyond the M50 evoked\n\ufb01eld. In the following analysis, trials with poor correlation values between the MEG data and the\nmodel prediction were removed by testing for the hypothesis of uncorrelatedness using the Fisher\ntransformation at a con\ufb01dence level of 95% [20], resulting in rejection of about 26% of the trials.\nAll the hyper-parameters are equal to those used for the simulation studies (See Section 3.1).\nIn the \ufb01rst and second conditions, subjects were asked to attend to the male and female speakers,\nrespectively, during the entire trial. Figure 3\u2013A and 3\u2013B show the MEG data and the predicted qk\nvalues for averaged as well as single trials for both subjects. Con\ufb01dence intervals are shown by the\nshaded hulls for the averaged trial estimate in each condition. The decoding results indicate that the\ndecoder reliably recovers the attention modulation in both conditions, by estimating {qk} close to 1\nand 0 for the \ufb01rst and second conditions, respectively. For the third and fourth conditions, subjects\nwere instructed to switch their attention in the middle of each trial, from the male to the female\nspeaker (third condition) and from the female to the male speaker (fourth condition). Switching\ntimes were cued by inserting a 2s pause starting at 28s in each trial. Figures 3\u2013C and 3\u2013D show the\nMEG data and the predicted qk values for averaged and single trials corresponding to the third and\nfourth conditions, respectively. Dashed vertical lines show the start of the 2s pause before attentional\nswitch. Using multiple trials, the decoder is able to capture the attentional switch occurring roughly\nhalfway through the trial. The decoding of individual trials suggest that the exact switching time is\nnot consistent across different trials, as the attentional switch may occur slightly earlier or later than\nthe presented cue due to inter-trial variability. Moreover, the decoding results for a correlation-based\nclassi\ufb01er is shown in the third panel of each \ufb01gure for one of the subjects. At each time window, the\n\n7\n\n\fFigure 3: Decoding of auditory attentional modulation from real MEG data. In each subplot, the\nMEG data (black traces) and the model prediction (red traces) for attending to speaker 1 (male) and\nspeaker 2 (female) are shown in the \ufb01rst and second panels, respectively, for subject 1. The third\npanel shows the estimated values of {qk} and the corresponding con\ufb01dence intervals using multiple\ntrials for both subjects. The gray traces show the results for a correlation-based classi\ufb01er for subject\n1. The fourth panel shows the estimated {qk} values for single trials. A) Condition one: attending\nto speaker 1 through the entire trial. B) Condition two: attending to speaker 2 through the entire\ntrial. C) Condition three: attending to speaker 1 until t = 28s and switching attention to speaker 2\nstarting at t = 30s. D) Condition four: attending to speaker 2 until t = 28s and switching attention\nto speaker 1 starting at t = 30s. Dashed lines in subplots C and D indicate the start of the 2s silence\ncue for attentional switch. Error hulls indicate 95% con\ufb01dence intervals. The MEG units are in\npT /m.\n\nclassi\ufb01er picks the speaker with the maximum correlation (averaged across trials) between the MEG\ndata and its predicted value based on the envelopes. Our proposed method signi\ufb01cantly outperforms\nthe correlation-based classi\ufb01er which is unable to consistently track the attentional modulation of\nthe listener over time.\n\n4 Discussion\nIn this paper, we presented a behaviorally inspired state-space model and an estimation framework\nfor decoding the attentional state of a listener in a competing-speaker environment. The estimation\nframework takes advantage of the temporal continuity in the attentional state, resulting in a decoding\nperformance with high accuracy and high temporal resolution. Parameter estimation is carried out\nusing the EM algorithm, which at its heart ties to the ef\ufb01cient computation of the Bernoulli process\nsmoothing, resulting in a very low overall computational complexity. We illustrate the performance\nof our technique on simulated and real MEG data from human subjects. The proposed approach\nbene\ufb01ts from the inherent model-based dynamic denosing of the underlying state-space model, and\nis able to reliably decode the attentional state under very low SNR conditions. Future work includes\ngeneralization of the proposed model to more realistic and complex auditory environments with\nmore diverse sources such as mixtures of speech, music and structured background noise. Adapting\nthe proposed model and estimation framework to EEG is also under study.\n\n8\n\n\fReferences\n[1] Bregman, A. S. (1994). Auditory Scene Analysis: The Perceptual Organization of Sound, Cam-\n\nbridge, MA: MIT Press.\n\n[2] Grif\ufb01ths, T. D., & Warren, J. D. (2004). What is an auditory object?. Nature Reviews Neuro-\n\nscience, 5(11), 887\u2013892.\n\n[3] Shamma, S. A., Elhilali, M., & Micheyl, C. (2011). Temporal coherence and attention in audi-\n\ntory scene analysis. Trends in neurosciences, 34(3), 114\u2013123.\n\n[4] Bregman, A. S. (1998). Psychological data and computational ASA. In Computational Auditory\n\nScene Analysis (pp. 1-12). Hillsdale, NJ: L. Erlbaum Associates Inc.\n\n[5] Cherry, E. C. (1953). Some experiments on the recognition of speech, with one and with two\n\nears. Journal of the Acoustical Society of America, 25(5), 975\u2013979.\n\n[6] Elhilali, M., Xiang, J., Shamma, S. A., & Simon, J. Z. (2009). Interaction between attention and\nbottom-up saliency mediates the representation of foreground and background in an auditory\nscene. PLoS Biology, 7(6), e1000129.\n\n[7] Shinn-Cunningham, B. G. (2008). Object-based auditory and visual attention. Trends in Cogni-\n\ntive Sciences, 12(5), 182\u2013186.\n\n[8] Ding, N. & Simon, J.Z. (2012). Emergence of neural encoding of auditory objects while listen-\n\ning to competing speakers. PNAS, 109(29):11854\u201311859.\n\n[9] Ding, N. & Simon, J.Z. (2012). Neural coding of continuous speech in auditory cortex during\n\nmonaural and dichotic listening. Journal of Neurophisiology, 107(1):78\u201389.\n\n[10] Mesgarani, N., & Chang, E. F. (2012). Selective cortical representation of attended speaker in\n\nmulti-talker speech perception. Nature, 485(7397), 233\u2013236.\n\n[11] de Cheveign\u00b4e, A., & Simon, J. Z. (2008). Denoising based on spatial \ufb01ltering. Journal of\n\nNeuroscience Methods, 171(2), 331\u2013339.\n\n[12] David, S. V., Mesgarani, N., & Shamma. (2007). Estimating sparse spectro-temporal receptive\n\n\ufb01elds with natural stimuli. Network: Computation in Neural Systems, 18(3), 191\u2013212.\n\n[13] Ba, D., Babadi, B., Purdon, P. L., & Brown, E. N. (2014). Convergence and stability of itera-\ntively re-weighted least squares algorithms, IEEE Trans. on Signal Processing, 62(1), 183\u2013195.\n[14] Fisher, N. I. (1995). Statistical Analysis of Circular Data, Cambridge, UK: Cambridge Uni-\n\nversity Press.\n\n[15] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete\n\ndata via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1\u201338.\n\n[16] Smith, A. C. & Brown, E. N. (2003). Estimating a state-space model from point process obser-\n\nvations. Neural Computation. 15(5), 965\u2013991.\n\n[17] Smith, A. C., Frank, L. M., Wirth, S., Yanike, M., Hu, D., Kubota, Y., Graybiel, A. M., Suzuki,\nW. A., & Brown, E. N. (2004). Dynamic analysis of learning in behavioral experiments. The\nJournal of Neuroscience, 24(2), 447\u2013461.\n\n[18] Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting\n\nusing the EM algorithm. Journal of Time Series Analysis, 3(4), 253\u2013264.\n\n[19] de Cheveign\u00b4e, A., & Simon, J. Z. (2007). Denoising based on time-shift PCA. Journal of\n\nNeuroscience Methods, 165(2), 297\u2013305.\n\n[20] Fisher, R. A. (1915). Frequency distribution of the values of the correlation coef\ufb01cient in sam-\n\nples of an inde\ufb01nitely large population. Biometrika, 10(4): 507\u2013521.\n\n9\n\n\f", "award": [], "sourceid": 291, "authors": [{"given_name": "Sahar", "family_name": "Akram", "institution": "University of Maryland"}, {"given_name": "Jonathan", "family_name": "Simon", "institution": "University of Maryland"}, {"given_name": "Shihab", "family_name": "Shamma", "institution": "University of Maryland"}, {"given_name": "Behtash", "family_name": "Babadi", "institution": "University of Maryland"}]}