{"title": "Multivariate Convolutional Sparse Coding for Electromagnetic Brain Signals", "book": "Advances in Neural Information Processing Systems", "page_first": 3292, "page_last": 3302, "abstract": "Frequency-specific patterns of neural activity are traditionally interpreted as sustained rhythmic oscillations, and related to cognitive mechanisms such as attention, high level visual processing or motor control. While alpha waves (8--12\\,Hz) are known to closely resemble short sinusoids, and thus are revealed by Fourier analysis or wavelet transforms, there is an evolving debate that electromagnetic neural signals are composed of more complex waveforms that cannot be analyzed by linear filters and traditional signal representations. In this paper, we propose to learn dedicated representations of such recordings using a multivariate convolutional sparse coding (CSC) algorithm. Applied to electroencephalography (EEG) or magnetoencephalography (MEG) data, this method is able to learn not only prototypical temporal waveforms, but also associated spatial patterns so their origin can be localized in the brain. Our algorithm is based on alternated minimization and a greedy coordinate descent solver that leads to state-of-the-art running time on long time series. To demonstrate the implications of this method, we apply it to MEG data and show that it is able to recover biological artifacts. More remarkably, our approach also reveals the presence of non-sinusoidal mu-shaped patterns, along with their topographic maps related to the somatosensory cortex.", "full_text": "Multivariate Convolutional Sparse Coding for\n\nElectromagnetic Brain Signals\n\nTom Dupr\u00e9 La Tour\u22171, Thomas Moreau\u22172, Mainak Jas1, Alexandre Gramfort2\n\n1: LTCI, T\u00e9l\u00e9com ParisTech, Universit\u00e9 Paris-Saclay, Paris, France\n\n2: INRIA, Universit\u00e9 Paris-Saclay, Saclay, France\n\n*: Both authors contributed equally.\n\nAbstract\n\nFrequency-speci\ufb01c patterns of neural activity are traditionally interpreted as sus-\ntained rhythmic oscillations, and related to cognitive mechanisms such as attention,\nhigh level visual processing or motor control. While alpha waves (8\u201312 Hz) are\nknown to closely resemble short sinusoids, and thus are revealed by Fourier analy-\nsis or wavelet transforms, there is an evolving debate that electromagnetic neural\nsignals are composed of more complex waveforms that cannot be analyzed by\nlinear \ufb01lters and traditional signal representations. In this paper, we propose to\nlearn dedicated representations of such recordings using a multivariate convolu-\ntional sparse coding (CSC) algorithm. Applied to electroencephalography (EEG)\nor magnetoencephalography (MEG) data, this method is able to learn not only\nprototypical temporal waveforms, but also associated spatial patterns so their origin\ncan be localized in the brain. Our algorithm is based on alternated minimization\nand a greedy coordinate descent solver that leads to state-of-the-art running time\non long time series. To demonstrate the implications of this method, we apply it to\nMEG data and show that it is able to recover biological artifacts. More remarkably,\nour approach also reveals the presence of non-sinusoidal mu-shaped patterns, along\nwith their topographic maps related to the somatosensory cortex.\n\n1\n\nIntroduction\n\nNeural activity recorded via measurements of the electrical potential over the scalp by electroen-\ncephalography (EEG), or magnetic \ufb01elds by magnetoencephalography (MEG), can be used to\ninvestigate human cognitive processes and certain pathologies. Such recordings consist of dozens to\nhundreds of simultaneously recorded signals, for durations going from minutes to hours. In order\nto describe and quantify neural activity in such multi-gigabyte data, it is classical to decompose\nthe signal in prede\ufb01ned representations such as the Fourier or wavelet bases. It leads to canonical\nfrequency bands such as theta (4\u20138 Hz), alpha (8\u201312 Hz), or beta (15\u201330 Hz) (Buzsaki, 2006), in\nwhich signal power can be quanti\ufb01ed. While such linear analyses have had signi\ufb01cant impact in\nneuroscience, there is now a debate regarding whether neural activity consists more of transient\nbursts of isolated events rather than rhythmically sustained oscillations (van Ede et al., 2018). To\nstudy the transient events and the morphology of the waveforms (Mazaheri and Jensen, 2008; Cole\nand Voytek, 2017), which matter in cognition and for our understanding of pathologies (Jones,\n2016; Cole et al., 2017), there is a clear need to go beyond traditionally employed signal processing\nmethodologies (Cole and Voytek, 2018). For instance, a classic Fourier analysis fails to distinguish\nalpha-rhythms from mu-rhythms, which have the same peak frequency at around 10 Hz, but whose\nwaveforms are different (Cole and Voytek, 2017; Hari and Puce, 2017).\nThe key to many modern statistical analyses of complex data such as natural images, sounds or\nneural time series is the estimation of data-driven representations. Dictionary learning is one family\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fof techniques, which consists in learning atoms (or patterns) that offer sparse data approximations.\nWhen working with long signals in which events can happen at any instant, one idea is to learn\nshift-invariant atoms. They can offer better signal approximations than generic bases such as Fourier\nor wavelets, since they are not limited to narrow frequency bands. Multiple approaches have been\nproposed to solve this shift-invariant dictionary learning problem, such as MoTIF (Jost et al., 2006),\nthe sliding window matching (Gips et al., 2017), the adaptive waveform learning (Hitziger et al.,\n2017), or the learning of recurrent waveform (Brockmeier and Pr\u00edncipe, 2016), yet they all have\nseveral limitations, as discussed in Jas et al. (2017). A more popular approach, especially in image\nprocessing, is the convolutional sparse coding (CSC) model (Jas et al., 2017; Pachitariu et al., 2013;\nKavukcuoglu et al., 2010; Zeiler et al., 2010; Heide et al., 2015; Wohlberg, 2016b; \u0160orel and \u0160roubek,\n2016; Grosse et al., 2007; Mailh\u00e9 et al., 2008). The idea is to cast the problem as an optimization\nproblem, representing the signal as a sum of convolutions between atoms and activation signals.\nThe CSC approach has been quite successful in several \ufb01elds such as computer vision (Kavukcuoglu\net al., 2010; Zeiler et al., 2010; Heide et al., 2015; Wohlberg, 2016b; \u0160orel and \u0160roubek, 2016),\nbiomedical imaging (Jas et al., 2017; Pachitariu et al., 2013), and audio signal processing (Grosse\net al., 2007; Mailh\u00e9 et al., 2008), yet it was essentially developed for univariate signals. Interestingly,\nimages can be multivariate such as color or hyper-spectral images, yet most CSC methods only\nconsider gray scale images. To the best of our knowledge, the only reference to multivariate CSC\nis Wohlberg (2016a), where the author proposes two models well suited for 3-channel images. In\nthe case of EEG and MEG recordings, neural activity is instantaneously and linearly spread across\nchannels, due to Maxwell\u2019s equations (Hari and Puce, 2017). The same temporal patterns are\nreproduced on all channels with different intensities, which depend on each activity\u2019s location in the\nbrain. To exploit this property, we propose to use a rank-1 constraint on each multivariate atom. This\nidea has been mentioned in (Barth\u00e9lemy et al., 2012, 2013), but was considered less \ufb02exible than the\nfull-rank model. Moreover, their proposed optimization techniques are not speci\ufb01c to shift-invariant\nmodels, and not scalable to long signals. Multivariate shift-invariant rank-1 decomposition of EEG\nhas also been considered with matching pursuit (Durka et al., 2005), but without learning the atoms,\nwhich are \ufb01xed Gabor \ufb01lters.\n\nContribution In this study, we develop a multivariate model for CSC, using a rank-1 constraint\non the atoms to account for the instantaneous spreading of an electromagnetic source over all the\nchannels. We also propose ef\ufb01cient optimization strategies, namely a locally greedy coordinate\ndescent (LGCD, Moreau et al. 2018), and precomputation steps for faster gradient computations. We\nprovide multiple numerical evaluations of our method, which show the highly competitive running\ntime on both univariate and multivariate models, even when working with hundreds of channels. We\nalso demonstrate the estimation performance of the multivariate model by recovering patterns on low\nsignal-to-noise ratio (SNR) data. Finally, we illustrate our method with atoms learned on multivariate\nMEG data, that thanks to the rank-1 model can be localized in the brain for clinical or cognitive\nneuroscience studies.\nNotation A multivariate signal with T time points in RP is noted X \u2208 RP\u00d7T , while x \u2208 RT is\na univariate signal. We index time with brackets X[t] \u2208 Rp, while Xi \u2208 RT is the channel i in X.\ni |vi|q)1/q, and for a multivariate signal\nq)1/q. The transpose of a\n(cid:30)(cid:30)(cid:30) is obtained by reversal of the\nmatrix U is denoted by U(cid:62). For a multivariate signal X \u2208 RP\u00d7T , X\n[t] = X[T + 1 \u2212 t]. The convolution of two signals z \u2208 RT\u2212L+1 and\ntemporal dimension, i.e., X\nd \u2208 RL is denoted by z \u2217 d \u2208 RT . For D \u2208 RP\u00d7L, z \u2217 D is obtained by convolving every row of D\nby z. For D(cid:48) \u2208 RP\u00d7L, D \u02dc\u2217 D(cid:48) \u2208 R2L\u22121 is obtained by summing the convolution between each row\np . We note [a, b] the set of real numbers between a and b,\n\nFor a vector v \u2208 RP we de\ufb01ne the (cid:96)q norm as (cid:107)v(cid:107)q = ((cid:80)\nX \u2208 RP\u00d7T , we de\ufb01ne the time-wise (cid:96)q norm as (cid:107)X(cid:107)q = ((cid:80)T\nof D and D(cid:48): D \u02dc\u2217 D(cid:48) =(cid:80)P\nand(cid:74)a, b(cid:75) the set of integers between a and b. We de\ufb01ne (cid:101)T as T \u2212 L + 1.\n\nt=1 (cid:107)X[t](cid:107)q\n\n(cid:30)(cid:30)(cid:30)\n\np=1 Dp \u2217 D(cid:48)\n\n2 Multivariate Convolutional Sparse Coding\n\nIn this section, we introduce the convolutional sparse coding (CSC) models used in this work. We\nfocus on 1D-convolution, although these models can be naturally extended to higher order signals\nsuch as images by using the proper convolution operators.\n\n2\n\n\fmin\n{dk}k,{zn\n\nk }k,n\n\nUnivariate CSC The CSC formulation adopted in this work follows the shift-invariant sparse\ncoding (SISC) model from Grosse et al. (2007). It is de\ufb01ned as follows:\n\nN(cid:88)\n\n1\n2\n\nn=1\n\ns.t. (cid:107)dk(cid:107)2\n\nK(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)xn \u2212 K(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\nk \u2217 dk\nzn\nk \u2265 0 ,\n2 \u2264 1 and zn\nk=1 \u2282 RL\nk=1 \u2282 R(cid:101)T are K signals of activations, a.k.a.\nk}K\n\nk(cid:107)1 ,\n\n(cid:107)zn\n\n+ \u03bb\n\n(1)\n\nk=1\n\nk=1\n\n2\n\nn=1 \u2282 RT are N observed signals, \u03bb > 0 is the regularization parameter, {dk}K\n\nwhere {xn}N\nare the K temporal atoms we aim to learn, and {zn\nk are sparse, in the sense\nthe code associated with xn. This model assumes that the coding signals zn\nthat only few entries are nonzero in each signal. In this work, we also assume that the entries of zn\nk\nare positive, which means that the temporal patterns are present each time with the same polarity.\n\nMultivariate CSC The multivariate formulation uses an additional dimension on the signals and\non the atoms, since the signal is recorded over P channels (mapping to space locations):\n\nN(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X n \u2212 K(cid:88)\n\n1\n2\n\nn=1\n\ns.t. (cid:107)Dk(cid:107)2\n\nk \u2217 Dk\nzn\nk \u2265 0 ,\n2 \u2264 1 and zn\n\nk=1\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nK(cid:88)\n\nk=1\n\nmin\n{Dk}k,{zn\n\nk }k,n\n\n+ \u03bb\n\n(cid:107)zn\n\nk(cid:107)1,\n\n(2)\n\nwhere {X n}N\ntemporal atoms, and {zn\n\nn=1 \u2282 RP\u00d7T are N observed multivariate signals, {Dk}K\n\nk=1 \u2282 R(cid:101)T are the sparse activations associated with X n.\nk}K\n\nk=1 \u2282 RP\u00d7L are the spatio-\n\nMultivariate CSC with rank-1 constraint This model is similar to the multivariate case but it\nk \u2208 RP\u00d7L, with uk \u2208 RP being the pattern\nadds a rank-1 constraint on the dictionary, Dk = ukv(cid:62)\nover channels and vk \u2208 RL the pattern over time. The optimization problem boils down to:\n\nmin\n\n{uk}k,{vk}k,{zn\n\nk }k,n\n\nN(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X n \u2212 K(cid:88)\n\n1\n2\n\nn=1\n\ns.t. (cid:107)uk(cid:107)2\n\nk \u2217 (ukv(cid:62)\nzn\nk )\n2 \u2264 1 , (cid:107)vk(cid:107)2\n\n2 \u2264 1 and zn\n\nk=1\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nK(cid:88)\n+ \u03bb\nk \u2265 0 .\n\nk=1\n\n(cid:107)zn\nk(cid:107)1 ,\n\n(3)\n\nThe rank-1 constraint is consistent with Maxwell\u2019s equations and the physical model of electrophysio-\nlogical signals like EEG or MEG, where each source is linearly spread instantaneously over channels\nwith a constant topographic map (Hari and Puce, 2017). Using this assumption, one aims to improve\nthe estimation of patterns under the presence of independent noise over channels. Moreover, it can\nhelp separating overlapped sources which are inherently rank-1 but whose sum is generally of higher\nrank. Finally, as explained below, several computations can be factorized to speed up computations.\n\nNoise model Note that our models use a Gaussian noise, whereas one can also use an alpha-stable\nnoise distribution to better handle strong artifacts, as proposed by Jas et al. (2017). Importantly, our\ncontribution is orthogonal to their work, and one can easily extend multivariate models to alpha-stable\nnoise distributions, by using their EM algorithm and by updating the (cid:96)2 loss into a weighted (cid:96)2 loss\nin (3). Also, our experiments used artifact-free datasets, so the Gaussian noise model is appropriate.\n\n3 Model estimation\n\nProblems (1), (2) and (3) share the same structure. They are convex in each variable but not jointly\nconvex. The resolution is done by using a block coordinate descent approach which minimizes\nalternately the objective function over one block of the variables. In this section, we describe this\napproach on the multivariate CSC with rank-1 constraint case (3), updating iteratively the activations\nk , the spatial patterns uk, and the temporal pattern vk.\nzn\n\n3.1 Z-step: solving for the activations\n\nGiven K \ufb01xed atoms Dk and a regularization parameter \u03bb > 0, the Z-step aims to retrieve the\nN K activation signals zn\n\nk \u2208 R(cid:101)T associated to the signals X n \u2208 RP\u00d7T by solving the following\n\n3\n\n\fAlgorithm 1: Locally greedy coordinate descent (LGCD)\nInput :Signal X, atoms Dk, number of segments M, stopping parameter \u0001 > 0, zk initialization\nInitialize \u03b2k[t] with (5).\nrepeat\n\n(cid:16) \u03b2k[t]\u2212\u03bb\n\n(cid:107)Dk(cid:107)2\n\n(cid:17)\n\nfor m = 1 to M do\n\nk[t] = max\n\nCompute z(cid:48)\nChoose (k0, t0) = arg max\n(k,t)\u2208Cm\n\n, 0\n\n2\n\n|zk[t] \u2212 z(cid:48)\n\nfor (k, t) \u2208 Cm\nk[t]|\n\nUpdate \u03b2 with (6)\nUpdate the current point estimate zk0 [t0] \u2190 z(cid:48)\n\n[t0]\n\nk0\n\nuntil (cid:107)z \u2212 z(cid:48)(cid:107)\u221e < \u0001\n\n(cid:96)1-regularized optimization problem:\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X n \u2212 K(cid:88)\n\n1\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nK(cid:88)\n\nk \u2217 Dk\nzn\n\n(cid:107)zn\nk(cid:107)1 .\n\n(4)\n\n2\n\nk=1\n\nk=1\n\n+ \u03bb\n\nmin\n{zn\nk }k,n\nk \u22650\nzn\nThis problem is convex in zn\nk and can be ef\ufb01ciently solved. In Chalasani et al. (2013), the authors\nproposed an algorithm based on FISTA (Beck and Teboulle, 2009) to solve it. Bristow et al. (2013)\nintroduced a method based on ADMM (Boyd et al., 2011) to compute ef\ufb01ciently the activation signals\nk . These two methods are detailed and compared by Wohlberg (2016b), which also made use of the\nzn\nfast Fourier transform (FFT) to accelerate the computations. Recently, Jas et al. (2017) proposed to\nuse L-BFGS (Byrd et al., 1995) to improve on \ufb01rst order methods. Finally, Kavukcuoglu et al. (2010)\nadapted the greedy coordinate descent (GCD) to solve this convolutional sparse coding problem.\nHowever, for long signals, these techniques can be quite slow due the computation of the gradient\n(FISTA, ADMM, L-BFGS) or the choice of the best coordinate to update in GCD, which are\noperations that scale linearly in T . A way to alleviate this limitation is to use a locally greedy\ncoordinate descent (LGCD) strategy, presented recently in Moreau et al. (2018).\nNote that problem (4) is independent for each signal X n. The computation of each zn can thus\nbe parallelized, independently of the technique selected to solve the optimization (Jas et al., 2017).\nTherefore, we omit the superscript n in the following subsection to simplify the notation.\n\n(cid:33)(cid:35)\n\n(cid:18) \u03b2k[t] \u2212 \u03bb\n\n(cid:19)\n\n(cid:34)\n\n(cid:32)\nX \u2212 K(cid:88)\n\nCoordinate descent (CD) The key idea of coordinate descent is to update our estimate of the\nsolution one coordinate zk[t] at a time. For (4), it is possible to compute the optimal value z(cid:48)\nk[t]\nof one coordinate zk[t] given that all the others are \ufb01xed. Indeed, the problem (4) restricted to one\ncoordinate has a closed-form solution given by:\n\n(cid:107)Dk(cid:107)2\n\nz(cid:48)\nk[t] = max\nwhere et \u2208 R(cid:101)T is the canonical basis vector with value 1 at index t and 0 elsewhere. When updating\nthe coef\ufb01cient zk0[t0] to the value z(cid:48)\n\nzl \u2217 Dl + zk[t]et \u2217 Dk\n\n, with \u03b2k[t] =\n\n[t] (5)\n\n(cid:30)(cid:30)(cid:30)\nk \u02dc\u2217\nD\n\n, 0\n\nl=1\n\n2\n\n(cid:30)(cid:30)(cid:30)\nk [t] + (D\nk0\n\n[t] = \u03b2(q)\n\u03b2(q+1)\n(6)\nk\n(cid:30)(cid:30)(cid:30)\n\u02dc\u2217 Dk)[t \u2212 t0] is zero for |t \u2212 t0| \u2265 L. Thus, only K(2L \u2212 1) coef\ufb01cients of \u03b2 need\nThe term (D\nk0\nto be changed (Kavukcuoglu et al., 2010). The CD algorithm updates at each iteration a coordinate to\nthis optimal value. The coordinate to update can be chosen with different strategies, such as the cyclic\nstrategy which iterates over all coordinates (Friedman et al., 2007), the randomized CD (Nesterov,\n2010; Richt\u00e1rik and Tak\u00e1\u02c7c, 2014) which chooses a coordinate at random for each iteration, or the\ngreedy CD (Osher and Li, 2009) which chooses the coordinate the farthest from its optimal value.\n\n[t0]),\n\nk0\n\n\u2200(k, t) (cid:54)= (k0, t0) .\n\nk0\n\n[t0], \u03b2 is updated with:\n\u02dc\u2217 Dk)[t \u2212 t0](zk0[t0] \u2212 z(cid:48)\n\nLocally greedy coordinate descent (LGCD) The choice of a coordinate selection strategy results\nof a tradeoff between the computational cost of each iteration and the improvement it provides. For\ncyclic and randomized strategies, the iteration complexity is O(KL) as the coordinate selection can\n\n4\n\n\fTable 1: Computational complexities of each step\n\nComputed\n\nComputation\n\u03b2 initialization\nPrecomputation\n\nStep\nZ-step\nZ-step\nZ-step M coordinate updates multiple times\nD-step\nD-step\nD-step\nD-step\n\n\u03a6 precomputation\n\u03a8 precomputation\nGradient evaluation\nFunction evaluation\n\nonce\nonce\n\nonce\nonce\n\nRank-1\n\nFull-rank\nN KT (L + P ) N KT (LP )\nK 2L(L + P )\nK 2L(LP )\n\nM KL\n\nM KL\n\nN KT LP\nN K 2T L\n\nmultiple times K 2L(L + P )\nmultiple times K 2L(L + P )\n\nN KT LP\nN KT LP\nK 2L(LP )\nK 2L(LP )\n\nbe performed in constant time. The greedy selection of a coordinate is more expensive as it is linear\n\nin the signal length O(K(cid:101)T ). However, greedy selection is more ef\ufb01cient iteration-wise (Nutini et al.,\n\n2015). Moreau et al. (2018) proposed to consider a locally greedy selection strategy for CD. The\ncoordinate to update is chosen greedily in one of M subsegments of the signal, i.e., at iteration q, the\nselected coordinate is:\n\n|zk[t] \u2212 z(cid:48)\n\nk[t]| , m \u2261 q (mod M ) + 1 ,\n\n(k0, t0) = arg max\n(k,t)\u2208Cm\n\nwith Cm =(cid:74)1, K(cid:75)\u00d7(cid:74)(m\u2212 1)(cid:101)T /M, m(cid:101)T /M(cid:75). With this strategy, the coordinate selection complexity\nis linear in the length of the considered subsegment O(K(cid:101)T /M ). By choosing M = (cid:98)(cid:101)T /(2L \u2212 1)(cid:99),\n\nthe complexity of update is the same as the complexity of random and cyclic coordinate selection,\nO(KL). We detail the steps of LGCD in Algorithm 1. This algorithm is particularly ef\ufb01cient when\nthe zk are sparser. Indeed, in this case, only few coef\ufb01cients need to be updated in the signal, resulting\nin a low number of iterations. Computational complexities are detailed in Table 1.\n\n(7)\n\nRelation with matching pursuit (MP) Note that the greedy CD is strongly related to the well-\nknown matching pursuit (MP) algorithm (Locatello et al., 2018). The main difference is that MP solves\na slightly different problem, where the (cid:96)1 regularization is replaced with an (cid:96)0 constraint. Therefore,\nthe size of the support is a \ufb01xed parameter in MP, whereas it is controlled by the regularization\nparameter \u03bb in our case. In term of algorithm, both methods update one coordinate at a time selected\ngreedily, but MP does not apply a soft-thresholding in (5).\n\n3.2 D-step: solving for the atoms\n\nGiven KN \ufb01xed activation signals zn\nupdate the K spatial patterns uk \u2208 RP and K temporal patterns vk \u2208 RL, by solving:\n\nk \u2208 R(cid:101)T , associated to signals X n \u2208 RP\u00d7T , the D-step aims to\n\nE,\n\nmin\n\n(cid:107)uk(cid:107)2\u22641\n(cid:107)vk(cid:107)2\u22641\n\nwhere E \u2206=\n\nN(cid:88)\n\nn=1\n\n1\n2\n\n(cid:107)X n \u2212 K(cid:88)\n\nk=1\n\nk \u2217 (ukv(cid:62)\nzn\n\nk )(cid:107)2\n\n2\n\n.\n\n(8)\n\nThe problem (8) is convex in each block of variables {uk}k and {vk}k, but not jointly convex.\nTherefore, we optimize \ufb01rst {uk}k, then {vk}k, using in both cases a projected gradient descent with\nan Armijo backtracking line-search (Wright and Nocedal, 1999) to \ufb01nd a good step size. These steps\nare detailed in Algorithm A.1.\nGradient relative to uk and vk The gradient of E relatively to {uk}k and {vk}k can be computed\nk \u2208 RP\u00d7L:\nusing the chain rule. First, we compute the gradient relatively to a full atom Dk = ukv(cid:62)\n\nN(cid:88)\n\n(cid:32)\nX n \u2212 K(cid:88)\n\n(cid:33)\n\n= \u03a6k \u2212 K(cid:88)\n\n\u2207Dk E =\n\n(cid:30)(cid:30)(cid:30) \u2217\n\n(zn\nk )\n\nl \u2217 Dl\nzn\n\n\u03a8k,l \u2217 Dl ,\n\n(9)\n\nn=1\n\nl=1\n\nl=1\n\nwhere we reordered this expression to de\ufb01ne \u03a6k \u2208 RP\u00d7L and \u03a8k,l \u2208 R2L\u22121. These terms are\nboth constant during a D-step and can thus be precomputed to accelerate the computation of the\n\n5\n\n\f(a) \u03bb = 10 (univariate).\n\n(b) Time to reach a precision of 0.001 (univariate).\n\n(c) \u03bb = 10 (multivariate).\n\n(d) Time to reach a precision of 0.001 (multivariate).\n\nFigure 1: Comparison of state-of-the-art univariate (a, b) and multivariate (c, d) methods with our\napproach. (a) Convergence plot with the objective function relative to the obtained minimum, as a\nfunction of computational time. (b) Time taken to reach a relative precision of 10\u22123, for different\nregularization parameters \u03bb. (c, d) Same as (a, b) in the multivariate setting P = 5.\n\ngradients and the cost function E. We detail these computations in the supplementary materials\n(see Section A.1). Computational complexities are detailed in Table 1. Note that the dependence in\nT is present only in the precomputations, which makes the following iterations very fast. Without\nprecomputations, the complexity of each gradient computation in the D-step would be O(N KT LP ).\n\n3.3\n\nInitialization\n\nThe activations sub-problem (Z-step) is regularized with a (cid:96)1-norm, which induces sparsity: the\nhigher the regularization parameter \u03bb, the higher the sparsity. Therefore, there exists a value \u03bbmax\nabove which the sub-problem solution is always zeros (Hastie et al., 2015). As \u03bbmax depends on the\natoms Dk and on the signals X n, its value changes after each D-step. In particular, its value might\nchange a lot between the initialization and the \ufb01rst D-step. This is problematic since we cannot use a\nregularization \u03bb above this initial \u03bbmax, even though the following \u03bbmax might be higher.\nThe standard strategy to initialize CSC methods is to generate random atoms with Gaussian white\nnoise. However, as these atoms generally poorly correlate with the signals, the initial value of \u03bbmax\nis low compared to the following ones. For example, on the MEG dataset described later on, we\nfound that the initial \u03bbmax is about 1/3 of the following ones in the univariate case, with L = 32. On\nthe multivariate case, it is even more problematic as with P = 204, we could have an initial \u03bbmax as\nlow as 1/20 of the following ones.\nTo \ufb01x this problem, we propose to initialize the dictionary with random chunks of the signal,\nprojecting each chunk on a rank-1 approximation using singular value decomposition. We noticed on\nthe MEG dataset that the initial \u03bbmax was then about the same value as the following ones, which\nenables the use of higher regularization parameters. We used this scheme in all our experiments.\n\n4 Experiments\n\nAll numerical experiments were run using Python (Python Software Foundation, 2017) and our code\nis publicly available online at https://alphacsc.github.io/.\n\nSpeed performance To illustrate the performance of our optimization strategy, we monitored its\nconvergence speed on a real MEG dataset. The somatosensory dataset from the MNE software (Gram-\n\n6\n\n\u03bb=0.3\u03bb=1.0\u03bb=3.0\u03bb=10.0103104Time (s)Garcia-Cardona et al (2017)Jas et al (2017) FISTAJas et al (2017) LBFGSProposed (univariate)\u03bb=0.3\u03bb=1.0\u03bb=3.0\u03bb=10.01042\u00d71033\u00d71034\u00d71036\u00d7103Time (s)Wohlberg (2016)Proposed (multivariate)Proposed (rank-1)\f(a) \u03bb = .005\u03bbmax\n\n(b) \u03bb = .001\u03bbmax\n\nFigure 2: Timings of Z and D updates when varying the number of channels P . The scaling is\nsublinear with P , due to the precomputation steps in the optimization.\n\nfort et al., 2013, 2014) contains responses to median nerve stimulation. We consider only gradiometers\nchannels and we used the following parameters: T = 134 700, N = 2, K = 8, and L = 128.\nFirst we compared our strategy against three state-of-the-art univariate CSC solvers available online.\nThe \ufb01rst was developed by Garcia-Cardona and Wohlberg (2017) and is based on ADMM. The\nsecond and third were developed by Jas et al. (2017), and are respectively based on FISTA and\nL-BFGS. All solvers shared the same objective function, but as the problem is non-convex, the solvers\nare not guaranteed to reach the same local minima, even though we started from the same initial\nsettings. Hence, for a fair comparison, we computed the convergence curves relative to each local\nminimum, and averaged them over 10 different initializations. The results, presented in Figure 1(a,\nb), demonstrate the competitiveness of our method, for reasonable choices of \u03bb. Indeed, a higher\nregularization parameter leads to sparser activations zn\nThen, we also compared our method against a multivariate ADMM solver developed by Wohlberg\n(2016a). As this solver was quite slow on these long signals, we limited our experiments to P = 5\nchannels. The results, presented in Figure 1(c, d), show that our method is faster than the competing\nmethod for large \u03bb. More benchmarks are available in the supplementary materials.\n\nk , on which LGCD is particularly ef\ufb01cient.\n\nScaling with the number of channels The multivariate model involves an extra dimension P but\nits impact on the computational complexity of our solver is limited. Figure 2 shows the average\nrunning times of the Z-step and the D-step. Timings are normalized w.r.t. the timings for a single\nchannel. The running times are computed using the same signals from the somatosensory dataset,\nwith the following parameters: T = 26 940, N = 10, K = 2, L = 128. We can see that the scaling\nof these three operations is sub-linear in P . For the Z-step, only the initial computations for the\n(cid:30)(cid:30)(cid:30)\nk \u02dc\u2217 Dl depend linearly on P so that the complexity increase is limited\n\ufb01rst \u03b2k and the constants D\ncompared to the complexity of solving the optimization problem (4). For the D-step, the scaling\nto compute the gradients is linear with P . However, the most expensive operations here are the\ncomputation of the constant \u03a8k, which does not on P .\n\nFinding patterns in low SNR signals Since the multivariate model has access to more data, we\nwould expect it to perform better compared to the univariate model especially for low SNR signals.\nTo demonstrate this, we compare the two models when varying the number of channels P and the\nSNR of the data. The original dictionary contains two temporal patterns, a square and a triangle,\npresented in Figure 3(a). The spatial maps are designed with a sine and a cosine, and the \ufb01rst\nchannel\u2019s amplitude is forced to 1 to make sure both atoms are present even with only one channel.\nThe signals are obtained by convolving the atoms with activation signals zn\nk , where the activation\n\nlocations are sampled uniformly in(cid:74)1,(cid:101)T(cid:75) \u00d7(cid:74)1, K(cid:75) with 5% non-zero activations, and the amplitudes\nWe \ufb01xed N = 100, L = 64 and (cid:101)T = 640 for our simulated signals. We can see in Figure 3(a) the\n\nare uniformly sampled in [0, 1]. Then, a Gaussian white noise with variance \u03c3 is added to the signal.\ntemporal patterns recovered for \u03c3 = 10\u22123 using only one channel and using 5 channels. While the\npatterns recovered with one channel are very noisy, the multivariate model with rank-1 constraint\nrecovers the original atoms accurately. This can be expected as the univariate model is ill-de\ufb01ned\nin this situation, where some atoms are superimposed. For the rank-1 model, as the atoms have\ndifferent spatial maps, the problem is easier. Then, we evaluate the learned temporal atoms. Due to\npermutation and sign ambiguity, we compute the (cid:96)2-norm of the difference between the temporal\n\n7\n\n150100150200# of channels P1234Relative timeZ-stepD-stepZ+D-steps150100150200# of channels P1234Relative timeZ-stepD-stepZ+D-steps\f(a) Patterns recovered with 1 and 5 channels.\n\n(b) Loss w.r.t. noise level \u03c3 and P (lower is better).\n\nFigure 3: (a) Patterns recovered with P = 1 and P = 5. The signals were generated with the\ntwo simulated temporal patterns and with \u03c3 = 10\u22123. (b) Evolution of the recovery loss with \u03c3 for\ndifferent values of P . Using more channels improves the recovery of the original patterns.\n\n(a) Temporal waveform\n\n(b) Spatial pattern\n\n(c) PSD (dB)\n\n(d) Dipole \ufb01t\n\nFigure 4: Atom revealed using the MNE somatosensory data. Note the non-sinusoidal comb shape of\nthe mu rhythm. This atom has been manually selected, and other atoms are presented in Figure B.4.\n\npattern(cid:98)vk and the ground truths, vk or \u2212vk, for all permutations S(K) i.e.,\n2,(cid:107)(cid:98)vs(k) + vk(cid:107)2\n\nmin(cid:0)(cid:107)(cid:98)vs(k) \u2212 vk(cid:107)2\n\nK(cid:88)\n\nloss((cid:98)v) = min\n\ns\u2208S(K)\n\n(cid:1) .\n\n(10)\n\n2\n\nk=1\n\nMultiple values of \u03bb were tested and the best loss is reported in Figure 3(b) for varying noise levels\n\u03c3. We observe that independently of the noise level, the multivariate rank-1 model outperforms\nthe univariate one. This is true even for good SNR, as using multiple channels disambiguates the\nseparation of overlapped patterns.\n\nExamples of atoms in real MEG signals: We show the results of our algorithm on experimental\ndata, using the MNE somatosensory dataset (Gramfort et al., 2013, 2014). This dataset contains MEG\nrecordings of one patient receiving median nerve stimulations. Here we \ufb01rst extract N = 103 trials\nfrom the data. Each trial lasts 6 s with a sampling frequency of 150 Hz (T = 900). We selected only\ngradiometer channels, leading to P = 204 channels. The signals were notch-\ufb01ltered to remove the\npower-line noise, and high-pass \ufb01ltered at 2 Hz to remove the low-frequency trend, i.e. to remove low\nfrequency drift artifacts which contribute a lot to the variance of the raw signals. We learned K = 40\natoms with L = 150 using a rank-1 multivariate CSC model, with a regularization \u03bb = 0.2\u03bbmax.\nFigure 4(a) shows a recovered non-sinusoidal brain rhythm which resembles the well-known mu-\nrhythm. The mu-rhythm has been implicated in motor-related activity (Hari, 2006) and is centered\naround 9\u201311 Hz. Indeed, while the power is concentrated in the same frequency band as the alpha,\nit has a very different spatial topography (Figure 4(b)). In Figure 4(c), the power spectral density\n(PSD) shows two components of the mu-rhythm \u2013 one at around 9 Hz, and a harmonic at 18 Hz as\npreviously reported in (Hari, 2006). Based on our analysis, it is clear that the 18 Hz component is\nsimply a harmonic of the mu-rhythm even though a Fourier-based analysis could lead us to falsely\nconclude that the data contained beta-rhythms. Finally, due to the rank-1 nature of our atoms, it\nis straightforward to \ufb01t an equivalent current dipole (Tuomisto et al., 1983) to interpret the origin\nof the signal. Figure 4(d) shows that the atom does indeed localize in the primary somatosensory\ncortex, or the so-called S1 region with a 59.3% goodness of \ufb01t. For results on more MEG datasets,\nsee Section B.2. It notably includes mu-shaped atoms from S2.\n\n8\n\n0102030405060Times0.30.20.10.00.10.2AtomsP=1P=5Simulated10-610-510-410-310-210-1Noise level \u03c310-310-210-1100loss(bv)P=1P=5P=25P=500.00.20.40.60.81.0Time (s)0.30.20.10.00.10.20510152025Frequencies (Hz)30201001020\f5 Conclusion\n\nMany neuroscienti\ufb01c debates today are centered around the morphology of the signals under consid-\neration. For instance, are alpha-rhythms asymmetric (Mazaheri and Jensen, 2008) ? Are frequency\nspeci\ufb01c patterns the result of sustained oscillations or transient bursts (van Ede et al., 2018) ? In\nthis paper, we presented a multivariate extension to the CSC problem applied to MEG data to help\nanswer such questions. In the original CSC formulation, the signal is expressed as a convolution of\natoms and their activations. Our method extends this to the case of multiple channels and imposes\na rank-1 constraint on the atoms to account for the instantaneous propagation of electromagnetic\n\ufb01elds. We demonstrate the usefulness of our method on publicly available multivariate MEG data.\nNot only are we able to recover neurologically plausible atoms, but also we are able to \ufb01nd temporal\nwaveforms which are non-sinusoidal. Empirical evaluations show that our solvers are signi\ufb01cantly\nfaster compared to existing CSC methods even for the univariate case (single channel). The algorithm\nscales sublinearly with the number of channels which means it can be employed even for dense sensor\narrays with 200-300 sensors, leading to better estimation of the patterns and their origin in the brain.\n\nAcknowledgment\n\nThis work was supported by the ERC Starting Grant SLAB ERC-YStG-676943 and by the ANR\nTHALAMEEG ANR-14-NEUC-0002-01.\n\nReferences\nQ. Barth\u00e9lemy, A. Larue, A. Mayoue, D. Mercier, and J. I. Mars. Shift & 2D rotation invariant sparse\ncoding for multivariate signals. IEEE Transactions on Signal Processing, 60(4):1597\u20131611, 2012.\n\nQ. Barth\u00e9lemy, C. Gouy-Pailler, Y. Isaac, A. Souloumiac, A. Larue, and J. I. Mars. Multivariate\n\ntemporal dictionary learning for EEG. J. Neurosci. Methods, 215(1):19\u201328, 2013.\n\nA. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\nS. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R(cid:13) in Machine\nLearning, 3(1):1\u2013122, 2011.\n\nH. Bristow, A. Eriksson, and S. Lucey. Fast convolutional sparse coding. In Computer Vision and\n\nPattern Recognition (CVPR), pages 391\u2013398, 2013.\n\nA. J. Brockmeier and J. C. Pr\u00edncipe. Learning recurrent waveforms within EEGs. IEEE Transactions\n\non Biomedical Engineering, 63(1):43\u201354, 2016.\n\nG. Buzsaki. Rhythms of the Brain. Oxford University Press, 2006.\n\nR. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained\n\noptimization. SIAM Journal on Scienti\ufb01c Computing, 16(5):1190\u20131208, 1995.\n\nR. Chalasani, J. C. Principe, and N. Ramakrishnan. A fast proximal method for convolutional sparse\ncoding. In International Joint Conference on Neural Networks (IJCNN), pages 1\u20135, 2013. ISBN\n9781467361293.\n\nS. R. Cole and B. Voytek. Brain oscillations and the importance of waveform shape. Trends Cogn.\n\nSci., 2017.\n\nS. R. Cole and B. Voytek. Cycle-by-cycle analysis of neural oscillations. preprint bioRxiv, 2018.\n\nS. R. Cole, R. van der Meij, E. J. Peterson, C. de Hemptinne, P. A. Starr, and B. Voytek. Nonsinusoidal\nbeta oscillations re\ufb02ect cortical pathophysiology in Parkinson\u2019s disease. Journal of Neuroscience,\n37(18):4830\u20134840, 2017.\n\nP. J. Durka, A. Matysiak, E. M. Montes, P. V. Sosa, and K. J. Blinowska. Multichannel matching\n\npursuit and EEG inverse solutions. J. Neurosci. Methods, 148(1):49\u201359, 2005.\n\n9\n\n\fJ. Friedman, T. Hastie, H. H\u00f6\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. The Annals\n\nof Applied Statistics, 1(2):302\u2013332, 2007.\n\nC. Garcia-Cardona and B. Wohlberg. Convolutional dictionary learning.\n\narXiv:1709.02893, 2017.\n\narXiv preprint\n\nB. Gips, A. Bahramisharif, E. Lowet, M. Roberts, P. de Weerd, O. Jensen, and J. van der Eerden.\nDiscovering recurring patterns in electrophysiological recordings. J. Neurosci. Methods, 275:\n66\u201379, 2017.\n\nA. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, R. Goj, M. Jas,\nT. Brooks, L. Parkkonen, and M. S. H\u00e4m\u00e4l\u00e4inen. MEG and EEG data analysis with MNE-Python.\nFrontiers in neuroscience, 7, 2013.\n\nA. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and\nM. S. H\u00e4m\u00e4l\u00e4inen. MNE software for processing MEG and EEG data. Neuroimage, 86:446\u2013460,\n2014.\n\nR. Grosse, R. Raina, H. Kwong, and A. Y. Ng. Shift-invariant sparse coding for audio classi\ufb01cation.\nIn 23rd Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 149\u2013158. AUAI Press,\n2007. ISBN 0-9749039-3-0.\n\nR. Hari. Action\u2013perception connection and the cortical mu rhythm. Progress in brain research, 159:\n\n253\u2013260, 2006.\n\nR. Hari and A. Puce. MEG-EEG Primer. Oxford University Press, 2017.\n\nT. Hastie, R. Tibshirani, and M. J. Wainwright. Statistical Learning with Sparsity. CRC Press, 2015.\n\nF. Heide, W. Heidrich, and G. Wetzstein. Fast and \ufb02exible convolutional sparse coding. In Computer\n\nVision and Pattern Recognition (CVPR), pages 5135\u20135143. IEEE, 2015.\n\nS. Hitziger, M. Clerc, S. Saillet, C. Benar, and T. Papadopoulo. Adaptive Waveform Learning: A\nFramework for Modeling Variability in Neurophysiological Signals. IEEE Transactions on Signal\nProcessing, 2017.\n\nM. Jas, T. Dupr\u00e9 La Tour, U. \u00b8Sim\u00b8sekli, and A. Gramfort. Learning the morphology of brain signals\nusing alpha-stable convolutional sparse coding. In Advances in Neural Information Processing\nSystems (NIPS), pages 1\u201315, 2017.\n\nS. R. Jones. When brain rhythms aren\u2019t \u2018rhythmic\u2019: implication for their mechanisms and meaning.\n\nCurr. Opin. Neurobiol., 40:72\u201380, 2016.\n\nP. Jost, P. Vandergheynst, S. Lesage, and R. Gribonval. MoTIF: an ef\ufb01cient algorithm for learning\ntranslation invariant dictionaries. In Acoustics, Speech and Signal Processing (ICASSP), volume 5.\nIEEE, 2006.\n\nK. Kavukcuoglu, P. Sermanet, Y-L. Boureau, K. Gregor, M. Mathieu, and Y. Le Cun. Learning\nIn Advances in Neural Information\n\nconvolutional feature hierarchies for visual recognition.\nProcessing Systems (NIPS), pages 1090\u20131098, 2010.\n\nF. Locatello, A. Raj, S. P. Karimireddy, G. R\u00e4tsch, B. Sch\u00f6lkopf, S. Stich, and M. Jaggi. On matching\npursuit and coordinate descent. In International Conference on Machine Learning (ICML), pages\n3204\u20133213, 2018.\n\nB. Mailh\u00e9, S. Lesage, R. Gribonval, F. Bimbot, and P. Vandergheynst. Shift-invariant dictionary\nlearning for sparse representations: extending K-SVD. In 16th Eur. Signal Process. Conf., pages\n1\u20135. IEEE, 2008.\n\nA. Mazaheri and O. Jensen. Asymmetric amplitude modulations of brain oscillations generate slow\n\nevoked responses. The Journal of Neuroscience, 28(31):7781\u20137787, 2008.\n\nT. Moreau, L. Oudre, and N. Vayatis. DICOD: Distributed Convolutional Sparse Coding.\n\nInternational Conference on Machine Learning (ICML), 2018.\n\nIn\n\n10\n\n\fY. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2010.\n\nJ. Nutini, M. Schmidt, I. H. Laradji, M. P. Friedlander, and H. Koepke. Coordinate Descent Converges\nFaster with the Gauss-Southwell Rule Than Random Selection. In International Conference on\nMachine Learning (ICML), pages 1632\u20131641, 2015.\n\nS. Osher and Y. Li. Coordinate descent optimization for (cid:96)1 minimization with application to\n\ncompressed sensing; a greedy algorithm. Inverse Problems and Imaging, 3(3):487\u2013503, 2009.\n\nM. Pachitariu, A. M Packer, N. Pettit, H. Dalgleish, M. Hausser, and M. Sahani. Extracting regions\nof interest from biological images with convolutional sparse block coding. In Advances in Neural\nInformation Processing Systems (NIPS), pages 1745\u20131753, 2013.\n\nPython Software Foundation. Python Language Reference, version 3.6. http://python.org/, 2017.\n\nP. Richt\u00e1rik and M. Tak\u00e1\u02c7c. Iteration complexity of randomized block-coordinate descent methods for\n\nminimizing a composite function. Mathematical Programming, 144(1-2):1\u201338, 2014.\n\nM. \u0160orel and F. \u0160roubek. Fast convolutional sparse coding using matrix inversion lemma. Digital\n\nSignal Processing, 2016.\n\nT. Tuomisto, R. Hari, T. Katila, T. Poutanen, and T. Varpula. Studies of auditory evoked magnetic and\nelectric responses: Modality speci\ufb01city and modelling. Il Nuovo Cimento D, 2(2):471\u2013483, 1983.\n\nF. van Ede, A. J. Quinn, M. W. Woolrich, and A. C. Nobre. Neural oscillations: Sustained rhythms or\n\ntransient burst-events? Trends in Neurosciences, 2018.\n\nB. Wohlberg. Convolutional sparse representation of color images. In IEEE Southwest Symposium\n\non Image Analysis and Interpretation (SSIAI), pages 57\u201360, 2016a.\n\nB. Wohlberg. Ef\ufb01cient algorithms for convolutional sparse representations. Image Processing, IEEE\n\nTransactions on, 25(1):301\u2013315, 2016b.\n\nS. Wright and J. Nocedal. Numerical optimization, volume 35. Springer Science, 1999.\n\nM. D. Zeiler, D. Krishnan, G.W. Taylor, and R. Fergus. Deconvolutional networks. In Computer\n\nVision and Pattern Recognition (CVPR), pages 2528\u20132535. IEEE, 2010.\n\n11\n\n\f", "award": [], "sourceid": 1675, "authors": [{"given_name": "Tom", "family_name": "Dupr\u00e9 la Tour", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Thomas", "family_name": "Moreau", "institution": "Inria"}, {"given_name": "Mainak", "family_name": "Jas", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Alexandre", "family_name": "Gramfort", "institution": "INRIA, Universit\u00e9 Paris-Saclay"}]}