{"title": "Time-dependent spatially varying graphical models, with application to brain fMRI data analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 5832, "page_last": 5840, "abstract": "In this work, we present an additive model for space-time data that splits the data into a temporally correlated component and a spatially correlated component. We model the spatially correlated portion using a time-varying Gaussian graphical model. Under assumptions on the smoothness of changes in covariance matrices, we derive strong single sample convergence results, confirming our ability to estimate meaningful graphical structures as they evolve over time. We apply our methodology to the discovery of time-varying spatial structures in human brain fMRI signals.", "full_text": "Time-dependent spatially varying graphical models,\n\nwith application to brain fMRI data analysis\n\nKristjan Greenewald\nDepartment of Statistics\n\nHarvard University\n\nSeyoung Park\n\nDepartment of Biostatistics\n\nYale University\n\nShuheng Zhou\n\nDepartment of Statistics\nUniversity of Michigan\n\nAlexander Giessing\n\nDepartment of Statistics\nUniversity of Michigan\n\nAbstract\n\nIn this work, we present an additive model for space-time data that splits the data\ninto a temporally correlated component and a spatially correlated component. We\nmodel the spatially correlated portion using a time-varying Gaussian graphical\nmodel. Under assumptions on the smoothness of changes in covariance matrices,\nwe derive strong single sample convergence results, con\ufb01rming our ability to es-\ntimate meaningful graphical structures as they evolve over time. We apply our\nmethodology to the discovery of time-varying spatial structures in human brain\nfMRI signals.\n\n1\n\nIntroduction\n\nLearning structured models of high-dimensional datasets from relatively few training samples is an\nimportant task in statistics and machine learning. Spatiotemporal data, in the form of n variables\nevolving over m time points, often \ufb01ts this regime due to the high (mn) dimension and potential\ndif\ufb01culty in obtaining independent samples. In this work, we develop a nonparametric framework\nfor estimating time varying spatiotemporal graphical structure using an (cid:96)1 regularization method.\nThe covariance of a spatiotemporal array X = [x1, . . . , xm] \u2208 Rn\u00d7m is an mn by mn matrix\n\n\u03a3 = Cov(cid:2)vec([x1, . . . , xm])(cid:3) ,\n\n(1)\nwhere xi \u2208 Rn, i = 1, . . . , m denotes the n variables or features of interest at the ith time point.\nEven for moderately large m and n the number of degrees of freedom (mn(mn + 1)/2) in the\ncovariance matrix can greatly exceed the number of training samples available for estimation. One\nway to handle this problem is to introduce structure and/or sparsity, thus reducing the number of\nparameters to be estimated. Spatiotemporal data is often highly structured, hence the design of\nestimators that model and exploit appropriate covariance structure can provide signi\ufb01cant gains.\nWe aim to develop a nonparametric framework for estimating time varying graphical structure for\nmatrix-variate distributions. Associated with each xi \u2208 Rn is its undirected graph G(i). Under the\nassumption that the law L(xi) of xi changes smoothly, Zhou et al. (2010) introduced a nonparamet-\nric method to estimate the graph sequence G(1), G(2), . . . assuming that the xi \u223c N (0, B(i/m))\nare independent, where B(t) is a smooth function over t \u2208 [0, 1] and we have mapped the indices i\nonto points t = i/m on the interval [0, 1]. In this work, we are interested in the general time series\nmodel where the xi, i = 1, . . . , m are dependent and the B\u22121(t) graphs change over time.\nOne way to introduce dependency into the xi is to study the following covariance structure. Let\nA = (aij) \u2208 Rm\u00d7m, B(t) = (bij(t)) \u2208 Rn\u00d7n, t \u2208 [0, 1] be symmetric positive de\ufb01nite covariance\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmatrices. Let diag(v), v = (v1, . . . , vm) be the diagonal matrix with elements vi along the diagonal.\nConsider the random matrix X with row vectors yj corresponding to measurements at the jth spatial\nlocation, and columns xi corresponding to the m measurements at times i/m, i = 1, . . . , m:\n\u2200j = 1, . . . , n, yj \u223c Nm(0, Aj) where Aj = A + diag(bjj(1), . . . , bjj(m)), and\n(2)\n\u2200i = 1, . . . , m, xi \u223c Nn(0, aiiI + B(i/m)) where B(t) changes smoothly over t \u2208 [0, 1]; (3)\nthat is, the covariance of the column vectors xi corresponding to each time point changes smoothly\nwith time (if aii is a smooth function of i). This provides ultimate \ufb02exibility in parameterizing\nspatial correlations, for example, across different geographical scales through variograms (Cressie,\n2015), each of which is allowed to change over seasons. Observe that while we have used the normal\ndistribution here for simplicity, all our results hold for general subgaussian distributions.\nThe model (3) also allows modeling the dynamic gene regulatory and brain connectivity networks\nwith topological (e.g., Erd\u02ddos-R\u00b4enyi random graph, small-world graph, or modular graphs) con-\nstraints via degree speci\ufb01cations as well as spatial constraints in the set of {B(t), t = 1, 2, . . .}.\nWhen A = 0, we return to the case of Zhou et al. (2010) where there is no temporal correlation, i.e.\ny1, . . . , yn assumed to be independent.\nWe propose methodologies to study the model as constructed in (2) and (3). Building upon and\nextending techniques of Zhou et al. (2010) and Rudelson & Zhou (2017); Zhou (2014), we aim to\ndesign estimators to estimate graph sequence G(1), G(2), . . ., where the temporal graph H and spa-\ntial graphs G(i) are determined by the zeros of A\u22121 and B(t)\u22121. Intuitively, the temporal correlation\nand spatial correlation are modeled as two additive processes. The covariance of X is now\n\nCov[vec(X)] = \u03a3 = A \u2297 In +\n\n(4)\n\n(eieT\nwhere ei \u2208 Rm, \u2200i are the m-dimensional standard basis vectors.\nIn the context of this model, we aim to develop a nonparametric method for estimating time varying\ngraphical structure for matrix-variate normal distributions using an (cid:96)1 regularization method. We\nwill show that, as long as the covariances change smoothly over time, we can estimate the spatial\nand temporal covariance matrices well in terms of predictive risk even when n, m are both large. We\nwill investigate the following theoretical properties: (a) consistency and rate of convergence in the\noperator and Frobenius norm of the covariance matrices and their inverse, (b) large deviation results\nfor covariance matrices for simultaneously correlated and non-identically distributed observations,\nand (c) conditions that guarantee smoothness of the covariances.\nBesides the model (4), another well-studied option for modeling spatio-temporal covariances \u03a3 is\nto introduce structure via the Kronecker product of smaller symmetric positive de\ufb01nite matrices, i.e.\n\u03a3 = A \u2297 B. The Kronecker product model, however, is restrictive when applied to general spatio-\ntemporal covariances as it assumes the covariance is separable (disallowing such simple scenarios\nas the presence of additive noise), and does not allow for time varying spatial structure. When used\nto estimate covariances not following Kronecker product structure, many estimators will respond to\nthe model mismatch by giving ill-conditioned estimates (Greenewald & Hero, 2015).\nHuman neuroscience data is a notable application where time-varying structure emerges. In neu-\nroscience, one must take into account temporal correlations as well as spatial correlations, which\nre\ufb02ect the connectivity formed by the neural pathways. It is conceivable that the brain connectivity\ngraph will change over a suf\ufb01ciently long period of measurements. For example, as a child learns to\nassociate symbols in the environment, certain pathways within the brain are reinforced. When they\nbegin to associate images with words, the correlation between a particular sound like Mommy and\nthe sight of a face becomes stronger and forms a well worn pathway. On the other hand, long term\nnon-use of connections between sensory and motor neurons can result in a loss of the pathway.\n\n(cid:88)m\n\ni=1\n\ni ) \u2297 B(i/m)\n\n1.1 Datasets and Related Work\n\nEstimating graphical models (connectomes) in fMRI data using sparse inverse covariance techniques\nhas enjoyed wide application (Huang et al., 2010; Varoquaux et al., 2010; Narayan et al., 2015; Kim\net al., 2015). However, recent research has only now begun exploring observed phenomena such\nas temporal correlations and additive temporally correlated noise (Chen et al., 2015; Arbabshirani\net al., 2014; Kim et al., 2015; Qiu et al., 2016), and time-varying dynamics and graphical models\n(connectomes) (Calhoun et al., 2014; Liu & Duyn, 2013; Chang & Glover, 2010; Chen et al., 2015).\n\n2\n\n\fWe consider the ADHD-200 fMRI dataset (Biswal et al., 2010), and study resting state fMRIs for\na variety of healthy patients in the dataset at different stages of development. Using our methods,\nwe are able to directly estimate age-varying graphical models across brain regions, chronicling the\ndevelopment of brain structure throughout childhood.\nSeveral models have emerged to generalize the Kronecker product model to allow it to model more\nrealistic covariances while still maintaining many of the gains associated with Kronecker structure.\nKronecker PCA, discussed in Tsiligkaridis & Hero (2013), approximates the covariance matrix using\na sum of Kronecker products. An algorithm (Permuted Rank-penalized Least Squares (PRLS)) for\n\ufb01tting the KronPCA model to a measured sample covariance matrix was introduced in (Tsiligkaridis\n& Hero, 2013) and was shown to have strong high dimensional MSE performance guarantees. From\na modeling perspective, the strengths of Kronecker PCA lie in its ability to handle \u201cnear separable\u201d\ncovariances and a variety of time-varying effects. While the Kronecker PCA model is very general,\nso far incorporation of sparsity in the inverse covariance has not been forthcoming. This motivates\nour introduction of the sparse model (4), which we demonstrate experimentally in Section 10 of the\nsupplement to enjoy better statistical convergence.\nCarvalho et al. (2007) proposed a Bayesian additive time-varying graphical model, where the\nspatially-correlated noise term is a parameter of the driving noise covariance in a temporal dynami-\ncal model. Unlike our method, they did not estimate the temporal correlation, instead requiring the\ndynamical model to be pre-set. Our proposed method has wholly independent spatial and temporal\nmodels, directly estimating an inverse covariance graphical model for the temporal relationships of\nthe data. This allows for a much richer temporal model and increases its applicability.\nIn the context of fMRI, the work of Qiu et al. (2016) used a similar kernel-weighted estimator for the\nspatial covariance, however they modeled the temporal covariance with a simple AR-1 model which\nthey did not estimate, and their estimator did not attempt to remove. Similarly, Monti et al. (2014)\nused a smoothed kernel estimator for B\u22121(t) with a penalty to further promote smoothness, but did\nnot model the temporal correlations. Our additive model allows the direct estimation of the temporal\nbehavior, revealing a richer structure than a simple AR-1, and allowing for effective denoising of the\ndata, and hence better estimation of the spatial graph structures.\n\n2 The model and method\nLet the elements of A (cid:31) 0 and B(t) be denoted as [A]ij := aij and [B(t)]ij := bij(t), t \u2208 [0, 1].\nSimilar to the setting in (Zhou et al., 2010), we assume that bij(t) is a smooth function of time t\nfor all i, j, and assume that B\u22121(t) is sparse. Furthermore, we suppose that m (cid:29) n, corresponding\nto there being more time points than spatial variables. For a random variable Y , the subgaussian\nnorm of Y , (cid:107)Y (cid:107)\u03c82, is de\ufb01ned via (cid:107)Y (cid:107)\u03c82 = supp\u22651 p\u22121/2(E|Y |p)1/p. Note that if E[Y ] = 0,\nwe also have E[exp(tY )] \u2264 exp(Ct2(cid:107)Y (cid:107)2\n) \u2200t \u2208 R. De\ufb01ne an n \u00d7 m random matrix Z with\nij] = 1 and having subgaussian norm (cid:107)Zij(cid:107)\u03c82 \u2264\nindependent, zero mean entries Zij satisfying E[Z 2\nK. Matrices Z1, Z2 denote independent copies of Z. We now write an additive generative model\nfor subgaussian data X \u2208 Rn\u00d7m having covariance given in (4). Let\n\n\u03c82\n\n(5)\nwhere ZB = [B(1/m)1/2Z2e1, . . . , B(i/m)1/2Z2ei, . . . , B(1)1/2Z2em], and ei \u2208 Rm, \u2200i are the\nm-dimensional standard basis vectors. Then the covariance\n\nX = Z1A1/2 + ZB\n\n\u03a3 = Cov[vec(X)] = Cov[vec(Z1A1/2)] + Cov[vec(ZB)]\n= Cov[vec(Z1A1/2)] +\n= A \u2297 In +\n\ni ) \u2297 Cov[B(i/m)1/2Z2ei]\n\n(cid:88)m\n\n(cid:88)m\n(eieT\ni ) \u2297 B(i/m).\n\ni=1\n\n(eieT\n\ni=1\n\nThus (5) is a generative model for data following the covariance model (4).\n\n2.1 Estimators\n\nAs in Rudelson & Zhou (2017), we can exploit the large-m convergence of Z1AZ T\n1 to tr(A)I to\nproject out the A part and create an estimator for the B covariances. As B(t) is time-varying, we\nuse a weighted average across time to create local estimators of spatial covariance matrix B(t).\n\n3\n\n\fIt is often assumed that knowledge of the trace of one of the factors is available a priori. For exam-\nple, the spatial signal variance may be known and time invariant, corresponding to tr(B(t)) being\nknown. Alternatively, the temporal component variance may be constant and known, corresponding\nto tr(A) being known. In our analysis below, we suppose that tr(A) is known or otherwise estimated\n(similar results hold when tr(B(t)) is known). For simplicity in stating the trace estimators, in what\nfollows we suppose that tr(B(t)) = tr(B) is constant, and without loss of generality that the data\nhas been normalized such that diagonal elements Aii are constant over i.\nAs B(t) is smoothly varying over time, the estimate at time t0 should depend strongly on the time\nsamples close to t0, and less on the samples further from t0. For any time of interest t0, we thus\nt=1 wt(t0) = 1. Our\n\nconstruct a weighted estimator using a weight vector wi(t0) such that (cid:80)m\n(cid:98)Sm(t0) :=\n\nweighted, unstructured sample-based estimator for B(t0) is then given by\n1\nmh\n\n(cid:18) i/m \u2212 t0\n\ni \u2212 tr(A)\nm\n\n, where wi(t0) =\n\n(cid:88)m\n\n,\n\n(6)\n\nwi(t0)\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\nxixT\n\nIn\n\ni=1\n\nK\n\nh\n\nand we have considered the class of weight vectors wi(t0) arising from a symmetric nonnegative\nkernel function K with compact support [0, 1] and bandwidth determined by parameter h. A list of\nminor regularity assumptions on K are listed in the supplement. For kernels such as the Gaussian\nkernel, this wt(t0) will result in samples close to t0 being highly weighted, with the \u201cweight decay\u201d\naway from t0 scaling with the bandwidth h. A wide bandwidth will be appropriate for slowly-\nvarying B(t), and a narrow bandwidth for quickly-varying B(t).\n\nTo enforce sparsity in the estimator for B\u22121(t0), we substitute (cid:98)Sm(t0) into the widely-used GLasso\n(cid:98)B\u03bb(t0) := arg min\nFor a matrix B, we let |B|1 :=(cid:80)\n(7)\nB\u03bb(cid:31)0\n(cid:98)B\u22121\nij |Bij|. Increasing the parameter \u03bbm gives an increasingly sparse\n(cid:88)m\n\u03bb (t0). Having formed an estimator for B, we can now form a similar estimator for A. Under the\n\nobjective function, resulting in a penalized estimator for B(t0) with regularization parameter \u03bbm\n\nconstant-trace assumption, we construct an estimator for tr(B)\n\n(cid:17)\n\u03bb (cid:98)Sm(t0)\n\n+ log |B\u03bb| + \u03bbm|B\u22121\n\n\u03bb |1.\n\nB\u22121\n\n(cid:16)\n\ntr\n\n\u02c6tr(B) =\n\nwi(cid:107)Xi(cid:107)2\n\n2 \u2212 n\nm\n\ni=1\n\ntr(A), with wi =\n\n(8)\n\nFor a time-varying trace tr(B(t)), use the time-averaged kernel\n\n\u02c6tr(B(t0)) =\n\nwi(t0)(cid:107)Xi(cid:107)2\n\n2 \u2212 n\nm\n\ntr(A), with wi(t0) =\n\n1\nmh\n\nK\n\n.\n\n(9)\n\nIn the future we will derive rates for the time varying case by choosing an optimal h. These estima-\ntors allow us to construct a sample covariance matrix for A:\n\nWe (similarly to B(t)) apply the GLasso approach to \u02dcA. Note that with m > n, \u02dcA has negative\neigenvalues since \u03bbmin\n\n(cid:0) 1\nn X T X(cid:1) = 0. We obtain a positive semide\ufb01nite matrix \u02dcA+ as:\n\ndiag{ \u02c6tr(B(1/m)), . . . , \u02c6tr(B(1))}.\n\n\u02dcA =\n\n1\nn\n\nX T X \u2212 1\nn\n\n\u02dcA+ = arg min\nA(cid:23)0\n\n(cid:107) \u02dcA \u2212 A(cid:107)max.\n\nWe use alternating direction method of multipliers (ADMM) to solve (11) as in Boyd et al. (2011),\nand prove that this retains a tight elementwise error bound. Note that while we chose this method\nof obtaining a positive semide\ufb01nite \u02dcA+ for its simplicity, there may exist other possible projections,\nthe exact method is not critical to our overall Kronecker sum approach. In fact, if the GLasso is not\nused, it is not necessary to do the projection (11), as the elementwise bounds also hold for \u02dcA.\nWe provide a regularized estimator for the correlation matrices \u03c1(A) = diag(A)\u22121/2Adiag(A)\u22121/2\nusing the positive semide\ufb01nite \u02dcA+ as the initial input to the GLasso problem\n\u03c1 \u03c1( \u02dcA+)) + log |A\u03c1| + \u03bbn|A\u22121\n\n\u02c6\u03c1\u03bb(A) = argminA\u03c1(cid:31)0 tr(A\u22121\n\nwhere \u03bbn > 0 is a regularization parameter and | \u00b7 |1,o\ufb00 is the L1 norm on the offdiagonal.\nForm the estimate for A as tr(A)\nm \u02c6\u03c1\u03bb(A). Observe that our method has three tuning parameters, two if\ntr(A) is known or can be estimated. If tr(A) is not known, we present several methods to choose it\nin Section 7.1 in the supplement. Once tr(A) is chosen, the estimators (7) and (12) for A and B(t)\nrespectively do not depend on each other, allowing \u03bbm and \u03bbn to be tuned independently.\n\n\u03c1 |1,o\ufb00 ,\n\n(12)\n\n4\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n.\n\n(cid:18) i/m \u2212 t0\n\n(cid:19)\n\nh\n\n(10)\n\n(11)\n\n\f3 Statistical convergence\n\nWe \ufb01rst bound the estimation error for the time-varying B(t). Since \u02c6B(t) is based on a kernel-\nsmoothed sample covariance, \u02c6B(t) is a biased estimator, with the bias depending on the kernel width\nand the smoothness of B(t). In Section 12.1 of the supplement, we derive the bias and variance of\n\u02c6Sm(t0), using arguments from kernel smoothing and subgaussian concentration respectively.\nIn the following results, we assume that the eigenvalues of the matrices A and B(t) are bounded:\n\u2264 \u03bbmin(A) \u2264 \u03bbmax(A) \u2264 cA\nAssumption 1: There exist positive constants cA, cB such that 1\ncA\nand 1\ncB\nAssumption 2: B(t) has entries with bounded second derivatives on [0, 1].\nPutting the bounds on the bias and variance together and optimizing the rate of h, we obtain the\nfollowing, which we prove in the supplementary material.\nTheorem 1. Suppose that the above Assumption holds, the entries Bij(t) of B(t) have bounded sec-\nond derivatives for all i, j, and t \u2208 [0, 1], sb+n = o((m/ log m)2/3), and that h (cid:16) (m\u22121 log m)1/3.\nThen with probability at least 1 \u2212 c(cid:48)(cid:48)\n\nm8/3 , (cid:98)Sm(t0) is positive de\ufb01nite and for some C\n\n\u2264 \u03bbmin(B(t)) \u2264 \u03bbmax(B(t)) \u2264 cB for all t.\n\nmaxij|(cid:98)Sm(t0, i, j) \u2212 B(t0, i, j)| \u2264 C(cid:0)m\u22121 log m(cid:1)1/3\n\nThis result con\ufb01rms that the mh temporal samples selected by the kernel act as replicates for estimat-\n\ning B(t). We can now substitute this elementwise bound on(cid:98)Sm(t0) into the GLasso proof, obtaining\ndiagonal elements for all t. If \u03bbm \u223c(cid:113) log m\n\nthe following theorem which demonstrates that \u02c6B(t) successfully exploits sparsity in B\u22121(t).\nTheorem 2. Suppose the conditions of Theorem 1 and that B\u22121(t) has at most sb nonzero off-\n\nm2/3 , then the GLasso estimator (7) satis\ufb01es\n\n.\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n(cid:107) \u02c6B(t0) \u2212 B(t0)(cid:107)F = Op\n\n(sb + n) log m\n\nm2/3\n\n,(cid:107) \u02c6B\u22121(t0) \u2212 B\u22121(t0)(cid:107)F = O\n\n(sb + n) log m\n\nm2/3\n\n(cid:112)\n\nObserve that this single-sample bound converges whenever the A part dimensionality m grows.\nThe proof follows from the concentration bound in Theorem 1 using the argument in Zhou et al.\n(2010), Zhou et al. (2011), and Rothman et al. (2008). Note that \u03bbm goes to zero as m increases, in\naccordance with the standard bias/variance tradeoff.\nWe now turn to the estimator for the A part. As it does not involve kernel smoothing, we simply\nneed to bound the variance. We have the following bound on the error of \u02dcA:\nTheorem 3. Suppose the above Assumption holds. Then\n\nmaxij| \u02dcAij \u2212 Aij| \u2264 C(cA + cB)\n\nn\u22121 log m\n\nm4 for some constants C, c > 0.\n\nwith probability 1 \u2212 c\nRecall that we have assumed that m > n, so the probability converges to 1 with increasing m or\nn. While \u02dcA is not positive de\ufb01nite, the triangle inequality implies a bound on the positive de\ufb01nite\nprojection \u02dcA+ (11):\n\n(cid:107) \u02dcA+ \u2212 A(cid:107)max \u2264 (cid:107) \u02dcA+ \u2212 \u02dcA(cid:107)max + (cid:107) \u02dcA \u2212 A(cid:107)max \u2264 2(cid:107) \u02dcA \u2212 A(cid:107)max = Op\n\n(13)\nThus, similarly to the earlier result for B(t), the estimator (12) formed by substituting the positive\nsemide\ufb01nite \u03c1( \u02dcA+) into the GLasso objective enjoys the following error bound (Zhou et al., 2011).\nTheorem 4. Suppose the conditions of Theorem 3 and that A\u22121 has at most sa = o(n/ log m)\n\nn\u22121log m\n\n(cid:16)(cid:112)\n\n(cid:17)\n\n.\n\nnonzero off-diagonal elements. If \u03bbn \u223c(cid:113) log m\n(cid:33)\n\n(cid:32)(cid:114)\n\nn , then the GLasso estimator (12) satis\ufb01es\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n(cid:107) \u02c6A \u2212 A(cid:107)F = Op\n\nsa log m\n\nn\n\n(cid:107) \u02c6A\u22121 \u2212 A\u22121(cid:107)F = Op\n\n,\n\nsa log m\n\nn\n\n.\n\nObserve that this single-sample bound converges whenever the B(t) dimensionality n grows since\nthe sparsity sa = o(n/ log m). For relaxation of this stringent sparsity assumption, one can use\nother assumptions, see for example Theorem 3.3 in Zhou (2014).\n\n5\n\n\f4 Simulation study\nWe generated a time varying sequence of spatial covariances B(ti) = \u0398(ti)\u22121 according to the\nmethod of Zhou et al. (2010), which follows a type of Erdos-Renyi random graph model. Initially\nwe set \u0398(0) = 0.25In\u00d7n, where n = 100. Then, we randomly select k edges and update \u0398(t) as\nfollows: for each new edge (i, j), a weight a > 0 is chosen uniformly at random from [0.1, 0.3]; we\nsubtract a from \u0398ij and \u0398ji, and increase \u0398ii, \u0398jj by a. This keeps B(t) positive de\ufb01nite. When\nwe later delete an existing edge from the graph, we reverse the above procedure.\nWe consider t \u2208 [0, 1], changing the graph structure at the points ti = i/5 as follows. At each ti,\n\ufb01ve existing edges are deleted, and \ufb01ve new edges are added. For each of the \ufb01ve new edges, a target\nweight is chosen. Linear interpolation of the edge weights between the ti is used to smoothly add\nthe new edges and gradually delete the ones to be deleted. Thus, almost always, there are 105 edges\nin the graph and 10 edges have weights that are varying smoothly (Figure 1).\n\nFigure 1: Example sequence of Erdos-Renyi B\u22121(t) = \u0398(t) graphs. At each time point, the 100\nedges connecting n = 100 nodes are shown. Changes are indicated by red and green edges: red\nedges indicate edges that will be deleted in the next increment and green indicates new edges.\n\nIn the \ufb01rst set of experiments we consider B(t) generated from the ER time-varying graph procedure\nand A an AR-1 covariance with parameter \u03c1. The magnitudes of the two factors are balanced. We\nset n = 100 and vary m from 200 to 2400. For each n, m pair, we vary the B(t) regularization\nparameter \u03bb, estimating every B(t), t = 1/m, . . . , 1 for each. We evaluate performance using the\nmean relative Frobenius B(t) estimation error ((cid:107) \u02c6B(t) \u2212 B(t)(cid:107)F /(cid:107)B(t)(cid:107)F ), the mean relative L2\nestimation error ((cid:107) \u02c6B(t) \u2212 B(t)(cid:107)2/(cid:107)B(t)(cid:107)2), and the Matthews correlation coef\ufb01cient (MCC).\nThe MCC quanti\ufb01es edge support estimation performance, and is de\ufb01ned as follows. Let the number\nof true positive edge detections be TP, true negatives TN, false positives FP, and false negatives FN.\nThe Matthews correlation coef\ufb01cient is de\ufb01ned as MCC =\n.\n(TP+FP)(TP+FN)(TN+FP)(TN+FN)\nIncreasing values of MCC imply better edge estimation performance, with MCC = 0 implying\ncomplete failure and MCC = 1 implying perfect edge set estimation.\nResults are shown in Figure 2, for \u03c1 = .5 and 50 edges in B, \u03c1 = .5 and 100 edges in B, and \u03c1 = .95\nand 100 edges in B. As predicted by the theory, increasing m improves performance and increasing\n\u03c1 decreases performance. Increasing the number of edges in B changes the optimal \u03bb, as expected.\nFigure 3 shows performance results for the penalized estimator \u02c6A using MCC, Frobenius error, and\nL2 error, where A follows an AR(1) model with \u03c1 = 0.5 and B follows a random ER model. Note\nthe MCC, Frobenius, spectral norm errors are improved with larger n. In the supplement (Section\n11), we repeat these experiments, using an alternate random graph topologies, with similar results.\n\nTP\u00b7TN\u2212FP\u00b7FN\n\n\u221a\n\n5\n\nfMRI Application\n\nThe ADHD-200 resting-state fMRI dataset (Biswal et al., 2010) was collected from 973 subjects,\n197 of which were diagnosed with ADHD types 1, 2, or 3. The fMRI images have varying numbers\n\n6\n\nt = 0/5t = 1/5t = 2/5t = 3/5t = 4/5t = 5/5\fFigure 2: MCC, Frobenius, and L2 norm error curves for B a random ER graph and n = 100. Top:\nA is AR covariance with \u03c1 = .5 and 50 edges in B, Middle: A is AR(1) covariance with \u03c1 = .5 and\nB having 100 edges, Bottom: AR covariance with \u03c1 = .95 and 100 edges in B.\n\nFigure 3: MCC, Frobenius, and L2 norm error curves for A a AR(1) with \u03c1 = 0.5 when B is a\nrandom ER graph. From top to bottom: m = 200 and m = 800.\n\nof voxels which we divide into 90 regions of interest for graphical model analysis (Wehbe et al.,\n2014), and between 76 and 276 images exist for each subject. Provided covariates for the subjects\ninclude age, gender, handedness, and IQ. Previous works such as (Qiu et al., 2016) used this dataset\nto establish that the brain network density increases with age, corresponding to brain development\nas subjects mature. We revisit this problem using our additive approach. Our additive model allows\nthe direct estimation of the temporal behavior, revealing a richer structure than a simple AR-1, and\nallowing for effectively a denoising of the data, and better estimation of the spatial graph structure.\nWe estimate the temporal A covariances for each subject using the voxels contained in the regions\nof interest, with example results shown in Figure 5 in the supplement. We choose \u03c4B as the lower\nlimit of the eigenvalues of 1\n\nn X T X, as in the high sample regime it is an upper bound on \u03c4B.\n\n7\n\n00.20.40.60.8\u03bb00.20.40.60.81MCCMCC00.20.40.60.8\u03bb00.20.40.60.811.2Frobenius ErrorFrobenius Errorm=200m=400m=800m=240000.20.40.60.8\u03bb00.20.40.60.811.2L2 ErrorL2 Error00.20.40.60.8\u03bb00.20.40.60.81MCCMCC00.20.40.60.8\u03bb00.20.40.60.811.2Frobenius ErrorFrobenius Errorm=200m=400m=800m=240000.20.40.60.8\u03bb00.20.40.60.811.2L2 ErrorL2 Error00.20.40.60.8\u03bb00.20.40.60.81MCCMCC00.20.40.60.8\u03bb00.20.40.60.811.2Frobenius ErrorFrobenius Errorm=200m=400m=800m=240000.20.40.60.8\u03bb00.20.40.60.811.2L2 ErrorL2 Error00.20.40.600.20.40.60.81MCCMCC00.20.40.600.20.40.60.811.2Frobenius Errorm=200, Frobenius Errorn=200n=400n=80000.20.40.600.20.40.60.811.2L2 ErrorL2 Error00.20.40.600.20.40.60.81MCCMCC00.20.40.600.20.40.60.811.2Frobenius Errorm=800, Frobenius Errorn=200n=400n=80000.20.40.600.20.40.60.811.2L2 ErrorL2 Error\fWe then estimate the brain connectivity network at a range of ages from 8 to 18, using both our\nproposed method and the method of Monti et al. (2014), as it is an optimally-penalized version\nof the estimator in Qiu et al. (2016). We use a Gaussian kernel with bandwidth h, and estimate\nthe graphs using a variety of values of \u03bb and h. Subjects with fewer than 120 time samples were\neliminated, and those with more were truncated to 120 to reduce bias towards longer scans. The\nnumber of edges in the estimated graphs are shown in Figure 4. Note the consistent increase in\nnetwork density with age, becoming more smooth with increasing h.\n\n(a) Non-additive method of Monti et al. (2014) (optimally penalized version of Qiu et al. (2016)).\n\n(b) Our proposed additive method, allowing for denoising of the time-correlated data.\n\nFigure 4: Number of edges in the estimated B\u22121(t) graphical models across 90 brain regions as a\nfunction of age. Shown are results using three different values of the regularization parameter \u03bb,\nand from left to right the kernel bandwidth parameter used is h = 1.5, 2, and 3. Note the consistently\nincreasing edge density in our estimate, corresponding to predictions of increased brain connectivity\nas the brain develops, leveling off in the late teenage years. Compare this to the method of Monti\net al. (2014), which successfully detects the trend in the years 11-14, but fails for other ages.\n\n6 Conclusion\n\nIn this work, we presented estimators for time-varying graphical models in the presence of time-\ncorrelated signals and noise. We revealed a bias-variance tradeoff scaling with the underlying rate\nof change, and proved strong single sample convergence results in high dimensions. We applied our\nmethodology to an fMRI dataset, discovering meaningful temporal changes in functional connectiv-\nity, consistent with scienti\ufb01cally expected childhood growth and development.\n\nAcknowledgement\n\nThis work was supported in part by NSF under Grant DMS-1316731, Elizabeth Caroline Crosby\nResearch Award from the Advance Program at the University of Michigan, and by AFOSR grant\nFA9550-13-1-0043.\n\nReferences\nArbabshirani, M., Damaraju, E., Phlypo, R., Plis, S., Allen, E., Ma, S., Mathalon, D., Preda, A.,\nVaidya, J., and Adali, T. Impact of autocorrelation on functional connectivity. Neuroimage, 102:\n294\u2013308, 2014.\n\n8\n\n81012141618Age020040060080010001200Number of edges\u03bb = 0.875\u03bb = 0.9\u03bb = 0.92581012141618Age020040060080010001200Number of edges\u03bb = 0.875\u03bb = 0.9\u03bb = 0.92581012141618Age020040060080010001200Number of edges\u03bb = 0.875\u03bb = 0.9\u03bb = 0.92581012141618Age020040060080010001200Number of edges\u03bb = 0.5\u03bb = 0.6\u03bb = 0.781012141618Age020040060080010001200Number of edges\u03bb = 0.5\u03bb = 0.6\u03bb = 0.781012141618Age020040060080010001200Number of edges\u03bb = 0.5\u03bb = 0.6\u03bb = 0.7\fBiswal, B., Mennes, M., Zuo, X., Gohel, S., Kelly, C., Smith, S., Beckmann, C., Adelstein, J.,\nBuckner, R., and Colcombe, S. Toward discovery science of human brain function. Proceedings\nof the National Academy of Sciences, 107(10):4734\u20134739, 2010.\n\nBoyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical\n\nlearning via ADMM. Foundations and Trends R(cid:13) in Machine Learning, 3(1):1\u2013122, 2011.\n\nCalhoun, V., Miller, R., Pearlson, G., and Adal\u0131, T. The chronnectome: time-varying connectivity\n\nnetworks as the next frontier in fMRI data discovery. Neuron, 84(2):262\u2013274, 2014.\n\nCarvalho, C., West, M., et al. Dynamic matrix-variate graphical models. Bayesian analysis, 2(1):\n\n69\u201397, 2007.\n\nChang, C. and Glover, G. Time\u2013frequency dynamics of resting-state brain connectivity measured\n\nwith fmri. Neuroimage, 50(1):81\u201398, 2010.\n\nChen, S., Liu, K., Yang, Y., Xu, Y., Lee, S., Lindquist, M., Caffo, B., and Vogelstein,\nJ. An m-estimator for reduced-rank high-dimensional linear dynamical system identi\ufb01cation.\narXiv:1509.03927, 2015.\n\nCressie, N. Statistics for spatial data. John Wiley & Sons, 2015.\nGreenewald, K. and Hero, A. Robust kronecker product PCA for spatio-temporal covariance esti-\n\nmation. Signal Processing, IEEE Transactions on, 63(23):6368\u20136378, Dec 2015.\n\nHuang, S., Li, J., Sun, L., Ye, J., Fleisher, A., Wu, T., Chen, K., and Reiman, E. Learning brain\n\nconnectivity of alzheimer\u2019s disease by sparse inv. cov. est. NeuroImage, 50(3):935\u2013949, 2010.\n\nKim, J., Pan, W., Initiative, Alzheimer\u2019s Disease Neuroimaging, et al. Highly adaptive tests for\n\ngroup differences in brain functional connectivity. NeuroImage: Clinical, 9:625\u2013639, 2015.\n\nLiu, X. and Duyn, J. Time-varying functional network information extracted from brief instances of\n\nspontaneous brain activity. Proc. of the Natl. Academy of Sciences, 110(11):4392\u20134397, 2013.\n\nMonti, R., Hellyer, P., Sharp, D., Leech, R., Anagnostopoulos, C., and Montana, G. Estimating\n\ntime-varying brain conn. networks from fMRI time series. NeuroImage, 103:427\u2013443, 2014.\n\nNarayan, M., Allen, G., and Tomson, S. Two sample inference for populations of graphical models\n\nwith applications to functional connectivity. arXiv preprint arXiv:1502.03853, 2015.\n\nQiu, H., Han, F., Liu, H., and Caffo, B. Joint estimation of multiple graphical models from high\ndimensional time series. Journal of the Royal Statistical Society: Series B, 78(2):487\u2013504, 2016.\nRothman, A., Bickel, P., Levina, E., Zhu, J., et al. Sparse permutation invariant covariance estima-\n\ntion. Electronic Journal of Statistics, 2:494\u2013515, 2008.\n\nRudelson, M. and Zhou, S. Errors-in-variables models with dependent measurements. The Elec-\n\ntronic Journal of Statistics, 11(1):1699\u20131797, 2017.\n\nTsiligkaridis, T. and Hero, A. Covariance estimation in high dimensions via kronecker product\n\nexpansions. IEEE Trans. on Sig. Proc., 61(21):5347\u20135360, 2013.\n\nVaroquaux, G., Gramfort, A., Poline, J-B., and Thirion, B. Brain covariance selection: better in-\ndividual functional connectivity models using population prior. Advances in Neural Information\nProcessing Systems, 23:2334\u20132342, 2010.\n\nWehbe, L., Murphy, B., Talukdar, P., Fyshe, A., Ramdas, A., and Mitchell, T. Simultaneously\nuncovering the patterns of brain regions involved in different story reading subprocesses. PLOS\nONE, 9(11):e112575, 2014.\n\nZhou, S. Gemini: Graph estimation with matrix variate normal instances. The Annals of Statistics,\n\n42(2):532\u2013562, 2014.\n\nZhou, S., Lafferty, J., and Wasserman, L. Time varying undirected graphs. Machine Learning, 80\n\n(2-3):295\u2013319, 2010.\n\nZhou, S., R\u00a8utimann, P., Xu, M., and B\u00a8uhlmann, P. High-dimensional covariance estimation based\non gaussian graphical models. The Journal of Machine Learning Research, 12:2975\u20133026, 2011.\n\n9\n\n\f", "award": [], "sourceid": 2986, "authors": [{"given_name": "Kristjan", "family_name": "Greenewald", "institution": "University of Michigan"}, {"given_name": "Seyoung", "family_name": "Park", "institution": "Yale University"}, {"given_name": "Shuheng", "family_name": "Zhou", "institution": "University of Michigan"}, {"given_name": "Alexander", "family_name": "Giessing", "institution": "University of Michigan"}]}