{"title": "Learning Networks of Stochastic Differential Equations", "book": "Advances in Neural Information Processing Systems", "page_first": 172, "page_last": 180, "abstract": "We consider linear models for stochastic dynamics. Any such model can be associated a network (namely a directed graph) describing which degrees of freedom interact under the dynamics. We tackle the problem of learning such a network from observation of the system trajectory over a time interval T. We analyse the l1-regularized least squares algorithm and, in the setting in which the underlying network is sparse, we prove performance guarantees that are uniform in the sampling rate as long as this is sufficiently high. This result substantiates the notion of a well defined \u2018time complexity\u2019 for the network inference problem.", "full_text": "Learning Networks of\n\nStochastic Differential Equations\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nJos\u00b4e Bento\n\nStanford University\nStanford, CA 94305\n\nMorteza Ibrahimi\n\nStanford University\nStanford, CA 94305\n\njbento@stanford.edu\n\nibrahimi@stanford.edu\n\nDepartment of Electrical Engineering and Statistics\n\nAndrea Montanari\n\nStanford University\nStanford, CA 94305\n\nmontanari@stanford.edu\n\nAbstract\n\nWe consider linear models for stochastic dynamics. To any such model can be as-\nsociated a network (namely a directed graph) describing which degrees of freedom\ninteract under the dynamics. We tackle the problem of learning such a network\nfrom observation of the system trajectory over a time interval T .\nWe analyze the \u21131-regularized least squares algorithm and, in the setting in which\nthe underlying network is sparse, we prove performance guarantees that are uni-\nform in the sampling rate as long as this is suf\ufb01ciently high. This result substan-\ntiates the notion of a well de\ufb01ned \u2018time complexity\u2019 for the network inference\nproblem.\n\nkeywords: Gaussian processes, model selection and structure learning, graphical models, sparsity\nand feature selection.\n\n1\n\nIntroduction and main results\n\nLet G = (V, E) be a directed graph with weight A0\nij \u2208 R associated to the directed edge (j, i) from\nj \u2208 V to i \u2208 V . To each node i \u2208 V in this network is associated an independent standard Brownian\nmotion bi and a variable xi taking values in R and evolving according to\n\ndxi(t) = Xj\u2208\u2202+i\n\nA0\n\nijxj(t) dt + dbi(t) ,\n\nwhere \u2202+i = {j \u2208 V : (j, i) \u2208 E} is the set of \u2018parents\u2019 of i. Without loss of generality we shall\ntake V = [p] \u2261 {1, . . . , p}. In words, the rate of change of xi is given by a weighted sum of the\ncurrent values of its neighbors, corrupted by white noise. In matrix notation, the same system is then\nrepresented by\n\ndx(t) = A0x(t) dt + db(t) ,\n\n(1)\nwith x(t) \u2208 Rp, b(t) a p-dimensional standard Brownian motion and A0 \u2208 Rp\u00d7p a matrix with\nentries {A0\nij}i,j\u2208[p] whose sparsity pattern is given by the graph G. We assume that the linear system\n\u02d9x(t) = A0x(t) is stable (i.e. that the spectrum of A0 is contained in {z \u2208 C : Re(z) < 0}). Further,\nwe assume that x(t = 0) is in its stationary state. More precisely, x(0) is a Gaussian random variable\n\n1\n\n\findependent of b(t), distributed according to the invariant measure. Under the stability assumption,\nthis a mild restriction, since the system converges exponentially to stationarity.\nA portion of time length T of the system trajectory {x(t)}t\u2208[0,T ] is observed and we ask under which\nconditions these data are suf\ufb01cient to reconstruct the graph G (i.e., the sparsity pattern of A0). We\nare particularly interested in computationally ef\ufb01cient procedures, and in characterizing the scaling\nof the learning time for large networks. Can the network structure be learnt in a time scaling linearly\nwith the number of its degrees of freedom?\n\nAs an example application, chemical reactions can be conveniently modeled by systems of non-\nlinear stochastic differential equations, whose variables encode the densities of various chemical\nspecies [1, 2]. Complex biological networks might involve hundreds of such species [3], and learn-\ning stochastic models from data is an important (and challenging) computational task [4]. Consider-\ning one such chemical reaction network in proximity of an equilibrium point, the model (1) can be\nused to trace \ufb02uctuations of the species counts with respect to the equilibrium values. The network\nG would represent in this case the interactions between different chemical factors. Work in this area\nfocused so-far on low-dimensional networks, i.e. on methods that are guaranteed to be correct for\n\ufb01xed p, as T \u2192 \u221e, while we will tackle here the regime in which both p and T diverge.\nBefore stating our results, it is useful to stress a few important differences with respect to classical\ngraphical model learning problems:\n\n(i) Samples are not independent. This can (and does) increase the sample complexity.\n(ii) On the other hand, in\ufb01nitely many samples are given as data (in fact a collection indexed\nby the continuous parameter t \u2208 [0, T ]). Of course one can select a \ufb01nite subsample, for\ninstance at regularly spaced times {x(i \u03b7)}i=0,1,.... This raises the question as to whether\nthe learning performances depend on the choice of the spacing \u03b7.\n(iii) In particular, one expects that choosing \u03b7 suf\ufb01ciently large as to make the con\ufb01gurations in\nthe subsample approximately independent can be harmful. Indeed, the matrix A0 contains\nmore information than the stationary distribution of the above process (1), and only the\nlatter can be learned from independent samples.\n\n(iv) On the other hand, letting \u03b7 \u2192 0, one can produce an arbitrarily large number of distinct\nsamples. However, samples become more dependent, and intuitively one expects that there\nis limited information to be harnessed from a given time interval T .\n\nOur results con\ufb01rm in a detailed and quantitative way these intuitions.\n\n1.1 Results: Regularized least squares\n\nRegularized least squares is an ef\ufb01cient and well-studied method for support recovery. We will\ndiscuss relations with existing literature in Section 1.3.\n\nIn the present case, the algorithm reconstructs independently each row of the matrix A0. The rth\nrow, A0\n\nr, is estimated by solving the following convex optimization problem for Ar \u2208 Rp\n\nminimize L(Ar;{x(t)}t\u2208[0,T ]) + \u03bbkArk1 ,\n\n(2)\n\n(3)\n\n1\n\n0\n\nL(Ar;{x(t)}t\u2208[0,T ]) =\n\nwhere the likelihood function L is de\ufb01ned by\n2T Z T\n\nT Z T\nrx(t) \u2212 \u02d9xr(t))2dt \u2212R \u02d9xr(t)2 dt.\nfor the right hand side of Eq. (3), thus getting the integralR (A\u2217\n\n(Here and below M \u2217 denotes the transpose of matrix/vector M.) To see that this likelihood function\nis indeed related to least squares, one can formally write \u02d9xr(t) = dxr(t)/dt and complete the square\n\nThe \ufb01rst term is a sum of square residuals, and the second is independent of A. Finally the \u21131\nregularization term in Eq. (2) has the role of shrinking to 0 a subset of the entries Aij thus effectively\nselecting the structure.\n\n(A\u2217\n\nrx(t))2 dt \u2212\n\n(A\u2217\n\nrx(t)) dxr(t) .\n\n1\n\n0\n\nLet S0 be the support of row A0\nthe signed support of A0\n\nr) as to\nr (where sign(0) = 0 by convention). Let \u03bbmax(M ) and \u03bbmin(M ) stand for\n\nr, and assume |S0| \u2264 k. We will refer to the vector sign(A0\n\n2\n\n\fT >\n\n104k2(k \u03c1min(A0)\u22122 + A\u22122\n\nmin)\n\n\u03b12\u03c1min(A0)C 2\n\nmin\n\nlog(cid:16) 4pk\n\u03b4 (cid:17) ,\n\n(5)\n\nthe maximum and minimum eigenvalue of a square matrix M respectively. Further, denote by Amin\nthe smallest absolute value among the non-zero entries of row A0\nr.\nWhen stable, the diffusion process (1) has a unique stationary measure which is Gaussian with\ncovariance Q0 \u2208 Rp\u00d7p given by the solution of Lyapunov\u2019s equation [5]\n\n(4)\nOur guarantee for regularized least squares is stated in terms of two properties of the covariance Q0\nand one assumption on \u03c1min(A0) (given a matrix M, we denote by ML,R its submatrix ML,R \u2261\n(Mij)i\u2208L,j\u2208R):\n\nA0Q0 + Q0(A0)\u2217 + I = 0.\n\n(a) We denote by Cmin \u2261 \u03bbmin(Q0\n\nthe support S0 and assume Cmin > 0.\n\nS0,S0) the minimum eigenvalue of the restriction of Q0 to\n\n|||\u221e = 1 \u2212 \u03b1,\n(b) We de\ufb01ne the incoherence parameter \u03b1 by letting |||Q0\n(c) We de\ufb01ne \u03c1min(A0) = \u2212\u03bbmax((A0 + A0\u2217)/2) and assume \u03c1min(A0) > 0. Note this is a\n\nand assume \u03b1 > 0. (Here ||| \u00b7 |||\u221e is the operator sup norm.)\nstronger form of stability assumption.\n\n(S0)C ,S0(cid:0)Q0\n\nS0,S0(cid:1)\u22121\n\nOur main result is to show that there exists a well de\ufb01ned time complexity, i.e. a minimum time\ninterval T such that, observing the system for time T enables us to reconstruct the network with\nhigh probability. This result is stated in the following theorem.\nTheorem 1.1. Consider the problem of learning the support S0 of row A0\nsample trajectory {x(t)}t\u2208[0,T ] distributed according to the model (1). If\n\nr of the matrix A0 from a\n\nthen there exists \u03bb such that \u21131-regularized least squares recovers the signed support of A0\n\nprobability larger than 1 \u2212 \u03b4. This is achieved by taking \u03bb =p36 log(4p/\u03b4)/(T \u03b12\u03c1min(A0)) .\n\nThe time complexity is logarithmic in the number of variables and polynomial in the support size.\nFurther, it is roughly inversely proportional to \u03c1min(A0), which is quite satisfying conceptually,\nsince \u03c1min(A0)\u22121 controls the relaxation time of the mixes.\n\nr with\n\n1.2 Overview of other results\n\nSo far we focused on continuous-time dynamics. While, this is useful in order to obtain elegant state-\nments, much of the paper is in fact devoted to the analysis of the following discrete-time dynamics,\nwith parameter \u03b7 > 0:\n\nt \u2208 N0 .\n\nx(t) = x(t \u2212 1) + \u03b7A0x(t \u2212 1) + w(t),\n\n(6)\nHere x(t) \u2208 Rp is the vector collecting the dynamical variables, A0 \u2208 Rp\u00d7p speci\ufb01es the dynamics\nas above, and {w(t)}t\u22650 is a sequence of i.i.d. normal vectors with covariance \u03b7 Ip\u00d7p (i.e. with\nindependent components of variance \u03b7). We assume that consecutive samples {x(t)}0\u2264t\u2264n are\ngiven and will ask under which conditions regularized least squares reconstructs the support of A0.\nThe parameter \u03b7 has the meaning of a time-step size. The continuous-time model (1) is recovered,\nin a sense made precise below, by letting \u03b7 \u2192 0. Indeed we will prove reconstruction guarantees\nthat are uniform in this limit as long as the product n\u03b7 (which corresponds to the time interval T in\nthe previous section) is kept constant. For a formal statement we refer to Theorem 3.1. Theorem 1.1\nis indeed proved by carefully controlling this limit. The mathematical challenge in this problem is\nrelated to the fundamental fact that the samples {x(t)}0\u2264t\u2264n are dependent (and strongly dependent\nas \u03b7 \u2192 0).\nDiscrete time models of the form (6) can arise either because the system under study evolves by\ndiscrete steps, or because we are subsampling a continuous time system modeled as in Eq. (1).\nNotice that in the latter case the matrices A0 appearing in Eq. (6) and (1) coincide only to the zeroth\norder in \u03b7. Neglecting this technical complication, the uniformity of our reconstruction guarantees\nas \u03b7 \u2192 0 has an appealing interpretation already mentioned above. Whenever the samples spacing\nis not too large, the time complexity (i.e.\nthe product n\u03b7) is roughly independent of the spacing\nitself.\n\n3\n\n\f1.3 Related work\n\nA substantial amount of work has been devoted to the analysis of \u21131 regularized least squares, and\nits variants [6, 7, 8, 9, 10]. The most closely related results are the one concerning high-dimensional\nconsistency for support recovery [11, 12]. Our proof follows indeed the line of work developed in\nthese papers, with two important challenges. First, the design matrix is in our case produced by\na stochastic diffusion, and it does not necessarily satis\ufb01es the irrepresentability conditions used by\nthese works. Second, the observations are not corrupted by i.i.d. noise (since successive con\ufb01gura-\ntions are correlated) and therefore elementary concentration inequalities are not suf\ufb01cient.\nLearning sparse graphical models via \u21131 regularization is also a topic with signi\ufb01cant literature. In\nthe Gaussian case, the graphical LASSO was proposed to reconstruct the model from i.i.d. samples\n[13]. In the context of binary pairwise graphical models, Ref. [11] proves high-dimensional con-\nsistency of regularized logistic regression for structural learning, under a suitable irrepresentability\nconditions on a modi\ufb01ed covariance. Also this paper focuses on i.i.d. samples.\n\nMost of these proofs builds on the technique of [12]. A naive adaptation to the present case allows\nto prove some performance guarantee for the discrete-time setting. However the resulting bounds\nare not uniform as \u03b7 \u2192 0 for n\u03b7 = T \ufb01xed. In particular, they do not allow to prove an analogous\nof our continuous time result, Theorem 1.1. A large part of our effort is devoted to producing more\naccurate probability estimates that capture the correct scaling for small \u03b7.\nSimilar issues were explored in the study of stochastic differential equations, whereby one is often\ninterested in tracking some slow degrees of freedom while \u2018averaging out\u2019 the fast ones [14]. The\nrelevance of this time-scale separation for learning was addressed in [15]. Let us however emphasize\nthat these works focus once more on system with a \ufb01xed (small) number of dimensions p.\nFinally, the related topic of learning graphical models for autoregressive processes was studied re-\ncently in [16, 17]. The convex relaxation proposed in these papers is different from the one devel-\noped here. Further, no model selection guarantee was proved in [16, 17].\n\n2\n\nIllustration of the main results\n\nIt might be dif\ufb01cult to get a clear intuition of Theorem 1.1, mainly because of conditions (a) and (b),\nwhich introduce parameters Cmin and \u03b1. The same dif\ufb01culty arises with analogous results on the\nhigh-dimensional consistency of the LASSO [11, 12]. In this section we provide concrete illustration\nboth via numerical simulations, and by checking the condition on speci\ufb01c classes of graphs.\n\n2.1 Learning the laplacian of graphs with bounded degree\n\nGiven a simple graph G = (V,E) on vertex set V = [p], its laplacian \u2206G is the symmetric p \u00d7 p\nmatrix which is equal to the adjacency matrix of G outside the diagonal, and with entries \u2206G\nii =\n\u2212deg(i) on the diagonal [18]. (Here deg(i) denotes the degree of vertex i.)\nIt is well known that \u2206G is negative semide\ufb01nite, with one eigenvalue equal to 0, whose multiplicity\nis equal to the number of connected components of G. The matrix A0 = \u2212m I + \u2206G \ufb01ts into\nthe setting of Theorem 1.1 for m > 0. The corresponding model (1.1) describes the over-damped\ndynamics of a network of masses connected by springs of unit strength, and connected by a spring\nof strength m to the origin. We obtain the following result.\nTheorem 2.1. Let G be a simple connected graph of maximum vertex degree k and consider the\nmodel (1.1) with A0 = \u2212m I + \u2206G where \u2206G is the laplacian of G and m > 0. If\n\nT \u2265 2 \u00b7 105k2(cid:16) k + m\nm (cid:17)5\n\n\u03b4 (cid:17) ,\n(k + m2) log(cid:16) 4pk\n\n(7)\n\nthen there exists \u03bb such that \u21131-regularized least squares recovers the signed support of A0\n\nprobability larger than 1 \u2212 \u03b4. This is achieved by taking \u03bb =p36(k + m)2 log(4p/\u03b4)/(T m3).\nIn other words, for m bounded away from 0 and \u221e, regularized least squares regression correctly\nreconstructs the graph G from a trajectory of time length which is polynomial in the degree and\nlogarithmic in the system size. Notice that once the graph is known, the laplacian \u2206G is uniquely\ndetermined. Also, the proof technique used for this example is generalizable to other graphs as well.\n\nr with\n\n4\n\n\f \n\np = 16\np = 32\np = 64\np = 128\np = 256\np = 512\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ns\ne\nc\nc\nu\ns\n \nf\no\n\n \n\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n0\n\n \n0\n\n50\n\n100\n\n150\n\n200\n250\nT = n \u03b7\n\n300\n\n350\n\n400\n\n450\n\n2800\n\n2600\n\n2400\n\n2200\n\n2000\n\n1800\n\n1600\n\n1400\n\n9\n\n.\n\n \n\n0\n=\n\n \n.\n\nb\no\nr\np\n\n \ns\ns\ne\nc\nc\nu\ns\n \nr\no\nf\n \ns\ne\nl\np\nm\na\ns\n \nf\no\n\n \n\n#\n\n \n.\n\nn\ni\nM\n\n1200\n\n101\n\n102\np\n\n103\n\nFigure 1: (left) Probability of success vs.\ncomplexity for 90% probability of success vs. p.\n\nlength of the observation interval n\u03b7. (right) Sample\n\n2.2 Numerical illustrations\n\nIn this section we present numerical validation of the proposed method on synthetic data. The results\ncon\ufb01rm our observations in Theorems 1.1 and 3.1, below, namely that the time complexity scales\nlogarithmically with the number of nodes in the network p, given a constant maximum degree.\nAlso, the time complexity is roughly independent of the sampling rate. In Fig. 1 and 2 we consider\nthe discrete-time setting, generating data as follows. We draw A0 as a random sparse matrix in\nij = 1) = k/p, k = 5. The\n{0, 1}p\u00d7p with elements chosen independently at random with P(A0\n0 \u2261 {x(t)}0\u2264t\u2264n is then generated according to Eq. (6). We solve the regularized least\nprocess xn\nsquare problem (the cost function is given explicitly in Eq. (8) for the discrete-time case) for different\nvalues of n, the number of observations, and record if the correct support is recovered for a random\nrow r using the optimum value of the parameter \u03bb. An estimate of the probability of successful\nrecovery is obtained by repeating this experiment. Note that we are estimating here an average\nprobability of success over randomly generated matrices.\n\nThe left plot in Fig.1 depicts the probability of success vs. n\u03b7 for \u03b7 = 0.1 and different values of\np. Each curve is obtained using 211 instances, and each instance is generated using a new random\nmatrix A0. The right plot in Fig.1 is the corresponding curve of the sample complexity vs. p where\nsample complexity is de\ufb01ned as the minimum value of n\u03b7 with probability of success of 90%. As\npredicted by Theorem 2.1 the curve shows the logarithmic scaling of the sample complexity with p.\nIn Fig. 2 we turn to the continuous-time model (1). Trajectories are generated by discretizing this\nstochastic differential equation with step \u03b4 much smaller than the sampling rate \u03b7. We draw random\nmatrices A0 as above and plot the probability of success for p = 16, k = 4 and different values of \u03b7,\nas a function of T . We used 211 instances for each curve. As predicted by Theorem 1.1, for a \ufb01xed\nobservation interval T , the probability of success converges to some limiting value as \u03b7 \u2192 0.\n3 Discrete-time model: Statement of the results\n\nConsider a system evolving in discrete time according to the model (6), and let xn\nbe the observed portion of the trajectory. The rth row A0\nconvex optimization problem for Ar \u2208 Rp\n\n0 \u2261 {x(t)}0\u2264t\u2264n\nr is estimated by solving the following\n\nminimize L(Ar; xn\n\n0 ) + \u03bbkArk1 ,\n\n(8)\n\n(9)\n\nwhere\n\nL(Ar; xn\n\n0 ) \u2261\n\n1\n\n2\u03b72n\n\nn\u22121Xt=0\n\n{xr(t + 1) \u2212 xr(t) \u2212 \u03b7 A\u2217\n\nrx(t)}2 .\n\nApart from an additive constant, the \u03b7 \u2192 0 limit of this cost function can be shown to coincide\nwith the cost function in the continuous time case, cf. Eq. (3). Indeed the proof of Theorem 1.1 will\namount to a more precise version of this statement. Furthermore, L(Ar; xn\n0 ) is easily seen to be the\nlog-likelihood of Ar within model (6).\n\n5\n\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\ns\ne\nc\nc\nu\ns\n \nf\no\n\n \n\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n \n\n\u03b7 = 0.04\n\u03b7 = 0.06\n\u03b7 = 0.08\n\u03b7 = 0.1\n\u03b7 = 0.14\n\u03b7 = 0.18\n\u03b7 = 0.22\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\ns\ns\ne\nc\nc\nu\ns\n \nf\no\n\n \n\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n \n0\n\n50\n\n100\n\n150\n\nT = n \u03b7\n\n200\n\n250\n\n0.55\n\n0.04\n\n0.06\n\n0.08\n\n0.1\n\n0.12\n\n0.14\n\n0.16\n\n0.18\n\n0.2\n\n0.22\n\n\u03b7\n\nFigure 2: (right)Probability of success vs. length of the observation interval n\u03b7 for different values\nof \u03b7. (left) Probability of success vs. \u03b7 for a \ufb01xed length of the observation interval, (n\u03b7 = 150) .\nThe process is generated for a small value of \u03b7 and sampled at different rates.\n\nr, and assume |S0| \u2264 k. Under the model (6) x(t) has\nAs before, we let S0 be the support of row A0\na Gaussian stationary state distribution with covariance Q0 determined by the following modi\ufb01ed\nLyapunov equation\n\nA0Q0 + Q0(A0)\u2217 + \u03b7A0Q0(A0)\u2217 + I = 0 .\n\n(10)\nIt will be clear from the context whether A0/Q0 refers to the dynamics/stationary matrix from the\ncontinuous or discrete time system. We assume conditions (a) and (b) introduced in Section 1.1, and\nadopt the notations already introduced there. We use as a shorthand notation \u03c3max \u2261 \u03c3max(I +\u03b7 A0)\nwhere \u03c3max(.) is the maximum singular value. Also de\ufb01ne D \u2261 (cid:0)1 \u2212 \u03c3max(cid:1)/\u03b7 . We will assume\nD > 0. As in the previous section, we assume the model (6) is initiated in the stationary state.\nTheorem 3.1. Consider the problem of learning the support S0 of row A0\ntrajectory {x(t)}0\u2264t\u2264n. If\n\nr from the discrete-time\n\n104k2(kD\u22122 + A\u22122\n\nmin)\n\nn\u03b7 >\n\n\u03b12DC 2\n\nmin\n\nlog(cid:16) 4pk\n\u03b4 (cid:17) ,\n\n(11)\n\nthen there exists \u03bb such that \u21131-regularized least squares recovers the signed support of A0\n\nprobability larger than 1 \u2212 \u03b4. This is achieved by taking \u03bb =p(36 log(4p/\u03b4))/(D\u03b12n\u03b7).\n\nIn other words the discrete-time sample complexity, n, is logarithmic in the model dimension, poly-\nnomial in the maximum network degree and inversely proportional to the time spacing between\nsamples. The last point is particularly important. It enables us to derive the bound on the continuous-\ntime sample complexity as the limit \u03b7 \u2192 0 of the discrete-time sample complexity. It also con\ufb01rms\nour intuition mentioned in the Introduction: although one can produce an arbitrary large number\nof samples by sampling the continuous process with \ufb01ner resolutions, there is limited amount of\ninformation that can be harnessed from a given time interval [0, T ].\n\nr with\n\n4 Proofs\n\nIn the following we denote by X \u2208 Rn\u00d7p the matrix whose (t + 1)th column corresponds to the\ncon\ufb01guration x(t), i.e. X = [x(0), x(1), . . . , x(n\u2212 1)]. Further \u2206X \u2208 Rn\u00d7p is the matrix contain-\ning con\ufb01guration changes, namely \u2206X = [x(1) \u2212 x(0), . . . , x(n) \u2212 x(n \u2212 1)]. Finally we write\nW = [w(1), . . . , w(n \u2212 1)] for the matrix containing the Gaussian noise realization. Equivalently,\n\nW = \u2206X \u2212 \u03b7A X .\n\nThe rth row of W is denoted by Wr.\nIn order to lighten the notation, we will omit the reference to xn\nsimply write L(Ar). We de\ufb01ne its normalized gradient and Hessian by\n1\nn\n\nXW \u2217\nr ,\n\n1\nn\u03b7\n\nr) =\n\nr) =\n\nbG = \u2212\u2207L(A0\n\nbQ = \u22072L(A0\n\n6\n\n0 in the likelihood function (9) and\n\nXX \u2217 .\n\n(12)\n\n\f4.1 Discrete time\n\nIn this Section we outline our prove for our main result for discrete-time dynamics, i.e., Theorem\n3.1. We start by stating a set of suf\ufb01cient conditions for regularized least squares to work. Then we\npresent a series of concentration lemmas to be used to prove the validity of these conditions, and\n\ufb01nally we sketch the outline of the proof.\n\nAs mentioned, the proof strategy, and in particular the following proposition which provides a com-\npact set of suf\ufb01cient conditions for the support to be recovered correctly is analogous to the one in\n[12]. A proof of this proposition can be found in the supplementary material.\nProposition 4.1. Let \u03b1, Cmin > 0 be be de\ufb01ned by\n\n\u03bbmin(Q0\n\n(13)\nIf the following conditions hold then the regularized least square solution (8) correctly recover the\nsigned support sign(A0\n\nS0,S0 ) \u2261 Cmin ,\nr):\n\n|||\u221e \u2261 1 \u2212 \u03b1 .\n\n(S0)C ,S0(cid:0)Q0\n|||Q0\n\nkbGk\u221e \u2264\n\n(S0)C ,S0|||\u221e \u2264\n\n\u03bb\u03b1\n3\n\u03b1\n12\n\n,\n\nCmin\u221ak\n\n,\n\nAminCmin\n\n4k\nS0,S0|||\u221e \u2264\n\n\u2212 \u03bb,\n\u03b1\n12\n\n(14)\n\n(15)\n\nCmin\u221ak\n\n.\n\nS0,S0(cid:1)\u22121\nkbGS0k\u221e \u2264\n|||bQS0,S0 \u2212 Q0\n\nP(|||bQJS \u2212 Q0\n\n7\n\n|||bQ(S0)C ,S0 \u2212 Q0\n\nand the hessian of the likelihood (3).\n\nFurther the same statement holds for the continuous model 3, provided bG and bQ are the gradient\n\nThe proof of Theorem 3.1 consists in checking that, under the hypothesis (11) on the number of\nconsecutive con\ufb01gurations, conditions (14) to (15) will hold with high probability. Checking these\nconditions can be regarded in turn as concentration-of-measure statements. Indeed, if expectation is\n\n4.1.1 Technical lemmas\n\nthe supplementary material provided.\n\nIn this section we will state the necessary concentration lemmas for proving Theorem 3.1. These\n\ntaken with respect to a stationary trajectory, we have E{bG} = 0, E{bQ} = Q0.\nare non-trivial because bG, bQ are quadratic functions of dependent random variables(cid:0)the samples\n{x(t)}0\u2264t\u2264n(cid:1). The proofs of Proposition 4.2, of Proposition 4.3, and Corollary 4.4 can be found in\nOur \ufb01rst Proposition implies concentration of bG around 0.\nProposition 4.2. Let S \u2286 [p] be any set of vertices and \u01eb < 1/2. If \u03c3max \u2261 \u03c3max(I + \u03b7 A0) < 1,\nthen\n(16)\n\nP(cid:8)kbGSk\u221e > \u01eb(cid:9) \u2264 2|S| e\u2212n(1\u2212\u03c3max) \u01eb2/4.\nJS|||\u221e with bounds on |bQij \u2212 Q0\nij| using the following proposition\n\nWe furthermore need to bound the matrix norms as per (15) in proposition 4.1. First we relate\nij|, (i \u2208 J, i \u2208 S) where J and S are any\nP(|bQij \u2212 Q0\n\nbounds on |||bQJS \u2212 Q0\nsubsets of {1, ..., p}. We have,\nP(|||bQJS \u2212 Q0\nThen, we bound |bQij \u2212 Q0\nProposition 4.3. Let i, j \u2208 {1, ..., p}, \u03c3max \u2261 \u03c3max(I + \u03b7A0) < 1, T = \u03b7n > 3/D and 0 < \u01eb <\n2/D where D = (1 \u2212 \u03c3max)/\u03b7 then,\n\nJS)|||\u221e > \u01eb) \u2264 |J||S| max\n\nij| > \u01eb/|S|).\n\n(17)\n\ni,j\u2208J\n\n(18)\n\nij)| > \u01eb) \u2264 2e\u2212 n\n\n32\u03b72 (1\u2212\u03c3max)3\u01eb2\n\n.\n\nP(|bQij \u2212 Q0\n\nFinally, the next corollary follows from Proposition 4.3 and Eq. (17).\nCorollary 4.4. Let J, S (|S| \u2264 k) be any two subsets of {1, ..., p} and \u03c3max \u2261 \u03c3max(I + \u03b7A0) < 1,\n\u01eb < 2k/D and n\u03b7 > 3/D (where D = (1 \u2212 \u03c3max)/\u03b7) then,\nJS|||\u221e > \u01eb) \u2264 2|J|ke\u2212 n\n\n32k2 \u03b72 (1\u2212\u03c3max)3\u01eb2\n\n(19)\n\n.\n\n\f4.1.2 Outline of the proof of Theorem 3.1\n\nWith these concentration bounds we can now easily prove Theorem 3.1. All we need to do is\nto compute the probability that the conditions given by Proposition 4.1 hold. From the statement\nof the theorem we have that the \ufb01rst two conditions (\u03b1, Cmin > 0) of Proposition 4.1 hold. In\n\n\u03bb \u2264 AminCmin/8k.\n\norder to make the \ufb01rst condition on bG imply the second condition on bG we assume that \u03bb\u03b1/3 \u2264\n(AminCmin)/(4k) \u2212 \u03bb which is guaranteed to hold if\nWe also combine the two last conditions on bQ, thus obtaining the following\nsince [p] = S0 \u222a (S0)C. We then impose that both the probability of the condition on bQ failing and\nthe probability of the condition on bG failing are upper bounded by \u03b4/2 using Proposition 4.2 and\n\nCorollary 4.4. It is shown in the supplementary material that this is satis\ufb01ed if condition (11) holds.\n\n|||bQ[p],S0 \u2212 Q0\n\n[p],S0|||\u221e \u2264\n\nCmin\u221ak\n\n\u03b1\n12\n\n(20)\n\n(21)\n\n,\n\n4.2 Outline of the proof of Theorem 1.1\n\nTo prove Theorem 1.1 we recall that Proposition 4.1 holds provided the appropriate continuous time\n\nr) =\n\nbQ = \u22072L(A0\n\n1\n\nT Z T\n\n0\n\nx(t)x(t)\u2217 dt .\n\n(22)\n\nexpressions are used for bG and bQ, namely\n\n1\n\nx(t) dbr(t) ,\n\nr) =\n\nT Z T\n\n0\n\nbG = \u2212\u2207L(A0\n\nThese are of course random variables. In order to distinguish these from the discrete time version,\n\ntheorem.\n\nWe are left with the task of showing that the discrete and continuous time processes can be coupled\n\nwe will adopt the notation bGn, bQn for the latter. We claim that these random variables can be\ncoupled (i.e. de\ufb01ned on the same probability space) in such a way that bGn \u2192 bG and bQn \u2192 bQ\nalmost surely as n \u2192 \u221e for \ufb01xed T . Under assumption (5), it is easy to show that (11) holds for all\nn > n0 with n0 a suf\ufb01ciently large constant (for a proof see the provided supplementary material).\nTherefore, by the proof of Theorem 3.1, the conditions in Proposition 4.1 hold for gradient bGn and\nhessian bQn for any n \u2265 n0, with probability larger than 1 \u2212 \u03b4. But by the claimed convergence\nbGn \u2192 bG and bQn \u2192 bQ, they hold also for bG and bQ with probability at least 1 \u2212 \u03b4 which proves the\nin such a way that bGn \u2192 bG and bQn \u2192 bQ. With slight abuse of notation, the state of the discrete\ntime system (6) will be denoted by x(i) where i \u2208 N and the state of continuous time system (1) by\nx(t) where t \u2208 R. We denote by Q0 the solution of (4) and by Q0(\u03b7) the solution of (10). It is easy\nto check that Q0(\u03b7) \u2192 Q0 as \u03b7 \u2192 0 by the uniqueness of stationary state distribution.\nThe initial state of the continuous time system x(t = 0) is a N(0, Q0) random variable inde-\npendent of b(t) and the initial state of the discrete time system is de\ufb01ned to be x(i = 0) =\n(Q0(\u03b7))1/2(Q0)\u22121/2x(t = 0). At subsequent times, x(i) and x(t) are assumed are generated by the\nrespective dynamical systems using the same matrix A0 using common randomness provided by the\nstandard Brownian motion {b(t)}0\u2264t\u2264T in Rp. In order to couple x(t) and x(i), we construct w(i),\nthe noise driving the discrete time system, by letting w(i) \u2261 (b(T i/n) \u2212 b(T (i \u2212 1)/n)).\nThe almost sure convergence bGn \u2192 bG and bQn \u2192 bQ follows then from standard convergence of\n\nrandom walk to Brownian motion.\n\nAcknowledgments\n\nThis work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978\nand the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship.\n\n8\n\n\fReferences\n\n[1] D.T. Gillespie. Stochastic simulation of chemical kinetics. Annual Review of Physical Chem-\n\nistry, 58:35\u201355, 2007.\n\n[2] D. Higham. Modeling and Simulating Chemical Reactions. SIAM Review, 50:347\u2013368, 2008.\n[3] N.D.Lawrence et al., editor. Learning and Inference in Computational Systems Biology. MIT\n\nPress, 2010.\n\n[4] T. Toni, D. Welch, N. Strelkova, A. Ipsen, and M.P.H. Stumpf. Modeling and Simulating\n\nChemical Reactions. J. R. Soc. Interface, 6:187\u2013202, 2009.\n\n[5] K. Zhou, J.C. Doyle, and K. Glover. Robust and optimal control. Prentice Hall, 1996.\n[6] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[7] D.L. Donoho. For most large underdetermined systems of equations, the minimal l1-norm\nnear-solution approximates the sparsest near-solution. Communications on Pure and Applied\nMathematics, 59(7):907\u2013934, 2006.\n\n[8] D.L. Donoho. For most large underdetermined systems of linear equations the minimal l1-\nnorm solution is also the sparsest solution. Communications on Pure and Applied Mathematics,\n59(6):797\u2013829, 2006.\n\n[9] T. Zhang. Some sharp performance bounds for least squares regression with L1 regularization.\n\nAnnals of Statistics, 37:2109\u20132144, 2009.\n\n[10] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-\nconstrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183\u20132202,\n2009.\n\n[11] M.J. Wainwright, P. Ravikumar, and J.D. Lafferty. High-Dimensional Graphical Model Selec-\ntion Using l-1-Regularized Logistic Regression. Advances in Neural Information Processing\nSystems, 19:1465, 2007.\n\n[12] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n[13] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical lasso. Biostatistics, 9(3):432, 2008.\n\n[14] K. Ball, T.G. Kurtz, L. Popovic, and G. Rempala. Modeling and Simulating Chemical Reac-\n\ntions. Ann. Appl. Prob., 16:1925\u20131961, 2006.\n\n[15] G.A. Pavliotis and A.M. Stuart. Parameter estimation for multiscale diffusions. J. Stat. Phys.,\n\n127:741\u2013781, 2007.\n\n[16] J. Songsiri, J. Dahl, and L. Vandenberghe. Graphical models of autoregressive processes. pages\n\n89\u2013116, 2010.\n\n[17] J. Songsiri and L. Vandenberghe. Topology selection in graphical models of autoregressive\n\nprocesses. Journal of Machine Learning Research, 2010. submitted.\n\n[18] F.R.K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics,\n\n1997.\n\n[19] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using\n\nl1-regularized logistic regression. Annals of Statistics, 2008.\n\n9\n\n\f", "award": [], "sourceid": 715, "authors": [{"given_name": "Jos\u00e9", "family_name": "Pereira", "institution": null}, {"given_name": "Morteza", "family_name": "Ibrahimi", "institution": null}, {"given_name": "Andrea", "family_name": "Montanari", "institution": null}]}