{"title": "Spectral Learning of Dynamic Systems from Nonequilibrium Data", "book": "Advances in Neural Information Processing Systems", "page_first": 4179, "page_last": 4187, "abstract": "Observable operator models (OOMs) and related models are one of the most important and powerful tools for modeling and analyzing stochastic systems. They exactly describe dynamics of finite-rank systems and can be efficiently and consistently estimated through spectral learning under the assumption of identically distributed data. In this paper, we investigate the properties of spectral learning without this assumption due to the requirements of analyzing large-time scale systems, and show that the equilibrium dynamics of a system can be extracted from nonequilibrium observation data by imposing an equilibrium constraint. In addition, we propose a binless extension of spectral learning for continuous data. In comparison with the other continuous-valued spectral algorithms, the binless algorithm can achieve consistent estimation of equilibrium dynamics with only linear complexity.", "full_text": "Spectral Learning of Dynamic Systems from\n\nNonequilibrium Data\n\nHao Wu and Frank No\u00e9\n\nDepartment of Mathematics and Computer Science\n\nFreie Universit\u00e4t Berlin\n\nArnimallee 6, 14195 Berlin\n\n{hao.wu,frank.noe}@fu-berlin.de\n\nAbstract\n\nObservable operator models (OOMs) and related models are one of the most im-\nportant and powerful tools for modeling and analyzing stochastic systems. They\nexactly describe dynamics of \ufb01nite-rank systems and can be ef\ufb01ciently and con-\nsistently estimated through spectral learning under the assumption of identically\ndistributed data. In this paper, we investigate the properties of spectral learning\nwithout this assumption due to the requirements of analyzing large-time scale\nsystems, and show that the equilibrium dynamics of a system can be extracted\nfrom nonequilibrium observation data by imposing an equilibrium constraint. In\naddition, we propose a binless extension of spectral learning for continuous data.\nIn comparison with the other continuous-valued spectral algorithms, the binless\nalgorithm can achieve consistent estimation of equilibrium dynamics with only\nlinear complexity.\n\n1\n\nIntroduction\n\nIn the last two decades, a collection of highly related dynamic models including observable operator\nmodels (OOMs) [1\u20133], predictive state representations [4\u20136] and reduced-rank hidden Markov models\n[7, 8], have become powerful and increasingly popular tools for analysis of dynamic data. These\nmodels are largely similar, and all can be learned by spectral methods in a general framework of\nmultiplicity automata, or equivalently sequential systems [9, 10]. In contrast with the other commonly\nused models such as Markov state models [11, 12], Langevin models [13, 14], traditional hidden\nMarkov models (HMMs) [15, 16], Gaussian process state-space models [17, 18] and recurrent\nneural networks [19], the spectral learning based models can exactly characterize the dynamics of a\nstochastic system without any a priori knowledge except the assumption of \ufb01nite dynamic rank (i.e.,\nthe rank of Hankel matrix) [10, 20], and the parameter estimation can be ef\ufb01ciently performed for\ndiscrete-valued systems without solving any intractable inverse or optimization problem. We focus in\nthis paper only on stochastic systems without control inputs and all spectral learning based models\ncan be expressed in the form of OOMs for such systems, so we will refer to them as OOMs below.\nIn most literature on spectral learning, the observation data are assumed to be identically (possibly not\nindependently) distributed so that the expected values of observables associated with the parameter\nestimation can be reliably computed by empirical averaging. However, this assumption can be\nseverely violated due to the limit of experimental technique or computational capacity in many\npractical situations, especially where metastable physical or chemical processes are involved. A\nnotable example is the distributed computing project Folding@home [21], which explores protein\nfolding processes that occur on the timescales of microseconds to milliseconds based on molecular\ndynamics simulations on the order of nanoseconds in length. In such a nonequilibrium case where\ndistributions of observation data are time-varying and dependent on initial conditions, it is still unclear\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fif promising estimates of OOMs can be obtained. In [22], a hybrid estimation algorithm was proposed\nto improve spectral learning of large-time scale processes by using both dynamic and static data,\nbut it still requires assumption of identically distributed data. One solution to reduce the statistical\nbias caused by nonequilibrium data is to discard the observation data generated before the system\nreaches steady state, which is a common trick in applied statistics [23]. Obviously, this way suffers\nfrom substantial information loss and is infeasible when observation trajectories are shorter than\nmixing times. Another possible way would be to learn OOMs by likelihood-based estimation instead\nof spectral methods, but there is no effective maximum likelihood or Bayesian estimator of OOMs\nuntil now. The maximum pseudo-likelihood estimator of OOMs proposed in [24] demands high\ncomputational cost and its consistency is yet unveri\ufb01ed.\nAnother dif\ufb01culty for spectral approaches is learning with continuous data, where density estimation\nproblems are involved. The density estimation can be performed by parametric methods such as the\nfuzzy interpolation [25] and the kernel density estimation [8]. But these methods would reduce the\n\ufb02exibility of OOMs for dynamic modeling because of their limited expressive capacity. Recently, a\nkernel embedding based spectral algorithm was proposed to cope with continuous data [26], which\navoids explicit density estimation and learns OOMs in a nonparametric manner. However, the kernel\nembedding usually yields a very large computational complexity, which greatly limits practical\napplications of this algorithm to real-world systems.\nThe purpose of this paper is to address the challenge of spectral learning of OOMs from nonequilib-\nrium data for analysis of both discrete- and continuous-valued systems. We \ufb01rst provide a modi\ufb01ed\nspectral method for discrete-valued stochastic systems which allows us to consistently estimate\nthe equilibrium dynamics from nonequilibrium data, and then extend this method to continuous\nobservations in a binless manner. In comparison with the existing learning methods for continuous\nOOMs, the proposed binless spectral method does not rely on any density estimator, and can achieve\nconsistent estimation with linear computational complexity in data size even if the assumption of iden-\ntically distributed observations does not hold. Moreover, some numerical experiments are provided\nto demonstrate the capability of the proposed methods.\n\n2 Preliminaries\n\n2.1 Notation\nIn this paper, we use P to denote probability distribution for discrete random variables and probability\ndensity for continuous random variables. The indicator function of event e is denoted by 1e and\nthe Dirac delta function centered at x is denoted by \u03b4x (\u00b7). For a given process {at}, we write\nthe subsequence (ak, ak+1, . . . , ak(cid:48)) as ak:k(cid:48), and E\u221e[at] (cid:44) limt\u2192\u221e E[at] means the equilibrium\nexpected value of at if the limit exists. In addition, the convergence in probability is denoted by p\u2192.\n\n2.2 Observable operator models\nAn m-dimensional observable operator model (OOM) with observation space O can be represented by\na tuple M = (\u03c9,{\u039e(x)}x\u2208O, \u03c3), which consists of an initial state vector \u03c9 \u2208 R1\u00d7m, an evaluation\nvector \u03c3 \u2208 Rm\u00d71 and an observable operator matrix \u039e(x) \u2208 Rm\u00d7m associated to each element\nx \u2208 O. M de\ufb01nes a stochastic process {xt} in O as\n\n(1)\n\u00b4\nunder the condition that \u03c9\u039e(x1:t)\u03c3 \u2265 0, \u03c9\u039e(O)\u03c3 = 1 and \u03c9\u039e(x1:t)\u03c3 = \u03c9\u039e(x1:t)\u039e(O)\u03c3 hold\nfor all t and x1:t \u2208 Ot [10], where \u039e(x1:t) (cid:44) \u039e(x1) . . . \u039e(xt) and \u039e(A) (cid:44)\nA dx \u039e (x). Two\nOOMs M and M(cid:48) are said to be equivalent if P (x1:t|M) \u2261 P (x1:t|M(cid:48)).\n\nP (x1:t|M) = \u03c9\u039e(x1:t)\u03c3\n\n3 Spectral learning of OOMs\n\n3.1 Algorithm\nHere and hereafter, we only consider the case that the observation space O is a \ufb01nite set. (Learning\nwith continuous observations will be discussed in Section 4.2.) A large number of largely similar\n\n2\n\n\fAlgorithm 1 General procedure for spectral learning of OOMs\nINPUT: Observation trajectories generated by a stochastic process {xt} in O\nOUTPUT: \u02c6M = ( \u02c6\u03c9,{ \u02c6\u039e(x)}x\u2208O, \u02c6\u03c3)\nPARAMETER: m: dimension of the OOM. D1, D2: numbers of feature functions. L: order of\n1: Construct feature functions \u03c61 = (\u03d51,1, . . . , \u03d51,D1 )(cid:62) and \u03c62 = (\u03d52,1, . . . , \u03d52,D2)(cid:62), where\n\nfeature functions.\neach \u03d5i,j is a mapping from OL to R and D1, D2 \u2265 m.\n\n2: Approximate\n\n\u00af\u03c61\n\n(cid:44) E [\u03c61(xt\u2212L:t\u22121)] ,\n\nC1,2 (cid:44) E(cid:2)\u03c61(xt\u2212L:t\u22121)\u03c62(xt:t+L\u22121)(cid:62)(cid:3)\nC1,3 (x) (cid:44) E(cid:2)1xt=x \u00b7 \u03c61(xt\u2212L:t\u22121)\u03c62(xt+1:t+L)(cid:62)(cid:3) ,\n\n(cid:44) E [\u03c62(xt:t+L\u22121)]\n\n\u00af\u03c62\n\n\u2200x \u2208 O\n\n(5)\n(6)\n(7)\n\nby their empirical means \u02c6\u00af\u03c61, \u02c6\u00af\u03c62, \u02c6C1,2 and \u02c6C1,3 (x) over observation data.\n3: Compute F1 = U\u03a3\u22121 \u2208 RD1\u00d7m and F2 = V \u2208 RD2\u00d7m from the truncated singular value\ndecomposition \u02c6C1,2 \u2248 U\u03a3V(cid:62), where \u03a3 \u2208 Rm\u00d7m is a diagonal matrix contains the top m\nsingular values of \u02c6C1,2, and U and V consist of the corresponding m left and right singular\nvectors of \u02c6C1,2.\n\n4: Compute\n\n1\n\n\u02c6\u03c3 = F(cid:62)\n\u02c6\u039e(x) = F(cid:62)\n1\n(cid:62)\n\u02c6\u03c9 = \u02c6\u00af\u03c6\n2 F2\n\n\u02c6\u00af\u03c61\n\u02c6C1,3(x)F2,\n\n\u2200x \u2208 O\n\n(8)\n(9)\n(10)\n\nspectral methods have been developed, and the generic learning procedure of these methods is\nsummarized in Algorithm 1 by omitting details of algorithm implementation and parameter choice\n[27, 7, 28]. For convenience of description and analysis, we specify in this paper the formula for\ncalculating \u02c6\u00af\u03c61, \u02c6\u00af\u03c62, \u02c6C1,2 and \u02c6C1,3 (x) in Line 2 of Algorithm 1 as follows:\n\nN(cid:88)\n\nn=1\n\nN(cid:88)\n\nn=1\n\nN(cid:88)\n\nn=1\n\nn),\n\nN(cid:88)\n\nn=1\n\n\u02c6\u00af\u03c61 =\n\n1\nN\n\n\u03c61((cid:126)s 1\n\n\u02c6\u00af\u03c62 =\n\n1\nN\n\n\u03c62((cid:126)s 2\nn)\n\n\u02c6C1,2 =\n\n1\nN\n\n\u03c61((cid:126)s 1\n\nn)\u03c62((cid:126)s 2\n\nn)(cid:62)\n\n\u02c6C1,3 (x) =\n\n1\nN\n\n1s2\n\nn=x\u03c61((cid:126)s 1\n\nn)\u03c62((cid:126)s 3\n\nn)(cid:62),\n\n\u2200x \u2208 O\n\n(2)\n\n(3)\n\n(4)\n\nn, s2\n\nn, s2\n\nn, (cid:126)s 3\n\nn, (cid:126)s 3\n\nn)}N\n\nn = xt\u2212L:t\u22121 and (cid:126)s 3\n\nn) with some n, then (cid:126)s 1\n\nHere {((cid:126)s 1\nn=1 is the collection of all subsequences of length (2L + 1) appearing in observa-\ntion data (N = T \u2212 2L for a single observation trajectory of length T ). If an observation subsequence\nxt\u2212L:t+L is denoted by ((cid:126)s 1\nn = xt+1:t+L represents\nthe pre\ufb01x and suf\ufb01x of xt\u2212L:t+L of length L, s2\nn = xt is the intermediate observation value, and\nn = xt:t+L\u22121 is an \u201cintermediate part\u201d of the subsequence of length L starting from time t (see\n(cid:126)s 2\nFig. 1 for a graphical illustration).\nAlgorithm 1 is much more ef\ufb01cient than the commonly used likelihood-based learning algorithms\nand does not suffer from local optima issues. In addition, and more importantly, this algorithm can be\nn) are (i) independently sampled from M or (ii) obtained from\nshown to be consistent if ((cid:126)s 1\na \ufb01nite number of trajectories which have fully mixed so that all observation triples are identically\ndistributed (see, e.g., [8, 3, 10] for related works). However, the asymptotic correctness of OOMs\nlearned from short trajectories starting from nonequilibrium states has not been formally determined.\n\nn, (cid:126)s 3\n\nn, s2\n\n3\n\n\fT (cid:48)(cid:88)\n\nt=1\n\n1\nT (cid:48)\n\nFigure 1: Illustration of variables (cid:126)s 1\nxt\u2212L:t+L.\n\nn, s2\n\nn, (cid:126)s 3\n\nn and (cid:126)s 2\n\nn used in Eqs. (2)-(4) with ((cid:126)s 1\n\nn, s2\n\nn, (cid:126)s 3\n\nn) =\n\n3.2 Theoretical analysis\n\nWe now analyze statistical properties of the spectral algorithm without the assumption of identically\ndistributed observations. Before stating our main result, some assumptions on observation data are\nlisted as follows:\nAssumption 1. The observation data consists of I independent trajectories of length T produced\nby a stochastic process {xt}, and the data size tends to in\ufb01nity with (i) I \u2192 \u221e and T = T0 or (ii)\nT \u2192 \u221e and I = I0.\nAssumption 2. {xt} is driven by an m-dimensional OOM M = (\u03c9,{\u039e(x)}x\u2208O, \u03c3), and\n\np\u2192 E\u221e [f (xt:t+l\u22121)] = E\u221e [f (xt:t+l\u22121)|x1:k]\n\nft\n\n(11)\n\nas T (cid:48) \u2192 \u221e for all k, l, x1:k and f : Ol (cid:55)\u2192 R.\nAssumption 3. The rank of the limit of \u02c6C1,2 is not less than m.\nNotice that Assumption 2 only states the asymptotic stationarity of {xt} and marginal distributions of\nobservation triples are possibly time dependent if \u03c9 (cid:54)= \u03c9\u039e (O). Assumption 3 ensures that the limit\nof \u02c6M given by Algorithm 1 is well de\ufb01ned, which generally holds for minimal OOMs (see [10]).\nBased on the above assumptions, we have the following theorem concerning the statistical consistency\nof the OOM learning algorithm (see Appendix A.1 for proof):\nTheorem 1. Under Assumptions 1-3, there exists an OOM M(cid:48) = (\u03c9(cid:48),{\u039e(cid:48)(x)}x\u2208O, \u03c3(cid:48)) which is\nequivalent to \u02c6M and satis\ufb01es\n\n\u03c3(cid:48) p\u2192 \u03c3, \u039e(cid:48)(x)\n\np\u2192 \u039e(x), \u2200x \u2208 O\n\n(12)\n\nThis theorem is central in this paper, which implies that the spectral learning algorithm can achieve\nconsistent estimation of all parameters of OOMs except initial state vectors even for nonequilibrium\np\u2192 \u03c9(cid:48) does not hold in most cases except when {xt} is stationary.). It can be further\ndata. ( \u02c6\u03c9\ngeneralized according to requirements in more complicated situations where, for example, observation\ntrajectories are generated with multiple different initial conditions (see Appendix A.2).\n\n4 Spectral learning of equilibrium OOMs\n\nIn this section, applications of spectral learning to the problem of recovering equilibrium properties\nof dynamic systems from nonequilibrium data will be highlighted, which is an important problem in\npractice especially for thermodynamic and kinetic analysis in computational physics and chemistry.\n\n4.1 Learning from discrete data\n\nAccording to the de\ufb01nition of OOMs,\n(\u03c9,{\u039e(x)}x\u2208O, \u03c3) can be described by an equilibrium OOM Meq = (\u03c9eq,{\u039e(x)}x\u2208O, \u03c3) as\n\nthe equilibrium dynamics of an OOM M =\n\nlim\nt\u2192\u221e\nif the equilibrium state vector\n\nP (xt+1:t+k = z1:k|M) = P (x1:t = z1:k|Meq)\n\n(13)\n\n(14)\n\n\u03c9eq = lim\n\nt\u2192\u221e \u03c9\u039e(O)t\n\n4\n\n (cid:1876)(cid:3047)(cid:2879)(cid:3013)\u22ef (cid:1876)(cid:3047)(cid:2879)(cid:2869) (cid:1876)(cid:3047) (cid:1876)(cid:3047)(cid:2878)(cid:2869)\u22ef (cid:1876)(cid:3047)(cid:2878)(cid:3013)(cid:1871)(cid:1318)(cid:3041)(cid:2869)(cid:1871)(cid:1318)(cid:3041)(cid:2871)(cid:1871)(cid:3041)(cid:2870)(cid:1871)(cid:1318)(cid:3041)(cid:2870)\fexists. From (13) and (14), we have\n\n(cid:26) \u03c9eq\u039e(O) = limt\u2192\u221e \u03c9eq\u039e(O)t+1 = \u03c9eq\n\u03c9eq\u03c3 = limt\u2192\u221e(cid:80)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)w \u02c6\u039e(O) \u2212 w\n\nx\u2208O P (xt+1 = x) = 1\n\n\u02c6\u03c9eq = arg\n\nw\u2208{w|w \u02c6\u03c3=1}\n\nmin\n\n(15)\n\n(16)\n\nThe above equilibrium constraint of OOMs motivates the following algorithm for learning equilibrium\nOOMs: Perform Algorithm 1 to get \u02c6\u039e (x) and \u02c6\u03c3 and calculate \u02c6\u03c9eq by a quadratic programming\nproblem\n\n(See Appendix A.3 for a closed-form expression of the solution to (16).)\nThe existence and uniqueness of \u03c9eq are shown in Appendix A.3, which yield the following theorem:\nTheorem 2. Under Assumptions 1-3, the estimated equilibrium OOM \u02c6Meq = ( \u02c6\u03c9eq,{ \u02c6\u039e(x)}x\u2208O, \u02c6\u03c3)\nprovided by Algorithm 1 and Eq. (16) satis\ufb01es\nx1:l = z1:l| \u02c6Meq\n\n(cid:17) p\u2192 lim\n\nP (xt+1:t+l = z1:l)\n\nP(cid:16)\n\n(17)\n\nt\u2192\u221e\n\nfor all l and z1:l.\nRemark 1. \u02c6\u03c9eq can also be computed as an eigenvector of \u02c6\u039e(O). But the eigenvalue problem possibly\nyields numerical instability and complex values because of statistical noise, unless some speci\ufb01c\nfeature functions \u03c61, \u03c62 are selected so that \u02c6\u03c9eq \u02c6\u039e(O) = \u02c6\u03c9eq can be exactly solved in the real \ufb01eld\n[29].\n\n4.2 Learning from continuous data\n\nA straightforward way to extend spectral algorithms to handle continuous data is based on the\ncoarse-graining of the observation space. Suppose that {xt} is a stochastic process in a continuous\nobservation space O \u2282 Rd, and O is partitioned into J discrete bins B1, . . . ,BJ. Then we can utilize\nthe algorithm in Section 4.1 to approximate the equilibrium transition dynamics between bins as\n\n(18)\nand obtain a binned OOM \u02c6Meq = ( \u02c6\u03c9eq,{ \u02c6\u039e(x)}x\u2208O, \u02c6\u03c3) for the continuous dynamics of {xt} with\n\nP (xt+1 \u2208 Bj1 , . . . , xt+l \u2208 Bjl ) \u2248 \u02c6\u03c9eq \u02c6\u039e (Bj1 ) . . . \u02c6\u039e (Bjl ) \u02c6\u03c3\n\nlim\nt\u2192\u221e\n\n\u02c6\u039e(B (x))\nvol(B (x))\n\n\u02c6\u039e(x) =\n\n(19)\nby assuming the observable operator matrices are piecewise constant on bins, where B (x) denotes\nthe bin containing x and vol(B) is the volume of B. Conventional wisdom dictates that the number\nof bins is a key parameter for the coarse-graining strategy and should be carefully chosen for the\nbalance of statistical noise and discretization error. However, we will show in what follows that it is\njusti\ufb01able to increase the number of bins to in\ufb01nity.\nLet us consider the limit case where J \u2192 \u221e and bins are in\ufb01nitesimal with maxj vol(Bj) \u2192 0. In\nthis case,\n\n(cid:26) \u02c6Ws2\n\nn\n\n0,\n\n\u02c6\u039e(B (x))\nvol(B (x))\n\n=\n\n\u03b4s2\n\nn\n\n(x) , x = s2\nn\n\notherwise\n\n(20)\n\n\u02c6\u039e(x) =\n\nlim\n\nvol(B(x))\u21920\n\nwhere\n\nn)(cid:62)F2\n\nF(cid:62)\n1 \u03c61((cid:126)s 1\n\n1\nN\n\n\u02c6Ws2\n\nn\n\n=\n\nn}N\n\nn)\u03c62((cid:126)s 3\n\n(21)\naccording to (9) in Algorithm 1. Then \u02c6Meq becomes a binless OOM over sample points X =\n{s2\nn=1 and can be estimated from data by Algorithm 2, where the feature functions can be selected\nas indicator functions, radial basis functions or other commonly used activation functions for single-\nlayer neural networks in order to digest adequate dynamic information from observation data.\nThe binless algorithm presented here can be ef\ufb01ciently implemented in a linear computational\ncomplexity O(N ), and is applicable to more general cases where observations are strings, graphs or\nother structured variables. Unlike the other spectral algorithms for continuous data, it does not require\n\n5\n\n\fAlgorithm 2 Procedure for learning binless equilibrium OOMs\nINPUT: Observation trajectories generated by a stochastic process {xt} in O \u2282 Rd\nOUTPUT: Binless OOM \u02c6M = ( \u02c6\u03c9,{ \u02c6\u039e(x)}x\u2208O, \u02c6\u03c3)\n1: Construct feature functions \u03c61 : RLd (cid:55)\u2192 RD1 and \u03c62 : RLd (cid:55)\u2192 RD2 with D1, D2 \u2265 m.\n2: Calculate \u02c6\u00af\u03c61, \u02c6\u00af\u03c62, \u02c6C1,2 by (2) and (3).\n3: Compute F1 = U\u03a3\u22121 \u2208 RD1\u00d7m and F2 = V \u2208 RD2\u00d7m from the truncated singular value\nz\u2208X \u02c6Wz\u03b4z (x) by (8), (16) and (21), where \u02c6\u039e(O) =\n\n4: Compute \u02c6\u03c3, \u02c6\u03c9 and \u02c6\u039e(x) = (cid:80)\ndecomposition \u02c6C1,2 \u2248 U\u03a3V(cid:62).\nO dx \u02c6\u039e(x) =(cid:80)\n\u00b4\n\nz\u2208X \u02c6Wz.\n\nthat the observed dynamics coincides with some parametric model de\ufb01ned by feature functions. Lastly\nbut most importantly, as stated in the following theorem, this algorithm can be used to consistently\nextract static and kinetic properties of a dynamic system in equilibrium from nonequilibrium data\n(see Appendix A.3 for proof):\nTheorem 3. Provided that the observation space O is a closed set in Rd, feature functions \u03c61, \u03c62\nare bounded on OL, and Assumptions 1-3 hold, the binless OOM given by Algorithm 2 satis\ufb01es\n\nwith\n\nE(cid:104)\n\nE(cid:104)\n\n(cid:105) p\u2192 E\u221e [g (xt+1:t+r)]\n(cid:88)\n\ng (x1:r)| \u02c6Meq\n\n(cid:105)\n\ng (x1:r)| \u02c6Meq\n\n=\n\ng (x1:r) \u02c6\u03c9 \u02c6Wz1 . . . \u02c6Wzr \u02c6\u03c3\n\n(22)\n\n(23)\n\nx1:r\u2208X r\n(i) for all continuous functions g : Or (cid:55)\u2192 R.\n(ii) for all bounded and Borel measurable functions g : Or (cid:55)\u2192 R, if there exist positive constants\n\u00af\u03be and \u03be so that (cid:107)\u039e (x)(cid:107) \u2264 \u00af\u03be and limt\u2192\u221e P (xt+1:t+r = z1:r) \u2265 \u03be for all x \u2208 O and\nz1:r \u2208 Or.\n\n4.3 Comparison with related methods\n\nIt is worth pointing out that the spectral learning investigated in this section is an ideal tool for analy-\nsis of dynamic properties of stochastic processes, because the related quantities, such as stationary\ndistributions, principle components and time-lagged correlations, can be easily computed from pa-\nrameters of discrete OOMs or binless OOMs. For many popular nonlinear dynamic models, including\nGaussian process state-space models [17] and recurrent neural networks [19], the computation of\nsuch quantities is intractable or time-consuming.\nThe major disadvantage of spectral learning is that the estimated OOMs are usually only \u201capprox-\nimately valid\u201d and possibly assign \u201cnegative probabilities\u201d to some observation sequences. So it\nis dif\ufb01cult to apply spectral methods to prediction, \ufb01ltering and smoothing of signals where the\nBayesian inference is involved.\n\n5 Applications\n\nIn this section, we evaluate our algorithms on two diffusion processes and the molecular dynamics of\nalanine dipeptide, and compare them to several alternatives. The detailed settings of simulations and\nalgorithms are provided in Appendix B.\n\nBrownian dynamics Let us consider a one-dimensional diffusion process driven by the Brownian\ndynamics\n\ndxt = \u2212\u2207V (xt)dt +\n\n2\u03b2\u22121dWt\n\n(24)\n\nwith observations generated by\n\nyt =\n\n(cid:112)\n(cid:26) 1, xt \u2208 I\n\n0, xt \u2208 II\n\n6\n\n\fFigure 2: Comparison of modeling methods for a one-dimensional diffusion process. (a) Potential\nfunction. (b) Estimates of the difference between equilibrium probabilities of I and II given by the\ntraditional OOM, HMM and the equilibrium OOM (EQ-OOM) obtained from the proposed algorithm\nwith O = {I, II}. (c) Estimates of the probability difference given by the empirical estimator, HMM\nand the proposed binless OOM with O = [0, 2]. (d) Stationary histograms of {xt} with 100 uniform\nbins estimated from trajectories with length 50. The length of each trajectory is T = 50 \u223c 1000\nand the number of trajectories is [105/T ]. Error bars are standard deviations over 30 independent\nexperiments.\n\nThe potential function V (x) is shown in Fig. 2(a), which contains two potential wells I, II. In this\nexample, all simulations are performed by starting from a uniform distribution on [0, 0.2], which\nimplies that simulations are highly nonequilibrium and it is dif\ufb01cult to accurately estimate the\nequilibrium probabilities ProbI = E\u221e [1xt\u2208I] = E\u221e [yt] and ProbII = E\u221e [1xt\u2208II] = 1 \u2212 E\u221e [yt]\nof the two potential wells from the simulation data. We \ufb01rst utilize the traditional spectral learning\nwithout enforcing equilibrium, expectation\u2013maximization based HMM learning and the proposed\ndiscrete spectral algorithm to estimate ProbI and ProbII based on {yt}, and the estimation results\nwith different simulation lengths are summarized in Fig. 2(b). It can be seen that, in contrast to with\nthe other methods, the spectral algorithm for equilibrium OOMs effectively reduce the statistical bias\nin the nonequilibrium data, and achieves statistically correct estimation at T = 300.\nFigs. 2(c) and 2(d) plot estimates of stationary distribution of {xt} obtained from {xt} directly, where\nthe empirical estimator calculates statistics through averaging over all observations. In this case, the\nproposed binless OOM signi\ufb01cantly outperform the other methods, and its estimates are very close to\ntrue values even for extremely small short trajectories.\nFig. 3 provides an example of a two-dimensional diffusion process. The dynamics of this process can\nalso be represented in the form of (24) and the potential function is shown in Fig. 3(a). The goal of\nthis example is to estimate the \ufb01rst time-structure based independent component wTICA [30] of this\nprocess from simulation data. Here wTICA is a kinetic quantity of the process and is the solution to\nthe generalized eigenvalue problem\nwith the largest eigenvalue, where C0 is the covariance matrix of {xt} in equilibrium and\n\n(cid:3)(cid:1) is the equilibrium time-lagged covariance matrix. The\n\nC\u03c4 =(cid:0)E\u221e(cid:2)xtx(cid:62)\n\nsimulation data are also nonequilibrium with all simulations starting from the uniform distribution on\n[\u22122, 0] \u00d7 [\u22122, 0]. Fig. 3(b) displays the estimation errors of wTICA obtained from different learning\nmethods, which also demonstrates the superiority of the binless spectral method.\n\n(cid:3) \u2212 E\u221e [xt] E\u221e(cid:2)x(cid:62)\n\nt\n\nt+\u03c4\n\nC\u03c4 w = \u03bbC0w\n\nAlanine dipeptide Alanine dipeptide is a small molecule which consists of two alanine amino acid\nunits, and its con\ufb01guration can be described by two backbone dihedral angles. Fig. 4(a) shows the\npotential pro\ufb01le of the alanine dipeptide with respect to the two angles, which contains \ufb01ve metastable\n\n7\n\nTrue OOM Empirical HMM EQ-OOM(cid:1876)trajectory lengthtrajectory length(cid:1876)histogram(a)(b)(c)(d)III\fFigure 3: Comparison of modeling methods for a two-dimensional diffusion process. (a) Potential\nfunction. (b) Estimation error of wTICA \u2208 R2 of the \ufb01rst TIC with lag time 100. Length of each\ntrajectory is T = 200 \u223c 2500 and the number of trajectories is [105/T ]. Error bars are standard\ndeviations over 30 independent experiments.\n\nFigure 4: Comparison of modeling methods for molecular dynamics of alanine dipeptide. (a) Reduced\nfree energy. (b) Estimation error of \u03c0, where the horizontal axis denotes the total simulation time\nT \u00d7 I. Length of each trajectory is T = 10ns and the number of trajectories is I = 150 \u223c 1500.\nError bars are standard deviations over 30 independent experiments.\n\nstates {I, II, III, IV, V}. We perform multiple short molecular dynamics simulations starting from\nthe metastable state IV, where each simulation length is 10ns, and utilizes different methods to\napproximate the stationary distribution \u03c0 = (ProbI, ProbII, . . . , ProbV) of the \ufb01ve metastable states.\nAs shown in Fig. 4(b), the proposed binless algorithm yields lower estimation error compared to each\nof the alternatives.\n\n6 Conclusion\n\nIn this paper, we investigated the statistical properties of the general spectral learning procedure for\nnonequilibrium data, and developed novel spectral methods for learning equilibrium dynamics from\nnonequilibrium (discrete or continuous) data. The main ideas of the presented methods are to correct\nthe model parameters by the equilibrium constraint and to handle continuous observations in a binless\nmanner. Interesting directions of future research include analysis of approximation error with \ufb01nite\ndata size and applications to controlled systems.\n\nAcknowledgments\n\nThis work was funded by Deutsche Forschungsgemeinschaft (SFB 1114) and European Research\nCouncil (starting grant \u201cpcCells\u201d).\n\nReferences\n[1] H. Jaeger, \u201cObservable operator models for discrete stochastic time series,\u201d Neural Comput., vol. 12, no. 6,\n\npp. 1371\u20131398, 2000.\n\n[2] M.-J. Zhao, H. Jaeger, and M. Thon, \u201cA bound on modeling error in observable operator models and an\n\nassociated learning algorithm,\u201d Neural Comput., vol. 21, no. 9, pp. 2687\u20132712, 2009.\n\n[3] H. Jaeger, \u201cDiscrete-time, discrete-valued observable operator models: a tutorial,\u201d tech. rep., International\n\nUniversity Bremen, 2012.\n\n[4] M. L. Littman, R. S. Sutton, and S. Singh, \u201cPredictive representations of state,\u201d in Adv. Neural. Inf. Process.\n\nSyst. 14 (NIPS 2001), pp. 1555\u20131561, 2001.\n\n8\n\ncoord1coord2trajectory lengtherror of (a)(b)Empirical HMMEQ-OOMangle 1angle 2simulation time (ns)error of (a)(b)EmpiricalHMMEQ-OOMIIIIIIIIIIVVV\f[5] S. Singh, M. James, and M. Rudary, \u201cPredictive state representations: A new theory for modeling dynamical\n\nsystems,\u201d in Proc. 20th Conf. Uncertainty Artif. Intell. (UAI 2004), pp. 512\u2013519, 2004.\n\n[6] E. Wiewiora, \u201cLearning predictive representations from a history,\u201d in Proc. 22nd Intl. Conf. on Mach. Learn.\n\n(ICML 2005), pp. 964\u2013971, 2005.\n\n[7] D. Hsu, S. M. Kakade, and T. Zhang, \u201cA spectral algorithm for learning hidden Markov models,\u201d in Proc.\n\n22nd Conf. Learning Theory (COLT 2009), pp. 964\u2013971, 2005.\n\n[8] S. Siddiqi, B. Boots, and G. Gordon, \u201cReduced-rank hidden Markov models,\u201d in Proc. 13th Intl. Conf. Artif.\n\nIntell. Stat. (AISTATS 2010), vol. 9, pp. 741\u2013748, 2010.\n\n[9] A. Beimel, F. Bergadano, N. H. Bshouty, E. Kushilevitz, and S. Varricchio, \u201cLearning functions represented\n\nas multiplicity automata,\u201d J. ACM, vol. 47, no. 3, pp. 506\u2013530, 2000.\n\n[10] M. Thon and H. Jaeger, \u201cLinks between multiplicity automata, observable operator, models and predictive\nstate representations \u2014 a uni\ufb01ed learning framework,\u201d J. Mach. Learn. Res., vol. 16, pp. 103\u2013147, 2015.\n\n[11] J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Sch\u00fctte, and F. No\u00e9,\n\u201cMarkov models of molecular kinetics: Generation and validation,\u201d J. Chem. Phys., vol. 134, p. 174105,\n2011.\n\n[12] G. R. Bowman, V. S. Pande, and F. No\u00e9, An introduction to Markov state models and their application to\n\nlong timescale molecular simulation. Springer, 2013.\n\n[13] A. Ruttor, P. Batz, and M. Opper, \u201cApproximate Gaussian process inference for the drift function in\nstochastic differential equations,\u201d in Adv. Neural. Inf. Process. Syst. 26 (NIPS 2013), pp. 2040\u20132048, 2013.\n[14] N. Schaudinnus, B. Bastian, R. Hegger, and G. Stock, \u201cMultidimensional langevin modeling of nonover-\n\ndamped dynamics,\u201d Phys. Rev. Lett., vol. 115, no. 5, p. 050602, 2015.\n\n[15] L. R. Rabiner, \u201cA tutorial on hidden markov models and selected applications in speech recognition,\u201d Proc.\n\nIEEE, vol. 77, no. 2, pp. 257\u2013286, 1989.\n\n[16] F. No\u00e9, H. Wu, J.-H. Prinz, and N. Plattner, \u201cProjected and hidden markov models for calculating kinetics\n\nand metastable states of complex molecules,\u201d J. Chem. Phys., vol. 139, p. 184114, 2013.\n\n[17] R. D. Turner, M. P. Deisenroth, and C. E. Rasmussen, \u201cState-space inference and learning with Gaussian\n\nprocesses,\u201d in Proc. 13th Intl. Conf. Artif. Intell. Stat. (AISTATS 2010), pp. 868\u2013875, 2010.\n\n[18] S. S. T. S. Andreas Svensson, Arno Solin, \u201cComputationally ef\ufb01cient bayesian learning of Gaussian process\n\nstate space models,\u201d in Proc. 19th Intl. Conf. Artif. Intell. Stat. (AISTATS 2016), pp. 213\u2013221, 2016.\n\n[19] S. Hochreiter and J. Schmidhuber, \u201cLong short-term memory,\u201d Neural Comp., vol. 9, no. 8, pp. 1735\u20131780,\n\n1997.\n\n[20] H. Wu, J.-H. Prinz, and F. No\u00e9, \u201cProjected metastable markov processes and their estimation with\n\nobservable operator models,\u201d J. Chem. Phys., vol. 143, no. 14, p. 144101, 2015.\n\n[21] M. Shirts and V. S. Pande, \u201cScreen savers of the world unite,\u201d Science, vol. 290, pp. 1903\u20131904, 2000.\n[22] T.-K. Huang and J. Schneider, \u201cSpectral learning of hidden Markov models from dynamic and static data,\u201d\n\nin Proc. 30th Intl. Conf. on Mach. Learn. (ICML 2013), pp. 630\u2013638, 2013.\n\n[23] M. K. Cowles and B. P. Carlin, \u201cMarkov chain monte carlo convergence diagnostics: a comparative review,\u201d\n\nJ. Am. Stat. Assoc., vol. 91, no. 434, pp. 883\u2013904, 1996.\n\n[24] N. Jiang, A. Kulesza, and S. Singh, \u201cImproving predictive state representations via gradient descent,\u201d in\n\nProc. 30th AAAI Conf. Artif. Intell. (AAAI 2016), 2016.\n\n[25] H. Jaeger, \u201cModeling and learning continuous-valued stochastic processes with OOMs,\u201d Tech. Rep.\n\nGMD-102, German National Research Center for Information Technology (GMD), 2001.\n\n[26] B. Boots, S. M. Siddiqi, G. Gordon, and A. Smola, \u201cHilbert space embeddings of hidden markov models,\u201d\n\nin Proc. 27th Intl. Conf. on Mach. Learn. (ICML 2010), 2010.\n\n[27] M. Rosencrantz, G. Gordon, and S. Thrun, \u201cLearning low dimensional predictive representations,\u201d in Proc.\n\n22nd Intl. Conf. on Mach. Learn. (ICML 2004), pp. 88\u201395, ACM, 2004.\n\n[28] B. Boots, Spectral Approaches to Learning Predictive Representations. PhD thesis, Carnegie Mellon\n\nUniversity, 2012.\n\n[29] H. Jaeger, M. Zhao, and A. Kolling, \u201cEf\ufb01cient estimation of OOMs,\u201d in Adv. Neural. Inf. Process. Syst. 18\n\n(NIPS 2005), pp. 555\u2013562, 2005.\n\n[30] G. Perez-Hernandez, F. Paul, T. Giorgino, G. De Fabritiis, and F. No\u00e9, \u201cIdenti\ufb01cation of slow molecular\n\norder parameters for markov model construction,\u201d J. Chem. Phys., vol. 139, no. 1, p. 015102, 2013.\n\n9\n\n\f", "award": [], "sourceid": 2070, "authors": [{"given_name": "Hao", "family_name": "Wu", "institution": "Free University of Berlin"}, {"given_name": "Frank", "family_name": "Noe", "institution": "FU Berlin"}]}