{"title": "Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability", "book": "Advances in Neural Information Processing Systems", "page_first": 189, "page_last": 197, "abstract": "Interesting real-world datasets often exhibit nonlinear, noisy, continuous-valued states that are unexplorable, are poorly described by first principles, and are only partially observable. If partial observability can be overcome, these constraints suggest the use of model-based reinforcement learning.  We experiment with manifold embeddings as the reconstructed observable state-space of an off-line, model-based reinforcement learning approach to control. We demonstrate the embedding of a system changes as a result of learning and that the best performing embeddings well-represent the dynamics of both the uncontrolled and adaptively controlled system.  We apply this approach in simulation to learn a neurostimulation policy that is more efficient in treating epilepsy than conventional policies.  We then demonstrate the learned policy completely suppressing seizures in real-world neurostimulation experiments on actual animal brain slices.", "full_text": "Manifold Embeddings for Model-Based\n\nReinforcement Learning under Partial Observability\n\nKeith Bush\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\nJoelle Pineau\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\nkbush@cs.mcgill.ca\n\njpineau@cs.mcgill.ca\n\nAbstract\n\nInteresting real-world datasets often exhibit nonlinear, noisy, continuous-valued\nstates that are unexplorable, are poorly described by \ufb01rst principles, and are only\npartially observable. If partial observability can be overcome, these constraints\nsuggest the use of model-based reinforcement learning. We experiment with man-\nifold embeddings to reconstruct the observable state-space in the context of off-\nline, model-based reinforcement learning. We demonstrate that the embedding of\na system can change as a result of learning, and we argue that the best performing\nembeddings well-represent the dynamics of both the uncontrolled and adaptively\ncontrolled system. We apply this approach to learn a neurostimulation policy that\nsuppresses epileptic seizures on animal brain slices.\n\n1 Introduction\nThe accessibility of large quantities of off-line discrete-time dynamic data\u2014state-action sequences\ndrawn from real-world domains\u2014represents an untapped opportunity for widespread adoption of\nreinforcement learning. By real-world we imply domains that are characterized by continuous state,\nnoise, and partial observability. Barriers to making use of this data include: 1) goals (rewards) are\nnot well-de\ufb01ned, 2) exploration is expensive (or not permissible), and 3) the data does not preserve\nthe Markov property. If we assume that the reward function is part of the problem description, then\nto learn from this data we must ensure the Markov property is preserved before we approximate the\noptimal policy with respect to the reward function in a model-free or model-based way.\n\nFor many domains, particularly those governed by differential equations, we may leverage the in-\nductive bias of locality during function approximation to satisfy the Markov property. When ap-\nplied to model-free reinforcement learning, function approximation typically assumes that the value\nfunction maps nearby states to similar expectations of future reward. As part of model-based rein-\nforcement learning, function approximation additionally assumes that similar actions map to nearby\nfuture states from nearby current states [10]. Impressive performance and scalability of local model-\nbased approaches [1, 2] and global model-free approaches [6, 17] have been achieved by exploiting\nthe locality of dynamics in fully observable state-space representations of challenging real-world\nproblems.\n\nIn partially observable systems, however, locality is not preserved without additional context. First\nprinciple models offer some guidance in de\ufb01ning local dynamics, but the existence of known \ufb01rst\nprinciples cannot always be assumed. Rather, we desire a general framework for reconstructing\nstate-spaces of partially observable systems which guarantees the preservation of locality. Nonlinear\ndynamic analysis has long used manifold embeddings to reconstruct locally Euclidean state-spaces\nof unforced, partially observable systems [24, 18] and has identi\ufb01ed ways of \ufb01nding these embed-\ndings non-parametrically [7, 12]. Dynamicists have also used embeddings as generative models of\npartially observable unforced systems [16] by numerically integrating over the resultant embedding.\n\n1\n\n\fRecent advances have extended the theory of manifold embeddings to encompass deterministically\nand stochastically forced systems [21, 22].\n\nA natural next step is to apply these latest theoretical tools to reconstruct and control partially ob-\nservable forced systems. We do this by \ufb01rst identifying an appropriate embedding for the system\nof interest and then leveraging the resultant locality to perform reinforcement learning in a model-\nbased way. We believe it may be more practical to address reinforcement learning under partial\nobservability in a model-based way because it facilitates reasoning about domain knowledge and\noff-line validation of the embedding parameters.\n\nThe primary contribution of this paper is to formally combine and empirically evaluate these ex-\nisting, but not well-known, methods by incorporating them in off-line, model-based reinforcement\nlearning of two domains. First, we study the use of embeddings to learn control policies in a par-\ntially observable variant of the well-known Mountain Car domain. Second, we demonstrate the\nembedding-driven, model-based technique to learn an effective and ef\ufb01cient neurostimulation pol-\nicy for the treatment of epilepsy. The neurostimulation example is important because it resides\namong the hardest classes of learning domain\u2014a continuous-valued state-space that is nonlinear,\npartially observable, prohibitively expensive to explore, noisy, and governed by dynamics that are\ncurrently not well-described by mathematical models drawn from \ufb01rst principles.\n\n2 Methods\nIn this section we combine reinforcement learning, partial observability, and manifold embeddings\ninto a single mathematical formalism. We then describe non-parametric means of identifying the\nmanifold embedding of a system and how the resultant embedding may be used as a local model.\n\n2.1 Reinforcement Learning\nReinforcement learning (RL) is a class of problems in which an agent learns an optimal solution to\na multi-step decision task by interacting with its environment [23]. Many RL algorithms exist, but\nwe will focus on the Q-learning algorithm.\nConsider an environment (i.e. forced system) having a state vector, s \u2208 RM , which evolves ac-\ncording to a nonlinear differential equation but is discretized in time and integrated numerically\naccording to the map, f. Consider an agent that interacts with the environment by selecting action,\na, according to a policy function, \u03c0. Consider also that there exists a reward function, g, which in-\nforms the agent of the scalar goodness of taking an action with respect to the goal of some multi-step\ndecision task. Thus, for each time, t,\n\na(t) = \u03c0(s(t)),\n\ns(t + 1) = f (s(t), a(t)), and\nr(t + 1) = g(s(t), a(t)).\n\n(1)\n(2)\n(3)\n\nRL is the process of learning the optimal policy function, \u03c0\u2217, that maximizes the expected sum of\nfuture rewards, termed the optimal action-value function or Q-function, Q\u2217, such that,\n\nQ\u2217(s(t), a(t)) = r(t + 1) + \u03b3 max\n\na\n\nQ\u2217(s(t + 1), a),\n\n(4)\n\nwhere \u03b3 is the discount factor on [0, 1). Equation 4 assumes that Q\u2217 is known. Without a priori\nknowledge of Q\u2217 an approximation, Q, must be constructed iteratively. Assume the current Q-\nfunction estimate, Q, of the optimal, Q\u2217, contains error, \u03b4,\n\n\u03b4(t) = r(t + 1) + \u03b3 max\n\na\n\nQ (s(t + 1), a) \u2212 Q (s(t), a(t)) ,\n\nwhere \u03b4(t) is termed the temporal difference error or TD-error. The TD-error can be used to improve\nthe approximation of Q by\n\nQ (s(t), a(t)) = Q (s(t), a(t)) + \u03b1\u03b4(t),\n\n(5)\n\nwhere \u03b1 is the learning rate. By selecting action a that maximizes the current estimate of Q, Q-\nlearning speci\ufb01es that over many applications of Equation 5, Q approaches Q\u2217.\n\n2\n\n\f2.2 Manifold Embeddings for Reinforcement Learning Under Partial Observability\nQ-learning relies on complete state observability to identify the optimal policy. Nonlinear dynamic\nsystems theory provides a means of reconstructing complete state observability from incomplete\nstate via the method of delayed embeddings, formalized by Takens\u2019 Theorem [24]. Here we present\nthe key points of Takens\u2019 Theorem utilizing the notation of Huke [8] in a deterministically forced\nsystem.\n\nAssume s is an M-dimensional, real-valued, bounded vector space and a is a real-valued action input\nto the environment. Assuming that the state update f and the policy \u03c0 are deterministic functions,\nEquation 1 may be substituted into Equation 2 to compose a new function, \u03c6,\n\ns(t + 1) = f (s(t), \u03c0(s(t))) ,\n\n= \u03c6(s(t)),\n\n(6)\n\nwhich speci\ufb01es the discrete time evolution of the agent acting on the environment. If \u03c6 is a smooth\nmap \u03c6 : RM \u2192 RM and this system is observed via function, y, such that\n\n\u02dcs(t) = y(s(t)),\n\n(7)\n\nwhere y : RM \u2192 R, then if \u03c6 is invertible, \u03c6\u22121 exists, and \u03c6, \u03c6\u22121, and y are continuously differen-\ntiable we may apply Takens\u2019 Theorem [24] to reconstruct the complete state-space of the observed\nsystem. Thus, for each \u02dcs(t), we can construct a vector sE(t),\n\nsE(t) = [\u02dcs(t), \u02dcs(t \u2212 1), ..., \u02dcs(t \u2212 (E \u2212 1))], E > 2M,\n\n(8)\n\nsuch that sE lies on a subset of RE which is an embedding of s. Because embeddings preserve the\nconnectivity of the original vector-space, in the context of RL the mapping \u03c8,\n\nsE(t + 1) = \u03c8(sE(t)),\n\n(9)\n\nmay be substituted for f (Eqn. 6) and vectors sE(t) may be substituted for corresponding vectors\ns(t) in Equations 1\u20135 without loss of generality.\n\n2.3 Non-parametric Identi\ufb01cation of Manifold Embeddings\nTakens\u2019 Theorem does not de\ufb01ne how to compute the embedding dimension of arbitrary sequences\nof observations, nor does it provide a test to determine if the theorem is applicable. In general.\nthe intrinsic dimension, M, of a system is unknown. Finding high-quality embedding parameters\nof challenging domains, such as chaotic and noise-corrupted nonlinear signals, occupy much of\nthe \ufb01elds of subspace identi\ufb01cation and nonlinear dynamic analysis. Numerous methods of note\nexist, drawn from both disciplines. We employ a spectral approach [7]. This method, premised by\nthe singular value decomposition (SVD), is non-parametric, computationally ef\ufb01cient, and robust\nto additive noise\u2014all of which are useful in practical application. As will be seen in succeeding\nsections, this method \ufb01nds embeddings which are both accurate in theoretical tests and useful in\npractice.\n\nWe summarize the spectral parameter selection algorithm as follows. Given a sequence of state ob-\nservations \u02dcs of length \u02dcS, we choose a suf\ufb01ciently large \ufb01xed embedding dimension, \u02c6E. Suf\ufb01ciently\nlarge refers to a cardinality of dimension which is certain to be greater than twice the dimension\nin which the actual state-space resides. For each embedding window size, \u02c6Tmin \u2208 { \u02c6E, ..., \u02dcS}, we:\n1) de\ufb01ne a matrix S \u02c6E having row vectors, s \u02c6E(t), t \u2208 { \u02c6Tmin, ..., \u02dcS}, constructed according to the\nrule,\n\ns \u02c6E(t) = [\u02dcs(t), \u02dcs(t \u2212 \u03c4 ), ..., \u02dcs(t \u2212 ( \u02c6E \u2212 1)\u03c4 )],\n\n(10)\n\nwhere \u03c4 = \u02c6Tmin/( \u02c6E \u2212 1), 2) compute the SVD of the matrix S \u02c6E, and 3) record the vector of\nsingular values, \u03c3( \u02c6Tmin). Embedding parameters of \u02dcs are found by analysis of the second singular\nvalues, \u03c32( \u02c6Tmin), \u02c6Tmin \u2208 { \u02c6E, ..., \u02dcS}. The \u02c6Tmin value of the \ufb01rst local maxima of this sequence\nis the approximate embedding window, Tmin, of \u02dcs. The approximate embedding dimension, E, is\nthe number of non-trivial singular values of \u03c3(Tmin) where we de\ufb01ne non-trivial as a value greater\nthan the long-term trend of \u03c3 \u02c6E with respect to \u02c6Tmin. Embedding \u02dcs according to Equation 10 via\nparameters E and Tmin yields the matrix SE of row vectors, sE(t), t \u2208 {Tmin, ..., \u02dcS}.\n\n3\n\n\f2.4 Generative Local Models from Embeddings\nThe preservation of locality and dynamics afforded by the embedding allows an approximation of\nthe underlying dynamic system. To model this space we assume that the derivative of the Voronoi\nregion surrounding each embedded point is well-approximated by the derivative at the point itself,\na nearest-neighbors derivative [16]. Using this, we simulate trajectories as iterative numerical inte-\ngration of the local state and gradient. We de\ufb01ne the model and integration process formally.\n\nConsider a dataset D as a set of temporally aligned sequences of state observations \u02dcs(t), action\nobservations a(t), and reward observations r(t), t \u2208 {1, ..., \u02dcS}. Applying the spectral embedding\nmethod to D yields a sequence of vectors sE(t) in RE indexed by t \u2208 {Tmin, ..., \u02dcS}. A local model\nM of D is the set of 3-tuples, m(t) = {sE(t), a(t), r(t)}, t \u2208 {Tmin, ..., \u02dcS}, as well as operations\non these tuples, A(m(t)) \u2261 a(t), S(m(t)) \u2261 sE(t), Z(m(t)) \u2261 z(t) where z(t) = [s(t), a(t)],\nand U(M, a) \u2261 Ma where Ma is the subset of tuples in M containing action a.\nConsider a state vector x(i) in RE indexed by simulation time, i. To numerically integrate this\nstate we de\ufb01ne the gradient according to our de\ufb01nition of locality, namely the nearest neighbor.\nThis step is de\ufb01ned differently for models having discrete and continuous actions. The model\u2019s\nnearest neighbor of x(i) when taking action a(i) is de\ufb01ned in the case of a discrete set of actions,\nA, according to Equation 11 and in the continuous case it is de\ufb01ned by Equation 12,\n\nm(tx(i)) =\n\nargmin\n\nkS(m(t)) \u2212 x(i)k, a \u2208 A,\n\nm(t)\u2208U(M,a(i))\n\nm(tx(i)) = argmin\nm(t)\u2208M\n\nkZ(m(t)) \u2212 [x(i), \u03c9a(i)] k, a \u2208 R.\n\n(11)\n\n(12)\n\nwhere \u03c9 is a scaling parameter on the action space. The model gradient and numerical integration\nare de\ufb01ned, respectively, as,\n\nx(i + 1) = x(i) + \u2206i \u00a1\u2207x(i) + \u03b7\u00a2 ,\n\n\u2207x(i) = S(m(tx(i) + 1)) \u2212 S(m(tx(i))) and\n\n(13)\n(14)\nwhere \u03b7 is a vector of noise and \u2206i is the integration step-size. Applying Equations 11\u201314 iteratively\nsimulates a trajectory of the underlying system, termed a surrogate trajectory. Surrogate trajectories\nare initialized from state x(0). Equation 14 assumes that dataset D contains noise. This noise biases\nthe derivative estimate in RE, via the embedding rule (Eqn. 10). In practice, a small amount of\nadditive noise facilitates generalization.\n\n2.5 Summary of Approach\nOur approach is to combine the practices of dynamic analysis and RL to construct useful policies in\npartially observable, real-world domains via off-line learning. Our meta-level approach is divided\ninto two phases: the modeling phase and the learning phase.\n\nWe perform the modeling phase in steps: 1) record a partially observable system (and its rewards)\nunder the control of a random policy or some other policy or set of policies that include observations\nof high reward value; 2) identify good candidate parameters for the embedding via the spectral\nembedding method; and 3) construct the embedding vectors and de\ufb01ne the local model of the system.\n\nDuring the learning phase, we identify the optimal policy on the local model with respect to the\nrewards, R(m(t)) \u2261 r(t), via batch Q-learning. In this work we consider strictly local function\napproximation of the model and Q-function, thus, we de\ufb01ne the Q-function as a set of values, Q,\nindexed by the model elements, Q(m), m \u2208 M. For a state vector x(i) in RE at simulation time\ni, and an associated action, a(i), the reward and Q-value of this state can be indexed by either\nEquation 11 or 12, depending on whether the action is discrete or continuous. Note, our technique\ndoes not preclude the use of non-local function approximation, but here we assume a suf\ufb01cient\ndensity of data exists to reconstruct the embedded state-space with minimal bias.\n3 Case Study: Mountain Car\nThe Mountain Car problem is a second-order, nonlinear dynamic system with low-dimensional,\ncontinuous-valued state and action spaces. This domain is perhaps the most studied continuous-\nvalued RL domain in the literature, but, surprisingly, there is little study of the problem in the case\nwhere the velocity component of state is unobserved. While not a real-world domain as imagined in\nthe introduction, Mountain Car provides a familiar benchmark to evaluate our approach.\n\n4\n\n\f(a)\n\nRandom Policy\n\n(b)\n\nEmbedding Performance, E=2\n\ns\ne\nu\na\nV\n\nl\n\nl\n\n \nr\na\nu\ng\nn\nS\n\ni\n\n0\n2\n\n5\n1\n\n0\n1\n\n5\n\n0\n\n\u03c31\n\n\u03c32\n\n\u03c33\n\n\u03c35\n\n)\n\u03c4\n\u2212\nt\n(\nx\n\n5\n.\n0\n\n0\n.\n0\n\n5\n.\n0\n\u2212\n\n0\n.\n1\n\u2212\n\n0\n\n2.5\n\n5.0\n\n7.5\n\n10.0\n\n\u22121.0\n\n\u22120.5\n\n0.0\n\n0.5\n\nTmin (sec)\n\nx(t)\n\n(d)\n\nLearned Policy\n\n\u03c3 1\n\n\u03c3 2\n\n\u03c3 5\n\n5\n.\n0\n\n0\n.\n0\n\n5\n.\n0\n\u2212\n\n.\n\n0\n1\n\u2212\n\n)\n\u03c4\n\u2212\nt\n(\nx\n\n5\n1\n\ns\ne\nu\na\nV\n\nl\n\n0\n1\n\nl\n\n \nr\na\nu\ng\nn\nS\n\ni\n\n5\n\n0\n\nMax\nBest\nRandom\n\nTmin\n\n0.20\n0.70\n1.20\n1.70\n2.20\n\n50000\n\n100000\n\n150000\nTraining Samples\n\n200000\n\n(c)\n\nEmbedding Performance, E=3\n\nMax\nBest\nRandom\n\nTmin\n\n0.20\n0.70\n1.20\n1.70\n2.20\n\nh\n\nt\n\ng\nn\ne\nL\n\n \nl\n\na\no\ng\n\u2212\no\n\u2212\nh\n\nt\n\nt\n\na\nP\n\n0\n0\n0\n1\n\n0\n0\n1\n\nh\n\nt\n\ng\nn\ne\nL\n\n \nl\n\na\no\ng\n\u2212\no\n\u2212\nh\n\nt\n\nt\n\na\nP\n\n0\n0\n0\n1\n\n0\n0\n1\n\n0\n\n2.5\n\n5.0\n\n7.5\n\n10.0\n\n\u22121.0\n\n\u22120.5\n\n0.0\n\n0.5\n\nTmin (sec)\n\nx(t)\n\n50000\n\n100000\n150000\nTraining Samples\n\n200000\n\nFigure 1: Learning experiments on Mountain Car under partial observability. (a) Embedding spec-\ntrum and accompanying trajectory (E = 3, Tmin = 0.70 sec.) under random policy. (b) Learning\nperformance as a function of embedding parameters and quantity of training data. (c) Embedding\nspectrum and accompanying trajectory (E = 3, Tmin = 0.70 sec.) for the learned policy.\n\nWe use the Mountain Car dynamics and boundaries of Sutton and Barto [23]. We \ufb01x the initial state\nfor all experiments (and resets) to be the lowest point of the mountain domain with zero velocity,\nwhich requires the longest path-to-goal in the optimal policy. Only the position element of the\nstate is observable. During the modeling phase, we record this domain under a random control\npolicy for 10,000 time-steps (\u2206t = 0.05 seconds), where the action is changed every \u2206t = 0.20\nseconds. We then compute the spectral embedding of the observations (Tmin = [0.20, 9.95] sec.,\n\u2206Tmin = 0.25 sec., and \u02c6E = 5). The resulting spectrum is presented in Figure 1(a). We conclude\nthat the embedding of Mountain Car under the random policy requires dimension E = 3 with a\nmaximum embedding window of Tmin = 1.70 seconds.\nTo evaluate learning phase outcomes with respect to modeling phase outcomes, we perform an ex-\nperiment where we model the randomly collected observations using embedding parameters drawn\nfrom the product of the sets Tmin = {0.20, 0.70, 1.20, 1.70, 2.20} seconds and E = {2, 3}. While\nwe \ufb01x the size of the local model to 10,000 elements we vary the total amount of training samples\nobserved from 10,000 to 200,000 at intervals of 10,000. We use batch Q-learning to identify the\noptimal policy in a model-based way\u2014in Equation 5 the transition between state-action pair and\nthe resulting state-reward pair is drawn from the model (\u03b7 = 0.001). After learning converges, we\nexecute the learned policy on the real system for 10,000 time-steps, recording the mean path-to-goal\nlength over all goals reached. Each con\ufb01guration is executed 30 times.\n\nWe summarize the results of these experiments by log-scale plots, Figures 1(b) and (c), for embed-\ndings of dimension two and three, respectively. We compare learning performance against three\nmeasures: the maximum performing policy achievable given the dynamics of the system (path-to-\ngoal = 63 steps), the best (99th percentile) learned policy for each quantity of training data for each\nembedding dimension, and the random policy. Learned performance is plotted as linear regression\n\ufb01ts of the data.\n\nPolicy performance results of Figures 1(b) and (c) may be summarized by the following obser-\nvations. Performance positively relates to the quantity of off-line training data for all embedding\nparameters. Except for the con\ufb01guration (E = 2, Tmin = 0.20), in\ufb02uence of Tmin on learning\nperformance relative to E is small. Learning performance of 3-dimensional embeddings dominate\n\n5\n\n\fall but the shortest 2-dimensional embeddings. These observations indicate that the parameters of\nthe embedding ultimately determine the effectiveness of RL under partial observability. This is not\nsurprising. What is surprising is that the best performing parameter con\ufb01gurations are linked to\ndynamic characteristics of the system under both a random policy and the learned policy.\nTo support this claim we collected 1,000 sample observations of the best policy (E = 3, Tmin =\n0.70 sec., Ntrain = 200, 000) during control of the real Mountain Car domain (path-to-goal = 79\nsteps). We computed and plotted the embedding spectrum and \ufb01rst two dimensions of the embedding\nin Figure 1(d). We compare these results to similar plots for the random policy in Figure 1(a).\nWe observe that the spectrum of the learned system has shifted such that the optimal embedding\nparameters require a shorter embedding window, Tmin = 0.70\u20131.20 sec. and a lower embedding\ndimension E = 2 (i.e., \u03c33 peaks at Tmin = 0.70\u20131.20 and \u03c33 falls below the trend of \u03c35 at this\nwindow length). We con\ufb01rm this by observing the embedding directly, Figure 1(d). Unlike the\nrandom policy, which includes both an unstable spiral \ufb01xed point and limit cycle structure and\nrequires a 3-dimensional embedding to preserve locality, the learned policy exhibits a 2-dimensional\nunstable spiral \ufb01xed point. Thus, the \ufb01xed-point structure (embedding structure) of the combined\npolicy-environment system changes during learning.\n\nTo reinforce this claim, we consider the difference between a 2-dimensional and 3-dimensional em-\nbedding. An agent may learn to project into a 2-dimensional plane of the 3-dimensional space, thus\ndecreasing its embedding dimension if the training data supports a 2-dimensional policy. We believe\nit is no accident that (E = 3, Tmin = 0.70) is the best performing con\ufb01guration across all quantities\nof training data. This con\ufb01guration can represent both 3-dimensional and 2-dimensional policies,\ndepending on the amount of training data available. It can also select between 2-dimensional em-\nbeddings having window sizes of Tmin = {0.35, 0.70} sec., depending on whether the second or\nthird dimension is projected out. One resulting parameter con\ufb01guration (E = 2, Tmin = 0.35) is\nnear the optimal 2-dimensional con\ufb01guration of Figure 1(b).\n\n4 Case Study: Neurostimulation Treatment of Epilepsy\n\nEpilepsy is a common neurological disorder which manifests itself, electrophysiologically, in the\nform of intermittent seizures\u2014intense, synchronized \ufb01ring of neural populations. Researchers now\nrecognize seizures as artifacts of abnormal neural dynamics and rely heavily on the nonlinear dy-\nnamic systems analysis and control literature to understand and treat seizures [4]. Promising tech-\nniques have emerged from this union. For example, \ufb01xed frequency electrical stimulation of slices\nof the rat hippocampus under arti\ufb01cially induced epilepsy have been demonstrated to suppress the\nfrequency, duration, or amplitude of seizures [9, 5]. Next generation epilepsy treatments, derived\nfrom machine learning, promise maximal seizure suppression via minimal electrical stimulation by\nadapting control policies to patients\u2019 unique neural dynamics. Barriers to constructing these treat-\nments arise from a lack of \ufb01rst principles understanding of epilepsy. Without \ufb01rst principles, neu-\nroscientists have only vague notions of what effective neurostimulation treatments should look like.\nEven if effective policies could be envisioned, exploration of the vast space of policy parameters is\nimpractical without computational models.\n\nOur speci\ufb01c control problem is de\ufb01ned as follows. Given labeled \ufb01eld potential recordings of brain\nslices under \ufb01xed-frequency electrical stimulation policies of 0.5, 1.0, and 2.0 Hz, as well as unstim-\nulated control data, similar to the time-series depicted in Figure 2(a), we desire to learn a stimulation\npolicy that suppresses seizures of a real, previously unseen, brain slice with an effective mean fre-\nquency (number of stimulations divided by the time the policy is active) of less than 1.0 Hz (1.0 Hz\nis currently known to be the most robust suppression policy for the brain slice model we use [9, 5]).\nAs a further complication, on-line exploration is extremely expensive because the brain slices are\nexperimentally viable for periods of less than 2 hours.\n\nAgain, we approach this problem as separate modeling and learning phases. We \ufb01rst compute the\nembedding spectrum of our dataset assuming \u02c6E = 15, presented in Figure 2(b). Using our knowl-\nedge of the interaction between embedding parameters and learning we select the embedding di-\nmension E = 3 and embedding window Tmin = 1.05 seconds. Note, the strong maxima of \u03c32 at\nTmin = 110 seconds is the result of periodicity of seizures in our small training dataset. Periodicity\nof spontaneous seizure formation, however, varies substantially between slices. We select a shorter\nembedding window and rely on integration of the local model to unmask long-term dynamics.\n\n6\n\n\f(a) Example Field Potentials\n\nControl\n\n(b) Neurostimulation Embedding Spectrum\n\ns\ne\nu\na\nV\n\nl\n\nl\n\n \nr\na\nu\ng\nn\nS\n\ni\n\n0\n5\n2\n\n0\n0\n2\n\n0\n5\n1\n\n0\n0\n1\n\n0\n5\n\n0\n\n*\n\n0\n\n\u03c3\n2\n\n*\n\u03c3\n3\n\n\u03c3\n1\n\n\u03c3\n2\n\u03c3\n3\n\ns\ne\nu\na\nV\n\nl\n\nl\n\n \nr\na\nu\ng\nn\nS\n\ni\n\n0\n6\n\n0\n5\n\n0\n4\n\n0\n3\n\n0\n2\n\n0\n1\n\n0\n\n1 Hz\n\n2 Hz\n\n0.5 Hz\n\n \n\nV\nm\n1\n\n \n\n200 sec\n\n50\n\nT\nmin\n\n (s)\n\n100\n\n150\n\n0.0\n\n0.5\n\n1.0\nT\nmin\n\n (s)\n\n1.5\n\n2.0\n\n(c) Neurostimulation Model\n\nt\n\nn\ne\nn\no\np\nm\no\nC\n\n \nl\n\ni\n\na\np\nc\nn\ni\nr\n\n \n\nP\nd\nn\n2\n\n6\n\n.\n\n0\n\n \n\n4\n\n.\n\n0\n\n \n\n2\n\n.\n\n0\n\n \n\n0\n\n.\n\n0\n\n \n\n2\n\n.\n\n0\n\u2212\n\n4\n\n.\n\n0\n\u2212\n\n6\n\n.\n\n0\n\u2212\n\n\u22123\n\nt\n\nn\ne\nn\no\np\nm\no\nC\n\n \nl\n\ni\n\na\np\nc\nn\ni\nr\n\n \n\nP\nd\nn\n2\n\n 0\n\n6\n\n.\n\n0\n\n \n\n4\n\n.\n\n0\n\n \n\n2\n\n.\n\n0\n\n \n\n0\n\n.\n\n0\n\n \n\n2\n\n.\n\n0\n\u2212\n\n4\n\n.\n\n0\n\u2212\n\n6\n\n.\n\n0\n\u2212\n\n\u22120.6\n\n\u22120.4 \u22120.2\n 0.0  0.2  0.4\n3rd Principal Component\n\n 0.6\n\n\u22122\n\n\u22121\n\n1st Principal Component\n\nFigure 2: Graphical summary of the modeling phase of our adaptive neurostimulation study.\n(a) Sample observations from the \ufb01xed-frequency stimulation dataset. Seizures are labeled with\nhorizontal lines. (b) The embedding spectrum of the \ufb01xed-frequency stimulation dataset. The large\nmaximum of \u03c32 at approximately 100 sec. is an artifact of the periodicity of seizures in the dataset.\n*Detail of the embedding spectrum for Tmin = [0.05, 2.0] depicting a maximum of \u03c32 at the time-\nscale of individual stimulation events. (c) The resultant neurostimulation model constructed from\nembedding the dataset with parameters (E = 3, Tmin = 1.05 sec.). Note, the model has been\ndesampled 5\u00d7 in the plot.\n\nIn this complex domain we apply the spectral method differently than described in Section 2. Rather\nthan building the model directly from the embedding (E = 3, Tmin = 1.05), we perform a change\nof basis on the embedding ( \u02c6E = 15, Tmin = 1.05), using the \ufb01rst three columns of the right sin-\ngular vectors, analogous to projecting onto the principal components. This embedding is plotted in\nFigure 2(c). Also, unlike the previous case study, we convert stimulation events in the training data\nfrom discrete frequencies to a continuous scale of time-elapsed-since-stimulation. This allows us to\ncombine all of the data into a single state-action space and then simulate any arbitrary frequency.\nBased on exhaustive closed-loop simulations of \ufb01xed-frequency suppression ef\ufb01cacy across a spec-\ntrum of [0.001, 2.0] Hz, we constrain the model\u2019s action set to discrete frequencies a = {2.0, 0.25}\nHz in the hopes of easing the learning problem. We then perform batch Q-learning over the model\n(\u2206t = 0.05, \u03c9 = 0.1, and \u03b7 = 0.00001), using discount factor \u03b3 = 0.9. We structure the reward\nfunction to penalize each electrical stimulation by \u22121 and each visited seizure state by \u221220.\nWithout stimulation, seizure states comprise 25.6% of simulation states. Under a 1.0 Hz \ufb01xed-\nfrequency policy, stimulation events comprise 5.0% and seizures comprise 6.8% of the simulation\nstates. The policy learned by the agent also reduces the percent of seizure states to 5.2% of sim-\nulation states while stimulating only 3.1% of the time (effective frequency equals 0.62 Hz).\nIn\nsimulation, therefore, the learned policy achieves the goal.\n\nWe then deployed the learned policy on real brain slices to test on-line seizure suppression perfor-\nmance. The policy was tested over four trials on two unique brain slices extracted from the same\nanimal. The effective frequencies of these four trials were {0.65, 0.64, 0.66, 0.65} Hz. In all trials\nseizures were effectively suppressed after a short transient period, during which the policy and slice\nachieved equilibrium. (Note: seizures occurring at the onset of stimulation are common artifacts\nof neurostimulation). Figure 3 displays two of these trials spaced over four sequential phases: (a)\na control (no stimulation) phase used to determine baseline seizure activity, (b) a learned policy\ntrial lasting 1,860 seconds, (c) a recovery phase to ensure slice viability after stimulation and to\nrecompute baseline seizure activity, and (d) a learned policy trial lasting 2,130 seconds.\n\n7\n\n\f(a) Control Phase\n\n(b) Policy Phase 1\n\nV\nm\n2\n\n \n\n60 sec\n\n(d) Policy Phase 2\n\nStimulations\n\n(c) Recovery Phase\n\n*\n\nFigure 3: Field potential trace of a real seizure suppression experiment using a policy learned from\nsimulation. Seizures are labeled as horizontal lines above the traces. Stimulation events are marked\nby vertical bars below the traces. (a) A control phase used to determine baseline seizure activity.\n(b) The initial application of the learned policy. (c) A recovery phase to ensure slice viability after\nstimulation and recompute baseline seizure activity. (d) The second application of the learned policy.\n*10 minutes of trace are omitted while the algorithm was reset.\n\n5 Discussion and Related Work\nThe RL community has long studied low-dimensional representations to capture complex domains.\nApproaches for ef\ufb01cient function approximation, basis function construction, and discovery of em-\nbeddings has been the topic of signi\ufb01cant investigations [3, 11, 20, 15, 13]. Most of this work has\nbeen limited to the fully observable (MDP) case and has not been extended to partially observable\nenvironments. The question of state space representation in partially observable domains was tack-\nled under the POMDP framework [14] and recently in the PSR framework [19]. These methods\naddress a similar problem but have been limited primarily to discrete action and observation spaces.\nThe PSR framework was extended to continuous (nonlinear) domains [25]. This method is signi\ufb01-\ncantly different from our work, both in terms of the class of representations it considers and in the\ncriteria used to select the appropriate representation. Furthermore, it has not yet been applied to\nreal-world domains. An empirical comparison with our approach is left for future consideration.\n\nThe contribution of our work is to integrate embeddings with model-based RL to solve real-world\nproblems. We do this by leveraging locality preserving qualities of embeddings to construct dynamic\nmodels of the system to be controlled. While not improving the quality of off-line learning that\nis possible, these models permit embedding validation and reasoning over the domain, either to\nconstrain the learning problem or to anticipate the effects of the learned policy on the dynamics of the\ncontrolled system. To demonstrate our approach, we applied it to learn a neurostimulation treatment\nof epilepsy, a challenging real-world domain. We showed that the policy learned off-line from an\nembedding-based, local model can be successfully transferred on-line. This is a promising step\ntoward widespread application of RL in real-world domains. Looking to the future, we anticipate\nthe ability to adjust the embedding a priori using a non-parametric policy gradient approach over\nthe local model. An empirical investigation into the bene\ufb01ts of this extension are also left for future\nconsideration.\n\nAcknowledgments\nThe authors thank Dr. Gabriella Panuccio and Dr. Massimo Avoli of the Montreal Neurological\nInstitute for generating the time-series described in Section 4. The authors also thank Arthur Guez,\nRobert Vincent, Jordan Frank, and Mahdi Milani Fard for valuable comments and suggestions. The\nauthors gratefully acknowledge \ufb01nancial support by the Natural Sciences and Engineering Research\nCouncil of Canada and the Canadian Institutes of Health Research.\n\n8\n\n\fReferences\n\n[1] Christopher G. Atkeson, Andrew W. Moore, and Stefan Schaal. Locally weighted learning for control.\n\nArti\ufb01cial Intelligence Review, 11:75\u2013113, 1997.\n\n[2] Christopher G. Atkeson and Jun Morimoto. Nonparametric representation of policies and value functions:\n\nA trajectory-based approach. In Advances in Neural Information Processing, 2003.\n\n[3] M. Bowling, A. Ghodsi, and D. Wilkinson. Action respecting embedding. In Proceedings of ICML, 2005.\n[4] F. Lopes da Silva, W. Blanes, S. Kalitzin, J. Parra, P. Suffczynski, and D. Velis. Dynamical diseases\nof brain systems: Different routes to epileptic seizures. IEEE Transactions on Biomedical Engineering,\n50(5):540\u2013548, 2003.\n\n[5] G. D\u2019Arcangelo, G. Panuccio, V. Tancredi, and M. Avoli. Repetitive low-frequency stimulation reduces\nepileptiform synchronization in limbic neuronal networks. Neurobiology of Disease, 19:119\u2013128, 2005.\n[6] Damien Ernst, Pierre Guerts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Jour-\n\nnal of Machine Learning Research, 6:503\u2013556, 2005.\n\n[7] A. Galka. Topics in Nonlinear Time Series Analysis: with implications for EEG Analysis. World Scienti\ufb01c,\n\n2000.\n\n[8] J.P. Huke. Embedding nonlinear dynamical systems: A guide to Takens\u2019 Theorem. Technical report,\n\nManchester Institute for Mathematical Sciences, University of Manchester, March, 2006.\n\n[9] K. Jerger and S. Schiff. Periodic pacing and in vitro epileptic focus.\n\n73(2):876\u2013879, 1995.\n\nJournal of Neurophysiology,\n\n[10] Nicholas K. Jong and Peter Stone. Model-based function approximation in reinforcement learning. In\n\nProceedings of AAMAS, 2007.\n\n[11] P.W. Keller, S. Mannor, and D. Precup. Automatic basis function construction for approximate dynamic\n\nprogramming and reinforcement learning. In Proceedings of ICML, 2006.\n\n[12] M. Kennel and H. Abarbanel. False neighbors and false strands: A reliable minimum embedding dimen-\n\nsion algorithm. Physical Review E, 66:026209, 2002.\n\n[13] S. Mahadevan and M. Maggioni. Proto-value functions: A Laplacian framework for learning represen-\ntation and control in Markov decision processes. Journal of Machine Learning Research, 8:2169\u20132231,\n2007.\n\n[14] A. K. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis,\n\nUniversity of Rochester, 1996.\n\n[15] R. Munos and A. Moore. Variable resolution discretization in optimal control. Machine Learning, 49:291\u2013\n\n323, 2002.\n\n[16] U. Parlitz and C. Merkwirth. Prediction of spatiotemporal time series based on reconstructed local states.\n\nPhysical Review Letters, 84(9):1890\u20131893, 2000.\n\n[17] Jan Peters, Sethu Vijayakumar, and Stefan Schaal. Natural actor-critic. In Proceedings of ECML, 2005.\n[18] Tim Sauer, James A. Yorke, and Martin Casdagli. Embedology.\nJournal of Statistical Physics,\n\n65:3/4:579\u2013616, 1991.\n\n[19] S. Singh, M. L. Littman, N. K. Jong, D. Pardoe, and P. Stone. Learning predictive state representations.\n\nIn Proceedings of ICML, 2003.\n\n[20] W. Smart. Explicit manifold representations for value-functions in reinforcement learning. In Proceedings\n\nof ISAIM, 2004.\n\n[21] J. Stark. Delay embeddings for forced systems. I. Deterministic forcing. Journal of Nonlinear Science,\n\n9:255\u2013332, 1999.\n\n[22] J. Stark, D.S. Broomhead, M.E. Davies, and J. Huke. Delay embeddings for forced systems. II. Stochastic\n\nforcing. Journal of Nonlinear Science, 13:519\u2013577, 2003.\n\n[23] R. Sutton and A. Barto. Reinforcement learning: An introduction. The MIT Press, Cambridge, MA, 1998.\n[24] F. Takens. Detecting strange attractors in turbulence. In D. A. Rand & L. S. Young, editor, Dynamical\n\nSystems and Turbulence, volume 898, pages 366\u2013381. Warwick, 1980.\n\n[25] D. Wingate and S. Singh. On discovery and learning of models with predictive state representations of\n\nstate for agents with continuous actions and observations. In Proceedings of AAMAS, 2007.\n\n9\n\n\f", "award": [], "sourceid": 189, "authors": [{"given_name": "Keith", "family_name": "Bush", "institution": null}, {"given_name": "Joelle", "family_name": "Pineau", "institution": null}]}