{"title": "Continuous-time Value Function Approximation in Reproducing Kernel Hilbert Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 2813, "page_last": 2824, "abstract": "Motivated by the success of reinforcement learning (RL) for discrete-time tasks such as AlphaGo and Atari games, there has been a recent surge of interest in using RL for continuous-time control of physical systems (cf. many challenging tasks in OpenAI Gym and DeepMind Control Suite).\nSince discretization of time is susceptible to error, it is methodologically more desirable to handle the system dynamics directly in continuous time.\nHowever, very few techniques exist for continuous-time RL and they lack flexibility in value function approximation.\nIn this paper, we propose a novel framework for model-based continuous-time value function approximation in reproducing kernel Hilbert spaces.\nThe resulting framework is so flexible that it can accommodate any kind of kernel-based approach, such as Gaussian processes and kernel adaptive filters, and it allows us to handle uncertainties and nonstationarity without prior knowledge about the environment or what basis functions to employ.\nWe demonstrate the validity of the presented framework through experiments.", "full_text": "Continuous-time Value Function Approximation\n\nin Reproducing Kernel Hilbert Spaces\n\nMotoya Ohnishi\n\nKeio Univ., KTH, RIKEN\n\nmotoya.ohnishi@riken.jp\n\nMikael Johansson\n\nKTH\n\nmikaelj@ee.kth.se\n\nMasahiro Yukawa\nKeio Univ., RIKEN\n\nyukawa@elec.keio.ac.jp\n\nMasashi Sugiyama\nRIKEN, Univ. Tokyo\n\nmasashi.sugiyama@riken.jp\n\nAbstract\n\nMotivated by the success of reinforcement learning (RL) for discrete-time tasks\nsuch as AlphaGo and Atari games, there has been a recent surge of interest in using\nRL for continuous-time control of physical systems (cf. many challenging tasks\nin OpenAI Gym and DeepMind Control Suite). Since discretization of time is\nsusceptible to error, it is methodologically more desirable to handle the system\ndynamics directly in continuous time. However, very few techniques exist for\ncontinuous-time RL and they lack \ufb02exibility in value function approximation.\nIn this paper, we propose a novel framework for model-based continuous-time\nvalue function approximation in reproducing kernel Hilbert spaces. The resulting\nframework is so \ufb02exible that it can accommodate any kind of kernel-based approach,\nsuch as Gaussian processes and kernel adaptive \ufb01lters, and it allows us to handle\nuncertainties and nonstationarity without prior knowledge about the environment\nor what basis functions to employ. We demonstrate the validity of the presented\nframework through experiments.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) [37, 20, 35] has been successful in a variety of applications such as\nAlphaGo and Atari games, particularly for discrete stochastic systems. Recently, application of RL to\nphysical control tasks has also been gaining attention, because solving an optimal control problem\n(or the Hamilton-Jacobi-Bellman-Isaacs equation) [21] directly is computationally prohibitive for\ncomplex nonlinear system dynamics and/or cost functions.\nIn the physical world, states and actions are continuous, and many dynamical systems evolve in\ncontinuous time. OpenAI Gym [7] and DeepMind Control Suite [40] offer several representative\nexamples of such physical tasks. When handling continuous-time (CT) systems, CT formulations\nare methodologically desirable over the use of discrete-time (DT) formulations with the small time\nintervals, since such discretization is susceptible to errors. In terms of computational complexities\nand the ease of analysis, CT formulations are also more advantageous over DT counterparts for\ncontrol-theoretic analyses such as stability and forward invariance [14], which are useful for safety-\ncritical applications. As we will show in this paper, our framework allows to constrain control inputs\nand/or states in a computationally ef\ufb01cient way.\nOne of the early examples of RL for CT systems [4] pointed out that Q learning is incabable of\nlearning in continuous time and proposed advantage updating. Convergence proofs were given\nin [25] for systems described by stochastic differential equations (SDEs) [28] using a grid-based\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTable 1: Relations to the existing approaches\n\nNon kernel-based\nKernel-based\n\nDT\n(e.g. [5])\n(e.g. [9])\n\nDT stochastic (MDP) CT\n(e.g. [44])\n(e.g. [13])\n\n(e.g. [8])\n(This work)\n\nCT stochastic\n(e.g. [25])\n(This work)\n\ndiscretization of states and time. Stochastic differential dynamic programming and RL have also been\nstudied in, for example, [43, 30, 42]. For continuous states and actions, function approximators are\noften employed instead of \ufb01nely discretizing the state space to avoid the explosion of computational\ncomplexities. The work in [8] presented an application of CT-RL by function approximators such\nas Gaussian networks with \ufb01xed number of basis functions. In [45], it was mentioned that any\ncontinuously differentiable value function (VF) can be approximated by increasing the number of\nindependent basis functions to in\ufb01nity in CT scenarios, and a CT policy iteration was proposed.\nHowever, without resorting to the theory of reproducing kernels [3], determining the number of\nbasis functions and selecting the suitable basis function class cannot be performed systematically\nin general. Nonparametric learning is often desirable when no a priori knowledge about a suitable\nset of basis functions for learning is available. Kernel-based methods have many non-parametric\nlearning algorithms, ranging from Gaussian processes (GPs) [32] to kernel adaptive \ufb01lters (KFs)\n[22], which can provably deal with uncertainties and nonstationarity. While DT kernel-based\nRL was studied in [29, 49, 41, 36, 26, 13, 27], for example, and the Gaussian process temporal\ndifference (GPTD) algorithm was presented in [9], no CT kernel-based RL has been proposed to\nour knowledge. Moreover, there is no uni\ufb01ed framework in which existing kernel methods and their\nconvergence/tracking analyses are straightforwardly applied to model-based VF approximation.\nIn this paper, we present a novel theoretical framework of model-based CT-VF approximation\nin reproducing kernel Hilbert spaces (RKHSs) [3] for systems described by SDEs. The RKHS\nfor learning is de\ufb01ned through one-to-one correspondence to a user-de\ufb01ned RKHS in which the\nVF being obtained is lying. We then obtain the associated kernel to be used for learning. The\nresulting framework renders any kind of kernel-based methods applicable in model-based CT-VF\napproximation, including GPs [32] and KFs [22]. In addition, we propose an ef\ufb01cient barrier-certi\ufb01ed\npolicy update for CT systems, which implicitly enforces state constraints. Relations of our framework\nto the existing approaches for DT, DT stochastic (the Markov decision process (MDP)), CT, and\nCT stochastic systems are shown in Table 1. Our proposed framework covers model-based VF\napproximation working in RKHSs, including those for CT and CT stochastic systems. We verify the\nvalidity of the framework on the classical Mountain Car problem and a simulated inverted pendulum.\n\n2 Problem setting\n\nThroughout, R, Z0, and Z>0 are the sets of real numbers, nonnegative integers, and strictly positive\nintegers, respectively. We suppose that the system dynamics described by the SDE [28],\n\ndx = h(x(t), u(t))dt + \u2318(x(t), u(t))dw,\n\n(1)\nis known or learned, where x(t) 2 Rnx, u(t) 2U\u21e2 Rnu, and w are the state, control, and a Brownian\nmotion of dimensions nx 2 Z>0, nu 2 Z>0, and nw 2 Z>0, respectively, h : Rnx \u21e5U! Rnx is\nthe drift, and \u2318 : Rnx \u21e5U! Rnx\u21e5nw is the diffusion. A Brownian motion can be considered as a\nprocess noise, and is known to satisfy the Markov property [28]. Given a policy  : Rnx !U , we\nde\ufb01ne h(x) := h(x, (x)) and \u2318(x) := \u2318(x, (x)), and make the following two assumptions.\nAssumption 1. For any Lipschitz continuous policy , both h(x) and \u2318(x) are Lipschitz continu-\nous, i.e., the stochastic process de\ufb01ned in (1) is an It\u00f4 diffusion [28, De\ufb01nition 7.1.1], which has a\npathwise unique solution for t 2 [0,1).\nAssumption 2. The set X\u21e2 Rnx is compact with nonempty interior int(X ), and int(X ) is invariant\nunder the system (1) with any Lipschitz continuous policy , i.e.,\n(2)\nwhere Px(x(t) 2 int(X )) denotes the probability that x(t) lies in int(X ) when starting from\nx(0) = x.\n\nPx(x(t) 2 int(X )) = 1, 8x 2 int(X ), 8t  0,\n\n2\n\n\fFigure 1: An illustration of the main ideas of our proposed framework. Given a system dynamics and\nan RKHS HV for the VF V , de\ufb01ne HR under one-to-one correspondence to estimate an observable\nimmediate cost function in HR, and obtain V  by bringing it back to HV .\n\nAssumption 2 implies that a solution of the system (1) stays in int(X ) with probability one. We refer\nthe readers to [15] for stochastic stability and invariance for SDEs.\nIn this paper, we consider the immediate cost function1 R : Rnx \u21e5U! R, which is continuous and\nsatis\ufb01es Ex\u21e5R 1\n0 et|R(x(t), u(t))|dt\u21e4 < 1, where Ex is the expectation for all trajectories (time\nevolutions of x(t)) starting from x(0) = x, and   0 is the discount factor. Note this boundedness\nimplies that > 0 or that there exists a zero-cost state which is stochastically asymptotically stable\n[15]. Speci\ufb01cally, we consider the case where the immediate cost is not known a priori but is\nsequentially observed. Now, the VF associated with a policy  is given by\n\nV (x) := Ex\uf8ffZ 1\n\n0\n\netR(x(t))dt < 1,\n\n(3)\n\nwhere R(x(t)) := R(x(t), (x(t))).\nThe advantages of using CT formulations include a smooth control performance and an ef\ufb01cient\npolicy update, and CT formulations require no elaborative partitioning of time [8]. In addition, our\nwork shows that CT formulations make control-theoretic analyses easier and computationally more\nef\ufb01cient and are more advantageous in terms of susceptibility to errors when the time interval is small.\nWe mention that the CT formulation can still be considered in spite of the fact that the algorithm is\nimplemented in discrete time.\nWith these problem settings in place, our goal is to estimate the CT-VF in an RKHS and improve\npolicies. However, since the output V (x) is unobservable and the so-called double-sampling\nproblem exists when approximating VFs (see e.g., [38, 16]), kernel-based supervised learning and its\nanalysis cannot be directly applied to VF approximation in general. Motivated by this fact, we propose\na novel model-based CT-VF approximation framework which enables us to conduct kernel-based VF\napproximation as supervised learning.\n\n3 Model-based CT-VF approximation in RKHSs\n\nIn this section, we brie\ufb02y present an overview of our framework; We take the following steps:\n\n1. Select an RKHS HV which is supposed to contain V  as one of its elements.\n2. Construct another RKHS HR under one-to-one correspondence to HV through a certain\nbijective linear operator U : HV !H R to be de\ufb01ned later in the next section.\n3. Estimate the immediate cost function R in the RKHS HR by kernel-based supervised\nlearning, and return its estimate \u02c6R.\n\n4. An estimate of the VF V  is immediately obtained by U1( \u02c6R).\n\nAn illustration of our framework is depicted in Figure 1. Note we can avoid the double-sampling\nproblem because the operator U is deterministic even though the system dynamics is stochastic.\nTherefore, under this framework, model-based CT-VF approximation in RKHSs can be derived,\nand convergence/tracking analyses of kernel-based supervised learning can also be applied to VF\napproximation.\n\n3\n\n\fAlgorithm 1 Model-based CT-VF Approximation in RKHSs with Barrier-Certi\ufb01ed Policy Updates\n\nEstimate of the VF: \u02c6V \nfor n 2 Z0 do\n\nn = U1( \u02c6R\nn)\n\n- Receive xn 2X , (xn) 2U , and R(xn, (xn)) 2 R\n- Update the estimate \u02c6R\n- Update the policy with barrier certi\ufb01cates when V  is well estimated\n\nn of R by using some kernel-based method in HR\n\n. e.g., Section 6\n. e.g., (11)\n\nend for\n\nPolicy update while restricting certain regions of the state space As mentioned above, one of\nthe advantages of a CT framework is its af\ufb01nity for control-theoretic analyses such as stability and\nforward invariance, which are useful for safety-critical applications. For example, suppose that we\nneed to restrict the region of exploration in the state space to some set C := {x 2X | b(x)  0},\nwhere b : X! R is smooth. This is often required for safety-critical applications.\nTo this end, control inputs must be properly constrained so that\nthe state trajectory remains inside the set C. In the safe RL\ncontext, there exists an idea of considering a smaller space of\nallowable policies (see [11] and references therein). To effec-\ntively constrain policies, we employ control barrier certi\ufb01cates\n(cf. [50, 48, 12, 46, 2, 1]). Without explicitly calculating the\nstate trajectory over a long time horizon, it is known that any\nLipschitz continuous policy satisfying control barrier certi\ufb01-\ncates renders the set C forward invariant [50], i.e., the state\ntrajectory remains inside the set C. In other words, we can im-\nplicitly enforce state constraints by satisfying barrier certi\ufb01cates\nwhen updating policies. Barrier-certi\ufb01ed policy update was \ufb01rst\nintroduced in [27] for DT systems, but is computationally more\nef\ufb01cient in our CT scenario. This concept is illustrated in Fig-\nure 2, where  is the space of Lipschitz continuous policies\n\nFigure 2: An illustration of barrier-\ncerti\ufb01ed policy updates. State con-\nstraints are implicitly enforced via\nbarrier certi\ufb01cates.\n\n : X!U , and  is the space of barrier-certi\ufb01ed allowable policies.\nA brief summary of the proposed model-based CT-VF approximation in RKHSs is given in Algorithm\n1. In the next section, we present theoretical analyses of our framework.\n\n4 Theoretical analyses\n\nWe presented the motivations and an overview of our framework in the previous section. In this\nsection, we validate the proposed framework from theoretical viewpoints. Because the output V (x)\nof the VF is unobservable, we follow the strategy presented in the previous section. First, by properly\nidentifying the RKHS HV which is supposed to contain the VF, we can implicitly restrict the class of\nthe VF. If the VF V  is twice continuously differentiable2 over int(X ) \u21e2X , we obtain the following\nHamilton-Jacobi-Bellman-Isaacs equation [28]:\n\nV (x) = G(V )(x) + R(x), x 2 int(X ),\n\nwhere the in\ufb01nitesimal generator G is de\ufb01ned as\n@x2 A(x) \n\ntr\uf8ff @2V (x)\n\nG(V )(x) := \n\n1\n2\n\n@V (x)\n\n@x\n\nh(x), x 2 int(X ).\n\n(4)\n\n(5)\n\n:= A(x, (x)) 2 Rnx\u21e5nx, where A(x, u) =\nHere, tr stands for the trace, and A(x)\n\u2318(x, u)\u2318(x, u)T. By employing a suitable RKHS such as a Gaussian RKHS for HV , we can guarantee\ntwice continuous differentiability of an estimated VF. Note that functions in a Gaussian RKHS are\nsmooth [23], and any continuous function on every compact subset of Rnx can be approximated with\nan arbitrary accuracy [34] in a Gaussian RKHS.\n\n1The cost function might be obtained by the negation of the reward function.\n2 See, for example, [10, Chapter IV],[19], for more detailed arguments about the conditions under which\n\ntwice continuous differentiability is guaranteed.\n\n4\n\n\fNext, we need to construct another RKHS HR which contains the immediate cost function R as one\nof its element. The relation between the VF and the immediate cost function is given by rewriting (4)\nas\n\nR(x) = [Iop + G] (V )(x), x 2 int(X ),\n\n(6)\nwhere Iop is the identity operator. To de\ufb01ne the operator Iop + G over the whole X , we use the\nfollowing de\ufb01nition.\nDe\ufb01nition 1 ([52, De\ufb01nition 1]). Let Is := {\u21b5 := [\u21b51,\u21b5 2, . . . ,\u21b5 nx]T 2 Znx\nj=1 \u21b5j \uf8ff\ns} for s 2 Z0, nx 2 Z>0. De\ufb01ne D\u21b5'(x) =\n@(x1)\u21b51 @(x2)\u21b52 ...@(xnx )\u21b5nx '(x), where x :=\n[x1, x2, . . . , xnx]T 2 Rnx. If X\u21e2 Rnx is compact with nonempty interior int(X ), Cs(int(X )) is\nthe space of functions ' over int(X ) such that D\u21b5' is well de\ufb01ned and continuous over int(X )\nfor each \u21b5 2 Is. De\ufb01ne Cs(X ) to be the space of continuous functions ' over X such that\n'|int(X ) 2 Cs(int(X )) and that D\u21b5('|int(X )) has a continuous extension D\u21b5' to X for each \u21b5 2 Is.\nIf \uf8ff 2 C2s(X\u21e5X ), de\ufb01ne (D\u21b5\uf8ff)x(y) =\nNow, suppose that HV is an RKHS associated with the reproducing kernel \uf8ffV (\u00b7,\u00b7) 2 C2\u21e52(X\u21e5X ).\nThen, we de\ufb01ne the operator U : HV !H R := {' | '(x) = U ('V )(x), 9'V 2H V , 8x 2X} as\n\n@(x1)\u21b51 @(x2)\u21b52 ...@(xnx )\u21b5nx \uf8ff(y, x), 8x, y 2 int(X ).\n\n0 | Pnx\n\n@Pnx\n\n@Pnx\n\nj=1 \u21b5j\n\nj=1 \u21b5j\n\nU ('V )(x) := 'V (x)  [De1'V (x), De2'V (x), . . . , Denx 'V (x)]h(x)\n8'V 2H V , 8x 2X ,\n\nm,n(x)Dem+en'V (x),\n\nA\n\n\n\n1\n2\n\nnxXm,n=1\n\n(7)\n\nm,n(x) is the (m, n) entry of A(x). Note here that U ('V )(x) = [Iop + G] ('V )(x) over\nwhere A\nint(X ). We emphasize here that the expected value and the immediate cost are related through the\ndeterministic operator U. The following main theorem states that HR is indeed an RKHS under\nAssumptions 1 and 2, and its corresponding reproducing kernel is obtained.\nTheorem 1. Under Assumptions 1 and 2, suppose that HV is an RKHS associated with the repro-\nducing kernel \uf8ffV (\u00b7,\u00b7) 2 C2\u21e52(X\u21e5X ). Suppose also that (i) > 0, or that (ii) HV is a Gaussian\nRKHS, and there exists a point xt!1 2 int(X ) which is stochastically asymptotically stable over\nx(t) = xt!1\u2318 = 1 for any starting point x 2 int(X ). Then, the following\nint(X ), i.e., Px\u21e3 lim\nstatements hold.\n(a) The space HR := {' | '(x) = U ('V )(x), 9'V 2H V , 8x 2X} is an isomorphic Hilbert\nspace of HV equipped with the inner product de\ufb01ned by\n,' i(x) := U ('V\n\nt!1\n\n(8)\n\n1 ,' V\n\ni )(x), 8x 2X , i 2{ 1, 2},\n\nh'1,' 2iHR\n\n:=\u2326'V\n\n2\u21b5HV\n\nwhere the operator U is de\ufb01ned in (7).\n(b) The Hilbert space HR has the reproducing kernel given by\n\n\uf8ff(x, y) := U (K(\u00b7, y))(x), x, y 2X ,\n\nwhere\n\n(9)\n\n(10)\n\nK(x, y) = \uf8ffV (x, y)  [(De1\uf8ffV )y(x), (De2\uf8ffV )y(x), . . . , (Denx \uf8ffV )y(x)]h(y)\n\n1\n2\n\n\n\nnxXm,n=1\n\nA\n\nm,n(y)(Dem+en\uf8ffV )y(x).\n\nProof. See Appendices A and B in the supplementary document.\nUnder Assumptions 1 and 2, Theorem 1 implies that the VF V  can be uniquely determined by the\nimmediate cost function R for a policy  if the VF is in an RKHS of a particular class. In fact, the\nrelation between the VF and the immediate cost function in (4) is based on the assumption that the\nVF is twice continuously differentiable over int(X ), and the veri\ufb01cation theorem (cf. [10]) states\nthat, when the immediate cost function and a twice continuously differentiable function satisfying\nthe relation (4) are given under certain conditions, the twice continuously differentiable function is\nindeed the VF. In Theorem 1, on the other hand, we \ufb01rst restrict the class of the VF by identifying\n\n5\n\n\fn (x) = U1( \u02c6R\n\nT\n\nT\n\nT\n\nn)(x) =Pr\n\nan RKHS HV , and then approximate the immediate cost function in the RKHS HR any element of\nwhich satis\ufb01es the relation (4). Because the immediate cost R(x(t)) is observable, we can employ\nany kernel-based supervised learning to estimate the function R in the RKHS HR, such as GPs and\nKFs, as elaborated later in Section 6.\nIn the RKHS HR, an estimate of R at time instant n 2 Z0 is given by \u02c6R\nn(x) =\nPr\ni ci\uf8ff(x, xi), ci 2 R, r 2 Z0, where {xi}i2{1,2,...,r} \u21e2X is the set of samples, and the\nreproducing kernel \uf8ff is de\ufb01ned in (9). An estimate of the VF V  at time instant n 2 Z0 is thus\nimmediately obtained by \u02c6V \ni=1 ciK(x, xi), where K is de\ufb01ned in (10).\nNote, when the system dynamics is described by an ordinary differential equation (i.e., \u2318 = 0),\nthe assumptions that V  is twice continuously differentiable and that \uf8ffV (\u00b7,\u00b7) 2 C2\u21e52(X\u21e5X ) are\nrelaxed to that V  is continuously differentiable and that \uf8ffV (\u00b7,\u00b7) 2 C2\u21e51(X\u21e5X ), respectively.\nAs an illustrative example of Theorem 1, we show the case of the linear-quadratic regulator (LQR)\nbelow.\nSpecial case: linear-quadratic regulator Consider a linear feedback LQR, i.e., LQR(x) =\nFLQRx, FLQR 2 Rnu\u21e5nx, and a linear system \u02d9x := dx\ndt = ALQRx + BLQRu, where ALQR 2\nRnx\u21e5nx and BLQR 2 Rnx\u21e5nu are matrices. In this case, we know that the value function V LQR\nbecomes quadratic with respect to the state variable (cf. [53]). Therefore, we employ an RKHS\nwith a quadratic kernel for HV , i.e., \uf8ffV (x, y) = (xTy)2. If we assume that the input space X is so\nlarge that the set span{Asym|Asym = xxT, 9x 2X} accommodates any real symmetric matrix, we\nobtain HV = {X 3 x 7! xTAsymx|Asym is symmetric}.\nMoreover, it follows from the product rule of the directional derivative [6] that K(x, y) =\nLQR)x, where ALQR := ALQR \nxTALQRyxTy  xTyxTALQRy = xT(ALQRyyT  yyTA\nLQR is symmetric, implying K(\u00b7, y) 2H V ,\nBLQRFLQR. Note Avalue(y) := ALQRyyT  yyTA\nT\nand we obtain \uf8ff(x, y) = xT(A\nLQRAvalue(y) + Avalue(y)ALQR)x. Because Acost(y) :=\nLQRAvalue(y)  Avalue(y)ALQR is symmetric, it follows that \uf8ff(\u00b7, y) 2H V . If ALQR is stable\nA\n(Hurwitz), from Theorem 1, the one-to-one correspondence between HV and HR thus implies that\nHV = HR. Therefore, we can fully approximate the immediate cost function RLQR in HR if RLQR\nis quadratic with respect to the state variable.\nSuppose that\ni=1 ci\uf8ff(x, xi) =\nxTAcostx. Then, the estimated value function will be given by V LQR(x) = U1(RLQR)(x) =\nT\nLQRAvalue + AvalueALQR + Acost = 0, which is indeed\nthe well-known Lyapunov equation [53]. Unlike Gaussian-kernel cases, we only require a \ufb01nite\nnumber of parameters to fully approximate the immediate cost function, and hence is analytically\nobtainable.\nBarrier-certi\ufb01ed policy updates under CT formulation Next, we show that the CT formula-\ntion makes barrier-certi\ufb01ed policy updates computationally more ef\ufb01cient under certain condi-\ntions. Assume that the system dynamics is af\ufb01ne in the control, i.e., h(x, u) = f (x) + g(x)u,\nand \u2318 = 0, where f : Rnx ! Rnx and g : Rnx ! Rnx\u21e5nu, and that the immediate cost\nR(x, u) is given by Q(x) + 1\n2 uTM u, where Q : Rnx ! R, and M 2 Rnu\u21e5nu is a positive\nde\ufb01nite matrix. Then, any Lipschitz continuous policy  : X!U\nsatisfying (x) 2S (x) :=\n@x g(x)u + \u21b5(b(x))  0o renders the set C forward invariant [50], i.e., the\nnu 2U | @b(x)\nstate trajectory remains inside the set C, where \u21b5 : R ! R is strictly increasing and \u21b5(0) = 0. Taking\nthis constraint into account, the barrier-certi\ufb01ed greedy policy update is given by\n\nthe immediate cost function is given by RLQR(x) = Pr\n\nPr\ni=1 ciK(x, xi) = xTAvaluex, where A\n\n@x f (x) + @b(x)\n\n+(x) = argmin\n\nu2S(x) \uf8ff 1\n\n2\n\nuTM u +\n\n@V (x)\n\n@x\n\ng(x)u ,\n\n(11)\n\nwhich is, by virtue of the CT formulation, a quadratic programming (QP) problem at x when U\u21e2 Rnu\nde\ufb01nes af\ufb01ne constraints (see Appendix C in the supplementary document). The space of allowable\npolicies is thus given by := { 2  | (x) 2S (x), 8x 2X} . When \u2318 6= 0 and the dynamics is\nlearned by GPs as in [30], the work in [47] provides a barrier certi\ufb01cate for uncertain dynamics. Note,\none can also employ a function approximator or add noises to the greedily updated policy to avoid\n\n6\n\n\funstable performance and promote exploration (see e.g., [8]). To see if the updated policy + remains\nin the space of Lipschitz continuous policies , i.e.,  \u21e2 , we present the following proposition.\nProposition 1. Assume the conditions in Theorem 1. Assume also that U\u21e2 Rnu de\ufb01nes af\ufb01ne\nconstraints, and that f, g, \u21b5, and the derivative of b are Lipschitz continuous over X . Then, the policy\n+ de\ufb01ned in (11) is Lipschitz continuous over X if the width of a feasible set3 is strictly larger than\nzero over X .\nProof. See Appendix D in the supplementary document.\nNote, if U\u21e2 Rnx de\ufb01nes the bounds of each entry of control inputs, it de\ufb01nes af\ufb01ne constraints.\nLastly, the width of a feasible set is strictly larger than zero if U is suf\ufb01ciently large and @b(x)\n@x g(x) 6= 0.\nWe will further clarify the relations of the proposed theoretical framework to existing works below.\n\n5 Relations to existing works\n\nFirst, our proposed framework takes advantage of the capability of learning complicated functions\nand nonparametric \ufb02exibility of RKHSs, and reproduces some of the existing model-based DT-VF\napproximation techniques (see Appendix E in the supplementary document). Note that some of the\nexisting DT-VF approximations in RKHSs, such as GPTD [9], also work for model-free cases (see\n[27] for model-free adaptive DT action-value function approximation, for example). Second, since\nthe RKHS HR for learning is explicitly de\ufb01ned in our framework, any kernel-based method and its\nconvergence/tracking analyses are directly applicable. While, for example, the work in [17], which\naims at attaining a sparse representation of the unknown function in an online fashion in RKHSs, was\nextended to the policy evaluation [18] by addressing the double-sampling problem, our framework\ndoes not suffer from the double-sampling problem, and hence any kernel-based online learning (e.g.,\n[17, 51, 39]) can be straightforwardly applied. Third, when the time interval is small, DT formulations\nbecome susceptible to errors, while CT formulations are immune to the choice of the time interval.\nNote, on the other hand, a larger time interval poorly represents the system dynamics evolving in\ncontinuous time. Lastly, barrier certi\ufb01cates are ef\ufb01ciently incorporated in our CT framework through\nQPs under certain conditions, and state constraints are implicitly taken into account. Stochastic\noptimal control such as the work in [43, 42] requires sample trajectories over prede\ufb01ned \ufb01nite time\nhorizons and the value is computed along the trajectories while the VF is estimated in an RKHS even\nwithout having to follow the trajectory in our framework.\n\n6 Applications and practical implementation\n\nRnx\n\n(2\u21e12)L/2 exp kx  yk2\n\n22\n\n1\n\nWe apply the theory presented in the previous section to the Gaussian kernel case and introduce CTGP\nas an example, and present a practical implementation. Assume that A(x, u) 2 Rnx\u21e5nx is diagonal,\n!,\nfor simplicity. The Gaussian kernel is given by \uf8ffV (x, y) :=\nx, y 2X , > 0. Given Gaussian kernel \uf8ffV (x, y), the reproducing kernel \uf8ff(x, y) de\ufb01ned in (9) is\nderived as \uf8ff(x, y) = a(x, y)\uf8ffV (x, y), where a : X\u21e5X! R is a real-valued function (see Appendix\nF in the supplementary document for the explicit form of a(x, y)).\nCTGP One of the celebrated properties of GPs is their Bayesian formulation, which enables us\nto deal with uncertainty through credible intervals. Suppose that the observation d at time instant\nn 2 Z0 contains some noise \u270f 2 R, i.e., d(x) = R(x) + \u270f, \u270f \u21e0N (0, \u00b52\no), \u00b5o  0. Given data\ndN := [d(x0), d(x1), . . . , d(xN )]T for some N 2 Z0, we can employ GP regression to obtain the\nmean m(x\u21e4) and the variance \u00b52(x\u21e4) of \u02c6R(x\u21e4) at a point x\u21e4 2X as\n\u00b52(x\u21e4) = \uf8ff(x\u21e4, x\u21e4)  kT\n\n(12)\nwhere I is the identity matrix, k\u21e4 := [\uf8ff(x\u21e4, x0),\uf8ff (x\u21e4, x1), . . . ,\uf8ff (x\u21e4, xN )]T, and the (i, j) entry of\nG 2 R(N +1)\u21e5(N +1) is \uf8ff(xi1, xj1). Then, by the existence of the inverse operator U1, the mean\nmV (x\u21e4) and the variance \u00b5V 2(x\u21e4) of \u02c6V (x\u21e4) at a point x\u21e4 2X can be given by\n(G + \u00b52\n\nm(x\u21e4) = kT\n\n\u21e4 (G + \u00b52\n\n\u21e4 (G + \u00b52\n\nT\n\n(G + \u00b52\n\noI)1dN , \u00b5V 2\n\noI)1k\u21e4,\n\noI)1dN ,\n\nmV (x\u21e4) = KV\n\u21e4\n\n(x\u21e4) = \uf8ffV (x\u21e4, x\u21e4)  KV\n\u21e4\n\nT\n\noI)1KV\n\u21e4 ,\n\n(13)\n\n3Width indicates how much control margin is left for the strictest constraint, as de\ufb01ned in [24, Equation 21].\n\n7\n\n\fTable 2: Comparisons of the cumulative costs and numbers of times the observed velocities became\nlower than 0.05 with and without barrier certi\ufb01cates\n\nCumulative cost\nWith barrier\nWithout barrier\n\nCTKF\n114.2\n0 (times)\n0 (times)\n\nGPTD_1 DTKF_1 CTGP\n299.1\n0 (times)\n0 (times)\n\n299.1\n0 (times)\n0 (times)\n\n82.2\n0 (times)\n10 (times)\n\nGPTD_20 DTKF_20\n89.2\n0 (times)\n20 (times)\n\n90.4\n0 (times)\n20 (times)\n\nwhere KV\n\u21e4\ndocument for more details).\n\n:= [K(x\u21e4, x0), K(x\u21e4, x1), . . . , K(x\u21e4, xN )]T (see Appendix G in the supplementary\n\n7 Numerical Experiment\n\n0\n\nv(t)\n\n0.0025 cos (3x(t)) dt +\uf8ff\n\nIn this section, we \ufb01rst show the validity of the proposed CT framework and its advantage over\nDT counterparts when the time interval is small, and then compare CTGP and GPTD for RL on a\nsimulated inverted pendulum. In both of the experiments, the coherence-based sparsi\ufb01cation [33] in\nthe RKHS HR is employed to curb the growth of the dictionary size.\nPolicy evaluations:\ncomparison of CT and DT approaches We show that our CT ap-\nproaches are advantageous over DT counterparts in terms of susceptibility to errors, by using\nMountainCarContinuous-v0 in OpenAI Gym [7] as the environment. The state is given by\nx(t) := [x(t), v(t)]T 2 R2, where x(t) and v(t) are the position and the velocity of the car, and the dy-\n0.0015 u(t)dt, where u(t) 2 [1.0, 1.0].\nnamics is given by dx =\uf8ff\nThe position and the velocity are clipped to [0.07, 0.07] and [1.2, 0.6], respectively, and the\ngoal is to reach the position x = 0.45. In the simulation, the control cycle (i.e., the frequency\nof applying control inputs and observing the states and costs) is set to 1.0 second. The ob-\nserved immediate cost is given by R(x(t), u(t)) + \u270f = 1 + 0.001u2(t) + \u270f for x(t) < 0.45 and\nR(x(t), u(t)) + \u270f = 0.001u2(t) + \u270f for x(t)  0.45, where \u270f \u21e0N (0, 0.12). Note the immediate\ncost for the DT cases is given by (R(x(t), u(t)) + \u270f)t, where t is the time interval. For policy\nevaluations, we use the policy obtained by RL based on the cross-entropy methods4, and the four meth-\nods, CTGP, KF-based CT-VF approximation (CTKF), GPTD, and KF-based DT-VF approximation\n(DTKF), are used to learn value functions associated with the policy by executing the current policy\nfor \ufb01ve episodes, each of which terminates whenever t = 300 or x(t)  0.45. GP-based techniques\nare expected to handle the random component \u270f added to the immediate cost. The new policies are\nthen obtained by the barrier-certi\ufb01ed policy updates under CT formulations, and these policies are\nevaluated for \ufb01ve times. Here, the barrier function is given by b(x) = 0.05 + v, which prevents the\nvelocity from becoming lower than 0.05. Figure 3 compares the value functions5 learned by each\nmethod for the time intervals t = 20.0 and t = 1.0. We observe that the DT approaches are\nsensitive to the choice of t. Table 2 compares the cumulative costs averaged over \ufb01ve episodes for\neach method and for different time intervals and the numbers of times we observed the velocity being\nlower than 0.05 when the barrier certi\ufb01cate is employed and unemployed. (Numbers associated\nwith the algorithm names indicate the lengths of the time intervals.) Note that the cumulative costs\nare calculated by summing up the immediate costs multiplied by the duration of each control cycle,\ni.e., we discretized the immediate cost based on the control cycle. It is observed that the CT approach\nis immune to the choice of t while the performance of the DT approach degrades when the time\ninterval becomes small, and that the barrier-certi\ufb01ed policy updates work ef\ufb01ciently.\nReinforcement learning: inverted pendulum We show the advantage of CTGP over GPTD when\nthe time interval for the estimation is small. Let the state x(t) := [\u2713(t),! (t)]T 2 R2 consists of the\nangle \u2713(t) and the angular velocity !(t) of an inverted pendulum, and we consider the dynamics:\nm`2 u(t)dt + 0.01Idw, where g = 9.8, m = 1,` =\ndx = \uf8ff\n\nm`2 !(t) dt +\uf8ff 0\n\n1,\u21e2 = 0.01. The Brownian motion may come from outer disturbances and/or modeling error. In\nthe simulation, the time interval t is set to 0.01 seconds, and the simulated dynamics evolves\n\ng\n\n` sin(\u2713(t))  \u21e2\n\n!(t)\n\n1\n\n4We used the\n\ncode\n\nin https://github.com/udacity/deep-reinforcement-learning/blob/master/cross-\n\nentropy/CEM.ipynb offered by Udacity. The code is based on PyTorch [31].\n\n5We used \"jet colormaps\" in Python Matplotlib for illustrating the value functions.\n\n8\n\n\f(a) GPTD for t = 20.0\n\n(b) GPTD for t = 1.0\n\n(c) CTGP\n\n(d) DTKF for t = 20.0\n\n(e) DTKF for t = 1.0\n\n(f) CTKF\n\nFigure 3: Illustrations of the value functions obtained by CTGP, CTKF, GPTD, and DTKF for time\nintervals t = 20.0 and t = 1.0. The policy is obtained by RL based on the cross-entropy method.\nCT approaches are not affected by the choice of t.\n\nby x = h(x(t), u(t))t + pt\u2318(x(t), u(t))\u270fw, where \u270fw \u21e0N (0, I). In this experiment, the\n\ntask is to stabilize the inverted pendulum at \u2713 = 0. The observed immediate cost is given by\nR(x(t), u(t)) + \u270f = 1/(1 + e10(\u2713(t)\u21e1/16)) + 100/(1 + e10(\u2713(t)\u21e1/6)) + 0.05u2(t) + \u270f, where\n\u270f \u21e0N (0, 0.12). A trajectory associated with the current policy is generated to learn the VF. The\ntrajectory terminates when |\u2713(t)| >\u21e1/ 4 and restarts from a random initial angle. After 10 seconds,\nthe policy is updated. To evaluate the current policy, average time over \ufb01ve episodes in which the\npendulum stays up (|\u2713(t)|\uf8ff \u21e1/4) when initialized as \u2713(0) 2 [\u21e1/6,\u21e1/ 6] is used. Figure 4 compares\nthis average time of CTGP and GPTD up to \ufb01ve updates with standard deviations until when stable\npolicy improvement becomes dif\ufb01cult without some heuristic techniques such as adding noises to\npolicies. Note that the same seed for the random number generator is used for the initializations of\nboth of the two approaches. It is observed that GPTD fails to improve policies. The large credible\ninterval of CTGP is due to the random initialization of the state.\n\n8 Conclusion and future work\n\nWe presented a novel theoretical framework that renders the\nCT-VF approximation problem simultaneously solvable in an\nRKHS by conducting kernel-based supervised learning for the\nimmediate cost function in the properly de\ufb01ned RKHS. Our CT\nframework is compatible with rich theories of control, including\ncontrol barrier certi\ufb01cates for safety-critical applications. The\nvalidity of the proposed framework and its advantage over\nDT counterparts when the time interval is small were veri\ufb01ed\nexperimentally on the classical Mountain Car problem and a\nsimulated inverted pendulum.\nThere are several possible directions to explore as future works;\nFirst, we can employ the state-of-the-art kernel methods within\nour theoretical framework or use other variants of RL, such as\nactor-critic methods, to improve practical performances. Sec-\nond, we can consider uncertainties in value function approxima-\ntion by virtue of the RKHS-based formulation, which might be\nused for safety veri\ufb01cations. Lastly, it is worth further explorations of advantages of CT formulations\nfor physical tasks.\n\nFigure 4: Comparison of time up\nto which the pendulum stays up be-\ntween CTGP and GPTD for the in-\nverted pendulum (\u00b1 std. deviation).\n\n9\n\n\fAcknowledgments\nThis work was partially conducted when M. Ohnishi was at the GRITS Lab, Georgia Institute of\nTechnology; M. Ohnishi thanks the members of the GRITS Lab, including Dr. Li Wang, and Prof.\nMagnus Egerstedt for discussions regarding barrier functions. M. Yukawa was supported in part by\nKAKENHI 18H01446 and 15H02757, M. Johansson was supported in part by the Swedish Research\nCouncil and by the Knut and Allice Wallenberg Foundation, and M. Sugiyama was supported in part\nby KAKENHI 17H00757. Lastly, the authors thank all of the anonymous reviewers for their very\ninsightful comments.\n\nReferences\n\n[1] A. Agrawal and K. Sreenath. \u201cDiscrete control barrier functions for safety-critical control of\n\ndiscrete systems with application to bipedal robot navigation\u201d. In: Proc. RSS. 2017.\n\n[2] A. D. Ames et al. \u201cControl barrier function based quadratic programs with application to\n\nautomotive safety systems\u201d. In: arXiv preprint arXiv:1609.06408 (2016).\n\n[3] N. Aronszajn. \u201cTheory of reproducing kernels\u201d. In: Trans. Amer. Math. Soc. 68.3 (1950),\n\npp. 337\u2013404.\n\n[4] L. Baird. \u201cReinforcement\n\nlearning in continuous time: Advantage updating\u201d.\n\nIn:\n\nProc. IEEE ICNN. Vol. 4. 1994, pp. 2448\u20132453.\n\n[5] L. Baird. \u201cResidual algorithms: Reinforcement learning with function approximation\u201d. In:\n\nProc. ICML. 1995, pp. 30\u201337.\nJ. Bonet and R. D. Wood. Nonlinear continuum mechanics for \ufb01nite element analysis. Cam-\nbridge University Press, 1997.\n\n[6]\n\n[7] G. Brockman et al. \u201cOpenAI Gym\u201d. In: arXiv preprint arXiv:1606.01540 (2016).\n[8] K. Doya. \u201cReinforcement learning in continuous time and space\u201d. In: Neural Computation\n\n[9] Y. Engel, S. Mannor, and R. Meir. \u201cReinforcement learning with Gaussian processes\u201d. In:\n\n12.1 (2000), pp. 219\u2013245.\n\nProc. ICML. 2005, pp. 201\u2013208.\n\n[10] W. H. Fleming and H. M. Soner. Controlled Markov processes and viscosity solutions. Vol. 25.\n\nSpringer Science & Business Media, 2006.\nJ. Garc\u0131a and F. Fern\u00e1ndez. \u201cA comprehensive survey on safe reinforcement learning\u201d. In: J.\nMach. Learn. Res. 16.1 (2015), pp. 1437\u20131480.\n\n[11]\n\n[12] P. Glotfelter, J. Cort\u00e9s, and M. Egerstedt. \u201cNonsmooth barrier functions with applications to\n\nmulti-robot systems\u201d. In: IEEE Control Systems Letters 1.2 (2017), pp. 310\u2013315.\n\n[13] S. Grunewalder et al. \u201cModelling transition dynamics in MDPs with RKHS embeddings\u201d. In:\n\nProc. ICML. 2012.\n\nBusiness Media, 2011.\n\n[14] H. K Khalil. \u201cNonlinear systems\u201d. In: Prentice-Hall 3 (1996).\n[15] R. Khasminskii. Stochastic stability of differential equations. Vol. 66. Springer Science &\n\n[16] V. R. Konda, J. N. Tsitsiklis, et al. \u201cConvergence rate of linear two-time-scale stochastic\n\napproximation\u201d. In: The Annals of Applied Probability 14.2 (2004), pp. 796\u2013819.\n\n[17] A. Koppel et al. \u201cParsimonious online learning with kernels via sparse projections in function\n\nspace\u201d. In: Proc. IEEE ICASSP. 2017, pp. 4671\u20134675.\n\n[18] A. Koppel et al. \u201cPolicy evaluation in continuous MDPs with ef\ufb01cient kernelized gradient\n\ntemporal difference\u201d. In: IEEE Trans. Automatic Control (submitted) (2017).\n\n[19] N. V. Krylov. Controlled diffusion processes. Vol. 14. Springer Science & Business Media,\n\n2008.\n\n[20] F. L. Lewis and D. Vrabie. \u201cReinforcement learning and adaptive dynamic programming for\n\nfeedback control\u201d. In: IEEE Circuits and Systems Magazine 9.3 (2009).\n\n[21] D. Liberzon. Calculus of variations and optimal control theory: a concise introduction. Prince-\n\nton University Press, 2011.\n\n[22] W. Liu, J. Pr\u00edncipe, and S. Haykin. Kernel adaptive \ufb01ltering. New Jersey: Wiley, 2010.\n[23] H. Q. Minh. \u201cSome properties of Gaussian reproducing kernel Hilbert spaces and their impli-\ncations for function approximation and learning theory\u201d. In: Constructive Approximation 32.2\n(2010), pp. 307\u2013338.\n\n10\n\n\f[24] B. Morris, M. J. Powell, and A. D. Ames. \u201cSuf\ufb01cient conditions for the Lipschitz continuity of\nQP-based multi-objective control of humanoid robots\u201d. In: Proc. CDC. 2013, pp. 2920\u20132926.\n[25] R. Munos and P. Bourgine. \u201cReinforcement learning for continuous stochastic control prob-\n\nlems\u201d. In: Proc. NIPS. 1998, pp. 1029\u20131035.\n\n[26] Y. Nishiyama et al. \u201cHilbert space embeddings of POMDPs\u201d. In: arXiv preprint\n\narXiv:1210.4887 (2012).\n\n[27] M. Ohnishi et al. \u201cBarrier-certi\ufb01ed adaptive reinforcement learning with applications to\n\nbrushbot navigation\u201d. In: arXiv preprint arXiv:1801.09627 (2018).\n\n[28] B. \u00d8ksendal. Stochastic differential equations. Springer, 2003.\n[29] D. Ormoneit and \u00b4S. Sen. \u201cKernel-based reinforcement learning\u201d. In: Machine Learning 49.2\n\n(2002), pp. 161\u2013178.\n\n[30] Y. Pan and E. Theodorou. \u201cProbabilistic differential dynamic programming\u201d. In: Proc. NIPS.\n\n2014, pp. 1907\u20131915.\n\n[31] A. Paszke et al. \u201cAutomatic differentiation in PyTorch\u201d. In: (2017).\n[32] C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning. Vol. 1. MIT\n\npress Cambridge, 2006.\n\n[33] C. Richard, J. Bermudez, and P. Honeine. \u201cOnline prediction of time series data with kernels\u201d.\n\nIn: IEEE Trans. Signal Process. 57.3 (2009), pp. 1058\u20131067.\nI. Steinwart. \u201cOn the in\ufb02uence of the kernel on the consistency of support vector machines\u201d.\nIn: J. Mach. Learn. Res. 2 (2001), pp. 67\u201393.\n\n[34]\n\n[35] M. Sugiyama. Statistical reinforcement learning: modern machine learning approaches. CRC\n\nPress, 2015.\n\n[36] W. Sun and J. A. Bagnell. \u201cOnline Bellman residual and temporal difference algorithms with\n\npredictive error guarantees\u201d. In: Proc. IJCAI. 2016.\n\n[37] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 1998.\n[38] R. S. Sutton, H. R. Maei, and C. Szepesv\u00e1ri. \u201cA convergent O(n) temporal-difference algorithm\nfor off-policy learning with linear function approximation\u201d. In: Advances in Neural Information\nProcessing Systems. 2009, pp. 1609\u20131616.\n\n[39] M. Takizawa and M. Yukawa. \u201cAdaptive nonlinear estimation based on parallel projection\nalong af\ufb01ne subspaces in reproducing kernel Hilbert space\u201d. In: IEEE Trans. Signal Processing\n63.16 (2015), pp. 4257\u20134269.\n\n[40] Y. Tassa et al. \u201cDeepMind Control Suite\u201d. In: arXiv preprint arXiv:1801.00690 (2018).\n[41] G. Taylor and R. Parr. \u201cKernelized value function approximation for reinforcement learning\u201d.\n\nIn: Proc. ICML. 2009, pp. 1017\u20131024.\n\n[42] E. Theodorou, J. Buchli, and S. Schaal. \u201cReinforcement learning of motor skills in high\n\ndimensions: A path integral approach\u201d. In: Proc. IEEE ICRA. 2010, pp. 2397\u20132403.\n\n[43] E. Theodorou, Y. Tassa, and E. Todorov. \u201cStochastic differential dynamic programming\u201d. In:\n\nProc. IEEE ACC. 2010, pp. 1125\u20131132.\nJ. N. Tsitsiklis and B. Van R. \u201cAnalysis of temporal-diffference learning with function approxi-\nmation\u201d. In: Proc. NIPS. 1997, pp. 1075\u20131081.\n\n[44]\n\n[45] K. G. Vamvoudakis and F. L. Lewis. \u201cOnline actor-critic algorithm to solve the continuous-time\n\nin\ufb01nite horizon optimal control problem\u201d. In: Automatica 46.5 (2010), pp. 878\u2013888.\n\n[46] L. Wang, A. D. Ames, and M. Egerstedt. \u201cSafety barrier certi\ufb01cates for collisions-free multi-\n\nrobot systems\u201d. In: IEEE Trans. Robotics (2017).\n\n[47] L. Wang, E. A. Theodorou, and M. Egerstedt. \u201cSafe learning of quadrotor dynamics using\n\nbarrier certi\ufb01cates\u201d. In: arXiv preprint arXiv:1710.05472 (2017).\n\n[48] P. Wieland and F. Allg\u00f6wer. \u201cConstructive safety using control barrier functions\u201d. In:\n\nProc. IFAC 40.12 (2007), pp. 462\u2013467.\n\n[49] X. Xu, D. Hu, and X. Lu. \u201cKernel-based least squares policy iteration for reinforcement\n\nlearning\u201d. In: IEEE Trans. Neural Networks 18.4 (2007), pp. 973\u2013992.\n\n[50] X. Xu et al. \u201cRobustness of control barrier functions for safety critical control\u201d. In: Proc. IFAC\n\n48.27 (2015), pp. 54\u201361.\n\n[51] M. Yukawa. \u201cMultikernel Adaptive Filtering\u201d. In: IEEE Trans. Signal Processing 60.9 (2012),\n\npp. 4672\u20134682.\n\n11\n\n\f[52] DX. Zhou. \u201cDerivative reproducing properties for kernel methods in learning theory\u201d. In:\n\nJournal of Computational and Applied Mathematics 220.1-2 (2008), pp. 456\u2013463.\n\n[53] K. Zhou, J. C. Doyle, K. Glover, et al. Robust and optimal control. Vol. 40. Prentice Hall,\n\n1996.\n\n12\n\n\f", "award": [], "sourceid": 1484, "authors": [{"given_name": "Motoya", "family_name": "Ohnishi", "institution": "Keio University/KTH Royal Institute of Technology/RIKEN"}, {"given_name": "Masahiro", "family_name": "Yukawa", "institution": "Keio University"}, {"given_name": "Mikael", "family_name": "Johansson", "institution": "KTH - Royal Institute of Technology"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}