{"title": "Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7216, "page_last": 7225, "abstract": "This work tackles the problem of robust zero-shot planning in non-stationary stochastic environments. We study Markov Decision Processes (MDPs) evolving over time and consider Model-Based Reinforcement Learning algorithms in this setting. We make two hypotheses: 1) the environment evolves continuously with a bounded evolution rate; 2) a current model is known at each decision epoch but not its evolution. Our contribution can be presented in four points. 1) we define a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). We introduce the notion of regular evolution by making an hypothesis of Lipschitz-Continuity on the transition and reward functions w.r.t. time; 2) we consider a planning agent using the current model of the environment but unaware of its future evolution. This leads us to consider a worst-case method where the environment is seen as an adversarial agent; 3) following this approach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shot Model-Based method similar to Minimax search; 4) we illustrate the benefits brought by RATS empirically and compare its performance with reference Model-Based algorithms.", "full_text": "Non-Stationary Markov Decision Processes\na Worst-Case Approach using Model-Based\n\nReinforcement Learning\n\nErwan Lecarpentier\nUniversit\u00e9 de Toulouse\n\nONERA - The French Aerospace Lab\n\nerwan.lecarpentier@isae-supaero.fr\n\nEmmanuel Rachelson\nUniversit\u00e9 de Toulouse\n\nISAE-SUPAERO\n\nemmanuel.rachelson@isae-supaero.fr\n\nAbstract\n\nThis work tackles the problem of robust planning in non-stationary stochastic\nenvironments. We study Markov Decision Processes (MDPs) evolving over time\nand consider Model-Based Reinforcement Learning algorithms in this setting. We\nmake two hypotheses: 1) the environment evolves continuously with a bounded\nevolution rate; 2) a current model is known at each decision epoch but not its\nevolution. Our contribution can be presented in four points. 1) we de\ufb01ne a speci\ufb01c\nclass of MDPs that we call Non-Stationary MDPs (NSMDPs). We introduce the\nnotion of regular evolution by making an hypothesis of Lipschitz-Continuity on the\ntransition and reward functions w.r.t. time; 2) we consider a planning agent using\nthe current model of the environment but unaware of its future evolution. This leads\nus to consider a worst-case method where the environment is seen as an adversarial\nagent; 3) following this approach, we propose the Risk-Averse Tree-Search (RATS)\nalgorithm, a Model-Based method similar to minimax search; 4) we illustrate the\nbene\ufb01ts brought by RATS empirically and compare its performance with reference\nModel-Based algorithms.\n\n1\n\nIntroduction\n\nOne of the hot topics of modern Arti\ufb01cial Intelligence (AI) is the ability for an agent to adapt its\nbehavior to changing tasks. In the literature, this problem is often linked to the setting of Lifelong\nReinforcement Learning (LRL) [Silver et al., 2013, Abel et al., 2018a,b] and learning in non-stationary\nenvironments [Choi et al., 1999, Jaulmes et al., 2005, Hadoux, 2015]. In LRL, the tasks presented to\nthe agent change sequentially at discrete transition epochs [Silver et al., 2013]. Similarly, the non-\nstationary environments considered in the literature often evolve abruptly [Hadoux, 2015, Hadoux\net al., 2014, Doya et al., 2002, Da Silva et al., 2006, Choi et al., 1999, 2000, 2001, Campo et al., 1991,\nWiering, 2001]. In this paper, we investigate environments continuously changing over time that we\ncall Non-Stationary Markov Decision Processes (NSMDPs). In this setting, it is realistic to bound\nthe evolution rate of the environment using a Lipschitz Continuity (LC) assumption.\nModel-based Reinforcement Learning approaches [Sutton et al., 1998] bene\ufb01t from the knowledge\nof a model allowing them to reach impressive performances, as demonstrated by the Monte Carlo\nTree Search (MCTS) algorithm [Silver et al., 2016]. In this matter, the necessity to have access to a\nmodel is a great concern of AI [Asadi et al., 2018, Jaulmes et al., 2005, Doya et al., 2002, Da Silva\net al., 2006]. In the context of NSMDPs, we assume that an agent is provided with a snapshot model\nwhen its action is computed. By this, we mean that it only has access to the current model of the\nenvironment but not its future evolution, as if it took a photograph but would be unable to predict how\nit is going to evolve. This hypothesis is realistic, because many environments have a tractable state\nwhile their future evolution is hard to predict [Da Silva et al., 2006, Wiering, 2001]. In order to solve\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fLC-NSMDPs, we propose a method that considers the worst-case possible evolution of the model\nand performs planning w.r.t. this model. This is equivalent to considering Nature as an adversarial\nagent. The paper is organized as follows: \ufb01rst we describe the NSMDP setting and the regularity\nassumption (Section 2); then we outline related works (Section 3); follows the explanation of the\nworst-case approach proposed in this paper (Section 4); then we describe an algorithm re\ufb02ecting this\napproach (Section 5); \ufb01nally we illustrate its behavior empirically (Section 6).\n\n2 Non-Stationary Markov Decision Processes\n\nTo de\ufb01ne a Non-Stationary Markov Decision Process (NSMDP), we revert to the initial MDP model\nintroduced by Puterman [2014], where the transition and reward functions depend on time.\nDe\ufb01nition 1. NSMDP. An NSMDP is an MDP whose transition and reward functions depend on\nthe decision epoch. It is de\ufb01ned by a 5-tuple {S,T ,A, (pt)t\u2208T , (rt)t\u2208T } where S is a state space;\nT \u2261 {1, 2, . . . , N} is the set of decision epochs with N \u2264 +\u221e; A is an action space; pt(s(cid:48)\n| s, a) is\nthe probability of reaching state s(cid:48) while performing action a at decision epoch t in state s; rt(s, a, s(cid:48))\nis the scalar reward associated to the transition from s to s(cid:48) with action a at decision epoch t.\nThis de\ufb01nition can be viewed as that of a stationary MDP whose state space has been enhanced with\ntime. While this addition is trivial in episodic tasks where an agent is given the opportunity to interact\nseveral times with the same MDP, it is different when the experience is unique. Indeed, no exploration\nis allowed along the temporal axis. Within a stationary, in\ufb01nite-horizon MDP with a discounted\ncriterion, it is proven that there exists a Markovian deterministic stationary policy [Puterman, 2014].\nIt is not the case within NSMDPs where the optimal policy is non-stationary in the most general case.\nAdditionally, we de\ufb01ne the expected reward received when taking action a at state s and decision\nepoch t as Rt(s, a) = Es(cid:48)\u223cpt(\u00b7|s,a) [rt(s, a, s(cid:48))]. Without loss of generality, we assume the reward\nfunction to be bounded between \u22121 and 1. In this paper, we consider discrete time decision processes\nwith constant transition durations, which imply deterministic decision times in De\ufb01nition 1. This\nassumption is mild since many discrete time sequential decision problems follow that assumption. A\nnon-stationary policy \u03c0 is a sequence of decision rules \u03c0t which map states to actions (or distributions\nover actions). For a stochastic non-stationary policy \u03c0t(a | s), the value of a state s at decision epoch\nt within an in\ufb01nite horizon NSMDP is de\ufb01ned, with \u03b3 \u2208 [0, 1) a discount factor, by:\n\n(cid:12)(cid:12)(cid:12) st = s, ai \u223c \u03c0i(\u00b7 | si), si+1 \u223c pi(\u00b7 | si, ai)\n\nt (s) = E\nV \u03c0\n\n\u03b3i\u2212tRi(si, ai)\n\n(cid:35)\n\n,\n\n(cid:34) \u221e(cid:88)\n\ni=t\n\nThe de\ufb01nition of the state-action value function Q\u03c0\n\nt for \u03c0 at decision epoch t is straightforward:\n\nQ\u03c0\n\nt (s, a) = Rt(s, a) + \u03b3\n\nE\n\ns(cid:48)\u223cpt(\u00b7|s,a)\n\n(cid:2)V \u03c0\n\n(cid:48)\nt+1(s\n\n)(cid:3) .\n\nOverall, we de\ufb01ned an NSMDP as an MDP where we stress out the distinction between state,\ntime, and decision epoch due to the inability for an agent to explore the temporal axis at will.\nThis distinction is particularly relevant for non-episodic tasks, i.e. when there is no possibility to\nre-experience the same MDP starting from a prior date.\nThe regularity hypothesis. Many real-world problems can be modeled as an NSMDP. For instance,\nthe problem of path planning for a glider immersed in a non-stationary atmosphere [Chung et al.,\n2015, Lecarpentier et al., 2017], or that of vehicle routing in dynamic traf\ufb01c congestion. Realistically,\nwe consider that the expected reward and transition functions do not evolve arbitrarily fast over\ntime. Conversely, if such an assumption was not made, a chaotic evolution of the NSMDP would be\nallowed which is both unrealistic and hard to solve. Hence, we assume that changes occur slowly over\ntime. Mathematically, we formalize this hypothesis by bounding the evolution rate of the transition\nand expected reward functions, using the notion of Lipschitz Continuity (LC).\nDe\ufb01nition 2. Lipschitz Continuity. Let (X, dX ) and (Y, dY ) be two metric spaces and f : X \u2192 Y ,\nf is L-Lipschitz Continuous (L-LC) with L \u2208 R+ iff dY (f (x), f (\u02c6x)) \u2264 L dX (x, \u02c6x),\u2200(x, \u02c6x) \u2208 X 2.\nL is called a Lipschitz constant of the function f.\n\nWe apply this hypothesis to the transition and reward functions of an NSMDP so that those functions\nare LC w.r.t. time. For the transition function, this leads to the consideration of a metric between\nprobability density functions. For that purpose, we use the 1-Wasserstein distance [Villani, 2008].\n\n2\n\n\fDe\ufb01nition 3. 1-Wasserstein distance. Let (X, dX ) be a Polish metric space, \u00b5, \u03bd any probability\nmeasures on X, \u03a0(\u00b5, \u03bd) the set of joint distributions on X \u00d7 X with marginals \u00b5 and \u03bd. The\n1-Wasserstein distance between \u00b5 and \u03bd is W1(\u00b5, \u03bd) = inf \u03c0\u2208\u03a0(\u00b5,\u03bd)\n\nX\u00d7X dX (x, y)d\u03c0(x, y).\n\n(cid:82)\n\nThe choice of the Wasserstein distance is motivated by the fact that it quanti\ufb01es the distance between\ntwo distributions in a physical manner, respectful of the topology of the measured space [Dabney\net al., 2018, Asadi et al., 2018]. First, it is sensitive to the difference between the supports of the\ndistributions. Comparatively, the Kullback-Leibler divergence between distributions with disjoint\nsupports is in\ufb01nite. Secondly, if one consider two regions of the support where two distributions\ndiffer, the Wasserstein distance is sensitive to the distance between the elements of those regions.\nComparatively, the total-variation metric is the same regardless of this distance.\nDe\ufb01nition 4. (Lp, Lr)-LC-NSMDP. An (Lp, Lr)-LC-NSMDP is an NSMDP whose transition and\nreward functions are respectively Lp-LC and Lr-LC w.r.t. time, i.e., \u2200(t, \u02c6t, s, s(cid:48), a) \u2208 T 2 \u00d7 S 2 \u00d7 A,\n\nW1(pt(\u00b7 | s, a), p\u02c6t(\u00b7 | s, a)) \u2264 Lp|t \u2212 \u02c6t| and\n\n|rt(s, a, s\n\n) \u2212 r\u02c6t(s, a, s\n\n(cid:48)\n\n(cid:48)\n\n)| \u2264 Lr|t \u2212 \u02c6t|.\n\nOne should remark that the LC property should be de\ufb01ned with respect to actual decision times and\nnot decision epoch indexes for the sake of realism. In the present case, both have the same value,\nand we choose to keep this convention for clarity. Our results however extend easily to the case\nwhere indexes and times do not coincide. From now on, we consider (Lp, Lr)-LC-NSMDPs, making\nLipschitz Continuity our regularity property. Notice that R is de\ufb01ned as a convex combination of r\nby the probability measure p. As a result, the notion of Lipschitz Continuity of R is strongly related\nto that of r and p as showed by Property 1. All the proofs of the paper can be found in the Appendix.\nProperty 1. Given an (Lp, Lr)-LC-NSMDP,\n: s, a (cid:55)\u2192\nEs(cid:48)\u223cpt(\u00b7|s,a) {rt(s, a, s(cid:48))} is LR-LC with LR = Lr + Lp.\nThis result shows R\u2019s evolution rate is conditioned by the evolution rates of r and p. It allows to work\neither with the reward function r or its expectation R, bene\ufb01ting from the same LC property.\n\nthe expected reward function Rt\n\n3 Related work\n\nIyengar [2005] introduced the framework of robust MDPs, where the transition function is allowed to\nevolve within a set of functions due to uncertainty. This differs from our work in two fundamental\naspects: 1) we consider uncertainty in the reward model as well; 2) we use a stronger Lipschitz\nformulation on the set of possible transition and reward functions, this last point being motivated by\nits relevance to the non-stationary setting. Szita et al. [2002] also consider the robust MDP setting\nand adopt a different constraint hypothesis on the set of possible functions than our LC assumption.\nThey control the total variation distance of transition functions from subsequent decision epochs\nby a scalar value. Those slowly changing environments allow model-free RL algorithms such as\nQ-Learning to \ufb01nd near optimal policies. Lim et al. [2013] consider learning in robust MDPs where\nthe model evolves in an adversarial manner for a subset of S \u00d7 A. In that setting, they propose\nto learn to what extent the adversary can modify the model and to deduce a behavior close to the\nminimax policy. Even-Dar et al. [2009] studied the case of non-stationary reward functions with\n\ufb01xed transition models. No assumption is made on the set of possible functions and they propose an\nalgorithm achieving sub-linear regret w.r.t. the best stationary policy. Dick et al. [2014] viewed a\nsimilar setting from the perspective of online linear optimization. Cs\u00e1ji and Monostori [2008] studied\nthe NSMDP setting with an assumption of reward and transition functions varying in a neighborhood\nof a reference reward-transition function pair. Finally, Abbasi et al. [2013] address the adversarial\nNSMDP setting with a mixing assumption constraint instead of the LC assumption we make.\nNon-stationary environments also have been studied through the framework of Hidden Mode MDPs\n(HM-MDP) introduced by Choi et al. [1999]. This is a special class of Partially Observable MDPs\n(POMDPs) [Kaelbling et al., 1998] where a hidden mode indexes a latent stationary MDP within\nwhich the agent evolves. Similarly to the context of LRL, the agent experiences a series of different\nMDPs over time. In this setting, Choi et al. [1999, 2000] proposed methods to learn the different\nmodels of the latent stationary MDPs. Doya et al. [2002] built a modular architecture switching\nbetween models and policies when a change is detected. Similarly, Wiering [2001], Da Silva et al.\n[2006], Hadoux et al. [2014] proposed a method tracking the switching occurrence and re-planning if\nneeded. Overall, as in LRL, the HM-MDP setting considers abrupt evolution of the transition and\n\n3\n\n\freward functions whereas we consider a continuous one. Other settings have been considered, as by\nJaulmes et al. [2005], who do not make particular hypothesis on the evolution of the NSMDP. They\nbuild a learning algorithm for POMDPs solving, weighting recently experienced transitions more\nthan older ones to account for the time dependency.\nTo plan robustly within an NSMDP, our approach consists in exploiting the slow LC evolution of the\nenvironment. Utilizing Lipschitz continuity to infer bounds on a function is common in the RL, bandit\nand optimization communities [Kleinberg et al., 2008, Rachelson and Lagoudakis, 2010, Pirotta\net al., 2015, Pazis and Parr, 2013, Munos, 2014]. We implement this approach with a minimax-like\nalgorithm [Fudenberg and Tirole, 1991], where the environment is seen as an adversarial agent.\n\n4 Worst-case approach\n\nWe consider \ufb01nding an optimal policy within an LC-NSMDP under the non-episodic task hypothesis.\nThe latter prevents us from learning from previous experience data since they become outdated with\ntime and no information samples have been collected yet for future time steps. An alternative is to use\nmodel-based RL algorithms such as MCTS. For a current state s0, such algorithms focus on \ufb01nding\nthe optimal action a\u2217\n0 by using a generative model. This action is then undertaken and the operation\nrepeated at the next state. However, using the true NSMDP model for this purpose is an unrealistic\nhypothesis, since this model is generally unknown. We assume the agent does not have access to the\ntrue NSMDP model; instead, we introduce the notion of snapshot model.Intuitively, the snapshot\nassociated to time t0 is a temporal slice of the NSMDP at t0.\nDe\ufb01nition 5. Snapshot of an NSMDP. The snapshot of an NSMDP {S,T ,A, (pt)t\u2208T , (rt)t\u2208T } at\ndecision epoch t0, denoted by MDPt0, is the stationary MDP de\ufb01ned by the 4-tuple {S,A, pt0, rt0}\nwhere pt0 (s(cid:48)\n| s, a) and rt0 (s, a, s(cid:48)) are the transition and reward functions of the NSMDP at t0.\nSimilarly to the NSMDP, this de\ufb01nition induces the existence of the snapshot expected reward Rt0\nde\ufb01ned by Rt0 : s, a (cid:55)\u2192 Es(cid:48)\u223cpt0 (\u00b7|s,a) {rt0 (s, a, s(cid:48))}. Notice that the snapshot MDPt0 is stationary\nand coincides with the NSMDP only at t0. Particularly, one can generate a trajectory {s0, r0,\u00b7\u00b7\u00b7 , sk}\nwithin an NSMDP using the sequence of snapshots {MDPt0 ,\u00b7\u00b7\u00b7 , MDPt0+k\u22121} as a model. Overall,\nthe hypothesis of using snapshot models amounts to considering a planning agent only able to get the\ncurrent stationary model of the environment. In real-world problems, predictions often are uncertain\nor hard to perform e.g. in the thermal soaring problem of a glider.\nWe consider a generic planning agent at s0, t0, using MDPt0 as a model of the NSMDP. By planning,\nwe mean conducting a look-ahead search within the possible trajectories starting from s0, t0 given a\nmodel of the environment. The search allows in turn to identify an optimal action w.r.t. the model.\nThis action is then undertaken and the agent jumps to the next state where the operation is repeated.\nThe consequence of planning with MDPt0 is that the estimated value of an s, t pair is the value of the\noptimal policy of MDPt0, written V \u2217\n(s). The true optimal value of s at t within the NSMDP\ndoes not match this estimate because of the non-stationarity. The intuition we develop is that, given\nthe slow evolution rate of the environment, for a state s seen at a future decision epoch during the\nsearch, we can predict a scope into which the transition and reward functions at s lie.\nProperty 2. Set of admissible snapshot models. Consider an (Lp, Lr)-LC-NSMDP, s, t, a \u2208\nS \u00d7 T \u00d7 A. The transition and expected reward functions (pt, Rt) of the snapshot MDPt respect\n\nMDPt0\n\n(pt, Rt) \u2208 \u2206t := BW1 (pt\u22121(\u00b7 | s, a), Lp) \u00d7 B|\u00b7| (Rt\u22121(s, a), LR)\n\nwhere LR = Lp + Lr and Bd (c, r) denotes the ball of centre c, de\ufb01ned with metric d and radius r.\nFor a future prediction at s, t, we consider the question of using a better model than pt0 , Rt0. The\nunderlying evolution of the NSMDP being unknown, a desirable feature would be to use a model\nleading to a policy that is robust to every possible evolution. To that end, we propose to use the\nsnapshots corresponding to the worst possible evolution scenario under the constraints of Property 2.\nWe claim that such a practice is an ef\ufb01cient way to 1) ensure robust performance to all possible\nevolutions of the NSMDP and 2) avoid catastrophic terminal states. Practically, this boils down to\nusing a different value estimate for s at t than V \u2217\n(s) which provided no robustness guarantees.\nGiven a policy \u03c0 = (\u03c0t)t\u2208T and a decision epoch t, a worst-case NSMDP corresponds to a sequence\nof transition and reward models minimizing the expected value of applying \u03c0 in any pair (s, t), while\n\nMDPt0\n\n4\n\n\f\u0001\n\n0\n\n0.5\n\n1\n\n(a) Tree structure, dmax = 2, A = {a1, a2}.\n\nCVaR\n\nE [(cid:80) r]\nE [(cid:80) r]\nE [(cid:80) r]\n(b) Expected return E [(cid:80) r] and CVaR at 5%.\n\nRATS DP-snapshot DP-NSMDP\n-0.026\n-0.81\n-0.032\n-0.81\n0.67\n0.095\n\n0.47\n-0.9\n-0.077\n-0.81\n0.66\n-0.033\n\n0.48\n-0.90\n-0.46\n-0.90\n-0.78\n-0.90\n\nCVaR\n\nCVaR\n\nFigure 1: Tree structure and results from the Non-Stationary bridge experiment.\n\nremaining within the bounds of Property 2. We write V\n\u03b3i\u2212tRi(si, ai)\n\n\u03c0\nt (s) :=\n\nmin\n\nE\n\nV\n\n(pi,Ri)\u2208\u2206i,\u2200i\u2208T\n\n(cid:12)(cid:12)(cid:12) st = s\n\n(cid:34) \u221e(cid:88)\n\ni=t\n\n\u03c0\nt (s) this value for s at decision epoch t.\n\nai \u223c \u03c0i(\u00b7 | si), si+1 \u223c pi(\u00b7 | si, ai)\n\n(cid:35)\n\n(1)\n\nIntuitively, the worst-case NSMDP is a model of a non-stationary environment leading to the poorest\n\u03c0\npossible performance for \u03c0, while being an admissible evolution of MDPt. Let us de\ufb01ne Q\nt (s, a) as\nthe worst-case Q-value for the pair (s, a) at decision epoch t:\n\n(cid:104)\n\n(cid:105)\n\nR(s, a) + \u03b3V\n\n(cid:48)\n\u03c0\nt+1(s\n\n)\n\n.\n\n(2)\n\n\u03c0\nt (s, a) := min\nQ\n\n(p,R)\u2208\u2206t\n\nE\ns(cid:48)\u223cp\n\n5 Risk-Averse Tree-Search algorithm\n\nThe algorithm. Tree search algorithms within MDPs have been well studied and cover two classes\nof search trees, namely closed loop [Keller and Helmert, 2013, Kocsis and Szepesv\u00e1ri, 2006, Browne\net al., 2012] and open loop [Bubeck and Munos, 2010, Lecarpentier et al., 2018]. Following [Keller\nand Helmert, 2013], we consider closed loop search trees, composed of decision nodes alternating\nwith chance nodes. We adapt their formulation to take time into account, resulting in the following\nde\ufb01nitions. A decision node at depth t, denoted by \u03bds,t, is labeled by a unique state / decision epoch\npair (s, t). The edges leading to its children chance nodes correspond to the available actions at (s, t).\nA chance node, denoted by \u03bds,t,a, is labeled by a state / decision epoch / action triplet (s, t, a). The\nedges leading to its children decision nodes correspond to the reachable state / decision epoch pairs\n(s(cid:48), t(cid:48)) after performing a in (s, t) as illustrated by Figure 1a. We consider the problem of estimating\nthe optimal action a\u2217\n0 at s0, t0 within a worst-case NSMDP, knowing MDPt0. This problem is\ntwofold. It requires 1) to estimate the worst-case NSMDP given MDPt0 and 2) to explore the latter in\norder to identify a\u2217\n0. We propose to tackle both problems with an algorithm inspired by the minimax\nalgorithm [Fudenberg and Tirole, 1991] where the max operator corresponds to the agent\u2019s policy,\nseeking to maximize the return; and the min operator corresponds to the worst-case model, seeking\nto minimize the return. Estimating the worst-case NSMDP requires to estimate the sequence of\nsubsequent snapshots minimizing Equation 2. The inter-dependence of those snapshots (Equation 1)\nmakes the problem hard to solve [Iyengar, 2005], particularly because of the combinatorial nature\nof the opponent\u2019s action space.\nInstead, we propose to solve a relaxation of this problem, by\nconsidering snapshots only constrained by MDPt0. Making this approximation leaves a possibility to\nviolate property 2 but allows for an ef\ufb01cient search within the developed tree and (as will be shown\nexperimentally) leads to robust policies. For that purpose, we de\ufb01ne the set of admissible snapshot\nmodels w.r.t. MDPt0 by \u2206t\nt0 := BW1 (pt0(\u00b7 | s, a), Lp|t \u2212 t0|) \u00d7 B|\u00b7| (Rt0(s, a), LR|t \u2212 t0|). The\nrelaxed analogues of Equations 1 and 2 for s, t, a \u2208 S \u00d7 T \u00d7 A are de\ufb01ned as follows:\n\n\u02c6V \u03c0\nt0,t(s) :=\n\nmin\n(pi,Ri)\u2208\u2206i\nt0\n\n,\u2200i\u2208T\n\nE\n\n\u03b3i\u2212tRi(si, ai)\n\n(cid:104)\n\nai \u223c \u03c0i(\u00b7 | si), si+1 \u223c pi(\u00b7 | si, ai)\n\nR(s, a) + \u03b3 \u02c6V \u03c0\n\n(cid:48)\nt0,t+1(s\n\n)\n\n(cid:105)\n\n.\n\n(cid:35)\n\n,\n\n(cid:34) \u221e(cid:88)\n\ni=t\n\n(cid:12)(cid:12)(cid:12) st = s\n\n\u02c6Q\u03c0\n\nt0,t(s, a) := min\n(p,R)\u2208\u2206t\nt0\n\nE\ns(cid:48)\u223cp\n\n5\n\ns0s0a1s1s2s2a1s3s4s2a2s0a2d=0d=1dmax=2DecisionnodeChancenodeLeafnode\fAlgorithm 1: RATS algorithm\n\nRATS (s0, t0, maxDepth)\n\u03bd0 = rootNode(s0, t0)\nMinimax(\u03bd0)\n\u03bd\u2217 = arg max\u03bd(cid:48) in \u03bd0.children \u03bd(cid:48).value\nreturn \u03bd\u2217.action\n\nMinimax (\u03bd, maxDepth)\nif \u03bd is DecisionNode then\n\nif \u03bd.state is terminal or \u03bd.depth = maxDepth then\n\nreturn \u03bd.value = heuristicValue(\u03bd.state)\n\nelse\n\nreturn \u03bd.value = max\u03bd(cid:48)\u2208\u03bd.childrenMinimax(\u03bd(cid:48), maxDepth)\n\nelse\n\nreturn \u03bd.value = min(p,R)\u2208\u2206t\n\nt0\n\nR(\u03bd) + \u03b3(cid:80)\n\n\u03bd(cid:48)\u2208\u03bd.children p(\u03bd(cid:48)\n\n| \u03bd)Minimax(\u03bd(cid:48), maxDepth)\n\nTheir optimal counterparts, while seeking to \ufb01nd the optimal policy, verify the following equations:\n(3)\n\n\u2217\n\u02c6Q\nt0,t(s, a),\n\n\u2217\n\u02c6V\nt0,t(s) = max\na\u2208A\n\u2217\n\u02c6Q\nt0,t(s, a) = min\n(p,R)\u2208\u2206t\nt0\n\n(cid:104)\n\n(cid:105)\n\nE\ns(cid:48)\u223cp\n\nR(s, a) + \u03b3 \u02c6V\n\n\u2217\n(cid:48)\nt0,t+1(s\n\n)\n\n.\n\n(4)\n\nWe now provide a method to calculate those quantities within the nodes of the tree search algorithm.\nMax nodes. A decision node \u03bds,t corresponds to a max node due to the greediness of the agent w.r.t.\nthe subsequent values of the children. We aim at maximizing the return while retaining a risk-averse\nbehavior. As a result, the value of \u03bds,t follows Equation 3 and is de\ufb01ned as:\n\nV (\u03bds,t) = max\n\na\u2208A V (\u03bds,t,a).\n\n(5)\n\nMin nodes. A chance node \u03bds,t,a corresponds to a min node due to the use of a worst-case NSMDP\nas a model which minimizes the value of \u03bds,t,a w.r.t. the reward and the subsequent values of its\nchildren. Writing the value of \u03bds,t,a as the value of s, t, a, within the worst-case snapshot minimizing\nEquation 4, and using the children\u2019s values as values for the next reachable states, leads to Equation 6.\n(6)\n\nV (\u03bds(cid:48),t+1)\n\nV (\u03bds,t,a) = min\n(p,R)\u2208\u2206t\nt0\n\nR(s, a) + \u03b3 E\ns(cid:48)\u223cp\n\nOur approach considers the environment as an adversarial agent, as in an asymmetric two-player game,\nin order to search for a robust plan. The resulting algorithm, RATS for Risk-Averse Tree-Search, is\ndescribed in Algorithm 1. Given an initial state / decision epoch pair, a minimax tree is built using\nthe snapshot MDPt0 and the operators corresponding to Equations 5 and 6 in order to estimate the\nworst-case snapshots at each depth. The tree is built, the action leading to the best possible value from\nthe root node is selected and a real transition is performed. The next state is then reached, the new\nsnapshot model MDPt0+1 is acquired and the process re-starts. Notice the use of R(\u03bd) and p(\u03bd(cid:48)\n| \u03bd)\nin the pseudo-code: they are light notations respectively standing for Rt(s, a) corresponding to a\nchance node \u03bd \u2261 \u03bds,t,a and the probability pt(s(cid:48)\n\u2261 \u03bds(cid:48),t+1 given\na chance node \u03bd \u2261 \u03bds,t,a. The tree built by RATS is entirely developed until the maximum depth\ndmax. A heuristic function is used to evaluate the leaf nodes of the tree.\nAnalysis of RATS. We are interested in characterizing Algorithm 1 without function approximation\nand therefore will consider \ufb01nite, countable, S \u00d7 A sets. We now detail the computation of the min\noperator (Property 3), the computational complexity of RATS (Property 4) and the heuristic function.\nProperty 3. Closed-form expression of the worst case snapshot of a chance node. Following\nAlgorithm 1, a solution to Equation 6 is given by:\n\n|s, a) to jump to a decision node \u03bd(cid:48)\n\n\u02c6R(s, a) = Rt0(s, a) \u2212 LR|t \u2212 t0| and\n\n\u02c6p(\u00b7 | s, a) = (1 \u2212 \u03bb)pt0 (\u00b7 | s, a) + \u03bbpsat(\u00b7 | s, a)\n\nwith psat(\u00b7\n1 if W1(psat, p0) \u2264 Lp|t \u2212 t0| and \u03bb = Lp|t \u2212 t0|/W1(psat, p0) otherwise.\n\n| s, a) = (0,\u00b7\u00b7\u00b7 , 0, 1, 0,\u00b7\u00b7\u00b7 , 0) with 1 at position arg mins(cid:48) V (\u03bds(cid:48),t+1), \u03bb =\n\n6\n\n\fProperty 4. Computational complexity. The total computation complexity of Algorithm 1 is\nO(B|S|1.5|A| (|S||A|)dmax ) with B the number of time steps and dmax the maximum depth.\nHeuristic function. As in vanilla minimax algorithms, Algorithm 1 bootstraps the values of the leaf\nnodes with a heuristic function if these leaves do not correspond to terminal states. Given such a leaf\nnode \u03bds,t, a heuristic aims at estimating the value of the optimal policy at (s, t) within the worst-case\nNSMDP, i.e. \u02c6V \u2217\nt0,t(s). Let H(s, t) be such a heuristic function, we call heuristic error in (s, t) the\ndifference between H(s, t) and \u02c6V \u2217\nt0,t(s). Assuming that the heuristic error is uniformly bounded, the\nfollowing property provides an upper bound on the propagated error due to the choice of H.\nProperty 5. Upper bound on the propagated heuristic error within RATS. Consider an agent\nexecuting Algorithm 1 at s0, t0 with a heuristic function H. We note L the set of all leaf nodes.\nSuppose that the heuristic error is uniformly bounded, i.e. \u2203\u03b4 > 0, \u2200\u03bds,t \u2208 L, |H(s)\u2212 \u02c6V \u2217\nt0,t(s)| \u2264 \u03b4.\nThen we have for every decision and chance nodes \u03bds,t and \u03bds,t,a, at any depth d \u2208 [0, dmax]:\nt0,t(s, a)| \u2264 \u03b3(dmax\u2212d)\u03b4.\n\nt0,t(s)| \u2264 \u03b3(dmax\u2212d)\u03b4\n\n\u2217\n|V (\u03bds,t,a) \u2212 \u02c6Q\n\n|V (\u03bds,t) \u2212 \u02c6V\n\nand\n\n\u2217\n\nThis last result implies that with any heuristic function H inducing a uniform heuristic error, the\npropagated error at the root of the tree is guaranteed to be upper bounded by \u03b3dmax\u03b4. In particular,\nsince the reward function is bounded by hypothesis, we have \u02c6V \u2217\nt0,t(s) \u2264 1/(1 \u2212 \u03b3). Thus, selecting\nfor instance the zero function ensures a root node heuristic error of at most \u03b3dmax /(1 \u2212 \u03b3). In order to\nimprove the precision of the algorithm, we propose to guide the heuristic by using a function re\ufb02ecting\nbetter the value of state s at leaf node \u03bds,t. The ideal function would of course be H(s) = \u02c6V \u2217\nt0,t(s),\nreducing the heuristic error to zero, but this is intractable. Instead, we suggest to use the value of s\nwithin the snapshot MDPt using an evaluation policy \u03c0, i.e. H(s) = V \u03c0\nMDPt(s). This snapshot is also\nnot available, but Property 6 provides a range wherein this value lies.\nProperty 6. Bounds on the snapshots values. Let s \u2208 S, \u03c0 a stationary policy, MDPt0 and MDPt\ntwo snapshot MDPs, t, t0 \u2208 T 2 be. We note V \u03c0\nMDPi (s) the value of s within MDPi following \u03c0. Then,\n\nMDPt(s)| \u2264 |t \u2212 t0|LR/(1 \u2212 \u03b3).\n\n(s) \u2212 V \u03c0\n(s) can be estimated, e.g. via Monte-Carlo roll-outs. Let(cid:98)V \u03c0\n\n|V \u03c0\nMDPt0\nSince MDPt0 is available, V \u03c0\n(s)\nMDPt(s) is H(s) = (cid:98)V \u03c0\nMDPt(s). Hence,\ndenote such an estimate. Following Property 6, V \u03c0\n(s)\u2212|t\u2212 t0|LR/(1\u2212 \u03b3) \u2264 V \u03c0\na worst-case heuristic on V \u03c0\n(s)\u2212|t\u2212 t0|LR/(1\u2212 \u03b3). The bounds provided\nby Property 5 decrease quickly with dmax, and given that dmax is large enough, RATS provides the\noptimal risk-averse maximizing the worst-case value for any evolution of the NSMDP.\n\nMDPt0\n\nMDPt0\n\nMDPt0\n\nMDPt0\n\n6 Experiments\n\nWe compare the RATS algorithm with two policies 1. The \ufb01rst one, named DP-snapshot, uses\nDynamic Programming to compute the optimal actions w.r.t. the snapshot models at each decision\nepoch. The second one, named DP-NSMDP, uses the real NSMDP as a model to provide its optimal\naction. The latter behaves as an omniscient agent and should be seen as an upper bound on the\nperformance. We choose a particular grid-world domain coined \u201cNon-Stationary bridge\u201d illustrated in\nAppendix, Section 7. An agent starts at the state labeled S in the center and the goal is to reach one of\nthe two terminal states labeled G where a reward of +1 is received. The gray cells represent holes that\nare terminal states where a reward of -1 is received. Reaching the goal on the right leads to the highest\npayoff since it is closest to the initial state and a discount factor \u03b3 = 0.9 is applied. The actions are\nA = {Up, Right, Down, Left}. The transition function is stochastic and non-stationary. At decision\nepoch t = 0, any action deterministically yields the intuitive outcome. With time, when applying\nLeft or Right, the probability to reach the positions usually stemming from Up and Down increases\nsymmetrically until reaching 0.45. We set the Lipschitz constant Lp = 1. Aside, we introduce a\nparameter \u0001 \u2208 [0, 1] controlling the behavior of the environment. If \u0001 = 0, only the left-hand side\nbridge becomes slippery with time. It re\ufb02ects a close to worst-case evolution for a policy aiming\nto the left-hand side goal. If \u0001 = 1, only the right-hand side bridge becomes slippery with time. It\n\n1 Code: https://github.com/SuReLI/rats-experiments \u2013 ML reproducibility checklist: Appendix Section 8.\n\n7\n\n\f(a) Discounted return vs \u0001, 50% of standard deviation.\n\n(b) Discounted return distributions \u0001 \u2208 {0, 0.5, 1}.\n\nFigure 2: Discounted return of the three algorithms for various values of \u0001.\n\nre\ufb02ects a close to worst-case evolution for a policy aiming to the right-hand side goal. In between, the\nmisstep probability is proportionally balanced between left and right. One should note that changing \u0001\nfrom 0 to 1 does not cover all the possible evolutions from MDPt0 but provides a concrete, graphical\nillustration of RATS\u2019s behavior for various possible evolutions of the NSMDP.\nWe tested RATS with dmax = 6 so that leaf nodes in the search tree are terminal states. Hence,\nthe optimal risk-averse policy is applied and no heuristic approximation is made. Our goal is to\ndemonstrate that planning in this worst-case NSMDP allows to minimize the loss given any possible\nevolution of the environment. To illustrate this, we report results re\ufb02ecting different evolutions of\nthe same NSMDP using the \u0001 factor. It should be noted that, at t = 0, RATS always moves to the\nleft, even if the goal is further, since going to the right may be risky if the probabilities to go Up\nand Down increase. This corresponds to the careful, risk-averse, behavior. Conversely, DP-snapshot\nalways moves to the right since MDP0 does not capture this risk. As a result, the \u0001 = 0 case re\ufb02ects a\nfavorable evolution for DP-snapshot and a bad one for RATS. The opposite occurs with \u0001 = 1 where\nthe cautious behavior dominates over the risky one, and the in-between cases mitigate this effect.\nIn Figure 2a, we display the achieved expected return for each algorithm as a function of \u0001, i.e. as a\nfunction of the possible evolutions of the NSMDP. As expected, the performance of DP-snapshot\nstrongly depends on this evolution. It achieves high return for \u0001 = 0 and low return for \u0001 = 1.\nConversely, the performance of RATS varies less across the different values of \u0001. The effect illustrated\nhere is that RATS maximizes the minimal possible return given any evolution of the NSMDP. It\nprovides the guarantee to achieve the best return in the worst-case. This behavior is highly desirable\nwhen one requires robust performance guarantees as, for instance, in critical certi\ufb01cation processes.\nFigure 2b displays the return distributions of the three algorithms for \u0001 \u2208 {0, 0.5, 1}. The effect seen\nhere is the tendency for RATS to diminish the left tail of the distribution corresponding to low returns\nfor each evolution. It corresponds to the optimized criteria, i.e. robustly maximizing the worst-case\nvalue. A common risk measure is the Conditional Value at Risk (CVaR) de\ufb01ned as the expected\nreturn in the worst q% cases. We illustrate the CVaR at 5% achieved by each algorithm in Table 1b.\nNotice that RATS always maximizes the CVaR compared to both DP-snapshot and DP-NSMDP.\nIndeed, even if the latter uses the true model, the optimized criteria in DP is the expected return.\n\n7 Conclusion\n\nWe proposed an approach for robust planning in non-stationary stochastic environments. We intro-\nduced the framework of Lipchitz Continuous Non-Stationary MDPs (NSMDPs) and derived the\nRisk-Averse Tree-Search (RATS) algorithm, to predict the worst-case evolution and to plan optimally\nw.r.t. this worst-case NSMDP. We analyzed RATS theoretically and showed that it approximates a\nworst-case NSMDP with a control parameter that is the depth of the search tree. We showed empiri-\ncally the bene\ufb01t of the approach that searches for the highest lower bound on the worst achievable\nscore. RATS is robust to every possible evolution of the environment, i.e. maximizing the expected\nworst-case outcome on the whole set of possible NSMDPs. Our method was applied to the uncertainty\non the evolution of a model. Generally, it could be extended to any uncertainty on the model used\nfor planning, given bounds on the set of the feasible models. The purpose of this contribution is to\nlay a basis of worst-case analysis for robust solutions to NSMDPs. As is, RATS is computationally\nintensive and scaling the algorithm to larger problems is an exciting future challenge.\n\n8\n\n0.00.20.40.60.81.0\u0001\u22121.00\u22120.75\u22120.50\u22120.250.000.250.500.751.00DiscountedreturnDP-NSMDPDP-snapshotRATS\u22121.0\u22120.50.00.51.0Discountedreturn0.00.51.01.52.02.53.03.5Density\u0001=0\u22121.0\u22120.50.00.51.0Discountedreturn\u0001=0.5DP-snapshotRATSDP-NSMDP\u22121.0\u22120.50.00.51.0Discountedreturn\u0001=1\fAcknowledgments\n\nThis research was supported by the Occitanie region, France.\n\nReferences\nY. Abbasi, P. L. Bartlett, V. Kanade, Y. Seldin, and C. Szepesv\u00e1ri. Online learning in Markov decision\nprocesses with adversarially chosen transition probability distributions. In Advances in Neural\nInformation Processing Systems, pages 2508\u20132516, 2013.\n\nD. Abel, D. Arumugam, L. Lehnert, and M. Littman. State Abstractions for Lifelong Reinforcement\n\nLearning. In International Conference on Machine Learning, pages 10\u201319, 2018a.\n\nD. Abel, Y. Jinnai, S. Y. Guo, G. Konidaris, and M. Littman. Policy and Value Transfer in Lifelong\nReinforcement Learning. In International Conference on Machine Learning, pages 20\u201329, 2018b.\n\nK. Asadi, D. Misra, and M. L. Littman. Lipschitz continuity in model-based reinforcement learning.\n\narXiv preprint arXiv:1804.07193, 2018.\n\nC. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener,\nD. Perez, S. Samothrakis, and S. Colton. A survey of Monte Carlo tree search methods. IEEE\nTransactions on Computational Intelligence and AI in games, 4(1):1\u201343, 2012.\n\nS. Bubeck and R. Munos. Open loop optimistic planning. In 10th Conference on Learning Theory,\n\n2010.\n\nL. Campo, P. Mookerjee, and Y. Bar-Shalom. State estimation for systems with sojourn-time-\ndependent Markov model switching. IEEE Transactions on Automatic Control, 36(2):238\u2013243,\n1991.\n\nS. P. Choi, D.-y. Yeung, and N. L. Zhang. Hidden-mode Markov decision processes. In IJCAI\nWorkshop on Neural, Symbolic, and Reinforcement Methods for Sequence Learning. Citeseer,\n1999.\n\nS. P. Choi, D.-Y. Yeung, and N. L. Zhang. Hidden-mode Markov decision processes for nonstationary\n\nsequential decision making. In Sequence Learning, pages 264\u2013287. Springer, 2000.\n\nS. P. Choi, N. L. Zhang, and D.-Y. Yeung. Solving hidden-mode Markov decision problems. In\nProceedings of the 8th International Workshop on Arti\ufb01cial Intelligence and Statistics, Key West,\nFlorida, USA, 2001.\n\nJ. J. Chung, N. R. Lawrance, and S. Sukkarieh. Learning to soar: Resource-constrained exploration\nin reinforcement learning. The International Journal of Robotics Research, 34(2):158\u2013172, 2015.\n\nB. C. Cs\u00e1ji and L. Monostori. Value function based reinforcement learning in changing Markovian\n\nenvironments. Journal of Machine Learning Research, 9(Aug):1679\u20131709, 2008.\n\nB. C. Da Silva, E. W. Basso, A. L. Bazzan, and P. M. Engel. Dealing with non-stationary environments\nusing context detection. In Proceedings of the 23rd International Conference on Machine Learning,\npages 217\u2013224. ACM, 2006.\n\nW. Dabney, M. Rowland, M. G. Bellemare, and R. Munos. Distributional reinforcement learning\n\nwith quantile regression. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nT. Dick, A. Gyorgy, and C. Szepesvari. Online learning in Markov decision processes with changing\n\ncost sequences. In International Conference on Machine Learning, pages 512\u2013520, 2014.\n\nK. Doya, K. Samejima, K.-i. Katagiri, and M. Kawato. Multiple model-based reinforcement learning.\n\nNeural computation, 14(6):1347\u20131369, 2002.\n\nE. Even-Dar, S. M. Kakade, and Y. Mansour. Online Markov Decision Processes. Mathematics of\n\nOperations Research, 34(3):726\u2013736, 2009.\n\nD. Fudenberg and J. Tirole. Game theory. Cambridge, Massachusetts, 393(12):80, 1991.\n\n9\n\n\fE. Hadoux. Markovian sequential decision-making in non-stationary environments: application to\n\nargumentative debates. PhD thesis, UPMC, Sorbonne Universit\u00e9s CNRS, 2015.\n\nE. Hadoux, A. Beynier, and P. Weng. Sequential decision-making under non-stationary environments\n\nvia sequential change-point detection. In Learning over Multiple Contexts (LMCE), 2014.\n\nG. N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257\u2013280,\n\n2005.\n\nR. Jaulmes, J. Pineau, and D. Precup. Learning in non-stationary partially observable Markov\ndecision processes. In ECML Workshop on Reinforcement Learning in non-stationary environments,\nvolume 25, pages 26\u201332, 2005.\n\nL. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable\n\nstochastic domains. Arti\ufb01cial intelligence, 101(1-2):99\u2013134, 1998.\n\nT. Keller and M. Helmert. Trial-based heuristic tree search for \ufb01nite horizon MDPs. In ICAPS, 2013.\n\nR. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the\n\nfortieth annual ACM symposium on Theory of computing, pages 681\u2013690. ACM, 2008.\n\nL. Kocsis and C. Szepesv\u00e1ri. Bandit based Monte-Carlo planning. In European conference on\n\nmachine learning, pages 282\u2013293. Springer, 2006.\n\nE. Lecarpentier, S. Rapp, M. Melo, and E. Rachelson. Empirical evaluation of a Q-Learning Algorithm\n\nfor Model-free Autonomous Soaring. arXiv preprint arXiv:1707.05668, 2017.\n\nE. Lecarpentier, G. Infantes, C. Lesire, and E. Rachelson. Open loop execution of tree-search\n\nalgorithms. IJCAI, 2018.\n\nS. H. Lim, H. Xu, and S. Mannor. Reinforcement learning in robust markov decision processes. In\n\nAdvances in Neural Information Processing Systems, pages 701\u2013709, 2013.\n\nR. Munos. From bandits to monte-carlo tree search: The optimistic principle applied to optimization\n\nand planning. Foundations and Trends R(cid:13) in Machine Learning, 7(1):1\u2013129, 2014.\n\nJ. Pazis and R. Parr. PAC Optimal Exploration in Continuous Space Markov Decision Processes. In\n\nAAAI, 2013.\n\nM. Pirotta, M. Restelli, and L. Bascetta. Policy gradient in lipschitz Markov Decision Processes.\n\nMachine Learning, 100(2-3):255\u2013283, 2015.\n\nM. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley\n\n& Sons, 2014.\n\nE. Rachelson and M. G. Lagoudakis. On the locality of action domination in sequential decision\n\nmaking. 2010.\n\nD. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural\nnetworks and tree search. Nature, 529(7587):484, 2016.\n\nD. L. Silver, Q. Yang, and L. Li. Lifelong Machine Learning Systems: Beyond Learning Algorithms.\n\nIn AAAI Spring Symposium: Lifelong Machine Learning, volume 13, page 05, 2013.\n\nR. S. Sutton, A. G. Barto, et al. Reinforcement learning: An introduction. MIT press, 1998.\n\nI. Szita, B. Tak\u00e1cs, and A. L\u00f6rincz. \u03b5-mdps: Learning in varying environments. Journal of Machine\n\nLearning Research, 3(Aug):145\u2013174, 2002.\n\nC. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.\n\nM. A. Wiering. Reinforcement learning in dynamic environments using instantiated information. In\nMachine Learning: Proceedings of the Eighteenth International Conference (ICML2001), pages\n585\u2013592, 2001.\n\n10\n\n\f", "award": [], "sourceid": 3918, "authors": [{"given_name": "Erwan", "family_name": "Lecarpentier", "institution": "Universit\u00e9 de Toulouse"}, {"given_name": "Emmanuel", "family_name": "Rachelson", "institution": "ISAE-SUPAERO / University of Toulouse"}]}