{"title": "Safe Exploration in Finite Markov Decision Processes with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 4312, "page_last": 4320, "abstract": "In classical reinforcement learning agents accept arbitrary short term loss for long term gain when exploring their environment. This is infeasible for safety critical applications such as robotics, where even a single unsafe action may cause system failure or harm the environment. In this paper, we address the problem of safely exploring finite Markov decision processes (MDP). We define safety in terms of an a priori unknown safety constraint that depends on states and actions and satisfies certain regularity conditions expressed via a Gaussian process prior. We develop a novel algorithm, SAFEMDP, for this task and prove that it completely explores the safely reachable part of the MDP without violating the safety constraint. To achieve this, it cautiously explores safe states and actions in order to gain statistical confidence about the safety of unvisited state-action pairs from noisy observations collected while navigating the environment. Moreover, the algorithm explicitly considers reachability when exploring the MDP, ensuring that it does not get stuck in any state with no safe way out. We demonstrate our method on digital terrain models for the task of exploring an unknown map with a rover.", "full_text": "Safe Exploration in Finite Markov Decision Processes\n\nwith Gaussian Processes\n\nMatteo Turchetta\n\nETH Zurich\n\nmatteotu@ethz.ch\n\nFelix Berkenkamp\n\nETH Zurich\n\nbefelix@ethz.ch\n\nAndreas Krause\n\nETH Zurich\n\nkrausea@ethz.ch\n\nAbstract\n\nIn classical reinforcement learning agents accept arbitrary short term loss for long\nterm gain when exploring their environment. This is infeasible for safety critical\napplications such as robotics, where even a single unsafe action may cause system\nfailure or harm the environment. In this paper, we address the problem of safely\nexploring \ufb01nite Markov decision processes (MDP). We de\ufb01ne safety in terms of an\na priori unknown safety constraint that depends on states and actions and satis\ufb01es\ncertain regularity conditions expressed via a Gaussian process prior. We develop a\nnovel algorithm, SAFEMDP, for this task and prove that it completely explores\nthe safely reachable part of the MDP without violating the safety constraint. To\nachieve this, it cautiously explores safe states and actions in order to gain statistical\ncon\ufb01dence about the safety of unvisited state-action pairs from noisy observations\ncollected while navigating the environment. Moreover, the algorithm explicitly\nconsiders reachability when exploring the MDP, ensuring that it does not get stuck\nin any state with no safe way out. We demonstrate our method on digital terrain\nmodels for the task of exploring an unknown map with a rover.\n\n1\n\nIntroduction\n\nToday\u2019s robots are required to operate in variable and often unknown environments. The traditional\nsolution is to specify all potential scenarios that a robot may encounter during operation a priori. This\nis time consuming or even infeasible. As a consequence, robots need to be able to learn and adapt to\nunknown environments autonomously [10, 2]. While exploration algorithms are known, safety is still\nan open problem in the development of such systems [18]. In fact, most learning algorithms allow\nrobots to make unsafe decisions during exploration. This can damage the platform or its environment.\nIn this paper, we provide a solution to this problem and develop an algorithm that enables agents to\nsafely and autonomously explore unknown environments. Speci\ufb01cally, we consider the problem of\nexploring a Markov decision process (MDP), where it is a priori unknown which state-action pairs\nare safe. Our algorithm cautiously explores this environment without taking actions that are unsafe or\nmay render the exploring agent stuck.\nRelated Work. Safe exploration is an open problem in the reinforcement learning community and\nseveral de\ufb01nitions of safety have been proposed [16]. In risk-sensitive reinforcement learning, the\ngoal is to maximize the expected return for the worst case scenario [5]. However, these approaches\nonly minimize risk and do not treat safety as a hard constraint. For example, Geibel and Wysotzki [7]\nde\ufb01ne risk as the probability of driving the system to a previously known set of undesirable states.\nThe main difference to our approach is that we do not assume the undesirable states to be known\na priori. Garcia and Fern\u00e1ndez [6] propose to ensure safety by means of a backup policy; that is,\na policy that is known to be safe in advance. Our approach is different, since it does not require a\nbackup policy but only a set of initially safe states from which the agent starts to explore. Another\napproach that makes use of a backup policy is shown by Hans et al. [9], where safety is de\ufb01ned in\nterms of a minimum reward, which is learned from data.\n29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fMoldovan and Abbeel [14] provide probabilistic safety guarantees at every time step by optimizing\nover ergodic policies; that is, policies that let the agent recover from any visited state. This approach\nneeds to solve a large linear program at every time step, which is computationally demanding even\nfor small state spaces. Nevertheless, the idea of ergodicity also plays an important role in our method.\nIn the control community, safety is mostly considered in terms of stability or constraint satisfaction\nof controlled systems. Akametalu et al. [1] use reachability analysis to ensure stability under the\nassumption of bounded disturbances. The work in [3] uses robust control techniques in order to\nensure robust stability for model uncertainties, while the uncertain model is improved.\nAnother \ufb01eld that has recently considered safety is Bayesian optimization [13]. There, in order to\n\ufb01nd the global optimum of an a priori unknown function [21], regularity assumptions in form of\na Gaussian process (GP) [17] prior are made. The corresponding GP posterior distribution over\nthe unknown function is used to guide evaluations to informative locations. In this setting, safety\ncentered approaches include the work of Sui et al. [22] and Schreiter et al. [20], where the goal is\nto \ufb01nd the safely reachable optimum without violating an a priori unknown safety constraint at any\nevaluation. To achieve this, the function is cautiously explored, starting from a set of points that is\nknown to be safe initially. The method in [22] was applied to the \ufb01eld of robotics to safely optimize\nthe controller parameters of a quadrotor vehicle [4]. However, they considered a bandit setting, where\nat each iteration any arm can be played. In contrast, we consider exploring an MDP, which introduces\nrestrictions in terms of reachability that have not been considered in Bayesian optimization before.\nContribution. We introduce SAFEMDP, a novel algorithm for safe exploration in MDPs. We model\nsafety via an a priori unknown constraint that depends on state-action pairs. Starting from an initial\nset of states and actions that are known to satisfy the safety constraint, the algorithm exploits the\nregularity assumptions on the constraint function in order to determine if nearby, unvisited states are\nsafe. This leads to safe exploration, where only state-actions pairs that are known to ful\ufb01l the safety\nconstraint are evaluated. The main contribution consists of extending the work on safe Bayesian\noptimization in [22] from the bandit setting to deterministic, \ufb01nite MDPs. In order to achieve this, we\nexplicitly consider not only the safety constraint, but also the reachability properties induced by the\nMDP dynamics. We provide a full theoretical analysis of the algorithm. It provably enjoys similar\nsafety guarantees in terms of ergodicity as discussed in [14], but at a reduced computational cost.\nThe reason for this is that our method separates safety from the reachability properties of the MDP.\nBeyond this, we prove that SAFEMDP is able to fully explore the safely reachable region of the\nMDP, without getting stuck or violating the safety constraint with high probability. To the best of\nour knokwledge, this is the \ufb01rst full exploration result in MDPs subject to a safety constraint. We\nvalidate our method on an exploration task, where a rover has to explore an a priori unknown map.\n\n2 Problem Statement\n\nIn this section, we de\ufb01ne our problem and assumptions. The unknown environment is modeled as a\n\ufb01nite, deterministic MDP [23]. Such a MDP is a tuple hS,A(\u00b7), f (s, a), r(s, a)i with a \ufb01nite set of\nstates S, a set of state-dependent actions A(\u00b7), a known, deterministic transition model f (s, a), and\nreward function r(s, a). In the typical reinforcement learning framework, the goal is to maximize\nthe cumulative reward. In this paper, we consider the problem of safely exploring the MDP. Thus,\ninstead of aiming to maximize the cumulative rewards, we de\ufb01ne r(s, a) as an a priori unknown\nsafety feature. Although r(s, a) is unknown, we make regularity assumptions about it to make the\nproblem tractable. When traversing the MDP, at each discrete time step, k, the agent has to decide\nwhich action and thereby state to visit next. We assume that the underlying system is safety-critical\nand that for any visited state-action pair, (sk, ak), the unknown, associated safety feature, r(sk, ak),\nmust be above a safety threshold, h. While the assumption of deterministic dynamics does not hold\nfor general MDPs, in our framework, uncertainty about the environment is captured by the safety\nfeature. If requested, the agent can obtain noisy measurements of the safety feature, r(sk, ak), by\ntaking action ak in state sk. The index t is used to index measurements, while k denotes movement\nsteps. Typically k t.\nIt is hopeless to achieve the goal of safe exploration unless the agent starts in a safe location. Hence,\nwe assume that the agent stays in an initial set of state action pairs, S0, that is known to be safe a\npriori. The goal is to identify the maximum safely reachable region starting from S0, without visiting\nany unsafe states. For clarity of exposition, we assume that safety depends on states only; that\nis, r(s, a) = r(s). We provide an extension to safety features that also depend on actions in Sec. 3.\n\n2\n\n\fFigure 1: Illustration of the set operators with S = {\u00afs1, \u00afs2}. The set S = {s} can be reached from s2\nin one step and from s1 in two steps, while only the state s1 can be reached from s. Visiting s1 is safe;\nthat is, it is above the safety threshold, is reachable, and there exists a safe return path through s2.\nAssumptions on the reward function Ensuring that all visited states are safe without any prior\nknowledge about the safety feature is an impossible task (e.g., if the safety feature is discontinuous).\nHowever, many practical safety features exhibit some regularity, where similar states will lead to\nsimilar values of r.\nIn the following, we assume that S is endowed with a positive de\ufb01nite kernel function k(\u00b7,\u00b7) and that\nthe function r(\u00b7) has bounded norm in the associated Reproducing Kernel Hilbert Space (RKHS) [19].\nThe norm induced by the inner product of the RKHS indicates the smoothness of functions with re-\nspect to the kernel. This assumption allows us to model r as a GP [21], r(s) \u21e0GP (\u00b5(s), k(s, s0)). A\nGP is a probability distribution over functions that is fully speci\ufb01ed by its mean function \u00b5(s)\nand its covariance function k(s, s0). The randomness expressed by this distribution captures\nour uncertainty about the environment. We assume \u00b5(s) = 0 for all s 2S , without loss of\ngenerality. The posterior distribution over r(\u00b7) can be computed analytically, based on t mea-\nsurements at states Dt = {s1, . . . , st}\u2713S with measurements, yt = [r(s1) + !1 . . . r(st) + !t]T,\nthat are corrupted by zero-mean Gaussian noise, !t \u21e0N (0, 2).\nThe posterior is a GP\ndistribution with mean \u00b5t(s) = kt(s)T(Kt + 2I)1yt, variance t(s) = kt(s, s), and covari-\nance kt(s, s0) = k(s, s0) kt(s)T(Kt + 2I)1kt(s0), where kt(s) = [k(s1, s), . . . , k(st, s)]T and\nKt is the positive de\ufb01nite kernel matrix, [k(s, s0)]s,s02Dt. The identity matrix is denoted by I 2 Rt\u21e5t.\nWe also assume L-Lipschitz continuity of the safety function with respect to some metric d(\u00b7, \u00b7) on S.\nThis is guaranteed by many commonly used kernels with high probability [21, 8].\nGoal In this section, we de\ufb01ne the goal of safe exploration. In particular, we ask what the best that\nany algorithm may hope to achieve is. Since we only observe noisy measurements, it is impossible to\nknow the underlying safety function r(\u00b7) exactly after a \ufb01nite number of measurements. Instead, we\nconsider algorithms that only have knowledge of r(\u00b7) up to some statistical con\ufb01dence \u270f. Based on\nthis con\ufb01dence within some safe set S, states with small distance to S can be classi\ufb01ed to satisfy the\nsafety constraint using the Lipschitz continuity of r(\u00b7). The resulting set of safe states is\n\n\u270f\n\nRsafe\n\n(S) = S [{ s 2S |9 s0 2 S : r(s0) \u270f Ld(s, s0) h},\n\n(1)\nwhich contains states that can be classi\ufb01ed as safe given the information about the states in S.\nWhile (1) considers the safety constraint, it does not consider any restrictions put in place by the\nstructure of the MDP. In particular, we may not be able to visit every state in Rsafe\n(S) without visiting\nan unsafe state \ufb01rst. As a result, the agent is further restricted to\n\n\u270f\n\nRreach(S) = S [{ s 2S |9 s0 2 S, a 2A (s0) : s = f (s0, a)},\n\n(2)\nthe set of all states that can be reached starting from the safe set in one step. These states are called\nthe one-step safely reachable states. However, even restricted to this set, the agent may still get stuck\nin a state without any safe actions. We de\ufb01ne\n\nRret(S, S) = S [{ s 2 S |9 a 2A (s) : f (s, a) 2 S}\n\n(3)\n\nas the set of states that are able to return to a set S through some other set of states, S, in one step. In\nparticular, we care about the ability to return to a certain set through a set of safe states S. Therefore,\nthese are called the one-step safely returnable states. In general, the return routes may require taking\nmore than one action, see Fig. 1. The n-step returnability operator Rret\nn1(S, S))\n1 (S, S) = Rret(S, S) considers these longer return routes by repeatedly applying the return\nwith Rret\noperator, Rret in (3), n times. The limit R\nn (S, S) contains all the states that\ncan reach the set S through an arbitrarily long path in S.\n\n(S, S) = limn!1 Rret\n\nn (S, S) = Rret(S, Rret\n\nret\n\n3\n\n\fAlgorithm 1 Safe exploration in MDPs (SafeMDP)\n\nthreshold h, Lipschitz constant L, Safe seed S0.\n\nInputs: states S, actions A, transition function f (s, a), kernel k(s, s0), Safety\nC0(s) [h,1) for all s 2 S0\nfor t = 1, 2, . . . do\nSt { s 2S |9 s0 2 \u02c6St1 : lt(s0) Ld(s, s0) h}\n\u02c6St { s 2 St | s 2 Rreach( \u02c6St1), s 2 R\nGt { s 2 \u02c6St | gt(s) > 0}\nst argmaxs2Gt wt(s)\nSafe Dijkstra in St from st1 to st\nUpdate GP with st and yt r(st) + !t\nif Gt = ; or max\nwt(s) \uf8ff \u270f then Break\ns2Gt\n\n(St, \u02c6St1)}\n\nret\n\nret\n\n(Rsafe\n\n\u270f\n\n(Rsafe\n\n\u270f\n\n(S), S),\n\nFor safe exploration of MDPs, all of the above are requirements; that is, any state that we may want\nto visit needs to be safe (satisfy the safety constraint), reachable, and we must be able to return to safe\nstates from this new state. Thus, any algorithm that aims to safely explore an MDP is only allowed to\nvisit states in\n\nR\u270f(S) = Rsafe\n\n(S) \\ Rreach(S) \\ R\n\n(4)\nwhich is the intersection of the three safety-relevant sets. Given a safe set S that ful\ufb01lls the safety\nrequirements, R\n(S), S) is the set of states from which we can return to S by only visiting\nstates that can be classi\ufb01ed as above the safety threshold. By including it in the de\ufb01nition of R\u270f(S),\nwe avoid the agent getting stuck in a state without an action that leads to another safe state to take.\nGiven knowledge about the safety feature in S up to \u270f accuracy thus allows us to expand the set of\nsafe ergodic states to R\u270f(S). Any algorithm that has the goal of exploring the state space should\nconsequently explore these newly available safe states and gain new knowledge about the safety\nfeature to potentially further enlargen the safe set. The safe set after n such expansions can be found\nby repeatedly applying the operator in (4): Rn\n\u270f = R\u270f(S). Ultimately,\nthe size of the safe set is bounded by surrounding unsafe states or the number of states in S. As a\nresult, the biggest set that any algorithm may classify as safe without visiting unsafe states is given\nby taking the limit, R\u270f(S) = limn!1 Rn\nThus, given a tolerance level \u270f and an initial safe seed set S0, R\u270f(S0) is the set of states that any\nalgorithm may hope to classify as safe. Let St denote the set of states that an algorithm determines\nto be safe at iteration t. In the following, we will refer to complete, safe exploration whenever an\nalgorithm ful\ufb01lls R\u270f(S0) \u2713 limt!1 St \u2713 R0(S0); that is, the algorithm classi\ufb01es every safely\nreachable state up to \u270f accuracy as safe, without misclassi\ufb01cation or visiting unsafe states.\n\n\u270f (S) = R\u270f(Rn1\n\n(S)) with R1\n\nret\n\n\u270f\n\n\u270f\n\n\u270f (S).\n\n3 SAFEMDP Algorithm\n\nWe start by giving a high level overview of the method. The SAFEMDP algorithm relies on a GP\nmodel of r to make predictions about the safety feature and uses the predictive uncertainty to guide\nthe safe exploration. In order to guarantee safety, it maintains two sets. The \ufb01rst set, St, contains\nall states that can be classi\ufb01ed as satisfying the safety constraint using the GP posterior, while the\nsecond one, \u02c6St, additionally considers the ability to reach points in St and the ability to safely return\nto the previous safe set, \u02c6St1. The algorithm ensures safety and ergodicity by only visiting states\nin \u02c6St. In order to expand the safe region, the algorithm visits states in Gt \u2713 \u02c6St, a set of candidate\nstates that, if visited, could expand the safe set. Speci\ufb01cally, the algorithm selects the most uncertain\nstate in Gt, which is the safe state that we can gain the most information about. We move to this state\nvia the shortest safe path, which is guaranteed to exist (Lemma 2). The algorithm is summarized\nin Algorithm 1.\nInitialization. The algorithm relies on an initial safe set S0 as a starting point to explore the MDP.\nThese states must be safe; that is, r(s) h, for all s 2 S0. They must also ful\ufb01ll the reachability and\nreturnability requirements from Sec. 2. Consequently, for any two states, s, s0 2 S0, there must exist\na path in S0 that connects them: s0 2 R\n(S0,{s}). While this may seem restrictive, the requirement\nis, for example, ful\ufb01lled by a single state with an action that leads back to the same state.\n\nret\n\n4\n\n\f(a) States are classi\ufb01ed as safe (above the safety con-\nstraint, dashed line) according to the con\ufb01dence in-\ntervals of the GP model (red bar). States in the green\nbar can expand the safe set if sampled, Gt.\n\n(b) Modi\ufb01ed MDP model that is used to encode safety\nfeatures that depend on actions. In this model, actions\nlead to abstract action-states sa, which only have one\navailable action that leads to f (s, a).\n\nClassi\ufb01cation. In order to safely explore the MDP, the algorithm must determine which states are\nsafe without visiting them. The regularity assumptions introduced in Sec. 2 allow us to model the\nsafety feature as a GP, so that we can use the uncertainty estimate of the GP model in order to\ndetermine a con\ufb01dence interval within which the true safety function lies with high probability. For\nevery state s, this con\ufb01dence interval has the form Qt(s) =\u21e5\u00b5t1(s) \u00b1 ptt1(s)\u21e4, where t is a\npositive scalar that determines the amplitude of the interval. We discuss how to select t in Sec. 4.\nRather than de\ufb01ning high probability bounds on the values of r(s) directly in terms of Qt, we consider\nthe intersection of the sets Qt up to iteration t, Ct(s) = Qt(s) \\ Ct1(s) with C0(s) = [h,1] for\nsafe states s 2 S0 and C0(s) = R otherwise. This choice ensures that set of states that we classify as\nsafe does not shrink over iterations and is justi\ufb01ed by the selection of t in Sec. 4. Based on these con-\n\ufb01dence intervals, we de\ufb01ne a lower bound, lt(s) = min Ct(s), and upper bound, ut(s) = max Ct(s),\non the values that the safety features r(s) are likely to take based on the data obtained up to iteration t.\nBased on these lower bounds, we de\ufb01ne\n\nSt =s 2S |9 s0 2 \u02c6St1 : lt(s0) Ld(s, s0) h \n\n(5)\nas the set of states that ful\ufb01ll the safety constraint on r with high probability by using the Lipschitz\nconstant to generalize beyond the current safe set. Based on this classi\ufb01cation, the set of ergodic safe\nstates is the set of states that achieve the safety threshold and, additionally, ful\ufb01ll the reachability and\nreturnability properties discussed in Sec. 2:\n\n(6)\n\n(7)\n\n\u02c6St =s 2 St | s 2 Rreach( \u02c6St1) \\ R\n\nret\n\n(St, \u02c6St1) .\n\nExpanders. With the set of safe states de\ufb01ned, the task of the algorithm is to identify and explore\nstates that might expand the set of states that can be classi\ufb01ed as safe. We use the uncertainty estimate\nin the GP in order to de\ufb01ne an optimistic set of expanders,\n\nGt = {s 2 \u02c6St | gt(s) > 0},\n\nwhere gt(s) ={s0 2S \\ St | ut(s) Ld(s, s0) h}. The function gt(s) is positive whenever an\n\noptimistic measurement at s, equal to the upper con\ufb01dence bound, ut(s), would allow us to determine\nthat a previously unsafe state indeed has value r(s0) above the safety threshold. Intuitively, sampling s\nmight lead to the expansion of St and thereby \u02c6St. The set Gt explicitly considers the expansion of\nthe safe set as exploration goal, see Fig. 2a for a graphical illustration of the set.\nSampling and shortest safe path. The remaining part of the algorithm is concerned with selecting\nsafe states to evaluate and \ufb01nding a safe path in the MDP that leads towards them. The goal is to\nvisit states that allow the safe set to expand as quickly as possible, so that we do not waste resources\nwhen exploring the MDP. We use the GP posterior uncertainty about the states in Gt in order to make\nthis choice. At each iteration t, we select as next target sample the state with the highest variance\nin Gt, st = argmaxs2Gt wt(s), where wt(s) = ut(s) lt(s). This choice is justi\ufb01ed, because\nwhile all points in Gt are safe and can potentially enlarge the safe set, based on one noisy sample\nwe can gain the most information from the state that we are the most uncertain about. This design\nchoice maximizes the knowledge acquired with every sample but can lead to long paths between\nmeasurements within the safe region. Given st, we use Dijkstra\u2019s algorithm within the set \u02c6St in order\nto \ufb01nd the shortest safe path to the target from the current state, st1. Since we require reachability\nand returnability for all safe states, such a path is guaranteed to exist. We terminate the algorithm\nwhen we reach the desired accuracy; that is, argmaxs2Gt wt(s) \uf8ff \u270f.\nAction-dependent safety. So far, we have considered safety features that only depend on the\nstates, r(s). In general, safety can also depend on the actions, r(s, a). In this section, we introduce a\n\n5\n\ns0s2s4s6s8s10s12s14s16s18Safetyr(s)\fmodi\ufb01ed MDP that captures these dependencies without modifying the algorithm. The modi\ufb01ed MDP\nis equivalent to the original one in terms of dynamics, f (s, a). However, we introduce additional\naction-states sa for each action in the original MDP. When we start in a state s and take action a, we\n\ufb01rst transition to the corresponding action-state and from there transition to f (s, a) deterministically.\nThis model is illustrated in Fig. 2b. Safety features that depend on action-states, sa, are equivalent\nto action-dependent safety features. The SAFEMDP algorithm can be used on this modi\ufb01ed MDP\nwithout modi\ufb01cation. See the experiments in Sec. 5 for an example.\n\n4 Theoretical Results\n\nThe safety and exploration aspects of the algorithm that we presented in the previous section rely\non the correctness of the con\ufb01dence intervals Ct(s). In particular, they require that the true value of\nthe safety feature, r(s), lies within Ct(s) with high probability for all s 2S and all iterations t > 0.\nFurthermore, these con\ufb01dence intervals have to shrink suf\ufb01ciently fast over time. The probability\nof r taking values within the con\ufb01dence intervals depends on the scaling factor t. This scaling\nfactor trades off conservativeness in the exploration for the probability of unsafe states being visited.\nAppropriate selection of t has been studied by Srinivas et al. [21] in the multi-armed bandit setting.\nEven though our framework is different, their setting can be applied to our case. We choose,\n\nret\n\nret\n\n(St,{s0}).\n\nt = 2B + 300t log3(t/),\n\n(8)\nwhere B is the bound on the RKHS norm of the function r(\u00b7), is the probability of visiting\nunsafe states, and t is the maximum mutual information that can be gained about r(\u00b7) from t\nnoisy observations; that is, t = max|A|\uf8fft I(r, yA). The information capacity t has a sublinear\ndependence on t for many commonly used kernels [21]. The choice of t in (8) is justi\ufb01ed by the\nfollowing Lemma, which follows from [21, Theorem 6]:\nLemma 1. Assume that krk2\nk \uf8ff B, and that the noise !t is zero-mean conditioned on the history,\nas well as uniformly bounded by for all t > 0. If t is chosen as in (8), then, for all t > 0 and\nall s 2S , it holds with probability at least 1 that r(s) 2 Ct(s).\nThis Lemma states that, for t as in (8), the safety function r(s) takes values within the con\ufb01dence\nintervals C(s) with high probability. Now we show that the the safe shortest path problem has always\na solution:\nLemma 2. Assume that S0 6= ; and that for all states, s, s0 2 S0, s 2 R\n(S0,{s0}). Then, when\nusing Algorithm 1 under the assumptions in Theorem 1, for all t > 0 and for all states, s, s0 2 \u02c6St,\ns 2 R\nThis lemma states that, given an initial safe set that ful\ufb01lls the initialization requirements, we can\nalways \ufb01nd a policy that drives us from any state in \u02c6St to any other state in \u02c6St without leaving the set\nof safe states, St. Lemmas 1 and 2 have a key role in ensuring safety during exploration and, thus, in\nour main theoretical result:\nTheorem 1. Assume that r(\u00b7) is L-Lipschitz continuous and that the assumptions of Lemma 1\nhold. Also, assume that S0 6= ;, r(s) h for all s 2 S0, and that for any two states, s, s0 2 S0,\n(S0,{s}). Choose t as in (8). Then, with probability at least 1 , we have r(s) h for\ns0 2 R\nany s along any state trajectory induced by Algorithm 1 on an MDP with transition function f (s, a).\nMoreover, let t\u21e4 be the smallest integer such that\n, with C = 8/ log(1 + 2).\nThen there exists a t0 \uf8ff t\u21e4 such that, with probability at least 1 , R\u270f(S0) \u2713 \u02c6St0 \u2713 R0(S0).\nTheorem 1 states that Algorithm 1 performs safe and complete exploration of the state space; that\nis, it explores the maximum reachable safe set without visiting unsafe states. Moreover, for any\ndesired accuracy \u270f and probability of failure , the safely reachable region can be found within a\n\ufb01nite number of observations. This bound depends on the information capacity t, which in turn\ndepends on the kernel. If the safety feature is allowed to change rapidly across states, the information\ncapacity will be larger than if the safety feature was smooth. Intuitively, the less prior knowledge\nthe kernel encodes, the more careful we have to be when exploring the MDP, which requires more\nmeasurements.\n\nt\u21e4 t\u21e4 C |R0(S0)|\n\nret\n\nt\u21e4\n\n\u270f2\n\n6\n\n\f5 Experiments\n\nIn this section, we demonstrate Algorithm 1 on an exploration task. We consider the setting in [14],\nthe exploration of the surface of Mars with a rover. The code for the experiments is available\nat http://github.com/befelix/SafeMDP.\nFor space exploration, communication delays between the rover and the operator on Earth can be\nprohibitive. Thus, it is important that the robot can act autonomously and explore the environment\nwithout risking unsafe behavior. For the experiment, we consider the Mars Science Laboratory\n(MSL) [11], a rover deployed on Mars. Due to communication delays, the MSL can travel 20 meters\nbefore it can obtain new instructions from an operator. It can climb a maximum slope of 30 [15,\nSec. 2.1.3]. In our experiments we use digital terrain models of the surface of Mars from the High\nResolution Imaging Science Experiment (HiRISE), which have a resolution of one meter [12].\nAs opposed to the experiments considered in [14], we do not have to subsample or smoothen the data\nin order to achieve good exploration results. This is due to the \ufb02exibility of the GP framework that\nconsiders noisy measurements. Therefore, every state in the MDP represents a d \u21e5 d square area\nwith d = 1 m, as opposed to d = 20 m in [14].\nAt every state, the agent can take one of four actions: up, down, left, and right.\nIf the rover\nattempts to climb a slope that is steeper than 30, it fails and may be damaged. Otherwise it moves\ndeterministically to the desired neighboring state. In this setting, we de\ufb01ne safety over state transitions\nby using the extension introduced in Sec. 3. The safety feature over the transition from s to s0 is de\ufb01ned\nin terms of height difference between the two states, H(s) H(s0). Given the maximum slope of\n\u21b5 = 30 that the rover can climb, the safety threshold is set at a conservative h = d tan(25). This\nencodes that it is unsafe for the robot to climb hills that are too steep. In particular, while the MDP\ndynamics assume that Mars is \ufb02at and every state can be reached, the safety constraint depends on the\na priori unknown heights. Therefore, under the prior belief, it is unknown which transitions are safe.\nWe model the height distribution, H(s), as a GP with a Mat\u00e9rn kernel with \u232b = 5/2. Due to the\nlimitation on the grid resolution, tuning of the hyperparameters is necessary to achieve both safety\nand satisfactory exploration results. With a \ufb01ner resolution, more cautious hyperparameters would\nalso be able to generalize to neighbouring states. The lengthscales are set to 14.5 m and the prior\nstandard deviation of heights is 10 m. We assume a noise standard deviation of 0.075 m. Since the\nsafety feature of each state transition is a linear combination of heights, the GP model of the heights\ninduces a GP model over the differences of heights, which we use to classify whether state transitions\nful\ufb01ll the safety constraint. In particular, the safety depends on the direction of travel, that is, going\ndownhill is possible, while going uphill might be unsafe.\nFollowing the recommendations in [22], in our experiments we use the GP con\ufb01dence intervals\nQt(s) directly to determine the safe set St. As a result, the Lipschitz constant is only used to\ndetermine expanders in G. Guaranteeing safe exploration with high probability over multiple steps\nleads to conservative behavior, as every step beyond the set that is known to be safe decreases the\n\u2018probability budget\u2019 for failure. In order to demonstrate that safety can be achieved empirically\nusing less conservative parameters than those suggested by Theorem 1, we \ufb01x t to a constant\nvalue, t = 2, 8t 0. This choice aims to guarantee safety per iteration rather than jointly over all\nthe iterations. The same assumption is used in [14].\nWe compare our algorithm to several baselines. The \ufb01rst one considers both the safety threshold and\nthe ergodicity requirements but neglects the expanders. In this setting, the agent samples the most\nuncertain safe state transaction, which corresponds to the safe Bayesian optimization framework\nin [20]. We expect the exploration to be safe, but less ef\ufb01cient than our approach. The second baseline\nconsiders the safety threshold, but does not consider ergodicity requirements. In this setting, we\nexpect the rover\u2019s behavior to ful\ufb01ll the safety constraint and to never attempt to climb steep slopes,\nbut it may get stuck in states without safe actions. The third method uses the unconstrained Bayesian\noptimization framework in order to explore new states, without safety requirements. In this setting,\nthe agent tries to obtain measurements from the most uncertain state transition over the entire space,\nrather than restricting itself to the safe set. In this case, the rover can easily get stuck and may also\nincur failures by attempting to climb steep slopes. Last, we consider a random exploration strategy,\nwhich is similar to the \u270f-greedy exploration strategies that are widely used in reinforcement learning.\n\n7\n\n\f(a) Non-ergodic\n\n(b) Unsafe\n\n(c) Random\n\n(d) No Expanders\n\nSafeMDP\nNo Expanders\nNon-ergodic\nUnsafe\nRandom\n\nR.15(S0) [%]\n80.28 %\n30.44 %\n0.86 %\n0.23 %\n0.98 %\n\nk at failure\n\n-\n-\n2\n1\n219\n\n(e) SafeMDP\n\n(f) Performance metrics.\n\nFigure 2: Comparison of different exploration schemes. The background color shows the real altitude\nof the terrain. All algorithms are run for 525 iterations, or until the \ufb01rst unsafe action is attempted.\nThe saturated color indicates the region that each strategy is able to explore. The baselines get stuck\nin the crater in the bottom-right corner or fail to explore, while Algorithm 1 manages to safely explore\nthe unknown environment. See the statistics in Fig. 2f.\nWe compare these baselines over an 120 by 70 meters area at 30.6 latitude and 202.2 longitude.\nWe set the accuracy \u270f = n. The resulting exploration behaviors can be seen in Fig. 2. The rover\nstarts in the center-top part of the plot, a relatively planar area. In the top-right corner there is a hill\nthat the rover cannot climb, while in the bottom-right corner there is a crater that, once entered, the\nrover cannot leave. The safe behavior that we expect is to explore the planar area, without moving\ninto the crater or attempting to climb the hill. We run all algorithms for 525 iterations or until the\n\ufb01rst unsafe action is attempted. It can be seen in Fig. 2e that our method explores the safe area\nthat surrounds the crater, without attempting to move inside. While some state-action pairs closer\nto the crater are also safe, the GP model would require more data to classify them as safe with the\nnecessary con\ufb01dence. In contrast, the baselines perform signi\ufb01cantly worse. The baseline that does\nnot ensure the ability to return to the safe set (non-ergodic) can be seen in Fig. 2a. It does not explore\nthe area, because it quickly reaches a state without a safe path to the next target sample. Our approach\navoids these situations explicitly. The unsafe exploration baseline in Fig. 2b considers ergodicity,\nbut concludes that every state is reachable according to the MDP model. Consequently, it follows\na path that crosses the boundary of the crater and eventually evaluates an unsafe action. Overall,\nit is not enough to consider only ergodicity or only safety, in order to solve the safe exploration\nproblem. The random exploration in Fig. 2c attempts an unsafe action after some exploration. In\ncontrast, Algorithm 1 manages to safely explore a large part of the unknown environment. Running\nthe algorithm without considering expanders leads to the behavior in Fig. 2d, which is safe, but only\nmanages to explore a small subset of the safely reachable area within the same number of iterations\nin which Algorithm 1 explores over 80% of it. The results are summarized in Table 2f.\n\n6 Conclusion\n\nWe presented SAFEMDP, an algorithm to safely explore a priori unknown environments. We used\na Gaussian process to model the safety constraints, which allows the algorithm to reason about the\nsafety of state-action pairs before visiting them. An important aspect of the algorithm is that it\nconsiders the transition dynamics of the MDP in order to ensure that there is a safe return route before\nvisiting states. We proved that the algorithm is capable of exploring the full safely reachable region\nwith few measurements, and demonstrated its practicality and performance in experiments.\nAcknowledgement. This research was partially supported by the Max Planck ETH Center for\nLearning Systems and SNSF grant 200020_159557.\n\n8\n\n0306090120distance[m]70350distance[m]0306090120distance[m]0306090120distance[m]0306090120distance[m]0306090distance[m]7035distance[m]0102030altitude[m]\fReferences\n[1] Anayo K. Akametalu, Shahab Kaynama, Jaime F. Fisac, Melanie N. Zeilinger, Jeremy H. Gillula, and\nIn Proc. of the IEEE\n\nClaire J. Tomlin. Reachability-based safe learning with Gaussian processes.\nConference on Decision and Control (CDC), pages 1424\u20131431, 2014.\n\n[2] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from\n\ndemonstration. Robotics and Autonomous Systems, 57(5):469\u2013483, 2009.\n\n[3] Felix Berkenkamp and Angela P. Schoellig. Safe and robust learning control with Gaussian processes. In\n\nProc. of the European Control Conference (ECC), pages 2501\u20132506, 2015.\n\n[4] Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause. Safe controller optimization for quadrotors\nwith Gaussian processes. In Proc. of the IEEE International Conference on Robotics and Automation\n(ICRA), 2016.\n\n[5] Stefano P. Coraluppi and Steven I. Marcus. Risk-sensitive and minimax control of discrete-time, \ufb01nite-state\n\nMarkov decision processes. Automatica, 35(2):301\u2013309, 1999.\n\n[6] Javier Garcia and Fernando Fern\u00e1ndez. Safe exploration of state and action spaces in reinforcement\n\nlearning. Journal of Arti\ufb01cial Intelligence Research, pages 515\u2013564, 2012.\n\n[7] Peter Geibel and Fritz Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints.\n\nJournal of Arti\ufb01cial Intelligence Research (JAIR), 24:81\u2013108, 2005.\n\n[8] Subhashis Ghosal and Anindya Roy. Posterior consistency of Gaussian process prior for nonparametric\n\nbinary regression. The Annals of Statistics, 34(5):2413\u20132429, 2006.\n\n[9] Alexander Hans, Daniel Schneega\u00df, Anton Maximilian Sch\u00e4fer, and Steffen Udluft. Safe exploration for\nreinforcement learning. In Proc. of the European Symposium on Arti\ufb01cial Neural Networks (ESANN),\npages 143\u2013148, 2008.\n\n[10] Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: a survey. The\n\nInternational Journal of Robotics Research, 32(11):1238\u20131274, 2013.\n\n[11] Mary Kae Lockwood. Introduction: Mars Science Laboratory: The Next Generation of Mars Landers.\n\nJournal of Spacecraft and Rockets, 43(2):257\u2013257, 2006.\n\n[12] Alfred S. McEwen, Eric M. Eliason, James W. Bergstrom, Nathan T. Bridges, Candice J. Hansen, W. Alan\nDelamere, John A. Grant, Virginia C. Gulick, Kenneth E. Herkenhoff, Laszlo Keszthelyi, Randolph L. Kirk,\nMichael T. Mellon, Steven W. Squyres, Nicolas Thomas, and Catherine M. Weitz. Mars Reconnaissance\nOrbiter\u2019s High Resolution Imaging Science Experiment (HiRISE). Journal of Geophysical Research:\nPlanets, 112(E5):E05S02, 2007.\n\n[13] Jonas Mockus. Bayesian Approach to Global Optimization, volume 37 of Mathematics and Its Applications.\n\nSpringer Netherlands, 1989.\n\n[14] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. In Proc. of the\n\nInternational Conference on Machine Learning (ICML), pages 1711\u20131718, 2012.\n\n[15] MSL. MSL Landing Site Selection User\u2019s Guide to Engineering Constraints, 2007. URL http://marsoweb.\n\nnas.nasa.gov/landingsites/msl/memoranda/MSL_Eng_User_Guide_v4.5.1.pdf.\n\n[16] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning \u2013 an overview.\n\nIn Modelling and Simulation for Autonomous Systems, pages 357\u2013375. Springer, 2014.\n\n[17] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning.\n\nAdaptive computation and machine learning. MIT Press, 2006.\n\n[18] Stefan Schaal and Christopher Atkeson. Learning Control in Robotics. IEEE Robotics & Automation\n\nMagazine, 17(2):20\u201329, 2010.\n\n[19] Bernhard Sch\u00f6lkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regular-\n\nization, Optimization, and Beyond. MIT Press, 2002.\n\n[20] Jens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc Toussaint.\nSafe exploration for active learning with Gaussian processes. In Proc. of the European Conference on\nMachine Learning (ECML), volume 9284, pages 133\u2013149, 2015.\n\n[21] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization\nin the bandit setting: no regret and experimental design. In Proc. of the International Conference on\nMachine Learning (ICML), 2010.\n\n[22] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with\nIn Proc. of the International Conference on Machine Learning (ICML), pages\n\nGaussian processes.\n997\u20131005, 2015.\n\n[23] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. Adaptive computation\n\nand machine learning. MIT Press, 1998.\n\n9\n\n\f", "award": [], "sourceid": 2135, "authors": [{"given_name": "Matteo", "family_name": "Turchetta", "institution": "ETH Zurich"}, {"given_name": "Felix", "family_name": "Berkenkamp", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}]}