{"title": "Verifiable Reinforcement Learning via Policy Extraction", "book": "Advances in Neural Information Processing Systems", "page_first": 2494, "page_last": 2504, "abstract": "While deep reinforcement learning has successfully solved many challenging control tasks, its real-world applicability has been limited by the inability to ensure the safety of learned policies. We propose an approach to verifiable reinforcement learning by training decision tree policies, which can represent complex policies (since they are nonparametric), yet can be efficiently verified using existing techniques (since they are highly structured). The challenge is that decision tree policies are difficult to train. We propose VIPER, an algorithm that combines ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy (called the oracle) and its Q-function, and show that it substantially outperforms two baselines. We use VIPER to (i) learn a provably robust decision tree policy for a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree policy for a toy game based on Pong that provably never loses, and (iii) learn a provably stable decision tree policy for cart-pole. In each case, the decision tree policy achieves performance equal to that of the original DNN policy.", "full_text": "Veri\ufb01able Reinforcement Learning\n\nvia Policy Extraction\n\nOsbert Bastani\n\nMIT\n\nobastani@csail.mit.edu\n\nYewen Pu\n\nMIT\n\nyewenpu@mit.edu\n\nArmando Solar-Lezama\n\nMIT\n\nasolar@csail.mit.edu\n\nAbstract\n\nWhile deep reinforcement learning has successfully solved many challenging con-\ntrol tasks, its real-world applicability has been limited by the inability to ensure\nthe safety of learned policies. We propose an approach to veri\ufb01able reinforcement\nlearning by training decision tree policies, which can represent complex policies\n(since they are nonparametric), yet can be ef\ufb01ciently veri\ufb01ed using existing tech-\nniques (since they are highly structured). The challenge is that decision tree policies\nare dif\ufb01cult to train. We propose VIPER, an algorithm that combines ideas from\nmodel compression and imitation learning to learn decision tree policies guided by\na DNN policy (called the oracle) and its Q-function, and show that it substantially\noutperforms two baselines. We use VIPER to (i) learn a provably robust decision\ntree policy for a variant of Atari Pong with a symbolic state space, (ii) learn a\ndecision tree policy for a toy game based on Pong that provably never loses, and\n(iii) learn a provably stable decision tree policy for cart-pole. In each case, the\ndecision tree policy achieves performance equal to that of the original DNN policy.\n\n1\n\nIntroduction\n\nDeep reinforcement learning has proven to be a promising approach for automatically learning\npolicies for control problems [11, 22, 29]. However, an important challenge limiting real-world\napplicability is the dif\ufb01culty ensuring the safety of deep neural network (DNN) policies learned\nusing reinforcement learning. For example, self-driving cars must robustly handle a variety of human\nbehaviors [26], controllers for robotics typically need stability guarantees [2, 20, 8], and air traf\ufb01c\ncontrol should provably satisfy safety properties including robustness [19]. Due to the complexity of\nDNNs, verifying these properties is typically very inef\ufb01cient if not infeasible [6].\nOur goal is to learn policies for which desirable properties such as safety, stability, and robustness\ncan be ef\ufb01ciently veri\ufb01ed. We focus on learning decision tree policies for two reasons: (i) they are\nnonparametric, so in principle they can represent very complex policies, and (ii) they are highly\nstructured, making them easy to verify. However, decision trees are challenging to learn even in\nthe supervised setting; there has been some work learning decision tree policies for reinforcement\nlearning [13], but we \ufb01nd that they do not even scale to simple problems like cart-pole [5].\nTo learn decision tree policies, we build on the idea of model compression [10] (or distillation [17]),\nwhich uses high-performing DNNs to guide the training of shallower [4, 17] or more structured [34, 7]\nclassi\ufb01ers. Their key insight is that DNNs perform better not because they have better representative\npower, but because they are better regularized and therefore easier to train [4]. Our goal is to devise a\npolicy extraction algorithm that distills a high-performing DNN policy into a decision tree policy.\nOur approach to policy extraction is based on imitation learning [27, 1], in particular, DAGGER [25]\u2014\nthe pretrained DNN policy (which we call the oracle) is used to generate labeled data, and then\nsupervised learning is used to train a decision tree policy. However, we \ufb01nd that DAGGER learns\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fMDP\n\nneural network policy\n\ndecision tree policy\n\nverification\n\nverified policy\n\nFigure 1: The high level approach VIPER uses to learn veri\ufb01able policies.\n\nmuch larger decision tree policies than necessary. In particular, DAGGER cannot leverage the fact that\nour oracle provides not just the optimal action to take in a given state, but also the cumulative reward\nof every state-action pair (either directly as a Q-function or indirectly as a distribution over possible\nactions). First, we propose Q-DAGGER, a novel imitation learning algorithm that extends DAGGER to\nuse the Q-function for the oracle; we show that Q-DAGGER can use this extra information to achieve\nprovably better performance than DAGGER. Then, we propose VIPER1, which modi\ufb01es Q-DAGGER\nto extract decision tree policies; we show that VIPER can learn decision tree policies that are an order\nof magnitude smaller than those learned by DAGGER (and are thus easier to verify).\nWe show how existing veri\ufb01cation techniques can be adapted to ef\ufb01ciently verify desirable properties\nof extracted decision tree policies: (i) we learn a decision tree policy that plays Atari Pong (on a\nsymbolic abstraction of the state space rather than from pixels2) [22] and verify its robustness [6, 19],\n(ii) we learn a decision tree policy to play a toy game based on Pong, and prove that it never loses (the\ndif\ufb01culty doing so for Atari Pong is that the system dynamics are unavailable),3 and (iii) we learn a\ndecision tree policy for cart-pole [5], and compute its region of stability around the goal state (with\nrespect to the degree-5 Taylor approximation of the system dynamics). In each case, our decision tree\npolicy also achieves perfect reward. Additionally, we discover a counterexample to the correctness of\nour decision tree policy for the toy game of pong, which we show can be \ufb01xed by slightly extending\nthe paddle length. In summary, our contributions are:\n\n\u2022 We propose an approach to learning veri\ufb01able policies (summarized in Figure 1).\n\u2022 We propose a novel imitation learning algorithm called VIPER, which is based on DAGGER\nbut leverages a Q-function for the oracle. We show that VIPER learns relatively small\ndecision trees (< 1000 nodes) that play perfectly on Atari Pong (with symbolic state space),\na toy game based on Pong, and cart-pole.\n\n\u2022 We describe how to verify correctness (for the case of a toy game based on Pong), stability,\nand robustness of decision tree policies, and show that veri\ufb01cation is orders of magnitude\nmore scalable than approaches compatible with DNN policies.\n\nRelated work. There has been work on verifying machine learning systems [3, 30, 16, 6, 19, 18, 15].\nSpeci\ufb01c to reinforcement learning, there has been substantial interest in safe exploration [23, 36, 33];\nsee [14] for a survey. Veri\ufb01cation of learned controllers [24, 32, 3, 20, 19, 31] is a crucial component\nof many such systems [2, 8], but existing approaches do not scale to high dimensional state spaces.\nThere has been work training decision tree policies for reinforcement learning [13], but we \ufb01nd that\ntheir approach does not even scale to cart-pole. There has also been work using model compression\nto learn decision trees [34, 7], but the focus has been on supervised learning rather than reinforcement\nlearning, and on interpretability rather than veri\ufb01cation. There has also been recent work using\nprogram synthesis to devise structured policies using imitation learning [35], but their focus is\ninterpretability, and they are outperformed by DNNs even on cart-pole.\n\n1VIPER stands for Veri\ufb01ability via Iterative Policy ExtRaction.\n2We believe that this limitation is reasonable for safety-critical systems; furthermore, a model of the system\n\ndynamics de\ufb01ned with respect to symbolic state space is anyway required for most veri\ufb01cation tasks.\n\n3We believe that having the system dynamics are available is a reasonable assumption; they are available for\n\nmost real-world robots, including sophisticated robots such as the walking robot ATLAS [20].\n\n2\n\n\fsk\n\nleft\n\ns(k1)\ndown\n\nleft\n\nright\n\n...\n\nleft\n\nright\n\n\u02dcs (R = T )\n\nleft\n\ns1\n\nleft\n\nleft\n\nright\n\ns0\n\nsend\n\nleft\n\nright\n\nleft\n\ns1\n\n...\n\nright\n\nright\n\nsk (R = T  \u21b5)\n\nright\n\nFigure 2: An MDP with initial state s0, deterministic transitions shown as arrows (the label is the\naction), actions A = {left, right, down} (taking an unavailable action transitions to send), rewards\nR(\u02dcs) = T , R(sk) = T  \u21b5 (where \u21b5 2 (0, 1) is a constant), and R(s) = 0 otherwise, and time\nhorizon T = 3(k + 1). Trajectories taken by \u21e1\u21e4, \u21e1left : s 7! left, and \u21e1right : s 7! right are shown as\ndashed edges, red edges, and green edges, respectively.\n\n2 Policy Extraction\n\nWe describe Q-DAGGER, a general policy extraction algorithm with theoretical guarantees improving\non DAGGER\u2019s, and then describe how VIPER modi\ufb01es Q-DAGGER to extract decision tree policies.\n\nProblem formulation. Let (S, A, P, R) be a \ufb01nite-horizon (T -step) MDP with states S, actions\nA, transition probabilities P : S \u21e5 A \u21e5 S ! [0, 1] (i.e., P (s, a, s0) = p(s0 | s, a)), and rewards\nR : S ! R. Given a policy \u21e1 : S ! A, for t 2{ 0, ..., T  1}, let\n\nV (\u21e1)\nt\n\n(s) = R(s) +Xs02S\n(s, a) = R(s) +Xs02S\n\nQ(\u21e1)\n\nt\n\nP (s, \u21e1(s), s0)V (\u21e1)\n\nt+1(s0)\n\nP (s, a, s0)V (\u21e1)\n\nt+1(s0)\n\nbe its value function and Q-function for t 2{ 0, ..., T  1}, where V (\u21e1)\ngenerality, we assume that there is a single initial state s0 2 S. Then, let\n\nT (s) = 0. Without loss of\n\nd(\u21e1)\n0 (s) = I[s = s0]\nd(\u21e1)\nt\n\n(s) = Xs02S\n(s). Let J(\u21e1) = V (\u21e1)\n\n0\n\nP (s0,\u21e1 (s0), s)d(\u21e1)\n\nt1(s0)\n\n(for t > 0)\n\nt=0 d(\u21e1)\n\nbe the distribution over states at time t, where I is the indicator function, and let d(\u21e1)(s) =\n(s0) be the cost-to-go of \u21e1 from s0. Our goal is to learn the\n\nT 1PT1\nbest policy in a given class \u21e7, leveraging an oracle \u21e1\u21e4 : S ! A and its Q-function Q(\u21e1\u21e4)\nThe Q-DAGGER algorithm. Consider the (in general nonconvex) loss function\n\n(s, a).\n\nt\n\nt\n\n`t(s, \u21e1) = V (\u21e1\u21e4)\n\nt\n\n(s)  Q(\u21e1\u21e4)\n\nt\n\n(s, \u21e1(s)).\n\nLet g(s, \u21e1) = I[\u21e1(s) 6= \u21e1\u21e4(s)] be the 0-1 loss and \u02dcg(s, \u21e1) a convex upper bound (in the parameters\nof \u21e1), e.g., the hinge loss [25].4 Then, \u02dc`t(s, \u21e1) = \u02dc`t(s)\u02dcg(s, \u21e1) convex upper bounds `t(s, \u21e1), where\n\n\u02dc`t(s) = V (\u21e1\u21e4)\n\nt\n\n(s)  min\na2A\n\nQ(\u21e1\u21e4)\n\nt\n\n(s, a).\n\nQ-DAGGER runs DAGGER (Algorithm 3.1 from [25]) with the convex loss \u02dc`t(s, \u21e1) and i = I[i = 1].\n\nTheory. We bound the performance of Q-DAGGER and compare it to the bound in [25]; proofs are\n\nin Appendix A. First, we characterize the loss `(\u21e1) = T 1PT1\n\nt=0 Es\u21e0d(\u21e1)\n\nt\n\n4Other choices of \u02dcg are possible; our theory holds as long as it is a convex upper bound on the 0-1 loss g.\n\n[`t(s, \u21e1)].\n\n3\n\n\fAlgorithm 1 Decision tree policy extraction.\n\nprocedure VIPER((S, A, P, R),\u21e1 \u21e4, Q\u21e4, M, N)\n\nInitialize dataset D ?\nInitialize policy \u02c6\u21e10 \u21e1\u21e4\nfor i = 1 to N do\n\nSample M trajectories Di { (s, \u21e1\u21e4(s)) \u21e0 d(\u02c6\u21e1i1)}\nAggregate dataset D D[D i\nResample dataset D0 { (s, a) \u21e0 p((s, a)) / \u02dc`(s)I[(s, a) 2D ]}\nTrain decision tree \u02c6\u21e1i TrainDecisionTree(D0)\n\nend for\nreturn Best policy \u02c6\u21e1 2{ \u02c6\u21e11, ..., \u02c6\u21e1N} on cross validation\n\nend procedure\n\nLemma 2.1. For any policy \u21e1, we have T` (\u21e1) = J(\u21e1)  J(\u21e1\u21e4).\nNext, let \"N = min\u21e12\u21e7 N1PN\n[\u02dc`t(s, \u21e1)] be the training loss, where N is\nthe number of iterations of Q-DAGGER and \u02c6\u21e1i is the policy computed on iteration i. Let `max be an\nupper bound on \u02dc`t(s, \u21e1), i.e., \u02dc`t(s, \u21e1) \uf8ff `max for all s 2 S and \u21e1 2 \u21e7.\nTheorem 2.2. For any > 0, there exists a policy \u02c6\u21e1 2{ \u02c6\u21e11, ..., \u02c6\u21e1N} such that\n\ni=1 T 1PT1\n\nt=0 Es\u21e0d(\u02c6\u21e1i)\n\nt\n\nJ(\u02c6\u21e1) \uf8ff J(\u21e1\u21e4) + T\" N + \u02dcO(1)\n\nt\n\nt\n\nmaxT 2 log(1/)).\n\n(s, a)  Q(\u21e1\u21e4)\n\nwith probability at least 1  , as long as N = \u02dc\u21e5(`2\nIn contrast, the bound J(\u02c6\u21e1) \uf8ff J(\u21e1\u21e4) + uT \"N + \u02dcO(1) in [25] includes the value u that upper bounds\nQ(\u21e1\u21e4)\n(s, \u21e1\u21e4(s)) for all a 2 A, s 2 S, and t 2{ 0, ..., T  1}. In general, u may\nbe O(T ), e.g., if there are critical states s such that failing to take the action \u21e1\u21e4(s) in s results in\nforfeiting all subsequent rewards. For example, in cart-pole [5], we may consider the system to have\nfailed if the pole hit the ground; in this case, all future reward is forfeited, so u = O(T ).\nAn analog of u appears implicitly in \"N, since our loss \u02dc`t(s, \u21e1) includes an extra multiplicative factor\n\u02dc`t(s) = V (\u21e1\u21e4)\n(s, a). However, our bound is O(T ) as long as \u02c6\u21e1 achieves high\naccuracy on critical states, whereas the bound in [25] is O(T 2) regardless of how well \u02c6\u21e1 performs.\nWe make the gap explicit. Consider the MDP in Figure 2 (with \u21b5 2 (0, 1) constant and T = 3(k +1)).\nLet \u21e7= {\u21e1left : s 7! left,\u21e1 right : s 7! right}, and let g(\u21e1) = Es\u21e0d(\u21e1)[g(s, \u21e1)] be the 0-1 loss.\nTheorem 2.3. g(\u21e1left) = O(T 1), g(\u21e1right) = O(1), `(\u21e1left) = O(1), and `(\u21e1right) = O(T 1).\n\n(s)  mina2A Q(\u21e1\u21e4)\n\nt\n\nt\n\nThat is, according to the 0-1 loss g(\u21e1), the worse policy \u21e1left (J(\u21e1left) = 0) is better, whereas\naccording to our loss `(\u21e1), the better policy \u21e1right (J(\u21e1right) = (T  \u21b5)) is better.\nExtracting decision tree policies. Our algorithm VIPER for extracting decision tree policies is\nshown in Algorithm 1. Because the loss function for decision trees is not convex, there do not exist\nonline learning algorithms with the theoretical guarantees required by DAGGER. Nevertheless, we\nuse a heuristic based on the follow-the-leader algorithm [25]\u2014on each iteration, we use the CART\nalgorithm [9] to train a decision tree on the aggregated dataset D. We also assume that \u21e1\u21e4 and Q(\u21e1\u21e4)\nare not time-varying, which is typically true in practice. Next, rather than modify the loss optimized\nby CART, it resamples points (s, a) 2D weighted by \u02dc`(s), i.e., according to\n\np((s, a)) / \u02dc`(s)I[(s, a) 2D ].\n\nThen, we have E(s,a)\u21e0p((s,a))[\u02dcg(s, \u21e1)] = E(s,a)\u21e0D[\u02dc`(s, \u21e1)], so using CART to train a decision tree\non D0 is in expectation equivalent to training a decision tree with the loss \u02dc`(s, \u21e1). Finally, when\nusing neural network policies trained using policy gradients (so no Q-function is available), we use\nthe maximum entropy formulation of reinforcement learning to obtain Q values, i.e., Q(s, a) =\nlog \u21e1\u21e4(s, a), where \u21e1\u21e4(s, a) is the probability that the (stochastic) oracle takes action a in state s [37].\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: (a) An example of an initial state of our toy pong model; the ball is the white dot, the paddle\nis the white rectangle at the bottom, and the red arrow denotes the initial velocity (vx, vy) of the ball.\n(b) An intuitive visualization of the ball positions (blue region) and velocities (red arrows) in Y0. (c)\nA counterexample to correctness discovered by our veri\ufb01cation algorithm.\n\n3 Veri\ufb01cation\n\nIn this section, we describe three desirable control properties we can ef\ufb01ciently verify for decision\ntree policies but are dif\ufb01cult to verify for DNN policies.\n\nCorrectness for toy Pong. Correctness of a controller is system-dependent; we \ufb01rst discuss proving\ncorrectness of controller for a toy model of the Pong Atari game [22]. This toy model consists of a\nball bouncing on the screen, with a player-controlled paddle at the bottom. If the ball hits the top\nor the side of the screen, or if the ball hits the paddle at the bottom, then it is re\ufb02ected; if the ball\nhits the bottom of the screen where the paddle is not present, then the game is over. The system\nis frictionless and all collisions are elastic. It can be thought of as Pong where the system paddle\nis replaced with a wall. The goal is to play for as long as possible before the game ends. The\nstates are (x, y, vx, vy, xp) 2 R5, where (x, y) is the position of the ball (with x 2 [0, xmax] and\ny 2 [0, ymax]), (vx, vy) is its velocity (with vx, vy 2 [vmax, vmax]), and xp is the position of the\npaddle (with xp 2 [0, xmax]), and the actions are {left, right, stay}, indicating how to move the paddle.\nOur goal is to prove that the controller never loses, i.e., the ball never hits the bottom of the\nscreen at a position where the paddle is not present. More precisely, assuming the system is\ninitialized to a safe state (i.e., y 2 Y0 = [ymax/2, ymax]), then it should avoid an unsafe region (i.e.,\ny = 0 ^ (x \uf8ff xp  L _ x  xp + L), where L is half the paddle length).\nTo do so, we assume that the speed of the ball in the y direction is lower bounded, i.e., |vy| > vmin;\nsince velocity in each direction is conserved, this assumption is equivalent to assuming that the\ninitial y velocity is in [vmax,vmin] [ [vmin, vmax]. Then, it suf\ufb01ces to prove the following inductive\ninvariant: as long as the ball starts in Y0, then it re-enters Y0 after at most tmax = d2ymax/vmine steps.\nBoth the dynamics f : S \u21e5 A ! S and the controller \u21e1 : S ! A are piecewise-linear, so the joint\ndynamics f\u21e1(s) = f (s, \u21e1(s)) are also piecewise linear; let S = S1 [ ... [ Sk be a partition of the\ni s for all s 2 Si. Then, let st be a variable denoting the state\nstate space so that f\u21e1(s) = fi(s) = T\nof the system at time t 2{ 0, ..., tmax}; then, the following constraints specify the system dynamics:\n\nt =\n\n(st1 2 Si ) st = T\n\ni st1)\n\n8t 2{ 1, ..., tmax}\n\nk_i=1\n\nFurthermore letting t = (st 2 Y0), we can express the correctness of the system as the formula 5\n\n = tmax^t=1\n\nt! ^ 0 )\n\n t.\n\ntmax_t=1\n\nNote that  ) \u2327 is equivalent to \u00ac _ \u2327. Then, since Y0 and all of the Si are polyhedron, the\npredicates st 2 Y0 and st 2 Si are conjunctions of linear (in)equalities; thus, the formulas t and t\nare disjunctions of conjunctions of linear (in)equalities. As a consequence, consists of conjunctions\nand disjunctions of linear (in)equalities; standard tools exist for checking whether such formulas\n\n5We are verifying correctness over a continuous state space, so enumerative approaches are not feasible.\n\n5\n\n\fare satis\ufb01able [12]. In particular, the controller is correct if and only if \u00ac is unsatis\ufb01able, since a\nsatisfying assignment to \u00ac is a counterexample showing that does not always hold.\nFinally, note that we can slightly simplify : (i) we only have to show that the system enters a state\nwhere vy > 0 after tmax steps, not that it returns to Y0, and (ii) we can restrict Y0 to states where\nvy < 0. We use parameters (xmax, ymax, vmin, vmax, L) = (30, 20, 1, 2, 4); Figure 3 (a) shows an\nexample of an initial state, and Figure 3 (b) depicts the set Y0 of initial states that we verify.\n\nCorrectness for cart-pole. We also discuss proving correctness of a cart-pole control policy. The\nclassical cart-pole control problem has a 4-dimensional state space (x, v, \u2713, !) 2 R4, where x is\nthe cart position, v is the cart velocity, \u2713 is the pole angle, and ! is the pole angular velocity, and\na 1-dimensional action space a 2 R, where a is the lateral force to apply to the cart. Consider a\ncontroller trained to move the cart to the right while keeping the pole in the upright position. The\ngoal is to prove that the pole never falls below a certain height, which can be encoded as the formula6\n\n \u2318 s0 2 S0 ^\n\n|(st)|\uf8ff y0,\n\n1^t=0\n\nwhere S0 = [0.05, 0.05]4 is the set of initial states, st = f (st1, at1) is the state on step t, f\nis the transition function, (s) is the deviation of the pole angle from upright in state s, and y0 is\nthe maximum desirable deviation from the upright position. As with correctness for toy Pong, the\ncontroller is correct if \u00ac is unsatis\ufb01able. The property can be thought of as a toy example of\na safety property we would like to verify for a controller for a walking robot\u2014in particular, we\nmight want the robot to run as fast as possible, but prove that it never falls over while doing so.\nThere are two dif\ufb01culties verifying : (i) the in\ufb01nite time horizon, and (ii) the nonlinear transitions\nf. To address (i), we approximate the system using a \ufb01nite time horizon Tmax = 10, i.e., we show\nthat the system is safe for the \ufb01rst ten time steps. To address (ii), we use a linear approximation\nf (s, a) \u21e1 As + Ba; for cart-pole, this approximation is good as long as (st) is small.\nStability. Stability is a property from control theory saying that systems asymptotically reach their\ngoal [31]. Consider a continuous-time dynamical system with states s 2 S = Rn, actions a 2 A =\nRm, and dynamics \u02d9s = f (s, a). For a policy \u21e1 : S ! A, we say the system f\u21e1(s) = f (s, \u21e1(s))\nis stable if there is a region of attraction U \u2713 Rn containing 0 such that for any s0 2 U, we have\nlimt!1 s(t) = 0, where s(t) is a solution to \u02d9s = f (s, a) with initial condition s(0) = s0.\nWhen f\u21e1 is nonlinear, we can verify stability (and compute U) by \ufb01nding a Lyapunov function\nV : S ! R which satis\ufb01es (i) V (s) > 0 for all s 2 U \\ {0}, (ii) V (0) = 0, and (iii) \u02d9V (s) =\n(rV )(s) \u00b7 f (s) < 0 for all s 2 U \\ {0} [31]. Given a candidate Lyapunov function, exhaustive\nsearch can be used to check whether the Lyapunov properties hold [8], but scales exponentially in n.\nWhen f\u21e1 is polynomial, we can use sum-of-squares (SOS) optimization to devise a candidate\nLyapunov function, check the Lyapunov properites, and compute U [24, 32, 31]; we give a brief\noverview. First, suppose that V (s) = sT P s for some P 2 Rn\u21e5n. To compuate a candidate Lyapunov\nfunction, we choose P so that the Lyapunov properties hold for the linear approximation f\u21e1(s) \u21e1 As,\nwhich can be accomplished by solving the SOS program 7\n(1)\n\n9P 2 Rn\u21e5n\nsubj. to sT P s  ksk2  0 and sT P As + ksk2 \uf8ff 0 (8s 2 S).\n\nThe \ufb01rst equation ensures properties (i) and (ii)\u2014in particular, the term ksk2 ensures that sT P s > 0\nexcept when s = 0. Similarly, the second equation ensures property (iii). Next, we can simultaneously\ncheck whether the Lyapunov properties hold for f\u21e1 and compute U using the SOS program\n\n\u21e2\n\n(2)\n\narg\n\nmax\n\n\u21e22R+,\u21e42Rn\u21e5n\n\nsubj. to (sT \u21e4s)(sT P f\u21e1(s)) + (\u21e2  sT P s)ksk2 \uf8ff 0 and sT \u21e4s  0 (8s 2 S).\n\nThe term (s) = sT \u21e4s is a slack variable\u2014when \u21e2> s T P s or s = 0 (so the second term is\nnonpositive), it can be made suf\ufb01ciently large so that the \ufb01rst constraint holds regardless of sT P f\u21e1(s),\n\n6This property cannot be expressed as a stability property since the cart is always moving.\n7Simpler approaches exist, but this one motivates our approach to checking whether the Lyapunov properties\n\nhold for V for the polynomial dynamics f\u21e1.\n\n6\n\n\fbut when \u21e2 \uf8ff sT P s and s 6= 0 (so the second term is positive), we must have sT P f\u21e1(s) < 0 since\nsT \u21e4s  0 by the second constraint. Properites (i) and (ii) hold from (1), and (2) veri\ufb01es (iii) for all\n\ns 2 U = {s 2 S | V (s) \uf8ff \u21e2}.\n\nThus, if a solution \u21e2> 0 is found, then V is a Lyapunov function with region of attraction U. This\napproach extends to higher-order polynomials V (s) by taking V (s) = m(s)T P m(s), where m(s) is\na vector of monomials (and similarly for (s)).\nNow, let \u21e1 be a decision tree whose leaf nodes are associated with linear functions of the state s\n` s be the associated linear\n(rather than restricted to constant functions). For ` 2 leaves(\u21e1), let T\nfunction. Let `0 2 leaves(\u21e1) be the leaf node such that 0 2 routed(`0,\u21e1 ), where routed(`; \u21e1) \u2713 S is\nthe set of states routed to ` (i.e., the computation of the decision tree maps s to leaf node `). Then,\n`0s; letting \u02dcU be the region of\nwe can compute a Lyapunov function for the linear policy \u02dc\u21e1(s) = T\nattraction for \u02dc\u21e1, the region of attraction for \u21e1 is U = \u02dcU \\ routed(`0,\u21e1 ). To maximize U, we can bias\nthe decision tree learning algorithm to prefer branching farther from s = 0.\nThere are two limitations of our approach. First, we require that the dynamics be polynomial. For\nconvenience, we use Taylor approximations of the dynamics, which approximates the true property\nbut works well in practice [32]. This limitation can be addressed by reformulating the dynamics as a\npolynomial system or by handling approximation error in the dynamics [31]. Second, we focus on\nverifying stability locally around 0; there has been work extending the approach we use by \u201cpatching\ntogether\u201d different regions of attraction [32].\n\nRobustness. Robustness has been studied for image classi\ufb01cation [30, 16, 6]. We study this\nproperty primarily since it can be checked when the dynamics are unknown, though it has been\nstudied for air traf\ufb01c control as a safety consideration [19]. We say \u21e1 is \"-robust at s0 2 S = Rd if8\n\n\u21e1(s) = \u21e1(s0) (8s 2 B1(s0,\" )),\n\nwhere B1(s0,\" ) is the L1-ball of radius \" around s0. If \u21e1 is a decision tree, we can ef\ufb01ciently\ncompute the largest \" such that \u21e1 is \"-robust at s0, which we denote \"(s0; \u21e1). Consider a leaf node\n` 2 leaves(\u21e1) labeled with action a` 6= \u21e1(s0). The following linear program computes the distance\nfrom s0 to the closest point s 2 S (in L1 norm) such that s 2 routed(`; \u21e1):\n\n\"(s0; `, \u21e1) = max\n\n\"\n\ns2S,\"2R+\n\nsubj. to \u2713 ^n2path(`;\u21e1)\n\n|si  (s0)i|\uf8ff \"\u25c6,\nwhere path(`; \u21e1) is the set of internal nodes along the path from the root of \u21e1 to `, n = 1 if n is a\nleft-child and 1 otherwise, in is the feature index of n, and tn is the threshold of n. Then,\n\nnsin \uf8ff tn\u25c6 ^\u2713 ^i2[d]\n\n\"(s0; \u21e1) = arg min\n\n`2leaves(\u21e1)\u21e21\n\n\"(s0; \u21e1, `)\n\nif a` = \u21e1(s0)\notherwise.\n\n4 Evaluation\n\nVerifying robustness of an Atari Pong controller. For the Atari Pong environment, we use a 7-\ndimensional state space (extracted from raw images), which includes the position (x, y) and velocity\n(vx, vy) of the ball, and the position yp, velocity vp, acceleration ap, and jerk jp of the player\u2019s\npaddle. The actions are A = {up, down, stay}, corresponding to moving the paddle up, down, or\nunchanged. A reward of 1 is given if the player scores, and -1 if the opponent scores, for 21 rounds\n(so R 2 {21, ..., 21}). Our oracle is the deep Q-network [22], which achieves a perfect reward of\n21.0 (averaged over 50 rollouts). 9 VIPER (with N = 80 iterations and M = 10 sampled traces per\niteration) extracts a decision tree policy \u21e1 with 769 nodes that also achieves perfect reward 21.0.\nWe compute the robustness \"(s0; \u21e1) at 5 random states s0 2 S, which took just under 2.9 seconds for\neach point (on a 2.5 GHz Intel Core i7 CPU); the computed \" varies from 0.5 to 2.8. We compare to\n\n8This de\ufb01nition of robustness is different than the one in control theory.\n9This policy operates on images, but we can still use it as an oracle.\n\n7\n\n\fReluplex, a state-of-the-art tool for verifying DNNs. We use policy gradients to train a stochastic\nDNN policy \u21e1 : R7 \u21e5 A ! [0, 1], and use Reluplex to compute the robustness of \u21e1 on the same 5\npoints. We use line search on \" to \ufb01nd the distance to the nearest adversarial example to within 0.1\n(which requires 4 iterations of Reluplex); in contrast, our approach computes \" to within 105, and\ncan easily be made more precise. The Reluplex running times varied substantially\u2014they were 12,\n136, 641, and 649 seconds; verifying the \ufb01fth point timed out after running for one hour.\n\nVerifying correctness of a toy Pong controller. Because we do not have a model of the system\ndynamics for Atari Pong, we cannot verify correctness; we instead verify correctness for our toy\nmodel of Pong. We use policy gradients to train a DNN policy to play toy pong, which achieves\na perfect reward of 250 (averaged over 50 rollouts), which is the maximum number of time steps.\nVIPER extracts a decision tree with 31 nodes, which also plays perfectly. We use Z3 to check\nsatis\ufb01ability of \u00ac . In fact, we discover a counterexample\u2014when the ball starts near the edge of the\nscreen, the paddle oscillates and may miss it.10\nFurthermore, by manually examining this counterexample, we were able to devise two \ufb01xes to repair\nthe system. First, we discovered a region of the state space where the decision tree was taking a\nclearly suboptimal action that led to the counterexample. To \ufb01x this issue, we added a top-level node\nto the decision tree so that it performs a safer action in this case. Second, we noticed that extending\nthe paddle length by one (i.e., L = 9/2) was also suf\ufb01cient to remove the counterexample. For both\n\ufb01xes, we reran the veri\ufb01cation algorithm and proved that the no additional counterexamples exist,\ni.e., the controller never loses the game. All veri\ufb01cation tasks ran in just under 5 seconds.\nVerifying correctness of a cart-pole controller. We restricted to discrete actions a 2 A =\n{1, 1}, and used policy gradients to train a stochastic oracle \u21e1\u21e4 : S \u21e5 A ! [0, 1] (a neural\nnetwork with a single hidden layer) to keep the pole upright while moving the cart to the right;\nthe oracle achieved a perfect reward of 200.0 (averaged over 100 rollouts), i.e., the pole never falls\ndown. We use VIPER as before to extract a decision tree policy. In Figure 4 (a), we show the reward\nachieved by extracted decision trees of varying sizes\u2014a tree with just 3 nodes (one internal and two\nleaf) suf\ufb01ces to achieve perfect reward. We used Z3 to check satis\ufb01ability of \u00ac ; Z3 proves that the\ndesired safety property holds, running in 1.5 seconds.\n\nVerifying stability of a cart-pole controller. Next, we tried to verify stability of the cart-pole\ncontroller, trained as before except without moving the cart to the right; as before, the decision tree\nachieves a perfect reward of 200.0. However, achieving a perfect reward only requires that the pole\ndoes not fall below a given height, not stability; thus, neither the extracted decision tree policy nor\nthe original neural network policy are stable.\nInstead, we used an approach inspired by guided policy search [21]. We trained another decision\ntree using a different oracle, namely, an iterative linear quadratic regulator (iLQR), which comes\nwith stability guarantees (at least with respect to the linear approximation of the dynamics, which are\na very good near the origin). Note that we require a model to use an iLQR oracle, but we anyway\nneed the true model to verify stability. We use iLQR with a time horizon of T = 50 steps and n = 3\niterations. To extract a policy, we use Q(s, a) = JT (s), where JT (s) = sT PT s is the cost-to-go\nfor the \ufb01nal iLQR step. Because iLQR can be slow, we compute the LQR controller for the linear\napproximation of the dynamics around the origin, and use it when ksk1 \uf8ff 0.05. We now use\ncontinuous actions A = [amax, amax], so we extract a (3 node) decision tree policy \u21e1 with linear\nregressors at the leaves (internal branches are axis-aligned); \u21e1 achieves a reward of 200.0.\nWe verify stability of \u21e1 with respect to the degree-5 Taylor approximation of the cart-pole dynamics.\nSolving the SOS program (2) takes 3.9 seconds. The optimal solution is \u21e2 = 3.75, which suf\ufb01ces to\nverify that the region of stability contains {s 2 S |k sk1 \uf8ff 0.03}. We compare to an enumerative\nalgorithm for verifying stability similar to the one used in [8]; after running for more than 10\nminutes, it only veri\ufb01ed a region U0 whose volume is 1015 that of U. To the best of our knowledge,\nenumeration is the only approach that can be used to verify stability of neural network policies.\n\nComparison to \ufb01tted Q iteration. On the cart-pole benchmark, we compare VIPER to \ufb01tted Q\niteration [13], which is an actor-critic algorithm that uses a decision tree policy that is retrained on\n10While this counterexample was not present for the original neural network controller, we have no way of\n\nknowing if other counterexamples exist for that controller.\n\n8\n\n\fd\nr\na\nw\ne\nR\n\n200\n190\n180\n170\n\n200\n\nd\nr\na\nw\ne\nR\n\n100\n\n1\n\n3\n\n7\n\n# Nodes\n\n15\n\n31\n\n(a)\n\n0\n\n0\n\n2000\n\n6000\n\n8000\n\n4000\n\n# Rollouts\n(b)\n\n8000\n\ns\ne\nd\no\nN\n#\n\n \n\n4000\n\n0\n\n0\n\n20\n\n10\nReward\n\n(c)\n\nFigure 4: (a) Reward (maximum R = 200) as a function of the size (in number of nodes) of the\ndecision tree extracted by VIPER, on the cart-pole benchmark. (b) Reward (maximum R = 200) as a\nfunction of the number of training rollouts, on the cart-pole benchmark, for VIPER (black, circle) and\n\ufb01tted Q-iteration (red, triangle); for VIPER, we include rollouts used to train the oracle. (c) Decision\ntree size needed to achieve a given reward R 2{ 0, 5, 10, 15, 20, 21} (maximum R = 21), on the\nAtari Pong benchmark, for VIPER (black, circle) and DAGGER with the 0-1 loss (red, triangle).\n\nevery step rather than using gradient updates; for the Q-function, we use a neural network with a\nsingle hidden layer. In Figure 4 (b), we compare the reward achieved by VIPER compared to \ufb01tted Q\niteration as a function of the number of rollouts (for VIPER, we include the initial rollouts used to\ntrain the oracle \u21e1\u21e4). Even after 200K rollouts, \ufb01tted Q iteration only achieves a reward of 104.3.\n\nComparison to DAGGER. On the Atari Pong benchmark, we compare VIPER to using DAGGER\nwith the 0-1 loss. We use each algorithm to learn decision trees with maximum depths from 4\nto 16. In Figure 4 (c), we show the smallest size decision tree needed to achieve reward R 2\n{0, 5, 10, 15, 20, 21}. VIPER consistently produces trees an order of magnitude smaller than those\nproduced by DAGGER\u2014e.g., for R = 0 (31 nodes vs. 127 nodes), R = 20 (127 nodes vs. 3459\nnodes), and R = 21 (769 nodes vs. 7967 nodes)\u2014likely because VIPER prioritizes accuracy on\ncritical states. Evaluating pointwise robustness for DAGGER trees is thus an order of magnitude\nslower: 36 to 40 seconds for the R = 21 tree (vs. under 3 seconds for the R = 21 VIPER tree).\n\nController for half-cheetah. We demonstrate that we can learn high quality decision trees for the\nhalf-cheetah problem instance in the MuJoCo benchmark. In particular, we used a neural network\noracle trained using PPO [28] to extract a regression tree controller. The regression tree had 9757\nnodes, and achieved cumulative reward R = 4014 (whereas the neural network achieved R = 4189).\n\n5 Conclusion\n\nWe have proposed an approach to learning decision tree policies that can be veri\ufb01ed ef\ufb01ciently. Much\nwork remains to be done to fully realize the potential of our approach. For instance, we used a number\nof approximations to verify correctness for the cart-pole controller; it may be possible to avoid these\napproximations, e.g., by \ufb01nding an invariant set (similar to our approach to verifying toy Pong),\nand by using upper and lower piecewise linear bounds on transition function. More generally, we\nconsidered a limited variety of veri\ufb01cation tasks; we expect that a wider range of properties may\nbe veri\ufb01ed for our policies. Another important direction is exploring whether we can automatically\nrepair errors discovered in a decision tree policy. Finally, our decision tree policies may be useful for\nimproving the ef\ufb01ciency of safe reinforcement learning algorithms that rely on veri\ufb01cation.\n\nAcknowledgments\nThis work was funded by the Toyota Research Institute and NSF InTrans award 1665282.\n\nReferences\n[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In ICML,\n\n2004.\n\n[2] Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N Zeilinger, and Claire J\n\nTomlin. Reachability-based safe learning with gaussian processes. In CDC, 2014.\n\n9\n\n\f[3] Anil Aswani, Humberto Gonzalez, S Shankar Sastry, and Claire Tomlin. Provably safe and robust\n\nlearning-based model predictive control. Automatica, 2013.\n\n[4] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS, 2014.\n\n[5] A. Barto, R. Sutton, and C. Anderson. Neuronlike adaptive elements that can solve dif\ufb01cult learning\n\ncontrol problems. IEEE transactions on systems, man, and cybernetics, 1983.\n\n[6] Osbert Bastani, Yani Ioannou, Leonidas Lampropoulos, Dimitrios Vytiniotis, Aditya Nori, and Antonio\n\nCriminisi. Measuring neural net robustness with constraints. In NIPS, 2016.\n\n[7] Osbert Bastani, Carolyn Kim, and Hamsa Bastani. Interpretability via model extraction. In FAT/ML, 2017.\n\n[8] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforce-\n\nment learning with stability guarantees. In NIPS, 2017.\n\n[9] Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Classi\ufb01cation and Regression Trees.\n\nWadsworth, 1984.\n\n[10] Cristian Bucilu\u02c7a, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006.\n\n[11] Steve Collins, Andy Ruina, Russ Tedrake, and Martijn Wisse. Ef\ufb01cient bipedal robots based on passive-\n\ndynamic walkers. Science, 2005.\n\n[12] Leonardo De Moura and Nikolaj Bj\u00f8rner. Z3: An ef\ufb01cient smt solver. In TACAS, 2008.\n\n[13] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. JMLR,\n\n2005.\n\n[14] Javier Garc\u0131a and Fernando Fern\u00e1ndez. A comprehensive survey on safe reinforcement learning. JMLR,\n\n2015.\n\n[15] Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin\nVechev. Ai 2: Safety and robustness certi\ufb01cation of neural networks with abstract interpretation. In IEEE\nSecurity & Privacy, 2018.\n\n[16] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.\n\nIn ICLR, 2015.\n\n[17] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS\n\nDeep Learning Workshop, 2014.\n\n[18] Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety veri\ufb01cation of deep neural networks.\n\nIn CAV, 2017.\n\n[19] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An ef\ufb01cient smt\n\nsolver for verifying deep neural networks. In CAV, 2017.\n\n[20] Scott Kuindersma, Robin Deits, Maurice Fallon, Andr\u00e9s Valenzuela, Hongkai Dai, Frank Permenter, Twan\nKoolen, Pat Marion, and Russ Tedrake. Optimization-based locomotion planning, estimation, and control\ndesign for the atlas humanoid robot. Autonomous Robots, 2016.\n\n[21] Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine\n\nLearning, pages 1\u20139, 2013.\n\n[22] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through\ndeep reinforcement learning. Nature, 2015.\n\n[23] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. In ICML,\n\n2012.\n\n[24] Pablo A Parrilo. Structured semide\ufb01nite programs and semialgebraic geometry methods in robustness and\n\noptimization. PhD thesis, California Institute of Technology, 2000.\n\n[25] S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret\n\nonline learning. In AISTATS, 2011.\n\n[26] Dorsa Sadigh, S Shankar Sastry, Sanjit A Seshia, and Anca Dragan. Information gathering actions over\n\nhuman internal state. In IROS, 2016.\n\n10\n\n\f[27] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 1999.\n\n[28] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[29] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go\nwith deep neural networks and tree search. Nature, 2016.\n\n[30] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and\n\nRob Fergus. Intriguing properties of neural networks. In ICLR, 2014.\n\n[31] Russ Tedrake. Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and\n\nManipulation. 2018.\n\n[32] Russ Tedrake, Ian R Manchester, Mark Tobenkin, and John W Roberts. Lqr-trees: Feedback motion\n\nplanning via sums-of-squares veri\ufb01cation. IJRR, 2010.\n\n[33] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in \ufb01nite markov decision\n\nprocesses with gaussian processes. In NIPS, 2016.\n\n[34] Gilles Vandewiele, Olivier Janssens, Femke Ongenae, Filip De Turck, and So\ufb01e Van Hoecke. Genesim:\ngenetic extraction of a single, interpretable model. In NIPS Workshop on Interpretable Machine Learning\nin Complex Systems, 2016.\n\n[35] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Pro-\n\ngrammatically interpretable reinforcement learning. arXiv preprint arXiv:1804.02477, 2018.\n\n[36] Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesv\u00e1ri. Conservative bandits. In ICML, 2016.\n\n[37] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse\n\nreinforcement learning. In AAAI, 2008.\n\n11\n\n\f", "award": [], "sourceid": 1247, "authors": [{"given_name": "Osbert", "family_name": "Bastani", "institution": "University of Pennsylvania"}, {"given_name": "Yewen", "family_name": "Pu", "institution": "MIT"}, {"given_name": "Armando", "family_name": "Solar-Lezama", "institution": "MIT"}]}