{"title": "A Composable Specification Language for Reinforcement Learning Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 13041, "page_last": 13051, "abstract": "Reinforcement learning is a promising approach for learning control policies for robot tasks. However, specifying complex tasks (e.g., with multiple objectives and safety constraints) can be challenging, since the user must design a reward function that encodes the entire task. Furthermore, the user often needs to manually shape the reward to ensure convergence of the learning algorithm. We propose a language for specifying complex control tasks, along with an algorithm that compiles specifications in our language into a reward function and automatically performs reward shaping. We implement our approach in a tool called SPECTRL, and show that it outperforms several state-of-the-art baselines.", "full_text": "A Composable Speci\ufb01cation Language for\n\nReinforcement Learning Tasks\n\nKishor Jothimurugan, Rajeev Alur, Osbert Bastani\n\nUniversity of Pennsylvania\n\n{kishor,alur,obastani}@cis.upenn.edu\n\nAbstract\n\nReinforcement learning is a promising approach for learning control policies for\nrobot tasks. However, specifying complex tasks (e.g., with multiple objectives\nand safety constraints) can be challenging, since the user must design a reward\nfunction that encodes the entire task. Furthermore, the user often needs to manually\nshape the reward to ensure convergence of the learning algorithm. We propose\na language for specifying complex control tasks, along with an algorithm that\ncompiles speci\ufb01cations in our language into a reward function and automatically\nperforms reward shaping. We implement our approach in a tool called SPECTRL,\nand show that it outperforms several state-of-the-art baselines.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) is a promising approach to learning control policies for robotics\ntasks [5, 21, 16, 15]. A key shortcoming of RL is that the user must manually encode the task as a\nreal-valued reward function, which can be challenging for several reasons. First, for complex tasks\nwith multiple objectives and constraints, the user must manually devise a single reward function\nthat balances different parts of the task. Second, the state space must often be extended to encode\nthe reward\u2014e.g., adding indicators that keep track of which subtasks have been completed. Third,\noftentimes, different reward functions can encode the same task, and the choice of reward function\ncan have a large impact on the convergence of the RL algorithm. Thus, users must manually design\nrewards that assign \u201cpartial credit\u201d for achieving intermediate goals, known as reward shaping [17].\nFor example, consider the task in Figure 1, where the state is the robot position and its remaining\nfuel, the action is a (bounded) robot velocity, and the task is\n\n\u201cReach target q, then reach target p, while maintaining positive fuel and avoiding\nobstacle O\u201d.\n\nTo encode this task, we would have to combine rewards for (i) reaching q, and then reaching p (where\n\u201creach x\u201d denotes the task of reaching an \u0001 box around x\u2014the regions corresponding to p and q are\ndenoted by P and Q respectively), (ii) avoiding region O, and (iii) maintaining positive fuel, into\na single reward function. Furthermore, we would have to extend the state space to keep track of\nwhether q has been reached\u2014otherwise, the control policy would not know whether the current goal\nis to move towards q or p. Finally, we might need to shape the reward to assign partial credit for\ngetting closer to q, or for reaching q without reaching p.\nWe propose a language for users to specify control tasks. Our language allows the user to specify\nobjectives and safety constraints as logical predicates over states, and then compose these primitives\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Example control task. The blue dashed trajectory satis\ufb01es the speci\ufb01cation \u03c6ex (ignoring the\nfuel budget), whereas the red dotted trajectory does not satisfy \u03c6ex as it passes through the obstacle.\n\nsequentially or as disjunctions. For example, the above task can be expressed as\n\n\u03c6ex = achieve (reach q; reach p) ensuring (avoid O \u2227 fuel > 0),\n\n(1)\n\nwhere fuel is the component of the state space keeping track of how much fuel is remaining.\nThe principle underlying our approach is that in many applications, users have in mind a sequence\nof high-level actions that are needed to accomplish a given task. For example, \u03c6ex may encode the\nscenario where the user wants a quadcopter to \ufb02y to a location q, take a photograph, and then return\nback to its owner at position p, while avoiding a building O and without running out of battery.\nAlternatively, a user may want to program a warehouse robot to go to the next room, pick up a box,\nand then bring this item back to the \ufb01rst room. In addition to specifying sequences of tasks, users can\nalso specify choices between multiple tasks (e.g., bring back any box).\nAnother key aspect of our approach is to allow the user to specify a task without providing the\nlow-level sequence of actions needed to accomplish the task. Instead, analogous to how a compiler\ngenerates machine code from a program written by the user, we propose a compiler for our language\nthat takes the user-provided task speci\ufb01cation and generates a control policy that achieves the task.\nRL is a perfect tool for doing so\u2014in particular, our algorithm compiles the task speci\ufb01cation to a\nreward function, and then uses state-of-the-art RL algorithms to learn a control policy. Overall, the\nuser provides the high-level task structure, and the RL algorithm \ufb01lls in the low-level details.\nA key challenge is that our speci\ufb01cations may encode rewards that are not Markov\u2014e.g., in \u03c6ex,\nthe robot needs memory that keeps track of whether its current goal is reach q or reach p. Thus,\nour compiler automatically extends the state space using a task monitor, which is an automaton\nthat keeps track of which subtasks have been completed.1 Furthermore, this automaton may have\nnondeterministic transitions; thus, our compiler also extends the action space with actions for choosing\nstate transitions. Intuitively, there may be multiple points in time at which a subtask is considered\ncompleted, and the robot must choose which one to use.\nAnother challenge is that the na\u00efve choice of rewards\u2014i.e., reward 1 if the task is completed and\n0 otherwise\u2014can be very sparse, especially for complex tasks. Thus, our compiler automatically\nperforms two kinds of reward shaping based on the structure of the speci\ufb01cation\u2014it assigns partial\ncredit for (i) partially accomplishing intermediate subtasks, and (ii) for completing more subtasks. For\ndeterministic MDPs, our reward shaping is guaranteed to preserve the optimal policy; we empirically\n\ufb01nd it also works well for stochastic MDPs.\nWe have implemented our approach in a tool called SPECTRL,2 and evaluated the performance of\nSPECTRL compared to a number of baselines. We show that SPECTRL learns policies that solve each\ntask in our benchmark with a success rate of at least 97%. In summary, our contributions are:\n\n\u2022 We propose a language for users to specify RL tasks (Section 2).\n\u2022 We design an algorithm for compiling a speci\ufb01cation into an RL problem, which can be\n\u2022 We have implemented SPECTRL, and empirically demonstrated its bene\ufb01ts (Section 4).\n\nsolved using standard RL algorithms (Section 3).\n\nRelated work. Imitation learning enables users to specify tasks by providing demonstrations of the\ndesired task [18, 1, 26, 20, 10]. However, in many settings, it may be easier for the user to directly\nspecify the task\u2014e.g., when programming a warehouse robot, it may be easier to specify waypoints\ndescribing paths the robot should take than to manually drive the robot to obtain demonstrations.\n\n1Intuitively, this construction is analogous to compiling a regular expression to a \ufb01nite state automaton.\n2SPECTRL stands for SPECifying Tasks for Reinforcement Learning.\n\n2\n\n\fAlso, unlike imitation learning, our language allows the user to specify global safety constraints on\nthe robot. Indeed, we believe our approach complements imitation learning, since the user can specify\nsome parts of the task in our language and others using demonstrations.\nAnother approach is for the user to provide a policy sketch\u2014i.e., a string of tokens specifying a\nsequence of subtasks [2]. However, tokens have no meaning, except equal tokens represent the\nsame task. Thus, policy sketches cannot be compiled to a reward function, which must be provided\nseparately.\nOur speci\ufb01cation language is based on temporal logic [19], a language of logical formulas for\nspecifying constraints over (typically, in\ufb01nite) sequences of events happening over time. For example,\ntemporal logic allows the user to specify that a logical predicate must be satis\ufb01ed at some point\nin time (e.g., \u201ceventually reach state q\u201d) or that it must always be satis\ufb01ed (e.g., \u201calways avoid an\nobstacle\u201d). In our language, these notions are represented using the achieve and ensuring\noperators, respectively. Our language restricts temporal logic in a way that enables us to perform\nreward shaping, and also adds useful operators such as sequencing that allow the user to easily express\ncomplex control tasks.\nAlgorithms have been designed for automatically synthesizing a control policy that satis\ufb01es a given\ntemporal logic formula; see [4] for a recent survey, and [12, 25, 6, 9] for applications to robotic\nmotion planning. However, these algorithms are typically based on exhaustive search over control\npolicies. Thus, as with \ufb01nite-state planning algorithms such as value iteration [22], they cannot be\napplied to tasks with continuous state and action spaces that can be solved using RL.\nReward machines have been proposed as a high-level way to specify tasks [11]. In their work, the\nuser provides a speci\ufb01cation in the form of a \ufb01nite state machine along with reward functions for\neach state. Then, they propose an algorithm for learning multiple tasks simultaneously by applying\nthe Q-learning updates across different speci\ufb01cations. At a high level, these reward machines are\nsimilar to the task monitors de\ufb01ned in our work. However, we differ from their approach in two ways.\nFirst, in contrast to their work, the user only needs to provide a high-level logical speci\ufb01cation; we\nautomatically generate a task monitor from this speci\ufb01cation. Second, our notion of task monitor has\na \ufb01nite set of registers that can store real values; in contrast, their \ufb01nite state reward machines cannot\nstore quantitative information.\nThe most closely related work is [13], which proposes a variant of temporal logic called truncated\nLTL, along with an algorithm for compiling a speci\ufb01cation written in this language to a reward\nfunction that can be optimized using RL. However, they do not use any analog of the task monitor,\nwhich we demonstrate is needed to handle non-Markovian speci\ufb01cations. Finally, [24] allows the\nuser to separately specify objectives and safety constraints, and then using RL to learn a policy.\nHowever, they do not provide any way to compose rewards, and do not perform any reward shaping.\nAlso, their approach is tied to a speci\ufb01c RL algorithm. We show empirically that our approach\nsubstantially outperforms both these approaches.\nFinally, an alternative approach is to manually specify rewards for sub-goals to improve perfor-\nmance. However, many challenges arise when implementing sub-goal based rewards\u2014e.g., how\ndoes achieving a sub-goal count compared to violating a constraint, how to handle sub-goals that\ncan be achieved in multiple ways, how to ensure the agent does not repeatedly obtain a reward for a\npreviously completed sub-goal, etc. As tasks become more complex and deeply nested, manually\nspecifying rewards for sub-goals becomes very challenging. Our system is designed to automatically\nsolve these issues.\n\n2 Task Speci\ufb01cation Language\n\nMarkov decision processes. A Markov decision process (MDP) is a tuple (S, D, A, P, T ), where\nS \u2286 Rn are the states, D is the initial state distribution, A \u2286 Rm are the actions, P : S \u00d7 A \u00d7 S \u2192\n[0, 1] are the transition probabilities, and T \u2208 N is the time horizon. A rollout \u03b6 \u2208 Z of length t\na0\u2212\u2192 . . .\nat\u22121\u2212\u2212\u2212\u2192 st where si \u2208 S and ai \u2208 A. Given a (deterministic) policy\nis a sequence \u03b6 = s0\n\u03c0 : Z \u2192 A, we can generate a rollout using ai = \u03c0(\u03b60:i). Optionally, an MDP can also include a\nreward function R : Z \u2192 R. 3\n\n3Note that we consider rollout-based rewards rather than state-based rewards. Most modern RL algorithms,\n\nsuch as policy gradient algorithms, can use rollout-based rewards.\n\n3\n\n\fSpeci\ufb01cation language. Intuitively, a speci\ufb01cation \u03c6 in our language is a logical formula specify-\ning whether a given rollout \u03b6 successfully accomplishes the desired task\u2014in particular, it can be\ninterpreted as a function \u03c6 : Z \u2192 B, where B = {true, false}, de\ufb01ned by\n\n\u03c6(\u03b6) = I[\u03b6 successfully achieves the task],\n\nevery p \u2208 P0 is associated with a function(cid:74)p(cid:75) : S \u2192 B such that(cid:74)p(cid:75)(s) indicates whether s satis\ufb01es\n\nwhere I is the indicator function. Formally, the user \ufb01rst de\ufb01nes a set of atomic predicates P0, where\np. For example, given x \u2208 S, the atomic predicate\n\n(cid:74)reach x(cid:75)(s) = ((cid:107)s \u2212 x(cid:107)\u221e < 1)\n(cid:74)avoid O(cid:75)(s) = (s (cid:54)\u2208 O)\n\nindicates whether the robot is in a state near x, and given a rectangular region O \u2286 S, the atomic\npredicate\n\nb ::= p | (b1 \u2227 b2) | (b1 \u2228 b2),\n\nFinally, the syntax of our speci\ufb01cations is given by 5\n\npredicates. In particular, the syntax of predicates is given by 4\n\nindicates if the robot is avoiding O. In general, the user can de\ufb01ne a new atomic predicate as an\n\narbitrary function(cid:74)p(cid:75) : S \u2192 B. Next, predicates b \u2208 P are conjunctions and disjunctions of atomic\nwhere p \u2208 P0. Similar to atomic predicates, each predicate b \u2208 P corresponds to a function(cid:74)b(cid:75) :\nS \u2192 B, de\ufb01ned recursively by(cid:74)b1 \u2227 b2(cid:75)(s) =(cid:74)b1(cid:75)(s)\u2227(cid:74)b2(cid:75)(s) and(cid:74)b1 \u2228 b2(cid:75)(s) =(cid:74)b1(cid:75)(s)\u2228(cid:74)b2(cid:75)(s).\nthat(cid:74)b(cid:75)(s) = true. The second construct says that the robot should try to satisfy \u03c61 while always\nstaying in states s such that(cid:74)b(cid:75)(s) = true. The third construct says the robot should try to satisfy\n\u03c61 or task \u03c62. Formally, we associate a function(cid:74)\u03c6(cid:75) : Z \u2192 B with \u03c6 recursively as follows:\n\nwhere b \u2208 P. Intuitively, the \ufb01rst construct means that the robot should try to reach a state s such\n\ntask \u03c61 and then task \u03c62. The fourth construct means that the robot should try to satisfy either task\n\n\u03c6 ::= achieve b | \u03c61 ensuring b | \u03c61; \u03c62 | \u03c61 or \u03c62,\n\n(cid:74)achieve b(cid:75)(\u03b6) = \u2203 i < t, (cid:74)b(cid:75)(si)\n(cid:74)\u03c6 ensuring b(cid:75)(\u03b6) = (cid:74)\u03c6(cid:75)(\u03b6) \u2227 (\u2200i < t, (cid:74)b(cid:75)(si))\n(cid:74)\u03c61; \u03c62(cid:75)(\u03b6) = \u2203 i < t, ((cid:74)\u03c61(cid:75)(\u03b60:i) \u2227 (cid:74)\u03c62(cid:75)(\u03b6i:t))\n(cid:74)\u03c61 or \u03c62(cid:75)(\u03b6) = (cid:74)\u03c61(cid:75)(\u03b6) \u2228 (cid:74)\u03c62(cid:75)(\u03b6),\n[(cid:74)\u03c6(cid:75)(\u03b6) = true],\n\n\u03c0\u2217 \u2208 arg max\n\nPr\n\u03b6\u223cD\u03c0\n\nwhere t is the length of \u03b6. A rollout \u03b6 satis\ufb01es \u03c6 if(cid:74)\u03c6(cid:75)(\u03b6) = true, which is denoted \u03b6 |= \u03c6.\n\nProblem formulation. Given an MDP and a speci\ufb01cation \u03c6, our goal is to compute\n\n(2)\nwhere D\u03c0 is the distribution over rollouts generated by \u03c0. In other words, we want to learn a policy\n\u03c0\u2217 that maximizes the probability that a generated rollout \u03b6 satis\ufb01es \u03c6.\n\n\u03c0\n\n3 Compilation and Learning Algorithms\n\nIn this section, we describe our algorithm for reducing the above problem (2) for a given MDP\n(S, D, A, P, T ) and a speci\ufb01cation \u03c6 to an RL problem speci\ufb01ed as an MDP with a reward function.\nAt a high level, our algorithm extends the state space S to keep track of completed subtasks and\nconstructs a reward function R : Z \u2192 R encoding \u03c6. A key feature of our algorithm is that the user\nhas control over the compilation process\u2014we provide a natural default compilation strategy, but the\nuser can extend or modify our approach to improve the performance of the RL algorithm. We give\nproofs in Appendix B.\nQuantitative semantics. So far, we have associated speci\ufb01cations \u03c6 with Boolean semantics (i.e.,\n\n(cid:74)\u03c6(cid:75)(\u03b6) \u2208 B). A na\u00efve strategy is to assign rewards to rollouts based on whether they satisfy \u03c6:\n\n(cid:26)1\n\n0\n\nR(\u03b6) =\n\nif \u03b6 |= \u03c6\notherwise.\n\n4Formally, a predicate is a string in the context-free language generated by this context-free grammar.\n5Here, achieve and ensuring correspond to the \u201ceventually\u201d and \u201calways\u201d operators in temporal logic.\n\n4\n\n\fx1 \u2190 0\nx2 \u2190 0\nx3 \u2190 \u221e\nx4 \u2190 \u221e\n\n\u03c1 : min{x1, x2, x3, x4}\n\nu\n\nq1\n\nq4\n\nu\n\n\u03a3 : s \u2208 Q\n\nx1 \u2190 1 \u2212 d\u221e(s, q)\n\n\u03a3 : s \u2208 P\n\nx2 \u2190 1 \u2212 d\u221e(s, p)\n\nu\n\nu\n\nu\n\nq2\n\u03a3 : min{x1, x3, x4} > 0\nu\nq3\n\nu\n\nFigure 2: An example of a task monitor. States are labeled with rewards (pre\ufb01xed with \u201c\u03c1 :\u201d).\nTransitions are labeled with transition conditions (pre\ufb01xed with \u201c\u03a3 :\u201d), as well as register update\nrules. A transition from q2 to q4 is omitted for clarity. Also, u denotes the two updates x3 \u2190\nmin{x3, d\u221e(s, O)} and x4 \u2190 min{x4, fuel(s)}.\n\nHowever, it is usually dif\ufb01cult to learn a policy to maximize this reward due to its discrete nature.\nA common strategy is to provide a shaped reward that quanti\ufb01es the \u201cdegree\u201d to which \u03b6 satis\ufb01es\n\u03c6. Our algorithm uses an approach based on quantitative semantics for temporal logic [7, 8, 14].\nIn particular, we associate an alternate interpretation of a speci\ufb01cation \u03c6 as a real-valued function\n\n(cid:74)\u03c6(cid:75)q : Z \u2192 R. To do so, the user provides quantitative semantics for atomic predicates p \u2208 P0\u2014in\nparticular, they provide a function(cid:74)p(cid:75)q : S \u2192 R that quanti\ufb01es the degree to which p holds for s \u2208 S.\n\nFor example, we can use\n\n(cid:74)reach x(cid:75)q(s) = 1 \u2212 d\u221e(s, x)\n(cid:74)avoid O(cid:75)q(s) = d\u221e(s, O),\n\ni 0 if and only if(cid:74)p(cid:75)(s) = true, and a larger value of(cid:74)p(cid:75)q should correspond\nb \u2208 P are(cid:74)b1 \u2227 b2(cid:75)q(s) = min{(cid:74)b1(cid:75)q(s),(cid:74)b2(cid:75)q(s)} and(cid:74)b1 \u2228 b2(cid:75)q(s) = max{(cid:74)b1(cid:75)q(s),(cid:74)b2(cid:75)q(s)}.\nAssuming(cid:74)p(cid:75)q satis\ufb01es the above properties, then(cid:74)b(cid:75)q > 0 if and only if(cid:74)b(cid:75) = true.\n(cid:74)achieve b(cid:75)q(\u03b6) = max\n(cid:74)\u03c6 ensuring b(cid:75)q(\u03b6) = min{(cid:74)\u03c6(cid:75)q(\u03b6), (cid:74)b(cid:75)q(s0), ..., (cid:74)b(cid:75)q(st\u22121)}\n(cid:74)\u03c61; \u03c62(cid:75)q(\u03b6) = max\n(cid:74)\u03c61 or \u03c62(cid:75)q(\u03b6) = max{(cid:74)\u03c61(cid:75)q(\u03b6), (cid:74)\u03c62(cid:75)q(\u03b6)}.\n\ni 0, so we could de\ufb01ne a reward\nfunction R(\u03b6) =(cid:74)\u03c6(cid:75)q(\u03b6). However, one of our key goals is to extend the state space so the policy\nknows which subtasks have been completed. On the other hand, the semantics(cid:74)\u03c6(cid:75)q quantify over all\non(cid:74)\u03c6(cid:75)q, but applied to a single choice of time steps on which each subtask is completed.\n\npossible ways that subtasks could have been completed in hindsight (i.e., once the entire trajectory is\nknown). For example, there may be multiple points in a trajectory when a subtask reach q could be\nconsidered as completed. Below, we describe our construction of the reward function, which is based\n\nTask monitor. Intuitively, a task monitor is a \ufb01nite-state automaton (FSA) that keeps track of which\nsubtasks have been completed and which constraints are still satis\ufb01ed. Unlike an FSA, its transitions\nmay depend on the state s \u2208 S of a given MDP. Also, since we are using quantitative semantics,\nthe task monitor has to keep track of the degree to which subtasks are completed and the degree to\nwhich constraints are satis\ufb01ed; thus, it includes registers that keep track of the these values. A key\nchallenge is that the task monitor is nondeterministic; as we describe below, we let the policy resolve\nthe nondeterminism, which corresponds to choosing which subtask to complete on each step.\nFormally, a task monitor is a tuple M = (Q, X, \u03a3, U, \u2206, q0, v0, F, \u03c1). First, Q is a \ufb01nite set of\nmonitor states, which are used to keep track of which subtasks have been completed. Also, X is a\n\ufb01nite set of registers, which are variables used to keep track of the degree to which the speci\ufb01cation\n\n5\n\n\fholds so far. Given an MDP (S, D, A, P, T ), an augmented state is a tuple (s, q, v) \u2208 S \u00d7 Q \u00d7 V ,\nwhere V = RX\u2014i.e., an MDP state s \u2208 S, a monitor state q \u2208 Q, and a vector v \u2208 V encoding the\nvalue of each register in the task monitor. An augmented state is analogous to a state of an FSA.\nThe transitions \u2206 of the task monitor depend on the augmented state; thus, they need to specify two\npieces of information: (i) conditions on the MDP states and registers for the transition to be enabled,\nand (ii) how the registers are updated. To handle (i), we consider a set \u03a3 of predicates over S \u00d7 V ,\nand to handle (ii), we consider a set U of functions u : S \u00d7 V \u2192 V . Then, \u2206 \u2286 Q \u00d7 \u03a3 \u00d7 U \u00d7 Q is\na \ufb01nite set of (nondeterministic) transitions, where (q, \u03c3, u, q(cid:48)) \u2208 \u2206 encodes augmented transitions\n(s, q, v) a\u2212\u2192 (s(cid:48), q(cid:48), u(s, v)), where s a\u2212\u2192 s(cid:48) is an MDP transition, which can be taken as long as\n\u03c3(s, v) = true. Finally, v0 \u2208 RX is the vector of initial register values, F \u2286 Q is a set of \ufb01nal\nmonitor states, and \u03c1 is a reward function \u03c1 : S \u00d7 F \u00d7 V \u2192 R.\nGiven an MDP (S, D, A, P, T ) and a speci\ufb01cation \u03c6, our algorithm constructs a task monitor M\u03c6 =\n(Q, X, \u03a3, U, \u2206, q0, v0, F, \u03c1) whose states and registers keep track which subtasks of \u03c6 have been\ncompleted. Our task monitor construction algorithm is analogous to compiling a regular expression\nto an FSA. More speci\ufb01cally, it is analogous to algorithms for compiling temporal logic formulas\nto automata [23]. We detail this algorithm in Appendix A. The underlying graph of a task monitor\nconstructed from any given speci\ufb01cation is acyclic (ignoring self loops) and \ufb01nal states correspond to\nsink vertices with no outgoing edges (except a self loop).\nAs an example, the task monitor for \u03c6ex is shown in Figure 2. It has monitor states Q = {q1, q2, q3, q4}\nand registers X = {x1, x2, x3, x4}. The monitor states encode when the robot (i) has not yet reached\nq (q1), (ii) has reached q, but has not yet returned to p (q2 and q3), and (iii) has returned to p (q4); q3\nis an intermediate monitor state used to ensure that the constraints are satis\ufb01ed before continuing.\n\nRegister x1 records(cid:74)reach q(cid:75)(s) = 1 \u2212 d\u221e(s, q) when transitioning from q1 to q2, and x2 records\n(cid:74)reach p(cid:75)q = 1\u2212d\u221e(s, p) when transitioning from q3 to q4. Register x3 keeps track of the minimum\nvalue of(cid:74)avoid s(cid:75) = d\u221e(s, O) over states s in the rollout, and x4 keeps track of the minimum value\nof(cid:74)fuel > 0(cid:75)(s) over states s in the rollout.\n\nAugmented MDP. Given an MDP, a speci\ufb01cation \u03c6, and its task monitor M\u03c6, our algorithm con-\nstructs an augmented MDP, which is an MDP with a reward function ( \u02dcS, \u02dcs0, \u02dcA, \u02dcP , \u02dcR, T ). Intuitively,\nif \u02dc\u03c0\u2217 is a good policy (one that achieves a high expected reward) for the augmented MDP, then\nrollouts generated using \u02dc\u03c0\u2217 should satisfy \u03c6 with high probability.\nIn particular, we have \u02dcS = S \u00d7 Q \u00d7 V and \u02dcs0 = (s0, q0, v0). The transitions \u02dcP are based on P and\n\u2206. However, the task monitor transitions \u2206 may be nonderministic. To resolve this nondeterminism,\nwe require that the policy decides which task monitor transitions to take. In particular, we extend\nthe actions \u02dcA = A \u00d7 A\u03c6 to include a component A\u03c6 = \u2206 indicating which one to take at each\nstep. An augmented action (a, \u03b4) \u2208 \u02dcA, where \u03b4 = (q, \u03c3, u, q(cid:48)), is only available in augmented state\n\u02dcs = (s, q, v) if \u03c3(s, v) = true. Then, the augmented transition probability is given by,\n\n\u02dcP ((s, q, v), (a, (q, \u03c3, u, q(cid:48)))), (s(cid:48), q(cid:48), u(s, v))) = P (s, a, s(cid:48)).\nNext, an augmented rollout of length t is a sequence \u02dc\u03b6 = (s0, q0, v0) a0\u2212\u2192 ...\naugmented transitions. The projection proj(\u02dc\u03b6) = s0\n(normal) rollout. Then, the augmented rewards\n\na0\u2212\u2192 ...\n\nat\u22121\u2212\u2212\u2212\u2192 (st, qt, vt) of\nat\u22121\u2212\u2212\u2212\u2192 st of \u02dc\u03b6 is the corresponding\n\n(cid:26)\u03c1(sT , qT , vT )\n\n\u02dcR(\u02dc\u03b6) =\n\n\u2212\u221e\n\nif qT \u2208 F\notherwise\n\nare constructed based on F and \u03c1. The augmented rewards satisfy the following property.\nTheorem 3.1. For any MDP, speci\ufb01cation \u03c6, and rollout \u03b6 of the MDP, \u03b6 satis\ufb01es \u03c6 if and only if\nthere exists an augmented rollout \u02dc\u03b6 such that (i) R(\u02dc\u03b6) > 0, and (ii) proj(\u02dc\u03b6) = \u03b6.\nThus, if we use RL to learn an optimal augmented policy \u02dc\u03c0\u2217 over augmented states, then \u02dc\u03c0\u2217 is more\nlikely to generate rollouts \u02dc\u03b6 such that proj(\u02dc\u03b6) satis\ufb01es \u03c6.\nReward shaping. As discussed before, our algorithm constructs a shaped reward function that\nprovides \u201cpartial credit\u201d based on the degree to which \u03c6 is satis\ufb01ed. We have already described\none step of reward shaping\u2014i.e., using quantitative semantics instead of the Boolean semantics.\n\n6\n\n\fHowever, the augmented rewards \u02dcR are \u2212\u221e unless a run reaches a \ufb01nal state of the task monitor.\nThus, our algorithm performs an additional step of reward shaping\u2014in particular, it constructs a\nreward function \u02dcRs that gives partial credit for accomplishing subtasks in the MDP.\nFor a non-\ufb01nal monitor state q, let \u03b1 : S \u00d7 Q \u00d7 V \u2192 R be de\ufb01ned by\n\n\u03b1(s, q, v) =\n\n(q,\u03c3,u,q(cid:48))\u2208\u2206, q(cid:48)(cid:54)=q(cid:74)\u03c3(cid:75)q(s, v).\n\nmax\n\nIntuitively, \u03b1 quanti\ufb01es how \u201cclose\u201d an augmented state \u02dcs = (s, q, v) is to transitioning to another\naugmented state with a different monitor state. Then, our algorithm assigns partial credit to augmented\nstates where \u03b1 is larger.\nHowever, to ensure that a good policy according to the shaped rewards \u02dcRs is also a good policy\naccording to \u02dcR, it does so in a way that preserves the ordering of the cumulative rewards for rollouts\u2014\ni.e., for two length T rollouts \u02dc\u03b6 and \u02dc\u03b6(cid:48), it guarantees that if \u02dcR(\u02dc\u03b6) > \u02dcR(\u02dc\u03b6(cid:48)), then \u02dcRs(\u02dc\u03b6) > \u02dcRs(\u02dc\u03b6(cid:48)).\nTo this end, we assume that we are given a lower bound C(cid:96) on the \ufb01nal reward achieved when\nreaching a \ufb01nal monitor state\u2014i.e., C(cid:96) < \u02dcR(\u02dc\u03b6) for all \u02dc\u03b6 with \ufb01nal state \u02dcsT = (sT , qT , vT ) such that\nqT \u2208 F is a \ufb01nal monitor state. Furthermore, we assume that we are given an upper bound Cu on the\nabsolute value of \u03b1 over non-\ufb01nal monitor states\u2014i.e., Cu \u2265 |\u03b1(s, q, v)| for any augmented state\nsuch that q (cid:54)\u2208 F .\nNow, for any q \u2208 Q, let dq be the length of the longest path from q0 to q in the graph of M\u03c6 (ignoring\nself loops in \u2206) and D = maxq\u2208Q dq. Given an augmented rollout \u02dc\u03b6, let \u02dcsi = (si, qi, vi) be the \ufb01rst\naugmented state in \u02dc\u03b6 such that qi = qi+1 = ... = qT . Then, the shaped reward is\n\n(cid:26)maxi\u2264j \u02dcR(\u02dc\u03b6(cid:48)), then \u02dcRs(\u02dc\u03b6) > \u02dcRs(\u02dc\u03b6(cid:48)), and\n(ii) if \u02dc\u03b6 and \u02dc\u03b6(cid:48) end in distinct non-\ufb01nal monitor states qT and q(cid:48)\n, then\n\u02dcRs(\u02dc\u03b6) \u2265 \u02dcRs(\u02dc\u03b6(cid:48)).\n\nT such that dqT > dq(cid:48)\n\nT\n\nReinforcement learning. Once our algorithm has constructed an augmented MDP, it can use any\nRL algorithm to learn an augmented policy \u02dc\u03c0 : \u02dcS \u2192 \u02dcA for the augmented MDP:\n\n\u02dc\u03c0\u2217 \u2208 arg max\n\n\u02dc\u03c0\n\nE\u02dc\u03b6\u223cD\u02dc\u03c0\n\n[ \u02dcRs(\u02dc\u03b6)]\n\nwhere D\u02dc\u03c0 denotes the distribution over augmented rollouts generated by policy \u02dc\u03c0. We solve this RL\nproblem using augmented random search (ARS), a state-of-the-art RL algorithm [15].\nAfter computing \u02dc\u03c0\u2217, we can convert \u02dc\u03c0\u2217 to a projected policy \u03c0\u2217 = proj(\u02dc\u03c0\u2217) for the original MDP by\nintegrating \u02dc\u03c0\u2217 with the task monitor M\u03c6, which keeps track of the information needed for \u02dc\u03c0\u2217 to make\ndecisions. More precisely, proj( \u02dc\u03c0\u2217) includes internal memory that keeps track of the current monitor\nstate and register value (qt, vt) \u2208 Q \u00d7 V . It initializes this memory to the initial monitor state q0 and\ninitial register valuation v0. Given an augmented action (a, (q, \u03c3, u, q(cid:48))) = \u02dc\u03c0\u2217((st, qt, vt)), it updates\nthis internal memory using the rules qt+1 = q(cid:48) and vt+1 = u(st, vt).\nFinally, we use a neural network architecture similar to neural module networks [3, 2], where different\nneural networks accomplish different subtasks in \u03c6. In particular, an augmented policy \u02dc\u03c0 is a set\nof neural networks {Nq | q \u2208 Q}, where Q are the monitor states in M\u03c6. Each Nq takes as\ninput (s, v) \u2208 S \u00d7 V and outputs an augmented action Nq(s, v) = (a, a(cid:48)) \u2208 Rk+2, where k is the\nout-degree of the q in M\u03c6; then, \u02dc\u03c0(s, q, v) = Nq(s, v).\n\n7\n\n\fFigure 3: Learning curves for \u03c61, \u03c62, \u03c63 and \u03c64 (top, left to right), and \u03c65, \u03c66 and \u03c67 (bottom, left\nto right), for SPECTRL (green), TLTL (blue), CCE (yellow), and SPECTRL without reward shaping\n(purple). The x-axis shows the number of sample trajectories, and the y-axs shows the probability of\nsatisfying the speci\ufb01cation (estimated using samples). To exclude outliers, we omitted one best and\none worst run out of the 5 runs. The plots are the average over the remaining 3 runs with error bars\nindicating one standard deviation around the average.\n\n4 Experiments\n\nSetup. We implemented our algorithm in a tool SPECTRL6, and used it to learn policies for a variety\nof speci\ufb01cations. We consider a dynamical system with states S = R2\u00d7R, where (x, r) \u2208 S encodes\nthe robot position x and its remaining fuel r, actions A = [\u22121, 1]2 where an action a \u2208 A is the robot\nvelocity, and transitions f (x, r, a) = (x + a + \u0001, r \u2212 0.1 \u00b7 |x1| \u00b7 (cid:107)a(cid:107)), where \u0001 \u223c N (0, \u03c32I) and the\nfuel consumed is proportional to the product of speed and distance from the y-axis. The initial state\nis s0 = (5, 0, 7), and the horizon is T = 40.\nIn Figure 3, we consider the following speci\ufb01cations, where O = [4, 6] \u00d7 [4, 6]:\n\n\u2022 \u03c61 = achieve reach (5, 10) ensuring (avoid O)\n\u2022 \u03c62 = achieve reach (5, 10) ensuring (avoid O \u2227 (r > 0))\n\u2022 \u03c63 = achieve (reach [(5, 10); (5, 0)]) ensuring avoid O\n\u2022 \u03c64 = achieve (reach (5, 10) or reach (10, 0); reach (10, 10)) ensuring avoid O\n\u2022 \u03c65 = achieve (reach [(5, 10); (5, 0); (10, 0)]) ensuring avoid O\n\u2022 \u03c66 = achieve (reach [(5, 10); (5, 0); (10, 0); (10, 10)]) ensuring avoid O\n\u2022 \u03c67 = achieve (reach [(5, 10); (5, 0); (10, 0); (10, 10); (0, 0)]) ensuring avoid O\n\nwhere the abbreviation achieve (b; b(cid:48)) denotes achieve b; achieve b(cid:48) and the abbreviation\nreach [p1; p2] denotes reach p1; reach p2. For all speci\ufb01cations, each Nq has two fully con-\nnected hidden layers with 30 neurons each and ReLU activations, and tanh function as its output\nlayer. We compare our algorithm to [13] (TLTL), which directly uses the quantitative semantics of\nthe speci\ufb01cation as the reward function (with ARS as the learning algorithm), and to the constrained\ncross entropy method (CCE) [24], which is a state-of-the-art RL algorithm for learning policies to\nperform tasks with constraints. We used neural networks with two hidden layers and 50 neurons per\nlayer for both the baselines.\nResults. Figure 3 shows learning curves of SPECTRL (our tool), TLTL, and CCE. In addition, it\nshows SPECTRL without reward shaping (Unshaped), which uses rewards \u02dcR instead of \u02dcRs. These\nplots demonstrate the ability of SPECTRL to outperform state-of-the-art baselines. For speci\ufb01cations\n\u03c61, ..., \u03c65, the curve for SPECTRL gets close to 100% in all executions, and for \u03c66 and \u03c67, it gets\nclose to 100% in 4 out of 5 executions. The performance of CCE drops when multiple constraints\n(here, obstacle and fuel) are added (i.e., \u03c62). TLTL performs similar to SPECTRL on tasks \u03c61, \u03c63 and\n\u03c64 (at least in some executions), but SPECTRL converges faster for \u03c61 and \u03c64.\n\n6The implementation can be found at https://github.com/keyshor/spectrl_tool.\n\n8\n\n\fFigure 4: Sample complexity curves (left) with number of nested sequencing operators on the x-axis\nand average number of samples to converge on the y-axis. Learning curve for cartpole example\n(right).\n\nSince TLTL and CCE use a single neural network to encode the policy as a function of state, they\nperform poorly in tasks that require memory\u2014i.e., \u03c65, \u03c66, and \u03c67. For example, to satisfy \u03c65, the\naction that should be taken at s = (5, 0) depends on whether (5, 10) has been visited. In contrast,\nSPECTRL performs well on these tasks since its policy is based on the monitor state.\nThese results also demonstrate the importance of reward shaping. Without it, ARS cannot learn unless\nit randomly samples a policy that reaches \ufb01nal monitor state. Reward shaping is especially important\nfor speci\ufb01cations that include many sequencing operators (\u03c6; \u03c6(cid:48))\u2014i.e., speci\ufb01cations \u03c65, \u03c66, and \u03c67.\nFigure 4 (left) shows how sample complexity grows with the number of nested sequencing operators\n(\u03c61, \u03c63, \u03c65, \u03c66, \u03c67). Each curve indicates the average number of samples needed to learn a policy\nthat achieves a satisfaction probability \u2265 \u03c4. SPECTRL scales well with the size of the speci\ufb01cation.\nCartpole. Finally, we applied SPECTRL to a different control task\u2014namely, to learn a policy for the\nversion of cart-pole in OpenAI Gym, in which we used continuous actions instead of discrete actions.\nThe speci\ufb01cation is to move the cart to the right and move back left without letting the pole fall. The\nformal speci\ufb01cation is given by\n\n\u03c6 = achieve (reach 0.5; reach 0.0) ensuring balance\n\nwhere the predicate balance holds when the vertical angle of the pole is smaller than \u03c0/15 in\nabsolute value. Figure 4 (right) shows the learning curve for this task averaged over 3 runs of the\nalgorithm along with the three baselines. TLTL is able to learn a policy to perform this task, but it\nconverges slower than SPECTRL; CCE is unable to learn a policy satisfying this speci\ufb01cation.\n\n5 Conclusion\n\nWe have proposed a language for formally specifying control tasks and an algorithm to learn policies\nto perform tasks speci\ufb01ed in the language. Our algorithm \ufb01rst constructs a task monitor from the\ngiven speci\ufb01cation, and then uses the task monitor to assign shaped rewards to runs of the system.\nFurthermore, the monitor state is also given as input to the controller, which enables our algorithm\nto learn policies for non-Markovian speci\ufb01cations. Finally, we implemented our approach in a tool\ncalled SPECTRL, which enables the users to program what the agent needs to do at a high level;\nthen, it automatically learns a policy that tries to best satisfy the user intent. We also demonstrate\nthat SPECTRL can be used to learn policies for complex speci\ufb01cations, and that it can outperform\nstate-of-the-art baselines.\nAcknowledgements. We thank the reviewers for their insightful comments. This work was partially\nsupported by NSF by grant CCF 1723567 and by AFRL and DARPA under Contract No. FA8750-18-\nC-0090.\n\nReferences\n[1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning.\n\nIn Proceedings of the 21st International Conference on Machine Learning, pages 1\u20138, 2004.\n\n9\n\n\f[2] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with\npolicy sketches. In Proceedings of the 34th International Conference on Machine Learning,\npages 166\u2013175, 2017.\n\n[3] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n39\u201348, 2016.\n\n[4] Roderick Bloem, Krishnendu Chatterjee, and Barbara Jobstmann. Graph games and reactive\n\nsynthesis. In Handbook of Model Checking, pages 921\u2013962. Springer, 2018.\n\n[5] Steve Collins, Andy Ruina, Russ Tedrake, and Martijn Wisse. Ef\ufb01cient bipedal robots based on\n\npassive-dynamic walkers. Science, 307(5712):1082\u20131085, 2005.\n\n[6] Samuel Coogan, Ebru Aydin Gol, Murat Arcak, and Calin Belta. Traf\ufb01c network control from\ntemporal logic speci\ufb01cations. IEEE Transactions on Control of Network Systems, 3(2):162\u2013172,\n2015.\n\n[7] Jyotirmoy V. Deshmukh, Alexandre Donz\u00e9, Shromona Ghosh, Xiaoqing Jin, Garvit Juniwal,\nand Sanjit A. Seshia. Robust online monitoring of signal temporal logic. Formal Methods in\nSystem Design, 51(1):5\u201330, 2017.\n\n[8] Georgios E. Fainekos and George J. Pappas. Robustness of temporal logic speci\ufb01cations for\ncontinuous-time signals. Theoretical Computer Science, 410(42):4262\u20134291, September 2009.\n\n[9] Keliang He, Morteza Lahijanian, Lydia E. Kavraki, and Moshe Y. Vardi. Reactive synthesis for\n\ufb01nite tasks under resource constraints. In IEEE/RSJ International Conference on Intelligent\nRobots and Systems, pages 5326\u20135332, 2017.\n\n[10] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems, pages 4565\u20134573, 2016.\n\n[11] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using reward\nmachines for high-level task speci\ufb01cation and decomposition in reinforcement learning. In\nProceedings of the 35th International Conference on Machine Learning, pages 2112\u20132121,\n2018.\n\n[12] Hadas Kress-Gazit, Georgios E. Fainekos, and George J. Pappas. Temporal-logic-based reactive\n\nmission and motion planning. IEEE Transactions on Robotics, 25(6):1370\u20131381, 2009.\n\n[13] Xiao Li, Cristian-Ioan Vasile, and Calin Belta. Reinforcement learning with temporal logic\nrewards. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages\n3834\u20133839, 2017.\n\n[14] Oded Maler and Dejan Nivckovi\u00b4c. Monitoring properties of analog and mixed-signal circuits.\n\nInternational Journal on Software Tools for Technology Transfer, 15(3):247\u2013268, 2013.\n\n[15] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of static linear policies\nIn Advances in Neural Information Processing\n\nis competitive for reinforcement learning.\nSystems, pages 1800\u20131809, 2018.\n\n[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[17] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transfor-\nmations: Theory and application to reward shaping. In Proceedings of the 16th International\nConference on Machine Learning, pages 278\u2013287, 1999.\n\n[18] Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning.\n\nIn\nProceedings of the 17th International Conference on Machine Learning, pages 663\u2013670, 2000.\n\n[19] Amir Pnueli. The temporal logic of programs. In Proceedings of the 18th Annual Symposium\n\non Foundations of Computer Science, pages 46\u201357, 1977.\n\n10\n\n\f[20] St\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and\nstructured prediction to no-regret online learning. In Proceedings of the 14th International\nConference on Arti\ufb01cial Intelligence and Statistics, pages 627\u2013635, 2011.\n\n[21] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.\nDeterministic policy gradient algorithms. In Proceedings of the 31st International Conference\non Machine Learning, pages 387\u2013395, 2014.\n\n[22] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,\n\n2018.\n\n[23] M.Y. Vardi and P. Wolper. Reasoning about in\ufb01nite computations. Information and Computation,\n\n115(1):1\u201337, 1994.\n\n[24] Min Wen and Ufuk Topcu. Constrained cross-entropy method for safe reinforcement learning.\n\nIn Advances in Neural Information Processing Systems, pages 7450\u20137460, 2018.\n\n[25] Tichakorn Wongpiromsarn, Ufuk Topcu, and Richard M. Murray. Receding horizon temporal\n\nlogic planning. IEEE Transactions on Automatic Control, 57(11):2817\u20132830, 2012.\n\n[26] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy\ninverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Arti\ufb01cial\nIntelligence, pages 1433\u20131438, 2008.\n\n11\n\n\f", "award": [], "sourceid": 7143, "authors": [{"given_name": "Kishor", "family_name": "Jothimurugan", "institution": "University of Pennsylvania"}, {"given_name": "Rajeev", "family_name": "Alur", "institution": "University of Pennsylvania"}, {"given_name": "Osbert", "family_name": "Bastani", "institution": "University of Pennysylvania"}]}