{"title": "Budgeted Reinforcement Learning in Continuous State Space", "book": "Advances in Neural Information Processing Systems", "page_first": 9299, "page_last": 9309, "abstract": "A Budgeted Markov Decision Process (BMDP) is an extension of a Markov Decision Process to critical applications requiring safety constraints. It relies on a notion of risk implemented in the shape of an upper bound on a constrains violation signal that -- importantly -- can be modified in real-time. So far, BMDPs could only be solved in the case of finite state spaces with known dynamics. This work extends the state-of-the-art to continuous spaces environments and unknown dynamics. We show that the solution to a BMDP is the fixed point of a novel Budgeted Bellman Optimality operator. This observation allows us to introduce natural extensions of Deep Reinforcement Learning algorithms to address large-scale BMDPs. We validate our approach on two simulated applications: spoken dialogue and autonomous driving.", "full_text": "Budgeted Reinforcement Learning in Continuous\n\nState Space\n\nNicolas Carrara\u2217\n\nSequeL team, INRIA Lille \u2013 Nord Europe\u2020\n\nnicolas.carrara@inria.fr\n\nRomain Laroche\n\nMicrosoft Research, Montreal, Canada\n\nromain.laroche@microsoft.com\n\nOdalric-Ambrym Maillard\n\nSequeL team, INRIA Lille \u2013 Nord Europe\n\nodalric.maillard@inria.fr\n\nEdouard Leurent\u2217\n\nSequeL team, INRIA Lille \u2013 Nord Europe\u2020\n\nRenault Group, France\n\nedouard.leurent@inria.fr\n\nTanguy Urvoy\n\nOrange Labs, Lannion, France\n\ntanguy.urvoy@orange.com\n\nOlivier Pietquin\n\nGoogle Research - Brain Team\n\nSequeL team, INRIA Lille \u2013 Nord Europe\u2020\n\npietquin@google.com\n\nAbstract\n\nA Budgeted Markov Decision Process (BMDP) is an extension of a Markov\nDecision Process to critical applications requiring safety constraints. It relies on a\nnotion of risk implemented in the shape of a cost signal constrained to lie below\nan \u2013 adjustable \u2013 threshold. So far, BMDPs could only be solved in the case of\n\ufb01nite state spaces with known dynamics. This work extends the state-of-the-art\nto continuous spaces environments and unknown dynamics. We show that the\nsolution to a BMDP is a \ufb01xed point of a novel Budgeted Bellman Optimality\noperator. This observation allows us to introduce natural extensions of Deep\nReinforcement Learning algorithms to address large-scale BMDPs. We validate\nour approach on two simulated applications: spoken dialogue and autonomous\ndriving3.\n\n1\n\nIntroduction\n\nr =(cid:80)\u221e\n\nReinforcement Learning (RL) is a general framework for decision-making under uncertainty. It\nframes the learning objective as the optimal control of a Markov Decision Process (S,A, P, Rr, \u03b3)\nwith measurable state space S, discrete actions A, unknown rewards Rr \u2208 RS\u00d7A, and unknown\ndynamics P \u2208 M(S)S\u00d7A , where M(X ) denotes the probability measures over a set X . Formally,\nwe seek a policy \u03c0 \u2208 M(A)S that maximises in expectation the \u03b3-discounted return of rewards\nG\u03c0\nHowever, this modelling assumption comes at a price: no control is given over the spread of the\nperformance distribution (Dann et al., 2019). In many critical real-world applications where failures\nmay turn out very costly, this is an issue as most decision-makers would rather give away some\namount of expected optimality to increase the performances in the lower-tail of the distribution. This\n\nt=0 \u03b3tRr(st, at).\n\n\u2217Both authors contributed equally.\n\u2020Univ. Lille, CNRS, Centrale Lille, INRIA UMR 9189 - CRIStAL, Lille, France\n3Videos and code are available at https://budgeted-rl.github.io/.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fhas led to the development of several risk-averse variants where the optimisation criteria include\nother statistics of the performance, such as the worst-case realisation (Iyengar, 2005; Nilim and El\nGhaoui, 2005; Wiesemann et al., 2013), the variance-penalised expectation (Garc\u00eda and Fern\u00e1ndez,\n2015; Tamar et al., 2012), the Value-At-Risk (VaR) (Mausser and Rosen, 2003; Luenberger, 2013),\nor the Conditional Value-At-Risk (CVaR) (Chow et al., 2015, 2018).\nReinforcement Learning also assumes that the performance can be described by a single reward\nfunction Rr. Conversely, real problems typically involve many aspects, some of which can be\ncontradictory (Liu et al., 2014). For instance, a self-driving car needs to balance between progressing\nquickly on the road and avoiding collisions. When aggregating several objectives in a single scalar\nsignal, as often in Multi-Objectives RL (Roijers et al., 2013), no control is given over their relative\nratios, as high rewards can compensate high penalties. For instance, if a weighted sum is used to\nbalance velocity v and crashes c, then for any given choice of weights \u03c9 the optimality equation\nand the automotive company cannot control where its optimal policy \u03c0\u2217 lies on that line.\nBoth of these concerns can be addressed in the Constrained Markov Decision Process (CMDP)\nsetting (Beutler and Ross, 1985; Altman, 1999). In this multi-objective formulation, task completion\nand safety are considered separately. We equip the MDP with a cost signal Rc \u2208 RS\u00d7A and a cost\nbudget \u03b2 \u2208 R. Similarly to G\u03c0\nt=0 \u03b3tRc(st, at) and the new\ncost-constrained objective:\n\nr is the equation of a line in (E[(cid:80) \u03b3tvt], E[(cid:80) \u03b3tct]),\n\n\u03c9v E[(cid:80) \u03b3tvt]+\u03c9a E[(cid:80) \u03b3tct] = G\u2217\n\nr , we de\ufb01ne the return of costs G\u03c0\n\nc =(cid:80)\u221e\n\nr = max\u03c0 G\u03c0\n\nmax\n\n\u03c0\u2208M(A)S\n\nE[G\u03c0\n\nr|s0 = s]\n\ns.t. E[G\u03c0\n\nc |s0 = s] \u2264 \u03b2\n\n(1)\n\nThis constrained framework allows for better control of the performance-safety tradeoff. However, it\nsuffers from a major limitation: the budget has to be chosen before training, and cannot be changed\nafterwards.\nTo address this concern, the Budgeted Markov Decision Process (BMDP) was introduced in (Boutilier\nand Lu, 2016) as an extension of CMDPs to enable the online control over the budget \u03b2 within an\ninterval B \u2282 R of admissible budgets. Instead of \ufb01xing the budget prior to training, the objective is\nnow to \ufb01nd a generic optimal policy \u03c0\u2217 that takes \u03b2 as input so as to solve the corresponding CMDP\n(Eq. (1)) for all \u03b2 \u2208 B. This gives the system designer the ability to move the optimal policy \u03c0\u2217 in\nreal-time along the Pareto-optimal curve of the different reward-cost trade-offs.\nOur \ufb01rst contribution is to re-frame the original BMDP formulation in the context of continuous states\nand in\ufb01nite discounted horizon. We then propose a novel Budgeted Bellman Optimality Operator and\nprove the optimal value function to be a \ufb01xed point of this operator. Second, we use this operator in\nBFTQ, a batch Reinforcement Learning algorithm, for solving BMDPs online by interaction with\nan environment, through function approximation and a tailored exploration procedure. Third, we\nscale this algorithm to large problems by providing an ef\ufb01cient implementation of the Budgeted\nBellman Optimality Operator based on convex programming, a risk-sensitive exploration procedure,\nand by leveraging tools from Deep Reinforcement Learning such as Deep Neural Networks and\nsynchronous parallel computing. Finally, we validate our approach in two environments that display\na clear trade-off between rewards and costs: a spoken dialogue system and a problem of behaviour\nplanning for autonomous driving. The proofs of our main results are provided in Appendix A.\n\n2 Budgeted Dynamic Programming\n\nWe work in the space of budgeted policies, where a policy \u03c0 both depends on the current budget \u03b2\nand also outputs a next budget \u03b2a. Hence, the budget \u03b2 is neither \ufb01xed nor constant as in the CMDP\nsetting but instead evolves as part of the dynamics.\nWe cast the BMDP problem as a multi-objective MDP problem (Roijers et al., 2013) by considering\naugmented state and action spaces S = S \u00d7 B and A = A \u00d7 B, and equip them with the augmented\ndynamics P \u2208 M(S)S\u00d7A de\ufb01ned as:\n(cid:48) | s, a) = P ((s\n(cid:48)\n\n) | (s, \u03b2), (a, \u03b2a))\n\n(cid:48)|s, a)\u03b4(\u03b2\n\n(cid:48) \u2212 \u03b2a),\n\ndef= P (s\n\nP (s\n\n(cid:48)\n\n, \u03b2\n\n(2)\n\nwhere \u03b4 is the Dirac indicator distribution.\n\n2\n\n\fIn other words, in these augmented dynamics, the output budget \u03b2a returned at time t by a budgeted\npolicy \u03c0 \u2208 \u03a0 = M(A)S will be used to condition the policy at the next timestep t + 1.\nWe stack the rewards and cost functions in a single vectorial signal R \u2208 (R2)S\u00d7A. Given an\naugmented transition (s, a) = ((s, \u03b2), (a, \u03b2a)), we de\ufb01ne:\n\n(cid:20)Rr(s, a)\n\n(cid:21)\n\ndef=\n\n\u2208 R2.\n\nLikewise, the return G\u03c0 = (G\u03c0\nand the value functions V \u03c0, Q\u03c0 of a budgeted policy \u03c0 \u2208 \u03a0 are de\ufb01ned as:\n\nr , G\u03c0\n\nt=0 \u03b3tR(st, at),\n\nR(s, a)\n\nRc(s, a)\n\nc ) of a budgeted policy \u03c0 \u2208 \u03a0 refers to: G\u03c0 def=(cid:80)\u221e\n\n(3)\n\nr , V \u03c0\nc )\n\nV \u03c0(s) = (V \u03c0\n\nQ\u03c0(s, a) = (Q\u03c0\n\ndef= E [G\u03c0 | s0 = s]\ndef= E [G\u03c0 | s0 = s, a0 = a] .\n(4)\nc (s) \u2264 \u03b2} that we assume is\nWe restrict S to feasible budgets only: S f\nnon-empty for the BMDP to admit a solution. We still write S in place of S f for brevity of notations.\n(cid:88)\nProposition 1 (Budgeted Bellman Expectation). The value functions V \u03c0 and Q\u03c0 verify:\n(cid:48) | s, a) V \u03c0(s\n(cid:48)\n\ndef={(s, \u03b2) \u2208 S : \u2203\u03c0 \u2208 \u03a0, V \u03c0\n\nQ\u03c0(s, a) = R(s, a) + \u03b3\n\n\u03c0(a|s)Q\u03c0(s, a)\n\n(cid:88)\n\nV \u03c0(s) =\n\nr , Q\u03c0\nc )\n\nP (s\n\n(5)\n\n)\n\na\u2208A\n\ns(cid:48)\u2208S\n\n(cid:88)\n\n(cid:88)\n\nT \u03c0Q(s, a)\n\ndef= R(s, a) + \u03b3\n\nMoreover, consider the Budgeted Bellman Expectation operator T \u03c0: \u2200Q \u2208 (R2)SA, s \u2208 S, a \u2208 A,\n(6)\n\nP (s\nThen T \u03c0 is a \u03b3-contraction and Q\u03c0 is its unique \ufb01xed point.\nDe\ufb01nition 1 (Budgeted Optimality). We now come to the de\ufb01nition of budgeted optimality. We want\nan optimal budgeted policy to: (i) respect the cost budget \u03b2, (ii) maximise the \u03b3-discounted return of\nrewards Gr, (iii) in case of tie, minimise the \u03b3-discounted return of costs Gc. To that end, we de\ufb01ne\nfor all s \u2208 S:\n(i) Admissible policies \u03a0a:\n\n(cid:48)|s, a)\u03c0(a\n\n(cid:48)\n)Q(s\n\na(cid:48)\u2208A\n\n(cid:48)|s\n(cid:48)\n\n(cid:48)\n, a\n\ns(cid:48)\u2208S\n\n)\n\n\u03a0a(s)\n\ndef={\u03c0 \u2208 \u03a0 : V \u03c0\n\nc (s) \u2264 \u03b2} where s = (s, \u03b2)\n\nr and candidate policies \u03a0r:\n\n(ii) Optimal value function for rewards V \u2217\nV \u03c0\nr (s)\n\ndef= max\n\u03c0\u2208\u03a0a(s)\nc and optimal policies \u03a0\u2217:\n(iii) Optimal value function for costs V \u2217\nV \u03c0\nc (s),\n\n\u2217\nc (s)\n\n\u2217\nr (s)\n\n\u03a0r(s)\n\n(s)\n\n\u03a0\n\nV\n\nV\n\n\u2217\n\ndef= arg max\n\u03c0\u2208\u03a0a(s)\n\ndef= min\n\u03c0\u2208\u03a0r(s)\n\ndef= arg min\n\u03c0\u2208\u03a0r(s)\n\nV \u03c0\nr (s)\n\nV \u03c0\nc (s)\n\n(7)\n\n(8)\n\n(9)\n\nWe de\ufb01ne the budgeted action-value function Q\u2217 similarly:\n\u2217\nQ\nc (s, a)\n\n\u2217\nr(s, a)\nQ\nand denote V \u2217 = (V \u2217\nr , V \u2217\nTheorem 1 (Budgeted Bellman Optimality). The optimal budgeted action-value function Q\u2217 veri\ufb01es:\n(11)\n\ndef= max\nQ\u03c0\n\u03c0\u2208\u03a0a(s)\nc ), Q\u2217 = (Q\u2217\n\nr (s, a)\nr, Q\u2217\nc ).\n\n\u03c0greedy(a(cid:48)|s(cid:48); Q\n\u2217\n\n(s, a) = T Q\n\u2217\n\ndef= min\n\u03c0\u2208\u03a0r(s)\n\ndef= R(s, a) + \u03b3\n\nP (s(cid:48)|s, a)\n\n(cid:88)\n\n(cid:88)\n\n(s(cid:48), a(cid:48)),\n\nc (s, a)\n\n\u2217\n)Q\n\n(s, a)\n\n\u2217\nQ\n\n(10)\n\nQ\u03c0\n\nwhere the greedy policy \u03c0greedy is de\ufb01ned by: \u2200s = (s, \u03b2) \u2208 S, a \u2208 A,\u2200Q \u2208 (R2)A\u00d7S ,\n\ns(cid:48)\u2208S\n\na(cid:48)\u2208A\n\n\u03c0greedy(a|s; Q) \u2208 arg min\n\u03c1\u2208\u03a0Q\n\nr\n\nE\na\u223c\u03c1\n\nQc(s, a),\n\nwhere \u03a0Q\nr\n\ndef= arg max\n\u03c1\u2208M(A)\ns.t. E\na\u223c\u03c1\n\nE\nQr(s, a)\na\u223c\u03c1\nQc(s, a) \u2264 \u03b2.\n\n3\n\n(12a)\n\n(12b)\n\n(12c)\n\n\fRemark 1 (Appearance of the greedy policy). In classical Reinforcement Learning, the greedy policy\ntakes a simple form \u03c0greedy(s; Q\u2217) = arg maxa\u2208A Q\u2217(s, a), and the term \u03c0greedy(a(cid:48)|s(cid:48); Q\u2217)Q\u2217(s(cid:48), a(cid:48))\nin (11) conveniently simpli\ufb01es to maxa(cid:48)\u2208A Q\u2217(s(cid:48), a(cid:48)). Unfortunately, in a budgeted setting the greedy\npolicy requires solving the nested constrained optimisation program (12) at each state and budget in\norder to apply this Budgeted Bellman Optimality operator.\nProposition 2 (Optimality of the greedy policy). The greedy policy \u03c0greedy(\u00b7 ; Q\u2217) is uniformly\noptimal: \u2200s \u2208 S, \u03c0greedy(\u00b7 ; Q\u2217) \u2208 \u03a0\u2217(s). In particular, V \u03c0greedy(\u00b7;Q\u2217) = V \u2217 and Q\u03c0greedy(\u00b7;Q\u2217) = Q\u2217.\nBudgeted Value Iteration The Budgeted Bellman Optimality equation is a \ufb01xed-point equation,\nwhich motivates the introduction of a \ufb01xed-point iteration procedure. We introduce Algorithm 1,\na Dynamic Programming algorithm for solving known BMDPs. If it were to converge to a unique\n\ufb01xed point, this algorithm would provide a way to compute Q\u2217 and recover the associated optimal\nbudgeted policy \u03c0greedy(\u00b7 ; Q\u2217).\nTheorem 2 (Non-contractivity of T ). For any BMDP (S,A, P, Rr, Rc, \u03b3) with |A| \u2265 2, T is not a\ncontraction. Precisely: \u2200\u03b5 > 0,\u2203Q1, Q2 \u2208 (R2)SA : (cid:107)T Q1 \u2212 T Q2(cid:107)\u221e \u2265 1\nUnfortunately, as T is not a contraction, we can guarantee neither the convergence of Algorithm 1\nnor the unicity of its \ufb01xed points. Despite those theoretical limitations, we empirically observed the\nconvergence to a \ufb01xed point in our experiments (Section 5). We conjecture a possible explanation:\nTheorem 3 (Contractivity of T on smooth Q-functions). The operator T is a contraction when\nrestricted to the subset L\u03b3 of Q-functions such that \"Qr is Lipschitz with respect to Qc\":\n\n\u03b5(cid:107)Q1 \u2212 Q2(cid:107)\u221e.\n\nL\u03b3 =\n\nQ \u2208 (R2)SA s.t. \u2203L < 1\n\u03b3 \u2212 1 : \u2200s \u2208 S, a1, a2 \u2208 A,\n|Qr(s, a1) \u2212 Qr(s, a2)| \u2264 L|Qc(s, a1) \u2212 Qc(s, a2)|\n\n(cid:26)\n\n(cid:27)\n\n(13)\n\nThus, we expect that Algorithm 1 is likely to converge when Q\u2217 is smooth, but could diverge if the\nslope of Q\u2217 is too high. L2-regularisation can be used to encourage smoothness and mitigate risk of\ndivergence.\n\n3 Budgeted Reinforcement Learning\n\nIn this section, we consider BMDPs with unknown parameters that must be solved by interaction\nwith an environment.\n\n3.1 Budgeted Fitted-Q\n\nWhen the BMDP is unknown, we need to adapt Algorithm 1 to work with a batch of samples\ni)}i\u2208[1,N ] collected by interaction with the environment. Applying T in (11)\nD = {(si, ai, ri, s(cid:48)\ns(cid:48)\u223cP over next states s(cid:48) and hence an access to the model\nwould require computing an expectation E\nP . We instead use \u02c6T , a sampling operator, in which this expectation is replaced by:\n\n\u02c6T Q(s, a, r, s\n(cid:48)\n\ndef= r + \u03b3\n\n)\n\n\u03c0greedy(a(cid:48)|s(cid:48); Q)Q(s(cid:48), a(cid:48)).\n\n(cid:88)\n\na(cid:48)\u2208A\n\nWe introduce in Algorithm 2 the Budgeted-Fitted-Q (BFTQ) algorithm, an extension of the Fitted-Q\n(FTQ) algorithm (Ernst et al., 2005; Riedmiller, 2005) adapted to solve unknown BMDPs. Because we\nwork with continuous state space S and budget space B, we need to employ function-approximation\nin order to generalise to nearby states and budgets. Precisely, given a parametrized model Q\u03b8, we\n2. Any\n\nseek to minimise a regression loss L(Q\u03b8, Qtarget;D) =(cid:80)D ||Q\u03b8(s, a) \u2212 Qtarget(s, a, r, s(cid:48))||2\n\nmodel can be used, such as linear models, regression trees, or neural networks.\n\nAlgorithm 1: Budgeted Value Iteration\nData: P, Rr, Rc\nResult: Q\u2217\n1 Q0 \u2190 0\n2 repeat\nQk+1 \u2190 T Qk\n3\n4 until convergence\n\nAlgorithm 2: Budgeted Fitted-Q\nData: D\nResult: Q\u2217\n1 Q\u03b80 \u2190 0\n2 repeat\n3\n4 until convergence\n\n\u03b8k+1 \u2190 arg min\u03b8 L(Q\u03b8, \u02c6T Q\u03b8k ;D)\n\n4\n\n\f3.2 Risk-sensitive exploration\nIn order to run Algorithm 2, we must \ufb01rst gather a batch of samples D. The following strategy is\nmotivated by the intuition that a wide variety of risk levels needs to be experienced during training,\nwhich can be achieved by enforcing the risk constraints during data collection. Ideally we would\nneed samples from the asymptotic state-budget distribution limt\u2192\u221e P (st) induced by an optimal\npolicy \u03c0\u2217 given an initial distribution P (s0), but as we are actually building this policy, it is not\npossible. Following the same idea of \u03b5-greedy exploration for FTQ (Ernst et al., 2005; Riedmiller,\n2005), we introduce an algorithm for risk-sensitive exploration. We follow an exploration policy: a\nmixture between a random budgeted policy \u03c0rand and the current greedy policy \u03c0greedy. The batch D is\nsplit into several mini-batches generated sequentially, and \u03c0greedy is updated by running Algorithm 2\non D upon mini-batch completion. \u03c0rand should only pick augmented actions that are admissible\ncandidates for \u03c0greedy. To that extent \u03c0rand is designed to obtain trajectories that only explore feasible\nbudgets: we impose that the joint distribution P (a, \u03b2a|s, \u03b2) veri\ufb01es E[\u03b2a] \u2264 \u03b2. This condition\nde\ufb01nes a probability simplex \u2206A from which we sample uniformly. Finally, when interacting with an\nenvironment the initial state s0 is usually sampled from a starting distribution P (s0). In the budgeted\nsetting, we also need to sample the initial budget \u03b20. Importantly, we pick a uniform distribution\nP (\u03b20) = U(B) so that the entire range of risk-level is explored, and not only reward-seeking\nbehaviours as would be the case with a traditional risk-neutral \u03b5-greedy strategy. The pseudo-code of\nour exploration procedure is shown in Algorithm 3.\n\nAlgorithm 3: Risk-sensitive exploration\nData: An environment, a BFTQ solver, W CPU workers\nResult: A batch of transitions D\n1 D \u2190 \u2205\n2 for each intermediate batch do\n3\n4\n\nsplit episodes between W workers\nfor each episode in batch do\nparallel\n\n// run this loop on each worker in\n\nsample initial budget \u03b2 \u223c U(B).\nwhile episode not done do\nupdate \u03b5 from schedule.\nsample z \u223c U([0, 1]).\nif z < \u03b5 then sample (a, \u03b2a) \u223c U(\u2206AB).\nelse sample (a, \u03b2a) \u223c \u03c0greedy(a, \u03b2a|s, \u03b2; Q\u2217).\nappend transition (s, \u03b2, a, \u03b2a, R, C, s(cid:48)) to batch D.\nstep episode budget \u03b2 \u2190 \u03b2a\n\n5\n6\n7\n8\n9\n10\n11\n12\n13\nend\n14\n\u03c0greedy(\u00b7 \u223c; Q\u2217) \u2190 BFTQ(D).\n15\n16 end\n17 return the batch of transitions D\n\nend\n\n// Explore\n// Exploit\n\n4 A Scalable Implementation\n\nIn this section, we introduce an implementation of the BFTQ algorithm designed to operate ef\ufb01ciently\nand handle large batches of experiences D.\n\n4.1 How to compute the greedy policy?\n\nAs stated in Remark 1, computing the greedy policy \u03c0greedy in (11) is not trivial since it requires\nsolving the nested constrained optimisation program (12). However, it can be solved ef\ufb01ciently by\nexploiting the structure of the set of solutions with respect to \u03b2, that is, concave and increasing.\nProposition 3 (Equality of \u03c0greedy and \u03c0hull). Algorithm 1 and Algorithm 2 can be run by replacing\n\u03c0greedy in the equation (11) of T with \u03c0hull as described in Algorithm 4.\n\n\u03c0greedy(a|s; Q) = \u03c0hull(a|s; Q)\n\n5\n\n\fAlgorithm 4: Convex hull policy \u03c0hull(a|s; Q)\nData: s = (s, \u03b2), Q\n1 Q+ \u2190 {Qc > min{Qc(s, a) s.t. a \u2208 arg maxa Qr(s, a)}}\n2 F \u2190 top frontier of convex_hull(Q(s,A) \\ Q+)\n3 FQ \u2190 F \u2229 Q(s,A)\n4 for points q = Q(s, a) \u2208 FQ in clockwise order do\n\np \u2190 (\u03b2 \u2212 q1\nreturn the mixture (1 \u2212 p)\u03b4(a \u2212 a1) + p\u03b4(a \u2212 a2)\n\nif \ufb01nd two successive points ((q1\n\n5\n6\n7\n8 end\n9 return \u03b4(a \u2212 arg maxa Qr(s, a))\n\nc \u2212 q1\nc )\n\nc )/(q2\n\nc , q2\n\nc , q1\n\nr ), (q2\n\n// dominated points\n\n// candidate mixtures\n\nr )) of FQ such that q1\n\nc \u2264 \u03b2 < q2\n\nc then\n\n// budget \u03b2 always respected\n\nFigure 1: Representation of \u03c0hull. When the budget lies between\nQ(s, a1) and Q(s, a2), two points of the top frontier of the convex\nhull, then the policy is a mixture of these two points.\n\nThe computation of \u03c0hull\nin Algo-\nrithm 4 is illustrated in Figure 1: \ufb01rst\nwe get rid of dominated points. Then\nwe compute the top frontier of the\nconvex hull of the Q-function. Next,\nwe \ufb01nd the two closest augmented ac-\ntions a1 and a2 with cost-value Qc\nsurrounding \u03b2: Qc(s, a1) \u2264 \u03b2 <\nQc(s, a2). Finally, we mix the two ac-\ntions such that the expected spent bud-\nget is equal to \u03b2. Because of the con-\ncavity of the convex hull top frontier,\nany other combination of augmented\nactions would lead to a lower expected\nreward Qr.\n\n4.2 Function approximation\n\nNeural networks are well suited to model Q-functions in Reinforcement Learning algorithms (Ried-\nmiller, 2005; Mnih et al., 2015). We approximate Q = (Qr, Qc) using one single neural network.\nThus, the two components are jointly optimised which accelerates convergence and fosters learning\nof useful shared representations. Moreover, as in (Mnih et al., 2015) we are dealing with a \ufb01nite\n(categorical) action space A, instead of including the action in the input we add the output of the\nQ-function for each action to the last layer. Again, it provides a faster convergence toward useful\nshared representations and it only requires one forward pass to evaluate all action values. Finally,\nbeside the state s there is one more input to a budgeted Q-function: the budget \u03b2a. This budget is a\nscalar value whereas the state s is a vector of potentially large size. To avoid a weak in\ufb02uence of \u03b2\ncompared to s in the prediction, we include an additional encoder for the budget, whose width and\ndepth may depend on the application. A straightforward choice is a single layer with the same width\nas the state. The overall architecture is shown in Figure 7 in Appendix B.\n\n4.3 Parallel computing\n\nIn a simulated environment, a \ufb01rst process that can be distributed is the collection of samples in\nthe exploration procedure of Algorithm 3, as \u03c0greedy stays constant within each mini-batch which\navoids the need of synchronisation between workers. Second, the main bottleneck of BFTQ is the\ncomputation of the target T Q. Indeed, when computing \u03c0hull we must perform at each epoch a\n\nGraham-scan of complexity O(|A||(cid:101)B| log |A(cid:101)B|) per sample in D to compute the convex hulls of Q\n(where (cid:101)B is a \ufb01nite discretisation of B). The resulting total time-complexity is O(\nlog |A||(cid:101)B|).\nmodel Q(s(cid:48),A(cid:101)B) for each sample s(cid:48) \u2208 D, which can be done in a single forward pass. By using\n\nThis operation can easily be distributed over several CPUs provided that we \ufb01rst evaluate the\n\nmultiprocessing in the computations of \u03c0hull, we enjoy a linear speedup. The full description of our\nscalable implementation of BFTQ is recalled in Algorithm 5 in Appendix B.\n\n|D||A||(cid:101)B|\n\n1\u2212\u03b3\n\n6\n\nQ(,)s\u23af\u23afa\u23af\u23af\u23af2Q(,)s\u23af\u23afa\u23af\u23af\u23af1\u03b2QrQcQ+\ue232\ue232QQ(,)s\u23af\u23af\ue22d\u23af\u23af\u23af\u23af\u23af\f5 Experiments\n\nThere are two hypotheses we want to validate.\n\nExploration strategies We claimed in Section 3.2 that a risk-sensitive exploration was required in\nthe setting of BMDPs. We test this hypotheses by confronting our strategy to a classical risk-neutral\nstrategy. The latter is chosen to be a \u03b5-greedy policy slowly transitioning from a random to a greedy\nc . The quality of the resulting batches D is\npolicy4 that aims to maximise E\u03c0 G\u03c0\nassessed by training a BFTQ policy and comparing the resulting performance.\n\nr regardless of E\u03c0 G\u03c0\n\nBudgeted algorithms We compare our scalable BFTQ algorithm described in Section 4 to an\nFTQ(\u03bb) baseline. This baseline consists in approximating the BMDP by a \ufb01nite set of CMDPs\nproblems. We solve each of these CMDP using the standard technique of Lagrangian Relaxation: the\ncost constraint is converted to a soft penalty weighted by a Lagrangian multiplier \u03bb in a surrogate\nreward function: max\u03c0 E\u03c0[G\u03c0\nc ]. The resulting MDP can be solved by any RL algorithm, and\nwe chose FTQ for being closest to BFTQ. In our experiments, a single training of BFTQ corresponds\nto 10 trainings of FTQ(\u03bb) policies. Each run was repeated Nseeds times. Parameters of the algorithms\ncan be found in Appendix D.3.1\n\nr \u2212 \u03bbG\u03c0\n\n5.1 Environments\n\nWe evaluate our method on three different environments involving reward-cost trade-offs. Their\nparameters can be found in Appendix D.3.2\n\nCorridors This simple environment is only meant to highlight clearly the speci\ufb01city of exploration\nin a budgeted setting. It is a continuous gridworld with Gaussian perturbations, consisting in a maze\ncomposed of two corridors: a risky one with high rewards and costs, and a safe one with low rewards\nand no cost. In both corridors the outermost cell is the one yielding the most reward, which motivates\na deep exploration.\n\nSpoken dialogue system Our second application is a dialogue-based slot-\ufb01lling simulation that\nhas already bene\ufb01ted from batch RL optimisation in the past (Li et al., 2009; Chandramohan et al.,\n2010; Pietquin et al., 2011). The system \ufb01lls in a form of slot-values by interacting a user through\nspeech, before sending them a response. For example, in a restaurant reservation domain, it may\nask for three slots: the area of the restaurant, the price-range and the food type. The user could\nrespectively provide those three slot-values : Cambridge, Cheap and Indian-food. In this\napplication, we do not focus on how to extract such information from the user utterances, we rather\nfocus on decision-making for \ufb01lling in the form. To that end, the system can choose among a set of\ngeneric actions. As in (Carrara et al., 2018), there are two ways of asking for a slot value: a slot value\ncan be either be provided with an utterance, which may cause speech recognition errors with some\nprobability, or by requiring the user to \ufb01ll-in the slots by using a numeric pad. In this case, there are\nno recognition errors but a counterpart risk of hang-up: we assume that manually \ufb01lling a key-value\nform is time-consuming and annoying. The environment yields a reward if all slots are \ufb01lled without\nerrors, and a constraint if the user hang-ups. Thus, there is a clear trade-off between using utterances\nand potentially committing a mistake, or using the numeric pad and risking a premature hang-up.\n\nAutonomous driving\nIn our third application, we use the highway-env environment (Leurent,\n2018) for simulated highway driving and behavioural decision-making. We de\ufb01ne a task that displays\na clear trade-off between safety and ef\ufb01ciency. The agent controls a vehicle with a \ufb01nite set of\nmanoeuvres implemented by low-lever controllers: A = {no-op, right-lane, left-lane, faster, slower}.\nIt is driving on a two-lane road populated with other traf\ufb01c participants: the vehicles in front of\nthe agent drive slowly, and there are incoming vehicles on the opposite lane. Their behaviours are\nrandomised, which introduces some uncertainty with respect to their possible future trajectories. The\ntask consists in driving as fast as possible, which is modelled by a reward proportional to the velocity:\nRr(st, at) \u221d vt. This motivates the agent to try and overtake its preceding vehicles by driving\nfast on the opposite lane. This optimal but overly aggressive behaviour can be tempered through a\ncost function that embodies a safety objective: Rc(st, at) is set to 1/H whenever the ego-vehicle is\n\n4We train this greedy policy using FTQ.\n\n7\n\n\fFigure 2: Density of explored states (left) and corresponding policy performances (right) of two exploration\nstrategies in the corridors environment.\n\nFigure 3: Performance comparison of FTQ(\u03bb) and BFTQ on slot-\ufb01lling (left) and highway-env(right)\n\ndriving on the opposite lane, where H is the episode horizon. Thus, the constrained signal G\u03c0\nmaximum proportion of time that the agent is allowed to drive on the wrong side of the road.\n\nc is the\n\n5.2 Results\n\nr , G\u03c0\n\nIn the following \ufb01gures, each patch represents the mean and 95% con\ufb01dence interval over Nseeds\nseeds of the means of (G\u03c0\nc ) over Ntrajs trajectories. That way, we display the variation related to\nlearning (and batches) rather than the variation in the execution of the policies.\nWe \ufb01rst bring to light the role of risk-sensitive exploration in the corridors environment: Figure 2\nshows the set of trajectories collected by each exploration strategy. and the resulting performance\nof a budgeted policy trained on each batch. The trajectories (orange) in the risk-neutral batch are\nconcentrated along the risky corridor (right) and ignore the safe corridor (left), which results in\nbad performances in the low-risk regime. Conversely, trajectories in the risk-sensitive batch (blue)\nare well distributed among both corridors and the corresponding budgeted policy achieves good\nperformance across the whole spectrum of risk budgets.\nIn a second experiment displayed in Figure 3, we compare the performance of FTQ(\u03bb) to that of\nBFTQ in the dialogue and autonomous driving tasks. For each algorithm, we plot the reward-cost\ntrade-off curve. In both cases, BFTQ performs almost as well as FTQ(\u03bb) despite only requiring a\nsingle model. All budgets are well-respected on slot-\ufb01lling, but on highway-env we can observe an\nunderestimation of Qc, since e.g. E[Gc|\u03b2 = 0] (cid:39) 0.1. This underestimation can be a consequence\nof two approximations: the use of the sampling operator \u02c6T instead of the true population operator\nT , and the use of the neural network function approximation Q\u03b8 instead of Q. Still, BFTQ provides\na better control on the expected cost of the policy, than FTQ(\u03bb). In addition, BFTQ behaves more\nconsistently than FTQ(\u03bb) overall, as shown by its lower extra-seed variance.\n\n8\n\nBFTQ(risk-sensitive)BFTQ(risk-neutral)G\u03c0rG\u03c0cG\u03c0rG\u03c0cG\u03c0rG\u03c0c\fAdditional material such as videos of policy executions is provided in Appendix D.\n\n6 Discussion\n\nAlgorithm 2 is an algorithm for solving large unknown BMDPs with continuous states. To the best of\nour knowledge, there is no algorithm in the current literature that combines all those features.\nAlgorithms have been proposed for CMDPs, which are less \ufb02exible sub-problems of the more general\nBMDP. When the environment parameters (P , Rr, Rc) are known but not tractable, solutions relying\non function approximation (Undurti et al., 2011) or approximate linear programming (Poupart et al.,\n2015) have been proposed. For unknown environments, online algorithms (Geibel and Wysotzki,\n2005; Abe and others, 2010; Chow et al., 2018; Achiam et al., 2017) and a batch algorithm (Thomas\net al., 2015; Petrik et al., 2016; Laroche and Trichelair, 2019; Le et al., 2019) can solve large unknown\nCMDPs. Nevertheless, these approaches are limited in that the constraints thresholds are \ufb01xed prior\nto training and cannot be updated in real-time at policy execution to select the desired level of risk.\nTo our knowledge, there were only two ways of solving a BMDP. The \ufb01rst one is to approximate\nit with a \ufb01nite set of CMDPs (e.g. see our FTQ(\u03bb) baseline). The solutions of these CMDPs take\nthe form of mixtures between two deterministic policies (Theorem 4.4, Beutler and Ross, 1985). To\nobtain these policies, one needs to evaluate their expected cost by interacting with the environment5.\nOur solution not only requires one single model but also avoids any supplementary interaction.\nThe only other existing BMDP algorithm, and closest work to ours, is the Dynamic Programming\nalgorithm proposed by Boutilier and Lu (2016). However, their work was established for \ufb01nite state\nspaces only, and their solution relies heavily on this property. For instance, they enumerate and sort\nthe next states s(cid:48) \u2208 S by their expected value-by-cost, which could not be performed in a continuous\nstate space S. Moreover, they rely on the knowledge of the model (P , Rr, Rc), and do not address\nthe question of learning from interaction data.\n\n7 Conclusion\n\nThe BMDP framework is a principled framework for safe decision making under uncertainty, which\ncould be bene\ufb01cial to the diffusion of Reinforcement Learning in industrial applications. However,\nBMDPs could so far only be solved in \ufb01nite state spaces which limits their interest in many use-cases.\nWe extend their de\ufb01nition to continuous states by introducing of a novel Dynamic Programming\noperator, that we build upon to propose a Reinforcement Learning algorithm. In order to scale to large\nproblems, we provide an ef\ufb01cient implementation that exploits the structure of the value function and\nleverages tools from Deep Distributed Reinforcement Learning. We show that on two practical tasks\nour solution performs similarly to a baseline Lagrangian relaxation method while only requiring a\nsingle model to train, and relying on an interpretable \u03b2 instead of the tedious tuning of the penalty \u03bb.\n\nAcknowledgements\n\nThis work has been supported by CPER Nord-Pas de Calais/FEDER DATA Advanced data science\nand technologies 2015-2020, the French Ministry of Higher Education and Research, INRIA, and the\nFrench Agence Nationale de la Recherche (ANR). We thank Guillaume Gautier, Fabrice Clerot, and\nXuedong Shang for the helpful discussions and valuable insights.\n\n5More details are provided in Appendix C\n\n9\n\n\fReferences\nNaoki Abe et al. Optimizing debt collections using constrained reinforcement learning. In Special\n\nInterest Group on Knowledge Discovery and Data Mining (SIGKDD), 2010.\n\nJoshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In\n\nProceedings of the International Conference on Machine Learning (ICML), 2017.\n\nEitan Altman. Constrained Markov Decision Processes. CRC Press, 1999.\nFrederick J. Beutler and Keith W. Ross. Optimal policies for controlled markov chains with a\n\nconstraint. In Journal of Mathematical Analysis and Applications, 1985.\n\nCraig Boutilier and Tyler Lu. Budget allocation using weakly coupled, constrained markov decision\n\nprocesses. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2016.\n\nNicolas Carrara, Romain Laroche, Jean-L\u00e9on Bouraoui, Tanguy Urvoy, and Olivier Pietquin. Safe\ntransfer learning for dialogue applications. In International Conference on Statistical Language\nand Speech Processing (SLSP), 2018.\n\nSenthilkumar Chandramohan, Matthieu Geist, and Olivier Pietquin. Optimizing spoken dialogue\nmanagement with \ufb01tted value iteration. In Conference of the International Speech Communication\nAssociation (InterSpeech), 2010.\n\nYinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-Sensitive and Robust Decision-\nMaking: a CVaR Optimization Approach. In Advances in Neural Information Processing Systems\n(NIPS), 2015.\n\nYinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained\nreinforcement learning with percentile risk criteria. In Journal of Machine Learning Research\n(JMLR), 2018.\n\nChristoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certi\ufb01cates: Towards accountable\nreinforcement learning. In Proceedings of the International Conference on Machine Learning\n(ICML), 2019.\n\nDamien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-Based Batch Mode Reinforcement Learning.\n\nIn Journal of Machine Learning Research (JMLR), 2005.\n\nJavier Garc\u00eda and Fernando Fern\u00e1ndez. A Comprehensive Survey on Safe Reinforcement Learning .\n\nIn Journal of Machine Learning Research (JMLR), 2015.\n\nPeter Geibel and Fritz Wysotzki. Risk-sensitive reinforcement learning applied to control under\n\nconstraints. In Journal of Arti\ufb01cial Intelligence Research (JAIR), 2005.\n\nGarud N. Iyengar. Robust Dynamic Programming . In Mathematics of Operations Research, 2005.\nHatim Khouzaimi, Romain Laroche, and Fabrice. Lefevre. Optimising turn-taking strategies with\nreinforcement learning. . In Special Interest Group on Discourse and Dialogue (SIGDIAL), 2015.\nRomain Laroche and R\u00e9mi Trichelair, Paul and Tachet des Combes. Safe policy improvement with\nbaseline bootstrapping. In Proceedings of the International Conference on Machine Learning\n(ICML), 2019.\n\nHoang M. Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In\n\nProceedings of the International Conference on Machine Learning (ICML), 2019.\n\nEdouard Leurent. An environment for autonomous driving decision-making. https://github.com/\n\neleurent/highway-env, 2018.\n\nLihong Li, Jason D. Williams, and Suhrid Balakrishnan. Reinforcement learning for dialog man-\nIn Conference of the\n\nagement using least-squares policy iteration and fast feature selection.\nInternational Speech Communication Association (InterSpeech), 2009.\n\nChunming Liu, Xin Xu, and Dewen Hu. Multiobjective Reinforcement Learning: A Comprehensive\n\nOverview. In IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2014.\nDavid G. Luenberger. Investment science. Oxford University Press, Incorporated, 2013.\nH. Mausser and D. Rosen. Beyond VaR: from measuring risk to managing risk. In Proceedings of\n\nthe IEEE Conference on Computational Intelligence for Financial Engineering, 2003.\n\n10\n\n\fVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-\nmare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,\nCharles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,\nShane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.\nNature, 2015.\n\nArnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain\n\nTransition Matrices . In Operations Research, 2005.\n\nMohammad Petrik, Marek Ghavamzadeh, , and Yinlam Chow. Safe policy improvement by min-\nimizing robust baseline regret. In Advances in Neural Information Processing Systems (NIPS),\n2016.\n\nOlivier Pietquin, Matthieu Geist, Senthilkumar Chandramohan, and Herv\u00e9 Frezza-Buet. Sample-\nef\ufb01cient batch reinforcement learning for dialogue management optimization. ACM Transactions\non Speech and Language Processing (TSLP), 7(3):7, 2011.\n\nPascal Poupart, Aarti Malhotra, Pei Pei, Kee-Eung Kim, Bongseok Goh, and Michael Bowling.\nApproximate linear programming for constrained partially observable markov decision processes.\nIn Proceedings of the Association for the Advancement of Arti\ufb01cial Intelligence Conference (AAAI),\n2015.\n\nMartin Riedmiller. Neural \ufb01tted Q iteration - First experiences with a data ef\ufb01cient neural Reinforce-\nment Learning method. In Lecture Notes in Computer Science (including subseries Lecture Notes\nin Arti\ufb01cial Intelligence and Lecture Notes in Bioinformatics), 2005.\n\nDiederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-\nobjective sequential decision-making. In Journal of Arti\ufb01cial Intelligence Research (JAIR), 2013.\nAviv Tamar, Dotan Di Castro , and Shie Mannor. Policy Gradients with Variance Related Risk\n\nCriteria . In Proceedings of the International Conference on Machine Learning (ICML), 2012.\n\nPhilip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High con\ufb01dence policy\nimprovement. In Proceedings of the International Conference on Machine Learning (ICML), 2015.\nAditya Undurti, Alborz Geramifard, and Jonathan P. How. Function approximation for continuous\n\nconstrained mdps. In Tech Report, 2011.\n\nWolfram Wiesemann, Daniel Kuhn, and Ber\u00e7 Rustem. Robust markov decision processes.\n\nMathematics of Operations Research, 2013.\n\nIn\n\n11\n\n\f", "award": [], "sourceid": 4976, "authors": [{"given_name": "Nicolas", "family_name": "Carrara", "institution": "ULille"}, {"given_name": "Edouard", "family_name": "Leurent", "institution": "INRIA"}, {"given_name": "Romain", "family_name": "Laroche", "institution": "Microsoft Research"}, {"given_name": "Tanguy", "family_name": "Urvoy", "institution": "Orange-Labs"}, {"given_name": "Odalric-Ambrym", "family_name": "Maillard", "institution": "INRIA"}, {"given_name": "Olivier", "family_name": "Pietquin", "institution": "Google Research    Brain Team"}]}