{"title": "VIREL: A Variational Inference Framework for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7122, "page_last": 7136, "abstract": "Applying probabilistic models to reinforcement learning (RL) enables the uses of powerful optimisation tools such as variational inference in RL. However, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, e.g., the lack of mode capturing behaviour in pseudo-likelihood methods, difficulties learning deterministic policies in maximum entropy RL based approaches, and a lack of analysis when function approximators are used. We propose VIREL, a theoretically grounded probabilistic inference framework for RL that utilises a parametrised action-value function to summarise future dynamics of the underlying MDP, generalising existing approaches. VIREL also benefits from a mode-seeking form of KL divergence, the ability to learn deterministic optimal polices naturally from inference, and the ability to optimise value functions and policies in separate, iterative steps. In applying variational expectation-maximisation to VIREL, we thus show that the actor-critic algorithm can be reduced to expectation-maximisation, with policy improvement equivalent to an E-step and policy evaluation to an M-step. We then derive a family of actor-critic methods fromVIREL, including a scheme for adaptive exploration. Finally, we demonstrate that actor-critic algorithms from this family outperform state-of-the-art methods based on soft value functions in several domains.", "full_text": "VIREL: A Variational Inference Framework\n\nfor Reinforcement Learning\n\nMatthew Fellows\u2217 Anuj Mahajan\u2217 Tim G. J. Rudner\n\nShimon Whiteson\n\nDepartment of Computer Science\n\nUniversity of Oxford\n\nAbstract\n\nApplying probabilistic models to reinforcement learning (RL) enables the uses\nof powerful optimisation tools such as variational inference in RL. However, ex-\nisting inference frameworks and their algorithms pose signi\ufb01cant challenges for\nlearning optimal policies, for example, the lack of mode capturing behaviour in\npseudo-likelihood methods, dif\ufb01culties learning deterministic policies in maximum\nentropy RL based approaches, and a lack of analysis when function approxima-\ntors are used. We propose VIREL, a theoretically grounded inference framework\nfor RL that utilises a parametrised action-value function to summarise future dy-\nnamics of the underlying MDP, generalising existing approaches. VIREL also\nbene\ufb01ts from a mode-seeking form of KL divergence, the ability to learn deter-\nministic optimal polices naturally from inference, and the ability to optimise value\nfunctions and policies in separate, iterative steps. Applying variational expectation-\nmaximisation to VIREL, we show that the actor-critic algorithm can be reduced\nto expectation-maximisation, with policy improvement equivalent to an E-step\nand policy evaluation to an M-step. We derive a family of actor-critic methods\nfrom VIREL, including a scheme for adaptive exploration and demonstrate that our\nalgorithms outperform state-of-the-art methods based on soft value functions in\nseveral domains.\n\n1\n\nIntroduction\n\nEfforts to combine reinforcement learning (RL) and probabilistic inference have a long history,\nspanning diverse \ufb01elds such as control, robotics, and RL [64, 62, 46, 47, 27, 74, 75, 73, 36]. For-\nmalising RL as probabilistic inference enables the application of many approximate inference tools\nto reinforcement learning, extending models in \ufb02exible and powerful ways [35]. However, existing\nmethods at the intersection of RL and inference suffer from several de\ufb01ciencies. Methods that\nderive from the pseudo-likelihood inference framework [12, 64, 46, 26, 44, 1] and use expectation-\nmaximisation (EM) favour risk-seeking policies [34], which can be suboptimal. Yet another approach,\nthe MERL inference framework [35] (which we refer to as MERLIN), derives from maximum entropy\nreinforcement learning (MERL) [33, 74, 75, 73]. While MERLIN does not suffer from the issues of\nthe pseudo-likelihood inference framework, it presents different practical dif\ufb01culties. These methods\ndo not naturally learn deterministic optimal policies and constraining the variational policies to be\ndeterministic renders inference intractable [47]. As we show by way of counterexample in Section 2.2,\nan optimal policy under the reinforcement learning objective is not guaranteed from the optimal\nMERL objective. Moreover, these methods rely on soft value functions which are sensitive to a\npre-de\ufb01ned temperature hyperparameter.\n\nAdditionally, no existing framework formally accounts for replacing exact value functions with\nfunction approximators in the objective; learning function approximators is carried out indepen-\ndently of the inference problem and no analysis of convergence is given for the corresponding\nalgorithms.\n\n\u2217Equal\n\nContribution.\n\nCorrespondence\n\nto\n\nmatthew.fellows@cs.ox.ac.uk\n\nand\n\nanuj.mahajan@cs.ox.ac.uk.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis paper addresses these de\ufb01ciencies. We introduce VIREL, an inference framework that translates\nthe problem of \ufb01nding an optimal policy into an inference problem. Given this framework, we demon-\nstrate that applying EM induces a family of actor-critic algorithms, where the E-step corresponds\nexactly to policy improvement and the M-step exactly to policy evaluation. Using a variational EM\nalgorithm, we derive analytic updates for both the model and variational policy parameters, giving a\nuni\ufb01ed approach to learning parametrised value functions and optimal policies.\n\nWe extensively evaluate two algorithms derived from our framework against DDPG [38] and an\nexisting state-of-the-art actor-critic algorithm, soft actor-critic (SAC) [25], on a variety of OpenAI\ngym domains [9]. While our algorithms perform similarly to SAC and DDPG on simple low\ndimensional tasks, they outperform them substantially on complex, high dimensional tasks.\n\nThe main contributions of this work are: 1) an exact reduction of entropy regularised RL to prob-\nabilistic inference using value function estimators; 2) the introduction of a theoretically justi\ufb01ed\ngeneral framework for developing inference-style algorithms for RL that incorporate the uncertainty\nin the optimality of the action-value function, \u02c6Q\u03c9(h), to drive exploration, but that can also learn\noptimal deterministic policies; and 3) a family of practical algorithms arising from our framework that\nadaptively balances exploration-driving entropy with the RL objective and outperforms the current\nstate-of-the-art SAC, reconciling existing advanced actor critic methods like A3C [43], MPO [1] and\nEPG [10] into a broader theoretical approach.\n\n2 Background\n\nWe assume familiarity with probabilistic inference [30] and provide a review in Appendix A.\n\n2.1 Reinforcement Learning\n\nFormally, an RL problem is modelled as a Markov decision process (MDP) de\ufb01ned by the tuple\nhS, A, r, p, p0, \u03b3i [54, 59], where S is the set of states and A \u2286 Rn the set of available actions.\nAn agent in state s \u2208 S chooses an action a \u2208 A according to the policy a \u223c \u03c0(\u00b7|s), forming a\nstate-action pair h \u2208 H, h := hs, ai. This pair induces a scalar reward according to the reward\nfunction rt := r(ht) \u2208 R and the agent transitions to a new state s\u2032 \u223c p(\u00b7|h). The initial state\ndistribution for the agent is given by s0 \u223c p0. We denote a sampled state-action pair at timestep\nt as ht := hst, ati. As the agent interacts with the environment using \u03c0, it gathers a trajectory\n\u03c4 = (h0, r0, h1, r1, ...). The value function is the expected, discounted reward for a trajectory,\nstarting in state s. The action-value function or Q-function is the expected, discounted reward\nt=0 \u03b3trt], where p\u03c0(\u03c4 |h) := p(s1|h0 =\nt\u2032=1 p(st\u2032+1|ht\u2032 )\u03c0(at|st). Any Q-function satis\ufb01es a Bellman equation T \u03c0Q\u03c0(\u00b7) = Q\u03c0(\u00b7)\nwhere T \u03c0\n\u00b7 := r(h) + \u03b3Eh\u2032\u223cp(s\u2032|h)\u03c0(a\u2032|s\u2032) [\u00b7] is the Bellman operator. We consider in\ufb01nite horizon\nproblems with a discount factor \u03b3 \u2208 [0, 1). The agent seeks an optimal policy \u03c0\u2217 \u2208 arg max\u03c0 J \u03c0,\nwhere\n\nfor each trajectory, starting in h, Q\u03c0(h) := E\u03c4 \u223cp\u03c0(\u03c4 |h) [P\u221e\nh)Q\u221e\n\nJ \u03c0 = Eh\u223cp0(s)\u03c0(a|s) [Q\u03c0(h)] .\n\n(1)\n\nWe denote optimal Q-functions as Q\u2217(\u00b7) := Q\u03c0\u2217\narg max\u03c0 J \u03c0. The optimal Bellman operator is T \u2217\n\n(\u00b7) and the set of optimal policies \u03a0\u2217 :=\n\n\u00b7 := r(h) + \u03b3Eh\u2032\u223cp(s\u2032|h) [maxa\u2032 (\u00b7)].\n\n2.2 Maximum Entropy RL\n\nmerl := E\u03c4 \u223cp(\u03c4 )hPT \u22121\n\nt=0 (rt \u2212 c log(\u03c0(at|st))i. The standard RL, undiscounted objective is\n\nThe MERL objective supplements each reward in the RL objective with an entropy term [61, 74, 75,\n73], J \u03c0\nrecovered for c \u2192 0 and we assume c = 1 without loss of generality. The MERL objective is often\nused to motivate the MERL inference framework (which we call MERLIN) [34], mapping the problem\nof \ufb01nding the optimal policy, \u03c0\u2217\nmerl, to an equivalent inference problem. A full\nexposition of this framework is given by Levine [35] and we discuss the graphical model of MERLIN\nin comparison to VIREL in Section 3.3. The inference problem is often solved using a message passing\nalgorithm, where the log backward messages are called soft value functions due to their similarity\nto classic (hard) value functions [63, 48, 25, 24, 35]. The soft Q-function is de\ufb01ned as Q\u03c0\nsoft(h) :=\nt=0 p(st+1|ht)\u03c0(at|st).\n\nE\u03c4 \u223cq\u03c0(\u03c4 |h)hr0 +PT \u22121\n\nt=1 (rt \u2212 log \u03c0(at|st))i, where q\u03c0(\u03c4 |h) := p(s0|h)QT \u22121\n\nmerl(a|s) = arg max\u03c0 J \u03c0\n\n2\n\n\fsoft \u00b7 := r(h) + Eh\u2032\u223cp(s\u2032|h)\u03c0(a\u2032|s\u2032)[\u00b7 \u2212 log \u03c0(a\u2032|s\u2032)].\nThe corresponding soft Bellman operator is T \u03c0\nSeveral algorithms have been developed that mirror existing RL algorithms using soft Bellman\nequations, including maximum entropy policy gradients [35], soft Q-learning [24], and soft\nactor-critic (SAC) [25]. MERL is also compatible with methods that use recall traces [21].\n\ns\n\n1\n1\n\ns\n\ns5\n\ns0\n\ns1\n\nk1\n1\n\ns5+k2\n\nWe now outline key drawbacks of MERLIN. It is well-\nunderstood that optimal policies under regularised\nBellman operators are more stochastic than under\ntheir equivalent unregularised operators [20]. While\nthis can lead to improved exploration, the optimal\npolicy under these operators will still be stochastic,\nmeaning optimal deterministic policies are not learnt\nnaturally. This leads to two dif\ufb01culties: 1) a de-\nterministic policy can be constructed by taking the\naction a\u2217 = arg maxa \u03c0\u2217\nmerl(a|s), corresponding to\nthe maximum a posteriori (MAP) policy, however, in\ncontinuous domains, \ufb01nding the MAP policy requires\noptimising the Q-function approximator for actions,\nwhich is often a deep neural network. A common\napproximation is to use the mean of a variational policy instead; 2) even if we obtain a good approxi-\nmation, as we show below by way of counterexample, the deterministic MAP policy is not guaranteed\nto be the optimal policy under J \u03c0. Constraining the variational policies to the set of Dirac-delta\ndistributions does not solve this problem either, since it renders the inference procedure intractable\n[47, 48].\n\nFigure 1: A discrete MDP counterexample for op-\ntimal policy under maximum entropy.\n\ns3\n\ns2\n\ns4\n\nNext, we demonstrate that the optimal policy under J \u03c0 cannot always be recovered from the\nMAP policy under J \u03c0\nmerl. Consider the discrete state MDP as shown in Fig. 1, with action set\n1, \u00b7 \u00b7 \u00b7 ak1\nA = {a1, a2, a1\n1 } and state set S = {s0, s1, s2, s3, s4, s1\n1 , s5, \u00b7 \u00b7 \u00b7 s5+k2 }. All state\ntransitions are deterministic, with p(s1|s0, a1) = p(s1|s0, a2) = p(si\n1) = 1. All other state\ntransitions are deterministic and independent of action taken, that is, p(sj|\u00b7, sj\u22121) = 1 \u2200 j > 2 and\np(s5|\u00b7, si\n1) = 1. The reward function is r(s0, a2) = 1 and zero otherwise. Clearly the optimal policy\nunder J \u03c0 has \u03c0\u2217(a2|s0) = 1. De\ufb01ne a maximum entropy reinforcement learning policy as \u03c0merl with\n\u03c0merl(a1|s0) = p1, \u03c0merl(a2|s0) = (1 \u2212 p1) and \u03c0merl(ai\n1. For \u03c0merl and k2 >> 5, we can\nevaluate J \u03c0\n\n1|s1) = pi\nmerl for any scaling constant c and discount factor \u03b3 as:\n\n1 \u00b7 \u00b7 \u00b7 sk1\n\n1|s1, ai\n\nJ \u03c0\n\nmerl = (1 \u2212 p1)(1 \u2212 c log(1 \u2212 p1)) \u2212 p1 c log p1 + \u03b3c\n\npi\n1 log pi\n\n1! .\n\n(2)\n\nk\n\nXi=1\n\nWe now \ufb01nd the optimal MERL policy. Note that pi\nSubstituting for pi\n\ufb01nd p\u2217\n\nk maximises the \ufb01nal term in Eq. (2).\n, then taking derivatives of Eq. (2) with respect to p1, and setting to zero, we\n\n1 = 1\nk1\n\n1 = 1\n\n1 = \u03c0\u2217\n\nmerl(a1|s0) as:\n\n1 \u2212 c log(1 \u2212 p\u2217\n\n1) = \u03b3c log(k1) \u2212 c log p\u2217\n1,\n\n=\u21d2 p\u2217\n\n1 =\n\n1\n\n,\n\nk1\n1 > 1\n2 and so \u03c0\u2217 cannot be recovered from \u03c0\u2217\n\n\u2212\u03b3 exp(cid:0) 1\n\nc(cid:1) + 1\n\n\u2212\u03b3 exp(cid:0) 1\n\nc(cid:1) < 1, we have p\u2217\n\nhence, for any k1\nmerl, even\nusing the mode action a1 = arg maxa \u03c0\u2217\nmerl(a|s0). The degree to which the MAP policy varies from\nthe optimal unregularised policy depends on both the value of c and k1, the later controlling the\nnumber of states with sub-optimal reward. Our counterexample illustrates that when there are large\nregions of the state-space with sub-optimal reward, the temperature must be comparatively small to\ncompensate, hence algorithms derived from MERLIN become very sensitive to temperature. As we\ndiscuss in Section 3.3, this problem stems from the fact that MERL policies optimise for expected\nreward and long-term expected entropy. While initially bene\ufb01cial for exploration, this can lead to\nsub-optimal polices being learnt in complex domains as there is often too little a priori knowledge\nabout the MDP to make it possible to choose an appropriate value or schedule for c.\n\nFinally, a minor issue with MERLIN is that many existing models are de\ufb01ned for \ufb01nite-horizon\nproblems [35, 48]. While it is possible to discount and extend MERLIN to in\ufb01nite-horizon problems,\ndoing so is often nontrivial and can alter the objective [60, 25].\n\n3\n\n\f2.3 Pseudo-Likelihood Methods\n\nwhere R(\u03c4 ) =PT \u22121\n\nt=0 \u03b3trt and p(\u03c4 ) = p0(s0)\u03c0(a0|so)QT \u22121\n\nA related but distinct approach is to apply Jensen\u2019s inequality directly to the RL objective J \u03c0. Firstly,\nwe rewrite Eq. (1) as an expectation over \u03c4 to obtain J = Eh\u223cp0(s)\u03c0(a|s) [Q\u03c0(h)] = E\u03c4 \u223cp(\u03c4 ) [R(\u03c4 )],\nt=0 p(ht+1|ht). We then treat p(R, \u03c4 ) =\nR(\u03c4 )p(\u03c4 ) as a joint distribution, and if rewards are positive and bounded, Jensen\u2019s inequality can\nbe applied, enabling the derivation of an evidence lower bound (ELBO). Inference algorithms\nsuch as EM can then be employed to \ufb01nd a policy that optimises the pseudo-likelihood objective\n[12, 64, 46, 26, 44, 1]. Pseudo-likelihood methods can also be extended to a model-based setting by\nde\ufb01ning a prior over the environment\u2019s transition dynamics. Furmston & Barber [19] demonstrate\nthat the posterior over all possible environment models can be integrated over to obtain an optimal\npolicy in a Bayesian setting.\n\nMany pseudo-likelihood methods minimise KL(pO k p\u03c0), where p\u03c0 is the policy to be learnt and\npO is a target distribution monotonically related to reward [35]. Classical RL methods minimise\nKL(p\u03c0 k pO). The latter encourages learning a mode of the target distribution, while the former\nencourages matching the moments of the target distribution. If the optimal policy can be represented\naccurately in the class of policy distributions, optimisation converges to a global optimum and the\nproblem is fully observable, the optimal policy is the same in both cases. Otherwise, the pseudo-\nlikelihood objective reduces the in\ufb02uence of large negative rewards, encouraging risk-seeking policies.\n\n3 VIREL\n\nBefore describing our framework, we state some relevant assumptions.\nDe\ufb01nition 1 (Unique Maximum and Locally Smooth Function). Let f : X \u2192 Y be a function with\na unique maximum f (x\u2217) = supx f where the domain X is a compact set and range Y is bounded.\nLet f be locally C2 smooth about x\u2217, i.e., \u2203 \u2206 > 0 s.t.f (x) \u2208 C2 \u2200 x \u2208 {x|kx \u2212 x\u2217k < \u2206 }.\nAssumption 1. The optimal action-value function for the reinforcement learning problem is \ufb01nite\nand strictly positive, i.e., 0 < Q\u2217(h) < \u221e \u2200 h \u2208 H.\n\nAny MDP for which rewards are lower bounded and \ufb01nite, that is, R \u2282 [rmin, \u221e), satis\ufb01es Assump-\ntion 1. To see this, we can construct a new MDP by adding rmin to the reward function, ensuring that\nall rewards are positive and hence the optimal action-value function for the reinforcement learning\nproblem is \ufb01nite and strictly positive. This does not affect the optimal solution. Now we introduce a\nfunction approximator \u02c6Q\u03c9(h) \u2248 Q\u03c0(h) parametrised by \u03c9 \u2208 \u2126.\nAssumption 2 (Exact Representability Under Optimisation). Our function approximator can repre-\nsent the optimal Q-function, i.e., \u2203 \u03c9\u2217 \u2208 \u2126 s.t. Q\u2217(\u00b7) = \u02c6Q\u03c9\u2217 (\u00b7).\n\nIn Appendix F.1, we extend the work of Bhatnagar et al. [6] to continuous domains, demonstrating\nthat Assumption 2 can be neglected if projected Bellman operators are used.\nAssumption 3 (Local Smoothness of Q-functions ). For \u03c9\u2217 parametrising Q\u2217(h) in Assumption 2,\nQ\u03c9\u2217 (h) has a unique maximum and is locally smooth under De\ufb01nition 1 for actions in any state.\n\nThis assumption is formally required for the strict convergence of a Boltzmann to a Dirac-delta distri-\nbution and, as we discuss in Appendix F.4, is of more mathematical than practical concern.\n\n3.1 Objective Speci\ufb01cation\n\nWe now de\ufb01ne an objective that we motivate by satisfying three desiderata: 1 In the limit of maximis-\ning our objective, a deterministic optimal policy can be recovered and the optimal Bellman equation\nis satis\ufb01ed by our function approximator; 2 when our objective is not maximised, stochastic policies\ncan be recovered that encourage effective exploration of the state-action space; and 3 our objective\npermits the application of powerful and tractable optimisation algorithms from variational inference\nthat optimise the risk-neutral form of KL divergence, KL(p\u03c0 k pO), introduced in Section 2.3.\n\nFirstly, we de\ufb01ne the residual error \u03b5\u03c9 := c\n\u03b3Eh\u2032\u223cp(s\u2032|h)\u03c0\u03c9(a\u2032|s\u2032) [\u00b7] is the Bellman operator for the Boltzmann policy with temperature \u03b5\u03c9:\n\np kT\u03c9 \u02c6Q\u03c9(h) \u2212 \u02c6Q\u03c9(h)kp\n\np where T\u03c9 = T \u03c0\u03c9 \u00b7 := r(h) +\n\n\u03c0\u03c9(a|s) :=\n\nexp(cid:16) \u02c6Q\u03c9(h)\n\u03b5\u03c9 (cid:17)\nRA exp(cid:16) \u02c6Q\u03c9(h)\n\u03b5\u03c9 (cid:17) da\n\n.\n\n4\n\n(3)\n\n\fWe assume p = 2 and c = 1\n|H| without loss of generality. Our main result in Theorem 2 proves\nthat \ufb01nding a \u03c9\u2217 that reduces the residual error to zero, i.e., \u03b5\u03c9\u2217 = 0, is a suf\ufb01cient condition for\nlearning an optimal Q-function \u02c6Q\u03c9\u2217 (h) = Q\u2217(h). Additionally, the Boltzmann distribution \u03c0\u03c9(a|s)\n\u02c6Q\u03c9\u2217 (a\u2032, s)) whenever \u03b5\u03c9 \u2192 0\ntends towards a Dirac-delta distribution \u03c0\u03c9(a|s) = \u03b4(a = arg max\u2032\na\n(see Theorem 1), which is an optimal policy. The simple objective arg min(L(\u03c9)) := arg min(\u03b5\u03c9)\ntherefore satis\ufb01es 1 . Moreover, when our objective is not minimised, we have \u03b5\u03c9 > 0 and from\nEq. (3) we see that \u03c0\u03c9(a|s) is non-deterministic for all non-optimal \u03c9. L(\u03c9) therefore satis\ufb01es 2 as\nany agent following \u03c0\u03c9(a|s) will continue exploring until the RL problem is solved. To generalise\nour framework, we extend T\u03c9 \u00b7 to any operator from the set of target operators T\u03c9 \u00b7 \u2208 T:\nDe\ufb01nition 2 (Target Operator Set). De\ufb01ne T to be the set of target operators such that an optimal\nBellman operator for \u02c6Q\u03c9(h) is recovered when the Boltzmann policy in Eq. (3) is greedy with respect\nto \u02c6Q\u03c9(h), i.e., T := {T\u03c9 \u00b7 | lim\u03b5\u03c9\u21920 \u03c0\u03c9(a|s) =\u21d2 T\u03c9 \u02c6Q\u03c9(h) = T \u2217 \u02c6Q\u03c9(h)}.\nAs an illustration, we prove in Appendix C that the Bellman operator T \u03c0\u03c9 \u00b7 introduced above is a\nmember of T and can be approximated by several well-known RL targets. We also discuss how\nT \u03c0\u03c9 \u00b7 induces a constraint on \u2126 due to its recursive de\ufb01nition. As we show in Section 3.2, there\nexists an \u03c9 in the constrained domain that maximises the RL objective under these conditions, so an\noptimal solution is always feasible. Moreover, we provide an analysis in Appendix F.5 to establish\nthat such a policy is an attractive \ufb01xed point for our algorithmic updates, even when we ignore this\nconstraint. Off-policy operators will not constrain \u2126: by de\ufb01nition, the optimal Bellman operator\nT \u2217\n\u00b7 is a member of T and does not constrain \u2126; similarly, we derive an off-policy operator based\non a Boltzmann distribution with a diminishing temperature in Appendix F.2 that is a member of T.\nObserve that soft Bellman operators are not members of T as the optimal policy under J \u03c0\nmerl is not\ndeterministic, hence algorithms such as SAC cannot be derived from the VIREL framework.\n\nOne problem remains: calculating the normalisation constant to sample directly from the Boltzmann\ndistribution in Eq. (3) is intractable for many MDPs and function approximators. As such, we look\nto variational inference to learn an approximate variational policy \u03c0\u03b8(a|s) \u2248 \u03c0\u03c9(a|s), parametrised\nby \u03b8 \u2208 \u0398 with \ufb01nite variance and the same support as \u03c0\u03c9(a|s). This suggests optimising a new\nobjective that penalises \u03c0\u03b8(a|s) when \u03c0\u03b8(a|s) 6= \u03c0\u03c9(a|s) but still has a global maximum at \u03b5\u03c9 = 0.\nA tractable objective that meets these requirements is the evidence lower bound (ELBO) on the\nunnormalised potential of the Boltzmann distribution, de\ufb01ned as {\u03c9\u2217, \u03b8\u2217} \u2208 arg max\u03c9,\u03b8 L(\u03c9, \u03b8),\n\nL(\u03c9, \u03b8) := Es\u223cd(s)\"Ea\u223c\u03c0\u03b8(a|s)\" \u02c6Q\u03c9(h)\n\n\u03b5\u03c9 # + H (\u03c0\u03b8(a|s))# ,\n\n(4)\n\nwhere q\u03b8(h) := d(s)\u03c0\u03b8(a|s) is a variational distribution, H (\u00b7) denotes the differential entropy of\na distribution, and d(s) is any arbitrary sampling distribution with support over S. From Eq. (4),\nmaximising our objective with respect to \u03c9 is achieved when \u03b5\u03c9 \u2192 0 and hence L(\u03c9, \u03b8) satis\ufb01es\n1 and 2 . As we show in Lemma 1, H (\u00b7) in Eq. (4) causes L(\u03c9, \u03b8) \u2192 \u2212\u221e whenever \u03c0\u03b8(a|s)\nis a Dirac-delta distribution for all \u03b5\u03c9 > 0. This means our objective heavily penalises premature\nconvergence of our variational policy to greedy Dirac-delta policies except under optimality. We\ndiscuss a probabilistic interpretation of our framework in Appendix B, where it can be shown that\n\u03c0\u03c9(a|s) characterises our model\u2019s uncertainty in the optimality of \u02c6Q\u03c9(h).\nWe now motivate L(\u03c9, \u03b8) from an inference perspective: In Appendix D.1, we write L(\u03c9, \u03b8) in terms\nof the log-normalisation constant of the Boltzmann distribution and the KL divergence between the\naction-state normalised Boltzmann distribution, p\u03c9(h), and the variational distribution, q\u03b8(h):\n\nL(\u03c9, \u03b8) = \u2113(\u03c9) \u2212 KL(q\u03b8(h) k p\u03c9(h)) \u2212 H (d(s)),\n\n(5)\n\nwhere\n\n\u2113(\u03c9) := logZH\n\nexp \u02c6Q\u03c9(h)\n\n\u03b5\u03c9 ! dh,\n\np\u03c9(h) :=\n\nexp(cid:16) \u02c6Q\u03c9(h)\n\u03b5\u03c9 (cid:17)\nRH exp(cid:16) \u02c6Q\u03c9(h)\n\u03b5\u03c9 (cid:17) dh\n\n.\n\nAs the KL divergence in Eq. (5) is always positive and the \ufb01nal entropy term has no dependence on \u03c9\nor \u03b8, maximising our objective for \u03b8 always reduces the KL divergence between \u03c0\u03c9(a|s) and \u03c0\u03b8(a|s)\nfor any \u03b5\u03c9 > 0, with \u03c0\u03b8(a|s) = \u03c0\u03c9(a|s) achieved under exact representability (see Theorem 3).\nThis yields a tractable way to estimate \u03c0\u03c9(a|s) at any point during our optimisation procedure by\nmaximising L(\u03c9, \u03b8) for \u03b8. From Eq. (5), we see that our objective satis\ufb01es 3 , as we minimise the\n\n5\n\n\fmode-seeking direction of KL divergence, KL(q\u03b8(h) k p\u03c9(h)), and our objective is an ELBO, which\nis the starting point for inference algorithms [30, 4, 17]. When the RL problem is solved and \u03b5\u03c9 = 0,\nour objective tends towards in\ufb01nity for any variational distribution that is non-deterministic (see\nLemma 1). This is of little consequence, however, as whenever \u03b5\u03c9 = 0, our approximator is the\noptimal value function, \u02c6Q\u03c9\u2217 (h) = Q\u2217(h) (Theorem 2), and hence, \u03c0\u2217(a|s) can be inferred exactly by\n\n\ufb01nding maxa\u2032 \u02c6Q\u03c9\u2217 (a\u2032, s) or by using the policy gradient \u2207\u03b8Ed(s)\u03c0\u03b8(a|s)h \u02c6Q\u03c9\u2217 (h)i (see Section 4.2).\n\n3.2 Theoretical Results\n\nWe now formalise the intuition behind 1 - 3 . Theorem 1 establishes the emergence of a Dirac-delta\ndistribution in the limit of \u03b5\u03c9 \u2192 0. To the authors\u2019 knowledge, this is the \ufb01rst rigorous proof\nof this result. Theorem 2 shows that \ufb01nding an optimal policy that maximises the RL objective\nin Eq. (1) reduces to \ufb01nding the Boltzmann distribution associated with the parameters \u03c9\u2217 \u2208\narg max\u03c9 L(\u03c9, \u03b8). The existence of such a distribution is a suf\ufb01cient condition for the policy to be\noptimal. Theorem 3 shows that whenever \u03b5\u03c9 > 0, maximising our objective for \u03b8 always reduces\nthe KL divergence between \u03c0\u03c9(a|s) and \u03c0\u03b8(a|s), providing a tractable method to infer the current\nBoltzmann policy.\nTheorem 1 (Convergence of Boltzmann Distribution to Dirac Delta). Let p\u03b5 : X \u2192 [0, 1] be a\n\nexp( f (x)\n\u03b5 )\nRX exp( f (x)\n\u03b5 )dx\n\n, where f : X \u2192 Y is a\n\nBoltzmann distribution with temperature \u03b5 \u2208 R\u22650, p\u03b5(x) =\nfunction that satis\ufb01es De\ufb01nition 1. In the limit \u03b5 \u2192 0, p\u03b5(x) \u2192 \u03b4(x = supx\u2032 f (x\u2032)).\nProof. See Appendix D.2\nLemma 1 (Lower and Upper limits of L(\u03c9, \u03b8)). i) For any \u03b5\u03c9 > 0 and \u03c0\u03b8(a|s) = \u03b4(a\u2217), we have\nL(\u03c9, \u03b8) = \u2212\u221e. ii) For \u02c6Q\u03c9(h) > 0 and any non-deterministic \u03c0\u03b8(a|s), lim\u03b5\u03c9\u21920 L(\u03c9, \u03b8) = \u221e.\nProof. See Appendix D.3.\nTheorem 2 (Optimal Boltzmann Distributions as Optimal Policies). For \u03c9\u2217 that maximises L(\u03c9, \u03b8)\nde\ufb01ned in Eq. (4), the corresponding Boltzmann policy induced must be optimal, i.e., {\u03c9\u2217, \u03b8\u2217} \u2208\narg max\u03c9,\u03b8 L(\u03c9, \u03b8) =\u21d2 \u03c0\u03c9\u2217 (a|s) \u2208 \u03a0\u2217.\nProof. See Appendix D.3.\nTheorem 3 (Maximising the ELBO for \u03b8). For any \u03b5\u03c9 > 0, max\u03b8 L(\u03c9, \u03b8) =\nEd(s) [min\u03b8 KL(\u03c0\u03b8(a|s) k \u03c0\u03c9(a|s))] with \u03c0\u03c9(a|s) = \u03c0\u03b8(a|s) under exact representability.\nProof. See Appendix D.4.\n\n3.3 Comparing VIREL and MERLIN Frameworks\n\nTo compare MERLIN and VIREL, we consider the prob-\nabilistic interpretation of the two models discussed in\nAppendix B; introducing a binary variable O \u2208 {0, 1}\nde\ufb01nes a graphical model for our inference problem\nwhenever \u03b5\u03c9 > 0. Comparing the graphs in Fig. 2,\nobserve that MERLIN models exponential cumulative\nrewards over entire trajectories. By contrast, VIREL\u2019s\nvariational policy models a single step and a function\napproximator is used to model future expected rewards.\nThe resulting KL divergence minimisation for MERLIN is therefore much more sensitive to the value\nof temperature, as this affects how much future entropy in\ufb02uences the variational policy. For VIREL,\ntemperature is de\ufb01ned by the model, and updates to the variational policy will not be as sensitive to\nerrors in its value or linear scaling as its in\ufb02uence only extends to a single interaction. We hypothesise\nthat VIREL may afford advantages in higher dimensional domains where there is greater chance of\nencountering large regions of state-action space with sub-optimal reward; like our counterexample\nfrom Section 2, c must be comparatively small to balance the in\ufb02uence of entropy in these regions to\nprevent MERLIN algorithms from learning sub-optimal policies.\n\nFigure 2: Graphical models for MERLIN and\nVIREL (variational approximations are dashed).\n\nTheorem 1 demonstrates that, unlike in MERLIN, VIREL naturally learns optimal deterministic policies\ndirectly from the optimisation procedure while still maintaining the bene\ufb01ts of stochastic policies\nin training. While Boltzmann policies with \ufb01xed temperatures have been proposed before [49], as\nwe discuss in Appendix B, the adaptive temperature \u03b5\u03c9 in VIREL\u2019s Boltzmann policy has a unique\ninterpretation, characterising the model\u2019s uncertainty in the optimality of \u02c6Q\u03c9(h); both \u03c0\u03c9(a|s) and\nits variational approximation \u03c0\u03b8(a|s) have an adaptive variance that reduces as \u02c6Q\u03c9(h) \u2192 Q\u2217(h),\nallowing us to bene\ufb01t from uncertainty-driven exploration when sampling under \u03c0\u03b8(a|s).\n\n6\n\n\f4 Actor-Critic and EM\n\nWe now apply the expectation-maximisation (EM) algorithm [13, 23] to optimise our objective\nL(\u03c9, \u03b8). (See Appendix A for an exposition of this algorithm.) In keeping with RL nomenclature, we\nrefer to \u02c6Q\u03c9(h) as the critic and \u03c0\u03b8(a|s) as the actor. We establish that the expectation (E-) step is\nequivalent to carrying out policy improvement and the maximisation (M-)step to policy evaluation.\nThis formulation reverses the situation in most pseudo-likelihood methods, where the E-step is\nrelated to policy evaluation and the M-step is related to policy improvement, and is a direct result\nof optimising the forward KL divergence, KL(q\u03b8(h) k p\u03c9(h|O)), as opposed to the reverse KL\ndivergence used in pseudo-likelihood methods. As discussed in Section 2.3, this mode-seeking\nobjective prevents the algorithm from learning risk-seeking policies. We now introduce an extension\nto Assumption 2 that is suf\ufb01cient to guarantee convergence.\nAssumption 4 (Universal Variational Representability). Every Boltzmann policy can be represented\nas \u03c0\u03b8(a|s), i.e., \u2200 \u03c9 \u2208 \u2126 \u2203 \u03b8 \u2208 \u0398 s.t. \u03c0\u03b8(a|s) = \u03c0\u03c9(a|s).\n\nAssumption 4 is strong but, like in variational inference, our variational policy \u03c0\u03b8(a|s) provides\na useful approximation when Assumption 4 does not hold. As we discuss in Appendix F.1, using\nprojected Bellman errors also ensures that our M-step always converges no matter what our current\npolicy is.\n\n4.1 Variational Actor-Critic\n\nIn the E-step, we keep the parameters of our critic \u03c9k constant while updating the actor\u2019s parameters\nby maximising the ELBO with respect to \u03b8: \u03b8k+1 \u2190 arg max\u03b8 L(\u03c9k, \u03b8). Using gradient ascent with\nstep size \u03b1actor, we optimise \u03b5\u03c9k L(\u03c9k, \u03b8) instead, which prevents ill-conditioning and does not alter\nthe optimal solution, yielding the update (see Appendix E.1 for full derivation):\n\nE-Step (Actor):\n\n\u03b8i+1 \u2190 \u03b8i + \u03b1actor (\u03b5\u03c9k \u2207\u03b8L(\u03c9k, \u03b8))|\u03b8=\u03b8i ,\n\n\u03b5\u03c9k \u2207\u03b8L(\u03c9k, \u03b8) = Es\u223cd(s)hEa\u223c\u03c0\u03b8(a|s)h \u02c6Q\u03c9k (h)\u2207\u03b8 log \u03c0\u03b8(a|s)i + \u03b5\u03c9k \u2207\u03b8H (\u03c0\u03b8(a|s))i .\n\nIn the M-step, we maximise the ELBO with respect to \u03c9 while holding the parameters \u03b8k+1 constant.\nHence expectations are taken with respect to the variational policy found in the E-step: \u03c9k+1 \u2190\narg max\u03c9 L(\u03c9, \u03b8k+1). We use gradient ascent with step size \u03b1critic(\u03b5\u03c9i )2 to optimise L(\u03c9, \u03b8k+1) to\nprevent ill-conditioning, yielding (see Appendix E.2 for full derivation):\n\n(6)\n\nM-Step (Critic): \u03c9i+1 \u2190 \u03c9i + \u03b1critic(\u03b5\u03c9i )2\u2207\u03c9L(\u03c9, \u03b8k+1)|\u03c9=\u03c9i ,\n\n(\u03b5\u03c9i )2\u2207\u03c9L(\u03c9, \u03b8k+1) = \u03b5\u03c9i\n\n4.2 Discussion\n\nEd(s)\u03c0\u03b8k+1 (a|s)h\u2207\u03c9 \u02c6Q\u03c9(h)i \u2212 Ed(s)\u03c0\u03b8k+1 (a|s)h \u02c6Q\u03c9i (h)i \u2207\u03c9\u03b5\u03c9.\n\n(7)\n\nFrom an RL perspective, the E-step corresponds to training an actor using a policy gradient method\n[56] with an adaptive entropy regularisation term [69, 43]. The M-step update corresponds to a policy\nevaluation step, as we seek to reduce the MSBE in the second term of Eq. (7). We derive \u2207\u03c9\u03b5\u03c9\nexactly in Appendix E.3. Note that this term depends on (T\u03c9 \u02c6Q\u03c9(h) \u2212 \u02c6Q\u03c9(h))\u2207\u03c9T\u03c9 \u02c6Q\u03c9(h), which\ntypically requires evaluating two independent expectations. For convergence guarantees, techniques\nsuch as residual gradients [2] or GTD2/TDC [6] need to be employed to obtain an unbiased estimate\nof this term. If guaranteed convergence is not a priority, dropping gradient terms allows us to use\ndirect methods [55], which are often simpler to implement. We discuss these methods further in\nAppendix F.3 and provide an analysis in Appendix F.5 demonstrating that the corresponding updates\nact as a variational approximation to Q-learning [68, 42]. A key component of our algorithm is\nthe behaviour when \u03b5\u03c9\u2217 = 0; under this condition, there is no M-step update (both \u03b5\u03c9k = 0 and\n\u2207\u03c9\u03b5\u03c9 = 0) and Q\u03c9\u2217 (h) = Q\u2217(h) (see Theorem 2), so our E-step reduces exactly to a policy gradient\nstep, \u03b8k+1 \u2190 \u03b8k + \u03b1actorEh\u223cd(s)\u03c0\u03b8(a|s) [Q\u2217(h)\u2207\u03b8 log \u03c0\u03b8(a|s)], recovering the optimal policy in the\nlimit of convergence, that is, \u03c0\u03b8(a|s) \u2192 \u03c0\u2217(a|s).\n\nFrom an inference perspective, the E-step improves the parameters of our variational distri-\nbution to reduce the gap between the current Boltzmann posterior and the variational policy,\nKL(\u03c0\u03b8(a|s)) k \u03c0\u03c9k (a|s)) (see Theorem 3). This interpretation makes precise the intuition that\nhow much we can improve our policy is determined by how similar \u02c6Q\u03c9k (h) is to Q\u2217(h), limiting\n\n7\n\n\fpolicy improvement to the complete E-step: \u03c0\u03b8k+1 (a|s) = \u03c0\u03c9k (a|s). We see that the common greedy\npolicy improvement step, \u03c0\u03b8k+1 (a|s) = \u03b4(a \u2208 arg maxa\u2032 ( \u02c6Q\u03c9k (a\u2032, s))) acts as an approximation to\nthe Boltzmann form in Eq. (3), replacing the softmax with a hard maximum.\n\nIf Assumption 4 holds and any constraint induced by T\u03c9 \u00b7 does not prevent convergence to a complete\nE-step, the EM algorithm alternates between two convex optimisation schemes, and is guaranteed to\nconverge to at least a local optimum of L(\u03c9, \u03b8) [71]. In reality, we cannot carry out complete E- and\nM-steps for complex domains, and our variational distributions are unlikely to satisfy Assumption 4.\nUnder these conditions, we can resort to the empirically successful variational EM algorithm [30],\ncarrying out partial E- and M-steps instead, which we discuss further in Appendix F.3.\n\n4.3 Advanced Actor-Critic Methods\n\nA family of actor-critic algorithms follows naturally from our framework: 1) we can use powerful\ninference techniques such as control variates [22] or variance-reducing baselines by subtracting\nany function that does not depend on the action [50], e.g., V (s), from the action-value function,\nas this does not change our objective, 2) we can manipulate Eq. (6) to obtain variance-reducing\ngradient estimators such as EPG [11], FPG [15], and SVG0 [28], and 3) we can take advantage of\nd(s) being any general decorrelated distribution by using replay buffers [42] or empirically successful\nasynchronous methods that combine several agents\u2019 individual gradient updates at once [43]. As\nwe discuss in Appendix E.4, the manipulation required to derive the estimators in 2) is not strictly\njusti\ufb01ed in the classic policy gradient theorem [56] and MERL formulation [25].\n\nMPO is a state-of-the-art EM algorithm derived from the pseudo-likelihood objective [1]. In its\nderivation, policy evaluation does not naturally arise from either of its EM steps and must be carried\nout separately. In addition, its E step is approximated, giving rise to the the one step KL regularised\nupdate. As we demonstrate in Appendix G, under the probabilistic interpretation of our model,\nincluding a prior of the form p\u03c6(h) = U (s)\u03c0\u03c6(a|s) in our ELBO and specifying a hyper-prior p(\u03c9),\nthe MPO objective with an adaptive regularisation constant can be recovered from VIREL:\n\nLMPO(\u03c9, \u03b8, \u03c6) = Es\u223cd(s)\"Ea\u223c\u03c0\u03b8(a|s)\" \u02c6Q\u03c9(h)\n\n\u03b5\u03c9 # \u2212 KL(\u03c0\u03b8(a|s) k \u03c0\u03c6(a|s))# + log p(\u03c9).\n\nWe also show in Appendix G that applying the (variational) EM algorithm from Section 4 yields\nthe MPO updates with the missing policy evaluation step and without approximation in the E-step.\n\n5 Experiments\n\nWe evaluate our EM algorithm using the direct method approximation outlined in Appendix F.3\nwith T\u03c9, ignoring constraints on \u2126. The aim of our evaluation is threefold: Firstly, as explained in\nSection 3.1, algorithms using soft value functions cannot be recovered from VIREL. We therefore\ndemonstrate that using hard value functions does not affect performance. Secondly, we provide\nevidence for our hypothesis stated in Section 3.3 that using soft value functions may harm performance\nin higher dimensional tasks. Thirdly, we show that even under all practical approximations discussed,\nthe algorithm derived in Section 4 still outperforms advanced actor-critic methods.\n\nWe compare our methods to the state-of-the-art SAC2 and DDPG [38] algorithms on MuJoCo tasks in\nOpenAI gym [9] and in rllab [14]. We use SAC as a baseline because Haarnoja et al. [25] show that\nit outperforms PPO [52], Soft Q-Learning [24], and TD3 [18]. We compare to DDPG [38] because,\nlike our methods, it can learn deterministic optimal policies. We consider two variants: In the \ufb01rst\none, called virel, we keep the scale of the entropy term in the gradient update for the variational policy\nconstant \u03b1; in the second, called beta, we use an estimate \u02c6\u03b5\u03c9 of \u03b5\u03c9 to scale the corresponding term\nin Eq. (25). We compute \u02c6\u03b5\u03c9 using a buffer to draw a \ufb01xed number of samples N\u03b5 for the estimate.\n\nTo adjust for the relative magnitude of the \ufb01rst term in Eq. (25) with that of \u03b5\u03c9 scaling the second term,\nwe also multiply the estimate \u02c6\u03b5\u03c9 by a scalar \u03bb \u2248 1\u2212\u03b3\n, where ravg is the average reward observed;\nravg\n\u03bb\u22121 roughly captures the order of magnitude of the \ufb01rst term and allows \u02c6\u03b5\u03c9 to balance policy changes\n\n2We use implementations provided by the authors https://github.com/haarnoja/sac for v1 and\n\nhttps://github.com/vitchyr/rlkit for v2.\n\n8\n\n\fFigure 3: Training curves on continuous control benchmarks gym-Mujoco-v2 : High-dimensional domains.\n\nbetween exploration and exploitation. We found performance is poor and unstable without \u03bb. To\nreduce variance, all algorithms use a value function network V (\u03c6) as a baseline and a Gaussian\npolicy, which enables the use of the reparametrisation trick. Pseudocode can be found in Appendix H.\nAll experiments use 5 random initialisations and parameter values are given in Appendix I.1.\n\nFig. 3 gives the training curves for the various algorithms on high-dimensional tasks for on gym-\nmujoco-v2. In particular, in Humanoid-v2 (action space dimensionality: 17, state space dimensional-\nity: 376) and Ant-v2 (action space dimensionality: 8, state space dimensionality: 111), DDPG fails\nto learn any reasonable policy. We believe that this is because the Ornstein-Uhlenbeck noise that\nDDPG uses for exploration is insuf\ufb01ciently adaptive in high dimensions. While SAC performs better,\nvirel and beta still signi\ufb01cantly outperform it. As hypothesised in Section 3.3, we believe that this\nperformance advantage arises because the gap between optimal unregularised policies and optimal\nvariational policies learnt under MERLIN is sensitive to temperature c. This effect is exacerbated in\nhigh dimensions where there may be large regions of the state-action space with sub-optimal reward.\nAll algorithms learn optimal policies in simple domains, the training curves for which can be found\nin Fig. 8 in Appendix I.3. Thus, as the state-action dimensionality increases, algorithms derived from\nVIREL outperform SAC and DDPG.\n\nFujimoto et al. [18] and van Hasselt et al. [67] note that using the minimum of two randomly initialised\naction-value functions helps mitigate the positive bias introduced by function approximation in policy\ngradient methods. Therefore, a variant of SAC uses two soft critics. We compare this variant of SAC\nto two variants of virel: virel1, which uses two hard Q-functions and virel2, which uses one hard and\none soft Q-function. We scale the rewards so that the means of the Q-function estimates in virel2\nare approximately aligned. Fig. 4 shows the training curves on three gym-Mujoco-v1 domains, with\nadditional plots shown in Fig. 7 in Appendix I.2. Again, the results demonstrate that virel1 and virel2\nperform on par with SAC in simple domains like Half-Cheetah and outperform it in challenging\nhigh-dimensional domains like humanoid-gym and -rllab (17 and 21 dimensional action spaces, 376\ndimensional state space).\n\nFigure 4: Training curves on continuous control benchmarks gym-Mujoco-v1.\n\n6 Conclusion and Future Work\n\nThis paper presented VIREL, a novel framework that recasts the reinforcement learning problem as an\ninference problem using function approximators. We provided strong theoretical justi\ufb01cations for\nthis framework and compared two simple actor-critic algorithms that arise naturally from applying\nvariational EM on the objective. Extensive empirical evaluation shows that our algorithms perform\non par with current state-of-the-art methods on simple domains and substantially outperform them\non challenging high dimensional domains. As immediate future work, our focus is to \ufb01nd better\nestimates of \u03b5\u03c9 to provide a principled method for uncertainty based exploration; we expect it to\nhelp attain sample ef\ufb01ciency in conjunction with various methods like [39, 40]. Another avenue of\nresearch would extend our framework to multi-agent settings, in which it can be used to tackle the\nsub-optimality induced by representational constraints used in MARL algorithms [41].\n\n9\n\n0.0e+005.0e+051.0e+061.5e+062.0e+062.5e+063.0e+06Steps050010001500200025003000Average ReturnAnt-v2ddpgbetasacvirel0.0e+001.0e+062.0e+063.0e+064.0e+065.0e+06Steps10002000300040005000Average ReturnHumanoid-v2betasacvirelddpg0.0e+005.0e+051.0e+061.5e+062.0e+062.5e+063.0e+063.5e+06Steps5000075000100000125000150000175000200000225000Average ReturnHumanoidStandup-v2betaddpgsacvirel0.0e+002.0e+064.0e+066.0e+068.0e+061.0e+07Steps0100020003000400050006000Average Returnhumanoid-rllab0.0e+002.0e+064.0e+066.0e+068.0e+061.0e+07Steps02000400060008000Average Returnhumanoid0.0e+005.0e+051.0e+061.5e+062.0e+062.5e+063.0e+06Steps0200040006000800010000120001400016000Average ReturnhalfcVIREL2VIREL1SAC\f7 Acknowledgements\n\nThis project has received funding from the European Research Council (ERC) under the European\nUnions Horizon 2020 research and innovation programme (grant agreement number 637713). The\nexperiments were made possible by a generous equipment grant from NVIDIA. Matthew Fellows is\nfunded by the EPSRC. Anuj Mahajan is funded by Google DeepMind and the Drapers Scholarship.\nTim G. J. Rudner is funded by the Rhodes Trust and the EPSRC. We would like to thank Yarin Gal\nand Piotr Milo for helpful comments.\n\nReferences\n\n[1] Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maxi-\nmum a posteriori policy optimisation. In International Conference on Learning Representations,\n2018. URL https://openreview.net/forum?id=S1ANxQW0b.\n\n[2] Baird, L. Residual algorithms: Reinforcement learning with function approximation. Machine\nLearning-International Workshop Then Conference-, (July):30\u201337, 1995. ISSN 00043702. doi:\n10.1.1.48.3256.\n\n[3] Bass, R.\n\nISBN 9781481869140.\ns6mVlgEACAAJ.\n\nReal Analysis for Graduate Students.\n\nCreatespace Ind Pub, 2013.\nURL https://books.google.co.uk/books?id=\n\n[4] Beal, M. J. Variational algorithms for approximate Bayesian inference. PhD thesis, 2003.\n\n[5] Bertsekas, D. Constrained Optimization and Lagrange Multiplier Methods. Athena scienti\ufb01c\nseries in optimization and neural computation. Athena Scienti\ufb01c, 1996. ISBN 9781886529045.\nURL http://web.mit.edu/dimitrib/www/Constrained-Opt.pdf.\n\n[6] Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R., and Szepesv\u00b4ari, C. Convergent\ntemporal-difference learning with arbitrary smooth function approximation. In Bengio, Y.,\nSchuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural\nInformation Processing Systems 22, pp. 1204\u20131212. Curran Associates, Inc., 2009.\n\n[7] Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics).\n\nSpringer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 0387310738.\n\n[8] Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational Inference: A Review for\n\nStatisticians, 2017. ISSN 1537274X.\n\n[9] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba,\nW. Openai gym. CoRR, abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.\n01540.\n\n[10] Ciosek, K. and Whiteson, S. Expected Policy Gradients. The Thirty-Second AAAI Conference\n\non Arti\ufb01cial Intelligence (AAAI-18), 2018.\n\n[11] Ciosek, K. and Whiteson, S. Expected policy gradients for reinforcement learning. journal\n\nsubmission, arXiv preprint arXiv:1801.03326, 2018.\n\n[12] Dayan, P. and Hinton, G. E. Using expectation-maximization for reinforcement learning.\nNeural Computation, 9(2):271\u2013278, 1997. doi: 10.1162/neco.1997.9.2.271. URL https:\n//doi.org/10.1162/neco.1997.9.2.271.\n\n[13] Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via\n\nthe em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1\u201338, 1977.\n\n[14] Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforce-\nment learning for continuous control. In International Conference on Machine Learning, pp.\n1329\u20131338, 2016.\n\n10\n\n\f[15] Fellows, M., Ciosek, K., and Whiteson, S. Fourier policy gradients. In Dy, J. and Krause,\nA. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80\nof Proceedings of Machine Learning Research, pp. 1486\u20131495, Stockholmsmssan, Stock-\nholm Sweden, 10\u201315 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/\nfellows18a.html.\n\n[16] Foerster, J., Farquhar, G., Al-Shedivat, M., Rockt\u00a8aschel, T., Xing, E., and Whiteson, S. DiCE:\nThe in\ufb01nitely differentiable Monte Carlo estimator. In Dy, J. and Krause, A. (eds.), Proceedings\nof the 35th International Conference on Machine Learning, volume 80 of Proceedings of\nMachine Learning Research, pp. 1529\u20131538, Stockholmsmssan, Stockholm Sweden, 10\u201315 Jul\n2018. PMLR. URL http://proceedings.mlr.press/v80/foerster18a.html.\n\n[17] Fox, C. W. and Roberts, S. J. A tutorial on variational Bayesian inference. Arti\ufb01cial Intelligence\nReview, pp. 1\u201311, 2010. ISSN 0269-2821. doi: 10.1007/s10462-011-9236-8. URL papers2:\n//publication/uuid/1B6D2DDA-67F6-4EEE-9720-2907FFB14789.\n\n[18] Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in\nactor-critic methods. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International\nConference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,\npp. 1587\u20131596, Stockholmsmssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR. URL http:\n//proceedings.mlr.press/v80/fujimoto18a.html.\n\n[19] Furmston, T. and Barber, D. Variational Methods For Reinforcement Learning. In AISTATS, pp.\n\n241\u2013248, 2010. ISSN 15324435.\n\n[20] Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized Markov decision processes. In\nChaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference\non Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2160\u2013\n2169, Long Beach, California, USA, 09\u201315 Jun 2019. PMLR. URL http://proceedings.\nmlr.press/v97/geist19a.html.\n\n[21] Goyal, A., Brakel, P., Fedus, W., Lillicrap, T. P., Levine, S., Larochelle, H., and Bengio, Y.\nRecall traces: Backtracking models for ef\ufb01cient reinforcement learning. CoRR, abs/1804.00379,\n2018. URL http://arxiv.org/abs/1804.00379.\n\n[22] Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-Prop: Sample-Ef\ufb01cient\nPolicy Gradient with An Off-Policy Critic. pp. 1\u201313, 2016. URL http://arxiv.org/\nabs/1611.02247.\n\n[23] Gunawardana, A. and Byrne, W. Convergence theorems for generalized alternating minimization\nprocedures. J. Mach. Learn. Res., 6:2049\u20132073, December 2005. ISSN 1532-4435. URL\nhttp://dl.acm.org/citation.cfm?id=1046920.1194913.\n\n[24] Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-\nbased policies. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International\nConference on Machine Learning, volume 70 of Proceedings of Machine Learning Research,\npp. 1352\u20131361, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\nURL http://proceedings.mlr.press/v70/haarnoja17a.html.\n\n[25] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy\ndeep reinforcement learning with a stochastic actor. In Dy, J. and Krause, A. (eds.), Proceedings\nof the 35th International Conference on Machine Learning, volume 80 of Proceedings of\nMachine Learning Research, pp. 1861\u20131870, Stockholmsmssan, Stockholm Sweden, 10\u201315 Jul\n2018. PMLR. URL http://proceedings.mlr.press/v80/haarnoja18b.html.\n\n[26] Hachiya, H., Peters, J., and Sugiyama, M. Ef\ufb01cient sample reuse in em-based policy search. In\nBuntine, W., Grobelnik, M., Mladeni\u00b4c, D., and Shawe-Taylor, J. (eds.), Machine Learning and\nKnowledge Discovery in Databases, pp. 469\u2013484, Berlin, Heidelberg, 2009. Springer Berlin\nHeidelberg. ISBN 978-3-642-04180-8.\n\n[27] Heess, N., Silver, D., and Teh, Y. W. Actor-critic reinforcement learning with energy-based poli-\ncies.\nIn Deisenroth, M. P., Szepesvri, C., and Peters, J. (eds.), Proceedings of the Tenth\nEuropean Workshop on Reinforcement Learning, volume 24 of Proceedings of Machine\n\n11\n\n\fLearning Research, pp. 45\u201358, Edinburgh, Scotland, 30 Jun\u201301 Jul 2013. PMLR. URL\nhttp://proceedings.mlr.press/v24/heess12a.html.\n\n[28] Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. Learning continuous\ncontrol policies by stochastic value gradients. In Cortes, C., Lawrence, N. D., Lee, D. D.,\nSugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28,\npp. 2944\u20132952. Curran Associates, Inc., 2015.\n\n[29] Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y., and Erez, T. Learning Continuous\nISSN 10495258. URL\n\nControl Policies by Stochastic Value Gradients. pp. 1\u201313, 2015.\nhttp://arxiv.org/abs/1510.09142.\n\n[30] Jordan, M. I. (ed.). Learning in Graphical Models. MIT Press, Cambridge, MA, USA, 1999.\n\nISBN 0-262-60032-3.\n\n[31] Kelly, J. Generalized Functions, chapter 4, pp. 111\u2013124. John Wiley & Sons, Ltd, 2008. ISBN\n9783527618897. doi: 10.1002/9783527618897.ch4. URL https://onlinelibrary.\nwiley.com/doi/abs/10.1002/9783527618897.ch4.\n\n[32] Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes PPT. Proceedings of the\n2nd International Conference on Learning Representations (ICLR), 2014. ISSN 1312.6114v10.\nURL http://arxiv.org/abs/1312.6114.\n\n[33] Koller, D. and Parr, R. Policy iteration for factored mdps. In Proceedings of the Sixteenth\nConference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201900, pp. 326\u2013334, San Francisco,\nCA, USA, 2000. Morgan Kaufmann Publishers Inc.\nISBN 1-55860-709-9. URL http:\n//dl.acm.org/citation.cfm?id=2073946.2073985.\n\n[34] Levine, S. Motor Skill Learning with Trajectory Methods. PhD thesis, 2014. URL https:\n\n//people.eecs.berkeley.edu/{\u02dc}svlevine/papers/thesis.pdf.\n\n[35] Levine, S. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.\n\n2018. URL https://arxiv.org/pdf/1805.00909.pdf.\n\n[36] Levine, S. and Koltun, V. Variational policy search via trajectory optimization. In Burges,\nC. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in\nNeural Information Processing Systems 26, pp. 207\u2013215. Curran Associates, Inc., 2013.\n\n[37] Liberzon, D. Calculus of Variations and Optimal Control Theory: A Concise Introduction.\n\nPrinceton University Press, Princeton, NJ, USA, 2011. ISBN 0691151873, 9780691151878.\n\n[38] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.\nContinuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.\n\n[39] Mahajan, A. and Tulabandhula, T. Symmetry learning for function approximation in reinforce-\n\nment learning. arXiv preprint arXiv:1706.02999, 2017.\n\n[40] Mahajan, A. and Tulabandhula, T. Symmetry detection and exploitation for function approxima-\ntion in deep rl. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent\nSystems, pp. 1619\u20131621. International Foundation for Autonomous Agents and Multiagent\nSystems, 2017.\n\n[41] Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson, S. Maven: Multi-agent variational\n\nexploration, 2019. URL https://arxiv.org/abs/1910.07483.\n\n[42] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,\nRiedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,\nI., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015. ISSN 14764687. doi:\n10.1038/nature14236.\n\n12\n\n\f[43] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and\nKavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Balcan, M. F.\nand Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine\nLearning, volume 48 of Proceedings of Machine Learning Research, pp. 1928\u20131937, New York,\nNew York, USA, 20\u201322 Jun 2016. PMLR. URL http://proceedings.mlr.press/\nv48/mniha16.html.\n\n[44] Neumann, G. Variational inference for policy search in changing situations. In Proceedings of\nthe 28th International Conference on International Conference on Machine Learning, ICML\u201911,\npp. 817\u2013824, USA, 2011. Omnipress. ISBN 978-1-4503-0619-5. URL http://dl.acm.\norg/citation.cfm?id=3104482.3104585.\n\n[45] Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural Computation, 6:147\u2013160,\n\n1994.\n\n[46] Peters, J. and Schaal, S. Reinforcement learning by reward-weighted regression for operational\nspace control. In Proceedings of the 24th International Conference on Machine Learning, ICML\n\u201907, pp. 745\u2013750, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-793-3. doi: 10.1145/\n1273496.1273590. URL http://doi.acm.org/10.1145/1273496.1273590.\n\n[47] Rawlik, K., Toussaint, M., and Vijayakumar, S. Approximate inference and stochastic optimal\n\ncontrol. CoRR, abs/1009.3958, 2010. URL http://arxiv.org/abs/1009.3958.\n\n[48] Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement\n\nlearning by approximate inference. In Robotics: Science and Systems, 2012.\n\n[49] Sallans, B. and Hinton, G. E. Reinforcement learning with factored states and actions. J.\nMach. Learn. Res., 5:1063\u20131088, dec 2004. ISSN 1532-4435. URL http://dl.acm.org/\ncitation.cfm?id=1005332.1016794.\n\n[50] Schulman, J., Heess, N., Weber, T., and Abbeel, P. Gradient estimation using stochastic\ncomputation graphs. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett,\nR. (eds.), Advances in Neural Information Processing Systems 28, pp. 3528\u20133536. Curran\nAssociates, Inc., 2015.\n\n[51] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimiza-\ntion. 37:1889\u20131897, 07\u201309 Jul 2015. URL http://proceedings.mlr.press/v37/\nschulman15.html.\n\n[52] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimiza-\n\ntion algorithms. CoRR, abs/1707.06347, 2017.\n\n[53] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic\nPolicy Gradient Algorithms. Proceedings of the 31st International Conference on Machine\nLearning (ICML-14), pp. 387\u2013395, 2014. ISSN 1938-7228.\n\n[54] Sutton, R. S. and Barto, A. G. Sutton & Barto Book: Reinforcement Learning: An Introduction.\nMIT Press, Cambridge, MA, A Bradford Book, 1998. ISSN 10459227. doi: 10.1109/TNN.1998.\n712192.\n\n[55] Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, Cambridge,\n\nMA, USA, 2nd edition, 2017. ISBN 0262193981.\n\n[56] Sutton, R. S., Mcallester, D., Singh, S., and Mansour, Y. Policy Gradient Methods for Rein-\nforcement Learning with Function Approximation. Advances in Neural Information Processing\nSystems 12, pp. 1057\u20131063, 1999. ISSN 0047-2875. doi: 10.1.1.37.9714.\n\n[57] Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesv\u00b4ari, C., and Wiewiora,\nE. Fast gradient-descent methods for temporal-difference learning with linear function approx-\nimation. In Proceedings of the 26th Annual International Conference on Machine Learning,\nICML \u201909, pp. 993\u20131000, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. doi: 10.\n1145/1553374.1553501. URL http://doi.acm.org/10.1145/1553374.1553501.\n\n13\n\n\f[58] Sutton, R. S., Maei, H. R., and Szepesv\u00b4ari, C. A convergent o(n) temporal-difference algorithm\nfor off-policy learning with linear function approximation. In Koller, D., Schuurmans, D.,\nBengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 21, pp.\n1609\u20131616. Curran Associates, Inc., 2009.\n\n[59] Szepesv\u00b4ari, C. Algorithms for Reinforcement Learning. Synthesis Lectures on Arti\ufb01cial\nIntelligence and Machine Learning, 4(1):1\u2013103, 2010.\nISSN 1939-4608. doi: 10.2200/\nS00268ED1V01Y201005AIM009. URL http://www.morganclaypool.com/doi/\nabs/10.2200/S00268ED1V01Y201005AIM009.\n\n[60] Thomas, P. Bias in natural actor-critic algorithms. In Xing, E. P. and Jebara, T. (eds.), Pro-\nceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings\nof Machine Learning Research, pp. 441\u2013448, Bejing, China, 22\u201324 Jun 2014. PMLR. URL\nhttp://proceedings.mlr.press/v32/thomas14.html.\n\n[61] Todorov, E. Linearly-solvable markov decision problems. In Sch\u00a8olkopf, B., Platt, J. C., and\nHoffman, T. (eds.), Advances in Neural Information Processing Systems 19, pp. 1369\u20131376.\nMIT Press, 2007.\n\n[62] Toussaint, M. Robot trajectory optimization using approximate inference. In Proceedings\nof the 26th Annual International Conference on Machine Learning - ICML \u201909, pp. 1\u20138,\n2009. ISBN 9781605585161. doi: 10.1145/1553374.1553508. URL https://homes.cs.\nwashington.edu/{\u02dc}todorov/courses/amath579/reading/Toussaint.\npdfhttp://portal.acm.org/citation.cfm?doid=1553374.1553508.\n\n[63] Toussaint, M. Probabilistic inference as a model of planned behavior. Kunstliche Intelligenz, 3,\n\n01 2009.\n\n[64] Toussaint, M. and Storkey, A. Probabilistic inference for solving discrete and continuous state\nMarkov Decision Processes. Proceedings of the 23rd international conference on Machine\nlearning - ICML \u201906, pp. 945\u2013952, 2006. doi: 10.1145/1143844.1143963. URL http:\n//portal.acm.org/citation.cfm?doid=1143844.1143963.\n\n[65] Tsitsiklis, J. N. and Van Roy, B. An analysis of temporal-difference learning with function\napproximation. IEEE Transactions on Automatic Control, 42(5):674\u2013690, 1997. ISSN 00189286.\ndoi: 10.1109/9.580874.\n\n[66] Turner, R. E. and Sahani, M. Two problems with variational expectation maximisation\nfor time series models, pp. 104124. Cambridge University Press, 2011. doi: 10.1017/\nCBO9780511984679.006.\n\n[67] van Hasselt, H., Guez, A., and Silver, D. Deep Reinforcement Learning with Double Q-learning.\n2015. ISSN 00043702. doi: 10.1016/j.artint.2015.09.002. URL http://arxiv.org/abs/\n1509.06461.\n\n[68] Watkins, C. J. C. H. and Dayan, P. Q-learning. Machine Learning, 8(3-4):279\u2013292, 1992.\nISSN 0885-6125. doi: 10.1007/BF00992698. URL http://link.springer.com/10.\n1007/BF00992698.\n\n[69] Williams, R. J. and Peng, J. Function optimization using connectionist reinforcement learning\nalgorithms. Connection Science, 3(3):241\u2013268, 1991. doi: 10.1080/09540099108946587. URL\nhttps://doi.org/10.1080/09540099108946587.\n\n[70] Williams, R. J., Baird, L. C., and III. Analysis of some incremental variants of policy iteration:\n\nFirst steps toward understanding actor-critic learning systems, 1993.\n\n[71] Wu, C. F. J. On the Convergence Properties of the EM Algorithm\u2019. Source: The Annals of\n\nStatistics The Annals of Statistics, 11(1):95\u2013103, 1983.\n\n[72] Yang, Z., Xie, Y., and Wang, Z. A theoretical analysis of deep q-learning. CoRR, abs/1901.00137,\n\n2019. URL http://arxiv.org/abs/1901.00137.\n\n14\n\n\f[73] Ziebart, B. D. Modeling Purposeful Adaptive Behavior with the Principle of Maximum\nCausal Entropy. PhD thesis, 2010. URL http://www.cs.cmu.edu/{\u02dc}bziebart/\npublications/thesis-bziebart.pdf.\n\n[74] Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement\nlearning. In Proceedings of the 23rd National Conference on Arti\ufb01cial Intelligence - Volume\n3, AAAI\u201908, pp. 1433\u20131438. AAAI Press, 2008. ISBN 978-1-57735-368-3. URL http:\n//dl.acm.org/citation.cfm?id=1620270.1620297.\n\n[75] Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Modeling interaction via the principle of maximum\ncausal entropy. In Proceedings of the 27th International Conference on International Conference\non Machine Learning, ICML\u201910, pp. 1255\u20131262, USA, 2010. Omnipress. ISBN 978-1-60558-\n907-7. URL http://dl.acm.org/citation.cfm?id=3104322.3104481.\n\n15\n\n\f", "award": [], "sourceid": 3852, "authors": [{"given_name": "Matthew", "family_name": "Fellows", "institution": "University of Oxford"}, {"given_name": "Anuj", "family_name": "Mahajan", "institution": "University of Oxford"}, {"given_name": "Tim G. J.", "family_name": "Rudner", "institution": "University of Oxford"}, {"given_name": "Shimon", "family_name": "Whiteson", "institution": "University of Oxford"}]}