{"title": "An Actor/Critic Algorithm that is Equivalent to Q-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 401, "page_last": 408, "abstract": null, "full_text": "An Actor/Critic Algorithm that \n\n\u2022 \nIS \n\nEquivalent to Q-Learning \n\nRobert H. Crites \n\nComputer Science Department \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \ncrites~cs.umass.edu \n\nAndrew G. Barto \n\nComputer Science Department \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \nbarto~cs.umass.edu \n\nAbstract \n\nWe prove the convergence of an actor/critic algorithm that is equiv(cid:173)\nalent to Q-Iearning by construction. Its equivalence is achieved by \nencoding Q-values within the policy and value function of the ac(cid:173)\ntor and critic. The resultant actor/critic algorithm is novel in two \nways: it updates the critic only when the most probable action is \nexecuted from any given state, and it rewards the actor using cri(cid:173)\nteria that depend on the relative probability of the action that was \nexecuted. \n\n1 \n\nINTRODUCTION \n\nIn actor/critic learning systems, the actor implements a stochastic policy that maps \nstates to action probability vectors, and the critic attempts to estimate the value of \neach state in order to provide more useful reinforcement feedback to the actor. The \nresult is two interacting adaptive processes: the actor adapts to the critic, while the \ncritic adapts to the actor. \n\nThe foundations of actor/critic learning systems date back at least to Samuel's \nchecker program in the late 1950s (Samuel,1963). Examples of actor/critic systems \ninclude Barto, Sutton, & Anderson's (1983) ASE/ ACE architecture and Sutton's \n(1990) Dyna-PI architecture. Sutton (1988) notes that the critic in these systems \nperforms temporal credit assignment using what he calls temporal difference (TD) \nmethods. Barto, Sutton, & Watkins (1990) note a relationship between actor/critic \n\n\f402 \n\nRobert Crites, Andrew G. Barto \n\narchitectures and a dynamic programming (DP) algorithm known as policy iteration. \n\nAlthough DP is a collection of general methods for solving Markov decision pro(cid:173)\ncesses (MDPs), these algorithms are computationally infeasible for problems with \nvery large state sets. Indeed, classical DP algorithms require multiple complete \nsweeps of the entire state set. However, progress has been made recently in devel(cid:173)\noping asynchronous, incremental versions of DP that can be run online concurrently \nwith control (Watkins, 1989; Barto et ai, 1993). Most of the theoretical results for \nincremental DP have been for algorithms based on a DP algorithm known as value \niteration. Examples include Watkins' (1989) Q-Iearning algorithm (motivated by \na desire for on-line learning), and Bertsekas & Tsitsiklis' (1989) results on asyn(cid:173)\nchronous DP (motivated by a desire for parallel implementations). Convergence \nproofs for incremental algorithms based on policy iteration (such as actor/critic \nalgorithms) have been slower in coming. \n\nWilliams & Baird (1993) provide a valuable analysis of the convergence of certain \nactor/critic learning systems that use deterministic policies. They assume that a \nmodel of the MDP (including all the transition probabilities and expected rewards) \nis available, allowing the use of operations that look ahead to all possible next \nstates. When a model is not available for the evaluation of alternative actions, one \nmust resort to other methods for exploration, such as the use of stochastic policies. \nWe prove convergence for an actor/critic algorithm that uses stochastic policies and \ndoes not require a model of the MDP. \nThe key idea behind our proof is to construct an actor/critic algorithm that is \nequivalent to Q-Iearning. It achieves this equivalence by encoding Q-values within \nthe policy and value function of the actor and critic. By illustrating the way Q(cid:173)\nlearning appears as an actor/critic algorithm, the construction sheds light on two \nsignificant differences between Q-Iearning and traditional actor/critic algorithms. \nTraditionally, the critic attempts to provide feedback to the actor by estimating \nV1I', the value function corresponding to the current policy'll\". In our construction, \ninstead of estimating ~, the critic directly estimates the optimal value function \nV\"'. In practice, this means that the value function estimate V is updated only \nwhen the most probable action is executed from any given state. In addition, our \nactor is provided with more discriminating feedback, based not only on the TD \nerror, but also on the relative probability of the action that was executed. By \nadding these modifications, we can show that this algorithm behaves exactly like \nQ-Iearning constrained by a particular exploration strategy. Since a number of \nproofs of the convergence of Q-Iearning already exist (Tsitsiklis, 1994; Jaakkola et \nai, 1993; Watkins & Dayan, 1992), the fact that this algorithm behaves exactly \nlike Q-Iearning implies that it too converges to the optimal value function with \nprobability one. \n\n2 MARKOV DECISION PROCESSES \n\nActor/critic and Q-Iearning algorithms are usually studied within the Markov de(cid:173)\ncision process framework. In a finite MDP, at each discrete time step, an agent \nobserves the state :z: from a finite set X, and selects an action a from a finite set \nAx by using a stochastic policy'll\" that assigns a probability to each action in Ax. \nThe agent receives a reward with expected value R(:z:, a), and the state at the next \n\n\fAn Actor/Critic Algorithm That Is Equivalent to Q-Learning \n\n403 \n\ntime step is y with probability pll(:z:,y). For any policy 7\u00a3' and :z: E X, let V\"\"(:Z:) \ndenote the ezpected infinite-horizon discounted return from :z: given that the agent \nuses policy 7\u00a3'. Letting rt denote the reward at time t, this is defined as: \n\nV\"\"(:z:) = E7r [L:~o,trtl:z:o = :z:], \n\n(1) \nwhere :Z:o is the initial state, 0 :::; , < 1 is a factor used to discount future rewards, \nand E7r is the expectation assuming the agent always uses policy 7\u00a3'. It is usual to call \nV7r (:z:) the value of:z: under 7\u00a3'. The function V\"\" is the value function corresponding \nto 7\u00a3'. The objective is to find an optimal policy, i.e., a policy,7\u00a3'*, that maximizes \nthe value of each state :z: defined by (1). The unique optimal value function, V*, is \nthe value function corresponding to any optimal policy. Additional details on this \nand other types of MOPs can be found in many references. \n\n3 ACTOR/CRITIC ALGORITHMS \n\nA generic actor/critic algorithm is as follows: \n\n1. Initialize the stochastic policy and the value function estimate. \n2. From the current state :z:, execute action a randomly according to the cur(cid:173)\n\nrent policy. Note the next state y, the reward r, and the TO error \n\nwhere 0 :::; , < 1 is the discount factor. \n\ne = [r + ,V(y)] - V(:z:), \n\n3. Update the actor by adjusting the action probabilities for state :z: using the \nTO error. If e > 0, action a performed relatively well and its probability \nshould be increased. If e < 0, action a performed relatively poorly and its \nprobability should be decreased. \n\n4. Update the critic by adjusting the estimated value of state :z: using the TO \n\nerror: \n\nV(:z:) -- V(:z:) + a e \n\nwhere a is the learning rate. \n\n5. :z: -- y. Go to step 2. \n\nThere are a variety of implementations of this generic algorithm in the literature. \nThey differ in the exact details of how the policy is stored and updated. Barto \net al (1990) and Lin (1993) store the action probabilities indirectly using param(cid:173)\neters w(:z:, a) that need not be positive, and need not sum to one. Increasing (or \ndecreasing) the probability of action a in state :z: is accomplished by increasing (or \ndecreasing) the value of the parameter w(:z:, a). Sutton (1990) modifies the generic \nalgorithm so that these parameters can be interpreted as action value estimates. \nHe redefines e in step 2 as follows: \n\ne = [r + ,V(y)] - w(:z:, a). \n\nFor this reason, the Oyna-PI architecture (Sutton, 1990) and the modified ac(cid:173)\ntor/critic algorithm we present below both reward less probable actions more readily \nbecause of their lower estimated values. \n\n\f404 \n\nRobert Crites, Andrew G. Barto \n\nBarto et al (1990) select actions by adding exponentially distributed random num(cid:173)\nbers to each parameter w(:z:, a) for the current state, and then executing the action \nwith the maximum sum. Sutton (1990) and Lin (1993) convert the parameters \nw(:z:, a) into action probabilities using the Boltzmann distribution, where given a \ntemperature T, the probability of selecting action i in state :z: is \n\new(x,i)/T \n\n'\" \nL.JaEA .. \n\new(x,a)/T' \n\nIn spite of the empirical success of these algorithms, their convergence has never \nbeen proven. \n\n4 Q-LEARNING \n\nRather than learning the values of states, the Q-Iearning algorithm learns the val(cid:173)\nues of state/action pairs. Q(:z:, a) is the expected discounted return obtained by \nperforming action a in state :z: and performing optimally thereafter. Once the Q \nfunction has been learned, an optimal action in state :z: is any action that maximizes \nQ(:z:, .). Whenever an action a is executed from state :z:, the Q-value estimate for \nthat state/action pair is updated as follows: \n\nQ(:z:, a) +- Q(:z:, a) + O!xa(n) [r + \"y maxbEAlI Q(y, b) - Q(:z:, a)), \n\nwhere O!xa (n) is the non-negative learning rate used the nth time action a is executed \nfrom state :z:. Q-Learning does not specify an exploration mechanism, but requires \nthat all actions be tried infinitely often from all states. In actor/critic learning \nsystems, exploration is fully determined by the action probabilities of the actor. \n\n5 A MODIFIED ACTOR/CRITIC ALGORITHM \n\nFor each value v E !R, the modified actor/critic algorithm presented below uses an \ninvertible function, H.\", that assigns a real number to each action probability ratio: \n\nH1J : (0,00) -+ !R. \n\nEach H.\" must be a continuous, strictly increasing function such that H.,,(l) = v, \nand \n\nHH .. (Z2)(i;) = H\",(Zl) for all Zl,Z2 > o. \n\nOne example of such a class of functions is H.,,(z) = T In(z) + v, v E !R, for some \npositive T. This class of functions corresponds to Boltzmann exploration in Q(cid:173)\nlearning. Thus, a kind of simulated annealing can be accomplished in the modified \nactor/critic algorithm (as is often done in Q-Iearning) by gradually lowering the \n\"temperature\" T and appropriately renormalizing the action probabilities. It is \nalso possible to restrict the range of H.\" if bounds on the possible values for a given \nMDP are known a priori. \nFor a state :z:, let Pa be the probability of action a, let Pmax be the probability of \nthe most probable action, amax , and let Za = ~. \n\n\fAn Actor/Critic Algorithm That Is Equivalent to Q-Leaming \n\n405 \n\nThe modified actor/critic algorithm is as follows: \n\n1. Initialize the stochastic poliCj and the value function estimate. \n2. From the current state :z:, execute an action randomly according to the \ncurrent policy. Call it action i. Note the next state y and the immediate \nreward r, and let \n\ne = [r + -yV(y)] - Hy(X) (Zi). \n\n3. Increase the probability of action i if e > 0, and decrease its probability if \n\ne < O. The precise probability update is as follows. First calculate \n\nzt = H~tX)[HY(x)(Zi) + aXi(n) e]. \n\nThen determine the new action probabilities by dividing by normalization \nfactor N = zt + E#i Zj, as follows: \n\na:' \nPi +-:W, and Pj +- =jt, \n\na:~ \n\nj =P i. \n\n4. Update V(:z:) only if i = UomIlX' Since the action probabilities are updated \nafter every action, the most probable action may be different before and \nafter the update. If i = amllx both before and after step 3 above, then \nupdate the value function estimate as follows: \nV(:z:) +- V(:z:) + aXi(n) e \n\nOtherwise, if i = UomIlX before or after step 3: \n\nV(:z:) +- HY(x)(Npk), \n\nwhere action Ie is the most probable action after step 3. \n\n5. :z: +- y. Go to step 2. \n\n6 CONVERGENCE OF THE MODIFIED ALGORITHM \n\nTheorem: The modified actor/critic algorithm given above converge6 to the opti(cid:173)\nmal value function V\u00b7 with probability one if: \n\n1. The 6tate and action 6et6 are finite. \n2. E:=o axlI(n) = 00 and E:=o a!lI(n) < 00. \n\nSpace does not permit us to supply the complete proof, which follows this outline: \n\n1. The modified actor/critic algorithm behaves exactly the same as a Q(cid:173)\n\nlearning algorithm constrained by a particular exploration strategy. \n\n2. Q-Iearning converges to V\u00b7 with probability one, given the conditions above \n\n(Tsitsiklis, 1993; Jaakkola et aI, 1993; Watkins & Dayan, 1992). \n\n3. Therefore, the modified actor/critic algorithm also converges to V\u00b7 with \n\nprobability one. \n\n\f406 \n\nRobert Crites, Andrew G. Barto \n\nThe commutative diagram below illustrates how the modified actor/critic algorithm \nbehaves exactly like Q-Iearning constrained by a particular exploration strategy. \nThe function H recovers Q-values from the policy ?r and value function V. H- 1 \nrecovers (?r, V) from the Q-values, thus determining an exploration strategy. Given \nthe ability to move back and forth between (?r, V) and Q, we can determine how to \nchange (?r, V) by converting to Q, determining updated Q-values, and then convert(cid:173)\ning back to obtain an updated (?r, V). The modified actor/critic algorithm simply \ncollapses this process into one step, bypassing the explicit use of Q-values. \n\n( VA) Modified Actor/Critic \n\n- - - - - - - . . 7r \n, \n\n7r \n, \n\nt \n\n( \n\nVA) \n\nt+1 \n\nH-l \n\nH \n\nA \n\nA \n\n(Jt ------------Q---L-ea-r-ru-\u00b7n-g----------~~ (Jt+1 \n\nFollowing the diagram above, (?r, V) can be converted to Q-values as follows: \n\nGoing the other direction, Q-values can be converted to (?r, V) as follows: \n\nand \n\nThe only Q-value that should change at time t is the one corresponding to the \nstate/action pair that was visited at time tj call it Q(:z:, i). In order to prove the con(cid:173)\nvergence theorem, we must verify that after an iteration ofthe modified actor/critic \nalgorithm, its encoded Q-values match the values produced by Q-Iearning: \nQt+1(:Z:, a) = Qt(:Z:, i) + Qx.(n) [r + \"y max Qt(Y, b) - Qt(:Z:, i)], a = i. \n\n(2) \n\nbEAli \n\n(3) \nIn verifying this, it is necessary to consider the four cases where Q(:z:, i) is, or is not, \nthe maximum Q-value for state :z: at times t and t + 1. Only enough space exists to \npresent a detailed verification of one case. \nCase 1: Qt(:Z:, i) = ma:z: Qt(:Z:,\u00b7) and Qt+l(:Z:, i) = ma:z: Qt+l(:Z:, .). \nIn this case, Jli(t) = Pmax(t) and P.(t + 1) = Pmax(t + 1), since Hyt(x) and Hyt+1(x) \nare strictly increasing. Therefore Zi (t) = 1 and Zi (t + 1) = 1. Therefore, Vi ( :z:) = \nHYt (x)[1] = HYt(x)[Zi(t)] = Qt(:Z:, i), and \n\n\fAn Actor/Critic Algorithm That Is Equivalent to Q-Learning \n\n407 \n\nHyt+1(x) [Zi(t + 1)] \nHyt+ 1 (x) [1] \nVt+l(:C) \nVt(:c) + O!xi(n) e \nQt(:Z:, i) + O!xi(n) [r + \"y max Qt(Y, b) - Qt(:Z:, i)]. \n\nbEAJI \n\nThis establishes (2). To show that (3) holds, we have that \n\nVt+l(:Z:) \n\nVt(:c) + O!xi(n) e \nQt(:z:, i) + O!xi(n) e \nHYt(x)[Zi(t)] + O!xi(n) e \nHYt(x)[H~t~x)[HYt(x)[Zi(t)] + O!xi(n) e]] \nHyt(x) [zt(t)] \n\n(4) \n\nand \n\nHyt+1(x) [za(t + 1)] \nH \n\n[ Pa(t + 1) ] \nYt+ 1(x) Pmax(t + 1) \n\nH \n\n[Za(t)/lV] \nYt+ 1(x) zt(t)/lV \n\nf \ni \n\na i= i \n\nH \n\n[Za(t)] \nYt+ 1(x) zt(t) \n\nza(t) \n\nHHfrt( .. )[zt(t)][zt(t)1 by (4) \nHyt(x) [za(t)] \nQt(:Z:, a). \n\nby a property of H \n\nThe other cases can be shown similarly. \n\n7 CONCLUSIONS \n\nWe have presented an actor/critic algorithm that is equivalent to Q-Iearning con(cid:173)\nstrained by a particular exploration strategy. Like Q-Iearning, it estimates V\" \ndirectly without a model of the underlying decision process. It uses exactly the \nsame amount of storage as Q-Iearning: one location for every state/action pair. \n(For each state, IAI- 1 locations are needed to store the action probabilities, since \nthey must sum to one. The remaining location can be used to store the value of \nthat state.) One advantage of Q-Iearning is that its exploration is uncoupled from \nits value function estimates. In the modified actor/critic algorithm, the exploration \nstrategy is more constrained. \n\n\f408 \n\nRobert Crites, Andrew G. Barto \n\nIt is still an open question whether other actor/critic algorithms are guaranteed \nto converge. One way to approach this question would be to investigate further \nthe relationship between the modified actor/critic algorithm described here and the \nactor/critic algorithms that have been employed by others. \n\nAcknowledgements \n\nWe thank Vijay GullapaUi and Rich Sutton for helpful discussions. This research \nwas supported by Air Force Office of Scientific Research grant F49620-93-1-0269. \n\nReferences \n\nA. G. Barto, S. J. Bradtke &. S. P. Singh. (1993) Learning to act using real-time \ndynamic programming. Artificial Intelligence, Accepted. \nA. G. Barto, R. S. Sutton &. C. W. Anderson. (1983) Neuronlike adaptive elements \nthat can solve difficult learning control problems. IEEE Transactions on Systems, \nMan, and Cybernetics 13:835-846. \nA. G. Barto, R. S. Sutton &. C. J. C. H. Watkins. (1990) Learning and sequential \ndecision making. In M. Gabriel &. J. Moore, editors, Learning and Computational \nNeuroscience: Foundations of Adaptive Networks. MIT Press, Cambridge, MA. \nD. P. Bertsekas &. J. N. Tsitsiklis. (1989) Parallel and Distributed Computation: \nNumerical Metkods. Prentice-Hall, Englewood Cliffs, N J. \nT. Jaakkola, M. 1. Jordan &. S. P. Singh. (1993) On the convergence of stochastic \niterative dynamic programming algorithms. MIT Computational Cognitive Science \nTechnical Report 9307. \n\nL. Lin. (1993) Reinforcement Learning for Robots Using Neural Networks. PhD \nThesis, Carnegie Mellon University, Pittsburgh, PA. \n\nA. L. Samuel. (1963) Some studies in machine learning using the game of checkers. \nIn E. Feigenbaum &. J. Feldman, editors, Computers and Tkougkt. McGraw-Hill, \nNew York, NY. \n\nR. S. Sutton. (1988) Learning to predict by the methods of temporal differences. \nMackine Learning 3:9-44. \n\nR. S. Sutton. (1990) Integrated architectures for learning, planning, and react(cid:173)\ning based on approximating dynamic programming. In Proceedings of tke Seventk \nInternational Conference on Mackine Learning. \n\nJ. N. Tsitsiklis. \nMackine Learning 16:185-202. \n\n(1994) Asynchronous stochastic approximation and Q-Iearning. \n\nC. J. C. H. Watkins. (1989) Learning from Delayed Rewards. PhD thesis, Cam(cid:173)\nbridge University. \nC. J. C. H. Watkins &. P. Dayan. (1992) Q-Iearning. Mackine Learning 8:279-292. \nR. J. Williams &. L. C. Baird. (1993) Analysis ofsome incremental variants of policy \niteration: first steps toward understanding actor-critic learning systems. Technical \nReport NU-CCS-93-11. Northeastern University College of Computer Science. \n\n\fPART V \n\nALGORITHMS AND ARCIDTECTURES \n\n\f\f", "award": [], "sourceid": 916, "authors": [{"given_name": "Robert", "family_name": "Crites", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}