{"title": "Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task", "book": "Advances in Neural Information Processing Systems", "page_first": 1075, "page_last": 1081, "abstract": null, "full_text": "Using Free Energies to Represent Q-values in a \n\nMultiagent Reinforcement Learning Task \n\nBrian Sallans \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto M5S 2Z9 Canada \n\nsallam'@cs,toronto,edu \n\nGeoffrey E. Hinton \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \nLondon WCIN 3AR u.K. \nhinton @ gatsby, ucZ. ac, uk \n\nAbstract \n\nThe problem of reinforcement learning in large factored Markov decision \nprocesses is explored. The Q-value of a state-action pair is approximated \nby the free energy of a product of experts network. Network parameters \nare learned on-line using a modified SARSA algorithm which minimizes \nthe inconsistency of the Q-values of consecutive state-action pairs. Ac(cid:173)\ntions are chosen based on the current value estimates by fixing the current \nstate and sampling actions from the network using Gibbs sampling. The \nalgorithm is tested on a co-operative multi-agent task. The product of \nexperts model is found to perform comparably to table-based Q-Iearning \nfor small instances of the task, and continues to perform well when the \nproblem becomes too large for a table-based representation. \n\n1 \n\nIntroduction \n\nOnline Reinforcement Learning (RL) algorithms try to find a policy which maximizes the \nexpected time-discounted reward provided by the environment. They do this by performing \nsample backups to learn a value function over states or state-action pairs [1]. If the decision \nproblem is Markov in the observed states, then the optimal value function over state-action \npairs (the Q-function) yields all of the information required to find the optimal policy for the \ndecision problem. For example, when the Q-function is represented as a table, the optimal \naction for a given state can be found simply by searching the row of the table corresponding \nto that state. \n\n1.1 Factored Markov Decision Processes \n\nIn many cases the dimensionality of the problem makes a table representation impractical, \nso a more compact representation that makes use of the structure inherent in the problem is \nrequired. In a co-operative multi-agent system, for example, it is natural to represent both \nthe state and action as sets of variables (one for each agent). We expect that the mapping \nfrom the combined states of all the agents to the combined actions of all the agents is not \narbitrary: Given an individual agent's state, that agent's action might be largely independent \nof the other agents' exact states and actions, at least for some regions of the combined state \nspace. We expect that a facto red representation of the Q-value function will be appropriate \n\n\ffor two reasons: The original representation of the combined states and combined actions \nis factored, and the ways in which the optimal actions of one agent are dependent on the \nstates and actions of other agents might be well captured by a small number of \"hidden\" \nfactors rather than the exponential number required to express arbitrary mappings. \n\n1.2 Actor-Critic Architectures \n\nIf a non-linear function approximator is used to model the Q-function, then it is difficult \nand time consuming to extract the policy directly from the Q-function because a non-linear \noptimization must be solved for each action choice. One solution, called an actor-critic \narchitecture, is to use a separate function approximator to model the policy (i.e. to approxi(cid:173)\nmate the non-linear optimization) [2, 3]. This has the advantage of being fast, and allows us \nto explicitly learn a stochastic policy, which can be advantageous if the underlying problem \nis not strictly Markov [4]. However, a specific parameterized family of policies must be \nchosen a priori. \n\nInstead we present a method where the Q-value of a state-action pair is represented (up \nto an additive constant) by the negative free-energy, - F, of the state-action pair under a \nnon-causal graphical model. The graphical model is a product of experts [5] which has two \nvery useful properties: Given a state-action pair, the exact free energy is easily computed, \nand the derivative of this free energy W.r.t. each parameter of the network is also very \nsimple. The model is trained to minimize the inconsistency between the free-energy of a \nstate-action pair and the discounted free energy of the next state-action pair, taking into \naccount the immediate reinforcement. After training, a good action for a given state can \nbe found by clamping the state and drawing a sample of the action variables using Gibbs \nsampling [6]. Although finding optimal actions would still be difficult for large problems, \nselecting an action with a probability that is approximately proportional to exp( - F) can \nbe done with a modest number of iterations of Gibbs sampling. \n\n1.3 Markov Decision Processes \n\nWe will concentrate on finite, factored, Markov decision processes (factored MDPs), in \nwhich each state and action is represented as a set of discrete variables. Formally, a factored \nMDP consists of the set { {SO:}~=I' {A,8 }:=l, {s~ }~=l' P, Pr }, where: So: is the set of \npossible values for state variable 0:; A,8 is the set of possible values for action variable (3; \ns~ is the initial value for state variable 0:; P is a transition distribution P(st+llst, at); and \nPr is a reward distribution P(rtlst,at,st+l). A state is an M-tuple and an action is an \nN-tuple. \n\nThe goal of solving an MDP is to find a policy, which is a sequence of (possibly stochastic) \nmappings 7ft : Sl X S2 X ... X SM -+ Al X A2 X ... X AN which maximize the total \nexpected reward received over the course of the task: \n\n(1) \nwhere,/, is a discount factor and (-) 1r' denotes the expectation taken with respect to policy \n7ft. We will focus on the case when the policy is stationary: 7ft is identical for all t. \n\n(Rt) 1r' = (rt + '/'rt+1 + ... + ,/,T-trT) 1r' \n\n2 Approximating Q-values with a Product of Experts \n\nAs the number of state and action variables increases, a table representation quickly be(cid:173)\ncomes intractable. We represent the value of a state and action as the negative free-energy \n(up to a constant) under a product of experts model (see Figure lea\u00bb~. \n\nWith a product of experts, the probability assigned to a state-action pair, (s, a) is just the \n(normalized) product of the probabilities assigned to (s, a) under each of the individual \n\n\fa) \n\nb) \n\nhidden units \n\nstate units \n\naction units \n\njO a \n000 \n000 \n000 \n\n~ \n\nFigure 1: a) The Boltzmann product of experts. The estimated Q-value (up to an additive \nconstant) of a setting of the state and action units is found by holding these units fixed \nand computing the free energy of the network. Actions are selected by alternating between \nupdating all of the hidden units in parallel and updating all of the action units in parallel, \nwith the state units held constant. b) A multinomial state or action variable is represented \nby a set of \"one-of-n\" binary units in which exactly one is on. \n\nexperts: \n\np(s,al()l, ... ,()K) = \" \n\nII~=l Pk(S, al()k) \nk Pk S , a \n\n(' 'I()) \nk \n\nII \n\nL.J(sl,al) \n\n(2) \n\nwhere {()1, ... , ()K} are parameters of the K experts and (s', a') indexes all possible state(cid:173)\naction pairs. \n\nIn the following, we will assume that there are an equal number of state and action variables \n(i.e. M = N); and that each state or action variable has the same arity (Va., (3, Isa I = IS,BI \nand IAa I = IA,Bi). These assumptions are appropriate, for example, when there is one state \nand action variable for each agent in a multi-agent task. Extension to the general case is \nstraight forward. In the following, (3 will index agents. \n\nMany kinds of \"experts\" could be used while still retaining the useful properties of the \nPoE. We will focus on the case where each expert is a single binary sigmoid unit because \nit is particularly suited to the discrete tasks we consider here. Each agent's (multinomial) \nstate or action is represented using a \"one-of-N\" set of binary units which are constrained \nso that exactly one of them is on. The product of experts is then a bipartite \"Restricted \nBoltzmann Machine\" [5]. We use S,Bi to denote agent (3's ith state and a,Bj to denote its jth \naction. We will denote the binary latent variables of the \"experts\" by hk (see Figure l(b)). \n\nFor a state s = {S,Bi} and an action a = {a,Bj} ' the free energy is given by the expected \nenergy given the posterior distribution of the hidden units minus the entropy of this poste(cid:173)\nrior distribution. This is simple to compute because the hidden units are independent in the \nposterior distribution: \n\nF(s,a) \n\nK M \n\n(lSI \n\n-t;]; ~(W,BikS,Bihk + b,BiS,B,) + ~(U,Bjka,Bjhk + b,Bja,BJ \n- L bkhk + L hk log hk + (1 - h k ) log (1 - h k ) - Cp \n\nIAI \n\n) \n\nK \n\nK \n\n(3) \n\nk=l \n\nk=l \n\n\fwhere W(3,k is the weight from the kth expert to binary state variable s(3,; U(3;k is the weight \nfrom the kth expert to binary action variable a(3;; b k , b(3, and b(3; are biases; and \n\nhk = a ] ; t; W(3,k S(3,k + t; u(3;ka(3;k + b k \n\n(lSI \n\nIAI \n\nM \n\n{ \n\n) } \n\n(4) \n\nis the expected value of each expert given the data where a(x) = 1/1 + e- x denotes the \nlogistic function. CF is an additive constant equal to the log of the partition function. The \nfirst two terms of (3) corresponds to an unnormalized negative log-likelihood, and the third \nto the negative entropy of the distribution over the hidden units given the data. The free \nenergy can be computed tractably because inference is tractable in a product of experts: \nunder the product model each expert is independent of the others given the data. We can \nefficiently compute the exact free energy of a state and action under the product model, up \nto an additive constant. The Q-function will be approximated by the negative free-energy \n(or goodness), without the constant: \n\nQ(s, a) :::::: -F(s, a) + CF \n\n(5) \n\n2.1 Learning the Parameters \n\nThe parameters of the model must be adjusted so that the goodness of a state-action under \nthe product model approximates its actual Q-value. This is done with a modified SARSA \nlearning rule designed to minimize the Bellman error [7, 8]. If we consider a delta-rule \nupdate where the target for input (st, at) is rt + 'YQ(st+!, a t+!), then (for example) the \nupdate for W(3,k is given by: \n\n!1W(3,k ex: (rt+'YQ(st+l,at+!)-Q(st,at)) 8Q(st,at ) \n\n8 W(3ik \n\n(6) \n\n(7) \n\nThe other weights and biases are updated similarly. Although there is no proof of conver(cid:173)\ngence for this learning rule, it works well in practice even though it ignores the effect of \nchanges in W(3ik on Q(st+l, a t +!). \n\n2.2 Sampling Actions \n\nGiven a trained network and the current state st, we need to generate actions according \nto their goodness. We would like to select actions according to a Boltzmann exploration \nscheme in which the probability of selecting an action is proportional to eQ IT. This selec(cid:173)\ntion scheme has the desirable property that it optimizes the trade-off between the expected \npayoff, Q, and the entropy of the selection distribution, where T is the relative importance \nof exploration versus exploitation. Fortunately, the additive constant, CF, does not need \nto be known in order to select actions in this way. It is sufficient to do alternating Gibbs \nsampling. We start with an arbitrary initial action represented on the action units. Holding \nthe state units fixed we update all of the hidden units in parallel so that we get a sample \nfrom the posterior distribution over the hidden units given the state and the action. Then we \nupdate all of the action units in parallel so that we get a sample from the posterior distribu(cid:173)\ntion over actions given the states of the hidden units. When updating the states of the action \nunits, we use a \"softmax\" to enforce the one-of-N constraint within a set of binary units that \nrepresent mutually exclusive actions of the same agent. When the alternating Gibbs sam(cid:173)\npling reaches equilibrium it draws unbiased samples of actions according to their Q-value. \nFor the networks we used, 50 Gibbs iterations appeared to be sufficient to come close to \nthe equilibrium distribution. \n\n\f3 Experimental Results \n\nTo test the algorithm we introduce a co-operative multi-agent task in which there are offen(cid:173)\nsive players trying to reach an end-zone, and defensive players trying to block them (see \nFigure 2). \n\nend-zone \n\nblockers \n\nagents \n\n~ \n~ \n\\ C) \n'0 \n\n[C) \n\nFigure 2: An example of the \"blocker\" \ntask. Agents must get past the blockers \nto the end-zone. The blockers are pre(cid:173)\nprogrammed with a strategy to stop them, \nbut if they co-operate the blockers cannot \nstop them all simultaneously. \n\nThe task is co-operative: As long as one agent reaches the end-zone, the \"team\" is re(cid:173)\nwarded. The team receives a reward of + 1 when an agent reaches the end-zone, and a \nreward of -1 otherwise. The blockers are pre-programmed with a fixed blocking strategy. \nEach agent occupies one square on the grid, and each blocker occupies three horizontally \nadjacent squares. An agent cannot move into a square occupied by a blocker or another \nagent. The task has non-wrap-around edge conditions on the east, west and south sides of \nthe field, and the blockers and agents can move north, south, east or west. \n\nA product of experts (PoE) network with 4 hidden units was trained on a 5 x 4 blocker task \nwith two agents and one blocker. The combined state consisted of three position variables \n(two agents and one blocker) which could take on integer values {I, ... , 20}. The combined \naction consisted of two action variables taking on values from {I, ... ,4}. \n\nThe network was run twice, once for 60 000 combined actions and once for 400 000 com(cid:173)\nbined actions, with a learning rate going from 0.1 to 0.01 linearly and temperature going \nfrom 1.0 to 0.01 exponentially over the course of training. Each trial was terminated after \neither the end-zone was reached, or 20 combined actions were taken, whichever occurred \nfirst. Each trial was initialized with the blocker placed randomly in the top row and the \nagents placed randomly in the bottom row. The same learning rate and temperature sched(cid:173)\nule were used to train a Q-Iearner with a table containing 128,000 elements (203 x 42), \nexcept that the Q-Iearner was allowed to train for 1 million combined actions. After train(cid:173)\ning each policy was run for 10,000 steps, and all rewards were totaled. The two algorithms \nwere also compared to a hand-coded policy, where the agents first move to opposite sides \nof the field and then move to the end-zone. In this case, all of the algorithms performed \ncomparably, and the POE network performing well even for a short training time. \n\nA PoE network with 16 hidden units was trained on a 4 x 7 blockers task with three agents \nand two blockers. Again, the input consisted of position variables for each blocker and \nagent, and and action variables for each agent. The network was trained for 400 000 com(cid:173)\nbined actions, with the a learning rate from 0.01 to 0.001 and the same temperature sched(cid:173)\nule as the previous task. Each trial was terminated after either the end-zone was reached, or \n40 steps were taken, whichever occurred first. After training, the resultant policy was run \nfor 10,000 steps and the rewards received were totaled. As the table representation would \nhave over a billion elements (285 x 43 ), a table based Q-Iearner could not be trained for \ncomparison. The hand-coded policy moved agents 1, 2 and 3 to the left, middle and right \ncolumn respectively, and then moved all agents towards the end-zone. The PoE performed \ncomparably to this hand-coded policy. The results for all experiments are summarized in \nTable 1. \n\n\fTable 1: Experimental Results \n\nAlgorithm \n\nReward \n\nRandom policy (5 x 4,2 agents, 1 blocker) \nhand-coded (5 x 4, 2 agents, 1 blocker) \nQ-Ieaming (5 x 4,2 agents, 1 blocker, 1000K steps) \nPoE (5 x 4, 2 agents, 1 blocker, 60K steps) \nPoE (5 x 4, 2 agents, 1 blocker, 400K steps) \nRandom policy (4 x 7,3 agents, 2 blockers) \nhand-coded (4 x 7,3 agents, 2 blockers) \nPoE (4 x 7,3 agents, 2 blockers, 400K steps) \n\n-9986 \n-6782 \n-6904 \n-7303 \n-6738 \n-9486 \n-7074 \n-7631 \n\n4 Discussion \n\nEach hidden unit in the product model implements a probabilistic constraint that captures \none aspect of the relationship between combined states and combined actions in a good \npolicy. In practice the hidden units tend to represent particular strategies that are relevant \nin particular parts of the combined state space. This suggests that the hidden units could \nbe used for hierarchical or temporal learning. A reinforcement learner could, for exam(cid:173)\nple, learn the dynamics between hidden unit values (useful for POMDPs) and the rewards \nassociated with hidden unit activations. \n\nBecause the PoE network implicitly represents a joint probability distribution over state(cid:173)\naction pairs, it can be queried in ways that are not normally possible for an actor network. \nGiven any subset of state and action variables, the remainder can be sampled from the \nnetwork using Gibbs sampling. This makes it easy to answer questions of the form: \"How \nshould agent 3 behave given fixed actions for agents 1 and 2?\" or \"I can see some of \nthe state variables but not others. What values would I most like to see for the others?\". \nFurther, because there is an efficient unsupervised learning algorithm for PoE networks, an \nagent could improve its policy by watching another agent's actions and making them more \nprobable under its own model. \n\nThere are a number of related works, both in the fields of reinforcement learning and un(cid:173)\nsupervised learning. The SARSA algorithm is from [7, 8]. A delta-rule update similar to \nours was explored by [9] for POMDPs and Q-Iearning. Factored MDPs and function ap(cid:173)\nproximators have a long history in the adaptive control and RL literature (see for example \n[10]). \n\nOur method is also closely related to actor-critic methods [2,3]. Normally with an actor(cid:173)\ncritic method, the actor network can be viewed as a biased scheme for selecting actions \naccording to the value assigned by the critic. The selection is biased by the choice of pa(cid:173)\nrameterization. Our method of action selection is unbiased (if the Markov chain is allowed \nto converge). Further, the resultant policy can potentially be much more complicated than \na typical parameterized actor network would allow. This is exactly the tradeoff explored in \nthe graphical models literature between the use of Monte Carlo inference [11] and varia(cid:173)\ntional approximations [12]. \n\nOur algorithm is also related to probability matching [13], in which good actions are made \nmore probable under the model, and the temperature at which the probability is computed is \nslowly reduced over time in order to move from exploration to exploitation and avoid local \nminima. Unlike our algorithm, the probability matching algorithm used a parameterized \ndistribution which was maximized using gradient descent, and it did not address temporal \ncredit assignment. \n\n\f5 Conclusions \n\nWe have shown that a product of experts network can be used to learn the values of state(cid:173)\naction pairs (including temporal credit assignment) when both the states and actions have \na factored representation. An unbiased sample of actions can then be recovered with Gibbs \nsampling and 50 iterations appear to be sufficient. The network performs as well as a table(cid:173)\nbased Q-Iearner for small tasks, and continues to perform well when the task becomes too \nlarge for a table-based representation. \n\nAcknowledgments \n\nWe thank Peter Dayan, Zoubin Ghahramani and Andy Brown for helpful discussions. This \nresearch was funded by NSERC Canada and the Gatsby Charitable Foundation. \n\nReferences \n\n[1] R.S Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, \n\nMA, 1998. \n\n[2] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve dif(cid:173)\nficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, 13:835-\n846, 1983. \n\n[3] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximat(cid:173)\n\ning dynamic programming. In Proc. International Conference on Machine Learning, 1990. \n\n[4] Tommi Jaakkola, Satinder P. Singh, and Michael 1. Jordan. Reinforcement learning algorithm \nfor partially observable Markov decision problems. In Gerald Tesauro, David S. Touretzky, and \nTodd K. Leen, editors, Advances in Neural Information Processing Systems, volume 7, pages \n345-352. The MIT Press, Cambridge, 1995. \n\n[5] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Technical \n\nReport GCNU TR 2000-004, Gatsby Computational Neuroscience Unit, UCL, 2000. \n\n[6] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restora(cid:173)\n\ntion of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721- 741 , \n1984. \n\n[7] G.A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical \n\nReport CUEDIF-INFENGfTR 166, Engineering Department, Cambridge University, 1994. \n\n[8] R.S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse \n\ncoding. In Touretzky et al. [14], pages 1038- 1044. \n\n[9] M.L. Littman, A.R. Cassandra, and L.P. Kaelbling. Learning policies for partially observable \n\nenvironments: Scaling up. In Proc. International Conference on Machine Learning, 1995. \n\n[10] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, \n\nMA, 1996. \n\n[11] R. M. Neal. Connectionist learning of belief networks. Artificial Intelligence, 56:71- 113, 1992. \n[12] T. S. Jaakkola. Variational Methods for Inference and Estimation in Graphical Models. De(cid:173)\n\npartment of Brain and Cognitive Sciences, MIT, Cambridge, MA, 1997. Ph.D. thesis. \n\n[13] Philip N. Sabes and Michael 1. Jordan. Reinforcement learning by probability matching. In \n\nTouretzky et al. [14], pages 1080-1086. \n\n[14] David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors. Advances in Neural \n\nInformation Processing Systems, volume 8. The MIT Press, Cambridge, 1996. \n\n\f", "award": [], "sourceid": 1888, "authors": [{"given_name": "Brian", "family_name": "Sallans", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}