{"title": "Balancing Multiple Sources of Reward in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1082, "page_last": 1088, "abstract": null, "full_text": "Balancing Multiple Sources of Reward in \n\nReinforcement Learning \n\nChristian R. Shelton \n\nArtificial Intelligence Lab \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \ncshelton@ai.mit.edu \n\nAbstract \n\nFor many problems which  would be natural for reinforcement learning, \nthe reward signal is not a single scalar value but has multiple scalar com(cid:173)\nponents.  Examples of such problems include agents  with multiple goals \nand agents  with  multiple users.  Creating a single reward value by com(cid:173)\nbining  the  multiple  components  can throwaway  vital  information and \ncan lead  to incorrect solutions.  We describe the multiple reward source \nproblem  and  discuss  the  problems  with  applying  traditional  reinforce(cid:173)\nment learning.  We  then present an  new  algorithm for finding a solution \nand results on simulated environments. \n\n1  Introduction \n\nIn  the  traditional reinforcement learning  framework, the  learning  agent is  given  a single \nscalar value  of reward  at each  time  step.  The goal  is  for  the  agent to  optimize  the  sum \nof these rewards over time (the return).  For many  applications, there is  more information \navailable. \n\nConsider the case  of a home entertainment system designed to  sense which residents  are \ncurrently in the room and automatically select a television program to  suit their tastes.  We \nmight construct the reward signal to be the total number of people paying attention to the \nsystem.  However,  a  reward  signal  of 2 ignores  important information about  which two \nusers are watching. The users of the system change as people leave and enter the room. We \ncould, in  theory, learn the relationship among the users present, who is  watching, and  the \nreward.  In general, it is  better to use the domain knowledge we  have instead of requiring \nthe  system to  learn it.  We know which users are  contributing to  the reward  and that only \npresent users can contribute. \n\nIn  other cases,  the  multiple  sources  aren't users,  but goals.  For elevator scheduling we \nmight be trading off people serviced per minute against average waiting time.  For financial \nportfolio managing, we might be weighing profit against risk.  In these cases, we may wish \nto  change the  weighting  over time.  In  order to  keep  from  having to relearn the  solution \nfrom scratch each time the weighting is changed, we need to keep track of which rewards \nto attribute to which goals. \n\nThere  is  a  separate  difficulty  if the  rewards  are  not  designed  functions  of the  state  but \n\n\frather are  given by other agents  or people in  the  environment.  Consider the  case  of the \nentertainment system above but where every resident has a dial by which they can give the \nsystem feedback or reward. The rewards are incomparable. One user may decide to reward \nthe  system  with  values  twice as  large as  those  of another which  should not result in  that \nuser having twice the control over the entertainment. This isn't limited to scalings but also \nincludes any  other monotonic transforms  of the returns.  If the users of the  system know \nthey are training it, they will employ all kinds of reward strategies to try to steer the system \nto the desired behavior [2].  By keeping track of the sources of the rewards, we will derive \nan algorithm to overcome these difficulties. \n\n1.1  Related Work \n\nThe  work presented  here is  related  to  recent  work  on  multiagent reinforcement learning \n[1,4,5,7] in that multiple rewards signals are present and game theory provides a solution. \nThis work is  different in that it attacking a simpler problem where the computation is con(cid:173)\nsolidated on a single agent.  Work in multiple goals (see [3,  8]  as  examples) is also related \nbut assumes either that the returns of the  goals are  to  be linearly combined for an overall \nvalue function or that only one goal is to be solved at a time. \n\n1.2  Problem Setup \n\nWe  will be working with partially observable environments with discrete actions and dis(cid:173)\ncrete observations.  We  make no  assumptions about the  world  model and thus  do  not use \nbelief states.  x(t)  and  a(t ) are  the  observation and  action,  respectively,  at  time t.  We \nconsider only  reactive  policies  (although  the  observations  could be expanded to  include \nhistory). 7f(x, a)  is the policy or probability the agent will take action a when observing x. \nAt each time step, the agent receives a set of rewards (one for each source in the environ(cid:173)\nment), Ts (t ) is  the reward at time t from source s. We use the average reward formulation \nand  so  R;  =  limn--->CXl  ~E [r\u00b7s(1) + Ts(2 ) + ... + Ts(n)I7f]  is  the  expected  return  from \nsource s for following policy 7f.  It is this return that we want to maximize for each source. \n\nWe will also assume that the algorithm knows the set of sources present at each time step. \nSources which are not present provide a constant reward, regardless of the state or action, \nwhich we will assume to be zero.  All sums over sources will be assumed to be taken over \nonly the present sources. \n\nThe goal is to produce an algorithm that will produce a policy based on previous experience \nand the sources present.  The agent's experience will take the form of prior interactions with \nthe world. Each experience is  a sequence of observations, action, and reward triplets for a \nparticular run of a particular policy. \n\n2  Balancing Multiple Rewards \n\n2.1  Policy Votes \n\nIf rewards are not directly comparable, we need to  find  a property of the sources which is \ncomparable and a metric to optimize. We begin by noting that we want to limit the amount \nof control  any  given  source has over the behavior of the agent.  To  that end, we construct \nthe  policy  as  the  average of a set  of votes,  one for each  source present.  The votes  for  a \nsource must sum to  1 and must all be non-negative (thus giving each source an equal \"say\" \nin the agent's policy). We will first consider restricting the rewards from a given source to \nonly affect the votes for that source. \n\n\fThe form for the policy is therefore \n\n(1) \n\nwhere for each present source  8,  L x a s(x )  =  1, as (x)  ~ \u00b0 for all  x , L a vs (x, a)  =  1 \nfor all  x , and Vs (x, a)  ~ \u00b0 for  all  x  and  a.  We  have broken apart the vote from a source \n\ninto  two  parts,  a  and v.  as (x ) is  how  much effort source  8  is  putting into  affecting  the \npolicy for observation x . vs (x, a)  is  the vote by source 8  for the policy for observation x. \nMathematically this is  the same as  constructing a single vote (v~(x, a)  =  as (x )vs (x, a) , \nbut we find a  and v to be more interpretable. \n\nWe  have  constrained  the  total  effort and  vote  anyone source can  apply.  Unfortunately, \nthese  votes  are  not quite the  correct parameters for our policy.  They  are  not invariant to \nthe other sources present.  To illustrate this consider the example of a single state with two \nactions, two sources, and a learning agent with the voting method from above. If 8 1  prefers \nonly  a1 and  82 likes an  equal mix of a1  and a2, the agent will learn a vote of (1,0) for  81 \nand  82  can reward the agent to cause it to  learn a vote of (0,1) for  82  resulting in  a policy \nof (0.5,0.5).  Whether this  is  the  correct final  policy  depends  on  the  problem definition. \nHowever,  the real problem arises  when  we  consider what happens if 8 1  is  removed.  The \npolicy reverts to  (0, 1) which is far from 82 'S  (the only present source's) desired (0.5 , 0.5) \nClearly, the learned votes for 82  are meaningless when  8 1  is  not present. \n\nThus,  while  the  voting  scheme  does  limit  the  control  each  present  source  has  over  the \nagent, it does  not provide a description of the source's preferences which would allow for \nthe removal or addition (or reweighting) of sources. \n\n2.2  Returns as Preferences \n\nWhile rewards (or returns) are not comparable across sources, they are comparable within \na source. In particular, we know that if R;l  > R;2  that source 8 prefers policy 'if1 to policy \n'if2 .  We  do not know how to  weigh that preference against a different source's preference \nso  an  explicit  tradeoff is  still  impossible,  but  we  can  limit  (using  the  voting  scheme  of \nequation 1) how much one source's preference can override another source's preference. \n\nWe allow a source's preference for a change to prevail in as much as its votes are sufficient \nto  affect  the  change  in  the  presences  of the  other sources'  votes.  We  have  a  type  of a \ngeneral-sum game (letting the sources be the players of game theory jargon). The value to \nsource 8 '  of the set of all sources'  votes is R;,  where 'if is  the function of the votes defined \nin equation  1.  Each source 8 '  would like to  set its particular votes, as, (x)  and v~ (x, a) to \nmaximize its  value (or return).  Our algorithm will  set each  source's vote in  this  way  thus \ninsuring that no source could do better by \"lying\" about its true reward function. \n\nIn  game theory,  a \"solution\"  to  such  a game is  called a Nash  Equilibrium  [6],  a point at \nwhich each player (source) is  playing (voting) its best response to the other players.  At a \nNash  Equilibrium,  no  single player can change its  play  and  achieve a gain.  Because the \nvotes  are real-valued, we  are looking for the  equilibrium of a continuous game.  We  will \nderive a fictitious play algorithm to find  an equilibrium for this game. \n\n3  Multiple Reward Source Algorithm \n\n3.1  Return Parameterization \n\nIn  order to  apply  the  ideas  of the  previous  section,  we  must  find  a method for  finding  a \nNash Equilibrium.  To  do that, we will pick a parametric form for R; (the estimate of the \n\n\freturn):  linear in the KL-divergence between a target vote and 1L Letting as. bs, f3s (x), and \nPs(x, a)  be the parameters of Ii;, \n\nAn \nRs =as ~ f3sx ~psx, alog  ( \n\n\"\"\"'  ( )  \"\"\"' \n\n) \n\n( \n\nPs(X, a) \nn  x, a \n\n) + bs \n\nx \n\na \n\n(2) \n\nwhere as  :::::  0,  f3s (x )  :::::  0,  L:x f3s (x )  =  1, Ps(x, a)  :::::  0,  and  L:aps(x , a)  =  1.  Just as \na s (x)  was  the amount of vote  source  s was  putting towards the policy for observation x, \nf3s (x ) is  the importance for source s  of the policy for observation x .  And, while Vs (x, a) \nwas  the  policy  vote  for  observation  x  for  source  s,  ps(x, a)  is  the  preferred  policy  for \nobservation x for source s.  The constants as and bs allow for scaling and translation of the \nreturn. \nIf we  let p~(x , a)  =  asf3s (x)ps(x, a),  then,  given  experiences  of different policies  and \ntheir empirical returns, we  can estimate p~(x, a)  using linear least-squares.  Imposing the \nconstraints just involves  finding  the  normal  least-squares  fit  with  the  constraint  that  all \np~(x, a) be non-negative.  Fromp~(x, a) we  can calculate as  =  L: x,ap~(x, a), f3s (x )  = \n.1...  L:ap~(x, a) and Ps(x, a)  =  ~p~ (~t) ')'  We  now  have  a method for solving for Ii; \na s \ngiven experience. We now need to find a way to compute the agent's policy. \n\na'  Ps  x,a \n\n3.2  Best Response Algorithm \n\nTo  produce an  algorithm for  finding  a Nash Equilibrium,  let  us  first  start by  deriving an \nalgorithm for finding  the best response for source s  to  a set of votes.  We  need to  find  the \nset of as (x) and Vs (x, a)  that satisfy the constraints on the votes and maximize equation 2 \nwhich is the same as  minimizing \n\n,,\"\", f-!  ()\"\"\"' \n~ fJs X  ~ps x , a  og \nx \n\n( \n\na \n\n)1 \n\n. L:slas/ (x )vs/ (X) \n\n'\" \nL..J s'  Qs ,  X \n\n( ) \n\n(3) \n\nover a s (x ) and v s (x , a) for given s because the other terms depend on neither as (x ) nor \nvs(x , a). \nTo minimize equation 3, let's first fix the a -values and optimize Vs (x, a).  We will ignore the \nnon-negative constraints on Vs (x , a)  and just impose the constraint that L:a Vs (x , a)  =  1. \nThe solution, whose derivation is simple and omitted due to space, is \n\nWe impose the non-negative constraints by setting to zero any Vs (x, a)  which are negative \nand renormalizing. \n\nUnfortunately, we  have not been able  to  find  such a nice solution for a s(x ).  Instead,  we \nuse gradient descent to optimize equation 3 yielding \n\n(4) \n\n(5) \n\nWe constrain the gradient to fit the constraints. \n\nWe can find  the best response for  source s by iterating between the two steps above.  First \nf3s (x ) for all  x.  We  then solve for a new  set of vs(x, a)  with equa(cid:173)\nwe initialize a s(x )  = \ntion 4.  Using those v-values, we take a step in the direction of the gradient of a s(x ) with \nequation 5.  We  keep  repeating until  the  solution  converges  (reducing the  step  size  each \niteration) which usually only takes a few  tens of steps. \n\n\f\u00abK~ \n\n8 1eft \n\nBright \n\nSbottom \n\n' \n\n, \n\n,  , \n\nJ  T \n,:~  T \n,~  T \nps(5,a) \n\n\" \n\n, \n\n=>  ':~,.,  => \n\n' \n\n, \n\n,:[  \u2022 \n,:L  \u2022 \n\nvs(5 , a) \n\n\" \n\n, \n\n::b \n\n' \n\n, \n\n7f(5 ,a) \n\nFigure 1:  Load-unload problem: The right is the state diagram,  Cargo is  loaded in  state  L \nDelivery to  a boxed state results in reward from the source associated with that state,  The \nleft is the solution found,  For state 5, from left to right are shown the p-values, the v-values, \nand the policy, \n\nBright \n\nJ \n\n'[ \n\nSbottom  J~,~, \n\nps(5 , a) \n\n=> \n\n'J,., \n':' , \u2022 \n\nvs(5,a) \n\n=> \n\n7f(5, a) \n\nFigure 2: Transfer of the load-unload solution:  plots of the same values as  in  figure  1 but \nwith the left source absent  No  additional learning was allowed (the left side plots are the \nsame), The votes, however, change, and thus so does the final policy, \n\n3.3  Nash Equilibrium Algorithm \n\nTo  find  a  Nash  Equilibrium,  we  start with  as (x)  =  f3s (x)  and  vs(x , a)  =  Ps(x, a)  and \niterate to an equilibrium by repeatedly finding the best response for each source and simul(cid:173)\ntaneously replacing the  old solution  with  the  new  best responses,  To  prevent oscillation, \nwhenever the change in as (x)vs (x , a) grows from one step to the next, we replace the old \nsolution with one halfway between the old and new solutions and continue the iteration, \n\n4  Example Results \n\nIn  all  of these  examples we  used  the  same learning scheme.  We  ran  the  algorithm  for  a \nseries  of epochs.  At each epoch,  we  calculated 7f  using  the  Nash  Equilibrium algorithm. \nWith probability t, we replace 7f with one chosen uniformly over the simplex of conditional \ndistributions . This insures some exploration. We follow 7f for a fixed number of time steps \nand  record  the  average  reward  for  each  source.  We  add  these  average  rewards  and  the \nempirical estimate of the policy followed as data to the least-squares estimate of the returns. \nWe then repeat for the next epoch. \n\n4.1  Multiple Delivery Load-Unload Problem \n\nWe  extend  the  classic  load-unload problem  to  multiple receivers.  The  observation  state \nis  shown  in  figure  1.  The  hidden  state  is  whether the  agent  is  currently carrying  cargo. \nWhenever the agent enters the top  state (state  1), cargo is  placed on the agent  Whenever \nthe  agent  arrives  in  any  of the  boxed  states  while  carrying cargo,  the  cargo  is  removed \nand the agent receives reward.  For each boxed state, there is  one reward source who only \nrewards  for  deliveries  to  that  state  (a reward  of 1 for  a delivery  and  0 for  all  other time \nsteps).  In  state 5, the  agent has  the choice of four actions each of which moves  the agent \nto the corresponding state without error.  Since the agent cannot observe neither whether it \n\n\fFigure 3:  One-way door state diagram:  At every state there are two actions (right and left) \navailable to the agent.  In states 1,9, 10, and 15 where there are only single outgoing edges, \nboth actions follow the same edge. With probability 0.1, an action will actually follow the \nother edge. Source 1 rewards entering state 1 whereas source 2 rewards entering state 9. \n\n81 \n\n82 \n\n81 \n\n82 \n\n:1 ....... ~ \n\n(3s (x ) \n\n=} \n\n:~I .\u2022...... ~ \n:~I ~~ .. ~ .... \n\n(ts(x) \n\n=} \n\nPs(x, right) \n\nVs (x , right) \n\nn(x, right) \n\nFigure 4:  One-way door solution:  from left to right:  the sources' ideal policies, the votes, \nand the final  agent's policy. Light bars are for states for which both actions lead to the same \nstate. \n\nhas cargo nor its history, the optimal policy for state 5 is  stochastic. \n\nThe algorithm set all  (t- and {3-values to 0 for states other than state 5.  We  started f at 0.5 \nand reduced it to  0.1  by  the  end of the  run.  We  ran for 300 epochs  of 200  iterations  by \nwhich point the algorithm consistently settled on the solution shown in  figure  1.  For each \nsource,  the  algorithm found  the best solution of randomly picking between the load  state \nand the source's delivery state (as  shown by the p-values).  The votes are heavily weighted \ntowards  the  delivery  actions  to  overcome  the  other  sources'  preferences  resulting  in  an \napproximately uniform policy. The important point is that, without additional learning, the \npolicy can be changed if the left source leaves.  The learned  (t- and p-values are kept the \nsame, but the Nash Equilibrium is different resulting in the policy in figure 2. \n\n4.2  One-way Door Problem \n\nIn this case we consider the environment shown in figure 3.  From each state the agent can \nmove to the left or right except in  states  1,  9,  10,  and  15  where there is  only one possible \naction.  We  can think of states  1 and 9 as  one-way doors.  Once the  agent enters  states  1 \nor 9, it may not pass back through except by going around through state 5.  Source 1 gives \nreward when the agent passes through state 1. Source 2 gives reward when the agent passes \nthrough state 9.  Actions fail  (move in the opposite direction than intended) 0.1  of the time. \n\nWe ran the learning scheme for 1000 epochs of 100 iterations starting f at 0.5 and reducing \nit to  0.015 by  the last epoch.  The algorithm consistently converged to  the solution shown \nin  figure  4.  Source  1 considers  the  left-side  states  (2-5  and  11-12) the  most  important \nwhile  source  2 considers  the right-side  states  (5-8 and  13-14) the  most important.  The \nideal policies  captured by  the p-values  show  that  source  1 wants  the  agent to  move left \nand  source  2 wants  the  agent to  move right for  the  upper states  (2-8) while the  sources \n\n\fagree that for the  lower states  (11-14) the  agent should move towards  state 5.  The votes \nreflect this  preference and  agreement.  Both  sources  spend  most  of their vote  on  state  5, \nthe  state  they  both feel  is  important and  on which  they  disagree.  The  other states  (states \nfor which  only one source has a strong opinion or on  which they agree),  they do not need \nto  spend much of their vote.  The resulting policy is  the  natural one:  in  state 5, the agent \nrandomly picks a direction after which, the agent moves around the chosen loop quickly to \nreturn to  state 5.  Just as  in  the  load-unload problem, if we  remove one  source, the agent \nautomatically adapts to the ideal policy for the remaining source (with only one source, So, \npresent, 7f(x, a)  =  P SQ (x, a)). \nEstimating the  optimal policies  and  then  taking the  mixture of these two policies  would \nproduce a far worse result.  For states 2-8, both sources would have differing opinions and \nthe mixture model  would produce a uniform policy in those states; the agent would spend \nmost of its  time near state 5.  Constructing a reward signal that is  the sum of the sources' \nrewards does not lead to a good solution either.  The agent will find  that circling either the \nleft or right loop is optimal and will have no incentive to ever travel along the other loop. \n\n5  Conclusions \n\nIt is difficult to conceive of a method for providing a single reward signal that would result \nin  the  solution  shown  in  figure  4  and  still  automatically  change  when  one  of the  reward \nsources was removed. The biggest improvement in the algorithm will come from changing \nthe  form  of the Ii; estimator.  For problems in  which  there is  a single best solution,  the \nKL-divergence measure seems to work well. However, we would like to be able to extend \nthe load-unload result to  the situation where the agent has  a memory bit.  In this  case, the \nreturns as a function of 7f are bimodal (due to the symmetry in the interpretation of the bit). \nIn  general,  allowing each source's preference to be modelled in  a more complex manner \ncould help extend these results. \n\nAcknowledgments \n\nWe would like to thank Charles Isbell, Tommi Jaakkola, Leslie Kaelbling, Michael Kearns, Satinder \nSingh, and Peter Stone for their discussions and comments. \n\nThis report describes research done within CBCL in the Department of Brain and Cognitive Sciences \nand in the AI Lab at MIT.  This research is sponsored by a grants from ONR contracts Nos. NOOOI4-\n93-1-3085 & NOO014-95-1-0600, and NSF contracts Nos. IIS-9800032 & DMS-9872936.  Additional \nsupport  was  provided  by:  AT&T,  Central  Research  Institute  of Electric  Power  Industry,  Eastman \nKodak  Company,  Daimler-Chrysler,  Digital Equipment Corporation,  Honda  R&D  Co., Ltd.,  NEC \nFund, Nippon Telegraph & Telephone, and Siemens Corporate Research, Inc. \n\nReferences \n[1]  1.  Hu  and  M.  P.  Wellman.  Multiagent reinforcement learning:  Theoretical framework  and  an \nalgorithm.  In Froc. of the 15th International Con! on Machine Learning,  pages 242- 250,  1998. \n[2]  C.  L.  Isbell, C. R. Shelton, M.  Kearns, S.  Singh, and P.  Stone.  A  social reinforcement learning \n\nagent.  2000.  submitted to Autonomous Agents 2001. \n\n[3]  1.  Karlsson.  Learning to Solve Multiple Goals.  PhD thesis, University of Rochester,  1997. \n[4]  M. Kearns, Y. Mansouor, and S. Singh.  Fast planning in  stochastic games.  In  Proc.  of the  16th \n\nConference on Uncertainty  in Artificial Intelligence, 2000. \n\n[5]  M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proc. of \n\nthe 11th International Conference on Machine Learning, pages 157-163, 1994. \n\n[6]  G.  Owen.  Game Theory.  Academic Press, UK, 1995. \n[7]  S. Singh, M. Kearns,  and  Y. Mansour.  Nash convergence of gradient dynamics  in general-sum \n\ngames.  In Proc.  of the 16th Conference on Uncertainty in Artificial Intelligence, 2000. \n\n[8]  S. P.  Singh.  The efficient learning of multiple task sequences.  In NIPS, volume 4, 1992. \n\n\f", "award": [], "sourceid": 1831, "authors": [{"given_name": "Christian", "family_name": "Shelton", "institution": null}]}