{"title": "The Steering Approach for Multi-Criteria Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1563, "page_last": 1570, "abstract": "", "full_text": "The Steering Approach for Multi-Criteria\n\nReinforcement Learning\n\nShie Mannor and Nahum Shimkin\nDepartment of Electrical Engineering\n\nTechnion, Haifa 32000, Israel\n\nfshie,shimking@ftx,eeg.technion.ac.il\n\nAbstract\n\nWe consider the problem of learning to attain multiple goals in a dynamic envi-\nronment, which is initially unknown.\nIn addition, the environment may contain\narbitrarily varying elements related to actions of other agents or to non-stationary\nmoves of Nature. This problem is modelled as a stochastic (Markov) game between\nthe learning agent and an arbitrary player, with a vector-valued reward function.\nThe objective of the learning agent is to have its long-term average reward vector\nbelong to a given target set. We devise an algorithm for achieving this task, which\nis based on the theory of approachability for stochastic games. This algorithm com-\nbines, in an appropriate way, a \ufb02nite set of standard, scalar-reward learning algo-\nrithms. Su\u2013cient conditions are given for the convergence of the learning algorithm\nto a general target set. The specialization of these results to the single-controller\nMarkov decision problem are discussed as well.\n\n1\n\nIntroduction\n\nThis paper considers an on-line learning problem for Markov decision processes with vector-valued\nrewards. Each entry of the reward vector represents a scalar reward (or cost) function which is\nof interest. Focusing on the long-term average reward, we assume that the desired performance is\nspeci\ufb02ed through a given target set, to which the average reward vector should eventually belong.\nAccordingly, the speci\ufb02ed goal of the decision maker is to ensure that the average reward vector will\nconverge to the target set. Following terminology from game theory, we refer to such convergence\nof the reward vector as approaching the target set.\n\nA distinctive feature of our problem formulation is the possible incorporation of arbitrarily varying\nelements of the environment, which may account for the in(cid:176)uence of other agents or non-stationary\nmoves of Nature. These are collectively modelled as a second agent, whose actions may a\ufb01ect both\nthe state transition and the obtained rewards. This agent is free to choose its actions according to\nany control policy, and no prior assumptions are made regarding its policy.\n\nThis problem formulation is derived from the so-called theory of approachability that was introduced\nin [3] in the context of repeated matrix games with vector payo\ufb01s. Using a geometric viewpoint, it\ncharacterizes the sets in the reward space that a player can guarantee for himself for any possible\npolicy of the other player, and provides appropriate policies for approaching these sets. Approach-\nability theory has been extended to stochastic (Markov) games in [14], and the relevant results are\nbrie(cid:176)y reviewed in Section 2. In this paper we add the learning aspect, and consider the problem of\nlearning such approaching policies on-line, using Reinforcement Learning (RL) or similar algorithms.\n\nApproaching policies are generally required to be non-stationary. Their construction relies on a\ngeometric viewpoint, whereby the average reward vector is \\steered\" in the direction of the target\nset by the use of direction-dependent (and possibly stationary) control policies. To motivate the\nsteering viewpoint, consider the following one dimensional example of an automatic temperature\n\n\fcontrolling agent. The measured property is the temperature which should be in some prescribed\nrange [T ; T ], the agent may activate a cooler or a heater at will. An obvious algorithm that achieves\nthe prescribed temperature range is { when the average temperature is higher than T choose a\n\\policy\" that reduces it, namely activate the cooler; and if the average temperature is lower than T\nuse the heater. See Figure 1(a) for an illustration. Note that this algorithm is robust and requires\nlittle knowledge about the characteristics of the processes, as would be required by a procedure that\ntunes the heater or cooler for continuous operation. A learning algorithm needs only determine\nwhich element to use at each of the two extreme regions.\n\na\n\nHumidity\n\nb\n\nTarget\n\nHeating policy\n\nCooling policy\n\nTT\n\nT\n\nTemperature\n\nTemperature\n\nFigure 1: (a) The single dimensional temperature example. If the temperature is higher than T\nthe control is to cool, and if the temperature is lower than T the control is to heat. (b) The two\ndimensional temperature-humidity example. The learning directions are denoted by arrows, note\nthat an in\ufb02nite number of directions are to be considered.\n\nConsider next a more complex multi-objective version of this controlling agent. The controller\u2019s\nobjective is as before to have the temperature in a certain range. One can add other parameters such\nas the average humidity, frequency of switching between policies, average energy consumption and\nso on. This problem is naturally characterized as a multi-objective problem, in which the objective\nof the controller is to have the average reward in some target set. (Note that in this example, the\ntemperature itself is apparently the object of interest rather than its long-term average. However,\nwe can reformulate the temperature requirement as an average reward objective by measuring the\nfraction of times that the temperature is outside the target range, and require this fraction to be zero.\nFor the purpose of illustration we shall proceed here with the original formulation). For example,\nsuppose that the controller is also interested in the humidity. For the controlled environment of, say,\na greenhouse, the allowed level of humidity depends on the average temperature. An illustrative\ntarget set is shown in Figure 1(b). A steering policy for the controller is not as simple anymore.\nIn place of the two directions (left/right) of the one-dimensional case, we now face a continuum of\npossible directions, each associated with a possibly di\ufb01erent steering policy. For the purpose of the\nproposed learning algorithm we shall require to consider only a \ufb02nite number of steering policies.\nWe will show that this can always be done, with negligible e\ufb01ect on the attainable performance.\n\nThe analytical basis for this work relies on three elements: stochastic game models, which capture the\nMarkovian system dynamics while allowing arbitrary variation in some elements of the environment;\nthe theory of approachability for vector-valued dynamic games, which provides the basis for the\nsteering approach; and RL algorithms for (scalar) average reward problems. For the sake of brevity,\nwe do not detail the mathematical models and proofs and concentrate on concepts.\n\nReinforcement Learning (RL) has emerged in the last decade as a unifying discipline for learning and\nadaptive control. Comprehensive overviews may be found in [2, 7]. RL for average reward Markov\nDecision Processes (MDPs) was suggested in [13, 10] and later analyzed in [1]. Several methods\nexist for average reward RL, including Q-learning [1] the E 3 algorithm [8], actor-critic schemes [2]\nand more.\n\nThe paper is organized as follows: In Section 2 we describe the stochastic game setup, recall ap-\n\n\fproachability theory, and mention a key theorem that allows to consider only a \ufb02nite number of\ndirections for approaching a set. Section 3 describes the proposed multi-criteria RL algorithm and\noutlines its convergence proof. We also brie(cid:176)y discuss learning in multi-criteria single controller\nenvironments, as this case is a special case of the more general game model. An illustrative example\nis brie(cid:176)y described in Section 4 and concluding remarks are drawn in Section 5.\n\n2 Multi-Criteria Stochastic Games\n\nIn this section we present the multi-criteria stochastic game model. We recall some known results\nfrom approachability theory for stochastic games with vector-valued reward, and state a key theorem\nwhich decomposes the problem of approaching a target set into a \ufb02nite number of scalar control\nproblems.\n\nWe consider a two-person average reward stochastic game model, with a vector-valued reward func-\ntion. We refer to the players as P1 (the learning agent) and P2 (the arbitrary adversary). The\ngame is de\ufb02ned by: the state space S; the sets of actions for P1 and P2, respectively, in each\nstate s, A and B; the state transition kernel, P = (P (s0js; a; b)); a vector-valued reward function\nm : S \u00a3 A \u00a3 B ! IRk. The reward itself is allowed to be random, in which case it is assumed\nto have a bounded second moment. At each time epoch n \u201a 0, both players observe the current\nstate sn, and then P1 and P2 simultaneously choose actions an and bn, respectively. As a result\nP1 receives the reward vector mn = m(sn; an; bn) and the next state is determined according to the\ntransition probability P (\u00a2jsn; an; bn). More generally, we allow the actual reward mn to be random,\nin which case m(sn; an; bn) denotes its mean and a bounded second moment is assumed. We further\nassume that both players observe the previous rewards and actions (however, in some of the learning\nalgorithms below, the assumption that P1 observes P2\u2019s action may be relaxed). A policy \u2026 2 \u0192 for\nP1 is a mapping which assigns to each possible observed history a mixed action in \u00a2(A), namely a\nprobability vector over P1\u2019s action set A. A policy (cid:190) 2 \u00a7 for P2 is de\ufb02ned similarly. A policy of\neither player is called stationary if the mixed action it prescribes depends only on the current state\nsn. Let ^mn denote the average reward by time n: ^mn\nThe following recurrence assumption will be imposed. Let state s\u2044 denote a speci\ufb02c reference state\n4\nto which a return is guaranteed. We de\ufb02ne the hitting time of state s\u2044 as: \u00bf\n= minfn > 0 : sn = s\u2044g.\n\nn Pn\u00a11\n\n4\n= 1\n\nt=0 mt.\n\nAssumption 1 (Recurrence) There exist a state s\u2044 2 S and a \ufb02nite constant N such that\n\nEs\n\n\u2026(cid:190)(\u00bf 2) < N for all \u2026 2 \u0192, (cid:190) 2 \u00a7 and s 2 S,\n\nwhere Es\nfor P1 and P2, respectively.\n\n\u2026(cid:190) is the expectation operator when starting from state s0 = s and using policies \u2026 and (cid:190)\n\nIf the game is \ufb02nite then this assumption is satis\ufb02ed if state s\u2044 is accessible from all other states\nunder any pair of stationary deterministic policies [14]. We note that the recurrence assumption\nmay be relaxed in a similar manner to [11].\n\nLet u be a unit vector in the reward space IRk. We often consider the projected game in direction\nu as the zero-sum stochastic game with same dynamic as above, and scalar rewards rn := mn \u00a2 u.\nHere \\\u00a2\" stands for the standard inner product in IRk. Denote this game by \u00a1s(u), where s is the\ninitial state. The scalar stochastic game \u00a1s(u), has a value, denoted v\u00a1s(u), if\n\nv\u00a1s(u) = sup\n\u2026\n\ninf\n(cid:190)\n\nlim inf\nn!1\n\nEs\n\n\u2026(cid:190)( ^mn \u00a2 u) = inf\n(cid:190)\n\nsup\n\n\u2026\n\nlim sup\nn!1\n\nEs\n\n\u2026(cid:190)( ^mn \u00a2 u) :\n\nFor \ufb02nite games, the value exists [12]. Furthermore, under Assumption 1 the value is independent\nof the initial state and can be achieved in stationary policies [6]. We henceforth simply write v\u00a1(u)\nfor this value.\n\nWe next consider the task of approaching a given target set in the reward space, and introduce\napproaching policies for the case where the game parameters are fully known to P1. Let T \u2030 IRk\ndenote the target set. In the following, d is the Euclidean distance in IRk, and P s\n\u2026;(cid:190) is the probability\nmeasure induced by the policies \u2026 and (cid:190), with initial state s.\n\n\fDe\ufb02nition 2.1 The set T \u2030 IRk is approachable (from initial state s) if there exists a T -\napproaching policy \u2026\u2044 of P1 such that d( ^mn; T ) ! 0 P s\n\u2026\u2044;(cid:190)-a.s., for every (cid:190) 2 \u00a7 at a uniform rate\nover \u00a7.\n\nThe policy \u2026\u2044 in that de\ufb02nition will be called an approaching policy for P1. A set is approachable\nif it is approachable from all states. Noting that approaching a set and its closure are the same, we\nshall henceforth suppose that the set T is closed.\n\nWe recall the basic results from [14] regarding approachability for known stochastic games, which\ngeneralize Blackwell\u2019s conditions for repeated matrix games. Let\n\n`(\u2026; (cid:190))\n\n4\n=\n\nEs\u2044\n\nt=0 mt)\n\n\u2026;(cid:190)(P\u00bf \u00a11\n\u2026;(cid:190)(\u00bf )\n\nEs\u2044\n\n(1)\n\ndenote the average per-cycle reward vector, which is the expected total reward over the cycle that\nstarts and ends in the reference state, divided by the expected duration of that cycle. For any x 62 T ,\ndenote by Cx a closest point in T to x, and let ux be the unit vector in the direction of Cx \u00a1 x,\nwhich points from x to the goal set T , see Figure 2 for an illustration.\n\nTheorem 2.1 [14] Let Assumption 1 hold. Assume that for every point x 62 T there exists a policy\n\u2026(x) such that:\n\n(2)\nThen T is approachable by P1. An approaching policy is: If sn = s\u2044 and ^mn 62 T , play \u2026( ^mn) until\nthe next visit to state s\u2044; otherwise, play arbitrarily.\n\n(`(\u2026(x); (cid:190)) \u00a1 Cx) \u00a2 ux \u201a 0 ;\n\n8(cid:190) 2 \u00a7 :\n\nFigure 2: An illustration of approachability. \u2026(x) brings P1 to the other side of the hyperplane\nperpendicular to the segment between Cx and x.\n\nGeometrically, the condition in (2) means that P1 can ensure, irrespectively of P2\u2019s policy, that\nthe average per-cycle reward will be on the other side (relative to x) of the hyperplane which is\nperpendicular to the line segment that points from x to Cx. We shall refer to the direction ux\nas the steering direction from point x, and to the policy \u2026(x) as the steering policy from x. The\napproaching policy uses the following rule: between successive visits to the reference state, a \ufb02xed\n(possibly stationary) policy is used. When in the reference state, the current average reward vector\n^mn is inspected. If this vector is not in T , then the steering policy that satis\ufb02es (2) with x = ^mn is\nselected for the next cycle. Consequently, the average reward is \\steered\" towards T , and eventually\nconverges to it.\n\nRecalling the de\ufb02nition of the projected game in direction u and its value v\u00a1(u), the condition in (2)\nmay be equivalently stated as v\u00a1(ux) \u201a Cx \u00a2 ux. Furthermore, the policy \u2026(x) can always be chosen\nas the stationary policy which is optimal for P1 in the game \u00a1(ux). In particular, the steering policy\n\u2026(x) needs to depend only on the corresponding steering direction ux. It can be shown that for\nconvex target sets, the condition of the last theorem turns out to be both su\u2013cient and necessary.\n\nStandard approachability results, as outlined above, require to consider an in\ufb02nite number of steering\ndirections whenever the reward in non-scalar. The corresponding set of steering policies may turn out\nto be in\ufb02nite as well. For the purpose of our learning scheme, we shall require an approaching policy\nwhich relies on a \ufb02nite set of steering directions and policies. The following results show that this can\nindeed be done, possibly requiring to slightly expand the target set. In the following, let M be an\nupper bound on the magnitude of the expected one-stage reward vector, so that jjm(s; a; b)jj < M for\n\n\fall (s; a; b) (jj \u00a2 jj denote the Euclidean norm). We say that a set of vectors (u1; : : : ; uJ ) is an \u2020-cover\nof the unit ball if for every vector in the unit ball u there exists a vector ui such that jjui \u00a1 ujj \u2022 \u2020 .\n\nTheorem 2.2 Let Assumption 1 hold and suppose that the target set T \u2030 IRk satis\ufb02es condition\n(2). Fix \u2020 > 0. Let fu1; : : : ; uJ g be an \u2020=M cover of the unit ball. Suppose that \u2026i is an optimal\nstrategy in the scalar game \u00a1(ui) (1 \u2022 i \u2022 J). Then the following policy approaches T \u2020, the \u2020-\nexpansion of T : If sn = s\u2044 and ^mn 62 T \u2020, then choose j so that u ^mn is closest to uj (in Euclidean\nnorm) and play \u2026j until the next visit to state s\u2044; otherwise, play arbitrarily.\n\n(Outline) The basic observation is that if two directions, u and ui are close, then v\u00a1(u)\nProof:\nand v\u00a1(ui) are close. Consequently, by playing a strategy which is optimal in \u00a1(ui) results in a play\nwhich is almost optimal in \u00a1(u). Finally we can apply Blackwell\u2019s Theorem (2.1) for the expansion\nof T , by noticing that a \\good enough\" strategy is played in every direction.\n\nIt follows immediately from the last theorem that the set T itself (rather than its \u2020-\nRemark:\nexpansion) is approachable with a \ufb02nite number of steering directions if T \u00a1\u2020, the \u2020 shrinkage of T ,\nsatis\ufb02es (2). Equivalently, T is required to satisfy (2) with the 0 on the right-hand-side replaced by\n\u2020 > 0.\n\n3 The Multi-Criteria Reinforcement Learning Algorithm\n\nIn this section we introduce and prove the convergence of the MCRL (Multi-Criteria Reinforcement\nLearning) algorithm. We consider the controlled Markov model of Section 2, but here we assume that\nP1, the learning agent, does not know the model parameters, namely the state transition probabilities\nand reward functions. A policy of P1 that does not rely on knowledge of these parameters will be\nreferred to as a learning policy. P1\u2019s task is to approach a given target set T , namely to ensure\nconvergence of the average reward vector to this set irrespective of P2\u2019s actions.\n\nThe proposed learning algorithm relies on the construction of the previous section of approaching\npolicies with a \ufb02nite number of steering directions. The main idea is to apply a (scalar) learning\nalgorithm for each of the projected games \u00a1(uj) corresponding to these directions. Recall that each\nsuch game is a standard zero-sum stochastic game with average reward. The required learning\nalgorithm for game \u00a1(u) should secure an average reward that is not less than the value v\u00a1(u) of\nthat game.\n\nConsider a zero-sum stochastic game, with reward function r(s; a; b), average reward ^rn and value\nv. Assume for simplicity that the initial state is \ufb02xed. We say that a learning policy \u2026 of P1 is\n\u2020-optimal in this game if, for any policy (cid:190) of P2, the average reward satis\ufb02es\n\nlim inf\nn!1\n\n^rn \u201a v \u00a1 \u2020 P\u2026(cid:190) a.s. ;\n\nwhere P\u2026(cid:190) is the probability measure induced by the algorithm \u2026, P2\u2019s policy (cid:190) and the game\ndynamics. Note that P1 may be unable to learn a min-max policy as P2 may play an inferior policy\nand refrain from playing certain actions, thereby keeping some parts of the game unobserved.\n\nRemark: RL for average reward zero-sum stochastic games can be devised in a similar manner\nto average reward Markov decision processes. For example, a Q-learning based algorithm which\ncombines the ideas of [9] with those of [1] can be devised. An additional assumption that is needed\nfor the analysis is that all actions of both players are used in\ufb02nitely often. A di\ufb01erent type of a scalar\nalgorithm that overcomes this problem is [4]. The algorithm there is similar to the E 3 algorithm [8]\nwhich is based on explicit exploration-exploitation tradeo\ufb01 and estimation of the game reward and\ntransition structure.\n\nWe now describe the MCRL algorithm that nearly approaches any target set T that satis\ufb02es (2).\nThe parameters of the algorithm are \u2020 and M . \u2020 is the approximation level and M is a known bound\non the norm of the expected reward per step. The goal of the algorithm is to approach T \u2020, the \u2020\nexpansion of T . There are J learning algorithms that are run in parallel, denoted by \u20261; : : : \u2026J . The\nMCRL is described in Figure 3 and is given here as a meta-algorithm (the scalar RL algorithms \u2026i\nare not speci\ufb02ed). When arriving to s\u2044, the decision maker checks if the average reward vector is\noutside the set T \u2020. In that case, he switches to an appropriate policy that is intended to \\steer\" the\naverage reward vector towards the target set. The steering policy (\u2026j) is chosen according to closest\n\n\fdirection (uj) to the actual direction needed according to the problem geometry. Recall that each\n\u2026j is actually a learning policy with respect to a scalar reward function. In general, when \u2026j is not\nplayed, its learning pauses and the process history during that time is ignored. Note however that\nsome \\o\ufb01-policy\" algorithms (such as Q-learning) can learn the optimal policy even while playing a\ndi\ufb01erent policy. In that case a more e\u2013cient version of the MCRL is suggested, in which learning is\nperformed by all learning policies \u2026j continuously and concurrently.\n\n0. Let u1; : : : uJ be an \u2020=2M cover of the unit ball. Initialize J di\ufb01erent \u2020=2-optimal scalar\nalgorithms, \u20261; : : : ; \u2026J .\n1. If s0 6= s\u2044 play arbitrarily until sn = s\u2044.\n2. (sn = s\u2044) If ^mn 2 T \u2020 goto step 1. Else let i = arg min1\u2022i\u2022J jjui \u00a1 u ^mnjj2.\n3. While sn 6= s\u2044 play according to \u2026i, the reward \u2026i receives is ui \u00a2 mn.\n4. When sn = s\u2044 goto step 2.\n\nFigure 3: The MCRL algorithm\n\nTheorem 3.1 Suppose that Assumption 1 holds and the MCRL algorithm is used with \u2020-optimal\nscalar learning algorithms. If the target set T satis\ufb02es (2), then T \u2020 is approached using MCRL.\n\nProof:\n(Outline) If a direction is played in\ufb02nitely often, then eventually the learned strategy in\nthis direction is nearly optimal. If a direction is not played in\ufb02nitely often it has a negligible e\ufb01ect\non the long term average reward vector. Since the learning algorithms are nearly optimal, then any\npolicy \u2026j that is played in\ufb02nitely often, eventually attains a (scalar) average reward of v\u00a1(uj) \u00a1 \u2020=2.\nOne can apply Theorem 2.2 for the set T \u2020=2 to verify that the overall policy is an approaching policy\nfor the target set.\n\nNote that for convex target sets the algorithm is consistent in the sense that if the set is approachable\nthen the algorithm attains it.\n\nRemark: Multi-criteria Markov Decision Process (MDP) models may be regarded as a special case\nof the stochastic game model that was considered so far, with P2 eliminated from the problem. The\nMCRL meta-algorithm of the previous section remains the same for MDPs. Its constituent scalar\nlearning algorithms are now learning algorithms for the optimal polices in average-reward MDPs.\nThese are generally simpler than for the game problem. Examples of optimal or \u2020-optimal algorithms\nare Q-Learning with persistent exploration [2], Actor-critic schemes [2], an appropriate version of\nthe E 3 algorithm [8] and others. In the absence of an adversary, the problem of approaching a set\nbecomes much simpler. Moreover, it can be shown that if a set is approachable then it may be\napproached using a stationary (possibly randomized) policy.\nIndeed, any point in feasible set of\nstate-action frequencies may be achieved by such a stationary policy [5]. Thus, alternative learning\nschemes may be applicable to this problem. Another observation is that all steering policies learned\nand used within the MCRL may now be deterministic stationary policies, which simpli\ufb02es the\nimplementation of this algorithm.\n\n4 Example\n\nRecall the humidity-temperature example from the introduction. Suppose that the system is mod-\nelled in such a way that P1 chooses a temperature-humidity curve. Then Nature (modelled as P2)\nchooses the exact location on the temperature-humidity curve. In Figure 4(a) we show three di\ufb01er-\nent temperature-humidity curves, that can be determined by P1 (each de\ufb02ned by a certain strategy\nof P1 - f0; f1; f2). We implemented MCRL algorithm with nine directions.\nIn each direction a\nversion of Littman\u2019s Q-learning ([9]), adapted for average cost games, was used. A sample path of\nthe average reward generated by the MCRL algorithm is shown in Figure 4(b). The sample path\nstarted at \u2019S\u2019 and \ufb02nished at \u2019E\u2019. For this speci\ufb02c run, an even smaller number of directions would\nhave su\u2013ced (up and right). It can be seen that the learning algorithm pushes towards the target set\nso that the path is mostly on the edge of the target set. Note that in this example a small number\nof directions was quite enough for approaching the target set.\n\n\fa\n\nProblem dynamics for different strategies\n\nf\n2\n\nf\n1\n\nf\n0\n\n1\n\nTemperature\n\n1.5\n\n2\n\ny\nt\ni\n\ni\n\nd\nm\nu\nH\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.8\n\n2\n\n1.5\n\n1\n\ny\nt\ni\n\ni\n\nd\nm\nu\nH\n\n0.5\n\n0.5\n\nb\n\nA sample path of average reward\n\nS E\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\nTemperature\n\nFigure 4: (a) Greenhouse problem dynamics. (b) A sample path from \u2019S\u2019 to \u2019E\u2019\n\n5 Conclusion\n\nWe have presented a learning algorithm that approaches a prescribed target set in multi-dimensional\nperformance space, provided this set satis\ufb02es a certain su\u2013cient condition. Our approach essentially\nrelies on the theory of approachability for stochastic games, which is based on the idea of steering\nthe average reward vector towards the target set. We provided a key result stating that a set can\nbe approached to a given precision using only a \ufb02nite number of steering policies, which may be\nlearned on-line.\n\nAn interesting observation regarding the proposed learning algorithm is that the learned optimal\npolices in each direction are essentially independent of the target set T . Thus, the target set need\nnot be \ufb02xed in advance and may be modi\ufb02ed on-line without requiring a new learning process. This\nmay be especially useful for constrained MDPs.\n\nOf further interest is the question of reduction of the number of steering directions used in the\nalgorithm. In some cases, especially when the requirements embodied by the target set T are not\nstringent, this number may be quite small compared to the worst-case estimate used above. A\npossible re\ufb02nement of the algorithm is to eliminate directions that are not required.\n\nThe scaling of he algorithm with the dimension of the reward space is exponential. The problem\nis that as the dimension increases, exponentially many directions are needed to cover the unit ball.\nWhile in general this is necessary, it might happen that considerably less directions are needed.\nConditions and algorithms that use much less than exponential number of directions are under\ncurrent study.\n\nAcknowledgement\n\nThis research was supported by the fund for the promotion of research at the Technion.\n\nReferences\n\n[1] J. Abounadi, D. Bertsekas, and V. Borkar. Learning algorithms for markov decision processes\n\nwith average cost. LIDS-P 2434, Lab. for Info. and Decision Systems, MIT, October 1998.\n\n[2] A.G. Barto and R.S. Sutton. Reinforcement Learning. MIT Press, 1998.\n\n[3] D. Blackwell. An analog of the minimax theorem for vector payo\ufb01s. Paci\ufb02c J. Math., 6(1):1{8,\n\n1956.\n\n\f[4] R.I. Brafman and M. Tennenholtz. A near optimal polynomial time algorithm for learning in\n\ncertain classes of stochastic games. Arti\ufb02cial Intelligence, 121(1-2):31{47, April 2000.\n\n[5] C. Derman. Finite state Markovian decision processes. Academic Press, 1970.\n\n[6] J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer Verlag, 1996.\n[7] L.P. Kaelbling, M. Littman, and A.W. Moore. Reinforcement learning - a survey. Journal of\n\nArti\ufb02cial Intelligence Research, (4):237{285, May 1996.\n\n[8] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In Proc. of\n\nthe 15th Int. Conf. on Machine Learning, pages 260{268. Morgan Kaufmann, 1998.\n\n[9] M.L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Morgan\nKaufman, editor, Eleventh International Conference on Machine Learning, pages 157{163, 1994.\n\n[10] S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical\n\nresults. Machine Learning, 22(1):159{196, 1996.\n\n[11] S. Mannor and N. Shimkin. The empirical bayes envelope approach to regret minimization\nin stochastic games. Technical report EE- 1262, Faculty of Electrical Engineering, Technion,\nIsrael, October 2000.\n\n[12] J.F. Mertens and A. Neyman. Stochastic games.\n\nInternational Journal of Game Theory,\n\n10(2):53{66, 1981.\n\n[13] A. Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Pro-\nceedings of the Tenth International Conference on Machine Learning, pages 298{305. Morgan\nKaufmann, 1993.\n\n[14] N. Shimkin and A. Shwartz. Guaranteed performance regions in markovian systems with\n\ncompeting decision makers. IEEE Trans. on Automatic Control, 38(1):84{95, January 1993.\n\n\f", "award": [], "sourceid": 1986, "authors": [{"given_name": "Shie", "family_name": "Mannor", "institution": null}, {"given_name": "Nahum", "family_name": "Shimkin", "institution": null}]}