{"title": "Exponentially Weighted Imitation Learning for Batched Historical Data", "book": "Advances in Neural Information Processing Systems", "page_first": 6288, "page_last": 6297, "abstract": "We consider deep policy learning with only batched historical trajectories. The main challenge of this problem is that the learner no longer has a simulator or ``environment oracle'' as in most reinforcement learning settings. To solve this problem, we propose a monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space. The method does not rely on the knowledge of the behavior policy, thus can be used to learn from data generated by an unknown policy. Under mild conditions, our algorithm, though surprisingly simple, has a policy improvement bound and outperforms most competing methods empirically. Thorough numerical results are also provided to demonstrate the efficacy of the proposed methodology.", "full_text": "Exponentially Weighted Imitation Learning for\n\nBatched Historical Data\n\nQing Wang1\n\nJiechao Xiong1 Lei Han1 Peng Sun1 Han Liu12 Tong Zhang1\n\n1Tencent AI Lab\n\n2Northwestern University\n\n{drwang, jcxiong, lxhan, pythonsun}@tencent.com\n\nhanliu@northwestern.edu, tongzhang@tongzhang-ml.org\n\nAbstract\n\nWe consider deep policy learning with only batched historical trajectories. The\nmain challenge of this problem is that the learner no longer has a simulator or\n\u201cenvironment oracle\u201d as in most reinforcement learning settings. To solve this\nproblem, we propose a monotonic advantage reweighted imitation learning strategy\nthat is applicable to problems with complex nonlinear function approximation\nand works well with hybrid (discrete and continuous) action space. The method\ndoes not rely on the knowledge of the behavior policy, thus can be used to learn\nfrom data generated by an unknown policy. Under mild conditions, our algorithm,\nthough surprisingly simple, has a policy improvement bound and outperforms most\ncompeting methods empirically. Thorough numerical results are also provided to\ndemonstrate the ef\ufb01cacy of the proposed methodology.\n\n1\n\nIntroduction\n\nIn this article, we consider the problem of learning a deep policy with batched historical trajectories.\nThis problem is important and challenging. As in many real-world tasks, we usually have numerous\nhistorical data generated by different policies, but is lack of a perfect simulator of the environment. In\nthis case, we want to learn a good policy from these data, to make decisions in a complex environment\nwith possibly continuous state space and hybrid action space of discrete and continuous parts.\nSeveral existing \ufb01elds of research concern the problem of policy learning from batched data. In\nparticular, imitation learning (IL) aims to \ufb01nd a policy whose performance is close to that of the\ndata-generating policy [Abbeel and Ng, 2004]. On the other hand, off-policy reinforcement learning\n(RL) concerns the problem of learning a good (or possibly better) policy with data collected from a\nbehavior policy [Sutton and Barto, 1998]. However, to the best of our knowledge, previous methods\ndo not have satis\ufb01able performance or are not directly applicable in a complex environment as ours\nwith continuous state and hybrid action space.\nIn this work, we propose a novel yet simple method, to imitate a better policy by monotonic advantage\nreweighting. From theoretical analysis and empirical results, we \ufb01nd the proposed method has several\nadvantages that\n\n\u2022 From theoretical analysis, we show that the algorithm as proposed has policy improvement\n\nlower bound under mild condition.\n\n\u2022 Empirically, the proposed method works well with function approximation and hybrid action\n\nspace, which is crucial for the success of deep RL in practical problems.\n\n\u2022 For off-policy learning, the method does not rely on the knowledge of action probability of\nthe behavior policy, thus can be used to learn from data generated by an unknown policy,\nand is robust when current policy is deviated from the behavior policy.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn our real-world problem of a complex MOBA game, the proposed method has been successfully\napplied on human replay data, which validates the effectiveness of the method.\nThe article is organized as follows: We \ufb01rstly state some preliminaries (Sec. 2) and related works\n(Sec. 3). Then we present our main method of imitating a better policy (Sec. 4), with theoretical\nanalysis (Sec. 5) and empirical experiments (Sec. 6). Finally we conclude our discussion (Sec. 7).\n\n2 Preliminaries\nConsider a Markov decision process (MDP) with in\ufb01nite-horizon, denoted by M =\n(S,A, P, r, d0, \u03b3), where S is the state space, A is the action space, P is the transition probability de-\n\ufb01ned on S \u00d7A\u00d7S \u2192 [0, 1], r is the reward function S \u00d7A \u2192 R, d0 is the distribution of initial state\ns0, and \u03b3 \u2208 (0, 1) is the discount factor. A trajectory \u03c4 is a sequence of triplets of state, action and\nreward, i.e., \u03c4 = {(st, at, rt)}t=1,...,T , where T is the terminal step number. A stochastic policy de-\nnoted by \u03c0 is de\ufb01ned as S\u00d7A \u2192 [0, 1]. We use the following standard notation of state-value V \u03c0(st),\n(cid:80)\u221e\naction-value Q\u03c0(st, at) and advantage A\u03c0(st, at), de\ufb01ned as V \u03c0(st) = E\u03c0|st\nl=0 \u03b3lr(st+l, at+l),\nl=0 \u03b3lr(st+l, at+l), and A\u03c0(st, at) = Q\u03c0(st, at) \u2212 V \u03c0(st), where E\u03c0|st\nQ\u03c0(st, at) = E\u03c0|st,at\nmeans al \u223c \u03c0(a|sl), sl+1 \u223c P (sl+1|sl, al), \u2200l \u2265 t, and E\u03c0|st,at means sl+1 \u223c P (sl+1|sl, al),\nal+1 \u223c \u03c0(a|sl+1), \u2200l \u2265 t. As the state space S may be prohibitively large, we approximate the\n\u03b8 (s) with parameter \u03b8 \u2208 \u0398. We\npolicy and state-value with parameterized forms as \u03c0\u03b8(s, a) and V \u03c0\na\u2208A \u03c0(s, a) = 1,\u2200s \u2208 S, a \u2208 A} and\nparametrized policy space as \u03a0\u0398 = {\u03c0\u03b8|\u03b8 \u2208 \u0398}.\nTo measure the similarity between two policies \u03c0 and \u03c0(cid:48), we consider the Kullback\u2013Leibler divergence\nand total variance (TV) distance de\ufb01ned as\n\ndenote the original policy space as \u03a0 = {\u03c0|\u03c0(s, a) \u2208 [0, 1],(cid:80)\n\n(cid:80)\u221e\n\n(cid:88)\nKL(\u03c0(cid:48)||\u03c0) =\nDd\nTV(\u03c0(cid:48), \u03c0) = (1/2)\nDd\n\ns\n\n(cid:88)\n(cid:88)\n\na\n\nd(s)\n\n\u03c0(cid:48)(a|s)\n\u03c0(a|s)\n\n\u03c0(cid:48)(a|s) log\n\n(cid:88)\n\n|\u03c0(cid:48)(a|s) \u2212 \u03c0(a|s)|\n\nwhere d(s) is a probability distribution of states. The performance of a policy \u03c0 is measured by its\nexpected discounted reward:\n\n\u03b7(\u03c0) = Ed0,\u03c0\n\n\u03b3tr(st, at)\n\nd(s)\n\ns\n\na\n\n\u221e(cid:88)\n\nt=0\n\n(cid:88)\n\ns\n\n(cid:88)\n\n(cid:88)\n\na\n\n(cid:88)\n\nwhere Ed0,\u03c0 means s0 \u223c d0, at \u223c \u03c0(at|st), and st+1 \u223c P (st+1|st, at). We omit the subscript d0\nwhen there is no ambiguity. In [Kakade and Langford, 2002], a useful equation has been proved that\n\n\u03b7(\u03c0(cid:48)) \u2212 \u03b7(\u03c0) =\n\n1\n1 \u2212 \u03b3\n\nd\u03c0(cid:48)(s)\n\n\u03c0(cid:48)(a|s)A\u03c0(s, a)\n\nwhere d\u03c0 is the discounted visiting frequencies de\ufb01ned as d\u03c0(s) = (1 \u2212 \u03b3)Ed0,\u03c0\nand 1(\u00b7) is an indicator function. In addition, de\ufb01ne Ld,\u03c0(\u03c0(cid:48)) as\n\n(cid:80)\u221e\n\nt=0 \u03b3t1(st = s)\n\nLd,\u03c0(\u03c0(cid:48)) =\n\n1\n1 \u2212 \u03b3\n\nd(s)\n\ns\n\na\n\n\u03c0(cid:48)(a|s)A\u03c0(s, a)\n\nthen from [Schulman et al., 2015, Theorem 1], the difference of \u03b7(\u03c0(cid:48)) and \u03b7(\u03c0) can be approximated\nby Ld\u03c0,\u03c0(\u03c0(cid:48)), where the approximation error is bounded by total variance Dd\u03c0\nTV(\u03c0(cid:48), \u03c0), which can be\nfurther bounded by Dd\u03c0\nIn the following sections, we mainly focus on maximizing Ld\u03c0,\u03c0(\u03c0\u03b8) as a proxy for optimizing policy\nperformance \u03b7(\u03c0\u03b8), for \u03c0\u03b8 \u2208 \u03a0\u0398.\n\nKL(\u03c0(cid:48)||\u03c0) or Dd\u03c0\n\nKL(\u03c0||\u03c0(cid:48)).\n\n3 Related Work\n\nOff-policy learning [Sutton and Barto, 1998] is a broad region of research. For policy improvement\nmethod with performance guarantee, conservative policy iteration [Kakade and Langford, 2002] or\n\n2\n\n\fsafe policy iteration [Pirotta et al., 2013] has long been an interesting topic in the literature. The\nterm \u201csafety\u201d or \u201cconservative\u201d usually means the algorithm described is guaranteed to produce a\nseries of monotonic improved policies. Exact or high-probability bounds of policy improvement are\noften provided in these previous works [Thomas and Brunskill, 2016, Jiang and Li, 2016, Thomas\net al., 2015, Ghavamzadeh et al., 2016]. We refer readers to [Garc\u0131a and Fern\u00e1ndez, 2015] for a\ncomprehensive survey of safe RL. However, to the best of our knowledge, these prior methods cannot\nbe directly applied in our problem of learning in a complex game environment with large scale replay\ndata, as they either need full-knowledge of the MDP or consider tabular case mainly for \ufb01nite states\nand discrete actions, with prohibitive computational complexity.\nConstrained policy optimization problems in the parameter space are considered in previous works\n[Schulman et al., 2015, Peters et al., 2010]. In [Peters et al., 2010], they constrain the policy on the\ndistribution of p\u03c0(s, a) = \u00b5\u03c0(s)\u03c0(a|s), while in [Schulman et al., 2015], the constraint is on \u03c0(a|s),\nwith \ufb01xed state-wise weight d(s). Also, in [Schulman et al., 2015], the authors have considered\nKL(\u03c0||\u03c0\u03b8) as a policy divergence constraint, while in [Peters et al., 2010] the authors considered\nDd\u03c0\nDKL(\u00b5\u03c0\u03c0||q). The connection with our proposed method is elaborated in Appendix B.1. A closely\nrelated work is [Abdolmaleki et al., 2018] which present the exponential advantage weighting in an\nEM perspective. Independently, we further generalize to monotonic advantage re-weighting and also\nderive a lower bound for imitation learning.\nBesides off-policy policy iteration algorithm, value iteration algorithm can also be used in off-policy\nsettings. For deep reinforcement learning, DQN [Mnih et al., 2013], DQfD [Hester et al., 2018] works\nprimarily with discrete actions, while DDPG [Lillicrap et al., 2016] works well with continuous\nactions. For hybrid action space, there are also works combining the idea of DQN and DDPG\n[Hausknecht and Stone, 2016]. In our preliminary experiments, we found value iteration method\nfailed to converge for the tasks in the HFO environment. It seems that the discrepancy between\nbehavior policy and the target policy (arg max policy in DQN) should be properly restrained, which\nwe think worth further research and investigation.\nAlso, there are existing related methods in the \ufb01eld of imitation learning. For example, when expert\ndata is available, we can learn a policy directly by predicting the expert action [Bain and Sommut,\n1999, Ross et al., 2011]. Another related idea is to imitate an MCTS policy [Guo et al., 2014, Silver\net al., 2016]. In the work of [Silver et al., 2016], the authors propose to use Monte-Carlo Tree Search\n(MCTS) to form a new policy \u02dc\u03c0 = MCTS(\u03c0) where \u03c0 is the base policy of network, then imitate\nthe better policy \u02dc\u03c0 by minimizing DKL(\u02dc\u03c0||\u03c0\u03b8). Also in [Guo et al., 2014], the authors use UCT\nas a policy improvement operator and generate data from \u02dc\u03c0 = UCT(\u03c0), then perform regression\nor classi\ufb01cation with the dataset, which can be seen as approximating the policy under normal\ndistribution or multinomial distribution parametrization.\n\n4 Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)\n\nTo learn a policy from data, the most straight forward way is imitation learning (behavior cloning).\nSuppose we have state-action pairs (st, at) in the data generated by a behavior policy \u03c0, then we can\nminimize the KL divergence between \u03c0 and \u03c0\u03b8. To be speci\ufb01c, we would like to minimize\n\nKL(\u03c0||\u03c0\u03b8) = \u2212Es\u223cd(s),a\u223c\u03c0(a|s)(log \u03c0\u03b8(a|s) \u2212 log \u03c0(a|s))\nDd\n\n(1)\nunder some state distribution d(s). However, this method makes no distinction between \u201cgood\u201d and\n\u201cbad\u201d actions. The learned \u03c0\u03b8 simply imitates all the actions generated by \u03c0. Actually, if we also have\nreward rt in the data, we can know the consequence of taking action at, by looking at future state\n\nst+1 and reward rt. Suppose we have estimation of the advantage of action at as (cid:98)A\u03c0(st, at), we can\n\nput higher sample weight on the actions with higher advantage, thus imitating good actions more\noften. Inspired by this idea, we propose a monotonic advantage reweighted imitation learning method\n(Algorithm 1) which maximizes\n\nEs\u223cd\u03c0(s),a\u223c\u03c0(a|s) exp(\u03b2(cid:98)A\u03c0(s, a)) log \u03c0\u03b8(a|s)\n\n(2)\nwhere \u03b2 is a hyper-parameter. When \u03b2 = 0 the algorithm degenerates to ordinary imitation learning.\nIdeally we would like to estimate the advantage function A(st, at) = E\u03c0|st,at(Rt \u2212 V \u03c0(st)) using\nl=t \u03b3l\u2212trl. For example, one possible solution is to\nuse a neural network to estimate A(st, at), by minimizing E\u03c0|st,at(A\u03b8(st, at) \u2212 (Rt \u2212 V\u03b8(st)))2\n\ncumulated discounted future reward Rt =(cid:80)T\n\n3\n\n\fAlgorithm 1 Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)\n\nInput: Historical data D generated by \u03c0, hyper-parameter \u03b2.\n\nFor each trajectory \u03c4 in D, estimate advantages (cid:98)A\u03c0(st, at) for time t = 1,\u00b7\u00b7\u00b7 , T .\nMaximize E(st,at)\u2208D exp(\u03b2(cid:98)A\u03c0(st, at)) log \u03c0\u03b8(at|st) with respect to \u03b8.\n\nfor Rt computed from different trajectories, where V\u03b8(st) is also estimated with a neural network\nrespectively. In practice we \ufb01nd that good results can be achieved by simply using a single path\n\nestimation as (cid:98)A(st, at) = (Rt \u2212 V\u03b8(st))/c, where we normalize the advantage by its average norm\n\nc1 in order to make the scale of \u03b2 stable across different environments. We use this method in our\nexperiments as it greatly simpli\ufb01es the computation.\nAlthough the algorithm has a very simple formulation, it has many strengths as\n\n1. Under mild conditions, we show that the proposed algorithm has policy improvement bound\nby theoretical analysis. Speci\ufb01cally, the policy \u02dc\u03c0 is uniformly as good as, or better than the\nbehavior policy \u03c0.\n\n2. The method works well with function approximation as a complex neural network, as sug-\ngested by theoretical analysis and validated empirically. The method is naturally compatible\nwith hybrid action of discrete and continuous parts, which is common in practical problems.\n3. In contrast to most off-policy methods, the algorithm does not rely on importance sampling\nwith the value of \u03c0(at|st) \u2013 the action probability of the behavior policy, thus can be used\nto learn from an unknown policy, and is also robust when current policy is deviated from the\nbehavior policy. We validate this with several empirical experiments.\n\nIn Section 5 we give a proposition of policy improvement by theoretical analysis. And in Section 6\nwe give experimental results of the proposed algorithm in off-policy settings.\n\n5 Theoretical Analysis\n\nIn this section, we \ufb01rstly show that in the ideal case Algorithm 1 is equivalent to imitating a new\npolicy \u02dc\u03c0. Then we show that the policy \u02dc\u03c0 is indeed uniformly better than \u03c0. Thus Algorithm 1 can\nalso be regarded as imitating a better policy (IBP). For function approximation, we also provide a\npolicy improvement lower bound under mild conditions.\n\n5.1 Equivalence to Imitating a New Policy\n\nIn this subsection, we show that in the ideal case when we know the advantage A\u03c0(st, at), Algorithm\n1 is equivalent to minimizing KL divergence between \u03c0\u03b8 and a hypothetic \u02dc\u03c0. Consider the problem\n(3)\n\n((1 \u2212 \u03b3)\u03b2Ld\u03c0,\u03c0(\u03c0(cid:48)) \u2212 Dd\u03c0\n\n\u02dc\u03c0 = arg max\n\nKL(\u03c0(cid:48)||\u03c0))\n\n\u03c0(cid:48)\u2208\u03a0\n\nwhich has an analytical solution in the policy space \u03a0 [Azar et al., 2012, Appendix A, Proposition 1]\n(4)\n\n\u02dc\u03c0(a|s) = \u03c0(a|s) exp(\u03b2A\u03c0(s, a) + C(s))\n\nwhere C(s) is a normalizing factor to ensure that(cid:80)\n(cid:88)\n\nKL(\u02dc\u03c0||\u03c0\u03b8) = arg max\nDd\n\narg min\n\nd(s)\n\n\u03b8\n\na\n\n(cid:88)\n(cid:88)\n\ns\n\n= arg max\n\nd(s) exp(C(s))\n\n\u03b8\n\ns\n\na\n\n\u02dc\u03c0(a|s) log \u03c0\u03b8(a|s)\n\na\u2208A \u02dc\u03c0(a|s) = 1 for each state s. Then\n(cid:88)\n\n\u03c0(a|s) exp(\u03b2A\u03c0(s, a)) log \u03c0\u03b8(a|s)\n\n(5)\n\n\u03b8\n\n(cid:80)T\n\nThus Algorithm 1 is equivalent to minimizing Dd\n\nKL(\u02dc\u03c0||\u03c0\u03b8) for d(s) \u221d d\u03c0(s) exp(\u2212C(s)). 2\n\n1In our experiments, the average norm of advantage is approximated with a moving average estimation, by\n\nc2 \u2190 c2 + 10\u22128((Rt \u2212 V\u03b8(st))2 \u2212 c2).\n\n2In the implementation of the algorithm, we omit the step discount in d\u03c0,\n\n\u03c0(s) =\nt=0 1(st = s) where T is the terminal step. Sampling from d\u03c0(s) is possible, but usually leads\n\nEd0,\u03c0\nto inferior performance according to our preliminary experiments.\n\ni.e., using d(cid:48)\n\n4\n\n\f5.2 Monotonic Advantage Reweighting\n\nIn subsection 5.1, we have shown that the \u02dc\u03c0 de\ufb01ned in 4 is the analytical solution to the problem 3. In\nthis section, we further show that \u02dc\u03c0 is indeed uniformly as good as, or better than \u03c0. To be rigorous, a\npolicy \u03c0(cid:48) is considered uniformly as good as, or better than \u03c0, if \u2200s \u2208 S, we have V \u03c0(cid:48)\n(s) \u2265 V \u03c0(s).\nIn Proposition 1, we give a family of \u02dc\u03c0 which are uniformly as good as, or better than \u03c0. To be\nspeci\ufb01c, we have\nProposition 1. Suppose two policies \u03c0 and \u02dc\u03c0 satisfy\n\n(6)\nwhere g(\u00b7) is a monotonically increasing function, and h(s,\u00b7) is monotonically increasing for any\n\ufb01xed s. Then we have\n\ng(\u02dc\u03c0(a|s)) = g(\u03c0(a|s)) + h(s, A\u03c0(s, a))\n\nV \u02dc\u03c0(s) \u2265 V \u03c0(s), \u2200s \u2208 S.\n\n(7)\n\nthat is, \u02dc\u03c0 is uniformly as good as or better than \u03c0.\n\nThe idea behind this proposition is simple. The condition (6) requires that the policy \u02dc\u03c0 has positive\nadvantages for the actions where \u02dc\u03c0(a|s) \u2265 \u03c0(a|s). Then it follows directly from the well-known\npolicy improvement theorem as stated in [Sutton and Barto, 1998, Equation 4.8]. A short proof is\nprovided in Appendix A.1 for completeness.\nWhen g(\u00b7) and h(s,\u00b7) in (6) are chosen as g(\u03c0) = log(\u03c0) and h(s, A\u03c0(s, a)) = \u03b2A\u03c0(s, a) + C(s),\nthen we recover the formula in 4. By Proposition (1) we have shown that \u02dc\u03c0 de\ufb01ned in 4 is as good as,\nor better than policy \u03c0.\nWe note that there are other choice of g(\u00b7) and h(s,\u00b7) as well. For example we can choose g(\u03c0) =\nlog(\u03c0) and h(s, A\u03c0(s, a)) = log((\u03b2A\u03c0(s, a))+ + \u0001) + C(s), where (\u00b7)+ is a positive truncation, \u0001 is\na\u2208A \u02dc\u03c0(s, a) = 1. In this case,\na \u03c0(a|s)((\u03b2A\u03c0(s, a))+ + \u0001) log \u03c0\u03b8(a|s) + C.\n\na small positive number, and C(s) is a normalizing factor to ensure(cid:80)\n\ns d(s) exp(C(s))(cid:80)\n\nKL(\u02dc\u03c0||\u03c0\u03b8) =(cid:80)\n\nwe can minimize Dd\n\n5.3 Lower bound under Approximation\n\nFor practical usage, we usually seek a parametric approximation of \u02dc\u03c0. The following proposition\ngives a lower bound of policy improvement for the parametric policy \u03c0\u03b8.\nProposition 2. Suppose we use parametric policy \u03c0\u03b8 to approximate the improved policy \u02dc\u03c0 de\ufb01ned\nin Formula 3, we have the following lower bound on the policy improvement\n2\u03b3\u0001\u02dc\u03c0\n\u03c0\n\n1\n\n\u221a\n2\n1 \u2212 \u03b3\n\n1\n2\n\n\u03b4\n\n1 M \u03c0\u03b8 +\n\n\u221a\n(1 \u2212 \u03b3)2 \u03b4\n\n\u03b42 \u2212\n(8)\n(1 \u2212 \u03b3)\u03b2\nKL (\u02dc\u03c0||\u03c0), \u0001\u03c0(cid:48)\n\u03c0 = maxs |Ea\u223c\u03c0(cid:48)A\u03c0(s, a)|, and\n\n1\n2\n2\n\n\u03b7(\u03c0\u03b8) \u2212 \u03b7(\u03c0) \u2265 \u2212\nKL (\u03c0\u03b8||\u02dc\u03c0), Dd\u02dc\u03c0\n\nKL (\u02dc\u03c0||\u03c0\u03b8)), \u03b42 = Dd\u03c0\nwhere \u03b41 = min(Dd\u02dc\u03c0\nM \u03c0 = maxs,a |A\u03c0(s, a)| \u2264 maxs,a |r(s, a)|/(1 \u2212 \u03b3).\nA short proof can be found in Appendix A.2. Note that we would like to approximate \u02dc\u03c0 under state\ndistribution d\u02dc\u03c0 in theory. However in practice we use a heuristic approximation to sample data from\ntrajectories generated by the base policy \u03c0 as in Algorithm 1, which is equivalent to imitating \u02dc\u03c0 under\na slightly different state distribution d as discussed in Sec.5.1.\n\n6 Experimental Results\n\nIn this section, we provide empirical evidence that the algorithm is well suited for off-policy RL\ntasks, as it does not need to know the probability of the behavior policy, thus is robust when learning\nfrom replays from an unknown policy. We evaluate the proposed algorithm with HFO environment\nunder different settings (Sec. 6.1). Furthermore, we also provide two other environments (TORCS\nand mobile MOBA game) to evaluate the algorithm in learning from replay data (Sec. 6.2, 6.3).\nDenote the behavior policy as \u03c0, the desired parametrized policy as \u03c0\u03b8, the policy loss Lp for the\npolicy iteration algorithms considered are listed as following: (C is a \u03b8-independent constant)\n\n\u2022 (IL) Imitation learning, minimizing Dd\u03c0\n\nKL(\u03c0||\u03c0\u03b8).\n\nLp = Dd\u03c0\n\nKL(\u03c0||\u03c0\u03b8) = \u2212Es\u223cd\u03c0(s),a\u223c\u03c0(a|s) log \u03c0\u03b8(a|s) + C\n\n(9)\n\n5\n\n\fLp = Dd\u03c0\n\nKL(\u03c0||\u03c0\u03b8) \u2212 (1 \u2212 \u03b3)\u03b2Ld\u03c0,\u03c0(\u03c0\u03b8)\n\n(cid:18) \u03c0\u03b8(a|s)\n\n\u03c0(a|s)\n\n(cid:19)\n\n\u2022 (PG) Policy gradient with baseline and Dd\u03c0\n\nKL(\u03c0||\u03c0\u03b8) regularization.\n\nLp = \u2212Es\u223cd\u03c0(s),a\u223c\u03c0(a|s)(\u03b2A\u03c0(s, a) + 1) log \u03c0\u03b8(a|s) + C\n\n\u2022 (PGIS) Policy gradient with baseline and Dd\u03c0\n\n(10)\nKL(\u03c0||\u03c0\u03b8) regularization, with off-policy correc-\ntion by importance sampling (IS), as in TRPO [Schulman et al., 2015] and CPO [Achiam\net al., 2017]. Here we simply use penalized gradient algorithm to optimize the objective,\ninstead of using delegated optimization method as in [Schulman et al., 2015].\n\n= \u2212Es\u223cd\u03c0(s),a\u223c\u03c0(a|s)\n\n\u03b2A\u03c0(s, a) + log \u03c0\u03b8(s, a)\n\n(11)\n\n+ C\n\n\u2022 (MARWIL) Minimizing Dd\n\nKL(\u02dc\u03c0||\u03c0\u03b8) as in (5) and Algorithm 1.\n\nLp = Dd\n\nKL(\u02dc\u03c0||\u03c0\u03b8) = \u2212Es\u223cd\u03c0(s),a\u223c\u03c0(a|s) log(\u03c0\u03b8(a|s)) exp(\u03b2A\u03c0(s, a)) + C\n\n(12)\n\nNote that IL simply imitates all the actions in the data, while PG needs the on-policy assumption to be\na reasonable algorithm. Both PGIS and MARWIL are derived under off-policy setting. However, the\nimportance ratio \u03c0\u03b8/\u03c0 used to correct off-policy bias for PG usually has large variance and may cause\nsevere problems when \u03c0\u03b8 is deviated far away from \u03c0 [Sutton and Barto, 1998]. Several methods are\nproposed to alleviate this problem [Schulman et al., 2017, Munos et al., 2016, Precup et al., 2000].\nOn the other hand, we note that the algorithm MARWIL is naturally off-policy, instead of relying on\nthe importance sampling ratio \u03c0\u03b8/\u03c0 to do off-policy correction. We expect the proposed algorithm to\nwork better when learning from a possibly unknown behavior policy.\n\n6.1 Experiments with Half Field Offense (HFO)\n\nTo compare the aforementioned algorithms, we employ Half Field Offense (HFO) as our primary\nexperiment environment. HFO is an abstraction of the full RoboCup 2D game, where an agent plays\nsoccer in a half \ufb01eld. The HFO environment has continuous state space and hybrid (discrete and\ncontinuous) action space, which is similar to our task in a MOBA game (Sec. 6.3). In this simpli\ufb01ed\nenvironment, we validate the effectiveness and ef\ufb01ciency of the proposed learning method.\n\n6.1.1 Environment Settings\n\nLike in [Hausknecht and Stone, 2016], we let the agent try to goal without a goalkeeper. We follow\n[Hausknecht and Stone, 2016] for the settings, as is briefed below.\nThe observation is a 59-d feature vector, encoding the relative position of several critical objects such\nas the ball, the goal and other landmarks (See [Hausknecht, 2017]). In our experiments, we use a\nhybrid action space of discrete actions and continuous actions. 3 types of actions are considered in\nour setting, which correspond to {\u201cDash\u201d, \u201cTurn\u201d, \u201cKick\u201d}. For each type k of action, we require\nthe policy to output a parameter xk \u2208 R2. For the action \u201cDash\u201d and \u201cKick\u201d, the parameter xk is\ninterpreted as (r cos \u03b1, r sin \u03b1), with r truncated to 1 when exceeding. Then \u03b1 \u2208 [0, 2\u03c0] is interpreted\nas the relative direction of that action, while r \u2208 [0, 1] is interpreted as the power/force of that action.\nFor the action \u201cTurn\u201d, the parameter xk is \ufb01rstly normalized to (cos \u03b1, sin \u03b1) and then \u03b8 is interpreted\nas the relative degree of turning. The reward is hand-crafted, written as:\n\nrt = dt(b, a) \u2212 dt+1(b, a) + Ikick\n\nt+1 + 3(dt(b, g) \u2212 dt+1(b, g)) + 5Igoal\nt+1 ,\n\n= 1 if the agent is close enough to kick the ball. Igoal\n\nwhere dt(b, a) (or dt(b, g)) is the distance between the ball and the agent (or the center of goal).\nIkick\nt = 1 if a successful goal happens. We\nt\nleverage Winning Rate = NG\nto evaluate the \ufb01nal performance, where NG is the number of goals\n(G) achieved, NF is the number of failures (F), due to either out-of-time (the agent does not kick the\nball in 100 frames or does not goal in 500 frames) or out-of-bound (the ball is out of the half \ufb01eld).\nWhen learning from data, the historical experience is generated with a mixture of a perfect (100%\nwinning rate) policy \u03c0perfect and a random policy \u03c0random. For the continuous part of the action, a\nGaussian distribution of \u03c3 = 0.2 or 0.4 is added to the model output, respectively. The mixture\n\nNG+NF\n\n6\n\n\fAlgorithm 2 Stochastic Gradient Algorithm for MARWIL\n\nInput: Policy loss Lp being one of 9 to 12. base policy \u03c0, parameter m, cv.\nRandomly initialize \u03c0\u03b8. Empty replay memory D.\nFill D with trajectories from \u03c0 and calculate Rt for each (st, at) in D.\nfor i = 1 to N do\n\nSample a batch B = {(sk, ak, Rk)}m from D.\n\nCompute mini-batch gradient \u2207\u03b8(cid:98)Lp, \u2207\u03b8(cid:98)Lv of B.\nUpdate \u03b8: \u2212\u2206\u03b8 \u221d \u2207\u03b8(cid:98)Lp + cv\u2207\u03b8(cid:98)Lv\n\nend for\n\nTable 1: Performance of PG and MARWIL in TORCS, where \u03b2 = 0 is the case of IL. For consistent\nperformance, \u03b2 should be inversely proportional to the scale of (normalized) A\u03c0. Different \u03b2 are\ntested in the experiments. The performance is evaluated on the sum of rewards per episode.\n\n\u03b2\nPG\n\nMARWIL\n\n0.0\n2710\n(2710)\n\n0.25\n6396\n5583\n\n0.5\n6735\n6832\n\n0.75\n6758\n7670\n\n1.0\n7152\n9492\n\ncoef\ufb01cient \u0001 is used to adjust the proportion of \u201cgood\u201d actions and \u201cbad\u201d actions. To be speci\ufb01c, for\neach step, the action is taken as\nat \u223c\n\n(cid:26) \u03c0perfect(\u00b7|st) + N (0, \u03c3) w.p. \u0001\n\n(13)\n\n\u03c0random(\u00b7|st) + N (0, \u03c3) w.p. 1 \u2212 \u0001\n\nThe parameter \u0001 is adjusted from 0.1 to 0.5. Smaller \u0001 means greater noise, in which case it is harder\nfor the algorithms to \ufb01nd a good policy from the noisy data.\n\n6.1.2 Algorithm Setting\n\nFor the HFO game, we model the 3 discrete actions with multinomial probabilities and the 2\ncontinuous parameters for each action with normal distributions of known \u03c3 = 0.2 but unknown \u00b5.\nParameters for different types of action are modeled separately. In total we have 3 output nodes for\ndiscrete action probabilities and 6 output nodes for continuous action parameters, in the form of\n\n\u03c0\u03b8((k, xk)|s) = p\u03b8(k|s)N (xk|\u00b5\u03b8,k, \u03c3),\n\nk \u2208 {1, 2, 3}, xk \u2208 R2\n\nwhere p\u03b8(\u00b7|s) is computed as a soft-max for discrete actions and N (\u00b7|\u00b5\u03b8, \u03c3) is the probability density\nfunction of Gaussian distribution.\nWhen learning from data, the base policy (13) is used to generate trajectories into a replay memory\nD, and the policy network is updated by different algorithms, respectively. We denote the policy loss\nobjective as Lp, being one of the formula (9) (10) (11) (12). Then we optimize the policy loss Lp\nand the value loss Lv simultaneously, with a mixture coef\ufb01cient cv as a hyper-parameter (by default\ncv = 1). The value loss Lv is de\ufb01ned as Lv = Ed,\u03c0(Rt \u2212 V\u03b8(st))2. A stochastic gradient algorithm\nis given in Algorithm 2. Each experiment is repeated 3 times and the average of scores is reported in\nFigure 1. Additional details of the algorithm settings are given in Appendix B.2.\nWe note that the explicit value \u03c0(at|st) is crucial for the correction used by most off-policy policy\niteration methods [Sutton and Barto, 1998], including [Munos et al., 2016, Wang et al., 2016,\nSchulman et al., 2017, Wu et al., 2017] and many other works [Geist and Scherrer, 2014]. Here for\na comparable experiment between policy gradient method and our proposed method, we consider\na simple off-policy correction by importance sampling as in (11). We test the performance of the\nproposed method and previous works under different settings in Figure 1. We can see that the\nproposed MARWIL achieves consistently better performance than other methods.3\n\n6.2 Experiments with TORCS\n\nWe also evaluate the imitation learning and the proposed method within the TORCS [Wymann et al.,\n2014] environment. In the TORCS environment, the observation is the raw screen with image size of\n\n3We note that the gap between behavior policy and IL is partly due to the approximation we used. As we\nhave continuous action space, we use a gaussian model with \ufb01xed \u03c3, thus the variance of learned policy may be\nlower than that of the behavior policy. A fair comparison should be made among IL, PG, PGIS, and MARWIL.\n\n7\n\n\fFigure 1: Left: Learning from data with additional noise \u03c3 = 0.2. Right: Learning from data with\nadditional noise \u03c3 = 0.4. The data is generated with a mixture of a perfect (100% winning rate)\npolicy \u03c0perfect and a random policy \u03c0random. For the continuous part of the action, a Gaussian noise\nof \u03c3 = 0.2 (left) or 0.4 (right) is added to the model output, respectively. The mixture coef\ufb01cient \u0001\nis used to adjust the proportion of \u201cgood\u201d actions and \u201cbad\u201d actions. Smaller \u0001 means less \u201cgood\u201d\nactions and harder problem. The performance of the behavior policy is plotted in black. We see that\nthe performance of IL is stable, while PG and PGIS may be affected by the increasing noise in the\ndata. In all settings we see that the proposed algorithm MARWIL performs best in this task.\n64 \u00d7 64 \u00d7 3, the action is a scalar indicating the steering angle in [\u2212\u03c0, \u03c0], and the reward rt is the\nmomentary speed. When the car crushes, a \u22121 reward is received and the game terminates.\nFor the TORCS environment, a simple rule is leveraged to keep the car running and to prevent it\nfrom crushing. Therefore, we can use the rule as the optimal policy to generate expert trajectories. In\naddition, we generate noisy trajectories with random actions to intentionally confuse the learning\nalgorithms, and see whether the proposed method can learn a better policy from the data generated by\nthe deteriorated policy. We make the training data by generating 10 matches with the optimal policy\nand another 10 matches with the random actions.\nWe train the imitation learning and the proposed method for 5 epochs to compare their performance.\nTable 1 shows the test scores when varying the parameter \u03b2. From the results, we see that our\nproposed algorithm is effective at learning a better policy from these noisy trajectories.\n\n6.3 Experiments with King of Glory\n\nWe also evaluate the proposed algorithm with King of Glory \u2013 a mobile MOBA (Multi-player Online\nBattle Arena) game popular in China. In the experiments, we collect human replay \ufb01les in the size\nof millions, equaling to tens of billions time steps in total. Evaluation is performed in the \u201csolo\u201d\ngame mode, where an agent \ufb01ghts against another AI in the opposite side. A DNN based function\napproximator is adopted. In a proprietary test, we \ufb01nd that our AI agent, trained with the proposed\nmethod, can reach the level of an experienced human player in a solo game. Additional details of the\nalgorithm settings for King of Glory is given in Appendix B.3.\n\n7 Conclusion\n\nIn this article, we present an off-policy learning algorithm that can form a better policy from\ntrajectories generated by a possibly unknown policy. When learning from replay data, the proposed\nalgorithm does not require the bahavior probability \u03c0 over the actions, which is usually missing in\nhuman generated data, and also works well with function approximation and hybrid action space.\nThe algorithm is preferable in real-world application, including playing video games. Experimental\nresults over several real world datasets validate the effectiveness of the proposed algorithm. We note\nthat the proposed MARWIL algorithm can also work as a full reinforcement learning method, when\napplied iteratively on self-generated replay data. Due to the space limitation, a thorough study of our\nmethod for full reinforcement learning is left to a future work.\n\n8\n\n0.10.20.30.40.50.00.20.40.60.81.0epsilonwinning ratebehaviorILPGPGISMARWIL0.10.20.30.40.50.00.20.40.60.81.0epsilonwinning ratebehaviorILPGPGISMARWIL\fAcknowledgement We are grateful for the anonymous reviewers for their detailed and helpful\ncomments on this work. We also thank our colleagues in the project of King of Glory AI, particularly\nHaobo Fu and Tengfei Shi, for their assistance on the game environment and parsing replay data.\n\nReferences\nPieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of\n\nthe twenty-\ufb01rst international conference on Machine learning, page 1. ACM, 2004.\n\nAbbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller.\nMaximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018.\n\nJoshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings\n\nof the 34th International Conference on Machine Learning, pages 22\u201331, 2017.\n\nMohammad Gheshlaghi Azar, Vicen\u00e7 G\u00f3mez, and Hilbert J Kappen. Dynamic policy programming. Journal of\n\nMachine Learning Research, 13(Nov):3207\u20133245, 2012.\n\nMichael Bain and Claude Sommut. A framework for behavioural cloning. Machine intelligence, 15(15):103,\n\n1999.\n\nDjork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by\n\nexponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\nR\u00e9mi Coulom. Bayesian elo rating. https://www.remi-coulom.fr/Bayesian-Elo/, 2005.\n\naccessed 9-Feb-2018].\n\n[Online;\n\nImre Csiszar and J\u00e1nos K\u00f6rner. Information theory: coding theorems for discrete memoryless systems. Cambridge\n\nUniversity Press, 2011.\n\nJavier Garc\u0131a and Fernando Fern\u00e1ndez. A comprehensive survey on safe reinforcement learning. Journal of\n\nMachine Learning Research, 16(1):1437\u20131480, 2015.\n\nMatthieu Geist and Bruno Scherrer. Off-policy learning with eligibility traces: A survey. The Journal of Machine\n\nLearning Research, 15(1):289\u2013333, 2014.\n\nMohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust\n\nbaseline regret. In Advances in Neural Information Processing Systems, pages 2298\u20132306, 2016.\n\nXiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deep learning for real-time\natari game play using of\ufb02ine monte-carlo tree search planning. In Advances in neural information processing\nsystems, pages 3338\u20133346, 2014.\n\nMatthew Hausknecht. Robocup 2d half \ufb01eld offense technical manual. https://github.com/LARG/HFO/\n\nblob/master/doc/manual.pdf, 2017. [Online; accessed 9-Feb-2018].\n\nMatthew Hausknecht and Peter Stone. Deep reinforcement learning in parameterized action space. In Proceedings\n\nof the International Conference on Learning Representations (ICLR), 2016.\n\nTodd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan,\nAndrew Sendonaris, Ian Osband, Gabriel Dulac-Arnold, John Agapiou, Joel Leibo, and Audrunas Gruslys.\nDeep q-learning from demonstrations. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nDaniel Jiang, Emmanuel Ekwedike, and Han Liu. Feedback-based tree search for reinforcement learning. In\nProceedings of the 35th International Conference on Machine Learning, volume 80, pages 2284\u20132293, 2018.\n\nNan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings\n\nof The 33rd International Conference on Machine Learning, volume 48, pages 652\u2013661, 2016.\n\nSham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International\n\nConference on Machine Learning, pages 267\u2013274, 2002.\n\nTimothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,\nand Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the International\nConference on Learning Representations (ICLR), 2016.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and\nMartin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.\n\n9\n\n\fR\u00e9mi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and ef\ufb01cient off-policy reinforce-\n\nment learning. In Advances in Neural Information Processing Systems, pages 1054\u20131062, 2016.\n\nJan Peters, Katharina M\u00fclling, and Yasemin Altun. Relative entropy policy search. In Twenty-Fourth AAAI\n\nConference on Arti\ufb01cial Intelligence, pages 1607\u20131612, 2010.\n\nMatteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration.\n\nInternational Conference on Machine Learning, pages 307\u2013315, 2013.\n\nIn\n\nDoina Precup, Richard S Sutton, and Satinder P Singh. Eligibility traces for off-policy policy evaluation. In\n\nICML, pages 759\u2013766. Citeseer, 2000.\n\nSt\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction\nto no-regret online learning. In Proceedings of the fourteenth international conference on arti\ufb01cial intelligence\nand statistics, pages 627\u2013635, 2011.\n\nJohn Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy\noptimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages\n1889\u20131897, 2015.\n\nJohn Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\nDavid Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with\ndeep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\nKaren Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.\n\nPhilip Thomas and Emma Brunskill. Data-ef\ufb01cient off-policy policy evaluation for reinforcement learning. In\n\nInternational Conference on Machine Learning, pages 2139\u20132148, 2016.\n\nPhilip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High con\ufb01dence policy improvement.\n\nIn International Conference on Machine Learning, pages 2380\u20132388, 2015.\n\nZiyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando\n\nde Freitas. Sample ef\ufb01cient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.\n\nYuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for\ndeep reinforcement learning using kronecker-factored approximation. In Advances in neural information\nprocessing systems, pages 5285\u20135294, 2017.\n\nBernhard Wymann, Eric Espi\u00e9, Christophe Guionneau, Christos Dimitrakakis, R\u00e9mi Coulom, and Andrew\n\nSumner\". TORCS, The Open Racing Car Simulator. http://www.torcs.org, 2014.\n\nJiechao Xiong, Qing Wang, Zhuoran Yang, Peng Sun, Lei Han, Yang Zheng, Haobo Fu, Tong Zhang, Ji Liu, and\nHan Liu. Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid\naction space. arXiv preprint arXiv:1810.06394, 2018.\n\n10\n\n\f", "award": [], "sourceid": 3106, "authors": [{"given_name": "Qing", "family_name": "Wang", "institution": "Tencent AI Lab"}, {"given_name": "Jiechao", "family_name": "Xiong", "institution": "Tencent AI Lab"}, {"given_name": "Lei", "family_name": "Han", "institution": null}, {"given_name": "peng", "family_name": "sun", "institution": "Tencent AI Lab"}, {"given_name": "Han", "family_name": "Liu", "institution": "Tencent AI Lab"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Tencent AI Lab"}]}