{"title": "Provably Efficient Q-Learning with Low Switching Cost", "book": "Advances in Neural Information Processing Systems", "page_first": 8004, "page_last": 8013, "abstract": "We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of \\emph{local switching cost}. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for $H$-step episodic MDP that achieves sublinear regret whose local switching cost in $K$ episodes is $O(H^3SA\\log K)$, and we provide a lower bound of $\\Omega(HSA)$ on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting \\citep{guo2015concurrent}, which yields nontrivial results that improve upon prior work in certain aspects.", "full_text": "Provably Ef\ufb01cient Q-Learning\n\nwith Low Switching Cost\n\nYu Bai\n\nStanford University\nyub@stanford.edu\n\nTengyang Xie\n\nNan Jiang\n\nUIUC\n\n{tx10, nanjiang}@illinois.edu\n\nYu-Xiang Wang\nUC Santa Barbara\n\nyuxiangw@cs.ucsb.edu\n\nAbstract\n\nWe take initial steps in studying PAC-MDP algorithms with limited adaptivity,\nthat is, algorithms that change its exploration policy as infrequently as possible\nduring regret minimization. This is motivated by the dif\ufb01culty of running fully\nadaptive algorithms in real-world applications (such as medical domains), and we\npropose to quantify adaptivity using the notion of local switching cost. Our main\ncontribution, Q-Learning with UCB2 exploration, is a model-free algorithm for\nH-step episodic MDP that achieves sublinear regret whose local switching cost in\nK episodes is O(H 3SA log K), and we provide a lower bound of \u2126(HSA) on the\nlocal switching cost for any no-regret algorithm. Our algorithm can be naturally\nadapted to the concurrent setting [13], which yields nontrivial results that improve\nupon prior work in certain aspects.\n\n1\n\nIntroduction\n\nThis paper is concerned with reinforcement learning (RL) under limited adaptivity or low switching\ncost, a setting in which the agent is allowed to act in the environment for a long period but is\nconstrained to switch its policy for at most N times. A small switching cost N restricts the agent\nfrom frequently adjusting its exploration strategy based on feedback from the environment.\nThere are strong practical motivations for developing RL algorithms under limited adaptivity. The\nsetting of restricted policy switching captures various real-world settings where deploying new\npolicies comes at a cost. For example, in medical applications where actions correspond to treatments,\nit is often unrealistic to execute fully adaptive RL algorithms \u2013 instead one can only run a \ufb01xed policy\napproved by the domain experts to collect data, and a separate approval process is required every\ntime one would like to switch to a new policy [19, 2, 3]. In personalized recommendation [25], it\nis computationally impractical to adjust the policy online based on instantaneous data (for instance,\nthink about online video recommendation where there are millions of users generating feedback at\nevery second). A more common practice is to aggregate data in a long period before deploying a\nnew policy. In problems where we run RL for compiler optimization [4] and hardware placements\n[20], as well as for learning to optimize databases [18], often it is desirable to limit the frequency\nof changes to the policy since it is costly to recompile the code, to run pro\ufb01ling, to recon\ufb01gure an\nFPGA devices, or to restructure a deployed relational database. The problem is even more prominent\nin the RL-guided new material discovery as it takes time to fabricate the materials and setup the\nexperiments [24, 21]. In many of these applications, adaptivity turns out to be really the bottleneck.\nUnderstanding limited adaptivity RL is also important from a theoretical perspective. First, algorithms\nwith low adaptivity (a.k.a. \u201cbatched\u201d algorithms) that are as effective as their fully sequential\ncounterparts have been established in bandits [23, 12], online learning [8], and optimization [11],\nand it would be interesting to extend such undertanding into RL. Second, algorithms with few policy\nswitches are naturally easy to parallelize as there is no need for parallel agents to communicate if\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthey just execute the same policy. Third, limited adaptivity is closed related to off-policy RL1 and\noffers a relaxation less challenging than the pure off-policy setting. We would also like to note that\nlimited adaptivity can be viewed as a constraint for designing RL algorithms, which is conceptually\nsimilar to those in constrained MDPs [9, 26].\nIn this paper, we take initial steps towards studying theoretical aspects of limited adaptivity RL\nthrough designing low-regret algorithms with limited adaptivity. We focus on model-free algorithms,\n\nin particular Q-Learning, which was recently shown to achieve a (cid:101)O((cid:112)poly(H) \u00b7 SAT ) regret bound\n\nwith UCB exploration and a careful stepsize choice by Jin et al. [16]. Our goal is to design Q-Learning\ntype algorithms that achieve similar regret bounds with a bounded switching cost.\nThe main contributions of this paper are summarized as follows:\n\u2022 We propose a notion of local switching cost that captures the adaptivity of an RL algorithm in\nepisodic MDPs (Section 2). Algorithms with lower local switching cost will make fewer switches\nin its deployed policies.\n\nthese two algorithms achieve (cid:101)O(\n\n\u2022 Building on insights from the UCB2 algorithm in multi-armed bandits [5] (Section 3), we propose\nour main algorithms, Q-Learning with UCB2-{Hoeffding, Bernstein} exploration. We prove that\nH{4,3}SAT ) regret (respectively) and O(H 3SA log(K/A))\nlocal switching cost (Section 4). The regret matches their vanilla counterparts of [16] but the\nswitching cost is only logarithmic in the number of episodes.\n\n\u2022 We show how our low switching cost algorithms can be applied in the concurrent RL setting [13],\nin which multiple agents can act in parallel (Section 5). The parallelized versions of our algorithms\nwith UCB2 exploration give rise to Concurrent Q-Learning algorithms, which achieve a nearly\nlinear speedup in execution time and compares favorably against existing concurrent algorithms in\nsample complexity for exploration.\n\n\u2022 We show a simple \u2126(HSA) lower bound on the switching cost for any sublinear regret algorithm,\n\n\u221a\n\nwhich has at most a O(H 2 log(K/A)) gap from the upper bound (Section 7).\n\n1.1 Prior work\n\nLow-regret RL Sample-ef\ufb01cient RL has been studied extensively since the classical work of Kearns\nand Singh [17] and Brafman and Tennenholtz [7], with a focus on obtaining a near-optimal policy in\npolynomial time, i.e. PAC guarantees. A subsequent line of work initiate the study of regret in RL\n\nand provide algorithms that achieve regret (cid:101)O((cid:112)poly(H, S, A) \u00b7 T ) [15, 22, 1]. In our episodic MDP\nalgorithms such as Q-learning are able to achieve (cid:101)O(\n\nsetting, the information-theoretic lower bound for the regret is \u2126(\nH 2SAT ), which is matched in\nrecent work by the UCBVI [6] and ORLC [10] algorithms. On the other hand, while all the above\nlow-regret algorithms are essentially model-based, the recent work of [16] shows that model-free\nH)\n\n\u221a\nH{4,3}SAT ) regret which is only O(\n\n\u221a\n\n\u221a\n\nworse than the lower bound.\n\nLow switching cost / batched algorithms Auer et al. [5] propose UCB2 in bandit problems, which\nachieves the same regret bound as UCB but has switching cost only O(log T ) instead of the naive\nO(T ). Cesa-Bianchi et al. [8] study the switching cost in online learning in both the adversarial and\nstochastic setting, and design an algorithm for stochastic bandits that acheive optimal regert and\nO(log log T ) switching cost.\nLearning algorithms with switching cost bounded by a \ufb01xed O(1) constant is often referred to as\nbatched algorithms. Minimax rates for batched algorithms have been established in various problems\nsuch as bandits [23, 12] and convex optimization [11]. In all these scenarios, minimax optimal\nM-batch algorithms are obtained for all M, and their rate matches that of fully adaptive algorithms\nonce M = O(log log T ).\n\n1In particular, N = 0 corresponds to off-policy RL, where the algorithm can only choose one data collection\n\npolicy [14].\n\n2\n\n\f2 Problem setup\nIn this paper, we consider undiscounted episodic tabular MDPs of the form (H,S, P,A, r). The MDP\nhas horizon H with trajectories of the form (x1, a1, . . . , xH , aH , xH+1), where xh \u2208 S and ah \u2208 A.\nThe state space S and action space A are discrete with |S| = S and |A| = A. The initial state x1\ncan be either adversarial (chosen by an adversary who has access to our algorithm), or stochastic\nspeci\ufb01ed by some distribution P0(x1). For any (h, xh, ah) \u2208 [H] \u00d7 S \u00d7 A, the transition probability\nis denoted as Ph(xh+1|xh, ah). The reward is denoted as rh(xh, ah) \u2208 [0, 1], which we assume to\nbe deterministic2. We assume in addition that rh+1(x) = 0 for all x, so that the last state xH+1 is\neffectively an (uninformative) absorbing state.\nA deterministic policy \u03c0 consists of H sub-policies \u03c0h(\u00b7) : S \u2192 A. For any deterministic policy\nh(\u00b7,\u00b7) : S \u00d7 A \u2192 R denote its value function and state-action\nh (\u00b7) : S \u2192 R and Q\u03c0\n\u03c0, let V \u03c0\nvalue function at the h-th step respectively. Let \u03c0(cid:63) denote an optimal policy, and V (cid:63)\nand\n[PhVh+1](x, a) := Ex(cid:48)\u223cP(\u00b7|x,a)[Vh+1(x(cid:48))] and also use [(cid:98)PhVh+1](xh, ah) := Vh+1(xh+1) in the\nh denote the optimal V and Q functions for all h. As a convenient short hand, we denote\nQ(cid:63)\n\nproofs to denote observed transition. Unless otherwise speci\ufb01ed, we will focus on deterministic\npolicies in this paper, which will be without loss of generality as there exists at least one deterministic\npolicy \u03c0(cid:63) that is optimal.\n\nh = V \u03c0(cid:63)\n\nh = Q\u03c0(cid:63)\n\nh\n\nRegret We focus on the regret for measuring the performance of RL algorithms. Let K be the\nnumber of episodes that the agent can play. (so that total number of steps is T := KH.) The regret\nof an algorithm is de\ufb01ned as\n\nK(cid:88)\n\n(cid:2)V (cid:63)\n\nRegret(K) :=\n\n1 (xk\n\n1) \u2212 V \u03c0k\n\n1 (xk\n\n1)(cid:3) ,\n\nwhere \u03c0k is the policy it employs before episode k starts, and V (cid:63)\nentire episode.\n\n1 is the optimal value function for the\n\nk=1\n\nMiscellanous notation We use standard Big-Oh notations in this paper: An = O(Bn) means\nthat there exists an absolute constant C > 0 such that An \u2264 CBn (similarly An = \u2126(Bn) for\n\nAn \u2265 CBn). An = (cid:101)O(Bn) means that An \u2264 CnBn where Cn depends at most poly-logarithmically\n\non all the problem parameters.\n\n2.1 Measuring adaptivity through local switching cost\n\nTo quantify the adaptivity of RL algorithms, we consider the following notion of local switching cost\nfor RL algorithms.\nDe\ufb01nition 2.1. The local switching cost (henceforth also \u201cswitching cost\u201d) between any pair of\npolicies (\u03c0, \u03c0(cid:48)) is de\ufb01ned as the number of (h, x) pairs on which \u03c0 and \u03c0(cid:48) are different:\n\nnswitch(\u03c0, \u03c0(cid:48)) :=(cid:12)(cid:12)(cid:8)(h, x) \u2208 [H] \u00d7 S : \u03c0h(x) (cid:54)= [\u03c0(cid:48)]h(x)(cid:9)(cid:12)(cid:12) .\n\nFor an RL algorithm that employs policies (\u03c01, . . . , \u03c0K), its local switching cost is de\ufb01ned as\n\nK\u22121(cid:88)\n\nNswitch :=\n\nnswitch(\u03c0k, \u03c0k+1).\n\nk=1\n\nNote that (1) Nswitch is a random variable in general, as \u03c0k can depend on the outcome of the MDP;\n(2) we have the trivial bound nswitch(\u03c0, \u03c0(cid:48)) \u2264 HS for any (\u03c0, \u03c0(cid:48)) and Nswitch(A) \u2264 HS(K \u2212 1)\nfor any algorithm A.3\nRemark The local switching cost extends naturally the notion of switching cost in online learning [8]\nand is suitable in scenarios where the cost of deploying a new policy scales with the portion of (h, x)\n\n2Our results can be straightforwardly extended to the case with stochastic rewards.\n3To avoid confusion, we also note that our local switching cost is not to measure the change of the sub-policy\n\u03c0h between timestep h and h + 1 (which is in any case needed due to potential non-stationarity), but rather to\n\nmeasure the change of the entire policy \u03c0k =(cid:8)\u03c0h\n\n(cid:9) between episode k and k + 1.\n\nk\n\n3\n\n\fon which the action \u03c0h(x) is changed. A closely related notion of adaptivity is the global switching\ncost, which simply measures how many times the algorithm switches its entire policy:\n\nK\u22121(cid:88)\n\nN gl\n\nswitch =\n\n1{\u03c0k (cid:54)= \u03c0k+1} .\n\nk=1\n\nswitch \u2264 Nswitch.\nAs \u03c0k (cid:54)= \u03c0k+1 implies nswitch(\u03c0k, \u03c0k+1) \u2265 1, we have the trivial bound that N gl\nHowever, the global switching cost can be substantially smaller for algorithms that tend to change the\npolicy \u201centirely\u201d rather than \u201clocally\u201d. In this paper, we focus on bounding Nswitch, and leave the\ntask of tighter bounds on N gl\n\nswitch as future work.\n\n3 UCB2 for multi-armed bandits\n\nTo gain intuition about the switching cost, we brie\ufb02y review the UCB2 algorithm [5] on multi-armed\nbandit problems, which achieves the same regret bound as the original UCB but has a substantially\nlower switching cost.\nThe multi-armed bandit problem can be viewed as an RL problem with H = 1, S = 1, so that\nthe agent needs only play one action a \u2208 A and observe the (random) reward r(a) \u2208 [0, 1]. The\ndistribution of r(a)\u2019s are unknown to the agent, and the goal is to achieve low regret.\nThe UCB2 algorithm is a variant of the celebrated UCB (Upper Con\ufb01dence Bound) algorithm for\nbandits. UCB2 also maintains upper con\ufb01dence bounds on the true means \u00b51, . . . , \u00b5A, but instead\nplays each arm multiple times rather than just once when it\u2019s found to maximize the upper con\ufb01dence\nbound. Speci\ufb01cally, when an arm is found to maximize the UCB for the r-th time, UCB2 will play it\n\u03c4 (r) \u2212 \u03c4 (r \u2212 1) times, where\n\n\u03c4 (r) = (1 + \u03b7)r\n\nfor r = 0, 1, 2, . . . and some parameter \u03b7 \u2208 (0, 1) to be determined. 4 The full UCB2 algorithm is\npresented in Algorithm 1.\n\nAlgorithm 1 UCB2 for multi-armed bandits\ninput Parameter \u03b7 \u2208 (0, 1).\n\nInitialize: rj = 0 for j = 1, . . . , A. Play each arm once. Set t \u2190 0 and T \u2190 T \u2212 A.\nwhile t \u2264 T do\n\nar = O((cid:112)log T /\u03c4 (r)) (with some speci\ufb01c choice.)\n\nSelect arm j that maximizes rj + arj , where rj is the average reward obtained from arm j and\nPlay arm j exactly \u03c4 (rj + 1) \u2212 \u03c4 (rj) times.\nSet t \u2190 t + \u03c4 (rj + 1) \u2212 \u03c4 (rj) and rj \u2190 rj + 1.\n\nend while\n\n1\n\n2\u22062\ni\n\n(cid:35)\n\nE\n\nt=1\n\n(\u00b5(cid:63) \u2212 \u00b5t)\n\n\u2264 O\u03b7\n\n(cid:34) T(cid:88)\n\n, the UCB2 algorithm acheives expected\n\n\uf8eb\uf8edlog T \u00b7 (cid:88)\n\nTheorem 1 (Auer et al. [5]). For T \u2265 maxi:\u00b5i<\u00b5(cid:63)\nregret bound\n\n\uf8f6\uf8f8 ,\nThe switching cost bound in Theorem 1 comes directly from the fact that(cid:80)A\n(cid:80)A\ni=1(1 + \u03b7)ri \u2264 T implies\ni=1 ri \u2264 O(A log(T /A)/\u03b7), by the convexity of r (cid:55)\u2192 (1 + \u03b7)r and Jensen\u2019s inequality. Such an\napproach can be fairly general, and we will follow it in sequel to develop RL algorithm with low\nswitching cost.\n\nwhere \u2206i := \u00b5(cid:63) \u2212 \u00b5i is the gap between arm i and the optimal arm. Further, the switching cost is at\nmost O( A log(T /A)\n\n).\n\n\u03b7\n\n1\n\u2206i\n\ni:\u00b5i<\u00b5(cid:63)\n\n4For convenience, here we treat (1 + \u03b7)r as an integer. In Q-learning we could not make this approximation\n\n(as we choose \u03b7 super small), and will massage the sequence \u03c4 (r) to deal with it.\n\n4\n\n\f4 Q-learning with UCB2 exploration\n\nIn this section, we propose our main algorithm, Q-learning with UCB2 exploration, and show that it\nachieves sublinear regret as well as logarithmic local switching cost.\n\n4.1 Algorithm description\n\nHigh-level idea Our algorithm maintains wo sets of optimistic Q estimates: a running estimate (cid:101)Q\n\nwhich is updated after every episode, and a delayed estimate Q which is only updated occasionally\nbut used to select the action. In between two updates to Q, the policy stays \ufb01xed, so the number of\npolicy switches is bounded by the number of updates to Q.\nTo describe our algorithm, let \u03c4 (r) be de\ufb01ned as\n\n\u03c4 (r) = (cid:100)(1 + \u03b7)r(cid:101) ,\n\nr = 1, 2, . . .\n\nand de\ufb01ne the triggering sequence as\n\n{tn}n\u22651 = {1, 2, . . . , \u03c4 (r(cid:63))} \u222a {\u03c4 (r(cid:63) + 1), \u03c4 (r(cid:63) + 2), . . .},\n\n(1)\nwhere the parameters (\u03b7, r(cid:63)) will be inputs to the algorithm. De\ufb01ne for all t \u2208 {1, 2, . . .} the\nquantities\n\n\u03c4last(t) := max{tn : tn \u2264 t} and \u03b1t =\n\nH + 1\nH + t\n\n.\n\nTwo-stage switching strategy The triggering sequence (1) de\ufb01nes a two-stage strategy for switch-\ning policies. Suppose for a given (h, xh), the algorithm decides to take some particular ah for the\n\nt-th time, and has observed (rh, xh+1) and updated the running estimate (cid:101)Qh(xh, ah) accordingly.\n\nThen, whether to also update the policy network Q is decided as\n\n\u2022 Stage I: if t \u2264 \u03c4 (r(cid:63)), then always perform the update Qh(xh, ah) \u2190 (cid:101)Qh(xh, ah).\n\n\u2022 Stage II: if t > \u03c4 (r(cid:63)), then perform the above update only if t is in the triggering sequence,\n\nthat is, t = \u03c4 (r) = (cid:100)(1 + \u03b7)r(cid:101) for some r > r(cid:63).\n\nIn other words, for any state-action pair, the algorithm performs eager policy update in the beginning\n\u03c4 (r(cid:63)) visitations, and switches to delayed policy update after that according to UCB2 scheduling.\n\nOptimistic exploration bonus We employ either a Hoeffding-type or a Bernstein-type exploration\nbonus to make sure that our running Q estimates are optimistic. The full algorithm with Hoeffding-\nstyle bonus is presented in Algorithm 2.\n\n4.2 Regret and switching cost guarantee\n\n1\n\n\u221a\n\n(cid:109)\n\n(cid:108) log(10H 2)\n\n2H(H+1) and r(cid:63) =\n\nNswitch \u2264 O(H 3SA log(K/A)).\n\nWe now present our main results.\nTheorem 2 (Q-learning with UCB2H exploration achieves sublinear regret and low switching\n, with probability at least 1 \u2212 p, the regret\ncost). Choosing \u03b7 =\nH 4SAT ). Further, the local switching cost is bounded as\n\nof Algorithm 2 is bounded by (cid:101)O(\nTheorem 2 shows that the total regret of Q-learning with UCB2 exploration is (cid:101)O(\n\nH 4SAT ),\nthe same as UCB version of [16]. In addition, the local switching cost of our algorithm is only\nO(H 3SA log(K/A)), which is logarithmic in K, whereas the UCB version can have in the worst\ncase the trivial bound HS(K\u22121). We give a high-level overview of the proof Theorem 2 in Section 6,\nand defer the full proof to Appendix A.\n\nlog(1+\u03b7)\n\n\u221a\n\nBernstein version Replacing the Hoeffding bonus with a Bernstein-type bonus, we can achieve\n\n(cid:101)O(\n\n\u221a\n\n\u221a\nH 3SAT ) regret (\n\nH better than UCB2H) and the same switching cost bound.\n\n5\n\n\fAlgorithm 2 Q-learning with UCB2-Hoeffding (UCB2H) Exploration\ninput Parameter \u03b7 \u2208 (0, 1), r(cid:63) \u2208 Z>0, and c > 0.\n\nInitialize: (cid:101)Qh(x, a) \u2190 H, Qh \u2190 (cid:101)Qh, Nh(x, a) \u2190 0 for all (x, a, h) \u2208 S \u00d7 A \u00d7 [H].\n\nfor episode k = 1, . . . , K do\n\nReceive x1.\nfor step h = 1, . . . , H do\n\nbt = c(cid:112)H 3(cid:96)/t (Hoeffding-type bonus);\nTake action ah \u2190 arg maxa(cid:48) Qh(xh, a(cid:48)), and observe xh+1.\nt = Nh(xh, ah) \u2190 Nh(xh, ah) + 1;\n(cid:101)Qh(xh, ah) \u2190 (1 \u2212 \u03b1t)(cid:101)Qh(xh, ah) + \u03b1t[rh(xh, ah) +(cid:101)Vh+1(xh+1) + bt].\n(cid:111)\n(cid:110)\nH, maxa(cid:48)\u2208A (cid:101)Qh(xh, a(cid:48))\n(cid:101)Vh(xh) \u2190 min\n(Update policy) Qh(xh,\u00b7) \u2190 (cid:101)Qh(xh,\u00b7).\nif t \u2208 {tn}n\u22651 (where tn is de\ufb01ned in (1)) then\n\n.\n\nend if\nend for\n\nend for\n\n(cid:108) log(10H 2)\n\n(cid:109)\n\n\u221a\n\n1\n\n2H(H+1) and r(cid:63) =\n\nlog(1+\u03b7)\n\nTheorem 3 (Q-learning with UCB2B exploration achieves sublinear regret and low switching cost).\n, with probability at least 1 \u2212 p, the regret of\nChoosing \u03b7 =\n\nH 3SAT ) as long as T =(cid:101)\u2126(H 6S2A2). Further, the local switching\n\nAlgorithm 1 is bounded by (cid:101)O(\n\ncost is bounded as Nswitch \u2264 O(H 3SA log(K/A)).\nThe full algorithm description, as well as the proof of Theorem 3, are deferred to Appendix B.\nCompared with Q-learning with UCB [16], Theorem 2 and 3 demonstrate that \u201cvanilla\u201d low-regret RL\nalgorithms such as Q-Learning can be turned into low switching cost versions without any sacri\ufb01ce\non the regret bound.\n\n4.3 PAC guarantee\n\nOur low switching cost algorithms can also achieve the PAC learnability guarantee. Speci\ufb01cally, we\nhave the following\nCorollary 4 (PAC bound for Q-Learning with UCB2 exploration). Suppose (WLOG) that x1 is\ndeterministic. For any \u03b5 > 0, Q-Learning with {UCB2H, UCB2B} exploration can output a\n\n(stochastic) policy(cid:98)\u03c0 such that with high probability\n1 (x1) \u2212 V(cid:98)\u03c0\nafter K = (cid:101)O(H{5,4}SA/\u03b52) episodes.\n\nV (cid:63)\n\n1 (x1) \u2264 \u03b5\n\nThe proof of Corollary 4 involves turning the regret bounds in Theorem 2 and 3 to PAC bounds using\nthe online-to-batch conversion, similar as in [16]. The full proof is deferred to Appendix C.\n\n5 Application: Concurrent Q-Learning\n\nOur low switching cost Q-Learning can be applied to developing algorithms for Concurrent RL [13] \u2013\na setting in which multiple RL agents can act in parallel and hopefully accelerate the exploration in\nwall time.\n\nSetting We assume there are M agents / machines, where each machine can interact with a\nindependent copy of the episodic MDP (so that the transitions and rewards on the M MDPs are\nmutually independent). Within each episode, the M machines must play synchronously and cannot\ncommuniate, and can only exchange information after the entire episode has \ufb01nished. Note that our\nsetting is in a way more stringent than [13], which allows communication after each timestep.\nWe de\ufb01ne a \u201cround\u201d as the duration in which the M machines simultanesouly \ufb01nish one episode and\n(optionally) communicate and update their policies. We measure the performance of a concurrent\n\n6\n\n\falgorithm in its required number of rounds to \ufb01nd an \u03b5 near-optimal policy. With larger M, we expect\nsuch number of rounds to be smaller, and the best we can hope for is a linear speedup in which the\nnumber of rounds scales as M\u22121.\n\nConcurrent Q-Learning Intuitively, any low switching cost algorithm can be made into a con-\ncurrent algorithm, as its execution can be parallelized in between two consecutive policy switches.\nIndeed, we can design concurrent versions of our low switching Q-Learning algorithm and achieve a\nnearly linear speedup.\nTheorem 5 (Concurrent Q-Learning achieves nearly linear speedup). There exists concurrent versions\nof Q-Learning with {UCB2H, UCB2B} exploration such that, given a budget of M parallel machines,\n\nH{5,4}SA\n\nreturns an \u03b5 near-optimal policy in (cid:101)O\nTheorem 5 shows that concurrent Q-Learning has a linear speedup so long as M = (cid:101)O(H{2,1}/\u03b52).\n\nrounds of execution.\n\nIn particular, in high-accuracy (small \u03b5) cases, the constant overhead term H 3SA can be negligible\nand we essentially have a linear speedup over a wide range of M. The proof of Theorem 5 is deferred\nto Appendix D.\n\nH 3SA +\n\n(cid:18)\n\n(cid:19)\n\n\u03b52M\n\nComparison with existing concurrent algorithms Theorem 5 implies a PAC mistake bound as\nwell: there exists concurrent algorithms on M machines, Concurrent Q-Learning with {UCB2H,\nUCB2B}, that performs a \u03b5 near-optimal action on all but\nH{6,5}SA\n\n(cid:18)\n\n(cid:19)\n\nH 4SAM +\n\n\u03b52\n\n:= N CQL\n\n\u03b5\n\n(cid:101)O\n\nactions with high probability (detailed argument in Appendix D.2).\nWe compare ourself with the Concurrent MBIE (CMBIE) algorithm in [13], which considers the\ndiscounted and in\ufb01nite-horizon MDPs, and has a mistake bound5\n\n(cid:18) S(cid:48)A(cid:48)M\n\n(cid:101)O\n\n\u03b5(1 \u2212 \u03b3(cid:48))2 +\n\n(cid:19)\n\nS(cid:48)2A(cid:48)\n\n\u03b53(1 \u2212 \u03b3(cid:48))6\n\n:= N CMBIE\n\n\u03b5\n\nOur concurrent Q-Learning compares favorably against CMBIE in terms of the mistake bound:\n\n\u2022 Dependence on \u03b5. CMBIE achieves N CMBIE\n\n\u03b5 = (cid:101)O(\u03b5\u22122 + M ), better by a factor of \u03b5\u22121.\n\nachieves N CQL\n\n\u03b5\n\n= (cid:101)O(\u03b5\u22123 + \u03b5\u22121M ), whereas our algorithm\n\n\u2022 Dependence on (H, S, A). These are not comparable in general, but under the \u201ctypi-\n=\n, CMBIE has a higher dependence on\n\n(cid:101)O(H 3SAM \u03b5\u22121 + H 8S2A\u03b5\u22123). Compared to N CQL\n\ncal\u201d correspondence6 S(cid:48) \u2190 HS, A(cid:48) \u2190 A, (1 \u2212 \u03b3(cid:48))\u22121 \u2190 H, we get N CMBIE\n\n\u03b5\n\n\u03b5\n\nH as well as a S2 term due to its model-based nature.\n\n6 Proof overview of Theorem 2\n\nThe key to the (cid:101)O(poly(H) \u00b7 \u221a\n\nThe proof of Theorem 2 involves two parts: the switching cost bound and the regret bound. The\nswitching cost bound results directly from the UCB2 switching schedule, similar as in the bandit case\n(cf. Section 3). However, such a switching schedule results in delayed policy updates, which makes\nestablishing the regret bound technically challenging.\n\nSAT ) regret bound for \u201cvanilla\u201d Q-Learning in [16] is a propagation\nof error argument, which shows that the regret7 from the h-th step and forward (henceforth the\n\n5(S(cid:48), A(cid:48), \u03b3(cid:48)) are the {# states, # actions, discount factor} of the discounted in\ufb01nite-horizon MDP.\n6One can transform an episodic MDP with S states to an in\ufb01nite-horizon MDP with HS states. Also note\n\nthat the \u201ceffective\u201d horizon for discounted MDP is (1 \u2212 \u03b3)\u22121.\n\n7Technically it is an upper bound on the regret.\n\n7\n\n\fh-regret), de\ufb01ned as\n\nK(cid:88)\n\nk=1\n\n(cid:101)\u03b4k\n\nh :=\n\nK(cid:88)\n\nk=1\n\n(cid:104)(cid:101)V k\n\nh \u2212 V \u03c0k\n\nh\n\n(cid:105)\n\n(xk\n\nh),\n\nis bounded by 1 + 1/H times the (h + 1)-regret, plus some bounded error term. As (1 + 1/H)H =\nO(1), this fact can be applied recursively for h = H, . . . , 1 which will result in a total regret bound\nthat is not exponential in H. The control of the (excess) error propagation factor by 1/H and the\nability to converge are then achieved simultaneously via the stepsize choice \u03b1t = H+1\nH+t .\nIn constrast, our low-switching version of Q-Learning updates the exploration policy in a delayed\nfashion according to the UCB2 schedule. Speci\ufb01cally, the policy at episode k does not correspond\nfor some k(cid:48) \u2264 k.\nThis introduces a mismatch between the Q used for exploration and the Q being updated, and it is a\npriori possible whether such a mismatch will blow up the propagation of error.\nWe resolve this issue via a novel error analysis, which at a high level consists of the following steps:\n\nto the argmax of the running estimate (cid:101)Qk, but rather a previous version Qk = (cid:101)Qk(cid:48)\n\n(i) We show that the quantity(cid:101)\u03b4k\n(cid:18)(cid:101)Qk(cid:48)\n(cid:19)\n(cid:110)(cid:101)Qk(cid:48)\nh \u2264(cid:16)\n(cid:101)\u03b4k\nh ,(cid:101)Qk\n(Lemma A.3). On the right hand side, the \ufb01rst term (cid:101)Qk(cid:48)\n\u03c0k depends on (cid:101)Qk(cid:48)\n) and can be bounded similarly as in [16]. The second term [(cid:101)Qk\n\nh is upper bounded by a max error\nh \u2212 Q\u03c0k\nh +\nh \u2212 Q\u03c0k\n\nh does not have a mismatch (as\nh ]+\nis a perturbation term, which we bound in a precise way that relates to stepsizes in between\nepisodes k(cid:48) to k and the (h + 1)-regret (Lemma A.4).\n\n(cid:104)(cid:101)Qk\nh \u2212 (cid:101)Qk(cid:48)\n\nh\n\n(cid:111) \u2212 Q\u03c0k\n\nh \u2212 (cid:101)Qk(cid:48)\n\n(xk\n\nh, ak\n\nh) =\n\n(xk\n\nh, ak\nh)\n\n(cid:17)\n\nh\n\nh\n\nmax\n\n(cid:105)\n\n+\n\n(ii) We show that, under the UCB2 scheduling, the combined error above results a mild blowup in\nthe relation between h-regret and (h + 1)-regret \u2013 the multiplicative factor can be now bounded\nby (1+1/H)(1+O(\u03b7H)) (Lemma A.5). Choosing \u03b7 = O(1/H 2) will make the multiplicative\nfactor 1 + O(1/H) and the propagation of error argument go through.\n\nWe hope that the above analysis can be applied more broadly in analyzing exploration problems with\ndelayed updates or asynchronous parallelization.\n\n7 Lower bound on switching cost\nTheorem 6. Let A \u2265 4 and M be the set of episodic MDPs satisfying the conditions in Section 2.\nFor any RL algorithm A satisfying Nswitch \u2264 HSA/2, we have\n\n(cid:34) K(cid:88)\n\nk=1\n\n(cid:35)\n\nEx1,M\n\nsup\nM\u2208M\n\n1 (x1) \u2212 V \u03c0k\nV (cid:63)\n\n1 (x1)\n\n\u2265 KH/4.\n\ni.e. the worst case regret is linear in K.\n\nTheorem 6 implies that the switching cost of any no-regret algorithm is lower bounded by \u2126(HSA),\nwhich is quite intuitive as one would like to play each action at least once on all (h, x). Compared\nwith the lower bound, the switching cost O(H 3SA log K) we achieve through UCB2 scheduling is\nat most off by a factor of O(H 2 log K). We believe that the log K factor is not necessary as there\nexist algorithms achieving double-log [8] in bandits, and would also like to leave the tightening of\nthe H 2 factor as future work. The proof of Theorem 6 is deferred to Appendix E.\n\n8 Conclusion\n\nwith infrequent policy switching that achieves (cid:101)O(\n\nIn this paper, we take steps toward studying limited adaptivity RL. We propose a notion of local\nswitching cost to account for the adaptivity of RL algorithms. We design a Q-Learning algorithm\nH{4,3}SAT ) regret while switching its policy\nfor at most O(log T ) times. Our algorithm works in the concurrent setting through parallelization\nand achieves nearly linear speedup and favorable sample complexity. Our proof involves a novel\n\n\u221a\n\n8\n\n\fperturbation analysis for exploration algorithms with delayed updates, which could be of broader\ninterest.\nThere are many interesting future directions, including (1) low switching cost algorithms with tighter\nregret bounds, most likely via model-based approaches; (2) algorithms with even lower switching\ncost; (3) investigate the connection to other settings such as off-policy RL.\n\nAcknowledgment\n\nThe authors would like to thank Emma Brunskill, Ramtin Keramati, Andrea Zanette, and the staff\nof CS234 at Stanford for the valuable feedback at an earlier version of this work, and Chao Tao\nfor the very insightful feedback and discussions on the concurrent Q-learning algorithm. YW was\nsupported by a start-up grant from UCSB CS department, NSF-OAC 1934641 and a gift from AWS\nML Research Award.\n\nReferences\n[1] S. Agrawal and R. Jia. Optimistic posterior sampling for reinforcement learning: worst-case\nregret bounds. In Advances in Neural Information Processing Systems, pages 1184\u20131194, 2017.\n\n[2] D. Almirall, S. N. Compton, M. Gunlicks-Stoessel, N. Duan, and S. A. Murphy. Designing\na pilot sequential multiple assignment randomized trial for developing an adaptive treatment\nstrategy. Statistics in medicine, 31(17):1887\u20131902, 2012.\n\n[3] D. Almirall, I. Nahum-Shani, N. E. Sherwood, and S. A. Murphy.\n\nIntroduction to smart\ndesigns for the development of adaptive interventions: with application to weight loss research.\nTranslational behavioral medicine, 4(3):260\u2013274, 2014.\n\n[4] A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano. A survey on compiler\n\nautotuning using machine learning. ACM Computing Surveys (CSUR), 51(5):96, 2018.\n\n[5] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[6] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n263\u2013272. JMLR. org, 2017.\n\n[7] R. I. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm for near-optimal\n\nreinforcement learning. Journal of Machine Learning Research, 3(Oct):213\u2013231, 2002.\n\n[8] N. Cesa-Bianchi, O. Dekel, and O. Shamir. Online learning with switching costs and other\nadaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160\u20131168,\n2013.\n\n[9] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach\nto safe reinforcement learning. In Advances in Neural Information Processing Systems, pages\n8092\u20138101, 2018.\n\n[10] C. Dann, L. Li, W. Wei, and E. Brunskill. Policy certi\ufb01cates: Towards accountable reinforcement\n\nlearning. arXiv preprint arXiv:1811.03056, 2018.\n\n[11] J. Duchi, F. Ruan, and C. Yun. Minimax bounds on stochastic batched convex optimization. In\n\nConference On Learning Theory, pages 3065\u20133162, 2018.\n\n[12] Z. Gao, Y. Han, Z. Ren, and Z. Zhou. Batched multi-armed bandits problem. arXiv preprint\n\narXiv:1904.01763, 2019.\n\n[13] Z. Guo and E. Brunskill. Concurrent pac rl. In Twenty-Ninth AAAI Conference on Arti\ufb01cial\n\nIntelligence, 2015.\n\n9\n\n\f[14] J. P. Hanna, P. S. Thomas, P. Stone, and S. Niekum. Data-ef\ufb01cient policy evaluation through\nIn Proceedings of the 34th International Conference on Machine\n\nbehavior policy search.\nLearning-Volume 70, pages 1394\u20131403. JMLR. org, 2017.\n\n[15] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning.\n\nJournal of Machine Learning Research, 11(Apr):1563\u20131600, 2010.\n\n[16] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably ef\ufb01cient? In Advances\n\nin Neural Information Processing Systems, pages 4868\u20134878, 2018.\n\n[17] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine\n\nlearning, 49(2-3):209\u2013232, 2002.\n\n[18] S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica. Learning to optimize join\n\nqueries with deep reinforcement learning. arXiv preprint arXiv:1808.03196, 2018.\n\n[19] H. Lei, I. Nahum-Shani, K. Lynch, D. Oslin, and S. A. Murphy. A \"smart\" design for building\n\nindividualized treatment sequences. Annual review of clinical psychology, 8:21\u201348, 2012.\n\n[20] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi,\nIn\nS. Bengio, and J. Dean. Device placement optimization with reinforcement learning.\nnternational Conference on Machine Learning (ICML-17), pages 2430\u20132439. JMLR. org, 2017.\n\n[21] P. Nguyen, T. Tran, S. Gupta, S. Rana, M. Barnett, and S. Venkatesh. Incomplete conditional\ndensity estimation for fast materials discovery. In Proceedings of the 2019 SIAM International\nConference on Data Mining, pages 549\u2013557. SIAM, 2019.\n\n[22] I. Osband, D. Russo, and B. Van Roy. (more) ef\ufb01cient reinforcement learning via posterior\n\nsampling. In Advances in Neural Information Processing Systems, pages 3003\u20133011, 2013.\n\n[23] V. Perchet, P. Rigollet, S. Chassang, E. Snowberg, et al. Batched bandit problems. The Annals\n\nof Statistics, 44(2):660\u2013681, 2016.\n\n[24] P. Raccuglia, K. C. Elbert, P. D. Adler, C. Falk, M. B. Wenny, A. Mollo, M. Zeller, S. A.\nFriedler, J. Schrier, and A. J. Norquist. Machine-learning-assisted materials discovery using\nfailed experiments. Nature, 533(7601):73, 2016.\n\n[25] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation\nsystems for life-time value optimization with guarantees. In Twenty-Fourth International Joint\nConference on Arti\ufb01cial Intelligence, 2015.\n\n[26] M. Yu, Z. Yang, M. Kolar, and Z. Wang. Convergent policy optimization for safe reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, 2019.\n\n10\n\n\f", "award": [], "sourceid": 4388, "authors": [{"given_name": "Yu", "family_name": "Bai", "institution": "Stanford University"}, {"given_name": "Tengyang", "family_name": "Xie", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Nan", "family_name": "Jiang", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Yu-Xiang", "family_name": "Wang", "institution": "UC Santa Barbara"}]}