{"title": "Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle", "book": "Advances in Neural Information Processing Systems", "page_first": 8060, "page_last": 8070, "abstract": "Q-learning with function approximation is one of the most popular methods in reinforcement learning. Though the idea of using function approximation was proposed at least 60 years ago, even in the simplest setup, i.e, approximating Q-functions with linear functions, it is still an open problem how to design a provably efficient algorithm that learns a near-optimal policy. The key challenges are how to efficiently explore the state space and how to decide when to stop exploring in conjunction with the function approximation scheme.\n\nThe current paper presents a provably efficient algorithm for Q-learning with linear function approximation. Under certain regularity assumptions, our algorithm, Difference Maximization Q-learning, combined with linear function approximation, returns a near-optimal policy using polynomial number of trajectories. Our algorithm introduces a new notion, the Distribution Shift Error Checking (DSEC) oracle. This oracle tests whether there exists a function in the function class that predicts well on a distribution $\\mathcal{D}_1$, but predicts poorly on another distribution $\\mathcal{D}_2$, where $\\mathcal{D}_1$ and $\\mathcal{D}_2$ are distributions over states induced by two different exploration policies. For the linear function class, this oracle is equivalent to solving a top eigenvalue problem. We believe our algorithmic insights, especially the DSEC oracle, are also useful in designing and analyzing reinforcement learning algorithms with general function approximation.", "full_text": "Provably Ef\ufb01cient Q-learning with Function\n\nApproximation via Distribution Shift Error Checking\n\nOracle\n\nSimon S. Du\n\nInstitute for Advanced Study\n\nssdu@ias.edu\n\nYuping Luo\n\nPrinceton University\n\nyupingl@cs.princeton.edu\n\nRuosong Wang\n\nCarnegie Mellon University\nruosongw@andrew.cmu.edu\n\nHanrui Zhang\nDuke University\n\nhrzhang@cs.duke.edu\n\nAbstract\n\nQ-learning with function approximation is one of the most popular methods in\nreinforcement learning. Though the idea of using function approximation was\nproposed at least 60 years ago [27], even in the simplest setup, i.e, approximating\nQ-functions with linear functions, it is still an open problem how to design a\nprovably ef\ufb01cient algorithm that learns a near-optimal policy. The key challenges\nare how to ef\ufb01ciently explore the state space and how to decide when to stop\nexploring in conjunction with the function approximation scheme.\nThe current paper presents a provably ef\ufb01cient algorithm for Q-learning with lin-\near function approximation. Under certain regularity assumptions, our algorithm,\nDifference Maximization Q-learning (DMQ), combined with linear function\napproximation, returns a near-optimal policy using polynomial number of trajecto-\nries. Our algorithm introduces a new notion, the Distribution Shift Error Checking\n(DSEC) oracle. This oracle tests whether there exists a function in the function\nclass that predicts well on a distribution D1, but predicts poorly on another distri-\nbution D2, where D1 and D2 are distributions over states induced by two different\nexploration policies. For the linear function class, this oracle is equivalent to\nsolving a top eigenvalue problem. We believe our algorithmic insights, especially\nthe DSEC oracle, are also useful in designing and analyzing reinforcement learning\nalgorithms with general function approximation.\n\n1\n\nIntroduction\n\nQ-learning is a foundational method in reinforcement learning [35] and has been successfully applied\nin various domains. Q-learning aims at learning the optimal state-action value function (Q-function).\nOnce we have learned the Q-function, at every state, we can just greedily choose the action with the\nlargest Q value, which is guaranteed to be an optimal policy.\nAlthough being a fundamental method, theoretically, we only have a good understanding of Q-\nlearning in the tabular setting. Strehl et al. [30] and Jin et al. [18] showed with proper exploration\ntechniques, one can obtain a near-optimal Q-function (and so a near-optimal policy) using polynomial\nnumber of trajectories, in terms of number of states, actions and planning horizon. While these\nanalyses provide valuable insights, they are of limited practical importance because the number of\nstates in most applications is enormous. Even worse, it has been proved that in the tabular setting, the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fnumber of trajectories needed to learn a near-optimal policy scales at least linearly with the number\nof states [16].\nTo resolve this problem, we need reinforcement learning methods that generalize, which, for Q-\nlearning methods, is to constrain Q-function to a pre-speci\ufb01ed function class, e.g., linear functions\nor neural networks. The basic assumption of this function approximation scheme is that the true\nQ-function lies in the function class. A natural problem is:\n\nCan we design provably ef\ufb01cient Q-learning algorithms with function approximation?\n\nIndeed, this is one of the major open problems in reinforcement learning [32]. The idea of using\nfunction approximation was proposed at least 60 years ago [27], where linear functions are used\nto approximate the value functions in playing checkers. However, even in the most basic setting,\nQ-learning with linear function approximation, there is no provably ef\ufb01cient algorithm in the general\nstochastic setting.\nThe key challenges are how to 1) ef\ufb01ciently explore the state space to learn a good predictor that\ngeneralizes across states and 2) decide when to stop exploring. In order to deal with these challenges,\nwe need to exploit the fact that the true Q-function belongs to a pre-speci\ufb01ed function class.\n\nOur Contributions Our main theoretical contribution is a provably ef\ufb01cient algorithm for Q-\nlearning with linear function approximation in the episodic Markov decision process (MDP) setting.\nTheorem 1.1 (Main Theorem (informal)). Suppose the Q-function is linear. Then under certain\nregularity assumptions, Algorithm 1, Difference Maximization Q-learning (DMQ) returns an\n\u270f-suboptimal policy \u21e1 using poly(1/\u270f) number of trajectories.\n\nOur algorithm works for episodic MDPs with general stochastic transitions. In contrast, previous\nalgorithms only work for deterministic systems, or rely on strong assumptions, e.g., a suf\ufb01ciently\ngood exploration policy is given. See Section 2 for more discussion. Our main assumption is that the\nQ-function is linear. Note this is somehow a necessary assumption because otherwise one should not\nuse linear function approximation in the \ufb01rst place.\nBefore getting into details, we \ufb01rst give an overview of our main techniques. As we have discussed,\nthe main technical challenge is to design an ef\ufb01cient exploration algorithm, and decide when to stop\nexploring. Our main algorithmic contribution is to introduce a new notion, the Distribution Shift\nError Checking (DSEC) oracle (cf. Oracle 4.1 and Oracle 4.2). Given two distributions D1 and D2,\nthis oracle returns True if there exists a function in the pre-speci\ufb01ed function class which predicts\nwell on D1 but predicts poorly on D2. We will show that this is an extremely useful notion. If the\noracle returns False, then our learned predictor performs well on both distributions. If the oracle\nreturns True, we know D2 contains information that we can explore, which implies the policy that\ngenerates D2 is a valuable exploration policy. We will discuss the DSEC oracle in more detail in\nSection 4.\nWith this oracle at hand, a natural question is how many times this oracle will return True, as we will\nnot stop exploring if it always returns True. A technical contribution of this paper is to show for the\nlinear function class, this oracle will only return True at most polynomial number of times. At a high\nlevel, whenever the oracle returns True, it means we will learn something new from D2. However,\nsince the complexity of the function class is bounded, we cannot learn new things too many times.\nFormally, we use a potential function argument to make this intuition rigorous (cf. Lemma A.5).\n\n1.1 Organization\n\nThis paper is organized as follows. In Section 2, we review related work. In Section 3, we introduce\nnecessary notations, de\ufb01nitions and assumptions. In Section 4, we describe the DSEC oracle in\ndetail. In Section 5, we present our general algorithm for Q-learning with function approximation. In\nSection 6, we instantiate the general algorithm to the linear function approximation case, and present\nour main theorem. We conclude and discuss future works in Section 7. All technical proofs are\ndeferred to the supplementary material.\n\n2\n\n\f2 Related Work\n\nClassical theoretical reinforcement learning literature studies asymptotic behavior of concrete algo-\nrithms. The most related work is [24], which studies an online Q-learning algorithm with a \ufb01xed\nexploration policy. They showed that the estimated Q-function converges to the true Q-function\nasymptotically. Recently, Zou et al. [39] derived \ufb01nite sample bounds for the same setting. The major\ndrawback of these works is that they put strong assumptions on the \ufb01xed exploration policy. For\nexample, Zou et al. [39] require that the covariance matrix induced by the exploration policy has\nlower bounded least eigenvalue. In general, it is hard to verify whether a policy has such benign\nproperties.\nWhile it is challenging to design ef\ufb01cient algorithms for Q-learning with function approximation,\nin the tabular setting, exploration becomes much easier, as one can \ufb01rst estimate the transition\nprobabilities and then design exploration policies accordingly. There is a substantial body of work on\ntabular reinforcement learning [2, 16, 19, 5, 21, 10]. For Q-learning, Strehl et al. [30] introduced the\ndelayed Q-learning algorithm which has O(T 4/5) regret bound. A recent work by Jin et al. [18] gave\n\na UCB-based algorithm which enjoys O\u21e3pT\u2318 regret bound. More recent papers provided re\ufb01ned\n\nanalyses that exploit benign properties of the MDP, e.g., the gap between the optimal action and the\nrest [28, 38], which our algorithm also utilizes. However, it is hard to generalize the exploration\ntechniques in these previous works, since they all rely on the fact that the total number of states is\n\ufb01nite.\nRecently, exploration algorithms are proposed for Q-learning with function approximation. Osband\net al. [25] proposed a Thompson-sampling based method for the linear function class. Later works\nfurther generalized sampling-based algorithms to Q-functions with neural network parameteriza-\ntion [6, 23, 13]. However, none of these works have polynomial sample complexity guarantees. Pazis\nand Parr [26] gave a nearest-neighbor-based algorithm for exploration in continuous state space.\nHowever, in general this type of algorithms has exponential dependence on the state dimension.\nThe seminal work by Wen and Van Roy [36] proposed an algorithm, optimistic constraint propagation\n(OCP), which enjoys polynomial sample complexity bounds for a family of Q-function classes,\nincluding the linear function class as a special case. However, their algorithm can only deal with\ndeterministic systems, i.e., both transition dynamics and rewards are deterministic. A line of recent\npapers study Q-learning in the general state-action metric space [37, 29]. However, due to the\ngenerality, the sample complexity has exponential dependence on the dimension.\nFinally, a recent series of work introduced contextual decision processes (CDPs) [22, 17, 9, 31, 11]\nand developed algorithms with polynomial sample complexity guarantees. Our paper is not directly\ncomparable with these results, since they can deal with general function classes. In some cases, the\nfunction approximation is even not for the Q-function, but for modeling the map from the observed\nstate to hidden states [11]. The result in [17] also applies to our setting. However, their bound depends\non both the function class complexity and a quantity called the Bellman rank. Conceptually, since\nour bound does not depend on the Bellman rank, our result thus demonstrates that the function class\ncomplexity alone is enough for ef\ufb01cient learning.\n\n3 Preliminaries\n\nNotations We begin by introducing necessary notations. We write [h] to denote the set {1, . . . , h}.\nFor any \ufb01nite set S, we write unif (S) to denote the uniform distribution over S and 4 (S) to denote\nthe probability simplex. Let k\u00b7k2 denote the Euclidean norm of a \ufb01nite-dimensional vector in Rd.\nFor a symmetric matrix A, let kAkop denote its operator norm and i (A) denote its i-th eigenvalue.\nThroughout the paper, all sets are multisets, i.e., a single element can appear multiple times.\n\nMarkov Decision Processes (MDPs) Let M = (S,A, H, P, R) be an MDP, where S is the\n(possibly uncountable) state space, A is the \ufb01nite action space with |A| = K, H 2 Z+ is the\nplanning horizon, P : S\u21e5A! 4 (S) is the transition function and R : S\u21e5A! 4 (R) is the\nreward distribution.\n\n3\n\n\fEs\u21e0D\u21e1\n\nhh|V \u21e1(s)  V \u21e4(s)|2i \uf8ff CEs\u21e0D\u21e1\n\nh\n\nA (stochastic) policy \u21e1 : S! 4 (A) prescribes a distribution over actions for each state. Without\nloss of generality, we assume a \ufb01xed start state s1.1 The policy \u21e1 induces a random trajectory\ns1, a1, r1, s2, a2, r2, . . . , sH, aH, rH where r1 \u21e0 R(s1, a1), s2 \u21e0 P (s1, a1), a2 \u21e0 \u21e1(s2), etc. For a\ngiven policy \u21e1, we use D\u21e1\nh to denote the distribution over Sh induced by executing policy \u21e1.\nTo streamline our analysis, we denote Sh \u2713S to be the set of states at level h. Similar to previous\ntheoretical reinforcement learning results, we also assume rh  0 for all h 2 [H] andPH\nh=1 rh \uf8ff\n1 [17]. Our goal is to \ufb01nd a policy \u21e1 that maximizes the expected reward EhPH\nh=1 rh | \u21e1i. We use\n\u21e1\u21e4 to denote the optimal policy.\nGiven a policy \u21e1, a level h 2 [H] and a state-action pair (s, a) 2S h \u21e5A , the Q-function is\nde\ufb01ned as Q\u21e1(s, a) = EhPH\nh0=h rh0 | sh = s, ah = a, \u21e1i. It will also be useful to de\ufb01ne the value\nh0=h rh0 | sh = s, \u21e1i. For simplicity, we denote\nfunction of a given state s 2S h as V \u21e1(s) = EhPH\n\nQ\u21e4(s, a) = Q\u21e1\u21e4(s, a) and V \u21e4 = V \u21e1\u21e4(s). Recall that if we know Q\u21e4, we can just choose the action\ngreedily: \u21e1\u21e4(s) = argmaxa2AQ\u21e4(s, a). In this paper, we make the following assumption about the\nvariation of the suboptimality of policies [12].\nAssumption 3.1 (Bounded Coef\ufb01cient of Variation of Policy Sub-optimality). There exists a constant\n1 \uf8ff C < 1, such that for any \ufb01xed level h 2 [H] and deterministic policy \u21e1,\n[|V \u21e1(s)  V \u21e4(s)|]2 .\n\nIntuitively, this assumption says the variation due to the randomness over states is not too large\ncomparing to the mean. For example, if the transition is deterministic, then this assumption holds\nwith C = 1.\nOur paper also relies on the following \ufb01ne-grained characterization of the MDP.\nDe\ufb01nition 3.1 (Suboptimality Gaps). Given s 2S and a 2A , the gap is de\ufb01ned as gap(s, a) =\nV \u21e4(s)  Q\u21e4(s, a). The minimum gap is de\ufb01ned as  , mins2S,a2A {gap(s, a) : gap(s, a) > 0}.\nThis notion has been extensively studied in the bandit literature to obtain \ufb01ne-grained bounds [4].\nRecently, Simchowitz et al. [28] derived regret bounds in tabular MDPs based on this notion. In this\npaper we assume > 0, and the sample complexity of our algorithm depends polynomially on 1/.\nNotice that assuming  is strictly positive is not a restrictive assumption for the \ufb01nite action setting\nconsidered in this paper. First, in the contextual linear bandit literature, this assumption is widely\ndiscussed. See, e.g., [1, 8]. The notion, context, in the bandit literature is essentially (s) in our\npaper and the number of contexts can also be in\ufb01nite. Second, there are many natural environments\nin RL which satisfy this assumption. For example, in many environments, states can be classi\ufb01ed\nas good states and bad states. In these environments, an agent can obtain a reward only if it is in\na good state. There are also two kinds of actions: good actions and bad actions. If the agent is in\na good state and chooses a good action, the agent will transit to a good state. If the agent chooses\na bad action, the agent will transit to a bad state. If the agent is in a bad state, whatever action the\nagent chooses, the agent will transit to a bad state. Note that for this kind of environments, there is\na strictly positive gap between good actions and bad actions when the agent is in good states and\nthere is no difference between good actions and bad actions when the agent is in bad states. In this\ncase,  is strictly positive, since by De\ufb01nition 3.1, we take the minimum over all state-action pairs\nwith strictly positive gap. These environments are natural generalizations of the combination lock\nenvironment [20]. Some Atari games, e.g. Freeway, have a similar \ufb02avor as these environments.\n\nFunction Approximation When the state space is large, we need structures on the state space\nso that reinforcement learning methods can generalize. We constrain the optimal Q-function to a\npre-speci\ufb01ed function class Q [7], e.g., the class of linear functions. In this paper we associate each\nh 2 [H] and a 2A with a Q-function f a\nAssumption 3.2. For every (h, a) 2 [H] \u21e5A , its associated optimal Q-function is in Q.\n\nh 2Q . We make the following assumption.\n\n1Some papers assume the starting state is sampled from a distribution P1. Note this is equivalent to assuming\na \ufb01xed state s1, by setting P (s1, a) = P1 for all a 2A and now our s2 is equivalent to the starting state in their\nassumption.\n\n4\n\n\fThis is a widely used assumption in the theoretical reinforcement learning literature [17]. Note\nthat without this assumption, we cannot hope to obtain optimal policy using functions in Q as the\nQ-function.\nThe focus of this paper is about linear function class which is one of the most popular function\nclasses used in practice. This function class depends on a feature extractor  : S! Rd which\ncan be a hand-crafted feature extractor or a pre-trained neural network that transforms a state to a\nd-dimension embedding. For sh 2S h and a 2A , our estimated optimal Q-function admits the form\nh (s) = (s)> \u02c6\u2713a\nh 2 Rd only depends on the level h 2 [H] and a 2A . Therefore, we only\nf a\nneed to learn K \u00b7 H d-dimension vectors (linear coef\ufb01cients), since by Assumption 3.2, for each\nh 2 [H] and a 2A , there exists \u2713a\nThe aim of this paper is to obtain polynomial sample complexity bounds. To this end, we also need\nsome regularity conditions.\nAssumption 3.3. For all s 2S , its feature is bounded k(s)k2 \uf8ff 1. For all h 2 [H], a 2A , the\ntrue linear predictor is bounded k\u2713a\n4 Distribution Shift Error Checking Oracle\n\nh 2 Rd such that for all sh 2S h, Q\u21e4(sh, a) = (sh)>\u2713a\nh.\n\nhk2 \uf8ff 1.\n\nh where \u02c6\u2713a\n\nAs we have discussed in Section 1, in reinforcement learning, we often need to know whether a\npredictor learned from samples generated from one distribution D1 can predict well on another\ndistribution D2. This is related to off-policy learning for which one often needs to bound the\nprobability density ratio between D1 and D2 on all state-action pair. When function approximation\nscheme is used, we naturally arrive at the following oracle.\nOracle 4.1 (Distribution Shift Error Checking Oracle (D1,D2,\u270f 1,\u270f 2, \u21e4)). For two given distributions\nD1,D2 over S, two real numbers \u270f1 and \u270f2, and a regularizer \u21e4: Q\u21e5Q! R, de\ufb01ne\n\nv = max\nf1,f22Q\n\nEs\u21e0D2h(f1(s)  f2(s))2i\n\ns.t. Es\u21e0D1h(f1(s)  f2(s))2i +\u21e4( f1, f2) \uf8ff \u270f1.\n\nThe oracle returns True if v  \u270f2, and False otherwise.\nTo motivate this oracle, let f2 be the optimal Q-function and f1 is a predictor we learned using\nsamples generated from distribution D1. In this scenario, we know f1 has a small expected error \u270f1\non distribution D1. Note since we maximize over the entire function class Q, v is an upper bound on\nthe expected error of f1 on distribution D2. If v is large enough, say larger than \u270f2, then it could be\nthe case that we cannot predict well on distribution D2. On the other hand, if v is small, we are certain\nthat f1 has small error on D2. Here we add a regularization term \u21e4(f1, f2) to prevent pathological\ncases. The concrete choice of \u21e4 will be given later.\nIn practice, it is impossible to get access to the underlying distributions D1 and D2. Thus, we use\nsamples generated from these two distributions instead.\nOracle 4.2 (Sample-based Distribution Shift Error Checking Oracle (D1, D2,\u270f 1,\u270f 2, \u21e4)). For two\nset of states D1, D2 \u2713S , two real numbers \u270f1 and \u270f2, and a regularizer \u21e4: Q\u21e5Q! R, de\ufb01ne\n\nv = max\nf1,f22Q\n1\n\ns.t.\n\n1\n\n|D2| Xti2D2h(f1(ti)  f2(ti))2i\n\n|D1| Xsi2D1h(f1(si)  f2(si))2i +\u21e4( f1, f2) \uf8ff \u270f1.\n\nThe oracle returns True if v  \u270f2 and False otherwise. If D1 = ;, the oracle simply returns True.\nAn interesting property of Oracle 4.2 is that it only depends on the states and does not rely on the\nreward values.\n\n5 Difference Maximization Q-learning\n\nNow we describe our algorithm. We maintain three sets of global variables.\n\n5\n\n\fAlgorithm 1 Difference Maximization Q-learning (DMQ)\n\nOutput: A near-optimal policy \u21e1.\n\n1: for h = H, H  1, . . . , 1 do\n2:\n3: Return \u02c6\u21e1, the greedy policy with respect to {f a\n\nRun Algorithm 2 on input h.\n\nh}a2A,h2[H].\n\ni=1 is a set of states in Sh.\n\n1. {f a\nh}a2A,h2[H]. These are our estimated Q-functions for all actions a 2A and all levels h 2 [H].\n2. {\u21e7h}h2[H]. For each level h 2 [H], \u21e7h is a set of exploration policies for level h, which we use\nto collect data.\n3. {Dh}h2[H]. For each h 2 [H], Dh = {sh,i}N\nWe initialize these global variables in the following manner. For {f a\nh}a2A,h2[H], we initialize them\narbitrarily. For each h 2 [H], we initialize \u21e7h to be a single purely random exploration policy, i.e.,\n\u21e7h = {\u21e1}, where \u21e1(s) = unif (A) for all s 2S . We initialize {Dh}h2[H] to be empty sets.\nAlgorithm 1 uses Algorithm 2 to learn predictors for each level h 2 [H]. Algorithm 2 takes h 2 [H]\nas input, tries to learn predictors {f a\nat level h. Algorithm 3 takes h 2 [H] and a 2A as\ninputs, and checks whether the predictors learned for later levels h0 > h are accurate enough under\nthe current policy.\nNow we explain Algorithm 2 and Algorithm 3 in more detail. Algorithm 2 iterates all actions, and for\neach action a 2A , it uses Algorithm 3 to check whether we can learn the Q-function that corresponds\nto a well. After executing Algorithm 3, we are certain that we can learn f a\nh well (we will explain this\nin the next paragraph), and thus construct a set of new policies \u21e7a\nh}\u21e1h2\u21e7h, in the following\nh = {\u21e1a\nway. For each policy \u21e1h 2 \u21e7h, we de\ufb01ne \u21e1a\n\u21e1h(sh0)\na\nargmaxa02Af a0\n\nif h0 < h\nif h0 = h\nif h0 > h\n\nh}a2A\n\nh0 (sh0)\n\nh as\n\n(1)\n\n.\n\nThis policy uses \u21e1h as the roll-in policy till level h, chooses action a at level h and uses greedy policy\nwith respect to {f a\n, the current estimates of Q-functions at level h + 1, . . . , H as the\nroll-out policy. In each iteration, we sample one policy \u21e1 uniformly at random from \u21e7a\nh, and use it to\ncollect (s, y), where s 2S h and y 2 R is the on-the-go reward. In total we collect a dataset Da\nh with\nsize N \u00b7 |\u21e7h|, and we use regression to learn a predictor on these data. Formally, we calculate\n\nh0}h0>h,a2A\n\n\u21e1a\n\nh(sh0) =8<:\n\nf a\n\nh = argminf2Q24\n\n1\n\nN \u00b7 |\u21e7h| X(s,y)2Da\n\nh\n\n(f (s)  y)2 + (f )35 .\n\n(2)\n\nHere, (f ) represents a regularization term on f. Finally, we update Dh by using each \u21e1h 2 \u21e7h to\ncollect N states in Sh.\nh de\ufb01ned in (1) to collect N trajectories.\nNow we explain Algorithm 3. For each \u21e1h 2 \u21e7h, we use \u21e1a\nFor each h0 = h + 1, . . . , H, we set eD\u21e1a\ni=1, where sh0,i is the state at level h0 in the\ni-th trajectory. Next, for each h0 = H, . . . , h + 1, we invoke Oracle 4.2 on input Dh0 and eD\u21e1a\nh,h0.\nNote that Dh0 was collected when we execute Algorithm 2 to learn the predictors at level h0 . The\noracle will return whether our current predictors at level h0 can still predict well on the distribution\nthat generates eD\u21e1a\nh to our policy set \u21e7h0, and we execute Algorithm 2 to\nlearn the predictors at level h0 once again. Note it is crucial to iterate h0 from H to h + 1, so that we\nwill always make sure the predictors at later levels are correct.\n\nh,h0. If not, then we add \u21e1a\n\nh,h0 = {sh0,i}N\n\n6 Provably Ef\ufb01cient Q-learning with Linear Function Approximation\n\nNow we instantiate our algorithm to the linear function class. For the regression problem in (2), we\nset (\u2713) = ridge k\u2713k2\n2. The concrete choice of the parameter ridge will be given later. In this case,\n\n6\n\n\fAlgorithm 2\n\nInput: h 2 [H], a target level.\n\n1: for a 2A do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: Set Dh = ;.\n11: for \u21e1h 2 \u21e7h do\n12:\n13:\n\nh = ;.\n\nh according to (1).\n\nExecute Algorithm 3 on input (h, a).\nInitialize Da\nConstruct a policy set \u21e7a\nh| do\nfor i = 1, . . . , N \u00b7 |\u21e7a\nSample \u21e1 \u21e0 unif (\u21e7a\nh).\nUse \u21e1 to collect (si, yi), where si 2S h and yi is the on-the-go reward.\nAdd (si, yi) into Da\nh.\n(f (s)  y)2 + (f )i.\n\n1\n\nLearn a predictor f a\n\nh = argminf2Qh\nUse \u21e1h to collect a set of states {s\u21e1h,i}N\nAdd all states {s\u21e1h,i}N\n\ni=1 into Dh.\n\nh\n\nN\u00b7|\u21e7h|P(s,y)2Da\ni=1, where s\u21e1h,i 2S h.\n\nAlgorithm 3\n\nh de\ufb01ned in (1).\n\nInput: target level h 2 [H] and an action a 2A .\n\n1: for \u21e1h 2 \u21e7h do\nCollect N trajectories using policy \u21e1a\n2:\nfor h0 = H, H  1, . . . , h + 1 do\n3:\nLet eD\u21e1a\n4:\nInvoke Oracle 4.2 on input\u21e3Dh0,eD\u21e1a\n\nif Oracle 4.2 returns True then\n\nh,h0 = {sh0,i}N\n\nh}.\n\u21e7h0 =\u21e7 h0 [{ \u21e1a\nExecute Algorithm 2 on input h0.\n\n5:\n6:\n7:\n8:\n\nh,h0,\n\n\u270fs\n|\u21e7h0|\n\n,\u270f t, \u21e4\u21e7h0\u2318.\n\ni=1 be the states at level h0 on the N trajectories collected using \u21e1a\nh.\n\nthe regression program represents the ridge regression estimator\n\n\u02c6\u2713a\n\nh =0@ 1\n\n|Da\nh (sh) = (sh)> \u02c6\u2713a\n\nh| X(s,y)2Da\nh for sh 2S h.\n\nh\n\n(s)(s)> + ridge \u00b7 I1A\n\n10@ 1\n\n|Da\n\nh| X(s,y)2Da\n\nh\n\ny \u00b7 (s)1A ,\n\nand f a\nFor Oracle 4.2, we choose \u21e4\u21e7h0 (\u27131,\u2713 2) = r/|\u21e7h0| \u00b7k \u27131  \u27132k2\n2. The concrete choice of the\nparameter r will be given later. Since Q is the linear function class, the optimization problem is\nequivalent to the following program\n\nmax\n\u27131,\u27132\n\n1\n\n|D2| Xti2D2(\u27131  \u27132)>(ti)2\n|D1| Xsi2D1(\u27131  \u27132)>(si)2\n\n1\n\n2 \uf8ff \u270f1.\n\ns.t.\n\n+ r/|\u21e7h0| \u00b7k \u27131  \u27132k2\n|D2|Pti2D2\n|D1|Psi2D1\n(\u27131  \u27132), then the optimization problem can be further reduced to\ns.t. \u02dc\u27132 \uf8ff 1,\n\n(si)(si)> + r/|\u21e7h0| \u00b7 I, M2 = 1\n\u02dc\u2713>\u21e3\u270f1M 1\n\n1 \u2318 \u02dc\u2713\n\n1 M2M 1\n\nmax\n\n\u02dc\u2713\n\n2\n\n2\n\nWe let M1 = 1\n\u02dc\u2713 , 1p\u270f1\n\nM 1/2\n\n1\n\nwhich is equivalent to compute the top eigenvalue of \u270f1M 1\n. Therefore, the regression\nproblem in (2) and Oracle 4.2 can be ef\ufb01ciently implemented. Our main result is the following\ntheorem.\n\n1 M2M 1\n\n1\n\n2\n\n2\n\n(ti)(ti)>, and let\n\n7\n\n\fTheorem 6.1 (Provably Ef\ufb01cient Q-Learning with Linear Function Approximation). Let \u270f \uf8ff\npoly(, 1/C, 1/d, 1/H, 1/K) be the target accuracy parameter. Under Assumption 3.1, 3.2 and 3.3,\nthen using at most poly(1/\u270f) trajectories, with high probability, Algorithm 1 returns a policy \u02c6\u21e1 that\nsatis\ufb01es V \u02c6\u21e1(s1)  V \u21e4(s1)  \u270f.\nThis theorem demonstrates that if the true Q-function is linear, then it is actually possible to learn\na near-optimal policy with polynomial number of samples. We refer readers to the Proof of Theo-\nrem 6.1 for the speci\ufb01c values of \u270ft,\u270f s,\u270f N , ridge, r, N. Furthermore, our algorithm also runs in\npolynomial time. Therefore, this is the \ufb01rst provably ef\ufb01cient algorithm for Q-learning with function\napproximation in the stochastic setting.\nNow we brie\ufb02y sketch the proof of Theorem 6.1. The full proof is deferred to Section A. Our\nproof follows directly from the design of our algorithm. First, through classical analysis of linear\nregression, we know the learned predictor \u02c6\u2713a\nh can predict well on the distribution induced by \u21e1a\nh.\nSecond, Oracle 4.2 guarantees that if it returns False, then the learned predictors at level h0 can\nh. Therefore, the labels we used to\npredict well on the distribution over Sh induced by the policy \u21e1a\nlearn \u2713h\na well. Now the trickiest part of the proof\nis to show Oracle 4.2 returns True at most polynomial number of times. To establish this, for each\nh 2 [H], we construct a potential function in terms of covariance matrices induced by the policies\nin \u21e7h. We show whenever a new policy is added to \u21e7h, this potential function must be increased\nby a multiplicative factor. We further show this potential function is at always polynomially upper\nbounded by the size of the policy set. Therefore, we can conclude the size of \u21e7h is polynomially\nupper bounded. See Lemma A.5 for details.\n\na have only small bias, and thus, we can learn \u2713h\n\n7 Discussion\n\nBy giving a provably ef\ufb01cient algorithm for Q-learning with function approximation, this paper paves\nthe way for rigorous studies of modern model-free reinforcement learning methods with function\napproximation. Now we list some future directions.\n\nRegret Bound This paper presents a PAC bound but no regret bound. Note that we assume the\ngap between the on-the-go reward of the best action and the rest is strictly positive. In the tabular\nsetting, previous work showed that under this assumption, one can obtain log T regret bound [28, 38].\nWe believe it would be a very strong result to prove (or disprove) log T regret bound in the setting\nconsidered in this paper.\n\nQ-learning with General Function Class While the main theorem in this paper is about the linear\nfunction class, the DSEC oracle and the general algorithmic framework applies to any function\nclasses. From an information-theoretic point of view, given Oracle 4.2, can we use it to design\nalgorithms for general function class with polynomial sample complexity guarantees? For example,\nif the Q-function class has a bounded VC-dimension, can Algorithm 1 give a polynomial sample\ncomplexity guarantee? We believe a generalization of Lemma A.5 is required to resolve this question.\nAnother interesting problem is to generalize our algorithm to the case that the Q-function is not\nexactly linear but can only be approximated by a linear function.\nFrom the computational point of view, can we characterize the function classes for which we have an\nef\ufb01cient solver for Oracle 4.2? For those we do not have such exact solvers, can we develop a relaxed\nversion of Oracle 4.2 which, possibly sacri\ufb01cing the sample ef\ufb01ciency, makes the optimization\nproblem tractable. This idea was used in the sparse learning literature [34]. Another interesting\nproblem is to improve the computational ef\ufb01ciency of our algorithm to make it fast enough to be used\nin practice.\n\nToward a Rigorous Theory for DQN Deep Q-learning (DQN) is one of the most popular model-\nfree methods in modern reinforcement learning. Recent studies established that over-parameterized\nneural networks are equivalent to kernel predictors [15, 3] with multi-layer kernel functions. Since\nkernel predictors can be viewed as linear predictors in in\ufb01nite dimensional feature spaces, can\nwe adapt our algorithm to over-parameterized neural networks and multi-layer kernels, and prove\npolynomial sample complexity guarantees when, e.g., the true Q-function has a small reproducing\nHilbert space norm?\n\n8\n\n\fAcknowledgements\n\nThe authors would like to thank Nan Jiang, Akshay Krishnamurthy, Wen Sun, Yining Wang and Lin\nF. Yang for useful discussions. The paper was initiated while S. S. Du was an intern at MSR NYC\nand a Ph.D. student at Carnegie Mellon University. Part of this work was done while S. S. Du and R.\nWang were visiting Simons Institute.\n\nReferences\n[1] Yasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri.\n\nImproved algorithms for linear\nstochastic bandits. In Advances in Neural Information Processing Systems, pages 2312\u20132320,\n2011.\n\n[2] Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret\n\nbounds. In NIPS, 2017.\n\n[3] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang.\nOn exact computation with an in\ufb01nitely wide neural net. arXiv preprint arXiv:1904.11955,\n2019.\n\n[4] Jean-Yves Audibert and S\u00e9bastien Bubeck. Best arm identi\ufb01cation in multi-armed bandits. In\n\nCOLT-23th Conference on learning theory-2010, pages 13\u2013p, 2010.\n\n[5] Mohammad Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. Minimax regret bounds for\n\nreinforcement learning. arXiv preprint arXiv:1703.05449, 2017.\n\n[6] K. Azizzadenesheli, E. Brunskill, and A. Anandkumar. Ef\ufb01cient exploration through bayesian\ndeep Q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1\u20139,\nFeb 2018.\n\n[7] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, volume 5. Athena\n\nScienti\ufb01c Belmont, MA, 1996.\n\n[8] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under\n\nbandit feedback. In Conference on Learning Theory, 2008.\n\n[9] Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and\nRobert E Schapire. On polynomial time PAC reinforcement learning with rich observations.\narXiv preprint arXiv:1803.00606, 2018.\n\n[10] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC\nbounds for episodic reinforcement learning. In Proceedings of the 31st International Conference\non Neural Information Processing Systems, NIPS\u201917, pages 5717\u20135727, USA, 2017. Curran\nAssociates Inc.\n\n[11] Simon S Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dud\u00edk, and John\nLangford. Provably ef\ufb01cient RL with rich observations via latent state decoding. arXiv preprint\narXiv:1901.09018, 2019.\n\n[12] Brian Everitt. The Cambridge dictionary of statistics. Cambridge University Press, Cambridge,\n\nUK; New York, 2002.\n\n[13] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian\nOsband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin,\nCharles Blundell, and Shane Legg. Noisy networks for exploration. In International Conference\non Learning Representations, 2018.\n\n[14] Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In\n\nConference on learning theory, pages 9\u20131, 2012.\n\n[15] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\ngeneralization in neural networks. In Advances in neural information processing systems, pages\n8571\u20138580, 2018.\n\n9\n\n\f[16] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11(Apr):1563\u20131600, 2010.\n\n[17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire.\nContextual decision processes with low bellman rank are PAC-learnable. In Proceedings of the\n34th International Conference on Machine Learning-Volume 70, pages 1704\u20131713. JMLR. org,\n2017.\n\n[18] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably\n\nef\ufb01cient? In Advances in Neural Information Processing Systems, pages 4863\u20134873, 2018.\n\n[19] Sham Kakade, Mengdi Wang, and Lin Yang. Variance reduction methods for sublinear rein-\n\nforcement learning. 02 2018.\n\n[20] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD\n\nthesis, University of London London, England, 2003.\n\n[21] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.\n\nMach. Learn., 49(2-3):209\u2013232, November 2002.\n\n[22] Akshay Krishnamurthy, Alekh Agarwal, and John Langford. PAC reinforcement learning with\nrich observations. In Advances in Neural Information Processing Systems, pages 1840\u20131848,\n2016.\n\n[23] Zachary Chase Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and Li Deng.\nBBQ-networks: Ef\ufb01cient exploration in deep reinforcement learning for task-oriented dialogue\nsystems. In AAAI, 2018.\n\n[24] Francisco S Melo and M Isabel Ribeiro. Q-learning with linear function approximation. In\nInternational Conference on Computational Learning Theory, pages 308\u2013322. Springer, 2007.\n\n[25] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via random-\nized value functions. In Proceedings of the 33rd International Conference on International\nConference on Machine Learning - Volume 48, ICML\u201916, pages 2377\u20132386. JMLR.org, 2016.\n\n[26] Jason Pazis and Ronald Parr. PAC optimal exploration in continuous space markov decision\nprocesses. In Proceedings of the Twenty-Seventh AAAI Conference on Arti\ufb01cial Intelligence,\nAAAI\u201913, pages 774\u2013781. AAAI Press, 2013.\n\n[27] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of\n\nResearch and Development, 3(3):210\u2013229, July 1959.\n\n[28] Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular\n\nMDPs. 05 2019.\n\n[29] Zhao Song and Wen Sun. Ef\ufb01cient model-free reinforcement learning in metric spaces. arXiv\n\npreprint arXiv:1905.00475, 2019.\n\n[30] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC\nmodel-free reinforcement learning. In Proceedings of the 23rd international conference on\nMachine learning, pages 881\u2013888. ACM, 2006.\n\n[31] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based\nreinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540,\n2018.\n\n[32] Richard S. Sutton. Open theoretical questions in reinforcement learning. In EuroCOLT, 1999.\n\n[33] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and\n\nTrends R in Machine Learning, 8(1-2):1\u2013230, 2015.\n\n[34] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A\nnear-optimal convex relaxation of sparse PCA. In Advances in neural information processing\nsystems, pages 2670\u20132678, 2013.\n\n10\n\n\f[35] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292,\n\n1992.\n\n[36] Zheng Wen and Benjamin Van Roy. Ef\ufb01cient exploration and value function generalization in\ndeterministic systems. In Advances in Neural Information Processing Systems, pages 3021\u2013\n3029, 2013.\n\n[37] Lin F Yang, Chengzhuo Ni, and Mengdi Wang. Learning to control in metric space with optimal\n\nregret. arXiv preprint arXiv:1905.01576, 2019.\n\n[38] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in rein-\nforcement learning without domain knowledge using value function bounds. arXiv preprint\narXiv:1901.00210, 2019.\n\n[39] Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for SARSA and Q-\n\nlearning with linear function approximation. arXiv preprint arXiv:1902.02234, 2019.\n\n11\n\n\f", "award": [], "sourceid": 4404, "authors": [{"given_name": "Simon", "family_name": "Du", "institution": "Institute for Advanced Study"}, {"given_name": "Yuping", "family_name": "Luo", "institution": "Princeton University"}, {"given_name": "Ruosong", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Hanrui", "family_name": "Zhang", "institution": "Duke University"}]}