{"title": "On Oracle-Efficient PAC RL with Rich Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 1422, "page_last": 1432, "abstract": "We study the computational tractability of PAC reinforcement learning with rich observations. We present new provably sample-efficient algorithms for environments with deterministic hidden state dynamics and stochastic rich observations. These methods operate in an oracle model of computation -- accessing policy and value function classes exclusively through standard optimization primitives -- and therefore represent computationally efficient alternatives to prior algorithms that require enumeration. With stochastic hidden state dynamics, we prove that the only known sample-efficient algorithm, OLIVE, cannot be implemented in the oracle model. We also present several examples that illustrate fundamental challenges of tractable PAC reinforcement learning in such general settings.", "full_text": "On Oracle-Ef\ufb01cient PAC RL with Rich Observations\n\nChristoph Dann\n\nCarnegie Mellon University\n\nPittsburgh, Pennsylvania\n\nNan Jiang\u2217\n\nUIUC\n\nUrbana, Illinois\n\nAkshay Krishnamurthy\n\nMicrosoft Research\nNew York, New York\n\ncdann@cdann.net\n\nnanjiang@illinois.edu\n\nakshay@cs.umass.edu\n\nAlekh Agarwal\n\nMicrosoft Research\n\nRedmond, Washington\n\nalekha@microsoft.com\n\nJohn Langford\n\nMicrosoft Research\nNew York, New York\njcl@microsoft.com\n\nRobert E. Schapire\nMicrosoft Research\nNew York, New York\n\nschapire@microsoft.com\n\nAbstract\n\nWe study the computational tractability of PAC reinforcement learning with rich\nobservations. We present new provably sample-ef\ufb01cient algorithms for environ-\nments with deterministic hidden state dynamics and stochastic rich observations.\nThese methods operate in an oracle model of computation\u2014accessing policy and\nvalue function classes exclusively through standard optimization primitives\u2014and\ntherefore represent computationally ef\ufb01cient alternatives to prior algorithms that\nrequire enumeration. With stochastic hidden state dynamics, we prove that the only\nknown sample-ef\ufb01cient algorithm, OLIVE [1], cannot be implemented in the oracle\nmodel. We also present several examples that illustrate fundamental challenges of\ntractable PAC reinforcement learning in such general settings.\n\n1\n\nIntroduction\n\nWe study episodic reinforcement learning (RL) in environments with realistically rich observations\nsuch as images or text, which we refer to broadly as contextual decision processes. We aim for\nmethods that use function approximation in a provably effective manner to \ufb01nd the best possible\npolicy through strategic exploration.\nWhile such problems are central to empirical RL research [2], most theoretical results on strategic\nexploration focus on tabular MDPs with small state spaces [3\u201310]. Comparatively little work exists\non provably effective exploration with large observation spaces that require generalization through\nfunction approximation. The few algorithms that do exist either have poor sample complexity\nguarantees [e.g., 11\u201314] or require fully deterministic environments [15, 16] and are therefore\ninapplicable to most real-world applications and modern empirical RL benchmarks. This scarcity of\npositive results on ef\ufb01cient exploration with function approximation can likely be attributed to the\nchallenging nature of this problem rather than a lack of interest by the research community.\nOn the statistical side, recent important progress was made by showing that contextual decision\nprocesses (CDPs) with rich stochastic observations and deterministic dynamics over M hidden\nstates can be learned with a sample complexity polynomial in M [17]. This was followed by an\nalgorithm called OLIVE [1] that enjoys a polynomial sample complexity guarantee for a broader\nrange of CDPs, including ones with stochastic hidden state transitions. While encouraging, these\nefforts focused exclusively on statistical issues, ignoring computation altogether. Speci\ufb01cally, the\nproposed algorithms exhaustively enumerate candidate value functions to eliminate the ones that\nviolate Bellman equations, an approach that is computationally intractable for any function class of\n\n\u2217The work was done while NJ was a postdoc researcher at MSR NYC.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fpractical interest. Thus, while showing that RL with rich observations can be statistically tractable,\nthese results leave open the question of computational feasibility.\nIn this paper, we focus on this dif\ufb01cult computational challenge. We work in an oracle model of\ncomputation, meaning that we aim to design sample-ef\ufb01cient algorithms whose computation can be\nreduced to common optimization primitives over function spaces, such as linear programming and\ncost-sensitive classi\ufb01cation. The oracle-based approach has produced practically effective algorithms\nfor active learning [18], contextual bandits [19], structured prediction [20, 21], and multi-class\nclassi\ufb01cation [22], and here, we consider oracle-based algorithms for challenging RL settings.\nWe begin by studying the setting of Krishnamurthy et al. [17] with deterministic dynamics over\nM hidden states and stochastic rich observations. In Section 4, we use cost-sensitive classi\ufb01cation\nand linear programming oracles to develop VALOR, the \ufb01rst algorithm that is both computationally\nand statistically ef\ufb01cient for this setting. While deterministic hidden-state dynamics are somewhat\nrestrictive, the model is considerably more general than fully deterministic MDPs assumed by prior\nwork [15, 16], and it accurately captures modern empirical benchmarks such as visual grid-worlds in\nMinecraft [23]. As such, this method represents a considerable advance toward provably ef\ufb01cient RL\nin practically relevant scenarios.\nNevertheless, we ultimately seek ef\ufb01cient algorithms for more general settings, such as those with\nstochastic hidden-state transitions. Working toward this goal, we study the computational aspects of\nthe OLIVE algorithm [1], which applies to a wide range of environments. Unfortunately, in Section 5.1,\nwe show that OLIVE cannot be implemented ef\ufb01ciently in the oracle model of computation. As\nOLIVE is the only known statistically ef\ufb01cient approach for this general setting, our result establishes a\nsigni\ufb01cant barrier to computational ef\ufb01ciency. In the appendix, we also describe several other barriers,\nand two other oracle-based algorithms for the deterministic-dynamics setting that are considerably\ndifferent from VALOR. The negative results identify where the hardness lies while the positive results\nprovide a suite of new algorithmic tools. Together, these results advance our understanding of ef\ufb01cient\nreinforcement learning with rich observations.\n\n2 Related Work\n\nThere is abundant work on strategic exploration in the tabular setting [3\u201310]. The computation\nin these algorithms often involves planning in optimistic models and can be solved ef\ufb01ciently via\ndynamic programming. To extend the theory to the more practical settings of large state spaces,\ntypical approaches include (1) distance-based state identity test under smoothness assumptions [e.g.,\n11\u201314], or (2) working with factored MDPs [e.g., 24]. The former approach is similar to the use of\nstate abstractions [25], and typically incurs exponential sample complexity in state dimension. The\nlatter approach does have sample-ef\ufb01cient results, but the factored representation assumes relatively\ndisentangled state variables which cannot model rich sensory inputs (such as images).\nAzizzadenesheli et al. [26] have studied regret minimization in rich observation MDPs, a special case\nof contextual decision processes with a small number of hidden states and reactive policies. They do\nnot utilize function approximation, and hence incur polynomial dependence on the number of unique\nobservations in both sample and computational complexity. Therefore, this approach, along with\nrelated works [27, 28], does not scale to the rich observation settings that we focus on here.\nWen and Van Roy [15, 16] have studied exploration with function approximation in fully deterministic\nMDPs, which is considerably more restrictive than our setting of deterministic hidden state dynamics\nwith stochastic observations and rewards. Moreover, their analysis measures representation com-\nplexity using eluder dimension [29, 30], which is only known to be small for some simple function\nclasses. In comparison, our bounds scale with more standard complexity measures and can easily\nextend to VC-type quantities, which allows our theory to apply to practical and popular function\napproximators including neural networks [31].\n\n3 Setting and Background\n\nWe consider reinforcement learning (RL) in a common special case of contextual decision pro-\ncesses [17, 1], sometimes referred to as rich observation MDPs [26]. We assume an H-step process\nwhere in each episode, a random trajectory s1, x1, a1, r1, s2, x2, . . . , sH , xH , aH , rH is generated.\n\n2\n\n\fFigure 1: Graphical representation of the problem class considered by our algorithm, VALOR: The\nmain assumptions that enable sample-ef\ufb01cient learning are (1) that the small hidden state sh is\nidenti\ufb01able from the rich observation xh and (2) that the next state is a deterministic function of\nthe previous state and action. State and observation examples are from https://github.com/\nMicrosoft/malmo-challenge.\n\nFor each time step (or level) h \u2208 [H], sh \u2208 S where S is a \ufb01nite hidden state space, xh \u2208 X where X\nis the rich observation (context) space, ah \u2208 A where A is a \ufb01nite action space of size K, and rh \u2208 R.\nEach hidden state s \u2208 S is associated with an emission process Os \u2208 \u2206(X ), and we use x \u223c s as a\nshorthand for x \u223c Os. We assume that each rich observation contains enough information so that s\ncan in principle be identi\ufb01ed just from x \u223c Os\u2014hence x is a Markov state and the process is in fact\nan MDP over X \u2014but the mapping x (cid:55)\u2192 s is unavailable to the agent and s is never observed. The\nhidden states S introduce structure into the problem, which is essential since we allow the observation\nspace X to be in\ufb01nitely large.2 The issue of partial observability is not the focus of the paper.\nLet \u0393 : S \u00d7 A \u2192 \u2206(S) de\ufb01ne transition dynamics over the hidden states, and let \u03931 \u2208 \u2206(S)\ndenote an initial distribution over hidden states. R : X \u00d7 A \u2192 \u2206(R) is the reward function;\nthis differs from partially observable MDPs where reward depends only on s, making the problem\nmore challenging. With this notation, a trajectory is generated as follows: s1 \u223c \u03931, x1 \u223c s1,\nr1 \u223c R(x1, a1), s2 \u223c \u0393(s1, a1), x2 \u223c s2, . . . , sH \u223c \u0393(sH\u22121, aH\u22121), xH \u223c sH, rH \u223c R(xH , aH ),\nwith actions a1:H chosen by the agent. We emphasize that s1:H are unobservable to the agent.\nTo simplify notation, we assume that each observation and hidden state can only appear at a particular\nlevel. This implies that S is partitioned into {Sh}H\nh=1 with size M := maxh\u2208[H] |Sh|. For regularity,\nh=1 rh \u2264 1 almost surely.\n\nassume rh \u2265 0 and(cid:80)H\nV \u03c0 := E[(cid:80)H\noptimal value function g(cid:63) de\ufb01ned as g(cid:63)(x) := E[(cid:80)H\n\nIn this setting, the learning goal is to \ufb01nd a policy \u03c0 : X \u2192 A that maximizes the expected return\nh=1 rh | a1:H \u223c \u03c0]. Let \u03c0(cid:63) denote the optimal policy, which maximizes V \u03c0, with\nh(cid:48)=h rh(cid:48)|xh = x, ah:H \u223c \u03c0(cid:63)]. As is standard, g(cid:63)\n\nsatis\ufb01es the Bellman equation: \u2200x at level h,\n\ng(cid:63)(x) = max\n\na\u2208A E[rh + g(cid:63)(xh+1)|xh = x, ah = a],\n\nQ(cid:63)(x, a) := E[(cid:80)H\n\nwith the understanding that g(cid:63)(xH+1) \u2261 0. A similar equation holds for the optimal Q-value function\n\nh(cid:48)=h rh(cid:48)|xh = x, ah = a, ah+1:H \u223c \u03c0(cid:63)], and \u03c0(cid:63) = argmaxa\u2208A Q(cid:63)(x, a).3\n\nBelow are two special cases of the setting described above that will be important for later discussions.\nTabular MDPs: An MDP with a \ufb01nite and small state space is a special case of this model, where\nX = S and Os is the identity map for each s. This setting is relevant in our discussion of oracle-\nef\ufb01ciency of the existing OLIVE algorithm in Section 5.1.\nDeterministic dynamics over hidden states: Our algorithm, VALOR, works in this special case,\nwhich requires \u03931 and \u0393(s, a) to be point masses. Originally proposed by Krishnamurthy et al. [17],\n\n2Indeed, the lower bound in Proposition 6 in Jiang et al. [1] show that ignoring underlying structure precludes\n\nprovably-ef\ufb01cient RL, even with function approximation.\n\n3Note that the optimal policy and value functions depend on x and not just s even if s was known, since\n\nreward is a function of x.\n\n3\n\n...small hidden staterich observationstate indenti(cid:31)able from observation but mapping unknownAssumption:time\fthis setting can model some challenging benchmark environments in modern reinforcement learning,\nincluding visual grid-worlds common to the deep RL literature [e.g., 23]. In such tasks, the state\nrecords the position of each game element in a grid but the agent observes a rendered 3D view.\nFigure 1 shows a visual summary of this setting. We describe VALOR in detail in Section 4.\nThroughout the paper, we use \u02c6ED[\u00b7] to denote empirical expectation over samples from a data set D.\n\n3.1 Function Classes and Optimization Oracles\nAs X can be rich, the agent must use function approximation to generalize across observations. To\nthat end, we assume a given value function class G \u2282 (X \u2192 [0, 1]) and policy class \u03a0 \u2282 (X \u2192 A).\nOur algorithm is agnostic to the speci\ufb01c function classes used, but for the guarantees to hold, they\nmust be expressive enough to represent the optimal value function and policy, that is, \u03c0(cid:63) \u2208 \u03a0 and\ng(cid:63) \u2208 G. Prior works often use F \u2282 (X \u00d7 A \u2192 [0, 1]) to approximate Q(cid:63) instead, but for example\nJiang et al. [1] point out that their OLIVE algorithm can equivalently work with G and \u03a0. This (G, \u03a0)\nrepresentation is useful in resolving the computational dif\ufb01culty in the deterministic setting, and has\nalso been used in practice [32].\nWhen working with large and abstract function classes as we do here, it is natural to consider an\noracle model of computation and assume that these classes support various optimization primitives.\nWe adopt this oracle-based approach here, and speci\ufb01cally use the following oracles:\nCost-Sensitive Classi\ufb01cation (CSC) on Policies. A cost-sensitive classi\ufb01cation (CSC) oracle\nreceives as inputs a parameter \u0001sub and a sequence {(x(i), c(i))}i\u2208[n] of observations x(i) \u2208 X and cost\nvectors c(i) \u2208 RK, where c(i)(a) is the cost of predicting action a \u2208 A for x(i). The oracle returns a\npolicy whose average cost is within \u0001sub of the minimum average cost, min\u03c0\u2208\u03a0\ni=1 c(i)(\u03c0(x(i))).\nWhile CSC is NP-hard in the worst case, CSC can be further reduced to binary classi\ufb01cation [33, 34]\nfor which many practical algorithms exist and actually form the core of empirical machine learning.\nAs further motivation, the CSC oracle has been used in practically effective algorithms for contextual\nbandits [35, 19], imitation learning [20], and structured prediction [21].\nLinear Programs (LP) on Value Functions. A linear program (LP) oracle considers an optimiza-\ntion problem where the objective o : G \u2192 R and the constraints h1, . . . hm are linear functionals of G\ni=1 \u03b1ig(xi)\nwith coef\ufb01cients {\u03b1i}i\u2208[n] and contexts {xi}i\u2208[n]. Formally, for a program of the form\n\ngenerated by \ufb01nitely many function evaluations. That is, o and each hj have the form(cid:80)n\n\n(cid:80)n\n\n1\nn\n\nmaxg\u2208G o(g), subject to hj(g) \u2264 cj, \u2200j \u2208 [m],\n\nwith constants {cj}j\u2208[m], an LP oracle with approximation parameters \u0001sub, \u0001feas returns a function \u02c6g\nthat is at most \u0001sub-suboptimal and that violates each constraint by at most \u0001feas. For intuition, if the\nvalue functions G are linear with parameter vector \u03b8 \u2208 Rd, i.e., g(x) = (cid:104)\u03b8, x(cid:105), then this reduces to a\nlinear program in Rd for which a plethora of provably ef\ufb01cient solvers exist. Beyond the linear case,\nsuch problems can be practically solved using standard continuous optimization methods. LP oracles\nare also employed in prior work focusing on deterministic MDPs [15, 16].\nLeast-Squares (LS) Regression on Value Functions. We also consider a least-squares regression\n(LS) oracle that returns the value function which minimizes a square-loss objective. Since VALOR\ndoes not use this oracle, we defer details to the appendix.\nWe de\ufb01ne the following notion of oracle-ef\ufb01ciency based on the optimization primitives above.\nDe\ufb01nition 1 (Oracle-Ef\ufb01cient). An algorithm is oracle-ef\ufb01cient if it can be implemented with polyno-\nmially many basic operations and calls to CSC, LP, and LS oracles.\n\nNote that our algorithmic results continue to hold if we include additional oracles in the de\ufb01nition,\nwhile our hardness results easily extend, provided that the new oracles can be ef\ufb01ciently implemented\nin the tabular setting (i.e., they satisfy Proposition 6; see Section 5).\n\n4 VALOR: An Oracle-Ef\ufb01cient Algorithm\n\nIn this section we propose and analyze a new algorithm, VALOR (Values stored Locally for RL)\nshown in Algorithm 1 (with 2 & 3 as subroutines). As we will show, this algorithm is oracle-ef\ufb01cient\n\n4\n\n\fand enjoys a polynomial sample-complexity guarantee in the deterministic hidden-state dynamics\nsetting described earlier, which was originally introduced by Krishnamurthy et al. [17].\n\nAlgorithm 2: Subroutine: Policy optimization\nwith local values\n1 Function polvalfun()\n2\n3\n\n\u02c6V (cid:63) \u2190 V of the only dataset in D1;\nfor h = 1 : H do\n// CSC-oracle\n\u02c6\u03c0h \u2190 argmax\n\u03c0\u2208\u03a0h\n\n(D,V,{Va})\u2208Dh\n\n(cid:80)\n\nVD(\u03c0;{Va});\n\n4\n\n5\n\nreturn \u02c6\u03c01:H , \u02c6V (cid:63);\n\n// Alg.3\n\nNotation:\nVD(\u03c0;{Va}) := \u02c6ED[K1{\u03c0(x) = a}(r + Va)]\n\n// see exact values in Table 1 in the appendix\n// accuracy of learned values at level h\n\nAlgorithm 1: Main Algorithm VALOR\n1 Global: D1, . . .DH initialized as \u2205;\n2 Function MetaAlg\ndfslearn (\u2205) ;\n3\nfor k = 1, . . . , M H do\n4\n5\n6\n\n\u02c6\u03c0(k), \u02c6V (k) \u2190 polvalfun() ; // Alg.2\nT \u2190 sample neval trajectories with \u02c6\u03c0(k);\n\u02c6V \u02c6\u03c0(k) \u2190 average return of T ;\nif \u02c6V (k) \u2264 \u02c6V \u02c6\u03c0(k)\nfor h = 1 . . . H \u2212 1 do\n\n2 then return \u02c6\u03c0(k) ;\n\nfor all a1:h of nexpl traj. \u2208 T do\n\n+ \u0001\n\n// Alg.3\n\ndfslearn (a1:h) ;\n\nreturn failure;\n\nAlgorithm 3: Subroutine: DFS Learning of local values\n1 \u0001feas = \u0001sub = \u0001stat = \u02dcO(\u00012/M H 3) ;\n2 \u03c6h = (H + 1 \u2212 h)(6\u0001stat + 2\u0001sub + \u0001feas) ;\n3 Function dfslearn(path p with length h \u2212 1)\n4\n5\n\nfor a \u2208 A do\n\n\u02c6ED(cid:48)[g(xh+1)]\n\nD(cid:48) \u2190 Sample ntest trajectories with actions p \u25e6 a ;\nVopt \u2190 maxg\u2208Gh+1\nif |Vopt \u2212 Vpes| \u2264 2\u03c6h+1 + 4\u0001stat + 2\u0001feas then\nelse\n\ns.t. \u2200(D, V, ) \u2208 Dh+1 :\n\nVa \u2190 (Vopt + Vpes)/2 ;\nVa \u2190 dfslearn(p \u25e6 a) ;\n\n\u02dcD \u2190 Sample ntrain traj. with p and ah \u223c Unif(K);\n\u02dcV \u2190 max\u03c0\u2208\u03a0h V \u02dcD(\u03c0;{Va});\nAdd ( \u02dcD, \u02dcV ,{Va}a\u2208A) to Dh;\nreturn \u02dcV ;\n\n7\n\n8\n9\n10\n11\n\n12\n\n6\n\n7\n8\n9\n10\n\n11\n12\n13\n14\n\n// compute optimistic / pessimistic values using LP-oracle\n\n(and Vpes \u2190 ming\u2208Gh+1\n|V \u2212 \u02c6ED[g(xh+1)]| \u2264 \u03c6h+1 ;\n\n\u02c6ED(cid:48)[g(xh+1)])\n\n// consensus among remaining functions\n\n// no consensus, descend\n\n// CSC-oracle\n\nSince hidden states can be deterministically reached by sequences of actions (or paths), from an\nalgorithmic perspective, the process can be thought of as an exponentially large tree where each\nnode is associated with a hidden state (such association is unknown to the agent). Similar to LSVEE\n[17], VALOR \ufb01rst explores this tree (Line 3) with a form of depth \ufb01rst search (Algorithm 3). To\navoid visiting all of the exponentially many paths, VALOR performs a state identity test (Algorithm 3,\nLines 5\u20138): the data collected so far is used to (virtually) eliminate functions in G (Algorithm 3,\nLine 6), and we do not descend to a child if the remaining functions agree on the value of the child\nnode (Algorithm 3, Line 7).\nThe state identity test prevents exploring the same hidden state twice but might also incorrectly\nprune unvisited states if all functions happen to agree on the value. Unfortunately, with no data\nfrom such pruned states, we are unable to learn the optimal policy on them. To address this issue,\nafter dfslearn returns, we \ufb01rst use the stored data and values (Line 5) to compute a policy (see\nAlgorithm 2) that is near optimal on all explored states. Then, VALOR deploys the computed policy\n(Line 6) and only terminates if the estimated optimal value is achieved (Line 8). If not, the policy\nhas good probability of visiting those accidentally pruned states (see Appendix B.5), so we invoke\ndfslearn on the generated paths to complement the data sets (Line 11).\n\n5\n\n\fIn the rest of this section we describe VALOR in more detail, and then state its statistical and\ncomputational guarantees. VALOR follows a dynamic programming style and learns in a bottom-up\nfashion. As a result, even given stationary function classes (G, \u03a0) as inputs, the algorithm can return\na non-stationary policy \u02c6\u03c01:H := (\u02c6\u03c01, . . . , \u02c6\u03c0H ) \u2208 \u03a0H that may use different policies at different time\nsteps.4 To avoid ambiguity, we de\ufb01ne \u03a0h := \u03a0 and Gh := G for h \u2208 [H], to emphasize the time\npoint h under consideration. For convenience, we also de\ufb01ne GH+1 to be the singleton {x (cid:55)\u2192 0}.\nThis notation also allows our algorithms to handle more general non-stationary function classes.\nDetails of depth-\ufb01rst search exploration. VALOR maintains many data sets collected at paths\nvisited by dfslearn. Each data set D is collected from some path p, which leads to some hidden\nstate s. (Due to determinism, we will refer to p and s interchangeably throughout this section.) D\nconsists of tuples (x, a, r) where x \u223c p (i.e., x \u223c Os), a \u223c Unif(K), and r is the instantaneous\nreward. Associated with D, we also store a scalar V which approximates V (cid:63)(s), and {Va}a\u2208A\nwhich approximate {V (cid:63)(s \u25e6 a)}a\u2208A, where s \u25e6 a denotes the state reached when taking a in s. The\nestimates {Va}a\u2208A of the future optimal values associated with the current path p \u2208 Ah\u22121 are\neither determined through a recursive call (Line 10), or through a state-identity test (Lines 5\u20138 in\ndfslearn). To check if we already know V (cid:63)(p \u25e6 a), we solve constrained optimization problems to\ncompute optimistic and pessimistic estimates, using a small amount of data from p\u25e6a. The constraints\neliminate all g \u2208 Gh+1 that make incorrect predictions for V (cid:63)(s(cid:48)) for any previously visited s(cid:48) at\nlevel h + 1. As such, if we have learned the value of s \u25e6 a on a different path, the optimistic and\npessimistic values must agree (\u201cconsensus\u201d), so we need not descend. Once we have the future values\nVa, the value estimate \u02dcV (which approximates V (cid:63)(s)) is computed (in Line 12) by maximizing the\nsum of immediate reward and future values, re-weighted using importance sampling to re\ufb02ect the\npolicy under consideration \u03c0:\n\nVD(\u03c0;{Va}) := \u02c6ED[K1{\u03c0(x) = a}(r + Va)].\n\n(1)\n\nDetails of policy optimization and exploration-on-demand. polvalfun performs a sequence of\npolicy optimization steps using all the data sets collected so far to \ufb01nd a non-stationary policy that is\nnear-optimal at all explored states simultaneously. Note that this policy differs from that computed in\n(Alg. 3, Line 12) as it is common for all datasets at a level h. And \ufb01nally using this non-stationary\npolicy, MetaAlg estimates its suboptimality and either terminates successfully, or issues several other\ncalls to dfslearn to gather more data sets. This so-called exploration-on-demand scheme is due\nto Krishnamurthy et al. [17], who describe the subroutine in more detail.\n\n4.1 What is new compared to LSVEE?\n\nThe overall structure of VALOR is similar to LSVEE [17]. The main differences are in the pruning\nmechanism, where we use a novel state-identity test, and the policy optimization step in Algorithm 2.\nLSVEE uses a Q-value function class F \u2282 (X \u00d7 A \u2192 [0, 1]) and a state identity test based on\nBellman errors on data sets D consisting of (x, a, r, x(cid:48)) tuples:\n\n(cid:17)2(cid:21)\n\n.\n\n(cid:20)(cid:16)\n\n\u02c6ED\n\nf (x, a) \u2212 r \u2212 \u02c6Ex(cid:48)\u223ca maxa(cid:48)\u2208A f (x(cid:48), a(cid:48))\n\nThis enables a conceptually simpler statistical analysis, but the coupling between value function and\nthe policy yield challenging optimization problems that do not obviously admit ef\ufb01cient solutions.\nIn contrast, VALOR uses dynamic programming to propagate optimal value estimates from future\nto earlier time points. From an optimization perspective, we \ufb01x the future value and only optimize\nthe current policy, which can be implemented by standard oracles, as we will see. However, from a\nstatistical perspective, the inaccuracy of the future value estimates leads to bias that accumulates over\nlevels. By a careful design of the algorithm and through an intricate and novel analysis, we show\nthat this bias only accumulates linearly (as opposed to exponentially; see e.g., Appendix E.1), which\nleads to a polynomial sample complexity guarantee.\n\n4This is not rare in RL; see e.g., Chapter 3.4 of Ross [36].\n\n6\n\n\f4.2 Computational and Sample Complexity of VALOR\n\nVALOR requires two types of nontrivial computations over the function classes. We show that they\ncan be reduced to CSC on \u03a0 and LP on G (recall Section 3.1), respectively, and hence VALOR is\noracle-ef\ufb01cient.\nFirst, Lines 4 in polvalfun and 12 in dfslearn involve optimizing VD(\u03c0;{Va}) (Eq. (1)) over \u03a0,\nwhich can be reduced to CSC as follows: We \ufb01rst form tuples (x(i), a(i), y(i)) from D and {Va}\non which VD(\u03c0;{Va}) depends, where we bind xh to x(i), ah to a(i), and rh + Vah to y(i). From\nthe tuples, we construct a CSC data set (x(i),\u2212[K1{a = a(i)}y(i)]a\u2208A). On this data set, the\ncost-sensitive error of any policy (interpreted as a classi\ufb01er) is exactly \u2212VD(\u03c0;{Va}), so minimizing\nerror (which the oracle does) maximizes the original objective.\nSecond, the state identity test requires solving the following problem over the function class G:\n\nVopt = max\ng\u2208G\n\n\u02c6ED(cid:48)[g(xh)]\n\n(and min for Vpes)\n\n(2)\n\ns.t. V \u2212 \u03c6h \u2264 \u02c6ED[g(xh)] \u2264 V + \u03c6h,\u2200(D, V ) \u2208 Dh.\n\nThe objective and the constraints are linear functionals of G, all empirical expectations involve\npolynomially many samples, and the number of constraints is |Dh| which remains polynomial\nthroughout the execution of the algorithm, as we will show in the sample complexity analysis.\nTherefore, the LP oracle can directly handle this optimization problem.\nWe now formally state the main computational and statistical guarantees for VALOR.\nTheorem 2 (Oracle ef\ufb01ciency of VALOR). Consider a contextual decision process with deterministic\ndynamics over M hidden states as described in Section 3. Assume \u03c0(cid:63) \u2208 \u03a0 and g(cid:63) \u2208 G. Then for any\n\u0001, \u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4, VALOR makes O\nCSC oracle calls and\nat most O\nTheorem 3 (PAC bound of VALOR). Under the same setting and assumptions as in Theorem 2,\nVALOR returns a policy \u02c6\u03c0 such that V (cid:63) \u2212 V \u02c6\u03c0 \u2264 \u0001 with probability at least 1 \u2212 \u03b4, after collecting at\nmost \u02dcO\n\nLP oracle calls with required accuracy \u0001feas = \u0001sub = \u02dcO(\u00012/M H 3).\n\nlog(|G||\u03a0|/\u03b4) log3(1/\u03b4)\n\n(cid:16) M KH 2\n(cid:16) M 3H 8K\n\n\u0001\n\n(cid:16) M H 2\n\n\u0001\n\n(cid:17)\n\nlog M H\n\u03b4\n\n(cid:17)\n\nlog M H\n\u03b4\n\ntrajectories.5\n\n(cid:17)\n\n\u00015\n\nNote that this bound assumes \ufb01nite value function and policy classes for simplicity, but can be\nextended to in\ufb01nite function classes with bounded statistical complexity using standard tools, as in\nSection 5.3 of Jiang et al. [1]. The resulting bound scales linearly with the Natarajan and Pseudo-\ndimension of the function classes, which are generalizations of VC-dimension. We further expect that\none can generalize the theorems above to an approximate version of realizability as in Section 5.4\nof Jiang et al. [1].\nCompared to the guarantee for LSVEE [17], Theorem 3 is worse in the dependence on M, H, and \u0001.\nYet, in Appendix B.7 we show that a version of VALOR with alternative oracle assumptions enjoys a\nbetter PAC bound than LSVEE. Nevertheless, we emphasize that our main goal is to understand the\ninterplay between statistical and computational ef\ufb01ciency to discover new algorithmic ideas that may\nlead to practical methods, rather than improve sample complexity bounds.\n\n5 Toward Oracle-Ef\ufb01cient PAC-RL with Stochastic Hidden State Dynamics\n\nVALOR demonstrates that provably sample- and oracle-ef\ufb01cient RL with rich stochastic observations\nis possible and, as such, makes progress toward reliable and practical RL in many applications. In\nthis section, we discuss the natural next step of allowing stochastic hidden-state transitions.\n\n5.1 OLIVE is not Oracle-Ef\ufb01cient\n\nFor this more general setting with stochastic hidden state dynamics, OLIVE [1] is the only known\nalgorithm with polynomial sample complexity, but its computational properties remain underexplored.\n5 \u02dcO(\u00b7) suppresses logarithmic dependencies on M, K, H, 1/\u0001 and doubly-logarithmic dependencies on 1/\u03b4,\n\n|G|, and |\u03a0|.\n\n7\n\n\fWe show here that OLIVE is in fact not oracle-ef\ufb01cient. A brief description of the algorithm is\nprovided below, and in the theorem statement, we refer to a parameter \u03c6, which the algorithm uses as\na tolerance on deviations of empirical expectations.\nTheorem 4. Assuming P (cid:54)= N P , even with algorithm parameter \u03c6 = 0 and perfect evaluation of\nexpectations, OLIVE is not oracle-ef\ufb01cient, that is, it cannot be implemented with polynomially many\nbasic arithmetic operations and calls to CSC, LP, and LS oracles.\n\nThe assumptions of perfect evaluation of expectations and \u03c6 = 0 are merely to unclutter the\nconstructions in the proofs. We show this result by proving that even in tabular MDPs, OLIVE solves\nan NP-hard problem to determine its next exploration policy, while all oracles we consider have\npolynomial runtime in the tabular setting. While we only show this for CSC, LP, and LS oracles\nexplicitly, we expect other practically relevant oracles to also be ef\ufb01cient in the tabular setting, and\ntherefore they could not help to implement OLIVE ef\ufb01ciently.\nThis theorem shows that there are no known oracle-ef\ufb01cient PAC-RL methods for this general setting\nand that simply applying clever optimization tricks to implement OLIVE is not enough to achieve a\npractical algorithm. Yet, this result does not preclude tractable PAC RL altogether, and we discuss\nplausible directions in the subsequent section. Below we highlight the main arguments of the proof.\nProof Sketch of Theorem 4.\nOLIVE is round-based and follows the optimism in the face of\nuncertainty principle. At round k it selects a value function and a policy to execute (\u02c6gk, \u02c6\u03c0k) that\npromise the highest return while satisfying all average Bellman error constraints:\n\n\u02c6gk, \u02c6\u03c0k = argmax\ng\u2208G,\u03c0\u2208\u03a0\n\n\u02c6ED0 [g(x)]\n\n(3)\n\ns.t.\n\n| \u02c6EDi[K1{a = \u03c0(x)}(g(x) \u2212 r \u2212 g(x(cid:48)))]| \u2264 \u03c6, \u2200 Di\u2208D.\n\nHere D0 is a data set of initial contexts x, D consists of data sets of (x, a, r, x(cid:48)) tuples collected in\nthe previous rounds, and \u03c6 is a statistical tolerance parameter. If this optimistic policy \u02c6\u03c0k is close to\noptimal, OLIVE returns it and terminates. Otherwise we add a constraint to (3) by (i) choosing a time\npoint h, (ii) collecting trajectories with \u02c6\u03c0k but choosing the h-th action uniformly, and (iii) storing the\ntuples (xh, ah, rh, xh+1) in the new data set Dk which is added to the constraints for the next round.\nThe following theorem shows that OLIVE\u2019s optimization is NP-hard even in tabular MDPs.\nTheorem 5. Let POLIVE denote the family of problems of the form (3), parameterized by (X ,A, Env, t),\nwhich describes the optimization problem induced by running OLIVE in the MDP Env (with states\nX , actions A, and perfect evaluation of expectations) for t rounds. OLIVE is given tabular function\nclasses G = (X \u2192 [0, 1]) and \u03a0 = (X \u2192 A) and uses \u03c6 = 0. Then POLIVE is NP-hard.\nAt the same time, oracles are implementable in polynomial time:\nProposition 6. For tabular value functions G = (X \u2192 [0, 1]) and policies \u03a0 = (X \u2192 A), the CSC,\nLP, and LS oracles can be implemented in time polynomial in |X|, K = |A| and the input size.\nBoth proofs are in Appendix D. Proposition 6 implies that if OLIVE could be implemented with\npolynomially many CSC/LP/LS oracle calls, its total runtime would be polynomial for tabular MDPs.\nAssuming P (cid:54)= NP, this contradicts Theorem 5 which states that determining the exploration policy of\nOLIVE in tabular MDPs is NP-hard. Combining both statements therefore proves Theorem 4.\nWe now give brief intuition for Proposition 6. To implement the CSC oracle, for each of the\npolynomially many observations x \u2208 X , we simply add the cost vectors for that observation\ntogether and pick the action that minimizes the total cost, that is, compute the action \u02c6\u03c0(x) as\ni\u2208[n]: x(i)=x c(i)(a). Similarly, the square-loss objective of the LS-oracle decomposes\nand we can compute the tabular solution one entry at a time. In both cases, the oracle runtime\nis O(nK|X|). Finally, using one-hot encoding, G can be written as a linear function in R|X| for\nwhich the LP oracle problem reduces to an LP in R|X|. The ellipsoid method [37] solves these\napproximately in polynomial time.\n\nmina\u2208A(cid:80)\n\n5.2 Computational Barriers with Decoupled Learning Rules.\n\nOne factor contributing to the computational intractability of OLIVE is that (3) involves optimizing\nover policies and values jointly.\nIt is therefore promising to look for algorithms that separate\n\n8\n\n\foptimizations over policies and values, as in VALOR. In Appendix E, we provide a series of examples\nthat illustrate some limitations of such algorithms. First, we show that methods that compute optimal\nvalues iteratively in the style of \ufb01tted value iteration [38] need additional assumptions on G and \u03a0\nbesides realizability (Theorem 45). (Storing value estimates of states explicitly allows VALOR to only\nrequire realizability.) Second, we show that with stochastic state dynamics, average value constraints,\nas in Line 6 of Algorithm 3, can cause the algorithm to miss a high-value state (Proposition 46).\nFinally, we show that square-loss constraints suffer from similar problems (Proposition 47).\n\n5.3 Alternative Algorithms.\n\nAn important element of VALOR is that it explicitly stores value estimates of the hidden states,\nwhich we call \u201clocal values.\u201d Local values lead to statistical and computational ef\ufb01ciency under weak\nrealizability conditions, but this approach is unlikely to generalize to the stochastic setting where\nthe agent may not be able to consistently visit a particular hidden state. In Appendices B.7-C.2, we\ntherefore derive alternative algorithms which do not store local values to approximate the future\nvalue g(cid:63)(xh+1). Inspired by classical RL algorithms, these algorithms approximate g(cid:63)(xh+1) by\neither bootstrap targets \u02c6gh+1(xh+1) (as in TD methods) or Monte-Carlo estimates of the return\nusing a near-optimal roll-out policy \u02c6\u03c0h+1:H (as in PSDP [39]). Using such targets can introduce\nadditional errors, and stronger realizability-type assumptions on \u03a0,G are necessary for polynomial\nsample-complexity (see Appendix C and E). Nevertheless, these algorithms are also oracle-ef\ufb01cient\nand while we only establish statistical ef\ufb01ciency with deterministic hidden state dynamics, we believe\nthat they considerably expand the space of plausible algorithms for the general setting.\n\n6 Conclusion\n\nThis paper describes new RL algorithms for environments with rich stochastic observations and\ndeterministic hidden state dynamics. Unlike other existing approaches, these algorithms are com-\nputationally ef\ufb01cient in an oracle model, and we emphasize that the oracle-based approach has led\nto practical algorithms for many other settings. We believe this work represents an important step\ntoward computationally and statistically ef\ufb01cient RL with rich observations.\nWhile challenging benchmark environments in modern RL (e.g. visual grid-worlds [23]) often have\nthe assumed deterministic hidden state dynamics, the natural goal is to develop ef\ufb01cient algorithms\nthat handle stochastic hidden-state dynamics. We show that the only known approach for this setting\nis not implementable with standard oracles, and we also provide several constructions demonstrating\nother concrete challenges of RL with stochastic state dynamics. This provides insights into the key\nopen question of whether we can design an ef\ufb01cient algorithm for the general setting. We hope to\nresolve this question in future work.\n\nReferences\n[1] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire.\nContextual decision processes with low Bellman rank are PAC-learnable. In International\nConference on Machine Learning, 2017.\n\n[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 2015.\n\n[3] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.\n\nMachine Learning, 2002.\n\n[4] Ronen I. Brafman and Moshe Tennenholtz. R-max \u2013 a general polynomial time algorithm for\n\nnear-optimal reinforcement learning. Journal of Machine Learning Research, 2003.\n\n[5] Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model-based interval\n\nestimation. In International Conference on Machine learning, 2005.\n\n9\n\n\f[6] Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC\nmodel-free reinforcement learning. In International Conference on Machine Learning, 2006.\n\n[7] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, 2009.\n\n[8] Christoph Dann and Emma Brunskill. Sample complexity of episodic \ufb01xed-horizon reinforce-\n\nment learning. In Advances in Neural Information Processing Systems, 2015.\n\n[9] Mohammad Gheshlaghi Azar, Ian Osband, and R\u00b4emi Munos. Minimax regret bounds for\n\nreinforcement learning. In International Conference on Machine Learning, 2017.\n\n[10] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC\nbounds for episodic reinforcement learning. In Advances in Neural Information Processing\nSystems, 2017.\n\n[11] Sham M. Kakade, Michael Kearns, and John Langford. Exploration in metric state spaces. In\n\nInternational Conference on Machine Learning, 2003.\n\n[12] Jason Pazis and Ronald Parr. PAC optimal exploration in continuous space Markov decision\n\nprocesses. In AAAI Conference on Arti\ufb01cial Intelligence, 2013.\n\n[13] Robert Grande, Thomas Walsh, and Jonathan How. Sample ef\ufb01cient reinforcement learning\n\nwith gaussian processes. In International Conference on Machine Learning, 2014.\n\n[14] Jason Pazis and Ronald Parr. Ef\ufb01cient PAC-optimal exploration in concurrent, continuous state\n\nMDPs with delayed updates. In AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[15] Zheng Wen and Benjamin Van Roy. Ef\ufb01cient exploration and value function generalization in\n\ndeterministic systems. In Advances in Neural Information Processing Systems, 2013.\n\n[16] Zheng Wen and Benjamin Van Roy. Ef\ufb01cient reinforcement learning in deterministic systems\n\nwith value function generalization. Mathematics of Operations Research, 2017.\n\n[17] Akshay Krishnamurthy, Alekh Agarwal, and John Langford. PAC reinforcement learning with\n\nrich observations. In Advances in Neural Information Processing Systems, 2016.\n\n[18] Daniel Joseph Hsu. Algorithms for active learning. PhD thesis, UC San Diego, 2010.\n\n[19] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire.\nTaming the monster: A fast and simple algorithm for contextual bandits. In International\nConference on Machine Learning, 2014.\n\n[20] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive\n\nno-regret learning. arXiv:1406.5979, 2014.\n\n[21] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume III, and John Langford.\nLearning to search better than your teacher. In International Conference on Machine Learning,\n2015.\n\n[22] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A\n\nunifying approach for margin classi\ufb01ers. Journal of Machine Learning Research, 2000.\n\n[23] Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The Malmo Platform\nIn International Joint Conference on Arti\ufb01cial\n\nfor arti\ufb01cial intelligence experimentation.\nIntelligence, 2016.\n\n[24] Michael Kearns and Daphne Koller. Ef\ufb01cient reinforcement learning in factored MDPs. In\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, 1999.\n\n[25] Lihong Li, Thomas J. Walsh, and Michael L. Littman. Towards a uni\ufb01ed theory of state\nabstraction for MDPs. In International Symposium on Arti\ufb01cial Intelligence and Mathematics,\n2006.\n\n10\n\n\f[26] Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement\n\nlearning in rich-observation MDPs using spectral methods. arXiv:1611.03907, 2016.\n\n[27] Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement\n\nlearning of POMDPs using spectral methods. In Conference on Learning Theory, 2016.\n\n[28] Zhaohan Daniel Guo, Shayan Doroudi, and Emma Brunskill. A PAC RL algorithm for episodic\n\nPOMDPs. In Arti\ufb01cial Intelligence and Statistics, 2016.\n\n[29] Dan Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic\n\nexploration. In Advances in Neural Information Processing Systems, 2013.\n\n[30] Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the Eluder\n\ndimension. In Advances in Neural Information Processing Systems, 2014.\n\n[31] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations.\n\nCambridge University Press, 2009.\n\n[32] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song.\nSbeed: Convergent reinforcement learning with nonlinear function approximation. In Interna-\ntional Conference on Machine Learning, pages 1133\u20131142, 2018.\n\n[33] Alina Beygelzimer, John Langford, and Pradeep Ravikumar. Error-correcting tournaments. In\n\nInternational Conference on Algorithmic Learning Theory, 2009.\n\n[34] John Langford and Alina Beygelzimer. Sensitive error correcting output codes. In International\n\nConference on Computational Learning Theory, 2005.\n\n[35] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in Neural Information Processing Systems, 2008.\n\n[36] Stephane Ross. Interactive learning for sequential decisions and predictions. PhD thesis,\n\nCarnegie Mellon University, 2013.\n\n[37] Leonid G Khachiyan. Polynomial algorithms in linear programming. USSR Computational\n\nMathematics and Mathematical Physics, 1980.\n\n[38] Geoffrey J Gordon. Stable function approximation in dynamic programming. In International\n\nConference on Machine Learning, 1995.\n\n[39] J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by\n\ndynamic programming. In Advances in Neural Information Processing Systems, 2004.\n\n[40] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a\n\nmeta-algorithm and applications. Theory of Computing, 2012.\n\n[41] R\u00b4emi Munos and Csaba Szepesv\u00b4ari. Finite-time bounds for \ufb01tted value iteration. Journal of\n\nMachine Learning Research, 2008.\n\n[42] Andr\u00b4as Antos, Csaba Szepesv\u00b4ari, and R\u00b4emi Munos. Learning near-optimal policies with\nBellman-residual minimization based \ufb01tted policy iteration and a single sample path. Machine\nLearning, 2008.\n\n[43] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00a8olkopf, and Alexander J.\n\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 2012.\n\n[44] Bernhard Sch\u00a8olkopf and Alexander J. Smola. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. MIT Press, 2002.\n\n[45] Martin Gr\u00a8otschel, L\u00b4aszl\u00b4o Lov\u00b4asz, and Alexander Schrijver. The ellipsoid method and its\n\nconsequences in combinatorial optimization. Combinatorica, 1981.\n\n[46] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement\n\nlearning. Journal of Machine Learning Research, 2005.\n\n[47] Amir-Massoud Farahmand, Csaba Szepesv\u00b4ari, and R\u00b4emi Munos. Error propagation for ap-\nproximate policy and value iteration. In Advances in Neural Information Processing Systems,\n2010.\n\n11\n\n\f", "award": [], "sourceid": 745, "authors": [{"given_name": "Christoph", "family_name": "Dann", "institution": "Carnegie Mellon University"}, {"given_name": "Nan", "family_name": "Jiang", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": "Microsoft"}, {"given_name": "Alekh", "family_name": "Agarwal", "institution": "Microsoft Research"}, {"given_name": "John", "family_name": "Langford", "institution": "Microsoft Research New York"}, {"given_name": "Robert", "family_name": "Schapire", "institution": "MIcrosoft Research"}]}