{"title": "Limiting Extrapolation in Linear Approximate Value Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 5615, "page_last": 5624, "abstract": "We study linear approximate value iteration (LAVI) with a generative model. While linear models may accurately represent the optimal value function using a few parameters, several empirical and theoretical studies show the combination of least-squares projection with the Bellman operator may be expansive, thus leading LAVI to amplify errors over iterations and eventually diverge. We introduce an algorithm that approximates value functions by combining Q-values estimated at a set of \\textit{anchor} states. Our algorithm tries to balance the generalization and compactness of linear methods with the small amplification of errors typical of interpolation methods. We prove that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and our method achieves a polynomial sample complexity bound in the horizon and the number of anchor points. These findings are confirmed in preliminary simulations in a number of simple problems where a traditional least-square LAVI method diverges.", "full_text": "Limiting Extrapolation in\n\nLinear Approximate Value Iteration\n\nInstitute for Computational and Mathematical Engineering,\n\nAndrea Zanette\n\nStanford University, CA\nzanette@stanford.edu\n\nAlessandro Lazaric\nFacebook AI Research\n\nlazaric@fb.com\n\nDepartment of Aeronautics and Astronautics,\n\nDepartment of Computer Science,\n\nMykel J. Kochenderfer\n\nStanford University, CA\nmykel@stanford.edu\n\nEmma Brunskill\n\nStanford University, CA\n\nebrun@cs.stanford.edu\n\nAbstract\n\nWe study linear approximate value iteration (LAVI) with a generative model. While\nlinear models may accurately represent the optimal value function using a few\nparameters, several empirical and theoretical studies show the combination of least-\nsquares projection with the Bellman operator may be expansive, thus leading LAVI\nto amplify errors over iterations and eventually diverge. We introduce an algorithm\nthat approximates value functions by combining Q-values estimated at a set of\nanchor states. Our algorithm tries to balance the generalization and compactness\nof linear methods with the small ampli\ufb01cation of errors typical of interpolation\nmethods. We prove that if the features at any state can be represented as a convex\ncombination of features at the anchor points, then errors are propagated linearly\nover iterations (instead of exponentially) and our method achieves a polynomial\nsample complexity bound in the horizon and the number of anchor points. These\n\ufb01ndings are con\ufb01rmed in preliminary simulations in a number of simple problems\nwhere a traditional least-square LAVI method diverges.\n\nIntroduction\n\n1\nImpressive empirical successes [Mni+13; Sil+16; Sil+17] in using deep neural networks in reinforce-\nment learning (RL) often use sample inef\ufb01cient algorithms. Despite recent advances in the theoretical\nanalysis of value-based batch RL with function approximation [MS08; ASM08; FSM10; YXW19;\nCJ19], designing provably sample-ef\ufb01cient approximate RL algorithms with function approximation\nremains an open challenge.\nIn this paper, we study value iteration with linear approximation (LAVI for short). Linear function\napproximators represent action-value functions as the inner product between a weight vector w\n\nand a d-dimensional feature map evaluated at each state-action pair, i.e., bQ(s, a) = w>(s, a).\n\nLinear models are common and powerful because they allow to compactly represent functions with\na small number of parameters, and therefore have promise for requiring a small sample size to\nlearn such functions. Unfortunately, it is well known that the Bellman operator combined with the\nprojection onto a linear space in, e.g., `2-norm, may result in an expansive operator. As a result, even\nwhen the features are expressive enough so that the optimal state-action value function Q? can be\naccurately represented (i.e., Q?(s, a) \u21e1 (w?)>(s, a)), combining linear function approximation\nwith value iteration may lead to divergence [Bai95; TV96]. Munos [Mun05] derived bounds on the\nerror propagation for general approximate value iteration (AVI) and later Munos and Szepesv\u00e1ri\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[MS08] proved \ufb01nite-sample guarantees for \ufb01tted value iteration with a generative model, while\nsharper results can be found in [FSM10]. A key issue in AVI is that errors at one iteration may be\nampli\ufb01ed through the application of the Bellman operator and projection. In the analysis of Munos\nand Szepesv\u00e1ri [MS08], this effect is illustrated by the inherent Bellman error, which measures how\nwell the image through the Bellman operator of any function in the approximation space can be\napproximated within the space itself. Whenever the inherent Bellman error is unbounded, AVI may\ndiverge.\nIn contrast to the ampli\ufb01cation of errors of linear value function approximation, averagers [Gor95],\nsuch as barycentric interpolators [MM99], nearest-neighbors, and kernels [OS02], can reduce how\nerrors are propagated through iterations. Averagers represent the value function at a state-action\npair as an interpolation of its values at a \ufb01nite set of anchor points. By interpolating instead of\nextrapolating, the function approximator is guaranteed to be a non-expansion in `1-norm, and\ntherefore the Bellman backup remains a contraction even after the projection onto the approximation\nspace. Unfortunately, the number of anchor points needed to accurately represent the value function,\nand thus the number of parameters to learn, may scale exponentially with the input state dimension.\nIn this paper, we explore a new function approximator that tries to balance the compactness and\ngeneralization of linear methods, leading to sample ef\ufb01ciency at each iteration, while constraining the\nresulting expansion, as in averagers, thus providing a small ampli\ufb01cation factor over iterations. Our\nalgorithm estimates the Q-values at a set of anchor points and predict the function at any other point\nby taking a combination of those values, while using a linear representation. We show that whenever\nthe features generate a convex set, it is possible to avoid any error ampli\ufb01cation and achieve a sample\ncomplexity that is polynomial in the number of anchor points and in the horizon. A related convexity\nassumption has been very recently used by Yang and Wang [YW19] to obtain the \ufb01rst algorithm with\nnear-optimal sample complexity. Nonetheless, their result holds when the transition model p admits a\nnon-negative low-rank factorization in , which also corresponds to a zero inherent Bellman error. In\nour analysis, we consider the far more general setting of when the optimal state-action value function\ncan be accurately approximately with a linear set of features. Note that this can be true even if the\ntransition model does not admit a low-rank decomposition, as we illustrate in our simulation results.\nFurthermore, our result holds even when the inherent Bellman error is in\ufb01nite. Unlike [YW19], we\nalso report a thorough discussion on how to select anchor points and provide a heuristic procedure to\nautomatically create them.\nIn our simulations we show that small levels of ampli\ufb01cation can be achieved, and that our algorithm\ncan effectively mitigate the divergence observed in some simple MDPs for least-squares AVI. This\nhappens even when using identical feature representations, highlighting the bene\ufb01t of bounding\nextrapolation through constructing feature representations as near convex combinations (versus `2 or\nother common loss functions). Furthermore, we empirically show that small ampli\ufb01cation factors\ncan be obtained with relatively small sets of anchor points. We believe this work provides a \ufb01rst step\ntowards designing sample ef\ufb01cient algorithms that effectively balance per-iteration generalization and\nsample complexity and the ampli\ufb01cation of errors through iterations for general linear action-value\nfunction solvers.\n2 Preliminaries\nWe consider a \ufb01xed-horizon MDP M = hS,A, p, r, H,\u21e2i de\ufb01ned by a continuous state space S, a\ndiscrete action space A, a horizon H, an initial state distribution \u21e2, a transition model p(s, a) and a\nreward model r(s, a). We also denote by R(s, a) the random reward, with expected value r(s, a).\nA deterministic policy \u21e1t(s) is a mapping from a state and timestep to an action. The Q-value of a\npolicy \u21e1 in state-action-timestep (s, a, t) is the expected return after taking action a in s at timestep t\nt (s, \u21e1t(s)). An optimal policy \u21e1? maximizes the\nand following policy \u21e1 afterwards, and V \u21e1\nt = Q\u21e1?\nvalue function at any state and timestep, i.e., \u21e1?\nt = arg max\u21e1 V \u21e1\nt\nto denote the functions corresponding to an optimal policy \u21e1?.\nWe consider the so-called generative model setting, where p and r are unknown but a simulator can be\nqueried at any state-action pair (s, a) to obtain samples s0 \u21e0 p(s, a) and R(s, a). As the generation\nof each sample may be expensive, the overall objective is to compute a near-optimal policy with as\nfew samples as possible. Approximate dynamic programming algorithms can be used to replace p\nand r with a \ufb01nite number of simulator samples, and can be used for high dimensional or continuous\nspaces. Approximate value iteration (AVI) (related closely to \ufb01tted value iteration), takes as input a\nregression algorithm F, and it proceeds backward from horizon H to 1. At each timestep t, given the\n\nt . We use V ?\n\nt (s) = Q\u21e1\n\nt = V \u21e1?\n\nt\n\nand Q?\n\n2\n\n\fi=1,\nt+1(s0i, a). AVI\nt (s, a),\n\ni=1 with yi = ri + maxa bQ?\n\nt = F(Dt), returns the approximated optimal policyb\u21e1?\n\nt+1, it queries the simulator n times and obtains a set of tuples {(si, ai, ri, s0i)}n\nt (s) = arg maxa bQ?\n\napproximation bQ?\nused to construct a regression dataset Dt = {(si, ai), yi)}n\nthen computes bQ?\nand proceed to timestep t 1.\nA popular instance of AVI is to use linear regression to approximate Q-functions. We refer to this\ngeneral scheme as linear AVI (LAVI). Let t : St \u21e5A t ! Rd be a feature mapping for timestep\nt. We de\ufb01ne t = { 2 Rd : 9s 2S , a 2A t,s, t(s, a) = } as the subset of Rd obtained by\nevaluating t at any state-action pair (s, a). Any approximate action-value function bQt is represented\nas a linear combination of weights bwt 2 Rd and features t as bQt(s, a) = bw>t t(s, a), where bwt\nis usually computed by minimizing the `2-loss on the dataset Dt. Linear function approximation\nrequires only O(d/\u270f2) samples to have an \u270f estimation error, independent from the size of S and\nA. Nonetheless, at each timestep t the combination of the `2-loss minimization (i.e., F) with the\napplication of the Bellman operator to the function computed at timestep t + 1 may correspond to an\nexpansive operation. In this case, errors at each iteration may be ampli\ufb01ed and eventually lead LAVI\nto diverge.\n3 Linear Approximate Value Iteration with Extrapolation Reduction\nWe introduce IER (Interpolation for Extrapolation Reduction), a novel approximation algorithm\nthat interpolates Q-values at a set of anchor points. We study its prediction error and we analyze the\nsample complexity of the LAVI scheme obtained by executing IER backward from H to 1.\n\nAt each timestep t, IER receives as input an estimate bQ?\nt+1 of the action-value function at timestep\nt + 1, the feature map t, and a set Kt \u2713S\u21e5A of Kt anchor state-action pairs. IER \ufb01rst estimates\nt (si, ai) at any anchor point (si, ai) 2K t by repeatedly sampling from the simulator and using the\nQ?\napproximation bQ?\n\nt+1 to compute the backup values. We de\ufb01ne the anchor values as\n\n1\n\n(1)\n\nt,i =\n\nnsupp\n\nbQ?\n\nt + max\n\nwhere R(j)\n\nnsuppXj=1\u21e3R(j)\n\nt+1, a))\u2318,\nbudget at each anchor point. Given these estimations, the approximation bQ?\nany state-action pair (s, a) is obtained by a linear combination of the bQ?\n\na2A bQ?\n\nt+1(s(j)\n\nand s(j)\n\nt\n\nt (s, a) =\n\n\u2713t(s,a)\nt,i\n\nt+1 are the samples generated from the generative model at (si, ai) and nsupp is the\nt (s, a) returned by IER at\n\nt,i values as\n\nbQ?\n\nKtXi=1\n\nwhere the interpolation vector \u2713t(s,a)\n\nt\n\n\u2713t(s,a) k\u2713t(s,a)k1\nmin\n\n2 RK is the solution to the optimization problem\nsubject to t(s, a) =\n\nt(si, ai).\n\n\u2713t(s,a)\ni\n\nAs long as the image of the anchor points {(si, ai)}Kt\ni=1 spans Rd, (3) admits a solution. This\nproblem is a linear optimization program with linear constraints and it can be solved ef\ufb01ciently\nusing standard techniques [BV04; NW06]. Notice that the weights \u2713t(s,a)\nchange with s, a and no\npositiveness constraint is enforced.\n\nt\n\n3.1 Prediction Error and Sample Complexity of IER\nt cannot be exactly represented by a low\nIn most problems, the optimal action-value function Q?\ndimensional inner product w>t t(\u00b7,\u00b7). The best approximator that can be expressed by features and\nits associated approximation error are de\ufb01ned as\n(4)\n;\n\n\u270fapp\nt = min\n\nt = arg min\n\nw?\n\n,\n\nw2Rd w>t(\u00b7) Q?\n\nt (\u00b7)1\n\nw2Rdw>t(\u00b7) Q?\n\nt (\u00b7)1\n\nwhere k\u00b7k 1 denotes the in\ufb01nity norm, i.e., the maximum over state-action pairs in S\u21e5A . Stan-\ndard linear function approximation methods rather minimize the `2-norm (i.e., least-squares) or a\nregularized version of it.\n\n3\n\nt,i,\n\nbQ?\n\nKtXi=1\n\n(2)\n\n(3)\n\n\fWe are interested in studying whether IER approaches the performance of w?. Before analyzing IER,\nt (s, a) as the interpolator obtained by combining\n\nthe exact Q?-function evaluated on the anchor points as\n\nwe focus on its \u201cexact\u201d counterpart. We introduce eQ?\nKtXi=1\n\nt (s, a) def=\n\n\u2713(s,a)\ni\n\neQ?\n\nwhere the vector \u2713(s,a) is the solution of (3). We prove the following.\n\nQ?\n\nt (si, ai)\n\n(5)\n\nLemma 1 (Error Bounds offQ?\n\nIf \u270fapp\nt (s, a) = (w?\ninterpolator in Eq. 5 has an error\n\nt = 0, i.e., Q?\n\nt ). Let \u270fapp\n\nt\n\nbe the approximation error of the best linear model (Eq. 4).\nt )>t(s, a). Otherwise the (exact)\n\nt (s, a) = (w?\n\nt )>t(s, a), then eQ?\nt (s, a) Q?\n\nmax\n\n(s,a)2S\u21e5AeQ?\n\nt (s, a) \uf8ff (1 + Ct)\u270fapp\n\nt\n\n,\n\n(6)\n\nwhere Ct\n\ndef= max(s,a)2S\u21e5A k\u2713(s,a)\n\nt\n\nk1 is the ampli\ufb01cation factor.\n\nt\n\nt\n\nt + Ct\u270fest\n\nt+1k1 \uf8ff \u270fbias\n\nt+1 be the approximation\nt+1 be\nt+1 V ?\nt is\n(7)\n\ntk1 \uf8ff(1 + Ct)\u270fapp\n{z\n\nThis result shows that the interpolation done in (5) preserves the linearity of the model whenever\nthe function evaluated at the anchor points is linear itself. Furthermore, the prediction error is a\nfactor (1 + Ct) bigger than the best approximator. The optimization program (3) plays a crucial role\nin obtaining both results. In particular, the constraint ensures that the linear structure is preserved,\nwhile the minimization over \u2713(s,a)\naims at controlling the ampli\ufb01cation factor Ct. We now study the\nsample complexity of IER at timestep t when an approximation of the optimal value function V ?\nt+1 at\ntimestep t + 1 is available (the proof and de\ufb01nition of 0 is postponed to the supplementary).\nLemma 2. Let \u270fapp\nof V ?\n\nbe the error of the best linear model at timestep t andbV ?\nt+1 used in estimating the values at the anchor points in Eq. 1. Let kbV ?\nthe prediction error ofbV ?\nt+1. If IER is run with Kt anchor points, then the prediction error of bQ?\nkbQ?\nt Q?\n| {z }\nwith probability at least 1 /H as long as nsupp ln(2/0)/(2\u270fest\nLem. 2 shows that the prediction error of IER is bounded by three main components: an estimation\nerror \u270fest\nt,i at the anchor points, an approximation\nerror (1 + Ct)\u270fapp\ndue to the linear model de\ufb01ned by the features t, and a propagation error Ct\u270fbias\nt+1\nt+1 at timestep t + 1. The key result from this lemma is to illustrate\nhow Ct not only impacts the approximation error as in Lem. 1, but it determines how the errors\nt+1 propagates from timestep t + 1 to t. While for a standard least-square method, Ct may be\nmuch larger than one, the approximator (2) with the interpolation vector obtained from (3) aims at\nminimizing the extrapolation and lowering Ct as much as possible, while preserving the linearity of\nthe representation. As discussed in Sect. 4, a suitable choice of the anchor points may signi\ufb01cantly\nreduce the ampli\ufb01cation factor by leveraging the additional degrees of freedom offered by choosing\nKt larger than d. In general, we may expect that the larger Kt, the smaller Ct. Nonetheless, the\noverall sample complexity of IER increases as Ktnsupp. This shows the need of trading off the\nnumber of anchor points (hence possibly higher variance) in exchange for better control on how\nerrors gets ampli\ufb01ed. In this sense, Lem. 2 reveals a critical extrapolation-variance trade-off.\n\ndue to the prediction error ofbV ?\nofbV ?\n\ndue to the noise in estimating the Q-values bQ?\n\nt \n}\n\n+ Ct\u270fbias\nt+1\n\nerrors at timestep t\n\npropagation error\n\n|\n\n)2.\n\nt\n\nt\n\nt\n\n3.2 Sample Complexity of LAVIER\nWe analyze LAVIER (Linear Approximate Value Iteration with Extrapolation Reduction) obtained\nby running IER backward from timestep H to 1 and we derive a sample complexity upper bound to\nachieve a near-optimal policy. Under the assumption of bounded value function V ?\nt (s) 2 [0, 1] and\nbounded immediate reward random variables R(s, a) 2 [0, 1], we obtain the following result.1\n\n1This assumption is inspired by [JA18], who suggested this is a more expressive framework, as it allows\n\nsome rewards to be substantially larger than others in terms of contributing to the \ufb01nal value function.\n\n4\n\n\f(8)\n\nt \uf8ff \u270fapp for all t = 1, . . . , H. If LAVIER is run with failure\nln(2KH/)/\u270f2 samples,\n\n2\n\nTheorem 1. Let Ct \uf8ff C and \u270fapp\nprobability > 0, precision \u270f> 0 and constant C > CH, ntot KH 5C\nthen with probability 1 LAVIER returns a policyb\u21e1? such that\n}\n{z\n\n1 (s0) \uf8ff \u270f|{z}\n\n1 (s0) Vb\u21e1?\n\n+ 4H 2C\u270fapp\n\napp. error\n\nest. error\n\n|\n\nV ?\n\n.\n\n\u270f2\n\nend for\n\nAlgorithm 1 LAVIER algorithm.\n\nInput: Failure probability , accuracy \u270f, set of anchor points\n{Kt}t=1,...,H, time horizon H, total ampli\ufb01cation constant C.\nt=1 Kt), nsupp =l H4C2\nt=1 Kt)/)m\nln(2(PH\nSet 0 = /(PH\nbQ?\nH+1(\u00b7) = 0 (zero predictor at terminal states)\nfor t = H downto 1 do\nCall IER with param. (nsupp,Kt,bQ?\nt+1(\u00b7)) and obtain bQ?\nt (s) = arg maxa2At,s bQ?\nReturn policyb\u21e1?\n\nThis bound decomposes the predic-\ntion error in two components: an es-\ntimation error due to the noise in the\nsamples and an approximation error\ndue to the features {t}t and the tar-\nget functions {Q?\nt}t. Thm. 1 illus-\ntrates the impact of the ampli\ufb01cation\nfactor on the overall sample complex-\nity and \ufb01nal error. If C > 1, C grows\nexponentially with the horizon. Fur-\nthermore, the error \u270fapp itself is ampli-\n\ufb01ed by C, thus leading to an approx-\nimation error scaling exponentially with H. This result is not unexpected, as it con\ufb01rms previous\nnegative results showing how the extrapolation typical of linear models may lead the error to diverge\nover iterations [Bai95; TV97]. Nonetheless, if the ampli\ufb01cation constant is C < (1 + 1\nH ), then\nH )H \uf8ff e, which gives a polynomial sample complexity bound of order \u02dcO(KH 5/\u270f2) and\nC \uf8ff (1 + 1\na \ufb01nal error where the approximation error is only ampli\ufb01ed by H 2. While this con\ufb01guration does\nremove the divergence problem, it may still lead to a sample inef\ufb01cient algorithm. In fact, in order to\nachieve C \u21e1 1, we may need to take K very large. This raises the fundamental question of whether\nlow ampli\ufb01cation error and low sample complexity can be obtained at the same time. In the next\nsection, we \ufb01rst discuss how anchor points with small ampli\ufb01cation C can be ef\ufb01ciently constructed,\nwhile in Sect. 5 we empirically show how in some scenarios this can be achieved with a small number\nof anchor points K and thus low sample complexity. Finally, we notice that when the features are\nchosen to be averagers, the interpolation scheme corresponds to a convex combination of anchor\nweights, thus corresponding to C = 1. As a result, Thm. 1 is also a sample complexity result for\naveragers [Gor95].\n\nt (t(s, a))\n\nt (\u00b7)\n\n4 Anchor Points and Ampli\ufb01cation Factor\n\nWhile averagers attain C = 1, in general they may not generalize as well as linear models. Further-\nmore, averagers usually have poor sample complexity, as they may require a number of samples\nscaling exponentially with the dimension of the state-action space [see e.g., Thm.3 in OS02]. The aim\nof the minimization program (3) is to trade off the generalization capacity of linear models and their\nextrapolation, without compromising the overall sample complexity. The process of constructing a\n\u201cgood\u201d set of anchor points can be seen as a form of \u201cexperimental design\u201d. While in experimental\noptimal design the objective is to \ufb01nd a small number of anchor points such that least-squares achieves\nsmall prediction error, here the objective is to construct a set Kt such that the ampli\ufb01cation factor Ct\nis small. We have the following result.\nProposition 1. Let (Kt) = { 2 Rd,8(si, ai) 2K t, (si, ai) = } be the image of the anchor\npoints through . If the convex hull of (Kt) contains all the features in t, i.e.,\n\nt \u2713 conv(Kt) = { 2 Rd : 9\u2713 2 RKt, =\n\nKtXi=1\n\n\u2713\ni i, with \u2713\n\ni 0,\n\n\u2713\ni = 1},\n\nKtXi=1\n\nthen the ampli\ufb01cation factor is Ct \uf8ff 1.\nUnder the condition of Prop. 1, prediction errors propagates linearly through timesteps. In general, it\nis not possible to provide a bound on Kt, as the number of anchor points needed to construct a convex\nhull containing t may largely vary depending on the structure of t.2 If the convex hull is not known\n\n2For instance, if t is a polyhedron in Rd, Kt may be as large as exponential in d.\n\n5\n\n\ft\n\nt\n\nk1 is below C, where the interpolation vector \u2713t(s,a)\n\nor it contains too many features, an approximate convex hull could be found by standard techniques,\nfor example [GO17; Blu+17] or [SV16; HA13] and can still provably yield a linear propagation of the\nerror if it is of suf\ufb01cient quality (i.e., Ct < (1 + 1/H)). Importantly, \ufb01nding an approximate convex\nhull can be performed of\ufb02ine without accessing the generative model as it only requires access to the\nmapping function t(\u00b7,\u00b7). Finally, as the algorithm solves the optimization program (3) during the\nlearning phase (to compute the backupbV ?\nt+1(s0) with the sampled next state s0) the actual value of\nk\u2713(s,a)k1 is computed and therefore the algorithm can identify whether signi\ufb01cant extrapolation is\ntaking place and whether the set of anchor points Kt may need to be increased or adjusted. While we\ndefer adaptive construction of approximate convex hulls as future work, we propose a simple greedy\nheuristic to construct a good set of anchor points before the learning process.\nLet C be a target ampli\ufb01cation error, at timestep t we would like to \ufb01nd the smallest set Kt such\nthat C(Kt) = maxs,a k\u2713t(s,a)\nis computed as\nin (3). As this problem may be NP-hard, we propose a sequential greedy scheme where anchor points\nare added to Kt until the condition is met. Starting with Kt including a single arbitrary state-action\n(s1, a1), if C(Kt) > C, we compute (s, a) = arg maxs,a k\u2713t(s,a)\nk1 and add it to Kt. Notice that\nthis process does not necessarily return a positive interpolation vector \u2713t(s,a)\nt may not\nbe a convex combination of the anchor values. This extra degree of freedom w.r.t. convex hulls may\nallow us to obtain a small ampli\ufb01cation factor with fewer anchor points. Although we do not have\ntheoretical guarantees about the number of anchor points K = |Kt| added through this heuristic\nprocess, we report experiments where we show that it is possible to effectively obtain small C, and\nthus small prediction error, with few anchor points.\n5 Numerical Simulations\nWe investigate the potential bene\ufb01t of LAVIER over least-squares AVI (LS-AVI). Although LAVIER\nshares similarity with averagers, a fair comparison is dif\ufb01cult and out of the scope of this preliminary\nempirical study. In fact, in designing an averager, the choice of structure and parameters (e.g., the\nposition of the points in a nearest neighbor procedure) heavily affects the corresponding function\nclass, i.e., the type of functions that can be accurately represented. As a result, any difference in\nperformance would mostly depend on the different function class used by the averager and the linear\nmodel (i.e., ) used by LAVIER.\nThe following MDPs are toy examples designed to investigate the differences between the LAVIER\nand LS-AVI and con\ufb01rm our theoretical \ufb01ndings. The empirical results are obtained by averaging\n100 simulations and they are reported with 95%-con\ufb01dence intervals.\n\nand thus bQ?\n\nt\n\ni\n\n1 \u270f\n\n1\n\n\u270f\n\nw\n\n2w\n\nEND\n\nFigure 1: Left: Two state MDP. Right: Prediction error for least-squares AVI and LAVIER.\n\nTwo-state MDP of Tsitsiklis and Van Roy The \ufb01rst experiment focuses on how the interpolation\nscheme of IER may avoid divergence. The smallest-known problem where least-squares approxi-\nmation diverges is reported in [TV96; SB18]. This problem consists of a two-state Markov reward\nprocess (i.e., an MDP with only one action per state) plus a terminal state (Fig. 1). As there is only\none possible policy, the approximation problem reduces to estimating its value function. The feature\n maps a state to a \ufb01xed real number, i.e., (\u00b7,\u00b7) 2 R, and there is only one weight to learn. For\nsimplicity, we set the parameter \u270f = 0.01, and add a zero-mean noise to all rewards generated as\n1/2 Ber(1/2), where Ber(\u00b7) is a Bernoulli random variable. We study the approximation error at\nthe left-most state when each algorithm is run for a varying number of iterations H and with 1000\n\n6\n\n\fsamples at each timestep. The samples are generated uniformly from the left and middle node, which\nserve as anchor points. Fig. 1 shows that the error of the least-square-based method rapidly diverges\nthrough iterations, while LAVIER is more robust and its error remains stable.\n\n1\nN\n\ns1\n\n1 1\n\nN\n\n1\n10N , 1)\n\nr \u21e0N ( 1\n\n1\nN\n\ns2\n\n1\n\n1 1\n\nN\n\n1\n\n1\nN\n\ns3\n\n1\n\n1\nN\n\n1 1\n\nN\n\n1 1\n\nN\n\n1 1\n\nN\n\n\u00b7\u00b7\u00b7\n\nsN1\n\n1\nN\n\nsN\n\n1\n\nr \u21e0N (1, 1)\n\nFigure 2: Left: Chain MDP. Right: Suboptimality of the policy at s1, V ?(s1) Ve\u21e1?(s1).\n\nChain MDP. We now evaluate the quality of the anchor points returned by the heuristic method\nillustrated in Sect. 4. In the chain MDPs of Fig. 2 the agent starts in the leftmost state and the optimal\npolicy is to always go right and catch the noisy reward in the rightmost state before the episode\nterminate. However, a small reward is present in the leftmost state and settling for this reward yields a\nsuboptimal policy. We de\ufb01ne the feature (s, a) = [Q?(s, a), v(s, a)], where v(s, a) \u21e0 Unif(0, 1) is\na random number \ufb01xed for each simulation and (s, a) pair. We run LS-AVI by sampling state-actions\nin the reachable space uniformly at random, while for LAVIER we compute an anchor set with\nC \uf8ff 1.2. Both algorithms use the same number of samples and LAVIER splits the budget of samples\nuniformly over the anchor points to compute the anchor values. The length of the chain is N = 50,\nwhich is also the time horizon. We report the quality of the learned policy at s1 and notice that\nLAVIER is consistently better than LS-AVI (see App. A for further experiments).\n\ns?\n\n>w?\n1\n\n. . .\n\ns1\n\ns2\n\nsN\n\n10$\n\ns0\n\n0$\n\nFigure 3: Left: MDP with a sequence of linear bandits with actions in 2 dimensions. Center: Example\nof the anchor points generated by the heuristic greedy algorithm. Right: Accuracy V ?(st) V \u02dc\u21e1?(st)\nas a function of state.\n\nSuccessive Linear Bandits. We consider an MDP de\ufb01ned as a sequence of linear bandit problems\n(Fig. 3) which is designed so that signi\ufb01cant extrapolation occurs at each iteration. In this MDP,\nthere are N states s1, . . . , sN augmented with the starting state s0 and a terminal state s?. From the\nstarting state s0 there are two actions (left and right). The optimal policy is to take left and receive a\nreward of 10. The states s1, . . . , sN are linear bandit problems, where each action gives a Gaussian\nnoisy return of mean 0 and variance 1 and the state transitions deterministically from s to s + 1.\nThis represents a sequence of linear bandits with no signal, i.e., the output is not correlated with the\n1 (s1) = 0. The feature map t(s, a) = a\nfeatures and the learner only experiences noise, hence V ?\nreturns the features describing the action itself, and the solution Q?\nt (s, a) = 0 is exactly representable\nby a zero weight vector. The solution is unique. The learner should estimate the value of V ?\nt (s1)\naccurately to infer the right action in state s0. At each state s1, . . . , sN, we represent actions in R2\nand we generate 100 actions by uniformly discretizing the circumference. As the canonical vectors\ne1 and e2 are the most informative actions to estimate the reward associated to any other action\n\n7\n\n\f(see [SLM14] for the best policy identi\ufb01cation in linear bandits), we collect our samples from these\ntwo actions. The anchor points for LAVIER are chosen by our adaptive procedure for different value\nof the extrapolation coef\ufb01cient C 2{ 1.05, 1.2, 1.5}. The extrapolation becomes more and more\ncontrolled as C approaches one. Fig.3 shows the performance at different states. For small values of\nC, LAVIER signi\ufb01cantly outperforms LS-AVI. Furthermore, looking more closely into the rightmost\nstates (i.e., the states that are updated at early iterations) reveals the extrapolation-variance tradeoff\n(see Fig. 5 in App. B for a zoomed version of the plot): a value of C = 1.5 ensures a more accurate\nestimate (due to less variance) in the \ufb01rst timesteps, but the curve steeply diverges. By contrast,\nC = 1.05 has initially a poorer estimate, but such estimate remains far more stable with the horizon.\nWe also report the support points selected by the algorithm. Although C is small, only a few points\nare necessary. In fact, we do not need to cover the circle with an approximate convex hull and our\nprocedures can, for example, \u2018\ufb02ip\u2019 the sign of the learned value without causing extrapolation (i.e.,\nkeeping C small).\nIn Fig. 3 we also report the performance of LS-AVI. In this case, the divergence of the estimate\nof LS-AVI is extreme, and it does not allow to accurately estimate V ?(s1) = 0, yielding a policy\nthat cannot identify the correct initial action. Furthermore, in this example we additionally evaluate\nLeast Square Temporal Difference (LSTD) for off-policy prediction [SB18]. LSTD is not a policy\noptimization algorithm but we can use it to evaluate the value of a policy that chooses for example the\naction [1/p2, 1/p2] in every state of the chain. The training data for LSTD are identical to LS-AVI,\ni.e., the canonical vectors e1 and e2. Despite collecting data along the informative direction e1 and\ne2, the LSTD solution is of increasingly poor quality as a function of the chain length.\n6 Conclusion\nRelated work. Most of literature in linear function approximation focused on designing feature maps\n that could represent action-value functions well, by optimizing parameterized features (e.g., in deep\nnetworks or in [MMS05]), by an initial representation learning phase to extract features adapted to\nthe structure of the MDP [MM07; Pet07; Bel+19], or by adding features to reduce the approximation\nerror [Tos+17]. Unfortunately, accurately \ufb01tting value functions does not guarantee small inherent\nBellman error (IBE), and thus LAVI may still be very unstable. In this paper we assume has small\napproximation error but arbitrary IBE and we focus on how to reduce the ampli\ufb01cation factor at each\niteration.\nYang and Wang [YW19] recently studied the sample complexity of LAVI under the assumption that\nthe transition model p admits a non-negative low-rank factorization in the features . In particular,\nthey show that in this case, the inherent Bellman error is zero, thus avoiding the ampli\ufb01cation of\nerrors through iterations of LAVI. In this paper, we consider the more general case where only the\noptimal action-value function should be accurately approximated in , which may be true even\nwhen the transition model does not admit a low-rank decomposition. In fact, Thm. 1 holds even\nwhen the inherent Bellman error is in\ufb01nite and shows that whenever the ampli\ufb01cation factor C is\nsmall, LAVIER can still achieve polynomial sample complexity. Yang and Wang [YW19] used the\nconvexity condition in Prop. 1 to derive sample complexity guarantees in their setting, while in our\ncase, the same condition is used to control the ampli\ufb01cation of errors. Furthermore, we notice that in\nLAVIER we only need to control the `1-norm of the interpolation weights, which does not necessarily\nrequire any convexity assumption (see also experiments). [YW19] introduced OPPQ-Learning and\n\nproved a near-optimal sample complexity bound of order eO(K/\u270f2(1 )3), where K is the number\n\nof anchor points which are assumed to be provided.3 While this shows that both methods scale\nlinearly with the number of anchor points, OPPQ-Learning enjoys a much better dependency on the\nhorizon. It remains as an open question whether our analysis for LAVIER can be improved to match\ntheir bound or the difference the unavoidable price to pay for the more general setting we consider.4\nAveragers pursue the same objective but take an extreme approach, where no extrapolation is allowed\nand Q-functions are approximated by interpolation of values at a \ufb01xed set of anchor points [Gor95;\nGor96; PP13; KKL03; MM99]. Unfortunately, such approach may suffer from a poor sample\ncomplexity [PP13; KKL03], as the number of anchor points may scale exponentially with the\n\n3Yang and Wang [YW19] point out that convex assumption requires the number of features d to scale with\n\nthe number of anchor points K.\n\n4We conjecture that the dependency on H could be greatly improving using similar arguments as in [YW19],\n\nsuch as monotonicity, tighter concentration inequalities, and variance reduction.\n\n8\n\n\fproblem dimensionality. In LAVIER, we introduce a more explicit extrapolation-variance tradeoff,\nwhere the anchor points should be designed to avoid extrapolation only when/where it happens.\nFuture work. There are several directions for future investigation. AVI is a core building block\nfor many exploration-exploitation algorithms [OVW16; Kum+18] and better LAVI may help in\nbuilding sample-ef\ufb01cient online learning algorithms with function approximation. Another venue\nof investigation is off-policy prediction with batch data. The mismatch between behavioural and\ntarget policies poses similar challenges as in the error propagation of AVI. In order to control the\nextrapolation-variance tradeoff may need penalize a non-uniform use of the samples (to reduce the\nvariance) while the 1-norm minimization objective may reduce the amount of extrapolation to the\ndesired value.\n\nAcknowledgment\n\nThis was was partially supported by a Total Innovation Fellowship. The authors are grateful to the\nreviewers for the high quality reviews and helpful suggestions.\n\nReferences\n[ASM08]\n\nAndr\u00e1s Antos, Csaba Szepesv\u00e1ri, and R\u00e9mi Munos. \u201cLearning near-optimal policies with Bellman-\nresidual minimization based \ufb01tted policy iteration and a single sample path\u201d. In: Machine Learning\n71.1 (2008), pp. 89\u2013129.\nLeemon Baird. \u201cResidual algorithms: Reinforcement learning with function approximation\u201d. In:\nInternational Conference on Machine Learning (ICML). 1995.\nMarc G. Bellemare et al. \u201cA Geometric Perspective on Optimal Representations for Reinforcement\nLearning\u201d. In: CoRR abs/1901.11530 (2019). arXiv: 1901.11530. URL: http://arxiv.org/\nabs/1901.11530.\nAvrim Blum et al. Approximate Convex Hull of Data Streams. 2017. arXiv: 1712 . 04564\n[cs.CG].\nStephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n2004.\nJinglin Chen and Nan Jiang. \u201cInformation-Theoretic Considerations in Batch Reinforcement\nLearning\u201d. In: arXiv e-prints, arXiv:1905.00360 (May 2019), arXiv:1905.00360. arXiv: 1905.\n00360 [cs.LG].\nAmir-massoud Farahmand, Csaba Szepesv\u00e1ri, and R\u00e9mi Munos. \u201cError propagation for approxi-\nmate policy and value iteration\u201d. In: Advances in Neural Information Processing Systems (NIPS).\n2010.\nRobert Graham and Adam M. Oberman. Approximate Convex Hulls: sketching the convex hull\nusing curvature. 2017. arXiv: 1703.01350 [cs.CG].\nGeoffrey J Gordon. \u201cStable function approximation in dynamic programming\u201d. In: International\nConference on Machine Learning (ICML). 1995, pp. 261\u2013268.\nGeoffrey J Gordon. \u201cStable \ufb01tted reinforcement learning\u201d. In: Advances in Neural Information\nProcessing Systems (NIPS). 1996.\nM Zahid Hossain and M Ashraful Amin. \u201cOn constructing approximate convex hull\u201d. In: American\nJournal of Computational Mathematics 3.1 (2013), p. 11.\nWassily Hoeffding. \u201cProbability inequalities for sums of bounded random variables\u201d. In: Journal\nof the American Statistical Association (1963).\nNan Jiang and Alekh Agarwal. \u201cOpen Problem: The Dependence of Sample Complexity Lower\nBounds on Planning Horizon\u201d. In: Conference on Learning Theory (COLT). 2018, pp. 3395\u20133398.\nSham M. Kakade, Michael Kearns, and John Langford. \u201cExploration in Metric State Spaces\u201d. In:\nInternational Conference on Machine Learning (ICML). 2003.\nRaksha Kumaraswamy et al. \u201cContext-dependent upper-con\ufb01dence bounds for directed explo-\nration\u201d. In: Advances in Neural Information Processing Systems (NIPS). 2018.\nSridhar Mahadevan and Mauro Maggioni. \u201cProto-value Functions: A Laplacian Framework for\nLearning Representation and Control in Markov Decision Processes\u201d. In: J. Mach. Learn. Res. 8\n(Dec. 2007), pp. 2169\u20132231. ISSN: 1532-4435. URL: http://dl.acm.org/citation.cfm?\nid=1314498.1314570.\nRemi Munos and Andrew W Moore. \u201cBarycentric interpolators for continuous space and time\nreinforcement learning\u201d. In: Advances in Neural Information Processing Systems (NIPS). 1999.\n\n[Bai95]\n\n[Bel+19]\n\n[Blu+17]\n\n[BV04]\n\n[CJ19]\n\n[FSM10]\n\n[GO17]\n\n[Gor95]\n\n[Gor96]\n\n[HA13]\n\n[Hoe63]\n\n[JA18]\n\n[KKL03]\n\n[Kum+18]\n\n[MM07]\n\n[MM99]\n\n9\n\n\f[MMS05]\n\n[Mni+13]\n\n[MS08]\n\n[Mun05]\n\n[NW06]\n[OS02]\n\n[OVW16]\n\n[Pet07]\n\n[PP13]\n\n[SB18]\n\n[Sil+16]\n\n[Sil+17]\n\n[SLM14]\n\n[SV16]\n\n[Tos+17]\n\n[TV96]\n\n[TV97]\n\n[YW19]\n\n[YXW19]\n\nIshai Menache, Shie Mannor, and Nahum Shimkin. \u201cBasis Function Adaptation in Temporal\nDifference Reinforcement Learning\u201d. In: Annals of Operations Research 134.1 (Feb. 2005),\npp. 215\u2013238. ISSN: 1572-9338. DOI: 10.1007/s10479-005-5732-z. URL: https://doi.\norg/10.1007/s10479-005-5732-z.\nVolodymyr Mnih et al. \u201cPlaying Atari with Deep Reinforcement Learning\u201d. In: Advances in\nNeural Information Processing Systems (NIPS). 2013.\nR\u00e9mi Munos and Csaba Szepesv\u00e1ri. \u201cFinite-time bounds for \ufb01tted value iteration\u201d. In: Journal of\nMachine Learning Research 9.May (2008), pp. 815\u2013857.\nR\u00e9mi Munos. \u201cError bounds for approximate value iteration\u201d. In: AAAI Conference on Arti\ufb01cial\nIntelligence (AAAI). 2005.\nJorge Nocedal and Stephen Wright. Numerical Optimization. Springer, 2006.\nDirk Ormoneit and \u00b4Saunak Sen. \u201cKernel-based reinforcement learning\u201d. In: Machine Learning\n49.2-3 (2002), pp. 161\u2013178.\nIan Osband, Benjamin Van Roy, and Zheng Wen. \u201cGeneralization and Exploration via Randomized\nValue Functions\u201d. In: International Conference on Machine Learning (ICML). 2016.\nMarek Petrik. \u201cAn Analysis of Laplacian Methods for Value Function Approximation in MDPs\u201d.\nIn: Proceedings of the 20th International Joint Conference on Arti\ufb01cal Intelligence. IJCAI\u201907.\nHyderabad, India: Morgan Kaufmann Publishers Inc., 2007, pp. 2574\u20132579. URL: http://dl.\nacm.org/citation.cfm?id=1625275.1625690.\nJason Pazis and Ronald Parr. \u201cPAC Optimal Exploration in Continuous Space Markov Decision\nProcesses\u201d. In: AAAI Conference on Arti\ufb01cial Intelligence (AAAI). Bellevue, Washington, 2013,\npp. 774\u2013781. URL: http://dl.acm.org/citation.cfm?id=2891460.2891568.\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press,\n2018.\nDavid Silver et al. \u201cMastering the game of Go with deep neural networks and tree search\u201d. In:\nNature 529.7587 (2016), p. 484.\nDavid Silver et al. \u201cMastering the game of Go without human knowledge\u201d. In: Nature 550.7676\n(2017), p. 354.\nMarta Soare, Alessandro Lazaric, and R\u00e9mi Munos. \u201cBest-arm identi\ufb01cation in linear bandits\u201d.\nIn: Advances in Neural Information Processing Systems (NIPS). 2014, pp. 828\u2013836.\nHossein Sartipizadeh and Tyrone L. Vincent. Computing the Approximate Convex Hull in High\nDimensions. 2016. arXiv: 1603.04422 [cs.CG].\nSamuele Tosatto et al. \u201cBoosted Fitted Q-Iteration\u201d. In: Proceedings of the 34th International\nConference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings\nof Machine Learning Research. International Convention Centre, Sydney, Australia: PMLR, June\n2017, pp. 3434\u20133443. URL: http://proceedings.mlr.press/v70/tosatto17a.html.\nJohn N Tsitsiklis and Benjamin Van Roy. \u201cFeature-based methods for large scale dynamic\nprogramming\u201d. In: Machine Learning 22.1-3 (1996), pp. 59\u201394.\nJohn N Tsitsiklis and Benjamin Van Roy. \u201cAnalysis of temporal-diffference learning with function\napproximation\u201d. In: Advances in Neural Information Processing Systems (NIPS). 1997.\nLin F Yang and Mengdi Wang. \u201cSample-Optimal Parametric Q-Learning with Linear Transition\nModels\u201d. In: arXiv preprint arXiv:1902.04779 (2019).\nZhuoran Yang, Yuchen Xie, and Zhaoran Wang. \u201cA Theoretical Analysis of Deep Q-Learning\u201d.\nIn: CoRR abs/1901.00137 (2019). arXiv: 1901.00137. URL: http://arxiv.org/abs/1901.\n00137.\n\n10\n\n\f", "award": [], "sourceid": 3001, "authors": [{"given_name": "Andrea", "family_name": "Zanette", "institution": "Stanford University"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "Facebook Artificial Intelligence Research"}, {"given_name": "Mykel", "family_name": "Kochenderfer", "institution": "Stanford University"}, {"given_name": "Emma", "family_name": "Brunskill", "institution": "Stanford University"}]}