{"title": "Iterative Value-Aware Model Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9072, "page_last": 9083, "abstract": "This paper introduces a model-based reinforcement learning (MBRL) framework that incorporates the underlying decision problem in learning the transition model of the environment. This is in contrast with conventional approaches to MBRL that learn the model of the environment, for example by finding the maximum likelihood estimate, without taking into account the decision problem. Value-Aware Model Learning (VAML) framework argues that this might not be a good idea, especially if the true model of the environment does not belong to the model class from which we are estimating the model. The original VAML framework, however, may result in an optimization problem that is difficult to solve. This paper introduces a new MBRL class of algorithms, called Iterative VAML, that benefits from the structure of how the planning is performed (i.e., through approximate value iteration) to devise a simpler optimization problem. The paper theoretically analyzes Iterative VAML and provides finite sample error upper bound guarantee for it.", "full_text": "Iterative Value-Aware Model Learning\n\nAmir-massoud Farahmand\u2217\nVector Institute, Toronto, Canada\n\nfarahmand@vectorinstitute.ai\n\nAbstract\n\nThis paper introduces a model-based reinforcement learning (MBRL) framework\nthat incorporates the underlying decision problem in learning the transition model\nof the environment. This is in contrast with conventional approaches to MBRL\nthat learn the model of the environment, for example by \ufb01nding the maximum\nlikelihood estimate, without taking into account the decision problem. Value-Aware\nModel Learning (VAML) framework argues that this might not be a good idea,\nespecially if the true model of the environment does not belong to the model class\nfrom which we are estimating the model.\nThe original VAML framework, however, may result in an optimization problem\nthat is dif\ufb01cult to solve. This paper introduces a new MBRL class of algorithms,\ncalled Iterative VAML, that bene\ufb01ts from the structure of how the planning is per-\nformed (i.e., through approximate value iteration) to devise a simpler optimization\nproblem. The paper theoretically analyzes Iterative VAML and provides \ufb01nite\nsample error upper bound guarantee for it.\n\n1\n\nIntroduction\n\nValue-Aware Model Learning (VAML) is a novel framework for learning the model of the envi-\nronment in Model-Based Reinforcement Learning (MBRL) [Farahmand et al., 2017a, 2016a]. The\nconventional approach to model learning in MBRL is based on minimizing some kind of probabilistic\nloss. A common choice is to minimize the KL-divergence between the empirical data and the model,\nwhich leads to the Maximum Likelihood Estimator (MLE). Farahmand et al. [2017a, 2016a] argue\nthat minimizing a probabilistic loss function might not be a good idea because it does not take into ac-\ncount the underlying decision problem. Any knowledge about the reward, value function, or policy is\nignored in the conventional model learning approaches in MBRL (some recent exceptions are Joseph\net al. [2013], Silver et al. [2017], Oh et al. [2017], Farquhar et al. [2018]; refer to the supplementary\nmaterial for a detailed literature review of MBRL). The main thesis behind decision-aware model\nlearning, including VAML, is that the knowledge about the underlying decision problem, which is\noften available, should be considered in the model learning itself. VAML, as its name suggests, uses\nthe information about the value function. In particular, the formulation by Farahmand et al. [2017a]\nincorporates the knowledge about the value function space in learning the model. In this work, we\nsuggest an alternative, and possibly simpler, approach called Iterative VAML (IterVAML, for short).\nVAML de\ufb01nes a robust loss function and has a minP\u2208M maxV \u2208F structure, where M is the transition\nprobability model of the environment to be learned and F is the function space to which the value\nfunction belongs (we discuss this in more detail in Section 2). Solving this min max optimization can\nbe dif\ufb01cult in general, unless we impose some structure on F, e.g., linear function space. IterVAML\nmitigates this issue by bene\ufb01ting from special structure of how value functions are generated within\nthe approximate value iteration (AVI) framework (Section 3).\n\n\u2217Homepage: http://academic.sologen.net. Part of this work has been done when the author was\n\naf\ufb01liated with Mitsubishi Electric Research Laboratories (MERL), Cambridge, USA.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe theoretically analyze IterVAML (Section 4). We provide a \ufb01nite-sample error upper bound\nguarantee for the model learning that shows the effect of the number of samples and complexity of the\nmodel on the error bound (Section 4.1). We also analyze how the errors in the learned model affect\nthe quality of the outcome policy. This is in the form of an error propagation result (Section 4.2).\n\n2 Background on Value-Aware Model Learning\n\nTo formalize the framework, let us consider a discounted Markov Decision Process (MDP)\n(X ,A,R\u2217,P\u2217, \u03b3) [Szepesv\u00e1ri, 2010]. Here X is the state space, A is the action space, R\u2217 is\nthe reward distribution, P\u2217 is the transition probability kernel, and 0 \u2264 \u03b3 < 1 is the discount factor.\nIn the RL setting, P\u2217 and R\u2217 are not known to the agent. Instead, the agent can interact with the\nenvironment to collect samples from these distributions. The collected data is in the form of\n\nDn = {(Xi, Ai, Ri, X(cid:48)\n\ni)}n\n\ni=1,\n\n(1)\nwith the current state-action being distributed according to Zi = (Xi, Ai) \u223c \u03bd(X \u00d7A) \u2208 \u00afM(X \u00d7A),\nthe reward Ri \u223c R\u2217(\u00b7|Xi, Ai), and the next-state X(cid:48)\ni \u223c P\u2217(\u00b7|Xi, Ai). We denote the expected\nreward by r(x, a) = E [R\u2217(\u00b7|x, a)].2\nThe goal of model learning is to \ufb01nd a \u02c6P that is close to P\u2217.3 The learned model \u02c6P is then used\nby an MBRL algorithm to \ufb01nd a policy. To formalize this, let us denote Planner as an algorithm\nthat receives a model \u02c6P and returns a policy, i.e., \u03c0 \u2190 Planner( \u02c6P). We assume that the reward\nfunction is already known to Planner, so we do not explicitly pass it as an argument. There are many\nvariations on how Planner may use the learned model to obtain a new policy. For example, Planner\nmight be a value function-based approach that computes an estimate of the optimal value function\nbased on \u02c6P, and then returns the greedy policy of the estimated value function. Or it might be a\npolicy gradient method that computes the gradient of the performance with respect to (w.r.t.) the\npolicy using the learned model.\nA fundemental question is how we should measure the closeness of \u02c6P to P\u2217. The answer to this\nquestion depends on how Planner is going to use the model. It is possible that some aspects of the\ndynamics is irrelevant to Planner. The usual approaches based on the probabilistic losses, such as\nthe KL-divergence that leads to MLE, ignore this dependency. Therefore, they might be less ef\ufb01cient\nthan an approach that considers how Planner is going to use the learned model.\nVAML, introduced by Farahmand et al. [2016a, 2017a], is a value-based approach and assumes that\nPlanner uses the Bellman optimality operator de\ufb01ned based on \u02c6P to \ufb01nd a \u02c6Q\u2217, that is\n\n\u02c6P : Q (cid:55)\u2192 r + \u03b3 \u02c6P max\nT \u2217\n\na\n\nQ,\n\n(2)\nand then outputs \u03c0 = \u02c6\u03c0(\u00b7; \u02c6Q\u2217), the greedy policy w.r.t. \u02c6Q\u2217 de\ufb01ned as \u02c6\u03c0(x; Q) = argmaxa\u2208A Q(x, a).\nFor brevity, we sometimes use \u02c6T \u2217 instead of T \u2217\n\u02c6P. The use of the Bellman [optimality] operator is\ncentral to value-based approaches such as the family of (Approximate) Value Iteration [Gordon, 1995,\nSzepesv\u00e1ri and Smart, 2004, Ernst et al., 2005, Munos and Szepesv\u00e1ri, 2008, Farahmand et al., 2009,\nFarahmand and Precup, 2012, Mnih et al., 2015, Tosatto et al., 2017, Farahmand et al., 2017b] or\n(Approximate) Policy Iteration (API) algorithms [Lagoudakis and Parr, 2003, Antos et al., 2008,\nBertsekas, 2011, Lazaric et al., 2012, Scherrer et al., 2012, Farahmand et al., 2016b].\nVAML focuses on \ufb01nding \u02c6P such that the difference between T \u2217Q and \u02c6T \u2217Q is small. It starts from\nassuming that V is known and de\ufb01nes the pointwise loss (or cost) between \u02c6P and P\u2217 as\n\n2Given a set \u2126 and its \u03c3-algebra \u03c3\u2126, \u00afM(\u2126) refers to the set of all probability distributions de\ufb01ned over \u03c3\u2126.\nAs we do not get involved in the measure theoretic issues in this paper, we do not explicitly de\ufb01ne the \u03c3-algebra,\nand simply use a well-de\ufb01ned and \u201cstandard\u201d one, e.g., Borel sets de\ufb01ned for metric spaces.\n\n3Learning the expected reward r is also a part of model learning, which can be formulated as a regression\n\nproblem. For simplicity of presentation, we assume that r is known.\n\n2\n\n(3)\n\nc( \u02c6P,P\u2217; V )(x, a) =\n\n=\n\n(cid:12)(cid:12)(cid:12)(cid:68)P\u2217(\u00b7|x, a) \u2212 \u02c6P(\u00b7|x, a) , V\n(cid:69)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) (cid:104)P\u2217(dx(cid:48)|x, a) \u2212 \u02c6P(dx(cid:48)|x, a)\n(cid:105)\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\nV (x(cid:48))\n\n\fin which we substituted maxa Q(\u00b7, a) in (2) with V to simplify the presentation. In the rest of the\npaper, we sometimes use Pz(\u00b7) with z = (x, a) \u2208 Z = X \u00d7 A to refer to the probability distribution\n\nP(\u00b7|x, a), so PzV =(cid:82) P(dx(cid:48)|x, a)V (x(cid:48)).\n\nGiven a probability distribution \u03bd \u2208 \u00afM(X \u00d7 A), which can be the same distribution as the data\ngenerating one, VAML de\ufb01nes the expected loss function\n\n2,\u03bd( \u02c6P,P\u2217; V ) =\nc2\n\nd\u03bd(x, a)\n\nV (x(cid:48))\n\n.\n\n(4)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) (cid:104)P\u2217(dx(cid:48)|x, a) \u2212 \u02c6P(dx(cid:48)|x, a)\n\n(cid:105)\n\n(cid:90)\n\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12)2\n\n(cid:12)(cid:12)(cid:12)(cid:12)2\n\n(cid:12)(cid:12)(cid:12)(cid:12)2\n\nNotice that the value function V is unknown, so we cannot readily minimize this loss function, or its\nempirical version. What differentiates this work from the original VAML formulation is how this\nunknown V is handled. VAML takes a robust approach: It considers the worst-case choice of V in\nthe value function space F that is used by Planner. Therefore, it minimizes\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) (cid:104)P\u2217(dx(cid:48)|x, a) \u2212 \u02c6P(dx(cid:48)|x, a)\n\n(cid:105)\n\nV (x(cid:48))\n\n.\n\n(5)\n\n2,\u03bd( \u02c6P,P\u2217) =\nc2\n\nd\u03bd(x, a) sup\nV \u2208F\n\n2KL(P\u2217\n\nz \u2212 \u02c6Pz(cid:107)1 supV \u2208F (cid:107)V (cid:107)\u221e \u2264(cid:113)\n\nAs argued by Farahmand et al. [2017a], this is still a tighter objective to minimize than the KL-\ndivergence. To see this, consider a \ufb01x z = (x, a). We have supV \u2208F |(cid:104)P\u2217(\u00b7|x, a) \u2212 \u02c6P(\u00b7|x, a), V (cid:105)| \u2264\n(cid:107)P\u2217\nz|| \u02c6Pz) supV \u2208F (cid:107)V (cid:107)\u221e, where we used Pinsker\u2019s inequal-\nity. As MLE is the minimizer of the KL-divergence based on data, these upper bounds suggest\nthat if we \ufb01nd a good MLE (with small KL-divergence), we also have an accurate Bellman op-\nerator. These sequences of upper bounding, however, might be quite loose. For an extreme, but\ninstructive, example, suppose that the value function space consists of bounded constant functions\n(F = { x (cid:55)\u2192 c : |c| < \u221e}). In that case, supV \u2208F |(cid:104)P\u2217(\u00b7|x, a) \u2212 \u02c6P(\u00b7|x, a), V (cid:105)| is always zero, no\nmatter how large the total variation and the KL-divergence of two distributions are. MLE does not\nexplicitly bene\ufb01t from these interaction of the value function and the model. Asadi et al. [2018]\nshow that the VAML objective supV \u2208F |(cid:104)P\u2217(\u00b7|x, a) \u2212 \u02c6P(\u00b7|x, a), V (cid:105)| with the choice of 1-Lipschitz\nfunctions for F is equivalent to the Wasserstein metric between P\u2217(\u00b7|x, a) and \u02c6P(\u00b7|x, a). Refer\nto Farahmand et al. [2017a] for more detail and discussion about VAML and its properties.\nThe loss function (5) de\ufb01nes the population version loss of VAML. The empirical version, which is\nminimized in practice, replaces P\u2217 and \u03bd by their empirical distributions. The result is\n\n(cid:88)\n\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12)V (X(cid:48)\n\n2,n( \u02c6P) =\nc2\n\n1\nn\n\n(Xi,Ai,X(cid:48)\n\ni)\u2208Dn\n\nsup\nV \u2208F\n\ni) \u2212\n\n\u02c6P(dx(cid:48)|Xi, Ai)V (x(cid:48))\n\n.\n\n(6)\n\nThe estimated probability distribution \u02c6P is obtain by solving the following optimization problem:\n(7)\nFarahmand et al. [2017a] provide an expression for the gradient of this objective function when F\nis the space of linear function approximators, commonly used in the RL literature, and M is an\nexponential family. They also provide \ufb01nite sample error upper bound guarantee showing that the\nminimizer of (6) converges to the minimizer of (5).\n\n\u02c6P \u2190 argmin\nP\u2208M\n\n2,n(P).\nc2\n\n3\n\nIterative Value-Aware Model Learning\n\nIn this section we describe an alternative approach to formulating a value-aware model learning\nmethod. As opposed to the original formulation (5), and its empirical version (6), it is not based on a\nworst-case formulation. Instead, it de\ufb01nes the loss function based on the actual sequence of value\nfunctions generated by an (Approximate) Value Iteration (AVI) type of Planner.\nConsider the Value Iteration (VI) procedure: At iteration k = 0, 1, . . . ,\n\n(cid:90)\n\nQk+1(x, a) \u2190 r(x, a) + \u03b3\n\nP\u2217(dx(cid:48)|x, a) max\n\na(cid:48) Qk(x(cid:48), a(cid:48)),\n\n(8)\n\n3\n\n\for more succinctly,\n\nQk+1 \u2190 T \u2217\n\nP\u2217 Qk (cid:44) r + \u03b3P\u2217Vk,\n\nwith Vk(x) (cid:44) maxa Qk(x, a). Here T \u2217\nP\u2217 is the Bellman optimality operator de\ufb01ned based on the\ntrue transition probability kernel P\u2217 (we similarly de\ufb01ne T \u2217\nP for any generic transition probability\nkernel P). Because of the contraction property of the Bellman optimality operator for discounted\nMDPs, we have Qk \u2192 Q\u2217 as k \u2192 \u221e. This is the basis of the VI procedure.\nThe intuition behind IterVAML can be developed by studying the sequence Q0, Q1, . . . generated by\nVI. Starting from Q0 \u2190 r, VI generates the following sequence\n\nQ0 \u2190 r\nQ1 \u2190 T \u2217\nQ2 \u2190 T \u2217\n\nP\u2217 V0 = r + \u03b3P\u2217r\nP\u2217 V1 = r + \u03b3P\u2217V1\n...\n\nTo obtain the value of Q0, we do not need the knowledge of P\u2217. To obtain the value of Q1, we only\nneed to compute P\u2217V0 = P\u2217r. If we \ufb01nd a \u02c6P such that\n\u02c6Pr = P\u2217r,\n\nwe may replace it with P\u2217 to obtain Q1 without any error, i.e., Q1 = r + \u03b3P\u2217r = r + \u03b3 \u02c6Pr. Likewise,\nfor any k \u2265 1 and given Qk, in order to compute Qk+1 exactly we only need to \ufb01nd a \u02c6P such that\n\n\u02c6PVk = P\u2217Vk.\n\nWe have two sources of errors though. The \ufb01rst is that we may only guarantee that\n\n\u02c6PVk \u2248 P\u2217Vk.\nP\u2217 Vk. IterVAML\u2019s goal is to make the error \u02c6PVk\u2212P\u2217Vk\n\n\u02c6P Vk is not exactly the same as T \u2217\n\nAs a result, T \u2217\nas small as possible. Based on this, we de\ufb01ne the following \u201cidealized\u201d optimization problem:\nGiven a model space M and the current approximation of the value function \u02c6Qk (and therefore \u02c6Vk),\nsolve\n\n(cid:13)(cid:13)(cid:13)(P \u2212 P\u2217) \u02c6Vk\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:90) (cid:12)(cid:12)(cid:12)(P \u2212 P\u2217)(dx(cid:48)|z) max\n\na(cid:48)\n\n=\n\n\u02c6Qk(x(cid:48), a(cid:48))\n\nd\u03bd(z),\n\n(9)\n\n\u02c6P (k) \u2190 argmin\nP\u2208M\n\n(cid:12)(cid:12)(cid:12)2\n\nwhere \u03bd \u2208 \u00afM(X \u00d7 A) is a user-de\ufb01ned distribution over the state-action space. Oftentimes, this\ndistribution is the same as the empirical distribution generating Dn (1), hence our use of the same\nnotation. Afterwards, we use \u02c6P (k) to \ufb01nd \u02c6Vk+1 by using the usual VI approach, that is,\n\n\u02c6Qk+1 \u2190 T \u2217\n\n\u02c6P (k)\n\n\u02c6Qk.\n\n(10)\n\nThis procedure is repeated. This is using the exact VI based on \u02c6P (k).\nThis formulation is idealized as we do not have access to P\u2217 and \u03bd, but only samples from them. We\nuse the empirical distribution instead:\n\n(cid:88)\n\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u02c6Vk(X(cid:48)\n\n\u02c6P (k+1) \u2190 argmin\nP\u2208M\n\n1\nn\n\n(Xi,Ai,X(cid:48)\n\ni)\u2208Dn\n\ni) \u2212\n\nP(dx(cid:48)|Xi, Ai) \u02c6Vk(x(cid:48))\n\n.\n\n(11)\n\n(cid:12)(cid:12)(cid:12)(cid:12)2\n\nThe optimization problem minimizes the distance between the next-state expectation of \u02c6Vk according\nto P and the samples \u02c6Vk(X(cid:48)) with X(cid:48) being drawn from the true next-state distribution. In case the\nintegral is dif\ufb01cult to compute, we may replace it by samples from P, i.e.,\n\n(cid:90)\n\nP(dx(cid:48)|Xi, Ai) \u02c6Vk(x(cid:48)) \u2248 1\nm\n\n4\n\nm(cid:88)\n\nj=1\n\n\u02c6Vk(X(cid:48)\n\ni,j),\n\n\fi,j \u223c P(\u00b7|Xi, Ai) for j = 1, . . . , m. These are \u201cvirtual\u201d samples generated from the model.\nwith X(cid:48)\nThe second source of error is that the VI cannot be performed exactly (for example because the state\nspace is very large), and can only be performed approximately. This leads to the Approximate Value\nIteration (AVI) procedure (also known as Fitted Value or Q-Iteration) with a function space F|A|\n(the space of action-value functions), see e.g., Ernst et al. [2005], Munos and Szepesv\u00e1ri [2008],\nFarahmand et al. [2009], Farahmand and Precup [2012], Mnih et al. [2015], Tosatto et al. [2017].\nInstead of setting \u02c6Qk+1 \u2190 T \u2217\n\n\u02c6Qk, in AVI we have\n\n\u02c6P (k)\n\n(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12)Q(Xi, Ai) \u2212\n\n(cid:18)\n\n\u02c6Qk+1 \u2190 argmin\nQ\u2208F|A|\n\n1\nn\n\n(Xi,Ai,Ri)\u2208Dn\n\nRi + \u03b3\n\n\u02c6P (k+1)(dx(cid:48)|Xi, Ai) \u02c6Vk(x(cid:48))\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n.\n\n(12)\n\n(cid:90)\n\nm(cid:88)\n\nj=1\n\nNotice that \ufb01nding \u02c6Qk+1 is a regression problem, for which many methods to solve are available,\nincluding the regularized variants of this empirical risk minimization problem [Farahmand et al.,\n2009]. As before, we may replace the integral with virtual samples, i.e.,\n\n(cid:90)\n\n\u02c6P (k+1)(dx(cid:48)|Xi, Ai) \u02c6Vk(x(cid:48)) \u2248 1\nm\n\n\u02c6Vk(X(cid:48)\n\ni,j),\n\nn = {(Xi, Ai, Ri, X(cid:48)\n\ni,j \u223c \u02c6P (k+1)(\u00b7|Xi, Ai) for j = 1, . . . , m, for each i = 1, . . . , n. These \u201cvirtual\u201d samples\nwith X(cid:48)\nfrom the model play the same role as in the hypothetical experience in the Dyna architecture [Sutton,\n1990] or imagination in imagination-augmented agents by Racani\u00e8re et al. [2017].\nAlgorithm 1 summarizes a generic IterVAML procedure. The algorithm receives a model space M,\nthe action-value function space F|A|, the space of reward functions G, and the number of iterations K.\nAt each iteration k = 0, 1, . . . , it generates a fresh training dataset D(k)\ni=1 by\ninteracting with the environment. It learns the transition model \u02c6P (k+1) by solving (11). It also learns\nthe reward function \u02c6r, by minimizing LossR, which can be the squared loss (or a robust variant). We\ndo not analyze learning the reward function in this work. Afterwards, it performs one step of AVI by\nsolving (12). These steps are repeated for K iterations.\nMany variations of this algorithm are possible. We brie\ufb02y remark on some of them. Here the AVI\nstep only uses the model \u02c6P. In practice, however, one may use both the learned model \u02c6P and the\ndata Dn in solving the optimization problem (12) in order to obtain better solutions, as in the Dyna\narchitecture [Sutton, 1990]. Moreover, the summation in (12) is Dn (or \u222ak\nn as stated in the\nalgorithm), which is the dataset of true samples. If we also learn a distribution model of \u03bd by \u02c6\u03bd, we\ncan sample from it too. In that case, we can increase the number of samples used in solving the\nregression step of IterVAML. We have more discussion about the algorithm in the supplementary\nmaterial.\nWe can also similarly de\ufb01ne a policy evaluation version of IterVAML.\n\ni=0D(i)\n\ni)}n\n\n4 Theoretical Analysis of Iterative VAML\n\nWe analyze the statistical properties of the IterVAML procedure. The analysis is divided into two\nparts. First we analyze one iteration of model learning (cf. (11)) and provide an upper bound on\nthe error in learning the model (Theorem 1 in Section 4.1). Afterwards, we consider how errors at\neach iteration propagate throughout the iterations of IterVAML and affect the quality of the learned\npolicy (Theorem 2 in Section 4.2). Theorem 3 combines these two results and shows how the model\nlearning errors affect the quality of the outcome policy. The proofs and more extensive discussions\nare all referred to the extended version of the paper, which is provided as a supplementary material.\n\n4.1 Error Analysis for a Single Iteration\nWe analyze the k-th iteration of IterVAML (11) and provide an error bound on (cid:107)( \u02c6P (k+1) \u2212 P\u2217)Vk(cid:107)2.\nTo reduce clutter, we do not specify the iteration index k, e.g., the analyzed loss would be denoted by\n(cid:107)( \u02c6P \u2212 P\u2217)V (cid:107)2 for a \ufb01xed V .\n\n5\n\n\fAlgorithm 1 Model-based Reinforcement Learning Algorithm with Iterative VAML\n\n// MDP (X ,A,R\u2217,P\u2217, \u03b3)\n// K: Number of iterations\n// M: Space of transition probability kernels\n// F|A|: Space of action-value functions\n// G: Space of reward functions\nInitialize a policy \u03c00 and a value function \u02c6V0.\nfor k = 0 to K \u2212 1 do\n\nGenerate training set D(k)\nusing \u03c0k), i.e., (Xi, Ai) \u223c \u03bdk with X(cid:48)\n\u02c6P (k+1) \u2190 argminP\u2208M\n\u02c6r \u2190 argminr\u2208G LossR(r;\u222ak\n\u02c6Qk+1 \u2190 argminQ\u2208F|A|\n\u03c0k+1 \u2190 \u02c6\u03c0(\u00b7; \u02c6Qk+1).\n\n(cid:13)(cid:13)(cid:13) \u02c6Vk(X(cid:48)\n(cid:13)(cid:13)(cid:13)Q(Xi, Ai) \u2212(cid:16)\n\ni=0D(i)\nn )\n\nend for\nreturn \u03c0K\n\nn = {(Xi, Ai, Ri, X(cid:48)\n\ni)}n\n\n(cid:13)(cid:13)(cid:13)2\ni) \u2212(cid:82) P(dx(cid:48)|Xi, Ai) \u02c6Vk(x(cid:48))\n\ni \u223c P\u2217(\u00b7|Xi, Ai) and Ri \u223c R\u2217(\u00b7|Xi, Ai).\n\n.\n\ni=1 by interacting with the true environment (potentially\n\n\u222ak\ni=0D(i)\n\n\u02c6r(Xi, Ai) + \u03b3(cid:82) \u02c6P (k+1)(dx(cid:48)|Xi, Ai) \u02c6Vk(x(cid:48))\n\nn\n\n(cid:17)(cid:13)(cid:13)(cid:13)2\n\n\u222ak\ni=0D(i)\n\nn\n\n.\n\nConsider a \ufb01xed value function V : X \u2192 R. We are given a dataset Dn = {(Xi, Ai, X(cid:48)\nZi = (Xi, Ai) \u223c \u03bd(X \u00d7 A) \u2208 \u00afM(X \u00d7 A), and the next-state X(cid:48)\nin (1).\nWe now enlist our set of assumptions. Some of them are technical assumptions to simplify the\nanalysis, and some are characterizing crucial aspects of the model learning. We shall remark on these\nas we introduce them.\nAssumption A1 (Samples) At the k-th iteration we are given a dataset Dn(= D(k)\nn )\n\ni=1 with\ni \u223c P\u2217(\u00b7|Xi, Ai), as speci\ufb01ed\n\ni)}n\n\nDn = {(Xi, Ai, X(cid:48)\n\ni)}n\n\nn\n\nn\n\ni=1,\n\nfor k (cid:54)= k(cid:48) are independent.\n\n(13)\nwith Zi = (Xi, Ai) being independent and identically distributed (i.i.d.) samples drawn from\n\u03bd(X \u00d7 A) \u2208 \u00afM(X \u00d7 A) and the next-state X(cid:48)\ni \u223c P\u2217(\u00b7|Xi, Ai). Furthermore, we assume that D(k)\nand D(k(cid:48))\nThe i.i.d. assumption is to simplify the analysis, and with extra effort one can provide similar\nresults for dependent processes that gradually \u201cforget\u201d their past. The forgetting behaviour can be\ncharacterized by the mixing behaviour of the stochastic process [Doukhan, 1994]. One can then\nprovide statistical guarantees for learning algorithms under various mixing conditions [Yu, 1994,\nMeir, 2000, Steinwart and Christmann, 2009, Mohri and Rostamizadeh, 2009, 2010, Farahmand and\nSzepesv\u00e1ri, 2012].\nIn this assumption we also require that the datasets of two different iterations are independent. This\nis again to simplify the analysis. In practice, we might reuse the same dataset in all iterations.\nTheoretical results by Munos and Szepesv\u00e1ri [2008] suggest that the dependence between iterations\nmay not lead to signi\ufb01cant performance degradation.\nWe need to make some assumptions about the model space M and its complexity (i.e., capacity).\nWe use covering number (and its logarithm, i.e., metric entropy) of a function space (here being the\nmodel space M) as the characterizer of its complexity. The covering number at resolution \u03b5 is the\nminimum number of balls with radius \u03b5 required to cover the model space M according to a particular\nmetric, and is denoted by N (\u03b5,M) (see the supplementary material for de\ufb01nitions). As \u03b5 decreases,\nthe covering number increases (or more accurately, the covering number is non-decreasing). For\nexample, the covering number for a p-dimensional linear function approximator with constraint on\nthe magnitude of its functions behaves like O( 1\n\u03b5p ). A similar result holds when the subgraphs of a\nfunction space has a VC-dimension p. Model spaces whose covering number grows faster are more\ncomplex, and estimating a function within them is more dif\ufb01cult. This leads to larger estimation error,\nas we shall see. On the other hand, those model spaces often (but not always) have better model\napproximation properties too.\nIn order to show the \ufb01ner behaviour of the error bound, we de\ufb01ne M as a subset of a larger family of\nprobability distributions M0. Let J : M0 \u2192 [0,\u221e) be a pseudo-norm de\ufb01ned on functions in M0.\n\n6\n\n\fWe then de\ufb01ne M = {P \u2208 M0 : J(P) \u2264 R } for some R > 0. One may think of J as a measure\nof complexity of functions in M0, so M would be a ball with a \ufb01xed radius R w.r.t. J. If M0 is\nde\ufb01ned based on a reproducing kernel Hilbert space (RKHS), we can think of J as the inner product\nnorm of the RKHS.\nAssumption A2 (Model Space) For R > 0, let M = MR = {P \u2208 M0 : J(P) \u2264 R }. There\nexist constants c > 0 and 0 < \u03b1 < 1 such that for any \u03b5, R > 0 and all sequences z1:n (cid:44)\nz1, . . . , zn \u2282 X \u00d7 A, the following metric entropy condition is satis\ufb01ed:\n\nlog N (\u03b5,M, L2(z1:n)) \u2264 c\n\n(cid:18) R\n\n(cid:19)2\u03b1\n\n.\n\n\u03b5\n\nsupz\u2208Z(cid:82) |P1(dy|z) \u2212 P2(dy|z)|.\n\nFurthermore,\n\nthe model space M is convex, and compact w.r.t.\n\nd\u221e,TV(P1,P2) =\n\nThis form of the metric entropy of M = MR is suitable to capture the complexity of large function\nspaces such as some RKHS and Sobolev spaces. For example, for Wk(Rd) = Wk,2(Rd), the Sobolev\nspace de\ufb01ned w.r.t. the L2-norm of the weak derivatives up to order k, we can set \u03b1 = d\n2k , see\ne.g., Lemma 20.6 of Gy\u00f6r\ufb01 et al. [2002]. For smaller function spaces, such as the p-dimensional\n\u03b5 ), which\nlinear function approximator mentioned above, the behaviour of the metric entropy is p log( 1\ncan be seen as having \u03b1 \u2192 0 with a certain rate, For many examples of the covering number and\nmetric entropy results, refer to van de Geer [2000], Gy\u00f6r\ufb01 et al. [2002], Zhou [2003], Steinwart\nand Christmann [2008], Gin\u00e9 and Nickl [2015]. Also note that here we require the convexity and\ncompactness of M. The convexity is a crucial assumption for obtaining fast convergence rate, as\nwas shown and discussed by Lee et al. [1998, 2008], Mendelson [2008]. The compactness w.r.t. this\nparticular metric is a technical assumption and it may be possible to be relaxed.\nAssumption A3 (Value Function) The value function V is \ufb01xed (i.e., not dependent on Dn) and is\nVmax-bounded with Vmax \u2265 1.\nThis assumption is for the simplicity of the analysis. We use it in large deviation results that require\nthe boundedness of the involved random variables, e.g., Theorem 2.1 of Bartlett et al. [2005] or\nTheorem 19.1 of Gy\u00f6r\ufb01 et al. [2002], which we use in our proofs.\nWe are now ready to state the main result of this section.\nTheorem 1. Suppose that Assumptions A1, A2, and A3 hold. Consider \u02c6P obtained by solving (11).\nThere exists a \ufb01nite c(\u03b1) > 0, depending only on \u03b1, such that for any \u03b4 > 0, with probability at least\n\n1 \u2212 \u03b4, we have(cid:13)(cid:13)(cid:13)( \u02c6Pz \u2212 P\u2217\n\nz )V\n\n(cid:13)(cid:13)(cid:13)2\n\n2,\u03bd\n\n\u2264 infP\u2208M(cid:107)(Pz \u2212 P\u2217\n\nz )V (cid:107)2\n\n2,\u03bd +\n\nc(\u03b1)V 2\n\nmaxR\nn\n\n1\n\n1+\u03b1\n\n1+\u03b1(cid:112)log(1/\u03b4)\n\n2\u03b1\n\n.\n\nThis result upper bounds the error of \u02c6P in approximating the next-state expectation of the value\nfunction. The upper bound has the model (or function) approximation error (the \ufb01rst term) and the\nestimation error (the second term). It is notable that the constant in front of the model approximation\nerror is one, so the best we can hope from this algorithm, in the limit of in\ufb01nite data, is as good as the\nbest model in the model class M.\nThe estimation error behaves as n\u22121/(1+\u03b1) (0 < \u03b1 < 1). This is a fast rate and can reach n\u22121\nwhenever \u03b1 \u2192 0. We do not know whether this is an optimal rate for this particular problem, but\nresults from regression with least-squares loss suggest that this might indeed be optimal: For a\nregression function belonging to a function space that has a packing entropy in the same form as in\nthe upper bound of Assumption A2, the rate \u2126(n\u22121/(1+\u03b1)) is its minimax lower bound [Yang and\nBarron, 1999].\nWe use local Rademacher complexity and analyze the modulus of continuity of empirical processes\nto obtain rates faster than what could be achieved by more conventional techniques of analyzing the\nsupremum of the empirical processes. Farahmand et al. [2017a] used the supremum of the empirical\nprocess to analyze VAML and obtained n\u22121/2 rate, which is slower than n\u22121/(1+\u03b1). Notice that the\nloss functions of VAML and IterVAML are different, so this is only an approximate comparison. The\n\n7\n\n\frate n\u22121/2, however, is common in the supremum of empirical process-based analysis, so we would\nexpect it to hold if we used those techniques to analyze IterVAML. Finally notice that the error rate\nof VAML is not necessarily slower than IterVAML\u2019s; the present difference is at least somehow due\nto the shortcoming of their simpler proof technique.\n\n4.2 Error Propagation\n\nWe analyze how the errors incurred at each step of IterVAML propagate throughout the iterations\nand affect the quality of the outcome. For policy evaluation, the quality is de\ufb01ned as the difference\nbetween V \u03c0 and \u02c6VK, weighted according to a user-de\ufb01ned probability distribution \u03c1X \u2208 \u00afM(X ),\ni.e., (cid:107)V \u03c0 \u2212 \u02c6VK(cid:107)1,\u03c1X . For the control case we consider the performance loss, which is de\ufb01ned as the\ndifference between the value of following the greedy policy w.r.t. \u02c6QK compared to the value of the\noptimal policy Q\u2217, weighted according to a user-de\ufb01ned probability distribution \u03c1 \u2208 \u00afM(X \u00d7 A),\ni.e., \u03c1(Q\u2217 \u2212 Q\u03c0K ) (cf. Algorithm 1). This type of error propagation analysis has been performed\nbefore by Munos [2007], Antos et al. [2008], Farahmand et al. [2010], Scherrer et al. [2012], Huang\net al. [2015], Mann et al. [2015], Farahmand et al. [2016c].\nRecall that there are two sources of errors in the IterVAML procedure. The \ufb01rst is the error in model\nlearning, which is caused because the model \u02c6P (k+1) learned by solving the minimization problem (11)\nonly satis\ufb01es \u02c6P (k+1) \u02c6Vk \u2248 P\u2217 \u02c6Vk instead of being exact. This error is studied in Section 4.1, and\nTheorem 1 provides an upper bound on it.\nThe second source of error is that the AVI performs the Bellman update only approximately. So\ninstead of having \u02c6Qk+1 = T \u2217\n\u02c6Qk (or its policy evaluation equivalent), the function \u02c6Qk+1 obtained\nby solving (12) is only approximately equal to T \u2217\n\u02c6Qk. As already mentioned, this step is essentially\nsolving a regression problem. Hence, many of the standard error guarantees for regression can be\nused here too, with possibly some minor changes.\nConsider a sequence of \u02c6Q0, \u02c6Q1, . . . , \u02c6QK with \u02c6Qk+1 \u2248 T \u2217\n\u02c6Qk, with \u02c6P (k+1) being an approxi-\nmation of the true P (cid:44) P\u2217. IterVAML, which consists of repeated solving of (11) and (12), is an\nexample of a procedure that generates these \u02c6Qk. The result, however, is more general and does not\ndepend on the particular way \u02c6P (k+1) and \u02c6Qk+1 are produced.\nWe de\ufb01ne the following concentrability coef\ufb01cients, similar to the coef\ufb01cient introduced by Farah-\nmand et al. [2010], which itself is a relaxation of the coef\ufb01cient introduced by Munos [2007]. These\ncoef\ufb01cients are the Radon-Nikydom (R-N) derivative of the multi-step ahead state-action distribution\nw.r.t. the distribution \u03bd. The R-N. derivative can be thought of as the ratio of two probability density\nfunctions.\nDe\ufb01nition 1 (Expected Concentrability of the Future State-Action Distribution). Given \u03c1, \u03bd \u2208\n\u00afM(X \u00d7A), an integer number k \u2265 0, and an arbitrary sequence of policies (\u03c0i)m\ni=1, the distribution\n\u03c1P \u03c01 \u00b7\u00b7\u00b7P \u03c0k denotes the future state-action distribution obtained when the \ufb01rst state-action is\ndistributed according \u03c1 and the agent follows the sequence of policies \u03c01, \u03c02, . . . , \u03c0k. De\ufb01ne\n\n\u02c6P (k+1)\n\n\u02c6P (k)\n\n\u02c6P (k)\n\n(cid:13)(cid:13)(cid:13)(cid:13) d\u03c1P \u03c01 \u00b7\u00b7\u00b7P \u03c0k\n\nd\u03bd\n\n(cid:13)(cid:13)(cid:13)(cid:13)2,\u03bd\n\n.\n\n\u00afcVI,\u03c1,\u03bd(k) = sup\n\n\u03c01,...,\u03c0k\n\nIf the future state-action distribution \u03c1P \u03c01 \u00b7\u00b7\u00b7P \u03c0k is not absolutely continuous w.r.t. \u03bd, we take\n\u00afcVI,\u03c1,\u03bd(k) = \u221e. Moreover, for a discount factor 0 \u2264 \u03b3 < 1, de\ufb01ne the discounted weighted average\nconcentrability coef\ufb01cient as\n\n\u00afC(\u03c1, \u03bd) = (1 \u2212 \u03b3)2(cid:88)\n\nk\u22651\n\nk\u03b3k\u22121\u00afcVI,\u03c1,\u03bd(k).\n\nThe de\ufb01nition of \u00afC(\u03c1, \u03bd) is similar to the second order discounted future state distribution concentra-\ntion coef\ufb01cient of Munos [2007], with the main difference being that it is de\ufb01ned for the expectation\nof the R-N derivative instead of its supremum.\nThe following theorem is our main error propagation result. It can be seen as the generalization of the\nresults of Farahmand et al. [2010], Munos [2007] to the case when we use a model that has an error,\nwhereas the aforementioned papers are for model-free case (or when the model is exact). Because of\nthis similarity, several steps of the proof are similar to theirs.\n\n8\n\n\fTheorem 2. Consider a sequence of action-value function ( \u02c6Qk)K\nk=0, and their corresponding\nk=0, each of which is de\ufb01ned as \u02c6Vk(x) = maxa \u02c6Qk(x, a). Suppose that the MDP is such that\n( \u02c6Vk)K\nthe expected rewards are Rmax-bounded, and \u02c6Q0 is initialized such that it is Vmax \u2264 Rmax\n1\u2212\u03b3 -bounded.\n\u02c6Qk \u2212 \u02c6Qk+1 (regression error) and ek = (P\u2217 \u2212 \u02c6P (k+1)) \u02c6Vk (modelling error) for\nLet \u03b5k = T \u2217\nk = 0, 1, . . . , K \u2212 1. Let \u03c0K be the greedy policy w.r.t. \u02c6QK, i.e., \u03c0K(x) = argmaxa\u2208A \u02c6Q(x, a) for\nall x \u2208 X . Consider probability distributions \u03c1, \u03bd \u2208 \u00afM(X \u00d7 A). We have\n\n\u02c6P (k+1)\n\n(cid:16)(cid:107)\u03b5k(cid:107)2,\u03bd + \u03b3 (cid:107)ek(cid:107)2,\u03bd\n\n(cid:17)\n\n+ 2\u03b3KRmax\n\n(cid:21)\n\n(cid:107)Q\u2217 \u2212 Q\u03c0K(cid:107)1,\u03c1 \u2264\n\n2\u03b3\n\n(1 \u2212 \u03b3)2\n\n\u00afC(\u03c1, \u03bd) max\n\n0\u2264k\u2264K\u22121\n\n(cid:20)\n\nWe compare this result with the results of Munos [2007], Farahmand et al. [2010], Farahmand [2011]\nin the same section of the supplementary material. Before stating Theorem 3, which is the direct\nimplication of Theorems 1 and 2, we state another assumption.\nAssumption A4 (Value Function Space) The value function space F|A| is Vmax-bounded with\nVmax \u2264 Rmax\n\n1\u2212\u03b3 , and Vmax \u2265 1.\n\nThis assumption requires that all the value functions \u02c6Qk and \u02c6Vk generated by performing a step\nof AVI (12) and used in the model learning steps (11) are Vmax-bounded. This ensures that As-\nsumption A3, which is required by Theorem 1, is satis\ufb01ed in all iterations. This assumption is easy\nto satisfy in practice by clipping the output of the value function estimator at the level of \u00b1Vmax.\nTheoretical analysis of such a clipped value function estimator, however, is more complicated. As\nwe do not analyze the value function estimation steps of IterVAML, which depends on the choice of\nF|A|, we ignore this issue.\nTheorem 3. Consider the IterVAML procedure in which at the k-th iteration the model \u02c6P (k+1) is\n\u02c6Qk \u2212 \u02c6Qk+1 be the\nobtained by solving (11) and \u02c6Qk+1 is obtained by solving (12). Let \u03b5k = T \u2217\nregression error. Suppose that Assumptions A1, A2, and A4 hold. Consider the greedy policy \u03c0K w.r.t.\n\u02c6QK. For any \u03c1 \u2208 \u00afM(X \u00d7 A), there exists a \ufb01nite c(\u03b1) > 0, depending only on \u03b1, such that for any\n\u03b4 > 0, with probability at least 1 \u2212 \u03b4, we have\n\n\u02c6P (k+1)\n\n(cid:20)\n\n(cid:18)\n\n\u00afC(\u03c1, \u03bd)\n\nmax\n\n0\u2264k\u2264K\u22121\n\n(cid:19)\n(cid:107)\u03b5k(cid:107)2,\u03bd + \u03b3emodel(n)\n\n(cid:21)\n\n(cid:107)Q\u2217 \u2212 Q\u03c0K(cid:107)1,\u03c1 \u2264\n\n2\u03b3\n\n(1 \u2212 \u03b3)2\n\nwhere\n\n+ 2\u03b3KRmax\n\n1+\u03b1 4(cid:112)log(K/\u03b4)\n\n\u03b1\n\n,\n\n1\n\n2(1+\u03b1)\n\nn\n\nand F + =(cid:8) maxa Q(\u00b7, a) : Q \u2208 F|A|(cid:9) .\n\nemodel(n) = sup\nV \u2208F +\n\ninfP\u2208M(cid:107)(Pz \u2212 P\u2217\n\nz )V (cid:107)2,\u03bd +\n\nc(\u03b1)VmaxR\n\nThis result provides an upper bound on the quality of the learned policy \u03c0K, as a function of the\nnumber of samples and the properties of the model space M and the MDP. The estimation error\ndue to the model learning is n\n2(1+\u03b1) , which is discussed in some detail after Theorem 1. The model\napproximation error term supV \u2208F + infP\u2208M (cid:107)(Pz \u2212 P\u2217\nz )V (cid:107)2,\u03bd shows the interaction between the\nmodel M and the value function space F|A|. This quantity is likely to be conservative and can be\nimproved. We also note that an upper bound on (cid:107)\u03b5k(cid:107)2,\u03bd depends on the regression method, the choice\nof F|A|, and the number of samples generated from \u02c6P (k+1).\n\n1\n\n5 Conclusion\n\nWe have introduced IterVAML, a decision-aware model-based RL algorithm. We proved \ufb01nite sample\nerror upper bound for the model learning procedure (Theorem 1) and a generic error propagation\nresult for an approximate value iteration algorithm that uses an inaccurate model (Theorem 2). The\nconsequence of these two results was Theorem 3, which provides an error upper bound guarantee on\nthe quality of the outcome policy of IterVAML.\nThere are several possible future research directions. One is empirical studies of IterVAML and\ncomparing it with non-decision-aware methods. Another direction is to investigate other approaches\nto decision-aware model-based RL algorithms.\n\n9\n\n\fAcknowledgments\n\nI would like to thank the anonymous reviewers for their helpful feedback, and Mehdi Ghasemi and\nMurat A. Erdogdu for discussions.\n\nReferences\nAndr\u00e1s Antos, Csaba Szepesv\u00e1ri, and R\u00e9mi Munos. Learning near-optimal policies with Bellman-\nresidual minimization based \ufb01tted policy iteration and a single sample path. Machine Learning,\n71:89\u2013129, 2008. 2, 8\n\nKavosh Asadi, Evan Cater, Dipendra Misra, and Michael L. Littman. Equivalence between wasserstein\nand value-aware model-based reinforcement learning. In FAIM Workshop on Prediction and\nGenerative Modeling in Reinforcement Learning, 2018. 3\n\nPeter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. The\n\nAnnals of Statistics, 33(4):1497\u20131537, 2005. 7\n\nDimitri P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of\n\nControl Theory and Applications, 9(3):310\u2013335, 2011. 2\n\nPaul Doukhan. Mixing: Properties and Examples, volume 85 of Lecture Notes in Statistics. Springer-\n\nVerlag, Berlin, 1994. 6\n\nDamien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning.\n\nJournal of Machine Learning Research (JMLR), 6:503\u2013556, 2005. 2, 5\n\nAmir-massoud Farahmand. Regularization in Reinforcement Learning. PhD thesis, University of\n\nAlberta, 2011. 9\n\nAmir-massoud Farahmand and Doina Precup. Value pursuit iteration. In F. Pereira, C.J.C. Burges,\nL. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems\n(NIPS - 25), pages 1349\u20131357. Curran Associates, Inc., 2012. 2, 5\n\nAmir-massoud Farahmand and Csaba Szepesv\u00e1ri. Regularized least-squares regression: Learning\nfrom a \u03b2-mixing sequence. Journal of Statistical Planning and Inference, 142(2):493 \u2013 505, 2012.\n6\n\nAmir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv\u00e1ri, and Shie Mannor. Reg-\nularized \ufb01tted Q-iteration for planning in continuous-space Markovian Decision Problems. In\nProceedings of American Control Conference (ACC), pages 725\u2013730, June 2009. 2, 5\n\nAmir-massoud Farahmand, R\u00e9mi Munos, and Csaba Szepesv\u00e1ri. Error propagation for approximate\npolicy and value iteration. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and\nA. Culotta, editors, Advances in Neural Information Processing Systems (NIPS - 23), pages\n568\u2013576. 2010. 8, 9\n\nAmir-massoud Farahmand, Andr\u00e9 M.S. Barreto, and Daniel N. Nikovski. Value-aware loss function\nfor model learning in reinforcement learning. In 13th European Workshop on Reinforcement\nLearning (EWRL), December 2016a. 1, 2\n\nAmir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv\u00e1ri, and Shie Mannor. Regular-\nized policy iteration with nonparametric function spaces. Journal of Machine Learning Research\n(JMLR), 17(139):1\u201366, 2016b. 2\n\nAmir-massoud Farahmand, Daniel N. Nikovski, Yuji Igarashi, and Hiroki Konaka. Truncated\napproximate dynamic programming with task-dependent terminal value. In AAAI Conference on\nArti\ufb01cial Intelligence, February 2016c. 8\n\nAmir-massoud Farahmand, Andr\u00e9 M.S. Barreto, and Daniel N. Nikovski. Value-aware loss function\nfor model-based reinforcement learning. In Proceedings of the 20th International Conference on\nArti\ufb01cial Intelligence and Statistics (AISTATS), pages 1486\u20131494, April 2017a. 1, 2, 3, 7\n\n10\n\n\fAmir-massoud Farahmand, Saleh Nabi, and Daniel N. Nikovski. Deep reinforcement learning for\n\npartial differential equation control. In American Control Conference (ACC), 2017b. 2\n\nGregory Farquhar, Tim Rocktaeschel, Maximilian Igl, and Shimon Whiteson. TreeQN and ATreec:\nDifferentiable tree planning for deep reinforcement learning. In International Conference on\nLearning Representations (ICLR), 2018. 1\n\nEvarist Gin\u00e9 and Richard Nickl. Mathematical Foundations of In\ufb01nite-Dimensional Statistical Models.\n\nCambridge University Press, 2015. 7\n\nGeoffrey Gordon. Stable function approximation in dynamic programming. In International Confer-\n\nence on Machine Learning (ICML), 1995. 2\n\nL\u00e1szl\u00f3 Gy\u00f6r\ufb01, Michael Kohler, Adam Krzy\u02d9zak, and Harro Walk. A Distribution-Free Theory of\n\nNonparametric Regression. Springer Verlag, New York, 2002. 7\n\nDe-An Huang, Amir-massoud Farahmand, Kris M Kitani, and J. Andrew Bagnell. Approximate\nMaxEnt inverse optimal control and its application for mental simulation of human interactions. In\nAAAI Conference on Arti\ufb01cial Intelligence, January 2015. 8\n\nJoshua Joseph, Alborz Geramifard, John W Roberts, Jonathan P How, and Nicholas Roy. Reinforce-\nment learning with misspeci\ufb01ed model classes. In Proceedings of IEEE International Conference\non Robotics and Automation (ICRA), pages 939\u2013946. IEEE, 2013. 1\n\nMichail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch (JMLR), 4:1107\u20131149, 2003. 2\n\nAlessandro Lazaric, Mohammad Ghavamzadeh, and R\u00e9mi Munos. Finite-sample analysis of least-\nsquares policy iteration. Journal of Machine Learning Research (JMLR), 13:3041\u20133074, October\n2012. 2\n\nWee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning\n\nwith squared loss. IEEE Transactions on Information Theory, 44(5):1974\u20131980, 1998. 7\n\nWee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Correction to the importance of convexity\n\nin learning with squared loss. IEEE Transactions on Information Theory, 54(9):4395, 2008. 7\n\nTimothy A. Mann, Shie Mannor, and Doina Precup. Approximate value iteration with temporally\n\nextended actions. Journal of Arti\ufb01cial Intelligence Research (JAIR), 53:375\u2013438, 2015. 8\n\nRon Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning,\n\n39(1):5\u201334, 2000. 6\n\nShahar Mendelson. Lower bounds for the empirical risk minimization algorithm. IEEE Transactions\n\non Information Theory, 54(8):3797 \u2013 3803, August 2008. 7\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-\nmare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,\nCharles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,\nShane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.\nNature, 518(7540):529\u2013533, 02 2015. 2, 5\n\nMehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes.\nIn Advances in Neural Information Processing Systems 21, pages 1097\u20131104. Curran Associates,\nInc., 2009. 6\n\nMehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary \u03c6-mixing and \u03b2-mixing\nprocesses. Journal of Machine Learning Research (JMLR), 11:789\u2013814, 2010. ISSN 1532-4435. 6\n\nR\u00e9mi Munos. Performance bounds in Lp norm for approximate value iteration. SIAM Journal on\n\nControl and Optimization, pages 541\u2013561, 2007. 8, 9\n\nR\u00e9mi Munos and Csaba Szepesv\u00e1ri. Finite-time bounds for \ufb01tted value iteration. Journal of Machine\n\nLearning Research (JMLR), 9:815\u2013857, 2008. 2, 5, 6\n\n11\n\n\fJunhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in Neural\nInformation Processing Systems (NIPS - 30), pages 6118\u20136128. Curran Associates, Inc., 2017. 1\n\nS\u00e9bastien Racani\u00e8re, Theophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo\nJimenez Rezende, Adri\u00e0 Puigdom\u00e8nech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pas-\ncanu, Peter Battaglia, Demis Hassabis, David Silver, and Daan Wierstra. Imagination-augmented\nagents for deep reinforcement learning. In Advances in Neural Information Processing Systems\n(NIPS - 30), pages 5690\u20135701. Curran Associates, Inc., 2017. 5\n\nBruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, and Matthieu Geist. Approximate\nIn Proceedings of the 29th International Conference on Machine\n\nmodi\ufb01ed policy iteration.\nLearning (ICML), 2012. 2, 8\n\nDavid Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel\nDulac-Arnold, David Reichert, Neil Rabinowitz, Andr\u00e9 M.S. Barreto, and Thomas Degris. The\npredictron: End-to-end learning and planning. In Proceedings of the 34th International Conference\non Machine Learning (ICML), pages 3191\u20133199, 2017. 1\n\nIngo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008. 7\n\nIngo Steinwart and Andreas Christmann. Fast learning from non-i.i.d. observations. In Advances in\nNeural Information Processing Systems (NIPS - 22), pages 1768\u20131776. Curran Associates, Inc.,\n2009. 6\n\nRichard S. Sutton. Integrated architectures for learning, planning, and reacting based on approxi-\nmating dynamic programming. In Proceedings of the 7th International Conference on Machine\nLearning (ICML), 1990. 5\n\nCsaba Szepesv\u00e1ri. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 2\n\nCsaba Szepesv\u00e1ri and William D. Smart. Interpolation-based Q-learning. In Proceedings of the\n\ntwenty-\ufb01rst International Conference on Machine learning (ICML), 2004. 2\n\nSamuele Tosatto, Matteo Pirotta, Carlo D\u2019Eramo, and Marcello Restelli. Boosted \ufb01tted q-iteration. In\nProceedings of the 34th International Conference on Machine Learning (ICML), pages 3434\u20133443,\nAugust 2017. 2, 5\n\nSara A. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. 7\n\nYuhong Yang and Andrew R. Barron. Information-theoretic determination of minimax rates of\n\nconvergence. The Annals of Statistics, 27(5):1564\u20131599, 1999. 7\n\nBin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of\n\nProbability, 22(1):94\u2013116, January 1994. 6\n\nDing-Xuan Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on\n\nInformation Theory, 49:1743\u20131752, 2003. 7\n\n12\n\n\f", "award": [], "sourceid": 5440, "authors": [{"given_name": "Amir-massoud", "family_name": "Farahmand", "institution": "Vector Institute"}]}