{"title": "Value Function in Frequency Domain and the Characteristic Value Iteration Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 14808, "page_last": 14819, "abstract": "This paper considers the problem of estimating the distribution of returns in reinforcement learning (i.e., distributional RL problem). It presents a new representational framework to maintain the uncertainty of returns and provides mathematical tools to compute it. \nWe show that instead of representing a probability distribution function of returns, one can represent their characteristic function instead, the Fourier transform of their distribution. We call the new representation Characteristic Value Function (CVF), which can be interpreted as the frequency domain representation of the probability distribution of returns.\nWe show that the CVF satisfies a Bellman-like equation, and its corresponding Bellman operator is contraction with respect to certain metrics.\nThe contraction property allows us to devise an iterative procedure to compute the CVF, which we call Characteristic Value Iteration (CVI). We analyze CVI and its approximate variant and show how approximation errors affect the quality of computed CVF.", "full_text": "Value Function in Frequency Domain\n\nand the Characteristic Value Iteration Algorithm\n\nAmir-massoud Farahmand\u2217\n\nVector Institute & University of Toronto\n\nToronto, Canada\n\nfarahmand@vectorinstitute.ai\n\nAbstract\n\nThis paper considers the problem of estimating the distribution of returns in rein-\nforcement learning, i.e., distributional RL problem. It presents a new representa-\ntional framework to maintain the uncertainty of returns and provides mathematical\ntools to compute it. We show that instead of representing a probability distribution\nfunction of returns, one can represent their characteristic function, the Fourier\ntransform of their distribution. We call the new representation Characteristic Value\nFunction (CVF). The CVF satis\ufb01es a Bellman-like equation, and its corresponding\nBellman operator is contraction with respect to certain metrics. The contraction\nproperty allows us to devise an iterative procedure to compute the CVF, which\nwe call Characteristic Value Iteration (CVI). We analyze CVI and its approximate\nvariant and show how approximation errors affect the quality of the computed\nCVF.\n\n1\n\nIntroduction\n\nThe object of focus of the conventional RL is the expected return of following a policy, i.e., the value\nfunction [Sutton and Barto, 2019]. The goal is to \ufb01nd a policy that maximizes that expectation over\nall states, i.e., the optimal policy. This leads to agents that do not consider the distribution of returns\nin their decision making, but only its \ufb01rst moment. This might be of concern in scenarios where the\nrisk is of paramount importance. Estimating the distribution of the return facilitates designing agents\nthat consider objectives more general than maximizing the expected return, such as various notions\nof risk [Tamar et al., 2012, Prashanth and Ghavamzadeh, 2013, Garc\u00eda and Fern\u00e1ndez, 2015, Chow\net al., 2018].\nThe Distributional RL (DistRL) literature [Engel et al., 2005, Morimura et al., 2010b, Bellemare\net al., 2017, Barth-Maron et al., 2018, Lyle et al., 2019], on the other hand, moves away from\nthe conventional goal of estimating the expectation of return and attempts to estimate a richer\nrepresentation of the return, such as the distribution itself [Morimura et al., 2010b,a] or some\nstatistical functional of it [Rowland et al., 2018, Dabney et al., 2018, Rowland et al., 2019]. It is\nnotable that so far the focus of the DistRL literature has mostly been on designing better performing\nagents according to the expected return, and not any risk-related performance measure, but it is\nconceivable that those methods can be be used for designing risk-aware agents too.\nThis paper develops a new framework for maintaining the information available in the distribution of\nreturns. Instead of estimating the distribution function itself, we maintain the Characteristic Function\n(CF) of the returns. The CF of a random variable (r.v.) is the Fourier transform of its probability\ndistribution function (PDF). Similar to PDF, the CF of a r.v. contains all the information available\nabout the distribution of that r.v., i.e., CF and PDF have a bijection relationship. They are nonetheless\n\n\u2217Homepage: http://academic.sologen.net.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdifferent representations of the uncertainty of a r.v., hence they allow different types of manipulations\nand processing. The bene\ufb01t of a new representation is that it opens up the possibility of designing\nnew algorithms. An example from the \ufb01eld of control theory is that we have both time and frequency\ndomain representations of a dynamical system. Although they are equivalent in many cases, designing\na controller in the frequency domain is sometimes easier and may provide better insights. This work\nbrings the frequency-based representation of uncertainty to DistRL.\nThe estimation procedures based on CF are not novel. Methods based on the Empirical Characteristic\nFunction (ECF) have a long history in the statistics and econometrics literature [Feuerverger and\nMureika, 1977, Feuerverger and McDunnough, 1981, Feuerverger, 1990, Knight and Yu, 2002, Yu,\n2004]. These methods are considered as alternatives to the maximum likelihood estimation (MLE),\nbecause as opposed to MLE, whose computation might be infeasible for some distributions, one can\nalways de\ufb01ne and compute the ECF. This paper is inspired from that literature and develops similar\ntools for RL and approximate dynamic programming.\nThe main idea of this work is that by transforming the return, which is a r.v., to the frequency\ndomain through the Fourier transform, we can de\ufb01ne Characteristic Value Function (CVF), which\nessentially captures all information about the distribution of the return. A contribution of this work\nis that we prove that CVF indeed satis\ufb01es a Bellman-like equation \u02dcT \u03c0 \u02dcV = \u02dcV (Section 3). The\ncorresponding Bellman operator, however, is different from the conventional ones or those in the\nDistRL literature. Instead of having an additive form, it is multiplicative, i.e., ( \u02dcT \u03c0 \u02dcV )(\u03c9; x) (cid:44)\n\n\u02dcR(\u03c9; x)(cid:82) P \u03c0(dy|x) \u02dcV (\u03b3\u03c9; y) with \u03c9 being the frequency variable, x being the state variable, and\n\n\u02dcR being the Fourier transform of the immediate reward distribution (we will de\ufb01ne these quantities\nlater). We also prove that the new Bellman operator is contraction with respect to (w.r.t.) some\nspeci\ufb01c metrics de\ufb01ned in the frequency domain (Section 3.1). The contraction property suggests\nthat one might \ufb01nd the CVF through an iterative procedure similar to value iteration, which we call\nthe Characteristic Value Iteration (CVI) algorithm (Section 4). This is the algorithmic contribution of\nthis work.\nAny procedure that implements CVI, however, may not perform it exactly, for example because we\nonly have data as opposed to the actual transition probability distribution or because the state space is\nvery large and we need to use function approximation. In case we can only approximately perform\nCVI, which we call Approximate CVI (ACVI), we inevitably have some errors. To understand the\neffect of using function approximation on these errors better, we consider a class of band-limited\n(in the frequency domain) functions, and study their function approximation and covering number\nproperties (in the extended version of the paper). Another contribution of this work is the analysis\nof how the errors caused at each iteration of ACVI propagate throughout iterations and affect\nthe quality of the outcome CVF (Section 5). We show that the errors in earlier iterations decay\nexponentially fast, i.e., the past errors are forgotten quickly. This is the same phenomenon observed\nin the conventional approximate value iteration. Finally, we show how to convert the error of CVF in\nthe frequency domain to an error in distributions, measured according to the p-smooth Wasserstein\ndistance (Section 6).\n\n2 Distributional Bellman equation\nWe consider a discounted Markov Decision Process (MDP) (X ,A,R,P, \u03b3) [Szepesv\u00e1ri, 2010].\nHere X is the state space, A is the action space, P : X \u00d7 A \u2192 M(X ) is the transition probability\nkernel, R : X \u00d7 A \u2192 M(R) is the immediate reward distribution, and 0 \u2264 \u03b3 < 1 is the discount\nfactor.2 The (Markov stationary) policy \u03c0 : X \u2192 M(A) induces the transition probability kernel\nP \u03c0 : X \u2192 M(X ) and the immediate reward distribution for the policy R\u03c0 : X \u2192 M(R).\nAn MDP together with an initial state distribution \u03c1 \u2208 M(X ) encode the laws governing the temporal\nevolution of a discrete-time stochastic process controlled by an agent as follows: The controlled\nprocess starts at time t = 0 with random initial state X0 drawn from \u03c1, i.e., X0 \u223c \u03c1. The agent\nfollowing a policy \u03c0 chooses action At \u2208 A according to At \u223c \u03c0(\u00b7|Xt) (stochastic policy) or\n2Here M(\u2126) refers to the space of all probability distributions on an appropriately de\ufb01ned \u03c3-algebra of \u2126,\ne.g., the Borel \u03c3-algebra on R. We do not deal with the measure theoretic considerations in this work. Refer to\nAppendix C of Bertsekas [2013] or Chapter 7 of Bertsekas and Shreve [1978]. We occasionally use \u00afX to denote\nthe probability distribution \u00b5 of the r.v. X.\n\n2\n\n\fAt = \u03c0(Xt) (deterministic policy). In response, the next state is Xt+1 \u223c P(\u00b7|Xt, At) and the agent\nreceives reward Rt \u223c R(\u00b7|Xt, At). This process repeats. We may occasionally use R(x, a) or R\u03c0(x)\nto denote to the r.v. that is drawn from R(\u00b7|x, a) or R\u03c0(\u00b7|x). Also we may use z = (x, a) as a\nshorthand. When we refer to a r.v. Z = (X, A), this should be interpreted as a r.v. de\ufb01ned with\nA \u223c \u03c0(\u00b7|X), where the policy should be clear from the context.\nThe return of the agent starting from a state x \u2208 X and following a policy \u03c0 is the following random\nvariable:\n\n(cid:88)\n\ni\u22650\n\nG\u03c0(x) =\n\n\u03b3iRi.\n\nThe (conventional) value function V \u03c0 is the \ufb01rst moment of this r.v., i.e.,\n\nV \u03c0(x) = E [G\u03c0(X0)|X0 = x] .\n\nFrom G\u03c0(x) = R0 + \u03b3(cid:80)\n\nLikewise, one may de\ufb01ne the return G\u03c0(x, a) for starting from state x, choosing action a, and\nfollowing policy \u03c0 afterwards. The corresponding \ufb01rst moment of G\u03c0(x, a) would be the action-\nvalue function Q\u03c0(x, a).\n\ni\u22650 \u03b3iRi+1, we see that G\u03c0(x) is the addition of two r.v. R0 and \u03b3G\u03c0(X1)\nwith X1 \u223c P \u03c0(\u00b7|X0 = x). Therefore, the law (probability distribution) of G\u03c0(x) is the same as the\nlaw of R0 + \u03b3G\u03c0(X1), i.e.,\n\nG\u03c0(x)\n\n(D)\n= R0 + \u03b3G\u03c0(X1).\n\n(1)\n\n= to emphasize that we are comparing two probability distributions. This is\n\nHere we use the symbol (D)\nthe Bellman-like distributional equation in the conventional DistRL.\nWe can also have a similar equation that relates \u00afG\u03c0 (the distribution of the r.v. G\u03c0) and \u00afR(x) =\nR\u03c0(\u00b7|x) (the distribution of the r.v. R\u03c0(x)) [Rowland et al., 2018]. To de\ufb01ne it, we recall the\nde\ufb01nition of the pushforward measure: Given a probability distribution \u03bd \u2208 M(R) and a measurable\nfunction f : R \u2192 R, the pushforward measure f#\u03bd \u2208 M(R) is de\ufb01ned as (f#\u03bd)(A) = \u03bd(f\u22121(A))\nfor all Borel sets A \u2282 R.\nThe Bellman operator \u00afT \u03c0 : M(X ) \u2192 M(X ) between distributions is de\ufb01ned as\n\n(cid:90)\n\n( \u00afT \u03c0 \u00afG)(x) (cid:44)\n\n(r + \u03b3y)#\n\n\u00afG(y)R\u03c0(dr|x)P \u03c0(dy|x),\n\n\u2200x \u2208 X .\n\nWith this notation, the distributional Bellman equation is\n\n\u00afG\u03c0(x) = ( \u00afT \u03c0 \u00afG\u03c0)(x),\n\n\u2200x \u2208 X .\n\n(2)\n\nThe distributional Bellman equation represents the intrinsic uncertainty of the return due to the\nrandomness of the dynamics and policy. We may occasionally use \u00afV \u03c0 to refer to \u00afG\u03c0, to show its\nclose relation to the conventional value function.\n\n3 Characteristic value function\n\nThe conventional approach to representing the uncertainty of a r.v. is through its probability distribu-\ntion function. This is not the only way to characterize a r.v. though. An alternative is to characterize\nthe r.v. through the Fourier transform of its distribution function. This is known as the Characteristic\nFunction (CF) of the random variable [Williams, 1991].\nIn this section we show that the instead of representing the distribution function of the return G\u03c0,\nwe may represents its characteristic function. Interestingly, the CF of return satis\ufb01es a Bellman-like\nequation, which is quite different from the conventional ones (1) and (2) that we have encountered so\nfar.\nLet us brie\ufb02y recall the de\ufb01nition of a CF of a random variable. Given a real-valued r.v. X with the\nprobability distribution \u00b5 \u2208 M(R), its corresponding CF cX : R \u2192 C is the function de\ufb01ned as3\n\ncX (\u03c9) (cid:44) E(cid:2)ejX\u03c9(cid:3) =\n\n(cid:90)\n\nexp(jx\u03c9)\u00b5(dx),\n\n\u03c9 \u2208 R\n\n(3)\n\n3Here X is a generic r.v. and does not refer to the state. The particular r.v. will be clear from the context.\n\n3\n\n\f\u221a\u22121 is the imaginary unit. The CF of a probability distribution is closely related to\n\nwhere j =\nthe Fourier transform of its distribution function. If the probability density function is well-de\ufb01ned,\nCF is its Fourier transform, though CF exists even if the density does not. Several properties of CF\nare summarized in an appendix of the extended version of the paper. Thinking in the terms of the\nspatial-frequency duality common in the Fourier analysis, the probability distribution function is the\nspatial representation of a r.v. (with the magnitude of the r.v. corresponding to the space dimension),\nand the CF is its frequency representation.\nConsider the recursive relation G\u03c0(x) = R\u03c0(x) + \u03b3G\u03c0(X(cid:48)), with X(cid:48) \u223c P \u03c0(\u00b7|x), between the\nreturn G\u03c0(x) (a r.v.) and the random reward R\u03c0(x) and the return at the next step G\u03c0(X(cid:48)). By the\ndistributional equality of both sides (cf. (1)), we have\n\ncG\u03c0(x)(\u03c9) = E [exp (j\u03c9G\u03c0(x))] = E [exp (j\u03c9 (R\u03c0(x) + \u03b3G\u03c0(X(cid:48))))] ,\n\n\u2200\u03c9 \u2208 R.\n\n(4)\n\nThe right-hand side (RHS) of (4) is\nE [exp (j\u03c9 (R\u03c0(x) + \u03b3G\u03c0(X(cid:48))))] = E [E [exp (j\u03c9 (R\u03c0(x) + \u03b3G\u03c0(X(cid:48)))) | X = x, A]]\n\n= E [E [exp (j\u03c9R\u03c0(x)) | X = x, A] E [exp (j\u03c9\u03b3G\u03c0(X(cid:48))) | X = x, A]]\n= cR\u03c0(x)(\u03c9) E [E [exp (j\u03c9\u03b3G\u03c0(X(cid:48))) | X = x, A]]\n= cR\u03c0(x)(\u03c9) E [exp (j\u03c9\u03b3G\u03c0(X(cid:48))) | X = x] ,\n\n(5)\nwhere A is a r.v. drawn from \u03c0(\u00b7|x). Here we bene\ufb01tted from the fact that the r.v. R\u03c0(x) and G\u03c0(X(cid:48))\nare conditionally independent given X = x and A.\nLet us consider the CF of G\u03c0(X(cid:48)) conditioned on X = x:\n\nE [exp (j\u03c9G\u03c0(X(cid:48))) | X = x] = E [E [exp (j\u03c9G\u03c0(X(cid:48))) | X(cid:48)] | X = x]\n\n(cid:90)\n= E(cid:2)cG\u03c0(X(cid:48))(\u03c9) | X = x(cid:3) ,\n\n=\n\nP \u03c0(dx(cid:48)|x)E [exp (j\u03c9G\u03c0(x(cid:48)))]\n\n(cid:90)\n\n(cid:90)\n\n(6)\nwhere we conditioned the inner expectation on the next-state X(cid:48) (so its randomness comes from the\nreturn from that point onward), and used the de\ufb01nition of CF.\nPlugging (6) in (5) gives the RHS of (4). So we get\n\ncG\u03c0(x)(\u03c9) = cR\u03c0(x)(\u03c9) E [exp (j\u03c9\u03b3G\u03c0(X(cid:48))) | X = x]\n\n= cR\u03c0(x)(\u03c9)E(cid:2)c\u03b3G\u03c0(X(cid:48))(\u03c9) | X = x(cid:3)\n= cR\u03c0(x)(\u03c9)E(cid:2)cG\u03c0(X(cid:48))(\u03b3\u03c9) | X = x(cid:3) = cR\u03c0(x)(\u03c9)\n\n(cid:90)\n\nP \u03c0(dy|x)cG\u03c0(y)(\u03b3\u03c9),\n\n(7)\n\nwhere the penultimate equality is because of the scaling property of CF (refer to the extended version\nof the paper for more information).\nWe denote the CF of the reward cR\u03c0(x)(\u03c9) by \u02dcR(\u03c9; x), and the CF of the return cG\u03c0(x)(\u03c9) by\n\u02dcV \u03c0(\u03c9; x) for all x \u2208 X and \u03c9 \u2208 R. Here the symbol \u02dc\u00b7 is used to remind us that we are referring to a\nCF of a random variable. With these notations, we can write (7) in more compact form of\n\n\u02dcV \u03c0(\u03c9; x) = \u02dcR(\u03c9; x)\n\nP \u03c0(dy|x) \u02dcV \u03c0(\u03b3\u03c9; y).\n\n(8)\n\nThis is the Bellman-like equation between the CF of return and the reward. The function\n\u02dcV \u03c0 : R \u00d7 X \u2192 C1 (where C1 is the area within the unit circle in the complex plane, i.e.,\nC1 = { z \u2208 C : |z| \u2264 1}) is the CF of the G\u03c0(x) for all x \u2208 X . We call \u02dcV \u03c0 the Characteris-\ntic Value Function (CVF).\nWe also de\ufb01ne the Bellman operator between the CF functions:\n\nP \u03c0(dy|x) \u02dcV (\u03b3\u03c9; y).\nWith this notation, the Bellman equation can be written more compactly as\n\n( \u02dcT \u03c0 \u02dcV )(\u03c9; x) (cid:44) \u02dcR(\u03c9; x)\n\nIt is worth mentioning that for any \ufb01xed x \u2208 X , \u03c9 (cid:55)\u2192 \u02dcV \u03c0(\u03c9; x) is a CF. A CF is continuous function\nof \u03c9 and its magnitude is bounded by 1 (refer to the extended version of the paper).\n\n\u02dcV \u03c0 = \u02dcT \u03c0 \u02dcV \u03c0.\n\n4\n\n\f3.1 Bellman operator is contraction\n\nWe show that the Bellman operator \u02dcT \u03c0 is a contraction w.r.t. certain metrics, to be speci\ufb01ed. This\nallows us to devise a value iteration-like procedure that converges to the CVF \u02dcV \u03c0 of a policy \u03c0.\nWe \ufb01rst de\ufb01ne some distance metrics between CFs. Given two CF c1, c2 : R \u2192 C, and p \u2265 1, we\nde\ufb01ne\n\nd\u221e,p(c1, c2) (cid:44) sup\n\u03c9\u2208R\n\nd1,p(c1, c2) (cid:44)\n\n\u03c9p\n\n(cid:12)(cid:12)(cid:12)(cid:12) c1(\u03c9) \u2212 c2(\u03c9)\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u02dcV1(\u03c9; x) \u2212 \u02dcV2(\u03c9; x)\n\n0 = 0.4\n\n\u03c9p\n\n(9)\n\n\u03c9p\n\n(cid:90) (cid:12)(cid:12)(cid:12)(cid:12) c1(\u03c9) \u2212 c2(\u03c9)\n(cid:12)(cid:12)(cid:12)(cid:12) d\u03c9.\n(cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u02dcV1(\u03c9; x) \u2212 \u02dcV2(\u03c9; x)\n\n\u03c9p\n\nHere we use the convention that 0\nWe also de\ufb01ne similar metrics for functions such as \u02dcR and \u02dcV \u03c0. Given \u02dcV1, \u02dcV2 : R \u00d7 X \u2192 R, we\nde\ufb01ne\nd\u221e,p( \u02dcV1, \u02dcV2) (cid:44) sup\nx\u2208X\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , d1,p( \u02dcV1, \u02dcV2) (cid:44) sup\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d\u03c9.\n\nsup\n\u03c9\u2208R\n\n(10)\nThere are similar to the distances for comparing two CFs, with the difference that we take the\nsupremum over all states x \u2208 X . To be more precise about how the distances are calculated (e.g., sup\nover X , etc.), we could use dX (\u221e),\u03c9(\u221e,p)( \u02dcV1, \u02dcV2) instead of d\u221e,p( \u02dcV1, \u02dcV2). To simplify the notations,\nhowever, we use the overloaded symbols d\u221e,p and d1,p instead.\nBased on these distances, we de\ufb01ne the following norms for a function \u02dcV : R \u00d7 X \u2192 R\n\nx\u2208X\n\n(cid:13)(cid:13)(cid:13) \u02dcV\n\n(cid:13)(cid:13)(cid:13)\u221e,p\n\n(cid:13)(cid:13)(cid:13) \u02dcV\n\n(cid:13)(cid:13)(cid:13)1,p\n\n= d\u221e,p( \u02dcV , 0),\n\n= d1,p( \u02dcV , 0),\n\nwhere 0 is a constant function (\u03c9; x) (cid:55)\u2192 0. We sometimes refer to the supremum w.r.t. x \u2208 X of \u02dcV\nby (cid:107) \u02dcV (\u03c9;\u00b7)(cid:107)\u221e = supx\u2208X | \u02dcV (\u03c9; x)|. This should not be confused with (cid:107) \u02dcV (cid:107)\u221e,p, whose supremum\nis over both \u03c9 and x, and the \u03c9 variable is weighted by w\u2212p.\nSeveral properties of d\u221e,p and d1,p are presented in an appendix of the extended version of the\npaper. Brie\ufb02y, we show that d1,p and d\u221e,p are metrics. We also show that the space of VCFs\nV = { \u02dcV : R \u00d7 X \u2192 C1 : \u02dcV (0; x) = 1}, which is a superset of the space of all feasible VCFs,\nendowed with d\u221e,p is complete.\nThe following result shows that the Bellman operator for VCF is a contraction operator w.r.t. d1,p\nand d\u221e,p. This is the main result of this section.\nLemma 1. Let 0 < \u03b3 < 1. The operator \u02dcT \u03c0 is a \u03b3p-contraction in d\u221e,p (for p > 0) and \u03b3p\u22121-\ncontraction in d1,p (for p > 1). That is, for any \u02dcV1, \u02dcV2 : R \u00d7 X \u2192 C with d\u221e,p( \u02dcV1, \u02dcV2) < \u221e or\nd1,p( \u02dcV1, \u02dcV2) < \u221e, we have\n\nd\u221e,p( \u02dcT \u03c0 \u02dcV1, \u02dcT \u03c0 \u02dcV2) \u2264 \u03b3pd\u221e,p( \u02dcV1, \u02dcV2),\nd1,p( \u02dcT \u03c0 \u02dcV1, \u02dcT \u03c0 \u02dcV2) \u2264 \u03b3p\u22121d1,p( \u02dcV1, \u02dcV2).\n\nFor the contraction to be non-trivial, and avoid having a trivial inequality such as \u221e \u2264 \u03b3p\u221e, we\nrequire the boundedness of d\u221e,p( \u02dcV1, \u02dcV2) or d1,p( \u02dcV1, \u02dcV2). This is a condition that should be veri\ufb01ed,\nand as we shall soon see holds under certain conditions.\nWe brie\ufb02y remark that the Bellman operator \u02dcT \u03c0 is not a contraction w.r.t.\n(cid:107) \u02dcV (cid:107)\u221e = supx\u2208X sup\u03c9\u2208R | \u02dcV (\u03c9; x)|. This is shown in the extended version of the paper.\nThe importance of showing that the Bellman operator for VCF is a contraction is that we can then\napply the Banach \ufb01xed point theorem (e.g., Theorem 3.2 of Hunter and Nachtergaele [2001]) to\nshow the uniqueness of the \ufb01xed point \u02dcV \u03c0 (we also require the completeness of the space, which is\nshown for d\u221e,p). Moreover, it suggests that we can \ufb01nd the \ufb01xed point by iterative application of the\noperator. This is the path we pursue in the next section.\n\nthe supremum norm\n\n4The metric d\u221e,p has been studied under the name of Fourier-based metric Carrillo and Toscani [2007], and\n\nis called Toscani distance by Villani [2008].\n\n5\n\n\f4 Characteristic value iteration\n\nThe contraction property of the Bellman operator \u02dcT \u03c0 (Lemma 1) suggests that we can \ufb01nd \u02dcV \u03c0 by an\niterative procedure, similar to the conventional value iteration. The procedure is\n\n\u02dcV1 \u2190 \u02dcR,\n\u02dcVk+1 \u2190 \u02dcT \u03c0 \u02dcVk = \u02dcRP \u03c0 \u02dcVk.\n\n(k \u2265 1)\n\n(11)\n\nWe call this procedure Characteristic Value Iteration (CVI).\nCVI converges under certain conditions. To see this, notice that \u02dcV \u03c0 = \u02dcT \u03c0 \u02dcV \u03c0, so for p \u2265 1 by\nLemma 1 we have\n\nd\u221e,p( \u02dcT \u03c0 \u02dcVk, \u02dcV \u03c0) = d\u221e,p( \u02dcT \u03c0 \u02dcVk, \u02dcT \u03c0 \u02dcV \u03c0) \u2264 \u03b3pd\u221e,p( \u02dcVk, \u02dcV \u03c0),\n\nunder the condition that d\u221e,p( \u02dcVk, \u02dcV \u03c0) < \u221e.\nSimilarly, we have d1,p( \u02dcT \u03c0 \u02dcVk, \u02dcV \u03c0) \u2264\n\u03b3p\u22121d1,p( \u02dcVk, \u02dcV \u03c0) (for p > 1). By the iterative application of this upper bound, assuming that\nd\u221e,p( \u02dcR, \u02dcV \u03c0) < \u221e, we get that\n\nd\u221e,p( \u02dcVk+1, \u02dcV \u03c0) \u2264 \u03b3pd\u221e,p( \u02dcVk, \u02dcV \u03c0) \u2264 \u00b7\u00b7\u00b7 \u2264 (\u03b3p)kd\u221e,p( \u02dcV1, \u02dcV \u03c0) = (\u03b3p)kd\u221e,p( \u02dcR, \u02dcV \u03c0).\n\n(12)\n\nLikewise, assuming that d1,p( \u02dcR, \u02dcV \u03c0) < \u221e, we obtain\n\nd1,p( \u02dcVk+1, \u02dcV \u03c0) \u2264 (\u03b3p\u22121)kd1,p( \u02dcR, \u02dcV \u03c0).\n\n(13)\nAs long as d\u221e,p( \u02dcR, \u02dcV \u03c0) (or d1,p( \u02dcR, \u02dcV \u03c0)) is \ufb01nite for some p \u2265 1 (p > 1), CVI converges geometri-\ncally fast. A result in an appendix of the extended version of the paper speci\ufb01es the condition when\nthe d\u221e,p distance of two CF would be \ufb01nite. For p = 1, it is suf\ufb01cient that the immediate reward\nR\u03c0(x) \u223c R(\u00b7; x) and the return G\u03c0(\u00b7; x) be integrable, i.e., E [|R\u03c0(x)|] , E [|G\u03c0(\u00b7; x)|] < \u221e for all\nstates x \u2208 X . Since we deal with discounted MDP, the integrability of R\u03c0(x) (uniformly over X )\nentails the integrability of G\u03c0(\u00b7; x). Therefore under very mild conditions, CVI is convergent w.r.t.\nd\u221e,1.\nFor integer valued p \u2265 2, the condition becomes more restrictive. The \ufb01rst requirement is that\nE [|R\u03c0(x)|p] and E [|G\u03c0(\u00b7; x)|p] are \ufb01nite. This is not restrictive, and holds for many problems. The\nrestrictive condition is that the \ufb01rst k = 1, . . . , p \u2212 1 moments of the reward and the return should\n\nmatch, i.e., E(cid:2)R\u03c0(x)k(cid:3) = E(cid:2)G\u03c0(x)k(cid:3) for all x \u2208 X . This does not seem realistic, perhaps except\n\nfor p = 2 when problems with zero expected immediate reward for all states but with varying variance\nare imaginable.\nOne can show that the \ufb01xed point of \u02dcT \u03c0 is unique. The result is formally stated in the extended\nversion of the paper.\n\n4.1 Approximate characteristic value iteration\n\nPerforming CVI (11) exactly may not be practical, for at least two reasons. First, for problems with\nlarge state space, we cannot represent \u02dcV \u03c0 exactly and we need to rely on function approximation.\nSecond, for learning scenario where we do not have access to the model P \u03c0, but only observe data\nfrom interacting with the environment, we cannot apply the Bellman operator \u02dcT \u03c0 exactly either.\nWe can extend CVI to Approximate CVI (ACVI) similar to how exact VI can be extended to\nApproximate Value Iteration, also known as Fitted Value Iteration or Fitted Q-Iteration. Various\nvariants of AVI have been empirically and theoretically studied in the literature [Ernst et al., 2005,\nMunos and Szepesv\u00e1ri, 2008, Farahmand et al., 2009, Silver et al., 2016, Tosatto et al., 2017, Chen\nand Jiang, 2019]. We would like to build the same general framework for CVF and CVI.\nSuppose that for whatever reason we perform each iteration of CVI only approximately, that is,\n\u02dcVk+1 \u2248 \u02dcT \u03c0 \u02dcVk. The resulting procedure can be described as\n\n\u02dcV1 \u2190 \u02dcR + \u02dc\u03b51,\n\u02dcVk+1 \u2190 \u02dcT \u03c0 \u02dcVk + \u02dc\u03b5k+1.\n\n(k \u2265 1)\n\n(14)\n\n6\n\n\fHere \u02dc\u03b5k : R \u00d7 X \u2192 C is the error in the frequency-state space. Recall that the value of a valid CF\nat frequency \u03c9 = 0 is equal to one, i.e., c(0) = 1. To ensure that \u02dcVk(\u00b7; x) is a CF for all x \u2208 X ,\nwe must have \u02dcVk(0; x) = 1. This is satis\ufb01ed if we require that \u02dc\u03b5k(0; x) = 0 for all k = 1, 2, . . .\nand x \u2208 X . We can interpret this requirement by noticing that the condition c(0) = 1 is simply a\n\nrequirement that c(0) = E(cid:2)ejX0(cid:3) = E [1] =(cid:82) \u00b5(dx) be equal to 1. So we are essentially requiring\n\nthat we do not lose or add probability mass at each iteration of ACVI.\nPerforming ACVI can be quite similar to the conventional AVI. Suppose that we are given a dataset\nDn = {(Xi, Ri, X(cid:48)\ni \u223c P \u03c0(\u00b7|Xi) and Ri \u223c R\u03c0(\u00b7|Xi). Given this dataset\nand a CVF \u02dcV , we de\ufb01ne the empirical Bellman operator as the following mapping:\n\u2200\u03c9 \u2208 R,\u2200i = 1, . . . , n.\n\n( \u02c6\u02dcT \u03c0 \u02dcV )(\u03c9; Xi) (cid:44) ej\u03c9Ri \u02dcV (\u03b3\u03c9; X(cid:48)\ni),\n\ni)}n\n\ni=1, with Xi \u223c \u00b5, X(cid:48)\n\nFor any \ufb01xed function \u02dcV and at any \ufb01xed state Xi, with a r.v. Ai \u223c \u03c0(\u00b7|Xi), we have\n\n( \u02c6\u02dcT \u03c0 \u02dcV )(\u03c9; X) | X = Xi\n\nej\u03c9Ri \u02dcV (\u03b3\u03c9; X(cid:48)\n\ni) | X = Xi\n\nE(cid:104)\n\n(cid:105)\n\n(cid:105)\n\n= E(cid:104)\n\n(cid:90)\n\n= \u02dcR(\u03c9; Xi)\n\nP \u03c0(dy|Xi) \u02dcV (\u03b3\u03c9; y) = ( \u02dcT \u03c0 \u02dcV )(\u03c9; Xi).\n\nThis shows that the random process ( \u02c6\u02dcT \u03c0 \u02dcV )(\u03c9; Xi) is an unbiased estimate of ( \u02dcT \u03c0 \u02dcV )(\u03c9; Xi). In\nother words, ( \u02dcT \u03c0 \u02dcV )(\u03c9; Xi) is the conditional mean of ( \u02c6\u02dcT \u03c0 \u02dcV )(\u03c9; Xi). Finding the conditional\nmean of a r.v. is the regression problem (i.e., estimating m(x) = E [Y |X = x] by \u02c6m(x) using a\ndataset of {(Xi, Yi)}n\ni=1), which has been extensively studied in the statistics and machine learning\nliterature [Gy\u00f6r\ufb01 et al., 2002, Wasserman, 2007, Hastie et al., 2009, Goodfellow et al., 2016]. A\npowerful estimator that generalizes well across states and \u03c9 allows us to approximately perform one\nstep of ACVI.\nOne approach to \ufb01nding a regression estimator is to solve an empirical risk minimization problem:\n\n(cid:90) (cid:12)(cid:12)(cid:12) \u02dcV (\u03c9; Xi) \u2212 ej\u03c9Ri \u02dcVk(\u03b3\u03c9; X(cid:48)\n\ni)\n\n(cid:12)(cid:12)(cid:12)2\n\nn(cid:88)\n\ni=1\n\n\u02dcVk+1 \u2190 argmin\n\u02dcV \u2208F\n\n1\nn\n\nw(\u03c9)d\u03c9,\n\n(15)\n\nwhere F \u2282 V is a space of functions from R \u00d7 X to C1, which can be represented by various types\nof function approximators (including decision trees, kernel-based ones, and neural networks), and\nw : R (cid:55)\u2192 R is a weighting function that indicates the importance of different frequencies \u03c9. This is\nsimilar to the usual Fitted Value Iteration procedure [Ernst et al., 2005, Munos and Szepesv\u00e1ri, 2008,\nFarahmand et al., 2009, Silver et al., 2016, Tosatto et al., 2017, Chen and Jiang, 2019], which solves\n\nVk+1 \u2190 argmin\nV \u2208F\n\n1\nn\n\n|V (Xi) \u2212 (Ri + \u03b3Vk(X(cid:48)\n\ni))|2 ,\n\n(16)\n\nwith appropriately chosen function space F (and similar for Fitted Q Iteration and the action-value\nfunction Q). One clear difference between (15) and (16) is that we have an integral over the frequency\ndomain in the former. This one-dimensional integral can be numerically integrated, for example,\nby discretizing the low-frequency domain [\u2212b, +b] (with b > 0) with resolution \u03b5int. This incurs\nsome controlled numerical error that is a function of \u03b5int. For some function approximators, such as\na decision tree, one might be able to calculate the integral more ef\ufb01ciently by bene\ufb01tting from the\nconstancy of values within a leaf.\nThe quality of approximating \u02dcT \u03c0 \u02dcVk by \u02dcVk+1 determines the error \u02dc\u03b5k. The error depends on the\nregression method being used, as well as the number of data points available, capacity and express-\nibility of the function space F, etc. We do not analyze this regression problem in this paper. We are\nnevertheless interested in knowing whether one can hope to have a small error with a reasonably\nselected F. Two relevant questions are whether one can approximate \u02dcT \u03c0 \u02dcVk within F well enough\n(function approximation error), and whether F has enough regularity to allow reasonable convergence\nrate for the estimation error. We study these questions in detail in the appendices of the extended\nversion of the paper. We only brie\ufb02y mention that if the reward distribution is smooth in a certain\nsense, a band-limited function class Fb = { \u02dcV : R\u00d7X \u2192 C1 : \u02dcV (0; x) = 1, \u02dcV (\u03c9; x) = 0 \u2200|\u03c9| > b}\n\nn(cid:88)\n\ni=1\n\n7\n\n\fprovides an approximation error that goes to zero as the bandwidth b increases. More speci\ufb01cally,\nthe d\u221e,1 distance-based norm of the approximation error behaves like O(b\n1+\u03b2 ) with \u03b2 being the\nsmoothness parameter. Furthermore, if the \ufb01rst s absolute moments of the reward distribution are \ufb01-\nnite, the CVF \u02dcV (\u00b7; x) belongs to the smoothness class C s([\u2212b, b])\u2229Fb. This leads to a well-behaving\ncovering number, which can be used to obtain a convergence rate for the estimation error. A side\nbene\ufb01t of working with a band-limited function space is that the integral in (15) can be converted to a\nde\ufb01nite integral, which is easier to integrate numerically.\nNext we analyze how these errors, however generated, affect the quality of the outcome \u02dcVK after\nperforming K steps of ACVI.\n\n\u2212 1\n\n5 Error propagation analysis\n\nWe analyze how the errors in the ACVI procedure (14) propagate throughout the iterations and affect\nthe quality of the outcome CVF \u02dcVK, where K is the number of times the iteration is performed.\nWe skip all the intermediate steps required to prove the main result of this section. They can be found\nin the same section of the extended version of the paper.\nTheorem 2. Consider the ACVI procedure (14) after K \u2265 1 iterations. Assume that \u02dc\u03b5k(0; x) = 0\nfor all x \u2208 X and k = 1, . . . , K + 1. We have\n\nd\u221e,p( \u02dcVK+1, \u02dcV \u03c0) \u2264 K(cid:88)\nd1,p( \u02dcVK+1, \u02dcV \u03c0) \u2264 K(cid:88)\n\ni=0\n\n(\u03b3p)i (cid:107)\u02dc\u03b5K+1\u2212i(cid:107)\u221e,p + (\u03b3p)Kd\u221e,p( \u02dcR, \u02dcV \u03c0),\n\n(\u03b3p\u22121)i (cid:107)\u02dc\u03b5K+1\u2212i(cid:107)1,p + (\u03b3p\u22121)Kd1,p( \u02dcR, \u02dcV \u03c0).\n\n(p \u2265 1)\n\n(p > 1)\n\ni=0\n\nThis result shows how the errors \u02dc\u03b5k in the ACVI procedure propagate throughout iterations and\naffect the quality of the approximation of \u02dcV \u03c0 by \u02dcVK+1. The error is measured according to the\ndistances d1,p and d\u221e,p. The upper bounds show that errors in the earlier iterations are geometrically\ndecayed. This entails that if the resources are limited, it is better to ensure the smallness of errors\nin later iterations. This phenomenon is similar to what we have observed in the conventional value\niteration [Farahmand et al., 2010].\nAs discussed in Section 4, the condition that d\u221e,p( \u02dcR, \u02dcV \u03c0) is \ufb01nite might be very restrictive for p > 2\nand even for p = 2, it might hold only in special problems. But the \ufb01niteness of d\u221e,1 requires mild\nconditions. For the \ufb01niteness of d\u221e,1( \u02dcR, \u02dcV \u03c0) in the upper bound, the \ufb01niteness of the \ufb01rst absolute\nmoment of the reward function is suf\ufb01cient, as discussed after (13). For the \ufb01niteness of (cid:107)\u02dc\u03b5i(cid:107)\u221e,1\nterms, it is suf\ufb01cient that \u02dc\u03b5i(0; x) = 0 and that its \ufb01rst derivative w.r.t. \u03c9 is bounded for all states\nx \u2208 X , i.e., |\u02dc\u03b5(1)(\u03c9; x)| < \u221e. Based on these, so from now on we focus on p = 1.\n\n6 From error in frequency domain to error in probability distributions\n\nTheorem 2 in the previous section relates the errors at each iteration of ACVI to the quality of the\nobtained approximation of \u02dcV \u03c0. The error is measured according to the metrics d1,p and d\u221e,p. These\nare metrics in the frequency domain. What does having a small error in the frequency domain imply\nabout the quality of approximating the distribution of returns \u00afV \u03c0?\nFrom Levy\u2019s continuity theorem we know that the pointwise convergence of CF implies the conver-\ngence in distribution of their corresponding distributions. This suggest that we could de\ufb01ne the error\nin the frequency domain\n\n(cid:12)(cid:12)(cid:12) \u02dcV (\u03c9; x) \u2212 \u02dcV \u03c0(\u03c9; x)\n\n(cid:12)(cid:12)(cid:12) .\n\ndunif( \u02dcV , \u02dcV \u03c0) = sup\nx\u2208X\n\nsup\n\u03c9\u2208R\n\nNevertheless, we did not de\ufb01ne the distance this way because the Bellman operator would not be a\ncontraction w.r.t. to it. So a valid question is whether, or in what sense, the smallness of d\u221e,p( \u02dcV , \u02dcV \u03c0)\nimplies anything about the closeness of their corresponding probability distribution functions \u00afV\n\n8\n\n\f(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n\u2126, and Fp(\u2126) = (cid:8) f \u2208 Cp(\u2126) : (cid:107)f (k)(cid:107)\u221e \u2264 1, 0 \u2264 k \u2264 p(cid:9). For two probability distributions\n\nand \u00afV \u03c0? In this section we show that such a relation indeed exists. We relate d\u221e,p and d1,p to the\np-smooth Wasserstein distance of the probability distribution functions [Arras et al., 2017].\nDe\ufb01nition 1. Let p \u2265 1, Cp(\u2126) be the space of p-times continuous differentiable functions on domain\n\u00b51, \u00b52 \u2208 M(\u2126), the p-smooth Wasserstein distance is de\ufb01ned as\n\nWCp (\u00b51, \u00b52) = sup\nf\u2208Fp(\u2126)\n\nf (x) (d\u00b51(x) \u2212 d\u00b52(x))\n\nRemark 1. Note that the conventional 1-Wasserstein distance is de\ufb01ned as\nf (x) (d\u00b51(x) \u2212 d\u00b52(x))\n\nwhere Lip1 is the space of 1-Lipschitz functions. As(cid:13)(cid:13)f (1)(cid:13)(cid:13)\u221e \u2264 1 implies 1-Lipschitz functions, but\n\nW1(\u00b51, \u00b52) = sup\n\nf\u2208Lip1(\u2126)\n\nnot necessarily vice versa, WC1(\u00b51, \u00b52) \u2264 W1(\u00b51, \u00b52).\nLet us also de\ufb01ne the p-smooth Wasserstein between \u00afV1 and \u00afV2 as follows:\n2 (\u00b7; x)).\n\nWCp ( \u00afV1(\u00b7; x), \u00afV \u03c0\n\nWCp ( \u00afV1, \u00afV \u03c0\n\n2 ) (cid:44) sup\nx\u2208X\n\nThis is the maximum over states x \u2208 X of the value of the p-smooth Wasserstein between the\ndistribution of return according to the probability distributions \u00afV1(\u00b7; x) and \u00afV2(\u00b7; x).\nTheorem 3. Consider the ACVI procedure (14) after K \u2265 1 iterations. Assume that \u02dc\u03b5k(0; x) = 0\nfor all x \u2208 X and k = 1, . . . , K + 1. Furthermore, assume that the immediate reward distribution\nR\u03c0(\u00b7|x) is Rmax-bounded. We then have\n\n(cid:35)\n\n\u221a\n2\u221a\nWC2( \u00afVK+1, \u00afV \u03c0) \u2264 2\n\u03c0\n\n(cid:115)\n\n(cid:34) K(cid:88)\n\ni=0\n\nRmax\n1 \u2212 \u03b3\n\n\u03b3i (cid:107)\u02dc\u03b5K+1\u2212i(cid:107)\u221e,1 +\n\n2\u03b3K\n1 \u2212 \u03b3\n\nRmax\n\n.\n\nThis upper bound can be simpli\ufb01ed if we are willing to provide a uniform over iterations upper bound\non (cid:107)\u02dc\u03b5K+1\u2212i(cid:107)\u221e,1. In that case, we have\n\u221a\n\u221a\nWC2 ( \u00afVK+1, \u00afV \u03c0) \u2264 2\n\u03c0(1 \u2212 \u03b3)3/2\n\n(cid:107)\u02dc\u03b5i(cid:107)\u221e,1 + 2\u03b3KRmax\n\ni=1,...,K+1\n\n2Rmax\n\n(cid:21)\n\n.\n\n(cid:20)\n\nmax\n\nWe note that the 2-smooth Wasserstein distance WC2, which is an integral probability metric [M\u00fcller,\n1997], is only one of the many distances between probability distributions [Gibbs and Su, 2002]. The\nchoice of the right probability distance most likely depends on the performance measure we would\nlike the policy to optimize. Studying this further is an interesting topic of future research.\n\n7 Conclusion\n\nThis paper laid the groundwork for a new class of distributional RL algorithms. We have shown\nthat one might represent the uncertainty about the return in the frequency domain, and such a\nrepresentation (called Characteristic Value Function) enjoys properties such as satisfying a Bellman\nequation and having a contractive Bellman operator. This in turn allows us to compute the CVF by\nan iterative method called Characteristic Value Iteration. We also showed the effect of errors in the\niterative procedure, and provided error propagation results, in both the frequency domain and the\nprobability distribution space.\nThis paper is only the \ufb01rst step towards understanding CVFs and their properties. Among remaining\nquestions is how to perform the regression step (15) of ACVI properly and ef\ufb01ciently. Speci\ufb01cally,\nhow should we set the weighting function w(\u03c9) in order to achieve accurate CVF in frequencies that\nare relevant for the tasks we want to solve. Studying other distances between CFs and their properties\nis another interesting research directions. This work only focused on the policy evaluation problem,\nso another obvious direction is designing risk-aware policy optimization algorithms based on CVF.\nFinally, empirically evaluating this approach for return uncertainty representation may lead to better\nunderstanding of its strengths and weaknesses.\n\n9\n\n\fAcknowledgments\n\nI would like to thank the anonymous reviewers for their helpful feedback, particularly Reviewer #4. I\nacknowledge the funding from the Canada CIFAR AI Chairs program.\n\nReferences\nBenjamin Arras, Guillaume Mijoule, Guillaume Poly, and Yvik Swan. A new approach to the Stein-\nTikhomirov method: with applications to the second Wiener chaos and Dickman convergence.\narXiv:1605.06819v2, 2017. 9\n\nGabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva\nTB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributional policy gradients. In\nInternational Conference on Learning Representations (ICLR), 2018. 1\n\nMarc G. Bellemare, Will Dabney, and R\u00e9mi Munos. A distributional perspective on reinforcement\nlearning. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.\n1\n\nDimitri P. Bertsekas. Abstract dynamic programming. Athena Scienti\ufb01c Belmont, 2013. 2\n\nDimitri P. Bertsekas and Steven E. Shreve. Stochastic Optimal Control: The Discrete-Time Case.\n\nAcademic Press, 1978. 2\n\nJos\u00e9 Antonio Carrillo and Giuseppe Toscani. Contractive probability metrics and asymptotic behavior\nof dissipative kinetic equations (notes of the Porto Ercole school, june 2006). Riv. Mat. Univ.\nParma, 7(6):75\u2013198, 2007. 5\n\nJinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In\n\nProceedings of the 36th International Conference on Machine Learning (ICML), 2019. 6, 7\n\nYinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained\nreinforcement learning with percentile risk criteria. Journal of Machine Learning Research\n(JMLR), 18(167):1\u201351, 2018. 1\n\nWill Dabney, Georg Ostrovski, David Silver, and R\u00e9mi Munos. Implicit quantile networks for\ndistributional reinforcement learning. In Proceedings of the 35th International Conference on\nMachine Learning (ICML), 2018. 1\n\nYaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In\nProceedings of the 22nd International Conference on Machine learning (ICML), pages 201\u2013208.\nACM, 2005. 1\n\nDamien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning.\n\nJournal of Machine Learning Research (JMLR), 6:503\u2013556, 2005. 6, 7\n\nAmir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv\u00e1ri, and Shie Mannor. Reg-\nularized \ufb01tted Q-iteration for planning in continuous-space Markovian Decision Problems. In\nProceedings of American Control Conference (ACC), pages 725\u2013730, June 2009. 6, 7\n\nAmir-massoud Farahmand, R\u00e9mi Munos, and Csaba Szepesv\u00e1ri. Error propagation for approximate\npolicy and value iteration. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and\nA. Culotta, editors, Advances in Neural Information Processing Systems (NIPS - 23), pages\n568\u2013576. 2010. 8\n\nAndrey Feuerverger. An ef\ufb01ciency result for the empirical characteristic function in stationary\n\ntime-series models. Canadian Journal of Statistics, 18(2):155\u2013161, 1990. 2\n\nAndrey Feuerverger and Philip McDunnough. On some Fourier methods for inference. Journal of\n\nthe American Statistical Association, 76(374):379\u2013387, 1981. 2\n\nAndrey Feuerverger and Roman A. Mureika. The empirical characteristic function and its applications.\n\nAnnals of Statistics, 5(1):88\u201397, 01 1977. 2\n\n10\n\n\fJavier Garc\u00eda and Fernando Fern\u00e1ndez. A comprehensive survey on safe reinforcement learning.\n\nJournal of Machine Learning Research (JMLR), 16:1437\u20131480, 2015. 1\n\nAlison L. Gibbs and Francis Edward Su. On choosing and bounding probability metrics. International\n\nStatistical Review, 70(3):419\u2013435, 2002. 9\n\nIan Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. 7\nL\u00e1szl\u00f3 Gy\u00f6r\ufb01, Michael Kohler, Adam Krzy\u02d9zak, and Harro Walk. A Distribution-Free Theory of\n\nNonparametric Regression. Springer Verlag, New York, 2002. 7\n\nTrevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data\n\nMining, Inference, and Prediction (2nd edition). Springer, 2009. 7\n\nJohn K. Hunter and Bruno Nachtergaele. Applied analysis. World Scienti\ufb01c Publishing Company,\n\n2001. 5\n\nJohn L. Knight and Jun Yu. Empirical characteristic function in time series estimation. Econometric\n\nTheory, 18(3):691\u2013721, 2002. 2\n\nClare Lyle, Pablo Samuel Castro, and Marc G. Bellemare. A comparative analysis of expected and\ndistributional reinforcement learning. In Proceedings of the 31st AAAI Conference on Arti\ufb01cial\nIntelligence, 2019. 1\n\nTetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka.\nNonparametric return distribution approximation for reinforcement learning. In Proceedings of the\n27th International Conference on Machine Learning (ICML), 2010a. 1\n\nTetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka.\nIn Proceedings of the 26th\n\nParametric return density estimation for reinforcement learning.\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 368\u2013375, 2010b. 1\n\nAlfred M\u00fcller. Integral probability metrics and their generating classes of functions. Advances in\n\nApplied Probability, 29(2):429\u2013443, 1997. 9\n\nR\u00e9mi Munos and Csaba Szepesv\u00e1ri. Finite-time bounds for \ufb01tted value iteration. Journal of Machine\n\nLearning Research (JMLR), 9:815\u2013857, 2008. 6, 7\n\nL.A. Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive mdps. In\n\nAdvances in Neural Information Processing Systems (NIPS - 26). 2013. 1\n\nMark Rowland, Marc G. Bellemare, Will Dabney, R\u00e9mi Munos, and Yee Whye Teh. An analysis\nof categorical distributional reinforcement learning. In Proceedings of the 21st International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2018. 1, 3\n\nMark Rowland, Robert Dadashi, Saurabh Kumar, R\u00e9mi Munos, Marc G. Bellemare, and Will\nDabney. Statistics and samples in distributional reinforcement learning. In Proceedings of the 36th\nInternational Conference on Machine Learning (ICML), 2019. 1\n\nDavid Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,\nJulian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,\nDominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine\nLeach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with\ndeep neural networks and tree search. Nature, 529(7587):484\u2013489, 01 2016. 6, 7\n\nRichard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press,\n\nsecond edition, 2019. 1\n\nCsaba Szepesv\u00e1ri. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 2\n\nAviv Tamar, Dotan Di Castro, and Shie Mannor. Policy gradients with variance related risk criteria.\n\nIn Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. 1\n\nSamuele Tosatto, Matteo Pirotta, Carlo D\u2019Eramo, and Marcello Restelli. Boosted \ufb01tted Q-iteration.\n\nIn Proceedings of the 34th International Conference on Machine Learning (ICML), 2017. 6, 7\n\n11\n\n\fC\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,\n\n2008. 5\n\nLarry Wasserman. All of Nonparametric Statistics. Springer, 2007. 7\n\nDavid Williams. Probability with Martingales. Cambridge University Press, 1991. 3\n\nJun Yu. Empirical characteristic function estimation and its applications. Econometric reviews, 23(2):\n\n93\u2013123, 2004. 2\n\n12\n\n\f", "award": [], "sourceid": 8400, "authors": [{"given_name": "Amir-massoud", "family_name": "Farahmand", "institution": "Vector Institute and University of Toronto"}]}