Reviews: Value Function in Frequency Domain and the Characteristic Value Iteration Algorithm

Thanks to the authors for their response! I maintain that a lot of good work was put in, and primarily recommend including some mention/intuition of scenarios where the proposed method would be preferred over an existing approach (like the examples given in the response). If possible, I think deriving an algorithm based on this foundation and empirically evaluating it (even in a toy domain), would greatly strengthen the paper! ----- The paper had minor grammatical errors. It's clear that a lot of work was put in, and many key theoretical results were worked out for their proposed setting. However, I think the paper strongly lacks in motivation. They suggest that instead of representing a probability distribution over returns, one can instead represent the Fourier transform of it. Can the authors provide any intuition as to why this might be a preferred approach? Are there any known drawbacks of existing distributional RL approaches that this approach can better handle? They show that the characteristic value function in the frequency domain satisfies a Bellman-like equation, which might not be surprising given the linearity of the Fourier transform. Interestingly, it results in a multiplicative form instead of an additive form. Are there any insights as to whether or not this operator might be better behaved compared to the additive form of standard distributional RL? They then characterized the error one might expect from approximately applying value iteration in the frequency domain, and how its implications on errors in the underlying probability distribution. I think this was a good direction as it attempts to tie things back to the standard approach of estimating a probability distribution, but are there any insights on how this might compare to errors in trying to approximate the probability distribution directly? An interesting parallel is that prior work has been done exploring the Fourier transform of returns in the expected, non-distributional setting (De Asis, 2018). While not about the return distribution, perhaps it could give context or motivate why viewing reinforcement learning from the frequency domain might be interesting.

POST-REBUTTAL I thank the authors for their detailed response, and in particular the discussion with regards to usefulness of the bounds in the main paper, and possible function approximation classes. I found the discussion around band-limited approximations to the return distribution interesting, and have updated my score from 5 to 6. I'd still argue that further discussion of practical function approximation classes (such as mixtures of Dirac deltas), and/or investigation of a practical algorithm derived from these approximations would make the paper stronger, but am happy to recommend that this paper be accepted, with the additional discussion given by the authors in the rebuttal. --------------------------------------- HIGH-LEVEL COMMENTS This paper proposes a perspective on distributional RL based on characteristic functions. Contractivity of a corresponding Bellman operator is shown under the so-called Toscani metrics, and analysis is undertaken with regards to approximate policy evaluation. The introductory material is well-written, and clearly motivates the authors’ work. Keeping track of return distribution estimates via characteristic functions is an interesting idea, which to my knowledge has not been explored in the distributional RL literature before. My concerns about the significance of the paper stem from the facts that (i) there is no practical algorithm proposed or investigated in the paper. I don’t think this in itself should preclude acceptance, but (ii) there is also little discussion around the usefulness of the theoretical bounds presented in the paper -- it is not clear when the upper bounds presented may be finite, or whether the errors may be driven to zero as the expressivity of the function approximation class appearing in Eqn (14) is increased. Some discussion around why it may be of interest to approximate characteristic functions rather than approximate distributions directly, as in previous DistRL work, would also be interesting. There are some minor mathematical issues regarding conditional independence around Eqn (1) and several later equations, although I don’t believe these affect the overall correctness of the paper significantly. The metrics the authors propose in Section 3.1 allow for distances of +\infty between points. This means that results such as Banach’s fixed point theorem are less readily applicable, and so additional care should be taken around these discussions. As I understand it, it might be necessary to restrict to the case p=1 to recover a unique fixed point using Banach’s fixed point theorem. Eqn (14) seems central to the idea of approximate policy evaluation that the authors consider, but there is little discussion as to how this objective might be optimised in practice (how should the integral over frequencies be approximated?), or with regards to the magnitude of the errors that may arise -- it seems possible that the \infty,p or 1,p norm of the errors might be infinite, which would make the upper bounds appearing in the paper less interesting -- can the authors comment on whether this might be the case, and in what scenarios the error terms might be guaranteed to be finite in norm? I found that the formatting of the supplementary material as an extended version of the main paper made navigating the supplementary material quite difficult. I would ask the authors to consider reformatting the supplementary material to avoid repetition of the main paper where possible in future versions, and to make sure theorem numberings etc. are consistent with the main paper. I have checked the proofs of the results in the main paper. Whilst there are some minor issues, as noted in more detail below, I believe that they are not fundamentally flawed and can be fixed relatively straightforwardly by the authors. DETAILED COMMENTS Line 68 & footnote: if mentioning sigma-algebras in the footnote, they should perhaps be specified in the main paper too. For example, no sigma-algebra is declared for either \mathcal{X} or \mathcal{R}, yet the notation \mathcal{M}(\mathcal{X}) is used. It might be worth mentioning somewhere that you assume the usual Borel sigma algebra for \mathbb{R}, and assume that \mathcal{X} comes equipped with some sigma-algebra. Line 77: it may be slightly more accurate to include a conditioning on X_0 = x on the right-hand side of this equation. Line 82: perhaps remark that G is the sum of two *independent* random variables. Eqn (1): I think the notation here is slightly incorrect. Equality in distribution should apply to random variables, not probability distributions. Similarly, adding two probability distributions does not yield another probability distribution (the distribution on the right-hand side of Eqn (1) would have total mass 1 + \gamma). It would be more correct to say that the random variables concerned are equal in distribution, and that their distributions satisfy a distributional equation (involving e.g. push-forwards and convolutions of measures). This equation is also missing a conditioning on X_0 = x on the right-hand side. Eqn below Line 91: As with Eqn (1), there is an issue with conflating distributions and random variables. Line 109-111 are slightly unclear to me: do the authors mean “probability density function” rather than “probability distribution function”? What about distributions without a pdf? Line 112: why is the notation \mathcal{P}^\pi(\cdot|X=x) used -- isn’t \mathcal{P}^\pi(\cdot|x) more consistent with the definition at the beginning of Line 69? Line 115: I don’t think that R^\pi(x) and G^\p(X’) are conditionally independent given X = x, since they both depend on the action taken in state x. Can the authors comment? It seems that the derivations around Eqn (5) should include a conditioning on the action taken as well; with some care I think this should fix the issue. Line 152 (and later in the paper): I found the notation d_{1/\infty, p} confusing, thinking that 1/\infty was the first parameter of the distance, as opposed to representing two different distances. Lemma 1: in the proof of this result, as with Eqn (5), there should be additional conditioning on the action taken in the initial state to justify conditional independence of the immediate reward and the next state. Line 159: In this paper, the authors allow a metric to take on the value infinity. It may be worth making this clearer in the paper, because this definition means that the notion of contraction is weaker (since \infty <= \gamma \infty), and so results such as Banach’s fixed point theorem do not necessarily apply (for example, if (X,d) is a complete metric space allowing for the possibility of infinite distances between points, then a contraction mapping does not necessarily have a unique fixed point). Value Iteration usually refers specifically to algorithms that aim to find a (near) optimal policy, whereas the algorithm described in Section 4 aims to evaluate a fixed policy. The algorithm might be more accurately described as “characteristic policy evaluation”, or something similar? Line 204: I believe there are similar issues here as with Eqn (5) around lack of conditional independence of R_i and X’_i given X; they are only conditionally independent when conditioning on the action taken at X as well. Line 212: Can the authors comment on whether they would expect the terms (or gradients of the terms) in Eqn (14) to be computable? I wonder whether the integral over frequency is possible to evaluate analytically when \tilde{V} is parametrized according to e.g. a decision tree or a neural network, and if the weighting function w is taken to be uniform, how the integral would be approximated. Line 216: can the authors comment on whether a uniform weighting of frequencies is likely to lead to a finite integral in Eqn (14)? Theorem 2: when are these results likely to be non-vacuous? I.e. when will the \infty, p and/or 1,p norms and metrics be finite? Proof of Theorem 2: in Eqn (24), the final \tilde{\eps} seems to be missing an \omega argument. Line 284: why is this stated as an inequality rather than an equality? Theorem 3: when are these results likely to be non-vacuous? I.e. when will the \infty, p and/or 1,p norms and metrics be finite? Proof of Theorem 3 and related results in the appendix. Proof of Lemma 5: in the first line of Eqn (28), should e^{jwx} be replaced by (e^{jwX_1} - e^{jwX_2})? Line 334: is there a missing factor of 2\pi in the use of Parseval’s theorem here? It seems to appear after Line 337 again. Lines 338-344: there seems to be a problem with a missing factor of 2\pi again. The proof of Lemma 2 (supplementary) would be more neatly expressed as an inductive argument, avoiding the use of ellipses.

Paper ID:	8400
Title:	Value Function in Frequency Domain and the Characteristic Value Iteration Algorithm

Reviewer 1

Reviewer 2

Reviewer 3