NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4447
Title:Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Gradient Estimators for Reinforcement Learning

Reviewer 1

The DiCE gradient estimator [1] allows the computation of higher-order derivatives in stochastic computation graphs. This may be useful in contexts such multi-agent learning or meta-RL where the proper application of methods such as MAML require the computation of second-order derivatives. The current paper extends DiCE and derives a more general objective that allows integration of the advantage A(s_t, a_t) = Q(s_t, a_t) - V(s_t) in order to control for the variance while providing unbiased estimates. The advantage can be approximated by trading off variance for bias using parametric function approximators and methods such as Generalized Advantage Estimation (GAE). Moreover, the authors propose to further control the variance of the higher-order gradients by discounting the impact past actions on the current advantage, thus limiting the range of causal dependencies. This paper is well executed: it is well written, technically sound and potentially impactful. The method is tested on a toy domain, which highlights that the new estimator is more efficient than baselines at estimating the true higher-order derivatives, and in a meta-RL task. I'd suggest the authors to strengthen and detail a bit more their experimental setting: - In Section 4.1, I'd suggest the authors to add a few lines on how the ground-truth higher order gradients are computed in the small MDP. I couldn't find that information in the supplementary material. - In Figure 2, the convergence plots wrt to the batch size are nice. Are you using a \lambda = 1 ? - In Section 4.2, the authors test on Half-Cheetah. Citing footnote 2 in [3] "half-cheetah is an example of a weak benchmark–it can be learned in just a few gradient step from a random prior. It can also e solved with a linear policy. Stop using half-cheetah." I wonder if the authors can find some other more challenging tasks that can show the empirical superiority of their method. - In Figure 4, I think you are using Loaded-DiCE with the best \tau found for GAE. If that's the case, you should write it down. - In Section 4.2, it'd be better to add a few more baselines ? Maybe [2] ? Pros: + Address an important problem + Well written and technically sound. + Empirically validated on a toy setting. Cons: - The range of realistic use-case applications is rather limited. [1] [2]

Reviewer 2

This paper extends the DiCE objective in a way that allows automatic differentiation to produce low-variance unbiased estimators of derivatives of any order. This is done in a way that can incorporate any advantage estimator, and shown to be accurate and valuble in practice through two experiments. As someone who knows little about this field, I found the paper well-written, with sufficient references to and background of existing literature. What's known and what's novel is clearly pointed out, and the motivation for Loaded DiCE is well justified. The extension from DiCE appears efficient to compute, even though much more powerful. Figures 3 and 4 show substantial improvement over the previous state of the art. The work seems to be of high quality and is presented clearly enough to be read by non-RL researchers. Given my limited knowledge of the area, I'll defer a careful assessment of the significance and originality of the work to other reviewers.

Reviewer 3

The paper sets out to derive an arbitrary order score function estimator that can balance variance and bias. With this in mind, I think that the strenghts and weaknesses of the paper are: + well written, clear presentation + clear derivation of new algorithm - the empirical analysis is somewhat weak Other minor comments - the subfigures in Figure 2 are nearly unreadable, expecially in b&w - the authors might want to consider "conjugate MDPs" and how they relate to their work UPDATE: While I still do not feel perfectly confident in my assessment, I find the author's rebuttal to be satisfying. I increase my score.