NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4447
Title:Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Gradient Estimators for Reinforcement Learning

This paper presents novel methodology in combination with automatic differentiation, that yields unbiased and low-variance estimators of derivatives at any order. It appears potentially to be widely useful, and the exposition is clear to understand. The reviewers and I seem to be in general agreement in liking the paper. Reviewer 1 wrote a thorough review touching on many aspects of the paper. The overall score was 7, and his bottom line positives were: "This paper is well executed: it is well written, technically sound and potentially impactful." The main bottom line negative was "The range of realistic use-case applications is rather limited." Reviewer 2 also gave overall score of 7, and agreed with much of R1's comments. For example: "I found the paper well-written, with sufficient references to and background of existing literature. What's known and what's novel is clearly pointed out, and the motivation for Loaded DiCE is well justified ... The work seems to be of high quality and is presented clearly enough to be read by non-RL researchers." However, due to R2's limited knowledge of this topic, he deferred a careful assessment of the significance and originality of the work to other reviewers. As a result, the confidence was only 2. Reviewer 3 originally gave an overall score of 5. However, after reading the reviews and author feedback, he increased his score to 6. R3 also gave a low-confidence score of 1 due to lack of expertise in this topic. My take is that the paper gives a nice contribution, is correct and easily accessible, and may prove to be widely useful, and providing opportunities to build further on this line of research. With overall scores of (7, 7, 6) the paper is in a range where it should have a decent chance of being accepted at NeurIPS. However, due to the very low confidence scores of R2 and R3, the Senior Area Chair and I decided to seek out an additional last-minute reviewer who is quite familiar with previous work that provides the basis of the current submission. Due to lack of time, this reviewer provided the following brief comments: "I cannot say for the correctness of the paper, but I think that there is a useful contribution in the paper: that of using the Dice trick with control variables (GAE). Afaik, the dice paper only considers plain finite-horizon likelihood ratio estimator.  The experiments show a plot of the standard deviation as a function of the order, and a maml-experiment with confidence bands around their curves. The experimental methodology looks good to me." With this additional input, the Senior Area Chair and I agree that there is now sufficient confidence to agree with the unanimous recommendations of Accept from the initial three reviewers.