NeurIPS 2020

Self-Imitation Learning via Generalized Lower Bound Q-learning


Review 1

Summary and Contributions: The paper proposes an extension to self-imitation learning, generalising return-based lower-bound Q-learning to a family of n-step lower bound Q-learning. It is shown that lower-bound Q-learning trades off fixed point bias and contraction rate, compared to uncorrected n-step Q-learning. Empirical evidence shows improvements in RL benchmarks.

Strengths: The proposed method generalises return-based lower-bound Q-learning to n-step settings, introducing a flexible family of new self-imitation learning algorithms. The change is intuitive as n-step version leverages the current Q-function and reduces variance. In addition, the paper shows the trade-off between contraction rate and fixed-point bias between lower-bound Q-learning and uncorrected n-step.

Weaknesses: I have several concerns about the proposed method: 1) It is not quite clear how the trade-off between contraction rate and fixed point bias affects practical performances. Additional discussion over this could strengthen the relationship between the theoretical and the practical results. 2) The appears to be a disconnect between empirical evaluation and the theoretical discussion. In general, the best performing models from the experiments use importance-sampling (IS), which is not reflected in the theoretical results. More discussion about this aspect is needed. 3) It would be beneficial to include empirical comparison with unbiased Q-function estimator, such as Retrace and \alpha-Retrace to understand the trade-off of the introduced bias.

Correctness: The method is theoretically justified and the proof appears correct. The ablation study on the proposed n-step Q-function estimator is informative and quite comprehensive.

Clarity: The paper is well written, and easy to follow.

Relation to Prior Work: There is extensive discussion with respect to previous work, and how the proposed method relates to them. Sufficient details are given about how the proposed method differs from previous works.

Reproducibility: Yes

Additional Feedback: -- After rebuttal My concerns are adequately addressed. I hope that the authors incorporate their revised discussion about contraction rate and fixed point bias, with respect to empirical performances. I revised my score to a 7.


Review 2

Summary and Contributions: Self-imitation learning (SIL) has shown good performance in some hard exploration tasks. This paper proposes the generalised self-imitation learning (SIL), which generalises the SIL method with n-step lower bound Q-learning. The n-step lower bound Q-learning provides a trade-off between fixed point bias and contraction rate, which connects to the uncorrected n-step Q-learning. In the experiments, the generalised method shows better performance in some testing environments. However, the improvements are not very strong.

Strengths: The paper has a good theoretical grounding, which derives the n-step generalised Self-Imitation Learning via the maximum-entropy RL formulation. The contribution of the paper includes a generalised self-imitation learning method, which proves additional advantages over the original SIL method. It learns from partial trajectories. The authors also formalise the trade-offs of SIL and show that the generalised SIL trades-off contraction rates with fixed point bias. The authors show that during evaluation, the proposed method, generalised SIL outperform the baseline methods.

Weaknesses: The performance improvement is incremental and needs to be further evaluated. For example, each experiment should be conducted over 5 random seeds, instead of 3 seeds, for a more accurate comparison. Besides, in only 3 out of 8 environments, shown in Figure 2, the proposed method shows clear improvement. And more baseline methods should be considered, such as SAC. Also, in the original SIL paper, the method also show strong performance in the hard exploration environments, such as Montezuma’s Revenge. So, how does the generalise SIL compare to SIL in the Montezuma’s Revenge task? Does it improve performance or not? Post-rebuttal: After reading the rebuttal, I increase my score because that the authors add some new experimental results comparing with SAC. It would be more convincing if there are results in Montezuma’s Revenge.

Correctness: Yes, these are correct.

Clarity: Yes, it is well written.

Relation to Prior Work: Yes, it is clear.

Reproducibility: Yes

Additional Feedback:


Review 3

Summary and Contributions: The paper proposes a general version of self-imitation learning by replace the original lower bound Q by a n-step generalized lower bound of Q^*, which is tunable and can tradeoff the contraction rate and bias to achieve a better performance. Detailed theoretical discussion and extensive empirical results demonstrate the improvement of the algorithm.

Strengths: The paper is well motivated and the idea is simple to understand. It combines two important aspect of off policy learning methods that can achieve a good performance. The idea of generalizing the lower bound of Q-function in self-imitation learning opens a potential way to improve the performance. The empirical results are convincing and show gain on performance.

Weaknesses: My major concern is for the variance of empirical operator. Though the paper mentioned in line 87-88 that variance in practice could be reduced by large training batch sizes. However, in continuous settings, each (x,a) pair may only has one observed trajectory, where the lower bound L^{\pi, \mu, n} can only be estimated by a single trajectory, which will unavoidable have a large variance. Since the tradeoff lemma in equation (1) includes variance term, I think it is better to address that problem in the paper, e.g. consider an oracle transition operator that can draw more trajectories starting from (x,a), will the performance get improved empirically?

Correctness: The paper is theoretically sound and the empirical methodology is standard.

Clarity: The paper is well written and well motivated. Minor: a few typo: line 78: variance of \T includes an undefine Q, which should be \tilde{Q}? line 108: \mathcal H should be \mathcal H^{\pi};

Relation to Prior Work: Existing works have been clearly discussed in the introduction and throughout the paper. Another line of recent research of off policy learning which involve learning importance sampling over state or state/action pair is also worth mentioned[1-5] which avoid the high variance of product of \pi(a_t|x_t)/\mu(a_t|x_t). 1. Liu, Li, Tang, Zhou: Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation 2. Liu, Swaminathan, Agarwal, Brunskill: Off-Policy Policy Gradient with State Distribution Correction 3. Nachum, Dai, Chow, Li: DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections 4. Nachum, Dai, Kostrikov, Chow, Li, Schuurmans: AlgaeDICE: Policy Gradient from Arbitrary Experience 5.Jiang, Huang: Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization.

Reproducibility: Yes

Additional Feedback: