NeurIPS 2020

Meta-Gradient Reinforcement Learning with an Objective Discovered Online

Review 1

Summary and Contributions: This paper proposes an approach for meta-gradient reinforcement learning where the value target used in loss functions is learned in an outer optimization loop. The authors hypothesize that this provides more freedom to an RL agent to learn its own objective for policy optimization, as opposed to priors (such as the Bellman equation) being applied to the learning algorithm. The authors present their algorithm (called FRODO) in both value function and actor critic variants and provide empirical evidence for the effectiveness of their method on toy tasks and the ALE benchmark. UPDATE: Thanks to the authors for their clarifications!

Strengths: * Novel method for meta-RL that is clearly different from prior work (also clearly highlighted in section 2). * Promising empirical results on the tested environments along with useful analyses * Clear exposition and writing

Weaknesses: * Would have liked to see some environments more substantial than Atari (e.g. Mujoco or Deepmind 3D lab)

Correctness: * The methodology largely seems correct to me. * For the empirical results, why not have an RL baseline for the Catch environment?

Clarity: * The paper is well written. I especially appreciate the notation and natural lead-up to the algorithm description in section 3. * The description of results on the Catch environment could use some more detail. Particularly, what exactly does the 3-step lookahead baseline do? Is it trained or does it simply perform a search based on pre-defined values?

Relation to Prior Work: Yes, in pretty good detail, including a table of features of different works.

Reproducibility: Yes

Additional Feedback: * Line 112: The authors mention that one intuition for predicting targets instead of the objective directly is to ensure that learning always moves the agent's predictions towards the target rather than away. What happens if the target itself jumps around haphazardly? How different would this be from the divergence effect that would like to be avoided? * * Re: reproducibility: it would be helpful to provide exact architecture details and release code publicly.

Review 2

Summary and Contributions: In this paper, the authors propose a new meta RL algorithm where the value prediction target is self-learned, i.e., generated by a trained prediction model. The value function learns to predict the self-generated value target at the inner loop of the meta RL algorithm, whereas at the outer loop, the value function learns to predict a canonical multi-step bootstrapped return. The target at the outer loss could be replaced by any standard RL update target. ------->>>> Post-rebuttal update The main results for this method are presented as synthesized learning curves over a set of Atari games (Figure 3 from the main paper) and the individual playing scores for each of the games are not provided from the appendix. It is good to see such scores in the updated version of this paper so that it is easier for the follow-up works to compare with this method.

Strengths: + The idea of formulating the inner loss for meta RL as learning from the objective discovered by its own is interesting and novel. Generally, defining the algorithm to self-discover its objective makes the learning algorithm moves one step closer towards developing automated machine intelligence compared to the conventional meta RL methods which greatly rely on expert's design choice such as the hyperparameter to perform learning-to-learn. + The authors present extensive experiment results to evaluate the proposed method. The proposed method has been evaluated on three task domains: a catch game to demonstrate the method could effectively learn bootstrapping, a 5-state random walk to demonstrate the method works in non-stationary environments, and ALE which is a large-scale RL testbed. In all the task domains, the proposed method achieves noticeable performance improvement over the compared baselines. + The authors propose a consistency loss for large-scale experiments, which is used to regularize the output of the target prediction model to be self-consistent over time. The consistency loss could bring significant performance improvement when testified on ALE, if its weight is properly set.

Weaknesses: - My main concern for the paper is that the proposed method has not been compared with many meta RL methods. It is good to have some meta RL baselines that performs hyperparameter optimization (e.g., [1]) or using a classic RL target different from the outer loop target return as the inner loop prediction target, etc. The effect of incorporating self-discovered object could have been evaluated more thoroughly. [1] Meta-Gradient Reinforcement Learning (Neurips 2018).

Correctness: The problem formulation is sound and can generally work for most deep RL problems.

Clarity: The paper is generally well-written and structured clearly and I enjoyed reading it. The method is formulated in a clear way and is easy to understand.

Relation to Prior Work: The authors present extensive literature review for meta RL and provide in-depth comparison with the existing literature.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper proposes to allow RL agents to learn their own objective during online interactions with their environment. First, a meta-learner takes the trajectory as input and meta-learns an update target. Then, a traditional TD-like update or policy-gradient update will be performed based on the learned target. Because the target is learned according to online trajectories, it can adapt to the changing context of learning over time. To speed up learning in complex environments, a simple heuristic is added to regularize the learning space of the target function towards targets that are self-consistent over time. This paper provides motivating examples to show their proposed method can address issues on bootstrapping, non-stationarity, and off-policy learning. In addition, the authors also conduct large-scale experiments on Atari to further evaluate their method. Contributions: 1. This paper makes a first step towards meta-learning update target in the context of meta-gradient RL. 2. The proposed method makes uses of online trajectories to allow the agent to learn its own online objective and thus learn how to learn increasingly effectively.

Strengths: 1. This paper tackles a very valuable problem of meta-learning knowledge from online trajectories. The idea of building a learnable update target for RL is interesting and novel. This paper is relevant to a broad range of researches on meta RL. 2. They provide motivating examples to validate their motivation of algorithm design. These experiments are well-designed to support the main claims.

Weaknesses: 1. I wonder if R2D2 or an extension version of a conventional model-free RL that supports POMDP will have similar performance with the proposed method in this paper. For example, what if we simply use a trajectory encoder to capture the true state through history and then add it to the value network and policy network as an additional input feature. Maybe this naïve method can also address tasks like “Catch” and “5-state Random Walk”. More experiments will be helpful to further understand the contribution of this paper. 2. The proposed approach has only comparable or slightly better performance than baseline method (IMPALA) on large-scale standard Atari benchmark, but it has a much more complex implementation than baseline method (a meta learning algorithm seems to be harder to train and use compared to a simple model-free method, and may be less stable in practice). In addition, some recent stronger baselines(e.g., R2D2, NGU, Agent57, and MuZero)on this dataset are not included.

Correctness: The claims and method are correct. The empirical methodology is sound.

Clarity: Overall, the paper is well-written. I only have a few suggestions. 1. It will be better if an algorithm sketch or pseudocode is provided. 2. Some notations are a little bit confusing. In Section 3.4 and 3.5, g and G represent learned update target and returns, respectively, but in Line 231, G changes to learned update target. 3. In Line 174, “When the pellet reaches the top row the episode terminates” should be “When the pellet reaches the bottom row the episode terminates”?

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: 1. In “5-state Random Walk”, what is the trajectory length used in g(\tau)? Is a trajectory within a single episode or crossing many episodes? 2. Are there any implementation codes? ========================================== post-rebuttal comment Thanks for the feedback. I increase my score to 6 and tend to accept this paper since the feedback has addressed my main concern. The authors provide additional experiments to show their methods can outperform a naïve extension of RL algorithm which uses trajectory encoder as additional input to value and policy network and in principle supports non-stationarity.