Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper presents a very interesting idea: uses combinatorial methods only for projecting neural (or hybrid neural-symbolic) functions to the programmatic space. The projection operator uses program synthesis via imitation learning such as Dagger. The paper does not clearly explain why projection is much easier than searching in program space. A detailed explanation with an example will be needed. The evaluation results are very weak. It uses very simple RL tasks. The dropbox link https://www.dropbox.com/sh/qsgxk76xg9t5ow3/AACyJIkqwhrO9EUmVa6SIrJva?dl=0 does not have the code. The folder is empty. ---after author feedback and reading other reviews--- The authors have addressed some of my concerns: projection via imitation learning is easier, and provided the code and a video comparing with NDPS. The authors did not address my question on evaluation well. Although TORCS tasks are important, a single environment is not enough for evaluation. In particular, when can IPPG learn effectively via imitation learning can not be established via TORCS itself. Online imitation learning may be hard in many application domains. Can you leverage offline demonstrations? As a result, the paper still needs lots of improvement. However, I am willing to increase my rating to 6 given some of my concerns are addressed.
Very nice work. Some minor comments: 1. As in so many NIPS and ICML papers I review these days, the scholarship is somewhat shoddy. There's a vast literature of work on mirror descent based RL, including mirror-descent based safe policy gradient (Thomas et al, NIPS 2013), mirror-descent based sparse Q-learning (Mahadevan and Liu, UAI 2012), mirror-descent based gradient TD (Liu et al, UAI 2015, JAIR 2019), and many more articles in this vein. Please do a more thorough literature review. 2. From the results of Thomas et al., NIPS 2013, one can reinterpret your learning algorithm (Section 3) as doing natural gradient updates. Please read Thomas et al., and summarize your ideas from this perspective. 3. Following Mahadevan and Liu's work on sparse Q-learning, can you provide any sparsity guarantees on your learned policies (which would help in their interpretability). 4. Table 1 needs to be augmented with some of these earlier mirror-descent based approaches.
This paper addresses the problem of learning programmatic policies, which are structured in programmatic classes such as programming languages or regression trees. To this end, the paper proposes a "lift-and-project" framework (IPPG) that alternatively (1) optimizes a policy parameterized by a neural network in an unconstrained policy space and (2) projects the learned knowledge to space where the desired policy is constrained with a programmatic representation. Specifically, (1) is achieved by using deep policy gradient methods (e.g. DDPG, TRPO, etc.) and (2) is obtained by synthesizing programs to describe behaviors (program synthesis via imitation learning). The experiments on TORCS (a simulated car racing environment) show that the learned programmatic policies outperform the methods that imitate or distill a pre-trained neural policy and DDPG. Also, they exhibit stronger generalization ability and better verifiability (e.g. verifying the smoothness). I believe this work explores a promising direction for learning programmatic policies yet I am not entirely convinced by the experiments. I am will to raise the score if the rebuttal addresses my concerns/questions. [Strengths] *motivation* - The motivation for alternating between learning a neural policy and a programmatic policy to avoid obtaining a suboptimal programmatic policy from distilling a pre-trained neural policy is convincing and intuitive. *novelty & technical contribution* - I believe the idea of iteratively learning a programmatic policy with its neural counterpart is novel. This paper presents an effective way to implement this idea. - The provided theoretical analyses on the convergence and the overall performance are helpful. *clarity* - The paper gives clear descriptions in both theoretical and intuitive ways. The notations, formulations, and theorem are well-explained. *experimental* - The presentations of results are clear. - The proposed framework (IPPG) ----- outperforms the methods (VIPER and NDPS) that distill fixed neural policies and achieve comparable results compared to a neural policy learned using DDPG. ----- progressively improves the performance of a given sub-optimal programmatic policy (PRIOR). ----- exhibits stronger generalization ability (compared to DDPG): learn from a track and generalize to another. [Weaknesses] *assumption* - I am not sure if it is safe to assume any programmatic policy can be parameterized by a vector \theta and is differentiable in \theta. (for Theorem 4.2) *initial policy* - In all the experiments (TORCS, MountainCar, and Pendulum), the IPPG polices improve upon the PRIOR. It is not clear if IPPG can learn from scratch. Showing the performance of IPPG learning from scratch would be important to verify this. - Can IPPG be initialized with a neural policy? It seems that it is possible based on Algorithm 1. If so, it would be interesting to see how well IPPG work using a neural policy learned with DDPG instead of PRIOR. Can IIPG improve upon DDPG? *experiment setup* - It is mentioned that "both NDPS and VIPER rely on imitating a fixed neural policy oracle" (L244). What is this policy oracle? Is this the policy learned using DDPG shown in the tables? If not, what's the performance of using NDPS and VIPER to distill the DDPG policies? - It would be interesting to see if the proposed framework works with different policy gradient approaches. *experiment results* - How many random seeds are used for learning the policies (DDPO and IPPG)? - What are the standard deviation or confidence intervals for all performance values? Are all the tracks deterministic? Are the DDPG policies deterministic during testing? - It would be better if the authors provided some videos showing different policies controlling cars on different tracks so that we can have a better idea of how different methods work. *reproducibility* - Some implementation details are lacking from the main paper, which makes reproducing the results difficult. It is not clear to me what policy gradient approach is used. - The provided dropbox link leads to an empty folder (I checked it on July 5th). *related work* - I believe it would be better if some prior works [1-5] exploring learning-based program synthesis frameworks were mentioned in the paper. *reference*  "Neuro-symbolic program synthesis" in ICLR 2017  "Robustfill: Neural program learning under noisy I/O" in ICML 2017  "Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis" in ICLR 2018  "Neural program synthesis from diverse demonstration videos" in ICML 2018  "Execution-Guided Neural Program Synthesis" in ICLR 2019 ----- final review ----- After reading the other reviews and the author response, I have mixed feelings about this paper. On one hand, I do recognize the importance of this problem and appreciate the proposed framework (IPPG). On the other hand, many of my concerns (e.g. the choices of initial policy, experiment setup, and experiment results) are not addressed, which makes me worried about the empirical performance of the proposed framework. To be more specific, I believe the following questions are important for understanding the performance of IPPG, which remain unanswered: (1) Can IPPG learn from scratch (i.e. where no neural policy could solve the task that we are interested in)? The authors stated that "IPPG can be initialized with a neural policy, learned for example via DDPG, and thus can be made to learn" in the rebuttal, which does not answer my question, but it is probably because my original question was confusing. (2) Can IPPG be initialized with a neural policy? If so, can IPPG be initialized with a policy learned using DDPG and improve it? As DDPG achieves great performance on different tracks, I am just interested in if IPPG can even improve it. (3) How many random seeds are used for learning the policies (DDPO and IPPG)? What are the standard deviation or confidence intervals for all performance values? I believe this is important for understanding the performance of RL algorithms. (4) What is the oracle policy that NDPS and VIPER learn from? If they do not learn from the DDPG policy, what is the performance if they distill the DDPG policy. (5) Can IPPG learn from a TPRO/PPO policy? While the authors mentioned that TRPO and PPO can't solve TORCS tasks, I believe this can be verified using the CartPole or other simpler environment. In sum, I decided to keep my score as 5. I am ok if this paper gets accepted (which is likely to happen given positive reviews from other reviewers) but I do hope this paper gets improved from the above points. Also, it would be good to discuss learning-based program synthesis frameworks as they are highly-related.