NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4927
Title:Write, Execute, Assess: Program Synthesis with a REPL

Reviewer 1

This line of work is very promising, but it seems like a small variation from existing work in the area. As an example, how does this paper compare to [Tian et al. 2019] that has a very similar approach and similarly examines 3d graphics? As far as the method itself. One area of confusion is in the reward signal. Page 3 line 129 claims that the reward signal is 1 if the partial program satisfies the spec. Clearly that cannot be entirely correct, because it must give some reward to partial solutions. Indeed, Page 6 line 177 states that IoU is the measure used. It would be useful to be more clear exactly how this works.

Reviewer 2

> "Given a large enough time budget the ‘no REPL’ baseline is competitive with our ablated alternatives." - As per my understanding, the key result here is the effectiveness of search (SMC) over pure RL methods in the program synthesis context. However, the policy rollout baseline is trained with RL using a single machine, making it difficult to explore using entropy based methods or epsilon greedy. However, using multiple actors in an asynchronous setting would be a stronger/fairer baseline (and then doing policy rollouts) to the SMC approach. I expect SMC to do well but this is an important empirical question (other methods cited like Ganin et al. seem to do this in the same context). > "The value-guided SMC sampler leads to the highest overall number of correct programs, requiring less time and fewer nodes expanded compared to other inference techniques. " -> how well does a SMC sampler work without value guided proposals for both case studies? > How sensitive are results in Figure 6 to random seeds? > Are there burn-in issues to get the sampling procedure to work reliably? > It will be informative to visualize and show the SMC population for the 3D examples, similar to Figure 3

Reviewer 3

The proposed idea of partial program evaluation based on a REPL seems to be based on a careful design of the evaluation mechanism and is specific to each of the two domains targeted. Even for the domains considered in the paper, like the graphics domain I think different evaluation mechanisms may have different effects on the synthesis outcomes. The proposed way of generation of a program starts from the bottom of the syntax tree and generates the terminals (individual shapes), which can be evaluated quite well, but if we follow the grammar and synthesize the program top-down from the non-terminals then the partial evaluation is not as easy. I feel some discussion on the effect of different possible evaluation mechanisms and different designs of the sequential decision process can be helpful. The RL loss in eq (3) is somewhat non-standard, I wonder if the authors have tried more standard value estimation losses rather than the cross-entropy loss. The paper is well written and clear.