NeurIPS 2020

### Review 1

Summary and Contributions: In 2016 Whalen announced the first automated theorem prover using deep reinforcement learning techniques. The prover was based on Metamath, a math formalization corpus and a proof checker. The current work extends Whalen's work by augmentation with synthetic data, which are generated also using neural networks (a combination of deterministic methods, GAN and reinforcement learning).

Strengths: The content of this work is relevant to NIPS. Data augmentation is also a natural and proven methodology to improve machine learning performance, though it is also known that with fixed amount of source data the gain from data augmentation is marginal. Based on a reimplementation of Whalen's code so as to adapt to GPU, the authors have shown some improvements with their data augmentation approach as expected.

Weaknesses: There is not that much novelty in the paper. It would be nice if the authors can give more discussion on the limitations of their approach. And I am wondering whether at certain point of augmentation they have witnessed any degradation of performance due to overfitting. The part on Metamath not clearly written. Being not an expert on Metamath myself, I had to go back to Whalen's paper from time to time (unfortunately it is also not very clearly written) and eventually to Metamath's manual. It took me some time to realize that only the fact of tree interpretation of Metamath sentences is essential to understanding this line of works. I also think it's a bit overclaiming by saying that your work is 'orthogonal' to all previous approaches as your idea verification methodology follows largely from previous approaches, especially Whalen, Loos, etc.

Correctness: Some of the results are overstated, as the paper is not that novel, but other than this there are no complaints about the method and correctness.

Clarity: Some parts could be explained better, as discussed above.

Relation to Prior Work: It is ok.

Reproducibility: Yes

Additional Feedback: 57 -> MetaGen'' (correct quotes). 71 -> Reference of HOL-Light (Harrison et al.) 92 -> ...self-play, 'with' one agent... 98 -> 'compete' with each other... 184 -> 'optionally' their proof trees... 322 -> 'MetaGen-Rand' (typo) 324 -> ...due to 'the fact that' our prover runs... 334 -> ...'focus more on'... 341 -> and less 'comfortable' / 'confident' with...

### Review 2

Summary and Contributions: The paper describes a 7.7% improvement of the Holophrasm system by generating synthetic theorems. The theorems are generated by guided instantiation and special care is taken not to generate instantiations that are trivially true.

Strengths: Generating relevant synthetic theorems is a nice and nontrivial research direction. It is related to tasks such as lemma introduction and theory exploration. 7.7% more proved theorems on the test set is a decent result. The authors have done a decent amount of work.

Correctness: See above.

Clarity: See above.

Relation to Prior Work: See above.

Reproducibility: Yes

### Review 3

Summary and Contributions: This paper proposes an approach to augment training data of prover. Trained with the generated data, the the number of theorems proved increased from 557 to 600 as in Table 3, which is the state of the art number on this dataset.

Strengths: The strength of the paper has shown that data augmentation is useful in the field of automated theorem proving. As far as I understood, no researches in the field of ML based automated theorem proving has addressed this direction. As a result, the trained prover achieves the state of the art result on a MetaMath dataset.

Weaknesses: The weakness of the paper is the limited experimental evaluations. The evaluation data is set.mm only and the improvement in Table 3 is not large; I'm not convinced that we are just lucky on this dataset or not. The difference of the number of "test proofs found" between the authors Holophrasm and the previous Holophrasm, the number varies a lot depending configurations. Achieving state of the art is not impactful unless controlling computational cost etc.

Correctness: One approach to show the effectiveness of the proposed approach is to experiment on other Metamath theorem sets. The author proposes a constraint as in line 178. However the effectiveness is not evaluated; adding those analysis can better understanding of proposed data generation method.

Clarity: Combined with the supplementary material, the paper is overall well written. However, the detail of the substitution network is not clear. (Line 229 and 230.) Some example would be helpful.

Relation to Prior Work: The author can explicitly claim the difference or the contribution. However, the difference from the work of Whalen (2016) could have been stated more carefully.

Reproducibility: No