Review for NeurIPS paper: Bootstrapping neural processes

NeurIPS 2020

Bootstrapping neural processes

Review 1

Summary and Contributions: The paper proposes to use bootstrap techniques to improve the uncertainty estimates of Neural Processes (NPs), and consequently, their robustness to model mismatch. The paper states that a naive application of bootstrap to NPs doesn't work, and so it proposes a method that does work.

Strengths: I find the ideas in this paper sound, however, I doubt their significance and relevance. In my understanding, uncertainty estimates of NPs have a very low quality compared to those of GPs. Thus, I would find it significant if a method could close this gap between NPs and GPs. I think that this paper tries to take a step in this direction. However, the bootstrap does not seem to have the power of solving this issue. I think this is best illustrated in Figure 2, where it's clear that there is no big difference between the behaviour of bootstrapped and vanilla models. Therefore, it might have been better to focus on the downstream tasks, where differences could be more visible, for example, as seen in case of Bayesian optimisation in Section 5.2.

Weaknesses: The proposed method, in my opinion, does not provide a good enough solution to the problems related to uncertainty and robusteness in NP-based models. While experiments show improvements in comparison to vanilla NPs, it is unclear to me if these improvements are significant, or whether the evaluation is done correctly. For instance, all tables in the paper report log-likelihoods. However, without extra steps, one can only get the lower bound on the log-likelihood using NPs. I haven't found an explanation on how the numbers in the tables are computed, so I'm still wondering what they are. The figures included in the paper do not show significant improvements over the base models. Thus, if I had to use NPs, I would choose the vanilla version. The extra layer of complexity in the form of bootstrapping, in my opinion, is not worth of the gains. The only result which I find more or less convincing is the Bayesian optimisation experiment. Though, the paper pays very little attention to it, and so it is difficult to interpret the results.

Correctness: I didn't find mistakes in the derivations. Personally, I don't find the method very elegant.

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: Some random remarks: -- I find the abstract a bit confusing, e.g. I wouldn't say that NPs learn from a 'data stream' or that having a 'single latent variable limits the flexibility' --line 23: It's weird to see ... in p(y|x....) -- I suspect that the training time of BNPs is higher in comparison to NPs. I think the paper should discuss this. -- I think that if tables like 1 or 2 are used, then they shoud have a lot less entries, e.g. include likelihoods only on the target points. -- the paper lacks the interpretation of Figure 4, which could be quite interesting -------------- POST-REBUTTAL UPDATE ------------------------- I'm grateful to the authors for answering my questions in the rebuttal. I acknowledge that BNPs result in improved log-likelihoods over the baselines. However, I still think that the extra layer of complexity on top of NPs needs to result in even better performance gains before it can be fully justified. From the reviewers' discussion, I believe that other reviewers agree that this is a rather incremental work. I tend to agree with this assessment. This is a borderline paper for me, but I wouldn't be upset if it's accepted.

Review 2

Summary and Contributions: This paper proposes an architecture bootstrapping latent variable on Neural Processes. With a small number of additional parameters, they designed it and validate that it is more robust than Neural Processes, particularly when the test data is from different distribution from a trained one.

Strengths: Motivation is well set, which is addressing the flexibility limitation on the global latent variable and solving it through bootstrap. I think that it deserves to be shared in the community. The proposed method is clearly described, and the experiment design and analysis are also quite good.

Weaknesses: The limitation of the method is that It requires more computation resource even though it just requires a few additional parameters because it processes encoding-decoding twice with k variables. The limitation on the paper is lack of analysis when not working well.

Correctness: Correct

Clarity: Well written

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: - They mentioned naiive application of bootstrap to NP, but not showed the empirical results about that. As one of baselines, if they compare the result with BNP, it would be better. - BNP is processed in a parallel manner and requires the encoding-decoding twice, which requires more computation time, so analysis about wall time comparison on training can be fruitful to understand BNP. - line 92, in equation, \tilde{X}^{(k)} -> \tilde{X}^{(j)} # To authors: Thank you for updating about processing time. What I wanted to check was convergence speed comparison with baselines, but your data is also wealthy to check.

Review 3

Summary and Contributions: This paper proposes an extension to NPs that is one way to model functional uncertainty of NP models. The proposed method is using bootstrap, which essentially doing re-sampling the contexts and construct an ensemle of predictions. The resulting algorithm is called Bootstrapping Neural Process (BNP). This functional uncertainty modeling is said to improve the robustness of NPs, especially in the case of model-data mismatch.

Strengths: The paper pursues an interesting research question that looks at the problem of model-data mismatch. Modeling functional uncertainty of context representation would be a good way to improve the prediction of target data. The author proposes a method using bootstraping to obtain an ensemle of context representation, hence receives an ansembled distribution of target predictions. The proposed idea makes sense as boostraping has been used successfully in frequentist statistics and other ML frameworks.

Weaknesses: Given the paper's current state, I have following major comments: - The proposed method's motivation is to tackle the issue of model-data mismatch by modeling the context representation uncertainty. However the notion of the model-data mismatch is loosely defined. It would be more interesting if the paper's formulation would fomulate this problem in a principled way, e.g. the model-data mismatch problem can be framed in a more principled way, e.g. training task distribution and target task distribution could be defined on different domains as shown in experiments. - The choice of the training objective in (14) needs justifications. The combined objective of two models with/without bootstraps is somewhat questionable. The computation of residuals would influence a lot to the input hence the convergence of the full model. I would really like to see how this technical issue can be analysed and evaluated with ablation more carefully. - The proposed method seem to enjoy many advantages as seen in 3.4. discussion. But the missing of a parallel implementation and its benefit demonstration would be unfortunate. - BNP/BANP does not always perform better then the original NP family. Sometimes it performes better, but by not much and not clearly seen especially in qualitative results like in EMNIST. - As BNP can model uncertainty of context data, it might be interesting to see comparisons among methods on a different amount of context points.

Correctness: The claim and empirical methodology are correct.

Clarity: The paper is well written, which is easy to follow.

Relation to Prior Work: Yes, the paper has a great discussion to related work.

Reproducibility: Yes

Additional Feedback: ----------------------------------------------------------------------------------- I have read the author response. I have changed my score upon some good responses.

Review 4

Summary and Contributions: The paper proposes a solution to the problem of generating functional uncertainty in neural process models. To achieve this, it uses residual bootstrap. A positive effect on the resulting models is their robustness against model-data mismatch.

Strengths: This work is well-designed and not trivial, as illustrated by the fact that the naïve application of residual bootstrap to NP does not work, sec 3.1. The experimental validation is good, covering several very different scenarios and demonstrating consistently good results. The method rests on a few good ideas, and so should not pose specific reproducibility problems.

Weaknesses: Theoretical analysis and analytical experiments could be strengthened.

Correctness: Yes. The experimental part is good. The experiments are not novel, in the sense that they follow established precedents, but the coverage is good. The code accompanying the paper is complete.

Clarity: The paper is well written and clear overall. Several English errors should not have made it to the submitted paper, but do not obfuscate the meaning of the text. The structure of the paper is good. Though not part of the assessment done for this review, and not necessary, the supplementary material is interesting.

Relation to Prior Work: Yes. The connection is pretty straightforward with the NP literature. There could be a few more connections with the literature built regarding neural architecture aspects. The connection with the literature using bootstrap ideas is fine.

Reproducibility: Yes

Additional Feedback: The paper should be spell-checked and grammar-checked. Line 105 and 108, I believe the ^ is missing from mentions of mu and sigma. Last author of ref 4 missing. ---- note after reviewer discussion I have carefully read the author feedback, other reviews, and the reviewer discussion. My overall score is updated to a solid 7, with higher confidence than before.