Review for NeurIPS paper: Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes

NeurIPS 2020

Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes

Review 1

Summary and Contributions: This paper proposes a model called convNP which adds a global latent variable to the previously proposed Convolutional Conditional Neural Process (convCNP), allowing it to produce coherent samples rather than independent samples per data-point (similar to what Neural Process is relative to Conditional Neural Process). The authors propose to train the model by approximating the likelihood objective using importance sampling, rather than amortized variational inferernce, and compare between the two both for the proposed model and for a previously proposed variant of NP that uses attention. The experiments show that the proposed model and training method outperform the baselines on prediction of 1D functions, images and weather data.

Strengths: The paper extends the convCNP model in a way that makes it useful in cases where there is a global uncertainty that needs to be captured and coherent samples are required. The paper proposed a different training method than what the rest of the work on NPs is usually using. The results also suggest that the method is better not only for the proposed model but also for previously proposed models like ANP (although this is still not very convincing in this paper and would require more research). The paper is well written and the experimental section does a good job in conveying the performance of the model, and shows that it outperforms strong baselines for the tested tasks.

Weaknesses: Some parts of the explanations contain statements that are a bit vague and hard to follow. The image completion experiments fail to show multi-modal prediction. see below for more details.

Correctness: As far as I can tell everything is correct. See some minor comments below on some unclear issues.

Clarity: The paper is well written.

Relation to Prior Work: The paper proposes a relatively small extension to a previous model (ConvCNP) and as such is described mainly in relation to it. As it is part of the general Neural Process family of models it is also compared clearly to other members of the family (NP) and the experiments are compared to another strong baseline from the family (ANP). One thing that I think is missing is comparing the performance in experiments to ConvCNP.

Reproducibility: Yes

Additional Feedback: 1. The authors say that they use as an encoder a convCNP. Looking at the psudo-code in algorithm 1 in the appendix, it is unclear to me if the convCNP is actually run all the way and given some discretize grid as targets, or are the discretization at the level of t_i used? I would assume the latter but this is not stated in the text. If it's the former I don't understand why line 6 and 7 (in algorithm 1) are needed in the encoder. 2. Figure 1 is not very informative specially compared to figure 1 in the appendix which is much better. Same goes for the pseudo-code in the appendix. I think it would be good to move those to the main paper even at the expense of some of the text. 3. Since both the proposed training method and the ELBO method are surrogate to maximum likelihood optimization, I would call the proposed method Importance Sampling (IS) and the ELBO method VI or ELBO. 4. I don't understand the claim in lines 172-173. Why is this not a lower bound? How is this different conceptually from the standard NP or any latent variable model? 5. Line 174-175 discuss the difficulty in using the KL divergence for 'non_discretized ConvNP' I don't see any reference to this model. As far as I understand convCNP always need some discretization to realize the method. 6. The image completion task in figure 3 don't show a prediction that covers different modes, which is an important aspect of NPs. I would add experiments with very few context points showing that the model can produce different figures, and if this doesn't happen that this should be discussed. Update: I have read the authors' feedback and happy to keep my score.

Review 2

Summary and Contributions: Updated: I have read the other reviews and comments and I will stick to my current score. The authors build on previous work on Convolutional Conditional Neural Processes (ConvCNPs) and introduce Convolutional Neural Processes (ConvNPs). Unlike ConvCNPs, ConvNPs can capture complex joint distributions between the predictions. In addition to ConvCNP they present a new training procedure based on approximate maximum likelihood (as opposed to variational inference, which was used for training previous latent variable Neural Process (NP) models). Finally, the authors provide experimental results on time-series few-shot regression experiments, image inpainting experiments as well as an environmental dataset.

Strengths: - The model is a natural extension of the original ConvCNP. The fact that ConcNPs can carry out few-shot regression on data with translation equivariances is a significant contribution. The additional maximum likelihood training regime makes this an even stronger paper. - Overall the paper is well written and clearly structured and has good figures. - The experiments are varied and the results look convincing. Normally I would not consider a single baseline to be enough, but in this case attentive NPs are really the main model that the authors should be comparing to (and GPs for tasks that require translation equivariance, as they do), so one baseline seems sufficient.

Weaknesses: The paper is dense in some parts, as it is building on previous work and lots of notation.

Correctness: The methodology seems correct.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: No additional comments.

Review 3

Summary and Contributions: This paper proposes Convolutional Neural Process (ConvNP), which extends ConvCNP to model more expressive distribution especially for spatial temporal setting. Also, they proposes variational lower bound approach and aproximate maximum-likelihood training method for learning ConvNP. Experimental results on toy time-series experiments, image-based sampling and extrapolation, and real-world environmental data demonstrate its advantage over existing baselines, e.g., ConvNP and ANP.

Strengths: This is very nice work about Neural Process, which improves Neural Processes (NPs) with translation equivariance and extends convolutional conditional NPs to learn more expressive distribution. Both aspects are important to model more realistic probalistic distrition. Emprical evaluation is thorough and good analysis with experimental results. This is a significant contribution to this area. Overally speaking, this is a good paper.

Weaknesses: Almost none. It would be great if the authors can apply it on more challenging tasks and datasets, e.g., few shot learning. This will greatly strengthen the paper.

Correctness: It seems claims, method and the empirical methodology are correct.

Clarity: This paper is well written and well motivated. The author provides nice explanation for their proposed model and derivation, and the paper is easy to read and follow.

Relation to Prior Work: Yes, this paper provides good discussion with related works and connection with existing works.

Reproducibility: Yes

Additional Feedback: Rebuttal: I have read rebuttal and keep score

Review 4

Summary and Contributions: The paper aims to achieve improved modeling of stationary stochastic processes. The claimed contributions are the following: (1) Proposing a Convolutional Neural Process (ConvNPs) which extends Neural Processes (NPs) to use convolution, and extends Convolutional Conditional Neural Processes (ConvCNPs) by introducing latent variables. (2) Proposing a new biased, maximum likelihood objective for training NPs (to replace the ELBO which is typically used for training NPs). (3) Demonstrates improved performance on multiple tasks (e.g. 1D regression, image completion, a real world spatial data task).

Strengths: (1) The authors argue for combining the benefits of ConvCNP (translation equivariance) with those of NPs (more expressivity via latent variables) and provide a thorough discussion on the advantages. (2) The paper provides a good empirical investigation, comparing the proposed approach with the alternatives, including on a downstream task (albeit a toy problem), and also comparing the maximum likelihood vs. the ELBO objectives in isolation.

Weaknesses: (1) The approach itself is a combination of two proposed approaches (ConvCNP, NPs), so it might have limited impact from the methods perspective. That said, it's a good investigation into the architecture and training objective for NPs. (2) The approach considers baselines such as GP and ANP and ConvCNP. However, not all baselines are evaluated on all tasks. E.g. ANP and ConvCNP are not evaluated on the real data task, nor on the downstream task. Showing that ConvNP is consistently better than alternatives across tasks would improve the paper.

Correctness: The paper seems methodologically sound. The proposed approach is evaluated empirically on multiple tasks, with both illustrative and numerical results.

Clarity: The paper is well written; it describes clearly the approach, prior work and relevant terms. These are some minor comments: * In the notation section, the paper should describe what context and target sets are. These are explained later (e.g. in 2.1) but they should be described as soon as they are introduced. * I like the rainfall example. It would be nice if the example also illustrated the source of randomness in the random function from \mathcal{X} to \mathcal{Y}. * In Figure 5, the legend reads 'NP UCB' and 'BP TS' even though the approach used is ConvNP.

Relation to Prior Work: The authors combine the benefits of ConvCNP (translation equivariance) with those of NPs (more expressivity with latent variables). The paper adequately distinguishes themselves from these works. I’m not familiar with work that makes the same claims.

Reproducibility: Yes

Additional Feedback: The improvement in log likelihood for the ML vs. NP objective is much higher in the ConvNP paper vs. the ANP paper (e.g. in Table 2). Do you have any intuition on this?