Review for NeurIPS paper: Sparse Spectrum Warped Input Measures for Nonstationary Kernel Learning

NeurIPS 2020

Sparse Spectrum Warped Input Measures for Nonstationary Kernel Learning

Review 1

Summary and Contributions: This paper considers a non-stationary extension of sparse spectrum GPs by including an input dependent warping function in the model.

Strengths: The paper is clear and well written, easy to follow, the theme is relevant for the conference, and the paper includes a wide range of experiments.

Weaknesses: The paper extends well-known methodology by a warping function for non-stationarity. The novelty is limited and the construction itself rather straightforward, but interesting and useful nevertheless.

Correctness: Apart from some vague use of terminology (see detailed comments), I could not spot flaws in the methodology.

Clarity: The paper reads rather well and the examples and visualisations are illustrative. The structure is slightly surprising (Related Work is a subsection of Experiments) and would need some re-structuring.

Relation to Prior Work: The paper covers related work in sufficient detail. Non-stationary methods have received a lot of interest in the past couple of years, and the coverage of recent work could perhaps be improved.

Reproducibility: Yes

Additional Feedback: I generally liked the paper and its extensive experiments. However, the following concerns are worth addressing in the rebuttal. 1. Originality/impact/novelty: My main concern is related to novelty and impact. The idea of warping the inputs of a sparse spectrum GP is rather straightforward and similar constructions can be applied to other GP approximation schemes as well. I am not aware of previous implementations of this kind in the sparse spectrum setting, but the approach cannot be seen as particularly original. 2. Terminology: The authors talk about varying smoothness with respect to the input. However, the paper is not considering models with varying smoothness (degree of differentiability), but varying characteristic length-scale (by morphing the inputs). The models considered in the paper are typically infinitely smooth (the RBF and sinusoidal bases are used), and it would be interesting to see how the model would work with, say, the exponential covariance function in the GP prior. 3. Restructuring: As mentioned earlier, the structure is confusing. Having the analysis and related work at the end is not unconventional, but the related work should perhaps be its own section at not a subsection of Experiments. After rebuttal: Thank you for your answers.

Review 2

Summary and Contributions: The paper extends Gaussian process literature, specifically sparse spectrum GPs by proposing to learn nonstationary GP models using input warping. To this end it introduces a novel self-supervised input dependent warping function. The paper also develops a self-supervised training scheme for representing this warping function that makes use of pseudo-training points which themselves are hyper parameters (initially set to a prior). The paper then naturally extends this to multiple layers of warping, each layer successively leading to a response variable more stationary than just using the preceding layer.

Strengths: Soundness of claims: the paper claims to develop a general input-dependent measure valued warping to learn nonstationary kernels. It also claims to validate the new approach on a wide variety of datasets and compares against prior work. With the detailed math provided in the paper and an indeed extensive experiment section these claims appear to be well validated. Significance and novelty: the paper extends the work on sparse spectrum GPs and enables the use of nonstationary kernels, specially in regression tasks. Given the ubiquitous presence of nonstationary stochastic processes this is a significant work. Relevance to the NeurIPS community: Gaussian processes are well studied and popular in the machine learning community and have been used to study a variety of tasks including bayesian optimization, and neural networks. As such, extensions to GPs are likely to be well received by the community.

Weaknesses: I was not able to spot any glaring weaknesses in the work. Although that might be due to my relative lack of expertise in this area.

Correctness: See Strengths.

Clarity: The paper is well written for the most part with minor typos which are identified in additional comments.

Relation to Prior Work: As far as I can tell yes. The authors have established the connection to standard GPs as well as Sparse spectrum GPs which they improve upon. The related works section and the experiments section compares and covers several other notable work.

Reproducibility: Yes

Additional Feedback: Great job putting together an impressive work despite current stressful times. Congratulations! In addition to above comments there are some minor typos identified here: UPDATE: points raised by other reviewers have convinced me to downgrade my rating to 6. line 38: the measure after the comma seems redundant. line 85: "out the front" in not needed line 275: "this is" to "that is" line 276: "this perspectives" to "this perspective"

Review 3

Summary and Contributions: This paper concerns extending the sparse spectrum approximation for tractable GP kernel learning to incorporate non-stationary kernels by warping the input space to a new space where a stationary kernel is appropriate resulting in a non-stationary kernel via composition. The approach benefits from the scaling complexity afforded by the sparse spectrum formulation and these efficiencies are maintained by specifying the warp as a particular locally affine transformation. Whilst multiple levels of warping no longer possess a direct closed-form solution (under the sparse spectrum approximation) the authors also proposed a moment matching approach to approximate the results for multiple levels of warping.

Strengths: + The limitations of assumptions of stationarity have often been identified as a weakness in the GP literature and movement to provide efficient non-stationary kernel formulations are consequently of interest to the community. + The linear scaling conferred by the sparse spectrum approximation makes the method highly desirable for larger datasets where direct approaches are prohibitive. + The formulation allows for a transfer of gradients through-out making it possible to learn the warping and kernel hyperparameters simultaneously. + The experiments run some favourable comparisons with a number of related approaches to specifying effectively non-stationary kernels in GP frameworks.

Weaknesses: - The input space warping approach to providing non-stationarity in this context has clear parallels with the Deep GP formulations and it was surprising that a discussion of this was left until the very end of the paper and only really dealt with at a cursory level. There is much discussion in the literature about the difficulties of maintaining uncertainty when dealing when warping the input space for stationarity and this discussion ties in with injectivity. For example, Duvenaud et al. discuss the potential pathologies and the Deep GP thesis of Damianou also considers the collapse of uncertainty. A source of difficulty lies around the lack of injectivity in the warping layer. The results presented in this paper seem to follow the discussion of Salimbeni et al. in the DSDGP in that initialising the layers close to the identity is critical to ensure that a good local optimum is obtained (this ties in with the discussion of the affine initialisation in this work). However, the "close" to injective solutions are not guaranteed. The methods around using dynamical systems, e.g. Hedge et al. AISTATS 2019 and Ustyuzhaninov et al. AISTATS 2020, can provide injective guarantees and it would seem important to have a discussion around this issue and to provide comparisons to the dynamics approaches in the empirical evaluation.

Correctness: The theory and derivations are all clear. The illustrative experiment in Fig1 is interesting - the warping functions in (b) show that the confidence interval for the g() and h() posteriors do not cover the inducing data well - please could the authors comment on this. I would argue that the comparisons to the dynamics versions of input warping (e.g. [40]) are also missing from the experimental evaluation and I cannot see any justification in the text for their absence - please could the authors comment on this?

Clarity: I found the paper to be clear and logical in its presentation apart from the delayed discussion of related work.

Relation to Prior Work: The relation to prior work in the sparse spectrum approximation family is well covered however by main reservation with the paper is the discussion wrt to the deep GP literature and the dynamics family of input warping which I believe merit further discussion and comparisons.

Reproducibility: Yes

Additional Feedback: It is interesting the use of the pseudo data for determining the warps brings together the two different sparsity approaches (sparse spectrum and inducing points) into a single model. Would the authors like to comment on the decisions for the different approximations in the different parts? Did they try the approximations the other way around or using the same approach for both? In addition to the injectivity issue discussed above, it still doesn't seem clear from the literature what a suitable inductive bias should be for a non-stationary kernel - e.g. too much local freedom will remove the ability to reject noise. I found the discussion of this in section 4.1.1 to be quite brief but it seems to be a critical issue in the wider application of such models - could the authors provide more evidence as to why this particular specification of a local Euclidean manifold is appropriate. It would seem that the specific priors on the g() and h() functions would be more important towards this bias than simply adding additional layers - is this not the case? I was expecting to see experiments focused on these priors rather than adding additional warping layers that would seem to move even further away from an intuitive prior (especially as additional approximation errors are introduced). Finally, I would guess that the argument about the local Euclidean structure would also play into the dimensionality of the warped space - is there a reason we should expect this to be the same as the original input space? Would there be merit in using ARD stationary kernels to allow the output GP to control the dimensionality of the warped input space similar to a deep GP model? Overall I found the paper to be interesting and adding this property to the desirable sparse spectrum approximation family is clearly beneficial. My recommendation would be more favourable if the discussions wrt the existing literature and the challenges of warping input spaces were addressed by the authors. Line 207: Do the authors mean Figure 3? UPDATE: Thanks to the authors for their detailed rebuttal and I think some interesting points were raised and the paper would definitely benefit from including the discussion points around infectivity and the priors (although I can appreciate that the priors are probably too big a topic for this work). I have raised my rating accordingly. A discussion about the potential collapse would be beneficial - it is not desperately clear to me that the effects will be negligible for shallow models so more explanation/justification in the final version about this would be appreciated. I agree with other reviewers that the related work should really be discussed earlier in the paper as it sets the context for the presentation of this model. I also still think it would be beneficial to have a comparison against one of the dynamic warping approaches - they may have a different paradigm but they are seeking to achieve precisely the same goal so it would be interesting to know the trade-offs between the approaches. As an aside to continue the discussion (unrelated to the review rating); I take the point about some latent mappings not wanting to be injective (e.g. discriminative models where one might desire certain invariances) however in a generative model I would argue one would usually seek equivariance which would probably want to avoid complete knots in the mapping; given the title as focusing on non-stationarity I accept the argument for both sides - in terms of the non-stationarity through warping though I think injectivity is still a key point of discussion.

Review 4

Summary and Contributions: The paper proposes an adaptation to sparse spectrum Gaussian process (SSGP) models that incorporates input warping into SSGPs. The paper considers the problem of modelling nonstationary data where processes may change significantly over the input domain, motivating adaptive kernels. The model learns an input-dependent local affine transformation of inputs where components of the transform are also Gaussian processes. The top-level GP can then be represented as a stationary process on the transformed (warped) inputs. By using this kind of transformation, the authors are able to extend closed form expressions for SSGPs to the warped case that generalise to an arbitrary number of warping levels. Experimental evaluation finds the approach (termed SSWIM) performs favourably against benchmarks over a number of real datasets.

Strengths: The proposed approach is theoretically sound and well supported by experiments with a reasonably comprehensive set of benchmark methods. The use of the input-dependent affine transformation provides a balance of expressiveness (sufficient for examples given) and efficiency that appears well suited to a range of applications. The efficiency in particular is an advantage of this method as it allows capture of input-varying processes without resorting to very expensive models, and integrates this with scalable SSGP methods. Significance: medium Novelty: medium I think this technique is relevant to the wider GP community and is especially helpful for applications on large datasets.

Weaknesses: The authors argue that a key limitation of existing nonstationary spectral GP models, warped input models and other rich models such as deep GPs, is their potential to be too expressive (overfit) and costly. However, results analysis is focused on performance accuracy summary statistics. There is little discussion of the relative complexity of benchmark methods or more detailed comparison of model performance. E.g. The supplement includes an overfitting analysis for SSWIM, but it is hard to interpret this without comparable overfitting analyses for competing methods. Relatedly, it would be helpful to have more detail of experiment benchmarks in the text rather than leave this to the code. In particular, given that more expressive models can usually be “dialed down” to reduce both cost and overfitting, further design details for benchmarks would add support to modelling results.

Correctness: Claims and method appear correct.

Clarity: The paper is well written with most key points clearly described. Some kind of model diagram would be very useful. The opening of the Introduction (paragraph 1) only discusses the context of GP models utilising stationary kernels. Some mention of previous attempts to tackle processes that vary over input domain would seem appropriate here. Notation: There is x^{tilde} and x^{hat} notation introduced at Eq.(10) without prior definition - clarification would be helpful. I found the use of X and Y notation for pseudo-training points a bit muddy - consider changing to something more distinct from the raw data. Minor typos/grammatical errors: lines 35, 38, 260.

Relation to Prior Work: As per comments above, I felt that the detail on competing methods, and support for explanations as to why differences are observed, was a little thin.

Reproducibility: Yes

Additional Feedback: After rebuttal: Thank you for the further information provided in the rebuttal. Considering my own concerns in combination with those of the other reviewers, I have downgraded my overall score to marginal accept.