NeurIPS 2020

### Review 1

Summary and Contributions: The paper proposes a methodology to jointly fit the parameters and the loss function of generalised linear models (GLM). Related approaches parameterise the loss in terms of an invertible link function (ie a monotonic function that maps [0,1] to R). While existing methodologies attempt to learn the link function directly, the authors propose to fit what they call the "source function", which is defined to be a differentiable strictly increasing function. It is shown in the paper that is it possible to construct a proper loss by means of the composition of a learnable source function and a canonical link. The source function is claimed to be more convenient from a computational point of view, as it only imposes monotonicity; the source can map R to R (as opposed to the link function). A GP regression scheme is proposed in order to perform inference for the source function. A particular prior is considered in the space of functions, which is the integral (up to x) of a squared GP (ISGP). The authors discuss the theoretical properties of the prior induced by ISGP, and they develop a Laplace approximation scheme to approximate the non-Gaussian posterior; the GP hyperparameters are treated by means of marginal likelihood maximization. They apply the scheme in temrs of a GLM, where the model parameters and the source function are trained using Expectation Maximisation (EM). In the experiments, the authors demonstrate the fitting of the source function and the actual learning of a classification loss on both toy and real world datasets.

Strengths: The paper discusses a novel approach to learn a loss function that I think would be of interest to the community. The parametrisation of the loss by means of the source function is novel and well-motivated. The authors use the results of recent literature to show that it is possible to reconstruct any proper loss function as composition of a canonical link and a source (Eq.4). Also, they propose a principled inference method to capture these source functions. There is an extended theoretical discussion over the properties of the proposed ISGP prior. It is shown that the sample paths of the prior are valid source functions, and it easily follows that the posterior samples are valid sources too. (Lemma 1, Theorem 3) In my opinion, this modelling of loss in combination with the inference scheme differentiate the paper from the previous literature. The parametrisation proposed is more 'tailored' to the requirements associated with fitting a loss function, and the experimental results seem to support this claim.

Weaknesses: My biggest concern regarding this work is readability. The authors could benefit from some extra space to provide some intuition regarding the concepts they introduce. I am a bit confused about Theorem 8. What is the distribution of \vu in the expectation on the left-hand side? Is this the true posterior distribution? If that is the case, then the loss induced by the approximate posterior is an upper bound to the loss of the true posterior, which would be remarkable.

Correctness: I think that the claims and the mathematical derivations in the paper are correct. The methodologies used to perform inference, including the Laplace approximation scheme and the expectation maximization algorithm are appropriate for the problem at hand.

Clarity: I think the paper is difficult to understand for machine learning researchers that are not very familiar with the topic. Also, I appreciate the popular culture reference of the title, but I don't think it clearly reflects the content of the paper.

Relation to Prior Work: The authors clarify that the paper is effectively an extension of a very recent piece of literature [NM20]. Compared to the work in [NM20], the authors of the current paper employ a Bayesian regression method (ie the ISGP scheme) in the place of the optimisation scheme of [NM20]. Also, inference is performed on the source function, rather than the link as in previous works. These contributions are clearly stated.

Reproducibility: Yes

Additional Feedback: Post Rebuttal Feedback: I thank the authors for their response. Personally, I am in favour of a more technical title. Regarding to Vapnik's quote, Yann LeCun refers to it as an "inside-inside" joke. So I think that the current title is way too cryptic.

### Review 2

Summary and Contributions: This paper proposes a method to augment the proper loss functions when evaluating a classifier. Instead of using single proper loss on the fitted function, this paper considers using a source function to map the fitted function to a set of latent functions before calculating the loss, thus allowing the final loss to be augmented. A monotonic GP is applied to model the source function to ensure the invertibility between the fitted function and the latent functions, thus keeps the properness of the original loss.

Strengths: 1. The direction of the work is interesting. It has been a default setting in the machine learning community to pick a single proper loss for probabilistic classifiers. This paper points out a way to enrich a given proper loss without affecting its properness. 2. This proposed method is constructed and proved on a set of related work on proper losses/margin losses/link functions, hence is theoretically sounding.

Weaknesses: 1. While both the research direction and solution are interesting, the title and ceratin statements in this paper seem to be over-claimed. First of all, the title is "All your loss are belong to Bayes", which is not very informative on the paper's content. To my understanding, given a particular proper loss, the proposed method allows augmenting the given loss via inferring a set of source functions. This means the final obtained loss is still depending on the original proper loss. So the proposed method is unlikely to perform inference on different types of proper losses, such as Brier score and log-loss (I am not saying impossible, but it requires further mathematical demonstration to see on what circumstances different proper losses can meet). Also, the proposed method seems to only work on a binary setting, which is limited compared to the general setting of proper scoring rules. 2. Although this paper is more on the mathematical and theoretical side, the execution on evaluation is relatively weak. All the classification experiments only use the logistic link, which is only limited to a particular kind of proper loss. Experiments on more related loss/link functions would make the results stronger as the paper claims to contribute a general framework for inferring proper losses. Also, while the paper suggests their approach improves "consistency, robustness, calibration and rates.", none of these benefits are formally defined and checked in the experiments. For instance, outputting calibrated probability has been a key property of classifiers and is closely related to proper losses. It would be better if the author can indeed to check ceratin metrics like expectation calibration error to evaluate the proposed approach.

Correctness: As described above, I agree that the proposed method is useful and has potential benefits from existing approaches. However, the paper title and some statements in the paper might need adjustments.

Clarity: While a reader with a related background can understand most of the paper, this paper is quite heavy on math, and some notations can be confusing through the reading. It might be useful to add a table of related notations in the supplementary.

Relation to Prior Work: Yes, the discussion on existing work and the differences from them are relatively clear.

Reproducibility: Yes

Additional Feedback: I appreciate the efforts on the theoretical work in this paper; the general ML community might benefit more if this paper can include more empirical analysis /demonstration on the difference between the proposed approach and existing proper scoring rules. At the moment, it seems to be a complex and expensive method without detailed pros and cons. It would also be better if the authors can be more accurate on the paper title as the current one seems too broad. ===========AFTER AUTHOR FEEDBACK===================== Following the discussion with other reviewers, I increased my score to 6 as this paper provides a nice extension within a particular area (e.g. NM20) While I do appreciate the idea and agree it brings interesting insights to investigate proper losses, I am still not fully convinced if I should switch to the proposed approach from existing proper losses. (e.g. providing higher accuracies might also make the model vulnerable to predicting uncalibrated probabilities). I also feel there is some miscommunication between the authors and me, as shown in the feedback. Please see some further explanations below. 1. The authors replied that their approach could work with any proper loss. However, I wasn't questioning this. I was wondering if the proposed method can build links between different proper losses. For instance, Brier Score and Log-loss are quite different losses, and I thought maybe the proposed augmentation could somehow merge them to give a more generic view on these losses. 2. The authors said the new loss could work on a multi-class problem via one-vs-rest / 1-vs-1, which agrees with my comment that the proposed loss can only work on a binary setting. Proper losses like Brier Score supposed to be true multi-class and doesn't require such repeated training to work on more than two classes. 3. The authors suggested their method is not expensive as the inference can be quick. My point was the proposed method is more expensive than simply applying existing proper losses, which doesn't require any optimisation steps at all.

### Review 3

Summary and Contributions: This paper proposes a novel approach to estimate the loss function in classification problems, and it derives theoretical conditions and analyses about the problem of loss estimation.

Strengths: This is an important problem, for which some approaches have appeared in the earlier literature of statistics. This paper proposes a fresh look at the problem and some solid theory around the proposed approach.

Weaknesses: I believe that the presentation of the paper could be improved. I also think that the paper could benefit from the use of more modern inference techniques for Gaussian processes. The experiments could also feature some more compelling examples where inference for the loss could lead to considerable improvements in performance. Finally, a clear algorithmic breakdown of the proposed approach would help clarifying what the Authors implemented exactly.

Correctness: As far as I can see, the method is correct. The empirical methodology could be improved by providing more comparisons and analyses.

Clarity: Title and abstract would benefit from a rewriting. I find the title nonsensical.

Relation to Prior Work: I think that the paper is well positioned within the literature on the topic.

Reproducibility: No

### Review 4

Summary and Contributions: The paper learns the loss function in a Bayesian manner by decomposing the loss function into a link and a source, and then using the ISGP to model the source. Inference with ISGP is done via Laplace approximation, and a trignometric kernel is proposed as a practial manner to implement the model.

Strengths: The paper is has a strong foundation in theory (section 2). It proposes a GP based model to implement the theory, with a suggestion of the inference method. This will serve as a useful reference point for future work in this area. One novel contribution can be additionally stated in the introduction: the introduction of ISGP, which can be of independent interest.

Weaknesses: W1. Is the trigonometric kernel universal? and this is not clearly stated nor proved in the paper. W2. Althougth universal kernels are "nice to have" (Theorem 3), are they really crucial to the applicability of the model? --- the authors have not expressed their opinon on this explicity, though Theorem 2ii seems to imply that it does not matter for classification. W3. The paper only provides the trignometric kernels, and it seems that other kernels are rather hard to use. Does this limit the applicability and flexibility of the method? W4. The OU kernel has rather easy eigenfunctions and eigenvalues (see appendix B.6.2. of Cha10). Though it is not a universal kernel according to Theorem 17 of MXZ06, it should also be investigated empirically. W6. Since the paper focused on losses, the experiment on isotonic regression in section 5 seems like a distraction --- it can be moved to the supplementary material. I would prefer more experiments on learning the loses, and also analysis of the results. See also C5 below. Chai. Multi-task learning with Gaussian processes. PhD Thesis. 2010.

Correctness: Mostly. Though it is unclear if the trigonometric kernel is universal, and if not, then the theory (Theorem 3 requires universal kernel) is currently not supported by a model.

Clarity: Mostly. Some comments C1. Figure 1 rather hard to interpret because the terms are not explained within the caption itself. The general term "Correspondence" in the caption does not help. C2. In line 137, it is good to inform the reader where "later" is. Seems to be Theorem 5 and section 4.3, but I am not sure. C3. In the GP community translation invariant (line 172) more commony known as stationary. C4. Line 233: If the prior is GP, then this is not a construct for proper composite loss, as implied by Definition 1 (line 120) and noted on lines 291/291. It is best to bring upfront at line 233. The outcome is simply a GP classification model with a very perculiar kernel that first projects $x$ onto a line. C5. The references to the the supplementary material (SM) in Section 5 is makes the main paper totally dependent on reading the SM, which defeats the entire purpose of having a proper main paper and a SM. If the authors deem the material in SM so important, they so be moved to the main paper. In fact, Figures 8 and 9 should be moved to the main paper since they are central to the paper.

Relation to Prior Work: Suggest the authors to also discuss the following two work in relation to theirs: Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. 2003. Warped Gaussian processes. In Proceedings of the 16th International Conference on Neural Information Processing Systems (NIPS’03). MIT Press, Cambridge, MA, USA, 337–344. Miguel Lázaro-Gredilla. 2012. Bayesian warped Gaussian processes. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’12). Curran Associates Inc., Red Hook, NY, USA, 1619–1627.

Reproducibility: Yes

Additional Feedback: A1. Not pertaining just to this paper, this area of work seems to be focused on the latent model being linear, but with a flexible loss function --- GLM models. How does it compare with a flexible latent model (e.g., GP) but a fixed loss function? Comments after rebuttal =================== Thank you for the reply. I have just one more comment to make. It seems to me that the Nystrom method is more general than the trigo kernel, which seems to be an invention just for ISGP. In this case, I would prefer the Nystrom to be in the main text and the trigo to be in the SM if there is a lack of space. Also, I've forgotten to mention that you should also relate the trigo kernel to some of the Fourier expansion approximations of GP in the literature.