NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8286
Title:Bayesian Layers: A Module for Neural Network Uncertainty

Reviewer 1

EDIT post-rebuttal: Thanks to the authors for answering to my comments. I am still voting for acceptance of this paper. +++++ This paper is about a software component, called Bayesian Layers, that allows for consistent creation of deep layers that are associated with some form of uncertainty or stochasticity. The paper outlines the design philosophy and principles, shows many examples and concludes with new demonstrations of Bayesian neural network applications. I find that this work is on a significant topic, since software for Bayesian (deep) learning models significantly lacks behind. Integration and drop-in replacement with traditional architectures seems like the right avenue to pursue, and is a strong motivation point for this approach. I also think that this work is sufficiently original, related to what one could expect form a software component. I find that the relation to Pyro's random module is strong, perhaps stronger than discussed in the paper, when it comes to the fundamental concepts behind it (for this statement I am assuming that there is no fundamental reason why Pyro couldn't be extended with recent estimators using the current random module's architecture, but I am not an expert). When it comes to practicality though, I find that Bayesian Layers is sufficiently different, more consistent and more extensive. In any case, the community needs more works on this kind of modules. Having said that, comparison to Pyro (e.g. more discussion or side-by-side snippets) would be useful. Regarding quality, it is in general difficult to criticize a software in the same way as for a mathematical idea: ultimately, in software it's all about trade-offs in functionality. Nevertheless, I find that from the technical viewpoint, the design choices are very reasonable, offer useful features and improve upon past approaches (e.g. Pyro's random module) in certain aspects. The consistent use of layers with traditional architectures and software components makes the approach very useful, readable and scalable. The separation of model and inference is an important aspect for the Bayesian community, and is indeed unfortunate that the default use of Bayesian Layers obscures this separation. However, it is also true that no one design can fulfill all requirements, and the authors admit that this is deliberate to facilitate other important aspects of their framework. Indeed the modularization of inference per layers is a very interesting idea, however for such an important design choice, I feel that the paper does not explain clearly all the reasons why this is beneficial. A more concrete side-by-side demonstration about its advantages would be welcome. As a side note, it is also not entirely clear how the approach in section 2.5 separates in principle model and inference. Could the authors provide a more extended snippet that uses the model and the posterior? As for presentation, in this type of software paper the usual issue of trading-off details and abstraction is exacerbated. However, I feel that the authors did a good job exposing the correct amount of details. Having said that, the flow between being high- and low-level is not very smooth and at times the paper assumes too deep knowledge of Keras, Tensorflow and recent Bayesian ML methods (e.g. lines 22-40). The intersection of readers being in-depth experts in all the above is smaller than the set of readers in the target audience. Therefore, I suggest a background section listing the basic topics before diving deeper into the details. The authors use a variety of ML models to demonstrate the concepts. Although this is more difficult to follow, I feel that it is ultimately more instructive, and also demonstrates better the range of applications for Bayesian Layers. Related work is generally discussed thoroughly. Overall, the presentation in this paper is honest (starting with what it is: code), and gets right to the point. A think that the paper would benefit from the following central discussion: Bayesian models are still not as widely used in practice as DNNs. Some say that this is due to current software capabilities. Is Bayesian Layers a reply to this? In my opinion, it seems that there's more missing, whether in the inference methods, practical tricks or software. It'd be interesting to see a discussion about this here. The paper shows *how* to use Bayesian layers, but not as much *why*. This is more important than in theoretical papers, because by (some) definition a software paper is about practicality. More than simply a discussion on the above, it'd be nice to have a demonstration, e.g. with stronger, previously unthought of RL results. However, that would be an impressive bonus, and I understand that it goes beyond the scope of the paper.

Reviewer 2

-- Paper Summary -- This paper describes a comprehensive extension to Edward2, built upon Tensorflow, which permits the seamless inclusion of non-deterministic layers for constructing deep probabilistic models. The API is similar to that used for standard neural network layers, but permits the inclusion of uncertainty on a variety of components such as the weights in a layer, any associated parameters in the activation function, as well as the inclusion of Gaussian process layers (for which a variety of formulations are included). This module not only makes it easier for practitioners to construct such probabilistic models, but also exploits the processing power of state-of-the-art TPUs for implementing large-scale models such as the Bayesian transformer described in the paper. -- Writing/Clarity -- The paper is well-written and methodological in its approach. The related work is properly described, and I found the individual subsection on different layer options to be informative and useful. On the downside, I found the code snippets to be fairly messy - the presentation is quite poor with some instances of overlapping text, and I highly encourage the authors to rethink their presentation. I am also neither a fan of the current side-by-side placement of the figures which is hard to follow at times. With particular emphasis on the paper title and the first line of the abstract, I am slightly puzzled by the insistence on referring to the module as catering for ‘neural networks’ when it also enables the inclusion of Gaussian process layers, and consequently the construction of deep Gaussian processes. Of course we could debate on the connection between Gaussian processes and infinitely-wide neural networks, but this generalisation appears misleading to me in this context (and possibly also undersells the broader functionally provided by the proposed module). It may also be helpful to include a more detailed breakdown of the various options available for some of the arguments in the method signatures. For example, the local reparameterisation trick is mentioned in the description of one of the experimental set-ups, but it would be nice to have a more comprehensive list of the available options for each parameter option (perhaps in the appendix). I appreciate that this verges on providing full code documentation, but I think it would be nice to have more of this included in the paper along with a list of associated references for the papers originally proposing the implemented/featured techniques. Some additional minor complaints: - There are some typing issues in the references where words such as ‘Gaussian’ appear as lower-case. I noted some papers cited here as appearing on ‘arXiv’ which have since been published at either ICLR or ICML 2019 - please double check and update accordingly. - A few typos/preferences: L29: monolothic -> monolithic; L32: ops -> operations; L85: abbreviated last names in citation; L207: In ‘the’ experiments; - The phrase ‘whether it be the weights, pre-activation units, activations’ appears in some form or another at least 3 times in the first two pages. While I appreciate the authors’ intent to drive the message home, it unintentionally comes across as overly repetitive. -- Originality and Significance -- The importance of Bayesian inference in practical applications of machine learning has recently become more prevalent, and the work presented here is certainly a step forward in facilitating their use. As the authors clearly illustrate in the supplementary material, implementing Bayesian variations of neural networks and other models using frameworks such as Tensorflow and PyTorch typically require a substantial amount of tweaking (often bordering on ‘hacks’), whereas the proposed module could help abstract away from such complex implementations. Homogenising techniques such as Bayesian neural networks and Gaussian processes under a single framework should also have positive repercussions by encouraging future work on these topics to include broader experimental comparisons of both methods. While reading the paper, I was wondering about the possibility of incorporating the recent work by Sun et al. 2019 in this set-up, so I was very pleased to see this listed as a direction for future work. I also liked that the authors picked reinforcement learning as a use-case for demonstrating the scope of this work. Given the ballooning popularity of this field, showing how Bayesian Layers enable the extension of these methods to the Bayesian setting should be a great entry point for both ‘Bayesians’ interested in dipping their toes into reinforcement learning, and vice versa. The related work section is comprehensive and I appreciated the segment dedicated to describing the differences to Pyro, which would be the primary competing mechanism available on PyTorch. However, I was surprised that there was no mention of the MXFusion package (Dai et al., 2018), which completes the trifecta of modular probabilistic modelling by offering an implementation for MXNet. To the best of my understanding, this package bears greater similarity to the work presented here due to the inclusion of Gaussian processes and a similar notion of ‘inference modularity’ relying on variational inference. I would expect to see any connections explored in greater detail given the similar nature of this work. Instead, it is currently conspicuous by its absence. Work by Cutajar et al (2017) on deep Gaussian processes also merits a mention in the discussion on GPs for being one of the first practical instances of exploiting Tensorflow for implementing large-scale DGPs. -- Technical Quality/Evaluation -- There are little theoretical elements to comment on here, but I found most of the discussion on the implementation details of this module sufficiently clear to follow. Perhaps including some more background information on Edward2 could be useful for readers who aren’t immediately familiar with what it currently provides. As highlighted earlier in my review, the presentation and content of the code snippets could definitely be improved however. I think it’s also important to specify the degree to which this work extends beyond simply building an API around existing functionality and implemented models - at the moment the extent of the contributions are not entirely clear. Although the experiments showcasing the module are diverse and sufficiently convey the broad scope of their potential use, I feel as though the paper is missing some degree of benchmarking with regards to both scalability across hardware and model parallelisation. While I appreciate that this is not immediately within the scope of this work (which rather relies on the fundamentals of Tensorflow for these inner workings), it would be interesting to see whether a direct comparison against Pyro and MXFusion can be carried out in this regard. In its current format, the experimental evaluation feels fairly isolated in simply showcasing the functionality of the module, but there is little external context. It is also slightly unclear to me how existing frameworks such as GPflow are positioned in relation to this module - while I appreciate that the scope of GPflow is much greater than the brief appearance of GPs featured here, I am still curious to understand whether say, the DGP of Salimbeni et al. (2017) which can be constructed here, is just as good as the original implementation. -- Overall recommendation -- I would be hard-pressed to classify this paper as ‘essential reading’, but I also believe that it successfully describes the module in a succinct manner, while also giving potential readers and conference attendees a better incentive for incorporating probabilistic modelling in their workflows. There’s a few disappointing aspects which I highlighted in the review - aside from some other minor issues, the messy inclusion of code snippets is a sign of carelessness when preparing the submission. While effective in showcasing the diversity of model which can be tackled using the proposed module, the experimental section also verges on being more of a ‘demo’ than a critical evaluation. I am currently giving this submission a relatively ‘modest’ score, but would be keen on raising this score following a convincing rebuttal. ** Post-rebuttal update ** Thank you for your response! The rebuttal targets the majority of concerns listed in the reviews, and also clears up some of the more muddled aspects of the paper. If accepted, there are some points which require particular attention, especially clarifying the similarities to related work (all reviewers had issues with this aspect) and highlighting the contributions further. Papers describing software toolkits are always faced with greater scepticism, which makes it essential to clearly emphasise the contributions of the paper. Likewise, the inclusion of additional comparisons and benchmarks will elevate this from seeming like a standard technical report or documentation to a proper paper. Coupled with the presentation issues highlighted across reviews, I believe there is still some work to do, which is why I am not increasing my score. However, I also think that this work could benefit from the increased attention enabled by NeurIPS, and hopefully encourage more streamlined model implementation and evaluation within the Bayesian community. For this reason, my vote still tends towards accepting this paper. * With regards to the title, it is ultimately at your discretion whether to change it or not. However, ‘uncertainty-aware functions’ has a nice ring to it!

Reviewer 3

---------------------------------------------------------------------------------------------------------- Post-rebuttal update: ================ I thank the authors for the clarification. After some discussions with the other reviewers and the AC, I've decided to increase my score to lean more towards an acceptance. I do believe, however, the current version of the paper is sub-par in term of presentation. Please add comparison with Aboleth, along with other missing information that I had mentioned in my original review, and please fix all the formatting issues. I do hope that given sufficient work from the authors in preparing the camera-ready version, this paper could be more like a proper scientific paper, instead of a description of a software toolkit. ---------------------------------------------------------------------------------------------------------- This article describes an extension of TensorFlow (TF) called "Bayesian Layers" (BL) which abstracts the variational inference implementations (e.g. sampling, reparametrization trick, and KL-divergence calculation) inside the API itself. These layers are constructed in such a way that they maintain the compatibility with the pre-existing API in TF. The resulting layers are therefore maintained compatibility with the pre-existing layers in TF. This allows users to stack together variational and vanilla layers together when building their models. Some examples provided by the authors include the variational versions of fully-connected, convolutional, RNN, and Gaussian process layers. The article shows that these layers can be used as drop-in replacements for the existing deterministic layers in (possibly any) existing models to enable uncertainty quantification, which is very important in real-world systems. Furthermore, the authors demonstrate that the proposed layers scale well to a particularly large model with 5 billion parameters. I like the idea of having an easy way of building BNNs. Especially, having a painless way to turn complex, deterministic models (like what people use in NLP or CV) to Bayesian ones is very appealing. Even more so because the real-world systems nowadays cannot quantify their uncertainty, giving rise to many safety issues. So, I think there is a big real-world potential for this toolkit. Having said that, I have the feeling that the proposed toolkit is a bit too similar to the existing toolkits such as Aboleth ( Indeed, the authors cite Aboleth as the most similar toolkit to theirs. However, there is a lack of comparison between BL and Aboleth, thus it is difficult to know what makes BL special and original. Furthermore, when comparing BL with Pyro, the authors mentioned that BL could use more recent estimators such as Flipout and deterministic VI. I think this could be a strong point for BL, but the authors did not do any follow-up on this feature. While BL hides away the pain of implementing VI, it makes the toolkit inflexible. For example, I am under the impression that only VI with Gaussian prior and variational posterior (with diagonal covariance) is supported. While Edward can be used on top of BL for more advanced inference, it is still not clear to me how easy this would be and whether the usage of BL with non-Gaussian priors and posteriors could be done easily. Perhaps the author could clarify further on this point. For the experiments, I think it is really great to see that one can use the proposed toolkit to turn a large, complex model into a Bayesian one. But, I think an additional comparison with the respective vanilla deterministic model in term of training time and memory overhead would be important. I also think the claim in Figure 9 that the performance of BL scales linearly is not warranted as there are not enough data points to draw that conclusion. That is, there is a lot of uncertainty on how does the performance curve look like between x=128 and x=512. The overall writing of the article is clear, although some questionable terms such as "tensor-dimensional" are being used here and there. I appreciate the authors for showing codes describing the usage of BL. However, the presentation of those codes could be better, as they often overlap with each other and cross the page boundary. Another minor point that I would like to bring up is that in the x-axis of Figure 9, the number 8 and 32 are too close together and makes it confusing at a glance. Finally, I would like to mention again that I really like what the authors proposed in this article and hope that it could have a big impact on real-world systems. However, ultimately, I think this article's scientific significance is low, as the nature of this article is a description of a software toolkit.