Review for NeurIPS paper: The Pitfalls of Simplicity Bias in Neural Networks

NeurIPS 2020

The Pitfalls of Simplicity Bias in Neural Networks

Review 1

Summary and Contributions: The paper studies the implicit bias of neural networks, and argues that networks are biased to learn "simpler" features, even when they could be more accurate and robust by learning more complex ones. The main contribution of this paper is introducing several simplified data distributions where this can be more easily investigated. They show that certain neural networks empirically learn only "simple" features on these distributions, and also show this theoretically in a toy setting.

Strengths: The main strength of the work is in the data-distributions which they introduce. These are a simple and interesting testbed for theories about inductive bias, and they could be used in future work in the area. The "CIFAR+MNIST" dataset is especially interesting, since it is a more realistic distribution, and clearly demonstrates the "simplicity bias." The theoretical section is also a nice observation, extending existing results to the newly-introduced distributions. This paper will be relevant to the NeurIPS community, since it sheds more light on the implicit bias of neural networks (empirically and theoretically).

Weaknesses: The main weakness is that most of the settings studied are toy models: synthetic non-image distributions (with the exception of the CIFAR/MNIST experiment). The effects in these toy models are interesting, but it is unclear how much they say about implicit bias on real distributions.

Correctness: The theoretical claims appear to be correct. The proof of Theorem 1 could use some elaboration: specifically, the conclusion about small test error is not clearly demonstrated (the relevant Lemma F.3 appears to be about the train loss, without connecting it to the test loss). I don't think this will affect correctness of the Theorem, although the proof should be clarified.

Clarity: The paper is overall written reasonably, though there are some points where claims are overstated. Most notably, the term "explains" is overused (eg: SB "explains" adversarial examples, distribution shift, spurious features, etc). The examples in this paper do take steps towards understanding these phenomena, but it is too strong to claim they "explain" them. After all, this paper considers toy distributions -- and further, several of the terms above are not formally defined in the literature (eg "spurious features"), so it is unclear what an explanation would entail.

Relation to Prior Work: Prior work is sufficiently discussed in context. This problem of implicit bias is studied in various ways, and the field is young enough that every paper typically studies it in a different way.

Reproducibility: Yes

Additional Feedback: Comments which do not affect the review score: I suggest cutting down on the number of "synthetic distributions" introduced in the main body. Currently there are 4 synthetic distributions, but all share roughly the same interesting feature, and it's not worth forcing the reader to context-switch between these in order to make points (which could, presumably, be made equally well just by focusing on 1 distribution). The CIFAR/MNIST distribution is in my opinion the most interesting aspect of this work, and could be highlighted more. Section 5, that the implicit bias can hurt generalization, is interesting -- one would imagine that fully-connected nets can exploit the non-linear component. This could be worthy of more discussion. Section 4.3 is a fairly weak section, in that most of it is speculation that is not truly supported by the data. I suggest moving it out of the main body, or putting it into a "discussion" section explicitly. There is also a lot of interesting-looking material in the appendix. Consider adding a sketch of appendix-results towards the end of the main body. ======= Post-rebuttal update: I reduced my score by 1 pt because the concerns about over-claiming were not adequately addressed. However, I would like to see this paper appear in NeurIPS if these concerns are addressed.

Review 2

Summary and Contributions: [UPDATE AFTER REBUTTAL] I'm generally excited about this paper! The authors addressed my suggestions by stating that the code & dataset will be open-sourced, and by describing two additional experiments studying the role of initialization as well as a different loss function. Probably due to time constraints, the authors do not provide any results for these experiments in the rebuttal. I hope these experiments will be included in the final version. Given the author response I see no reason to down-grade my score; at the same time concerns raised by other reviewers regarding overly broad claims prevent me from raising my score (but I think these concerns can be quite easily addressed). I thus support publication at NeurIPS. [ORIGINAL REVIEW] The paper "The Pitfalls of Simplicity Bias in Neural Networks" investigates the tendency of neural networks to learn "simple" solutions. More specifically, it shows that CNNs a) ignore complex features when simple ones are equally predictive, b) ignore complex features even when simple ones are less predictive and c) base their "confidence" mostly on simple features. The paper proposes a number of simple toy datasets where these phenomena can be studied and convincingly shows the prevalence and scope of the "simplicity bias". The implications of simplicity bias are often considered to be beneficial in the sense that this prevents overfitting, but this paper shows how it can lead to poor OOD generalisation and how models often fall short from learning more than the most simple solution.

Strengths: The paper has a number of experiments with different architectures and optimizers (SGD, Adam), providing support to the empirical evaluation. The findings are surprising (not necessarily that these pitfalls exist but that they are so strong) and very relevant to the NeurIPS community. Overall, the observed problematic - in combination with other works in this direction - underlie one of the most important problems of present-day deep learning. A better understanding of these phenomena is a necessity, and the paper does a very good job in providing a better understanding. I believe it will be a valuable contribution to the community.

Weaknesses: - Simplicity bias is attributed to "standard training procedures such as SGD". However, Jacobsen et al. (reference [20]) showed that cross-entropy may be to blame: a modified loss function encourages neural networks to learn more than the most simple solution. This issue should be discussed and ideally one would like to investigate the invertible network from [20] on the proposed datasets. - The paper could benefit from investigating the role of initialization: clearly, a model that happened to be initialized such that a complex feature is already "learned" would make use of that feature. But at which point would it switch to learn the simpler one? I.e., one could design an experiment where a network is trained on a dataset where only the complex feature is predictive. This network is then used as the initialization for a dataset where both the complex and simpler feature are predictive. Linearly interpolating between the weights of this network and a random initialization could enable one to investigate how much the weights can deviate from the "complex feature solution" such that the complex solution is still learned, or whether the simple solution will always be preferred irrespective of initialization. - I appreciate that code was submitted alongside the paper. That being said, it would be good to mention if/how the dataset will be made available to others (which would be very helpful).

Correctness: I have not thoroughly checked the math, but the overall approach and datasets look convincing and make sense to me.

Clarity: The paper is very well written and figures nicely illustrate the setup. The appendix does not comply with the NeurIPS style file. At points the paper appears a bit crammed.

Relation to Prior Work: The paper discusses prior and related work. However, many aspects seem related to "shortcut learning" (https://arxiv.org/abs/2004.07780) and I believe the reader will benefit from discussing this connection. At some point, the authors mention that "to the best of our knowledge, prior works only focus on the positive aspect of SB: the lack of overfitting in practice". This is not the case e.g. in the shortcut paper, where the problematic aspects of SB are discussed as well. That being said, I agree with the authors that the positive aspects of SB have been more prevalent in the literature, so re-phrasing this statement accordingly would be more accurate.

Reproducibility: Yes

Additional Feedback: - line 274: point missing. - concurrent work (https://arxiv.org/pdf/2006.12433.pdf) observes that "when two features redundantly predict the label, the model preferentially represents one, and its preference reflects what was most linearly decodable from the untrained model." Just FYI in case the authors haven't seen this already, this may be a pattern worth looking out for in the future.

Review 3

Summary and Contributions: The paper develops a notion of feature simplicity. It designs datasets and shows NNs rely on simple features in these. This is offered as an explanation why NNs have poor robustness to data shift.

Strengths: The paper addresses the important questions of robustness in neural networks and generalization, specifically its relationship to simplicity. It introduces synthetic datasets which allow both theoretical and empirical evaluation of simplicity and test errors. The contributions are novel to my knowledge, and ambitious in their scope.

Weaknesses: Several key claims seem false or insufficiently supported. 1) The authors demonstrate that “contrary to conventional wisdom, extreme [simplicity bias] can in fact hurt generalization”. We can always _construct_ a dataset where a simplicity-biased algorithm will fail. This is unsurprising and follows from the no free lunch theorems. Constructing datasets is in fact the method used here. But the practical significance/realism of these datasets must be established. This is not done at all for the first 3 datasets. The MNIST-CIFAR dataset may be more realistic but on it’s own it is not enough to establish the paper’s claims. 2) The authors show that simplicity bias is one possible reason for non-robustness. But they claim that simplicity bias is the cause behind non-robustness in other datasets. This is a large leap of faith with no support. 3) Key claims are phrased as if they apply to neural networks in general. In fact, theoretical results seem to be limited to the synthetic constructed datasets, and do not seem to apply generally. Nonetheless, this work could be promising if the relevance and generality of the bias towards "simple" features was established. The observation that NNs can learn small-margin classifiers if they correspond to simple features is interesting, though it is unclear if it holds outside the constructed datasets.

Correctness: As noted above, the methods do not sufficiently support the key claims, but I have not spotted any technical errors.

Clarity: The abstract and introduction are clear and appear promising. Section 3.0 neglects to define a few terms (see below) and from section 4.3 the paper becomes less focused.

Relation to Prior Work: A discussion of simplicity is missing. I personally think this would be more useful than discussing distribution shift and adversarial robustness.

Reproducibility: Yes

Additional Feedback: - Is MNIST necessarily less complex than CIFAR-10? It’s intuitive but not otherwise justified. - Definition of simplicity: distinguish simplicity of functions, which is more standard, vs features - Claims are overly general, implying that they apply to all NNs (L74-87) - Related work on simplicity and SB is missing - Claim (ii) :SB is only one explanation for overconfidence, others seem possible - L135-146 are hard to follow. For example, it would be useful to define x^S, x^Sc, \bar{x}^S. The same goes for “marginal distribution of S”, I don’t think the distribution of a set is meant here? - The abstract states that a key shortcoming in prior work is that simplicity is vaguely defined. The authors claim that the number of linear classifiers needed is a “natural notion of simplicity” but this is not justified. Furthermore, no method is offered to measure it (except in the authors own datasets by construction). - Typo: “would almost perfect” Edit: I increased my score by one point since the authors now gave some argument why their datasets are realistic (although not a decisive one in my view). Additionally, the authors indicated they will acknowledge that their results apply to specific, constructed datasets and not necessarily in general.

Review 4

Summary and Contributions: ** Update after authors feedback ** I will maintain my score and recommend this work for acceptance. I do however agree with some of the other reviews that some parts of the paper could be written more cautiously and would like to encourage the authors to do so. --- In this work the authors describe an approach to systematically investigate a certain kind of bias and therefore also the generalization properties of neural networks for supervised classification. At its core, the authors propose to construct datasets that contain both “simple” and more “complex” features which are all highly predictive for the given task. In a sequence of experiments the authors show convincingly that fully connected MLPs, ConvNets and GRU based sequence models all almost exclusively concentrate on the simple features in the input data to make predictions, disregarding the more complex but equally predictive features in the data. The authors call this property the “simplicity bias”. Besides empirically demonstrating this property for a range of neural network architectures and for various optimization methods (SGD, Adam RMSProp), the authors furthermore present a proof that this is expected for one hidden layer neural networks with ReLU activations.

Strengths: The authors start with a simple idea, the definition of “simple” and “complex” predictive features, construct datasets from it and use these to systematically investigate the sensitivity of the learned neural networks to these input features. This simple approach proves to be strikingly powerful. The paper and the appendix describe a vast number of experiments that together form a very convincing argument that the trained networks indeed exhibit a very strong simplicity bias. The large number of systematically conducted experiments make very unlikely that the observed effects are the result of any one of the specific architectural or hyperparameter choices, but rather an inherent feature of the way we currently train neural networks. The paper furthermore describes an impressive proof that this is indeed expected in one hidden layer neural network. Besides providing immediate insight into the properties of neural networks as they are trained today, I suspect that this paper will spark significant follow up work which might further investigate, or try to mitigate the described and harmful simplicity bias. m

Weaknesses: I generally don’t see any weaknesses in the work as it is presented here. The authors make effective use of the 8 page limit and the space in the appendix. The presented methodology suggests and enables future work that might investigat other aspects related to the effects described here. For example, it would be interesting to see whether techniques like drop-out or batch-norm have any influence on the obtained results. But at this point, I think it is absolutely acceptable to leave such questions for follow-up work.

Correctness: The presented arguments appear sound, the experiments are systematic and thorough.

Clarity: The paper is well written, easy to follow and makes effective use of the space available. The appendix contains a large number of additional experiments, details and insights.

Relation to Prior Work: In section 2 the authors correctly point out that the work presented here touches upon previous work on generalization, out-of distribution performance and adversarial robustness. I think prior work in this area is adequately discussed.

Reproducibility: Yes

Additional Feedback: