__ Summary and Contributions__: Authors empirically evaluate various generalization bounds via population-of-environments approach, as was done in a recent work [6]. Critical difference from [6] is that the authors use a worst-case focused metric, instead of a correlation-focused metric (e.g. Kendall correlation coefficient in [6]). In particular, they look at what they call the robust sign-error. Empirical evaluation using this metric reveals several interesting findings.

__ Strengths__: - Significance (of the effort): As was the case of [6], the computational effort put into the work is amazing; the experimental results per se may be a great asset to the future researchers, especially if all the trained models, codes, and detailed setups are disclosed to the public.
- Significance (of the problem): The problem to be addressed (if is addressed correctly) is definitely significant. Putting aside the importance of understanding the generalization of neural networks models, answering the question "what is the right measure to empirically evaluate theoretical generalization bounds?" is of critical importance.

__ Weaknesses__: - Soundness of the claim (theoretical grounding): Although I have some experience regarding the distributionally robust optimization, I must say that I cannot directly see *why* the proposed "robust error" (which, in my opinion, should rather be called worst-case error), is a better empirical quantity to look at. What properties of the estimate make it so? How should the discrepancy between average-based evaluations and worst-case based evaluations be compared? The authors do refer to some existing works (I am not sure if they are that relevant though), but they should also be formally addressed in the main text as well.

__ Correctness__: Mixed here. The proposed evaluation metric looks reasonable, but I cannot find any concrete reasoning.

__ Clarity__: I do not believe so.
- Many claims are ill-cited. For example, see line 47. There, authors state that a "test-set bound provides a sharp estimate of risk," but what exactly do you mean by the "test-set bound"? I guessed that authors are trying to invoke a prediction-theoretic argument, but I could not really be sure because the authors did not make a reference to an example of such test-set bounds.
- I also believe that the paper could improve in terms of organization as well. For instance, I needed to spend some time crawling through the paper to find an (explicit) distinction between the environments "e" and the samples "omega" (by the way, are 5 seeds enough?).

__ Relation to Prior Work__: The distinction in terms of the evaluation metric has been made quite clear (although the authors could also introduce the Kendall correlation coefficient formally to help the readers). On the other hand, in terms of discussion of experimental results, I wish to see more explicit comparisons to the results in [6].

__ Reproducibility__: No

__ Additional Feedback__: Response read; participated in the earlier discussion, but forgot to edit the review.
Still, I am not fully convinced of the proposed methodology, which is only poetically justified by the authors as 'theory is only as strong as its weakest link.' I strongly believe that understanding and formally explaining why such methodology is better than the previous approach is an essential part of the research procedure.
--------------------
- I must recommend renaming some elements, e.g. to have omega \in Omega, f \in \mathcal{F}. P^e looks quite similar to "probability of error" instead of P parametrized by e.

__ Summary and Contributions__: - The paper proposes a systematic, empirical methodology to evaluate existing generalization bounds (or measures) with the aim of understanding and more importantly explaining generalization. The main idea constitutes testing the predictions made by a generalization theory (against empirically observed generalization errors) by intervening on relevant characteristics, thus attempting to uncover causal relationships between these characteristics and generalization properties. The importance of such a methodology is highlighted by demonstrating, anecdotally, that generalization bounds can rely on non-causal (but correlated) factors and due to the presence of various such interactions, using these bounds to explain generalization can be complicated. Finally, some findings from the large scale empirical study are presented and briefly discussed.
---------------------------------------------
I would like to thank the authors for addressing the questions raised in the review.

__ Strengths__: - The paper is very well written and has good clarity both in terms of readability and conceptual clarity. The motivating example of the SVM's was quite helpful in clarifying the reasoning of the paper.
- One central contribution of the paper: using the framework of distributional robustness and considering robust sign error follows from the objective of the paper - evaluating theories of generalization by testing their predictions. This, although it seems like a minor change, crucially differentiates this methodology from other large scale empirical evaluation approaches.

__ Weaknesses__: - A bit more discussion on the findings of the experiments could have been more instructive.
--------------------
Upon further discussions and more reflection, I believe that the missing discussion on how to interpret the results is somewhat more concerning than I previously believed. To reflect my updated stance on the paper, I am changing the overall score to a 7.

__ Correctness__: Yes, to the extent that I could verify.

__ Clarity__: Yes, it is one of the main strengths of the paper.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: - Did the authors consider using the approach to formulate robust generalization measures (for the same family of environments used in the paper)?
- In the discussion on robustness to width: the authors mention that since the networks are overparameterized, robustness w.r.t width could not be observed. However, shouldn't it be possible to consider a broader range of widths in the experiment to avoid failure due to overparametrization?

__ Summary and Contributions__: This paper proposes an improved framework for evaluating theories of generalization for deep neural networks building on the notion of distributional robustness -- in particular, robustness of a theory’s efficacy in response to changes in the hyperparameters. The paper also includes large scale empirical studies that show, under this framework, no existing generalization theory can reliably predict the empirical performance of deep neural networks. Finally, the paper analyzes the advantage and weakness of various families of complexity measures.

__ Strengths__: I enjoyed reading this paper. Like other disciplines of science (e.g. physics), a theory from the first principle is only useful when it fits what we observe in nature, and I believe this “experimental” spirit is missing at large in the theory community of machine learning. I believe the methodology proposed in this paper will help bridge that gap. Distributional robustness over different environments in my opinion is a good way to quantify how good a theory is, and should probably be adopted by all theory, in particular frequentist, bounds papers once suitable benchmarks are established and the details of the applications are further refined. I believe the method is also more flexible than its predecessors and reduces the amount of compute required. This paper is of great importance to the NeurIPS community since generalization is the crux of supervised machine learning.

__ Weaknesses__: I think the claim on the scale made in the paper may be a bit misleading since the actual hyperparameters they search are not that large and do not include common techniques such as dropout or weight decay, but it might be okay since there is no reason why the techniques cannot be applied to any hyperparameters. Another potential issue I see is the relation between this paper and Jiang et al. While the authors claim that Jiang et al. falls short in being distributionally robust, I believe the conditional independence test is extremely close to the method proposed here. In fact, the method has formal connection to intervention and the IC algorithm which is used to build causal graphs. Specifically, Jiang et al also takes a minimum over all possible interventions, which to me seems to be equivalent to taking the infimum over the “environment” proposed in this paper. I hope that the authors can provide a better explanation on how this method improves over the independence test in Jiang et al and outlines the pros/cons of using this method over the former. Some potential benefits I see are that the proposed method is more flexible than the one proposed by Jiang et al due to the introduction of Monte Carlo noise and the ability to define distribution of environment is not only more flexible but also facilitates analysis. That being said, I hope to hear the authors’ thoughts on this. Lastly, it’s not clear to me how the single-network experiment relates to the distribution robustness.

__ Correctness__: I believe that the claims and methods made in the paper are correct.

__ Clarity__: The clarity of the paper could be improved. The definition of environments in section 5 is somewhat hard to follow, and it’s not immediately clear what Monte Carlo noise is referring to. The introduction of measure in the second paragraph of 4.2 and the rest of that paragraph are also not very clear and it’s not clear why it is important. I believe that this paper is not only important to the theory community but also practitioners, so many notations in the paper need to be better explained.

__ Relation to Prior Work__: The discussion on Jiang et al. could be elaborated. See weakness for details.

__ Reproducibility__: Yes

__ Additional Feedback__: ------------------------------------Update-----------------------------------
The reviewers and AC had extensive discussion on this paper. I maintain my evaluation so I did not update the score but I believe it's nonetheless good for the authors to see this.
I believe the authors have some misunderstanding about (6) and (7) of Jiang et al. which is actually not computing the average case.
Regarding MI collapsing to 0, I believe it's not "bug" but a feature. If I know every single thing about the neural networks except for the training randomness, then I hope everyone agrees that repeated training the model would yield models with more or less the same performance, which implies that the MI would be 0. In the light of this, it's actually possible to see the method proposed in this paper as a stronger version of the IC algorithm presented but a weaker version of IC algorithm with all conditional variable.
My impression of the paper is positive but I hope the authors can properly address this in the future.

__ Summary and Contributions__: The paper proposes a new way to evaluate generalization measures that targets behavior in the worst case rather than on average. The method is applied to study how different generalization measures can prediction impact of hyperparameters. The main experimental setup called coupled-network uses a pair of networks trained independently with all hyperparameters shared except for one. The claim is that a good generalization measure will move the same direction as the generalization error for all perturbations. They conduct a large scale evaluation of a diverse set of measures on CIFAR and SVHN datasets.

__ Strengths__: The paper provides a few surprising findings and insights:
- No measure is robust, i.e., they disagree with error for some permutation
- num.params in the best measure on average
- digging into failure cases for this measure allows spotting failures of the generalization measures that are not visible for non-robust evaluation
We find that no existing complexity measure has better robust sign-error than a coin flip. Even
114 though some measures perform well on average, every single measure suffers from 100% failure in
115 predicting the sign change in generalization error under some intervention. This observation is not
116 the end of the evaluation, but the beginning.
117 To better understand the measures, we evaluate them in families of environments defined by inter-
118 ventions to a single hyperparameter. We find: (i) most, though not all, measures are good at robustly
119 predicting changes due to training set size; (ii) robustly predicting changes due to width and depth
120 is hard for all measures, though some PAC Bayes-based measures show (weak) signs of robustness.
121 (iii) norm-based measures outperform other measures at learning rate interventions.

__ Weaknesses__: While the theoretical motivation behind using $sup$ is clear, in practice it makes the evaluation criterion less robust to the coarseness of the underlying set of environments. E.g., if there were only 2 values for width, the results would probably behave quite different.
The authors mention that the reason for network width to have little effect is that the neural network is overparametrized. So it's natural to wonder how would the measures perform on bigger datasets such as ImageNet.

__ Correctness__: The methods are clear, but better argumentation for the selection of hyperparameters is preferred (see above).

__ Clarity__: The paper is written well.

__ Relation to Prior Work__: The contribution is clear

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: [ QUESTION-ONLY NON-REVIEW FROM AC. SCORE IS FAKE. BUT PLEASE ANSWER! ]
1. I briefly checked your anonymous code and I believe you are calculating spectral norms incorrectly: the spectral norm of a convolution layer is _not_ the spectral norm of its parameters (which define the filter). It is okay to use an approximation, but it should be spelled out without reading the code; meanwhile, to my taste, the spectral norm of the filter is not a reasonable approximation, unless you provide some evidence. Can you please clarify?
2. I realize you are relying on [6] as a source of your generalization measures, but can you at least explain a little about the bounds, for the sake of the reader? Many of these bounds are apples-to-oranges and some discretion is needed to interpret a total ordering of them. E.g., they use varying amounts of information, computation, and bound different things (some are on networks, some are on posterior averages over networks, etc.).

__ Strengths__: .

__ Weaknesses__: .

__ Correctness__: .

__ Clarity__: .

__ Relation to Prior Work__: .

__ Reproducibility__: Yes

__ Additional Feedback__: .