__ Summary and Contributions__: ++ Post Rebuttal
I'm happy with the rebuttal which clarified some points about the paper. The extra experiments show that the method extends beyond WideResNet too. For this I'm raising my score.
++
This paper discusses the idea of combining ensemble through hyper-parameters and different initialization of a deep model. The paper applied this idea on two directions. The first one is on deep ensembles in which they introduced stratified hyper ensemble as a greedy search algorithm that updates over hyper-ensemble method. The second direction is in applying the idea on batch ensembles, which is a budget wise ensemble mechanism. They introduced batch hyper ensemble which 2x size of hyper ensemble. Also, it merged the idea of self-tuning networks into batch-hyper-ensemble to make an efficient upgrade-non greedy approach for batch-ensemble.

__ Strengths__: I have read the paper several times(4-5) to make sure of these points:
- I'm not sure if the idea is novel itself in this area or not but it seems valid.
- The empirical evaluation shows an improvement over previous methods and it covered the comparison with different alternatives such as deep ensembles and batch ensembles.
- The analysis of the pictorial view in figure 2 where they showed that deep ensembles and hyper ensembles parameters search are a special case of the method.
- The upgrade of self-tuning networks to match K ensembles such as in equation 7 and the updated objective function in equation 8 is another contribution of the paper.

__ Weaknesses__: I do have several questions:
- Did the hyper-ensemble paper forced the networks to start form the same initialization point? I looked into the paper in ref[12] for this information but couldn't tell. If so, then the work will need a more justification with the difference w.r.t hyper-ensemble.
- In page 5, starting from line 172 it was not clear why o(mk) became o(k^2), can you elaborate more on this?
- From table#1 and table#2 it seems that hyper-ens, str hyper ens and deep ens are quite close to each other in nll,acc,ece ranges. What's exactly the range if improvement of using str-hyper-ens over the others?
- In tables#1,2 what is the meaning of the numbers in brackets (1),(4)?
- I understand that the empirical evaluation is expensive, but reporting results on other deep models such as VGG, ResNet, DenseNet for a small subset of the settings will clear any doubts regards that the method only works best for wide-resnet.
- On the same point of wide-resnet as in lines 269-272 for using two deep ensembles, what are the results of this comparison for wideresnet?

__ Correctness__: It seems correct expect for the comments in the weaknesses section.

__ Clarity__: Each section of the paper is clear on its own but the overall flow is not the best.
Because the paper have a main idea [hyper parameters + different initialization] then it applied to two different techniques [deep ensemble and batch ensemble]. It's better to introduce the idea first then a complete section for the improvement over deep ensemble and same for batch ensemble. Same for the discussion in the empirical evaluation section. Of course this is a suggestion.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: Post-rebuttal:
-----------------------------------------------
Thank you for the clarification. After reading the other reviews and the rebuttal, I decided to increase my score.
-----------------------------------------------
This paper presents a generalization of deep ensembles (Lakshminarayanan et al., NIPS 2017). In addition to ensembling neural networks based on distinct points in the parameter space, this paper proposes to also the *hyper*parameter space (which contains all possible values of neural networks hyperparameters, e.g. weight decay, dropout rate, learning rate). The authors propose a greedy algorithm for constructing an ensemble given a set of models which have different parameter and hyperparameter values. Furthermore, a lightweight version of this algorithm, based on the recently proposed batch ensembles (Wen et al., ICLR 2020), is proposed. Extensive experiments show that ensembling networks based on both parameters and hyperparameters yield a substantial improvement in uncertainty quantification compared to the baselines.

__ Strengths__: I like this paper since it is an important step toward full uncertainty quantification (i.e. quantifying all sources of uncertainty) of neural networks. As more sources of uncertainty are quantified, the quality of predictive uncertainty---the quantity that matters the most in predictive systems---improves. This can have a big implication in safety-critical systems since one can trust the predictive uncertainty better.
Empirical evaluation is solid but can be further improved: It is sufficiently broad but it could be more in-depth. Please find some suggestions in the bottom of this review.

__ Weaknesses__: This paper has a strong connection to the hierarchical Bayesian modeling of neural networks, where a hyperprior (prior distribution over hyperparameters) is assigned to the probabilistic model. Hyper ensembles can roughly be seen as consisting of samples from the posterior of this Bayesian model. It is thus a bit disappointing that the authors did not compare, discuss, or at least mention this connection in the paper. A comparison and discussion would be very helpful to point out exactly the novelty of hyper ensembles compared to this established Bayesian modeling technique.

__ Correctness__: The proposed methods are straightforward generalizations of prior works such as deep & batch ensembles and self-tuning networks (MacKay et al., ICLR 2019), so I do not think there is any obvious issue here. Nevertheless, there is a questionable design decision in the method: In lines 157-159: Why does the algorithm hyper_ens select a model *with replacement*? Doesn't this mean that the resulting ensemble could consist of K exact copies of a single model? In this case, wouldn't it defeat the purpose of forming ensembles?

__ Clarity__: This paper is very well written. I appreciate the extensive discussion about deep & batch ensembles and self-tuning networks. One minor complaint would be: what does "skew" mean in the caption of Figure 3? I do not think that it is discussed or defined anywhere in the main text.

__ Relation to Prior Work__: I think the authors have discussed the related work sufficiently well, except the connection to hierarchical Bayesian models. I would like to see this connection to be discussed and possibly compared in the experiment section. Additionally, I think (standard, non-hierarchical) BNN baselines need to be compared in the empirical evaluation.

__ Reproducibility__: Yes

__ Additional Feedback__: Additional feedback:
- Please do not wait until acceptance before releasing the code. I think even a simple self-contained code example over a toy dataset in the form of a Jupyter notebook would be really helpful.
- I would like to see more OOD experiment. Perhaps a big table consisting of AUROC values like Table 1 in Hein et al., CVPR 2019? A deeper OOD experiment is important since it will complement the frequentist calibration results in Table 1 and 2 (frequentist calibration only concerns about in-distribution uncertainty).
- Please polish the References section. There are some inaccuracies there, e.g. [25] and [28]---they are either workshop or conference papers, not just Arxiv papers.
Questions:
- What is the interpretation of the parameter \xi_t in the distribution of p(\lambda | \xi_t)? Does this mean p(\lambda | \xi_t) is some kind of (time-dependent) stochastic process?
- How many samples do you use for computing the objectives in Eq. 8 and 9?
To summarize, I think this paper provides an important step toward full uncertainty quantification---where all sources of uncertainty are considered---of neural networks. I think this paper will be much stronger if (i) the authors discussed the connection between hyper ensemble and hierarchical Bayes, (ii) added more non-ensemble baselines like BNNs, and (ii) added deeper OOD experiments.

__ Summary and Contributions__: 1. This paper unifies hyper-parameter tuning and random initialization as two dimensions to encourage model diversity. When combining these two methods, the overall result is better than each method.
2. The paper further applies a recently proposed batch ensemble technique to simulate deep ensemble and extend the existing self-tuning networks to the ensemble learning scenario.
3. Empirical results are provided on benchmark datasets with different architectures.

__ Strengths__: Empirical results look believable and the authors promise to release code upon acceptance.

__ Weaknesses__: 1. The proposed method marginally improve over previous methods.
2. The proposed method is a combination of existing techniques. The main innovation I can see so far is the design of self-tuning networks for ensemble learning.
3. The paper claims that two sources of diversity jointly contribute to the overall ensemble model. Actually, there is a third source of diversity during training from p_t(\lambda_k) that controls the diversity of \lambda_k. Assuming p_t will not degenerate, how to effectively control the variance of p_t such that it can do a good local search job around \lambda_k? It would be great if there is a qualitative explanation that multiple ensemble members p_t(\lambda_k) for k=1,..., K work well independently and can jointly explore a wider space of lambda.

__ Correctness__: All techniques are properly used in this paper as far as I can see.

__ Clarity__: This paper is well-written.

__ Relation to Prior Work__: This paper includes related prior works as far as I can see.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper proposed to do ensembles over both weights and hyperparameters to improve the performance. Specifically, the paper proposed stratified hyper ensembles that involves a random search over different hyperparameters and stratified across multiple random initializations. The authors also proposed batch hyper ensembles, which is a parameter efficient version of the model. The proposed model is tested on image classification tasks and achieves favorable performance.

__ Strengths__: 1. This paper is well-motivated and well-written. It is easy to read and follow, which sufficient details on the model.
2. The performance of the model outperformance the baselines consistently.

__ Weaknesses__: On the novelty. The proposed is simple and straightforward. Although the empirical performance is good, the novelty is incremental.

__ Correctness__: Correct

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: