__ Summary and Contributions__: This paper addresses the problem of hidden stratification in (image) classification: many classes (e.g. “dog”) contain sub-classes (e.g. “golden retriever”), and a model may perform much better on some subclasses than others. If the subclass labels are known, this can be measured by the “robust performance”, which is the worst-case performance over all subclasses. The authors propose a method (GEORGE) meant to reduce the gap between the overall performance and the robust performance without knowledge of the subclass labels. The basic idea is to first train a classifier using the available labels, then cluster the representations to produce pseudo-labels for subclasses, and finally use these pseudo-labels to train GRDO [42] (a prior method to address stratification in the case where subclass labels are known). The paper also provides a data generating process and uses that framework to prove some asymptotic sample complexity results.

__ Strengths__: This paper addresses an important problem in computer vision that deserves more attention, and the proposed technique is interesting and effective. Based on the experiments in the paper, GEORGE provides large gains in robust performance compared to ERM.
More broadly, the experimental evaluation is thorough and multifaceted, often including results from multiple runs to demonstrate robustness.

__ Weaknesses__: The clustering step seems somewhat intricate (see supplementary material), and it is not clear how that procedure came about or how robust the performance of GEORGE is to changes in that procedure.
It is not clear how relevant the theoretical contributions in Section 5 are in practice - it would be great to include a frank discussion of that point. If they’re not practically relevant, what is the key obstacle there? If they are practically relevant, is there a way to demonstrate that empirically? For instance, is it possible to empirically demonstrate the convergence behavior predicted by Lemma 1?

__ Correctness__: The proposed approach seems reasonable. The theoretical results are correct to the best of my knowledge, but I did not review the proofs in detail so I can’t comment on line-by-line correctness.

__ Clarity__: The paper is well written and the figures are generally clear and informative.

__ Relation to Prior Work__: The paper is well situated with respect to prior work, particularly given the additional related works section in the supplementary materials.

__ Reproducibility__: Yes

__ Additional Feedback__: It would be interesting to see numbers for other common classification metrics, e.g. average precision or class averaged accuracy (maybe using ground truth subclass labels?). This might paint a more comprehensive picture of what is improving and what is getting worse when we switch from ERM to GEORGE.
All of the results are for binary classification tasks and one or two subclasses. Is there a natural way to extend this to multiclass classification and/or to many subclasses? Would GEORGE be expected to perform well as the problem complexity increases in this way?
Minor comments:
The arrowheads in the graphical model in Figure 2(a) are too small, making it hard to see the direction of the edges.
***Post-Rebuttal Comments***
I think the rebuttal was satisfactory. After reading the rebuttal and the other reviews, I will maintain my previous rating in favor of acceptance.

__ Summary and Contributions__: This paper proposes a method to both measure and mitigate hidden stratification (even when subclass labels are unknown). Specifically, the authors firstly observe that unlabeled subclasses are often separable in the feature space. Then, they employ the approximate subclass labels as a form of noisy supervision in a distributionally robust optimization objective. Experiments are conducted on four datasets.

__ Strengths__: + The paper is clearly written and easy to follow.
+ The proposed method technically sounds and the basic idea is interesting.
+ The theoretical analysis is good.
+ Experimental results in both main paper and supplementary materials are good.

__ Weaknesses__: There are several typos through this paper. The authors should carefully proofread their paper.

__ Correctness__: The claims and method are correct. Also, the empirical methodology is correct.

__ Clarity__: This paper is clearly written and easy to follow.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper proposed a method named GEOGRE to measure and mitigate hidden stratification consequences without subclass labels. The main reasons for this phenomenon are inherent hardness and imbalance in subclass sizes. Since a high overall accuracy rate is the only evaluation indicator, the model may consistently prefer superclasses while fail to capture visually meaningful finer-grained variation. Specifically, UMAP dimensionality reduction and Gaussian mixture model clustering are primarily used to produce subclass pseudo labels according to the representation space. The second step is training deep model based on those subclass labels to achieve a better worst-case robust performance. Finally, the paper theoretically analyzed and empirically validated the GEORGE performance on four datasets. Experiments demonstrate the effectiveness of the proposed approach.

__ Strengths__: # Advantages:
1. This paper is well written and attached with sufficient supplementary materials, comprising pseudocode, experiment details, code, specific parameters, theoretical derivations, and proofs as so on.
2. The motivation of using cluster pseudo subclass labels and optimize worst-case performance to measure and mitigate hidden stratification problem is novel. And the method seems to work well.
3. The GEORGE procedure is very applied and could be easily added to the normal used ERM model just by clustering with the pretrained ERM and retraining ERM with those generated subclass labels.
4. Experiments are extensive. Four challenging datasets are used to evaluate the effectiveness of the proposed approach. Multiple ablation studies are provided to give insight into the method.

__ Weaknesses__: # Disadvantages:
1. It’s better to highlight your contributions and provide a whole framework figure to illustrate the process more directly.
2. The quality of Figure 2 needs to be improved for a better aesthetic and expressiveness.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper first introduces a generative model to explain how the hidden stratification problem can occur. Then, a framework called GEORGE is introduced to mitigate hidden stratification problem without requiring access to subclass labels. Empirical results on four datasets show that GEORGE outpeform ERM.

__ Strengths__: This paper focus the hidden stratification problem of training deep neural networks with only coarse-grained class labels that results in variable performance across different subclasses. Motivated by the observation of feature representation of deep neural networks often capture information about unlabeled subclasses, this paper proposed GEORGE, a two-step method for mitigating hidden stratification. In the first step, GEORGE estimates subclass labels in feature space via Gaussian mixture model clustering. Then in the second step, the estimated subclass labels are used in a distributional robust optimization objecitve to train a robust classifier.
The strengths of this work are:
1.This paper is overall complete. It started by introducing a generative model to study how the hidden stratification problem occur. Then, this paper proposed GEORGE, a two-step framework for mitigating the problem. Theoretical analysis of the proposed method is also provided. Extensive experiments are provided to support the proposed method.
2. In the first step, GEORGE trains a standard ERM model and clusters the subclass features. Empirically, the recovered clusters are ofter aligned closely with the true labels. With such estimated subclass labels, GEORGE is then able to train a robust classifier via a group distributional robust optimiation (GDRO) approach.
3. GEORGE can converge to optimal robust risk with the same sample complexity rate as GDRO when recovering true latent features.

__ Weaknesses__:
1. It is good to see a study of the causes of hidden stratification problem before formally proposed GEORGE. According to the paper, there are two main causes, i.e. inherent hardness and data imbalance. However, it is unclear how the proposed GEORGE address these two causes. Any discussion about this would further improve this paper.
2. The computation cost of GEORGE might be high since it is a two-step framework. Any results to show the computation cost?

__ Correctness__: Yes.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: After reading the rebuttal and other reviews, and I am satisfied with the rebuttal as the additional computational cost is further reduced by modification on the code. Therefore, I agree to raise my score.