__ Summary and Contributions__: This paper works on importance weighting for deep learning under distribution shift. Overall, it is of high quality. The idea is novel and the results are significant.

__ Strengths__: The paper starts from the goal defined in equation 1. It finds out the bottleneck to apply existing algorithms is the circular dependency in figure 1. This bottleneck is also theoretically justified after theorem 1 in page 3. Then, the solution to the bottleneck is given in figure 2 (rough) and equations 8, 9, 10 and algorithm 1 (detailed). The equations and algorithm are clearly motivated. I think the solution is not only novel but also technically sound. Finally, the paper ends with applications and discussions (both like related work, closely and remotely).

__ Weaknesses__: However, it is unclear how the experiments were done. Which version of SGD was used, and why not Adam or some better optimizer than Adam in deep learning? The authors claimed it is compatible with any model and any optimizer, but didn't show it is not limited to SGD. Moreover, why the convergence analysis is interesting? Can you show the assumptions hold even if the optimizer is limited to SGD? If so, how can you know the algorithm doesn't converge to a stupid local minimum which means the model or the algorithm itself is stupid?
Last but not least, section 3 is fine but section 4 is a bit misleading. Importance weighting (including learning to reweight examples) seems more powerful, because you have labeled validation data, while unsupervised domain adaptation has unlabeled validation data and distributionally robust supervised learning even doesn't have unlabeled validation data. You should clearly tell this fact in this section.
In summary, this is a nice paper with an important problem, a deep understanding, a novel idea, and a lot of experiments. Without enough information, I think the experiments are less convincing than they should be. The authors should also motivate their choices of model/optimizer, otherwise they may be cherry picked. The convergence analysis is confusing. I cannot get the message after carefully reading the paper. The merits outweigh the flaws and I vote to accept the paper. I would like to increase my score by one if the authors can convince me.

__ Correctness__: yes

__ Clarity__: yes

__ Relation to Prior Work__: yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper propose dynamic importance weighting as an end-to-end solution to distribution shift problem. They train a deep classifier on the importance-weighted training data as the feature extractor for importance weighting. The experimental results show that the proposed method exceed the traditional importance weighting methods.

__ Strengths__: 1. IW is general and fundamental in machine learning against distribution shift.
2. The analysis is intuitive and insightful.
3. The proposed algorithm is novel and compatible with deep learning techniques.
4. The proposed algorithm experimentally outperforms learning to reweight.

__ Weaknesses__: 1. In experiments of learning with noisy labels, the baseline algorithms are not latest.
2. Furthermore, the datasets are not enough. Should have a non-image dataset.
3. The distribution shift is synthetic. Should have a real-world dataset.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: -----------------------------
I have read the author's rebuttal, the rebuttal addressed the issue of latest baselines, and I decide to change my score to 8.

__ Summary and Contributions__: ===UPDATE AFTER AUTHOR FEEDBACK===
I think that the additional results and proposed experiments from the author feedback will really fill out the empirical contributions of this paper, and I have changed the overall score to reflect this. I think that this paper belongs in Neurips
This paper introduces a method that simultaneously trains a deep network to perform (importance) weight estimation and classification using the same extracted/hidden features for a context involving both complex data and distribution shift where one desires to perform weighted ERM with a deep network. Their method DIW, alternates between updating the network based on a weight estimation objective and a weighted classification objective. The weight estimation objective is kernel mean matching on the distributions of either the final hidden layer representations or loss values between training data and a small validation set. They describe under which conditions the correct IWs for the raw data are also the correct IWs for the data under some transformation. They then motivate their choice for matching the loss, arguing that finding optimal weights to match data distributions is unnecessary. Finally, experiments are presented showing superior classification performance under label noise - results for the label shift scenario are also presented.

__ Strengths__: The method is novel and interesting, and outperforms previous work empirically on image classification with label noise. Section 2 in the paper does a good job of justifying their choices in algorithm design. Since their final algorithm combines several design choices, I appreciate the inclusion of ablation studies that effectively illustrate how these individual choices affect model performance. They not only show superior classification results for the label noise scenario, but also show that their method is effective in downweighting mislabeled classes. Distribution shift is an important problem present in many real-world scenarios, and this paper is a step towards tackling the problem for complex modalities that require deep learning models.

__ Weaknesses__: I don't understand the purpose of the class imbalance experiments. Weight estimation in label shift scenario only requires estimating the train/test ratio for each label, which is trivial with access to the labeled validation set. Weight estimation from extracted features seems like overkill here, and there is a glaring absence of a baseline that just uses the label ratios between train and validation set as the importance weights. Weighting for label shift can even be done effectively without validation set labels. There isn't any discussion of the results of this experiment at all in the paper either. The paper mentions that this experiment tests whether DIW can estimate weights without being told it's in the setting of class imbalance, but I think that experiments for the covariate shift scenario would be much more effective at demonstrating the weight estimation abilities of the proposed model, and it is also a harder task.
(Additionally, I usually see the term "label shift" applied to this setting while "class imbalance" usually refers to one or more classes being more prevalent in the data regardless of distribution shift).
For this type of deep learning methods paper, absent strong theoretical grounding, it's important to have thorough empirical results. While the label noise results presented are quite strong, the experiments only address one data modality (images) and don't address the covariate shift setting.

__ Correctness__: Yes, I didn't see any flaws in correctness of claims and empirical methodology.

__ Clarity__: The paper is very clear and well-written. The section that describes their method is easy to follow and well-organized.

__ Relation to Prior Work__: Yes, the authors illustrate how their method differs from Learning to Reweight, as well as how their problem setup differs from similar ones in section 4.

__ Reproducibility__: Yes

__ Additional Feedback__: Experiments on text data and covariate shift, could make the empirical results very convincing. Additionally, I think it would be very interesting to look at how the weighted classifiers partition the train and test sets into different classes, i.e. what the ratios between between number of predicted and ground truth examples for different classes.
It could be useful to briefly motivate the setting where you have access to a small labeled validation set drawn from the test distribution. This isn't the canonical supervised learning setup, so I think the paper should make an argument/motivation for the relevance of this scenario.
I also think that this paper would benefit from having a real conclusion setting that contextualizes the method and results and examines the experimental results.
While plotting accuracy against epochs in Figure 3, does illustrate the overfitting of other methods, it might be more interesting to plot accuracy vs amount of label noise to compare how the different methods perform across varying label flip rates.
I's also like to note that whether or not performing weighted ERM with deep networks results in the desired effect can be heavily dependent on training time and hyperparameter choice, even when importance weights are known[Byrd, Lipton. What is the Effect of Importance Weighting in Deep Learning? ICML 2019], however you empirical results suggests that their method is still able to perform well.

__ Summary and Contributions__: ===UPDATE AFTER AUTHOR FEEDBACK===
Thanks to the authors for the clarifications. My score was not changed, I still believe that this is a good paper. Proposed experiments will make it stronger.
The paper presents a new framework for training deep learning models in the presence of the distribution shift between train and test data. The proposed method iteratively optimises classifier for prediction task and estimates importance weights. Authors theoretically show the validity of the proposed approach, empirically compare their method with existing methods and perform ablation study.

__ Strengths__: The proposed framework for dynamic estimation of importance weights is novel and potentially could be interesting for the community. The method is clearly described. Experiments are well designed and analysis of results highlights the main properties of the proposed approach.

__ Weaknesses__: There are no experiments for 'covariate shift' type of distribution shift. To my opinion, the paper is incomplete without them and they are needed to support claims.

__ Correctness__: Provided derivations and empirical methodology look correct.

__ Clarity__: While the paper is clearly written, there is a couple of moments that can be improved:
- It would be useful to add references for the first and second reason why it is difficult to boost the expressive power of WE (paragraph starting at line 38)
- The paragraph starting at line 33 is a bit confusing and not really convincing. How the conclusion depends on the number of classes? Does w* use y as input or why does dimension equal to (d+1)?

__ Relation to Prior Work__: There is a fairly good discussion section with a comparison of the proposed method and existing approaches. However, there is no explanation for a choice of baseline methods [20] and [38]. I would recommend highlighting the differences between them. For instance, explicitly mention that [20] doesn't require labels for a target test set to estimate IW.

__ Reproducibility__: Yes

__ Additional Feedback__: