Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This is an interesting paper with solid and novel work. 1. Novelty. The authors start from label distribution independent margin theory to distribution aware margin generalization bound, from which the optimal distribution aware margin is derived. Inspired by the proposed label distribution aware margin theory, the authors propose the LDAM loss to adjust the margin between frequent and minority classes. The class-distribution-aware margin trade-off theory is profound and novel. 2. Quality. The main mathematical tool applied is margin theory. The mathematical statement is simplified, like Eq. (3), however, just enough to illustrate. I didn't check the proof carefully provided by the authors. 3. Clarity. The paper is well-written and organized. 4. Significance. The proposed method is simple yet effective. Inspired from margin theory the proposed method is theoretically sound. Also the authors conduct carefully designed experiments, which validates the feasibility.
This is a new solution to an old problem. The problem is somewhat significant. The paper's goal is clear and is overall well written. The theoretical analysis is sound and intuitive, though I did not check the proofs. The simulation study, including an Ablation study, is fairly thorough. However, I find some details are missing. 1. It is unclear to me why the loss function (10) enforces the desired margin in (9). This seems to be a missing piece of puzzle. Some better and intuitive explanation (perhaps a direct calculation) is needed. 2. I wonder what exactly is showing in Figure 2. What is "feature distribution of different methods"? Isn't that the output X_i*b where b is the coefficient matrix? 3. I am puzzled by some of the discussion in 3.3. On the one hand, the authors proposed "Deferred Re-balancing Optimization Schedule", which is to " first trains using vanilla ERM with the LDAM loss before annealing the learning rate, and then deploys a re-weighted LDAM loss with a smaller learning rate." However, they also mentioned that " the second stage does not move the weights very far." If the second stage does not move the weight by much, then shouldn't the vanilla ERM with LDAM loss work well enough? I don't think the results have any issue, but the way the motivation of this new strategy is presented needs a fix up. Edit: I have read the author response and changed my score from 6 to 7.
+ LDAM aims to put regularization on the margins (i.e. the minimum distance of data samples to the decision boundary) of minority classes in order to improve the generalizability of the model towards minority classes during the test time, in which the value of margin is set to be proportional to the number of samples for each class thus the LDAM is label-distribution-aware. DRW runs reweighting and LDAM with smaller learning rate in order to perform fine-tuning on the model after an initial stage of training. Although without any theoretical justification, its efficacy is successfully proven across various experiments. These ideas are novel and shown to provide better superior performance, even avoid overfitting for frequent classes, in comparison to naive re-weighting/re-sampling techniques and other baselines (e.g. Focal loss, CB ). - There is no description on how to decide the hyperparameter C (which is used in equation.13 for adjusting the class-dependent margins). It is also required to have an analysis on the sensitivity of performance with respect to C. Additionally, as in both stages of Algorithm.1 LDAM is used, should there be different values of C? - How is the LDAM-HG-DRS in Table.1 implemented? - While CB  is the main baseline used in this paper for comparison, it calls for a clearer explanation on the difference between CB  and the proposed method. To be detailed, although CB  is a concurrent work (published in CVPR-19) to this submission and it models the class-imbalance problem from a slightly different perspective, when we examine the equation.12 & equation.13 in this submission and have comparison w.r.t. CB+softmax, they seem to be quite similar thus here requires more insight to point out why and how the proposed LDAM brings superior benefits to the performance. - As we can see from Table.2, it seems that the main boost of performance is stemmed from the DRW (deferred re-weighting) mechanism, together with the similarity between CB  and the proposed LDAM, we would need an additional baseline, i.e. CB+DRW, to clarify the contribution of LDAM. ** after rebuttal ** My review comments and concerns are well addressed by the authors. Therefore I would stick to my positive opinion on this submission. However, I agree with other reviewers that currently the derivations of Equation. 5, 6, 7, 8 are simply based on the binary classification, there should be more justification on the case of multi-class. I change my rating from 7 to 6 accordingly.
The key point of this paper is the equation (10) which is inspired by the trade-off between the class margins for BINARY classification. The loss function, LDAM takes the class margins into account so that the minority classes will enjoy a larger margin. The best numerical results come from LDAM-DRW scheme. I can't not find a very convincing argument for deferred re-weight scheme. The numerical result outperform than current popular strategies for dealing with imbalanced dataset. However, it is only tested on the computer vision learning task.