Paper ID: | 2964 |
---|---|

Title: | Robustness to Adversarial Perturbations in Learning from Incomplete Data |

Update after reading the rebuttal and discussion with other reviewers: I feel my background knowledge is not enough for giving a strong support for this submission, therefore I decreased my score. ------- I should admit that I am not familiar with this area. Originality: Good. I think the notion of minimum supervision ratio is interesting and novel. Moreover, they can use a hyperparameter lambda to denote control the quality of the unlabeled data. And their results subsumes the cluster assumption, which is widely used by semi-supervised learning papers. Quality: Good. I did not check the proofs in the appendix. But the statements look reasonable to me, and the logic flow is clear and convincing. The authors clearly discussed existing results as special cases of their result, e.g. the cluster assumption. Clarify: Good. This paper is nicely presented and structured. Significance: Good. Semi-supervised learning is widely used in practice, and it is important to have theoretical understanding of the algorithms and sample complexity. Therefore I think this paper is a nice contribution to this topic, and many people in the NeurIPS community will be interested.

This paper proposes a theoretical framework which combines Semi-Supervised Learning (SSL) and Distributionally Robust Learning (DRL). The authors added an entropy regularizer on the unlabeled data, and analyzes the generalization error of the solution to an Lagrangian relaxation. The generalization bound is based on two new complexity measures, semi-supervised monge (SSM) complexity and minimun supervision ratio. The authors showed the connection of SSM and VC dimension, which is nice. The proposed algorithms hows a comparable performance to those of the state-of-the-art on a number of real-world benchmark datasets. Major Comments: It is unclear why the proposed SSM measure can explain the help from the unlabeled data. Consider a purely supervised algorithm using only n*eta labeled data from the dataset and just optimizes the supervised part of the loss. We can use the standard Rademacher complexity to derive a generalization bound. How does this bound compare to the SSM-based generalization bound? Under what condition does unlabeled data help generalization? Minor Comments: 1. It seems possible to define SSM complexity without the distribution robustness (so it can be applied to general semi-supervised learning). Is there any existing work about this? 2. The minimum supervision ratio has a very implicit dependence on lambda and zeta, so I am not sure about how useful is this complexity measure. Is there a way to know MSR empirically?

The paper is well written and is easy to follow. The proposed methodology seems to work well.

** I noticed that one of the files in the codes-for-reproducibility, submitted by the authors, contains a name-like-string which could possibly be of authors (see dmuxspec.yml). This happened after having formed an opinion about the submission, fortunately, but I am not sure if this is a violation of reviewer's code of conduct (or equivalent).** I believe that the submission could be something much more than a 'self-learning version' of Sinha, Namkoong and Duchi [9], if the authors put additional efforts in persuading the readers of the importance of the proposed notions, e.g. minimum supervision ratio and SSM Rademacher complexity. For the readers who are already familiar of the results in [9] or [11], I do not think Theorem 1, 2, or 3 *per se* would come as surprising or significant; the proof ideas are already quite well-established. Instead, I will be more interested in the nature of newly suggested notions, such as MSR, and an in-depth analysis on the objects would be a more contributing fraction of the submission. For example, it would be quite interesting to see if MSR could be characterized explicitly under simple scenarios, from which one could understand the dependencies of MSR to various parameters. Here are a few minor remarks: 1. I noticed that in section 1.2, authors use distributional robustness learning quite interchangeably with adversarially robustness, which are quite different subjects to me, in the sense that the latter is more about pointwise perturbations instead of distributional. 2. Is n equal to n_l + n_{ul}? This could cause unnecessary confusion, as in Section 1.1. n is introduced as a cardinality of the dataset D = {Z_1, ... Z_n}, without any context of being semi-supervised. 3. Missing a space between sentences, at line 222. 4. Typo at line 246, denot -> denote. 5. Some readers may be interested in seeing the choice of model in the main text, to be more critical about the robustness guaranteeable by semi-supervised methods. 6. The title of the reference [20] should be "Monge blunts Bayes:...," if I am correct.