NeurIPS 2020

Learning from Positive and Unlabeled Data with Arbitrary Positive Shift

Review 1

Summary and Contributions: The authors propose a method for learning with only positive and unlabled data when the positive class is allowed to have arbitrary shifts. The key assumption is that the negative class distribution stays constant across training and test sets. They provide a two-stage and a one-stage recursive algorithm for solving the problem. Experiments under different shift settings demonstrate the utility of their approach. The authors have addressed the questions raised by the reviewers well in their rebuttal, and I am keeping my score of acceptance.

Strengths: - The authors provide theoretical analysis to their risk estimators and provide more than one algorithm to solve the problem. - The setting of learning from positive data with arbitrary shifts is relevant and interesting and the authors provide several real world examples. - There are thorough empirical evaluations on multiple shift scenarios.

Weaknesses: - There are potentially many variations of this problem, e.g., by exchanging the role of positive and negative classes. One can assume the positive class distribution stays the same while allowing arbitrary shifts in the negative class. It would be clearer to the readers to explain the contributions in terms of these equivalent classes of problems.

Correctness: The theorems and empirical methodolgy are correct to the best of my knowledge.

Clarity: Yes, the paper is clear and well written.

Relation to Prior Work: Yes, the authors sufficiently cite prior works and point out the relations.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper works on the problem of positive-unlabeled binary classification, and the interest is when the positive distribution will change in test compared from train stage, but the negative distribution stays the same between train and test stage. The available data in training stage is positive and unlabeled data from training distribution, and unlabeled data from test distribution. The final goal is to learn a (inductive) classifier from these available resources. A similar problem setting was introduced in one of the previous works, but this paper has different assumptions and has a different algorithm to solve this problem. Experiments show that the proposed methods work better than previous methods for image and NLP datasets.

Strengths: - The paper explains strong potential applications for arbitrary-positive and unlabeled classification. - The writing is well organized and gradually improve one after another: starts with improving max operator with absolute value for original PU, then introduces a two-step method for aPU, and finally proposes a one-step method for aPU. - Experiments are extensive and are shown for both image and NLP domains. - It doesn't rely on the input-output consistent relationship which previous works relied on.

Weaknesses: If I understood Kiryo et al. paper correctly, Section 3 of the paper under review is already proposed in Kiryo et al.'s paper. Kiryo et al. proposed a method with max in Eq. 6, but the final proposal in Algorithm 1 in their paper doesn't use max, but instead flips the gradient when the loss goes below zero (or a specified hyper-parameter value). As far as I understand, mathematically, this is equivalent to using absolute value, because if the value inside abs is negative, it will flip the sign, and this will also flip the gradient's sign. Kiryo et al. has a step size hyper-parameter for adjusting the gradient (gamma), so if this hyper-parameter is set to 1, it seems that it becomes the same as using the absolute value function. The proposed method in Section 3 has the benefit of not having hyper-parameters gamma because it is using gamma=1 implicitly. The idea of PURR is interesting and creative. I think the motivation was to propose a one-step method so that errors in the first step do not propagate. The experiments seems like it is PURR is not necessarily beneficial over two-step methods. The intuition behind this was not explained. (On the other hand, the comparison between aPNU and wUU was clear and results were intuitive.) The experiments seem to use kernel models for PUc but neural network for proposed methods. Is there a reason for not aligning the underlying models? I think it would make the experiments much more meaningful if the underlying settings are aligned as much as possible.

Correctness: See my answer to the previous question about comments on Section 3.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Overall comment: Although I enjoyed reading the paper and it proposes novel ideas for PU learning research, I couldn't give a high score because: I feel it is hard to compare between methods in the experiments due to the usage of different models for proposed/baselines, some of the work in this paper (Sec. 3) is already proposed in a previous paper, and the benefit of one of the proposed methods (PURR) is not clear in experiments. Other comments: The output of logistic classifiers will be between 0 and 1, and theoretically it should be an estimate of p(y|x). Practically, the estimate of p(y|x) can become quite noisy, or may overfit and lead to peaky hat{p}(y|x) distributions, according to papers like "On Calibration of Modern Neural Networks" (ICML 2017). Assuming $\hat{\sigma}(x) = p_tr(y=-1|x)$ seems to be a strong assumption, but does this cause any issues in the experiments? A minor suggestion is to investigate confidence-calibration, and see how much sensitive the final PU classifier is for worse calibration. Can we tune $\rho$ in aPNU with validation data? It seems unrealistic to have knowledge of how much overlap there is between $p_{tr-p}(x)$ and $p_{te-p}(x)$. ________________________________________________ After rebuttal period: Thank you for answering my questions. Some of my initial concerns have been resolved and I have decided to raise my score. The response helped me understand the difference from Kiryo et al. 2017 paper. However, Table 16 (Appendix E.5) seem to show quite similar results and wasn't so sure of the significance of the proposal (although novel). For using different models for different baselines: I now think the comparisons are fair. It still puts some burden on the reader to get to that point, so if accepted, I would like to suggest using the same models as much as possible in the camera-ready version.

Review 3

Summary and Contributions: This paper focuses on an arbitrary-positive, unlabeled (aPU) learning setting where the labeled (positive) data may be arbitrarily different from the target distribution’s positive class. The paper uses absolute-value correction for PU risk estimation. It also proposes a two-step formulation method to solve aPU problem.

Strengths: +The arbitrary-positive, unlabeled (aPU) learning is an under-explored problem. The paper proposes a two-step formulation method to solve aPU problem., i.e., create a representative negative set and classify X_{te-u}. +The paper replaces non-negativity constraint (e.g., max term) by absolute-value correction.

Weaknesses: -Some notations are used before/without definition. This makes it hard to understand. For example, Line 70, p, n, and u are not defined until Line 103. -The paper directly uses some conclusions from other papers without any explanation, which makes it hard to understand. It would be nice to organize a brief introduction to make the paper self-contained.

Correctness: No totally understand.

Clarity: No.

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: I have read the response and the authors addressed my concerns. I raised my score to accept this paper.

Review 4

Summary and Contributions: The paper tackles the problem of PU learning where the labeled set of positives is biased arbitrarily. In addition to an unlabeled set that contains unbiased positives and negatives, they exploit another unlabeled set made of positives, biased similarly as the labeled set, and unbiased negatives. They introduce a modification of the non-negative risk estimator (for unbiased PU) by replacing the max operation by absolute value, which leads to a simpler algorithm. The absolute value based formulation of the risk is further modified in three different ways to provide a consistent risk estimators based on the biased positive set and the two unlabeled sets.

Strengths: 1) The problem formulation based on two unlabeled sets is novel and has some real world applications. 2) The method and the theoretical results, although not surprising, are novel. 3) Experiments on many datasets demonstrate the efficacy of the method.

Weaknesses: 1) The assumption that the positives in the training unlabeled set has the same bias as the labeled positives is restrictive and doesn’t solve the bias problem in PU learning in general. For example, in the medical domain it might be difficult to construct an unlabeled set where the distribution of diseased individuals is the same as the biased set of individuals known to have the disease and also have the healthy individuals in the set represent an unbiased distribution of healthy individuals. 2) The authors claim that the comparison to the bPU methods is infeasible. I understand that the bias assumptions of these methods are restrictive, but it should be possible to include them as baseline.

Correctness: yes

Clarity: The paper is very well written.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: The epidemiology example might not be apt for this problem. If the population of the region is constant across the years and the distribution of diseased individuals changes, would it not imply that the distribution of the healthy individuals also changes? Since every individual in the region if not healthy is diseased and vice versa. =======================After rebuttal====================== I still think that the epidemiology example should be removed since the distribution of negative is not stationary from one year to another. If the authors want to keep it they should give some supporting evidence. The authors haven't responded to my comment on not including bPU for baseline, so I'm lovering my score to marginally above acceptance.