#### Authors

Saharon Rosset, Ji Zhu, Hui Zou, Trevor Hastie

#### Abstract

We consider the situation in semi-supervised learning, where the "label sampling" mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a super- vised learning procedure which can be used to "de-bias" its results using labeled data only and b. As a potentially interesting learning task in it- self. We present several examples to illustrate the practical usefulness of our method.

1 Introduction

In semi-supervised learning, we assume we have a sample (xi, yi, si)n i=1, of i.i.d. draws from a joint distribution on (X, Y, S), where:1

     xi  Rp are p-vectors of features.          yi is a label, or response (yi  R for regression, yi  {0, 1} for 2-class classifica-            tion).

si  {0, 1} is a "labeling indicator", that is yi is observed if and only if si = 1,            while xi is observed for all i.


In this paper we consider the interesting case of semi-supervised learning, where the prob- ability of observing the response depends on the data through the true response, as well as

 1Our notation here differs somewhat from many semi-supervised learning papers, where the un- labeled part of the sample is separated from the labeled part and sometimes called "test set".


potentially through the features. Our goal is to model this unknown dependence:

                                l(x, y) = P r(S = 1|x, y)                                      (1)


Note that the dependence on y (which is unobserved when S = 0) prevents us from using standard supervised modeling approaches to learn l. We show here that we can use the whole data-set (labeled+unlabeled data) to obtain estimates of this probability distribution within a parametric family of distributions, without needing to "impute" the unobserved responses.2

We believe this setup is of significant practical interest. Here are a couple of examples of realistic situations: 1. The problem of learning from positive examples and unlabeled data is of significant interest in document topic learning [4, 6, 8]. Consider a generalization of that problem, where we observe a sample of positive and negative examples and unlabeled data, but we believe that the positive and negative labels are supplied with different probabilities (in the document learning example, positive examples are typically more likely to be labeled than negative ones, which are much more abundant). These probabilities may also not be uniform within each class, and depend on the features as well. Our methods allow us to infer these labeling probabilities by utilizing the unlabeled data. 2. Consider a satisfaction survey, where clients of a company are requested to report their level of satisfaction, but they can choose whether or not they do so. It is reasonable to assume that their willingness to report their satisfaction depends on their actual satisfaction level. Using our methods, we can infer the dependence of the reporting probability on the actual satisfaction by utilizing the unlabeled data, i.e., the customers who declined to respond.

Being able to infer the labeling mechanism is important for two distinct reasons. First, it may be useful for "de-biasing" the results of supervised learning, which uses only the labeled examples. The generic approach for achieving this is to use "inverse sampling" weights (i.e. weigh labeled examples by 1/l(x, y)). The us of this for maximum likeli- hood estimation is well established in the literature as a method for correcting sampling bias (of which semi-supervised learning is an example) [10]. We can also use the learned mechanism to post-adjust the probabilities from a probability estimation methods such as logistic regression to attain "unbiasedness" and consistency [11]. Second, understanding the labeling mechanism may be an interesting and useful learning task in itself. Consider, for example, the "satisfaction survey" scenario described above. Understanding the way in which satisfaction affects the customers' willingness to respond to the survey can be used to get a better picture of overall satisfaction and to design better future surveys, regardless of any supervised learning task which models the actual satisfaction.

Our approach is described in section 2, and is based on a method of moments. Observe that for every function of the features g(x), we can get an unbiased estimate of its mean n as 1 g(x n i=1 i). We show that if we know the underlying label sampling mechanism l(x, y) we can get a different unbiased estimate of Eg(x), which uses only the labeled examples, weighted by 1/l(x, y). We suggest inferring the unknown function l(x, y) by requiring that we get identical estimates of Eg(x) using both approaches. We illustrate our method's implementation on the California Housing data-set in section 3. In section 4 we review related work in the machine learning and statistics literature, and we conclude with a discussion in section 5.

2The importance of this is that we are required to hypothesize and fit a conditional probability model for l(x, y) only, as opposed to the full probability model for (S, X, Y ) required for, say, EM.