Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track
Nived Rajaraman, Fnu Devvrit, Pranjal Awasthi
Labeled data often comes at a high cost as it may require recruiting human labelers or running costly experiments. At the same time, in many practical scenarios, one already has access to a partially labeled, potentially biased dataset that can help with the learning task at hand. Motivated by such settings, we formally initiate a study of semi-supervised active learning'' through the frame of linear regression. Here, the learner has access to a dataset X∈R(nun+nlab)×d composed of nun unlabeled examples that a learner can actively query, and nlab examples labeled a priori. Denoting the true labels by Y∈Rnun+nlab, the learner's objective is to find ˆβ∈Rd such that,\| X \widehat{\beta} - Y \|_2^2 \le (1 + \epsilon) \min_{\beta \in \mathbb{R}^d} \| X \beta - Y \|_2^2while querying the labels of as few unlabeled points as possible. In this paper, we introduce an instance dependent parameter called the reduced rank, denoted RX, and propose an efficient algorithm with query complexity O(RX/ϵ). This result directly implies improved upper bounds for two important special cases: (i) active ridge regression, and (ii) active kernel ridge regression, where the reduced-rank equates to the statistical dimension'', sdλ and effective dimension'', dλ of the problem respectively, where λ≥0 denotes the regularization parameter. Finally, we introduce a distributional version of the problem as a special case of the agnostic formulation we consider earlier; here, for every X, we prove a matching instance-wise lower bound of Ω(RX/ϵ) on the query complexity of any algorithm.