{"title": "Active learning for misspecified generalized linear models", "book": "Advances in Neural Information Processing Systems", "page_first": 65, "page_last": 72, "abstract": null, "full_text": "Active learning for misspeci\ufb01ed\n\ngeneralized linear models\n\nFrancis R. Bach\n\nCentre de Morphologie Math\u00b4ematique\n\nEcole des Mines de Paris\n\nFontainebleau, France\n\nfrancis.bach@mines.org\n\nAbstract\n\nActive learning refers to algorithmic frameworks aimed at selecting training data\npoints in order to reduce the number of required training data points and/or im-\nprove the generalization performance of a learning method.\nIn this paper, we\npresent an asymptotic analysis of active learning for generalized linear models.\nOur analysis holds under the common practical situation of model misspeci\ufb01ca-\ntion, and is based on realistic assumptions regarding the nature of the sampling\ndistributions, which are usually neither independent nor identical. We derive un-\nbiased estimators of generalization performance, as well as estimators of expected\nreduction in generalization error after adding a new training data point, that allow\nus to optimize its sampling distribution through a convex optimization problem.\nOur analysis naturally leads to an algorithm for sequential active learning which is\napplicable for all tasks supported by generalized linear models (e.g., binary clas-\nsi\ufb01cation, multi-class classi\ufb01cation, regression) and can be applied in non-linear\nsettings through the use of Mercer kernels.\n\n1 Introduction\n\nThe goal of active learning is to select training data points so that the number of required training\ndata points for a given performance is smaller than the number which is required when randomly\nsampling those points. Active learning has emerged as a dynamic \ufb01eld of research in machine learn-\ning and statistics [1], from early works in optimal experimental design [2, 3], to recent theoretical\nresults [4] and applications, in text retrieval [5], image retrieval [6] or bioinformatics [7].\n\nDespite the numerous successful applications of active learning to reduce the number of required\ntraining data points, many authors have also reported cases where widely applied active learning\nheuristic schemes such as maximum uncertainty sampling perform worse than random selection [8,\n9], casting doubt into the practical applicability of active learning: why would a practitioner use an\nactive learning strategy that is not ensuring, unless the data satisfy possibly unrealistic and usually\nnon veri\ufb01able assumptions, that it performs better than random? The objectives of this paper are\n(1) to provide a theoretical analysis of active learning with realistic assumptions and (2) to derive a\nprincipled algorithm for active learning with guaranteed consistency.\n\nIn this paper, we consider generalized linear models [10], which provide \ufb02exible and widely used\ntools for many supervised learning tasks (Section 2). Our analysis is based on asymptotic arguments,\nand follows previous asymptotic analysis of active learning [11, 12, 9, 13]; however, as shown in\nSection 4, we do not rely on correct model speci\ufb01cation and assume that the data are not identically\ndistributed and may not be independent. As shown in Section 5, our theoretical results naturally\nlead to convex optimization problems for selecting training data point in a sequential design. In\nSection 6, we present simulations on synthetic data, illustrating our algorithms and comparing them\nfavorably to usual active learning schemes.\n\n\f2 Generalized linear models\n\nGiven data x \u2208 Rd, and targets y in a set Y, we consider the problem of modeling the conditional\nprobability p(y|x) through a generalized linear model (GLIM) [10]. We assume that we are given\nan exponential family adapted to our prediction task, of the form p(y|\u03b7) = exp(\u03b7>T (y) \u2212 \u03c8(\u03b7)),\nwhere T (y) is a k-dimensional vector of suf\ufb01cient statistics, \u03b7 \u2208 Rk is vector of natural parameters\nand \u03c8(\u03b7) is the convex log-partition function. We then consider the generalized linear model de\ufb01ned\nas p(y|x, \u03b8) = exp(tr(\u03b8>xT (y)>) \u2212 \u03c8(\u03b8>x)), where \u03b8 \u2208 \u0398 \u2282 Rd\u00d7k. The framework of GLIMs is\ngeneral enough to accomodate many supervised learning tasks [10], in particular:\n\n\u2022 Binary classi\ufb01cation:\n\nthe Bernoulli distribution leads to logistic regression, with Y =\n\n{0, 1}, T (y) = y and \u03c8(\u03b7) = log(1 + e\u03b7).\n\n\u2022 k-class classi\ufb01cation: the multinomial distribution leads to softmax regression, with Y =\n\n{y \u2208 {0, 1}k,Pk\n\ni=1 yi = 1}, T (y) = y and \u03c8(\u03b7) = log(Pk\n\ni=1 e\u03b7i ).\n\n\u2022 Regression:\n\nthe normal distribution leads to Y = R, T (y) = (y, \u2212 1\n\n2 y2)> \u2208 R2, and\n\u03c8(\u03b71, \u03b72) = \u2212 1\n. When both \u03b71 and \u03b72 depends linearly on x, we\nhave an heteroscedastic model, while if \u03b72 is constant for all x, we obtain homoscedastic\nregression (constant noise variance).\n\n2 log 2\u03c0 + \u03b72\n\n2 log \u03b72 + 1\n\n1\n2\u03b72\n\nMaximum likelihood estimation We assume that we are given independent and identically dis-\ntributed (i.i.d.) data sampled from the distribution p0(x, y) = p0(x)p0(y|x). The maximum likeli-\nhood population estimator \u03b80 is de\ufb01ned as the minimizer of the expectation under p0 of the negative\nlog-likelihood `(y, x, \u03b8) = \u2212tr(\u03b8>xT (y)>) + \u03c8(\u03b8>x). The function `(y, x, \u03b8) is convex in \u03b8 and\nby taking derivatives and using the classical relationship between the derivative of the log-partition\nand the expected suf\ufb01cient statistics [10], the population maximum likelihood estimate is de\ufb01ned\nby:\n\nEp0(x,y)\u2207`(y, x, \u03b80) = Ep0(x)(cid:8)x(Ep(y|x,\u03b80)T (y) \u2212 Ep0(y|x)T (y))>(cid:9) = 0\ni=1 `(yi, xi, \u03b8) + 1\n\n(1)\nGiven i.i.d data (xi, yi), i = 1, . . . , n, we use the penalized maximum likelihood estimator, which\nminimizesPn\n2 \u03bbtr\u03b8>\u03b8. The minimization is performed by Newton\u2019s method [14].\nModel speci\ufb01cation\nA GLIM is said well-speci\ufb01ed is there exists a \u03b8 \u2208 Rd\u00d7k such that for\nall x \u2208 Rd, Ep(y|x,\u03b8)T (y) = Ep0(y|x)T (y). A suf\ufb01cient condition for correct speci\ufb01cation is that\nthere exist \u03b8 \u2208 Rd\u00d7k such that for all x \u2208 Rd, y \u2208 Y, p(y|x, \u03b8) = p0(y|x). This condition is\nnecessary for the Bernoulli and multinomial exponential family, but not for example for the normal\ndistribution. In practice, the model is often misspeci\ufb01ed and it is thus of importance to consider\npotential misspeci\ufb01cation while deriving asymptotic expansions.\nKernels\nThe theoretical results of this paper mainly focus on generalized linear models; however,\nthey can be readily generalized to non-linear settings by using Mercer kernels [15], for example\nleading to kernel logistic regression or kernel ridge regression. When the data are given by a kernel\nmatrix, we can use the incomplete Cholesky decomposition [16] to \ufb01nd an approximate basis of the\nfeature space on which the usual linear methods can be applied. Note that our asymptotic results do\nnot hold when the number of parameters may grow with the data (which is the case for kernels such\nas the Gaussian kernel). However, our dimensionality reduction procedure uses a non-parametric\nmethod on the entire (usually large) training dataset and we then consider a \ufb01nite dimensional prob-\nlem on a much smaller sample. If the whole training dataset is large enough, then the dimension\nreduction procedure may be considered deterministic and our criteria may apply.\n\n3 Active learning set-up\n\nWe consider the following \u201cpool-based\u201d active learning scenario: we have a large set of i.i.d. data\npoints xi \u2208 Rd, i = 1, . . . , m sampled from p0(x). The goal of active learning is to select the points\nto label, i.e., the points for which the corresponding yi will be observed. We assume that given\nxi, i = 1, . . . , n, the targets yi, i = 1, . . . , n are independent and sampled from the corresponding\nconditional distribution p0(yi|xi). This active learning set-up is well studied and appears naturally\nin many applications where the input distribution p0(x) is only known through i.i.d. samples [5, 17].\nFor alternative scenarii, where the density p0(x) is known, see e.g. [18, 19, 20].\n\n\fMore precisely, we assume that the points xi are selected sequentially, and we let denote\nqi(xi|x1, . . . , xi\u22121) the sampling distribution of xi given the previously observed points.\nIn\nsituations where the data are not sampled from the testing distribution, it has proved advanta-\ngeous to consider likelihood weighting techniques [13, 19], and we thus consider weights wi =\nwi(xi|x1, . . . , xi\u22121). We let \u02c6\u03b8n denote the weighted penalized ML estimator, de\ufb01ned as the mini-\nmum with respect to \u03b8 of\n\nPn\ni=1 wi`(yi, xi, \u03b8) + \u03bb\n\n2 tr\u03b8>\u03b8.\n\n(1) the variables xi are independent,\n\n(2)\nIn this paper, we work with two different assumptions regarding the sequential sampling dis-\ntributions:\ni.e., qi(xi|x1, . . . , xi\u22121) = qi(xi), (2) the\nvariable xi depends on x1, . . . , xi\u22121 only through the current empirical ML estimator \u02c6\u03b8i, i.e.,\nqi(xi|x1, . . . , xi\u22121) = q(xi| \u02c6\u03b8i), where q(xi|\u03b8) is a pre-speci\ufb01ed sampling distribution. The \ufb01rst\nassumption is not realistic, but readily leads to asymptotic expansions. The second assumption is\nmore realistic, as most of the heuristic schemes for sequential active learning satisfy this assumption.\nIt turns out that under certain assumption, the asymptotic expansions of the expected generalization\nperformance for both sets of assumptions are identical.\n\n4 Asymptotic expansions\n\nIn this section, we derive the asymptotic expansions that will lead to active learning algorithms in\nSection 5. Throughout this section, we assume that p0(x) has a compact support K and has a twice\ndifferentiable density with respect to the Lebesgue measure, and that all sampling distributions have\na compact support included in the one of p0(x) and have twice differentiable densities.\nWe \ufb01rst make the assumption that the variables xi are independent, i.e., we have sampling distri-\nbutions qi(xi) and weights wi(xi), both measurable, and such that wi(xi) > 0 for all xi \u2208 K. In\nSection 4.4, we extend some of our results to the dependent case.\n\n4.1 Bias and variance of ML estimator\n\nThe following proposition is a simple extension to non identically distributed observations, of clas-\nsical results on maximum likelihood for misspeci\ufb01ed generalized linear models [21, 13]. We let ED\nand varD denote the expectation and variance with respect to the data D = {(xi, yi), i = 1, . . . , n}.\nProposition 1 We let \u03b8n denote the minimizer of Pn\ni=1 Eqi(xi)p0(yi|xi)wi(xi)`(yi, xi, \u03b8). If (a) the\nweight functions wn and the sampling densities qn are pointwise strictly positive and such that\nn(x) is bounded , then \u02c6\u03b8n \u2212 \u03b8n converges\nwn(x)qn(x) converges in the L\u221e-norm, and (b) Eqn(x)w2\nto zero in probability and we have\n\nED \u02c6\u03b8n = \u03b8n + O(n\u22121) and varD \u02c6\u03b8n = 1\ni=1 Eqi(x)wi(x)\u22072`(x, \u03b8n) can be consistently estimated by \u02c6Jn = 1\n\nn Pn\ni=1 wihi\nn Pn\ni=1 Eqi(x)p0(y|x)wi(x)2\u2207`(y, x, \u03b8n)\u2207`(y, x, \u03b8n)> can be consistently estimated by\ni gig>\ni=1 w2\n\ni , where gi = \u2207`(yi, xi, \u02c6\u03b8n) and hi = \u22072`(xi, \u02c6\u03b8n).\n\nn + O(n\u22122)\n\nn Pn\n\nn InJ \u22121\n\nwhere Jn = 1\nand In = 1\n\u02c6In = 1\n\nn Pn\n\nn J \u22121\n\n(3)\n\nFrom Proposition 1, it is worth noting that in general \u03b8n will not converge to the population maxi-\nmum likelihood estimate \u03b80, i.e., using a different sampling distribution than p0(x) may introduce\na non asymptotically vanishing bias in estimating \u03b80. Thus, active learning requires to ensure that\n(a) our estimators have a low bias and variance in estimating \u03b8n, and (b) that \u03b8n does actually con-\nverge to \u03b80. This double objective is taken care of by our estimates of generalization performance in\nPropositions 2 and 3.\nThere are two situations, however, where \u03b8n is equal to \u03b80. First, if the model is well speci\ufb01ed,\nthen whatever the sampling distributions are, \u03b8n is the population ML estimate (which is a simple\nconsequence of the fact that Ep(y|x,\u03b80)T (y) = Ep0(y|x)T (y), for all x, implies that, for all q(x),\nEq(x)p0(y|x)\u2207`(y, x, \u03b8) = Eq(x)(cid:8)x(Ep(y|x,\u03b80)T (y) \u2212 Ep0(y|x)T (y))>(cid:9) = 0).\nSecond, When wn(x) = p0(x)/qn(x), then \u03b8n is also equal to \u03b80, and we refer to this weighting\nscheme as the unbiased reweighting scheme, which was used by [19] in the context of active learn-\ning. We refer to the weights wu\nn = p0(xn)/qn(xn) as the importance weights. Note however, that\n\n\frestricting ourselves to such unbiased estimators, as done in [19] might not be optimal because they\nmay lead to higher variance [13], in particular due to the potential high variance of the importance\nweights (see simulations in Section 6).\n\n4.2 Expected generalization performance\n\nWe let Lu(\u03b8) = Ep0(x)p0(y|x)`(y, x, \u03b8) denote the generalization performance1 of the parameter \u03b8.\nWe now provide an unbiased estimator of the expected generalization error of \u02c6\u03b8n, which generalized\nthe Akaike information criterion [22] (for a proof, see [23]):\n\nProposition 2 In\nthe\nEqn(x) (p0(x)/qn(x))2 is bounded. Let\n\naddition\n\nto\n\nassumptions\n\nof Proposition\n\n1, we\n\nassume\n\nthat\n\nn (cid:16) 1\n\nn Pn\n\ni ( \u02c6Jn)\u22121gi(cid:17) ,\n\nn Pn\n\ni wig>\n\ni=1 wu\n\ni=1 wu\n\ni `(yi, xi, \u02c6\u03b8n) + 1\n\nbG = 1\ni = p0(xi)/qi(xi). bG is an asymptotically unbiased estimator of EDLu(\u02c6\u03b8n), i.e., ED bG =\nwhere wu\nEDLu(\u02c6\u03b8n) + O(n\u22122).\nThe criterion bG is a sum of two terms: the second term corresponds to a variance term and will\nconverge to zero in probability at rate O(n\u22121); the \ufb01rst term, however, which corresponds to a\nselection bias induced by a speci\ufb01c choice of sampling distributions, will not always converge to\nthe minimum possible value Lu(\u03b80). Thus, in order to ensure that our active learning method are\nconsistent, we have to ensure that this \ufb01rst term is going to its minimum value. One simple way to\nachieve this is to always optimize our weights so that the estimate bG is smaller than the estimate for\n\nthe unbiased reweighting scheme (see Section 5).\n\n(4)\n\n4.3 Expected performance gain\n\nWe now look at the following situation: we are given the \ufb01rst n data points (xi, yi) and the cur-\nrent estimate \u02c6\u03b8n, the gradients gi = \u2207`(yi, xi, \u02c6\u03b8n), the Hessians hi = \u22072`(xi, \u02c6\u03b8n) and the third\nderivatives Ti = \u22073`(xi, \u02c6\u03b8n), we consider the following criterion, which depends on the sampling\ndistributions and weights of the (n + 1)-th point:\n\nbH(qn+1, wn+1|\u03b1, \u03b2) = 1\n\nwhere \u03b1i = \u2212(n + 1)n\u02dcg>\ni\n\nn3 Pn\n\ni=1 \u03b1iwu\n\ni wn+1(xi) qn+1(xi)\ni \u02dcg>\n\n\u02c6JnA \u2212 wiwu\n\np0(xi) +Pn\n\ni hi\u02dcgi + wu\n\ni=1 \u03b2iwu\ni \u02dcg>\n\ni wn+1(xi)2 qn+1(xi)\n\u02c6Jn\u02dcgi \u2212 2\u02dcg>\n\np0(xi)\n\ni B\n\ni\n\n\u02c6J u\nn \u02dcgi + Ti[\u02dcgi, C] \u2212 2wi\u02dcg>\n\ni hiA + Ti[A, \u02dcgi, \u02dcgi]\n\n(5)\n\n(6)\n\n(7)\n\n\u03b2i =\n\n\u02dcg>\ni\n\n\u2212wi\u02dcg>\ni\n1\n\u02c6J u\nn \u02dcgi + A>hi\u02dcgi\n2\nn Pn\n1\n\ni gi, B = Pn\n\nn\n\ni=1 wu\n\ni=1 wu\n\ni=1 wu\n\nn gi, A = \u02c6J \u22121\n\nwith \u02dcgi = \u02c6J \u22121\nn Pn\n1\ni hi.\nThe following proposition shows that bH(qn+1, wn+1|\u03b1, \u03b2) is an estimate of the expected perfor-\nmance gain of choosing a point xn+1 according to distribution qn+1 and weight wn+1 (and marginal-\nizing over yn+1) and may be used as an objective function for learning the distributions qn+1, wn+1\n(for a proof, see [23]). In Section 5, we show that if the distributions and weights are properly\nparameterized, this leads to a convex optimization problem.\n\ni=1 wiwu\n\ni , \u02c6J u\n\ni \u02dcgi\u02dcg>\n\nn =\n\ni wihi\u02dcgi, C = Pn\n\nn(x) and Eqn(x) (p0(x)/qn(x))2 are bounded. We let de-\nProposition 3 We assume that Eqn(x)w2\nnote \u02c6\u03b8n denote the weighted ML estimator obtained from the \ufb01rst n points, and \u02c6\u03b8n+1 the one-step\nestimator obtained from the \ufb01rst n+1 points, i.e., \u02c6\u03b8n+1 is obtained by one Newton step from \u02c6\u03b8n [24];\nthen the criterion de\ufb01ned in Eq. (5) is such that ED bH(qn+1, wn+1) = EDLu(\u02c6\u03b8n)\u2212 EDLu(\u02c6\u03b8n+1)+\n\nO(n\u22123), where ED denotes the expectation with respect to the \ufb01rst n+1 data points and their labels.\nMoreover, for n large enough, all values of \u03b2i are positive.\n\n1In this paper, we use the negative log-likelihood as a measure of performance, which allows simple asymp-\ntotic expansions, and the focus of the paper is about the differences between testing and training sampling\ndistributions. The study of potentially different costs for testing and training is beyond the scope of this paper.\n\n\fNote that many of the terms in Eq. (6) and Eq. (7) are dedicated to weighting schemes for the \ufb01rst\nn points other than the unbiased reweighting scheme. For the unbiased reweighting scheme where\nwi = wu\n\ni , for i = 1, . . . , n, then A = 0 and the equations may be simpli\ufb01ed.\n\n4.4 Dependent observations\n\nIn this section, we show that under a certain form of weak dependence between the data points\nxi, i = 1, . . . , n, then the results presented in Propositions 1 and 2 still hold. For simplicity and\nbrevity, we restrict ourselves to the unbiased reweighting scheme, i.e., wn(xn|x1, . . . , xn\u22121) =\np0(xn)/qn(xn|x1, . . . , xn\u22121) for all n, and we assume that those weights are uniformly bounded\naway from zero and in\ufb01nity. In addition, we only prove our result in the well-speci\ufb01ed case, which\nleads to a simpler argument for the consistency of the estimator.\n\nMany sequential active learning schemes select a training data point with a distribution or criterion\nthat depends on the estimate so far (see Section 6 for details). We thus assume that the sampling\ndistribution qn is of the form q(xn|\u02c6\u03b8n), where q(x|\u03b8) is a \ufb01xed set of smooth parameterized densities.\n\nProposition 4 (for a proof, see [23]) Let\n\nn Pn\n\nbG = 1\n\ni=1 wi`(yi, xi, \u02c6\u03b8n) + 1\n\nn (cid:16) 1\n\nn Pn\n\ni=1 w2\n\ni ( \u02c6Jn)\u22121gi(cid:17) ,\ni g>\n\n(8)\n\nwhere wi = wu\n\ni = p0(xi)/q(xi|\u02c6\u03b8i). bG is an asymptotically unbiased estimator of EDLu(\u02c6\u03b8n), i.e.,\n\nED bG = EDLu(\u02c6\u03b8n) + O(log(n)n\u22122).\n\nThe estimator is the same as in Proposition 2. The effect of the dependence is asymptotically negli-\ngible and only impacts the result with the presence of an additional log(n) term. In the algorithms\npresented in Section 5, the distribution qn is obtained as the solution of a convex optimization prob-\nlem, and thus the previous theorem does not readily apply. However, when n gets large, qn depends\non the previous data points only through the \ufb01rst two derivatives of the objective function of the\nconvex problem, which are empirical averages of certain functions of all currently observed data\npoints; we are currently working out a generalization of Proposition 4 that allows the dependence\non certain empirical moments and potential misspeci\ufb01cation.\n\n5 Algorithms\n\nalgorithms are composed of the following three ingredients:\n\nIn Section 4, we have derived a criterion bH in Eq. (5) that enables to optimize the sampling density\nof the (n + 1)-th point, and an estimate bG in Eq. (4) and Eq. (8) of the generalization error. Our\nn = p0(xn)/qn(xn) is\ncontrolled. In order to make sure that those results apply, our algorithms will ensure that\nthis condition is met.\n\n1. Those criteria assume that the variance of the importance weights wu\n\n2. The sampling density qn+1 will be obtained by minimizing bH(wn+1, qn+1|\u03b1, \u03b2) for a cer-\ntain parameterization of qn+1 and wn+1. It turns out that those minimization problems are\nconvex, and can thus be ef\ufb01ciently solved, without local minima.\n\n3. Once a new sample has been selected, and its label observed, Proposition 4 is used in a way\nsimilar to [13], in order to search for the best mixture between the current weights (wi) and\ni )1\u2212\u03b3 and perform\nthe importance weights (wu\n\ni ), i.e., we look at weights of the form w\u03b3\n\ni (wu\n\na grid search on \u03b3 to \ufb01nd \u03b3 such that bG in Eq. (4) is minimum.\n\nThe main interest of the \ufb01rst and third points is that we obtain a \ufb01nal estimator of \u03b80 which is at\nleast provably consistent: indeed, although our criteria are obtained from an assumption of indepen-\ndence, the generalization performance result also holds for \u201cweakly\u201d dependent observations and\nthus ensures the consistency of our approach. Thus, as opposed to most previous active learning\nheuristics, our estimator will always converge (in probability) to the ML estimator. In Section 6, we\nshow empirically that usual heuristic schemes do not share this property.\nConvex optimization problem We assume that we have a \ufb01xed set of candidate distributions\nsk(x) of the form sk(x) = p0(x)rk(x). Note that the multiplicative form of our candidate distri-\n\n\fH0(\u03b7|\u03b1, \u03b2) = 1\nH1(\u03b7|\u03b1, \u03b2) = 1\n\nn3 Pk \u03b7k (Pn\nn3 Pn\ni=1 \u03b1iwu\n\ni=1(\u03b1i + \u03b2i)wu\ni +Pn\n\ni sk(xi)) ,\n\u03b2iwu\nP k \u03b7ksk(xi) .\n\ni=1\n\ni\n\n(9)\n\n(10)\n\nbutions allows ef\ufb01cient sampling from a pool of samples of p0. We look at distributions qn+1(x)\nwith mixture density of the form s(x|\u03b7) = Pk \u03b7ksk(x) = p0(x)r(x), where the weights \u03b7 are\nnon-negative and sum to one. The criterion bH(qn+1, wn+1|\u03b1, \u03b2) in Eq. (5) is thus a function\nH(\u03b7|\u03b1, \u03b2) of \u03b7. We consider two weighting schemes: (a) one with all weights equal to one (unit\nweighting scheme) which leads to H0(\u03b7|\u03b1, \u03b2), and (b) the unbiased reweighting scheme, where\nwn+1(x) = p0(x)/qn+1(x), which leads to H1(\u03b7|\u03b1, \u03b2). We have\n\nThe function H0(\u03b7) is linear in \u03b7, while the function H1(\u03b7) is the sum of a constant and positive\ninverse functions, and is thus convex [14].\nUnless natural candidate distributions sk(x) can be de\ufb01ned for the active learning problem, we use\nthe set of distributions obtained as follows: we perform K-means clustering with a large number p of\ne\u2212\u03b1kkx\u2212\u00b5kk2,\nclusters (e.g., 100 or 200), and then consider functions rk(x) of the form rk(x) = 1\nZk\nwhere \u03b1k is one element of a \ufb01nite given set of parameters, and \u00b5k is one of the p centroids\ny1, . . . , yp, obtained from K-means. We let \u02dcwi denote the number of data points assigned to the\ncentroid yi. We normalize by Zk = Pp\ni=1 \u02dcwi. We thus obtained O(p) can-\ndidate distributions rk(x), which, if p is large enough, provides a \ufb02exible yet tractable set of mixture\ndistributions.\n\ni=1 \u02dcwie\u2212\u03b1kkyi\u2212\u00b5kk2\n\n/Pp\n\n\u02dcwi\n\n\u02dcwi\n\ni=1\n\ni=1\n\nn+1 = Pm\n\nn+1 can be estimated as var wu\n\nr(xi) \u2212 1 = Pm\n\nOne additional element is the constraint on the variance of the importance weights. The variance of\nP k \u03b7krk(xi) \u2212 1 = V (\u03b7), which\nwu\nis convex in \u03b7. Thus constraining the variance of the new weights leads to a convex optimization\nproblem, with convex objective and convex constraints, which can be solved ef\ufb01ciently by the log-\nbarrier method [14], with cubic complexity in the number of candidate distributions.\nAlgorithms We have three versions of our algorithm, one with unit weights (referred to as \u201cno\nweight\u201d) which optimizes H0(\u03b7|\u03b1, \u03b2) at each iteration, one with the unbiased reweighting scheme,\nwhich optimizes H1(\u03b7|\u03b1, \u03b2) (referred to as \u201dunbiased\u201d) and one which does both and chooses the\nbest one, as measured by bH (referred to as \u201dfull\u201d): in the initialization phase, K-means is run to\ngenerate candidate distributions that will be used throughout the sampling of new points. Then, in\norder to select the new training data point xn+1, the scores \u03b1 and \u03b2 are computed from Eq. (6) and\nEq. (7), then the appropriate cost function, H0(\u03b7|\u03b1, \u03b2), H1(\u03b7|\u03b1, \u03b2) (or both) is minimized and once\n\u03b7 is obtained, we sample xn+1 from the corresponding distribution, and compute the weights wn+1\nand wu\ni )1\u2212\u03b3)i) in Eq. (4) is minimized\nand update weights accordingly.\nRegularization parameter\nIn the active learning set-up, the number of samples used for learning\nvaries a lot. It is thus not possible to use a constant regularization parameter. We thus learn it by\ncross-validation every 10 new samples.\n\nn+1. As described earlier, we then \ufb01nd \u03b3 such that bG((w\u03b3\n\ni (wu\n\n6 Simulation experiments\n\nIn this section, we present simulation experiments on synthetic examples (sampled from Gaussian\nmixtures in two dimensions), for the task of binary and 3-class classi\ufb01cation. We compare our al-\ngorithms to the following three active learning frameworks. In the maximum uncertainty framework\n(referred to as \u201cmaxunc\u201d), the next training data point is selected such that the entropy of p(y|x, \u02c6\u03b8n)\nis maximal [17]. In the maximum variance reduction framework [25, 9] (referred to as \u201cvarred\u201d), the\nnext point is selected so that the variance of the resulting estimator has the lowest determinant, which\nis equivalent to \ufb01nding x such that tr\u2207(x, \u02c6\u03b8n) \u02c6J \u22121\nn is minimum. Note that this criterion has theo-\nretical justi\ufb01cation under correct model speci\ufb01cation. In the minimum prediction error framework\n(referred to as \u201cminpred\u201d), the next point is selected so that it reduces the most the expected log-loss,\nwith the current model as an estimate of the unknown conditional probability p0(y|x) [5, 8].\nSampling densities\nIn Figure 1, we look at the limit selected sampling densities, i.e., we assume\nthat a large number of points has been sampled, and we look at the criterion \u02c6H in Eq. (5). We\nshow the density obtained from the unbiased reweighting scheme (middle of Figure 1), as well as\n\n\f5\n\n0\n\n\u22125\n\n5\n\n0\n\n\u22125\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n5\n\n0\n\n\u22125\n\n0.5\n\n0\n\n\u22120.5\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n\u221210\n\n\u22125\n\n0\n\n5\n\nFigure 1: Proposal distributions: (Left) density p0(x) with the two different classes (red and blue),\n(Middle) best density with unbiased reweighting, (Right) function \u03b3(x) such that bH(qn+1(x), 1) =\nR \u03b3(x)qn+1(x)dx (see text for details).\n\n0.2\n\n0.15\n\ne\nt\na\nr\n \nr\no\nr\nr\ne\n\n0.1\n\n0\n\nrandom\nfull\n\n50\n\nnumber of samples\n\n100\n\ne\nt\na\nr\n \nr\no\nr\nr\ne\n\n0.2\n\n0.15\n\n0.1\n\n0\n\nrandom\nno weight\nunbiased\n\n50\n\nnumber of samples\n\n100\n\ne\nt\na\nr\n \nr\no\nr\nr\ne\n\n0.2\n\n0.15\n\n0.1\n\n0\n\nrandom\nminpred\nvarred\nmaxunc\n\n50\n\nnumber of samples\n\n100\n\nFigure 2: Error rates vs. number of samples averaged over 10 replications sampled from same dis-\ntribution as in Figure 1: (Left) random sampling and active learning \u201dfull\u201d, with standard deviations,\n(Middle) Comparison of the two schemes \u201cunbiased\u201d and \u201dno weight\u201d, (Right) Comparison with\nother methods.\n\nthe function \u03b3(x) (right of Figure 1) such that, for the unit weighting scheme, bH(qn+1(x), 1) =\nR \u03b3(x)qn+1(x)dx. In this framework, minimizing the cost without any constraint leads to a Dirac\nat the maximum of \u03b3(x), while minimizing with a constraint on the variance of the corresponding\nimportance weights will select point with high values of \u03b3(x). We also show the line \u03b8>\n0 x = 0.\nFrom Figure 1, we see that (a) the unit weighting scheme tends to be more selective (i.e., \ufb01ner grain)\nthan the unbiased scheme, and (b) that the mode of the optimal densities are close to the maximum\nuncertainty hyperplane but some parts of this hyperplane are in fact leading to negative cost gains\n(e.g., the part of the hyperplane crossing the central blob), hinting at the bad potential behavior of\nthe maximum uncertainty framework.\nComparison with other algorithms\nIn Figure 2 and Figure 1, we compare the performance of\nour active learning algorithms. In the left of Figure 2, we see that our active learning framework does\nperform better on average but also leads to smaller variance. In the middle of Figure 2, we compare\nthe two schemes \u201cno weight\u201d and \u201cunbiased\u201d, showing the superiority of the unit weighting scheme\nand the signi\ufb01cance of our asymptotic results in Proposition 2 and 3 which extend the unbiased\nframework of [13].\nIn the right of Figure 2 and in Figure 3, we compare with the other usual\nheuristic schemes: our \u201cfull\u201d algorithm outperforms other schemes; moreover, in those experiments,\nthe other schemes do perform worse than random sampling and converge to the wrong estimator, a\nbad situation that our algorithms provably avoid.\n\n7 Conclusion\n\nWe have presented a theoretical asymptotic analysis of active learning for generalized linear models,\nunder realistic sampling assumptions. From this analysis, we obtain convex criteria which can be\noptimized to provide algorithms for online optimization of the sampling distributions. This work\nnaturally leads to several extensions. First, our framework is not limited to generalized linear mod-\nels, but can be readily extended to any convex differentiable M-estimators [24]. Second, it seems\nadvantageous to combine our active learning analysis with semi-supervised learning frameworks, in\nparticular ones based on data-dependent regularization [26]. Finally, we are currently investigating\napplications to large scale image retrieval tasks, where unlabelled data are abundant but labelled data\nare scarce.\n\n\frandom\nfull\nminpred\nvarred\nmaxunc\n\nt\n\ne\na\nr\n \nr\no\nr\nr\ne\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n0\n\n50\n\nnumber of samples\n\n100\n\nFigure 3: Error rates vs. number of samples averaged over 10 replications for 3 classes: (left) data,\n(right) comparisons of methods.\n\nReferences\n[1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. J. Art. Intel. Res.,\n\n4:129\u2013145, 1996.\n\n[2] V. V. Fedorov. Theory of optimal experiments. Academic Press, 1972.\n[3] P. Chaudhuri and P. A. Mykland. On ef\ufb01cient designing of nonlinear experiments. Stat. Sin., 5:421\u2013440,\n\n1995.\n\n[4] S. Dasgupta. Coarse sample complexity bounds for active learning. In Adv. NIPS 18, 2006.\n[5] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction.\n\nIn Proc. ICML, 2001.\n\n[6] S. Tong and E. Chang. Support vector machine active learning for image retrieval. In Proc. ACM Multi-\n\nmedia, 2001.\n\n[7] M. Warmuth, G. R\u00a8atsch, M. Mathieson, J. Liao, and C. Lemmen. Active learning in the drug discovery\n\nprocess. In Adv. NIPS 14, 2002.\n\n[8] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learning using\n\nGaussian \ufb01elds and harmonic functions. In Proc. ICML, 2003.\n\n[9] A I. Schein. Active Learning for Logistic Regression. Ph.D. diss., U. Penn., 2005. CIS Dpt.\n[10] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1989.\n[11] T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classi\ufb01cation problems.\n\nIn Proc. ICML, 2000.\n\n[12] O. Chapelle. Active learning for parzen window classi\ufb01er. In Proc. AISTATS, 2005.\n[13] H. Shimodaira.\n\nImproving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. J. Stat. Plan. Inf., 90:227\u2013244, 2000.\n\n[14] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2003.\n[15] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.\n[16] S. Fine and K. Scheinberg. Ef\ufb01cient SVM training using low-rank kernel representations. J. Mach. Learn.\n\nRes., 2:243\u2013264, 2001.\n\n[17] S. Tong and D. Koller. Support vector machine active learning with applications to text classi\ufb01cation. In\n\nProc. ICML, 2000.\n\n[18] K. Fukumizu. Active learning in multilayer perceptrons. In Adv. NIPS 8, 1996.\n[19] T. Kanamori and H. Shimodaira. Active learning algorithm using the maximum weighted log-likelihood\n\nestimator. J. Stat. Plan. Inf., 116:149\u2013162, 2003.\n\n[20] T. Kanamori. Statistical asymptotic theory of active learning. Ann. Inst. Stat. Math., 54(3):459\u2013475, 2002.\n[21] H. White. Maximum likelihood estimation of misspeci\ufb01ed models. Econometrica, 50(1):1\u201326, 1982.\n[22] H. Akaike. A new look at statistical model identi\ufb01cation. IEEE Trans. Aut. Cont., 19:716\u2013722, 1974.\n[23] F. R. Bach. Active learning for misspeci\ufb01ed generalized linear models. Technical Report N15/06/MM,\n\nEcole des Mines de Paris, 2006.\n\n[24] A. W. Van der Vaart. Asymptotic Statistics. Cambridge Univ. Press, 1998.\n[25] D. MacKay.\n\nInformation-based objective functions for active data selection. Neural Computation,\n\n4(4):590\u2013604, 1992.\n\n[26] Y. Bengio and Y Grandvalet. Semi-supervised learning by entropy minimization. In Adv. NIPS 17, 2005.\n\n\f", "award": [], "sourceid": 2961, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}]}