{"title": "Integrating Bayesian and Discriminative Sparse Kernel Machines for Multi-class Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2285, "page_last": 2294, "abstract": "We propose a novel active learning (AL) model that integrates Bayesian and discriminative kernel machines for fast and accurate multi-class data sampling. By joining a sparse Bayesian model and a maximum margin machine under a unified kernel machine committee (KMC), the proposed model is able to identify a small number of data samples that best represent the overall data space while accurately capturing the decision boundaries. The integration is conducted using the maximum entropy discrimination framework, resulting in a joint objective function that contains generalized entropy as a regularizer. Such a property allows the proposed AL model to choose data samples that more effectively handle non-separable classification problems. Parameter learning is achieved through a principled optimization framework that leverages convex duality and sparse structure of KMC to efficiently optimize the joint objective function. Key model parameters are used to design a novel sampling function to choose data samples that can simultaneously improve multiple decision boundaries, making it an effective sampler for problems with a large number of classes. Experiments conducted over both synthetic and real data and comparison with competitive AL methods demonstrate the effectiveness of the proposed model.", "full_text": "Integrating Bayesian and Discriminative Sparse\nKernel Machines for Multi-class Active Learning\n\nRochester Institute of Technology\n\nRochester Institute of Technology\n\nWeishi Shi\n\nws7586@rit.edu\n\nQi Yu\n\nqi.yu@rit.edu\n\nAbstract\n\nWe propose a novel active learning (AL) model that integrates Bayesian and\ndiscriminative kernel machines for fast and accurate multi-class data sampling.\nBy joining a sparse Bayesian model and a maximum margin machine under a\nuni\ufb01ed kernel machine committee (KMC), the proposed model is able to identify\na small number of data samples that best represent the overall data space while\naccurately capturing the decision boundaries. The integration is conducted using\nthe maximum entropy discrimination framework, resulting in a joint objective\nfunction that contains generalized entropy as a regularizer. Such a property allows\nthe proposed AL model to choose data samples that more effectively handle\nnon-separable classi\ufb01cation problems. Parameter learning is achieved through\na principled optimization framework that leverages convex duality and sparse\nstructure of KMC to ef\ufb01ciently optimize the joint objective function. Key model\nparameters are used to design a novel sampling function to choose data samples that\ncan simultaneously improve multiple decision boundaries, making it an effective\nsampler for problems with a large number of classes. Experiments conducted\nover both synthetic and real data and comparison with competitive AL methods\ndemonstrate the effectiveness of the proposed model.\n\nIntroduction\n\n1\nWhile more labeled data tends to improve the performance of supervised learning, labeling a large\nnumber of data samples is labor intensive and time consuming. Furthermore, obtaining accurate\nlabels may be highly challenging for many specialized domains, such as medicine and biology, where\nexpert knowledge is required for understanding and extracting the underlying semantics of data.\nActive Learning (AL) provides a promising direction to use a small subset of labeled data samples to\ntrain high-quality supervised learning models in a cost-effective way. Consequently, AL has been\nsuccessfully applied to various applications [1, 2, 3].\nA large number of AL models have been developed for different types of supervised learning models.\nHowever, the design of the data sampling strategy is usually limited by the learning models, which\nare not designed speci\ufb01cally for AL purpose. For example, max-margin based classi\ufb01ers, such as\nsupport vector machines (SVMs), are widely used for sampling purpose in AL. However, as they are\nessentially designed for the classi\ufb01cation task, using them directly for sampling might lead to a slow\nconvergence. Figure 1a illustrates such behavior in existing models. Assume that the two middle\nclusters contain 80% of data samples. Hence, it is highly likely that the initially labeled samples are\nfrom these two clusters, which give the initial decision boundary as shown by the dashed line. Then,\nsamples in the middle clusters will continue to be sampled as they are close to the current decision\nboundary. This will cause a very slow convergence to the true decision boundary shown as the solid\nline. Furthermore, since the model performance over iterations stays roughly the same, it may cause\nAL to terminate. This is undesirable, because the true decision boundary is never discovered. Such\nbehavior is intrinsic to the classi\ufb01er, which primarily focuses on exploiting the current decision\nboundary rather than exploring the entire data distribution for more effective data sampling.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Convergence Issue of AL\n\n(b) Distribution of SVs\n\n(c) Distribution of RVs\n\nFigure 1: (a) Undesired convergence behavior of AL; (b) Distribution of SVs; (c) Distribution of RVs\n\nTo address the undesired convergence behavior of existing AL models, we propose a novel a kernel\nmachine committee (KMC) based model that integrates Bayesian and discriminative sparse kernel\nmachines for multi-class active learning. The KMC model naturally extends the sampling space from\naround the current decision boundaries to other critical areas through the representative data samples\nidenti\ufb01ed by the sparse Bayesian model. More speci\ufb01cally, the proposed KMC sampler incorporates\na relevant vector machine (RVM), which is a Bayesian sparse kernel technique, to identify data\nsamples (referred as relevance vectors, or RVs) that capture the overall data distribution. By further\naugmenting the SVM with a RVM, the KMC model is able to choose data samples that provide a good\ncoverage of the entire data space (by maximizing the data likelihood) while giving special attention\nto the critical areas for accurate classi\ufb01cation (by maximizing the margins of decision boundaries).\nFigures 1b and 1c demonstrate the complementary distribution of RVs and SVs (support vectors)\nand how they cover different critical areas in the data space, where most SVs are located near to the\ndecision boundary while most RVs are in the densely distributed areas of the two classes. There are\nalso much less RVs than SVs, implying that RVM is even a sparser model than SVM [4]. The sparse\nnature of both RVM and SVM makes their combination an ideal choice for AL.\nIn essence, the KMC joins a sparse Bayesian model (RVM) with a maximum margin machine (SVM)\nto choose data samples that meet two key properties simultaneously: (1) providing a good \ufb01t of\nthe overall data distribution, and (2) accurately capturing the decision boundaries. We propose to\nuse the maximum entropy discrimination (MED) framework [5, 6] to seamlessly integrate these\ntwo distinctive properties into one joint objective function to train the KMC for multi-class data\nsampling in AL. Furthermore, the objective function can be equivalently expressed as combination of\na likelihood term with the generalized entropy [7]. This deeper connection implies that the KMC\nmodel is able to choose data samples that are most instrumental to tackle the dif\ufb01cult non-separable\nclassi\ufb01cation problems (as shown in our experiments). In contrast, the SVM based models need\na much large number of SVs (hence more labeled data) to accurately capture the more complex\ndecision boundaries. Our main contribution is threefold: (i) a novel kernel machine committee (KMC)\nmodel that seamlessly uni\ufb01es Bayesian and discriminative sparse kernel machines for effective data\nsampling; (ii) a principled optimization framework by leveraging convex duality and sparse structure\nof KMC to ef\ufb01ciently optimize the joint objective function; and (iii) a novel sampling function that\ncombines key model parameters to choose data samples that can simultaneously improve multiple\ndecision boundaries, making it an effective sampler for problems with a large number of classes.\n2 Related Work\nUncertainty sampling is one of the most commonly used sampling method for AL, where the\ninformativeness of the unlabeled data point is determined by its distance to the decision boundaries [8,\n9]. In order to better accommodate multi-class problems, Culotta and McCallum [10] propose a\nsampling method based on the predictive probability over the sample pool where the data point with\nthe smallest probability of its predicted class is sampled. However, this method prefers trivial data\nsamples with evenly distributed predictive probabilities. To address this, a Best-versus-Second Best\n(BvSB) model is proposed to choose the data sample whose probabilities of the most and second\nmost classes are closest to each other [10]. But BvSB only focuses on the two most probable classes\nwhile the probability distribution of other classes is ignored. As a result, this method is less effective\nwith more classes. Entropy-based methods obtain a complete view of uncertainty over all classes to\nconduct effective AL sampling. But the lack of training samples in the beginning of the AL impedes\nthe accurate estimation of entropy. In fact, all sampling methods that rely on the probability output of\n\n2\n\nTrue decision boundaryExploreExploitInitial decision boundary\u22124\u22123\u22122\u221210123Feature 1\u22122\u22121012345Feature 2SVM with 149 vectorstrain_postrain_negsupport vecs0.000.150.300.450.600.750.901.05\u22124\u22123\u22122\u221210123Feature 1 2 1012345Feature 2RVM with 30 vectorstrain_postrain_negrelevant vecs0.000.150.300.450.600.750.901.05\fSVM should be dealt with caution as the probabilities are estimated by \ufb01tting an additional logistic\nregression model over SVM scores. Thus, the estimated probability might not re\ufb02ect the true behavior\nof the SVM as a discriminative model [11].\nCompared to discriminative models (e.g., SVMs), generative models can be more naturally used for\nmulti-class AL. Roy et al. propose a sampling method based on Naive Bayes using the expectation\nof future classi\ufb01cation error as the sampling criterion [12]. Kottke et al. propose a multi-class\nprobabilistic AL model (McPAL) according to the expectation of the classi\ufb01cation error from clusters\nof unlabeled data [13]. Both methods are computationally intensive, making them hard to be applied\nto real-time AL. Furthermore, since the learning objective (e.g. maximum likelihood) of generative\nmodels are not speci\ufb01cally designed for discrimination [14], the model performance might be less\ncompetitive when AL is complete as compared to the discriminative models.\nThere are also existing models that utilize the properties of the data space or the trained model\nfor effective data sampling. A convex-hull based AL model is developed to avoid sampling less\ninformative data points that are close to the current support vectors [15]. A similar strategy is adopted\nin [16], where data samples with furthest distance to its closest relevance vectors are sampled. The\nQUIRE model combines the clustering structure of unlabeled data with the class assignments of the\nlabeled data, allowing it to choose samples both informative and representative [17]. However, this\nmodel is designed for binary problems. Different from all existing works, the proposed KMC model\nleverages the complementary behavior of Bayesian and discriminative sparse kernel machines and\nsystematic integrates them for effective data sampling for multi-class AL.\n3 Kernel Machine Committee based Active Learning\nLet X = {x1, ..., xM} denote a training set with M data samples and y = {y1, ..., yM} be their\ncorresponding labels. Let\u2019s consider the binary-class case where \u2200yi \u2208 y, yi \u2208 {\u22121, +1} and the\nmulti-class problems can be achieved via the one-versus-the-rest strategy. The conditional distribution\nof label yi is given by p(yi = 1|w, xi) = \u03c3(wT \u03c6(xi)), where \u03c3 is the logistic sigmoid function,\nw is the coef\ufb01cient, and \u03c6(xi) is feature vector of xi. For RVM, we set \u03c6j(xi) = k(xi, xj) with\nk(\u00b7,\u00b7) being a kernel function. We further place a prior over the coef\ufb01cient w with hyperparameter\nj ). Having a separate hyperparameter\n\u03b1i for each coef\ufb01cient wi will ensure model sparsity through automatic relevance determination\n(ARD) [4]. In particular, during the parameter learning process, some of the \u03b1i will be driven\nto in\ufb01nity, which has the effect of making the corresponding wi approach zero. As a result, the\nassociated data sample xi will be excluded from the RV set. The optimal \u03b1 can be determined\nthrough evidence approximation, which maximizes the log marginal likelihood of observed data\n\n\u03b1 = (\u03b11, ..., \u03b1M ), given by p(w|\u03b1) = (cid:81)\n\nj N (wj|0, \u03b1\u22121\n\ngiven by ln p(y|X, \u03b1) = ln(cid:82)(cid:81)\n\ni [p(yi|xi, w)] p(w|\u03b1)dw.\n\nSince the likelihood term p(yi|xi, w) is a logistic function, which is non-conjugate to the Gaussian\nprior p(w|\u03b1), the integration can not be straightforwardly performed. By applying Jensen\u2019s inequality\nto the log function, we can obtain a lower bound of the log likelihood given by ln p(y|X, \u03b1) \u2265\nEq(w)[ln p(y|X, w)] \u2212 KL(q(w)||p(w|\u03b1)), where q(w) is a variational distribution and the second\nterm is the KL divergence between q(w) and prior distribution p(w|\u03b1). This change makes it\npossible to put parameter learning in RVM into the MED framework [5], which allows us to further\nintegrate a set of margin-based constraints\n\nmin\n\nq(w),\u03b1,\u03be\n\nKL(q(w)||p(w|\u03b1)) \u2212 Eq(w)[ln p(y|X, w)] + C\n\n\u03bei\n\n(1)\n\n(cid:88)\n\n(cid:90)\n\ni\n\nq(w)dw = 1\n\nsubject to \u2200i : Eq(w)[yif (w, xi)] \u2265 \u2212\u03bei,\n\n\u03bei \u2265 0,\n\np(yi=\u22121|w,xi) = wT \u03c6(xi),\u2200i, yi = \u22121; f (w, xi) = ln p(yi=\u22121|w,xi)\n\nwhere \u03be\u2019s are slack variables and f (w, xi) is a cost function, de\ufb01ned as f (w, xi) =\nln p(yi=1|w,xi)\np(yi=1|w,xi) = \u2212wT \u03c6(xi),\u2200i, yi = 1,\nwhich ensures that a linear cost is introduced only for misclassi\ufb01ed data samples.\nDirectly optimizing (1) is still challenging due to the likelihood term that follows a logistic function.\nWe make further approximation by using an exponential quadratic function to lower bound the logistic\nfunction [18]: \u03c3(z) \u2265 \u03c3(\u03b3) exp{(z \u2212 \u03b3)/2 \u2212 \u03bb(\u03b3)(z2 \u2212 \u03b32)}, where \u03bb(\u03b3) = 1\n2 ). This\napproximation allows us to derive a lower bound of the likelihood term in (1).\n\n2\u03b3 (\u03c3(\u03b3) \u2212 1\n\n3\n\n\fLemma 1. For \u2200\u03b3 \u2208 RM , \u2200y \u2208 {\u22121, +1}M , there exists a lower bound of the likelihood of the\nlogistic regression function that has an exponential quadratic functional form and satis\ufb01es:\n\n\u03c3(\u03b3i) exp\n\n(wT \u03c6(xi)yi \u2212 \u03b3i) \u2212 \u03bb(\u03b3i)([wT \u03c6(xi)]2 \u2212 \u03b32\ni )\n\n= h(w, \u03b3)\n\n(2)\n\np(y|w, X) \u2265 M(cid:89)\n\ni=1\n\nwhere \u03b3 = (\u03b31, ..., \u03b3m)T .\n\n(cid:26) 1\n\n2\n\n(cid:27) .\n\n(cid:88)\n(cid:90)\n\ni\n\nProof. By leveraging the symmetry of the sigmoid function, we have\n\np(yi = 1|w, xi) = \u03c3(wT \u03c6(xi))\np(yi = \u22121|w, xi) = 1 \u2212 \u03c3(wT \u03c6(xi)) = \u03c3(\u2212wT \u03c6(xi))\n\n(3)\n(4)\nUsing \u03c3(z) \u2265 \u03c3(\u03b3) exp{(z \u2212 \u03b3)/2 \u2212 \u03bb(\u03b3)(z2 \u2212 \u03b32)}, the conditional likelihood of yi is given by:\np(yi|w) = \u03c3(yiwT \u03c6(xi) \u2265 \u03c3(\u03b3i) exp\n\n(wT \u03c6(xi)yi \u2212 \u03b3i) \u2212 \u03bb(\u03b3i)([yiwT \u03c6(xi)]2 \u2212 \u03b32\ni )\n\n(cid:26) 1\n\n(cid:27)\n\n(5)\n\nSubstitute for y2\n\n2\ni = 1 and lemma 1 is proved.\n\nReplacing the likelihood with h(w, \u03b3) in (1), the \ufb01nal objective function of KMC is given by\n\nObjective (KMC): min\n\nq(w),\u03b3,\u03b1,\u03be\n\nKL(q(w)||p(w|\u03b1)) \u2212 Eq(w)[ln h(w, \u03b3)] + C\n\n\u03bei\n\n(6)\n\nsubject to\n\n\u2200i : Eq(w)[yif (w, xi)] > \u2212\u03bei,\n\n\u03bei \u2265 0,\n\nq(w)dw = 1\n\nThe \ufb01rst term is a regularizer of the variational distribution q(w). The use of an ARD prior imposes\nthe sparsity of w, which guarantees the sparsity of the KMC. The second term approximates the\nnegative log likelihood of the observed data and the last term brings in the maximum margin-based\nconstraints. It is also worth to note that the expectation Eq(w)[f (w, xi)] is taken over q(w), which\ndemonstrates the interplay of the Bayesian RVM model and the maximum margin SVM model. The\nintegrated objective allows the KMC model to identify a small number of data samples, referred to\nas KMC vectors, which can describe the observed data well while accurately capturing the decision\nboundaries at the same time. In addition, by combining the \ufb01rst and third terms, they form a special\ncase of generalized entropy [7]. Therefore, the KMC objective function can also be interpreted\nas minimizing the negative log likelihood with the generalized entropy as a regularizer. Such\na formulation implies that it can more effectively handle non-separable classi\ufb01cation problems,\nbene\ufb01ting from the generalized entropy (as con\ufb01rmed through our experiments). To extend to K\nclasses, we adopt the one-versus-the-rest strategy and then apply a softmax transformation, which\ngives rise to the posterior probability of the k-th class: p(Ck|x) = e\nj \u03c6(x)].\nThe conjugacy introduced by the lower bound function is essential for ef\ufb01cient KMC parameter\nlearning. First, it guarantees a Gaussian form of q(w) and other parameters expressed by the moments\nof q(w). This allows us to develop an iterative algorithm to ef\ufb01ciently optimize q(w). Second, we\ncan leverage convex duality and the sparse structure to ef\ufb01ciently solve for Lagrangian multipliers.\n3.1 Parameter Learning in KMC\nIn order to optimize the KMC objective function in (6) and learn the key model parameters, we \ufb01rst\nderive the Lagrangian function by introducing dual variables ui \u2265 0 and v for each inequality and\nequality constraints:\n\nk \u03c6(x)]/(cid:80)K\n\nj=1 e\n\nE[wT\n\nE[wT\n\nL(\u03be, \u03b1, \u03b3) = KL(q(w)||p(w|\u03b1)) \u2212 Eq(w)[ln h(w, \u03b3)]\n\nui(Eq(w)[yi(f (w, xi))] + \u03bei) + v(\n\nq(w)dw \u2212 1)\n\n(7)\n\n(cid:90)\n\n(cid:88)\n\n\u03bei \u2212(cid:88)\n\ni\n\ni\n\n+ C\n\nWe start by solving q(w). By setting \u2202L\nquadratic form, we have: q(w) = N (mq, Sq) with\n\n\u2202q(w) = 0 and recognizing that q(w) takes an exponential\n\n\u03bb(\u03b3i)\u03c6(xi)\u03c6(xi)T , A = diag(\u03b1)\n\n(8)\n\n(cid:88)\n\ni\n\nmq = Sq[\n\n(cid:88)\n\ni\n\n(yi(ui +\n\n)\u03c6(xi)], S\u22121\n\nq = A + 2\n\n1\n2\n\n4\n\n\fWe now solve the Lagrangian multipliers. By substituting (8) back to (7), we obtain the dual problem:\n(9)\n\n\u2212 ln Z(u),\n\n\u2200i, ui \u2265 0\n\nui = C,\n\nmax\n\nu\n\nsubject to (cid:88)\n\ni\n\nwhere u = (u1, ..., uM )T and Z(u) is the normalization factor that ensures that q(w) integrates to 1.\nIn particular, we have\n\nln Z(u) = ln(2\u03c0)\n\nM\n\n2 +\n\nln \u03c3(\u03b3i) + ln|Sq| 1\n\n2 +\n\n1\n2\n\nzT Sqz +\n\n(\u03bb(\u03b3i)\u03b32\n\n\u03b3i)\n\n(10)\n\ni \u2212 1\n2\n\n(cid:88)\n\ni\n\n(cid:88)\n\ni\n\ni yi(ui + 1\n\n2 )\u03c6(xi). By removing terms irrelevant to u, the dual problem is given by\n\nDual (KMC): min\nu\n\n1\n2\n\nzT Sqz subject to :\n\nui = C, ui \u2265 0\n\n(11)\n\nwhere z =(cid:80)\n\n(cid:88)\n\ni\n\nThe dual problem is essentially a constrained quadratic programming problem of u, which can be\nsolved using a standard QP solver. However, using the ARD prior ensures that the KMC problem has\na nice sparse structure, as shown in the following theorem.\nTheorem 1. Using an ARD prior, the covariance matrix Sq of variational distribution q(w) has a\nsparse structure. In particular, for |\u03b1i| \u2192 \u221e and |\u03b1j| \u2192 \u221e, Sq(i, j) \u2192 0 as Sq(i, j) \u221d 1/|\u03b1j|.\nProof. First, reformulate S\u22121\nas a matrix form: A + 2\u03a6T \u039b\u03a6, where \u03a6 = (\u03c6(x1), ..., \u03c6(xM )T and\n\u039b = diag(\u03bb(\u03b31), ..., \u03bb(\u03b3M )). Given the de\ufb01nition of \u03c6(xi), \u03a6 = \u03a6T . Applying Woodbury identity\nto S\u22121\n\n, we have\n\nq\n\nq\n\nwhere A\u22121 = diag(\u03b1\u22121\n1 , ..., \u03b1\u22121\nsecond term in (12). In particular,\n\n\uf8ee\uf8ef\uf8f0\u03b1\u22121\nSimilarly, we have \u03a6A\u22121 = (cid:0)\u03b1\u22121\n\nA\u22121\u03a6 =\n\n1\n\nSq = A\u22121 + A\u22121\u03a6(\u039b\u22121 + \u03a6A\u22121\u03a6)\u22121\u03a6A\u22121\n\n(12)\nM ) is a diagonal matrix (hence already sparse). So we focus on the\n\n1 )\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0 \u03b1\u22121\n\n\uf8ee\uf8ef\uf8f0 \u03c6(xT\n\uf8f9\uf8fa\uf8fb\nM \u03c6(xM )(cid:1). It can be shown that a signi\ufb01cant pro-\n\n\u03b1\u22121\nM \u03c6(xT\n\n...\n\u03c6(xT\n\n1 \u03c6(xT\n1 )\n\n\uf8f9\uf8fa\uf8fb\n\n(13)\n\nM )\n\nM )\n\n...\n\n...\n\n\u03b1\u22121\n1 \u03c6(x1), ..., \u03b1\u22121\n\nM\n\nportion of \u03b1\u2019s approach \u221e due to the ARD prior [19]. We then apply the Woodbury identity to\nthe term (\u039b\u22121 + \u03a6A\u22121\u03a6)\u22121 in (12) and assume that |\u03b1i| \u2192 \u221e, it is straightforward to show that\n(\u039b\u22121 + \u03a6A\u22121\u03a6)\u22121 \u2248 A. Using this fact and (13), we can show Sq(i, j) \u221d 1/|\u03b1j| and hence\nSq(i, j) \u2192 0 for |\u03b1j| \u2192 \u221e.\n\nOur empirical evaluation over both synthetic and real data shows that a high percentage (e.g., > 80%)\nof \u03b1\u2019s are driven to \u221e during the optimization process. Therefore, Sq is indeed highly sparse. Thus,\nthe problem can be solved much more ef\ufb01ciently by quadratic solvers boosted by sparse input such\nas MOSEK [20].\nNext, we solve \u03b3 by set \u2202L\n\u2202\u03b3i\n\n= 0 and obtain the update rule of \u03b3i as:\n\ni = \u03c6(xi)T Eq(w)[wwT ]\u03c6(xi) = \u03c6(xi)T [Sq + mqmT\n\u03b32\nThe derivation of the update rule of \u03b1 can bene\ufb01t from the following result.\nLemma 2. Let p1(x) \u223c N (x|m1, S1) and p2(x) \u223c N (x|m2, S2) then the KL(p1||p2) is given by:\n(15)\n\n(cid:21)\n2 (m1 \u2212 m2)\n\n(m1 \u2212 m2)T S\u22121\n\nKL(p1||p2) =\n\n1\n2\nSubstituting q(w) for p1 and p(w|\u03b1) for p2 in (15), we have\n\n2 S1] +\n\nq ]\u03c6(xi)\n\n(14)\n\n(cid:20)\n\n1\n2\n\nln\n\n|S2|\n|S1| \u2212 M + Tr[S\u22121\n(cid:88)\n\nln \u03b1\u22121\n\ni + ln|Sq| \u2212 M + Tr[S\u22121\n\np Sq] + mT\n\nq S\u22121\n\np mq)\n\nKL(q(w)||p(w|\u03b1)) =\n\n(16)\n\n1\n2\n\n(\n\nwhere S\u22121\n\np = A. Solving for \u2202L\n\u2202\u03b1i\n\ni\n\n= 0 while making use of (16), we obtain the update rule for \u03b1i:\n\n\u03b1i =\n\n1\n\nSq(ii) + (mq(i))2\n\n(17)\n\nwhere Sq(ii) denotes the i-th diagonal entry of Sq.\n\n5\n\n\f3.2 KMC-based Multi-class Data Sampling\nWe develop a novel two-phase KMC-based sampling process to achieve many-class sampling.\nThe proposed KMC model is used for different purposes in each phase: predicting the posterior\nprobabilities of different classes in initial sampling and KMC vector discovery for \ufb01nal sampling.\nMore speci\ufb01cally, in the initial sampling phase, a pre-trained KMC model using the initial labeled\npool is used to make an prediction of all the data samples in the unlabeled pool. The top-S samples\nwill be selected according to their entropy de\ufb01ned over the posterior probabilities of different classes.\nIn essence, these samples confuse the current KMC model the most and thus have the greatest\npotential to improve the model if being labeled. Different from existing AL approaches that directly\nsend these samples for human labeling, the proposed process proceeds by including these samples\nalong with their predicted labels to retrain the KMC model. The goal is to further select data samples\nidenti\ufb01ed as KMC vectors, which can contribute to improving the decision boundaries while properly\nexploring the data space to avoid slow convergence of AL.\nWe propose a multi-class sampling function to measure the overall improvement that a sample can\nbring to all the classes. In particular, when solving the KMC objective in (6) for the k-class, we obtain\nan optimal \u03b1(k)\n, we obtain two quantities\ns(k)\nthat are referred to as sparsity and quality, respectively, where sparsity measures the\ni\noverlap of data sample xi with other samples and quality measures xi\u2019s contribution to reducing the\ni \u2192 \u221e\nerror between the model output and the actual targets. The optimization process will set \u03b1(k)\n, which makes the corresponding wi \u2192 0. In this case, xi will not contribute to\nif q(k)\n2 \u2212 s(k)\nthe k-class (and is not included as a KMC vector). Otherwise, \u03b1(k)\n).\nIntuitively, if xi is not too close to other data samples and effective to reduce the classi\ufb01cation error,\nits corresponding \u03b1(k)\ni will take a small positive value. By considering all K classes, we use the\nfollowing function for multi-class sampling:\nx\u2217 = arg min\n\nfor each xi. Similar to RVM [4], when optimizing each \u03b1(k)\n\ni\nand q(k)\n\nis set to s(k)\n\n(cid:88)\n\n/(q(k)\n\ni\n\n(18)\n\n< s(k)\n\ni\n\nEq(w)[w(k)\n\ni\n\n]/\u03b1(k)\n\ni\n\n2\n\n2\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\nk\n\ni\n\ni\n\nIn essence, the multi-class sampling function aggregates the impact of w(k)\nthat is directly used to\ncompute the posterior probability and the contribution of \u03b1(k)\nthat gives preference to non-redundant\nsamples that can help reduce the classi\ufb01cation error. Finally, by combining the contribution to all\nclasses, it will choose a data sample that can bene\ufb01t a large number of classes, making it effective and\nef\ufb01cient when many classes are involved. The details are summarized in the supplementary materials\nand the source code is available at [21].\n4 Experiments\nWe conduct extensive experiments to evaluate the proposed KMC AL model. We \ufb01rst investigate and\nverify some important model properties by using synthetic data and through comparison with SVM\nand RVM, which helps demonstrate the potential of using KMC for AL. We then apply the model to\nmultiple real-world datasets from diverse domains. Comparison with state-of-the-art multi-class AL\nmodels will establish the advantage of using KMC in real-world AL applications. For KMC, unless\notherwise speci\ufb01ed, parameter C is set to 10\u22122 and the convergence threshold is set to 10\u22123.\n\n4.1 Synthetic Data\nWe draw 500 2-D data samples from a moon-shape distribution and use 70 % of the data for training\nand 30 % for testing. In Figures 2 and 3, we visualize the learned vectors of the compare models\nat different noise levels. The results help verify important properties of the proposed KMC model,\nas described in our theoretical \ufb01ndings. First, the selected KMC vectors suf\ufb01ciently explore critical\nareas to cover the entire data distribution while giving adequate attention to the decision boundaries.\nIn contrast, SVM overly focuses on the decision boundaries by using a large number of SVs while\nRVM under explores the decision areas by using a very small number of RVs and hence suffers\nfrom a relatively low model accuracy. This result veri\ufb01es the desired behavior of KMC vectors\nthat are discovered through optimizing the joint objective function (6). Second, KMC maintains a\nvery high sparsity level at around 90% in both cases. This veri\ufb01es our theoretical result as stated in\nTheorem 1. These two important properties clearly establish the potential of using KMC for effective\ndata sampling in AL. Finally, as we add more Gaussian noises to make the data less separable, KMC\n\n6\n\n\fFigure 2: Moon-shape distribution with 30% Gaussian noises\n\nFigure 3: Moon-shape distribution with 60% Gaussian noises\n\nDataset\nYeast\nReuters\nPenstroke\nDerm 1\nDerm 2\nAuto-drive\n\n8\n\n5227\n500\n1391\n1554\n48\n\nTable 1: Description of Datasets\n\n#Inst #Attr #Classes Class Distr. Domain\nBiology\n1484\nNews\n10788\nImage\n1144\nMedical\n800\n868\nMedical\nAutomobile\n58509\n\nSkewed\nSkewed\nEven\nEven\nEven\nEven\n\n10\n75\n26\n50\n30\n11\n\nshows its robustness by maintaining the highest accuracy. Furthermore, the sparsity of KMC remains\nstable unlike SVM with an exploding number of SVs. This veri\ufb01es the impact of the generalized\nentropy regularizer in the KMC objective function (6).\n4.2 Real Data\nWe choose 6 datasets from different domains, as summarized in Table 1, to evaluate the proposed\nKMC based multi-class sampling model.\n\u2022 Yeast uses biological features to predict cellular localization sites of proteins.\n\u2022 Reuters uses the content of Reuters News to conduct text classi\ufb01cation.\n\u2022 Penstroke contains handwritten English letters from writers with different writing styles.\n\u2022 Derm I&II contain physicians\u2019 verbal narrations of dermatological images with different diseases.\n\u2022 Auto-drive uses different sensor readings to predict failures of a running automobile.\nExperimental setting and comparison methods:\nIt is typical for an AL model to start with limited\nlabeled training samples. For datasets with an even class distribution (except for Auto-drive), we\nrandomly select one data sample per class to form the initial training set for AL. For Auto-drive, given\nits large size, we use 20 labeled samples per class. For unevenly distributed datasets, we randomly\nsample 1% of the data from each class to form the initial set. For comparison, random sampling\n(Random) is a commonly used baseline approach. We also include Entropy-based sampling (Entr) that\nselects the data sample with the maximum entropy of the predicted class distribution. Furthermore,\nwe also compare the KMC AL model with three state-of-the-art multi-class AL models that have\nbeen discussed in the related work section, including Best-vs-Second-Best sampling (BvSB) [10],\nmulti-class Probabilistic Active Learning (McPAL) [13], and multi-class convex hull based sampling\n(MC-CH) [15]. The reported test accuracy is averaged over three runs. Two important parameters of\nKMC, the large margin coef\ufb01cient C and the initial sample size S, are set to 1 and 40 respectively\nwhen compared with other AL models.\nAL performance comparison:\nFigure 4 compares the AL results from KMC and the other four\ncompetitive models along with the baseline random sampling. In most cases, the KMC model shows\n\n7\n\n\u22121012Feature 1\u22121.0\u22120.50.00.51.01.5Feature 2RVM with 17 vectors Test accuracy:0.89train_postrain_negrelevant vecs0.000.150.300.450.600.750.901.05\u22121012Feature 1\u22121.0\u22120.50.00.51.01.5Feature 2SVM with 123 vectors Test accuracy:0.94train_postrain_negsupport vecs0.000.150.300.450.600.750.901.05\u22121012Feature 1\u22121.0\u22120.50.00.51.01.5Feature 2KMC with 23 vectors Test accuracy:0.95train_postrain_negKMC vecs0.000.150.300.450.600.750.901.05\u22122\u221210123Feature 1\u22121.5\u22121.0\u22120.50.00.51.01.52.0Feature 2RVM with 19 vectors Test accuracy:0.78train_postrain_negrelevant vecs0.000.150.300.450.600.750.901.05\u22122\u221210123Feature 1\u22121.5\u22121.0\u22120.50.00.51.01.52.0Feature 2SVM with 192 vectors Test accuracy:0.8train_postrain_negsupport vecs0.000.150.300.450.600.750.90\u22122\u221210123Feature 1\u22121.5\u22121.0\u22120.50.00.51.01.52.0Feature 2KMC with 40 vectors Test accuracy:0.8train_postrain_negKMC vecs0.000.150.300.450.600.750.901.05\fFigure 4: AL Performance Comparison\n\na fast convergence of AL and better model accuracy. In four datasets, including Yeast, Reuters, Derm\nI (tied with MC-CH), and Derm II, KMC demonstrates a very clear advantages in the convergence\nspeed of AL. In the other two datasets (i.e., Penstroke and Auto-drive), KMC achieves comparable\nperformance with the best competitive model in the early stage of AL but converges to a high model\naccuracy in the end. The excellent AL performance of KMC bene\ufb01ts from the ability of the KMC\nvectors that can effectively explore the entire data distribution while accurately capturing the decision\nboundaries. The sparse structure of KMC further ensures that only very limited labeled data samples\nare needed to train highly accurate models. Furthermore, the fast convergence also attributes to\neffectiveness of the multi-class sampling function (18) that chooses data samples to simultaneously\nimprove multiple classes, which is further veri\ufb01ed in later experiments.\n\nImpact of model parameters: We study the impact of two parameters, including the large margin\ncoef\ufb01cient C and the initial sample size S that may affect the model performance. Since a data\nsample with \u03bei > 0 will be wrongly classi\ufb01ed, a large C will lead to a variational distribution q(w)\nwith more discriminative strength. In contrast, a smaller C makes the model more tolerant of wrongly\nclassi\ufb01ed cases (and also more robust to noises). In practice, KMC performs quite stable over a wide\nrange of C values (i.e., 0.001 to 1) for all the datasets. As for the initial sample size S, Figure 5\nshows the AL curves of KMC with S set to 5,20, and 40, respectively. It can be seen that at the early\nstage of AL, a larger S creates an advantage by allowing the model to explore more confusing data\nsamples when the model is not accurate enough to con\ufb01rm the confusion. For most datasets, such\nadvantage reduces as the model becomes more accurate along the AL process. The only exception\nis Auto-drive, where a larger initial sample size still shows a good advantage toward the end of the\n500 AL iterations. In fact, given the large size of this dataset, after sampling 500 data samples, all\nthe labeled data is only comprised of around 1% of the entire dataset. In addition, this dataset is\nhighly noisy, so the AL model may not fully converge yet by using such limited labeled data, which\ncorresponds to a relative early stage of AL. Therefore, the advantage of using a larger sample size is\nstill signi\ufb01cant as in the early stages of other datasets.\n\nEffectiveness of multi-class sampling:\nIn this set of experiments, we further verify the effective-\nness of the multi-class sampling function by demonstrating that the selected data samples have the\npotential to bene\ufb01t multiple classes. Figure 6 visualizes data samples with high sampling scores in\nthe Penstroke dataset from the \ufb01rst 100 AL interactions. For each character, we show the true label\nalong with the most confusing labels (shown in the parenthesis) based on the predictive distribution\nof KMC. The AL iteration is shown at the bottom. It can be seen that the sampling function prefers\nsamples that is confusing w.r.t. multiple classes. In other words, once the sample is labeled, it has\nthe potential to improve multiple decision boundaries. This exhibits the behavior of more effective\nexploitation, leading to fast convergence in AL. We also observe that characters in similar appearance\ntend not to be sampled repeatedly over a large number of AL iterations. Instead, samples from each\nclass are selected in a round-robin manner, which shows the behavior of effective exploration. The\ngood balance between exploitation and exploration explains fast and accurate sampling behavior of\nthe KMC based multi-class sampling function.\n\n8\n\n050100150200250300350400450500Number of active iterations0.250.300.350.400.450.500.550.60Model accuracyYeast datasetRandomBvSBEntrMCPALMC-CHKMC050100150200250300350400450500Number of active iterations0.50.60.70.80.9Model accuracyReuters datasetRandomBvSBEntrMC-CHKMC050100150200250300350400450500Number of active iterations0.000.050.100.150.200.250.300.35Model accuracyPenstroke datasetRandomBvSBEntrMCPALMC-CHKMC050100150200250300350400450500Number of active iterations0.00.20.40.60.81.0Model accuracyDermatology_1 datasetRandomBvSBEntrMCPALMC-CHKMC050100150200250300350400450500Number of active iterations0.00.20.40.60.81.0Model accuracyDermatology_2 datasetRandomBvSBEntrMCPALMC-CHKMC050100150200250300350400450500Number of active iterations0.200.250.300.35Model accuracyAuto-dirve datasetRandomBvSBEntrMCPALMC-CHKMC\fFigure 5: Impact of the Initial Sample Size S\n\nFigure 6: Representative AL Samples that Bene\ufb01t Multiple Classes\n\n5 Conclusion\nWe propose a novel kernel machine committee that combines Bayesian and discriminative sparse\nkernel machines for multi-class AL. These two kernel machines with distinct properties are seamlessly\nuni\ufb01ed using the maximum entropy discrimination framework in a principled way that allows the\nresultant model to choose data samples ideal for AL purpose. The sparse structure of the KMC\nminimizes the size of the selected data samples for labeling and also ensures ef\ufb01cient parameter\nlearning to support fast model training for real-time AL. A novel multi-class sampling function is\ndesigned that combines key model parameters to choose data samples most effective to improve the\ndecision boundaries of multiple classes, leading to faster AL convergence in multi-class problems.\nExtensive experiments conducted over both synthetic and real data help verify our theoretical results\nand clearly justify the effectiveness of the proposed KMC based AL model.\n\nAcknowledgement\nThis research was supported in part by an NSF IIS award IIS-1814450 and an ONR award N00014-\n18-1-2875. The views and conclusions contained in this paper are those of the authors and should not\nbe interpreted as representing any funding agency.\n\nReferences\n[1] Bishan Yang, Jian-Tao Sun, Tengjiao Wang, and Zheng Chen. Effective multi-label active learn-\ning for text classi\ufb01cation. In Proceedings of the 15th ACM SIGKDD international conference\non Knowledge discovery and data mining, pages 917\u2013926. ACM, 2009.\n\n[2] Oisin Mac Aodha, Neill DF Campbell, Jan Kautz, and Gabriel J Brostow. Hierarchical subquery\nevaluation for active learning on a graph. In Proceedings of the IEEE conference on computer\nvision and pattern recognition, pages 564\u2013571, 2014.\n\n9\n\n050100150200250300350400450500Number of active iterations0.300.350.400.450.500.550.60Model accuracyYeast datasetS=5S=20S=40050100150200250300350400450500Number of active iterations0.550.600.650.700.750.800.850.90Model accuracyReuters datasetS=5S=20S=40050100150200250300350400450500Number of active iterations0.050.100.150.200.250.300.35Model accuracyPenstroke datasetS=5S=20S=40050100150200250300350400450500Number of active iterations\u22120.20.00.20.40.60.81.0Model accuracyDermatology_1 datasetS=5S=20S=40050100150200250300350400450500Number of active iterations0.00.20.40.60.81.0Model accuracyDermatology_2 datasetS=5S=20S=40050100150200250300350400450500Number of active iterations0.220.240.260.280.300.320.340.36Model accuracyAuto-drive datasetS=5S=20S=40AL iteration:3True label:f (j,p,g,t)AL iteration:23True label:f (z,p,r)AL iteration:9True label:r (b,w,h)AL iteration:39True label:r (v,y,l,h)AL iteration:13True label:t (a,k,e)AL iteration:26True label:o (u,g,k)AL iteration:17True label:d (l,e,a)AL iteration:45True label:d (a,e,p,u)AL iteration:24True label:h (l,b)AL iteration:53True label:h (a,u,q,l,c)AL iteration:87True label:h (a,k,n)AL iteration:32True label:i (j,l,f,e)AL iteration:51True label:i (g,c)AL iteration:34True label:u (c,l)AL iteration:55True label:u (a,q,l,g)AL iteration:98True label:u (w,a,q)\f[3] Meng Wang and Xian-Sheng Hua. Active learning in multimedia annotation and retrieval: A\n\nsurvey. ACM Transactions on Intelligent Systems and Technology (TIST), 2(2):10, 2011.\n\n[4] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and\n\nStatistics). Springer-Verlag, Berlin, Heidelberg, 2006.\n\n[5] Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy discrimination.\n\nAdvances in neural information processing systems, pages 470\u2013476, 2000.\n\nIn\n\n[6] Jun Zhu and Eric P Xing. Maximum entropy discrimination markov networks. Journal of\n\nMachine Learning Research, 10(Nov):2531\u20132569, 2009.\n\n[7] Miroslav Dud\u00edk, Steven J Phillips, and Robert E Schapire. Maximum entropy density estimation\nwith generalized regularization and an application to species distribution modeling. Journal of\nMachine Learning Research, 8(Jun):1217\u20131260, 2007.\n\n[8] Simon Tong and Edward Chang. Support vector machine active learning for image retrieval. In\nProceedings of the ninth ACM international conference on Multimedia, pages 107\u2013118. ACM,\n2001.\n\n[9] Michael I Mandel, Graham E Poliner, and Daniel PW Ellis. Support vector machine active\n\nlearning for music retrieval. Multimedia systems, 12(1):3\u201313, 2006.\n\n[10] Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active learning for\n\nimage classi\ufb01cation. In CVPR, pages 2372\u20132379. IEEE, 2009.\n\n[11] Tong Zhang et al. Statistical behavior and consistency of classi\ufb01cation methods based on convex\n\nrisk minimization. The Annals of Statistics, 32(1):56\u201385, 2004.\n\n[12] Nicholas Roy and Andrew McCallum. Toward optimal active learning through monte carlo\n\nestimation of error reduction. ICML, Williamstown, pages 441\u2013448, 2001.\n\n[13] Daniel Kottke, Georg Krempl, Dominik Lang, Johannes Teschner, and Myra Spiliopoulou.\n\nMulti-class probabilistic active learning. In ECAI, pages 586\u2013594, 2016.\n\n[14] Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural\n\nnetworks, 10(5):988\u2013999, 1999.\n\n[15] Weishi Shi and Qi Yu. An ef\ufb01cient many-class active learning framework for knowledge-rich\ndomains. In 2018 IEEE International Conference on Data Mining (ICDM), pages 1230\u20131235.\nIEEE, 2018.\n\n[16] Catarina Silva and Bernardete Ribeiro. Combining active learning and relevance vector machines\nfor text classi\ufb01cation. In Sixth International Conference on Machine Learning and Applications\n(ICMLA 2007), pages 130\u2013135. IEEE, 2007.\n\n[17] Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and\nrepresentative examples. In Advances in neural information processing systems, pages 892\u2013900,\n2010.\n\n[18] Tommi S Jaakkola and Michael I Jordan. Bayesian parameter estimation via variational methods.\n\nStatistics and Computing, 10(1):25\u201337, 2000.\n\n[19] Anita C Faul and Michael E Tipping. Analysis of sparse bayesian learning. In Advances in\n\nneural information processing systems, pages 383\u2013389, 2002.\n\n[20] MOSEK ApS. MOSEK Optimizer API for Python 8.1.0.80, 2019.\n\n[21] Weishi Shi and Qi Yu. Source Code and Data. https://drive.google.com/drive/\n[Online; ac-\n\nfolders/1kk50iDvgR8PdpB8lbt32rqn9KGTDFB4n?usp=sharing, 2019.\ncessed 12-October-2019].\n\n[22] M Andersen, Joachim Dahl, and Lieven Vandenberghe. Cvxopt: A python package for convex\n\noptimization. abel. ee. ucla. edu/cvxopt, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1349, "authors": [{"given_name": "Weishi", "family_name": "Shi", "institution": "Rochester Institute of Technology"}, {"given_name": "Qi", "family_name": "Yu", "institution": "Rochester Institute of Technology"}]}