{"title": "Active Instance Sampling via Matrix Partition", "book": "Advances in Neural Information Processing Systems", "page_first": 802, "page_last": 810, "abstract": "Recently, batch-mode active learning has attracted a lot of attention. In this paper, we propose a novel batch-mode active learning approach that selects a batch of queries in each iteration by maximizing a natural form of mutual information criterion between the labeled and unlabeled instances. By employing a Gaussian process framework, this mutual information based instance selection problem can be formulated as a matrix partition problem. Although the matrix partition is an NP-hard combinatorial optimization problem, we show a good local solution can be obtained by exploiting an effective local optimization technique on the relaxed continuous optimization problem. The proposed active learning approach is independent of employed classification models. Our empirical studies show this approach can achieve comparable or superior performance to discriminative batch-mode active learning methods.", "full_text": "Active Instance Sampling via Matrix Partition\n\nDepartment of Computer & Information Sciences\n\nYuhong Guo\n\nTemple University\n\nPhiladelphia, PA 19122\nyuhong@temple.edu\n\nAbstract\n\nRecently, batch-mode active learning has attracted a lot of attention. In this pa-\nper, we propose a novel batch-mode active learning approach that selects a batch\nof queries in each iteration by maximizing a natural mutual information criterion\nbetween the labeled and unlabeled instances. By employing a Gaussian process\nframework, this mutual information based instance selection problem can be for-\nmulated as a matrix partition problem. Although matrix partition is an NP-hard\ncombinatorial optimization problem, we show that a good local solution can be\nobtained by exploiting an effective local optimization technique on a relaxed con-\ntinuous optimization problem. The proposed active learning approach is indepen-\ndent of employed classi\ufb01cation models. Our empirical studies show this approach\ncan achieve comparable or superior performance to discriminative batch-mode ac-\ntive learning methods.\n\n1\n\nIntroduction\n\nActive learning is well-motivated in many supervised learning scenarios where unlabeled instances\nare abundant and easy to retrieve but labels are dif\ufb01cult, time-consuming, or expensive to obtain.\nFor example, it is easy to gather large amounts of unlabeled documents or images from the Inter-\nnet, whereas labeling them requires manual effort from experienced human annotators. Randomly\nselecting unlabeled instances for labeling is inef\ufb01cient in many situations, since non-informative or\nredundant instances might be selected. Aiming to reduce labeling effort, active learning (i.e., selec-\ntive sampling) methods have been adopted to control the labeling process in many areas of machine\nlearning. Given a large pool of unlabeled instances, active learning provides a way to iteratively\nselect the most informative unlabeled instances\u2014the queries\u2014from the pool to label.\n\nMany researchers have addressed the active learning problem in various ways [13]. Most have fo-\ncused on selecting a single most informative unlabeled instance to query each time. The ultimate\ngoal for most such approaches is to select instances that could lead to a classi\ufb01er with low gener-\nalization error. Towards this, a few variants of a mutual information criterion have been employed\nin the literature to guide the active instance sampling process. The approaches in [4][10] select the\ninstance to maximize the increase of mutual information and the mutual information, respectively,\nbetween the selected set of instances and the remainder based on Gaussian process models. The\napproach proposed in [5] seeks the instance whose optimistic label provides maximum mutual in-\nformation about the labels of the remaining unlabeled instances. The mutual information measure\nused is discriminative, computed using their trained classi\ufb01er at that point. This approach implicitly\nexploits the clustering information contained in the unlabeled data in an optimistic way.\n\nThe single instance selection active learning methods require tedious retraining with each single in-\nstance being labeled. When the learning task is suf\ufb01ciently complex, the retraining process between\nqueries can become very slow. This may make highly interactive learning inef\ufb01cient or imprac-\ntical. Furthermore, if a parallel labeling system is available, e.g., multiple annotators working on\n\n1\n\n\fdifferent labeling workstations at the same time on a network, a single instance selection system\ncan make wasteful use of the resource. Thus, a batch-mode active learning strategy that selects\nmultiple instances each time is more appropriate under these circumstances. The challenge in batch-\nmode active learning is how to properly assemble the optimal query batch. Simply using a single\ninstance selection strategy to select a batch of queries in each iteration does not work well, since\nit fails to take the information overlap between the multiple instances into account. Principles for\nbatch mode active learning need to be developed to address the multi-instance selection speci\ufb01cally.\nSeveral sophisticated batch-mode active learning methods have been proposed for classi\ufb01cation.\nMost of these approaches use greedy heuristics to ensure the overall informativeness of the batch by\ntaking both the individual informativeness and the diversity of the selected instances into account.\nSchohn and Cohn [12] select instances according to their proximity to the dividing hyperplane for\na linear SVM. Brinker [2] considers an approach for SVMs that explicitly takes the diversity of the\nselected instances into account, in addition to individual informativeness. Xu et al. [14] propose a\nrepresentative sampling approach for SVM active learning, which also incorporates a diversity mea-\nsure. Speci\ufb01cally, they query cluster centroids for instances that lie close to the decision boundary.\nHoi et al. [7, 8] extend the Fisher information framework to the batch-mode setting for binary logistic\nregression. Hoi et al. [9] propose a novel batch-mode active learning scheme on SVMs that exploits\nsemi-supervised kernel learning. In particular, a kernel function is \ufb01rst learned from a mixture of\nlabeled and unlabeled examples, and then is used to effectively identify the informative and diverse\ninstances via a min-max framework. Instead of using heuristic measures, Guo and Schuurmans [6]\ntreat batch construction for logistic regression as a discriminative optimization problem, and at-\ntempt to construct the most informative batch directly. Overall, these batch-mode active learning\napproaches all make batch selection decisions directly based on the classi\ufb01ers employed.\n\nIn this paper, we propose a novel batch-mode active learning approach that makes query selection\ndecisions independent of the classi\ufb01cation model employed. The idea is to select a batch of queries\nin each iteration by maximizing a general mutual information measure between the labeled instances\nand the unlabeled instances. By employing a Gaussian process framework, this mutual information\nmaximization problem can be further formulated as a matrix partition problem. Although the matrix\npartition problem is an NP-hard combinatorial optimization, it can \ufb01rst be relaxed into a continuous\noptimization problem and then a good local solution can be obtained by exploiting an effective local\noptimization. The local optimization method we use is developed by combining a local lineariza-\ntion of the objective function based on its \ufb01rst-order Taylor series expansion, and a straightforward\nbacktracking line search. Unlike most active learning methods studied in the literature, our query\nselection method does not require knowledge of the employed classi\ufb01er. Our empirical studies show\nthat the proposed batch-mode active learning approach can achieve superior or comparable perfor-\nmance to discriminative batch-mode active learning methods that have been optimized on speci\ufb01c\nclassi\ufb01ers.\n\nThe remainder of the paper is organized as follows. Section 2 provides preliminaries on Gaussian\nprocesses. Section 3 introduces the proposed matrix partition approach for batch-mode active learn-\ning. Empirical studies are presented in Section 4, and Section 5 concludes this work.\n\n2 Gaussian Processes\n\nA Gaussian process is a generalization of the Gaussian probability distribution. Although Gaussian\nprocesses have a long history in statistics, their potential has only become widely appreciated in the\nmachine learning community during the past decade [11]. In this section, we provide an overview of\nGaussian processes and some of their important properties which we will exploit later to construct\nour active learning approach.\n\n2.1 Multivariate Gaussian Distribution\n\nThe Gaussian, also known as the normal distribution, is a widely used model for the distribution\nof continuous variables. In the case of multiple random variables, the joint multivariate Gaussian\ndistribution for a d \u00d7 1 vector x is given in the form\n\nP (x) =\n\n(2\u03c0)d/2|\u03a3|1/2 exp(cid:18)\u2212\n\n1\n\n(x \u2212 \u00b5)>\u03a3\u22121(x \u2212 \u00b5)(cid:19)\n\n1\n2\n\n2\n\n\fwhere \u00b5 is a d-dimensional mean vector, \u03a3 is a d \u00d7 d covariance matrix, and |\u03a3| denotes the\ndeterminant of \u03a3. When d = 1, we obtain the standard one-variable Gaussian distribution.\n\n2.2 Gaussian Processes\n\nA Gaussian process is a generalization of a multivariate Gaussian distribution over a \ufb01nite vector\nspace to a function space of in\ufb01nite dimension. Given a set of instances X = [x>\nt ],\n2 ; \u00b7 \u00b7 \u00b7 ; x>\na data modeling function f (\u00b7) can be viewed as a single sample from a Gaussian distribution with\na mean function \u00b5(\u00b7), and a covariance function C(\u00b7, \u00b7). In particular, \u00b5(xi) denotes the mean of\nthe function variable f (xi) at point xi, and C(xi, xj) expresses the expected covariance between\nfunctions f at point xi and xj. A Gaussian process is de\ufb01ned as a Gaussian distribution on a space\nof functions f which can be written in the form\n\n1 ; x>\n\nP (f (x)) =\n\nexp(cid:18)\u2212\n\n(f (x)\u2212\u00b5(x))>\u03a3\u22121(f (x)\u2212\u00b5(x))(cid:19)\n\n1\n2\n\n1\nZ\n\nwhere \u00b5(x) is the mean function, \u03a3 is de\ufb01ned using the covariance function C, and Z denotes the\nnormalization factor. One typical choice for the covariance function C is a symmetric positive-\nde\ufb01nite kernel function K, e.g. a Gaussian kernel\n\nK(xi, xj) = exp(cid:18)\u2212\n\n(kxi \u2212 xjk2\n\n\u03c4 2\n\n(cid:19)\n\n(1)\n\nOne important property of Gaussian processes is that for every \ufb01nite set (or subset) of instances XQ\nwith indices Q, the joint distribution over the corresponding random function variables fQ = f (XQ)\nis a multivariate Gaussian distribution with a mean vector \u00b5Q = \u00b5(XQ) and a covariance matrix\n\u03a3QQ, where each entry \u03a3i,j is de\ufb01ned using the covariance kernel function K(xi, xj)\n\nP (fQ) =\n\nexp(cid:18)\u2212\n\n1\n2\n\n1\nZ\n\n(fQ \u2212\u00b5Q)>\u03a3\u22121\n\nQQ(fQ \u2212\u00b5Q)(cid:19)\n\n(2)\n\nHere Z = (2\u03c0)q/2|\u03a3QQ|1/2, and q is the size of set Q. We can assume the the mean function\n\u00b5(\u00b7) = 0. Nevertheless, it is irrelevant in this paper.\n\n3 Batch-mode Active Learning via Matrix Partition\n\nGiven a small set of labeled instances {(xi, yi)}i\u2208L and a large set of unlabeled instances {xj}j\u2208U ,\nour task is to iteratively select the most informative set of b instances from U and add them into\nthe labeled set L after querying their labels from a labeling system. In this section, we propose to\nconduct instance selective sampling using a maximum mutual information strategy which can then\nbe formulated into a matrix partition problem.\n\n3.1 Maximum Mutual Information Instance Selection\n\nSince the ultimate goal of active learning is to achieve a classi\ufb01er with good generalization perfor-\nmance on unseen test data, it makes sense to select instances that can produce a labeled set that is\nmost informative about the unseen test instances. Apparently it is not possible to access the unseen\ntest data. Nevertheless, in active learning setting, we have a large number of unlabeled instances\navailable that come from the same distribution as the future test instances. Thus we can select in-\nstances that lead to a labeled set which is most informative about the large set of unlabeled instances\ninstead. We propose to use a mutual information criterion to measure the informativeness of the\nlabeled set L over the unlabeled set U\n\nI(XL, XU ) = H(XL) + H(XU ) \u2212 H(XL, XU )\n\n(3)\n\nwhere XL and XU denotes the labeled set of instances and the unlabeled set of instances respec-\ntively, H(\u00b7) denotes the entropy term.\nBoth the mutual information measure and the entropy measure are de\ufb01ned on probability distribu-\ntions [3]. We thus employ a Gaussian process framework (introduced in the previous section) to\n\n3\n\n\fmodel the joint probability distribution over all the instances. We \ufb01rst associate each instance xi\nwith a random variable fi. Then the joint distribution over a \ufb01nite number of instances XQ can be\nrepresented using the joint multivariate Gaussian distribution over variables fQ, which is given in\n(2). Thus the entropy term H(XQ) = H(fQ) can be computed using a closed-form solution\n\nH(fQ) =\n\n1\n2\n\nln(cid:0)(2\u03c0e)m|\u03a3QQ|(cid:1)\n\nwhere m is the number of variables, i.e., the size of Q; \u03a3QQ is the covariance matrix computed over\nXQ using a kernel function K given in (1). Within this Gaussian process framework, the mutual\ninformation criterion in (3) can be rewritten as\n\nI(XL, XU ) = H(fL) + H(fU ) \u2212 H(fL, fU )\n\n(5)\n\n=\n\n1\n2\n\nln(cid:0)(2\u03c0e)l|\u03a3LL|(cid:1) +\n\n1\n2\n\nln(cid:0)(2\u03c0e)u|\u03a3U U |(cid:1) \u2212\n\n1\n2\n\nln(cid:0)(2\u03c0e)t|\u03a3V V |(cid:1)\n\nwhere V is the union of L and U; l, u, t denote the sizes of L, U, V respectively such that l + u = t.\nNote that for a given data set, the overall number of instances does not change during the active\nlearning process. We simply move b instances from the unlabeled set U into the labeled set L\nin each iteration. Thus the set V and the entropy term H(fL, fU ) are irrelevant to the instance\nselection. Based on this observation, our maximum mutual information instance selection strategy\ncan be formulated as\n\nQ\u2217 = arg max\n|Q|=b,Q\u2286U\n\nI(XL\u222aQ, X\n\nU \\Q) = arg max\n|Q|=b,Q\u2286U\n\nln |\u03a3L0L0| + ln |\u03a3U 0U 0|\n\n(6)\n\nwhere L0 = L\u222aQ and U 0 = U \\Q. This also suggests the mutual information criterion depends only\non the covariance matrices computed using the kernel functions over the instances. Our maximum\nmutual information strategy attempts to select the batch of b instances from the unlabeled set U to\nlabel, to maximize the log determinants of the covariance matrices over the produced sets L0 and U 0.\n\n3.2 Matrix Partition\n\nLet \u03a3 be the covariance matrix over all the instances indexed by V = L \u222a U = L0 \u222a U 0. Then\nthe covariance matrices \u03a3LL, \u03a3U U , \u03a3L0L0 and \u03a3U 0U 0 are all submatrices of \u03a3. Without losing any\ngenerality, we assume the instances are arranged in the order of [U, L], such that\n\n(4)\n\n(7)\n\n(8)\n\n(9)\n\n\u03a3LU \u03a3LL (cid:21)\n\u03a3 =(cid:20) \u03a3U U \u03a3U L\n\nThe instance selection problem formulated in (6) selects a subset of b instances indexed by Q from U\nand moves them into the labeled set L. This problem is actually equivalent to partitioning matrix \u03a3\ninto submatrices \u03a3L0L0, \u03a3U 0U 0, \u03a3L0U 0 and \u03a3U 0L0 by reordering the instances in U. Since L is \ufb01xed,\nthe actual matrix partition is conduct on covariance matrix \u03a3U U . Now we de\ufb01ne a permutation\nmatrix M \u2208 {0, 1}u\u00d7u such that\n\nwhere 1 denotes a vector of all 1 entries. We let M\u02dcb denote the \ufb01rst u \u2212 b rows of M, and Mb denote\nthe last b rows of M, such that\n\nM 1 = 1, M >1 = 1\n\nObviously Mb selects b instances from U to form Q. Let\n\nM\u02dcb\u03a3U U M >\n\n\u02dcb = \u03a3U 0U 0 , Mb\u03a3U U M >\n\nb = \u03a3QQ\n\nT =(cid:2)M\u02dcb O(u\u2212b)\u00d7l(cid:3) , B =(cid:20) Mb\n\nOl\u00d7u\n\nOb\u00d7l\nIl\n\n(cid:21)\n\nwhere Om\u00d7n denotes a m \u00d7 n matrix with all 0 entries, and Il denotes a l \u00d7 l identity matrix.\nAccording to (8) we then have\n\n(10)\nFinally, the maximum mutual information problem given in (6) can be equivalently formulated into\nthe following matrix partition problem\n\n\u03a3U 0U 0 = T \u03a3T >, \u03a3L0L0 = B\u03a3B>\n\nmax\n\nM\ns.t.\n\nln |B\u03a3B>| + ln |T \u03a3T >|\n\n(11)\n\nM \u2208 {0, 1}u\u00d7u, M 1 = 1, M >1 = 1\n\n4\n\n\fAfter solving this problem to obtain an optimal M \u2217, the instance selection can be determined from\nthe last b rows of M \u2217, i.e., M \u2217\nb .\nHowever, the optimization problem (11) is an NP-hard combinatorial optimization problem over an\ninteger matrix M. To facilitate a convenient optimization procedure, we relax the integer optimiza-\ntion problem (11) into the following upper bound optimization problem\n\nmax\n\nM\ns.t.\n\nln |B\u03a3B>| + ln |T \u03a3T >|\n\n0 \u2264 M \u2264 1, M 1 = 1, M >1 = 1\n\n(12)\n\n(13)\n\nNote a determinant is a log concave function on positive de\ufb01nite matrices [1]. Thus ln |X| is concave\nin X. However, the quadratic matrix function X = B\u03a3B> is matrix convex given the matrix \u03a3\nis positive de\ufb01nite. Thus the composition function ln |B\u03a3B>| is neither convex nor concave, but\ndifferentiable.\nIn general, this type of problems are dif\ufb01cult global optimization problems. We\ndevelop an ef\ufb01cient local optimization technique to solve for a reasonable local solution instead.\n\n3.3 First-order Local Optimization\n\nThe target optimization (12) is an optimization problem over a u \u00d7 u matrix M, subject to the\nlinear inequality and equality constraints (13). Here u is the number of unlabeled instances, and\nwe typically assume it is a large number. Therefore a second-order optimization approach will be\nspace demanding. We develop a \ufb01rst-order local maximization algorithm to conduct optimization,\nwhich combines a gradient direction \ufb01nding method with a straightforward backtracking line search\ntechnique. This local optimization algorithm produced promising results in our experiments.\n\nThe algorithm is an iterative procedure, starting from an initial matrix M (0). Let M (k) denote the\noptimization variable values returned from the the kth iteration. At the (k + 1)th iteration, we\napproximate the objective function in (12) using its \ufb01rst-order Taylor series expansion at point M (k)\n\ng(M ) = ln |B\u03a3B>| + ln |T \u03a3T >|\n\n\u2248 ln |B(k)\u03a3B(k)>| + ln |T (k)\u03a3T (k)>| + Tr(cid:16)G(M (k))>(M \u2212 M (k))(cid:17)\n\n(14)\n\nWhere B(k) and T (k) denote the corresponding B and T matrices with their M submatrices \ufb01xed to\nvalues given by M (k); Tr denotes the trace operator; G(M (k)) denotes the gradient matrix value at\npoint M (k). The gradient of the objective function g(M ) can be calculated using the matrix calculus,\nwhich gives the following results\n\nG(M\u02dcb) =\n\nG(Mb) =\n\ndg(M )\ndM\u02dcb\ndg(M )\ndMb\n\n= 2(cid:2)(T \u03a3T >)\u22121T \u03a3(cid:3)1:(u\u2212b),1:u\n= 2(cid:2)(B\u03a3B>)\u22121B\u03a3(cid:3)1:b,1:u\n\nG(M ) = (cid:2)G(M\u02dcb)>, G(Mb)>(cid:3)>\n\nNote here we use notations in the matlab format where [X]i:j,m:n denotes the (j\u2212i+1)\u00d7(n\u2212m+1)\nsubmatrix of X formed by entries between the ith to the jth rows and the mth to the nth columns.\n\nGiven the gradient at point M (k), we maximize the local linearization (14) to seek a gradient direc-\ntion regarding the constraints. This leads to a convex linear optimization\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\nfM = arg max\n\nM\ns.t.\n\nTr(cid:16)G(M (k))>M(cid:17)\n\n0 \u2264 M \u2264 1, M 1 = 1, M >1 = 1\n\nThe gradient direction for the (k + 1)th iteration can be determined as\n\nWe then employ a backtracking line search to seek the optimal value M (k+1) to improve the\noriginal objective function g(M ) with g(M (k+1)) > g(M (k)). The line search procedure,\n\nD = fM \u2212 M (k).\n\n5\n\n\fAlgorithm 1 Matrix Partition\n\nInput: l: the number of labeled instances; u the number of unlabeled instances;\n\n\u03a3: covariance matrix given in form of (7); b: batch size;\nM (0); \u0001 < 1e \u2212 8.\n\nOutput: M \u2217\nInitialize k = 0, N oChange = f alse.\nrepeat\n\nSet T and B according to equations (9) using the current M (k).\nCompute gradient G(M (k)) at point M (k) according to equations (15), (16) and (17).\n\nSolve the local linear optimization (18) for the given gradient to get fM.\n\nCompute the gradient ascend direction D using the equation (19).\nCompute M (k+1) = linesearch(D, M (k)).\nif kM (k+1) \u2212 M (k)k2 < \u0001 then NoChange=true. end if\nk = k+1.\n\nuntil N oChange is true or maximum iteration number is reached.\nM \u2217 = M (k).\n\nAlgorithm 2 Heuristic Greedy Rounding Procedure\n\nInput: b, M \u2208 (0, 1)b\u00d7u for b < u.\n\nfor k = 1 to b do\n\nOutput: cM , Q.\nInitialize Let Q = \u2205, set cM as a b \u00d7 u matrix with all 0 entries.\nSet Q = Q \u222a {j}, cM (i, j) = 1, M (i, :) = \u2212Inf, M (:, j) = \u2212Inf.\n\nIdentify the largest value v = max(M (:)).\nIdentify the indices (i, j) of v in M.\n\nend for\n\nlinesearch(D, M (k)), seeks an optimal step size, 0 \u2264 s < 1, to update the M (k) in the ascending\ndirection D given in (19), i.e. M (k+1) = M (k) + sD, guaranteeing the returned M (k+1) satis\ufb01es\nthe linear constraints in (13), and leads to an objective value no worse than before.\n\nThe overall algorithm for optimizing the matrix partition problem (12) is given in Algorithm 1.\nIn our implementation, the constrained linear optimization (18) can be ef\ufb01ciently solved using an\noptimization software package CPLEX. When the number of unlabeled instances, u, is large, com-\nputing the log-determinant of the (u \u2212 b) \u00d7 (u \u2212 b) matrix, T \u03a3T >, is likely to run into over\ufb02ow\nor under\ufb02ow. Instead of computing the log-determinant directly, we choose to compute it in an\nalternative ef\ufb01cient way. The key idea is based on the mathematical fact that the determinant of a\ntriangular matrix equals the product of its diagonal elements. Hence, the matrix\u2019s log-determinant\nis equal to the sum of their logarithm values. By keeping all computations in log-scale, the problem\nof under\ufb02ow/over\ufb02ow caused by product of many numbers can be effectively circumvented. For\npositive de\ufb01nite matrices, such as the matrices we have, one can use Cholesky factorization to \ufb01rst\nproduce a triangular matrix and then compute the log-determinant of the original matrix using the\nlogarithms of the diagonal values of the triangular matrix. The computation of log-determinants or\nmatrix inverse in our algorithm are all conducted on matrices assumed to be positive de\ufb01nite. How-\never, in order to increase the robustness of the algorithm and avoid numerical problems, we can add\nan additional \u03b4I term to the matrices to guarantee the positive de\ufb01nite property. Here \u03b4 is a very\nsmall value and I is an identity matrix.\nBy solving the matrix partition problem in (12) using Algorithm 1, an optimal matrix M \u2217 is returned.\nHowever, this M \u2217 contains continuous values. In order to determine which set of b instances to\n\nselect, we need to round M \u2217 to a {0,1}-valued dM \u2217, while maintaining the permutation constraints\ndM \u22171 = 1 anddM \u2217\n\n1 = 1. We use a simple heuristic greedy procedure to conduct the rounding. In\nthis procedure, we focused on rounding the last b rows, M \u2217\nb , since they are the ones used to pick b\ninstances for labeling. The procedure is described in Algorithm 2, which returns the indices of the\nselected b instances as well.\n\n>\n\n6\n\n\f4 Experiments\n\nTo investigate the empirical performance of the proposed batch-mode active learning algorithm, we\nconducted two sets of experiments on a few UCI datasets and the 20 newsgroups dataset. Note the\nproposed active learning method is in general independent of the speci\ufb01c classi\ufb01cation model em-\nployed. For the experiments in this section, we used logistic regression as its classi\ufb01cation model to\nevaluate the informativeness of the selected labeled instances. We compared the proposed approach,\ndenoted as Matrix, with three discriminative batch-mode active learning methods proposed in the\nliterature: svmD, an approach that incorporates diversity in active learning with SVMs [2]; Fisher,\nan approach that uses Fisher information matrix based on logistic regression classi\ufb01ers for instance\nselection [8]; Discriminative, a discriminative optimization approach based on logistic regression\nclassi\ufb01ers [6]. We have also compared our approach to one transductive experimental design method\nwhich is formulated from regression problems and whose instance selection process is independent\nof evaluation classi\ufb01cation models [15]. We used the sequential design code downloaded from the\nauthors\u2019 webpage and denote this method as Design.\n\nFirst, we conducted experiments on seven UCI datasets. We consider a hard case of active learning,\nwhere we start active learning from only a few labeled instances. In each experiment, we start with\ntwo randomly selected labeled instances, one in each class. We then randomly select 2/3 of the\nremaining instances as the unlabeled set, using all the other instances for testing. All the algorithms\nstart with the same initial labeled set, unlabeled set and testing set. For a \ufb01xed batch size b, each\nalgorithm repeatedly select b instances to label each time and evaluate the produced classi\ufb01er on\ntesting data after each new labeling, with maximum 110 instances to select in total. The experiments\nwere repeated 20 times. In Table 1, we report the experimental results with b = 10, comparing the\nproposed Matrix algorithm with each of the three batch-mode alternatives. With b = 10, there are\ntotally 11 evaluation points, with 20 results on each of them. We therefore run a 2-sided paired t-test\nat each evaluation point to compare the performance of each pair of algorithms. The \u201cwin%\u201d denotes\nthe percentage of evaluation points where the Matrix algorithm outperforms the speci\ufb01ed algorithm\nusing a 2-sided paired t-test at the level of p<0.05; the \u201close%\u201d denotes the percentage of evaluation\npoints where the speci\ufb01ed algorithm outperforms the Matrix algorithm. The \u201coverall\u201d nevertheless\nshow the comparison results using a single 2-sided paired t-test on all 220 results. These results\nshow that the proposed active learning method, Matrix, overperformed svmD, Fisher and Design on\nmost data sets, except an overall lose to svmD on pima, a tie with Fisher and Design on hepatitis, and\na tie with Design on \ufb02are. Matrix is mostly tied with Discriminative on all data sets, with a slight\npointwise win on crx and a slight overall lose on german. Although Matrix and Discriminative\ndemonstrated similar performance, the proposed Matrix is more ef\ufb01cient regarding running time on\nrelatively big data sets. The comparison in running times over 20 repeats are reported in Table 2.\n\nTable 1: Comparison of the active learning algorithms on UCI data with batch size = 10. These\nresults are based on 2-sided paired t-test at the level of p< 0.05.\n\nData set\n\nMatrix vs svmD\n\nMatrix vs Fisher Matrix vs Discriminative Matrix vs Design\n\nwin% lose% overall win% lose% overall win% lose% overall win% lose% overall\n\n63.6\ncleve\n27.3\ncrx\n\ufb02are\n54.5\n81.8\ngerman\n63.6\nheart\nhepatitis 100.0\n0\npima\n\n0 win\n0 win\n0 win\n0 win\n0 win\n0 win\n0\nlose\n\n45.5\n9.1\n100.0\n9.1\n36.4\n33.3\n100.0\n\n0 win\n0 win\n0 win\n0 win\n0 win\n0\ntie\n0 win\n\n0\n9.1\n0\n0\n0\n0\n0\n\n0\n0\n0\n0\n0\n0\n0\n\ntie\ntie\ntie\nlose\ntie\ntie\ntie\n\n90.9\n90.9\n36.4\n72.7\n100.0\n0\n81.8\n\n9.1\n\n0 win\n0 win\ntie\n0 win\n0 win\n0\ntie\n0 win\n\nMethod\nMatrix\nDiscriminative\n\nTable 2: Average running time (in minutes)\ncleve\n8.37\n3.33\n\ngerman\n22.08\n285.65\n\n\ufb02are\n9.53\n220.12\n\ncrx\n6.14\n61.44\n\nheart\n5.68\n2.40\n\nhepatitis\n0.12\n0.08\n\npima\n60.11\n68.27\n\n7\n\n\fTable 3: Comparison of the active learning algorithms on Newsgroup data with batch size = 20.\nThese results are based on 2-sided paired t-test at the level of p< 0.05.\n\nData set\n\nMatrix vs svmD\n\nMatrix vs Fisher Matrix vs Random Matrix vs Design\n\nwin% lose% overall win% lose% overall win% lose% overall win% lose% overall\n\nAutos\n86.7\nHardware 100.0\n86.7\nSport\n\n0 win\n0 win\n6.6 win\n\n20.0\n0\n20.0\n\n6.6\n0\n13.3\n\ntie\ntie\ntie\n\n73.3\n13.3\n46.7\n\n6.6 win\n0 win\n0 win\n\n80.0\n86.7\n80.0\n\n6.7 win\n0 win\n6.7 win\n\nNext we conducted experiments on 20 newsgroups dataset for document categorization. We build\nthree binary classi\ufb01cation tasks: (1) Autos: rec.autos (987 documents) vs. rec.motorcycles (993 doc-\numents); (2) Hardware: comp.sys.ibm.pc.hardware (979 documents) vs. comp.sys.mac.hardware\n(958 documents); (3) Sport: rec.sport.baseball (991 documents) vs. rec.sport.hockey (997 docu-\nments). Each document is \ufb01rst minimally processed into a \u201ctf.idf\u201d vector. We then select the top\n400 features to use according to their total \u201ctf.idf\u201d frequencies in all the documents for the consid-\nered task. In each experiment, we start with four randomly selected labeled instances, two in each\nclass. We then randomly select 1000 instances (500 from each class) from the remaining ones as the\nunlabeled set, using all the other instances for testing. All the algorithms start with the same initial\nlabeled set, unlabeled set and testing set. For a \ufb01xed batch size b, each algorithm repeatedly select b\ninstances to label each time with maximum 300 instances to select in total. In this section, we report\nthe experimental results with b = 20 averaged over 20 times repetitions. There are 300/20 = 15\nevaluation points in this case.\n\nNote the unlabeled sets used for this set of experiments are much larger than the ones used for\nexperiments on UCI datasets. This substantially increases the searching space of instance selection.\nOne consequence in our experiments is that the Discriminative algorithm becomes very slow. Thus\nwe were not able to produce comparison results for this algorithm. The proposed Matrix method\nwas affected as well. However, we coped with this problem using a subsampling assisted method,\nwhere we \ufb01rst select a subset of 400 instances from the unlabeled set and then restrain our instance\nselection to this subset. This is equivalent to solving the matrix partition optimization in (12) with\nadditional constraints on Mb, such that the columns of Mb corresponding to instances outside of\nthis subset of 400 instances are all set to 0. For the experiments, we chose the 400 instances as\nthe ones with top entropy terms under the current classi\ufb01cation model. The same subsampling was\nused for the method Design as well. Table 3 shows the comparison results on the three document\ncategorization tasks, comparing Matrix to svmD, Fisher, Design and a baseline random selection,\nRandom. These results show the proposed Matrix outperformed svmD, Design and Random. It tied\nwith Fisher regarding overall measure, but had a slight win regarding pointwise measure.\n\nThese empirical results suggest that selecting unlabeled instances independent of the classi\ufb01cation\nmodel using the proposed matrix partition method can achieve reasonable performance, which is\nbetter than a transductive experimental design method and comparable to the discriminative batch-\nmode active learning approaches. However, our approach can offer certain conveniences in some\ncircumstances where one does not know the classi\ufb01cation model to be employed for classi\ufb01cation.\n\n5 Conclusions\n\nIn this paper, we propose a novel batch-mode active learning approach that makes query selection\ndecisions independent of the classi\ufb01cation model employed. The proposed approach is based on a\ngeneral maximum mutual information principle. It is formulated as a matrix partition optimization\nproblem under a Gaussian process framework. To tackle the formulated combinatorial optimization\nproblem, we developed an effective local optimization technique. Our empirical studies show the\nproposed \ufb02exible batch-mode active learning approach can achieve comparable or superior perfor-\nmance to discriminative batch-mode active learning methods that have been optimized on speci\ufb01c\nclassi\ufb01ers. A future extension for this work is to consider batch-mode active learning with structured\ndata by exploiting different kernel functions.\n\n8\n\n\fReferences\n[1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[2] K. Brinker. Incorporating diversity in active learning with support vector machines. In Pro-\n\nceedings of International Conference on Machine learning, 2003.\n\n[3] T. Cover and J. Thomas. Elements of Information Theory. John Wiley & sons, 1991.\n[4] C. Guestrin, A. Krause, and A. Singh. Near-optimal sensor placements in Gaussian processes.\n\nIn Proceedings of International Conference on Machine Learning, 2005.\n\n[5] Y. Guo and R. Greiner. Optimistic active learning using mutual information. In Proceedings\n\nof International Joint Conference on Arti\ufb01cial Intelligence, 2007.\n\n[6] Y. Guo and D. Schuurmans. Discriminative batch mode active learning. In Proceedings of\n\nNeural Information Processing Systems, 2007.\n\n[7] S. Hoi, R. Jin, and M. Lyu. Large-scale text categorization by batch mode active learning. In\n\nProceedings of the International World Wide Web Conference, 2006.\n\n[8] S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and its application to medical\nimage classi\ufb01cation. In Proceedings of International Conference on Machine Learning, 2006.\n[9] S. Hoi, R. Jin, J. Zhu, and M. Lyu. Semi-supervised SVM batch mode active learning for\nimage retrieval. In Proceedings of IEEE Computer Society Conference on Computer Vision\nand Pattern Recognition, 2008.\n\n[10] A. Krause, C. Guestrin, A. Gupta, and J. Kleinberg. Near-optimal sensor placements: Max-\nimizing information while minimizing communication cost. In International Symposium on\nInformation Processing in Sensor Networks, 2006.\n\n[11] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[12] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines.\nIn\n\nProceedings of International Conference on Machine Learning, 2000.\n\n[13] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, Uni-\n\nversity of Wisconsin\u2013Madison, 2009.\n\n[14] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang. Representative sampling for text classi\ufb01cation\n\nusing support vector machines. In European Conference on Information Retrieval, 2003.\n\n[15] K. Yu and J. Bi. Active learning via transductive experimental design. In In Proceedings of the\n\nInternational Conference on Machine Learning, 2006.\n\n9\n\n\f", "award": [], "sourceid": 613, "authors": [{"given_name": "Yuhong", "family_name": "Guo", "institution": null}]}