{"title": "Active Classification based on Value of Classifier", "book": "Advances in Neural Information Processing Systems", "page_first": 1062, "page_last": 1070, "abstract": "Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.", "full_text": "Active Classi\ufb01cation based on Value of Classi\ufb01er\n\nDepartment of Electrical Engineering\n\nDepartment of Computer Science\n\nDaphne Koller\n\nStanford University\nStanford, CA 94305\n\nTianshi Gao\n\nStanford University\nStanford, CA 94305\n\ntianshig@stanford.edu\n\nkoller@cs.stanford.edu\n\nAbstract\n\nModern classi\ufb01cation tasks usually involve many class labels and can be informed\nby a broad range of features. Many of these tasks are tackled by constructing a\nset of classi\ufb01ers, which are then applied at test time and then pieced together in a\n\ufb01xed procedure determined in advance or at training time. We present an active\nclassi\ufb01cation process at the test time, where each classi\ufb01er in a large ensemble\nis viewed as a potential observation that might inform our classi\ufb01cation process.\nObservations are then selected dynamically based on previous observations, using\na value-theoretic computation that balances an estimate of the expected classi\ufb01ca-\ntion gain from each observation as well as its computational cost. The expected\nclassi\ufb01cation gain is computed using a probabilistic model that uses the outcome\nfrom previous observations. This active classi\ufb01cation process is applied at test\ntime for each individual test instance, resulting in an ef\ufb01cient instance-speci\ufb01c de-\ncision path. We demonstrate the bene\ufb01t of the active scheme on various real-world\ndatasets, and show that it can achieve comparable or even higher classi\ufb01cation ac-\ncuracy at a fraction of the computational costs of traditional methods.\n\n1 Introduction\n\nAs the scope of machine learning applications has increased, the complexity of the classi\ufb01cation\ntasks that are commonly tackled has grown dramatically. On one dimension, many classi\ufb01cation\nproblems involve hundreds or even thousands of possible classes [8]. On another dimension, re-\nsearchers have spent considerable effort developing new feature sets for particular applications, or\nnew types of kernels. For example, in an image labeling task, we have the option of using GIST\nfeature [26], SIFT feature [23], spatial HOG feature [33], Object Bank [21] and more. The bene\ufb01ts\nof combining information from different types of features can be very signi\ufb01cant [12, 33].\n\nTo solve a complex classi\ufb01cation problem, many researchers have resorted to ensemble methods, in\nwhich multiple classi\ufb01ers are combined to achieve an accurate classi\ufb01cation decision. For example,\nthe Viola-Jones classi\ufb01er [32] uses a cascade of classi\ufb01ers, each of which focuses on different spatial\nand appearance patterns. Boosting [10] constructs a committee of weak classi\ufb01ers, each of which\nfocuses on different input distributions. Multiclass classi\ufb01cation problems are very often reduced\nto a set of simpler (often binary) decisions, including one-vs-one [11], one-vs-all, error-correcting\noutput codes [9, 1], or tree-based approaches [27, 13, 3]. Intuitively, different classi\ufb01ers provide\ndifferent \u201cexpertise\u201d in making certain distinctions that can inform the classi\ufb01cation task. However,\nas we discuss in Section 2, most of these methods use a \ufb01xed procedure determined at training time\nto apply the classi\ufb01ers without adapting to each individual test instance.\n\nIn this paper, we take an active and adaptive approach to combine multiple classi\ufb01ers/features at\ntest time, based on the idea of value of information [16, 17, 24, 22]. At training time, we construct\na rich family of classi\ufb01ers, which may vary in the features that they use or the set of distinctions\nthat they make (i.e., the subset of classes that they try to distinguish). Each of these classi\ufb01ers is\ntrained on all of the relevant training data. At test time, we dynamically select an instance-speci\ufb01c\n\n1\n\n\fsubset of classi\ufb01ers. We view each our pre-trained classi\ufb01er as a possible observation we can make\nabout an instance; each one adds a potential value towards our ability to classify the instance, but\nalso has a cost. Starting from an empty set of observations, at each stage, we use a myopic value-of-\ninformation computation to select the next classi\ufb01er to apply to the instance in a way that attempts to\nincrease the accuracy of our classi\ufb01cation state (e.g., decrease the uncertainty about the class label)\nat a low computational cost. This process stops when one of the suitable criteria is met (e.g., if\nwe are suf\ufb01ciently con\ufb01dent about the prediction). We provide an ef\ufb01cient probabilistic method for\nestimating the uncertainty of the class variable and about the expected gain from each classi\ufb01er. We\nshow that this approach provides a natural trajectory, in which simple, cheap classi\ufb01ers are applied\ninitially, and used to provide guidance on which of our more expensive classi\ufb01ers is likely to be\nmore informative. In particular, we show that we can get comparable (or even better) performance\nto a method that uses a large range of expensive classi\ufb01ers, at a fraction of the computational cost.\n\n2 Related Work\n\nOur classi\ufb01cation model is based on multiple classi\ufb01ers, so it resembles ensemble methods like\nboosting [10], random forests [4] and output-coding based multiclass classi\ufb01cation [9, 1, 29, 14].\nHowever, these methods use a static decision process, where all classi\ufb01ers have to be evaluated\nbefore any decision can be made. Moreover, they often consider a homogeneous set of classi\ufb01ers,\nbut we consider a variety of heterogeneous classi\ufb01ers with different features and function forms.\n\nSome existing methods can make classi\ufb01cation decisions based on partial observations. One exam-\nple is a cascade of classi\ufb01ers [32, 28], where an instance goes through a chain of classi\ufb01ers and the\ndecision can be made at any point if the classi\ufb01er response passes some threshold. Another type of\nmethod focuses on designing the stopping criteria. Schwing et al. [30] proposed a stopping criterion\nfor random forests such that decisions can be made based on a subset of the trees. However, these\nmethods have a \ufb01xed evaluation sequence for any instance, so there is no adaptive selection of which\nclassi\ufb01ers to use based on what we have already observed.\n\nInstance-speci\ufb01c decision paths based on previous observations can be found in decision tree style\nmodels, e.g., DAGSVM [27] and tree-based methods [15, 13, 3]. Instead of making hard decisions\nbased on individual observations like these methods, we use a probabilistic model to fuse informa-\ntion from multiple observations and only make decisions when it is suf\ufb01ciently con\ufb01dent.\n\nWhen observations are associated with different features, our method also performs feature selec-\ntion. Instead of selecting a \ufb01xed set of features in the learning stage [34], we actively select instance-\nspeci\ufb01c features in the test stage. Furthermore, our method also considers computational properties\nof the observations. Our selection criterion trades off between the statistical gain and the compu-\ntational cost of the classi\ufb01er, resulting in a computationally ef\ufb01cient cheap-to-expensive evaluation\nprocess. Similar ideas are hard-coded by Vedaldi et al. [31] without adaptive decisions about when to\nswitch to which classi\ufb01er with what cost. Angelova et al. [2] performed feature selection to achieve\ncertain accuracy under some computational budget, but the selection is at training time without adap-\ntation to individual test instances. Chai et al. [5] considered test-time feature value acquisition with\na strong assumption that observations are conditionally independent given the class variable.\n\nFinally, our work is inspired by decision-making under uncertainty based on value of informa-\ntion [16, 17, 24, 22]. For classi\ufb01cation, Krause and Guestrin [19] used it to compute a conditional\nplan for asking the expert, trying to optimize classi\ufb01cation accuracy while requiring as little expert\ninteraction as possible. In machine learning, Cohn et al. [7] used active learning to select training\ninstances to reduce the labeling cost and speedup the learning, while our work focuses on inference.\n\n3 Model\n\nWe denote the instance and label pair as (X, Y ). Furthermore, we assume that we have been pro-\nvided a set of trained classi\ufb01ers H, where hi \u2208 H : X \u2192 R can be any real-valued classi\ufb01ers\n(functions) from existing methods. For example, for multiclass classi\ufb01cation, hi can be one-vs-all\nclassi\ufb01ers, one-vs-one classi\ufb01ers and weak learners from the boosting algorithms. Note that hi\u2019s\ndo not have to be homogeneous meaning that they can have different function forms, e.g., linear\nor nonlinear, and more importantly they can be trained on different types of features with various\ncomputational costs. Given an instance x, our goal is to infer Y by sequentially selecting one hi to\nevaluate at a time, based on what has already been observed, until we are suf\ufb01ciently con\ufb01dent about\n\n2\n\n\fY or some other stopping criterion is met, e.g., the computational constraint. The key in this process\nis how valuable we think a classi\ufb01er hi is, so we introduce the value of a classi\ufb01er as follows.\nValue of Classi\ufb01er. Let O be the set of classi\ufb01ers that have already been evaluated (empty at the\nbeginning). Denote the random variable Mi = hi(X) as the response/margin of the i-th classi\ufb01er\nin H and denote the random vector for the observed classi\ufb01ers as MO = [Mo1 , Mo2 , . . . , Mo|O|]T ,\nwhere \u2200oi \u2208 O. Given the actual observed values mO of MO, we have a posterior P (Y |mO)\nover Y . For now, suppose we are given a reward R : P \u2192 R which takes in a distribution P and\nreturns a real value indicating how preferable P is. Furthermore, we use C(hi|O) to denote the\ncomputational cost of evaluating classi\ufb01er hi conditioned on the set of evaluated classi\ufb01ers O. This\nis because if hi shares the same feature with some oi \u2208 O, we do not need to compute the feature\nagain. With some chosen reward R and a computational model C(hi|O), we de\ufb01ne the value of an\nunobserved classi\ufb01er as follows.\n\nDe\ufb01nition 1 The value of classi\ufb01er V (hi|mO) for a classi\ufb01er hi given the observed classi\ufb01er re-\nsponses mO is the combination of the expected reward of the state informed by hi and the compu-\ntational cost of hi. Formally,\n\nV (hi|mO) \u2206=Z P (mi|mO)R(P (Y |mi, mO))dmi \u2212\n=Emi\u223cP (Mi|mO)(cid:2)R(P (Y |mi, mO))(cid:3) \u2212\n\n1\n\u03c4\n\n1\n\u03c4\n\nC(hi|O)\n\nC(hi|O)\n\n(1)\n\nThe value of classi\ufb01er has two parts corresponding to the statistical and computational properties\n\nof the classi\ufb01er respectively. The \ufb01rst part VR(hi|mO) \u2206= E(cid:2)R(P (Y |mi, mO))(cid:3) is the expected\nreward of P (Y |mi, mO), where the expectation is with respect to the posterior of Mi given mO.\nThe second part VC (hi|mO) \u2206= \u2212 1\n\u03c4 C(hi|O) is a computational penalty incurred by evaluating the\nclassi\ufb01er hi. The constant \u03c4 controls the tradeoff between the reward and the cost.\nGiven the de\ufb01nition of the value of classi\ufb01er, at each step of our sequential evaluations, our goal is\nto pick hi with the highest value:\n\nh\u2217 = argmax\nhi\u2208H\\O\n\nV (hi|mO) = argmax\nhi\u2208H\\O\n\nVR(hi|mO) + VC (hi|mO)\n\n(2)\n\nWe introduce the building blocks of the value of classi\ufb01er, i.e., the reward, the cost and the proba-\nbilistic model in the following, and then explain how to compute it.\nReward De\ufb01nition. We propose two ways to de\ufb01ne the reward R : P \u2192 R.\nResidual Entropy. From the information-theoretical point of view, we want to reduce the uncertainty\nof the class variable Y by observing classi\ufb01er responses. Therefore, a natural way to de\ufb01ne the\nreward is to consider the negative residual entropy, that is the lower the entropy the higher the\nreward. Formally, given some posterior distribution P (Y |mO) , we de\ufb01ne\n\nR(P (Y |mO)) = \u2212H(Y |mO) = Xy\n\nP (y|mO) log P (y|mO)\n\n(3)\n\nThe value of classi\ufb01er under this reward de\ufb01nition is closely related to information gain. Speci\ufb01cally,\n\nVR(hi|mO) =Emi\u223cP (Mi|mO)(cid:2) \u2212 H(Y |mi, mO)(cid:3) + H(Y |mO) \u2212 H(Y |mO)\n\n=I(Y ; Mi|mO) \u2212 H(Y |mO)\n\nSince H(Y |mO) is a constant w.r.t. hi, we have\n\nh\u2217 = argmax\nhi\u2208H/O\n\nVR(hi|mO) + VC (hi|mO) = argmax\nhi\u2208H/O\n\nI(Y ; Mi|mO) + VC (hi|mO)\n\n(4)\n\n(5)\n\nTherefore, at each step, we want to pick the classi\ufb01er with the highest mutual information with the\nclass variable Y given the observed classi\ufb01er responses mO with a computational constraint.\nClassi\ufb01cation Loss. From the classi\ufb01cation loss point of view, we want to minimize the expected\nloss when choosing classi\ufb01ers to evaluate. Therefore, given a loss function \u2206(y, y\u2032) specifying the\n\n3\n\n\fpenalty of classifying an instance of class y to y\u2032, we can de\ufb01ne the reward as the negative of the\nminimum expected loss:\n\nR(P (Y |mO)) = \u2212 min\n\ny\u2032 Xy\n\nP (y|mO)\u2206(y, y\u2032) = \u2212 min\ny\u2032\n\nEy\u223cP (Y |mO)(cid:2)\u2206(y, y\u2032)(cid:3)\n\n(6)\n\nTo gain some intuition about this de\ufb01nition, consider a 0-1 loss function, i.e., \u2206(y, y\u2032) = 1{y 6= y\u2032},\nthen R(P (Y |mO)) = \u22121 + maxy\u2032 P (y\u2032|mO). To maximize R, we want the peak of P (Y |mO) to\nbe as high as possible. In our experiment, these two reward de\ufb01nitions give similar results.\nClassi\ufb01cation Cost. The cost of evaluating a classi\ufb01er h on an instance x can be broken down into\ntwo parts. The \ufb01rst part is the cost of computing the feature \u03c6 : X \u2192 Rn on which h is built, and\nthe second is the cost of computing the function value of h given the input \u03c6(x). If h shares the\nsame feature as some evaluated classi\ufb01ers in O, then C(h|O) only consists of the cost of evaluating\nthe function h, otherwise it will also include the cost of computing the feature input \u03c6. Note that\ncomputing \u03c6 is usually much more expensive than evaluating the function value of h.\nProbabilistic Model. Given a test instance x, we construct an instance-speci\ufb01c joint distribution\nover Y and the selected observations MO. Our probabilistic model is a mixture model, where each\ncomponent corresponds to a class Y = y, and we use a uniform prior P (Y ). Starting from an empty\nO, we model P (Mi, Y ) as a mixture of Gaussian distributions. At each step, given the selected MO,\nwe model the new joint distribution P (Mi, MO, Y ) = P (Mi|MO, Y )P (MO, Y ) by modeling the\nnew P (Mi|MO, Y = y) as a linear Gaussian, i.e., P (Mi|MO, Y = y) = N (\u03b8T\ny). As we\ny\nshow in Section 5, this choice of probabilistic model works well empirically. We discuss how to\nlearn the distribution and do inference in the next section.\n\nMO, \u03c32\n\n4 Learning and Inference\n\nLearning P (Mi|mO, y). Given the subset of the training set {(x(j), y(j) = y)}Ny\nto the instances from class y, we denote m(j)\nfrom {(m(j), y(j) = y)}Ny\nP (Mi|y) = N (\u00b5y, \u03c32\nand \u03c32\nsian, i.e., \u00b5y = \u03b8T\ny\nwe know mO at test time, we estimate \u03b8y and \u03c32\n\nj=1 corresponding\ni = hi(x(j)), then our goal is to learn P (Mi|mO, y)\nj=1. If O = \u2205, then P (Mi|mO, y) reduces to the marginal distribution\nNy Pj m(j)\n,\nIf O 6= \u2205, we assume that P (Mi|mO, y) is a linear Gaus-\nmO. Note that we also append a constant 1 to mO as the bias term. Since\ny by maximizing the local likelihood with a Gaus-\nk2\n\ny), and based on maximum likelihood estimation, we have \u00b5y = 1\ni \u2212 \u00b5y)2.\n\nNy Pj(m(j)\n\ny = 1\n\ni\n\nsian prior on \u03b8y. Speci\ufb01cally, for each training instance j from class y, let wj = e\u2212\nwhere \u03b2 is a bandwidth parameter, then the regularized local log likelihood is\n\nkmO \u2212m\n\n(j)\nO\n\n\u03b2\n\n,\n\nL(\u03b8y, \u03c3y; mO) = \u2212\u03bb k \u03b8y k2\n\n2 +\n\nNy\n\nXj=1\n\nwj log N (m(j)\n\ni\n\n; \u03b8T\ny\n\nm\n\n(j)\n\nO , \u03c32\ny)\n\n(7)\n\nwhere we overload the notation N (x; \u00b5y, \u03c32\nvariance \u03c32\nwith \u21132 regularization. Maximizing (7) results in:\n\ny) to mean the value of a Gaussian PDF with mean \u00b5y and\ny evaluated at x. Note that maximizing (7) is equivalent to locally weighted regression [6]\n\n\u02c6\u03b8y = argmin\n\n\u03b8y\n\n\u03bb k \u03b8y k2\n\n2 +\n\nNy\n\nXj=1\n\nwj k m(j)\n\ni \u2212 \u03b8T\ny\n\nm\n\n(j)\nO k2\n\n2= ( \u00afMT\n\nO\n\nW \u00afMO + \u03bbI)\u22121 \u00afMT\nO\n\nW \u00afMi\n\n(8)\n\nwhere \u00afMO is a matrix whose j-th row is m\nwj\u2019s , \u00afMi is an column vector whose j-th element is m(j)\nnoting that ( \u00afMT\nO\nshared for different classi\ufb01ers hi\u2019s. Finally, the estimated \u03c32\n\n(j)T\nO , W is a diagonal matrix whose diagonal entries are\n, and I is an identity matrix. It is worth\nW \u00afMO + \u03bbI)\u22121W in (8) does not depend on i, so it can be computed once and\n\ni\n\ny is\n\n\u02c6\u03c3y\n\n2 =\n\nNy\n\nXj=1\n\n1\n\nj=1 wj\n\nPNy\n\nwj k m(j)\n\ni \u2212 \u02c6\u03b8y\n\nT\n\nm\n\n(j)\n\nO k2\n\n(9)\n\n4\n\n\fComputing V (fi|mO). Given the learned distribution, we can easily compute the two CPDs\nin (1), i.e., P (Mi|mO) and P (Y |mi, mO). P (Mi|mO) can be obtained as P (Mi|mO) =\nPy P (Mi|mO, y)P (y|mO), where P (Y |mO) is the posterior over Y given some observation\nmO which is tracked over iterations. Speci\ufb01cally, P (Y |mi, mO) \u221d P (mi, mO|Y )P (Y ) =\nP (mi|mO, Y )P (mO|Y )P (Y ), where all terms are available by caching previous computations.\nFinally, to compute V (fi|mO), the computational part VC (fi|mO) is just a lookup in a cost table,\nand the expected reward part VR(fi|mO) can be rewritten as:\n\nVR(hi|mO) = Xy\n\nP (y|mO)Emi\u223cP (Mi|mO,y)(cid:2)R(P (Y |mi, mO))(cid:3)\n\n(10)\n\nTherefore, each component Emi\u223cP (Mi|mO,y)(cid:2)R(P (Y |mi, mO))(cid:3) is the expectation of a function\nof a scalar Gaussian variable. We use Gaussian quadrature [18] 1 to approximate each component\nexpectation, and then do the weighted average to get VR(hi|mO).\nDynamic Inference. Given the building blocks introduced before, one can execute the classi\ufb01cation\nprocess in |H| steps, where at each step, the values of all the remaining classi\ufb01ers are computed.\nHowever, this will incur a large scheduling cost. This is due to the fact that usually |H| is large. For\nexample, in multiclass classi\ufb01cation, if we include all one-vs-one classi\ufb01ers into H, |H| is quadratic\nin the number of classes. Since we are maintaining a belief over Y as observations are accumulated,\nwe can use it to make the inference process more adaptive resulting in small scheduling cost.\nEarly Stopping. Based on the posterior P (Y |mO), we can make dynamic and adaptive decision\nabout whether to continue observing new classi\ufb01ers or stop the process. We propose two stop-\nping criteria. We stop the inference process whenever either of them is met, and use the pos-\nterior over Y at that point to make classi\ufb01cation decision. The \ufb01rst criterion is based on the\ninformation-theoretic point of view. Given the current posterior estimation P (Y |mi, mO) and\nthe previous posterior estimation P (Y |mO), the relative entropy (KL-divergence) between them\nis D(cid:16)P (Y |mO) k P (Y |mi, mO)(cid:17). We stop the inference procedure when this divergence is below\n\nsome threshold t. The second criterion is based on the classi\ufb01cation point of view. We consider the\ngap between the probability of the current best class and that of the runner-up. Speci\ufb01cally, we de\ufb01ne\nthe margin given a posterior P (Y |mO) as \u03b4m(P (Y |mO)) = P (y\u2217|mO) \u2212 maxy6=y\u2217 P (Y |mO),\nwhere y\u2217 = argmaxy P (y|mO). If \u03b4m(P (Y |mO)) \u2265 t\u2032, then the inference stops.\nDynamic Pruning of Class Space. In many cases, a class is mainly confused with a small number of\nother classes (the confusion matrix is often close to sparse). This implies that after observing a few\nclassi\ufb01ers, the posterior P (Y |mO) is very likely to be dominated by a few modes leaving the rest\nwith very small probability. For those classes y with very small P (y|mO), their contributions to the\nvalue of classi\ufb01er (10) are negligible. Therefore, when computing (10), we ignore the components\nwhose P (y|mO) is below some small threshold (equivalent to setting the contribution from this\ncomponent to 0). Furthermore, when P (y|mO) falls below some very small threshold for a class y,\nwe will not estimate the likelihood related to y, i.e., P (Mi|mO, y), but use a small constant.\nDynamic Classi\ufb01er Space. To avoid computing the values of all the remaining classi\ufb01ers, we can\ndynamically restrict the search space of classi\ufb01ers to those having high expected mutual informa-\ntion with Y with respect to the current posterior P (Y |mO). Speci\ufb01cally, during the training, for\neach classi\ufb01er hi we can compute the mutual information I(Mi; By) between its response Mi and\na class y, where By is a binary variable indicating whether an instance is from class y or not.\nGiven our current posterior P (Y |mO), we tried two ways to rank the unobserved classi\ufb01ers. First,\nwe simply select the top L classi\ufb01ers with the highest I(Mi; B\u02c6y), where \u02c6y is the most probable\nclass based on current posterior. Since we can sort classi\ufb01ers in the training stage, this step is con-\nstant time. Another way is that for each classi\ufb01er, we can compute a weighted mutual information\n\nscore, i.e., Py P (y|mO)I(Mi; By), and we restrict the classi\ufb01er space to those with the top L\nscores. Note that computing the scores is very ef\ufb01cient, since it is just an inner product between two\nvectors, where I(Y ; By)\u2019s have been computed and cached before testing. Our experiments showed\nthat these two scores have similar performances, and we used the \ufb01rst method to report the results.\nAnalysis of Time Complexity. At each iteration t, the scheduling overhead includes selecting\nthe top L candidate observations, and for each candidate i, learning P (Mi|mO, y) and computing\n\n1We found that 3 or 5 points provide an accurate approximation.\n\n5\n\n\fy\nc\na\nr\nu\nc\nc\na\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nc\n \nt\ns\ne\nt\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n \n0\n\nresults on satimage dataset\n\nselection by value of classifier\nrandom selection\none\u2212vs\u2212all\ndagsvm\none\u2212vs\u2212one\ntree\n\n5\n\n10\n\nnumber of evaluated classifiers\n\n \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\ny\nc\na\nr\nu\nc\nc\na\nn\no\ni\nt\na\nc\n\n \n\ni\nf\ni\n\ns\ns\na\n\nl\n\nc\n\n \nt\ns\ne\nt\n\n15\n\n \n0\n\n5\n\nresults on pendigits dataset\n\nselection by value of classifier\nrandom selection\none\u2212vs\u2212all\ndagsvm\none\u2212vs\u2212one\ntree\n20\n\n25\n\n40\n\n15\n\n10\n35\nnumber of evaluated classifiers\n\n30\n\n \n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\ny\nc\na\nr\nu\nc\nc\na\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nc\n \nt\ns\ne\nt\n\n45\n\n0.2\n\n \n0\n\n5\n\n10\n\nresults on vowel dataset\n\nselection by value of classifier\nrandom selection\none\u2212vs\u2212all\ndagsvm\none\u2212vs\u2212one\ntree\n25\n\n30\n\n45\n\n35\n\n40\n\n50\n\n15\n\n20\n\nnumber of evaluated classifiers\n\n \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\ny\nc\na\nr\nu\nc\nc\na\nn\no\ni\nt\na\nc\n\n \n\ni\nf\ni\n\ns\ns\na\n\nl\n\nc\n\n \nt\ns\ne\nt\n\n55\n\n0.1\n\n \n100\n\nresults on letter dataset\n\n \n\nselection by value of classifier\nrandom selection\none\u2212vs\u2212all\ndagsvm\none\u2212vs\u2212one\ntree\n101\n\n102\nnumber of evaluated classifiers\n\nFigure 1: (Best viewed magni\ufb01ed and in colors) Performance comparisons on UCI datasets. From\nthe left to right are the results on satimage, pendigits, vowel and letter (in log-scale) datasets. Note\nthat the error bars for pendigits and letter datasets are very small (around 0.5% on average).\n\nV (fi|mO). First, selecting the top L candidate observations is a constant time, since we can sort\nthe observations based on I(Mi; By) before the test process. Second, estimating P (Mi|mO, y)\nrequires computing (8) and (9) for different y\u2019s. Given our dynamic pruning of class space, suppose\nthere are only Nt,Y promising classes to consider instead of the total number of classes K. Since\n( \u00afMT\nW \u00afMO + \u03bbI)\u22121W in (8) does not depend on i, we compute it for each promising class, which\nO\ny + t2Ny + t3) \ufb02oating point operations, and share it for different i\u2019s. After computing\ntakes O(tN 2\nthis shared component, for each pair of i and a promising class, computing (8) and (9) both take\nt,Y ). Putting everything together, the overall cost at\nO(tNy). Finally, computing (10) takes O(N 2\nt,Y ). The key to have a low cost is to\niteration t is O(Nt,Y (tN 2\neffectively prune the class space (small Nt,Y ) and reach a decision quickly (small t).\n\ny + t2Ny + t3) + LNt,Y tNy + LN 2\n\n5 Experimental Results\n\nWe performed experiments on a collection of four UCI datasets [25] and on a scene recognition\ndataset [20]. All tasks are multiclass classi\ufb01cation problems. The \ufb01rst set of experiments focuses on\na single feature type and aims to show that (i) our probabilistic model is able to combine multiple\nbinary classi\ufb01ers to achieve comparable or higher classi\ufb01cation accuracy than traditional methods;\n(ii) our active evaluation strategy successfully selects a signi\ufb01cantly fewer number of classi\ufb01ers. The\nsecond set of experiments considers multiple features, with varying computational complexities.\nThis experiment shows the real power of our active scheme. Speci\ufb01cally, it dynamically selects an\ninstance-speci\ufb01c subset of features, resulting in higher classi\ufb01cation accuracy of using all features\nbut with a signi\ufb01cant reduction in the computational cost.\nBasic Setup. Given a feature \u03c6, our set of classi\ufb01ers H\u03c6 consists of all one-vs-one classi\ufb01ers, all\none-vs-all classi\ufb01ers, and all node classi\ufb01ers from a tree-based method [13], where a node classi\ufb01er\ncan be trained to distinguish two arbitrary clusters of classes. Therefore, for a K-class problem, the\nnumber of classi\ufb01ers given a single feature is |H\u03c6| = (K\u22121)K\n+ K + N\u03c6,tree, where N\u03c6,tree is the\ni=1, our pool of classi\ufb01ers is\nnumber of nodes in the tree model. If there are multiple features {\u03c6i}F\ni=1H\u03c6i. The form of all classi\ufb01ers is linear SVM for the \ufb01rst set of experiments and nonlinear\nH = \u222aF\nSVM with various kernels for the second set of experiments. During training, in addition to learning\nthe classi\ufb01ers, we also need to compute the response m(j)\ni of each classi\ufb01er hi \u2208 H for each training\ninstance x(j). In order to make the training distribution of the classi\ufb01er responses better match the\ntest distribution, when evaluating classi\ufb01er hi on x(j), we do not want hi to be trained on x(j). To\nachieve this, we use a procedure similar to cross validation. Speci\ufb01cally, we split the training set\ninto 10 folds, and for each fold, instances from this fold are tested using the classi\ufb01ers trained on the\nother 9 folds. After this procedure, each training instance x(j) will be evaluated by all hi\u2019s. Note\nthat the classi\ufb01ers used in the test stage are trained on the entire training set. Although for different\ntraining instances x(j) and x(k) from different folds and a test instance x, m(j)\nand mi are\nobtained using different hi\u2019s, our experimental results con\ufb01rmed that their empirical distributions\nare close enough to achieve good performance.\nStandard Multiclass Problems from UCI Repository. The \ufb01rst set of experiments are done on\nfour standard multiclass problems from the UCI machine learning repository [25]: vowel (speech\nrecognition, 11 classes), letter (optical character recognition, 26 classes), satimage (pixel-based clas-\nsi\ufb01cation/segmentation on satellite images, 6 classes) and pendigits (hand written digits recognition,\n\n, m(k)\n\n2\n\ni\n\ni\n\n6\n\n\f10 classes). We used the same training/test split as speci\ufb01ed in the UCI repository. For each dataset,\nthere is only one type of feature, so it will be computed at the \ufb01rst step no matter which classi\ufb01er\nis selected. After that, all classi\ufb01ers have the same complexity, so the results will be independent\nof the \u03c4 parameter in the de\ufb01nition of value of classi\ufb01er (1). For the baselines, we have one-vs-one\nwith max win, one-vs-all, DAGSVM [27] and a tree-based method [13]. These methods vary both\nin terms of what set of classi\ufb01ers they use and how those classi\ufb01ers are evaluated and combined.\nTo evaluate the effectiveness of our classi\ufb01er selection scheme, we introduce another baseline that\nselects classi\ufb01ers randomly, for which we repeated the experiments for 10 times and the average and\none standard deviation are reported. We compare different methods in terms of both the classi\ufb01ca-\ntion accuracy and the number of evaluated classi\ufb01ers. For our algorithm and the random selection\nbaseline, we show the accuracy over iterations as well. Since in our framework the number of it-\nerations (classi\ufb01ers) needed varies over instances due to early stopping, the maximum number of\niterations shown is de\ufb01ned as the mean plus one standard derivation of the number of classi\ufb01er\nevaluations of all test instances. In addition, for the tree-based method, the number of evaluated\nclassi\ufb01ers is the mean over all test instances.\n\nFigure 1 shows a set of results. As can be seen, our method can achieve comparable or higher\naccuracy than traditional methods. In fact, we achieved the best accuracy on three datasets and\nthe gains over the runner-up methods are 0.2%, 5.2%, 8.2% for satimage, vowel, and letter datasets\nrespectively. We think the statistical gain might come from two facts: (i) we are performing instance-\nspeci\ufb01c \u201cfeature selection\u201d to only consider those most informative classi\ufb01ers; (ii) another layer of\nprobabilistic model is used to combine the classi\ufb01ers instead of the uniform voting of classi\ufb01ers used\nby many traditional methods. In terms of the number of evaluated classi\ufb01ers, our active scheme is\nvery effective: the mean number of classi\ufb01er evaluations for 6-class, 10-class, 11-class and 26-class\nproblems are 4.50, 3.22, 6.15 and 7.72. Although the tree-based method can also use a few number\nof classi\ufb01ers, sometimes it suffers from a signi\ufb01cant drop in accuracy like on the vowel and letter\ndatasets. Furthermore, compared to the random selection scheme, our method can effectively select\nmore informative classi\ufb01ers resulting in faster convergence to a certain classi\ufb01cation accuracy.\n\nThe performance gain of our method is not free. To maintain a belief over the class variable Y\nand to dynamically select classi\ufb01ers with high value, we have introduced additional computational\ncosts, i.e., estimating conditional distributions and computing the value of classi\ufb01ers. For example,\nthis additional cost is around 10ms for satimage, however, evaluating a linear classi\ufb01er only takes\nless than 1ms due to very low feature dimension, so the actual running time of the active scheme\nis higher than one-vs-one. Therefore, our method will have a real computational advantage only\nif the cost of evaluating the classi\ufb01ers is higher than the cost of our probabilistic inference. We\ndemonstrate such bene\ufb01t of our method in the context of multiple high dimensional features below.\nScene Recognition. We test our active classi\ufb01cation on a benchmark scene recognition dataset\nScene15 [20]. It has 15 scene classes and 4485 images in total. Following the protocol used in\n[20, 21], 100 images per class are randomly sampled for training and the remaining 2985 for test.\n\nmodel\n\nall features\nbest feature OB [21]\nfastest feature GIST [26]\nours \u03c4 = 25\nours \u03c4 = 100\nours \u03c4 = 600\n\naccuracy\n\nfeature cost\n(# of features)\n86.40% 52.645s (184)\n83.38% 6.20s\n72.70% 0.399s\n86.26% 1.718s (5.62)\n86.77% 6.573s (4.71)\n88.11% 19.821s (4.46)\n\nclassi\ufb01er\ncost\n0.426s\n0.024s\n0.0002s\n0.010s\n0.014s\n0.031s\n\nscheduling\ncost\n0\n0\n0\n0.141s\n0.116s\n0.094s\n\ntotal\nrunning time\n53.071s\n6.224s\n0.3992s\n1.869s (28.4x)\n6.703s (7.9x)\n19.946s (2.7x)\n\nTable 1: Detailed performance comparisons on Scene15 dataset with various feature types. For our\nmethods, we show the speedup factors with respective to using all the features in a static way.\nWe consider various types of features, since as shown in [33], the classi\ufb01cation accuracy can be\nsigni\ufb01cantly improved by combining multiple features but at a high computational cost. Our feature\nset includes 7 features from [33], including GIST, spatial HOG, dense SIFT, Local Binary Pattern,\nself-similarity, texton histogram, geometry speci\ufb01c histograms (please refer to [33] for details), and\nanother recently proposed high-level image feature Object Bank [21]. The basic idea of Object\nBank is to use the responses of various object detectors as the feature. The current release of the\ncode from the authors selected 177 object detectors, each of which outputs a feature vector \u03c6i with\n\n7\n\n\fdimension 252. These individual vectors are concatenated together to form the \ufb01nal feature vector\nInstead of treating \u03a6 as an undecomposable single feature\n\u03a6 = [\u03c61; \u03c62; . . . ; \u03c6177] \u2208 R44,604.\nvector, we can think of it as a collection of 177 different features {\u03c6i}177\ni=1. Therefore, our feature\npool consists of 184 features in total. Their computational costs vary from 0.035 to 13.796 seconds,\nwith the accuracy from 54% to 83%. One traditional way to combine these features is through\nmultiple kernel learning. Speci\ufb01cally, we take the average of individual kernels constructed based\non individual features, and train a one-vs-all SVM using the joint average kernel. Surprisingly, this\nsimple average kernel performs comparably with learning the weights to combine them [12].\n\nFor our active classi\ufb01cation, we will not compute all features at the beginning of the evaluation\nprocess, but will only compute a component \u03c6i when a classi\ufb01er h based on it is selected. We\nwill cache all evaluated \u03c6i\u2019s, so different classi\ufb01ers sharing the same \u03c6i will not induce repeated\ncomputation of the common \u03c6i. We decompose the computational costs per instance into three\nparts: (1) the feature cost, which is the time spent on computing the features; (2) the classi\ufb01er cost,\nwhich is the time spent on evaluating the function value of the classi\ufb01ers; (3) the scheduling cost,\nwhich is the time spent on selecting the classi\ufb01ers using our method. To demonstrate the trade-off\nbetween the accuracy and computational cost in the de\ufb01nition of value of classi\ufb01er, we run multiple\nexperiments with various \u03c4\u2019s.\n\n0.9\n\n0.85\n\n0.8\n\n \n\n0.75\n\nresults on scene15 dataset\n\ny\nc\na\nr\nu\nc\nc\na\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nc\n \nt\ns\ne\nt\n\nThe results are shown in Table 1. We also report\ncomparisons to the best individual features in terms\nof either accuracy or speed (the reported accuracy\nis the best of one-vs-one and one-vs-all). As can\nbe seen, combining all features using the traditional\nmethod indeed improves the accuracy signi\ufb01cantly\nover those individual features, but at an expensive\ncomputational cost. However, using active classi\ufb01-\ncation, to achieve similar accuracy as the baseline\nof all features, we can get 28.4x speedup (\u03c4 = 25).\nNote that at this con\ufb01guration, our method is faster\nthan the state-of-the-art individual feature [21], and\nis also 2.8% better in accuracy. Furthermore, if we\nput more emphasis on the accuracy, we can get the\nbest accuracy 88.11% when \u03c4 = 600.\nTo further test the effectiveness of our active selec-\ntion scheme, we compare with another baseline that\nsequentially adds one feature at a time from a \ufb01ltered\npool of features. Speci\ufb01cally, we \ufb01rst rank the individual features based on their classi\ufb01cation accu-\nracy, and only consider the top 80 features (using 80 features achieves essentially the same accuracy\nas using 184 features). Given this selected pool, we arrange the features in order of increasing com-\nputational complexity, and then train a classi\ufb01er based on the top N features for all values of N from\n1 to 80. As shown in Figure 2, our active scheme is one order of magnitude faster than the baseline\ngiven the same level of accuracy.\n\nFigure 2: Classi\ufb01cation accuracy versus run-\nning time for the baseline, active classi\ufb01ca-\ntion, and various individual features.\n\nsequentially adding features\nactive classification\nGIST\nLBP\nspatial HOG\nObject Bank\ndense SIFT\n\n20\n\n15\n35\nrunning time (seconds)\n\n25\n\n30\n\n0.65\n\n \n0\n\n40\n\n45\n\n50\n\n5\n\n10\n\n0.7\n\n6 Conclusion and Future Work\n\nIn this paper, we presented an active classi\ufb01cation process based on the value of classi\ufb01er. We ap-\nplied this active scheme in the context of multiclass classi\ufb01cation, and achieved comparable and\neven higher classi\ufb01cation accuracy with signi\ufb01cant computational savings compared to traditional\nstatic methods. One interesting future direction is to estimate the value of features instead of individ-\nual classi\ufb01ers. This is particularly important when computing the feature is much more expensive\nthan evaluating the function value of classi\ufb01ers, which is often the case. Once a feature has been\ncomputed, a set of classi\ufb01ers that are built on it will be cheap to evaluate. Therefore, predicting the\nvalue of the feature (equivalent to the joint value of multiple classi\ufb01ers sharing the same feature) can\npotentially lead to more computationally ef\ufb01cient classi\ufb01cation process.\nAcknowledgment. This work was supported by the NSF under grant No. RI-0917151, the Of\ufb01ce of\nNaval Research MURI grant N00014-10-10933, and the Boeing company. We thank Pawan Kumar\nand the reviewers for helpful feedbacks.\n\n8\n\n\fReferences\n[1] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for\n\nmargin classi\ufb01ers. J. Mach. Learn. Res., 1:113\u2013141, 2001.\n\n[2] A. Angelova, L. Matthies, D. Helmick, and P. Perona. Fast terrain classi\ufb01cation using variable-length\n\nrepresentation for autonomous navigation. CVPR, 2007.\n\n[3] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multiclass task. In NIPS, 2010.\n[4] L. Breiman. Random forests. In Machine Learning, pages 5\u201332, 2001.\n[5] X. Chai, L. Deng, and Q. Yang. Test-cost sensitive naive bayes classi\ufb01cation. In ICDM, 2004.\n[6] W. S. Cleveland and S. J. Devlin. Locally weighted regression: An approach to regression analysis by\n\nlocal \ufb01tting. Journal of the American Statistical Association, 83:596\u2013610, 1988.\n\n[7] D.A. Cohn, Zoubin Ghahramani, and M.I. Jordan. Active learning with statistical models. CoRR,\n\n[8] J. Deng, A.C. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell\n\n[9] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. J.\n\ncs.AI/9603104, 1996.\n\nus? In ECCV10, pages V: 71\u201384, 2010.\n\nof A. I. Res., 2:263\u2013286, 1995.\n\n[10] Y. Freud. Boosting a weak learning algorithm by majority. In Computational Learning Theory, 1995.\n[11] Jerome H. Friedman. Another approach to polychotomous classi\ufb01cation. Technical report, Department\n\nof Statistics, Stanford University, 1996.\n\n[12] P.V. Gehler and S. Nowozin. On feature combination for multiclass object classi\ufb01cation. In ICCV, 2009.\n[13] G. Grif\ufb01n and P. Perona. Learning and using taxonomies for fast visual categorization. In CVPR, 2008.\n[14] V. Guruswami and A. Sahai. Multiclass learning, boosting, and error-correcting codes. In Proc. of the\n\nTwelfth Annual Conf. on Computational Learning Theory, 1999.\n\n[15] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learning: data mining, inference,\n\nand prediction. 2009.\n\n2009.\n\n[16] R. A. Howard. Information value theory. IEEE Trans. on Systems Science and Cybernetics, 1966.\n[17] R. A. Howard. Decision analysis: Practice and promise. Management Science, 1988.\n[18] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,\n\n[19] A. Krause and C. Guestrin. Optimal value of information in graphical models. Journal of Arti\ufb01cial\n\n[20] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nIntelligence Research (JAIR), 35:557\u2013591, 2009.\n\nnatural scene categories. In CVPR, 2006.\n\n[21] L.-J. Li, H. Su, E.P. Xing, and L. Fei-Fei. Object bank: A high-level image representation for scene\n\nclassi\ufb01cation and semantic feature sparsi\ufb01cation. In NIPS, 2010.\n\n[22] D. V. Lindley. On a Measure of the Information Provided by an Experiment. The Annals of Mathematical\n\nStatistics, 27(4):986\u20131005, 1956.\n\n[23] D.G. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.\n[24] V.S. Mookerjee and M.V. Mannino. Sequential decision models for expert system optimization. IEEE\n\nTrans. on Knowledge & Data Engineering, (5):675.\n\n[25] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz. Uci repository of machine learning databases, 1998.\n[26] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the\n\nspatial envelope. IJCV, 2001.\n\n[27] J.C. Platt, N. Cristianini, and J. Shawe-taylor. Large margin dags for multiclass classi\ufb01cation. In NIPS,\n\n[28] M.J. Saberian and N. Vasconcelos. Boosting classi\ufb01er cascades. In NIPS, 2010.\n[29] Robert E. Schapire. Using output codes to boost multiclass learing problems. In ICML, 1997.\n[30] G. A. Schwing, C. Zach, Zheng Y., and M. Pollefeys. Adaptive random forest - how many \u201cexperts\u201d to\n\nask before making a decision? In CVPR, 2011.\n\n[31] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In ICCV,\n\n2000.\n\n2009.\n\n[32] P. Viola and M. Jones. Robust Real-time Object Detection. IJCV, 2002.\n[33] J.X. Xiao, J. Hays, K.A. Ehinger, A. Oliva, and A.B. Torralba. Sun database: Large-scale scene recogni-\n\n[34] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML,\n\ntion from abbey to zoo. In CVPR, 2010.\n\npages 412\u2013420, 1997.\n\n9\n\n\f", "award": [], "sourceid": 647, "authors": [{"given_name": "Tianshi", "family_name": "Gao", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}