{"title": "Avoiding False Positive in Multi-Instance Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 811, "page_last": 819, "abstract": "In multi-instance learning, there are two kinds of prediction failure, i.e., false negative and false positive. Current research mainly focus on avoding the former. We attempt to utilize the geometric distribution of instances inside positive bags to avoid both the former and the latter. Based on kernel principal component analysis, we define a projection constraint for each positive bag to classify its constituent instances far away from the separating hyperplane while place positive instances and negative instances at opposite sides. We apply the Constrained Concave-Convex Procedure to solve the resulted problem. Empirical results demonstrate that our approach offers improved generalization performance.", "full_text": "Avoiding False Positive in Multi-Instance Learning\n\nYanjun Han, Qing Tao, Jue Wang\n\nInstitute of Automation, Chinese Academy of Sciences\n\nBeijing, 100190, China\n\nyanjun.han, qing.tao, jue.wang@ia.ac.cn\n\nAbstract\n\nIn multi-instance learning, there are two kinds of prediction failure, i.e., false\nnegative and false positive. Current research mainly focus on avoiding the for-\nmer. We attempt to utilize the geometric distribution of instances inside positive\nbags to avoid both the former and the latter. Based on kernel principal com-\nponent analysis, we de\ufb01ne a projection constraint for each positive bag to clas-\nsify its constituent instances far away from the separating hyperplane while place\npositive instances and negative instances at opposite sides. We apply the Con-\nstrained Concave-Convex Procedure to solve the resulted problem. Empirical re-\nsults demonstrate that our approach offers improved generalization performance.\n\n1 Introduction\n\nMulti-instance Learning (MIL) was \ufb01rst proposed by Dietterich et.al. in [1] to predict the binding\nability of a drug from its biochemical structure. A certain drug molecule corresponds to a set of\nconformations which cannot be differentiated via chemical experiments. A drug is labeled positive\nif any of its constituent conformations has the binding ability greater than the threshold, otherwise\nnegative. Therefore, each sample (a drug) is a bag of instances (its constituent conformations). In\nmulti-instance learning the label information for positive samples is incomplete in that the instances\nin a certain positive bag are all labeled positive. Generally, methods for multi-instance learning are\nmodi\ufb01ed versions of approaches for supervised learning by shifting the focus from discrimination\non instances to discrimination on bags.\nThe earliest exploration were the APR algorithms proposed in [1]. From then on, a number of\napproaches emerged. Examples include Diverse Density [2], Citation k\u2212NN [3], MI-SVMs [4], MI-\nkernels [5], reg-SVM [6], MissSVM [7], sbMIL, stMIL [8], PPMM [9], MIGraphs [10], etc. Many\nreal-world applications can be regarded as Multi-instance learning problems. Examples include\nimage classi\ufb01cation [11], document categorization [12], computer aided diagnosis [13], etc.\nAs far as positive bags are concerned, current research usually treat them as labyrinths in which\nwitnesses (responsible positive instances) are encaged, and consider nonwitnesses (other instances)\ntherein to be useless or even distractive. The information carried by nonwitnesses is not well utilized.\nFactually, nonwitnesses are indispensable for characterizing the overall instance distribution, and\nthus help to improve the learner. Several researchers realized the importance of nonwitnesses and\nattempted to utilize them. In MI-kernels [5] and reg-SVM [6], nonwitnesses together with witnesses\nare squeezed into the kernel matrix. In mi-SVM [4], the labels of all nonwitnesses are treated as\nunknown integer variables to be optimized. mi-SVM tends to misclassify negative instances in\npositive bags since the resulted margin will be larger. And we will elaborate on this \ufb02aw in section\n3.1. In MissSVM [7] and stMIL [8], multi-instance learning is addressed from the view of semi-\nsupervised learning, and nonwitnesses are treated as unlabeled data, whose labels should be assigned\nto maximize the margin. sbMIL [8] attempt to estimate the ratio of positive instances inside positive\nbags and utilize this information in the subsequent classi\ufb01cation. MissSVM, sbMIL and stMIL\nsuffer from the same \ufb02aw as mi-SVM.\n\n1\n\n\fFigure 1: Illustration of the False Positive Phenomenon: The top image is a positive training sample,\nand the bottom image is a negative testing sample. The symbol \u2295 and \u2296 respectively denote positive\nand negative instances. Enveloped points are instances in a positive bag. The Point not enveloped\nis a negative bag of just one instance. Separating plane Fi corresponds to f (x) = i, and Gi corre-\nsponds to g(x) = i. The learners f and g are obtained with and without the projection constraint,\nrespectively. Instances are labeled according to f. For details, please refer to the passage below.\n\nThe neglect of nonwitnesses in positive bags may lead to false positive and cause a model to misclas-\nsify unseen negative samples. For example, in natural scene classi\ufb01cation, each image is segmented\nto a bag of instances beforehand, and each instance is a patch (ROI, Regions Of Interest) charac-\nterized by one feature vector describing its color. The task is to predict whether an image contains\na waterfall or not (Figure 1). A positive image contains some positive instances corresponding to\nwaterfall and some negative instances from other categories such as sky, stone, grass, etc., while\na negative bag exclusively contains negative instances from other categories. Naturally, some neg-\native instances (patches) only exist in positive bags. For instance, the end of a waterfall is often\nsurrounded by mist. The aforementioned approaches tend to misclassify negative instances in posi-\ntive bags. Therefore, the patch corresponding to mist is misclassi\ufb01ed as positive. Given an unseen\nimage with cirrus cloud and without waterfall, the obtained learner will misclassify this image as\npositive because cirrus cloud and mist are similar to each other.\nTo avoid both false negative and false positive, we attempt to classify instances inside positive bags\nfar from the separating hyperplane and place positive and negative instances at opposite sides. We\nachieve this by introducing projection constraints based on kernel principal component analysis into\nMI-SVM [4]. Each constraint is de\ufb01ned on a positive bag to encourage large variance of its con-\nstituent instances along the normal direction of the separating hyperplane. We apply the Constrained\nConcave-Convex Procedure (CCCP) to solve the resulted optimization problems.\nThe remainder of the paper is organized as follows: Section 2 introduces notation convention and\nthe CCCP. In Section 3 we bring out the projection constraint and the corresponding formulation\nfor multi-instance learning. In Section 4, the algorithm is evaluated on real world data sets. Finally,\nconclusions are drawn in Section 5.\n\n2 Preliminaries\n\n2.1 Notation Convention\nThe origin of multi-instance learning [1] has been presented in section 1. Let X \u2286 Rp be the\nspace containing instances and D = {(Bi, yi)}m\ni=1 be the training data, where Bi is the ith bag of\ninstances {xi1,\u00b7\u00b7\u00b7 , xini\n} and yi \u2208 Y is the label for Bi. Y is {+1,\u22121} for classi\ufb01cation and R\nfor regression. In addition, denote the index set for instances xij of Bi by Ii. The task is to train\n\n2\n\n\fa learner to predict the label of an unseen bag. Compared with traditional supervised learning, the\nX to Y instead of from X to Y. Denote the index sets for positive and\nlearner is a mapping from 2\nnegative bags by I+ and I\u2212 respectively. Without loss of generality, assume that the instances are\n}. We index instances by a function\nordered in the sequence {x11,\u00b7\u00b7\u00b7 , x1n1 ,\u00b7\u00b7\u00b7 , xm1,\u00b7\u00b7\u00b7 , xmnm\nI(xij) =\n\nnk + j. And I(Bi) returns a vector (\n\nnk + 1,\u00b7\u00b7\u00b7 ,\n\nnk + ni).\n\ni\u22121\u2211\n\ni\u22121\u2211\n\ni\u22121\u2211\n\nk=1\n\nk=1\n\nk=1\n\n2.2 Constrained Concave-Convex Procedure\n\nNon-convex optimizations are undesirable because few algorithms effectively converge even to a\nlocal optimum. However, if both objective function and constraints take the form of a difference be-\ntween two convex functions, then a non-convex problem can be solved ef\ufb01ciently by the constrained\nconcave-convex procedure [14]. The fundamental is to eliminate the non-convexity by changing\nnon-convex parts to their \ufb01rst-order Taylor expansions. The original problem is as follows:\n\nf0(x) \u2212 g0(x)\n\nmin\n\nx\n\ns.t. fi(x) \u2212 gi(x) \u2264 ci,\n\ni = 1,\u00b7\u00b7\u00b7 , m\n\n(1)\nwhere fi, gi(i = 0,\u00b7\u00b7\u00b7 , m) are real-valued, convex and differentiable functions on Rn. Starting\nfrom a random x(0), (1) is approximated by a sequence of successive convex optimization problems.\nAt the t + 1th iteration, the non-convex parts in the objective and constraints are substituted by their\n\ufb01rst-order Taylor expansions, and the resulted optimization problem is as follows:\n\n[\n]\n[\n]\ng0(x(t)) + \u2207g0(x(t))T (x \u2212 x(t))\ngi(x(t)) + \u2207gi(x(t))T (x \u2212 x(t))\n\nmin\n\nf0(x) \u2212\ns.t. fi(x) \u2212\n\nx\n\n(2)\n\n\u2264 ci\n\nwhere x(t) is the optimal solution to (2) at the tth iteration. The above procedure is repeated until\nconvergence. In [14] it is proved that the CCCP converges to a local optimum of (1).\n\n3 Multi-Instance Classi\ufb01cation\n\n3.1 Support Vector Machine Formulation\n\nOur work is based on the support vector machine (SVM) formulations for multi-instance learning,\nto be exact, the MI-SVM [4] as follows:\n\u2225w\u22252 + C\n\n(3)\n\n\u03bei +\n\n\u03beij\n\n\u2211\n\n]\n\n[\u2211\n\nmin\nw;b;(cid:24)\n\n1\n2\n\ni\u2208I+\n\nj\u2208Ii;i\u2208I(cid:0)\n\n(wT xij + b) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, i \u2208 I+\n\ns.t. max\nj\u2208Ii\n\u2212 wT xij \u2212 b \u2265 1 \u2212 \u03beij, \u03beij \u2265 0, j \u2208 Ii, i \u2208 I\u2212\n\nCompared with the conventional SVM, in MI-SVM the notion of slack variables for positive samples\nis extended from individual instances to bags while that for negative samples remains unchanged.\nAs shown by the \ufb01rst set of max constraints, only the \u201cmost positive\u201d instance in a positive bag, or\nthe witness, could affect the margin. And other instances, or nonwitnesses, become irrelevant for\ndetermining the position of the separating plane once the witness is speci\ufb01ed.\nThe max constraint at \ufb01rst sight seems to well embody the characteristic of multi-instance learning.\nIndeed, it helps to avoid the false negative, i,e., the misclassi\ufb01cation of positive samples. However,\nit may incur false positive due to the following two reasons. Firstly, the max constraint aims at\ndiscovering the witness, and tends to skip nonwitnesses. Thus each positive bag is approximately\noversimpli\ufb01ed to a single pattern, i.e., the witness. Most information in positive bags is wasted.\nSecondly, due to the characteristic of the max function and the greediness of optimization methods,\nthe predictions of nonwitnesses are often adjusted above zero in the learning process. Besides, there\nis no mechanism to draw the predictions of nonwitenesses below zero. Nevertheless, many nonwit-\nnesses in positive bags are factually negative instances. For example, in natural scene classi\ufb01cation,\n\n3\n\n\fmany image patches in a positive bag are from the irrelevant background; in document categoriza-\ntion, many posts in a positive bag are not from the target category. Hence, many nonwitnesses are\nmislabeled as positive, and we obtain a falsely large margin.\nAs shown in Figure 1, MI-SVM classi\ufb01es half instances in the training sample as positive, and some\nnegative instances are mislabeled. This false positive will impair the generalization performance.\n\n3.2 Projection Constraint\n\nThe above problem is not unique for MI-SVM. Any approach without properly utilizing nonwit-\nnesses has the same problem. In our preliminary work before this paper, we tried three solutions.\nFirstly, we treat the labels of all nonwitnesses as unknown integer variables to be optimized. In the\nSVM framework, it is exactly the mi-SVM [4] as follows:\n\n\u2211\n\n(4)\n\n\u03beij\n\n1\n2\n\nw;b;(cid:24)\n\n\u2225w\u22252 + C\n\u2211\n\nj\u2208Ii; i\u2208I+\u222aI(cid:0)\n\nmin{yij} min\ns.t. yij(wT xij + b) \u2265 1 \u2212 \u03beij, \u03beij \u2265 0, j \u2208 Ii, i \u2208 I+\ni \u2208 I+\nj\u2208Ii\n\u2212 wT xij \u2212 b \u2265 1 \u2212 \u03beij, \u03beij \u2265 0, j \u2208 Ii, i \u2208 I\u2212\n\n\u2265 1,\n\nyij + 1\n\n2\n\n\u2211\n\nIt seems that assigning labels over all nonwitnesses should lead to a reasonable model. Nevertheless,\nnonwitnesses are usually labeled positive since the consequent margin will be larger. Thus, many\nof nonwitnesses are misclassi\ufb01ed. As far as the example in Figure 1 is concerned, the obtained\nlearner is g(x) instead of f (x). MissSVM [7] takes an unsupervised approach. For every instance\nin positive bags, two slack variables are introduced, measuring the distances from the instance to\nthe positive boundary f (x) = +1 and the negative boundary f (x) = \u22121 respectively, and the label\nof the instance depends on the smaller slack variable. stMIL [8] takes a similar approach. As mi-\nSVM, MissSVM and stMIL also suffers from misclassi\ufb01cation of nonwitnesses. sbMIL [8] tackles\nmulti-instance learning in two steps. The \ufb01rst step is similar to MI-SVM, and the second step is a\ntraditional SVM. Still, there is no mechanism in sbMIL to avoid false positive.\nIn the second solution, we simultaneously seek for the \u201cmost positive\u201d instance and the \u201cmost neg-\native\u201d instance in a positive bag by adding the following constraints to (3):\n(wT xij + b) \u2265 \u22121 \u2212 \u03b6i, \u03b6i \u2265 0, i \u2208 I+\n\n(5)\n\n(\u22121) \u00b7 min\nj\u2208Ii\n\ni\u2208I+\n\ni\u2208I+\n\n\u03bei in the objective of (3) is changed to\n\n(\u03bei + \u03b6i). Although misclas-\nAnd the term\nsi\ufb01cation of nonwitnesses is alleviated since at least the \u201cmost negative\u201d nonwitness is classi\ufb01ed\ncorrectly, the information carried by most nonwitnesses are not fully utilized. As far as the example\nin Figure 1 is concerned, the obtained learner is still g(x) instead of f (x). Besides, this solution is\nnot appropriate for applications which involve positive bags only with positive instances.\nThe third solution is the projection constraint proposed in this paper. In a maximum margin frame-\nwork we want to classify instances in a positive bag far away from the separating hyperplane while\nplace positive instances and negative instances at opposite sides. From another point of view, in the\nfeature (kernel) space, we want to maximize the variance of instances in a positive bag along w, the\nnormal vector of the separating hyperplane. Therefore, the principal component analysis (PCA) [15]\nis just the technique that we need. To tackle complicated real world datasets, we directly develop our\napproach in the Reproducing Kernel Hilbert Space (RKHS). Let X be the space of instances, and H\nbe a RKHS of functions f : X \u2192 R with associated kernel function k(\u00b7,\u00b7). Note that f is both a\nfunction on X and a vector in H. With an abuse of notation, we will not differentiate them unless\nnecessary. Denote the RKHS norm of H by \u2225f\u2225H. Then MI-SVM can be rewritten as follows:\n\n[\u2211\n\n\u2211\n\n]\n\n\u2211\n\nmin\nf\u2208H;(cid:24)\n\n1\n2\n\n\u2225f\u22252 + C\n\n\u03bei +\n\ni\u2208I+\n\nj\u2208Ii;i\u2208I(cid:0)\n\n\u03beij\n\n(f (xij)) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, i \u2208 I+\n\ns.t. max\nj\u2208Ii\n\u2212 (f (xij)) \u2265 1 \u2212 \u03beij, \u03beij \u2265 0, j \u2208 Ii, i \u2208 I\u2212\n\n(6)\n\n4\n\n\fFigure 2: Illustration of the Effect of the Projection Constraint: Please note that the projection\nconstraint is effective for datasets with any geometric distribution once an appropriate kernel is\nselected. Enveloped points are instances in a positive bag. Points not enveloped are negative bags of\njust one instance. Separating plane Fi corresponds to f (x) = i, and Gi corresponds to g(x) = i.\nThe learner f and g are obtained with and without the projection constraint, respectively. Instances\nare labeled according to f. \u2295 and \u2296 denote positive instances and negative instance respectively.\n\nAccording to the representer theorem [16], each minimizer f \u2208 H of (6) has the following form:\n\n\u03b1ij\u03d5(xij)\n\n(7)\n\n\u2211\n\n\u2211\n\nf =\n\ni\u2208I+\u222aI(cid:0)\n\nj\u2208Ii\n\nwhere all \u03b1i \u2208 R, and \u03d5(\u00b7) induced by k(\u00b7,\u00b7) is the feature mapping from X to H.\nNext, we will propose our key contribution, i.e., the projection constraint. Given a positive bag\nBi, denote its instances by {xij}ni\nj=1, and denote the normal vector of the separating plane in the\nRKHS by f. According to the theory of PCA [15, 17], maximizing the variance of mapped instances\n{\u03d5(xij)}ni\nj=1 along f equals to minimizing the sum of the Euclidean distances from the centralized\nni\u2211\ndata points to their projections on the normalized vector\n\nf\u2225f\u22252\n\n, as follows:\n\u2212 (\u03d5(xij) \u2212 \u03d5(mi))\u22252\n\n2\n\n(8)\n\nJi(f ) =\n\nj=1\n\n\u2225cj\n\nf\u2225f\u22252\n\nni\u2211\n\nwhere \u03d5(mi) = 1\nni\n\n\u03d5(xij), the mean of {\u03d5(xij)}ni\nj=1.\nprojection point of \u03d5(xij). After simple algebra, we get:\n\nj=1\n\n|cj| is the distance from \u03d5(mi) to the\n\nf T\n\u2225f\u22252\nSubstituting (9) and (7) into (8), we arrive at:\n\ncj =\n\n(\u03d5(xij) \u2212 \u03d5(mi))\n\n(9)\n\nJi((cid:11)) = oi \u2212 (cid:11)T L2\ni (cid:11)\n(cid:11)T K(cid:11)\n\n(10)\nwhere K is a n \u00d7 n kernel matrix de\ufb01ned on all the instances of both positive bags and negative\nbags, oi = trace(KI(Bi)) \u2212 1\n1T KI(Bi)1 where KI(Bi) is a ni \u00d7 ni matrix formed by extracting\nthe I(Bi) columns (Please refer to section 2.1) and the I(Bi) rows of the overall kernel matrix K,\nand L2\n\ni is the \u201ccentralized\u201d L2\n\ni as follows:\n\nni\n\ni = LT\nL2\n\ni Li \u2212 1nLT\n\ni Li \u2212 LT\n\ni Li1n + 1nLT\n\ni Li1n\n\n(11)\n\n5\n\n\u2295\u2295\u2295\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2295\u2295\u2296\u2296\u2296\u2296\u2295\u2295\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2296\u2296F+1F0F\u22121G\u22121G0G+1\fwhere 1n is a matrix with all elements equal to 1\nI(Bi) rows of K and setting all the elements in other rows to 0:\n\n{\n\nLi(p, q) =\n\nK(p, q)\n\n0\n\nn, and Li is a n \u00d7 n matrix formed by keeping the\nif p \u2208 I(Bi),\u2200q \u2208 {1,\u00b7\u00b7\u00b7 , n}\notherwise\n\nGenerally, the optimal normal vector f varies for different positive bags. Hence it is meaningless to\nsolve (10) for its optimum. Instead, we average (10) by the bag size ni, and use a common threshold\n\u03bb to bound the averaged projection distance for different bags from above. We name the obtained\ninequality \u201cthe projection constraint\u201d, as follows:\n\n(\n\n)\n\n1\nni\n\noi \u2212 (cid:11)T L2\ni (cid:11)\n(cid:11)T K(cid:11)\n\n\u2264 \u03bb\n\n(12)\n\nThis is equivalent to bounding variance of instances in positive bags along f from below [15].\nSubstituting (7) into (6), and adding the projection constraint (12) for each positive bag to the re-\nsulted problem, we arrive at the following optimization problem:\n\n[\u2211\n\n\u2211\n\n]\n\n1\n2\n\nmin\n(cid:11);b;(cid:24)\n\n\u03beij\n\n\u03bei +\n\n(cid:11)T K(cid:11) + C\ns.t. 1 \u2212 \u03bei \u2212 max\nj\u2208Ii\n\nj\u2208Ii;i\u2208I(cid:0)\n\ni\u2208I+\nI(xij )(cid:11) + b) \u2264 0, \u03bei \u2265 0, i \u2208 I+\n(kT\nI(xij )\u03b1 + b \u2264 \u22121 + \u03beij, \u03beij \u2265 0, j \u2208 Ii, i \u2208 I\u2212\nkT\n(cid:11)T (oi \u00b7 K \u2212 L2\ni )(cid:11) \u2212 \u03bbni \u00b7 (cid:11)T K(cid:11) \u2264 0, i \u2208 I+\n\n3.3 Optimization via the CCCP\n\nIn the problem (13), the objective function and the second set of constraints are convex. The \ufb01rst\nset of constraints are all in the form of difference of two convex functions since the max function\nis convex. According to the de\ufb01nition of Ji(f ) in (8), J((cid:11)) in (10) is not less than 0 for any (cid:11).\nThus for any i \u2208 I+, oi \u00b7 K \u2212 L2\ni is semi-de\ufb01nite positive. Consequently, the third set of constraints\nare all in the form of difference of two convex functions. Therefore, we can apply the Constrained\nConcave-Convex Procedure (CCCP) introduced in section 2.2 to solve the problem (13).\nSince the function max in the \ufb01rst set of constraints is nonsmooth, we have to change gradients to\nsubgradients to use the CCCP. The subgradient is usually not unique, and we adopt the de\ufb01nition\nused in [6] for the subgradient of max\nj\u2208Ii\n\nI(xij )(cid:11):\nkT\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\nwhere\n\n0\n\n\u03b2ij =\n\nif kT\notherwise\nwhere na is the number of xij that maximize kT\nestimate for (cid:11) and \u03b2ij by (cid:11)(t) and \u03b2(t)\nij\nmax\nj\u2208Ii\n\nI(xij )(cid:11) is as follows:\nkT\n\n1\nna\n\nAccording to (15), we have \u2211\n\n\u2211\n\nj\u2208Ii\n\n\u2202(max\nj\u2208Ii\n\nkT\nI(xij )(cid:11)) =\n\n\u03b2ijkT\n\nI(xij )\n\n{\n\nI(xij )(cid:11) \u0338= max\nj\u2208Ii\n\nkT\nI(xij )(cid:11)\n\nmax\nj\u2208Ii\n\nkT\nI(xij )(cid:11)(t) +\n\n\u03b2(t)\nij kT\n\nI(xij )((cid:11) \u2212 (cid:11)(t))\n\n\u2211\n\nj\u2208Ii\n\nj\u2208Ii\n\n\u03b2(t)\nij kT\n\nI(xij )(cid:11)(t) = max\nj\u2208Ii\n\n(kT\n\nI(xij )(cid:11)(t))\n\n6\n\nI(xij )(cid:11). At the tth iteration, denote the current\nrespectively. Then the \ufb01rst order Taylor expansion of\n\n\fUsing (17), (16) reduces to\n\n\u2211\n\nj\u2208Ii\n\n\u03b2(t)\nij kT\n\nI(xij )(cid:11)\n\n(18)\n\nReplacing max\nj\u2208Ii\nstraints by their \ufb01rst order Taylor expansions, \ufb01nally we get:\n\nI(xij )(cid:11) in the \ufb01rst set of constraints by (18) and (cid:11)T K\u03b1 in the third set of con-\nkT\n\nmin\n(cid:11);b;(cid:24)\n\n1\n2\n\n(cid:11)T K(cid:11) + C\n\n\u03bei +\n\n\u03bei\n\n(19)\n\n\u2211\n\n]\n\ni\u2208Ii;I\u2208I(cid:0)\n\n[\u2211\n\ni\u2208I+\n\u03b2(t)\nij kT\n\n\u2211\n\nj\u2208Ii\n\ns.t. 1 \u2212 \u03bei \u2212 (\n\nI(xij )(cid:11) + b) \u2264 0, \u03bei \u2265 0, i \u2208 I+\n\nI(xij )(cid:11) + b \u2264 \u22121 + \u03beij, \u03beij \u2265 0, j \u2208 Ii, i \u2208 I\u2212\nkT\nK((cid:11) \u2212 (cid:11)(t)) \u2264 0, i \u2208 I+\n(cid:11)T Si(cid:11) \u2212 2\u03bbni \u00b7 (cid:11)(t)T\n\nwhere Si = oi\u00b7 K\u2212 L2\ni . The problem (19) is a quadratically constrained quadratic program (QCQP)\nwith a convex objective function and convex constraints, and thus can be readily solved via interior\npoint methods [18]. Following the CCCP, we can do the iteration until (19) converges.\n\n4 Experiments\n\n4.1 Classi\ufb01cation: Benchmark\n\nBenchmark data sets comes from two areas. Musk 1 and Musk 2 data sets [1] are two biochemical\ntasks which directly promoted the research of multi-instance learning. The aim is to predict activ-\nity of drugs from structural information. Each drug molecule is a bag of potential conformations\n(instances). The Musk 1 data set consists of 47 positive bags, 45 negative bags, and totally 476\ninstances. The Musk 2 data set consists of 39 positive bags, 63 negative bags, and totally 6598 in-\nstances. Each instance is represented by a 166 dimensional vector. Elephant, tiger and fox are three\ndata sets from image categorization. The aim is to differentiate images with elephant, tiger, and fox\n[4] from those without, respectively. A bag here is a group of ROIs (Region Of Interests) drawn\nfrom a certain image. Each data set contains 100 positive bags and 100 negative bags, and each\nROI as an instance is a 230 dimensional vector. Related methods for comparison includes Diverse\n\nTable 1: Test Accuracy(%) On Benchmark: Rows and columns correspond to methods and datasets\nrespectively.\n\nmiGraph\n\nMIGraph\n\nFox\n65:7\n\nAlgorithm Musk 1 Musk 2 Elep\nTiger\n89:8\nPC-SVM 90.6\n83.8\n\u00b11:2 \u00b11:4 \u00b11.3\n\u00b12.7\n81.9\n85.1\n90.0\n\u00b12.8 \u00b11.7 \u00b11.5\n\u00b13.8\n86:0\n86.8\n88.9\n\u00b13.3\n\u00b10.7 \u00b12.8 \u00b11:0\n84.3\n88.0\n84.2\n\u00b11.6 \u00b11.9 \u00b11.0\n\u00b13.1\nMI-Kernel\nMI-SVM 77.9\n84.0\n81.4\nstMIL\n79.5\n81.6\n74.7\nsbMIL\n83.0\n88.6\n91:8\nN/A\nN/A\nDD\n88.0\nEM-DD\n84.8\n78.3\n72.1\n\n91:3\n\u00b13:2\n90.0\n\u00b12.7\n90.3\n\u00b12.6\n89.3\n\u00b11.5\n84.3\n68.4\n87.7\n84.0\n84.9\n\n59.4\n60.7\n69.8\nN/A\n56.1\n\n61.2\n\n61.6\n\n60.3\n\nDensity (DD,[2]), EM-DD [19], MI-SVM [4], MI-Kernel [5], stMIL [8], sbMIL [8], MIGraph and\nmiGraph [10]. When applied for multi-instance classi\ufb01cation, our approach involves three parame-\nters, namely, the bias/variance trade-off factor C, the kernel parameter (e.g.: \u03b3 in RBF kernel), and\nthe bound parameter \u03bb in the projection constraint. In the experiment, C, \u03b3, and \u03bb are selected from\n\n7\n\n\f{0.01,0.1,1,10,50,100}, {0.2,0.4,0.6,0.8,1.0} and {0.01,0.1,1,10,100} respectively. We employ the\nMOSEK toolbox 1 to solve the resulted QCQP problem (19). The other experiment uses the same\nparameter setting.\nThe ten-times 10-fold cross validation results (except Diverse Density) are shown in Table 1. The\nresults for other methods are replicated from their original papers. The results not available are\nmarked by N/A. The bolded \ufb01gure indicates that result is better than all other methods. Table 1\nshows that the performance of our approach (PC-SVM) is competitive. Recall that the difference\nbetween our approach and MI-SVM is just the projection constraint. Therefore, as discussed in\nsection 3.2, the results in Table 1 demonstrates that the strength of nonwitnesses is well utilized via\nthe projection constraint.\n\n4.2 Classi\ufb01cation: COREL Image Data Sets\n\nTable 2: Test Accuracy(%) On COREL: Rows and columns correspond to methods and datasets\nrespectively.\n\nAlgorithm 1000-Image\nPC-SVM 85:6 : [84:3; 86:9]\nreg-SVM 84.4 : [83.0, 85.8]\nMIGraph\n83.9 : [81.2, 85.7]\nmiGraph\n82.4 : [80.2, 82.6]\nMI-Kernel\n81.8 : [80.1, 83.6]\nMI-SVM 74.7 : [74.1, 75.3]\nDD-SVM 81.5 : [78.5, 84.5]\n\n2000-Image\n75:8 : [74:4; 77:2]\nN/A\n72.1 : [71.0, 73.2]\n70.5 : [68.7, 72.3]\n72.0 : [71.2, 72.8]\n54.6 : [53.1, 56.1]\n67.5 : [66.1, 68.9]\n\nCOREL is a collection of natural scene images which have been categorized according to the pres-\nence of certain objects. Each image is regarded as a bag, and the nine dimensional ROIs (Region Of\nInterests) in it are regarded as its constituent instances. In experiments, we use the 1000-Image data\nset and the 2000-Image data set which contain ten and twenty categorizes, respectively. Following\nthe methodology in [10], on both of the two data sets the related methods are compared by their \ufb01ve\ntimes 2-fold cross validation results. The algorithm for comparison include Diverse Density (DD),\nMI-SVM, MIGraph, miGraph , MI-Kernel and reg-SVM. In the last four algorithms one-against-all\nstrategy is employed to tackle this multi-class task. In our approach this strategy is also used. Table\n2 shows the overall accuracy as well as the 95% interval. As in benchmark data sets, our approach is\ncompetitive with the latest methods. The results again suggest that fully utilizing the nonwitnesses\nis important for multi-instance classi\ufb01cation.\n\n5 Conclusion\n\nWe design a projection constraint to fully exploit nonwitnesses to avoid false positive. Since our\napproach is basically MI-SVM with projection constraints, the improved results on real world data\nsets validate the strength of nonwitnesses. We will introduce the universal projection constraint\ninto other existing approaches for multi-instance learning, and related learning tasks, such as multi-\ninstance regression, multi-label multi-instance learning, generalized multi-instance learning, etc.\n\nAcknowledgments\n\nWe gratefully acknowledge reviewers for their insightful remarks and editors for their assiduous\nwork. We also deeply appreciate Kuijun Ma\u2019s careful proof-reading. Finally, we are extremely\nthankful to Runing Liu for the fascinating illustrations. This work was partially supported by Na-\ntional Basic Research Program of China under Grant No.2004CB318103 and National Natural Sci-\nence Foundation of China under award No.60835002 and 60975040.\n\n1http://www.mosek.com/\n\n8\n\n\fReferences\n\n[1] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P\u00b4erez. Solving the multiple-instance problem with axis-\n\nparallel rectangles. Arti\ufb01cial Intelligence, 89(1-2):31\u201371, 1997.\n\n[2] O. Maron and T. Lozano-P\u00b4erez. A framework for multiple-instance learning. Advances in neural infor-\n\nmation processing systems, pages 570\u2013576, 1998.\n\n[3] J. Wang and J.D. Zucker. Solving the multiple-instance problem: A lazy learning approach.\n\nIn Pro-\nceedings of the Seventeenth International Conference on Machine Learning, pages 1119\u20131126. Citeseer,\n2000.\n\n[4] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning.\n\nAdvances in neural information processing systems, pages 577\u2013584, 2003.\n\n[5] T. G\u00a8artner, P.A. Flach, A. Kowalczyk, and A.J. Smola. Multi-instance kernels. In Proceedings of the\n\nNineteenth International Conference on Machine Learning, pages 179\u2013186. Citeseer, 2002.\n\n[6] P.M. Cheung and J.T. Kwok. A regularization framework for multiple-instance learning. In Proceedings\n\nof the 23rd international conference on Machine learning, page 200. ACM, 2006.\n\n[7] Z.H. Zhou and J.M. Xu. On the relation between multi-instance learning and semi-supervised learning.\n\nIn Proceedings of the 24th international conference on Machine learning, page 1174. ACM, 2007.\n\n[8] R.C. Bunescu and R.J. Mooney. Multiple instance learning for sparse positive bags. In Proceedings of\n\nthe 24th international conference on Machine learning, page 112. ACM, 2007.\n\n[9] H.Y. Wang, Q. Yang, and H. Zha. Adaptive p-posterior mixture-model kernels for multiple instance\nlearning. In Proceedings of the 25th international conference on Machine learning, pages 1136\u20131143.\nACM, 2008.\n\n[10] Z. H. Zhou, Y. Y. Sun, and Yu. F. Li. Multi-instance learning by treating instances as non-I.I.D. samples. In\nL\u00b4eon Bottou and Michael Littman, editors, Proceedings of the 26th International Conference on Machine\nLearning, pages 1249\u20131256, Montreal, June 2009. test, Omnipress.\n\n[11] Y. Chen and J.Z. Wang. Image categorization by learning and reasoning with regions. The Journal of\n\nMachine Learning Research, 5:913\u2013939, 2004.\n\n[12] B. Settles, M. Craven, and S. Ray. Multiple-instance active learning. Advances in Neural Information\n\nProcessing Systems (NIPS), 20:1289\u20131296, 2008.\n\n[13] G. Fung, M. Dundar, B. Krishnapuram, and R.B. Rao. Multiple instance learning for computer aided\n\ndiagnosis. In NIPS2007, page 425. The MIT Press, 2007.\n\n[14] A.J. Smola, SVN Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In Proceedings\n\nof the Tenth International Workshop on Arti\ufb01cial Intelligence and Statistics. Citeseer, 2005.\n\n[15] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classi\ufb01cation. John Wiley & Sons, 2001.\n[16] B. Sch\u00a8olkopf and A.J. Smola. Learning with kernels. Citeseer, 2002.\n[17] Q. Tao, D.J. Chu, and J. Wang. Recursive support vector machines for dimensionality reduction. IEEE\n\nTransactions on Neural Networks, 19(1):189\u2013193, 2008.\n\n[18] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr, 2004.\n[19] Q. Zhang and S.A. Goldman. Em-dd: An improved multiple-instance learning technique. Advances in\n\nneural information processing systems, 2:1073\u20131080, 2002.\n\n9\n\n\f", "award": [], "sourceid": 755, "authors": [{"given_name": "Yanjun", "family_name": "Han", "institution": null}, {"given_name": "Qing", "family_name": "Tao", "institution": null}, {"given_name": "Jue", "family_name": "Wang", "institution": null}]}