{"title": "EM-DD: An Improved Multiple-Instance Learning Technique", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1080, "abstract": null, "full_text": "EM-DD: An Improved Multiple-Instance \n\nLearning Technique \n\nQi Zhang \n\nSally A. Goldman \n\nDepartment of Computer Science \n\nDepartment of Computer Science \n\nWashington University \n\nSt. Louis, MO 63130-4899 \n\nqz@cs. wustl. edu \n\nWashington University \n\nSt. Louis, MO 63130-4899 \n\nsg@cs. wustl. edu \n\nAbstract \n\nWe present a new multiple-instance (MI) learning technique (EM(cid:173)\nDD) that combines EM with the diverse density (DD) algorithm. \nEM-DD is a general-purpose MI algorithm that can be applied with \nboolean or real-value labels and makes real-value predictions. On \nthe boolean Musk benchmarks, the EM-DD algorithm without any \ntuning significantly outperforms all previous algorithms. EM-DD \nis relatively insensitive to the number of relevant attributes in the \ndata set and scales up well to large bag sizes. Furthermore, EM(cid:173)\nDD provides a new framework for MI learning, in which the MI \nproblem is converted to a single-instance setting by using EM to \nestimate the instance responsible for the label of the bag. \n\nIntroduction \n\n1 \nThe multiple-instance (MI) learning model has received much attention. In this \nmodel, each training example is a set (or bag) of instances along with a single \nlabel equal to the maximum label among all instances in the bag. The individual \ninstances within the bag are not given labels. The goal is to learn to accurately \npredict the label of previously unseen bags. Standard supervised learning can be \nviewed as a special case of MI learning where each bag holds a single instance. The \nMI learning model was originally motivated by the drug activity prediction problem \nwhere each instance is a possible conformation (or shape) of a molecule and each \nbag contains all likely low-energy conformations for the molecule. A molecule is \nactive if it binds strongly to the target protein in at least one of its conformations \nand is inactive if no conformation binds to the protein. The problem is to predict \nthe label (active or inactive) of molecules based on their conformations. \n\nThe MI learning model was first formaliz ed by Dietterich et al. in th eir seminal \npaper [4] in which they developed MI algorithms for learning axis-parallel rectangles \n(APRs) and they also provided two benchmark \"Musk\" data sets. Following this \nwork, there has been a significant amount of research directed towards the devel(cid:173)\nopment of MI algorithms using different learning models [2 ,5,6,9,12]. Maron and \n\n\fRaton [7] applied the multiple-instance model to the task of recognizing a person \nfrom a series of images that are labeled positive if they contain the person and \nnegative otherwise. The same technique was used to learn descriptions of natural \nscene images (such as a waterfall) and to retrieve similar images from a large im(cid:173)\nage database using the learned concept [7]. More recently, Ruffo [11] has used this \nmodel for data mining applications. \n\nWhile the musk data sets have boolean labels , algorithms that can handle real(cid:173)\nvalue labels are often desirable in real-world applications. For example, the binding \naffinity between a molecule and receptor is quantitative, and hence a real-value \nclassification of binding strength is preferable to a binary one. Most prior research \non MI learning is restricted to concept learning (i.e. boolean labels). Recently, MI \nlearning with real-value labels has been performed using extensions of the diverse \ndensity (DD) and k-NN algorithms [1] and using MI regression [10]. \n\nIn this paper , we present a general-purpose MI learning technique (EM-DD) that \ncombines EM [3] with the extended DD [1] algorithm. The algorithm is applied \nto both boolean and real-value labeled data and the results are compared with \ncorresponding MI learning algorithms from previous work. In addition, the effects \nof the number of instances per bag and the number of relevant features on the \nperformance of EM-DD algorithm are also evaluated using artificial data sets . A \nsecond contribution of this work is a new general framework for MI learning of \nconverting the MI problem to a single-instance setting using EM. A very similar \napproach was also used by Ray and Page [10]. \n\n2 Background \nDietterich et al. [4], presented three algorithms for learning APRs in the MI model. \nTheir best performing algorithm (iterated-discrim) , starts with a point in the feature \nspace and \"grows\" a box with the goal of finding the smallest box that covers at \nleast one instance from each positive bag and no instances from any negative bag. \nThe resulting box was then expanded (via a statistical technique) to get better \nresults. However, the test data from Muskl was used to tune the parameters of the \nalgorithm. These parameters are then used for Muskl and Musk2. \n\nAuer [2] presented an algorithm, MULTINST, that learns using simple statistics to \nfind the halfspaces defining the boundaries of the target APR and hence avoids some \npotentially hard computational problems that were required by the heuristics used \nin the iterated-discrim algorithm. More recently, Wang and Zucker [11] proposed a \nlazy learning approach by applying two variant of the k nearest neighbor algorithm \n(k-NN) which they refer to as citation-kNN and Bayesian k-NN. Ramon and De \nRaedt [9] developed a MI neural network algorithm. \n\nOur work builds heavily upon the Diverse Density (DD) algorithm of Maron and \nLozano- Perez [5,6]. When describing the shape of a molecule by n features , one can \nview each conformation of the molecule as a point in a n-dimensional feature space. \nThe diverse density at a point p in the feature space is a probabilistic m easure of \nboth how many different positive bags have an instance near p, and how far the \nnegative instances are from p. Intuitively, the diversity density of a hypothesis h is \njust the likelihood (with respect to the data) that h is the target. A high diverse \ndensity indicates a good candidate for a \"true\" concept. \n\nWe now formally define the general MI problem (with boolean or real-value la-\n\n\fbels) and DD likelihood measurement originally defined in [6] and extended to \nreal-value labels in [1]. Let D be the labeled data which consists of a set of m \nbags B = {B 1, ... , B m } and labels L = {l\\, ... ,\u00a3m }, i.e., D = {< B 1,\u00a3l >, ... ,< \nBm, \u00a3m >}. Let bag Bi = {Bil \" '\" B ij , ... Bin} where Bij denote the lh in-\nstance in bag i. Assume the labels of the instances in Bi are \u00a3i 1, ... , \u00a3ij, ... , \u00a3in . \nFor boolean labels, \u00a3i = \u00a3i1 V \u00a3i2 V ... V \u00a3in, and for real-value labels, \u00a3i = \nmax{ \u00a3il, \u00a3i2, ... , \u00a3in}. The diverse density of hypothesized target point h is de-\nfi d \nne as D D = Pr \n= ( ) . ssummg a \nuniform prior on the hypothesis space and independence of < B i , \u00a3i > pairs given \nh , using Bayes' rule, the maximum likelihood hypothesis , hDD , is defined as: \narg maxPr(D I h) = arg m ax IT Pr(Bi , \u00a3i I h) = arg min I) -log Pr(\u00a3i I h , B i )) \n\nPr(B , L I h) Pr(h) A \n\nPr(D I h) Pr(h) \n\n() \nPr D \n\n(h I \n\n) \n\nD = \n\nPr B , L \n\n(h) \n\n. \n\nn \n\nhEH i=l \n\nn \n\nhEH i=l \n\nhEH \n\nwhere Label (Bi I h) is the label that would be given to B i if h were the correct \nhypothesis. As in the extended DD algorithm [1], Pr(\u00a3i I h , Bi) is estimated as \nl-I\u00a3i - Label (Bi I h) I in [1]. When the labels are boolean (0 or 1) , this formulation \nis exactly the most-likely-cause estimator used in the original DD algorit hm [5]. For \nmost applications t he influence each feature has on t he label varies greatly. This \nvariation is modeled in the DD algorithm by associating with each attribute an \n(unknown) scale factor . Hence the target concept really consists of two values per \ndimension , the ideal attribute value and the scale value. Using the assumption that \nbinding strength drops exponentially as the similarity between the conform ation \nto the ideal shape increases , the following generative model was introduced by \nMaron and Lozano-Perez [6] for estimating the label of bag B i for hypothesis h = \n{h 1 , ... , hn , Sl , ... , sn} : \n\nLabel(Bi I h) =max{ exP[- t\n\n(Sd(Bijd - hd)) 2]} \n\n(1) \n\nJ \n\nd=l \n\nwhere Sd is a scale factor indicating the importance of feature d, h d is the feature \nvalue for dimension d, and B ijd is the feature value of instance B ij on dimension d. \nLet NLDD(h , D) = 2::7=1 (-log Pr(\u00a3i I h , B i )) , where NLDD denote the negative \nlogarit hm of DD. The DD algorithm [6] uses a two-step gradient descent search to \nfind a value of h that minimizes NLDD (and hence maximizes DD). \n\nRay and Page [10] developed multiple-instance regression algorithm which can also \nhandle real-value labeled data. They assumed an underlying linear model for the \nhypothesis and applied the algorithm to some artificial data. Similar to the current \nwork, they also used EM to select one instance from each bag so multiple regression \ncan be applied to MI learning. \n\n3 Our algorithm: EM-DD \nWe now describe EM-DD and compare it with the original DD algorithm. One \nreason why MI learning is so difficult is the ambiguity caused by not knowing \nwhich instance is the important one. The basic idea behind EM-DD is to view \nthe knowledge of which instance corresponds to the label of th e bag as a missing \nattribute which can be estimated using EM approach in a way similar to how EM \nis used in the MI regression [10]. EM-DD starts with some initial guess of a target \npoint h obtained in the standard way by trying points from positive bags, then \nrepeatedly performs the following two steps that combines EM with DD to search \nIn the first step (E-step) , the current \nfor the maximum likelihood hypothesis. \n\n\fhypothesis h is used to pick one instance from each bag which is most likely (given \nour generative model) to be the one responsible for the label given to the bag. In \nthe second step (M -step), we use the two-step gradient ascent search (quasi-newton \nsearch dfpmin in [8]) of the standard DD algorithm to find a new hi that maximizes \nDD(h). Once this maximization step is completed , we reset the proposed target \nh to hi and return to the first step until the algorithm converges. Pseudo-code for \nEM-DD is given in Figure 1. \n\nWe now briefly provide intuition as to why EM-DD improves both the accuracy and \ncomputation time of the DD algorithm. Again, the basic approach of DD is to use \na gradient search to find a value of h that maximizes DD(h). In every search step , \nthe DD algorithm uses all points in each bag and hence the maximum that occurs \nin Equation (1) must be computed. The prior diverse density algorithms [1,5,6,7] \nused a softmax approximation for the maximum (so that it will be differentiable), \nwhich dramatically increases the computation complexity and introduces additional \nerror based on the parameter selected in softmax. In comparison, EM-DD converts \nthe multiple-instance data to single-instance data by removing all but one point per \nbag in the E -step, which greatly simplifies the search step since the maximum that \noccurs in Equation (1) is removed in the E -step. The removal of softmax in EM(cid:173)\nDD greatly decreases the computation time. In addition, we believe that EM-DD \nhelps avoid getting caught in local minimum since it makes major changes in the \nhypothesis when it switches which point is selected from a bag. \n\nWe now provide a sketch of the proof of convergence of EM-DD. Note that at \neach iteration t , given a set of instances selected in the E-step, the M-step will \nfind a unique hypothesis (h t ) and corresponding DD (ddt). At iteration t + 1, if \nddt+1 ::; ddt , the algorithm will terminate. Otherwise, ddt+1 > ddt , which means \nthat a different set of instances are selected. For the iteration to continue, the DD \nwill decrease monotonically and the set of instances selected can not repeat. Since \nthere are only finite number of sets to instances that can be selected at the E-step , \nthe algorithm will terminate after a finite number of iterations. \n\nHowever, there is no guarantee on the convergence rate of EM algorithms. We \nfound that the NLDD(h , D) usually decreases dramatically after the first several \niterations and then begins to flatten out. From empirical tests we found that it is \noften beneficial to allow NLDD to increase slightly to escape a local minima and thus \nwe used the less restrictive termination condition: Idd1 - ddo I < 0.01 . ddo or the \nnumber of iterations is greater than 10. This modification reduces the training time \nwhile gaining comparable results. However, for this modification no convergence \nproof can be given without restricting the number of iterations. \n\n4 Experimental results \nIn this section we summarize our experimental results. We begin by reporting our \nresults for the two musk benchmark data sets provided by Dietterich et al. [4]. \nThese data sets contain 166 feature vectors describing the surface for low-energy \nconformations of 92 molecules for Muskl and 102 molecules for Musk2 wh ere roughly \nhalf of the molecules are known to smell musky and the remainder are not. The \nMusk1 data set is smaller both in having fewer bags (i.e molecules) and many fewer \ninstances per bag (an average of 6.0 for Musk1 versus 64.7 for Musk2). Prior to \nthis work, the highly-tuned iterated-discrim algorithm of Dietterich et al. still gave \nthe best performance on both Musk1 and Musk2. Maron and Lozano-Perez [6] \n\n\fMain(k , D) \n\npartition D = {D1 ' D2, ... , D 10 }; 111 O-fold cross validation \nfor (i = l ;i:::; 10 ;i++) \n\nIIDt training data , Di validation data \n\nDt = D - Di ; \npick k random positive bags B 1 , ... , B k from Dt ; \nlet Ho be the union of all instances from selected bags; \nfor every instance Ij E H 0 \n\nhj = EM-DD (Ij, Dt ); \nei = mino:<;:j:<;:IIHoll{error(hj,Di)}; \nreturn avg(e1,e2, ... , e1o) ; \n\nEM-DD(I , Dt ) \n\nLet h = {h1' ... , hn , Sl, ... , sn}; \nFor each dimension d = 1, ... , n \n\nIlinitial hypothesis \n\nhd = Id; \nnlddo = +00; \nwhile (nldd 1 < nlddo) \n\nSd = 0.1 ; \nnldd1 = NLDD(h, Dt); \n\nfor each bag Bi E Dt \n\npi = argmaxBijEBi Pr(Bij E h); \n\nhi = argmaXhEH flP r(fi I h , pi); \nnlddo = nldd1; \n\nnldd1 = NLDD(hl,Dt); \n\nliE-step \n\n11M-step \n\nh = hi; \n\nreturn h; \n\nFigure 1: Pseudo-code for EM-DD where k indicates the number of different starting \nbags used, Pr(Bij E h) = exp[- I:~=1 (sd(Bijd - hd))2]. Pr(fi I h , p,!) is calculate as \neither 1-lfi - Pr(pi E h) I (linear model) or exp [-( fi - Pr(pi E h) )2] (Gaussian-like \nmodel) , where Pr(pi E h) = maxBijEBi Pr(Bij E h). \n\nsummarize the generally held belief that \"The performance reported for iterated(cid:173)\ndiscrim APR involves choosing parameters to maximize the test set performance \nand so probably represents an upper bound for accuracy on this (Musk1) data set.\" \n\nEM-DD without tuning outperforms all previous algorithms. To be consistent with \nthe way in which past results have been reported for the musk benchmarks we \nreport the average accuracy of la-fold cross-validation (which is the value returned \nby Main in Figure l. EM-DD obtains an average accuracy of 96.8% on Musk1 and \n96.0% on Musk2. A summary of the performance of different algorithms on the \nMusk1 and Musk2 data sets is given in Table l. In addition , for both data sets , \nthere are no false negative errors using EM-DD , which is important for the drug \ndiscovery application since the final hypothesis would be used to filter potential \ndrugs and a false negative error means that a potential good drug molecule would \nnot be tested and thus it is good to minimize such errors. As compared to the \nstandard DD algorithm , EM-DD only used three random bags for Muskl and two \nrandom bags for Musk2 (versus all positive bags used in DD) as the starting point \nof the algorithm. Also, unlike th e results reported in [6] in which the threshold is \ntuned based on leave-one-out cross validation, for our reported results the threshold \nvalue (of 0.5) is not tuned. More importantly, EM-DD runs over 10 times faster \nthan DD on Musk1 and over 100 times faster when applied to Musk2. \n\n\fTable 1: Comparison of performance on Musk1 and Musk2 data sets as measured \nby giving the average accuracy across 10 runs using 10-fold cross validation. \n\nAlgorithm \n\nEM-DD \nIterated-discrim [4] \nCitation-kNN [11] \nBayesian-kNN [11] \nDiverse density [6] \nMulti-instance neural network [9] \nMultinst [2] \n\nMusk1 \naccuracy \n96.8% \n92.4% \n92.4% \n90.2% \n88.9% \n88.0% \n76.7% \n\nMusk 2 \naccuracy \n96.0% \n89.2% \n86.3% \n82.4% \n82.5% \n82.0% \n84.0% \n\nIn addition to its superior performance on the musk data sets, EM-DD can handle \nreal-value labeled data and produces real-value predictions. We present results \nusing one real data set (Affinity) 1 that has real-value labels and several artificial \ndata sets generated using the technique of our earlier work [1]. For these data sets, \nwe used as our starting points the points from the bag with the highest DD value. \nThe result are shown in Table 2. The Affinity data set has 283 features and 139 \nbags with an average of 32.5 points per bag. Only 29 bags have labels that were \nhigh enough to be considered as \"positive.\" Using the Gaussian-like version of our \ngenerative model we obtained a squared loss of 0.0185 and with the linear model \nwe performed slightly better with a loss of 0.0164. In contrast using the standard \ndiverse density algorithm the loss was 0.042l. EM-DD also gained much better \nperformance than DD on two artificial data (160.166.1a-S and 80.166.1a-S) where \nboth algorithms were used 2 . The best result on Affinity data was obtained using a \nversion of citation-kNN [1] that works with real-value data with the loss as 0.0124. \nWe think that the affinity data set is well-suited for a nearest neighbor approach in \nthat all of the negative bags have labels between 0.34 and 0.42 and so the actual \npredictions for the negative bags are better with citation-kNN. \n\nTo study the sensitivity of EM-DD to the number ofrelevant attributes and the size \nof the bags, tests were performed on artificial data sets with different number of \nrelevant features and bag sizes. As shown in Table 2, similar to the DD algorithm [1], \nthe performance of EM-DD degrades as the number of relevant features decreases. \nThis behavior is expected since all scale factors are initialized to the same value \nand when most of the features are relevant less adjustment is needed and hence the \nalgorithm is more likely to succeed. In comparison to DD , EM-DD is more robust \nagainst the change of the number of relevant features. For example, as shown in \nFigure 2, when the number of relevant features is 160 out of 166, both EM-DD and \nDD algorithms perform well with good correlation between the actual labels and \npredicted labels. However, when the number of relevant features decreases to 80 , \nalmost no correlation between the actual and predicted labels is found using DD , \nwhile EM-DD can still provide good predictions on the labels. \n\nIntuitively, as the size of bags increases, more ambiguity is introduced to the data \nand the performance of algorithms is expected to go down. However , somewhat \n\n] Jonathan Greene from CombiChem provided us with the Affinity data set. However, \n\ndue to the proprietary nature of it we cannot make it publicly available. \n\n2See Amar et al. [1] for a description of these two data sets. \n\n\fTable 2: Performance on data with real-value labels measured as squared loss. \n\n# reI. features #pts per bag EM-DD DD [1] \n.0421 \n.0052 \n\n32.5 \n\nData set \nAffinity \n160.166.1a-S \n160.166.1b-S \n160.166.1c-S \n80.166.1a-S \n80.166.1b-S \n80.166.1c-S \n40.166.1a-S \n40.166.1b-S \n40.166.1c-S \n\n160 \n160 \n160 \n80 \n80 \n80 \n40 \n40 \n40 \n\n4 \n15 \n25 \n4 \n15 \n25 \n4 \n15 \n25 \n\n.0164 \n.0014 \n.0013 \n.0012 \n.0029 \n.0023 \n.0022 \n.0038 \n.0026 \n.0037 \n\n.1116 \n\nsurprisingly, the performance of EM-DD actually improves as the number of ex(cid:173)\namples per bag increases . We believe that this is partly due to the fact that with \nfew points per bag the chance that a bad starting point has the highest diverse \ndensity is much higher than when the bags are large. In addition, in contrast to the \nstandard diverse density algorithm , the overall time complexity of EM-DD does not \ngo up as the size of the bags increased , since after the instance selection (E-step) , \nthe time complexities of the dominant M-step are essentially the same for data sets \nwith different bag sizes. The fact that EM-DD scales up well to large bag sizes \nin both performance and running time is very important for real drug-discovery \napplications in which the bags can be quite large. \n\n5 Future directions \nThere are many avenues for future work. We believe that EM-DD can be refined to \nobtain better performance by finding alternate ways to select the initial hypothesis \nand scale factors. One option would be to use the result from a different learning \nalgorithm as the starting point then use EM-DD to refine the hypothesis. We are \ncurrently studying the application of the EM-DD algorithm to other domains such \nas content-based image retrieval. Since our algorithm is based on the diverse density \nlikelihood measurement we believe that it will perform well on all applications in \nwhich the standard diverse density algorithm has worked well. In addition , EM-DD \nand MI regression [10] presented a framework to convert the multiple-instance data \nto single-instance data, where supervised learning algorithms can be applied. We \nare currently working on using this general m ethodology to develop new MI learning \ntechniques based on supervised learning algorithms and EM. \n\nAcknowledgments \nThe authors gratefully acknowledge the support NSF grant CCR-9988314. We \nthank Dan Dooly for many useful discussions. We also thank Jonathan Greene who \nprovided us with the Affinity data set . \n\nReferences \n[1] Amar, R.A., Dooly, D.R., Goldman, S.A. & Zhang, Q. (2001). Multiple-Instance \nLearning of Real-Valued Data. Pr'oceedings 18th International Confer'ence on Machine \nLearning, pp. 3- 10. San Francisco, CA: Morgan Kaufmann. \n\n[2] Auer, P. (1997) On learning from mult-instance examples: Empirical evaluation of a \ntheoretical approach. Proceedings 14th International Conference on Ma chine Learning, \n\n\f160.166.1a-S (DD) \n\n80.166.1a-S (DD) \n\n0 . 8 \n\n0 . 6 \n\n0 . 4 \n\n0.2 \n\n..... \n\n. ~. \" \n' . \n\n0.2 \n\n0 . 8 \n\n0 . 6 \n\n0 . 4 \n\n0.2 \n\n, . \n\n- ~-: :- T.;-~ --- - - '- -\n\n: .... \n\n' ':~.,::. - -\n..' \n\n0 . 4 \n\n0.6 \n\nActual \n\n0 . 8 \n\n0.2 \n\n0 . 4 \n\n0.6 \n\n0 . 8 \n\nActual \n\n160.166.1a-S (EM-DD) \n\n80.166.1a-S (EM-DD) \n\n0 . 8 \n\n~ \n\n~ 0 . 6 \n\n~ 0 . 4 \n\n0.2 \n\n\", .. \n, .. :' . \n\n0.2 \n\n\" ,': \n\n.::::.; \":\" .. \n\n:',\" \n\n0 . 8 \n\n~ \n\n~ 0 . 6 \n\n~ 0 . 4 \n\n0.2 \n\n':': .... \n\n0 . 4 \n\n0.6 \n\nActual \n\n0 . 8 \n\n0.2 \n\n0 . 4 \n\n0.6 \n\n0 . 8 \n\nActual \n\nFigure 2: Comparison of EM-DD and DD on real-value labeled artificial data with \ndifferent number of relevant features. The x-axis corresponds to the actual label \nand y-axis gives t h e predicted label. \n\npp. 21-29. San Francisco, CA: Morgan Kaufmann. \n\n[3] Dempster, A.P., Laird, N.M. , & Rubin, D.B. (1977). Maximum likelihood from incom(cid:173)\nplete data via the EM algorithm. Journal of the Royal Statistics Society, Series B, 39 (1): \n1-38. \n\n[4] Dietterich, T. G., Lathrop , R. H., & Lozano-Perez, T. (1997). Solving the multiple(cid:173)\ninstance problem with axis-parallel rectangles. Artificial Intelligence, 89(1-2): 31-7l. \n\n[5] Maron, O. (1998). Lea rning from Ambiguity. Doctoral dissertation, MIT, AI Technical \nReport 1639. \n\n[6] Maron, O. & Lozano-Perez, T. (1998). A framework for multiple-instance learning. \nNeural Information Processing Systems 10. Cambridge, MA: MIT Press. \n\n[7] Maron, O. & Ratan, A. (1998). Multiple-instance learning for natural scene classifica(cid:173)\ntion. Proceedings 15th International Conference on Machine Learning, pp. 341-349. San \nFrancisco, CA: Morgan Kaufmann. \n\n[8] Press, W.H., Teukolsky, S.A., Vetterling, W .T., and Flannery, B.P. (1992). Numerical \nRecipes in C: the art of scientific computing . Cambridge University Press, New York, \nsecond edition. \n\n[9] Ramon, J. & L. De Raedt. (2000). Multi instance neural networks. Proceedings of \nI CML -2000 workshop on \"Attribute- Value and Relational Learning. \n\n[10] Ray, S. & Page, D. (2001) . Multiple-Instance Regression. Proceedings 18th Inter(cid:173)\nnational Conference on Machine Learning, pp. 425-432. San Francisco, CA: Morgan \nKaufmann. \n\n[11] RufIo, G . (2000) . Learning single and multiple instance dec is io n tr'ees for' co mputer' \nsecurity applica tions. Doctoral dissertation. Department of Computer Science, University \nof Turin, Torino, Italy. \n\n[12] Wang, J. & Zucker, J.-D. (2000). Solving the Multiple-Instance Learning Problem: A \nLazy Learning Approach. Proceedings 17th International Conference on Ma chin e Learning, \npp. 1119-11 25 . San Francisco, CA: Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 1959, "authors": [{"given_name": "Qi", "family_name": "Zhang", "institution": null}, {"given_name": "Sally", "family_name": "Goldman", "institution": null}]}