Manfred K. K. Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen
We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, ﬁnd those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data. Each weight vector prediction and we pick the unlabeled examples for which votes with its the prediction is most evenly split between . For a third selec- tion strategy note that each unlabeled example bisects the version space of consistent weight vectors. We estimate the volume on both sides of the split by bouncing a billiard through the version space and select un- labeled examples that cause the most even split of the version space. We demonstrate that on two data sets provided by DuPont Pharmaceu- ticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing.