{"title": "A Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 216, "page_last": 223, "abstract": null, "full_text": "A Comparison of Dynamic Reposing and \n\nTangent Distance for Drug Activity \n\nPrediction \n\nArris Pharmaceutical Corporation and Oregon State University \n\nThomas G. Dietterich \n\nCorvallis, OR 97331-3202 \n\nAjay N. Jain \n\nArris Pharmaceutical Corporation \n\n385 Oyster Point Blvd., Suite 3 \nSouth San Francisco, CA 94080 \n\nRichard H. Lathrop and Tomas Lozano-Perez \n\nArris Pharmaceutical Corporation and MIT Artificial Intelligence Laboratory \n\n545 Technology Square \nCambridge, MA 02139 \n\nAbstract \n\nIn drug activity prediction (as in handwritten character recogni(cid:173)\ntion), the features extracted to describe a training example depend \non the pose (location, orientation, etc.) of the example. In hand(cid:173)\nwritten character recognition, one of the best techniques for ad(cid:173)\ndressing this problem is the tangent distance method of Simard, \nLeCun and Denker (1993). Jain, et al. (1993a; 1993b) introduce a \nnew technique-dynamic reposing-that also addresses this prob(cid:173)\nlem. Dynamic reposing iteratively learns a neural network and then \nreposes the examples in an effort to maximize the predicted out(cid:173)\nput values. New models are trained and new poses computed until \nmodels and poses converge. This paper compares dynamic reposing \nto the tangent distance method on the task of predicting the bio(cid:173)\nlogical activity of musk compounds. In a 20-fold cross-validation, \n\n216 \n\n\fA Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction \n\n217 \n\ndynamic reposing attains 91 % correct compared to 79% for the \ntangent distance method, 75% for a neural network with standard \nposes, and 75% for the nearest neighbor method. \n\n1 \n\nINTRODUCTION \n\nThe task of drug activity prediction is to predict the activity of proposed drug \ncompounds by learning from the observed activity of previously-synthesized drug \ncompounds. Accurate drug activity prediction can save substantial time and money \nby focusing the efforts of chemists and biologists on the synthesis and testing of \ncompounds whose predicted activity is high. If the requirements for highly active \nbinding can be displayed in three dimensions, chemists can work from such displays \nto design new compounds having high predicted activity. \n\nDrug molecules usually act by binding to localized sites on large receptor molecules \nor large enyzme molecules. One reasonable way to represent drug molecules is \nto capture the location of their surface in the (fixed) frame of reference of the \n(hypothesized) binding site. By learning constraints on the allowed location of \nthe molecular surface (and important charged regions on the surface), a learning \nalgorithm can form a model of the binding site that can yield accurate predictions \nand support drug design. \n\nThe training data for drug activity prediction consists of molecules (described by \ntheir structures, i.e., bond graphs) and measured binding activities. There are two \ncomplications that make it difficult to learn binding site models from such data. \n\nFirst, the bond graph does not uniquely determine the shape of the molecule. The \nbond graph can be viewed as specifying a (possibly cyclic) kinematic chain which \nmay have several internal degrees of freedom (i.e., rotatable bonds). The confor(cid:173)\nmations that the graph can adopt, when it is embedded in 3-space, can be assigned \nenergies that depend on such intramolecular interactions as the Coulomb attraction, \nthe van der Waal's force, internal hydrogen bonds, and hydrophobic interactions. \nAlgorithms exist for searching through the space of conformations to find local \nminima having low energy (these are called \"conformers\"). Even relatively rigid \nmolecules may have tens or even hundreds of low energy conformers. The training \ndata does not indicate which of these conformers is the \"bioactive\" one-that is, \nthe conformer that binds to the binding site and produces the observed binding \nactivity. \n\nSecond, even if the bioactive conformer were known, the features describing the \nmolecular surface-because they are measured in the frame of reference of the bind(cid:173)\ning site-change as the molecule rotates and translates (rigidly) in space. \n\nHence, if we consider feature space, each training example (bond graph) induces a \nfamily of 6-dimensional manifolds. Each manifold corresponds to one conformer as \nit rotates and translates (6 degrees of freedom) in space. For a classification task, \na positive decision region for \"active\" molecules would be a region that intersects \nat least one manifold of each active molecule and no manifolds of any inactive \nmolecules. Finding such a decision region is quite difficult, because the manifolds \nare difficult to compute. \n\n\f218 \n\nDietterich, Jain, Lathrop, and Lozano-Perez \n\nA similar \"feature manifold problem\" arises in handwritten character recognition. \nThere, the training examples are labelled handwritten digits, the features are ex(cid:173)\ntracted by taking a digitized gray-scale picture, and the feature values depend on \nthe rotation, translation, and zoom of the camera with respect to the character. \nWe can formalize this situation as follows. Let Xi, i = 1, ... , N be training exam(cid:173)\nples (i.e., bond graphs or physical handwritten digits), and let I(Xi) be the label \nassociated with Xi (i.e., the measured activity of the molecule or the identity of the \nhandwritten digit). Suppose we extract n real-valued features V( Xi) to describe ob(cid:173)\nject Xi and then employ, for example, a multilayer sigmoid network to approximate \nI(x) by j(x) = g(V(x\u00bb. This is the ordinary supervised learning task. \nHowever, the feature manifold problem arises when the extracted features depend \non the \"pose\" of the example. We will define the pose to be a vector P of parameters \nthat describe, for example, the rotation, translation, and conformation of a molecule \nor the rotation, translation, scale, and line thickness of a handwritten digit. In this \ncase, the feature vector V(x,p) depends on both the example and the pose. \nWithin the handwritten character recognition community, several techniques have \nbeen developed for dealing with the feature manifold problem. Three existing ap(cid:173)\nproaches are standardized poses, the tangent-prop method, and the tangent-distance \nmethod. Jain et al. (1993a, 1993b) describe a new method-dynamic reposing(cid:173)\nthat applies supervised learning simultaneously to discover the \"best\" pose pi of \neach training example Xi and also to learn an approximation to the unknown func(cid:173)\ntion I(x) as j(Xi) = g(V(Xi'p;\u00bb. In this paper, we briefly review each of these \nmethods and then compare the performance of standardized poses, tangent dis(cid:173)\ntance, and dynamic reposing to the problem of predicting the activity of musk \nmolecules. \n\n2 FOUR APPROACHES TO THE FEATURE \n\nMANIFOLD PROBLEM \n\n2.1 STANDARDIZED POSES \n\nThe simplest approach is to select only one of the feature vectors V( Xi, Pi) for each \nexample by constructing a function, Pi = S(Xi), that computes a standard pose \nfor each object. Once Pi is chosen for each example, we have the usual super(cid:173)\nvised learning task-each training example has a unique feature vector, and we can \napproximate 1 by j(x) = g(V(x, S(x\u00bb). \nThe difficulty is that S can be very hard to design. In optical character recognition, \nS typically works by computing some pose-invariant properties (e.g., principal axes \nof a circumscribing ellipse) of Xi and then choosing Pi to translate, rotate, and scale \nXi to give these properties standard values. Errors committed by OCR algorithms \ncan often be traced to errors in the S function, so that characters are incorrectly \npositioned for recognition. \nIn drug activity prediction, the standardizing function S must guess which con(cid:173)\nformer is the bioactive conformer. This is exceedingly difficult to do without addi(cid:173)\ntional information (e.g., 3-D atom coordinates of the molecule bound in the binding \n\n\fA Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction \n\n219 \n\nsite as determined by x-ray crystallography). In addition, S must determine the \norientation of the bioactive conformers within the binding site. This is also quite \ndifficult-the bioactive conformers must be mutually aligned so that shared poten(cid:173)\ntial chemical interactions (e.g., hydrogen bond donors) are superimposed. \n\n2.2 TANGENT PROPAGATION \n\nThe tangent-prop approach (Simard, Victorri, LeCun, & Denker, 1992) also em(cid:173)\nploys a standardizing function S, but it augments the learning procedure with the \nconstraint that the output of the learned function g(V( x, p)) should be invariant \nwith respect to slight changes in the poses of the examples: \n\nII\\7p g(V(x,p)) Ip=S(x) II = 0, \n\nwhere II . II indicates Euclidean norm. This constraint is incorporated by using the \nleft-hand-side as a regularizer during backpropagation training. \n\nTangent-prop can be viewed as a way of focusing the learning algorithm on those \ninput features and hidden-unit features that are invariant with respect to slight \nchanges in pose. Without the tangent-prop constraint, the learning algorithm \nmay identify features that \"accidentally\" discriminate between classes. However, \ntangent-prop still assumes that the standard poses are correct. This is not a safe \nassumption in drug activity prediction. \n\n2.3 TANGENT DISTANCE \n\nThe tangent-distance approach (Simard, LeCun & Denker, 1993) is a variant of the \nnearest-neighbor algorithm that addresses the feature manifold problem. Ideally, \nthe best distance metric to employ for the nearest-neighbor algorithm with feature \nmanifolds is to compute the \"manifold distance\"-the point of nearest approach \nbetween two manifolds: \n\nThis is very expensive to compute, however, because the manifolds can have highly \nnonlinear shapes in feature space, so the manifold distance can have many local \nmInIma. \n\nThe tangent distance is an approximation to the manifold distance. It is computed \nby approximating the manifold by a tangent plane in the vicinity of the standard \nposes. Let Ji be the Jacobian matrix defined by (Jdik = 8V(Xi,Pi)ij8(Pih, which \ngives the plane tangent to the manifold of molecule Xi at pose Pi. The tangent \ndistance is defined as \n\nwhere PI = S(xI) and P2 = S(X2)' The column vectors a and b give the change \nin the pose required to minimize the distance between the tangent planes approx(cid:173)\nimating the manifolds. The values of a and b minimizing the right-hand side can \nbe computed fairly quickly via gradient descent (Simard, personal communication). \nIn practice, only poses close to S(xd and S(X2) are considered, but this provides \n\n\f220 \n\nDietterich, Jain, Lathrop, and Lozano-Perez \n\nmore opportunity for objects belonging to the same class to adopt poses that make \nthem more similar to each other. \n\nIn experiments with handwritten digits, Simard, LeCun, and Denker (1993) found \nthat tangent distance gave the best performance of these three methods. \n\n2.4 DYNAMIC REPOSING \n\nAll of the preceding methods can be viewed as attempts to make the final predicted \noutput j(x) invariant with respect to changes in pose. Standard poses do this by \nnot permitting poses to change. Tangent-prop adds a local invariance constraint. \nTangent distance enforces a somewhat less local invariance constraint. \nIn dynamic reposing, we make j invariant by defining it to be the maximum value \n(taken over all poses p) of an auxiliary function g: \n\nj(x) = max g(V(x,p)). \n\np \n\nThe function 9 will be the function learned by the neural network. \n\nBefore we consider how 9 is learned, let us first consider how it can be used to \npredict the activity of a new molecule x'. To compute j(x'), we must find the pose \np'. that maximizes g(V(x',p'*\u00bb. We can do this by performing a gradient ascent \nstarting from the standard pose S(x) and moving in the direction of the gradient \nof 9 with respect to the pose: \\7plg(V(X',p'\u00bb. \nThis process has an important physical analog in drug activity prediction. If x' is \na new molecule and 9 is a learned model of the binding site, then by varying the \npose p' we are imitating the process by which the molecule chooses a low-energy \nconformation and rotates and translates to \"dock\" with the binding site. \n\nIn handwritten character recognition, this would be the dual of a deformable tem(cid:173)\nplate model: the template (g) is held fixed, while the example is deformed (by \nrotation, translation, and scaling) to find the best fit to the template. \n\nThe function 9 is learned iteratively from a growing pool of feature vectors. Initially, \nthe pool contains only the feature vectors for the standard poses of the training ex(cid:173)\namples (actually, we start with one standard pose of each low energy conformation \nof each training example). In iteration j, we apply backpropagation to learn hy(cid:173)\npothesis gj from selected feature vectors drawn from the pool. For each molecule, \none feature vector is selected by performing a forward propagation (i.e., computing \n9(V(Xi' Pi\u00bb)) of all feature vectors of that molecule and selecting the one giving the \nhighest predicted activity for that molecule. \nAfter learning gj, we then compute for each conformer the pose P1+1 that maximizes \ngj(V(Xi' p\u00bb: \n\nPi = argmax gj(V(Xi'p\u00bb. \n\u00b7+1 \n\np \n\nFrom the chemical perspective, we permit each of the molecules to \"dock\" to the \ncurrent model gj of the binding site. \n\n) corresponding to these poses are added to the pool \nThe feature vectors V(Xi,Pi \nof poses, and a new hypothesis gj+l is learned. This process iterates until the poses \n\n\u00b7+1 \n\n\fA Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction \n\n221 \n\ncease to change. Note that this algorithm is analogous to the EM procedure (Redner \n& Walker, 1984) in that we accomplish the simultaneous optimization of 9 and the \nposes {Pi} by conducting a series of separate optimizations of 9 (holding {Pi} fixed) \nand {pd (holding 9 fixed). \n\nWe believe the power of dynamic reposing results from its ability to identify the \nfeatures that are critical for discriminating active from inactive molecules. In the \ninitial, standard poses, a learning algorithm is likely to find features that \"acciden(cid:173)\ntally\" discriminate actives from inactives. However, during the reposing process, \ninactive molecules will be able to reorient themselves to resemble active molecules \nwith respect to these features. \nIn the next iteration, the learning algorithm is \ntherefore forced to choose better features for discrimination. \n\nMoreover, during reposing, the active molecules are able to reorient themselves so \nthat they become more similar to each other with respect to the features judged \nto be important in the previous iteration. In subsequent iterations, the learning \nalgorithm can \"tighten\" its criteria for recognizing active molecules. \n\nIn the initial, standard poses, the molecules are posed so that they resemble each \nother along all features more-or-Iess equally. At convergence, the active molecules \nhave changed pose so that they only resemble each other along the features impor(cid:173)\ntant for discrimination. \n\n3 AN EXPERIMENTAL COMPARISON \n\n3.1 MUSK ACTIVITY PREDICTION \n\nWe compared dynamic reposing with the tangent distance and standard pose meth(cid:173)\nods on the task of musk odor prediction. The problem of musk odor prediction has \nbeen the focus of many modeling efforts (e.g., Bersuker, et al., 1991; Fehr, et al., \n1989; Narvaez, Lavine & Jurs, 1986). Musk odor is a specific and clearly iden(cid:173)\ntifiable sensation, although the mechanisms underlying it are poorly understood. \nMusk odor is determined almost entirely by steric (i.e., \"molecular shape\") effects \n(Ohloff, 1986). The addition or deletion of a single methyl group can convert an \nodorless compound into a strong musk. Musk molecules are similar in size and \ncomposition to many kinds of drug molecules. \n\nWe studied a set of 102 diverse structures that were collected from published studies \n(Narvaez, Lavine & Jurs, 1986; Bersuker, et al., 1991; Ohloff, 1986; Fehr, et al., \n1989). The data set contained 39 aromatic, oxygen-containing molecules with musk \nodor and 63 homologs that lacked musk odor. Each molecule was conformation(cid:173)\nally searched to identify low energy conformations. The final data set contained \n6,953 conformations of the 102 molecules (for full details of this data set, see Jain, \net al., 1993a). Each of these conformations was placed into a starting pose via a \nhand-written S function. We then applied nearest neighbor with Euclidean dis(cid:173)\ntance, nearest neighbor with the tangent distance, a feed-forward network without \nreposing, and a feed-forward network with the dynamic reposing method. For dy(cid:173)\nnamic reposing, five iterations of reposing were sufficient for convergence. The time \nrequired to compute the tangent distances far exceeds the computation times of \nthe other algorithms. To make the tangent distance computations feasible, we only \n\n\f222 \n\nDietterich, Jain, Lathrop, and Lozano-Perez \n\nTable 1: Results of 20-fold cross-validation on 102 musk molecules. \n\nMethod \nNearest neighbor (Euclidean distance) \nNeural network (standard poses) \nNearest neighbor (Tangent distance) \nNeural network (dynamic reposing) \n\nPercent Correct \n\n75 \n75 \n79 \n91 \n\nTable 2: Neural network cross-class predictions (percent correct) \n\nN \n\nMolecular class: \n\nStandard poses \nDynamic reposing \n\n85 \n100 \n\n76 \n90 \n\n74 \n85 \n\n57 \n71 \n\ncomputed the tangent distance for the 200 neighbors that were nearest in Euclidean \ndistance. Experiments with a subset of the molecules showed that this heuristic in(cid:173)\ntroduced no error on that subset. \n\nTable 1 shows the results of a 20-fold cross-validation of all four methods. The \ntangent distance method does show improvement with respect to a standard neu(cid:173)\nral network approach (and with respect to the standard nearest neighbor method). \nHowever, the dynamic reposing method outperforms the other two methods sub(cid:173)\nstantially. \n\nAn important test for drug activity prediction methods is to predict the activity \nof molecules whose molecular structure (i.e., bond graph) is substantially different \nfrom the molecules in the training set. A weakness of many existing methods for \ndrug activity prediction (Hansch & Fujita, 1964; Hansch, 1973) is that they rely on \nthe assumption that all molecules in the training and test data sets share a common \nstructural skeleton. Because our representation for molecules concerns itself only \nwith the surface of the molecule, we should not suffer from this problem. Table 2 \nshows four structural classes of molecules and the results of \"class holdout\" exper(cid:173)\niments in which all molecules of a given class were excluded from the training set \nand then predicted. Cross-class predictions from standard poses are not particularly \ngood. However, with dynamic reposing, we obtain excellent cross-class predictions. \nThis demonstrates the ability of dynamic reposing to identify the critical discrimi(cid:173)\nnating features. Note that the accuracy of the predictions generally is determined \nby the size of the training set (i.e., as more molecules are withheld, performance \ndrops). The exception to this is the right-most class, where the local geometry of \nthe oxygen atom is substantially different from the other three classes. \n\n\fA Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction \n\n223 \n\n4 CONCLUDING REMARKS \n\nThe \"feature manifold problem\" arises in many application tasks, including drug \nactivity prediction and handwritten character recognition. A new method, dynamic \nreposing, exhibits performance superior to the best existing method, tangent dis(cid:173)\ntance, and to other standard methods on the problem of musk activity prediction. \nIn addition to producing more accurate predictions, dynamic reposing results in a \nlearned binding site model that can guide the design of new drug molecules. Jain, \net al., (1993a) shows a method for visualizing the learned model in the context of \na given molecule and demonstrates how the model can be applied to guide drug \ndesign. Jain, et al., (1993b) compares the method to other state-of-the-art meth(cid:173)\nods for drug activity prediction and shows that feed-forward networks with dynamic \nreposing are substantially superior on two steroid binding tasks. The method is cur(cid:173)\nrently being applied at Arris Pharmaceutical Corporation to aid the development \nof new pharmaceutical compounds. \n\nAcknowledgements \n\nMany people made contributions to this project. The authors thank Barr Bauer, \nJohn Burns, David Chapman, Roger Critchlow, Brad Katz, Kimberle Koile, John \nPark, Mike Ross, Teresa Webster, and George Whitesides for their efforts. \n\nReferences \n\nBersuker, I. B., Dimoglo, A. S., Yu. Gorbachov, M., Vlad, P. F., Pesaro, M. (1991). \nNew Journal of Chemistry, 15, 307. \nFehr, C., Galindo, J., Haubrichs, R., Perret, R. (1989). Helv. Chim. Acta, 72, 1537. \nHansch, C. (1973). In C. J. Cavallito (Ed.), Structure-Activity Relationships. Ox(cid:173)\nford: Pergamon. \nHansch, C., Fujita, T. (1964). J. Am. Chem. Soc., 86, 1616. \nJain, A. N., Dietterich, T. G., Lathrop, R. H., Chapman, D., Critchlow, R . E., \nBauer, B. E., Webster, T. A., Lozano-Perez, T. (1993a). A shape-based method for \nmolecular design with adaptive alignment and conformational selection. Submitted. \nJain, A., Koile, K., Bauer, B., Chapman, D. (1993b). Compass: A 3D QSAR \nmethod. Performance comparisons on a steroid benchmark. Submitted. \nNarvaez, J. N., Lavine, B. K., Jurs, P. C. (1986). Chemical Senses, 11, 145-156. \nOhloff, G. (1986). Chemistry of odor stimuli. Experientia, 42, 271. \nRedner, R. A., Walker, H. F. (1984). Mixture densities, maximum likelihood, and \nthe EM algorithm. SIAM Review, 26 (2) 195-239. \nSimard, P. Victorri, B., Le Cun, Y. Denker, J. (1992). Tangent Prop-A formalism \nfor specifying selected invariances in an adaptive network. In Moody, J. E., Hanson, \nS. J., Lippmann, R. P. (Eds.) Advances in Neural Information Processing Systems \n4. San Mateo, CA: Morgan Kaufmann. 895-903. \nSimard, P. Le Cun, Y., Denker, J. (1993). Efficient pattern recognition using a \nnew transformation distance. In Hanson, S. J., Cowan, J. D., Giles, C. L. (Eds.) \nAdvances in Neural Information Processing Systems 5, San Mateo, CA: Morgan \nKaufmann. 50-58. \n\n\f", "award": [], "sourceid": 781, "authors": [{"given_name": "Thomas", "family_name": "Dietterich", "institution": null}, {"given_name": "Ajay", "family_name": "Jain", "institution": null}, {"given_name": "Richard", "family_name": "Lathrop", "institution": null}, {"given_name": "Tom\u00e1s", "family_name": "Lozano-P\u00e9rez", "institution": null}]}