{"title": "Large-Scale Prediction of Disulphide Bond Connectivity", "book": "Advances in Neural Information Processing Systems", "page_first": 97, "page_last": 104, "abstract": null, "full_text": " Large-Scale Prediction of Disulphide Bond\n Connectivity\n\n\n\n Pierre Baldi Jianlin Cheng Alessandro Vullo\n Schoolof Information and Computer Science Computer Science Department\n University of California, Irvine University College Dublin\n Irvine, CA 92697-3425 Dublin, Ireland\n {pfbaldi,jianlinc}@ics.uci.edu alessandro.vullo@ucd.ie\n\n\n\n Abstract\n\n The formation of disulphide bridges among cysteines is an important fea-\n ture of protein structures. Here we develop new methods for the predic-\n tion of disulphide bond connectivity. We first build a large curated data\n set of proteins containing disulphide bridges and then use 2-Dimensional\n Recursive Neural Networks to predict bonding probabilities between cys-\n teine pairs. These probabilities in turn lead to a weighted graph matching\n problem that can be addressed efficiently. We show how the method con-\n sistently achieves better results than previous approaches on the same\n validation data. In addition, the method can easily cope with chains with\n arbitrary numbers of bonded cysteines. Therefore, it overcomes one of\n the major limitations of previous approaches restricting predictions to\n chains containing no more than 10 oxidized cysteines. The method can\n be applied both to situations where the bonded state of each cysteine is\n known or unknown, in which case bonded state can be predicted with\n 85% precision and 90% recall. The method also yields an estimate for\n the total number of disulphide bridges in each chain.\n\n\n\n1 Introduction\n\nThe formation of covalent links among cysteine (Cys) residues with disulphide bridges is\nan important and unique feature of protein folding and structure. Simulations [1], experi-\nments in protein engineering [15, 8, 14], theoretical studies [7, 18], and even evolutionary\nmodels [9] stress the importance of disulphide bonds in stabilizing the native state of pro-\nteins. Disulphide bridges may link distant portions of a protein sequence, providing strong\nstructural constraints in the form of long-range interactions. Thus prediction/knowledge of\nthe disulphide connectivity of a protein is important and provides essential insights into its\nstructure and possibly also into its function and evolution.\n\nOnly recently has the problem of predicting disulphide bridges received increased attention.\nIn the current literature, this problems is typically split into three subproblems: (1) predic-\ntion of whether a protein chain contains intra-chain disulphide bridges or not; (2) predic-\ntion of the intra-chain bonded/non-bonded state of individual cysteines; and (3) prediction\nof intra-chain disulphide bridges, i.e. of the actual pairings between bonded cysteines (see\nFig.1). In this paper, we address the problem of intra-chain connectivity prediction, and\n\n\f\n AVITGACERDLQCGKGTCCAVSLWIKSVRVCTPVGTSGEDCHPASHKIPFSGQRKMHHTCPCAPNLACVQTSPKKFKCLSK\n\n\n\n\nFigure 1: Structure (top) and connectivity pattern (bottom) of intestinal toxin 1, PDB code 1IMT.\nDisulphide bonds in the structure are shown as thick lines.\n\n\n\nspecifically the solution of problem (3) alone, and of problems (2) and (3) simultaneously.\n\nExisting approaches to connectivity prediction use stochastic global optimization [10],\ncombinatorial optimization [13] and machine learning techniques [11, 17]. The method in\n[10] represents the set of potential disulphide bridges in a sequence as a complete weighted\nundirected graph. Vertices are oxidized cysteines and edges are labeled by the strength of\ninteraction (contact potential) in the associated pair of cysteines. A simulated annealing\napproach is first used to find an optimal set of weights. After a complete labeled graph\nis obtained, candidate bridges are then located by finding the maximum weight perfect\nmatching1.\n\nThe method in [17] attempts to solve the problem using a different machine learning ap-\nproach. Candidate connectivity patterns are modelled as undirected graphs. A recursive\nneural network architecture is trained to score candidate patterns according to a similarity\nmetric with respect to correct graphs. Vertices of the graphs are labeled by fixed-size vec-\ntors corresponding to multiple alignment profiles in a local window around each cysteine.\nDuring prediction, the score computed by the network is used to exhaustively search the\nspace of candidate graphs. This method, tested on the same data as in [11], achieved the\nbest results. Unfortunately, for computational reasons, both this method and the previous\none can only deal with sequences containing a limited number of bonds (K 5).\n\nA different approach to predicting disulphide bridges is reported in [13], where finding\ndisulphide bridges is part of a more general protocol aimed at predicting the topology\nof -sheets in proteins. Residue-to-residue contacts (including Cys-Cys bridges) are pre-\ndicted by solving a series of integer linear programming problems in which customized\nhydrophobic contact energies must be maximized. This method cannot be compared with\nthe other approaches because the authors report validation results only for two relatively\nshort polypeptides with few bonds (2 and 3).\n\nIn this paper we use 2-Dimensional Recursive Neural Network (2D-RNN, [4]) to predict\ndisulphide connectivity in proteins starting from their primary sequence and its homo-\nlogues. The output of 2D-RNN are the pairwise probabilities of the existence of a bridge\nbetween any pair of cysteines. Candidate disulphide connectivities are predicted by finding\nthe maximum weight perfect matching. The proposed framework represents a significant\nimprovement in disulphide connectivity prediction for several reasons. First, we show how\nthe method consistently achieves better results than all previous approaches on the same\nvalidation data. Second, our architecture can easily cope with chains with arbitrary number\n\n 1A perfect matching of a graph (V, E) is a subset E E such that each vertex v V is met by\nonly one edge in E .\n\n\f\n O i,j\n\n\n\n\n\n Output Plane\n NE i,j-1 NE i,j\n\n\n\n\n NE i+1,j\n\n NE\n NW i,j NW i,j+1\n\n\n\n NW SW i-1,j\n NW i+1,j\n 4 Hidden Planes\n\n SW SW i,j SW i,j+1\n\n\n SE i-1,j\n\n\n SE\n SE i,j-1 SE i,j\n\n\n\n\n\n Input Plane I i,j\n\n\n\n\n\n (a) (b)\n\nFigure 2: (a) General layout of a 2D-RNN for processing two-dimensional objects such as disulphide\ncontacts, with nodes regularly arranged in one input plane, one output plane, and four hidden planes.\nIn each plane, nodes are arranged on a square lattice. The hidden planes contain directed edges\nassociated with the square lattices. All the edges of the square lattice in each hidden plane are\noriented in the direction of one of the four possible cardinal corners: NE, NW, SW, SE. Additional\ndirected edges run vertically in column from the input plane to each hidden plane, and from each\nhidden plane to the output plane. (b) Connections within a vertical column (i, j) of the directed\ngraph. Iij represents the input, Oij the output, and N Eij represents the hidden variable in the\nNorth-East hidden plane.\n\n\n\nof bonded cysteines. Therefore, it overcomes the limitation of previous approaches which\nrestrict predictions to chains with no more than 10 oxidized cysteines. Third, our meth-\nods can be applied both to situations where the bonded state of each cysteine is known\nor unknown. And finally, once trained, our system is very rapid and can be used on a\nhigh-throughput scale.\n\n\n2 Methods\n\nAlgorithms\n\nTo predict disulphide connectivity patterns, we use the 2D-RNN approach described in [4],\nwhereby a suitable Bayesian network is recast, for computational effectiveness, in terms\nof recursive neural networks, where local conditional probability tables in the underlying\ndirected graph are replaced by deterministic relationships between a variable and its parent\nnode variables. These functions are parameterized by neural networks using appropriate\nweight sharing as described below. Here the underlying directed graph for disulphide con-\nnectivity has six 2D-layers: input, output, and four hidden layers (Figure 2(a)). Vertical\nconnections, within an (i, j) column, run from input to hidden and output layers, and from\nhidden layers to output (Figure 2(b)). In each one of the four hidden planes, square lattice\nconnections are oriented towards one of the four cardinal corners. Detailed motivation for\nthese architectures can be found in [4] and a mathematical analysis of their relationships\nto Bayesian networks in [5]. The essential point is that they combine the flexibility of\ngraphical models with the deterministic propagation and learning speed of artificial neural\nnetworks. Unlike traditional neural networks with fixed-size input, these architectures can\nprocess inputs of variable structure and length, and allow lateral propagation of contextual\ninformation over considerable length scales.\n\nIn a disulphide contact map prediction, the (i, j) output represents the probability of\nwhether the i-th and j-th cysteines in the sequence are linked by a disulphide bridge\n\n\f\nor not. This prediction depends directly on the (i, j) input and the four-hidden units in\nthe same column, associated with omni-directional contextual propagation in the hidden\nplanes. Hence, using weight sharing across different columns, the model can be summa-\nrized by 5 distinct neural networks in the form\n\n\n Oij = NO(Iij,HNW,HNE,HSW,HSE)\n i,j i,j i,j i,j\n HNE = N , HNE )\n N E (Ii,j , H N E\n i,j i-1,j i,j-1\n HNW = N , HNW )\n i,j N W (Ii,j , H N W\n i (1)\n +1,j i,j-1\n HSW = NSW(Ii,j,HSW ,HSW )\n i,j i+1,j i,j+1\n HSE = N , HSE )\n i,j SE (Ii,j , H SE\n i-1,j i,j+1\n\nwhere N represents NN parameterization. Learning can proceed by gradient descent (back-\npropagation) due to the acyclic nature of the underlying graph.\n\nThe input information is based on the sequence itself or rather the corresponding profile\nderived by multiple alignment methods to leverage evolutionary information, possibly aug-\nmented with secondary structure and solvent accessibility information derived from the\nPDB files and/or our SCRATCH suite of predictors [16, 3, 4]. For a sequence of length N\nand containing M cysteines, the output layer contains M M units. The input and hidden\nlayer can scale like N N if the full sequence is used, or like M M if only fixed-size\nwindows around each cysteine are used, as in the experiments reported here. The results\nreported here are obtained using local windows of size 5 around each cysteine, as in [17].\nThe input of each position within a window is the normalized frequency of all 20 amino\nacids at that position in the multiple alignment generated by aligning the sequence with the\nsequences in the NR database using the PSI-BLAST program as described, for instance, in\n[16]. Gaps are treated as one additional amino acid. For each (i, j) location an extra input\nis added to represent the absolute linear distance between the two corresponding cysteines.\n\nFinally, it is essential to remark that the same 2D-RNN approach can be trained and applied\nhere in two different modes. In the first mode, we can assume that the bonded state of the\nindividual cysteines is known, for instance through the use of a specialized predictor for\nbonded/non-bonded states. Then if the sequence contains M cysteines, 2K (2K M )\nof which are intra-chain disulphide bonded, the prediction of the connectivity can focus\non the 2K bonded cysteines exclusively and ignore the remaining M - 2K cysteines that\nare not bonded. In the second mode, we can try to solve both prediction problemsbond\nstate and connectivityat the same time by focusing on all cysteines in a given sequence.\nIn both cases, the output is an array of pairwise probabilities from which the connectivity\npattern graph must be inferred. In the first case, the total number of bonds or edges in the\nconnectivity graph is known (K). In the second case, the total number of edges must be\ninferred. In section 3, we show that sum of all probabilities across the output array can be\nused to estimate the number of disulphide contacts.\n\n\nData Preparation\n\nIn order to assess our method, two data sets of known disulphide connectivities were com-\npiled from the Swiss-Prot archive [2]. First, we considered the same selection of sequences\nas adopted in [11, 17] and taken from the Swiss-Prot database release no. 39 (October\n2000). Additionally, we collected and filtered a more recent selection of chains extracted\nfrom the latest available Swiss-Prot archive, version 41.19 (August 2003). In the following,\nwe refer to these two data sets as SP39 and SP41, respectively.\n\nSP41 was compiled with the same filtering procedure used for SP39. Specifically, only\nchains whose structure is deposited in the Protein Data Bank PDB [6] were retained. We\nfiltered out proteins with disulphide bonds assigned tentatively or disulphide bonds inferred\nby similarity. We finally ended up with 966 chains, each with a number of disulphide bonds\nin the range of 1 to 24. As previously pointed out, our methodology is not limited by the\n\n\f\nnumber of disulphide bonds, hence we were able to retain and test the algorithm on the\nwhole filtered set of non-trivial chains. This set consists of 712 sequences, each containing\nat least two bridges (K 2)the case K = 1 being trivial when the bonded state is known.\nBy comparison, SP39 includes 446 chains with no more than 5 bridges; SP41 additionally\nincludes 266 sequences and 112 of these have more than 10 oxidized cysteines.\n\nIn order to avoid biases during the assessment procedure and to perform k-fold cross val-\nidation, SP41 was partitioned in ten different subsets, with the constraint that sequence\nsimilarity between two different subsets be less or equal to 30%. This is similar to the\ncriteria adopted in [17, 10], where SP39 was splitted into four subsets.\n\n\nGraph Matching to Derive Connectivity from Output Probabilities\nIn the case where the bonded state of the cysteines is known, one has a graph with 2K\nnodes, one for each bonded cysteine. The weight associated with each edge is the proba-\nbility that the corresponding bridge exists, as computed by the predictor. The problem is\nthen to find a connectivity pattern with K edges and maximum weight, where each cys-\nteine is paired uniquely with another cysteine. The maximum weight matching algorithm\nof Gabow [12] is used to chosen paired cysteines (edges), whose time complexity is cubic\nO(V 3) = O(K3), where V is the number of vertices and linear O(V ) = O(K) space com-\nplexity beyond the storage of the graph. Note that because the number of bonded cysteines\nin general is not very large, it is also possible in many cases to use an exhaustive search of\nall possible combinations. Indeed, the number of combinations is 1 3 5 . . . (2K - 1)\nwhich yields 945 connectivity patterns in the case of 10 bonded cysteines.\n\nThe case where the bonded state of the cysteines is not known is slightly more involved\nand the Gabow algorithm cannot be applied directly since the graph has M nodes but, if\nsome of the cysteines are not bonded, only a subset of 2K < M nodes participate in the\nfinal maximum weighted matching. Alternatively, we use a greedy algorithm to derive\nthe connectivity pattern using the estimate of the total number of bonds. First, we order\nthe edges in decreasing order of probabilities. Then we pick the edge with the highest\nprobability. Then we pick the next edge with highest probability that is not incident to the\nfirst edge and so forth, until K edges have been selected. Because this greedy procedure is\nnot guaranteed to find the global optimum, we find it useful to make it a little more robust\nby repeating L times. In each run i = 1, . . . , L, the first edge selected is the i-th most\nprobable edge. In other words the different runs differ by the choice of the first edge, noting\nthat in practice the optimal solution always contain one of the top L edges. This procedure\nworks well in practice because the edges with largest probabilities tend to occur in the final\npattern. For L reasonably large, the optimal connectivity pattern can usually be found. We\nhave compared this method with Gabow's algorithm in the case where the bonding state is\nknown and observed that when L = 6, this greedy heuristic yields results that are as good\nas those obtained with Gabow's algorithm which, in this case, is guaranteed to find a global\noptimum. The results reported here are obtained using the greedy procedure with L = 6.\nThe advantage of the greedy algorithm is its low O(LM 2) complexity time. It is important\nto note that this method ends up by producing a prediction of both the connectivity pattern\nand of the bonding state of each cysteine.\n\n\n3 Results\n\nDisulphide Connectivity Prediction for Bonded Cysteines\n\nHere we assume that the bonding state is known. We train 2D-RNN architectures using\nthe SP39 data set to compare with other published results. We evaluate the performance\nusing the precision P (P =TP/(TP+FP) with TP = true positives and FP = false positives)\nand recall R (R=TP/(TP+FN) with FN = false negatives).\n\nAs shown in Table 1, in all but one case the results are superior to what has been previ-\n\n\f\n K Pair Precision Pattern Precision\n 2 0.74* (0.73) 0.74* (0.73)\n 3 0.61* (0.51) 0.51* (0.41)\n 4 0.44* (0.37) 0.27* (0.24)\n 5 0.41* (0.30) 0.11 (0.13)\n 2 . . . 5 0.56* (0.49) 0.49* (0.44)\n\n\nTable 1: Disulphide connectivity prediction with 2D-RNN assuming the bonding state is known.\nLast row reports performance on all test chains. * denote levels of precision that exceeds previously\nreported best results in the literature [17] (in parentheses).\n\n\n\n\n\n Figure 3: Correlation between number of bonded cysteines (2K) and O\n i i,j log M .\n =j\n\n\n\n\n\nously reported in the literature [17, 11]. In some cases, results are substantially better. For\ninstance, in the case of 3 bonded cysteines, the precision reaches 0.61 and 0.51 at the pair\nand pattern levels, whereas the best similar results reported in the literature are 0.51 (pair)\nand 0.41 (pattern).\n\n\nEstimation of the Number K of Bonds\n\nAnalysis of the prediction results shows that there is a relationship between the sum of\nall the probabilities, O\n i=j i,j , in the graph (or the output layer of the 2D-RNN) and\nthe total number of bonded cysteines (2K). For instance, on one of the cross-validation\ntraining sets, the correlation coefficient between 2K and O\n i=j i,j is 0.7, the correlation\ncoefficient between 2K and M is 0.68, and the correlation coefficient between 2K and\n O\n i=j i,j log M is 0.72. As shown in Figure 3, there is a reasonably linear relationship\n\nbetween the total number 2K of bonded cysteines and the product O\n i=j i,j log M ,\n\nwhere M is the total number of cysteines in the sequence being considered. The slope and\ny-intercept for the line are respectively 0.66 and 3.01 on one training data set. Using this,\nwe estimate the total number of bonded cysteines using linear regression and rounding off,\nmaking sure that the total number is even and does not exceed the total number of cysteines\nin the sequence. In the following experiments, the regression equation for predicting K is\nsolved separately based on each cross-validation training set.\n\n\f\n K Pair Recall Pair Precision Pattern Precision\n 2 0.59 0.49 0.40\n 3 0.50 0.45 0.32\n 4 0.36 0.37 0.15\n 5 0.28 0.31 0.03\n\n\nTable 2: Prediction of disulphide connectivity pattern with 2D-RNN on all the cysteines, without\nassuming knowledge of the bonding state.\n\n\n\nDisulphide Connectivity Prediction from Scratch\n\nIn the last set of experiments, we do not assume any knowledge of the bonding state and\napply the 2D-RNN approach to all the cysteines (both bonded and not bonded) in each\nsequence. We predict the number of bonds, the bonding state, and connectivity pattern\nusing one predictor. Experiments are run both on SP39 (4-fold cross validation) and SP41\n(10-fold cross validation).\n\nFor lack of space, we cannot report all the results but, for example, precision and recall\nfor SP39 are given in Table 2 for 2 K 5. Table 3 shows the kind of results that are\nobtained when the method is applied to sequences with more than K = 5 bonds in SP41.\nThe pair precision remains quite good, although the results can be noisy for certain values\nbecause there are not many such examples in the data. Finally, the precision of bonded state\nprediction is 0.85, and the recall of bonded state prediction is 0.9. The precision and recall\nof bond number prediction is 0.68. The average absolute difference between true bond and\npredicted bond number is 0.42. The average absolute difference between true bond number\nand wrongly predicted bond number is 1.3.\n\n K 6 7 8 9 10 11 12 15 16 17 18 19 24\n Precision 0.41 0.40 0.34 0.37 0.5 0.4 0.17 0.37 0.57 0.40 0.56 0.42 0.24\n\n\n\nTable 3: Prediction of disulphide connectivity pattern with 2D-RNN on all the cysteines, without\nassuming knowledge of the bonding state and when the number of bridges K exceeds 5.\n\n\n\n4 Conclusion\n\nWe have presented a complete system for disulphide connectivity prediction in cysteine-\nrich proteins. Assuming knowledge of cysteine bonding state, the method outperforms\nexisting approaches on the same validation data. The results also show that the 2D-RNN\nmethod achieves good recall and accuracy on the prediction of connectivity pattern even\nwhen the bonding state of individual cysteines is not known. Differently from previous\napproaches, our method can be applied to chains with K > 5 bonds and yields good, coop-\nerative, predictions of the total number of bonds, as well as of the bonding states and bond\nlocations. Training can take days but once trained predictions can be carried on a proteomic\nor protein engineering scale. Several improvements are currently in progress including (a)\ndeveloping a classifier to discriminate protein chains that do not contain any disulphide\nbridges, using kernel methods; (b) assessing the effect on prediction of additional input\ninformation, such as secondary structure and solvent accessibility; (c) leveraging the pre-\ndicted cysteine contacts in 3D protein structure prediction; and (d) curating a new larger\ntraining set. The current version of our disulphide prediction server DIpro (which includes\nstep (a)) is available through: http://www.igb.uci.edu/servers/psss.html.\n\nAcknowledgments\nWork supported by an NIH grant, an NSF MRI grant, a grant from the University of Cal-\nifornia Systemwide Biotechnology Research and Education Program, and by the Institute\nfor Genomics and Bioinformatics at UCI.\n\n\f\nReferences\n\n [1] V.I. Abkevich and E.I. Shankhnovich. What can disulfide bonds tell us about protein energetics,\n function and folding: simulations and bioinformatics analysis. J. Math. Biol., 300:975985,\n 2000.\n\n [2] A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence database and its supplement\n TrEMBL. Nucleic Acids Res., 28:4548, 2000.\n\n [3] P. Baldi and G. Pollastri. Machine learning structural and functional proteomics. IEEE Intelli-\n gent Systems. Special Issue on Intelligent Systems in Biology, 17(2), 2002.\n\n [4] P. Baldi and G. Pollastri. The principled design of large-scale recursive neural network\n architecturesdag-rnns and the protein structure prediction problem. Journal of Machine Learn-\n ing Research, 4:575602, 2003.\n\n [5] P. Baldi and M. Rosen-Zvi. On the relationship between deterministic and probabilistic directed\n graphical models. 2004. Submitted.\n\n [6] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov,\n and P. E. Bourne. The Protein Data Bank. Nucl. Acids Res., 28:235242, 2000.\n\n [7] S. Betz. Disulfide bonds and the stability of globular proteins. Proteins, Struct., Function\n Genet., 21:167195, 1993.\n\n [8] J. Clarke and A.R. Fersht. Engineered disulfide bonds as probes of the folding pathway of bar-\n nase - increasing stability of proteins against the rate of denaturation. Biochemistry, 32:4322\n 4329, 1993.\n\n [9] L. Demetrius. Thermodynamics and kinetics of protein folding: an evolutionary perpective. J.\n Theor. Biol., 217:397411, 2000.\n\n[10] P. Fariselli and R. Casadio. Prediction of disulfide connectivity in proteins. Bioinformatics,\n 17:957964, 2001.\n\n[11] P. Fariselli, P. L. Martelli, and R. Casadio. A neural network-based method for predicting the\n disulfide connectivity in proteins. In E. Damiani et al., editors, Knowledge based intelligent\n information engineering systems and allied technologies (KES 2002), volume 1, pages 464\n 468. IOS Press, 2002.\n\n[12] H.N. Gabow. An efficient implementation of Edmond's algorithm for maximum weight match-\n ing on graphs. Journal of the ACM, 23(2):221234, 1976.\n\n[13] J.L. Klepeis and C.A. Floudas. Prediction of -sheet topology and disulfide bridges in polypep-\n tides. J. Comput. Chem., 24:191208, 2003.\n\n[14] T.A. Klink, K.J. Woycechosky, K.M. Taylor, and R.T. Raines. Contribution of disulfide bonds to\n the conformational stability and catalytic activity of ribonuclease A. Eur. J. Biochem., 267:566\n 572, 2000.\n\n[15] M. Matsumura et al. Substantial increase of protein stability by multiple disulfide bonds. Na-\n ture, 342:291293, 1989.\n\n[16] G. Pollastri, D. Przybylski, B. Rost, and P. Baldi. Improving the prediction of protein secondary\n structure in three and eight classes using recurrent neural networks and profiles. Proteins,\n 47:228235, 2002.\n\n[17] A. Vullo and P. Frasconi. Disulfide connectivity prediction using recursive neural networks and\n evolutionary information. Bioinformatics, 20:653659, 2004.\n\n[18] W.J. Wedemeyer, E. Welkler, M. Narayan, and H.A. Scheraga. Disulfide bonds and protein-\n folding. Biochemistry, 39:42074216, 2000.\n\n\f\n", "award": [], "sourceid": 2607, "authors": [{"given_name": "Jianlin", "family_name": "Cheng", "institution": null}, {"given_name": "Alessandro", "family_name": "Vullo", "institution": null}, {"given_name": "Pierre", "family_name": "Baldi", "institution": null}]}