{"title": "Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 512, "page_last": 520, "abstract": "Residue-residue contact prediction is a fundamental problem in protein structure prediction. Hower, despite considerable research efforts, contact prediction methods are still largely unreliable. Here we introduce a novel deep machine-learning architecture which consists of a multidimensional stack of learning modules. For contact prediction, the idea is implemented as a three-dimensional stack of Neural Networks NN^k_{ij}, where i and j index the spatial coordinates of the contact map and k indexes ''time''. The temporal dimension is introduced to capture the fact that protein folding is not an instantaneous process, but rather a progressive refinement. Networks at level k in the stack can be trained in supervised fashion to refine the predictions produced by the previous level, hence addressing the problem of vanishing gradients, typical of deep architectures. Increased accuracy and generalization capabilities of this approach are established by rigorous comparison with other classical machine learning approaches for contact prediction. The deep approach leads to an accuracy for difficult long-range contacts of about 30%, roughly 10% above the state-of-the-art. Many variations in the architectures and the training algorithms are possible, leaving room for further improvements. Furthermore, the approach is applicable to other problems with strong underlying spatial and temporal components.", "full_text": "Deep Spatio-Temporal Architectures and Learning\n\nfor Protein Structure Prediction\n\nPietro Di Lena, Ken Nagata, Pierre Baldi\n\nDepartment of Computer Science, Institute for Genomics and Bioinformatics\n\n{pdilena,knagata,pfbaldi}@[ics.]uci.edu\n\nUniversity of California, Irvine\n\nAbstract\n\nResidue-residue contact prediction is a fundamental problem in protein structure\nprediction. Hower, despite considerable research efforts, contact prediction meth-\nods are still largely unreliable. Here we introduce a novel deep machine-learning\narchitecture which consists of a multidimensional stack of learning modules. For\ncontact prediction, the idea is implemented as a three-dimensional stack of Neu-\nral Networks NNk\nij, where i and j index the spatial coordinates of the contact\nmap and k indexes \u201ctime\u201d. The temporal dimension is introduced to capture the\nfact that protein folding is not an instantaneous process, but rather a progressive\nre\ufb01nement. Networks at level k in the stack can be trained in supervised fash-\nion to re\ufb01ne the predictions produced by the previous level, hence addressing the\nproblem of vanishing gradients, typical of deep architectures. Increased accuracy\nand generalization capabilities of this approach are established by rigorous com-\nparison with other classical machine learning approaches for contact prediction.\nThe deep approach leads to an accuracy for dif\ufb01cult long-range contacts of about\n30%, roughly 10% above the state-of-the-art. Many variations in the architectures\nand the training algorithms are possible, leaving room for further improvements.\nFurthermore, the approach is applicable to other problems with strong underlying\nspatial and temporal components.\n\n1\n\nIntroduction\n\nProtein structure prediction from amino acidic sequence is one of the grand challenges in Bioinfor-\nmatics and Computational Biology. To date, the more accurate and reliable computational methods\nfor protein structure prediction are based on homology modeling [27]. Homology-based methods\nuse similarity to model the unknown target structure using known template structures. However,\nwhen good templates do not exist in protein structure repositories or when sequence similarity is\nvery poor\u2013which is often the case\u2013homology modeling is no more effective. This is the realm of ab\ninitio modeling methods, which attempt to recover three-dimensional protein models more or less\nfrom scratch. Because the structure of proteins is invariant under translations and rotations, it is\nuseful to consider structural representations that do not depend on cartesian coordinates. One such\nrepresentation is the contact map, essentially a sparse binary matrix representing which amino acids\nare in contact in the 3D structure. While contact map prediction can be viewed as a sub-problem\nin protein structure prediction, it is well known that it is essentially equivalent to protein structure\npredictions since 3D structures can be completely recovered from suf\ufb01ciently large subsets of true\ncontacts [20, 26, 23]. Furthermore, even small sets of correctly predicted contacts can be useful\nfor improving ab initio methods [25]. In short, contact map prediction plays a fundamental role\nin protein structure prediction and most of the state-of-the art contact predictors use some form of\nmachine learning. Contact prediction is assessed every two years in the CASP experiments [9, 15].\nHowever, despite considerable efforts, the accuracy of the best predictors at CASP rarely exceeds\n\n1\n\n\f20% for long-range contacts, suggesting major room for improvements. Simulations suggest that\nthis accuracy ought to be increased to about 35% in order to be able to recover good 3D structures.\nThere are two main issues arising in contact prediction that have not been addressed systematically:\n(1) Residue contacts are not randomly distributed in native protein structures, rather they are spatially\ncorrelated. Current contact predictors generally do not take into account these correlations, not\neven at the local level, since the contact probability for a residue pair is typically learned/inferred\nindependently of the contact probabilities in the neighborhood of the pair. (2) Proteins do not assume\na 3D conformation instantaneously, but rather through a dynamic folding process that progressively\nre\ufb01nes the structure. In contrast, current machine learning approaches attempt to learn contact map\nprobabilities in a single step. To address these issues, here we introduce a new machine-learning\ndeep architecture, designed as a deep stack of neural networks, in such a way that each level in\nthe stack receives in input and re\ufb01nes the predictions produced at the previous level. Each level\ncan be trained in a fully supervised fashion on the same set of target contacts/non-contacts, thus\novercoming the gradient vanishing problem, typical of deep architectures. The idea of layering\nlearning modules, such that the outputs of previous layers are fed in input to the next layers, is\nnot completely new and it has been applied in different contexts, particularly to computer vision\ndetection problems [4, 10, 12, 22]. However the techniques developed in visual detection cannot\nbe directly applied to contact prediction due to the intrinsic difference of such problems: protein\nsequences have different lengths, thus it is not possible to process the entire sequence at once in the\nnetwork input, as it is done for images. The present work represents, to our knowledge, the \ufb01rst\nattempt to introduce spatial correlation in protein contact prediction.\n\n2 Data preparation\n\n2.1 Contact de\ufb01nition and evaluation criteria\n\nWe de\ufb01ne two residues to be in contact if the Euclidean distance between their C\u03b2 atoms (C\u03b1\nfor Glycines) is lower than 8 \u02daA. This is the contact de\ufb01nition adopted for the contact prediction\nassessment in CASP experiments [15]. The protein map of contact (or contact map) provides a\ntwo-dimensional translation and rotation invariant representation of the protein three-dimensional\nstructure. The information content of the contact map is not uniform within different regions of the\nmap. Three distinct classes of contacts can be de\ufb01ned, depending on the linear sequence separation\nbetween the residues: (1) long-range contacts, with separation \u2265 24 residues; (2) medium-range\ncontacts, with separation between 12 and 23 residues; and (3) short-range contacts, with separation\nbetween 6 and 11 residues. Contacts between residues separated by less than 6 residues are dense\nand can be easily predicted from the secondary structure. Conversely, the sparse long-range contacts\nare the most informative and also the most dif\ufb01cult to predict. Thus, as in the CASP experiments, we\nfocus primarily on long-range contact prediction for performance assessment. The contact predic-\ntion performance is evaluated using the standard accuracy measure [15]: Acc = TP/(TP+FP), where\nTP and FP are the true positive and false positive predicted contacts, respectively. The Acc measure\nis computed for the sets of L/5, L/10 and 5 top scored predicted pairs, where L is the length of\nthe domain sequence. The most widely accepted measure of performance for contact prediction\nassessment is Acc for L/5 pairs and sequence separation \u2265 24 [15].\n\n2.2 Training and test sets\n\nIn order to asses the performance of our method, a training and a test set of protein domains are\nderived from the ASTRAL database [6]. We extract from the ASTRAL release 1.73 the precompiled\nset of protein domains with less than 20% pairwise sequence identity. We select only the domains\nbelonging to the main SCOP [17] classes (All-Alpha, All-Beta, Alpha/Beta and Alpha+Beta). We\nexclude domains of length less than 50 residues, domains with multiple 3D structures, as well as\nnon-contiguous domains (including those with missing backbone atoms). We further \ufb01lter this list\nby selecting just one representative domain\u2013the shortest one\u2013per SCOP family. This yields a training\nset of 2,191 structures (the list of protein domains can be found as supplementary material of [8]).\nFor performance assessment purposes, this set is partitioned into 10 disjoint groups of roughly the\nsame size and average domain lengths, so that no domains from two distinct groups belong to the\nsame SCOP fold. As a result, the 10 sets do not share any structural or sequence similarity, providing\n\n2\n\n\fa high-quality benchmark for ab initio prediction. Model performance is assessed using a standard\n10-fold cross-validation procedure. In all our tests, the accuracy results on training/test are averaged\nover the 10 cross-validation experiments.\n\n2.3 Feature and training example selection\n\nIn this work, we do not attempt to determine the best static input features for contact prediction.\nRather, we focus on a minimal set of features commonly used in machine learning-based contact\nprediction [11, 2, 21, 7, 24, 5]. Each residue in the protein sequence is described by a feature vector\nencoding three sources of information (for a total of 25 values): evolutionary information in the form\nof pro\ufb01les (20 values, one for each amino acid type), predicted secondary structure (3 binary values,\n\u03b2-sheet or \u03b1-helix or coil), and predicted solvent accessibility (2 binary values, buried or exposed).\nThe pro\ufb01les are computed using PSI-BLAST [1] with an E-value cutoff equal to 0.001 and up to\nten iterations against the non redundant protein sequence database (NR). The secondary structure is\npredicted with SSPRO [18] and the solvent accessibility with ACCPRO [19]. For a pair of residues,\nthese features are included in the network input by using a 9-residue long sliding window centered\nat each residue in the pair. In our Deep NN, these features represent the spatial features (Section 3).\nThe uneven distribution of positive (residue pairs in contact) and negative (residue pairs not in con-\ntact) examples in native protein structures requires some rebalancing of the training data. For each\ntraining structure we randomly select 20% of the negative examples, while keeping all the positive\nexamples. We do not include in our set of selected examples residue pairs with sequence separation\nless than 6. All the methods compared in Section 4 are trained on exactly the same sets of examples.\n\n3 Deep Spatio-Temporal Neural Network (DST-NN) architecture\n\nIn the speci\ufb01c implementation used in the simulations, the DST-NN architecture consists of a three-\ndimensional stack of neural networks NNk\nij, where i and j are the usual spatial coordinates of the\ncontact map, and k is a \u201ctemporal\u201d index. All the neural networks in the stack have the same topol-\nogy (same input, hidden, and output layer sizes) with a single hidden layer, and a single sigmoidal\noutput unit estimating the probability of contact between i and j at the level k (Figure 1(a) and 1(b)).\nFurthermore, in this implementation, all the networks in the level k have the same weights (weight\nsharing). Each level k can be trained in a fully supervised fashion, using the same contact maps\nas targets. In this way, each level of the deep architecture represents a distinct contact predictor.\nThe inputs into NNk\nij can be separated into purely spatial inputs, and temporal inputs (which are not\npurely temporal but include also a spatial component). For \ufb01xed i and j, the purely spatial inputs are\nidentical for all levels k in the stack, hence they do not depend on \u201ctime\u201d. These purely spatial inputs\ninclude evolutionary pro\ufb01les, predicted secondary structure, and solvent accessibility in a window\naround residue i and residue j. These are the standard inputs used by most other predictors which\nattempt to predict contacts in one shot and are described in more detail in Section 2.3. The temporal\ninputs, on the other hand, are novel.\n\n3.1 Temporal Features\n\nrs\n\nik correspond to the outputs of the networks NNk\u22121\n\nat the previous\nThe temporal inputs for NNk\nlevel in the stack, where r and s range over a neighborhood of i and j. Here we use a neighborhood\nof radius 4 centered at (i, j). The temporal features capture the idea that residue contacts are not\nrandomly distributed in native protein structures, rather they are spatially correlated: a contacting\nresidue pair is very likely to be in the proximity of a different pair of contacting residues. For\ninstance, a comparison of the contact proximity distribution (data not shown) for long-range residue\npairs in contact and not in contact shows that over 98% of the contacting residue pairs are in the\nproximity of at least one additional contact, compared to 30% for non-contacting residue pairs,\nwithin a neighborhood of radius 4. Although the contact predictions at a given level of the stack are\ninaccurate, the contact probabilities included in the temporal feature vector can still provide some\nrough estimation of the contact distribution in a given neighborhood.\nThus, in short, while our model is not necessarily meant to simulate the physical folding process,\nthe stack is used to organize the prediction in such a way that each level in the stack is meant to\nre\ufb01ne the predictions produced by the previous levels, integrating information over both space and\n\n3\n\n\ftime. In particular, through the temporal inputs the architecture ought to be able to capture spatial\ncorrelations between contacts, at least over some range.\n\n(a) DST-NN architecture\n\n(b) Temporal input features for NNk\nij\n\nij represents a feed-forward neural network\nFigure 1: DST-NN architecture. (a) Overview. Each NNk\ntrainable by back-propagation. (b) For a pair of residues (i, j), the temporal inputs into NNk+1\ncon-\nsist of the contact probabilities produced by the network at the previous level over a neighborhood\nof (i, j).\n\nij\n\n3.2 Deep Learning\n\nij are then used to initialize the weights of NN2\n\nTraining deep multi-layered neural networks is generally hard, since the error gradient tends to\nvanish or explode with a high number of layers [16]. In contrast, in the proposed model, the learning\ncapabilities are not directly degraded by the depth of the stack, since each level of the stack can be\ntrained in a supervised fashion using true contact maps to provide the targets. In this way, training\ncan be performed incrementally, by adding a new layer to the stack. More precisely, the weights\nij, are randomly initialized and the temporal feature vector is set to\nof the \ufb01rst level network, NN1\nij is then trained for one epoch on the given set of examples. The weights\n0. The \ufb01rst network NN1\nof NN1\nij are\nused to setup the temporal feature vector of NN2\nij is then trained for one epoch\non the same set of examples used for NN1\nij and this procedure is repeated up to a certain depth.\nWe have experimented with several variations of this training procedure, such as randomization of\nthe weights for each new network in the stack, training each network in the stack for more than one\nepoch, growing the stack up to a maximum number of training epochs (one network for each epoch),\nor growing it to a smaller depth but then repeating the training procedure through one or more\nepochs. In Section 4.2 we discuss and compare such different training strategies. In Section 5 we\ndiscuss some possible variants and generalizations of the full architecture. In any case, this approach\nenables training very deep networks (e.g. with maximal values of k up to 100, corresponding to a\nglobal neural network architecture with 300 layers).\n\nij and the predictions obtained with NN1\n\nij. The network NN2\n\n4 Results\n\n4.1 Performance comparison\n\nHere we investigate the learning and generalization capabilities of the DST-NN model, and compare\nit with plain three-layer Neural Network (NN) models, as well as 2D Recurrent Neural Network\n(RNN) models, which are two of the most widely used machine learning approaches for contact\nprediction [11, 2, 21, 24]. Here, the NN model is perfectly equivalent to the NNs implemented\nin the DST-NN architecture, except for the temporal feature vector (which is missing in the NN\nimplementation). All three methods are trained with a standard on-line back-propagation procedure\nusing exactly the same set of examples and the same input features (Section 2.3).\nOne of the most typical problem in neural network design is related to the issue of choosing, for\na given classi\ufb01cation problem, the most appropriate network size (i.e.\ntypically the hidden layer\nsize, which affects the total number of connections in the network). The learning time and the\n\n4\n\n25\u00d79\u00d723\u00d77\u00d774\u00d77\u00d7781Spatial featuresTemporal featuresreceptive fieldresidue featurescoarse featuresalignment featuresSpatial featuresOutput of NNi jSpatial featuresOutput of NNi jSpatial featuresAll zeros........NNi j k+1NNi j 2NNi j 11kSpatial features for i and jTemporal features for i and jijNNi j Contact map predicted withthe networks NNi j k+1kContact probability for i and j\fgeneralization capabilities of the particular neural network model are highly affected by the network\nsize parameter. In order to take into account the intrinsic incomparable capabilities of the different\nDST-NN, NN, and RNN architectures, we perform our tests by considering a range of exponentially\nincreasing hidden layer sizes (4,8,16,32,64, and 128 units) for each architecture. The total number\nof connection weights for each architecture in function of the hidden layer size, as well as the time\nneeded to perform one training epoch, are shown in Table 1.\nFigure 2 shows the learning curves of the three methods as a function of the training epoch and\nthe different hidden layer sizes. We show the cross-training average accuracy on both training sets\n(continuous line) and test sets (dotted line). The learning curves in Figure 2 show the generalization\nperformance with respect to the contact prediction accuracy on L/5 long range contacts; the accuracy\nof prediction on long range contacts is the most widely accepted evaluation measure for contact\nprediction and it provides a better estimate of the prediction performance than the training/testing\nerror. Since very large training epochs are infeasible in terms of time for the RNN model (see Table\n1), for the aim of comparison, we trained each method for a maximum of 100 epochs. In Table 2\nwe summarize the prediction performance of the three machine learning methods by showing the\nmaximum average accuracy achieved in testing over 100 training epochs.\nFrom Figure 2, the DST-NN has overall higher storage and generalization capacity than NN and\nRNN. In particular, for hidden layer sizes larger than or equal to 8, the DST-NN performance are\nsuperior to those of NN and RNN, regardless of their sizes. Moreover, note that hidden layer sizes\nlarger than 32 do not increase the generalization capabilities of any one of the three methods (Ta-\nble 2). The counterintuitive learning curves of the RNN for hidden layer sizes larger than 8 can\nbe explained by considering the structure of the RNN architecture. The RNN model exploits a\nrecursive architecture that suffers, as general deep architectures, from the problem of gradient van-\nishing/explosion.\nIn order to overcome this problem the authors of [2] use a modi\ufb01ed form of\ngradient descent, by which the delta-error for back-propagation is mapped into a piecewise linear\ninterval; this prevents the delta-error from becoming too small or too large. The boundaries of the\ninterval have been tuned for very small hidden layers (private communication). In our experiments,\nwe use the same boundaries for all the tested hidden layer sizes and, apparently, these proved to\nbe ineffective for hidden layer sizes larger than or equal to 16. In comparison, we remark again\nthat the DST-NN is unaffected by the gradient vanishing problem, even for very deep stacks. From\nFigure 2, we notice that the DST-NN tends to over\ufb01t the training data more easily than the NN. For\ninstance, we notice some small over\ufb01tting for the DST-NN starting with hidden layer size 32, while\nthe NN starts to show some small over\ufb01tting only at hidden layer size 128. On the contrary, the\nRNN does not show any sign of over\ufb01tting in 100 epochs of training, regardless the hidden layer\nsize in the tested range, and the performance in training is somewhat equivalent to the performance\nin testing. As a \ufb01nal consideration, from Table 2, the NN and RNN best performance on L/5 long\nrange contacts re\ufb02ect quite well the state-of-the-art in contact prediction [9, 15] with an accuracy\nin the 21-23% range. In contrast, the DST-NN architecture achieves a maximum accuracy of %29\nwhich represents a signi\ufb01cant improvement over the state-of-the-art. As a visual example, Figure 3\nshows the best predictions obtained by each method on a target domain in our data set. Despite the\nthree methods achieve exactly the same accuracy (0.6) on the top-scored L/5 long range contacts, it\nis evident that the DST-NN provides an overall better prediction of the contact map topology.\n\nHL size\n\n4\n8\n16\n32\n64\n128\n\nTable 1: Connection weights and training times\nDST-NN\n#Conn\n\nNN\n#Conn\n\nTime\n\nTime\n\nRNN\n#Conn\n\n2,133\n4,265\n8,529\n17,057\n34,113\n68,225\n\n\u223c6m\n\u223c10m\n\u223c15m\n\u223c26m\n\u223c1h20m\n\u223c2h\n\n1,809\n3,617\n7,233\n14,465\n28,929\n57,857\n\n\u223c1m\n\u223c3m\n\u223c5m\n\u223c8m\n\u223c15m\n\u223c28m\n\n17,169\n19,105\n22,977\n30,721\n46,209\n77,185\n\n5\n\nTime\n\n\u223c1h30min\n\u223c2h\n\u223c2h40m\n\u223c3h20m\n\u223c4h50m\n\u223c7h\n\n\f(a) Hidden Layer size 4\n\n(b) Hidden Layer size 8\n\n(c) Hidden Layer size 16\n\n(d) Hidden Layer size 32\n\n(e) Hidden Layer size 64\n\n(f) Hidden Layer size 128\n\nFigure 2: Learning curves of different machine learning methods\n\nTable 2: Best prediction performance\n\nHL size\n\n4\n8\n16\n32\n64\n128\n\nL/5\n\n0.21\n0.25\n0.27\n0.29\n0.29\n0.29\n\nDST-NN\nL/10\n\nBest5\n\nL/5\n\n0.23\n0.27\n0.30\n0.32\n0.33\n0.33\n\n0.26\n0.29\n0.33\n0.35\n0.37\n0.36\n\n0.21\n0.21\n0.23\n0.23\n0.23\n0.23\n\nNN\nL/10\n\n0.24\n0.24\n0.26\n0.26\n0.25\n0.25\n\nBest5\n\nL/5\n\nRNN\nL/10\n\nBest5\n\n0.27\n0.27\n0.28\n0.29\n0.28\n0.28\n\n0.21\n0.23\n0.22\n0.23\n0.22\n0.22\n\n0.23\n0.26\n0.25\n0.26\n0.25\n0.25\n\n0.25\n0.29\n0.29\n0.29\n0.28\n0.28\n\n4.2 Training strategies comparison\n\nHere we compare the generalization performance of the DST-NN under different training strategies.\nSince the training time for the DST-NN increases substantially with the size of the hidden layers, in\nthese tests we consider only hidden layers of size 16 and 32. On the other end, as shown in Table 2, a\nhidden layer of size 32 does not limit the generalization performance of our method in comparison to\nlarger sizes. As in the previous section, we show the performance of the different training strategies\nin terms of learning curves (Figure 4) and maximum achievable accuracy in testing (Table 3).\nRecall that, according to our general training strategy, when a new network is added to the stack its\ninitial connection weights are copied from the previous-level network in the stack. Moreover, each\nnetwork is trained on exactly the same set of examples. Thus, a natural question is to which extent\nthe randomization, in terms of both connection weights and training examples, affects the network\nlearning capabilities. As shown in Figure 4(a)(b), under weight randomization (DST-NN1), the DST-\nNN gets stuck in local minima and the best prediction performance are comparable to those of NN\n\n6\n\n11020304050607080901000.12 0.14 0.16 0.18 0.2  0.22 EpochsAccuracy  DST\u2212NN trainNN trainRNN trainDST\u2212NN testNN testRNN test11020304050607080901000.16 0.18 0.2  0.22 0.24 0.26 EpochsAccuracy11020304050607080901000.16 0.18 0.2  0.22 0.24 0.26 0.28 0.3  EpochsAccuracy11020304050607080901000.16 0.18 0.2  0.22 0.24 0.26 0.28 0.3  0.32 EpochsAccuracy11020304050607080901000.16 0.18 0.2  0.22 0.24 0.26 0.28 0.3  0.32 0.34 EpochsAccuracy11020304050607080901000.15 0.2  0.25 0.3  0.35 EpochsAccuracy\f(a) DST-NN\n\n(b) NN\n\n(c) RNN\n\nFigure 3: Predicted contacts at sequence separation \u2265 6 for the d1igqa domain. In all three \ufb01g-\nures, the lower triangle shows the native contacts (black dots). The blue and red dots in the upper\ntriangle represent the correctly (blue) and incorrectly (red) predicted contacts among the N top-\nscored residue pairs, where N is the number of native contacts at sequence separation \u2265 6. All three\nmethods achieve 0.6 accuracy on the top L/5 long range contacts.\n\n(a) Hidden Layer size 16\n\n(b) Hidden Layer size 32\n\nFigure 4: Learning curves of different training strategies\n\nand RNN (Table 2 and Table 3). On the other hand, under weight randomization, the DST-NN does\nnot show any sign of over\ufb01tting and the training performance is similar to the testing performance,\nas for the RNN in the previous section. Conversely, randomized selection of the training examples\n(DST-NN2) does not affect the performance of the DST-NN. However, this training strategy seems to\nbe slightly less stable than our general strategy, since the standard deviation of the accuracy over the\nten training/testing sets is slightly higher (data not shown). In these tests, according to our general\ntraining strategy, each network in the stack has been trained for one single epoch. The approach of\ntraining each network for more than one single epoch leads to slightly better accuracy (< 1% of\nimprovement) at the cost of a larger training time (data not shown).\nAnother natural issue concerning DST-NNs is whether the depth of the stack affects the generaliza-\ntion capabilities of the model. To assess this issue, we train a new DST-NN by limiting the depth of\nthe stack to a \ufb01xed number of networks and then repeating the training procedure up to 100 epochs\n(DST-NN3). For this test, we use a limit size of 20 networks, which roughly corresponds to the\ninterval with highest learning peaks for hidden layer size 16 (see Figure 2). Due to the increased\ntraining time for this model (20 times slower), testing different stack depths is not practical. For this\ntraining strategy, the randomization of the weights for each newly added network in the stack does\nnot produce any dramatic loss in prediction accuracy, although the performance results are slightly\nlower than those obtained by using our general weight initialization strategy (data not shown). As\nshown in Figure 4 and Table 3, although more time consuming, this training technique allows an im-\nprovement of approximatively 2% points of accuracy with respect to our general training approach\n(at least for a hidden layer of size 16). For this reason, restarting the training on a \ufb01xed size stack is\nmore advantageous in terms of prediction performance than having a very deep stack. Unfortunately,\nthe optimal stack depth is very likely related to the speci\ufb01c classi\ufb01cation problem and it cannot be\ninferred a priori from the architecture topology.\n\n7\n\n15415415415415415411020304050607080901000.1  0.15 0.2  0.25 0.3  0.35 EpochsAccuracy  DST\u2212NN trainDST\u2212NN1 trainDST\u2212NN2 trainDST\u2212NN3 trainDST\u2212NN testDST\u2212NN1 testDST\u2212NN2 testDST\u2212NN3 test11020304050607080901000.2  0.22 0.24 0.26 0.28 0.3  0.32 0.34 0.36 EpochsAccuracy\fTable 3: Best prediction performance\n\nMethod\n\nDST-NN\nDST-NN1\nDST-NN2\nDST-NN3\n\nHL 16\nL/10\n\n0.30\n0.27\n0.30\n0.32\n\nL/5\n\n0.27\n0.24\n0.27\n0.29\n\nBest5\n\nL/5\n\nHL 32\nL/10\n\nBest5\n\n0.33\n0.30\n0.33\n0.35\n\n0.29\n0.24\n0.29\n0.30\n\n0.32\n0.27\n0.33\n0.33\n\n0.35\n0.29\n0.36\n0.37\n\n5 Concluding remarks\n\nWe have presented a novel and general deep machine-learning architecture for contact prediction,\nimplemented as a stack of Neural Networks NNk\nij with two spatial dimensions and one temporal\ndimension. The stack architecture is used to organize the prediction in such a way that each level\nin the stack can receive in input, through the temporal feature vectors, and re\ufb01ne the predictions\nproduced by the previous stages in the stack. This approach is closer to the characteristics of the\nfolding process, where the folded state is dynamically attained through a series of local re\ufb01nements.\nWhile our architecture is not meant to simulate the folding process, the idea to model the contact\nprediction in a multi-level fashion seems more natural than the traditional single-shot approach. This\nis con\ufb01rmed by the improved generalization capabilities and accuracy of the DST-NN model, which\nhave been demonstrated by rigorous comparison against other approaches.\nThe proposed architecture is somewhat general and it can be adopted as a starting point for more\nsophisticate methods for contact prediction or other problems. For instance, while the elementary\nlearning modules of the architecture are implemented using neural networks, it is clear that these\ncould be replaced by other models, such as SVMs. Moreover, here we considered a simple square\nneighborhood for encoding the contact predictions in the temporal feature vector; more complex\nrelationships could be discovered by exploiting different topologies for such feature vector . For\nexample, different secondary structure elements tend to form speci\ufb01c contacting patterns and such\npatterns could be directly implemented in one or more speci\ufb01c feature vectors (see, for example, [8]).\nAnother property of our DST-NN approach is that each level can be trained in supervised fashion.\nWhile we have used the true contact map as the target for all the levels in the architecture, it is clear\nthat different targets could be used at different levels [3]. For instance, experimental or simulation\ndata1 on protein folding could be used to generate contact maps at different stages of folding and\nuse those as targets. Different variations based on these ideas are currently under investigation.\nThe DST-NN approach is in fact a special case of the DAG-RNN approach described in [2] and\nrelies on an underlying directed acyclic graph (DAG) to organize the computations. For these rea-\nsons, one could also imagine architectures based on a higher-dimensional stack of learning modules,\nfor instance a stack of the form NNlm\nijk where the spatial coordinates are three-dimensional, and the\n\u201ctemporal\u201d coordinates are two-dimensional with a connectivity that ensures the absence of directed\ncycles (the temporal connections running only from the \u201cpast\u201d towards the \u201cfuture\u201d). DST-NNs of\nthe form NNk\ni , with one spatial and one temporal coordinate, could be applied to sequence prob-\nlems, for instance to the prediction of secondary structure or relative solvent accessibility. Likewise,\nDST-NNs of the form NNl\nijk, with three spatial and one temporal coordinate, could be applied, for\ninstance, to problems in weather forecasting [13] or trajectory prediction in robot movements [14].\n\nReferences\n\n[1] Altschul,S.F., Madden,T.L., Sch\u00a8affer,A.A., Zhang,J., Zhang,Z., Miller,W., Lipman, D.J. (1997) Gapped\nBLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 25(17),\n3389-3402.\n\n[2] Baldi,P., Pollastri,G.\n\n(2003) The Principled Design of Large-Scale Recursive Neural Network\nArchitectures-DAG-RNNs and the Protein Structure Prediction Problem, Journal of Machine Learning Re-\nsearch, 4, 575-602.\n\n1http:www.dynameomics.org\n\n8\n\n\f[3] Baldi,P. (2012) Boolean Autoencoders and Hypercube Clustering Complexity, Designs, Codes, and Cryp-\n\ntography, 65, 383-403.\n\n[4] Bengio,Y., Lamblin,P., Popovici,D., Larochelle,H. (2006) Greedy Layer-Wise Training of Deep Networks.\nProceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS 2006), 153-\n160.\n\n[5] Bj\u00a8orkholm,P., Daniluk,P., Kryshtafovych,A., Fidelis,K., Andersson,R., Hvidsten,T.R. (2009) Using multi-\ndata hidden Markov models trained on local neighborhoods of protein structure to predict residue-residue\ncontacts. Bioinformatics, 25, 1264-1270.\n\n[6] Chandonia,J.M., Hon,G., Walker,N.S., Lo Conte,L., Koehl,P., Levitt, M., Brenner, S.E. (2004) The AS-\n\nTRAL Compendium in 2004, Nucl. Acids Res. , 32(suppl 1), D189-D192.\n\n[7] Cheng,J., Baldi,P. (2007) Improved residue contact prediction using support vector machines and a large\n\nfeature set, BMC Bioinformatics, 8, 113.\n\n[8] Di Lena,P., Nagata,K., Baldi,P. (2012) Deep Architectures for Protein Contact Map Prediction, Bioinfor-\n\nmatics, 28, 2449-2457.\n\n[9] Ezkurdia,I., Gra\u02dcna,O., Izarzugaza,J.M., Tress,M.L. (2009) Assessment of domain boundary predictions\n\nand the prediction of intramolecular contacts in CASP8, Proteins, 77(suppl 9), 196-209\n\n[10] Farabet,C. Couprie,C., Najman,L., LeCun,Y. (2012) Scene Parsing with Multiscale Feature Learning,\nPurity Trees, and Optimal Covers. Proceedings of the 29th International Conference on Machine Learning\n(ICML 2012).\n\n[11] Fariselli,P.,Olmea,O.,Valencia,A.,Casadio,R. (2001) Progress in predicting inter-residue contacts of pro-\n\nteins with neural networks and correlated mutations. Proteins 5, 157-162.\n\n[12] Heitz,G., Gould,S., Saxena,A., Koller,D. (2008) Cascaded Classi\ufb01cation Models: Combining Models for\nHolistic Scene Understanding. Proceedings of the 22nd Annual Conference on Neural Information Process-\ning Systems (NIPS 2008), 641-648.\n\n[13] Hsieh,W. (2009) Machine Learning Methods in the Environmental Sciences: Neural Networks and Ker-\n\nnels. Cambridge University Press, NY, USA.\n\n[14] Jetchev,N., Toussaint,M. (2009) Trajectory prediction: learning to map situations to robot trajectories.\n\nProceedings of the 26th Annual International Conference on Machine Learning, 449-456.\n\n[15] Kryshtafovych,A., Fidelis,K., Moult,J. (2011) CASP9 results compared to those of previous CASP ex-\n\nperiments, Proteins, In press.\n\n[16] Larochelle,H., Bengio,J., Louradour,J., Lamblin,P. (2009) Exploring Strategies for Training Deep Neural\n\nNetworks Journal of Machine Learning Research, 10, 1-40.\n\n[17] Murzin,A.G., Brenner,S.E., Hubbard,T., Chothia,C. (1995) SCOP: a structural classi\ufb01cation of proteins\n\ndatabase for the investigation of sequences and structures, J. Mol. Biol., 247(4), 536-540.\n\n[18] Pollastri,G., Przybylski,D., Rost,B., Baldi,P. (2002) Improving the prediction of protein secondary struc-\n\nture in three and eight classes using recurrent neural networks and pro\ufb01les, Proteins, 47(2), 228-235.\n\n[19] Pollastri,G., Baldi,P., Fariselli,P., Casadio,R. (2002) Prediction of Coordination Number and Relative\n\nSolvent Accessibility in Proteins, Proteins, 47(2), 142-153.\n\n[20] Porto,M., Bastolla,U., Roman,H.E., Vendruscolo,M. (2004) Reconstruction of protein structures from a\n\nvectorial representation, Phys. Rev. Lett., 92, 218101.\n\n[21] Punta,M., Rost,B. (2005) PROFcon: novel prediction of long-range contacts, Bioinformatics, 21, 2960-\n\n2968.)\n\n[22] Ross,S., Munoz,D., Hebert,M., Bagnell,J.A. (2011) Learning message-passing inference machines for\nstructured prediction, Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recogni-\ntion, 2737-2744.\n\n[23] Sathyapriya,R., Duarte,J.M., Stehr,H., Filippis,I., Lappe,M. (2009) De\ufb01ning an Essence of Structure De-\n\ntermining Residue Contacts in Proteins. PLoS Comput Biol, 5(12), e1000584.\n\n[24] Shackelford,G., Karplus, K. (2007) Contact prediction using mutual information and neural nets.Proteins,\n\n69,159-164.\n\n[25] Tress,M.L., Valencia,A. (2010) Predicted residue-residue contacts can help the scoring of 3D models.\n\nProteins, 78(8), 1980-1991.\n\n[26] Vassura,M., Margara,L., Di Lena,P., Medri,F., Fariselli,P. , Casadio,R. (2008) FT-COMAR: fault tolerant\n\nthree-dimensional structure reconstruction from protein contact maps. Bioinformatics, 24, 1313-1315.\n\n[27] Zhang,Y. (2008) Progress and challenges in protein structure prediction. Curr Opin Struct Biol., 18(3),\n\n342-348.\n\n9\n\n\f", "award": [], "sourceid": 263, "authors": [{"given_name": "Pietro", "family_name": "Lena", "institution": null}, {"given_name": "Ken", "family_name": "Nagata", "institution": null}, {"given_name": "Pierre", "family_name": "Baldi", "institution": null}]}