{"title": "Generative modeling for protein structures", "book": "Advances in Neural Information Processing Systems", "page_first": 7494, "page_last": 7505, "abstract": "Analyzing the structure and function of proteins is a key part of understanding biology at the molecular and cellular level. In addition, a major engineering challenge is to design new proteins in a principled and methodical way. Current computational modeling methods for protein design are slow and often require human oversight and intervention. Here, we apply Generative Adversarial Networks (GANs) to the task of generating protein structures, toward application in fast de novo protein design. We encode protein structures in terms of pairwise distances between alpha-carbons on the protein backbone, which eliminates the need for the generative model to learn translational and rotational symmetries. We then introduce a convex formulation of corruption-robust 3D structure recovery to fold the protein structures from generated pairwise distance maps, and solve these problems using the Alternating Direction Method of Multipliers. We test the effectiveness of our models by predicting completions of corrupted protein structures and show that the method is capable of quickly producing structurally plausible solutions.", "full_text": "Generative Modeling for Protein Structures\n\nNamrata Anand\n\nBioengineering Department, Stanford\n\nnamrataa@stanford.edu\n\nPo-Ssu Huang\n\nBioengineering Department, Stanford\n\npossu@stanford.edu\n\nAbstract\n\nAnalyzing the structure and function of proteins is a key part of understanding\nbiology at the molecular and cellular level.\nIn addition, a major engineering\nchallenge is to design new proteins in a principled and methodical way. Current\ncomputational modeling methods for protein design are slow and often require hu-\nman oversight and intervention. Here, we apply Generative Adversarial Networks\n(GANs) to the task of generating protein structures, toward application in fast de\nnovo protein design. We encode protein structures in terms of pairwise distances\nbetween \u21b5-carbons on the protein backbone, which eliminates the need for the gen-\nerative model to learn translational and rotational symmetries. We then introduce a\nconvex formulation of corruption-robust 3D structure recovery to fold the protein\nstructures from generated pairwise distance maps, and solve these problems using\nthe Alternating Direction Method of Multipliers. We test the effectiveness of our\nmodels by predicting completions of corrupted protein structures and show that the\nmethod is capable of quickly producing structurally plausible solutions.\n\n1\n\nIntroduction\n\nThe ability to determine and design protein structures has deepened our understanding of biology.\nAdvancements in computational modeling methods have led to remarkable outcomes in protein\ndesign including the development of new therapies [1, 2], enzymes [3, 4, 5], small-molecule binders\n[6], and biosensors [7]. These efforts are largely limited to modifying naturally occurring, or \u201cnative,\u201d\nproteins. To fully control the structure and function of engineered proteins, it is ideal in practice to\ncreate proteins de novo [8]. A fundamental question is discovering new, non-native folds or structural\nelements that can be used for designing these novel proteins. The protein design problem remains a\nmajor engineering challenge because the current design process relies heavily on heuristics, requiring\nsubjective expertise to negotiate pitfalls that result from optimizing imperfect scoring functions.\nWe demonstrate the potential of deep generative modeling for fast generation of new, viable protein\nstructures for use in protein design applications. We use Generative Adversarial Networks (GANs)\nto generate novel protein structures [9, 10] and use our trained models to predict missing sections\nof corrupted protein structures. We use a data representation restricted to structural information\u2013\npairwise distances of \u21b5-carbons on the protein backbone. Despite this reduced representation, our\nmethod successfully learns to generate new structures and, importantly, can be used to infer solutions\nfor completing corrupted structures. We use the Alternating Direction Method of Multipliers (ADMM)\nalgorithm to \u201cfold\u201d 2D pairwise distances into 3D Cartesian coordinates [11]. The algorithm presented\nis a new method to do 3D structure generation and recovery using deep generative models, which is\ninvariant to transformations in the Lie group of rotations and translations (SE(3)).\nThis paper is a step toward learning the protein design and folding process. Ultimately, our goal is to\nextend the generative model described, with subsequent steps of reinforcement learning or imitation\nlearning to produce realistic protein structure at atomic resolution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fa) Data representation. Proteins are made up of chains of amino acids and have\nFigure 1:\nsecondary structure features such as alpha helices and beta sheets. We represent protein structures\nusing pairwise distances in angstroms between the \u21b5-carbons on the protein backbone. b) Pipeline.\nGAN generates a pairwise distance matrix, which is \u201cfolded\u201d into a 3D structure by ADMM to\nget \u21b5-carbon coordinate positions; a fast \u2018trace\u201d script then traces a reasonable protein backbone\nthrough the \u21b5-carbon positions. We also fold structures directly from pairwise distances using Rosetta\n(fragment sampling subject to distance constraints) c) Model. DCGAN model architecture used for\ngenerating pairwise distance maps. The generator takes in random vector z \u21e0N (0, I) and outputs a\nfake distance map to fool the discriminator. The discriminator predicts whether inputs are real (data\nsamples) or fake (generator output).\n\nThe main contributions of this paper are (i) a generative model of proteins that estimates the distribu-\ntion of their structures in a way that is invariant to rotational and translational symmetry and (ii) a\nconvex formulation for the resulting reconstruction problem that we show scales to large problem\ninstances.\n\n2 Background\n\n2.1 Protein structure and design\n\nProteins are macromolecules made up of chains of amino acids, with side-chain groups branching off\na connected backbone (Figure 1a). Interactions between the side-chains, the protein backbone, and\nthe environment give rise to local secondary structure elements \u2013 such as helices, strands, or random\ncoils \u2013 and to the ultimate 3D structure of the protein. The large number of possible conformations of\nthe peptide backbone, as well as the requirement to satisfy correct chemical bonding geometry and\ninteractions, makes the protein structural modeling problem challenging.\nIn this paper, we study sequence-agnostic structure generation; this is different from the task of protein\nstructure prediction, in which the structure of the protein is predicted given the amino acid sequence.\nAlthough protein structures are determined by their primary amino acid sequence, in recent years,\nit has become more apparent that protein structures and protein-protein interfaces conform largely\nto structural motifs [12]. A well known example is helical coiled-coils, where the angles between\ntwo interacting and sequence diverse helices fall within a range to facilitate knobs-into-holes packing\n[13]. These observations emphasize the importance of understanding sequence agnostic backbone\nbehaviors. Here, our goal is to try to sample from the distribution of viable protein backbones.\nThe conventional protein design process starts with designing a peptide backbone structure, which\ncan either be derived from a native protein or arti\ufb01cially created (i.e. de novo design); this is followed\nby \ufb01nding the sequence of amino acids or side-chain residues which will fold into the backbone\nstructure. Often, a structure is only partially modi\ufb01ed such that one or more segments are manipulated\nwhile the rest of the structure is kept intact; this is referred to as a loop modeling or loop closure\nproblem. Ensuring that the resulting structure has a fully connected and plausible backbone can be\ndif\ufb01cult.\n\n2\n\n\fThe current state-of-the-art method for designing protein structures is the Rosetta modeling suite.\nGuided by a heuristic energy function, Rosetta samples native protein fragments to fold backbones\nand can optimize over amino acid types and orientations to determine the most likely amino acid\nsequence corresponding to the designed backbone [14]. For loop modeling problems, Rosetta\nsupplements fragment sampling with closure algorithms such as Cyclic Coordinate Descent (CCD)\n[15] and Kinematic Closure (KIC) [16] to ensure backbone connectivity. Hallmarks of the Rosetta\nmethodology are that it has a highly re\ufb01ned energy function to guide sampling and its model building\nprocesses are intuitive and \ufb02exible. The main drawbacks of this method are that the fragment sampling\nstep is very slow and the method requires extensive sampling steps before arriving at reasonable\nsolutions. We suggest a fast method for loop modeling using a generative model, which takes into\naccount the global structure of the protein.\n\n2.2 Generative models\nGenerative Adversarial Networks (GANs) are a powerful class of models for generating new data\nsamples that match the statistics of an input dataset [9]. GANs are made up of a generator network\nwhich seeks to produce realistic data and a discriminator network which tries to distinguish fake\nsamples produced by the generator from real samples from the input dataset.\nGiven data x and random vector z \u21e0N (0, I), the discriminator D and generator G each seek to\nmaximize the following objectives\n\nmax\n\nD\n\nmax\n\nG\n\nEx\u21e0p data(x)[log D(x)] + Ez\u21e0pz(z)[log(1 (D(G(z))))]\nEz\u21e0pz(z)[log(D(G(z)))]\n\n(1)\n\nWe use a deep convolutional generative adversarial network (DCGANs) as our generative model in\nthis paper [10].\n\n2.3 Related Work\nOther than Rosetta-based fragment sampling methods, related work on sequence-agnostic generative\nmodels for protein backbones include TorusDBN [17] and FB5-HMM [18] which are Hidden\nMarkov Models (HMMs) trained to generate local backbone torsion angles and \u21b5-carbon coordinate\nplacement, respectively. We baseline our work with these methods.\nIndirectly related papers use neural network models to predict properties of and generate new small\nmolecules and protein/DNA sequences. Some of these use neural networks on graph or string\nrepresentations of small molecules [19, 20, 21, 22]. A recent paper uses deep neural networks to\npredict amino acid sequence for a corresponding structure [23]. Another result uses GANs to generate\nDNA sequences [24].\nStructure prediction methods include residue coevolution-based structure prediction [25, 26] and\nrecent work on neural network based methods [27]. These approaches assume the underlying\namino-acid sequence is known.\nWe use ADMM to infer 3D coordinates from pairwise distances; similarly, others have used a\nsemide\ufb01nite program (SDP) to infer protein structure from nuclear magnetic resonance (NMR) data,\n[28], using semide\ufb01nite facial reduction to reduce the size of the SDP.\nThe algorithm presented in this paper is a new method to do 3D structure generation and recovery\nusing deep generative models in a manner invariant to transformations in SE(3). This method works\nbecause of the \ufb01xed order of the peptide chain. Current methods for 3D structure generation are not\nSE(3) invariant. A representative example is [29] who use GANs to produce structures in 3D voxel\nspace; we baseline our method with this model.\n\n3 Methods\n\n3.1 Dataset and map generation\nAn overview of our method is shown in Figure 1b. We use data from the Protein Data Bank [30],\na repository of experimentally determined structures available on-line. Although full-atom, high\n\n3\n\n\fresolution structures are available for use, we sought a representation of protein structure that would\neliminate the need for explicitly encoding SE(3) invariance of 3D structures. We chose to encode 3D\nstructure as 2D pairwise distances between \u21b5-carbons on the protein backbone. This representation\ndoes not preserve information about the protein sequence (side chains) or the torsion angles of the\npolypeptide backbone, but preserves enough information to allow for structure recovery. We refer\nhenceforth to these pairwise \u21b5-carbon distance matrices as \"maps.\" Note that the maps preserve the\norder of the peptide chain from N- to C- terminus and are SE(3) invariant representations of the 3D\nstructure by construction.\nTo minimize structural homology between the GAN training data and the test data, we separated train\nand test structures by SCOP (Structural Classi\ufb01cation of Proteins) fold type. We include an exact\nlist of train and test set PDB IDs in the supplementary material. Note that the GAN does not require\na train/test split, but our subsequent corruption recovery experiments do. To create our datasets,\nwe extract non-overlapping fragments of lengths 16, 64, and 128 from chain \u2018A\u2019 for each protein\nstructure starting at the \ufb01rst residue and calculate the pairwise distance matrices from the \u21b5-carbon\ncoordinate positions. Importantly, the inputs to the model are not all independently folded domains,\nbut include fragments. We made this choice because we were unable to stably train a generative\nmodel for arbitrary length input structures. Our model for the scope of this paper is not necessarily\nlearning protein structures which will fold, but rather learning building block features that de\ufb01ne\nsecondary and tertiary structural elements.\nWe generated 16-, 64-, and 128-residue maps by training GANs on the corresponding maps in our\ndataset. The model architecture is represented in Figure 1c. Experiment details are given in the\nsupplementary material, Section A.1.\n\n3.2 Folding generated maps\n\nAfter generating pairwise distance maps, we must recover or \u201cfold\u201d the corresponding 3D structure.\nWe tested two methods for folding generated maps. The \ufb01rst is using Rosetta\u2019s optimization toolkit to\n\ufb01nd a low-energy structure via fragment sampling, given distance constraints with slack. In practice,\nthis takes several minutes to fold small structures of less than 100 residues because of the fragment\nand rotamer sampling steps (Figure S1a). This is is not a scalable method for folding and evaluating\nmany structures; therefore we sought another, faster way to reconstruct 3D protein structure via the\nAlternating Direction Method of Multipliers (ADMM) algorithm [11].\n\n3.2.1 ADMM\nThe task of determining 3D cartesian coordinates given pairwise distance measurements is already\nwell understood and has a natural formulation as a convex problem [31]. Given m coordinates\n[a1, a2, . . . am] = A 2 Rn\u21e5m, we form the Gram matrix G = AT A 2S m\n+ . Note that G is\nsymmetric, positive-semide\ufb01nite with rank at most n. We want to recover A given pairwise distance\nmeasurements D, with dij = kai ajk2. Since gij = aT\nij = gii + gjj 2gij, we can \ufb01nd\nG by solving an SDP over the positive semide\ufb01nite cone.\n\ni aj and d2\n\nmin\nG,\u2318\n\n k\u2318k1 +\n\nmXi=1,j=1\nsubject to G 2S n\n\n(gii + gjj 2gij + \u2318ij d2\n\nij)2\n\n(2)\n\n1\n2\n\n+\n\nwhere we have allowed a slack term \u2318 on each distance measurement, whose `1 norm is penalized,\nwith weight 2 R. This slack term allows the model to be robust to sparse corruptions of distance\nmeasurements. This penalty is common in corruption-robust modeling approaches and has theoretical\nguarantees in other applications like Robust Principal Components Analysis [32]. We do not address\nsuch theoretical guarantees in this work but demonstrate its empirical properties in the supplement\nSection A.2.\nWhile this optimization problem can be solved quickly using SDP solvers for systems where n\nis small, the runtime of traditional solvers is quadratic in n and renders large structure recovery\nproblems out of reach. We found that indirect solvers like SCS were not able to handle problems\nwith n > 115 (Figure S1b) [33, 34].\n\n4\n\n\fFigure 2: a) Generated pairwise distance maps for the 64-residue model, along with correspond-\ning nearest neighbors (NNs) by `2 distance in training dataset and maps after ADMM coordinate\nrecovery, subsequent \u21b5-carbon retrace step (CA retrace), and coordinate recovery by Rosetta fragment\nsampling. b) Distribution of `2 map errors after coordinate recovery for generated maps (n = 1000).\nc) Distribution of \u21b5-carbon rigid body alignment errors between folding methods for generated maps\n(n = 1000). d) Mean `2 map errors after coordinate recovery for generated and real maps vs. protein\nlength e) Mean \u21b5-carbon rigid body alignment errors vs. protein length (n = 64 per data point).\n\nHence, we use ADMM which we found in practice converges to the correct solution quickly (Figure\nS1c). ADMM is a combination of dual ascent with decomposition and the method of multipliers. We\nwrite a modi\ufb01ed optimization problem as\n\n1\n\n20@\n\nmXi=1,j=1\n\n(gii + gjj 2gij + \u2318ij d2\n\nij)21A + {Z 2S m\n\n+ }\n\n(3)\n\nmin\nG,Z,\u2318\n\n k\u2318k1 +\n\nsubject to G Z = 0\n\nNow we decompose the problem into iterative updates over variables G, Z, and U as\n\nGk+1,\u2318 k+1 = argmin\n\nG,\u2318\n\n[ k\u2318k1+\nZk+1 =\u21e7 Sn\n+(Gk+1 + Uk)\nUk+1 = Uk + Gk+1 Zk+1\n\n1\n2\n\nmXi=1,j=1\n\n(gii + gjj 2gij + \u2318ij d2\n\nij)2+\n\n\u21e2\n2 kG Zk + Ukk2\n2 ]\n\n(4)\nwith augmented Lagrangian penalty \u21e2> 0. The update for Z is simply the projection onto the set\nof symmetric positive semide\ufb01nite matrices of rank n. We \ufb01nd Gk and \u2318k by several iterations of\ngradient descent. After convergence, coordinates A can be recovered from G via SVD. Note that\nthis algorithm is generally applicable to any problem for structure recovery from pairwise distance\nmeasurements, not only for protein structures. In practice, this algorithm is fairly robust to corruption\nof data (Section A.2).\nSince our data representation only includes pairwise distances between \u21b5-carbons, we need a fast\nmethod to recover the entire protein backbone given the \u21b5-carbon coordinates outputted by ADMM.\nTo do this, we use Rosetta to do local fragment sampling for every \ufb01ve carbons, constraining the\noriginal placement of the carbons. The backbone is joined by selecting the middle residue for each\ncarbon 5-mer. This is followed by a short design procedure which \ufb01nds a low-energy sequence for\nthe designed backbone. This is the \u201c\u21b5-carbon trace\u201d block shown in Figure 1b. Unlike running the\nfull Rosetta protocol (dashed line, Figure 1b), this is a short procedure that runs in no more than a\nfew minutes for large structures (Figure S1d).\nThe bene\ufb01t of this procedure is that we recover realistic protein structures that adhere to the \u21b5-carbon\nplacement determined by ADMM, while only sampling 5-mers and not all native fragments; hence,\nthe procedure runs an order of magnitude faster than the Rosetta protocol described above. The\nprimary drawback is that the ADMM procedure cannot always correct for local errors in secondary\n\n5\n\n\fFigure 3: Examples of real 64-residue fragments (a) from the training dataset versus generated\n64-residue fragments (b-f). b) GAN-generated maps folded subject to \u21b5-carbon distance constraints\nusing Rosetta for fragment sampling. c) GAN-generated maps folded using ADMM and \u21b5-carbon\ntrace script.d) Full-atom GAN-generated maps folded using ADMM and \u21b5-carbon trace script. e)\nStructures from , , ! angles generated by torsion GAN baseline with idealized backbone bond\nlengths and angles [35]. f) Structures generated by sampling from TorusDBN without sequence prior\n[17]. g) Structures generated by sampling \u21b5-carbon traces from FB5-HMM without sequence prior\nand recovering full atom structure via Rosetta fragment sampling [18]. 3D GAN voxel output shown\nin Figure S8 [29].\n\nstructure (such as deviations from helix or sheet structure), while the Rosetta sampling procedure\nusually guarantees correct local structure.\n\n4 Experiments\n\n4.1 Generating protein structures\n\nWe generate 16-, 64-, and 128-residue maps and then fold them using the ADMM and Rosetta proto-\ncols above. Information on model architectures and training is given in detail in the supplementary\nmaterial, Section A.1.\nWe compare our method for structure generation to the following baselines: Hidden Markov Model\n(HMM) based methods TorusDBN [17] and FB5-HMM [18], a multi-scale torsion angle GAN,\n3DGAN [29], and a full-atom GAN (2D pairwise distance maps for full-atom peptide backbones).\nDescriptions of these baselines are given in the supplement Section A.3.\n\n4.1.1 Results\nGenerated maps from our trained 64- and 128-residue models are shown in Figure 2a and Figure S3a,\nalongside nearest neighbor maps from the training set. We found that generated maps were highly\nvariable and similar but not identical to real data, suggesting that the GAN is in fact generating new\nmaps and is not memorizing the training data.\nWe fold structures in two different ways. First, we use Rosetta to do fragment sampling subject to the\ngenerated \u21b5-carbon distance constraints. In practice, this gives us a 3D structure with correct peptide\nbackbone geometry that adheres to the generated constraints. Second, we use ADMM to \ufb01nd 3D\n\u21b5-carbon placement that satis\ufb01es the generated constraints; we then use the \u21b5-carbon trace script\n(described in Section 3.2.1) to trace an idealized peptide backbone geometry through the \u21b5-carbons.\nIn Figure 2b, we show the distribution of mean `2 \u21b5-carbon map errors due to reconstruction via\nADMM, the \u21b5-carbon retrace step, and fragment sampling subject to distance constraints by Rosetta.\nThe errors are smaller than those corresponding to nearest neighbors in the training set, and the\nreconstructed maps retain qualitative features of the generated maps. The \u21b5-carbon rigid body\nalignment errors between the coordinate recovery methods are also small relative to nearest neighbors\nin the training set (Figure 2c).\n\n6\n\n\fIn Figure 2d, we show the relative contribution to the recovered map error from the coordinate\nrecovery process versus the generative model. While the reconstruction process introduces errors,\nthe error is only slightly higher for the reconstruction of generated maps versus real maps for the 64-\nand 128- residue models. This indicates that the most of the reconstruction error is inherent to the\nreconstruction, versus correcting errors in the generated maps. In contrast, for a weaker 256-residue\nGAN, which fails to produce realistic maps or corresponding structure, the map reconstruction error\nfor generated maps far exceeds that of real maps.\nFolded structures are shown in Figure 3. We found that the generator was able to learn to construct\nmeaningful secondary structure elements such as alpha helices and beta sheets. The Rosetta folding\nprocedure is slow but produces locally correct structures (Figure 3c). In contrast, the ADMM folding\nprocedure is fast but cannot correct for errors in local structure (Figure 3c). Linear interpolations in\nthe latent space of the GAN map to smooth interpolations in the generated pairwise distances, as well\nas in the corresponding structures (Figures S4, S5). In addition, we found that the generator could\nproduce maps closely corresponding to real protein structure maps through optimization of the input\nlatent vector, indicating that the models are suf\ufb01ciently complex (Section A.4, Figure S6).\nAssuming average bond angles and bond lengths for the peptide backbone, the 3D structure of the\nbackbone can be exactly represented in terms of the torsion, or dihedral, angles , , and !, de\ufb01ned\nas the angles around the N C\u21b5, C\u21b5 C, and C N0 bonds, respectively. ! is typically 180\ndegrees (trans orientation), except in the rare case when ! is around 0 degrees (cis orientation). The\n, distribution ( Ramachandran plot) indicates allowable backbone torsion angles and different\nregions of the distribution correspond to alpha helix, beta sheet, and loop regions. We show the , \ndistribution of the generated structures and baselines in Figure S7.\nThe baselines (Figure 3d-g) underperform relative to the \u21b5-carbon distance map method. Out of\nall the methods, the torsion GAN best adheres to the true , distribution (Figure S7h); however,\nthe torsion GAN, TorusDBN, and FB5-HMM baselines generate many structures which loop in on\nthemselves and do not have realistic global 3D protein structure (Figure 3e-g).\nThe full-atom GAN (Figure 3d) also underperforms relative to the \u21b5-carbon distance map method.\nThe generated structures are represented with breaks in the peptide chain, because often the placement\nof backbone atoms is far enough from real backbone geometry such that the structure cannot be\nrendered as a contiguous chain. It is possible that a full-atom method would work better with a\ngenerative model that is multi-scale, learning both local peptide structure and global 3D structure.\nResults for the 3DGAN baseline are given in Figure S8; we found that this method could not generate\nmeaningful structures.\n\n4.2\n\nInpainting for protein design\n\nNext, we considered how to use the trained generative models to infer contextually correct missing\nportions of protein structures. This is a protein design problem which arises in the context of loop\nmodeling, circular permutation, and interface prediction. We can formulate this problem naturally as\nan inpainting problem, where for a subset of residues all pairwise distances are eliminated and the\ntask is to \ufb01ll in these distances reasonably, given the context of the rest of the uncorrupted structure.\nWe used a slightly modi\ufb01ed version of the semantic inpainting method described in [36], omitting the\nPoisson blending step. This method is described in detail in the supplement Section A.5. We only\npresent inpainting results on structures in the test set, which the GAN has not seen during training.\n\n4.2.1 Baselines and metrics\nWe baseline with the following methods:\n10-residue supervised autoencoder. We train an autoencoder to reconstruct completed 64-residue\npairwise distance maps given input maps with random 10-residue corruptions. The encoder and\ndecoder networks are equivalent to the discriminator and generator networks for the GAN, omitting\nthe last layer of the discriminator and the \ufb01rst layer of the generator. As before, the inputs are\nnormalized and the outputs are forced to be positive and symmetric. The autoencoder is trained with\nsupervised `2 loss with respect to the uncorrupted map.\nRandom corruption supervised autoencoder. We also train the same autoencoder model to recon-\nstruct completed 64-residue maps given input maps with random corruptions ranging from 5 to 25\n\n7\n\n\fFigure 4: GAN vs. baselines for inpainting 10, 15, and 20 residue segments on 64-residue structures.\nNative structures are colored green and reconstructed structures are colored yellow. The omitted\nregions of each native structure are colored blue, and the inpainted solutions are colored red. a)\nGAN \u2013 solution found by sampling 1000 structures and selecting structure with low Rosetta score\nand low \u21b5-carbon rigid body alignment error with respect to the native structure (e). b) Supervised\nautoencoder trained on 10 residue corruptions. c) Autoencoder trained on random corruptions\nfrom 5 to 25 residues in length. d) RosettaRemodel \u2013 best Rosetta score structure after 4000\nsampling trajectories. e) Log scaled Rosetta backbone score (log(Rosetta score + 500)) vs. mean\n\u21b5-carbon rigid body alignment error of the inpainted region for GAN solutions (n = 1000). Solutions\nare colored by map recovery error of Rosetta with respect to generated map. Arrows indicate rendered\nsolution (a).\n\nresidues in length. This is to more fairly compare against the GAN inpainting method, which can\nhandle arbitrary length corruptions.\nRosettaRemodel. We also baseline the GAN inpainting method with RosettaRemodel [37], which\nuses fragment sampling to do loop closure, followed by a sequence design process, guided by a\nheuristic energy score. For our experiments, we do 4000 sampling trajectories per structure and rank\nthe output solutions by their Rosetta score.\nThere is no canonical evaluation metric for loop closure. We can try to more quantitatively describe\nthe correctness of the inpainting procedure using as metrics the discriminator score and the least-\nsquares rigid-body alignment error with respect to the true native structure. However, the GAN\nsolutions are found by explicitly trying to optimize the discriminator score; hence the corresponding\nscore will be arti\ufb01cially in\ufb02ated relative to the baselines. In addition, in the case where the inpainted\nsolution is plausible but deviates from the native structure, the rigid-body alignment error will be high.\nTherefore, we cannot necessarily view these metrics as strong indicators of whether the reconstructed\nsolutions are reasonable, only as rough heuristics. The ultimate test of the inpainted solutions is to\nexperimentally verify the structures.\n\n4.2.2 Inpainting results\nResults for inpainting of missing residues 128-residue maps are shown in Figure S9. We see that the\ntrained generator can \ufb01ll in semantically correct pairwise distances for the removed portions of the\nmaps. To test whether these inpainted portions correspond to legitimate reconstructions of missing\nparts of the protein, we fold the new inpainted maps into structures.\nWe render some GAN and baseline inpainting solutions for 64-residue structures folded by fragment\nsampling in Figure 4. We sample 1000 inpainting solutions using the generator, and render solutions\nwith low Rosetta backbone score and low rigid body alignment error with respect to the native\nstructure. For the Rosetta score, we only include backbone energy terms, excluding those terms\ninvolving side-chain interactions. By sampling many solutions for a single inpainting task, we can\nsee whether native-like solutions are discoverable. In order to show that these solutions are truly\n\n8\n\n\fgenerated by the GAN and not simply found due to the fragment sampling used in map reconstruction,\nwe color the points by the `2 map error between the generated map and the reconstructed map after\ncoordinate recovery.\nIn general, the 10-residue supervised autoencoder produces unrealistic solutions when made to inpaint\nlonger regions; however, the autoencoder trained on random corruptions tends to do better. The\nprimary advantages of the GAN over the autoencoder are that the GAN can handle arbitrary length\nloop closures without issue and that the GAN can be used to sample multiple solutions for each\ninpainting task, as shown in Figures S10b, S11, and S12.\nWhile the generator and autoencoders can \ufb01nd inpainting solutions in minutes, including the coor-\ndinate recovery step, RosettaRemodel takes much longer. For example, for 64-residue structures,\nrunning 100 Rosetta sampling trajectories on a node with 16 CPU cores takes on average 20 minutes.\nFor our experiments, we ran 100 trajectories per node across 40 nodes, which corresponds to about\n13 hours of total CPU time per structure. It is also important to note that since Rosetta samples native\nfragments, it is possible for it to \ufb01nd the correct native solution in the course of sampling.\nDiscriminator score and mean rigid-body alignment error for\nthe inpainting solutions are given in Figure 5. The mean align-\nment error is calculated for \u21b5-carbons over the inpainted region\nonly. For the GAN, results are given for 10 solutions per struc-\nture (\u2018GAN all\u2019), as well as for the top solutions per structure\nunder the discriminator score (\u2018GAN\u2019). For Rosetta, results are\ngiven for the top 40 structures out of 4000 sampling trajectories\n(\u2018Rosetta all\u2019) under the Rosetta score, as well as for the best\nsolution among all trajectories (\u2018Rosetta\u2019).\nAs the size of the inpainting region increases, the autoencoder\ndiscriminator score reduces and the structural solutions are also\nqualitatively worse for the autoencoder trained on 10-residue\ncorruptions; this indicates that supervised autoencoders are not\n\ufb02exible enough to handle loop closure problems of arbitrary\nlength. The GAN discriminator score is explicitly optimized\nduring inpainting and is therefore high. In general, in terms of\nalignment error, the GAN inpainting solutions deviate the most\nfrom the native structures relative to the baselines.\nThe GAN can also be used to model longer or shorter fragments\nrelative to the native structure; we show two examples in Figure\nS10c. In Figure S11 and S12, we show more GAN inpainting\nsolutions. We are able to recover native-like solutions in terms\nof rigid-body error and secondary structure assignment for many inpainting tasks. There are also\nlow-energy, non-native solutions found in a few cases. We also render some non-native and clearly\nimplausible inpainting solutions in Figure S13.\n\nFigure 5: Discriminator score and\nmean coordinate `2 alignment error\nfor 64-residue inpainting task. Each\npoint is averaged over n = 64 data-\npoints, except for \u2018GAN all\u2019 (n =\n640) and \u2018Rosetta all\u2019 (n = 2560).\n\n5 Conclusion\n\nWe use GANs to generate protein \u21b5-carbon pairwise distance maps and use ADMM to \u201cfold\u201d the\nprotein structure. We apply this method to the task of inferring completions for missing residues in\nprotein structures.\nSeveral immediate extensions are possible from this work. We can assess whether the learned\nGAN features can improve performance for semi-supervised prediction tasks. We can also see how\nconditioning on sequence data, functional data, or higher-resolution structural data might improve\nstructure generation [38].\nFurthermore, we can extend our generative modeling procedure to solve the structure recovery\nproblem end-to-end and mitigate the issue our current model has with making errors in \ufb01ne local\nstructure. The current approach factors through the map representation, which overconstrains the\nrecovery problem. By incorporating the ADMM recovery procedure as a differentiable optimization\nlayer of the generator, we can potentially extend the models presented to directly generate and\nevaluate 3D structures.\n\n9\n\n\fAcknowledgments\nWe would like to thank Frank DiMaio for helpful discussion on Rosetta and providing baseline scripts\nto fold structures directly from pairwise distances.\n\nReferences\n[1] Timothy A Whitehead, Aaron Chevalier, Yifan Song, Cyrille Dreyfus, Sarel J Fleishman,\nCecilia De Mattos, Chris A Myers, Hetunandan Kamisetty, Patrick Blair, Ian A Wilson, et al.\nOptimization of af\ufb01nity, speci\ufb01city and function of designed in\ufb02uenza inhibitors using deep\nsequencing. Nature biotechnology, 30(6):543, 2012.\n\n[2] Eva-Maria Strauch, Steffen M Bernard, David La, Alan J Bohn, Peter S Lee, Caitlin E Anderson,\nTravis Nieusma, Carly A Holstein, Natalie K Garcia, Kathryn A Hooper, et al. Computational\ndesign of trimeric in\ufb02uenza-neutralizing proteins targeting the hemagglutinin receptor binding\nsite. Nature biotechnology, 35(7):667, 2017.\n\n[3] Daniela R\u00f6thlisberger, Olga Khersonsky, Andrew M Wollacott, Lin Jiang, Jason DeChancie,\nJamie Betker, Jasmine L Gallaher, Eric A Althoff, Alexandre Zanghellini, Orly Dym, et al.\nKemp elimination catalysts by computational enzyme design. Nature, 453(7192):190, 2008.\n\n[4] Justin B Siegel, Alexandre Zanghellini, Helena M Lovick, Gert Kiss, Abigail R Lambert,\nJennifer L St Clair, Jasmine L Gallaher, Donald Hilvert, Michael H Gelb, Barry L Stoddard,\net al. Computational design of an enzyme catalyst for a stereoselective bimolecular diels-alder\nreaction. Science, 329(5989):309\u2013313, 2010.\n\n[5] Lin Jiang, Eric A Althoff, Fernando R Clemente, Lindsey Doyle, Daniela R\u00f6thlisberger, Alexan-\ndre Zanghellini, Jasmine L Gallaher, Jamie L Betker, Fujie Tanaka, Carlos F Barbas, et al. De\nnovo computational design of retro-aldol enzymes. science, 319(5868):1387\u20131391, 2008.\n\n[6] Christine E Tinberg, Sagar D Khare, Jiayi Dou, Lindsey Doyle, Jorgen W Nelson, Alberto\nSchena, Wojciech Jankowski, Charalampos G Kalodimos, Kai Johnsson, Barry L Stoddard,\net al. Computational design of ligand-binding proteins with high af\ufb01nity and selectivity. Nature,\n501(7466):212, 2013.\n\n[7] Ashley D Smart, Roland A Pache, Nathan D Thomsen, Tanja Kortemme, Graeme W Davis, and\nJames A Wells. Engineering a light-activated caspase-3 for precise ablation of neurons in vivo.\nProceedings of the National Academy of Sciences, 114(39):E8174\u2013E8183, 2017.\n\n[8] Po-Ssu Huang, Scott E Boyken, and David Baker. The coming of age of de novo protein design.\n\nNature, 537(7620):320, 2016.\n\n[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[10] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed opti-\nmization and statistical learning via the alternating direction method of multipliers. Foundations\nand Trends R in Machine Learning, 3(1):1\u2013122, 2011.\n\n[12] Fan Zheng and Gevorg Grigoryan. Sequence statistics of tertiary structural motifs re\ufb02ect protein\n\nstability. PloS one, 12(5):e0178272, 2017.\n\n[13] JV Pratap, BF Luisi, and CR Calladine. Geometric principles in the assembly of \u21b5-helical\n\nbundles. Phil. Trans. R. Soc. A, 371(1993):20120369, 2013.\n\n[14] Rhiju Das and David Baker. Macromolecular modeling with rosetta. Annu. Rev. Biochem.,\n\n77:363\u2013382, 2008.\n\n[15] Adrian A Canutescu and Roland L Dunbrack. Cyclic coordinate descent: A robotics algorithm\n\nfor protein loop closure. Protein science, 12(5):963\u2013972, 2003.\n\n10\n\n\f[16] Daniel J Mandell, Evangelos A Coutsias, and Tanja Kortemme. Sub-angstrom accuracy in\nprotein loop reconstruction by robotics-inspired conformational sampling. Nature methods,\n6(8):551, 2009.\n\n[17] Wouter Boomsma, Kanti V Mardia, Charles C Taylor, Jesper Ferkinghoff-Borg, Anders Krogh,\nand Thomas Hamelryck. A generative, probabilistic model of local protein structure. Proceed-\nings of the National Academy of Sciences, 105(26):8932\u20138937, 2008.\n\n[18] Thomas Hamelryck, John T Kent, and Anders Krogh. Sampling realistic protein conformations\n\nusing local structural bias. PLoS Computational Biology, 2(9):e131, 2006.\n\n[19] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,\nAl\u00e1n Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning\nmolecular \ufb01ngerprints. In Advances in neural information processing systems, pages 2224\u2013\n2232, 2015.\n\n[20] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S\nPappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine\nlearning. Chemical Science, 9(2):513\u2013530, 2018.\n\n[21] Rafael G\u00f3mez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato,\nBenjam\u00edn S\u00e1nchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,\nRyan P Adams, and Al\u00e1n Aspuru-Guzik. Automatic chemical design using a data-driven\ncontinuous representation of molecules. ACS Central Science, 2016.\n\n[22] Matt J Kusner, Brooks Paige, and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Grammar variational\n\nautoencoder. arXiv preprint arXiv:1703.01925, 2017.\n\n[23] Jingxue Wang, Huali Cao, John ZH Zhang, and Yifei Qi. Computational protein design with\n\ndeep learning neural networks. arXiv preprint arXiv:1801.07130, 2018.\n\n[24] Nathan Killoran, Leo J Lee, Andrew Delong, David Duvenaud, and Brendan J Frey. Generating\n\nand designing dna with deep generative models. arXiv preprint arXiv:1712.06148, 2017.\n\n[25] Hetunandan Kamisetty, Sergey Ovchinnikov, and David Baker. Assessing the utility of\ncoevolution-based residue\u2013residue contact predictions in a sequence-and structure-rich era.\nProceedings of the National Academy of Sciences, 110(39):15674\u201315679, 2013.\n\n[26] Thomas A Hopf, Charlotta PI Sch\u00e4rfe, Jo\u00e3o PGLM Rodrigues, Anna G Green, Oliver\nKohlbacher, Chris Sander, Alexandre MJJ Bonvin, and Debora S Marks. Sequence co-evolution\ngives 3d contacts and structures of protein complexes. Elife, 3:e03430, 2014.\n\n[27] Mohammed AlQuraishi. End-to-end differentiable learning of protein structure. bioRxiv, page\n\n265231, 2018.\n\n[28] Babak Alipanahi, Nathan Krislock, Ali Ghodsi, Henry Wolkowicz, Logan Donaldson, and Ming\nLi. Determining protein structures from noesy distance constraints by semide\ufb01nite programming.\nJournal of Computational Biology, 20(4):296\u2013310, 2013.\n\n[29] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a\nprobabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances\nin Neural Information Processing Systems, pages 82\u201390, 2016.\n\n[30] J Berman. Hm and westbrook, z. feng, g. gilliland, tn bhat, h. weissig, in shindyalov, and pe\n\nbourne. the protein data bank. Nucleic Acids Research, 106:16972\u201316977, 2000.\n\n[31] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[32] Emmanuel J Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? Journal of the ACM (JACM), 58(3):11, 2011.\n\n[33] B. O\u2019Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting\nand homogeneous self-dual embedding. Journal of Optimization Theory and Applications,\n169(3):1042\u20131068, June 2016.\n\n11\n\n\f[34] B. O\u2019Donoghue, E. Chu, N. Parikh, and S. Boyd. SCS: Splitting conic solver, version 2.0.2.\n\nhttps://github.com/cvxgrp/scs, November 2017.\n\n[35] Richard A Engh and Robert Huber. Accurate bond and angle parameters for x-ray protein\n\nstructure re\ufb01nement. Acta Crystallographica Section A, 47(4):392\u2013400, 1991.\n\n[36] Raymond Yeh, Chen Chen, Teck Yian Lim, Mark Hasegawa-Johnson, and Minh N Do. Semantic\nimage inpainting with perceptual and contextual losses. arXiv preprint arXiv:1607.07539, 2016.\n[37] Po-Ssu Huang, Yih-En Andrew Ban, Florian Richter, Ingemar Andre, Robert Vernon, William R\nSchief, and David Baker. Rosettaremodel: a generalized framework for \ufb02exible backbone\nprotein design. PloS one, 6(8):e24109, 2011.\n\n[38] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[39] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\n\nnetworks. arXiv preprint arXiv:1611.02163, 2016.\n\n[40] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[41] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5769\u20135779, 2017.\n\n[42] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[43] Wenzhe Shi, Jose Caballero, Ferenc Husz\u00e1r, Johannes Totz, Andrew P Aitken, Rob Bishop,\nDaniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an\nef\ufb01cient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 1874\u20131883, 2016.\n\n[44] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[45] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[46] Robbie P Joosten, Tim AH Te Beek, Elmar Krieger, Maarten L Hekkelman, Rob WW Hooft,\nReinhard Schneider, Chris Sander, and Gert Vriend. A series of pdb related databases for\neveryday needs. Nucleic acids research, 39(suppl_1):D411\u2013D419, 2010.\n\n[47] Wolfgang Kabsch and Christian Sander. Dictionary of protein secondary structure: pattern\nrecognition of hydrogen-bonded and geometrical features. Biopolymers: Original Research on\nBiomolecules, 22(12):2577\u20132637, 1983.\n\n12\n\n\f", "award": [], "sourceid": 3723, "authors": [{"given_name": "Namrata", "family_name": "Anand", "institution": "Stanford University"}, {"given_name": "Possu", "family_name": "Huang", "institution": "Stanford University"}]}