{"title": "Deep imitation learning for molecular inverse problems", "book": "Advances in Neural Information Processing Systems", "page_first": 4990, "page_last": 5000, "abstract": "Many measurement modalities arise from well-understood physical processes and result in information-rich but difficult-to-interpret data. Much of this data still requires laborious human interpretation. This is the case in nuclear magnetic resonance (NMR) spectroscopy, where the observed spectrum of a molecule provides a distinguishing fingerprint of its bond structure. Here we solve the resulting inverse problem: given a molecular formula and a spectrum, can we infer the chemical structure? We show for a wide variety of molecules we can quickly compute the correct molecular structure, and can detect with reasonable certainty when our method cannot. We treat this as a problem of graph-structured prediction, where armed with per-vertex information on a subset of the vertices, we infer the edges and edge types. We frame the problem as a Markov decision process (MDP) and incrementally construct molecules one bond at a time, training a deep neural network via imitation learning, where we learn to imitate a subisomorphic oracle which knows which remaining bonds are correct. Our method is fast, accurate, and is the first among recent chemical-graph generation approaches to exploit per-vertex information and generate graphs with vertex constraints. Our method points the way towards automation of molecular structure identification and potentially active learning for spectroscopy.", "full_text": "Deep imitation learning for\nmolecular inverse problems\n\nEric Jonas\n\nDepartment of Computer Science\n\nUniversity of Chicago\nericj@uchicago.edu\n\nAbstract\n\nMany measurement modalities arise from well-understood physical processes and\nresult in information-rich but dif\ufb01cult-to-interpret data. Much of this data still\nrequires laborious human interpretation. This is the case in nuclear magnetic reso-\nnance (NMR) spectroscopy, where the observed spectrum of a molecule provides a\ndistinguishing \ufb01ngerprint of its bond structure. Here we solve the resulting inverse\nproblem: given a molecular formula and a spectrum, can we infer the chemical\nstructure? We show for a wide variety of molecules we can quickly compute the\ncorrect molecular structure, and can detect with reasonable certainty when our\nmethod fails. We treat this as a problem of graph-structured prediction where,\narmed with per-vertex information on a subset of the vertices, we infer the edges\nand edge types. We frame the problem as a Markov decision process (MDP) and in-\ncrementally construct molecules one bond at a time, training a deep neural network\nvia imitation learning, where we learn to imitate a subisomorphic oracle which\nknows which remaining bonds are correct. Our method is fast, accurate, and is\nthe \ufb01rst among recent chemical-graph generation approaches to exploit per-vertex\ninformation and generate graphs with vertex constraints. Our method points the\nway towards automation of molecular structure identi\ufb01cation and active learning\nfor spectroscopy.\n\n1\n\nIntroduction\n\nUnderstanding the molecular structure of unknown and novel substances is a long-standing problem\nin chemistry, and is frequently addressed via spectroscopic techniques, which measure how materials\ninteract with electromagnetic radiation. One type of spectroscopy, nuclear magnetic resonance\nspectroscopy (NMR) measures properties about individual nuclei in the molecule. This information\ncan be used to determine the molecular structure, and indeed this is a common, if laborious, exercise\ntaught to organic chemistry students.\n\nAlthough not usually conceived of as such, this is actually an inverse problem. Inverse problems\nare problems where we attempt to invert a known forward model y = f (x) to make inferences\nabout the unobserved x from measurements y. Inverse problems are at the heart of many important\nmeasurement modalities, including computational photography [31], medical imaging [5], and\nmicroscopy [22]. In these cases, the measurement model is often linear, that is f (x) = Ax, and the\nrecovered x is a vector in a high-dimensional vector space (see [24] for a review ). Crucially, A is\nboth linear and well-known, as a result of extensive hardware engineering and calibration.\n\nNonlinear inverse problems have a non-linear forward model f , and structured inverse problems\nimpose additional structural constraints on the recovered x, like being a graph or a permutation matrix.\nSpectroscopy is an example of this sort of problem, where we wish to recover the graph structure of\na given molecule from its spectroscopic measurements. Yet we still know f very well, in the case\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fC10H12N2O\n\nforward\n\ninverse\n\nC10H12N2O\n\n0\n\n1\n\n2\n\n3\n\npeak\nsplitting\n\nmolecular structure\n\nresonance spectrum\n\na.\n\nb.\n\nchemical shift (ppm)\n\nFigure 1: a. The forward problem is how to compute the spectrum of a molecule (right) given its\nstructure (left). Spectroscopists seek to solve the corresponding inverse problem, working backward\nfrom spectra towards the generating structure. b. Various properties measured in a spectrum, including\nthe chemical shift value and the degree to which each peak is split into multiplets.\n\nof NMR due to decades of expertise in quantum physics and computational chemistry, especially\ntechniques like Density Functional Theory (DFT) [23, 10].\n\nHere the inverse problem we tackle is structured as follows: Given a collection of per-atom (vertex)\nmeasurements, can we recover the molecular structure (labeled edges) ? Note we seek a single,\ncorrect graph \u2013 unlike recent work in using ML techniques for molecules, here there is one true,\ncorrect edge set (up to isomorphism) that we are attempting to \ufb01nd. We formulate the problem as a\nMarkov decision process, and use imitation learning to train a deep neural network to sequentially\nplace new bonds until the molecule is complete. The availability of a fast approximate forward\nmodel1 f enables us to rapidly check if our recovered structure matches experimental data, allowing\nus more con\ufb01dence in the veracity of our recovered structure.\n\nIn this paper, we brie\ufb02y review the relevant graph-theoretic connections to chemistry and NMR\nspectroscopy, explain the sequential problem setup, our dataset, and how the resulting Markov\ndecision process (MDP) can be solved via imitation learning. We evaluate on both synthetic and real\ndata, showing that the resulting problem is invertible, is reasonably well-posed in the face of input\ndata perturbations, and crucially performs well on experimentally-observed data for many molecules.\nWe assess how improvements to both the approximate forward model f and the inverse that we\npropose here can be made, and suggest avenues for future research.\n\n1.1 Basic chemistry\n\nWe can recall from early education that a molecule consists of a collection of atoms of various\nelements, with certain types of \"bonds\" between those atoms. This corresponds to a following.\n\nA molecule is\n\n1. A collection of vertices (atoms) V = {vi}, each vertex having a known color (that is,\ncorresponding to a particular element) and maximum vertex degree, corresponding to the\nvalence of that element.\n\n2. A set of edges E = {eij} between the vertices (vi, vj) \u2208 V each with an associated edge\nlabel c corresponding to their bond order. The bond orders we are interested in are labeled\n{1, 1.5, 2, 3} corresponding to single, aromatic, double, and triple bonds.\n\n3. A graph G = (V, E) which is connected, simple, and obeys a color-degree relationship by\nwhich the weights of the edges connected to a vertex vi almost always sum to a known value\ndependent on that vertices\u2019 color.\n\nOur measurement technique, NMR spectroscopy, lets us observe various per-vertex properties, P (vi),\nwhich we can then featurize. Note we only observe P (vi) for a subset of nuclei, and will refer to the\nset of all of these observations as P .\n\nWe begin with a known molecular formula, such as C8H10N4O2, obtained via stoichiometric calcula-\ntions or an alternative techniques such as mass spectroscopy. In our application of NMR spectroscopy,\n\n1We use this model throughout the rest of the paper and will treat it as \"correct\", thus will frequently drop the\n\n\"approximate\" designation.\n\n2\n\n\fthere are two sources of per-vertex information measured (\ufb01g 1b) in an experiment: a unique per-\nnucleus resonance frequency, the chemical shift, that arises as a function of the local electronic\nenvironment, and peak splitting, which results from spin-spin coupling between a nucleus and its\nneighbors. In this work we focus on 13C spectroscopy and thus only observe these properties for\nthe carbon atoms; all other nuclei yield no spectroscopic information. In this case, the splitting then\nre\ufb02ects the number of adjacent hydrogens bonded to the carbon. Thus, our problem ultimately is as\nfollows: based upon a collection of per-vertex measurements we wish to know the label (if any)\nof the edge eij between all vertex pairs (vi, vj).\n\n2 Sequential molecule assembly via imitation learning\n\nWe formulate our problem as a Markov decision process (MDP), where we start with a known set of\nvertices V and their observed properties P , and sequentially add bonds until the molecule is connected\nand all per-vertex constraints are met. Let Gk = (V, P, Ek) represent the kth state of our model,\nwith k existing edges Ek, k \u2208 [0, . . . , K]. We seek to learn the function p(ek+1 = ei,j,c|Ek, V, P )\nassigning a probability to each possible new edge between vi and vj (with label c) and in Gk+1.\n\nTo generate a single molecule (given spectral observations P and per-vertex information V ) we\nbegin with an empty edge set E0 and sequentially sample from Ek \u223c p(ei,j,c|Ek, V, P ) until we\nhave placed the correct number of edges necessary to satisfy all the valence constraints. We call\nthis resulting edge set (molecule) a candidate structure. For a given spectrum P , we can repeat this\nprocess N times, generating N (potentially different) candidate structures, {E(i)\ni=1 . We can then\nevaluate the quality of these candidate structures by measuring how close their predicted spectra\nPi = f (E(i)\n\nK are to the true observations P , that is,\n\nK }N\n\nEpredicted = argmini||f (E(i)\n\nK ) \u2212 P ||2\n\nThis is crucial \u2013 our sequential molecule generation process can produce a large number of candidates\nbut the presence of the fast forward model can rapidly \ufb01lter out incorrect answers.\n\n2.1\n\nImitation learning\n\nFor any partially completed molecule Gk = (V, P, Ek) we seek out the single correct edge set EK .\nWe can use our ability to ef\ufb01ciently compute graph subisomorphism as an oracle \u2013 at training time\nwe compute which individual edges ei,j,c we could add to Ek \u222a {eijc} such that Gk+1 would still be\nsubisomorphic to GK ( details of this subisomorphism calculation are provided in the appendix 6.2)\n\nWe generate a large number of candidate (Gk, Ek+1) training pairs, and \ufb01t a function approximator\nto p(ei,j,c|Ek, Vp) . To generate these training pairs, we can take a known molecule GK with\nobserved properties P and delete d edges randomly yielding state GK\u2212d (see \ufb01gure 3). Note this is a\ncrucial difference between our problem and previous molecule generation work: at each point when\nassembling a molecule, there is a \ufb01nite, known set of valid next edges, as we are seeking a single\ncorrect structure.\n\nThere are many ways to sample the number of deleted edges d and a pathological choice may\nyield distributional shift issues, where sequentially-assembled molecules during the recovery phase\nmight look signi\ufb01cantly different from our training molecules. We did not observe this in practice,\nalthough adoption of distributional-shift techniques such as SEARN [6] or DAGGER [25] are worth\nexploration in the future. Here we sample d such that the fraction of remaining edges is distributed as\na mixture of uniform and beta distributions, d \u223c 0.2 \u00b7 Unif(0, 1) + 0.7 \u00b7 Beta(3, 3).\n\n2.2 Dataset\n\nComputing the spectrum from a given complete molecule G requires computationally-expensive\nquantum-mechanical (DFT) calculations, and there is a paucity of high-quality NMR experimental\ndata available. We leverage the authors\u2019 previous work on learning fast neural-network-based\napproximations for spectra to address this problem[? ]. Our original fast forward model was trained\non a small number of experimentally-available spectra, and then that model was used to synthesize\nthe spectra of a very large number of candidate molecular structures.\n\n3\n\n\fFigure 2: a.) Our input is a molecular formula indicating the number of atoms (vertices) of each\nelement (color) along with per-vertex properties P measured for a subset of those vertices (in this\ncase, carbon nuclei). b.) We sequentially construct a molecule by sampling the next edge (i, j) and\nedge label c conditioned on the existing edge set as well as vertices V and vertex properties P . c.)\nWe end up with a sampled collection of candidate molecules, which we can then pass back through\nour forward model to compute their spectra, and validate against the true (observed) spectrum.\n\nWe obtain molecular structures consisting of H, C, O, and N atoms by downloading a large fraction\nof the public database PubChem []. We select molecules with no more than 64 atoms total and at\nmost 32 heavy atoms (C,O,and N). We preprocess this dataset to eliminate excessively small (<5\natoms) rings and excessively large rings (> 8 atoms), as well as radicals, as they were absent in the\ninitial dataset that trained our approximate forward model. We take care to ensure that no molecule is\nduplicated in our database, contaminating our train/test split. This leaves us with 1,268,461 molecules,\nof which we train on 819,200 and draw the test set from the remainder.\n\n2.3 Network\n\nOur network assigns a probability to each possible edge and edge-label given the current partially-\ncompeted graph, p(ei,j,c|Ek, V, P ). The network itself is a cascaded series of layers which combine\nper-edge and per-vertex information, sometimes called a \"relational network\" and nicely summarized\nin [2]. Each layer k of the network transforms per-vertex v(k) and per-edge features e(k) into per-\nvertex v(k+1) and per-edge e(k+1) output features, see \ufb01gure 3. Crucially the network combines\npairwise vector features with the associated edge features e\u2032\nj) when computing\nthe next edge feature, and computes an aggregate of all edges associated with a vertex ve\ni = maxi e\u2032\nij .\nWe adopt a residual structure as we \ufb01nd it easier to train, and our \u03c6r perform batch normalization\nbefore ReLu operations, We found that a recurrent GRU cell was the easiest to train ef\ufb01ciently when\ncombining input per-vertex features with the aggregated edge features G(ve\n). Note our network\nat no point depends on the ordering of the input per-vertex or per-edge features, thus maintaining\npermutation-invariance.\n\nij = \u03c6e(eij + v\u2032\n\ni + v\u2032\n\ni , v(k)\n\ni\n\nThe \ufb01rst layer takes in per-vertex properties including element type, typical valence, and others (see\nappendix 6.1) as well as the experimentally-observed per-vertex measurements for the relevant nuclei,\nincluding one-hot-encoded (binned) chemical shift value and peak splitting. We also featurize the\ncurrent state of the partially-assembled molecule as a one-hot-encoded adjacency matrix, where\nein\ni,j,c = 1 if there is an edge labeled c between vi and vj and 0 otherwise.\nWe cascade 32 relational layers and an internal feature dimensionality of 256, discarding the \ufb01nal\nlayer\u2019s per-vertex information. Each output edge is passed through a \ufb01nal 1024-dimensional fully-\nconnected layer to reduce its dimensionality down to 4 for output outputting a one-hot-encoded edge\nweight for each of the four possible edge labels. We normalize the resulting outputs to give us the\n\ufb01nal p(ei,j,c|Ek, V, P ).\n\nWe train using binary cross-entropy loss between the allowed possible next edges and the predicted\nedges of the network. We use Adam with a learning rate of 10\u22124 and train the network for 100 hours\non an nVidia v100 GPU.\n\n4\n\nEk-1eijc ~ p(Ek-1| V, P) EkEk+1eijc ~ p(Ek| V, P) Ek-1eijc ~ p(Ek-1| V, P) EkEk+1eijc ~ p(Ek| V, P) Ek-1eijc ~ p(Ek-1| V, P) EkEk+1eijc ~ p(Ek| V, P) InputSequential ConstructionObserved SpectrumRecovered CandidatesSpectrumComparisonMSEforwardforwardforward0.176.758.19Candidate evaluationFreqSplit1207357180x3x3ONCCCC1210singlearomaticdoubletriple\fi\n\ni\n\nij\n\nij\n\nij\n\n, e(in)\n\n, e(out)\ne e(in)\nv v(in)\n\nlayer of relational network\nInput: v(in)\nOutput: v(out)\neij \u2190\u2212 Lk\ni \u2190\u2212 Lk\nv\u2032\ne\u2032\nij \u2190\u2212 \u03c6e(eij + v\u2032\nve\ni \u2190\u2212 maxi e\u2032\nij\nvc\ni \u2190\u2212 \u03c6v(G(ve\nv(out)\ne(out)\n\ni , v(in)\n\n\u2190\u2212 vc\n\u2190\u2212 e\u2032\n\ni + v(in)\nij + e(in)\n\ni\n\n))\n\ni\n\nij\n\ni\n\ni + v\u2032\nj)\n\ni\n\nij\n\ne\nt\na\nt\ns\n \nt\nn\ne\nr\nr\nu\nc\n\ns\ne\ni\nt\nr\ne\np\no\nr\np\n\n \n\nx\ne\nt\nr\ne\nv\n\nPossible next edges\nwhich maintain \nsubisomorphism\n\nBinary\nCross-Entropy\nLoss\n\nTrue \nGraph\n\nExact\nSubisomorphism\nCalculation\n\nNetwork\n\nFigure 3: Each layer k of the network transforms per-vertex v(in) and per-edge features e(in) into\nper-vertex v(out) and per-edge e(out) output features. At train time we take true graphs, randomly\ndelete a subset of edges, and exactly compute which single edges could be added back into the graph\nand maintain subisomorphism. We minimize the binary cross-entropy loss between the output of our\nnetwork and this matrix of possible next edges.\n\n2.4 Related work\n\nEarly work on function approximation and machine learning techniques for inverse problems, includ-\ning neural networks [16], focused primarily on one-dimensional problems. More recent work has\nfocused on enhancing and accelerating various sparse linear recovery problems, starting with [9].\nThese approaches can show superior performance to traditional reconstruction and inversion methods\n[29], but a large fraction of this may be due to learning better data-driven regularization. [24].\n\nDeep learning methods for graphs have attracted considerable interest over the past several years,\nbeginning with graph convolutional kernels [15] and continuing to this day (see [2] for a good unifying\nview). Many of these advances have been applied to chemical problems, including trying to accurately\nestimate whole molecule properties [28], \ufb01nd latent embeddings of molecules in continuous spaces\n[8], and optimize molecules to have particular scalar properties [11]. These approaches have focused\non exploiting the latent space recovered by GANs [7] and variational autoencoders [26]. Crucially\nthese molecule generation approaches have largely focused on optimizing a single scalar property\nby optimizing in a learned latent space, which had made constraints like the precise number and\nelements of atoms dif\ufb01cult.\n\nOther approaches have attempted to sequentially construct graphs piecemeal [21, 30] using RNNs\nand deep Q-learning, but here we have an oracle policy that is known to be correct: our ability to\nexplicitly calculate subisomorphism between Ek+1 = eijc \u222a Ek and the \ufb01nal GK . This lets us reduce\nthis to learning to imitate this oracle, a substantially easier problem.\n\nRecent work on structured prediction via imitation learning, sometimes termed learning to search[6],\nis a strong inspiration for this work, (see [3, 4] for recent results with a good review). Our work\nextends these approaches to dense connected graphs via an explicit subisomorphism oracle.\n\nFinally, the use of machine learning methods for chemical inverse problems is historic, especially with\nregards to other types of spectroscopy, such as mass spec. One of the \ufb01rst expert systems, DENDRAL,\n[20] was conceived in the early 1960s to elucidate chemical structures from mass spectroscopy data,\nwhich fractures elements and measures the charge/mass ratio of the subsequent fragments. Today,\ncommercial software exists [1] to perform chemical structure determination from two-dimensional\nNMR data, a richer but substantially slower form of NMR data acquisition, which measures both\nchemical shift values and scalar coupling constants between nuclei, which can directly inform bond\nstructure. Our focus on 1-D NMR has the potential to dramatically accelerate this structure discovery\nprocess.\n\n5\n\n\fFigure 4: Example recovered structures. The left-most column is the true structure and associated\nSMILEs string, and the right are the candidate structures produced by our method, ranked in order of\nspectral reconstruction error (number below). Color indicates correct SMILEs string. The top row\nshows a molecule for which the correct structure was recovered, and the bottom row is an example of\na failure to identify a structure.\n\n3 Evaluation\n\nOur molecule samples candidate structures for a given set of observations using our trained policy,\nand then we check each of those structures\u2019 agreement with the observed data using our fast forward\nmodel. We call this error in reconstruction the spectral reconstruction error for candidate edge set\nE(i)\nK\n\nSe = ||f (E(i)\n\nK ) \u2212 P ||2\n\nWhile training we attempt to generate graphs with exact isomorphism \u2013 that is, the recovered\ngraph is isomorphic to the observed graph, including all per-vertex properties. This is excessively\nstringent for evaluation, as the experimental error for chemical shift values can be as high as 1\nppm, thus experimentally a carbon with a shift of 115.2 and a carbon with a shift of 115.9 are are\nindistinguishable. Thus, we relax our de\ufb01nition and conclude two molecules are correct if they agree,\nirrespective of exact shift agreement, according to a canonicalized string representation. Here we use\nSMILES strings [27], which are canonical string representations of molecular structure with extensive\nuse in the literature. SMILES strings are a more realistic de\ufb01nition of molecular equivalence. We use\nthe SMILES generation code in RDKit [19], a popular Python-based chemistry package.\n\n3.1\n\nInversion, identi\ufb01ability, noise sensitivity\n\nFirst we assess our ability to simply recover the correct molecular structure given observed data,\noperating entirely on simulated data, using the exact spectra produced by our forward model f .\nThis noise-free case represents the best case for our model, and lets us assess the degree to which\nour system works at all \u2013 that is, is the combination of chemical shift data and peak splitting data\nsuf\ufb01cient to uniquely recover molecular structure?\n\nFor each spectrum in the test set our model generates 128 candidate structures and measure the\nfraction which are SMILES-correct. In \ufb01gure 4 we show example spectra and recovered molecules,\nboth correct and incorrect (see appendix 6.3 for more). Since in this case we are using spectral\nobservations computed by our forward model, the spectral reconstruction error should be exactly 0.0\nfor the correct structure, and all incorrect structures have substantially larger spectral errors when\npassed through the forward model.\n\nFor each test molecule, which fraction of the 128 generated candidate models are correct, irrespective\nof how well the forward model matches? This is a measure of how often we produce the correct\nstructure, and in the noiseless case is the best case. Figure 5a shows the fraction of correct candidate\nstructures sorted in decreasing order; we can see that 90.6% molecules in the test set have at least\n\n6\n\nO=C1Nc2ccccc2C1=Cc1ccc(OCc2ccccc2)cc10.0O=C1Nc2ccccc2C1=Cc1ccc(OCc2ccccc2)cc135.0O=C1Nc2cccc1c2C=Cc1ccc(OCc2ccccc2)cc142.7O=C1NC(=Cc2ccc(OCc3ccccc3)cc2)c2ccccc2147.3O=c1[nH]c2ccccc(-c3ccc(OCc4ccccc4)cc3)cc1=2CCN1c2ccccc2C23CCN4CCCC(O)(CCC12)C4335.4CCN1c2ccccc2C23CCCN4CCCC(O)(C4C2)C1344.4CCCC12C(O)C3CC14CCCN3CCN2c1ccccc1446.7CCN1c2ccccc2C23CC4(O)CCCN(CCCC12)C4347.5CC1N2CCN3CCCC1(CC1(O)CCC31)c1ccccc12true structurecandidate structures\fb.\n\na.\n\nc.\n\njitter\n\nreference\nnoise\n\ntrue\nobserved\n\nmodel\n\ndata\n\nno noise\n\njitter\n\nref noise\n\nno noise\n\njitter (\u03c3 = 0.1)\njitter (\u03c3 = 1.0)\n\nref noise\n\n0.994\n0.994\n0.868\n0.862\n\n0.992\n0.992\n0.928\n0.898\n\n0.991\n0.991\n0.935\n0.916\n\nFigure 5: Synthetic data: a. Distribution of candidate molecules that are correct, before \ufb01ltering,\nfor the noiseless case. For 90.6% of the test molecules we generate at least one correct structure. b.\nDistribution of reconstruction errors for correct and incorrect candidate structures, showing a clear\nseparation. c. Jitter adds random noise to each shift value independently, whereas reference noise\nshifts the entire spectrum by a random amount. We train models on each class of noise and evaluate\non each class of model. The performance as measured by AUC minimally degrades when the model\nis trained with the equivalent noisy (augmented) data.\n\none generated structure that is correct. We then pass each of these candidate structures back through\nthe forward model and evaluate the MSE between the observed spectrum for the true molecule and\nthe spectra of the candidates. Figure 5b shows the distribution of reconstruction error for the correct\nstructures and for the incorrect structures; a clear separation is visible, suggesting the existence of a\nthreshold that can be used to reliably distinguish correct from incorrect structures.\n\n3.2 Well-posedness\n\nInverse problems are often ill-posed in the sense of Hadamard [14], where the solution can vary wildly\nwith slight perturbations in the input. To evaluate the ill-posedness of our model, again generate data\nwith our forward model, but we perturb these spectra at evaluation time via with per-chemical-shift\njitter and reference noise.\n\nPer-shift jitter can arise in SNR-limited applications when precise measurement of the chemical\nshift peak is dif\ufb01cult, and we model here as the addition of Gaussian noise, training with \u03c3 = 0.5\nand evaluating with \u03c3 = {0.1, 1.0}. Reference noise can arise when the reference frequency is\nmis-calibrated or mis-measured and manifests as an offset applied to all observed chemical shift\nvalues. Here we model this as a uniform shift on [\u22122, 2] ppm.\n\nAs we have added noise to the observed data at evaluation time, in this case our spectral reconstruction\nerror Se will never be zero \u2013 even the candidate structure with the lowest Se will not match the\nforward model exactly. We can set a threshold for this reconstruction error, below which we will\nsimply choose not to suggest a solution \u2013 in this case our model is effectively saying \"I don\u2019t know,\nnone of these candidate structures \ufb01t well enough\". For a given evaluation set of molecules, we\ncan vary this threshold \u2013 when it is very low, we may only suggest solutions for a small number of\nmolecules (but they are more likely to be correct). We can then vary this threshold over the range of\nvalues, and compute the resulting area under the curve (AUC). We use this threshold-based con\ufb01dence\napproach throughout the remainder of the paper.\n\nWe train our model with no noise in the training data, and with training data augmented with just\nper-shift jitter, and with reference noise and per-shift jitter. We evaluate the resulting models on\nevaluation datasets with no noise, jitter at two values, and reference noise (\ufb01gure 5c).\n\n7\n\n020406080100fraction of sorted data (pct)020406080100pct correct samples90.6% of testmolecules have at leastone correct structurefrac samples with correct SMILES string102101100101102spectral reconstruction errorincorrectstructurescorrectstructurescandidate molecule forward model error histograms\fevaluation data from\n\nthreshold accuracy\n\nforward\n\ninverse\n\nAUC 10%\n\n20%\n\n50%\n\ntop-1\n\ntop-5\n\ntrain\n\ntest\n\ntrain\ntest\ntrain\ntest\n\n0.95\n0.88\n0.91\n0.83\n\n100.0% 100.0% 99.2% 79.5% 79.9%\n100.0% 99.0%\n96.9% 71.3% 74.6%\n100.0% 100.0% 97.6% 60.3% 63.3%\n90.2% 55.9% 58.6%\n100.0% 97.1%\n\nTable 1: Experimental data: Evaluation on varying degrees of unseen molecules. \"Train\" and \"test\"\nindicate whether the forward or inverse model was trained on the evaluation set or not. (\"test\", \"test\")\nrepresent molecules completely unseen at both training and test time, and re\ufb02ect the most realistic\nscenario. \"theshold accuracy\" re\ufb02ects the SMILES accuracy of the top n% of molecule-prediction\nmatches, by spectra MSE. Top-1 and top-5 are average correctness across the entire dataset of the\ntop-1 and top-5 candidate structures.\n\nWe can see from the table that training the presence of noise can substantially improve noise\nrobustness, and reference artifacts are more damaging to our recovery AUC, even when we train\nthe model with them. Thus while our model is robust to some classes of perturbations, the precise\nchemical shift values do matter. Systematic exploration of this sensitivity is an obvious next step for\nfuture work.\n\n4 Experimental data\n\nAll previous results have been using spectra generated from our fast forward model. Now we compare\nwith actual experimentally-measured spectra. We use NMRShiftDB [18] for all data. The original\nforward model was trained on a small number of experimentally-available spectra, and then that\nmodel was used to synthesize the spectra of a very large number of molecular structures from PubMed.\nThis gives us four different classes of experimental molecules to evaluate: molecules from the train\nand test sets of the forward model, and molecules whose structures were used in training the inverse\nnetwork or not.\n\nWe evaluate this performance in table 1. We measure the AUC as we vary the \"reconstruction\nerror\" threshold, and look at the accuracy of the top 10%, 20%, and 50% molecules with the lowest\nreconstruction error.\n\nFor molecules whose structure was never seen by either the approximate forward model or the inverse\nmodel, we achieve an AUC of 0.83 and a top-1 accuracy of 55.9%. We can select a Se threshold\nwhich gives a prediction for 50% of the molecules, and in this case we recover the correct structure\n90.2% of the time.\n\nWe can use the other forward/inverse pairings to sanity check these results and understand the relative\nimportance of molecular structure and chemical shift prediction between the forward and the inverse\nmodel. Unsurprisingly when the structure was known to both the forward and the inverse model, we\nachieve the best performance, even though the inverse model never observed the empirical shifts,\nonly the simulated shifts from the forward model. AUC drops by 0.07 - 0.08 between train and test\nsets for the inverse model, regardless of whether or not the molecule was in forward model training or\ntest. Similarly, AUC drops by 0.04-0.05 between forward model train and test, regardless of whether\nthe inverse model had seen the molecule or not. This is suggestive of a consistent independent role\nfor both forward model accuracy and inverse model performance, and suggests that improving either\nwill ultimately yield improved reconstruction performance.\n\n5 Discussion\n\nHere we have demonstrated the capability to reconstruct entire molecular graphs from per-vertex\nspectroscopic information via deep imitation learning. This inverse problem is made viable by the\non-line generation of an enormous amount of training data, enabled by the construction of a fast\nforward model to quickly compute these per-vertex properties, and ef\ufb01cient computation of exact\ngraph subisomorphism.\n\n8\n\n\fOur approach is very general for molecular spectroscopy problems, but many steps and extensions\nremain before we can completely automate the process of chemical structure identi\ufb01cation. While\nour approach gives us the ability to evaluate the quality of our candidate structures (by passing them\nback through the forward model and computing MSE), this approach is not perfect, and there are\nnumerous avenues for improvement, both to expand the fraction of molecules we con\ufb01dently predict\nand the resulting accuracy.\n\nFist, we began by focusing on molecules with the four most common organic elements (H, C, O, and\nN) but plan to extend our method to other elements. One challenge is that the accuracy of chemical\nshift prediction for NMR spectra is less accurate for other important organic elements like S,P,F,\nand Cl. Even best-in-class density functional theory calculations struggle to accurately compute ab\ninitio chemical shift values for heavier atoms such as metals, and molecules with unusual charge\ndistributions such as radicals [17].\nHere we focus entirely on 13C NMR, but 1H NMR has a substantially higher signal-to-noise ratio\n(SNR) due to the near-100% natural abundance of the spin-active hydrogen isotope. However,\nthe chemical shift range for hydrogen is substantially smaller (\u223c 9ppm) and much more variable.\nNonetheless, a high-quality forward model approximation for hydrogen spectra would enable us to\neasily incorporate this information into our model, and would potentially improve performance.\n\nLike the majority of machine learning approaches to chemistry, by focusing entirely on topology\nwe are ignoring substantial geometric effects, both stereochemical and conformational. As different\nstereoisomers can have radically different biological properties, correctly identifying steroisomerism\nis an important challenge, especially in natural products chemistry.\nIncorporating this explicit\ngeometry into both our forward model approximation and our inverse recovery is an important avenue\nfor future work.\n\nMore complex forms of NMR are capable of elucidating more elements of the coupled spin system\ncreated by a molecule, giving additional information about atomic bonding. As noted above, this\nis the basis for the (substantially slower) technique present in 2D NMR. Incorporating this style of\ndata into our model is relatively straight-forward assuming we can develop a forward model that\naccurately predicts the parameters for the full spin Hamiltonian, and has the potential to extend our\nperformance to substantially-larger molecules.\n\nFinally, we see tremendous potential in using techniques like the one outlined here to make the\nspectroscopic process active. As the NMR spectrometer is fully programmable, it may be possible to\nuse the list of uncertain candidate structures to compute successive follow-on experiments to more\ncon\ufb01dently identify the true structure. This style of active instrumentation will require the existence\nof very rapid forward and inverse models, of which we believe models like those presented in this\nwork are an important \ufb01rst step.\n\nReferences\n\n[1] David C Adams and Dimitris Argyropoulos. A review of Computer Assisted Structure Elucida-\n\ntion (CASE) Methodology. ENC, 2018.\n\n[2] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zam-\nbaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner,\nCaglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani,\nKelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra,\nPushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational\ninductive biases, deep learning, and graph networks. pages 1\u201338, 2018.\n\n[3] Kai-Wei Chang, He He, Hal Daum\u00e9, John Langford, and Stephane Ross. A Credit Assignment\n\nCompiler for Joint Prediction. (Nips), 2014.\n\n[4] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, and John Langford.\nLearning to Search Better than Your Teacher. Proceedings of The 32nd International Conference\non Machine Learning, 37(2014):3\u20137, 2015.\n\n[5] Chen Chen and Junzhou Huang. Compressive Sensing MRI with Wavelet Tree Sparsity. In\nP. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 25, number Mcmc, pages 1\u20139, 2012.\n\n9\n\n\f[6] Hal Daum\u00e9, John Langford, and Daniel Marcu. Search-based structured prediction. Machine\n\nLearning, 75(3):297\u2013325, 2009.\n\n[7] Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular\n\ngraphs. 2018.\n\n[8] Rafael G\u00f3mez-Bombarelli, Jennifer N. Wei, David Duvenaud, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato,\nBenjam\u00edn S\u00e1nchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel,\nRyan P. Adams, and Al\u00e1n Aspuru-Guzik. Automatic Chemical Design Using a Data-Driven\nContinuous Representation of Molecules. ACS Central Science, 4(2):268\u2013276, 2018.\n\n[9] Karol Gregor and Yann Lecun. Learning Fast Approximations of Sparse Coding. ICML 2010,\n\n152(3):318\u2013 326, 2005.\n\n[10] Felix Hoffmann, Da Wei Li, Daniel Sebastiani, and Rafael Br\u00fcschweiler. Improved Quantum\nChemical NMR Chemical Shift Prediction of Metabolites in Aqueous Solution toward the\nValidation of Unknowns. Journal of Physical Chemistry A, 121(16):3071\u20133078, 2017.\n\n[11] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction Tree Variational Autoencoder for\n\nMolecular Graph Generation. 2018.\n\n[12] Eric Jonas and Stefan Kuhn. Rapid prediction of nmr spectral properties with quanti\ufb01ed\n\nuncertainty. Journal of cheminformatics, 11(1):1\u20137, 2019.\n\n[13] Alp\u00e1r J\u00fcttner and P\u00e9ter Madarasi. VF2 ++ \u2014 An improved subgraph isomorphism algorithm.\n\nDiscrete Applied Mathematics, 242:69\u201381, 2018.\n\n[14] S. I. Kabanikhin. De\ufb01nitions and examples of inverse and ill-posed problems. Journal of Inverse\n\nand Ill-Posed Problems, 16(4):317\u2013357, 2008.\n\n[15] Thomas N. Kipf and Max Welling. Semi-Supervised Classi\ufb01cation with Graph Convolutional\n\nNetworks. pages 1\u201314, 2016.\n\n[16] Kitamura and Qing. Neural network application to solve Fredholm integral equations of the\n\ufb01rst kind. In International Joint Conference on Neural Networks, page 589 vol.2. IEEE, 1989.\n\n[17] Leonid B Krivdin. Calculation of 15 N NMR chemical shifts: Recent advances and perspectives.\n\nProgress in Nuclear Magnetic Resonance Spectroscopy, 102-103:98\u2013119, nov 2017.\n\n[18] Stefan Kuhn, Bj\u00f6rn Egert, Steffen Neumann, and Christoph Steinbeck. Building blocks for\nautomated elucidation of metabolites: machine learning methods for NMR prediction. BMC\nbioinformatics, 9:400, 2008.\n\n[19] Greg Landrum. RDKit: Open-source cheminformatics, 2006.\n\n[20] J. Lederberg. How DENDRAL was conceived and born. In Proceedings of ACM conference on\n\nHistory of medical informatics -, volume 1945, pages 5\u201319, 1987.\n\n[21] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning Deep\n\nGenerative Models of Graphs. 2018.\n\n[22] Hsiou-Yuan Liu, Eric Jonas, Lei Tian, Jingshan Zhong, Benjamin Recht, and Laura Waller.\n3D imaging in volumetric scattering media using phase-space measurements. Optics Express,\n23(11):14461, 2015.\n\n[23] Michael W. Lodewyk, Matthew R. Siebert, and Dean J. Tantillo. Computational prediction\nof 1H and 13C chemical shifts: A useful tool for natural product, mechanistic, and synthetic\norganic chemistry. Chemical Reviews, 112(3):1839\u20131862, 2012.\n\n[24] Michael T. McCann, Kyong Hwan Jin, and Michael Unser. A Review of Convolutional Neural\n\nNetworks for Inverse Problems in Imaging. pages 1\u201311, 2017.\n\n[25] Stephane Ross, Geoffrey J Gordon, and J Andrew Bagnell. A Reduction of Imitation Learning\n\nand Structured Prediction to No-Regret Online Learning. Aistats, 15:627\u2013635, 2011.\n\n10\n\n\f[26] Martin Simonovsky and Nikos Komodakis. GraphVAE : towards generation of small graphs\n\nusing variational autoencoders. 2017.\n\n[27] David Weininger. SMILES, a chemical language and information system. 1. Introduction to\nmethodology and encoding rules. Journal of Chemical Information and Modeling, 28(1):31\u201336,\nfeb 1988.\n\n[28] Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S.\nPappu, Karl Leswing, and Vijay Pande. MoleculeNet: A Benchmark for Molecular Machine\nLearning. Chemical Science, 9:513\u2013530, 2017.\n\n[29] Yan Yang, Jian Sun, Huibin Li, and Zongben Xu. Deep ADMM-Net for Compressive Sensing\n\nMRI. (NIPS):1\u201310, 2016.\n\n[30] Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph Convolutional\n\nPolicy Network for Goal-Directed Molecular Graph Generation. pages 1\u201311, jun 2018.\n\n[31] Changyin Zhou and Shree K. Nayar. Computational cameras: Convergence of optics and\n\nprocessing. IEEE Transactions on Image Processing, 20(12):3322\u20133340, 2011.\n\n11\n\n\f", "award": [], "sourceid": 2760, "authors": [{"given_name": "Eric", "family_name": "Jonas", "institution": "University of Chicago"}]}