{"title": "Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 7113, "page_last": 7124, "abstract": "Deep generative models have achieved remarkable success in various data domains, including images, time series, and natural languages. There remain, however, substantial challenges for combinatorial structures, including graphs. One of the key challenges lies in the difficulty of ensuring semantic validity in context. For example, in molecular graphs, the number of bonding-electron pairs must not exceed the valence of an atom; whereas in protein interaction networks, two proteins may be connected only when they belong to the same or correlated gene ontology terms. These constraints are not easy to be incorporated into a generative model. In this work, we propose a regularization framework for variational autoencoders as a step toward semantic validity. We focus on the matrix representation of graphs and formulate penalty terms that regularize the output distribution of the decoder to encourage the satisfaction of validity constraints. Experimental results confirm a much higher likelihood of sampling valid graphs in our approach, compared with others reported in the literature.", "full_text": "Constrained Generation of Semantically Valid\n\nGraphs via Regularizing Variational Autoencoders\n\nTengfei.Ma1@ibm.com,\n\n{chenjie,cxiao}@us.ibm.com\n\nCao Xiao\n\nTengfei Ma\u2217\n\nJie Chen\u2217\nIBM Research\n\nAbstract\n\nDeep generative models have achieved remarkable success in various data domains,\nincluding images, time series, and natural languages. There remain, however,\nsubstantial challenges for combinatorial structures, including graphs. One of\nthe key challenges lies in the dif\ufb01culty of ensuring semantic validity in context.\nFor example, in molecular graphs, the number of bonding-electron pairs must not\nexceed the valence of an atom; whereas in protein interaction networks, two proteins\nmay be connected only when they belong to the same or correlated gene ontology\nterms. These constraints are not easy to be incorporated into a generative model.\nIn this work, we propose a regularization framework for variational autoencoders\nas a step toward semantic validity. We focus on the matrix representation of graphs\nand formulate penalty terms that regularize the output distribution of the decoder\nto encourage the satisfaction of validity constraints. Experimental results con\ufb01rm a\nmuch higher likelihood of sampling valid graphs in our approach, compared with\nothers reported in the literature.\n\n1\n\nIntroduction\n\nThe recent years have witnessed rapid progress in the development of deep generative models for a\nwide variety of data types, including continuous data (e.g., images [39]) and sequences (e.g., time\nseries [38] and natural language sentences [18]). Representative methods, including generative\nadversarial networks (GAN) [16] and variational autoencoders (VAE) [25], learn a distribution\nparameterized by deep neural networks from a set of training examples. Amid the tremendous\nprogress, deep generative models for combinatorial structures, particularly graphs, are less mature.\nThe dif\ufb01culty, in part, is owing to the challenge of an ef\ufb01cient parameterization of the graphs while\nmaintaining semantic validity in context. Whereas there exists a large body of work on learning\nan abstract representation of a graph [34, 11, 26, 10, 14], how one decodes it into a valid native\nrepresentation is less straightforward.\nOne natural line of approaches [24, 29] treats the generation process as sequential decision making,\nwherein nodes and edges are inserted one by one, conditioned on the graph constructed so far.\nVectorial representations of the graph, nodes, and edges may be simultaneously learned. When the\ntraining graphs are large, however, learning a long sequence of decisions is reportedly challenging [29].\nMoreover, existing work is limited to a prede\ufb01ned ordering of the sequence, leaving open the role of\npermutation.\nAnother approach [37, 36] is to build a probabilistic graph model based on the matrix representation.\nIn the simplest form, the adjacency matrix may encode the probability of the existence of each edge,\nwith additionally a probability vector indicating the existence of nodes. The challenge of such an\napproach, in contrast to the local decisions made in the preceding one, is that global properties (e.g.,\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fconnectivity) of the graph are hard to control [36]. Furthermore, in applications, there often exist\nconstraints demanding that only certain combinations of the nodes and edges are valid. For example,\nin molecular graphs, the number of bonding-electron pairs cannot exceed the valence of an atom.\nTranslated to the graph language, this constraint means that for a node of certain type, the incident\nedges collectively must have total type values not exceeding a threshold, where the \u201ctype value\u201d is\nsome numeric property of an edge type. For another example, in protein interaction networks, two\nproteins may be connected only when they belong to the same or correlated gene ontology terms.\nThat is, two nodes are connected only when their types are compatible. These constraints are dif\ufb01cult\nto satisfy when the sampling of nodes and edges is independent according to the probability matrix.\nSuch challenges motivate the present work.\nIn this work, we propose a regularization framework for VAEs to generate semantically valid graphs.\nWe focus on the matrix representation and formulate penalty terms that address validity constraints.\nThe penalties in effect regularize the distributions of the existence and types of the nodes and edges\ncollectively. Examples of the constraints include graph connectivity, node label compatibility, as well\nas valence in the context of molecular graphs. All these constraints are formulated with respect to\nthe node-label matrix and the edge-label tensor. We demonstrate the high probability of generating\nvalid graphs under the proposed framework with two benchmark molecule data sets and a synthetic\nnode-compatible data set.\n\n2 Related Work\n\nGenerative models aim at learning the distribution of training examples. The emergence of deep archi-\ntectures for generative modeling, including VAE [25], GAN [16], and Generative Moment Matching\nNetworks (GMMN) [28], demonstrates state-of-the-art performance in various data domains, in-\ncluding continuous ones (e.g., time series [38], musical improvisation [22], image generation [31],\nand video synthesis [40]) and discrete ones (e.g, program induction [13] and molecule genera-\ntion [15, 27]). They inspire a rich literature on the training and extensions of these models; see, e.g.,\nthe work [8, 1, 3, 30, 33, 7, 19, 2, 17].\nPrior to deep learning, there exist several random graph models, notably the Erd\u02ddos\u2013R\u00e9nyi model [12]\nand Barabasi\u2019s scale-free model [4], that explicitly describe a class of graphs. These models are\nlimited to speci\ufb01c graph properties (e.g., degree distribution) and are not suf\ufb01ciently expressive for\nreal-life graphs with richer structures.\nRecently, deep generative models for graphs are attracting surging interests. In GraphVAE [36],\na work the most relevant to ours, the authors use VAE to reconstruct the matrix representation of\na graph. Based on this representation, our work formulates a regularization framework to impose\nconstraints so that the sampled graphs are semantically valid. In another work, NetGAN [6], the\nauthors use the Wasserstein GAN [2] formulation to generate random paths and assemble them for a\nnew graph. Rather than learning a distribution of graphs, this work learns the connectivity structure\nof a single graph and produces a sibling graph that shares the distribution of the paths. Yet another\napproach for graph generation is to produce nodes and edges sequentially [24, 29]. This approach\nresembles the usual sequence models for texts, wherein new tokens are generated conditioned on the\nhistory. These models are increasingly dif\ufb01cult to train as the sequence becomes longer and longer.\nApplication-wise, de novo design of molecules is a convenient testbed and an easily validated\ntask for deep graph generative models. Molecules can be represented as undirected graphs with\natoms as nodes and bonds as edges. Most of the work for molecule generation, however, is not\napplicable to a general graph. For example, a popular approach to tackling the problem is based on\nthe SMILES [41] representation. An array of work designs recurrent neural net architectures for\ngenerating SMILES strings [15, 5, 35]. G\u00f3mez-Bombarelli et al. [15] use VAEs to encode SMILES\nstrings into a continuous latent space, from which one may search for new molecules with desirable\nproperties through Bayesian optimization. The string representation, however, is very brittle, because\nsmall changes to the string may completely violate chemical validity. To resolve the problem, Kusner\net al. [27] propose to use the SMILES grammar to generate a parse tree, which in turn is \ufb02attened as\na SMILES string. This approach guarantees the syntactic validity of the output, but semantic validity\nis still questionable. Dai et al. [9] further apply attribute grammar as a constraint in the parse-tree\ngeneration, a step toward semantic validity. Jin et al. [23] exploit the fact that molecular graphs\n\n2\n\n\fmay be turned into a tree by treating the rings as super nodes. We note that none of these methods\ngeneralizes to a general graph.\n\n3 Regularized Variational Autoencoder for Graphs\n\nIn this section, we propose a regularization framework for VAEs to generate semantically valid\ngraphs. The framework is inspired by the transformation of a constrained optimization problem to a\nregularized unconstrained one.\n\n3.1 Graph Representation and Probability Model\n\nLet a collection of graphs have at most N nodes, d node types, and t edge types. A graph from this\ncollection is normally represented by the tuple (F, E) with\n\nnode-label matrix F \u2208 RN\u00d7(1+d)\n\nand\n\nedge-label tensor E \u2208 RN\u00d7N\u00d7(1+t),\n\nwhere 0-based indexing is used for convenience. The node types range from 1 to d and the edge\ntypes from 1 to t. For each node i, the row F (i, :) is one-hot. If F (i, 0) = 1, the node is nonexistent\n(subsequently called \u201cghost nodes\u201d). Otherwise, the \u201con\u201d location of F (i, :) indicates the label of\nthe node. Similarly, for each pair (i, j), the \ufb01ber E(i, j, :) is one-hot. If E(i, j, 0) = 1, the edge is\nnonexistent; otherwise, the \u201con\u201d location of the \ufb01ber indicates the label of the edge.\nWe relax the one-hot rows of F and \ufb01bers of E to probability vectors and write with a tilde notation\n\n(cid:101)G = ((cid:101)F ,(cid:101)E). Now, (cid:101)F (i, r) is the probability of node i belonging to type r (nonexistent if r = 0),\nand (cid:101)E(i, j, k) is the probability of edge (i, j) belonging to type k (nonexistent if k = 0). Then, (cid:101)G\nindependence assumption, the probability of sampling a graph G using the model (cid:101)G is\n\nis a random graph model, from which one may generate random realizations of graphs. Under the\n\nN(cid:89)\n\n1+d(cid:89)\n\n(cid:101)F (i, r)F (i,r)(cid:89)\n\n1+t(cid:89)\n\n(cid:101)E(i, j, k)E(i,j,k).\n\nIn what follows, we use the one-hot G to denote a graph and the probabilistic (cid:101)G to denote the\n\nk=1\n\ni<j\n\ni=1\n\nr=1\n\n(1)\n\nsampling distribution.\n\n3.2 Variational Autoencoder\n\nThe goal of a generative model is to learn a probability distribution from a set of training graphs, such\nthat one can sample new graphs from it. To this end, let z be a latent vector. We want to learn a latent\n\nmodel p\u03b8(G|z), which is de\ufb01ned by a generative network with parameters \u03b8 and output (cid:101)G. Assuming\n\nindependence of training examples, the objective, then, is to maximize the log-evidence of the data:\n\np\u03b8(G(l)|z)p\u03b8(z) dz,\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\n(cid:90)\n\nlog p\u03b8(G(l)) =\n\nlog\n\nl\n\nl\n\nwhere the superscript l indexes training examples.\nThe integral (2) being generally intractable, a common remedy in variational Bayes is to use a\nvariational posterior q\u03c6(z|G), de\ufb01ned by an inference network with parameters \u03c6, to approximate the\nactual posterior p\u03b8(z|G). In this vein, the log-evidence (2) admits a lower bound\n\n(cid:17)\n\n(cid:88)\n\n(cid:104)\n\n(cid:105)\n\nDKL\n\nq\u03c6(z|G(l))|| p\u03b8(z)\n\n+\n\nEq\u03c6(z|G(l))\n\nlog p\u03b8(G(l)|z)\n\n,\n\n(3)\n\nLELBO = \u2212(cid:88)\n\n(cid:16)\n\nl\n\nl\n\nwhere DKL denotes the Kullback\u2013Leibler divergence. We maximize the lower bound LELBO with\nrespect to \u03b8 and \u03c6.\nSuch a variational treatment lands itself to an autoencoder, where the inference network encodes a\ntraining example G into a latent representation z, and the generative network decodes the latent z and\n\nreconstructs a graph from the probabilistic model (cid:101)G, such that it is as close to G as possible. Between\n\nthe two constituent parts of LELBO in (3), the expectation term is the negative reconstruction loss,\n\n3\n\n\fwhereas the KL divergence term serves as a regularization that encourages the variational posterior to\nstay close with the prior.\nIt remains to de\ufb01ne the probability distributions. The likelihood p\u03b8(G|z) simply follows the proba-\nbility model (1). The variational posterior q\u03c6(z|G) is a factored Gaussian with mean vector \u00b5 and\nvariance vector \u03c32. Usually, the prior p\u03b8(z) is standard normal, but we \ufb01nd that parameterizing it\nwith a trainable mean vector m and variance vector s2 sometimes improves inference.\n\n3.3 Regularization\n\nThe central contribution of this work is an approach to imposing validity constraints in the training of\nVAEs. In optimization, (in)equality constraints may be moved to the objective function to form a\nLagrangian function, whose solution coincides with that of the original objective. This connection\nis one of the justi\ufb01cations of using regularization to formulate an unconstrained objective, which\nis otherwise challenging to optimize. The regularization corresponds to the original (in)equality\nconstraints. For example, it is well known that the Euclidean-ball constraint is, under certain\nconditions, equivalent to an L2 regularization.\nFor VAE, we want the samples produced by the generative network p\u03b8(G|z) to be valid, regardless of\nwhat latent value z one starts with. The constraint set is then in\ufb01nite because of the cardinality of\nthe (often) continuous random variable z. Hence, we \ufb01rst generalize the Lagrangian function for an\nin\ufb01nite constraint set, and then use the generalization to motivate a sound approach for formulating\nregularization. The idea turns out to be fairly simple\u2014it suf\ufb01ces to marginalize the constraints over z.\nLet f (x) be the objective function to be minimized. For VAE, the unknown x includes both the\ngenerative parameter \u03b8 and the variational parameter \u03c6, and f is the negative lower bound \u2212LELBO.\nSuppose that for each z there are m equality constraints and r inequality constraints, such that the\nproblem is formally written as\n\nmin\n\nx\n\nf (x)\n\nsubject to for almost all z \u223c px(z),\n\nh1(x, z) = 0, . . . , hm(x, z) = 0,\ng1(x, z) \u2264 0, . . . , gr(x, z) \u2264 0.\n\nThe phrase \u201calmost all\u201d means that the set of z violating the constraints has a zero measure.\nTo solve (4), we generalize the usual notion of Lagrangian function to\n\nm(cid:88)\nwhere {\u03bbi} and {\u00b5j} are Lagrangian multipliers and\n\nL(x, \u03bb, \u00b5) = f (x) +\n\ni=1\n\n\u03bbi(cid:101)hi(x) +\n\n\u00b5j(cid:101)gj(x),\n\nr(cid:88)\n(cid:20)(cid:90)\n\nj=1\n\n(cid:20)(cid:90)\n\n(cid:101)hi(x) =\n\nhi(x, z)2px(z) dz\n\n(cid:21) 1\n\n2\n\nand (cid:101)gj(x) =\n\n(cid:21) 1\n\n2\n\n.\n\ngj(x, z)2px(z) dz\n\n(4)\n\n(5)\n\n(6)\n\nThese two tilde terms correspond to the marginalization of the squared constraints; hence, the\ndependency on z in the Lagrangian function is eliminated. After a technical de\ufb01nition, we give a\ntheorem that resembles the usual KKT condition for constrained problems. Its proof is given in the\nsupplementary material.\nDe\ufb01nition 1. For any feasible x, denote by A(x) = {j | gj(x, z) = 0 for almost all z} = {j |\n\n(cid:101)gj(x) = 0} the set of active inequality constraints. A feasible x is said to be regular if the equality\nconstraint gradients \u2207(cid:101)hi(x), i = 1, . . . , m, and the active inequality constraint gradients \u2207(cid:101)gj(x),\n\nj = 1, . . . , r, are linearly independent.\nTheorem 1. Let x\u2217 be a local minimum of the problem (4) and assume that x\u2217 is regular. Then, there\nr) such that (a) \u2207xL(x\u2217, \u03bb\u2217, \u00b5\u2217) = 0;\nexist unique vectors \u03bb\u2217 = (\u03bb\u2217\n1, . . . , \u03bb\u2217\nj \u2265 0, j = 1, . . . , r; and (c) \u00b5\u2217\n(b) \u00b5\u2217\nThe above theorem indicates that a solution x\u2217 of the constrained problem (4) coincides with that\nof the unconstrained minimization of (5). An intuitive explanation of why validity constraints for\n\nm) and \u00b5\u2217 = (\u00b5\u2217\nj = 0, \u2200 j /\u2208 A(x\u2217).\n\n1, . . . , \u00b5\u2217\n\n4\n\n\fevery z may be equivalently reformulated as regularization terms involving only the marginalization\n\nof z, is that hi(x, z) is zero for almost all z if and only if(cid:101)hi(x) is zero. Moreover, active inequality\n\nconstraints are equivalent to equality ones, and nonactive constraints have multipliers equal to zero.\nThis argument proves a majority of the conclusions in Theorem 1. The spirit is that marginalization\nis a powerful tool for formulating regularizations that faithfully represent the constraints.\nNote that the squaring of hi (and similarly of gj) in (6) ensures that hi(x, z) is zero for almost all\n\nz if and only if(cid:101)hi(x) is zero, a premise of the correctness of the theorem. Without squaring, this\n\nif-and-only-if statement does not hold.\n\n3.4 Training\nReturning to the notation of VAE, let us write the i-th validity constraint as gi(\u03b8, z) \u2264 0 for all z.\nBased on the preceding subsection, the loss function for training VAE may then be written as\n\n(cid:20)(cid:90)\n\n(cid:88)\n\ni\n\n(cid:20)(cid:90)\n\n(cid:88)\n\n(cid:21) 1\n\n2\n\n(cid:21) 1\n\n2\n\n\u2212LELBO(\u03b8, \u03c6) + \u00b5\n\ngi(\u03b8, z)2p\u03b8(z) dz\n\n,\n\nwhere \u00b5 \u2265 0 is treated as a tunable hyperparameter and the square-bracket term is a regularization. We\navoid using different \u00b5\u2019s for each i to reduce the number of hyperparameters. A problem for this loss\nfunction is that it penalizes not only the undesirable case gi(\u03b8, z) > 0, but also the opposite desirable\ncase gi(\u03b8, z) < 0, because of the presence of the square. Hence, we make a slight modi\ufb01cation to the\nregularization and use the following loss function for training instead:\n\n\u2212LELBO(\u03b8, \u03c6) + \u00b5\n\ngi(\u03b8, z)2\n\n+p\u03b8(z) dz\n\n,\n\n(7)\n\ni\n\nwhere g+ = max(g, 0) denotes the ramp function. This regularization will not penalize the desirable\ncase gi(\u03b8, z) \u2264 0.\nAd hoc as it may sound, the use of the ramp function follows the same interpretation of the usual\nrelationship between a constrained optimization problem and the corresponding regularized uncon-\nstrained one. In the usual KKT condition where an inequality constraint g \u2264 0 is not squared, the\nnonnegative multiplier \u00b5 ensures that the regularization \u00b5g penalizes the undesirable case g > 0.\nHere, on the other hand, the squaring of the inequality constraint cannot distinguish the sign of g\nanymore. Therefore, g+ is a correct replacement.\nIn practice, the integral in the regularization may be intractable, and hence we appeal to Monte Carlo\napproximation for evaluating the loss in each parameter update:\ngi(\u03b8, z)+, where\n\n\u2212LELBO(\u03b8, \u03c6) + \u00b5\n\nz \u223c p\u03b8(z).\n\n(cid:88)\n\n(8)\n\ni\n\nSuch an approach is similar to the training of standard VAE, where the expectation term in (3) is also\napproximated by a Monte Carlo sample. There are two important distinctions, however. First, in\nthe standard VAE, the latent vector z is sampled from the variational posterior q\u03c6(z|G), whereas in\nregularized VAE, the additional z is sampled from the prior p\u03b8(z). Second, the variational posterior\nsample z is decoded from a training graph G, whereas the prior sample z does not come from any\ntraining graph. We call the latter z synthetic. Despite the distinctions, reparameterization must be\nused for sampling in both cases, so that z is differentiable.\nThis training procedure is schematically illustrated by Figure 1. We use l to index a training example\nand l (note the underline) to denote a synthetic example, needed by regularization. The top \ufb02ow\ndenotes the standard VAE, where an input graph G(l) is encoded as z(l), which in turn is decoded\nto compute the ELBO loss. The bottom \ufb02ow denotes the regularization, where a synthetic z(l) is\ndecoded to compute the constraints gi(\u03b8, z(l))+. The combination of the two gives the total loss in\neach optimization step.\n\n4 Constraint Formulation\n\nIn this section we formulate several constraints gi(\u03b8, z) used as regularization in (7). All these\n\nconstraints are concerned with the decoder output (cid:101)G = ((cid:101)F ,(cid:101)E) and hence the dependencies on \u03b8 and\n\n5\n\n\fFigure 1: Overview of the regularization framework. In addition to a standard VAE (top \ufb02ow),\nregularizations are imposed on synthetic z(l) sampled from the prior (bottom \ufb02ow).\n\nz are omitted for readability; that is, we write the constraints as gi. In this case, i corresponds to a\ngraph node. If the constraints are imposed on each edge (i, j), we write the constraints as gij.\n\n4.1 Ghost Nodes and Valence\n\nA ghost node has no incident edges. On the other hand, if a graph represents a molecule, the\ncon\ufb01guration of the bonds must meet the valence criteria of the atoms. These two seemingly\nunrelated constraints have a common characteristic: the edges incident to a node are collectively\nsubject to a limited choice of existence and types.\nDenote by V (i) the capacity of a node i and by U (i) an upper bound of the capacity. For example, V\nis the number of bonding-electron pairs and U is the valence. The constraint is written as\n\n(9)\n\nh(triple bond) = 3.\n\nTo de\ufb01ne V and U, let h(k) be the capacity function of an edge type k:\nh(double bond) = 2,\n\nh(single bond) = 1,\n\nh(nonexistent) = 0,\n\nThen,\n\ngi = V (i) \u2212 U (i).\n\n(cid:88)\n\n(cid:88)\n\nh(k)(cid:101)E(i, j, k).\n\nV (i) =\n\nj(cid:54)=i\n\nThis expression reads that if the \ufb01ber (cid:101)E(i, j, :) is one hot, the inner summation is exactly the type\nthe overall capacity of the node i. Of course, (cid:101)E(i, j, :) is not one-hot; hence, the inner summation\n\ncapacity of the edge (i, j), and the outer summation sums over all other nodes j in the graph, forming\n\nis the expected capacity for the edge. A similar expectation is used to de\ufb01ne the upper bound U\n(valence):\n\nk\n\n(cid:88)\n\nr\n\nU (i) =\n\nu(r)(cid:101)F (i, r), where u(r) =\n\n(cid:26)valence of node type r,\n\n0,\n\nif r (cid:54)= 0,\nif r = 0.\n\nNote that for graphs other than molecules, where the concept of valence does not exist, ghost nodes\nstill must obey the constraint (9). In such a case, the capacity function h is 0 if an edge is nonexistent\nand 1 otherwise. The function u is 0 when r = 0 and N \u2212 1 when r (cid:54)= 0.\n\n4.2 Connectivity\n\nA graph is connected if there is a path between every pair of non-ghost nodes. If A is the adjacency\nmatrix of the graph, then the (i, j) element of the matrix B = I + A + A2 + \u00b7\u00b7\u00b7 + AN\u22121 is nonzero\nif and only if i and j are connected by a path. Let q be an indicator vector of non-ghost nodes; that is,\nq(i) = 0 if i is a ghost node and = 1 otherwise. Then, the graph must satisfy the following constraint:\n\nq(i)q(j) \u00b7 1{B(i, j) = 0} + [1 \u2212 q(i)q(j)] \u00b7 1{B(i, j) (cid:54)= 0} \u2264 0,\n\n\u2200 i (cid:54)= j.\n\nIn words, if at least one of i and j is a ghost node, B(i, j) must be zero. On the other hand, if neither\nof i and j is a ghost node, B(i, j) must be nonzero.\nThe above constraint is unfortunately nondifferentiable. To formulate a differentiable version, we \ufb01rst\n\nlet q(i) = 1 \u2212 (cid:101)F (i, 0), because (cid:101)F (i, 0) is the probability that node i is a ghost node. Then, for the\nmatrix B, we need to de\ufb01ne A. Because (cid:101)E(i, j, 0) is the probability that edge (i, j) is nonexistent, we\nlet A(i, j) = 1 \u2212 (cid:101)E(i, j, 0). Because now A is probabilistic, an element (i, j) of A is nonzero even\n\n6\n\nG(l)encoderz(l)decoderG(l)(standardVAE)syntheticz(l)G(l)(regularization)\fif the probability that i and j are connected is tiny. Such a tiny nonzero element will be ampli\ufb01ed\nthrough taking powers of A. Hence, we need a sigmoid transform for each power of A to suppress\n\nampli\ufb01cation. Speci\ufb01cally, de\ufb01ne \u03c3(x) =(cid:8)1 + exp[\u2212a(x \u2212 1\n\n2 ))](cid:9)\u22121, where a > 0 is suf\ufb01ciently\n\nlarge to make most of the transformed mass close to either 0 or 1 (say, a = 100). Then, de\ufb01ne\n\nN\u22121(cid:88)\n\ni=0\n\nA0 = I, A1 = A, Ai+1 = \u03c3(AiA), i = 1, . . . , N \u2212 2, B =\n\nAi, C = \u03c3(B),\n\nwhere C(i, j) is now suf\ufb01ciently close to the indicator 1{B(i, j) (cid:54)= 0}. Thus, the differentiable\nversion of the constraint is\n\ngij = q(i)q(j) \u00b7 [1 \u2212 C(i, j)] + [1 \u2212 q(i)q(j)] \u00b7 C(i, j).\n\n(10)\n\n4.3 Node Compatibility\nA compatibility matrix D \u2208 R(1+d)\u00d7(1+d) summarizes the compatibility of every pair of node types.\nWhen both types r and r(cid:48) are nonzero, D(r, r(cid:48)) is 1 if the two types are compatible and 0 otherwise.\nMoreover, ghost nodes are incompatible. Hence, when either of r and r(cid:48) is zero, D(r, r(cid:48)) = 0.\nWe now consider a constraint that mandates that two nodes are connected only when their node\ntypes are compatible. This constraint appears in, for example, protein interaction networks where\ntwo proteins are connected only when they belong to the same or correlated gene ontology terms.\n\nConsider the matrix P = (cid:101)F D(cid:101)F T . The (i, j) element of P is the probability that nodes i and j\n\ngij = [1 \u2212 (cid:101)E(i, j, 0)][1 \u2212 P (i, j)] \u2212 \u03b1,\n\nhave compatible types. We want node pairs with low compatibility to be disconnected. Hence, the\nconstraint is\n(11)\nwhere \u03b1 \u2208 (0, 1) is a tunable hyperparameter. The interpretation of (11) is as follows. In order\n\u2264 \u03b1/[1 \u2212 P (i, j)]. The smaller is P (i, j), the lower is the threshold, which leads to a smaller\nprobability for the edge to exist. On the other hand, once P (i, j) exceeds 1 \u2212 \u03b1, we see that gij \u2264 0\n\nto satisfy the constraint gij \u2264 0, the probability that edge (i, j) exists, 1 \u2212 (cid:101)E(i, j, 0), must be\nalways holds regardless of the existence probability 1 \u2212 (cid:101)E(i, j, 0). Hence, for highly compatible\n\npairs, the corresponding edge may or may not exist.\n\n5 Experiments\n\n5.1 Tasks, Data Sets, and Baselines\n\nWe consider two tasks: the generation of molecular graphs and that of node-compatible graphs. For\nmolecular graphs, two benchmark data sets are QM9 [32] and ZINC [21]. The former contains\nmolecules with at most 9 heavy atoms whereas the latter consists of drug-like commercially available\nmolecules extracted at random from the ZINC database.\nFor node-compatible graphs, we construct a synthetic data set by \ufb01rst generating random node labels,\nfollowed by connecting node pairs under certain probability if their labels are compatible. The\ncompatibility matrix D, ignoring the empty top row and left column, is\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\n0\n1\n1\n1\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb .\n\n1\n0\n1\n0\n1\n\n1\n1\n0\n1\n1\n\n1\n0\n1\n0\n0\n\n0\n1\n1\n0\n0\n\nBased on this matrix, we generate 100,000 random node-compatible graphs. For each graph, the\nnumber of nodes is a uniformly random integer \u2208 [10, 15]. Each node is randomly assigned one of\nthe \ufb01ve possible labels. Then, for each pair of nodes whose types are compatible according to D, we\nassign an edge with probability 0.4. The graph is not necessarily connected.\nTable 1 summarizes the data sets.\nBaselines for molecular graphs are character VAE (CVAE) [15] and grammar VAE (GVAE) [27].\nBoth methods are based on the SMILES string representation. The codes are downloaded from\n\n7\n\n\fTable 1: Data sets.\n\n# Graphs\n\n# Nodes\n\n# Node Types\n\nQM9 (molecule)\nZINC (molecule)\nNode-compatible\n\n134k\n250k\n100k\n\n9\n38\n15\n\n4\n9\n5\n\n# Edge Types\n\n3\n3\n1\n\nhttps://github.com/mkusner/grammarVAE. For node-compatible graphs, there is no baseline,\nbecause it is a new task. However, we will compare the results of using regularization versus not and\nshow the effectiveness of the proposed method.\n\n5.2 Network Architecture\nFor input representation, we unfold the edge-label tensor E \u2208 RN\u00d7N\u00d7(1+t) and concatenate it with\nthe node-label matrix F \u2208 RN\u00d7(1+d) to form a wide matrix with N rows and (1 + d) + N (1 + t)\ncolumns. The encoder is a 4-layer convolutional neural net (32, 32, 64, 64 channels with \ufb01lter\nsize 3\u00d73). The latent vector z is normally distributed and its mean and variance are generated\nfrom two separate fully connected layers over the output of the CNN encoder. The decoder follows\nthe generator of DCGAN [31] and is a 4-layer deconvolutional neural net (64, 32, 32, 1 channels\nwith \ufb01lter size 3\u00d73). Both the encoder and the decoder have modules of the form Convolution\u2013\nBatchNorm\u2013ReLU [20].\n\n5.3 Results\n\nEffect of Regularization. Table 2 compares the performance of standard VAE with that of the\nproposed regularized VAE. The column \u201c% Valid\u201d denotes the percentage of valid graphs among\nthose sampled from the prior, and the column \u201cELBO\u201d is the lower bound approximation of the\nlog-evidence of the training data. Regularization parameters are tuned for the highest validity. One\nsees that regularization noticeably boosts the the validity percentage in all data sets. ELBO becomes\nlower, as expected, because optimal parameters for the standard objective are not optimal for the\nregularized one. However, the difference of the two ELBOs is not large.\n\nTable 2: Standard VAE versus regularized VAE.\n\nQM9\n\nMethod % Valid ELBO\nStandard\n-17.3\n-18.5\nRegul.\n\n83.2\n96.6\n\nZINC\n\nMethod % Valid ELBO\nStandard\n-46.5\n-47.0\nRegul.\n\n29.6\n34.9\n\nNode-compatible\n\nMethod % Valid ELBO\nStandard\n-42.5\n-51.2\nRegul.\n\n40.2\n98.4\n\nComparison with Baselines. Validity percentage is not the only metric for measuring the success\nof an approach. In Table 3 we include other common metrics used in the literature and compare our\nresults with those of the baselines. The column \u201c% Novel\u201d is the percentage of valid graphs sampled\nfrom the prior and not occurring in the training set, and the column \u201c% Recon.\u201d is the percentage\nof holdout graphs (in the training set) that can be reconstructed by the autoencoder. Our results\nsubstantially improve over those of the character VAE and grammar VAE.\n\nTable 3: Comparison with baselines. Baseline results: The \u201c% Valid\u201d and \u201c% Novel\u201d columns of\nQM9 are copied from Simonovsky and Komodakis [36]. The \u201c% Valid\u201d and \u201c% Recon.\u201d columns of\nZINC are copied from Kusner et al. [27]. The \u201c% Recon.\u201d column of QM9 and \u201c% Novel\u201d column\nof ZINC are computed by using the downloaded codes mentioned in Section 5.1.\n\nQM9\n\nMethod % Valid % Novel % Recon.\nProposed\nGVAE\nCVAE\n\n97.5\n80.9\n90.0\n\n96.6\n60.2\n10.3\n\n61.8\n96.0\n3.61\n\n8\n\nZINC\n\nMethod % Valid % Novel % Recon.\nProposed\nGVAE\nCVAE\n\n54.7\n53.7\n44.6\n\n34.9\n7.2\n0.7\n\n100\n100\n100\n\n\fSmoothness of Latent Space. We visually inspect the coherence of the latent space in two ways.\nIn the \ufb01rst one, randomly pick a graph in the training set and encode it as z in the latent space. Then,\ndecode latent vectors on a grid centering at z and with random orientation. We show the grid of\ngraphs on a two-dimensional plane. In the second approach, randomly pick a few pairs in the training\nset and for each pair, perform a linear interpolation in the latent space. Figure 2 shows that the\ntransitions of the graphs are quite smooth.\n\nFigure 2: Visualization of latent space. Data set: QM9. Left: Two-dimensional plane. Right: Each\nrow is a one-dimensional interpolation.\n\nDenoising. Node-compatible graphs are often noisy. We perform an experiment to show that the\nproposed regularization may be used to recover a graph that obeys the compatibility constraints. To\nthis end, we generated another set of 10,000 graphs and randomly inserted edges for noncompatible\nnode pairs. Then, we applied regularized VAE to reconstruct the graphs and investigated how many\nof them were valid. Results are shown in Table 4. One sees that the proposed approach leads to a\nhigh probability of reconstructing valid graphs, whereas standard VAE fails in most of the cases.\n\nTable 4: Percentage of validly decoded graphs.\n\nStandard VAE Regularized VAE\n\n11.2\n\n93.8\n\nFor more details of all the experiments, the reader is referred to the supplementary material.\n\n6 Conclusions\n\nGenerating semantically valid graphs is a challenging subject for deep generative models. Whereas\nsubstantial breakthrough is seen for molecular graphs, rarely a method is generalizable to a general\ngraph. In this work we propose a regularization framework for training VAEs that encourages the\nsatisfaction of validity constraints. The approach is motivated by the transformation of a constrained\noptimization problem to a regularized unconstrained one. We demonstrate the effectiveness of the\nframework in two tasks: the generation of molecular graphs and that of node-compatible graphs.\n\nReferences\n[1] Mart\u00edn Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adver-\n\nsarial networks. ICLR 17, abs/1701.04862, 2017.\n\n[2] Mart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\nnetworks. In ICML 17, 2017. URL http://proceedings.mlr.press/v70/arjovsky17a.\nhtml.\n\n9\n\n\f[3] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilib-\nrium in generative adversarial nets (GANs). In ICML 17, 2017. URL http://proceedings.\nmlr.press/v70/arora17a.html.\n\n[4] Albert-Laszlo Barabasi and Reka Albert. Emergence of scaling in random networks. Science,\n\n286(5439):509\u2013512, 1999. doi: 10.1126/science.286.5439.509.\n\n[5] Esben Jannik Bjerrum and Richard Threlfall. Molecular generation with recurrent neural\nnetworks (rnns). CoRR, abs/1705.04612, 2017. URL http://arxiv.org/abs/1705.04612.\n\n[6] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Z\u00fcgner, and Stephan G\u00fcnnemann. Graph-\n\nGAN: Generating graphs via random walks. In ICML, 2018.\n\n[7] Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\nCoRR, abs/1509.00519, 2015. URL http://arxiv.org/abs/1509.00519.\n\n[8] Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman,\nIlya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. CoRR, abs/1611.02731, 2016.\n\n[9] Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed variational\nautoencoder for structured data. International Conference on Learning Representations, 2018.\nURL https://openreview.net/forum?id=SyqShMZRb.\n\n[10] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\n\non graphs with fast localized spectral \ufb01ltering. CoRR, abs/1606.09375, 2016.\n\n[11] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,\nAlan Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning\nmolecular \ufb01ngerprints. In NIPS 15. 2015.\n\n[12] P. Erd\u02ddos and A R\u00e9nyi. On the evolution of random graphs.\n\nIn PUBLICATION OF THE\nMATHEMATICAL INSTITUTE OF THE HUNGARIAN ACADEMY OF SCIENCES, pages\n17\u201361, 1960.\n\n[13] Alexander L. Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli,\nJonathan Taylor, and Daniel Tarlow. Terpret: A probabilistic programming language for program\ninduction. CoRR, abs/1608.04428, 2016. URL http://arxiv.org/abs/1608.04428.\n\n[14] J. Gilmer, S.S. Schoenholz, P.F. Riley, O. Vinyals, and G.E. Dahl. Neural message passing for\n\nquantum chemistry. In ICML, 2017.\n\n[15] Rafael G\u00f3mez-Bombarelli, David K. Duvenaud, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Jorge Aguilera-\nIparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Al\u00e1n Aspuru-Guzik. Automatic chemical\ndesign using a data-driven continuous representation of molecules. CoRR, abs/1610.02415,\n2016. URL http://arxiv.org/abs/1610.02415.\n\n[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS 14. 2014.\nURL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.\n\n[17] Xiaojie Guo, Lingfei Wu, and Liang Zhao. Deep graph translation.\n\narXiv:1805.09980, 2018.\n\narXiv preprint\n\n[18] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. Toward\ncontrolled generation of text. In Proceedings of the 34th International Conference on Machine\nLearning, 2017.\n\n[19] Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P. Xing. On unifying deep generative\n\nmodels. CoRR, abs/1706.00550, 2017. URL http://arxiv.org/abs/1706.00550.\n\n[20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n10\n\n\f[21] John J. Irwin, Teague Sterling, Michael M. Mysinger, Erin S. Bolstad, and Ryan G. Coleman.\nZINC: A free tool to discover chemistry for biology. Journal of Chemical Information and\nModeling, 52(7), 2012.\n\n[22] Natasha Jaques, Shixiang Gu, Richard E. Turner, and Douglas Eck. Tuning recurrent neural\nnetworks with reinforcement learning. CoRR, abs/1611.02796, 2016. URL http://arxiv.\norg/abs/1611.02796.\n\n[23] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for\n\nmolecular graph generation. CoRR, abs/1802.04364, 2018.\n\n[24] Daniel D. Johnson. Learning graphical state transitions. In ICLR, 2017.\n\n[25] Diederik P. Kingma and Max Welling.\n\nCoRR,\nabs/1312.6114, 2013. URL http://dblp.uni-trier.de/db/journals/corr/corr1312.\nhtml#KingmaW13.\n\nAuto-encoding variational bayes.\n\n[26] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. CoRR, abs/1609.02907, 2016.\n\n[27] Matt J. Kusner, Brooks Paige, and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Grammar variational\nIn Proceedings of the 34th International Conference on Machine Learning,\n\nautoencoder.\nvolume 70, pages 1945\u20131954, 2017.\n\n[28] Yujia Li, Kevin Swersky, and Richard Zemel. Generative moment matching networks. In ICML\n\n15, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045301.\n\n[29] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep\n\ngenerative models of graphs, 2018. https://openreview.net/forum?id=Hy1d-ebAb.\n\n[30] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural\n\nsamplers using variational divergence minimization. In NIPS, 2016.\n\n[31] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015. URL\nhttp://arxiv.org/abs/1511.06434.\n\n[32] Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld.\nQuantum chemistry structures and properties of 134 kilo molecules. Scienti\ufb01c Data, 1, 2014.\n\n[33] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\n\nImproved techniques for training gans. In NIPS, 2016.\n\n[34] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. IEEE Transactions on Neural Networks, 20, 2009.\n\n[35] Marwin H. S. Segler, Thierry Kogej, Christian Tyrchan, and Mark P. Waller. Generating focussed\nmolecule libraries for drug discovery with recurrent neural networks. CoRR, abs/1701.01329,\n2017. URL http://arxiv.org/abs/1701.01329.\n\n[36] Martin Simonovsky and Nikos Komodakis. GraphVAE: Towards generation of small graphs\nusing variational autoencoders, 2018. https://openreview.net/forum?id=SJlhPMWAW.\n\n[37] Sahar Tavakoli, Alireza Hajibagheri, and Gita Sukthankar. Learning social graph topologies\nusing generative adversarial neural networks. In International Conference on Social Computing,\nBehavioral-Cultural Modeling & Prediction, 2017.\n\n[38] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,\nNal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model\nfor raw audio. CoRR, abs/1609.03499, 2016. URL http://arxiv.org/abs/1609.03499.\n\n[39] A\u00e4ron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray\n\nKavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016.\n\n11\n\n\f[40] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics.\n\nCoRR, abs/1609.02612, 2016. URL http://arxiv.org/abs/1609.02612.\n\n[41] David Weininger. SMILES, a chemical language and information system. 1. introduction to\nmethodology and encoding rules. J. Chem. Inf. Comput. Sci., 28(1):31\u201336, February 1988. ISSN\n0095-2338. doi: 10.1021/ci00057a005. URL http://dx.doi.org/10.1021/ci00057a005.\n\n12\n\n\f", "award": [], "sourceid": 3538, "authors": [{"given_name": "Tengfei", "family_name": "Ma", "institution": "IBM Research"}, {"given_name": "Jie", "family_name": "Chen", "institution": "IBM Research"}, {"given_name": "Cao", "family_name": "Xiao", "institution": "IBM Research"}]}