{"title": "D-VAE: A Variational Autoencoder for Directed Acyclic Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1588, "page_last": 1600, "abstract": "Graph structured data are abundant in the real world. Among different graph types, directed acyclic graphs (DAGs) are of particular interest to machine learning researchers, as many machine learning models are realized as computations on DAGs, including neural networks and Bayesian networks. In this paper, we study deep generative models for DAGs, and propose a novel DAG variational autoencoder (D-VAE). To encode DAGs into the latent space, we leverage graph neural networks. We propose an asynchronous message passing scheme that allows encoding the computations on DAGs, rather than using existing simultaneous message passing schemes to encode local graph structures. We demonstrate the effectiveness of our proposed DVAE through two tasks: neural architecture search and Bayesian network structure learning. Experiments show that our model not only generates novel and valid DAGs, but also produces a smooth latent space that facilitates searching for DAGs with better performance through Bayesian optimization.", "full_text": "D-VAE: A Variational Autoencoder for Directed\n\nAcyclic Graphs\n\nMuhan Zhang, Shali Jiang, Zhicheng Cui, Roman Garnett, Yixin Chen\n\n{muhan, jiang.s, z.cui, garnett}@wustl.edu, chen@cse.wustl.edu\n\nDepartment of Computer Science and Engineering\n\nWashington University in St. Louis\n\nAbstract\n\nGraph structured data are abundant in the real world. Among different graph\ntypes, directed acyclic graphs (DAGs) are of particular interest to machine learning\nresearchers, as many machine learning models are realized as computations on\nDAGs, including neural networks and Bayesian networks. In this paper, we study\ndeep generative models for DAGs, and propose a novel DAG variational autoencoder\n(D-VAE). To encode DAGs into the latent space, we leverage graph neural networks.\nWe propose an asynchronous message passing scheme that allows encoding the\ncomputations on DAGs, rather than using existing simultaneous message passing\nschemes to encode local graph structures. We demonstrate the effectiveness of\nour proposed D-VAE through two tasks: neural architecture search and Bayesian\nnetwork structure learning. Experiments show that our model not only generates\nnovel and valid DAGs, but also produces a smooth latent space that facilitates\nsearching for DAGs with better performance through Bayesian optimization.\n\n1\n\nIntroduction\n\nMany real-world problems can be posed as optimizing of a directed acyclic graph (DAG) representing\nsome computational task. For example, the architecture of a neural network is a DAG. The problem\nof searching optimal neural architectures is essentially a DAG optimization task. Similarly, one\ncritical problem in learning graphical models \u2013 optimizing the connection structures of Bayesian\nnetworks [1], is also a DAG optimization task. DAG optimization is pervasive in other \ufb01elds as well.\nIn electronic circuit design, engineers need to optimize DAG circuit blocks not only to realize target\nfunctions, but also to meet speci\ufb01cations such as power usage and operating temperature.\nDAG optimization is a hard problem. Firstly, the evaluation of a DAG\u2019s performance is often time-\nconsuming (e.g., training a neural network). Secondly, state-of-the-art black-box optimization\ntechniques such as simulated annealing and Bayesian optimization primarily operate in a continuous\nspace, thus are not directly applicable to DAG optimization due to the discrete nature of DAGs. In\nparticular, to make Bayesian optimization work for discrete structures, we need a kernel to measure\nthe similarity between discrete structures as well as a method to explore the design space and\nextrapolate to new points. Principled solutions to these problems are still lacking.\nIs there a way to circumvent the trouble from discreteness? The answer is yes. If we can embed all\nDAGs to a continuous space and make the space relatively smooth, we might be able to directly use\nprincipled black-box optimization algorithms to optimize DAGs in this space, or even use gradient\nmethods if gradients are available. Recently, there has been increased interest in training generative\nmodels for discrete data types such as molecules [2, 3], arithmetic expressions [4], source code\n[5], undirected graphs [6], etc. In particular, Kusner et al. [3] developed a grammar variational\nautoencoder (G-VAE) for molecules, which is able to encode and decode molecules into and from\na continuous latent space, allowing one to optimize molecule properties by searching in this well-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbehaved space instead of a discrete space. Inspired by this work, we propose to also train a variational\nautoencoder for DAGs, and optimize DAG structures in the latent space via Bayesian optimization.\nTo encode DAGs, we leverage graph neural networks (GNNs) [7]. Traditionally, a GNN treats all\nnodes symmetrically, and extracts local features around nodes by simultaneously passing all nodes\u2019\nneighbors\u2019 messages to themselves. However, such a simultaneous message passing scheme is\ndesigned to learn local structure features. It might not be suitable for DAGs, since in a DAG: 1) nodes\nare not symmetric, but intrinsically have some ordering based on its dependency structure; and 2) we\nare more concerned about the computation represented by the entire graph, not the local structures.\nIn this paper, we propose an asynchronous message passing scheme to encode the computations\non DAGs. The message passing no longer happens at all nodes simultaneously, but respects the\ncomputation dependencies (the partial order) among the nodes. For example, suppose node A has\ntwo predecessors, B and C, in a DAG. Our scheme does not perform feature learning for A until the\nfeature learning on B and C are both \ufb01nished. Then, the aggregated message from B and C is passed\nto A to trigger A\u2019s feature learning. This means, although the message passing is not simultaneous, it\nis also not completely unordered \u2013 some synchronization is still required. We incorporate this feature\nlearning scheme in both our encoder and decoder, and propose DAG variational autoencoder (D-VAE).\nD-VAE has an excellent theoretical property for modeling DAGs\u2013 we prove that D-VAE can injectively\nencode computations on DAGs. This means, we can build a mapping from the discrete space to a\ncontinuous latent space so that every DAG computation has its unique embedding in the latent space,\nwhich justi\ufb01es performing optimization in the latent space instead of the original design space.\nOur contributions in this paper are: 1) We propose D-VAE, a variational autoencoder for DAGs using\na novel asynchronous message passing scheme, which is able to injectively encode computations.\n2) Based on D-VAE, we propose a new DAG optimization framework which performs Bayesian\noptimization in a continuous latent space. 3) We apply D-VAE to two problems, neural architecture\nsearch and Bayesian network structure learning. Experiments show that D-VAE not only generates\nnovel and valid DAGs, but also learns smooth latent spaces effective for optimizing DAG structures.\n\n2 Related work\n\nL(\u03c6, \u03b8; x) = Ez\u223cq\u03c6(z|x)[log p\u03b8(x|z)] \u2212 KL[q\u03c6(z|x)(cid:107)p(z)].\n\nVariational autoencoder (VAE) [8, 9] provides a framework to learn both a probabilistic generative\nmodel p\u03b8(x|z) (the decoder) as well as an approximated posterior distribution q\u03c6(z|x) (the encoder).\nVAE is trained through maximizing the evidence lower bound\n(1)\nThe posterior approximation q\u03c6(z|x) and the generative model p\u03b8(x|z) can in principle take arbitrary\nparametric forms whose parameters \u03c6 and \u03b8 are output by the encoder and decoder networks. After\nlearning p\u03b8(x|z), we can generate new data by decoding latent space vectors z sampled from the\nprior p(z). For generating discrete data, p\u03b8(x|z) is often decomposed into a series of decision steps.\nDeep graph generative models use neural networks to learn distributions over graphs. There are\nmainly three types: token-based, adjacency-matrix-based, and graph-based. Token-based models\n[2, 3, 10] represent a graph as a sequence of tokens (e.g., characters, grammar rules) and model these\nsequences using RNNs. They are less general since task-speci\ufb01c graph grammars such as SMILES\nfor molecules [11] are required. Adjacency-matrix-based models [12, 13, 14, 15, 16] leverage the\nproxy adjacency matrix representation of a graph, and generate the matrix in one shot or generate\nthe columns/entries sequentially. In contrast, graph-based models [6, 17, 18, 19] seem more natural,\nsince they operate directly on graph structures (instead of proxy matrix representations) by iteratively\nadding new nodes/edges to a graph based on the existing graph and node states. In addition, the\ngraph and node states are learned by graph neural networks (GNNs), which have already shown\ntheir powerful graph representation learning ability on various tasks [20, 21, 22, 23, 24, 25, 26, 27].\nNeural architecture search (NAS) aims at automating the design of neural network architectures.\nIt has seen major advances in recent years [28, 29, 30, 31, 32, 33]. See Hutter et al. [34] for\nan overview. NAS methods can be mainly categorized into: 1) reinforcement learning methods\n[28, 31, 33] which train controllers to generate architectures with high rewards in terms of validation\naccuracy, 2) Bayesian optimization based methods [35] which de\ufb01ne kernels to measure architecture\nsimilarity and extrapolate the architecture space heuristically, 3) evolutionary approaches [29, 36, 37]\nwhich use evolutionary algorithms to optimize neural architectures, and 4) differentiable methods\n\n2\n\n\f[32, 38, 39] which use continuous relaxation/mapping of neural architectures to enable gradient-based\noptimization. In Appendix A, we include more detailed discussion on several most related works.\nBayesian network structure learning (BNSL) is to learn the structure of the underlying Bayesian\nnetwork from observed data [40, 41, 42, 43]. Bayesian network is a probabilistic graphical model\nencoding conditional dependencies among variables via a DAG [1]. One main approach for BNSL\nis score-based search, i.e., de\ufb01ne some \u201cgoodness-of-\ufb01t\u201d score for network structures, and search\nfor one with the optimal score in the discrete design space. Commonly used scores include BIC\nand BDeu, mostly based on marginal likelihood [1]. Due to the NP-hardness [44], however, exact\nalgorithms such as dynamic programming [45] or shortest path approaches [46, 47] can only solve\nsmall-scale problems. Thus, people have to resort to heuristic methods such as local search and\nsimulated annealing, etc. [48]. BNSL is still an active research area [41, 43, 49, 50, 51].\n\n3 DAG variational autoencoder (D-VAE)\n\nIn this section, we describe our proposed DAG variational autoencoder (D-VAE). D-VAE uses an\nasynchronous message passing scheme to encode and decode DAGs. In contrast to the simultaneous\nmessage passing in traditional GNNs, D-VAE allows encoding computations rather than structures.\nDe\ufb01nition 1. (Computation) Given a set of elementary operations O, a computation C is the\ncomposition of a \ufb01nite number of operations o \u2208 O applied to an input signal x, with the output of\neach operation being the input to its succeeding operations.\n\nThe set of elementary operations O\ndepends on speci\ufb01c applications. For\nexample, when we are interested in\ncomputations given by a calculator, O\nwill be the set of all the operations de-\n\ufb01ned on the functional buttons, such\nFigure 1: Computations can be represented by DAGs. Note that the\nas +, \u2212, \u00d7, \u00f7, etc. When modeling\nleft and right DAGs represent the same computation.\nneural networks, O can be a prede-\n\ufb01ned set of basic layers, such as 3\u00d73 convolution, 5\u00d75 convolution, 2\u00d72 max pooling, etc. A\ncomputation can be represented as a directed acyclic graph (DAG), with directed edges representing\nsignal \ufb02ow directions among node operations. The graph must be acyclic, since otherwise the input\nsignal will go through an in\ufb01nite number of operations so that the computation never stops. Figure 1\nshows two examples. Note that the two different DAGs in Figure 1 represent the same computation,\nas the input signal goes through exactly the same operations. We discuss it further in Appendix B.\n3.1 Encoding\n\nWe \ufb01rst introduce the encoder of D-VAE, which can be seen as a graph neural network (GNN) using\nan asynchronous message passing scheme. Given a DAG G, we assume there is a single starting node\nwhich does not have any predecessors (e.g., the input layer of a neural architecture). If there are\nmultiple such nodes, we add a virtual starting node connecting to all of them.\nSimilar to standard GNNs, we use an update function U to compute the hidden state of each node\nbased on its neighbors\u2019 incoming message. The hidden state of node v is given by:\n(2)\nv is\n\nv represents the incoming message to v. hin\n\nhv = U(xv, hin\nv ),\n\nv = A((cid:8)hu : u \u2192 v(cid:9)),\n\nwhere xv is the one-hot encoding of v\u2019s type, and hin\ngiven by aggregating the hidden states of v\u2019s predecessors using an aggregation function A:\n\nwhere u \u2192 v denotes there is a directed edge from u to v, and(cid:8)hu : u \u2192 v(cid:9) represents a multiset\n\nof v\u2019s predecessors\u2019 hidden states. If an empty set is input to A (corresponding to the case for the\nstarting node without any predecessors), we let A output an all-zero vector.\nCompared to the traditional simultaneous message passing, in D-VAE the message passing for a node\nmust wait until all of its predecessors\u2019 hidden states have already been computed. This simulates\nhow a computation is really performed \u2013 to execute some operation, we also need to wait until all\nits input signals are ready. So how to make sure all the predecessor states are available when a new\nnode comes? One solution is that we can sequentially perform message passing for nodes following a\ntopological ordering of the DAG. We illustrate this encoding process in Figure 2.\n\nhin\n\n(3)\n\n3\n\n1xsinsin(\")$(\")$+xsin(\")$(\")$+123546124352\fFigure 2: An illustration of the encoding procedure for a neural architecture. Following a topological ordering,\nwe iteratively compute the hidden state for each node (red) by feeding in its predecessors\u2019 hidden states (blue).\nThis simulates how an input signal goes through a computation, with hv simulating the output signal at node v.\n\nAfter all nodes\u2019 hidden states are computed, we use hvn, the hidden state of the ending node vn\nwithout any successors, as the output of the encoder. Then we feed hvn to two MLPs to get the mean\nand variance parameters of the posterior approximation q\u03c6(z|G) in (1). If there are multiple nodes\nwithout successors, we again add a virtual ending node connecting from all of them.\nNote that although topological orderings are usually not unique for a DAG, we can take any one of\nthem as the message passing order while ensuring the encoder output is always the same, revealed by\nthe following theorem. We include all theorem proofs in the appendix.\nTheorem 1. The D-VAE encoder is invariant to node permutations of the input DAG if the aggregation\nfunction A is invariant to the order of its inputs.\nTheorem 1 means isomorphic DAGs are always encoded the same, no matter how we index the nodes.\nIt also indicates that so long as we encode a DAG complying with its partial order, we can perform\nmessage passing in arbitrary order (even parallelly for some nodes) with the same encoding result.\nThe next theorem shows another property of D-VAE that is crucial for its success in modeling DAGs,\ni.e., it is able to injectively encode computations on DAGs.\nTheorem 2. Let G be any DAG representing some computation C. Let v1, . . . , vn be its nodes\nfollowing a topological order each representing some operation oi, 1 \u2264 i \u2264 n, where vn is the ending\nnode. Then, the encoder of D-VAE maps C to hvn injectively if A is injective and U is injective.\nThe signi\ufb01cance of Theorem 2 is that it provides a way to injectively encode computations on\nDAGs, so that every computation has a unique embedding in the latent space. Therefore, instead of\nperforming optimization in the original discrete space, we may alternatively perform optimization\nin the continuous latent space. In this well-behaved Euclidean space, distance is well de\ufb01ned, and\nprincipled Bayesian optimization can be applied to search for latent points with high performance\nscores, which transforms the discrete optimization problem into an easier continuous problem.\nNote that Theorem 2 states D-VAE injectively encodes computations on graph structures, rather\nthan graph structures themselves. Being able to injectively encode graph structures is a very strong\ncondition, as it implies an ef\ufb01cient algorithm to solve the challenging graph isomorphism (GI)\nproblem. Luckily, here what we really care about are computations instead of structures, since we\ndo not want to differentiate two different structures G1 and G2 as long as they represent the same\ncomputation. Figure 1 shows such an example. Our D-VAE can identify that the two DAGs in\nFigure 1 actually represent the same computation by encoding them to the same vector, while those\nencoders focusing on encoding structures might fail to capture the underlying computation and output\ndifferent vectors. We discuss more advantages of Theorem 2 in optimizing DAGs in Appendix G.\nTo model and learn the injective functions A and U, we resort to neural networks thanks to the\nuniversal approximation theorem [52]. For example, we can let A be a gated sum:\n(4)\nwhere m is a mapping network and g is a gating network. Such a gated sum can model injective\nmultiset functions [53], and is invariant to input order. To model the injective update function U, we\ncan use a gated recurrent unit (GRU) [54], with hin\n(5)\nHere the subscript e denotes \u201cencoding\u201d. Using a GRU also allows reducing our framework to\ntraditional sequence to sequence modeling frameworks [55], as discussed in 3.4.\n\nv treated as the input hidden state:\n\ng(hu) (cid:12) m(hu),\n\nhv = GRUe(xv, hin\n\nv ).\n\n(cid:88)\n\nhin\nv =\n\nu\u2192v\n\n4\n\nconv3max3conv5conv5max3conv3outputconv5conv5max3conv3conv5max3conv5inputinputmax3conv5max3conv3inputconv3conv3max3conv5inputconv5conv3max3inputconv5conv3max3conv5max3inputconv3inputinput1234567Aggregatemessages from predecessorsUpdatethe hidden state of this nodeoutputconv5max3conv5max3conv3outputconv5conv5max3conv3outputconv5max3conv3outputmax3conv3outputconv3outputconv5\fFigure 3: An illustration of the steps for generating a new node.\n\nThe above aggregation and update functions can be used to encode general computation graphs. For\nneural architectures, depending on how the outputs of multiple previous layers are aggregated as\nthe input to a next layer, we will make a modi\ufb01cation to (4), which is discussed in Appendix E. For\nBayesian networks, we also make some modi\ufb01cations to their encoding due to the special d-separation\nproperties of Bayesian networks, which is discussed in Appendix F.\n3.2 Decoding\n\nWe now describe how D-VAE decodes latent vectors to DAGs (the generative part). The D-VAE decoder\nuses the same asynchronous message passing scheme as in the encoder to learn intermediate node\nand graph states. Similar to (5), the decoder uses another GRU, denoted by GRUd, to update node\nhidden states during the generation. Given the latent vector z to decode, we \ufb01rst use an MLP to map\nz to h0 as the initial hidden state to be fed to GRUd. Then, the decoder constructs a DAG node by\nnode. For the ith generated node vi, the following steps are performed:\n1. Compute vi\u2019s type distribution using an MLP fadd_vertex (followed by a softmax) based on the\n\ncurrent graph state hG := hvi\u22121.\n\n2. Sample vi\u2019s type. If the sampled type is the ending type, stop the decoding, connect all loose ends\n\n(nodes without successors) to vi, and output the DAG; otherwise, continue the generation.\n\n= h0 if i = 1; otherwise, hin\n\n3. Update vi\u2019s hidden state by hvi = GRUd(xvi, hin\nvi\n\n), where hin\nvi\n\nthe aggregated message from its predecessors\u2019 hidden states given by equation (4).\n\non hvj and hvi; (b) sample the edge; and (c) if a new edge is added, update hvi using step 3.\n\nvi is\n4. For j = i\u22121, i\u22122, . . . , 1: (a) compute the edge probability of (vj, vi) using an MLP fadd_edge based\nThe above steps are iteratively applied to each new generated node, until step 2 samples the ending\ntype. For every new node, we \ufb01rst predict its node type based on the current graph state, and then\nsequentially predict whether each existing node has a directed edge to it based on the existing and\ncurrent nodes\u2019 hidden states. Figure 3 illustrates this process. Since edges always point to new\nnodes, the generated graph is guaranteed to be acyclic. Note that we maintain hidden states for both\nthe current node and existing nodes, and keep updating them during the generation. For example,\nwhenever step 4 samples a new edge between vj and vi, we will update hvi to re\ufb02ect the change of\nits predecessors and thus the change of the computation so far. Then, we will use the new hvi for the\nnext prediction. Such a dynamic updating scheme is \ufb02exible, computation-aware, and always uses\nthe up-to-date state of each node to predict next steps. In contrast, methods based on RNNs [3, 13] do\nnot maintain states for old nodes, and only use the current RNN state to predict the next step.\nIn step 4, when sequentially predicting incoming edges from previous nodes, we choose the reversed\norder i \u2212 1, . . . , 1 instead of 1, . . . , i \u2212 1 or any other order. This is based on the prior knowledge\nthat a new node vi is more likely to \ufb01rstly connect from the node vi\u22121 immediately before it. For\nexample, in neural architecture design, when adding a new layer, we often \ufb01rst connect it from the\nlast added layer, and then decide whether there should be skip connections from other previous layers.\nNote that however, such an order is not \ufb01xed and can be \ufb02exible according to speci\ufb01c applications.\n3.3 Training\n\nDuring the training phase, we use teacher forcing [17] to measure the reconstruction loss: following\nthe topological order with which the input DAG\u2019s nodes are consumed, we sum the negative log-\nlikelihood of each decoding step by forcing them to generate the ground truth node type or edge at\neach step. This ensures that the model makes predictions based on the correct histories. Then, we\noptimize the VAE loss (the negative of (1)) using mini-batch gradient descent following [17]. Note\nthat teacher forcing is only used in training. During generation, we sample a node type or edge at\n\n5\n\ninputmax3conv5?conv31Predict new node typeinputmax3conv5conv5conv32If not ending type, continue generationPredict edge from the last nodeconv3inputmax3conv5conv53?If a new edge is added, update node hidden stateconv5conv3inputmax3conv54conv5conv5AUconv5\u2026\u2026Start generating the next nodePredict remaining edges in the same mannerconv5conv3inputmax3conv57?conv3inputconv5max3conv55???conv3inputmax3conv56conv5Finish generating the current node\feach step according to the decoding distributions described in Section 3.2 and calculate subsequent\ndecoding distributions based on the sampled results.\n\n3.4 Discussion and model extensions\n\nRelation with RNNs. The D-VAE encoder and decoder can be reduced to ordinary RNNs when the\ninput DAG is reduced to a chain of nodes. Although we propose D-VAE from a GNN\u2019s perspective,\nour model can also be seen as a generalization of traditional sequence modeling frameworks [55, 56]\nwhere a timestamp depends only on the timestamp immediately before it, to the DAG case where a\ntimestamp has multiple previous dependencies. As special DAGs, similar ideas have been explored\nfor trees [57, 17], where a node can have multiple incoming edges yet only one outgoing edge.\nBidirectional encoding. D-VAE\u2019s encoding process can be seen as simulating how an input signal\ngoes through a DAG, with hv simulating the output signal at each node v. This is also known as\nforward propagation in neural networks. Inspired by the bidirectional RNN [58], we can also use\nanother GRU to reversely encode a DAG (i.e., reverse all edge directions and encode the DAG again),\nthus simulating the backward propagation too. After reverse encoding, we get two ending states,\nwhich are concatenated and linearly mapped to their original size as the \ufb01nal output state. We \ufb01nd this\nbidirectional encoding can increase the performance and convergence speed on neural architectures.\nIncorporating vertex semantics. Note that D-VAE currently uses one-hot encoding of node types\nas xv, which does not consider the semantic meanings of different node types. For example, a\n3 \u00d7 3 convolution layer might be functionally very similar to a 5 \u00d7 5 convolution layer, while being\nfunctionally distinct from a max pooling layer. We expect incorporating such semantic meanings of\nnode types to be able to further improve D-VAE\u2019s performance. For example, we can use pretrained\nembeddings of node types to replace the one-hot encoding. We leave it for future work.\n\n4 Experiments\n\nWe validate the proposed DAG variational autoencoder (D-VAE) on two DAG optimization tasks:\n\u2022 Neural architecture search. Our neural network dataset contains 19,020 neural architectures from\nthe ENAS software [33]. Each neural architecture has 6 layers (excluding input and output layers)\nsampled from: 3 \u00d7 3 and 5 \u00d7 5 convolutions, 3 \u00d7 3 and 5 \u00d7 5 depthwise-separable convolutions\n[59], 3 \u00d7 3 max pooling, and 3 \u00d7 3 average pooling. We evaluate each neural architecture\u2019s\nweight-sharing accuracy [33] (a proxy of the true accuracy) on CIFAR-10 [60] as its performance\nmeasure. We split the dataset into 90% training and 10% held-out test sets. We use the training set\nfor VAE training, and use the test set only for evaluation.\n\n\u2022 Bayesian network structure learning. Our Bayesian network dataset contains 200,000 random\n8-node Bayesian networks from the bnlearn package [61] in R. For each network, we compute the\nBayesian Information Criterion (BIC) score to measure the performance of the network structure\nfor \ufb01tting the Asia dataset [62]. We split the Bayesian networks into 90% training and 10% test\nsets. For more details, please refer to Appendix I.\nFollowing [3], we do four experiments for each task:\n\u2022 Basic abilities of VAE models. In this experiment, we perform standard tests to evaluate the\nreconstructive and generative abilities of a VAE model for DAGs, including reconstruction accuracy,\nprior validity, uniqueness and novelty.\n\n\u2022 Predictive performance of latent representation. We test how well we can use the latent embed-\n\ndings of neural architectures and Bayesian networks to predict their performances.\n\n\u2022 Bayesian optimization. This is the motivating application of D-VAE. We test how well the learned\nlatent space can be used for searching for high-performance DAGs through Bayesian optimization.\n\u2022 Latent space visualization. We visualize the latent space to qualitatively evaluate its smoothness.\nSince there is little previous work on DAG generation, we compare D-VAE with four generative\nbaselines adapted for DAGs: S-VAE, GraphRNN, GCN and DeepGMG. Among them, S-VAE [56] and\nGraphRNN [13] are adjacency-matrix-based methods; GCN [22] and DeepGMG [6] are graph-based\nmethods which use simultaneous message passing to embed DAGs. We include more details about\nthese baselines and discuss D-VAE\u2019s advantages over them in Appendix J. The training details are in\nAppendix K. All the code and data are available at https://github.com/muhanzhang/D-VAE.\n\n6\n\n\fTable 1: Reconstruction accuracy, prior validity, uniqueness and novelty (%).\n\nNeural architectures\n\nBayesian networks\n\nMethods\nD-VAE\nS-VAE\n\nGraphRNN\n\nGCN\n\nDeepGMG\n\nAccuracy Validity Uniqueness Novelty Accuracy Validity Uniqueness Novelty\n98.01\n99.70\n98.57\n99.40\n98.49\n\n98.84\n100.00\n100.00\n99.02\n98.86\n\n100.00\n99.99\n100.00\n100.00\n99.93\n\n100.00\n100.00\n99.84\n99.53\n98.66\n\n99.96\n99.98\n99.85\n98.70\n94.98\n\n37.26\n37.03\n29.77\n34.00\n46.37\n\n38.98\n35.51\n27.30\n32.84\n57.27\n\n99.94\n99.99\n96.71\n99.81\n47.74\n\n4.1 Reconstruction accuracy, prior validity, uniqueness and novelty\n\nBeing able to accurately reconstruct input examples and generate valid new examples are basic\nrequirements for VAE models. In this experiment, we evaluate the models by measuring 1) how often\nthey can reconstruct input DAGs perfectly (Accuracy), 2) how often they can generate valid neural\narchitectures or Bayesian networks from the prior distribution (Validity), 3) the proportion of unique\nDAGs out of the valid generations (Uniqueness), and 4) the proportion of valid generations that are\nnever seen in the training set (Novelty).\nWe \ufb01rst evaluate each model\u2019s reconstruction accuracy on the test sets. Following previous work\n[3, 17], we regard the encoding as a stochastic process. That is, after getting the mean and variance\nparameters of the posterior approximation q\u03c6(z|G), we sample a z from it as G\u2019s latent vector. To\nestimate the reconstruction accuracy, we sample z 10 times for each G, and decode each z 10 times\ntoo. Then we report the average proportion of the 100 decoded DAGs that are identical to the input.\nTo calculate prior validity, we sample 1,000 latent vectors z from the prior distribution p(z) and\ndecode each latent vector 10 times. Then we report the proportion of valid DAGs in these 10,000\ngenerations. A generated DAG is valid if it can be read by the original software which generated the\ntraining data. More details about the validity experiment are in Appendix M.1.\nWe show the results in Table 1. Among all the models, D-VAE and S-VAE generally perform the best.\nWe \ufb01nd that D-VAE, S-VAE and GraphRNN all have near perfect reconstruction accuracy, prior validity\nand novelty. However, D-VAE and S-VAE show higher uniqueness, meaning that they generate more\ndiverse examples. GCN and DeepGMG have worse reconstruction accuracies for neural architectures\ndue to nonzero training losses. This is because the simultaneous message passing scheme in them\nfocus more on learning local graph structures, but fail to encode the computation represented by the\nentire neural network. Besides, the sum pooling after the message passing might also lose some\nglobal topology information which is important for the reconstruction. The nonzero training loss of\nDeepGMG acts like an early stopping regularizer, making DeepGMG generate more unique graphs.\nNevertheless, reconstruction accuracy is much more important than uniqueness in our tasks, since we\nwant our embeddings to accurately remap to their original structures after latent space optimization.\n4.2 Predictive performance of latent representation.\n\nIn this experiment, we evaluate how well the learned latent embeddings can predict the corresponding\nDAGs\u2019 performances, which tests a VAE\u2019s unsupervised representation learning ability. Being able\nto accurately predict a latent point\u2019s performance also makes it much easier to search for high-\nperformance points in this latent space. Thus, the experiment is also an indirect way to evaluate\na VAE latent space\u2019s amenability for DAG optimization. Following [3], we train a sparse Gaussian\nprocess (SGP) model [63] with 500 inducing points on the embeddings of training data to predict the\nperformance of unseen test data. We include the SGP training details in Appendix L.\n\nBayesian networks\n\nNeural architectures\nRMSE\n\nTable 2: Predictive performance of encoded means.\n\nWe use two metrics to evaluate\nthe predictive performance of the\nlatent embeddings (given by the\nmean of the posterior approximations\nq\u03c6(z|G)). One is the RMSE between\nthe SGP predictions and the true per-\nformances. The other is the Pearson\ncorrelation coef\ufb01cient (or Pearson\u2019s\nr), measuring how well the predic-\ntion and real performance tend to go\nup and down together. A small RMSE and a large Pearson\u2019s r indicate a better predictive performance.\n\nMethods\n0.384\u00b10.002\nD-VAE\nS-VAE\n0.478\u00b10.002\nGraphRNN 0.726\u00b10.002\n0.485\u00b10.006\nDeepGMG 0.433\u00b10.002\n\nPearson\u2019s r\n0.959\u00b10.001\n0.933\u00b10.001\n0.641\u00b10.002\n0.836\u00b10.002\n0.625\u00b10.002\n\nPearson\u2019s r\n0.920\u00b10.001\n0.873\u00b10.001\n0.669\u00b10.001\n0.870\u00b10.001\n0.897\u00b10.001\n\n0.300\u00b10.004\n0.369\u00b10.003\n0.774\u00b10.007\n0.557\u00b10.006\n0.788\u00b10.007\n\nRMSE\n\nGCN\n\n7\n\n\fFigure 4: Top 5 neural architectures found by each model and their true test accuracies.\n\nFigure 5: Top 5 Bayesian networks found by each model and their BIC scores (higher the better).\n\nAll the experiments are repeated 10 times and the means and standard deviations are reported. Table\n2 shows the results. We \ufb01nd that both the RMSE and Pearson\u2019s r of D-VAE are signi\ufb01cantly better\nthan those of the other models. A possible explanation is that D-VAE encodes the computation, while\na DAG\u2019s performance is primarily determined by its computation. Therefore, D-VAE\u2019s latent embed-\ndings are more informative about performance. In comparison, adjacency-matrix-based methods\n(S-VAE and GraphRNN) and graph-based methods with simultaneous message passing (GCN and\nDeepGMG) both only encode (local) graph structures without speci\ufb01cally modeling computations on\nDAG structures. The better predictive power of D-VAE favors using a predictive model in its latent\nspace to guide the search for high performance graphs.\n4.3 Bayesian optimization\n\nWe perform Bayesian optimization (BO) using the two best models, D-VAE and S-VAE, validated by\nprevious experiments. Based on the SGP model from the last experiment, we perform 10 iterations of\nbatch BO, and average results across 10 trials. Following Kusner et al. [3], in each iteration, a batch\nof 50 points are proposed by sequentially maximizing the expected improvement (EI) acquisition\nfunction, using Kriging Believer [64] to assume labels for previously chosen points in the batch. For\neach batch of selected points, we evaluate their decoded DAGs\u2019 real performances and add them back\nto the SGP to select the next batch. Finally, we check the best-performing DAGs found by each model\nto evaluate its DAG optimization performance.\nNeural architectures. For neural architectures, we select the top 15 found architectures in terms\nof their weight-sharing accuracies, and fully train them on CIFAR-10\u2019s train set to evaluate their\ntrue test accuracies. More details can be found in Appendix H. We show the 5 architectures with the\nhighest true test accuracies in Figure 4. As we can see, D-VAE in general found much better neural\narchitectures than S-VAE. Among the selected architectures, D-VAE achieved a highest accuracy of\n94.80%, while S-VAE\u2019s highest accuracy was only 92.79%. In addition, all the 5 architectures of\nD-VAE have accuracies higher than 94%, indicating that D-VAE\u2019s latent space can stably \ufb01nd many\nhigh-performance architectures. More details about our NAS experiments are in Appendix H.\nBayesian networks. We similarly report the top 5 Bayesian networks found by each model ranked\nby their BIC scores in Figure 5. D-VAE generally found better Bayesian networks than S-VAE. The\nbest Bayesian network found by D-VAE achieved a BIC of -11125.75, which is better than the best\nnetwork in the training set with a BIC of -11141.89 (a higher BIC score is better). Note that BIC is in\nlog scale, thus the probability of our found network to explain the data is actually 1E7 times larger\nthan that of the best training network. For reference, the true Bayesian network used to generate\nthe Asia data has a BIC of -11109.74. Although we did not exactly \ufb01nd the true network, our found\nnetwork was close to it and outperformed all 180,000 training networks. Our experiments show that\nsearching in an embedding space is a promising direction for Bayesian network structure learning.\n4.4 Latent space visualization\n\nIn this experiment, we visualize the latent spaces of the VAE models to get a sense of their smoothness.\n\n8\n\ninputmax3conv5avg3conv5avg3conv5outputinputavg3conv5max3conv5outputsep3avg3inputavg3max3sep5max3max3outputsep3inputmax3conv5max3avg3avg3max3outputinputavg3conv5max3conv5max3conv5outputinputsep3max3conv5max3conv3outputconv5inputsep5conv5conv5sep3conv5outputconv5inputsep5max3conv5conv3conv3outputconv5inputconv3sep3max3sep5conv3outputconv5inputconv3avg3conv5conv3conv5outputconv594.7094.6394.8094.6394.7492.7992.0091.3991.3890.77D-VAES-VAEAcc.Acc.D-VAES-VAE-11125.75-11129.86-11132.31-11134.38-11135.06BIC:-11125.77-11129.50-11133.73-11140.36-11140.56BIC:ATEXDSLBATDEXSLBADSLBEXTATBEXDSLATEXDSLBATBEDXSLASLBEXDTATBLEDXSABEDXSLTATBEDXSL\fFigure 6: Great circle interpolation starting from a point and returning to itself. Upper: D-VAE. Lower: S-VAE.\n\nFor neural architectures, we visualize the decoded architectures from points along a great circle\nin the latent space [65] (slerp). We start from the latent embedding of a straight network without\nskip connections. Imagine this latent embedding as a point on the surface of a sphere (visualize the\nearth). We randomly pick a great circle starting from this point and returning to itself around the\nsphere. Along this circle, we evenly pick 35 points and visualize their decoded neural architectures in\nFigure 6. As we can see, both D-VAE and S-VAE show relatively smooth interpolations by changing\nonly a few node types or edges each time. Visually speaking, S-VAE\u2019s structural changes are even\nsmoother. This is because S-VAE treats DAGs as strings, thus tending to embed DAGs with few\ndifferences in string representations to similar regions of the latent space without considering their\ncomputational differences (see Appendix J for more discussion of this problem). In contrast, D-VAE\nmodels computations, and focuses more on the smoothness w.r.t. computation rather than structure.\nFor Bayesian networks, we directly\nvisualize the BIC score distribution\nof the latent space. To do so, we re-\nduce its dimensionality by choosing\na 2-D subspace spanned by the \ufb01rst\ntwo principal components of the train-\ning data\u2019s embeddings. In this low-\ndimensional subspace, we compute\nthe BIC scores of all the points evenly\nspaced within a [\u22120.3, 0.3] grid and\nFigure 7: Visualizing a principal 2-D subspace of the latent space.\nvisualize the scores using a colormap\nin Figure 7. As we can see, D-VAE seems to better differentiate high-score points from low-score\nones and shows more smoothly changing BIC scores. In comparison, S-VAE shows sharp boundaries\nand seems to mix high-score and low-score points more severely. The smoother latent space might\nbe the key reason for the better Bayesian optimization performance with D-VAE. Furthermore, we\nnotice that D-VAE\u2019s 2-D latent space is brighter; one explanation is the two principal components of\nD-VAE explain more variance (59%) of training data than those of S-VAE (17%). Thus, along the\ntwo principal components of S-VAE we will see less points from the training distribution. These\nout-of-distribution points tend to decode to not very good Bayesian networks, thus are darker. This\nalso indicates that D-VAE learns a more compact latent space.\n\n5 Conclusion\n\nIn this paper, we have proposed D-VAE, a GNN-based deep generative model for DAGs. D-VAE uses\na novel asynchronous message passing scheme to encode a DAG respecting its partial order, which\nexplicitly models the computations on DAGs. By performing Bayesian optimization in D-VAE\u2019s latent\nspaces, we offer promising new directions to two important problems, neural architecture search and\nBayesian network structure learning. We hope D-VAE can inspire more research on studying DAGs\nand their applications in the real world.\n\nAcknowledgments\n\nMZ, ZC and YC were supported by the National Science Foundation (NSF) under award num-\nbers III-1526012 and SCH-1622678, and by the National Institute of Health under award number\n1R21HS024581. SJ and RG were supported by the NSF under award numbers IIA\u20131355406, IIS\u2013\n1845434, and OAC\u20131940224. The authors would like to thank Liran Wang for the helpful discussions.\n\n9\n\nD-VAES-VAE\fReferences\n\n[1] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press,\n\n2009.\n\n[2] Rafael G\u00f3mez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Benjam\u00edn\nS\u00e1nchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams,\nand Al\u00e1n Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of\nmolecules. ACS central science, 4(2):268\u2013276, 2018.\n\n[3] Matt J Kusner, Brooks Paige, and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Grammar variational autoencoder. In\n\nInternational Conference on Machine Learning, pages 1945\u20131954, 2017.\n\n[4] Matt J Kusner and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. GANs for sequences of discrete elements with the\n\nGumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.\n\n[5] Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor,\nand Daniel Tarlow. TerpreT: A probabilistic programming language for program induction. arXiv preprint\narXiv:1608.04428, 2016.\n\n[6] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models\n\nof graphs. arXiv preprint arXiv:1803.03324, 2018.\n\n[7] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehen-\n\nsive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.\n\n[8] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[9] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[10] Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-Directed Variational Autoencoder\n\nfor Structured Data. arXiv preprint arXiv:1802.08786, 2018.\n\n[11] David Weininger. SMILES, a chemical language and information system. 1. Introduction to methodology\n\nand encoding rules. Journal of chemical information and computer sciences, 28(1):31\u201336, 1988.\n\n[12] Martin Simonovsky and Nikos Komodakis. GraphVAE: Towards Generation of Small Graphs Using\n\nVariational Autoencoders. arXiv preprint arXiv:1802.03480, 2018.\n\n[13] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. GraphRNN: Generating\nRealistic Graphs with Deep Auto-regressive Models. In International Conference on Machine Learning,\npages 5694\u20135703, 2018.\n\n[14] Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular graphs.\n\narXiv preprint arXiv:1805.11973, 2018.\n\n[15] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Z\u00fcgner, and Stephan G\u00fcnnemann. NetGAN: Generating\n\nGraphs via Random Walks. arXiv preprint arXiv:1803.00816, 2018.\n\n[16] Tengfei Ma, Jie Chen, and Cao Xiao. Constrained generation of semantically valid graphs via regularizing\nvariational autoencoders. In Advances in Neural Information Processing Systems, pages 7113\u20137124, 2018.\n\n[17] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular\ngraph generation. In Proceedings of the 35th International Conference on Machine Learning, pages\n2323\u20132332, 2018.\n\n[18] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L Gaunt. Constrained graph variational\n\nautoencoders for molecule design. arXiv preprint arXiv:1805.09076, 2018.\n\n[19] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy\nnetwork for goal-directed molecular graph generation. In Advances in Neural Information Processing\nSystems, pages 6412\u20136422, 2018.\n\n[20] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al\u00e1n\nAspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints.\nIn Advances in neural information processing systems, pages 2224\u20132232, 2015.\n\n10\n\n\f[21] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n[22] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[23] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for\n\ngraphs. In International conference on machine learning, pages 2014\u20132023, 2016.\n\n[24] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In\n\nAdvances in Neural Information Processing Systems, pages 1024\u20131034, 2017.\n\n[25] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture\n\nfor graph classi\ufb01cation. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[26] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural\n\nInformation Processing Systems, pages 5165\u20135175, 2018.\n\n[27] Muhan Zhang and Yixin Chen. Inductive matrix completion based on graph neural networks. arXiv\n\npreprint arXiv:1904.12058, 2019.\n\n[28] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint\n\narXiv:1611.01578, 2016.\n\n[29] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le,\n\nand Alex Kurakin. Large-scale evolution of image classi\ufb01ers. arXiv preprint arXiv:1703.01041, 2017.\n\n[30] Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter. Simple and ef\ufb01cient architecture search for\n\nconvolutional neural networks. arXiv preprint arXiv:1711.04528, 2017.\n\n[31] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for\nscalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 8697\u20138710, 2018.\n\n[32] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. arXiv\n\npreprint arXiv:1806.09055, 2018.\n\n[33] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Ef\ufb01cient neural architecture search\n\nvia parameter sharing. arXiv preprint arXiv:1802.03268, 2018.\n\n[34] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors. Automatic Machine Learning: Methods,\n\nSystems, Challenges. Springer, 2018. In press, available at http://automl.org/book.\n\n[35] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing. Neural\narchitecture search with Bayesian optimisation and optimal transport. In Advances in Neural Information\nProcessing Systems, 2018.\n\n[36] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical\n\nrepresentations for ef\ufb01cient architecture search. arXiv preprint arXiv:1711.00436, 2017.\n\n[37] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Daniel Fink, Olivier Francon, Bala Raju,\nHormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, et al. Evolving deep neural networks. In Arti\ufb01cial\nIntelligence in the Age of Neural Networks and Brain Computing, pages 293\u2013312. Elsevier, 2019.\n\n[38] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and\n\nhardware. arXiv preprint arXiv:1812.00332, 2018.\n\n[39] Renqian Luo, Fei Tian, Tao Qin, En-Hong Chen, and Tie-Yan Liu. Neural architecture optimization. In\n\nAdvances in neural information processing systems, 2018.\n\n[40] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE\n\nTransactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[41] Tian Gao, Kshitij Fadnis, and Murray Campbell. Local-to-global Bayesian network structure learning. In\n\nInternational Conference on Machine Learning, pages 1193\u20131202, 2017.\n\n[42] Tian Gao and Dennis Wei. Parallel Bayesian network structure learning. In Jennifer Dy and Andreas\nKrause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of\nProceedings of Machine Learning Research, pages 1685\u20131694, Stockholmsm\u00e4ssan, Stockholm Sweden,\n10\u201315 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/gao18b.html.\n\n11\n\n\f[43] Dominik Linzner and Heinz Koeppl. Cluster Variational Approximations for Structure Learning of\nContinuous-Time Bayesian Networks from Incomplete Data. In Advances in Neural Information Processing\nSystems, pages 7891\u20137901, 2018.\n\n[44] David Maxwell Chickering. Learning Bayesian networks is NP-complete. In Learning from data, pages\n\n121\u2013130. Springer, 1996.\n\n[45] Ajit P. Singh and Andrew W. Moore. Finding Optimal Bayesian Networks by Dynamic Programming,\n\n2005.\n\n[46] Changhe Yuan, Brandon Malone, and Xiaojian Wu. Learning Optimal Bayesian Networks Using A*\nSearch. In Proceedings of the Twenty-Second International Joint Conference on Arti\ufb01cial Intelligence\n- Volume Three, IJCAI\u201911, pages 2186\u20132191. AAAI Press, 2011. ISBN 978-1-57735-515-1. doi: 10.\n5591/978-1-57735-516-8/IJCAI11-364. URL http://dx.doi.org/10.5591/978-1-57735-516-8/\nIJCAI11-364.\n\n[47] Changhe Yuan and Brandon Malone. Learning Optimal Bayesian Networks: A Shortest Path Perspective.\nJournal of Arti\ufb01cial Intelligence Research, 48(1):23\u201365, October 2013. ISSN 1076-9757. URL http:\n//dl.acm.org/citation.cfm?id=2591248.2591250.\n\n[48] Do Chickering, Dan Geiger, and David Heckerman. Learning Bayesian networks: Search methods and\nexperimental results. In Proceedings of Fifth Conference on Arti\ufb01cial Intelligence and Statistics, pages\n112\u2013128, 1995.\n\n[49] Tomi Silander, Janne Lepp\u00e4-aho, Elias J\u00e4\u00e4saari, and Teemu Roos. Quotient Normalized Maximum\nLikelihood Criterion for Learning Bayesian Network Structures. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 948\u2013957, 2018.\n\n[50] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. DAGs with NO TEARS: Continuous\noptimization for structure learning. In Advances in Neural Information Processing Systems, pages 9472\u2013\n9483, 2018.\n\n[51] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG Structure Learning with Graph Neural\n\nNetworks. arXiv preprint arXiv:1904.10098, 2019.\n\n[52] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal\n\napproximators. Neural networks, 2(5):359\u2013366, 1989.\n\n[53] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?\n\narXiv preprint arXiv:1810.00826, 2018.\n\n[54] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[55] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[56] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\n\nGenerating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.\n\n[57] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from\n\ntree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.\n\n[58] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on\n\nSignal Processing, 45(11):2673\u20132681, 1997.\n\n[59] Fran\u00e7ois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages\n\n1610\u201302357, 2017.\n\n[60] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, Citeseer, 2009.\n\n[61] Marco Scutari. Learning Bayesian Networks with the bnlearn R Package.\n\nJournal of Statistical\nISSN 1548-7660. doi: 10.18637/jss.v035.i03. URL https:\n\nSoftware, Articles, 35(3):1\u201322, 2010.\n//www.jstatsoft.org/v035/i03.\n\n[62] Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on graphical structures\nand their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological),\npages 157\u2013224, 1988.\n\n12\n\n\f[63] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in\n\nneural information processing systems, pages 1257\u20131264, 2006.\n\n[64] David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging is well-suited to parallelize\noptimization. In Computational intelligence in expensive optimization problems, pages 131\u2013162. Springer,\n2010.\n\n[65] Tom White. Sampling generative networks. arXiv preprint arXiv:1609.04468, 2016.\n\n[66] Marc-Andr\u00e9 Z\u00f6ller and Marco F Huber. Survey on automated machine learning. arXiv preprint\n\narXiv:1904.12054, 2019.\n\n[67] Jonas Mueller, David Gifford, and Tommi Jaakkola. Sequence to better sequence: continuous revision of\n\ncombinatorial structures. In International Conference on Machine Learning, pages 2536\u20132544, 2017.\n\n[68] Nicolo Fusi, Rishit Sheth, and Melih Elibol. Probabilistic matrix factorization for automated machine\n\nlearning. In Advances in Neural Information Processing Systems, pages 3352\u20133361, 2018.\n\n[69] Benjamin Yackley and Terran Lane. Smoothness and Structure Learning by Proxy. In International\n\nConference on Machine Learning, 2012.\n\n[70] Blake Anderson and Terran Lane. Fast Bayesian network structure search using Gaussian processes. 2009.\n\nAvailable at https://www.cs.unm.edu/ treport/tr/09-06/paper.pdf.\n\n[71] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n13\n\n\f", "award": [], "sourceid": 910, "authors": [{"given_name": "Muhan", "family_name": "Zhang", "institution": "Washington University; Facebook (now)"}, {"given_name": "Shali", "family_name": "Jiang", "institution": "Washington University in St. Louis"}, {"given_name": "Zhicheng", "family_name": "Cui", "institution": "Washington University in St. Louis"}, {"given_name": "Roman", "family_name": "Garnett", "institution": "Washington University in St. Louis"}, {"given_name": "Yixin", "family_name": "Chen", "institution": "Washington University in St. Louis"}]}