{"title": "Neural Relational Inference with Fast Modular Meta-learning", "book": "Advances in Neural Information Processing Systems", "page_first": 11827, "page_last": 11838, "abstract": "Graph neural networks (GNNs) are effective models for many dynamical systems consisting of entities and relations. Although most GNN applications assume a single type of entity and relation, many situations involve multiple types of interactions. Relational inference is the problem of inferring these interactions and learning the dynamics from observational data. We frame relational inference as a modular meta-learning problem, where neural modules are trained to be composed in different ways to solve many tasks. This meta-learning framework allows us to implicitly encode time invariance and infer relations in context of one another rather than independently, which increases inference capacity.  Framing inference as the inner-loop optimization of meta-learning leads to a model-based approach that is more data-efficient and capable of estimating the state of entities that we do not observe directly, but whose existence can be inferred from their effect on observed entities. To address the large search space of graph neural network compositions, we meta-learn a proposal function that speeds up the inner-loop simulated annealing search within the modular meta-learning algorithm, providing two orders of magnitude increase in the size of problems that can be addressed.", "full_text": "Neural Relational Inference\n\nwith Fast Modular Meta-learning\n\nFerran Alet, Erica Weng, Tom\u00e1s Lozano P\u00e9rez, Leslie Pack Kaelbling\n\nMIT Computer Science and Arti\ufb01cial Intelligence Laboratory\n\n{alet,ericaw,tlp,lpk}@mit.edu\n\nAbstract\n\nGraph neural networks (GNNs) are effective models for many dynamical systems\nconsisting of entities and relations. Although most GNN applications assume\na single type of entity and relation, many situations involve multiple types of\ninteractions. Relational inference is the problem of inferring these interactions and\nlearning the dynamics from observational data. We frame relational inference as a\nmodular meta-learning problem, where neural modules are trained to be composed\nin different ways to solve many tasks. This meta-learning framework allows us\nto implicitly encode time invariance and infer relations in context of one another\nrather than independently, which increases inference capacity. Framing inference\nas the inner-loop optimization of meta-learning leads to a model-based approach\nthat is more data-ef\ufb01cient and capable of estimating the state of entities that we\ndo not observe directly, but whose existence can be inferred from their effect\non observed entities. To address the large search space of graph neural network\ncompositions, we meta-learn a proposal function that speeds up the inner-loop\nsimulated annealing search within the modular meta-learning algorithm, providing\ntwo orders of magnitude increase in the size of problems that can be addressed.\n\n1\n\nIntroduction\n\nMany dynamical systems can be modeled in terms of entities interacting with each other, and can be\nbest described by a set of nodes and relations. Graph neural networks (GNNs) (Gori et al., 2005;\nBattaglia et al., 2018) leverage the representational power of deep learning to model these relational\nstructures. However, most applications of GNNs to such systems only consider a single type of object\nand interaction, which limits their applicability. In general there may be several types of interaction;\nfor example, charged particles of the same sign repel each other and particles of opposite charge\nattract each other. Moreover, even when there is a single type of interaction, the graph of interactions\nmay be sparse, with only some object pairs interacting. Similarly, relational inference can be a useful\nframework for a variety of applications such as modeling multi-agent systems (Sun et al., 2019; Wu\net al., 2019a), discovering causal relationships (Bengio et al., 2019b) or inferring goals and beliefs of\nagents (Rabinowitz et al., 2018).\nWe would like to infer object types and their relations by observing the dynamical system. Kipf et al.\n(2018) named this problem neural relational inference and approached it using a variational inference\nframework. In contrast, we propose to approach this problem as a modular meta-learning problem:\nafter seeing many instances of dynamical systems with the same underlying dynamics but different\nrelational structures, we see a new instance for a short amount of time and have to predict how it will\nevolve. Observing the behavior of the new instance allows us to infer its relational structure, and\ntherefore make better predictions of its future behavior.\nMeta-learning, or learning to learn, aims at fast generalization. The premise is that, by training on a\ndistribution of tasks, we can learn a learning algorithm that, when given a new task, will learn from\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Modular meta-learning with graph networks; adapted from Alet et al. (2018). The system\nmeta-learns a library of node and edge modules, represented as small neural networks; at performance\n(meta-test) time, it is only necessary to infer the combination of modules that best predict the observed\ndata for the system, and use that GNN to predict further system evolution.\n\nvery little data. Recent progress in meta-learning has been very promising; however, meta-learning\nhas rarely been applied to learn building blocks for a structured domain; more typically it is used to\nadapt parameters such as neural network weights. Modular meta-learning (Alet et al., 2018), instead,\ngeneralizes by learning a small set of neural network modules that can be composed in different\nways to solve a new task, without changing their weights. This representation allows us to generalize\nto unseen data-sets by combining learned modules, exhibiting combinatorial generalization; i.e.,\n\"making in\ufb01nite use of \ufb01nite means\" (von Humboldt, 1836/1999). In this work we show that modular\nmeta-learning is a promising approach to the neural relational inference problem.\nWe proposed the BounceGrad algorithm (Alet et al., 2018), which alternates between simulated\nannealing steps, which improve the structure (the assignment of node and edge modules in the GNN)\nfor each dataset given the current neural modules, and gradient descent steps, which optimize the\nmodule weights given the modular structure used in each dataset. This formulation of neural relational\ninference offers several advantages over the variational formulation of Kipf et al. (2018). Primarily,\nit allows joint inference of the GNN structure that best models the task data, rather than making\nindependent predictions of the types of each edge. In addition, since it is model-based, it is much\nmore data ef\ufb01cient and supports other inferences for which it was not trained. However, the fact that\nthe space of compositional hypotheses for GNNs is so large poses computational challenges for the\noriginal modular meta-learning algorithm, which could only tackle small modular compositions and\na meta-datasets of only a few hundred tasks, instead of 50.000 in our current framework.\nOur contributions are the following:\n\n1. A model-based approach to neural relational inference by framing it as a modular\nmeta-learning problem. This leads to much higher data ef\ufb01ciency and enables the model\nto make inferences for which it was not originally trained.\n\n2. Speeding up modular meta-learning by two orders of magnitude, allowing it to scale\nto big datasets and modular compositions. With respect to our previous work (Alet et al.,\n2018), we increase the number of modules from 6 to 20 and the number of datasets from\na few hundreds to 50,000. We do so by showing we can batch computation over multiple\ntasks (not possible with most gradient-based meta-learning methods) and learning a proposal\nfunction that speeds up simulated annealing.\n\n3. We propose to leverage meta-data coming from each inner optimization during meta-\ntraining to simultaneously learn to learn and learn to optimize. Most meta-learning\nalgorithms only leverage loss function evaluations to propagate gradients back to a model\nand discard other information created by the inner loop optimization. We can leverage\nthis \u201cmeta-data\u201d to learn to perform these inner loop optimizations more ef\ufb01ciently; thus\nspeeding up both meta-training and meta-test optimizations.\n\n2 Related Work\n\nGraph neural networks (Battaglia et al., 2018) perform computations over a graph (see recent surveys\nby Battaglia et al. (2018); Zhou et al. (2018); Wu et al. (2019b)), with the aim of incorporating\nrelational inductive biases: assuming the existence of a set of entities and relations between them.\nAmong their many uses, we are especially interested in their ability to model dynamical systems.\nGNNs have been used to model objects (Chang et al., 2016; Battaglia et al., 2016; van Steenkiste et al.,\n\n2\n\n\f2018; Hamrick et al., 2018), parts of objects such as particles (Mrowca et al., 2018), links (Wang et al.,\n2018a), or even \ufb02uids (Li et al., 2018b) and partial differential equations (Alet et al., 2019). However,\nmost of these models assume a \ufb01xed graph and a single relation type that governs all interactions.\nWe want to do without this assumption and infer the relations, as in neural relational inference\n(NRI) (Kipf et al., 2018). Computationally, we build on the framework of message-passing neural\nnetworks (MPNNs) (Gilmer et al., 2017), similar to graph convolutional networks (GCNs) (Kipf &\nWelling, 2016; Battaglia et al., 2018).\nIn NRI, one infers the type of every edge pair based on node states or state trajectories. This problem\nis related to generating graphs that follow some training distribution, as in applications such as\nmolecule design. Some approaches generate edges independently (Simonovsky & Komodakis, 2018;\nFranceschi et al., 2019) or independently prune them from an over-complete graph (Selvan et al.,\n2018), some generate them sequentially (Johnson, 2017; Li et al., 2018a; Liu et al., 2018) and others\ngenerate graphs by \ufb01rst generating their junction tree (Jin et al., 2018) . In our approach to NRI, we\nmake iterative improvements to a hypothesized graph with a learned proposal function.\nThe literature in meta-learning (Schmidhuber, 1987; Thrun & Pratt, 2012; Lake et al., 2015) and\nmulti-task learning (Torrey & Shavlik, 2010) is very extensive. However, it mostly involves parametric\ngeneralization; i.e., generalizing by changing parameters: either weights in a neural network, as in\nMAML and others variants (Finn et al., 2017; Clavera et al., 2019; Nichol et al., 2018), or in the\ninputs fed to the network by using LSTMs or similar methods (Ravi & Larochelle, 2017; Vinyals\net al., 2016; Mishra et al., 2018; Garcia & Bruna, 2017).\nIn contrast, we build on our method of modular meta-learning which aims at combinatorial general-\nization by reusing modules in different structures. This framework is a better \ufb01t for GNNs, which\nalso heavily exploit module reuse. Combinatorial generalization plays a key role within a growing\ncommunity that aims to merge the best aspects of deep learning with structured solution spaces in\norder to obtain broader generalizations (Tenenbaum et al., 2011; Reed & De Freitas, 2015; Andreas\net al., 2016; Fernando et al., 2017; Ellis et al., 2018; Pierrot et al., 2019). This and similar ideas\nin multi-task learning (Fernando et al., 2017; Meyerson & Miikkulainen, 2017), have been used to\nplan ef\ufb01ciently (Chitnis et al., 2018) or \ufb01nd causal structures (Bengio et al., 2019a). Notably, Chang\net al. (2018) learn to tackle a single task using an input-dependent modular composition, with a\nneural network trained with PPO (Schulman et al., 2017), a variant of policy gradients, deciding the\ncomposition. This has similarities to our bottom-up proposal approach in section 4, except we train\nthe proposal function via supervised learning on data from the slower simulated annealing search.\n\n3 Methods\n\nFirst, we describe the original approaches to neural relational inference and modular meta-learning,\nthen we detail our strategy for meta-learning the modules for a GNN model.\n\n3.1 Neural relational inference\n\n, . . . , s1:T\n\n1\n\nConsider a set of n known entities with states that\nevolve over T time steps: s1:T\nn . Assume\nthat each pair of entities is related according to one\nof a small set of unknown relations, which govern the\ndynamics of the system. For instance, these entities\ncould be charged particles that can either attract or\nrepel each other. Our goal is to predict the evolution\nof the dynamical system;\nn ,\n1 , . . . , sT\npredict values of sT +1:T +k\nIf we\nknew the true relations between the entities (which\npairs of particles attract or repel) it would be easy to\npredict the evolution of the system. However, instead\nof being given those relations we have to infer them\nfrom the raw observational data.\nlet G be a graph, with nodes\nMore formally,\nv1, . . . , vn and edges e1, . . . , er(cid:48). Let S be a struc-\n\ni.e., given sT\n, . . . , sT +1:T +k\n\nn\n\n1\n\n.\n\n3\n\nFigure 2: Task setup, taken from Kipf et al.\n(2018): we want to predict the evolution of a\ndynamical system by inferring the set of rela-\ntions between the entities, such as attraction\nand repulsion between charged particles.\n\n\fture detailing a mapping from each node to its corresponding node module and from each edge to its\ncorresponding edge module. We can now run several steps of message passing: in each step, nodes\nread incoming messages from their neighbors and sum them, to then update their own states. The\nmessage \u00b5ij from node i to j is computed using the edge module determined by S, mSij , which takes\nthe states of nodes i and j as input, so \u00b5t\nusing its own neural network module mSi (in our experiments, this module is the same across all\nnodes), which takes as input the sum of its incoming messages, so\n\n(cid:1). The state of each node is then updated\n(cid:88)\n\n(cid:0)st\n\nij = mSij\n\ni, st\nj\n\n\uf8f6\uf8f8 .\n\n\u00b5t\nji\n\n\uf8eb\uf8edst\n\ni,\n\nst+1\ni = st\n\ni + mSi\n\nj\u2208neigh(vi)\n\nWe apply this procedure T times to get st+1, . . . , sT ; the whole process is differentiable, allowing us\nto train the parameters of mSi, mSij end-to-end based on predictive loss.\nIn the neural relational inference (NRI) setting, the structure S is latent, and must be inferred from\nobservations of the state sequence. In particular, NRI requires both learning the edge and node\nmodules, m, and determining which module is used in which position (\ufb01nding structure S for each\nscene). Kipf et al. (2018) propose using a variational auto-encoder with a GNN encoder and decoder,\nand using the Gumbel softmax representation to obtain categorical distributions. The encoder is a\ngraph neural network that takes an embedding of the trajectory of every particle and outputs, for each\nnode pair, a distribution over possible edge modules. The decoder samples from this factored graph\ndistribution to get a graph representing a GNN that can be run to obtain output data. However, the\nprobability distribution over structures is completely factored (each edge is chosen independently),\nwhich can be a poor approximation when the effects of several edges are interdependent or the graph\nis known a priori to have some structural property (such as being symmetric, a tree, or bipartite).\n\n3.2 Modular meta-learning\n\nMeta-learning can be seen as learning a learning algorithm. In the context of supervised learning,\ninstead of learning a regressor f with parameters \u0398 with the objective that f (xtest, \u0398) \u2248 ytest, we\naim to learn an algorithm A that takes a small training set Dtrain = (xtrain, ytrain) and returns a\nhypothesis h that performs well on the test set:\nh = A(Dtrain , \u0398) s.t. h(xtest) \u2248 ytest; i.e. A minimizes L(A(Dtrain , \u0398)(xtest), ytest) for loss L.\nSimilar to conventional learning algorithms, we optimize \u0398, the parameters of A, to perform well.\nModular meta-learning learns a set of small neural network modules and forms hypotheses by\ncomposing them into different structures. In particular, let m1, . . . , mk be the set of modules, with\nparameters \u03b81, . . . , \u03b8k and S be a set of structures that describes how modules are composed. For\nexample, simple compositions can be adding the modules\u2019 outputs, concatenating them, or using the\noutput of several modules to guide attention over the results of other modules.\nFor modular meta-learning, \u0398 = (\u03b81, . . . , \u03b8k) are the weights of modules m1, . . . , mk, and the\nalgorithm A operates by searching over the set of possible structures S to \ufb01nd the one that best \ufb01ts\nDtrain, and applies it to xtest. Let hS,\u0398 be the function that predicts the output using the modular\nstructure S and parameters \u0398. Then\n\nA(Dtrain , \u0398) = hS\u2217,\u0398 where S\u2217 = arg min\n\nS\u2208S L(hS,\u0398(xtrain), ytrain) .\n\nNote that, in contrast to many meta-learning algorithms, \u0398 is constant when learning a new task.\nAt meta-training time we have to \ufb01nd module weights \u03b81, . . . , \u03b8m that compose well. To do this,\nwe proposed the BOUNCEGRAD algorithm (Alet et al., 2018) to optimize the modules and \ufb01nd the\nstructure for each task. It works by alternating steps of simulated annealing and gradient descent. Sim-\nulated annealing (a stochastic combinatorial optimization algorithm) optimizes the structure of each\ntask using its train split. Gradient descent steps optimize module weights with the test split, pooling\ngradients from each instance of a module applied to different tasks. At meta-test time, it has access\nto the \ufb01nal training data set, which it uses to perform structure search to arrive at a \ufb01nal hypothesis.\n\n4\n\n\f4 Modular meta-learning graph neural networks\n\nTo apply modular meta-learning to GNNs, we let G be the set of node modules g1, . . . , g|G|, where\ngi is a network with weights \u03b8gi, and let H be the set of edge modules h1, . . . , h|H|, where hi has\nweights \u03b8hi. We then apply a version of the BOUNCEGRAD method, described in the appendix.\nBoth modular meta-learning and GNNs exhibit combinatorial generalization, combining small\ncomponents in \ufb02exible ways to solve new problem instances, making modular meta-learning a\nparticularly appropriate strategy for meta-learning in GNNs.\nTo use modular meta-learning for NRI, we create a number of edge modules that is greater or equal to\nthe potential number of types of interactions; then with modular meta-learning we learn specialized\nedge modules that can span many types of behaviors with different graphs. For a new scene we\ninfer relations by optimizing the edge modules that best \ufb01t the data and then classifying the relation\naccording to the module used for that edge slot. This formulation of neural relational inference has a\nnumber of advantages.\nFirst, the simulated annealing (SA) step in the BOUNCEGRAD algorithm searches the space of struc-\ntures, tackling the problem directly in its combinatorial form rather than via differentiable variational\napproximations. Moreover, with SA, relations are inferred as a whole instead of independently; this\nis critical for inferring the correct relationships from short observation sequences of complex scenes,\nwhere there could be many \ufb01rst-order candidate explanations that roughly approximate the scene\nand one has to use higher-order dependencies to obtain an accurate model. For instance, if we are\ntrying to infer the causal relationship between two variables A and B and we have 40% probability\nof A \u2192 B and 60% of B \u2192 A, we want to express that these choices are mutually exclusive and the\nprobability of having both edges is 0% and not 24%.\nSecond, our formulation is a more direct, model-based approach. Given observational data from a\nnew scene (task from the meta-learning perspective), we infer an underlying latent model (types of\nrelations among the entities) by directly optimizing the ability of the inferred model to predict the\nobserved data. This framework allows facts about the underlying model to improve inference, which\nimproves generalization performance with small amounts of data. For example, the fact that the\nmodel class is GNNs means that the constraint of an underlying time-invariant dynamics is built into\nthe learning algorithm. The original feed-forward inference method for NRI cannot take advantage of\nthis important inductive bias. Another consequence of the model-based approach is that we can ask\nand answer other inference questions. An important example is that we can infer the existence and\nrelational structure of unobserved entities based only on their observable effects on other entities.\nHowever, our modular meta-learning formulation poses a substantial computational challenge. Choos-\nthe search space increases as |H|O(n2), which too large even for small graphs. We address this\nproblem by proposing two improvements to the BOUNCEGRAD algorithm, which together result in\norder-of-magnitude improvements in running time.\n\ning the module type for each edge in a fully connected graph requires(cid:0)n\n\n(cid:1) = O(n2) decisions; thus\n\n2\n\nMeta-learning a proposal function One way to improve stochastic search methods, including\nsimulated annealing, is to improve the proposal distribution, so that many fewer proposed moves\nare rejected. Similar proposals have been made in the context of particle \ufb01lters (Doucet et al., 2000;\nMahendran et al., 2012; Andrieu & Thoms, 2008). One strategy is to improve the proposal distribution\nby treating it as another parameter to be meta-learned (Wang et al., 2018b); this can be effective,\nbut only at meta-test time. We take a different approach, which is to treat the current structures\nin simulated annealing as training examples for a new proposal function. Note that to train this\nproposal function we have plenty of data coming from search at meta-training time. In particular,\nafter we evaluate a batch of tasks we take them and their respective structures as ground truth for\na batch update to the proposal function. As the algorithm learns to learn, it also learns to optimize\nfaster since the proposal function will suggest changes that tend to be accepted more often, making\nmeta-training (and not only meta-testing) faster, making simulated annealing structures better, which\nin turn improves proposal functions. This virtuous cycle is similar to the relationship between the\nfast policy network and the slow MCTS planner in AlphaZero (Silver et al., 2017), analogous to our\nproposal function and simulated annealing optimization, respectively.\nOur proposal function takes a dataset D of state transitions and outputs a factored probability\ndistribution over the modules for every edge. This predictor is structurally equivalent to the encoder\n\n5\n\n\fof Kipf et al. (2018). We use this function to generate a proposal for SA by sampling a random node,\nand then using the predicted distribution to resample modules for each of the incoming edges. This\nblocked Gibbs sampler is very ef\ufb01cient because edges going to the same node are highly correlated\nand it is better to propose a coherent set of changes all at once. To train the proposal function, it\nwould be ideal to know the true structures associated with the training data. Since we do not have\naccess to the true structures, we use the best proxy for them: the current structure in the simulated\nannealing search. Therefore, for each batch of datasets we do a simulated annealing step on the\ntraining data to decide whether to update the structure. Then, we use the current batch of structures\nas target for the current batch of datasets, providing a batch of training data for the proposal function.\nMixtures of learning-to-learn and learning-to-optimize (Li & Malik, 2016) have been made before\nin meta-learning in the context of meta-learning loss functions (Yu et al., 2018; Bechtle et al.,\n2019). Similarly, we think that other metadata generated by the inner-loop optimizations during\nmeta-training could be useful to other few-shot learning algorithms, which could be more ef\ufb01cient by\nsimultaneously learning to optimize. In doing so, we could get could get meta-learning algorithms\nwith expressive and non-local, but also fast, inner-loop adaptations.\n\nBatched modular meta-learning From an implementation standpoint, it is important to note that,\nin contrast to most gradient-based meta-learning algorithms ( Zintgraf et al. (2018) being a notable\nexception), modular meta-learning does not need to change the weights of the neural network modules\nin its inner loop. This enables us to run the same network for many different datasets in a batch,\nexploiting the parallelization capabilities of GPUs and with constant memory cost for the network\nparameters. Doing so is especially convenient for GNN structures. We use a common parallelization\nin graph neural network training, by creating a super-graph composed of many graphs, one per\ndataset. Creating this graph only involves minor book-keeping, by renaming vertex and edges. Since\nboth edge and node modules can run all their instances in the same graph in parallel, they will\nparallelize the execution of all the datasets in the batch. Moreover, since the graphs of the different\ndatasets are disconnected, their dynamics will not affect one another. In practice, this implementation\nspeeds up both the training and evaluation time by an order of magnitude. Similar book-keeping\nmethods are applicable to speed up modular meta-learning for structures other than GNNs.\n\n5 Experiments\n\nWe implement our solution in PyTorch (Paszke et al., 2017), using the Adam optimizer (Kingma\n& Ba, 2014); details and pseudo-code can be found in the appendix and code can be found at\nhttps://github.com/FerranAlet/modular-metalearning. We follow the choices\nof Kipf et al. (2018) whenever possible to make results comparable. Please see the arxiv version for\ncomplete results.\nWe begin by addressing two problems on which NRI was originally demonstrated, then show that our\napproach can be applied to the novel problem of inferring the existence of unobserved nodes.\n\n5.1 Predicting physical systems\n\nTwo datasets from Kipf et al.\n(2018) are available online (https://github.com/\nethanfetaya/NRI/);\nin each one, we observe the state of dynamical system for 50\ntime steps and are asked both to infer the relations between object pairs and to predict their states\nfor the next 10 time steps.\nSprings: a set of 5 particles move in a box with elastic collisions with the walls. Each pair of particles\nis connected with a spring with probability 0.5. The spring will exert forces following Hooke\u2019s law.\nWe observe that the graph of forces is symmetric, but none of the algorithms hard-code this fact.\nCharged particles: similar to springs, a set of 5 particles move in a box, but now all particles\ninteract. A particle is set to have positive charge with probability 0.5 and negative charge otherwise.\nParticles of opposite charges attract and particles of the same charge repel, both following Coulomb\u2019s\nlaw. This behavior can be modeled using two edge modules, one which will pull a particle i closer\nto j and another that pushes it away. We observe that the graph of attraction is both symmetric and\nbipartite, but none of the algorithms hard-codes this fact.\n\n6\n\n\fSprings\n\nCharged\n\nPrediction steps\n\n1\n\nStatic\n\nLSTM(single)\nLSTM(joint)\n\nNRI (full graph)\n(Kipf et al., 2018)\nModular meta-l.\nNRI (true graph)\n\n7.93e-5\n2.27e-6\n4.13e-8\n1.66e-5\n3.12e-8\n3.13e-8\n1.69e-11\n\n10\n\n7.59e-3\n4.69e-4\n2.19e-5\n1.64e-3\n3.29e-6\n3.25e-6\n1.32e-9\n\n1\n\n5.09e-3\n2.71e-3\n1.68e-3\n1.09e-3\n1.05e-3\n1.03e-3\n1.04e-3\n\n10\n\n2.26e-2\n7.05e-3\n6.45e-3\n3.78e-3\n3.21e-3\n3.11e-3\n3.03e-3\n\nTable 1: Prediction results evalu-\nated on datasets from Kipf et al.\n(2018), including their baselines for\ncomparison. Mean-squared error\nin prediction after T steps; lower\nis better. We observe that our\nmethod is able to either match or\nimprove the performance of the\nauto-encoder based approach, de-\nspite it being close to optimal.\n\nModel\n\nSprings Charged\n\nCorrelation(data)\nCorrelation(LSTM)\n(Kipf et al., 2018)\nModular meta-l.\n\nSupervised\n\n52.4\n52.7\n99.9\n99.9\n99.9\n\n55.8\n54.2\n82.1\n88.4\n95.0\n\nTable 2: Edge type prediction accuracy. Cor-\nrelation baselines try to infer the pairwise\nrelation between two particles on a simple\nclassi\ufb01er built upon the correlation between\nthe temporal sequence of raw states or LSTM\nhidden states, respectively. The supervised\ngold standard trains the encoder alone with\nthe ground truth edges. Our work matches\nthe gold standard on the springs dataset and\nhalves the distance between the variational\napproach and the gold standard in the charged\nparticles domain.\n\nFigure 3: Accuracy as a function of the training set\nsize (note the logarithmic axis). By being model-\nbased, modular meta-learning is around 3-5 times\nmore data ef\ufb01cient than the variational approach.\n\nOur main goal is to recover the relations accurately just from observational data from the trajectories,\ndespite having no labels for the relations. To do so we minimize the surrogate goal of trajectory\nprediction error, as our model has to discover the underlying relations in order to make good\npredictions. We compare to 4 baselines and the novel method used by Kipf et al. (2018). Two of\nthese baselines resemble other popular meta-learning algorithms that do not properly exploit the\nmodularity of the problem: feeding the data to LSTMs (either a single trajectory or the trajectory\nof all particles) is analogous to recurrent networks used for few-shot learning (Ravi & Larochelle,\n2017) and using a graph neural network with only one edge to do predictions is similar to the work\nof Garcia & Bruna (2017) to classify images by creating fully connected graphs of the entire dataset.\nTo make the comparisons as fair as possible, all the neural network architectures (with the encoder in\nthe auto-encoder framework being our proposal function) are exactly the same.\nPrediction error results (table 1) for training on the full dataset indicate that our approach performs\nas well as or better than all other methods on both problems. This in turn leads to better edge\npredictions, shown in table 2, with our method substantially more accurately predicting the edge types\nfor the charged particle domain. By optimizing the edge-choices jointly instead of independently,\nour method has higher capacity, thus reaching higher accuracies for charged particles. Note that the\nhigher capacity also comes with higher computational cost, but not higher number of parameters\n(since the architectures are the same). In addition, we compare generalization performance of our\nmethod and the VAE approach of Kipf et al. (2018) by plotting predictive accuracy as a function of\nthe number of meta-training tasks in \ufb01gure 3. Our more model-based strategy has a built-in inductive\nbias that makes it perform signi\ufb01cantly better in the low-data regime.\n\n7\n\n\fFigure 4: We observe the trajectories shown in the black box and notice that they differ from the\npredictions of our model (red box). We can then hypothesize models with an additional, unseen,\nentity (red particle in green box) that is pulling the cyan particle higher and the black and blue particle\ntowards the right. Conditioning the trajectory of the particle on those predicted by our model, we\nmake a good estimate of the position of the unseen particle.\n\n5.2\n\nInferring unseen nodes\n\nIn many structured learning problems, we can improve the quality of our predictions by adding\nadditional latent state. For example, in graphical models, adding \"hidden cause\" nodes can substan-\ntially reduce model complexity and improve generalization performance. In NRI, we may improve\npredictive accuracy by adding an additional latent object to the model, represented as a latent node in\nthe GNN. A famous illustrative example is the discovery of Neptune in 1846 thanks to mathemat-\nical predictions based on its gravitational pull on Uranus, which modi\ufb01ed its trajectory. Based on\nthe deviations of Uranus\u2019 trajectory from its theoretical trajectory had there been no other planets,\nmathematicians were able to guess the existence of an unobserved planet and estimate its location.\nBy casting NRI as modular meta-learning, we have developed a model-based approach that allows\nus to infer properties beyond the edge relations. More concretely, we can add a node to the graph\nand optimize its trajectory as part of the inner-loop optimization of the meta-learning algorithm. We\nonly need to add the predicted positions at every time-step t for the new particle and keep the same\nself-supervised prediction loss. This loss will be both for the unseen object, ensuring it has a realistic\ntrajectory, and for the observed objects, which will optimize the node state to in\ufb02uence the observed\nnodes appropriately.\nIn practice, optimizing the trajectory is a very non-smooth problem in R4\u00d7T (T is the length of\nthe predicted trajecories) which is dif\ufb01cult to search. Instead of searching for an optimal trajec-\ntory, we optimize only the initial state and determine the following states by running our learned\npredictive model. However, since small perturbations can lead to large deviations in the long run,\nthe optimization is highly non-linear. We thus resort to a combination of random sampling and\ngradient descent, where we optimize our current best guess by gradient descent, but keep sampling\nfor radically different solutions. Detailed pseudo-code for this optimization can be found in the\nappendix. We illustrate this capability in the springs dataset, by \ufb01rst training a good model with the\ntrue edges and then \ufb01nding the trajectory of one of the particles given the other four, where we are\nable to predict the state with an MSE of 1.09e-3, which is less than the error of some baselines that\nsaw the entire dynamical system up to 10 timesteps prior, as seen in table 1.\n\n6 Conclusion\n\nWe proposed to frame relational inference as a modular meta-learning problem, where neural modules\nare trained to be composed in different ways to solve many tasks. We demonstrated that this approach\nleads to improved performance with less training data. We also showed how this framing enables us\nto estimate the state of entities that we do not observe directly. To address the large search space of\ngraph neural network compositions within modular meta-learning, we meta-learn a proposal function\nthat speeds up the inner-loop simulated annealing search within the modular meta-learning algorithm,\nproviding one or two orders of magnitude increase in the size of problems that can be addressed.\n\n8\n\n\fAcknowledgments\n\nWe gratefully acknowledge support from NSF grants 1523767 and 1723381; from AFOSR grant\nFA9550-17-1-0165; from ONR grant N00014-18-1-2847; from the Honda Research Institute; and\nfrom SUTD Temasek Laboratories. Any opinions, \ufb01ndings, and conclusions or recommendations ex-\npressed in this material are those of the authors and do not necessarily re\ufb02ect the views of our sponsors.\nThe authors want to thank Clement Gehring for insightful discussions and his help setting up the\nexperiments, Thomas Kipf for his quick and detailed answers about his paper and Maria Bauza for\nher feedback on an earlier draft of this work.\n\nReferences\nFerran Alet, Tomas Lozano-Perez, and Leslie P. Kaelbling. Modular meta-learning. In Proceedings\n\nof The 2nd Conference on Robot Learning, pp. 856\u2013868, 2018.\n\nFerran Alet, Adarsh K Jeewajee, Maria Bauza, Alberto Rodriguez, Tomas Lozano-Perez, and\nLeslie Pack Kaelbling. Graph element networks: adaptive, structured computation and memory.\nProceedings of the 36th International Conference on Machine Learning-Volume 97, 2019.\n\nJacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39\u201348,\n2016.\n\nChristophe Andrieu and Johannes Thoms. A tutorial on adaptive mcmc. Statistics and computing, 18\n\n(4):343\u2013373, 2008.\n\nPeter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks\nfor learning about objects, relations and physics. In Advances in neural information processing\nsystems, pp. 4502\u20134510, 2016.\n\nPeter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi,\nMateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al.\nRelational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261,\n2018.\n\nSarah Bechtle, Artem Molchanov, Yevgen Chebotar, Edward Grefenstette, Ludovic Righetti, Gaurav\nSukhatme, and Franziska Meier. Meta-learning via learned loss. arXiv preprint arXiv:1906.05374,\n2019.\n\nYoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, S\u00e9bastien Lachapelle, Olexa\nBilaniuk, Anirudh Goyal, and Christopher Joseph Pal. A meta-transfer objective for learning to\ndisentangle causal mechanisms. CoRR, abs/1901.10912, 2019a.\n\nYoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, S\u00e9bastien Lachapelle, Olexa Bilaniuk,\nAnirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal\nmechanisms. arXiv preprint arXiv:1901.10912, 2019b.\n\nMichael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional\n\nobject-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.\n\nMichael B Chang, Abhishek Gupta, Sergey Levine, and Thomas L Grif\ufb01ths. Automatically composing\nrepresentation transformations as a means for generalization. arXiv preprint arXiv:1807.04640,\n2018.\n\nRohan Chitnis, Leslie Pack Kaelbling, and Tom\u00e1s Lozano-P\u00e9rez. Learning quickly to plan quickly\n\nusing modular meta-learning. arXiv preprint arXiv:1809.07878, 2018.\n\nIgnasi Clavera, Anusha Nagabandi, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea\nFinn. Learning to adapt: Meta-learning for model-based control. In International Conference on\nLearning Representations, 2019.\n\n9\n\n\fArnaud Doucet, Nando De Freitas, Kevin Murphy, and Stuart Russell. Rao-blackwellised particle\n\ufb01ltering for dynamic bayesian networks. In Proceedings of the Sixteenth conference on Uncertainty\nin arti\ufb01cial intelligence, pp. 176\u2013183. Morgan Kaufmann Publishers Inc., 2000.\n\nKevin Ellis, Lucas Morales, Mathias Sabl\u00e9 Meyer, Armando Solar-Lezama, and Joshua B Tenenbaum.\nSearch, compress, compile: Library learning in neurally-guided bayesian program learning. In\nAdvances in neural information processing systems, 2018.\n\nChrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu,\nAlexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural\nnetworks. arXiv preprint arXiv:1701.08734, 2017.\n\nChelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of\n\ndeep networks. arXiv preprint arXiv:1703.03400, 2017.\n\nLuca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures\n\nfor graph neural networks. arXiv preprint arXiv:1903.11960, 2019.\n\nVictor Garcia and Joan Bruna. Few-shot learning with graph neural networks. arXiv preprint\n\narXiv:1711.04043, 2017.\n\nJustin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural\n\nmessage passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.\n\nMarco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains.\nIn Proceedings. 2005 IEEE International Joint Conference on Neural Networks, volume 2, pp.\n729\u2013734, 2005.\n\nJessica B Hamrick, Kelsey R Allen, Victor Bapst, Tina Zhu, Kevin R McKee, Joshua B Tenenbaum,\nand Peter W Battaglia. Relational inductive bias for physical construction in humans and machines.\narXiv preprint arXiv:1806.01203, 2018.\n\nWengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for\n\nmolecular graph generation. arXiv preprint arXiv:1802.04364, 2018.\n\nDaniel D Johnson. Learning graphical state transitions. In International Conference on Learning\n\nRepresentations (ICLR), 2017.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\nThomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relational\n\ninference for interacting systems. arXiv preprint arXiv:1802.04687, 2018.\n\nThomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\nBrenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning\n\nthrough probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\nKe Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.\n\nYujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative\n\nmodels of graphs. arXiv preprint arXiv:1803.03324, 2018a.\n\nYunzhu Li, Jiajun Wu, Russ Tedrake, Joshua B Tenenbaum, and Antonio Torralba. Learning\nparticle dynamics for manipulating rigid bodies, deformable objects, and \ufb02uids. arXiv preprint\narXiv:1810.01566, 2018b.\n\nQi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L. Gaunt. Constrained graph\n\nvariational autoencoders for molecule design. In NeurIPS, pp. 7806\u20137815, 2018.\n\nNimalan Mahendran, Ziyu Wang, Firas Hamze, and Nando De Freitas. Adaptive mcmc with bayesian\n\noptimization. In Arti\ufb01cial Intelligence and Statistics, pp. 751\u2013760, 2012.\n\n10\n\n\fElliot Meyerson and Risto Miikkulainen. Beyond shared hierarchies: Deep multitask learning through\n\nsoft layer ordering. arXiv preprint arXiv:1711.00108, 2017.\n\nNikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-\n\nlearner. In International Conference on Learning Representations (ICLR), 2018.\n\nDamian Mrowca, Chengxu Zhuang, Elias Wang, Nick Haber, Li Fei-Fei, Joshua B. Tenenbaum,\nand Daniel L. K. Yamins. Flexible neural representation for physics prediction. arXiv preprint\narXiv:1806.08047, 2018.\n\nAlex Nichol, Joshua Achiam, and John Schulman. On \ufb01rst-order meta-learning algorithms. CoRR,\n\nabs/1803.02999, 2018. URL http://arxiv.org/abs/1803.02999.\n\nAdam Paszke, Sam Gross, and Adam Lerer. Automatic differentiation in PyTorch. In International\n\nConference on Learning Representations, 2017.\n\nThomas Pierrot, Guillaume Ligner, Scott Reed, Olivier Sigaud, Nicolas Perrin, Alexandre Laterre,\nDavid Kas, Karim Beguir, and Nando de Freitas. Learning compositional neural programs with\nrecursive tree search and planning. arXiv preprint arXiv:1905.12941, 2019.\n\nNeil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan Zhang, SM Eslami, and Matthew\n\nBotvinick. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.\n\nSachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International\n\nConference on Learning Representations (ICLR), 2017.\n\nScott Reed and Nando De Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279,\n\n2015.\n\nJ\u00fcrgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn:\n\nthe meta-meta-... hook. PhD thesis, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\nJohn Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\nRaghavendra Selvan, Thomas Kipf, Max Welling, Jesper H Pedersen, Jens Petersen, and Marleen\nde Bruijne. Graph re\ufb01nement based tree extraction using mean-\ufb01eld networks and graph neural\nnetworks. arXiv preprint arXiv:1811.08674, 2018.\n\nDavid Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,\nThomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without\nhuman knowledge. Nature, 550(7676):354, 2017.\n\nMartin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using\nvariational autoencoders. In International Conference on Arti\ufb01cial Neural Networks, pp. 412\u2013422.\nSpringer, 2018.\n\nChen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, and Kevin Murphy. Stochastic prediction\n\nof multi-agent interactions from partial observations. arXiv preprint arXiv:1902.09641, 2019.\n\nJoshua B Tenenbaum, Charles Kemp, Thomas L Grif\ufb01ths, and Noah D Goodman. How to grow a\n\nmind: Statistics, structure, and abstraction. science, 331(6022):1279\u20131285, 2011.\n\nSebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.\n\nLisa Torrey and Jude Shavlik. Transfer learning. In Handbook of Research on Machine Learning\nApplications and Trends: Algorithms, Methods, and Techniques, pp. 242\u2013264. IGI Global, 2010.\n\nSjoerd van Steenkiste, Michael Chang, Klaus Greff, and J\u00fcrgen Schmidhuber. Relational neural\nexpectation maximization: Unsupervised discovery of objects and their interactions. arXiv preprint\narXiv:1802.10353, 2018.\n\nOriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot\n\nlearning. In Advances in Neural Information Processing Systems, pp. 3630\u20133638, 2016.\n\n11\n\n\fWilhelm von Humboldt. On language: On the diversity of human language construction and its\nin\ufb02uence on the mental development of the human species. Cambridge University Press, 1836/1999.\n\nTingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with\n\ngraph neural networks. In International Conference on Learning Representations, 2018a.\n\nTongzhou Wang, YI WU, Dave Moore, and Stuart J Russell. Meta-learning mcmc proposals. In\nS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.),\nAdvances in Neural Information Processing Systems 31, pp. 4146\u20134156. Curran Associates, Inc.,\n2018b.\n\nJianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. Learning actor relation graphs for\n\ngroup activity recognition. arXiv preprint arXiv:1904.10117, 2019a.\n\nZonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A\n\ncomprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019b.\n\nTianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey\nLevine. One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv\npreprint arXiv:1802.01557, 2018.\n\nJie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural\n\nnetworks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.\n\nLuisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Caml:\n\nFast context adaptation via meta-learning. arXiv preprint arXiv:1810.03642, 2018.\n\n12\n\n\f", "award": [], "sourceid": 6328, "authors": [{"given_name": "Ferran", "family_name": "Alet", "institution": "MIT"}, {"given_name": "Erica", "family_name": "Weng", "institution": "MIT"}, {"given_name": "Tom\u00e1s", "family_name": "Lozano-P\u00e9rez", "institution": "MIT"}, {"given_name": "Leslie", "family_name": "Kaelbling", "institution": "MIT"}]}