{"title": "Flexible neural representation for physics prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 8799, "page_last": 8810, "abstract": "Humans have a remarkable capacity to understand the physical dynamics of objects in their environment, flexibly capturing complex structures and interactions at multiple levels of detail. \nInspired by this ability, we propose a hierarchical particle-based object representation that covers a wide variety of types of three-dimensional objects, including both arbitrary rigid geometrical shapes and deformable materials. \nWe then describe the Hierarchical Relation Network (HRN), an end-to-end differentiable neural network based on hierarchical graph convolution, that learns to predict physical dynamics in this representation. \nCompared to other neural network baselines, the HRN accurately handles complex collisions and nonrigid deformations, generating plausible dynamics predictions at long time scales in novel settings, and scaling to large scene configurations.\nThese results demonstrate an architecture with the potential to form the basis of next-generation physics predictors for use in computer vision, robotics, and quantitative cognitive science.", "full_text": "Flexible Neural Representation for Physics Prediction\n\nDamian Mrowca1,\u21e4, Chengxu Zhuang2,\u21e4, Elias Wang3,\u21e4, Nick Haber2,4,5 , Li Fei-Fei1 ,\n\nJoshua B. Tenenbaum7,8 , and Daniel L. K. Yamins1,2,6\n\nDepartment of Computer Science1, Psychology2, Electrical Engineering3, Pediatrics4 and\nBiomedical Data Science5, and Wu Tsai Neurosciences Institute6, Stanford, CA 94305\n\nDepartment of Brain and Cognitive Sciences7, and Computer Science and Arti\ufb01cial Intelligence\n\nLaboratory8, MIT, Cambridge, MA 02139\n\n{mrowca, chengxuz, eliwang}@stanford.edu\n\nAbstract\n\nHumans have a remarkable capacity to understand the physical dynamics of objects\nin their environment, \ufb02exibly capturing complex structures and interactions at\nmultiple levels of detail. Inspired by this ability, we propose a hierarchical particle-\nbased object representation that covers a wide variety of types of three-dimensional\nobjects, including both arbitrary rigid geometrical shapes and deformable materi-\nals. We then describe the Hierarchical Relation Network (HRN), an end-to-end\ndifferentiable neural network based on hierarchical graph convolution, that learns\nto predict physical dynamics in this representation. Compared to other neural\nnetwork baselines, the HRN accurately handles complex collisions and nonrigid\ndeformations, generating plausible dynamics predictions at long time scales in\nnovel settings, and scaling to large scene con\ufb01gurations. These results demonstrate\nan architecture with the potential to form the basis of next-generation physics\npredictors for use in computer vision, robotics, and quantitative cognitive science.\n\n1\n\nIntroduction\n\nHumans ef\ufb01ciently decompose their environment into objects, and reason effectively about the\ndynamic interactions between these objects [43, 45]. Although human intuitive physics may be\nquantitatively inaccurate under some circumstances [32], humans make qualitatively plausible guesses\nabout dynamic trajectories of their environments over long time horizons [41]. Moreover, they either\nare born knowing, or quickly learn about, concepts such as object permanence, occlusion, and\ndeformability, which guide their perception and reasoning [42].\nAn arti\ufb01cial system that could mimic such abilities would be of great use for applications in computer\nvision, robotics, reinforcement learning, and many other areas. While traditional physics engines\nconstructed for computer graphics have made great strides, such routines are often hard-wired\nand thus challenging to integrate as components of larger learnable systems. Creating end-to-end\ndifferentiable neural networks for physics prediction is thus an appealing idea. Recently, Chang et al.\n[11] and Battaglia et al. [4] have illustrated the use of neural networks to predict physical object\ninteractions in (mostly) 2D scenarios by proposing object-centric and relation-centric representations.\nCommon to these works is the treatment of scenes as graphs, with nodes representing object point\nmasses and edges describing the pairwise relations between objects (e.g. gravitational, spring-like, or\nrepulsing relationships). Object relations and physical states are used to compute the pairwise effects\nbetween objects. After combining effects on an object, the future physical state of the environment is\npredicted on a per-object basis. This approach is very promising in its ability to explicitly handle\n\n\u21e4Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fobject interactions. However, a number of challenges have remained in generalizing this approach\nto real-world physical dynamics, including representing arbitrary geometric shapes with suf\ufb01cient\nresolution to capture complex collisions, working with objects at different scales simultaneously, and\nhandling non-rigid objects of nontrivial complexity.\n\nFigure 1: Predicting physical dynamics. Given past observations the task is to predict the future\nphysical state of a system. In this example, a cube deforms as it collides with the ground. The top\nrow shows the ground truth and the bottom row the prediction of our physics prediction network.\nSeveral of these challenges are illustrated in the fast-moving deformable cube sequence depicted\nin Figure 1. Humans can \ufb02exibly vary the level of detail at which they perceive such objects in\nmotion: The cube may naturally be conceived as an undifferentiated point mass as it moves along\nits initial kinematic trajectory. But as it collides with and bounces up from the \ufb02oor, the cube\u2019s\ncomplex rectilinear substructure and nonrigid material properties become important for understanding\nwhat happens and predicting future interactions. The ease with which the human mind handles such\ncomplex scenarios is an important explicandum of cognitive science, and also a key challenge for\narti\ufb01cial intelligence. Motivated by both of these goals, our aim here is to develop a new class of\nneural network architectures with this human-like ability to reason \ufb02exibly about the physical world.\nTo this end, it would be natural to extend the interaction network framework by representing each\nobject as a (potentially large) set of connected particles. In such a representation, individual constituent\nparticles could move independently, allowing the object to deform while being constrained by pairwise\nrelations preventing the object from falling apart. However, this type of particle-based representation\nintroduces a number of challenges of its own. Conceptually, it is not immediately clear how to\nef\ufb01ciently propagate effects across such an object. Moreover, representing every object with hundreds\nor thousands of particles would result in an exploding number of pairwise relations, which is both\ncomputationally infeasible and cognitively unnatural.\nAs a solution to these issues, we propose a novel cognitively-inspired hierarchical graph-based object\nrepresentation that captures a wide variety of complex rigid and deformable bodies (Section 3), and\nan ef\ufb01cient hierarchical graph-convolutional neural network that learns physics prediction within this\nrepresentation (Section 4). Evaluating on complex 3D scenarios, we show substantial improvements\nrelative to strong baselines both in quantitative prediction accuracy and qualitative measures of\nprediction plausibility, and evidence for generalization to complex unseen scenarios (Section 5).\n\n2 Related Work\n\nAn ef\ufb01cient and \ufb02exible predictor of physical dynamics has been an outstanding question in neural\nnetwork design.\nIn computer vision, modeling moving objects in images or videos for action\nrecognition, future prediction, and object tracking is of great interest. Similarly in robotics, action-\nconditioned future prediction from images is crucial for navigation or object interactions. However,\nfuture predictors operating directly on 2D image representations often fail to generate sharp object\nboundaries and struggle with occlusions and remembering objects when they are no longer visually\nobservable [1, 17, 16, 28, 29, 33, 34, 19]. Representations using 3D convolution or point clouds are\nbetter at maintaining object shape [46, 47, 10, 36, 37], but do not entirely capture object permanence,\nand can be computationally inef\ufb01cient. More similar to our approach are inverse graphics methods\nthat extract a lower dimensional physical representation from images that is used to predict physics\n[25, 26, 51, 50, 52, 53, 7, 49]. Our work draws inspiration from and extends that of Chang et al. [11]\nand Battaglia et al. [4], which in turn use ideas from graph-based neural networks [39, 44, 9, 30,\n22, 14, 13, 24, 8, 40]. Most of the existing work, however, does not naturally handle complex scene\nscenarios with objects of widely varying scales or deformable objects with complex materials.\nPhysics simulation has also long been studied in computer graphics, most commonly for rigid-body\ncollisions [2, 12]. Particles or point masses have also been used to represent more complex physical\n\n2\n\ntt+1t+2t+3t+4t+5t+6t+7t+8t+9Ground TruthPrediction\fobjects, with the neural network-based NeuroAnimator being one of the earliest examples to use a\nhierarchical particle representation for objects to advance the movement of physical objects [18]. Our\nparticle-based object representation also draws inspiration from recent work on (non-neural-network)\nphysics simulation, in particular the NVIDIA FleX engine [31, 6]. However, unlike this work, our\nsolution is an end-to-end differentiable neural network that can learn from data.\nRecent research in computational cognitive science has posited that humans run physics simulations\nin their mind [5, 3, 20, 48, 21]. It seems plausible that such simulations happen at just the right\nlevel of detail which can be \ufb02exibly adapted as needed, similar to our proposed representation. Both\nthe ability to imagine object motion as well as to \ufb02exibly decompose an environment into objects\nand parts form an important prior that humans rely on for further learning about new tasks, when\ngeneralizing their skills to new environments or \ufb02exibly adapting to changes in inputs and goals [27].\n\n3 Hierarchical Particle Graph Representation\n\nA key factor for predicting the future physical state of a system is the underlying representation used.\nA simplifying, but restrictive, often made assumption is that all objects are rigid. A rigid body can be\nrepresented with a single point mass and unambiguously situated in space by specifying its position\nand orientation, together with a separate data structure describing the object\u2019s shape and extent.\nExamples are 3D polygon meshes or various forms of 2D or 3D masks extracted from perceptual\ndata [10, 16]. The rigid body assumption describes only a fraction of the real world, excluding,\nfor example, soft bodies, cloths, \ufb02uids, and gases, and precludes objects breaking and combining.\nHowever, objects are divisible and made up of a potentially large numbers of smaller sub-parts.\nGiven a scene with a set of objects O, the core idea is to represent each object o 2 O with a set of\nparticles Po \u2318{ pi|i 2 o}. Each particle\u2019s state at time t is described by a vector in R7 consisting of its\nposition x 2 R3, velocity 2 R3, and mass m 2 R+. We refer to pi and this vector interchangeably.\nParticles are spaced out across an object to fully describe its volume. In theory, particles can be\narbitrarily placed within an object. Thus, less complex parts can be described with fewer particles\n(e.g. 8 particles fully de\ufb01ne a cube). More complicated parts (e.g. a long rod) can be represented with\nmore particles. We de\ufb01ne P as the set {pi|1 \uf8ff i \uf8ff NP} of all NP particles in the observed scene.\n\nFigure 2: Hierarchical graph-based object representation. An object is decomposed into particles.\nParticles (of the same color) are grouped into a hierarchy representing multiple object scales. Pairwise\nrelations constrain particles in the same group and to ancestors and descendants.\nTo fully physically describe a scene containing multiple objects with particles, we also need to de\ufb01ne\nhow the particles relate to each other. Similar to Battaglia et al. [4], we represent relations between\nparticles pi and pj with K-dimensional pairwise relationships R = {rij 2 RK}. Each relationship\nrij within an object encodes material properties. For example, for a soft body rij 2 R represents the\nlocal material stiffness, which need not be uniform within an object. Arbitrarily-shaped objects with\npotentially nonuniform materials can be represented in this way. Note that the physical interpretation\nof rij is learned from data rather than hard-coded through equations. Overall, we represent the scene\nby a node-labeled graph G = hP, Ri where the particles form the nodes P and the relations de\ufb01ne the\n(directed) edges R. Except for the case of collisions, different objects are disconnected components\nwithin G.\nThe graph G is used to propagate effects through the scene. It is infeasible to use a fully con-\nP ). To achieve\nnected graph for propagation as pairwise-relationship computations grow with O(N 2\nO(NP log(NP )) complexity, we construct a hierarchical scene (di)graph GH from G in which the\nnodes of each connected component are organized into a tree structure: First, we initialize the leaf\nnodes L of GH as the original particle set P . Then, we extend GH by a root node for each connected\ncomponent (object) in G. The root node states are de\ufb01ned as the aggregates of their leaf node states.\nThe root nodes are connected to their leaves with directed edges and vice versa.\n\n3\n\n...\fAt this point, GH consists of the leaf particles L representing the \ufb01nest scene resolution and one root\nnode for each connected component describing the scene at the object level. To obtain intermediate\nlevels of detail, we then cluster the leaves L in each connected component into smaller subcomponents\nusing a modi\ufb01ed k-means algorithm. We add one node for each new subcomponent and connect\nits leaves to the newly added node and vice versa. This newly added node is then labeled as the\ndirect ancestors for its leaves and its leaves are siblings to each other. We then connect the added\nintermediate nodes with each other if and only if their respective subcomponent leaves are connected.\nLastly, we add directed edges from the root node of each connected component to the new intermediate\nnodes in that component, and remove edges between leaves not in the same cluster. The process then\nrecurses within each new subcomponent. See Algorithm 1 in the supplementary for details.\nWe denote the sibling(s) of a particle p by sib(p), its ancestor(s) by anc(p), its parent by par(p), and\nits descendant(s) by des(p). We de\ufb01ne leaves(pa) = {pl 2 L | pa 2 anc(pl)}. Note that in GH,\ndirected edges connect pi and sib(pi), leaves pl and anc(pl), and pi and des(pi); see Figure 3b.\n\n4 Physics Prediction Model\n\nIn this section we introduce our physics prediction model. It is based on hierarchical graph convolu-\ntion, an operation which propagates relevant physical effects through the graph hierarchy.\n\n4.1 Hierarchical Graph Convolutions For Effect Propagation\n\nIn order to predict the future physical state, we need to resolve the constraints that particles connected\nin the hierarchical graph impose on each other. We use graph convolutions to compute and propagate\nthese effects. Following Battaglia et al. [4], we implement a pairwise graph convolution using two\nbasic building blocks: (1) A pairwise processing unit that takes the sender particle state ps, the\nreceiver particle state pr and their relation rsr as input and outputs the effect esr 2 RE of ps on\npr, and (2) a commutative aggregation operation \u2303 which collects and computes the overall effect\ner 2 RE. In our case, this is a simple summation over all effects on pr. Together these two building\nblocks form a convolution on graphs as shown in Figure 3a.\na)\n\nb)\n\nps1\nps1\n\nps3\nps3\n\nps2\nps2\npr\npr\n\nps\n1 ps2rs r2\nps3rs r3\nrs r\npr pr pr\n\n1\n\nes r2 er\n\nes r3\n\nes r1\n\n\n\n\nL2A\n\nA2D\n\nA\n2\nD\n\nL2A\n\nWS\n\nWS\n\nA\n2\nD\n\nWS\n\n\nL2A\n\nA2D\n \n\nFigure 3: Effect propagation through graph convolutions. a) Pairwise graph convolution . A\nreceiver particle pr is constrained in its movement through graph relations rsr with sender particle(s)\nps. Given ps, pr and rsr, the effect esr of ps on pr is computed using a fully connected neural\nnetwork. The overall effect er is the sum of all effects on pr. b) Hierarchical graph convolution\n\u2318. Effects in the hierarchy are propagated in three consecutive steps. (1) L2A. Leaf particles L\npropagate effects to all of their ancestors A. (2) W S. Effects are exchanged between siblings S. (3)\nA2D. Effects are propagated from the ancestors A to all of their descendants D.\n\nPairwise processing limits graph convolutions to only propagate effects between directly connected\nnodes. For a generic \ufb02at graph, we would have to repeatedly apply this operation until the information\nfrom all particles has propagated across the whole graph. This is infeasible in a scenario with\nmany particles. Instead, we leverage direct connections between particles and their ancestors in\nour hierarchy to propagate all effects across the entire graph in one model step. We introduce a\nhierarchical graph convolution, a three stage mechanism for effect propagation as seen in Figure 3b:\nThe \ufb01rst L2A (Leaves to Ancestors) stage L2A(pl, pa, rla, e0\nla 2 RE of a\nleaf particle pl on an ancestor particle pa 2 anc(pl) given pl, pa, the material property information\nl on pl. The second WS (Within Siblings) stage W S(pi, pj, rij, eL2A\nof rla, and input effect e0\n)\nij 2 RE of sibling particle pi on pj 2 sib(pi). The third A2D (Ancestors to\npredicts the effect eW S\nDescendants) stage A2D(pa, pd, rad, eL2A\n2 RE of an ancestor\nparticle pa on a descendant particle pd 2 des(pa). The total propagated effect ei on particle pi is\n\nl ) predicts the effect eL2A\n\n) predicts the effect eA2D\n\na + eW S\n\nij\n\na\n\ni\n\n4\n\n\fG(t-T,t]\n\nH\n\nC\n\nH\n\nF\n\n\n\n\n\n\n\nPt+1\n\nFigure 4: Hierarchical Relation Network. The model takes the past particle graphs G(tT,t]\n=\nhP (tT,t], R(tT,t]i as input and outputs the next states P t+1. The inputs to each graph convolutional\neffect module are the particle states and relations, the outputs the respective effects. H processes\npast states, C collisions, and F external forces. The hierarchical graph convolutional module \u2318\ntakes the sum of all effects, the pairwise particle states, and relations and propagates the effects\nthrough the graph. Finally, uses the propagated effects to compute the next particle states P t+1.\n\nH\n\ncomputed by summing the various effects on that particle, ei = eL2A\n\ni + eW S\n\ni + eA2D\n\ni\n\nwhere\n\nL2A(pl, pa, rla, e0\nl )\n\nW S(pi, pj, rij, eL2A\n\ni\n\n)\n\neW S\n\nj = Xpi2sib(pj )\n\nA2D(pa, pd, rad, eL2A\n\na + eW S\n\na\n\n).\n\neL2A\n\na = Xpl2leaves(pa)\n\neA2D\nd\n\n= Xpa2anc(pd)\n\nIn practice, L2A, W S, and A2D are realized as fully-connected networks with shared weights that\nreceive an additional ternary input (0 for L2A, 1 for WS, and 2 for A2D) in form of a one-hot vector.\nSince all particles within one object are connected to the root node, information can \ufb02ow across the\nentire hierarchical graph in at most two propagation steps. We make use of this property in our model.\n\n4.2 The Hierarchical Relation Network Architecture\n\nH\n\nH\n\nThis section introduces the Hierarchical Relation Network (HRN), a neural network for predicting\nfuture physical states shown in Figure 4. At each time step t, HRN takes a history of T previous\nparticle states P (tT,t] and relations R(tT,t] in the form of hierarchical scene graphs G(tT,t]\nas\ninput. G(tT,t]\ndynamically changes over time as directed, unlabeled virtual collision relations\nare added for suf\ufb01ciently close pairs of particles. HRN also takes external effects on the system\n(for example gravity g or external forces F ) as input. The model consists of three pairwise graph\nconvolution modules, one for external forces (F ), one for collisions (C) and one for past states\n(H), followed by a hierarchical graph convolution module \u2318 that propagates effects through the\nparticle hierarchy. A fully-connected module then outputs the next states P t+1.\nIn the following, we brie\ufb02y describe each module. For ease of reading we drop the notation (t T, t]\nand assume that all variables are subject to this time range unless otherwise noted.\nExternal Force Module The external force module F converts forces F \u2318{ fi} on leaf particles\npi 2 P L into effects F (pi, fi) = eF\nCollision Module Collisions between objects are handled by dynamically de\ufb01ning pairwise collision\nij between leaf particles pi 2 P L from one object and pj 2 P L from another object that\nrelations rC\nare close to each other [11]. The collision module C uses pi, pj and rC\nij to compute the effects\njk, the overall\nC(pj, pi, rC\nij) = eC\ni = Pj{eji|dt(i, j) < DC}. The hyperparameter DC represents the\ncollision effects equal eC\nmaximum distance for a collision relation.\nHistory Module The history module H predicts the effects (p(tT,t1]\np(tT,t1]\ni\nHierarchical Effect Propagation Module The hierarchical effect propagation module \u2318 propagates\nthe overall effect e0\nfrom external forces, collisions and history on pi through\nthe particle hierarchy. \u2318 corresponds to the three-stage hierarchical graph convolution introduced in\n\nji 2 RE of pj on pi and vice versa. With dt(i, j) = kxt\n\n2 P L on current leaf particle states pt\n\ni 2 RE.\n\ni) 2 eH\n\ni 2 P L.\n\ni = eF\n\ni + eC\n\ni + eH\ni\n\nfrom past\n\ni xt\n\n, pt\n\ni\n\ni\n\n5\n\n\fFigure 3 b) which given the pairwise particle states pi and pj, their relation rij, and input effects e0\ni ,\noutputs the total propagated effect ei on each particle pi.\nState Prediction Module We use a simple fully-connected network to predict the next parti-\ncle states P t+1. In order to get more accurate predictions, we leverage the hierarchical particle\nrepresentation by predicting the dynamics of any given particle within the local coordinate system\noriginated at its parent. The only exceptions are object root particles for which we predict the global\ndynamics. Speci\ufb01cally, the state prediction module (g, pi, ei) predicts the local future delta position\nt+1\ni,` = t+1\npar(i) using the particle state pi, the total effect ei on pi, and the gravity g as input.\nAs we only predict global dynamics for object root particles, the gravity is only applied to these root\nparticles. The \ufb01nal future delta position in world coordinates is computed from local information as\nt+1\ni = t+1\n\nj,` , j 2 anc(i).\n\ni t+1\ni,` +Pj t+1\n\n4.3 Learning Physical Constraints through Loss Functions and Data\n\nTraditionally, physical systems are modeled with equations providing \ufb01xed approximations of the\nreal world. Instead, we choose to learn physical constraints, including the meaning of the material\nproperty vector, from data. The error signal we found to work best is a combination of three objectives.\n(1) We predict the position change t+1\ni,` between time step t and t + 1 independently for all particles\nin the hierarchy. In practice, we \ufb01nd that t+1\ni,` will differ in magnitude for particles in different levels.\nTherefore, we normalize the local dynamics using the statistics from all particles in the same level\n(local loss). (2) We also require that the global future delta position t+1\nis accurate (global loss). (3)\nWe aim to preserve the intra-object particle structure by imposing that the pairwise distance between\ntwo connected particles pi and pj in the next time step dt+1(i, j) matches the ground truth. In the\ncase of a rigid body this term works to preserve the distance between particles. For soft bodies, this\nobjective ensures that pairwise local deformations are learned correctly (preservation loss).\nThe total objective function linearly combines (1), (2), and (3) weighted by hyperparameters \u21b5 and :\nk \u02c6dt+1(i, j) dt+1(i, j)k2\n\nk\u02c6t+1\ni,` t+1\n\ni\n\nk\u02c6t+1\ni t+1\n\ni\n\nk2+1\u21b5 Xpi2sib(pj )\n\nLoss = \u21b5Xpi\n\ni,` k2+Xpi\n\n5 Experiments\n\nIn this section, we examine the HRN\u2019s ability to accurately predict the physical state across time in\nscenarios with rigid bodies, deformable bodies (soft bodies, cloths, and \ufb02uids), collisions, and external\nactions. We also evaluate the generalization performance across various object and environment\nproperties. Finally, we present some more complex scenarios including (e.g.) falling block towers\nand dominoes. Prediction roll-outs are generated by recursively feeding back the HRN\u2019s one-step\nprediction as input. We strongly encourage readers to have a look at result examples shown in main\ntext \ufb01gures, supplementary materials, and at https://youtu.be/kD2U6lghyUE.\nAll training data for the below experiments was generated via a custom interactive particle-based\nenvironment based on the FleX physics engine [31] in Unity3D. This environment provides (1) an\nautomated way to extract a particle representation given a 3D object mesh, (2) a convenient way to\ngenerate randomized physics scenes for generating static training data, and (3) a standardized way\nto interact with objects in the environment through forces.\u2020. Further details about the experimental\nsetups and training procedure can be found in the supplement.\n\n5.1 Qualitative evaluation of physical phenomena\nRigid body kinematic motion and external forces. In a \ufb01rst experiment, rigid objects are pushed\nup, via an externally applied force, from a ground plane then fall back down and collide with the\nplane. The model is trained on 10 different simple shapes (cube, sphere, pyramid, cylinder, cuboid,\ntorus, prism, octahedron, ellipsoid, \ufb02at pyramid) with 50-300 particles each. The static plane is\nrepresented using 5,000 particles with a practically in\ufb01nite mass. External forces spatially dispersed\nwith a Gaussian kernel are applied at randomly chosen points on the object. Testing is performed on\n\n\u2020HRN code and Unity FleX environment can be found at https://neuroailab.github.io/physics/\n\n6\n\n\fFigure 5: Prediction examples and ground truth. a) A cone bouncing off a plane. b) Parabolic\nmotion of a bunny. A force is applied at the \ufb01rst frame. c) A cube falling on a slope. d) A cone\ncolliding with a pentagonal prism. Both shapes were held-out. e) Three objects colliding on a plane.\nf) Falling block tower not trained on. g) A cloth drops and folds after hitting the \ufb02oor. h) A \ufb02uid\ndrop bursts on the ground. We strongly recommend watching the videos in the supplement.\n\ninstances of the same rigid shapes, but with new force vectors and application points, resulting in new\ntrajectories. Results can be seen in supplementary Figure F.9c-d, illustrating that the HRN correctly\npredicts the parabolic kinematic trajectories of tangentially accelerated objects, rotation due to torque,\nresponses to initial external impulses, and the eventual elastic collisions of the object with the \ufb02oor.\nComplex shapes and surfaces. In more complex scenarios, we train on the simple shapes colliding\nwith a plane then generalize to complex non-convex shapes (e.g. bunny, duck, teddy). Figure 5b shows\nan example prediction for the bunny; more examples are shown in supplementary Figure F.9g-h.\nWe also examine spheres and cubes falling on 5 complex surfaces: slope, stairs, half-pipe, bowl, and\na \u201crandom\u201d bumpy surface. See Figure 5c and supplementary Figure F.10c-e for results. We train on\nspheres and cubes falling on the 5 surfaces, and test on new trajectories.\nDynamic collisions. Collisions between two moving objects are more complicated to predict than\nstatic collisions (e.g. between an object and the ground). We \ufb01rst evaluate this setup in a zero-gravity\nenvironment to obtain purely dynamic collisions. Training was performed on collisions between 9\npairs of shapes sampled from the 10 shapes in the \ufb01rst experiment. Figure 5d shows predictions for\ncollisions involving shapes not seen during training, the cone and pentagonal prism, demonstrating\nHRN\u2019s ability to generalize across shapes. Additional examples can be found in supplementary\nFigure F.9e-f, showing results on trained shapes.\nMany-object interactions. Complex scenarios include simultaneous interactions between multiple\nmoving objects supported by static surfaces. For example, when three objects collide on a planar\nsurface, the model has to resolve direct object collisions, indirect collisions through intermediate\nobjects, and forces exerted by the surface to support the objects. To illustrate the HRN\u2019s ability\nto handle such scenarios, we train on combinations of two and three objects (cube, stick, sphere,\nellipsoid, triangular prism, cuboid, torus, pyramid) colliding simultaneously on a plane. See Figure 5e\nand supplementary Figure F.10f for results.\nWe also show that HRN trained on the two and three object collision data generalizes to complex new\nscenarios. Generalization tests were performed on a falling block tower, a falling domino chain, and\na bowl containing multiple spheres. All setups consist of 5 objects. See Figure 5f and supplementary\nFigures F.9b and F.10b,g for results. Although predictions sometimes differ from ground truth in their\ndetails, results still appear plausible to human observers.\n\n7\n\na)c)Ground TruthPredictionb)d)Ground TruthPredictionf)Ground TruthPredictionGround TruthPredictionGround TruthPredictionGround TruthPredictione)t+1t+3t+5t+7t+9t+1t+3t+5t+7t+9h)Ground TruthPredictionGround TruthPredictiong)\fSoft bodies. We repeat the same experiments but with soft bodies of varying stiffness, showing that\nHRN properly handles kinematics, external forces, and collisions with complex shapes and surfaces\ninvolving soft bodies. One illustrative result is depicted in Figure 1, showing a non-rigid cube as it\ndeformably bounces off the \ufb02oor. Additional examples are shown in supplementary Figure F.9g-h.\nCloth. We also experiment with various cloth setups. In the \ufb01rst experiment, a cloth drops on the\n\ufb02oor from a certain height and folds or deforms. In another experiment a cloth is \ufb01xated at two points\nand swings back and forth. Cloth predictions are very challenging as cloths do not spring back to\ntheir original shape and self-collisions have to be resolved in addition to collisions with the ground.\nTo address this challenge, we add self-collisions, collision relationships between particles within the\nsame object, in the collision module. Results can be seen in Figure 5g and supplementary Figure F.11\nand show that the cloth motion and deformations are accurately predicted.\nFluids. In order to test our models ability to predict \ufb02uids, we perform a simple experiment in which\na \ufb02uid drop drops on the \ufb02oor from a certain height. As effects within a \ufb02uid are mostly local, \ufb02at\nhierarchies with small groupings are better on \ufb02uid prediction. Results can be seen in Figure 5h and\nshow that the fall of a liquid drop is successfully predicted when trained in this scenario.\nResponse to parameter variation. To evaluate how the HRN responds to changes in mass, gravity\nand stiffness, we train on datasets in which these properties vary. During testing time we vary those\nparameters for the same initial starting state and evaluate how trajectories change. In supplementary\nFigures F.14, F.13 and F.12 we show results for each variation, illustrating e.g. how objects accelerate\nmore rapidly in a stronger gravitational \ufb01eld.\nHeterogeneous materials. We leverage the hierarchical particle graph representation to construct\nobjects that contain both rigid and soft parts. After training a model with objects of varying shapes and\nstiffnesses falling on a plane, we manually adjust individual stiffness relations to create a half-rigid\nhalf-soft object and generate HRN predictions. Supplementary Figure F.10h shows a half-rigid\nhalf-soft pyramid. Note that there is no ground truth for this example as we surpass the capabilities of\nthe used physics simulator which is incapable of simulating objects with heterogeneous materials.\n\n5.2 Quantitative evaluation and ablation\n\nWe compare HRN to several baselines and model ablations. The \ufb01rst baseline is a simple Multi-\nLayer-Perceptron (MLP) which takes the full particle representation and directly outputs the next\nparticle states. The second baseline is the Interaction Network as de\ufb01ned by Battaglia et al. [4]\ndenoted as fully connected graph as it corresponds to removing our hierarchy and computing on a\nfully connected graph. In addition, to show the importance of the C, F , and H modules, we\nremove and replace them with simple alternatives. No F replaces the force module by concatenating\nthe forces to the particle states and directly feeding them into \u2318. Similarly for no C, C is removed\nby adding the collision relations to the object relations and feeding them directly through \u2318. In case\nof no H, H is simply removed and not replaced with anything. Next, we show that two input time\nsteps (t, t 1) improve results by comparing it with a 1 time step model. Lastly, we evaluate the\nimportance of the preservation loss and the global loss component added to the local loss. All models\nare trained on scenarios where two cubes collide fall on a plane and repeatedly collide after being\npushed towards each other. The models are tested on held-out trajectories of the same scenario. An\nadditional evaluation of different grouping methods can be found in Section B of the supplement.\n\nLocal loss\n\nNo preservation loss\n\nNo H\n\nNo C\n\nNo F\n\nFully connected graph\n\nMLP\n\n1 time step\n0.035\n0.03\n0.025\n0.02\n0.015\n0.01\n0.005\n0\n\nE\nS\nM\n\n \ne\nc\nn\na\nt\ns\ni\nD\n \ne\nv\nr\ne\ns\ne\nr\nP\n\nGlobal + local loss\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\n \n\nE\nS\nM\nn\no\ni\nt\ni\ns\no\nP\n\n \n\nE\nS\nM\nn\no\ni\nt\ni\ns\no\nP\n \na\nt\nl\ne\nD\n\n0.035\n0.03\n0.025\n0.02\n0.015\n0.01\n0.005\n0\n\nt\n\nt\n\nt+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9\n\nTime\n\nt+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9\n\nTime\n\nt\n\nt+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9\n\nTime\n\nFigure 6: Quantitative evaluation. We compare the full HRN (global + local loss) to several\nbaselines, namely local loss only, no preservation loss, no H, no C, no F , 1 time step, fully\nconnected graph and a MLP baseline. The line graphs from left to right show the mean squared\nerror (MSE) between positions, delta positions and distance preservation accumulated over time. Our\nmodel has the lowest position and delta position error and a only slightly higher preservation error.\n\n8\n\n\fComparison metrics are the cumulative mean squared error of the absolute global position, local\nposition delta, and preserve distance error up to time step t + 9. Results are reported in Figure 6.\nThe HRN outperforms all controls most of the time. The hierarchy is especially important, with\nthe fully connected graph and MLP baselines performing substantially worse. Besides, the HRN\nwithout the hierarchical graph convolution mechanism performed signi\ufb01cantly worse as seen in\nsupplementary Figure C.4, which shows the necessity of the three consecutive graph convolution\nstages. In qualitative evaluations, we found that using more than one input time step improves results\nespecially during collisions as the acceleration is better estimated which the metrics in Figure 6\ncon\ufb01rm. We also found that splitting collisions, forces, history and effect propagation into separate\nmodules with separate weights allows each module to specialize, improving predictions. Lastly, the\nproposed loss structure is crucial to model training. Without distance preservation or the global\ndelta position prediction our model performs much worse. See supplementary Section C for further\ndiscussion on the losses and graph structures.\n\n5.3 Discussion\n\nOur results show that the vast majority of complex multi-object interactions are predicted well,\nincluding multi-point collisions between non-convex geometries and complex scenarios like the bowl\ncontaining multiple rolling balls. Although not shown, in theory, one could also simulate shattering\nobjects by removing enough relations between particles within an object. These manipulations\nare of substantial interest because they go beyond what is possible to generate in our simulation\nenvironment. Additionally, predictions of especially challenging situations such as multi-block towers\nwere also mostly effective, with objects (mostly) retaining their shapes and rolling over each other\nconvincingly as towers collapsed (see the supplement and the video). The loss of shape preservation\nover time can be partially attributed to the compounding errors generated by the recursive roll-outs.\nNevertheless, our model predicts the tower to collapse faster than ground truth. Predictions also jitter\nwhen objects should stand absolutely still. These failures are mainly due to the fact that the training\nset contained only interactions between fast-moving pairs or triplets of objects, with no scenarios\nwith objects at rest. That it generalized to towers as well as it did is a powerful illustration of our\napproach. Adding a fraction of training observations with objects at rest causes towers to behave more\nrealistically and removes the jitter overall. The training data plays a crucial role in reaching the \ufb01nal\nmodel performance and its generalization ability. Ideally, the training set would cover the entirety\nof physical phenomena in the world. However, designing such a dataset by hand is intractable and\nalmost impossible. Thus, methods in which a self-driven agent sets up its own physical experiments\nwill be crucial to maximize learning and understanding[19].\n\n6 Conclusion\n\nWe have described a hierarchical graph-based scene representation that allows the scalable spec-\ni\ufb01cation of arbitrary geometrical shapes and a wide variety of material properties. Using this\nrepresentation, we introduced a learnable neural network based on hierarchical graph convolution\nthat generates plausible trajectories for complex physical interactions over extended time horizons,\ngeneralizing well across shapes, masses, external and internal forces as well as material properties.\nBecause of the particle-based nature of our representation, it naturally captures object permanence\nidenti\ufb01ed in cognitive science as a key feature of human object perception [43].\nA wide variety of applications of this work are possible. Several of interest include developing\npredictive models for grasping of rigid and soft objects in robotics, and modeling the physics of 3D\npoint cloud scans for video games or other simulations. To enable a pixel-based end-to-end trainable\nversion of the HRN for use in key computer vision applications, it will be critical to combine our\nwork with adaptations of existing methods (e.g. [54, 23, 15]) for inferring initial (non-hierarchical)\nscene graphs from LIDAR/RGBD/RGB image or video data. In the future, we also plan to remedy\nsome of HRN\u2019s limitations, expanding the classes of materials it can handle to including in\ufb02atables\nor gases, and to dynamic scenarios in which objects can shatter or merge. This should involve a\nmore sophisticated representation of material properties as well as a more nuanced hierarchical\nconstruction. Finally, it will be of great interest to evaluate to what extent HRN-type models describe\npatterns of human intuitive physical knowledge observed by cognitive scientists [32, 35, 38].\n\n9\n\n\fAcknowledgments\nWe thank Viktor Reutskyy, Miles Macklin, Mike Skolones and Rev Lebaredian for helpful discussions\nand their support with integrating NVIDIA FleX into our simulation environment. This work was\nsupported by grants from the James S. McDonnell Foundation, Simons Foundation, and Sloan\nFoundation (DLKY), a Berry Foundation postdoctoral fellowship (NH), the NVIDIA Corporation,\nONR - MURI (Stanford Lead) N00014-16-1-2127 and ONR - MURI (UCLA Lead) 1015 G TA275.\n\nReferences\n[1] P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking:\nIn Advances in Neural Information Processing\n\nExperiential learning of intuitive physics.\nSystems, pages 5074\u20135082, 2016.\n\n[2] D. Baraff. Physically based modeling: Rigid body simulation. SIGGRAPH Course Notes, ACM\n\nSIGGRAPH, 2(1):2\u20131, 2001.\n\n[3] C. Bates, P. Battaglia, I. Yildirim, and J. B. Tenenbaum. Humans predict liquid dynamics using\n\nprobabilistic simulation. In CogSci, 2015.\n\n[4] P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, and k. kavukcuoglu. Interaction networks\nfor learning about objects, relations and physics. In Advances in Neural Information Processing\nSystems 29, pages 4502\u20134510. 2016.\n\n[5] P. W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum. Simulation as an engine of physical scene\nunderstanding. Proceedings of the National Academy of Sciences, 110(45):18327\u201318332, 2013.\n[6] J. Bender, M. M\u00fcller, and M. Macklin. Position-based simulation methods in computer graphics.\n\nIn Eurographics (Tutorials), 2015.\n\n[7] M. Brand. Physics-based visual understanding. Computer Vision and Image Understanding, 65\n\n(2):192\u2013205, 1997.\n\n[8] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning:\n\ngoing beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18\u201342, 2017.\n\n[9] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected\n\nnetworks on graphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[10] A. Byravan and D. Fox. Se3-nets: Learning rigid body motion using deep neural networks.\nIn Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 173\u2013180.\nIEEE, 2017.\n\n[11] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. A compositional object-based\n\napproach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.\n\n[12] E. Coumans. Bullet physics engine. Open Source Software: http://bulletphysics. org, 1:3, 2010.\n[13] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs\nwith fast localized spectral \ufb01ltering. In Advances in Neural Information Processing Systems,\npages 3844\u20133852, 2016.\n\n[14] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik,\nand R. P. Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints. In\nAdvances in neural information processing systems, pages 2224\u20132232, 2015.\n\n[15] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction\n\nfrom a single image. In CVPR, volume 2, page 6, 2017.\n\n[16] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through\nvideo prediction. In Advances in neural information processing systems, pages 64\u201372, 2016.\n[17] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik. Learning visual predictive models of\n\nphysics for playing billiards. arXiv preprint arXiv:1511.07404, 2015.\n\n10\n\n\f[18] R. Grzeszczuk, D. Terzopoulos, and G. Hinton. Neuroanimator: Fast neural network emulation\nand control of physics-based models. In Proceedings of the 25th annual conference on Computer\ngraphics and interactive techniques, pages 9\u201320. ACM, 1998.\n\n[19] N. Haber, D. Mrowca, L. Fei-Fei, and D. L. Yamins. Learning to play with intrinsically-\n\nmotivated self-aware agents. arXiv preprint arXiv:1802.07442, 2018.\n\n[20] J. Hamrick, P. Battaglia, and J. B. Tenenbaum. Internal physics models guide probabilistic\njudgments about object dynamics. In Proceedings of the 33rd annual conference of the cognitive\nscience society, pages 1545\u20131550. Cognitive Science Society Austin, TX, 2011.\n\n[21] M. Hegarty. Mechanical reasoning by mental simulation. Trends in cognitive sciences, 8(6):\n\n280\u2013285, 2004.\n\n[22] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data.\n\narXiv preprint arXiv:1506.05163, 2015.\n\n[23] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel. Neural relational inference for\n\ninteracting systems. arXiv preprint arXiv:1802.04687, 2018.\n\n[24] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[25] T. D. Kulkarni, V. K. Mansinghka, P. Kohli, and J. B. Tenenbaum. Inverse graphics with\n\nprobabilistic cad models. arXiv preprint arXiv:1407.1339, 2014.\n\n[26] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics\n\nnetwork. In Advances in Neural Information Processing Systems, pages 2539\u20132547, 2015.\n\n[27] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn\n\nand think like people. Behavioral and Brain Sciences, 40, 2017.\n\n[28] A. Lerer, S. Gross, and R. Fergus. Learning physical intuition of block towers by example.\n\narXiv preprint arXiv:1603.01312, 2016.\n\n[29] W. Li, S. Azimi, A. Leonardis, and M. Fritz. To fall or not to fall: A visual approach to physical\n\nstability prediction. arXiv preprint arXiv:1604.00066, 2016.\n\n[30] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n[31] M. Macklin, M. M\u00fcller, N. Chentanez, and T.-Y. Kim. Uni\ufb01ed particle physics for real-time\n\napplications. ACM Transactions on Graphics (TOG), 33(4):153, 2014.\n\n[32] M. McCloskey, A. Caramazza, and B. Green. Curvilinear motion in the absence of external\n\nforces: Naive beliefs about the motion of objects. Science, 210(4474):1139\u20131141, 1980.\n\n[33] R. Mottaghi, H. Bagherinezhad, M. Rastegari, and A. Farhadi. Newtonian scene understanding:\nUnfolding the dynamics of objects in static images. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 3521\u20133529, 2016.\n\n[34] R. Mottaghi, M. Rastegari, A. Gupta, and A. Farhadi. \u201cwhat happens if...\u201d learning to predict\nthe effect of forces in images. In European Conference on Computer Vision, pages 269\u2013285.\nSpringer, 2016.\n\n[35] L. Piloto, A. Weinstein, A. Ahuja, M. Mirza, G. Wayne, D. Amos, C.-c. Hung, and M. Botvinick.\nProbing physics knowledge using tools from developmental psychology. arXiv preprint\narXiv:1804.01128, 2018.\n\n[36] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d\nclassi\ufb01cation and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE,\n1(2):4, 2017.\n\n11\n\n\f[37] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on\npoint sets in a metric space. In Advances in Neural Information Processing Systems, pages\n5105\u20135114, 2017.\n\n[38] R. Riochet, M. Y. Castro, M. Bernard, A. Lerer, R. Fergus, V. Izard, and E. Dupoux. Int-\nphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint\narXiv:1803.07616, 2018.\n\n[39] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[40] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, I. Titov, and M. Welling. Modeling\n\nrelational data with graph convolutional networks. arXiv preprint arXiv:1703.06103, 2017.\n\n[41] K. A. Smith and E. Vul. Sources of uncertainty in intuitive physics. Topics in cognitive science,\n\n5(1):185\u2013199, 2013.\n\n[42] E. S. Spelke. Principles of object perception. Cognitive science, 14(1):29\u201356, 1990.\n[43] E. S. Spelke, K. Breinlinger, J. Macomber, and K. Jacobson. Origins of knowledge. Psychologi-\n\ncal review, 99(4):605, 1992.\n\n[44] I. Sutskever and G. E. Hinton. Using matrices to model symbolic relationship. In Advances in\n\nNeural Information Processing Systems, pages 1593\u20131600, 2009.\n\n[45] J. B. Tenenbaum, C. Kemp, T. L. Grif\ufb01ths, and N. D. Goodman. How to grow a mind: Statistics,\n\nstructure, and abstraction. science, 331(6022):1279\u20131285, 2011.\n\n[46] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal fea-\ntures with 3d convolutional networks. In Computer Vision (ICCV), 2015 IEEE International\nConference on, pages 4489\u20134497. IEEE, 2015.\n\n[47] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Deep end2end voxel2voxel\nprediction. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2016 IEEE\nConference on, pages 402\u2013409. IEEE, 2016.\n\n[48] T. Ullman, A. Stuhlm\u00fcller, N. Goodman, and J. B. Tenenbaum. Learning physics from dynamical\nscenes. In Proceedings of the 36th Annual Conference of the Cognitive Science society, pages\n1640\u20131645, 2014.\n\n[49] Z. Wang, S. Rosa, B. Yang, S. Wang, N. Trigoni, and A. Markham. 3d-physnet: Learning the\n\nintuitive physics of non-rigid object deformations. arXiv preprint arXiv:1805.00328, 2018.\n\n[50] N. Watters, A. Tacchetti, T. Weber, R. Pascanu, P. Battaglia, and D. Zoran. Visual interaction\n\nnetworks. arXiv preprint arXiv:1706.01433, 2017.\n\n[51] W. F. Whitney, M. Chang, T. Kulkarni, and J. B. Tenenbaum. Understanding visual concepts\n\nwith continuation learning. arXiv preprint arXiv:1602.06822, 2016.\n\n[52] J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum. Galileo: Perceiving physical\nobject properties by integrating a physics engine with deep learning. In Advances in neural\ninformation processing systems, pages 127\u2013135, 2015.\n\n[53] J. Wu, J. J. Lim, H. Zhang, J. B. Tenenbaum, and W. T. Freeman. Physics 101: Learning\n\nphysical object properties from unlabeled videos. In BMVC, volume 2, page 7, 2016.\n\n[54] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single\nimage 3d interpreter network. In European Conference on Computer Vision, pages 365\u2013382.\nSpringer, 2016.\n\n12\n\n\f", "award": [], "sourceid": 5296, "authors": [{"given_name": "Damian", "family_name": "Mrowca", "institution": "Stanford University"}, {"given_name": "Chengxu", "family_name": "Zhuang", "institution": "Stanford University"}, {"given_name": "Elias", "family_name": "Wang", "institution": "Stanford University"}, {"given_name": "Nick", "family_name": "Haber", "institution": "Stanford University"}, {"given_name": "Li", "family_name": "Fei-Fei", "institution": "Stanford University & Google"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "Daniel", "family_name": "Yamins", "institution": "Stanford University"}]}