{"title": "Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 8985, "page_last": 8995, "abstract": "From infancy, humans have expectations about how objects will move and interact. Even young children expect objects not to move through one another, teleport, or disappear. They are surprised by mismatches between physical expectations and perceptual observations, even in unfamiliar scenes with completely novel objects. A model that exhibits human-like understanding of physics should be similarly surprised, and adjust its beliefs accordingly.  We propose ADEPT, a model that uses a coarse (approximate geometry) object-centric representation for dynamic 3D scene understanding. Inference integrates deep recognition networks, extended probabilistic physical simulation, and particle filtering for forming predictions and expectations across occlusion. We also present a new test set for measuring violations of physical expectations, using a range of scenarios derived from developmental psychology.  We systematically compare ADEPT, baseline models, and human expectations on this test set.  ADEPT outperforms standard network architectures in discriminating physically implausible scenes, and often performs this discrimination at the same level as people.", "full_text": "Modeling Expectation Violation in Intuitive Physics\nwith Coarse Probabilistic Object Representations\n\nKevin A. Smith1,2,\u2217, Lingjie Mei3,\u2217, Shunyu Yao4,\u2217, Jiajun Wu3\n\nElizabeth Spelke2,5, Joshua B. Tenenbaum1,2,3, Tomer D. Ullman2,5\n\n1 MIT BCS, 2 Center for Brains, Minds, & Machines, 3 MIT CSAIL,\n\n4 Princeton University, 5 Harvard Psychology\n\nAbstract\n\nFrom infancy, humans have expectations about how objects will move and interact.\nEven young children expect objects not to move through one another, teleport, or\ndisappear. They are surprised by mismatches between physical expectations and\nperceptual observations, even in unfamiliar scenes with completely novel objects.\nA model that exhibits human-like understanding of physics should be similarly\nsurprised, and adjust its beliefs accordingly. We propose ADEPT, a model that\nuses a coarse (approximate geometry) object-centric representation for dynamic\n3D scene understanding. Inference integrates deep recognition networks, extended\nprobabilistic physical simulation, and particle \ufb01ltering for forming predictions\nand expectations across occlusion. We also present a new test set for measuring\nviolations of physical expectations, using a range of scenarios derived from de-\nvelopmental psychology. We systematically compare ADEPT, baseline models,\nand human expectations on this test set. ADEPT outperforms standard network\narchitectures in discriminating physically implausible scenes, and often performs\nthis discrimination at the same level as people.\n\n1\n\nIntroduction\n\nPeople have a rich understanding of everyday physics that they use to predict how the future might\nunfold [Battaglia et al., 2013, Smith and Vul, 2013], plan actions, and manipulate tools [Osiurak and\nBadets, 2016]. This commonsense reasoning includes a set of early developing or possibly innate\nexpectations about the behavior of objects, which are part of \u2018core knowledge\u2019 [Spelke and Kinzler,\n2007]. For example, even very young infants generally expect objects to remain coherent wholes that\nfollow spatially contiguous paths, and not wink in and out of existence [Baillargeon, 1987, Spelke\net al., 1992]. Intelligent machines that can interact with the physical world in a human-like way\nshould hold human-like physical intuitions [Lake et al., 2017].\nWe propose that human-like physical understanding is based on explicit object-centric representations\nand their associated dynamics, similar to the idea of a Mental Game Engine [Battaglia et al., 2013,\nUllman et al., 2017]. Such object representations have constant physical properties regardless of their\nperceptual appearance, an appearance that can differ greatly depending on factors such as viewpoint,\ndistance, and occlusion. We assume these object-based representations as given, and propose that the\nmain computational burden is learning how visual input is parsed into these representations.\nOur work has two aims. We \ufb01rst aim to formalize human (especially infant) physical cognition.\nBut we also aim to show how this modeling of infant cognition can inspire more robust AI vision\nsystems that extract physical object representations from video and can detect violations of physical\nexpectations to use as learning signals, similar to infants [Stahl and Feigenson, 2015].\n\n\u2217Indicates equal contribution. Model code, training and testing stimuli, and data can be accessed at http:\n\n//physadept.csail.mit.edu/\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Frames taken from a physically implausible video, in which a yellow cube seems to disappear behind\nthe occluder. Agents that observe this scene should \ufb01nd it surprising, and use this knowledge as a guide to\nexplore the properties of the objects or dynamics that caused the yellow cube to disappear.\n\nWe present a new model, \u201cApproximate Derenderer, Extended Physics, and Tracking\u201d (ADEPT;\nFig. 2), that closes the loop between cognitive models of intuitive physics which assume object\nparses are given [e.g. Battaglia et al., 2013, Smith and Vul, 2013, Hamrick et al., 2016, Ullman et al.,\n2018], and computer vision models that parse images into physical object representations [Wu et al.,\n2017a]. Importantly, we suggest that perception does not have to be exact to capture basic physical\nexpectations [Ullman et al., 2017]. Approximate perception allows the model to trivially extend to\nnovel objects at a loss of object identity information; this is similar to the way young infants can\nreason about new objects but are insensitive to changes of object shape [Xu and Carey, 1996]. Our\nmodel (i) learns to approximately parse novel arbitrarily shaped objects into approximate geometric\nforms, (ii) makes extended predictions about future world states, by using a robust dynamics model\nthat combines the predictions of a standard physics engine with with the possibility of resampling\naspects of the dynamic scene from the prior, and (iii) uses a particle \ufb01lter to tie together parsing and\npredicting, allowing it to track objects over occlusion.\nIn the spirit of Riochet et al. [2018] and Piloto et al. [2018], we evaluate our model using the Violation\nof Expectations (VoE) paradigm from developmental psychology. In this paradigm, models are shown\nscenes that are matched as closely as possible, except that one scene contains an event that violates\nintuitive physical expectations (e.g., an object disappearing; Fig 1). A model passes the test if its\npredictions diverge from observations more strongly in the violation video than the control.\nWe tested the ADEPT model on eight different scenarios, which replicate tests from developmental\npsychology and capture different aspects of early core object knowledge. These scenarios examine\nconcepts such as permanence (objects do not appear or disappear for no reason), continuity (objects\nmove along connected trajectories), and solidity (objects cannot move through one another). We\ncompared ADEPT to models that acquire physical expectations without explicit object representations,\nand found that only ADPET discriminates violations from control stimuli at above chance rates in all\neight scenarios, and did so at similar rates as humans.\nIn sum, this paper makes three contributions: (1) ADEPT, a novel, object-centric model that can\ndiscriminate physically implausible scenes at near-human levels; (2) A new stimulus set closely based\non developmental psychology for probing core physics knowledge, with more differentiated training\nand test sets to test stronger generalization than previous such stimuli; (3) A new data set of human\nadult judgments on our stimuli, allowing us to investigate not just whether a model discriminates\nviolations above chance, but whether it has human-like expectations about physical events.\n\n2 Related Work\nThe ADEPT model relies on two modules: an \u2018inverse graphics\u2019 module that infers an object-centric\nrepresentation from raw images, and a \u2018physical simulation\u2019 module that predicts future object\nstates from current beliefs. Working within the framework of \u2018vision-as-inverse-graphics\u2019 [Yuille\nand Kersten, 2006], researchers have come up with various ways to integrate recognition nets with\nrendering engines [e.g, Wu et al., 2017b, Che et al., 2018, Kulkarni et al., 2015, Huang et al., 2018],\nin order to extract information about objects (e.g., their number and attributes) from pixel input. The\nADEPT model extends this work by extracting very coarse shape information, which allows it to\nnaturally generalize to new objects.\nRecent work on physical prediction can be roughly divided into models that learn in a pure end-to-end\nway from raw pixels with minimal built-in structure [e.g, Mottaghi et al., 2016, Agrawal et al., 2016,\nFinn et al., 2016, Fragkiadaki et al., 2016, Eslami et al., 2016, 2018, Lerer et al., 2016, Vondrick et al.,\n2016], and models that integrate object-based representations with off-the-shelf physics engines [e.g.,\nBattaglia et al., 2013, Wu et al., 2017a, Zheng et al., 2015, Kloss et al., 2017]. Graph-based networks\nlie in between these broad categories, allowing for minimal object representations and relations as\nnodes and edges in a graph [Battaglia et al., 2016, Chang et al., 2017, Mrowca et al., 2018]. ADEPT\n\n2\n\n\fFigure 2: The ADEPT model contains two parts. A. The perception module segments objects, and then extracts\ncoarse object attributes from each object segment, approximating all non-occluders as ellipsoids. B. The\nreasoning module tracks and updates beliefs based on the perception results, using the particle \ufb01lter algorithm\nand an extended stochastic physics engine.\ninfers approximate object representations from visual input rather than precise geometric bodies, and\nintegrates them with an extended physics engine that allows for robust dynamics and non-standard\nevents. Similar to Vul et al. [2009], we represent belief distributions over object states using a particle\n\ufb01lter to track objects and their properties even during occlusion.\nPhysical reasoning models are often evaluated by comparing a model\u2019s accuracy to the ground-truth\nunfolding of a scene, or the known physical parameters of objects. Recently, however, two groups\nof researchers have proposed that developmental psychology can offer bench-marks for evaluating\nmodel performance, by comparing it to the looking time patterns of young infants [Piloto et al., 2018,\nRiochet et al., 2018]. Our paper builds on those proposals, but differs from them in three important\nways. First, the previous work used stimuli from developmental psychology in both training and\ntesting. While the test stimuli may differ from the training in terms of object properties (a blue cube\n\ufb02attened by a red screen in training, but by a yellow screen in test), the training and testing are still\nrelatively similar. In contrast, we only use developmental-psychology-like stimuli during testing. Our\ntraining videos contain a small set of objects with a relatively unconstrained motion that does not\ninclude collisions. Our test set can contain motions that are not contained in the training set, but are\ncompositions of trained motions. Similarly, our test set includes only objects outside of the training\nset. This train and test split provides a stricter test of generalization, reduces the risk of over\ufb01tting,\nand is more human-like in that infants do not see anything like developmental psychology test stimuli\nin their everyday life. Second, we focus on violations of core physical principles that are observed\nvery early in infancy (such as solidity and permance), rather than violations that are noticed only\nin older infants (such as changes in object shape or color [Xu and Carey, 1996], or violations of\nconservation of energy). Third, we evaluate and compare both pixel-based and object-based models,\nwhile prior work focuses on pixel-based methods only.\n3 Approximate Derendering, Extended Physics, and Tracking (ADEPT)\nAs shown in Figure 2, the proposed Approximate Derendering, Extended Physics, and Tracking\n(ADEPT) model has two components: a perception module (Figure 2A) that estimates an abstract,\nobject-centric representation ot (observation) from a raw image xt at each moment t, and a reasoning\nmodule (Figure 2B) that maintains a belief about the scene\u2019s physical state p(st|o<t), conditioning\non past observations o<t and using particle \ufb01ltering. Further details can be found in the supplement.\n3.1 Perception\nThe perception module (Figure 2A) translates a raw image xt to an object-centric presentation ot\nin two stages. First, xt is fed into an instance segmentation network to obtain object segments\n\n3\n\n\f\u02c6ot = {\u02c6ot,i}n(ot)\ni=1 , where n(ot) is the number of segments. We use each proposed object segment\n\u02c6ot,i to mask the current image xt and concatenate the masked image with original images at xt\u22124,\nxt\u22122, xt to account for the proposed object\u2019s visual, temporal, and contextual information. A deep\nconvolutional feature extractor takes the concatenation and returns ot,i, a feature vector that encodes\nintrinsic shape attributes (type, scale), and state attributes (location, velocity, rotation) of the object in\na pre-de\ufb01ned format.\nOur perception module approximates shapes into two types: large, thin cuboids (typically occluders),\nor ellipsoids of varied scales (other objects). This approximation increases robustness to novel objects\nand shapes, and aligns better with infant learning literature, which suggests object kind information\nis not used early on in spatio-temporal tracking [Xu and Carey, 1996, Ullman et al., 2017].\n3.2 Physical Reasoning via Particle Filtering\nThe physics reasoning module aims to estimate p(ot|o<t) at each moment t. It maintains and updates\na belief over the physical state of the scene, st = {st,i}n(st)\ni=1 , where n(st) is the number of objects,\nand each object state st,i contains the same set of attributes as an object observation ot,i. We assume\na Hidden Markov Model (HMM) relating s1\u00b7\u00b7\u00b7T and o1\u00b7\u00b7\u00b7T , whose transitional probability is governed\nby two models:\n\u2022 Dynamics model p(st+1|st). We employ a stochastic physics engine \u0398 : st \u2192 st+1 to sample\nobject dynamics. However, standard physical dynamics will place vanishingly low probabilities on\nphysically implausible events, and can cause degeneracy in the particle \ufb01lter. We thus extend our\ndynamics model to include the possibility of re-sampling aspects of the dynamic scene from the\nprior. In the same way that any physical scene interpretation begins with an arbitrary draw from\nthe scene prior at the beginning of a scenario, we assume that at any moment as the scene unfolds,\naspects of the scene may occasionally change arbitrarily and unpredictably, which can be modeled\nby redrawing them from the prior, with very low probability. In principle any object or any subset\nof its properties can be re-sampled in this way to allow for violation of its expected dynamics,\nand can be handled ef\ufb01ciently via smart initialization such that new samples are constrained by\ngiven observations. Here we consider only re-samples of two basic object states: identity or\nexistence (accounting for unexpected appearances or disappearances), and velocity (accounting\nfor unexpected stopping, speeding up, or changing direction). We are also exploring more general\nimplementations that re-sample objects\u2019 positions, shapes, colors, and other properties.\n\n\u2022 Observation model p(ot|st). To estimate p(ot|st), we \ufb01rst match objects in our belief {st,i}n(st)\nwith objects in the current observation {ot,i}n(ot)\ni=1 , based on extrinsic attributes (color, shape, and\nlocation). We also adjust for mismatches between st,i and ot,i in case (i) the model believes an\nobject exists but it is not observed (e.g., due to occlusion), or (ii) an object must be added to st,i\nbecause it was in a previously unobserved location (e.g., it comes out from behind an occluder).\nAfter objects are matched, the likelihood is computed based on the intersection of union (IoU) of\ntheir corresponding silhouettes (masks).\n\ni=1\n\nt\n\n}M\nm=1 with normalized weights {w(m)\n\nWe use a particle \ufb01lter to track and update beliefs. At any time t, ADEPT maintains a set of M\nparticles {s(m)\n}M\nm=1 to approximate the belief distribution\np(st|o\u2264t). Each step of particle \ufb01ltering involves a stochastic state transition s(m)\nt = \u0398(s(m)\nt\u22121),\nre-weighting w(m)\n), and importance re-sampling based on w(m)\n. Within this\nframework, our model estimates p(ot|o<t) in the following way:\n\nt\n\nt\u22121p(ot|s(m)\nt \u221d w(m)\n(cid:90)\n\nt\n\np(ot|o<t) =\n\np(st|o<t)p(ot|st, o<t)dst \u2248 M(cid:88)\n\nst\n\nm=1\n\n(cid:16)\n\nt\n\n(cid:17)\n\n(cid:12)(cid:12)(cid:12)s(m)\n\nt\n\nw(m)\n\nt\n\np\n\not\n\n.\n\n(1)\n\n4 Modeling Violation of Expectation in Intuitive Physics\n4.1 Framework\nFor a video x of T frames x = x1\u00b7\u00b7\u00b7T , we want humans or models to output a level of surprise\nc(x), a number that indicates the deviation of observations from expectations throughout x. For\ntwo videos, c(x) > c(x(cid:48)) implies x is more implausible than x(cid:48). In humans, this measure can be\nobtained by explicit report (for adults) or looking time (for infants). While several instantiations of\n\n4\n\n\fFigure 3: Diagrams of the different expectation violation scenarios. Black arrows represent physically plausible\ntransitions between movie parts. Red dashed arrows represent transitions that violate physical expectations.\n\nthis measure are possible, for the purposes of our models here we will consider it to be the most\nsurprising moment in a series c(x) = maxt c(x, t), where the level of surprise at each moment c(xt)\nis de\ufb01ned according to the model. A probabilistic model like ADEPT may ground surprise as the\nnegative log-likelihood of observations under the model c(x, t) (cid:44) \u2212 log p(xt|x<t), similar to how\ninfant measures of surprise have been found to be inversely related to the probability of an event\noccuring [T\u00e9gl\u00e1s et al., 2011, Kidd et al., 2012]. A deterministic prediction model might choose\nsurprise to be L2 divergence between predictions and observations c(x, t) (cid:44) ||f (x<t) \u2212 xt||2\n2.\n4.2 Stimuli\nInspired by developmental psychology, we designed a stimuli set to test for aspects of core knowledge\nof physics [Spelke and Kinzler, 2007]: that objects do not appear nor disappear for no reason\n(permanence), that objects cannot occupy the same space at the same time (solidity), and that objects\nmove along continuous paths (continuity). These stimuli were grouped into eight scenarios, each\nof which included a \u201csurprising\u201d scene that contained a violation of physical expectation, and one\nto three \u201ccontrol\u201d scenes that matched the surprising scenes in different aspects but did not include\na violation of physical expectations (see Fig. 3, with further details and videos in the supplemental\nmaterial). These scenarios included:\n\u2022 Create: Based on Wynn [1992], this scenario tests the concept of permanence by showing an\nobject appear from behind an occluder. This is compared to control scenes in which nothing\nappears, or the object was already in the scene.\n\u2022 Vanish: This scenario is the converse of create, in which an object disappears while behind an\noccluder. This is compared to control scenes in which nothing disappears, or the object that is\nremoved in the violation scene was never observed to begin with.\n\u2022 Overturn (long): Based on Baillargeon et al. [1985], this scenario violates the principle of\nsolidity by showing a screen rotate backwards and through an object that was positioned behind\nit, then rotates back to show the object still exists. This scenario is thus doubly surprising. In the\ncontrol scenes, the screen stops rotating when it comes into contact with the object.\n\u2022 Overturn (short): This scenario is identical to the overturn (long) scenario, but the video ends\n\u2022 Discontinuous (invisible): Based on Spelke et al. [1995], this scenario tests the principle of\ncontinuity by showing an object disappear behind one occluder and appear out of a spatially\ndistinct occluder. The video ends with both occluders rotating down to show only one object,\nsuggesting the object teleported through the intervening space. In the control scenes, two identical\n\nafter the \ufb01rst rotation of the screen.\n\n5\n\n\fGAN\n\nLSTM\n\nADEPT\n\nEncoder-decoder\n0.51 [0.43, 0.58] 0.63 [0.60, 0.66] 0.47 [0.44, 0.51] 0.77 [0.74, 0.80]\nCreate\n0.52 [0.45, 0.58] 0.50 [0.44, 0.56] 0.50 [0.44, 0.56] 0.83 [0.76, 0.90]\nVanish\n0.53 [0.42, 0.66] 0.84 [0.74, 0.92] 0.63 [0.52, 0.74] 0.73 [0.63, 0.82]\nOverturn (long)\n0.61 [0.50, 0.73] 0.81 [0.71, 0.90] 0.52 [0.39, 0.65] 0.79 [0.70, 0.87]\nOverturn (short)\nDiscon. (invisible) 0.77 [0.72, 0.82] 0.76 [0.69, 0.82] 0.34 [0.25, 0.42] 0.79 [0.72, 0.85]\n0.82 [0.76, 0.88] 0.60 [0.53, 0.66]\n0.80 [0.73, 0.86]\nDiscon. (visible)\n0.29 [0.19, 0.40] 0.60 [0.48, 0.69] 0.40 [0.32, 0.48] 0.76 [0.68, 0.85]\nDelay\n0.52 [0.40, 0.65] 0.44 [0.32, 0.55] 0.44 [0.35, 0.55] 0.68 [0.57, 0.79]\nBlock\n0.57 [0.53, 0.61] 0.63 [0.59, 0.67] 0.49 [0.45, 0.53] 0.79 [0.76, 0.82]\nUnseen shapes\n0.59 [0.51, 0.66] 0.61 [0.53, 0.69] 0.45 [0.37, 0.52] 0.68 [0.59, 0.76]\nUnseen categories\n0.47 [0.35, 0.58] 0.66 [0.56, 0.77] 0.39 [0.28, 0.49] 0.84 [0.74, 0.94]\nGeometric shapes\n0.71 [0.58, 0.83] 0.57 [0.53, 0.62] 0.43 [0.35, 0.50] 0.85 [0.72, 0.96]\nToys\n0.57 [0.54, 0.60] 0.63 [0.60, 0.66] 0.47 [0.44, 0.51] 0.79 [0.72, 0.85]\n\n0.53 [0.46, 0.6]\n\nBy Stimulus\n\nBy Shape\n\nAverage\n\nTable 1: Relative accuracy within stimulus classes (top) and shape classes (bottom). Brackets indicate boot-\nstrapped 95% CI. Bold indicates best performing model(s) on that category.\n\nobjects exist at the end, or there is no space between the occluders such that the object did not\nneed to teleport.\n\n\u2022 Discontinuous (visible): This scenario is similar to discontinuous (invisible) except that the\nobject moves visibly between the occluders. Due to this change, the surprising scene now includes\ntwo objects, and the scene with just one object is expected (any explanation of the scene involving\ntwo objects would be less parsimonious than one object up until the moment of reveal, involving\nfor example multiple suspicious-coincidence stop-and-start behaviors, or movement through a\nsolid object, etc.). This is shown together with discontinuous (invisible) in Fig. 3 because the ends\nof the surprising scenes can be swapped to form a control scene for the other violation type.\n\n\u2022 Delay: This scenario was designed to test the principle of continuity, by showing an object moving\nthrough an occluded area too quickly to be explained by continuous motion (involving either a\nsudden speed up and slow down, or a teleportation). This contrasts with a control condition that\nshows two identical objects at the end.\n\n\u2022 Block: Based on Spelke et al. [1992] and Stahl and Feigenson [2015], this scenario tests for\nnotions of solidity and continuity by allowing moving objects to appear on the other side of a\nsolid wall (seemingly having teleported, or moved through the block). In the control conditions,\nthe wall stops the object from going through.\n\n5 Experiments\n\n5.1 Settings\nTraining data. The training set consists of 1,000 randomly generated videos of objects moving\naround a scene without collisions (e.g., traveling along a straight line, moving back and forth, rotating;\nsee example videos in the supplementary material). Although ADEPT only uses short sequences\nof frames to train the perception module, for a fair comparison with baselines we included longer\nvideos with motion patterns typical of objects in the control videos of the test set. Crucially, training\nvideos do not include any physical violations. Objects in these videos can be either occluders, or\nshapes drawn from 44 ShapeNet [Chang et al., 2015] object categories (the remaining 11 ShapeNet\ncategories are held out for test). Within each category we select only 80% of shapes for the training\nset, holding out the rest for testing. The training data is used to train the perception module of\nADEPT, and all baselines.\nTesting data. The test set contains 1,512 videos across the eight stimulus classes described in\nSection 4.2, and are labeled as \u2018surprise\u2019 or \u2018control\u2019 based on the stimulus class design. These videos\nwere designed to have well-controlled con\ufb01gurations of objects and motions that were distinct from\nthe training set. Non-occluder objects in these scenes were never observed in the training set, and\ncome from one of four different shape classes: (1) the training-set ShapeNet object categories, but\none of the 20% of shape instances not trained on; (2) the 11 unseen ShapeNet object categories; (3)\nfour simple geometric shapes (cubes, cylinders, cones, spheres); (4) four toy-like shapes similar to\nthose used in developmental experiments (bunny, bowling pin, truck, boot).\n\n6\n\n\fFigure 4: Model performance on four trials. In each block, there are video frames at different moments (row\n1), corresponding object location beliefs (row 2) and surprise traces (row 3, the red line indicates the current\ntime). A. In Overturn (long): Control, the yellow cube is still believed to exist when the occluder overturns (but\nnot completely). B. In Overturn (long): Surprise, the yellow cube is believed to disappear when the occluder\noverturns (completely) and then reappear when the occluder overturns back, generating two spikes of surprise. C.\nIn Create: Control, objects are tracked appropriately and there are no spikes of surprise. D. In Create: Surprise,\nwhen a blue object suddenly appears, a spike of surprise is generated.\n\ni }ns\n\ni ) > c(x\u2212\n\nBaselines. We contrast ADEPT with several standard deep learning models. The encoder-decoder\nand GAN baselines [Riochet et al., 2018] sequentially take frame pairs (xt\u2212s1, xt\u2212s2) to predict the\nsemantic mask of xt, with a prediction span of 5 and 40, respectively. We have also implemented an\nLSTM model that explicitly incorporates memory for prediction.\nMetric. Similar to Riochet et al. [2018], we use relative accuracy to evaluate both model and\nhuman judgments. Given a paired group of ns surprising videos {x+\ni=1 and nc control videos\n{x\u2212\nj=1, the relative accuracy within the group is de\ufb01ned as the proportion of surprise-control pairs\n\nj )(cid:3) /(nsnc). The relative accuracy of several\n\ni,j 1(cid:2)c(x+\n\nthat are correctly ordered, i.e.,(cid:80)\n\ni }nc\n\nvideo groups is the average relative accuracy across groups.\nFurther details on training data, testing data, and baselines can be found in the supplementary material.\n5.2 Model Results\n\nWe present the result of ADEPT and baselines on the entire testing set, which can be grouped into\ndifferent physical violation cases or shape classes.\nQuantitative results. Table 1 shows ADEPT outperforms baselines overall. It provides the best\nperformance on four of the eight stimulus classes, and ties for best on three of the four remaining\nclasses. It is also the only model that performs above chance in all cases.\nGeneralization to unseen shapes and categories. All objects in the test set were instances that\nwere not present in the training set, allowing us to test for generalization to novel objects. As can\nbe seen in Table 1, the ADEPT model outperforms or ties with baselines across all object classes,\nsuggesting that it can gracefully handle novel objects. This generalization is driven by the approximate\nperception module, which was designed to ignore most \ufb01ne shape information and therefore avoid\nreliance on previously observed shapes.\n\n7\n\nA. Overturn (long, control)B. Overturn (long, surprise)C. Create (control)D. Create (surprise)InputInputBeliefBeliefSurpriseSurpriseStartMiddleEndStartMiddleEnd\fFigure 5: Failure cases in the ADEPT model. Top: The scene is initialized with a small object positioned behind\nan occluder, and the occluder rotates through that object. However, ADEPT perceives the occluder to be only\npartially rotated, so believes that the object remains underneath and is unsurprised. Bottom: An occluder rotates\nup to partially cover a blocking wall while an object moves from the left into occlusion. In the control case the\noccluder comes down to show the object on the left side of the wall; ADEPT correctly recognizes this. In the\nviolation case, the object is on the right side of the wall. The perception module localizes the object correctly, but\nthe reasoning module fails to update its beliefs about the object location because there is only a small mismatch\nin distance.\nQualitative visualization. Fig 4 shows the evolution of the model\u2019s beliefs (over object location)\nand surprise, for four videos from two stimulus classes (one surprising and one control from each).\nADEPT is highly surprised around the time when the unexpected event occurs: when the occluder\nmoves through the object (top), or an object appears from behind the occluder (bottom). Conversely,\nduring periods with normal physical dynamics the model demonstrates only a low level of surprise,\ndriven by uncertainty about exact object dynamics. See the supplemental videos for further examples\nof how ADEPT is surprised over time in additional scenarios.\nLimitations. We \ufb01nd some cases where the ADEPT model sees plausible scenarios as surprising,\nor surprising scenes as normal. For instance, in Fig. 5 (top), coarse representations can cause\nthe perception module to misperceive the occluder\u2019s rotation, which produces an unsurprising\nrepresentation that the object is lying underneath the occluder. In Fig. 5 (bottom), ADEPT observes\nan object on the right side of the block, but represents that object as remaining on the left; because\nthe observation model measures loss in Euclidean distance regardless of intervening objects, this is\nperceived as simply a perceptual error.\n5.3 Human Results\nIn order to directly and quantitatively compare the different models to human intuitions, we asked\npeople to indicate their surprise levels when watching a subset of 64 test videos. 60 participants\nfrom Amazon\u2019s Mechanical Turk (using the psiTurk framework [Gureckis et al., 2016]) participated\nfor 15-20 minutes, compensated with $2.50. The 64 clips included eight scenes from each of eight\nconditions described in section 4.2. These were further selected to use two objects from each of the\nfour stimulus categories (unseen shapes, unseen classes, geometric, or toys). Matching \u201csurprise\u201d\nand \u201ccontrol\u201d videos were selected for each base scene; each participant rated only one of the two.\nParticipants rated how surprising they found the events in each video, using a continuous slider scale\nfrom 0 \u201cnot at all surprising\u201d to 100 \u201cextremely surprising\u201d, and z-scored within participants to\ncontrol for differences in how people used the slider. We then calculated the relative accuracy for\nhumans (see Section 5.1: Metric).\u2020\nTo determine how well the models capture human surprise, we compare model performance on the\nsubset of scenarios presented in the human experiment. Because peoples\u2019 performance differed by\nviolation scenario (F (14, 3824) = 26.8, p \u2248 0) but not object type (F (48, 3776) = 0.81, p = 0.83),\nwe only compare models and humans by violation condition. In six of eight scenarios our model\nperformed at near-human levels, differing substantially only in the \u2018block\u2019 and \u2018create\u2019 scenarios (see\nFig. 6). In contrast, all other models did not exceed chance performance on more than half of the\n\u2020Because relative accuracy requires comparisons between surprise and control scenes within a pair, but\n\nparticipants only observed one video per pair, we must aggregate across participants to calculate this metric.\n\n8\n\nInitial frameMid frame Final frameParticle representationSurprise over timetSurpriseSurpriseControlViolationInitial frameFinal frameDerenderingtSurpriseOverturnBlockParticle representationSurprise over time\fFigure 6: Comparison of model relative accuracy (x-axis) versus human relative accuracy (y-axis) by type\nof surprise condition for each of the models. Bars represent 95% bootstrapped CIs, the grey lines represent\nchance performance (50%), and performance is calculated on only the experiment stimulus subset. The ADEPT\nmodel performs at near human performance (on the dashed line) in most surprise conditions, and has the lowest\ndeviation from human behavior of all models.\n\nconditions. We quantify this difference by calculating the RMSE between model and human relative\naccuracies across these conditions, and \ufb01nd that our model has the lowest deviation from human\nperformance by a factor of more than two.\n\n5.4 Generalization to other datasets\n\nAs a test of whether the ADEPT model generalizes outside of our dataset, we tested it on the IntPhys\nbenchmarks [Riochet et al., 2018] of object permanence (O1) which include more complex textures\nand motion patterns. In order to capture the texture differences, we retrained the approximate\nderenderer on IntPhys videos without physical violations, but we retain the same dynamics and\nobservation model used in the experiments above. We \ufb01nd that ADEPT achieves a relative accuracy\nof 0.73, outperforming all baselines (Enc-Dec: 0.61, GAN: 0.53, LSTM: 0.65).\n\n6 Discussion\n\nIn this paper, we proposed ADEPT, a model that takes image input and transforms it into an\napproximate object-based representation, predicts the behavior of objects with probabilistic physical\ndynamics, and tracks and updates beliefs about those representations. We created a novel Violation\nof Expectation dataset based on developmental studies, and tested whether ADEPT and pixel-based\nmodels hold expectations about objects that are similar to those found in very young infants. We\nfound the ADEPT model best differentiates scenes that violate physical expectations from those that\ndo not, and is the only model that reliably performs above chance across eight different violation\nscenarios. We additionally measured human surprise on the same violation scenarios, and found that\nADEPT matches human performance on many of those scenarios.\nOur work highlights the importance of object-centric representations for physical understanding.\nWhile visual input might change drastically (due to pose, lighting, or occlusion), object properties tend\nto remain the same. Holding a core representation of objects can produce more accurate predictions\nand better generalization.\nFinally, we hope this work can help drive a virtuous cycle linking AI and developmental psychology.\nWith models that are surprised by physical events like humans, we can make testable predictions\nabout which physical scenarios people should \ufb01nd surprising, as well as relative surprises across\ndifferent scenarios. These predictions can in turn drive infant experiments, allowing for model-based\ntests of the structure and content of infants\u2019 object representations. A further understanding of the\ndevelopment of human physical knowledge can then inform the minimal representations and expected\nlearning trajectories needed to design arti\ufb01cial agents that understand physics the way people do.\nAcknowledgments. This work was sponsored by the Army Research Of\ufb01ce accomplished under\nGrant Number W911NF-18-1-0019, by NSF STC award CCF-1231216, by ONR MURI N00014-13-\n1-0333, and by Honda\u2019s Curious Minded Machines research grant.\n\n9\n\nADEPTEnc-DecGANLSTMRMSE=0.40RMSE=1.10RMSE=0.99RMSE=1.37chancechance\fReferences\nPulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking:\n\nExperiential learning of intuitive physics. In NeurIPS, 2016. 2\n\nRenee Baillargeon. Object permanence in 31/2-and 41/2-month-old infants. Dev. Sci., 23(5):655, 1987. 1\n\nRenee Baillargeon, Elizabeth S Spelke, and Stanley Wasserman. Object permanence in \ufb01ve-month-old infants.\n\nCognition, 20(3):191\u2013208, 1985. 5\n\nPeter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene\n\nunderstanding. PNAS, 110(45):18327\u201318332, 2013. 1, 2\n\nPeter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu. Interaction\n\nnetworks for learning about objects, relations and physics. In NeurIPS, 2016. 2\n\nAngel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,\nManolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An Information-\nRich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University \u2014 Princeton\nUniversity \u2014 Toyota Technological Institute at Chicago, 2015. 6\n\nMichael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based\n\napproach to learning physical dynamics. In ICLR, 2017. 2\n\nChengqian Che, Fujun Luan, Shuang Zhao, Kavita Bala, and Ioannis Gkioulekas. Inverse transport networks.\n\narXiv:1809.10820, 2018. 2\n\nSM Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, and Geoffrey E Hinton. Attend,\n\ninfer, repeat: Fast scene understanding with generative models. In NeurIPS, 2016. 2\n\nSM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham\nRuderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene representation and rendering.\nSci., 360(6394):1204\u20131210, 2018. 2\n\nChelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video\n\nprediction. In NeurIPS, 2016. 2\n\nKaterina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive models of\n\nphysics for playing billiards. In ICLR, 2016. 2\n\nTodd M. Gureckis, Jay Martin, John McDonnell, Alexander S. Rich, Doug Markant, Anna Coenen, David\nHalpern, Jessica B. Hamrick, and Patricia Chan. psiTurk: An open-source framework for conducting replicable\nbehavioral experiments online. Behavior Research Methods, 48(3):829\u2013842, 2016. ISSN 1554-3528. doi:\n10.3758/s13428-015-0642-8. 8\n\nJessica B Hamrick, Peter W Battaglia, Thomas L Grif\ufb01ths, and Joshua B Tenenbaum. Inferring mass in complex\n\nscenes by mental simulation. Cognition, 157:61\u201376, 2016. 2\n\nSiyuan Huang, Siyuan Qi, Yinxue Xiao, Yixin Zhu, Ying Nian Wu, and Song-Chun Zhu. Cooperative holistic\n\nscene understanding: Unifying 3d object, layout, and camera pose estimation. In NeurIPS, 2018. 2\n\nCeleste Kidd, Steven T Piantadosi, and Richard N Aslin. The goldilocks effect: Human infants allocate attention\n\nto visual sequences that are neither too simple nor too complex. PloS one, 7(5):e36399, 2012. 5\n\nAlina Kloss, Stefan Schaal, and Jeannette Bohg. Combining learned and analytical models for predicting action\n\neffects. arXiv:1710.04102, 2017. 2\n\nTejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In NeurIPS, 2015. 2\n\nBrenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that\n\nlearn and think like people. Behav. Brain Sci., 40, 2017. 1\n\nAdam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. In ICML,\n\n2016. 2\n\nRoozbeh Mottaghi, Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. Newtonian scene under-\n\nstanding: Unfolding the dynamics of objects in static images. In CVPR, 2016. 2\n\n10\n\n\fDamian Mrowca, Chengxu Zhuang, Elias Wang, Nick Haber, Li Fei-Fei, Joshua B Tenenbaum, and Daniel LK\n\nYamins. Flexible neural representation for physics prediction. In NeurIPS, 2018. 2\n\nFran\u00e7ois Osiurak and Arnaud Badets. Tool use and affordance: Manipulation-based versus reasoning-based\napproaches. Psychological Review, 123(5):534\u2013568, 2016. ISSN 1939-1471, 0033-295X. doi: 10.1037/\nrev0000027. 1\n\nLuis Piloto, Ari Weinstein, Arun Ahuja, Mehdi Mirza, Greg Wayne, David Amos, Chia-chun Hung, and Matt\nBotvinick. Probing physics knowledge using tools from developmental psychology. arXiv:1804.01128, 2018.\n2, 3\n\nRonan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, V\u00e9ronique Izard,\nIntphys: A framework and benchmark for visual intuitive physics reasoning.\n\nand Emmanuel Dupoux.\narXiv:1803.07616, 2018. 2, 3, 7, 9\n\nKevin A Smith and Edward Vul. Sources of uncertainty in intuitive physics. topiCS, 5(1):185\u2013199, 2013. 1, 2\n\nElizabeth S Spelke and Katherine D Kinzler. Core knowledge. Dev. Psychol., 10(1):89\u201396, 2007. 1, 5\n\nElizabeth S Spelke, Karen Breinlinger, Janet Macomber, and Kristen Jacobson. Origins of knowledge. Psychol.\n\nRev., 99(4):605, 1992. 1, 6\n\nElizabeth S. Spelke, Roberta Kestenbaum, Daniel J. Simons, and Debra Wein. Spatiotemporal continuity,\nsmoothness of motion and object identity in infancy. British Journal of Developmental Psychology, 13(2):\n113\u2013142, 1995. ISSN 0261510X. doi: 10.1111/j.2044-835X.1995.tb00669.x. 5\n\nAimee E Stahl and Lisa Feigenson. Observing the unexpected enhances infants\u2019 learning and exploration. Sci.,\n\n348(6230):91\u201394, 2015. 1, 6\n\nErn\u02ddo T\u00e9gl\u00e1s, Edward Vul, Vittorio Girotto, Michel Gonzalez, Joshua B Tenenbaum, and Luca L Bonatti. Pure\n\nreasoning in 12-month-old infants as probabilistic inference. Sci., 332(6033):1054\u20131059, 2011. 5\n\nTomer D. Ullman, Elizabeth Spelke, Peter Battaglia, and Joshua B. Tenenbaum. Mind games: Game engines as\nan architecture for intuitive physics. TiCS, 21(9):649\u2013665, sep 2017. doi: 10.1016/j.tics.2017.05.012. 1, 2, 4\n\nTomer D Ullman, Andreas Stuhlm\u00fcller, Noah D Goodman, and Joshua B Tenenbaum. Learning physical\n\nparameters from dynamic scenes. Cognitive psychology, 104:57\u201382, 2018. 2\n\nCarl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NeurIPS,\n\n2016. 2\n\nEd Vul, George Alvarez, Joshua B Tenenbaum, and Michael J Black. Explaining human multiple object tracking\nas resource-constrained approximate inference in a dynamic probabilistic model. In Neural Information\nProcessing Systems, page 9, 2009. 3\n\nJiajun Wu, Erika Lu, Pushmeet Kohli, William T Freeman, and Joshua B Tenenbaum. Learning to see physics\n\nvia visual de-animation. In NeurIPS, 2017a. 2\n\nJiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In CVPR, 2017b. 2\n\nKaren Wynn. Addition and subtraction by human infants. Nature, 358(6389):749\u2013750, 1992. ISSN 0028-0836,\n\n1476-4687. doi: 10.1038/358749a0. 5\n\nFei Xu and Susan Carey. Infants\u2019 metaphysics: The case of numerical identity. Cognit. Psychol., 30(2):111\u2013153,\n\n1996. 2, 3, 4\n\nAlan Yuille and Daniel Kersten. Vision as bayesian inference: analysis by synthesis? TiCS, 10(7):301\u2013308,\n\n2006. 2\n\nBo Zheng, Yibiao Zhao, Joey Yu, Katsushi Ikeuchi, and Song-Chun Zhu. Scene understanding by reasoning\n\nstability and safety. IJCV, 112(2):221\u2013238, 2015. 2\n\n11\n\n\f", "award": [], "sourceid": 4818, "authors": [{"given_name": "Kevin", "family_name": "Smith", "institution": "MIT"}, {"given_name": "Lingjie", "family_name": "Mei", "institution": "MIT"}, {"given_name": "Shunyu", "family_name": "Yao", "institution": "Princeton University"}, {"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}, {"given_name": "Elizabeth", "family_name": "Spelke", "institution": "Harvard University"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "Tomer", "family_name": "Ullman", "institution": "Harvard"}]}