{"title": "Learning to Exploit Stability for 3D Scene Parsing", "book": "Advances in Neural Information Processing Systems", "page_first": 1726, "page_last": 1736, "abstract": "Human scene understanding uses a variety of visual and non-visual cues to perform inference on object types, poses, and relations. Physics is a rich and universal cue which we exploit to enhance scene understanding. We integrate the physical cue of stability into the learning process using a REINFORCE approach coupled to a physics engine, and apply this to the problem of producing the 3D bounding boxes and poses of objects in a scene. We first show that applying physics supervision to an existing scene understanding model increases performance, produces more stable predictions, and allows training to an equivalent performance level with fewer annotated training examples. We then present a novel architecture for 3D scene parsing named Prim R-CNN, learning to predict bounding boxes as well as their 3D size, translation, and rotation. With physics supervision, Prim R-CNN outperforms existing scene understanding approaches on this problem. Finally, we show that applying physics supervision on unlabeled real images improves real domain transfer of models training on synthetic data.", "full_text": "Learning to Exploit Stability for 3D Scene Parsing\n\nYilun Du\nMIT CSAIL\n\nZhijian Liu\nMIT CSAIL\n\nHector Basevi\n\nUniversity of Birmingham\n\nAle\u0161 Leonardis\n\nUniversity of Birmingham\n\nWilliam T. Freeman\n\nMIT CSAIL\n\nJoshua B. Tenenbaum\n\nMIT CSAIL\n\nJiajun Wu\nMIT CSAIL\n\nAbstract\n\nHuman scene understanding uses a variety of visual and non-visual cues to perform\ninference on object types, poses, and relations. Physics is a rich and universal\ncue that we exploit to enhance scene understanding. In this paper, we integrate\nthe physical cue of stability into the learning process by looping in a physics\nengine into bottom-up recognition models, and apply it to the problem of 3D scene\nparsing. We \ufb01rst show that applying physics supervision to an existing scene\nunderstanding model increases performance, produces more stable predictions, and\nallows training to an equivalent performance level with fewer annotated training\nexamples. We then present a novel architecture for 3D scene parsing named Prim\nR-CNN, learning to predict bounding boxes as well as their 3D size, translation,\nand rotation. With physics supervision, Prim R-CNN outperforms existing scene\nunderstanding approaches on this problem. Finally, we show that \ufb01netuning with\nphysics supervision on unlabeled real images improves real domain transfer of\nmodels training on synthetic data.\n\nIntroduction\n\n1\nHuman scene understanding is rich, and operates robustly using limited information. Physics\ncomprises invisible causal relationships that are ubiquitous in natural scenes and crucial in scene\nunderstanding [Battaglia et al., 2013, Zhang et al., 2016]. In particular, the vast majority of natural\nscenes are physically stable, a prior most systems for visual scene understanding do not exploit.\nVisual scene understanding takes many forms. Most commonly, elements of a scene are detected,\nclassi\ufb01ed, and localized, either through bounding boxes [Dai et al., 2016] or pixel labels [He et al.,\n2017]. If object instances are known, object poses can be inferred directly [Brachmann et al., 2016].\nSupervision here can take the form of ground truth pixel annotations, as well as pixel depth if depth\nimages are available. Physical supervision is more challenging to introduce because there are few\nvisual features directly associated with physical relationships. Image sequences enable the robust\ninference of physical properties through movement of visual features [Stewart and Ermon, 2017], but\nanalysis of single images requires a different approach.\nMachine learning has enabled end-to-end inference of object physical properties [Wu et al., 2015],\nscene physical properties such as stability [Li et al., 2017], and prediction of future states [Lerer\net al., 2016, Wu et al., 2017a]. However, these systems have been trained in heavily constrained\ndomains and cannot easily be adapted to the variety present in natural scenes. In contrast, systems\nthat incorporate physics engines operate on discrete, interpretable representations; they are amenable\nto levels of object abstraction, such as the approximation of geometry by simple geometric primitives,\nand are more easily adaptable. Further, these types of abstractions can be more readily extracted\nfrom natural scenes, as demonstrated by existing machine learning systems for semantic localization\n\nCorrespondence to: jiajunwu@mit.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: When we observe natural scenes, we understand that objects are physically stable. We wish\nto harness this inherent stability signal to generate modi\ufb01cations to model predictions that are physics\nstable. In the scene above, stability signals allow us to modify predictions due to object intersection\n(red, green and blue arrows) and due to bad alignment with walls (orange arrow).\nand segmentation [Song et al., 2017, He et al., 2017], and facilitate combinations of systems for\nunderstanding of natural scenes and physics engine\u2013based stability supervision.\nPhysical stability supervision is not used by modern scene reconstruction systems [Tulsiani et al.,\n2018], but can have multiple bene\ufb01ts as seen in Figure 1. First, it can correct certain types of pose\nerrors resulting from the limited resolution of visual features. These errors manifest in objects\nintersecting other objects, objects intersecting the boundaries of the scene, and objects not aligned\nwith their supporting surfaces. When abstracted to simple geometric primitives, these errors can\nbe identi\ufb01ed by a physics engine with high accuracy and ef\ufb01ciency, and corrected to both increase\naccuracy and the physical stability of the reconstructed scene.\nSecond, physical stability supervision is always applicable and requires no annotation. This enables\nthe inclusion of unlabeled data in the learning, which facilitates the training of systems that generalize\nto natural scenes. This is particularly important due to the impracticality of annotating suf\ufb01cient data\nto cover the variety of visual features and spatial con\ufb01gurations present in natural scenes.\nIn this paper, we systematically explore how the use of scene stability as a supervisory signal\ncan enhance scene understanding by improving the quality of solutions and reducing the need for\nannotated training data. We apply this to the problem of scene reconstruction, which consists of\nidentifying objects present in a scene, generating the 3-dimensional bounding box and 4-dimensional\npose for each object. We train using synthetic data for which ground truth bounding box annotations\nare available, and also take advantage of data from real scenes without annotations by using scene\nstability as a supervisory signal, under the natural assumption that real scenes are stable. We\nincorporate stability supervision via the use of a physics engine, and estimate gradients using\nREINFORCE [Williams, 1992].\nWe evaluate our framework across several photosynthetic and realistic domains: human-designed\nroom layouts from SUNCG [Song et al., 2017], photo-realistically rendered automatically gener-\nated room layouts from SceneNet-RGBD [McCormac et al., 2017], and real scenes from SUN-\nRGBD [Song et al., 2015]. We validate that our framework makes use of unlabeled data to increase\nreconstruction performance and demonstrate that with physics supervision, we require fewer annota-\ntions to achieve the same performance as a fully-supervised framework.\nOur contributions are three-fold. First, we propose a framework to integrate physical stability into\nscene reconstruction and demonstrate its ability to improve data ef\ufb01ciency and its use of unsupervised\ndata. Second, we propose an end-to-end scene reconstruction network that achieves state-of-the-art\nperformance on SUNCG [Song et al., 2017]. Third, we show our framework helps models trained on\nsynthetic data to transfer to real data.\n2 Related Work\nPhysical scene understanding has attracted increasing attention in recent years [Jia et al., 2013,\nZheng et al., 2013, Shao et al., 2014, Zheng et al., 2015, Fragkiadaki et al., 2016, Finn et al., 2016,\nBattaglia et al., 2016, Chang et al., 2017, Li et al., 2017, Ehrhardt et al., 2017]. Beyond answering\n\u201cwhat is where\u201d, physical scene understanding models the future dynamics of objects [Lerer et al.,\n2016, Mottaghi et al., 2016] and facilitates inference of actions to reach a goal [Li et al., 2017].\nMany previous papers have inferred physics from pixel-level representations [Lerer et al., 2016,\nMottaghi et al., 2016, Wu et al., 2016, Wu, 2016]; a number of methods infer physics from voxel-level\n\n2\n\nInput ImageUnstable ReconstructionStable Reconstruction\frepresentations [Zheng et al., 2013, 2015, Liu et al., 2018]; others propose to learn a \ufb02exible model\nof object interactions, but assume an existing decomposition of a scene into physical objects [Chang\net al., 2017, Battaglia et al., 2016, Fragkiadaki et al., 2016]. Our framework combines a \ufb02exible\nmodel of object interactions with the complex task of inferring the layout and object locations and\nposes in natural indoor scenes.\nOur work integrates physics and geometric context into solutions to the task of 3D scene reconstruc-\ntion, which was \ufb01rst presented in Robert\u2019s blocks world [Roberts, 1963], and repeatedly revisited\nin later studies with various techniques [Gupta et al., 2010, Silberman et al., 2012, Hoiem et al.,\n2005]. Recent work in 3D scene reconstruction has explored reconstruction representations such\nas depth [Eigen and Fergus, 2015], surface normals [Bansal and Russell, 2016], and volumetric\nreconstructions [Firman et al., 2016, Song et al., 2017]. Our work builds on the explicit object-based\ngeometric primitive representation of 3D scenes in Tulsiani et al. [2018], which facilitates physical\nprediction.\nPrevious approaches applying complex physics to 3D scene reconstruction have focused on inferring\noccluded portions of objects. Zheng et al. [2013, 2015] have applied stability constraints to segment\npoint clouds into grouped primitives. The stability-based completion of objects assumes that occluded\nportions of objects extend to the nearest boundary, and does not generalize in cases when such\nassumptions do not hold. The approach of Shao et al. [2014] uses stability as a selection criterion\nto infer occluded portions of objects in the form of geometric primitives. Their approach uses\n\ufb01xed heuristics to infer cuboids to add to an existing partial object shape but suffers when two\nobjects are close to each other. The approach of Jia et al. [2013] constructs geometric primitives for\nobjects by \ufb01tting surfaces one-by-one to depth information, and applies simple rules about supporting\nsurfaces and center of mass to approximate features related to stability. The approach of Gupta\net al. [2010] constructs physical representations of objects but uses a set of heuristic rules to infer\nphysics. By leveraging the representation of objects as geometric primitives consistently in our scene\nreconstruction and stability estimation components, our model is able to handle partially occluded\nobjects (Figure 6) and nearby objects and operate on the diversity of scene con\ufb01gurations and objects\nfound in natural scenes. The representation of objects as geometric primitives couples naturally with\nthe representations employed by the physics engine that we use to estimate scene stability, and the\nhigh \ufb01delity of the physics engine provides more accurate and direct physics supervision than used in\nother works.\nOur results relate to research that uses deep networks to explain scenes with multiple objects [Ba\net al., 2015, Huang and Murphy, 2015, Eslami et al., 2016, Wu et al., 2017b]. A few recent studies\nhave also modeled higher level relations in scenes [Fisher et al., 2011, 2012]. Our work extends the\nmodel introduced by Tulsiani et al. [2018], which inverts the geometric shapes and scene layout from\na single image, but without modeling physics.\n3 Physics Stability Model\nTo recover a 3D reconstructed scene, our physics stability model consists of (i) a primitive prediction\nmodule\u2014an inverse graphics component to build object representations from an input image, (b) a\nlayout prediction module that estimates the enclosing space of objects, and (c) a physics stability\nmodule that simulates the stability of the prediction. We show our framework in Figure 2.\n3.1 Overview\nThe \ufb01rst component of our model is an inverse graphics component to estimate the physical state of all\nobjects in an image. We represent each object present in a image as a 3D bounding box primitive and\npredict translation, scale and rotation of the primitive in full 3D space. These primitive predictions\nare generated through two different architectures, one which generates 3D primitives through 2D\nbounding boxes and another which predicts 3D primitives in parallel with 2D bounding boxes. The\nsecond component of our model is a layout prediction module that represents the layout around all\nprimitives as a set of normal planes. The third component of our model is a physics stability module\nthat takes predictions from both primitives and layout to generate a 3D scene and infer the stability of\neach of the primitives through a physics engine [Coumans, 2010].\nOur model is able to combine powerful neural networks for primitive and layout prediction with a\nreal world physics engine. This allows us to generate predictions that are interpretable and feasible\nin the real world. Our model can be trained in an semi-supervised manner on images without 3D\nannotations by learning to generate predictions that are physically stable.\n\n3\n\n\fFigure 2: Overview of the Physical Stability Model (PSM). Our model consists of three modules.\nA primitive prediction module predicts each object in a scene as a set of cuboid primitives, a\nlayout prediction module predicts the walls surrounding a scene, and a physics stability module\nprovides feedback on the stability of the predictions. We show our end-to-end primitive prediction\nnetwork (Prim R-CNN) for the primitive prediction module; please see Tulsiani et al. [2018] for their\nFactored3D model.\n\n3.2 Primitive Prediction Module\n\nTo show generality of our physics supervision, we consider two different primitive prediction models.\n\nGround Truth Bounding Boxes Given that we have access to ground truth bounding boxes of\nobjects in an image, we use the Factored3D architecture described in Tulsiani et al. [2018]. The\nFactored3D architecture encodes images through both a coarse and \ufb01ne ResNet-18 encoder [He et al.,\n2015] and then uses ROI pooling to extract relevant image features. These features are concatenated\nwith ground truth bounding box coordinates to regress parameters for size, translation and classify\nrotations for each primitive.\n\nRaw Input Image For direct primitive prediction from a raw image, we propose a new network\narchitecture called Prim R-CNN. We use a Faster R-CNN [Ren et al., 2015] architecture with a\nResNet-50 Feature Pyramid Network (FPN) [Lin et al., 2017] to extract features from images.\nThe Faster R-CNN architecture consists of a region proposal network (RPN) to propose candidate\nbounding boxes and a R-CNN network to re\ufb01ne bounding boxes and classify their content. Inspired\nby the Mask R-CNN architecture [He et al., 2017], in parallel to bounding box prediction and class\nre\ufb01nement, we add a primitive prediction branch that independently regresses size, translation and\nclassi\ufb01es rotation for each possible class for each candidate bounding box, allowing priors to be\nformed for each class type. We use 24 rotation bins as in Tulsiani et al. [2018]. An overview of\nour network architecture can be found in Figure 2 with architecture details in the supplementary\nmaterial. In comparison, the Factored3D model generalizes to raw images by \ufb01rst using a off-the-shelf\nbounding box detector Edge Boxes [Zitnick and Doll\u00e1r, 2014] to propose bounding boxes and then\nby forwarding each resultant proposal through a Factored3D model with an added head for existence\nclassi\ufb01cation.\nOur overall architecture is end-to-end and allows features learned for 2D detection to also be used in\nprimitive prediction. Furthermore, we incur reduced computational cost during training time as we\nare able to of\ufb02oad much of initial bounding box computation to the RPN as opposed to forwarding\nall possible candidates through the Factored3D model.\nPrim R-CNN Loss Function To train our network, we use the loss function L = Lcls + Lbox +\nLsize + Ltrans + Lrot. We use Lcls and Lbox to represent the losses for classi\ufb01cation and bounding box\nregression respectively as de\ufb01ned in Ren et al. [2015]. Given ground truth labels for translation, t,\nand size, s, and model predictions of \u02c6t and \u02c6s, we have Ltrans = (cid:107)t \u2212 \u02c6t(cid:107)2 and Lsize = (cid:107)s \u2212 \u02c6s(cid:107)2. We\nde\ufb01ne the rotation loss Lrot assuming a predicted rotation distribution of kd and ground truth rotation\nbin g as Lrot = \u2212 log(kd(g)). We only apply the above losses to primitives whose corresponding\npredicted bounding boxes have signi\ufb01cant overlap with ground truth bounding boxes.\n\n4\n\nResNet-18ROI AlignProposalsRoom LayoutPhysics SimulationPrimitivesI. Primitive Prediction ModuleII. Layout Prediction ModuleIII. Stability ModuleScene PredictionPer Wall?NormalOffsetConfidencePer ProposalClass2D BBox3D TranslationPer Class3D SizeRotationRPNLSTM\f3.3 Layout Prediction Module\nA second component of our model is the layout prediction module. We represent the layout around\nan image as a set of walls, \ufb02oor and ceiling. We represent each such surface as a plane, parametrized\nas the set of points x \u2208 S where S = {x|n \u00b7 x + d0 = 0}\nWe predict each plane by predicting a normal vector n, with (cid:107)n(cid:107) = 1, and a distance offset d0. Our\nlayout prediction network consists of an LSTM with three outputs at each time step, consisting of\na plane existence probability, n, and d0. We initialize the hidden state and input into the recurrent\nnetwork as a convolutional encoding of the scene. Planes, if they exist, are predicted in the order \ufb02oor,\nceiling, walls from left to right. We use L2 loss on normal and offset predictions and cross entropy\nloss on existence probabilities. Details about network architecture can be found in the supplement.\n3.4 Physics Stability Module\nThroughout the paper, we use a rigid body physics simulator, Bullet [Coumans, 2010], for estimating\n3D object stability. Since the simulator is not differentiable, we train both layout prediction and prim-\nitive prediction modules jointly using a stability signal from multi-sample REINFORCE [Williams,\n1992]. Our physics stability module operates on primitive cuboid representations of objects. We\nfound that representing objects as voxels led to the same approximate performance but required\nsigni\ufb01cantly more expensive 3D simulation.\nStability Calculation Given predictions from both our layout and object prediction modules, we\ninfer the direction of gravity from the \ufb02oor prediction. We then initialize all predicted oriented\nbounding boxes in Bullet with the same mass and friction coef\ufb01cient. We simulate all objects for 50\nseconds to detect even small instabilities. Each primitive then receives binary stability labels of 0 and\n1 dependent on object displacement. The overall stability score of scene S(c) is calculated by taking\nthe average of stability scores of all primitives, so in the situation where a scene has 1 stable object\nand 1 unstable object, the calculated stability score would be 0.5.\nREINFORCE Details We train layout and primitive prediction modules simultaneously with the\nstability signal and apply REINFORCE at the scene level. We vary predictions from both modules\nsimultaneously and sample translations, scales, and offsets of predicted wall planes from a normal\ndistribution and size from a log-normal distribution.\nGiven sampled values for primitives and layouts in a con\ufb01guration c, we compute the overall\nprobability of the con\ufb01guration P (c) as well as the corresponding stability score S(c). Our overall\n, where we subtract the average\nstability S(C) across different sampled con\ufb01gurations from each individual calculated stability score.\nWe sample a total of 15 different sets of primitive proposals for each scene.\n4 Evaluation\nWe evaluate our physics model on three different scenarios: synthetic room images from both the\nSUNCG dataset [Song et al., 2017] and SceneNet RGB-D dataset [McCormac et al., 2017] and\nreal images from the SUN RGB-D dataset [Song et al., 2015]. We show that our physics-based\nsupervision helps performance in both uncluttered realistic room scenes in the SUNCG dataset and\ncluttered scenes in the SceneNet RGB-D dataset. We show that our physics stability model can take\nadvantage of not only synthetic data, but real data without 3D annotations, hinting that physics can\nhelp model transfer from synthetic to real data.\n4.1 Scene Parsing with Ground Truth 2D Bounding Boxes\nWe begin by showing that our proposed physical stability model provides gains to performance of the\nFactored3D model using ground truth 2D bounding boxes. It helps by more effectively using labeled\ndata and further data without 3D annotations.\nData We evaluate on the SUNCG dataset [Song et al., 2017] and SceneNet RGB-D [McCormac\net al., 2017] synthetic datasets. We use physically based renderings for SUNCG found in Zhang et al.\n[2017]. We use splits of the SUNCG dataset in Tulsiani et al. [2018] with around 400,000 training\nimages, 50,000 validation images and 100,000 test images. The set of objects in SUNCG is diverse,\nincluding lights, doors, candlesticks; predicting the 3D location all such objects is beyond the scope\nof the work. We restrict our predictions to beds, sofas, chairs, refrigerators, bathtubs, tables, and\ndesks.\n\nloss for reinforce is then Lstab = \u2212(cid:80)\n\n(cid:16)\n\n(cid:17)\n\nc log (P (c))\n\nS(c) \u2212 S(C)\n\n5\n\n\f# Labels Model\n\nDataset\n\nSUNCG\n\nSceneNet RGB-D\n\n0.1%\n\n1%\n\n10%\n\nSPhys mAP0.5 mAP0.3 mAP0.1 Avg IoU\n0.149\n0.23\nFactored3D\nFactored3D + P\n0.164\n0.28\n0.164\nFactored3D + P + FT 0.41\nFactored3D\n0.36\n0.200\nFactored3D + P\n0.206\n0.38\n0.208\nFactored3D + P + FT 0.54\n0.256\nFactored3D\n0.31\nFactored3D + P\n0.54\n0.261\n0.266\nFactored3D + P + FT 0.59\n0.0276\n0.21\n0.0304\n0.27\n\n0.280\n0.335\n0.331\n0.389\n0.437\n0.450\n0.540\n0.554\n0.557\n0.091\n0.089\n\n0.623\n0.671\n0.650\n0.713\n0.743\n0.751\n0.809\n0.813\n0.812\n0.010\n0.010\n\n0.048\n0.067\n0.076\n0.100\n0.122\n0.124\n0.202\n0.213\n0.212\n0.337\n0.349\n\n100% Factored3D\n\nFactored3D + P\n\nTable 1: Quantitative results of Factored3D trained with limited data on SUNCG. P represents physics\nsupervision while FT represents \ufb01netuning. FT and P both improve performance.\n\n(a) SPhys\n\n(b) Average IoU\n\nFigure 3: Average IoU vs data fraction used to trained Factored3D model on SUNCG. Finetuning\nwith physics always gives around a 50% data ef\ufb01ciency boost and improves stability.\n\nWe also evaluate on the SceneNet RGB-D dataset to show that our physics-based method also\ngeneralizes to very cluttered scenes. Primitives in SceneNet RGB-D are dropped via simulation using\na physics engine and are often stacked. We split the SceneNet RGB-D dataset into 90% training and\n10% testing. We use a subsection of the SceneNet RGB-D dataset of approximately 300,000 images.\n\nSetup We test how our physics stability model can be used to take fuller advantage of both limited\nsupervised data and data without 3D annotations.\nOur training protocol consists of three different steps. We \ufb01rst train both our primitive prediction\nmodule and layout prediction module using existing labeled data. Second, we train both modules\nwith the addition of physical stability module. We \ufb01nd that adding the physical stability module\nbefore pretraining leads to slow training, possibly due to there being many possible stable positions\nthat are far away from ground truth. Third, we \ufb01netune using our remaining semi-supervised data\nwithout 3D annotations (so in the case of 1% data use, we use 99% of data without 3D annotations),\ncontaining only color images and ground truth bounding box annotations, to train our model using\nthe physics stability module by using alternate batches of supervised and semi-supervised data. We\nuse the same training settings for Factored3D as described in Tulsiani et al. [2018].\nTo quantify our results, we compute intersection over union (IoU) values between each predicted\nprimitive from a ground truth bounding box and the corresponding ground truth primitive label. We\nuse IoU as opposed to thresholds for closeness values in Tulsiani et al. [2018] as they more accurately\nrepresent how close a predicted primitive is to a real primitive. To further quantify our results, we also\ncompute IoU values between each predicted primitive and all possible ground truth primitives in a\nscene. We then compute mean average precision (mAP) values for ground truth primitives assuming\nthresholds for IoU matching of 0.1, 0.3, and 0.5.\n\n6\n\n0.1%1%10%100%Data Frac0.30.40.50.6Average Stability ScoreStability vs Training Data FractionBaselinePhysics LossFinetune0.1%1%10%100%Data Frac0.150.200.250.30Average IOUAverage IoU vs Training Data FractionBaselinePhysics LossFinetune\fModel\nFactored 3D (GT Boxes)\nFactored 3D (Edge Boxes)\nPrim RCNN\nPrim RCNN+Physics\n\nWalls\nNone\nNone\nNone\n\nPredicted\n\nStability\n\n0.1 mAP 0.3 mAP 0.5 mAP\n\n0.34\n0.10\n0.42\n0.54\n\n0.851\n0.790\n0.808\n0.814\n\n0.623\n0.296\n0.672\n0.680\n\n0.308\n0.046\n0.402\n0.393\n\nTable 2: Quantitative results for training on the full SUNCG dataset. Prim R-CNN achieves better\nperformance than Factored3D Models. Physics offers additional bene\ufb01ts.\n\nFigure 4: Qualitative results on SUNCG dataset. Prim R-CNN is able to detect objects more reliably\nand infer localization better then Factored3D with Edge Boxes.\n\nResults We \ufb01nd that our physics stability module signi\ufb01cantly improves the stability of network\npredictions and also improves overall primitive prediction metrics. We present quantitative results\nfor training each of the three steps of our models in Table 1 and a plot of trend in performance with\ndifferent data fractions in Figure 3.\nOur full model is able to achieve 50% increased data ef\ufb01ciency at a wide range of fractions of data\nusage on the SUNCG dataset with semi-supervised \ufb01netuning. Our model is also able to construct\nscenes with much higher physical stability after training. We found diminishing gains from the\nphysics module on the whole SUNCG dataset (approximately 400,000 images), perhaps because on a\nscale of so many images, primitives are already predicted at a close-to-stable location, reducing need\nof physics supervision, although we still observe increased stability. We \ufb01nd that at full dataset scale,\nphysics provides minor improvements to layout prediction, reducing wall offset prediction MSE from\n0.299 to 0.297, while surface normal MSE stays constant at 0.013.\nWe further test our physics loss on the SceneNet RGB-D dataset and are also able to observe gains on\nthe full dataset as seen in Table 1. This indicates that our stability supervision is also applicable in\ncases of very cluttered scenes.\n\n4.2 Scene Parsing from Raw Images\n\nWe further show that our proposed physical stability model is able to provide gains to an end-to-end\nprimitive prediction network with further gains from utilizing unlabeled input images. We also show\nthat our proposed model outperforms the previous state-of-the-art method (Factored3D).\n\nSetup We use the same data and training process as that described in Section 4.1. The only\nexception is when undergoing stability training, since each predicted primitive now has an associated\ncon\ufb01dence score, when we simulate a scene of primitives, we now simulate sets of primitives\npredicted above sets of threshold con\ufb01dence scores.\nFor metrics, since Prim R-CNN predicts possible sets of bounding boxes, there is no longer a direct\ncorrespondence between predicted primitives and ground truth primitives. Therefore, we cannot\n\n7\n\nGround TruthFactored3D(Ground Truth Bbox)Factored3D(Edgebox)Prim R-CNNInput Image\f(a) SPhys\n\n(b) mAP0.3\n\nFigure 5: Plots of Prim R-CNN\u2019s performance using limited data on SUNCG. Finetuning and physics\nimprove performance, providing over 200% data ef\ufb01ciency at 0.1% data use.\n\n# Labels Model\n\n0.1%\n\n1%\n\n10%\n\nFactored3D\nFactored3D + P\nFactored3D + P + FT\nFactored3D\nFactored3D + P\nFactored3D + P + FT\nFactored3D\nFactored3D + P\nFactored3D + P + FT\n\nSPhys mAP0.5 mAP0.3 mAP0.1\n0.629\n0.27\n0.627\n0.38\n0.636\n0.35\n0.28\n0.710\n0.704\n0.42\n0.729\n0.39\n0.796\n0.416\n0.53\n0.802\n0.805\n0.55\n\n0.105\n0.117\n0.115\n0.191\n0.195\n0.187\n0.303\n0.300\n0.300\n\n0.377\n0.403\n0.415\n0.482\n0.492\n0.502\n0.608\n0.623\n0.622\n\nTable 3: Quantitative results on training Prim R-CNN on the SUNCG dataset. P represents physics\nsupervision, FT represents \ufb01netuning. P and FT help mAP metrics and stability.\n\ncompute the average IOU metric used in Section 4.1 and only report the mAP scores de\ufb01ned in\nSection 4.1.\nResults We show quantitative results for training Prim R-CNN in Table 2. Prim R-CNN out\nperforms both Factored3d trained with or without ground truth bounding boxes, using similar number\nof parameters and faster training time. Our model achieves further gains from physics when trained\non all data in SUNCG. We show qualitative results on the SUNCG dataset in Figure 4.\nWe show plots for Prim R-CNN trained and \ufb01netuned with physics on limited data in Figure 5 and\nquantitative numbers in Table 3. We note that our model achieves 200% data ef\ufb01ciency at 0.1% data\nutilization with over 50% data ef\ufb01ciency at 10% data utilization. We further note that with physics\nsupervision, we get the same performance at 0.3 IOU mAP as the ground truth Factored3D model\nusing only 10% of the data.\n\n4.3 Real Data Transfer\nTo demonstrate the generalization of our approach, we further show that our model can be \ufb01netuned\non real images and obtain signi\ufb01cant gains to performance on SUN RGB-D without using any 3D\nlabeled real data.\nData We use 10,335 real training color images from the SUN RGB-D [Song et al., 2015] with 2D\nbounding box annotation.\nSetup Our training protocol consists of two steps. We \ufb01rst fully train both a primitive prediction\nmodule and a layout prediction module on SUNCG. Next, we \ufb01netune these modules on real images\nfrom SUN RGB-D, by mixing a batch of labeled RGB data from SUNCG with a unlabeled raw images\nfrom SUN RGB-D labeled with only our physics stability module. In the case of Prim R-CNN, when\n\ufb01netuning on unlabeled raw images from SUN RGB-D, we also train the corresponding bounding\nbox classi\ufb01cation parts of the network. We use the mAP metrics in Section 4.1.\n\n8\n\n0.1%1%10%100%Data Frac0.300.350.400.450.500.55Average Stability ScoreStability vs Training Data FractionBaselinePhysics LossFinetune0.1%1%10%100%Data frac0.40.50.60.3 mAPmAP at 0.3 IoU vs Training Data FractionBaselinePhysics LossFinetune\fModel\nFactored3D (GT Boxes)\nFactored3D (GT Boxes)\nPrim R-CNN\nPrim R-CNN\n\nFinetune\n\nStability\n\n0.1 mAP\n\n0.3 mAP\n\n0.5 mAP\n\n\u2013\n+\n\u2013\n+\n\n0.23\n0.49\n0.07\n0.22\n\n0.238\n0.367\n0.193\n0.276\n\n0.016\n0.056\n0.024\n0.056\n\n0.000\n0.012\n0.000\n0.060\n\nTable 4: Quantitative result on \ufb01netuning on the SUN RGB-D Dataset. We note that \ufb01netuning leads\nto large increases on performance in both Factored3D and Prim R-CNN.\n\nFigure 6: Qualitative Results on the SUN RGB-D dataset. Finetuning places predicted primitives on\nthe \ufb02oor and reduces collisions.\nResults We show quantitative results of \ufb01netuning on SUN RGB-D in Table 4 and qualitative\nresults of \ufb01netuning in Figure 6. Our model has signi\ufb01cant improvement in performance on metrics\nafter \ufb01netuning. Qualitatively, we observe \ufb01netuning allows spread out con\ufb01gurations of objects,\nlowers primitives to the ground, and pushes back walls.\nQuantitatively, our numbers are relatively low. We suspect that one reason is sensor disparity between\nimages in SUNCG and SUN RGB-D. From a qualitative point of view, it appears our overall model\nis capable of making more realistic looking reconstructions of the original input image.\nHuman Results We further evaluate our results on humans to evaluate if \ufb01netuning makes scenes\nmore qualitatively reasonable to humans. We randomly sample 100 images from SUN RGB-D\nand construct 3D reconstructions using both a \ufb01netuned and non-\ufb01netuned Factored3D model. We\nevaluate each image with 20 different people through Amazon Mechanical Turk. We found that\nin 13.3% of scenes, people disliked the \ufb01netuned Factored3D result (<35% approval), in 52.4% of\nscenes people were indifferent (between 35% approval and 65% approval) and in 32.5% of scenes\npeoples preferred reconstructions from the \ufb01netuned model (>65% approval). Overall, we \ufb01nd that\nmany scenes are more appealing to humans after \ufb01netuning. We also note a signi\ufb01cant number of\nscenes that humans are indifferent towards\u2014these may be due to the scenes already being stable\nor both reconstructions being of suf\ufb01ciently poor quality that humans are indifferent. We include\n\ufb01netuned scenes that are most and least liked by people in the appendix.\n5 Conclusion\nWe propose a physical stability model that combines primitive and layout prediction modules with a\nphysics simulation engine. Our model achieves state-of-the-art results on the SUNCG dataset. Our\nmodel effectively uses unlabeled images for training and allows better domain transfer when applied\nto real images. We expect our model to have wider impact in the future due to the growing need for\naccurate 3D scene reconstruction methods and the increased prevalence of synthetic datasets.\n\n9\n\nNo FinetuneFinetuneNo FinetuneFinetuneInput ImageInput Image\fAcknowledgements This work is in part supported by ONR MURI N00014-16-1-2007, the Center\nfor Brain, Minds, and Machines (CBMM), Toyota Research Institute, NSF #1447476, and Facebook.\nWe acknowledge MoD/Dstl and EPSRC for providing the grant to support the UK academics\u2019\ninvolvement in a Department of Defense funded MURI project through EPSRC grant EP/N019415/1.\nReferences\nJimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In\n\nICLR, 2015. 3\n\nAayush Bansal and Bryan Russell. Marr revisited: 2d-3d alignment via surface normal prediction. In CVPR,\n\n2016. 3\n\nPeter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene\n\nunderstanding. PNAS, 110(45):18327\u201318332, 2013. 1\n\nPeter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu. Interaction\n\nnetworks for learning about objects, relations and physics. In NIPS, 2016. 2, 3\n\nEric Brachmann, Frank Michel, Alexander Krull, Michael Ying Yang, Stefan Gumhold, et al. Uncertainty-driven\n\n6d pose estimation of objects and scenes from a single rgb image. In CVPR, 2016. 1\n\nMichael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based\n\napproach to learning physical dynamics. In ICLR, 2017. 2, 3\n\nErwin Coumans. Bullet physics engine. Open Source Software: http://bulletphysics. org, 2010. 3, 5\n\nJifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional\n\nnetworks. In NIPS, 2016. 1\n\nSebastien Ehrhardt, Aron Monszpart, Niloy J Mitra, and Andrea Vedaldi. Learning a physical long-term predictor.\n\narXiv:1703.00247, 2017. 2\n\nDavid Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale\n\nconvolutional architecture. In ICCV, 2015. 3\n\nSM Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, and Geoffrey E Hinton. Attend,\n\ninfer, repeat: Fast scene understanding with generative models. In NIPS, 2016. 3\n\nChelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video\n\nprediction. In NIPS, 2016. 2\n\nMichael Firman, Oisin Mac Aodha, Simon Julier, and Gabriel J Brostow. Structured prediction of unobserved\n\nvoxels from a single depth image. In CVPR, 2016. 3\n\nMatthew Fisher, Manolis Savva, and Pat Hanrahan. Characterizing structural relationships in scenes using graph\n\nkernels. ACM transactions on graphics (TOG), 30(4):34, 2011. 3\n\nMatthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. Example-based\n\nsynthesis of 3d object arrangements. ACM TOG, 31(6):135, 2012. 3\n\nKaterina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive models of\n\nphysics for playing billiards. In ICLR, 2016. 2, 3\n\nAbhinav Gupta, Alexei A Efros, and Martial Hebert. Blocks world revisited: Image understanding using\n\nqualitative geometry and mechanics. In ECCV, 2010. 3\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In\n\nCVPR, 2015. 4\n\nKaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 1, 2, 4\n\nDerek Hoiem, Alexei A. Efros, and Martial Hebert. Geometric context from a single image. In ICCV, 2005. 3\n\nJonathan Huang and Kevin Murphy. Ef\ufb01cient inference in occlusion-aware generative models of images. In\n\nICLR Workshop, 2015. 3\n\nZhaoyin Jia, Andy Gallagher, Ashutosh Saxena, and Tsuhan Chen. 3d-based reasoning with blocks, support,\n\nand stability. In CVPR, 2013. 2, 3\n\n10\n\n\fAdam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. In ICML,\n\n2016. 1, 2\n\nWenbin Li, Ales Leonardis, and Mario Fritz. Visual stability prediction for robotic manipulation. In ICRA. IEEE,\n\n2017. 1, 2\n\nTsung-Yi Lin, Piotr Doll\u00e1r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature\n\npyramid networks for object detection. In CVPR, 2017. 4\n\nZhijian Liu, William T Freeman, Joshua B Tenenbaum, and Jiajun Wu. Physical primitive decomposition. In\n\nECCV, 2018. 3\n\nJohn McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J Davison. Scenenet rgb-d: Can 5m synthetic\n\nimages beat generic imagenet pre-training on indoor segmentation? In ICCV, 2017. 2, 5\n\nRoozbeh Mottaghi, Mohammad Rastegari, Abhinav Gupta, and Ali Farhadi. \u201cwhat happens if...\u201d learning to\n\npredict the effect of forces in images. In ECCV, 2016. 2\n\nShaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection\n\nwith region proposal networks. In NIPS, 2015. 4\n\nLawrence G Roberts. Machine perception of three-dimensional solids. PhD thesis, Massachusetts Institute of\n\nTechnology, 1963. 3\n\nTianjia Shao, Aron Monszpart, Youyi Zheng, Bongjin Koo, Weiwei Xu, Kun Zhou, and Niloy J Mitra. Imagining\n\nthe unseen: Stability-based cuboid arrangements for scene understanding. ACM TOG, 33(6), 2014. 2, 3\n\nNathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference\n\nfrom rgbd images. In ECCV, 2012. 3\n\nShuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark\n\nsuite. In CVPR, 2015. 2, 5, 8\n\nShuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene\n\ncompletion from a single depth image. In CVPR, 2017. 2, 3, 5\n\nRussell Stewart and Stefano Ermon. Label-free supervision of neural networks with physics and domain\n\nknowledge. In AAAI, 2017. 1\n\nShubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A Efros, and Jitendra Malik. Factoring shape, pose,\n\nand layout from the 2d image of a 3d scene. In CVPR, 2018. 2, 3, 4, 5, 6\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMLJ, 8(3-4):229\u2013256, 1992. 2, 5\n\nJiajun Wu. Computational perception of physical object properties. Master\u2019s thesis, Massachusetts Institute of\n\nTechnology, 2016. 2\n\nJiajun Wu, Ilker Yildirim, Joseph J Lim, William T Freeman, and Joshua B Tenenbaum. Galileo: Perceiving\n\nphysical object properties by integrating a physics engine with deep learning. In NIPS, 2015. 1\n\nJiajun Wu, Joseph J Lim, Hongyi Zhang, Joshua B Tenenbaum, and William T Freeman. Physics 101: Learning\n\nphysical object properties from unlabeled videos. In BMVC, 2016. 2\n\nJiajun Wu, Erika Lu, Pushmeet Kohli, Bill Freeman, and Josh Tenenbaum. Learning to see physics via visual\n\nde-animation. In NIPS, 2017a. 1\n\nJiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In CVPR, 2017b. 3\nRenqiao Zhang, Jiajun Wu, Chengkai Zhang, William T Freeman, and Joshua B Tenenbaum. A comparative\nevaluation of approximate probabilistic simulation and deep neural networks as accounts of human physical\nscene understanding. In CogSci, 2016. 1\n\nYinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin Jin, and Thomas Funkhouser.\nPhysically-based rendering for indoor scene understanding using convolutional neural networks. In CVPR,\n2017. 5\n\nBo Zheng, Yibiao Zhao, Joey Yu, Katsushi Ikeuchi, and Song-Chun Zhu. Beyond point clouds: Scene\n\nunderstanding by reasoning geometry and physics. In CVPR, 2013. 2, 3\n\nBo Zheng, Yibiao Zhao, Joey Yu, Katsushi Ikeuchi, and Song-Chun Zhu. Scene understanding by reasoning\n\nstability and safety. IJCV, 112(2):221\u2013238, 2015. 2, 3\n\nC. Lawrence Zitnick and Piotr Doll\u00e1r. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 4\n\n11\n\n\f", "award": [], "sourceid": 872, "authors": [{"given_name": "Yilun", "family_name": "Du", "institution": "MIT"}, {"given_name": "Zhijian", "family_name": "Liu", "institution": "MIT"}, {"given_name": "Hector", "family_name": "Basevi", "institution": "University of Birmingham"}, {"given_name": "Ales", "family_name": "Leonardis", "institution": "University of Birmingham"}, {"given_name": "Bill", "family_name": "Freeman", "institution": "MIT/Google"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}]}