{"title": "Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 207, "page_last": 218, "abstract": "Holistic 3D indoor scene understanding refers to jointly recovering the i) object bounding boxes, ii) room layout, and iii) camera pose, all in 3D. The existing methods either are ineffective or only tackle the problem partially. In this paper, we propose an end-to-end model that simultaneously solves all three tasks in real-time given only a single RGB image. The essence of the proposed method is to improve the prediction by i) parametrizing the targets (e.g., 3D boxes) instead of directly estimating the targets, and ii) cooperative training across different modules in contrast to training these modules individually. Specifically, we parametrize the 3D object bounding boxes by the predictions from several modules, i.e., 3D camera pose and object attributes. The proposed method provides two major advantages: i) The parametrization helps maintain the consistency between the 2D image and the 3D world, thus largely reducing the prediction variances in 3D coordinates. ii) Constraints can be imposed on the parametrization to train different modules simultaneously. We call these constraints \"cooperative losses\" as they enable the joint training and inference. We employ three cooperative losses for 3D bounding boxes, 2D projections, and physical constraints to estimate a geometrically consistent and physically plausible 3D scene. Experiments on the SUN RGB-D dataset shows that the proposed method significantly outperforms prior approaches on 3D layout estimation, 3D object detection, 3D camera pose estimation, and holistic scene understanding.", "full_text": "Cooperative Holistic Scene Understanding: Unifying\n\n3D Object, Layout, and Camera Pose Estimation\n\nSiyuan Huang 1\n\nhuangsiyuan@ucla.edu\n\nSiyuan Qi 2\n\nsyqi@cs.ucla.edu\n\nYinxue Xiao 2\n\nyinxuex@ucla.edu\n\nYixin Zhu 1\n\nyixin.zhu@ucla.edu\n\nYing Nian Wu 1\n\nywu@stat.ucla.edu\n\nSong-Chun Zhu 1,2\n\nsczhu@stat.ucla.edu\n\n1 Dept. of Statistics, UCLA 2 Dept. of Computer Science, UCLA\n\nAbstract\n\nHolistic 3D indoor scene understanding refers to jointly recovering the i) object\nbounding boxes, ii) room layout, and iii) camera pose, all in 3D. The existing\nmethods either are ineffective or only tackle the problem partially. In this paper,\nwe propose an end-to-end model that simultaneously solves all three tasks in real-\ntime given only a single RGB image. The essence of the proposed method is to\nimprove the prediction by i) parametrizing the targets (e.g., 3D boxes) instead of\ndirectly estimating the targets, and ii) cooperative training across different modules\nin contrast to training these modules individually. Speci\ufb01cally, we parametrize\nthe 3D object bounding boxes by the predictions from several modules, i.e., 3D\ncamera pose and object attributes. The proposed method provides two major\nadvantages: i) The parametrization helps maintain the consistency between the\n2D image and the 3D world, thus largely reducing the prediction variances in\n3D coordinates. ii) Constraints can be imposed on the parametrization to train\ndifferent modules simultaneously. We call these constraints \"cooperative losses\" as\nthey enable the joint training and inference. We employ three cooperative losses\nfor 3D bounding boxes, 2D projections, and physical constraints to estimate a\ngeometrically consistent and physically plausible 3D scene. Experiments on the\nSUN RGB-D dataset shows that the proposed method signi\ufb01cantly outperforms\nprior approaches on 3D object detection, 3D layout estimation, 3D camera pose\nestimation, and holistic scene understanding.\n\nIntroduction\n\n1\nHolistic 3D scene understanding from a single RGB image is a fundamental yet challenging computer\nvision problem, while humans are capable of performing such tasks effortlessly within 200 ms [Potter,\n1975, 1976, Schyns and Oliva, 1994, Thorpe et al., 1996]. The primary dif\ufb01culty of the holistic 3D\nscene understanding lies in the vast, but ambiguous 3D information attempted to recover from a\nsingle RGB image. Such estimation includes three essential tasks:\n\u2022 The estimation of the 3D camera pose that captures the image. This component helps to maintain\n\nthe consistency between the 2D image and the 3D world.\n\n\u2022 The estimation of the 3D bounding boxes for each object in the scene, recovering the local details.\nMost current methods either are inef\ufb01cient or only tackle the problem partially. Speci\ufb01cally,\n\u2022 Traditional methods [Gupta et al., 2010, Zhao and Zhu, 2011, 2013, Choi et al., 2013, Schwing et al.,\n2013, Zhang et al., 2014, Izadinia et al., 2017, Huang et al., 2018] apply sampling or optimization\nmethods to infer the geometry and semantics of indoor scenes. However, those methods are\ncomputationally expensive; it usually takes a long time to converge and could be easily trapped in\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\u2022 The estimation of the 3D room layout. Combining with the estimated 3D camera pose, it recovers a\n\nglobal geometry.\n\n\fFigure 1: Overview of the proposed framework for cooperative holistic scene understanding. (a) We\n\ufb01rst detect 2D objects and generate their bounding boxes, given a single RGB image as the input,\nfrom which (b) we can estimate 3D object bounding boxes, 3D room layout, and 3D camera pose.\nThe blue bounding box is the estimated 3D room layout. (c) We project 3D objects to the image plane\nwith the learned camera pose, forcing the projection from the 3D estimation to be consistent with 2D\nestimation.\n\nan unsatisfactory local minimum, especially for cluttered indoor environments. Thus both stability\nand scalability become issues.\n\n\u2022 Recently, researchers attempt to tackle this problem using deep learning. The most straightforward\nway is to directly predict the desired targets (e.g., 3D room layouts or 3D bounding boxes) by\ntraining the individual modules separately with isolated losses for each module. Thereby, the prior\nwork [Mousavian et al., 2017, Lee et al., 2017, Kehl et al., 2017, Kundu et al., 2018, Zou et al.,\n2018, Liu et al., 2018] only focuses on the individual tasks or learn these tasks separately rather\nthan jointly inferring all three tasks, or only considers the inherent relations without explicitly\nmodeling the connections among them [Tulsiani et al., 2018].\n\n\u2022 Another stream of approach takes both an RGB-D image and the camera pose as the input [Lin\net al., 2013, Song and Xiao, 2014, 2016, Song et al., 2017, Deng and Latecki, 2017, Zou et al.,\n2017, Qi et al., 2018, Lahoud and Ghanem, 2017, Zhang et al., 2017a], which provides suf\ufb01cient\ngeometric information from the depth images, thereby relying less on the consistency among\ndifferent modules.\n\nIn this paper, we aim to address the missing piece in the literature: to recover a geometrically\nconsistent and physically plausible 3D scene and jointly solve all three tasks in an ef\ufb01cient and\ncooperative way, only from a single RGB image. Speci\ufb01cally, we tackle three important problems:\n1. 2D-3D consistency A good solution to the aforementioned three tasks should maintain a high\nconsistency between the 2D image plane and the 3D world coordinate. How should we design a\nmethod to achieve such consistency?\n\n2. Cooperation Psychological studies have shown that our biologic perception system is extremely\ngood at rapid scene understanding [Schyns and Oliva, 1994], particularly utilizing the fusion of\ndifferent visual cues [Landy et al., 1995, Jacobs, 2002]. Such \ufb01ndings support the necessities of\ncooperatively solving all the holistic scene tasks together. Can we devise an algorithm such that it\ncan cooperatively solve these tasks, making different modules reinforce each other?\n\n3. Physically Plausible As humans, we excel in inferring the physical attributes and dynamics\n[Kubricht et al., 2017]. Such a deep understanding of the physical environment is imperative,\nespecially for an interactive agent (e.g., a robot) to navigate the environment or collaborate with a\nhuman agent. How can the model estimate a 3D scene in a physically plausible fashion, or at least\nhave some sense of physics?\n\nTo address these issues, we propose a novel parametrization of the 3D bounding box as well as a set\nof cooperative losses. Speci\ufb01cally, we parametrize the 3D boxes by the predicted camera pose and\nobject attributes from individual modules. Hence, we can construct the 3D boxes starting from the\n2D box centers to maintain a 2D-3D consistency, rather than predicting 3D coordinates directly or\nassuming the camera pose is given, which loses the 2D-3D consistency.\nCooperative losses are further imposed on the parametrization in addition to the direct losses to enable\nthe joint training of all the individual modules. Speci\ufb01cally, we employ three cooperative losses on\nthe parametrization to constrain the 3D bounding boxes, projected 2D bounding boxes, and physical\nplausibility, respectively:\n\u2022 The 3D bounding box loss encourages accurate 3D estimation.\n\n2\n\n2D + 3D Supervision(c) 2D projection(a) 2D detectionChair 0.995Chair 0.999Chair 0.954Chair 0.767Chair 0.767Estimate(b) Holistic 3D inferenceCameraGlobal geometryLocal objectProjectBounding boxparametrizationModelInferenceLearning\fFigure 2: Illustration of (a) network architecture and (b) parametrization of 3D object bounding box.\n\n\u2022 The differentiable 2D projection loss measures the consistency between 3D and 2D bounding\nboxes, which permits our networks to learn the 3D structures with only 2D annotations (i.e., no 3D\nannotations are required). In fact, we can directly supervise the learning process with 2D objects\nannotations using the common sense of the object sizes.\n\n\u2022 The physical plausibility loss penalizes the intersection between the reconstructed 3D object boxes\n\nand the 3D room layout, which prompts the networks to yield a physically plausible estimation.\n\nFigure 1 shows the proposed framework for cooperative holistic scene understanding. Our method\nstarts with the detection of 2D object bounding boxes from a single RGB image. Two branches of\nconvolutional neural networks are employed to learn the 3D scene from both the image and 2D boxes:\ni) The global geometry network (GGN) learns the global geometry of the scene, predicting both the\n3D room layout and the camera pose. ii) The local object network (LON) learns the object attributes,\nestimating the object pose, size, distance between the 3D box center and camera center, and the 2D\noffset from the 2D box center to the projected 3D box center on the image plane. The details are\ndiscussed in Section 2. By combining the camera pose from the GGN and object attributes from the\nLON, we can parametrize 3D bounding boxes, which grants jointly learning of both GGN and LON\nwith 2D and 3D supervisions.\nAnother bene\ufb01t of the proposed parametrization is improving the training stability by reducing the\nvariance of the 3D boxes prediction, due to that i) the estimated 2D offset has relatively low variance,\nand ii) we adopt a hybrid of classi\ufb01cation and regression method to estimate the variables of large\nvariances, inspired by [Ren et al., 2015, Mousavian et al., 2017, Qi et al., 2018].\nWe evaluate our method on SUN RGB-D Dataset [Song et al., 2015]. The proposed method out-\nperforms previous methods on four tasks, including 3D layout estimation, 3D object detection, 3D\ncamera pose estimation, and holistic scene understanding. Our experiments demonstrate that a coop-\nerative method performing holistic scene understanding tasks can signi\ufb01cantly outperform existing\nmethods tackling each task in isolation, further indicating the necessity of joint training.\nOur contributions are four-fold. i) We formulate an end-to-end model for 3D holistic scene under-\nstanding tasks. The essence of the proposed model is to cooperatively estimate 3D room layout,\n3D camera pose, and 3D object bounding boxes. ii) We propose a novel parametrization of the 3D\nbounding boxes and integrate physical constraint, enabling the cooperative training of these tasks.\niii) We bridge the gap between the 2D image plane and the 3D world by introducing a differentiable\nobjective function between the 2D and 3D bounding boxes. iv) Our method signi\ufb01cantly outperforms\nthe state-of-the-art methods and runs in real-time.\n2 Method\nIn this section, we describe the parametrization of the 3D bounding boxes and the neural networks\ndesigned for the 3D holistic scene understanding. The proposed model consists of two networks,\nshown in Figure 2: a global geometric network (GGN) that estimates the 3D room layout and camera\npose, and a local object network (LON) that infers the attributes of each object. Based on these two\nnetworks, we further formulate differentiable loss functions to train the two networks cooperatively.\n2.1 Parametrization\n3D Objects We use the 3D bounding box X W \u2208 R3\u00d78 as the representation of the estimated 3D\nobject in the world coordinate. The 3D bounding box is described by its 3D center CW \u2208 R3, size\n\n3\n\n(a) Network architecture(b) 3D box parametrizationLocal Object NetworkImageDetected2D BoxesCamera Projection3D BoxesPhysicalConstraintsInputModel2Do\ufb00setDistanceSizeOrientation3D room layoutCamera poseGlobal Geometry NetworkCamera pose3D BoxInput imageDistance2D boxRotationSizeParametrize\fSW \u2208 R3, and orientation R(\u03b8W ) \u2208 R3\u00d73: X W = h(CW , R(\u03b8W ), S), where \u03b8 is the heading angle\nalong the up-axis, and h(\u00b7) is the function that composes the 3D bounding box.\nWithout any depth information, estimating 3D object center CW directly from the 2D image may\nresult in a large variance of the 3D bounding box estimation. To alleviate this issue and bridge the gap\nbetween 2D and 3D object bounding boxes, we parametrize the 3D center CW by its corresponding\n2D bounding box center C I \u2208 R2 on the image plane, distance D between the camera center and\nthe 3D object center, the camera intrinsic parameter K \u2208 R3\u00d73, and the camera extrinsic parameters\nR(\u03c6, \u03c8) \u2208 R3\u00d73 and T \u2208 R3, where \u03c6 and \u03c8 are the camera rotation angles. As illustrated in\nFigure 2(b), since each 2D bounding box and its corresponding 3D bounding box are both manually\nannotated, there is always an offset \u03b4I \u2208 R2 between the 2D box center and the projection of 3D box\ncenter. Therefore, the 3D object center CW can be computed as\n\nCW = T + DR(\u03c6, \u03c8)\u22121 K\u22121(cid:2)C I + \u03b4I , 1(cid:3)T\n(cid:13)(cid:13)(cid:13)K\u22121 [C I + \u03b4I , 1]T(cid:13)(cid:13)(cid:13) .\n\nSince T becomes (cid:126)0 when the data is captured from the \ufb01rst-person view, the above equation could be\nwritten as CW = p(C I , \u03b4I , D, \u03c6, \u03c8, K), where p is a differentiable projection function.\nIn this way, the parametrization of the 3D object bounding box unites the 3D object center CW and\n2D object center C I, which helps maintain the 2D-3D consistency and reduces the variance of the 3D\nbounding box estimation. Moreover, it integrates both object attributes and camera pose, promoting\nthe cooperative training of the two networks.\n3D Room Layout Similar to 3D objects, we parametrize 3D room layout in the world coordinate\nas a 3D bounding box X L \u2208 R3\u00d78, which is represented by its 3D center C L \u2208 R3, size SL \u2208 R3,\nand orientation R(\u03b8L) \u2208 R3\u00d73, where \u03b8L is the rotation angle. In this paper, we estimate the room\nlayout center by predicting the offset from the pre-computed average layout center.\n2.2 Direct Estimations\nAs shown in Figure 2(a), the global geometry network (GGN) takes a single RGB image as the input,\nand predicts both 3D room layout and 3D camera pose. Such design is driven by the fact that the\nestimations of both the 3D room layout and 3D camera pose rely on low-level global geometric\nfeatures. Speci\ufb01cally, GGN estimates the center C L, size SL, and the heading angle \u03b8L of the 3D\nroom layout, as well as the two rotation angles \u03c6 and \u03c8 for predicting the camera pose.\nMeanwhile, the local object network (LON) takes 2D image patches as the input. For each object,\nLON estimates object attributes including distance D, size SW , heading angle \u03b8W , and the 2D offsets\n\u03b4I between the 2D box center and the projection of the 3D box center.\nDirect estimations are supervised by two losses LGGN and LLON. Speci\ufb01cally, LGGN is de\ufb01ned as\n\n(1)\n\n(2)\n\n(3)\n\nand LLON is de\ufb01ned as\n\nLGGN = L\u03c6 + L\u03c8 + LCL + LSL + L\u03b8L,\n\nN(cid:88)\n(LDj + L\u03b4I\n\nj\n\nj=1\n\nLLON =\n\n1\nN\n\n+ LSW\n\nj\n\n+ L\u03b8W\n\nj\n\n),\n\nwhere N is the number of objects in the scene. In practice, directly regressing objects\u2019 attributes\n(e.g., heading angle) may result in a large error. Inspired by [Ren et al., 2015, Mousavian et al., 2017,\nQi et al., 2018], we adopt a hybrid method of classi\ufb01cation and regression to predict the sizes and\nheading angles. Speci\ufb01cally, we pre-de\ufb01ne several size templates or equally split the space into a set\nof angle bins. Our model \ufb01rst classi\ufb01es size and heading angles to those pre-de\ufb01ned categories, and\nthen predicts residual errors within each category. For example, in the case of the rotation angle \u03c6,\nwe de\ufb01ne L\u03c6 = L\u03c6\u2212cls + L\u03c6\u2212reg. Softmax is used for classi\ufb01cation and smooth-L1 (Huber) loss is\nused for regression.\n2.3 Cooperative Estimations\nPsychological experiments have shown that human perception of the scene often relies on global\ninformation instead of local details, known as the gist of the scene [Oliva, 2005, Oliva and Torralba,\n\n4\n\n\f2006]. Furthermore, prior studies have demonstrated that human perceptions on speci\ufb01c tasks involve\nthe cooperation from multiple visual cues, e.g., on depth perception [Landy et al., 1995, Jacobs,\n2002]. These crucial observations motivate the idea that the attributes and properties are naturally\ncoupled and tightly bounded, thus should be estimated cooperatively, in which individual component\nwould help to boost each other.\nUsing the parametrization described in Subsection 2.1, we hope to cooperatively optimize GGN and\nLON, simultaneously estimating 3D camera pose, 3D room layout, and 3D object bounding boxes, in\nthe sense that the two networks enhance each other and cooperate to make the de\ufb01nitive estimation\nduring the learning process. Speci\ufb01cally, we propose three cooperative losses which jointly provide\nsupervisions and fuse 2D/3D information into a physically plausible estimation. Such cooperation\nimproves the estimation accuracy of 3D bounding boxes, maintains the consistency between 2D and\n3D, and generates a physically plausible scene. We further elaborate on these three aspects below.\n3D Bounding Box Loss As neither GGN or LON is directly optimized for the accuracy of the\n\ufb01nal estimation of the 3D bounding box, learning directly through GGN and LON is evidently not\nsuf\ufb01cient, thus requiring additional regularization. Ideally, the estimation of the object attributes\nand camera pose should be cooperatively optimized, as both contribute to the estimation of the 3D\nbounding box. To achieve this goal, we propose the 3D bounding box loss with respect to its 8 corners\n\n(cid:13)(cid:13)h(CW\n\nN(cid:88)\n\nj=1\n\nL3D =\n\n1\nN\n\nj , R(\u03b8j), Sj) \u2212 X W\u2217\n\nj\n\n,\n\n(4)\n\n(cid:13)(cid:13)2\n\n2\n\n(cid:13)(cid:13)2\n\n2\n\nwhere X W\u2217 is the ground truth 3D bounding boxes in the world coordinate. Qi et al. [2018] proposes\na similar regularization in which the parametrization of 3D bounding boxes is different.\n2D Projection Loss\nIn addition to the 3D parametrization of the 3D bounding boxes, we further\nimpose an additional consistency as the 2D projection loss, which maintains the coherence between\nthe 2D bounding boxes in the image plane and the 3D bounding boxes in the world coordinate.\nSpeci\ufb01cally, we formulate the learning objective of the projection from 3D to 2D as\n\n(cid:13)(cid:13)f (X W\n\nN(cid:88)\n\nj=1\n\nLPROJ =\n\n1\nN\n\nj , R, K) \u2212 X I\u2217\n\nj\n\n,\n\n(5)\n\nwhere f (\u00b7) denotes a differentiable projection function which projects a 3D bounding box to a 2D\nj \u2208 R2\u00d74 is the 2D object bounding box (either detected or the ground truth).\nbounding box, and X I\u2217\nPhysical Loss\nIn the physical world, 3D objects and room layout should not intersect with each\nother. To produce a physically plausible 3D estimation of a scene, we integrate the physical loss that\npenalizes the physical violations between 3D objects and 3D room layout\n\nj ) \u2212 Max(X L)) + ReLU(Min(X L) \u2212 Min(X W\n\n(6)\n\nj ))(cid:1) ,\n\n(cid:0)ReLU(Max(X W\n\nN(cid:88)\n\nj=1\n\nLPHY =\n\n1\nN\n\nwhere ReLU is the activate function, Max(\u00b7) / Min(\u00b7) takes a 3D bounding box as the input and\noutputs the max/min value along three world axes. By adding the physical constraint loss, the proposed\nmodel connects the 3D environments and the 3D objects, resulting in a more natural estimation of\nboth 3D objects and 3D room layout.\nTo summarize, the total loss can be written as\n\nLTotal = LGGN + LLON + \u03bbCOOP (L3D + LPROJ + LPHY) ,\n\n(7)\n\nImplementation\n\nwhere \u03bbCOOP is the trade-off parameter that balances the cooperative losses and the direct losses.\n3\nBoth the GGN and LON adopt ResNet-34 [He et al., 2016] architecture as the encoder, which encodes\na 256x256 RGB image into a 2048-D feature vector. As each of the networks consists of multiple\noutput channels, for each channel with an L-dimensional output, we stack two fully connected layers\n(2048-1024, 1024-L) on top of the encoder to make the prediction.\n\n5\n\n\fFigure 3: Qualitative results (top 50%). (Left) Original RGB images. (Middle) Results projected in\n2D. (Right) Results in 3D. Note that the depth input is only used to visualize the 3D results.\nWe adopt a two-step training procedure. First, we \ufb01ne-tune the 2D detector [Dai et al., 2017, Bodla\net al., 2017] with 30 most common object categories to generate 2D bounding boxes. The 2D and 3D\nbounding box are matched to ensure each 2D bounding box has a corresponding 3D bounding box.\nSecond, we train two 3D estimation networks. To obtain good initial networks, both GGN and LON\nare \ufb01rst trained individually using the synthetic data (SUNCG dataset [Song et al., 2017]) with\nphoto-realistically rendered images Zhang et al. [2017b]. We then \ufb01x six blocks of the encoders of\nGGN and LON, respectively, and \ufb01ne-tune the two networks jointly on SUN RGBD dataset [Song\net al., 2015].\nTo avoid over-\ufb01tting, a data augmentation procedure is performed by randomly \ufb02ipping the images or\nrandomly shifting the 2D bounding boxes with corresponding labels during the cooperative training.\nWe use Adam [Kingma and Ba, 2015] for optimization with a batch size of 1 and a learning rate of\n0.0001. In practice, we train the two networks cooperatively for ten epochs, which takes about 10\nminutes for each epoch. We implement the proposed approach in PyTorch [Paszke et al., 2017].\n4 Evaluation\nWe evaluate our model on SUN RGB-D dataset [Song et al., 2015], including 5050 test images and\n10335 images in total. The SUN RGB-D dataset has 47 scene categories with high-quality 3D room\nlayout, 3D camera pose, and 3D object bounding boxes annotations. It also provides benchmarks\nfor various 3D scene understanding tasks. Here, we only use the RGB images as the input. Figure 3\nshows some qualitative results. We discard the rooms with no detected 2D objects or invalid 3D room\nlayout annotation, resulting in a total of 4783 training images and 4220 test images. More results can\nbe found in the supplementary materials.\nWe evaluate our model on \ufb01ve tasks: i) 3D layout estimation, ii) 3D object detection, iii) 3D box\nestimation iv) 3D camera pose estimation, and v) holistic scene understanding, all with the test images\nacross all scene categories. For each task, we compare our cooperatively trained model with the\nsettings in which we train GGN and LON individually without the proposed parametrization of 3D\nobject bounding box or cooperative losses. In the individual training setting, LON directly estimates\nthe 3D object centers in the 3D world coordinate.\n3D Layout Estimation Since SUN RGB-D dataset provides the ground truth of 3D layout with\narbitrary numbers of polygon corners, we parametrize each 3D room layout as a 3D bounding box by\n\n6\n\n\fTable 1: Comparison of 3D room layout estimation and holistic scene understanding on SUN RGB-D.\n\n3D Layout Estimation Holistic Scene Understanding\n\nMethod\n3DGP [Choi et al., 2013]\nHoPR [Huang et al., 2018]\nOurs (individual)\nOurs (cooperative)\n\nIoU\n19.2\n54.9\n55.4\n56.9\n\n0.7\n\nPg Rg Rr\n2.1\n0.6\n37.7 23.0 18.3\n36.8 22.4 20.1\n49.3 29.7 28.5\n\nIoU\n13.9\n40.7\n39.6\n42.9\n\nTable 2: Comparisons of 3D object detection on SUN RGB-D.\n\nbed\n5.62\n\nchair\n2.31\n\nsofa\n3.24\n\nMethod\nChoi et al. [2013]\nHuang et al. [2018] 58.29 13.56 28.37 12.12 4.79 16.50\nOurs (individual)\n27.04 22.80 5.51 28.07\nOurs (cooperative)\n\n2.18 1.29\n53.08\n5.08 2.58\n63.58 17.12 41.22 26.21 9.55 58.55 10.19 5.34 3.01\n\ntable desk toilet\n1.23\n\nsink shelf\n\nbin\n-\n\n0.63\n0.54\n\n7.7\n\n-\n\n-\n\nlamp mAP\n\n2.41 14.01\n0.01 15.24\n1.75 23.65\n\n-\n\n-\n\n-\n\n-\n\ntaking the output of the Manhattan Box baseline from [Song et al., 2015] with eight layout corners,\nwhich serves as the ground truth. We compare the estimation of the proposed model with three\nprevious methods\u20143DGP [Choi et al., 2013], IM2CAD [Izadinia et al., 2017] and HoPR [Huang\net al., 2018]. Following the evaluation protocol de\ufb01ned in [Song et al., 2015], we compute the average\nIntersection over Union (IoU) between the free space of the ground truth and the free space estimated\nby the proposed method. Table 1 shows our model outperforms HoPR by 2.0%. The results further\nshow that there is an additional 1.5% performance improvement compared with individual training,\ndemonstrating the ef\ufb01cacy of our method. Note that IM2CAD [Izadinia et al., 2017] manually selected\n484 images from 794 test images of living rooms and bedrooms. For fair comparisons, we evaluate\nour method on the entire set of living room and bedrooms, outperforming IM2CAD by 2.1%.\n3D Object Detection We evaluate our 3D object detection results using the metrics de\ufb01ned in [Song\net al., 2015]. Speci\ufb01cally, the mean average precision (mAP) is computed using the 3D IoU between\nthe predicted and the ground truth 3D bounding boxes. In the absence of depth, the threshold of IoU\nis adjusted from 0.25 (evaluation setting with depth image input) to 0.15 to determine whether two\nbounding boxes are overlapped. The 3D object detection results are reported in Table 2. We report 10\nout of 30 object categories here, and the rest are reported in the supplementary materials. The results\nindicate our method outperforms HoPR by 9.64% on mAP and improves the individual training result\nby 8.41%. Compared with the model using individual training, the proposed cooperative model makes\na signi\ufb01cant improvement, especially on small objects such as bins and lamps. The accuracy of the\nestimation easily in\ufb02uences 3d detection of small objects; oftentimes, it is nearly impossible for prior\napproaches to detect. In contrast, bene\ufb01ting from the parametrization method and 2D projection loss,\nthe proposed cooperative model maintains the consistency between 3D and 2D, substantially reducing\nthe estimation variance. Note that although IM2CAD also evaluates the 3D detection, they use a\nmetric related to a speci\ufb01c distance threshold. For fair comparisons, we further conduct experiments\non the subset of living rooms and bedrooms, using the same object categories with respect to this\nparticular metric rather than an IoU threshold. We obtain an mAP of 78.8%, 4.2% higher than the\nresults reported in IM2CAD.\n3D Box Estimation The 3D object detection performance of our model is determined by both\nthe 2D object detection and the 3D bounding box estimation. We \ufb01rst evaluate the accuracy of the\n3D bounding box estimation, which re\ufb02ects the ability to predict 3D boxes from 2D image patches.\nInstead of using mAP, 3D IoU is directly computed between the ground truth and the estimated\n3D boxes for each object category. To evaluate the 2D-3D consistency, the estimated 3D boxes are\nprojected back to 2D, and the 2D IoU is evaluated between the projected and detected 2D boxes.\nResults using the full model are reported in Table 3, which shows 3D estimation is still under\nsatisfactory, despite the efforts to maintain a good 2D-3D consistency. The underlying reason for the\ngap between 3D and 2D performance is the increased estimation dimension. Another possible reason\nis due to the lack of context relations among objects. Results for all object categories can be found in\nthe supplementary materials.\n\nTable 3: 3D box estimation results on SUN RGB-D.\nsink shelf\nbin\nIoU (3D) 33.1 15.7 28.0 20.8 15.6 25.1 13.2\n9.9\n6.9\nIoU (2D) 75.7 68.1 74.4 71.2 70.1 72.5 69.7 59.3 62.1\n\nsofa table desk toilet\n\nbed chair\n\nlamp mIoU\n17.4\n5.9\n63.8\n68.7\n\n7\n\n\fTable 4: Comparisons of 3D camera pose estimation on SUN RGB-D.\n\nMethod\nHedau et al. [2009]\nHuang et al. [2018]\nOurs (individual)\nOurs (cooperative)\n\nMean Absolute Error (degree)\nyaw\n3.45\n3.12\n2.48\n2.19\n\nroll\n33.85\n7.60\n4.56\n3.28\n\nCamera Pose Estimation We evaluate the camera pose by computing the mean absolute error of\nyaw and roll between the model estimation and ground truth. As shown in Table 4, comparing with the\ntraditional geometry-based method [Hedau et al., 2009] and previous learning-based method [Huang\net al., 2018], the proposed cooperative model gains a signi\ufb01cant improvement. It also improves the\nindividual training performance with 0.29 degree on yaw and 1.28 degree on roll.\nHolistic Scene Understanding Per de\ufb01nition introduced in [Song et al., 2015], we further estimate\nthe holistic 3D scene including 3D objects and 3D room layout on SUN RGB-D. Note that the holistic\nscene understanding task de\ufb01ned in [Song et al., 2015] misses 3D camera pose estimation compared\nto the de\ufb01nition in this paper, as the results are evaluated in the world coordinate.\nUsing the metric proposed in [Song et al., 2015], we evaluate the geometric precision Pg, the\ngeometric recall Rg, and the semantic recall Rr with the IoU threshold set to 0.15. We also evaluate\nthe IoU between free space (3D voxels inside the room polygon but outside any object bounding\nbox) of the ground truth and the estimation. Table 1 shows that we improve the previous approaches\nby a signi\ufb01cant margin. Moreover, we further improve the individually trained results by 8.8%\non geometric precision, 5.6% on geometric recall, 6.6% on semantic recall, and 3.7% on free\nspace estimation. The performance gain of total scene understanding directly demonstrates that the\neffectiveness of the proposed parametrization method and cooperative learning process.\n5 Discussion\nIn the experiment, the proposed method outperforms the state-of-the-art methods on four tasks.\nMoreover, our model runs at 2.5 fps (0.4s for 2D detection and 0.02s for 3D estimation) on a single\nTitan Xp GPU, while other models take signi\ufb01cantly much more time; e.g., [Izadinia et al., 2017] takes\nabout 5 minutes to estimate one image. Here, we further analyze the effects of different components in\nthe proposed cooperative model, hoping to shed some lights on how parametrization and cooperative\ntraining help the model using a set of ablative analysis.\n5.1 Ablative Analysis\nWe compare four variants of our model with the full model trained using LSUM:\n1. The model trained without the supervision on 3D object bounding box corners (w/o L3D, S1).\n2. The model trained without the 2D supervision (w/o LPROJ, S2).\n3. The model trained without the penalty of physical constraint (w/o LPHY, S3).\n4. The model trained in an unsupervised fashion where we only use 2D supervision to estimate the\n\n3D bounding boxes (w/o L3D + LGGN + LLON, S4).\n\nAdditionally, we compare two variants of training settings: i) the model trained directly on SUN\nRGB-D without pre-train (S5), and ii) the model trained with 2D bounding boxes projected from\nground truth 3D bounding boxes (S6). We conduct the ablative analysis over all the test images on\nthe task of holistic scene understanding. We also compare the 3D mIoU and 2D mIoU of 3D box\nestimation. Table 5 summarizes the quantitative results.\n\nTable 5: The ablative analysis of the proposed cooperative model on SUN RGB-D. We evaluate\nholistic scene understanding, 3D mIoU and 2D mIoU of box estimation under different settings.\nFull\n43.3\n46.5\n28.0\n26.7\n17.4\n68.7\n\nSetting\nIoU\nPg\nRg\nRr\n3D mIoU\n2D mIoU\n\nS1\n42.8\n41.8\n25.3\n23.8\n14.4\n65.2\n\nS2\n42.0\n48.3\n30.1\n28.7\n18.2\n60.7\n\nS4\n35.9\n28.1\n17.1\n15.6\n9.8\n64.3\n\nS5\n40.2\n36.3\n22.1\n20.6\n12.7\n65.3\n\nS6\n43.0\n45.4\n29.7\n27.1\n17.0\n67.7\n\nS3\n41.7\n47.2\n27.5\n26.4\n17.3\n68.5\n\n8\n\n\fFigure 4: Comparison with two variants of our model.\n\nExperiment S1 and S3 Without the supervision on 3D object bounding box corners or physical\nconstraint, the performance of all the tasks decreases since it removes the cooperation between the\ntwo networks.\nExperiment S2 The performance on the 3D detection is improved without the projection loss,\nwhile the 2D mIoU decreases by 8.0%. As shown in Figure 4(b), a possible reason is that the 2D-3D\nconsistency LPROJ may hurt the performance on 3D accuracy compared with directly using 3D\nsupervision, while the 2D performance is largely improved thanks to the consistency.\nExperiment S4 The training entirely in an unsupervised fashion for 3D bounding box estimation\nwould fail since each 2D pixel could correspond to an in\ufb01nite number of 3D points. Therefore, we\nintegrate some common sense into the unsupervised training by restricting the size of the object close\nto the average size. As shown in Figure 4(c), we can still estimate the 3D bounding box without 3D\nsupervision quite well, although the orientations are usually not accurate.\nExperiment S5 and S6 S5 demonstrates the ef\ufb01ciency of using a large amount of synthetic training\ndata, and S6 indicates that we can gain almost the same performance even if there are no 2D bounding\nbox annotations.\n5.2 Related Work\nSingle Image Scene Reconstruction Existing 3D scene reconstruction approaches fall into two\nstreams. i) Generative approaches model the recon\ufb01gurable graph structures in generative probabilistic\nmodels [Zhao and Zhu, 2011, 2013, Choi et al., 2013, Lin et al., 2013, Guo and Hoiem, 2013, Zhang\net al., 2014, Zou et al., 2017, Huang et al., 2018]. ii) Discriminative approaches [Izadinia et al.,\n2017, Tulsiani et al., 2018, Song et al., 2017] reconstruct the 3D scene using the representation\nof 3D bounding boxes or voxels through direct estimations. Generative approaches are better at\nmodeling and inferring scenes with complex context, but they rely on sampling mechanisms and are\nalways computational ineffective. Compared with prior discriminative approaches, our model focus\non establishing cooperation among each scene module.\nGap between 2D and 3D It is intuitive to constrain the 3D estimation to be consistent with 2D\nimages. Previous research on 3D shape completion and 3D object reconstruction explores this idea\nby imposing differentiable 2D-3D constraints between the shape and silhouettes [Wu et al., 2016,\nRezende et al., 2016, Yan et al., 2016, Tulsiani and Malik, 2015, Wu et al., 2017]. Mousavian et al.\n[2017] infers the 3D bounding boxes by matching the projected 2D corners in autonomous driving. In\nthe proposed cooperative model, we introduce the parametrization of the 3D bounding box, together\nwith a differentiable loss function to impose the consistency between 2D-3D bounding boxes for\nindoor scene understanding.\n6 Conclusion\nUsing a single RGB image as the input, we propose an end-to-end model that recovers a 3D indoor\nscene in real-time, including the 3D room layout, camera pose, and object bounding boxes. A\nnovel parametrization of 3D bounding boxes and a 2D projection loss are introduced to enforce the\nconsistency between 2D and 3D. We also design differentiable cooperative losses which help to train\ntwo major modules cooperatively and ef\ufb01ciently. Our method shows signi\ufb01cant improvements in\nvarious benchmarks while achieving high accuracy and ef\ufb01ciency.\nAcknowledgement: The work reported herein was supported by DARPA XAI grant N66001-17-\n2-4029, ONR MURI grant N00014-16-1-2007, ARO grant W911NF-18-1-0296, and an NVIDIA\nGPU donation grant. We thank Prof. Hongjing Lu from the UCLA Psychology Department for useful\ndiscussions on the motivation of this work, and three anonymous reviewers for their constructive\ncomments.\n\n9\n\n(a) Full model(b) Model without 2D supervision(c) Model without 3D supervision(a) Full model(b) Model without 2D supervision(c) Model without 3D supervision\fReferences\nNavaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms \u2013 improving object\ndetection with one line of code. In IEEE International Conference on Computer Vision (ICCV),\n2017.\n\nWongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese. Understanding indoor scenes\nusing 3d geometric phrases. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2013.\n\nJifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable\n\nconvolutional networks. In IEEE International Conference on Computer Vision (ICCV), 2017.\n\nZhuo Deng and Longin Jan Latecki. Amodal detection of 3d objects: Inferring 3d bounding boxes\nfrom 2d ones in rgb-depth images. In IEEE Conference on Computer Vision and Pattern Recogni-\ntion (CVPR), 2017.\n\nRuiqi Guo and Derek Hoiem. Support surface prediction in indoor scenes. In IEEE International\n\nConference on Computer Vision (ICCV), 2013.\n\nAbhinav Gupta, Martial Hebert, Takeo Kanade, and David M Blei. Estimating spatial layout of rooms\nusing volumetric reasoning about objects and surfaces. In Conference on Neural Information\nProcessing Systems (NIPS), 2010.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\nVarsha Hedau, Derek Hoiem, and David Forsyth. Recovering the spatial layout of cluttered rooms.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.\n\nSiyuan Huang, Siyuan Qi, Yixin Zhu, Yinxue Xiao, Yuanlu Xu, and Song-Chun Zhu. Holistic 3d\nscene parsing and reconstruction from a single rgb image. In European Conference on Computer\nVision (ECCV), 2018.\n\nHamid Izadinia, Qi Shan, and Steven M Seitz. Im2cad. In IEEE Conference on Computer Vision and\n\nPattern Recognition (CVPR), 2017.\n\nRobert A Jacobs. What determines visual cue reliability? Trends in cognitive sciences, 6(8):345\u2013350,\n\n2002.\n\nWadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. Ssd-6d: Making\nrgb-based 3d detection and 6d pose estimation great again. In IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2017.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations (ICLR), 2015.\n\nJames R Kubricht, Keith J Holyoak, and Hongjing Lu. Intuitive physics: Current research and\n\ncontroversies. Trends in cognitive sciences, 21(10):749\u2013759, 2017.\n\nAbhijit Kundu, Yin Li, and James M Rehg. 3d-rcnn: Instance-level 3d object reconstruction via\nrender-and-compare. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2018.\n\nJean Lahoud and Bernard Ghanem. 2d-driven 3d object detection in rgb-d images.\n\nIn IEEE\n\nInternational Conference on Computer Vision (ICCV), 2017.\n\n10\n\n\fMichael S Landy, Laurence T Maloney, Elizabeth B Johnston, and Mark Young. Measurement and\nmodeling of depth cue combination: In defense of weak fusion. Vision research, 35(3):389\u2013412,\n1995.\n\nChen-Yu Lee, Vijay Badrinarayanan, Tomasz Malisiewicz, and Andrew Rabinovich. Roomnet:\nEnd-to-end room layout estimation. In IEEE International Conference on Computer Vision (ICCV),\n2017.\n\nDahua Lin, Sanja Fidler, and Raquel Urtasun. Holistic scene understanding for 3d object detection\n\nwith rgbd cameras. In IEEE International Conference on Computer Vision (ICCV), 2013.\n\nChen Liu, Jimei Yang, Duygu Ceylan, Ersin Yumer, and Yasutaka Furukawa. Planenet: Piece-wise\nplanar reconstruction from a single rgb image. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2018.\n\nArsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Ko\u0161eck\u00e1. 3d bounding box estima-\ntion using deep learning and geometry. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2017.\n\nAude Oliva. Gist of the scene. In Neurobiology of attention, pages 251\u2013256. Elsevier, 2005.\n\nAude Oliva and Antonio Torralba. Building the gist of a scene: The role of global image features in\n\nrecognition. Progress in brain research, 155:23\u201336, 2006.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS-W, 2017.\n\nMary C Potter. Meaning in visual search. Science, 187(4180):965\u2013966, 1975.\n\nMary C Potter. Short-term conceptual memory for pictures. Journal of experimental psychology:\n\nhuman learning and memory, 2(5):509, 1976.\n\nCharles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d\nobject detection from rgb-d data. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2018.\n\nShaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object\ndetection with region proposal networks. In Conference on Neural Information Processing Systems\n(NIPS), 2015.\n\nDanilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and\nNicolas Heess. Unsupervised learning of 3d structure from images. In Conference on Neural\nInformation Processing Systems (NIPS), 2016.\n\nAlexander G Schwing, Sanja Fidler, Marc Pollefeys, and Raquel Urtasun. Box in the box: Joint 3d\nlayout and object reasoning from single images. In IEEE International Conference on Computer\nVision (ICCV), 2013.\n\nPhilippe G Schyns and Aude Oliva. From blobs to boundary edges: Evidence for time-and spatial-\n\nscale-dependent scene recognition. Psychological science, 5(4):195\u2013200, 1994.\n\nShuran Song and Jianxiong Xiao. Sliding shapes for 3d object detection in depth images. In European\n\nConference on Computer Vision (ECCV), 2014.\n\nShuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\n11\n\n\fShuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding\nbenchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\nShuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser.\nSemantic scene completion from a single depth image. In IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), 2017.\n\nSimon Thorpe, Denis Fize, and Catherine Marlot. Speed of processing in the human visual system.\n\nNature, 381(6582):520, 1996.\n\nShubham Tulsiani and Jitendra Malik. Viewpoints and keypoints. In IEEE Conference on Computer\n\nVision and Pattern Recognition (CVPR), 2015.\n\nShubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A Efros, and Jitendra Malik. Factoring\nshape, pose, and layout from the 2d image of a 3d scene. In IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), 2018.\n\nJiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and\nWilliam T Freeman. Single image 3d interpreter network. In European Conference on Computer\nVision (ECCV), 2016.\n\nJiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum. Marrnet:\n3d shape reconstruction via 2.5 d sketches. In Conference on Neural Information Processing\nSystems (NIPS), 2017.\n\nXinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets:\nLearning single-view 3d object reconstruction without 3d supervision. In Conference on Neural\nInformation Processing Systems (NIPS), 2016.\n\nYinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. Panocontext: A whole-room 3d context\nmodel for panoramic scene understanding. In European Conference on Computer Vision (ECCV),\n2014.\n\nYinda Zhang, Mingru Bai, Pushmeet Kohli, Shahram Izadi, and Jianxiong Xiao. Deepcontext:\nContext-encoding neural pathways for 3d holistic scene understanding. In IEEE International\nConference on Computer Vision (ICCV), 2017a.\n\nYinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin Jin, and Thomas\nFunkhouser. Physically-based rendering for indoor scene understanding using convolutional neural\nnetworks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017b.\n\nYibiao Zhao and Song-Chun Zhu. Image parsing with stochastic scene grammar. In Conference on\n\nNeural Information Processing Systems (NIPS), 2011.\n\nYibiao Zhao and Song-Chun Zhu. Scene parsing by integrating function, geometry and appearance\n\nmodels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.\n\nChuhang Zou, Zhizhong Li, and Derek Hoiem. Complete 3d scene parsing from single rgbd image.\n\narXiv preprint arXiv:1710.09490, 2017.\n\nChuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem. Layoutnet: Reconstructing the 3d room\nlayout from a single rgb image. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2018.\n\n12\n\n\f", "award": [], "sourceid": 175, "authors": [{"given_name": "Siyuan", "family_name": "Huang", "institution": "University of California, Los Angeles"}, {"given_name": "Siyuan", "family_name": "Qi", "institution": "UCLA"}, {"given_name": "Yinxue", "family_name": "Xiao", "institution": "University of California, Los Angeles"}, {"given_name": "Yixin", "family_name": "Zhu", "institution": "University of California, Los Angeles"}, {"given_name": "Ying Nian", "family_name": "Wu", "institution": "University of California, Los Angeles"}, {"given_name": "Song-Chun", "family_name": "Zhu", "institution": "UCLA"}]}