{"title": "SURGE: Surface Regularized Geometry Estimation from a Single Image", "book": "Advances in Neural Information Processing Systems", "page_first": 172, "page_last": 180, "abstract": "This paper introduces an approach to regularize 2.5D surface normal and depth predictions at each pixel given a single input image. The approach infers and reasons about the underlying 3D planar surfaces depicted in the image to snap predicted normals and depths to inferred planar surfaces, all while maintaining fine detail within objects. Our approach comprises two components: (i) a fourstream convolutional neural network (CNN) where depths, surface normals, and likelihoods of planar region and planar boundary are predicted at each pixel, followed by (ii) a dense conditional random field (DCRF) that integrates the four predictions such that the normals and depths are compatible with each other and regularized by the planar region and planar boundary information. The DCRF is formulated such that gradients can be passed to the surface normal and depth CNNs via backpropagation. In addition, we propose new planar wise metrics to evaluate geometry consistency within planar surfaces, which are more tightly related to dependent 3D editing applications. We show that our regularization yields a 30% relative improvement in planar consistency on the NYU v2 dataset.", "full_text": "SURGE: Surface Regularized Geometry Estimation\n\nfrom a Single Image\n\nPeng Wang1 Xiaohui Shen2 Bryan Russell2 Scott Cohen2 Brian Price2 Alan Yuille3\n\n1University of California, Los Angeles\n\n2Adobe Research 3Johns Hopkins University\n\nAbstract\n\nThis paper introduces an approach to regularize 2.5D surface normal and depth\npredictions at each pixel given a single input image. The approach infers and\nreasons about the underlying 3D planar surfaces depicted in the image to snap\npredicted normals and depths to inferred planar surfaces, all while maintaining\n\ufb01ne detail within objects. Our approach comprises two components: (i) a four-\nstream convolutional neural network (CNN) where depths, surface normals, and\nlikelihoods of planar region and planar boundary are predicted at each pixel,\nfollowed by (ii) a dense conditional random \ufb01eld (DCRF) that integrates the four\npredictions such that the normals and depths are compatible with each other and\nregularized by the planar region and planar boundary information. The DCRF is\nformulated such that gradients can be passed to the surface normal and depth CNNs\nvia backpropagation. In addition, we propose new planar-wise metrics to evaluate\ngeometry consistency within planar surfaces, which are more tightly related to\ndependent 3D editing applications. We show that our regularization yields a 30%\nrelative improvement in planar consistency on the NYU v2 dataset [24].\n\nIntroduction\n\n1\nRecent efforts to estimate the 2.5D layout of a depicted scene from a single image, such as per-pixel\ndepths and surface normals, have yielded high-quality outputs respecting both the global scene layout\nand \ufb01ne object detail [2, 6, 7, 29]. Upon closer inspection, however, the predicted depths and normals\nmay fail to be consistent with the underlying surface geometry. For example, consider the depth and\nnormal predictions from the contemporary approach of Eigen and Fergus [6] shown in Figure 1 (b)\n(Before DCRF). Notice the signi\ufb01cant distortion in the predicted depth corresponding to the depicted\nplanar surfaces, such as the back wall and cabinet. We argue that such distortion arises from the fact\nthat the 2.5D predictions (i) are made independently per pixel from appearance information alone,\nand (ii) do not explicitly take into account the underlying surface geometry. When 3D geometry has\nbeen used, e.g., [29], it often consists of a boxy room layout constraint, which may be too coarse\nand fail to account for local planar regions that do not adhere to the box constraint. Moreover, when\nmultiple 2.5D predictions are made (e.g., depth and normals), they are not explicitly enforced to\nagree with each other.\nTo overcome the above issues, we introduce an approach to identify depicted 3D planar regions in the\nimage along with their spatial extent, and to leverage such planar regions to regularize the depth and\nsurface normal outputs. We formulate our approach as a four-stream convolutional neural network\n(CNN), followed by a dense conditional random \ufb01eld (DCRF). The four-stream CNN independently\npredicts at each pixel the surface normal, depth, and likelihoods of planar region and planar boundary.\nThe four cues are integrated into a DCRF, which encourages the output depths and normals to align\nwith the inferred 3D planar surfaces while maintaining \ufb01ne detail within objects. Furthermore, the\noutput depths and normals are explicitly encouraged to agree with each other.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Framework of SURGE system. (a) We induce surface regularization in geometry estimation\nthough DCRF, and enable joint learning with CNN, which largely improves the visual quality (b).\nWe show that our DCRF is differentiable with respect to depth and surface normals, and allows\nback-propagation to the depth and normal CNNs during training. We demonstrate that the proposed\napproach shows relative improvement over the base CNNs for both depth and surface normal\nprediction on the NYU v2 dataset using the standard evaluation criteria, and is signi\ufb01cantly better\nwhen evaluated using our proposed plane-wise criteria.\n2 Related work\nFrom a single image, traditional geometry estimation approaches rely on extracting visual primitives\nsuch as vanishing points and lines [10] or abstract the scenes with major plane and box representa-\ntions [22, 26]. Those methods can only obtain sparse geometry representations, and some of them\nrequire certain assumptions (e.g. Manhattan world).\nWith the advance of deep neural networks and their strong feature representation, dense geometry,\ni.e., pixel-wise depth and normal maps, can be readily estimated from a single image [7]. Long-range\ncontext and semantic cues are also incorporated in later works to re\ufb01ne the dense prediction by\ncombining the networks with conditional random \ufb01elds (CRF) [19, 20, 28, 29]. Most recently,\nEigen and Fergus [6] further integrate depth and normal estimation into a large multi-scale network\nstructure, which signi\ufb01cantly improves the geometry estimation accuracy. Nevertheless, the output\nof the networks still lacks regularization over planar surfaces due to the adoption of pixel-wise\nloss functions during network training, resulting in unsatisfactory experience in 3D image editing\napplications.\nFor inducing non-local regularization, DCRF has been commonly used in various computer vision\nproblems such as semantic segmentation [5, 32], optical \ufb02ow [16] and stereo [3]. However, the\nfeatures for the af\ufb01nity term are mostly simple ones such as color and location. In contrast, we have\ndesigned a unique planar surface af\ufb01nity term and a novel compatibility term to enable 3D planar\nregularization over geometry estimation.\nFinally, there is also a rich literature in 3D reconstruction from RGBD images [8, 12, 24, 25, 30],\nwhere planar surfaces are usually inferred. However, they all assume that the depth data have been\nacquired. To the best of our knowledge, we are the \ufb01rst to explore using planar surface information to\nregularize dense geometry estimation by only using the information of a single RGB image.\n3 Overview\nFig. 1 illustrates our approach. An input image is passed through a four-stream convolutional neural\nnetwork (CNN) that predicts at each pixel a surface normal, depth value, and whether the pixel belongs\nto a planar surface or edge (i.e., edge separating different planar surfaces or semantic regions), along\nwith their prediction con\ufb01dences. We build on existing CNNs [6, 31] to produce the four maps.\nWhile the CNNs for surface normals and depths produce high-\ufb01delity outputs, they do not explicitly\nenforce their predictions to agree with depicted planar regions. To address this, we propose a fully-\nconnected dense conditional random \ufb01eld (DCRF) that reasons over the CNN outputs to regularize\nthe surface normals and depths. The DCRF jointly aligns the surface normals and depths to individual\nplanar surfaces derived from the edge and planar surface maps, all while preserving \ufb01ne detail within\nobjects. Our DCRF leverages the advantages of previous fully-connected CRFs [15] in terms of both\nits non-local connectivity, which allows propagation of information across an entire planar surface,\nand ef\ufb01ciency during inference. We present our DCRF formulation in Section 4, followed by our\nalgorithm for joint learning and inference within a CNN in Section 5.\n\n2\n\n\fFigure 2: The orthogonal compatibility constraint inside the DCRF. We recover 3d points from the\ndepth map and require the difference vector to be perpendicular to the normal predictions.\n4 DCRF for Surface Regularized Geometry Estimation\nIn this section, we present our DCRF that incorporates plane and edge predictions for depth and\nsurface normal regularization. Speci\ufb01cally, the \ufb01eld of variables we optimize are depths, D =\n{di}K\ni=1, where ni = [nix, niy, niz]T\nindicates the 3D normal direction.\nIn addition, as stated in the overview (Sec. 3), we have four types of information from the CNN\npredictions, namely a predicted normal map No = {no\ni=1, a plane\nprobability map Po and edge predictions Eo. Following the general form of DCRF [16], our problem\ncan be formulated as,\n\ni=1, where K is number of the pixels, and normals, N = {ni}K\n\ni}K\ni=1, a depth map Do = {di}K\n\nmin\nN,D\n\ni,j,i(cid:54)=j\n\n\u03c8u(ni, di|No, Do) + \u03bb\n\n\u03c8r(ni, nj, di, dj|Po, Eo) with (cid:107)ni(cid:107)2 = 1\n\n(1)\nwhere \u03c8u(\u00b7) is a unary term encouraging the optimized surface normals ni and depths di to be close\ni from the networks. \u03c8r(\u00b7,\u00b7) is a pairwise fully connected regularization term\nto the outputs no\ndepending on the information from the plane map Po and edge map Eo, where we seek to encourage\nconsistency of surface normals and depths within planar regions with the underlying depicted 3D\nplanar surfaces. Also, we constrain the normal predictions to have unit length. Speci\ufb01cally, the\nde\ufb01nition of unary and pairwise in our model are presented as follows.\n\ni and do\n\n(cid:88)\n\n(cid:88)\n\ni\n\n4.1 Unary terms\nMotivated by Monte Carlo dropout [27], we notice that when forward propagating multiple times\nwith dropout, the CNN predictions have different variations across different pixels, indicating the\nprediction uncertainty. Based on the prediction variance from the normal and depth networks, we\nare able to obtain pixel-wise con\ufb01dence values wn\ni for normal and depth predictions. We\nleverage such information to DCRF inference by trusting the predictions with higher con\ufb01dence\nwhile regularizing more over ones with low con\ufb01dence. By integrating the con\ufb01dence values, our\nunary term is de\ufb01ned as,\n\ni and wd\n\nwhere \u03c8n(ni|no) = 1 \u2212 ni \u00b7 no\nand \u03c8d(di|do) = (di \u2212 do\n\n\u03c8u(ni, di|No, Do) =\n\n1\n2\n\ni \u03c8n(ni|no) +\nwn\n\ni \u03c8d(di|do),\nwd\n\n(2)\ni is the cosine distance between the input and output surface normals,\n\n1\n2\n\ni )2 is the is the squared difference between input and output depth.\n\n4.2 Pairwise term for regularization.\nWe follow the convention of DCRF with Gibbs energy [17] for pairwise designing, but also bring in\nthe con\ufb01dence value of each pixel as described in Sec. 4.1. Formally, it is de\ufb01ned as,\n\ni,j\u00b5d(di, dj, ni, nj)(cid:1) Ai,j(Po, Eo),\n\n\u03c8r(ni, nj, di, dj|Po, Eo) =(cid:0)wn\n\ni,j\u00b5n(ni, nj) + wd\n\n1\n2\n\n1\n2\n\n(wd\n\n(wn\n\ni,j =\n\ni,j =\n\nj ), wd\n\ni + wn\n\ni + wd\nj )\n\nwhere, wn\n\n(3)\nHere, Ai,j is a pairwise planar af\ufb01nity indicating whether pixel locations i and j belong to the same\nplanar surface derived from the inferred edge and planar surface maps. \u00b5n() and \u00b5d() regularize\nthe output surface normals and depths to be aligned inside the underlying 3D plane. Here, we use\nsimpli\ufb01ed notations, i.e. Ai,j, \u00b5n() and \u00b5d() for the corresponding terms.\nFor the compatibility \u00b5n() of surface normals, we use the same function as \u03c8n() in Eqn. (2), which\nmeasures the cosine distance between ni and nj. For depths, we design an orthogonal compatibility\nfunction \u00b5d() which encourages the normals and depths of each adjacent pixel pair to be consistent\nand aligned within a 3D planar surface. Next we de\ufb01ne \u00b5d() and Ai,j.\n\n3\n\nNormalDepth3D3D surfaceImage\fFigure 3: Pairwise surface af\ufb01nity from the plane and edge predictions with computed Ncut features.\nWe highlight the computed af\ufb01nity w.r.t. pixel i (red dot).\nOrthogonal compatibility:\nIn principle, when two pixels fall in the same plane, the vector con-\nnecting their corresponding 3D world coordinates should be perpendicular to their normal directions,\nas illustrated in Fig. 2. Formally, this orthogonality constraint can be formulated as,\n\n1\n2\n\n1\n2\n\n(ni \u00b7 (xi \u2212 xj))2 +\n\n(nj \u00b7 (xi \u2212 xj))2 , with xi = diK\u22121pi.\n\n\u00b5d(di, dj, ni, nj) =\n\n(4)\nHere xi is the 3D world coordinate back projected by 2D pixel coordinate pi (written in homogeneous\ncoordinates), given the camera calibration matrix K and depth value di. This compatibility encourages\nconsistency between depth and normals.\nPairwise planar af\ufb01nity: As noted in Eqn. (3), the planar af\ufb01nity is used to determine whether\npixels i and j belong to the same planar surface from the information of plane and edge. Here Po\nhelps to check whether two pixels are both inside planar regions, and Eo helps to determine whether\nthe two pixels belong to the same planar surface. Here, for ef\ufb01ciency, we chose the form of Gaussian\nbilateral af\ufb01nity to represent such information since it has been successfully adopted by many\nprevious works with ef\ufb01cient inference, e.g. in discrete label space for semantic segmentation [5]\nor in continuous label space for edge-awared smoothing [3, 16]. Speci\ufb01cally, following the form of\nbilateral \ufb01lters, our planar surface af\ufb01nity is de\ufb01ned as,\n\nwhere \u03ba(zi, zj; \u03b8) = exp(cid:0)\u2212 1\n\nAi,j(Po, Eo) = pipj (\u03c91\u03ba (fi, fj; \u03b8\u03b1) \u03ba (ci, cj; \u03b8\u03b2) + \u03c92\u03ba (ci, cj; \u03b8\u03b3)) ,\n\n2\u03b82(cid:107)zi \u2212 zj(cid:107)2(cid:1) is a Gaussian RBF kernel. pi is the predicted value\n\nfrom the planar map Po at pixel i. pipj indicates that the regularization is activated when both i, j\nare inside planar regions with high probability. fi is the appearance feature derived from the edge\nmap Eo, ci is the 2D coordinate of pixel i on image. \u03c91, \u03c92, \u03b8\u03b1, \u03b8\u03b2, \u03b8\u03b3 are parameters.\nTo transform the pairwise similarity derived from the edge map to the feature representation f for\nef\ufb01cient computing, we borrow the idea from the Normalized Cut (NCut) for segmentation [14, 23],\nwhere we can \ufb01rst generate an af\ufb01nity matrix between pixels using intervening contour [23], and\nperform normalized cut. We select the top 6 resultant eigenvectors as our feature f. . A transformation\nfrom edge to the planar af\ufb01nity using the eigenvectors is shown in Fig. 3. As can be seen from the\naf\ufb01nity map, the NCut features are effective to determine whether two pixels lie in the same planar\nsurface where the regularization can be performed.\n5 Optimization\nGiven the formulation in Sec. 4, we \ufb01rst discuss the fast inference implementation for DCRF, and\nthen present the algorithm of joint training with CNNs through back-propagation.\n\n(5)\n\nInference\n\n5.1\nTo optimize the objective function de\ufb01ned in Eqn.(1), we use mean-\ufb01eld approximation for fast\ninference as used in the optimization of DCRF [15]. In addition, we chose to use coordinate descent\nto sequentially optimize normals and depth. When optimizing normals, for simplicity and ef\ufb01ciency,\nwe do not consider the term of \u00b5d() in Eqn.(3), yielding the updating for pixel i at iteration t as,\n\ni \u2190 1\nn(t)\n2\n\nwn\n\ni no\n\ni +\n\n\u03bb\n2\n\nj n(t\u22121)\nwn\n\nj Ai,j, n(t)\n\ni \u2190 n(t)\n\ni /(cid:107)n(t)\n\ni (cid:107)2,\n\nj,j(cid:54)=i\n\n(6)\n\nwhich is equivalent to \ufb01rst performing a dense bilateral \ufb01ltering [4] with our pairwise planar af\ufb01nity\nterm Ai,j for the predicted normal map, and then applying L2 normalization.\nGiven the optimized normal information, we further optimize depth values. Similar to normals, after\nperforming mean-\ufb01eld approximation, the inferred updating equation for depth at iteration t is,\n\ni \u2190 1\nd(t)\n\u03bdi\n\nwd\n\ni do\n\ni + \u03bb(ni \u00b7 pi)\n\nAi,jwd\n\nj d(t\u22121)\n\nj\n\nj,j(cid:54)=i\n\n(nj \u00b7 pj)\n\n(7)\n\n(cid:17)\n\n(cid:88)\n\n(cid:16)\n\n(cid:88)\n\n4\n\nPairwise planar affinityEdgePlaneImageNCut eigenvectors\f(cid:16)\n\npi \u00b7(cid:80)\n\ni + \u03bb(ni \u00b7 pi)\n\nj nj\n\nj,j(cid:54)=i Ai,jwd\n\n, Since the graph is densely connected, previous\nwhere \u03bdi = wd\nwork [16] indicates that only a few (<10) iterations are need to achieve reasonable performance. In\npractice we found that 5 iterations for normal inference and 2 iterations for depth inference yielded\nreasonable results.\n5.2\nWe further implement the DCRF inference as a trainable layer as in [32] by considering the inference\nas feedforward process, to enable joint training together with the normal and depth neural networks.\nThis makes the planar surface information able to be back-propagated to the neural networks and\nfurther re\ufb01ne their output. We describe the gradients back-propagated to the two networks respectively.\n\nJoint training of CNN and DCRF\n\nBack-propagation to the normal network. Suppose the gradient of normal passed from the upper\nlayer after DCRF for pixel i is \u2207f (ni), which is a 3x1 vector. We now back-propagate it \ufb01rst through\nthe L2 normalization using the equation of \u2207L2(ni) = (I/(cid:107)ni(cid:107) \u2212 ninT\ni /(cid:107)ni(cid:107)3)\u2207f (ni), and then\nback-propagate through the mean-\ufb01eld approximation in Eqn. (6) as,\n\n\u2202L(N)\n\n\u2202ni\n\n=\n\n\u2207L2(ni)\n\n2\n\n+\n\n\u03bb\n2\n\nAj,i\u2207L2(nj),\n\nj,j(cid:54)=i\n\n(8)\n\nAj,i(nj \u00b7 pj)\u2207f (dj)\n\n1\n\u03bdj\n\n(9)\n\nwhere L(N) is the loss from normal predictions after DCRF, I is the identity matrix.\nBack-propagation to the depth network. Similarly for depth, suppose the gradient from the upper\nlayer is \u2207f (di), the depth gradient for back-propagation through Eqn. 7 can be inferred as,\n\n\u2202L(D)\n\n=\n\n1\n\u03bdi\n\n\u2207f (di) + \u03bb(ni \u00b7 pi)\n\n\u2202di\n\nImplementation details for DCRF\n\nj,j(cid:54)=i\nwhere L(D) is the loss from depth predictions after DCRF.\nNote that during back propagation for both surface normals and depths we drop the con\ufb01dences w\nsince using it during training will make the process very complicated and inef\ufb01cient. We adopt the\nsame surface normal and depth loss function as in [6] during joint training. It is possible to also back\npropagate the gradients of the depth values to the normal network via the surface normal and depth\ncompatibility in Eqn. (4). However, this involves the depth values from all the pixels within the same\nplane, which may be intractable and cause dif\ufb01culty during joint learning. We therefore chose not to\nback propagate through the compatibility in our current implementation and leave it to future work.\n6\nTo predict the input surface normals and depths, we build on the publicly-available implementation\nfrom Eigen and Fergus [6], which is at or near state of the art for both tasks. We compute prediction\ncon\ufb01dences for the surface normals and depths using Monte Carlo dropout [27]. Speci\ufb01cally, we\nforward propagate through the network 10 times with dropout during testing, and compute the\nprediction variance vi at each pixel. The predictions with larger variance vi are considered less stable,\ni = exp(\u2212vi/\u03c3\u00b72). We empirically set \u03c3n = 0.1 for normals prediction\nso we set the con\ufb01dence as w\u00b7\nand \u03c3d = 0.15 for depth prediction to produce reasonable con\ufb01dence values.\nSpeci\ufb01cally, for prediction the plane map Po, we adopt a semantic segmentation network structure\nsimilar to the Deeplab [5] network but with multi-scale output as the FCN [21]. The training is\nformulated as a pixel-wise two-class classi\ufb01cation problem (planar vs. non-planar). The output of the\nnetwork is hereby a plane probability map Po where pi at pixel i indicates the probability of pixel i\nbelonging to a planar surface. The edge map Eo indicates the plane boundaries. During training, the\nground-truth edges are extracted from the corresponding ground-truth depth and normal maps, and\nre\ufb01ned by semantic annotations when available (see Fig.4 for an example). We then adopt the recent\nHolistic-nested Edge Detector (HED) network [31] for training. In addition, we augment the network\nby adding predicted depth and normal maps as another 4-channel input to improve recall, which is\nvery important for our regularization since missing edges could mistakenly merge two planes and\npropagate errors during the message passing.\nFor the surface bilateral \ufb01lter in Eqn. (5), we set the parameters \u03b8\u03b1 = 0.1, \u03b8\u03b2 = 50, \u03b8\u03b3 = 3, \u03c91 =\n1, \u03c92 = 0.3, and set the \u03bb = 2 in Eqn.(1) through a grid search over a validation set from [9]. The\nfour types of inputs to the DCRF are aligned and resized to 294x218 by matching the network output\nof [6]. During the joint training of DCRF and CNNs, we \ufb01x the parameters and \ufb01ne-tune the network\n\n(cid:17)\n\n(cid:88)\n\n(cid:88)\n\n5\n\n\fFigure 4: Four types of ground-truth from the NYU dataset that are used in our algorithm.\n\nbased on the weights pre-trained from [6], with the 795 training images, and use the same loss\nfunctions and learning rates as in their depth and normal networks respectively.\nDue to limited space, the detailed edge and plane network structures, the learning and inference times\nand visualization of con\ufb01dence values are presented in the supplementary materials.\n7 Experiments\nWe perform all our experiments on the NYU v2 dataset [24]. It contains 1449 images with size of\n640\u00d7480, which is split to 795 training images and 654 testing images. Each image has an aligned\nground-truth depth map and a manually annotated semantic category map. In additional, we use the\nground-truth surface normals generated by [18] from depth maps. We further use the of\ufb01cial NYU\ntoolbox1 to extract planar surfaces from the ground-truth depth and re\ufb01ne them with the semantic\nannotations, from which a binary ground-truth plane map and an edge map are obtained. The details\nof generating plane and edge ground-truth are elaborated in supplementary materials. Fig. 4 shows\nthe produced four types of ground-truth maps for our learning and evaluation.\nWe implemented all our algorithms based on Caffe [13], including DCRF inference and learning,\nwhich are adapted from the implementation in [1, 32].\nEvaluation setup.\nIn the evaluation, we \ufb01rst compare the normals and depths generated by different\nbaselines and components over the ground truth planar regions, since these are the regions where\nwe are trying to improve, which are most important for 3D editing applications. We evaluated over\nthe valid 561x427 area following the convention in [18, 20]. We also perform evaluation over the\nground truth edge area showing that our results preserve better geometry details. Finally, we show\nthe improvement achieved by our algorithm over the entire image region.\nWe compare our results against the recent work Eigen et.al [6] since it is or is near state-of-the-art\nfor both depth and normal. In practice, we use their published results and models for comparison.\nIn addition, we implemented a baseline method for hard planar regularization, in which the planar\nsurfaces are explicitly extracted from the network predictions. The normal and depth values within\neach plane are then used to \ufb01t the plane parameters, from which the regularized normal and depth\nvalues are obtained. We refer to this baseline as \"Post-Proc.\". For normal prediction, we implemented\nanother baseline in which a basic Bilateral \ufb01lter based on the RGB image is used to smooth the\nnormal map.\nIn terms of the evaluation criteria, we \ufb01rst adopt the pixel-wise evaluation criteria commonly used\nby previous works [6, 28]. However, as mentioned in [11], such metrics mainly evaluate pixel-wise\ndepth and normal offsets, but do not well re\ufb02ect the quality of reconstructed structures over edges\nand planar surfaces. Thus, we further propose plane-wise metrics that evaluate the consistency of\nthe predictions inside a ground truth planar region. In the following, we \ufb01rst present evaluations for\nnormal prediction, and then report the results of depth estimation.\n\nSurface normal criteria. For pixel-wise evaluation, we use the same metrics used in [6].\nFor plane-wise evaluation, given a set of ground truth planar regions {P\u2217\nj}NP\nj=1, we propose two\nmetrics to evaluate the consistency of normal prediction within the planar regions,\n\n(cid:80)\n\n1\nNP\n\n(cid:80)\n\n1. Degree variation (var.): It measures the overall planarity inside a plane, and de\ufb01ned as,\n\u03b4(ni, nj), where \u03b4(ni, nj) = acos(ni \u00b7 nj) which is the degree differ-\n\nence between two normals, nj is the normal mean of the prediction inside P\u2217\nj .\n\nj\n\n1|P\u2217\nj |\n\nj\n\ni\u2208P\u2217\n\n2. First-order degree gradient (grad.): It measures the smoothness of the normal transition inside\n(\u03b4(ni, nhi) + \u03b4(ni, nvi)),\n\na planar region. Formally, it is de\ufb01ned as,\nwhere nhi, nvi are normals of right and bottom neighbor pixels of i.\n1http://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html\n\n1|P\u2217\nj |\n\n1\nNP\n\ni\u2208P\u2217\n\nj\n\nj\n\n(cid:80)\n\n(cid:80)\n\n6\n\nImagePlaneEdgeNormalDepth\fPixel-wise (Over planar region)\n\nLower the better\n\nHigher the better\n\nEvaluation over the planar regions\n\nmean\n14.5425\n14.4665\n14.8154\n14.4978\n14.1934\n14.2055\n13.9732\n13.9763\n13.5804\n\nmedian\n8.9735\n8.9439\n8.6971\n8.9371\n8.8697\n8.8696\n8.5320\n8.2535\n8.1671\n\n11.25\u25e6\n59.00\n59.12\n59.85\n59.12\n59.27\n59.34\n60.89\n62.20\n62.83\n\n22.5\u25e6\n30\u25e6\n80.85 87.38\n80.86 87.41\n80.52 86.67\n80.90 87.43\n81.08 87.77\n81.13 87.78\n81.87 88.09\n82.35 88.08\n83.16 88.85\n\nPlane-wise\n\nLower the better\n\nvar.\n\ngrad.\n1.1112\n9.1534\n1.1735\n8.6454\n0.9882\n7.2753\n1.0795\n8.9601\n6.9688\n0.7441\n6.8866, 0.7302\n0.7407\n6.8212\n0.6858\n6.3939\n4.9199\n0.5923\n\n23.4141 18.3288\n23.4694 17.6804\n20.9322 13.2214\n20.6093 12.1704\n\n30.90\n33.63\n44.43\n47.29\n\n58.91 71.43\n59.53 71.03\n67.25 75.83\n68.92 76.64\n\nEdge\n\nImage\n\nMethod\nEigen-VGG [6]\nRGB-Bilateral\nPost-Proc.\nEigen-VGG (JT)\nDCRF\nDCRF (JT)\nDCRF-conf\nDCRF-conf (JT)\nOracle\n\nEigen-VGG [6]\nDCRF-conf (JT)\nEigen-VGG [6]\nDCRF-conf (JT)\n\nTable 1: Normal accuracy comparison over the NYU v2 dataset. We compare our \ufb01nal results\n(DCRF-conf (JT)) against various baselines over ground truth planar regions at upper part, where JT\nmeans joint training CNN and DCRF as presented in Sec. 5.2. We list additional comparison over the\nedge and full image region at lower part.\n\nEvaluation on surface normal estimation.\nIn upper part of Tab. 1, we show the comparison\nresults. The \ufb01rst line, i.e. Eigen-VGG, is the result from [6] with VGG net, which serves as our\nbaseline. The simple RGB-Bilateral \ufb01ltering can only slightly improve the network output since\nit does not contain any planar surface information during the smoothing. The hard regularization\nover planar regions (\"Post-Proc.\") can improve the plane-wise consistency since hard constraints are\nenforced in each plane, but it also brings strong artifacts and suffers signi\ufb01cant decrease in pixel-wise\naccuracy metrics. Our \"DCRF\" can bring improvement on both pixel-wise and plane-wise metrics,\nwhile integrating network prediction con\ufb01dence further makes the DCRF inference achieve much\nbetter results. Speci\ufb01cally, using \"DCRF-conf\", the plane-wise error metric var. drops from 9.15\nproduced by the network to 6.8. It demonstrates that our non-local planar surface regularization does\nhelp the predictions especially for the consistency inside planar regions.\nWe also show the bene\ufb01ts from the joint training of DCRF and CNN. \"Eigen-VGG (JT)\" denotes\nthe output of the CNN after joint training, which shows better results than the original network. It\nindicates that regularization using DCRF for training also improves the network. By using the joint\ntrained CNN and DCRF (\"DCRF (JT)\"), we observe additional improvement over that from \"DCRF\".\nFinally, by combining the con\ufb01dence from joint trained CNN, our \ufb01nal outputs (\"DCRF-conf (JT)\")\nachieve the best results over all the compared methods. In addition, we also use ground-truth plane\nand edge map to regularize the normal output(\"Oracle\") to get an upper bound when the planar surface\ninformation is perfect. We can see our \ufb01nal results are in fact quite close to \"Oracle\", demonstrating\nthe high quality of our plane and edge prediction.\nIn the bottom part of Tab. 1, we show the evaluation over edge areas (rows marked by \"Edge\") as well\nas on the entire images (marked by \"Image\"). The edge areas are obtained by dilating the ground\ntruth edges with 10 pixels. Compared with the baseline, although our results slightly drop in \"mean\"\nand 30\u25e6, they are much better in \"median\" and 11.25\u25e6. It shows by preserving edge information, our\ngeometry have more accurate predictions around boundaries. When evaluated over the entire images,\nour results outperforms the baseline in all the metrics, showing that our algorithm not only largely\nimproves the prediction in planar regions, but also keeps the good predictions within non-planar\nregions.\n\nDepth criteria. When evaluating depths, similarly, we also \ufb01rstly adopt the traditional pixel-wise\ndepth metrics that are de\ufb01ned in [7, 28]. We refer readers to the original papers for detailed de\ufb01nition\ndue to limited space. We then also propose plane-wise metrics. Speci\ufb01cally, we generate the normals\nfrom the predicted depths using the NYU toolbox [24], and evaluate the degree variation (var.) of the\ngenerated normals within each plane.\n\n7\n\n\fLower the better (LTB)\n\nPixel-wise\n\nHigher the better\n\nPlane-wise\n\nLTB\n\nEvaluation over the planar regions\n\nMethod\nEigen-VGG [6]\nPost-Proc.\nEigen-VGG(JT)\nDCRF\nDCRF(JT)\nDCRF-conf\nDCRF-conf(JT)\nOracle\n\nEigen-VGG [6]\nDCRF-conf(JT)\nEigen-VGG [6]\nDCRF-conf(JT)\n\nRel\n\n0.1441\n0.1470\n0.1427\n0.1438\n0.1424\n0.1437\n0.1423\n0.1431\n\nRel(sqr)\n0.0892\n0.0937\n0.0881\n0.0893\n0.0874\n0.0881\n0.0874\n0.0879\n\nlog10\n0.0635\n0.0644\n0.0612\n0.0634\n0.0610\n0.0631\n0.0610\n0.0629\n\nRMSElin RMSElog\n0.5083\n0.1968\n0.2003\n0.5200\n0.1930\n0.4900\n0.1965\n0.5100\n0.1920\n0.4873\n0.5027\n0.1957\n0.1920\n0.4874\n0.5043\n0.1950\n\n1.25\n\n78.7055\n78.2290\n80.1163\n78.7311\n80.1800\n78.9070\n80.2453\n78.9777\n\n1.252\n96.3516\n96.1145\n96.4421\n96.3739\n96.5481\n96.4336\n96.5612\n96.4297\n\n1.253\n99.3291\n99.2258\n99.3029\n99.3321\n99.3326\n99.3395\n99.3229\n99.3605\n\n0.1369\n0.1328\n0.1213\n0.1179\n\n0.0735\n0.0707\n0.0671\n0.0672\n\n0.7268\n0.6965\n0.6388\n0.6430\n\n0.1645\n0.1624\n0.1583\n0.1555\nTable 2: Depth accuracy comparison over the NYU v2 dataset.\n\n72.9491\n74.7198\n77.0536\n76.8466\n\n0.2275\n0.2214\n0.2145\n0.2139\n\n94.2890\n94.6927\n95.0456\n95.0946\n\n98.6539\n98.7048\n98.8140\n98.8668\n\nvar.\n\n16.4460\n11.1489\n17.5251\n12.0424\n10.5836\n12.0420\n10.5746\n8.0522\n\nEdge\n\nImage\n\nEvaluation on depth prediction. Similarly, we \ufb01rst report the results on planar regions in the\nupper part of Tab. 2, and then present the evaluation on edge areas and over the entire image. We can\nobserve similar trends of different methods as in normal evaluation, demonstrating the effectiveness\nof the proposed approach in both tasks.\n\nQualitative results. We also visually show an example to illustrate the improvements brought by\nour method. In Fig. 5, we visualize the predictions in 3D space in which the reconstructed strcture\ncan be better observed. As can be seen, the results from network output [6] have lots of distortions\nin planar surfaces, and the transition is blurred accross plane boundaries, yielding non-satisfactory\nquality. Our results largely allievate such problems by incorporating plane and edge regularization,\nyielding visually much more satis\ufb01ed results. Due to space limitation, we include more examples in\nthe supplementary materials.\n8 Conclusion\nIn this paper, we introduce SURGE, which is a system that induces surface regularization to depth\nand normal estimation from a single image. Speci\ufb01cally, we formulate the problem as DCRF which\nembeds surface af\ufb01nity and depth normal compatibility into the regularization. Last but not the least,\nour DCRF is enabled to be jointly trained with CNN. From our experiments, we achieve promising\nresults and show such regularization largely improves the quality of estimated depth and surface\nnormal over planar regions, which is important for 3D editing applications.\nAcknowledgment. This work is supported by the NSF Expedition for Visual Cortex on Silicon NSF\naward CCF-1317376 and the Army Research Of\ufb01ce ARO 62250-CS.\n\nImage\n\nEigen et.al [6]\n\nOurs\n\nGround Truth\n\nNormal [6]\n\nOurs normal\n\nNormal GT\n\nDepth [6]\n\nOurs depth\n\nDepth GT.\n\nFigure 5: Visual comparison between network output from Eigen et.al [6] and our results in 3D view.\nWe project the RGB and normal color to the 3D points (Best view in color).\n\n8\n\n\fReferences\n[1] A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional \ufb01ltering using the permutohedral lattice. In\n\nComputer Graphics Forum, volume 29, pages 753\u2013762. Wiley Online Library, 2010.\n\n[2] A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2d-3d alignment via surface normal prediction. In\n\nCVPR, 2016.\n\n[3] J. T. Barron, A. Adams, Y. Shih, and C. Hern\u00e1ndez. Fast bilateral-space stereo for synthetic defocus.\n\nCVPR, 2015.\n\n[4] J. T. Barron and B. Poole. The fast bilateral solver. CoRR, 2015.\n[5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with\n\ndeep convolutional nets and fully connected crfs. ICLR, 2015.\n\n[6] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale\n\nconvolutional architecture. In ICCV, 2015.\n\n[7] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep\n\nnetwork. In NIPS. 2014.\n\n[8] R. Guo and D. Hoiem. Support surface prediction in indoor scenes. In ICCV, 2013.\n[9] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from RGB-D images for object\n\ndetection and segmentation. In ECCV. 2014.\n\n[10] D. Hoiem, A. A. Efros, and M. Hebert. Recovering surface layout from an image. In ICCV, 2007.\n[11] K. Honauer, L. Maier-Hein, and D. Kondermann. The hci stereo metrics: Geometry-aware performance\n\nanalysis of stereo algorithms. In ICCV, 2015.\n\n[12] S. Ikehata, H. Yang, and Y. Furukawa. Structured indoor modeling. In ICCV, 2015.\n[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n[14] I. Kokkinos. Pushing the boundaries of boundary detection using deep learning. ICLR, 2016.\n[15] P. Kr\u00e4henb\u00fchl and V. Koltun. Ef\ufb01cient inference in fully connected crfs with gaussian edge potentials.\n\nNIPS, 2012.\n\n[16] P. Kr\u00e4henb\u00fchl and V. Koltun. Ef\ufb01cient nonlocal regularization for optical \ufb02ow. In ECCV, 2012.\n[17] P. Kr\u00e4henb\u00fchl and V. Koltun. Parameter learning and convergent inference for dense random \ufb01elds. In\n\nICML, 2013.\n\n[18] L. Ladicky, B. Zeisl, and M. Pollefeys. Discriminatively trained dense surface normal estimation. In D. J.\n\nFleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, ECCV, 2014.\n\n[19] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He. Depth and surface normal estimation from\n\nmonocular images using regression on deep features and hierarchical crfs. In CVPR, June 2015.\n\n[20] F. Liu, C. Shen, and G. Lin. Deep convolutional neural \ufb01elds for depth estimation from a single image. In\n\nCVPR, June 2015.\n\n[21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\npages 3431\u20133440, 2015.\n\n[22] A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun. Box in the box: Joint 3d layout and object\n\nreasoning from single images. In ICCV, pages 353\u2013360. IEEE Computer Society, 2013.\n\n[23] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, 22(8):888\u2013905, 2000.\n[24] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd\n\nimages. In ECCV (5), pages 746\u2013760, 2012.\n\n[25] S. Song, S. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D scene understanding benchmark suite. In\n\nCVPR, 2015.\n\n[26] F. Srajer, A. G. Schwing, M. Pollefeys, and T. Pajdla. Match box: Indoor image matching via box-like\nscene estimation. In 2nd International Conference on 3D Vision, 3DV 2014, Tokyo, Japan, December 8-11,\n2014, Volume 1, 2014.\n\n[27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\n\nprevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15(1), 2014.\n\n[28] P. Wang, X. Shen, Z. Lin, S. Cohen, B. L. Price, and A. L. Yuille. Towards uni\ufb01ed depth and semantic\n\nprediction from a single image. In CVPR, 2015.\n\n[29] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In CVPR,\n\n2015.\n\n[30] J. Xiao and Y. Furukawa. Reconstructing the world\u2019s museums. In ECCV, 2012.\n[31] S. Xie and Z. Tu. Holistically-nested edge detection. ICCV, 2015.\n[32] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional\nrandom \ufb01elds as recurrent neural networks. In International Conference on Computer Vision (ICCV), 2015.\n\n9\n\n\f", "award": [], "sourceid": 132, "authors": [{"given_name": "Peng", "family_name": "Wang", "institution": "UCLA"}, {"given_name": "Xiaohui", "family_name": "Shen", "institution": "Adobe Research"}, {"given_name": "Bryan", "family_name": "Russell", "institution": "Adobe"}, {"given_name": "Scott", "family_name": "Cohen", "institution": "Adobe Research"}, {"given_name": "Brian", "family_name": "Price", "institution": null}, {"given_name": "Alan", "family_name": "Yuille", "institution": "UCLA"}]}