{"title": "Shape Recipes: Scene Representations that Refer to the Image", "book": "Advances in Neural Information Processing Systems", "page_first": 1359, "page_last": 1366, "abstract": null, "full_text": "Shape Recipes: Scene Representations that Refer\n\nto the Image\n\nWilliam T. Freeman and Antonio Torralba\n\nArti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nfwtf, torralbag@ai.mit.edu\n\nAbstract\n\nThe goal of low-level vision is to estimate an underlying scene, given\nan observed image. Real-world scenes (eg, albedos or shapes) can be\nvery complex, conventionally requiring high dimensional representations\nwhich are hard to estimate and store. We propose a low-dimensional rep-\nresentation, called a scene recipe, that relies on the image itself to de-\nscribe the complex scene con\ufb01gurations. Shape recipes are an example:\nthese are the regression coef\ufb01cients that predict the bandpassed shape\nfrom image data. We describe the bene\ufb01ts of this representation, and\nshow two uses illustrating their properties: (1) we improve stereo shape\nestimates by learning shape recipes at low resolution and applying them\nat full resolution; (2) Shape recipes implicitly contain information about\nlighting and materials and we use them for material segmentation.\n\n1 Introduction\n\nFrom images, we want to estimate various low-level scene properties such as shape, ma-\nterial, albedo or motion. For such an estimation task, the representation of the quantities\nto be estimated can be critical. Typically, these scene properties might be represented as a\nbitmap (eg [14]) or as a series expansion in a basis set of surface deformations (eg [10]).\nTo represent accurately the details of real-world shapes and textures requires either full-\nresolution images or very high order series expansions. Estimating such high dimensional\nquantities is intrinsically dif\ufb01cult [2]. Strong priors [14] are often needed, which can give\nunrealistic shape reconstructions.\n\nHere we propose a new scene representation with appealing qualities for estimation. The\napproach we propose is to let the image itself bear as much of the representational burden\nas possible. We assume that the image is always available and we describe the underlying\nscene in reference to the image. The scene representation is a set of rules for transforming\nfrom the local image information to the desired scene quantities. We call this representation\na scene recipe: a simple function for transforming local image data to local scene data. The\ncomputer doesn\u2019t have to represent every curve of an intricate shape; the image does that for\nus, the computer just stores the rules for transforming from image to scene. In this paper,\nwe focus on reconstructing the shapes that created the observed image, deriving shape\nrecipes. The particular recipes we study here are regression coef\ufb01cients for transforming\n\n\f(a)\n\n(b)\n\n(c)\n\n(c)\n\nFigure 1: 1-d example: The image (a) is rendered from the shape (b). The shape depends on\nthe image in a non-local way. Bandpass \ufb01ltering both signals allows for a local shape recipe.\nThe dotted line (which agrees closely with true solid line) in (d) shows shape reconstruction\nfrom 9-parameter linear regression (9-tap convolution) from bandpassed image, (c).\n\nbandpassed image data into bandpassed shape data.\n\n2 Shape Recipes\n\nThe shape representation consists in describing, for a particular image, the functional re-\nlationship between image and shape. This relationship is not general for all images, but\nspeci\ufb01c to the particular lighting and material conditions at hand. We call this functional\nrelationship the shape recipe.\n\nTo simplify the computation to obtain shape from image data, we require that the scene\nrecipes be local: the scene structure in a region should only depend on a local neighbor-\nhood of the image. It is easy to show that, without taking special care, the shape-image\nrelationship is not local. Fig. 1 (a) shows the intensity pro\ufb01le of a 1-d image arising from\nthe shape pro\ufb01le shown in Fig. 1 (b) under particular rendering conditions (a Phong model\nwith 10% specularity). Note that the function to recover the shape from the image cannot\nbe local because the identical local images on the left and right sides of the surface edge\ncorrespond to different shape heights.\n\nIn order to obtain locality in the shape-image relationship, we need to preprocess the shape\nand image signals. When shape and image are represented in a bandpass pyramid, within\na subband, under generic rendering conditions [4], local shape changes lead to local image\nchanges. (Representing the image in a Gaussian pyramid also gives a local relationship be-\ntween image and bandpassed shape, effectively subsuming the image bandpass operation\ninto the shape recipe. That formulation, explored in [16], can give slightly better perfor-\nmance and allows for simple non-linear extensions.) Figures 1 (c) and (d) are bandpass\n\ufb01ltered versions of (a) and (b), using a second-derivative of a Gaussian \ufb01lter. In this ex-\nample, (d) relates to (c) by a simple shape recipe: convolution with a 9-tap \ufb01lter, learned\nby linear regression from rendered random shape data. The solid line shows the true band-\npassed shape, while the dotted line is the linear regression estimate from Fig. 1 (c).\n\nFor 2-d images, we break the image and shape into subbands using a steerable pyramid\n[13], an oriented multi-scale decomposition with non-aliased subbands (Fig. 3 (a) and (b)).\nA shape subband can be related to an image intensity subband by a function\n\nZk = fk(Ik)\n\n(1)\nwhere fk is a local function and Zk and Ik are the kth subbands of the steerable pyramid\nrepresentation of the shape and image, respectively. The simplest functional relationship\nbetween shape and image intensity is via a linear \ufb01lter with a \ufb01nite size impulse response:\nZk (cid:25) rk ? Ik, where ? is convolution. The convolution kernel rk (speci\ufb01c to each scale\nand orientation) transforms the image subband Ik into the shape subband Zk. The recipe\nrk at each subband is learned by minimizing Px jZk (cid:0) Ik ? rkj2, regularizing rk as needed\nto avoid over\ufb01tting. rk contains information about the particular lighting conditions and\nthe surface material. More general functions can be built by using non-linear \ufb01lters and\ncombining image information from different orientations and scales [16].\n\n\f(a) Image\n\n(b) Stereo shape\n\n(c) Stereo shape (surface plot)\n\n(d) Re-rendered\n\nstereo shape\n\nFigure 2: Shape estimate from stereo. (a) is one image of the stereo pair; the stereo recon-\nstruction is depicted as (b) a range map and (c) a surface plot and (d) a re-rendering of the\nstereo shape. The stereo shape is noisy and misses \ufb01ne details.\n\nWe conjecture that multiscale shape recipes have various desirable properties for estima-\ntion. First, they allow for a compact encoding of shape information, as much of the com-\nplexity of the shape is encoded in the image itself. The recipes need only specify how to\ntranslate image into shape. Secondly, regularities in how the shape recipes fk vary across\nscale and space provide a powerful mechanism for regularizing shape estimates. Instead\nof regularizing shape estimates by assuming a prior of smoothness of the surface, we can\nassume a slow spatial variation of the functional relationship between image and shape,\nwhich should make estimating shape recipes easier. Third, shape recipes implicitly encode\nlighting and material information, which can be used for material-based segmentation. In\nthe next two sections we discuss the properties of smoothness across scale and space and\nwe show potential applications in improving shape estimates from stereo and in image\nsegmentation based on material properties.\n\n3 Scaling regularities of shape recipes\n\nFig. 2 shows one image of a stereo pair and the associated shape estimated from a stereo\nalgorithm1. The shape estimate is noisy in the high frequencies (see surface plot and re-\nrendered shape), but we assume it is accurate in the low spatial frequencies.\n\nFig. 3 shows the steerable pyramid representations of the image (a) and shape (b) and the\nlearned shape recipes (c) for each subband (linear convolution kernels that give the shape\nsubband from the image subband). We exploit the slow variation of shape recipes over scale\nand assume that the shape recipes are constant over the top four octaves of the pyramid 2\nThus, from the shape recipes learned at low-resolution we can reconstruct a higher resolu-\ntion shape estimate than the stereo output, by learning the rendering conditions then taking\nadvantage of shape details visible in the image but not exploited by the stereo algorithm.\nFig. 4 (a) and (b) show the image and the implicit shape representation: the pyramid\u2019s low-\nresolution shape and the shape recipes used over the top four scales. Fig. 4 (c) and (d) show\nexplicitly the reconstructed shape implied by (a) and (b): note the high resolution details,\nincluding the \ufb01ne structure visible in the bottom left corner of (d). Compare with the stereo\n\n1We took our stereo photographs using a 3.3 Megapixel Olympus Camedia C-3040 camera, with\na Pentax stereo adapter. We calibrated the stereo images using the point matching algorithm of Zhang\n[18], and recti\ufb01ed the stereo pair (so that epipoles are along scan lines) using the algorithm of [8],\nestimating disparity with the Zitnick\u2013Kanade stereo algorithm [19].\n\n2Except for a scale factor. We scale the amplitude of the \ufb01xed recipe convolution kernels by 2\nfor each octave, to account for the differentiation operation in the linear shading approximation to\nLambertian rendering [7].\n\n\f(a) Image pyramid\n\n(b) Shape pyramid\n\n(c) Shape recipes for each subband\n\nFigure 3: Learning shape recipes at each subband. (a) and (b) are the steerable pyramid\nrepresentations [13] of image and stereo shape. (c) shows the convolution kernels that best\npredict (b) from (a). The steerable pyramid isolates information according to scale (the\nsmaller subband images represent larger spatial scales) and orientation (clockwise among\nsubbands of one size: vertical, diagonal, horizontal, other diagonal).\n\n(b) low-res shape\n\n(center, top row) and\n\nrecipes (for each\n\nsubband orientation)\n\n(a) image\n\n(c) recipes shape (surface plot)\n\n(d) re-rendered\nrecipes shape\n\nFigure 4: Reconstruction from shape recipes. The shape is represented by the information\ncontained in the image (a), the low-res shape pyramid residual and the shape recipes (b)\nestimated at the lowest resolution. The shape can be regenerated by applying the shape\nrecipes (b) at the 4 highest resolution scales, then reconstructing from the shape pyramid.\n(d) shows the image re-rendered under different lighting conditions than (a). The recon-\nstruction is not noisy and shows more detail than the stereo shape, Fig. 2, including the \ufb01ne\ntextures visible at the bottom left of the image (a) but not detected by the stereo algorithm.\n\noutput in Fig. 2.\n\n4 Segmenting shape recipes\n\nSegmenting an image into regions of uniform color or texture is often an approximation\nto an underlying goal of segmenting the image into regions of uniform material. Shape\nrecipes, by describing how to transform from image to shape, implicitly encode both light-\ning and material properties. Across unchanging lighting conditions, segmenting by shape\nrecipes allows us to segment according to a material\u2019s rendering properties, even overcom-\ning changes of intensities or texture of the rendered image. (See [6] for a non-parametric\napproach to material segmentation.)\n\nWe expect shape recipes to vary smoothly over space except for abrupt boundaries at\nchanges in material or illumination. Within each subband, we can write the shape Zk\n\n\f(a) Shape\n\n(b) Image\n\n(c) Image-based\n\nsegmentation\n\n(d) Recipe-based\n\nsegmentation\n\nFigure 5: Segmentation example. Shape (a), with a horizontal orientation discontinuity, is\nrendered with two different shading models split vertically, (b). Based on image informa-\ntion alone, it is dif\ufb01cult to \ufb01nd a good segmentation into 2 groups, (c). A segmentation into\n2 different shape recipes naturally falls along the vertical material boundary, (d).\n\nas a mixture of recipes:\n\np(ZkjIk) =\n\nN\n\nXn=1\n\np(Zk (cid:0) fk;n(Ik))pn\n\n(2)\n\nwhere N speci\ufb01es the number of recipes needed to explain the underlying shape Zk. The\nweights pn, which will be a function of location, will specify which recipe has to be used\nwithin each region and, therefore, will provide a segmentation of the image.\n\nTo estimate the parameters of the mixture (shape recipes and weights), given known shape\nand the associated image, we use the EM algorithm [17]. We encourage spatial continuity\nfor the weights pn as neighboring pixels are likely to belong to the same material. We\nuse the mean \ufb01eld approximation to implement the spatial smoothness prior in the E step,\nsuggested in [17].\n\nFigure 5 shows a segmentation example. (a) is a fractal shape, with diagonal left structure\nacross the top half, and diagonal right structure across the bottom half. Onto that shape,\nwe \u201cpainted\u201d two different Phong shading renderings in the two vertical halves, shown\nin (b) (the right half is shinier than the left). Thus, texture changes in each of the four\nquadrants, but the only material transition is across the vertical centerline. An image-based\nsegmentation, which makes use of texture and intensity cues, among others, \ufb01nds the four\nquadrants when looking for 4 groups, but can\u2019t segment well when forced to \ufb01nd 2 groups,\n(c).\n(We used the normalized cuts segmentation software, available on-line [11].) The\nshape recipes encode the relationship between image and shape when segmenting into 2\ngroups, and \ufb01nds the vertical material boundary, (d).\n\n5 Occlusion boundaries\n\nNot all image variations have a direct translation into shape. This is true for paint bound-\naries and for most occlusion boundaries. These cases need to be treated specially with\nshape recipes. To illustrate, in Fig. 6 (c) the occluding boundary in the shape only pro-\nduces a smooth change in the image, Fig. 6 (a). In that region, a shape recipe will produce\nan incorrect shape estimate, however, the stereo algorithm will often succeed at \ufb01nding\nthose occlusion edges. On the other hand, stereo often fails to provide the shape of im-\nage regions with complex shape details, where the shape recipes succeed. For the special\ncase of revising the stereo algorithm\u2019s output using shape recipes, we propose a statistical\nframework to combine both sources of information. We want to estimate the shape Z that\nmaximizes the likelihood given the shape from stereo S and shape from image intensity I\n\n\f(a) image\n\n(b) image (subband)\n\n(c) stereo depth\n\n(d) stereo depth (subband)\n\n(e) shape recipe\n\n(subband)\n\n(f) recipe&stereo\n\n(subband)\n\n(g) recipe&stereo\n\n(surface plot)\n\n(h) laser range\n\n(subband)\n\n(i) laser range\n(surface plot)\n\nFigure 6: One way to handle occlusions with shape recipes.\nImage in full-res (a) and\none steerable pyramid subband (b); stereo depth, full-res (c) and subband (d). (e) shows\nsubband of shape reconstruction using learned shape recipe. Direct application of shape\nrecipe across occlusion boundary misses the shape discontinuity. Stereo algorithm catches\nthat discontinuity, but misses other shape details. Probabilistic combination of the two\nshape estimates (f, subband, g, surface), assuming Laplacian shape statistics, captures the\ndesirable details of both, comparing favorably with laser scanner ground truth, (h, subband,\ni, surface, at slight misalignment from photos).\n\nvia shape recipes:\n\np(ZjS; I) = p(S; IjZ)p(Z)=p(S; I)\n\n(3)\n(For notational simplicity, we omit the spatial dependency from I, S and Z.) As both stereo\nS and image intensity I provide strong constraints for the possible underlying shape Z, the\nfactor p(Z) can be considered constant in the region of support of p(S; IjZ). p(S; I) is a\nnormalization factor. Eq. (3) can be simpli\ufb01ed by assuming that the shapes from stereo and\nfrom shape recipes are independent. Furthermore, we also assume independence between\nthe pixels in the image and across subbands:\n\np(S; IjZ) = Yk Yx;y\n\np(SkjZk)p(IkjZk)\n\n(4)\n\nSk, Zk and Ik refer to the outputs of the subband k. Although this is an oversimpli\ufb01cation\nit simpli\ufb01es the analysis and provides good results.\nThe terms p(SkjZk) and p(IkjZk) will depend on the noise models for the depth from\nstereo and for the shape recipes. For the shape estimate from stereo we assume a Gaussian\ndistribution for the noise. At each subband and spatial location we have:\n\np(SkjZk) = ps(Zk (cid:0) Sk) =\n\ne(cid:0)jZk(cid:0)Skj2=(cid:27)2\n\ns\n\n(2(cid:25))1=2(cid:27)s\n\n(5)\n\nIn the case of the shape recipes, a Gaussian noise model is not adequate. The distribution\nof the error Zk (cid:0) fk(Ik) will depend on image noise, but more importantly, on all shape\nand image variations that are not functionally related with each other through the recipes.\nFig. 6 illustrates this point: the image data, Fig. 6 (b) does not describe the discontinuity that\n\n\fexists in the shape, Fig. 6(h). When trying to estimate shape using the shape recipe fk(Ik),\nit fails to capture the discontinuity although it captures correctly other texture variations,\nFig. 6 (e). Therefore, Zk (cid:0) fk(Ik) will describe the distribution of occluding edges that\ndo not produce image variations and paint edges that do not translate into shape variations.\nDue to the sparse distribution of edges in images (and range data), we expect Zk (cid:0) fk(Ik)\nto have a Laplacian distribution typical of the statistics of wavelet outputs of natural images\n[12]:\n\np(IkjZk) = p(Zk (cid:0) fk(Ik)) =\n\ne(cid:0)jZk(cid:0)fk(Ik )jp=(cid:27)p\n\ni\n\n2(cid:27)i=p(cid:0)(1=p)\n\n(6)\n\n(7)\n\nIn order to verify this, we use the stereo information at the low spatial resolutions that we\nexpect is correct so that: p(Zk (cid:0) fk(Ik)) \u2019 p(Sk (cid:0) fk(Ik)). We obtain values of p in the\nrange (0:6; 1:2). We set p = 1 for the results shown here. Note that p = 2 gives a Gaussian\ndistribution.\nThe least square estimate for the shape subband Zk given both stereo and image data, is:\n\n^Zk = Z Zkp(ZkjSk; Ik)dZk = R Zkp(SkjZk)p(IkjZk)dZk\nR p(SkjZk)p(IkjZk)dZk\n\nThis integral can be evaluated numerically independently at each pixel. When p = 2,\nthen the LSE estimation is a weighted linear combination of the shape from stereo and\nshape recipes. However, with p \u2019 1 this problem is similar to the one of image denosing\nfrom wavelet decompositions [12] providing a non-linear combination of stereo and shape\nrecipes. The basic behavior of Eq. (7) is to take from the stereo everything that cannot\nbe explained by the recipes, and to take from the recipes the rest. Whenever both stereo\nand shape recipes give similar estimates, we prefer the recipes because they are more accu-\nrate than the stereo information. Where stereo and shape recipes differ greatly, such as at\nocclusions, then the shape estimate follows the stereo shape.\n\n6 Discussion and Summary\n\nUnlike shape-from-shading algorithms [5], shape recipes are fast, local procedures for\ncomputing shape from image. The approximation of linear shading [7] also assumes a\nlocal linear relationship between image and shape subbands. However, learning the re-\ngression coef\ufb01cients allows a linearized \ufb01t to more general rendering conditions than the\nspecial case of Lambertian shading for which linear shading was derived.\n\nWe have proposed shape recipes as a representation that leaves the burden of describ-\ning shape details to the image. Unlike many other shape representations, these are low-\ndimensional, and should change slowly over time, distance, and spatial scale. We expect\nthat these properties will prove useful for estimation algorithms using these representations,\nincluding non-linear extensions [16].\n\nWe showed that some of these properties are indeed useful in practice. We developed a\nshape estimate improver that relies on an initial estimate being accurate at low resolutions.\nAssuming that a shape recipes change slowly over 4 octaves of spatial scale, we learned the\nshape recipes at low resolution and applied them at high resolution to \ufb01nd shape from image\ndetails not exploited by the stereo algorithm. Comparisons with ground truth shapes show\ngood results. Shape recipes fold in information about both lighting and material properties\nand can also be used to estimate material boundaries over regions where the lighting is\nassumed to be constant.\n\nGilchrist and Adelson describe \u201catmospheres\u201d, which are local formulas for converting\nimage intensities to perceived lightness values [3, 1]. In this framework, atmospheres are\n\u201clightness recipes\u201d. A full description of an image in terms of a scene recipe would re-\nquire both shape recipes and re\ufb02ectance recipes (for computing re\ufb02ectance values from\n\n\fimage data), which also requires labelling parts of the image as being caused by shading or\nre\ufb02ectance changes, such as [15].\n\nAt a conceptual level, this representation is consistent with a theme in human vision re-\nsearch, that our visual systems use the world as a framebuffer or visual memory, not storing\nin the brain what can be obtained by looking [9]. Using shape recipes, we \ufb01nd simple trans-\nformation rules that let us convert from image to shape whenever we need to, by examining\nthe image.\n\nWe thank Ray Jones and Leonard McMillan for providing Cyberware scans, and Hao Zhang for code\nfor recti\ufb01cation of stereo images. This work was funded by the Nippon Telegraph and Telephone\nCorporation as part of the NTT/MIT Collaboration Agreement.\n\nReferences\n[1] E. H. Adelson. Lightness perception and lightness illusions. In M. Gazzaniga, editor,\n\nThe New Cognitive Neurosciences, pages 339\u2013351. MIT Press, 2000.\n\n[2] C. M. Bishop. Neural networks for pattern recognition. Oxford, 1995.\n[3] A. Gilchrist et al. An anchoring theory of lightness.\n\nPsychological Review,\n\n106(4):795\u2013834, 1999.\n\n[4] W. T. Freeman. The generic viewpoint assumption in a framework for visual percep-\n\ntion. Nature, 368(6471):542\u2013545, April 7 1994.\n\n[5] B. K. P. Horn and M. J. Brooks, editors. Shape from shading. The MIT Press, Cam-\n\nbridge, MA, 1989.\n\n[6] T. Leung and J. Malik. Representing and recognizing the visual appearance of mate-\n\nrials using three-dimensional textons. Intl. J. Comp. Vis., 43(1):29\u201344, 2001.\n\n[7] A. P. Pentland. Linear shape from shading. Intl. J. Comp. Vis., 1(4):153\u2013162, 1990.\n[8] M. Pollefeys, R. Koch, and L. V. Gool. A simple and ef\ufb01cient recti\ufb01cation method\nfor general motion. In Intl. Conf. on Computer Vision (ICCV), pages 496\u2013501, 1999.\n[9] R. A. Rensink. The dynamic representation of scenes. Vis. Cognition, 7:17\u201342, 2000.\n[10] S. Sclaroff and A. Pentland. Generalized implicit functions for computer graphics.\nIn Proc. SIGGRAPH 91, volume 25, pages 247\u2013250, 1991. In Computer Graphics,\nAnnual Conference Series.\n\n[11] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Pattern Analysis\n\nand Machine Intelligence, 22(8):888\u2013905, 2000.\n\n[12] E. P. Simoncelli. Statistical models for images: Compression, restoration and synthe-\n\nsis. In 31st Asilomar Conf. on Sig., Sys. and Computers, Paci\ufb01c Grove, CA, 1997.\n\n[13] E. P. Simoncelli and W. T. Freeman. The steerable pyramid: a \ufb02exible architecture for\nmulti-scale derivative computation. In 2nd Annual Intl. Conf. on Image Processing,\nWashington, DC, 1995. IEEE.\n\n[14] R. Szeliski. Bayesian modeling of uncertainty in low-level vision. Intl. J. Comp. Vis.,\n\n5(3):271\u2013301, 1990.\n\n[15] M. F. Tappen, W. T. Freeman, and E. H. Adelson. Recovering intrinsic images from\n\na single image. In Adv. in Neural Info. Proc. Systems, volume 15. MIT Press, 2003.\n\n[16] A. Torralba and W. T. Freeman. Properties and applications of shape recipes. Tech-\n\nnical Report AIM-2002-019, MIT AI lab, 2002.\n\n[17] Y. Weiss. Bayesian motion estimation and segmentation. PhD thesis, M.I.T., 1998.\n[18] Z. Zhang. Determining the epipolar geometry and its uncertainty: A review.\nsee http://www-\n\nTechnical Report 2927, Sophia-Antipolis Cedex, France, 1996.\nsop.inria.fr/robotvis/demo/f-http/html/.\n\n[19] C. L. Zitnick and T. Kanade. A cooperative algorithm for stereo matching and occlu-\n\nsion detection. IEEE Pattern Analysis and Machine Intelligence, 22(7), July 2000.\n\n\f", "award": [], "sourceid": 2268, "authors": [{"given_name": "William", "family_name": "Freeman", "institution": null}, {"given_name": "Antonio", "family_name": "Torralba", "institution": null}]}