{"title": "Example-Based Image Synthesis of Articulated Figures", "book": "Advances in Neural Information Processing Systems", "page_first": 768, "page_last": 774, "abstract": null, "full_text": "Example Based Image Synthesis of Articulated \n\nFigures \n\nTrevor Darrell \n\nInterval Research. 1801C Page Mill Road. Palo Alto CA 94304 \n\ntrevor@interval.com, http://www.interval.com/-trevor/ \n\nAbstract \n\nWe present a method for learning complex appearance mappings. such \nas occur with images of articulated objects. Traditional interpolation \nnetworks fail on this case since appearance is not necessarily a smooth \nfunction nor a linear manifold for articulated objects. We define an ap(cid:173)\npearance mapping from examples by constructing a set of independently \nsmooth interpolation networks; these networks can cover overlapping re(cid:173)\ngions of parameter space. A set growing procedure is used to find ex(cid:173)\nample clusters which are well-approximated within their convex hull; \ninterpolation then proceeds only within these sets of examples. With this \nmethod physically valid images are produced even in regions of param(cid:173)\neter space where nearby examples have different appearances. We show \nresults generating both simulated and real arm images. \n\n1 Introduction \n\nImage-based view synthesis is.an important application of learning networks. offering the \nability to render realistic images without requiring detailed models of object shape and \nillumination effects. To date. much attention has been given to the problem of view synthe(cid:173)\nsis under varying camera pose or rigid object transformation. Several successful solutions \nhave been proposed in the computer graphics and vision literature. including view morph(cid:173)\ning [12], plenoptic modeling/depth recovery [8], \"lightfields\" [7], and recent approaches \nusing the trifocal tensor for view extrapolation [13]. \n\nFor non-rigid view synthesis. networks for model-based interpolation and manifold learn(cid:173)\ning have been used successfully in some cases [14. 2. 4. 11]. Techniques based on Radial \nBasis Function (RBF) interpolation or on Principle Components Analysis (peA), have been \nable to interpolate face images under varying pose. expression and identity [1.5, 6]. How-\n\n\fExample-Based Image Synthesis of Articulated Figures \n\n769 \n\nextends the notion of example clustering to the case of coupled shape and texture appear(cid:173)\nance models. \n\nOur basic method is to find sets of examples which can be well-approximated from their \nconvex hull in parameter space. We define a set growing criterion which enforces com(cid:173)\npactness and the good-interpolation property. To add a new point to an example set, we \nrequire both that the new point must be well approximated by the previous set alone and \nthat all interior points in the resulting set be well interpolated from the exterior examples. \nWe define exterior examples to be those on the convex hull of the set in parameter space. \nGiven a training subset s C 0 and new point p E 0, \n\nE(s,p) = max(E/(s U {p}),EE(S,p)) , \n\nwith the interior and extrapolation error defined as \n\n1ix (s) is the subset of s whose x vectors lie on the convex hull of all such vectors in s. To \nadd a new point, we require E < E, where E is a free parameter of the clustering method. \nGiven a seed example set, we look to nearest neighbors in appearance space to find the next \ncandidate to add. Unless we are willing to test the extrapolation error of the current model \nto all points, we have to rely on precomputed non-vectorized appearance distance (e.g., \nMSE between example images). If the examples are sparse in the appearance domain, this \nmay not lead to effective groupings. \n\nIf examples are provided in sequence and are based on observations from an object with \nrealistic dynamics, then we can find effective groupings even if observations are sparse in \nappearance space. We make the assumption that along the trajectory of example observa(cid:173)\ntions over time, the underlying object is likely to remain smooth and locally span regions of \nappearance which are possible to interpolate. We thus perform set growing along examples \non their input trajectory. Specifically, in the results reported below, we select K seed points \non the trajectory to form initial clusters. At each point p we find the set s which is the \nsmallest interval on the example trajectory which contains p, has a non-zero interior region \n(s -1ix (s)), and for which E / (s) < \u20ac. If such set exists, we continue to expand it, growing \nthe set along the example trajectory until the above set growing criterion is violated. Once \nwe can no longer grow any set, we test whether any set is a proper subset of another, and \ndelete it if so. We keep the remaining sets, and use them for interpolation as described \nbelow. \n4 Synthesis using example sets \nWe generate new views using sets of examples: interpolation is restricted to only occur \ninside the convex hull of an example set found as above for which E/(s) ::; E. Given a new \nparameter vector x, we test whether it is in the convex hull of parameters in any example \nset. If the point does not lie in the convex hull of any example set, we find the nearest point \non the convex hull of one of the example sets, and use that instead. This prevents erroneous \nextrapolation. \n\nIf a new parameter is in the convex hull of more than one example set, we first select the \nset whose median example parameter is closest to the desired example parameter. Once a \nset has been selected, we interpolate a new function value from examples using the RBF \nmethod summarized above. To enforce temporal consistency of rendered images over time, \n\n\f770 \n\nT. Darrell \n\n(b) \n\n(c) \n\nFigure 2: (a) Images of a real arm (from a sequence of 33 images) with changing appear(cid:173)\nance and elbow configuration. (b,c) Interpolated shape of arms tracked in previous figure. \n(b) shows results using all examples in a single interpolation network; (c) shows results \nusing example sets algorithm. Open contours show arms example locations; filled con(cid:173)\ntour shows interpolation result. Near regions of appearance singularity in parameter space \nthe full network method generates physically-invalid arm shapes; the example sets method \nproduces realistic images. \n\nThe method presented below for grouping examples into locally valid spaces is generally \napplicable to both the PCA and RBF-based view synthesis techniques. However our initial \nimplementation, and the results reported in this paper, have been with RBF-based models. \n\n3 Finding consistent example sets \nGiven examples from a complicated (non-linear, non-smooth) appearance mapping, we find \nlocal regions of appearance which are well-behaved as smooth, possibly linear, functions. \nWe wish to cluster our examples into sets which can be used for successful interpolation \nusing our local appearance mode\\. \n\nConceptually, this problem is similar to that faced by Bregler and Omohundro [2], who \nbuilt image manifolds using a mixture of local PCA models. Their work was limited to \nmodeling shape (lip outlines); they used K-means clustering of image appearance to form \nthe initial groupings for PCA analysis. However this approach had no model of texture and \nperformed clustering using a mean-squared-error distance metric in simple appearance. \nSimple appearance clustering drastically over-partitions the appearance space compared to \na model that jointly represent shape and texture. Examples which are distant in simple \nappearance can often be close when considered in 'vectorized' representation. Our work \n\n\fExample-Based Image Synthesis of Articulated Figures \n\n771 \n\n(b) ,-------.c' -,--' '_'. _' '_--, '-----\"--'---'--____ ---' '------\"--'---'--__ --' '------\"~ __ - ' '-----\"--'---'--____ ---' \n\n.. ' \n\n. \n\n. . : .. \n, .. \n- '\n:::~:.:.: :-:.: '\n\n\u2022 \u2022 \n\n\u2022\n\n'\n\n0 \n\n0 \n\n'~1.)'. \n\n..' . \n\n\" , . \n'\" .. \n... \n\n\" \n\n.. : .. \n. \n\" , .\n\" .... \n. \" .. \n\n\" \n\n. , \n.. \n\" . .. ' \n. .: .... . . \n\n\"\n\" \n\n, \n\n'\" .. .. \n\n.............. \n..... . '.. . \n\n..... . \n\n'\" \n\n:':',::':: \n\n(C)'----________ ~L _ __ ___ __ __\" '____ _ _ ~ __ __ ~ '-----' ______ - - ' '----~ ______ --' \n\nFigure 1: Arm appearance interpolated from examples using approximation network. (a) \nA 2DOF planar arm. Discontinuities in appearance due to workspace constraints make \nthis a difficult function to learn from examples; the first and last example are very close \nin parameter space, but far in appearance space. (b) shows results using all examples \nin a single network; (c) using the example sets algorithm described in text. Note poor \napproximation on last two examples in (a); appearance discontinuities and extrapolation \ncause problems for full network, but are handled well in examples sets method. \n\nIn peA-based approaches, G projects a portion of u onto a optimal linear subspace found \nfrom D, and F projects a portion of u onto a subspace found from T [6, 5]. For example \nG D (u) = PI) 59 U , where 59 is a diagonal boolean matrix which selects the texture param(cid:173)\neters from u and PI) is a matrix containing the m-th largest principle components of D . \nF warps the reconstructed texture according to the given shape: FT(u, s) = [PT5t u] 0 s. \nWhile interpolation is simple using a peA approach, the parameters used in peA models \noften do not have any direct physical interpretation. For the task of view synthesis, an ad(cid:173)\nditional mapping u = H(x) is needed to map from task parameters to peA input values; \na backpropogation neural net was used to perform this function for the task of eye gaze \nanalysis [10]. \n\nUsing the RBF-based approach [1], the application to view synthesis is straightforward. \nBoth G and F are networks which compute locally-weighted regression, and parameters \nare used directly (u = x) . G computes an interpolated shape, and F warps and blends the \nexample texture images according to that shape: G D(X) = Ei cd(x - xd, FT(X, s) = \n[Ei cU(x - Xi)] os , where f is a radial basis function. The coefficients c and c' are derived \nfrom D and T, respectively: C = D R+ , where rij = f (x i-X j) and C is the matrix of row \nvectors Ci; similarly C' = T R+ [9] . We have found both vector norm and Gaussian basis \nfunctions give good results when appearance data is from a smooth function; the results \nbelow use f(r) = Ilrll. \n\n\f772 \n\nT. Darrell \n\never, these methods are limited in the types of object appearance they can accurately model. \nPCA-based face analysis typically assumes images of face shape and texture fall in a linear \nsubspace; RBF approaches fare poorly when appearance is not a smooth function. \n\nWe want to extend non-rigid interpolation networks to handle cases where appearance is \nnot a linear manifold and is not a smooth function, such as with articulated bodies. The \nmapping from parameter to appearance for articulated bodies is often one-to-many due to \nthe multiple solutions possible for a given endpoint. It will also be discontinuous when \nconstraints call for different solutions across a boundary in parameter space, such as the \nexample shown in Figure 1. \n\nOur approach represents an appearance mapping as a set of piecewise smooth functions. \nWe search for sets of examples which are well approximated by the examples on the convex \nhull of the set's parameter values. Once we have these 'safe' sets of examples we perform \ninterpolation using only the examples in a single set. \n\nThe clear advantage of this approach is that it will prevent inconsistent examples from \nbeing combined during interpolation. It also can reduce the number of examples needed to \nfully interpolate the function, as only those examples which are on the convex hull of one \nor more example sets are needed. If a new example is provided and it falls within and is \nwell-approximated by the convex hull of an existing set, it can be safely ignored. \n\nThe remainder of this paper proceeds as follows. First, we will review methods for mod(cid:173)\neling appearance when it can be well approximated with a smooth and/or linear function. \nNext, we will present a technique for clustering examples to find maximal subsets which \nare well approximated in their interior. We will then detail how we select among the subsets \nduring interpolation, and finally show results with both synthetic and real imagery. \n\n2 Modeling smooth and/or linear appearance functions \nTraditional interpolation networks work well when object appearance can be modeled ei(cid:173)\nther as a linear manifold or as a smooth function over the parameters of interest (describing \npose, expression, identity, configuration, etc.). As mentioned above, both peA and RBF \napproaches have been successfully applied to model facial expression. \n\nIn both approaches, a key step in modeling non-rigid shape appearance from examples is \nto couple shape and texture into a single representation. Interpolation of shape has been \nwell studied in the computer graphics literature (e.g., splines for key-frame animation) but \ndoes not alone render realistic images. PCA or RBF models of images without a shape \nmodel can only represent and interpolate within a very limited range of pose or object \nconfiguration. \n\nIn a coupled representation, texture is modeled in shape-normalized coordinates, and shape \nis modeled as disparity between examples or displacement from a canonical example to all \nexamples. Image warping is used to generate images for a particular texture and shape. \nGiven a training set n = {(Yi, Xi, di ), 0 ~ i ~ n}, where Yi is the image of example i, \nXi is the associated pose or configuration parameter, and di is a dense correspondence map \nrelative to a canonical pose, a set of shape-aligned texture images can be computed such \nthat texture ti warped with displacement di renders example image Yi: Yi = ti 0 di [5, 1,6]. \nA new image is constructed using a coupled shape model G and texture model F, based on \ninput u: \n\ny(n,U) = FT(GD(U),u) , \n\nwhere D, T are the matrices [dodl ... d n ], [totl ... t n ], respectively. \n\n\fExample-Based Image Synthesis of Articulated Figures \n\n773 \n\n(c \n\n(b) \n\nFigure 3: Interpolated shape and texture result. (a) shows exemplar contours (open) and \ninterpolated shape (filled). (b) shows example texture images. (c) shows final interpolated \nimage. \nwe can use a simple additional constraint on subsequent frames. Once we have selected \nan example set, we keep using it until the desired parameter value leaves the valid region \n(convex hull) of that set. When this occurs, we allow transitions only to \"adjacent\" example \nsets; adjacency is defined as those pairs of sets for which at least one example on each \nconvex hull are sufficiently close (11Yi - Yj II < E) in appearance space. \nS Results \nFirst we show examples using a synthetic arm with several workspace constraints. Figure \nl(a) shows examples of a simple planar 2DOF ann and the inverse kinematic solution for a \nvariety of endpoints. Due to an artificial obstacle in the world, the ann is forced to switch \nbetween ann-up and ann-down configurations to avoid collision. \n\nWe trained an interpolation network using a single RBF to model the appearance of the ann \nas a function of endpoint location. Appearance was modeled as the vector of contour point \nlocations, obtained from the synthetic ann rendering function. We first trained a single RBF \nnetwork on a dense set of examples of this appearance function. Figure l(b) shows results \ninterpolating new ann images from these examples; results are accurate except where there \nare regions of appearance discontinuity due to workspace constraints, or when the network \nextrapolates erroneously. \n\nWe applied our clustering method described above to this data, yielding the results shown \nin Figure 1 (c). None of the problems with discontinuities or erroneous extrapolation can \nbe seen in these results, since our method enforces the constraint that an interpolated result \nmust be returned from on or within the convex hull of a valid example set. \n\nNext we applied our method to the images of real anns shown in Figure 2(a). Ann contours \nwere obtained in a sequence of 33 such images using a semi-automated defonnable contour \ntracker augmented with a local image distance metric [3]. Dense correspondences were in(cid:173)\nterpolated from the values on the contour. Figure 2(b) shows interpolated ann shapes using \na single RBF on all examples; dramatic errors can be seen near where multiple different \n\n\f774 \n\nT. Darrell \n\nappearances exist within a small region of parameter space. \n\nFigure 2( c) shows the results on the same points using sets of examples found using our \nclustering method; physically realistic arms are generated in each case. Figure 3 shows the \nfinal interpolated result rendered with both shape and texture. \n6 Conclusion \nView-based image interpolation is a powerful paradigm for generating realistic imagery \nwithout full models of the underlying scene geometry. Current techniques for non-rigid \ninterpolation assume appearance is a smooth function. We apply an example clustering \napproach using on-line cross validation to decompose a complex appearance mapping into \nsets of examples which can be smoothly interpolated. We show results on real imagery \nof human arms, with correspondences recovered from deformable contour tracking. Given \nimages of an arm moving on a plane with various configuration conditions (elbow up and \nelbow down), and with associated parameter vectors marking the hand location, our method \nis able to discover a small set of manifolds with a small number of exemplars each can \nrender new examples which are always physically correct. A single interpolating manifold \nfor this same data has errors near the boundary between different arm configurations, and \nwhere multiple images have the same parameter value. \nReferences \n\n[1] D. Beymer, A. Shashua and T. Poggio, Example Based Image Analysis and Synthesis, MIT AI \nLab Memo No. 1431, MIT, 1993. also see D. Beymer and T. Poggio, Science 272:1905-1909, \n1996. \n\n[2] C. Breg1er and S. Omohundro, Nonlinear Image Interpolation using Manifold Learning, NIPS-\n\n7, MIT Press, 1995. \n\n[3] T. DarrelI, A Radial Cumulative Similarity Transform for Robust Image Correspondence, Proc. \n\nCVPR-98. Santa Barbara, CA, IEEE CS Press, 1998. \n\n[4] M. Jagersand, Image Based View Synthesis of Articulated Agents, Proc. CVPR-97, San Jaun, \n\nPureto Rico, pp. 1047-1053, IEEE CS Press, 1997. \n\n[5] M. Jones and T Poggio, Multidimensional Morphable Models, Proc. ICCV-98, Bombay, India, \n\npp. 683-688, 1998. \n\n[6] A. Lanitis, C.J. Taylor, TF. Cootes, A Unified Approach to Coding and Interpreting Fa:::e Im(cid:173)\n\nages, Proc. ICCV-95, pp. 368-373, Cambridge, MA, 1995. \n\n[7] M. Levoy and P. Hanrahan, Light Field Rendering, In SIGGRAPH-96, pp. 31-42,1996. \n\n[8] L. McMillan and G. Bishop, Plenoptic Modeling: An image-based rendering system. In Proc. \n\nSIGGRAPH-95, pp. 39-46, 1995. \n\n[9] T Poggio and F. Girosi, A Theory of Networks for Approximation and Learning, MIT AI Lab \n\nMemo No. 1140. 1989. \n\n[10] T Rikert and M. Jones, Gaze Estimation using Morphable Models, Proc. IEEE Conf. Face and \n\nGesture Recognition '98, pp. 436-441, Nara, Japan, IEEE CS Press, 1998. \n\n[II] L. Saul and M. Jordan, A Variational Principle for Model-based Morphing, NIPS-9, MIT Press, \n\n1997. \n\n[12] S. Seitz and C. Dyer, View Morphing, in Proc. SIGGRAPH-96, pp. 21-30,1996. \n\n[13] A. Shashua and M. Werman, Trilinearity of Three Perspective Views and its Associated Tensor, \n\nin Proc. ICCV-95, pp. 920-935, Cambridge, MA, IEEE CS Press, 1995. \n\n[14] 1. Tenenbaum, Mapping a manifold of perceptual observations, NIPS-IO, MIT Press, 1998. \n\n\f", "award": [], "sourceid": 1504, "authors": [{"given_name": "Trevor", "family_name": "Darrell", "institution": null}]}