{"title": "Geometry-Aware Neural Rendering", "book": "Advances in Neural Information Processing Systems", "page_first": 11559, "page_last": 11569, "abstract": "Understanding the 3-dimensional structure of the world is a core challenge in computer vision and robotics. Neural rendering approaches learn an implicit 3D model by predicting what a camera would see from an arbitrary viewpoint. We extend existing neural rendering to more complex, higher dimensional scenes than previously possible. We propose Epipolar Cross Attention (ECA), an attention mechanism that leverages the geometry of the scene to perform efficient non-local operations, requiring only $O(n)$ comparisons per spatial dimension instead of $O(n^2)$. We introduce three new simulated datasets inspired by real-world robotics and demonstrate that ECA significantly improves the quantitative and qualitative performance of Generative Query Networks (GQN).", "full_text": "Geometry-Aware Neural Rendering\n\nJosh Tobin\n\nOpenAI & UC Berkeley\n\njosh@openai.com\n\nOpenAI Robotics\u2217\n\nOpenAI\n\nPieter Abbeel\n\nCovariant.AI & UC Berkeley\npabbeel@cs.berkeley.edu\n\nAbstract\n\nUnderstanding the 3-dimensional structure of the world is a core challenge in\ncomputer vision and robotics. Neural rendering approaches learn an implicit 3D\nmodel by predicting what a camera would see from an arbitrary viewpoint. We\nextend existing neural rendering to more complex, higher dimensional scenes than\npreviously possible. We propose Epipolar Cross Attention (ECA), an attention\nmechanism that leverages the geometry of the scene to perform ef\ufb01cient non-local\noperations, requiring only O(n) comparisons per spatial dimension instead of\nO(n2). We introduce three new simulated datasets inspired by real-world robotics\nand demonstrate that ECA signi\ufb01cantly improves the quantitative and qualitative\nperformance of Generative Query Networks (GQN) [7].\n\n1\n\nIntroduction\n\nThe ability to understand 3-dimensional structure has long been a fundamental topic of research in\ncomputer vision [10, 22, 26, 34]. Advances in 3D understanding, driven by geometric methods [14]\nand deep neural networks [7, 31, 40, 43, 44] have improved technologies like 3D reconstruction,\naugmented reality, and computer graphics. 3D understanding is also important in robotics. To interact\nwith their environments, robots must reason about the spatial structure of the world around them.\nAgents can learn 3D structure implicitly (e.g., using end-to-end reinforcement learning [24, 25]), but\nthese techniques can be data-inef\ufb01cient and the representations often have limited reuse. An explicit\n3D representation can be created using keypoints and geometry [14] or neural networks [44, 43, 30],\nbut these can lead to in\ufb02exible, high-dimensional representations. Some systems forego full scene\nrepresentations by choosing a lower-dimensional state representation. However, not all scenes admit\na compact state representation and learning state estimators often requires expensive labeling.\nPrevious work demonstrated that Generative Query Networks (GQN) [7] can perform neural rendering\nfor scenes with simple geometric objects. However, robotic manipulation applications require precise\nrepresentations of high degree-of-freedom (DoF) systems with complex objects. The goal of this\npaper is to explore the use of neural rendering in such environments.\nTo this end, we introduce an attention mechanism that leverages the geometric relationship between\ncamera viewpoints called Epipolar Cross-Attention (ECA). When rendering an image, ECA computes\na response at a given spatial position as a weighted sum at all relevant positions of feature maps from\nthe context images. Relevant features are those on the epipolar line in the context viewpoint.\n\n\u2217Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Alex Paino, Arthur Petron, Matthias\nPlappert, Raphael Ribas, Jonas Schneider, Jerry Tworek, Nik Tezak, Peter Welinder, Lilian Weng, Qiming Yuan,\nWojciech Zaremba, Lei Zhang\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Overview of the model architecture used in our experiments. Grey boxes denote intermediate\nrepresentations, zi latent variables, + element-wise addition, and blue and red boxes subcomponents\nof the neural network architecture. Green components are model inputs, blue are as in GQN [7], and\nred are contributions of our model. (1) Context images and corresponding viewpoints are passed\nthrough a convolutional neural network f to produce a context representation. We use the Tower\narchitecture from [7]. (2) We use the epipolar geometry between the query viewpoint and the context\nviewpoint to extract the features in the context representation rk that are relevant to rendering each\nspatial point in the query viewpoint vq. These extracted features are stored in a 3-dimensional tensor\nek called the epipolar representation. See Figure 3(a) for more details. (3) At each generation step l,\nwe compute the decoder input by attending over the epipolar representation. The attention map al\ncaptures the weighted contribution to each spatial position in the decoder hidden state of all relevant\npositions in the context representations. See Figure 3(b) for more details on how the attention map is\ncomputed. (4) The decoder, or generation network, is the skip-connection convolutional LSTM cell\nfrom [7]. It takes as input the attention map al, previous hidden state hl\u22121, and a latent variable zl,\nwhich is used to model uncertainty in the predicted output. See [7] for more details.\n\nUnlike GQN, GQN with ECA (E-GQN) can model relationships between pixels that are spatially\ndistant in the context images, and can use a different representation at each layer in the decoder. And\nunlike more generic approaches to non-local attention, which require comparing each spatial location\nto every other spatial location (O(n2) comparisons per pixel for a n \u00d7 n image), [42], E-GQN only\nrequires each spatial location to be compared to a subset of the other spatial locations (requiring only\nO(n) comparisons per pixel for a n \u00d7 n image).\nWe evaluate our approach on datasets from the original GQN paper and three new datasets designed\nto test the ability to render systems with many degrees of freedom and a wide variety of objects. We\n\ufb01nd signi\ufb01cant improvements in a lower bound on the negative log likelihood (the ELBO), per-pixel\nmean absolute error, and qualitative performance on most of these datasets.\nTo summarize, our key contributes are as follows:\n\n1. We introduce a novel attention mechanism, Epipolar Cross-Attention (ECA), that leverages\n\nthe geometry of the camera poses to perform ef\ufb01cient non-local attention.\n\n2. We introduce three datasets: Disco Humanoid, OpenAI Block, and Room-Random-Objects\n\nas a testbed for neural rendering with complex objects and high-dimensional state.\n\n3. We demonstrate the ECA with GQN (E-GQN) improves neural rendering performance on\n\nthose datasets.\n\n2\n\nattnattnattnattnattnattnepir1v1fe1epir1v1fe1vqepir1v1fe1attnclstmh1h0hLz1attn\u22efclstmzLattn\u22efOur ModelPrevious MethodGround Truthr1234Representation networkAttention mechanismGeneration networklossclstmz2\f2 Background\n\n2.1 Problem description\nGiven K images xk \u2208 X and corresponding camera viewpoints vk \u2208 V of a scene s, the goal\nof neural rendering is to learn a model that can accurately predict the image xq the camera would\nsee from a query viewpoint vq. More formally, for distributions of scenes p(S) and images with\ncorresponding viewpoints p(V, X | s), the goal of neural rendering is to learn a model that maximizes\n\nEs\u223cp(S)Evq,xq\u223cp(V,X|s)Evk,xk\u223cp(V,X|s) log p(cid:0)xq|(xk, vk)k={1,\u00b7\u00b7\u00b7,K}, vq(cid:1)\n\nThis can be viewed as an instance of few-shot density estimation [29].\n\n2.2 Generative Query Networks\n\nGenerative Query Networks [7] model the likelihood above with an encoder-decoder neural network\narchitecture. The encoder, or representation network is a convolutional neural network that takes vk\nand xk as input and produces a representation r.\nThe decoder, or generation network, takes r and vq as input and predicts the image rendered from that\nviewpoint. Uncertainty in the output is modeled using stochastic latent variables z, producing a density\n\ng(xq | vq, r) = (cid:82) g(xq, z | vq, r)dz that can be approximated tractably with a variational lower\n\nbound [20, 7]. The generation network architecture is based on the skip-connection convolutional\nLSTM decoder from DRAW [12].\n\n2.3 Epipolar Geometry\n\nThe epipolar geometry between camera viewpoints v1 and v2 describes the geometric relationship\nbetween 3D points in the scene and their projections in images x1 and x2 rendered from pinhole\ncameras at v1 and v2 [14]. Figure 2 describes the relationship.\n\nFigure 2: Illustration of the epipolar geometry. For any\nimage point y in x1 corresponding to a 3D point Y ,\nthe image point y(cid:48) in x2 that corresponds to Y lies on\na line l(cid:48) in x2. This line corresponds to the projection\nonto x2 of the ray passing through the camera center\nof v1 and y and depends only on the intrinsic geometry\nbetween v1 and v2, not the content of the scene.\n\nThere is a linear mapping called the fundamental matrix that captures this correspondence [14]. The\nfundamental matrix is a mapping F from an image point y in x1 to the epipolar line l(cid:48). F is a 3 \u00d7 3\nmatrix that depends on v1 and v2. The image point y(cid:48) = (h(cid:48), w(cid:48), 1) lies on the line l(cid:48) corresponding\nto y if y(cid:48) lies on the line ah(cid:48) + bw(cid:48) + c = 0, where F y = [a, b, c]T .\n\n3 Epipolar Cross-Attention\n\nIn GQN, the scene representation is an element-wise sum of context representations from each\ncontext viewpoint. The context representations are created from the raw camera images through\nconvolutional layers. Since convolutions are local operations, long-range dependencies are dif\ufb01cult\nto model [15, 16, 42]. As a result, information from distant image points in the context representation\nmay not propagate to the hidden state of the generation network.\nThe core idea of Epipolar Cross-Attention is to allow the features at a given spatial position y in\nthe generation network hidden state to depend directly on all of the relevant spatial positions in the\ncontext representations. Relevant spatial positions are those that lie on the epipolar line corresponding\nto y in each context viewpoint.\nFigure 1 describes our model architecture. Our model is a variant of GQN [7]. Instead of using\nk rk as input to compute the next generation network hidden state hl, we use an attention\n\nr =(cid:80)\n\n3\n\nv1v2yYl\u2032x1x2y\u2032\fmap computed using our epipolar cross-attention mechanism. The next two subsections describe the\nattention mechanism.\n\n3.1 Computing the Epipolar Representation\n\nFor a given spatial position y = (p0, p1) in the decoder hidden state hl, the epipolar representation ek\nstores at (p0, p1) all of the features from rk that are relevant to rendering the image at that position.2\n\n(a) Epipolar extraction\n\n(b) Attention mechanism\n\nFigure 3: (a) Constructing the epipolar representation ek for a given camera viewpoint vk. For a\ngiven spatial position in the decoder state hl, there is a 1-dimensional subset of feature maps l(cid:48) in rk\narising from the epipolar geometry. This can be viewed as the projection of the line passing from the\ncamera center at vq through the image point onto rk. The epipolar representation ek is constructed\nby stacking these lines along a third spatial dimension. Note that hl and rk are h(cid:48) \u00d7 w(cid:48) \u00d7 d(cid:48) feature\nmaps, so ek has shape h(cid:48) \u00d7 w(cid:48) \u00d7 h(cid:48) \u00d7 d(cid:48). If h = w is the size of the image, h(cid:48) = w(cid:48) = h/4 and\nd(cid:48) = 256 in our experiments. (b) Our attention mechanism. Blue rectangles denote convolutional\nlayers with given kernel size. \u201c\u00d7\" denotes batch-wise matrix multiplication, and \u201c+\" element-wise\nsummation. The previous decoder hidden state hl\u22121 is used to compute a query tensor Ql by linear\nprojection. The epipolar representation ek is also linearly projected to compute a key tensor K k and\nvalue tensor V k. K k and Ql are matrix-multiplied to form unnormalized attention weights, which\nare scaled by 1/\ndk. A softmax is computed along the \ufb01nal dimension, and the result is multiplied\nby V k to get an attention score as in [41]. All of the attention scores are linearly projected into the\ncorrect output dimension and summed element-wise.\nFigure 3(a) shows how we construct the epipolar representation. To compute the epipolar line l(cid:48)\nrk, we \ufb01rst compute the fundamental matrix F k\nthen \ufb01nd l(cid:48)\nIf hl has shape (h(cid:48), w(cid:48)), then for each 0 \u2264 p(cid:48)\n\ny in\nq arising from camera viewpoints vq and vk [14], and\n\nq [p0, p1, 1]T .\n\ny = F k\n\n1 < w(cid:48),\n\n\u221a\n\nek\np0,p1,p(cid:48)\nwhere the subscripts denote array indexing and p(cid:48)\nAll of these operations can be performed ef\ufb01ciently and differentiably in automatic differentiation\nlibraries like Tensor\ufb02ow [1] as they can be formulated as matrix multiplication or gather operations.\n\n= rk\np(cid:48)\n0,p(cid:48)\n0 is the point on l(cid:48)\n\ny corresponding to p(cid:48)\n1.3\n\n1\n\n1\n\nthe camera.\n\n2Note that care must be taken that the representation network does not change the effective \ufb01eld of view of\n3To make sure p(cid:48)\n\n0 are valid array indices, we round down to the nearest integer. For indices that are too large\n\nor too small, we instead use features of all zeros.\n\n4\n\nhlrkekvqvkaklhl\u221211\u00d71\u00d711\u00d711\u00d71\u00d71ekh\u2032\u00d7w\u2032\u00d7d\u2032h\u2032\u00d7w\u2032\u00d7h\u2032\u00d7d\u2032h\u2032\u00d7w\u2032\u00d7h\u2032\u00d7dvh\u2032\u00d7w\u2032\u00d7h\u2032\u00d7dkh\u2032\u00d7w\u2032\u00d7dkh\u2032w\u2032\u00d7h\u2032\u00d7dkh\u2032w\u2032\u00d7dk\u00d71softmaxh\u2032w\u2032\u00d71\u00d7h\u2032h\u2032w\u2032\u00d7h\u2032\u00d7dvh\u2032w\u2032\u00d7dvh\u2032w\u2032\u00d7h\u2032h\u2032\u00d7w\u2032\u00d7dv1\u00d71aklQlKkVk\f3.2 Attention mechanism\n\nFigure 3(b) describes our attention mechanism in more detail. We map the previous decoder hidden\nstate hl\u22121 and the epipolar representations ek to an attention score ak\nl represents the weighted\ncontribution to each spatial position of all of the geometrically relevant features in the context\nrepresentation rk.\nTypically the weights for the projections are shared between context images and decoder steps. To\nfacilitate passing gradients to the generation network, the attention maps ak\nl are provided a skip\nconnection to rk, producing\n\nl . ak\n\n(cid:88)\n\nk\n\nal = \u03bb\n\n(cid:88)\n\nk\n\nak\nl +\n\nrk\n\n.\nwhere \u03bb is a learnable parameter. al is used as input to to produce the next hidden state hl.\n\n4 Experiments\n\n4.1 Datasets\n\nTo evaluate our proposed attention mechanism, we trained GQN with Epipolar Cross-Attention\n(E-GQN) on four datasets from the GQN paper: Rooms-Ring-Camera (RRC), Rooms-Free-Camera\n(RFC), Jaco, and Shepard-Metzler-7-Parts (SM7) [7, 35]. We chose these datasets for their diversity\nand suitability for our method. Other datasets are either easier versions of those we used (Rooms-\nFree-Camera-No-Object-Rotations and Shepard-Metzler-5-Parts) or focus on modeling the room\nlayout of a large scene (Mazes). Our technique was designed to help improve detail resolution in\nscenes with high degrees of freedom and complex objects, so we would not expect it to improve\nperformance in an expansive, but relatively low-detail dataset like Mazes.\nThe GQN datasets are missing several important features for robotic representation learning. First,\nthey contain only simple geometric objects. Second, they have relatively few degrees of freedom:\nobjects are chosen from a \ufb01xed set and placed with two positional and 1 rotational degrees of freedom.\nThird, they do not require generalizing to a wide range of objects. Finally, with the exception of the\nRooms-Free-Camera dataset, all images are size 64 \u00d7 64 or smaller.\nTo address these limitations, we created three new datasets: OpenAI Block (OAB), Disco Humanoid\n(Disco), and Rooms-Random-Objects (RRO) 4. All of our datasets are rendered at size 128 \u00d7 128.\nExamples from these datasets are shown alongside our model\u2019s renderings in Figure 6.\nThe OAB dataset is a modi\ufb01ed version of the domain randomized [38] in-hand block manipulation\ndataset from [28, 27] where cameras poses are additionally randomized. Since this dataset is used for\nsim-to-real transfer for real-world robotic tasks, it captures much of the complexity needed to use\nneural rendering in real-world robotics, including a 24-DoF robotic actuator and a block with letters\nthat must be rendered in the correct 6-DoF pose.\nThe Disco dataset is designed to test the model\u2019s ability to accurately capture many degrees of\nfreedom. It consists of the 27-DoF MuJoCo [39] model from OpenAI Gym [3] rendered with each of\nits joints in a random position in [\u2212\u03c0, \u03c0). Each of the geometric shape components of the Humanoid\u2019s\nbody are rendered with a random simple texture.\nThe RRO dataset captures the ability of models to render a broad range of complex objects. Scenes\nare created by sampling 1-3 objects randomly from the ShapeNet [4] object database. The \ufb02oor and\nwalls of the room as well as each of the objects are rendered using random simple textures.\n\n4.2 Experimental setup\n\nWe use the the \u201cTower\" representation network from [7]. Our generation network is from Figure\nS2 of [7] with the exception of our attention mechanism. The convolutional LSTM hidden state\nand skip connection state have 192 channels. The generation network has 12 layers and weights are\n\n4Our datasets are available here: https://github.com/josh-tobin/egqn-datasets\n\n5\n\n\fFigure 4: ELBO (nats/dim) on the test set. The minimum y-value denotes the theoretical minimum\nerror. We compute this value by setting the KL term to 0 and the mean of the output distribution to\nthe true target image. Note that this value differs for the GQN datasets and ours because we use a\ndifferent output variance on our datasets as discussed in Section 4.2.\n\nshared between generation steps. We always use 3 context images. Key dimension dk = 64 for all\nexperiments, and value dimension dv = 128 on the GQN datasets with dv = 192 on our datasets.\nWe train our models using the Adam optimizer [19]. We ran a small hyperparameter sweep to choose\nthe learning rate schedule and found that a learning rate of 1e-4 or 2e-4 linearly ramped up from 2e-5\nover 25,000 optimizer steps and then linearly decayed by a factor of 10 over 1.6M optimizer steps\nperforms best in our experiments.\nWe use a batch size of 36 in experiments on the GQN datasets and 32 on our datasets. We train our\nmodels on 25M examples on 4 Tesla V-100s (GQN datasets) or 8 Tesla V-100s (our datasets).\nAs in [7], we evaluate samples from the model with random latent variables, but taking the mean of\nthe output distribution. Input and output images are scaled to [\u22120.5, 0.5] on the GQN datasets and\n[\u22121, 1] on ours. Output variance is scaled as in [7] on the GQN datasets but \ufb01xed at 1.4 on ours.\n\n4.3 Quantitative results\n\nDataset\nrrc\nrfc\njaco\nsm7\noab\ndisco\nrro\n\nMean Absolute Error (pixels)\nE-GQN\n\nGQN\n\n7.40 \u00b1 6.22\n12.44 \u00b1 12.89\n4.30 \u00b1 1.12\n3.13 \u00b1 1.30\n10.99 \u00b1 5.13\n18.86 \u00b1 7.16\n10.12 \u00b1 5.15\n\n3.59 \u00b1 2.10\n12.05 \u00b1 12.79\n4.00 \u00b1 0.90\n2.14 \u00b1 0.53\n5.47 \u00b1 2.54\n12.46 \u00b1 9.27\n6.59 \u00b1 3.23\n\nGQN\n\nE-GQN\n\nRoot Mean Squared Error (pixels)\n6.80 \u00b1 5.23\n7.43 \u00b1 2.32\n5.63 \u00b1 2.21\n10.39 \u00b1 4.55\n22.04 \u00b1 11.08\n12.08 \u00b1 6.52\n\n14.62 \u00b1 12.77\n26.80 \u00b1 21.35\n8.58 \u00b1 2.94\n9.97 \u00b1 4.34\n22.11 \u00b1 8.00\n32.72 \u00b1 6.32\n19.63 \u00b1 9.14\n\n27.65 \u00b1 20.72\n\nELBO (nats / dim)\n\nGQN\n\n0.5637 \u00b1 0.0013\n0.5637 \u00b1 0.0011\n0.5634 \u00b1 0.0007\n0.5637 \u00b1 0.0009\n1.2587 \u00b1 0.0018\n1.2635 \u00b1 0.0055\n1.2573 \u00b1 0.0011\n\nE-GQN\n\n0.5629 \u00b1 0.0008\n0.5639 \u00b1 0.0012\n0.5631 \u00b1 0.0005\n0.5628 \u00b1 0.0004\n1.2569 \u00b1 0.0011\n1.2574 \u00b1 0.0007\n1.2566 \u00b1 0.0009\n\nFigure 5: Performance of GQN and E-GQN. Note: ELBO scaling is due to different choices of output\nvariance as discussed in Figure 4.\n\nFigure 4 shows the learning performance of our method. Figure 5 shows the quantitative performance\nof the model after training. Both show that our method signi\ufb01cantly outperforms the baseline on most\ndatasets, with the exception of Jaco and RFC, where both methods perform about the same.\n\n6\n\n\f4.4 Qualitative results\n\nFigures 6 shows randomly chosen samples rendered by our model on our datasets. On OAB,\nour model near-perfectly captures the pose of the block and hand and faithfully reproduces their\ntextures, whereas the baseline model often misrepresents the pose and textures. On Disco, ours more\naccurately renders the limbs and shadow of the humanoid. On RRO, ours faithfully (though not\nalways accurately) renders the shape of objects, whereas the baseline often renders the wrong object\nin the wrong location. Quality differences are more subtle on the original GQN datasets.\nFor more examples, including on those datasets, see the website for this paper 5.\n\n4.5 Discussion\n\nE-GQN improves quantitative and qualitative neural rendering performance on most of the datasets\nin our evaluations. We hypothesize that the improved performance is due to the ability of our model\nto query features from spatial locations in the context images that correspond in 3D space, even when\nthose spatial locations are distant in pixel space.\nOur model does not improve over the baseline for the Jaco and RFC datasets. Jaco has relatively\nfew degrees of freedom, and both methods perform well. In RFC, since the camera moves freely,\nobjects contained in the target viewpoint are usually not contained in context images. Hence the lack\nof performance improvement on RFC is consistent with our intuition that E-GQN helps when there\nare 3D points contained in both the context and target viewpoints.\nThere are two performance disadvantages of our implementation of E-GQN. First, E-GQN requires\ncomputing the epipolar representation ek for each context viewpoint. Each ek is a h(cid:48) \u00d7 w(cid:48) \u00d7 h(cid:48) \u00d7 d(cid:48)\ntensor, which can could cause issues \ufb01tting ek into GPU memory for larger image sizes. Second, due\nto extra computation of ek and the attention maps al, E-GQN processes around 30% fewer samples\nper second than GQN in our experiments. In practice, E-GQN reaches a given loss value signi\ufb01cantly\nfaster in wall clock time on most dastasets due to better data ef\ufb01ciency.\n\nFigure 6: Images rendered by our model. See the website for this paper6 for more examples.\n\n5 Related work\n\n5.1 Multi-view 3D reconstruction\n\nConstructing models of 3D scenes from multiple camera views is a widely explored sub\ufb01eld of\ncomputer vision. If the camera poses are unknown, Structure-from-Motion (SfM) techniques [32, 2]\n\n5https://sites.google.com/view/geometryaware-neuralrendering/home\n\n7\n\nOABDiscoRRO\f(for unordered images) or Simultaneous Localization and Mapping (SLAM) techniques [5] (for\nordered images from a real-time system) are typically used. If camera poses are known, multi-view\nstereo or multi-view reconstruction (MVR) can be applied.\nMVR techniques differ in how they represent the scene. Voxels [6], level-sets [8], depth maps\n[36], and combinations thereof are common [33]. They also differ in how they construct the scene\nrepresentation. Popular approaches include adding parts that meet a cost threshold [34], iteratively\nremoving parts that do not [22, 10], or \ufb01tting a surface to extracted feature points [26].\nMost MVR techniques do not rely on ground truth scene representations and instead depend on some\nnotion of consistency between the generated scene representation and the input images like scene\nspace or image space photo consistency measures [18, 22, 33].\n\n5.2 Deep learning for 3D reconstruction\n\nRecently, researchers have used deep learning to learn the mapping from images to a scene represen-\ntation consisting of voxels [40, 44, 43, 30, 45] or meshes [30], with supervisory signal coming from\nverifying the 3D volume against known depth images [40, 45] or coming from a large-scale 3D model\ndatabase [44, 30]. Loss functions include supervised losses [45], generative modeling objectives [30],\na 3D analog of deep belief networks [23, 44], and a generative adversarial loss [43, 11].\nSome neural network approaches to 3D understanding instead create implicit 3D models of the\nworld. By training an agent end-to-end using deep reinforcement learning [25] or path planning and\nimitation learning [13], agents can learn good enough models of their environments to perform tasks\nin them successfully. Like our work, Gupta and coauthors also incorporate geometric primitives\ninto their model architecture, transforming viewpoint representations into world coordinates using\nspatial transformer layers [17]. Instead of attempting to learn 3D representations that help solve\na downstream task, other approaches learn generic 3D representations by performing multi-task\nlearning on a variety of supervised learning tasks like pose estimation [46].\n\n5.3 View Synthesis and neural rendering\n\nNeural rendering or view synthesis approaches learn an implicit representation of the 3D structure of\nthe scene by training a neural network end-to-end to render the scene from an unknown viewpoint. In\n[37], the authors map an images of a scene to an RGB-D image from an unknown viewpoint with an\nencoder-decoder architecture, and train their model using supervised learning. Others have proposed\nincorporating the geometry of the scene into the neural rendering task. In [9], plane-sweep volumes\nare used to estimate depth of points in the scene, which are colored by a separate network to perform\nview interpolation (i.e., the input and output images are close together). Instead of synthesizing pixels\nfrom scratch, other work explores using CNNs to predict appearance \ufb02ow [47].\nIn [7], the authors propose the generative query network model (GQN) model architecture for neural\nrendering. Previous extensions to GQN include augmenting it with a patch-attention mechanism [31]\nand extending it to temporal data [21].\n\n6 Conclusion\n\nIn this work, we present a geometrically motivated attention mechanism that allows neural rendering\nmodels to learn more accurate 3D representations and scale to more complex datasets with higher\ndimensional images. We show that our model outperforms an already strong baseline. Future work\ncould explore extending our approach to real-world data and higher-dimensional images.\nThe core insight of this paper is that injecting geometric structure ino neural networks can improve\nconditional generative modeling performance. Another interesting direction could be to apply this\ninsight to other types of data. For example, future work could explore uncalibrated or moving camera\nsystems, video modeling, depth prediction from stereo cameras, or constructing explicit 3D models.\n\n8\n\n\fReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and\nImplementation ({OSDI} 16), pages 265\u2013283, 2016.\n\n[2] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M Seitz, and Richard Szeliski. Building\nrome in a day. In 2009 IEEE 12th international conference on computer vision, pages 72\u201379.\nIEEE, 2009.\n\n[3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[4] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,\nSilvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d\nmodel repository. arXiv preprint arXiv:1512.03012, 2015.\n\n[5] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping: part i. IEEE\n\nrobotics & automation magazine, 13(2):99\u2013110, 2006.\n\n[6] Peter Eisert, Eckehard Steinbach, and Bernd Girod. Multi-hypothesis, volumetric reconstruction\nof 3-d objects from multiple calibrated camera views. In 1999 IEEE International Conference\non Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258),\nvolume 6, pages 3509\u20133512. IEEE, 1999.\n\n[7] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta\nGarnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene\nrepresentation and rendering. Science, 360(6394):1204\u20131210, 2018.\n\n[8] Olivier Faugeras and Renaud Keriven. Variational principles, surface evolution, PDE\u2019s, level\n\nset methods and the stereo problem. IEEE, 2002.\n\n[9] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict\nnew views from the world\u2019s imagery. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 5515\u20135524, 2016.\n\n[10] Pascal Fua and Yvan G Leclerc. Object-centered surface reconstruction: Combining multi-\n\nimage stereo and shading. International Journal of Computer Vision, 16(1):35\u201356, 1995.\n\n[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[12] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw:\n\nA recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.\n\n[13] Saurabh Gupta, David Fouhey, Sergey Levine, and Jitendra Malik. Unifying map and landmark\n\nbased representations for visual navigation. arXiv preprint arXiv:1712.08125, 2017.\n\n[14] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge\n\nuniversity press, 2003.\n\n[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[16] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[17] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In\n\nAdvances in neural information processing systems, pages 2017\u20132025, 2015.\n\n[18] Hailin Jin et al. Tales of shape and radiance in multiview stereo. In Proceedings Ninth IEEE\n\nInternational Conference on Computer Vision, pages 974\u2013981. IEEE, 2003.\n\n9\n\n\f[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[20] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[21] Ananya Kumar, SM Eslami, Danilo J Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart,\nand Murray Shanahan. Consistent generative query networks. arXiv preprint arXiv:1807.02033,\n2018.\n\n[22] Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving. International\n\njournal of computer vision, 38(3):199\u2013218, 2000.\n\n[23] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief\nnetworks for scalable unsupervised learning of hierarchical representations. In Proceedings of\nthe 26th annual international conference on machine learning, pages 609\u2013616. ACM, 2009.\n\n[24] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. The Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[25] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino,\nMisha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in\ncomplex environments. arXiv preprint arXiv:1611.03673, 2016.\n\n[26] Daniel D Morris and Takeo Kanade. Image-consistent surface triangulation. In Proceedings\nIEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662),\nvolume 1, pages 332\u2013338. IEEE, 2000.\n\n[27] OpenAI. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.\n\n[28] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn\nPowell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal\nreinforcement learning: Challenging robotics environments and request for research. arXiv\npreprint arXiv:1802.09464, 2018.\n\n[29] Scott Reed, Yutian Chen, Thomas Paine, A\u00e4ron van den Oord, SM Eslami, Danilo Rezende,\nOriol Vinyals, and Nando de Freitas. Few-shot autoregressive density estimation: Towards\nlearning to learn distributions. arXiv preprint arXiv:1710.10304, 2017.\n\n[30] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg,\nand Nicolas Heess. Unsupervised learning of 3d structure from images. In Advances in Neural\nInformation Processing Systems, pages 4996\u20135004, 2016.\n\n[31] Dan Rosenbaum, Frederic Besse, Fabio Viola, Danilo J Rezende, and SM Eslami. Learning\nmodels for visual 3d localization with implicit mapping. arXiv preprint arXiv:1807.03149,\n2018.\n\n[32] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceed-\nings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104\u20134113,\n2016.\n\n[33] Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A com-\nparison and evaluation of multi-view stereo reconstruction algorithms. In 2006 IEEE Computer\nSociety Conference on Computer Vision and Pattern Recognition (CVPR\u201906), volume 1, pages\n519\u2013528. IEEE, 2006.\n\n[34] Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring.\n\nInternational Journal of Computer Vision, 35(2):151\u2013173, 1999.\n\n[35] Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects. Science,\n\n171(3972):701\u2013703, 1971.\n\n[36] Richard Szeliski. A multi-view approach to motion and stereo. In Proceedings. 1999 IEEE\nComputer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149),\nvolume 1, pages 157\u2013163. IEEE, 1999.\n\n10\n\n\f[37] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-view 3d models from single\nimages with a convolutional network. In European Conference on Computer Vision, pages\n322\u2013337. Springer, 2016.\n\n[38] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.\nDomain randomization for transferring deep neural networks from simulation to the real world.\nIn 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages\n23\u201330. IEEE, 2017.\n\n[39] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\ncontrol. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages\n5026\u20135033. IEEE, 2012.\n\n[40] Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Multi-view consistency as supervisory\nsignal for learning shape and pose prediction. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 2897\u20132905, 2018.\n\n[41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information\nprocessing systems, pages 5998\u20136008, 2017.\n\n[42] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n7794\u20137803, 2018.\n\n[43] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a\nprobabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances\nin neural information processing systems, pages 82\u201390, 2016.\n\n[44] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and\nJianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pages 1912\u20131920, 2015.\n\n[45] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer\nnets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in\nNeural Information Processing Systems, pages 1696\u20131704, 2016.\n\n[46] Amir R Zamir, Tilman Wekel, Pulkit Agrawal, Colin Wei, Jitendra Malik, and Silvio Savarese.\nGeneric 3d representation via pose estimation and matching. In European Conference on\nComputer Vision, pages 535\u2013553. Springer, 2016.\n\n[47] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View\nsynthesis by appearance \ufb02ow. In European conference on computer vision, pages 286\u2013301.\nSpringer, 2016.\n\n11\n\n\f", "award": [], "sourceid": 6163, "authors": [{"given_name": "Joshua", "family_name": "Tobin", "institution": "OpenAI"}, {"given_name": "Wojciech", "family_name": "Zaremba", "institution": "OpenAI"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley & covariant.ai"}]}