{"title": "FPNN: Field Probing Neural Networks for 3D Data", "book": "Advances in Neural Information Processing Systems", "page_first": 307, "page_last": 315, "abstract": "Building discriminative representations for 3D data has been an important task in computer graphics and computer vision research. Convolutional Neural Networks (CNNs) have shown to operate on 2D images with great success for a variety of tasks. Lifting convolution operators to 3D (3DCNNs) seems like a plausible and promising next step. Unfortunately, the computational complexity of 3D CNNs grows cubically with respect to voxel resolution. Moreover, since most 3D geometry representations are boundary based, occupied regions do not increase proportionately with the size of the discretization, resulting in wasted computation. In this work, we represent 3D spaces as volumetric fields, and propose a novel design that employs field probing filters to efficiently extract features from them. Each field probing filter is a set of probing points -- sensors that perceive the space. Our learning algorithm optimizes not only the weights associated with the probing points, but also their locations, which deforms the shape of the probing filters and adaptively distributes them in 3D space. The optimized probing points sense the 3D space \"intelligently\", rather than operating blindly over the entire domain. We show that field probing is significantly more efficient than 3DCNNs, while providing state-of-the-art performance, on classification tasks for 3D object recognition benchmark datasets.", "full_text": "FPNN: Field Probing Neural Networks for 3D Data\n\nYangyan Li1,2\n\nS\u00f6ren Pirk1 Hao Su1 Charles R. Qi1 Leonidas J. Guibas1\n\n1Stanford University, USA\n\n2Shandong University, China\n\nAbstract\n\nBuilding discriminative representations for 3D data has been an important task in\ncomputer graphics and computer vision research. Convolutional Neural Networks\n(CNNs) have shown to operate on 2D images with great success for a variety\nof tasks. Lifting convolution operators to 3D (3DCNNs) seems like a plausible\nand promising next step. Unfortunately, the computational complexity of 3D\nCNNs grows cubically with respect to voxel resolution. Moreover, since most 3D\ngeometry representations are boundary based, occupied regions do not increase\nproportionately with the size of the discretization, resulting in wasted computation.\nIn this work, we represent 3D spaces as volumetric \ufb01elds, and propose a novel\ndesign that employs \ufb01eld probing \ufb01lters to ef\ufb01ciently extract features from them.\nEach \ufb01eld probing \ufb01lter is a set of probing points \u2014 sensors that perceive the\nspace. Our learning algorithm optimizes not only the weights associated with the\nprobing points, but also their locations, which deforms the shape of the probing\n\ufb01lters and adaptively distributes them in 3D space. The optimized probing points\nsense the 3D space \u201cintelligently\u201d, rather than operating blindly over the entire\ndomain. We show that \ufb01eld probing is signi\ufb01cantly more ef\ufb01cient than 3DCNNs,\nwhile providing state-of-the-art performance, on classi\ufb01cation tasks for 3D object\nrecognition benchmark datasets.\n\nIntroduction\n\nFigure 1: The sparsity characteristic of 3D data\nin occupancy grid representation. 3D occupancy\ngrids in resolution 30, 64 and 128 are shown in\nthis \ufb01gure, together with their density, de\ufb01ned as\n#occupied grid\n#total grid . It is clear that 3D occupancy grid\nspace gets sparser and sparser as the \ufb01delity of the\nsurface approximation increases.\n\n1\nRapid advances in 3D sensing technology have\nmade 3D data ubiquitous and easily accessible,\nrendering them an important data source for high\nlevel semantic understanding in a variety of en-\nvironments. The semantic understanding prob-\nlem, however, remains very challenging for 3D\ndata as it is hard to \ufb01nd an effective scheme for\nconverting input data into informative features\nfor further processing by machine learning algo-\nrithms. For semantic understanding problems in\n2D images, deep CNNs [15] have been widely\nused and have achieved great success, where the convolutional layers play an essential role. They\nprovide a set of 2D \ufb01lters, which when convolved with input data, transform the data to informative\nfeatures for higher level inference.\nIn this paper, we focus on the problem of learning a 3D shape representation by a deep neural network.\nWe keep two goals in mind when designing the network: the shape features should be discriminative\nfor shape recognition and ef\ufb01cient for extraction at runtime. However, existing 3D CNN pipelines that\nsimply replace the conventional 2D \ufb01lters by 3D ones [31, 19], have dif\ufb01culty in capturing geometric\nstructures with suf\ufb01cient ef\ufb01ciency. The input to these 3D CNNs are voxelized shapes represented\nby occupancy grids, in direct analogy to pixel array representation for images. We observe that\nthe computational cost of 3D convolution is quite high, since convolving 3D voxels has cubical\ncomplexity with respect to spatial resolution, one order higher than the 2D case. Due to this high\ncomputational cost, researchers typically choose 30 \u00d7 30 \u00d7 30 resolution to voxelize shapes [31, 19],\nwhich is signi\ufb01cantly lower than the widely adopted resolution 227 \u00d7 227 for processing images [24].\nWe suspect that the strong artifacts introduced at this level of quantization (see Figure 1) hinder the\nprocess of learning effective 3D convolutional \ufb01lters.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n10.41%5.09%2.41%\fFigure 2: An visualization of probing \ufb01lters before (a) and after (d) training them for extracting 3D\nfeatures. The colors associated with each probing point visualize the \ufb01lter weights for them. Note\nthat probing points belong to the same \ufb01lter are linked together for visualizing purpose. (b) and (c)\nare subsets of probing \ufb01lters of (a) and (d), for better visualizing that not only the weights on the\nprobing points, but also their locations are optimized for them to better \u201csense\u201d the space.\nTwo signi\ufb01cant differences between 2D images and 3D shapes interfere with the success of directly\napplying 2D CNNs on 3D data. First, as the voxel resolution grows, the grids occupied by shape\nsurfaces get sparser and sparser (see Figure 1). The convolutional layers that are designed for 2D im-\nages thereby waste much computation resource in such a setting, since they convolve with 3D blocks\nthat are largely empty and a large portion of multiplications are with zeros. Moreover, as the voxel\nresolution grows, the local 3D blocks become less and less discriminative. To capture informative\nfeatures, long range connections have to be established for taking distant voxels into consideration.\nThis long range effect demands larger 3D \ufb01lters, which yields an even higher computation overhead.\nTo address these issues, we represent 3D data as 3D \ufb01elds, and propose a \ufb01eld probing scheme, which\nsamples the input \ufb01eld by a set of probing \ufb01lters (see Figure 2). Each probing \ufb01lter is composed of a\nset of probing points which determine the shape and location of the \ufb01lter, and \ufb01lter weights associated\nwith probing points. In typical CNNs, only the \ufb01lter weights are trained, while the \ufb01lter shape\nthemselves are \ufb01xed. In our framework, due to the usage of 3D \ufb01eld representation, both the weights\nand probing point locations are trainable, making the \ufb01lters highly \ufb02exible in coupling long range\neffects and adapting to the sparsity of 3D data when it comes to feature extraction. The computation\namount of our \ufb01eld probing scheme is determined by how many probing \ufb01lters we place in the 3D\nspace, and how many probing points are sampled per \ufb01lter. Thus, the computational complexity does\nnot grow as a function of the input resolution. We found that a small set of \ufb01eld probing \ufb01lters is\nenough for sampling suf\ufb01cient information, probably due to the sparsity characteristic of 3D data.\nIntuitively, we can think our \ufb01eld probing scheme as a set of sensors placed in the space to collect\ninformative signals for high level semantic tasks. With the long range connections between the\nsensors, global overview of the underlying object can be easily established for effective inference.\nMoreover, the sensors are \u201csmart\u201d in the sense that they learn how to sense the space (by optimizing\nthe \ufb01lter weights), as well as where to sense (by optimizing the probing point locations). Note that\nthe intelligence of the sensors is not hand-crafted, but solely derived from data. We evaluate our \ufb01eld\nprobing based neural networks (FPNN) on a classi\ufb01cation task on ModelNet [31] dataset, and show\nthat they match the performance of 3DCNNs while requiring much less computation, as they are\ndesigned and trained to respect the sparsity of 3D data.\n2 Related Work\n3D Shape Descriptors. 3D shape descriptors lie at the core of shape analysis and a large variety\nof shape descriptors have been designed in the past few decades. 3D shapes can be converted into\n2D images and represented by descriptors of the converted images [13, 4]. 3D shapes can also be\nrepresented by their inherent statistical properties, such as distance distribution [22] and spherical\nharmonic decomposition [14]. Heat kernel signatures extract shape descriptions by simulating an\nheat diffusion process on 3D shapes [29, 3]. In contrast, we propose an approach for learning the\nshape descriptor extraction scheme, rather than hand-crafting it.\n\nConvolutional Neural Networks. The architecture of CNN [15] is designed to take advantage\nof the 2D structure of an input image (or other 2D input such as a speech signal), and CNNs have\nadvanced the performance records in most image understanding tasks in computer vision [24]. An\nimportant reason for this success is that by leveraging large image datasets (e.g., ImageNet [6]),\ngeneral purpose image descriptors can be directly learned from data, which adapt to the data better\nand outperform hand-crafted features [16]. Our approach follows this paradigm of feature learning,\nbut is speci\ufb01cally designed for 3D data coming from object surface representations.\n\n2\n\n(a)(b)(c)(d)\fCNNs on Depth and 3D Data. With rapid advances in 3D sensing technology, depth has became\navailable as an additional information channel beyond color. Such 2.5D data can be represented as\nmultiple channel images, and processed by 2D CNNs [26, 10, 8]. Wu et al. [31] in a pioneering\npaper proposed to extend 2D CNNs to process 3D data directly (3D ShapeNets). A similar approach\n(VoxNet) was proposed in [19]. However, such approaches cannot work on high resolution 3D data, as\nthe computational complexity is a cubic function of the voxel grid resolution. Since CNNs for images\nhave been extensively studied, 3D shapes can be rendered into 2D images, and be represented by\nthe CNN features of the images [25, 28], which, surprisingly, outperforms any 3D CNN approaches,\nin a 3D shape classi\ufb01cation task. Recently, Qi et al. [23] presented an extensive study of these\nvolumetric and multi-view CNNs and refreshed the performance records. In this work, we propose a\nfeature learning approach that is speci\ufb01cally designed to take advantage of the sparsity of 3D data,\nand compare against results reported in [23]. Note that our method was designed without explicit\nconsideration of deformable objects, which is a purely extrinsic construction. While 3D data is\nrepresented as meshes, neural networks can bene\ufb01t from intrinsic constructions[17, 18, 1, 2] to learn\nobject invariance to isometries, thus require less training data for handling deformable objects.\nOur method can be viewed as an ef\ufb01cient scheme of sparse coding[7]. The learned weights of each\nprobing curve can be interpreted as the entries of the coding matrix in the sparse coding framework.\nCompared with conventional sparse coding, our framework is not only computationally more tractable,\nbut also enables an end-to-end learning system.\n3 Field Probing Neural Network\n\n3.1\n\nInput 3D Fields\n\nWe study the 3D shape classi\ufb01cation problem\nby employing a deep neural network. The input\nof our network is a 3D vector \ufb01eld built from the\ninput shape and the output is an object category\nlabel. 3D shapes represented as meshes or point\nclouds can be converted into 3D distance \ufb01elds.\nFigure 3: 3D mesh (a) or point cloud (b) can be\nGiven a mesh (or point cloud), we \ufb01rst convert\nconverted into occupancy grid (c), from which the\nit into a binary occupancy grid representation,\ninput to our algorithm \u2014 a 3D distance \ufb01eld (d),\nwhere the binary occupancy value in each grid\nis obtained via a distance transform. We further\nis determined by whether it intersects with any\ntransform it to a Gaussian distance \ufb01eld (e) for\nmesh surface (or contains any sample point).\nfocusing attention to the space near the surface.\nThen we treat the occupied cells as the zero level\nThe \ufb01elds are visualized by two crossing slices.\nset of a surface, and apply a distance transform\nto build a 3D distance \ufb01eld D, which is stored in a 3D array indexed by (i, j, k), where i, j, k =\n1, 2, ..., R, and R is the resolution of the distance \ufb01eld. We denote the distance value at (i, j, k) by\nD(i,j,k). Note that D represents distance values at discrete grid locations. The distance value at an\narbitrary location d(x, y, z) can be computed by standard trilinear interpolation over D. See Figure 3\nfor an illustration of the 3D data representations.\nSimilar to 3D distance \ufb01elds, other 3D \ufb01elds, such as normal \ufb01elds Nx, Ny, and Nz, can also be used\nfor representing shapes. Note that the normal \ufb01elds can be derived from the gradient of the distance\n\ufb01eld: Nx(x, y, z) = 1\n\u2202z )|. Our\n\u2202y , \u2202d\nframework can employ any set of \ufb01elds as input, as long as the gradients can be computed.\n\n\u2202x ,Ny(x, y, z) = 1\n\n\u2202z , where l = |( \u2202d\n\n\u2202x , \u2202d\n\n\u2202d\n\nl\n\n\u2202d\n\n\u2202y ,Nz(x, y, z) = 1\n\nl\n\nl\n\n\u2202d\n\n3.2 Field Probing Layers\n\nThe basic modules of deep neural networks are\nlayers, which gradually convert input to output\nin a forward pass, and get updated during a back-\nward pass through the Back-propagation [30]\nmechanism. The key contribution of our ap-\nproach is that we replace the convolutional lay-\ners in CNNs by \ufb01eld probing layers, a novel\ncomponent that uses \ufb01eld probing \ufb01lters to ef\ufb01-\nciently extract features from the 3D vector \ufb01eld. They are composed of three layers: Sensor layer,\n\nFigure 4: Initialization of \ufb01eld probing layers. For\nsimplicity, a subset of the \ufb01lters are visualized.\n\n3\n\n(a)(b)(c)(d)(e)\fDotProduct layer and Gaussian layer. The Sensor layer is responsible for collecting the signals\n(the values in the input \ufb01elds) at the probing points in the forward pass, and updating the probing\npoint locations in the backward pass. The DotProduct layer computes the dot product between the\nprobing \ufb01lter weights and the signals from the Sensor layer. The Gaussian layer is an utility layer that\ntransforms distance \ufb01eld into a representation that is more friendly for numerical computation. We\nintroduce them in the following paragraphs, and show that they \ufb01t well for training a deep network.\nSensor Layer. The input to this layer is a 3D \ufb01eld V, where V(x, y, z) yields a T channel (T = 1\nfor distance \ufb01eld and T = 3 for normal \ufb01elds) vector at location (x, y, z). This layer contains C\nprobing \ufb01lters scattered in space, each with N probing points. The parameters of this layer are the\nlocations of all probing points {(xc,n, yc,n, zc,n)}, where c indexes the \ufb01lter and n indexes the probing\npoint within each \ufb01lter. This layer simply outputs the vector at the probing points V(xc,n, yc,n, zc,n).\nThe output of this layer forms a data chunk of size C \u00d7 N \u00d7 T .\nThe gradient of this function \u2207V = ( \u2202V\n\u2202z ) can be evaluated by numerical computation, which\nwill be used for updating the locations of probing points in the back-propagation process. This formal\nde\ufb01nition emphasizes why we need the input being represented as 3D \ufb01elds: the gradients computed\nfrom the input \ufb01elds are the forces to push the probing points towards more informative locations\nuntil they converge to a local optimum.\n\n\u2202x , \u2202p\n\n\u2202y , \u2202p\n\n{wc,n,t}, and outputs vc = v({pc,i,j},{wc,i,j}) =(cid:80)\n\nDotProduct Layer. The input to this layer is the output of the Sensor layer \u2014 a data chunk of size\nC \u00d7 N \u00d7 T , denoted as {pc,n,t}. The parameters of DotProduct layer are the \ufb01lter weights associated\nwith probing points, i.e., there are C \ufb01lters, each of length N, in T channels. We denote the set\nof parameters as {wc,n,t}. The function at this layer computes a dot product between {pc,n,t} and\npc,i,j \u00d7 wc,i,j, \u2014 a C-dimensional\nvector, and the gradient for the backward pass is: \u2207vc = (\n\u2202{wc,i,j} ) = ({wc,i,j},{pc,i,j}).\nTypical convolution encourages weight sharing within an image patch by \u201czipping\u201d the patch into a\nsingle value for upper layers by a dot production between the patch and a 2D \ufb01lter. Our DotProduct\nlayer shares the same \u201czipping\u201d idea, which facilitates to fully connect it: probing points are grouped\ninto probing \ufb01lters to generate output with lower dimensionality.\nAnother option in designing convolutional layers is to decide whether their weights should be shared\nacross different spatial locations. In 2D CNNs, these parameters are usually shared when processing\ngeneral images. In our case, we opt not to share the weights, as information is not evenly distributed\nin 3D space, and we encourage our probing \ufb01lters to individually deviate for adapting to the data.\n\ni=1,...,N\nj=1,...,T\n\u2202v\n\n\u2202{pc,i,j} ,\n\n\u2202v\n\nGaussian Layer. Samples in locations distant to the object surface are associated with large distance\nvalues from the distance \ufb01eld. Directly feeding them into the DotProduct layer does not converge\nand thus does not yield reasonable performance. To emphasize the importance of samples in the\nvicinity of the object surface, we apply a Gaussian transform (inverse exponential) on the distances\nso that regions approaching the zero surface have larger weights while distant regions matter less.1.\nWe implement this transform with a Gaussian layer. The input is the output values of the Sensor\nlayer. Let us assume the values are {x}, then this layer applies an element-wise Gaussian transform\n\n\u2212 x2\n2\u03c32\n\n\u03c32\n\n2\u03c32 , and the gradient is \u2207g = \u2212 xe\n\u2212 x2\n\nfor the backward pass.\n\ng(x) = e\nComplexity of Field Probing Layers. The complexity of \ufb01eld probing layers is O(C \u00d7 N \u00d7 T ),\nwhere C is the number of probing \ufb01lters, N is the number of probing points on each \ufb01lter, and T is\nthe number of input \ufb01elds. The complexity of the convolutional layer is O(K 3 \u00d7 C \u00d7 S3), where K\nis the 3D kernel size, C is the output channel number, and S is the number of the sliding locations for\neach dimension. In \ufb01eld probing layers, we typically use C = 1024, N = 8, and T = 4 (distance and\nnormal \ufb01elds), while in 3D CNN K = 6, C = 48 and S = 12. Compared with convolutional layers,\n\ufb01eld probing layers save a majority of computation (1024 \u00d7 8 \u00d7 4 \u2248 1.83% \u00d7 63 \u00d7 48 \u00d7 123), as the\n\n1Applying a batch normalization [11] on the distances also resolves the problem. However, Gaussian\ntransform has two advantages: 1. it can be approximated by truncated distance \ufb01elds [5], which is widely used\nin real time scanning and can be compactly stored by voxel hashing [21], 2. it is more ef\ufb01cient to compute than\nbatch normalization, since it is element-wise operation.\n\n4\n\n\fprobing \ufb01lters in \ufb01eld probing layers are capable of learning where to \u201csense\u201d, whereas convolutional\nlayers exhaustively examine everywhere by sliding the 3D kernels.\n\nInitialization of Field Probing Layers. There are two sets of parameters:\nthe probing point\nlocations and the weights associated with them. To encourage the probing points to explore as many\npotential locations as possible, we initialize them to be widely distributed in the input \ufb01elds. We\n\ufb01rst divide the space into G \u00d7 G \u00d7 G grids and then generate P \ufb01lters in each grid. Each \ufb01lter\nis initialized as a line segment with a random orientation, a random length in [llow, lhigh] (we use\n[llow, lhigh] = [0.2, 0.8] \u2217 R by default), and a random center point within the grid it belongs to\n(Figure 4 left). Note that a probing \ufb01lter spans distantly in the 3D space, so they capture long range\neffects well. This is a property that distinguishes our design from those convolutional layers, as they\nhave to increase the kernel size to capture long range effects, at the cost of increased complexity. The\nweights of \ufb01eld probing \ufb01lters are initialized by the Xavier scheme [9]. In Figure 4 right, weights for\ndistance \ufb01eld are visualized by probing point colors and weights for normal \ufb01elds by arrows attached\nto each probing point.\n\nFPNN Architecture and Usage. Field probing lay-\ners transform input 3D \ufb01elds into an intermediate\nrepresentation, which can further be processed and\neventually linked to task speci\ufb01c loss layers (Fig-\nure 5). To further encourage long range connections,\nwe feed the output of our \ufb01eld probing layers into\nfully connected layers. The advantage of long range\nconnections makes it possible to stick with a small\nnumber of probing \ufb01lters, while the small number of\nFigure 5: FPNN architecture. Field probing\nprobing \ufb01lters makes it possible to directly use fully\nlayers can be used together with other infer-\nconnected layers.\nence layers to minimize task speci\ufb01c losses.\nObject classi\ufb01cation is widely used in computer vision as a testbed for evaluating neural network\ndesigns, and the neural network parameters learned from this task may be transferred to other high-\nlevel understanding tasks such as object retrieval and scene parsing. Thus we choose 3D object\nclassi\ufb01cation as the task for evaluating our FPNN.\n\n4 Results and Discussions\n4.1 Timing\n\nWe implemented our \ufb01eld probing layers in\nCaffe [12]. The Sensor layer is parallelized by as-\nsigning computation on each probing point to one\nGPU thread, and DotProduct layer by assigning com-\nputation on each probing \ufb01lter to one GPU thread.\nFigure 6 shows a run time comparison between con-\nvonlutional layers and \ufb01eld probing layers on differ-\nent input resolutions. The computation cost of our\n\ufb01eld probing layers is agnostic to input resolutions,\nthe slight increase of the run time on higher resolu-\ntion is due to GPU memory latency introduced by the\nlarger 3D \ufb01elds. Note that the convolutional layers\nin [12] are based on highly optimized cuBlas library from NVIDIA, while our \ufb01eld probing layers\nare implemented with our naive parallelism, which is likely to be further improved.\n4.2 Datasets and Evaluation Protocols\n\nFigure 6: Running time of convolutional lay-\ners (same settings as that in [31]) and \ufb01eld\nprobing layers (C \u00d7 N \u00d7 T = 1024 \u00d7 8 \u00d7 4)\non Nvidia GTX TITAN with batch size 83.\n\nWe use ModelNet40 [31] (12,311 models from 40 categories, training/testing split with 9,843/2,468\nmodels4) \u2014 the standard benchmark for 3D object classi\ufb01cation task, in our experiments. Models\n\n3The batch size is chosen to make sure the largest resolution data \ufb01ts well in GPU memory.\n4The split is provided on the authors\u2019 website. In their paper, a split composed of at most 80/20 training/testing\nmodels for each category was used, which is tiny for deep learning tasks and thus prone to over\ufb01tting. Therefore,\nwe report and compare our performance on the whole ModelNet40 dataset.\n\n5\n\nGrid Resolution163264128227Running Time (ms)1.9950100150200234.9Convolutional LayersField Probing Layers\fin this dataset are already aligned with a canonical orientation. For 3D object recognition scenarios\nin real world, the gravity direction can often be captured by the sensor, but the horizontal \u201cfacing\u201d\ndirection of the objects are unknown. We augment ModelNet40 data by randomly rotating the shapes\nhorizontally. Note that this is done for both training and testing samples, thus in the testing phase, the\norientation of the inputs are unknown. This allows us to assess how well the trained network perform\non real world data.\n\n1-FC\n\nw/o FP w/ FP\n79.1\n85.0\n\n4-FCs\n+NF w/o FP w/ FP\n86.0\n87.5\n\n86.6\n\n+NF\n88.4\n\nTable 1: Top-1 accuracy of FPNNs on 3D object\nclassi\ufb01cation task on M odelN et40 dataset.\n\n4.3 Performance of Field Probing Layers\nWe train our FPNN 80, 000 iterations on 64 \u00d7\n64 \u00d7 64 distance \ufb01eld with batch size 1024.5,\nwith SGD solver, learning rate 0.01, momentum 0.9, and weight decay 0.0005.\nTrying to study the performance of our \ufb01eld probing layers separately, we build up an FPNN with only\none fully connected layer that converts the output of \ufb01eld probing layers into the representation for\nsoftmax classi\ufb01cation loss (1-FC setting). Batch normalization [11] and recti\ufb01ed-linear unit [20] are\nused in-between our \ufb01eld probing layers and the fully connected layer for reducing internal covariate\nshift and introducing non-linearity. We train the network without/with updating the \ufb01eld probing\nlayer parameters. We show their top-1 accuracy on 3D object classi\ufb01cation task on M odelN et40\ndataset with single testing view in Table 1. It is clear that our \ufb01eld probing layers learned to sense\nthe input \ufb01eld more intelligently, with a 5.9% performance gain from 79.1% to 85.0%. Note that,\nwhat achieved by this simple network, 85.0%, is already better than the state-of-the-art 3DCNN\nbefore [23] (83.0% in [31] and 83.8% in [19]).\nWe also evaluate the performance of our \ufb01eld probing layers in the context of a deeper FPNN,\nwhere four fully connected layers6, with in-between batch normalization, recti\ufb01ed-linear unit and\nDropout [27] layers, are used (4-FCs setting). As shown in Table 1, the deeper FPNN performs better,\nwhile the gap between with and without \ufb01eld probing layers, 87.5%\u2212 86.6% = 0.9%, is smaller than\nthat in one fully connected FPNN setting. This is not surprising, as the additional fully connected\nlayers, with many parameters introduced, have strong learning capability. The 0.9% performance gap\nintroduced by our \ufb01eld probing layers is a precious extra over a strong baseline.\nIt is important to note that in both settings (1-FC and 4-FCs), our FPNNs provides reasonable perfor-\nmance even without optimizing the \ufb01eld probing layers. This con\ufb01rms that long range connections\namong the sensors are bene\ufb01cial.\nFurthermore, we evaluate our FPNNs with multiple input \ufb01elds (+NF setting). We did not only employ\ndistance \ufb01elds, but also normal \ufb01elds for our probing layers and found a consistent performance gain\nfor both of the aforementioned FPNNs (see Table 1). Since normal \ufb01elds are derived from distance\n\ufb01elds, the same group of probing \ufb01lters are used for both \ufb01elds. Employing multiple \ufb01elds in the\n\ufb01eld probing layers with different groups of \ufb01lters potentially enables even higher performance.\n\nFPNN Setting\n\nR\n85.0\n87.5\n84.7\n\n1-FC\n4-FCs\n[31]\n\nT0.2\nRobustness Against Spatial Perturbations.\n72.2\nWe evaluate our FPNNs on different levels of\n85.4\n84.8\nspatial perturbations, and summarize the results\nTable 2: Performance on different perturbations.\nin Table 2, where R indicates random horizontal\nrotation, R15 indicates R plus a small random rotation (\u221215\u25e6, 15\u25e6) in the other two directions, T0.1\nindicates random translations within range (\u22120.1, 0.1) of the object size in all directions, S indicates\nrandom scaling within range (0.9, 1.1) in all directions. R45 and T0.2 shares the same notations,\nbut with even stronger rotation and translation, and are used in [23] for evaluating the performance\nof [31]. Note that such perturbations are done on both training and testing samples. It is clear that our\nFPNNs are robust against spatial perturbations.\n0.1 \u2212 0.2\n\nR15 R15 + T0.1 + S R45\n74.1\n82.4\n86.8\n85.3\n83.0\n\n0.2 \u2212 0.4\n\n0.2 \u2212 0.8\n\nFPNN Setting\n\n76.2\n84.9\n\n-\n\n-\n\nAdvantage of Long Range Connections.\nWe evaluate our FPNNs with different range\nTable 3: Performance with different \ufb01lter spans.\nparameters [llow, lhigh] used in initializing the\nprobing \ufb01lters, and summarize the results in Table 3. Note that since the output dimensionality\n\n84.1\n86.8\n\n82.8\n86.9\n\n1-FC\n4-FCs\n\n85.0\n87.5\n\n5To save disk I/O footprint, a data augmentation is done on the \ufb02y. Each iteration, 256 data samples are\n\nloaded, and augmented into 1024 samples for a batch.\n\n6The \ufb01rst three of them output 1024 dimensional feature vector.\n\n6\n\n\fof our \ufb01eld probing layers is low enough to be directly feed into fully connected layers, distant\nsensor information is directly coupled by them. This is a desirable property, however, it poses the\ndif\ufb01culty to study the advantage of \ufb01eld probing layers in coupling long range information separately.\nTable 3 shows that even if the following fully connected layer has the capability to couple distance\ninformation, the long range connections introduced in our \ufb01eld probing layers are bene\ufb01cial.\n\nFPNN Setting\n\n16 \u00d7 16 \u00d7 16\n\n32 \u00d7 32 \u00d7 32\n\n64 \u00d7 64 \u00d7 64\n\nPerformance on Different Field Resolutions.\nWe evaluate our FPNNs on different input \ufb01eld\nresolutions, and summarize the results in Table 4.\nHigher resolution input \ufb01elds can represent in-\nput data more accurately, and Table 4 shows that our FPNN can take advantage of the more accurate\nrepresentations. Since the computation cost of our \ufb01eld probing layers is agnostic to the resolution\nof the data representation, higher resolution input \ufb01elds are preferred for better performance, while\ncoupling with ef\ufb01cient data structures reduces the I/O footprint.\n\nTable 4: Performance on different \ufb01eld resolutions.\n\n1-FC\n4-FCs\n\n85.0\n87.5\n\n84.5\n87.3\n\n84.2\n87.3\n\n\u201cSharpness\u201d of Gaussian Layer. The \u03c3 hyper-parameter in Gaussian layer controls how \u201csharp\u201d\nis the transform. We select its value empirically in our experiments, and the best performance is given\nwhen we use \u03c3 \u2248 10% of the object size. Smaller \u03c3 slightly hurts the performance (\u2248 1%), but has\nthe potential of reducing I/O footprint.\n\nFPNN Features and Visual Similar-\nity. Figure 7 shows a visualization\nof the features extracted by the FPNN\ntrained for a classi\ufb01cation task. Our\nFPNN is capable of capturing 3D ge-\nometric structures such that it allows\nto map 3D models that belong to the\nsame categories (indicated by colors)\nto similar regions in the feature space.\nMore speci\ufb01cally, our FPNN maps 3D\nmodels into points in a high dimen-\nsional feature space, where the dis-\ntances between the points measure the\nsimilarity between their correspond-\ning 3D models. As can be seen from\nFigure 7 (better viewed in zoomin\nmode), the FPNN feature distances be-\ntween 3D models represent their shape similarities, thus FPNN features can support shape exploration\nand retrieval tasks.\n\nFigure 7: t-SNE visualization of FPNN features.\n\nTesting Dataset\n\nFP+FC FC Only\n93.8\n89.4\n\n90.7\n85.1\n\nFP+FC on Source\nFC Only on Target\n\n92.7\n88.2\n\n4.4 Generalizability of FPNN Features\n\nM N 401\nM N 402\n\nTable 5: Generalizability test of FPNN features.\nOne superior characteristic of CNN features is\nthat features from one task or dataset can be transferred to another task or dataset. We evaluate the\ngeneralizability of FPNN features by cross validation \u2014 we train on one dataset and test on another.\nWe \ufb01rst split M odelN et40 (lexicographically by the category names) into two parts M N 401 and\nM N 402, where each of them contains 20 non-overlapping categories. Then we train two FPNNs in\na 1-FC setting (updating both \ufb01eld probing layers and the only one fully connected layer) on these\ntwo datasets, achieving 93.8% and 89.4% accuracy, respectively (the second column in Table 5).7\nFinally, we \ufb01ne tune only the fully connected layer of these two FPNNs on the dataset that they were\nnot trained from, and achieved 92.7% and 88.2% on M N 401 and M N 402, respectively (the fourth\ncolumn in Table 5), which is comparable to that directly trained from the testing categories. We also\ntrained two FPNNs in 1-FC setting with updating only the fully connected layer, which achieves\n90.7% and 85.1% accuracy on M N 401 and M N 402, respectively (the third column in Table 5).\nThese two FPNNs do not perform as well as the \ufb01ne-tuned FPNNs (90.7% < 92.7% on M N 401\n\n7The performance is higher than that on all the 40 categories, since the classi\ufb01cation task is simpler on less\ncategories. The performance gap between M N 401 and M N 402 is presumably due to the fact that M N 401\ncategories are easier to classify than M N 402 ones.\n\n7\n\n\fand 85.1% < 88.2% on M N 402), although all of them only update the fully connected layer. These\nexperiments show that the \ufb01eld probing \ufb01lters learned from one dataset can be applied to another one.\n\n4.5 Comparison with State-of-the-art\n\nOur FPNN\n(4-FCs+NF)\n\n88.4\n\nSubvolSup+BN MVCNN-MultiRes\n\n[23]\n\n88.8\n\n93.8\n\nTable 6: Comparison with state-of-the-art methods.\n\nWe compare the performance of our FPNNs\nagainst two state-of-the-art approaches \u2014 Sub-\nvolSup+BN and MVCNN-MultiRes, both from [23], in Table 6. SubvolSup+BN is a subvolume\nsupervised volumetric 3D CNN, with batch normalization applied during the training, and MVCNN-\nMultiRes is a multi-view multi-resolution image based 2D CNN. Note that our FPNN achieves\ncomparable performance to SubvolSup+BN with less computational complexity. However, both our\nFPNN and SubvolSup+BN do not perform as well as MVCNN-MultiRes. It is intriguing to answer\nthe question why methods directly operating on 3D data cannot match or outperform multi-view 2D\nCNNs. The research on closing the gap between these modalities can lead to a deeper understanding\nof both 2D images and 3D shapes or even higher dimensional data.\n\n4.6 Limitations and Future Work\n\nFPNN on Generic Fields. Our framework provides a general means for optimizing probing lo-\ncations in 3D \ufb01elds where the gradients can be computed. We suspect this capability might be\nparticularly important for analyzing 3D data with invisible internal structures. Moreover, our ap-\nproach can easily be extended into higher dimensional \ufb01elds, where a careful storage design of the\ninput \ufb01elds is important for making the I/O footprint tractable though.\n\nFrom Probing Filters to Probing Network.\nIn our current framework, the probing \ufb01lters are\nindependent to each other, which means, they do not share locations and weights, which may result in\ntoo many parameters for small training sets. On the other hand, fully shared weights greatly limit the\nrepresentation power of the probing \ufb01lters. A trade-off might be learning a probing network, where\neach probing point belongs to multiple \u201cpathes\u201d in the network for partially sharing parameters.\n\nFPNN for Finer Shape Understanding. Our current approach is superior for extracting robust\nglobal descriptions of the input data, but lacks the capability of understanding \ufb01ner structures inside\nthe input data. This capability might be realized by strategically initializing the probing \ufb01lters\nhierarchically, and jointly optimizing \ufb01lters at different hierarchies.\n5 Conclusions\nWe proposed a novel design for feature extraction from 3D data, whose computation cost is agnostic\nto the resolution of data representation. A signi\ufb01cant advantage of our design is that long range\ninteraction can be easily coupled. As 3D data is becoming more accessible, we believe that our\nmethod will stimulate more work on feature learning from 3D data. We open-source our code at\nhttps://github.com/yangyanli/FPNN for encouraging future developments.\nAcknowledgments\nWe would \ufb01rst like to thank all the reviewers for their valuable comments and suggestions. Yangyan\nthanks Daniel Cohen-Or and Zhenhua Wang for their insightful proofreading. The work was\nsupported in part by NSF grants DMS-1546206 and IIS-1528025, UCB MURI grant N00014-13-\n1-0341, Chinese National 973 Program (2015CB352501), the Stanford AI Lab-Toyota Center for\nArti\ufb01cial Intelligence Research, the Max Planck Center for Visual Computing and Communication,\nand a Google Focused Research award.\n\nReferences\n[1] D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and P. Vandergheynst. Learning class-\nspeci\ufb01c descriptors for deformable shapes using localized spectral convolutional networks. In SGP, pages\n13\u201323. Eurographics Association, 2015.\n\n[2] Davide Boscaini, Jonathan Masci, Emanuele Rodol\u00e0, Michael M Bronstein, and Daniel Cremers.\n\nAnisotropic diffusion descriptors. CGF, 35(2):431\u2013441, 2016.\n\n[3] Alexander M. Bronstein, Michael M. Bronstein, Leonidas J. Guibas, and Maks Ovsjanikov. Shape google:\n\nGeometric words and expressions for invariant shape retrieval. ToG, 30(1):1:1\u20131:20, February 2011.\n\n8\n\n\f[4] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On visual similarity based 3d model\n\nretrieval. CGF, 22(3):223\u2013232, 2003.\n\n[5] Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In\n\nSIGGRAPH, SIGGRAPH \u201996, pages 303\u2013312, New York, NY, USA, 1996. ACM.\n\n[6] Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR, pages 248\u2013255, June 2009.\n\n[7] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289\u20131306, 2006.\n[8] Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, and Wolfram Burgard.\nMultimodal deep learning for robust rgb-d object recognition. In IEEE/RSJ International Conference on\nIntelligent Robots and Systems, pages 681\u2013687. IEEE, 2015.\n\n[9] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In International conference on arti\ufb01cial intelligence and statistics, pages 249\u2013256, 2010.\n\n[10] Saurabh Gupta, Ross Girshick, Pablo Arbelaez, and Jitendra Malik. Learning rich features from rgb-\nIn ECCV, volume 8695, pages 345\u2013360. Springer\n\nd images for object detection and segmentation.\nInternational Publishing, 2014.\n\n[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, pages 448\u2013456, 2015.\n\n[12] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv\npreprint arXiv:1408.5093, 2014.\n\n[13] Andrew E. Johnson and Martial Hebert. Using spin images for ef\ufb01cient object recognition in cluttered 3d\n\nscenes. TPAMI, 21(5):433\u2013449, May 1999.\n\n[14] Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical harmonic\nrepresentation of 3d shape descriptors. In SGP, SGP \u201903, pages 156\u2013164, Aire-la-Ville, Switzerland,\nSwitzerland, 2003. Eurographics Association.\n\n[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, Nov 1998.\n\n[16] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436\u2013444, 2015.\n[17] Roee Litman and Alexander M Bronstein. Learning spectral descriptors for deformable shape correspon-\n\ndence. TPAMI, 36(1):171\u2013180, 2014.\n\n[18] Jonathan Masci, Davide Boscaini, Michael M Bronstein, and Pierre Vandergheynst. Geodesic convolutional\nneural networks on riemannian manifolds. In ICCV Workshop on 3D Representation and Recognition\n(3dRR), 2015.\n\n[19] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object\nrecognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, September\n2015.\n\n[20] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML,\n\npages 807\u2013814, 2010.\n\n[21] Matthias Nie\u00dfner, Michael Zollh\u00f6fer, Shahram Izadi, and Marc Stamminger. Real-time 3d reconstruction\n\nat scale using voxel hashing. ToG, 32(6):169:1\u2013169:11, November 2013.\n\n[22] Robert Osada, Thomas Funkhouser, Bernard Chazelle, and David Dobkin. Shape distributions. ToG,\n\n21(4):807\u2013832, October 2002.\n\n[23] Charles R. Qi, Hao Su, Matthias Nie\u00dfner, Angela Dai, Mengyuan Yan, and Leonidas Guibas. Volumetric\n\nand multi-view cnns for object classi\ufb01cation on 3d data. In CVPR, 2016.\n\n[24] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, AlexanderC. Berg, and Li Fei-Fei. Imagenet large\nscale visual recognition challenge. IJCV, 115(3):211\u2013252, 2015.\n\n[25] Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation for\n\n3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339\u20132343, Dec 2015.\n\n[26] Richard Socher, Brody Huval, Bharath Bath, Christopher D Manning, and Andrew Y Ng. Convolutional-\n\nrecursive deep learning for 3d object classi\ufb01cation. In NIPS, pages 665\u2013673, 2012.\n\n[27] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research,\n15(1):1929\u20131958, 2014.\n\n[28] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. Multi-view convolutional\n\nneural networks for 3d shape recognition. In ICCV, 2015.\n\n[29] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative multi-scale signature\n\nbased on heat diffusion. In SGP, pages 1383\u20131392. Eurographics Association, 2009.\n\n[30] DE Rumelhart GE Hinton RJ Williams and GE Hinton. Learning representations by back-propagating\n\nerrors. Nature, pages 323\u2013533, 1986.\n\n[31] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao.\n\n3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912\u20131920, 2015.\n\n9\n\n\f", "award": [], "sourceid": 202, "authors": [{"given_name": "Yangyan", "family_name": "Li", "institution": "Stanford University"}, {"given_name": "Soeren", "family_name": "Pirk", "institution": "Stanford University"}, {"given_name": "Hao", "family_name": "Su", "institution": "Stanford University"}, {"given_name": "Charles", "family_name": "Qi", "institution": "Stanford University"}, {"given_name": "Leonidas", "family_name": "Guibas", "institution": "Stanford University"}]}