{"title": "Point-Voxel CNN for Efficient 3D Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 965, "page_last": 975, "abstract": "We present Point-Voxel CNN (PVCNN) for efficient, fast 3D deep learning. Previous work processes 3D data using either voxel-based or point-based NN models. However, both approaches are computationally inefficient. The computation cost and memory footprints of the voxel-based models grow cubically with the input resolution, making it memory-prohibitive to scale up the resolution. As for point-based networks, up to 80% of the time is wasted on dealing with the sparse data which have rather poor memory locality, not on the actual feature extraction. In this paper, we propose PVCNN that represents the 3D input data in points to reduce the memory consumption, while performing the convolutions in voxels to reduce the irregular, sparse data access and improve the locality. Our PVCNN model is both memory and computation efficient. Evaluated on semantic and part segmentation datasets, it achieves much higher accuracy than the voxel-based baseline with 10\u00d7 GPU memory reduction; it also outperforms the state-of-the-art point-based models with 7\u00d7 measured speedup on average. Remarkably, the narrower version of PVCNN achieves 2\u00d7 speedup over PointNet (an extremely efficient model) on part and scene segmentation benchmarks with much higher accuracy. We validate the general effectiveness of PVCNN on 3D object detection: by replacing the primitives in Frustrum PointNet with PVConv, it outperforms Frustrum PointNet++ by 2.4% mAP on average with 1.5\u00d7 measured speedup and GPU memory reduction.", "full_text": "Point-Voxel CNN for Ef\ufb01cient 3D Deep Learning\n\nZhijian Liu\u2217\n\nMIT\n\nHaotian Tang\u2217\n\nShanghai Jiao Tong University\n\nYujun Lin\n\nMIT\n\nSong Han\n\nMIT\n\nAbstract\n\nWe present Point-Voxel CNN (PVCNN) for ef\ufb01cient, fast 3D deep learning. Previ-\nous work processes 3D data using either voxel-based or point-based NN models.\nHowever, both approaches are computationally inef\ufb01cient. The computation cost\nand memory footprints of the voxel-based models grow cubically with the input\nresolution, making it memory-prohibitive to scale up the resolution. As for point-\nbased networks, up to 80% of the time is wasted on structuring the sparse data\nwhich have rather poor memory locality, not on the actual feature extraction. In this\npaper, we propose PVCNN that represents the 3D input data in points to reduce the\nmemory consumption, while performing the convolutions in voxels to reduce the\nirregular, sparse data access and improve the locality. Our PVCNN model is both\nmemory and computation ef\ufb01cient. Evaluated on semantic and part segmentation\ndatasets, it achieves a much higher accuracy than the voxel-based baseline with\n10\u00d7 GPU memory reduction; it also outperforms the state-of-the-art point-based\nmodels with 7\u00d7 measured speedup on average. Remarkably, the narrower version\nof PVCNN achieves 2\u00d7 speedup over PointNet (an extremely ef\ufb01cient model) on\npart and scene segmentation benchmarks with much higher accuracy. We validate\nthe general effectiveness of PVCNN on 3D object detection: by replacing the prim-\nitives in Frustrum PointNet with PVConv, it outperforms Frustrum PointNet++ by\nup to 2.4% mAP with 1.8\u00d7 measured speedup and 1.4\u00d7 GPU memory reduction.\n\n1\n\nIntroduction\n\n3D deep learning has received increased attention thanks to its wide applications: e.g., AR/VR and\nautonomous driving. These applications need to interact with people in real time and therefore require\nlow latency. However, edge devices (such as mobile phones and VR headsets) are tightly constrained\nby hardware resources and battery. Therefore, it is important to design ef\ufb01cient and fast 3D deep\nlearning models for real-time applications on the edge.\nCollected by the LiDAR sensors, 3D data usually comes in the format of point clouds. Conventionally,\nresearchers rasterize the point cloud into voxel grids and process them using 3D volumetric convolu-\ntions [4, 33]. With low resolutions, there will be information loss during voxelization: multiple points\nwill be merged together if they lie in the same grid. Therefore, a high-resolution representation is\nneeded to preserve the \ufb01ne details in the input data. However, the computational cost and memory\nrequirement both increase cubically with voxel resolution. Thus, it is infeasible to train a voxel-based\nmodel with high-resolution inputs: e.g., 3D-UNet [51] requires more than 10 GB of GPU memory on\n64\u00d764\u00d764 inputs with batch size of 16, and the large memory footprint makes it rather dif\ufb01cult to\nscale beyond this resolution.\nRecently, another stream of models attempt to directly process the input point clouds [17, 23, 30, 32].\nThese point-based models require much lower GPU memory than voxel-based models thanks to the\nsparse representation. However, they neglect the fact that the random memory access is also very\ninef\ufb01cient. As the points are scattered over the entire 3D space in an irregular manner, processing\n\n\u2217 indicates equal contributions. The \ufb01rst two authors are listed in the alphabetical order.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Off-chip DRAM accesses take two orders of magni-\ntude more energy than arithmetic operations (640pJ vs.\n3pJ [10]), while the bandwidth is two orders of magni-\ntude less (30GB/s vs. 668GB/s [16]). Ef\ufb01cient 3D deep\nlearning should reduce the memory footprint, which\nis the bottleneck of conventional voxel-based methods.\nFigure 1: Ef\ufb01cient 3D models should reduce memory footprint and avoid random memory accesses.\n\n(b) Random memory access is inef\ufb01cient since it cannot\ntake advantage of the DRAM burst and will cause bank\ncon\ufb02icts [28], while contiguous memory access does\nnot suffer from the above issue. Ef\ufb01cient 3D deep learn-\ning should avoid random memory accesses, which is\nthe bottleneck of conventional point-based methods.\n\nthem introduces random memory accesses. Most point-based models [23] mimic the 3D volumetric\nconvolution: they extract the feature of each point by aggregating its neighboring features. However,\nneighbors are not stored contiguously in the point representation; therefore, indexing them requires\nthe costly nearest neighbor search. To trade space for time, previous methods replicate the entire\npoint cloud for each center point in the nearest neighbor search, and the memory cost will then be\nO(n2), where n is the number of input points. Another overhead is introduced by the dynamic kernel\ncomputation. Since the relative positions of neighbors are not \ufb01xed, these point-based models have to\ngenerate the convolution kernels dynamically based on different offsets.\nDesigning ef\ufb01cient 3D neural network models needs to take the hardware into account. Compared\nwith arithmetic operations, memory operations are particularly expensive: they consume two orders\nof magnitude higher energy, having two orders of magnitude lower bandwidth (Figure 1a). Another\naspect is the memory access pattern: the random access will introduce memory bank con\ufb02icts and\ndecrease the throughput (Figure 1b). From the hardware perspective, conventional 3D models are\ninef\ufb01cient due to large memory footprint and random memory access.\nThis paper provides a novel perspective to overcome these challenges. We propose Point-Voxel CNN\n(PVCNN) that represents the 3D input data as point clouds to take advantage of the sparsity to reduce\nthe memory footprint, and leverages the voxel-based convolution to obtain the contiguous memory\naccess pattern. Extensive experiments on multiple tasks demonstrate that PVCNN outperforms the\nvoxel-based baseline with 10\u00d7 lower memory consumption. It also achieves 7\u00d7 measured speedup\non average compared with the state-of-the-art point-based models.\n\n2 Related Work\n\nHardware-Ef\ufb01cient Deep Learning. Extensive attention has been paid to hardware-ef\ufb01cient deep\nlearning for real-world applications. For instance, researchers have proposed to reduce the memory\naccess cost by pruning and quantizing the models [7,8,9,24,39,49] or directly designing the compact\nmodels [11, 12, 14, 25, 34, 48]. However, all these approaches are general-purpose and are suitable\nfor arbitrary neural networks. In this paper, we instead design our ef\ufb01cient primitive based on some\ndomain-speci\ufb01c properties: e.g., 3D point clouds are highly sparse and spatially structured.\nVoxel-Based 3D Models. Conventionally, researchers relied on the volumetric representation to\nprocess the 3D data [45]. For instance, Maturana et al. [27] proposed the vanilla volumetric CNN;\nQi et al. [31] extended 2D CNNs to 3D and systematically analyzed the relationship between 3D\nCNNs and multi-view CNNs; Wang et al. [40] incoporated the octree into volumetric CNNs to reduce\nthe memory consumption. Recent studies suggest that the volumetric representation can also be used\nin 3D shape segmentation [21, 37, 44] and 3D object detection [50].\nPoint-Based 3D Models. PointNet [30] takes advantage of the symmetric function to process the\nunordered point sets in 3D. Later research [17, 32, 43] proposed to stack PointNets hierarchically to\nmodel neighborhood information and increase model capacity. Instead of stacking PointNets as basic\n\n2\n\nAddr. BusData Bus!\"#\"Wait for DRAM!$#$!%with bank conflictsWait for DRAMWait Addr. BusData Bus!\"#\"!$#$!%!&#%#&without bank conflictsWait for DRAMWait for DRAMWait for DRAMWait for DRAM3.2564066816730110100100032b Mult and Add32b SRAM Read32b DRAM ReadEnergy (pJ)Bandwidth (GB/s)Addr. BusData Bus!\"#\"Wait for DRAM!$#$!%with bank conflictsWait for DRAMWait Addr. BusData Bus!\"#\"!$#$!%!&#%#&without bank conflictsWait for DRAMWait for DRAMWait for DRAMWait for DRAM3.2564066816730110100100032b Mult and Add32b SRAM Read32b DRAM ReadEnergy (pJ)Bandwidth (GB/s)\f(a) Voxel-based: memory grows cubically\n\n(b) Point-based: large memory/computation overheads\nFigure 2: Both voxel-based and point-based NN models are inef\ufb01cient. Left: the voxel-based model\nsuffers from large information loss at acceptable GPU memory consumption (model: 3D-UNet [51];\ndataset: ShapeNet Part [3]). Right: the point-based model suffers from large irregular memory access\nand dynamic kernel computation overheads.\n\nblocks, another type of methods [18, 23, 46] abstract away the symmetric function using dynamically\ngenerated convolution kernels or learned neighborhood permutation function. Other research, such as\nSPLATNet [36] which naturally extends the idea of 2D image SPLAT to 3D, and SONet [22] which\nuses the self-organization mechanism with the theoretical guarantee of invariance to point order, also\nshows great potential in general-purpose 3D modeling with point clouds as input.\nSpecial-Purpose 3D Models. There are also 3D models tailored for speci\ufb01c tasks. For instance,\nSegCloud [38], SGPN [42], SPGraph [19], ParamConv [41], SSCN [6] and RSNet [13] are specialized\nin 3D semantic/instance segmentation. As for 3D object detection, F-PointNet [29] is based on the\nRGB detector and point-based regional proposal networks; PointRCNN [35] follows the similar idea\nwhile abstracting away the RGB detector; PointPillars [20] and SECOND [47] focus on the ef\ufb01ciency.\n\n3 Motivation\n3D data can be represented in the format of x = {xk} = {(pk, fk)}, where pk is the 3D coordinate\nof the kth input point or voxel grid, and fk is the feature corresponding to pk. Both voxel-based and\npoint-based convolution can then be formulated as\n\nK(xk, xi) \u00d7 F(xi).\n\n(1)\n\n(cid:88)\n\nyk =\n\nxi\u2208N (xk)\n\nDuring the convolution, we iterate the center xk over the entire input. For each center, we \ufb01rst index\nits neighbors xi in N (xk), then convolve the neighboring features F(xi) with the kernel K(xk, xi),\nand \ufb01nally produces the corresponding output yk.\n\n3.1 Voxel-Based Models: Large Memory Footprint\n\nVoxel-based representation is regular and has good memory locality. However, it requires very high\nresolution in order not to lose information. When the resolution is low, multiple points are bucketed\ninto the same voxel grid, and these points will no longer be distinguishable. A point is kept only when\nit exclusively occupies one voxel grid. In Figure 2a, we analyze the number of distinguishable points\nand the memory consumption (during training with batch size of 16) with different resolutions. On a\nsingle GPU (with 12 GB of memory), the largest affordable resolution is 64, which will lead to 42%\nof information loss (i.e., non-distinguishable points). To keep more than 90% of the information, we\nneed to double the resolution to 128, consuming 7.2\u00d7 GPU memory (82.6 GB), which is prohibitive\nfor deployment. Although the GPU memory increases cubically with the resolution, the number of\ndistinguishable points has a diminishing return. Therefore, the voxel-based solution is not scalable.\n\n3.2 Point-Based Models: Irregular Memory Access and Dynamic Kernel Overhead\n\nPoint-based 3D modeling methods are memory ef\ufb01cient. The initial attempt, PointNet [30], is also\ncomputation ef\ufb01cient, but it lacks the local context modeling capability. Later research [23,32,43,46]\n\n3\n\n0102030405060708090100Distinguishable Points (%)1251050100200500GPU Memory (GB)81632486496128192256Voxel Resolution64x64x64 resolution11GB (Titan XP x 1)42% information loss128x128x128 resolution83GB (Titan XP x 7)7% information lossDGCNNPointCNNSpiderCNNOursIrregular Access5236.357.45.0Dynamic Kernel2.951.527.00.0E\ufb00ective Computation45.312.215.695.1Runtime (%)0102030405060708090100Irregular AccessDynamic KernelActual ComputationDGCNNPointCNNSpiderCNNOurs1\fimproves the expressiveness of PointNet by aggregating the neighborhood information in the point\ndomain. However, this will lead to the irregular memory access pattern and introduce the dynamic\nkernel computation overhead, which becomes the ef\ufb01ciency bottlenecks.\nIrregular Memory Access. Unlike the voxel-based representation, neighboring points xi \u2208 N (xk)\nin the point-based representation are not laid out contiguously in memory. Besides, 3D points are\nscattered in R3; thus, we need to explicitly identify who are in the neighboring set N (xk), rather than\nby direct indexing. Point-based methods often de\ufb01ne N (xk) as nearest neighbors in the coordinate\nspace [23, 46] or feature space [43]. Either requires explicit and expensive KNN computation. After\nKNN, gathering all neighbors xi in N (xk) requires large amount of random memory accesses, which\nis not cache friendly. Combining the cost of neighbor indexing and data movement, we summarize in\nFigure 2b that the point-based models spend 36% [23], 52% [43] and 57% [46] of the total runtime\non structuring the irregular data and random memory access.\nDynamic Kernel Computation. For the 3D volumetric convolutions, the kernel K(xk, xi) can be\ndirectly indexed as the relative positions of the neighbor xi are \ufb01xed for different center xk: e.g., each\naxis of the coordinate offset pi \u2212 pk can only be 0, \u00b11 for the convolution with size of 3. However,\nfor the point-based convolution, the points are scattered over the entire 3D space irregularly; therefore,\nthe relative positions of neighbors become unpredictable, and we will have to calculate the kernel\nK(xk, xi) for each neighbor xi on the \ufb02y. For instance, SpiderCNN [46] leverages the third-order\nTaylor expansion as a continuous approximation of the kernel K(xk, xi); PointCNN [23] permutes\nthe neighboring points into a canonical order with the feature transformer F(xi). Both will introduce\nadditional matrix multiplications. Empirically, we \ufb01nd that for PointCNN, the overhead of dynamic\nkernel computation can be more than 50% (see Figure 2b)!\nIn summary, the combined overhead of irregular memory access and dynamic kernel computation\nranges from 55% (for DGCNN) to 88% (for PointCNN), which indicates that most computations are\nwasted on dealing with the irregularity of the point-based representation.\n\n4 Point-Voxel Convolution\n\nBased on our analysis on the bottlenecks, we introduce a hardware-ef\ufb01cient primitive for 3D deep\nlearning: Point-Voxel Convolution (PVConv), which combines the advantages of point-based methods\n(i.e., small memory footprint) and voxel-based methods (i.e., good data locality and regularity).\nOur PVConv disentangles the \ufb01ne-grained feature transformation and the coarse-grained neighbor\naggregation so that each branch can be implemented ef\ufb01ciently and effectively. As illustrated in\nFigure 3, the upper voxel-based branch \ufb01rst transforms the points into low-resolution voxel grids,\nthen it aggregates the neighboring points by the voxel-based convolutions, followed by devoxelization\nto convert them back to points. Either voxelization or devoxelization requires one scan over all points,\nmaking the memory cost low. The lower point-based branch extracts the features for each individual\npoint. As it does not aggregate the neighbor\u2019s information, it is able to afford a very high resolution.\n\n4.1 Voxel-Based Feature Aggregation\n\nA key component of convolution is to aggregate the neighboring information to extract local features.\nWe choose to perform this feature aggregation in the volumetric domain due to its regularity.\nNormalization. The scale of different point cloud might be signi\ufb01cantly different. We therefore\nnormalize the coordinates {pk} before converting the point cloud into the volumetric domain. First,\nwe translate all points into the local coordinate system with the gravity center as origin. After that,\nwe normalize the points into the unit sphere by dividing all coordinates by max(cid:107)pk(cid:107)2, and we then\nscale and translate the points to [0, 1]. Note that the point features {fk} remain unchanged during the\nnormalization. We denote the normalized coordinates as { \u02c6pk}.\nVoxelization. We transform the normalized point cloud {( \u02c6pk, fk)} into the voxel grids {Vu,v,w}\nby averaging all features fk whose coordinate \u02c6pk = ( \u02c6xk, \u02c6yk, \u02c6zk) falls into the voxel grid (u, v, w):\n\nn(cid:88)\n\nVu,v,w,c =\n\n1\n\nNu,v,w\n\nk=1\n\nI[\ufb02oor( \u02c6xk \u00d7 r) = u, \ufb02oor( \u02c6yk \u00d7 r) = v, \ufb02oor( \u02c6zk \u00d7 r) = w] \u00d7 fk,c,\n\n(2)\n\n4\n\n\fFigure 3: PVConv is composed of a low-resolution voxel-based branch and a high-resolution point-\nbased branch. The voxel-based branch extracts coarse-grained neighborhood information, which is\nsupplemented by the \ufb01ne-grained individual point features extracted from the point-based branch.\nwhere r denotes the voxel resolution, I[\u00b7] is the binary indicator of whether the coordinate \u02c6pk belongs\nto the voxel grid (u, v, w), fk,c denotes the cth channel feature corresponding to \u02c6pk, and Nu,v,w is\nthe normalization factor (i.e., the number of points that fall in that voxel grid). As the voxel resolution\nr does not have to be large to be effective in our formulation (which will be justi\ufb01ed in Section 5),\nthe voxelized representation will not introduce very large memory footprint.\nFeature Aggregation. After converting the points into voxel grids, we apply a stack of 3D volu-\nmetric convolutions to aggregate the features. Similar to conventional 3D models, we apply the batch\nnormalization [15] and the nonlinear activation function [26] after each 3D convolution.\nDevoxelization. As we need to fuse the information with the point-based feature transformation\nbranch, we then transform the voxel-based features back to the domain of point cloud. A straightfor-\nward implementation of the voxel-to-point mapping is the nearest-neighbor interpolation (i.e., assign\nthe feature of a grid to all points that fall into the grid). However, this will make the points in the same\nvoxel grid always share the same features. Therefore, we instead leverage the trilinear interpolation\nto transform the voxel grids to points to ensure that the features mapped to each point are distinct.\nAs our voxelization and devoxelization are both differentiable, the entire voxel-based feature aggrega-\ntion branch can then be optimized in an end-to-end manner.\n\n4.2 Point-Based Feature Transformation\n\nThe voxel-based feature aggregation branch fuses the neighborhood information in a coarse granular-\nity. However, in order to model \ufb01ner-grained individual point features, low-resolution voxel-based\nmethods alone might not be enough. To this end, we directly operate on each point to extract individ-\nual point features using an MLP. Though simple, the MLP outputs distinct and discriminative features\nfor each point. Such high-resolution individual point information is very critical to supplement the\ncoarse-grained voxel-based information.\n\n4.3 Feature Fusion\n\nWith both individual point features and aggregated neighborhood information, we can ef\ufb01ciently fuse\ntwo branches with an addition as they are providing complementary information.\n\n4.4 Discussions\n\nEf\ufb01ciency: Better Data Locality and Regularity. Our PVConv is more ef\ufb01cient than conventional\npoint-based convolutions due to its better data locality and regularity. Our proposed voxelization\nand devoxelization both require O(n) random memory accesses, where n is the number of points,\nsince we only need to iterate over all points once to scatter them to their corresponding voxel grids.\nHowever, for conventional point-based methods, gathering the neighbors for all points requires at\nleast O(kn) random memory accesses, where k is the number of neighbors. Therefore, our PVCNN\nis k\u00d7 more ef\ufb01cient from this viewpoint. As the typical value for k is 32/64 in PointNet++ [32] and\n16 in PointCNN [23], we empirically reduce the number of incontiguous memory accesses by 16\u00d7 to\n\n5\n\nDevoxelizeVoxelizeConvolveMulti-Layer Perceptron (MLP)(a) Voxel-Based Feature Aggregation (Coarse-Grained)(b) Point-Based Feature Transformation (Fine-Grained)NormalizeFuse\fPointNet [30]\n3D-UNet [51]\nRSNet [13]\nPointNet++ [32]\nDGCNN [43]\nPVCNN (Ours, 0.25\u00d7C)\nSpiderCNN [46]\nPVCNN (Ours, 0.5\u00d7C)\nPointCNN [23]\nPVCNN (Ours, 1\u00d7C)\n\nInput Data\n\npoints (8\u00d72048)\nvoxels (8\u00d7963)\npoints (8\u00d72048)\npoints (8\u00d72048)\npoints (8\u00d72048)\npoints (8\u00d72048)\npoints (8\u00d72048)\npoints (8\u00d72048)\npoints (8\u00d72048)\npoints (8\u00d72048)\n\nConvolution Mean IoU\n\nnone\n\nvolumetric\npoint-based\npoint-based\npoint-based\nvolumetric\npoint-based\nvolumetric\npoint-based\nvolumetric\n\n83.7\n84.6\n84.9\n85.1\n85.1\n85.2\n85.3\n85.5\n86.1\n86.2\n\nLatency\n21.7 ms\n682.1 ms\n74.6 ms\n77.9 ms\n87.8 ms\n11.6 ms\n170.7 ms\n21.7 ms\n135.8 ms\n50.7 ms\n\nGPU Memory\n\n1.5 GB\n8.8 GB\n0.8 GB\n2.0 GB\n2.4 GB\n0.8 GB\n6.5 GB\n1.0 GB\n2.5 GB\n1.6 GB\n\nTable 1: Results of object part segmentation on ShapeNet Part. On average, PVCNN outperforms the\npoint-based models with 5.5\u00d7 measured speedup and 3\u00d7 memory reduction, and outperforms the\nvoxel-based baseline with 59\u00d7 measured speedup and 11\u00d7 memory reduction.\n\n(a) Trade-off: accuracy vs. measured latency\n(b) Trade-off: accuracy vs. memory consumption\nFigure 4: Comparisons between PVCNN and point/voxel-based baselines on ShapeNet Part.\n\n64\u00d7 through our design and achieve better data locality. Besides, as our convolutions are done in the\nvoxel domain, which is regular, our PVConv does not require KNN computation and dynamic kernel\ncomputation, which are usually quite expensive.\n\nEffectiveness: Keeping Points in High Resolution. As our point-based feature extraction branch\nis implemented as MLP, a natural advantage is that we are able to maintain the same number of points\nthroughout the whole network while still having the capability to model neighborhood information.\nLet us make a comparison between our PVConv and set abstraction (SA) module in PointNet++ [32].\nSuppose we have a batch of 2048 points with 64-channel features (with batch size of 16). We consider\nto aggregate information from 125 neighbors of each point and transform the aggregated feature\nto output the features with the same size. The SA module will require 75.2 ms of latency and 3.6\nGB of memory consumption, while our PVConv will only require 25.7 ms of latency and 1.0 GB\nof memory consumption. The SA module will have to downsample to 685 points (i.e., around 3\u00d7\ndownsampling) to match up with the latency of our PVConv, while the memory consumption will\nstill be 1.5\u00d7 higher. Thus, with the same latency, our PVConv is capable of modeling the full point\ncloud, while the SA module has to downsample the input aggressively, which will inevitably induce\ninformation loss. Therefore, our PVCNN is more effective compared to its point-based counterpart.\n\n5 Experiments\n\nWe experimented on multiple 3D tasks including object part segmentation, indoor scene segmentation\nand 3D object detection. Our PVCNN achieves superior performance on all these tasks with lower\nmeasured latency and GPU memory consumption. More details are provided in the appendix.\n\n6\n\n0255075100125150175200225Latency (ms)83.584.084.585.085.586.0Mean IoU2.7x speedupPVCNN3D-UNetPointCNNSpiderCNNDGCNNPointNet++RSNetPointNet0.71.01.31.61.92.22.52.83.1GPU Memory (GB)83.584.084.585.085.586.0Mean IoU1.5x reductionPVCNN3D-UNetPointCNNSpiderCNNDGCNNPointNet++RSNetPointNet\fFigure 5: PVCNN runs ef\ufb01ciently on edge devices with low latency.\n\nPVCNN (1\u00d7R)\nPVCNN (0.75\u00d7R)\nPVCNN (0.5\u00d7R)\n\nmIoU Latency GPU Mem.\n86.2\n85.7\n85.5\n\n1.59 GB\n1.56 GB\n1.55 GB\n\n50.7 ms\n36.8 ms\n28.9 ms\n\n\u2206mIoU\n\nDevoxelization w/o trilinear interpolation\n1\u00d7 voxel convolution in each PVConv\n3\u00d7 voxel convolution in each PVConv\n\n-0.5\n-0.6\n-0.1\n\nTable 2: Results of different voxel resolutions.\n\nTable 3: Results of more ablation studies.\n\n(a) Top row: features extracted from coarse-grained voxel-based branch (large, continuous).\n\n(b) Bottom row: features extracted from \ufb01ne-grained point-based branch (isolated, discontinuous).\n\nFigure 6: Two branches are providing complementary information: the voxel-based branch focuses\non the large, continuous parts, while the point-based focuses on the isolated, discontinuous parts.\n\n5.1 Object Part Segmentation\n\nSetups. We \ufb01rst conduct experiments on the large-scale 3D object dataset, ShapeNet Parts [3]. For\na fair comparison, we follow the same evaluation protocol as in Li et al. [23] and Graham et al. [6].\nThe evaluation metric is mean intersection-over-union (mIoU): we \ufb01rst calculate the part-averaged\nIoU for each of the 2874 test models and average the values as the \ufb01nal metrics. Besides, we report\nthe measured latency and GPU memory consumption on a single GTX 1080Ti GPU to re\ufb02ect the\nef\ufb01ciency. We ensure the input data to have the same size with 2048 points and batch size of 8.\nModels. We build our PVCNN by replacing the MLP layers in PointNet [30] with our PVConv lay-\ners. We adopt PointNet [30], RSNet [13], PointNet++ [32] (with multi-scale grouping), DGCNN [43],\nSpiderCNN [46] and PointCNN [23] as our point-based baselines. We reimplement 3D-UNet [51] as\nour voxel-based baseline. Note that most baselines make their implementation publicly available, and\nwe therefore collect the statistics from their of\ufb01cial implementation.\nResults. As in Table 1, our PVCNN outperforms all previous models. PVCNN directly improves the\naccuracy of its backbone (PointNet) by 2.5% with even smaller overhead compared with PointNet++.\nWe also design narrower versions of PVCNN by reducing the number of channels to 25% (denoted as\n0.25\u00d7C) and 50% (denoted as 0.5\u00d7C). The resulting model requires only 53.5% latency of PointNet,\nand it still outperforms several point-based methods with sophisticated neighborhood aggregation\nincluding RSNet, PointNet++ and DGCNN, which are almost an order of magnitude slower.\nIn Figure 4, PVCNN achieves a signi\ufb01cantly better accuracy vs. latency trade-off compared with all\npoint-based methods. With similar accuracy, our PVCNN is 15\u00d7 faster than SpiderCNN and 2.7\u00d7\nfaster than PointCNN. Our PVCNN also achieves a signi\ufb01cantly better accuracy vs. memory trade-off\ncompared with modern voxel-based baseline. With better accuracy, PVCNN saves the GPU memory\nconsumption by 10\u00d7 compared with 3D-UNet.\n\n7\n\n(cid:6418)(cid:3794) 2PointCNNPVCNN (0.25 x C)Jetson Nano1.43.3Jetson TX22.57.7Jetson AGX Xavier1020.2Objects per Second0.04.48.813.217.622.0Jetson NanoJetson TX2Jetson AGX Xavier20.27.73.39.52.51.4PointCNN (86.1 mIoU)1.0 PVCNN (86.2 mIoU)(cid:6418)(cid:3794) 2-1PointNetPVCNNJetson Nano8.219.9Jetson TX220.342.6Jetson AGX Xavier76.0139.9Objects per Second0285684112140Jetson NanoJetson TX2Jetson AGX Xavier139.942.619.976.020.38.2PointNet (83.7 mIoU)0.25 PVCNN (85.2 mIoU)1\fnone\n\nPointNet [30]\nPVCNN (Ours, 0.125\u00d7C)\nDGCNN [43]\nRSNet [13]\nPVCNN (Ours, 0.25\u00d7C)\n3D-UNet [51]\nPVCNN (Ours, 1\u00d7C)\nPVCNN++ (Ours, 0.5\u00d7C)\nPointCNN [23]\nPVCNN++ (Ours, 1\u00d7C)\n\nInput Data\n\npoints (8\u00d74096)\npoints (8\u00d74096)\npoints (8\u00d74096)\npoints (8\u00d74096)\npoints (8\u00d74096)\nvoxels (8\u00d7963)\npoints (8\u00d74096)\npoints (4\u00d78192)\npoints (16\u00d72048)\npoints (4\u00d78192)\n\n82.54\n82.60\n83.64\n\nConvolution mAcc mIoU\n42.97\n46.94\n47.94\n51.93\n52.25\n54.93\n56.12\n57.63\n57.26\n58.98\n\nvolumetric\npoint-based\npoint-based\nvolumetric\nvolumetric\nvolumetric\nvolumetric\npoint-based\nvolumetric\n\n\u2013\n\n85.25\n86.12\n86.66\n86.87\n85.91\n87.12\n\nLatency\n20.9 ms\n8.5 ms\n178.1 ms\n111.5 ms\n11.9 ms\n574.7 ms\n47.3 ms\n41.1 ms\n282.3 ms\n69.5 ms\n\nGPU Mem.\n\n1.0 GB\n0.6 GB\n2.4 GB\n1.1 GB\n0.7 GB\n6.8 GB\n1.3 GB\n0.7 GB\n4.6 GB\n0.8 GB\n\nTable 4: Results of indoor scene segmentation on S3DIS. On average, our PVCNN and PVCNN++\noutperform the point-based models with 8\u00d7 measured speedup and 3\u00d7 memory reduction, and\noutperform the voxel-based baseline with 14\u00d7 measured speedup and 10\u00d7 memory reduction.\n\n(a) Trade-off: accuracy vs. measured latency\n\n(b) Trade-off: accuracy vs. memory consumption\n\nFigure 7: Comparisons between PVCNN and point/voxel-based baselines on S3DIS.\n\nFurthermore, we also measure the latency of PVCNN on three edge devices. In Figure 5, PVCNN\nconsistently achieves a speedup of 2\u00d7 over PointNet and PointCNN on different devices. Especially,\nPVCNN is able to run at 19.9 objects per second on Jetson Nano with PointNet++-level accuracy and\n20.2 objects per second on Jetson Xavier with PointCNN-level accuracy.\nAnalysis. Conventional voxel-based methods have saturated the performance as the input resolution\nincreases, but the memory consumption grows cubically. PVCNN is much more ef\ufb01cient, and the\nmemory increases sub-linearly (Table 2). By increasing the resolution from 16 (0.5\u00d7R) to 32 (1\u00d7R),\nthe GPU memory usage is increased from 1.55 GB to 1.59 GB, only 1.03\u00d7. Even if we squeeze the\nvolumetric resolution to 16 (0.5\u00d7R), our method still outperforms 3D-UNet that has much higher\nvoxel resolution (96) by a large margin (1%). PVCNN is very robust even with small resolution in\nthe voxel branch, thanks to the high-resolution point-based branch maintaining the individual point\u2019s\ninformation. We also compared different implementations of devoxelization in Table 3. The trilinear\ninterpolation performs better than the nearest neighbor, which is because the points near the voxel\nboundaries will introduce larger \ufb02uctuations to the gradient, making it harder to optimize.\nVisualization. We illustrate the voxel and point branch features from the \ufb01nal PVConv in Figure 6,\nwhere warmer color represents larger magnitude. We can see that the voxel branch captures large,\ncontinuous parts (e.g. table top, lamp head) while the point branch captures isolated, discontinuous\ndetails (e.g., table legs, lamp neck). The two branches provide complementary information and can\nbe explained by the fact that the convolution operation extracts features with continuity and locality.\n\n5.2\n\nIndoor Scene Segmentation\n\nSetups. We conduct experiments on the large-scale indoor scene segmentation dataset, S3DIS [1,2].\nWe follow Tchapmi et al. [38] and Li et al. [23] to train the models on area 1,2,3,4,6 and test them on\n\n8\n\n2060100140180220260300Latency (ms)42.545.047.550.052.555.057.5Mean IoU6.9x speedupPVCNNPVCNN++3D-UNetPointCNNRSNetDGCNNPointNet0.51.01.52.02.53.03.54.04.5GPU Memory (GB)42.545.047.550.052.555.057.5Mean IoU6.6x reductionPVCNNPVCNN++3D-UNetPointCNNRSNetDGCNNPointNet\fEf\ufb01ciency\n\nCar\n\nPedestrian\n\nCyclist\n\nLatency GPU Mem. Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard\n85.24 71.63 63.79 66.44 56.90 50.43 77.14 56.46 52.79\nF-PointNet [29]\n29.1 ms\n84.72 71.99 64.20 68.40 60.03 52.61 75.56 56.74 53.33\nF-PointNet++ [29] 105.2 ms\nF-PVCNN (Ours)\n85.25 72.12 64.24 70.60 61.24 56.25 78.10 57.45 53.65\n58.9 ms\n\n1.3 GB\n2.0 GB\n1.4 GB\n\nTable 5: Results of 3D object detection on the val set of KITTI. F-PVCNN outperforms F-PointNet++\nin all categories signi\ufb01cantly with 1.8\u00d7 measured speedup and 1.4\u00d7 memory reduction.\n\narea 5 since it is the only area that does not overlap with any other area. Both data processing and\nevaluation protocol are the same as PointCNN [23] for fair comparison. We measure the latency and\nmemory consumption with 32768 points per batch at test time on a single GTX 1080Ti GPU.\nModels. Apart from PVCNN (which is based on PointNet), we also extend PointNet++ [32] with\nour PVConv to build PVCNN++. We compare our two models with the state-of-the-art point-based\nmodels [13, 23, 30, 43] and the voxel-based baseline [51].\nResults. As in Table 4, PVCNN improves its backbone (PointNet) by more than 13% in mIoU, and\nit also outperforms DGCNN (which involves sophisticated graph convolutions) by a large margin in\nboth accuracy and latency. Remarkably, our PVCNN++ outperforms the state-of-the-art point-based\nmodel (PointCNN) by 1.7% in mIoU with 4\u00d7 lower latency, and the voxel-based baseline (3D-UNet)\nby 4% in mIoU with more than 8\u00d7 lower latency and GPU memory consumption.\nSimilar to object part segmentation, we design compact models by reducing the number of channels\nin PVCNN to 12.5%, 25% and 50% and PVCNN++ to 50%. Remarkably, the narrower version of our\nPVCNN outperforms DGCNN with 15\u00d7 measured speedup, and RSNet with 9\u00d7 measured speedup.\nFurthermore, it achieves 4% improvement in mIoU upon PointNet while still being 2.5\u00d7 faster than\nthis extremely ef\ufb01cient model (which does not have any neighborhood aggregation).\n\n5.3\n\n3D Object Detection\n\nSetups. We \ufb01nally conduct experiments on the driving-oriented dataset, KITTI [5]. We follow Qi et\nal. [29] to construct the val set from the training set so that no instances in the val set belong to the\nsame video clip of any training instance. The size of val set is 3769, leaving the other 3711 samples\nfor training. We evaluate all models for 20 times and report the mean 3D average precision (AP).\nModels. We build our F-PVCNN based on F-PointNet [29] by replacing the MLP layers within the\ninstance segmentation network with our PVConv primitive and keep the box proposal and re\ufb01nement\nnetworks unchanged. We compare our model with F-PointNet (whose backbone is PointNet) and\nF-PointNet++ (whose backbone is PointNet++). We report their results based on our reproduction.\nResults.\nIn Table 5, even if our F-PVCNN model does not aggregate neighboring features in the box\nestimation network while F-PointNet++ does, ours still outperforms it in all classes with 1.8\u00d7 lower\nlatency. Speci\ufb01cally, our model achieves 2.4% average mAP improvement in the most challenging\npedestrian class. Compared with F-PointNet, our F-PVCNN obtains up to 4-5% mAP improvement\nin pedestrians, which indicates that our proposed model is both ef\ufb01cient and expressive.\n\n6 Conclusion\n\nWe propose Point-Voxel CNN (PVCNN) for fast and ef\ufb01cient 3D deep learning. We bring the best\nof both worlds together: voxels and points, reducing the memory footprint and irregular memory\naccess. We represent the 3D input data ef\ufb01ciently with the sparse, irregular point representation and\nperform the convolutions ef\ufb01ciently in the dense, regular voxel representation. Extensive experiments\non multiple tasks consistently demonstrate the effectiveness and ef\ufb01ciency of our proposed method.\nWe believe that our research will break the stereotype that the voxel-based convolution is naturally\ninef\ufb01cient and shed light on co-designing the voxel-based and point-based network architectures.\nAcknowledgements. We thank MIT Quest for Intelligence, MIT-IBM Watson AI Lab, Samsung,\nFacebook and SONY for supporting this research. We also thank AWS Machine Learning Research\nAwards for providing the computation resource and NVIDIA for donating the Jetson AGX Xavier.\n\n9\n\n\fReferences\n\n[1] Iro Armeni, Alexandar Sax, Amir R. Zamir, and Silvio Savarese. Joint 2D-3D-Semantic Data for Indoor\n\nScene Understanding. arXiv, 2017. 8\n\n[2] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese.\n\n3D Semantic Parsing of Large-Scale Indoor Spaces. In CVPR, 2016. 8\n\n[3] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio\nSavarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An\nInformation-Rich 3D Model Repository. arXiv, 2015. 3, 7\n\n[4] Christopher Bongsoo Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3D-R2N2: A\n\nUni\ufb01ed Approach for Single and Multi-view 3D Object Reconstruction. In ECCV, 2016. 1\n\n[5] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics: The KITTI\n\nDataset. IJRR, 2013. 9\n\n[6] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3D Semantic Segmentation With\n\nSubmanifold Sparse Convolutional Networks. In CVPR, 2018. 3, 7\n\n[7] Song Han, Huizi Mao, and William J Dally. Deep Compression: Compressing Deep Neural Networks with\n\nPruning, Trained Quantization and Huffman Coding. In ICLR, 2016. 2\n\n[8] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both Weights and Connections for Ef\ufb01cient\n\nNeural Networks. In NeurIPS, 2015. 2\n\n[9] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. AMC: AutoML for Model\n\nCompression and Acceleration on Mobile Devices. In ECCV, 2018. 2\n\n[10] Mark Horowitz. Computing\u2019s Energy Problem. In ISSCC, 2014. 2\n[11] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang,\nYukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for MobileNetV3.\narXiv, 2019. 2\n\n[12] Andrew G. Howard, Menglong Zhu, Bo Chen, Dimitry Kalenichenko, Weijun Wang, Tobias Weyand,\nMarco Andreetto, and Hartwig Adam. MobileNets: Ef\ufb01cient Convolutional Neural Networks for Mobile\nVision Applications. arXiv, 2017. 2\n\n[13] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Recurrent Slice Networks for 3D Segmentation on\n\nPoint Clouds. In CVPR, 2018. 3, 6, 7, 8, 9\n\n[14] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer.\nSqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and < 0.5MB Model Size. arXiv, 2016.\n2\n\n[15] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by\n\nReducing Internal Covariate Shift. In ICML, 2015. 5\n\n[16] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah\nBates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-Datacenter Performance Analysis of a Tensor\nProcessing Unit. In ISCA, 2017. 2\n\n[17] Roman Klokov and Victor S Lempitsky. Escape from Cells: Deep Kd-Networks for the Recognition of 3D\n\nPoint Cloud Models. In ICCV, 2017. 1, 2\n\n[18] Shiyi Lan, Ruichi Yu, Gang Yu, and Larry S. Davis. Modeling Local Geometric Structure of 3D Point\n\nClouds using Geo-CNN. In CVPR, 2019. 3\n\n[19] Loic Landrieu and Martin Simonovsky. Large-Scale Point Cloud Semantic Segmentation With Superpoint\n\nGraphs. In CVPR, 2018. 3\n\n[20] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, and Jiong Yang. PointPillars: Fast Encoders\n\nfor Object Detection from Point Clouds. In CVPR, 2019. 3\n\n[21] Truc Le and Ye Duan. PointGrid: A Deep Network for 3D Shape Understanding. In CVPR, 2018. 2\n[22] Jiaxin Li, Ben M Chen, and Gim Hee Lee. SO-Net: Self-Organizing Network for Point Cloud Analysis. In\n\n[23] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. PointCNN: Convolution on\n\nCVPR, 2018. 3\nX -Transformed Points. In NeurIPS, 2018. 1, 2, 3, 4, 5, 6, 7, 8, 9\n\n[24] Darryl D. Lin, Sachin S. Talathi, and V.Sreekanth Annapureddy. Fixed Point Quantization of Deep\n\nConvolutional Networks. In ICLR, 2016. 2\n\n[25] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuf\ufb02eNet V2: Practical Guidelines for\n\nEf\ufb01cient CNN Architecture Design. In ECCV, 2018. 2\n\n[26] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti\ufb01er Nonlinearities Improve Neural Network\n\nAcoustic Models. In ICML, 2013. 5\n\n[27] Daniel Maturana and Sebastian Scherer. VoxNet: A 3D Convolutional Neural Network for Real-Time\n\nObject Recognition. In IROS, 2015. 2\n\n[28] Onur Mutlu. DDR Access Illustration. https://www.archive.ece.cmu.edu/~ece740/f11/lib/\n\nexe/fetch.php?media=wiki:lectures:onur-740-fall11-lecture25-mainmemory.pdf. 2\n\n10\n\n\f[29] Charles Ruizhongtai Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. Frustum PointNets for\n\n3D Object Detection from RGB-D Data. In CVPR, 2018. 3, 9\n\n[30] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning on Point\n\nSets for 3D Classi\ufb01cation and Segmentation. In CVPR, 2017. 1, 2, 3, 6, 7, 8, 9\n\n[31] Charles Ruizhongtai Qi, Hao Su, Matthias Niessner, Angela Dai, Mengyuan Yan, and Leonidas J. Guibas.\n\nVolumetric and Multi-View CNNs for Object Classi\ufb01cation on 3D Data. In CVPR, 2016. 2\n\n[32] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep Hierarchical Feature\n\nLearning on Point Sets in a Metric Space. In NeurIPS, 2017. 1, 2, 3, 5, 6, 7, 9\n\n[33] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. OctNet: Learning Deep 3D Representations at\n\nHigh Resolutions. In CVPR, 2017. 1\n\n[34] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2:\n\nInverted Residuals and Linear Bottlenecks. In CVPR, 2018. 2\n\n[35] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointRCNN: 3D Object Proposal Generation and\n\nDetection from Point Cloud. In CVPR, 2019. 3\n\n[36] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and\n\nJan Kautz. SPLATNet: Sparse Lattice Networks for Point Cloud Processing. In CVPR, 2018. 3\n\n[37] Maxim Tatarchenko, Alexley Dosovitskiy, and Thomas Brox. Octree Generating Networks: Ef\ufb01cient\n\nConvolutional Architectures for High-Resolution 3D Outputs. In ICCV, 2017. 2\n\n[38] Lyne P. Tchapmi, Christopher B. Choy, Iro Armeni, JunYoung Gwak, and Silvio Savarese. SEGCloud:\n\nSemantic Segmentation of 3D Point Clouds. In 3DV, 2017. 3, 8\n\n[39] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. HAQ: Hardware-Aware Automated Quantization\n\nwith Mixed Precision. In CVPR, 2019. 2\n\n[40] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-CNN: Octree-based Convolu-\n\ntional Neural Networks for 3D Shape Analysis. In SIGGRAPH, 2017. 2\n\n[41] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep Parametric\n\nContinuous Convolutional Neural Networks. In CVPR, 2018. 3\n\n[42] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann. SGPN: Similarity Group Proposal\n\nNetwork for 3D Point Cloud Instance Segmentation. In CVPR, 2018. 3\n\n[43] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon.\n\nDynamic Graph CNN for Learning on Point Clouds. In SIGGRAPH, 2019. 2, 3, 4, 6, 7, 8, 9\n\n[44] Zongji Wang and Feng Lu. VoxSegNet: Volumetric CNNs for Semantic Part Segmentation of 3D Shapes.\n\nTVCG, 2019. 2\n\n[45] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao.\n\n3D ShapeNets: A Deep Representation for Volumetric Shapes. In CVPR, 2015. 2\n\n[46] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. SpiderCNN: Deep Learning on Point Sets\n\nwith Parameterized Convolutional Filters. In ECCV, 2018. 3, 4, 6, 7\n\n[47] Yan Yan, Yuxing Mao, and Bo Li. SECOND: Sparsely Embedded Convolutional Detection. Sensors, 2018.\n\n3\n\n[48] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02eNet: An Extremely Ef\ufb01cient Convolu-\n\ntional Neural Network for Mobile Devices. In CVPR, 2018. 2\n\n[49] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental Network Quantization:\n\nTowards Lossless CNNs with Low-Precision Weights. In ICLR, 2017. 2\n\n[50] Yin Zhou and Oncel Tuzel. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection.\n\nIn CVPR, 2018. 2\n\n[51] \u00d6zg\u00fcn \u00c7i\u00e7ek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ronneberger. 3D U-Net:\n\nLearning Dense Volumetric Segmentation from Sparse Annotation. In MICCAI, 2016. 1, 3, 6, 7, 8, 9\n\n11\n\n\f", "award": [], "sourceid": 539, "authors": [{"given_name": "Zhijian", "family_name": "Liu", "institution": "MIT"}, {"given_name": "Haotian", "family_name": "Tang", "institution": "Shanghai Jiao Tong University"}, {"given_name": "Yujun", "family_name": "Lin", "institution": "MIT"}, {"given_name": "Song", "family_name": "Han", "institution": "MIT"}]}