{"title": "Self-Supervised Deep Learning on Point Clouds by Reconstructing Space", "book": "Advances in Neural Information Processing Systems", "page_first": 12962, "page_last": 12972, "abstract": "Point clouds provide a flexible and natural representation usable in countless applications such as robotics or self-driving cars. Recently, deep neural networks operating on raw point cloud data have shown promising results on supervised learning tasks such as object classification and semantic segmentation. While massive point cloud datasets can be captured using modern scanning technology, manually labelling such large 3D point clouds for supervised learning tasks is a cumbersome process. This necessitates methods that can learn from unlabelled data to significantly reduce the number of annotated samples needed in supervised learning. We propose a self-supervised learning task for deep learning on raw point cloud data in which a neural network is trained to reconstruct point clouds whose parts have been randomly rearranged. While solving this task, representations that capture semantic properties of the point cloud are learned. Our method is agnostic of network architecture and outperforms current unsupervised learning approaches in downstream object classification tasks. We show experimentally, that pre-training with our method before supervised training improves the performance of state-of-the-art models and significantly improves sample efficiency.", "full_text": "Self-Supervised Deep Learning on Point Clouds by\n\nReconstructing Space\n\nJonathan Sauder\n\nHasso Plattner Institute\n\nPotsdam, Germany\n\njonathan.sauder@student.hpi.de\n\nBjarne Sievers\n\nHasso Plattner Institute\n\nPotsdam, Germany\n\nbjarne.sievers@student.hpi.de\n\nAbstract\n\nPoint clouds provide a \ufb02exible and natural representation usable in countless\napplications such as robotics or self-driving cars. Recently, deep neural networks\noperating on raw point cloud data have shown promising results on supervised\nlearning tasks such as object classi\ufb01cation and semantic segmentation. While\nmassive point cloud datasets can be captured using modern scanning technology,\nmanually labelling such large 3D point clouds for supervised learning tasks is a\ncumbersome process. This necessitates methods that can learn from unlabelled\ndata to signi\ufb01cantly reduce the number of annotated samples needed in supervised\nlearning. We propose a self-supervised learning task for deep learning on raw point\ncloud data in which a neural network is trained to reconstruct point clouds whose\nparts have been randomly rearranged. While solving this task, representations\nthat capture semantic properties of the point cloud are learned. Our method is\nagnostic of network architecture and outperforms current unsupervised learning\napproaches in downstream object classi\ufb01cation tasks. We show experimentally, that\npre-training with our method before supervised training improves the performance\nof state-of-the-art models and signi\ufb01cantly improves sample ef\ufb01ciency.\n\n1\n\nIntroduction\n\nPoint clouds provide a natural and \ufb02exible representation of objects in metric spaces. They can also be\neasily captured by modern scanning devices and techniques. Algorithms that can recognize objects in\npoint clouds are crucial to countless applications such as robotics and self-driving cars. Traditionally,\nsystems for such tasks have relied on the approximate computation of geometric features such as\nfaces, edges or corners [31, 11] and hand-crafted features encoding statistical properties [3, 27].\nHowever, these approaches are often tailored to speci\ufb01c tasks, thus not providing the necessary\n\ufb02exibility for modern applications. Recently, Convolutional Neural Networks (CNNs) which are\ndomain-independent have shown promising performance on point clouds in supervised learning tasks\nsuch as object classi\ufb01cation and semantic segmentation, outperforming conventional approaches\n[23, 24, 33, 16].\nThe advent of scalable 3D point cloud scanning technologies such as LiDAR scanners and stereo\ncameras gives rise to massive point cloud datasets, possibly spanning large entities such as entire\ncities or regions. However, manually annotating such massive amounts of data for supervised learning\ntasks such as semantic segmentation poses problems due to typical real-world point clouds reaching\nbillions of points and petabytes of data, opposing the innate limitations of user-interfaces for 3D data\nlabelling (e.g. drawing bounding boxes) on 2D screens. Therefore, it is of large interest to develop\nmethods which can reduce the number of annotated samples required for strong performance on\nsupervised learning tasks.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 1: A visual example of the proposed self-supervised learning task. (a) The original object is\nsplit into voxels along the axes, each point is assigned a voxel label. (b) The voxels are randomly\nrearranged. (c) A neural network predicts the voxel labels, here visualized with the original point\npositions. (d) Points with correctly predicted voxel labels (blue) and misclassi\ufb01cations (red).\n\nUnsupervised or self-supervised learning approaches for deep learning have shown to be effective in\nthis scenario in various domains [10, 20, 13, 7, 21, 9]. On point clouds, self-supervised approaches\nhave been largely focused on applying either Autoencoders [13] or Generative Adversarial Networks\n(GANs) [10]. While GAN-based approaches have not been successfully applied to raw point cloud\ndata due to the non-triviality of sampling unordered sets with neural networks, Autoencoders for\npoint clouds rely on possibly problematic similarity metrics [1].\nIn this work we address these limitations and present a self-supervised learning method for neural net-\nworks operating on raw point cloud data in which a neural network is trained to correctly reconstruct\npoint clouds whose parts have been randomly displaced. An example of the proposed self-supervised\ntask given in Figure 1. The proposed method is agnostic of the speci\ufb01c network architecture and can\nbe \ufb02exibly used to pre-train any deep learning model operating on raw point clouds for other tasks.\nIn a series of experiments, we show that powerful representations of point clouds are obtained from\nself-supervised training with our method. Our method outperforms previous unsupervised methods\nin a downstream object classi\ufb01cation task in a transfer learning setting. We also explore per-point\nfeatures and show pre-training with our method improves the performance and sample ef\ufb01ciency in\nsupervised tasks. To highlight our main contributions:\n\n\u2022 We present an architecture-agnostic self-supervised learning method operating on raw point\nclouds in which a neural network is trained to reconstruct a point cloud whose parts have\nbeen randomly displaced. Our method avoids computationally expensive and possibly\n\ufb02awed reconstruction losses or similarity metrics on point clouds.\n\n\u2022 We demonstrate the effectiveness of the learned representations: our method outperforms\nstate-of-the-art unsupervised methods in a downstream object classi\ufb01cation task. Pre-training\nwith our method improves results in all evaluated supervised tasks.\n\n2\n\n\f2 Related Work\n\n2.1 Deep Learning on Point Clouds\n\nDeep neural networks have shown impressive performance on regularly structured data representations\nsuch as images and time series. However point clouds are unordered sets of vectors, therefore\nexemplifying a class of problems posing challenges for deep learning for which the term geometric\ndeep learning [4] has been coined. Although deep learning methods for unordered sets [32, 39]\nhave been proposed and also applied to point clouds [25], these approaches do not leverage spatial\nstructure.\nTo address this problem, popular point cloud representations suitable for deep learning include\nvolumetric approaches, in which the containing space is voxelized to be suitable for 3D CNNs\n[18, 22, 36], and multi-view approaches [28, 30], in which 3D point clouds are rendered into 2D\nimages fed into 2D CNNs. However, voxelized representations can be dif\ufb01cult to use when the point\ncloud density varies, and as such are constrained by the resolution and limited by the computational\ncost of 3D convolutions. Despite multi-view approaches having shown strong performance in\nclassi\ufb01cation of standalone objects, it is unclear how to extend them to work reliably in larger scenes\n(e.g. with covered objects) and on per-point tasks such as part segmentation [23].\nA more recent approach, pioneered by PointNet [23], is feeding raw point cloud data into neural\nnetworks. As point clouds are unordered sets, these networks have to be permutation invariant -\nPointNet achieves this by using the max-pooling operation to form a single feature vector representing\nthe global context from a variable amount of points. PointNet++ [24] proposes an extension that\nintroduces local context by stacking multiple PointNet layers. Further improvements were made by\nintroducing Dynamic Graph CNNs (DGCNNs) [33], in which a graph convolution is applied to edges\nof the k-nearest neighbor graph of the point clouds, which is dynamically recomputed in feature space\nafter each layer. Similar performance was achieved by PointCNN [16], which uses a hierarchical\nconvolution that is trained to learn permutation invariance. All neural networks operating on raw\npoint cloud data naturally provide per-point embeddings, making them particularly useful for point\nsegmentation tasks. Our proposed method can leverage these methods as it is \ufb02exible with regards to\nthe use of speci\ufb01c neural network architecture.\n\n2.2 Unsupervised and Self-Supervised Deep Learning\n\nDeep learning algorithms have demonstrated the ability to learn powerful internal hierarchical\nembeddings through unsupervised learning tasks, in which no supervision is given at all, or self-\nsupervised tasks, where the labels are generated from the data itself [14, 7, 9]. These representations\ncan be directly used in downstream tasks or as strong initializers for supervised tasks [20, 8]. In\ncases where large amounts of data are available but annotated samples are scarce, unsupervised or\nself-supervised learning can signi\ufb01cantly reduce the number of annotated training samples required\nfor strong performance in various tasks [37], making such methods particularly desirable for point\nclouds.\nFollowing the impressive results that have been achieved with GANs [10] and Autoencoders [13] in\nthe image domain, previous efforts for unsupervised learning on point clouds have been adaptations\nof these approaches. However, GANs for point clouds have been limited to either work on voxelized\nrepresentations [34], on 2D-rendered images of point clouds [12], or through adversarial learning\non the learned embedding space from an external Autoencoder [1] as sampling unordered but intra-\ndependent sets of points with neural networks is non-trivial. Autoencoders on the other hand work\nby learning to encode inputs into a latent space before reconstructing them, therefore requiring\nsimilarity or reconstruction metrics. Besides Autoencoders on voxelized representations [29] in\nwhich conventional loss functions can be applied per-voxel, Autoencoders have also been applied\non raw point clouds [37, 15]. When operating on raw point clouds, Autoencoder-based methods\nfor point clouds rely on similarity metrics such as the Chamfer (pseudo) distance, which acts\nas a differentiable approximation to the computationally infeasible Earth Mover\u2019s Distance [26].\nComputing the Chamfer distance can be limited by memory requirements in large point clouds, but\nmore importantly, the authors [1] observe that speci\ufb01c pathological cases are handled incorrectly.\nThis motivates self-supervised methods such as ours which avoid potentially problematic similarity\nfunctions.\n\n3\n\n\fA completely different approach to self-supervised learning in the image domain was taken by [7], in\nwhich a neural network is trained to predict the spatial relation between two randomly chosen image\npatches. The authors demonstrate the effectiveness of the learned features in a range of experiments\nand argue that such a classi\ufb01cation task tackles the problem of the extremely large variety of pixels\nthat can arise from the same semantic object in images. This holds even more true when moving from\nimages to point clouds, i.e. from regular grids in 2D space to unordered sets in 3D space. These ideas\nwere extended in [21], where a neural network with a limited receptive \ufb01eld was trained to correctly\nplace randomly displaced image patches to their original position. The authors of [7, 21] identify the\nchallenge of trivial solutions for such self-supervised tasks in the image domain, such as chromatic\naberration or the matching of low-level feature such as the position of lines in image segments. They\ntake extensive precautions to alleviate this problem, one of which is limiting the receptive \ufb01eld of\nthe neural network, which prevents the same neural network used for pre-training from being used\nwithout any changes in further supervised training. Another approach for self-supervised learning was\ntaken by [9], in which a neural network learns to identify the correct rotation on an image. However,\nthis approach is limited to domains in which a clear height-axis is de\ufb01ned. We build on the concepts\nof [21] and adapt the idea of reordering patches to point clouds, which have certain characteristics\nthat make them particularly well-suited for such a task.\n\n3 Method\n\nX1 \u2190 scale_to_unit_cube(X)\nX1, y \u2190 voxelize(X1, k)\n\u03c0 \u2190 random_permutation(0..k3)\nfor i in 0..k3 do\n\nAlgorithm 1: Generation of Self-Supervised Labels\n1: function GET_SELF_SUPERVISED_LABEL(X \u2282 R3, k \u2208 R)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\n(cid:46) get corresponding voxel ID for each point in X\n\nnew_position \u2190 move_to_voxel(X1[i], \u03c0[y[i]]))\nX1[i] \u2190 augment(new_position)\n\nreturn X1, y\n\nIn this paper we propose a self-supervised method that learns powerful representations from raw\npoint cloud data. Our method works by training a neural network to reassemble point clouds whose\nparts have been randomly displaced. The key assumption of the proposed method is that learning to\nreassemble displaced point cloud segments is only possible by learning holistic representations that\ncapture the high-level semantics of the objects in the point cloud.\nWe phrase the self-supervised learning task as a point segmentation task, in which the label for each\npoint is generated from the point cloud itself with the following procedure: the input point cloud is\nscaled to unit cube before each axis is split into k equal lengths, forming k3 voxels. We use these\nto assign each point its voxel ID as a label. Subsequently all voxels are randomly swapped with\nother voxels and a neural network is trained to predict the original voxel ID of each point. The\npoints in each voxel can also be augmented (e.g. randomly shifted by a small amount) to improve\ngeneralization. Pseudo-code for this entire procedure is provided in Algorithm 1. Note that using the\nvoxel ID as per-point label admits a unique solution even for almost all axis-symmetric point clouds,\nas long as the individual voxels are not all randomly rotated, i.e. as long as a general sense of the\norientation of the input point cloud is maintained. While k may be varied across domains, depending\non the amount of detail in the input point clouds, we list all results with k = 3. Additional details are\ndiscussed in Section 5.\nThe proposed method is agnostic of the speci\ufb01c neural network architecture at hand - any neural\nnetwork capable of point segmentation tasks, such as PointNet [23], PointNet++ [24], DGCNN [33],\nor PointCNN [16] can be used out-of-the-box. These network architectures can be pre-trained in a\nself-supervised manner with our method and used as-is for further supervised training. Furthermore,\nas point clouds do not suffer from the same trivial solutions as identi\ufb01ed in the image domain by\n[7, 21], no limitation is needed on the receptive \ufb01eld size. Phrasing the self-supervised task as a point\nsegmentation task brings many advantages: there is no reliance on possibly \ufb02awed similarity metrics\nas with Autoencoders, it is not necessary to sample unordered sets of points from a neural network\nas with GANs, and the method can work on raw point cloud data and does not require voxelized\n\n4\n\n\fTable 1: Comparison of our method against previous unsupervised methods in downstream object\nclassi\ufb01cation on the ModelNet40 and ModelNet10 dataset in terms of accuracy. A linear SVM is\ntrained on the representations learned in an unsupervised manner on the ShapeNet dataset.\n\nModel\nVConv-DAE [29]\n3D-GAN [34]\nLatent-GAN [1]\nFoldingNet [37]\nVIP-GAN [12]\nPointNet + Pre-Training (Ours)\nDGCNN + Pre-Training (Ours)\n\nMN40\nMN10\n75.50% 80.50%\n83.30% 91.00%\n85.70% 95.30%\n88.40% 94.40%\n90.19% 92.18%\n87.31% 91.61%\n90.64% 94.52%\n\nor 2D-rendered representations of point cloud, making our approach universally applicable to any\npoint cloud data. Operating on raw point cloud enables \ufb02exibility with regards to the point cloud\ndensity and allows for learning of per-point embeddings instead of per-voxel or per-pixel embeddings\nwithout explicit supervision.\n\n4 Experiments\n\n4.1 Object Classi\ufb01cation\n\nIn this section, we show that the embeddings learned with our method outperform state-of-the-art\nunsupervised methods in a downstream object classi\ufb01cation task and demonstrate the bene\ufb01ts of\npre-training with our method before fully supervised training. In line with previous approaches, we\nevaluate our performance on the object classi\ufb01cation problem using the ModelNet dataset [35], which\ncontains CAD models from different categories of man-made objects. For this we use the standard\ntrain/test split, with the same uniform point sample as de\ufb01ned in [23] with ModelNet40 on 40 classes\ncontaining 9843 train and 2468 test models and ModelNet10 on ten classes containing 3991 and 909\nmodels respectively.\nIn the \ufb01rst experiment, we follow the same procedure as in [1, 34, 37, 12]. We train a model in a self-\nsupervised manner on the ShapeNet dataset [5], which consists of 57448 models from 55 categories.\nAfter that, we train a linear Support Vector Machine (SVM) [6] on the obtained embeddings of the\nModelNet40 train split and evaluate it on the test split. We do this with a PointNet and a DGCNN\nwith the exact same setup as proposed by the authors for object classi\ufb01cation [33, 23], the object\nembeddings are obtained after the last max-pooling layer. This experiment evaluates the learned\nembeddings in a transfer learning task, demonstrating their generalizability. From every model in\nShapeNet we use the same random sample of 2048 points on the model surface as provided by\n[37]. The results are displayed in Table 1. Our method outperforms all previous approaches on\nModelNet40, and all except Latent-GAN on ModelNet10. However, as noted by [37], the point cloud\n\n(a) The self-supervised training loss on the ShapeNet\ndataset and the linear SVM accuracy trained on ob-\ntained embeddings for the ModelNet dataset. Per-\nforming better on the unsupervised tasks results in\nstronger embeddings for downstream object classi\ufb01-\ncation.\n\n(b) Visualization of the object embeddings of the\nModelNet10 test data obtained through training\nwith the proposed self-supervised method on the\nShapeNet dataset.\nt-SNE with perplexity 10 and\n1000 iterations was used for dimensionality reduc-\ntion.\nFigure 2\n\n5\n\n020406080100TrainingEpochs0.810.860.911.01.52.0SVMClassi\ufb01cationAccuracySelf-SupervisedTrainingLoss\u22122002040\u221260\u221240\u221220020BathtubBedChairDeskDresserMonitorNightstandSofaTableToilet\f(a) Figure showing how the linear SVM classi\ufb01ca-\ntion accuracy for ModelNet40 behaves when few\nannotated training samples are available.\n\n(b) The training curves on the ModelNet40 object\nclassi\ufb01cation task of a DGCNN pre-trained with\nour self-supervised method (blue) on the ShapeNet\ndataset and a randomly initialized DGCNN (red).\n\nFigure 3\n\nformat and sampling procedure from Latent-GAN is not publicly available, making a comparison on\nModelNet10 accuracy inconclusive. Figure 2a shows that a decrease in self-supervised training loss\non ShapeNet gives a better downstream classi\ufb01cation accuracy on ModelNet40, which suggests that\ncorrectly reconstructing the point cloud parts results requires learning representations that capture the\nsemantics of the objects at hand. The obtained embeddings from a DGCNN with out method for the\nModelNet10 test data are visualized using t-SNE [17] in Figure 2b. One can see that clear, separable\nclusters are formed for each class except for the classes dresser (violet) vs nightstand (pink), which\nare almost visually indiscernible when scaled to unit cube, as done in the ShapeNet dataset.\nIn a second experiment, we show that a very small number of labelled samples can suf\ufb01ce to achieve\nstrong performance in a downstream task, which is one of the main motivations of self-supervised\nlearning. We evaluate our method in such a scenario by limiting the number of training samples\navailable in the ModelNet object classi\ufb01cation task. We sample according to the following procedure:\n\ufb01rst we randomly sample one object per class, and then sample the remaining objects uniformly out\nof the entire training set. We compare the performance of a linear SVM trained on the embeddings\nobtained from training a DGCNN on ShapeNet with our method to those obtained with FoldingNet\n[37] in Figure 3a. The embeddings obtained from our method lead to higher accuracy than those\nobtained with FoldingNet with any amount of training labels. Using only 1 % of training data,\nequivalent to three or less samples per class, our model is able to achieve 65.2 % accuracy on the test\nset. When using 10 % of available training samples, this accuracy rises up to 84.4 %.\nFinally, we demonstrate the bene\ufb01t of pre-training with our method, by pre-training a DGCNN in\na self-supervised manner on the ShapeNet dataset with 1024 points chosen randomly from each\nmodel for 100 epochs before fully supervised training on the ModelNet40 dataset. As seen in 3b,\nself-supervised pre-training acts as a strong initializer, reducing the number of supervised epochs\nneeded for strong performance and even improving the \ufb01nal object classi\ufb01cation accuracy with\nDGCNN (Table 2).\n\n4.2 Part Segmentation\n\nIn this section we explore the per-point embeddings obtained through unsupervised training in a\npart segmentation task. Again, we train our model in a self-supervised fashion on the ShapeNet\n\nTable 2: Comparison to state-of-the-art supervised methods in ModelNet40 classi\ufb01cation accuracy.\nAll models are trained and evaluated on 1024 points. Self-supervised pre-training is performed on the\nShapeNet dataset.\n\nModel\nPointNet [23]\nPointNet++ [24]\nPointCNN [16]\nDGCNN + Random Init [33]\nDGCNN + Pre-Training (Ours)\n\nAccuracy\n\n89.2%\n90.7%\n92.2%\n92.2%\n92.4%\n\n6\n\n10\u2212210\u22121100%ofLabeledDataUsed0.60.70.80.9Classi\ufb01cationAccuracy1%2%5%10%20%50%100%OursFoldingNet050100150200250TrainingEpochs0.750.800.850.900.95Classi\ufb01cationAccuracyWithPre-TrainingWithoutPre-Training\fTable 3: The effect of pre-training on ShapeNet Part Segmentation. Metric is mean IoU% of parts per\nobject class.\n\nMean Aero Bag Cap Car Chair Earphone Guitar Knife Lamp Laptop Motor Mug Pistol Rocket Skateboard Table\n5271\n80.6\n82.6\n82.0\n81.8\n\n55 898 3758\n# Shapes\n83.7 83.4 78.7 82.5 74.9 89.6\nPointNet\nPointNet++ 85.1 82.4 79.0 87.7 77.3 90.8\n85.1 84.2 83.7 84.4 77.1 90.9\nDGCNN\n85.3 84.1 84.0 85.8 77.0 90.9\nOurs\n\n202\n184 283\n65.2 93.0 81.2\n71.6 94.1 81.3\n67.8 93.3 82.6\n71.6 94.0 82.6\n\n152\n72.8\n76.4\n75.5\n77.9\n\n66\n57.9\n58.7\n59.7\n60.0\n\n2690 76\n\n69\n73.0\n71.8\n78.5\n80.0\n\n787\n91.5\n91.0\n91.5\n91.5\n\n392 1547\n85.9 80.8\n85.9 83.7\n87.3 82.9\n87.0 83.2\n\n451\n95.3\n95.3\n96.0\n95.8\n\ndataset. The supervised task is then to correctly classify each point of an object into the correct\nobject part on the ShapeNet Part dataset [38], which is a subset of the full ShapeNet containing\n16881 3D objects from 16 categories, annotated with 50 parts in total. We use the of\ufb01cial train /\nvalidation / test splits [38]. Following the same procedure as in [23, 24, 33], the one-hot encoded\nobject class label of the object is given as an input during supervised training. During the 200 epochs\nof pre-training, a random class label is given to each object. Part segmentation is evaluated on\nthe mean Intersection-over-Union (mIoU) metric, calculated by averaging IoUs for each part in an\nobject before averaging the obtained values for each object class. The results are shown in Table 3.\nA DGCNN pre-trained with our method slightly outperforms a randomly initialized DGCNN, the\ndifferences in accuracy being particularly notable on the classes with few samples.\nIn Figure 4 we show a visualization of the features learned for objects after self-supervised training\nbut before any fully supervised training. The visualizations are obtained by selecting a random point\nand visualizing the distance to the two (sequentially chosen) furthest points in the learned feature\nspace using a color scale. The visualizations show that the features learned in a self-supervised\nmanner can capture high-level semantics such as object parts without ever having seen part IDs. In\nFigure 5 a visualization of the features for each point from ten airplanes and ten chairs is shown.\nThe features are projected into two dimensions using UMAP [19]. One can clearly see that the two\nobject classes form clear, separable clusters in the feature spaces and that clear, discernible clusters\nare formed for the individual object parts. Individual objects from the classes are not identi\ufb01able,\nshowing that the learned features generalize over reoccurring structures. This highlights the semantics\nof the high-level features learned with our method.\n\n4.3 Semantic Segmentation\n\nIn this semantic segmentation task we evaluate the effectiveness on our method on data that goes\nbeyond simple, free-standing objects. The task is evaluated on the Stanford Large-Scale 3D Indoor\nSpaces (S3DIS) dataset [2]. The dataset consists of 3D point cloud scans from 6 indoor areas totalling\n272 rooms. The points are classi\ufb01ed into 13 semantic classes such as board, chair, ceiling, beam,\nand clutter. Each room is split into blocks of 1m \u00d7 1m area and each point is given as a 6D vector\ncontaining XYZ coordinates and RGB color values. In this setup we evaluate the case in which there\nis large amounts of unlabelled data and only few annotated samples are available. For this the largest\narea (area 5) is chosen as the test set, and the other areas form distinct training sets. We compare two\n\nFigure 4: A visualization of the features learned through self-supervised training with our method for\nindividual objects. A color scale shows the distance in feature space between a randomly sampled\npoint and its two (mutually) furthest neighbors in feature space.\n\n7\n\n\fFigure 5: Visualization of the per-point features of 10 airplanes and 10 chairs from the ShapeNet Part\ndataset. UMAP is used for dimensionality reduction for visualization purposes.\n\nTable 4: Results of semantic segmentation on the S3DIS dataset. Results are evaluated on area 6.\n\nSupervised Train Area\nArea 1\nArea 2\nArea 3\nArea 4\nArea 6\n\nRandom Init\n\nPre-Training (ours)\n# Samples mIoU% Acc % mIoU% Acc %\n82.9% 44.7% 83.5%\n43.6%\n34.6% 81.2% 34.9% 81.2%\n82.8% 42.4% 84.0%\n39.9%\n82.8% 39.9% 82.9%\n39.4%\n43.9% 83.1% 43.9% 83.3%\n\n3687\n4440\n1650\n3662\n3294\n\nDGCNNs with the architecture proposed for semantic segmentation by the authors for each training\narea, one that has been pre-trained on all areas except area 5, and one that is not pre-trained. The task\nis evaluated in mIoU% per object class and total per-point classi\ufb01cation accuracy. The results are\nshown in Table 4. Pre-training improves the mIoU and classi\ufb01cation accuracy in all cases except two,\nin which the two methods are tied. As expected, the difference is the largest for area 3, where the\nnumber of training samples for fully supervised learning is the smallest.\n\n5 Discussion\n\nThroughout all experiments, our proposed method learns representations that prove to be effective.\nThis leads us to believe that trivial solutions to the task of reconstructing the inputs, as discussed for\nthe image domain by [7, 21] are not a signi\ufb01cant problem for point clouds. Point clouds do not suffer\nfrom chromatic aberration and point cloud parts can be shifted and rotated freely in the coordinates,\nalleviating the issue of simply matching lines and edges. In this paper we performed all experiments\nwith a three-by-three voxel grid during self-supervised pre-training, which we observed to outperform\nboth k = 2 and k = 4. We found that randomly rotating 15% of the individual voxels and randomly\nreplacing one voxel in each input point cloud with a random voxel from a randomly drawn input\npoint cloud from the same dataset leads to a slightly higher quality of the embeddings in the object\nclassi\ufb01cation task (consistently around 0.2% SVM accuracy in the downstream object classi\ufb01cation\ntask), therefore we kept this setup throughout all experiments. An extensive evaluation on how to\n\ufb01ne-tune the self-supervised task to a speci\ufb01c dataset or domain is not the focus of this paper, instead\nwe show that our simple approach works reliably in all evaluated cases.\n\n6 Conclusion\n\nIn this paper we propose a self-supervised method for learning representations from unlabelled raw\npoint cloud data. In this easy-to-implement method, a neural network learns to reconstruct input point\nclouds whose parts have been randomly displaced. While solving this task, high-level representations\nof the underlying input point clouds are learned. We demonstrate the effectiveness of the learned\nrepresentations in downstream tasks and show our method can improve the sample ef\ufb01ciency and the\naccuracy of state-of-the-art models when used to pre-train with large amounts of data before fully\nsupervised training. As our method is independent of the speci\ufb01c neural network architecture, we\nexpect to see further bene\ufb01ts of using our results as more effective neural networks for processing\nraw point cloud data are developed in the future.\n\n8\n\n\u221210\u221250510\u221210\u221250510Chair-ArmrestChair-LegChair-SeatChair-BackPlane-TurbinePlane-TailPlane-WingPlane-Body\fReferences\n[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas J. Guibas. Representation\n\nlearning and adversarial generation of 3d point clouds. CoRR, abs/1707.02392, 2017.\n\n[2] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and\nSilvio Savarese. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE\nInternational Conference on Computer Vision and Pattern Recognition, 2016.\n\n[3] Mathieu Aubry, Ulrich Schlickewei, and Daniel Cremers. The wave kernel signature: A quantum\nmechanical approach to shape analysis. In Computer Vision Workshops (ICCV Workshops),\n2011 IEEE International Conference on, pages 1626\u20131633. IEEE, 2011.\n\n[4] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.\nGeometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine,\n34(4):18\u201342, 2017.\n\n[5] Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas, Pat Hanrahan, Qi-Xing Huang,\nZimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and\nFisher Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.\n\n[6] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013\n\n297, 1995.\n\n[7] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning\nby context prediction. In Proceedings of the IEEE International Conference on Computer\nVision, pages 1422\u20131430, 2015.\n\n[8] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent,\nand Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine\nLearning Research, 11(Feb):625\u2013660, 2010.\n\n[9] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning\nby predicting image rotations. International Conference on Learning Representations (ICLR),\n2018.\n\n[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[11] Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, and Jianwei Wan. 3d object\nrecognition in cluttered scenes with local surface features: a survey. IEEE Transactions on\nPattern Analysis and Machine Intelligence, 36(11):2270\u20132287, 2014.\n\n[12] Zhizhong Han, Mingyang Shang, Yuhang Liu, and Matthias Zwicker. View inter-prediction\ngan: Unsupervised representation learning for 3d shapes by learning global shape memories to\nsupport local view predictions. AAAI, abs/1811.02744, 2019.\n\n[13] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with\n\nneural networks. science, 313(5786):504\u2013507, 2006.\n\n[14] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief\nnetworks for scalable unsupervised learning of hierarchical representations. In Proceedings of\nthe 26th Annual International Conference on Machine Learning, ICML \u201909, pages 609\u2013616,\nNew York, NY, USA, 2009. ACM.\n\n[15] Jiaxin Li, Ben M. Chen, and Gim Hee Lee. So-net: Self-organizing network for point cloud\nanalysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June\n2018.\n\n[16] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn:\nConvolution on x-transformed points. In Advances in Neural Information Processing Systems,\npages 828\u2013838, 2018.\n\n9\n\n\f[17] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[18] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-\ntime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International\nConference on, pages 922\u2013928. IEEE, 2015.\n\n[19] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation\n\nand projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.\n\n[20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[21] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving\n\njigsaw puzzles. In European Conference on Computer Vision, pages 69\u201384. Springer, 2016.\n\n[22] Charles R Qi, Hao Su, Matthias Nie\u00dfner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas.\nVolumetric and multi-view cnns for object classi\ufb01cation on 3d data. In Proceedings of the IEEE\nconference on computer vision and pattern recognition, pages 5648\u20135656, 2016.\n\n[23] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning\non point sets for 3d classi\ufb01cation and segmentation. Computer Vision and Pattern Recognition\n(CVPR), abs/1612.00593, 2017.\n\n[24] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical\nfeature learning on point sets in a metric space. In Advances in Neural Information Processing\nSystems, pages 5099\u20135108, 2017.\n\n[25] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Deep learning with sets and point\n\nclouds. arXiv preprint arXiv:1611.04500, 2016.\n\n[26] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover\u2019s distance as a metric for\n\nimage retrieval. International journal of computer vision, 40(2):99\u2013121, 2000.\n\n[27] Radu Bogdan Rusu, Nico Blodow, Zoltan Csaba Marton, and Michael Beetz. Aligning point\ncloud views using persistent feature histograms. In Intelligent Robots and Systems, 2008. IROS\n2008. IEEE/RSJ International Conference on, pages 3384\u20133391. IEEE, 2008.\n\n[28] Manolis Savva, Fisher Yu, Hao Su, Asako Kanezaki, Takahiko Furuya, Ryutarou Ohbuchi,\nZhichao Zhou, Rui Yu, Song Bai, Xiang Bai, et al. Large-scale 3d shape retrieval from shapenet\ncore55: Shrec\u201917 track. In Proceedings of the Workshop on 3D Object Retrieval, pages 39\u201350.\nEurographics Association, 2017.\n\n[29] Abhishek Sharma, Oliver Grau, and Mario Fritz. Vconv-dae: Deep volumetric shape learning\nwithout object labels. In European Conference on Computer Vision, pages 236\u2013250. Springer,\n2016.\n\n[30] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convo-\nlutional neural networks for 3d shape recognition. In Proceedings of the IEEE international\nconference on computer vision, pages 945\u2013953, 2015.\n\n[31] Oliver Van Kaick, Hao Zhang, Ghassan Hamarneh, and Daniel Cohen-Or. A survey on shape\ncorrespondence. In Computer Graphics Forum, volume 30, pages 1681\u20131707. Wiley Online\nLibrary, 2011.\n\n[32] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for\n\nsets. arXiv preprint arXiv:1511.06391, 2015.\n\n[33] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M.\n\nSolomon. Dynamic graph CNN for learning on point clouds. CoRR, abs/1801.07829, 2018.\n\n[34] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a\nprobabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances\nin Neural Information Processing Systems, pages 82\u201390, 2016.\n\n10\n\n\f[35] Zhirong Wu, Shuran Song, Aditya Khosla, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets for\n\n2.5d object recognition and next-best-view prediction. CoRR, abs/1406.5670, 2014.\n\n[36] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and\nJianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pages 1912\u20131920, 2015.\n\n[37] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder\nvia deep grid deformation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition\n(CVPR), volume 3, 2018.\n\n[38] Li Yi, Vladimir G Kim, Duygu Ceylan, I Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang,\nAlla Sheffer, Leonidas Guibas, et al. A scalable active framework for region annotation in 3d\nshape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.\n\n[39] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov,\nand Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems,\npages 3391\u20133401, 2017.\n\n11\n\n\f", "award": [], "sourceid": 7097, "authors": [{"given_name": "Jonathan", "family_name": "Sauder", "institution": "Hasso Plattner Institute"}, {"given_name": "Bjarne", "family_name": "Sievers", "institution": "Hasso-Plattner-Institut"}]}